diff --git a/README.md b/README.md
index 90ce7b8fed..0d91b737db 100644
--- a/README.md
+++ b/README.md
@@ -1,10361 +1,10361 @@
 # arxiv-daily
- Automated deployment @ 2025-02-16 09:10:43 Asia/Taipei
+ Automated deployment @ 2025-02-16 20:27:09 Asia/Taipei
 > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml).
 > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage).
 
 ## AI
 
-### Medical explainable AI
+### Knowledge Graphs
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
-|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
-|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)|
-|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null|
-|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null|
-|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v1](http://arxiv.org/abs/2501.09628v1)|null|
-|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null|
-|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null|
-|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null|
-|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null|
-|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null|
-|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)|
-|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null|
-|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null|
-|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null|
-|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null|
-|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null|
-|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null|
-|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null|
-|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null|
-|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null|
-|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)|
-|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null|
-|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null|
-|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null|
-|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)|
-|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)|
-|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null|
-|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)|
-|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null|
-|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null|
-|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null|
-|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null|
-|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null|
-|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null|
-|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null|
-|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null|
-|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**|Antonio Rago et.al.|[2408.17401v1](http://arxiv.org/abs/2408.17401v1)|null|
-|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null|
-|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null|
-|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null|
-|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null|
-|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null|
-|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null|
-|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null|
-|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null|
-|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null|
-|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null|
-|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null|
-|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null|
-|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null|
-|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null|
-|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)|
-|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null|
-|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null|
-|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null|
-|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null|
-|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null|
-|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null|
-|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)|
-|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null|
-|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null|
-|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null|
-|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null|
-|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null|
-|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)|
-|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null|
-|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)|
-|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null|
-|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null|
-|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null|
-|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null|
-|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null|
-|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null|
-|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null|
-|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null|
-|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null|
-|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null|
-|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null|
-|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null|
-|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)|
-|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null|
-|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null|
-|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null|
-|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)|
-|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)|
-|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null|
-|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null|
-|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null|
-|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null|
-|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null|
-|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null|
-|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null|
-|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)|
-|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null|
-|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
-|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
-|**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
-|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
-
-#### Abstracts
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
-
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
-
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
-
-##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
-2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
-
-We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
-an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
-predicts future PHTs using transformer-based architectures. The Adaptive Risk
-Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
-probabilities for clinician-defined critical events. ARES incorporates a
-personalized explainability module that identifies key clinical factors
-influencing risk estimates for individual patients. ARES was evaluated on the
-MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
-performance against traditional early warning systems and machine learning
-models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
-with 60% including hospital admissions. The dataset contained over 357 million
-tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
-ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
-ETHOS-based risk estimates demonstrated robustness across demographic subgroups
-with strong model reliability, confirmed via calibration curves. The
-personalized explainability module provides insights into patient-specific
-factors contributing to risk. ARES, powered by ETHOS, advances predictive
-healthcare AI by providing dynamic, real-time, and personalized risk estimation
-with patient-specific explainability to enhance clinician trust. Its
-adaptability and superior accuracy position it as a transformative tool for
-clinical decision-making, potentially improving patient outcomes and resource
-allocation in emergency and inpatient settings. We release the full code at
-github.com/ipolharvard/ethos-ares to facilitate future research.
-
-摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
-一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
-使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
-
-##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**
-2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq
-
-This study addresses a critical gap in the healthcare system by developing a
-clinically meaningful, practical, and explainable disease surveillance system
-for multiple chronic diseases, utilizing routine EHR data from multiple U.S.
-practices integrated with CureMD's EMR/EHR system. Unlike traditional
-systems--using AI models that rely on features from patients' labs--our
-approach focuses on routinely available data, such as medical history, vitals,
-diagnoses, and medications, to preemptively assess the risks of chronic
-diseases in the next year. We trained three distinct models for each chronic
-disease: prediction models that forecast the risk of a disease 3, 6, and 12
-months before a potential diagnosis. We developed Random Forest models, which
-were internally validated using F1 scores and AUROC as performance metrics and
-further evaluated by a panel of expert physicians for clinical relevance based
-on inferences grounded in medical knowledge. Additionally, we discuss our
-implementation of integrating these models into a practical EMR system. Beyond
-using Shapley attributes and surrogate models for explainability, we also
-introduce a new rule-engineering framework to enhance the intrinsic
-explainability of Random Forests.
-
-摘要：本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統，來解決醫療保健系統中的重大缺口，利用整合 CureMD 的 EMR/EHR 系統，來自多個美國實務的例行 EHR 資料。與傳統系統不同的是，我們的做法著重在例行可得的資料，例如病歷、生命徵象、診斷和藥物，以預先評估未來一年慢性疾病的風險，而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型：預測模型，用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型，並使用 F1 分數和 AUROC 作為效能指標，進行內部驗證，並進一步由專家醫師小組根據植基於醫學知識的推論，評估其臨床相關性。此外，我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外，我們還引進了一個新的規則工程架構，以增強隨機森林的內在可解釋性。
-
-##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**
-2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
-
-Deep neural networks are increasingly employed in high-stakes medical
-applications, despite their tendency for shortcut learning in the presence of
-spurious correlations, which can have potentially fatal consequences in
-practice. Detecting and mitigating shortcut behavior is a challenging task that
-often requires significant labeling efforts from domain experts. To alleviate
-this problem, we introduce a semi-automated framework for the identification of
-spurious behavior from both data and model perspective by leveraging insights
-from eXplainable Artificial Intelligence (XAI). This allows the retrieval of
-spurious data points and the detection of model circuits that encode the
-associated prediction rules. Moreover, we demonstrate how these shortcut
-encodings can be used for XAI-based sample- and pixel-level data annotation,
-providing valuable information for bias mitigation methods to unlearn the
-undesired shortcut behavior. We show the applicability of our framework using
-four medical datasets across two modalities, featuring controlled and
-real-world spurious correlations caused by data artifacts. We successfully
-identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision
-Transformer models, ultimately increasing their robustness and applicability
-for real-world medical tasks.
-
-摘要：深度神经网络越来越多地用于高风险医疗应用中，尽管它们在存在虚假相关性的情况下倾向于捷径学习，这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务，通常需要领域专家的大量标记工作。为了缓解这个问题，我们引入了一个半自动框架，用于从数据和模型的角度识别虚假行为，方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外，我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释，为偏差缓解方法提供有价值的信息，以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性，这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差，最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。
-
-##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**
-2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail
-
-Suicidal ideation detection is crucial for preventing suicides, a leading
-cause of death worldwide. Many individuals express suicidal thoughts on social
-media, offering a vital opportunity for early detection through advanced
-machine learning techniques. The identification of suicidal ideation in social
-media text is improved by utilising a hybrid framework that integrates
-Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory
-(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability
-of the model's predictions, Explainable AI (XAI) methods are applied, with a
-particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At
-first, the model managed to reach an accuracy of 92.81%. By applying
-fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The
-SHAP analysis revealed key features influencing the model's predictions, such
-as terms related to mental health struggles. This level of transparency boosts
-the model's credibility while helping mental health professionals understand
-and trust the predictions. This work highlights the potential for improving the
-accuracy and interpretability of detecting suicidal tendencies, making a
-valuable contribution to the progress of mental health monitoring systems. It
-emphasizes the significance of blending powerful machine learning methods with
-explainability to develop reliable and impactful mental health solutions.
-
-摘要：自殺意念偵測對於預防自殺至關重要，而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭，這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構，並加入注意力機制，可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性，我們採用可解釋人工智慧 (XAI) 方法，特別著重於 SHapley 加法解釋 (SHAP)。一開始，模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術，準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵，例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度，同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力，為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。
-
-##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights**
-2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet
-
-In epidemiology, traditional statistical methods such as logistic regression,
-linear regression, and other parametric models are commonly employed to
-investigate associations between predictors and health outcomes. However,
-non-parametric machine learning techniques, such as deep neural networks
-(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for
-this task. Despite their potential, these methods face challenges due to the
-limited availability of high-quality, high-quantity data in this field. To
-address these challenges, we introduce SEANN, a novel approach for informed
-DNNs that leverages a prevalent form of domain-specific knowledge: Pooled
-Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies,
-in different forms, and represent a quantitative form of a scientific
-consensus. By direct integration within the learning procedure using a custom
-loss, we experimentally demonstrate significant improvements in the
-generalizability of predictive performances and the scientific plausibility of
-extracted relationships compared to a domain-knowledge agnostic neural network
-in a scarce and noisy data setting.
-
-摘要：在流行病學中，傳統的統計方法，例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而，非參數機器學習技術，例如深度神經網路 (DNN)，結合可解釋的 AI (XAI) 工具，為這項任務提供了新的機會。儘管這些方法具有潛力，但由於該領域缺乏高品質、高數量資料，因此這些方法面臨挑戰。為了應對這些挑戰，我們引入了 SEANN，這是一種新穎的方法，用於獲取知識的 DNN，它利用了一種流行的領域特定知識形式：彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中，並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中，我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比，科學合理性的顯著提升，且是在稀少且有雜訊的資料設定中。
-
-##### **Artificial Intelligence-Driven Clinical Decision Support Systems**
-2501.09628v1 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni
-
-As artificial intelligence (AI) becomes increasingly embedded in healthcare
-delivery, this chapter explores the critical aspects of developing reliable and
-ethical Clinical Decision Support Systems (CDSS). Beginning with the
-fundamental transition from traditional statistical models to sophisticated
-machine learning approaches, this work examines rigorous validation strategies
-and performance assessment methods, including the crucial role of model
-calibration and decision curve analysis. The chapter emphasizes that creating
-trustworthy AI systems in healthcare requires more than just technical
-accuracy; it demands careful consideration of fairness, explainability, and
-privacy. The challenge of ensuring equitable healthcare delivery through AI is
-stressed, discussing methods to identify and mitigate bias in clinical
-predictive models. The chapter then delves into explainability as a cornerstone
-of human-centered CDSS. This focus reflects the understanding that healthcare
-professionals must not only trust AI recommendations but also comprehend their
-underlying reasoning. The discussion advances in an analysis of privacy
-vulnerabilities in medical AI systems, from data leakage in deep learning
-models to sophisticated attacks against model explanations. The text explores
-privacy-preservation strategies such as differential privacy and federated
-learning, while acknowledging the inherent trade-offs between privacy
-protection and model performance. This progression, from technical validation
-to ethical considerations, reflects the multifaceted challenges of developing
-AI systems that can be seamlessly and reliably integrated into daily clinical
-practice while maintaining the highest standards of patient care and data
-protection.
-
-摘要：隨著人工智慧 (AI) 在醫療保健中的應用日益普及，本章探討了開發可靠且符合道德標準的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型到複雜機器學習方法的基本轉變開始，這項工作審查了嚴謹的驗證策略和效能評估方法，包括模型校準和決策曲線分析的關鍵角色。本章強調，在醫療保健中建立值得信賴的 AI 系統不只是技術上的準確性；它需要仔細考量公平性、可解釋性和隱私權。本章強調了透過 AI 確保公平的醫療保健服務的挑戰，並討論了識別和減輕臨床預測模型中偏差的方法。接著，本章深入探討可解釋性，作為以人為中心的 CDSS 的基石。這種關注反映了醫療保健專業人員不僅必須信任 AI 建議，還必須理解其背後的推理。討論進一步分析了醫療 AI 系統中的隱私漏洞，從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略，例如差分隱私和聯合學習，同時承認隱私保護和模型效能之間的固有取捨。這種從技術驗證到道德考量的進展，反映了開發 AI 系統的多面向挑戰，這些系統可以無縫且可靠地整合到日常臨床實務中，同時維持最高的病患照護和資料保護標準。
-
-##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**
-2501.06887v1 by Sadia Kamal, Tim Oates
-
-As deep learning models gain attraction in medical data, ensuring transparent
-and trustworthy decision-making is essential. In skin cancer diagnosis, while
-advancements in lesion detection and classification have improved accuracy, the
-black-box nature of these methods poses challenges in understanding their
-decision processes, leading to trust issues among physicians. This study
-leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on
-different skin lesion datasets, to capture meaningful relationships between
-visual features and diagnostic criteria terms. To further enhance transparency,
-we propose a method called MedGrad E-CLIP, which builds on gradient-based
-E-CLIP by incorporating a weighted entropy mechanism designed for complex
-medical imaging like skin lesions. This approach highlights critical image
-regions linked to specific diagnostic descriptions. The developed integrated
-pipeline not only classifies skin lesions by matching corresponding
-descriptions but also adds an essential layer of explainability developed
-especially for medical data. By visually explaining how different features in
-an image relates to diagnostic criteria, this approach demonstrates the
-potential of advanced vision-language models in medical image analysis,
-ultimately improving transparency, robustness, and trust in AI-driven
-diagnostic systems.
-
-摘要：随着深度学习模型在医学数据中获得关注，确保透明且值得信赖的决策至关重要。在皮肤癌诊断中，虽然病灶检测和分类的进步提高了准确性，但这些方法的黑盒性质对理解其决策过程构成了挑战，导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP（对比语言图像预训练）模型，以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度，我们提出了一种名为 MedGrad E-CLIP 的方法，该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制，建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类，还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系，这种方法展示了高级视觉语言模型在医学图像分析中的潜力，最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。
-
-##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**
-2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat
-
-Humour styles can have either a negative or a positive impact on well-being.
-Given the importance of these styles to mental health, significant research has
-been conducted on their automatic identification. However, the automated
-machine learning models used for this purpose are black boxes, making their
-prediction decisions opaque. Clarity and transparency are vital in the field of
-mental health. This paper presents an explainable AI (XAI) framework for
-understanding humour style classification, building upon previous work in
-computational humour analysis. Using the best-performing single model
-(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
-analyse how linguistic, emotional, and semantic features contribute to humour
-style classification decisions. Our analysis reveals distinct patterns in how
-different humour styles are characterised and misclassified, with particular
-emphasis on the challenges in distinguishing affiliative humour from other
-styles. Through detailed examination of feature importance, error patterns, and
-misclassification cases, we identify key factors influencing model decisions,
-including emotional ambiguity, context misinterpretation, and target
-identification. The framework demonstrates significant utility in understanding
-model behaviour, achieving interpretable insights into the complex interplay of
-features that define different humour styles. Our findings contribute to both
-the theoretical understanding of computational humour analysis and practical
-applications in mental health, content moderation, and digital humanities
-research.
-
-摘要：幽默風格對幸福感可能產生負面或正面的影響。
-鑑於這些風格對心理健康的重要性，已經對其自動識別進行了大量研究。然而，用於此目的的自動機器學習模型是黑盒子，使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架，用於理解幽默風格分類，建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost)，我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式，特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例，我們確定了影響模型決策的關鍵因素，包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用，實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。
-
-##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**
-2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani
-
-The increasing demand for mental health services has highlighted the need for
-innovative solutions, particularly in the realm of psychological conversational
-AI, where the availability of sensitive data is scarce. In this work, we
-explored the development of a system tailored for mental health support with a
-novel approach to psychological assessment based on explainable emotional
-profiles in combination with empathetic conversational models, offering a
-promising tool for augmenting traditional care, particularly where immediate
-expertise is unavailable. Our work can be divided into two main parts,
-intrinsecaly connected to each other. First, we present RACLETTE, a
-conversational system that demonstrates superior emotional accuracy compared to
-state-of-the-art benchmarks in both understanding users' emotional states and
-generating empathetic responses during conversations, while progressively
-building an emotional profile of the user through their interactions. Second,
-we show how the emotional profiles of a user can be used as interpretable
-markers for mental health assessment. These profiles can be compared with
-characteristic emotional patterns associated with different mental disorders,
-providing a novel approach to preliminary screening and support.
-
-摘要：隨著對心理健康服務需求的增加，凸顯了創新解決方案的需求，特別是在心理對話式人工智慧領域，那裡缺乏敏感資料。在這項工作中，我們探索了開發一個針對心理健康支持的系統，採用一種基於可解釋的情緒特徵的新方法進行心理評估，結合同理心對話模式，提供了一個有前途的工具，用於擴充傳統照護，特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分，彼此內在相關。首先，我們展示了 RACLETTE，一個對話系統，與最先進的基準相比，在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性，同時透過他們的互動逐漸建立使用者的情緒特徵。其次，我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較，提供了一種初步篩選和支持的新方法。
-
-##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**
-2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia
-
-Artificial intelligence (AI) has emerged as a powerful tool to enhance
-decision-making and optimize treatment protocols in in vitro fertilization
-(IVF). In particular, AI shows significant promise in supporting
-decision-making during the ovarian stimulation phase of the IVF process. This
-review evaluates studies focused on the applications of AI combined with
-medical imaging in ovarian stimulation, examining methodologies, outcomes, and
-current limitations. Our analysis of 13 studies on this topic reveals that,
-reveal that while AI algorithms demonstrated notable potential in predicting
-optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the
-medical imaging data utilized predominantly came from two-dimensional (2D)
-ultrasound which mainly involved basic quantifications, such as follicle size
-and number, with limited use of direct feature extraction or advanced image
-analysis techniques. This points to an underexplored opportunity where advanced
-image analysis approaches, such as deep learning, and more diverse imaging
-modalities, like three-dimensional (3D) ultrasound, could unlock deeper
-insights. Additionally, the lack of explainable AI (XAI) in most studies raises
-concerns about the transparency and traceability of AI-driven decisions - key
-factors for clinical adoption and trust. Furthermore, many studies relied on
-single-center designs and small datasets, which limit the generalizability of
-their findings. This review highlights the need for integrating advanced
-imaging analysis techniques with explainable AI methodologies, as well as the
-importance of leveraging multicenter collaborations and larger datasets.
-Addressing these gaps has the potential to enhance ovarian stimulation
-management, paving the way for efficient, personalized, and data-driven
-treatment pathways that improve IVF outcomes.
-
-摘要：人工智慧（AI）已成為增強體外受精（IVF）決策制定和優化治療方案的強大工具。特別是，AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示，雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力，但所利用的醫學影像數據主要來自於二次元（2D）超音波，而二次元超音波主要涉及基本量化，例如濾泡大小和數量，且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會，例如深度學習等進階影像分析方法，以及更多元的影像模式，例如三維（3D）超音波，可以解鎖更深入的見解。此外，大多數研究缺乏可解釋 AI（XAI），這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂，而透明度和可追溯性是臨床採用和信任的關鍵因素。此外，許多研究依賴於單中心設計和小型數據集，這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性，以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理，為有效、個人化和數據驅動的治療途徑鋪平道路，進而改善 IVF 結果。
-
-##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**
-2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman
-
-This research presents an innovative approach to cancer diagnosis and
-prediction using explainable Artificial Intelligence (XAI) and deep learning
-techniques. With cancer causing nearly 10 million deaths globally in 2020,
-early and accurate diagnosis is crucial. Traditional methods often face
-challenges in cost, accuracy, and efficiency. Our study develops an AI model
-that provides precise outcomes and clear insights into its decision-making
-process, addressing the "black box" problem of deep learning models. By
-employing XAI techniques, we enhance interpretability and transparency,
-building trust among healthcare professionals and patients. Our approach
-leverages neural networks to analyse extensive datasets, identifying patterns
-for cancer detection. This model has the potential to revolutionise diagnosis
-by improving accuracy, accessibility, and clarity in medical decision-making,
-possibly leading to earlier detection and more personalised treatment
-strategies. Furthermore, it could democratise access to high-quality
-diagnostics, particularly in resource-limited settings, contributing to global
-health equity. The model's applications extend beyond cancer diagnosis,
-potentially transforming various aspects of medical decision-making and saving
-millions of lives worldwide.
-
-摘要：本研究提出了一個創新的癌症診斷和預測方法，使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡，因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型，它提供精確的結果並清楚地了解其決策過程，解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術，我們增強了解釋性和透明度，在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集，識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷，可能導致更早的檢測和更個性化的治療策略。此外，它可以使更多人獲得高品質的診斷，特別是在資源有限的環境中，有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷，還可能轉變醫療決策的各個方面，並拯救全球數百萬人的生命。
-
-##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**
-2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag
-
-Deep learning has advanced medical image classification, but interpretability
-challenges hinder its clinical adoption. This study enhances interpretability
-in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)
-and a multi-agent Retrieval-Augmented Generation (RAG) system for report
-generation. By modeling relationships between visual features and clinical
-concepts, we create interpretable concept vectors that guide a multi-agent RAG
-system to generate radiology reports, enhancing clinical relevance,
-explainability, and transparency. Evaluation of the generated reports using an
-LLM-as-a-judge confirmed the interpretability and clinical utility of our
-model's outputs. On the COVID-QU dataset, our model achieved 81% classification
-accuracy and demonstrated robust report generation performance, with five key
-metrics ranging between 84% and 90%. This interpretable multi-agent framework
-bridges the gap between high-performance AI and the explainability required for
-reliable AI-driven CXR analysis in clinical settings. Our code is available at
-https://github.com/tifat58/IRR-with-CBM-RAG.git.
-
-摘要：深度學習已提升醫學影像分類，但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成，來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係，我們建立可解釋的概念向量，引導多代理 RAG 系統生成放射報告，增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估，確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上，我們的模型達到了 81% 的分類準確率，並展示了穩健的報告生成效能，五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。
-
-##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**
-2412.15748v1 by Shamus Sim, Tyrone Chen
-
-Background: Despite the current ubiquity of Large Language Models (LLMs)
-across the medical domain, there is a surprising lack of studies which address
-their reasoning behaviour. We emphasise the importance of understanding
-reasoning behaviour as opposed to high-level prediction accuracies, since it is
-equivalent to explainable AI (XAI) in this context. In particular, achieving
-XAI in medical LLMs used in the clinical domain will have a significant impact
-across the healthcare sector. Results: Therefore, we define the concept of
-reasoning behaviour in the specific context of medical LLMs. We then categorise
-and discuss the current state of the art of methods which evaluate reasoning
-behaviour in medical LLMs. Finally, we propose theoretical frameworks which can
-empower medical professionals or machine learning engineers to gain insight
-into the low-level reasoning operations of these previously obscure models.
-Conclusion: The subsequent increased transparency and trust in medical machine
-learning models by clinicians as well as patients will accelerate the
-integration, application as well as further development of medical AI for the
-healthcare system as a whole
-
-摘要：背景：儘管大型語言模型 (LLM) 目前在醫療領域無所不在，但令人驚訝的是，探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要，因為在這種情況下，這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI，將對整個醫療保健產業產生重大影響。結果：因此，我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後，我們提出理論架構，讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論：臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升，將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。
-
-##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media**
-2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton
-
-Stress is a pervasive global health issue that can lead to severe mental
-health problems. Early detection offers timely intervention and prevention of
-stress-related disorders. The current early detection models perform "black
-box" inference suffering from limited explainability and trust which blocks the
-real-world clinical application. Thanks to the generative properties introduced
-by the Large Language Models (LLMs), the decision and the prediction from such
-models are semi-interpretable through the corresponding description. However,
-the existing LLMs are mostly trained for general purposes without the guidance
-of psychological cognitive theory. To this end, we first highlight the
-importance of prior theory with the observation of performance boosted by the
-chain-of-thoughts tailored for stress detection. This method termed Cognition
-Chain explicates the generation of stress through a step-by-step cognitive
-perspective based on cognitive appraisal theory with a progress pipeline:
-Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress
-State, guiding LLMs to provide comprehensive reasoning explanations. We further
-study the benefits brought by the proposed Cognition Chain format by utilising
-it as a synthetic dataset generation template for LLMs instruction-tuning and
-introduce CogInstruct, an instruction-tuning dataset for stress detection. This
-dataset is developed using a three-stage self-reflective annotation pipeline
-that enables LLMs to autonomously generate and refine instructional data. By
-instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable
-stress detection model. Evaluations demonstrate that CogLLM achieves
-outstanding performance while enhancing explainability. Our work contributes a
-novel approach by integrating cognitive theories into LLM reasoning processes,
-offering a promising direction for future explainable AI research.
-
-摘要：壓力是一個普遍的全球性健康問題，可能會導致嚴重的精神
-健康問題。早期發現提供及時的干預和預防
-壓力相關疾病。目前的早期發現模型執行「黑
-盒子」推論，存在可解釋性和信任度有限的問題，阻礙了
-現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性，此類
-模型的決策和預測通過對應描述具有半可解釋性。然而，
-現有的 LLM 主要針對一般用途進行訓練，沒有心理認知理論的指導。為此，我們首先強調
-先驗理論的重要性，並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知
-鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生，並具有進度管道：
-刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力
-狀態，指導 LLM 提供全面的推理解釋。我們進一步
-通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點，並介紹 CogInstruct，這是一個針對壓力檢測的指令調整數據集。這個
-數據集是使用一個三階段的自省標註管道開發的，使 LLM 能夠自主生成和優化指令數據。通過
-使用 CogInstruct 對 Llama3 進行指令調整，我們開發了 CogLLM，這是一個可解釋的
-壓力檢測模型。評估表明，CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中，提出了一種新穎的方法，
-為未來的可解釋人工智能研究提供了一個有希望的方向。
-
-##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**
-2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi
-
-Human-machine teaming in medical AI requires us to understand to what degree
-a trained clinician should weigh AI predictions. While previous work has shown
-the potential of AI assistance at improving clinical predictions, existing
-clinical decision support systems either provide no explainability of their
-predictions or use techniques like saliency and Shapley values, which do not
-allow for physician-based verification. To address this gap, this study
-compares previously used explainable AI techniques with a newly proposed
-technique termed '2-factor retrieval (2FR)', which is a combination of
-interface design and search retrieval that returns similarly labeled data
-without processing this data. This results in a 2-factor security blanket
-where: (a) correct images need to be retrieved by the AI; and (b) humans should
-associate the retrieved images with the current pathology under test. We find
-that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician
-accuracy, with particular improvements when clinicians are radiologists and
-have low confidence in their decision. Our results highlight the importance of
-understanding how different modes of human-AI decision making may impact
-clinician accuracy in clinical decision support systems.
-
-摘要：人機協作在醫療 AI 中，需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力，但現有的臨床決策支援系統，要不就沒有提供預測的可解釋性，要不就是使用像顯著性和 Shapley 值之類的技術，這些技術不允許基於醫生的驗證。為了解決這個差距，本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較，後者是一種介面設計和搜尋檢索的組合，它會傳回標籤相似的資料，而不會處理這些資料。這會產生一個 2 因子安全機制，其中：(a) 正確的影像需要由 AI 檢索；(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現，當在胸部 X 光診斷上進行測試時，2FR 會提高臨床醫生的準確度，特別是在臨床醫生是放射科醫生且對其決策信心不足時，會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。
-
-##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**
-2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle
-
-Understanding public perception of artificial intelligence (AI) and the
-tradeoffs between potential risks and benefits is crucial, as these perceptions
-might shape policy decisions, influence innovation trajectories for successful
-market strategies, and determine individual and societal acceptance of AI
-technologies. Using a representative sample of 1100 participants from Germany,
-this study examines mental models of AI. Participants quantitatively evaluated
-71 statements about AI's future capabilities (e.g., autonomous driving, medical
-care, art, politics, warfare, and societal divides), assessing the expected
-likelihood of occurrence, perceived risks, benefits, and overall value. We
-present rankings of these projections alongside visual mappings illustrating
-public risk-benefit tradeoffs. While many scenarios were deemed likely,
-participants often associated them with high risks, limited benefits, and low
-overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in
-value assessment can be explained by perceived risks ($\beta=-.504$) and
-perceived benefits ($\beta=+.710$), with no significant relation to expected
-likelihood. Demographics and personality traits influenced perceptions of
-risks, benefits, and overall evaluations, underscoring the importance of
-increasing AI literacy and tailoring public information to diverse user needs.
-These findings provide actionable insights for researchers, developers, and
-policymakers by highlighting critical public concerns and individual factors
-essential to align AI development with individual values.
-
-摘要：<paragraph>了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要，因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡，並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本，探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述（例如，自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧）進行了定量評估，評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名，並附上視覺化映射，說明了公眾的風險收益權衡。儘管許多場景被認為是可能的，但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中，96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋，與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法，這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素，為研究人員、開發人員和政策制定者提供了可行的見解。</paragraph>
-
-##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**
-2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
-
-The use of machine learning and AI on electronic health records (EHRs) holds
-substantial potential for clinical insight. However, this approach faces
-challenges due to data heterogeneity, sparsity, temporal misalignment, and
-limited labeled outcomes. In this context, we leverage a linked EHR dataset of
-approximately one million de-identified individuals from Bristol, North
-Somerset, and South Gloucestershire, UK, to characterize urinary tract
-infections (UTIs). We implemented a data pre-processing and curation pipeline
-that transforms the raw EHR data into a structured format suitable for
-developing predictive models focused on data fairness, accountability and
-transparency. Given the limited availability and biases of ground truth UTI
-outcomes, we introduce a UTI risk estimation framework informed by clinical
-expertise to estimate UTI risk across individual patient timelines. Pairwise
-XGBoost models are trained using this framework to differentiate UTI risk
-categories with explainable AI techniques applied to identify key predictors
-and support interpretability. Our findings reveal differences in clinical and
-demographic predictors across risk groups. While this study highlights the
-potential of AI-driven insights to support UTI clinical decision-making,
-further investigation of patient sub-strata and extensive validation are needed
-to ensure robustness and applicability in clinical practice.
-
-摘要：電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而，由於資料異質性、稀疏性、時間錯位和標籤結果有限，此方法面臨挑戰。在此背景下，我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集，來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線，適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差，我們引入了由臨床專業知識告知的 UTI 風險評估架構，以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練，以區分 UTI 風險類別，並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力，但仍需要進一步調查患者子群體和廣泛驗證，以確保在臨床實務中的穩健性和適用性。
-
-##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**
-2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez
-
-There is a growing need to understand how digital systems can support
-clinical decision-making, particularly as artificial intelligence (AI) models
-become increasingly complex and less human-interpretable. This complexity
-raises concerns about trustworthiness, impacting safe and effective adoption of
-such technologies. Improved understanding of decision-making processes and
-requirements for explanations coming from decision support tools is a vital
-component in providing effective explainable solutions. This is particularly
-relevant in the data-intensive, fast-paced environments of intensive care units
-(ICUs). To explore these issues, group interviews were conducted with seven ICU
-clinicians, representing various roles and experience levels. Thematic analysis
-revealed three core themes: (T1) ICU decision-making relies on a wide range of
-factors, (T2) the complexity of patient state is challenging for shared
-decision-making, and (T3) requirements and capabilities of AI decision support
-systems. We include design recommendations from clinical input, providing
-insights to inform future AI systems for intensive care.
-
-摘要：隨著人工智慧 (AI) 模型變得越來越複雜，且越來越難以被人理解，了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮，影響了此類技術的安全且有效採用。改善對決策制定流程的理解，以及對決策支援工具所提供說明的要求，是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題，對七位 ICU 臨床醫師進行了小組訪談，這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題：(T1) ICU 決策制定依賴於廣泛的因素，(T2) 病患狀態的複雜性對共同決策制定構成挑戰，以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議，提供見解以提供資訊給未來用於加護的 AI 系統。
-
-##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**
-2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra
-
-Pediatric heart diseases present a broad spectrum of congenital and acquired
-diseases. More complex congenital malformations require a differentiated and
-multimodal decision-making process, usually including echocardiography as a
-central imaging method. Artificial intelligence (AI) offers considerable
-promise for clinicians by facilitating automated interpretation of pediatric
-echocardiography data. However, adapting AI technologies for pediatric
-echocardiography analysis has challenges such as limited public data
-availability, data privacy, and AI model transparency. Recently, researchers
-have focused on disruptive technologies, such as federated learning (FL) and
-explainable AI (XAI), to improve automatic diagnostic and decision support
-workflows. This study offers a comprehensive overview of the limitations and
-opportunities of AI in pediatric echocardiography, emphasizing the synergistic
-workflow and role of XAI and FL, identifying research gaps, and exploring
-potential future developments. Additionally, three relevant clinical use cases
-demonstrate the functionality of XAI and FL with a focus on (i) view
-recognition, (ii) disease classification, (iii) segmentation of cardiac
-structures, and (iv) quantitative assessment of cardiac function.
-
-摘要：小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程，通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望，因為它可以促進小兒超音波檢查資料的自動化解讀。然而，將人工智慧技術應用於小兒超音波檢查分析有許多挑戰，例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近，研究人員專注於破壞性技術，例如聯合學習 (FL) 和可解釋人工智慧 (XAI)，以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述，強調了 XAI 和 FL 的協同工作流程和角色，找出研究差距並探討潛在的未來發展。此外，三個相關的臨床使用案例展示了 XAI 和 FL 的功能，重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。
-
-##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**
-2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust
-
-Osteoporosis is a common condition that increases fracture risk, especially
-in older adults. Early diagnosis is vital for preventing fractures, reducing
-treatment costs, and preserving mobility. However, healthcare providers face
-challenges like limited labeled data and difficulties in processing medical
-images. This study presents a novel multi-modal learning framework that
-integrates clinical and imaging data to improve diagnostic accuracy and model
-interpretability. The model utilizes three pre-trained networks-VGG19,
-InceptionV3, and ResNet50-to extract deep features from X-ray images. These
-features are transformed using PCA to reduce dimensionality and focus on the
-most relevant components. A clustering-based selection process identifies the
-most representative components, which are then combined with preprocessed
-clinical data and processed through a fully connected network (FCN) for final
-classification. A feature importance plot highlights key variables, showing
-that Medical History, BMI, and Height were the main contributors, emphasizing
-the significance of patient-specific data. While imaging features were
-valuable, they had lower importance, indicating that clinical data are crucial
-for accurate predictions. This framework promotes precise and interpretable
-predictions, enhancing transparency and building trust in AI-driven diagnoses
-for clinical integration.
-
-摘要：骨質疏鬆症是一種常見的疾病，會增加骨折的風險，特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而，醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架，該框架整合了臨床和影像數據，以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路，VGG19、InceptionV3 和 ResNet50，從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分，然後將這些組成部分與預處理的臨床數據結合，並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數，表明病史、BMI 和身高是主要貢獻因素，強調了患者特定數據的重要性。雖然影像特徵很有價值，但它們的重要性較低，這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測，提高了透明度，並建立了對 AI 驅動診斷在臨床整合中的信任。
+|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
+|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
+|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null|
+|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v1](http://arxiv.org/abs/2502.08346v1)|null|
+|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null|
+|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null|
+|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null|
+|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null|
+|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null|
+|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null|
+|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)|
+|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null|
+|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|null|
+|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null|
+|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null|
+|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null|
+|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null|
+|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null|
+|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null|
+|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null|
+|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null|
+|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)|
+|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null|
+|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|null|
+|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null|
+|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)|
+|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null|
+|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)|
+|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)|
+|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null|
+|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
+|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)|
+|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
+|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
+|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
+|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
+|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
+|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
+|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
+|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
+|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null|
+|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null|
+|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null|
+|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)|
+|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null|
+|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null|
+|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)|
+|**2025-02-03**|**Common Foundations for SHACL, ShEx, and PG-Schema**|S. Ahmetaj et.al.|[2502.01295v1](http://arxiv.org/abs/2502.01295v1)|null|
+|**2025-02-03**|**GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**|Linhao Luo et.al.|[2502.01113v1](http://arxiv.org/abs/2502.01113v1)|[link](https://github.com/RManLuo/gfm-rag)|
+|**2025-02-03**|**Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**|Seungri Yoon et.al.|[2502.01059v1](http://arxiv.org/abs/2502.01059v1)|null|
+|**2025-02-03**|**Encrypted Large Model Inference: The Equivariant Encryption Paradigm**|James Buban et.al.|[2502.01013v1](http://arxiv.org/abs/2502.01013v1)|null|
+|**2025-02-02**|**Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**|Juno Kim et.al.|[2502.01694v1](http://arxiv.org/abs/2502.01694v1)|null|
+|**2025-02-02**|**PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**|Qixuan Li et.al.|[2502.00708v1](http://arxiv.org/abs/2502.00708v1)|null|
+|**2025-02-02**|**A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**|Qika Lin et.al.|[2502.00681v1](http://arxiv.org/abs/2502.00681v1)|null|
+|**2025-02-01**|**Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**|Jingyuan Yi et.al.|[2502.00339v1](http://arxiv.org/abs/2502.00339v1)|null|
+|**2025-02-01**|**DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**|Jiaxin Guo et.al.|[2502.00305v1](http://arxiv.org/abs/2502.00305v1)|null|
+|**2025-01-31**|**Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**|Nathaniel Tomczak et.al.|[2502.01659v2](http://arxiv.org/abs/2502.01659v2)|[link](https://github.com/KLab-AI3/Graph-Processing-Attention-IPDPS-2025)|
+|**2025-01-31**|**Improving vision-language alignment with graph spiking hybrid Networks**|Siyu Zhang et.al.|[2501.19069v1](http://arxiv.org/abs/2501.19069v1)|null|
+|**2025-01-30**|**Semantic Web and Creative AI -- A Technical Report from ISWS 2023**|Raia Abu Ahmad et.al.|[2501.18542v1](http://arxiv.org/abs/2501.18542v1)|null|
+|**2025-01-30**|**Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**|Tianpeng Pan et.al.|[2501.18320v1](http://arxiv.org/abs/2501.18320v1)|null|
+|**2025-01-30**|**Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**|Wanlong Liu et.al.|[2501.18154v1](http://arxiv.org/abs/2501.18154v1)|null|
+|**2025-01-30**|**Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**|Qika Lin et.al.|[2501.18119v1](http://arxiv.org/abs/2501.18119v1)|null|
+|**2025-01-29**|**Hybrid Graphs for Table-and-Text based Question Answering using LLMs**|Ankush Agarwal et.al.|[2501.17767v1](http://arxiv.org/abs/2501.17767v1)|null|
+|**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
+|**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
+|**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
+|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
+|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
+|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
+|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
+|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
+|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
+|**2025-01-26**|**Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**|Haodi Ma et.al.|[2501.15688v1](http://arxiv.org/abs/2501.15688v1)|null|
+|**2025-01-26**|**How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**|Manzong Huang et.al.|[2501.15378v1](http://arxiv.org/abs/2501.15378v1)|null|
+|**2025-01-24**|**Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**|Cencheng Shen et.al.|[2501.14932v1](http://arxiv.org/abs/2501.14932v1)|null|
+|**2025-01-24**|**Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**|Hang Luo et.al.|[2501.14892v1](http://arxiv.org/abs/2501.14892v1)|null|
+|**2025-01-24**|**GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**|Ziwen Li et.al.|[2501.16382v1](http://arxiv.org/abs/2501.16382v1)|[link](https://github.com/aaronli43/grappi)|
+|**2025-01-24**|**Evaluating and Improving Graph to Text Generation with Large Language Models**|Jie He et.al.|[2501.14497v1](http://arxiv.org/abs/2501.14497v1)|[link](https://github.com/probe2/kg_text)|
+|**2025-01-24**|**Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**|Xujian Liang et.al.|[2501.14300v1](http://arxiv.org/abs/2501.14300v1)|[link](https://github.com/dosonleung/fasttog)|
+|**2025-01-24**|**Top Ten Challenges Towards Agentic Neural Graph Databases**|Jiaxin Bai et.al.|[2501.14224v1](http://arxiv.org/abs/2501.14224v1)|null|
+|**2025-01-23**|**GraphRAG under Fire**|Jiacheng Liang et.al.|[2501.14050v1](http://arxiv.org/abs/2501.14050v1)|null|
+|**2025-01-23**|**EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**|Yuhui Yun et.al.|[2501.13746v1](http://arxiv.org/abs/2501.13746v1)|null|
+|**2025-01-23**|**Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**|Chang Gong et.al.|[2501.13731v1](http://arxiv.org/abs/2501.13731v1)|null|
+|**2025-01-23**|**CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**|Hamza Landolsi et.al.|[2501.13993v1](http://arxiv.org/abs/2501.13993v1)|null|
+|**2025-01-23**|**Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**|Hy Nguyen et.al.|[2501.13992v1](http://arxiv.org/abs/2501.13992v1)|null|
+|**2025-01-23**|**Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**|Bhumika Gupta et.al.|[2501.13984v1](http://arxiv.org/abs/2501.13984v1)|null|
+|**2025-01-21**|**LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**|Hasan Abu-Rasheed et.al.|[2501.12300v1](http://arxiv.org/abs/2501.12300v1)|null|
+|**2025-01-21**|**Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**|Dongsheng Zhu et.al.|[2501.12432v1](http://arxiv.org/abs/2501.12432v1)|null|
+|**2025-01-21**|**InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**|Pha Nguyen et.al.|[2501.12231v1](http://arxiv.org/abs/2501.12231v1)|null|
+|**2025-01-21**|**Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**|Maya Medjad et.al.|[2501.11977v1](http://arxiv.org/abs/2501.11977v1)|[link](https://github.com/reecall/graphtod)|
+|**2025-01-21**|**Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**|Jie Zhao et.al.|[2501.11968v1](http://arxiv.org/abs/2501.11968v1)|null|
+|**2025-01-21**|**A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**|Qinggang Zhang et.al.|[2501.13958v1](http://arxiv.org/abs/2501.13958v1)|[link](https://github.com/deep-polyu/awesome-graphrag)|
+|**2025-01-21**|**Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**|Nikos Kanakaris et.al.|[2501.11849v2](http://arxiv.org/abs/2501.11849v2)|[link](https://github.com/nkanak/brag-fake-news-campaigns)|
+|**2025-01-21**|**Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**|Haoran Song et.al.|[2501.16361v1](http://arxiv.org/abs/2501.16361v1)|null|
+|**2025-01-20**|**Zep: A Temporal Knowledge Graph Architecture for Agent Memory**|Preston Rasmussen et.al.|[2501.13956v1](http://arxiv.org/abs/2501.13956v1)|[link](https://github.com/getzep/graphiti)|
+|**2025-01-20**|**Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**|M. Manzour et.al.|[2501.11560v1](http://arxiv.org/abs/2501.11560v1)|null|
+|**2025-01-20**|**Each Graph is a New Language: Graph Learning with LLMs**|Huachi Zhou et.al.|[2501.11478v2](http://arxiv.org/abs/2501.11478v2)|null|
+|**2025-01-20**|**Few-shot Policy (de)composition in Conversational Question Answering**|Kyle Erwin et.al.|[2501.11335v1](http://arxiv.org/abs/2501.11335v1)|null|
+|**2025-01-20**|**Reasoning Language Models: A Blueprint**|Maciej Besta et.al.|[2501.11223v3](http://arxiv.org/abs/2501.11223v3)|[link](https://github.com/spcl/x1)|
+|**2025-01-19**|**IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**|Elad Levi et.al.|[2501.11067v1](http://arxiv.org/abs/2501.11067v1)|[link](https://github.com/plurai-ai/intellagent)|
 
-##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**
-2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor
+#### Abstracts
+##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
+2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
 
-This review paper explores recent advances in deep learning approaches for
-non-invasive cognitive impairment detection. We examine various non-invasive
-indicators of cognitive decline, including speech and language, facial, and
-motoric mobility. The paper provides an overview of relevant datasets,
-feature-extracting techniques, and deep-learning architectures applied to this
-domain. We have analyzed the performance of different methods across modalities
-and observed that speech and language-based methods generally achieved the
-highest detection performance. Studies combining acoustic and linguistic
-features tended to outperform those using a single modality. Facial analysis
-methods showed promise for visual modalities but were less extensively studied.
-Most papers focused on binary classification (impaired vs. non-impaired), with
-fewer addressing multi-class or regression tasks. Transfer learning and
-pre-trained language models emerged as popular and effective techniques,
-especially for linguistic analysis. Despite significant progress, several
-challenges remain, including data standardization and accessibility, model
-explainability, longitudinal analysis limitations, and clinical adaptation.
-Lastly, we propose future research directions, such as investigating
-language-agnostic speech analysis methods, developing multi-modal diagnostic
-systems, and addressing ethical considerations in AI-assisted healthcare. By
-synthesizing current trends and identifying key obstacles, this review aims to
-guide further development of deep learning-based cognitive impairment detection
-systems to improve early diagnosis and ultimately patient outcomes.
+Visual Question Answering (VQA) is a challenging problem that requires to
+process multimodal input. Answer-Set Programming (ASP) has shown great
+potential in this regard to add interpretability and explainability to modular
+VQA architectures. In this work, we address the problem of how to integrate ASP
+with modules for vision and natural language processing to solve a new and
+demanding VQA variant that is concerned with images of graphs (not graphs in
+symbolic form). Images containing graph-based structures are an ubiquitous and
+popular form of visualisation. Here, we deal with the particular problem of
+graphs inspired by transit networks, and we introduce a novel dataset that
+amends an existing one by adding images of graphs that resemble metro lines.
+Our modular neuro-symbolic approach combines optical graph recognition for
+graph parsing, a pretrained optical character recognition neural network for
+parsing labels, Large Language Models (LLMs) for language processing, and ASP
+for reasoning. This method serves as a first baseline and achieves an overall
+average accuracy of 73% on the dataset. Our evaluation provides further
+evidence of the potential of modular neuro-symbolic systems, in particular with
+pretrained models that do not involve any further training and logic
+programming for reasoning, to solve complex VQA tasks.
 
-摘要：本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標，包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現，並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力，但研究較少。大多數論文專注於二元分類（受損與未受損），較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術，特別是對於語言分析。儘管取得了重大進展，但仍存在一些挑戰，包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後，我們提出了未來的研究方向，例如調查與語言無關的語音分析方法、開發多模式診斷系統，以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙，本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展，以改善早期診斷，並最終改善患者的治療結果。
+摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
 
-##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**
-2410.17504v1 by Shruthi Chari
+##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
+2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
 
-Explainable Artificial Intelligence (AI) focuses on helping humans understand
-the working of AI systems or their decisions and has been a cornerstone of AI
-for decades. Recent research in explainability has focused on explaining the
-workings of AI models or model explainability. There have also been several
-position statements and review papers detailing the needs of end-users for
-user-centered explainability but fewer implementations. Hence, this thesis
-seeks to bridge some gaps between model and user-centered explainability. We
-create an explanation ontology (EO) to represent literature-derived explanation
-types via their supporting components. We implement a knowledge-augmented
-question-answering (QA) pipeline to support contextual explanations in a
-clinical setting. Finally, we are implementing a system to combine explanations
-from different AI methods and data modalities. Within the EO, we can represent
-fifteen different explanation types, and we have tested these representations
-in six exemplar use cases. We find that knowledge augmentations improve the
-performance of base large language models in the contextualized QA, and the
-performance is variable across disease groups. In the same setting, clinicians
-also indicated that they prefer to see actionability as one of the main foci in
-explanations. In our explanations combination method, we plan to use similarity
-metrics to determine the similarity of explanations in a chronic disease
-detection setting. Overall, through this thesis, we design methods that can
-support knowledge-enabled explanations across different use cases, accounting
-for the methods in today's AI era that can generate the supporting components
-of these explanations and domain knowledge sources that can enhance them.
+The adoption of EHRs has expanded opportunities to leverage data-driven
+algorithms in clinical care and research. A major bottleneck in effectively
+conducting multi-institutional EHR studies is the data heterogeneity across
+systems with numerous codes that either do not exist or represent different
+clinical concepts across institutions. The need for data privacy further limits
+the feasibility of including multi-institutional patient-level data required to
+study similarities and differences across patient subgroups. To address these
+challenges, we developed the GAME algorithm. Tested and validated across 7
+institutions and 2 languages, GAME integrates data in several levels: (1) at
+the institutional level with knowledge graphs to establish relationships
+between codes and existing knowledge sources, providing the medical context for
+standard codes and their relationship to each other; (2) between institutions,
+leveraging language models to determine the relationships between
+institution-specific codes with established standard codes; and (3) quantifying
+the strength of the relationships between codes using a graph attention
+network. Jointly trained embeddings are created using transfer and federated
+learning to preserve data privacy. In this study, we demonstrate the
+applicability of GAME in selecting relevant features as inputs for AI-driven
+algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
+We then highlight the application of GAME harmonized multi-institutional EHR
+data in a study of Alzheimer's disease outcomes and suicide risk among patients
+with mental health disorders, without sharing patient-level data outside
+individual institutions.
 
-摘要：可解釋人工智慧（AI）專注於協助人類了解 AI 系統運作或其決策，數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求，但實作較少。因此，本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體（EO）以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答（QA）管線，以在臨床環境中支援情境解釋。最後，我們正在實作一個系統，以結合來自不同 AI 方法和資料模式的解釋。在 EO 中，我們可以表示 15 種不同的解釋類型，並且我們已在六個範例使用案例中測試這些表示。我們發現，知識增強改善了基礎大型語言模型在情境化 QA 中的效能，並且效能因疾病群組而異。在相同的環境中，臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中，我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言，透過本論文，我們設計了可以在不同使用案例中支援知識啟用解釋的方法，考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。
+摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
 
-##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**
-2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay
+##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**
+2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang
 
-Objectives: To investigate clinicians' attitudes towards current automated
-interpretation of ECG and novel AI technologies and their perception of
-computer-assisted interpretation. Materials and Methods: We conducted a series
-of interviews with clinicians in the UK. Our study: (i) explores the potential
-for AI, specifically future 'human-like' computing approaches, to facilitate
-ECG interpretation and support clinical decision making, and (ii) elicits their
-opinions about the importance of explainability and trustworthiness of AI
-algorithms. Results: We performed inductive thematic analysis on interview
-transcriptions from 23 clinicians and identified the following themes: (i) a
-lack of trust in current systems, (ii) positive attitudes towards future AI
-applications and requirements for these, (iii) the relationship between the
-accuracy and explainability of algorithms, and (iv) opinions on education,
-possible deskilling, and the impact of AI on clinical competencies. Discussion:
-Clinicians do not trust current computerised methods, but welcome future 'AI'
-technologies. Where clinicians trust future AI interpretation to be accurate,
-they are less concerned that it is explainable. They also preferred ECG
-interpretation that demonstrated the results of the algorithm visually. Whilst
-clinicians do not fear job losses, they are concerned about deskilling and the
-need to educate the workforce to use AI responsibly. Conclusion: Clinicians are
-positive about the future application of AI in clinical decision-making.
-Accuracy is a key factor of uptake and visualisations are preferred over
-current computerised methods. This is viewed as a potential means of training
-and upskilling, in contrast to the deskilling that automation might be
-perceived to bring.
+With the extensive application of Graph Neural Networks (GNNs) across various
+domains, their trustworthiness has emerged as a focal point of research. Some
+existing studies have shown that the integration of large language models
+(LLMs) can improve the semantic understanding and generation capabilities of
+GNNs, which in turn improves the trustworthiness of GNNs from various aspects.
+Our review introduces a taxonomy that offers researchers a clear framework for
+comprehending the principles and applications of different methods and helps
+clarify the connections and differences among various approaches. Then we
+systematically survey representative approaches along the four categories of
+our taxonomy. Through our taxonomy, researchers can understand the applicable
+scenarios, potential advantages, and limitations of each approach for the the
+trusted integration of GNNs with LLMs. Finally, we present some promising
+directions of work and future trends for the integration of LLMs and GNNs to
+improve model trustworthiness.
 
-摘要：<paragraph>目的：調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度，以及他們對電腦輔助解讀的看法。材料和方法：我們對英國的臨床醫生進行了一系列訪談。我們的研究：(i) 探討人工智慧的潛力，特別是未來的「類人類」運算方法，以促進心電圖解讀並支持臨床決策制定，以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果：我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析，並找出以下主題：(i) 對目前系統缺乏信任，(ii) 對未來人工智慧應用和對這些應用的要求持正面態度，(iii) 演算法的準確性和可解釋性之間的關係，以及 (iv) 對教育、可能的技能退化，以及人工智慧對臨床能力的影響的看法。討論：臨床醫生不信任目前的電腦化方法，但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下，他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業，但他們擔心技能退化，以及需要教育員工負責任地使用人工智慧。結論：臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素，而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法，與自動化可能帶來的技能退化形成對比。</paragraph>
+摘要：隨著圖神經網路 (GNN) 在各種領域的廣泛應用，其可信度已成為研究的焦點。一些現有研究表明，整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力，進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法，為研究人員提供了一個清晰的架構，用於理解不同方法的原理和應用，並有助於釐清各種方法之間的關聯和差異。然後，我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法，可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後，我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢，以提升模型的可信度。
 
-##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**
-2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker
+##### **Graph Foundation Models for Recommendation: A Comprehensive Survey**
+2502.08346v1 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi
 
-The aggressiveness of prostate cancer, the most common cancer in men
-worldwide, is primarily assessed based on histopathological data using the
-Gleason scoring system. While artificial intelligence (AI) has shown promise in
-accurately predicting Gleason scores, these predictions often lack inherent
-explainability, potentially leading to distrust in human-machine interactions.
-To address this issue, we introduce a novel dataset of 1,015 tissue microarray
-core images, annotated by an international group of 54 pathologists. The
-annotations provide detailed localized pattern descriptions for Gleason grading
-in line with international guidelines. Utilizing this dataset, we develop an
-inherently explainable AI system based on a U-Net architecture that provides
-predictions leveraging pathologists' terminology. This approach circumvents
-post-hoc explainability methods while maintaining or exceeding the performance
-of methods trained directly for Gleason pattern segmentation (Dice score: 0.713
-$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason
-patterns). By employing soft labels during training, we capture the intrinsic
-uncertainty in the data, yielding strong results in Gleason pattern
-segmentation even in the context of high interobserver variability. With the
-release of this dataset, we aim to encourage further research into segmentation
-in medical tasks with high levels of subjectivity and to advance the
-understanding of pathologists' reasoning processes.
+Recommender systems (RS) serve as a fundamental tool for navigating the vast
+expanse of online information, with deep learning advancements playing an
+increasingly important role in improving ranking accuracy. Among these, graph
+neural networks (GNNs) excel at extracting higher-order structural information,
+while large language models (LLMs) are designed to process and comprehend
+natural language, making both approaches highly effective and widely adopted.
+Recent research has focused on graph foundation models (GFMs), which integrate
+the strengths of GNNs and LLMs to model complex RS problems more efficiently by
+leveraging the graph-based structure of user-item relationships alongside
+textual understanding. In this survey, we provide a comprehensive overview of
+GFM-based RS technologies by introducing a clear taxonomy of current
+approaches, diving into methodological details, and highlighting key challenges
+and future directions. By synthesizing recent advancements, we aim to offer
+valuable insights into the evolving landscape of GFM-based recommender systems.
 
-摘要：前列腺癌是全球男性最常見的癌症，其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力，但這些預測通常缺乏內在的可解釋性，可能會導致對人機互動的不信任。為了解決這個問題，我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述，用於符合國際準則的 Gleason 分級。利用這個資料集，我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統，該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法，同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能（Dice 分數：0.713 ± 0.003，訓練於解釋，相對於 0.691 ± 0.010，訓練於 Gleason 模式）。透過在訓練期間採用軟標籤，我們捕捉了資料中的內在不確定性，即使在觀察者間變異性高的情況下，也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集，我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割，並增進對病理學家推理過程的理解。
+摘要：推薦系統 (RS) 是導航廣闊線上資訊的基本工具，深度學習的進展在提升排名準確度方面扮演著日益重要的角色。在這些進展中，圖形神經網路 (GNN) 擅長萃取高階結構資訊，而大型語言模型 (LLM) 則設計用於處理和理解自然語言，這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM)，它整合了 GNN 和 LLM 的優點，透過利用使用者與項目關係的圖形化結構，以及文字理解，更有效率地建構複雜的 RS 問題。在這項調查中，我們提供 GFM-based RS 技術的全面概觀，介紹當前方法的明確分類法，深入探討方法論的細節，並強調關鍵挑戰和未來方向。透過綜合最近的進展，我們旨在提供有價值的見解，了解 GFM-based 推薦系統不斷演變的樣貌。
 
-##### **Explainable AI Methods for Multi-Omics Analysis: A Survey**
-2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee
+##### **Self-Evaluation for Job-Shop Scheduling**
+2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana
 
-Advancements in high-throughput technologies have led to a shift from
-traditional hypothesis-driven methodologies to data-driven approaches.
-Multi-omics refers to the integrative analysis of data derived from multiple
-'omes', such as genomics, proteomics, transcriptomics, metabolomics, and
-microbiomics. This approach enables a comprehensive understanding of biological
-systems by capturing different layers of biological information. Deep learning
-methods are increasingly utilized to integrate multi-omics data, offering
-insights into molecular interactions and enhancing research into complex
-diseases. However, these models, with their numerous interconnected layers and
-nonlinear relationships, often function as black boxes, lacking transparency in
-decision-making processes. To overcome this challenge, explainable artificial
-intelligence (xAI) methods are crucial for creating transparent models that
-allow clinicians to interpret and work with complex data more effectively. This
-review explores how xAI can improve the interpretability of deep learning
-models in multi-omics research, highlighting its potential to provide
-clinicians with clear insights, thereby facilitating the effective application
-of such models in clinical settings.
+Combinatorial optimization problems, such as scheduling and route planning,
+are crucial in various industries but are computationally intractable due to
+their NP-hard nature. Neural Combinatorial Optimization methods leverage
+machine learning to address these challenges but often depend on sequential
+decision-making, which is prone to error accumulation as small mistakes
+propagate throughout the process. Inspired by self-evaluation techniques in
+Large Language Models, we propose a novel framework that generates and
+evaluates subsets of assignments, moving beyond traditional stepwise
+approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a
+heterogeneous graph neural network with a Transformer to build a policy model
+and a self-evaluation function. Experimental validation on challenging,
+well-known benchmarks demonstrates the effectiveness of our approach,
+surpassing state-of-the-art methods.
 
-摘要：高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料，例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面，能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料，提供分子交互作用的洞察力，並加強對複雜疾病的研究。然而，這些模型具有許多相互連接的層級和非線性關係，通常會像黑盒子一樣運作，缺乏決策過程的透明度。為了克服此挑戰，可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要，讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性，強調其提供臨床醫生明確見解的潛力，進而促進此類模型在臨床環境中的有效應用。
+摘要：組合優化問題，例如排程和路線規劃，在各行各業中至關重要，但由於它們的 NP 難度，在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰，但通常依賴於序貫決策制定，而序貫決策制定容易發生錯誤累積，因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發，我們提出了一個新的框架，可生成和評估作業子集，超越傳統的分步方法。應用於工作車間排程問題，我們的方法將異質圖神經網路與 Transformer 整合在一起，以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性，超越了最先進的方法。
 
-##### **Study on the Helpfulness of Explainable Artificial Intelligence**
-2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing
+##### **Improving Existing Optimization Algorithms with LLMs**
+2502.08298v1 by Camilo Chacón Sartori, Christian Blum
 
-Explainable Artificial Intelligence (XAI) is essential for building advanced
-machine learning-powered applications, especially in critical domains such as
-medical diagnostics or autonomous driving. Legal, business, and ethical
-requirements motivate using effective XAI, but the increasing number of
-different methods makes it challenging to pick the right ones. Further, as
-explanations are highly context-dependent, measuring the effectiveness of XAI
-methods without users can only reveal a limited amount of information,
-excluding human factors such as the ability to understand it. We propose to
-evaluate XAI methods via the user's ability to successfully perform a proxy
-task, designed such that a good performance is an indicator for the explanation
-to provide helpful information. In other words, we address the helpfulness of
-XAI for human decision-making. Further, a user study on state-of-the-art
-methods was conducted, showing differences in their ability to generate trust
-and skepticism and the ability to judge the rightfulness of an AI decision
-correctly. Based on the results, we highly recommend using and extending this
-approach for more objective-based human-centered user studies to measure XAI
-performance in an end-to-end fashion.
+The integration of Large Language Models (LLMs) into optimization has created
+a powerful synergy, opening exciting research opportunities. This paper
+investigates how LLMs can enhance existing optimization algorithms. Using their
+pre-trained knowledge, we demonstrate their ability to propose innovative
+heuristic variations and implementation strategies. To evaluate this, we
+applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt
+(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that
+incorporates a heuristic in the solution construction phase. Our results show
+that an alternative heuristic proposed by GPT-4o outperforms the
+expert-designed heuristic of CMSA, with the performance gap widening on larger
+and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/
 
-摘要：可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要，特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI，但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外，由於解釋高度依賴於背景，在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊，排除人類因素，例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法，設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說，我們探討 XAI 對人類決策制定的幫助。此外，對最先進的方法進行使用者研究，顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果，我們強烈建議使用和擴充這種方法，以進行更多以目標為基礎的人為中心使用者研究，以終端到終端的方式衡量 XAI 效能。
+摘要：大型语言模型 (LLM) 与优化相结合，创造了一种强大的协同作用，开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识，我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点，我们应用了一种非平凡的优化算法，构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法，它在求解构建阶段纳入了启发式算法。我们的结果表明，GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法，并且随着图形变得更大、更密集，性能差距也在扩大。项目网址：https://imp-opt-algo-llms.surge.sh/
 
-##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**
-2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh
+##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**
+2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
 
-Early detection of intrapartum risk enables interventions to potentially
-prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently,
-there is no accurate automated system to predict such events to assist with
-clinical decision-making. To fill this gap, we propose "Artificial Intelligence
-(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning
-framework that not only predicts adverse labor outcomes from maternal, fetal,
-obstetrical, and intrapartum risk factors but also provides the model's
-reasoning behind the predictions made. The latter can provide insights into
-what modifications in the input variables of the model could have changed the
-predicted outcome. We address the challenges of imbalance and small datasets by
-synthesizing additional training data using Adaptive Synthetic Sampling
-(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN
-uses an ensemble of fully-connected neural networks as the backbone for its
-classification with the data augmentation supported by either ADASYN or CTGAN.
-AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in
-classification. AIMEN can predict a high risk for adverse labor outcomes with
-an average F1 score of 0.784. It also provides counterfactual explanations that
-can be achieved by changing 2 to 3 attributes on average. Resources available:
-https://github.com/ab9mamun/AIMEN.
+Identifying cause-and-effect relationships is critical to understanding
+real-world dynamics and ultimately causal reasoning. Existing methods for
+identifying event causality in NLP, including those based on Large Language
+Models (LLMs), exhibit difficulties in out-of-distribution settings due to the
+limited scale and heavy reliance on lexical cues within available benchmarks.
+Modern benchmarks, inspired by probabilistic causal inference, have attempted
+to construct causal graphs of events as a robust representation of causal
+knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent
+benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a
+benchmark designed for discovery and reasoning over abstract causal events.
+Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday
+life events on the abstraction level. We propose a pipeline for identifying
+abstractions for event generalizations from \texttt{GLUCOSE}
+\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit
+commonsense causal knowledge, from which we subsequently extract $1,4$K causal
+pairs. Our experiments highlight the ongoing challenges of using statistical
+methods and/or LLMs for automatic abstraction identification and causal
+discovery in NLP. Nonetheless, we demonstrate that the abstract causal
+knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA
+reasoning performance in LLMs.
 
-摘要：產程中風險的早期偵測有助於進行干預措施，以預防或減輕不利的生產結果，例如腦性麻痺。目前，沒有準確的自動化系統可以預測此類事件，以協助臨床決策。為了填補這一空白，我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN)，這是一個深度學習架構，它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果，還能提供模型做出預測背後的原因。後者可以提供見解，說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料，以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹，並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險，平均 F1 分數為 0.784。它還提供反事實解釋，可透過平均變更 2 至 3 個屬性來達成。可用資源：https://github.com/ab9mamun/AIMEN。
+摘要：<paragraph>找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法，包括基於大型語言模型 (LLM) 的方法，由於規模有限且過度依賴於可用基準中的詞彙線索，在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖，作為因果知識的強健表示，其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中，我們介紹 \texttt{ACCESS}，一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同，\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道，用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象，\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集，我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此，我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。</paragraph>
 
-##### **Artificial intelligence techniques in inherited retinal diseases: A review**
-2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian
+##### **GCoT: Chain-of-Thought Prompt Learning for Graphs**
+2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang
 
-Inherited retinal diseases (IRDs) are a diverse group of genetic disorders
-that lead to progressive vision loss and are a major cause of blindness in
-working-age adults. The complexity and heterogeneity of IRDs pose significant
-challenges in diagnosis, prognosis, and management. Recent advancements in
-artificial intelligence (AI) offer promising solutions to these challenges.
-However, the rapid development of AI techniques and their varied applications
-have led to fragmented knowledge in this field. This review consolidates
-existing studies, identifies gaps, and provides an overview of AI's potential
-in diagnosing and managing IRDs. It aims to structure pathways for advancing
-clinical applications by exploring AI techniques like machine learning and deep
-learning, particularly in disease detection, progression prediction, and
-personalized treatment planning. Special focus is placed on the effectiveness
-of convolutional neural networks in these areas. Additionally, the integration
-of explainable AI is discussed, emphasizing its importance in clinical settings
-to improve transparency and trust in AI-based systems. The review addresses the
-need to bridge existing gaps in focused studies on AI's role in IRDs, offering
-a structured analysis of current AI techniques and outlining future research
-directions. It concludes with an overview of the challenges and opportunities
-in deploying AI for IRDs, highlighting the need for interdisciplinary
-collaboration and the continuous development of robust, interpretable AI models
-to advance clinical applications.
+Chain-of-thought (CoT) prompting has achieved remarkable success in natural
+language processing (NLP). However, its vast potential remains largely
+unexplored for graphs. This raises an interesting question: How can we design
+CoT prompting for graphs to guide graph models to learn step by step? On one
+hand, unlike natural languages, graphs are non-linear and characterized by
+complex topological structures. On the other hand, many graphs lack textual
+data, making it difficult to formulate language-based CoT prompting. In this
+work, we propose the first CoT prompt learning framework for text-free graphs,
+GCoT. Specifically, we decompose the adaptation process for each downstream
+task into a series of inference steps, with each step consisting of
+prompt-based inference, ``thought'' generation, and thought-conditioned prompt
+learning. While the steps mimic CoT prompting in NLP, the exact mechanism
+differs significantly. Specifically, at each step, an input graph, along with a
+prompt, is first fed into a pre-trained graph encoder for prompt-based
+inference. We then aggregate the hidden layers of the encoder to construct a
+``thought'', which captures the working state of each node in the current step.
+Conditioned on this thought, we learn a prompt specific to each node based on
+the current state. These prompts are fed into the next inference step,
+repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we
+conduct comprehensive experiments on eight public datasets, which demonstrate
+the advantage of our approach.
 
-摘要：遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病，
-會導致視力逐漸喪失，是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。
-然而，AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究，找出差距，並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術，特別是在疾病檢測、進程預測和個性化治療計劃中，為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外，討論了可解釋 AI 的整合，強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性，提供了對當前 AI 技術的結構化分析，並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇，強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。
+摘要：<paragraph>鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而，其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題：我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習？一方面，與自然語言不同，圖形是非線性的，並且具有複雜的拓撲結構。另一方面，許多圖形缺乏文本數據，這使得難以制定基於語言的 CoT 提示。在這項工作中，我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說，我們將每個下游任務的適應過程分解為一系列推理步驟，每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示，但具體機制卻有很大不同。具體來說，在每一步中，一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後，我們聚合編碼器的隱藏層以構建一個「思想」，它捕獲了當前步驟中每個節點的工作狀態。基於這個思想，我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中，重複這個循環。為了評估和分析 GCoT 的有效性，我們對八個公共數據集進行了全面的實驗，這證明了我們方法的優勢。</paragraph>
 
-##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**
-2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri
+##### **Deep Semantic Graph Learning via LLM based Node Enhancement**
+2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen
 
-Explaining Artificial Intelligence (AI) decisions is a major challenge
-nowadays in AI, in particular when applied to sensitive scenarios like medicine
-and law. However, the need to explain the rationale behind decisions is a main
-issue also for human-based deliberation as it is important to justify
-\textit{why} a certain decision has been taken. Resident medical doctors for
-instance are required not only to provide a (possibly correct) diagnosis, but
-also to explain how they reached a certain conclusion. Developing new tools to
-aid residents to train their explanation skills is therefore a central
-objective of AI in education. In this paper, we follow this direction, and we
-present, to the best of our knowledge, the first multilingual dataset for
-Medical Question Answering where correct and incorrect diagnoses for a clinical
-case are enriched with a natural language explanation written by doctors. These
-explanations have been manually annotated with argument components (i.e.,
-premise, claim) and argument relations (i.e., attack, support), resulting in
-the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases
-in four languages (English, Spanish, French, Italian) with explanations, where
-we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106
-attack relations. We conclude by showing how competitive baselines perform over
-this challenging dataset for the argument mining task.
+Graph learning has attracted significant attention due to its widespread
+real-world applications. Current mainstream approaches rely on text node
+features and obtain initial node embeddings through shallow embedding learning
+using GNNs, which shows limitations in capturing deep textual semantics. Recent
+advances in Large Language Models (LLMs) have demonstrated superior
+capabilities in understanding text semantics, transforming traditional text
+feature processing. This paper proposes a novel framework that combines Graph
+Transformer architecture with LLM-enhanced node features. Specifically, we
+leverage LLMs to generate rich semantic representations of text nodes, which
+are then processed by a multi-head self-attention mechanism in the Graph
+Transformer to capture both local and global graph structural information. Our
+model utilizes the Transformer's attention mechanism to dynamically aggregate
+neighborhood information while preserving the semantic richness provided by LLM
+embeddings. Experimental results demonstrate that the LLM-enhanced node
+features significantly improve the performance of graph learning models on node
+classification tasks. This approach shows promising results across multiple
+graph learning tasks, offering a practical direction for combining graph
+networks with language models.
 
-摘要：解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰，特別是應用於像醫學和法律等敏感情境時。然而，解釋決策背後理由的需求也是基於人類的考量的一個主要問題，因為有必要證明為什麼做出某個決策。例如，住院醫師不僅需要提供（可能是正確的）診斷，還需要解釋他們如何達成某個結論。因此，開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中，我們遵循這個方向，並且根據我們的了解，提出第一個多語言醫學問答資料集，其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成（即前提、主張）和論證關係（即攻擊、支持）進行手動註解，產生多語言 CasiMedicos-Arg 資料集，其中包含 558 個具有解釋的四種語言（英語、西班牙語、法語、義大利語）的臨床病例，我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。
+摘要：圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵，並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入，這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力，轉換了傳統的文本特徵處理。本文提出了一種新的框架，將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說，我們利用 LLM 來生成文本節點的豐富語義表示，然後在圖形轉換器中由多頭自我注意機制處理，以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息，同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明，LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果，為將圖形網絡與語言模型相結合提供了實用的方向。
 
-##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration**
-2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu
+##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**
+2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
 
-Diagnosis prediction is a critical task in healthcare, where timely and
-accurate identification of medical conditions can significantly impact patient
-outcomes. Traditional machine learning and deep learning models have achieved
-notable success in this domain but often lack interpretability which is a
-crucial requirement in clinical settings. In this study, we explore the use of
-neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop
-explainable models for diagnosis prediction. Essentially, we design and
-implement LNN-based models that integrate domain-specific knowledge through
-logical rules with learnable thresholds. Our models, particularly
-$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior
-performance over traditional models such as Logistic Regression, SVM, and
-Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up
-to 0.8457) in the case study of diabetes prediction. The learned weights and
-thresholds within the LNN models provide direct insights into feature
-contributions, enhancing interpretability without compromising predictive
-power. These findings highlight the potential of neuro-symbolic approaches in
-bridging the gap between accuracy and explainability in healthcare AI
-applications. By offering transparent and adaptable diagnostic models, our work
-contributes to the advancement of precision medicine and supports the
-development of equitable healthcare solutions. Future research will focus on
-extending these methods to larger and more diverse datasets to further validate
-their applicability across different medical conditions and populations.
+The prototyping of computer games, particularly card games, requires
+extensive human effort in creative ideation and gameplay evaluation. Recent
+advances in Large Language Models (LLMs) offer opportunities to automate and
+streamline these processes. However, it remains challenging for LLMs to design
+novel game mechanics beyond existing databases, generate consistent gameplay
+environments, and develop scalable gameplay AI for large-scale evaluations.
+This paper addresses these challenges by introducing a comprehensive automated
+card game prototyping framework. The approach highlights a graph-based indexing
+method for generating novel game designs, an LLM-driven system for consistent
+game code generation validated by gameplay records, and a gameplay AI
+constructing method that uses an ensemble of LLM-generated action-value
+functions optimized through self-play. These contributions aim to accelerate
+card game prototyping, reduce human labor, and lower barriers to entry for game
+developers.
 
-摘要：診斷預測是醫療保健中的關鍵任務，及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功，但通常缺乏可解釋性，這在臨床環境中是一項關鍵要求。在本研究中，我們探討了神經符號方法的應用，特別是邏輯神經網路 (LNN)，以開發用於診斷預測的可解釋模型。基本上，我們設計並實作了基於 LNN 的模型，這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型，特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$，表現出優於傳統模型（例如邏輯迴歸、SVM 和隨機森林）的優異效能，在糖尿病預測的案例研究中達到了更高的準確度（高達 80.52%）和 AUROC 分數（高達 0.8457）。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解，增強了可解釋性，同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型，我們的研究有助於推進精準醫療，並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集，以進一步驗證其在不同醫療狀況和人群中的適用性。
+摘要：電腦遊戲，尤其是卡牌遊戲的原型製作，需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而，LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境，以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法，用於生成新穎的遊戲設計，一個由 LLM 驅動的系統，用於一致的遊戲程式碼生成，並由遊戲記錄驗證，以及一個遊戲 AI 構建方法，該方法使用由 LLM 生成的動作值函數的集合，通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作，減少人力，並降低遊戲開發人員的進入門檻。
 
-##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**
-2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty
+##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**
+2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan
 
-The rapid advancements in artificial intelligence (AI) have revolutionized
-smart healthcare, driving innovations in wearable technologies, continuous
-monitoring devices, and intelligent diagnostic systems. However, security,
-explainability, robustness, and performance optimization challenges remain
-critical barriers to widespread adoption in clinical environments. This
-research presents an innovative algorithmic method using the Adaptive Feature
-Evaluator (AFE) algorithm to improve feature selection in healthcare datasets
-and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable
-Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT),
-the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby
-enhancing predictive accuracy and interpretability. The proposed method is
-validated across three diverse healthcare datasets using six distinct machine
-learning algorithms, demonstrating its robustness and superiority over
-conventional feature selection techniques. The results underscore the
-transformative potential of AFE in smart healthcare, enabling personalized and
-transparent patient care. Notably, the AFE algorithm, when combined with a
-Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting
-its capability to improve clinical decision-making processes in real-world
-healthcare applications.
+Graph Neural Networks (GNNs) are vital for learning from graph-structured
+data, enabling applications in network analysis, recommendation systems, and
+speech analytics. Deploying them on edge devices like client PCs and laptops
+enhances real-time processing, privacy, and cloud independence. GNNs aid
+Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and
+enable event-based vision tasks. However, irregular memory access, sparsity,
+and dynamic structures cause high latency and energy overhead on
+resource-constrained devices. While modern edge processors integrate CPUs,
+GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular
+GNN computations. We introduce GraNNite, the first hardware-aware framework
+optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN
+accelerators via a structured three-step methodology: (1) enabling NPU
+execution, (2) optimizing performance, and (3) trading accuracy for efficiency
+gains. Step 1 employs GraphSplit for workload distribution and StaGr for static
+aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts
+performance using EffOp for control-heavy tasks and GraSp for sparsity
+exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce
+redundancy and memory transfers. Step 3 balances quality versus efficiency,
+where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate
+attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs,
+GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to
+8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher
+performance than CPUs and GPUs, respectively, across GNN models.
 
-摘要：人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健，推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而，安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法，使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT)，該演算法最佳化了臨床決策支援系統 (CDSS)，從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集，證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力，實現了個人化和透明的患者照護。值得注意的是，AFE 演算法與多層感知器 (MLP) 結合使用時，準確度高達 98.5%，突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。
+摘要：圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要，能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置（例如用戶端電腦和筆電）上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG)，並支援基於事件的視覺任務。然而，不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU，但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite，這是第一個硬體感知框架，透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行：(1) 啟用 NPU 執行，(2) 最佳化效能，以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配，並使用 StaGr 進行靜態聚合，而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能，並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率，其中 QuantGr 適用 INT8 量化，而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上，GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速，在 CPU 和 GPU 上實現了高達 8.6X 的能源增益，在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。
 
-##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**
-2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker
+##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**
+2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
 
-Artificial intelligence (AI) systems have substantially improved
-dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI)
-systems further enhancing clinicians' confidence and trust in AI-driven
-decisions. Despite these advancements, there remains a critical need for
-objective evaluation of how dermatologists engage with both AI and XAI tools.
-In this study, 76 dermatologists participated in a reader study, diagnosing 16
-dermoscopic images of melanomas and nevi using an XAI system that provides
-detailed, domain-specific explanations. Eye-tracking technology was employed to
-assess their interactions. Diagnostic performance was compared with that of a
-standard AI system lacking explanatory features. Our findings reveal that XAI
-systems improved balanced diagnostic accuracy by 2.8 percentage points relative
-to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and
-complex lesions were associated with elevated cognitive load, as evidenced by
-increased ocular fixations. These insights have significant implications for
-clinical practice, the design of AI tools for visual tasks, and the broader
-development of XAI in medical diagnostics.
+Recent advancements in AI for biological research focus on integrating
+molecular data with natural language to accelerate drug discovery. However, the
+scarcity of high-quality annotations limits progress in this area. This paper
+introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework
+that leverages large language models to augment existing datasets, thereby
+improving AI training. We demonstrate the effectiveness of LA$^3$ by creating
+an enhanced dataset, LaChEBI-20, where we systematically rewrite the
+annotations of molecules from an established dataset. These rewritten
+annotations preserve essential molecular information while providing more
+varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5
+based on a benchmark architecture to learn the mapping between molecular
+representations and augmented annotations.
+  Experimental results on text-based *de novo* molecule generation and molecule
+captioning demonstrate that LaMolT5 outperforms state-of-the-art models.
+Notably, incorporating LA$^3$ leads to improvements of up to 301% over the
+benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$
+notable applications in *image*, *text* and *graph* tasks, affirming its
+versatility and utility.
 
-摘要：人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度，而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展，對於皮膚科醫師如何使用 AI 和 XAI 工具，仍有客觀評估的迫切需求。在這項研究中，76 位皮膚科醫師參與了一項讀者研究，使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像，該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示，XAI 系統相較於標準 AI，將平衡診斷準確度提升了 2.8 個百分點。此外，與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關，這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。
+摘要：<paragraph>人工智慧在生物研究上的最新進展，專注於將分子資料與自然語言整合，以加速藥物發現。然而，高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$，一個基於語言的自動註解擴充框架，它利用大型語言模型來擴充現有的資料集，進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性，我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊，同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20，我們在基於基準架構上訓練 LaMolT5，以學習分子表示和擴充註解之間的對應。
+在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明，LaMolT5 優於最先進的模型。值得注意的是，納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外，我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性，肯定了它的多功能性和實用性。</paragraph>
 
-##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**
-2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar
+##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**
+2502.06472v1 by Yuxing Lu, Jinzhuo Wang
 
-Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been
-shown to significantly improve the quality of life of autistic individuals.
-However, diagnostics methods for ASD rely on assessments based on clinical
-presentation that are prone to bias and can be challenging to arrive at an
-early diagnosis. There is a need for objective biomarkers of ASD which can help
-improve diagnostic accuracy. Deep learning (DL) has achieved outstanding
-performance in diagnosing diseases and conditions from medical imaging data.
-Extensive research has been conducted on creating models that classify ASD
-using resting-state functional Magnetic Resonance Imaging (fMRI) data. However,
-existing models lack interpretability. This research aims to improve the
-accuracy and interpretability of ASD diagnosis by creating a DL model that can
-not only accurately classify ASD but also provide explainable insights into its
-working. The dataset used is a preprocessed version of the Autism Brain Imaging
-Data Exchange (ABIDE) with 884 samples. Our findings show a model that can
-accurately classify ASD and highlight critical brain regions differing between
-ASD and typical controls, with potential implications for early diagnosis and
-understanding of the neural basis of ASD. These findings are validated by
-studies in the literature that use different datasets and modalities,
-confirming that the model actually learned characteristics of ASD and not just
-the dataset. This study advances the field of explainable AI in medical imaging
-by providing a robust and interpretable model, thereby contributing to a future
-with objective and reliable ASD diagnostics.
+Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical
+for modern AI systems, but manual curation struggles to scale with the rapid
+growth of scientific literature. This paper presents KARMA, a novel framework
+employing multi-agent large language models (LLMs) to automate KG enrichment
+through structured analysis of unstructured text. Our approach employs nine
+collaborative agents, spanning entity discovery, relation extraction, schema
+alignment, and conflict resolution that iteratively parse documents, verify
+extracted knowledge, and integrate it into existing graph structures while
+adhering to domain-specific schema. Experiments on 1,200 PubMed articles from
+three different domains demonstrate the effectiveness of KARMA in knowledge
+graph enrichment, with the identification of up to 38,230 new entities while
+achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\%
+through multi-layer assessments.
 
-摘要：自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而，ASD 的診斷方法依賴於基於臨床表現的評估，容易產生偏見，且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記，以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而，現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD，還能提供可解釋見解說明其運作原理的 DL 模型，來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本，包含 884 個樣本。我們的研究結果顯示，該模型能準確分類 ASD，並強調 ASD 與典型對照組之間存在差異的關鍵腦區，對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證，證實該模型實際上學習了 ASD 的特徵，而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型，推動了醫學影像中可解釋 AI 的領域，從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。
+摘要：維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要，但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA，一個採用多代理大型語言模型 (LLM) 的新框架，透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理，涵蓋實體發現、關係提取、架構比對和衝突解決，這些代理會反覆分析文件、驗證提取的知識，並將其整合到現有的圖結構中，同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性，識別出多達 38,230 個新實體，同時達到 83.1% 的 LLM 驗證正確性，並透過多層評估將衝突邊緣降低了 18.6%。
 
-##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**
-2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul
+##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs**
+2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
 
-The in-vivo identification of the kidney stone types during an ureteroscopy
-would be a major medical advance in urology, as it could reduce the time of the
-tedious renal calculi extraction process, while diminishing infection risks.
-Furthermore, such an automated procedure would make possible to prescribe
-anti-recurrence treatments immediately. Nowadays, only few experienced
-urologists are able to recognize the kidney stone types in the images of the
-videos displayed on a screen during the endoscopy. Thus, several deep learning
-(DL) models have recently been proposed to automatically recognize the kidney
-stone types using ureteroscopic images. However, these DL models are of black
-box nature whicl limits their applicability in clinical settings. This
-contribution proposes a case-based reasoning DL model which uses prototypical
-parts (PPs) and generates local and global descriptors. The PPs encode for each
-class (i.e., kidney stone type) visual feature information (hue, saturation,
-intensity and textures) similar to that used by biologists. The PPs are
-optimally generated due a new loss function used during the model training.
-Moreover, the local and global descriptors of PPs allow to explain the
-decisions ("what" information, "where in the images") in an understandable way
-for biologists and urologists. The proposed DL model has been tested on a
-database including images of the six most widespread kidney stone types. The
-overall average classification accuracy was 90.37. When comparing this results
-with that of the eight other DL models of the kidney stone state-of-the-art, it
-can be seen that the valuable gain in explanability was not reached at the
-expense of accuracy which was even slightly increased with respect to that
-(88.2) of the best method of the literature. These promising and interpretable
-results also encourage urologists to put their trust in AI-based solutions.
+Mitigating positional bias of language models (LMs) for listwise inputs is a
+well-known and important problem (e.g., lost-in-the-middle). While zero-shot
+order-invariant LMs have been proposed to solve this issue, their success on
+practical listwise problems has been limited. In this work, as a first
+contribution, we identify and overcome two limitations to make zero-shot
+invariant LMs more practical: (1) training and inference distribution mismatch
+arising from modifying positional ID assignments to enforce invariance, and (2)
+failure to adapt to a mixture of order-invariant and sensitive inputs in
+practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot
+invariant LM for genuinely order-invariant inputs with minimal modifications of
+positional IDs, and (2) Selective Routing, an adaptive framework that handles
+both order-invariant and order-sensitive inputs in listwise tasks. On the Lost
+in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU
+benchmarks, we show that RoToR with Selective Routing can effectively handle
+practical listwise input tasks in a zero-shot manner.
 
-摘要：尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展，因為它可以減少繁瑣的腎結石取出過程的時間，同時降低感染風險。此外，這種自動化程序將使立即開立抗復發治療成為可能。如今，只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此，最近已提出多種深度學習 (DL) 模型，以使用輸尿管鏡圖像自動識別腎結石類型。然而，這些 DL 模型本質上是黑盒子，這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型，它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型（即腎結石類型）編碼視覺特徵信息（色調、飽和度、強度和紋理），類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數，PP 得到了最佳生成。此外，PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策（“什麼”信息，“圖像中的什麼位置”）。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時，可以看出，可解釋性的寶貴增益並未以準確性為代價，甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。
+摘要：語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題（例如，迷失在中間）。雖然已經提出零次學習順序不變的 LM 來解決這個問題，但它們在實際列表問題上的成功卻很有限。在這項工作中，作為第一個貢獻，我們找出並克服了兩個限制，讓零次學習不變的 LM 更有實用性：(1) 訓練和推論分布不匹配，這是由於修改位置 ID 分配以強制不變性所造成的，以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題，我們提出 (1) RoToR，一個零次學習不變的 LM，用於真正不變的輸入，並對位置 ID 進行最小的修改，以及 (2) 選擇性路由，一個自適應框架，用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中，我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。
 
-##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**
-2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman
+##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model**
+2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen
 
-This study explores the potential of utilizing administrative claims data,
-combined with advanced machine learning and deep learning techniques, to
-predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal
-Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major
-health insurance organization to develop prediction models for multiple
-observation windows using traditional machine learning methods such as Random
-Forest and XGBoost as well as deep learning approaches such as Long Short-Term
-Memory (LSTM) networks. Our findings demonstrate that the LSTM model,
-particularly with a 24-month observation window, exhibits superior performance
-in predicting ESRD progression, outperforming existing models in the
-literature. We further apply SHapley Additive exPlanations (SHAP) analysis to
-enhance interpretability, providing insights into the impact of individual
-features on predictions at the individual patient level. This study underscores
-the value of leveraging administrative claims data for CKD management and
-predicting ESRD progression.
+Recent advancements in large language models (LLMs) have significantly
+improved various natural language processing (NLP) tasks. Typically, LLMs are
+trained to predict the next token, aligning well with many NLP tasks. However,
+in knowledge graph (KG) scenarios, entities are the fundamental units and
+identifying an entity requires at least several tokens. This leads to a
+granularity mismatch between KGs and natural languages. To address this issue,
+we propose K-ON, which integrates KG knowledge into the LLM by employing
+multiple head layers for next k-step prediction. K-ON can not only generate
+entity-level results in one step, but also enables contrastive loss against
+entities, which is the most powerful tool in KG representation learning.
+Experimental results show that K-ON outperforms state-of-the-art methods that
+incorporate text and even the other modalities.
 
-摘要：本研究探討利用行政申報資料，結合先進機器學習與深度學習技術，預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集，使用傳統機器學習方法（例如隨機森林和 XGBoost）以及深度學習方法（例如長期短期記憶 (LSTM) 網路）開發多個觀察視窗的預測模型。我們的研究結果顯示，LSTM 模型（尤其是 24 個月觀察視窗）在預測 ESRD 進展方面表現優異，優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性，深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。
+摘要：大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常，LLM 會接受訓練以預測下一個符號，這與許多 NLP 任務非常吻合。然而，在知識圖譜 (KG) 場景中，實體是基本單位，而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題，我們提出了 K-ON，它透過採用多個頭部層進行下一個 k 步預測，將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果，還能針對實體啟用對比損失，這是 KG 表示學習中最有力的工具。實驗結果顯示，K-ON 優於將文字甚至其他方式納入考量的最新方法。
 
-##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**
-2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller
+##### **LegalViz: Legal Text Visualization by Text To Diagram Generation**
+2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita
 
-While large language models (LLMs) have shown promise for medical question
-answering, there is limited work focused on tropical and infectious
-disease-specific exploration. We build on an opensource tropical and infectious
-diseases (TRINDs) dataset, expanding it to include demographic and semantic
-clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM
-performance on these, comparing generalist and medical LLMs, as well as LLM
-outcomes to human experts. We demonstrate through systematic experimentation,
-the benefit of contextual information such as demographics, location, gender,
-risk factors for optimal LLM response. Finally we develop a prototype of
-TRINDs-LM, a research tool that provides a playground to navigate how context
-impacts LLM outputs for health.
+Legal documents including judgments and court orders require highly
+sophisticated legal knowledge for understanding. To disclose expert knowledge
+for non-experts, we explore the problem of visualizing legal texts with
+easy-to-understand diagrams and propose a novel dataset of LegalViz with 23
+languages and 7,010 cases of legal document and visualization pairs, using the
+DOT graph description language of Graphviz. LegalViz provides a simple diagram
+from a complicated legal corpus identifying legal entities, transactions, legal
+sources, and statements at a glance, that are essential in each judgment. In
+addition, we provide new evaluation metrics for the legal diagram visualization
+by considering graph structures, textual similarities, and legal contents. We
+conducted empirical studies on few-shot and finetuning large language models
+for generating legal diagrams and evaluated them with these metrics, including
+legal content-based evaluation within 23 languages. Models trained with
+LegalViz outperform existing models including GPTs, confirming the
+effectiveness of our dataset.
 
-摘要：儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景，但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上，並將其擴展為納入人口統計和語義臨床和消費者擴充，產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能，比較了通才和醫療 LLM，以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊（例如人口統計、位置、性別、最佳 LLM 回應的風險因素）的好處。最後，我們開發了 TRINDs-LM 的原型，這是一個研究工具，提供一個探索背景如何影響 LLM 健康輸出的平台。
+摘要：法律文件，包括判決和法院命令，需要高度專業的法律知識才能理解。為了向非專家揭露專家知識，我們探討了使用易於理解的圖表將法律文本視覺化的問題，並提出了一個新的 LegalViz 數據集，其中包含 23 種語言和 7,010 個法律文件和視覺化配對，使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表，可以一目了然地識別法律實體、交易、法律來源和陳述，這些在每項判決中都是必不可少的。此外，我們通過考慮圖形結構、文本相似性和法律內容，為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究，以生成法律圖表，並使用這些指標對它們進行了評估，包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型，包括 GPT，證實了我們數據集的有效性。
 
-##### **Explainable AI: Definition and attributes of a good explanation for health AI**
-2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group
+##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**
+2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee
 
-Proposals of artificial intelligence (AI) solutions based on increasingly
-complex and accurate predictive models are becoming ubiquitous across many
-disciplines. As the complexity of these models grows, transparency and users'
-understanding often diminish. This suggests that accurate prediction alone is
-insufficient for making an AI-based solution truly useful. In the development
-of healthcare systems, this introduces new issues related to accountability and
-safety. Understanding how and why an AI system makes a recommendation may
-require complex explanations of its inner workings and reasoning processes.
-Although research on explainable AI (XAI) has significantly increased in recent
-years and there is high demand for XAI in medicine, defining what constitutes a
-good explanation remains ad hoc, and providing adequate explanations continues
-to be challenging. To fully realize the potential of AI, it is critical to
-address two fundamental questions about explanations for safety-critical AI
-applications, such as health-AI: (1) What is an explanation in health-AI? and
-(2) What are the attributes of a good explanation in health-AI? In this study,
-we examined published literature and gathered expert opinions through a
-two-round Delphi study. The research outputs include (1) a definition of what
-constitutes an explanation in health-AI and (2) a comprehensive list of
-attributes that characterize a good explanation in health-AI.
+Mental-illness stigma is a persistent social problem, hampering both
+treatment-seeking and recovery. Accordingly, there is a pressing need to
+understand it more clearly, but analyzing the relevant data is highly
+labor-intensive. Therefore, we designed a chatbot to engage participants in
+conversations; coded those conversations qualitatively with AI assistance; and,
+based on those coding results, built causal knowledge graphs to decode stigma.
+The results we obtained from 1,002 participants demonstrate that conversation
+with our chatbot can elicit rich information about people's attitudes toward
+depression, while our AI-assisted coding was strongly consistent with
+human-expert coding. Our novel approach combining large language models (LLMs)
+and causal knowledge graphs uncovered patterns in individual responses and
+illustrated the interrelationships of psychological constructs in the dataset
+as a whole. The paper also discusses these findings' implications for HCI
+researchers in developing digital interventions, decomposing human
+psychological constructs, and fostering inclusive attitudes.
 
-摘要：隨著越來越複雜且準確的預測模型，基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加，透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中，這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加，且醫學領域對 XAI 有很高的需求，但定義什麼構成一個好的解釋仍是臨時性的，而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力，對於安全關鍵型 AI 應用（例如健康 AI）的解釋，探討兩個基本問題至關重要：(1) 什麼是健康 AI 中的解釋？以及 (2) 健康 AI 中一個好的解釋有哪些屬性？在本研究中，我們檢視了已發表的文獻，並透過兩輪德爾菲研究收集了專家意見。研究成果包括：(1) 健康 AI 中什麼構成解釋的定義，以及 (2) 健康 AI 中一個好解釋的屬性清單。
+摘要：精神疾病的污名化是一個持續存在的社會問題，阻礙了尋求治療和康復。因此，迫切需要更清楚地了解它，但分析相關數據非常費力。因此，我們設計了一個聊天機器人，讓參與者參與對話；使用 AI 協助對這些對話進行定性編碼；並根據這些編碼結果，構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明，與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊，而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式，並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。
 
-##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**
-2408.17401v1 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni
+##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**
+2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
 
-In recent years, various methods have been introduced for explaining the
-outputs of "black-box" AI models. However, it is not well understood whether
-users actually comprehend and trust these explanations. In this paper, we focus
-on explanations for a regression tool for assessing cancer risk and examine the
-effect of the explanations' content and format on the user-centric metrics of
-comprehension and trust. Regarding content, we experiment with two explanation
-methods: the popular SHAP, based on game-theoretic notions and thus potentially
-complex for everyday users to comprehend, and occlusion-1, based on feature
-occlusion which may be more comprehensible. Regarding format, we present SHAP
-explanations as charts (SC), as is conventional, and occlusion-1 explanations
-as charts (OC) as well as text (OT), to which their simpler nature also lends
-itself. The experiments amount to user studies questioning participants, with
-two different levels of expertise (the general population and those with some
-medical training), on their subjective and objective comprehension of and trust
-in explanations for the outputs of the regression tool. In both studies we
-found a clear preference in terms of subjective comprehension and trust for
-occlusion-1 over SHAP explanations in general, when comparing based on content.
-However, direct comparisons of explanations when controlling for format only
-revealed evidence for OT over SC explanations in most cases, suggesting that
-the dominance of occlusion-1 over SHAP explanations may be driven by a
-preference for text over charts as explanations. Finally, we found no evidence
-of a difference between the explanation types in terms of objective
-comprehension. Thus overall, the choice of the content and format of
-explanations needs careful attention, since in some contexts format, rather
-than content, may play the critical role in improving user experience.
+In this paper, we address the task of semantic segmentation of legal
+documents through rhetorical role classification, with a focus on Indian legal
+judgments. We introduce LegalSeg, the largest annotated dataset for this task,
+comprising over 7,000 documents and 1.4 million sentences, labeled with 7
+rhetorical roles. To benchmark performance, we evaluate multiple
+state-of-the-art models, including Hierarchical BiLSTM-CRF,
+TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and
+Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an
+instruction-tuned large language model. Our results demonstrate that models
+incorporating broader context, structural relationships, and sequential
+sentence information outperform those relying solely on sentence-level
+features. Additionally, we conducted experiments using surrounding context and
+predicted or actual labels of neighboring sentences to assess their impact on
+classification accuracy. Despite these advancements, challenges persist in
+distinguishing between closely related roles and addressing class imbalance.
+Our work underscores the potential of advanced techniques for improving legal
+document understanding and sets a strong foundation for future research in
+legal NLP.
 
-摘要：<paragraph>近年來，已經引進各種方法來解釋「黑箱」AI 模型的輸出。然而，目前並不清楚使用者是否實際理解和信任這些解釋。在本文中，我們專注於評估癌症風險的回歸工具的解釋，並探討解釋的內容和格式對以使用者為中心的理解和信任指標的影響。關於內容，我們實驗了兩種解釋方法：流行的 SHAP，基於博弈論概念，因此對於日常使用者來說可能很複雜，以及基於特徵遮蔽的 occlusion-1，可能更易於理解。關於格式，我們將 SHAP 解釋呈現為圖表 (SC)，這是慣例，而將 occlusion-1 解釋呈現為圖表 (OC) 以及文字 (OT)，其較為簡單的性質也適用於此。這些實驗等同於使用者研究，詢問參與者，具有兩種不同程度的專業知識（一般民眾和具備一些醫學訓練的人），他們對回歸工具輸出解釋的主觀和客觀理解和信任。在兩項研究中，我們發現，在基於內容進行比較時，一般來說，occlusion-1 優於 SHAP 解釋，在主觀理解和信任方面有明顯的偏好。然而，在僅控制格式的情況下直接比較解釋，在大多數情況下只顯示 OT 優於 SC 解釋的證據，這表明 occlusion-1 優於 SHAP 解釋的主導地位可能是由偏好文字而非圖表作為解釋所驅動的。最後，我們沒有發現解釋類型在客觀理解方面的差異證據。因此，總體而言，對解釋的內容和格式的選擇需要仔細注意，因為在某些情況下，格式而非內容，可能在改善使用者體驗方面發揮關鍵作用。</paragraph>
+摘要：<paragraph>在本文中，我們通過修辭角色分類來探討法律文件的語義分段任務，重點關注印度法律判決。我們引入了 LegalSeg，這是此任務中最大的註釋資料集，包含超過 7,000 份文件和 140 萬個句子，並標記了 7 個修辭角色。為了評量效能，我們評估了多個最先進的模型，包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer，以及探索性的 RhetoricLLaMA，一種經過指令調整的大型語言模型。我們的結果表明，結合廣泛背景、結構關係和順序句子資訊的模型，表現優於僅依賴句子層級特徵的模型。此外，我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗，以評估它們對分類精度的影響。儘管有這些進展，但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力，並為法律自然語言處理的未來研究奠定了堅實的基礎。</paragraph>
 
-##### **A Survey for Large Language Models in Biomedicine**
-2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen
+##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**
+2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
 
-Recent breakthroughs in large language models (LLMs) offer unprecedented
-natural language understanding and generation capabilities. However, existing
-surveys on LLMs in biomedicine often focus on specific applications or model
-architectures, lacking a comprehensive analysis that integrates the latest
-advancements across various biomedical domains. This review, based on an
-analysis of 484 publications sourced from databases including PubMed, Web of
-Science, and arXiv, provides an in-depth examination of the current landscape,
-applications, challenges, and prospects of LLMs in biomedicine, distinguishing
-itself by focusing on the practical implications of these models in real-world
-biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot
-learning across a broad spectrum of biomedical tasks, including diagnostic
-assistance, drug discovery, and personalized medicine, among others, with
-insights drawn from 137 key studies. Then, we discuss adaptation strategies of
-LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to
-enhance their performance in specialized biomedical contexts where zero-shot
-fails to achieve, such as medical question answering and efficient processing
-of biomedical literature. Finally, we discuss the challenges that LLMs face in
-the biomedicine domain including data privacy concerns, limited model
-interpretability, issues with dataset quality, and ethics due to the sensitive
-nature of biomedical data, the need for highly reliable model outputs, and the
-ethical implications of deploying AI in healthcare. To address these
-challenges, we also identify future research directions of LLM in biomedicine
-including federated learning methods to preserve data privacy and integrating
-explainable AI methodologies to enhance the transparency of LLMs.
+Developing intelligent agents for long-term cooperation in dynamic open-world
+scenarios is a major challenge in multi-agent systems. Traditional Multi-agent
+Reinforcement Learning (MARL) frameworks like centralized training
+decentralized execution (CTDE) struggle with scalability and flexibility. They
+require centralized long-term planning, which is difficult without custom
+reward functions, and face challenges in processing multi-modal data. CTDE
+approaches also assume fixed cooperation strategies, making them impractical in
+dynamic environments where agents need to adapt and plan independently. To
+address decentralized multi-agent cooperation, we propose Decentralized
+Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in
+a novel Multi-agent Crafter environment. Our generative agents, powered by
+Large Language Models (LLMs), are more scalable than traditional MARL agents by
+leveraging external knowledge and language for long-term planning and
+reasoning. Instead of fully sharing information from all past experiences,
+DAMCS introduces a multi-modal memory system organized as a hierarchical
+knowledge graph and a structured communication protocol to optimize agent
+cooperation. This allows agents to reason from past interactions and share
+relevant information efficiently. Experiments on novel multi-agent open-world
+tasks show that DAMCS outperforms both MARL and LLM baselines in task
+efficiency and collaboration. Compared to single-agent scenarios, the two-agent
+scenario achieves the same goal with 63% fewer steps, and the six-agent
+scenario with 74% fewer steps, highlighting the importance of adaptive memory
+and structured communication in achieving long-term goals. We publicly release
+our project at: https://happyeureka.github.io/damcs.
 
-摘要：大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而，現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構，缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析，深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景，其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先，我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力，包括診斷輔助、藥物發現和個性化醫療等，並從 137 項關鍵研究中汲取見解。然後，我們討論了 LLM 的適應策略，包括單模態和多模態 LLM 的微調方法，以增強它們在零次學習無法實現的專業生物醫學背景中的性能，例如醫療問題解答和生物醫學文獻的有效處理。最後，我們討論了 LLM 在生物醫學領域面臨的挑戰，包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰，我們還確定了生物醫學中 LLM 未來的研究方向，包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。
+摘要：<paragraph>在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架，例如集中式訓練去中心化執行 (CTDE)，在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃，這在沒有自訂獎勵函數的情況下很難執行，並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略，這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題，我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援，透過利用外部知識和語言進行長期規劃和推理，比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊，而是引入了多模式記憶體系統，該系統組織成階層式知識圖譜和結構化通訊協定，以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明，DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比，雙重代理情境以少 63% 的步驟達成相同的目標，而六重代理情境則以少 74% 的步驟達成目標，突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於：https://happyeureka.github.io/damcs。</paragraph>
 
-##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**
-2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone
+##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**
+2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang
 
-Significant investment and development have gone into integrating Artificial
-Intelligence (AI) in medical and healthcare applications, leading to advanced
-control systems in medical technology. However, the opacity of AI systems
-raises concerns about essential characteristics needed in such sensitive
-applications, like transparency and trustworthiness. Our study addresses these
-concerns by investigating a process for selecting the most adequate Explainable
-AI (XAI) methods to comply with the explanation requirements of key EU
-regulations in the context of smart bioelectronics for medical devices. The
-adopted methodology starts with categorising smart devices by their control
-mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving
-into their technology. Then, we analyse these regulations to define their
-explainability requirements for the various devices and related goals.
-Simultaneously, we classify XAI methods by their explanatory objectives. This
-allows for matching legal explainability requirements with XAI explanatory
-goals and determining the suitable XAI algorithms for achieving them. Our
-findings provide a nuanced understanding of which XAI algorithms align better
-with EU regulations for different types of medical devices. We demonstrate this
-through practical case studies on different neural implants, from chronic
-disease management to advanced prosthetics. This study fills a crucial gap in
-aligning XAI applications in bioelectronics with stringent provisions of EU
-regulations. It provides a practical framework for developers and researchers,
-ensuring their AI innovations advance healthcare technology and adhere to legal
-and ethical standards.
+Graphs are able to model interconnected entities in many online services,
+supporting a wide range of applications on the Web. This raises an important
+question: How can we train a graph foundational model on multiple source
+domains and adapt to an unseen target domain? A major obstacle is that graphs
+from different domains often exhibit divergent characteristics. Some studies
+leverage large language models to align multiple domains based on textual
+descriptions associated with the graphs, limiting their applicability to
+text-attributed graphs. For text-free graphs, a few recent works attempt to
+align different feature distributions across domains, while generally
+neglecting structural differences. In this work, we propose a novel Structure
+Alignment framework for text-free Multi-domain Graph Pre-Training and
+cross-domain adaptation (SAMGPT). It is designed to learn multi-domain
+knowledge from graphs originating in multiple source domains, which can then be
+adapted to address applications in an unseen target domain. Specifically, we
+introduce a set of structure tokens to harmonize structure-based aggregation
+across source domains during the pre-training phase. Next, for cross-domain
+adaptation, we design dual prompts, namely, holistic prompts and specific
+prompts, which adapt unified multi-domain structural knowledge and
+fine-grained, domain-specific information, respectively, to a target domain.
+Finally, we conduct comprehensive experiments on seven public datasets to
+evaluate and analyze the effectiveness of SAMGPT.
 
-摘要：人工智慧（AI）在醫療和保健應用中投入了大量的投資和開發，進而導致醫療技術中的先進控制系統。然而，AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂，例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題，用於選擇最充分的可解釋 AI（XAI）方法，以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制（開迴路、閉迴路和半閉迴路系統）對智慧型裝置進行分類，並深入探討其技術開始。然後，我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時，我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配，並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點，從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構，確保其 AI 創新能促進醫療技術並遵守法律和道德標準。
+摘要：圖表能夠在許多線上服務中對相互關聯的實體進行建模，
+支援網路上廣泛的應用程式。這提出了重要的問題：我們如何針對多個來源網域訓練圖表基礎模型，並適應未見過的目標網域？一個主要的障礙是，來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型，根據與圖表相關的文字描述，對齊多個網域，限制其適用性於有文字屬性的圖表。對於沒有文字的圖表，最近的一些作品嘗試對齊跨網域的不同特徵分佈，同時通常忽略結構上的差異。在這項工作中，我們提出了一個新的結構對齊框架，用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識，然後可以適應於未見過的目標網域中的應用程式。具體來說，我們引入了一組結構化代碼，以在預訓練階段，調和跨來源網域的基於結構的聚合。接下來，對於跨網域適應，我們設計了雙重提示，即整體提示和具體提示，分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後，我們在七個公共資料集上進行了全面的實驗，以評估和分析 SAMGPT 的有效性。
 
-##### **Towards Case-based Interpretability for Medical Federated Learning**
-2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva
+##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**
+2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang
 
-We explore deep generative models to generate case-based explanations in a
-medical federated learning setting. Explaining AI model decisions through
-case-based interpretability is paramount to increasing trust and allowing
-widespread adoption of AI in clinical practice. However, medical AI training
-paradigms are shifting towards federated learning settings in order to comply
-with data protection regulations. In a federated scenario, past data is
-inaccessible to the current user. Thus, we use a deep generative model to
-generate synthetic examples that protect privacy and explain decisions. Our
-proof-of-concept focuses on pleural effusion diagnosis and uses publicly
-available Chest X-ray data.
+In-context learning (ICL) effectively conditions large language models (LLMs)
+for molecular tasks, such as property prediction and molecule captioning, by
+embedding carefully selected demonstration examples into the input prompt. This
+approach avoids the computational overhead of extensive pertaining and
+fine-tuning. However, current prompt retrieval methods for molecular tasks have
+relied on molecule feature similarity, such as Morgan fingerprints, which do
+not adequately capture the global molecular and atom-binding relationships. As
+a result, these methods fail to represent the full complexity of molecular
+structures during inference. Moreover, small-to-medium-sized LLMs, which offer
+simpler deployment requirements in specialized systems, have remained largely
+unexplored in the molecular ICL literature. To address these gaps, we propose a
+self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context
+learning, which aligns global molecular structures, represented by graph neural
+networks (GNNs), with textual captions (descriptions) while leveraging local
+feature similarity through Morgan fingerprints. In addition, we introduce a
+Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to
+optimize input prompt demonstration samples. Our experimental findings using
+diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL
+retrieval methods across all tasks by up to 45%.
 
-摘要：我們探索深度生成模型，在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策，對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而，醫療 AI 訓練範例正轉向聯邦學習設置，以符合資料保護法規。在聯邦情境中，過去的資料對目前的使用者而言是無法取得的。因此，我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷，並使用公開可取得的胸部 X 光資料。
+摘要：<paragraph>情境學習 (ICL) 有效地調整大型語言模型 (LLM)，以執行分子任務，例如屬性預測和分子標題，方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而，目前針對分子任務的提示檢索方法依賴於分子特徵相似性，例如 Morgan 指紋，而無法充分捕捉全局分子和原子鍵結關係。因此，這些方法無法在推理過程中表示分子結構的完整複雜性。此外，在專業系統中提供更簡單部署需求的小到中型的 LLM，在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距，我們提出了一種自我監督學習技術，GAMIC（圖形對齊分子情境學習），它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題（描述）對齊，同時透過 Morgan 指紋利用局部特徵相似性。此外，我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法，以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示，GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法，最多可達 45%。</paragraph>
 
-##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**
-2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans
+##### **Knowledge Graph-Guided Retrieval Augmented Generation**
+2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
 
-Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging
-lesions with variable clinical behaviours and treatment approaches. This
-systematic review provides an overview of Artificial Intelligence (AI) methods
-using radiological imaging for diagnosis and prognosis of these tumours,
-highlighting challenges in clinical translation, and evaluating study alignment
-with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI
-international consensus guidelines for trustworthy and deployable AI to promote
-the clinical translation of AI methods. The review covered literature from
-several bibliographic databases, including papers published before 17/07/2024.
-Original research in peer-reviewed journals focused on radiology-based AI for
-diagnosing or prognosing primary STBT was included. Exclusion criteria were
-animal, cadaveric, or laboratory studies, and non-English papers. Abstracts
-were screened by two of three independent reviewers for eligibility. Eligible
-papers were assessed against guidelines by one of three independent reviewers.
-The search identified 15,015 abstracts, from which 325 articles were included
-for evaluation. Most studies performed moderately on CLAIM, averaging a score
-of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out
-of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage,
-indicating significant room for improvement. Future efforts by AI developers
-should focus on design (e.g. define unmet clinical need, intended clinical
-setting and how AI would be integrated in clinical workflow), development (e.g.
-build on previous work, explainability), evaluation (e.g. evaluating and
-addressing biases, evaluating AI against best practices), and data
-reproducibility and availability (making documented code and data publicly
-available). Following these recommendations could improve clinical translation
-of AI methods.
+Retrieval-augmented generation (RAG) has emerged as a promising technology
+for addressing hallucination issues in the responses generated by large
+language models (LLMs). Existing studies on RAG primarily focus on applying
+semantic-based approaches to retrieve isolated relevant chunks, which ignore
+their intrinsic relationships. In this paper, we propose a novel Knowledge
+Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
+knowledge graphs (KGs) to provide fact-level relationships between chunks,
+improving the diversity and coherence of the retrieved results. Specifically,
+after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
+employs a KG-guided chunk expansion process and a KG-based chunk organization
+process to deliver relevant and important knowledge in well-organized
+paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
+variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
+approaches, in terms of both response quality and retrieval quality.
 
-摘要：軟組織和骨骼腫瘤（STBT）是罕見、診斷具有挑戰性的病灶，其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀，重點說明了臨床轉譯的挑戰，並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性，以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻，包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究，以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要，其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等，平均得分為 53 分中的 28.9±7.5 分，但在 FUTURE-AI 中表現不佳，平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段，表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計（例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中）、開發（例如建立在先前的工作、可解釋性）、評估（例如評估和解決偏差、評估 AI 與最佳實務）、以及數據可複製性和可用性（公開提供文件化的代碼和數據）。遵循這些建議可以改善 AI 方法的臨床轉譯。
+摘要：檢索增強生成 (RAG) 已成為一項有前途的技術，用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊，而忽略它們的內在關係。在本文中，我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架，它利用知識圖表 (KG) 來提供區塊之間的事實層級關係，從而提高檢索結果的多樣性和一致性。具體來說，在執行基於語義的檢索以提供種子區塊後，KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序，以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。
 
-##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**
-2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen
+##### **Can Large Language Models Understand Intermediate Representations?**
+2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
 
-Early detection of Cerebral Palsy (CP) is crucial for effective intervention
-and monitoring. This paper tests the reliability and applicability of
-Explainable AI (XAI) methods using a deep learning method that predicts CP by
-analyzing skeletal data extracted from video recordings of infant movements.
-Specifically, we use XAI evaluation metrics -- namely faithfulness and
-stability -- to quantitatively assess the reliability of Class Activation
-Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this
-specific medical application. We utilize a unique dataset of infant movements
-and apply skeleton data perturbations without distorting the original dynamics
-of the infant movements. Our CP prediction model utilizes an ensemble approach,
-so we evaluate the XAI metrics performances for both the overall ensemble and
-the individual models. Our findings indicate that both XAI methods effectively
-identify key body points influencing CP predictions and that the explanations
-are robust against minor data perturbations. Grad-CAM significantly outperforms
-CAM in the RISv metric, which measures stability in terms of velocity. In
-contrast, CAM performs better in the RISb metric, which relates to bone
-stability, and the RRS metric, which assesses internal representation
-robustness. Individual models within the ensemble show varied results, and
-neither CAM nor Grad-CAM consistently outperform the other, with the ensemble
-approach providing a representation of outcomes from its constituent models.
+Intermediate Representations (IRs) are essential in compiler design and
+program analysis, yet their comprehension by Large Language Models (LLMs)
+remains underexplored. This paper presents a pioneering empirical study to
+investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA
+3.1, and Code Llama, in understanding IRs. We analyze their performance across
+four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code
+summarization, and execution reasoning. Our results indicate that while LLMs
+demonstrate competence in parsing IR syntax and recognizing high-level
+structures, they struggle with control flow reasoning, execution semantics, and
+loop handling. Specifically, they often misinterpret branching instructions,
+omit critical IR operations, and rely on heuristic-based reasoning, leading to
+errors in CFG reconstruction, IR decompilation, and execution reasoning. The
+study underscores the necessity for IR-specific enhancements in LLMs,
+recommending fine-tuning on structured IR datasets and integration of explicit
+control flow models to augment their comprehension and handling of IR-related
+tasks.
+
+摘要：中間表徵 (IR) 在編譯器設計和程式分析中至關重要，但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究，以探討 LLM（包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama）理解 IR 的能力。我們分析了它們在四項任務中的表現：控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明，儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力，但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言，它們經常誤解分支指令、省略關鍵 IR 操作，並依賴於基於啟發式的推理，導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性，建議對結構化的 IR 資料集進行微調，並整合明確的控制流程模型，以增強其對 IR 相關任務的理解和處理。
+
+##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**
+2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen
+
+Long-context large language models (LLMs) have recently shown strong
+performance in information retrieval and long-document QA. However, to tackle
+the most challenging intellectual problems, LLMs must reason effectively in
+long and complex contexts (e.g., frontier mathematical research). Studying how
+LLMs handle increasing reasoning complexity and context length is essential,
+yet existing benchmarks lack a solid basis for quantitative evaluation.
+Inspired by the abstraction of GSM-8K problems as computational graphs, and the
+ability to introduce noise by adding unnecessary nodes and edges, we develop a
+grade school math problem generator capable of producing arithmetic problems
+with infinite difficulty and context length under fine-grained control. Using
+our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate
+existing LLMs. We find a consistent sigmoid decline in reasoning performance as
+complexity increases, along with a systematic inference scaling trend:
+exponentially increasing inference computation yields only linear performance
+gains. These findings underscore the fundamental limitations of current
+long-context LLMs and the key challenges in scaling reasoning capabilities. Our
+GSM-Infinite benchmark provides a scalable and controllable testbed for
+systematically studying and advancing LLM reasoning in long and complex
+contexts.
 
-摘要：腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性，使用深度學習方法，透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說，我們使用 XAI 評估指標（即忠實度和穩定性）來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集，並應用骨骼資料擾動，而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法，因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明，兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位，並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM，該指標衡量速度方面的穩定性。相比之下，CAM 在 RISb 指標中表現得更好，該指標與骨骼穩定性有關，而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果，CAM 和 Grad-CAM 都不一致地優於另一種，整體方法提供了其組成模型結果的表示。
+摘要：長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而，若要解決最具挑戰性的智力問題，LLM 必須在長且複雜的脈絡中有效推理（例如，前沿數學研究）。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要，但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發，以及透過加入不必要的節點和邊緣來引入雜訊的能力，我們開發了一個小學數學問題產生器，能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準，我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降，並伴隨著系統性的推論縮放趨勢：指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制，以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台，用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。
 
-##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**
-2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma
+##### **Causality can systematically address the monsters under the bench(marks)**
+2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf
 
-Recent global estimates suggest that as many as 2.41 billion individuals have
-health conditions that would benefit from rehabilitation services. Home-based
-Physical Therapy (PT) faces significant challenges in providing interactive
-feedback and meaningful observation for therapists and patients. To fill this
-gap, we present MicroXercise, which integrates micro-motion analysis with
-wearable sensors, providing therapists and patients with a comprehensive
-feedback interface, including video, text, and scores. Crucially, it employs
-multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable
-methods to analyze the existing deep learning neural networks in monitoring
-exercises, focusing on a high granularity of exercise. This synergistic
-approach is pivotal, providing output matching the input size to precisely
-highlight critical subtleties and movements in PT, thus transforming complex AI
-analysis into clear, actionable feedback. By highlighting these micro-motions
-in different metrics, such as stability and range of motion, MicroXercise
-significantly enhances the understanding and relevance of feedback for
-end-users. Comparative performance metrics underscore its effectiveness over
-traditional methods, such as a 39% and 42% improvement in Feature Mutual
-Information (FMI) and Continuity. MicroXercise is a step ahead in home-based
-physical therapy, providing a technologically advanced and intuitively helpful
-solution to enhance patient care and outcomes.
+Effective and reliable evaluation is essential for advancing empirical
+machine learning. However, the increasing accessibility of generalist models
+and the progress towards ever more complex, high-level tasks make systematic
+evaluation more challenging. Benchmarks are plagued by various biases,
+artifacts, or leakage, while models may behave unreliably due to poorly
+explored failure modes. Haphazard treatments and inconsistent formulations of
+such "monsters" can contribute to a duplication of efforts, a lack of trust in
+results, and unsupported inferences. In this position paper, we argue causality
+offers an ideal framework to systematically address these challenges. By making
+causal assumptions in an approach explicit, we can faithfully model phenomena,
+formulate testable hypotheses with explanatory power, and leverage principled
+tools for analysis. To make causal model design more accessible, we identify
+several useful Common Abstract Topologies (CATs) in causal graphs which help
+gain insight into the reasoning abilities in large language models. Through a
+series of case studies, we demonstrate how the precise yet pragmatic language
+of causality clarifies the strengths and limitations of a method and inspires
+new approaches for systematic progress.
 
-摘要：最近的全球估計表明，多達 24.1 億人有
-健康狀況可從復健服務中受益。居家
-物理治療 (PT) 在提供互動式
-回饋和有意義的觀察方面面臨重大挑戰，供治療師和患者使用。為了填補這
-個缺口，我們提出 MicroXercise，它將微動作分析與
-可穿戴式感測器整合在一起，為治療師和患者提供一個全面的
-回饋介面，包括影片、文字和分數。至關重要的是，它採用
-多維動態時間規整 (DTW) 和基於歸因的可解釋
-方法來分析監控運動中現有的深度學習神經網路，專注於運動的高粒度。這種協同
-方法至關重要，提供與輸入大小匹配的輸出，以精確地
-突出 PT 中關鍵的細微差別和動作，從而將複雜的 AI
-分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作，例如穩定性和動作範圍，MicroXercise
-顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於
-傳統方法的有效性，例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家
-物理治療方面更進一步，提供技術先進且直覺有用的
-解決方案，以提升患者照護和結果。
+摘要：有效的、可靠的評估對於推進經驗機器學習至關重要。然而，一般化模型的可及性日益提高，以及朝著更複雜、更高級別任務的進展，使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾，而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中，我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設，我們可以忠實地模擬現象，制定具有解釋力的可測試假設，並利用原則性的分析工具。為了使因果模型設計更易於使用，我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT)，有助於深入了解大型語言模型中的推理能力。通過一系列案例研究，我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點，並激發系統進展的新方法。
 
-##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**
-2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz
+##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**
+2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha
 
-Systematic literature reviews are the highest quality of evidence in
-research. However, the review process is hindered by significant resource and
-data constraints. The Literature Review Network (LRN) is the first of its kind
-explainable AI platform adhering to PRISMA 2020 standards, designed to automate
-the entire literature review process. LRN was evaluated in the domain of
-surgical glove practices using 3 search strings developed by experts to query
-PubMed. A non-expert trained all LRN models. Performance was benchmarked
-against an expert manual review. Explainability and performance metrics
-assessed LRN's ability to replicate the experts' review. Concordance was
-measured with the Jaccard index and confusion matrices. Researchers were
-blinded to the other's results until study completion. Overlapping studies were
-integrated into an LRN-generated systematic review. LRN models demonstrated
-superior classification accuracy without expert training, achieving 84.78% and
-85.71% accuracy. The highest performance model achieved high interrater
-reliability (k = 0.4953) and explainability metrics, linking 'reduce',
-'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51%
-of the relevant literature despite diverging from the non-expert's judgments (k
-= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN
-outperformed the manual review (19,920 minutes over 11 months), reducing the
-entire process to 288.6 minutes over 5 days. This study demonstrates that
-explainable AI does not require expert training to successfully conduct
-PRISMA-compliant systematic literature reviews like an expert. LRN summarized
-the results of surgical glove studies and identified themes that were nearly
-identical to the clinical researchers' findings. Explainable AI can accurately
-expedite our understanding of clinical practices, potentially revolutionizing
-healthcare research.
+Large Language Models (LLMs) have demonstrated impressive reasoning
+capabilities, yet their performance is highly dependent on the prompting
+strategy and model scale. While reinforcement learning and fine-tuning have
+been deployed to boost reasoning, these approaches incur substantial
+computational and data overhead. In this work, we introduce Adaptive Graph of
+Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM
+reasoning solely at test time. Rather than relying on fixed-step methods like
+Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes
+complex queries into structured subproblems, forming an dynamic directed
+acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding
+only those subproblems that require further analysis, AGoT unifies the
+strengths of chain, tree, and graph paradigms into a cohesive framework that
+allocates computation where it is most needed. We validate our approach on
+diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and
+mathematical problem-solving, achieving up to 46.2% improvement on scientific
+reasoning tasks (GPQA) - comparable to gains achieved through computationally
+intensive reinforcement learning approaches and outperforming state-of-the-art
+iterative approaches. These results suggest that dynamic decomposition and
+structured recursion offer a scalable, cost-effective alternative to
+post-training modifications, paving the way for more robust, general-purpose
+reasoning in LLMs.
 
-摘要：系統性文獻回顧是研究中證據品質最高的。然而，回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台，旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估，使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率，達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標，將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻，儘管與非專家的判斷不同 (k = 0.2174)，但包含了「乳膠」、「雙重」（手套）和「適應症」等詞彙。LRN 優於手動回顧（11 個月超過 19,920 分鐘），將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示，可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果，並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解，有潛力革新醫療保健研究。
+摘要：大型語言模型 (LLM) 已展現令人印象深刻的推理能力，但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理，但這些方法會造成大量的運算和資料開銷。在這項工作中，我們引入了「適應性思考圖」(AGoT)，一個動態的、基於圖形的推論架構，它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法，而是遞迴地將複雜的查詢分解成結構化的子問題，形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題，AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中，將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法，在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進，這與透過運算密集的強化學習方法所獲得的增益相當，並且優於最先進的迭代方法。這些結果表明，動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案，用於訓練後修改，為 LLM 中更強健、更通用的推理鋪平了道路。
 
-##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**
-2408.02709v1 by Chi Him Ng
+##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**
+2502.05239v1 by Hussam Ghanem, Christophe Cruz
 
-This study analyzes hybrid AI systems' design patterns and their
-effectiveness in clinical decision-making using the boxology framework. It
-categorizes and copares various architectures combining machine learning and
-rule-based reasoning to provide insights into their structural foundations and
-healthcare applications. Addressing two main questions, how to categorize these
-systems againts established design patterns and how to extract insights through
-comparative analysis, the study uses design patterns from software engineering
-to understand and optimize healthcare AI systems. Boxology helps identify
-commonalities and create reusable solutions, enhancing these systems'
-scalability, reliability, and performance. Five primary architectures are
-examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and
-weaknesses, highlighting the need for tailored approaches in clinical tasks.
-REML excels in high-accuracy prediction for datasets with limited data; MLRB in
-handling large datasets and complex data integration; RBML in explainability
-and trustworthiness; RMLT in managing high-dimensional data; and PERML, though
-limited in analysis, shows promise in urgent care scenarios. The study
-introduces four new patterns, creates five abstract categorization patterns,
-and refines those five further to specific systems. These contributions enhance
-Boxlogy's taxonomical organization and offer novel approaches to integrating
-expert knowledge with machine learning. Boxology's structured, modular apporach
-offers significant advantages in developing and analyzing hybrid AI systems,
-revealing commonalities, and promoting reusable solutions. In conclusion, this
-study underscores hybrid AI systems' crucial role in advancing healthcare and
-Boxology's potential to drive further innovation in AI integration, ultimately
-improving clinical decision support and patient outcomes.
+Recent advancements in large language models have demonstrated significant
+potential in the automated construction of knowledge graphs from unstructured
+text. This paper builds upon our previous work [16], which evaluated various
+models using metrics like precision, recall, F1 score, triple matching, and
+graph matching, and introduces a refined approach to address the critical
+issues of hallucination and omission. We propose an enhanced evaluation
+framework incorporating BERTScore for graph similarity, setting a practical
+threshold of 95% for graph matching. Our experiments focus on the Mistral
+model, comparing its original and fine-tuned versions in zero-shot and few-shot
+settings. We further extend our experiments using examples from the KELM-sub
+training dataset, illustrating that the fine-tuned model significantly improves
+knowledge graph construction accuracy while reducing the exact hallucination
+and omission. However, our findings also reveal that the fine-tuned models
+perform worse in generalization tasks on the KELM-sub dataset. This study
+underscores the importance of comprehensive evaluation metrics in advancing the
+state-of-the-art in knowledge graph construction from textual data.
 
-摘要：本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構，以深入了解其結構基礎和醫療保健應用。針對兩個主要問題，如何根據既定的設計模式對這些系統進行分類，以及如何通過比較分析提取見解，本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案，從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構：REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點，強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測；MLRB 在處理大型資料集和複雜資料整合方面表現出色；RBML 在可解釋性和可信度方面表現出色；RMLT 在管理高維資料方面表現出色；而 PERML 儘管在分析方面有限，但在緊急照護場景中表現出潛力。本研究引入了四種新模式，建立了五種抽象分類模式，並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織，並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之，本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用，以及盒子學在推動人工智慧整合進一步創新方面的潛力，最終改善臨床決策支援和患者的治療成果。
+摘要：大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上，該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型，並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架，結合 BERTScore 來進行圖形相似性，並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上，比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗，說明微調後的模型顯著提高了知識圖譜建構的準確度，同時減少了精確的幻覺和遺漏。然而，我們的研究結果也顯示，微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。
 
-##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**
-2408.02706v1 by Masoud Muhammed Hassan
+##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**
+2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu
 
-Because of its strong predictive skills, deep learning has emerged as an
-essential tool in many industries, including healthcare. Traditional deep
-learning models, on the other hand, frequently lack interpretability and omit
-to take prediction uncertainty into account two crucial components of clinical
-decision making. In order to produce explainable and uncertainty aware
-predictions, this study presents a novel framework called Bayesian Kolmogorov
-Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov
-Arnold Networks with Bayesian inference. We employ BKANs on two medical
-datasets, which are widely used benchmarks for assessing machine learning
-models in medical diagnostics: the Pima Indians Diabetes dataset and the
-Cleveland Heart Disease dataset. Our method provides useful insights into
-prediction confidence and decision boundaries and outperforms traditional deep
-learning models in terms of prediction accuracy. Moreover, BKANs' capacity to
-represent aleatoric and epistemic uncertainty guarantees doctors receive more
-solid and trustworthy decision support. Our Bayesian strategy improves the
-interpretability of the model and considerably minimises overfitting, which is
-important for tiny and imbalanced medical datasets, according to experimental
-results. We present possible expansions to further use BKANs in more
-complicated multimodal datasets and address the significance of these
-discoveries for future research in building reliable AI systems for healthcare.
-This work paves the way for a new paradigm in deep learning model deployment in
-vital sectors where transparency and reliability are crucial.
+We introduce Agentic Reasoning, a framework that enhances large language
+model (LLM) reasoning by integrating external tool-using agents. Unlike
+conventional LLM-based reasoning approaches, which rely solely on internal
+inference, Agentic Reasoning dynamically engages web search, code execution,
+and structured reasoning-context memory to solve complex problems requiring
+deep research and multi-step logical deduction. Our framework introduces the
+Mind Map agent, which constructs a structured knowledge graph to track logical
+relationships, improving deductive reasoning. Additionally, the integration of
+web-search and coding agents enables real-time retrieval and computational
+analysis, enhancing reasoning accuracy and decision-making. Evaluations on
+PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks
+demonstrate that our approach significantly outperforms existing models,
+including leading retrieval-augmented generation (RAG) systems and
+closed-source LLMs. Moreover, our results indicate that agentic reasoning
+improves expert-level knowledge synthesis, test-time scalability, and
+structured problem-solving. The code is at:
+https://github.com/theworldofagents/Agentic-Reasoning.
 
-摘要：由於其強大的預測能力，深度學習已成為許多產業中不可或缺的工具，包括醫療保健。然而，傳統的深度學習模型通常缺乏可解釋性，並且忽略了將預測不確定性納入考量，而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測，本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構，它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN，這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準：皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解，並且在預測準確度方面優於傳統的深度學習模型。此外，BKAN 表現隨機和認識不確定性的能力，可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果，我們的貝氏策略提高了模型的可解釋性，並大幅減少了過度擬合，這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能，以進一步將 BKAN 用於更複雜的多模式資料集，並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。
+摘要：我們引入了代理推理，一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同，代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理，它建立一個結構化的知識圖譜來追蹤邏輯關係，改善演繹推理。此外，整合網路搜尋和編碼代理能進行即時擷取和運算分析，增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示，我們的做法明顯優於現有模型，包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外，我們的結果顯示，代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在：https://github.com/theworldofagents/Agentic-Reasoning。
 
-##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**
-2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal
+##### **Position-aware Automatic Circuit Discovery**
+2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
 
-In modern healthcare, addressing the complexities of accurate disease
-prediction and personalized recommendations is both crucial and challenging.
-This research introduces MLtoGAI, which integrates Semantic Web technology with
-Machine Learning (ML) to enhance disease prediction and offer user-friendly
-explanations through ChatGPT. The system comprises three key components: a
-reusable disease ontology that incorporates detailed knowledge about various
-diseases, a diagnostic classification model that uses patient symptoms to
-detect specific diseases accurately, and the integration of Semantic Web Rule
-Language (SWRL) with ontology and ChatGPT to generate clear, personalized
-health advice. This approach significantly improves prediction accuracy and
-ensures results that are easy to understand, addressing the complexity of
-diseases and diverse symptoms. The MLtoGAI system demonstrates substantial
-advancements in accuracy and user satisfaction, contributing to developing more
-intelligent and accessible healthcare solutions. This innovative approach
-combines the strengths of ML algorithms with the ability to provide
-transparent, human-understandable explanations through ChatGPT, achieving
-significant improvements in prediction accuracy and user comprehension. By
-leveraging semantic technology and explainable AI, the system enhances the
-accuracy of disease prediction and ensures that the recommendations are
-relevant and easily understood by individual patients. Our research highlights
-the potential of integrating advanced technologies to overcome existing
-challenges in medical diagnostics, paving the way for future developments in
-intelligent healthcare systems. Additionally, the system is validated using 200
-synthetic patient data records, ensuring robust performance and reliability.
+A widely used strategy to discover and understand language model mechanisms
+is circuit analysis. A circuit is a minimal subgraph of a model's computation
+graph that executes a specific task. We identify a gap in existing circuit
+discovery methods: they assume circuits are position-invariant, treating model
+components as equally relevant across input positions. This limits their
+ability to capture cross-positional interactions or mechanisms that vary across
+positions. To address this gap, we propose two improvements to incorporate
+positionality into circuits, even on tasks containing variable-length examples.
+First, we extend edge attribution patching, a gradient-based method for circuit
+discovery, to differentiate between token positions. Second, we introduce the
+concept of a dataset schema, which defines token spans with similar semantics
+across examples, enabling position-aware circuit discovery in datasets with
+variable length examples. We additionally develop an automated pipeline for
+schema generation and application using large language models. Our approach
+enables fully automated discovery of position-sensitive circuits, yielding
+better trade-offs between circuit size and faithfulness compared to prior work.
+
+摘要：廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖，可執行特定任務。我們找出電路發現方法中的一個缺口：它們假設電路與位置無關，將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口，我們提出兩項改進，將位置性納入電路中，即使在包含變長範例的任務中也是如此。首先，我們擴充邊緣屬性修補，一種基於梯度的電路發現方法，以區分符號位置。其次，我們引入了資料集架構的概念，它定義了在範例中具有類似語義的符號跨距，使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外，我們開發了一個自動化管線，用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化，與先前的研究相比，在電路大小和忠實度之間產生了更好的權衡。
+
+##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**
+2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
+
+We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by
+jointly optimizing model roles and weights. We represent multi-LLM systems as
+directed acyclic graphs (DAGs) of LLMs with topological message passing for
+collaborative generation. Given a pool of LLM experts and a utility function,
+Heterogeneous Swarms employs two iterative steps: role-step and weight-step.
+For role-step, we interpret model roles as learning a DAG that specifies the
+flow of inputs and outputs between LLMs. Starting from a swarm of random
+continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs
+in topological order, evaluate on the utility function (e.g. accuracy on a
+task), and optimize the adjacency matrices with particle swarm optimization
+based on the utility score. For weight-step, we assess the contribution of
+individual LLMs in the multi-LLM systems and optimize model weights with swarm
+intelligence. We propose JFK-score to quantify the individual contribution of
+each LLM in the best-found DAG of the role-step, then optimize model weights
+with particle swarm optimization based on the JFK-score. Experiments
+demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based
+baselines by 18.5% on average across 12 tasks. Further analysis reveals that
+Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles
+and substantial collaborative gains, and benefits from the diversity of
+language models.
 
-摘要：在現代醫療保健中，解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI，它將語義網路技術與機器學習 (ML) 相結合，以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分：一個可重複使用的疾病本体，其中包含有關各種疾病的詳細知識；一個診斷分類模型，它使用患者症狀來準確檢測特定疾病；以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合，以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性，並確保了易於理解的結果，解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步，有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點，以及透過 ChatGPT 提供透明且人類可以理解的說明的能力，在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI，該系統提高了疾病預測的準確性，並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力，為智慧醫療保健系統的未來發展鋪路。此外，該系統使用 200 個合成患者資料記錄進行驗證，確保了穩健的效能和可靠性。
+摘要：<paragraph>我們提出異質群體，一種演算法，透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG)，並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數，異質群體使用兩個反覆步驟：角色步驟和權重步驟。對於角色步驟，我們將模型角色解釋為學習一個 DAG，它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始，我們將它們解碼為離散 DAG，以拓撲順序呼叫 LLM，根據效用函數（例如任務的準確度）進行評估，並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟，我們評估個別 LLM 在多 LLM 系統中的貢獻，並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻，然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明，異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明，異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統，並受益於語言模型的多樣性。</paragraph>
 
-##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**
-2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora
+##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
+2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
 
-Explainable Artificial Intelligence (XAI) is central to the debate on
-integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms
-into clinical practice. High-performing AI/ML models, such as ensemble learners
-and deep neural networks, often lack interpretability, hampering clinicians'
-trust in their predictions. To address this, XAI techniques are being developed
-to describe AI/ML predictions in human-understandable terms. One promising
-direction is the adaptation of sensitivity analysis (SA) and global sensitivity
-analysis (GSA), which inherently rank model inputs by their impact on
-predictions. Here, we introduce a novel delta-XAI method that provides local
-explanations of ML model predictions by extending the delta index, a GSA
-metric. The delta-XAI index assesses the impact of each feature's value on the
-predicted output for individual instances in both regression and classification
-problems. We formalize the delta-XAI index and provide code for its
-implementation. The delta-XAI method was evaluated on simulated scenarios using
-linear regression models, with Shapley values serving as a benchmark. Results
-showed that the delta-XAI index is generally consistent with Shapley values,
-with notable discrepancies in models with highly impactful or extreme feature
-values. The delta-XAI index demonstrated higher sensitivity in detecting
-dominant features and handling extreme feature values. Qualitatively, the
-delta-XAI provides intuitive explanations by leveraging probability density
-functions, making feature rankings clearer and more explainable for
-practitioners. Overall, the delta-XAI method appears promising for robustly
-obtaining local explanations of ML model predictions. Further investigations in
-real-world clinical settings will be conducted to evaluate its impact on
-AI-assisted clinical workflows.
+Retrieval-augmented generation (RAG) is a well-suited technique for
+retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
+key module of the healthcare copilot, helping reduce misdiagnosis for
+healthcare practitioners and patients. However, the diagnostic accuracy and
+specificity of existing heuristic-based RAG models used in the medical domain
+are inadequate, particularly for diseases with similar manifestations. This
+paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
+reasoning for the medical domain that retrieves diagnosis and treatment
+recommendations based on manifestations. MedRAG systematically constructs a
+comprehensive four-tier hierarchical diagnostic KG encompassing critical
+diagnostic differences of various diseases. These differences are dynamically
+integrated with similar EHRs retrieved from an EHR database, and reasoned
+within a large language model. This process enables more accurate and specific
+decision support, while also proactively providing follow-up questions to
+enhance personalized medical decision-making. MedRAG is evaluated on both a
+public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
+collected from Tan Tock Seng Hospital, and its performance is compared against
+various existing RAG methods. Experimental results show that, leveraging the
+information integration and relational abilities of the KG, our MedRAG provides
+more specific diagnostic insights and outperforms state-of-the-art models in
+reducing misdiagnosis rates. Our code will be available at
+https://github.com/SNOWTEAM2023/MedRAG
 
-摘要：可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型，例如整體學習器和深度神經網路，通常缺乏可解釋性，阻礙臨床醫生對其預測的信任。為了解決這個問題，正在開發 XAI 技術，以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA)，它們本質上會依據模型輸入對預測的影響來對其進行排名。在此，我們介紹一種新的 delta-XAI 方法，透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化，並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法，並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致，但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說，delta-XAI 透過利用機率密度函數提供直觀的解釋，使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言，delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查，以評估其對 AI 輔助臨床工作流程的影響。
+摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
 
-##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**
-2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis
+##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**
+2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
 
-Dementia, a debilitating neurological condition affecting millions worldwide,
-presents significant diagnostic challenges. In this work, we introduce a novel
-methodology for the classification of demented and non-demented elderly
-patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach
-features a unique technique for selectively processing MRI slices, focusing on
-the most relevant brain regions and excluding less informative sections. This
-methodology is complemented by a confidence-based classification committee
-composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and
-Dem3D EfficientNet. These models work synergistically to enhance
-decision-making accuracy, leveraging their collective strengths. Tested on the
-Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an
-impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore,
-validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset
-confirmed the robustness and generalizability of our approach. The use of
-explainable AI (XAI) techniques and comprehensive ablation studies further
-substantiate the effectiveness of our techniques, providing insights into the
-decision-making process and the importance of our methodology. This research
-offers a significant advancement in dementia diagnosis, providing a highly
-accurate and efficient tool for clinical applications.
+Most existing Knowledge Graph Question Answering (KGQA) approaches are
+designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the
+heterogeneity of the underlying graph schema, topology and assertions, most
+KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without
+resource-intensive training data. We present OntoSCPrompt, a novel Large
+Language Model (LLM)-based KGQA approach with a two-stage architecture that
+separates semantic parsing from KG-dependent interactions. OntoSCPrompt first
+generates a SPARQL query structure (including SPARQL keywords such as SELECT,
+ASK, WHERE and placeholders for missing tokens) and then fills them with
+KG-specific information. To enhance the understanding of the underlying KG, we
+present an ontology-guided, hybrid prompt learning strategy that integrates KG
+ontology into the learning process of hybrid prompts (e.g., discrete and
+continuous vectors). We also present several task-specific decoding strategies
+to ensure the correctness and executability of generated SPARQL queries in both
+stages. Experimental results demonstrate that OntoSCPrompt performs as well as
+SOTA approaches without retraining on a number of KGQA datasets such as CWQ,
+WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well
+to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code:
+\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
 
-摘要：失智症是一種影響全球數百萬人的衰弱性神經疾病，在診斷上具有重大挑戰。在這項工作中，我們提出了一種新的方法，用於對失智和非失智老年患者進行分類，使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術，用於選擇性處理 MRI 切片，重點關注最相關的大腦區域，並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充，該委員會由三個自定義深度學習模型組成：Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性，利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試，我們的模型達到了 94.12% 的驚人準確度，超過了現有方法。此外，在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性，提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展，為臨床應用提供了一個高度準確且高效的工具。
+摘要：現有的知識圖譜問答（KGQA）方法大多是為特定 KG 而設計的，例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性，大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜（KG）。我們提出 OntoSCPrompt，這是一種基於大型語言模型（LLM）的新型 KGQA 方法，採用兩階段架構，將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構（包括 SPARQL 關鍵字，例如 SELECT、ASK、WHERE 和缺失令牌的佔位符），然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解，我們提出了一種由本体指導的混合提示學習策略，將 KG 本体整合到混合提示（例如，離散和連續向量）的學習過程中。我們還提出了多種特定任務的解碼策略，以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明，OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時，效能與 SOTA 方法一樣好，且資源使用效率高，並且可以很好地概括到未見過的特定領域 KG，例如 DBLP-QuAD 和 CoyPu KG Code：
+\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
 
-##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**
-2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini
+##### **Multimodal Medical Code Tokenizer**
+2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
 
-Recognizing daily activities with unobtrusive sensors in smart environments
-enables various healthcare applications. Monitoring how subjects perform
-activities at home and their changes over time can reveal early symptoms of
-health issues, such as cognitive decline. Most approaches in this field use
-deep learning models, which are often seen as black boxes mapping sensor data
-to activities. However, non-expert users like clinicians need to trust and
-understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human
-Activity Recognition have emerged to provide intuitive natural language
-explanations from these models. Different XAI methods generate different
-explanations, and their effectiveness is typically evaluated through user
-surveys, that are often challenging in terms of costs and fairness. This paper
-proposes an automatic evaluation method using Large Language Models (LLMs) to
-identify, in a pool of candidates, the best XAI approach for non-expert users.
-Our preliminary results suggest that LLM evaluation aligns with user surveys.
+Foundation models trained on patient electronic health records (EHRs) require
+tokenizing medical data into sequences of discrete vocabulary items. Existing
+tokenizers treat medical codes from EHRs as isolated textual tokens. However,
+each medical code is defined by its textual description, its position in
+ontological hierarchies, and its relationships to other codes, such as disease
+co-occurrences and drug-treatment associations. Medical vocabularies contain
+more than 600,000 codes with critical information for clinical reasoning. We
+introduce MedTok, a multimodal medical code tokenizer that uses the text
+descriptions and relational context of codes. MedTok processes text using a
+language model encoder and encodes the relational structure with a graph
+encoder. It then quantizes both modalities into a unified token space,
+preserving modality-specific and cross-modality information. We integrate
+MedTok into five EHR models and evaluate it on operational and clinical tasks
+across in-patient and out-patient datasets, including outcome prediction,
+diagnosis classification, drug recommendation, and risk stratification.
+Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
+models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
+the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
+using MedTok tokenizer with medical QA systems. Our results demonstrate the
+potential of MedTok as a unified tokenizer for medical codes, improving
+tokenization for medical foundation models.
 
-摘要：藉由智慧環境中不引人注目的感測器辨識日常活動，能啟用各種醫療保健應用。監控受試者在家中如何執行活動，以及其隨著時間的變化，可以揭示健康問題的早期症狀，例如認知能力下降。此領域中的大多數方法都使用深度學習模型，這些模型通常被視為將感測器資料對應至活動的黑盒子。然而，非專家使用者（例如臨床醫師）需要信任並了解這些模型的輸出。因此，人類活動辨識的可解釋 AI (XAI) 方法應運而生，以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明，而其有效性通常透過使用者調查來評估，這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法，以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明，LLM 評估與使用者調查一致。
+摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
 
-##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**
-2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil
+##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**
+2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu
 
-Industry 5.0, which focuses on human and Artificial Intelligence (AI)
-collaboration for performing different tasks in manufacturing, involves a
-higher number of robots, Internet of Things (IoTs) devices and
-interconnections, Augmented/Virtual Reality (AR), and other smart devices. The
-huge involvement of these devices and interconnection in various critical
-areas, such as economy, health, education and defense systems, poses several
-types of potential security flaws. AI itself has been proven a very effective
-and powerful tool in different areas of cybersecurity, such as intrusion
-detection, malware detection, and phishing detection, among others. Just as in
-many application areas, cybersecurity professionals were reluctant to accept
-black-box ML solutions for cybersecurity applications. This reluctance pushed
-forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool
-that helps explain how decisions are made in ML-based systems. In this survey,
-we present a comprehensive study of different XAI-based intrusion detection
-systems for industry 5.0, and we also examine the impact of explainability and
-interpretability on Cybersecurity practices through the lens of Adversarial
-XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities
-and challenges in XAI cybersecurity systems for industry 5.0 that elicit future
-research toward XAI-based solutions to be adopted by high-stakes industry 5.0
-applications. We believe this rigorous analysis will establish a foundational
-framework for subsequent research endeavors within the specified domain.
+The rapid expansion of web content has made on-device AI assistants
+indispensable for helping users manage the increasing complexity of online
+tasks. The emergent reasoning ability in large language models offer a
+promising path for next-generation on-device AI agents. However, deploying
+full-scale Large Language Models (LLMs) on resource-limited local devices is
+challenging. In this paper, we propose Division-of-Thoughts (DoT), a
+collaborative reasoning framework leveraging the synergy between locally
+deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT
+leverages a Task Decomposer to elicit the inherent planning abilities in
+language models to decompose user queries into smaller sub-tasks, which allows
+hybrid language models to fully exploit their respective strengths. Besides,
+DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks
+and create a dependency graph, facilitating parallel reasoning of sub-tasks and
+the identification of key steps. To allocate the appropriate model based on the
+difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an
+additional task head attached to the SLM that does not alter the SLM's
+parameters. To boost adapter's task allocation capability, we propose a
+self-reinforced training method that relies solely on task execution feedback.
+Extensive experiments on various benchmarks demonstrate that our DoT
+significantly reduces LLM costs while maintaining competitive reasoning
+accuracy. Specifically, DoT reduces the average reasoning time and API costs by
+66.12% and 83.57%, while achieving comparable reasoning accuracy with the best
+baseline methods.
 
-摘要：工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務，涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與，引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具，例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣，網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用，有助於說明在基於 ML 的系統中如何做出決策。在這項調查中，我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究，並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外，我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰，引發了未來針對 XAI 基礎解決方案的研究，以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。
+摘要：<paragraph>網頁內容快速擴充，使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而，在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中，我們提出了思想分工 (DoT)，一個協作推理框架，利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力，將使用者查詢分解成較小的子任務，這允許混合語言模型充分發揮其各自的優勢。此外，DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖，促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型，DoT 利用了即插即用適配器，這是一個附加在 SLM 上的任務頭，不會改變 SLM 的參數。為了提升適配器的任務分配能力，我們提出了一種自我強化訓練方法，它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明，我們的 DoT 大幅降低了 LLM 成本，同時維持了有競爭力的推理準確度。具體來說，DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%，同時達到了與最佳基準方法相當的推理準確度。</paragraph>
 
-##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability**
-2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic
+##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**
+2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong
 
-This study aims to explore the implementation of Natural Language Processing
-(NLP) and machine learning (ML) techniques to automate the coding of medical
-letters with visualised explainability and light-weighted local computer
-settings. Currently in clinical settings, coding is a manual process that
-involves assigning codes to each condition, procedure, and medication in a
-patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There
-are preliminary research on automatic coding in this field using
-state-of-the-art ML models; however, due to the complexity and size of the
-models, the real-world deployment is not achieved. To further facilitate the
-possibility of automatic coding practice, we explore some solutions in a local
-computer setting; in addition, we explore the function of explainability for
-transparency of AI models. We used the publicly available MIMIC-III database
-and the HAN/HLAN network models for ICD code prediction purposes. We also
-experimented with the mapping between ICD and SNOMED CT knowledge bases. In our
-experiments, the models provided useful information for 97.98\% of codes. The
-result of this investigation can shed some light on implementing automatic
-clinical coding in practice, such as in hospital settings, on the local
-computers used by clinicians , project page
-\url{https://github.com/Glenj01/Medical-Coding}.
+Knowledge Graph-based recommendations have gained significant attention due
+to their ability to leverage rich semantic relationships. However, constructing
+and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy
+of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent
+advancements in Large Language Models (LLMs) offer a promising way to improve
+the quality and relevance of KGs for recommendation tasks. Despite this,
+integrating LLMs into KG-based systems presents challenges, such as efficiently
+augmenting KGs, addressing hallucinations, and developing effective joint
+learning methods. In this paper, we propose the Confidence-aware KG-based
+Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework
+that combines KGs and LLMs for recommendation task. The framework includes: (1)
+an LLM-based subgraph augmenter for enriching KGs with high-quality
+information, (2) a confidence-aware message propagation mechanism to filter
+noisy triplets, and (3) a dual-view contrastive learning method to integrate
+user-item interactions and KG data. Additionally, we employ a confidence-aware
+explanation generation process to guide LLMs in producing realistic
+explanations for recommendations. Finally, extensive experiments demonstrate
+the effectiveness of CKG-LLMA across multiple public datasets.
 
-摘要：本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化，並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中，編碼是一種手動流程，涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如，使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究；然而，由於模型的複雜性和大小，並未實現實際部署。為了進一步促進自動編碼實務的可能性，我們在本地電腦設定中探討了一些解決方案；此外，我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中，這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解，例如在醫院環境中，由臨床醫生使用的本地電腦，專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。
+摘要：基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而，構建和維護知識圖譜 (KG) 是一項資源密集型任務，而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此，將 LLM 整合到基於 KG 的系統中會帶來挑戰，例如有效擴充 KG、處理幻覺，以及開發有效的聯合學習方法。在本文中，我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA)，這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括：(1) 一個基於 LLM 的子圖擴充器，用於使用高品質資訊豐富 KG，(2) 一個信心感知型訊息傳播機制，用於過濾雜訊三元組，以及 (3) 一個雙視圖對比學習方法，用於整合使用者-項目互動和 KG 資料。此外，我們採用一個信心感知型解釋產生程序，以引導 LLM 為推薦產生逼真的解釋。最後，大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。
 
-##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**
-2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier
+##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**
+2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell
 
-The support of artificial intelligence (AI) based decision-making is a key
-element in future 6G networks, where the concept of native AI will be
-introduced. Moreover, AI is widely employed in different critical applications
-such as autonomous driving and medical diagnosis. In such applications, using
-AI as black-box models is risky and challenging. Hence, it is crucial to
-understand and trust the decisions taken by these models. Tackling this issue
-can be achieved by developing explainable AI (XAI) schemes that aim to explain
-the logic behind the black-box model behavior, and thus, ensure its efficient
-and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST
-framework that is oriented toward channel estimation in wireless
-communications. The core idea of the XAI-CHEST framework is to identify the
-relevant model inputs by inducing high noise on the irrelevant ones. This
-manuscript provides the detailed theoretical foundations of the XAI-CHEST
-framework. In particular, we derive the analytical expressions of the XAI-CHEST
-loss functions and the noise threshold fine-tuning optimization problem. Hence
-the designed XAI-CHEST delivers a smart input feature selection methodology
-that can further improve the overall performance while optimizing the
-architecture of the employed model. Simulation results show that the XAI-CHEST
-framework provides valid interpretations, where it offers an improved bit error
-rate performance while reducing the required computational complexity in
-comparison to the classical DL-based channel estimation.
+Scene graphs have emerged as a structured and serializable environment
+representation for grounded spatial reasoning with Large Language Models
+(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason
+framework for reasoning and planning with scene graphs. Our approach employs
+two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and
+information queries generation, and a (2) Retriever for extracting
+corresponding graph information following the queries. Two agents collaborate
+iteratively, enabling sequential reasoning and adaptive attention to graph
+information. Unlike prior works, both agents are prompted only with the scene
+graph schema rather than the full graph data, which reduces the hallucination
+by limiting input tokens, and drives the Reasoner to generate reasoning trace
+abstractly.Following the trace, the Retriever programmatically query the scene
+graph data based on the schema understanding, allowing dynamic and global
+attention on the graph that enhances alignment between reasoning and retrieval.
+Through experiments in multiple simulation environments, we show that our
+framework surpasses existing LLM-based approaches in numerical Q\&A and
+planning tasks, and can benefit from task-level few-shot examples, even in the
+absence of agent-level demonstrations. Project code will be released.
 
-摘要：人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素，其中將引入原生 AI 的概念。此外，AI 廣泛用於不同的關鍵應用中，例如自動駕駛和醫療診斷。在這些應用中，使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此，理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構，旨在解釋黑盒模型行為背後的邏輯，從而確保其有效且安全的部署。最近，我們提出了一個新的基於擾動的 XAI-CHEST 框架，該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是，我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此，設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法，可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明，XAI-CHEST 框架提供了有效的解釋，在降低所需的計算複雜度的同時，提供了改進的比特錯誤率性能，而這與基於傳統 DL 的信道估計相比。
+摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
 
-##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**
-2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman
+##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
+2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
 
-This paper presents dilated Residual Network (ResNet) models for disease
-classification from retinal fundus images. Dilated convolution filters are used
-to replace normal convolution filters in the higher layers of the ResNet model
-(dilated ResNet) in order to improve the receptive field compared to the normal
-ResNet model for disease classification. This study introduces
-computer-assisted diagnostic tools that employ deep learning, enhanced with
-explainable AI techniques. These techniques aim to make the tool's
-decision-making process transparent, thereby enabling medical professionals to
-understand and trust the AI's diagnostic decision. They are particularly
-relevant in today's healthcare landscape, where there is a growing demand for
-transparency in AI applications to ensure their reliability and ethical use.
-The dilated ResNet is used as a replacement for the normal ResNet to enhance
-the classification accuracy of retinal eye diseases and reduce the required
-computing time. The dataset used in this work is the Ocular Disease Intelligent
-Recognition (ODIR) dataset which is a structured ophthalmic database with eight
-classes covering most of the common retinal eye diseases. The evaluation
-metrics used in this work include precision, recall, accuracy, and F1 score. In
-this work, a comparative study has been made between normal ResNet models and
-dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50,
-ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as
-compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67,
-and 0.70 respectively for the above respective variants in ODIR multiclass
-disease classification.
+Recent advancements have highlighted that Large Language Models (LLMs) are
+prone to hallucinations when solving complex reasoning problems, leading to
+erroneous results. To tackle this issue, researchers incorporate Knowledge
+Graphs (KGs) to improve the reasoning ability of LLMs. However, existing
+methods face two limitations: 1) they typically assume that all answers to the
+questions are contained in KGs, neglecting the incompleteness issue of KGs, and
+2) they treat the KG as a static repository and overlook the implicit logical
+reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an
+innovative neural-symbolic agent framework that achieves collaborative
+augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments
+and transform complex reasoning tasks into a multi-step interactive process,
+enabling KGs to participate deeply in the reasoning process. SymAgent consists
+of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages
+LLM's inductive reasoning capability to extract symbolic rules from KGs,
+guiding efficient question decomposition. The Agent-Executor autonomously
+invokes predefined action tools to integrate information from KGs and external
+documents, addressing the issues of KG incompleteness. Furthermore, we design a
+self-learning framework comprising online exploration and offline iterative
+policy updating phases, enabling the agent to automatically synthesize
+reasoning trajectories and improve performance. Experimental results
+demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields
+better or comparable performance compared to various strong baselines. Further
+analysis reveals that our agent can identify missing triples, facilitating
+automatic KG updates.
 
-摘要：这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器（扩张 ResNet），以改善感知场，从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具，并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化，从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关，在该领域，对 AI 应用的透明度需求不断增长，以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品，以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集，这是一个结构化的眼科数据库，包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中，对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比，扩张 ResNet 模型显示出有希望的结果，在 ODIR 多类疾病分类中，上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。
+摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
 
-##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**
-2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li
+##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
+2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
 
-The rapid advancement of foundation models in medical imaging represents a
-significant leap toward enhancing diagnostic accuracy and personalized
-treatment. However, the deployment of foundation models in healthcare
-necessitates a rigorous examination of their trustworthiness, encompassing
-privacy, robustness, reliability, explainability, and fairness. The current
-body of survey literature on foundation models in medical imaging reveals
-considerable gaps, particularly in the area of trustworthiness. Additionally,
-existing surveys on the trustworthiness of foundation models do not adequately
-address their specific variations and applications within the medical imaging
-domain. This survey aims to fill that gap by presenting a novel taxonomy of
-foundation models used in medical imaging and analyzing the key motivations for
-ensuring their trustworthiness. We review current research on foundation models
-in major medical imaging applications, focusing on segmentation, medical report
-generation, medical question and answering (Q\&A), and disease diagnosis. These
-areas are highlighted because they have seen a relatively mature and
-substantial number of foundation models compared to other applications. We
-focus on literature that discusses trustworthiness in medical image analysis
-manuscripts. We explore the complex challenges of building trustworthy
-foundation models for each application, summarizing current concerns and
-strategies for enhancing trustworthiness. Furthermore, we examine the potential
-of these models to revolutionize patient care. Our analysis underscores the
-imperative for advancing towards trustworthy AI in medical image analysis,
-advocating for a balanced approach that fosters innovation while ensuring
-ethical and equitable healthcare delivery.
+We introduce a new approach to systematically map features discovered by
+sparse autoencoder across consecutive layers of large language models,
+extending earlier work that examined inter-layer feature links. By using a
+data-free cosine similarity technique, we trace how specific features persist,
+transform, or first appear at each stage. This method yields granular flow
+graphs of feature evolution, enabling fine-grained interpretability and
+mechanistic insights into model computations. Crucially, we demonstrate how
+these cross-layer feature maps facilitate direct steering of model behavior by
+amplifying or suppressing chosen features, achieving targeted thematic control
+in text generation. Together, our findings highlight the utility of a causal,
+cross-layer interpretability framework that not only clarifies how features
+develop through forward passes but also provides new means for transparent
+manipulation of large language models.
 
-摘要：基礎模型在醫學影像方面的快速進展，代表著在加強診斷準確性和個人化治療方面邁出一大步。然而，基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查，包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距，特別是在可信度方面。此外，現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機，來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究，重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調，是因為與其他應用相比，它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰，總結了當前關注點和增強可信度的策略。此外，我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性，並倡導一種平衡的方法，既能促進創新，又能確保道德和公平的醫療保健服務。
+摘要：我們提出了一種新方法，用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能，擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術，我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖，實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是，我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導，在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用，不僅闡明了特徵如何透過前向傳遞發展，還提供了新的方法來透明地操作大型語言模型。
 
-##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**
-2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan
+##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**
+2502.02896v1 by Bradley P. Allen, Paul T. Groth
 
-Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and
-interpreting ultrasound scans right at the patient's bedside. However, the
-expertise needed to interpret these images is considerable and may not always
-be present in emergency situations. This reality makes algorithms such as
-machine learning classifiers extremely valuable to augment human decisions.
-POCUS devices are becoming available at a reasonable cost in the size of a
-mobile phone. The challenge of turning POCUS devices into life-saving tools is
-that interpretation of ultrasound images requires specialist training and
-experience. Unfortunately, the difficulty to obtain positive training images
-represents an important obstacle to building efficient and accurate
-classifiers. Hence, the problem we try to investigate is how to explore
-strategies to increase accuracy of classifiers trained with scarce data. We
-hypothesize that training with a few data instances may not suffice for
-classifiers to generalize causing them to overfit. Our approach uses an
-Explainable AI-Augmented approach to help the algorithm learn more from less
-and potentially help the classifier better generalize.
+Evaluating large language models (LLMs) for tasks like fact extraction in
+support of knowledge graph construction frequently involves computing accuracy
+metrics using a ground truth benchmark based on a knowledge graph (KG). These
+evaluations assume that errors represent factual disagreements. However, human
+discourse frequently features metalinguistic disagreement, where agents differ
+not on facts but on the meaning of the language used to express them. Given the
+complexity of natural language processing and generation using LLMs, we ask: do
+metalinguistic disagreements occur between LLMs and KGs? Based on an
+investigation using the T-REx knowledge alignment dataset, we hypothesize that
+metalinguistic disagreement does in fact occur between LLMs and KGs, with
+potential relevance for the practice of knowledge graph engineering. We propose
+a benchmark for evaluating the detection of factual and metalinguistic
+disagreements between LLMs and KGs. An initial proof of concept of such a
+benchmark is available on Github.
 
-摘要：床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而，解讀這些影像所需的專業知識相當可觀，而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出，尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於，解讀超音波影像需要專門訓練和經驗。不幸的是，取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此，我們嘗試探討的問題是如何探索策略，以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括，導致它們過度擬合。我們的做法使用可解釋 AI 增強方法，以協助演算法從較少的資料中學習更多，並潛在協助分類器更好地概括。
+摘要：評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時，通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而，人類話語經常出現元語言分歧，其中代理人之間的差異不在於事實，而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性，我們提出疑問：LLM 和 KG 之間是否會發生元語言分歧？根據使用 T-REx 知識比對資料集進行的調查，我們假設元語言分歧確實會發生在 LLM 和 KG 之間，並可能與知識圖譜工程實務有關。我們提出一個基準，用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。
 
-##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**
-2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang
+##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**
+2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim
 
-In recent years, the United States has witnessed a significant surge in the
-popularity of vaping or e-cigarette use, leading to a notable rise in cases of
-e-cigarette and vaping use-associated lung injury (EVALI) that caused
-hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting
-the urgency to comprehend vaping behaviors and develop effective strategies for
-cessation. Due to the ubiquity of social media platforms, over 4.7 billion
-users worldwide use them for connectivity, communications, news, and
-entertainment with a significant portion of the discourse related to health,
-thereby establishing social media data as an invaluable organic data resource
-for public health research. In this study, we extracted a sample dataset from
-one vaping sub-community on Reddit to analyze users' quit-vaping intentions.
-Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit
-vaping intention detection, this study compares the outcomes of this model
-against layman and clinical expert annotations. Using different prompting
-strategies such as zero-shot, one-shot, few-shot and chain-of-thought
-prompting, we developed 8 prompts with varying levels of detail to explain the
-task to GPT-4 and also evaluated the performance of the strategies against each
-other. These preliminary findings emphasize the potential of GPT-4 in social
-media data analysis, especially in identifying users' subtle intentions that
-may elude human detection.
+Recent advances in Large Language Models (LLMs) have motivated the
+development of general LLMs for molecular tasks. While several studies have
+demonstrated that fine-tuned LLMs can achieve impressive benchmark
+performances, they are far from genuine generalist molecular LLMs due to a lack
+of fundamental understanding of molecular structure. Specifically, when given
+molecular task instructions, LLMs trained with naive next-token prediction
+training assign similar likelihood scores to both original and negatively
+corrupted molecules, revealing their lack of molecular structure understanding
+that is crucial for reliable and general molecular LLMs. To overcome this
+limitation and obtain a true generalist molecular LLM, we introduce a novel
+multi-modal training method based on a thorough multi-modal instruction tuning
+as well as a molecular structure preference optimization between chosen and
+rejected graphs. On various molecular benchmarks, the proposed generalist
+molecular LLM, called Mol-LLM, achieves state-of-the-art performances among
+generalist LLMs on most tasks, at the same time, surpassing or comparable to
+state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior
+generalization performances in reaction prediction tasks, demonstrating the
+effect of the molecular structure understanding for generalization perspective.
 
-摘要：近年來，美國見證了電子煙或電子香菸使用率大幅激增，導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加，在 2019 年 EVALI 爆發期間造成住院和死亡，凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及，全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂，其中很大一部分與健康相關，因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中，我們從 Reddit 上一個電子煙子社群中提取一個範例資料集，以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測，本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略，例如零次學習、一次學習、少次學習和思考鏈提示，我們開發了 8 個提示，詳細程度不同，向 GPT-4 解釋任務，並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力，特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。
+摘要：大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能，但由於缺乏對分子結構的基本理解，它們遠非真正的通才分子 LLM。具體來說，當給予分子任務說明時，使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子，這顯示出它們缺乏對分子結構的理解，而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM，我們引入了一種新穎的多模態訓練方法，該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中，所提出的通才分子 LLM（稱為 Mol-LLM）在多數任務中實現了通才 LLM 中的最新效能，同時超越或與最新的專家 LLM 相當。此外，Mol-LLM 在反應預測任務中也展現出優異的泛化效能，證明了分子結構理解對泛化觀點的影響。
 
-##### **Towards Compositional Interpretability for XAI**
-2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke
+##### **Leveraging the true depth of LLMs**
+2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
 
-Artificial intelligence (AI) is currently based largely on black-box machine
-learning models which lack interpretability. The field of eXplainable AI (XAI)
-strives to address this major concern, being critical in high-stakes areas such
-as the finance, legal and health sectors.
-  We present an approach to defining AI models and their interpretability based
-on category theory. For this we employ the notion of a compositional model,
-which sees a model in terms of formal string diagrams which capture its
-abstract structure together with its concrete implementation. This
-comprehensive view incorporates deterministic, probabilistic and quantum
-models. We compare a wide range of AI models as compositional models, including
-linear and rule-based models, (recurrent) neural networks, transformers, VAEs,
-and causal and DisCoCirc models.
-  Next we give a definition of interpretation of a model in terms of its
-compositional structure, demonstrating how to analyse the interpretability of a
-model, and using this to clarify common themes in XAI. We find that what makes
-the standard 'intrinsically interpretable' models so transparent is brought out
-most clearly diagrammatically. This leads us to the more general notion of
-compositionally-interpretable (CI) models, which additionally include, for
-instance, causal, conceptual space, and DisCoCirc models.
-  We next demonstrate the explainability benefits of CI models. Firstly, their
-compositional structure may allow the computation of other quantities of
-interest, and may facilitate inference from the model to the modelled
-phenomenon by matching its structure. Secondly, they allow for diagrammatic
-explanations for their behaviour, based on influence constraints, diagram
-surgery and rewrite explanations. Finally, we discuss many future directions
-for the approach, raising the question of how to learn such meaningfully
-structured models in practice.
+Large Language Models demonstrate remarkable capabilities at the cost of high
+compute requirements. While recent research has shown that intermediate layers
+can be removed or have their order shuffled without impacting performance
+significantly, these findings have not been employed to reduce the
+computational cost of inference. We investigate several potential ways to
+reduce the depth of pre-trained LLMs without significantly affecting
+performance. Leveraging our insights, we present a novel approach that exploits
+this decoupling between layers by grouping some of them into pairs that can be
+evaluated in parallel.
+  This modification of the computational graph -- through better parallelism --
+results in an average improvement of around 1.20x on the number of tokens
+generated per second, without re-training nor fine-tuning, while retaining
+95%-99% of the original accuracy. Empirical evaluation demonstrates that this
+approach significantly improves serving efficiency while maintaining model
+performance, offering a practical improvement for large-scale LLM deployment.
 
-摘要：<paragraph>人工智慧（AI）目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧（XAI）領域致力於解決這個主要問題，這在金融、法律和健康等高風險領域至關重要。
-我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此，我們採用組合模型的概念，它以形式弦圖的形式看待模型，這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較，包括線性和基於規則的模型、（遞迴）神經網路、Transformer、VAE，以及因果和 DisCoCirc 模型。
-接下來，我們根據模型的組合結構給出模型解釋的定義，展示如何分析模型的可解釋性，並使用它來澄清 XAI 中的常見主題。我們發現，讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋（CI）模型概念，它另外還包括因果、概念空間和 DisCoCirc 模型。
-接下來，我們展示了 CI 模型的可解釋性優勢。首先，它們的組合結構允許計算其他感興趣的量，並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次，它們允許對其行為進行圖解說明，這些說明基於影響約束、圖解手術和重寫說明。最後，我們討論了這種方法的許多未來方向，提出了如何在實踐中學習這種有意義的結構化模型的問題。</paragraph>
+摘要：大型语言模型展示了其强大的功能，但代价是较高的计算需求。虽然最近的研究表明，中间层可以被移除或重新排列其顺序，而不会显著影响性能，但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度，而不会显著影响性能。利用我们的见解，我们提出了一种新颖的方法，该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。
+通过更好的并行性对计算图进行修改，平均而言，每秒生成的令牌数量提高了约 1.20 倍，而无需重新训练或微调，同时保留了 95%-99% 的原始准确性。经验评估表明，这种方法显著提高了服务效率，同时保持了模型性能，为大规模 LLM 部署提供了实际改进。
 
-##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**
-2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen
+##### **Modular Training of Neural Networks aids Interpretability**
+2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots
 
-Machine learning models have achieved high overall accuracy in medical image
-analysis. However, performance disparities on specific patient groups pose
-challenges to their clinical utility, safety, and fairness. This can affect
-known patient groups - such as those based on sex, age, or disease subtype - as
-well as previously unknown and unlabeled groups. Furthermore, the root cause of
-such observed performance disparities is often challenging to uncover,
-hindering mitigation efforts. In this paper, to address these issues, we
-leverage Slice Discovery Methods (SDMs) to identify interpretable
-underperforming subsets of data and formulate hypotheses regarding the cause of
-observed performance disparities. We introduce a novel SDM and apply it in a
-case study on the classification of pneumothorax and atelectasis from chest
-x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis
-formulation and yields an explanation of previously observed but unexplained
-performance disparities between male and female patients in widely used chest
-X-ray datasets and models. Our findings indicate shortcut learning in both
-classification tasks, through the presence of chest drains and ECG wires,
-respectively. Sex-based differences in the prevalence of these shortcut
-features appear to cause the observed classification performance gap,
-representing a previously underappreciated interaction between shortcut
-learning and model fairness analyses.
+An approach to improve neural network interpretability is via clusterability,
+i.e., splitting a model into disjoint clusters that can be studied
+independently. We define a measure for clusterability and show that pre-trained
+models form highly enmeshed clusters via spectral graph clustering. We thus
+train models to be more modular using a "clusterability loss" function that
+encourages the formation of non-interacting clusters. Using automated
+interpretability techniques, we show that our method can help train models that
+are more modular and learn different, disjoint, and smaller circuits. We
+investigate CNNs trained on MNIST and CIFAR, small transformers trained on
+modular addition, and language models. Our approach provides a promising
+direction for training neural networks that learn simpler functions and are
+easier to interpret.
 
-摘要：機器學習模型在醫學影像分析中已達到整體高準確度。然而，特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體（例如基於性別、年齡或疾病亞型）以及先前未知且未標籤的群體。此外，此類觀察到的效能差異的根本原因通常難以發現，阻礙了緩解措施。在本文中，為了解決這些問題，我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集，並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM，並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性，並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明，在分類任務中，透過胸腔引流管和心電圖導線的存在，存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異，似乎會導致觀察到的分類效能差距，這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。
+摘要：一種改善神經網路可解釋性的方法是透過群集性，
+也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量，並顯示預訓練的
+模型透過光譜圖形群集形成高度糾纏的群集。因此，我們使用「群集性損失」函數訓練模型，使其更具模組化，
+這鼓勵形成非交互群集。使用自動化可解釋性技術，我們顯示我們的模型可以幫助訓練更具模組化的模型，並學習不同、不相交且較小的電路。我們
+研究了在 MNIST 和 CIFAR 上訓練的 CNN，在模組化加法上訓練的小型Transformer，以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。
 
-##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**
-2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa
+##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**
+2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür
 
-The concept of Metaverse has attracted a lot of attention in various fields
-and one of its important applications is health and treatment. The Metaverse
-has enormous potential to transform healthcare by changing patient care,
-medical education, and the way teaching/learning and research are done. The
-purpose of this research is to provide an introduction to the basic concepts
-and fundamental technologies of the Metaverse. This paper examines the pros and
-cons of the Metaverse in healthcare context and analyzes its potential from the
-technology and AI perspective. In particular, the role of machine learning
-methods is discussed; We will explain how machine learning algorithms can be
-applied to the Metaverse generated data to gain better insights in healthcare
-applications. Additionally, we examine the future visions of the Metaverse in
-health delivery, by examining emerging technologies such as blockchain and also
-addressing privacy concerns. The findings of this study contribute to a deeper
-understanding of the applications of Metaverse in healthcare and its potential
-to revolutionize the delivery of medical services.
+Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large
+language models (LLMs) by enabling detailed step-by-step solutions. However,
+due to the verbosity of LLMs, the resulting reasoning chains can be long,
+making it harder to verify the reasoning steps and trace issues resulting from
+dependencies between the steps that may be farther away in the sequence of
+steps. Importantly, mathematical reasoning allows each step to be derived from
+a small set of premises, which are a subset of the preceding steps in the
+reasoning chain. In this paper, we present a framework that identifies the
+premises for each step, to improve the evaluation of reasoning. We restructure
+conventional linear reasoning chains into Premise Augmented Reasoning Chains
+(PARC) by introducing premise links, resulting in a directed acyclic graph
+where the nodes are the steps and the edges are the premise links. Through
+experiments with a PARC-based dataset that we built, namely PERL (Premises and
+ERrors identification in LLMs), we demonstrate that LLMs can reliably identify
+premises within complex reasoning chains. In particular, even open-source LLMs
+achieve 90% recall in premise identification. We also show that PARC helps to
+identify errors in reasoning chains more reliably. The accuracy of error
+identification improves by 6% to 16% absolute when step-by-step verification is
+carried out in PARC under the premises. Our findings highlight the utility of
+premise-centric representations in addressing complex problem-solving tasks and
+open new avenues for improving the reliability of LLM-based reasoning
+evaluations.
 
-摘要：元宇宙的概念在各個領域都備受關注，其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育，以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點，並從技術和 AI 的角度分析其潛力。特別是，討論了機器學習方法的角色；我們將說明如何將機器學習演算法應用於元宇宙產生的資料，以獲得醫療保健應用方面的更佳見解。此外，我們透過探討區塊鏈等新興技術，並解決隱私問題，來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用，以及其在醫療服務提供方面發揮革命性變革的潛力。
+摘要：<paragraph>思考鏈（CoT）提示透過提供詳細的逐步解法，增強大型語言模型（LLM）的數學推理能力。然而，由於 LLM 的冗長，產生的推理鏈可能很長，這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難，而這些步驟可能在步驟順序中相距較遠。重要的是，數學推理允許每個步驟從一組小的前提中推導出來，這些前提是推理鏈中前一個步驟的子集。在本文中，我們提出了一個框架，用於識別每個步驟的前提，以改進推理評估。我們透過引入前提連結，將傳統的線性推理鏈重組為前提擴充推理鏈（PARC），產生一個有向無環圖，其中節點是步驟，而邊緣是前提連結。透過我們建立的基於 PARC 的資料集（即 PERL（LLM 中的前提和錯誤識別））進行的實驗，我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是，即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明，PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時，錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用，並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。</paragraph>
 
-##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**
-2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf
+##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**
+2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
 
-Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with
-no known ultimo cure and high morbidity. Research demonstrates that progressive
-Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly
-impacts kidney structure and functions, eventually leading to kidney failure.
-With the progression of time, chronic kidney disease has moved from a
-life-threatening disease affecting few people to a common disorder of varying
-severity. The goal of this research is to visualize dominating features,
-feature scores, and values exhibited for early prognosis and detection of CKD
-using ensemble learning and explainable AI. For that, an AI-driven predictive
-analytics approach is proposed to aid clinical practitioners in prescribing
-lifestyle modifications for individual patients to reduce the rate of
-progression of this disease. Our dataset is collected on body vitals from
-individuals with CKD and healthy subjects to develop our proposed AI-driven
-solution accurately. In this regard, blood and urine test results are provided,
-and ensemble tree-based machine-learning models are applied to predict unseen
-cases of CKD. Our research findings are validated after lengthy consultations
-with nephrologists. Our experiments and interpretation results are compared
-with existing explainable AI applications in various healthcare domains,
-including CKD. The comparison shows that our developed AI models, particularly
-the Random Forest model, have identified more features as significant
-contributors than XgBoost. Interpretability (I), which measures the ratio of
-important to masked features, indicates that our XgBoost model achieved a
-higher score, specifically a Fidelity of 98\%, in this metric and naturally in
-the FII index compared to competing models.
+Embodied agents assisting humans are often asked to complete a new task in a
+new scenario. An agent preparing a particular dish in the kitchen based on a
+known recipe may be asked to prepare a new dish or to perform cleaning tasks in
+the storeroom. There may not be sufficient resources, e.g., time or labeled
+examples, to train the agent for these new situations. Large Language Models
+(LLMs) trained on considerable knowledge across many domains are able to
+predict a sequence of abstract actions for such new tasks and scenarios,
+although it may not be possible for the agent to execute this action sequence
+due to task-, agent-, or domain-specific constraints. Our framework addresses
+these challenges by leveraging the generic predictions provided by LLM and the
+prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an
+agent to quickly adapt to new tasks and scenarios. The robot also solicits and
+uses human input as needed to refine its existing knowledge. Based on
+experimental evaluation over cooking and cleaning tasks in simulation domains,
+we demonstrate that the interplay between LLM, KG, and human input leads to
+substantial performance gains compared with just using the LLM output.
 
-摘要：慢性腎臟病 (CKD) 是一種廣泛的慢性疾病，目前尚未找到最終的治療方法，且發病率很高。研究表明，進行性慢性腎臟病 (CKD) 是一種異質性疾病，會顯著影響腎臟結構和功能，最終導致腎衰竭。隨著時間的推移，慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值，以進行 CKD 的早期預後和檢測。為此，提出了一種 AI 驅動的預測分析方法，以幫助臨床醫生為個別患者開具生活方式的修改建議，以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的，以準確開發我們提出的 AI 驅動的解決方案。在這方面，提供了血液和尿液檢測結果，並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較，包括 CKD。比較表明，我們開發的 AI 模型，特別是隨機森林模型，已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率，表明我們的 XgBoost 模型在此指標中取得了更高的分數，特別是 98% 的保真度，並且在 FII 指數中自然高於競爭模型。
+摘要：具身代理协助人类时，通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源（例如时间或标记的示例）来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列，尽管代理可能无法执行此动作序列，因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战，使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估，我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。
 
-##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**
-2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan
+##### **On Bob Dylan: A Computational Perspective**
+2502.01772v1 by Prashant Garg
 
-Mental health constitutes a complex and pervasive global challenge, affecting
-millions of lives and often leading to severe consequences. In this paper, we
-conduct a thorough survey to explore the intersection of data science,
-artificial intelligence, and mental healthcare, focusing on the recent
-developments of mental disorder detection through online social media (OSM). A
-significant portion of the population actively engages in OSM platforms,
-creating a vast repository of personal data that holds immense potential for
-mental health analytics. The paper navigates through traditional diagnostic
-methods, state-of-the-art data- and AI-driven research studies, and the
-emergence of explainable AI (XAI) models for mental healthcare. We review
-state-of-the-art machine learning methods, particularly those based on modern
-deep learning, while emphasising the need for explainability in healthcare AI
-models. The experimental design section provides insights into prevalent
-practices, including available datasets and evaluation approaches. We also
-identify key issues and challenges in the field and propose promising future
-research directions. As mental health decisions demand transparency,
-interpretability, and ethical considerations, this paper contributes to the
-ongoing discourse on advancing XAI in mental healthcare through social media.
-The comprehensive overview presented here aims to guide researchers,
-practitioners, and policymakers in developing the area of mental disorder
-detection.
+Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style
+-- a constant refusal to conform to expectation and a penchant for reinventing
+his musical and lyrical identity. In this paper, I extend Sunstein's
+observations through a large-scale computational analysis of Dylan's lyrics
+from 1962 to 2012. Using o3-mini-high (a large language model), I extract
+concept-to-concept relationships from the lyrics and construct directed
+knowledge graphs that capture Dylan's thematic structure. I then quantify
+shifts in sentiment, metaphorical expression, thematic diversity, and network
+complexity over time. The results indicate that Dylan's lyrics increasingly
+rely on metaphor, display an evolving sentiment profile, and exhibit heightened
+dishabituation -- measured here as a growing variance in the network centrality
+of key concepts. I also find that references to movement, protest, and mythic
+imagery fluctuate in ways that align with well-known phases of Dylan's career,
+reflecting the dynamic and unpredictable quality of his art. These findings not
+only deepen our empirical understanding of Sunstein's thesis but also introduce
+a novel computational method for analyzing an artist's evolution-offering
+broader applicability to the study of cultural and creative change.
 
-摘要：心理健康構成了一項複雜且普遍的全球挑戰，影響了數百萬人的生活，並經常導致嚴重的後果。在本文中，我們進行了一項徹底的調查，以探索數據科學、人工智慧和心理保健的交集，重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台，創造了一個龐大的人員資料庫，對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究，以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法，特別是那些基於現代深度學習的方法，同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解，包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰，並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量，本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。
+摘要：卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格
+-- 這種風格不斷拒絕符合預期，並熱衷於重新塑造他的音樂和歌詞認同。在本文中，我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析，來延伸桑斯坦的觀察。使用 o3-mini-high（一個大型語言模型），我從歌詞中提取概念對概念的關係，並建構有向知識圖，以捕捉迪倫的主題結構。然後，我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示，迪倫的歌詞越來越依賴隱喻，展現出不斷演化的情緒輪廓，並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現，對運動、抗議和神話意象的引用，會以與迪倫職業生涯中眾所周知階段一致的方式波動，反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解，也引入了分析藝術家演變的新穎運算方法，為文化和創造性變化的研究提供了更廣泛的適用性。
 
-##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**
-2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou
+##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**
+2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
 
-AI-aided clinical diagnosis is desired in medical care. Existing deep
-learning models lack explainability and mainly focus on image analysis. The
-recently developed Dynamic Uncertain Causality Graph (DUCG) approach is
-causality-driven, explainable, and invariant across different application
-scenarios, without problems of data collection, labeling, fitting, privacy,
-bias, generalization, high cost and high energy consumption. Through close
-collaboration between clinical experts and DUCG technicians, 46 DUCG models
-covering 54 chief complaints were constructed. Over 1,000 diseases can be
-diagnosed without triage. Before being applied in real-world, the 46 DUCG
-models were retrospectively verified by third-party hospitals. The verified
-diagnostic precisions were no less than 95%, in which the diagnostic precision
-for every disease including uncommon ones was no less than 80%. After
-verifications, the 46 DUCG models were applied in the real-world in China. Over
-one million real diagnosis cases have been performed, with only 17 incorrect
-diagnoses identified. Due to DUCG's transparency, the mistakes causing the
-incorrect diagnoses were found and corrected. The diagnostic abilities of the
-clinicians who applied DUCG frequently were improved significantly. Following
-the introduction to the earlier presented DUCG methodology, the recommendation
-algorithm for potential medical checks is presented and the key idea of DUCG is
-extracted.
+Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in
+enhancing Large Language Models (LLMs) through external knowledge integration,
+yet its application has primarily focused on textual content, leaving the rich
+domain of multi-modal video knowledge predominantly unexplored. This paper
+introduces VideoRAG, the first retrieval-augmented generation framework
+specifically designed for processing and understanding extremely long-context
+videos. Our core innovation lies in its dual-channel architecture that
+seamlessly integrates (i) graph-based textual knowledge grounding for capturing
+cross-video semantic relationships, and (ii) multi-modal context encoding for
+efficiently preserving visual features. This novel design empowers VideoRAG to
+process unlimited-length videos by constructing precise knowledge graphs that
+span multiple videos while maintaining semantic dependencies through
+specialized multi-modal retrieval paradigms. Through comprehensive empirical
+evaluation on our proposed LongerVideos benchmark-comprising over 160 videos
+totaling 134+ hours across lecture, documentary, and entertainment
+categories-VideoRAG demonstrates substantial performance compared to existing
+RAG alternatives and long video understanding methods. The source code of
+VideoRAG implementation and the benchmark dataset are openly available at:
+https://github.com/HKUDS/VideoRAG.
+
+摘要：檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功，但其應用主要集中在文字內容上，而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG，這是第一個檢索增強生成架構，專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構，它無縫整合 (i) 基於圖形文字知識基礎，用於擷取跨影片語義關係，以及 (ii) 多模態語境編碼，用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片，同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估，該基準包含超過 160 部影片，總時數超過 134 小時，涵蓋演講、紀錄片和娛樂類別，VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比，展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於：https://github.com/HKUDS/VideoRAG。
+
+##### **Transformers trained on proteins can learn to attend to Euclidean distance**
+2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane
+
+While conventional Transformers generally operate on sequence data, they can
+be used in conjunction with structure models, typically SE(3)-invariant or
+equivariant graph neural networks (GNNs), for 3D applications such as protein
+structure modelling. These hybrids typically involve either (1)
+preprocessing/tokenizing structural features as input for Transformers or (2)
+taking Transformer embeddings and processing them within a structural
+representation. However, there is evidence that Transformers can learn to
+process structural information on their own, such as the AlphaFold3 structural
+diffusion model. In this work we show that Transformers can function
+independently as structure models when passed linear embeddings of coordinates.
+We first provide a theoretical explanation for how Transformers can learn to
+filter attention as a 3D Gaussian with learned variance. We then validate this
+theory using both simulated 3D points and in the context of masked token
+prediction for proteins. Finally, we show that pre-training protein Transformer
+encoders with structure improves performance on a downstream task, yielding
+better performance than custom structural models. Together, this work provides
+a basis for using standard Transformers as hybrid structure-language models.
 
-摘要：<paragraph>醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性，並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的，並且在不同的應用場景中是不變的，沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作，構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前，46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%，其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後，46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例，僅發現 17 個不正確的診斷。由於 DUCG 的透明性，發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後，提出了潛在健康檢查的推薦演算法，並提取了 DUCG 的關鍵思想。</paragraph>
+摘要：雖然傳統的 Transformer 通常處理序列資料，但它們可用於結構模型，通常是 SE(3) 不變式或等變式圖神經網路 (GNN)，用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而，有證據表明 Transformer 可以自行學習處理結構資訊，例如 AlphaFold3 結構擴散模型。在這項工作中，我們展示了 Transformer 在傳遞座標的線性嵌入時，可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後，我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能，產生比自訂結構模型更好的效能。綜合來說，這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。
 
-##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**
-2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi
+##### **Common Foundations for SHACL, ShEx, and PG-Schema**
+2502.01295v1 by S. Ahmetaj, I. Boneva, J. Hidders, K. Hose, M. Jakubowski, J. E. Labra-Gayo, W. Martens, F. Mogavero, F. Murlak, C. Okulmus, A. Polleres, O. Savkovic, M. Simkus, D. Tomaszuk
 
-It is imperative that breast cancer is detected precisely and timely to
-improve patient outcomes. Diagnostic methodologies have traditionally relied on
-unimodal approaches; however, medical data analytics is integrating diverse
-data sources beyond conventional imaging. Using multi-modal techniques,
-integrating both image and non-image data, marks a transformative advancement
-in breast cancer diagnosis. The purpose of this review is to explore the
-burgeoning field of multimodal techniques, particularly the fusion of
-histopathology images with non-image data. Further, Explainable AI (XAI) will
-be used to elucidate the decision-making processes of complex algorithms,
-emphasizing the necessity of explainability in diagnostic processes. This
-review utilizes multi-modal data and emphasizes explainability to enhance
-diagnostic accuracy, clinician confidence, and patient engagement, ultimately
-fostering more personalized treatment strategies for breast cancer, while also
-identifying research gaps in multi-modality and explainability, guiding future
-studies, and contributing to the strategic direction of the field.
+Graphs have emerged as an important foundation for a variety of applications,
+including capturing and reasoning over factual knowledge, semantic data
+integration, social networks, and providing factual knowledge for machine
+learning algorithms. To formalise certain properties of the data and to ensure
+data quality, there is a need to describe the schema of such graphs. Because of
+the breadth of applications and availability of different data models, such as
+RDF and property graphs, both the Semantic Web and the database community have
+independently developed graph schema languages: SHACL, ShEx, and PG-Schema.
+Each language has its unique approach to defining constraints and validating
+graph data, leaving potential users in the dark about their commonalities and
+differences. In this paper, we provide formal, concise definitions of the core
+components of each of these schema languages. We employ a uniform framework to
+facilitate a comprehensive comparison between the languages and identify a
+common set of functionalities, shedding light on both overlapping and
+distinctive features of the three languages.
 
-摘要：精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法；然而，醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術，標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域，特別是將組織病理學影像與非影像資料融合。此外，可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程，強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性，以提高診斷準確性、臨床醫師的信心和患者參與度，最終促進乳癌更個人化的治療策略，同時也找出多模式和可解釋性的研究差距，引導未來的研究，並為該領域的策略方向做出貢獻。
+摘要：圖表已成為各種應用的重要基礎，包括擷取和推理事實知識、語義資料整合、社群網路，以及為機器學習演算法提供事實知識。為了形式化資料的特定屬性並確保資料品質，有必要描述此類圖表的架構。由於應用範圍廣泛且有不同的資料模型可用，例如 RDF 和屬性圖表，因此語義網路和資料庫社群已獨立開發圖表架構語言：SHACL、ShEx 和 PG-Schema。每種語言都有其定義約束和驗證圖表資料的獨特方法，讓潛在使用者不清楚它們的共性和差異。在本文中，我們提供這些架構語言中每個核心元件的正式簡潔定義。我們採用統一的框架來促進語言之間的全面比較，並找出功能的共同集合，說明這三種語言的重疊和獨特功能。
 
-##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**
-2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya
+##### **GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**
+2502.01113v1 by Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
 
-The neonatal period is the most vulnerable time for the development of
-seizures. Seizures in the immature brain lead to detrimental consequences,
-therefore require early diagnosis. The gold-standard for neonatal seizure
-detection currently relies on continuous video-EEG monitoring; which involves
-recording multi-channel electroencephalogram (EEG) alongside real-time video
-monitoring within a neonatal intensive care unit (NICU). However, video-EEG
-monitoring technology requires clinical expertise and is often limited to
-technologically advanced and resourceful settings. Cost-effective new
-techniques could help the medical fraternity make an accurate diagnosis and
-advocate treatment without delay. In this work, a novel explainable deep
-learning model to automate the neonatal seizure detection process with a
-reduced EEG montage is proposed, which employs convolutional nets, graph
-attention layers, and fully connected layers. Beyond its ability to detect
-seizures in real-time with a reduced montage, this model offers the unique
-advantage of real-time interpretability. By evaluating the performance on the
-Zenodo dataset with 10-fold cross-validation, the presented model achieves an
-absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall,
-respectively.
+Retrieval-augmented generation (RAG) has proven effective in integrating
+knowledge into large language models (LLMs). However, conventional RAGs
+struggle to capture complex relationships between pieces of knowledge, limiting
+their performance in intricate reasoning that requires integrating knowledge
+from multiple sources. Recently, graph-enhanced retrieval augmented generation
+(GraphRAG) builds graph structure to explicitly model these relationships,
+enabling more effective and efficient retrievers. Nevertheless, its performance
+is still hindered by the noise and incompleteness within the graph structure.
+To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for
+retrieval augmented generation. GFM-RAG is powered by an innovative graph
+neural network that reasons over graph structure to capture complex
+query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage
+training process on large-scale datasets, comprising 60 knowledge graphs with
+over 14M triples and 700k documents. This results in impressive performance and
+generalizability for GFM-RAG, making it the first graph foundation model
+applicable to unseen datasets for retrieval without any fine-tuning required.
+Extensive experiments on three multi-hop QA datasets and seven domain-specific
+RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance
+while maintaining efficiency and alignment with neural scaling laws,
+highlighting its potential for further improvement.
 
-摘要：新生兒期是大腦發育最脆弱的時期，容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果，因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測；其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而，視訊腦電圖監控技術需要臨床專業知識，而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中，提出了一個新穎的可解釋深度學習模型，以自動化新生兒癲癇發作偵測過程，並採用減少的腦電圖裝置，其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外，此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能，所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。
+摘要：檢索增強生成 (RAG) 已證明在整合知識到大語言模型 (LLM) 中有效。然而，傳統的 RAG 難以捕捉知識片段之間的複雜關係，限制了它們在需要整合來自多個來源的知識的複雜推理中的表現。最近，圖表增強檢索增強生成 (GraphRAG) 建立圖表結構來明確建模這些關係，從而實現更有效率的檢索器。儘管如此，其效能仍受到圖表結構中雜訊和不完整性的阻礙。為了解決這個問題，我們引入了 GFM-RAG，一種用於檢索增強生成的全新圖表基礎模型 (GFM)。GFM-RAG 由一個創新的圖神經網路驅動，該網路在圖表結構上進行推理以捕捉複雜的查詢知識關係。具有 8M 參數的 GFM 在大型資料集上進行兩階段訓練流程，包括 60 個包含超過 14M 個三元組和 700k 個文件的文件。這為 GFM-RAG 帶來了令人印象深刻的效能和通用性，使其成為第一個適用於未見過資料集的圖表基礎模型，而無需任何微調。在三個多跳問答資料集和七個特定領域 RAG 資料集上的廣泛實驗表明，GFM-RAG 達到了最先進的效能，同時保持了效率並與神經擴充定律保持一致，突顯了其進一步改進的潛力。
 
-##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**
-2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik
+##### **Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**
+2502.01059v1 by Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn
 
-Breast cancer (BC) stands as one of the most common malignancies affecting
-women worldwide, necessitating advancements in diagnostic methodologies for
-better clinical outcomes. This article provides a comprehensive exploration of
-the application of Explainable Artificial Intelligence (XAI) techniques in the
-detection and diagnosis of breast cancer. As Artificial Intelligence (AI)
-technologies continue to permeate the healthcare sector, particularly in
-oncology, the need for transparent and interpretable models becomes imperative
-to enhance clinical decision-making and patient care. This review discusses the
-integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and
-others, with machine learning and deep learning models utilized in breast
-cancer detection and classification. By investigating the modalities of breast
-cancer datasets, including mammograms, ultrasounds and their processing with
-AI, the paper highlights how XAI can lead to more accurate diagnoses and
-personalized treatment plans. It also examines the challenges in implementing
-these techniques and the importance of developing standardized metrics for
-evaluating XAI's effectiveness in clinical settings. Through detailed analysis
-and discussion, this article aims to highlight the potential of XAI in bridging
-the gap between complex AI models and practical healthcare applications,
-thereby fostering trust and understanding among medical professionals and
-improving patient outcomes.
+The development of biological data analysis tools and large language models
+(LLMs) has opened up new possibilities for utilizing AI in plant science
+research, with the potential to contribute significantly to knowledge
+integration and research gap identification. Nonetheless, current LLMs struggle
+to handle complex biological data and theoretical models in photosynthesis
+research and often fail to provide accurate scientific contexts. Therefore,
+this study proposed a photosynthesis research assistant (PRAG) based on
+OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt
+optimization. Vector databases and an automated feedback loop were used in the
+prompt optimization process to enhance the accuracy and relevance of the
+responses to photosynthesis-related queries. PRAG showed an average improvement
+of 8.7% across five metrics related to scientific writing, with a 25.4%
+increase in source transparency. Additionally, its scientific depth and domain
+coverage were comparable to those of photosynthesis research papers. A
+knowledge graph was used to structure PRAG's responses with papers within and
+outside the database, which allowed PRAG to match key entities with 63% and
+39.5% of the database and test papers, respectively. PRAG can be applied for
+photosynthesis research and broader plant science domains, paving the way for
+more in-depth data analysis and predictive capabilities.
 
-摘要：乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一，因此需要進步的診斷方法，以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域，特別是在腫瘤學中，透明且可解釋的模型需求變得勢在必行，以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合，例如 SHAP、LIME、Grad-CAM 等，以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式，包括乳房攝影、超音波及其在 AI 中的處理，本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰，以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論，本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力，進而促進醫療專業人員之間的信任與理解，並改善患者的結果。
+摘要：生物資料分析工具和大型語言模型 (LLM) 的發展，為利用人工智慧於植物科學研究開啟了新的可能性，並有潛力對知識整合和研究差距的識別做出重大貢獻。儘管如此，目前的 LLM 在處理光合作用研究中的複雜生物資料和理論模型時仍有困難，而且常常無法提供準確的科學背景。因此，本研究提出了一個基於 OpenAI 的 GPT-4o、具備檢索增強生成 (RAG) 技術和提示最佳化的光合作用研究助理 (PRAG)。在提示最佳化過程中，使用了向量資料庫和自動回饋迴路，以增強對與光合作用相關查詢的回應的準確性和相關性。PRAG 在與科學寫作相關的五項指標中顯示出平均改善了 8.7%，來源透明度增加了 25.4%。此外，其科學深度和領域涵蓋範圍與光合作用研究論文相當。知識圖譜用於建構 PRAG 的回應，其中包含資料庫內外論文，這使得 PRAG 能夠分別與資料庫和測試論文中的 63% 和 39.5% 的關鍵實體相匹配。PRAG 可應用於光合作用研究和更廣泛的植物科學領域，為更深入的資料分析和預測能力鋪路。
 
-##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**
-2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
+##### **Encrypted Large Model Inference: The Equivariant Encryption Paradigm**
+2502.01013v1 by James Buban, Hongyang Zhang, Claudio Angione, Harry Yang, Ahmad Farhan, Seyfal Sultanov, Michael Du, Xuran Ma, Zihao Wang, Yue Zhao, Arria Owlia, Fielding Johnston, Patrick Colangelo
 
-Speech emotion recognition (SER) has gained significant attention due to its
-several application fields, such as mental health, education, and
-human-computer interaction. However, the accuracy of SER systems is hindered by
-high-dimensional feature sets that may contain irrelevant and redundant
-information. To overcome this challenge, this study proposes an iterative
-feature boosting approach for SER that emphasizes feature relevance and
-explainability to enhance machine learning model performance. Our approach
-involves meticulous feature selection and analysis to build efficient SER
-systems. In addressing our main problem through model explainability, we employ
-a feature evaluation loop with Shapley values to iteratively refine feature
-sets. This process strikes a balance between model performance and
-transparency, which enables a comprehensive understanding of the model's
-predictions. The proposed approach offers several advantages, including the
-identification and removal of irrelevant and redundant features, leading to a
-more effective model. Additionally, it promotes explainability, facilitating
-comprehension of the model's predictions and the identification of crucial
-features for emotion determination. The effectiveness of the proposed method is
-validated on the SER benchmarks of the Toronto emotional speech set (TESS),
-Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of
-Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion
-(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our
-knowledge, this is the first work to incorporate model explainability into an
-SER framework. The source code of this paper is publicly available via this
-https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.
+Large scale deep learning model, such as modern language models and diffusion
+architectures, have revolutionized applications ranging from natural language
+processing to computer vision. However, their deployment in distributed or
+decentralized environments raises significant privacy concerns, as sensitive
+data may be exposed during inference. Traditional techniques like secure
+multi-party computation, homomorphic encryption, and differential privacy offer
+partial remedies but often incur substantial computational overhead, latency
+penalties, or limited compatibility with non-linear network operations. In this
+work, we introduce Equivariant Encryption (EE), a novel paradigm designed to
+enable secure, "blind" inference on encrypted data with near zero performance
+overhead. Unlike fully homomorphic approaches that encrypt the entire
+computational graph, EE selectively obfuscates critical internal
+representations within neural network layers while preserving the exact
+functionality of both linear and a prescribed set of non-linear operations.
+This targeted encryption ensures that raw inputs, intermediate activations, and
+outputs remain confidential, even when processed on untrusted infrastructure.
+We detail the theoretical foundations of EE, compare its performance and
+integration complexity against conventional privacy preserving techniques, and
+demonstrate its applicability across a range of architectures, from
+convolutional networks to large language models. Furthermore, our work provides
+a comprehensive threat analysis, outlining potential attack vectors and
+baseline strategies, and benchmarks EE against standard inference pipelines in
+decentralized settings. The results confirm that EE maintains high fidelity and
+throughput, effectively bridging the gap between robust data confidentiality
+and the stringent efficiency requirements of modern, large scale model
+inference.
 
-摘要：語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而，SER 系統的準確性受到高維特徵集的阻礙，這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰，本研究提出了一種用於 SER 的迭代特徵提升方法，該方法強調特徵相關性和可解釋性，以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析，以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題，我們採用了具有 Shapley 值的特徵評估迴圈，以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡，這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點，包括識別和移除不相關和冗餘的特徵，從而建立更有效的模型。此外，它促進了可解釋性，有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證，其效能優於現有方法。據我們所知，這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得：https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。
+摘要：大型深度學習模型，例如現代語言模型和擴散架構，徹底改變了從自然語言處理到電腦視覺等各種應用。然而，它們在分散式或分散式環境中的部署引發了重大的隱私問題，因為敏感數據可能會在推理過程中遭到揭露。安全多方計算、同態加密和差分隱私等傳統技術提供了部分補救措施，但通常會產生大量的計算開銷、延遲處罰，或與非線性網路操作相容性有限。在這項工作中，我們引入了等變加密 (EE)，這是一種新穎的範例，旨在以接近零效能開銷對加密數據進行安全、「盲目」推理。與加密整個計算圖形的完全同態方法不同，EE 有選擇性地混淆神經網路層內的關鍵內部表示，同時保留線性和規定的一組非線性操作的精確功能。這種有針對性的加密確保了原始輸入、中間激活和輸出保持機密，即使在不受信任的基礎設施上處理也是如此。我們詳細說明了 EE 的理論基礎，比較了其效能和整合複雜度與傳統的隱私保護技術，並展示了其在從卷積網路到大語言模型等各種架構中的適用性。此外，我們的研究提供了全面的威脅分析，概述了潛在的攻擊媒介和基準策略，並在分散式設定中將 EE 與標準推理管道進行比較。結果證實，EE 保持了高保真度和高傳輸量，有效地彌合了強大的數據機密性與現代化、大規模模型推理的嚴格效率要求之間的差距。
 
-##### **The Explanation Necessity for Healthcare AI**
-2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling
+##### **Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**
+2502.01694v1 by Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki
 
-Explainability is often critical to the acceptable implementation of
-artificial intelligence (AI). Nowhere is this more important than healthcare
-where decision-making directly impacts patients and trust in AI systems is
-essential. This trust is often built on the explanations and interpretations
-the AI provides. Despite significant advancements in AI interpretability, there
-remains the need for clear guidelines on when and to what extent explanations
-are necessary in the medical context. We propose a novel categorization system
-with four distinct classes of explanation necessity, guiding the level of
-explanation required: patient or sample (local) level, cohort or dataset
-(global) level, or both levels. We introduce a mathematical formulation that
-distinguishes these categories and offers a practical framework for researchers
-to determine the necessity and depth of explanations required in medical AI
-applications. Three key factors are considered: the robustness of the
-evaluation protocol, the variability of expert observations, and the
-representation dimensionality of the application. In this perspective, we
-address the question: When does an AI medical application need to be explained,
-and at what level of detail?
+A key paradigm to improve the reasoning capabilities of large language models
+(LLMs) is to allocate more inference-time compute to search against a verifier
+or reward model. This process can then be utilized to refine the pretrained
+model or distill its reasoning patterns into more efficient models. In this
+paper, we study inference-time compute by viewing chain-of-thought (CoT)
+generation as a metastable Markov process: easy reasoning steps (e.g.,
+algebraic manipulations) form densely connected clusters, while hard reasoning
+steps (e.g., applying a relevant theorem) create sparse, low-probability edges
+between clusters, leading to phase transitions at longer timescales. Under this
+framework, we prove that implementing a search protocol that rewards sparse
+edges improves CoT by decreasing the expected number of steps to reach
+different clusters. In contrast, we establish a limit on reasoning capability
+when the model is restricted to local information of the pretrained graph. We
+also show that the information gained by search can be utilized to obtain a
+better reasoning model: (1) the pretrained model can be directly finetuned to
+favor sparse edges via policy gradient methods, and moreover (2) a compressed
+metastable representation of the reasoning dynamics can be distilled into a
+smaller, more efficient model.
 
-摘要：可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域，这一点尤为重要，因为决策直接影响患者，并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展，但仍然需要明确的指导方针，说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统，该系统具有四种不同的解释必要性类别，指导所需的解释级别：患者或样本（局部）级别、队列或数据集（全局）级别，或两个级别。我们引入了一个数学公式，该公式区分了这些类别，并为研究人员提供了一个实用框架，以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素：评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看，我们解决了这个问题：AI 医疗应用何时需要解释，以及需要解释到何种程度？
+摘要：<paragraph>提升大型語言模型 (LLM) 推理能力的一個關鍵範例，是分配更多推論時間運算來搜尋驗證器或獎勵模型。此程序接著可用於改善預訓練模型或將其推理模式提煉到更有效率的模型中。在這篇論文中，我們透過將思維鏈 (CoT) 生成視為亞穩態馬可夫過程來研究推論時間運算：簡單的推理步驟（例如代數運算）形成密集連接的叢集，而困難的推理步驟（例如應用相關定理）則在叢集之間建立稀疏、低機率的邊緣，導致在較長時間尺度上產生相變。在此架構下，我們證明實作一種獎勵稀疏邊緣的搜尋協定，會透過減少到達不同叢集所需的預期步驟數來改善 CoT。相反地，當模型受限於預訓練圖形的局部資訊時，我們建立了推理能力的限制。我們也顯示搜尋所獲得的資訊可用於取得更好的推理模型：(1) 預訓練模型可以直接微調以透過策略梯度方法偏好稀疏邊緣，而且 (2) 推理動態的壓縮亞穩態表徵可以提煉到更小、更有效率的模型中。</paragraph>
 
-##### **Interdisciplinary Expertise to Advance Equitable Explainable AI**
-2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles
+##### **PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**
+2502.00708v1 by Qixuan Li, Chao Wang, Zongjin He, Yan Peng
 
-The field of artificial intelligence (AI) is rapidly influencing health and
-healthcare, but bias and poor performance persists for populations who face
-widespread structural oppression. Previous work has clearly outlined the need
-for more rigorous attention to data representativeness and model performance to
-advance equity and reduce bias. However, there is an opportunity to also
-improve the explainability of AI by leveraging best practices of social
-epidemiology and health equity to help us develop hypotheses for associations
-found. In this paper, we focus on explainable AI (XAI) and describe a framework
-for interdisciplinary expert panel review to discuss and critically assess AI
-model explanations from multiple perspectives and identify areas of bias and
-directions for future research. We emphasize the importance of the
-interdisciplinary expert panel to produce more accurate, equitable
-interpretations which are historically and contextually informed.
-Interdisciplinary panel discussions can help reduce bias, identify potential
-confounders, and identify opportunities for additional research where there are
-gaps in the literature. In turn, these insights can suggest opportunities for
-AI model improvement.
+Text-to-3D asset generation has achieved significant optimization under the
+supervision of 2D diffusion priors. However, when dealing with compositional
+scenes, existing methods encounter several challenges: 1). failure to ensure
+that composite scene layouts comply with physical laws; 2). difficulty in
+accurately capturing the assets and relationships described in complex scene
+descriptions; 3). limited autonomous asset generation capabilities among layout
+approaches leveraging large language models (LLMs). To avoid these compromises,
+we propose a novel framework for compositional scene generation, PhiP-G, which
+seamlessly integrates generation techniques with layout guidance based on a
+world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene
+description to generate a scene graph, and integrating a multimodal 2D
+generation agent and a 3D Gaussian generation method for targeted assets
+creation. For the stage of layout, PhiP-G employs a physical pool with adhesion
+capabilities and a visual supervision agent, forming a world model for layout
+prediction and planning. Extensive experiments demonstrate that PhiP-G
+significantly enhances the generation quality and physical rationality of the
+compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA)
+performance in CLIP scores, achieves parity with the leading methods in
+generation quality as measured by the T$^3$Bench, and improves efficiency by
+24x.
 
-摘要：人工智慧 (AI) 領域正快速影響著健康與醫療保健，但對於面臨廣泛結構性壓迫的人群來說，偏見和不良表現依然存在。先前的研究已清楚說明，需要更嚴格地注意資料代表性和模型效能，以促進公平性並減少偏見。然而，我們有機會透過運用社會流行病學和健康公平的最佳實務，來改善 AI 的可解釋性，以幫助我們針對發現的關聯性，發展假設。在本文中，我們專注於可解釋 AI (XAI)，並描述一個跨領域專家小組審查架構，以從多重觀點討論和批判性評估 AI 模型的解釋，並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要，而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素，並在文獻中有缺口時找出額外研究的機會。反過來，這些見解可以建議 AI 模型改進的機會。
+摘要：<paragraph>在 2D 擴散先驗的監督下，文字轉 3D 資產生成已取得顯著的最佳化。然而，在處理合成場景時，現有方法會遇到幾個挑戰：1) 無法確保複合場景佈局符合物理定律；2) 難以準確捕捉複雜場景描述中所描述的資產和關係；3) 在利用大型語言模型 (LLM) 的佈局方法中，自主資產生成能力有限。為了避免這些折衷，我們提出了一個合成場景生成的新框架 PhiP-G，它將生成技術與基於世界模型的佈局指導無縫整合。利用基於 LLM 的代理，PhiP-G 分析複雜的場景描述以生成場景圖，並整合多模態 2D 生成代理和 3D 高斯生成方法以進行目標資產創建。對於佈局階段，PhiP-G 採用具有附著能力的物理池和視覺監督代理，形成用於佈局預測和規劃的世界模型。大量的實驗證明，PhiP-G 大幅提升了合成場景的生成品質和物理合理性。值得注意的是，PhiP-G 在 CLIP 分數中獲得了最先進 (SOTA) 的效能，在 T$^3$Bench 測量的生成品質中與領先的方法達到同等水準，並將效率提升了 24 倍。</paragraph>
 
-##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**
-2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen
+##### **A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**
+2502.00681v1 by Qika Lin, Zhen Peng, Kaize Shi, Kai He, Yiming Xu, Erik Cambria, Mengling Feng
 
-Artificial Intelligence (AI) repeatedly match or outperform radiologists in
-lab experiments. However, real-world implementations of radiological AI-based
-systems are found to provide little to no clinical value. This paper explores
-how to design AI for clinical usefulness in different contexts. We conducted 19
-design sessions and design interventions with 13 radiologists from 7 clinical
-sites in Denmark and Kenya, based on three iterations of a functional AI-based
-prototype. Ten sociotechnical dependencies were identified as crucial for the
-design of AI in radiology. We conceptualised four technical dimensions that
-must be configured to the intended clinical context of use: AI functionality,
-AI medical focus, AI decision threshold, and AI Explainability. We present four
-design recommendations on how to address dependencies pertaining to the medical
-knowledge, clinic type, user expertise level, patient context, and user
-situation that condition the configuration of these technical dimensions.
+Recent years have witnessed rapid advances in graph representation learning,
+with the continuous embedding approach emerging as the dominant paradigm.
+However, such methods encounter issues regarding parameter efficiency,
+interpretability, and robustness. Thus, Quantized Graph Representation (QGR)
+learning has recently gained increasing interest, which represents the graph
+structure with discrete codes instead of conventional continuous embeddings.
+Given its analogous representation form to natural language, QGR also possesses
+the capability to seamlessly integrate graph structures with large language
+models (LLMs). As this emerging paradigm is still in its infancy yet holds
+significant promise, we undertake this thorough survey to promote its rapid
+future prosperity. We first present the background of the general quantization
+methods and their merits. Moreover, we provide an in-depth demonstration of
+current QGR studies from the perspectives of quantized strategies, training
+objectives, distinctive designs, knowledge graph quantization, and
+applications. We further explore the strategies for code dependence learning
+and integration with LLMs. At last, we give discussions and conclude future
+directions, aiming to provide a comprehensive picture of QGR and inspire future
+research.
 
-摘要：人工智慧（AI）在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而，發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代，在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向，必須根據預期的臨床使用情境進行設定：AI 功能、AI 醫療重點、AI 決策門檻，以及 AI 可解釋性。我們提出四項設計建議，說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境，以及影響這些技術面向設定的使用者情境相關的依賴關係。
+摘要：近年来，图表示学习取得了快速进展，其中连续嵌入方法作为主导范式出现。然而，此类方法遇到了参数效率、可解释性和鲁棒性方面的问题。因此，量化图表示 (QGR) 学习最近引起了越来越多的兴趣，它使用离散代码而不是传统的连续嵌入来表示图结构。鉴于其与自然语言类似的表示形式，QGR 也具备将图结构与大型语言模型 (LLM) 无缝集成的能力。由于这种新兴范式仍处于起步阶段，但前景广阔，我们进行了这项全面调查以促进其快速未来的繁荣。我们首先介绍了通用量化方法的背景及其优点。此外，我们从量化策略、训练目标、独特设计、知识图谱量化和应用的角度对当前的 QGR 研究进行了深入的论证。我们进一步探索了代码依赖性学习和与 LLM 集成的策略。最后，我们给出了讨论并总结了未来的方向，旨在提供 QGR 的全面图景并激发未来的研究。
 
-##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**
-2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah
+##### **Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**
+2502.00339v1 by Jingyuan Yi, Zeqiu Xu, Tianyi Huang, Peiyang Yu
 
-With advanced AI/ML, there has been growing research on explainable AI (XAI)
-and studies on how humans interact with AI and XAI for effective human-AI
-collaborative decision-making. However, we still have a lack of understanding
-of how AI systems and XAI should be first presented to users without technical
-backgrounds. In this paper, we present the findings of semi-structured
-interviews with health professionals (n=12) and students (n=4) majoring in
-medicine and health to study how to improve onboarding with AI and XAI. For the
-interviews, we built upon human-AI interaction guidelines to create onboarding
-materials of an AI system for stroke rehabilitation assessment and AI
-explanations and introduce them to the participants. Our findings reveal that
-beyond presenting traditional performance metrics on AI, participants desired
-benchmark information, the practical benefits of AI, and interaction trials to
-better contextualize AI performance, and refine the objectives and performance
-of AI. Based on these findings, we highlight directions for improving
-onboarding with AI and XAI and human-AI collaborative decision-making.
+The pervasiveness of the dissemination of fake news through social media
+platforms poses critical risks to the trust of the general public, societal
+stability, and democratic institutions. This challenge calls for novel
+methodologies in detection, which can keep pace with the dynamic and
+multi-modal nature of misinformation. Recent works include powering the
+detection using large language model advances in multimodal frameworks,
+methodologies using graphs, and adversarial training in the literature of fake
+news. Based on the different approaches which can bring success, some key
+highlights will be underlined: enhanced LLM-improves accuracy through more
+advanced semantics and cross-modality fusion for robust detections. The review
+further identifies critical gaps in adaptability to dynamic social media
+trends, real-time, and cross-platform detection capabilities, as well as the
+ethical challenges thrown up by the misuse of LLMs. Future directions underline
+the development of style-agnostic models, cross-lingual detection frameworks,
+and robust policies with a view to mitigating LLM-driven misinformation. This
+synthesis thus lays a concrete foundation for those researchers and
+practitioners committed to reinforcing fake news detection systems with
+complications that keep on growing in the digital landscape.
 
-摘要：隨著先進的 AI/ML，對可解釋 AI (XAI) 的研究不斷增加，以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而，我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中，我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果，以研究如何改善 AI 和 XAI 的入門。對於訪談，我們建立在人機互動準則之上，為中風康復評估和 AI 解釋的 AI 系統創建入門材料，並將它們介紹給參與者。我們的研究結果表明，除了呈現傳統的 AI 性能指標外，參與者還希望基准信息、AI 的實際好處以及交互試驗，以更好地將 AI 性能情境化，並完善 AI 的目標和性能。根據這些發現，我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。
+摘要：社群媒體平台上假新聞散播的普遍性對一般大眾的信任、社會穩定性與民主制度構成重大風險。這項挑戰需要在偵測方面採用創新的方法論，才能跟上錯誤資訊的動態和多模態特性。最近的研究包括使用多模態架構中大型語言模型的進展、使用圖形的方法論，以及在假新聞文獻中進行對抗訓練來強化偵測。根據可以帶來成功的不同方法，將重點說明一些重點：增強的 LLM 可透過更進階的語意和跨模態融合來提升準確度，以進行穩健的偵測。這篇評論進一步找出在適應動態社群媒體趨勢、即時和跨平台偵測能力方面的重大差距，以及 LLM 遭濫用的道德挑戰。未來的方向強調開發與風格無關的模型、跨語言偵測架構和穩健的政策，以減輕 LLM 驅動的錯誤資訊。因此，這種綜合分析為那些致力於強化假新聞偵測系統的研究人員和從業人員奠定了具體的基礎，而這些複雜性在數位環境中持續增長。
 
-##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**
-2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao
+##### **DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**
+2502.00305v1 by Jiaxin Guo, C. L. Philip Chen, Shuzhen Li, Tong Zhang
 
-This article uses machine learning (ML) and explainable artificial
-intelligence (XAI) techniques to investigate the relationship between
-nutritional status and mortality rates associated with Alzheimers disease (AD).
-The Third National Health and Nutrition Examination Survey (NHANES III)
-database is employed for analysis. The random forest model is selected as the
-base model for XAI analysis, and the Shapley Additive Explanations (SHAP)
-method is used to assess feature importance. The results highlight significant
-nutritional factors such as serum vitamin B12 and glycated hemoglobin. The
-study demonstrates the effectiveness of random forests in predicting AD
-mortality compared to other diseases. This research provides insights into the
-impact of nutrition on AD and contributes to a deeper understanding of disease
-progression.
+Cold-start active learning (CSAL) selects valuable instances from an
+unlabeled dataset for manual annotation. It provides high-quality data at a low
+annotation cost for label-scarce text classification. However, existing CSAL
+methods overlook weak classes and hard representative examples, resulting in
+biased learning. To address these issues, this paper proposes a novel
+dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL.
+Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently
+extract textual representations, class predictions, and predictive uncertainty.
+Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both
+textual diversity and class diversity, ensuring a balanced data distribution.
+It further propagates uncertainty information via density-based clustering to
+select hard representative instances. DEUCE performs well in selecting
+class-balanced and hard representative data by dual-diversity and
+informativeness. Experiments on six NLP datasets demonstrate the superiority
+and efficiency of DEUCE.
 
-摘要：本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型，並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素，例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解，並有助於更深入地了解疾病的進展。
+摘要：冷啟動主動學習 (CSAL) 從未標記的資料集中選取有價值的實例進行手動標記。它以低標記成本提供高品質的資料，用於標籤稀少的文字分類。然而，現有的 CSAL 方法忽略了弱類別和難以代表的範例，導致有偏差的學習。為了解決這些問題，本文提出了一個新的雙重多樣性增強和不確定性感知 (DEUCE) 架構，用於 CSAL。具體來說，DEUCE 利用預訓練的語言模型 (PLM) 來有效地提取文字表徵、類別預測和預測不確定性。然後，它構建一個雙鄰居圖 (DNG) 來結合文字多樣性和類別多樣性的資訊，確保平衡的資料分佈。它進一步通過基於密度的聚類來傳播不確定性資訊，以選擇難以代表的實例。DEUCE 在通過雙重多樣性和資訊性選擇類別平衡和難以代表的資料方面表現良好。在六個 NLP 資料集上的實驗證明了 DEUCE 的優越性和效率。
 
-##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**
-2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath
+##### **Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**
+2502.01659v2 by Nathaniel Tomczak, Sanmukh Kuppannagari
 
-Primary care providers are vital for initial triage and referrals to
-specialty care. In glaucoma, asymptomatic and fast progression can lead to
-vision loss, necessitating timely referrals to specialists. However, primary
-eye care providers may not identify urgent cases, potentially delaying care.
-Artificial Intelligence (AI) offering explanations could enhance their referral
-decisions. We investigate how various AI explanations help providers
-distinguish between patients needing immediate or non-urgent specialist
-referrals. We built explainable AI algorithms to predict glaucoma surgery needs
-from routine eyecare data as a proxy for identifying high-risk patients. We
-incorporated intrinsic and post-hoc explainability and conducted an online
-study with optometrists to assess human-AI team performance, measuring referral
-accuracy and analyzing interactions with AI, including agreement rates, task
-time, and user experience perceptions. AI support enhanced referral accuracy
-among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams
-underperformed compared to AI alone. Participants believed they included AI
-advice more when using the intrinsic model, and perceived it more useful and
-promising. Without explanations, deviations from AI recommendations increased.
-AI support did not increase workload, confidence, and trust, but reduced
-challenges. On a separate test set, our black-box and intrinsic models achieved
-an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We
-identify opportunities of human-AI teaming for glaucoma management in primary
-eye care, noting that while AI enhances referral accuracy, it also shows a
-performance gap compared to AI alone, even with explanations. Human involvement
-remains essential in medical decision making, underscoring the need for future
-research to optimize collaboration, ensuring positive experiences and safe AI
-use.
+Transformers have demonstrated great success in numerous domains including
+natural language processing and bioinformatics. This success stems from the use
+of the attention mechanism by these models in order to represent and propagate
+pairwise interactions between individual tokens of sequential data. However,
+the primary limitation of this operation is its quadratic memory and time
+complexity in relation to the input's context length - the length of a sequence
+over which the interactions need to be captured. This significantly limits the
+length of sequences that can be inferred upon by these models. Extensive
+research has been conducted to reduce the number of pairwise interactions to
+sub-quadratic in relation to the context length by introducing sparsity into
+the attention mechanism through the development of sparse attention masks.
+However, efficient implementations that achieve "true sparsity" are lacking.
+  In this work, we address this issue by proposing a graph computing view of
+attention where tokens are perceived as nodes of the graph and the attention
+mask determines the edges of the graph. Using this view, we develop graph
+processing algorithms to implement the attention mechanism. Both theoretically
+and empirically, we demonstrate that our algorithms only perform the needed
+computations, i.e., they are work optimal. We also perform extensive
+experimentation using popular attention masks to explore the impact of sparsity
+on execution time and achievable context length. Our experiments demonstrate
+significant speedups in execution times compared to state-of-the-art attention
+implementations such as FlashAttention for large sequence lengths. We also
+demonstrate that our algorithms are able to achieve extremely long sequence
+lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB).
 
-摘要：<paragraph>初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下，無症狀且快速惡化可能導致視力喪失，因此需要及時轉診給專家。然而，初級眼科保健提供者可能無法識別緊急情況，可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法，以從例行眼科護理資料預測青光眼手術需求，作為識別高風險患者的代理。我們納入了內在和事後解釋性，並與驗光師進行了一項線上研究，以評估人機團隊的表現，衡量轉診準確度並分析與 AI 的互動，包括同意率、任務時間和使用者體驗感知。在 87 名參與者中，AI 支援提高了轉診準確度（使用 AI/未使用的比例為 59.9%/50.8%），儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議，並認為它更有用且更有希望。沒有解釋，AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任，但減少了挑戰。在一個單獨的測試集中，我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中，人機團隊合作管理青光眼的機會，並注意到雖然 AI 提高了轉診準確度，但即使有解釋，它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要，這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。</paragraph>
+摘要：變形金剛已在許多領域展現出巨大的成功，包括自然語言處理和生物資訊學。這種成功源自於這些模型使用注意機制來表示和傳播序列資料中各個標記之間成對的互動。然而，這種運算的主要限制在於其二次記憶體和時間複雜度與輸入的內容長度有關，也就是需要擷取互動的序列長度。這會顯著限制這些模型可以推論的序列長度。已經進行了大量的研究來減少成對互動的數量，使其與內容長度成次二次關係，方法是透過開發稀疏注意遮罩來將稀疏性引入注意機制。然而，缺乏能達成「真實稀疏性」的高效實作。在這項工作中，我們透過提出注意力的圖形運算檢視來解決這個問題，其中標記被視為圖形的節點，而注意力遮罩則決定圖形中的邊緣。使用這種檢視，我們開發了圖形處理演算法來實作注意力機制。我們在理論上和經驗上都證明了我們的演算法只執行必要的運算，也就是說，它們是工作最優的。我們也使用流行的注意力遮罩進行廣泛的實驗，以探討稀疏性對執行時間和可達成的內容長度的影響。我們的實驗證明，與最先進的注意力實作（例如 FlashAttention）相比，對於大型序列長度，我們的演算法在執行時間方面有顯著的加速。我們也證明了我們的演算法能夠在單一的 NVIDIA A100 GPU (SXM4 80GB) 上達成極長的序列長度，最高可達 1.6 億。
 
-##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**
-2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang
+##### **Improving vision-language alignment with graph spiking hybrid Networks**
+2501.19069v1 by Siyu Zhang, Heming Zheng, Yiming Wu, Yeming Chen
 
-In medical imaging, particularly in early disease detection and prognosis
-tasks, discerning the rationale behind an AI model's predictions is crucial for
-evaluating the reliability of its decisions. Conventional explanation methods
-face challenges in identifying discernible decisive features in medical image
-classifications, where discriminative features are subtle or not immediately
-apparent. To bridge this gap, we propose an explainable model that is equipped
-with both decision reasoning and feature identification capabilities. Our
-approach not only detects influential image patterns but also uncovers the
-decisive features that drive the model's final predictions. By implementing our
-method, we can efficiently identify and visualise class-specific features
-leveraged by the data-driven model, providing insights into the decision-making
-processes of deep learning models. We validated our model in the demanding
-realm of medical prognosis task, demonstrating its efficacy and potential in
-enhancing the reliability of AI in healthcare and in discovering new knowledge
-in diseases where prognostic understanding is limited.
+To bridge the semantic gap between vision and language (VL), it is necessary
+to develop a good alignment strategy, which includes handling semantic
+diversity, abstract representation of visual information, and generalization
+ability of models. Recent works use detector-based bounding boxes or patches
+with regular partitions to represent visual semantics. While current paradigms
+have made strides, they are still insufficient for fully capturing the nuanced
+contextual relations among various objects. This paper proposes a comprehensive
+visual semantic representation module, necessitating the utilization of
+panoptic segmentation to generate coherent fine-grained semantic features.
+Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that
+integrates the complementary advantages of Spiking Neural Networks (SNNs) and
+Graph Attention Networks (GATs) to encode visual semantic information.
+Intriguingly, the model not only encodes the discrete and continuous latent
+variables of instances but also adeptly captures both local and global
+contextual features, thereby significantly enhancing the richness and diversity
+of semantic representations. Leveraging the spatiotemporal properties inherent
+in SNNs, we employ contrastive learning (CL) to enhance the similarity-based
+representation of embeddings. This strategy alleviates the computational
+overhead of the model and enriches meaningful visual representations by
+constructing positive and negative sample pairs. We design an innovative
+pre-training method, Spiked Text Learning (STL), which uses text features to
+improve the encoding ability of discrete semantics. Experiments show that the
+proposed GSHN exhibits promising results on multiple VL downstream tasks.
 
-摘要：在醫學影像中，特別是在早期疾病檢測和預後任務中，辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰，其中區別性特徵很微妙或並不明顯。為了彌合這一差距，我們提出了一個可解釋的模型，該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式，還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型，我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵，從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型，展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。
+摘要：<paragraph>為了彌合視覺和語言 (VL) 之間的語意差距，必須制定良好的對齊策略，其中包括處理語意多樣性、視覺資訊的抽象表示以及模型的泛化能力。最近的研究使用基於偵測器的邊界框或具有規則分割的區塊來表示視覺語意。雖然目前的範例已取得進展，但對於完全捕捉各種物件之間的細微脈絡關係仍不足夠。本文提出了一個全面的視覺語意表示模組，需要利用全景分割來產生連貫的細粒度語意特徵。此外，我們提出了一個新穎的圖形脈衝混合網路 (GSHN)，它整合了脈衝神經網路 (SNN) 和圖形注意力網路 (GAT) 的互補優勢來編碼視覺語意資訊。有趣的是，該模型不僅編碼實例的離散和連續潛在變數，還能巧妙地捕捉局部和全域脈絡特徵，從而顯著增強語意表示的豐富性和多樣性。利用 SNN 中固有的時空特性，我們採用對比學習 (CL) 來增強嵌入的基於相似性的表示。此策略減輕了模型的計算負擔，並透過建構正負樣本對來豐富有意義的視覺表示。我們設計了一個創新的預訓練方法，脈衝文本學習 (STL)，它使用文本特徵來提高離散語意的編碼能力。實驗表明，所提出的 GSHN 在多個 VL 下游任務上展現出有希望的結果。</paragraph>
 
-##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**
-2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat
+##### **Semantic Web and Creative AI -- A Technical Report from ISWS 2023**
+2501.18542v1 by Raia Abu Ahmad, Reham Alharbi, Roberto Barile, Martin Böckling, Francisco Bolanos, Sara Bonfitto, Oleksandra Bruns, Irene Celino, Yashrajsinh Chudasama, Martin Critelli, Claudia d'Amato, Giada D'Ippolito, Ioannis Dasoulas, Stefano De Giorgis, Vincenzo De Leo, Chiara Di Bonaventura, Marco Di Panfilo, Daniil Dobriy, John Domingue, Xuemin Duan, Michel Dumontier, Sefika Efeoglu, Ruben Eschauzier, Fakih Ginwa, Nicolas Ferranti, Arianna Graciotti, Philipp Hanisch, George Hannah, Golsa Heidari, Aidan Hogan, Hassan Hussein, Alexane Jouglar, Jan-Christoph Kalo, Manoé Kieffer, Antonis Klironomos, Inês Koch, Weronika Lajewska, Nicolas Lazzari, Mikael Lindekrans, Anna Sofia Lippolis, Majlinda Llugiqi, Eleonora Mancini, Eleonora Marzi, Laura Menotti, Daniela Milon Flores, Soulakshmee Nagowah, Kerstin Neubert, Emetis Niazmand, Ebrahim Norouzi, Beatriz Olarte Martinez, Anouk Michelle Oudshoorn, Andrea Poltronieri, Valentina Presutti, Disha Purohit, Ensiyeh Raoufi, Celian Ringwald, Johanna Rockstroh, Sebastian Rudolph, Harald Sack, Zafar Saeed, Mohammad Javad Saeedizade, Aya Sahbi, Cristian Santini, Aleksandra Simic, Dennis Sommer, Rita Sousa, Mary Ann Tan, Vidyashree Tarikere, Tabea Tietz, Liam Tirpitz, Arnaldo Tomasino, Frank van Harmelen, Joao Vissoci, Caitlin Woods, Bohui Zhang, Xinyue Zhang, Heng Zheng
 
-This study explores the relationship between informational support seeking
-questions, responses, and helpfulness ratings in online health communities. We
-created a labeled data set of question-response pairs and developed multimodal
-machine learning and deep learning models to reliably predict informational
-support questions and responses. We employed explainable AI to reveal the
-emotions embedded in informational support exchanges, demonstrating the
-importance of emotion in providing informational support. This complex
-interplay between emotional and informational support has not been previously
-researched. The study refines social support theory and lays the groundwork for
-the development of user decision aids. Further implications are discussed.
+The International Semantic Web Research School (ISWS) is a week-long
+intensive program designed to immerse participants in the field. This document
+reports a collaborative effort performed by ten teams of students, each guided
+by a senior researcher as their mentor, attending ISWS 2023. Each team provided
+a different perspective to the topic of creative AI, substantiated by a set of
+research questions as the main subject of their investigation. The 2023 edition
+of ISWS focuses on the intersection of Semantic Web technologies and Creative
+AI. ISWS 2023 explored various intersections between Semantic Web technologies
+and creative AI. A key area of focus was the potential of LLMs as support tools
+for knowledge engineering. Participants also delved into the multifaceted
+applications of LLMs, including legal aspects of creative content production,
+humans in the loop, decentralised approaches to multimodal generative AI
+models, nanopublications and AI for personal scientific knowledge graphs,
+commonsense knowledge in automatic story and narrative completion, generative
+AI for art critique, prompt engineering, automatic music composition,
+commonsense prototyping and conceptual blending, and elicitation of tacit
+knowledge. As Large Language Models and semantic technologies continue to
+evolve, new exciting prospects are emerging: a future where the boundaries
+between creative expression and factual knowledge become increasingly permeable
+and porous, leading to a world of knowledge that is both informative and
+inspiring.
+
+摘要：國際語意網路研究學校 (ISWS) 是一個為期一週的密集課程，旨在讓參與者沉浸在該領域中。本文件報告了由十個學生團隊進行的合作成果，每個團隊都由一位資深研究員作為導師，參加了 2023 年 ISWS。每個團隊都從不同的角度探討了創意 AI 主題，並以一系列研究問題作為調查的主要主題。2023 年版的 ISWS 關注於語意網路技術和創意 AI 的交集。ISWS 2023 探索了語意網路技術和創意 AI 之間的各種交集。一個重點關注領域是 LLM 作為知識工程的支援工具的潛力。參與者還深入探討了 LLM 的多方面應用，包括創意內容製作的法律方面、循環中的人類、多模態生成式 AI 模型的分散式方法、納米出版物和用於個人科學知識圖譜的 AI、自動故事和敘述完成中的常識知識、生成式 AI 用於藝術評論、提示工程、自動音樂創作、常識原型和概念混合，以及對默會知識的引導。隨著大型語言模型和語意技術的持續發展，新的令人興奮的前景正在出現：一個創意表達和事實知識之間的界限變得越來越可滲透和多孔的未來，從而導致一個既有資訊性又有啟發性的知識世界。
 
-摘要：本研究探討線上健康社群中尋求資訊支持的問題、回應，以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集，並開發了多模態機器學習和深度學習模型，以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒，證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論，並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。
+##### **Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**
+2501.18320v1 by Tianpeng Pan, Wenqiang Pu, Licheng Zhao, Rui Zhou
 
-##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**
-2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis
+Automated optimization modeling (AOM) has evoked considerable interest with
+the rapid evolution of large language models (LLMs). Existing approaches
+predominantly rely on prompt engineering, utilizing meticulously designed
+expert response chains or structured guidance. However, prompt-based techniques
+have failed to perform well in the sensor array signal processing (SASP) area
+due the lack of specific domain knowledge. To address this issue, we propose an
+automated modeling approach based on retrieval-augmented generation (RAG)
+technique, which consists of two principal components: a multi-agent (MA)
+structure and a graph-based RAG (Graph-RAG) process. The MA structure is
+tailored for the architectural AOM process, with each agent being designed
+based on principles of human modeling procedure. The Graph-RAG process serves
+to match user query with specific SASP modeling knowledge, thereby enhancing
+the modeling result. Results on ten classical signal processing problems
+demonstrate that the proposed approach (termed as MAG-RAG) outperforms several
+AOM benchmarks.
 
-In the era of exponential technology growth, one unexpected guest has claimed
-a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as
-ChatGPT, promises a revolution in education, yet it arrives with a double-edged
-sword. Its potential for personalized learning is offset by issues of cheating,
-inaccuracies, and educators struggling to incorporate it effectively into their
-lesson design. We are standing on the brink of this educational frontier, and
-it is clear that we need to navigate this terrain with a lot of care. This is a
-major challenge that could undermine the integrity and value of our educational
-process. So, how can we turn these challenges into opportunities? When used
-inappropriately, AI tools can become the perfect tool for the cut copy paste
-mentality, and quickly begin to corrode critical thinking, creativity, and deep
-understanding, the most important skills in our rapidly changing world.
-Teachers feel that they are not equipped to leverage this technology, widening
-the digital divide among educators and institutions. Addressing these concerns
-calls for an in depth research approach. We will employ empirical research,
-drawing on the Technology Acceptance Model, to assess the attitudes toward
-generative AI among educators and students. Understanding their perceptions,
-usage patterns, and hurdles is the first crucial step in creating an effective
-solution. The present study will be used as a process manual for future
-researchers to apply, running their own data, based on the steps explained here
+摘要：自動化最佳化建模 (AOM) 隨著大型語言模型 (LLM) 的快速演進而引起相當大的興趣。現有方法主要依賴提示工程，利用精心設計的專家回應鏈或結構化指導。然而，基於提示的技術由於缺乏特定領域知識，無法在感測器陣列訊號處理 (SASP) 領域中表現良好。為了解決這個問題，我們提出一個基於檢索增強生成 (RAG) 技術的自動化建模方法，它包含兩個主要組成部分：多代理 (MA) 結構和基於圖形的 RAG (Graph-RAG) 程序。MA 結構是針對架構 AOM 程序量身打造，每個代理都是根據人類建模程序的原理設計的。Graph-RAG 程序用於將使用者查詢與特定的 SASP 建模知識相匹配，從而增強建模結果。在十個經典訊號處理問題上的結果表明，所提出的方法（稱為 MAG-RAG）優於多個 AOM 基準。
 
-摘要：在科技飛速發展的時代，一位意外的訪客已在全球教室中佔有一席之地，那就是人工智慧。生成式 AI，例如 ChatGPT，承諾在教育領域掀起一場革命，但它卻是一把雙面刃。它在個人化學習方面的潛力，卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣，顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰，可能會損害我們教育過程的完整性和價值。那麼，我們如何將這些挑戰轉化為機遇？當不適當地使用時，AI 工具可能會成為複製貼上心態的完美工具，並迅速腐蝕批判性思維、創造力和深入理解，這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術，這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究，借鑑技術接受模型，來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊，根據此處說明的步驟運行他們自己的數據
+##### **Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**
+2501.18154v1 by Wanlong Liu, Yichen Xiao, Dingyi Zeng, Hongyang Zhao, Wenyu Chen, Malu Zhang
 
-##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**
-2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk
+Post-Training Quantization (PTQ) is pivotal for deploying large language
+models (LLMs) within resource-limited settings by significantly reducing
+resource demands. However, existing PTQ strategies underperform at low bit
+levels < 3 bits due to the significant difference between the quantized and
+original weights. To enhance the quantization performance at low bit widths, we
+introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a
+graph neural network (GNN) module to capture dependencies among weights and
+adaptively assign quantization bit-widths. Through the information propagation
+of the GNN module, our method more effectively captures dependencies among
+target weights, leading to a more accurate assessment of weight importance and
+optimized allocation of quantization strategies. Extensive experiments on the
+WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms
+previous state-of-the-art PTQ method GPTQ, setting new benchmarks for
+quantization performance under low-bit conditions.
 
-With the digitalization of health care systems, artificial intelligence
-becomes more present in medicine. Especially machine learning shows great
-potential for complex tasks such as time series classification, usually at the
-cost of transparency and comprehensibility. This leads to a lack of trust by
-humans and thus hinders its active usage. Explainable artificial intelligence
-tries to close this gap by providing insight into the decision-making process,
-the actual usefulness of its different methods is however unclear. This paper
-proposes a user study based evaluation of the explanation method Grad-CAM with
-application to a neural network for the classification of breaths in time
-series neonatal ventilation data. We present the perceived usefulness of the
-explainability method by different stakeholders, exposing the difficulty to
-achieve actual transparency and the wish for more in-depth explanations by many
-of the participants.
+摘要：訓練後量化 (PTQ) 對於在資源受限的設定中部署大型語言模型 (LLM) 至關重要，因為它能顯著降低資源需求。然而，現有的 PTQ 策略在低位元層級 < 3 位元時表現不佳，因為量化後的權重與原始權重之間有顯著的差異。為了提升低位元寬度的量化效能，我們提出混合精度圖神經網路 PTQ (MG-PTQ) 方法，採用圖神經網路 (GNN) 模組來擷取權重之間的依存關係，並動態分配量化位元寬度。透過 GNN 模組的資訊傳播，我們的方法能更有效地擷取目標權重之間的依存關係，進而更準確地評估權重重要性，並最佳化量化策略的配置。在 WikiText2 和 C4 資料集上的廣泛實驗證明，我們的 MG-PTQ 方法優於先前的最先進 PTQ 方法 GPTQ，在低位元條件下設定了量化效能的新基準。
 
-摘要：隨著醫療保健系統的數位化，人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力，但通常是以透明度和可理解性為代價。這導致人類缺乏信任，從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距，但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估，其中包含了 Grad-CAM 解釋方法，並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用，揭示了實現實際透明度的難度，以及許多參與者希望獲得更深入的解釋。
+##### **Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**
+2501.18119v1 by Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
 
-##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**
-2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio
+Due to the presence of the natural gap between Knowledge Graph (KG)
+structures and the natural language, the effective integration of holistic
+structural information of KGs with Large Language Models (LLMs) has emerged as
+a significant question. To this end, we propose a two-stage framework to learn
+and apply quantized codes for each entity, aiming for the seamless integration
+of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR)
+method is proposed to compress both KG structural and semantic knowledge into
+discrete codes (\ie, tokens) that align the format of language sentences. We
+further design KG instruction-following data by viewing these learned codes as
+features to directly input to LLMs, thereby achieving seamless integration. The
+experiment results demonstrate that SSQR outperforms existing unsupervised
+quantized methods, producing more distinguishable codes. Further, the
+fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link
+prediction and triple classification tasks, utilizing only 16 tokens per entity
+instead of thousands in conventional prompting methods.
 
-The integration of Large Language Models (LLMs) into healthcare diagnostics
-offers a promising avenue for clinical decision-making. This study outlines the
-development of a novel method for zero-shot/few-shot in-context learning (ICL)
-by integrating medical domain knowledge using a multi-layered structured
-prompt. We also explore the efficacy of two communication styles between the
-user and LLMs: the Numerical Conversational (NC) style, which processes data
-incrementally, and the Natural Language Single-Turn (NL-ST) style, which
-employs long narrative prompts.
-  Our study systematically evaluates the diagnostic accuracy and risk factors,
-including gender bias and false negative rates, using a dataset of 920 patient
-records in various few-shot scenarios. Results indicate that traditional
-clinical machine learning (ML) models generally outperform LLMs in zero-shot
-and few-shot settings. However, the performance gap narrows significantly when
-employing few-shot examples alongside effective explainable AI (XAI) methods as
-sources of domain knowledge. Moreover, with sufficient time and an increased
-number of examples, the conversational style (NC) nearly matches the
-performance of ML models. Most notably, LLMs demonstrate comparable or superior
-cost-sensitive accuracy relative to ML models.
-  This research confirms that, with appropriate domain knowledge and tailored
-communication strategies, LLMs can significantly enhance diagnostic processes.
-The findings highlight the importance of optimizing the number of training
-examples and communication styles to improve accuracy and reduce biases in LLM
-applications.
+摘要：由於知識圖譜 (KG) 結構與自然語言之間存在自然差距，將 KG 的整體結構資訊與大型語言模型 (LLM) 有效整合已成為一個重要的問題。為此，我們提出了一個兩階段架構來學習和應用每個實體的量化碼，旨在將 KG 與 LLM 無縫整合。首先，提出了一個自監督量化表示 (SSQR) 方法，將 KG 結構和語義知識壓縮成離散碼（即，符號），以對齊語言句子的格式。我們進一步設計 KG 指令遵循資料，將這些學習到的碼視為直接輸入 LLM 的特徵，從而實現無縫整合。實驗結果表明，SSQR 優於現有的無監督量化方法，產生更具區別性的碼。此外，微調後的 LLaMA2 和 LLaMA3.1 在 KG 連結預測和三元分類任務上也具有優異的性能，每個實體僅使用 16 個符號，而不是傳統提示方法中的數千個。
 
-摘要：大型語言模型 (LLM) 與醫療診斷整合
-為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發，用於零次學習/少量學習情境學習 (ICL)，方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效：數值對話 (NC) 方式，它會逐步處理資料，以及自然語言單回合 (NL-ST) 方式，它會使用長篇敘事提示。
-我們的研究系統性地評估了診斷準確性和風險因子，包括性別偏見和假陰性率，使用了一個包含 920 個患者記錄的資料集，採用各種少量學習情境。結果表明，傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而，當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時，效能差距會顯著縮小。此外，隨著時間充足和範例數量增加，對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是，LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。
-本研究證實，透過適當的領域知識和量身打造的溝通策略，LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性，以提高準確度並減少 LLM 應用中的偏差。
+##### **Hybrid Graphs for Table-and-Text based Question Answering using LLMs**
+2501.17767v1 by Ankush Agarwal, Ganesh S, Chaitanya Devaguptapu
 
-##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**
-2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho
+Answering questions that require reasoning and aggregation across both
+structured (tables) and unstructured (raw text) data sources presents
+significant challenges. Current methods rely on fine-tuning and high-quality,
+human-curated data, which is difficult to obtain. Recent advances in Large
+Language Models (LLMs) have shown promising results for multi-hop question
+answering (QA) over single-source text data in a zero-shot setting, yet
+exploration into multi-source Table-Text QA remains limited. In this paper, we
+present a novel Hybrid Graph-based approach for Table-Text QA that leverages
+LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from
+textual and tabular data, pruning information based on the input question to
+provide the LLM with relevant context concisely. We evaluate our approach on
+the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs,
+including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot
+performance on both datasets, improving Exact Match scores by up to 10% on
+Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up
+to 53% compared to the original context.
 
-The increasing reliance on Deep Learning models, combined with their inherent
-lack of transparency, has spurred the development of a novel field of study
-known as eXplainable AI (XAI) methods. These methods seek to enhance the trust
-of end-users in automated systems by providing insights into the rationale
-behind their decisions. This paper presents a novel approach for measuring user
-trust in XAI systems, allowing their refinement. Our proposed metric combines
-both performance metrics and trust indicators from an objective perspective. To
-validate this novel methodology, we conducted a case study in a realistic
-medical scenario: the usage of XAI system for the detection of pneumonia from
-x-ray images.
+摘要：回答需要對結構化（表格）和非結構化（原始文字）資料來源進行推理和彙總的問題會帶來重大挑戰。目前的辦法仰賴微調和高品質、人工整理的資料，而這很難取得。大型語言模型（LLM）的最新進展已針對零次學習設定的單一來源文字資料多跳問題回答（QA）展現出有希望的結果，但對多來源表格文字 QA 的探討仍然有限。在本文中，我們提出了一種新穎的基於混合圖表的表格文字 QA 方法，它利用 LLM 而無需微調。我們的辦法從文字和表格資料建構一個統一的混合圖表，根據輸入問題修剪資訊，以簡潔地為 LLM 提供相關脈絡。我們使用最先進的 LLM，包括 GPT-3.5、GPT-4 和 LLaMA-3，針對具有挑戰性的 Hybrid-QA 和 OTT-QA 資料集評估我們的辦法。我們的辦法在兩個資料集上都達到了最佳的零次學習效能，在 Hybrid-QA 上將完全比對分數提高了 10%，在 OTT-QA 上將完全比對分數提高了 5.4%。此外，與原始脈絡相比，我們的辦法將符號使用量減少了 53%。
 
-摘要：隨著對深度學習模型依賴性的增加，加上其固有的透明度不足，促使一個新的研究領域發展，稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理，來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法，允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法，我們在一個真實的醫療場景中進行了一個案例研究：使用 XAI 系統從 X 光影像中偵測肺炎。
+##### **Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**
+2501.17549v1 by Wooyoung Kim, Byungyoon Park, Wooju Kim
 
-##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**
-2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao
+Graph-structured data plays a vital role in numerous domains, such as social
+networks, citation networks, commonsense reasoning graphs and knowledge graphs.
+While graph neural networks have been employed for graph processing, recent
+advancements have explored integrating large language models for graph-based
+tasks. In this paper, we propose a novel approach named Learnable Graph Pooling
+Token (LGPT), which addresses the limitations of the scalability issues in
+node-level projection and information loss in graph-level projection. LGPT
+enables flexible and efficient graph representation by introducing learnable
+parameters that act as tokens in large language models, balancing fine-grained
+and global graph information. Additionally, we investigate an Early Query
+Fusion technique, which fuses query context before constructing the graph
+representation, leading to more effective graph embeddings. Our method achieves
+a 4.13\% performance improvement on the GraphQA benchmark without training the
+large language model, demonstrating significant gains in handling complex
+textual-attributed graph data.
 
-The COVID-19 pandemic has strained global public health, necessitating
-accurate diagnosis and intervention to control disease spread and reduce
-mortality rates. This paper introduces an interpretable deep survival
-prediction model designed specifically for improved understanding and trust in
-COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale
-pretrained image encoder, Risk-specific Grad-CAM, and anatomical region
-detection techniques, our approach produces regional interpretable outcomes
-that effectively capture essential disease features while focusing on rare but
-critical abnormal regions. Our model's predictive results provide enhanced
-clarity and transparency through risk area localization, enabling clinicians to
-make informed decisions regarding COVID-19 diagnosis with better understanding
-of prognostic insights. We evaluate the proposed method on a multi-center
-survival dataset and demonstrate its effectiveness via quantitative and
-qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and
-time-dependent AUCs (0.799 and 0.691). These results suggest that our
-explainable deep survival prediction model surpasses traditional survival
-analysis methods in risk prediction, improving interpretability for clinical
-decision making and enhancing AI system trustworthiness.
+摘要：圖形結構資料在許多領域中扮演著至關重要的角色，例如社交網路、引用網路、常識推理圖形和知識圖形。雖然圖形神經網路已用於圖形處理，但最近的進展已探討整合大型語言模型以進行基於圖形的任務。在本文中，我們提出了一種名為可學習圖形池化令牌 (LGPT) 的新方法，它解決了節點層級投影中的可擴充性問題和圖形層級投影中的資訊遺失限制。LGPT 透過引入可學習的參數（在大型語言模型中作為令牌運作）來啟用彈性和高效的圖形表示，平衡細粒度和整體圖形資訊。此外，我們研究了一種早期查詢融合技術，它在建構圖形表示之前融合查詢內容，進而產生更有效的圖形嵌入。我們的方法在 GraphQA 基準上達到了 4.13% 的效能提升，而無需訓練大型語言模型，證明了在處理複雜的文字屬性圖形資料方面有顯著的進展。
 
-摘要：COVID-19 疫情對全球公共衛生造成壓力，必須進行準確的診斷和干預，以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型，專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術，我們的做法產生區域可解釋的結果，有效捕捉必要的疾病特徵，同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度，讓臨床醫生能夠在更了解預後見解的情況下，就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法，並透過量化和質化評估證明其有效性，達到優異的 C 指數（0.764 和 0.727）和時間相關 AUC（0.799 和 0.691）。這些結果表明，我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法，提升臨床決策的解釋性，並增強 AI 系統的信賴度。
+##### **General Scene Adaptation for Vision-and-Language Navigation**
+2501.17403v1 by Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
 
-##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**
-2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile
+Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on
+one-time execution of individual instructions across multiple environments,
+aiming to develop agents capable of functioning in any environment in a
+zero-shot manner. However, real-world navigation robots often operate in
+persistent environments with relatively consistent physical layouts, visual
+observations, and language styles from instructors. Such a gap in the task
+setting presents an opportunity to improve VLN agents by incorporating
+continuous adaptation to specific environments. To better reflect these
+real-world conditions, we introduce GSA-VLN, a novel task requiring agents to
+execute navigation instructions within a specific scene and simultaneously
+adapt to it for improved performance over time. To evaluate the proposed task,
+one has to address two challenges in existing VLN datasets: the lack of OOD
+data, and the limited number and style diversity of instructions for each
+scene. Therefore, we propose a new dataset, GSA-R2R, which significantly
+expands the diversity and quantity of environments and instructions for the R2R
+dataset to evaluate agent adaptability in both ID and OOD contexts.
+Furthermore, we design a three-stage instruction orchestration pipeline that
+leverages LLMs to refine speaker-generated instructions and apply role-playing
+techniques to rephrase instructions into different speaking styles. This is
+motivated by the observation that each individual user often has consistent
+signatures or preferences in their instructions. We conducted extensive
+experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various
+methods. Based on our findings, we propose a novel method, GR-DUET, which
+incorporates memory-based navigation graphs with an environment-specific
+training strategy, achieving state-of-the-art results on all GSA-R2R splits.
 
-In recent years, machine learning-based clinical decision support systems
-(CDSS) have played a key role in the analysis of several medical conditions.
-Despite their promising capabilities, the lack of transparency in AI models
-poses significant challenges, particularly in medical contexts where
-reliability is a mandatory aspect. However, it appears that explainability is
-inversely proportional to accuracy. For this reason, achieving transparency
-without compromising predictive accuracy remains a key challenge. This paper
-presents a novel method, namely Rad4XCNN, to enhance the predictive power of
-CNN-derived features with the inherent interpretability of radiomic features.
-Rad4XCNN diverges from conventional methods based on saliency maps, by
-associating intelligible meaning to CNN-derived features by means of Radiomics,
-offering new perspectives on explanation methods beyond visualization maps.
-Using a breast cancer classification task as a case study, we evaluated
-Rad4XCNN on ultrasound imaging datasets, including an online dataset and two
-in-house datasets for internal and external validation. Some key results are:
-i) CNN-derived features guarantee more robust accuracy when compared against
-ViT-derived and radiomic features; ii) conventional visualization map methods
-for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice
-model accuracy for their explainability; iv) Rad4XCNN provides a global
-explanation enabling the physician to extract global insights and findings. Our
-method can mitigate some concerns related to the explainability-accuracy
-trade-off. This study highlighted the importance of proposing new methods for
-model explanation without affecting their accuracy.
+摘要：視覺語言導航 (VLN) 任務主要根據代理程式在多個環境中執行個別指令的一次性執行來評估代理程式，旨在開發能夠在任何環境中以零次學習的方式運作的代理程式。然而，真實世界的導航機器人通常在持續性的環境中運作，而這些環境具有相對一致的物理配置、視覺觀察和指令的語言風格。任務設定中的這種差距提供了一個機會，可以透過將連續適應特定環境納入其中來改善 VLN 代理程式。為了更好地反映這些真實世界的條件，我們推出了 GSA-VLN，這是一個新任務，要求代理程式在特定場景中執行導航指令，並同時適應該場景，以隨著時間推移而提高效能。為了評估所提出的任務，必須解決現有 VLN 資料集中的兩個挑戰：缺乏 OOD 資料，以及每個場景的指令數量和風格多樣性有限。因此，我們提出了一個新的資料集 GSA-R2R，它顯著擴展了 R2R 資料集的環境和指令的多樣性和數量，以評估代理程式在 ID 和 OOD 背景下的適應能力。此外，我們設計了一個三階段指令編排管道，該管道利用大型語言模型 (LLM) 來精煉由說話者產生的指令，並應用角色扮演技巧將指令改寫成不同的說話風格。這項技術的靈感來自於觀察到每個個別使用者通常在其指令中具有相符的簽名或偏好。我們針對 GSA-R2R 進行了大量的實驗，以徹底評估我們的資料集和基準各種方法。根據我們的研究結果，我們提出了一種新的方法 GR-DUET，它將基於記憶的導航圖表與特定於環境的訓練策略結合在一起，在所有 GSA-R2R 分割中取得了最先進的結果。
+
+##### **Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**
+2501.17270v1 by Saloni Potdar, Daniel Lee, Omar Attia, Varun Embar, De Meng, Ramesh Balaji, Chloe Seivwright, Eric Choi, Mina H. Farid, Yiwen Sun, Yunyao Li
+
+Question answering systems for knowledge graph (KGQA), answer factoid
+questions based on the data in the knowledge graph. KGQA systems are complex
+because the system has to understand the relations and entities in the
+knowledge-seeking natural language queries and map them to structured queries
+against the KG to answer them. In this paper, we introduce Chronos, a
+comprehensive evaluation framework for KGQA at industry scale. It is designed
+to evaluate such a multi-component system comprehensively, focusing on (1)
+end-to-end and component-level metrics, (2) scalable to diverse datasets and
+(3) a scalable approach to measure the performance of the system prior to
+release. In this paper, we discuss the unique challenges associated with
+evaluating KGQA systems at industry scale, review the design of Chronos, and
+how it addresses these challenges. We will demonstrate how it provides a base
+for data-driven decisions and discuss the challenges of using it to measure and
+improve a real-world KGQA system.
 
-摘要：<paragraph>近年来，基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景，但 AI 模型缺乏透明度，尤其在医疗领域，可靠性是强制性方面，这带来了重大挑战。然而，解释性似乎与准确性成反比。因此，在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法，即 Rad4XCNN，以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来，从而偏离了基于显着性图的传统方法，为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究，我们在超声成像数据集上评估了 Rad4XCNN，包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是：i) 与 ViT 衍生和放射组学特征相比，CNN 衍生特征保证了更稳健的准确性；ii) 用于解释的传统可视化图方法存在一些缺陷；iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性；iv) Rad4XCNN 提供全局解释，使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。</paragraph>
+摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
 
-##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**
-2404.16957v1 by Yunfei Ge, Quanyan Zhu
+##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
+2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
 
-The pervasive integration of Artificial Intelligence (AI) has introduced
-complex challenges in the responsibility and accountability in the event of
-incidents involving AI-enabled systems. The interconnectivity of these systems,
-ethical concerns of AI-induced incidents, coupled with uncertainties in AI
-technology and the absence of corresponding regulations, have made traditional
-responsibility attribution challenging. To this end, this work proposes a
-Computational Reflective Equilibrium (CRE) approach to establish a coherent and
-ethically acceptable responsibility attribution framework for all stakeholders.
-The computational approach provides a structured analysis that overcomes the
-limitations of conceptual approaches in dealing with dynamic and multifaceted
-scenarios, showcasing the framework's explainability, coherence, and adaptivity
-properties in the responsibility attribution process. We examine the pivotal
-role of the initial activation level associated with claims in equilibrium
-computation. Using an AI-assisted medical decision-support system as a case
-study, we illustrate how different initializations lead to diverse
-responsibility distributions. The framework offers valuable insights into
-accountability in AI-induced incidents, facilitating the development of a
-sustainable and resilient system through continuous monitoring, revision, and
-reflection.
+Prior research on training grounded factuality classification models to
+detect hallucinations in large language models (LLMs) has relied on public
+natural language inference (NLI) data and synthetic data. However, conventional
+NLI datasets are not well-suited for document-level reasoning, which is
+critical for detecting LLM hallucinations. Recent approaches to document-level
+synthetic data generation involve iteratively removing sentences from documents
+and annotating factuality using LLM-based prompts. While effective, this method
+is computationally expensive for long documents and limited by the LLM's
+capabilities. In this work, we analyze the differences between existing
+synthetic training data used in state-of-the-art models and real LLM output
+claims. Based on our findings, we propose a novel approach for synthetic data
+generation, CG2C, that leverages multi-hop reasoning on context graphs
+extracted from documents. Our fact checker model, FactCG, demonstrates improved
+performance with more connected reasoning, using the same backbone models.
+Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
+with much smaller model size.
 
-摘要：隨著人工智慧 (AI) 的普及整合，在涉及 AI 驅動系統的事故中，責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題，加上 AI 技術的不確定性和缺乏相應法規，使得傳統責任歸屬面臨挑戰。為此，本研究提出了一種計算反思均衡 (CRE) 方法，以建立一個連貫且在倫理上可接受的責任歸屬架構，適用於所有利害關係人。計算方法提供了結構化的分析，克服了概念方法在處理動態且多面向情境時的限制，展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究，說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解，透過持續監控、修訂和反思，促進了永續且有韌性的系統發展。
+摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
 
-##### **Explainable AI for Fair Sepsis Mortality Predictive Model**
-2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang
+##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
+2501.16673v2 by Li Yin, Zhangyang Wang
 
-Artificial intelligence supports healthcare professionals with predictive
-modeling, greatly transforming clinical decision-making. This study addresses
-the crucial need for fairness and explainability in AI applications within
-healthcare to ensure equitable outcomes across diverse patient demographics. By
-focusing on the predictive modeling of sepsis-related mortality, we propose a
-method that learns a performance-optimized predictive model and then employs
-the transfer learning process to produce a model with better fairness. Our
-method also introduces a novel permutation-based feature importance algorithm
-aiming at elucidating the contribution of each feature in enhancing fairness on
-predictions. Unlike existing explainability methods concentrating on explaining
-feature contribution to predictive performance, our proposed method uniquely
-bridges the gap in understanding how each feature contributes to fairness. This
-advancement is pivotal, given sepsis's significant mortality rate and its role
-in one-third of hospital deaths. Our method not only aids in identifying and
-mitigating biases within the predictive model but also fosters trust among
-healthcare stakeholders by improving the transparency and fairness of model
-predictions, thereby contributing to more equitable and trustworthy healthcare
-delivery.
+Large Language Models (LLMs) have reshaped natural language processing,
+powering applications from multi-hop retrieval and question answering to
+autonomous agent workflows. Yet, prompt engineering -- the task of crafting
+textual inputs to effectively direct LLMs -- remains difficult and
+labor-intensive, particularly for complex pipelines that combine multiple LLM
+calls with functional operations like retrieval and data formatting. We
+introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
+(APE) that extends textual gradient-based methods (such as Text-Grad) to
+multi-component, potentially cyclic LLM architectures. Implemented within the
+AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
+parameter and uses a frozen backward engine LLM to generate feedback-akin to
+textual gradients -- that guide iterative prompt updates. Unlike prior
+single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
+preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
+and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
+(instructions, formats, or few-shot examples). It further boosts training
+efficiency by focusing on error-prone samples through selective gradient
+computation. Across diverse tasks, including single-step classification,
+multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
+consistently outperforms existing textual gradient baselines in both accuracy
+and training cost. By unifying prompt optimization through a graph-centric
+lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
+LLM workflows - mirroring the transformative role that automatic
+differentiation libraries have long played in neural network research.
 
-摘要：人工智慧透過預測模型協助醫療專業人員，大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求，以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型，我們提出了一種方法，該方法會學習一個效能最佳化的預測模型，然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法，旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同，我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要，因為敗血症的死亡率很高，且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差，還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任，進而有助於提供更公平且值得信賴的醫療保健服務。
+摘要：大型語言模型 (LLM) 已重塑自然語言處理，
+為從多跳檢索和問答到
+自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
+文本輸入以有效指導 LLM 的任務 -- 仍然困難且
+勞動密集，特別是對於將多個 LLM
+呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
+介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
+方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
+AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
+參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
+文本梯度——指導迭代提示更新。與先前的
+單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
+在重複呼叫（例如，多跳循環）中保留時間順序行為，
+並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
+效率，通過選擇性梯度
+計算專注於容易出錯的樣本。在包括單步分類、
+多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
+在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
+視角統一提示優化，LLM-AutoDiff 為擴展和自動化
+LLM 工作流程提供了一個強大的新範例——反映了自動
+微分庫在神經網絡研究中長期扮演的變革性角色。
 
-##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence**
-2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal
+##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
+2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
 
-Depression is a significant issue nowadays. As per the World Health
-Organization (WHO), in 2023, over 280 million individuals are grappling with
-depression. This is a huge number; if not taken seriously, these numbers will
-increase rapidly. About 4.89 billion individuals are social media users. People
-express their feelings and emotions on platforms like Twitter, Facebook,
-Reddit, Instagram, etc. These platforms contain valuable information which can
-be used for research purposes. Considerable research has been conducted across
-various social media platforms. However, certain limitations persist in these
-endeavors. Particularly, previous studies were only focused on detecting
-depression and the intensity of depression in tweets. Also, there existed
-inaccuracies in dataset labeling. In this research work, five types of
-depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted
-using tweets from the Twitter database based on lexicon labeling. Explainable
-AI was used to provide reasoning by highlighting the parts of tweets that
-represent type of depression. Bidirectional Encoder Representations from
-Transformers (BERT) was used for feature extraction and training. Machine
-learning and deep learning methodologies were used to train the model. The BERT
-model presented the most promising results, achieving an overall accuracy of
-0.96.
+Ranking and recommendation systems are the foundation for numerous online
+experiences, ranging from search results to personalized content delivery.
+These systems have evolved into complex, multilayered architectures that
+leverage vast datasets and often incorporate thousands of predictive models.
+The maintenance and enhancement of these models is a labor intensive process
+that requires extensive feature engineering. This approach not only exacerbates
+technical debt but also hampers innovation in extending these systems to
+emerging problem domains. In this report, we present our research to address
+these challenges by utilizing a large foundation model with a textual interface
+for ranking and recommendation tasks. We illustrate several key advantages of
+our approach: (1) a single model can manage multiple predictive tasks involved
+in ranking and recommendation, (2) decoder models with textual interface due to
+their comprehension of reasoning capabilities, can generalize to new
+recommendation surfaces and out-of-domain problems, and (3) by employing
+natural language interfaces for task definitions and verbalizing member
+behaviors and their social connections, we eliminate the need for feature
+engineering and the maintenance of complex directed acyclic graphs of model
+dependencies. We introduce our research pre-production model, 360Brew V1.0, a
+150B parameter, decoder-only model that has been trained and fine-tuned on
+LinkedIn's data and tasks. This model is capable of solving over 30 predictive
+tasks across various segments of the LinkedIn platform, achieving performance
+levels comparable to or exceeding those of current production systems based on
+offline metrics, without task-specific fine-tuning. Notably, each of these
+tasks is conventionally addressed by dedicated models that have been developed
+and maintained over multiple years by teams of a similar or larger size than
+our own.
 
-摘要：現今，憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料，在 2023 年，超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字；如果不認真看待，這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊，可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而，這些努力仍存在某些限制。特別是，先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外，資料集標籤中存在不準確的情況。在這項研究工作中，使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症（雙極型、重度、精神病型、非典型和產後）。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers（BERT）中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果，達到 0.96 的整體準確度。
+摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
+這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
+這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
+這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
+在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
+我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
+我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
+此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
+值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
 
-##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**
-2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman
+##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
+2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
 
-Deep learning is dramatically transforming the field of medical imaging and
-radiology, enabling the identification of pathologies in medical images,
-including computed tomography (CT) and X-ray scans. However, the performance of
-deep learning models, particularly in segmentation tasks, is often limited by
-the need for extensive annotated datasets. To address this challenge, the
-capabilities of weakly supervised semantic segmentation are explored through
-the lens of Explainable AI and the generation of counterfactual explanations.
-The scope of this research is development of a novel counterfactual inpainting
-approach (COIN) that flips the predicted classification label from abnormal to
-normal by using a generative model. For instance, if the classifier deems an
-input medical image X as abnormal, indicating the presence of a pathology, the
-generative model aims to inpaint the abnormal region, thus reversing the
-classifier's original prediction label. The approach enables us to produce
-precise segmentations for pathologies without depending on pre-existing
-segmentation masks. Crucially, image-level labels are utilized, which are
-substantially easier to acquire than creating detailed segmentation masks. The
-effectiveness of the method is demonstrated by segmenting synthetic targets and
-actual kidney tumors from CT images acquired from Tartu University Hospital in
-Estonia. The findings indicate that COIN greatly surpasses established
-attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an
-alternative counterfactual explanation method introduced by Singla et al. This
-evidence suggests that COIN is a promising approach for semantic segmentation
-of tumors in CT images, and presents a step forward in making deep learning
-applications more accessible and effective in healthcare, where annotated data
-is scarce.
+Fixing Python dependency issues is a tedious and error-prone task for
+developers, who must manually identify and resolve environment dependencies and
+version constraints of third-party modules and Python interpreters. Researchers
+have attempted to automate this process by relying on large knowledge graphs
+and database lookup tables. However, these traditional approaches face
+limitations due to the variety of dependency error types, large sets of
+possible module versions, and conflicts among transitive dependencies. This
+study explores the potential of using large language models (LLMs) to
+automatically fix dependency issues in Python programs. We introduce PLLM
+(pronounced "plum"), a novel technique that employs retrieval-augmented
+generation (RAG) to help an LLM infer Python versions and required modules for
+a given Python file. PLLM builds a testing environment that iteratively (1)
+prompts the LLM for module combinations, (2) tests the suggested changes, and
+(3) provides feedback (error messages) to the LLM to refine the fix. This
+feedback cycle leverages natural language processing (NLP) to intelligently
+parse and interpret build error messages. We benchmark PLLM on the Gistable
+HG2.9K dataset, a collection of challenging single-file Python gists. We
+compare PLLM against two state-of-the-art automatic dependency inference
+approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
+issues. Our results indicate that PLLM can fix more dependency issues than the
+two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
+over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
+for projects with many dependencies and for specific third-party numerical and
+machine-learning modules. Our findings demonstrate the potential of LLM-based
+approaches to iteratively resolve Python dependency issues.
 
-摘要：深度学习正大幅轉變醫學影像和放射線學領域，能辨識醫學影像中的病理，包括電腦斷層掃描 (CT) 和 X 光掃描。然而，深度學習模型的效能，特別是在分割任務中，常常受到廣泛註解資料集需求的限制。為了應對此挑戰，透過可解釋 AI 和反事實解釋的產生，探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN)，該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如，如果分類器將輸入的醫學影像 X 視為異常，表示存在病理，則生成模型旨在內插異常區域，從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割，而無需依賴於預先存在的分割遮罩。至關重要的是，利用影像層級標籤，這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明，COIN 遠遠超過已建立的歸因方法，例如 RISE、ScoreCAM 和 LayerCAM，以及 Singla 等人提出的另一種反事實解釋方法。此證據表明，COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法，並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步，其中註解資料很稀少。
+摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
 
-##### **Hybrid Intelligence for Digital Humanities**
-2406.15374v1 by Victor de Boer, Lise Stork
+##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
+2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
+
+Knowledge graphs are widely used in industrial applications, making error
+detection crucial for ensuring the reliability of downstream applications.
+Existing error detection methods often fail to effectively leverage
+fine-grained subgraph information and rely solely on fixed graph structures,
+while also lacking transparency in their decision-making processes, which
+results in suboptimal detection performance. In this paper, we propose a novel
+Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
+utilizes multiple large language models (LLMs) in a collaborative setting. By
+concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
+query embeddings during training, our framework integrates these
+representations to produce four specialized agents. These agents utilize
+subgraph information from different dimensions to engage in multi-round
+discussions, thereby improving error detection accuracy and ensuring a
+transparent decision-making process. Extensive experiments on FB15K and WN18RR
+demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
+accuracy and robustness of KG evaluation. For specific industrial scenarios,
+our framework can facilitate the training of specialized agents using
+domain-specific knowledge graphs for error detection, which highlights the
+potential industrial application value of our framework. Our code and datasets
+are available at https://github.com/kse-ElEvEn/MAKGED.
 
-In this paper, we explore the synergies between Digital Humanities (DH) as a
-discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research,
-the use of digital methods and specifically that of Artificial Intelligence is
-subject to a set of requirements and constraints. We argue that these are
-well-supported by the capabilities and goals of HI. Our contribution includes
-the identification of five such DH requirements: Successful AI systems need to
-be able to 1) collaborate with the (human) scholar; 2) support data criticism;
-3) support tool criticism; 4) be aware of and cater to various perspectives and
-5) support distant and close reading. We take the CARE principles of Hybrid
-Intelligence (collaborative, adaptive, responsible and explainable) as
-theoretical framework and map these to the DH requirements. In this mapping, we
-include example research projects. We finally address how insights from DH can
-be applied to HI and discuss open challenges for the combination of the two
-disciplines.
+摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
 
-摘要：在本文中，我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中，數位方法的使用，特別是人工智慧的使用，受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求：成功的 AI 系統需要能夠 1) 與（人類）學者合作；2) 支援資料批評；3) 支援工具批評；4) 察覺並迎合各種觀點；5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則（協作、適應、負責和可解釋）作為理論架構，並將這些原則對應到 DH 要求。在此對應中，我們納入範例研究專案。最後，我們探討如何將 DH 的見解應用於 HI，並討論結合這兩個學科的開放挑戰。
+##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
+2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
 
-##### **Ethical Framework for Responsible Foundational Models in Medical Imaging**
-2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci
+Short-reading comprehension questions help students understand text structure
+but lack effective feedback. Students struggle to identify and correct errors,
+while manual feedback creation is labor-intensive. This highlights the need for
+automated feedback linking responses to a scoring rubric for deeper
+comprehension.
+  Despite advances in Natural Language Processing (NLP), research has focused
+on automatic grading, with limited work on feedback generation. To address
+this, we propose a system that generates feedback for student responses.
+  Our contributions are twofold. First, we introduce the first system for
+feedback on short-answer reading comprehension. These answers are derived from
+the text, requiring structural understanding. We propose an "answer diagnosis
+graph," integrating the text's logical structure with feedback templates. Using
+this graph and NLP techniques, we estimate students' comprehension and generate
+targeted feedback.
+  Second, we evaluate our feedback through an experiment with Japanese high
+school students (n=39). They answered two 70-80 word questions and were divided
+into two groups with minimal academic differences. One received a model answer,
+the other system-generated feedback. Both re-answered the questions, and we
+compared score changes. A questionnaire assessed perceptions and motivation.
+  Results showed no significant score improvement between groups, but
+system-generated feedback helped students identify errors and key points in the
+text. It also significantly increased motivation. However, further refinement
+is needed to enhance text structure understanding.
 
-Foundational models (FMs) have tremendous potential to revolutionize medical
-imaging. However, their deployment in real-world clinical settings demands
-extensive ethical considerations. This paper aims to highlight the ethical
-concerns related to FMs and propose a framework to guide their responsible
-development and implementation within medicine. We meticulously examine ethical
-issues such as privacy of patient data, bias mitigation, algorithmic
-transparency, explainability and accountability. The proposed framework is
-designed to prioritize patient welfare, mitigate potential risks, and foster
-trust in AI-assisted healthcare.
+摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
 
-摘要：基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而，它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題，並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題，例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險，並培養對 AI 輔助醫療保健的信任。
+儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
 
-##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**
-2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak
+我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
 
-Thyroid cancer is an increasing global health concern that requires advanced
-diagnostic methods. The application of AI and radiomics to thyroid cancer
-diagnosis is examined in this review. A review of multiple databases was
-conducted in compliance with PRISMA guidelines until October 2023. A
-combination of keywords led to the discovery of an English academic publication
-on thyroid cancer and related subjects. 267 papers were returned from the
-original search after 109 duplicates were removed. Relevant studies were
-selected according to predetermined criteria after 124 articles were eliminated
-based on an examination of their abstract and title. After the comprehensive
-analysis, an additional six studies were excluded. Among the 28 included
-studies, radiomics analysis, which incorporates ultrasound (US) images,
-demonstrated its effectiveness in diagnosing thyroid cancer. Various results
-were noted, some of the studies presenting new strategies that outperformed the
-status quo. The literature has emphasized various challenges faced by AI
-models, including interpretability issues, dataset constraints, and operator
-dependence. The synthesized findings of the 28 included studies mentioned the
-need for standardization efforts and prospective multicenter studies to address
-these concerns. Furthermore, approaches to overcome these obstacles were
-identified, such as advances in explainable AI technology and personalized
-medicine techniques. The review focuses on how AI and radiomics could transform
-the diagnosis and treatment of thyroid cancer. Despite challenges, future
-research on multidisciplinary cooperation, clinical applicability validation,
-and algorithm improvement holds the potential to improve patient outcomes and
-diagnostic precision in the treatment of thyroid cancer.
+其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
 
-摘要：甲狀腺癌是一種日益嚴重的全球健康問題，需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下，對多個資料庫進行了回顧，直到 2023 年 10 月。通過結合關鍵字，發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後，原始搜尋共回傳 267 篇論文。在根據預先確定的標準，淘汰了 124 篇文章的摘要和標題後，選出了相關研究。在進行全面分析後，額外排除了六項研究。在納入的 28 項研究中，結合超音波 (US) 影像的放射特徵分析，證明了其在診斷甲狀腺癌方面的有效性。研究結果不一，有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰，包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到，需要標準化工作和前瞻性多中心研究來解決這些問題。此外，還確定了克服這些障礙的方法，例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰，但未來對多學科合作、臨床適用性驗證和演算法改進的研究，仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。
+結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
 
-##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**
-2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia
+##### **Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**
+2501.15688v1 by Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
 
-Breast cancer has rapidly increased in prevalence in recent years, making it
-one of the leading causes of mortality worldwide. Among all cancers, it is by
-far the most common. Diagnosing this illness manually requires significant time
-and expertise. Since detecting breast cancer is a time-consuming process,
-preventing its further spread can be aided by creating machine-based forecasts.
-Machine learning and Explainable AI are crucial in classification as they not
-only provide accurate predictions but also offer insights into how the model
-arrives at its decisions, aiding in the understanding and trustworthiness of
-the classification results. In this study, we evaluate and compare the
-classification accuracy, precision, recall, and F-1 scores of five different
-machine learning methods using a primary dataset (500 patients from Dhaka
-Medical College Hospital). Five different supervised machine learning
-techniques, including decision tree, random forest, logistic regression, naive
-bayes, and XGBoost, have been used to achieve optimal results on our dataset.
-Additionally, this study applied SHAP analysis to the XGBoost model to
-interpret the model's predictions and understand the impact of each feature on
-the model's output. We compared the accuracy with which several algorithms
-classified the data, as well as contrasted with other literature in this field.
-After final evaluation, this study found that XGBoost achieved the best model
-accuracy, which is 97%.
+Multimodal knowledge graph completion (MMKGC) aims to predict missing links
+in multimodal knowledge graphs (MMKGs) by leveraging information from various
+modalities alongside structural data. Existing MMKGC approaches primarily
+extend traditional knowledge graph embedding (KGE) models, which often require
+creating an embedding for every entity. This results in large model sizes and
+inefficiencies in integrating multimodal information, particularly for
+real-world graphs. Meanwhile, Transformer-based models have demonstrated
+competitive performance in knowledge graph completion (KGC). However, their
+focus on single-modal knowledge limits their capacity to utilize cross-modal
+information. Recently, Large vision-language models (VLMs) have shown potential
+in cross-modal tasks but are constrained by the high cost of training. In this
+work, we propose a novel approach that integrates Transformer-based KGE models
+with cross-modal context generated by pre-trained VLMs, thereby extending their
+applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
+relevant visual information from entities and their neighbors into textual
+sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
+model with the generated cross-modal context. This simple yet effective method
+significantly reduces model size compared to traditional KGE approaches while
+achieving competitive performance across multiple large-scale datasets with
+minimal hyperparameter tuning.
 
-摘要：<paragraph>近年來，乳癌的盛行率迅速增加，使其成為全球主要的死亡原因之一。在所有癌症中，乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時，因此透過建立機器學習模型來預測，有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要，因為它們不僅可以提供準確的預測，還可以深入了解模型如何做出決策，有助於理解和信賴分類結果。在此研究中，我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數，使用了一個主要的資料集（達卡醫學院醫院的 500 名患者）。五種不同的監督式機器學習技術，包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost，已用於在我們的資料集上取得最佳結果。此外，本研究將 SHAP 分析應用於 XGBoost 模型，以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度，並與該領域的其他文獻進行對比。在最後評估後，本研究發現 XGBoost 達到了最佳的模型準確度，為 97%。</paragraph>
+摘要：多模態知識圖譜補全 (MMKGC) 旨在透過利用來自各種模態與結構化資料的資訊，來預測多模態知識圖譜 (MMKG) 中的缺失連結。現有的 MMKGC 方法主要擴充傳統的知識圖譜嵌入 (KGE) 模型，這些模型通常需要為每個實體建立一個嵌入。這會導致模型尺寸過大，且在整合多模態資訊時效率低下，特別是對於真實世界的圖譜。與此同時，基於 Transformer 的模型已在知識圖譜補全 (KGC) 中展現出競爭力。然而，它們著重於單模態知識，限制了它們利用跨模態資訊的能力。最近，大型視覺語言模型 (VLM) 已在跨模態任務中展現潛力，但受限於訓練成本過高。在這項工作中，我們提出了一種創新的方法，它將基於 Transformer 的 KGE 模型與預先訓練的 VLM 所產生的跨模態內容整合在一起，從而擴展它們在 MMKGC 中的適用性。具體來說，我們採用預先訓練的 VLM，將實體及其鄰居相關的視覺資訊轉換成文字序列。然後，我們將 KGC 架構成一個序列到序列的任務，並使用產生的跨模態內容微調模型。這種簡單但有效的方法，與傳統的 KGE 方法相比，大幅減少了模型尺寸，同時在多個大型資料集上達到了競爭力的效能，且只需最少的超參數調整。
 
-##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**
-2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir
+##### **How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**
+2501.15378v1 by Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
 
-The Deep learning (DL) models for diagnosing breast cancer from mammographic
-images often operate as "black boxes", making it difficult for healthcare
-professionals to trust and understand their decision-making processes. The
-study presents an integrated framework combining Convolutional Neural Networks
-(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis
-of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an
-elaborate data preprocessing pipeline and advanced data augmentation techniques
-to counteract dataset limitations and transfer learning using pre-trained
-networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of
-our study is the evaluation of XAI's effectiveness in interpreting model
-predictions, highlighted by utilizing the Hausdorff measure to assess the
-alignment between AI-generated explanations and expert annotations
-quantitatively. This approach is critical for XAI in promoting trustworthiness
-and ethical fairness in AI-assisted diagnostics. The findings from our research
-illustrate the effective collaboration between CNNs and XAI in advancing
-diagnostic methods for breast cancer, thereby facilitating a more seamless
-integration of advanced AI technologies within clinical settings. By enhancing
-the interpretability of AI driven decisions, this work lays the groundwork for
-improved collaboration between AI systems and medical practitioners, ultimately
-enriching patient care. Furthermore, the implications of our research extended
-well beyond the current methodologies. It encourages further research into how
-to combine multimodal data and improve AI explanations to meet the needs of
-clinical practice.
+Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently
+propelled significant advances in complex reasoning tasks, thanks to their
+broad domain knowledge and contextual awareness. Unfortunately, current methods
+often assume KGs to be complete, which is impractical given the inherent
+limitations of KG construction and the potential loss of contextual cues when
+converting unstructured text into entity-relation triples. In response, this
+paper proposes the Triple Context Restoration and Query-driven Feedback
+(TCR-QF) framework, which reconstructs the textual context underlying each
+triple to mitigate information loss, while dynamically refining the KG
+structure by iteratively incorporating query-relevant missing knowledge.
+Experiments on five benchmark question-answering datasets substantiate the
+effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1%
+improvement in Exact Match and a 15.5% improvement in F1 over its
+state-of-the-art GraphRAG competitors.
 
-摘要：深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作，這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構，結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI)，以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術，以對抗資料集限制，並採用預先訓練的網路（例如 VGG-16、Inception-V3 和 ResNet）進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性，重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作，從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性，這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎，最終豐富了患者照護。此外，我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋，以滿足臨床實務的需求。
+摘要：知識圖譜 (KG) 增強大型語言模型 (LLM) 最近推動複雜推理任務的重大進展，這要歸功於它們廣泛的領域知識和語境感知。不幸的是，目前的模型通常假設 KG 是完整的，這在考慮到 KG 建構的固有限制和在將非結構化文字轉換為實體關係三元組時潛在的語境線索損失時是不切實際的。為了解決這個問題，本文提出了三元組語境還原和查詢驅動回饋 (TCR-QF) 架構，它重建每個三元組底層的文字語境以減輕資訊損失，同時透過反覆納入與查詢相關的遺失知識來動態優化 KG 結構。在五個基準問題回答資料集上的實驗證實了 TCR-QF 在 KG 和 LLM 整合方面的有效性，它在 Exact Match 中獲得 29.1% 的改進，在 F1 中獲得 15.5% 的改進，優於最先進的 GraphRAG 競爭對手。
 
-##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**
-2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu
+##### **Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**
+2501.14932v1 by Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe
 
-This research presents a novel multimodal data fusion methodology for pain
-behavior recognition, integrating statistical correlation analysis with
-human-centered insights. Our approach introduces two key innovations: 1)
-integrating data-driven statistical relevance weights into the fusion strategy
-to effectively utilize complementary information from heterogeneous modalities,
-and 2) incorporating human-centric movement characteristics into multimodal
-representation learning for detailed modeling of pain behaviors. Validated
-across various deep learning architectures, our method demonstrates superior
-performance and broad applicability. We propose a customizable framework that
-aligns each modality with a suitable classifier based on statistical
-significance, advancing personalized and effective multimodal fusion.
-Furthermore, our methodology provides explainable analysis of multimodal data,
-contributing to interpretable and explainable AI in healthcare. By highlighting
-the importance of data diversity and modality-specific representations, we
-enhance traditional fusion techniques and set new standards for recognizing
-complex pain behaviors. Our findings have significant implications for
-promoting patient-centered healthcare interventions and supporting explainable
-clinical decision-making.
+Modern datasets often consist of numerous samples with abundant features and
+associated timestamps. Analyzing such datasets to uncover underlying events
+typically requires complex statistical methods and substantial domain
+expertise. A notable example, and the primary data focus of this paper, is the
+global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC)
+-- a global hub of human trafficking data containing over 200,000 anonymized
+records spanning from 2002 to 2022, with numerous categorical features for each
+record. In this paper, we propose a fast and scalable method for analyzing and
+extracting significant categorical feature interactions, and querying large
+language models (LLMs) to generate data-driven insights that explain these
+interactions. Our approach begins with a binarization step for categorical
+features using one-hot encoding, followed by the computation of graph
+covariance at each time. This graph covariance quantifies temporal changes in
+dependence structures within categorical data and is established as a
+consistent dependence measure under the Bernoulli distribution. We use this
+measure to identify significant feature pairs, such as those with the most
+frequent trends over time or those exhibiting sudden spikes in dependence at
+specific moments. These extracted feature pairs, along with their timestamps,
+are subsequently passed to an LLM tasked with generating potential explanations
+of the underlying events driving these dependence changes. The effectiveness of
+our method is demonstrated through extensive simulations, and its application
+to the CTDC dataset reveals meaningful feature pairs and potential data stories
+underlying the observed feature interactions.
 
-摘要：本研究提出了一種創新的多模態數據融合方法，用於疼痛行為識別，將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新：1) 將數據驅動的統計相關權重整合到融合策略中，以有效利用來自異質模態的補充信息，以及 2) 將以人為中心的運動特徵納入多模態表示學習中，以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證，展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架，根據統計顯著性將每個模態與合適的分類器對齊，推進個性化和有效的多模態融合。此外，我們的模型提供對多模態數據的可解釋分析，有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性，我們增強了傳統的融合技術，並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。
+摘要：現代資料集通常包含許多具有豐富特徵和關聯時間戳的樣本。分析此類資料集以揭示底層事件通常需要複雜的統計方法和大量的領域專業知識。一個值得注意的範例，也是本文的主要資料重點，是來自反人口販運資料合作組織 (CTDC) 的全球合成資料集，這是全球人口販運資料的樞紐，包含超過 200,000 筆從 2002 年到 2022 年的匿名記錄，每個記錄都有許多分類特徵。在本文中，我們提出了一種快速且可擴充的方法，用於分析和提取重要的分類特徵交互作用，並查詢大型語言模型 (LLM)，以產生資料驅動的見解來解釋這些交互作用。我們的做法從使用獨熱編碼對分類特徵進行二元化步驟開始，然後在每個時間點計算圖形共變異數。此圖形共變異數量化了分類資料中依賴結構的時間變化，並在伯努利分佈下建立為一致的依賴度量。我們使用此度量來識別重要的特徵對，例如隨時間推移趨勢最頻繁的特徵對，或在特定時刻表現出依賴性突然激增的特徵對。這些提取的特徵對及其時間戳隨後傳遞給 LLM，後者負責產生對驅動這些依賴性變化的底層事件的潛在解釋。我們的方法的有效性已通過廣泛的模擬得到證明，其在 CTDC 資料集中的應用揭示了有意義的特徵對和潛在的資料故事，這些故事是觀察到的特徵交互作用的基礎。
 
-##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**
-2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini
+##### **Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**
+2501.14892v1 by Hang Luo, Jian Zhang, Chujun Li
 
-Human-centered explainable AI (HCXAI) advocates for the integration of social
-aspects into AI explanations. Central to the HCXAI discourse is the Social
-Transparency (ST) framework, which aims to make the socio-organizational
-context of AI systems accessible to their users. In this work, we suggest
-extending the ST framework to address the risks of social misattributions in
-Large Language Models (LLMs), particularly in sensitive areas like mental
-health. In fact LLMs, which are remarkably capable of simulating roles and
-personas, may lead to mismatches between designers' intentions and users'
-perceptions of social attributes, risking to promote emotional manipulation and
-dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To
-address these issues, we propose enhancing the ST framework with a fifth
-'W-question' to clarify the specific social attributions assigned to LLMs by
-its designers and users. This addition aims to bridge the gap between LLM
-capabilities and user perceptions, promoting the ethically responsible
-development and use of LLM-based technology.
+In knowledge-intensive tasks, especially in high-stakes domains like medicine
+and law, it is critical not only to retrieve relevant information but also to
+provide causal reasoning and explainability. Large language models (LLMs) have
+achieved remarkable performance in natural language understanding and
+generation tasks. However, they often suffer from limitations such as
+difficulty in incorporating new knowledge, generating hallucinations, and
+explaining their reasoning process. To address these challenges, integrating
+knowledge graphs with Graph Retrieval-Augmented Generation (Graph RAG) has
+emerged as an effective solution. Traditional Graph RAG methods often rely on
+simple graph traversal or semantic similarity, which do not capture causal
+relationships or align well with the model's internal reasoning steps. This
+paper proposes a novel pipeline that filters large knowledge graphs to
+emphasize cause-effect edges, aligns the retrieval process with the model's
+chain-of-thought (CoT), and enhances reasoning through multi-stage path
+improvements. Experiments on medical question-answering tasks show consistent
+gains, with up to a 10\% absolute improvement across multiple large language
+models (LLMs). This approach demonstrates the value of combining causal
+reasoning with stepwise retrieval, leading to more interpretable and logically
+grounded solutions for complex queries.
 
-摘要：以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架，其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中，我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险，尤其是在心理健康等敏感领域。事实上，LLM 能够出色地模拟角色和人格，这可能导致设计者的意图和用户对社会属性的认知之间出现错配，从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题，我们建议用第五个“W 问题”来增强 ST 框架，以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距，促进基于 LLM 的技术在道德上负责任地开发和使用。
+摘要：在知識密集型任務中，特別是在醫學和法律等高風險領域，不僅檢索相關資訊至關重要，還必須提供因果推理和可解釋性。大型語言模型 (LLM) 在自然語言理解和生成任務中取得了顯著的表現。然而，它們通常會遇到一些限制，例如難以納入新知識、產生幻覺，以及解釋其推理過程。為了應對這些挑戰，將知識圖與圖形檢索增強生成 (Graph RAG) 整合在一起已成為一種有效的解決方案。傳統的 Graph RAG 方法通常依賴於簡單的圖形遍歷或語義相似性，這無法捕捉因果關係或與模型的內部推理步驟很好地對齊。本文提出了一個新穎的管道，該管道過濾大型知識圖以強調因果邊緣，將檢索過程與模型的思想鏈 (CoT) 對齊，並通過多階段路徑改進來增強推理。在醫療問題解答任務上的實驗顯示出一致的收益，在多個大型語言模型 (LLM) 中絕對改進幅度高達 10%。這種方法展示了將因果推理與逐步檢索相結合的價值，從而為複雜查詢提供更具可解釋性和邏輯依據的解決方案。
 
-##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**
-2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu
+##### **GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**
+2501.16382v1 by Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
 
-Background: Pneumothorax is an acute thoracic disease caused by abnormal air
-collection between the lungs and chest wall. To address the opaqueness often
-associated with deep learning (DL) models, explainable artificial intelligence
-(XAI) methods have been introduced to outline regions related to pneumothorax
-diagnoses made by DL models. However, these explanations sometimes diverge from
-actual lesion areas, highlighting the need for further improvement. Method: We
-propose a template-guided approach to incorporate the clinical knowledge of
-pneumothorax into model explanations generated by XAI methods, thereby
-enhancing the quality of these explanations. Utilizing one lesion delineation
-created by radiologists, our approach first generates a template that
-represents potential areas of pneumothorax occurrence. This template is then
-superimposed on model explanations to filter out extraneous explanations that
-fall outside the template's boundaries. To validate its efficacy, we carried
-out a comparative analysis of three XAI methods with and without our template
-guidance when explaining two DL models in two real-world datasets. Results: The
-proposed approach consistently improved baseline XAI methods across twelve
-benchmark scenarios built on three XAI methods, two DL models, and two
-datasets. The average incremental percentages, calculated by the performance
-improvements over the baseline performance, were 97.8% in Intersection over
-Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model
-explanations and ground-truth lesion areas. Conclusions: In the context of
-pneumothorax diagnoses, we proposed a template-guided approach for improving AI
-explanations. We anticipate that our template guidance will forge a fresh
-approach to elucidating AI models by integrating clinical domain expertise.
+Drug discovery (DD) has tremendously contributed to maintaining and improving
+public health. Hypothesizing that inhibiting protein misfolding can slow
+disease progression, researchers focus on target identification (Target ID) to
+find protein structures for drug binding. While Large Language Models (LLMs)
+and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug
+discovery, integrating models into cohesive workflows remains challenging. We
+conducted a user study with drug discovery researchers to identify the
+applicability of LLMs and RAGs in Target ID. We identified two main findings:
+1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on
+an initial protein and protein candidates that have a therapeutic impact; 2)
+the model must provide the PPI and relevant explanations for better
+understanding. Based on these observations, we identified three limitations in
+previous approaches for Target ID: 1) semantic ambiguity, 2) lack of
+explainability, and 3) short retrieval units. To address these issues, we
+propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve
+agent pipeline RAG framework to support large-scale PPI signaling pathway
+exploration in understanding therapeutic impacts by decomposing the analysis of
+entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
 
-摘要：<paragraph>背景：氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習（DL）模型經常伴隨的不透明性，可解釋人工智慧（XAI）方法已被引入，用於概述與 DL 模型做出的氣胸診斷相關的區域。然而，這些解釋有時會與實際病灶區域有所出入，突顯出進一步改進的必要性。方法：我們提出了一種模板引導式方法，將氣胸的臨床知識納入 XAI 方法產生的模型解釋中，從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪，我們的做法首先產生一個模板，用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上，以篩選出超出模板邊界的無關解釋。為了驗證其效力，我們對三種 XAI 方法進行了比較分析，在兩個真實世界資料集中解釋兩個 DL 模型時，分別採用和不採用我們的模板引導。結果：所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中，始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時，透過基準效能的效能改進計算出的平均增量百分比為交集比（IoU）的 97.8% 和骰子相似性係數（DSC）的 94.1%。結論：在氣胸診斷的背景下，我們提出了一種模板引導式方法，用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識，為闡明 AI 模型建立一種新方法。</paragraph>
+摘要：药物发现 (DD) 极大地促进了公共卫生的维护和改善。研究人员假设抑制蛋白质错误折叠可以减缓疾病进展，因此专注于靶点识别 (Target ID) 以找到用于药物结合的蛋白质结构。虽然大型语言模型 (LLM) 和检索增强生成 (RAG) 框架加速了药物发现，但将模型整合到内聚工作流中仍然具有挑战性。我们与药物发现研究人员进行了一项用户研究，以确定 LLM 和 RAG 在 Target ID 中的适用性。我们确定了两个主要发现：1) LLM 应该基于初始蛋白质和具有治疗作用的蛋白质候选物提供多个蛋白质-蛋白质相互作用 (PPI)；2) 该模型必须提供 PPI 和相关解释以更好地理解。基于这些观察，我们发现了先前 Target ID 方法中的三个局限性：1) 语义歧义，2) 缺乏可解释性，3) 检索单元短。为了解决这些问题，我们提出了 GraPPI，这是一种基于大规模知识图 (KG) 的检索-分解-求解代理管道 RAG 框架，以支持大规模 PPI 信号通路探索，通过将整个 PPI 通路的分析分解为专注于 PPI 边缘分析的子任务来理解治疗影响。
 
-##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**
-2403.01580v1 by Séamus Lankford
+##### **Evaluating and Improving Graph to Text Generation with Large Language Models**
+2501.14497v1 by Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez Basulto, Jeff Z. Pan
 
-In the current machine translation (MT) landscape, the Transformer
-architecture stands out as the gold standard, especially for high-resource
-language pairs. This research delves into its efficacy for low-resource
-language pairs including both the English$\leftrightarrow$Irish and
-English$\leftrightarrow$Marathi language pairs. Notably, the study identifies
-the optimal hyperparameters and subword model type to significantly improve the
-translation quality of Transformer models for low-resource language pairs.
-  The scarcity of parallel datasets for low-resource languages can hinder MT
-development. To address this, gaHealth was developed, the first bilingual
-corpus of health data for the Irish language. Focusing on the health domain,
-models developed using this in-domain dataset exhibited very significant
-improvements in BLEU score when compared with models from the LoResMT2021
-Shared Task. A subsequent human evaluation using the multidimensional quality
-metrics error taxonomy showcased the superior performance of the Transformer
-system in reducing both accuracy and fluency errors compared to an RNN-based
-counterpart.
-  Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source
-applications streamlined for the development, fine-tuning, and deployment of
-neural machine translation models. These tools considerably simplify the setup
-and evaluation process, making MT more accessible to both developers and
-translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes
-eco-friendly natural language processing research by highlighting the
-environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM
-demonstrated advancements in translation performance for two low-resource
-language pairs: English$\leftrightarrow$Irish and
-English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021
-Shared Task.
+Large language models (LLMs) have demonstrated immense potential across
+various tasks. However, research for exploring and improving the capabilities
+of LLMs in interpreting graph structures remains limited. To address this gap,
+we conduct a comprehensive evaluation of prompting current open-source LLMs on
+graph-to-text generation tasks. Although we explored the optimal prompting
+strategies and proposed a novel and effective diversity-difficulty-based
+few-shot sample selection method, we found that the improvements from
+tuning-free approaches were incremental, as LLMs struggle with planning on
+complex graphs, particularly those with a larger number of triplets. To further
+improve LLMs in planning with graph sequences and grounding in truth, we
+introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks:
+reordering and attribution. Through extensive automatic and human evaluations,
+we demonstrate significant improvements in the quality of generated text from
+both few-shot learning and fine-tuning perspectives using the PlanGTG dataset.
+Our study paves the way for new research directions in graph-to-text
+generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
 
-摘要：<paragraph>在當前機器翻譯 (MT) 領域中，Transformer 架構脫穎而出，成為黃金標準，特別是對於高資源語言對。本研究探討其對低資源語言對的效能，包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是，本研究識別出最佳超參數和子詞模型類型，以顯著提高 Transformer 模型對低資源語言對的翻譯品質。
-低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題，開發了 gaHealth，這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域，使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步，與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示，與基於 RNN 的對應模型相比，Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。
-此外，本論文介紹了 adaptNMT 和 adaptMLLM，這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程，讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是，adaptNMT 以 OpenNMT 生態系統為基礎，通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比，adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。</paragraph>
+摘要：大型語言模型（LLM）已在各種任務中展現出巨大的潛力。然而，探索和提升 LLM 在詮釋圖形結構方面的能力的研究仍然有限。為了解決這個差距，我們對提示目前開源的 LLM 執行圖形轉文字生成任務進行全面評估。儘管我們探索了最佳提示策略並提出了一種新穎且有效的基於多樣性難度的少樣本選擇方法，但我們發現無調校方法的改進是漸進的，因為 LLM 難以規劃複雜的圖形，特別是那些具有較多三元組的圖形。為了進一步提升 LLM 在圖形序列規劃和真實依據方面的能力，我們引入了一個新的圖形轉文字資料集 PlanGTG，並註解了兩個子任務：重新排序和歸因。透過廣泛的自動化和人工評估，我們證明了使用 PlanGTG 資料集從少樣本學習和微調角度產生文字的品質有顯著提升。我們的研究為圖形轉文字生成中的新研究方向鋪路。PlanGTG 資料集可以在 https://github.com/probe2/kg_text 中找到。
 
-##### **Cause and Effect: Can Large Language Models Truly Understand Causality?**
-2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha
+##### **Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**
+2501.14300v1 by Xujian Liang, Zhaoquan Gu
 
-With the rise of Large Language Models(LLMs), it has become crucial to
-understand their capabilities and limitations in deciphering and explaining the
-complex web of causal relationships that language entails. Current methods use
-either explicit or implicit causal reasoning, yet there is a strong need for a
-unified approach combining both to tackle a wide array of causal relationships
-more effectively. This research proposes a novel architecture called Context
-Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to
-enhance causal reasoning and explainability. The proposed framework
-incorporates an explicit causal detection module with ConceptNet and
-counterfactual statements, as well as implicit causal detection through LLMs.
-Our framework goes one step further with a layer of counterfactual explanations
-to accentuate LLMs understanding of causality. The knowledge from ConceptNet
-enhances the performance of multiple causal reasoning tasks such as causal
-discovery, causal identification and counterfactual reasoning. The
-counterfactual sentences add explicit knowledge of the not caused by scenarios.
-By combining these powerful modules, our model aims to provide a deeper
-understanding of causal relationships, enabling enhanced interpretability.
-Evaluation of benchmark datasets shows improved performance across all metrics,
-such as accuracy, precision, recall, and F1 scores. We also introduce
-CausalNet, a new dataset accompanied by our code, to facilitate further
-research in this domain.
+Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes
+the naive RAG system a step further by integrating graph information, such as
+knowledge graph (KGs), into large-scale language models (LLMs) to mitigate
+hallucination. However, existing GRAG still encounter limitations: 1) simple
+paradigms usually fail with the complex problems due to the narrow and shallow
+correlations capture from KGs 2) methods of strong coupling with KGs tend to be
+high computation cost and time consuming if the graph is dense. In this paper,
+we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for
+enabling LLMs to think ``community by community" within KGs. To do this,
+FastToG employs community detection for deeper correlation capture and two
+stages community pruning - coarse and fine pruning for faster retrieval.
+Furthermore, we also develop two Community-to-Text methods to convert the graph
+structure of communities into textual form for better understanding by LLMs.
+Experimental results demonstrate the effectiveness of FastToG, showcasing
+higher accuracy, faster reasoning, and better explainability compared to the
+previous works.
 
-摘要：隨著大型語言模型 (LLM) 的興起，了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理，但強烈需要一種統一的方法，結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構，以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組，以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步，加入一層反事實解釋，以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行，例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組，我們的模型旨在提供對因果關係更深入的理解，實現增強的可解釋性。基準資料集的評估顯示在所有指標（例如準確度、精確度、召回率和 F1 分數）上都有所提升。我們還引入了 CausalNet，一個新的資料集，並附上了我們的程式碼，以促進在這個領域的進一步研究。
+摘要：圖表檢索增強生成 (GRAG) 是一種新穎的範例，它透過將圖表資訊（例如知識圖表 (KG)) 整合到大型語言模型 (LLM) 中，進一步提升了樸素的 RAG 系統以減輕幻覺。然而，現有的 GRAG 仍會遇到限制：1) 簡單的範例通常會因從 KG 中擷取的關聯性狹隘且淺薄而無法解決複雜的問題 2) 如果圖表很密集，與 KG 強耦合的方法往往會導致高運算成本和耗時。在本文中，我們提出了 Fast Think-on-Graph (FastToG)，這是一種創新的範例，可讓 LLM 在 KG 中「逐個社群」進行思考。為此，FastToG 使用社群偵測來擷取更深入的關聯性，並使用兩個階段的社群修剪（粗略修剪和精細修剪）來加快檢索速度。此外，我們還開發了兩種社群到文字的方法，將社群的圖表結構轉換為文字形式，以便 LLM 更容易理解。實驗結果證明了 FastToG 的有效性，與先前的研究相比，展示出更高的準確性、更快的推理速度和更好的可解釋性。
 
-##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**
-2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya
+##### **Top Ten Challenges Towards Agentic Neural Graph Databases**
+2501.14224v1 by Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Lixin Fan, Binhang Yuan, Wei Wang, Lei Chen, Xiaofang Zhou, Yangqiu Song
 
-Diabetes mellitus (DM) predisposes patients to vascular complications.
-Retinal images and vasculature reflect the body's micro- and macrovascular
-health. They can be used to diagnose DM complications, including diabetic
-retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular
-disease, as well as forecast the risk of cardiovascular events. Artificial
-intelligence (AI)-enabled systems developed for high-throughput detection of DR
-using digitized retinal images have become clinically adopted. Beyond DR
-screening, AI integration also holds immense potential to address challenges
-associated with the holistic care of the patient with DM. In this work, we aim
-to comprehensively review the literature for studies on AI applications based
-on retinal images related to DM diagnosis, prognostication, and management. We
-will describe the findings of holistic AI-assisted diabetes care, including but
-not limited to DR screening, and discuss barriers to implementing such systems,
-including issues concerning ethics, data privacy, equitable access, and
-explainability. With the ability to evaluate the patient's health status vis a
-vis DM complication as well as risk prognostication of future cardiovascular
-complications, AI-assisted retinal image analysis has the potential to become a
-central tool for modern personalized medicine in patients with DM.
+Graph databases (GDBs) like Neo4j and TigerGraph excel at handling
+interconnected data but lack advanced inference capabilities. Neural Graph
+Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for
+predictive analysis and reasoning over incomplete or noisy data. However, NGDBs
+rely on predefined queries and lack autonomy and adaptability. This paper
+introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs
+with three core functionalities: autonomous query construction, neural query
+execution, and continuous learning. We identify ten key challenges in realizing
+Agentic NGDBs: semantic unit representation, abductive reasoning, scalable
+query execution, and integration with foundation models like large language
+models (LLMs). By addressing these challenges, Agentic NGDBs can enable
+intelligent, self-improving systems for modern data-driven applications, paving
+the way for adaptable and autonomous data management solutions.
 
-摘要：糖尿病（DM）使患者容易出現血管併發症。
-視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症，包括糖尿病視網膜病變（DR）、神經病變、腎病和動脈粥樣硬化性心血管疾病，以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧（AI）啟用系統已在臨床採用。除了 DR 篩檢外，AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中，我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻，這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現，包括但不限於 DR 篩檢，並討論實施此類系統的障礙，包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況，同時考量糖尿病併發症以及未來心血管併發症的風險預後，AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。
+摘要：圖形資料庫（GDB），例如 Neo4j 和 TigerGraph，擅長處理相互連接的資料，但缺乏進階的推論能力。神經圖形資料庫（NGDB）透過整合圖形神經網路（GNN）來解決這個問題，以進行預測分析和對不完整或有雜訊的資料進行推理。然而，NGDB 依賴於預先定義的查詢，並且缺乏自主性和適應性。本文介紹了代理神經圖形資料庫（Agentic NGDB），它以三項核心功能擴充了 NGDB：自動查詢建構、神經查詢執行和持續學習。我們找出實現 Agentic NGDB 的十大關鍵挑戰：語義單元表示、演繹推理、可擴充查詢執行，以及與基礎模型（例如大型語言模型 (LLM)）整合。透過解決這些挑戰，Agentic NGDB 可以為現代資料驅動應用打造智慧且自我改善的系統，為適應性和自主資料管理解決方案鋪路。
 
-##### **Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**
-2402.15027v2 by A. J. Karran, P. Charland, J-T. Martineau, A. Ortiz de Guinea Lopez de Arana, AM. Lesage, S. Senecal, P-M. Leger
+##### **GraphRAG under Fire**
+2501.14050v1 by Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang
 
-This study investigates the acceptability of different artificial
-intelligence (AI) applications in education from a multi-stakeholder
-perspective, including students, teachers, and parents. Acknowledging the
-transformative potential of AI in education, it addresses concerns related to
-data privacy, AI agency, transparency, explainability and the ethical
-deployment of AI. Through a vignette methodology, participants were presented
-with four scenarios where AI's agency, transparency, explainability, and
-privacy were manipulated. After each scenario, participants completed a survey
-that captured their perceptions of AI's global utility, individual usefulness,
-justice, confidence, risk, and intention to use each scenario's AI if
-available. The data collection comprising a final sample of 1198
-multi-stakeholder participants was distributed through a partner institution
-and social media campaigns and focused on individual responses to four AI use
-cases. A mediation analysis of the data indicated that acceptance and trust in
-AI varies significantly across stakeholder groups. We found that the key
-mediators between high and low levels of AI's agency, transparency, and
-explainability, as well as the intention to use the different educational AI,
-included perceived global utility, justice, and confidence. The study
-highlights that the acceptance of AI in education is a nuanced and multifaceted
-issue that requires careful consideration of specific AI applications and their
-characteristics, in addition to the diverse stakeholders' perceptions.
+GraphRAG advances retrieval-augmented generation (RAG) by structuring
+external knowledge as multi-scale knowledge graphs, enabling language models to
+integrate both broad context and granular details in their reasoning. While
+GraphRAG has demonstrated success across domains, its security implications
+remain largely unexplored. To bridge this gap, this work examines GraphRAG's
+vulnerability to poisoning attacks, uncovering an intriguing security paradox:
+compared to conventional RAG, GraphRAG's graph-based indexing and retrieval
+enhance resilience against simple poisoning attacks; meanwhile, the same
+features also create new attack surfaces. We present GRAGPoison, a novel attack
+that exploits shared relations in the knowledge graph to craft poisoning text
+capable of compromising multiple queries simultaneously. GRAGPoison employs
+three key strategies: i) relation injection to introduce false knowledge, ii)
+relation enhancement to amplify poisoning influence, and iii) narrative
+generation to embed malicious content within coherent text. Empirical
+evaluation across diverse datasets and models shows that GRAGPoison
+substantially outperforms existing attacks in terms of effectiveness (up to 98%
+success rate) and scalability (using less than 68% poisoning text). We also
+explore potential defensive measures and their limitations, identifying
+promising directions for future research.
 
-摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
+摘要：GraphRAG 透過將外部知識結構化為多尺度知識圖譜，推動了檢索增強生成 (RAG)，使語言模型能夠在其推理中整合廣泛的背景和細微的細節。儘管 GraphRAG 在各個領域都已展現出成功，但其安全性影響在很大程度上仍未被探索。為了彌補這一差距，本研究探討了 GraphRAG 對投毒攻擊的脆弱性，揭示了一個有趣的安全悖論：與傳統的 RAG 相比，GraphRAG 基於圖表的索引和檢索增強了對簡單投毒攻擊的韌性；同時，相同的特徵也創造了新的攻擊面。我們提出了 GRAGPoison，這是一種新穎的攻擊，它利用知識圖譜中的共享關係來製作中毒文本，能夠同時危害多個查詢。GRAGPoison 採用了三項關鍵策略：i) 關係注入以引入錯誤的知識，ii) 關係增強以擴大投毒影響，以及 iii) 敘事生成以將惡意內容嵌入連貫的文本中。在各種數據集和模型上的經驗評估表明，GRAGPoison 在有效性（成功率高達 98%）和可擴展性（使用不到 68% 的投毒文本）方面都明顯優於現有的攻擊。我們還探討了潛在的防禦措施及其局限性，確定了未來研究的有希望的方向。
 
-##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
-2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
+##### **EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**
+2501.13746v1 by Yuhui Yun, Huilong Ye, Xinru Li, Ruojia Li, Jingfeng Deng, Li Li, Haoyi Xiong
+
+The paper introduces EICopilot, an novel agent-based solution enhancing
+search and exploration of enterprise registration data within extensive online
+knowledge graphs like those detailing legal entities, registered capital, and
+major shareholders. Traditional methods necessitate text-based queries and
+manual subgraph explorations, often resulting in time-consuming processes.
+EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this
+landscape by utilizing Large Language Models (LLMs) to interpret natural
+language queries. This solution automatically generates and executes Gremlin
+scripts, providing efficient summaries of complex enterprise relationships.
+Distinct feature a data pre-processing pipeline that compiles and annotates
+representative queries into a vector database of examples for In-context
+learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought
+with ICL to enhance Gremlin script generation for knowledge graph search and
+exploration, and a novel query masking strategy that improves intent
+recognition for heightened script accuracy. Empirical evaluations demonstrate
+the superior performance of EICopilot, including speed and accuracy, over
+baseline methods, with the \emph{Full Mask} variant achieving a syntax error
+rate reduction to as low as 10.00% and an execution correctness of up to
+82.14%. These components collectively contribute to superior querying
+capabilities and summarization of intricate datasets, positioning EICopilot as
+a groundbreaking tool in the exploration and exploitation of large-scale
+knowledge graphs for enterprise information search.
 
-Remote patient monitoring based on wearable single-lead electrocardiogram
-(ECG) devices has significant potential for enabling the early detection of
-heart disease, especially in combination with artificial intelligence (AI)
-approaches for automated heart disease detection. There have been prior studies
-applying AI approaches based on deep learning for heart disease detection.
-However, these models are yet to be widely accepted as a reliable aid for
-clinical diagnostics, in part due to the current black-box perception
-surrounding many AI algorithms. In particular, there is a need to identify the
-key features of the ECG signal that contribute toward making an accurate
-diagnosis, thereby enhancing the interpretability of the model. In the present
-study, we develop a vision transformer approach to identify atrial fibrillation
-based on single-lead ECG data. A residual network (ResNet) approach is also
-developed for comparison with the vision transformer approach. These models are
-applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
-well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
-heartbeats. The models enable the identification of the key regions of the
-heartbeat that determine the resulting classification, and highlight the
-importance of P-waves and T-waves, as well as heartbeat duration and signal
-amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
-sinus bradycardia.
+摘要：本文介紹了 EICopilot，這是一種基於代理的新型解決方案，可增強在廣泛的線上知識圖譜中搜尋和探索企業註冊資料，例如詳細說明法律實體、註冊資本和主要股東的資料。傳統方法需要基於文字的查詢和手動子圖探索，通常會導致耗時的流程。EICopilot 部署為百度企業搜尋的聊天機器人，透過利用大型語言模型 (LLM) 來詮釋自然語言查詢，進而改善這項技術。此解決方案會自動產生並執行 Gremlin 腳本，提供複雜企業關係的有效摘要。其獨特功能為資料前處理管線，可將具代表性的查詢編譯並註解到範例的向量資料庫中，以進行脈絡中學習 (ICL)，這是一個結合了思考鏈與 ICL 的綜合推理管線，用於增強 Gremlin 腳本產生，以進行知識圖譜搜尋和探索，以及一種新穎的查詢遮罩策略，可改善意圖辨識，進而提高腳本準確度。實證評估顯示，EICopilot 的效能優於基線方法，包括速度和準確度，其中「完整遮罩」變體將語法錯誤率降低至低於 10.00%，執行正確率高達 82.14%。這些元件共同促成了優異的查詢功能和複雜資料集的摘要，將 EICopilot 定位為探索和利用大規模知識圖譜進行企業資訊搜尋的創新工具。
 
-摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
+##### **Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**
+2501.13731v1 by Chang Gong, Wanrui Bian, Zhijie Zhang, Weiguo Zheng
 
+Graph computational tasks are inherently challenging and often demand the
+development of advanced algorithms for effective solutions. With the emergence
+of large language models (LLMs), researchers have begun investigating their
+potential to address these tasks. However, existing approaches are constrained
+by LLMs' limited capability to comprehend complex graph structures and their
+high inference costs, rendering them impractical for handling large-scale
+graphs. Inspired by human approaches to graph problems, we introduce a novel
+framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph
+Computational Tasks), which consists of three key steps: problem understanding,
+prompt design, and code generation. In this framework, LLMs are tasked with
+understanding the problem and extracting relevant information to generate
+correct code. The responsibility for analyzing the graph structure and
+executing the code is delegated to the interpreter. We inject task-related
+pseudocodes into the prompts to further assist the LLMs in generating efficient
+code. We also employ cost-effective trial-and-error techniques to ensure that
+the LLM-generated code executes correctly. Unlike other methods that require
+invoking LLMs for each individual test case, PIE only calls the LLM during the
+code generation phase, allowing the generated code to be reused and
+significantly reducing inference costs. Extensive experiments demonstrate that
+PIE outperforms existing baselines in terms of both accuracy and computational
+efficiency.
 
-### Medical
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null|
-|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
-|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
-|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
-|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|null|
-|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
-|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
-|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null|
-|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null|
-|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null|
-|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null|
-|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
-|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v1](http://arxiv.org/abs/2502.07516v1)|[link](https://github.com/Raman1121/diffusion_memorization)|
-|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
-|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
-|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
-|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
-|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
-|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null|
-|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null|
-|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null|
-|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null|
-|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null|
-|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null|
-|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null|
-|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null|
-|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
-|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null|
-|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)|
-|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null|
-|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null|
-|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null|
-|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null|
-|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null|
-|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null|
-|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null|
-|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)|
-|**2025-02-08**|**Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**|Ya Zhou et.al.|[2502.05494v1](http://arxiv.org/abs/2502.05494v1)|null|
-|**2025-02-08**|**DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**|Sibasish Dhibar et.al.|[2502.05459v1](http://arxiv.org/abs/2502.05459v1)|null|
-|**2025-02-07**|**Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**|Muhammad Imran et.al.|[2502.05330v1](http://arxiv.org/abs/2502.05330v1)|[link](https://github.com/MaxwellEng/MICCAI_CHANLLENGE24_HJL)|
-|**2025-02-07**|**Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**|Yuting He et.al.|[2502.05282v1](http://arxiv.org/abs/2502.05282v1)|null|
-|**2025-02-07**|**"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**|Shihan Fu et.al.|[2502.05115v1](http://arxiv.org/abs/2502.05115v1)|null|
-|**2025-02-07**|**Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**|Thierry Bossy et.al.|[2502.05087v1](http://arxiv.org/abs/2502.05087v1)|[link](https://github.com/tuneinsight/federated-llms)|
-|**2025-02-07**|**MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**|Minrui Chen et.al.|[2502.04794v1](http://arxiv.org/abs/2502.04794v1)|null|
-|**2025-02-06**|**MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**|Wei Fan et.al.|[2502.04515v1](http://arxiv.org/abs/2502.04515v1)|[link](https://github.com/aikunyi/MedGNN)|
-|**2025-02-06**|**Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**|Andrew G. Breithaupt et.al.|[2502.06842v1](http://arxiv.org/abs/2502.06842v1)|null|
-|**2025-02-06**|**Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**|Khushboo Verma et.al.|[2502.04423v1](http://arxiv.org/abs/2502.04423v1)|null|
-|**2025-02-06**|**Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**|Tewele W. Tareke et.al.|[2502.04083v1](http://arxiv.org/abs/2502.04083v1)|null|
-|**2025-02-06**|**Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**|Ran Song et.al.|[2502.04034v1](http://arxiv.org/abs/2502.04034v1)|null|
-|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
-|**2025-02-06**|**Transforming Multimodal Models into Action Models for Radiotherapy**|Matteo Ferrante et.al.|[2502.04408v1](http://arxiv.org/abs/2502.04408v1)|null|
-|**2025-02-06**|**Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**|Bokeng Zheng et.al.|[2502.04399v1](http://arxiv.org/abs/2502.04399v1)|null|
-|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
-|**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
-|**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
-|**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
-|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v1](http://arxiv.org/abs/2502.03568v1)|null|
-|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
-|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
-|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
-|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
-|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
-|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
-|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|null|
-|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
-|**2025-02-05**|**Contrastive Token-level Explanations for Graph-based Rumour Detection**|Daniel Wai Kit Chin et.al.|[2502.04366v1](http://arxiv.org/abs/2502.04366v1)|null|
-|**2025-02-05**|**AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**|Jorge García-Torres et.al.|[2502.04365v1](http://arxiv.org/abs/2502.04365v1)|null|
-|**2025-02-04**|**3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**|Weicheng Zhu et.al.|[2502.02779v1](http://arxiv.org/abs/2502.02779v1)|null|
-|**2025-02-04**|**Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**|Obed Korshie Dzikunu et.al.|[2502.02756v1](http://arxiv.org/abs/2502.02756v1)|[link](https://github.com/obeddzik/pca_segment)|
-|**2025-02-04**|**Diffusion Instruction Tuning**|Chen Jin et.al.|[2502.06814v1](http://arxiv.org/abs/2502.06814v1)|null|
-|**2025-02-04**|**MedRAX: Medical Reasoning Agent for Chest X-ray**|Adibvafa Fallahpour et.al.|[2502.02673v1](http://arxiv.org/abs/2502.02673v1)|[link](https://github.com/bowang-lab/medrax)|
-|**2025-02-04**|**Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**|Mahdi Alkaeed et.al.|[2502.04356v1](http://arxiv.org/abs/2502.04356v1)|null|
-|**2025-02-04**|**Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**|Shayan Kiyani et.al.|[2502.02561v1](http://arxiv.org/abs/2502.02561v1)|null|
-|**2025-02-04**|**CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**|Amy Rafferty et.al.|[2502.05214v1](http://arxiv.org/abs/2502.05214v1)|null|
-|**2025-02-04**|**A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**|Edward Ellis et.al.|[2502.02489v1](http://arxiv.org/abs/2502.02489v1)|null|
-|**2025-02-04**|**Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**|Yaling Shen et.al.|[2502.02438v1](http://arxiv.org/abs/2502.02438v1)|null|
-|**2025-02-04**|**Test Time Training for 4D Medical Image Interpolation**|Qikang Zhang et.al.|[2502.02341v1](http://arxiv.org/abs/2502.02341v1)|[link](https://github.com/chaostheproducer/ttt4d)|
-|**2025-02-04**|**Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**|Atharva Mangeshkumar Agrawal et.al.|[2502.02249v1](http://arxiv.org/abs/2502.02249v1)|null|
-|**2025-02-04**|**Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**|F. Xavier Gaya-Morey et.al.|[2502.02618v1](http://arxiv.org/abs/2502.02618v1)|null|
-|**2025-02-04**|**Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**|Yuxiao Cheng et.al.|[2502.02109v1](http://arxiv.org/abs/2502.02109v1)|null|
-|**2025-02-04**|**JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**|Yehan Yan et.al.|[2502.04345v1](http://arxiv.org/abs/2502.04345v1)|null|
-|**2025-02-03**|**An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**|Jiazi Tian et.al.|[2502.01789v1](http://arxiv.org/abs/2502.01789v1)|null|
-|**2025-02-03**|**Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**|Chacha Chen et.al.|[2502.03482v1](http://arxiv.org/abs/2502.03482v1)|null|
-|**2025-02-03**|**Improving Transformer World Models for Data-Efficient RL**|Antoine Dedieu et.al.|[2502.01591v1](http://arxiv.org/abs/2502.01591v1)|null|
-|**2025-02-03**|**Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**|Zhi Zhang et.al.|[2502.01377v1](http://arxiv.org/abs/2502.01377v1)|null|
-|**2025-02-03**|**OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**|Chengfeng Zhou et.al.|[2502.01243v1](http://arxiv.org/abs/2502.01243v1)|null|
-|**2025-02-03**|**MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**|Alejandro Guerra-Manzanares et.al.|[2502.01158v1](http://arxiv.org/abs/2502.01158v1)|null|
-|**2025-02-03**|**Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**|Qian Chen et.al.|[2502.01141v1](http://arxiv.org/abs/2502.01141v1)|null|
-|**2025-02-03**|**Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**|Mithun Saha et.al.|[2502.01108v1](http://arxiv.org/abs/2502.01108v1)|null|
-|**2025-02-03**|**Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**|Yeyubei Zhang et.al.|[2502.04342v1](http://arxiv.org/abs/2502.04342v1)|null|
-|**2025-02-02**|**Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**|Hadas Ben-Atya et.al.|[2502.01691v1](http://arxiv.org/abs/2502.01691v1)|null|
-|**2025-02-02**|**Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**|Si-Ioi Ng et.al.|[2502.01685v1](http://arxiv.org/abs/2502.01685v1)|null|
-|**2025-02-02**|**Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**|Shengtian Sang et.al.|[2502.00712v1](http://arxiv.org/abs/2502.00712v1)|null|
-|**2025-02-02**|**TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**|Linglong Wu et.al.|[2502.00695v1](http://arxiv.org/abs/2502.00695v1)|null|
-|**2025-02-02**|**Safety at Scale: A Comprehensive Survey of Large Model Safety**|Xingjun Ma et.al.|[2502.05206v2](http://arxiv.org/abs/2502.05206v2)|null|
-|**2025-02-02**|**Enhanced Convolutional Neural Networks for Improved Image Classification**|Xiaoran Yang et.al.|[2502.00663v1](http://arxiv.org/abs/2502.00663v1)|null|
-|**2025-02-02**|**Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**|Yujin Oh et.al.|[2502.00619v1](http://arxiv.org/abs/2502.00619v1)|null|
-|**2025-02-01**|**Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**|Samiran Dey et.al.|[2502.00568v3](http://arxiv.org/abs/2502.00568v3)|[link](https://github.com/Samiran-Dey/PathoGen)|
+摘要：圖表計算任務本質上具有挑戰性，而且通常需要開發先進的演算法才能有效解決。隨著大型語言模型 (LLM) 的出現，研究人員已開始探討其解決這些任務的可能性。然而，現有方法受到 LLM 理解複雜圖形結構的能力有限以及其高推理成本的限制，這使得它們不切實際地處理大規模圖形。受到人類解決圖形問題的方法啟發，我們引入了 PIE（偽代碼注入增強 LLM 圖形計算任務推理）這個新框架，它包含三個關鍵步驟：問題理解、提示設計和代碼生成。在此框架中，LLM 的任務是理解問題並擷取相關資訊以產生正確的代碼。分析圖形結構和執行代碼的責任委派給解釋器。我們將與任務相關的偽代碼注入提示中，以進一步協助 LLM 產生有效的代碼。我們還採用具有成本效益的試錯技術，以確保 LLM 生成的代碼正確執行。與需要為每個個別測試案例呼叫 LLM 的其他方法不同，PIE 僅在代碼產生階段呼叫 LLM，允許重複使用產生的代碼並大幅降低推理成本。大量的實驗證明，PIE 在準確性和計算效率方面都優於現有的基準。
 
-#### Abstracts
-##### **Metamorphic Testing for Pose Estimation Systems**
-2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque
+##### **CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**
+2501.13993v1 by Hamza Landolsi, Kais Letaief, Nizar Taghouti, Ines Abdeljaoued-Tej
 
-Pose estimation systems are used in a variety of fields, from sports
-analytics to livestock care. Given their potential impact, it is paramount to
-systematically test their behaviour and potential for failure. This is a
-complex task due to the oracle problem and the high cost of manual labelling
-necessary to build ground truth keypoints. This problem is exacerbated by the
-fact that different applications require systems to focus on different subjects
-(e.g., human versus animal) or landmarks (e.g., only extremities versus whole
-body and face), which makes labelled test data rarely reusable. To combat these
-problems we propose MET-POSE, a metamorphic testing framework for pose
-estimation systems that bypasses the need for manual annotation while assessing
-the performance of these systems under different circumstances. MET-POSE thus
-allows users of pose estimation systems to assess the systems in conditions
-that more closely relate to their application without having to label an ad-hoc
-test dataset or rely only on available datasets, which may not be adapted to
-their application domain. While we define MET-POSE in general terms, we also
-present a non-exhaustive list of metamorphic rules that represent common
-challenges in computer vision applications, as well as a specific way to
-evaluate these rules. We then experimentally show the effectiveness of MET-POSE
-by applying it to Mediapipe Holistic, a state of the art human pose estimation
-system, with the FLIC and PHOENIX datasets. With these experiments, we outline
-numerous ways in which the outputs of MET-POSE can uncover faults in pose
-estimation systems at a similar or higher rate than classic testing using hand
-labelled data, and show that users can tailor the rule set they use to the
-faults and level of accuracy relevant to their application.
+The introduction of new features and services in the banking sector often
+overwhelms customers, creating an opportunity for banks to enhance user
+experience through financial chatbots powered by large language models (LLMs).
+We initiated an AI agent designed to provide customers with relevant
+information about banking services and insights from annual reports. We
+proposed a hybrid Customer Analysis Pipeline Retrieval-Augmented Generation
+(CAPRAG) that effectively addresses both relationship-based and contextual
+queries, thereby improving customer engagement in the digital banking
+landscape. To implement this, we developed a processing pipeline to refine text
+data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This
+dual approach enables us to populate both vector and graph databases with
+processed data for efficient retrieval. The Cypher query component is employed
+to effectively query the graph database. When a user submits a query, it is
+first expanded by a query expansion module before being routed to construct a
+final query from the hybrid Knowledge Base (KB). This final query is then sent
+to an open-source LLM for response generation. Overall, our innovative,
+designed to international banks, serves bank's customers in an increasingly
+complex digital environment, enhancing clarity and accessibility of
+information.
 
-摘要：姿勢估計系統應用於各種領域，從運動分析到牲畜照護。鑑於其潛在影響，系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高，這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體（例如，人類對動物）或地標（例如，只有四肢對全身和臉部）而加劇，這使得標記的測試數據很少可以重複使用。為了解決這些問題，我們提出了 MET-POSE，這是一個姿勢估計系統的變形測試框架，在評估這些系統在不同情況下的性能時，可以繞過手動註解的需要。因此，MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統，而無需標記臨時測試數據集或僅依賴可用數據集，這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE，但我們也提供了一個非詳盡的變形規則列表，這些規則代表了電腦視覺應用中的常見挑戰，以及評估這些規則的具體方法。然後，我們通過將 MET-POSE 應用於 Mediapipe Holistic（一種先進的人類姿勢估計系統），並使用 FLIC 和 PHOENIX 數據集，以實驗方式展示 MET-POSE 的有效性。通過這些實驗，我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法，其速度與使用手動標記數據的傳統測試類似或更高，並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。
+摘要：銀行業中新功能和服務的推出經常讓客戶感到不知所措，這為銀行透過大型語言模型 (LLM) 驅動的金融聊天機器人來提升使用者體驗創造了機會。我們啟動了一個人工智慧代理，旨在為客戶提供有關銀行服務和年度報告見解的相關資訊。我們提出了一個混合式客戶分析管道檢索擴充生成 (CAPRAG)，它有效地處理基於關係和情境式的查詢，從而提升數位銀行環境中的客戶參與度。為了實作這一點，我們開發了一個處理管道來精煉文字資料，我們在兩個主要架構中使用它：Vector RAG 和 Graph RAG。這種雙管齊下的方法讓我們能夠使用處理過的資料來填補向量和圖形資料庫，以利於有效檢索。Cypher 查詢元件用於有效查詢圖形資料庫。當使用者提交查詢時，它會先由查詢擴充模組擴充，然後再路由到混合式知識庫 (KB) 中建構最終查詢。然後這個最終查詢會傳送給開源 LLM 以產生回應。整體而言，我們創新的設計服務於國際銀行，在日益複雜的數位環境中服務銀行客戶，提升資訊的清晰度和可及性。
 
-##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
-2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
+##### **Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**
+2501.13992v1 by Hy Nguyen, Nguyen Hung Nguyen, Nguyen Linh Bao Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
 
-Joint entity-relation extraction is a critical task in transforming
-unstructured or semi-structured text into triplets, facilitating the
-construction of large-scale knowledge graphs, and supporting various downstream
-applications. Despite its importance, research on Chinese text, particularly
-with complex semantics in specialized domains like medicine, remains limited.
-To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
-dataset designed to capture the intricacies of medical text. Leveraging the
-strengths of attention mechanisms in capturing long-range dependencies, we
-propose the SEA module, which enhances the extraction of complex contextual
-semantic information, thereby improving entity recognition and relation
-extraction. Additionally, to address the inefficiencies of existing methods in
-facilitating information exchange between entity recognition and relation
-extraction, we present an interactive fusion representation module. This module
-employs Cross Attention for bidirectional information exchange between the
-tasks and further refines feature extraction through BiLSTM. Experimental
-results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
-our model exhibits strong generalization capabilities. On the CH-DDI dataset,
-our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
-relation extraction. On the CoNLL04 dataset, it attains an entity recognition
-precision of 89.54% and a relation extraction accuracy of 71.64%.
+The Hierarchical Navigable Small World (HNSW) algorithm is widely used for
+approximate nearest neighbor (ANN) search, leveraging the principles of
+navigable small-world graphs. However, it faces some limitations. The first is
+the local optima problem, which arises from the algorithm's greedy search
+strategy, selecting neighbors based solely on proximity at each step. This
+often leads to cluster disconnections. The second limitation is that HNSW
+frequently fails to achieve logarithmic complexity, particularly in
+high-dimensional datasets, due to the exhaustive traversal through each layer.
+To address these limitations, we propose a novel algorithm that mitigates local
+optima and cluster disconnections while enhancing the construction speed,
+maintaining inference speed. The first component is a dual-branch HNSW
+structure with LID-based insertion mechanisms, enabling traversal from multiple
+directions. This improves outlier node capture, enhances cluster connectivity,
+accelerates construction speed and reduces the risk of local minima. The second
+component incorporates a bridge-building technique that bypasses redundant
+intermediate layers, maintaining inference and making up the additional
+computational overhead introduced by the dual-branch structure. Experiments on
+various benchmarks and datasets showed that our algorithm outperforms the
+original HNSW in both accuracy and speed. We evaluated six datasets across
+Computer Vision (CV), and Natural Language Processing (NLP), showing recall
+improvements of 18\% in NLP, and up to 30\% in CV tasks while reducing the
+construction time by up to 20\% and maintaining the inference speed. We did not
+observe any trade-offs in our algorithm. Ablation studies revealed that
+LID-based insertion had the greatest impact on performance, followed by the
+dual-branch structure and bridge-building components.
 
-摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
+摘要：分層可導航小世界 (HNSW) 演算法廣泛用於近似最近鄰居 (ANN) 搜尋，並利用可導航小世界圖形的原理。然而，它面臨一些限制。第一個是局部最佳化問題，這源自於演算法的貪婪搜尋策略，在每個步驟中僅根據鄰近度來選擇鄰居。這通常會導致群集斷線。第二個限制是，由於透過每一層的窮舉式遍歷，HNSW 常常無法在高維度資料集中達成對數複雜度。為了解決這些限制，我們提出了一種新的演算法，它可以減輕局部最佳化和群集斷線，同時提高建構速度，並維持推論速度。第一個組成部分是一個具有基於 LID 的插入機制的雙分支 HNSW 結構，它能從多個方向進行遍歷。這改善了異常值節點的擷取，增強了群集連通性，加速了建構速度，並降低了局部最小值的風險。第二個組成部分包含一種橋樑建構技術，它繞過了多餘的中間層，維持推論並彌補了雙分支結構所帶來的額外運算負擔。在各種基準和資料集上的實驗顯示，我們的演算法在準確度和速度上都優於原始的 HNSW。我們評估了電腦視覺 (CV) 和自然語言處理 (NLP) 中的六個資料集，顯示 NLP 中的召回率提高了 18%，CV 任務中提高了 30%，同時將建構時間縮短了 20%，並維持了推論速度。我們沒有在我們的演算法中觀察到任何取捨。消融研究顯示，基於 LID 的插入對效能的影響最大，其次是雙分支結構和橋樑建構組成部分。
 
-##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
-2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
+##### **Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**
+2501.13984v1 by Bhumika Gupta, Pralaypati Ta, Keerthi Ram, Mohanasankar Sivaprakasam
 
-Generative artificial intelligence (AI) models, such as diffusion models and
-OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
-and automating clinical workflows. The field has advanced rapidly, evolving
-from text-only large language models for tasks such as clinical documentation
-and decision support to multimodal AI systems capable of integrating diverse
-data modalities, including imaging, text, and structured data, within a single
-model. The diverse landscape of these technologies, along with rising interest,
-highlights the need for a comprehensive review of their applications and
-potential. This scoping review explores the evolution of multimodal AI,
-highlighting its methods, applications, datasets, and evaluation in clinical
-settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
-IEEE Xplore, and Web of Science, prioritizing recent studies published up to
-the end of 2024. After rigorous screening, 144 papers were included, revealing
-key trends and challenges in this dynamic field. Our findings underscore a
-shift from unimodal to multimodal approaches, driving innovations in diagnostic
-support, medical report generation, drug discovery, and conversational AI.
-However, critical challenges remain, including the integration of heterogeneous
-data types, improving model interpretability, addressing ethical concerns, and
-validating AI systems in real-world clinical settings. This review summarizes
-the current state of the art, identifies critical gaps, and provides insights
-to guide the development of scalable, trustworthy, and clinically impactful
-multimodal AI solutions in healthcare.
+The updated recommendations on diagnostic procedures and treatment pathways
+for a medical condition are documented as graphical flows in Clinical Practice
+Guidelines (CPGs). For effective use of the CPGs in helping medical
+professionals in the treatment decision process, it is necessary to fully
+capture the guideline knowledge, particularly the contexts and their
+relationships in the graph. While several existing works have utilized these
+guidelines to create rule bases for Clinical Decision Support Systems, limited
+work has been done toward directly capturing the full medical knowledge
+contained in CPGs. This work proposes an approach to create a contextually
+enriched, faithful digital representation of National Comprehensive Cancer
+Network (NCCN) Cancer CPGs in the form of graphs using automated extraction and
+node & relationship classification. We also implement semantic enrichment of
+the model by using Large Language Models (LLMs) for node classification,
+achieving an accuracy of 80.86% and 88.47% with zero-shot learning and few-shot
+learning, respectively. Additionally, we introduce a methodology for answering
+natural language questions with constraints to guideline text by leveraging
+LLMs to extract the relevant subgraph from the guideline knowledge base. By
+generating natural language answers based on subgraph paths and semantic
+information, we mitigate the risk of incorrect answers and hallucination
+associated with LLMs, ensuring factual accuracy in medical domain Question
+Answering.
 
-摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
+摘要：已更新的醫療狀況診斷程序和治療途徑建議，以臨床實務指南 (CPG) 中的圖形流程記錄。為了有效使用 CPG 協助醫療專業人員進行治療決策，必須完整擷取指南知識，特別是圖表中的脈絡及其關係。雖然現有許多研究已利用這些指南為臨床決策支援系統建立規則基礎，但直接擷取 CPG 中包含的完整醫療知識的工作卻有限。這項研究提出了一種方法，以自動化擷取和節點與關係分類的方式，建立脈絡豐富、忠實的國家綜合癌症網路 (NCCN) 癌症 CPG 圖形數位表示。我們也透過使用大型語言模型 (LLM) 進行節點分類，實作模型的語意豐富化，分別在零次學習和少次學習中達到 80.86% 和 88.47% 的準確度。此外，我們引進了一種方法，透過運用 LLM 從指南知識庫中擷取相關子圖，來回答具有指南文字限制的自然語言問題。透過根據子圖路徑和語意資訊產生自然語言答案，我們降低了與 LLM 相關的錯誤答案和幻覺風險，確保了醫療領域問題解答中的事實準確性。
 
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
+##### **LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**
+2501.12300v1 by Hasan Abu-Rasheed, Constance Jumbo, Rashed Al Amin, Christian Weber, Veit Wiese, Roman Obermaisser, Madjid Fathi
 
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
+While learning personalization offers great potential for learners, modern
+practices in higher education require a deeper consideration of domain models
+and learning contexts, to develop effective personalization algorithms. This
+paper introduces an innovative approach to higher education curriculum
+modelling that utilizes large language models (LLMs) for knowledge graph (KG)
+completion, with the goal of creating personalized learning-path
+recommendations. Our research focuses on modelling university subjects and
+linking their topics to corresponding domain models, enabling the integration
+of learning modules from different faculties and institutions in the student's
+learning path. Central to our approach is a collaborative process, where LLMs
+assist human experts in extracting high-quality, fine-grained topics from
+lecture materials. We develop a domain, curriculum, and user models for
+university modules and stakeholders. We implement this model to create the KG
+from two study modules: Embedded Systems and Development of Embedded Systems
+Using FPGA. The resulting KG structures the curriculum and links it to the
+domain models. We evaluate our approach through qualitative expert feedback and
+quantitative graph quality metrics. Domain experts validated the relevance and
+accuracy of the model, while the graph quality metrics measured the structural
+properties of our KG. Our results show that the LLM-assisted graph completion
+approach enhances the ability to connect related courses across disciplines to
+personalize the learning experience. Expert feedback also showed high
+acceptance of the proposed collaborative approach for concept extraction and
+classification.
 
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
+摘要：<paragraph>在學習個人化提供學習者巨大潛力的同時，高等教育中的現代實務需要更深入地考慮領域模型和學習情境，以開發有效的個人化演算法。本文介紹了一種創新的高等教育課程建模方法，該方法利用大型語言模型 (LLM) 來完成知識圖譜 (KG)，目的是建立個人化的學習路徑建議。我們的研究重點在於建模大學科目，並將它們的主題連結到對應的領域模型，從而能夠將來自不同院系和機構的學習模組整合到學生的學習路徑中。我們的做法核心是一個協作流程，其中 LLM 協助人類專家從講義材料中萃取高品質、細緻的主題。我們為大學模組和利害關係人開發了領域、課程和使用者模型。我們實作這個模型，從兩個研究模組建立 KG：嵌入式系統和使用 FPGA 的嵌入式系統開發。產生的 KG 建構了課程並將其連結到領域模型。我們透過定性專家回饋和定量圖形品質指標來評估我們的做法。領域專家驗證了模型的相關性和準確性，而圖形品質指標則測量了我們 KG 的結構特性。我們的結果顯示，LLM 輔助的圖形完成方法增強了跨學科連結相關課程的能力，以個人化學習體驗。專家回饋也顯示高度接受所提出的協作方法，用於概念萃取和分類。</paragraph>
 
-##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
-2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
+##### **Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**
+2501.12432v1 by Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
 
-Legal cases require careful logical reasoning following the laws, whereas
-interactions with non- technical users must be in natural language. As an
-application combining logical reasoning using Prolog and natural language
-processing using large language models (LLMs), this paper presents a novel
-approach and system, LogicLease, to automate the analysis of landlord-tenant
-legal cases in the state of New York. LogicLease determines compliance with
-relevant legal requirements by analyzing case descriptions and citing all
-relevant laws. It leverages LLMs for information extraction and Prolog for
-legal reasoning. By separating information extraction from legal reasoning,
-LogicLease achieves greater transparency and control over the legal logic
-applied to each case. We evaluate the accuracy, efficiency, and robustness of
-LogicLease through a series of tests, achieving 100% accuracy and an average
-processing time of 2.57 seconds. LogicLease presents advantages over
-state-of-the-art LLM- based legal analysis systems by providing clear,
-step-by-step reasoning, citing specific laws, and distinguishing itself by its
-ability to avoid hallucinations - a common issue in LLMs.
+Although current Large Language Models (LLMs) exhibit impressive
+capabilities, performing complex real-world tasks still requires tool learning.
+Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to
+interact with external environments, but they are limited in perceptual scope
+and lack adequate task-planning capability. To address these limitations, other
+studies introduce the first Search-based Decision Tree (DFSDT), which still
+suffers from the high computational cost. In this paper, we introduce a novel
+parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama).
+First, we transform traditional tree-based tool search paths into Directed
+Acyclic Graph (DAG) structure, generating a high-quality parallel tool
+invocation dataset. The DTA-Llama is then trained on the dataset to learn to
+iteratively divide the current task into several parallel tool invocation
+sub-tasks and aggregate the invocation results to decide the next actions.
+Furthermore, we introduce an efficient inference framework inspired by the
+Process/Threads mechanism when applying the DTA-Llama to practical tasks.
+Experimental results show that our approach substantially enhances task
+performance while reducing token consumption and inference time. Llama2-7B,
+using our method, is comparable to the official parallel function calling
+method of GPT-3.5. The relevant code, dataset, and model weights are available
+at https://corn0205.github.io/
 
-摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
+摘要：儘管目前的大型語言模型 (LLM) 展現出令人印象深刻的能力，但執行複雜的真實世界任務仍需要工具學習。主流方法（例如 CoT/ReAct）依賴逐步工具呼叫與外部環境互動，但它們的感知範圍有限，且缺乏足夠的任務規劃能力。為了解決這些限制，其他研究引入了第一個基於搜尋的決策樹 (DFSDT)，但仍有很高的運算成本。在本文中，我們介紹了一種新穎的平行工具呼叫範例，DTA-Llama（分而合之 Llama）。首先，我們將傳統的基於樹的工具搜尋路徑轉換為有向無環圖 (DAG) 結構，產生高品質的平行工具呼叫資料集。然後在資料集上訓練 DTA-Llama，學習反覆將當前任務分成幾個平行工具呼叫子任務，並彙總呼叫結果以決定後續動作。此外，我們在將 DTA-Llama 應用於實際任務時，引入了一個受 Process/Threads 機制啟發的高效推論框架。實驗結果表明，我們的做法大幅提升了任務效能，同時減少了符號消耗和推論時間。使用我們方法的 Llama2-7B，可與 GPT-3.5 的官方平行函式呼叫方法相媲美。相關程式碼、資料集和模型權重可在 https://corn0205.github.io/ 取得
 
-##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
-2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
+##### **InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**
+2501.12231v1 by Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
 
-In remote healthcare monitoring, time series representation learning reveals
-critical patient behavior patterns from high-frequency data. This study
-analyzes home activity data from individuals living with dementia by proposing
-a two-stage, self-supervised learning approach tailored to uncover low-rank
-structures. The first stage converts time-series activities into text sequences
-encoded by a pre-trained language model, providing a rich, high-dimensional
-latent state space using a PageRank-based method. This PageRank vector captures
-latent state transitions, effectively compressing complex behaviour data into a
-succinct form that enhances interpretability. This low-rank representation not
-only enhances model interpretability but also facilitates clustering and
-transition analysis, revealing key behavioral patterns correlated with
-clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
-framework's potential in supporting cognitive status prediction, personalized
-care interventions, and large-scale health monitoring.
+The improved competence of generative models can help building multi-modal
+virtual assistants that leverage modalities beyond language. By observing
+humans performing multi-step tasks, one can build assistants that have
+situational awareness of actions and tasks being performed, enabling them to
+cater assistance based on this understanding. In this paper, we develop a
+Context-aware Instructional Task Assistant with Multi-modal Large Language
+Models (InsTALL) that leverages an online visual stream (e.g. a user's screen
+share or video recording) and responds in real-time to user queries related to
+the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal
+model on task videos and paired textual data, and 2) automatically extracts
+task graph from video data and leverages it at training and inference time. We
+show InsTALL achieves state-of-the-art performance across proposed sub-tasks
+considered for multimodal activity understanding -- task recognition (TR),
+action recognition (AR), next action prediction (AP), and plan prediction (PP)
+-- and outperforms existing baselines on two novel sub-tasks related to
+automatic error identification.
 
-摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
+摘要：生成模型能力的提升有助于构建利用语言之外的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建对正在执行的动作和任务有情境感知的助手，使他们能够根据这种理解提供帮助。在本文中，我们开发了一个具有多模态大语言模型的上下文感知指令任务助手 (InsTALL)，该助手利用在线视觉流（例如用户的屏幕共享或视频录制），并实时响应与手头任务相关的用户查询。为了提供有用的帮助，InsTALL 1) 在任务视频和配对文本数据上训练多模态模型，以及 2) 从视频数据中自动提取任务图，并在训练和推理时间利用它。我们展示了 InsTALL 在考虑用于多模态活动理解的提议子任务中实现了最先进的性能——任务识别 (TR)、动作识别 (AR)、下一个动作预测 (AP) 和计划预测 (PP)——并且在与自动错误识别相关的两个新子任务上优于现有的基准。
 
-##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
-2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
+##### **Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**
+2501.11977v1 by Maya Medjad, Hugo Imbert, Bruno Yun, Raphaël Szymocha, Frédéric Armetta
 
-Precise segmentation and classification of cell instances are vital for
-analyzing the tissue microenvironment in histology images, supporting medical
-diagnosis, prognosis, treatment planning, and studies of brain
-cytoarchitecture. However, the creation of high-quality annotated datasets for
-training remains a major challenge. This study introduces a novel single-stage
-approach (HistoSmith) for generating image-label pairs to augment histology
-datasets. Unlike state-of-the-art methods that utilize diffusion models with
-separate components for label and image generation, our approach employs a
-latent diffusion model to learn the joint distribution of cellular layouts,
-classification masks, and histology images. This model enables tailored data
-generation by conditioning on user-defined parameters such as cell types,
-quantities, and tissue types. Trained on the Conic H&E histopathology dataset
-and the Nissl-stained CytoDArk0 dataset, the model generates realistic and
-diverse labeled samples. Experimental results demonstrate improvements in cell
-instance segmentation and classification, particularly for underrepresented
-cell types like neutrophils in the Conic dataset. These findings underscore the
-potential of our approach to address data scarcity challenges.
+Training task-oriented dialogue systems is both costly and time-consuming,
+due to the need for high-quality datasets encompassing diverse intents.
+Traditional methods depend on extensive human annotation, while recent
+advancements leverage large language models (LLMs) to generate synthetic data.
+However, these approaches often require custom prompts or code, limiting
+accessibility for non-technical users. We introduce GraphTOD, an end-to-end
+framework that simplifies the generation of task-oriented dialogues. Users can
+create dialogues by specifying transition graphs in JSON format. Our evaluation
+demonstrates that GraphTOD generates high-quality dialogues across various
+domains, significantly lowering the cost and complexity of dataset creation.
 
-摘要：精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而，建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith)，用於產生影像標籤對，以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同，我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數（例如細胞類型、數量和組織類型）來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後，此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步，特別是對於 Conic 資料集中代表性不足的細胞類型，例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。
+摘要：訓練任務導向對話系統既昂貴又耗時，
+因為需要包含各種意圖的高品質資料集。
+傳統方法依賴於廣泛的人工標註，而最近
+的進展利用大型語言模型 (LLM) 來產生合成資料。
+然而，這些方法通常需要自訂提示或程式碼，限制
+非技術使用者的可及性。我們介紹 GraphTOD，一個端對端的
+架構，簡化了任務導向對話的產生。使用者可以
+透過指定 JSON 格式的轉換圖表來建立對話。我們的評估
+證明 GraphTOD 在各種領域產生高品質對話，顯著降低資料集建立的成本和複雜性。
 
-##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**
-2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
+##### **Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**
+2501.11968v1 by Jie Zhao, Kang Hao Cheong, Witold Pedrycz
 
-The growing availability of longitudinal Magnetic Resonance Imaging (MRI)
-datasets has facilitated Artificial Intelligence (AI)-driven modeling of
-disease progression, making it possible to predict future medical scans for
-individual patients. However, despite significant advancements in AI, current
-methods continue to face challenges including achieving patient-specific
-individualization, ensuring spatiotemporal consistency, efficiently utilizing
-longitudinal data, and managing the substantial memory demands of 3D scans. To
-address these challenges, we propose Brain Latent Progression (BrLP), a novel
-spatiotemporal model designed to predict individual-level disease progression
-in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates
-in a small latent space, mitigating the computational challenges posed by
-high-dimensional imaging data; (ii) it explicitly integrates subject metadata
-to enhance the individualization of predictions; (iii) it incorporates prior
-knowledge of disease dynamics through an auxiliary model, facilitating the
-integration of longitudinal data; and (iv) it introduces the Latent Average
-Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in
-the predicted progression at inference time and (b) allows us to derive a
-measure of the uncertainty for the prediction. We train and evaluate BrLP on
-11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its
-generalizability on an external test set comprising 2,257 MRIs from 962
-subjects. Our experiments compare BrLP-generated MRI scans with real follow-up
-MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The
-code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
+Graph-structured combinatorial challenges are inherently difficult due to
+their nonlinear and intricate nature, often rendering traditional computational
+methods ineffective or expensive. However, these challenges can be more
+naturally tackled by humans through visual representations that harness our
+innate ability for spatial reasoning. In this study, we propose transforming
+graphs into images to preserve their higher-order structural features
+accurately, revolutionizing the representation used in solving graph-structured
+combinatorial tasks. This approach allows machines to emulate human-like
+processing in addressing complex combinatorial challenges. By combining the
+innovative paradigm powered by multimodal large language models (MLLMs) with
+simple search techniques, we aim to develop a novel and effective framework for
+tackling such problems. Our investigation into MLLMs spanned a variety of
+graph-based tasks, from combinatorial problems like influence maximization to
+sequential decision-making in network dismantling, as well as addressing six
+fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit
+exceptional spatial intelligence and a distinctive capability for handling
+these problems, significantly advancing the potential for machines to
+comprehend and analyze graph-structured data with a depth and intuition akin to
+human cognition. These results also imply that integrating MLLMs with simple
+optimization strategies could form a novel and efficient approach for
+navigating graph-structured combinatorial challenges without complex
+derivations, computationally demanding training and fine-tuning.
 
-摘要：隨著縱向磁共振影像 (MRI) 資料集的日益普及，已促進人工智慧 (AI) 驅動的疾病進程建模，讓預測個別患者的未來醫學掃描成為可能。然而，儘管 AI 有顯著進展，目前的技術仍面臨挑戰，包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料，以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰，我們提出腦潛在進程 (BrLP)，這是一種新穎的時空模型，旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個：(i) 它在一個小的潛在空間中運作，減輕了高維度影像資料帶來的計算挑戰；(ii) 它明確整合受試者的元資料，以增強預測的個別化；(iii) 它透過輔助模型納入疾病動態的先驗知識，促進縱向資料的整合；(iv) 它引入了潛在平均穩定化 (LAS) 演算法，該演算法 (a) 在推論時強制預測進程中的時空一致性，(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估，並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較，與現有方法相比，展示了最先進的準確性。程式碼已公開於：https://github.com/LemuelPuglisi/BrLP。
+摘要：圖形結構的組合挑戰本質上很困難，因為它們的非線性和複雜性，通常會使傳統的計算方法無效或昂貴。然而，人類可以透過利用我們天生的空間推理能力的視覺表徵，更自然地應對這些挑戰。在本研究中，我們建議將圖形轉換為影像，以準確保留它們的高階結構特徵，從而革新用於解決圖形結構組合任務的表徵。這種方法允許機器在解決複雜的組合挑戰時模擬類人的處理。透過結合由多模態大型語言模型 (MLLM) 提供動力的創新範例與簡單的搜尋技術，我們旨在為解決此類問題開發一個新穎且有效的架構。我們對 MLLM 的研究涵蓋了各種基於圖形的任務，從組合問題（如影響力最大化）到網路拆除中的順序決策制定，以及解決六個基本的圖形相關問題。我們的研究結果表明，MLLM 表現出非凡的空間智能和處理這些問題的獨特能力，顯著提升了機器以類似人類認知的深度和直覺來理解和分析圖形結構資料的潛力。這些結果還暗示，將 MLLM 與簡單的最佳化策略整合在一起，可以形成一種新穎且有效的方法，用於在沒有複雜推導、計算需求量大的訓練和微調的情況下應對圖形結構的組合挑戰。
 
-##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
-2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
+##### **A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**
+2501.13958v1 by Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, Xiao Huang
 
-The adoption of EHRs has expanded opportunities to leverage data-driven
-algorithms in clinical care and research. A major bottleneck in effectively
-conducting multi-institutional EHR studies is the data heterogeneity across
-systems with numerous codes that either do not exist or represent different
-clinical concepts across institutions. The need for data privacy further limits
-the feasibility of including multi-institutional patient-level data required to
-study similarities and differences across patient subgroups. To address these
-challenges, we developed the GAME algorithm. Tested and validated across 7
-institutions and 2 languages, GAME integrates data in several levels: (1) at
-the institutional level with knowledge graphs to establish relationships
-between codes and existing knowledge sources, providing the medical context for
-standard codes and their relationship to each other; (2) between institutions,
-leveraging language models to determine the relationships between
-institution-specific codes with established standard codes; and (3) quantifying
-the strength of the relationships between codes using a graph attention
-network. Jointly trained embeddings are created using transfer and federated
-learning to preserve data privacy. In this study, we demonstrate the
-applicability of GAME in selecting relevant features as inputs for AI-driven
-algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
-We then highlight the application of GAME harmonized multi-institutional EHR
-data in a study of Alzheimer's disease outcomes and suicide risk among patients
-with mental health disorders, without sharing patient-level data outside
-individual institutions.
+Large language models (LLMs) have demonstrated remarkable capabilities in a
+wide range of tasks, yet their application to specialized domains remains
+challenging due to the need for deep expertise. Retrieval-augmented generation
+(RAG) has emerged as a promising solution to customize LLMs for professional
+fields by seamlessly integrating external knowledge bases, enabling real-time
+access to domain-specific expertise during inference. Despite its potential,
+traditional RAG systems, based on flat text retrieval, face three critical
+challenges: (i) complex query understanding in professional contexts, (ii)
+difficulties in knowledge integration across distributed sources, and (iii)
+system efficiency bottlenecks at scale. This survey presents a systematic
+analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new
+paradigm that revolutionizes domain-specific LLM applications. GraphRAG
+addresses traditional RAG limitations through three key innovations: (i)
+graph-structured knowledge representation that explicitly captures entity
+relationships and domain hierarchies, (ii) efficient graph-based retrieval
+techniques that enable context-preserving knowledge retrieval with multihop
+reasoning ability, and (iii) structure-aware knowledge integration algorithms
+that leverage retrieved knowledge for accurate and logical coherent generation
+of LLMs. In this survey, we systematically analyze the technical foundations of
+GraphRAG and examine current implementations across various professional
+domains, identifying key technical challenges and promising research
+directions. All the related resources of GraphRAG, including research papers,
+open-source data, and projects, are collected for the community in
+\textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}}.
 
-摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
+摘要：大型語言模型 (LLM) 已在各種任務中展現出非凡的能力，但由於需要深入的專業知識，因此將其應用於專業領域仍具有挑戰性。檢索增強生成 (RAG) 已成為一種有前途的解決方案，可通過無縫整合外部知識庫來客製化 LLM 以適用於專業領域，從而在推理過程中即時存取特定領域的專業知識。儘管有其潛力，但基於平面文字檢索的傳統 RAG 系統面臨三項關鍵挑戰：(i) 在專業情境中進行複雜的查詢理解，(ii) 難以整合分散來源的知識，以及 (iii) 系統效率瓶頸會隨著規模擴大而產生。本調查系統性地分析了圖形化檢索增強生成 (GraphRAG) 的技術基礎，GraphRAG 是一個新的典範，它徹底改變了特定領域的 LLM 應用。GraphRAG 透過三項關鍵創新來解決傳統 RAG 的限制：(i) 圖形結構化的知識表述，明確擷取實體關係和領域階層，(ii) 有效的圖形化檢索技術，可進行保留脈絡的知識檢索，並具備多跳推理能力，以及 (iii) 結構感知知識整合演算法，可利用檢索到的知識來進行 LLM 的準確且邏輯一致的生成。在本調查中，我們系統性地分析了 GraphRAG 的技術基礎，並檢視了在各種專業領域中的現有實作，找出關鍵技術挑戰和有前景的研究方向。所有 GraphRAG 的相關資源，包括研究論文、開放原始碼資料和專案，都已在 \textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}} 中為社群收集。
 
-##### **EEG Artifact Detection and Correction with Deep Autoencoders**
-2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch
+##### **Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**
+2501.11849v2 by Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
 
-EEG signals convey important information about brain activity both in healthy
-and pathological conditions. However, they are inherently noisy, which poses
-significant challenges for accurate analysis and interpretation. Traditional
-EEG artifact removal methods, while effective, often require extensive expert
-intervention. This study presents LSTEEG, a novel LSTM-based autoencoder
-designed for the detection and correction of artifacts in EEG signals.
-Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear
-dependencies in sequential EEG data. LSTEEG demonstrates superior performance
-in both artifact detection and correction tasks compared to other
-state-of-the-art convolutional autoencoders. Our methodology enhances the
-interpretability and utility of the autoencoder's latent space, enabling
-data-driven automated artefact removal in EEG its application in downstream
-tasks. This research advances the field of efficient and accurate multi-channel
-EEG preprocessing, and promotes the implementation and usage of automated EEG
-analysis pipelines for brain health applications.
+Detecting organized political campaigns is of paramount importance in
+fighting against disinformation on social media. Existing approaches for the
+identification of such organized actions employ techniques mostly from network
+science, graph machine learning and natural language processing. Their ultimate
+goal is to analyze the relationships and interactions (e.g. re-posting) among
+users and the textual similarities of their posts. Despite their effectiveness
+in recognizing astroturf campaigns, these methods face significant challenges,
+notably the class imbalance in available training datasets. To mitigate this
+issue, recent methods usually resort to data augmentation or increasing the
+number of positive samples, which may not always be feasible or sufficient in
+real-world settings. Following a different path, in this paper, we propose a
+novel framework for identifying astroturf campaigns based solely on large
+language models (LLMs), introducing a Balanced Retrieval-Augmented Generation
+(Balanced RAG) component. Our approach first gives both textual information
+concerning the posts (in our case tweets) and the user interactions of the
+social network as input to a language model. Then, through prompt engineering
+and the proposed Balanced RAG method, it effectively detects coordinated
+disinformation campaigns on X (Twitter). The proposed framework does not
+require any training or fine-tuning of the language model. Instead, by
+strategically harnessing the strengths of prompt engineering and Balanced RAG,
+it facilitates LLMs to overcome the effects of class imbalance and effectively
+identify coordinated political campaigns. The experimental results demonstrate
+that by incorporating the proposed prompt engineering and Balanced RAG methods,
+our framework outperforms the traditional graph-based baselines, achieving
+2x-3x improvements in terms of precision, recall and F1 scores.
 
-摘要：腦電圖訊號傳達了關於大腦活動的重要資訊，無論是在健康或病理狀況下。然而，它們本質上是有雜訊的，這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效，但通常需要大量的專家介入。本研究提出 LSTEEG，一種新穎的基於 LSTM 的自動編碼器，用於偵測和校正腦電圖訊號中的人工製品。利用深度學習，特別是 LSTM 層，LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比，LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性，讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域，並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。
+摘要：<paragraph>在社交媒體上對抗錯誤資訊，偵測有組織的政治宣傳活動至關重要。現有的此類有組織行動識別方法，大多採用網路科學、圖形機器學習和自然語言處理的技術。它們的最終目標是分析使用者之間的關係和互動（例如轉發），以及他們貼文的文字相似性。儘管這些方法在辨識草根運動宣傳活動方面很有效，但它們面臨嚴峻的挑戰，特別是可用訓練資料集中的類別不平衡。為了減輕這個問題，最近的方法通常訴諸於資料擴充或增加正向樣本數量，但在現實世界中可能並非總是可行或足夠。本文採取不同的途徑，我們提出了一個基於大型語言模型 (LLM) 的辨識草根運動宣傳活動的新架構，並引入了平衡檢索擴充產生 (Balanced RAG) 組件。我們的做法首先將有關貼文（在我們的案例中是推文）的文字資訊和社交網路的使用者互動作為輸入，輸入到語言模型中。然後，透過提示工程和提出的平衡檢索擴充產生方法，它有效地偵測 X（Twitter）上協調的不實資訊宣傳活動。提出的架構不需要任何語言模型的訓練或微調。相反地，透過策略性地利用提示工程和平衡檢索擴充產生方法的優勢，它使大型語言模型能夠克服類別不平衡的影響，並有效地識別協調的政治宣傳活動。實驗結果證明，透過整合提出的提示工程和平衡檢索擴充產生方法，我們的架構優於傳統的基於圖形的基準，在精確度、召回率和 F1 分數方面獲得 2x-3x 的改進。</paragraph>
 
-##### **SycEval: Evaluating LLM Sycophancy**
-2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo
+##### **Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**
+2501.16361v1 by Haoran Song, Jiarui Feng, Guangfu Li, Michael Province, Philip Payne, Yixin Chen, Fuhai Li
 
-Large language models (LLMs) are increasingly applied in educational,
-clinical, and professional settings, but their tendency for sycophancy --
-prioritizing user agreement over independent reasoning -- poses risks to
-reliability. This study introduces a framework to evaluate sycophantic behavior
-in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and
-MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19%
-of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the
-lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred
-in 43.52% of cases, while regressive sycophancy, leading to incorrect answers,
-was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher
-sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$,
-$p<0.001$), particularly in computational tasks, where regressive sycophancy
-increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$).
-Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while
-citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$,
-$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI:
-[77.2%, 79.8%]) regardless of context or model. These findings emphasize the
-risks and opportunities of deploying LLMs in structured and dynamic domains,
-offering insights into prompt programming and model optimization for safer AI
-applications.
+In real-world scientific discovery, human beings always make use of the
+accumulated prior knowledge with imagination pick select one or a few most
+promising hypotheses from large and noisy data analysis results. In this study,
+we introduce a new type of graph structure, the text-numeric graph (TNG), which
+is defined as graph entities and associations have both text-attributed
+information and numeric information. The TNG is an ideal data structure model
+for novel scientific discovery via graph reasoning because it integrates
+human-understandable textual annotations or prior knowledge, with numeric
+values that represent the observed or activation levels of graph entities or
+associations in different samples. Together both the textual information and
+numeric values determine the importance of graph entities and associations in
+graph reasoning for novel scientific knowledge discovery. We further propose
+integrating large language models (LLMs) and graph neural networks (GNNs) to
+analyze the TNGs for graph understanding and reasoning. To demonstrate the
+utility, we generated the text-omic(numeric) signaling graphs (TOSG), as one
+type of TNGs, in which all graphs have the same entities, associations and
+annotations, but have sample-specific entity numeric (omic) values using single
+cell RNAseq (scRNAseq) datasets of different diseases. We proposed joint
+LLM-GNN models for key entity mining and signaling pathway mining on the TOSGs.
+The evaluation results showed the LLM-GNN and TNGs models significantly improve
+classification accuracy and network inference. In conclusion, the TNGs and
+joint LLM-GNN models are important approaches for scientific discovery.
 
-摘要：大型語言模型（LLM）日益應用於教育、臨床和專業領域，但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為，涉及 AMPS（數學）和 MedQuad（醫療建議）數據集。在 58.19% 的案例中觀察到了趨炎附勢行為，其中 Gemini 表現出最高比率（62.47%），而 ChatGPT 最低（56.71%）。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中，而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率（61.75% 對 56.52%，Z=5.87，p<0.001），特別是在計算任務中，其中退步式趨炎附勢顯著增加（先發制人：8.13%，上下文：3.54%，p<0.001）。簡單的反駁最大化了漸進式趨炎附勢（Z=6.59，p<0.001），而基於引用的反駁表現出最高的退步式比率（Z=6.59，p<0.001）。趨炎附勢行為表現出很高的持續性（78.5%，95% CI：[77.2%，79.8%]），無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇，為更安全的 AI 應用提供了提示編程和模型優化的見解。
+摘要：<paragraph>在現實世界的科學發現中，人類總是利用累積的先驗知識，並運用想像力從大量且雜訊的資料分析結果中挑選出一個或幾個最有希望的假設。在本研究中，我們介紹了一種新型態的圖形結構，稱為文字數值圖 (TNG)，定義為圖形實體和關聯具有文字屬性資訊和數值資訊。TNG 是透過圖形推理進行新科學發現的理想資料結構模型，因為它整合了人類可理解的文字註解或先驗知識，以及代表圖形實體或不同樣本中關聯的觀察值或活化程度的數值。文字資訊和數值一起決定了圖形實體和關聯在圖形推理中對於新科學知識發現的重要性。我們進一步提出整合大型語言模型 (LLM) 和圖形神經網路 (GNN) 來分析 TNG，以進行圖形理解和推理。為了展示其效用，我們生成了文字組學（數值）訊號圖 (TOSG)，作為一種 TNG，其中所有圖形都具有相同的實體、關聯和註解，但具有特定於樣本的實體數值（組學）值，使用不同疾病的單細胞 RNAseq (scRNAseq) 資料集。我們針對 TOSG 提出聯合 LLM-GNN 模型，用於關鍵實體探勘和訊號路徑探勘。評估結果顯示，LLM-GNN 和 TNG 模型顯著提升了分類準確度和網路推論。結論而言，TNG 和聯合 LLM-GNN 模型是科學發現的重要方法。</paragraph>
 
-##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**
-2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace
+##### **Zep: A Temporal Knowledge Graph Architecture for Agent Memory**
+2501.13956v1 by Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef
 
-Medical research faces well-documented challenges in translating novel
-treatments into clinical practice. Publishing incentives encourage researchers
-to present "positive" findings, even when empirical results are equivocal.
-Consequently, it is well-documented that authors often spin study results,
-especially in article abstracts. Such spin can influence clinician
-interpretation of evidence and may affect patient care decisions. In this
-study, we ask whether the interpretation of trial results offered by Large
-Language Models (LLMs) is similarly affected by spin. This is important since
-LLMs are increasingly being used to trawl through and synthesize published
-medical evidence. We evaluated 22 LLMs and found that they are across the board
-more susceptible to spin than humans. They might also propagate spin into their
-outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into
-plain language summaries that they generate. We also find, however, that LLMs
-are generally capable of recognizing spin, and can be prompted in a way to
-mitigate spin's impact on LLM outputs.
+We introduce Zep, a novel memory layer service for AI agents that outperforms
+the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR)
+benchmark. Additionally, Zep excels in more comprehensive and challenging
+evaluations than DMR that better reflect real-world enterprise use cases. While
+existing retrieval-augmented generation (RAG) frameworks for large language
+model (LLM)-based agents are limited to static document retrieval, enterprise
+applications demand dynamic knowledge integration from diverse sources
+including ongoing conversations and business data. Zep addresses this
+fundamental limitation through its core component Graphiti -- a
+temporally-aware knowledge graph engine that dynamically synthesizes both
+unstructured conversational data and structured business data while maintaining
+historical relationships. In the DMR benchmark, which the MemGPT team
+established as their primary evaluation metric, Zep demonstrates superior
+performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further
+validated through the more challenging LongMemEval benchmark, which better
+reflects enterprise use cases through complex temporal reasoning tasks. In this
+evaluation, Zep achieves substantial results with accuracy improvements of up
+to 18.5% while simultaneously reducing response latency by 90% compared to
+baseline implementations. These results are particularly pronounced in
+enterprise-critical tasks such as cross-session information synthesis and
+long-term context maintenance, demonstrating Zep's effectiveness for deployment
+in real-world applications.
 
-摘要：醫學研究在將新穎療法轉化為臨床實務上，面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現，即使經驗結果模稜兩可。因此，有據可查的是，作者經常扭曲研究結果，特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋，並可能影響病患照護決策。在本研究中，我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據，因此這點非常重要。我們評估了 22 個 LLM，發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中：例如，我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而，我們也發現 LLM 通常有能力辨認扭曲，而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。
+摘要：我們推出 Zep，這是一種新穎的記憶層服務，適用於 AI 代理，其在深度記憶擷取 (DMR) 基準測試中優於現行的最先進系統 MemGPT。此外，Zep 在比 DMR 更全面且更具挑戰性的評估中表現出色，這些評估更能反映真實世界的企業用例。雖然現有的檢索增強生成 (RAG) 架構僅限於大型語言模型 (LLM) 基於代理的靜態文件檢索，但企業應用需要從包括正在進行的對話和業務數據在內的不同來源動態整合知識。Zep 通過其核心組件 Graphiti 來解決這個基本限制，Graphiti 是一個時間感知知識圖譜引擎，可以在維護歷史關係的同時動態綜合非結構化對話數據和結構化業務數據。在 MemGPT 團隊確立為其主要評估指標的 DMR 基準測試中，Zep 表現出優異的效能（94.8% 對 93.4%）。除了 DMR 之外，Zep 的功能還通過更具挑戰性的 LongMemEval 基準測試進一步得到驗證，該基準測試通過複雜的時間推理任務更好地反映了企業用例。在這個評估中，Zep 以高達 18.5% 的準確度改進取得了顯著的成果，同時與基線實作相比，將回應延遲降低了 90%。這些成果在企業關鍵任務中尤為明顯，例如跨會話資訊綜合和長期脈絡維護，證明了 Zep 在實際應用中部署的有效性。
 
-##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**
-2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri
+##### **Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**
+2501.11560v1 by M. Manzour, A. Ballardini, R. Izquierdo, M. Á. Sotelo
 
-This paper presents a novel Natural Language Processing (NLP) framework for
-enhancing medical diagnosis through the integration of advanced techniques in
-data augmentation, feature extraction, and classification. The proposed
-approach employs back-translation to generate diverse paraphrased datasets,
-improving robustness and mitigating overfitting in classification tasks.
-Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with
-Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained
-contextual and positional relationships, dynamically adjusting the influence of
-positional information based on semantic context to produce high-quality text
-embeddings. For classification, an Attention-Based Feedforward Neural Network
-(ABFNN) is utilized, effectively focusing on the most relevant features to
-improve decision-making accuracy. Applied to the classification of symptoms,
-clinical notes, and other medical texts, this architecture demonstrates its
-ability to address the complexities of medical data. The combination of data
-augmentation, contextual embedding generation, and advanced classification
-mechanisms offers a robust and accurate diagnostic tool, with potential
-applications in automated medical diagnosis and clinical decision support. This
-method demonstrates the effectiveness of the proposed NLP framework for medical
-diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of
-99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only
-underscore the model's robust performance in classifying medical texts with
-exceptional precision and reliability but also highlight its superiority over
-existing methods, making it a highly promising tool for automated diagnostic
-systems.
+Lane-changing maneuvers, particularly those executed abruptly or in risky
+situations, are a significant cause of road traffic accidents. However, current
+research mainly focuses on predicting safe lane changes. Furthermore, existing
+accident datasets are often based on images only and lack comprehensive sensory
+data. In this work, we focus on predicting risky lane changes using the CRASH
+dataset (our own collected dataset specifically for risky lane changes), and
+safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian
+inference to predict these maneuvers using linguistic contextual information,
+enhancing the model's interpretability and transparency. The model achieved a
+91.5% f1-score with anticipation time extending to four seconds for risky lane
+changes, and a 90.0% f1-score for predicting safe lane changes with the same
+anticipation time. We validate our model by integrating it into a vehicle
+within the CARLA simulator in scenarios that involve risky lane changes. The
+model managed to anticipate sudden lane changes, thus providing automated
+vehicles with further time to plan and execute appropriate safe reactions.
+Finally, to enhance the explainability of our model, we utilize RAG to provide
+clear and natural language explanations for the given prediction.
+
+摘要：換車道動作，尤其是突然或在風險情況下執行的動作，是道路交通事故的重要原因。然而，目前的研究所主要集中在預測安全的換車道。此外，現有的事故資料集通常僅基於影像，且缺乏全面的感測資料。在這項工作中，我們專注於使用 CRASH 資料集（我們自己收集的專門針對風險換車道資料集）來預測風險換車道，以及安全換車道（使用 HighD 資料集）。然後，我們利用 KG 和貝氏推理來使用語言背景資訊預測這些動作，增強模型的可解釋性和透明度。該模型在風險換車道的預測時間延長至四秒時，達到了 91.5% 的 f1 分數，在預測安全換車道時，在相同的預測時間內達到了 90.0% 的 f1 分數。我們透過將模型整合到 CARLA 模擬器中的車輛中，在涉及風險換車道的場景中驗證我們的模型。該模型設法預測突然的換車道，從而為自動駕駛車輛提供了更多時間來規劃和執行適當的安全反應。最後，為了增強我們模型的可解釋性，我們利用 RAG 為給定的預測提供清晰且自然的語言解釋。
+
+##### **Each Graph is a New Language: Graph Learning with LLMs**
+2501.11478v2 by Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
 
-摘要：本文提出了一個創新的自然語言處理 (NLP) 框架，透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集，提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa)，這個模型捕捉細緻的脈絡和位置關係，根據語意脈絡動態調整位置資訊的影響，以產生高品質的文字嵌入。在分類方面，利用基於注意力的前饋神經網路 (ABFNN)，有效地關注最相關的特徵，以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類，此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具，在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性，以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數，取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性，也突顯了它優於現有方法的優越性，使其成為自動化診斷系統中極具前景的工具。
+Recent efforts leverage Large Language Models (LLMs) for modeling
+text-attributed graph structures in node classification tasks. These approaches
+describe graph structures for LLMs to understand or aggregate LLM-generated
+textual attribute embeddings through graph structure. However, these approaches
+face two main limitations in modeling graph structures with LLMs. (i) Graph
+descriptions become verbose in describing high-order graph structure. (ii)
+Textual attributes alone do not contain adequate graph structure information.
+It is challenging to model graph structure concisely and adequately with LLMs.
+LLMs lack built-in mechanisms to model graph structures directly. They also
+struggle with complex long-range dependencies between high-order nodes and
+target nodes.
+  Inspired by the observation that LLMs pre-trained on one language can achieve
+exceptional performance on another with minimal additional training, we propose
+\textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge
+\textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs
+to transfer their powerful language understanding capabilities to
+graph-structured data. GDL4LLM translates graphs into a graph language corpus
+instead of graph descriptions and pre-trains LLMs on this corpus to adequately
+understand graph structures. During fine-tuning, this corpus describes the
+structural information of target nodes concisely with only a few tokens. By
+treating graphs as a new language, GDL4LLM enables LLMs to model graph
+structures adequately and concisely for node classification tasks. Extensive
+experiments on three real-world datasets demonstrate that GDL4LLM outperforms
+description-based and textual attribute embeddings-based baselines by
+efficiently modeling different orders of graph structure with LLMs.
 
-##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**
-2502.07752v1 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
+摘要：<paragraph>最近的研究利用大型语言模型 (LLM) 对节点分类任务中的文本属性图结构进行建模。这些方法描述图结构，以便 LLM 理解或通过图结构聚合 LLM 生成的文本属性嵌入。然而，这些方法在使用 LLM 对图结构进行建模时面临两个主要限制。(i) 图描述在描述高阶图结构时变得冗长。(ii) 仅文本属性不包含足够的图结构信息。使用 LLM 对图结构进行简洁且充分的建模具有挑战性。LLM 缺乏直接对图结构进行建模的内置机制。它们还难以处理高阶节点和目标节点之间复杂的远程依赖关系。
+受 LLM 在一种语言上进行预训练后，只需进行最少的额外训练即可在另一种语言上实现卓越性能的观察结果的启发，我们提出了**G**raph-**D**efined **L**anguage for **L**arge **L**anguage **M**odel (GDL4LLM)。此新框架使 LLM 能够将其强大的语言理解能力转移到结构化数据图。GDL4LLM 将图翻译成图语言语料库，而不是图描述，并在该语料库上对 LLM 进行预训练，以充分理解图结构。在微调期间，此语料库仅使用几个标记简洁地描述目标节点的结构信息。通过将图视为一种新语言，GDL4LLM 使 LLM 能够充分且简洁地对图结构进行建模，以用于节点分类任务。在三个真实世界数据集上进行的广泛实验表明，GDL4LLM 通过使用 LLM 有效地对不同阶的图结构进行建模，优于基于描述和基于文本属性嵌入的基线。</paragraph>
 
-Designing efficient optimizers for large language models (LLMs) with
-low-memory requirements and fast convergence is an important and challenging
-problem. This paper makes a step towards the systematic design of such
-optimizers through the lens of structured Fisher information matrix (FIM)
-approximation. We show that many state-of-the-art efficient optimizers can be
-viewed as solutions to FIM approximation (under the Frobenius norm) with
-specific structural assumptions. Building on these insights, we propose two
-design recommendations of practical efficient optimizers for LLMs, involving
-the careful selection of structural assumptions to balance generality and
-efficiency, and enhancing memory efficiency of optimizers with general
-structures through a novel low-rank extension framework. We demonstrate how to
-use each design approach by deriving new memory-efficient optimizers: Row and
-Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation
-(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the
-effectiveness, showing faster and better convergence than existing
-memory-efficient baselines and Adam with little memory overhead. Notably, Alice
-achieves better than 2x faster convergence over Adam, while RACS delivers
-strong performance on the 1B model with SGD-like memory.
+##### **Few-shot Policy (de)composition in Conversational Question Answering**
+2501.11335v1 by Kyle Erwin, Guy Axelrod, Maria Chang, Achille Fokoue, Maxwell Crouse, Soham Dan, Tian Gao, Rosario Uceda-Sosa, Ndivhuwo Makondo, Naweed Khan, Alexander Gray
 
-摘要：設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的角度，朝向此類最佳化器的系統化設計邁進一步。我們展示了許多最先進的高效最佳化器可以被視為 FIM 近似（在 Frobenius 範數下）的解，並具有特定的結構假設。基於這些見解，我們提出了 LLM 的兩個實用高效最佳化器設計建議，包括仔細選擇結構假設以平衡通用性和效率，並透過新穎的低秩延伸架構來增強具有通用結構的最佳化器的記憶體效率。我們展示了如何透過推導新的記憶體高效最佳化器來使用每種設計方法：列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練（高達 1B 參數）上的實驗驗證了其有效性，顯示比現有的記憶體高效基線和 Adam 更快且更好的收斂，且記憶體開銷很小。值得注意的是，Alice 比 Adam 快 2 倍以上，而 RACS 則在 1B 模型上提供類似 SGD 記憶體的強勁效能。
+The task of policy compliance detection (PCD) is to determine if a scenario
+is in compliance with respect to a set of written policies. In a conversational
+setting, the results of PCD can indicate if clarifying questions must be asked
+to determine compliance status. Existing approaches usually claim to have
+reasoning capabilities that are latent or require a large amount of annotated
+data. In this work, we propose logical decomposition for policy compliance
+(LDPC): a neuro-symbolic framework to detect policy compliance using large
+language models (LLMs) in a few-shot setting. By selecting only a few exemplars
+alongside recently developed prompting techniques, we demonstrate that our
+approach soundly reasons about policy compliance conversations by extracting
+sub-questions to be answered, assigning truth values from contextual
+information, and explicitly producing a set of logic statements from the given
+policies. The formulation of explicit logic graphs can in turn help answer
+PCDrelated questions with increased transparency and explainability. We apply
+this approach to the popular PCD and conversational machine reading benchmark,
+ShARC, and show competitive performance with no task-specific finetuning. We
+also leverage the inherently interpretable architecture of LDPC to understand
+where errors occur, revealing ambiguities in the ShARC dataset and highlighting
+the challenges involved with reasoning for conversational question answering.
 
-##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**
-2502.07516v1 by Raman Dutt
+摘要：策略合規偵測 (PCD) 的任務是確定場景是否符合一組書面策略。在對話設定中，PCD 的結果可以指出是否必須提出澄清問題以確定合規狀態。現有的方法通常聲稱具有潛在的推理能力，或需要大量的註釋資料。在這項工作中，我們提出策略合規的邏輯分解 (LDPC)：一種使用大型語言模型 (LLM) 在少次嘗試中偵測策略合規的神經符號框架。透過僅選擇少數範例以及最近開發的提示技術，我們證明我們的做法透過提取要回答的子問題、從脈絡資訊指派真值，以及從給定的策略明確產生一組邏輯陳述，對策略合規對話進行合理的推理。明確邏輯圖表的制定反過來可以幫助回答 PCD 相關問題，並提高透明度和可解釋性。我們將此方法應用於熱門的 PCD 和對話式機器閱讀基準 ShARC，並在沒有特定任務微調的情況下展現出競爭力。我們也利用 LDPC 固有的可解釋架構來了解錯誤發生在哪裡，揭露 ShARC 資料集中的歧義，並強調對話式問題解答推理的挑戰。
 
-Generative models, particularly text-to-image (T2I) diffusion models, play a
-crucial role in medical image analysis. However, these models are prone to
-training data memorization, posing significant risks to patient privacy.
-Synthetic chest X-ray generation is one of the most common applications in
-medical image analysis with the MIMIC-CXR dataset serving as the primary data
-repository for this task. This study adopts a data-driven approach and presents
-the first systematic attempt to identify prompts and text tokens in MIMIC-CXR
-that contribute the most to training data memorization. Our analysis reveals an
-unexpected finding: prompts containing traces of de-identification procedures
-are among the most memorized, with de-identification markers contributing the
-most. Furthermore, we also find existing inference-time memorization mitigation
-strategies are ineffective and fail to sufficiently reduce the model's reliance
-on memorized text tokens highlighting a broader issue in T2I synthesis with
-MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy
-and improve the reliability of generative models in medical imaging. Finally,
-our results provide a foundation for future work on developing and benchmarking
-memorization mitigation techniques for synthetic chest X-ray generation using
-the MIMIC-CXR dataset.
+##### **Reasoning Language Models: A Blueprint**
+2501.11223v3 by Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler
 
-摘要：生成模型，尤其是文字轉圖像 (T2I) 擴散模型，在醫學影像分析中扮演著至關重要的角色。然而，這些模型容易訓練資料記憶，對病患隱私造成重大風險。合成胸部 X 光線生成是醫學影像分析中最常見的應用之一，其中 MIMIC-CXR 資料集作為此任務的主要資料儲存庫。本研究採用資料驅動的方法，並提出首次系統性嘗試，以識別 MIMIC-CXR 中最有助於訓練資料記憶的提示和文字代碼。我們的分析揭露了一個意外的發現：包含去識別程序痕跡的提示是最常被記憶的，其中去識別標記的貢獻最大。此外，我們也發現現有的推論時間記憶減緩策略無效，且無法充分降低模型對記憶文字代碼的依賴性，突顯了使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。針對此問題，我們提出可行的策略，以增強隱私並改善生成模型在醫學影像中的可靠性。最後，我們的結果為未來使用 MIMIC-CXR 資料集開發和評量合成胸部 X 光線生成的記憶減緩技術奠定了基礎。
+Reasoning language models (RLMs), also known as Large Reasoning Models
+(LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have
+redefined AI's problem-solving capabilities by extending LLMs with advanced
+reasoning mechanisms. Yet, their high costs, proprietary nature, and complex
+architectures - uniquely combining Reinforcement Learning (RL), search
+heuristics, and LLMs - present accessibility and scalability challenges. To
+address these, we propose a comprehensive blueprint that organizes RLM
+components into a modular framework, based on a survey and analysis of all RLM
+works. This blueprint incorporates diverse reasoning structures (chains, trees,
+graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search,
+Beam Search), RL concepts (policy, value models and others), supervision
+schemes (Outcome-Based and Process-Based Supervision), and other related
+concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
+tools). We also provide detailed mathematical formulations and algorithmic
+specifications to simplify RLM implementation. By showing how schemes like
+LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases,
+we demonstrate the blueprint's versatility and unifying potential. To
+illustrate its utility, we introduce x1, a modular implementation for rapid RLM
+prototyping and experimentation. Using x1 and a literature review, we provide
+key insights, such as multi-phase training for policy and value models, and the
+importance of familiar training distributions. Finally, we discuss scalable RLM
+cloud deployments and we outline how RLMs can integrate with a broader LLM
+ecosystem. Our work demystifies RLM construction, democratizes advanced
+reasoning capabilities, and fosters innovation, aiming to mitigate the gap
+between "rich AI" and "poor AI" by lowering barriers to RLM design and
+experimentation.
 
-##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**
-2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo
+摘要：推理語言模型 (RLM)，又稱為大型推理模型 (LRM)，例如 OpenAI 的 o1 和 o3、DeepSeek-V3 以及阿里巴巴的 QwQ，透過擴充 LLM 的先進推理機制，重新定義了 AI 的問題解決能力。然而，它們的高成本、專有性質和複雜架構（獨特地結合了強化學習 (RL)、搜尋啟發法和 LLM）提出了可及性和可擴充性的挑戰。為了解決這些問題，我們提出了一個全面的藍圖，將 RLM 組件組織成一個模組化架構，這是基於對所有 RLM 作品的調查和分析。此藍圖包含多樣化的推理結構（鏈、樹、圖和巢狀形式）、推理策略（例如蒙地卡羅樹搜尋、波束搜尋）、RL 概念（策略、價值模型等）、監督方案（基於結果和基於流程的監督）和其他相關概念（例如測試時間運算、檢索增強生成、代理工具）。我們還提供了詳細的數學公式和演算法規範，以簡化 RLM 的實作。透過展示 LLaMA-Berry、QwQ、Journey Learning 和 Graph of Thoughts 等方案如何作為特殊情況，我們展示了藍圖的多功能性和統一潛力。為了說明其效用，我們介紹了 x1，這是一個模組化實作，用於快速 RLM 原型製作和實驗。使用 x1 和文獻回顧，我們提供了關鍵見解，例如策略和價值模型的多階段訓練，以及熟悉訓練分佈的重要性。最後，我們討論了可擴充的 RLM 雲端部署，並概述了 RLM 如何與更廣泛的 LLM 生態系統整合。我們的研究揭開了 RLM 建構的神秘面紗，使先進的推理能力民主化，並促進創新，旨在透過降低 RLM 設計和實驗的障礙，來縮小「富裕 AI」和「貧窮 AI」之間的差距。
 
-Chronic kidney disease (CKD) is a major global health issue, affecting over
-10% of the population and causing significant mortality. While kidney biopsy
-remains the gold standard for CKD diagnosis and treatment, the lack of
-comprehensive benchmarks for kidney pathology segmentation hinders progress in
-the field. To address this, we organized the Kidney Pathology Image
-Segmentation (KPIs) Challenge, introducing a dataset that incorporates
-preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+
-Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes
-two tasks, patch-level segmentation and whole slide image segmentation and
-detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score.
-By encouraging innovative segmentation methods that adapt to diverse CKD models
-and tissue conditions, the KPIs Challenge aims to advance kidney pathology
-analysis, establish new benchmarks, and enable precise, large-scale
-quantification for disease research and diagnosis.
+##### **IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**
+2501.11067v1 by Elad Levi, Ilan Kadar
 
-摘要：慢性腎臟病 (CKD) 是全球主要的健康問題，影響超過
-10% 的人口，並造成顯著的死亡率。雖然腎臟活檢
-仍然是 CKD 診斷和治療的黃金標準，但缺乏
-腎臟病理學分割的全面基準阻礙了該領域的進展。
-為了解決這個問題，我們組織了腎臟病理影像
-分割 (KPIs) 挑戰，引入了包含超過 10,000 個註解的
-CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
-週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括
-兩個任務，修補層級分割和全幻燈片影像分割和
-偵測，使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。
-通過鼓勵創新的分割方法來適應不同的 CKD 模型
-和組織條件，KPIs 挑戰旨在推進腎臟病理
-分析，建立新的基準，並實現精確、大規模的
-疾病研究和診斷量化。
+Large Language Models (LLMs) are transforming artificial intelligence,
+evolving into task-oriented systems capable of autonomous planning and
+execution. One of the primary applications of LLMs is conversational AI
+systems, which must navigate multi-turn dialogues, integrate domain-specific
+APIs, and adhere to strict policy constraints. However, evaluating these agents
+remains a significant challenge, as traditional methods fail to capture the
+complexity and variability of real-world interactions. We introduce
+IntellAgent, a scalable, open-source multi-agent framework designed to evaluate
+conversational AI systems comprehensively. IntellAgent automates the creation
+of diverse, synthetic benchmarks by combining policy-driven graph modeling,
+realistic event generation, and interactive user-agent simulations. This
+innovative approach provides fine-grained diagnostics, addressing the
+limitations of static and manually curated benchmarks with coarse-grained
+metrics. IntellAgent represents a paradigm shift in evaluating conversational
+AI. By simulating realistic, multi-policy scenarios across varying levels of
+complexity, IntellAgent captures the nuanced interplay of agent capabilities
+and policy constraints. Unlike traditional methods, it employs a graph-based
+policy model to represent relationships, likelihoods, and complexities of
+policy interactions, enabling highly detailed diagnostics. IntellAgent also
+identifies critical performance gaps, offering actionable insights for targeted
+optimization. Its modular, open-source design supports seamless integration of
+new domains, policies, and APIs, fostering reproducibility and community
+collaboration. Our findings demonstrate that IntellAgent serves as an effective
+framework for advancing conversational AI by addressing challenges in bridging
+research and deployment. The framework is available at
+https://github.com/plurai-ai/intellagent
 
-##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
-2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
+摘要：大型語言模型 (LLM) 正在轉變人工智慧，演變成具備自主規劃和執行能力的任務導向系統。LLM 的主要應用之一是對話式 AI 系統，它必須應對多輪對話、整合特定領域的 API，並遵守嚴格的政策約束。然而，評估這些代理仍然是一項重大挑戰，因為傳統方法無法捕捉現實世界互動的複雜性和變異性。我們引入了 IntellAgent，一個可擴充、開放原始碼的多代理架構，旨在全面評估對話式 AI 系統。IntellAgent 自動化建立多樣化、合成的基準，方法是結合策略驅動的圖形建模、逼真的事件產生和互動使用者代理模擬。這種創新方法提供了細緻的診斷，解決了具有粗略指標的靜態和手動策劃基準的限制。IntellAgent 代表了評估對話式 AI 的典範轉移。通過模擬不同層級複雜性的逼真多策略場景，IntellAgent 捕捉到了代理功能和策略約束之間的細微交互。與傳統方法不同，它採用基於圖形的策略模型來表示策略交互的關係、可能性和複雜性，從而實現高度詳細的診斷。IntellAgent 還識別出關鍵效能差距，提供可行的見解，以進行目標最佳化。其模組化、開放原始碼的設計支援無縫整合新的領域、策略和 API，促進了可複製性和社群協作。我們的研究結果表明，IntellAgent 可作為一個有效的框架，透過解決研究和部署之間的挑戰來推進對話式 AI。這個框架可在 https://github.com/plurai-ai/intellagent 取得
 
-Early prediction of pediatric cardiac arrest (CA) is critical for timely
-intervention in high-risk intensive care settings. We introduce PedCA-FT, a
-novel transformer-based framework that fuses tabular view of EHR with the
-derived textual view of EHR to fully unleash the interactions of
-high-dimensional risk factors and their dynamics. By employing dedicated
-transformer modules for each modality view, PedCA-FT captures complex temporal
-and contextual patterns to produce robust CA risk estimates. Evaluated on a
-curated pediatric cohort from the CHOA-CICU database, our approach outperforms
-ten other artificial intelligence models across five key performance metrics
-and identifies clinically meaningful risk factors. These findings underscore
-the potential of multimodal fusion techniques to enhance early CA detection and
-improve patient care.
 
-摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
+### LLM
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Theoretical Benefit and Limitation of Diffusion Language Model**|Guhao Feng et.al.|[2502.09622v1](http://arxiv.org/abs/2502.09622v1)|null|
+|**2025-02-13**|**MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**|Dongzhi Jiang et.al.|[2502.09621v1](http://arxiv.org/abs/2502.09621v1)|null|
+|**2025-02-13**|**Exploring the Potential of Encoder-free Architectures in 3D LMMs**|Yiwen Tang et.al.|[2502.09620v1](http://arxiv.org/abs/2502.09620v1)|null|
+|**2025-02-13**|**DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**|Xueyi Liu et.al.|[2502.09614v1](http://arxiv.org/abs/2502.09614v1)|null|
+|**2025-02-13**|**Score-of-Mixture Training: Training One-Step Generative Models Made Simple**|Tejas Jayashankar et.al.|[2502.09609v1](http://arxiv.org/abs/2502.09609v1)|null|
+|**2025-02-13**|**Human-LLM Coevolution: Evidence from Academic Writing**|Mingmeng Geng et.al.|[2502.09606v1](http://arxiv.org/abs/2502.09606v1)|null|
+|**2025-02-13**|**SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**|Yung-Sung Chuang et.al.|[2502.09604v1](http://arxiv.org/abs/2502.09604v1)|null|
+|**2025-02-13**|**CoT-Valve: Length-Compressible Chain-of-Thought Tuning**|Xinyin Ma et.al.|[2502.09601v1](http://arxiv.org/abs/2502.09601v1)|null|
+|**2025-02-13**|**Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**|Siyan Zhao et.al.|[2502.09597v1](http://arxiv.org/abs/2502.09597v1)|null|
+|**2025-02-13**|**KIMAs: A Configurable Knowledge Integrated Multi-Agent System**|Zitao Li et.al.|[2502.09596v1](http://arxiv.org/abs/2502.09596v1)|null|
+|**2025-02-13**|**Logical forms complement probability in understanding language model (and human) performance**|Yixuan Wang et.al.|[2502.09589v1](http://arxiv.org/abs/2502.09589v1)|null|
+|**2025-02-13**|**Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**|Mark Beliaev et.al.|[2502.09573v1](http://arxiv.org/abs/2502.09573v1)|null|
+|**2025-02-13**|**MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**|Vlad Andrei Negru et.al.|[2502.09567v1](http://arxiv.org/abs/2502.09567v1)|null|
+|**2025-02-13**|**Zero-shot generation of synthetic neurosurgical data with large language models**|Austin A. Barr et.al.|[2502.09566v1](http://arxiv.org/abs/2502.09566v1)|null|
+|**2025-02-13**|**MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**|Quintina Campbell et.al.|[2502.09565v1](http://arxiv.org/abs/2502.09565v1)|null|
+|**2025-02-13**|**EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**|Rui Yang et.al.|[2502.09560v1](http://arxiv.org/abs/2502.09560v1)|null|
+|**2025-02-13**|**Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**|Shreyan Biswas et.al.|[2502.09532v1](http://arxiv.org/abs/2502.09532v1)|null|
+|**2025-02-13**|**Diffusion Models for Molecules: A Survey of Methods and Tasks**|Liang Wang et.al.|[2502.09511v1](http://arxiv.org/abs/2502.09511v1)|null|
+|**2025-02-13**|**AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**|Caleb Cranney et.al.|[2502.09503v1](http://arxiv.org/abs/2502.09503v1)|null|
+|**2025-02-13**|**Improve LLM-based Automatic Essay Scoring with Linguistic Features**|Zhaoyi Joey Hou et.al.|[2502.09497v1](http://arxiv.org/abs/2502.09497v1)|null|
+|**2025-02-13**|**Cracking the Code: Enhancing Development finance understanding with artificial intelligence**|Pierre Beaucoral et.al.|[2502.09495v1](http://arxiv.org/abs/2502.09495v1)|null|
+|**2025-02-13**|**Objective quantification of mood states using large language models**|Jakub Onysk et.al.|[2502.09487v1](http://arxiv.org/abs/2502.09487v1)|null|
+|**2025-02-13**|**The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**|Akash Ghosh et.al.|[2502.09457v1](http://arxiv.org/abs/2502.09457v1)|null|
+|**2025-02-13**|**Pixel-Level Reasoning Segmentation via Multi-turn Conversations**|Dexian Cai et.al.|[2502.09447v1](http://arxiv.org/abs/2502.09447v1)|null|
+|**2025-02-13**|**Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**|Navdeep Kumar et.al.|[2502.09432v1](http://arxiv.org/abs/2502.09432v1)|null|
+|**2025-02-13**|**Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**|Ziyi Chen et.al.|[2502.09423v1](http://arxiv.org/abs/2502.09423v1)|null|
+|**2025-02-13**|**On multi-token prediction for efficient LLM inference**|Somesh Mehra et.al.|[2502.09419v1](http://arxiv.org/abs/2502.09419v1)|null|
+|**2025-02-13**|**SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**|Daniel Fleischer et.al.|[2502.09390v1](http://arxiv.org/abs/2502.09390v1)|null|
+|**2025-02-13**|**Truth Knows No Language: Evaluating Truthfulness Beyond English**|Blanca Calvo Figueras et.al.|[2502.09387v1](http://arxiv.org/abs/2502.09387v1)|null|
+|**2025-02-13**|**A Deep Inverse-Mapping Model for a Flapping Robotic Wing**|Hadar Sharvit et.al.|[2502.09378v1](http://arxiv.org/abs/2502.09378v1)|null|
+|**2025-02-13**|**Language Agents as Digital Representatives in Collective Decision-Making**|Daniel Jarrett et.al.|[2502.09369v1](http://arxiv.org/abs/2502.09369v1)|null|
+|**2025-02-13**|**Neural Spatiotemporal Point Processes: Trends and Challenges**|Sumantrak Mukherjee et.al.|[2502.09341v1](http://arxiv.org/abs/2502.09341v1)|null|
+|**2025-02-13**|**Graph Diffusion Network for Drug-Gene Prediction**|Jiayang Wu et.al.|[2502.09335v1](http://arxiv.org/abs/2502.09335v1)|null|
+|**2025-02-13**|**Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**|Itai Mondshine et.al.|[2502.09331v1](http://arxiv.org/abs/2502.09331v1)|null|
+|**2025-02-13**|**A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**|Kentaro Imajo et.al.|[2502.09316v1](http://arxiv.org/abs/2502.09316v1)|null|
+|**2025-02-13**|**When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**|Samuel Joseph Amouyal et.al.|[2502.09307v1](http://arxiv.org/abs/2502.09307v1)|null|
+|**2025-02-13**|**Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**|Bernd Dudzik et.al.|[2502.09294v1](http://arxiv.org/abs/2502.09294v1)|null|
+|**2025-02-13**|**SparQLe: Speech Queries to Text Translation Through LLMs**|Amirbek Djanibekov et.al.|[2502.09284v1](http://arxiv.org/abs/2502.09284v1)|null|
+|**2025-02-13**|**LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**|Wenlun Zhang et.al.|[2502.09271v1](http://arxiv.org/abs/2502.09271v1)|null|
+|**2025-02-13**|**AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**|Hezhe Qiao et.al.|[2502.09254v1](http://arxiv.org/abs/2502.09254v1)|null|
+|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
+|**2025-02-13**|**You Do Not Fully Utilize Transformer's Representation Capacity**|Gleb Gerasimov et.al.|[2502.09245v1](http://arxiv.org/abs/2502.09245v1)|null|
+|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
+|**2025-02-13**|**Reliable Conversational Agents under ASP Control that Understand Natural Language**|Yankai Zeng et.al.|[2502.09237v1](http://arxiv.org/abs/2502.09237v1)|null|
+|**2025-02-13**|**Commonsense Reasoning-Aided Autonomous Vehicle Systems**|Keegan Kimbrell et.al.|[2502.09233v1](http://arxiv.org/abs/2502.09233v1)|null|
+|**2025-02-13**|**Logical foundations of Smart Contracts**|Kalonji Kalala et.al.|[2502.09232v1](http://arxiv.org/abs/2502.09232v1)|null|
+|**2025-02-13**|**Relating Answer Set Programming and Many-sorted Logics for Formal Verification**|Zachary Hansen et.al.|[2502.09230v1](http://arxiv.org/abs/2502.09230v1)|null|
+|**2025-02-13**|**Computational methods for Dynamic Answer Set Programming**|Susana Hahn et.al.|[2502.09228v1](http://arxiv.org/abs/2502.09228v1)|null|
+|**2025-02-13**|**Generating Causally Compliant Counterfactual Explanations using ASP**|Sopam Dasgupta et.al.|[2502.09226v1](http://arxiv.org/abs/2502.09226v1)|null|
+|**2025-02-13**|**Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**|Đorđe Marković et.al.|[2502.09224v1](http://arxiv.org/abs/2502.09224v1)|null|
+|**2025-02-13**|**ASP-driven User-interaction with Clinguin**|Alexander Beiser et.al.|[2502.09222v1](http://arxiv.org/abs/2502.09222v1)|null|
+|**2025-02-13**|**Pearce's Characterisation in an Epistemic Domain**|Ezgi Iraz Su et.al.|[2502.09221v1](http://arxiv.org/abs/2502.09221v1)|null|
+|**2025-02-13**|**Graphical Conditions for the Existence, Unicity and Number of Regular Models**|Van-Giang Trinh et.al.|[2502.09220v1](http://arxiv.org/abs/2502.09220v1)|null|
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-13**|**Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**|Galileo Sartor et.al.|[2502.09216v1](http://arxiv.org/abs/2502.09216v1)|null|
+|**2025-02-13**|**Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**|Sean Glaze et.al.|[2502.09215v1](http://arxiv.org/abs/2502.09215v1)|null|
+|**2025-02-13**|**Neuro-Symbolic Contrastive Learning for Cross-domain Inference**|Mingyue Liu et.al.|[2502.09213v1](http://arxiv.org/abs/2502.09213v1)|null|
+|**2025-02-13**|**LP-LM: No Hallucinations in Question Answering with Logic Programming**|Katherine Wu et.al.|[2502.09212v1](http://arxiv.org/abs/2502.09212v1)|null|
+|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
+|**2025-02-13**|**On LLM-generated Logic Programs and their Inference Execution Methods**|Paul Tarau et.al.|[2502.09209v1](http://arxiv.org/abs/2502.09209v1)|null|
+|**2025-02-13**|**Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**|Haya Majid Qureshi et.al.|[2502.09206v1](http://arxiv.org/abs/2502.09206v1)|null|
+|**2025-02-13**|**Counterfactual Explanations as Plans**|Vaishak Belle et.al.|[2502.09205v1](http://arxiv.org/abs/2502.09205v1)|null|
+|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
+|**2025-02-13**|**Thinking beyond the anthropomorphic paradigm benefits LLM research**|Lujain Ibrahim et.al.|[2502.09192v1](http://arxiv.org/abs/2502.09192v1)|null|
+|**2025-02-13**|**Matina: A Large-Scale 73B Token Persian Text Corpus**|Sara Bourbour Hosseinbeigi et.al.|[2502.09188v1](http://arxiv.org/abs/2502.09188v1)|null|
+|**2025-02-13**|**RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**|Changzhi Zhou et.al.|[2502.09183v1](http://arxiv.org/abs/2502.09183v1)|null|
+|**2025-02-13**|**FLAME: Flexible LLM-Assisted Moderation Engine**|Ivan Bakulin et.al.|[2502.09175v1](http://arxiv.org/abs/2502.09175v1)|null|
+|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-13**|**Musical Heritage Historical Entity Linking**|Arianna Graciotti et.al.|[2502.09168v1](http://arxiv.org/abs/2502.09168v1)|null|
+|**2025-02-13**|**Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**|Chang Liu et.al.|[2502.09156v1](http://arxiv.org/abs/2502.09156v1)|null|
+|**2025-02-13**|**A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**|Nasser A Alsadhan et.al.|[2502.09128v1](http://arxiv.org/abs/2502.09128v1)|null|
+|**2025-02-13**|**Automatic Pruning via Structured Lasso with Class-wise Information**|Xiang Liu et.al.|[2502.09125v1](http://arxiv.org/abs/2502.09125v1)|null|
+|**2025-02-13**|**The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**|Ye-eun Cho et.al.|[2502.09120v1](http://arxiv.org/abs/2502.09120v1)|null|
+|**2025-02-13**|**One-shot Federated Learning Methods: A Practical Guide**|Xiang Liu et.al.|[2502.09104v1](http://arxiv.org/abs/2502.09104v1)|null|
+|**2025-02-13**|**Logical Reasoning in Large Language Models: A Survey**|Hanmeng Liu et.al.|[2502.09100v1](http://arxiv.org/abs/2502.09100v1)|null|
+|**2025-02-13**|**A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**|Tianyi Huang et.al.|[2502.09097v1](http://arxiv.org/abs/2502.09097v1)|null|
+|**2025-02-13**|**A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**|Jia Gao et.al.|[2502.09086v1](http://arxiv.org/abs/2502.09086v1)|null|
+|**2025-02-13**|**Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**|Greta Warren et.al.|[2502.09083v1](http://arxiv.org/abs/2502.09083v1)|null|
+|**2025-02-13**|**CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**|Xintao Wang et.al.|[2502.09082v1](http://arxiv.org/abs/2502.09082v1)|null|
+|**2025-02-13**|**Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**|Xuzhao Geng et.al.|[2502.09073v1](http://arxiv.org/abs/2502.09073v1)|null|
+|**2025-02-13**|**An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**|Kunat Pipatanakul et.al.|[2502.09056v1](http://arxiv.org/abs/2502.09056v1)|null|
+|**2025-02-13**|**Cost-Saving LLM Cascades with Early Abstention**|Michael J. Zellinger et.al.|[2502.09054v1](http://arxiv.org/abs/2502.09054v1)|null|
+|**2025-02-13**|**Game Theory Meets Large Language Models: A Systematic Survey**|Haoran Sun et.al.|[2502.09053v1](http://arxiv.org/abs/2502.09053v1)|null|
+|**2025-02-13**|**AIDE: Agentically Improve Visual Language Model with Domain Experts**|Ming-Chang Chiu et.al.|[2502.09051v1](http://arxiv.org/abs/2502.09051v1)|null|
+|**2025-02-13**|**Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**|Chae-Hyun Kim et.al.|[2502.09050v1](http://arxiv.org/abs/2502.09050v1)|null|
+|**2025-02-13**|**Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**|Jin-Duk Park et.al.|[2502.09046v1](http://arxiv.org/abs/2502.09046v1)|null|
+|**2025-02-13**|**Typhoon T1: An Open Thai Reasoning Model**|Pittawat Taveekitworachai et.al.|[2502.09042v1](http://arxiv.org/abs/2502.09042v1)|null|
+|**2025-02-13**|**Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**|Lin Zhang et.al.|[2502.09022v1](http://arxiv.org/abs/2502.09022v1)|null|
+|**2025-02-13**|**EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**|Xiao Wang et.al.|[2502.09020v1](http://arxiv.org/abs/2502.09020v1)|null|
+|**2025-02-13**|**Zero-shot Concept Bottleneck Models**|Shin'ya Yamaguchi et.al.|[2502.09018v1](http://arxiv.org/abs/2502.09018v1)|null|
+|**2025-02-13**|**Diversity Enhances an LLM's Performance in RAG and Long-context Task**|Zhchao Wang et.al.|[2502.09017v1](http://arxiv.org/abs/2502.09017v1)|null|
+|**2025-02-13**|**Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**|Jonathan Pofcher et.al.|[2502.09004v1](http://arxiv.org/abs/2502.09004v1)|null|
+|**2025-02-13**|**RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**|Quan Wei et.al.|[2502.09003v1](http://arxiv.org/abs/2502.09003v1)|null|
+|**2025-02-13**|**PixLift: Accelerating Web Browsing via AI Upscaling**|Yonas Atinafu et.al.|[2502.08995v1](http://arxiv.org/abs/2502.08995v1)|null|
+|**2025-02-13**|**RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**|Nazatul H. Sultan et.al.|[2502.08989v1](http://arxiv.org/abs/2502.08989v1)|null|
+|**2025-02-13**|**Neural Force Field: Learning Generalized Physical Representation from a Few Examples**|Shiqian Li et.al.|[2502.08987v1](http://arxiv.org/abs/2502.08987v1)|null|
+|**2025-02-13**|**Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**|Hyundong Cho et.al.|[2502.08972v1](http://arxiv.org/abs/2502.08972v1)|null|
+|**2025-02-13**|**RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**|Peter Yong Zhong et.al.|[2502.08966v1](http://arxiv.org/abs/2502.08966v1)|null|
+|**2025-02-13**|**Biologically Plausible Brain Graph Transformer**|Ciyuan Peng et.al.|[2502.08958v1](http://arxiv.org/abs/2502.08958v1)|null|
+|**2025-02-13**|**Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**|Leon Nissen et.al.|[2502.08954v1](http://arxiv.org/abs/2502.08954v1)|null|
 
-##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
-2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
+#### Abstracts
+##### **Theoretical Benefit and Limitation of Diffusion Language Model**
+2502.09622v1 by Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He
 
-Counterfactual explanations in medical imaging are critical for understanding
-the predictions made by deep learning models. We extend the Latent Shift
-counterfactual generation method from 2D applications to 3D computed tomography
-(CT) scans. We address the challenges associated with 3D data, such as limited
-training samples and high memory demands, by implementing a slice-based
-approach. This method leverages a 2D encoder trained on CT slices, which are
-subsequently combined to maintain 3D context. We demonstrate this technique on
-two models for clinical phenotype prediction and lung segmentation. Our
-approach is both memory-efficient and effective for generating interpretable
-counterfactuals in high-resolution 3D medical imaging.
+Diffusion language models have emerged as a promising approach for text
+generation. One would naturally expect this method to be an efficient
+replacement for autoregressive models since multiple tokens can be sampled in
+parallel during each diffusion step. However, its efficiency-accuracy trade-off
+is not yet well understood. In this paper, we present a rigorous theoretical
+analysis of a widely used type of diffusion language model, the Masked
+Diffusion Model (MDM), and find that its effectiveness heavily depends on the
+target evaluation metric. Under mild conditions, we prove that when using
+perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling
+steps regardless of sequence length, demonstrating that efficiency can be
+achieved without sacrificing performance. However, when using the sequence
+error rate--which is important for understanding the "correctness" of a
+sequence, such as a reasoning chain--we show that the required sampling steps
+must scale linearly with sequence length to obtain "correct" sequences, thereby
+eliminating MDM's efficiency advantage over autoregressive models. Our analysis
+establishes the first theoretical foundation for understanding the benefits and
+limitations of MDMs. All theoretical findings are supported by empirical
+studies.
 
-摘要：反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法，來解決與 3D 資料相關的挑戰，例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器，隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實，既節省記憶體又有效。
+摘要：擴散語言模型已成為文字生成的一種有前途的方法。由於在每個擴散步驟期間可以並行採樣多個符號，因此人們自然會期望這種方法成為自迴歸模型的有效替代方案。然而，它的效率準確性權衡尚未得到很好的理解。在本文中，我們對廣泛使用的擴散語言模型類型，即遮罩擴散模型 (MDM) 進行了嚴格的理論分析，並發現其有效性在很大程度上取決於目標評估指標。在溫和條件下，我們證明了當使用困惑度作為指標時，MDM 可以無論序列長度如何，在採樣步驟中實現近乎最佳的困惑度，這表明可以在不犧牲性能的情況下實現效率。然而，當使用序列錯誤率（對於理解序列的「正確性」很重要，例如推理鏈）時，我們表明所需的採樣步驟必須隨著序列長度線性縮放才能獲得「正確」的序列，從而消除了 MDM 相對於自迴歸模型的效率優勢。我們的分析為理解 MDM 的優點和局限性建立了第一個理論基礎。所有理論發現都得到了實證研究的支持。
 
-##### **Interactive Data Harmonization with LLM Agents**
-2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire
+##### **MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**
+2502.09621v1 by Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
 
-Data harmonization is an essential task that entails integrating datasets
-from diverse sources. Despite years of research in this area, it remains a
-time-consuming and challenging task due to schema mismatches, varying
-terminologies, and differences in data collection methodologies. This paper
-presents the case for agentic data harmonization as a means to both empower
-experts to harmonize their data and to streamline the process. We introduce
-Harmonia, a system that combines LLM-based reasoning, an interactive user
-interface, and a library of data harmonization primitives to automate the
-synthesis of data harmonization pipelines. We demonstrate Harmonia in a
-clinical data harmonization scenario, where it helps to interactively create
-reusable pipelines that map datasets to a standard format. Finally, we discuss
-challenges and open problems, and suggest research directions for advancing our
-vision.
+Answering questions with Chain-of-Thought (CoT) has significantly enhanced
+the reasoning capabilities of Large Language Models (LLMs), yet its impact on
+Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth
+investigation. In this paper, we introduce MME-CoT, a specialized benchmark
+evaluating the CoT reasoning performance of LMMs, spanning six domains: math,
+science, OCR, logic, space-time, and general scenes. As the first comprehensive
+study in this area, we propose a thorough evaluation suite incorporating three
+novel metrics that assess the reasoning quality, robustness, and efficiency at
+a fine-grained level. Leveraging curated high-quality data and a unique
+evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,
+uncovering several key insights: 1) Models with reflection mechanism
+demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and
+demonstrating the highest quality results; 2) CoT prompting often degrades LMM
+performance on perception-heavy tasks, suggesting a potentially harmful
+overthinking behavior; and 3) Although the CoT quality is high, LMMs with
+reflection exhibit significant inefficiency in both normal response and
+self-correction phases. We hope MME-CoT serves as a foundation for advancing
+multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
 
-摘要：資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷，但由於架構不匹配、術語不同，以及資料收集方法的差異，它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和，作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia，一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統，以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia，它有助於互動式建立可重複使用的管線，將資料集對應至標準格式。最後，我們討論挑戰和開放性問題，並建議研究方向以推進我們的願景。
+摘要：<paragraph>透過思維鏈（CoT）回答問題，大幅提升了大型語言模型（LLM）的推理能力，但其對大型多模態模型（LMM）的影響仍缺乏系統性的評估和深入探討。在本文中，我們引入了 MME-CoT，一個專門的基準測試，用於評估 LMM 的 CoT 推理效能，涵蓋六個領域：數學、科學、OCR、邏輯、時空和一般場景。作為該領域的第一個全面性研究，我們提出了一個全面的評估套件，包含三個創新的指標，用於評估推理品質、穩健性和效率，並達到細微的層級。透過利用策展的高品質資料和獨特的評估策略，我們對最先進的 LMM 進行深入分析，發現了幾個關鍵見解：1）具有反思機制的模型展現出優異的 CoT 品質，其中 Kimi k1.5 優於 GPT-4o，並展現出最高品質的結果；2）CoT 提示通常會降低 LMM 在感知密集任務上的效能，這表示潛在有害的過度思考行為；3）儘管 CoT 品質很高，但具有反思能力的 LMM 在一般回應和自我修正階段都展現出顯著的低效率。我們希望 MME-CoT 能作為促進 LMM 中多模態推理的基礎。專案頁面：https://mmecot.github.io/</paragraph>
 
-##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**
-2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani
+##### **Exploring the Potential of Encoder-free Architectures in 3D LMMs**
+2502.09620v1 by Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
 
-Machine learning (ML) is transforming healthcare by enabling predictive
-analytics, personalized treatments, and improved patient outcomes. However,
-traditional ML workflows require specialized skills, infrastructure, and
-resources, limiting accessibility for many healthcare professionals. This paper
-explores how Google Cloud's BigQuery ML simplifies the development and
-deployment of ML models using SQL, reducing technical barriers. Through a case
-study on diabetes prediction using the Diabetes Health Indicators Dataset, we
-evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep
-Neural Network (DNN). Our results demonstrate that the Boosted Tree model
-achieves the highest performance, making it highly effective for diabetes
-prediction. This study highlights BigQuery ML's role in democratizing machine
-learning by providing a scalable, efficient, and accessible solution for
-healthcare analytics.
+Encoder-free architectures have been preliminarily explored in the 2D visual
+domain, yet it remains an open question whether they can be effectively applied
+to 3D understanding scenarios. In this paper, we present the first
+comprehensive investigation into the potential of encoder-free architectures to
+overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
+These challenges include the failure to adapt to varying point cloud
+resolutions and the point features from the encoder not meeting the semantic
+needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
+remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
+We propose the LLM-embedded Semantic Encoding strategy in the pre-training
+stage, exploring the effects of various point cloud self-supervised losses. And
+we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
+introduce the Hierarchical Geometry Aggregation strategy in the instruction
+tuning stage. This incorporates inductive bias into the LLM early layers to
+focus on the local details of the point clouds. To the end, we present the
+first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
+state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
+classification, captioning, and VQA tasks, respectively. Our results
+demonstrate that the encoder-free architecture is highly promising for
+replacing encoder-based architectures in the field of 3D understanding. The
+code is released at https://github.com/Ivan-Tang-3D/ENEL
 
-摘要：機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果，正在轉型醫療保健。然而，傳統的 ML 工作流程需要專業技能、基礎設施和資源，限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署，降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究，我們評估了三個預測模型：邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明，提升樹模型達到了最高的效能，使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色，提供可擴充、有效率且可存取的醫療保健分析解決方案。
+摘要：<paragraph>編碼器免費架構已在 2D 視覺領域中初步探索，但它們是否能有效應用於 3D 理解場景仍是一個開放的問題。在本文中，我們提出了對編碼器免費架構潛力的首次全面調查，以克服基於編碼器的 3D 大型多模態模型 (LMM) 的挑戰。這些挑戰包括無法適應不同的點雲解析度，且來自編碼器的點特徵無法滿足大型語言模型 (LLM) 的語義需求。我們識別出 3D LMM 的關鍵方面，以移除編碼器並讓 LLM 承擔 3D 編碼器的角色：1) 我們在預訓練階段提出 LLM 嵌入式語義編碼策略，探索各種點雲自我監督損失的影響。我們提出混合語義損失來提取高階語義。2) 我們在指令調整階段引入分層幾何聚合策略。這將歸納偏差納入 LLM 早期層，以專注於點雲的局部細節。最後，我們提出第一個無編碼器 3D LMM，ENEL。我們的 7B 模型與當前最先進的模型 ShapeLLM-13B 相媲美，分別在分類、字幕和 VQA 任務中達到 55.0%、50.92% 和 42.7%。我們的結果表明，無編碼器架構極有望取代基於編碼器的架構在 3D 理解領域的應用。程式碼發布於 https://github.com/Ivan-Tang-3D/ENEL</paragraph>
 
-##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**
-2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen
+##### **DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**
+2502.09614v1 by Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
 
-Despite over a decade of legislative efforts to address modern slavery in the
-supply chains of large corporations, the effectiveness of government oversight
-remains hampered by the challenge of scrutinizing thousands of statements
-annually. While Large Language Models (LLMs) can be considered a well
-established solution for the automatic analysis and summarization of documents,
-recognizing concrete modern slavery countermeasures taken by companies and
-differentiating those from vague claims remains a challenging task. To help
-evaluate and fine-tune LLMs for the assessment of corporate statements, we
-introduce a dataset composed of 5,731 modern slavery statements taken from the
-Australian Modern Slavery Register and annotated at the sentence level. This
-paper details the construction steps for the dataset that include the careful
-design of annotation specifications, the selection and preprocessing of
-statements, and the creation of high-quality annotation subsets for effective
-model evaluations. To demonstrate our dataset's utility, we propose a machine
-learning methodology for the detection of sentences relevant to mandatory
-reporting requirements set by the Australian Modern Slavery Act. We then follow
-this methodology to benchmark modern language models under zero-shot and
-supervised learning settings.
+We address the challenge of developing a generalizable neural tracking
+controller for dexterous manipulation from human references. This controller
+aims to manage a dexterous robot hand to manipulate diverse objects for various
+purposes defined by kinematic human-object interactions. Developing such a
+controller is complicated by the intricate contact dynamics of dexterous
+manipulation and the need for adaptivity, generalizability, and robustness.
+Current reinforcement learning and trajectory optimization methods often fall
+short due to their dependence on task-specific rewards or precise system
+models. We introduce an approach that curates large-scale successful robot
+tracking demonstrations, comprising pairs of human references and robot
+actions, to train a neural controller. Utilizing a data flywheel, we
+iteratively enhance the controller's performance, as well as the number and
+quality of successful tracking demonstrations. We exploit available tracking
+demonstrations and carefully integrate reinforcement learning and imitation
+learning to boost the controller's performance in dynamic environments. At the
+same time, to obtain high-quality tracking demonstrations, we individually
+optimize per-trajectory tracking by leveraging the learned tracking controller
+in a homotopy optimization method. The homotopy optimization, mimicking
+chain-of-thought, aids in solving challenging trajectory tracking problems to
+increase demonstration diversity. We showcase our success by training a
+generalizable neural controller and evaluating it in both simulation and real
+world. Our method achieves over a 10% improvement in success rates compared to
+leading baselines. The project website with animated results is available at
+https://meowuu7.github.io/DexTrack/.
 
-摘要：儘管立法努力超過十年，旨在解決大型企業供應鏈中的現代奴隸制，但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型（LLM）可以被認為是文件自動分析和摘要的完善解決方案，但要辨識公司採取的具體現代奴隸制對策，並將其與含糊的聲明區分開來，仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明，我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集，這些聲明取自澳洲現代奴隸制註冊處，並在句子層級進行註解。本文詳細說明了資料集的建構步驟，其中包括註解規格的仔細設計、聲明的選擇和預處理，以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用，我們提出了一種機器學習方法，用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後，我們遵循這種方法，在零次學習和監督學習設定下對現代語言模型進行基準測試。
+摘要：<paragraph>我們解決了從人類參照中開發靈巧操作通用神經追蹤控制器的挑戰。此控制器旨在管理靈巧機器人手，以操作各種物體，以實現由運動學人機互動定義的各種目的。由於靈巧操作的複雜接觸動力學以及對適應性、通用性和魯棒性的需求，開發此類控制器很複雜。目前的強化學習和軌跡優化方法通常由於依賴於特定任務的獎勵或精確的系統模型而表現不佳。我們引入了一種方法，它策劃了大規模成功的機器人追蹤示範，包括人體參照和機器人動作對，以訓練神經控制器。利用數據飛輪，我們反覆增強控制器的性能，以及成功追蹤示範的數量和品質。我們利用可用的追蹤示範，並仔細整合強化學習和模仿學習，以提升控制器在動態環境中的性能。同時，為了獲得高品質的追蹤示範，我們透過在同倫優化方法中利用已學習的追蹤控制器，個別優化每個軌跡的追蹤。同倫優化模擬思考鏈，有助於解決具有挑戰性的軌跡追蹤問題，以增加示範的多樣性。我們展示了我們在訓練通用神經控制器並在模擬和真實世界中評估它的成功。與領先的基準相比，我們的模型在成功率方面提高了 10% 以上。包含動畫結果的專案網站可在 https://meowuu7.github.io/DexTrack/ 取得。</paragraph>
 
-##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**
-2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour
+##### **Score-of-Mixture Training: Training One-Step Generative Models Made Simple**
+2502.09609v1 by Tejas Jayashankar, J. Jon Ryu, Gregory Wornell
 
-The fourth Machine Learning for Health (ML4H) symposium was held in person on
-December 15th and 16th, 2024, in the traditional, ancestral, and unceded
-territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver,
-British Columbia, Canada. The symposium included research roundtable sessions
-to foster discussions between participants and senior researchers on timely and
-relevant topics for the ML4H community. The organization of the research
-roundtables at the conference involved 13 senior and 27 junior chairs across 13
-tables. Each roundtable session included an invited senior chair (with
-substantial experience in the field), junior chairs (responsible for
-facilitating the discussion), and attendees from diverse backgrounds with an
-interest in the session's topic.
+We propose Score-of-Mixture Training (SMT), a novel framework for training
+one-step generative models by minimizing a class of divergences called the
+$\alpha$-skew Jensen-Shannon divergence. At its core, SMT estimates the score
+of mixture distributions between real and fake samples across multiple noise
+levels. Similar to consistency models, our approach supports both training from
+scratch (SMT) and distillation using a pretrained diffusion model, which we
+call Score-of-Mixture Distillation (SMD). It is simple to implement, requires
+minimal hyperparameter tuning, and ensures stable training. Experiments on
+CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even
+outperform existing methods.
 
-摘要：第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議，以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席（在該領域擁有豐富的經驗）、初級主席（負責促進討論）以及對會議主題感興趣的來自不同背景的與會者。
+摘要：我們提出混合評分訓練 (SMT)，一種透過最小化稱為 $\alpha$-偏斜 Jensen-Shannon 距離的距離類別來訓練單步生成模型的新穎架構。在核心部分，SMT 估計真實和虛假樣本之間在多個雜訊層級的混合分配評分。與一致性模型類似，我們的做法支援從頭開始訓練 (SMT) 和使用預先訓練的擴散模型進行蒸餾，我們稱之為混合評分蒸餾 (SMD)。它易於實作，只需要最小的超參數調整，並確保穩定的訓練。在 CIFAR-10 和 ImageNet 64x64 上的實驗顯示，SMT/SMD 具有競爭力，甚至可以優於現有方法。
 
-##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**
-2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla
+##### **Human-LLM Coevolution: Evidence from Academic Writing**
+2502.09606v1 by Mingmeng Geng, Roberto Trotta
 
-Current Large Language Models (LLMs) benchmarks are often based on open-ended
-or close-ended QA evaluations, avoiding the requirement of human labor.
-Close-ended measurements evaluate the factuality of responses but lack
-expressiveness. Open-ended capture the model's capacity to produce discourse
-responses but are harder to assess for correctness. These two approaches are
-commonly used, either independently or together, though their relationship
-remains poorly understood. This work is focused on the healthcare domain, where
-both factuality and discourse matter greatly. It introduces a comprehensive,
-multi-axis suite for healthcare LLM evaluation, exploring correlations between
-open and close benchmarks and metrics. Findings include blind spots and
-overlaps in current methodologies. As an updated sanity check, we release a new
-medical benchmark--CareQA--, with both open and closed variants. Finally, we
-propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to
-mitigate the identified limitations.
+With a statistical analysis of arXiv paper abstracts, we report a marked drop
+in the frequency of several words previously identified as overused by ChatGPT,
+such as "delve", starting soon after they were pointed out in early 2024. The
+frequency of certain other words favored by ChatGPT, such as "significant", has
+instead kept increasing. These phenomena suggest that some authors of academic
+papers have adapted their use of large language models (LLMs), for example, by
+selecting outputs or applying modifications to the LLM-generated content. Such
+coevolution and cooperation of humans and LLMs thus introduce additional
+challenges to the detection of machine-generated text in real-world scenarios.
+Estimating the impact of LLMs on academic writing by examining word frequency
+remains feasible, and more attention should be paid to words that were already
+frequently employed, including those that have decreased in frequency.
 
-摘要：當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量，避免了人力需求。封閉式測量評估回應的事實性，但缺乏表達力。開放式測量捕捉模型產生論述回應的能力，但較難評估正確性。這兩種方法通常獨立或合併使用，儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域，在該領域中，事實性和論述都非常重要。它引入了一個全面的多軸套件，用於醫療保健 LLM 評量，探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查，我們發布了一個新的醫療基準--CareQA--，包含開放式和封閉式變體。最後，我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。
+摘要：透過對 arXiv 論文摘要進行統計分析，我們報告了幾個先前被認為 ChatGPT 過度使用的詞彙的頻率大幅下降，例如「深入探討」，從 2024 年初被指出後不久就開始下降。相反地，ChatGPT 偏好的某些其他詞彙，例如「顯著」，頻率持續增加。這些現象表明，一些學術論文作者已經調整了他們使用大型語言模型 (LLM) 的方式，例如，透過選擇輸出或對 LLM 生成的內容進行修改。因此，人類和 LLM 的這種共同演化和合作為在現實世界場景中偵測機器產生的文字帶來了額外的挑戰。透過檢視詞彙頻率來評估 LLM 對學術寫作的影響仍然可行，並且應該對已經頻繁使用的詞彙給予更多關注，包括那些頻率下降的詞彙。
 
-##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**
-2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra
+##### **SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**
+2502.09604v1 by Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
 
-Accurate classification and anatomical localization are essential for
-effective medical diagnostics and research, which may be efficiently performed
-using deep learning techniques. However, availability of limited labeled data
-poses a significant challenge. To address this, we adapted Prototypical
-Networks and the Propagation-Reconstruction Network (PRNet) for few-shot
-classification and localization, respectively, in Single Photon Emission
-Computed Tomography (SPECT) images. For the proof of concept we used a
-2D-sliced image cropped around heart. The Prototypical Network, with a
-pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver
-tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for
-2D imaging with an encoder-decoder architecture and skip connections, achieved
-a training loss of 1.395, accurately reconstructing patches and capturing
-spatial relationships. These results highlight the potential of Prototypical
-Networks for tissue classification with limited labeled data and PRNet for
-anatomical landmark localization, paving the way for improved performance in
-deep learning frameworks.
+We introduce SelfCite, a novel self-supervised approach that aligns LLMs to
+generate high-quality, fine-grained, sentence-level citations for the
+statements in their generated responses. Instead of only relying on costly and
+labor-intensive annotations, SelfCite leverages a reward signal provided by the
+LLM itself through context ablation: If a citation is necessary, removing the
+cited text from the context should prevent the same response; if sufficient,
+retaining the cited text alone should preserve the same response. This reward
+can guide the inference-time best-of-N sampling strategy to improve citation
+quality significantly, as well as be used in preference optimization to
+directly fine-tune the models for generating better citations. The
+effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3
+points on the LongBench-Cite benchmark across five long-form question answering
+tasks.
 
-摘要：精確的分類和解剖定位對於有效的醫療診斷和研究至關重要，而這可以使用深度學習技術有效執行。然而，標記資料有限的取得會造成重大的挑戰。為了解決這個問題，我們分別調整了原型網路和傳播重建網路 (PRNet)，用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念，我們使用圍繞心臟裁切的 2D 切片影像。原型網路，使用預先訓練的 ResNet-18 主幹，對心室、心肌和肝臟組織進行分類，訓練準確度為 96.67%，驗證準確度為 93.33%。PRNet，調整為使用編碼器解碼器架構和跳躍連接的 2D 影像，達到了 1.395 的訓練損失，精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力，以及 PRNet 在解剖標誌定位方面的潛力，為深度學習架構中效能的提升鋪平了道路。
+摘要：我們介紹 SelfCite，一種新穎的自監督方法，它將 LLM 對齊以針對其生成回應中的陳述生成高品質、細粒度、句子級別的引用。SelfCite 不僅依賴於昂貴且勞動密集的註解，還利用 LLM 本身通過上下文消融提供的獎勵信號：如果需要引用，從上下文中移除被引用的文字應當會阻止相同的回應；如果足夠，僅保留被引用的文字應當會保留相同的回應。此獎勵可以引導推理時間最佳 N 個取樣策略以顯著改善引文品質，並用於偏好最佳化以直接微調模型以生成更好的引文。SelfCite 的有效性通過在五個長篇問答任務中將 LongBench-Cite 基準上的引文 F1 提高多達 5.3 點來證明。
 
-##### **Illegal Waste Detection in Remote Sensing Images: A Case Study**
-2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori
+##### **CoT-Valve: Length-Compressible Chain-of-Thought Tuning**
+2502.09601v1 by Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
 
-Environmental crime currently represents the third largest criminal activity
-worldwide while threatening ecosystems as well as human health. Among the
-crimes related to this activity, improper waste management can nowadays be
-countered more easily thanks to the increasing availability and decreasing cost
-of Very-High-Resolution Remote Sensing images, which enable semi-automatic
-territory scanning in search of illegal landfills. This paper proposes a
-pipeline, developed in collaboration with professionals from a local
-environmental agency, for detecting candidate illegal dumping sites leveraging
-a classifier of Remote Sensing images. To identify the best configuration for
-such classifier, an extensive set of experiments was conducted and the impact
-of diverse image characteristics and training settings was thoroughly analyzed.
-The local environmental agency was then involved in an experimental exercise
-where outputs from the developed classifier were integrated in the experts'
-everyday work, resulting in time savings with respect to manual
-photo-interpretation. The classifier was eventually run with valuable results
-on a location outside of the training area, highlighting potential for
-cross-border applicability of the proposed pipeline.
+Chain-of-Thought significantly enhances a model's reasoning capability, but
+it also comes with a considerable increase in inference costs due to long
+chains. With the observation that the reasoning path can be easily compressed
+under easy tasks but struggle on hard tasks, we explore the feasibility of
+elastically controlling the length of reasoning paths with only one model,
+thereby reducing the inference overhead of reasoning models dynamically based
+on task difficulty. We introduce a new tuning and inference strategy named
+CoT-Valve, designed to allow models to generate reasoning chains of varying
+lengths. To achieve this, we propose to identify a direction in the parameter
+space that, when manipulated, can effectively control the length of generated
+CoT. Moreover, we show that this property is valuable for compressing the
+reasoning chain. We construct datasets with chains from long to short for the
+same questions and explore two enhanced strategies for CoT-Valve: (1) a precise
+length-compressible CoT tuning method, and (2) a progressive chain length
+compression approach. Our experiments show that CoT-Valve successfully enables
+controllability and compressibility of the chain and shows better performance
+than the prompt-based control. We applied this method to QwQ-32B-Preview,
+reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor
+performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with
+only one additional incorrect answer.
 
-摘要：環境犯罪目前是全球第三大犯罪活動，威脅生態系統和人類健康。在與此活動相關的犯罪中，不當廢物管理現在可以更容易地得到解決，這要歸功於超高解析度遙測影像越來越普及且成本下降，這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道，與當地環境機構的專業人士合作開發，用於檢測候選非法傾倒地點，利用遙測影像分類器。為了找出這種分類器的最佳配置，進行了一系列廣泛的實驗，並徹底分析了不同影像特徵和訓練設定的影響。然後，當地環境機構參與了一項實驗練習，其中將已開發分類器的輸出整合到專家的日常工作中，從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器，獲得了有價值的結果，突出了所提出管道的跨境適用性潛力。
+摘要：<paragraph>連續思考大幅提升了模型的推理能力，但由於鏈條過長，也大幅增加了推理成本。由於觀察到推理路徑在簡單的任務中可以輕易壓縮，但在困難的任務中卻很吃力，我們探索了僅使用一個模型彈性控制推理路徑長度的可行性，從而根據任務難度動態減少推理模型的推理開銷。我們引入了一種名為 CoT-Valve 的新調校和推理策略，旨在讓模型產生長度不一的推理鏈。為此，我們提議在參數空間中識別一個方向，在操作時可以有效控制生成的 CoT 的長度。此外，我們展示了此屬性對於壓縮推理鏈是有價值的。我們構造了從長到短的鏈條的資料集，用於相同的問題，並探索了 CoT-Valve 的兩種增強策略：(1) 精確的長度可壓縮 CoT 調校方法，以及 (2) 漸進式鏈長壓縮方法。我們的實驗表明，CoT-Valve 成功地實現了鏈條的可控性和可壓縮性，並顯示出比基於提示的控制更好的效能。我們將此方法應用於 QwQ-32B-Preview，將 GSM8K 上的推理鏈條從 741 個代幣減少到 225 個代幣，效能僅略微下降 (95.07% 至 94.92%)，而在 AIME 上從 6827 個代幣減少到 4629 個代幣，只多了一個錯誤答案。</paragraph>
 
-##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**
-2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li
+##### **Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**
+2502.09597v1 by Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
 
-Accurate and efficient electroencephalography (EEG) analysis is essential for
-detecting seizures and artifacts in long-term monitoring, with applications
-spanning hospital diagnostics to wearable health devices. Robust EEG analytics
-have the potential to greatly improve patient care. However, traditional deep
-learning models, especially Transformer-based architectures, are hindered by
-their quadratic time and memory complexity, making them less suitable for
-resource-constrained environments. To address these challenges, we present
-FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel
-self-supervised framework that establishes new efficiency benchmarks for EEG
-analysis through bidirectional state-space modeling. Unlike Transformer-based
-models, which incur quadratic time and memory complexity, FEMBA scales linearly
-with sequence length, enabling more scalable and efficient processing of
-extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and
-fine-tuned on three downstream tasks, FEMBA achieves competitive performance in
-comparison with transformer models, with significantly lower computational
-cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB
-and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates
-viability for resource-constrained devices. These results pave the way for
-scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as
-a promising candidate for wearable applications.
+Large Language Models (LLMs) are increasingly used as chatbots, yet their
+ability to personalize responses to user preferences remains limited. We
+introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize
+and adhere to user preferences in a long-context conversational setting.
+PrefEval comprises 3,000 manually curated user preference and query pairs
+spanning 20 topics. PrefEval contains user personalization or preference
+information in both explicit and implicit forms, and evaluates LLM performance
+using a generation and a classification task. With PrefEval, we evaluated the
+aforementioned preference following capabilities of 10 open-source and
+proprietary LLMs in multi-session conversations with varying context lengths up
+to 100k tokens. We benchmark with various prompting, iterative feedback, and
+retrieval-augmented generation methods. Our benchmarking effort reveals that
+state-of-the-art LLMs face significant challenges in proactively following
+users' preferences during conversations. In particular, in zero-shot settings,
+preference following accuracy falls below 10% at merely 10 turns (~3k tokens)
+across most evaluated models. Even with advanced prompting and retrieval
+methods, preference following still deteriorates in long-context conversations.
+Furthermore, we show that fine-tuning on PrefEval significantly improves
+performance. We believe PrefEval serves as a valuable resource for measuring,
+understanding, and enhancing LLMs' preference following abilities, paving the
+way for personalized conversational agents. Our code and dataset are available
+at https://prefeval.github.io/.
 
-摘要：準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要，其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而，傳統深度學習模型，特別是基於 Transformer 的架構，受到其二次時間和記憶體複雜度的阻礙，使其不太適合資源受限的環境。為了應對這些挑戰，我們提出 FEMBA (基礎 EEG Mamba + 雙向架構)，一種創新的自我監督架構，透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同，FEMBA 隨著序列長度線性縮放，支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調，與Transformer模型相比，在計算成本顯著降低的情況下，實現了具有競爭力的效能。具體來說，它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC，而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路，並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。
+摘要：大型語言模型（LLM）正日益被用作聊天機器人，但它們根據使用者偏好個人化回應的能力仍然有限。我們引入了 PrefEval，一個用於評估 LLM 在長時間對話環境中推論、記憶和遵守使用者偏好的能力的基準。PrefEval 包含 3,000 個手動策劃的使用者偏好和查詢對，涵蓋 20 個主題。PrefEval 包含以明確和隱含形式表達的使用者個人化或偏好資訊，並使用生成和分類任務評估 LLM 效能。透過 PrefEval，我們評估了 10 個開源和專有 LLM 在多重對話中上述的偏好追蹤能力，對話內容長度最高達 100k 個符號。我們使用各種提示、迭代回饋和檢索增強生成方法進行基準測試。我們的基準測試工作顯示，最先進的 LLM 在對話中主動追蹤使用者偏好時面臨重大挑戰。特別是在零次學習設定中，在多數評估模型中，在僅 10 個回合（約 3k 個符號）時，偏好追蹤準確度低於 10%。即使使用進階提示和檢索方法，在長時間對話中偏好追蹤仍然會惡化。此外，我們展示了在 PrefEval 上進行微調會大幅改善效能。我們相信 PrefEval 可作為衡量、理解和提升 LLM 偏好追蹤能力的寶貴資源，為個人化對話代理鋪路。我們的程式碼和資料集可在 https://prefeval.github.io/ 取得。
 
-##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**
-2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham
+##### **KIMAs: A Configurable Knowledge Integrated Multi-Agent System**
+2502.09596v1 by Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding
 
-The advent of foundation models (FMs) is transforming medical domain. In
-ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4
-million natural images and 1.6 million retinal images, has demonstrated high
-adaptability across clinical applications. Conversely, DINOv2, a
-general-purpose vision FM pre-trained on 142 million natural images, has shown
-promise in non-medical domains. However, its applicability to clinical tasks
-remains underexplored. To address this, we conducted head-to-head evaluations
-by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular
-disease detection and systemic disease prediction tasks, across eight
-standardized open-source ocular datasets, as well as the Moorfields AlzEye and
-the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting
-diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets,
-all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In
-glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940,
-P<0.001). Conversely, RETFound achieved superior performance over all DINOv2
-models in predicting heart failure, myocardial infarction, and ischaemic stroke
-(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even
-with 10% of the fine-tuning data. These findings showcase the distinct
-scenarios where general-purpose and domain-specific FMs excel, highlighting the
-importance of aligning FM selection with task-specific requirements to optimise
-clinical performance.
+Knowledge-intensive conversations supported by large language models (LLMs)
+have become one of the most popular and helpful applications that can assist
+people in different aspects. Many current knowledge-intensive applications are
+centered on retrieval-augmented generation (RAG) techniques. While many
+open-source RAG frameworks facilitate the development of RAG-based
+applications, they often fall short in handling practical scenarios complicated
+by heterogeneous data in topics and formats, conversational context management,
+and the requirement of low-latency response times. This technical report
+presents a configurable knowledge integrated multi-agent system, KIMAs, to
+address these challenges. KIMAs features a flexible and configurable system for
+integrating diverse knowledge sources with 1) context management and query
+rewrite mechanisms to improve retrieval accuracy and multi-turn conversational
+coherency, 2) efficient knowledge routing and retrieval, 3) simple but
+effective filter and reference generation mechanisms, and 4) optimized
+parallelizable multi-agent pipeline execution. Our work provides a scalable
+framework for advancing the deployment of LLMs in real-world settings. To show
+how KIMAs can help developers build knowledge-intensive applications with
+different scales and emphases, we demonstrate how we configure the system to
+three applications already running in practice with reliable performance.
 
-摘要：基礎模型 (FM) 的出現正在轉變醫療領域。在眼科，RETFound 是一個視網膜專用 FM，依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練，已展現出高度適應性，可應用於各種臨床應用。相反地，DINOv2 是一個通用視覺 FM，使用 1.42 億張自然影像進行預訓練，已展現出在非醫療領域的潛力。然而，其在臨床任務中的適用性仍未被充分探索。為了解決這個問題，我們針對眼部疾病偵測和全身性疾病預測任務，對 RETFound 和三個 DINOv2 模型（大型、基礎、小型）進行微調，並進行一對一的評估，使用八個標準化的開源眼科資料集，以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound（三個資料集的 AUROC=0.850-0.952，相較於 0.823-0.944，所有 P<=0.007）和多類眼部疾病（AUROC=0.892，相較於 0.846，P<0.001）。在青光眼方面，DINOv2 基礎模型優於 RETFound（AUROC=0.958，相較於 0.940，P<0.001）。相反地，RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型（AUROC=0.732-0.796，相較於 0.663-0.771，所有 P<0.001）。即使使用 10% 的微調資料，這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景，突顯了根據任務特定需求調整 FM 選擇，以最佳化臨床表現的重要性。
+摘要：由大型語言模型 (LLM) 支持的知識密集型對話
+已成為最受歡迎且有用的應用程式之一，可協助
+人們在不同面向獲得協助。許多當前的知識密集型應用程式
+都以檢索增強生成 (RAG) 技術為中心。雖然許多
+開放原始碼 RAG 架構促進了基於 RAG 的應用程式開發，但它們在處理
+主題和格式中異質資料、對話內容管理，以及低延遲回應時間的要求所造成的實際情況時，通常力有未逮。這份技術報告
+提出了可設定的知識整合多重代理系統，KIMAs，以
+解決這些挑戰。KIMAs 具備靈活且可設定的系統，可整合多樣化的知識來源，並具備 1) 內容管理和查詢
+改寫機制，以提升檢索準確度和多輪對話的連貫性，2) 有效的知識路由和檢索，3) 簡單但
+有效的篩選和參考產生機制，以及 4) 最佳化的可平行化多重代理管線執行。我們的作品提供了可擴充的
+架構，以推動在實際環境中部署 LLM。為了展示 KIMAs 如何協助開發人員建置不同規模和重點的知識密集型應用程式，我們示範如何設定系統至
+三個已實際執行且效能良好的應用程式。
 
-##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**
-2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun
+##### **Logical forms complement probability in understanding language model (and human) performance**
+2502.09589v1 by Yixuan Wang, Freda Shi
 
-Medical time series are often irregular and face significant missingness,
-posing challenges for data analysis and clinical decision-making. Existing
-methods typically adopt a single modeling perspective, either treating series
-data as sequences or transforming them into image representations for further
-classification. In this paper, we propose a joint learning framework that
-incorporates both sequence and image representations. We also design three
-self-supervised learning strategies to facilitate the fusion of sequence and
-image representations, capturing a more generalizable joint representation. The
-results indicate that our approach outperforms seven other state-of-the-art
-models in three representative real-world clinical datasets. We further
-validate our approach by simulating two major types of real-world missingness
-through leave-sensors-out and leave-samples-out techniques. The results
-demonstrate that our approach is more robust and significantly surpasses other
-baselines in terms of classification performance.
+With the increasing interest in using large language models (LLMs) for
+planning in natural language, understanding their behaviors becomes an
+important research question. This work conducts a systematic investigation of
+LLMs' ability to perform logical reasoning in natural language. We introduce a
+controlled dataset of hypothetical and disjunctive syllogisms in propositional
+and modal logic and use it as the testbed for understanding LLM performance.
+Our results lead to novel insights in predicting LLM behaviors: in addition to
+the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical
+forms should be considered as orthogonal factors. In addition, we show
+similarities and differences between the logical reasoning performances of
+humans and LLMs by comparing LLM and human behavioral results.
 
-摘要：醫療時間序列通常不規則且會面臨顯著的缺失，對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點，將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中，我們提出了一個聯合學習架構，結合序列和影像表示。我們還設計了三種自我監督學習策略，以促進序列和影像表示的融合，捕捉更具概括性的聯合表示。結果表明，我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明，我們的做法更強大，並且在分類效能方面顯著優於其他基準。
+摘要：隨著在自然語言規劃中使用大型語言模型（LLM）的興趣日益濃厚，理解其行為已成為一項重要的研究課題。本研究對 LLM 在自然語言中執行邏輯推理的能力進行了系統性調查。我們引入了一個由假設和析取三段論組成的受控資料集，並使用它作為理解 LLM 效能的測試平台。我們的結果產生了預測 LLM 行為的新見解：除了輸入的機率（Gonen 等人，2023 年；McCoy 等人，2024 年）之外，邏輯形式應被視為正交因子。此外，我們透過比較 LLM 和人類行為結果，展示了人類和 LLM 在邏輯推理表現上的相似性和差異性。
 
-##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
-2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
+##### **Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**
+2502.09573v1 by Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
 
-We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
-an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
-predicts future PHTs using transformer-based architectures. The Adaptive Risk
-Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
-probabilities for clinician-defined critical events. ARES incorporates a
-personalized explainability module that identifies key clinical factors
-influencing risk estimates for individual patients. ARES was evaluated on the
-MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
-performance against traditional early warning systems and machine learning
-models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
-with 60% including hospital admissions. The dataset contained over 357 million
-tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
-ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
-ETHOS-based risk estimates demonstrated robustness across demographic subgroups
-with strong model reliability, confirmed via calibration curves. The
-personalized explainability module provides insights into patient-specific
-factors contributing to risk. ARES, powered by ETHOS, advances predictive
-healthcare AI by providing dynamic, real-time, and personalized risk estimation
-with patient-specific explainability to enhance clinician trust. Its
-adaptability and superior accuracy position it as a transformative tool for
-clinical decision-making, potentially improving patient outcomes and resource
-allocation in emergency and inpatient settings. We release the full code at
-github.com/ipolharvard/ethos-ares to facilitate future research.
+In this study, we tackle industry challenges in video content classification
+by exploring and optimizing GPT-based models for zero-shot classification
+across seven critical categories of video quality. We contribute a novel
+approach to improving GPT's performance through prompt optimization and policy
+refinement, demonstrating that simplifying complex policies significantly
+reduces false negatives. Additionally, we introduce a new
+decomposition-aggregation-based prompt engineering technique, which outperforms
+traditional single-prompt methods. These experiments, conducted on real
+industry problems, show that thoughtful prompt design can substantially enhance
+GPT's performance without additional finetuning, offering an effective and
+scalable solution for improving video classification systems across various
+domains in industry.
 
-摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
-一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
-使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
+摘要：在這項研究中，我們透過探索和最佳化基於 GPT 的模型，來處理影片內容分類中的產業挑戰，並針對影片品質的七個關鍵類別進行零次學習分類。我們貢獻了一種透過提示最佳化和政策改善來提升 GPT 效能的新方法，證明簡化複雜政策能大幅減少假陰性。此外，我們還引入了一種新的基於分解聚合的提示工程技術，其效能優於傳統的單一提示方法。這些在真實產業問題上執行的實驗顯示，經過深思熟慮的提示設計可以在不進行額外微調的情況下大幅提升 GPT 的效能，為提升產業中各種領域的影片分類系統提供了一個有效且可擴充的解決方案。
 
-##### **Can ChatGPT Diagnose Alzheimer's Disease?**
-2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin
+##### **MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**
+2502.09567v1 by Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea
 
-Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating
-neurodegenerative condition that affects approximately 1 in 9 individuals aged
-65 and older, profoundly impairing memory and cognitive function. This paper
-utilises 9300 electronic health records (EHRs) with data from Magnetic
-Resonance Imaging (MRI) and cognitive tests to address an intriguing question:
-As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs?
-We present an in-depth evaluation of ChatGPT using a black-box approach with
-zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to
-analyse MRI and cognitive test results, as well as its potential as a
-diagnostic tool for AD. By automating aspects of the diagnostic process, this
-research opens a transformative approach for the healthcare system,
-particularly in addressing disparities in resource-limited regions where AD
-specialists are scarce. Hence, it offers a foundation for a promising method
-for early detection, supporting individuals with timely interventions, which is
-paramount for Quality of Life (QoL).
+We introduce MorphNLI, a modular step-by-step approach to natural language
+inference (NLI). When classifying the premise-hypothesis pairs into
+{entailment, contradiction, neutral}, we use a language model to generate the
+necessary edits to incrementally transform (i.e., morph) the premise into the
+hypothesis. Then, using an off-the-shelf NLI model we track how the entailment
+progresses with these atomic changes, aggregating these intermediate labels
+into a final output. We demonstrate the advantages of our proposed method
+particularly in realistic cross-domain settings, where our method always
+outperforms strong baselines with improvements up to 12.6% (relative). Further,
+our proposed approach is explainable as the atomic edits can be used to
+understand the overall NLI label.
 
-摘要：ChatGPT 能否診斷出阿茲海默症 (AD)？AD 是一種毀滅性的神經退化性疾病，影響約 1/9 的 65 歲及以上人士，嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR)，其中包含磁共振成像 (MRI) 和認知測試的數據，來解決一個有趣的問題：作為一個通用任務解決器，ChatGPT 能否使用 EHR 準確地檢測出 AD？我們使用黑盒方法對 ChatGPT 進行了深入評估，採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力，以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面，這項研究為醫療保健系統開啟了一種變革性的方法，特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此，它為一種有希望的早期檢測方法奠定了基礎，通過及時干預來支持個人，這對於生活品質 (QoL) 至關重要。
+摘要：我們引入 MorphNLI，一種模組化逐步方法，用於自然語言推論 (NLI)。當對前提假設對進行分類時，我們使用語言模型來產生必要的編輯，以逐步轉換（即，變形）前提成為假設。然後，使用現成的 NLI 模型，我們追蹤推論如何隨著這些原子變化而進展，將這些中間標籤彙總成最終輸出。我們展示了我們提出的方法的優點，特別是在現實的跨網域設置中，我們的模型始終優於強大的基線，改進幅度高達 12.6%（相對）。此外，我們提出的方法是可以解釋的，因為原子編輯可以用來理解整體 NLI 標籤。
 
-##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**
-2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares
+##### **Zero-shot generation of synthetic neurosurgical data with large language models**
+2502.09566v1 by Austin A. Barr, Eddie Guo, Emre Sezgin
 
-EEG-based neural networks, pivotal in medical diagnosis and brain-computer
-interfaces, face significant intellectual property (IP) risks due to their
-reliance on sensitive neurophysiological data and resource-intensive
-development. Current watermarking methods, particularly those using abstract
-trigger sets, lack robust authentication and fail to address the unique
-challenges of EEG models. This paper introduces a cryptographic wonder
-filter-based watermarking framework tailored for EEG-based neural networks.
-Leveraging collision-resistant hashing and public-key encryption, the wonder
-filter embeds the watermark during training, ensuring minimal distortion ($\leq
-5\%$ drop in EEG task accuracy) and high reliability (100\% watermark
-detection). The framework is rigorously evaluated against adversarial attacks,
-including fine-tuning, transfer learning, and neuron pruning. Results
-demonstrate persistent watermark retention, with classification accuracy for
-watermarked states remaining above 90\% even after aggressive pruning, while
-primary task performance degrades faster, deterring removal attempts. Piracy
-resistance is validated by the inability to embed secondary watermarks without
-severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic
-hashing ensures authentication, reducing brute-force attack success
-probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet,
-TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively
-eliminating false positives. By integrating wonder filters with EEG-specific
-adaptations, this work bridges a critical gap in IP protection for
-neurophysiological models, offering a secure, tamper-proof solution for
-healthcare and biometric applications. The framework's robustness against
-adversarial modifications underscores its potential to safeguard sensitive EEG
-models while maintaining diagnostic utility.
+Clinical data is fundamental to advance neurosurgical research, but access is
+often constrained by data availability, small sample sizes, privacy
+regulations, and resource-intensive preprocessing and de-identification
+procedures. Synthetic data offers a potential solution to challenges associated
+with accessing and using real-world data (RWD). This study aims to evaluate the
+capability of zero-shot generation of synthetic neurosurgical data with a large
+language model (LLM), GPT-4o, by benchmarking with the conditional tabular
+generative adversarial network (CTGAN). Synthetic datasets were compared to
+real-world neurosurgical data to assess fidelity (means, proportions,
+distributions, and bivariate correlations), utility (ML classifier performance
+on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated
+datasets matched or exceeded CTGAN performance, despite no fine-tuning or
+access to RWD for pre-training. Datasets demonstrated high univariate and
+bivariate fidelity to RWD without directly exposing any real patient records,
+even at amplified sample size. Training an ML classifier on GPT-4o-generated
+data and testing on RWD for a binary prediction task showed an F1 score (0.706)
+with comparable performance to training on the CTGAN data (0.705) for
+predicting postoperative functional status deterioration. GPT-4o demonstrated a
+promising ability to generate high-fidelity synthetic neurosurgical data. These
+findings also indicate that data synthesized with GPT-4o can effectively
+augment clinical data with small sample sizes, and train ML models for
+prediction of neurosurgical outcomes. Further investigation is necessary to
+improve the preservation of distributional characteristics and boost classifier
+performance.
 
-摘要：<paragraph>基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要，由於其依賴敏感的神經生理資料和資源密集型的開發，面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法，特別是那些使用抽象觸發集的方法，缺乏強健的驗證，且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密，wonder 濾波器在訓練期間嵌入浮水印，確保最小的失真（EEG 任務準確度下降 $\leq 5\%$）和高可靠性（100% 浮水印檢測）。該架構針對對抗性攻擊進行了嚴格的評估，包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留，即使在激進的剪枝後，浮水印狀態的分類準確度仍保持在 90% 以上，而主要任務的性能下降得更快，阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證，而不會造成嚴重的準確度損失（在 EEGNet 和 CCNN 模型中 $>10\%$）。密碼學雜湊確保驗證，降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型（CCNN、EEGNet、TSception）進行評估，該方法達到了 $>99.4\%$ 的空嵌入準確度，有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合，這項工作彌補了神經生理模型 IP 保護中的關鍵差距，為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。</paragraph>
+摘要：<paragraph>臨床數據是推進神經外科研究的基礎，但訪問通常受到數據可用性、樣本量小、隱私法規以及資源密集型預處理和去識別程序的限制。合成數據為與存取和使用真實世界數據 (RWD) 相關的挑戰提供了潛在解決方案。本研究旨在評估使用大型語言模型 (LLM) GPT-4o 零次生成合成神經外科數據的能力，並通過條件表格生成對抗網路 (CTGAN) 進行基準測試。將合成數據集與真實世界的神經外科數據進行比較，以評估保真度（平均值、比例、分布和二元相關性）、實用性（RWD 上的 ML 分類器性能）和隱私（RWD 中記錄的重複）。儘管沒有微調或訪問 RWD 進行預訓練，但 GPT-4o 生成的數據集與 CTGAN 性能相匹配或超過 CTGAN 性能。數據集證明了對 RWD 的高單變量和二變量保真度，即使在擴充的樣本量下也不會直接公開任何真實患者記錄。在 GPT-4o 生成的數據上訓練 ML 分類器，並在 RWD 上測試二元預測任務，顯示 F1 分數 (0.706) 與在 CTGAN 數據上訓練以預測術後功能狀態惡化時的性能相當 (0.705)。GPT-4o 展示了生成高保真合成神經外科數據的潛力。這些發現還表明，使用 GPT-4o 合成的數據可以有效地增加樣本量小的臨床數據，並訓練 ML 模型以預測神經外科結果。需要進一步研究以改善分佈特徵的保留並提升分類器性能。</paragraph>
 
-##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**
-2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen
+##### **MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**
+2502.09565v1 by Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White
 
-Depression is one of the leading causes of disability worldwide, posing a
-severe burden on individuals, healthcare systems, and society at large. Recent
-advancements in Large Language Models (LLMs) have shown promise in addressing
-mental health challenges, including the detection of depression through
-text-based analysis. However, current LLM-based methods often struggle with
-nuanced symptom identification and lack a transparent, step-by-step reasoning
-process, making it difficult to accurately classify and explain mental health
-conditions. To address these challenges, we propose a Chain-of-Thought
-Prompting approach that enhances both the performance and interpretability of
-LLM-based depression detection. Our method breaks down the detection process
-into four stages: (1) sentiment analysis, (2) binary depression classification,
-(3) identification of underlying causes, and (4) assessment of severity. By
-guiding the model through these structured reasoning steps, we improve
-interpretability and reduce the risk of overlooking subtle clinical indicators.
-We validate our method on the E-DAIC dataset, where we test multiple
-state-of-the-art large language models. Experimental results indicate that our
-Chain-of-Thought Prompting technique yields superior performance in both
-classification accuracy and the granularity of diagnostic insights, compared to
-baseline approaches.
+Molecular dynamics (MD) simulations are essential for understanding
+biomolecular systems but remain challenging to automate. Recent advances in
+large language models (LLM) have demonstrated success in automating complex
+scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an
+agentic LLM assistant capable of automating MD workflows. MDCrow uses
+chain-of-thought over 40 expert-designed tools for handling and processing
+files, setting up simulations, analyzing the simulation outputs, and retrieving
+relevant information from literature and databases. We assess MDCrow's
+performance across 25 tasks of varying required subtasks and difficulty, and we
+evaluate the agent's robustness to both difficulty and prompt style.
+\texttt{gpt-4o} is able to complete complex tasks with low variance, followed
+closely by \texttt{llama3-405b}, a compelling open-source model. While prompt
+style does not influence the best models' performance, it has significant
+effects on smaller models.
 
-摘要：憂鬱症是全球殘障的主要原因之一，對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望，包括透過基於文字的分析來偵測憂鬱症。然而，現有的基於 LLM 的方法通常難以辨識細微的症狀，而且缺乏透明且逐步的推理過程，這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰，我們提出了一種思考鏈提示方法，它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段：(1) 情緒分析，(2) 二元憂鬱症分類，(3) 找出潛在原因，以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟，我們提升了可解釋性，並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法，並在其中測試了多種最先進的大型語言模型。實驗結果顯示，與基線方法相比，我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。
+摘要：分子動力學 (MD) 模擬對於理解生物分子系統至關重要，但自動化仍然具有挑戰性。大型語言模型 (LLM) 的最新進展已證明使用基於 LLM 的代理自動化複雜的科學任務是成功的。在本文中，我們介紹了 MDCrow，這是一個代理 LLM 助理，能夠自動化 MD 工作流程。MDCrow 使用 40 多種專家設計的工具的思考鏈來處理和處理檔案、設定模擬、分析模擬輸出，以及從文獻和資料庫中檢索相關資訊。我們評估了 MDCrow 在 25 項任務中的表現，這些任務所需的子任務和難度各不相同，並且我們評估了代理對難度和提示樣式的穩健性。\texttt{gpt-4o} 能夠以低變異完成複雜的任務，緊隨其後的是一個引人注目的開源模型 \texttt{llama3-405b}。雖然提示樣式不會影響最佳模型的效能，但它對較小的模型有顯著的影響。
 
-##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**
-2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
+##### **EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**
+2502.09560v1 by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
 
-The increasing volume of drug combinations in modern therapeutic regimens
-needs reliable methods for predicting drug-drug interactions (DDIs). While
-Large Language Models (LLMs) have revolutionized various domains, their
-potential in pharmaceutical research, particularly in DDI prediction, remains
-largely unexplored. This study thoroughly investigates LLMs' capabilities in
-predicting DDIs by uniquely processing molecular structures (SMILES), target
-organisms, and gene interaction data as raw text input from the latest DrugBank
-dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4,
-Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first
-assessing their zero-shot capabilities in DDI prediction. We then fine-tuned
-selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1
-distilled Qwen 1.5B) to optimize their performance. Our comprehensive
-evaluation framework included validation across 13 external DDI datasets,
-comparing against traditional approaches such as l2-regularized logistic
-regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5
-2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of
-0.919 on balanced datasets (50% positive, 50% negative cases). This result
-represents an improvement over both zero-shot predictions and state-of-the-art
-machine-learning methods used for DDI prediction. Our analysis reveals that
-LLMs can effectively capture complex molecular interaction patterns and cases
-where drug pairs target common genes, making them valuable tools for practical
-applications in pharmaceutical research and clinical settings.
+Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
+agents offers a promising avenue for tackling real-world tasks. While
+language-centric embodied agents have garnered substantial attention,
+MLLM-based embodied agents remain underexplored due to the lack of
+comprehensive evaluation frameworks. To bridge this gap, we introduce
+EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
+embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
+tasks across four environments, ranging from high-level semantic tasks (e.g.,
+household) to low-level tasks involving atomic actions (e.g., navigation and
+manipulation); and (2) six meticulously curated subsets evaluating essential
+agent capabilities like commonsense reasoning, complex instruction
+understanding, spatial awareness, visual perception, and long-term planning.
+Through extensive experiments, we evaluated 13 leading proprietary and
+open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
+at high-level tasks but struggle with low-level manipulation, with the best
+model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
+multifaceted standardized evaluation platform that not only highlights existing
+challenges but also offers valuable insights to advance MLLM-based embodied
+agents. Our code is available at https://embodiedbench.github.io.
 
-摘要：<paragraph>現代治療方案中藥物組合的數量越來越多，需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命，它們在藥物研究中的潛力，特別是在 DDI 預測中的潛力，仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入，徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM，包括專有模型（GPT-4、Claude、Gemini）和開源變體（從 1.5B 到 72B 參數），首先評估它們在 DDI 預測中的零次學習能力。然後，我們微調選定的模型（GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B）以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證，並與傳統方法（例如 l2 正則化邏輯迴歸）進行比較。微調後的 LLM 表現出優異的效能，其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度，在平衡資料集（50% 正例，50% 反例）上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明，LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況，使其成為藥物研究和臨床環境中實用應用的寶貴工具。</paragraph>
+摘要：<paragraph>利用多模態大型語言模型 (MLLM) 來建立具身代理，提供了解決現實世界任務的有前景途徑。儘管以語言為中心的具身代理已獲得大量關注，但由於缺乏全面的評估框架，基於 MLLM 的具身代理仍未得到充分探索。為了彌補這一差距，我們引入了 EmbodiedBench，這是一個廣泛的基準測試，旨在評估以視覺為導向的具身代理。EmbodiedBench 的特點：(1) 跨越四個環境的 1,128 項多樣化測試任務，範圍從高層級語義任務（例如，家庭）到涉及原子動作的低層級任務（例如，導航和操作）；以及 (2) 六個精心策劃的子集，用於評估基本的代理能力，例如常識推理、複雜指令理解、空間感知、視覺感知和長期規劃。通過廣泛的實驗，我們在 EmbodiedBench 中評估了 13 個領先的專有和開源 MLLM。我們的研究結果表明：MLLM 在高層級任務中表現出色，但在低層級操作中遇到困難，表現最好的模型 GPT-4o 平均得分僅為 28.9%。EmbodiedBench 提供了一個多方面的標準化評估平台，不僅突出了現有挑戰，還提供了有價值的見解來推進基於 MLLM 的具身代理。我們的程式碼可在 https://embodiedbench.github.io/ 取得。</paragraph>
 
-##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**
-2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh
+##### **Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**
+2502.09532v1 by Shreyan Biswas, Alexander Erlei, Ujwal Gadiraju
 
-Detecting sensitive data such as Personally Identifiable Information (PII)
-and Protected Health Information (PHI) is critical for data security platforms.
-This study evaluates regex-based pattern matching algorithms and exact-match
-search techniques to optimize detection speed, accuracy, and scalability. Our
-benchmarking results indicate that Google RE2 provides the best balance of
-speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among
-regex engines, outperforming PCRE while maintaining broader hardware
-compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated
-superior performance (8 ms/MB) and scalability for large datasets. Performance
-analysis revealed that regex processing time scales linearly with dataset size
-and pattern complexity. A hybrid AI + Regex approach achieved the highest F1
-score (91. 6%) by improving recall and minimizing false positives. Device
-benchmarking confirmed that our solution maintains efficient CPU and memory
-usage on both high-performance and mid-range systems. Despite its
-effectiveness, challenges remain, such as limited multilingual support and the
-need for regular pattern updates. Future work should focus on expanding
-language coverage, integrating data security and privacy management (DSPM) with
-data loss prevention (DLP) tools, and enhancing regulatory compliance for
-broader global adoption.
+Recent advances in generative AI have precipitated a proliferation of novel
+writing assistants. These systems typically rely on multilingual large language
+models (LLMs), providing globalized workers the ability to revise or create
+diverse forms of content in different languages. However, there is substantial
+evidence indicating that the performance of multilingual LLMs varies between
+languages. Users who employ writing assistance for multiple languages are
+therefore susceptible to disparate output quality. Importantly, recent research
+has shown that people tend to generalize algorithmic errors across independent
+tasks, violating the behavioral axiom of choice independence. In this paper, we
+analyze whether user utilization of novel writing assistants in a charity
+advertisement writing task is affected by the AI's performance in a second
+language. Furthermore, we quantify the extent to which these patterns translate
+into the persuasiveness of generated charity advertisements, as well as the
+role of peoples' beliefs about LLM utilization in their donation choices. Our
+results provide evidence that writers who engage with an LLM-based writing
+assistant violate choice independence, as prior exposure to a Spanish LLM
+reduces subsequent utilization of an English LLM. While these patterns do not
+affect the aggregate persuasiveness of the generated advertisements, people's
+beliefs about the source of an advertisement (human versus AI) do. In
+particular, Spanish-speaking female participants who believed that they read an
+AI-generated advertisement strongly adjusted their donation behavior downwards.
+Furthermore, people are generally not able to adequately differentiate between
+human-generated and LLM-generated ads. Our work has important implications for
+the design, development, integration, and adoption of multilingual LLMs as
+assistive agents -- particularly in writing tasks.
 
-摘要：偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料，對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術，以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示，在 regex 引擎中，Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡，優於 PCRE，同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對，Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示，regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低，達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效，但仍有挑戰存在，例如多語言支援有限，以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍，將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合，以及加強法規遵循以利更廣泛的全球採用。
+摘要：<paragraph>生成式 AI 的最新進展加速了新穎寫作助理的激增。這些系統通常依賴多語言大型語言模型 (LLM)，讓全球化的工作者能夠以不同的語言修改或建立各種形式的內容。然而，有大量證據顯示多語言 LLM 的表現因語言而異。因此，使用多語言寫作協助的使用者容易受到不同的輸出品質影響。重要的是，最近的研究顯示人們傾向於在獨立的任務中概化演算法錯誤，違反了選擇獨立性的行為公理。在本文中，我們分析使用者在慈善廣告寫作任務中使用新穎寫作助理是否會受到 AI 在第二語言中的表現影響。此外，我們量化這些模式轉化為所產生慈善廣告說服力的程度，以及人們對 LLM 使用在捐款選擇中的信念所扮演的角色。我們的結果提供證據，表明與基於 LLM 的寫作助理互動的寫作者會違反選擇獨立性，因為先前接觸過西班牙語 LLM 會減少後續使用英語 LLM 的情況。雖然這些模式不會影響所產生廣告的整體說服力，但人們對廣告來源（人類與 AI）的信念會影響。特別是，相信自己閱讀 AI 生成的廣告的西班牙語系女性參與者大幅調整了他們的捐款行為。此外，人們通常無法充分區分人類產生的廣告和 LLM 產生的廣告。我們的研究對多語言 LLM 作為輔助代理的設計、開發、整合和採用具有重要的意義，特別是在寫作任務中。</paragraph>
 
-##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**
-2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu
+##### **Diffusion Models for Molecules: A Survey of Methods and Tasks**
+2502.09511v1 by Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
 
-While just-in-time interventions (JITIs) have effectively targeted common
-health behaviors, individuals often have unique needs to intervene in personal
-undesirable actions that can negatively affect physical, mental, and social
-well-being. We present WatchGuardian, a smartwatch-based JITI system that
-empowers users to define custom interventions for these personal actions with a
-small number of samples. For the model to detect new actions based on limited
-new data samples, we developed a few-shot learning pipeline that finetuned a
-pre-trained inertial measurement unit (IMU) model on public hand-gesture
-datasets. We then designed a data augmentation and synthesis process to train
-additional classification layers for customization. Our offline evaluation with
-26 participants showed that with three, five, and ten examples, our approach
-achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of
-74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to
-compare WatchGuardian against a rule-based intervention. Our results
-demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in
-undesirable actions, substantially outperforming the baseline by 29.0%. Our
-findings underscore the effectiveness of a customizable, AI-driven JITI system
-for individuals in need of behavioral intervention in personal undesirable
-actions. We envision that our work can inspire broader applications of
-user-defined personalized intervention with advanced AI solutions.
+Generative tasks about molecules, including but not limited to molecule
+generation, are crucial for drug discovery and material design, and have
+consistently attracted significant attention. In recent years, diffusion models
+have emerged as an impressive class of deep generative models, sparking
+extensive research and leading to numerous studies on their application to
+molecular generative tasks. Despite the proliferation of related work, there
+remains a notable lack of up-to-date and systematic surveys in this area.
+Particularly, due to the diversity of diffusion model formulations, molecular
+data modalities, and generative task types, the research landscape is
+challenging to navigate, hindering understanding and limiting the area's
+growth. To address this, this paper conducts a comprehensive survey of
+diffusion model-based molecular generative methods. We systematically review
+the research from the perspectives of methodological formulations, data
+modalities, and task types, offering a novel taxonomy. This survey aims to
+facilitate understanding and further flourishing development in this area. The
+relevant papers are summarized at:
+https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
 
-摘要：<paragraph>雖然即時介入（JITIs）有效地針對常見的健康行為，但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian，這是一個基於智慧手錶的 JITI 系統，它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為，我們開發了一個小樣本學習管道，微調了公共手勢資料集上的預訓練慣性測量單元（IMU）模型。然後，我們設計了一個資料擴充和合成流程，以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示，我們的做法使用三個、五個和十個範例，達到了 76.8%、84.7% 和 87.7% 的平均準確度，以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後，我們進行了一項為時四小時的介入研究，以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明，我們的系統導致不良行為顯著減少了 64.0 +- 22.6%，大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用，並採用先進的 AI 解決方案。</paragraph>
+摘要：<paragraph>包括但不限於分子生成在內的分子生成任務，對於藥物發現和材料設計至關重要，並持續吸引大量關注。近年來，擴散模型已成為深度生成模型中令人印象深刻的一類，激發了廣泛的研究，並導致對其應用於分子生成任務的眾多研究。儘管相關工作不斷增加，但這個領域仍然缺乏最新的系統性綜述。特別是，由於擴散模型公式、分子數據方式和生成任務類型的多樣性，研究領域難以瀏覽，阻礙了理解並限制了該領域的發展。為了解決這個問題，本文對基於擴散模型的分子生成方法進行了全面的調查。我們從方法論公式、數據方式和任務類型的角度系統性地回顧了研究，提供了一種新穎的分類法。本調查旨在促進理解並進一步促進該領域的蓬勃發展。相關論文總結如下：
+https://github.com/AzureLeon1/awesome-molecular-diffusion-models。</paragraph>
 
-##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**
-2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
+##### **AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**
+2502.09503v1 by Caleb Cranney, Jesse G. Meyer
 
-Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group
-of cancers that account for more than 35% of cancer-related deaths worldwide,
-but postoperative complications are unpredictable and can be life-threatening.
-In this paper, we investigate how recent advancements in large language models
-(LLMs) can benefit remote patient monitoring (RPM) systems through clinical
-integration by designing RECOVER, an LLM-powered RPM system for postoperative
-GI cancer care. To closely engage stakeholders in the design process, we first
-conducted seven participatory design sessions with five clinical staff and
-interviewed five cancer patients to derive six major design strategies for
-integrating clinical guidelines and information needs into LLM-based RPM
-systems. We then designed and implemented RECOVER, which features an
-LLM-powered conversational agent for cancer patients and an interactive
-dashboard for clinical staff to enable efficient postoperative RPM. Finally, we
-used RECOVER as a pilot system to assess the implementation of our design
-strategies with four clinical staff and five patients, providing design
-implications by identifying crucial design elements, offering insights on
-responsible AI, and outlining opportunities for future LLM-powered RPM systems.
+Transformer architectures have transformed AI applications but remain complex
+to customize for domain experts lacking low-level implementation expertise. We
+introduce AttentionSmithy, a modular software package that simplifies
+transformer innovation by breaking down key components into reusable building
+blocks: attention modules, feed-forward networks, normalization layers, and
+positional encodings. Users can rapidly prototype and evaluate transformer
+variants without extensive coding. Our framework supports four positional
+encoding strategies and integrates with neural architecture search for
+automated design. We validate AttentionSmithy by replicating the original
+transformer under resource constraints and optimizing translation performance
+by combining positional encodings. Additionally, we demonstrate its
+adaptability in gene-specific modeling, achieving over 95% accuracy in cell
+type classification. These case studies highlight AttentionSmithy's potential
+to accelerate research across diverse fields by removing framework
+implementation barriers.
 
-摘要：癌症手術是胃腸道 (GI) 癌症的主要治療方式，這類癌症佔全球癌症相關死亡人數的 35% 以上，但術後併發症無法預測，且可能危及生命。在本文中，我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統，方法是設計 RECOVER，一個由 LLM 驅動的 RPM 系統，用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程，我們首先與五位臨床人員進行七場參與式設計會議，並訪談五位癌症患者，以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著，我們設計並實作 RECOVER，其特色在於一個由 LLM 驅動的對話式代理人，供癌症患者使用，以及一個互動式儀表板，供臨床人員使用，以進行有效的術後 RPM。最後，我們使用 RECOVER 作為試點系統，與四位臨床人員和五位患者評估我們設計策略的實作，並透過找出重要的設計元素、提供對負責任 AI 的見解，以及概述未來由 LLM 驅動的 RPM 系統的機會，提出設計意涵。
+摘要：Transformer 架構已轉變 AI 應用，但對於缺乏低階實作專業知識的領域專家而言，自訂仍很複雜。我們推出 AttentionSmithy，這是一個模組化軟體套件，透過將關鍵元件分解成可重複使用的建構區塊（注意力模組、前饋網路、正規化層和位置編碼）來簡化 Transformer 創新。使用者可以快速建置原型和評估 Transformer 變體，而無需大量編碼。我們的架構支援四種位置編碼策略，並整合神經架構搜尋以進行自動化設計。我們透過在資源限制下複製原始 Transformer 和結合位置編碼來最佳化翻譯效能，驗證 AttentionSmithy。此外，我們展示其在基因特定建模中的適應性，在細胞類型分類中達到超過 95% 的準確度。這些案例研究突顯 AttentionSmithy 在移除架構實作障礙後，加速各個領域研究的潛力。
 
-##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**
-2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander
+##### **Improve LLM-based Automatic Essay Scoring with Linguistic Features**
+2502.09497v1 by Zhaoyi Joey Hou, Alejandro Ciuba, Xiang Lorraine Li
 
-Understanding the progression trajectories of diseases is crucial for early
-diagnosis and effective treatment planning. This is especially vital for
-life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a
-chronic, progressive lung disease with a prognosis comparable to many cancers.
-Computed tomography (CT) imaging has been established as a reliable diagnostic
-tool for IPF. Accurately predicting future CT scans of early-stage IPF patients
-can aid in developing better treatment strategies, thereby improving survival
-outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial
-Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF
-patients at any time point. The model is trained using a two-stage approach. In
-the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the
-second stage, a Neural Ordinary Differential Equation (ODE) based temporal
-model is trained to capture the temporal dynamics of the quantised embeddings
-generated by the encoder in the first stage. We evaluate different
-configurations of our model for generating longitudinal CT scans and compare
-the results against ground truth data, both quantitatively and qualitatively.
-For validation, we conduct survival analysis using imaging biomarkers derived
-from generated CT scans and achieve a C-index comparable to that of biomarkers
-derived from the real CT scans. The survival analysis results demonstrate the
-potential clinical utility inherent to generated longitudinal CT scans, showing
-that they can reliably predict survival outcomes.
+Automatic Essay Scoring (AES) assigns scores to student essays, reducing the
+grading workload for instructors. Developing a scoring system capable of
+handling essays across diverse prompts is challenging due to the flexibility
+and diverse nature of the writing task. Existing methods typically fall into
+two categories: supervised feature-based approaches and large language model
+(LLM)-based methods. Supervised feature-based approaches often achieve higher
+performance but require resource-intensive training. In contrast, LLM-based
+methods are computationally efficient during inference but tend to suffer from
+lower performance. This paper combines these approaches by incorporating
+linguistic features into LLM-based scoring. Experimental results show that this
+hybrid method outperforms baseline models for both in-domain and out-of-domain
+writing prompts.
 
-摘要：了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要，IPF 是一種慢性、進行性肺部疾病，其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略，從而改善存活結果。在本文中，我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN)，這是一個模型，能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段，訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段，訓練基於神經常微分方程 (ODE) 的時間模型，以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置，以生成縱向 CT 掃描，並在定量和定性方面將結果與真實數據進行比較。為了驗證，我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析，並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用，表明它們可以可靠地預測存活結果。
+摘要：自動化論文評分 (AES) 會為學生的論文評分，以減輕教師的評分工作負擔。由於寫作任務的靈活性與多樣性，開發一種評分系統來處理各種提示的論文是一項挑戰。現有方法通常分為兩類：監督式特徵方法和大型語言模型 (LLM) 方法。監督式特徵方法通常能達到較高的效能，但需要大量資源進行訓練。相比之下，LLM 方法在推論期間的計算效率很高，但效能往往較低。本文結合了這些方法，將語言特徵納入 LLM 評分中。實驗結果顯示，這種混合方法在領域內和領域外寫作提示方面都優於基準模型。
 
-##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**
-2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho
+##### **Cracking the Code: Enhancing Development finance understanding with artificial intelligence**
+2502.09495v1 by Pierre Beaucoral
 
-The increasing demand for mental health services has led to the rise of
-AI-driven mental health chatbots, though challenges related to privacy, data
-collection, and expertise persist. Motivational Interviewing (MI) is gaining
-attention as a theoretical basis for boosting expertise in the development of
-these chatbots. However, existing datasets are showing limitations for training
-chatbots, leading to a substantial demand for publicly available resources in
-the field of MI and psychotherapy. These challenges are even more pronounced in
-non-English languages, where they receive less attention. In this paper, we
-propose a novel framework that simulates MI sessions enriched with the
-expertise of professional therapists. We train an MI forecaster model that
-mimics the behavioral choices of professional therapists and employ Large
-Language Models (LLMs) to generate utterances through prompt engineering. Then,
-we present KMI, the first synthetic dataset theoretically grounded in MI,
-containing 1,000 high-quality Korean Motivational Interviewing dialogues.
-Through an extensive expert evaluation of the generated dataset and the
-dialogue model trained on it, we demonstrate the quality, expertise, and
-practicality of KMI. We also introduce novel metrics derived from MI theory in
-order to evaluate dialogues from the perspective of MI.
+Analyzing development projects is crucial for understanding donors aid
+strategies, recipients priorities, and to assess development finance capacity
+to adress development issues by on-the-ground actions. In this area, the
+Organisation for Economic Co-operation and Developments (OECD) Creditor
+Reporting System (CRS) dataset is a reference data source. This dataset
+provides a vast collection of project narratives from various sectors
+(approximately 5 million projects). While the OECD CRS provides a rich source
+of information on development strategies, it falls short in informing project
+purposes due to its reporting process based on donors self-declared main
+objectives and pre-defined industrial sectors. This research employs a novel
+approach that combines Machine Learning (ML) techniques, specifically Natural
+Language Processing (NLP), an innovative Python topic modeling technique called
+BERTopic, to categorise (cluster) and label development projects based on their
+narrative descriptions. By revealing existing yet hidden topics of development
+finance, this application of artificial intelligence enables a better
+understanding of donor priorities and overall development funding and provides
+methods to analyse public and private projects narratives.
 
-摘要：由於對心理健康服務的需求日益增加，導致以人工智慧為基礎的心理健康聊天機器人興起，儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而，現有的資料集顯示出訓練聊天機器人的限制，導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯，因為它們受到的關注較少。在本文中，我們提出了一個新穎的架構，它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型，它模擬了專業治療師的行為選擇，並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後，我們展示了 KMI，這是第一個理論上以 MI 為基礎的合成資料集，其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估，我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標，以便從 MI 的角度評估對話。
+摘要：分析發展專案對於了解捐助者援助策略、受贈者優先事項，以及評估發展資金能力以透過實際行動解決發展問題至關重要。在這個領域中，經濟合作暨發展組織 (OECD) 債權人報告系統 (CRS) 資料集是一個參考資料來源。此資料集提供來自各個部門的大量專案敘述（約 500 萬個專案）。雖然 OECD CRS 提供了豐富的發展策略資訊來源，但由於其報告程序基於捐助者自行申報的主要目標和預先定義的產業部門，因此在告知專案目的方面有所不足。本研究採用一種新穎的方法，結合機器學習 (ML) 技術，特別是自然語言處理 (NLP)，一種稱為 BERTopic 的創新 Python 主題建模技術，根據其敘述描述對發展專案進行分類（叢集）和標籤。透過揭露發展資金現有但隱藏的主題，這種人工智慧應用程式可以更好地了解捐助者的優先事項和整體發展資金，並提供分析公共和私人專案敘述的方法。
 
-##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**
-2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco
+##### **Objective quantification of mood states using large language models**
+2502.09487v1 by Jakub Onysk, Quentin Huys
 
-Europe's healthcare systems require enhanced interoperability and
-digitalization, driving a demand for innovative solutions to process legacy
-clinical data. This paper presents the results of our project, which aims to
-leverage Large Language Models (LLMs) to extract structured information from
-unstructured clinical reports, focusing on patient history, diagnoses,
-treatments, and other predefined categories. We developed a workflow with a
-user interface and evaluated LLMs of varying sizes through prompting strategies
-and fine-tuning. Our results show that fine-tuned smaller models match or
-surpass larger counterparts in performance, offering efficiency for
-resource-limited settings. A new dataset of 60,000 annotated English clinical
-summaries and 24,000 German translations was validated with automated and
-manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics.
-The work highlights the approach's viability and outlines future improvements.
+Emotional states influence human behaviour and cognition, leading to diverse
+thought trajectories. Similarly, Large Language Models (LLMs) showcase an
+excellent level of response consistency across wide-ranging contexts (prompts).
+We leverage these parallels to establish a framework for quantifying mental
+states. Our approach utilises self-report questionnaires that reliably assess
+these states due to their inherent sensitivity to patterns of co-occurring
+responses. Specifically, we recruited a large sample of participants (N=422) to
+investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set
+of depressive mood states measured with participants' open-ended responses to a
+depression questionnaire. We show LLM responses to held-out multiple-choice
+questions, given participants' open-ended answers, correlate strongly (r:
+0.52-0.84) with true questionnaire scores, demonstrating LLM's generalisation
+from mood representations. We explore a link between these representations and
+factor analysis. Using ridge regression, we find depression-related subspaces
+within LLM hidden states. We show these subspaces to be predictive of
+participants' "Depression" and "Somatic & Emotional Distress" factor scores, as
+well as suicidality severity. Overall, LLMs can provide quantitative measures
+of mental states. The reliability of these hinges upon how informative the
+questions we ask participants are. Used correctly, this approach could
+supplement mental state assessment in a variety of settings.
 
-摘要：歐洲的醫療保健系統需要增強互通性和數位化，這驅動了對創新解決方案的需求，以處理傳統的臨床數據。本文介紹了我們專案的成果，該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊，重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程，並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示，微調後的較小模型在效能上與較大的模型相匹配或超越它們，為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性，並概述了未來的改進。
+摘要：情緒狀態會影響人類行為和認知，導致不同的思維軌跡。同樣地，大型語言模型 (LLM) 在廣泛的脈絡（提示）中展示出極佳的反應一致性。我們利用這些相似之處來建立一個量化心理狀態的框架。我們的做法利用自我報告問卷，由於這些問卷對共生反應模式具有內在敏感性，因此可以可靠地評估這些狀態。具體來說，我們招募了大量的參與者樣本 (N=422) 來調查 LLM (Mistral-7B-OpenOrca) 如何量化一組異質的抑鬱情緒狀態，這些狀態是根據參與者對抑鬱症問卷的開放式回答來衡量的。我們展示了 LLM 對保留的多選題的回答，給定參與者的開放式回答，與真正的問卷分數密切相關 (r：0.52-0.84)，這證明了 LLM 從情緒表徵中進行概括。我們探索這些表徵與因子分析之間的聯繫。使用嶺回歸，我們在 LLM 隱藏狀態內發現了與抑鬱相關的子空間。我們展示這些子空間可以預測參與者的「抑鬱」和「軀體和情緒困擾」因子分數，以及自殺嚴重性。總體而言，LLM 可以提供心理狀態的量化測量。這些測量的可靠性取決於我們詢問參與者的問題的資訊性。如果使用得當，這種方法可以補充各種環境中的心理狀態評估。
 
-##### **Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**
-2502.05494v1 by Ya Zhou, Yujie Yang, Jianhuang Gan, Xiangjie Li, Jing Yuan, Wei Zhao
+##### **The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**
+2502.09457v1 by Akash Ghosh, Debayan Datta, Sriparna Saha, Chirag Agarwal
 
-Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing
-cardiovascular conditions, yet anomaly detection in ECG signals remains
-challenging due to their inherent complexity and variability. We propose
-Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel
-end-to-end framework that effectively captures both global and local
-dependencies in ECG data. Unlike state-of-the-art methods that rely on
-heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for
-such pre-processing steps, enhancing its suitability for clinical deployment.
-MMAE-ECG partitions ECG signals into non-overlapping segments, with each
-segment assigned learnable positional embeddings. A novel multi-scale masking
-strategy and multi-scale attention mechanism, along with distinct positional
-embeddings, enable a lightweight Transformer encoder to effectively capture
-both local and global dependencies. The masked segments are then reconstructed
-using a single-layer Transformer block, with an aggregation strategy employed
-during inference to refine the outputs. Experimental results demonstrate that
-our method achieves performance comparable to state-of-the-art approaches while
-significantly reducing computational complexity-approximately 1/78 of the
-floating-point operations (FLOPs) required for inference. Ablation studies
-further validate the effectiveness of each component, highlighting the
-potential of multi-scale masked autoencoders for anomaly detection.
+While reasoning and multilingual capabilities in Language Models (LMs) have
+achieved remarkable progress in recent years, their integration into a unified
+paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning
+requires language models to handle logical reasoning across languages while
+addressing misalignment, biases, and challenges in low-resource settings. This
+survey provides the first in-depth review of multilingual reasoning in LMs. In
+this survey, we provide a systematic overview of existing methods that leverage
+LMs for multilingual reasoning, specifically outlining the challenges,
+motivations, and foundational aspects of applying language models to reason
+across diverse languages. We provide an overview of the standard data resources
+used for training multilingual reasoning in LMs and the evaluation benchmarks
+employed to assess their multilingual capabilities. Next, we analyze various
+state-of-the-art methods and their performance on these benchmarks. Finally, we
+explore future research opportunities to improve multilingual reasoning in LMs,
+focusing on enhancing their ability to handle diverse languages and complex
+reasoning tasks.
 
-摘要：心電圖 (ECG) 分析是診斷心血管疾病的基本工具，但由於 ECG 訊號本身的複雜性和變異性，異常偵測仍然是一項挑戰。我們提出用於 ECG 異常偵測的多尺度遮罩自編碼器 (MMAE-ECG)，這是一個新穎的端對端架構，可有效擷取 ECG 資料中的全局和局部依賴關係。與依賴於心跳區段或 R 波峰偵測的最新方法不同，MMAE-ECG 消除了對此類前處理步驟的需求，增強其適用於臨床部署。MMAE-ECG 將 ECG 訊號分割成不相疊的區段，每個區段都指派可學習的位置嵌入。新穎的多尺度遮罩策略和多尺度注意力機制，以及不同的位置嵌入，使輕量級 Transformer 編碼器能夠有效擷取局部和全局依賴關係。然後使用單層 Transformer 區塊重建遮罩區段，並在推理期間採用聚合策略來優化輸出。實驗結果表明，我們的模型達到了與最新方法相當的效能，同時大幅降低運算複雜度，約為推理所需的浮點運算 (FLOP) 的 1/78。消融研究進一步驗證了每個組件的有效性，突顯了多尺度遮罩自編碼器在異常偵測方面的潛力。
+摘要：儘管語言模型 (LM) 的推理和多語言能力在近年來取得顯著進展，但它們整合至統一典範（多語言推理）仍處於萌芽階段。多語言推理要求語言模型跨語言處理邏輯推理，同時解決低資源環境中的錯位、偏見和挑戰。本調查提供了 LM 中多語言推理的首次深入探討。在本調查中，我們系統性地概述了現有利用 LM 進行多語言推理的方法，特別概述了將語言模型應用於跨不同語言推理的挑戰、動機和基礎方面。我們概述了用於訓練 LM 中多語言推理的標準數據資源，以及用於評估其多語言能力的評估基準。接下來，我們分析了各種最先進的方法及其在這些基準上的表現。最後，我們探討了改進 LM 中多語言推理的未來研究機會，重點關注增強其處理不同語言和複雜推理任務的能力。
 
-##### **DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**
-2502.05459v1 by Sibasish Dhibar
+##### **Pixel-Level Reasoning Segmentation via Multi-turn Conversations**
+2502.09447v1 by Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
 
-White blood cells (WBC) are important parts of our immune system, and they
-protect our body against infections by eliminating viruses, bacteria, parasites
-and fungi. The number of WBC types and the total number of WBCs provide
-important information about our health status. A traditional method,
-convolutional neural networks (CNN), a deep learning architecture, can classify
-the blood cell from a part of an object and perform object recognition. Various
-CNN models exhibit potential; however, their development often involves ad-hoc
-processes that neglect unnecessary layers, leading to issues with unbalanced
-datasets and insufficient data augmentation. To address these challenges, we
-propose a novel ensemble approach that integrates three CNN architectures, each
-uniquely configured with different dropout and max-pooling layer settings to
-enhance feature learning. This ensemble model, named DCENWCNet, effectively
-balances the bias-variance trade-off. When evaluated on the widely recognized
-Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks,
-achieving highest mean accuracy. Additionally, it demonstrates superior
-performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC)
-across all categories. To delve deeper into the interpretability of
-classifiers, we employ reliable post-hoc explanation techniques, including
-Local Interpretable Model-Agnostic Explanations (LIME). These methods
-approximate the behavior of a black-box model by elucidating the relationships
-between feature values and predictions. Interpretable results enable users to
-comprehend and validate the model's predictions, thereby increasing their
-confidence in the automated diagnosis.
+Existing visual perception systems focus on region-level segmentation in
+single-turn dialogues, relying on complex and explicit query instructions. Such
+systems cannot reason at the pixel level and comprehend dynamic user intent
+that changes over interaction. Our work tackles this issue by introducing a
+novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on
+multi-turn conversations, tracking evolving user intent via multi-turn
+interactions for fine-grained segmentation. To establish a benchmark for this
+novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on
+Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
+multi-turn conversational scenarios with segmentation targets. Building on
+PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning
+Segmentation framework, integrates pixel-level segmentation with robust
+multi-turn conversation understanding, generating pixel-grounded explanations
+aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
+pixel-level reasoning segmentation. Experimental results on the PRIST dataset
+demonstrate that our method outperforms current segmentation-specific baselines
+in terms of segmentation and LLM-based reasoning metrics. The code and data are
+available at: https://github.com/ccccai239/PixelRIST.
 
-摘要：白血球 (WBC) 是我們免疫系統的重要組成部分，它們通過清除病毒、細菌、寄生蟲和真菌來保護我們的機體免受感染。WBC 類型數量和 WBC 總數提供了有關我們健康狀況的重要資訊。傳統方法卷積神經網路 (CNN) 是一種深度學習架構，可以對物體的一部分進行血細胞分類並執行物體識別。各種 CNN 模型展現出潛力；然而，它們的開發通常涉及忽略不必要層的臨時過程，導致不平衡的資料集和資料擴充不足的問題。為了應對這些挑戰，我們提出了一種新穎的整體方法，它整合了三種 CNN 架構，每種架構都採用不同的中斷和最大池化層設定進行獨特配置，以增強特徵學習。這種名為 DCENWCNet 的整體模型有效地平衡了偏差變異取捨。在廣泛認可的 Rabbin-WBC 資料集上進行評估時，我們的模型優於現有的最先進網路，達到了最高的平均準確度。此外，它在所有類別中都展示了在精確度、召回率、F1 分數和 ROC 曲線下面積 (AUC) 方面的卓越效能。為了更深入地研究分類器的可解釋性，我們採用了可靠的事後解釋技術，包括局部可解釋模型不可知解釋 (LIME)。這些方法通過闡明特徵值和預測之間的關係來近似黑盒模型的行為。可解釋的結果使用戶能夠理解和驗證模型的預測，從而增加他們對自動化診斷的信心。
+摘要：現有的視覺感知系統專注於單輪對話中的區域級分割，依賴於複雜且明確的查詢指令。此類系統無法在像素級別推理和理解在互動中不斷變化的動態使用者意圖。我們的研究通過引入一項基於多輪對話的像素級推理分割（像素級 RS）新任務來解決此問題，通過多輪互動追蹤不斷演變的使用者意圖，以進行精細分割。為了建立此新任務的基準，我們建立了一個基於多輪對話的像素級推理分割資料集（PRIST），其中包含來自 8.3k 多輪對話場景的 24k 個語句，以及分割目標。在 PRIST 的基礎上，我們進一步提出了 MIRAS，這是一個多輪互動推理分割框架，它將像素級分割與強大的多輪對話理解整合在一起，生成符合使用者意圖的像素級解釋。PRIST 資料集和 MIRSA 框架填補了像素級推理分割的空白。在 PRIST 資料集上的實驗結果表明，我們的模型在分割和基於 LLM 的推理指標方面優於目前的特定於分割的基準。程式碼和資料可在 https://github.com/ccccai239/PixelRIST 獲得。
 
-##### **Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**
-2502.05330v1 by Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee, Jaehee Chun, Jin Sung Kim, Minghui Zhang, Hanxiao Zhang, Xin You, Yun Gu, Zhaohong Pan, Xuan Liu, Xiaokun Liang, Markus Tiefenthaler, Enrique Almar-Munoz, Matthias Schwab, Mikhail Kotyushev, Rostislav Epifanov, Marek Wodzinski, Henning Muller, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Zhiwei Wang, Kaixiang Yang, Jintao Ren, Stine Sofia Korreman, Yuchong Gao, Hongye Zeng, Haoyu Zheng, Rui Zheng, Jinghua Yue, Fugen Zhou, Bo Liu, Alexander Cosman, Muxuan Liang, Chang Zhao, Gilbert R. Upchurch Jr., Jun Ma, Yuyin Zhou, Michol A. Cooper, Wei Shao
+##### **Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**
+2502.09432v1 by Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
 
-Multi-class segmentation of the aorta in computed tomography angiography
-(CTA) scans is essential for diagnosing and planning complex endovascular
-treatments for patients with aortic dissections. However, existing methods
-reduce aortic segmentation to a binary problem, limiting their ability to
-measure diameters across different branches and zones. Furthermore, no
-open-source dataset is currently available to support the development of
-multi-class aortic segmentation methods. To address this gap, we organized the
-AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes
-annotated for 23 clinically relevant aortic branches and zones. This dataset
-was designed to facilitate both model development and validation. The challenge
-attracted 121 teams worldwide, with participants leveraging state-of-the-art
-frameworks such as nnU-Net and exploring novel techniques, including cascaded
-models, data augmentation strategies, and custom loss functions. We evaluated
-the submitted algorithms using the Dice Similarity Coefficient (DSC) and
-Normalized Surface Distance (NSD), highlighting the approaches adopted by the
-top five performing teams. This paper presents the challenge design, dataset
-details, evaluation metrics, and an in-depth analysis of the top-performing
-algorithms. The annotated dataset, evaluation code, and implementations of the
-leading methods are publicly available to support further research. All
-resources can be accessed at https://aortaseg24.grand-challenge.org.
+We study robust Markov decision processes (RMDPs) with non-rectangular
+uncertainty sets, which capture interdependencies across states unlike
+traditional rectangular models. While non-rectangular robust policy evaluation
+is generally NP-hard, even in approximation, we identify a powerful class of
+$L_p$-bounded uncertainty sets that avoid these complexity barriers due to
+their structural simplicity. We further show that this class can be decomposed
+into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage
+its structural properties to derive a novel dual formulation for $L_p$ RMDPs.
+This formulation provides key insights into the adversary's strategy and
+enables the development of the first robust policy evaluation algorithms for
+non-rectangular RMDPs. Empirical results demonstrate that our approach
+significantly outperforms brute-force methods, establishing a promising
+foundation for future investigation into non-rectangular robust MDPs.
 
-摘要：多類別主動脈電腦斷層血管攝影 (CTA) 掃描分割對於診斷和規劃主動脈剝離患者的複雜血管內治療至關重要。然而，現有方法將主動脈分割簡化為二元問題，限制了其測量不同分支和區域直徑的能力。此外，目前沒有開放原始碼數據集可用於支援多類別主動脈分割方法的開發。為了解決此問題，我們組織了 AortaSeg24 MICCAI 挑戰，引入了第一個包含 100 個 CTA 體積的數據集，這些體積針對 23 個臨床上相關的主動脈分支和區域進行了註釋。此數據集旨在促進模型開發和驗證。該挑戰吸引了來自世界各地的 121 個團隊，參與者利用了 nnU-Net 等最先進的框架，並探索了創新技術，包括串聯模型、數據擴充策略和自訂損失函數。我們使用 Dice 相似性係數 (DSC) 和標準化表面距離 (NSD) 評估了提交的演算法，重點介紹了前五名表現最佳團隊採用的方法。本文介紹了挑戰設計、數據集詳細資訊、評估指標以及對表現最佳演算法的深入分析。已公開註釋的數據集、評估程式碼和領先方法的實作，以支援進一步的研究。所有資源都可以在 https://aortaseg24.grand-challenge.org/ 獲得。
+摘要：我們研究具有非矩形不確定性集合的強健馬可夫決策過程 (RMDP)，它能捕捉到不同於傳統矩形模型的跨狀態相互依賴性。雖然非矩形強健策略評估通常是 NP-hard，即使在近似中也是如此，我們識別了一類強大的 $L_p$ 有界不確定性集合，由於其結構的簡潔性，可以避免這些複雜性障礙。我們進一步表明，此類可以分解為無限多的 \texttt{sa} 矩形 $L_p$ 有界集合，並利用其結構屬性為 $L_p$ RMDP 導出一個新的對偶公式。此公式提供了對抗者策略的重要見解，並能夠開發出第一個非矩形 RMDP 的強健策略評估演算法。實證結果表明，我們的做法顯著優於蠻力方法，為未來對非矩形強健 MDP 的研究奠定了有希望的基礎。
 
-##### **Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**
-2502.05282v1 by Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
+##### **Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**
+2502.09423v1 by Ziyi Chen, Yang Yuan, Siming Zheng, Jialong Guo, Sihan Liang, Yangang Wang, Zongguo Wang
 
-Dense contrastive representation learning (DCRL) has greatly improved the
-learning efficiency for image-dense prediction tasks, showing its great
-potential to reduce the large costs of medical image collection and dense
-annotation. However, the properties of medical images make unreliable
-correspondence discovery, bringing an open problem of large-scale false
-positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric
-vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior
-to DCRL and enables a reliable correspondence discovery for effective dense
-contrast. We propose a deformable homeomorphism learning (DHL) which models the
-homeomorphism of medical images and learns to estimate a deformable mapping to
-predict the pixels' correspondence under topological preservation. It
-effectively reduces the searching space of pairing and drives an implicit and
-soft learning of negative pairs via a gradient. We also propose a geometric
-semantic similarity (GSS) which extracts semantic information in features to
-measure the alignment degree for the correspondence learning. It will promote
-the learning efficiency and performance of deformation, constructing positive
-pairs reliably. We implement two practical variants on two typical
-representation learning tasks in our experiments. Our promising results on
-seven datasets which outperform the existing methods show our great
-superiority. We will release our code on a companion link:
-https://github.com/YutingHe-list/GEMINI.
+Crystal structure forms the foundation for understanding the physical and
+chemical properties of materials. Generative models have emerged as a new
+paradigm in crystal structure prediction(CSP), however, accurately capturing
+key characteristics of crystal structures, such as periodicity and symmetry,
+remains a significant challenge. In this paper, we propose a
+Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction
+(TransVAE-CSP), who learns the characteristic distribution space of stable
+materials, enabling both the reconstruction and generation of crystal
+structures. TransVAE-CSP integrates adaptive distance expansion with
+irreducible representation to effectively capture the periodicity and symmetry
+of crystal structures, and the encoder is a transformer network based on an
+equivariant dot product attention mechanism. Experimental results on the
+carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP
+outperforms existing methods in structure reconstruction and generation tasks
+under various modeling metrics, offering a powerful tool for crystal structure
+design and optimization.
 
-摘要：密集对比表征学习（DCRL）极大地提高了影像密集预测任务的学习效率，显示出其在降低医学影像收集和密集标注的大量成本方面的巨大潜力。然而，医学影像的特性使得对应关系发现不可靠，给 DCRL 带来大规模假阳性和假阴性（FP&N）对的开放性问题。在本文中，我们提出了 GEoMetric vIsual deNse sImilarity（GEMINI）学习，它将同胚先验嵌入 DCRL 中，并针对有效密集对比提供了可靠的对应关系发现。我们提出了一种可变形同胚学习（DHL），它对医学影像的同胚进行建模，并学习估计可变形映射，以预测在拓扑保持下的像素对应关系。它有效地减少了配对的搜索空间，并通过梯度驱动了负对的隐式和软学习。我们还提出了几何语义相似性（GSS），它提取特征中的语义信息，以测量对应关系学习的对齐度。它将促进变形学习的效率和性能，可靠地构建正对。我们在实验中针对两个典型的表征学习任务实现了两个实际变体。我们在七个数据集上的有希望的结果优于现有方法，显示出我们的巨大优势。我们将在配套链接中发布我们的代码：https://github.com/YutingHe-list/GEMINI。
+摘要：晶體結構形成了解材料物理和化學性質的基礎。生成模型已成為晶體結構預測 (CSP) 的新典範，然而，準確捕捉晶體結構的關鍵特徵（例如週期性和對稱性）仍然是一項重大挑戰。在本文中，我們提出了一種用於晶體結構預測的 Transformer 增強變異自動編碼器 (TransVAE-CSP)，它學習穩定材料的特徵分佈空間，使晶體結構的重建和生成成為可能。TransVAE-CSP 將自適應距離擴展與不可約表示相結合，以有效地捕捉晶體結構的週期性和對稱性，並且編碼器是一個基於等變點積注意力機制的 Transformer 網路。在 carbon_24、perov_5 和 mp_20 資料集上的實驗結果表明，TransVAE-CSP 在各種建模指標下，在結構重建和生成任務中優於現有方法，為晶體結構設計和最佳化提供了一個強大的工具。
 
-##### **"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**
-2502.05115v1 by Shihan Fu, Bingsheng Yao, Smit Desai, Yuqi Hu, Yuling Sun, Samantha Stonbraker, Yanjun Gao, Elizabeth M. Goldberg, Dakuo Wang
+##### **On multi-token prediction for efficient LLM inference**
+2502.09419v1 by Somesh Mehra, Javier Alonso Garcia, Lukas Mauch
 
-Older adult patients constitute a rapidly growing subgroup of Intensive Care
-Unit (ICU) patients. In these situations, their family caregivers are expected
-to represent the unconscious patients to access and interpret patients' medical
-information. However, caregivers currently have to rely on overloaded
-clinicians for information updates and typically lack the health literacy to
-understand complex medical information. Our project aims to explore the
-information needs of caregivers of ICU older adult patients, from which we can
-propose design opportunities to guide future AI systems. The project begins
-with formative interviews with 11 caregivers to identify their challenges in
-accessing and interpreting medical information; From these findings, we then
-synthesize design requirements and propose an AI system prototype to cope with
-caregivers' challenges. The system prototype has two key features: a timeline
-visualization to show the AI extracted and summarized older adult patients' key
-medical events; and an LLM-based chatbot to provide context-aware informational
-support. We conclude our paper by reporting on the follow-up user evaluation of
-the system and discussing future AI-based systems for ICU caregivers of older
-adults.
+We systematically investigate multi-token prediction (MTP) capabilities
+within LLMs pre-trained for next-token prediction (NTP). We first show that
+such models inherently possess MTP capabilities via numerical marginalization
+over intermediate token probabilities, though performance is data-dependent and
+improves with model scale. Furthermore, we explore the challenges of
+integrating MTP heads into frozen LLMs and find that their hidden layers are
+strongly specialized for NTP, making adaptation non-trivial. Finally, we show
+that while joint training of MTP heads with the backbone improves performance,
+it cannot fully overcome this barrier, prompting further research in this
+direction. Our findings provide a deeper understanding of MTP applied to
+pretrained LLMs, informing strategies for accelerating inference through
+parallel token prediction.
 
-摘要：老年患者構成加護病房 (ICU) 患者中快速成長的子群。在這些情況下，預期他們的家庭照護者能代表無意識的患者取得並解讀患者的醫療資訊。然而，照護者目前必須依賴工作繁重的臨床醫師提供資訊更新，而且通常缺乏了解複雜醫療資訊的健康素養。我們的專案旨在探索 ICU 老年患者照護者的資訊需求，我們可以根據這些需求提出設計機會，以引導未來的 AI 系統。這個專案從對 11 位照護者的形成性訪談開始，以找出他們在取得和解讀醫療資訊方面的挑戰；根據這些發現，我們接著綜合設計需求，並提出一個 AI 系統原型，以應對照護者的挑戰。這個系統原型具有兩個關鍵特點：一個時間軸視覺化，以顯示 AI 萃取並摘要出的老年患者關鍵醫療事件；以及一個基於 LLM 的聊天機器人，以提供情境感知的資訊支援。我們透過報告系統的後續使用者評估，以及討論未來針對老年人 ICU 照護者的 AI 系統，來總結我們的論文。
+摘要：我們系統性地研究了在預先訓練下用於下一個代幣預測 (NTP) 的 LLM 中的多代幣預測 (MTP) 功能。我們首先表明，此類模型透過中間代幣機率的數值邊際化本質上具備 MTP 功能，儘管效能依賴於資料，且會隨著模型規模而提升。此外，我們探討了將 MTP 頭整合到凍結 LLM 中的挑戰，發現其隱藏層高度專門用於 NTP，使得適應變得不簡單。最後，我們顯示，儘管 MTP 頭與主幹的聯合訓練會提升效能，但無法完全克服此障礙，促使我們進一步研究這個方向。我們的發現提供了對應用於預先訓練 LLM 的 MTP 更深入的理解，並為透過平行代幣預測加速推論提供策略。
 
-##### **Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**
-2502.05087v1 by Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi
+##### **SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**
+2502.09390v1 by Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
 
-Federated learning (FL) is a popular paradigm for collaborative training
-which avoids direct data exposure between clients. However, data privacy issues
-still remain: FL-trained large language models are capable of memorizing and
-completing phrases and sentences contained in training data when given with
-their prefixes. Thus, it is possible for adversarial and honest-but-curious
-clients to recover training data of other participants simply through targeted
-prompting. In this work, we demonstrate that a popular and simple fine-tuning
-strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a
-factor of 10. We study this effect by performing a medical question-answering
-fine-tuning task and injecting multiple replicas of out-of-distribution
-sensitive sequences drawn from an external clinical dataset. We observe a
-reduction in memorization for a wide variety of Llama 2 and 3 models, and find
-that LoRA can reduce memorization in centralized learning as well. Furthermore,
-we show that LoRA can be combined with other privacy-preserving techniques such
-as gradient clipping and Gaussian noising, secure aggregation, and Goldfish
-loss to further improve record-level privacy while maintaining performance.
+In the rapidly evolving field of Natural Language Processing, Large Language
+Models (LLMs) are tasked with increasingly complex reasoning challenges.
+Traditional methods like chain-of-thought prompting have shown promise but
+often fall short in fully leveraging a model's reasoning capabilities. This
+paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a
+novel prompting technique designed to improve reasoning through a
+self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts
+models to generate and resolve multiple auxiliary questions before tackling the
+main query, promoting a more thorough exploration of various aspects of a
+topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models
+across multiple question-answering datasets, demonstrate that SQuARE
+significantly surpasses traditional CoT prompts and existing
+rephrase-and-respond methods. By systematically decomposing queries, SQuARE
+advances LLM capabilities in reasoning tasks. The code is publicly available at
+https://github.com/IntelLabs/RAG-FiT/tree/square.
 
-摘要：聯邦學習 (FL) 是一種流行的協作訓練範例，可避免客戶端之間直接公開資料。然而，資料隱私問題仍然存在：經過 FL 訓練的大型語言模型能夠記憶並完成訓練資料中包含的片語和句子，只要給予其前綴即可。因此，對抗和誠實但好奇的客戶端有可能僅透過目標提示來恢復其他參與者的訓練資料。在這項工作中，我們證明了一種流行且簡單的微調策略，低秩適應 (LoRA)，可將 FL 期間的記憶減少多達 10 倍。我們透過執行醫學問答微調任務並注入從外部臨床資料集抽取的非分佈敏感序列的多次複製品來研究此效應。我們觀察到各種 Llama 2 和 3 模型的記憶力降低，並發現 LoRA 也能減少集中式學習中的記憶力。此外，我們展示 LoRA 可以與其他隱私保護技術結合使用，例如梯度裁剪和高斯雜訊、安全聚合和 Goldfish 損失，以進一步改善記錄級隱私，同時維持效能。
+摘要：在快速發展的自然語言處理領域中，大型語言模型 (LLM) 負責越來越複雜的推理挑戰。
+傳統方法（如思考鏈提示）已展現潛力，但通常無法充分利用模型的推理能力。本文介紹 SQuARE（順序式問答推理引擎），這是一種新穎的提示技術，旨在透過自我提問模式來改善推理。建立在 CoT 架構之上，SQuARE 提示模型在處理主要查詢之前產生並解決多個輔助問題，促進對某個主題的各個面向進行更徹底的探討。我們使用 Llama 3 和 GPT-4o 模型對多個問答資料集進行廣泛評估，結果顯示 SQuARE 明顯優於傳統 CoT 提示和現有的改寫並回應方法。透過系統性地分解查詢，SQuARE 提升了 LLM 在推理任務中的能力。程式碼已公開於 https://github.com/IntelLabs/RAG-FiT/tree/square。
 
-##### **MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**
-2502.04794v1 by Minrui Chen, Yi Zhou, Huidong Jiang, Yuhan Zhu, Guanjie Zou, Minqi Chen, Rong Tian, Hiroto Saigo
+##### **Truth Knows No Language: Evaluating Truthfulness Beyond English**
+2502.09387v1 by Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
 
-Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is
-introduced as a multimodal framework inspired by real-world diagnostic
-processes. It uses pretrained models such as DINOv2, Vision Transformer, and
-ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into
-low-dimensional, semantically meaningful features. A learnable
-self-attention-based fusion network then integrates these imaging features with
-clinical data for classification. Using 416 FUO patient cases from Sichuan
-University West China Hospital from 2017 to 2023, the multimodal fusion
-classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to
-0.9291 across seven tasks, outperforming conventional machine learning and
-single-modality deep learning methods. Ablation studies and five-fold
-cross-validation further validated its effectiveness. By combining the
-strengths of pretrained large models and deep learning, MedMimic offers a
-promising solution for disease classification.
+We introduce a professionally translated extension of the TruthfulQA
+benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
+Spanish. Truthfulness evaluations of large language models (LLMs) have
+primarily been conducted in English. However, the ability of LLMs to maintain
+truthfulness across languages remains under-explored. Our study evaluates 12
+state-of-the-art open LLMs, comparing base and instruction-tuned models using
+human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
+findings reveal that, while LLMs perform best in English and worst in Basque
+(the lowest-resourced language), overall truthfulness discrepancies across
+languages are smaller than anticipated. Furthermore, we show that
+LLM-as-a-Judge correlates more closely with human judgments than
+multiple-choice metrics, and that informativeness plays a critical role in
+truthfulness assessment. Our results also indicate that machine translation
+provides a viable approach for extending truthfulness benchmarks to additional
+languages, offering a scalable alternative to professional translation.
+Finally, we observe that universal knowledge questions are better handled
+across languages than context- and time-dependent ones, highlighting the need
+for truthfulness evaluations that account for cultural and temporal
+variability. Dataset and code are publicly available under open licenses.
 
-摘要：不明原因發燒 (FUO) 仍然是診斷上的挑戰。MedMimic 是一個多模式架構，靈感來自於真實世界的診斷過程。它使用預先訓練的模型，例如 DINOv2、視覺轉換器和 ResNet-18，將高維 18F-FDG PET/CT 影像轉換為低維、語義有意義的特徵。一個可學習的自注意力融合網路接著將這些影像特徵與臨床資料整合，用於分類。使用 2017 年至 2023 年四川大學華西醫院的 416 個 FUO 病患病例，多模式融合分類網路 MFCN 在七項任務中達到了 0.8654 到 0.9291 的巨觀 AUROC 分數，優於傳統機器學習和單一模式深度學習方法。消融研究和五倍交叉驗證進一步驗證了其有效性。MedMimic 結合了預先訓練的大模型和深度學習的優點，為疾病分類提供了一個有前景的解決方案。
+摘要：我們針對 TruthfulQA 推出專業翻譯的延伸版本，旨在評估巴斯克語、加泰隆尼亞語、加利西亞語和西班牙語中的真實性。大型語言模型 (LLM) 的真實性評估主要以英語進行。然而，LLM 在不同語言中維持真實性的能力仍未得到充分探索。我們的研究評估了 12 個最先進的開放 LLM，使用人類評估、多選項指標和 LLM 作為評分標準比較基礎和指令調整模型。我們的研究結果表明，雖然 LLM 在英語中的表現最好，而在巴斯克語（資源最少的語言）中的表現最差，但整體上不同語言之間的真實性差異小於預期。此外，我們表明，與多選項指標相比，LLM 作為評分標準與人類判斷更密切相關，而且信息豐富性在真實性評估中發揮著至關重要的作用。我們的結果還表明，機器翻譯提供了一種可行的途徑，可以將真實性基準擴展到其他語言，從而提供了一種可擴展的專業翻譯替代方案。最後，我們觀察到，與上下文和時間依賴的問題相比，通用知識問題在不同語言之間的處理效果更好，這突顯了考慮文化和時間可變性的真實性評估的必要性。數據集和代碼在開放許可下公開可用。
 
-##### **MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**
-2502.04515v1 by Wei Fan, Jingru Fei, Dingyu Guo, Kun Yi, Xiaozhuang Song, Haolong Xiang, Hangting Ye, Min Li
+##### **A Deep Inverse-Mapping Model for a Flapping Robotic Wing**
+2502.09378v1 by Hadar Sharvit, Raz Karl, Tsevi Beatus
 
-Medical time series has been playing a vital role in real-world healthcare
-systems as valuable information in monitoring health conditions of patients.
-Accurate classification for medical time series, e.g., Electrocardiography
-(ECG) signals, can help for early detection and diagnosis. Traditional methods
-towards medical time series classification rely on handcrafted feature
-extraction and statistical methods; with the recent advancement of artificial
-intelligence, the machine learning and deep learning methods have become more
-popular. However, existing methods often fail to fully model the complex
-spatial dynamics under different scales, which ignore the dynamic
-multi-resolution spatial and temporal joint inter-dependencies. Moreover, they
-are less likely to consider the special baseline wander problem as well as the
-multi-view characteristics of medical time series, which largely hinders their
-prediction performance. To address these limitations, we propose a
-Multi-resolution Spatiotemporal Graph Learning framework, MedGNN, for medical
-time series classification. Specifically, we first propose to construct
-multi-resolution adaptive graph structures to learn dynamic multi-scale
-embeddings. Then, to address the baseline wander problem, we propose Difference
-Attention Networks to operate self-attention mechanisms on the finite
-difference for temporal modeling. Moreover, to learn the multi-view
-characteristics, we utilize the Frequency Convolution Networks to capture
-complementary information of medical time series from the frequency domain. In
-addition, we introduce the Multi-resolution Graph Transformer architecture to
-model the dynamic dependencies and fuse the information from different
-resolutions. Finally, we have conducted extensive experiments on multiple
-medical real-world datasets that demonstrate the superior performance of our
-method. Our Code is available.
+In systems control, the dynamics of a system are governed by modulating its
+inputs to achieve a desired outcome. For example, to control the thrust of a
+quad-copter propeller the controller modulates its rotation rate, relying on a
+straightforward mapping between the input rotation rate and the resulting
+thrust. This mapping can be inverted to determine the rotation rate needed to
+generate a desired thrust. However, in complex systems, such as flapping-wing
+robots where intricate fluid motions are involved, mapping inputs (wing
+kinematics) to outcomes (aerodynamic forces) is nontrivial and inverting this
+mapping for real-time control is computationally impractical. Here, we report a
+machine-learning solution for the inverse mapping of a flapping-wing system
+based on data from an experimental system we have developed. Our model learns
+the input wing motion required to generate a desired aerodynamic force outcome.
+We used a sequence-to-sequence model tailored for time-series data and
+augmented it with a novel adaptive-spectrum layer that implements
+representation learning in the frequency domain. To train our model, we
+developed a flapping wing system that simultaneously measures the wing's
+aerodynamic force and its 3D motion using high-speed cameras. We demonstrate
+the performance of our system on an additional open-source dataset of a
+flapping wing in a different flow regime. Results show superior performance
+compared with more complex state-of-the-art transformer-based models, with 11%
+improvement on the test datasets median loss. Moreover, our model shows
+superior inference time, making it practical for onboard robotic control. Our
+open-source data and framework may improve modeling and real-time control of
+systems governed by complex dynamics, from biomimetic robots to biomedical
+devices.
 
-摘要：<paragraph>醫療時間序列在真實世界的醫療保健系統中扮演著至關重要的角色，作為監控患者健康狀況的寶貴資訊。
-準確分類醫療時間序列，例如心電圖 (ECG) 訊號，有助於早期偵測和診斷。傳統的醫療時間序列分類方法仰賴手工特徵萃取和統計方法；隨著人工智慧的最新進展，機器學習和深度學習方法變得更為普及。然而，現有方法通常無法完全建模不同尺度下的複雜空間動態，忽略了動態多解析度空間和時間關節相互依賴性。此外，它們不太可能考慮特殊的基線漂移問題以及醫療時間序列的多視角特性，這在很大程度上阻礙了它們的預測效能。為了解決這些限制，我們提出了一個多解析度時空圖形學習架構 MedGNN，用於醫療時間序列分類。具體來說，我們首先提出構建多解析度自適應圖形結構以學習動態多尺度嵌入。然後，為了解決基線漂移問題，我們提出差分注意力網路，對時間建模的有限差分運算自注意力機制。此外，為了學習多視角特性，我們利用頻率卷積網路從頻域擷取醫療時間序列的互補資訊。此外，我們引入了多解析度圖形Transformer架構來建模動態依賴性，並融合來自不同解析度的資訊。最後，我們對多個醫療真實世界資料集進行了廣泛的實驗，證明了我們方法的優異效能。我們的程式碼已公開。</paragraph>
+摘要：<paragraph>在系統控制中，系統的動態受調節其輸入以實現所需結果的影響。例如，為了控制四軸旋翼推進器的推力，控制器會調節其旋轉速率，依賴於輸入旋轉速率和所產生的推力之間的直接映射。此映射可以反轉以確定產生所需推力所需的旋轉速率。然而，在複雜的系統中，例如涉及複雜流體運動的拍打式機翼機器人，將輸入（機翼運動學）映射到輸出（空氣動力）並非易事，並且反轉此映射以進行實時控制在計算上不切實際。在此，我們報告了一個基於我們開發的實驗系統數據的拍打式機翼系統反向映射的機器學習解決方案。我們的模型學習產生所需空氣動力結果所需的輸入機翼運動。我們使用了一個專門針對時間序列數據的序列到序列模型，並用一個在頻域中實現表示學習的新型自適應譜層對其進行了擴充。為了訓練我們的模型，我們開發了一個拍打式機翼系統，該系統同時使用高速相機測量機翼的空氣動力和其 3D 運動。我們在一個不同的流動狀態下拍打機翼的另一個開源數據集上展示了我們系統的性能。結果表明，與更複雜的基於Transformer的最先進模型相比，性能優異，在測試數據集中損失中值改進了 11%。此外，我們的模型顯示出優異的推理時間，使其適用於機載機器人控制。我們的開源數據和框架可以改進受複雜動態支配的系統的建模和實時控制，從仿生機器人到生物醫學設備。</paragraph>
 
-##### **Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**
-2502.06842v1 by Andrew G. Breithaupt, Alice Tang, Bruce L. Miller, Pedro Pinheiro-Chagas
+##### **Language Agents as Digital Representatives in Collective Decision-Making**
+2502.09369v1 by Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti
 
-Healthcare systems are struggling to meet the growing demand for neurological
-care, with challenges particularly acute in Alzheimer's disease and related
-dementias (ADRD). While artificial intelligence research has often focused on
-identifying patterns beyond human perception, implementing such predictive
-capabilities remains challenging as clinicians cannot readily verify insights
-they cannot themselves detect. We propose that large language models (LLMs)
-offer more immediately practical applications by enhancing clinicians'
-capabilities in three critical areas: comprehensive data collection,
-interpretation of complex clinical information, and timely application of
-relevant medical knowledge. These challenges stem from limited time for proper
-diagnosis, growing data complexity, and an overwhelming volume of medical
-literature that exceeds any clinician's capacity to fully master. We present a
-framework for responsible AI integration that leverages LLMs' ability to
-communicate effectively with both patients and providers while maintaining
-human oversight. This approach prioritizes standardized, high-quality data
-collection to enable a system that learns from every patient encounter while
-incorporating the latest clinical evidence, continuously improving care
-delivery. We begin to address implementation challenges and initiate important
-discussions around ethical considerations and governance needs. While developed
-for ADRD, this roadmap provides principles for responsible AI integration
-across neurology and other medical specialties, with potential to improve
-diagnostic accuracy, reduce care disparities, and advance clinical knowledge
-through a learning healthcare system.
+Consider the process of collective decision-making, in which a group of
+individuals interactively select a preferred outcome from among a universe of
+alternatives. In this context, "representation" is the activity of making an
+individual's preferences present in the process via participation by a proxy
+agent -- i.e. their "representative". To this end, learned models of human
+behavior have the potential to fill this role, with practical implications for
+multi-agent scenario studies and mechanism design. In this work, we investigate
+the possibility of training \textit{language agents} to behave in the capacity
+of representatives of human agents, appropriately expressing the preferences of
+those individuals whom they stand for. First, we formalize the setting of
+\textit{collective decision-making} -- as the episodic process of interaction
+between a group of agents and a decision mechanism. On this basis, we then
+formalize the problem of \textit{digital representation} -- as the simulation
+of an agent's behavior to yield equivalent outcomes from the mechanism.
+Finally, we conduct an empirical case study in the setting of
+\textit{consensus-finding} among diverse humans, and demonstrate the
+feasibility of fine-tuning large language models to act as digital
+representatives.
 
-摘要：醫療體系正努力滿足日益增長的神經照護需求，其中阿茲海默症和相關失智症 (ADRD) 的挑戰特別嚴重。雖然人工智慧研究通常專注於識別人類感知之外的模式，但實作此類預測功能仍然具有挑戰性，因為臨床醫生無法輕易驗證他們自己無法偵測到的見解。我們提出大型語言模型 (LLM) 可透過提升臨床醫生在三個關鍵領域的能力，提供更直接且實用的應用：全面的資料收集、複雜臨床資訊的詮釋，以及適時應用相關的醫學知識。這些挑戰源自於適當診斷時間有限、資料複雜性日益增加，以及龐大的醫學文獻量超過任何臨床醫生所能完全掌握的容量。我們提出了一個負責任的 AI 整合架構，利用 LLM 與患者和提供者有效溝通的能力，同時維持人為監督。此方法優先考慮標準化、高品質的資料收集，以建立一個從每次患者接觸中學習的系統，同時納入最新的臨床證據，持續改善照護提供。我們開始探討實作挑戰，並展開關於倫理考量和治理需求的重要討論。儘管是為 ADRD 所開發，此藍圖提供了神經科和其他醫學專科負責任 AI 整合的原則，有潛力透過學習型醫療保健系統改善診斷準確性、減少照護差異，並推進臨床知識。
+摘要：考慮集體決策的過程，其中一群個人互動式地從一系列備選方案中選擇一個偏好的結果。在此脈絡中，「代表」是透過代理人（即他們的「代表」）參與，讓個人的偏好出現在這個過程中的活動。為此，人類行為的學習模型有可能填補這個角色，對多重代理人情境研究和機制設計具有實際意義。在這項工作中，我們探討訓練「語言代理人」的可能性，以代表人類代理人的身分行事，適當地表達他們所代表的那些個人的偏好。首先，我們將「集體決策」的設定形式化，作為一群代理人與決策機制之間互動的間歇性過程。在此基礎上，我們接著將「數位代表」的問題形式化，作為模擬代理人的行為，從機制中產生等效結果。最後，我們在多元人類的「共識尋求」設定中進行一個實證個案研究，並展示微調大型語言模型以作為數位代表的可行性。
 
-##### **Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**
-2502.04423v1 by Khushboo Verma, Alan Michels, Ergi Gumusaneli, Shilpa Chitnis, Smita Sinha Kumar, Christopher Thompson, Lena Esmail, Guruprasath Srinivasan, Chandini Panchada, Sushovan Guha, Satwant Kumar
+##### **Neural Spatiotemporal Point Processes: Trends and Challenges**
+2502.09341v1 by Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David A. Selby, Yao Xie, Sebastian Vollmer, Gerrit Grossmann
 
-Referral workflow inefficiencies, including misaligned referrals and delays,
-contribute to suboptimal patient outcomes and higher healthcare costs. In this
-study, we investigated the possibility of predicting procedural needs based on
-primary care diagnostic entries, thereby improving referral accuracy,
-streamlining workflows, and providing better care to patients. A de-identified
-dataset of 2,086 orthopedic referrals from the University of Texas Health at
-Tyler was analyzed using machine learning models built on Base General
-Embeddings (BGE) for semantic extraction. To ensure real-world applicability,
-noise tolerance experiments were conducted, and oversampling techniques were
-employed to mitigate class imbalance. The selected optimum and parsimonious
-embedding model demonstrated high predictive accuracy (ROC-AUC: 0.874, Matthews
-Correlation Coefficient (MCC): 0.540), effectively distinguishing patients
-requiring surgical intervention. Dimensionality reduction techniques confirmed
-the model's ability to capture meaningful clinical relationships. A threshold
-sensitivity analysis identified an optimal decision threshold (0.30) to balance
-precision and recall, maximizing referral efficiency. In the predictive
-modeling analysis, the procedure rate increased from 11.27% to an optimal
-60.1%, representing a 433% improvement with significant implications for
-operational efficiency and healthcare revenue.
-  The results of our study demonstrate that referral optimization can enhance
-primary and surgical care integration. Through this approach, precise and
-timely predictions of procedural requirements can be made, thereby minimizing
-delays, improving surgical planning, and reducing administrative burdens. In
-addition, the findings highlight the potential of clinical decision support as
-a scalable solution for improving patient outcomes and the efficiency of the
-healthcare system.
+Spatiotemporal point processes (STPPs) are probabilistic models for events
+occurring in continuous space and time. Real-world event data often exhibit
+intricate dependencies and heterogeneous dynamics. By incorporating modern deep
+learning techniques, STPPs can model these complexities more effectively than
+traditional approaches. Consequently, the fusion of neural methods with STPPs
+has become an active and rapidly evolving research area. In this review, we
+categorize existing approaches, unify key design choices, and explain the
+challenges of working with this data modality. We further highlight emerging
+trends and diverse application domains. Finally, we identify open challenges
+and gaps in the literature.
 
-摘要：轉診流程效率低落，包括轉診不當和延誤，
-導致次優的患者結果和更高的醫療保健成本。在這
-項研究中，我們探討了根據初級保健診斷條目預測程序需求的可能性，從而提高轉診準確性，
-簡化工作流程，並為患者提供更好的照護。一個去識別化
-德克薩斯大學健康中心的 2,086 個骨科轉診的資料集
-泰勒使用建立在基本通用
-語義提取的嵌入 (BGE) 上的機器學習模型進行分析。為了確保現實世界的適用性，
-進行了噪聲容忍度實驗，並採用了過採樣技術來減輕類別不平衡。所選的最佳和簡約
-嵌入模型展示了高預測準確度 (ROC-AUC：0.874，馬修斯
-相關系數 (MCC)：0.540)，有效區分需要手術干預的患者。降維
-技術證實了模型捕捉有意義的臨床關係的能力。閾值
-敏感性分析確定了一個最佳決策閾值 (0.30) 來平衡
-精確度和召回率，最大化轉診效率。在預測中
-建模分析中，程序率從 11.27% 增加到最佳的
-60.1%，代表 433% 的改進，對運營效率和醫療保健收入具有重大影響。
-我們研究的結果表明，轉診優化可以增強
-初級和外科護理整合。通過這種方法，可以對程序需求進行準確及時的預測，從而最大程度地減少
-延誤，改善手術計劃，並減輕行政負擔。此外，研究結果強調了臨床決策支持作為
-一個可擴展的解決方案的潛力，用於改善患者結果和醫療保健系統的效率。
+摘要：時空點過程 (STPP) 是事件在連續時空發生的機率模型。真實世界的事件資料通常會展現錯綜複雜的依賴關係和異質動態。透過結合現代深度學習技術，STPP 可以比傳統方法更有效地模擬這些複雜性。因此，神經方法與 STPP 的融合已成為一個活躍且快速發展的研究領域。在本篇評論中，我們對現有方法進行分類、統一關鍵設計選擇，並說明處理這種資料模式的挑戰。我們進一步強調新興趨勢和多樣化的應用領域。最後，我們找出文獻中的開放性挑戰和空白。
 
-##### **Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**
-2502.04083v1 by Tewele W. Tareke, Neree Payan, Alexandre Cochet, Laurent Arnould, Benoit Presles, Jean-Marc Vrigneaud, Fabrice Meriaudeau, Alain Lalande
+##### **Graph Diffusion Network for Drug-Gene Prediction**
+2502.09335v1 by Jiayang Wu, Wensheng Gan, Philip S. Yu
 
-Neoadjuvant chemotherapy (NAC) has become a standard clinical practice for
-tumor downsizing in breast cancer with 18F-FDG Positron Emission Tomography
-(PET). Our work aims to leverage PET imaging for the segmentation of breast
-lesions. The focus is on developing an automated system that accurately
-segments primary tumor regions and extracts key biomarkers from these areas to
-provide insights into the evolution of breast cancer following the first course
-of NAC. 243 baseline 18F-FDG PET scans (PET_Bl) and 180 follow-up 18F-FDG PET
-scans (PET_Fu) were acquired before and after the first course of NAC,
-respectively. Firstly, a deep learning-based breast tumor segmentation method
-was developed. The optimal baseline model (model trained on baseline exams) was
-fine-tuned on 15 follow-up exams and adapted using active learning to segment
-tumor areas in PET_Fu. The pipeline computes biomarkers such as maximum
-standardized uptake value (SUVmax), metabolic tumor volume (MTV), and total
-lesion glycolysis (TLG) to evaluate tumor evolution between PET_Fu and PET_Bl.
-Quality control measures were employed to exclude aberrant outliers. The nnUNet
-deep learning model outperformed in tumor segmentation on PET_Bl, achieved a
-Dice similarity coefficient (DSC) of 0.89 and a Hausdorff distance (HD) of 3.52
-mm. After fine-tuning, the model demonstrated a DSC of 0.78 and a HD of 4.95 mm
-on PET_Fu exams. Biomarkers analysis revealed very strong correlations whatever
-the biomarker between manually segmented and automatically predicted regions.
-The significant average decrease of SUVmax, MTV and TLG were 5.22, 11.79 cm3
-and 19.23 cm3, respectively. The presented approach demonstrates an automated
-system for breast tumor segmentation from 18F-FDG PET. Thanks to the extracted
-biomarkers, our method enables the automatic assessment of cancer progression.
+Predicting drug-gene associations is crucial for drug development and disease
+treatment. While graph neural networks (GNN) have shown effectiveness in this
+task, they face challenges with data sparsity and efficient contrastive
+learning implementation. We introduce a graph diffusion network for drug-gene
+prediction (GDNDGP), a framework that addresses these limitations through two
+key innovations. First, it employs meta-path-based homogeneous graph learning
+to capture drug-drug and gene-gene relationships, ensuring similar entities
+share embedding spaces. Second, it incorporates a parallel diffusion network
+that generates hard negative samples during training, eliminating the need for
+exhaustive negative sample retrieval. Our model achieves superior performance
+on the DGIdb 4.0 dataset and demonstrates strong generalization capability on
+tripartite drug-gene-disease networks. Results show significant improvements
+over existing methods in drug-gene prediction tasks, particularly in handling
+complex heterogeneous relationships. The source code is publicly available at
+https://github.com/csjywu1/GDNDGP.
 
-摘要：新辅助化疗 (NAC) 已成为乳腺癌中采用 18F-FDG 正电子发射断层扫描 (PET) 进行肿瘤缩小的标准临床实践。我们的工作旨在利用 PET 影像分割乳腺病变。重点在于开发一个自动系统，该系统可以准确分割原发性肿瘤区域并从这些区域提取关键生物标记，以深入了解乳腺癌在第一疗程 NAC 后的演变。分别在第一疗程 NAC 之前和之后采集了 243 例基线 18F-FDG PET 扫描 (PET_Bl) 和 180 例随访 18F-FDG PET 扫描 (PET_Fu)。首先，开发了一种基于深度学习的乳腺肿瘤分割方法。对 15 例随访检查对最优基线模型（在基线检查中训练的模型）进行了微调，并使用主动学习对 PET_Fu 中的肿瘤区域进行了分割。该管道计算诸如最大标准摄取值 (SUVmax)、代谢肿瘤体积 (MTV) 和总病灶糖酵解 (TLG) 等生物标记，以评估 PET_Fu 和 PET_Bl 之间的肿瘤演变。采用质量控制措施来排除异常值。nnUNet 深度学习模型在 PET_Bl 上的肿瘤分割方面表现出色，达到 0.89 的 Dice 相似性系数 (DSC) 和 3.52 毫米的 Hausdorff 距离 (HD)。微调后，该模型在 PET_Fu 检查中显示出 0.78 的 DSC 和 4.95 毫米的 HD。无论手动分割区域和自动预测区域之间的生物标记如何，生物标记分析都显示出非常强的相关性。SUVmax、MTV 和 TLG 的平均显着下降分别为 5.22、11.79 cm3 和 19.23 cm3。所提出的方法展示了一个用于从 18F-FDG PET 分割乳腺肿瘤的自动化系统。由于提取了生物标记，我们的方法能够自动评估癌症进展。
+摘要：預測藥物基因關聯對藥物開發和疾病治療至關重要。雖然圖神經網路 (GNN) 已顯示在這個任務中的有效性，但它們在資料稀疏性和高效對比學習實作方面面臨挑戰。我們引入了一個用於藥物基因預測的圖擴散網路 (GDNDGP)，這是一個透過兩項關鍵創新來解決這些限制的框架。首先，它採用基於元路徑的同質圖學習來捕捉藥物-藥物和基因-基因關係，確保類似實體共享嵌入空間。其次，它整合了一個並行擴散網路，在訓練期間產生困難的負面樣本，消除了對詳盡負面樣本擷取的需求。我們的模型在 DGIdb 4.0 資料集上取得了卓越的效能，並在三方藥物-基因-疾病網路中展現強大的概化能力。結果顯示在藥物基因預測任務中，相較於現有方法有顯著的進步，特別是在處理複雜的異質關係方面。原始碼已公開於 https://github.com/csjywu1/GDNDGP。
 
-##### **Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**
-2502.04034v1 by Ran Song, Yinpu Bai, Hui Liu
+##### **Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**
+2502.09331v1 by Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
 
-The accurate prediction of drug responses remains a formidable challenge,
-particularly at the single-cell level and in clinical treatment contexts. Some
-studies employ transfer learning techniques to predict drug responses in
-individual cells and patients, but they require access to target-domain data
-during training, which is often unavailable or only obtainable in future. In
-this study, we propose a novel domain generalization framework, termed
-panCancerDR, to address this challenge. We conceptualize each cancer type as a
-distinct source domain, with its cell lines serving as domain-specific samples.
-Our primary objective is to extract domain-invariant features from the
-expression profiles of cell lines across diverse cancer types, thereby
-generalize the predictive capacity to out-of-distribution samples. To enhance
-robustness, we introduce a latent independence projection (LIP) module that
-encourages the encoder to extract informative yet non-redundant features. Also,
-we propose an asymmetric adaptive clustering constraint, which clusters
-drug-sensitive samples into a compact group while drives resistant samples
-dispersed across separate clusters in the latent space. Our empirical
-experiments demonstrate that panCancerDR effectively learns task-relevant
-features from diverse source domains, and achieves accurate predictions of drug
-response for unseen cancer type during training. Furthermore, when evaluated on
-single-cell and patient-level prediction tasks, our model-trained solely on in
-vitro cell line data without access to target-domain information-consistently
-outperforms and matched current state-of-the-art methods. These findings
-highlights the potential of our method for real-world clinical applications.
+Despite advances in the multilingual capabilities of Large Language Models
+(LLMs) across diverse tasks, English remains the dominant language for LLM
+research and development. So, when working with a different language, this has
+led to the widespread practice of pre-translation, i.e., translating the task
+prompt into English before inference. Selective pre-translation, a more
+surgical approach, focuses on translating specific prompt components. However,
+its current use is sporagic and lacks a systematic research foundation.
+Consequently, the optimal pre-translation strategy for various multilingual
+settings and tasks remains unclear. In this work, we aim to uncover the optimal
+setup for pre-translation by systematically assessing its use. Specifically, we
+view the prompt as a modular entity, composed of four functional parts:
+instruction, context, examples, and output, either of which could be translated
+or not. We evaluate pre-translation strategies across 35 languages covering
+both low and high-resource languages, on various tasks including Question
+Answering (QA), Natural Language Inference (NLI), Named Entity Recognition
+(NER), and Abstractive Summarization. Our experiments show the impact of
+factors as similarity to English, translation quality and the size of
+pre-trained data, on the model performance with pre-translation. We suggest
+practical guidelines for choosing optimal strategies in various multilingual
+settings.
 
-摘要：<paragraph>準確預測藥物反應仍然是一項艱鉅的挑戰，特別是在單細胞層級和臨床治療背景中。一些研究採用遷移學習技術來預測個別細胞和患者的藥物反應，但它們需要在訓練期間存取目標網域資料，而這些資料通常無法取得，或只能在未來取得。在這項研究中，我們提出一個新穎的網域概化架構，稱為 panCancerDR，以應對這項挑戰。我們將每種類型的癌症概念化為一個不同的來源網域，其細胞株作為特定網域的樣本。我們的首要目標是從不同癌症類型的細胞株表現特徵中萃取網域不變特徵，從而將預測能力概化到分布外的樣本。為了增強穩健性，我們引入一個潛在獨立投影 (LIP) 模組，鼓勵編碼器萃取有資訊但非冗餘的特徵。此外，我們提出一個非對稱自適應聚類約束，將對藥物敏感的樣本聚類到一個緊湊的群組中，同時驅動抗藥性樣本分散在潛在空間中的不同群組中。我們的實證實驗證明，panCancerDR 有效地從不同的來源網域學習與任務相關的特徵，並在訓練期間對未見的癌症類型實現準確的藥物反應預測。此外，當在單細胞和患者層級預測任務中進行評估時，我們的模型僅在體外細胞株資料上訓練，而沒有存取目標網域資訊，始終優於並符合當前的最新方法。這些發現突顯了我們的方法在實際臨床應用中的潛力。</paragraph>
+摘要：儘管大型語言模型 (LLM) 在各種任務中的多語言能力有進步，英語仍然是 LLM 研究和開發的主導語言。因此，在使用不同語言時，這導致了預翻譯的廣泛實務，即在推理之前將任務提示翻譯成英語。選擇性預翻譯是一種更精準的方法，專注於翻譯特定提示組成部分。然而，目前的使用是零星的，缺乏系統性的研究基礎。因此，各種多語言設定和任務的最佳預翻譯策略仍不清楚。在這項工作中，我們旨在透過系統性評估預翻譯的使用，找出其最佳設定。具體來說，我們將提示視為一個模組化實體，由四個功能部分組成：說明、背景、範例和輸出，其中任何一個都可以翻譯或不翻譯。我們在 35 種語言中評估預翻譯策略，涵蓋低資源語言和高資源語言，以及各種任務，包括問答 (QA)、自然語言推理 (NLI)、命名實體識別 (NER) 和抽象摘要。我們的實驗顯示了與英語的相似性、翻譯品質和預訓練資料大小等因素對預翻譯模型效能的影響。我們建議在各種多語言設定中選擇最佳策略的實用指南。
 
-##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
-2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
+##### **A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**
+2502.09316v1 by Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami
 
-Retrieval-augmented generation (RAG) is a well-suited technique for
-retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
-key module of the healthcare copilot, helping reduce misdiagnosis for
-healthcare practitioners and patients. However, the diagnostic accuracy and
-specificity of existing heuristic-based RAG models used in the medical domain
-are inadequate, particularly for diseases with similar manifestations. This
-paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
-reasoning for the medical domain that retrieves diagnosis and treatment
-recommendations based on manifestations. MedRAG systematically constructs a
-comprehensive four-tier hierarchical diagnostic KG encompassing critical
-diagnostic differences of various diseases. These differences are dynamically
-integrated with similar EHRs retrieved from an EHR database, and reasoned
-within a large language model. This process enables more accurate and specific
-decision support, while also proactively providing follow-up questions to
-enhance personalized medical decision-making. MedRAG is evaluated on both a
-public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
-collected from Tan Tock Seng Hospital, and its performance is compared against
-various existing RAG methods. Experimental results show that, leveraging the
-information integration and relational abilities of the KG, our MedRAG provides
-more specific diagnostic insights and outperforms state-of-the-art models in
-reducing misdiagnosis rates. Our code will be available at
-https://github.com/SNOWTEAM2023/MedRAG
+Evaluating the open-ended text generation of large language models (LLMs) is
+challenging because of the lack of a clear ground truth and the high cost of
+human or LLM-based assessments. We propose a novel benchmark that evaluates
+LLMs using n-gram statistics and rules, without relying on human judgement or
+LLM-as-a-judge approaches. Using 50 question and reference answer sets, we
+introduce three new metrics based on n-grams and rules: Fluency, Truthfulness,
+and Helpfulness. Our benchmark strongly correlates with GPT-4o-based
+evaluations while requiring significantly fewer computational resources,
+demonstrating its effectiveness as a scalable alternative for assessing LLMs'
+open-ended generation capabilities.
+
+摘要：評估大型語言模型 (LLM) 的開放式文字生成具有挑戰性，因為缺乏明確的基礎真實性，以及人工或基於 LLM 的評估成本高昂。我們提出一個新基準，使用 n-gram 統計和規則來評估 LLM，而不依賴於人工判斷或 LLM 作為評審的方法。使用 50 個問題和參考答案集，我們基於 n-gram 和規則引入了三項新指標：流暢度、真實性和有幫助性。我們的基準與基於 GPT-4o 的評估密切相關，同時需要明顯更少的計算資源，證明了其作為評估 LLM 的開放式生成能力的可擴充替代方案的有效性。
+
+##### **When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**
+2502.09307v1 by Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
+
+Modern Large Language Models (LLMs) have shown human-like abilities in many
+language tasks, sparking interest in comparing LLMs' and humans' language
+processing. In this paper, we conduct a detailed comparison of the two on a
+sentence comprehension task using garden-path constructions, which are
+notoriously challenging for humans. Based on psycholinguistic research, we
+formulate hypotheses on why garden-path sentences are hard, and test these
+hypotheses on human participants and a large suite of LLMs using comprehension
+questions. Our findings reveal that both LLMs and humans struggle with specific
+syntactic complexities, with some models showing high correlation with human
+comprehension. To complement our findings, we test LLM comprehension of
+garden-path constructions with paraphrasing and text-to-image generation tasks,
+and find that the results mirror the sentence comprehension question results,
+further validating our findings on LLM understanding of these constructions.
 
-摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
+摘要：現代大型語言模型（LLM）在許多語言任務中展現出類似人類的能力，引發了比較 LLM 與人類語言處理的興趣。在本文中，我們使用對人類來說極具挑戰的花園路徑結構，對這兩者進行了詳細比較，以進行句子理解任務。根據心理語言學研究，我們制定了關於為什麼花園路徑句子困難的假設，並使用理解問題對人類參與者和大量 LLM 測試這些假設。我們的研究結果表明，LLM 和人類都難以應付特定的句法複雜性，其中一些模型與人類理解力高度相關。為了補充我們的研究結果，我們測試了 LLM 對花園路徑結構的理解，並進行了改寫和文字轉換為圖像的生成任務，並發現結果反映了句子理解問題的結果，進一步驗證了我們對 LLM 理解這些結構的研究結果。
 
-##### **Transforming Multimodal Models into Action Models for Radiotherapy**
-2502.04408v1 by Matteo Ferrante, Alessandra Carosi, Rolando Maria D Angelillo, Nicola Toschi
+##### **Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**
+2502.09294v1 by Bernd Dudzik, Tiffany Matej Hrkalovic, Chenxu Hao, Chirag Raman, Masha Tsfasman
 
-Radiotherapy is a crucial cancer treatment that demands precise planning to
-balance tumor eradication and preservation of healthy tissue. Traditional
-treatment planning (TP) is iterative, time-consuming, and reliant on human
-expertise, which can potentially introduce variability and inefficiency. We
-propose a novel framework to transform a large multimodal foundation model
-(MLM) into an action model for TP using a few-shot reinforcement learning (RL)
-approach. Our method leverages the MLM's extensive pre-existing knowledge of
-physics, radiation, and anatomy, enhancing it through a few-shot learning
-process. This allows the model to iteratively improve treatment plans using a
-Monte Carlo simulator. Our results demonstrate that this method outperforms
-conventional RL-based approaches in both quality and efficiency, achieving
-higher reward scores and more optimal dose distributions in simulations on
-prostate cancer data. This proof-of-concept suggests a promising direction for
-integrating advanced AI models into clinical workflows, potentially enhancing
-the speed, quality, and standardization of radiotherapy treatment planning.
+Automatic Affect Prediction (AAP) uses computational analysis of input data
+such as text, speech, images, and physiological signals to predict various
+affective phenomena (e.g., emotions or moods). These models are typically
+constructed using supervised machine-learning algorithms, which rely heavily on
+labeled training datasets. In this position paper, we posit that all AAP
+training data are derived from human Affective Interpretation Processes,
+resulting in a form of Affective Meaning. Research on human affect indicates a
+form of complexity that is fundamental to such meaning: it can possess what we
+refer to here broadly as Qualities of Indeterminacy (QIs) - encompassing
+Subjectivity (meaning depends on who is interpreting), Uncertainty (lack of
+confidence regarding meanings' correctness), Ambiguity (meaning contains
+mutually exclusive concepts) and Vagueness (meaning is situated at different
+levels in a nested hierarchy). Failing to appropriately consider QIs leads to
+results incapable of meaningful and reliable predictions. Based on this
+premise, we argue that a crucial step in adequately addressing indeterminacy in
+AAP is the development of data collection practices for modeling corpora that
+involve the systematic consideration of 1) a relevant set of QIs and 2) context
+for the associated interpretation processes. To this end, we are 1) outlining a
+conceptual model of AIPs and the QIs associated with the meaning these produce
+and a conceptual structure of relevant context, supporting understanding of its
+role. Finally, we use our framework for 2) discussing examples of
+context-sensitivity-related challenges for addressing QIs in data collection
+setups. We believe our efforts can stimulate a structured discussion of both
+the role of aspects of indeterminacy and context in research on AAP, informing
+the development of better practices for data collection and analysis.
 
-摘要：放射治療是一種重要的癌症治療方法，需要精確的規劃來平衡腫瘤根除和健康組織的保留。傳統的治療規劃（TP）是反覆的、耗時的，並且依賴於人為專業知識，這可能會引入變異性和低效率。我們提出了一個新穎的框架，使用少次強化學習 (RL) 方法將大型多模態基礎模型 (MLM) 轉換為 TP 的動作模型。我們的模型利用了 MLM 對物理、輻射和解剖學的廣泛預先存在的知識，並通過少次學習過程對其進行增強。這允許模型使用蒙特卡羅模擬器反覆改進治療計劃。我們的結果表明，這種方法在質量和效率方面都優於基於傳統 RL 的方法，在對前列腺癌數據進行模擬時，獲得了更高的獎勵分數和更優化的劑量分佈。這個概念驗證表明了一個有希望的方向，即將先進的人工智慧模型整合到臨床工作流程中，從而有可能提高放射治療計劃的速度、質量和標準化。
+摘要：自動影響預測 (AAP) 使用輸入資料的運算分析，例如文字、語音、影像和生理訊號，來預測各種情感現象（例如情緒或心情）。這些模型通常使用監督式機器學習演算法建構，而這些演算法高度依賴標籤訓練資料集。在此立場文件中，我們主張所有 AAP 訓練資料都是從人類的情感詮釋過程中衍生而來的，進而形成一種情感意義。對人類情感的研究指出，這種複雜性是此種意義的基本要素：它可能具備我們在此廣泛稱之為不確定性品質 (QI)，包括主觀性（意義取決於詮釋者）、不確定性（對於意義正確性的信心不足）、歧義性（意義包含相互排斥的概念）和模糊性（意義位於嵌套層級的不同層級）。未能適當地考量 QI 會導致無法進行有意義且可靠預測的結果。基於此前提，我們主張，在 AAP 中適當地處理不確定性的關鍵步驟，是針對建模語料庫制定資料收集實務，其中涉及系統性地考量 1) 一組相關的 QI，以及 2) 相關詮釋過程的脈絡。為此，我們 1) 概述了 AIP 的概念模型，以及與這些 AIP 所產生的意義相關的 QI，以及相關脈絡的概念結構，支持對其角色的理解。最後，我們使用我們的架構 2) 討論了在資料收集設定中處理 QI 時，與脈絡敏感性相關的挑戰範例。我們相信我們的努力可以激勵對不確定性和脈絡面向在 AAP 研究中扮演的角色進行結構化的討論，為資料收集和分析的最佳實務發展提供資訊。
 
-##### **Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**
-2502.04399v1 by Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
+##### **SparQLe: Speech Queries to Text Translation Through LLMs**
+2502.09284v1 by Amirbek Djanibekov, Hanan Aldarmaki
 
-Advances in artificial intelligence (AI) including foundation models (FMs),
-are increasingly transforming human society, with smart city driving the
-evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as
-a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities.
-In particular, ride-hailing vehicles can effectively facilitate flexible data
-collection and contribute towards urban intelligence, despite resource
-limitations. Therefore, this work explores a promising scenario, where
-edge-assisted vehicles perform joint tasks of order serving and the emerging
-foundation model fine-tuning using various urban data. However, integrating the
-VCS AI task with the conventional order serving task is challenging, due to
-their inconsistent spatio-temporal characteristics: (i) The distributions of
-ride orders and data point-of-interests (PoIs) may not coincide in geography,
-both following a priori unknown patterns; (ii) they have distinct forms of
-temporal effects, i.e., prolonged waiting makes orders become instantly invalid
-while data with increased staleness gradually reduces its utility for model
-fine-tuning.To overcome these obstacles, we propose an online framework based
-on multi-agent reinforcement learning (MARL) with careful augmentation. A new
-quality-of-service (QoS) metric is designed to characterize and balance the
-utility of the two joint tasks, under the effects of varying data volumes and
-staleness. We also integrate graph neural networks (GNNs) with MARL to enhance
-state representations, capturing graph-structured, time-varying dependencies
-among vehicles and across locations. Extensive experiments on our testbed
-simulator, utilizing various real-world foundation model fine-tuning tasks and
-the New York City Taxi ride order dataset, demonstrate the advantage of our
-proposed method.
+With the growing influence of Large Language Models (LLMs), there is
+increasing interest in integrating speech representations with them to enable
+more seamless multi-modal processing and speech understanding. This study
+introduces a novel approach that leverages self-supervised speech
+representations in combination with instruction-tuned LLMs for speech-to-text
+translation. The proposed approach leverages a modality adapter to align
+extracted speech features with instruction-tuned LLMs using English-language
+data. Our experiments demonstrate that this method effectively preserves the
+semantic content of the input speech and serves as an effective bridge between
+self-supervised speech models and instruction-tuned LLMs, offering a promising
+solution for various speech understanding applications.
 
-摘要：人工智能（AI）的進展，包括基礎模型（FM），正日益轉變人類社會，智慧城市推動著城市生活的演進。同時，車輛群感測（VCS）已成為關鍵推動因素，利用車輛的機動性和配備感測器的能力。特別是，儘管有資源限制，叫車服務車輛能有效促進靈活的資料收集，並有助於城市智慧。因此，這項工作探索了一個有前途的場景，其中邊緣輔助車輛執行訂單服務和新興基礎模型微調的聯合任務，使用各種城市資料。然而，由於 VCS AI 任務與傳統訂單服務任務的不一致時空特徵，整合它們具有挑戰性：(i) 叫車訂單和資料感興趣點 (PoI) 的分佈在地域上可能不重合，兩者都遵循先驗未知的模式；(ii) 它們具有不同的時間效應形式，即長時間等待會使訂單立即失效，而過時的資料會逐漸降低其對模型微調的效用。為了解決這些障礙，我們提出了一個基於多智能體強化學習 (MARL) 的線上架構，並進行了仔細的擴充。設計了一個新的服務品質 (QoS) 指標，用於表徵和平衡這兩個聯合任務的效用，在不同資料量和過時性的影響下。我們還將圖神經網路（GNN）與 MARL 整合，以增強狀態表示，捕捉車輛之間和不同地點之間的圖結構、時變依賴性。在我們的測試平台模擬器上進行的廣泛實驗，利用各種真實世界的基礎模型微調任務和紐約市計程車叫車訂單資料集，證明了我們提出的方法的優點。
+摘要：隨著大型語言模型（LLM）影響力逐漸擴大，將語音表徵與其整合，以實現更順暢的多模態處理和語音理解，已引起越來越多的興趣。本研究提出了一種新穎的方法，該方法利用自監督語音表徵，結合指令調整的 LLM，進行語音轉文字翻譯。所提出的方法利用模態適配器，使用英語語言資料，將提取的語音特徵與指令調整的 LLM 對齊。我們的實驗證明，此方法有效地保留了輸入語音的語義內容，並作為自監督語音模型和指令調整的 LLM 之間的有效橋樑，為各種語音理解應用程式提供了一個有前景的解決方案。
 
-##### **Multimodal Medical Code Tokenizer**
-2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
+##### **LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**
+2502.09271v1 by Wenlun Zhang, Enyan Dai, Kentaro Yoshioka
 
-Foundation models trained on patient electronic health records (EHRs) require
-tokenizing medical data into sequences of discrete vocabulary items. Existing
-tokenizers treat medical codes from EHRs as isolated textual tokens. However,
-each medical code is defined by its textual description, its position in
-ontological hierarchies, and its relationships to other codes, such as disease
-co-occurrences and drug-treatment associations. Medical vocabularies contain
-more than 600,000 codes with critical information for clinical reasoning. We
-introduce MedTok, a multimodal medical code tokenizer that uses the text
-descriptions and relational context of codes. MedTok processes text using a
-language model encoder and encodes the relational structure with a graph
-encoder. It then quantizes both modalities into a unified token space,
-preserving modality-specific and cross-modality information. We integrate
-MedTok into five EHR models and evaluate it on operational and clinical tasks
-across in-patient and out-patient datasets, including outcome prediction,
-diagnosis classification, drug recommendation, and risk stratification.
-Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
-models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
-the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
-using MedTok tokenizer with medical QA systems. Our results demonstrate the
-potential of MedTok as a unified tokenizer for medical codes, improving
-tokenization for medical foundation models.
+Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in
+modeling data with graph structures, yet recent research reveals their
+susceptibility to adversarial attacks. Traditional attack methodologies, which
+rely on manipulating the original graph or adding links to artificially created
+nodes, often prove impractical in real-world settings. This paper introduces a
+novel adversarial scenario involving the injection of an isolated subgraph to
+deceive both the link recommender and the node classifier within a GNN system.
+Specifically, the link recommender is mislead to propose links between targeted
+victim nodes and the subgraph, encouraging users to unintentionally establish
+connections and that would degrade the node classification accuracy, thereby
+facilitating a successful attack. To address this, we present the LiSA
+framework, which employs a dual surrogate model and bi-level optimization to
+simultaneously meet two adversarial objectives. Extensive experiments on
+real-world datasets demonstrate the effectiveness of our method.
 
-摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
+摘要：圖形神經網路 (GNN) 已展現出在對具有圖形結構的資料進行建模方面的卓越能力，但最近的研究揭露了它們容易受到對抗性攻擊的影響。傳統的攻擊方法依賴於操縱原始圖形或將連結新增至人工建立的節點，在真實世界設定中通常被證明不切實際。本文介紹了一種新穎的對抗性場景，涉及注入一個孤立的子圖形，以欺騙 GNN 系統中的連結推薦器和節點分類器。具體來說，連結推薦器被誤導為在目標受害節點和子圖形之間提出連結，鼓勵使用者無意間建立連結，這將降低節點分類準確度，從而促成攻擊成功。為了解決這個問題，我們提出了 LiSA 框架，它採用雙重代理模型和雙層最佳化，以同時滿足兩個對抗性目標。對真實世界資料集進行的廣泛實驗證明了我們方法的有效性。
 
-##### **A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**
-2502.03772v1 by Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Lida Chen, Wei Wang, Qinghua Huang
+##### **AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**
+2502.09254v1 by Hezhe Qiao, Chaoxi Niu, Ling Chen, Guansong Pang
 
-Hepatocellular carcinoma (HCC) ranks as the third leading cause of
-cancer-related mortality worldwide, with early detection being crucial for
-improving patient survival rates. However, early screening for HCC using
-ultrasound suffers from insufficient sensitivity and is highly dependent on the
-expertise of radiologists for interpretation. Leveraging the latest
-advancements in artificial intelligence (AI) in medical imaging, this study
-proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model
-that combines the strengths of Convolutional Neural Networks (CNNs) and Vision
-Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound
-screening. The HSQformer leverages sparse latent space representations to
-capture hierarchical details at various granularities without the need for
-complex adjustments, and adopts a modular, plug-and-play design philosophy,
-ensuring the model's versatility and ease of use. The HSQformer's performance
-was rigorously tested across three distinct clinical scenarios: single-center,
-multi-center, and high-risk patient testing. In each of these settings, it
-consistently outperformed existing state-of-the-art models, such as ConvNext
-and SwinTransformer. Notably, the HSQformer even matched the diagnostic
-capabilities of senior radiologists and comprehensively surpassed those of
-junior radiologists. The experimental results from this study strongly
-demonstrate the effectiveness and clinical potential of AI-assisted tools in
-HCC screening. The full code is available at
-https://github.com/Asunatan/HSQformer.
+Graph anomaly detection (GAD) aims to identify abnormal nodes that differ
+from the majority of the nodes in a graph, which has been attracting
+significant attention in recent years. Existing generalist graph models have
+achieved remarkable success in different graph tasks but struggle to generalize
+to the GAD task. This limitation arises from their difficulty in learning
+generalized knowledge for capturing the inherently infrequent, irregular and
+heterogeneous abnormality patterns in graphs from different domains. To address
+this challenge, we propose AnomalyGFM, a GAD-oriented graph foundation model
+that supports zero-shot inference and few-shot prompt tuning for GAD in diverse
+graph datasets. One key insight is that graph-agnostic representations for
+normal and abnormal classes are required to support effective zero/few-shot GAD
+across different graphs. Motivated by this, AnomalyGFM is pre-trained to align
+data-independent, learnable normal and abnormal class prototypes with node
+representation residuals (i.e., representation deviation of a node from its
+neighbors). The residual features essentially project the node information into
+a unified feature space where we can effectively measure the abnormality of
+nodes from different graphs in a consistent way. This provides a driving force
+for the learning of graph-agnostic, discriminative prototypes for the normal
+and abnormal classes, which can be used to enable zero-shot GAD on new graphs,
+including very large-scale graphs. If there are few-shot labeled normal nodes
+available in the new graphs, AnomalyGFM can further support prompt tuning to
+leverage these nodes for better adaptation. Comprehensive experiments on 11
+widely-used GAD datasets with real anomalies, demonstrate that AnomalyGFM
+significantly outperforms state-of-the-art competing methods under both zero-
+and few-shot GAD settings.
 
-摘要：肝細胞癌（HCC）是全球第三大癌症相關死亡原因，早期檢測對於提高患者存活率至關重要。然而，使用超音波進行 HCC 早期篩檢的靈敏度不足，且高度依賴放射科醫師的專業知識進行判讀。本研究利用醫學影像中人工智慧（AI）的最新進展，提出了一種創新的分層稀疏查詢Transformer（HSQformer）模型，結合了卷積神經網路（CNN）和視覺Transformer（ViT）的優點，以提高超音波篩檢中 HCC 診斷的準確性。HSQformer 利用稀疏潛在空間表示，在不需要複雜調整的情況下擷取各種粒度層級的細節，並採用模組化、即插即用的設計理念，確保模型的多功能性和易用性。HSQformer 的效能經過三個不同的臨床場景的嚴格測試：單中心、多中心和高風險患者測試。在這些設定中，它始終優於現有的最先進模型，例如 ConvNext 和 SwinTransformer。值得注意的是，HSQformer 甚至匹配了資深放射科醫師的診斷能力，並全面超越了初級放射科醫師的診斷能力。本研究的實驗結果有力地證明了 AI 輔助工具在 HCC 篩檢中的有效性和臨床潛力。完整程式碼可在 https://github.com/Asunatan/HSQformer 取得。
+摘要：圖形異常偵測 (GAD) 的目標是找出與圖形中大多數節點不同的異常節點，這在近年來引起了廣泛的關注。現有的通才圖形模型在不同的圖形任務中都取得了顯著的成功，但卻難以推廣到 GAD 任務。這種限制來自於它們難以學習廣泛的知識，用於擷取來自不同領域圖形中固有的罕見、不規則和異質異常模式。為了應對這個挑戰，我們提出了 AnomalyGFM，一個面向 GAD 的圖形基礎模型，它支援零次學習推論和少次提示調整，用於在不同的圖形資料集中進行 GAD。一個關鍵見解是，需要圖形不可知的正常和異常類別表示，以支援跨不同圖形的有效零次/少次 GAD。受此啟發，AnomalyGFM 被預先訓練以將與資料無關的可學習正常和異常類別原型與節點表示殘差（即節點與其鄰居的表示偏差）對齊。殘差特徵基本上將節點資訊投射到一個統一的特徵空間中，在這個空間中，我們可以有效地測量來自不同圖形的節點異常，並且方式一致。這為學習正常和異常類別的圖形不可知、有區別的原型提供了驅動力，這些原型可用於對新的圖形（包括非常大規模的圖形）啟用零次 GAD。如果新的圖形中有少量的標籤正常節點，AnomalyGFM 可以進一步支援提示調整，以利用這些節點進行更好的適應。在 11 個廣泛使用的具有真實異常值的 GAD 資料集上的綜合實驗表明，在零次和少次 GAD 設定下，AnomalyGFM 明顯優於最先進的競爭方法。
 
-##### **Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**
-2502.04386v1 by Guangyao Zheng, Michael A. Jacobs, Vladimir Braverman, Vishwa S. Parekh
+##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
+2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
 
-Self-supervised learning has revolutionized medical imaging by enabling
-efficient and generalizable feature extraction from large-scale unlabeled
-datasets. Recently, self-supervised foundation models have been extended to
-three-dimensional (3D) computed tomography (CT) data, generating compact,
-information-rich embeddings with 1408 features that achieve state-of-the-art
-performance on downstream tasks such as intracranial hemorrhage detection and
-lung cancer risk forecasting. However, these embeddings have been shown to
-encode demographic information, such as age, sex, and race, which poses a
-significant risk to the fairness of clinical applications.
-  In this work, we propose a Variation Autoencoder (VAE) based adversarial
-debiasing framework to transform these embeddings into a new latent space where
-demographic information is no longer encoded, while maintaining the performance
-of critical downstream tasks. We validated our approach on the NLST lung cancer
-screening dataset, demonstrating that the debiased embeddings effectively
-eliminate multiple encoded demographic information and improve fairness without
-compromising predictive accuracy for lung cancer risk at 1-year and 2-year
-intervals. Additionally, our approach ensures the embeddings are robust against
-adversarial bias attacks. These results highlight the potential of adversarial
-debiasing techniques to ensure fairness and equity in clinical applications of
-self-supervised 3D CT embeddings, paving the way for their broader adoption in
-unbiased medical decision-making.
+Joint entity-relation extraction is a critical task in transforming
+unstructured or semi-structured text into triplets, facilitating the
+construction of large-scale knowledge graphs, and supporting various downstream
+applications. Despite its importance, research on Chinese text, particularly
+with complex semantics in specialized domains like medicine, remains limited.
+To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
+dataset designed to capture the intricacies of medical text. Leveraging the
+strengths of attention mechanisms in capturing long-range dependencies, we
+propose the SEA module, which enhances the extraction of complex contextual
+semantic information, thereby improving entity recognition and relation
+extraction. Additionally, to address the inefficiencies of existing methods in
+facilitating information exchange between entity recognition and relation
+extraction, we present an interactive fusion representation module. This module
+employs Cross Attention for bidirectional information exchange between the
+tasks and further refines feature extraction through BiLSTM. Experimental
+results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
+our model exhibits strong generalization capabilities. On the CH-DDI dataset,
+our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
+relation extraction. On the CoNLL04 dataset, it attains an entity recognition
+precision of 89.54% and a relation extraction accuracy of 71.64%.
 
-摘要：自我監督學習透過從大規模未標記資料集中提取有效且可概化的特徵，進而革新了醫學影像。最近，自我監督基礎模型已擴展到三維 (3D) 電腦斷層掃描 (CT) 資料，產生緊湊、資訊豐富的嵌入，包含 1408 個特徵，在顱內出血偵測和肺癌風險預測等下游任務中達到最先進的效能。然而，這些嵌入已被證明會編碼人口統計資訊，例如年齡、性別和種族，這對臨床應用的公平性構成重大風險。
-在這項工作中，我們提出一個基於變異自編碼器 (VAE) 的對抗性去偏框架，將這些嵌入轉換到一個新的潛在空間，其中不再編碼人口統計資訊，同時維持關鍵下游任務的效能。我們在 NLST 肺癌篩檢資料集上驗證了我們的做法，證明去偏嵌入有效消除了多重編碼的人口統計資訊，並在不損害 1 年和 2 年間隔的肺癌風險預測準確性的情況下提高了公平性。此外，我們的做法確保了嵌入對抗性偏誤攻擊具有魯棒性。這些結果突顯了對抗性去偏技術的潛力，可確保自我監督 3D CT 嵌入在臨床應用中的公平性和公正性，為其在無偏見醫療決策中的廣泛採用鋪路。
+摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
 
-##### **Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**
-2502.03591v1 by Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
+##### **You Do Not Fully Utilize Transformer's Representation Capacity**
+2502.09245v1 by Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
+
+In contrast to RNNs, which compress previous tokens into a single hidden
+state, Transformers can attend to all previous tokens directly. However,
+standard Transformers only use representations from the immediately preceding
+layer. In this paper, we show that this design choice causes representation
+collapse and leads to suboptimal performance. To address this issue, we
+introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
+preserves the model's overall memory footprint while expanding its
+representational capacity by allowing access to hidden states from earlier
+layers. Through extensive experiments across various architectures and
+different lookup mechanisms, we demonstrate consistent performance improvements
+on a wide range of tasks. Moreover, our analysis of the learned representation
+dynamics and our exploration of depthwise circuits reveal how LIMe integrates
+information across layers, pointing to promising directions for future
+research.
 
-In this work, we present a novel approach to multi-label chest X-ray (CXR)
-image classification that enhances clinical interpretability while maintaining
-a streamlined, single-model, single-run training pipeline. Leveraging the
-CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical
-label groupings to capture clinically meaningful relationships between
-diagnoses. To achieve this, we designed a custom hierarchical binary
-cross-entropy (HBCE) loss function that enforces label dependencies using
-either fixed or data-driven penalty types. Our model achieved a mean area under
-the receiver operating characteristic curve (AUROC) of 0.903 on the test set.
-Additionally, we provide visual explanations and uncertainty estimations to
-further enhance model interpretability. All code, model configurations, and
-experiment details are made available.
+摘要：與將先前符號壓縮成單一隱藏狀態的遞迴神經網路不同，Transformer 可以直接關注所有先前的符號。然而，標準 Transformer 僅使用緊鄰前一層的表示。在本文中，我們說明此設計選擇會導致表示崩潰，並導致次優效能。為了解決此問題，我們引入了「層整合式記憶體」(LIMe)，這是一種簡單但強大的方法，可在擴充表示能力的同時，保留模型的整體記憶體使用量，方法是允許存取來自較早層的隱藏狀態。透過各種架構和不同查詢機制的廣泛實驗，我們展示了在各種任務上的一致效能提升。此外，我們對已學習表示動態的分析和對深度電路的探討，揭示了 LIMe 如何整合跨層資訊，並指出未來研究有望發展的方向。
 
-摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
+##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
+2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
 
-##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
-2502.03568v1 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
+Generative artificial intelligence (AI) models, such as diffusion models and
+OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
+and automating clinical workflows. The field has advanced rapidly, evolving
+from text-only large language models for tasks such as clinical documentation
+and decision support to multimodal AI systems capable of integrating diverse
+data modalities, including imaging, text, and structured data, within a single
+model. The diverse landscape of these technologies, along with rising interest,
+highlights the need for a comprehensive review of their applications and
+potential. This scoping review explores the evolution of multimodal AI,
+highlighting its methods, applications, datasets, and evaluation in clinical
+settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
+IEEE Xplore, and Web of Science, prioritizing recent studies published up to
+the end of 2024. After rigorous screening, 144 papers were included, revealing
+key trends and challenges in this dynamic field. Our findings underscore a
+shift from unimodal to multimodal approaches, driving innovations in diagnostic
+support, medical report generation, drug discovery, and conversational AI.
+However, critical challenges remain, including the integration of heterogeneous
+data types, improving model interpretability, addressing ethical concerns, and
+validating AI systems in real-world clinical settings. This review summarizes
+the current state of the art, identifies critical gaps, and provides insights
+to guide the development of scalable, trustworthy, and clinically impactful
+multimodal AI solutions in healthcare.
 
-Many reasoning, planning, and problem-solving tasks share an intrinsic
-algorithmic nature: correctly simulating each step is a sufficient condition to
-solve them correctly. We collect pairs of naturalistic and synthetic reasoning
-tasks to assess the capabilities of Large Language Models (LLM). While
-naturalistic tasks often require careful human handcrafting, we show that
-synthetic data is, in many cases, a good proxy that is much easier to collect
-at scale. We leverage common constructs in programming as the counterpart of
-the building blocks of naturalistic reasoning tasks, such as straight-line
-programs, code that contains critical paths, and approximate and redundant
-instructions. We further assess the capabilities of LLMs on sorting problems
-and repeated operations via sorting algorithms and nested loops. Our synthetic
-datasets further reveal that while the most powerful LLMs exhibit relatively
-strong execution capabilities, the process is fragile: it is negatively
-affected by memorisation and seems to rely heavily on pattern recognition. Our
-contribution builds upon synthetically testing the reasoning capabilities of
-LLMs as a scalable complement to handcrafted human-annotated problems.
+摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
 
-摘要：許多推理、規劃和問題解決任務共享一個內在的演算法性質：正確模擬每一步就足以正確解決它們。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的功能。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成資料是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見建構，作為自然主義推理任務構建區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複運算上的功能，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎嚴重依賴模式辨識。我們的貢獻建立在以合成方式測試 LLM 的推理能力之上，作為手工編寫人類標註問題的可擴充補充。
+##### **Reliable Conversational Agents under ASP Control that Understand Natural Language**
+2502.09237v1 by Yankai Zeng
 
-##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
-2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
+Efforts have been made to make machines converse like humans in the past few
+decades. The recent techniques of Large Language Models (LLMs) make it possible
+to have human-like conversations with machines, but LLM's flaws of lacking
+understanding and reliability are well documented. We believe that the best way
+to eliminate this problem is to use LLMs only as parsers to translate text to
+knowledge and vice versa and carry out the conversation by reasoning over this
+knowledge using the answer set programming. I have been developing a framework
+based on LLMs and ASP to realize reliable chatbots that "understand" human
+conversation. This framework has been used to develop task-specific chatbots as
+well as socialbots. My future research is focused on making these chatbots
+scalable and trainable.
 
-Large Language Models (LLMs) have attained human-level accuracy on medical
-question-answer (QA) benchmarks. However, their limitations in navigating
-open-ended clinical scenarios have recently been shown, raising concerns about
-the robustness and generalizability of LLM reasoning across diverse, real-world
-medical tasks. To probe potential LLM failure modes in clinical
-problem-solving, we present the medical abstraction and reasoning corpus
-(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
-exploit the Einstellung effect -- the fixation of thought arising from prior
-experience, targeting LLM inductive biases toward inflexible pattern matching
-from their training data rather than engaging in flexible reasoning. We find
-that LLMs, including current state-of-the-art o1 and Gemini models, perform
-poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
-medical reasoning and a propensity to hallucinate. In addition, uncertainty
-estimation analyses indicate that LLMs exhibit overconfidence in their answers,
-despite their limited accuracy. The failure modes revealed by M-ARC in LLM
-medical reasoning underscore the need to exercise caution when deploying these
-models in clinical settings.
+摘要：在過去的幾十年裡，人們一直努力讓機器像人類一樣對話。大型語言模型 (LLM) 的最新技術讓與機器進行類人對話成為可能，但 LLM 缺乏理解力和可靠性的缺陷已被充分記錄。我們相信消除這個問題的最佳方法是僅將 LLM 作為解析器，將文字轉換為知識，反之亦然，並使用答案集程式設計對此知識進行推理來進行對話。我一直在開發一個基於 LLM 和 ASP 的框架，以實現「理解」人類對話的可靠聊天機器人。這個框架已被用於開發特定任務的聊天機器人以及社交機器人。我未來的研究重點在於讓這些聊天機器人具有可擴充性和可訓練性。
 
-摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
+##### **Commonsense Reasoning-Aided Autonomous Vehicle Systems**
+2502.09233v1 by Keegan Kimbrell
 
-##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
-2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
+Autonomous Vehicle (AV) systems have been developed with a strong reliance on
+machine learning techniques. While machine learning approaches, such as deep
+learning, are extremely effective at tasks that involve observation and
+classification, they struggle when it comes to performing higher level
+reasoning about situations on the road. This research involves incorporating
+commonsense reasoning models that use image data to improve AV systems. This
+will allow AV systems to perform more accurate reasoning while also making them
+more adjustable, explainable, and ethical. This paper will discuss the findings
+so far and motivate its direction going forward.
 
-Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
-Systems (HITS) is a hot research trend focusing on enhancing HITS management,
-particularly in emergencies where ambulance vehicles must arrive at the crash
-scene on time and track their real-time location is crucial to the medical
-authorities. Despite the claim of real-time representation, a temporal
-misalignment persists between the physical and virtual domains, leading to
-discrepancies in the ambulance's location representation. This study proposes
-integrating AI predictive models, specifically Support Vector Regression (SVR)
-and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
-framework to anticipate the medical vehicle's next location in the virtual
-world. These models align virtual representations with their physical
-counterparts, i.e., metaphorically offsetting the synchronization delay between
-the two worlds. Trained meticulously on a historical geospatial dataset, SVR
-and DNN exhibit exceptional prediction accuracy in MATLAB and Python
-environments. Through various testing scenarios, we visually demonstrate the
-efficacy of our methodology, showcasing SVR and DNN's key role in significantly
-reducing the witnessed gap within the HITS's DT. This transformative approach
-enhances real-time synchronization in emergency HITS by approximately 88% to
-93%.
+摘要：自動駕駛車輛 (AV) 系統的開發高度依賴機器學習技術。儘管機器學習方法（例如深度學習）在涉及觀察和分類的任務中非常有效，但它們在對路況進行更高層級推理時會遇到困難。本研究涉及整合使用影像資料的常識推理模型，以改善 AV 系統。這將使 AV 系統能夠執行更準確的推理，同時也讓它們更具可調整性、可解釋性和道德性。本文將探討迄今為止的發現，並說明其未來的發展方向。
 
-摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
+##### **Logical foundations of Smart Contracts**
+2502.09232v1 by Kalonji Kalala
 
-##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
-2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
+Nowadays, sophisticated domains are emerging which require appropriate
+formalisms to be specified accurately in order to reason about them. One such
+domain is constituted of smart contracts that have emerged in cyber physical
+systems as a way of enforcing formal agreements between components of these
+systems. Smart contracts self-execute to run and share business processes
+through blockchain, in decentralized systems, with many different participants.
+Legal contracts are in many cases complex documents, with a number of
+exceptions, and many subcontracts. The implementation of smart contracts based
+on legal contracts is a long and laborious task, that needs to include all
+actions, procedures, and the effects of actions related to the execution of the
+contract. An ongoing open problem in this area is to formally account for smart
+contracts using a uniform and somewhat universal formalism. This thesis
+proposes logical foundations to smart contracts using the Situation Calculus, a
+logic for reasoning about actions. Situation Calculus is one of the prominent
+logic-based artificial intelligence approaches that provides enough logical
+mechanism to specify and implement dynamic and complex systems such as
+contracts. Situation Calculus is suitable to show how worlds dynamically
+change. Smart contracts are going to be implement with Golog (written en
+Prolog), a Situation Calculus-based programming language for modeling complex
+and dynamic behaviors.
 
-The widespread use of chest X-rays (CXRs), coupled with a shortage of
-radiologists, has driven growing interest in automated CXR analysis and
-AI-assisted reporting. While existing vision-language models (VLMs) show
-promise in specific tasks such as report generation or abnormality detection,
-they often lack support for interactive diagnostic capabilities. In this work
-we present RadVLM, a compact, multitask conversational foundation model
-designed for CXR interpretation. To this end, we curate a large-scale
-instruction dataset comprising over 1 million image-instruction pairs
-containing both single-turn tasks -- such as report generation, abnormality
-classification, and visual grounding -- and multi-turn, multi-task
-conversational interactions. After fine-tuning RadVLM on this instruction
-dataset, we evaluate it across different tasks along with re-implemented
-baseline VLMs. Our results show that RadVLM achieves state-of-the-art
-performance in conversational capabilities and visual grounding while remaining
-competitive in other radiology tasks. Ablation studies further highlight the
-benefit of joint training across multiple tasks, particularly for scenarios
-with limited annotated data. Together, these findings highlight the potential
-of RadVLM as a clinically relevant AI assistant, providing structured CXR
-interpretation and conversational capabilities to support more effective and
-accessible diagnostic workflows.
+摘要：如今，正在出现需要适当形式化来准确指定以对其进行推理的复杂领域。此类领域之一由在网络物理系统中出现的智能合约构成，作为强制执行这些系统组件之间正式协议的一种方式。智能合约自执行以在去中心化系统中通过区块链运行和共享业务流程，并有许多不同的参与者。法律合约在许多情况下是复杂的文档，有许多例外和许多分包合同。基于法律合约实施智能合约是一项漫长而艰巨的任务，需要包括所有操作、程序以及与执行合约相关的操作效果。该领域的持续开放问题是使用统一且某种程度上通用的形式化来正式说明智能合约。本论文提出了使用情景演算（一种用于推理操作的逻辑）为智能合约提供逻辑基础。情景演算是基于逻辑的人工智能方法之一，提供了足够的逻辑机制来指定和实现动态且复杂的系统，例如合约。情景演算适用于展示世界如何动态变化。智能合约将使用 Golog（以 Prolog 编写的）实现，这是一种基于情景演算的编程语言，用于建模复杂且动态的行为。
 
-摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
+##### **Relating Answer Set Programming and Many-sorted Logics for Formal Verification**
+2502.09230v1 by Zachary Hansen
 
-##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
-2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
+Answer Set Programming (ASP) is an important logic programming paradigm
+within the field of Knowledge Representation and Reasoning. As a concise,
+human-readable, declarative language, ASP is an excellent tool for developing
+trustworthy (especially, artificially intelligent) software systems. However,
+formally verifying ASP programs offers some unique challenges, such as
+  1. a lack of modularity (the meanings of rules are difficult to define in
+isolation from the enclosing program),
+  2. the ground-and-solve semantics (the meanings of rules are dependent on the
+input data with which the program is grounded), and
+  3. limitations of existing tools.
+  My research agenda has been focused on addressing these three issues with the
+intention of making ASP verification an accessible, routine task that is
+regularly performed alongside program development. In this vein, I have
+investigated alternative semantics for ASP based on translations into the logic
+of here-and-there and many-sorted first-order logic. These semantics promote a
+modular understanding of logic programs, bypass grounding, and enable us to use
+automated theorem provers to automatically verify properties of programs.
 
-While increasing patients' access to medical documents improves medical care,
-this benefit is limited by varying health literacy levels and complex medical
-terminology. Large language models (LLMs) offer solutions by simplifying
-medical information. However, evaluating LLMs for safe and patient-friendly
-text generation is difficult due to the lack of standardized evaluation
-resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
-created from MIMIC-IV discharge summaries through an automated pipeline
-combining LLM-based question-answer generation with manual quality checks. We
-use this dataset to evaluate various LLMs on patient-oriented
-question-answering. Our findings reveal that general-purpose LLMs frequently
-surpass biomedical-adapted models, while automated metrics correlate with human
-judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
-development of LLMs to enhance patient understanding and ultimately improve
-care outcomes.
+摘要：<paragraph>答案集程式設計 (ASP) 是知識表徵與推理領域中一個重要的邏輯程式設計範式。ASP 作為一種簡潔、人類可讀、宣告式的語言，是開發值得信賴的 (特別是人工智慧) 軟體系統的絕佳工具。然而，正式驗證 ASP 程式提供了一些獨特的挑戰，例如
+  1. 缺乏模組化 (規則的含義難以與封閉程式隔離定義)，
+  2. 基礎與求解語意 (規則的含義取決於程式基礎的輸入資料)，以及
+  3. 現有工具的限制。
+  我的研究議程一直專注於解決這三個問題，目的是讓 ASP 驗證成為一個可存取的、例行任務，並在程式開發過程中定期執行。在這個脈絡下，我研究了基於翻譯成此處和彼處邏輯以及多種排序一階邏輯的 ASP 替代語意。這些語意促進了邏輯程式的模組化理解，繞過基礎，並使我們能夠使用自動化定理證明器自動驗證程式的屬性。</paragraph>
 
-摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
-但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
+##### **Computational methods for Dynamic Answer Set Programming**
+2502.09228v1 by Susana Hahn
 
-##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
-2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
+In our daily lives and industrial settings, we often encounter dynamic
+problems that require reasoning over time and metric constraints. These include
+tasks such as scheduling, routing, and production sequencing. Dynamic logics
+have traditionally addressed these needs but often lack the flexibility and
+integration required for comprehensive problem modeling. This research aims to
+extend Answer Set Programming (ASP), a powerful declarative problem-solving
+approach, to handle dynamic domains effectively. By integrating concepts from
+dynamic, temporal, and metric logics into ASP, we seek to develop robust
+systems capable of modeling complex dynamic problems and performing efficient
+reasoning tasks, thereby enhancing ASPs applicability in industrial contexts.
 
-Purpose: To develop and evaluate a deep learning-based method that allows to
-perform myocardial infarct segmentation in a fully-automated way.
-  Materials and Methods: For this retrospective study, a cascaded framework of
-two and three-dimensional convolutional neural networks (CNNs), specialized on
-identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
-cardiac magnetic resonance (CMR) images, was trained on an in-house training
-dataset consisting of 144 examinations. On a separate test dataset from the
-same institution, including images from 152 examinations obtained between 2021
-and 2023, a quantitative comparison between artificial intelligence (AI)-based
-segmentations and manual segmentations was performed. Further, qualitative
-assessment of segmentation accuracy was evaluated for both human and
-AI-generated contours by two CMR experts in a blinded experiment.
-  Results: Excellent agreement could be found between manually and
-automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
-evaluation showed that compared to human-based measurements, the experts rated
-the AI-based segmentations to better represent the actual extent of infarction
-significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
-the contrary, for segmentation of microvascular obstruction (MVO), manual
-measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
-  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
-size to be calculated in a very short time and without requiring any
-pre-processing of the input images while matching the segmentation quality of
-trained human observers. In a blinded experiment, experts preferred automated
-infarct segmentations more often than manual segmentations, paving the way for
-a potential clinical application.
+摘要：在我們的日常生活和工業環境中，我們經常會遇到動態問題，需要隨著時間和公制約束進行推理。這些問題包括排程、路由和生產順序等任務。動態邏輯傳統上解決了這些需求，但通常缺乏全面問題建模所需的靈活性與整合性。本研究旨在擴展強大的宣告式問題解決方法「Answer Set Programming (ASP)」，以有效處理動態領域。透過將動態、時態和公制邏輯的概念整合到 ASP 中，我們尋求開發強健的系統，能夠建模複雜的動態問題並執行有效的推理任務，進而增強 ASP 在工業環境中的適用性。
+
+##### **Generating Causally Compliant Counterfactual Explanations using ASP**
+2502.09226v1 by Sopam Dasgupta
 
-摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
-材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
-結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
-結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
+This research is focused on generating achievable counterfactual
+explanations. Given a negative outcome computed by a machine learning model or
+a decision system, the novel CoGS approach generates (i) a counterfactual
+solution that represents a positive outcome and (ii) a path that will take us
+from the negative outcome to the positive one, where each node in the path
+represents a change in an attribute (feature) value. CoGS computes paths that
+respect the causal constraints among features. Thus, the counterfactuals
+computed by CoGS are realistic. CoGS utilizes rule-based machine learning
+algorithms to model causal dependencies between features. The paper discusses
+the current status of the research and the preliminary results obtained.
 
-##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
-2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
+摘要：本研究重點在於產生可實現的反事實解釋。給定由機器學習模型或決策系統計算出的負面結果，創新的 CoGS 方法會產生 (i) 代表正面結果的反事實解，以及 (ii) 一條將我們從負面結果帶到正面結果的途徑，其中途徑中的每個節點代表屬性 (特徵) 值的變化。CoGS 計算出符合特徵之間因果關係的途徑。因此，CoGS 計算出的反事實是切合實際的。CoGS 利用基於規則的機器學習演算法來建模特徵之間的因果關係。本文探討了研究的現況和獲得的初步結果。
 
-Recently computer-aided diagnosis has demonstrated promising performance,
-effectively alleviating the workload of clinicians. However, the inherent
-sample imbalance among different diseases leads algorithms biased to the
-majority categories, leading to poor performance for rare categories. Existing
-works formulated this challenge as a long-tailed problem and attempted to
-tackle it by decoupling the feature representation and classification. Yet, due
-to the imbalanced distribution and limited samples from tail classes, these
-works are prone to biased representation learning and insufficient classifier
-calibration. To tackle these problems, we propose a new Long-tailed Medical
-Diagnosis (LMD) framework for balanced medical image classification on
-long-tailed datasets. In the initial stage, we develop a Relation-aware
-Representation Learning (RRL) scheme to boost the representation ability by
-encouraging the encoder to capture intrinsic semantic features through
-different data augmentations. In the subsequent stage, we propose an Iterative
-Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
-This is achieved by generating a large number of balanced virtual features and
-fine-tuning the encoder using an Expectation-Maximization manner. The proposed
-ICC compensates for minority categories to facilitate unbiased classifier
-optimization while maintaining the diagnostic knowledge in majority classes.
-Comprehensive experiments on three public long-tailed medical datasets
-demonstrate that our LMD framework significantly surpasses state-of-the-art
-approaches. The source code can be accessed at
-https://github.com/peterlipan/LMD.
+##### **Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**
+2502.09224v1 by Đorđe Marković, Marc Denecker
 
-摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
+Subtyping, also known as subtype polymorphism, is a concept extensively
+studied in programming language theory, delineating the substitutability
+relation among datatypes. This property ensures that programs designed for
+supertype objects remain compatible with their subtypes.
+  In this paper, we explore the capability of order-sorted logic for utilizing
+these ideas in the context of Knowledge Representation. We recognize two
+fundamental limitations: First, the inability of this logic to address the
+concept rather than the value of non-logical symbols, and second, the lack of
+language constructs for constraining the type of terms. Consequently, we
+propose guarded order-sorted intensional logic, where guards are language
+constructs for annotating typing information and intensional logic provides
+support for quantification over concepts.
 
-##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
-2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
+摘要：子類型化，也稱為子類型多態性，是一個在程式語言理論中廣泛研究的概念，用於描述資料類型之間的可替換關係。此特性可確保為超類型物件設計的程式與其子類型相容。
+在本文中，我們探討了使用排序邏輯在知識表徵中運用這些想法的能力。我們發現了兩個基本限制：首先，此邏輯無法處理非邏輯符號的概念而非值，其次，缺乏約束項類型的語言結構。因此，我們提出了受保護的排序邏輯，其中保護是註解類型資訊的語言結構，而內涵邏輯則支援對概念量化。
 
-This study investigates continual fine-tuning strategies for deep learning in
-online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
-within a causal setting involving a large user group and multiple sessions per
-participant. We are the first to explore such strategies across a large user
-group, as longitudinal adaptation is typically studied in the single-subject
-setting with a single adaptation strategy, which limits the ability to
-generalize findings. First, we examine the impact of different fine-tuning
-approaches on decoder performance and stability. Building on this, we integrate
-online test-time adaptation (OTTA) to adapt the model during deployment,
-complementing the effects of prior fine-tuning. Our findings demonstrate that
-fine-tuning that successively builds on prior subject-specific information
-improves both performance and stability, while OTTA effectively adapts the
-model to evolving data distributions across consecutive sessions, enabling
-calibration-free operation. These results offer valuable insights and
-recommendations for future research in longitudinal online MI decoding and
-highlight the importance of combining domain adaptation strategies for
-improving BCI performance in real-world applications. Clinical Relevance: Our
-investigation enables more stable and efficient long-term motor imagery
-decoding, which is critical for neurorehabilitation and assistive technologies.
+##### **ASP-driven User-interaction with Clinguin**
+2502.09222v1 by Alexander Beiser, Susana Hahn, Torsten Schaub
 
-摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
+We present clinguin, a system for ASP-driven user interface design. Clinguin
+streamlines the development of user interfaces for ASP developers by letting
+them build interactive prototypes directly in ASP, eliminating the need for
+separate frontend languages. To this end, clinguin uses a few dedicated
+predicates to define user interfaces and the treatment of user-triggered
+events. This simple design greatly facilitates the specification of user
+interactions with an ASP system, in our case clingo.
 
-##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
-2502.03004v1 by Seonok Kim
+摘要：我們提出 clinguin，一個用於 ASP 驅動使用者介面設計的系統。Clinguin 透過讓 ASP 開發人員直接在 ASP 中建立互動式原型，簡化了使用者介面的開發，消除了對個別前端語言的需求。為此，clinguin 使用一些專用的謂詞來定義使用者介面和處理使用者觸發的事件。這個簡單的設計極大地簡化了使用者與 ASP 系統互動的規範，在我們的案例中是 clingo。
 
-Large Language Models (LLMs) have demonstrated impressive capabilities across
-natural language processing tasks. However, their application to specialized
-domains such as medicine and biology requires further optimization to ensure
-factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
-domain-adapted biomedical question-answering model designed to enhance both
-short-form and long-form queries. By integrating fine-tuning and
-retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
-domain-specific knowledge, improving reasoning abilities and factual accuracy.
-To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
-datasets, covering structured multiple-choice assessments and complex clinical
-reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
-datasets, while RAG enhances factual consistency. These results highlight the
-potential of domain-optimized LLMs in advancing biomedical research, medical
-education, and clinical decision support.
+##### **Pearce's Characterisation in an Epistemic Domain**
+2502.09221v1 by Ezgi Iraz Su
 
-摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
+Answer-set programming (ASP) is a successful problem-solving approach in
+logic-based AI. In ASP, problems are represented as declarative logic programs,
+and solutions are identified through their answer sets. Equilibrium logic (EL)
+is a general-purpose nonmonotonic reasoning formalism, based on a monotonic
+logic called here-and-there logic. EL was basically proposed by Pearce as a
+foundational framework of ASP. Epistemic specifications (ES) are extensions of
+ASP-programs with subjective literals. These new modal constructs in the
+ASP-language make it possible to check whether a regular literal of ASP is true
+in every (or some) answer-set of a program. ES-programs are interpreted by
+world-views, which are essentially collections of answer-sets. (Reflexive)
+autoepistemic logic is a nonmonotonic formalism, modeling self-belief
+(knowledge) of ideally rational agents. A relatively new semantics for ES is
+based on a combination of EL and (reflexive) autoepistemic logic. In this
+paper, we first propose an overarching framework in the epistemic ASP domain.
+We then establish a correspondence between existing (reflexive) (auto)epistemic
+equilibrium logics and our easily-adaptable comprehensive framework, building
+on Pearce's characterisation of answer-sets as equilibrium models. We achieve
+this by extending Ferraris' work on answer sets for propositional theories to
+the epistemic case and reveal the relationship between some ES-semantic
+proposals.
 
-##### **Contrastive Token-level Explanations for Graph-based Rumour Detection**
-2502.04366v1 by Daniel Wai Kit Chin, Roy Ka-Wei Lee
+摘要：<paragraph>答案集程式設計（ASP）是基於邏輯的人工智慧中一種成功的問題解決方法。在 ASP 中，問題表示為宣告式邏輯程式，並透過其答案集來找出解答。平衡邏輯（EL）是一種通用的非單調推理形式主義，基於一種稱為此處和彼處邏輯的單調邏輯。EL 基本是由 Pearce 作為 ASP 的基礎架構所提出。知識規範（ES）是 ASP 程式與主觀文字的延伸。ASP 語言中的這些新模態建構使得可以檢查 ASP 的常規文字是否在程式的每個（或某些）答案集中為真。ES 程式由世界觀來詮釋，其本質上是答案集的集合。（反身）自認識邏輯是一種非單調形式主義，用來建模理想理性主體的自信念（知識）。ES 的一種相對新的語意是基於 EL 和（反身）自認識邏輯的組合。在本文中，我們首先提出一個涵蓋知識 ASP 領域的架構。然後，我們建立現有（反身）（自）認識平衡邏輯與我們容易適應的綜合架構之間的對應關係，建立在 Pearce 將答案集描述為平衡模型的特性之上。我們透過將 Ferraris 在命題理論的答案集上的工作延伸到知識案例，並揭示一些 ES 語義提案之間的關係來達成這一點。</paragraph>
 
-The widespread use of social media has accelerated the dissemination of
-information, but it has also facilitated the spread of harmful rumours, which
-can disrupt economies, influence political outcomes, and exacerbate public
-health crises, such as the COVID-19 pandemic. While Graph Neural Network
-(GNN)-based approaches have shown significant promise in automated rumour
-detection, they often lack transparency, making their predictions difficult to
-interpret. Existing graph explainability techniques fall short in addressing
-the unique challenges posed by the dependencies among feature dimensions in
-high-dimensional text embeddings used in GNN-based models. In this paper, we
-introduce Contrastive Token Layerwise Relevance Propagation (CT-LRP), a novel
-framework designed to enhance the explainability of GNN-based rumour detection.
-CT-LRP extends current graph explainability methods by providing token-level
-explanations that offer greater granularity and interpretability. We evaluate
-the effectiveness of CT-LRP across multiple GNN models trained on three
-publicly available rumour detection datasets, demonstrating that it
-consistently produces high-fidelity, meaningful explanations, paving the way
-for more robust and trustworthy rumour detection systems.
+##### **Graphical Conditions for the Existence, Unicity and Number of Regular Models**
+2502.09220v1 by Van-Giang Trinh, Belaid Benhamou, Sylvain Soliman, François Fages
 
-摘要：社群媒體的廣泛使用加速了資訊的傳播，但也促进了有害謠言的散播，這可能會擾亂經濟、影響政治結果，並加劇公共衛生危機，例如 COVID-19 大流行。雖然基於圖神經網路 (GNN) 的方法在自動化謠言偵測方面展現了顯著的前景，但它們通常缺乏透明度，這使得它們的預測難以解釋。現有的圖形可解釋性技術無法解決 GNN 模型中使用的維度嵌入式文本之間的依賴性所帶來的獨特挑戰。在本文中，我們介紹了對比標記分層關聯性傳播 (CT-LRP)，這是一個新穎的框架，旨在增強基於 GNN 的謠言偵測的可解釋性。CT-LRP 透過提供標記級別的解釋來擴充當前的圖形可解釋性方法，這些解釋提供了更細緻的粒度和可解釋性。我們在三個公開的謠言偵測資料集上訓練的幾個 GNN 模型中評估了 CT-LRP 的有效性，證明它始終產生高保真、有意義的解釋，為更強健且值得信賴的謠言偵測系統鋪路。
+The regular models of a normal logic program are a particular type of partial
+(i.e. 3-valued) models which correspond to stable partial models with minimal
+undefinedness. In this paper, we explore graphical conditions on the dependency
+graph of a finite ground normal logic program to analyze the existence, unicity
+and number of regular models for the program. We show three main results: 1) a
+necessary condition for the existence of non-trivial (i.e. non-2-valued)
+regular models, 2) a sufficient condition for the unicity of regular models,
+and 3) two upper bounds for the number of regular models based on positive
+feedback vertex sets. The first two conditions generalize the finite cases of
+the two existing results obtained by You and Yuan (1994) for normal logic
+programs with well-founded stratification. The third result is also new to the
+best of our knowledge. Key to our proofs is a connection that we establish
+between finite ground normal logic programs and Boolean network theory.
 
-##### **AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**
-2502.04365v1 by Jorge García-Torres, Øyvind Meinich-Bache, Siren Rettedal, Kjersti Engan
+摘要：正规模型的常规模型是一种特殊类型的局部模型（即 3 值）模型，它对应于具有最小未定义性的稳定局部模型。在本文中，我们探索了有限接地正规逻辑程序的依赖图上的图形条件，以分析程序的正规模型的存在性、唯一性和数量。我们展示了三个主要结果：1) 非平凡（即非 2 值）正规模型存在的必要条件，2) 正规模型唯一性的充分条件，3) 基于正反馈顶点集的正规模型数目的两个上限。前两个条件概括了 You 和 Yuan (1994) 为具有良好基础分层的正规逻辑程序获得的两个现有结果的有限情况。据我们所知，第三个结果也是新的。我们证明的关键是我们在有限接地正规逻辑程序和布尔网络理论之间建立的联系。
 
-Approximately 10% of newborns need some assistance to start breathing and 5\%
-proper ventilation. It is crucial that interventions are initiated as soon as
-possible after birth. Accurate documentation of Time of Birth (ToB) is thereby
-essential for documenting and improving newborn resuscitation performance.
-However, current clinical practices rely on manual recording of ToB, typically
-with minute precision. In this study, we present an AI-driven, video-based
-system for automated ToB detection using thermal imaging, designed to preserve
-the privacy of healthcare providers and mothers by avoiding the use of
-identifiable visual data. Our approach achieves 91.4% precision and 97.4%
-recall in detecting ToB within thermal video clips during performance
-evaluation. Additionally, our system successfully identifies ToB in 96% of test
-cases with an absolute median deviation of 1 second compared to manual
-annotations. This method offers a reliable solution for improving ToB
-documentation and enhancing newborn resuscitation outcomes.
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-摘要：約 10% 的新生兒需要協助才能開始呼吸，5% 需要適當的通氣。在出生後盡快開始介入至關重要。準確記錄出生時間 (ToB) 對於記錄和改善新生兒復甦表現至關重要。然而，目前的臨床實務依賴於手動記錄 ToB，通常精確到分鐘。在這項研究中，我們提出一個以 AI 為主的、基於影片的系統，用於使用熱影像自動偵測 ToB，旨在透過避免使用可識別的視覺資料來保護醫療保健提供者和母親的隱私。我們的做法在執行評估期間，在熱影像片段中偵測 ToB 時達到了 91.4% 的精確度和 97.4% 的召回率。此外，我們的系統在 96% 的測試案例中成功識別出 ToB，與手動註解相比，絕對中位數偏差為 1 秒。此方法提供了一個可靠的解決方案，用於改善 ToB 記錄和增強新生兒復甦結果。
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-##### **3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**
-2502.02779v1 by Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O'Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-Head computed tomography (CT) imaging is a widely-used imaging modality with
-multitudes of medical indications, particularly in assessing pathology of the
-brain, skull, and cerebrovascular system. It is commonly the first-line imaging
-in neurologic emergencies given its rapidity of image acquisition, safety,
-cost, and ubiquity. Deep learning models may facilitate detection of a wide
-range of diseases. However, the scarcity of high-quality labels and
-annotations, particularly among less common conditions, significantly hinders
-the development of powerful models. To address this challenge, we introduce
-FM-CT: a Foundation Model for Head CT for generalizable disease detection,
-trained using self-supervised learning. Our approach pre-trains a deep learning
-model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans
-without the need for manual annotations, enabling the model to learn robust,
-generalizable features. To investigate the potential of self-supervised
-learning in head CT, we employed both discrimination with self-distillation and
-masked image modeling, and we construct our model in 3D rather than at the
-slice level (2D) to exploit the structure of head CT scans more comprehensively
-and efficiently. The model's downstream classification performance is evaluated
-using internal and three external datasets, encompassing both in-distribution
-(ID) and out-of-distribution (OOD) data. Our results demonstrate that the
-self-supervised foundation model significantly improves performance on
-downstream diagnostic tasks compared to models trained from scratch and
-previous 3D CT foundation models on scarce annotated datasets. This work
-highlights the effectiveness of self-supervised learning in medical imaging and
-sets a new benchmark for head CT image analysis in 3D, enabling broader use of
-artificial intelligence for head CT-based diagnosis.
+##### **Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**
+2502.09216v1 by Galileo Sartor, Adam Wyner, Giuseppe Contissa
+
+In this paper, we present a modular system for representing and reasoning
+with legal aspects of traffic rules for autonomous vehicles. We focus on a
+subset of the United Kingdom's Highway Code (HC) related to junctions. As human
+drivers and automated vehicles (AVs) will interact on the roads, especially in
+urban environments, we claim that an accessible, unitary, high-level
+computational model should exist and be applicable to both users. Autonomous
+vehicles introduce a shift in liability that should not bring disadvantages or
+increased burden on human drivers. We develop a system "in silico" of the
+model. The proposed system is built of three main components: a natural
+language interface, using Logical English, which encodes the rules; an internal
+representation of the rules in Prolog; and an multi-agent-based simulation
+environment, built in NetLogo. The three components interact: Logical English
+is translated into and out of Prolog (along with some support code); Prolog and
+NetLogo interface via predicates. Such a modular approach enables the different
+components to carry different "burdens" in the overall system; it also allows
+swapping of modules. Given NetLogo, we can visualize the effect of the modeled
+rules as well as validate the system with a simple dynamic running scenario.
+Designated agents monitor the behaviour of the vehicles for compliance and
+record potential violations where they occur. The information on potential
+violations is then utilized by Validators, to determine whether the violation
+is punishable, differentiating between exceptions and cases.
 
-摘要：頭部電腦斷層掃描（CT）影像是一種廣泛使用的影像模式，具有
-大量的醫療適應症，特別是在評估腦部、頭骨和腦血管系統的病理時。由於其影像擷取速度快、安全性、成本低和普遍性，通常是神經緊急情況下的第一線影像。深度學習模型可以促進對各種疾病的檢測。然而，高品質標籤和註釋的稀缺，特別是在較不常見的疾病中，顯著地阻礙了強大模型的發展。為了應對這一挑戰，我們引入了 FM-CT：一個用於頭部 CT 的基礎模型，用於可概化的疾病檢測，並使用自我監督學習進行訓練。我們的做法在一個包含 361,663 個非對比 3D 頭部 CT 掃描的大型、多樣化的數據集上預訓練一個深度學習模型，而無需手動註釋，使模型能夠學習強健、可概化的特徵。為了探討自我監督學習在頭部 CT 中的潛力，我們同時採用了帶有自我蒸餾的判別和遮罩影像建模，並且我們以 3D 而不是切片層級（2D）構建我們的模型，以更全面、有效地利用頭部 CT 掃描的結構。該模型的下游分類效能使用內部和三個外部數據集進行評估，包括分佈內 (ID) 和分佈外 (OOD) 資料。我們的結果表明，與從頭開始訓練的模型和先前在稀疏註釋數據集上訓練的 3D CT 基礎模型相比，自我監督基礎模型顯著改善了下游診斷任務的效能。這項工作突顯了自我監督學習在醫學影像中的有效性，並為 3D 頭部 CT 影像分析設定了一個新的基準，讓人工智慧能夠更廣泛地用於基於頭部 CT 的診斷。
+摘要：<paragraph>在本文中，我們提出了一個模組化系統，用於表示和推理自動駕駛車輛交通規則的法律層面。我們專注於與路口相關的英國公路法規 (HC) 子集。由於人類駕駛和自動駕駛車輛 (AV) 將在道路上互動，尤其是在城市環境中，我們主張應存在一個可存取、統一、高階的運算模型，並適用於這兩種使用者。自動駕駛車輛引入了責任轉移，不應給人類駕駛帶來劣勢或增加負擔。我們開發了一個模型的「電腦模擬」系統。所提出的系統由三個主要組成部分建構而成：使用邏輯英語的自然語言介面，用於編碼規則；使用 Prolog 的規則內部表示；以及使用 NetLogo 建構的多主體模擬環境。這三個組成部分會進行互動：邏輯英語會翻譯成 Prolog（以及一些支援程式碼），再從 Prolog 翻譯回來；Prolog 和 NetLogo 會透過謂詞進行介面。這種模組化方法讓不同的組成部分可以在整體系統中承擔不同的「負擔」；它也允許模組交換。有了 NetLogo，我們可以視覺化已建模規則的效果，並使用一個簡單的動態執行範例來驗證系統。指定的代理會監控車輛的行為，以確保遵守規定，並記錄發生的潛在違規行為。然後，驗證者會利用潛在違規行為的資訊，來確定違規行為是否應受懲罰，並區分例外情況和案例。</paragraph>
 
-##### **Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**
-2502.02756v1 by Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
+##### **Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**
+2502.09215v1 by Sean Glaze, Daniela Inclezan
 
-This study proposes a new loss function for deep neural networks, L1-weighted
-Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of
-voxels based on their classification difficulty, towards automated detection
-and segmentation of metastatic prostate cancer lesions in PET/CT scans. We
-obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with
-biochemical recurrence metastatic prostate cancer. We trained two 3D
-convolutional neural networks, Attention U-Net and SegResNet, and concatenated
-the PET and CT volumes channel-wise as input. The performance of our custom
-loss function was evaluated against the Dice and Dice Focal Loss functions. For
-clinical significance, we considered a detected region of interest (ROI) as a
-true positive if at least the voxel with the maximum standardized uptake value
-falls within the ROI. We assessed the models' performance based on the number
-of lesions in an image, tumour volume, activity, and extent of spread. The
-L1DFL outperformed the comparative loss functions by at least 13% on the test
-set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were
-lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal
-Loss yielded more false positives, whereas the Dice Loss was more sensitive to
-smaller volumes and struggled to segment larger lesions accurately. They also
-exhibited network-specific variations and yielded declines in segmentation
-accuracy with increased tumour spread. Our results demonstrate the potential of
-L1DFL to yield robust segmentation of metastatic prostate cancer lesions in
-PSMA PET/CT images. The results further highlight potential complexities
-arising from the variations in lesion characteristics that may influence
-automated prostate cancer tumour detection and segmentation. The code is
-publicly available at: https://github.com/ObedDzik/pca_segment.git.
+This paper presents an architecture for simulating the actions of a
+norm-aware intelligent agent whose behavior with respect to norm compliance is
+set, and can later be changed, by a human controller. Updating an agent's
+behavior mode from a norm-abiding to a riskier one may be relevant when the
+agent is involved in time-sensitive rescue operations, for example. We base our
+work on the Authorization and Obligation Policy Language AOPL designed by
+Gelfond and Lobo for the specification of norms. We introduce an architecture
+and a prototype software system that can be used to simulate an agent's plans
+under different behavior modes that can later be changed by the controller. We
+envision such software to be useful to policy makers, as they can more readily
+understand how agents may act in certain situations based on the agents'
+attitudes towards norm-compliance. Policy makers may then refine their policies
+if simulations show unwanted consequences.
 
-摘要：<paragraph>本研究針對深度神經網路提出一個新的損失函數，L1 加權 Dice 焦點損失 (L1DFL)，它利用 L1 範數根據體素的分類難度進行自適應加權，用於自動偵測和分割 PET/CT 掃描中轉移性前列腺癌病灶。我們取得 380 個經診斷為生化復發轉移性前列腺癌的患者的 PSMA [18-F] DCFPyL PET/CT 掃描。我們訓練了兩個 3D 捲積神經網路，Attention U-Net 和 SegResNet，並將 PET 和 CT 體積按通道連接作為輸入。我們自訂的損失函數的效能與 Dice 和 Dice 焦點損失函數進行評估。為了臨床意義，我們將一個偵測到的感興趣區域 (ROI) 視為真陽性，如果至少具有最大標準攝取值的體素落在 ROI 內。我們根據影像中的病灶數量、腫瘤體積、活性，以及擴散程度評估模型的效能。L1DFL 在測試組中至少比比較損失函數高出 13%。此外，Dice 損失和 Dice 焦點損失的 F1 分數分別比 L1DFL 低至少 6% 和 34%。Dice 焦點損失產生更多假陽性，而 Dice 損失對較小體積較為敏感，且難以準確分割較大病灶。它們也展現出網路特定的變化，並隨著腫瘤擴散而導致分割準確度下降。我們的結果證明 L1DFL 具有在 PSMA PET/CT 影像中產生轉移性前列腺癌病灶的強健分割的潛力。結果進一步強調由病灶特徵變化所產生的潛在複雜性，這可能會影響自動化前列腺癌腫瘤偵測和分割。程式碼公開於：https://github.com/ObedDzik/pca_segment.git。</paragraph>
+摘要：本文提出了一個架構，用於模擬一個規範感知智能代理的行為，其行為遵守規範，並可以由人類控制者設定，並可以在稍後進行更改。當代理參與時間敏感的救援行動時，將代理的行為模式從遵守規範更新為更冒險的行為模式可能是相關的。我們的工作基於 Gelfond 和 Lobo 為規範規範設計的授權和義務政策語言 AOPL。我們引入了一個架構和一個原型軟體系統，可用於模擬代理在不同行為模式下的計畫，這些行為模式稍後可以由控制者更改。我們預計此類軟體對政策制定者很有用，因為他們可以更容易地根據代理對規範遵守的態度了解代理在特定情況下的行為方式。如果模擬顯示出不希望的後果，政策制定者可以修改他們的政策。
 
-##### **Diffusion Instruction Tuning**
-2502.06814v1 by Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
+##### **Neuro-Symbolic Contrastive Learning for Cross-domain Inference**
+2502.09213v1 by Mingyue Liu, Ryo Ueda, Zhen Wan, Katsumi Inoue, Chris G. Willcocks
 
-We introduce Lavender, a simple supervised fine-tuning (SFT) method that
-boosts the performance of advanced vision-language models (VLMs) by leveraging
-state-of-the-art image generation models such as Stable Diffusion.
-Specifically, Lavender aligns the text-vision attention in the VLM transformer
-with the equivalent used by Stable Diffusion during SFT, instead of adapting
-separate encoders. This alignment enriches the model's visual understanding and
-significantly boosts performance across in- and out-of-distribution tasks.
-Lavender requires just 0.13 million training examples, 2.5% of typical
-large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a
-single day. It consistently improves state-of-the-art open-source multimodal
-LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and
-a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently
-transferring the visual expertise of image generators with minimal supervision,
-Lavender offers a scalable solution for more accurate vision-language systems.
-All code, training data, and models will be shared at
-https://astrazeneca.github.io/vlm/.
+Pre-trained language models (PLMs) have made significant advances in natural
+language inference (NLI) tasks, however their sensitivity to textual
+perturbations and dependence on large datasets indicate an over-reliance on
+shallow heuristics. In contrast, inductive logic programming (ILP) excels at
+inferring logical relationships across diverse, sparse and limited datasets,
+but its discrete nature requires the inputs to be precisely specified, which
+limits their application. This paper proposes a bridge between the two
+approaches: neuro-symbolic contrastive learning. This allows for smooth and
+differentiable optimisation that improves logical accuracy across an otherwise
+discrete, noisy, and sparse topological space of logical functions. We show
+that abstract logical relationships can be effectively embedded within a
+neuro-symbolic paradigm, by representing data as logic programs and sets of
+logic rules. The embedding space captures highly varied textual information
+with similar semantic logical relations, but can also separate similar textual
+relations that have dissimilar logical relations. Experimental results
+demonstrate that our approach significantly improves the inference capabilities
+of the models in terms of generalisation and reasoning.
 
-摘要：<paragraph>我們介紹 Lavender，一種簡單的監督微調 (SFT) 方法，它透過利用 Stable Diffusion 等最先進的影像生成模型來提升先進視覺語言模型 (VLM) 的效能。
-具體來說，Lavender 在 SFT 期間將 VLM 轉換器中的文字視覺注意力與 Stable Diffusion 使用的等效注意力對齊，而不是調整單獨的編碼器。此對齊豐富了模型的視覺理解，並顯著提升了分佈內外任務的效能。
-Lavender 只需要 0.13 百萬個訓練範例，相當於典型大型 SFT 資料集的 2.5%，並在標準硬體 (8 個 GPU) 上於一天內進行微調。它持續改善最先進的開放原始碼多模態 LLM（例如 Llama-3.2-11B、MiniCPM-Llama3-v2.5），在具有挑戰性的分佈外醫療 QA 任務中獲得高達 30% 的收益和 68% 的提升。透過有效轉移影像生成器的視覺專業知識，並僅需最少的監督，Lavender 提供了一個可擴充的解決方案，以實現更準確的視覺語言系統。
-所有程式碼、訓練資料和模型將在 https://astrazeneca.github.io/vlm/ 分享。</paragraph>
+摘要：預訓練語言模型 (PLM) 在自然語言推理 (NLI) 任務中取得了重大進展，然而它們對文本擾動的敏感性和對大型資料集的依賴性表明過度依賴於淺層啟發法。相比之下，歸納邏輯規劃 (ILP) 擅長推論跨越多樣化、稀疏和有限資料集的邏輯關係，但其離散性質要求輸入被精確指定，這限制了它們的應用。本文提出了兩種方法之間的橋樑：神經符號對比學習。這允許平滑且可微分的優化，從而提高邏輯函數的離散、嘈雜和稀疏拓撲空間中的邏輯準確性。我們展示了抽象邏輯關係可以通過將資料表示為邏輯程式和邏輯規則集，有效地嵌入到神經符號範例中。嵌入空間捕獲具有相似語義邏輯關係的高度多變的文本資訊，但也可以分離具有不同邏輯關係的相似文本關係。實驗結果表明，我們的做法在泛化和推理方面顯著提高了模型的推理能力。
 
-##### **MedRAX: Medical Reasoning Agent for Chest X-ray**
-2502.02673v1 by Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
+##### **LP-LM: No Hallucinations in Question Answering with Logic Programming**
+2502.09212v1 by Katherine Wu, Yanhong A. Liu
 
-Chest X-rays (CXRs) play an integral role in driving critical decisions in
-disease management and patient care. While recent innovations have led to
-specialized models for various CXR interpretation tasks, these solutions often
-operate in isolation, limiting their practical utility in clinical practice. We
-present MedRAX, the first versatile AI agent that seamlessly integrates
-state-of-the-art CXR analysis tools and multimodal large language models into a
-unified framework. MedRAX dynamically leverages these models to address complex
-medical queries without requiring additional training. To rigorously evaluate
-its capabilities, we introduce ChestAgentBench, a comprehensive benchmark
-containing 2,500 complex medical queries across 7 diverse categories. Our
-experiments demonstrate that MedRAX achieves state-of-the-art performance
-compared to both open-source and proprietary models, representing a significant
-step toward the practical deployment of automated CXR interpretation systems.
-Data and code have been publicly available at
-https://github.com/bowang-lab/MedRAX
+Large language models (LLMs) are able to generate human-like responses to
+user queries. However, LLMs exhibit inherent limitations, especially because
+they hallucinate. This paper introduces LP-LM, a system that grounds answers to
+questions in known facts contained in a knowledge base (KB), facilitated
+through semantic parsing in Prolog, and always produces answers that are
+reliable.
+  LP-LM generates a most probable constituency parse tree along with a
+corresponding Prolog term for an input question via Prolog definite clause
+grammar (DCG) parsing. The term is then executed against a KB of natural
+language sentences also represented as Prolog terms for question answering. By
+leveraging DCG and tabling, LP-LM runs in linear time in the size of input
+sentences for sufficiently many grammar rules. Performing experiments comparing
+LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate
+on even simple questions, unlike LP-LM.
 
-摘要：胸部 X 光片 (CXR) 在疾病管理和患者照護中扮演著不可或缺的角色，推動著關鍵決策的制定。儘管近期的創新已針對各種 CXR 解讀任務開發出專門的模型，但這些解決方案通常獨立運作，限制了它們在臨床實務中的實際效用。我們提出 MedRAX，這是一款首創的多功能 AI 代理，它將最先進的 CXR 分析工具和多模態大型語言模型無縫整合到一個統一的架構中。MedRAX 動態運用這些模型來解決複雜的醫療查詢，而無需額外的訓練。為了嚴格評估其功能，我們引入了 ChestAgentBench，這是一個全面的基準，包含 7 個不同類別的 2,500 個複雜醫療查詢。我們的實驗證明，與開源和專有模型相比，MedRAX 達到了最先進的效能，這代表了自動化 CXR 解讀系統實際部署的重要一步。資料和程式碼已公開於 https://github.com/bowang-lab/MedRAX
+摘要：大型語言模型 (LLM) 能產生類似人類的回應來回答使用者的問題。然而，LLM 顯示出內在的限制，特別是因為它們會產生幻覺。本文介紹 LP-LM，一個系統，它將問題的答案建立在知識庫 (KB) 中已知的事實上，透過 Prolog 中的語義解析來促進，並始終產生可靠的答案。
+LP-LM 透過 Prolog 明確條款語法 (DCG) 解析產生一個最可能的成分解析樹，以及輸入問題對應的 Prolog 詞彙。然後，針對一個自然語言句子的 KB 執行該詞彙，也表示為 Prolog 詞彙，以進行問題解答。透過利用 DCG 和 tabling，LP-LM 在輸入句子的大小上以線性時間執行，對於足夠多的語法規則。執行實驗比較 LP-LM 與目前眾所周知的 LLM 在準確性上，我們顯示出 LLM 甚至會對簡單的問題產生幻覺，這與 LP-LM 不同。
 
-##### **Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**
-2502.04356v1 by Mahdi Alkaeed, Sofiat Abioye, Adnan Qayyum, Yosra Magdi Mekki, Ilhem Berrou, Mohamad Abdallah, Ala Al-Fuqaha, Muhammad Bilal, Junaid Qadir
+##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
+2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
 
-In response to the success of proprietary Large Language Models (LLMs) such
-as OpenAI's GPT-4, there is a growing interest in developing open,
-non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in
-academic, scientific, and non-commercial applications. Despite their inability
-to match the refined functionalities of their proprietary counterparts, open
-models hold immense potential to revolutionize healthcare applications. In this
-paper, we examine the prospects of open-source LLMs and AIFMs for developing
-healthcare applications and make two key contributions. Firstly, we present a
-comprehensive survey of the current state-of-the-art open-source healthcare
-LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their
-utility across various healthcare tasks. Secondly, to evaluate the
-general-purpose applications of open LLMs in healthcare, we present a case
-study on personalized prescriptions. This task is particularly significant due
-to its critical role in delivering tailored, patient-specific medications that
-can greatly improve treatment outcomes. In addition, we compare the performance
-of open-source models with proprietary models in settings with and without
-Retrieval-Augmented Generation (RAG). Our findings suggest that, although less
-refined, open LLMs can achieve performance comparable to proprietary models
-when paired with grounding techniques such as RAG. Furthermore, to highlight
-the clinical significance of LLMs-empowered personalized prescriptions, we
-perform subjective assessment through an expert clinician. We also elaborate on
-ethical considerations and potential risks associated with the misuse of
-powerful LLMs and AIFMs, highlighting the need for a cautious and responsible
-implementation in healthcare.
+Visual Question Answering (VQA) is a challenging problem that requires to
+process multimodal input. Answer-Set Programming (ASP) has shown great
+potential in this regard to add interpretability and explainability to modular
+VQA architectures. In this work, we address the problem of how to integrate ASP
+with modules for vision and natural language processing to solve a new and
+demanding VQA variant that is concerned with images of graphs (not graphs in
+symbolic form). Images containing graph-based structures are an ubiquitous and
+popular form of visualisation. Here, we deal with the particular problem of
+graphs inspired by transit networks, and we introduce a novel dataset that
+amends an existing one by adding images of graphs that resemble metro lines.
+Our modular neuro-symbolic approach combines optical graph recognition for
+graph parsing, a pretrained optical character recognition neural network for
+parsing labels, Large Language Models (LLMs) for language processing, and ASP
+for reasoning. This method serves as a first baseline and achieves an overall
+average accuracy of 73% on the dataset. Our evaluation provides further
+evidence of the potential of modular neuro-symbolic systems, in particular with
+pretrained models that do not involve any further training and logic
+programming for reasoning, to solve complex VQA tasks.
 
-摘要：<paragraph>為了回應 OpenAI 的 GPT-4 等專有大型語言模型 (LLM) 的成功，開發開放、非專有的 LLM 和人工智慧基礎模型 (AIFM) 以透明地用於學術、科學和非商業應用中，引起了越來越大的興趣。儘管無法與其專有對應產品的精緻功能相匹配，但開放模型在革新醫療保健應用方面具有巨大的潛力。在本文中，我們探討了開放原始碼 LLM 和 AIFM 在開發醫療保健應用方面的前景，並提出了兩項關鍵貢獻。首先，我們對當前最先進的開放原始碼醫療保健 LLM 和 AIFM 進行了全面的調查，並介紹了這些開放 AIFM 的分類法，對它們在各種醫療保健任務中的效用進行了分類。其次，為了評估開放 LLM 在醫療保健中的通用應用，我們對個人化處方進行了案例研究。這項任務特別重要，因為它在提供量身定制的患者特定藥物方面發揮著關鍵作用，可以大大改善治療效果。此外，我們比較了開放原始碼模型與專有模型在有和沒有檢索增強生成 (RAG) 的設置中的性能。我們的研究結果表明，儘管不太精緻，但開放 LLM 在與 RAG 等基礎技術配對時，可以實現與專有模型相當的性能。此外，為了強調 LLM 賦能的個性化處方的臨床意義，我們通過專家臨床醫生進行了主觀評估。我們還詳細說明了與濫用強大的 LLM 和 AIFM 相關的倫理考量和潛在風險，強調了在醫療保健中謹慎和負責任地實施的必要性。</paragraph>
+摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
 
-##### **Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**
-2502.02561v1 by Shayan Kiyani, George Pappas, Aaron Roth, Hamed Hassani
+##### **On LLM-generated Logic Programs and their Inference Execution Methods**
+2502.09209v1 by Paul Tarau
 
-A fundamental question in data-driven decision making is how to quantify the
-uncertainty of predictions in ways that can usefully inform downstream action.
-This interface between prediction uncertainty and decision-making is especially
-important in risk-sensitive domains, such as medicine. In this paper, we
-develop decision-theoretic foundations that connect uncertainty quantification
-using prediction sets with risk-averse decision-making. Specifically, we answer
-three fundamental questions: (1) What is the correct notion of uncertainty
-quantification for risk-averse decision makers? We prove that prediction sets
-are optimal for decision makers who wish to optimize their value at risk. (2)
-What is the optimal policy that a risk averse decision maker should use to map
-prediction sets to actions? We show that a simple max-min decision policy is
-optimal for risk-averse decision makers. Finally, (3) How can we derive
-prediction sets that are optimal for such decision makers? We provide an exact
-characterization in the population regime and a distribution free finite-sample
-construction. Answering these questions naturally leads to an algorithm,
-Risk-Averse Calibration (RAC), which follows a provably optimal design for
-deriving action policies from predictions. RAC is designed to be both
-practical-capable of leveraging the quality of predictions in a black-box
-manner to enhance downstream utility-and safe-adhering to a user-defined risk
-threshold and optimizing the corresponding risk quantile of the user's
-downstream utility. Finally, we experimentally demonstrate the significant
-advantages of RAC in applications such as medical diagnosis and recommendation
-systems. Specifically, we show that RAC achieves a substantially improved
-trade-off between safety and utility, offering higher utility compared to
-existing methods while maintaining the safety guarantee.
+Large Language Models (LLMs) trained on petabytes of data are highly
+compressed repositories of a significant proportion of the knowledge
+accumulated and distilled so far. In this paper we study techniques to elicit
+this knowledge in the form of several classes of logic programs, including
+propositional Horn clauses, Dual Horn clauses, relational triplets and Definite
+Clause Grammars. Exposing this knowledge as logic programs enables sound
+reasoning methods that can verify alignment of LLM outputs to their intended
+uses and extend their inference capabilities. We study new execution methods
+for the generated programs, including soft-unification of abducible facts
+against LLM-generated content stored in a vector database as well as GPU-based
+acceleration of minimal model computation that supports inference with large
+LLM-generated programs.
 
-摘要：<paragraph>在資料驅動決策中，一個基本問題是，如何量化預測的不確定性，以能有用地告知下游行動。
-預測不確定性和決策制定之間的這種介面，在風險敏感領域中特別重要，例如醫學。在本文中，我們
-發展了決策理論基礎，它利用預測集合將不確定性量化與風險規避決策制定聯繫起來。具體來說，我們回答
-了三個基本問題：(1) 對於風險規避決策者來說，不確定性量化的正確概念是什麼？我們證明，對於希望最佳化其風險價值的決策者來說，預測集合是最佳的。(2)
-風險規避決策者應使用什麼最佳政策，將預測集合映射到行動？我們表明，對於風險規避決策者來說，一個簡單的最大最小決策政策是最佳的。最後，(3) 我們如何推導出對此類決策者來說最佳的預測集合？我們在總體範圍內提供了一個確切的表徵，並提供了一個不依賴分佈的有限樣本建構。回答這些問題自然會導致一個演算法，風險規避校準 (RAC)，它遵循一個可證明最佳的設計，從預測中推導出行動政策。RAC 被設計為既實用——能夠以黑盒方式利用預測的品質來增強下游效用——又安全——遵守使用者定義的風險閾值，並最佳化使用者的下游效用的對應風險分位數。最後，我們在醫學診斷和推薦系統等應用中，以實驗方式證明了 RAC 的顯著優點。具體來說，我們表明，與現有方法相比，RAC 在安全性和效用之間實現了顯著改善的折衷，在維持安全保證的同時，提供了更高的效用。</paragraph>
+摘要：大型語言模型 (LLM) 在數位位元組的資料上受過訓練，是目前為止累積和提煉的知識中，高度濃縮的儲存庫。在本文中，我們研究了以數種邏輯程式類別的形式引出這些知識的技術，包括命題霍恩子句、雙重霍恩子句、關聯三元組和確定子句文法。將這些知識作為邏輯程式揭露，能啟用健全的推理方法，驗證 LLM 輸出的對齊方式，符合其預期的用途，並擴展其推論能力。我們研究了產生程式的新執行方法，包括對儲存在向量資料庫中的 LLM 產生內容，進行可約簡事實的軟統一，以及支援使用大型 LLM 產生程式進行推論的，基於 GPU 的最小模型計算加速。
 
-##### **CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**
-2502.05214v1 by Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
+##### **Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**
+2502.09206v1 by Haya Majid Qureshi, Wolfgang Faber
 
-Deep learning models for medical image classification tasks are becoming
-widely implemented in AI-assisted diagnostic tools, aiming to enhance
-diagnostic accuracy, reduce clinician workloads, and improve patient outcomes.
-However, their vulnerability to adversarial attacks poses significant risks to
-patient safety. Current attack methodologies use general techniques such as
-model querying or pixel value perturbations to generate adversarial examples
-designed to fool a model. These approaches may not adequately address the
-unique characteristics of clinical errors stemming from missed or incorrectly
-identified clinical features. We propose the Concept-based Report Perturbation
-Attack (CoRPA), a clinically-focused black-box adversarial attack framework
-tailored to the medical imaging domain. CoRPA leverages clinical concepts to
-generate adversarial radiological reports and images that closely mirror
-realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA
-using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our
-evaluation reveals that deep learning models exhibiting strong resilience to
-conventional adversarial attacks are significantly less robust when subjected
-to CoRPA's clinically-focused perturbations. This underscores the importance of
-addressing domain-specific vulnerabilities in medical AI systems. By
-introducing a specialized adversarial attack framework, this study provides a
-foundation for developing robust, real-world-ready AI models in healthcare,
-ensuring their safe and reliable deployment in high-stakes clinical
-environments.
+Metamodeling refers to scenarios in ontologies in which classes and roles can
+be members of classes or occur in roles. This is a desirable modelling feature
+in several applications, but allowing it without restrictions is problematic
+for several reasons, mainly because it causes undecidability. Therefore,
+practical languages either forbid metamodeling explicitly or treat occurrences
+of classes as instances to be semantically different from other occurrences,
+thereby not allowing metamodeling semantically. Several extensions have been
+proposed to provide metamodeling to some extent. Building on earlier work that
+reduces metamodeling query answering to Datalog query answering, recently
+reductions to query answering over hybrid knowledge bases were proposed with
+the aim of using the Datalog transformation only where necessary. Preliminary
+work showed that the approach works, but the hoped-for performance improvements
+were not observed yet. In this work we expand on this body of work by improving
+the theoretical basis of the reductions and by using alternative tools that
+show competitive performance.
 
-摘要：深度学习模型用于医学影像分类任务，在人工智能辅助诊断工具中得到广泛应用，旨在提高诊断准确性、减少临床医生的工作量并改善患者的治疗效果。然而，它们对对抗性攻击的脆弱性给患者安全带来了重大风险。目前的攻击方法使用通用技术，例如模型查询或像素值扰动来生成对抗性示例，旨在欺骗模型。这些方法可能无法充分解决源自遗漏或错误识别的临床特征的临床错误的独特特征。我们提出了基于概念的报告扰动攻击 (CoRPA)，这是一种以临床为中心的、针对医学成像领域的、黑盒对抗性攻击框架。CoRPA 利用临床概念来生成对抗性放射学报告和图像，这些报告和图像与现实的临床误诊场景非常相似。我们使用胸部 X 射线和放射学报告的 MIMIC-CXR-JPG 数据集演示了 CoRPA 的效用。我们的评估表明，对传统对抗性攻击表现出强大弹性的深度学习模型在受到 CoRPA 以临床为中心的扰动时，其鲁棒性明显降低。这强调了在医疗人工智能系统中解决特定领域漏洞的重要性。通过引入专门的对抗性攻击框架，本研究为在医疗保健领域开发健壮、面向现实世界的 AI 模型奠定了基础，确保它们在高风险临床环境中安全可靠地部署。
+摘要：元建模是指本体中的場景，其中類別和角色可以是類別成員或出現在角色中。這是一個在多個應用中理想的建模功能，但允許它不受限制會因多個原因而產生問題，主要是因為它會導致無法決定。因此，實用的語言會明確禁止元建模，或將類別的出現視為與其他出現語義不同的實例，從而語義上不允許元建模。已經提出多個擴充功能，在一定程度上提供元建模。建立在將元建模查詢回答簡化為 Datalog 查詢回答的早期工作之上，最近提出了將查詢回答簡化為混合知識庫的簡化，目的是僅在必要時使用 Datalog 轉換。初步工作顯示該方法有效，但尚未觀察到預期的效能改善。在這項工作中，我們透過改善簡化的理論基礎和使用表現競爭力的替代工具，擴展了這項工作。
 
-##### **A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**
-2502.02489v1 by Edward Ellis, Andrew Bulpitt, Nasim Parsa, Michael F Byrne, Sharib Ali
+##### **Counterfactual Explanations as Plans**
+2502.09205v1 by Vaishak Belle
 
-Ultrasound (US) imaging is clinically invaluable due to its noninvasive and
-safe nature. However, interpreting US images is challenging, requires
-significant expertise, and time, and is often prone to errors. Deep learning
-offers assistive solutions such as segmentation. Supervised methods rely on
-large, high-quality, and consistently labeled datasets, which are challenging
-to curate. Moreover, these methods tend to underperform on out-of-distribution
-data, limiting their clinical utility. Self-supervised learning (SSL) has
-emerged as a promising alternative, leveraging unlabeled data to enhance model
-performance and generalisability. We introduce a contrastive SSL approach
-tailored for B-mode US images, incorporating a novel Relation Contrastive Loss
-(RCL). RCL encourages learning of distinct features by differentiating positive
-and negative sample pairs through a learnable metric. Additionally, we propose
-spatial and frequency-based augmentation strategies for the representation
-learning on US images. Our approach significantly outperforms traditional
-supervised segmentation methods across three public breast US datasets,
-particularly in data-limited scenarios. Notable improvements on the Dice
-similarity metric include a 4% increase on 20% and 50% of the BUSI dataset,
-nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4%
-and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively.
-Furthermore, we demonstrate superior generalisability on the
-out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6%
-compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST
-training data, respectively. Our research highlights that domain-inspired SSL
-can improve US segmentation, especially under data-limited conditions.
+There has been considerable recent interest in explainability in AI,
+especially with black-box machine learning models. As correctly observed by the
+planning community, when the application at hand is not a single-shot decision
+or prediction, but a sequence of actions that depend on observations, a richer
+notion of explanations are desirable.
+  In this paper, we look to provide a formal account of ``counterfactual
+explanations," based in terms of action sequences. We then show that this
+naturally leads to an account of model reconciliation, which might take the
+form of the user correcting the agent's model, or suggesting actions to the
+agent's plan. For this, we will need to articulate what is true versus what is
+known, and we appeal to a modal fragment of the situation calculus to formalise
+these intuitions. We consider various settings: the agent knowing partial
+truths, weakened truths and having false beliefs, and show that our definitions
+easily generalize to these different settings.
 
-摘要：超音波 (US) 影像由於其非侵入性且安全的特性，在臨床上極具價值。然而，解讀超音波影像具有挑戰性，需要大量的專業知識和時間，而且經常容易出錯。深度學習提供了輔助解決方案，例如分割。監督式方法依賴於大量、高品質且標籤一致的資料集，而這在策劃上具有挑戰性。此外，這些方法在分佈外資料上的表現往往不佳，這限制了它們的臨床效用。自監督學習 (SSL) 已成為一種有前途的替代方案，它利用未標籤資料來增強模型效能和泛化能力。我們提出了一種對比式 SSL 方法，專門針對 B 模式超音波影像，並納入了新穎的關係對比損失 (RCL)。RCL 透過一個可學習的指標區分正負樣本對，來鼓勵學習不同的特徵。此外，我們提出了用於超音波影像上表徵學習的空間和頻率增強策略。我們的做法在三個公開的乳房超音波資料集上顯著優於傳統的監督式分割方法，特別是在資料有限的情況下。在 Dice 相似性指標上的顯著改進包括在 BUSI 資料集的 20% 和 50% 上增加了 4%，在 BrEaST 資料集的 20% 和 50% 上增加了近 6% 和 9%，以及在 UDIAT 資料集的 20% 和 50% 上分別增加了 6.4% 和 3.7%。此外，我們在分佈外的 UDIAT 資料集上展示了卓越的泛化能力，與使用 BUSI 和 BrEaST 訓練資料的 20% 和 50% 的監督式基準相比，效能分別提升了 20.6% 和 13.6%。我們的研究強調，領域啟發的 SSL 可以改善超音波分割，特別是在資料有限的條件下。
+摘要：最近在人工智能中對於可解釋性產生了相當大的興趣，
+特別是對於黑盒機器學習模型。正如規劃社群正確觀察到的，當手邊的應用程式不是單次決策或預測，而是一連串依賴於觀察的動作時，一個更豐富的解釋概念是可取的。
+在本文中，我們著眼於提供「反事實解釋」的一個正式說明，以動作序列為基礎。然後我們展示這自然會導致一個模型調和說明，其形式可能是使用者修正代理人的模型，或建議代理人的計畫採取行動。為此，我們需要說明什麼是真實的，什麼是已知的，我們訴諸情境演算的一個模態片段來形式化這些直覺。我們考慮各種設定：代理人知道部分真實、虛弱真實和擁有錯誤信念，並展示我們的定義輕鬆地概括到這些不同的設定。
 
-##### **Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**
-2502.02438v1 by Yaling Shen, Zhixiong Zhuang, Kun Yuan, Maria-Irina Nicolae, Nassir Navab, Nicolas Padoy, Mario Fritz
+##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
+2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
 
-Medical multimodal large language models (MLLMs) are becoming an instrumental
-part of healthcare systems, assisting medical personnel with decision making
-and results analysis. Models for radiology report generation are able to
-interpret medical imagery, thus reducing the workload of radiologists. As
-medical data is scarce and protected by privacy regulations, medical MLLMs
-represent valuable intellectual property. However, these assets are potentially
-vulnerable to model stealing, where attackers aim to replicate their
-functionality via black-box access. So far, model stealing for the medical
-domain has focused on classification; however, existing attacks are not
-effective against MLLMs. In this paper, we introduce Adversarial Domain
-Alignment (ADA-STEAL), the first stealing attack against medical MLLMs.
-ADA-STEAL relies on natural images, which are public and widely available, as
-opposed to their medical counterparts. We show that data augmentation with
-adversarial noise is sufficient to overcome the data distribution gap between
-natural images and the domain-specific distribution of the victim MLLM.
-Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that
-Adversarial Domain Alignment enables attackers to steal the medical MLLM
-without any access to medical data.
+Legal cases require careful logical reasoning following the laws, whereas
+interactions with non- technical users must be in natural language. As an
+application combining logical reasoning using Prolog and natural language
+processing using large language models (LLMs), this paper presents a novel
+approach and system, LogicLease, to automate the analysis of landlord-tenant
+legal cases in the state of New York. LogicLease determines compliance with
+relevant legal requirements by analyzing case descriptions and citing all
+relevant laws. It leverages LLMs for information extraction and Prolog for
+legal reasoning. By separating information extraction from legal reasoning,
+LogicLease achieves greater transparency and control over the legal logic
+applied to each case. We evaluate the accuracy, efficiency, and robustness of
+LogicLease through a series of tests, achieving 100% accuracy and an average
+processing time of 2.57 seconds. LogicLease presents advantages over
+state-of-the-art LLM- based legal analysis systems by providing clear,
+step-by-step reasoning, citing specific laws, and distinguishing itself by its
+ability to avoid hallucinations - a common issue in LLMs.
 
-摘要：醫療多模態大型語言模型 (MLLM) 正在成為醫療保健系統中不可或缺的一部分，協助醫療人員進行決策和結果分析。放射報告生成的模型能夠解釋醫學影像，從而減輕放射科醫師的工作負擔。由於醫療資料稀少且受隱私法規保護，醫療 MLLM 代表了有價值的智慧財產。然而，這些資產潛在地容易受到模型竊取的攻擊，攻擊者旨在透過黑盒存取來複製其功能。到目前為止，針對醫療領域的模型竊取一直專注於分類；然而，現有的攻擊對 MLLM 沒有效。在本文中，我們介紹了對抗域對齊 (ADA-STEAL)，這是針對醫療 MLLM 的第一個竊取攻擊。與醫療對應物相反，ADA-STEAL 依賴於公開且廣泛可用的自然影像。我們表明，對抗雜訊的資料擴充足以克服自然影像與受害者 MLLM 的特定領域分佈之間的資料分佈差距。在 IU X-RAY 和 MIMIC-CXR 放射學資料集上進行的實驗表明，對抗域對齊使攻擊者能夠在不存取任何醫療資料的情況下竊取醫療 MLLM。
+摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
 
-##### **Test Time Training for 4D Medical Image Interpolation**
-2502.02341v1 by Qikang Zhang, Yingjie Lei, Zihao Zheng, Ziyang Chen, Zhonghao Xie
+##### **Thinking beyond the anthropomorphic paradigm benefits LLM research**
+2502.09192v1 by Lujain Ibrahim, Myra Cheng
 
-4D medical image interpolation is essential for improving temporal resolution
-and diagnostic precision in clinical applications. Previous works ignore the
-problem of distribution shifts, resulting in poor generalization under
-different distribution. A natural solution would be to adapt the model to a new
-test distribution, but this cannot be done if the test input comes without a
-ground truth label. In this paper, we propose a novel test time training
-framework which uses self-supervision to adapt the model to a new distribution
-without requiring any labels. Indeed, before performing frame interpolation on
-each test video, the model is trained on the same instance using a
-self-supervised task, such as rotation prediction or image reconstruction. We
-conduct experiments on two publicly available 4D medical image interpolation
-datasets, Cardiac and 4D-Lung. The experimental results show that the proposed
-method achieves significant performance across various evaluation metrics on
-both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on
-Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image
-interpolation but also provides a template for domain adaptation in other
-fields such as image segmentation and image registration.
+Anthropomorphism, or the attribution of human traits to technology, is an
+automatic and unconscious response that occurs even in those with advanced
+technical expertise. In this position paper, we analyze hundreds of thousands
+of computer science research articles from the past decade and present
+empirical evidence of the prevalence and growth of anthropomorphic terminology
+in research on large language models (LLMs). This terminology reflects deeper
+anthropomorphic conceptualizations which shape how we think about and conduct
+LLM research. We argue these conceptualizations may be limiting, and that
+challenging them opens up new pathways for understanding and improving LLMs
+beyond human analogies. To illustrate this, we identify and analyze five core
+anthropomorphic assumptions shaping prominent methodologies across the LLM
+development lifecycle, from the assumption that models must use natural
+language for reasoning tasks to the assumption that model capabilities should
+be evaluated through human-centric benchmarks. For each assumption, we
+demonstrate how non-anthropomorphic alternatives can open new directions for
+research and development.
 
-摘要：4D 醫學影像插值對於提升時間解析度及臨床應用中的診斷精準度至關重要。過往的研究忽略了分佈轉移問題，導致在不同分佈下泛化能力不佳。一個自然的解決方案是將模型適應到新的測試分佈，但如果測試輸入沒有真實標籤，就無法做到這一點。在本文中，我們提出了一個新的測試時間訓練架構，它使用自我監督來適應模型到一個新的分佈，而不需要任何標籤。事實上，在對每個測試影片執行幀插值之前，使用自我監督任務（例如旋轉預測或影像重建）在同一個實例上訓練模型。我們在兩個公開的 4D 醫學影像插值資料集（Cardiac 和 4D-Lung）上進行實驗。實驗結果表明，所提出的方法在兩個資料集上的各種評估指標中都取得了顯著的效能。它達到了更高的峰值信噪比值，在 Cardiac 上為 33.73dB，在 4D-Lung 上為 34.02dB。我們的技術不僅推動了 4D 醫學影像插值，還為其他領域（例如影像分割和影像配準）中的領域適應提供了一個範本。
+摘要：擬人化，或將人類特質歸因於科技，是一種自動且無意識的反應，即使是那些擁有進階技術專業知識的人也會發生。在本文中，我們分析了過去十年數十萬篇電腦科學研究文章，並提出實證證據證明擬人化術語在大型語言模型 (LLM) 研究中的普遍性和增長。這些術語反映了更深層的擬人化概念化，塑造了我們思考和進行 LLM 研究的方式。我們認為這些概念化可能是有限制的，並且挑戰它們為超越人類類比來理解和改進 LLM 開闢了新的途徑。為了說明這一點，我們識別並分析了五個核心擬人化假設，這些假設塑造了 LLM 開發生命週期中的顯著方法論，從模型必須使用自然語言進行推理任務的假設到模型能力應該通過以人為中心的基準進行評估的假設。對於每個假設，我們展示了非擬人化替代方案如何為研究和開發打開新方向。
 
-##### **Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**
-2502.02249v1 by Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
+##### **Matina: A Large-Scale 73B Token Persian Text Corpus**
+2502.09188v1 by Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri
 
-Large language models (LLMs) have shown impressive capabilities in natural
-language processing tasks, including dialogue generation. This research aims to
-conduct a novel comparative analysis of two prominent techniques, fine-tuning
-with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG)
-framework, in the context of doctor-patient chat conversations with multiple
-datasets of mixed medical domains. The analysis involves three state-of-the-art
-models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient
-dialogues, we comprehensively evaluate the performance of models, assessing key
-metrics such as language quality (perplexity, BLEU score), factual accuracy
-(fact-checking against medical knowledge bases), adherence to medical
-guidelines, and overall human judgments (coherence, empathy, safety). The
-findings provide insights into the strengths and limitations of each approach,
-shedding light on their suitability for healthcare applications. Furthermore,
-the research investigates the robustness of the models in handling diverse
-patient queries, ranging from general health inquiries to specific medical
-conditions. The impact of domain-specific knowledge integration is also
-explored, highlighting the potential for enhancing LLM performance through
-targeted data augmentation and retrieval strategies.
+Text corpora are essential for training models used in tasks like
+summarization, translation, and large language models (LLMs). While various
+efforts have been made to collect monolingual and multilingual datasets in many
+languages, Persian has often been underrepresented due to limited resources for
+data collection and preprocessing. Existing Persian datasets are typically
+small and lack content diversity, consisting mainly of weblogs and news
+articles. This shortage of high-quality, varied data has slowed the development
+of NLP models and open-source LLMs for Persian. Since model performance depends
+heavily on the quality of training data, we address this gap by introducing the
+Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed
+and deduplicated to ensure high data quality. We further assess its
+effectiveness by training and evaluating transformer-based models on key NLP
+tasks. Both the dataset and preprocessing codes are publicly available,
+enabling researchers to build on and improve this resource for future Persian
+NLP advancements.
 
-摘要：大型語言模型 (LLM) 在自然語言處理任務中展現了令人印象深刻的能力，包括對話生成。本研究旨在對兩種著名的技術進行新穎的比較分析，即微調 LoRA (低秩適應) 和檢索增強生成 (RAG) 框架，在具有混合醫療領域的多個資料集的醫患聊天對話中。分析涉及三個最先進的模型：Llama-2、GPT 和 LSTM 模型。採用真實世界的醫患對話，我們全面評估模型的性能，評估語言品質（困惑度、BLEU 分數）、事實準確性（對照醫學知識庫進行事實查核）、遵守醫療指南以及整體人類判斷（連貫性、同理心、安全性）等關鍵指標。研究結果深入了解了每種方法的優點和限制，闡明了它們適用於醫療保健應用的適當性。此外，該研究調查了模型在處理多樣化患者查詢時的穩健性，範圍從一般健康詢問到特定醫療狀況。還探討了特定領域知識整合的影響，強調了通過有針對性的資料擴充和檢索策略來增強 LLM 性能的潛力。
+摘要：文字語料庫對於訓練用於摘要、翻譯和大型語言模型 (LLM) 等任務的模型至關重要。儘管已做出各種努力來收集許多語言中的單語和多語言資料集，但由於資料收集和預處理資源有限，波斯語常常代表性不足。現有的波斯語資料集通常很小，而且缺乏內容多樣性，主要由網誌和新聞文章組成。這種優質、多樣化資料的短缺減緩了波斯語的 NLP 模型和開源 LLM 的開發。由於模型效能高度依賴訓練資料的品質，我們透過推出 Matina 語料庫來解決這個差距，Matina 語料庫是一個新的波斯語資料集，包含 72.9B 個字元，經過仔細預處理和去重，以確保資料品質。我們進一步透過在關鍵 NLP 任務上訓練和評估基於轉換器的模型來評估其有效性。資料集和預處理程式碼都是公開的，使研究人員能夠建立和改善這個資源，以促進未來的波斯語 NLP 進展。
 
-##### **Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**
-2502.02618v1 by F. Xavier Gaya-Morey, Jose M. Buades-Rubio, Philippe Palanque, Raquel Lacuesta, Cristina Manresa-Yee
+##### **RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**
+2502.09183v1 by Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, Linmei Hu
 
-The rapid aging of the global population has highlighted the need for
-technologies to support elderly, particularly in healthcare and emotional
-well-being. Facial expression recognition (FER) systems offer a non-invasive
-means of monitoring emotional states, with applications in assisted living,
-mental health support, and personalized care. This study presents a systematic
-review of deep learning-based FER systems, focusing on their applications for
-the elderly population. Following a rigorous methodology, we analyzed 31
-studies published over the last decade, addressing challenges such as the
-scarcity of elderly-specific datasets, class imbalances, and the impact of
-age-related facial expression differences. Our findings show that convolutional
-neural networks remain dominant in FER, and especially lightweight versions for
-resource-constrained environments. However, existing datasets often lack
-diversity in age representation, and real-world deployment remains limited.
-Additionally, privacy concerns and the need for explainable artificial
-intelligence emerged as key barriers to adoption. This review underscores the
-importance of developing age-inclusive datasets, integrating multimodal
-solutions, and adopting XAI techniques to enhance system usability,
-reliability, and trustworthiness. We conclude by offering recommendations for
-future research to bridge the gap between academic progress and real-world
-implementation in elderly care.
+Code generation has attracted increasing attention with the rise of Large
+Language Models (LLMs). Many studies have developed powerful code LLMs by
+synthesizing code-related instruction data and applying supervised fine-tuning.
+However, these methods are limited by teacher model distillation and ignore the
+potential of iterative refinement by self-generated code. In this paper, we
+propose Adaptive Critique Refinement (ACR), which enables the model to refine
+itself by self-generated code and external critique, rather than directly
+imitating the code responses of the teacher model. Concretely, ACR includes a
+composite scoring system with LLM-as-a-Judge to evaluate the quality of code
+responses and a selective critique strategy with LLM-as-a-Critic to critique
+self-generated low-quality code responses. We develop the RefineCoder series by
+iteratively applying ACR, achieving continuous performance improvement on
+multiple code generation benchmarks. Compared to the baselines of the same
+size, our proposed RefineCoder series can achieve comparable or even superior
+performance using less data.
 
-摘要：全球人口快速老龄化突显了对技术的需求，以支持老年人，尤其是在医疗保健和情绪健康方面。面部表情识别 (FER) 系统提供了一种非侵入性的情绪状态监测手段，在辅助生活、心理健康支持和个性化护理中得到应用。本研究对基于深度学习的 FER 系统进行了系统的回顾，重点关注它们在老年人群中的应用。遵循严格的方法，我们分析了在过去十年中发表的 31 项研究，解决了诸如老年人特定数据集的稀缺性、类别不平衡以及与年龄相关的面部表情差异的影响等挑战。我们的研究结果表明，卷积神经网络在 FER 中仍然占主导地位，特别是针对资源受限环境的轻量级版本。然而，现有数据集往往缺乏年龄代表性的多样性，并且现实世界的部署仍然有限。此外，隐私问题和对可解释人工智能的需求已成为采用过程中的主要障碍。本次审查强调了开发包容年龄的数据集、整合多模式解决方案以及采用 XAI 技术以增强系统可用性、可靠性和可信度的重要性。最后，我们提出了未来研究的建议，以弥合学术进展与老年护理中的现实世界实施之间的差距。
+摘要：隨著大型語言模型 (LLM) 的興起，程式碼生成備受關注。許多研究透過綜合與程式碼相關的指令資料並應用監督式微調來開發強大的程式碼 LLM。然而，這些方法受到教師模型蒸餾的限制，且忽略了透過自行產生的程式碼進行反覆改進的潛力。在本文中，我們提出適應性批判改進 (ACR)，它使模型能夠透過自行產生的程式碼和外部批判來改進自身，而不是直接模仿教師模型的程式碼回應。具體來說，ACR 包含一個複合評分系統，其中 LLM 作為評審員來評估程式碼回應的品質，以及一個選擇性批判策略，其中 LLM 作為批判者來批判自行產生的低品質程式碼回應。我們透過反覆套用 ACR 來開發 RefineCoder 系列，在多個程式碼生成基準上實現持續的效能改善。與相同規模的基準相比，我們提出的 RefineCoder 系列可以使用較少資料來實現相當甚至更優異的效能。
 
-##### **Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**
-2502.02109v1 by Yuxiao Cheng, Xinxin Song, Ziqian Wang, Qin Zhong, Kunlun He, Jinli Suo
+##### **FLAME: Flexible LLM-Assisted Moderation Engine**
+2502.09175v1 by Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets
 
-Recent advances in deep learning (DL) have prompted the development of
-high-performing early warning score (EWS) systems, predicting clinical
-deteriorations such as acute kidney injury, acute myocardial infarction, or
-circulatory failure. DL models have proven to be powerful tools for various
-tasks but come with the cost of lacking interpretability and limited
-generalizability, hindering their clinical applications. To develop a practical
-EWS system applicable to various outcomes, we propose causally-informed
-explainable early prediction model, which leverages causal discovery to
-identify the underlying causal relationships of prediction and thus owns two
-unique advantages: demonstrating the explicit interpretation of the prediction
-while exhibiting decent performance when applied to unfamiliar environments.
-Benefiting from these features, our approach achieves superior accuracy for 6
-different critical deteriorations and achieves better generalizability across
-different patient groups, compared to various baseline algorithms. Besides, we
-provide explicit causal pathways to serve as references for assistant clinical
-diagnosis and potential interventions. The proposed approach enhances the
-practical application of deep learning in various medical scenarios.
+The rapid advancement of Large Language Models (LLMs) has introduced
+significant challenges in moderating user-model interactions. While LLMs
+demonstrate remarkable capabilities, they remain vulnerable to adversarial
+attacks, particularly ``jailbreaking'' techniques that bypass content safety
+measures. Current content moderation systems, which primarily rely on input
+prompt filtering, have proven insufficient, with techniques like Best-of-N
+(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
+In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
+new approach that shifts the focus from input filtering to output moderation.
+Unlike traditional circuit-breaking methods that analyze user queries, FLAME
+evaluates model responses, offering several key advantages: (1) computational
+efficiency in both training and inference, (2) enhanced resistance to BoN
+jailbreaking attacks, and (3) flexibility in defining and updating safety
+criteria through customizable topic filtering. Our experiments demonstrate that
+FLAME significantly outperforms current moderation systems. For example, FLAME
+reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
+while maintaining low computational overhead. We provide comprehensive
+evaluation on various LLMs and analyze the engine's efficiency against the
+state-of-the-art jailbreaking. This work contributes to the development of more
+robust and adaptable content moderation systems for LLMs.
 
-摘要：深度學習 (DL) 的最新進展促使開發出高性能早期預警評分 (EWS) 系統，預測急性腎臟損傷、急性心肌梗塞或循環衰竭等臨床惡化。DL 模型已被證明是各種任務的強大工具，但代價是缺乏可解釋性和有限的概括性，阻礙了其臨床應用。為了開發適用於各種結果的實用 EWS 系統，我們提出了因果關係解釋性早期預測模型，它利用因果發現來識別預測的潛在因果關係，從而擁有兩個獨特的優點：展示預測的明確解釋，同時在應用於不熟悉的環境時表現出良好的性能。得益於這些特性，與各種基線演算法相比，我們的模型在 6 種不同的危重惡化中實現了更高的準確度，並在不同的患者群體中實現了更好的概括性。此外，我們提供了明確的因果途徑，作為輔助臨床診斷和潛在干預措施的參考。所提出的方法增強了深度學習在各種醫療場景中的實際應用。
+摘要：大型語言模型 (LLM) 的快速進步為調節使用者與模型互動帶來重大挑戰。儘管 LLM 展現出非凡的能力，但它們仍然容易受到對抗性攻擊，特別是繞過內容安全措施的「越獄」技術。目前的內容審核系統主要依賴輸入提示過濾，已被證明不足，例如 Best-of-N (BoN) 越獄對抗熱門 LLM 的成功率達到 80% 以上。在本文中，我們介紹了靈活的 LLM 輔助審核引擎 (FLAME)：一種新的方法，將重點從輸入過濾轉移到輸出審核。與分析使用者查詢的傳統電路中斷方法不同，FLAME 評估模型回應，提供幾個關鍵優勢：(1) 訓練和推理中的計算效率，(2) 增強對 BoN 越獄攻擊的抵抗力，以及 (3) 透過可自訂主題過濾定義和更新安全標準的靈活性。我們的實驗證明，FLAME 明顯優於目前的審核系統。例如，FLAME 將 GPT-4o-mini 和 DeepSeek-v3 的攻擊成功率降低了約 9 倍，同時保持較低的計算負擔。我們對各種 LLM 進行了全面的評估，並分析了引擎對抗最新越獄的效率。這項工作有助於開發更強大且適應性更強的 LLM 內容審核系統。
 
-##### **JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**
-2502.04345v1 by Yehan Yan, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
+##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
+2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
-Traditional Chinese medicine (TCM) plays a vital role in health protection
-and disease treatment, but its practical application requires extensive medical
-knowledge and clinical experience. Existing TCM Large Language Models (LLMs)
-exhibit critical limitations of uncomprehensive medical consultation and
-diagnoses, and inaccurate syndrome differentiation-based treatment. To address
-these issues, this study establishes JingFang (JF): a novel TCM Large Language
-Model that demonstrates the expert-level capability of medical diagnosis and
-syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic
-Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation,
-enabling JF with effective and accurate diagnostic ability. In addition, a
-Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to
-significantly enhance the capacity of JF for disease treatment based on
-syndrome differentiation. JingFang not only facilitates the application of LLMs
-but also promotes the effective practice of TCM in human health protection and
-disease treatment.
+In remote healthcare monitoring, time series representation learning reveals
+critical patient behavior patterns from high-frequency data. This study
+analyzes home activity data from individuals living with dementia by proposing
+a two-stage, self-supervised learning approach tailored to uncover low-rank
+structures. The first stage converts time-series activities into text sequences
+encoded by a pre-trained language model, providing a rich, high-dimensional
+latent state space using a PageRank-based method. This PageRank vector captures
+latent state transitions, effectively compressing complex behaviour data into a
+succinct form that enhances interpretability. This low-rank representation not
+only enhances model interpretability but also facilitates clustering and
+transition analysis, revealing key behavioral patterns correlated with
+clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
+framework's potential in supporting cognitive status prediction, personalized
+care interventions, and large-scale health monitoring.
 
-摘要：中醫藥在保健與疾病治療中扮演著重要的角色，但其實務應用需要深厚的醫學知識與臨床經驗。現有的中醫大語言模型（LLM）存在著醫療諮詢與診斷不全面、症候分型治療不準確的重大限制。為了解決這些問題，本研究建立了精方（JF）：一個新穎的中醫大語言模型，展示了專家級的醫療診斷與症候分型治療能力。我們創新了一個多智能體動態協作思考鏈機制（MDCCTM）用於醫療諮詢，讓 JF 具備有效且準確的診斷能力。此外，還開發了一個症候智能體和一個雙階段檢索方案（DSRS），以顯著增強 JF 基於症候分型的疾病治療能力。精方不僅促進了 LLM 的應用，也推動了中醫藥在人類保健與疾病治療中的有效實踐。
+摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
-##### **An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**
-2502.01789v1 by Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri
+##### **Musical Heritage Historical Entity Linking**
+2502.09168v1 by Arianna Graciotti, Nicolas Lazzari, Valentina Presutti, Rocco Tripodi
 
-Early identification of cognitive concerns is critical but often hindered by
-subtle symptom presentation. This study developed and validated a fully
-automated, multi-agent AI workflow using LLaMA 3 8B to identify cognitive
-concerns in 3,338 clinical notes from Mass General Brigham. The agentic
-workflow, leveraging task-specific agents that dynamically collaborate to
-extract meaningful insights from clinical notes, was compared to an
-expert-driven benchmark. Both workflows achieved high classification
-performance, with F1-scores of 0.90 and 0.91, respectively. The agentic
-workflow demonstrated improved specificity (1.00) and achieved prompt
-refinement in fewer iterations. Although both workflows showed reduced
-performance on validation data, the agentic workflow maintained perfect
-specificity. These findings highlight the potential of fully automated
-multi-agent AI workflows to achieve expert-level accuracy with greater
-efficiency, offering a scalable and cost-effective solution for detecting
-cognitive concerns in clinical settings.
+Linking named entities occurring in text to their corresponding entity in a
+Knowledge Base (KB) is challenging, especially when dealing with historical
+texts. In this work, we introduce Musical Heritage named Entities Recognition,
+Classification and Linking (MHERCL), a novel benchmark consisting of manually
+annotated sentences extrapolated from historical periodicals of the music
+domain. MHERCL contains named entities under-represented or absent in the most
+famous KBs. We experiment with several State-of-the-Art models on the Entity
+Linking (EL) task and show that MHERCL is a challenging dataset for all of
+them. We propose a novel unsupervised EL model and a method to extend
+supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main
+difficulties posed by historical documents. Our experiments reveal that relying
+on unsupervised techniques and improving models with logical constraints based
+on KGs and heuristics to predict NIL entities (entities not represented in the
+KB of reference) results in better EL performance on historical documents.
 
-摘要：及早辨識認知問題至關重要，但常常受到症狀呈現過於細微的阻礙。本研究開發並驗證了一個全自動化、多重代理的 AI 工作流程，使用 LLaMA 3 8B 來辨識來自麻省總醫院布萊根分院的 3,338 則臨床筆記中的認知問題。這個代理工作流程利用了特定任務的代理，這些代理會動態合作從臨床筆記中萃取出有意義的見解，並與專家驅動的基準進行比較。這兩個工作流程都達到了很高的分類效能，F1 分數分別為 0.90 和 0.91。代理工作流程展現出更好的特異性（1.00），並且在更少的反覆運算中達到了提示精煉。儘管這兩個工作流程在驗證資料上的效能都降低了，但代理工作流程維持了完美的特異性。這些發現突顯了全自動化多重代理 AI 工作流程的潛力，它們能以更高的效率達到專家級的準確度，為在臨床環境中偵測認知問題提供了一個可擴充且具成本效益的解決方案。
+摘要：將文本中出現的名稱實體連結到知識庫 (KB) 中對應的實體具有挑戰性，尤其是在處理歷史文本時。在這項工作中，我們引入了音樂遺產命名實體識別、分類和連結 (MHERCL)，這是一個由從音樂領域的歷史期刊中外推的手動標註句子組成的全新基準。MHERCL 包含在最著名的 KB 中代表性不足或不存在的名稱實體。我們在實體連結 (EL) 任務中對多個最先進的模型進行了實驗，並表明 MHERCL 對所有模型來說都是一個具有挑戰性的資料集。我們提出了一個新的無監督 EL 模型和一個通過使用知識圖 (KG) 來擴充監督式實體連結器的的方法，以解決歷史文件提出的主要難題。我們的實驗表明，依賴無監督技術並使用基於 KG 和啟發法的邏輯約束來改善模型以預測 NIL 實體（未在參考 KB 中表示的實體）會在歷史文件中產生更好的 EL 效能。
 
-##### **Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**
-2502.03482v1 by Chacha Chen, Han Liu, Jiamin Yang, Benjamin M. Mervak, Bora Kalaycioglu, Grace Lee, Emre Cakmakli, Matteo Bonatti, Sridhar Pudu, Osman Kahraman, Gul Gizem Pamuk, Aytekin Oto, Aritrick Chatterjee, Chenhao Tan
+##### **Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**
+2502.09156v1 by Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin
 
-Despite the growing interest in human-AI decision making, experimental
-studies with domain experts remain rare, largely due to the complexity of
-working with domain experts and the challenges in setting up realistic
-experiments. In this work, we conduct an in-depth collaboration with
-radiologists in prostate cancer diagnosis based on MRI images. Building on
-existing tools for teaching prostate cancer diagnosis, we develop an interface
-and conduct two experiments to study how AI assistance and performance feedback
-shape the decision making of domain experts. In Study 1, clinicians were asked
-to provide an initial diagnosis (human), then view the AI's prediction, and
-subsequently finalize their decision (human-AI team). In Study 2 (after a
-memory wash-out period), the same participants first received aggregated
-performance statistics from Study 1, specifically their own performance, the
-AI's performance, and their human-AI team performance, and then directly viewed
-the AI's prediction before making their diagnosis (i.e., no independent initial
-diagnosis). These two workflows represent realistic ways that clinical AI tools
-might be used in practice, where the second study simulates a scenario where
-doctors can adjust their reliance and trust on AI based on prior performance
-feedback. Our findings show that, while human-AI teams consistently outperform
-humans alone, they still underperform the AI due to under-reliance, similar to
-prior studies with crowdworkers. Providing clinicians with performance feedback
-did not significantly improve the performance of human-AI teams, although
-showing AI decisions in advance nudges people to follow AI more. Meanwhile, we
-observe that the ensemble of human-AI teams can outperform AI alone, suggesting
-promising directions for human-AI collaboration.
+Objectives: Large language models (LLMs) can harness medical knowledge for
+intelligent question answering (Q&A), promising support for auxiliary diagnosis
+and medical talent cultivation. However, there is a deficiency of highly
+efficient retrieval-augmented generation (RAG) frameworks within the domain of
+Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the
+Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A
+tasks.
+  Materials and Methods: We introduce the novel approach of knowledge
+organization, constructing a tree structure knowledge base with hierarchy. At
+inference time, our self-reflection framework retrieves from this knowledge
+base, integrating information across chapters. Questions from the TCM Medical
+Licensing Examination (MLE) and the college Classics Course Exam (CCE) were
+randomly selected as benchmark datasets.
+  Results: By coupling with GPT-4, the framework can improve the best
+performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and
+improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation,
+the framework improves a total of 18.52 points across dimensions of safety,
+consistency, explainability, compliance, and coherence.
+  Conclusion: The TOSRR framework can effectively improve LLM's capability in
+Q&A tasks of TCM.
 
-摘要：儘管人們對人類與 AI 決策制定越來越感興趣，但與領域專家合作的實驗研究仍然很少見，這在很大程度上是因為與領域專家合作的複雜性，以及在設定實際實驗時面臨的挑戰。在這項工作中，我們與放射科醫師進行深入合作，基於 MRI 影像診斷前列腺癌。建立在用於教授前列腺癌診斷的現有工具上，我們開發了一個介面並進行了兩項實驗，以研究 AI 協助和效能回饋如何塑造領域專家的決策制定。在研究 1 中，要求臨床醫師提供初步診斷（人類），然後檢視 AI 的預測，並隨後確定他們的決策（人類-AI 團隊）。在研究 2（經過一段記憶清除期）中，同一位參與者首先收到研究 1 的彙總效能統計資料，特別是他們自己的效能、AI 的效能，以及他們的人類-AI 團隊效能，然後在做出診斷前直接檢視 AI 的預測（即，沒有獨立的初步診斷）。這兩個工作流程代表了臨床 AI 工具在實務中可能被使用的方式，其中第二個研究模擬了醫生可以根據先前的效能回饋調整他們對 AI 的依賴和信任的情況。我們的研究結果顯示，儘管人類-AI 團隊始終優於單獨的人類，但由於依賴不足，他們仍然表現不如 AI，這與之前針對群眾工作者的研究類似。儘管事先顯示 AI 決策會促使人們更多地遵循 AI，但向臨床醫師提供效能回饋並未顯著改善人類-AI 團隊的效能。同時，我們觀察到人類-AI 團隊的集合可以優於單獨的 AI，這表明了人類-AI 合作的前景。
+摘要：目標：大型語言模型（LLM）可以利用醫療知識進行智能問答（Q&A），承諾支持輔助診斷和醫療人才培養。然而，在中醫領域內缺乏高效的檢索增強生成（RAG）框架。我們的目的是觀察樹組織自省檢索（TOSRR）框架對中醫問答任務中 LLM 的影響。
+材料和方法：我們引入了知識組織的新方法，構建了一個具有層次的樹結構知識庫。在推理時間，我們的自省框架從這個知識庫中檢索，整合章節中的信息。中醫醫師資格考試（MLE）和大學經典課程考試（CCE）中的問題被隨機選為基準數據集。
+結果：通過與 GPT-4 結合，該框架可以將中醫 MLE 基準上的最佳性能提高 19.85% 的絕對準確度，並將 CCE 數據集上的召回準確度從 27% 提高到 38%。在手動評估中，該框架在安全性、一致性、可解釋性、合規性和連貫性方面總共提高了 18.52 分。
+結論：TOSRR 框架可以有效提升 LLM 在中醫問答任務中的能力。
 
-##### **Improving Transformer World Models for Data-Efficient RL**
-2502.01591v1 by Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
+##### **A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**
+2502.09128v1 by Nasser A Alsadhan
 
-We present an approach to model-based RL that achieves a new state of the art
-performance on the challenging Craftax-classic benchmark, an open-world 2D
-survival game that requires agents to exhibit a wide range of general abilities
--- such as strong generalization, deep exploration, and long-term reasoning.
-With a series of careful design choices aimed at improving sample efficiency,
-our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps,
-significantly outperforming DreamerV3, which achieves 53.2%, and, for the first
-time, exceeds human performance of 65.0%. Our method starts by constructing a
-SOTA model-free baseline, using a novel policy architecture that combines CNNs
-and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna
-with warmup", which trains the policy on real and imaginary data, (b) "nearest
-neighbor tokenizer" on image patches, which improves the scheme to create the
-transformer world model (TWM) inputs, and (c) "block teacher forcing", which
-allows the TWM to reason jointly about the future tokens of the next timestep.
+Arabic is one of the oldest languages still in use today. As a result,
+several Arabic-speaking regions have developed dialects that are unique to
+them. Dialect and emotion recognition have various uses in Arabic text
+analysis, such as determining an online customer's origin based on their
+comments. Furthermore, intelligent chatbots that are aware of a user's emotions
+can respond appropriately to the user. Current research in emotion detection in
+the Arabic language lacks awareness of how emotions are exhibited in different
+dialects, which motivates the work found in this study. This research addresses
+the problems of dialect and emotion classification in Arabic. Specifically,
+this is achieved by building a novel framework that can identify and predict
+Arabic dialects and emotions from a given text. The framework consists of three
+modules: A text-preprocessing module, a classification module, and a clustering
+module with the novel capability of building new dialect-aware emotion
+lexicons. The proposed framework generated a new emotional lexicon for
+different dialects. It achieved an accuracy of 88.9% in classifying Arabic
+dialects, which outperforms the state-of-the-art results by 6.45 percentage
+points. Furthermore, the framework achieved 89.1-79% accuracy in detecting
+emotions in the Egyptian and Gulf dialects, respectively.
 
-摘要：我們提出了一個基於模型的 RL 方法，在具有挑戰性的 Craftax-classic 基準上實現了新的技術水準，這是一個開放世界的 2D 生存遊戲，要求代理人展現廣泛的一般能力，例如強大的概括能力、深入探索和長期推理。通過一系列旨在提高樣本效率的仔細設計選擇，我們的 MBRL 演算法在僅 1M 環境步驟後就實現了 67.4% 的獎勵，顯著優於 DreamerV3（實現 53.2%），並且首次超過了人類的 65.0% 的表現。我們的演算法首先通過使用結合 CNN 和 RNN 的新穎策略架構來建構一個 SOTA 無模型基線。然後，我們對標準 MBRL 設定新增了三項改進：(a)「帶熱身的 Dyna」，它在真實和假想資料上訓練策略，(b) 影像貼片的「最近鄰代碼化器」，它改進了建立轉換器世界模型 (TWM) 輸入的方案，以及 (c)「區塊教師強制」，它允許 TWM 共同推理下一個時間步長的未來代碼。
+摘要：阿拉伯語是現今仍在使用中最古老的語言之一。因此，幾個講阿拉伯語的地區發展出獨特的方言。方言和情緒辨識在阿拉伯語文本分析中有多種用途，例如根據在線客戶的評論來確定其來源。此外，知道使用者情緒的智慧聊天機器人可以適當地回應使用者。目前對阿拉伯語情緒偵測的研究缺乏對不同方言如何表現情緒的認識，這激勵了本研究中的工作。本研究探討了阿拉伯語中的方言和情緒分類問題。具體而言，這是通過建立一個新的框架來實現的，該框架可以識別和預測給定文本中的阿拉伯方言和情緒。該框架包含三個模組：文字預處理模組、分類模組和聚類模組，具有建立新的方言感知情緒詞彙表的新功能。所提出的框架為不同的方言生成了新的情緒詞彙表。它在分類阿拉伯方言方面達到了 88.9% 的準確率，比最先進的結果高出 6.45 個百分點。此外，該框架在檢測埃及和海灣方言的情緒方面分別達到了 89.1-79% 的準確率。
 
-##### **Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**
-2502.01377v1 by Zhi Zhang, Yan Liu, Mengxia Gao, Yu Yang, Jiannong Cao, Wai Kai Hou, Shirley Li, Sonata Yau, Yun Kwok Wing, Tatia M. C. Lee
+##### **Automatic Pruning via Structured Lasso with Class-wise Information**
+2502.09125v1 by Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li
 
-Psychological resilience, defined as the ability to rebound from adversity,
-is crucial for mental health. Compared with traditional resilience assessments
-through self-reported questionnaires, resilience assessments based on
-neurological data offer more objective results with biological markers, hence
-significantly enhancing credibility. This paper proposes a novel data-efficient
-model to address the scarcity of neurological data. We employ Neuro
-Kolmogorov-Arnold Networks as the structure of the prediction model. In the
-training stage, a new trait-informed multimodal representation algorithm with a
-smart chunk technique is proposed to learn the shared latent space with limited
-data. In the test stage, a new noise-informed inference algorithm is proposed
-to address the low signal-to-noise ratio of the neurological data. The proposed
-model not only shows impressive performance on both public datasets and
-self-constructed datasets but also provides some valuable psychological
-hypotheses for future research.
+Most pruning methods concentrate on unimportant filters of neural networks.
+However, they face the loss of statistical information due to a lack of
+consideration for class-wise data. In this paper, from the perspective of
+leveraging precise class-wise information for model pruning, we utilize
+structured lasso with guidance from Information Bottleneck theory. Our approach
+ensures that statistical information is retained during the pruning process.
+With these techniques, we introduce two innovative adaptive network pruning
+schemes: sparse graph-structured lasso pruning with Information Bottleneck
+(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
+Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
+sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
+multiple state-of-the-art methods, our approaches demonstrate superior
+performance across three datasets and six model architectures in extensive
+experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
+achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
+an accuracy of 94.10% (0.14% higher than the original model); we reduce the
+parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
+ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
+computational resource usage while maintaining accuracy. Our codes are at
+https://anonymous.4open.science/r/IJCAI-8104.
 
-摘要：心理韌性，定義為從逆境中反彈的能力，對心理健康至關重要。與通過自我報告問卷的傳統韌性評估相比，基於神經數據的韌性評估提供了更客觀的結果和生物標記，從而顯著提高了可信度。本文提出了一個新穎的數據高效模型來解決神經數據的稀缺性。我們採用神經科爾莫哥羅夫-阿諾德網路作為預測模型的結構。在訓練階段，提出了一種新的特徵信息多模態表示算法，採用智能塊技術，以有限的數據學習共享潛在空間。在測試階段，提出了一種新的噪聲信息推理算法，以解決神經數據的信噪比低的問題。所提出的模型不僅在公共數據集和自構數據集上都顯示出令人印象深刻的性能，還為未來的研究提供了一些有價值的心理假設。
+摘要：大多數剪枝方法都集中在神經網路中不重要的濾波器上。
+然而，由於缺乏對類別資料的考量，它們面臨統計資訊的遺失。在本文中，我們從利用精確類別資訊進行模型剪枝的角度，利用結構化套索搭配資訊瓶頸理論的指導。我們的做法確保在剪枝過程中保留統計資訊。藉由這些技術，我們引入了兩個創新的自適應網路剪枝方案：帶有資訊瓶頸的稀疏圖形結構套索剪枝（sGLP-IB）和帶有資訊瓶頸的稀疏樹導引套索剪枝（sTLP-IB）。關鍵方面是使用 sGLP-IB 和 sTLP-IB 剪枝模型濾波器，以更好地擷取類別關聯性。與多種最先進的方法相比，我們的做法在廣泛的實驗中展現出跨三個資料集和六個模型架構的卓越效能。例如，在 CIFAR-10 資料集上使用 VGG16 模型，我們達到了 85% 的參數減少、61% 的 FLOP 減少，並維持 94.10% 的準確度（比原始模型高 0.14%）；我們在 ImageNet 上使用 ResNet 架構將參數減少了 55%，準確度為 76.12%（僅下降 0.03%）。總之，我們成功地減少了模型大小和計算資源使用，同時維持準確度。我們的程式碼位於 https://anonymous.4open.science/r/IJCAI-8104。
 
-##### **OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**
-2502.01243v1 by Chengfeng Zhou, Ji Wang, Juanjuan Qin, Yining Wang, Ling Sun, Weiwei Dai
+##### **The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**
+2502.09120v1 by Ye-eun Cho, Yunho Maeng
 
-Large language models (LLMs) have shown significant promise across various
-medical applications, with ophthalmology being a notable area of focus. Many
-ophthalmic tasks have shown substantial improvement through the integration of
-LLMs. However, before these models can be widely adopted in clinical practice,
-evaluating their capabilities and identifying their limitations is crucial. To
-address this research gap and support the real-world application of LLMs, we
-introduce the OphthBench, a specialized benchmark designed to assess LLM
-performance within the context of Chinese ophthalmic practices. This benchmark
-systematically divides a typical ophthalmic clinical workflow into five key
-scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each
-scenario, we developed multiple tasks featuring diverse question types,
-resulting in a comprehensive benchmark comprising 9 tasks and 591 questions.
-This comprehensive framework allows for a thorough assessment of LLMs'
-capabilities and provides insights into their practical application in Chinese
-ophthalmology. Using this benchmark, we conducted extensive experiments and
-analyzed the results from 39 popular LLMs. Our evaluation highlights the
-current gap between LLM development and its practical utility in clinical
-settings, providing a clear direction for future advancements. By bridging this
-gap, we aim to unlock the potential of LLMs and advance their development in
-ophthalmology.
+This study explored how Vision-Language Models (VLMs) process ignorance
+implicatures with visual and linguistic cues. Particularly, we focused on the
+effects of contexts (precise and approximate contexts) and modifier types (bare
+numerals, superlative, and comparative modifiers), which were considered
+pragmatic and semantic factors respectively. Methodologically, we conducted a
+truth-value judgment task in visually grounded settings using GPT-4o and Gemini
+1.5 Pro. The results indicate that while both models exhibited sensitivity to
+linguistic cues (modifier), they failed to process ignorance implicatures with
+visual cues (context) as humans do. Specifically, the influence of context was
+weaker and inconsistent across models, indicating challenges in pragmatic
+reasoning for VLMs. On the other hand, superlative modifiers were more strongly
+associated with ignorance implicatures as compared to comparative modifiers,
+supporting the semantic view. These findings highlight the need for further
+advancements in VLMs to process language-vision information in a
+context-dependent way to achieve human-like pragmatic inference.
 
-摘要：大型語言模型 (LLM) 在各種醫療應用中已展現出顯著的潛力，其中眼科是一個值得關注的重要領域。許多眼科任務已透過整合 LLM 而大幅進步。然而，在這些模型能廣泛應用於臨床實務之前，評估其能力並找出其限制至關重要。為了解決這個研究差距並支援 LLM 的實際應用，我們引入了 OphthBench，這是一個專門的基準測試，旨在評估 LLM 在中國眼科實務中的表現。此基準測試系統性地將典型眼科臨床工作流程劃分為五個關鍵情境：教育、分流、診斷、治療和預後。對於每個情境，我們開發了多項任務，包含多樣化的問題類型，最後組成一個包含 9 項任務和 591 個問題的綜合基準測試。此綜合架構可徹底評估 LLM 的能力，並提供其在中國眼科的實際應用見解。使用此基準測試，我們進行了廣泛的實驗，並分析了來自 39 個熱門 LLM 的結果。我們的評估強調了 LLM 開發與其在臨床環境中的實際效用之間的差距，為未來的進展提供了明確的方向。透過彌合此差距，我們旨在釋放 LLM 的潛力，並促進其在眼科的發展。
+摘要：本研究探討了視覺語言模型 (VLM) 如何處理視覺和語言線索中的無知含義。特別是，我們專注於語境（精確和近似語境）和修飾語類型（裸數字、最高級和比較級修飾語）的影響，這些分別被視為語用和語義因素。在方法論上，我們使用 GPT-4o 和 Gemini 1.5 Pro 在視覺基礎設置中進行了真值判斷任務。結果表明，儘管這兩個模型都對語言線索（修飾語）表現出敏感性，但它們未能像人類那樣處理帶有視覺線索（語境）的無知含義。具體來說，語境的影響在各個模型中較弱且不一致，表明 VLM 在語用推理方面存在挑戰。另一方面，與比較級修飾語相比，最高級修飾語與無知含義的關聯性更強，這支持了語義觀點。這些發現強調了 VLM 進一步發展的必要性，以以語境依賴的方式處理語言視覺信息，以實現類人語用推理。
+
+##### **One-shot Federated Learning Methods: A Practical Guide**
+2502.09104v1 by Xiang Liu, Zhenheng Tang, Xia Li, Yijun Song, Sijie Ji, Zemin Liu, Bo Han, Linshan Jiang, Jialin Li
+
+One-shot Federated Learning (OFL) is a distributed machine learning paradigm
+that constrains client-server communication to a single round, addressing
+privacy and communication overhead issues associated with multiple rounds of
+data exchange in traditional Federated Learning (FL). OFL demonstrates the
+practical potential for integration with future approaches that require
+collaborative training models, such as large language models (LLMs). However,
+current OFL methods face two major challenges: data heterogeneity and model
+heterogeneity, which result in subpar performance compared to conventional FL
+methods. Worse still, despite numerous studies addressing these limitations, a
+comprehensive summary is still lacking. To address these gaps, this paper
+presents a systematic analysis of the challenges faced by OFL and thoroughly
+reviews the current methods. We also offer an innovative categorization method
+and analyze the trade-offs of various techniques. Additionally, we discuss the
+most promising future directions and the technologies that should be integrated
+into the OFL field. This work aims to provide guidance and insights for future
+research.
 
-##### **MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**
-2502.01158v1 by Alejandro Guerra-Manzanares, Farah E. Shamout
+摘要：單次聯邦學習 (OFL) 是一種分散式機器學習範例，將客戶端與伺服器通訊限制在單一輪次中，解決傳統聯邦學習 (FL) 中多輪次資料交換相關的隱私和通訊負擔問題。OFL 展示了與需要協作訓練模型的未來方法整合的實際潛力，例如大型語言模型 (LLM)。然而，目前的 OFL 方法面臨兩大挑戰：資料異質性和模型異質性，這導致與傳統 FL 方法相比，效能較差。更糟的是，儘管有許多研究探討這些限制，但仍缺乏全面的摘要。為了解決這些差距，本文對 OFL 面臨的挑戰進行系統分析，並徹底檢視目前的方法。我們還提供創新的分類方法，並分析各種技術的權衡取捨。此外，我們討論最有希望的未來方向，以及應整合到 OFL 領域的技術。這項工作旨在為未來的研究提供指導和見解。
 
-Multimodal fusion leverages information across modalities to learn better
-feature representations with the goal of improving performance in fusion-based
-tasks. However, multimodal datasets, especially in medical settings, are
-typically smaller than their unimodal counterparts, which can impede the
-performance of multimodal models. Additionally, the increase in the number of
-modalities is often associated with an overall increase in the size of the
-multimodal network, which may be undesirable in medical use cases. Utilizing
-smaller unimodal encoders may lead to sub-optimal performance, particularly
-when dealing with high-dimensional clinical data. In this paper, we propose the
-Modality-INformed knowledge Distillation (MIND) framework, a multimodal model
-compression approach based on knowledge distillation that transfers knowledge
-from ensembles of pre-trained deep neural networks of varying sizes into a
-smaller multimodal student. The teacher models consist of unimodal networks,
-allowing the student to learn from diverse representations. MIND employs
-multi-head joint fusion models, as opposed to single-head models, enabling the
-use of unimodal encoders in the case of unimodal samples without requiring
-imputation or masking of absent modalities. As a result, MIND generates an
-optimized multimodal model, enhancing both multimodal and unimodal
-representations. It can also be leveraged to balance multimodal learning during
-training. We evaluate MIND on binary and multilabel clinical prediction tasks
-using time series data and chest X-ray images. Additionally, we assess the
-generalizability of the MIND framework on three non-medical multimodal
-multiclass datasets. Experimental results demonstrate that MIND enhances the
-performance of the smaller multimodal network across all five tasks, as well as
-various fusion methods and multimodal architectures, compared to
-state-of-the-art baselines.
+##### **Logical Reasoning in Large Language Models: A Survey**
+2502.09100v1 by Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang
 
-摘要：多模态融合利用跨模态的信息来学习更好的特征表示，目标是提升基于融合的任务的性能。然而，多模态数据集，尤其是在医疗环境中，通常比它们的单模态对应数据集小，这会阻碍多模态模型的性能。此外，模态数量的增加通常与多模态网络尺寸的整体增加相关，这在医疗用例中可能是不可取的。利用较小的单模态编码器可能会导致次优性能，尤其是在处理高维临床数据时。在本文中，我们提出了模态信息知识蒸馏 (MIND) 框架，这是一种基于知识蒸馏的多模态模型压缩方法，它将来自不同大小的预训练深度神经网络的集合中的知识转移到一个较小的多模态学生中。教师模型由单模态网络组成，允许学生从不同的表示中学习。MIND 采用多头联合融合模型，而不是单头模型，从而能够在单模态样本的情况下使用单模态编码器，而不需要缺失模态的插补或掩蔽。因此，MIND 生成了一个经过优化的多模态模型，增强了多模态和单模态表示。它还可以用来在训练期间平衡多模态学习。我们使用时间序列数据和胸部 X 射线图像对二元和多标签临床预测任务评估了 MIND。此外，我们评估了 MIND 框架在三个非医疗多模态多分类数据集上的泛化性。实验结果表明，与最先进的基线相比，MIND 增强了较小的多模态网络在所有五个任务以及各种融合方法和多模态架构中的性能。
+With the emergence of advanced reasoning models like OpenAI o3 and
+DeepSeek-R1, large language models (LLMs) have demonstrated remarkable
+reasoning capabilities. However, their ability to perform rigorous logical
+reasoning remains an open question. This survey synthesizes recent advancements
+in logical reasoning within LLMs, a critical area of AI research. It outlines
+the scope of logical reasoning in LLMs, its theoretical foundations, and the
+benchmarks used to evaluate reasoning proficiency. We analyze existing
+capabilities across different reasoning paradigms - deductive, inductive,
+abductive, and analogical - and assess strategies to enhance reasoning
+performance, including data-centric tuning, reinforcement learning, decoding
+strategies, and neuro-symbolic approaches. The review concludes with future
+directions, emphasizing the need for further exploration to strengthen logical
+reasoning in AI systems.
 
-##### **Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**
-2502.01141v1 by Qian Chen, Stefanie Rinderle-Ma, Lijie Wen
+摘要：隨著 OpenAI o3 和 DeepSeek-R1 等先進推理模型的出現，大型語言模型 (LLM) 已展現出非凡的推理能力。然而，它們執行嚴謹邏輯推理的能力仍是一個開放性的問題。此調查綜合了 LLM 中邏輯推理的最新進展，這是 AI 研究的一個關鍵領域。它概述了 LLM 中邏輯推理的範圍、其理論基礎，以及用於評估推理能力的基準。我們分析了不同推理範例（演繹、歸納、外推和類比）中的現有能力，並評估增強推理效能的策略，包括以數據為中心的調整、強化學習、解碼策略和神經符號方法。此評論以未來的方向作為結論，強調需要進一步探索以強化 AI 系統中的邏輯推理。
 
-Most existing process compliance monitoring approaches detect compliance
-violations in an ex post manner. Only predicate prediction focuses on
-predicting them. However, predicate prediction provides a binary yes/no notion
-of compliance, lacking the ability to measure to which extent an ongoing
-process instance deviates from the desired state as specified in constraints.
-Here, being able to quantify the magnitude of violation would provide
-organizations with deeper insights into their operational performance, enabling
-informed decision making to reduce or mitigate the risk of non-compliance.
-Thus, we propose two predictive compliance monitoring approaches to close this
-research gap. The first approach reformulates the binary classification problem
-as a hybrid task that considers both classification and regression, while the
-second employs a multi-task learning method to explicitly predict the
-compliance status and the magnitude of violation for deviant cases
-simultaneously. In this work, we focus on temporal constraints as they are
-significant in almost any application domain, e.g., health care. The evaluation
-on synthetic and real-world event logs demonstrates that our approaches are
-capable of quantifying the magnitude of violations while maintaining comparable
-performance for compliance predictions achieved by state-of-the-art approaches.
+##### **A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**
+2502.09097v1 by Tianyi Huang, Zeqiu Xu, Peiyang Yu, Jingyuan Yi, Xiaochuan Xu
 
-摘要：現有的流程合規監控方法大多會在事後偵測到合規違規。只有謂詞預測專注於預測這些違規。然而，謂詞預測提供的是合規與否的二元概念，無法衡量正在進行的流程實例偏離約束中所指定之理想狀態的程度。在此，能夠量化違規的嚴重程度，將能讓組織深入了解其營運績效，並能據此做出明智的決策，以降低或減輕不合規的風險。因此，我們提出兩種預測合規監控方法來填補此研究空白。第一種方法將二元分類問題重新表述為同時考量分類和回歸的混合任務，而第二種方法則採用多任務學習方法，同時明確預測合規狀態和偏差案例的違規嚴重程度。在這項工作中，我們專注於時間約束，因為它們幾乎在任何應用領域（例如醫療保健）中都很重要。在合成和真實世界事件記錄上的評估顯示，我們的做法能夠量化違規的嚴重程度，同時維持與現有方法所達成的合規預測相當的績效。
+In this paper, we propose an optimized Transformer model that integrates
+Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and
+apply it to fake news classification for the first time. First, we employ the
+TF-IDF method to extract features from news texts and transform them into
+numeric representations to facilitate subsequent machine learning tasks. Two
+sets of experiments are then conducted for fake news detection and
+classification: one using a Transformer model optimized only with BiGRU, and
+the other incorporating Bayesian algorithms into the BiGRU-based Transformer.
+Experimental results show that the BiGRU-optimized Transformer achieves 100%
+accuracy on the training set and 99.67% on the test set, while the addition of
+the Bayesian algorithm maintains 100% accuracy on the training set and slightly
+improves test-set accuracy to 99.73%. This indicates that the Bayesian
+algorithm boosts model accuracy by 0.06%, further enhancing the detection
+capability for fake news. Moreover, the proposed algorithm converges rapidly at
+around the 10th training epoch with accuracy nearing 100%, demonstrating both
+its effectiveness and its fast classification ability. Overall, the optimized
+Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits
+excellent continuous learning and detection performance, offering a robust
+technical means to combat the spread of fake news in the current era of
+information overload.
 
-##### **Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**
-2502.01108v1 by Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar
+摘要：<paragraph>在本文中，我們提出了一個最佳化的 Transformer 模型，它將貝氏演算法與雙向門控遞迴單元 (BiGRU) 整合在一起，並首次將其應用於假新聞分類。首先，我們採用 TF-IDF 方法從新聞文本中提取特徵，並將它們轉換為數值表示，以利於後續的機器學習任務。接著進行兩組實驗，分別針對假新聞偵測和分類：一組使用僅使用 BiGRU 最佳化的 Transformer 模型，另一組將貝氏演算法納入基於 BiGRU 的 Transformer 中。實驗結果顯示，BiGRU 最佳化的 Transformer 在訓練組上達到 100% 的準確度，在測試組上達到 99.67%，而加入貝氏演算法後，在訓練組上維持 100% 的準確度，並將測試組的準確度略微提升至 99.73%。這表示貝氏演算法將模型準確度提升了 0.06%，進一步增強了對假新聞的偵測能力。此外，所提出的演算法在約第 10 個訓練週期時快速收斂，準確度接近 100%，證明了它的有效性和快速的分類能力。總的來說，由貝氏演算法和 BiGRU 增強的最佳化 Transformer 模型展現出絕佳的持續學習和偵測效能，提供了一個強健的技術手段來對抗在當前資訊過載時代中假新聞的散布。</paragraph>
 
-Photoplethysmography (PPG)-based foundation models are gaining traction due
-to the widespread use of PPG in biosignal monitoring and their potential to
-generalize across diverse health applications. In this paper, we introduce
-Pulse-PPG, the first open-source PPG foundation model trained exclusively on
-raw PPG data collected over a 100-day field study with 120 participants.
-Existing PPG foundation models are either open-source but trained on clinical
-data or closed-source, limiting their applicability in real-world settings. We
-evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its
-performance against a state-of-the-art foundation model trained on clinical
-data. Our results demonstrate that Pulse-PPG, trained on uncurated field data,
-exhibits superior generalization across clinical and mobile health applications
-in both lab and field settings. This suggests that exposure to real-world
-variability enables the model to learn fine-grained representations, making it
-more adaptable across tasks. Furthermore, pre-training on field data
-surprisingly outperforms its pre-training on clinical data in many tasks,
-reinforcing the importance of training on real-world, diverse datasets. To
-encourage further advancements in robust foundation models leveraging field
-data, we plan to release Pulse-PPG, providing researchers with a powerful
-resource for developing more generalizable PPG-based models.
+##### **A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**
+2502.09086v1 by Jia Gao, Shuangquan Lyu, Guiran Liu, Binrong Zhu, Hongye Zheng, Xiaoxuan Liao
 
-摘要：基於光電容積描記術 (PPG) 的基礎模型由於 PPG 在生物訊號監控中的廣泛使用及其在各種健康應用中推廣的潛力而備受關注。在本文中，我們介紹 Pulse-PPG，這是第一個開放原始碼 PPG 基礎模型，專門針對在為期 100 天的現場研究中收集的 120 位參與者的原始 PPG 資料進行訓練。現有的 PPG 基礎模型要不是開放原始碼，但訓練於臨床資料，不然就是閉源，這限制了它們在真實世界中的應用性。我們評估了 Pulse-PPG 在多個資料集和下游任務中的表現，並將其效能與訓練於臨床資料的最新基礎模型進行比較。我們的結果表明，訓練於未整理現場資料的 Pulse-PPG 在實驗室和現場環境中，在臨床和行動健康應用中展現出優異的泛化能力。這表明接觸真實世界的變異性使模型能夠學習細粒度的表示，使其更能適應各種任務。此外，令人驚訝的是，現場資料的預訓練在許多任務中優於臨床資料的預訓練，這強化了在真實世界、多樣化的資料集上訓練的重要性。為了鼓勵在利用現場資料的強健基礎模型方面進一步發展，我們計畫發布 Pulse-PPG，為研究人員提供一個強大的資源，用於開發更具泛化性的基於 PPG 的模型。
+With the continuous development of natural language processing (NLP)
+technology, text classification tasks have been widely used in multiple
+application fields. However, obtaining labeled data is often expensive and
+difficult, especially in few-shot learning scenarios. To solve this problem,
+this paper proposes a few-shot text classification model based on transfer
+learning and meta-learning. The model uses the knowledge of the pre-trained
+model for transfer and optimizes the model's rapid adaptability in few-sample
+tasks through a meta-learning mechanism. Through a series of comparative
+experiments and ablation experiments, we verified the effectiveness of the
+proposed method. The experimental results show that under the conditions of few
+samples and medium samples, the model based on transfer learning and
+meta-learning significantly outperforms traditional machine learning and deep
+learning methods. In addition, ablation experiments further analyzed the
+contribution of each component to the model performance and confirmed the key
+role of transfer learning and meta-learning in improving model accuracy.
+Finally, this paper discusses future research directions and looks forward to
+the potential of this method in practical applications.
 
-##### **Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**
-2502.04342v1 by Yeyubei Zhang, Zhongyan Wang, Zhanyi Ding, Yexin Tian, Jianglai Dai, Xiaorui Shen, Yunchong Liu, Yuchen Cao
+摘要：隨著自然語言處理 (NLP) 技術的持續發展，文本分類任務已廣泛應用於多個應用領域。然而，獲取標記資料通常既昂貴又困難，特別是在小樣本學習場景中。為了解決這個問題，本文提出了一個基於遷移學習和元學習的少樣本文本分類模型。該模型利用預訓練模型的知識進行遷移，並透過元學習機制最佳化模型在少樣本任務中的快速適應性。透過一系列的比較實驗和消融實驗，我們驗證了所提出方法的有效性。實驗結果表明，在少樣本和中等樣本的條件下，基於遷移學習和元學習的模型明顯優於傳統機器學習和深度學習方法。此外，消融實驗進一步分析了各個組成部分對模型效能的貢獻，並確認了遷移學習和元學習在提升模型準確度中的關鍵作用。最後，本文探討了未來的研究方向，並期待此方法在實際應用中的潛力。
 
-Social media has become an important source for understanding mental health,
-providing researchers with a way to detect conditions like depression from
-user-generated posts. This tutorial provides practical guidance to address
-common challenges in applying machine learning and deep learning methods for
-mental health detection on these platforms. It focuses on strategies for
-working with diverse datasets, improving text preprocessing, and addressing
-issues such as imbalanced data and model evaluation. Real-world examples and
-step-by-step instructions demonstrate how to apply these techniques
-effectively, with an emphasis on transparency, reproducibility, and ethical
-considerations. By sharing these approaches, this tutorial aims to help
-researchers build more reliable and widely applicable models for mental health
-research, contributing to better tools for early detection and intervention.
+##### **Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**
+2502.09083v1 by Greta Warren, Irina Shklovski, Isabelle Augenstein
 
-摘要：社群媒體已成為了解心理健康的重要來源，
-為研究人員提供一種方式，從使用者發布的貼文中偵測憂鬱症等狀況。
-本教學提供實務指南，說明如何處理在這些平台上使用機器學習和深度學習方法進行心理健康偵測時常見的挑戰。
-它專注於處理不同資料集、改善文字前處理，以及處理不平衡資料和模型評估等問題的策略。
-實際範例和逐步說明示範如何有效應用這些技術，並強調透明度、可複製性，以及倫理考量。
-透過分享這些方法，本教學指南旨在協助研究人員建構更可靠且廣泛適用的心理健康研究模型，
-進而有助於早期偵測和介入的工具。
+The pervasiveness of large language models and generative AI in online media
+has amplified the need for effective automated fact-checking to assist
+fact-checkers in tackling the increasing volume and sophistication of
+misinformation. The complex nature of fact-checking demands that automated
+fact-checking systems provide explanations that enable fact-checkers to
+scrutinise their outputs. However, it is unclear how these explanations should
+align with the decision-making and reasoning processes of fact-checkers to be
+effectively integrated into their workflows. Through semi-structured interviews
+with fact-checking professionals, we bridge this gap by: (i) providing an
+account of how fact-checkers assess evidence, make decisions, and explain their
+processes; (ii) examining how fact-checkers use automated tools in practice;
+and (iii) identifying fact-checker explanation requirements for automated
+fact-checking tools. The findings show unmet explanation needs and identify
+important criteria for replicable fact-checking explanations that trace the
+model's reasoning path, reference specific evidence, and highlight uncertainty
+and information gaps.
 
-##### **Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**
-2502.01691v1 by Hadas Ben-Atya, Naama Gavrielov, Zvi Badash, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
+摘要：大型語言模型和生成式 AI 在線上媒體的普及
+放大了對有效自動查核事實的需求，以協助查核員應對日益增加的錯誤資訊量和複雜性。查核事實的複雜性質要求自動查核事實系統提供說明，讓查核員能夠仔細審查他們的輸出。然而，目前尚不清楚這些說明應如何與查核員的決策制定和推理過程保持一致，才能有效整合到他們的流程中。透過與查核事實專業人士進行半結構式訪談，我們透過以下方式彌補這個差距：(i) 提供查核員如何評估證據、做出決策和解釋其流程的說明；(ii) 檢視查核員如何實際使用自動化工具；以及 (iii) 找出查核員對自動查核事實工具的說明需求。研究結果顯示未滿足的說明需求，並找出可複製查核事實說明的重要準則，這些準則追蹤模型的推理路徑、參考具體證據，並強調不確定性和資訊差距。
 
-Reliable extraction of structured data from radiology reports using Large
-Language Models (LLMs) remains challenging, especially for complex, non-English
-texts like Hebrew. This study introduces an agent-based uncertainty-aware
-approach to improve the trustworthiness of LLM predictions in medical
-applications. We analyzed 9,683 Hebrew radiology reports from Crohn's disease
-patients (from 2010 to 2023) across three medical centers. A subset of 512
-reports was manually annotated for six gastrointestinal organs and 15
-pathological findings, while the remaining reports were automatically annotated
-using HSMP-BERT. Structured data extraction was performed using Llama 3.1
-(Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed
-six semantically equivalent prompts to estimate uncertainty. An Agent-Based
-Decision Model integrated multiple prompt outputs into five confidence levels
-for calibrated uncertainty and was compared against three entropy-based models.
-Performance was evaluated using accuracy, F1 score, precision, recall, and
-Cohen's Kappa before and after filtering high-uncertainty cases. The
-agent-based model outperformed the baseline across all metrics, achieving an F1
-score of 0.3967, recall of 0.6437, and Cohen's Kappa of 0.3006. After filtering
-high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to
-0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated
-clear separation between correct and incorrect predictions, with the
-agent-based model providing the most well-calibrated uncertainty estimates. By
-incorporating uncertainty-aware prompt ensembles and an agent-based decision
-model, this approach enhances the performance and reliability of LLMs in
-structured data extraction from radiology reports, offering a more
-interpretable and trustworthy solution for high-stakes medical applications.
+##### **CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**
+2502.09082v1 by Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
 
-摘要：<paragraph>使用大型語言模型 (LLM) 從放射科報告中可靠地提取結構化數據仍然具有挑戰性，尤其是對於希伯來語等複雜的非英語文本。本研究引入了一種基於代理的不確定性感知方法，以提高 LLM 預測在醫療應用中的可信度。我們分析了來自三個醫療中心的 9,683 份克隆氏症患者的希伯來語放射科報告（從 2010 年到 2023 年）。其中 512 份報告的手動註釋包括六個胃腸器官和 15 個病理發現，而其餘報告則使用 HSMP-BERT 自動註釋。結構化數據提取使用 Llama 3.1（Llama 3-8b-instruct）與貝葉斯提示集合（BayesPE）進行，它採用六個語義等效提示來估計不確定性。基於代理的決策模型將多個提示輸出整合到五個置信度級別中以校準不確定性，並與三個基於熵的模型進行比較。在過濾掉高度不確定性的情況之前和之後，使用準確度、F1 分數、精確度、召回率和 Cohen's Kappa 評估性能。基於代理的模型在所有指標上都優於基線，F1 分數達到 0.3967，召回率達到 0.6437，Cohen's Kappa 達到 0.3006。在過濾掉高度不確定性的情況（大於或等於 0.5）後，F1 分數提高到 0.4787，Kappa 提高到 0.4258。不確定性直方圖顯示了正確預測和不正確預測之間的明顯區別，基於代理的模型提供了校準最好的不確定性估計。通過結合不確定性感知提示集合和基於代理的決策模型，這種方法增強了 LLM 在放射科報告中結構化數據提取中的性能和可靠性，為高風險醫療應用提供了更具可解釋性和可信度的解決方案。</paragraph>
+Role-playing language agents (RPLAs) have emerged as promising applications
+of large language models (LLMs). However, simulating established characters
+presents a challenging task for RPLAs, due to the lack of authentic character
+datasets and nuanced evaluation methods using such data. In this paper, we
+present CoSER, a collection of a high-quality dataset, open models, and an
+evaluation protocol towards effective RPLAs of established characters. The
+CoSER dataset covers 17,966 characters from 771 renowned books. It provides
+authentic dialogues with real-world intricacies, as well as diverse data types
+such as conversation setups, character experiences and internal thoughts.
+Drawing from acting methodology, we introduce given-circumstance acting for
+training and evaluating role-playing LLMs, where LLMs sequentially portray
+multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
+CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
+Extensive experiments demonstrate the value of the CoSER dataset for RPLA
+training, evaluation and retrieval. Moreover, CoSER 70B exhibits
+state-of-the-art performance surpassing or matching GPT-4o on our evaluation
+and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
+the InCharacter and LifeChoice benchmarks respectively.
 
-##### **Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**
-2502.01685v1 by Si-Ioi Ng, Pranav S. Ambadi, Kimberly D. Mueller, Julie Liss, Visar Berisha
+摘要：角色扮演語言代理（RPLA）已成為大型語言模型（LLM）的有前途的應用。然而，由於缺乏真實角色資料集和使用此類資料的細緻評估方法，模擬既有角色對 RPLA 來說是一項具有挑戰性的任務。在本文中，我們提出了 CoSER，這是一個高品質資料集、開放模型和評估協議的集合，用於有效地扮演既有角色的 RPLA。CoSER 資料集涵蓋了來自 771 本著名書籍的 17,966 個角色。它提供了具有真實世界複雜性的真實對話，以及對話設定、角色體驗和內心想法等多種資料類型。借鑑表演方法，我們引入了既定情境表演，用於訓練和評估角色扮演 LLM，其中 LLM 在書籍場景中依次扮演多個角色。使用我們的資料集，我們開發了 CoSER 8B 和 CoSER 70B，即建立在 LLaMA-3.1 模型上的先進開放角色扮演 LLM。大量的實驗證明了 CoSER 資料集對於 RPLA 訓練、評估和檢索的價值。此外，CoSER 70B 在我們的評估和三個現有基準上展現了超越或匹配 GPT-4o 的最先進效能，即分別在 InCharacter 和 LifeChoice 基準上達到了 75.80% 和 93.47% 的準確率。
 
-Existing methods for analyzing linguistic content from picture descriptions
-for assessment of cognitive-linguistic impairment often overlook the
-participant's visual narrative path, which typically requires eye tracking to
-assess. Spatio-semantic graphs are a useful tool for analyzing this narrative
-path from transcripts alone, however they are limited by the need for manual
-tagging of content information units (CIUs). In this paper, we propose an
-automated approach for estimation of spatio-semantic graphs (via automated
-extraction of CIUs) from the Cookie Theft picture commonly used in
-cognitive-linguistic analyses. The method enables the automatic
-characterization of the visual semantic path during picture description.
-Experiments demonstrate that the automatic spatio-semantic graphs effectively
-differentiate between cognitively impaired and unimpaired speakers. Statistical
-analyses reveal that the features derived by the automated method produce
-comparable results to the manual method, with even greater group differences
-between clinical groups of interest. These results highlight the potential of
-the automated approach for extracting spatio-semantic features in developing
-clinical speech models for cognitive impairment assessment.
+##### **Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**
+2502.09073v1 by Xuzhao Geng, Haozhao Wang, Jun Wang, Wei Liu, Ruixuan Li
 
-摘要：現有的用於分析圖像描述中的語言內容的方法，用於評估認知語言障礙，通常會忽略參與者的視覺敘事路徑，這通常需要眼球追蹤來評估。時空語義圖是一種有用的工具，可以僅從轉錄本中分析此敘事路徑，但是它們受到手動標記內容資訊單元 (CIU) 的需求所限制。在本文中，我們提出了一種自動化方法，用於從認知語言分析中常用的 Cookie Theft 圖像估計時空語義圖（通過自動提取 CIU）。該方法能夠自動表徵圖片描述期間的視覺語義路徑。實驗表明，自動時空語義圖有效地區分了認知受損和未受損的說話者。統計分析表明，自動化方法衍生的特徵產生了與手動方法相當的結果，甚至在感興趣的臨床組之間產生了更大的組差異。這些結果突出了自動化方法在提取時空語義特徵以開發用於認知障礙評估的臨床語音模型方面的潛力。
+Retrieval-augmented generation (RAG) is a key technique for leveraging
+external knowledge and reducing hallucinations in large language models (LLMs).
+However, RAG still struggles to fully prevent hallucinated responses. To
+address this, it is essential to identify samples prone to hallucination or
+guide LLMs toward correct responses, which experts then annotate to develop
+high-quality datasets for refining LLMs. However, the growing scarcity of such
+datasets makes their creation challenging. This paper proposes using the vast
+amount of conversations from widespread LLM usage to build these datasets,
+training LLMs to avoid hallucination-prone questions while accurately
+responding to manageable ones. Given the impracticality of expert-annotating
+all conversation records, the paper introduces AL4RAG, which uses active
+learning to select the most suitable conversation samples for annotation,
+optimizing performance within an annotation budget. Additionally, recognizing
+that traditional active learning methods are not fully compatible with RAG due
+to unsuitable distance metrics, we develop a novel sample distance measurement
+for RAG active learning. Extensive experiments show that our method
+consistently outperforms baselines across multiple metrics.
 
-##### **Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**
-2502.00712v1 by Shengtian Sang, Hassan Jahanandish, Cynthia Xinran Li, Indrani Bhattachary, Jeong Hoon Lee, Lichun Zhang, Sulaiman Vesal, Pejman Ghanouni, Richard Fan, Geoffrey A. Sonn, Mirabela Rusu
+摘要：檢索增強生成 (RAG) 是一種關鍵技術，用於利用外部知識並減少大型語言模型 (LLM) 中的幻覺。然而，RAG 仍難以完全防止幻覺反應。為了解決這個問題，必須找出容易產生幻覺的範例，或引導 LLM 朝向正確的反應，然後由專家註解以開發用於精煉 LLM 的高品質資料集。然而，此類資料集日益稀少，使得其建立極具挑戰性。本文提出使用來自廣泛 LLM 使用的大量對話來建立這些資料集，訓練 LLM 以避免容易產生幻覺的問題，同時準確回應可管理的問題。鑑於由專家為所有對話記錄加上註解並不切實際，本文引入了 AL4RAG，它使用主動學習來選擇最適合註解的對話範例，在註解預算內最佳化效能。此外，認識到傳統主動學習方法由於不適當的距離度量而無法與 RAG 完全相容，我們為 RAG 主動學習開發了一種新穎的範例距離度量。廣泛的實驗表明，我們的模型在多種度量標準上始終優於基準。
 
-Prostate cancer is a major cause of cancer-related deaths in men, where early
-detection greatly improves survival rates. Although MRI-TRUS fusion biopsy
-offers superior accuracy by combining MRI's detailed visualization with TRUS's
-real-time guidance, it is a complex and time-intensive procedure that relies
-heavily on manual annotations, leading to potential errors. To address these
-challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation
-method that identifies prostate tumors directly in TRUS images without
-requiring manual annotations. Unlike traditional multimodal fusion approaches
-that rely on naive data concatenation, our method integrates a
-registration-segmentation framework to align and leverage spatial information
-between MRI and TRUS modalities. This alignment enhances segmentation accuracy
-and reduces reliance on manual effort. Our approach was validated on a dataset
-of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient
-of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132)
-methods, with significant improvements (p $<$ 0.01). This framework
-demonstrates the potential for reducing the complexity of prostate cancer
-diagnosis and provides a flexible architecture applicable to other multimodal
-medical imaging tasks.
+##### **An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**
+2502.09056v1 by Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai
 
-摘要：前列腺癌是男性癌症相關死亡的主要原因，早期發現可大幅提升存活率。儘管 MRI-TRUS 融合切片檢查結合了 MRI 的詳細視覺化與 TRUS 的即時導引，可提供更高的準確度，但它是一種仰賴大量手動註解的複雜且耗時的程序，容易導致錯誤。為了解決這些挑戰，我們提出了一種全自動的 MRI-TRUS 融合式分割方法，它可以在 TRUS 影像中直接辨識出前列腺腫瘤，而不需要手動註解。與依賴於天真資料串接的傳統多模態融合方法不同，我們的方法整合了一個配準分割架構，以對齊並利用 MRI 與 TRUS 模態之間的空間資訊。這種對齊提升了分割準確度，並減少了對手動作業的依賴。我們的方法已通過來自 Stanford 醫院的 1,747 位患者的資料集進行驗證，達到了 0.212 的平均 Dice 係數，優於僅使用 TRUS (0.117) 和天真的 MRI-TRUS 融合 (0.132) 方法，並有顯著的改善（p < 0.01）。這個架構證明了降低前列腺癌診斷複雜性的潛力，並提供了一個適用於其他多模態醫學影像任務的彈性架構。
+This paper investigates data selection and model merging methodologies aimed
+at incorporating advanced reasoning capabilities such as those of DeepSeek R1
+into language-specific large language models (LLMs), with a particular focus on
+the Thai LLM. Our goal is to enhance the reasoning capabilities of
+language-specific LLMs while maintaining their target language abilities.
+DeepSeek R1 excels in reasoning but primarily benefits high-resource languages
+such as English and Chinese. However, low-resource languages remain underserved
+due to the dominance of English-centric training data and model optimizations,
+which limit performance in these languages. This limitation results in
+unreliable code-switching and diminished effectiveness on tasks in low-resource
+languages. Meanwhile, local and regional LLM initiatives have attempted to
+bridge this gap by developing language-specific LLMs that focus on improving
+local linguistic fidelity. We demonstrate that, with only publicly available
+datasets and a computational budget of $120, it is possible to enhance the
+reasoning capabilities of language-specific LLMs to match the level of DeepSeek
+R1, without compromising their performance on target language tasks.
 
-##### **TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**
-2502.00695v1 by Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
+摘要：本文探討資料選取與模型合併方法，旨在將深度搜尋 R1 等先進推理能力整合至特定語言的大型語言模型 (LLM)，特別著重於泰語 LLM。我們的目標是提升特定語言 LLM 的推理能力，同時維持其目標語言能力。深度搜尋 R1 在推理方面表現出色，但主要受益於英語和中文等資源豐富的語言。然而，由於以英語為中心的訓練資料和模型最佳化佔據主導地位，資源貧乏的語言仍未獲得充分服務，這限制了這些語言的效能。此限制導致不可靠的代碼切換，並降低了資源貧乏語言任務的效能。與此同時，在地區 LLM 計畫已嘗試透過開發專注於改善在地語言忠實度的特定語言 LLM 來彌合此差距。我們證明，僅使用公開可用的資料集和 120 美元的運算預算，即可提升特定語言 LLM 的推理能力，使其達到深度搜尋 R1 的水準，同時不損及它們在目標語言任務上的效能。
 
-Chronic liver disease represents a significant health challenge worldwide and
-accurate prognostic evaluations are essential for personalized treatment plans.
-Recent evidence suggests that integrating multimodal data, such as computed
-tomography imaging, radiomic features, and clinical information, can provide
-more comprehensive prognostic information. However, modalities have an inherent
-heterogeneity, and incorporating additional modalities may exacerbate the
-challenges of heterogeneous data fusion. Moreover, existing multimodal fusion
-methods often struggle to adapt to richer medical modalities, making it
-difficult to capture inter-modal relationships. To overcome these limitations,
-We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet).
-Specifically, we develop an Intra-Modality Aggregation module and a
-Triple-Modal Cross-Attention Fusion module, which are designed to eliminate
-intra-modality redundancy and extract cross-modal information, respectively.
-Furthermore, we design a Triple-Modal Feature Fusion loss function to align
-feature representations across modalities. Extensive experiments on the liver
-prognosis dataset demonstrate that our approach significantly outperforms
-existing state-of-the-art unimodal models and other multi-modal techniques. Our
-code is available at https://github.com/Mysterwll/liver.git.
+##### **Cost-Saving LLM Cascades with Early Abstention**
+2502.09054v1 by Michael J. Zellinger, Rex Liu, Matt Thomson
 
-摘要：慢性肝病在全球范围内代表著重大的健康挑戰，而準確的預後評估對於個人化治療計畫至關重要。最近的證據表明，整合多模態資料（例如電腦斷層影像、放射特徵和臨床資訊）可以提供更全面的預後資訊。然而，模態具有內在異質性，而納入額外的模態可能會加劇異質化資料融合的挑戰。此外，現有的多模態融合方法通常難以適應更豐富的醫療模態，這使得難以捕捉模態間的關係。為了克服這些限制，我們提出了三模態交互慢性肝臟網路 (TMI-CLNet)。具體來說，我們開發了一個模態內聚合模組和一個三模態交叉注意力融合模組，它們分別旨在消除模態內冗餘和提取跨模態資訊。此外，我們設計了一個三模態特徵融合損失函數，以對齊跨模態的特徵表示。在肝臟預後資料集上的廣泛實驗表明，我們的做法顯著優於現有的最先進單模態模型和其他多模態技術。我們的程式碼可以在 https://github.com/Mysterwll/liver.git 上取得。
+LLM cascades are based on the idea that processing all queries with the
+largest and most expensive LLMs is inefficient. Instead, cascades deploy small
+LLMs to answer the majority of queries, limiting the use of large and expensive
+LLMs to only the most difficult queries. This approach can significantly reduce
+costs without impacting performance. However, risk-sensitive domains such as
+finance or medicine place an additional premium on avoiding model errors.
+Recognizing that even the most expensive models may make mistakes, applications
+in these domains benefit from allowing LLM systems to completely abstain from
+answering a query when the chance of making a mistake is significant. However,
+giving a cascade the ability to abstain poses an immediate design question for
+LLM cascades: should abstention only be allowed at the final model or also at
+earlier models? Since the error patterns of small and large models are
+correlated, the latter strategy may further reduce inference costs by letting
+inexpensive models anticipate abstention decisions by expensive models, thereby
+obviating the need to run the expensive models. We investigate the benefits of
+"early abstention" in LLM cascades and find that it reduces the overall test
+loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA,
+TruthfulQA, and XSum). These gains result from a more effective use of
+abstention, which trades a 4.1% average increase in the overall abstention rate
+for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings
+demonstrate that it is possible to leverage correlations between the error
+patterns of different language models to drive performance improvements for LLM
+systems with abstention.
 
-##### **Safety at Scale: A Comprehensive Survey of Large Model Safety**
-2502.05206v2 by Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
+摘要：<paragraph>LLM 級聯基於以下概念：使用最大且最昂貴的 LLM 處理所有查詢效率低下。相反，級聯會部署小型 LLM 來回答大部分查詢，將大型且昂貴的 LLM 的使用限制在最困難的查詢上。這種方法可以大幅降低成本，而不會影響效能。然而，像金融或醫學等對風險敏感的領域會額外重視避免模型錯誤。認識到即使是最昂貴的模型也可能會出錯，在這些領域中的應用程式可受益於允許 LLM 系統在出錯機率很大的情況下完全不回答查詢。然而，賦予級聯不回答的能力會對 LLM 級聯提出立即的設計問題：是否只允許在最終模型中不回答，還是也在較早的模型中不回答？由於小型和大型模型的錯誤模式相關，後一種策略可以讓便宜的模型預測昂貴模型的不回答決策，進而降低推論成本，從而避免執行昂貴的模型。我們調查了 LLM 級聯中「早期不回答」的好處，並發現它平均降低了六個基準測試（GSM8K、MedMCQA、MMLU、TriviaQA、TruthfulQA 和 XSum）的整體測試損失 2.2%。這些收益來自於更有效地使用不回答，以整體不回答率平均增加 4.1% 的代價換取成本降低 13.0% 和錯誤率降低 5.0%。我們的研究結果證明，可以利用不同語言模型的錯誤模式之間的關聯性，來推動具有不回答功能的 LLM 系統的效能改進。</paragraph>
 
-The rapid advancement of large models, driven by their exceptional abilities
-in learning and generalization through large-scale pre-training, has reshaped
-the landscape of Artificial Intelligence (AI). These models are now
-foundational to a wide range of applications, including conversational AI,
-recommendation systems, autonomous driving, content generation, medical
-diagnostics, and scientific discovery. However, their widespread deployment
-also exposes them to significant safety risks, raising concerns about
-robustness, reliability, and ethical implications. This survey provides a
-systematic review of current safety research on large models, covering Vision
-Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language
-Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models
-(DMs), and large-model-based Agents. Our contributions are summarized as
-follows: (1) We present a comprehensive taxonomy of safety threats to these
-models, including adversarial attacks, data poisoning, backdoor attacks,
-jailbreak and prompt injection attacks, energy-latency attacks, data and model
-extraction attacks, and emerging agent-specific threats. (2) We review defense
-strategies proposed for each type of attacks if available and summarize the
-commonly used datasets and benchmarks for safety research. (3) Building on
-this, we identify and discuss the open challenges in large model safety,
-emphasizing the need for comprehensive safety evaluations, scalable and
-effective defense mechanisms, and sustainable data practices. More importantly,
-we highlight the necessity of collective efforts from the research community
-and international collaboration. Our work can serve as a useful reference for
-researchers and practitioners, fostering the ongoing development of
-comprehensive defense systems and platforms to safeguard AI models.
+##### **Game Theory Meets Large Language Models: A Systematic Survey**
+2502.09053v1 by Haoran Sun, Yusen Wu, Yukun Cheng, Xu Chu
 
-摘要：<paragraph>大型模型的快速進展，得益於它們在通過大規模預訓練進行學習和概括方面的卓越能力，已經重塑了人工智能 (AI) 的格局。這些模型現在是廣泛應用程式（包括對話式 AI、推薦系統、自動駕駛、內容生成、醫療診斷和科學發現）的基礎。然而，它們的廣泛部署也使它們面臨重大的安全風險，引發了對穩健性、可靠性和倫理影響的擔憂。本調查提供了對大型模型當前安全研究的系統性回顧，涵蓋視覺基礎模型 (VFM)、大型語言模型 (LLM)、視覺語言預訓練 (VLP) 模型、視覺語言模型 (VLM)、擴散模型 (DM) 和基於大型模型的代理。我們的貢獻總結如下：(1) 我們提出了一個針對這些模型的安全威脅的全面分類，包括對抗性攻擊、資料中毒、後門攻擊、越獄和提示注入攻擊、能量延遲攻擊、資料和模型提取攻擊以及新興的特定代理威脅。(2) 我們檢視了針對每種類型攻擊提出的防禦策略（如果有的話），並總結了安全研究中常用的資料集和基準。(3) 基於此，我們找出並討論了大型模型安全中的開放性挑戰，強調了對全面安全評估、可擴充且有效的防禦機制以及永續資料實務的需求。更重要的是，我們強調了研究社群和國際合作共同努力的必要性。我們的研究可作為研究人員和從業人員的有用參考，促進全面防禦系統和平台的持續發展，以保護 AI 模型。</paragraph>
+Game theory establishes a fundamental framework for analyzing strategic
+interactions among rational decision-makers. The rapid advancement of large
+language models (LLMs) has sparked extensive research exploring the
+intersection of these two fields. Specifically, game-theoretic methods are
+being applied to evaluate and enhance LLM capabilities, while LLMs themselves
+are reshaping classic game models. This paper presents a comprehensive survey
+of the intersection of these fields, exploring a bidirectional relationship
+from three perspectives: (1) Establishing standardized game-based benchmarks
+for evaluating LLM behavior; (2) Leveraging game-theoretic methods to improve
+LLM performance through algorithmic innovations; (3) Characterizing the
+societal impacts of LLMs through game modeling. Among these three aspects, we
+also highlight how the equilibrium analysis for traditional game models is
+impacted by LLMs' advanced language understanding, which in turn extends the
+study of game theory. Finally, we identify key challenges and future research
+directions, assessing their feasibility based on the current state of the
+field. By bridging theoretical rigor with emerging AI capabilities, this survey
+aims to foster interdisciplinary collaboration and drive progress in this
+evolving research area.
 
-##### **Enhanced Convolutional Neural Networks for Improved Image Classification**
-2502.00663v1 by Xiaoran Yang, Shuhan Yu, Wenxi Xu
+摘要：博弈論建立一個基本架構，用來分析理性決策者之間的策略互動。大型語言模型 (LLM) 的快速進展，激發了廣泛的研究，探討這兩個領域的交集。具體來說，博弈論方法被應用於評估和增強 LLM 能力，而 LLM 本身正在重塑經典博弈模型。本文對這些領域的交集進行了全面的調查，從三個角度探討了雙向關係：(1) 建立標準化的基於博弈的基準，用於評估 LLM 行為；(2) 利用博弈論方法，通過演算法創新來改善 LLM 效能；(3) 透過博弈模型，描述 LLM 對社會的影響。在這三個方面中，我們還強調了 LLM 的先進語言理解如何影響傳統博弈模型的均衡分析，這反過來又擴展了博弈論的研究。最後，我們找出關鍵挑戰和未來的研究方向，根據該領域的現狀評估其可行性。透過將理論嚴謹性與新興的 AI 能力相結合，這項調查旨在促進跨學科合作，並推動這個不斷演變的研究領域的進展。
 
-Image classification is a fundamental task in computer vision with diverse
-applications, ranging from autonomous systems to medical imaging. The CIFAR-10
-dataset is a widely used benchmark to evaluate the performance of
-classification models on small-scale, multi-class datasets. Convolutional
-Neural Networks (CNNs) have demonstrated state-of-the-art results; however,
-they often suffer from overfitting and suboptimal feature representation when
-applied to challenging datasets like CIFAR-10. In this paper, we propose an
-enhanced CNN architecture that integrates deeper convolutional blocks, batch
-normalization, and dropout regularization to achieve superior performance. The
-proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN
-architectures. Through detailed ablation studies, we demonstrate the
-effectiveness of the enhancements and analyze the hierarchical feature
-representations. This work highlights the potential of refined CNN
-architectures for tackling small-scale image classification problems
-effectively.
+##### **AIDE: Agentically Improve Visual Language Model with Domain Experts**
+2502.09051v1 by Ming-Chang Chiu, Fuxiao Liu, Karan Sapra, Andrew Tao, Yaser Jacoob, Xuezhe Ma, Zhiding Yu, Guilin Liu
 
-摘要：影像分類是電腦視覺中的一項基本任務，應用範圍廣泛，從自動系統到醫學影像皆有。CIFAR-10 資料集是一個廣泛使用的基準，用於評估分類模型在小規模、多類別資料集上的效能。卷積神經網路 (CNN) 已展現出最先進的成果；然而，當應用於 CIFAR-10 等具挑戰性的資料集時，它們常常會發生過度擬合和次佳特徵表示的問題。在本文中，我們提出一個增強的 CNN 架構，它整合了更深的卷積區塊、批次正規化和中斷正規化，以達成卓越的效能。所提出的模型達到了 84.95% 的測試準確度，優於基準 CNN 架構。透過詳細的消融研究，我們證明了這些增強功能的有效性，並分析了階層式特徵表示。這項工作突顯了精進的 CNN 架構在有效解決小規模影像分類問題上的潛力。
+The enhancement of Visual Language Models (VLMs) has traditionally relied on
+knowledge distillation from larger, more capable models. This dependence
+creates a fundamental bottleneck for improving state-of-the-art systems,
+particularly when no superior models exist. We introduce AIDE (Agentic
+Improvement through Domain Experts), a novel framework that enables VLMs to
+autonomously enhance their capabilities by leveraging specialized domain expert
+models. AIDE operates through a four-stage process: (1) identifying instances
+for refinement, (2) engaging domain experts for targeted analysis, (3)
+synthesizing expert outputs with existing data, and (4) integrating enhanced
+instances into the training pipeline. Experiments on multiple benchmarks,
+including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve
+notable performance gains without relying on larger VLMs nor human supervision.
+Our framework provides a scalable, resource-efficient approach to continuous
+VLM improvement, addressing critical limitations in current methodologies,
+particularly valuable when larger models are unavailable to access.
 
-##### **Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**
-2502.00619v1 by Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
+摘要：視覺語言模型 (VLM) 的增強傳統上依賴於從更大、功能更強大的模型中進行知識萃取。這種依賴性會造成改善最先進系統的基本瓶頸，尤其在沒有更優越的模型時。我們引進 AIDE（透過領域專家進行代理式改善），一個創新的架構，讓 VLM 能夠透過利用專業的領域專家模型，自主增強其功能。AIDE 透過四階段流程運作：(1) 識別需要改善的實例，(2) 聘請領域專家進行有針對性的分析，(3) 將專家輸出與現有資料綜合，以及 (4) 將增強的實例整合到訓練流程中。在多個基準測試上的實驗，包括 MMMU、MME、MMBench 等，證明了 AIDE 能夠在不依賴更大型的 VLM 或人工監督的情況下，實現顯著的效能提升。我們的架構提供了一個可擴充、資源效率高的持續 VLM 改進方法，解決了當前方法中的關鍵限制，特別是在無法取得大型模型時，這一點特別有價值。
 
-Ensuring fairness in medical image segmentation is critical due to biases in
-imbalanced clinical data acquisition caused by demographic attributes (e.g.,
-age, sex, race) and clinical factors (e.g., disease severity). To address these
-challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired
-by optimal control theory. We provide a comprehensive analysis of its
-underlying mechanisms and clarify dMoE's role in adapting to heterogeneous
-distributions in medical image segmentation. Furthermore, we integrate dMoE
-into multiple network architectures, demonstrating its broad applicability
-across diverse medical image analysis tasks. By incorporating demographic and
-clinical factors, dMoE achieves state-of-the-art performance on two 2D
-benchmark datasets and a 3D in-house dataset. Our results highlight the
-effectiveness of dMoE in mitigating biases from imbalanced distributions,
-offering a promising approach to bridging control theory and medical image
-segmentation within fairness learning paradigms. The source code will be made
-available.
+##### **Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**
+2502.09050v1 by Chae-Hyun Kim, Yoon-Ryung Choi, Jin-Duk Park, Won-Yong Shin
 
-摘要：在医学影像分割中，由於人口屬性（例如年齡、性別、種族）和臨床因素（例如疾病嚴重程度）導致不平衡的臨床數據採集中存在偏差，因此確保公平性至關重要。為了應對這些挑戰，我們引入了受最優控制理論啟發的感知混合專家 (dMoE)。我們對其底層機制進行了全面分析，並釐清了 dMoE 在適應醫學影像分割中的異質分佈中的作用。此外，我們將 dMoE 整合到多個網路架構中，展示了其在各種醫學影像分析任務中的廣泛適用性。通過納入人口統計和臨床因素，dMoE 在兩個 2D 基準數據集和一個 3D 內部數據集上實現了最先進的性能。我們的結果突出了 dMoE 在減輕不平衡分佈的偏差方面的有效性，為在公平性學習範例中橋接控制理論和醫學影像分割提供了一個有前景的方法。原始碼將會公開。
+Group recommendation aims at providing optimized recommendations tailored to
+diverse groups, enabling groups to enjoy appropriate items. On the other hand,
+most existing group recommendation methods are built upon deep neural network
+(DNN) architectures designed to capture the intricate relationships between
+member-level and group-level interactions. While these DNN-based approaches
+have proven their effectiveness, they require complex and expensive training
+procedures to incorporate group-level interactions in addition to member-level
+interactions. To overcome such limitations, we introduce Group-GF, a new
+approach for extremely fast recommendations of items to each group via
+multi-view graph filtering (GF) that offers a holistic view of complex
+member-group dynamics, without the need for costly model training.
+Specifically, in Group-GF, we first construct three item similarity graphs
+manifesting different viewpoints for GF. Then, we discover a distinct
+polynomial graph filter for each similarity graph and judiciously aggregate the
+three graph filters. Extensive experiments demonstrate the effectiveness of
+Group-GF in terms of significantly reducing runtime and achieving
+state-of-the-art recommendation accuracy.
 
-##### **Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**
-2502.00568v3 by Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
+摘要：群組推薦旨在提供針對不同群組量身打造的最佳推薦，讓群組可以享受適當的項目。另一方面，現有的群組推薦方法大多建立在深度神經網路 (DNN) 架構上，旨在捕捉成員層級和群組層級互動之間的複雜關係。雖然這些基於 DNN 的方法已證明其有效性，但它們需要複雜且昂貴的訓練程序，才能在成員層級互動之外納入群組層級互動。為了克服這些限制，我們引入了 Group-GF，這是一種透過多視圖圖形過濾 (GF) 為每個群組提供極快速項目推薦的新方法，它提供了複雜成員群組動態的整體視圖，而無需進行昂貴的模型訓練。具體來說，在 Group-GF 中，我們首先建構三個項目相似度圖形，展現 GF 的不同觀點。然後，我們為每個相似度圖形發現一個不同的多項式圖形過濾器，並明智地彙總這三個圖形過濾器。廣泛的實驗證明了 Group-GF 在顯著減少執行時間和達成最先進的推薦準確度方面的有效性。
 
-Emerging research has highlighted that artificial intelligence based
-multimodal fusion of digital pathology and transcriptomic features can improve
-cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction.
-However, such direct fusion for joint decision is impractical in real clinical
-settings, where histopathology is still the gold standard for diagnosis and
-transcriptomic tests are rarely requested, at least in the public healthcare
-system. With our novel diffusion based crossmodal generative AI model PathGen,
-we show that genomic expressions synthesized from digital histopathology
-jointly predicts cancer grading and patient survival risk with high accuracy
-(state-of-the-art performance), certainty (through conformal coverage
-guarantee) and interpretability (through distributed attention maps). PathGen
-code is available for open use by the research community through GitHub at
-https://github.com/Samiran-Dey/PathGen.
+##### **Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**
+2502.09046v1 by Jin-Duk Park, Jaemin Yoo, Won-Yong Shin
 
-摘要：新興研究強調，基於人工智慧的多模態融合數位病理學和轉錄組特徵，可以改善癌症診斷（分級/分型）和預後（存活風險）預測。
-然而，這種直接融合對於聯合決策在實際臨床環境中並不切實際，在實際臨床環境中，組織病理學仍然是診斷的黃金標準，而轉錄組檢測很少被要求，至少在公共醫療保健系統中是如此。透過我們新穎的基於擴散的跨模態生成式 AI 模型 PathGen，我們展示了從數位組織病理學合成的基因體表達共同預測癌症分級和患者存活風險，具有很高的準確度（最先進的效能）、確定性（透過共形覆蓋保證）和可解釋性（透過分佈式注意力圖）。PathGen 程式碼可透過 GitHub 上的 https://github.com/Samiran-Dey/PathGen 供研究社群公開使用。
+Multi-criteria (MC) recommender systems, which utilize MC rating information
+for recommendation, are increasingly widespread in various e-commerce domains.
+However, the MC recommendation using training-based collaborative filtering,
+requiring consideration of multiple ratings compared to single-criterion
+counterparts, often poses practical challenges in achieving state-of-the-art
+performance along with scalable model training. To solve this problem, we
+propose CA-GF, a training-free MC recommendation method, which is built upon
+criteria-aware graph filtering for efficient yet accurate MC recommendations.
+Specifically, first, we construct an item-item similarity graph using an MC
+user-expansion graph. Next, we design CA-GF composed of the following key
+components, including 1) criterion-specific graph filtering where the optimal
+filter for each criterion is found using various types of polynomial low-pass
+filters and 2) criteria preference-infused aggregation where the smoothed
+signals from each criterion are aggregated. We demonstrate that CA-GF is (a)
+efficient: providing the computational efficiency, offering the extremely fast
+runtime of less than 0.2 seconds even on the largest benchmark dataset, (b)
+accurate: outperforming benchmark MC recommendation methods, achieving
+substantial accuracy gains up to 24% compared to the best competitor, and (c)
+interpretable: providing interpretations for the contribution of each criterion
+to the model prediction based on visualizations.
 
+摘要：多準則 (MC) 推薦系統在各種電子商務領域中日益普及，該系統利用 MC 評分資訊進行推薦。
+然而，與單準則對應項目相比，使用基於訓練的協同過濾的 MC 推薦，通常在達成最先進的效能以及可擴充模型訓練方面造成實務上的挑戰，需要考慮多個評分。為了解決這個問題，我們提出 CA-GF，一種無需訓練的 MC 推薦方法，它建立於準則感知圖形過濾之上，用於有效且準確的 MC 推薦。
+具體來說，首先，我們使用 MC 使用者擴展圖形來建構一個項目相似度圖形。接下來，我們設計 CA-GF，它包含以下關鍵組成部分，包括 1) 準則特定圖形過濾，其中使用各種類型的多項式低通濾波器來找出每個準則的最佳濾波器，以及 2) 準則偏好注入聚合，其中來自每個準則的平滑訊號被聚合。我們證明 CA-GF 是 (a) 有效的：提供運算效率，即使在最大的基準資料集上，也能提供低於 0.2 秒的極快執行時間，(b) 準確的：優於基準 MC 推薦方法，與最佳競爭者相比，獲得高達 24% 的顯著準確性提升，以及 (c) 可解釋的：根據視覺化提供對每個準則對模型預測的貢獻的解釋。
 
-### Knowledge Graphs
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
-|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
-|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null|
-|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v1](http://arxiv.org/abs/2502.08346v1)|null|
-|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null|
-|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null|
-|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null|
-|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null|
-|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null|
-|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null|
-|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)|
-|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null|
-|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|null|
-|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null|
-|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null|
-|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null|
-|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null|
-|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null|
-|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null|
-|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null|
-|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null|
-|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)|
-|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null|
-|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|null|
-|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null|
-|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)|
-|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null|
-|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)|
-|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)|
-|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null|
-|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
-|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)|
-|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
-|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
-|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
-|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
-|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
-|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
-|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
-|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
-|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null|
-|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null|
-|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null|
-|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)|
-|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null|
-|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null|
-|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)|
-|**2025-02-03**|**Common Foundations for SHACL, ShEx, and PG-Schema**|S. Ahmetaj et.al.|[2502.01295v1](http://arxiv.org/abs/2502.01295v1)|null|
-|**2025-02-03**|**GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**|Linhao Luo et.al.|[2502.01113v1](http://arxiv.org/abs/2502.01113v1)|[link](https://github.com/RManLuo/gfm-rag)|
-|**2025-02-03**|**Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**|Seungri Yoon et.al.|[2502.01059v1](http://arxiv.org/abs/2502.01059v1)|null|
-|**2025-02-03**|**Encrypted Large Model Inference: The Equivariant Encryption Paradigm**|James Buban et.al.|[2502.01013v1](http://arxiv.org/abs/2502.01013v1)|null|
-|**2025-02-02**|**Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**|Juno Kim et.al.|[2502.01694v1](http://arxiv.org/abs/2502.01694v1)|null|
-|**2025-02-02**|**PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**|Qixuan Li et.al.|[2502.00708v1](http://arxiv.org/abs/2502.00708v1)|null|
-|**2025-02-02**|**A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**|Qika Lin et.al.|[2502.00681v1](http://arxiv.org/abs/2502.00681v1)|null|
-|**2025-02-01**|**Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**|Jingyuan Yi et.al.|[2502.00339v1](http://arxiv.org/abs/2502.00339v1)|null|
-|**2025-02-01**|**DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**|Jiaxin Guo et.al.|[2502.00305v1](http://arxiv.org/abs/2502.00305v1)|null|
-|**2025-01-31**|**Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**|Nathaniel Tomczak et.al.|[2502.01659v2](http://arxiv.org/abs/2502.01659v2)|[link](https://github.com/KLab-AI3/Graph-Processing-Attention-IPDPS-2025)|
-|**2025-01-31**|**Improving vision-language alignment with graph spiking hybrid Networks**|Siyu Zhang et.al.|[2501.19069v1](http://arxiv.org/abs/2501.19069v1)|null|
-|**2025-01-30**|**Semantic Web and Creative AI -- A Technical Report from ISWS 2023**|Raia Abu Ahmad et.al.|[2501.18542v1](http://arxiv.org/abs/2501.18542v1)|null|
-|**2025-01-30**|**Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**|Tianpeng Pan et.al.|[2501.18320v1](http://arxiv.org/abs/2501.18320v1)|null|
-|**2025-01-30**|**Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**|Wanlong Liu et.al.|[2501.18154v1](http://arxiv.org/abs/2501.18154v1)|null|
-|**2025-01-30**|**Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**|Qika Lin et.al.|[2501.18119v1](http://arxiv.org/abs/2501.18119v1)|null|
-|**2025-01-29**|**Hybrid Graphs for Table-and-Text based Question Answering using LLMs**|Ankush Agarwal et.al.|[2501.17767v1](http://arxiv.org/abs/2501.17767v1)|null|
-|**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
-|**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
-|**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
-|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
-|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
-|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
-|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
-|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
-|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
-|**2025-01-26**|**Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**|Haodi Ma et.al.|[2501.15688v1](http://arxiv.org/abs/2501.15688v1)|null|
-|**2025-01-26**|**How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**|Manzong Huang et.al.|[2501.15378v1](http://arxiv.org/abs/2501.15378v1)|null|
-|**2025-01-24**|**Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**|Cencheng Shen et.al.|[2501.14932v1](http://arxiv.org/abs/2501.14932v1)|null|
-|**2025-01-24**|**Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**|Hang Luo et.al.|[2501.14892v1](http://arxiv.org/abs/2501.14892v1)|null|
-|**2025-01-24**|**GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**|Ziwen Li et.al.|[2501.16382v1](http://arxiv.org/abs/2501.16382v1)|[link](https://github.com/aaronli43/grappi)|
-|**2025-01-24**|**Evaluating and Improving Graph to Text Generation with Large Language Models**|Jie He et.al.|[2501.14497v1](http://arxiv.org/abs/2501.14497v1)|[link](https://github.com/probe2/kg_text)|
-|**2025-01-24**|**Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**|Xujian Liang et.al.|[2501.14300v1](http://arxiv.org/abs/2501.14300v1)|[link](https://github.com/dosonleung/fasttog)|
-|**2025-01-24**|**Top Ten Challenges Towards Agentic Neural Graph Databases**|Jiaxin Bai et.al.|[2501.14224v1](http://arxiv.org/abs/2501.14224v1)|null|
-|**2025-01-23**|**GraphRAG under Fire**|Jiacheng Liang et.al.|[2501.14050v1](http://arxiv.org/abs/2501.14050v1)|null|
-|**2025-01-23**|**EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**|Yuhui Yun et.al.|[2501.13746v1](http://arxiv.org/abs/2501.13746v1)|null|
-|**2025-01-23**|**Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**|Chang Gong et.al.|[2501.13731v1](http://arxiv.org/abs/2501.13731v1)|null|
-|**2025-01-23**|**CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**|Hamza Landolsi et.al.|[2501.13993v1](http://arxiv.org/abs/2501.13993v1)|null|
-|**2025-01-23**|**Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**|Hy Nguyen et.al.|[2501.13992v1](http://arxiv.org/abs/2501.13992v1)|null|
-|**2025-01-23**|**Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**|Bhumika Gupta et.al.|[2501.13984v1](http://arxiv.org/abs/2501.13984v1)|null|
-|**2025-01-21**|**LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**|Hasan Abu-Rasheed et.al.|[2501.12300v1](http://arxiv.org/abs/2501.12300v1)|null|
-|**2025-01-21**|**Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**|Dongsheng Zhu et.al.|[2501.12432v1](http://arxiv.org/abs/2501.12432v1)|null|
-|**2025-01-21**|**InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**|Pha Nguyen et.al.|[2501.12231v1](http://arxiv.org/abs/2501.12231v1)|null|
-|**2025-01-21**|**Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**|Maya Medjad et.al.|[2501.11977v1](http://arxiv.org/abs/2501.11977v1)|[link](https://github.com/reecall/graphtod)|
-|**2025-01-21**|**Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**|Jie Zhao et.al.|[2501.11968v1](http://arxiv.org/abs/2501.11968v1)|null|
-|**2025-01-21**|**A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**|Qinggang Zhang et.al.|[2501.13958v1](http://arxiv.org/abs/2501.13958v1)|[link](https://github.com/deep-polyu/awesome-graphrag)|
-|**2025-01-21**|**Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**|Nikos Kanakaris et.al.|[2501.11849v2](http://arxiv.org/abs/2501.11849v2)|[link](https://github.com/nkanak/brag-fake-news-campaigns)|
-|**2025-01-21**|**Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**|Haoran Song et.al.|[2501.16361v1](http://arxiv.org/abs/2501.16361v1)|null|
-|**2025-01-20**|**Zep: A Temporal Knowledge Graph Architecture for Agent Memory**|Preston Rasmussen et.al.|[2501.13956v1](http://arxiv.org/abs/2501.13956v1)|[link](https://github.com/getzep/graphiti)|
-|**2025-01-20**|**Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**|M. Manzour et.al.|[2501.11560v1](http://arxiv.org/abs/2501.11560v1)|null|
-|**2025-01-20**|**Each Graph is a New Language: Graph Learning with LLMs**|Huachi Zhou et.al.|[2501.11478v2](http://arxiv.org/abs/2501.11478v2)|null|
-|**2025-01-20**|**Few-shot Policy (de)composition in Conversational Question Answering**|Kyle Erwin et.al.|[2501.11335v1](http://arxiv.org/abs/2501.11335v1)|null|
-|**2025-01-20**|**Reasoning Language Models: A Blueprint**|Maciej Besta et.al.|[2501.11223v3](http://arxiv.org/abs/2501.11223v3)|[link](https://github.com/spcl/x1)|
-|**2025-01-19**|**IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**|Elad Levi et.al.|[2501.11067v1](http://arxiv.org/abs/2501.11067v1)|[link](https://github.com/plurai-ai/intellagent)|
+##### **Typhoon T1: An Open Thai Reasoning Model**
+2502.09042v1 by Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, Kunat Pipatanakul
 
-#### Abstracts
-##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
-2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
+This paper introduces Typhoon T1, an open effort to develop an open Thai
+reasoning model. A reasoning model is a relatively new type of generative model
+built on top of large language models (LLMs). A reasoning model generates a
+long chain of thought before arriving at a final answer, an approach found to
+improve performance on complex tasks. However, details on developing such a
+model are limited, especially for reasoning models that can generate traces in
+a low-resource language. Typhoon T1 presents an open effort that dives into the
+details of developing a reasoning model in a more cost-effective way by
+leveraging supervised fine-tuning using open datasets, instead of reinforcement
+learning. This paper shares the details about synthetic data generation and
+training, as well as our dataset and model weights. Additionally, we provide
+insights gained from developing a reasoning model that generalizes across
+domains and is capable of generating reasoning traces in a low-resource
+language, using Thai as an example. We hope this open effort provides a
+foundation for further research in this field.
 
-Visual Question Answering (VQA) is a challenging problem that requires to
-process multimodal input. Answer-Set Programming (ASP) has shown great
-potential in this regard to add interpretability and explainability to modular
-VQA architectures. In this work, we address the problem of how to integrate ASP
-with modules for vision and natural language processing to solve a new and
-demanding VQA variant that is concerned with images of graphs (not graphs in
-symbolic form). Images containing graph-based structures are an ubiquitous and
-popular form of visualisation. Here, we deal with the particular problem of
-graphs inspired by transit networks, and we introduce a novel dataset that
-amends an existing one by adding images of graphs that resemble metro lines.
-Our modular neuro-symbolic approach combines optical graph recognition for
-graph parsing, a pretrained optical character recognition neural network for
-parsing labels, Large Language Models (LLMs) for language processing, and ASP
-for reasoning. This method serves as a first baseline and achieves an overall
-average accuracy of 73% on the dataset. Our evaluation provides further
-evidence of the potential of modular neuro-symbolic systems, in particular with
-pretrained models that do not involve any further training and logic
-programming for reasoning, to solve complex VQA tasks.
+摘要：本文介紹 Typhoon T1，這是一個開放的計畫，旨在開發開放的泰語推理模型。推理模型是一種相對較新的生成模型，建構於大型語言模型 (LLM) 之上。推理模型會在得出最終答案之前產生一連串的思考，這種方法被發現可以改善複雜任務的效能。然而，關於如何開發這種模型的詳細資訊有限，特別是對於能夠以低資源語言產生軌跡的推理模型而言。Typhoon T1 提出了一個開放的計畫，深入探討如何以更具成本效益的方式開發推理模型，方法是利用開放式資料集進行監督微調，而不是強化學習。本文分享了關於合成資料產生和訓練的詳細資訊，以及我們的資料集和模型權重。此外，我們提供了從開發推理模型中獲得的見解，該模型可以跨領域概括，並能夠以低資源語言產生推理軌跡，以泰語為例。我們希望這個開放的計畫能為此領域的進一步研究奠定基礎。
 
-摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
+##### **Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**
+2502.09022v1 by Lin Zhang, Lijie Hu, Di Wang
 
-##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
-2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
+Transformer-based language models have achieved notable success, yet their
+internal reasoning mechanisms remain largely opaque due to complex non-linear
+interactions and high-dimensional operations. While previous research suggests
+that these models implicitly encode reasoning structures, it is still unclear
+which specific multi-step thought processes they employ to solve complex tasks.
+To address this gap, we propose a novel mechanistic interpretability framework,
+SICAF, designed to trace and analyze the reasoning strategies that language
+models use in multi-step inference tasks. By employing circuit analysis and
+self-influence functions, we quantify the evolving importance of each token
+throughout the reasoning process, thereby mapping the pathways the model uses
+for inference. Applying SICAF to the GPT-2 model on the Indirect Object
+Identification (IOI) prediction task, we demonstrate how underlying circuits
+can reveal a reasoning process that aligns with human interpretability,
+offering new insights into the model's internal logic.
 
-The adoption of EHRs has expanded opportunities to leverage data-driven
-algorithms in clinical care and research. A major bottleneck in effectively
-conducting multi-institutional EHR studies is the data heterogeneity across
-systems with numerous codes that either do not exist or represent different
-clinical concepts across institutions. The need for data privacy further limits
-the feasibility of including multi-institutional patient-level data required to
-study similarities and differences across patient subgroups. To address these
-challenges, we developed the GAME algorithm. Tested and validated across 7
-institutions and 2 languages, GAME integrates data in several levels: (1) at
-the institutional level with knowledge graphs to establish relationships
-between codes and existing knowledge sources, providing the medical context for
-standard codes and their relationship to each other; (2) between institutions,
-leveraging language models to determine the relationships between
-institution-specific codes with established standard codes; and (3) quantifying
-the strength of the relationships between codes using a graph attention
-network. Jointly trained embeddings are created using transfer and federated
-learning to preserve data privacy. In this study, we demonstrate the
-applicability of GAME in selecting relevant features as inputs for AI-driven
-algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
-We then highlight the application of GAME harmonized multi-institutional EHR
-data in a study of Alzheimer's disease outcomes and suicide risk among patients
-with mental health disorders, without sharing patient-level data outside
-individual institutions.
+摘要：基於 Transformer 的語言模型已取得顯著的成功，但由於複雜的非線性交互和高維度運算，它們的內部推理機制在很大程度上仍然不透明。儘管先前的研究表明這些模型隱含地編碼推理結構，但目前仍不清楚它們採用哪些具體的多步驟思考過程來解決複雜任務。為了解決這個差距，我們提出了一個新穎的機制可解釋性框架 SICAF，旨在追蹤和分析語言模型在多步驟推理任務中使用的推理策略。通過採用電路分析和自影響函數，我們量化了推理過程中每個標記的演化重要性，從而繪製出模型用於推理的路徑。將 SICAF 應用於 GPT-2 模型上的間接賓語識別 (IOI) 預測任務，我們展示了底層電路如何揭示與人類可解釋性相符的推理過程，從而對模型的內部邏輯提供了新的見解。
 
-摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
+##### **EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**
+2502.09020v1 by Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
 
-##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**
-2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang
+Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB
+cameras which are sensitive to challenging factors such as low illumination,
+motion blur, and cluttered backgrounds. In this paper, we propose to recognize
+the scene text using bio-inspired event cameras by collecting and annotating a
+large-scale benchmark dataset, termed EventSTR. It contains 9,928
+high-definition (1280 * 720) event samples and involves both Chinese and
+English characters. We also benchmark multiple STR algorithms as the baselines
+for future works to compare. In addition, we propose a new event-based scene
+text recognition framework, termed SimC-ESTR. It first extracts the event
+features using a visual encoder and projects them into tokens using a Q-former
+module. More importantly, we propose to augment the vision tokens based on a
+memory mechanism before feeding into the large language models. A
+similarity-based error correction mechanism is embedded within the large
+language model to correct potential minor errors fundamentally based on
+contextual information. Extensive experiments on the newly proposed EventSTR
+dataset and two simulation STR datasets fully demonstrate the effectiveness of
+our proposed model. We believe that the dataset and algorithmic model can
+innovatively propose an event-based STR task and are expected to accelerate the
+application of event cameras in various industries. The source code and
+pre-trained models will be released on https://github.com/Event-AHU/EventSTR
 
-With the extensive application of Graph Neural Networks (GNNs) across various
-domains, their trustworthiness has emerged as a focal point of research. Some
-existing studies have shown that the integration of large language models
-(LLMs) can improve the semantic understanding and generation capabilities of
-GNNs, which in turn improves the trustworthiness of GNNs from various aspects.
-Our review introduces a taxonomy that offers researchers a clear framework for
-comprehending the principles and applications of different methods and helps
-clarify the connections and differences among various approaches. Then we
-systematically survey representative approaches along the four categories of
-our taxonomy. Through our taxonomy, researchers can understand the applicable
-scenarios, potential advantages, and limitations of each approach for the the
-trusted integration of GNNs with LLMs. Finally, we present some promising
-directions of work and future trends for the integration of LLMs and GNNs to
-improve model trustworthiness.
+摘要：主流場景文字辨識（STR）演算法是基於對低光源、動態模糊和雜亂背景等挑戰性因素敏感的 RGB 相機開發的。在本文中，我們提出使用生物靈感事件相機辨識場景文字，方法是收集和標註一個稱為 EventSTR 的大規模基準資料集。它包含 9,928 個高畫質（1280 * 720）事件範例，並包含中文字和英文字元。我們也基準化多個 STR 演算法作為未來工作的基準，以進行比較。此外，我們提出一個新的基於事件的場景文字辨識架構，稱為 SimC-ESTR。它首先使用視覺編碼器萃取事件特徵，並使用 Q-former 模組將它們投影到代幣中。更重要的是，我們提出在輸入大型語言模型之前，基於記憶機制擴充視覺代幣。一個基於相似性的錯誤修正機制嵌入在大型語言模型中，以根據上下文資訊從根本上修正潛在的輕微錯誤。在最新提出的 EventSTR 資料集和兩個模擬 STR 資料集上進行的廣泛實驗充分證明了我們提出的模型的有效性。我們相信，該資料集和演算法模型可以創新地提出一個基於事件的 STR 任務，並有望加速事件相機在各個產業的應用。原始碼和預先訓練的模型將在 https://github.com/Event-AHU/EventSTR 上釋出
 
-摘要：隨著圖神經網路 (GNN) 在各種領域的廣泛應用，其可信度已成為研究的焦點。一些現有研究表明，整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力，進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法，為研究人員提供了一個清晰的架構，用於理解不同方法的原理和應用，並有助於釐清各種方法之間的關聯和差異。然後，我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法，可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後，我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢，以提升模型的可信度。
+##### **Zero-shot Concept Bottleneck Models**
+2502.09018v1 by Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
 
-##### **Graph Foundation Models for Recommendation: A Comprehensive Survey**
-2502.08346v1 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi
+Concept bottleneck models (CBMs) are inherently interpretable and
+intervenable neural network models, which explain their final label prediction
+by the intermediate prediction of high-level semantic concepts. However, they
+require target task training to learn input-to-concept and concept-to-label
+mappings, incurring target dataset collections and training resources. In this
+paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which
+predict concepts and labels in a fully zero-shot manner without training neural
+networks. Z-CBMs utilize a large-scale concept bank, which is composed of
+millions of vocabulary extracted from the web, to describe arbitrary input in
+various domains. For the input-to-concept mapping, we introduce concept
+retrieval, which dynamically finds input-related concepts by the cross-modal
+search on the concept bank. In the concept-to-label inference, we apply concept
+regression to select essential concepts from the retrieved concepts by sparse
+linear regression. Through extensive experiments, we confirm that our Z-CBMs
+provide interpretable and intervenable concepts without any additional
+training. Code will be available at https://github.com/yshinya6/zcbm.
 
-Recommender systems (RS) serve as a fundamental tool for navigating the vast
-expanse of online information, with deep learning advancements playing an
-increasingly important role in improving ranking accuracy. Among these, graph
-neural networks (GNNs) excel at extracting higher-order structural information,
-while large language models (LLMs) are designed to process and comprehend
-natural language, making both approaches highly effective and widely adopted.
-Recent research has focused on graph foundation models (GFMs), which integrate
-the strengths of GNNs and LLMs to model complex RS problems more efficiently by
-leveraging the graph-based structure of user-item relationships alongside
-textual understanding. In this survey, we provide a comprehensive overview of
-GFM-based RS technologies by introducing a clear taxonomy of current
-approaches, diving into methodological details, and highlighting key challenges
-and future directions. By synthesizing recent advancements, we aim to offer
-valuable insights into the evolving landscape of GFM-based recommender systems.
+摘要：概念瓶頸模型 (CBM) 本質上是可解釋且可干預的神經網路模型，它們透過對高階語意概念的中間預測來解釋其最終標籤預測。然而，它們需要目標任務訓練來學習輸入到概念和概念到標籤的對應，導致目標資料集收集和訓練資源。在本文中，我們展示了「零次學習概念瓶頸模型」(Z-CBM)，它以完全零次學習的方式預測概念和標籤，而無需訓練神經網路。Z-CBM 利用一個大型概念庫，其中包含從網路中擷取的數百萬個詞彙，來描述各種領域中的任意輸入。對於輸入到概念的對應，我們引入了概念擷取，它透過對概念庫的跨模態搜尋，動態地找出與輸入相關的概念。在概念到標籤的推論中，我們應用概念迴歸，透過稀疏線性迴歸從擷取的概念中選擇必要的概念。透過廣泛的實驗，我們確認我們的 Z-CBM 在沒有任何額外訓練的情況下提供了可解釋且可干預的概念。程式碼將可在 https://github.com/yshinya6/zcbm 取得。
 
-摘要：推薦系統 (RS) 是導航廣闊線上資訊的基本工具，深度學習的進展在提升排名準確度方面扮演著日益重要的角色。在這些進展中，圖形神經網路 (GNN) 擅長萃取高階結構資訊，而大型語言模型 (LLM) 則設計用於處理和理解自然語言，這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM)，它整合了 GNN 和 LLM 的優點，透過利用使用者與項目關係的圖形化結構，以及文字理解，更有效率地建構複雜的 RS 問題。在這項調查中，我們提供 GFM-based RS 技術的全面概觀，介紹當前方法的明確分類法，深入探討方法論的細節，並強調關鍵挑戰和未來方向。透過綜合最近的進展，我們旨在提供有價值的見解，了解 GFM-based 推薦系統不斷演變的樣貌。
+##### **Diversity Enhances an LLM's Performance in RAG and Long-context Task**
+2502.09017v1 by Zhchao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, Claire Na Cheng
 
-##### **Self-Evaluation for Job-Shop Scheduling**
-2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana
+The rapid advancements in large language models (LLMs) have highlighted the
+challenge of context window limitations, primarily due to the quadratic time
+complexity of the self-attention mechanism (\(O(N^2)\), where \(N\) denotes the
+context window length). This constraint impacts tasks such as
+retrieval-augmented generation (RAG) in question answering (Q\&A) and long
+context summarization. A common approach involves selecting content with the
+highest similarity to the query; however, this often leads to redundancy and
+the exclusion of diverse yet relevant information. Building on principles from
+Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we
+integrate diversity into the content selection process. Our findings reveal
+that incorporating diversity substantially increases the recall of selecting
+relevant sentences or chunks before LLM-based Q\&A and summarization. These
+results highlight the importance of maintaining diversity in future LLM
+applications to further improve summarization and Q\&A outcomes.
 
-Combinatorial optimization problems, such as scheduling and route planning,
-are crucial in various industries but are computationally intractable due to
-their NP-hard nature. Neural Combinatorial Optimization methods leverage
-machine learning to address these challenges but often depend on sequential
-decision-making, which is prone to error accumulation as small mistakes
-propagate throughout the process. Inspired by self-evaluation techniques in
-Large Language Models, we propose a novel framework that generates and
-evaluates subsets of assignments, moving beyond traditional stepwise
-approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a
-heterogeneous graph neural network with a Transformer to build a policy model
-and a self-evaluation function. Experimental validation on challenging,
-well-known benchmarks demonstrates the effectiveness of our approach,
-surpassing state-of-the-art methods.
+摘要：大型語言模型 (LLM) 的快速進步凸顯了上下文視窗限制的挑戰，這主要是由於自注意力機制的二次時間複雜度（\(O(N^2)\)），其中 \(N\) 表示上下文視窗長度。此限制會影響任務，例如問答 (Q&A) 中的檢索增強生成 (RAG) 和長文摘要。一種常見的方法涉及選擇與查詢最相似的內容；然而，這通常會導致冗餘，並排除多樣化但相關的資訊。我們根據最大邊際相關性 (MMR) 和最遠點取樣 (FPS) 的原則，將多樣性整合到內容選擇過程中。我們的研究結果顯示，在基於 LLM 的問答和摘要之前，納入多樣性會大幅增加選擇相關句子或區塊的召回率。這些結果突顯了在未來的 LLM 應用中維持多樣性的重要性，以進一步改善摘要和問答的結果。
 
-摘要：組合優化問題，例如排程和路線規劃，在各行各業中至關重要，但由於它們的 NP 難度，在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰，但通常依賴於序貫決策制定，而序貫決策制定容易發生錯誤累積，因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發，我們提出了一個新的框架，可生成和評估作業子集，超越傳統的分步方法。應用於工作車間排程問題，我們的方法將異質圖神經網路與 Transformer 整合在一起，以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性，超越了最先進的方法。
+##### **Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**
+2502.09004v1 by Jonathan Pofcher, Christopher M. Homan, Randall Sell, Ashiqur R. KhudaBukhsh
 
-##### **Improving Existing Optimization Algorithms with LLMs**
-2502.08298v1 by Camilo Chacón Sartori, Christian Blum
+This paper makes three contributions. First, via a substantial corpus of
+1,419,047 comments posted on 3,161 YouTube news videos of major US cable news
+outlets, we analyze how users engage with LGBTQ+ news content. Our analyses
+focus both on positive and negative content. In particular, we construct a
+fine-grained hope speech classifier that detects positive (hope speech),
+negative, neutral, and irrelevant content. Second, in consultation with a
+public health expert specializing on LGBTQ+ health, we conduct an annotation
+study with a balanced and diverse political representation and release a
+dataset of 3,750 instances with fine-grained labels and detailed annotator
+demographic information. Finally, beyond providing a vital resource for the
+LGBTQ+ community, our annotation study and subsequent in-the-wild assessments
+reveal (1) strong association between rater political beliefs and how they rate
+content relevant to a marginalized community; (2) models trained on individual
+political beliefs exhibit considerable in-the-wild disagreement; and (3)
+zero-shot large language models (LLMs) align more with liberal raters.
 
-The integration of Large Language Models (LLMs) into optimization has created
-a powerful synergy, opening exciting research opportunities. This paper
-investigates how LLMs can enhance existing optimization algorithms. Using their
-pre-trained knowledge, we demonstrate their ability to propose innovative
-heuristic variations and implementation strategies. To evaluate this, we
-applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt
-(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that
-incorporates a heuristic in the solution construction phase. Our results show
-that an alternative heuristic proposed by GPT-4o outperforms the
-expert-designed heuristic of CMSA, with the performance gap widening on larger
-and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/
+摘要：本文做出了三項貢獻。首先，透過一個龐大的語料庫，其中包含 1,419,047 則評論，這些評論張貼在 3,161 部美國有線新聞頻道的 YouTube 新聞影片上，我們分析了使用者如何參與 LGBTQ+ 新聞內容。我們的分析重點在於正面和負面的內容。特別是，我們建構了一個細緻的希望言論分類器，用來偵測正面的（希望言論）、負面的、中立的和不相關的內容。其次，在諮詢了一位專門研究 LGBTQ+ 健康的公共衛生專家後，我們進行了一項標註研究，其中包含平衡且多元的政治代表性，並發布了一個包含 3,750 個實例的資料集，其中包含細緻的標籤和詳細的標註者人口統計資訊。最後，除了為 LGBTQ+ 社群提供重要的資源外，我們的標註研究和後續的實際評估揭示了：(1) 評分者的政治信仰與他們如何評分與邊緣化社群相關的內容之間有很強的關聯性；(2) 根據個人政治信仰訓練的模型在實際應用中表現出相當大的分歧；(3) 零次學習大型語言模型 (LLM) 與自由派評分者的看法更一致。
 
-摘要：大型语言模型 (LLM) 与优化相结合，创造了一种强大的协同作用，开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识，我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点，我们应用了一种非平凡的优化算法，构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法，它在求解构建阶段纳入了启发式算法。我们的结果表明，GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法，并且随着图形变得更大、更密集，性能差距也在扩大。项目网址：https://imp-opt-algo-llms.surge.sh/
+##### **RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**
+2502.09003v1 by Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang, Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
 
-##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**
-2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
+Supervised fine-tuning is a standard method for adapting pre-trained large
+language models (LLMs) to downstream tasks. Quantization has been recently
+studied as a post-training technique for efficient LLM deployment. To obtain
+quantized fine-tuned LLMs, conventional pipelines would first fine-tune the
+pre-trained models, followed by post-training quantization. This often yields
+suboptimal performance as it fails to leverage the synergy between fine-tuning
+and quantization. To effectively realize low-bit quantization of weights,
+activations, and KV caches in LLMs, we propose an algorithm named Rotated
+Straight-Through-Estimator (RoSTE), which combines quantization-aware
+supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that
+identifies an effective rotation configuration to reduce activation outliers.
+We provide theoretical insights on RoSTE by analyzing its prediction error when
+applied to an overparameterized least square quantized training problem. Our
+findings reveal that the prediction error is directly proportional to the
+quantization error of the converged weights, which can be effectively managed
+through an optimized rotation configuration. Experiments on Pythia and Llama
+models of different sizes demonstrate the effectiveness of RoSTE. Compared to
+existing post-SFT quantization baselines, our method consistently achieves
+superior performances across various tasks and different LLM architectures.
 
-Identifying cause-and-effect relationships is critical to understanding
-real-world dynamics and ultimately causal reasoning. Existing methods for
-identifying event causality in NLP, including those based on Large Language
-Models (LLMs), exhibit difficulties in out-of-distribution settings due to the
-limited scale and heavy reliance on lexical cues within available benchmarks.
-Modern benchmarks, inspired by probabilistic causal inference, have attempted
-to construct causal graphs of events as a robust representation of causal
-knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent
-benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a
-benchmark designed for discovery and reasoning over abstract causal events.
-Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday
-life events on the abstraction level. We propose a pipeline for identifying
-abstractions for event generalizations from \texttt{GLUCOSE}
-\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit
-commonsense causal knowledge, from which we subsequently extract $1,4$K causal
-pairs. Our experiments highlight the ongoing challenges of using statistical
-methods and/or LLMs for automatic abstraction identification and causal
-discovery in NLP. Nonetheless, we demonstrate that the abstract causal
-knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA
-reasoning performance in LLMs.
+摘要：監督式微調是將預訓練的大型語言模型 (LLM) 適應至下游任務的標準方法。量化最近已被研究作為一種訓練後技術，用於高效部署 LLM。為了獲得量化的微調 LLM，傳統管道會先微調預訓練模型，然後再進行訓練後量化。這通常會產生次佳效能，因為它無法利用微調和量化之間的協同效應。為了有效實現 LLM 中權重、激活和 KV 快取的低位元量化，我們提出了一種名為旋轉直通估計器 (RoSTE) 的演算法，它結合了量化感知監督式微調 (QA-SFT) 和一種自適應旋轉策略，該策略會識別有效的旋轉組態以減少激活異常值。我們透過分析 RoSTE 在應用於過度參數化最小平方量化訓練問題時的預測誤差，提供了關於 RoSTE 的理論見解。我們的研究結果顯示，預測誤差與收斂權重的量化誤差成正比，而這可透過最佳化的旋轉組態有效地管理。在不同大小的 Pythia 和 Llama 模型上進行的實驗證明了 RoSTE 的有效性。與現有的訓練後 SFT 量化基準相比，我們的模型在各種任務和不同的 LLM 架構中持續獲得優異的效能。
 
-摘要：<paragraph>找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法，包括基於大型語言模型 (LLM) 的方法，由於規模有限且過度依賴於可用基準中的詞彙線索，在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖，作為因果知識的強健表示，其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中，我們介紹 \texttt{ACCESS}，一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同，\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道，用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象，\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集，我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此，我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。</paragraph>
+##### **PixLift: Accelerating Web Browsing via AI Upscaling**
+2502.08995v1 by Yonas Atinafu, Sarthak Malla, HyunSeok Daniel Jang, Nouar Aldahoul, Matteo Varvello, Yasir Zaki
 
-##### **GCoT: Chain-of-Thought Prompt Learning for Graphs**
-2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang
+Accessing the internet in regions with expensive data plans and limited
+connectivity poses significant challenges, restricting information access and
+economic growth. Images, as a major contributor to webpage sizes, exacerbate
+this issue, despite advances in compression formats like WebP and AVIF. The
+continued growth of complex and curated web content, coupled with suboptimal
+optimization practices in many regions, has prevented meaningful reductions in
+web page sizes. This paper introduces PixLift, a novel solution to reduce
+webpage sizes by downscaling their images during transmission and leveraging AI
+models on user devices to upscale them. By trading computational resources for
+bandwidth, PixLift enables more affordable and inclusive web access. We address
+key challenges, including the feasibility of scaled image requests on popular
+websites, the implementation of PixLift as a browser extension, and its impact
+on user experience. Through the analysis of 71.4k webpages, evaluations of
+three mainstream upscaling models, and a user study, we demonstrate PixLift's
+ability to significantly reduce data usage without compromising image quality,
+fostering a more equitable internet.
 
-Chain-of-thought (CoT) prompting has achieved remarkable success in natural
-language processing (NLP). However, its vast potential remains largely
-unexplored for graphs. This raises an interesting question: How can we design
-CoT prompting for graphs to guide graph models to learn step by step? On one
-hand, unlike natural languages, graphs are non-linear and characterized by
-complex topological structures. On the other hand, many graphs lack textual
-data, making it difficult to formulate language-based CoT prompting. In this
-work, we propose the first CoT prompt learning framework for text-free graphs,
-GCoT. Specifically, we decompose the adaptation process for each downstream
-task into a series of inference steps, with each step consisting of
-prompt-based inference, ``thought'' generation, and thought-conditioned prompt
-learning. While the steps mimic CoT prompting in NLP, the exact mechanism
-differs significantly. Specifically, at each step, an input graph, along with a
-prompt, is first fed into a pre-trained graph encoder for prompt-based
-inference. We then aggregate the hidden layers of the encoder to construct a
-``thought'', which captures the working state of each node in the current step.
-Conditioned on this thought, we learn a prompt specific to each node based on
-the current state. These prompts are fed into the next inference step,
-repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we
-conduct comprehensive experiments on eight public datasets, which demonstrate
-the advantage of our approach.
+摘要：在數據方案昂貴且連線有限的地區存取網路會造成重大挑戰，限制了資訊存取和經濟成長。圖像作為網頁大小的主要貢獻者，儘管 WebP 和 AVIF 等壓縮格式進步，但仍加劇了這個問題。複雜且經過策劃的網路內容持續成長，加上許多地區次佳的最佳化實務，已阻礙了網頁大小的顯著減少。本文介紹 PixLift，這是一種創新的解決方案，可在傳輸過程中縮小圖像大小，並利用使用者裝置上的 AI 模型來放大圖像，藉此縮小網頁大小。PixLift 透過以運算資源換取頻寬，讓網路存取更經濟實惠且更具包容性。我們解決了關鍵挑戰，包括熱門網站上縮放圖像要求的可行性、將 PixLift 實作為瀏覽器擴充功能，以及它對使用者體驗的影響。透過分析 71.4k 個網頁、評估三個主流放大模型，以及使用者研究，我們展示了 PixLift 在不影響影像品質的情況下顯著減少資料用量的能力，促進了更公平的網路。
 
-摘要：<paragraph>鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而，其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題：我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習？一方面，與自然語言不同，圖形是非線性的，並且具有複雜的拓撲結構。另一方面，許多圖形缺乏文本數據，這使得難以制定基於語言的 CoT 提示。在這項工作中，我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說，我們將每個下游任務的適應過程分解為一系列推理步驟，每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示，但具體機制卻有很大不同。具體來說，在每一步中，一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後，我們聚合編碼器的隱藏層以構建一個「思想」，它捕獲了當前步驟中每個節點的工作狀態。基於這個思想，我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中，重複這個循環。為了評估和分析 GCoT 的有效性，我們對八個公共數據集進行了全面的實驗，這證明了我們方法的優勢。</paragraph>
+##### **RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**
+2502.08989v1 by Nazatul H. Sultan, Yan Bo, Yansong Gao, Seyit Camtepe, Arash Mahboubi, Hang Thanh Bui, Aufeef Chauhan, Hamed Aboutorab, Michael Bewong, Praveen Gauravaram, Rafiqul Islam, Sharif Abuadbba
 
-##### **Deep Semantic Graph Learning via LLM based Node Enhancement**
-2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen
+Federated Learning (FL) allows users to collaboratively train a global
+machine learning model by sharing local model only, without exposing their
+private data to a central server. This distributed learning is particularly
+appealing in scenarios where data privacy is crucial, and it has garnered
+substantial attention from both industry and academia. However, studies have
+revealed privacy vulnerabilities in FL, where adversaries can potentially infer
+sensitive information from the shared model parameters. In this paper, we
+present an efficient masking-based secure aggregation scheme utilizing
+lightweight cryptographic primitives to mitigate privacy risks. Our scheme
+offers several advantages over existing methods. First, it requires only a
+single setup phase for the entire FL training session, significantly reducing
+communication overhead. Second, it minimizes user-side overhead by eliminating
+the need for user-to-user interactions, utilizing an intermediate server layer
+and a lightweight key negotiation method. Third, the scheme is highly resilient
+to user dropouts, and the users can join at any FL round. Fourth, it can detect
+and defend against malicious server activities, including recently discovered
+model inconsistency attacks. Finally, our scheme ensures security in both
+semi-honest and malicious settings. We provide security analysis to formally
+prove the robustness of our approach. Furthermore, we implemented an end-to-end
+prototype of our scheme. We conducted comprehensive experiments and
+comparisons, which show that it outperforms existing solutions in terms of
+communication and computation overhead, functionality, and security.
 
-Graph learning has attracted significant attention due to its widespread
-real-world applications. Current mainstream approaches rely on text node
-features and obtain initial node embeddings through shallow embedding learning
-using GNNs, which shows limitations in capturing deep textual semantics. Recent
-advances in Large Language Models (LLMs) have demonstrated superior
-capabilities in understanding text semantics, transforming traditional text
-feature processing. This paper proposes a novel framework that combines Graph
-Transformer architecture with LLM-enhanced node features. Specifically, we
-leverage LLMs to generate rich semantic representations of text nodes, which
-are then processed by a multi-head self-attention mechanism in the Graph
-Transformer to capture both local and global graph structural information. Our
-model utilizes the Transformer's attention mechanism to dynamically aggregate
-neighborhood information while preserving the semantic richness provided by LLM
-embeddings. Experimental results demonstrate that the LLM-enhanced node
-features significantly improve the performance of graph learning models on node
-classification tasks. This approach shows promising results across multiple
-graph learning tasks, offering a practical direction for combining graph
-networks with language models.
+摘要：聯合式學習 (FL) 使用者可以透過僅分享本機模型，在不將其私人資料揭露給中央伺服器的情況下，共同訓練全球機器學習模型。這種分散式學習在資料隱私至關重要的場景中特別具有吸引力，並且已獲得業界和學術界的廣泛關注。然而，研究顯示 FL 中存在隱私漏洞，其中對手可能會從共享模型參數中推斷出敏感資訊。在本文中，我們提出了一種有效率的基於遮罩的安全聚合方案，利用輕量級的密碼原語來降低隱私風險。我們的方案相較於現有方法提供了多項優點。首先，它僅需要在整個 FL 訓練階段進行一次設定階段，大幅降低了通訊開銷。其次，透過消除使用者間互動的需要，利用中間伺服器層和輕量級金鑰協商方法，將使用者端的開銷降到最低。第三，該方案對使用者中斷具有高度的復原力，使用者可以在任何 FL 回合中加入。第四，它可以偵測和防禦惡意伺服器活動，包括最近發現的模型不一致攻擊。最後，我們的方案確保在半誠實和惡意設定中都能獲得安全性。我們提供了安全分析，以正式證明我們方法的穩健性。此外，我們實作了我們方案的端對端原型。我們進行了全面的實驗和比較，結果顯示，在通訊和運算開銷、功能和安全性方面，它優於現有的解決方案。
 
-摘要：圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵，並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入，這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力，轉換了傳統的文本特徵處理。本文提出了一種新的框架，將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說，我們利用 LLM 來生成文本節點的豐富語義表示，然後在圖形轉換器中由多頭自我注意機制處理，以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息，同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明，LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果，為將圖形網絡與語言模型相結合提供了實用的方向。
+##### **Neural Force Field: Learning Generalized Physical Representation from a Few Examples**
+2502.08987v1 by Shiqian Li, Ruihong Shen, Chi Zhang, Yixin Zhu
 
-##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**
-2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
+Physical reasoning is a remarkable human ability that enables rapid learning
+and generalization from limited experience. Current AI models, despite
+extensive training, still struggle to achieve similar generalization,
+especially in Out-of-distribution (OOD) settings. This limitation stems from
+their inability to abstract core physical principles from observations. A key
+challenge is developing representations that can efficiently learn and
+generalize physical dynamics from minimal data. Here we present Neural Force
+Field (NFF) a modeling framework built on Neural Ordinary Differential Equation
+(NODE) that learns interpretable force field representations which can be
+efficiently integrated through an Ordinary Differential Equation ( ODE) solver
+to predict object trajectories. Unlike existing approaches that rely on
+high-dimensional latent spaces, NFF captures fundamental physical concepts such
+as gravity, support, and collision in an interpretable manner. Experiments on
+two challenging physical reasoning tasks demonstrate that NFF, trained with
+only a few examples, achieves strong generalization to unseen scenarios. This
+physics-grounded representation enables efficient forward-backward planning and
+rapid adaptation through interactive refinement. Our work suggests that
+incorporating physics-inspired representations into learning systems can help
+bridge the gap between artificial and human physical reasoning capabilities.
 
-The prototyping of computer games, particularly card games, requires
-extensive human effort in creative ideation and gameplay evaluation. Recent
-advances in Large Language Models (LLMs) offer opportunities to automate and
-streamline these processes. However, it remains challenging for LLMs to design
-novel game mechanics beyond existing databases, generate consistent gameplay
-environments, and develop scalable gameplay AI for large-scale evaluations.
-This paper addresses these challenges by introducing a comprehensive automated
-card game prototyping framework. The approach highlights a graph-based indexing
-method for generating novel game designs, an LLM-driven system for consistent
-game code generation validated by gameplay records, and a gameplay AI
-constructing method that uses an ensemble of LLM-generated action-value
-functions optimized through self-play. These contributions aim to accelerate
-card game prototyping, reduce human labor, and lower barriers to entry for game
-developers.
+摘要：物理推理是人类非凡的能力，它能从有限的经验中快速学习和概括。尽管经过广泛的训练，但当前的人工智能模型在实现类似的概括方面仍然存在困难，尤其是在分布外 (OOD) 设置中。这种限制源于它们无法从观察中抽象出核心物理原理。一个关键挑战是开发能够从最少数据中有效学习和概括物理动力学的表示。在这里，我们介绍了神经力场 (NFF)，这是一种建立在神经常微分方程 (NODE) 上的建模框架，它学习可解释的力场表示，这些表示可以通过常微分方程 (ODE) 求解器有效地进行积分，以预测物体轨迹。与依赖于高维潜在空间的现有方法不同，NFF 以可解释的方式捕获了诸如重力、支撑和碰撞等基本物理概念。在两个具有挑战性的物理推理任务上的实验表明，仅通过几个示例训练的 NFF 实现了对看不见场景的强大概括。这种基于物理的表示能够进行高效的前向后向规划，并通过交互式细化实现快速适应。我们的工作表明，将受物理启发的表示纳入学习系统可以帮助弥合人工智能和人类物理推理能力之间的差距。
 
-摘要：電腦遊戲，尤其是卡牌遊戲的原型製作，需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而，LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境，以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法，用於生成新穎的遊戲設計，一個由 LLM 驅動的系統，用於一致的遊戲程式碼生成，並由遊戲記錄驗證，以及一個遊戲 AI 構建方法，該方法使用由 LLM 生成的動作值函數的集合，通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作，減少人力，並降低遊戲開發人員的進入門檻。
+##### **Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**
+2502.08972v1 by Hyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan May
 
-##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**
-2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan
+Language models are aligned to the collective voice of many, resulting in
+generic outputs that do not align with specific users' styles. In this work, we
+present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method
+that personalizes language models for text generation tasks with fewer than 10
+examples per user. TICL iteratively expands an in-context learning prompt via a
+trial-error-explain process, adding model-generated negative samples and
+explanations that provide fine-grained guidance towards a specific user's
+style. TICL achieves favorable win rates on pairwise comparisons with
+LLM-as-a-judge up to 91.5% against the previous state-of-the-art and
+outperforms competitive tuning-free baselines for personalized alignment tasks
+of writing emails, essays and news articles. Both lexical and qualitative
+analyses show that the negative samples and explanations enable language models
+to learn stylistic context more effectively and overcome the bias towards
+structural and formal phrases observed in their zero-shot outputs. By
+front-loading inference compute to create a user-specific in-context learning
+prompt that does not require extra generation steps at test time, TICL presents
+a novel yet simple approach for personalized alignment.
 
-Graph Neural Networks (GNNs) are vital for learning from graph-structured
-data, enabling applications in network analysis, recommendation systems, and
-speech analytics. Deploying them on edge devices like client PCs and laptops
-enhances real-time processing, privacy, and cloud independence. GNNs aid
-Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and
-enable event-based vision tasks. However, irregular memory access, sparsity,
-and dynamic structures cause high latency and energy overhead on
-resource-constrained devices. While modern edge processors integrate CPUs,
-GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular
-GNN computations. We introduce GraNNite, the first hardware-aware framework
-optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN
-accelerators via a structured three-step methodology: (1) enabling NPU
-execution, (2) optimizing performance, and (3) trading accuracy for efficiency
-gains. Step 1 employs GraphSplit for workload distribution and StaGr for static
-aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts
-performance using EffOp for control-heavy tasks and GraSp for sparsity
-exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce
-redundancy and memory transfers. Step 3 balances quality versus efficiency,
-where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate
-attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs,
-GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to
-8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher
-performance than CPUs and GPUs, respectively, across GNN models.
+摘要：語言模型與眾人的集體聲音保持一致，導致產出內容流於一般，無法與特定使用者的風格相符。在這項工作中，我們提出了試驗錯誤解釋情境內學習 (TICL)，一種免調校方法，能為文字生成任務個人化語言模型，每個使用者少於 10 個範例。TICL 透過試驗錯誤解釋程序反覆擴充情境內學習提示，加入模型產生的負面範例和說明，提供細緻的指導，引導至特定使用者的風格。TICL 在與 LLM 作為評審的成對比較中獲得了高勝率，高達 91.5%，優於先前的技術水準，並在個人化對齊任務中超越了競爭性的免調校基準，包括撰寫電子郵件、論文和新聞文章。詞彙和質性分析皆顯示，負面範例和說明讓語言模型能更有效地學習風格脈絡，並克服零次學習產出中觀察到的結構性和正式詞組偏誤。透過預先加載推論運算，建立使用者特定的情境內學習提示，無需在測試時額外產生步驟，TICL 呈現一種新穎卻簡潔的方法，用於個人化對齊。
 
-摘要：圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要，能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置（例如用戶端電腦和筆電）上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG)，並支援基於事件的視覺任務。然而，不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU，但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite，這是第一個硬體感知框架，透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行：(1) 啟用 NPU 執行，(2) 最佳化效能，以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配，並使用 StaGr 進行靜態聚合，而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能，並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率，其中 QuantGr 適用 INT8 量化，而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上，GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速，在 CPU 和 GPU 上實現了高達 8.6X 的能源增益，在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。
+##### **RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**
+2502.08966v1 by Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller
 
-##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**
-2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
+Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external
+tools for tasks beyond their standalone capabilities, such as searching
+websites, booking flights, or making financial transactions. However, these
+tools greatly increase the risks of prompt injection attacks, where malicious
+content hijacks the LM agent to leak confidential data or trigger harmful
+actions. Existing defenses (OpenAI GPTs) require user confirmation before every
+tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS),
+which automatically detects and executes tool calls that preserve integrity and
+confidentiality, requiring user confirmation only when these safeguards cannot
+be ensured. RTBAS adapts Information Flow Control to the unique challenges
+presented by TBAS. We present two novel dependency screeners, using
+LM-as-a-judge and attention-based saliency, to overcome these challenges.
+Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS
+prevents all targeted attacks with only a 2% loss of task utility when under
+attack, and further tests confirm its ability to obtain near-oracle performance
+on detecting both subtle and direct privacy leaks.
 
-Recent advancements in AI for biological research focus on integrating
-molecular data with natural language to accelerate drug discovery. However, the
-scarcity of high-quality annotations limits progress in this area. This paper
-introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework
-that leverages large language models to augment existing datasets, thereby
-improving AI training. We demonstrate the effectiveness of LA$^3$ by creating
-an enhanced dataset, LaChEBI-20, where we systematically rewrite the
-annotations of molecules from an established dataset. These rewritten
-annotations preserve essential molecular information while providing more
-varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5
-based on a benchmark architecture to learn the mapping between molecular
-representations and augmented annotations.
-  Experimental results on text-based *de novo* molecule generation and molecule
-captioning demonstrate that LaMolT5 outperforms state-of-the-art models.
-Notably, incorporating LA$^3$ leads to improvements of up to 301% over the
-benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$
-notable applications in *image*, *text* and *graph* tasks, affirming its
-versatility and utility.
+摘要：基於工具的代理系統 (TBAS) 允許語言模型 (LM) 使用外部工具來執行超出其獨立功能的任務，例如搜尋網站、預訂航班或進行金融交易。然而，這些工具大幅增加了提示注入攻擊的風險，其中惡意內容劫持 LM 代理程式以洩露機密資料或觸發有害動作。現有的防禦措施 (OpenAI GPT) 在每次呼叫工具之前都需要使用者確認，這會對使用者造成沉重的負擔。我們引入了穩健的 TBAS (RTBAS)，它會自動偵測並執行保留完整性與機密性的工具呼叫，僅在無法確保這些防護措施時才需要使用者確認。RTBAS 將資訊流控制調整為 TBAS 呈現的獨特挑戰。我們提出兩種新穎的相依性篩選器，使用 LM 作為判斷者和基於注意力的顯著性，以克服這些挑戰。AgentDojo 提示注入基準上的實驗結果顯示，RTBAS 在受到攻擊時僅損失 2% 的任務效用，即可防止所有目標攻擊，進一步的測試證實了其在偵測細微和直接的隱私洩漏方面獲得接近神諭效能的能力。
 
-摘要：<paragraph>人工智慧在生物研究上的最新進展，專注於將分子資料與自然語言整合，以加速藥物發現。然而，高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$，一個基於語言的自動註解擴充框架，它利用大型語言模型來擴充現有的資料集，進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性，我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊，同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20，我們在基於基準架構上訓練 LaMolT5，以學習分子表示和擴充註解之間的對應。
-在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明，LaMolT5 優於最先進的模型。值得注意的是，納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外，我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性，肯定了它的多功能性和實用性。</paragraph>
+##### **Biologically Plausible Brain Graph Transformer**
+2502.08958v1 by Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
 
-##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**
-2502.06472v1 by Yuxing Lu, Jinzhuo Wang
+State-of-the-art brain graph analysis methods fail to fully encode the
+small-world architecture of brain graphs (accompanied by the presence of hubs
+and functional modules), and therefore lack biological plausibility to some
+extent. This limitation hinders their ability to accurately represent the
+brain's structural and functional properties, thereby restricting the
+effectiveness of machine learning models in tasks such as brain disorder
+detection. In this work, we propose a novel Biologically Plausible Brain Graph
+Transformer (BioBGT) that encodes the small-world architecture inherent in
+brain graphs. Specifically, we present a network entanglement-based node
+importance encoding technique that captures the structural importance of nodes
+in global information propagation during brain graph communication,
+highlighting the biological properties of the brain structure. Furthermore, we
+introduce a functional module-aware self-attention to preserve the functional
+segregation and integration characteristics of brain graphs in the learned
+representations. Experimental results on three benchmark datasets demonstrate
+that BioBGT outperforms state-of-the-art models, enhancing biologically
+plausible brain graph representations for various brain graph analytical tasks
 
-Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical
-for modern AI systems, but manual curation struggles to scale with the rapid
-growth of scientific literature. This paper presents KARMA, a novel framework
-employing multi-agent large language models (LLMs) to automate KG enrichment
-through structured analysis of unstructured text. Our approach employs nine
-collaborative agents, spanning entity discovery, relation extraction, schema
-alignment, and conflict resolution that iteratively parse documents, verify
-extracted knowledge, and integrate it into existing graph structures while
-adhering to domain-specific schema. Experiments on 1,200 PubMed articles from
-three different domains demonstrate the effectiveness of KARMA in knowledge
-graph enrichment, with the identification of up to 38,230 new entities while
-achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\%
-through multi-layer assessments.
+摘要：目前最先进的大腦圖形分析方法無法完全編碼大腦圖形的小世界架構（伴隨著樞紐和功能模組的存在），因此在某種程度上缺乏生物學上的可信度。這種限制阻礙了它們準確表示大腦結構和功能特性的能力，從而限制了機器學習模型在腦部疾病檢測等任務中的有效性。在這項工作中，我們提出了一個新的生物學上可信的大腦圖形轉換器 (BioBGT)，它編碼了大腦圖形中固有的、小世界的架構。具體來說，我們提出了一種基於網路糾纏的節點重要性編碼技術，它捕捉了大腦圖形通信過程中節點在全球資訊傳播中的結構重要性，突出了大腦結構的生物學特性。此外，我們引入了一個功能模組感知自注意力，以保留學習表徵中大腦圖形的功能分離和整合特性。在三個基準資料集上的實驗結果表明，BioBGT 優於最先進的模型，增強了各種大腦圖形分析任務的生物學上可信的大腦圖形表徵
 
-摘要：維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要，但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA，一個採用多代理大型語言模型 (LLM) 的新框架，透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理，涵蓋實體發現、關係提取、架構比對和衝突解決，這些代理會反覆分析文件、驗證提取的知識，並將其整合到現有的圖結構中，同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性，識別出多達 38,230 個新實體，同時達到 83.1% 的 LLM 驗證正確性，並透過多層評估將衝突邊緣降低了 18.6%。
+##### **Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**
+2502.08954v1 by Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer
 
-##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs**
-2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
+The deployment of Large Language Models (LLM) on mobile devices offers
+significant potential for medical applications, enhancing privacy, security,
+and cost-efficiency by eliminating reliance on cloud-based services and keeping
+sensitive health data local. However, the performance and accuracy of on-device
+LLMs in real-world medical contexts remain underexplored. In this study, we
+benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating
+accuracy, computational efficiency, and thermal limitation across various
+mobile devices. Our results indicate that compact general-purpose models like
+Phi-3 Mini achieve a strong balance between speed and accuracy, while medically
+fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably,
+deploying LLMs on older devices remains feasible, with memory constraints
+posing a greater challenge than raw processing power. Our study underscores the
+potential of on-device LLMs for healthcare while emphasizing the need for more
+efficient inference and models tailored to real-world clinical reasoning.
 
-Mitigating positional bias of language models (LMs) for listwise inputs is a
-well-known and important problem (e.g., lost-in-the-middle). While zero-shot
-order-invariant LMs have been proposed to solve this issue, their success on
-practical listwise problems has been limited. In this work, as a first
-contribution, we identify and overcome two limitations to make zero-shot
-invariant LMs more practical: (1) training and inference distribution mismatch
-arising from modifying positional ID assignments to enforce invariance, and (2)
-failure to adapt to a mixture of order-invariant and sensitive inputs in
-practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot
-invariant LM for genuinely order-invariant inputs with minimal modifications of
-positional IDs, and (2) Selective Routing, an adaptive framework that handles
-both order-invariant and order-sensitive inputs in listwise tasks. On the Lost
-in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU
-benchmarks, we show that RoToR with Selective Routing can effectively handle
-practical listwise input tasks in a zero-shot manner.
+摘要：大型語言模型 (LLM) 在行動裝置上的部署為醫療應用程式提供了巨大的潛力，透過消除對雲端服務的依賴並將敏感的健康資料儲存在本地，進而提升隱私、安全性，並提高成本效益。然而，在實際的醫療環境中，裝置上 LLM 的效能和準確度仍未受到充分的探討。在此研究中，我們使用 AMEGA 資料集來評量公開可用的裝置上 LLM，並評估其在各種行動裝置上的準確度、運算效率和熱限制。我們的結果顯示，像 Phi-3 Mini 等精簡的一般用途模型在速度和準確度之間取得了良好的平衡，而經過醫學微調的模型，例如 Med42 和 Aloe，則達到了最高的準確度。值得注意的是，在較舊的裝置上部署 LLM 仍然可行，記憶體限制比原始處理能力構成更大的挑戰。我們的研究強調了裝置上 LLM 在醫療保健方面的潛力，同時強調了對更有效率的推理和針對實際臨床推理量身打造的模型的需求。
 
-摘要：語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題（例如，迷失在中間）。雖然已經提出零次學習順序不變的 LM 來解決這個問題，但它們在實際列表問題上的成功卻很有限。在這項工作中，作為第一個貢獻，我們找出並克服了兩個限制，讓零次學習不變的 LM 更有實用性：(1) 訓練和推論分布不匹配，這是由於修改位置 ID 分配以強制不變性所造成的，以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題，我們提出 (1) RoToR，一個零次學習不變的 LM，用於真正不變的輸入，並對位置 ID 進行最小的修改，以及 (2) 選擇性路由，一個自適應框架，用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中，我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。
 
-##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model**
-2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen
+### Medical explainable AI
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
+|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
+|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)|
+|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null|
+|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null|
+|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v1](http://arxiv.org/abs/2501.09628v1)|null|
+|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null|
+|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null|
+|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null|
+|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null|
+|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null|
+|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)|
+|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null|
+|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null|
+|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null|
+|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null|
+|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null|
+|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null|
+|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null|
+|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null|
+|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null|
+|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)|
+|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null|
+|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null|
+|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null|
+|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)|
+|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)|
+|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null|
+|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)|
+|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null|
+|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null|
+|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null|
+|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null|
+|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null|
+|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null|
+|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null|
+|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null|
+|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**|Antonio Rago et.al.|[2408.17401v1](http://arxiv.org/abs/2408.17401v1)|null|
+|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null|
+|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null|
+|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null|
+|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null|
+|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null|
+|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null|
+|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null|
+|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null|
+|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null|
+|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null|
+|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null|
+|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null|
+|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null|
+|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null|
+|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)|
+|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null|
+|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null|
+|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null|
+|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null|
+|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null|
+|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null|
+|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)|
+|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null|
+|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null|
+|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null|
+|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null|
+|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null|
+|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)|
+|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null|
+|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)|
+|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null|
+|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null|
+|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null|
+|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null|
+|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null|
+|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null|
+|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null|
+|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null|
+|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null|
+|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null|
+|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null|
+|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null|
+|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)|
+|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null|
+|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null|
+|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null|
+|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)|
+|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)|
+|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null|
+|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null|
+|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null|
+|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null|
+|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null|
+|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null|
+|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null|
+|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)|
+|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null|
+|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
+|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
+|**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
+|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
+
+#### Abstracts
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-Recent advancements in large language models (LLMs) have significantly
-improved various natural language processing (NLP) tasks. Typically, LLMs are
-trained to predict the next token, aligning well with many NLP tasks. However,
-in knowledge graph (KG) scenarios, entities are the fundamental units and
-identifying an entity requires at least several tokens. This leads to a
-granularity mismatch between KGs and natural languages. To address this issue,
-we propose K-ON, which integrates KG knowledge into the LLM by employing
-multiple head layers for next k-step prediction. K-ON can not only generate
-entity-level results in one step, but also enables contrastive loss against
-entities, which is the most powerful tool in KG representation learning.
-Experimental results show that K-ON outperforms state-of-the-art methods that
-incorporate text and even the other modalities.
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-摘要：大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常，LLM 會接受訓練以預測下一個符號，這與許多 NLP 任務非常吻合。然而，在知識圖譜 (KG) 場景中，實體是基本單位，而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題，我們提出了 K-ON，它透過採用多個頭部層進行下一個 k 步預測，將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果，還能針對實體啟用對比損失，這是 KG 表示學習中最有力的工具。實驗結果顯示，K-ON 優於將文字甚至其他方式納入考量的最新方法。
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-##### **LegalViz: Legal Text Visualization by Text To Diagram Generation**
-2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita
+##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
+2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
 
-Legal documents including judgments and court orders require highly
-sophisticated legal knowledge for understanding. To disclose expert knowledge
-for non-experts, we explore the problem of visualizing legal texts with
-easy-to-understand diagrams and propose a novel dataset of LegalViz with 23
-languages and 7,010 cases of legal document and visualization pairs, using the
-DOT graph description language of Graphviz. LegalViz provides a simple diagram
-from a complicated legal corpus identifying legal entities, transactions, legal
-sources, and statements at a glance, that are essential in each judgment. In
-addition, we provide new evaluation metrics for the legal diagram visualization
-by considering graph structures, textual similarities, and legal contents. We
-conducted empirical studies on few-shot and finetuning large language models
-for generating legal diagrams and evaluated them with these metrics, including
-legal content-based evaluation within 23 languages. Models trained with
-LegalViz outperform existing models including GPTs, confirming the
-effectiveness of our dataset.
+We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
+an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
+predicts future PHTs using transformer-based architectures. The Adaptive Risk
+Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
+probabilities for clinician-defined critical events. ARES incorporates a
+personalized explainability module that identifies key clinical factors
+influencing risk estimates for individual patients. ARES was evaluated on the
+MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
+performance against traditional early warning systems and machine learning
+models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
+with 60% including hospital admissions. The dataset contained over 357 million
+tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
+ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
+ETHOS-based risk estimates demonstrated robustness across demographic subgroups
+with strong model reliability, confirmed via calibration curves. The
+personalized explainability module provides insights into patient-specific
+factors contributing to risk. ARES, powered by ETHOS, advances predictive
+healthcare AI by providing dynamic, real-time, and personalized risk estimation
+with patient-specific explainability to enhance clinician trust. Its
+adaptability and superior accuracy position it as a transformative tool for
+clinical decision-making, potentially improving patient outcomes and resource
+allocation in emergency and inpatient settings. We release the full code at
+github.com/ipolharvard/ethos-ares to facilitate future research.
 
-摘要：法律文件，包括判決和法院命令，需要高度專業的法律知識才能理解。為了向非專家揭露專家知識，我們探討了使用易於理解的圖表將法律文本視覺化的問題，並提出了一個新的 LegalViz 數據集，其中包含 23 種語言和 7,010 個法律文件和視覺化配對，使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表，可以一目了然地識別法律實體、交易、法律來源和陳述，這些在每項判決中都是必不可少的。此外，我們通過考慮圖形結構、文本相似性和法律內容，為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究，以生成法律圖表，並使用這些指標對它們進行了評估，包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型，包括 GPT，證實了我們數據集的有效性。
+摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
+一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
+使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
 
-##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**
-2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee
+##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**
+2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq
 
-Mental-illness stigma is a persistent social problem, hampering both
-treatment-seeking and recovery. Accordingly, there is a pressing need to
-understand it more clearly, but analyzing the relevant data is highly
-labor-intensive. Therefore, we designed a chatbot to engage participants in
-conversations; coded those conversations qualitatively with AI assistance; and,
-based on those coding results, built causal knowledge graphs to decode stigma.
-The results we obtained from 1,002 participants demonstrate that conversation
-with our chatbot can elicit rich information about people's attitudes toward
-depression, while our AI-assisted coding was strongly consistent with
-human-expert coding. Our novel approach combining large language models (LLMs)
-and causal knowledge graphs uncovered patterns in individual responses and
-illustrated the interrelationships of psychological constructs in the dataset
-as a whole. The paper also discusses these findings' implications for HCI
-researchers in developing digital interventions, decomposing human
-psychological constructs, and fostering inclusive attitudes.
+This study addresses a critical gap in the healthcare system by developing a
+clinically meaningful, practical, and explainable disease surveillance system
+for multiple chronic diseases, utilizing routine EHR data from multiple U.S.
+practices integrated with CureMD's EMR/EHR system. Unlike traditional
+systems--using AI models that rely on features from patients' labs--our
+approach focuses on routinely available data, such as medical history, vitals,
+diagnoses, and medications, to preemptively assess the risks of chronic
+diseases in the next year. We trained three distinct models for each chronic
+disease: prediction models that forecast the risk of a disease 3, 6, and 12
+months before a potential diagnosis. We developed Random Forest models, which
+were internally validated using F1 scores and AUROC as performance metrics and
+further evaluated by a panel of expert physicians for clinical relevance based
+on inferences grounded in medical knowledge. Additionally, we discuss our
+implementation of integrating these models into a practical EMR system. Beyond
+using Shapley attributes and surrogate models for explainability, we also
+introduce a new rule-engineering framework to enhance the intrinsic
+explainability of Random Forests.
 
-摘要：精神疾病的污名化是一個持續存在的社會問題，阻礙了尋求治療和康復。因此，迫切需要更清楚地了解它，但分析相關數據非常費力。因此，我們設計了一個聊天機器人，讓參與者參與對話；使用 AI 協助對這些對話進行定性編碼；並根據這些編碼結果，構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明，與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊，而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式，並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。
+摘要：本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統，來解決醫療保健系統中的重大缺口，利用整合 CureMD 的 EMR/EHR 系統，來自多個美國實務的例行 EHR 資料。與傳統系統不同的是，我們的做法著重在例行可得的資料，例如病歷、生命徵象、診斷和藥物，以預先評估未來一年慢性疾病的風險，而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型：預測模型，用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型，並使用 F1 分數和 AUROC 作為效能指標，進行內部驗證，並進一步由專家醫師小組根據植基於醫學知識的推論，評估其臨床相關性。此外，我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外，我們還引進了一個新的規則工程架構，以增強隨機森林的內在可解釋性。
 
-##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**
-2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
+##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**
+2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
 
-In this paper, we address the task of semantic segmentation of legal
-documents through rhetorical role classification, with a focus on Indian legal
-judgments. We introduce LegalSeg, the largest annotated dataset for this task,
-comprising over 7,000 documents and 1.4 million sentences, labeled with 7
-rhetorical roles. To benchmark performance, we evaluate multiple
-state-of-the-art models, including Hierarchical BiLSTM-CRF,
-TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and
-Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an
-instruction-tuned large language model. Our results demonstrate that models
-incorporating broader context, structural relationships, and sequential
-sentence information outperform those relying solely on sentence-level
-features. Additionally, we conducted experiments using surrounding context and
-predicted or actual labels of neighboring sentences to assess their impact on
-classification accuracy. Despite these advancements, challenges persist in
-distinguishing between closely related roles and addressing class imbalance.
-Our work underscores the potential of advanced techniques for improving legal
-document understanding and sets a strong foundation for future research in
-legal NLP.
+Deep neural networks are increasingly employed in high-stakes medical
+applications, despite their tendency for shortcut learning in the presence of
+spurious correlations, which can have potentially fatal consequences in
+practice. Detecting and mitigating shortcut behavior is a challenging task that
+often requires significant labeling efforts from domain experts. To alleviate
+this problem, we introduce a semi-automated framework for the identification of
+spurious behavior from both data and model perspective by leveraging insights
+from eXplainable Artificial Intelligence (XAI). This allows the retrieval of
+spurious data points and the detection of model circuits that encode the
+associated prediction rules. Moreover, we demonstrate how these shortcut
+encodings can be used for XAI-based sample- and pixel-level data annotation,
+providing valuable information for bias mitigation methods to unlearn the
+undesired shortcut behavior. We show the applicability of our framework using
+four medical datasets across two modalities, featuring controlled and
+real-world spurious correlations caused by data artifacts. We successfully
+identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision
+Transformer models, ultimately increasing their robustness and applicability
+for real-world medical tasks.
 
-摘要：<paragraph>在本文中，我們通過修辭角色分類來探討法律文件的語義分段任務，重點關注印度法律判決。我們引入了 LegalSeg，這是此任務中最大的註釋資料集，包含超過 7,000 份文件和 140 萬個句子，並標記了 7 個修辭角色。為了評量效能，我們評估了多個最先進的模型，包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer，以及探索性的 RhetoricLLaMA，一種經過指令調整的大型語言模型。我們的結果表明，結合廣泛背景、結構關係和順序句子資訊的模型，表現優於僅依賴句子層級特徵的模型。此外，我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗，以評估它們對分類精度的影響。儘管有這些進展，但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力，並為法律自然語言處理的未來研究奠定了堅實的基礎。</paragraph>
+摘要：深度神经网络越来越多地用于高风险医疗应用中，尽管它们在存在虚假相关性的情况下倾向于捷径学习，这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务，通常需要领域专家的大量标记工作。为了缓解这个问题，我们引入了一个半自动框架，用于从数据和模型的角度识别虚假行为，方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外，我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释，为偏差缓解方法提供有价值的信息，以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性，这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差，最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。
 
-##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**
-2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
+##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**
+2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail
 
-Developing intelligent agents for long-term cooperation in dynamic open-world
-scenarios is a major challenge in multi-agent systems. Traditional Multi-agent
-Reinforcement Learning (MARL) frameworks like centralized training
-decentralized execution (CTDE) struggle with scalability and flexibility. They
-require centralized long-term planning, which is difficult without custom
-reward functions, and face challenges in processing multi-modal data. CTDE
-approaches also assume fixed cooperation strategies, making them impractical in
-dynamic environments where agents need to adapt and plan independently. To
-address decentralized multi-agent cooperation, we propose Decentralized
-Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in
-a novel Multi-agent Crafter environment. Our generative agents, powered by
-Large Language Models (LLMs), are more scalable than traditional MARL agents by
-leveraging external knowledge and language for long-term planning and
-reasoning. Instead of fully sharing information from all past experiences,
-DAMCS introduces a multi-modal memory system organized as a hierarchical
-knowledge graph and a structured communication protocol to optimize agent
-cooperation. This allows agents to reason from past interactions and share
-relevant information efficiently. Experiments on novel multi-agent open-world
-tasks show that DAMCS outperforms both MARL and LLM baselines in task
-efficiency and collaboration. Compared to single-agent scenarios, the two-agent
-scenario achieves the same goal with 63% fewer steps, and the six-agent
-scenario with 74% fewer steps, highlighting the importance of adaptive memory
-and structured communication in achieving long-term goals. We publicly release
-our project at: https://happyeureka.github.io/damcs.
+Suicidal ideation detection is crucial for preventing suicides, a leading
+cause of death worldwide. Many individuals express suicidal thoughts on social
+media, offering a vital opportunity for early detection through advanced
+machine learning techniques. The identification of suicidal ideation in social
+media text is improved by utilising a hybrid framework that integrates
+Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory
+(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability
+of the model's predictions, Explainable AI (XAI) methods are applied, with a
+particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At
+first, the model managed to reach an accuracy of 92.81%. By applying
+fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The
+SHAP analysis revealed key features influencing the model's predictions, such
+as terms related to mental health struggles. This level of transparency boosts
+the model's credibility while helping mental health professionals understand
+and trust the predictions. This work highlights the potential for improving the
+accuracy and interpretability of detecting suicidal tendencies, making a
+valuable contribution to the progress of mental health monitoring systems. It
+emphasizes the significance of blending powerful machine learning methods with
+explainability to develop reliable and impactful mental health solutions.
 
-摘要：<paragraph>在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架，例如集中式訓練去中心化執行 (CTDE)，在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃，這在沒有自訂獎勵函數的情況下很難執行，並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略，這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題，我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援，透過利用外部知識和語言進行長期規劃和推理，比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊，而是引入了多模式記憶體系統，該系統組織成階層式知識圖譜和結構化通訊協定，以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明，DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比，雙重代理情境以少 63% 的步驟達成相同的目標，而六重代理情境則以少 74% 的步驟達成目標，突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於：https://happyeureka.github.io/damcs。</paragraph>
+摘要：自殺意念偵測對於預防自殺至關重要，而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭，這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構，並加入注意力機制，可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性，我們採用可解釋人工智慧 (XAI) 方法，特別著重於 SHapley 加法解釋 (SHAP)。一開始，模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術，準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵，例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度，同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力，為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。
 
-##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**
-2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang
+##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights**
+2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet
 
-Graphs are able to model interconnected entities in many online services,
-supporting a wide range of applications on the Web. This raises an important
-question: How can we train a graph foundational model on multiple source
-domains and adapt to an unseen target domain? A major obstacle is that graphs
-from different domains often exhibit divergent characteristics. Some studies
-leverage large language models to align multiple domains based on textual
-descriptions associated with the graphs, limiting their applicability to
-text-attributed graphs. For text-free graphs, a few recent works attempt to
-align different feature distributions across domains, while generally
-neglecting structural differences. In this work, we propose a novel Structure
-Alignment framework for text-free Multi-domain Graph Pre-Training and
-cross-domain adaptation (SAMGPT). It is designed to learn multi-domain
-knowledge from graphs originating in multiple source domains, which can then be
-adapted to address applications in an unseen target domain. Specifically, we
-introduce a set of structure tokens to harmonize structure-based aggregation
-across source domains during the pre-training phase. Next, for cross-domain
-adaptation, we design dual prompts, namely, holistic prompts and specific
-prompts, which adapt unified multi-domain structural knowledge and
-fine-grained, domain-specific information, respectively, to a target domain.
-Finally, we conduct comprehensive experiments on seven public datasets to
-evaluate and analyze the effectiveness of SAMGPT.
+In epidemiology, traditional statistical methods such as logistic regression,
+linear regression, and other parametric models are commonly employed to
+investigate associations between predictors and health outcomes. However,
+non-parametric machine learning techniques, such as deep neural networks
+(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for
+this task. Despite their potential, these methods face challenges due to the
+limited availability of high-quality, high-quantity data in this field. To
+address these challenges, we introduce SEANN, a novel approach for informed
+DNNs that leverages a prevalent form of domain-specific knowledge: Pooled
+Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies,
+in different forms, and represent a quantitative form of a scientific
+consensus. By direct integration within the learning procedure using a custom
+loss, we experimentally demonstrate significant improvements in the
+generalizability of predictive performances and the scientific plausibility of
+extracted relationships compared to a domain-knowledge agnostic neural network
+in a scarce and noisy data setting.
 
-摘要：圖表能夠在許多線上服務中對相互關聯的實體進行建模，
-支援網路上廣泛的應用程式。這提出了重要的問題：我們如何針對多個來源網域訓練圖表基礎模型，並適應未見過的目標網域？一個主要的障礙是，來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型，根據與圖表相關的文字描述，對齊多個網域，限制其適用性於有文字屬性的圖表。對於沒有文字的圖表，最近的一些作品嘗試對齊跨網域的不同特徵分佈，同時通常忽略結構上的差異。在這項工作中，我們提出了一個新的結構對齊框架，用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識，然後可以適應於未見過的目標網域中的應用程式。具體來說，我們引入了一組結構化代碼，以在預訓練階段，調和跨來源網域的基於結構的聚合。接下來，對於跨網域適應，我們設計了雙重提示，即整體提示和具體提示，分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後，我們在七個公共資料集上進行了全面的實驗，以評估和分析 SAMGPT 的有效性。
+摘要：在流行病學中，傳統的統計方法，例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而，非參數機器學習技術，例如深度神經網路 (DNN)，結合可解釋的 AI (XAI) 工具，為這項任務提供了新的機會。儘管這些方法具有潛力，但由於該領域缺乏高品質、高數量資料，因此這些方法面臨挑戰。為了應對這些挑戰，我們引入了 SEANN，這是一種新穎的方法，用於獲取知識的 DNN，它利用了一種流行的領域特定知識形式：彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中，並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中，我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比，科學合理性的顯著提升，且是在稀少且有雜訊的資料設定中。
 
-##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**
-2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang
+##### **Artificial Intelligence-Driven Clinical Decision Support Systems**
+2501.09628v1 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni
 
-In-context learning (ICL) effectively conditions large language models (LLMs)
-for molecular tasks, such as property prediction and molecule captioning, by
-embedding carefully selected demonstration examples into the input prompt. This
-approach avoids the computational overhead of extensive pertaining and
-fine-tuning. However, current prompt retrieval methods for molecular tasks have
-relied on molecule feature similarity, such as Morgan fingerprints, which do
-not adequately capture the global molecular and atom-binding relationships. As
-a result, these methods fail to represent the full complexity of molecular
-structures during inference. Moreover, small-to-medium-sized LLMs, which offer
-simpler deployment requirements in specialized systems, have remained largely
-unexplored in the molecular ICL literature. To address these gaps, we propose a
-self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context
-learning, which aligns global molecular structures, represented by graph neural
-networks (GNNs), with textual captions (descriptions) while leveraging local
-feature similarity through Morgan fingerprints. In addition, we introduce a
-Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to
-optimize input prompt demonstration samples. Our experimental findings using
-diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL
-retrieval methods across all tasks by up to 45%.
+As artificial intelligence (AI) becomes increasingly embedded in healthcare
+delivery, this chapter explores the critical aspects of developing reliable and
+ethical Clinical Decision Support Systems (CDSS). Beginning with the
+fundamental transition from traditional statistical models to sophisticated
+machine learning approaches, this work examines rigorous validation strategies
+and performance assessment methods, including the crucial role of model
+calibration and decision curve analysis. The chapter emphasizes that creating
+trustworthy AI systems in healthcare requires more than just technical
+accuracy; it demands careful consideration of fairness, explainability, and
+privacy. The challenge of ensuring equitable healthcare delivery through AI is
+stressed, discussing methods to identify and mitigate bias in clinical
+predictive models. The chapter then delves into explainability as a cornerstone
+of human-centered CDSS. This focus reflects the understanding that healthcare
+professionals must not only trust AI recommendations but also comprehend their
+underlying reasoning. The discussion advances in an analysis of privacy
+vulnerabilities in medical AI systems, from data leakage in deep learning
+models to sophisticated attacks against model explanations. The text explores
+privacy-preservation strategies such as differential privacy and federated
+learning, while acknowledging the inherent trade-offs between privacy
+protection and model performance. This progression, from technical validation
+to ethical considerations, reflects the multifaceted challenges of developing
+AI systems that can be seamlessly and reliably integrated into daily clinical
+practice while maintaining the highest standards of patient care and data
+protection.
 
-摘要：<paragraph>情境學習 (ICL) 有效地調整大型語言模型 (LLM)，以執行分子任務，例如屬性預測和分子標題，方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而，目前針對分子任務的提示檢索方法依賴於分子特徵相似性，例如 Morgan 指紋，而無法充分捕捉全局分子和原子鍵結關係。因此，這些方法無法在推理過程中表示分子結構的完整複雜性。此外，在專業系統中提供更簡單部署需求的小到中型的 LLM，在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距，我們提出了一種自我監督學習技術，GAMIC（圖形對齊分子情境學習），它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題（描述）對齊，同時透過 Morgan 指紋利用局部特徵相似性。此外，我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法，以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示，GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法，最多可達 45%。</paragraph>
+摘要：隨著人工智慧 (AI) 在醫療保健中的應用日益普及，本章探討了開發可靠且符合道德標準的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型到複雜機器學習方法的基本轉變開始，這項工作審查了嚴謹的驗證策略和效能評估方法，包括模型校準和決策曲線分析的關鍵角色。本章強調，在醫療保健中建立值得信賴的 AI 系統不只是技術上的準確性；它需要仔細考量公平性、可解釋性和隱私權。本章強調了透過 AI 確保公平的醫療保健服務的挑戰，並討論了識別和減輕臨床預測模型中偏差的方法。接著，本章深入探討可解釋性，作為以人為中心的 CDSS 的基石。這種關注反映了醫療保健專業人員不僅必須信任 AI 建議，還必須理解其背後的推理。討論進一步分析了醫療 AI 系統中的隱私漏洞，從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略，例如差分隱私和聯合學習，同時承認隱私保護和模型效能之間的固有取捨。這種從技術驗證到道德考量的進展，反映了開發 AI 系統的多面向挑戰，這些系統可以無縫且可靠地整合到日常臨床實務中，同時維持最高的病患照護和資料保護標準。
 
-##### **Knowledge Graph-Guided Retrieval Augmented Generation**
-2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
+##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**
+2501.06887v1 by Sadia Kamal, Tim Oates
 
-Retrieval-augmented generation (RAG) has emerged as a promising technology
-for addressing hallucination issues in the responses generated by large
-language models (LLMs). Existing studies on RAG primarily focus on applying
-semantic-based approaches to retrieve isolated relevant chunks, which ignore
-their intrinsic relationships. In this paper, we propose a novel Knowledge
-Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
-knowledge graphs (KGs) to provide fact-level relationships between chunks,
-improving the diversity and coherence of the retrieved results. Specifically,
-after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
-employs a KG-guided chunk expansion process and a KG-based chunk organization
-process to deliver relevant and important knowledge in well-organized
-paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
-variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
-approaches, in terms of both response quality and retrieval quality.
+As deep learning models gain attraction in medical data, ensuring transparent
+and trustworthy decision-making is essential. In skin cancer diagnosis, while
+advancements in lesion detection and classification have improved accuracy, the
+black-box nature of these methods poses challenges in understanding their
+decision processes, leading to trust issues among physicians. This study
+leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on
+different skin lesion datasets, to capture meaningful relationships between
+visual features and diagnostic criteria terms. To further enhance transparency,
+we propose a method called MedGrad E-CLIP, which builds on gradient-based
+E-CLIP by incorporating a weighted entropy mechanism designed for complex
+medical imaging like skin lesions. This approach highlights critical image
+regions linked to specific diagnostic descriptions. The developed integrated
+pipeline not only classifies skin lesions by matching corresponding
+descriptions but also adds an essential layer of explainability developed
+especially for medical data. By visually explaining how different features in
+an image relates to diagnostic criteria, this approach demonstrates the
+potential of advanced vision-language models in medical image analysis,
+ultimately improving transparency, robustness, and trust in AI-driven
+diagnostic systems.
 
-摘要：檢索增強生成 (RAG) 已成為一項有前途的技術，用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊，而忽略它們的內在關係。在本文中，我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架，它利用知識圖表 (KG) 來提供區塊之間的事實層級關係，從而提高檢索結果的多樣性和一致性。具體來說，在執行基於語義的檢索以提供種子區塊後，KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序，以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。
+摘要：随着深度学习模型在医学数据中获得关注，确保透明且值得信赖的决策至关重要。在皮肤癌诊断中，虽然病灶检测和分类的进步提高了准确性，但这些方法的黑盒性质对理解其决策过程构成了挑战，导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP（对比语言图像预训练）模型，以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度，我们提出了一种名为 MedGrad E-CLIP 的方法，该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制，建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类，还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系，这种方法展示了高级视觉语言模型在医学图像分析中的潜力，最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。
 
-##### **Can Large Language Models Understand Intermediate Representations?**
-2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
+##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**
+2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat
 
-Intermediate Representations (IRs) are essential in compiler design and
-program analysis, yet their comprehension by Large Language Models (LLMs)
-remains underexplored. This paper presents a pioneering empirical study to
-investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA
-3.1, and Code Llama, in understanding IRs. We analyze their performance across
-four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code
-summarization, and execution reasoning. Our results indicate that while LLMs
-demonstrate competence in parsing IR syntax and recognizing high-level
-structures, they struggle with control flow reasoning, execution semantics, and
-loop handling. Specifically, they often misinterpret branching instructions,
-omit critical IR operations, and rely on heuristic-based reasoning, leading to
-errors in CFG reconstruction, IR decompilation, and execution reasoning. The
-study underscores the necessity for IR-specific enhancements in LLMs,
-recommending fine-tuning on structured IR datasets and integration of explicit
-control flow models to augment their comprehension and handling of IR-related
-tasks.
+Humour styles can have either a negative or a positive impact on well-being.
+Given the importance of these styles to mental health, significant research has
+been conducted on their automatic identification. However, the automated
+machine learning models used for this purpose are black boxes, making their
+prediction decisions opaque. Clarity and transparency are vital in the field of
+mental health. This paper presents an explainable AI (XAI) framework for
+understanding humour style classification, building upon previous work in
+computational humour analysis. Using the best-performing single model
+(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
+analyse how linguistic, emotional, and semantic features contribute to humour
+style classification decisions. Our analysis reveals distinct patterns in how
+different humour styles are characterised and misclassified, with particular
+emphasis on the challenges in distinguishing affiliative humour from other
+styles. Through detailed examination of feature importance, error patterns, and
+misclassification cases, we identify key factors influencing model decisions,
+including emotional ambiguity, context misinterpretation, and target
+identification. The framework demonstrates significant utility in understanding
+model behaviour, achieving interpretable insights into the complex interplay of
+features that define different humour styles. Our findings contribute to both
+the theoretical understanding of computational humour analysis and practical
+applications in mental health, content moderation, and digital humanities
+research.
 
-摘要：中間表徵 (IR) 在編譯器設計和程式分析中至關重要，但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究，以探討 LLM（包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama）理解 IR 的能力。我們分析了它們在四項任務中的表現：控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明，儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力，但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言，它們經常誤解分支指令、省略關鍵 IR 操作，並依賴於基於啟發式的推理，導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性，建議對結構化的 IR 資料集進行微調，並整合明確的控制流程模型，以增強其對 IR 相關任務的理解和處理。
+摘要：幽默風格對幸福感可能產生負面或正面的影響。
+鑑於這些風格對心理健康的重要性，已經對其自動識別進行了大量研究。然而，用於此目的的自動機器學習模型是黑盒子，使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架，用於理解幽默風格分類，建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost)，我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式，特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例，我們確定了影響模型決策的關鍵因素，包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用，實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。
 
-##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**
-2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen
+##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**
+2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani
 
-Long-context large language models (LLMs) have recently shown strong
-performance in information retrieval and long-document QA. However, to tackle
-the most challenging intellectual problems, LLMs must reason effectively in
-long and complex contexts (e.g., frontier mathematical research). Studying how
-LLMs handle increasing reasoning complexity and context length is essential,
-yet existing benchmarks lack a solid basis for quantitative evaluation.
-Inspired by the abstraction of GSM-8K problems as computational graphs, and the
-ability to introduce noise by adding unnecessary nodes and edges, we develop a
-grade school math problem generator capable of producing arithmetic problems
-with infinite difficulty and context length under fine-grained control. Using
-our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate
-existing LLMs. We find a consistent sigmoid decline in reasoning performance as
-complexity increases, along with a systematic inference scaling trend:
-exponentially increasing inference computation yields only linear performance
-gains. These findings underscore the fundamental limitations of current
-long-context LLMs and the key challenges in scaling reasoning capabilities. Our
-GSM-Infinite benchmark provides a scalable and controllable testbed for
-systematically studying and advancing LLM reasoning in long and complex
-contexts.
+The increasing demand for mental health services has highlighted the need for
+innovative solutions, particularly in the realm of psychological conversational
+AI, where the availability of sensitive data is scarce. In this work, we
+explored the development of a system tailored for mental health support with a
+novel approach to psychological assessment based on explainable emotional
+profiles in combination with empathetic conversational models, offering a
+promising tool for augmenting traditional care, particularly where immediate
+expertise is unavailable. Our work can be divided into two main parts,
+intrinsecaly connected to each other. First, we present RACLETTE, a
+conversational system that demonstrates superior emotional accuracy compared to
+state-of-the-art benchmarks in both understanding users' emotional states and
+generating empathetic responses during conversations, while progressively
+building an emotional profile of the user through their interactions. Second,
+we show how the emotional profiles of a user can be used as interpretable
+markers for mental health assessment. These profiles can be compared with
+characteristic emotional patterns associated with different mental disorders,
+providing a novel approach to preliminary screening and support.
 
-摘要：長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而，若要解決最具挑戰性的智力問題，LLM 必須在長且複雜的脈絡中有效推理（例如，前沿數學研究）。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要，但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發，以及透過加入不必要的節點和邊緣來引入雜訊的能力，我們開發了一個小學數學問題產生器，能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準，我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降，並伴隨著系統性的推論縮放趨勢：指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制，以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台，用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。
+摘要：隨著對心理健康服務需求的增加，凸顯了創新解決方案的需求，特別是在心理對話式人工智慧領域，那裡缺乏敏感資料。在這項工作中，我們探索了開發一個針對心理健康支持的系統，採用一種基於可解釋的情緒特徵的新方法進行心理評估，結合同理心對話模式，提供了一個有前途的工具，用於擴充傳統照護，特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分，彼此內在相關。首先，我們展示了 RACLETTE，一個對話系統，與最先進的基準相比，在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性，同時透過他們的互動逐漸建立使用者的情緒特徵。其次，我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較，提供了一種初步篩選和支持的新方法。
 
-##### **Causality can systematically address the monsters under the bench(marks)**
-2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf
+##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**
+2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia
 
-Effective and reliable evaluation is essential for advancing empirical
-machine learning. However, the increasing accessibility of generalist models
-and the progress towards ever more complex, high-level tasks make systematic
-evaluation more challenging. Benchmarks are plagued by various biases,
-artifacts, or leakage, while models may behave unreliably due to poorly
-explored failure modes. Haphazard treatments and inconsistent formulations of
-such "monsters" can contribute to a duplication of efforts, a lack of trust in
-results, and unsupported inferences. In this position paper, we argue causality
-offers an ideal framework to systematically address these challenges. By making
-causal assumptions in an approach explicit, we can faithfully model phenomena,
-formulate testable hypotheses with explanatory power, and leverage principled
-tools for analysis. To make causal model design more accessible, we identify
-several useful Common Abstract Topologies (CATs) in causal graphs which help
-gain insight into the reasoning abilities in large language models. Through a
-series of case studies, we demonstrate how the precise yet pragmatic language
-of causality clarifies the strengths and limitations of a method and inspires
-new approaches for systematic progress.
+Artificial intelligence (AI) has emerged as a powerful tool to enhance
+decision-making and optimize treatment protocols in in vitro fertilization
+(IVF). In particular, AI shows significant promise in supporting
+decision-making during the ovarian stimulation phase of the IVF process. This
+review evaluates studies focused on the applications of AI combined with
+medical imaging in ovarian stimulation, examining methodologies, outcomes, and
+current limitations. Our analysis of 13 studies on this topic reveals that,
+reveal that while AI algorithms demonstrated notable potential in predicting
+optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the
+medical imaging data utilized predominantly came from two-dimensional (2D)
+ultrasound which mainly involved basic quantifications, such as follicle size
+and number, with limited use of direct feature extraction or advanced image
+analysis techniques. This points to an underexplored opportunity where advanced
+image analysis approaches, such as deep learning, and more diverse imaging
+modalities, like three-dimensional (3D) ultrasound, could unlock deeper
+insights. Additionally, the lack of explainable AI (XAI) in most studies raises
+concerns about the transparency and traceability of AI-driven decisions - key
+factors for clinical adoption and trust. Furthermore, many studies relied on
+single-center designs and small datasets, which limit the generalizability of
+their findings. This review highlights the need for integrating advanced
+imaging analysis techniques with explainable AI methodologies, as well as the
+importance of leveraging multicenter collaborations and larger datasets.
+Addressing these gaps has the potential to enhance ovarian stimulation
+management, paving the way for efficient, personalized, and data-driven
+treatment pathways that improve IVF outcomes.
+
+摘要：人工智慧（AI）已成為增強體外受精（IVF）決策制定和優化治療方案的強大工具。特別是，AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示，雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力，但所利用的醫學影像數據主要來自於二次元（2D）超音波，而二次元超音波主要涉及基本量化，例如濾泡大小和數量，且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會，例如深度學習等進階影像分析方法，以及更多元的影像模式，例如三維（3D）超音波，可以解鎖更深入的見解。此外，大多數研究缺乏可解釋 AI（XAI），這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂，而透明度和可追溯性是臨床採用和信任的關鍵因素。此外，許多研究依賴於單中心設計和小型數據集，這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性，以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理，為有效、個人化和數據驅動的治療途徑鋪平道路，進而改善 IVF 結果。
+
+##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**
+2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman
+
+This research presents an innovative approach to cancer diagnosis and
+prediction using explainable Artificial Intelligence (XAI) and deep learning
+techniques. With cancer causing nearly 10 million deaths globally in 2020,
+early and accurate diagnosis is crucial. Traditional methods often face
+challenges in cost, accuracy, and efficiency. Our study develops an AI model
+that provides precise outcomes and clear insights into its decision-making
+process, addressing the "black box" problem of deep learning models. By
+employing XAI techniques, we enhance interpretability and transparency,
+building trust among healthcare professionals and patients. Our approach
+leverages neural networks to analyse extensive datasets, identifying patterns
+for cancer detection. This model has the potential to revolutionise diagnosis
+by improving accuracy, accessibility, and clarity in medical decision-making,
+possibly leading to earlier detection and more personalised treatment
+strategies. Furthermore, it could democratise access to high-quality
+diagnostics, particularly in resource-limited settings, contributing to global
+health equity. The model's applications extend beyond cancer diagnosis,
+potentially transforming various aspects of medical decision-making and saving
+millions of lives worldwide.
 
-摘要：有效的、可靠的評估對於推進經驗機器學習至關重要。然而，一般化模型的可及性日益提高，以及朝著更複雜、更高級別任務的進展，使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾，而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中，我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設，我們可以忠實地模擬現象，制定具有解釋力的可測試假設，並利用原則性的分析工具。為了使因果模型設計更易於使用，我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT)，有助於深入了解大型語言模型中的推理能力。通過一系列案例研究，我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點，並激發系統進展的新方法。
+摘要：本研究提出了一個創新的癌症診斷和預測方法，使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡，因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型，它提供精確的結果並清楚地了解其決策過程，解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術，我們增強了解釋性和透明度，在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集，識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷，可能導致更早的檢測和更個性化的治療策略。此外，它可以使更多人獲得高品質的診斷，特別是在資源有限的環境中，有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷，還可能轉變醫療決策的各個方面，並拯救全球數百萬人的生命。
 
-##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**
-2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha
+##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**
+2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag
 
-Large Language Models (LLMs) have demonstrated impressive reasoning
-capabilities, yet their performance is highly dependent on the prompting
-strategy and model scale. While reinforcement learning and fine-tuning have
-been deployed to boost reasoning, these approaches incur substantial
-computational and data overhead. In this work, we introduce Adaptive Graph of
-Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM
-reasoning solely at test time. Rather than relying on fixed-step methods like
-Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes
-complex queries into structured subproblems, forming an dynamic directed
-acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding
-only those subproblems that require further analysis, AGoT unifies the
-strengths of chain, tree, and graph paradigms into a cohesive framework that
-allocates computation where it is most needed. We validate our approach on
-diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and
-mathematical problem-solving, achieving up to 46.2% improvement on scientific
-reasoning tasks (GPQA) - comparable to gains achieved through computationally
-intensive reinforcement learning approaches and outperforming state-of-the-art
-iterative approaches. These results suggest that dynamic decomposition and
-structured recursion offer a scalable, cost-effective alternative to
-post-training modifications, paving the way for more robust, general-purpose
-reasoning in LLMs.
+Deep learning has advanced medical image classification, but interpretability
+challenges hinder its clinical adoption. This study enhances interpretability
+in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)
+and a multi-agent Retrieval-Augmented Generation (RAG) system for report
+generation. By modeling relationships between visual features and clinical
+concepts, we create interpretable concept vectors that guide a multi-agent RAG
+system to generate radiology reports, enhancing clinical relevance,
+explainability, and transparency. Evaluation of the generated reports using an
+LLM-as-a-judge confirmed the interpretability and clinical utility of our
+model's outputs. On the COVID-QU dataset, our model achieved 81% classification
+accuracy and demonstrated robust report generation performance, with five key
+metrics ranging between 84% and 90%. This interpretable multi-agent framework
+bridges the gap between high-performance AI and the explainability required for
+reliable AI-driven CXR analysis in clinical settings. Our code is available at
+https://github.com/tifat58/IRR-with-CBM-RAG.git.
 
-摘要：大型語言模型 (LLM) 已展現令人印象深刻的推理能力，但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理，但這些方法會造成大量的運算和資料開銷。在這項工作中，我們引入了「適應性思考圖」(AGoT)，一個動態的、基於圖形的推論架構，它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法，而是遞迴地將複雜的查詢分解成結構化的子問題，形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題，AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中，將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法，在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進，這與透過運算密集的強化學習方法所獲得的增益相當，並且優於最先進的迭代方法。這些結果表明，動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案，用於訓練後修改，為 LLM 中更強健、更通用的推理鋪平了道路。
+摘要：深度學習已提升醫學影像分類，但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成，來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係，我們建立可解釋的概念向量，引導多代理 RAG 系統生成放射報告，增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估，確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上，我們的模型達到了 81% 的分類準確率，並展示了穩健的報告生成效能，五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。
 
-##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**
-2502.05239v1 by Hussam Ghanem, Christophe Cruz
+##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**
+2412.15748v1 by Shamus Sim, Tyrone Chen
 
-Recent advancements in large language models have demonstrated significant
-potential in the automated construction of knowledge graphs from unstructured
-text. This paper builds upon our previous work [16], which evaluated various
-models using metrics like precision, recall, F1 score, triple matching, and
-graph matching, and introduces a refined approach to address the critical
-issues of hallucination and omission. We propose an enhanced evaluation
-framework incorporating BERTScore for graph similarity, setting a practical
-threshold of 95% for graph matching. Our experiments focus on the Mistral
-model, comparing its original and fine-tuned versions in zero-shot and few-shot
-settings. We further extend our experiments using examples from the KELM-sub
-training dataset, illustrating that the fine-tuned model significantly improves
-knowledge graph construction accuracy while reducing the exact hallucination
-and omission. However, our findings also reveal that the fine-tuned models
-perform worse in generalization tasks on the KELM-sub dataset. This study
-underscores the importance of comprehensive evaluation metrics in advancing the
-state-of-the-art in knowledge graph construction from textual data.
+Background: Despite the current ubiquity of Large Language Models (LLMs)
+across the medical domain, there is a surprising lack of studies which address
+their reasoning behaviour. We emphasise the importance of understanding
+reasoning behaviour as opposed to high-level prediction accuracies, since it is
+equivalent to explainable AI (XAI) in this context. In particular, achieving
+XAI in medical LLMs used in the clinical domain will have a significant impact
+across the healthcare sector. Results: Therefore, we define the concept of
+reasoning behaviour in the specific context of medical LLMs. We then categorise
+and discuss the current state of the art of methods which evaluate reasoning
+behaviour in medical LLMs. Finally, we propose theoretical frameworks which can
+empower medical professionals or machine learning engineers to gain insight
+into the low-level reasoning operations of these previously obscure models.
+Conclusion: The subsequent increased transparency and trust in medical machine
+learning models by clinicians as well as patients will accelerate the
+integration, application as well as further development of medical AI for the
+healthcare system as a whole
 
-摘要：大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上，該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型，並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架，結合 BERTScore 來進行圖形相似性，並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上，比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗，說明微調後的模型顯著提高了知識圖譜建構的準確度，同時減少了精確的幻覺和遺漏。然而，我們的研究結果也顯示，微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。
+摘要：背景：儘管大型語言模型 (LLM) 目前在醫療領域無所不在，但令人驚訝的是，探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要，因為在這種情況下，這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI，將對整個醫療保健產業產生重大影響。結果：因此，我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後，我們提出理論架構，讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論：臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升，將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。
 
-##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**
-2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu
+##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media**
+2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton
 
-We introduce Agentic Reasoning, a framework that enhances large language
-model (LLM) reasoning by integrating external tool-using agents. Unlike
-conventional LLM-based reasoning approaches, which rely solely on internal
-inference, Agentic Reasoning dynamically engages web search, code execution,
-and structured reasoning-context memory to solve complex problems requiring
-deep research and multi-step logical deduction. Our framework introduces the
-Mind Map agent, which constructs a structured knowledge graph to track logical
-relationships, improving deductive reasoning. Additionally, the integration of
-web-search and coding agents enables real-time retrieval and computational
-analysis, enhancing reasoning accuracy and decision-making. Evaluations on
-PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks
-demonstrate that our approach significantly outperforms existing models,
-including leading retrieval-augmented generation (RAG) systems and
-closed-source LLMs. Moreover, our results indicate that agentic reasoning
-improves expert-level knowledge synthesis, test-time scalability, and
-structured problem-solving. The code is at:
-https://github.com/theworldofagents/Agentic-Reasoning.
+Stress is a pervasive global health issue that can lead to severe mental
+health problems. Early detection offers timely intervention and prevention of
+stress-related disorders. The current early detection models perform "black
+box" inference suffering from limited explainability and trust which blocks the
+real-world clinical application. Thanks to the generative properties introduced
+by the Large Language Models (LLMs), the decision and the prediction from such
+models are semi-interpretable through the corresponding description. However,
+the existing LLMs are mostly trained for general purposes without the guidance
+of psychological cognitive theory. To this end, we first highlight the
+importance of prior theory with the observation of performance boosted by the
+chain-of-thoughts tailored for stress detection. This method termed Cognition
+Chain explicates the generation of stress through a step-by-step cognitive
+perspective based on cognitive appraisal theory with a progress pipeline:
+Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress
+State, guiding LLMs to provide comprehensive reasoning explanations. We further
+study the benefits brought by the proposed Cognition Chain format by utilising
+it as a synthetic dataset generation template for LLMs instruction-tuning and
+introduce CogInstruct, an instruction-tuning dataset for stress detection. This
+dataset is developed using a three-stage self-reflective annotation pipeline
+that enables LLMs to autonomously generate and refine instructional data. By
+instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable
+stress detection model. Evaluations demonstrate that CogLLM achieves
+outstanding performance while enhancing explainability. Our work contributes a
+novel approach by integrating cognitive theories into LLM reasoning processes,
+offering a promising direction for future explainable AI research.
 
-摘要：我們引入了代理推理，一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同，代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理，它建立一個結構化的知識圖譜來追蹤邏輯關係，改善演繹推理。此外，整合網路搜尋和編碼代理能進行即時擷取和運算分析，增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示，我們的做法明顯優於現有模型，包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外，我們的結果顯示，代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在：https://github.com/theworldofagents/Agentic-Reasoning。
+摘要：壓力是一個普遍的全球性健康問題，可能會導致嚴重的精神
+健康問題。早期發現提供及時的干預和預防
+壓力相關疾病。目前的早期發現模型執行「黑
+盒子」推論，存在可解釋性和信任度有限的問題，阻礙了
+現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性，此類
+模型的決策和預測通過對應描述具有半可解釋性。然而，
+現有的 LLM 主要針對一般用途進行訓練，沒有心理認知理論的指導。為此，我們首先強調
+先驗理論的重要性，並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知
+鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生，並具有進度管道：
+刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力
+狀態，指導 LLM 提供全面的推理解釋。我們進一步
+通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點，並介紹 CogInstruct，這是一個針對壓力檢測的指令調整數據集。這個
+數據集是使用一個三階段的自省標註管道開發的，使 LLM 能夠自主生成和優化指令數據。通過
+使用 CogInstruct 對 Llama3 進行指令調整，我們開發了 CogLLM，這是一個可解釋的
+壓力檢測模型。評估表明，CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中，提出了一種新穎的方法，
+為未來的可解釋人工智能研究提供了一個有希望的方向。
 
-##### **Position-aware Automatic Circuit Discovery**
-2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
+##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**
+2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi
 
-A widely used strategy to discover and understand language model mechanisms
-is circuit analysis. A circuit is a minimal subgraph of a model's computation
-graph that executes a specific task. We identify a gap in existing circuit
-discovery methods: they assume circuits are position-invariant, treating model
-components as equally relevant across input positions. This limits their
-ability to capture cross-positional interactions or mechanisms that vary across
-positions. To address this gap, we propose two improvements to incorporate
-positionality into circuits, even on tasks containing variable-length examples.
-First, we extend edge attribution patching, a gradient-based method for circuit
-discovery, to differentiate between token positions. Second, we introduce the
-concept of a dataset schema, which defines token spans with similar semantics
-across examples, enabling position-aware circuit discovery in datasets with
-variable length examples. We additionally develop an automated pipeline for
-schema generation and application using large language models. Our approach
-enables fully automated discovery of position-sensitive circuits, yielding
-better trade-offs between circuit size and faithfulness compared to prior work.
+Human-machine teaming in medical AI requires us to understand to what degree
+a trained clinician should weigh AI predictions. While previous work has shown
+the potential of AI assistance at improving clinical predictions, existing
+clinical decision support systems either provide no explainability of their
+predictions or use techniques like saliency and Shapley values, which do not
+allow for physician-based verification. To address this gap, this study
+compares previously used explainable AI techniques with a newly proposed
+technique termed '2-factor retrieval (2FR)', which is a combination of
+interface design and search retrieval that returns similarly labeled data
+without processing this data. This results in a 2-factor security blanket
+where: (a) correct images need to be retrieved by the AI; and (b) humans should
+associate the retrieved images with the current pathology under test. We find
+that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician
+accuracy, with particular improvements when clinicians are radiologists and
+have low confidence in their decision. Our results highlight the importance of
+understanding how different modes of human-AI decision making may impact
+clinician accuracy in clinical decision support systems.
 
-摘要：廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖，可執行特定任務。我們找出電路發現方法中的一個缺口：它們假設電路與位置無關，將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口，我們提出兩項改進，將位置性納入電路中，即使在包含變長範例的任務中也是如此。首先，我們擴充邊緣屬性修補，一種基於梯度的電路發現方法，以區分符號位置。其次，我們引入了資料集架構的概念，它定義了在範例中具有類似語義的符號跨距，使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外，我們開發了一個自動化管線，用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化，與先前的研究相比，在電路大小和忠實度之間產生了更好的權衡。
+摘要：人機協作在醫療 AI 中，需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力，但現有的臨床決策支援系統，要不就沒有提供預測的可解釋性，要不就是使用像顯著性和 Shapley 值之類的技術，這些技術不允許基於醫生的驗證。為了解決這個差距，本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較，後者是一種介面設計和搜尋檢索的組合，它會傳回標籤相似的資料，而不會處理這些資料。這會產生一個 2 因子安全機制，其中：(a) 正確的影像需要由 AI 檢索；(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現，當在胸部 X 光診斷上進行測試時，2FR 會提高臨床醫生的準確度，特別是在臨床醫生是放射科醫生且對其決策信心不足時，會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。
 
-##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**
-2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
+##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**
+2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle
 
-We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by
-jointly optimizing model roles and weights. We represent multi-LLM systems as
-directed acyclic graphs (DAGs) of LLMs with topological message passing for
-collaborative generation. Given a pool of LLM experts and a utility function,
-Heterogeneous Swarms employs two iterative steps: role-step and weight-step.
-For role-step, we interpret model roles as learning a DAG that specifies the
-flow of inputs and outputs between LLMs. Starting from a swarm of random
-continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs
-in topological order, evaluate on the utility function (e.g. accuracy on a
-task), and optimize the adjacency matrices with particle swarm optimization
-based on the utility score. For weight-step, we assess the contribution of
-individual LLMs in the multi-LLM systems and optimize model weights with swarm
-intelligence. We propose JFK-score to quantify the individual contribution of
-each LLM in the best-found DAG of the role-step, then optimize model weights
-with particle swarm optimization based on the JFK-score. Experiments
-demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based
-baselines by 18.5% on average across 12 tasks. Further analysis reveals that
-Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles
-and substantial collaborative gains, and benefits from the diversity of
-language models.
+Understanding public perception of artificial intelligence (AI) and the
+tradeoffs between potential risks and benefits is crucial, as these perceptions
+might shape policy decisions, influence innovation trajectories for successful
+market strategies, and determine individual and societal acceptance of AI
+technologies. Using a representative sample of 1100 participants from Germany,
+this study examines mental models of AI. Participants quantitatively evaluated
+71 statements about AI's future capabilities (e.g., autonomous driving, medical
+care, art, politics, warfare, and societal divides), assessing the expected
+likelihood of occurrence, perceived risks, benefits, and overall value. We
+present rankings of these projections alongside visual mappings illustrating
+public risk-benefit tradeoffs. While many scenarios were deemed likely,
+participants often associated them with high risks, limited benefits, and low
+overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in
+value assessment can be explained by perceived risks ($\beta=-.504$) and
+perceived benefits ($\beta=+.710$), with no significant relation to expected
+likelihood. Demographics and personality traits influenced perceptions of
+risks, benefits, and overall evaluations, underscoring the importance of
+increasing AI literacy and tailoring public information to diverse user needs.
+These findings provide actionable insights for researchers, developers, and
+policymakers by highlighting critical public concerns and individual factors
+essential to align AI development with individual values.
+
+摘要：<paragraph>了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要，因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡，並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本，探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述（例如，自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧）進行了定量評估，評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名，並附上視覺化映射，說明了公眾的風險收益權衡。儘管許多場景被認為是可能的，但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中，96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋，與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法，這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素，為研究人員、開發人員和政策制定者提供了可行的見解。</paragraph>
+
+##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**
+2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
+
+The use of machine learning and AI on electronic health records (EHRs) holds
+substantial potential for clinical insight. However, this approach faces
+challenges due to data heterogeneity, sparsity, temporal misalignment, and
+limited labeled outcomes. In this context, we leverage a linked EHR dataset of
+approximately one million de-identified individuals from Bristol, North
+Somerset, and South Gloucestershire, UK, to characterize urinary tract
+infections (UTIs). We implemented a data pre-processing and curation pipeline
+that transforms the raw EHR data into a structured format suitable for
+developing predictive models focused on data fairness, accountability and
+transparency. Given the limited availability and biases of ground truth UTI
+outcomes, we introduce a UTI risk estimation framework informed by clinical
+expertise to estimate UTI risk across individual patient timelines. Pairwise
+XGBoost models are trained using this framework to differentiate UTI risk
+categories with explainable AI techniques applied to identify key predictors
+and support interpretability. Our findings reveal differences in clinical and
+demographic predictors across risk groups. While this study highlights the
+potential of AI-driven insights to support UTI clinical decision-making,
+further investigation of patient sub-strata and extensive validation are needed
+to ensure robustness and applicability in clinical practice.
 
-摘要：<paragraph>我們提出異質群體，一種演算法，透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG)，並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數，異質群體使用兩個反覆步驟：角色步驟和權重步驟。對於角色步驟，我們將模型角色解釋為學習一個 DAG，它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始，我們將它們解碼為離散 DAG，以拓撲順序呼叫 LLM，根據效用函數（例如任務的準確度）進行評估，並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟，我們評估個別 LLM 在多 LLM 系統中的貢獻，並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻，然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明，異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明，異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統，並受益於語言模型的多樣性。</paragraph>
+摘要：電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而，由於資料異質性、稀疏性、時間錯位和標籤結果有限，此方法面臨挑戰。在此背景下，我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集，來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線，適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差，我們引入了由臨床專業知識告知的 UTI 風險評估架構，以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練，以區分 UTI 風險類別，並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力，但仍需要進一步調查患者子群體和廣泛驗證，以確保在臨床實務中的穩健性和適用性。
 
-##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
-2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
+##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**
+2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez
 
-Retrieval-augmented generation (RAG) is a well-suited technique for
-retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
-key module of the healthcare copilot, helping reduce misdiagnosis for
-healthcare practitioners and patients. However, the diagnostic accuracy and
-specificity of existing heuristic-based RAG models used in the medical domain
-are inadequate, particularly for diseases with similar manifestations. This
-paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
-reasoning for the medical domain that retrieves diagnosis and treatment
-recommendations based on manifestations. MedRAG systematically constructs a
-comprehensive four-tier hierarchical diagnostic KG encompassing critical
-diagnostic differences of various diseases. These differences are dynamically
-integrated with similar EHRs retrieved from an EHR database, and reasoned
-within a large language model. This process enables more accurate and specific
-decision support, while also proactively providing follow-up questions to
-enhance personalized medical decision-making. MedRAG is evaluated on both a
-public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
-collected from Tan Tock Seng Hospital, and its performance is compared against
-various existing RAG methods. Experimental results show that, leveraging the
-information integration and relational abilities of the KG, our MedRAG provides
-more specific diagnostic insights and outperforms state-of-the-art models in
-reducing misdiagnosis rates. Our code will be available at
-https://github.com/SNOWTEAM2023/MedRAG
+There is a growing need to understand how digital systems can support
+clinical decision-making, particularly as artificial intelligence (AI) models
+become increasingly complex and less human-interpretable. This complexity
+raises concerns about trustworthiness, impacting safe and effective adoption of
+such technologies. Improved understanding of decision-making processes and
+requirements for explanations coming from decision support tools is a vital
+component in providing effective explainable solutions. This is particularly
+relevant in the data-intensive, fast-paced environments of intensive care units
+(ICUs). To explore these issues, group interviews were conducted with seven ICU
+clinicians, representing various roles and experience levels. Thematic analysis
+revealed three core themes: (T1) ICU decision-making relies on a wide range of
+factors, (T2) the complexity of patient state is challenging for shared
+decision-making, and (T3) requirements and capabilities of AI decision support
+systems. We include design recommendations from clinical input, providing
+insights to inform future AI systems for intensive care.
 
-摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
+摘要：隨著人工智慧 (AI) 模型變得越來越複雜，且越來越難以被人理解，了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮，影響了此類技術的安全且有效採用。改善對決策制定流程的理解，以及對決策支援工具所提供說明的要求，是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題，對七位 ICU 臨床醫師進行了小組訪談，這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題：(T1) ICU 決策制定依賴於廣泛的因素，(T2) 病患狀態的複雜性對共同決策制定構成挑戰，以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議，提供見解以提供資訊給未來用於加護的 AI 系統。
 
-##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**
-2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
+##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**
+2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra
 
-Most existing Knowledge Graph Question Answering (KGQA) approaches are
-designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the
-heterogeneity of the underlying graph schema, topology and assertions, most
-KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without
-resource-intensive training data. We present OntoSCPrompt, a novel Large
-Language Model (LLM)-based KGQA approach with a two-stage architecture that
-separates semantic parsing from KG-dependent interactions. OntoSCPrompt first
-generates a SPARQL query structure (including SPARQL keywords such as SELECT,
-ASK, WHERE and placeholders for missing tokens) and then fills them with
-KG-specific information. To enhance the understanding of the underlying KG, we
-present an ontology-guided, hybrid prompt learning strategy that integrates KG
-ontology into the learning process of hybrid prompts (e.g., discrete and
-continuous vectors). We also present several task-specific decoding strategies
-to ensure the correctness and executability of generated SPARQL queries in both
-stages. Experimental results demonstrate that OntoSCPrompt performs as well as
-SOTA approaches without retraining on a number of KGQA datasets such as CWQ,
-WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well
-to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code:
-\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
+Pediatric heart diseases present a broad spectrum of congenital and acquired
+diseases. More complex congenital malformations require a differentiated and
+multimodal decision-making process, usually including echocardiography as a
+central imaging method. Artificial intelligence (AI) offers considerable
+promise for clinicians by facilitating automated interpretation of pediatric
+echocardiography data. However, adapting AI technologies for pediatric
+echocardiography analysis has challenges such as limited public data
+availability, data privacy, and AI model transparency. Recently, researchers
+have focused on disruptive technologies, such as federated learning (FL) and
+explainable AI (XAI), to improve automatic diagnostic and decision support
+workflows. This study offers a comprehensive overview of the limitations and
+opportunities of AI in pediatric echocardiography, emphasizing the synergistic
+workflow and role of XAI and FL, identifying research gaps, and exploring
+potential future developments. Additionally, three relevant clinical use cases
+demonstrate the functionality of XAI and FL with a focus on (i) view
+recognition, (ii) disease classification, (iii) segmentation of cardiac
+structures, and (iv) quantitative assessment of cardiac function.
 
-摘要：現有的知識圖譜問答（KGQA）方法大多是為特定 KG 而設計的，例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性，大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜（KG）。我們提出 OntoSCPrompt，這是一種基於大型語言模型（LLM）的新型 KGQA 方法，採用兩階段架構，將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構（包括 SPARQL 關鍵字，例如 SELECT、ASK、WHERE 和缺失令牌的佔位符），然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解，我們提出了一種由本体指導的混合提示學習策略，將 KG 本体整合到混合提示（例如，離散和連續向量）的學習過程中。我們還提出了多種特定任務的解碼策略，以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明，OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時，效能與 SOTA 方法一樣好，且資源使用效率高，並且可以很好地概括到未見過的特定領域 KG，例如 DBLP-QuAD 和 CoyPu KG Code：
-\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
+摘要：小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程，通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望，因為它可以促進小兒超音波檢查資料的自動化解讀。然而，將人工智慧技術應用於小兒超音波檢查分析有許多挑戰，例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近，研究人員專注於破壞性技術，例如聯合學習 (FL) 和可解釋人工智慧 (XAI)，以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述，強調了 XAI 和 FL 的協同工作流程和角色，找出研究差距並探討潛在的未來發展。此外，三個相關的臨床使用案例展示了 XAI 和 FL 的功能，重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。
 
-##### **Multimodal Medical Code Tokenizer**
-2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
+##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**
+2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust
 
-Foundation models trained on patient electronic health records (EHRs) require
-tokenizing medical data into sequences of discrete vocabulary items. Existing
-tokenizers treat medical codes from EHRs as isolated textual tokens. However,
-each medical code is defined by its textual description, its position in
-ontological hierarchies, and its relationships to other codes, such as disease
-co-occurrences and drug-treatment associations. Medical vocabularies contain
-more than 600,000 codes with critical information for clinical reasoning. We
-introduce MedTok, a multimodal medical code tokenizer that uses the text
-descriptions and relational context of codes. MedTok processes text using a
-language model encoder and encodes the relational structure with a graph
-encoder. It then quantizes both modalities into a unified token space,
-preserving modality-specific and cross-modality information. We integrate
-MedTok into five EHR models and evaluate it on operational and clinical tasks
-across in-patient and out-patient datasets, including outcome prediction,
-diagnosis classification, drug recommendation, and risk stratification.
-Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
-models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
-the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
-using MedTok tokenizer with medical QA systems. Our results demonstrate the
-potential of MedTok as a unified tokenizer for medical codes, improving
-tokenization for medical foundation models.
+Osteoporosis is a common condition that increases fracture risk, especially
+in older adults. Early diagnosis is vital for preventing fractures, reducing
+treatment costs, and preserving mobility. However, healthcare providers face
+challenges like limited labeled data and difficulties in processing medical
+images. This study presents a novel multi-modal learning framework that
+integrates clinical and imaging data to improve diagnostic accuracy and model
+interpretability. The model utilizes three pre-trained networks-VGG19,
+InceptionV3, and ResNet50-to extract deep features from X-ray images. These
+features are transformed using PCA to reduce dimensionality and focus on the
+most relevant components. A clustering-based selection process identifies the
+most representative components, which are then combined with preprocessed
+clinical data and processed through a fully connected network (FCN) for final
+classification. A feature importance plot highlights key variables, showing
+that Medical History, BMI, and Height were the main contributors, emphasizing
+the significance of patient-specific data. While imaging features were
+valuable, they had lower importance, indicating that clinical data are crucial
+for accurate predictions. This framework promotes precise and interpretable
+predictions, enhancing transparency and building trust in AI-driven diagnoses
+for clinical integration.
 
-摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
+摘要：骨質疏鬆症是一種常見的疾病，會增加骨折的風險，特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而，醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架，該框架整合了臨床和影像數據，以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路，VGG19、InceptionV3 和 ResNet50，從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分，然後將這些組成部分與預處理的臨床數據結合，並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數，表明病史、BMI 和身高是主要貢獻因素，強調了患者特定數據的重要性。雖然影像特徵很有價值，但它們的重要性較低，這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測，提高了透明度，並建立了對 AI 驅動診斷在臨床整合中的信任。
 
-##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**
-2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu
+##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**
+2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor
 
-The rapid expansion of web content has made on-device AI assistants
-indispensable for helping users manage the increasing complexity of online
-tasks. The emergent reasoning ability in large language models offer a
-promising path for next-generation on-device AI agents. However, deploying
-full-scale Large Language Models (LLMs) on resource-limited local devices is
-challenging. In this paper, we propose Division-of-Thoughts (DoT), a
-collaborative reasoning framework leveraging the synergy between locally
-deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT
-leverages a Task Decomposer to elicit the inherent planning abilities in
-language models to decompose user queries into smaller sub-tasks, which allows
-hybrid language models to fully exploit their respective strengths. Besides,
-DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks
-and create a dependency graph, facilitating parallel reasoning of sub-tasks and
-the identification of key steps. To allocate the appropriate model based on the
-difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an
-additional task head attached to the SLM that does not alter the SLM's
-parameters. To boost adapter's task allocation capability, we propose a
-self-reinforced training method that relies solely on task execution feedback.
-Extensive experiments on various benchmarks demonstrate that our DoT
-significantly reduces LLM costs while maintaining competitive reasoning
-accuracy. Specifically, DoT reduces the average reasoning time and API costs by
-66.12% and 83.57%, while achieving comparable reasoning accuracy with the best
-baseline methods.
+This review paper explores recent advances in deep learning approaches for
+non-invasive cognitive impairment detection. We examine various non-invasive
+indicators of cognitive decline, including speech and language, facial, and
+motoric mobility. The paper provides an overview of relevant datasets,
+feature-extracting techniques, and deep-learning architectures applied to this
+domain. We have analyzed the performance of different methods across modalities
+and observed that speech and language-based methods generally achieved the
+highest detection performance. Studies combining acoustic and linguistic
+features tended to outperform those using a single modality. Facial analysis
+methods showed promise for visual modalities but were less extensively studied.
+Most papers focused on binary classification (impaired vs. non-impaired), with
+fewer addressing multi-class or regression tasks. Transfer learning and
+pre-trained language models emerged as popular and effective techniques,
+especially for linguistic analysis. Despite significant progress, several
+challenges remain, including data standardization and accessibility, model
+explainability, longitudinal analysis limitations, and clinical adaptation.
+Lastly, we propose future research directions, such as investigating
+language-agnostic speech analysis methods, developing multi-modal diagnostic
+systems, and addressing ethical considerations in AI-assisted healthcare. By
+synthesizing current trends and identifying key obstacles, this review aims to
+guide further development of deep learning-based cognitive impairment detection
+systems to improve early diagnosis and ultimately patient outcomes.
 
-摘要：<paragraph>網頁內容快速擴充，使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而，在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中，我們提出了思想分工 (DoT)，一個協作推理框架，利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力，將使用者查詢分解成較小的子任務，這允許混合語言模型充分發揮其各自的優勢。此外，DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖，促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型，DoT 利用了即插即用適配器，這是一個附加在 SLM 上的任務頭，不會改變 SLM 的參數。為了提升適配器的任務分配能力，我們提出了一種自我強化訓練方法，它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明，我們的 DoT 大幅降低了 LLM 成本，同時維持了有競爭力的推理準確度。具體來說，DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%，同時達到了與最佳基準方法相當的推理準確度。</paragraph>
+摘要：本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標，包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現，並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力，但研究較少。大多數論文專注於二元分類（受損與未受損），較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術，特別是對於語言分析。儘管取得了重大進展，但仍存在一些挑戰，包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後，我們提出了未來的研究方向，例如調查與語言無關的語音分析方法、開發多模式診斷系統，以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙，本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展，以改善早期診斷，並最終改善患者的治療結果。
 
-##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**
-2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong
+##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**
+2410.17504v1 by Shruthi Chari
 
-Knowledge Graph-based recommendations have gained significant attention due
-to their ability to leverage rich semantic relationships. However, constructing
-and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy
-of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent
-advancements in Large Language Models (LLMs) offer a promising way to improve
-the quality and relevance of KGs for recommendation tasks. Despite this,
-integrating LLMs into KG-based systems presents challenges, such as efficiently
-augmenting KGs, addressing hallucinations, and developing effective joint
-learning methods. In this paper, we propose the Confidence-aware KG-based
-Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework
-that combines KGs and LLMs for recommendation task. The framework includes: (1)
-an LLM-based subgraph augmenter for enriching KGs with high-quality
-information, (2) a confidence-aware message propagation mechanism to filter
-noisy triplets, and (3) a dual-view contrastive learning method to integrate
-user-item interactions and KG data. Additionally, we employ a confidence-aware
-explanation generation process to guide LLMs in producing realistic
-explanations for recommendations. Finally, extensive experiments demonstrate
-the effectiveness of CKG-LLMA across multiple public datasets.
+Explainable Artificial Intelligence (AI) focuses on helping humans understand
+the working of AI systems or their decisions and has been a cornerstone of AI
+for decades. Recent research in explainability has focused on explaining the
+workings of AI models or model explainability. There have also been several
+position statements and review papers detailing the needs of end-users for
+user-centered explainability but fewer implementations. Hence, this thesis
+seeks to bridge some gaps between model and user-centered explainability. We
+create an explanation ontology (EO) to represent literature-derived explanation
+types via their supporting components. We implement a knowledge-augmented
+question-answering (QA) pipeline to support contextual explanations in a
+clinical setting. Finally, we are implementing a system to combine explanations
+from different AI methods and data modalities. Within the EO, we can represent
+fifteen different explanation types, and we have tested these representations
+in six exemplar use cases. We find that knowledge augmentations improve the
+performance of base large language models in the contextualized QA, and the
+performance is variable across disease groups. In the same setting, clinicians
+also indicated that they prefer to see actionability as one of the main foci in
+explanations. In our explanations combination method, we plan to use similarity
+metrics to determine the similarity of explanations in a chronic disease
+detection setting. Overall, through this thesis, we design methods that can
+support knowledge-enabled explanations across different use cases, accounting
+for the methods in today's AI era that can generate the supporting components
+of these explanations and domain knowledge sources that can enhance them.
+
+摘要：可解釋人工智慧（AI）專注於協助人類了解 AI 系統運作或其決策，數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求，但實作較少。因此，本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體（EO）以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答（QA）管線，以在臨床環境中支援情境解釋。最後，我們正在實作一個系統，以結合來自不同 AI 方法和資料模式的解釋。在 EO 中，我們可以表示 15 種不同的解釋類型，並且我們已在六個範例使用案例中測試這些表示。我們發現，知識增強改善了基礎大型語言模型在情境化 QA 中的效能，並且效能因疾病群組而異。在相同的環境中，臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中，我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言，透過本論文，我們設計了可以在不同使用案例中支援知識啟用解釋的方法，考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。
+
+##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**
+2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay
+
+Objectives: To investigate clinicians' attitudes towards current automated
+interpretation of ECG and novel AI technologies and their perception of
+computer-assisted interpretation. Materials and Methods: We conducted a series
+of interviews with clinicians in the UK. Our study: (i) explores the potential
+for AI, specifically future 'human-like' computing approaches, to facilitate
+ECG interpretation and support clinical decision making, and (ii) elicits their
+opinions about the importance of explainability and trustworthiness of AI
+algorithms. Results: We performed inductive thematic analysis on interview
+transcriptions from 23 clinicians and identified the following themes: (i) a
+lack of trust in current systems, (ii) positive attitudes towards future AI
+applications and requirements for these, (iii) the relationship between the
+accuracy and explainability of algorithms, and (iv) opinions on education,
+possible deskilling, and the impact of AI on clinical competencies. Discussion:
+Clinicians do not trust current computerised methods, but welcome future 'AI'
+technologies. Where clinicians trust future AI interpretation to be accurate,
+they are less concerned that it is explainable. They also preferred ECG
+interpretation that demonstrated the results of the algorithm visually. Whilst
+clinicians do not fear job losses, they are concerned about deskilling and the
+need to educate the workforce to use AI responsibly. Conclusion: Clinicians are
+positive about the future application of AI in clinical decision-making.
+Accuracy is a key factor of uptake and visualisations are preferred over
+current computerised methods. This is viewed as a potential means of training
+and upskilling, in contrast to the deskilling that automation might be
+perceived to bring.
 
-摘要：基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而，構建和維護知識圖譜 (KG) 是一項資源密集型任務，而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此，將 LLM 整合到基於 KG 的系統中會帶來挑戰，例如有效擴充 KG、處理幻覺，以及開發有效的聯合學習方法。在本文中，我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA)，這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括：(1) 一個基於 LLM 的子圖擴充器，用於使用高品質資訊豐富 KG，(2) 一個信心感知型訊息傳播機制，用於過濾雜訊三元組，以及 (3) 一個雙視圖對比學習方法，用於整合使用者-項目互動和 KG 資料。此外，我們採用一個信心感知型解釋產生程序，以引導 LLM 為推薦產生逼真的解釋。最後，大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。
+摘要：<paragraph>目的：調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度，以及他們對電腦輔助解讀的看法。材料和方法：我們對英國的臨床醫生進行了一系列訪談。我們的研究：(i) 探討人工智慧的潛力，特別是未來的「類人類」運算方法，以促進心電圖解讀並支持臨床決策制定，以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果：我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析，並找出以下主題：(i) 對目前系統缺乏信任，(ii) 對未來人工智慧應用和對這些應用的要求持正面態度，(iii) 演算法的準確性和可解釋性之間的關係，以及 (iv) 對教育、可能的技能退化，以及人工智慧對臨床能力的影響的看法。討論：臨床醫生不信任目前的電腦化方法，但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下，他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業，但他們擔心技能退化，以及需要教育員工負責任地使用人工智慧。結論：臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素，而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法，與自動化可能帶來的技能退化形成對比。</paragraph>
 
-##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**
-2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell
+##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**
+2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker
 
-Scene graphs have emerged as a structured and serializable environment
-representation for grounded spatial reasoning with Large Language Models
-(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason
-framework for reasoning and planning with scene graphs. Our approach employs
-two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and
-information queries generation, and a (2) Retriever for extracting
-corresponding graph information following the queries. Two agents collaborate
-iteratively, enabling sequential reasoning and adaptive attention to graph
-information. Unlike prior works, both agents are prompted only with the scene
-graph schema rather than the full graph data, which reduces the hallucination
-by limiting input tokens, and drives the Reasoner to generate reasoning trace
-abstractly.Following the trace, the Retriever programmatically query the scene
-graph data based on the schema understanding, allowing dynamic and global
-attention on the graph that enhances alignment between reasoning and retrieval.
-Through experiments in multiple simulation environments, we show that our
-framework surpasses existing LLM-based approaches in numerical Q\&A and
-planning tasks, and can benefit from task-level few-shot examples, even in the
-absence of agent-level demonstrations. Project code will be released.
+The aggressiveness of prostate cancer, the most common cancer in men
+worldwide, is primarily assessed based on histopathological data using the
+Gleason scoring system. While artificial intelligence (AI) has shown promise in
+accurately predicting Gleason scores, these predictions often lack inherent
+explainability, potentially leading to distrust in human-machine interactions.
+To address this issue, we introduce a novel dataset of 1,015 tissue microarray
+core images, annotated by an international group of 54 pathologists. The
+annotations provide detailed localized pattern descriptions for Gleason grading
+in line with international guidelines. Utilizing this dataset, we develop an
+inherently explainable AI system based on a U-Net architecture that provides
+predictions leveraging pathologists' terminology. This approach circumvents
+post-hoc explainability methods while maintaining or exceeding the performance
+of methods trained directly for Gleason pattern segmentation (Dice score: 0.713
+$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason
+patterns). By employing soft labels during training, we capture the intrinsic
+uncertainty in the data, yielding strong results in Gleason pattern
+segmentation even in the context of high interobserver variability. With the
+release of this dataset, we aim to encourage further research into segmentation
+in medical tasks with high levels of subjectivity and to advance the
+understanding of pathologists' reasoning processes.
 
-摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
+摘要：前列腺癌是全球男性最常見的癌症，其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力，但這些預測通常缺乏內在的可解釋性，可能會導致對人機互動的不信任。為了解決這個問題，我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述，用於符合國際準則的 Gleason 分級。利用這個資料集，我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統，該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法，同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能（Dice 分數：0.713 ± 0.003，訓練於解釋，相對於 0.691 ± 0.010，訓練於 Gleason 模式）。透過在訓練期間採用軟標籤，我們捕捉了資料中的內在不確定性，即使在觀察者間變異性高的情況下，也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集，我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割，並增進對病理學家推理過程的理解。
 
-##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
-2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
+##### **Explainable AI Methods for Multi-Omics Analysis: A Survey**
+2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee
 
-Recent advancements have highlighted that Large Language Models (LLMs) are
-prone to hallucinations when solving complex reasoning problems, leading to
-erroneous results. To tackle this issue, researchers incorporate Knowledge
-Graphs (KGs) to improve the reasoning ability of LLMs. However, existing
-methods face two limitations: 1) they typically assume that all answers to the
-questions are contained in KGs, neglecting the incompleteness issue of KGs, and
-2) they treat the KG as a static repository and overlook the implicit logical
-reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an
-innovative neural-symbolic agent framework that achieves collaborative
-augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments
-and transform complex reasoning tasks into a multi-step interactive process,
-enabling KGs to participate deeply in the reasoning process. SymAgent consists
-of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages
-LLM's inductive reasoning capability to extract symbolic rules from KGs,
-guiding efficient question decomposition. The Agent-Executor autonomously
-invokes predefined action tools to integrate information from KGs and external
-documents, addressing the issues of KG incompleteness. Furthermore, we design a
-self-learning framework comprising online exploration and offline iterative
-policy updating phases, enabling the agent to automatically synthesize
-reasoning trajectories and improve performance. Experimental results
-demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields
-better or comparable performance compared to various strong baselines. Further
-analysis reveals that our agent can identify missing triples, facilitating
-automatic KG updates.
+Advancements in high-throughput technologies have led to a shift from
+traditional hypothesis-driven methodologies to data-driven approaches.
+Multi-omics refers to the integrative analysis of data derived from multiple
+'omes', such as genomics, proteomics, transcriptomics, metabolomics, and
+microbiomics. This approach enables a comprehensive understanding of biological
+systems by capturing different layers of biological information. Deep learning
+methods are increasingly utilized to integrate multi-omics data, offering
+insights into molecular interactions and enhancing research into complex
+diseases. However, these models, with their numerous interconnected layers and
+nonlinear relationships, often function as black boxes, lacking transparency in
+decision-making processes. To overcome this challenge, explainable artificial
+intelligence (xAI) methods are crucial for creating transparent models that
+allow clinicians to interpret and work with complex data more effectively. This
+review explores how xAI can improve the interpretability of deep learning
+models in multi-omics research, highlighting its potential to provide
+clinicians with clear insights, thereby facilitating the effective application
+of such models in clinical settings.
 
-摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
+摘要：高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料，例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面，能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料，提供分子交互作用的洞察力，並加強對複雜疾病的研究。然而，這些模型具有許多相互連接的層級和非線性關係，通常會像黑盒子一樣運作，缺乏決策過程的透明度。為了克服此挑戰，可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要，讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性，強調其提供臨床醫生明確見解的潛力，進而促進此類模型在臨床環境中的有效應用。
 
-##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
-2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
+##### **Study on the Helpfulness of Explainable Artificial Intelligence**
+2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing
 
-We introduce a new approach to systematically map features discovered by
-sparse autoencoder across consecutive layers of large language models,
-extending earlier work that examined inter-layer feature links. By using a
-data-free cosine similarity technique, we trace how specific features persist,
-transform, or first appear at each stage. This method yields granular flow
-graphs of feature evolution, enabling fine-grained interpretability and
-mechanistic insights into model computations. Crucially, we demonstrate how
-these cross-layer feature maps facilitate direct steering of model behavior by
-amplifying or suppressing chosen features, achieving targeted thematic control
-in text generation. Together, our findings highlight the utility of a causal,
-cross-layer interpretability framework that not only clarifies how features
-develop through forward passes but also provides new means for transparent
-manipulation of large language models.
+Explainable Artificial Intelligence (XAI) is essential for building advanced
+machine learning-powered applications, especially in critical domains such as
+medical diagnostics or autonomous driving. Legal, business, and ethical
+requirements motivate using effective XAI, but the increasing number of
+different methods makes it challenging to pick the right ones. Further, as
+explanations are highly context-dependent, measuring the effectiveness of XAI
+methods without users can only reveal a limited amount of information,
+excluding human factors such as the ability to understand it. We propose to
+evaluate XAI methods via the user's ability to successfully perform a proxy
+task, designed such that a good performance is an indicator for the explanation
+to provide helpful information. In other words, we address the helpfulness of
+XAI for human decision-making. Further, a user study on state-of-the-art
+methods was conducted, showing differences in their ability to generate trust
+and skepticism and the ability to judge the rightfulness of an AI decision
+correctly. Based on the results, we highly recommend using and extending this
+approach for more objective-based human-centered user studies to measure XAI
+performance in an end-to-end fashion.
 
-摘要：我們提出了一種新方法，用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能，擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術，我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖，實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是，我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導，在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用，不僅闡明了特徵如何透過前向傳遞發展，還提供了新的方法來透明地操作大型語言模型。
+摘要：可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要，特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI，但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外，由於解釋高度依賴於背景，在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊，排除人類因素，例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法，設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說，我們探討 XAI 對人類決策制定的幫助。此外，對最先進的方法進行使用者研究，顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果，我們強烈建議使用和擴充這種方法，以進行更多以目標為基礎的人為中心使用者研究，以終端到終端的方式衡量 XAI 效能。
 
-##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**
-2502.02896v1 by Bradley P. Allen, Paul T. Groth
+##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**
+2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh
 
-Evaluating large language models (LLMs) for tasks like fact extraction in
-support of knowledge graph construction frequently involves computing accuracy
-metrics using a ground truth benchmark based on a knowledge graph (KG). These
-evaluations assume that errors represent factual disagreements. However, human
-discourse frequently features metalinguistic disagreement, where agents differ
-not on facts but on the meaning of the language used to express them. Given the
-complexity of natural language processing and generation using LLMs, we ask: do
-metalinguistic disagreements occur between LLMs and KGs? Based on an
-investigation using the T-REx knowledge alignment dataset, we hypothesize that
-metalinguistic disagreement does in fact occur between LLMs and KGs, with
-potential relevance for the practice of knowledge graph engineering. We propose
-a benchmark for evaluating the detection of factual and metalinguistic
-disagreements between LLMs and KGs. An initial proof of concept of such a
-benchmark is available on Github.
+Early detection of intrapartum risk enables interventions to potentially
+prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently,
+there is no accurate automated system to predict such events to assist with
+clinical decision-making. To fill this gap, we propose "Artificial Intelligence
+(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning
+framework that not only predicts adverse labor outcomes from maternal, fetal,
+obstetrical, and intrapartum risk factors but also provides the model's
+reasoning behind the predictions made. The latter can provide insights into
+what modifications in the input variables of the model could have changed the
+predicted outcome. We address the challenges of imbalance and small datasets by
+synthesizing additional training data using Adaptive Synthetic Sampling
+(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN
+uses an ensemble of fully-connected neural networks as the backbone for its
+classification with the data augmentation supported by either ADASYN or CTGAN.
+AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in
+classification. AIMEN can predict a high risk for adverse labor outcomes with
+an average F1 score of 0.784. It also provides counterfactual explanations that
+can be achieved by changing 2 to 3 attributes on average. Resources available:
+https://github.com/ab9mamun/AIMEN.
 
-摘要：評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時，通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而，人類話語經常出現元語言分歧，其中代理人之間的差異不在於事實，而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性，我們提出疑問：LLM 和 KG 之間是否會發生元語言分歧？根據使用 T-REx 知識比對資料集進行的調查，我們假設元語言分歧確實會發生在 LLM 和 KG 之間，並可能與知識圖譜工程實務有關。我們提出一個基準，用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。
+摘要：產程中風險的早期偵測有助於進行干預措施，以預防或減輕不利的生產結果，例如腦性麻痺。目前，沒有準確的自動化系統可以預測此類事件，以協助臨床決策。為了填補這一空白，我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN)，這是一個深度學習架構，它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果，還能提供模型做出預測背後的原因。後者可以提供見解，說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料，以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹，並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險，平均 F1 分數為 0.784。它還提供反事實解釋，可透過平均變更 2 至 3 個屬性來達成。可用資源：https://github.com/ab9mamun/AIMEN。
 
-##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**
-2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim
+##### **Artificial intelligence techniques in inherited retinal diseases: A review**
+2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian
 
-Recent advances in Large Language Models (LLMs) have motivated the
-development of general LLMs for molecular tasks. While several studies have
-demonstrated that fine-tuned LLMs can achieve impressive benchmark
-performances, they are far from genuine generalist molecular LLMs due to a lack
-of fundamental understanding of molecular structure. Specifically, when given
-molecular task instructions, LLMs trained with naive next-token prediction
-training assign similar likelihood scores to both original and negatively
-corrupted molecules, revealing their lack of molecular structure understanding
-that is crucial for reliable and general molecular LLMs. To overcome this
-limitation and obtain a true generalist molecular LLM, we introduce a novel
-multi-modal training method based on a thorough multi-modal instruction tuning
-as well as a molecular structure preference optimization between chosen and
-rejected graphs. On various molecular benchmarks, the proposed generalist
-molecular LLM, called Mol-LLM, achieves state-of-the-art performances among
-generalist LLMs on most tasks, at the same time, surpassing or comparable to
-state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior
-generalization performances in reaction prediction tasks, demonstrating the
-effect of the molecular structure understanding for generalization perspective.
+Inherited retinal diseases (IRDs) are a diverse group of genetic disorders
+that lead to progressive vision loss and are a major cause of blindness in
+working-age adults. The complexity and heterogeneity of IRDs pose significant
+challenges in diagnosis, prognosis, and management. Recent advancements in
+artificial intelligence (AI) offer promising solutions to these challenges.
+However, the rapid development of AI techniques and their varied applications
+have led to fragmented knowledge in this field. This review consolidates
+existing studies, identifies gaps, and provides an overview of AI's potential
+in diagnosing and managing IRDs. It aims to structure pathways for advancing
+clinical applications by exploring AI techniques like machine learning and deep
+learning, particularly in disease detection, progression prediction, and
+personalized treatment planning. Special focus is placed on the effectiveness
+of convolutional neural networks in these areas. Additionally, the integration
+of explainable AI is discussed, emphasizing its importance in clinical settings
+to improve transparency and trust in AI-based systems. The review addresses the
+need to bridge existing gaps in focused studies on AI's role in IRDs, offering
+a structured analysis of current AI techniques and outlining future research
+directions. It concludes with an overview of the challenges and opportunities
+in deploying AI for IRDs, highlighting the need for interdisciplinary
+collaboration and the continuous development of robust, interpretable AI models
+to advance clinical applications.
 
-摘要：大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能，但由於缺乏對分子結構的基本理解，它們遠非真正的通才分子 LLM。具體來說，當給予分子任務說明時，使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子，這顯示出它們缺乏對分子結構的理解，而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM，我們引入了一種新穎的多模態訓練方法，該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中，所提出的通才分子 LLM（稱為 Mol-LLM）在多數任務中實現了通才 LLM 中的最新效能，同時超越或與最新的專家 LLM 相當。此外，Mol-LLM 在反應預測任務中也展現出優異的泛化效能，證明了分子結構理解對泛化觀點的影響。
+摘要：遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病，
+會導致視力逐漸喪失，是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。
+然而，AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究，找出差距，並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術，特別是在疾病檢測、進程預測和個性化治療計劃中，為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外，討論了可解釋 AI 的整合，強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性，提供了對當前 AI 技術的結構化分析，並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇，強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。
 
-##### **Leveraging the true depth of LLMs**
-2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
+##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**
+2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri
 
-Large Language Models demonstrate remarkable capabilities at the cost of high
-compute requirements. While recent research has shown that intermediate layers
-can be removed or have their order shuffled without impacting performance
-significantly, these findings have not been employed to reduce the
-computational cost of inference. We investigate several potential ways to
-reduce the depth of pre-trained LLMs without significantly affecting
-performance. Leveraging our insights, we present a novel approach that exploits
-this decoupling between layers by grouping some of them into pairs that can be
-evaluated in parallel.
-  This modification of the computational graph -- through better parallelism --
-results in an average improvement of around 1.20x on the number of tokens
-generated per second, without re-training nor fine-tuning, while retaining
-95%-99% of the original accuracy. Empirical evaluation demonstrates that this
-approach significantly improves serving efficiency while maintaining model
-performance, offering a practical improvement for large-scale LLM deployment.
+Explaining Artificial Intelligence (AI) decisions is a major challenge
+nowadays in AI, in particular when applied to sensitive scenarios like medicine
+and law. However, the need to explain the rationale behind decisions is a main
+issue also for human-based deliberation as it is important to justify
+\textit{why} a certain decision has been taken. Resident medical doctors for
+instance are required not only to provide a (possibly correct) diagnosis, but
+also to explain how they reached a certain conclusion. Developing new tools to
+aid residents to train their explanation skills is therefore a central
+objective of AI in education. In this paper, we follow this direction, and we
+present, to the best of our knowledge, the first multilingual dataset for
+Medical Question Answering where correct and incorrect diagnoses for a clinical
+case are enriched with a natural language explanation written by doctors. These
+explanations have been manually annotated with argument components (i.e.,
+premise, claim) and argument relations (i.e., attack, support), resulting in
+the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases
+in four languages (English, Spanish, French, Italian) with explanations, where
+we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106
+attack relations. We conclude by showing how competitive baselines perform over
+this challenging dataset for the argument mining task.
 
-摘要：大型语言模型展示了其强大的功能，但代价是较高的计算需求。虽然最近的研究表明，中间层可以被移除或重新排列其顺序，而不会显著影响性能，但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度，而不会显著影响性能。利用我们的见解，我们提出了一种新颖的方法，该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。
-通过更好的并行性对计算图进行修改，平均而言，每秒生成的令牌数量提高了约 1.20 倍，而无需重新训练或微调，同时保留了 95%-99% 的原始准确性。经验评估表明，这种方法显著提高了服务效率，同时保持了模型性能，为大规模 LLM 部署提供了实际改进。
+摘要：解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰，特別是應用於像醫學和法律等敏感情境時。然而，解釋決策背後理由的需求也是基於人類的考量的一個主要問題，因為有必要證明為什麼做出某個決策。例如，住院醫師不僅需要提供（可能是正確的）診斷，還需要解釋他們如何達成某個結論。因此，開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中，我們遵循這個方向，並且根據我們的了解，提出第一個多語言醫學問答資料集，其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成（即前提、主張）和論證關係（即攻擊、支持）進行手動註解，產生多語言 CasiMedicos-Arg 資料集，其中包含 558 個具有解釋的四種語言（英語、西班牙語、法語、義大利語）的臨床病例，我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。
 
-##### **Modular Training of Neural Networks aids Interpretability**
-2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots
+##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration**
+2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu
 
-An approach to improve neural network interpretability is via clusterability,
-i.e., splitting a model into disjoint clusters that can be studied
-independently. We define a measure for clusterability and show that pre-trained
-models form highly enmeshed clusters via spectral graph clustering. We thus
-train models to be more modular using a "clusterability loss" function that
-encourages the formation of non-interacting clusters. Using automated
-interpretability techniques, we show that our method can help train models that
-are more modular and learn different, disjoint, and smaller circuits. We
-investigate CNNs trained on MNIST and CIFAR, small transformers trained on
-modular addition, and language models. Our approach provides a promising
-direction for training neural networks that learn simpler functions and are
-easier to interpret.
+Diagnosis prediction is a critical task in healthcare, where timely and
+accurate identification of medical conditions can significantly impact patient
+outcomes. Traditional machine learning and deep learning models have achieved
+notable success in this domain but often lack interpretability which is a
+crucial requirement in clinical settings. In this study, we explore the use of
+neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop
+explainable models for diagnosis prediction. Essentially, we design and
+implement LNN-based models that integrate domain-specific knowledge through
+logical rules with learnable thresholds. Our models, particularly
+$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior
+performance over traditional models such as Logistic Regression, SVM, and
+Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up
+to 0.8457) in the case study of diabetes prediction. The learned weights and
+thresholds within the LNN models provide direct insights into feature
+contributions, enhancing interpretability without compromising predictive
+power. These findings highlight the potential of neuro-symbolic approaches in
+bridging the gap between accuracy and explainability in healthcare AI
+applications. By offering transparent and adaptable diagnostic models, our work
+contributes to the advancement of precision medicine and supports the
+development of equitable healthcare solutions. Future research will focus on
+extending these methods to larger and more diverse datasets to further validate
+their applicability across different medical conditions and populations.
 
-摘要：一種改善神經網路可解釋性的方法是透過群集性，
-也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量，並顯示預訓練的
-模型透過光譜圖形群集形成高度糾纏的群集。因此，我們使用「群集性損失」函數訓練模型，使其更具模組化，
-這鼓勵形成非交互群集。使用自動化可解釋性技術，我們顯示我們的模型可以幫助訓練更具模組化的模型，並學習不同、不相交且較小的電路。我們
-研究了在 MNIST 和 CIFAR 上訓練的 CNN，在模組化加法上訓練的小型Transformer，以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。
+摘要：診斷預測是醫療保健中的關鍵任務，及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功，但通常缺乏可解釋性，這在臨床環境中是一項關鍵要求。在本研究中，我們探討了神經符號方法的應用，特別是邏輯神經網路 (LNN)，以開發用於診斷預測的可解釋模型。基本上，我們設計並實作了基於 LNN 的模型，這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型，特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$，表現出優於傳統模型（例如邏輯迴歸、SVM 和隨機森林）的優異效能，在糖尿病預測的案例研究中達到了更高的準確度（高達 80.52%）和 AUROC 分數（高達 0.8457）。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解，增強了可解釋性，同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型，我們的研究有助於推進精準醫療，並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集，以進一步驗證其在不同醫療狀況和人群中的適用性。
 
-##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**
-2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür
+##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**
+2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty
 
-Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large
-language models (LLMs) by enabling detailed step-by-step solutions. However,
-due to the verbosity of LLMs, the resulting reasoning chains can be long,
-making it harder to verify the reasoning steps and trace issues resulting from
-dependencies between the steps that may be farther away in the sequence of
-steps. Importantly, mathematical reasoning allows each step to be derived from
-a small set of premises, which are a subset of the preceding steps in the
-reasoning chain. In this paper, we present a framework that identifies the
-premises for each step, to improve the evaluation of reasoning. We restructure
-conventional linear reasoning chains into Premise Augmented Reasoning Chains
-(PARC) by introducing premise links, resulting in a directed acyclic graph
-where the nodes are the steps and the edges are the premise links. Through
-experiments with a PARC-based dataset that we built, namely PERL (Premises and
-ERrors identification in LLMs), we demonstrate that LLMs can reliably identify
-premises within complex reasoning chains. In particular, even open-source LLMs
-achieve 90% recall in premise identification. We also show that PARC helps to
-identify errors in reasoning chains more reliably. The accuracy of error
-identification improves by 6% to 16% absolute when step-by-step verification is
-carried out in PARC under the premises. Our findings highlight the utility of
-premise-centric representations in addressing complex problem-solving tasks and
-open new avenues for improving the reliability of LLM-based reasoning
-evaluations.
+The rapid advancements in artificial intelligence (AI) have revolutionized
+smart healthcare, driving innovations in wearable technologies, continuous
+monitoring devices, and intelligent diagnostic systems. However, security,
+explainability, robustness, and performance optimization challenges remain
+critical barriers to widespread adoption in clinical environments. This
+research presents an innovative algorithmic method using the Adaptive Feature
+Evaluator (AFE) algorithm to improve feature selection in healthcare datasets
+and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable
+Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT),
+the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby
+enhancing predictive accuracy and interpretability. The proposed method is
+validated across three diverse healthcare datasets using six distinct machine
+learning algorithms, demonstrating its robustness and superiority over
+conventional feature selection techniques. The results underscore the
+transformative potential of AFE in smart healthcare, enabling personalized and
+transparent patient care. Notably, the AFE algorithm, when combined with a
+Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting
+its capability to improve clinical decision-making processes in real-world
+healthcare applications.
 
-摘要：<paragraph>思考鏈（CoT）提示透過提供詳細的逐步解法，增強大型語言模型（LLM）的數學推理能力。然而，由於 LLM 的冗長，產生的推理鏈可能很長，這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難，而這些步驟可能在步驟順序中相距較遠。重要的是，數學推理允許每個步驟從一組小的前提中推導出來，這些前提是推理鏈中前一個步驟的子集。在本文中，我們提出了一個框架，用於識別每個步驟的前提，以改進推理評估。我們透過引入前提連結，將傳統的線性推理鏈重組為前提擴充推理鏈（PARC），產生一個有向無環圖，其中節點是步驟，而邊緣是前提連結。透過我們建立的基於 PARC 的資料集（即 PERL（LLM 中的前提和錯誤識別））進行的實驗，我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是，即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明，PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時，錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用，並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。</paragraph>
+摘要：人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健，推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而，安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法，使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT)，該演算法最佳化了臨床決策支援系統 (CDSS)，從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集，證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力，實現了個人化和透明的患者照護。值得注意的是，AFE 演算法與多層感知器 (MLP) 結合使用時，準確度高達 98.5%，突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。
 
-##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**
-2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
+##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**
+2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker
 
-Embodied agents assisting humans are often asked to complete a new task in a
-new scenario. An agent preparing a particular dish in the kitchen based on a
-known recipe may be asked to prepare a new dish or to perform cleaning tasks in
-the storeroom. There may not be sufficient resources, e.g., time or labeled
-examples, to train the agent for these new situations. Large Language Models
-(LLMs) trained on considerable knowledge across many domains are able to
-predict a sequence of abstract actions for such new tasks and scenarios,
-although it may not be possible for the agent to execute this action sequence
-due to task-, agent-, or domain-specific constraints. Our framework addresses
-these challenges by leveraging the generic predictions provided by LLM and the
-prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an
-agent to quickly adapt to new tasks and scenarios. The robot also solicits and
-uses human input as needed to refine its existing knowledge. Based on
-experimental evaluation over cooking and cleaning tasks in simulation domains,
-we demonstrate that the interplay between LLM, KG, and human input leads to
-substantial performance gains compared with just using the LLM output.
+Artificial intelligence (AI) systems have substantially improved
+dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI)
+systems further enhancing clinicians' confidence and trust in AI-driven
+decisions. Despite these advancements, there remains a critical need for
+objective evaluation of how dermatologists engage with both AI and XAI tools.
+In this study, 76 dermatologists participated in a reader study, diagnosing 16
+dermoscopic images of melanomas and nevi using an XAI system that provides
+detailed, domain-specific explanations. Eye-tracking technology was employed to
+assess their interactions. Diagnostic performance was compared with that of a
+standard AI system lacking explanatory features. Our findings reveal that XAI
+systems improved balanced diagnostic accuracy by 2.8 percentage points relative
+to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and
+complex lesions were associated with elevated cognitive load, as evidenced by
+increased ocular fixations. These insights have significant implications for
+clinical practice, the design of AI tools for visual tasks, and the broader
+development of XAI in medical diagnostics.
 
-摘要：具身代理协助人类时，通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源（例如时间或标记的示例）来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列，尽管代理可能无法执行此动作序列，因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战，使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估，我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。
+摘要：人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度，而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展，對於皮膚科醫師如何使用 AI 和 XAI 工具，仍有客觀評估的迫切需求。在這項研究中，76 位皮膚科醫師參與了一項讀者研究，使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像，該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示，XAI 系統相較於標準 AI，將平衡診斷準確度提升了 2.8 個百分點。此外，與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關，這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。
 
-##### **On Bob Dylan: A Computational Perspective**
-2502.01772v1 by Prashant Garg
+##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**
+2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar
 
-Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style
--- a constant refusal to conform to expectation and a penchant for reinventing
-his musical and lyrical identity. In this paper, I extend Sunstein's
-observations through a large-scale computational analysis of Dylan's lyrics
-from 1962 to 2012. Using o3-mini-high (a large language model), I extract
-concept-to-concept relationships from the lyrics and construct directed
-knowledge graphs that capture Dylan's thematic structure. I then quantify
-shifts in sentiment, metaphorical expression, thematic diversity, and network
-complexity over time. The results indicate that Dylan's lyrics increasingly
-rely on metaphor, display an evolving sentiment profile, and exhibit heightened
-dishabituation -- measured here as a growing variance in the network centrality
-of key concepts. I also find that references to movement, protest, and mythic
-imagery fluctuate in ways that align with well-known phases of Dylan's career,
-reflecting the dynamic and unpredictable quality of his art. These findings not
-only deepen our empirical understanding of Sunstein's thesis but also introduce
-a novel computational method for analyzing an artist's evolution-offering
-broader applicability to the study of cultural and creative change.
+Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been
+shown to significantly improve the quality of life of autistic individuals.
+However, diagnostics methods for ASD rely on assessments based on clinical
+presentation that are prone to bias and can be challenging to arrive at an
+early diagnosis. There is a need for objective biomarkers of ASD which can help
+improve diagnostic accuracy. Deep learning (DL) has achieved outstanding
+performance in diagnosing diseases and conditions from medical imaging data.
+Extensive research has been conducted on creating models that classify ASD
+using resting-state functional Magnetic Resonance Imaging (fMRI) data. However,
+existing models lack interpretability. This research aims to improve the
+accuracy and interpretability of ASD diagnosis by creating a DL model that can
+not only accurately classify ASD but also provide explainable insights into its
+working. The dataset used is a preprocessed version of the Autism Brain Imaging
+Data Exchange (ABIDE) with 884 samples. Our findings show a model that can
+accurately classify ASD and highlight critical brain regions differing between
+ASD and typical controls, with potential implications for early diagnosis and
+understanding of the neural basis of ASD. These findings are validated by
+studies in the literature that use different datasets and modalities,
+confirming that the model actually learned characteristics of ASD and not just
+the dataset. This study advances the field of explainable AI in medical imaging
+by providing a robust and interpretable model, thereby contributing to a future
+with objective and reliable ASD diagnostics.
 
-摘要：卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格
--- 這種風格不斷拒絕符合預期，並熱衷於重新塑造他的音樂和歌詞認同。在本文中，我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析，來延伸桑斯坦的觀察。使用 o3-mini-high（一個大型語言模型），我從歌詞中提取概念對概念的關係，並建構有向知識圖，以捕捉迪倫的主題結構。然後，我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示，迪倫的歌詞越來越依賴隱喻，展現出不斷演化的情緒輪廓，並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現，對運動、抗議和神話意象的引用，會以與迪倫職業生涯中眾所周知階段一致的方式波動，反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解，也引入了分析藝術家演變的新穎運算方法，為文化和創造性變化的研究提供了更廣泛的適用性。
+摘要：自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而，ASD 的診斷方法依賴於基於臨床表現的評估，容易產生偏見，且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記，以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而，現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD，還能提供可解釋見解說明其運作原理的 DL 模型，來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本，包含 884 個樣本。我們的研究結果顯示，該模型能準確分類 ASD，並強調 ASD 與典型對照組之間存在差異的關鍵腦區，對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證，證實該模型實際上學習了 ASD 的特徵，而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型，推動了醫學影像中可解釋 AI 的領域，從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。
 
-##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**
-2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
+##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**
+2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul
 
-Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in
-enhancing Large Language Models (LLMs) through external knowledge integration,
-yet its application has primarily focused on textual content, leaving the rich
-domain of multi-modal video knowledge predominantly unexplored. This paper
-introduces VideoRAG, the first retrieval-augmented generation framework
-specifically designed for processing and understanding extremely long-context
-videos. Our core innovation lies in its dual-channel architecture that
-seamlessly integrates (i) graph-based textual knowledge grounding for capturing
-cross-video semantic relationships, and (ii) multi-modal context encoding for
-efficiently preserving visual features. This novel design empowers VideoRAG to
-process unlimited-length videos by constructing precise knowledge graphs that
-span multiple videos while maintaining semantic dependencies through
-specialized multi-modal retrieval paradigms. Through comprehensive empirical
-evaluation on our proposed LongerVideos benchmark-comprising over 160 videos
-totaling 134+ hours across lecture, documentary, and entertainment
-categories-VideoRAG demonstrates substantial performance compared to existing
-RAG alternatives and long video understanding methods. The source code of
-VideoRAG implementation and the benchmark dataset are openly available at:
-https://github.com/HKUDS/VideoRAG.
+The in-vivo identification of the kidney stone types during an ureteroscopy
+would be a major medical advance in urology, as it could reduce the time of the
+tedious renal calculi extraction process, while diminishing infection risks.
+Furthermore, such an automated procedure would make possible to prescribe
+anti-recurrence treatments immediately. Nowadays, only few experienced
+urologists are able to recognize the kidney stone types in the images of the
+videos displayed on a screen during the endoscopy. Thus, several deep learning
+(DL) models have recently been proposed to automatically recognize the kidney
+stone types using ureteroscopic images. However, these DL models are of black
+box nature whicl limits their applicability in clinical settings. This
+contribution proposes a case-based reasoning DL model which uses prototypical
+parts (PPs) and generates local and global descriptors. The PPs encode for each
+class (i.e., kidney stone type) visual feature information (hue, saturation,
+intensity and textures) similar to that used by biologists. The PPs are
+optimally generated due a new loss function used during the model training.
+Moreover, the local and global descriptors of PPs allow to explain the
+decisions ("what" information, "where in the images") in an understandable way
+for biologists and urologists. The proposed DL model has been tested on a
+database including images of the six most widespread kidney stone types. The
+overall average classification accuracy was 90.37. When comparing this results
+with that of the eight other DL models of the kidney stone state-of-the-art, it
+can be seen that the valuable gain in explanability was not reached at the
+expense of accuracy which was even slightly increased with respect to that
+(88.2) of the best method of the literature. These promising and interpretable
+results also encourage urologists to put their trust in AI-based solutions.
 
-摘要：檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功，但其應用主要集中在文字內容上，而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG，這是第一個檢索增強生成架構，專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構，它無縫整合 (i) 基於圖形文字知識基礎，用於擷取跨影片語義關係，以及 (ii) 多模態語境編碼，用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片，同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估，該基準包含超過 160 部影片，總時數超過 134 小時，涵蓋演講、紀錄片和娛樂類別，VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比，展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於：https://github.com/HKUDS/VideoRAG。
+摘要：尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展，因為它可以減少繁瑣的腎結石取出過程的時間，同時降低感染風險。此外，這種自動化程序將使立即開立抗復發治療成為可能。如今，只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此，最近已提出多種深度學習 (DL) 模型，以使用輸尿管鏡圖像自動識別腎結石類型。然而，這些 DL 模型本質上是黑盒子，這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型，它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型（即腎結石類型）編碼視覺特徵信息（色調、飽和度、強度和紋理），類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數，PP 得到了最佳生成。此外，PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策（“什麼”信息，“圖像中的什麼位置”）。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時，可以看出，可解釋性的寶貴增益並未以準確性為代價，甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。
 
-##### **Transformers trained on proteins can learn to attend to Euclidean distance**
-2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane
+##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**
+2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman
+
+This study explores the potential of utilizing administrative claims data,
+combined with advanced machine learning and deep learning techniques, to
+predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal
+Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major
+health insurance organization to develop prediction models for multiple
+observation windows using traditional machine learning methods such as Random
+Forest and XGBoost as well as deep learning approaches such as Long Short-Term
+Memory (LSTM) networks. Our findings demonstrate that the LSTM model,
+particularly with a 24-month observation window, exhibits superior performance
+in predicting ESRD progression, outperforming existing models in the
+literature. We further apply SHapley Additive exPlanations (SHAP) analysis to
+enhance interpretability, providing insights into the impact of individual
+features on predictions at the individual patient level. This study underscores
+the value of leveraging administrative claims data for CKD management and
+predicting ESRD progression.
+
+摘要：本研究探討利用行政申報資料，結合先進機器學習與深度學習技術，預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集，使用傳統機器學習方法（例如隨機森林和 XGBoost）以及深度學習方法（例如長期短期記憶 (LSTM) 網路）開發多個觀察視窗的預測模型。我們的研究結果顯示，LSTM 模型（尤其是 24 個月觀察視窗）在預測 ESRD 進展方面表現優異，優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性，深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。
+
+##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**
+2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller
 
-While conventional Transformers generally operate on sequence data, they can
-be used in conjunction with structure models, typically SE(3)-invariant or
-equivariant graph neural networks (GNNs), for 3D applications such as protein
-structure modelling. These hybrids typically involve either (1)
-preprocessing/tokenizing structural features as input for Transformers or (2)
-taking Transformer embeddings and processing them within a structural
-representation. However, there is evidence that Transformers can learn to
-process structural information on their own, such as the AlphaFold3 structural
-diffusion model. In this work we show that Transformers can function
-independently as structure models when passed linear embeddings of coordinates.
-We first provide a theoretical explanation for how Transformers can learn to
-filter attention as a 3D Gaussian with learned variance. We then validate this
-theory using both simulated 3D points and in the context of masked token
-prediction for proteins. Finally, we show that pre-training protein Transformer
-encoders with structure improves performance on a downstream task, yielding
-better performance than custom structural models. Together, this work provides
-a basis for using standard Transformers as hybrid structure-language models.
+While large language models (LLMs) have shown promise for medical question
+answering, there is limited work focused on tropical and infectious
+disease-specific exploration. We build on an opensource tropical and infectious
+diseases (TRINDs) dataset, expanding it to include demographic and semantic
+clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM
+performance on these, comparing generalist and medical LLMs, as well as LLM
+outcomes to human experts. We demonstrate through systematic experimentation,
+the benefit of contextual information such as demographics, location, gender,
+risk factors for optimal LLM response. Finally we develop a prototype of
+TRINDs-LM, a research tool that provides a playground to navigate how context
+impacts LLM outputs for health.
 
-摘要：雖然傳統的 Transformer 通常處理序列資料，但它們可用於結構模型，通常是 SE(3) 不變式或等變式圖神經網路 (GNN)，用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而，有證據表明 Transformer 可以自行學習處理結構資訊，例如 AlphaFold3 結構擴散模型。在這項工作中，我們展示了 Transformer 在傳遞座標的線性嵌入時，可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後，我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能，產生比自訂結構模型更好的效能。綜合來說，這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。
+摘要：儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景，但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上，並將其擴展為納入人口統計和語義臨床和消費者擴充，產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能，比較了通才和醫療 LLM，以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊（例如人口統計、位置、性別、最佳 LLM 回應的風險因素）的好處。最後，我們開發了 TRINDs-LM 的原型，這是一個研究工具，提供一個探索背景如何影響 LLM 健康輸出的平台。
 
-##### **Common Foundations for SHACL, ShEx, and PG-Schema**
-2502.01295v1 by S. Ahmetaj, I. Boneva, J. Hidders, K. Hose, M. Jakubowski, J. E. Labra-Gayo, W. Martens, F. Mogavero, F. Murlak, C. Okulmus, A. Polleres, O. Savkovic, M. Simkus, D. Tomaszuk
+##### **Explainable AI: Definition and attributes of a good explanation for health AI**
+2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group
 
-Graphs have emerged as an important foundation for a variety of applications,
-including capturing and reasoning over factual knowledge, semantic data
-integration, social networks, and providing factual knowledge for machine
-learning algorithms. To formalise certain properties of the data and to ensure
-data quality, there is a need to describe the schema of such graphs. Because of
-the breadth of applications and availability of different data models, such as
-RDF and property graphs, both the Semantic Web and the database community have
-independently developed graph schema languages: SHACL, ShEx, and PG-Schema.
-Each language has its unique approach to defining constraints and validating
-graph data, leaving potential users in the dark about their commonalities and
-differences. In this paper, we provide formal, concise definitions of the core
-components of each of these schema languages. We employ a uniform framework to
-facilitate a comprehensive comparison between the languages and identify a
-common set of functionalities, shedding light on both overlapping and
-distinctive features of the three languages.
+Proposals of artificial intelligence (AI) solutions based on increasingly
+complex and accurate predictive models are becoming ubiquitous across many
+disciplines. As the complexity of these models grows, transparency and users'
+understanding often diminish. This suggests that accurate prediction alone is
+insufficient for making an AI-based solution truly useful. In the development
+of healthcare systems, this introduces new issues related to accountability and
+safety. Understanding how and why an AI system makes a recommendation may
+require complex explanations of its inner workings and reasoning processes.
+Although research on explainable AI (XAI) has significantly increased in recent
+years and there is high demand for XAI in medicine, defining what constitutes a
+good explanation remains ad hoc, and providing adequate explanations continues
+to be challenging. To fully realize the potential of AI, it is critical to
+address two fundamental questions about explanations for safety-critical AI
+applications, such as health-AI: (1) What is an explanation in health-AI? and
+(2) What are the attributes of a good explanation in health-AI? In this study,
+we examined published literature and gathered expert opinions through a
+two-round Delphi study. The research outputs include (1) a definition of what
+constitutes an explanation in health-AI and (2) a comprehensive list of
+attributes that characterize a good explanation in health-AI.
 
-摘要：圖表已成為各種應用的重要基礎，包括擷取和推理事實知識、語義資料整合、社群網路，以及為機器學習演算法提供事實知識。為了形式化資料的特定屬性並確保資料品質，有必要描述此類圖表的架構。由於應用範圍廣泛且有不同的資料模型可用，例如 RDF 和屬性圖表，因此語義網路和資料庫社群已獨立開發圖表架構語言：SHACL、ShEx 和 PG-Schema。每種語言都有其定義約束和驗證圖表資料的獨特方法，讓潛在使用者不清楚它們的共性和差異。在本文中，我們提供這些架構語言中每個核心元件的正式簡潔定義。我們採用統一的框架來促進語言之間的全面比較，並找出功能的共同集合，說明這三種語言的重疊和獨特功能。
+摘要：隨著越來越複雜且準確的預測模型，基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加，透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中，這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加，且醫學領域對 XAI 有很高的需求，但定義什麼構成一個好的解釋仍是臨時性的，而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力，對於安全關鍵型 AI 應用（例如健康 AI）的解釋，探討兩個基本問題至關重要：(1) 什麼是健康 AI 中的解釋？以及 (2) 健康 AI 中一個好的解釋有哪些屬性？在本研究中，我們檢視了已發表的文獻，並透過兩輪德爾菲研究收集了專家意見。研究成果包括：(1) 健康 AI 中什麼構成解釋的定義，以及 (2) 健康 AI 中一個好解釋的屬性清單。
 
-##### **GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**
-2502.01113v1 by Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
+##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**
+2408.17401v1 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni
 
-Retrieval-augmented generation (RAG) has proven effective in integrating
-knowledge into large language models (LLMs). However, conventional RAGs
-struggle to capture complex relationships between pieces of knowledge, limiting
-their performance in intricate reasoning that requires integrating knowledge
-from multiple sources. Recently, graph-enhanced retrieval augmented generation
-(GraphRAG) builds graph structure to explicitly model these relationships,
-enabling more effective and efficient retrievers. Nevertheless, its performance
-is still hindered by the noise and incompleteness within the graph structure.
-To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for
-retrieval augmented generation. GFM-RAG is powered by an innovative graph
-neural network that reasons over graph structure to capture complex
-query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage
-training process on large-scale datasets, comprising 60 knowledge graphs with
-over 14M triples and 700k documents. This results in impressive performance and
-generalizability for GFM-RAG, making it the first graph foundation model
-applicable to unseen datasets for retrieval without any fine-tuning required.
-Extensive experiments on three multi-hop QA datasets and seven domain-specific
-RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance
-while maintaining efficiency and alignment with neural scaling laws,
-highlighting its potential for further improvement.
+In recent years, various methods have been introduced for explaining the
+outputs of "black-box" AI models. However, it is not well understood whether
+users actually comprehend and trust these explanations. In this paper, we focus
+on explanations for a regression tool for assessing cancer risk and examine the
+effect of the explanations' content and format on the user-centric metrics of
+comprehension and trust. Regarding content, we experiment with two explanation
+methods: the popular SHAP, based on game-theoretic notions and thus potentially
+complex for everyday users to comprehend, and occlusion-1, based on feature
+occlusion which may be more comprehensible. Regarding format, we present SHAP
+explanations as charts (SC), as is conventional, and occlusion-1 explanations
+as charts (OC) as well as text (OT), to which their simpler nature also lends
+itself. The experiments amount to user studies questioning participants, with
+two different levels of expertise (the general population and those with some
+medical training), on their subjective and objective comprehension of and trust
+in explanations for the outputs of the regression tool. In both studies we
+found a clear preference in terms of subjective comprehension and trust for
+occlusion-1 over SHAP explanations in general, when comparing based on content.
+However, direct comparisons of explanations when controlling for format only
+revealed evidence for OT over SC explanations in most cases, suggesting that
+the dominance of occlusion-1 over SHAP explanations may be driven by a
+preference for text over charts as explanations. Finally, we found no evidence
+of a difference between the explanation types in terms of objective
+comprehension. Thus overall, the choice of the content and format of
+explanations needs careful attention, since in some contexts format, rather
+than content, may play the critical role in improving user experience.
 
-摘要：檢索增強生成 (RAG) 已證明在整合知識到大語言模型 (LLM) 中有效。然而，傳統的 RAG 難以捕捉知識片段之間的複雜關係，限制了它們在需要整合來自多個來源的知識的複雜推理中的表現。最近，圖表增強檢索增強生成 (GraphRAG) 建立圖表結構來明確建模這些關係，從而實現更有效率的檢索器。儘管如此，其效能仍受到圖表結構中雜訊和不完整性的阻礙。為了解決這個問題，我們引入了 GFM-RAG，一種用於檢索增強生成的全新圖表基礎模型 (GFM)。GFM-RAG 由一個創新的圖神經網路驅動，該網路在圖表結構上進行推理以捕捉複雜的查詢知識關係。具有 8M 參數的 GFM 在大型資料集上進行兩階段訓練流程，包括 60 個包含超過 14M 個三元組和 700k 個文件的文件。這為 GFM-RAG 帶來了令人印象深刻的效能和通用性，使其成為第一個適用於未見過資料集的圖表基礎模型，而無需任何微調。在三個多跳問答資料集和七個特定領域 RAG 資料集上的廣泛實驗表明，GFM-RAG 達到了最先進的效能，同時保持了效率並與神經擴充定律保持一致，突顯了其進一步改進的潛力。
+摘要：<paragraph>近年來，已經引進各種方法來解釋「黑箱」AI 模型的輸出。然而，目前並不清楚使用者是否實際理解和信任這些解釋。在本文中，我們專注於評估癌症風險的回歸工具的解釋，並探討解釋的內容和格式對以使用者為中心的理解和信任指標的影響。關於內容，我們實驗了兩種解釋方法：流行的 SHAP，基於博弈論概念，因此對於日常使用者來說可能很複雜，以及基於特徵遮蔽的 occlusion-1，可能更易於理解。關於格式，我們將 SHAP 解釋呈現為圖表 (SC)，這是慣例，而將 occlusion-1 解釋呈現為圖表 (OC) 以及文字 (OT)，其較為簡單的性質也適用於此。這些實驗等同於使用者研究，詢問參與者，具有兩種不同程度的專業知識（一般民眾和具備一些醫學訓練的人），他們對回歸工具輸出解釋的主觀和客觀理解和信任。在兩項研究中，我們發現，在基於內容進行比較時，一般來說，occlusion-1 優於 SHAP 解釋，在主觀理解和信任方面有明顯的偏好。然而，在僅控制格式的情況下直接比較解釋，在大多數情況下只顯示 OT 優於 SC 解釋的證據，這表明 occlusion-1 優於 SHAP 解釋的主導地位可能是由偏好文字而非圖表作為解釋所驅動的。最後，我們沒有發現解釋類型在客觀理解方面的差異證據。因此，總體而言，對解釋的內容和格式的選擇需要仔細注意，因為在某些情況下，格式而非內容，可能在改善使用者體驗方面發揮關鍵作用。</paragraph>
 
-##### **Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**
-2502.01059v1 by Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn
+##### **A Survey for Large Language Models in Biomedicine**
+2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen
 
-The development of biological data analysis tools and large language models
-(LLMs) has opened up new possibilities for utilizing AI in plant science
-research, with the potential to contribute significantly to knowledge
-integration and research gap identification. Nonetheless, current LLMs struggle
-to handle complex biological data and theoretical models in photosynthesis
-research and often fail to provide accurate scientific contexts. Therefore,
-this study proposed a photosynthesis research assistant (PRAG) based on
-OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt
-optimization. Vector databases and an automated feedback loop were used in the
-prompt optimization process to enhance the accuracy and relevance of the
-responses to photosynthesis-related queries. PRAG showed an average improvement
-of 8.7% across five metrics related to scientific writing, with a 25.4%
-increase in source transparency. Additionally, its scientific depth and domain
-coverage were comparable to those of photosynthesis research papers. A
-knowledge graph was used to structure PRAG's responses with papers within and
-outside the database, which allowed PRAG to match key entities with 63% and
-39.5% of the database and test papers, respectively. PRAG can be applied for
-photosynthesis research and broader plant science domains, paving the way for
-more in-depth data analysis and predictive capabilities.
+Recent breakthroughs in large language models (LLMs) offer unprecedented
+natural language understanding and generation capabilities. However, existing
+surveys on LLMs in biomedicine often focus on specific applications or model
+architectures, lacking a comprehensive analysis that integrates the latest
+advancements across various biomedical domains. This review, based on an
+analysis of 484 publications sourced from databases including PubMed, Web of
+Science, and arXiv, provides an in-depth examination of the current landscape,
+applications, challenges, and prospects of LLMs in biomedicine, distinguishing
+itself by focusing on the practical implications of these models in real-world
+biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot
+learning across a broad spectrum of biomedical tasks, including diagnostic
+assistance, drug discovery, and personalized medicine, among others, with
+insights drawn from 137 key studies. Then, we discuss adaptation strategies of
+LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to
+enhance their performance in specialized biomedical contexts where zero-shot
+fails to achieve, such as medical question answering and efficient processing
+of biomedical literature. Finally, we discuss the challenges that LLMs face in
+the biomedicine domain including data privacy concerns, limited model
+interpretability, issues with dataset quality, and ethics due to the sensitive
+nature of biomedical data, the need for highly reliable model outputs, and the
+ethical implications of deploying AI in healthcare. To address these
+challenges, we also identify future research directions of LLM in biomedicine
+including federated learning methods to preserve data privacy and integrating
+explainable AI methodologies to enhance the transparency of LLMs.
 
-摘要：生物資料分析工具和大型語言模型 (LLM) 的發展，為利用人工智慧於植物科學研究開啟了新的可能性，並有潛力對知識整合和研究差距的識別做出重大貢獻。儘管如此，目前的 LLM 在處理光合作用研究中的複雜生物資料和理論模型時仍有困難，而且常常無法提供準確的科學背景。因此，本研究提出了一個基於 OpenAI 的 GPT-4o、具備檢索增強生成 (RAG) 技術和提示最佳化的光合作用研究助理 (PRAG)。在提示最佳化過程中，使用了向量資料庫和自動回饋迴路，以增強對與光合作用相關查詢的回應的準確性和相關性。PRAG 在與科學寫作相關的五項指標中顯示出平均改善了 8.7%，來源透明度增加了 25.4%。此外，其科學深度和領域涵蓋範圍與光合作用研究論文相當。知識圖譜用於建構 PRAG 的回應，其中包含資料庫內外論文，這使得 PRAG 能夠分別與資料庫和測試論文中的 63% 和 39.5% 的關鍵實體相匹配。PRAG 可應用於光合作用研究和更廣泛的植物科學領域，為更深入的資料分析和預測能力鋪路。
+摘要：大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而，現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構，缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析，深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景，其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先，我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力，包括診斷輔助、藥物發現和個性化醫療等，並從 137 項關鍵研究中汲取見解。然後，我們討論了 LLM 的適應策略，包括單模態和多模態 LLM 的微調方法，以增強它們在零次學習無法實現的專業生物醫學背景中的性能，例如醫療問題解答和生物醫學文獻的有效處理。最後，我們討論了 LLM 在生物醫學領域面臨的挑戰，包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰，我們還確定了生物醫學中 LLM 未來的研究方向，包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。
 
-##### **Encrypted Large Model Inference: The Equivariant Encryption Paradigm**
-2502.01013v1 by James Buban, Hongyang Zhang, Claudio Angione, Harry Yang, Ahmad Farhan, Seyfal Sultanov, Michael Du, Xuran Ma, Zihao Wang, Yue Zhao, Arria Owlia, Fielding Johnston, Patrick Colangelo
+##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**
+2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone
 
-Large scale deep learning model, such as modern language models and diffusion
-architectures, have revolutionized applications ranging from natural language
-processing to computer vision. However, their deployment in distributed or
-decentralized environments raises significant privacy concerns, as sensitive
-data may be exposed during inference. Traditional techniques like secure
-multi-party computation, homomorphic encryption, and differential privacy offer
-partial remedies but often incur substantial computational overhead, latency
-penalties, or limited compatibility with non-linear network operations. In this
-work, we introduce Equivariant Encryption (EE), a novel paradigm designed to
-enable secure, "blind" inference on encrypted data with near zero performance
-overhead. Unlike fully homomorphic approaches that encrypt the entire
-computational graph, EE selectively obfuscates critical internal
-representations within neural network layers while preserving the exact
-functionality of both linear and a prescribed set of non-linear operations.
-This targeted encryption ensures that raw inputs, intermediate activations, and
-outputs remain confidential, even when processed on untrusted infrastructure.
-We detail the theoretical foundations of EE, compare its performance and
-integration complexity against conventional privacy preserving techniques, and
-demonstrate its applicability across a range of architectures, from
-convolutional networks to large language models. Furthermore, our work provides
-a comprehensive threat analysis, outlining potential attack vectors and
-baseline strategies, and benchmarks EE against standard inference pipelines in
-decentralized settings. The results confirm that EE maintains high fidelity and
-throughput, effectively bridging the gap between robust data confidentiality
-and the stringent efficiency requirements of modern, large scale model
-inference.
+Significant investment and development have gone into integrating Artificial
+Intelligence (AI) in medical and healthcare applications, leading to advanced
+control systems in medical technology. However, the opacity of AI systems
+raises concerns about essential characteristics needed in such sensitive
+applications, like transparency and trustworthiness. Our study addresses these
+concerns by investigating a process for selecting the most adequate Explainable
+AI (XAI) methods to comply with the explanation requirements of key EU
+regulations in the context of smart bioelectronics for medical devices. The
+adopted methodology starts with categorising smart devices by their control
+mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving
+into their technology. Then, we analyse these regulations to define their
+explainability requirements for the various devices and related goals.
+Simultaneously, we classify XAI methods by their explanatory objectives. This
+allows for matching legal explainability requirements with XAI explanatory
+goals and determining the suitable XAI algorithms for achieving them. Our
+findings provide a nuanced understanding of which XAI algorithms align better
+with EU regulations for different types of medical devices. We demonstrate this
+through practical case studies on different neural implants, from chronic
+disease management to advanced prosthetics. This study fills a crucial gap in
+aligning XAI applications in bioelectronics with stringent provisions of EU
+regulations. It provides a practical framework for developers and researchers,
+ensuring their AI innovations advance healthcare technology and adhere to legal
+and ethical standards.
 
-摘要：大型深度學習模型，例如現代語言模型和擴散架構，徹底改變了從自然語言處理到電腦視覺等各種應用。然而，它們在分散式或分散式環境中的部署引發了重大的隱私問題，因為敏感數據可能會在推理過程中遭到揭露。安全多方計算、同態加密和差分隱私等傳統技術提供了部分補救措施，但通常會產生大量的計算開銷、延遲處罰，或與非線性網路操作相容性有限。在這項工作中，我們引入了等變加密 (EE)，這是一種新穎的範例，旨在以接近零效能開銷對加密數據進行安全、「盲目」推理。與加密整個計算圖形的完全同態方法不同，EE 有選擇性地混淆神經網路層內的關鍵內部表示，同時保留線性和規定的一組非線性操作的精確功能。這種有針對性的加密確保了原始輸入、中間激活和輸出保持機密，即使在不受信任的基礎設施上處理也是如此。我們詳細說明了 EE 的理論基礎，比較了其效能和整合複雜度與傳統的隱私保護技術，並展示了其在從卷積網路到大語言模型等各種架構中的適用性。此外，我們的研究提供了全面的威脅分析，概述了潛在的攻擊媒介和基準策略，並在分散式設定中將 EE 與標準推理管道進行比較。結果證實，EE 保持了高保真度和高傳輸量，有效地彌合了強大的數據機密性與現代化、大規模模型推理的嚴格效率要求之間的差距。
+摘要：人工智慧（AI）在醫療和保健應用中投入了大量的投資和開發，進而導致醫療技術中的先進控制系統。然而，AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂，例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題，用於選擇最充分的可解釋 AI（XAI）方法，以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制（開迴路、閉迴路和半閉迴路系統）對智慧型裝置進行分類，並深入探討其技術開始。然後，我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時，我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配，並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點，從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構，確保其 AI 創新能促進醫療技術並遵守法律和道德標準。
 
-##### **Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**
-2502.01694v1 by Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki
+##### **Towards Case-based Interpretability for Medical Federated Learning**
+2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva
 
-A key paradigm to improve the reasoning capabilities of large language models
-(LLMs) is to allocate more inference-time compute to search against a verifier
-or reward model. This process can then be utilized to refine the pretrained
-model or distill its reasoning patterns into more efficient models. In this
-paper, we study inference-time compute by viewing chain-of-thought (CoT)
-generation as a metastable Markov process: easy reasoning steps (e.g.,
-algebraic manipulations) form densely connected clusters, while hard reasoning
-steps (e.g., applying a relevant theorem) create sparse, low-probability edges
-between clusters, leading to phase transitions at longer timescales. Under this
-framework, we prove that implementing a search protocol that rewards sparse
-edges improves CoT by decreasing the expected number of steps to reach
-different clusters. In contrast, we establish a limit on reasoning capability
-when the model is restricted to local information of the pretrained graph. We
-also show that the information gained by search can be utilized to obtain a
-better reasoning model: (1) the pretrained model can be directly finetuned to
-favor sparse edges via policy gradient methods, and moreover (2) a compressed
-metastable representation of the reasoning dynamics can be distilled into a
-smaller, more efficient model.
+We explore deep generative models to generate case-based explanations in a
+medical federated learning setting. Explaining AI model decisions through
+case-based interpretability is paramount to increasing trust and allowing
+widespread adoption of AI in clinical practice. However, medical AI training
+paradigms are shifting towards federated learning settings in order to comply
+with data protection regulations. In a federated scenario, past data is
+inaccessible to the current user. Thus, we use a deep generative model to
+generate synthetic examples that protect privacy and explain decisions. Our
+proof-of-concept focuses on pleural effusion diagnosis and uses publicly
+available Chest X-ray data.
 
-摘要：<paragraph>提升大型語言模型 (LLM) 推理能力的一個關鍵範例，是分配更多推論時間運算來搜尋驗證器或獎勵模型。此程序接著可用於改善預訓練模型或將其推理模式提煉到更有效率的模型中。在這篇論文中，我們透過將思維鏈 (CoT) 生成視為亞穩態馬可夫過程來研究推論時間運算：簡單的推理步驟（例如代數運算）形成密集連接的叢集，而困難的推理步驟（例如應用相關定理）則在叢集之間建立稀疏、低機率的邊緣，導致在較長時間尺度上產生相變。在此架構下，我們證明實作一種獎勵稀疏邊緣的搜尋協定，會透過減少到達不同叢集所需的預期步驟數來改善 CoT。相反地，當模型受限於預訓練圖形的局部資訊時，我們建立了推理能力的限制。我們也顯示搜尋所獲得的資訊可用於取得更好的推理模型：(1) 預訓練模型可以直接微調以透過策略梯度方法偏好稀疏邊緣，而且 (2) 推理動態的壓縮亞穩態表徵可以提煉到更小、更有效率的模型中。</paragraph>
+摘要：我們探索深度生成模型，在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策，對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而，醫療 AI 訓練範例正轉向聯邦學習設置，以符合資料保護法規。在聯邦情境中，過去的資料對目前的使用者而言是無法取得的。因此，我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷，並使用公開可取得的胸部 X 光資料。
 
-##### **PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**
-2502.00708v1 by Qixuan Li, Chao Wang, Zongjin He, Yan Peng
+##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**
+2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans
 
-Text-to-3D asset generation has achieved significant optimization under the
-supervision of 2D diffusion priors. However, when dealing with compositional
-scenes, existing methods encounter several challenges: 1). failure to ensure
-that composite scene layouts comply with physical laws; 2). difficulty in
-accurately capturing the assets and relationships described in complex scene
-descriptions; 3). limited autonomous asset generation capabilities among layout
-approaches leveraging large language models (LLMs). To avoid these compromises,
-we propose a novel framework for compositional scene generation, PhiP-G, which
-seamlessly integrates generation techniques with layout guidance based on a
-world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene
-description to generate a scene graph, and integrating a multimodal 2D
-generation agent and a 3D Gaussian generation method for targeted assets
-creation. For the stage of layout, PhiP-G employs a physical pool with adhesion
-capabilities and a visual supervision agent, forming a world model for layout
-prediction and planning. Extensive experiments demonstrate that PhiP-G
-significantly enhances the generation quality and physical rationality of the
-compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA)
-performance in CLIP scores, achieves parity with the leading methods in
-generation quality as measured by the T$^3$Bench, and improves efficiency by
-24x.
+Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging
+lesions with variable clinical behaviours and treatment approaches. This
+systematic review provides an overview of Artificial Intelligence (AI) methods
+using radiological imaging for diagnosis and prognosis of these tumours,
+highlighting challenges in clinical translation, and evaluating study alignment
+with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI
+international consensus guidelines for trustworthy and deployable AI to promote
+the clinical translation of AI methods. The review covered literature from
+several bibliographic databases, including papers published before 17/07/2024.
+Original research in peer-reviewed journals focused on radiology-based AI for
+diagnosing or prognosing primary STBT was included. Exclusion criteria were
+animal, cadaveric, or laboratory studies, and non-English papers. Abstracts
+were screened by two of three independent reviewers for eligibility. Eligible
+papers were assessed against guidelines by one of three independent reviewers.
+The search identified 15,015 abstracts, from which 325 articles were included
+for evaluation. Most studies performed moderately on CLAIM, averaging a score
+of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out
+of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage,
+indicating significant room for improvement. Future efforts by AI developers
+should focus on design (e.g. define unmet clinical need, intended clinical
+setting and how AI would be integrated in clinical workflow), development (e.g.
+build on previous work, explainability), evaluation (e.g. evaluating and
+addressing biases, evaluating AI against best practices), and data
+reproducibility and availability (making documented code and data publicly
+available). Following these recommendations could improve clinical translation
+of AI methods.
 
-摘要：<paragraph>在 2D 擴散先驗的監督下，文字轉 3D 資產生成已取得顯著的最佳化。然而，在處理合成場景時，現有方法會遇到幾個挑戰：1) 無法確保複合場景佈局符合物理定律；2) 難以準確捕捉複雜場景描述中所描述的資產和關係；3) 在利用大型語言模型 (LLM) 的佈局方法中，自主資產生成能力有限。為了避免這些折衷，我們提出了一個合成場景生成的新框架 PhiP-G，它將生成技術與基於世界模型的佈局指導無縫整合。利用基於 LLM 的代理，PhiP-G 分析複雜的場景描述以生成場景圖，並整合多模態 2D 生成代理和 3D 高斯生成方法以進行目標資產創建。對於佈局階段，PhiP-G 採用具有附著能力的物理池和視覺監督代理，形成用於佈局預測和規劃的世界模型。大量的實驗證明，PhiP-G 大幅提升了合成場景的生成品質和物理合理性。值得注意的是，PhiP-G 在 CLIP 分數中獲得了最先進 (SOTA) 的效能，在 T$^3$Bench 測量的生成品質中與領先的方法達到同等水準，並將效率提升了 24 倍。</paragraph>
+摘要：軟組織和骨骼腫瘤（STBT）是罕見、診斷具有挑戰性的病灶，其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀，重點說明了臨床轉譯的挑戰，並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性，以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻，包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究，以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要，其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等，平均得分為 53 分中的 28.9±7.5 分，但在 FUTURE-AI 中表現不佳，平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段，表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計（例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中）、開發（例如建立在先前的工作、可解釋性）、評估（例如評估和解決偏差、評估 AI 與最佳實務）、以及數據可複製性和可用性（公開提供文件化的代碼和數據）。遵循這些建議可以改善 AI 方法的臨床轉譯。
 
-##### **A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**
-2502.00681v1 by Qika Lin, Zhen Peng, Kaize Shi, Kai He, Yiming Xu, Erik Cambria, Mengling Feng
+##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**
+2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen
 
-Recent years have witnessed rapid advances in graph representation learning,
-with the continuous embedding approach emerging as the dominant paradigm.
-However, such methods encounter issues regarding parameter efficiency,
-interpretability, and robustness. Thus, Quantized Graph Representation (QGR)
-learning has recently gained increasing interest, which represents the graph
-structure with discrete codes instead of conventional continuous embeddings.
-Given its analogous representation form to natural language, QGR also possesses
-the capability to seamlessly integrate graph structures with large language
-models (LLMs). As this emerging paradigm is still in its infancy yet holds
-significant promise, we undertake this thorough survey to promote its rapid
-future prosperity. We first present the background of the general quantization
-methods and their merits. Moreover, we provide an in-depth demonstration of
-current QGR studies from the perspectives of quantized strategies, training
-objectives, distinctive designs, knowledge graph quantization, and
-applications. We further explore the strategies for code dependence learning
-and integration with LLMs. At last, we give discussions and conclude future
-directions, aiming to provide a comprehensive picture of QGR and inspire future
-research.
+Early detection of Cerebral Palsy (CP) is crucial for effective intervention
+and monitoring. This paper tests the reliability and applicability of
+Explainable AI (XAI) methods using a deep learning method that predicts CP by
+analyzing skeletal data extracted from video recordings of infant movements.
+Specifically, we use XAI evaluation metrics -- namely faithfulness and
+stability -- to quantitatively assess the reliability of Class Activation
+Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this
+specific medical application. We utilize a unique dataset of infant movements
+and apply skeleton data perturbations without distorting the original dynamics
+of the infant movements. Our CP prediction model utilizes an ensemble approach,
+so we evaluate the XAI metrics performances for both the overall ensemble and
+the individual models. Our findings indicate that both XAI methods effectively
+identify key body points influencing CP predictions and that the explanations
+are robust against minor data perturbations. Grad-CAM significantly outperforms
+CAM in the RISv metric, which measures stability in terms of velocity. In
+contrast, CAM performs better in the RISb metric, which relates to bone
+stability, and the RRS metric, which assesses internal representation
+robustness. Individual models within the ensemble show varied results, and
+neither CAM nor Grad-CAM consistently outperform the other, with the ensemble
+approach providing a representation of outcomes from its constituent models.
 
-摘要：近年来，图表示学习取得了快速进展，其中连续嵌入方法作为主导范式出现。然而，此类方法遇到了参数效率、可解释性和鲁棒性方面的问题。因此，量化图表示 (QGR) 学习最近引起了越来越多的兴趣，它使用离散代码而不是传统的连续嵌入来表示图结构。鉴于其与自然语言类似的表示形式，QGR 也具备将图结构与大型语言模型 (LLM) 无缝集成的能力。由于这种新兴范式仍处于起步阶段，但前景广阔，我们进行了这项全面调查以促进其快速未来的繁荣。我们首先介绍了通用量化方法的背景及其优点。此外，我们从量化策略、训练目标、独特设计、知识图谱量化和应用的角度对当前的 QGR 研究进行了深入的论证。我们进一步探索了代码依赖性学习和与 LLM 集成的策略。最后，我们给出了讨论并总结了未来的方向，旨在提供 QGR 的全面图景并激发未来的研究。
+摘要：腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性，使用深度學習方法，透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說，我們使用 XAI 評估指標（即忠實度和穩定性）來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集，並應用骨骼資料擾動，而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法，因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明，兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位，並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM，該指標衡量速度方面的穩定性。相比之下，CAM 在 RISb 指標中表現得更好，該指標與骨骼穩定性有關，而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果，CAM 和 Grad-CAM 都不一致地優於另一種，整體方法提供了其組成模型結果的表示。
 
-##### **Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**
-2502.00339v1 by Jingyuan Yi, Zeqiu Xu, Tianyi Huang, Peiyang Yu
+##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**
+2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma
 
-The pervasiveness of the dissemination of fake news through social media
-platforms poses critical risks to the trust of the general public, societal
-stability, and democratic institutions. This challenge calls for novel
-methodologies in detection, which can keep pace with the dynamic and
-multi-modal nature of misinformation. Recent works include powering the
-detection using large language model advances in multimodal frameworks,
-methodologies using graphs, and adversarial training in the literature of fake
-news. Based on the different approaches which can bring success, some key
-highlights will be underlined: enhanced LLM-improves accuracy through more
-advanced semantics and cross-modality fusion for robust detections. The review
-further identifies critical gaps in adaptability to dynamic social media
-trends, real-time, and cross-platform detection capabilities, as well as the
-ethical challenges thrown up by the misuse of LLMs. Future directions underline
-the development of style-agnostic models, cross-lingual detection frameworks,
-and robust policies with a view to mitigating LLM-driven misinformation. This
-synthesis thus lays a concrete foundation for those researchers and
-practitioners committed to reinforcing fake news detection systems with
-complications that keep on growing in the digital landscape.
+Recent global estimates suggest that as many as 2.41 billion individuals have
+health conditions that would benefit from rehabilitation services. Home-based
+Physical Therapy (PT) faces significant challenges in providing interactive
+feedback and meaningful observation for therapists and patients. To fill this
+gap, we present MicroXercise, which integrates micro-motion analysis with
+wearable sensors, providing therapists and patients with a comprehensive
+feedback interface, including video, text, and scores. Crucially, it employs
+multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable
+methods to analyze the existing deep learning neural networks in monitoring
+exercises, focusing on a high granularity of exercise. This synergistic
+approach is pivotal, providing output matching the input size to precisely
+highlight critical subtleties and movements in PT, thus transforming complex AI
+analysis into clear, actionable feedback. By highlighting these micro-motions
+in different metrics, such as stability and range of motion, MicroXercise
+significantly enhances the understanding and relevance of feedback for
+end-users. Comparative performance metrics underscore its effectiveness over
+traditional methods, such as a 39% and 42% improvement in Feature Mutual
+Information (FMI) and Continuity. MicroXercise is a step ahead in home-based
+physical therapy, providing a technologically advanced and intuitively helpful
+solution to enhance patient care and outcomes.
 
-摘要：社群媒體平台上假新聞散播的普遍性對一般大眾的信任、社會穩定性與民主制度構成重大風險。這項挑戰需要在偵測方面採用創新的方法論，才能跟上錯誤資訊的動態和多模態特性。最近的研究包括使用多模態架構中大型語言模型的進展、使用圖形的方法論，以及在假新聞文獻中進行對抗訓練來強化偵測。根據可以帶來成功的不同方法，將重點說明一些重點：增強的 LLM 可透過更進階的語意和跨模態融合來提升準確度，以進行穩健的偵測。這篇評論進一步找出在適應動態社群媒體趨勢、即時和跨平台偵測能力方面的重大差距，以及 LLM 遭濫用的道德挑戰。未來的方向強調開發與風格無關的模型、跨語言偵測架構和穩健的政策，以減輕 LLM 驅動的錯誤資訊。因此，這種綜合分析為那些致力於強化假新聞偵測系統的研究人員和從業人員奠定了具體的基礎，而這些複雜性在數位環境中持續增長。
+摘要：最近的全球估計表明，多達 24.1 億人有
+健康狀況可從復健服務中受益。居家
+物理治療 (PT) 在提供互動式
+回饋和有意義的觀察方面面臨重大挑戰，供治療師和患者使用。為了填補這
+個缺口，我們提出 MicroXercise，它將微動作分析與
+可穿戴式感測器整合在一起，為治療師和患者提供一個全面的
+回饋介面，包括影片、文字和分數。至關重要的是，它採用
+多維動態時間規整 (DTW) 和基於歸因的可解釋
+方法來分析監控運動中現有的深度學習神經網路，專注於運動的高粒度。這種協同
+方法至關重要，提供與輸入大小匹配的輸出，以精確地
+突出 PT 中關鍵的細微差別和動作，從而將複雜的 AI
+分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作，例如穩定性和動作範圍，MicroXercise
+顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於
+傳統方法的有效性，例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家
+物理治療方面更進一步，提供技術先進且直覺有用的
+解決方案，以提升患者照護和結果。
 
-##### **DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**
-2502.00305v1 by Jiaxin Guo, C. L. Philip Chen, Shuzhen Li, Tong Zhang
+##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**
+2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz
 
-Cold-start active learning (CSAL) selects valuable instances from an
-unlabeled dataset for manual annotation. It provides high-quality data at a low
-annotation cost for label-scarce text classification. However, existing CSAL
-methods overlook weak classes and hard representative examples, resulting in
-biased learning. To address these issues, this paper proposes a novel
-dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL.
-Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently
-extract textual representations, class predictions, and predictive uncertainty.
-Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both
-textual diversity and class diversity, ensuring a balanced data distribution.
-It further propagates uncertainty information via density-based clustering to
-select hard representative instances. DEUCE performs well in selecting
-class-balanced and hard representative data by dual-diversity and
-informativeness. Experiments on six NLP datasets demonstrate the superiority
-and efficiency of DEUCE.
+Systematic literature reviews are the highest quality of evidence in
+research. However, the review process is hindered by significant resource and
+data constraints. The Literature Review Network (LRN) is the first of its kind
+explainable AI platform adhering to PRISMA 2020 standards, designed to automate
+the entire literature review process. LRN was evaluated in the domain of
+surgical glove practices using 3 search strings developed by experts to query
+PubMed. A non-expert trained all LRN models. Performance was benchmarked
+against an expert manual review. Explainability and performance metrics
+assessed LRN's ability to replicate the experts' review. Concordance was
+measured with the Jaccard index and confusion matrices. Researchers were
+blinded to the other's results until study completion. Overlapping studies were
+integrated into an LRN-generated systematic review. LRN models demonstrated
+superior classification accuracy without expert training, achieving 84.78% and
+85.71% accuracy. The highest performance model achieved high interrater
+reliability (k = 0.4953) and explainability metrics, linking 'reduce',
+'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51%
+of the relevant literature despite diverging from the non-expert's judgments (k
+= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN
+outperformed the manual review (19,920 minutes over 11 months), reducing the
+entire process to 288.6 minutes over 5 days. This study demonstrates that
+explainable AI does not require expert training to successfully conduct
+PRISMA-compliant systematic literature reviews like an expert. LRN summarized
+the results of surgical glove studies and identified themes that were nearly
+identical to the clinical researchers' findings. Explainable AI can accurately
+expedite our understanding of clinical practices, potentially revolutionizing
+healthcare research.
 
-摘要：冷啟動主動學習 (CSAL) 從未標記的資料集中選取有價值的實例進行手動標記。它以低標記成本提供高品質的資料，用於標籤稀少的文字分類。然而，現有的 CSAL 方法忽略了弱類別和難以代表的範例，導致有偏差的學習。為了解決這些問題，本文提出了一個新的雙重多樣性增強和不確定性感知 (DEUCE) 架構，用於 CSAL。具體來說，DEUCE 利用預訓練的語言模型 (PLM) 來有效地提取文字表徵、類別預測和預測不確定性。然後，它構建一個雙鄰居圖 (DNG) 來結合文字多樣性和類別多樣性的資訊，確保平衡的資料分佈。它進一步通過基於密度的聚類來傳播不確定性資訊，以選擇難以代表的實例。DEUCE 在通過雙重多樣性和資訊性選擇類別平衡和難以代表的資料方面表現良好。在六個 NLP 資料集上的實驗證明了 DEUCE 的優越性和效率。
+摘要：系統性文獻回顧是研究中證據品質最高的。然而，回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台，旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估，使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率，達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標，將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻，儘管與非專家的判斷不同 (k = 0.2174)，但包含了「乳膠」、「雙重」（手套）和「適應症」等詞彙。LRN 優於手動回顧（11 個月超過 19,920 分鐘），將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示，可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果，並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解，有潛力革新醫療保健研究。
 
-##### **Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**
-2502.01659v2 by Nathaniel Tomczak, Sanmukh Kuppannagari
+##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**
+2408.02709v1 by Chi Him Ng
 
-Transformers have demonstrated great success in numerous domains including
-natural language processing and bioinformatics. This success stems from the use
-of the attention mechanism by these models in order to represent and propagate
-pairwise interactions between individual tokens of sequential data. However,
-the primary limitation of this operation is its quadratic memory and time
-complexity in relation to the input's context length - the length of a sequence
-over which the interactions need to be captured. This significantly limits the
-length of sequences that can be inferred upon by these models. Extensive
-research has been conducted to reduce the number of pairwise interactions to
-sub-quadratic in relation to the context length by introducing sparsity into
-the attention mechanism through the development of sparse attention masks.
-However, efficient implementations that achieve "true sparsity" are lacking.
-  In this work, we address this issue by proposing a graph computing view of
-attention where tokens are perceived as nodes of the graph and the attention
-mask determines the edges of the graph. Using this view, we develop graph
-processing algorithms to implement the attention mechanism. Both theoretically
-and empirically, we demonstrate that our algorithms only perform the needed
-computations, i.e., they are work optimal. We also perform extensive
-experimentation using popular attention masks to explore the impact of sparsity
-on execution time and achievable context length. Our experiments demonstrate
-significant speedups in execution times compared to state-of-the-art attention
-implementations such as FlashAttention for large sequence lengths. We also
-demonstrate that our algorithms are able to achieve extremely long sequence
-lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB).
+This study analyzes hybrid AI systems' design patterns and their
+effectiveness in clinical decision-making using the boxology framework. It
+categorizes and copares various architectures combining machine learning and
+rule-based reasoning to provide insights into their structural foundations and
+healthcare applications. Addressing two main questions, how to categorize these
+systems againts established design patterns and how to extract insights through
+comparative analysis, the study uses design patterns from software engineering
+to understand and optimize healthcare AI systems. Boxology helps identify
+commonalities and create reusable solutions, enhancing these systems'
+scalability, reliability, and performance. Five primary architectures are
+examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and
+weaknesses, highlighting the need for tailored approaches in clinical tasks.
+REML excels in high-accuracy prediction for datasets with limited data; MLRB in
+handling large datasets and complex data integration; RBML in explainability
+and trustworthiness; RMLT in managing high-dimensional data; and PERML, though
+limited in analysis, shows promise in urgent care scenarios. The study
+introduces four new patterns, creates five abstract categorization patterns,
+and refines those five further to specific systems. These contributions enhance
+Boxlogy's taxonomical organization and offer novel approaches to integrating
+expert knowledge with machine learning. Boxology's structured, modular apporach
+offers significant advantages in developing and analyzing hybrid AI systems,
+revealing commonalities, and promoting reusable solutions. In conclusion, this
+study underscores hybrid AI systems' crucial role in advancing healthcare and
+Boxology's potential to drive further innovation in AI integration, ultimately
+improving clinical decision support and patient outcomes.
 
-摘要：變形金剛已在許多領域展現出巨大的成功，包括自然語言處理和生物資訊學。這種成功源自於這些模型使用注意機制來表示和傳播序列資料中各個標記之間成對的互動。然而，這種運算的主要限制在於其二次記憶體和時間複雜度與輸入的內容長度有關，也就是需要擷取互動的序列長度。這會顯著限制這些模型可以推論的序列長度。已經進行了大量的研究來減少成對互動的數量，使其與內容長度成次二次關係，方法是透過開發稀疏注意遮罩來將稀疏性引入注意機制。然而，缺乏能達成「真實稀疏性」的高效實作。在這項工作中，我們透過提出注意力的圖形運算檢視來解決這個問題，其中標記被視為圖形的節點，而注意力遮罩則決定圖形中的邊緣。使用這種檢視，我們開發了圖形處理演算法來實作注意力機制。我們在理論上和經驗上都證明了我們的演算法只執行必要的運算，也就是說，它們是工作最優的。我們也使用流行的注意力遮罩進行廣泛的實驗，以探討稀疏性對執行時間和可達成的內容長度的影響。我們的實驗證明，與最先進的注意力實作（例如 FlashAttention）相比，對於大型序列長度，我們的演算法在執行時間方面有顯著的加速。我們也證明了我們的演算法能夠在單一的 NVIDIA A100 GPU (SXM4 80GB) 上達成極長的序列長度，最高可達 1.6 億。
+摘要：本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構，以深入了解其結構基礎和醫療保健應用。針對兩個主要問題，如何根據既定的設計模式對這些系統進行分類，以及如何通過比較分析提取見解，本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案，從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構：REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點，強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測；MLRB 在處理大型資料集和複雜資料整合方面表現出色；RBML 在可解釋性和可信度方面表現出色；RMLT 在管理高維資料方面表現出色；而 PERML 儘管在分析方面有限，但在緊急照護場景中表現出潛力。本研究引入了四種新模式，建立了五種抽象分類模式，並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織，並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之，本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用，以及盒子學在推動人工智慧整合進一步創新方面的潛力，最終改善臨床決策支援和患者的治療成果。
 
-##### **Improving vision-language alignment with graph spiking hybrid Networks**
-2501.19069v1 by Siyu Zhang, Heming Zheng, Yiming Wu, Yeming Chen
+##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**
+2408.02706v1 by Masoud Muhammed Hassan
 
-To bridge the semantic gap between vision and language (VL), it is necessary
-to develop a good alignment strategy, which includes handling semantic
-diversity, abstract representation of visual information, and generalization
-ability of models. Recent works use detector-based bounding boxes or patches
-with regular partitions to represent visual semantics. While current paradigms
-have made strides, they are still insufficient for fully capturing the nuanced
-contextual relations among various objects. This paper proposes a comprehensive
-visual semantic representation module, necessitating the utilization of
-panoptic segmentation to generate coherent fine-grained semantic features.
-Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that
-integrates the complementary advantages of Spiking Neural Networks (SNNs) and
-Graph Attention Networks (GATs) to encode visual semantic information.
-Intriguingly, the model not only encodes the discrete and continuous latent
-variables of instances but also adeptly captures both local and global
-contextual features, thereby significantly enhancing the richness and diversity
-of semantic representations. Leveraging the spatiotemporal properties inherent
-in SNNs, we employ contrastive learning (CL) to enhance the similarity-based
-representation of embeddings. This strategy alleviates the computational
-overhead of the model and enriches meaningful visual representations by
-constructing positive and negative sample pairs. We design an innovative
-pre-training method, Spiked Text Learning (STL), which uses text features to
-improve the encoding ability of discrete semantics. Experiments show that the
-proposed GSHN exhibits promising results on multiple VL downstream tasks.
+Because of its strong predictive skills, deep learning has emerged as an
+essential tool in many industries, including healthcare. Traditional deep
+learning models, on the other hand, frequently lack interpretability and omit
+to take prediction uncertainty into account two crucial components of clinical
+decision making. In order to produce explainable and uncertainty aware
+predictions, this study presents a novel framework called Bayesian Kolmogorov
+Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov
+Arnold Networks with Bayesian inference. We employ BKANs on two medical
+datasets, which are widely used benchmarks for assessing machine learning
+models in medical diagnostics: the Pima Indians Diabetes dataset and the
+Cleveland Heart Disease dataset. Our method provides useful insights into
+prediction confidence and decision boundaries and outperforms traditional deep
+learning models in terms of prediction accuracy. Moreover, BKANs' capacity to
+represent aleatoric and epistemic uncertainty guarantees doctors receive more
+solid and trustworthy decision support. Our Bayesian strategy improves the
+interpretability of the model and considerably minimises overfitting, which is
+important for tiny and imbalanced medical datasets, according to experimental
+results. We present possible expansions to further use BKANs in more
+complicated multimodal datasets and address the significance of these
+discoveries for future research in building reliable AI systems for healthcare.
+This work paves the way for a new paradigm in deep learning model deployment in
+vital sectors where transparency and reliability are crucial.
 
-摘要：<paragraph>為了彌合視覺和語言 (VL) 之間的語意差距，必須制定良好的對齊策略，其中包括處理語意多樣性、視覺資訊的抽象表示以及模型的泛化能力。最近的研究使用基於偵測器的邊界框或具有規則分割的區塊來表示視覺語意。雖然目前的範例已取得進展，但對於完全捕捉各種物件之間的細微脈絡關係仍不足夠。本文提出了一個全面的視覺語意表示模組，需要利用全景分割來產生連貫的細粒度語意特徵。此外，我們提出了一個新穎的圖形脈衝混合網路 (GSHN)，它整合了脈衝神經網路 (SNN) 和圖形注意力網路 (GAT) 的互補優勢來編碼視覺語意資訊。有趣的是，該模型不僅編碼實例的離散和連續潛在變數，還能巧妙地捕捉局部和全域脈絡特徵，從而顯著增強語意表示的豐富性和多樣性。利用 SNN 中固有的時空特性，我們採用對比學習 (CL) 來增強嵌入的基於相似性的表示。此策略減輕了模型的計算負擔，並透過建構正負樣本對來豐富有意義的視覺表示。我們設計了一個創新的預訓練方法，脈衝文本學習 (STL)，它使用文本特徵來提高離散語意的編碼能力。實驗表明，所提出的 GSHN 在多個 VL 下游任務上展現出有希望的結果。</paragraph>
+摘要：由於其強大的預測能力，深度學習已成為許多產業中不可或缺的工具，包括醫療保健。然而，傳統的深度學習模型通常缺乏可解釋性，並且忽略了將預測不確定性納入考量，而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測，本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構，它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN，這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準：皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解，並且在預測準確度方面優於傳統的深度學習模型。此外，BKAN 表現隨機和認識不確定性的能力，可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果，我們的貝氏策略提高了模型的可解釋性，並大幅減少了過度擬合，這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能，以進一步將 BKAN 用於更複雜的多模式資料集，並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。
 
-##### **Semantic Web and Creative AI -- A Technical Report from ISWS 2023**
-2501.18542v1 by Raia Abu Ahmad, Reham Alharbi, Roberto Barile, Martin Böckling, Francisco Bolanos, Sara Bonfitto, Oleksandra Bruns, Irene Celino, Yashrajsinh Chudasama, Martin Critelli, Claudia d'Amato, Giada D'Ippolito, Ioannis Dasoulas, Stefano De Giorgis, Vincenzo De Leo, Chiara Di Bonaventura, Marco Di Panfilo, Daniil Dobriy, John Domingue, Xuemin Duan, Michel Dumontier, Sefika Efeoglu, Ruben Eschauzier, Fakih Ginwa, Nicolas Ferranti, Arianna Graciotti, Philipp Hanisch, George Hannah, Golsa Heidari, Aidan Hogan, Hassan Hussein, Alexane Jouglar, Jan-Christoph Kalo, Manoé Kieffer, Antonis Klironomos, Inês Koch, Weronika Lajewska, Nicolas Lazzari, Mikael Lindekrans, Anna Sofia Lippolis, Majlinda Llugiqi, Eleonora Mancini, Eleonora Marzi, Laura Menotti, Daniela Milon Flores, Soulakshmee Nagowah, Kerstin Neubert, Emetis Niazmand, Ebrahim Norouzi, Beatriz Olarte Martinez, Anouk Michelle Oudshoorn, Andrea Poltronieri, Valentina Presutti, Disha Purohit, Ensiyeh Raoufi, Celian Ringwald, Johanna Rockstroh, Sebastian Rudolph, Harald Sack, Zafar Saeed, Mohammad Javad Saeedizade, Aya Sahbi, Cristian Santini, Aleksandra Simic, Dennis Sommer, Rita Sousa, Mary Ann Tan, Vidyashree Tarikere, Tabea Tietz, Liam Tirpitz, Arnaldo Tomasino, Frank van Harmelen, Joao Vissoci, Caitlin Woods, Bohui Zhang, Xinyue Zhang, Heng Zheng
+##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**
+2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal
 
-The International Semantic Web Research School (ISWS) is a week-long
-intensive program designed to immerse participants in the field. This document
-reports a collaborative effort performed by ten teams of students, each guided
-by a senior researcher as their mentor, attending ISWS 2023. Each team provided
-a different perspective to the topic of creative AI, substantiated by a set of
-research questions as the main subject of their investigation. The 2023 edition
-of ISWS focuses on the intersection of Semantic Web technologies and Creative
-AI. ISWS 2023 explored various intersections between Semantic Web technologies
-and creative AI. A key area of focus was the potential of LLMs as support tools
-for knowledge engineering. Participants also delved into the multifaceted
-applications of LLMs, including legal aspects of creative content production,
-humans in the loop, decentralised approaches to multimodal generative AI
-models, nanopublications and AI for personal scientific knowledge graphs,
-commonsense knowledge in automatic story and narrative completion, generative
-AI for art critique, prompt engineering, automatic music composition,
-commonsense prototyping and conceptual blending, and elicitation of tacit
-knowledge. As Large Language Models and semantic technologies continue to
-evolve, new exciting prospects are emerging: a future where the boundaries
-between creative expression and factual knowledge become increasingly permeable
-and porous, leading to a world of knowledge that is both informative and
-inspiring.
+In modern healthcare, addressing the complexities of accurate disease
+prediction and personalized recommendations is both crucial and challenging.
+This research introduces MLtoGAI, which integrates Semantic Web technology with
+Machine Learning (ML) to enhance disease prediction and offer user-friendly
+explanations through ChatGPT. The system comprises three key components: a
+reusable disease ontology that incorporates detailed knowledge about various
+diseases, a diagnostic classification model that uses patient symptoms to
+detect specific diseases accurately, and the integration of Semantic Web Rule
+Language (SWRL) with ontology and ChatGPT to generate clear, personalized
+health advice. This approach significantly improves prediction accuracy and
+ensures results that are easy to understand, addressing the complexity of
+diseases and diverse symptoms. The MLtoGAI system demonstrates substantial
+advancements in accuracy and user satisfaction, contributing to developing more
+intelligent and accessible healthcare solutions. This innovative approach
+combines the strengths of ML algorithms with the ability to provide
+transparent, human-understandable explanations through ChatGPT, achieving
+significant improvements in prediction accuracy and user comprehension. By
+leveraging semantic technology and explainable AI, the system enhances the
+accuracy of disease prediction and ensures that the recommendations are
+relevant and easily understood by individual patients. Our research highlights
+the potential of integrating advanced technologies to overcome existing
+challenges in medical diagnostics, paving the way for future developments in
+intelligent healthcare systems. Additionally, the system is validated using 200
+synthetic patient data records, ensuring robust performance and reliability.
 
-摘要：國際語意網路研究學校 (ISWS) 是一個為期一週的密集課程，旨在讓參與者沉浸在該領域中。本文件報告了由十個學生團隊進行的合作成果，每個團隊都由一位資深研究員作為導師，參加了 2023 年 ISWS。每個團隊都從不同的角度探討了創意 AI 主題，並以一系列研究問題作為調查的主要主題。2023 年版的 ISWS 關注於語意網路技術和創意 AI 的交集。ISWS 2023 探索了語意網路技術和創意 AI 之間的各種交集。一個重點關注領域是 LLM 作為知識工程的支援工具的潛力。參與者還深入探討了 LLM 的多方面應用，包括創意內容製作的法律方面、循環中的人類、多模態生成式 AI 模型的分散式方法、納米出版物和用於個人科學知識圖譜的 AI、自動故事和敘述完成中的常識知識、生成式 AI 用於藝術評論、提示工程、自動音樂創作、常識原型和概念混合，以及對默會知識的引導。隨著大型語言模型和語意技術的持續發展，新的令人興奮的前景正在出現：一個創意表達和事實知識之間的界限變得越來越可滲透和多孔的未來，從而導致一個既有資訊性又有啟發性的知識世界。
+摘要：在現代醫療保健中，解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI，它將語義網路技術與機器學習 (ML) 相結合，以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分：一個可重複使用的疾病本体，其中包含有關各種疾病的詳細知識；一個診斷分類模型，它使用患者症狀來準確檢測特定疾病；以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合，以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性，並確保了易於理解的結果，解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步，有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點，以及透過 ChatGPT 提供透明且人類可以理解的說明的能力，在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI，該系統提高了疾病預測的準確性，並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力，為智慧醫療保健系統的未來發展鋪路。此外，該系統使用 200 個合成患者資料記錄進行驗證，確保了穩健的效能和可靠性。
 
-##### **Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**
-2501.18320v1 by Tianpeng Pan, Wenqiang Pu, Licheng Zhao, Rui Zhou
+##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**
+2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora
 
-Automated optimization modeling (AOM) has evoked considerable interest with
-the rapid evolution of large language models (LLMs). Existing approaches
-predominantly rely on prompt engineering, utilizing meticulously designed
-expert response chains or structured guidance. However, prompt-based techniques
-have failed to perform well in the sensor array signal processing (SASP) area
-due the lack of specific domain knowledge. To address this issue, we propose an
-automated modeling approach based on retrieval-augmented generation (RAG)
-technique, which consists of two principal components: a multi-agent (MA)
-structure and a graph-based RAG (Graph-RAG) process. The MA structure is
-tailored for the architectural AOM process, with each agent being designed
-based on principles of human modeling procedure. The Graph-RAG process serves
-to match user query with specific SASP modeling knowledge, thereby enhancing
-the modeling result. Results on ten classical signal processing problems
-demonstrate that the proposed approach (termed as MAG-RAG) outperforms several
-AOM benchmarks.
+Explainable Artificial Intelligence (XAI) is central to the debate on
+integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms
+into clinical practice. High-performing AI/ML models, such as ensemble learners
+and deep neural networks, often lack interpretability, hampering clinicians'
+trust in their predictions. To address this, XAI techniques are being developed
+to describe AI/ML predictions in human-understandable terms. One promising
+direction is the adaptation of sensitivity analysis (SA) and global sensitivity
+analysis (GSA), which inherently rank model inputs by their impact on
+predictions. Here, we introduce a novel delta-XAI method that provides local
+explanations of ML model predictions by extending the delta index, a GSA
+metric. The delta-XAI index assesses the impact of each feature's value on the
+predicted output for individual instances in both regression and classification
+problems. We formalize the delta-XAI index and provide code for its
+implementation. The delta-XAI method was evaluated on simulated scenarios using
+linear regression models, with Shapley values serving as a benchmark. Results
+showed that the delta-XAI index is generally consistent with Shapley values,
+with notable discrepancies in models with highly impactful or extreme feature
+values. The delta-XAI index demonstrated higher sensitivity in detecting
+dominant features and handling extreme feature values. Qualitatively, the
+delta-XAI provides intuitive explanations by leveraging probability density
+functions, making feature rankings clearer and more explainable for
+practitioners. Overall, the delta-XAI method appears promising for robustly
+obtaining local explanations of ML model predictions. Further investigations in
+real-world clinical settings will be conducted to evaluate its impact on
+AI-assisted clinical workflows.
 
-摘要：自動化最佳化建模 (AOM) 隨著大型語言模型 (LLM) 的快速演進而引起相當大的興趣。現有方法主要依賴提示工程，利用精心設計的專家回應鏈或結構化指導。然而，基於提示的技術由於缺乏特定領域知識，無法在感測器陣列訊號處理 (SASP) 領域中表現良好。為了解決這個問題，我們提出一個基於檢索增強生成 (RAG) 技術的自動化建模方法，它包含兩個主要組成部分：多代理 (MA) 結構和基於圖形的 RAG (Graph-RAG) 程序。MA 結構是針對架構 AOM 程序量身打造，每個代理都是根據人類建模程序的原理設計的。Graph-RAG 程序用於將使用者查詢與特定的 SASP 建模知識相匹配，從而增強建模結果。在十個經典訊號處理問題上的結果表明，所提出的方法（稱為 MAG-RAG）優於多個 AOM 基準。
+摘要：可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型，例如整體學習器和深度神經網路，通常缺乏可解釋性，阻礙臨床醫生對其預測的信任。為了解決這個問題，正在開發 XAI 技術，以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA)，它們本質上會依據模型輸入對預測的影響來對其進行排名。在此，我們介紹一種新的 delta-XAI 方法，透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化，並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法，並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致，但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說，delta-XAI 透過利用機率密度函數提供直觀的解釋，使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言，delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查，以評估其對 AI 輔助臨床工作流程的影響。
 
-##### **Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**
-2501.18154v1 by Wanlong Liu, Yichen Xiao, Dingyi Zeng, Hongyang Zhao, Wenyu Chen, Malu Zhang
+##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**
+2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis
 
-Post-Training Quantization (PTQ) is pivotal for deploying large language
-models (LLMs) within resource-limited settings by significantly reducing
-resource demands. However, existing PTQ strategies underperform at low bit
-levels < 3 bits due to the significant difference between the quantized and
-original weights. To enhance the quantization performance at low bit widths, we
-introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a
-graph neural network (GNN) module to capture dependencies among weights and
-adaptively assign quantization bit-widths. Through the information propagation
-of the GNN module, our method more effectively captures dependencies among
-target weights, leading to a more accurate assessment of weight importance and
-optimized allocation of quantization strategies. Extensive experiments on the
-WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms
-previous state-of-the-art PTQ method GPTQ, setting new benchmarks for
-quantization performance under low-bit conditions.
+Dementia, a debilitating neurological condition affecting millions worldwide,
+presents significant diagnostic challenges. In this work, we introduce a novel
+methodology for the classification of demented and non-demented elderly
+patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach
+features a unique technique for selectively processing MRI slices, focusing on
+the most relevant brain regions and excluding less informative sections. This
+methodology is complemented by a confidence-based classification committee
+composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and
+Dem3D EfficientNet. These models work synergistically to enhance
+decision-making accuracy, leveraging their collective strengths. Tested on the
+Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an
+impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore,
+validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset
+confirmed the robustness and generalizability of our approach. The use of
+explainable AI (XAI) techniques and comprehensive ablation studies further
+substantiate the effectiveness of our techniques, providing insights into the
+decision-making process and the importance of our methodology. This research
+offers a significant advancement in dementia diagnosis, providing a highly
+accurate and efficient tool for clinical applications.
 
-摘要：訓練後量化 (PTQ) 對於在資源受限的設定中部署大型語言模型 (LLM) 至關重要，因為它能顯著降低資源需求。然而，現有的 PTQ 策略在低位元層級 < 3 位元時表現不佳，因為量化後的權重與原始權重之間有顯著的差異。為了提升低位元寬度的量化效能，我們提出混合精度圖神經網路 PTQ (MG-PTQ) 方法，採用圖神經網路 (GNN) 模組來擷取權重之間的依存關係，並動態分配量化位元寬度。透過 GNN 模組的資訊傳播，我們的方法能更有效地擷取目標權重之間的依存關係，進而更準確地評估權重重要性，並最佳化量化策略的配置。在 WikiText2 和 C4 資料集上的廣泛實驗證明，我們的 MG-PTQ 方法優於先前的最先進 PTQ 方法 GPTQ，在低位元條件下設定了量化效能的新基準。
+摘要：失智症是一種影響全球數百萬人的衰弱性神經疾病，在診斷上具有重大挑戰。在這項工作中，我們提出了一種新的方法，用於對失智和非失智老年患者進行分類，使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術，用於選擇性處理 MRI 切片，重點關注最相關的大腦區域，並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充，該委員會由三個自定義深度學習模型組成：Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性，利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試，我們的模型達到了 94.12% 的驚人準確度，超過了現有方法。此外，在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性，提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展，為臨床應用提供了一個高度準確且高效的工具。
 
-##### **Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**
-2501.18119v1 by Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
+##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**
+2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini
 
-Due to the presence of the natural gap between Knowledge Graph (KG)
-structures and the natural language, the effective integration of holistic
-structural information of KGs with Large Language Models (LLMs) has emerged as
-a significant question. To this end, we propose a two-stage framework to learn
-and apply quantized codes for each entity, aiming for the seamless integration
-of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR)
-method is proposed to compress both KG structural and semantic knowledge into
-discrete codes (\ie, tokens) that align the format of language sentences. We
-further design KG instruction-following data by viewing these learned codes as
-features to directly input to LLMs, thereby achieving seamless integration. The
-experiment results demonstrate that SSQR outperforms existing unsupervised
-quantized methods, producing more distinguishable codes. Further, the
-fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link
-prediction and triple classification tasks, utilizing only 16 tokens per entity
-instead of thousands in conventional prompting methods.
+Recognizing daily activities with unobtrusive sensors in smart environments
+enables various healthcare applications. Monitoring how subjects perform
+activities at home and their changes over time can reveal early symptoms of
+health issues, such as cognitive decline. Most approaches in this field use
+deep learning models, which are often seen as black boxes mapping sensor data
+to activities. However, non-expert users like clinicians need to trust and
+understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human
+Activity Recognition have emerged to provide intuitive natural language
+explanations from these models. Different XAI methods generate different
+explanations, and their effectiveness is typically evaluated through user
+surveys, that are often challenging in terms of costs and fairness. This paper
+proposes an automatic evaluation method using Large Language Models (LLMs) to
+identify, in a pool of candidates, the best XAI approach for non-expert users.
+Our preliminary results suggest that LLM evaluation aligns with user surveys.
 
-摘要：由於知識圖譜 (KG) 結構與自然語言之間存在自然差距，將 KG 的整體結構資訊與大型語言模型 (LLM) 有效整合已成為一個重要的問題。為此，我們提出了一個兩階段架構來學習和應用每個實體的量化碼，旨在將 KG 與 LLM 無縫整合。首先，提出了一個自監督量化表示 (SSQR) 方法，將 KG 結構和語義知識壓縮成離散碼（即，符號），以對齊語言句子的格式。我們進一步設計 KG 指令遵循資料，將這些學習到的碼視為直接輸入 LLM 的特徵，從而實現無縫整合。實驗結果表明，SSQR 優於現有的無監督量化方法，產生更具區別性的碼。此外，微調後的 LLaMA2 和 LLaMA3.1 在 KG 連結預測和三元分類任務上也具有優異的性能，每個實體僅使用 16 個符號，而不是傳統提示方法中的數千個。
+摘要：藉由智慧環境中不引人注目的感測器辨識日常活動，能啟用各種醫療保健應用。監控受試者在家中如何執行活動，以及其隨著時間的變化，可以揭示健康問題的早期症狀，例如認知能力下降。此領域中的大多數方法都使用深度學習模型，這些模型通常被視為將感測器資料對應至活動的黑盒子。然而，非專家使用者（例如臨床醫師）需要信任並了解這些模型的輸出。因此，人類活動辨識的可解釋 AI (XAI) 方法應運而生，以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明，而其有效性通常透過使用者調查來評估，這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法，以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明，LLM 評估與使用者調查一致。
 
-##### **Hybrid Graphs for Table-and-Text based Question Answering using LLMs**
-2501.17767v1 by Ankush Agarwal, Ganesh S, Chaitanya Devaguptapu
+##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**
+2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil
+
+Industry 5.0, which focuses on human and Artificial Intelligence (AI)
+collaboration for performing different tasks in manufacturing, involves a
+higher number of robots, Internet of Things (IoTs) devices and
+interconnections, Augmented/Virtual Reality (AR), and other smart devices. The
+huge involvement of these devices and interconnection in various critical
+areas, such as economy, health, education and defense systems, poses several
+types of potential security flaws. AI itself has been proven a very effective
+and powerful tool in different areas of cybersecurity, such as intrusion
+detection, malware detection, and phishing detection, among others. Just as in
+many application areas, cybersecurity professionals were reluctant to accept
+black-box ML solutions for cybersecurity applications. This reluctance pushed
+forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool
+that helps explain how decisions are made in ML-based systems. In this survey,
+we present a comprehensive study of different XAI-based intrusion detection
+systems for industry 5.0, and we also examine the impact of explainability and
+interpretability on Cybersecurity practices through the lens of Adversarial
+XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities
+and challenges in XAI cybersecurity systems for industry 5.0 that elicit future
+research toward XAI-based solutions to be adopted by high-stakes industry 5.0
+applications. We believe this rigorous analysis will establish a foundational
+framework for subsequent research endeavors within the specified domain.
+
+摘要：工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務，涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與，引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具，例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣，網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用，有助於說明在基於 ML 的系統中如何做出決策。在這項調查中，我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究，並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外，我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰，引發了未來針對 XAI 基礎解決方案的研究，以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。
+
+##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability**
+2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic
 
-Answering questions that require reasoning and aggregation across both
-structured (tables) and unstructured (raw text) data sources presents
-significant challenges. Current methods rely on fine-tuning and high-quality,
-human-curated data, which is difficult to obtain. Recent advances in Large
-Language Models (LLMs) have shown promising results for multi-hop question
-answering (QA) over single-source text data in a zero-shot setting, yet
-exploration into multi-source Table-Text QA remains limited. In this paper, we
-present a novel Hybrid Graph-based approach for Table-Text QA that leverages
-LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from
-textual and tabular data, pruning information based on the input question to
-provide the LLM with relevant context concisely. We evaluate our approach on
-the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs,
-including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot
-performance on both datasets, improving Exact Match scores by up to 10% on
-Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up
-to 53% compared to the original context.
+This study aims to explore the implementation of Natural Language Processing
+(NLP) and machine learning (ML) techniques to automate the coding of medical
+letters with visualised explainability and light-weighted local computer
+settings. Currently in clinical settings, coding is a manual process that
+involves assigning codes to each condition, procedure, and medication in a
+patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There
+are preliminary research on automatic coding in this field using
+state-of-the-art ML models; however, due to the complexity and size of the
+models, the real-world deployment is not achieved. To further facilitate the
+possibility of automatic coding practice, we explore some solutions in a local
+computer setting; in addition, we explore the function of explainability for
+transparency of AI models. We used the publicly available MIMIC-III database
+and the HAN/HLAN network models for ICD code prediction purposes. We also
+experimented with the mapping between ICD and SNOMED CT knowledge bases. In our
+experiments, the models provided useful information for 97.98\% of codes. The
+result of this investigation can shed some light on implementing automatic
+clinical coding in practice, such as in hospital settings, on the local
+computers used by clinicians , project page
+\url{https://github.com/Glenj01/Medical-Coding}.
 
-摘要：回答需要對結構化（表格）和非結構化（原始文字）資料來源進行推理和彙總的問題會帶來重大挑戰。目前的辦法仰賴微調和高品質、人工整理的資料，而這很難取得。大型語言模型（LLM）的最新進展已針對零次學習設定的單一來源文字資料多跳問題回答（QA）展現出有希望的結果，但對多來源表格文字 QA 的探討仍然有限。在本文中，我們提出了一種新穎的基於混合圖表的表格文字 QA 方法，它利用 LLM 而無需微調。我們的辦法從文字和表格資料建構一個統一的混合圖表，根據輸入問題修剪資訊，以簡潔地為 LLM 提供相關脈絡。我們使用最先進的 LLM，包括 GPT-3.5、GPT-4 和 LLaMA-3，針對具有挑戰性的 Hybrid-QA 和 OTT-QA 資料集評估我們的辦法。我們的辦法在兩個資料集上都達到了最佳的零次學習效能，在 Hybrid-QA 上將完全比對分數提高了 10%，在 OTT-QA 上將完全比對分數提高了 5.4%。此外，與原始脈絡相比，我們的辦法將符號使用量減少了 53%。
+摘要：本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化，並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中，編碼是一種手動流程，涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如，使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究；然而，由於模型的複雜性和大小，並未實現實際部署。為了進一步促進自動編碼實務的可能性，我們在本地電腦設定中探討了一些解決方案；此外，我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中，這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解，例如在醫院環境中，由臨床醫生使用的本地電腦，專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。
 
-##### **Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**
-2501.17549v1 by Wooyoung Kim, Byungyoon Park, Wooju Kim
+##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**
+2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier
 
-Graph-structured data plays a vital role in numerous domains, such as social
-networks, citation networks, commonsense reasoning graphs and knowledge graphs.
-While graph neural networks have been employed for graph processing, recent
-advancements have explored integrating large language models for graph-based
-tasks. In this paper, we propose a novel approach named Learnable Graph Pooling
-Token (LGPT), which addresses the limitations of the scalability issues in
-node-level projection and information loss in graph-level projection. LGPT
-enables flexible and efficient graph representation by introducing learnable
-parameters that act as tokens in large language models, balancing fine-grained
-and global graph information. Additionally, we investigate an Early Query
-Fusion technique, which fuses query context before constructing the graph
-representation, leading to more effective graph embeddings. Our method achieves
-a 4.13\% performance improvement on the GraphQA benchmark without training the
-large language model, demonstrating significant gains in handling complex
-textual-attributed graph data.
+The support of artificial intelligence (AI) based decision-making is a key
+element in future 6G networks, where the concept of native AI will be
+introduced. Moreover, AI is widely employed in different critical applications
+such as autonomous driving and medical diagnosis. In such applications, using
+AI as black-box models is risky and challenging. Hence, it is crucial to
+understand and trust the decisions taken by these models. Tackling this issue
+can be achieved by developing explainable AI (XAI) schemes that aim to explain
+the logic behind the black-box model behavior, and thus, ensure its efficient
+and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST
+framework that is oriented toward channel estimation in wireless
+communications. The core idea of the XAI-CHEST framework is to identify the
+relevant model inputs by inducing high noise on the irrelevant ones. This
+manuscript provides the detailed theoretical foundations of the XAI-CHEST
+framework. In particular, we derive the analytical expressions of the XAI-CHEST
+loss functions and the noise threshold fine-tuning optimization problem. Hence
+the designed XAI-CHEST delivers a smart input feature selection methodology
+that can further improve the overall performance while optimizing the
+architecture of the employed model. Simulation results show that the XAI-CHEST
+framework provides valid interpretations, where it offers an improved bit error
+rate performance while reducing the required computational complexity in
+comparison to the classical DL-based channel estimation.
 
-摘要：圖形結構資料在許多領域中扮演著至關重要的角色，例如社交網路、引用網路、常識推理圖形和知識圖形。雖然圖形神經網路已用於圖形處理，但最近的進展已探討整合大型語言模型以進行基於圖形的任務。在本文中，我們提出了一種名為可學習圖形池化令牌 (LGPT) 的新方法，它解決了節點層級投影中的可擴充性問題和圖形層級投影中的資訊遺失限制。LGPT 透過引入可學習的參數（在大型語言模型中作為令牌運作）來啟用彈性和高效的圖形表示，平衡細粒度和整體圖形資訊。此外，我們研究了一種早期查詢融合技術，它在建構圖形表示之前融合查詢內容，進而產生更有效的圖形嵌入。我們的方法在 GraphQA 基準上達到了 4.13% 的效能提升，而無需訓練大型語言模型，證明了在處理複雜的文字屬性圖形資料方面有顯著的進展。
+摘要：人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素，其中將引入原生 AI 的概念。此外，AI 廣泛用於不同的關鍵應用中，例如自動駕駛和醫療診斷。在這些應用中，使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此，理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構，旨在解釋黑盒模型行為背後的邏輯，從而確保其有效且安全的部署。最近，我們提出了一個新的基於擾動的 XAI-CHEST 框架，該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是，我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此，設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法，可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明，XAI-CHEST 框架提供了有效的解釋，在降低所需的計算複雜度的同時，提供了改進的比特錯誤率性能，而這與基於傳統 DL 的信道估計相比。
 
-##### **General Scene Adaptation for Vision-and-Language Navigation**
-2501.17403v1 by Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
+##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**
+2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman
 
-Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on
-one-time execution of individual instructions across multiple environments,
-aiming to develop agents capable of functioning in any environment in a
-zero-shot manner. However, real-world navigation robots often operate in
-persistent environments with relatively consistent physical layouts, visual
-observations, and language styles from instructors. Such a gap in the task
-setting presents an opportunity to improve VLN agents by incorporating
-continuous adaptation to specific environments. To better reflect these
-real-world conditions, we introduce GSA-VLN, a novel task requiring agents to
-execute navigation instructions within a specific scene and simultaneously
-adapt to it for improved performance over time. To evaluate the proposed task,
-one has to address two challenges in existing VLN datasets: the lack of OOD
-data, and the limited number and style diversity of instructions for each
-scene. Therefore, we propose a new dataset, GSA-R2R, which significantly
-expands the diversity and quantity of environments and instructions for the R2R
-dataset to evaluate agent adaptability in both ID and OOD contexts.
-Furthermore, we design a three-stage instruction orchestration pipeline that
-leverages LLMs to refine speaker-generated instructions and apply role-playing
-techniques to rephrase instructions into different speaking styles. This is
-motivated by the observation that each individual user often has consistent
-signatures or preferences in their instructions. We conducted extensive
-experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various
-methods. Based on our findings, we propose a novel method, GR-DUET, which
-incorporates memory-based navigation graphs with an environment-specific
-training strategy, achieving state-of-the-art results on all GSA-R2R splits.
+This paper presents dilated Residual Network (ResNet) models for disease
+classification from retinal fundus images. Dilated convolution filters are used
+to replace normal convolution filters in the higher layers of the ResNet model
+(dilated ResNet) in order to improve the receptive field compared to the normal
+ResNet model for disease classification. This study introduces
+computer-assisted diagnostic tools that employ deep learning, enhanced with
+explainable AI techniques. These techniques aim to make the tool's
+decision-making process transparent, thereby enabling medical professionals to
+understand and trust the AI's diagnostic decision. They are particularly
+relevant in today's healthcare landscape, where there is a growing demand for
+transparency in AI applications to ensure their reliability and ethical use.
+The dilated ResNet is used as a replacement for the normal ResNet to enhance
+the classification accuracy of retinal eye diseases and reduce the required
+computing time. The dataset used in this work is the Ocular Disease Intelligent
+Recognition (ODIR) dataset which is a structured ophthalmic database with eight
+classes covering most of the common retinal eye diseases. The evaluation
+metrics used in this work include precision, recall, accuracy, and F1 score. In
+this work, a comparative study has been made between normal ResNet models and
+dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50,
+ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as
+compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67,
+and 0.70 respectively for the above respective variants in ODIR multiclass
+disease classification.
 
-摘要：視覺語言導航 (VLN) 任務主要根據代理程式在多個環境中執行個別指令的一次性執行來評估代理程式，旨在開發能夠在任何環境中以零次學習的方式運作的代理程式。然而，真實世界的導航機器人通常在持續性的環境中運作，而這些環境具有相對一致的物理配置、視覺觀察和指令的語言風格。任務設定中的這種差距提供了一個機會，可以透過將連續適應特定環境納入其中來改善 VLN 代理程式。為了更好地反映這些真實世界的條件，我們推出了 GSA-VLN，這是一個新任務，要求代理程式在特定場景中執行導航指令，並同時適應該場景，以隨著時間推移而提高效能。為了評估所提出的任務，必須解決現有 VLN 資料集中的兩個挑戰：缺乏 OOD 資料，以及每個場景的指令數量和風格多樣性有限。因此，我們提出了一個新的資料集 GSA-R2R，它顯著擴展了 R2R 資料集的環境和指令的多樣性和數量，以評估代理程式在 ID 和 OOD 背景下的適應能力。此外，我們設計了一個三階段指令編排管道，該管道利用大型語言模型 (LLM) 來精煉由說話者產生的指令，並應用角色扮演技巧將指令改寫成不同的說話風格。這項技術的靈感來自於觀察到每個個別使用者通常在其指令中具有相符的簽名或偏好。我們針對 GSA-R2R 進行了大量的實驗，以徹底評估我們的資料集和基準各種方法。根據我們的研究結果，我們提出了一種新的方法 GR-DUET，它將基於記憶的導航圖表與特定於環境的訓練策略結合在一起，在所有 GSA-R2R 分割中取得了最先進的結果。
+摘要：这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器（扩张 ResNet），以改善感知场，从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具，并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化，从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关，在该领域，对 AI 应用的透明度需求不断增长，以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品，以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集，这是一个结构化的眼科数据库，包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中，对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比，扩张 ResNet 模型显示出有希望的结果，在 ODIR 多类疾病分类中，上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。
 
-##### **Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**
-2501.17270v1 by Saloni Potdar, Daniel Lee, Omar Attia, Varun Embar, De Meng, Ramesh Balaji, Chloe Seivwright, Eric Choi, Mina H. Farid, Yiwen Sun, Yunyao Li
+##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**
+2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li
 
-Question answering systems for knowledge graph (KGQA), answer factoid
-questions based on the data in the knowledge graph. KGQA systems are complex
-because the system has to understand the relations and entities in the
-knowledge-seeking natural language queries and map them to structured queries
-against the KG to answer them. In this paper, we introduce Chronos, a
-comprehensive evaluation framework for KGQA at industry scale. It is designed
-to evaluate such a multi-component system comprehensively, focusing on (1)
-end-to-end and component-level metrics, (2) scalable to diverse datasets and
-(3) a scalable approach to measure the performance of the system prior to
-release. In this paper, we discuss the unique challenges associated with
-evaluating KGQA systems at industry scale, review the design of Chronos, and
-how it addresses these challenges. We will demonstrate how it provides a base
-for data-driven decisions and discuss the challenges of using it to measure and
-improve a real-world KGQA system.
+The rapid advancement of foundation models in medical imaging represents a
+significant leap toward enhancing diagnostic accuracy and personalized
+treatment. However, the deployment of foundation models in healthcare
+necessitates a rigorous examination of their trustworthiness, encompassing
+privacy, robustness, reliability, explainability, and fairness. The current
+body of survey literature on foundation models in medical imaging reveals
+considerable gaps, particularly in the area of trustworthiness. Additionally,
+existing surveys on the trustworthiness of foundation models do not adequately
+address their specific variations and applications within the medical imaging
+domain. This survey aims to fill that gap by presenting a novel taxonomy of
+foundation models used in medical imaging and analyzing the key motivations for
+ensuring their trustworthiness. We review current research on foundation models
+in major medical imaging applications, focusing on segmentation, medical report
+generation, medical question and answering (Q\&A), and disease diagnosis. These
+areas are highlighted because they have seen a relatively mature and
+substantial number of foundation models compared to other applications. We
+focus on literature that discusses trustworthiness in medical image analysis
+manuscripts. We explore the complex challenges of building trustworthy
+foundation models for each application, summarizing current concerns and
+strategies for enhancing trustworthiness. Furthermore, we examine the potential
+of these models to revolutionize patient care. Our analysis underscores the
+imperative for advancing towards trustworthy AI in medical image analysis,
+advocating for a balanced approach that fosters innovation while ensuring
+ethical and equitable healthcare delivery.
 
-摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
+摘要：基礎模型在醫學影像方面的快速進展，代表著在加強診斷準確性和個人化治療方面邁出一大步。然而，基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查，包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距，特別是在可信度方面。此外，現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機，來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究，重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調，是因為與其他應用相比，它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰，總結了當前關注點和增強可信度的策略。此外，我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性，並倡導一種平衡的方法，既能促進創新，又能確保道德和公平的醫療保健服務。
 
-##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
-2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
+##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**
+2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan
 
-Prior research on training grounded factuality classification models to
-detect hallucinations in large language models (LLMs) has relied on public
-natural language inference (NLI) data and synthetic data. However, conventional
-NLI datasets are not well-suited for document-level reasoning, which is
-critical for detecting LLM hallucinations. Recent approaches to document-level
-synthetic data generation involve iteratively removing sentences from documents
-and annotating factuality using LLM-based prompts. While effective, this method
-is computationally expensive for long documents and limited by the LLM's
-capabilities. In this work, we analyze the differences between existing
-synthetic training data used in state-of-the-art models and real LLM output
-claims. Based on our findings, we propose a novel approach for synthetic data
-generation, CG2C, that leverages multi-hop reasoning on context graphs
-extracted from documents. Our fact checker model, FactCG, demonstrates improved
-performance with more connected reasoning, using the same backbone models.
-Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
-with much smaller model size.
+Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and
+interpreting ultrasound scans right at the patient's bedside. However, the
+expertise needed to interpret these images is considerable and may not always
+be present in emergency situations. This reality makes algorithms such as
+machine learning classifiers extremely valuable to augment human decisions.
+POCUS devices are becoming available at a reasonable cost in the size of a
+mobile phone. The challenge of turning POCUS devices into life-saving tools is
+that interpretation of ultrasound images requires specialist training and
+experience. Unfortunately, the difficulty to obtain positive training images
+represents an important obstacle to building efficient and accurate
+classifiers. Hence, the problem we try to investigate is how to explore
+strategies to increase accuracy of classifiers trained with scarce data. We
+hypothesize that training with a few data instances may not suffice for
+classifiers to generalize causing them to overfit. Our approach uses an
+Explainable AI-Augmented approach to help the algorithm learn more from less
+and potentially help the classifier better generalize.
 
-摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
+摘要：床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而，解讀這些影像所需的專業知識相當可觀，而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出，尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於，解讀超音波影像需要專門訓練和經驗。不幸的是，取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此，我們嘗試探討的問題是如何探索策略，以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括，導致它們過度擬合。我們的做法使用可解釋 AI 增強方法，以協助演算法從較少的資料中學習更多，並潛在協助分類器更好地概括。
 
-##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
-2501.16673v2 by Li Yin, Zhangyang Wang
+##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**
+2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang
 
-Large Language Models (LLMs) have reshaped natural language processing,
-powering applications from multi-hop retrieval and question answering to
-autonomous agent workflows. Yet, prompt engineering -- the task of crafting
-textual inputs to effectively direct LLMs -- remains difficult and
-labor-intensive, particularly for complex pipelines that combine multiple LLM
-calls with functional operations like retrieval and data formatting. We
-introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
-(APE) that extends textual gradient-based methods (such as Text-Grad) to
-multi-component, potentially cyclic LLM architectures. Implemented within the
-AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
-parameter and uses a frozen backward engine LLM to generate feedback-akin to
-textual gradients -- that guide iterative prompt updates. Unlike prior
-single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
-preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
-and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
-(instructions, formats, or few-shot examples). It further boosts training
-efficiency by focusing on error-prone samples through selective gradient
-computation. Across diverse tasks, including single-step classification,
-multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
-consistently outperforms existing textual gradient baselines in both accuracy
-and training cost. By unifying prompt optimization through a graph-centric
-lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
-LLM workflows - mirroring the transformative role that automatic
-differentiation libraries have long played in neural network research.
+In recent years, the United States has witnessed a significant surge in the
+popularity of vaping or e-cigarette use, leading to a notable rise in cases of
+e-cigarette and vaping use-associated lung injury (EVALI) that caused
+hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting
+the urgency to comprehend vaping behaviors and develop effective strategies for
+cessation. Due to the ubiquity of social media platforms, over 4.7 billion
+users worldwide use them for connectivity, communications, news, and
+entertainment with a significant portion of the discourse related to health,
+thereby establishing social media data as an invaluable organic data resource
+for public health research. In this study, we extracted a sample dataset from
+one vaping sub-community on Reddit to analyze users' quit-vaping intentions.
+Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit
+vaping intention detection, this study compares the outcomes of this model
+against layman and clinical expert annotations. Using different prompting
+strategies such as zero-shot, one-shot, few-shot and chain-of-thought
+prompting, we developed 8 prompts with varying levels of detail to explain the
+task to GPT-4 and also evaluated the performance of the strategies against each
+other. These preliminary findings emphasize the potential of GPT-4 in social
+media data analysis, especially in identifying users' subtle intentions that
+may elude human detection.
 
-摘要：大型語言模型 (LLM) 已重塑自然語言處理，
-為從多跳檢索和問答到
-自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
-文本輸入以有效指導 LLM 的任務 -- 仍然困難且
-勞動密集，特別是對於將多個 LLM
-呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
-介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
-方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
-AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
-參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
-文本梯度——指導迭代提示更新。與先前的
-單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
-在重複呼叫（例如，多跳循環）中保留時間順序行為，
-並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
-效率，通過選擇性梯度
-計算專注於容易出錯的樣本。在包括單步分類、
-多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
-在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
-視角統一提示優化，LLM-AutoDiff 為擴展和自動化
-LLM 工作流程提供了一個強大的新範例——反映了自動
-微分庫在神經網絡研究中長期扮演的變革性角色。
+摘要：近年來，美國見證了電子煙或電子香菸使用率大幅激增，導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加，在 2019 年 EVALI 爆發期間造成住院和死亡，凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及，全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂，其中很大一部分與健康相關，因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中，我們從 Reddit 上一個電子煙子社群中提取一個範例資料集，以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測，本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略，例如零次學習、一次學習、少次學習和思考鏈提示，我們開發了 8 個提示，詳細程度不同，向 GPT-4 解釋任務，並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力，特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。
 
-##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
-2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
+##### **Towards Compositional Interpretability for XAI**
+2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke
 
-Ranking and recommendation systems are the foundation for numerous online
-experiences, ranging from search results to personalized content delivery.
-These systems have evolved into complex, multilayered architectures that
-leverage vast datasets and often incorporate thousands of predictive models.
-The maintenance and enhancement of these models is a labor intensive process
-that requires extensive feature engineering. This approach not only exacerbates
-technical debt but also hampers innovation in extending these systems to
-emerging problem domains. In this report, we present our research to address
-these challenges by utilizing a large foundation model with a textual interface
-for ranking and recommendation tasks. We illustrate several key advantages of
-our approach: (1) a single model can manage multiple predictive tasks involved
-in ranking and recommendation, (2) decoder models with textual interface due to
-their comprehension of reasoning capabilities, can generalize to new
-recommendation surfaces and out-of-domain problems, and (3) by employing
-natural language interfaces for task definitions and verbalizing member
-behaviors and their social connections, we eliminate the need for feature
-engineering and the maintenance of complex directed acyclic graphs of model
-dependencies. We introduce our research pre-production model, 360Brew V1.0, a
-150B parameter, decoder-only model that has been trained and fine-tuned on
-LinkedIn's data and tasks. This model is capable of solving over 30 predictive
-tasks across various segments of the LinkedIn platform, achieving performance
-levels comparable to or exceeding those of current production systems based on
-offline metrics, without task-specific fine-tuning. Notably, each of these
-tasks is conventionally addressed by dedicated models that have been developed
-and maintained over multiple years by teams of a similar or larger size than
-our own.
+Artificial intelligence (AI) is currently based largely on black-box machine
+learning models which lack interpretability. The field of eXplainable AI (XAI)
+strives to address this major concern, being critical in high-stakes areas such
+as the finance, legal and health sectors.
+  We present an approach to defining AI models and their interpretability based
+on category theory. For this we employ the notion of a compositional model,
+which sees a model in terms of formal string diagrams which capture its
+abstract structure together with its concrete implementation. This
+comprehensive view incorporates deterministic, probabilistic and quantum
+models. We compare a wide range of AI models as compositional models, including
+linear and rule-based models, (recurrent) neural networks, transformers, VAEs,
+and causal and DisCoCirc models.
+  Next we give a definition of interpretation of a model in terms of its
+compositional structure, demonstrating how to analyse the interpretability of a
+model, and using this to clarify common themes in XAI. We find that what makes
+the standard 'intrinsically interpretable' models so transparent is brought out
+most clearly diagrammatically. This leads us to the more general notion of
+compositionally-interpretable (CI) models, which additionally include, for
+instance, causal, conceptual space, and DisCoCirc models.
+  We next demonstrate the explainability benefits of CI models. Firstly, their
+compositional structure may allow the computation of other quantities of
+interest, and may facilitate inference from the model to the modelled
+phenomenon by matching its structure. Secondly, they allow for diagrammatic
+explanations for their behaviour, based on influence constraints, diagram
+surgery and rewrite explanations. Finally, we discuss many future directions
+for the approach, raising the question of how to learn such meaningfully
+structured models in practice.
 
-摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
-這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
-這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
-這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
-在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
-我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
-我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
-此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
-值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
+摘要：<paragraph>人工智慧（AI）目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧（XAI）領域致力於解決這個主要問題，這在金融、法律和健康等高風險領域至關重要。
+我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此，我們採用組合模型的概念，它以形式弦圖的形式看待模型，這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較，包括線性和基於規則的模型、（遞迴）神經網路、Transformer、VAE，以及因果和 DisCoCirc 模型。
+接下來，我們根據模型的組合結構給出模型解釋的定義，展示如何分析模型的可解釋性，並使用它來澄清 XAI 中的常見主題。我們發現，讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋（CI）模型概念，它另外還包括因果、概念空間和 DisCoCirc 模型。
+接下來，我們展示了 CI 模型的可解釋性優勢。首先，它們的組合結構允許計算其他感興趣的量，並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次，它們允許對其行為進行圖解說明，這些說明基於影響約束、圖解手術和重寫說明。最後，我們討論了這種方法的許多未來方向，提出了如何在實踐中學習這種有意義的結構化模型的問題。</paragraph>
 
-##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
-2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
+##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**
+2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen
 
-Fixing Python dependency issues is a tedious and error-prone task for
-developers, who must manually identify and resolve environment dependencies and
-version constraints of third-party modules and Python interpreters. Researchers
-have attempted to automate this process by relying on large knowledge graphs
-and database lookup tables. However, these traditional approaches face
-limitations due to the variety of dependency error types, large sets of
-possible module versions, and conflicts among transitive dependencies. This
-study explores the potential of using large language models (LLMs) to
-automatically fix dependency issues in Python programs. We introduce PLLM
-(pronounced "plum"), a novel technique that employs retrieval-augmented
-generation (RAG) to help an LLM infer Python versions and required modules for
-a given Python file. PLLM builds a testing environment that iteratively (1)
-prompts the LLM for module combinations, (2) tests the suggested changes, and
-(3) provides feedback (error messages) to the LLM to refine the fix. This
-feedback cycle leverages natural language processing (NLP) to intelligently
-parse and interpret build error messages. We benchmark PLLM on the Gistable
-HG2.9K dataset, a collection of challenging single-file Python gists. We
-compare PLLM against two state-of-the-art automatic dependency inference
-approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
-issues. Our results indicate that PLLM can fix more dependency issues than the
-two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
-over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
-for projects with many dependencies and for specific third-party numerical and
-machine-learning modules. Our findings demonstrate the potential of LLM-based
-approaches to iteratively resolve Python dependency issues.
+Machine learning models have achieved high overall accuracy in medical image
+analysis. However, performance disparities on specific patient groups pose
+challenges to their clinical utility, safety, and fairness. This can affect
+known patient groups - such as those based on sex, age, or disease subtype - as
+well as previously unknown and unlabeled groups. Furthermore, the root cause of
+such observed performance disparities is often challenging to uncover,
+hindering mitigation efforts. In this paper, to address these issues, we
+leverage Slice Discovery Methods (SDMs) to identify interpretable
+underperforming subsets of data and formulate hypotheses regarding the cause of
+observed performance disparities. We introduce a novel SDM and apply it in a
+case study on the classification of pneumothorax and atelectasis from chest
+x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis
+formulation and yields an explanation of previously observed but unexplained
+performance disparities between male and female patients in widely used chest
+X-ray datasets and models. Our findings indicate shortcut learning in both
+classification tasks, through the presence of chest drains and ECG wires,
+respectively. Sex-based differences in the prevalence of these shortcut
+features appear to cause the observed classification performance gap,
+representing a previously underappreciated interaction between shortcut
+learning and model fairness analyses.
 
-摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
+摘要：機器學習模型在醫學影像分析中已達到整體高準確度。然而，特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體（例如基於性別、年齡或疾病亞型）以及先前未知且未標籤的群體。此外，此類觀察到的效能差異的根本原因通常難以發現，阻礙了緩解措施。在本文中，為了解決這些問題，我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集，並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM，並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性，並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明，在分類任務中，透過胸腔引流管和心電圖導線的存在，存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異，似乎會導致觀察到的分類效能差距，這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。
 
-##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
-2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
+##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**
+2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa
 
-Knowledge graphs are widely used in industrial applications, making error
-detection crucial for ensuring the reliability of downstream applications.
-Existing error detection methods often fail to effectively leverage
-fine-grained subgraph information and rely solely on fixed graph structures,
-while also lacking transparency in their decision-making processes, which
-results in suboptimal detection performance. In this paper, we propose a novel
-Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
-utilizes multiple large language models (LLMs) in a collaborative setting. By
-concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
-query embeddings during training, our framework integrates these
-representations to produce four specialized agents. These agents utilize
-subgraph information from different dimensions to engage in multi-round
-discussions, thereby improving error detection accuracy and ensuring a
-transparent decision-making process. Extensive experiments on FB15K and WN18RR
-demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
-accuracy and robustness of KG evaluation. For specific industrial scenarios,
-our framework can facilitate the training of specialized agents using
-domain-specific knowledge graphs for error detection, which highlights the
-potential industrial application value of our framework. Our code and datasets
-are available at https://github.com/kse-ElEvEn/MAKGED.
+The concept of Metaverse has attracted a lot of attention in various fields
+and one of its important applications is health and treatment. The Metaverse
+has enormous potential to transform healthcare by changing patient care,
+medical education, and the way teaching/learning and research are done. The
+purpose of this research is to provide an introduction to the basic concepts
+and fundamental technologies of the Metaverse. This paper examines the pros and
+cons of the Metaverse in healthcare context and analyzes its potential from the
+technology and AI perspective. In particular, the role of machine learning
+methods is discussed; We will explain how machine learning algorithms can be
+applied to the Metaverse generated data to gain better insights in healthcare
+applications. Additionally, we examine the future visions of the Metaverse in
+health delivery, by examining emerging technologies such as blockchain and also
+addressing privacy concerns. The findings of this study contribute to a deeper
+understanding of the applications of Metaverse in healthcare and its potential
+to revolutionize the delivery of medical services.
 
-摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
+摘要：元宇宙的概念在各個領域都備受關注，其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育，以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點，並從技術和 AI 的角度分析其潛力。特別是，討論了機器學習方法的角色；我們將說明如何將機器學習演算法應用於元宇宙產生的資料，以獲得醫療保健應用方面的更佳見解。此外，我們透過探討區塊鏈等新興技術，並解決隱私問題，來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用，以及其在醫療服務提供方面發揮革命性變革的潛力。
 
-##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
-2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
+##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**
+2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf
 
-Short-reading comprehension questions help students understand text structure
-but lack effective feedback. Students struggle to identify and correct errors,
-while manual feedback creation is labor-intensive. This highlights the need for
-automated feedback linking responses to a scoring rubric for deeper
-comprehension.
-  Despite advances in Natural Language Processing (NLP), research has focused
-on automatic grading, with limited work on feedback generation. To address
-this, we propose a system that generates feedback for student responses.
-  Our contributions are twofold. First, we introduce the first system for
-feedback on short-answer reading comprehension. These answers are derived from
-the text, requiring structural understanding. We propose an "answer diagnosis
-graph," integrating the text's logical structure with feedback templates. Using
-this graph and NLP techniques, we estimate students' comprehension and generate
-targeted feedback.
-  Second, we evaluate our feedback through an experiment with Japanese high
-school students (n=39). They answered two 70-80 word questions and were divided
-into two groups with minimal academic differences. One received a model answer,
-the other system-generated feedback. Both re-answered the questions, and we
-compared score changes. A questionnaire assessed perceptions and motivation.
-  Results showed no significant score improvement between groups, but
-system-generated feedback helped students identify errors and key points in the
-text. It also significantly increased motivation. However, further refinement
-is needed to enhance text structure understanding.
+Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with
+no known ultimo cure and high morbidity. Research demonstrates that progressive
+Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly
+impacts kidney structure and functions, eventually leading to kidney failure.
+With the progression of time, chronic kidney disease has moved from a
+life-threatening disease affecting few people to a common disorder of varying
+severity. The goal of this research is to visualize dominating features,
+feature scores, and values exhibited for early prognosis and detection of CKD
+using ensemble learning and explainable AI. For that, an AI-driven predictive
+analytics approach is proposed to aid clinical practitioners in prescribing
+lifestyle modifications for individual patients to reduce the rate of
+progression of this disease. Our dataset is collected on body vitals from
+individuals with CKD and healthy subjects to develop our proposed AI-driven
+solution accurately. In this regard, blood and urine test results are provided,
+and ensemble tree-based machine-learning models are applied to predict unseen
+cases of CKD. Our research findings are validated after lengthy consultations
+with nephrologists. Our experiments and interpretation results are compared
+with existing explainable AI applications in various healthcare domains,
+including CKD. The comparison shows that our developed AI models, particularly
+the Random Forest model, have identified more features as significant
+contributors than XgBoost. Interpretability (I), which measures the ratio of
+important to masked features, indicates that our XgBoost model achieved a
+higher score, specifically a Fidelity of 98\%, in this metric and naturally in
+the FII index compared to competing models.
 
-摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
+摘要：慢性腎臟病 (CKD) 是一種廣泛的慢性疾病，目前尚未找到最終的治療方法，且發病率很高。研究表明，進行性慢性腎臟病 (CKD) 是一種異質性疾病，會顯著影響腎臟結構和功能，最終導致腎衰竭。隨著時間的推移，慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值，以進行 CKD 的早期預後和檢測。為此，提出了一種 AI 驅動的預測分析方法，以幫助臨床醫生為個別患者開具生活方式的修改建議，以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的，以準確開發我們提出的 AI 驅動的解決方案。在這方面，提供了血液和尿液檢測結果，並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較，包括 CKD。比較表明，我們開發的 AI 模型，特別是隨機森林模型，已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率，表明我們的 XgBoost 模型在此指標中取得了更高的分數，特別是 98% 的保真度，並且在 FII 指數中自然高於競爭模型。
+
+##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**
+2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan
+
+Mental health constitutes a complex and pervasive global challenge, affecting
+millions of lives and often leading to severe consequences. In this paper, we
+conduct a thorough survey to explore the intersection of data science,
+artificial intelligence, and mental healthcare, focusing on the recent
+developments of mental disorder detection through online social media (OSM). A
+significant portion of the population actively engages in OSM platforms,
+creating a vast repository of personal data that holds immense potential for
+mental health analytics. The paper navigates through traditional diagnostic
+methods, state-of-the-art data- and AI-driven research studies, and the
+emergence of explainable AI (XAI) models for mental healthcare. We review
+state-of-the-art machine learning methods, particularly those based on modern
+deep learning, while emphasising the need for explainability in healthcare AI
+models. The experimental design section provides insights into prevalent
+practices, including available datasets and evaluation approaches. We also
+identify key issues and challenges in the field and propose promising future
+research directions. As mental health decisions demand transparency,
+interpretability, and ethical considerations, this paper contributes to the
+ongoing discourse on advancing XAI in mental healthcare through social media.
+The comprehensive overview presented here aims to guide researchers,
+practitioners, and policymakers in developing the area of mental disorder
+detection.
 
-儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
+摘要：心理健康構成了一項複雜且普遍的全球挑戰，影響了數百萬人的生活，並經常導致嚴重的後果。在本文中，我們進行了一項徹底的調查，以探索數據科學、人工智慧和心理保健的交集，重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台，創造了一個龐大的人員資料庫，對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究，以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法，特別是那些基於現代深度學習的方法，同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解，包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰，並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量，本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。
 
-我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
+##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**
+2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou
 
-其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
+AI-aided clinical diagnosis is desired in medical care. Existing deep
+learning models lack explainability and mainly focus on image analysis. The
+recently developed Dynamic Uncertain Causality Graph (DUCG) approach is
+causality-driven, explainable, and invariant across different application
+scenarios, without problems of data collection, labeling, fitting, privacy,
+bias, generalization, high cost and high energy consumption. Through close
+collaboration between clinical experts and DUCG technicians, 46 DUCG models
+covering 54 chief complaints were constructed. Over 1,000 diseases can be
+diagnosed without triage. Before being applied in real-world, the 46 DUCG
+models were retrospectively verified by third-party hospitals. The verified
+diagnostic precisions were no less than 95%, in which the diagnostic precision
+for every disease including uncommon ones was no less than 80%. After
+verifications, the 46 DUCG models were applied in the real-world in China. Over
+one million real diagnosis cases have been performed, with only 17 incorrect
+diagnoses identified. Due to DUCG's transparency, the mistakes causing the
+incorrect diagnoses were found and corrected. The diagnostic abilities of the
+clinicians who applied DUCG frequently were improved significantly. Following
+the introduction to the earlier presented DUCG methodology, the recommendation
+algorithm for potential medical checks is presented and the key idea of DUCG is
+extracted.
 
-結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
+摘要：<paragraph>醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性，並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的，並且在不同的應用場景中是不變的，沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作，構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前，46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%，其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後，46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例，僅發現 17 個不正確的診斷。由於 DUCG 的透明性，發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後，提出了潛在健康檢查的推薦演算法，並提取了 DUCG 的關鍵思想。</paragraph>
 
-##### **Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**
-2501.15688v1 by Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
+##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**
+2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi
 
-Multimodal knowledge graph completion (MMKGC) aims to predict missing links
-in multimodal knowledge graphs (MMKGs) by leveraging information from various
-modalities alongside structural data. Existing MMKGC approaches primarily
-extend traditional knowledge graph embedding (KGE) models, which often require
-creating an embedding for every entity. This results in large model sizes and
-inefficiencies in integrating multimodal information, particularly for
-real-world graphs. Meanwhile, Transformer-based models have demonstrated
-competitive performance in knowledge graph completion (KGC). However, their
-focus on single-modal knowledge limits their capacity to utilize cross-modal
-information. Recently, Large vision-language models (VLMs) have shown potential
-in cross-modal tasks but are constrained by the high cost of training. In this
-work, we propose a novel approach that integrates Transformer-based KGE models
-with cross-modal context generated by pre-trained VLMs, thereby extending their
-applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
-relevant visual information from entities and their neighbors into textual
-sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
-model with the generated cross-modal context. This simple yet effective method
-significantly reduces model size compared to traditional KGE approaches while
-achieving competitive performance across multiple large-scale datasets with
-minimal hyperparameter tuning.
+It is imperative that breast cancer is detected precisely and timely to
+improve patient outcomes. Diagnostic methodologies have traditionally relied on
+unimodal approaches; however, medical data analytics is integrating diverse
+data sources beyond conventional imaging. Using multi-modal techniques,
+integrating both image and non-image data, marks a transformative advancement
+in breast cancer diagnosis. The purpose of this review is to explore the
+burgeoning field of multimodal techniques, particularly the fusion of
+histopathology images with non-image data. Further, Explainable AI (XAI) will
+be used to elucidate the decision-making processes of complex algorithms,
+emphasizing the necessity of explainability in diagnostic processes. This
+review utilizes multi-modal data and emphasizes explainability to enhance
+diagnostic accuracy, clinician confidence, and patient engagement, ultimately
+fostering more personalized treatment strategies for breast cancer, while also
+identifying research gaps in multi-modality and explainability, guiding future
+studies, and contributing to the strategic direction of the field.
 
-摘要：多模態知識圖譜補全 (MMKGC) 旨在透過利用來自各種模態與結構化資料的資訊，來預測多模態知識圖譜 (MMKG) 中的缺失連結。現有的 MMKGC 方法主要擴充傳統的知識圖譜嵌入 (KGE) 模型，這些模型通常需要為每個實體建立一個嵌入。這會導致模型尺寸過大，且在整合多模態資訊時效率低下，特別是對於真實世界的圖譜。與此同時，基於 Transformer 的模型已在知識圖譜補全 (KGC) 中展現出競爭力。然而，它們著重於單模態知識，限制了它們利用跨模態資訊的能力。最近，大型視覺語言模型 (VLM) 已在跨模態任務中展現潛力，但受限於訓練成本過高。在這項工作中，我們提出了一種創新的方法，它將基於 Transformer 的 KGE 模型與預先訓練的 VLM 所產生的跨模態內容整合在一起，從而擴展它們在 MMKGC 中的適用性。具體來說，我們採用預先訓練的 VLM，將實體及其鄰居相關的視覺資訊轉換成文字序列。然後，我們將 KGC 架構成一個序列到序列的任務，並使用產生的跨模態內容微調模型。這種簡單但有效的方法，與傳統的 KGE 方法相比，大幅減少了模型尺寸，同時在多個大型資料集上達到了競爭力的效能，且只需最少的超參數調整。
+摘要：精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法；然而，醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術，標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域，特別是將組織病理學影像與非影像資料融合。此外，可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程，強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性，以提高診斷準確性、臨床醫師的信心和患者參與度，最終促進乳癌更個人化的治療策略，同時也找出多模式和可解釋性的研究差距，引導未來的研究，並為該領域的策略方向做出貢獻。
 
-##### **How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**
-2501.15378v1 by Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
+##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**
+2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya
 
-Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently
-propelled significant advances in complex reasoning tasks, thanks to their
-broad domain knowledge and contextual awareness. Unfortunately, current methods
-often assume KGs to be complete, which is impractical given the inherent
-limitations of KG construction and the potential loss of contextual cues when
-converting unstructured text into entity-relation triples. In response, this
-paper proposes the Triple Context Restoration and Query-driven Feedback
-(TCR-QF) framework, which reconstructs the textual context underlying each
-triple to mitigate information loss, while dynamically refining the KG
-structure by iteratively incorporating query-relevant missing knowledge.
-Experiments on five benchmark question-answering datasets substantiate the
-effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1%
-improvement in Exact Match and a 15.5% improvement in F1 over its
-state-of-the-art GraphRAG competitors.
+The neonatal period is the most vulnerable time for the development of
+seizures. Seizures in the immature brain lead to detrimental consequences,
+therefore require early diagnosis. The gold-standard for neonatal seizure
+detection currently relies on continuous video-EEG monitoring; which involves
+recording multi-channel electroencephalogram (EEG) alongside real-time video
+monitoring within a neonatal intensive care unit (NICU). However, video-EEG
+monitoring technology requires clinical expertise and is often limited to
+technologically advanced and resourceful settings. Cost-effective new
+techniques could help the medical fraternity make an accurate diagnosis and
+advocate treatment without delay. In this work, a novel explainable deep
+learning model to automate the neonatal seizure detection process with a
+reduced EEG montage is proposed, which employs convolutional nets, graph
+attention layers, and fully connected layers. Beyond its ability to detect
+seizures in real-time with a reduced montage, this model offers the unique
+advantage of real-time interpretability. By evaluating the performance on the
+Zenodo dataset with 10-fold cross-validation, the presented model achieves an
+absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall,
+respectively.
 
-摘要：知識圖譜 (KG) 增強大型語言模型 (LLM) 最近推動複雜推理任務的重大進展，這要歸功於它們廣泛的領域知識和語境感知。不幸的是，目前的模型通常假設 KG 是完整的，這在考慮到 KG 建構的固有限制和在將非結構化文字轉換為實體關係三元組時潛在的語境線索損失時是不切實際的。為了解決這個問題，本文提出了三元組語境還原和查詢驅動回饋 (TCR-QF) 架構，它重建每個三元組底層的文字語境以減輕資訊損失，同時透過反覆納入與查詢相關的遺失知識來動態優化 KG 結構。在五個基準問題回答資料集上的實驗證實了 TCR-QF 在 KG 和 LLM 整合方面的有效性，它在 Exact Match 中獲得 29.1% 的改進，在 F1 中獲得 15.5% 的改進，優於最先進的 GraphRAG 競爭對手。
+摘要：新生兒期是大腦發育最脆弱的時期，容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果，因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測；其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而，視訊腦電圖監控技術需要臨床專業知識，而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中，提出了一個新穎的可解釋深度學習模型，以自動化新生兒癲癇發作偵測過程，並採用減少的腦電圖裝置，其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外，此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能，所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。
 
-##### **Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**
-2501.14932v1 by Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe
+##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**
+2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik
 
-Modern datasets often consist of numerous samples with abundant features and
-associated timestamps. Analyzing such datasets to uncover underlying events
-typically requires complex statistical methods and substantial domain
-expertise. A notable example, and the primary data focus of this paper, is the
-global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC)
--- a global hub of human trafficking data containing over 200,000 anonymized
-records spanning from 2002 to 2022, with numerous categorical features for each
-record. In this paper, we propose a fast and scalable method for analyzing and
-extracting significant categorical feature interactions, and querying large
-language models (LLMs) to generate data-driven insights that explain these
-interactions. Our approach begins with a binarization step for categorical
-features using one-hot encoding, followed by the computation of graph
-covariance at each time. This graph covariance quantifies temporal changes in
-dependence structures within categorical data and is established as a
-consistent dependence measure under the Bernoulli distribution. We use this
-measure to identify significant feature pairs, such as those with the most
-frequent trends over time or those exhibiting sudden spikes in dependence at
-specific moments. These extracted feature pairs, along with their timestamps,
-are subsequently passed to an LLM tasked with generating potential explanations
-of the underlying events driving these dependence changes. The effectiveness of
-our method is demonstrated through extensive simulations, and its application
-to the CTDC dataset reveals meaningful feature pairs and potential data stories
-underlying the observed feature interactions.
+Breast cancer (BC) stands as one of the most common malignancies affecting
+women worldwide, necessitating advancements in diagnostic methodologies for
+better clinical outcomes. This article provides a comprehensive exploration of
+the application of Explainable Artificial Intelligence (XAI) techniques in the
+detection and diagnosis of breast cancer. As Artificial Intelligence (AI)
+technologies continue to permeate the healthcare sector, particularly in
+oncology, the need for transparent and interpretable models becomes imperative
+to enhance clinical decision-making and patient care. This review discusses the
+integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and
+others, with machine learning and deep learning models utilized in breast
+cancer detection and classification. By investigating the modalities of breast
+cancer datasets, including mammograms, ultrasounds and their processing with
+AI, the paper highlights how XAI can lead to more accurate diagnoses and
+personalized treatment plans. It also examines the challenges in implementing
+these techniques and the importance of developing standardized metrics for
+evaluating XAI's effectiveness in clinical settings. Through detailed analysis
+and discussion, this article aims to highlight the potential of XAI in bridging
+the gap between complex AI models and practical healthcare applications,
+thereby fostering trust and understanding among medical professionals and
+improving patient outcomes.
 
-摘要：現代資料集通常包含許多具有豐富特徵和關聯時間戳的樣本。分析此類資料集以揭示底層事件通常需要複雜的統計方法和大量的領域專業知識。一個值得注意的範例，也是本文的主要資料重點，是來自反人口販運資料合作組織 (CTDC) 的全球合成資料集，這是全球人口販運資料的樞紐，包含超過 200,000 筆從 2002 年到 2022 年的匿名記錄，每個記錄都有許多分類特徵。在本文中，我們提出了一種快速且可擴充的方法，用於分析和提取重要的分類特徵交互作用，並查詢大型語言模型 (LLM)，以產生資料驅動的見解來解釋這些交互作用。我們的做法從使用獨熱編碼對分類特徵進行二元化步驟開始，然後在每個時間點計算圖形共變異數。此圖形共變異數量化了分類資料中依賴結構的時間變化，並在伯努利分佈下建立為一致的依賴度量。我們使用此度量來識別重要的特徵對，例如隨時間推移趨勢最頻繁的特徵對，或在特定時刻表現出依賴性突然激增的特徵對。這些提取的特徵對及其時間戳隨後傳遞給 LLM，後者負責產生對驅動這些依賴性變化的底層事件的潛在解釋。我們的方法的有效性已通過廣泛的模擬得到證明，其在 CTDC 資料集中的應用揭示了有意義的特徵對和潛在的資料故事，這些故事是觀察到的特徵交互作用的基礎。
+摘要：乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一，因此需要進步的診斷方法，以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域，特別是在腫瘤學中，透明且可解釋的模型需求變得勢在必行，以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合，例如 SHAP、LIME、Grad-CAM 等，以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式，包括乳房攝影、超音波及其在 AI 中的處理，本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰，以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論，本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力，進而促進醫療專業人員之間的信任與理解，並改善患者的結果。
 
-##### **Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**
-2501.14892v1 by Hang Luo, Jian Zhang, Chujun Li
+##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**
+2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
 
-In knowledge-intensive tasks, especially in high-stakes domains like medicine
-and law, it is critical not only to retrieve relevant information but also to
-provide causal reasoning and explainability. Large language models (LLMs) have
-achieved remarkable performance in natural language understanding and
-generation tasks. However, they often suffer from limitations such as
-difficulty in incorporating new knowledge, generating hallucinations, and
-explaining their reasoning process. To address these challenges, integrating
-knowledge graphs with Graph Retrieval-Augmented Generation (Graph RAG) has
-emerged as an effective solution. Traditional Graph RAG methods often rely on
-simple graph traversal or semantic similarity, which do not capture causal
-relationships or align well with the model's internal reasoning steps. This
-paper proposes a novel pipeline that filters large knowledge graphs to
-emphasize cause-effect edges, aligns the retrieval process with the model's
-chain-of-thought (CoT), and enhances reasoning through multi-stage path
-improvements. Experiments on medical question-answering tasks show consistent
-gains, with up to a 10\% absolute improvement across multiple large language
-models (LLMs). This approach demonstrates the value of combining causal
-reasoning with stepwise retrieval, leading to more interpretable and logically
-grounded solutions for complex queries.
+Speech emotion recognition (SER) has gained significant attention due to its
+several application fields, such as mental health, education, and
+human-computer interaction. However, the accuracy of SER systems is hindered by
+high-dimensional feature sets that may contain irrelevant and redundant
+information. To overcome this challenge, this study proposes an iterative
+feature boosting approach for SER that emphasizes feature relevance and
+explainability to enhance machine learning model performance. Our approach
+involves meticulous feature selection and analysis to build efficient SER
+systems. In addressing our main problem through model explainability, we employ
+a feature evaluation loop with Shapley values to iteratively refine feature
+sets. This process strikes a balance between model performance and
+transparency, which enables a comprehensive understanding of the model's
+predictions. The proposed approach offers several advantages, including the
+identification and removal of irrelevant and redundant features, leading to a
+more effective model. Additionally, it promotes explainability, facilitating
+comprehension of the model's predictions and the identification of crucial
+features for emotion determination. The effectiveness of the proposed method is
+validated on the SER benchmarks of the Toronto emotional speech set (TESS),
+Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of
+Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion
+(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our
+knowledge, this is the first work to incorporate model explainability into an
+SER framework. The source code of this paper is publicly available via this
+https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.
 
-摘要：在知識密集型任務中，特別是在醫學和法律等高風險領域，不僅檢索相關資訊至關重要，還必須提供因果推理和可解釋性。大型語言模型 (LLM) 在自然語言理解和生成任務中取得了顯著的表現。然而，它們通常會遇到一些限制，例如難以納入新知識、產生幻覺，以及解釋其推理過程。為了應對這些挑戰，將知識圖與圖形檢索增強生成 (Graph RAG) 整合在一起已成為一種有效的解決方案。傳統的 Graph RAG 方法通常依賴於簡單的圖形遍歷或語義相似性，這無法捕捉因果關係或與模型的內部推理步驟很好地對齊。本文提出了一個新穎的管道，該管道過濾大型知識圖以強調因果邊緣，將檢索過程與模型的思想鏈 (CoT) 對齊，並通過多階段路徑改進來增強推理。在醫療問題解答任務上的實驗顯示出一致的收益，在多個大型語言模型 (LLM) 中絕對改進幅度高達 10%。這種方法展示了將因果推理與逐步檢索相結合的價值，從而為複雜查詢提供更具可解釋性和邏輯依據的解決方案。
+摘要：語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而，SER 系統的準確性受到高維特徵集的阻礙，這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰，本研究提出了一種用於 SER 的迭代特徵提升方法，該方法強調特徵相關性和可解釋性，以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析，以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題，我們採用了具有 Shapley 值的特徵評估迴圈，以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡，這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點，包括識別和移除不相關和冗餘的特徵，從而建立更有效的模型。此外，它促進了可解釋性，有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證，其效能優於現有方法。據我們所知，這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得：https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。
 
-##### **GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**
-2501.16382v1 by Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
+##### **The Explanation Necessity for Healthcare AI**
+2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling
 
-Drug discovery (DD) has tremendously contributed to maintaining and improving
-public health. Hypothesizing that inhibiting protein misfolding can slow
-disease progression, researchers focus on target identification (Target ID) to
-find protein structures for drug binding. While Large Language Models (LLMs)
-and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug
-discovery, integrating models into cohesive workflows remains challenging. We
-conducted a user study with drug discovery researchers to identify the
-applicability of LLMs and RAGs in Target ID. We identified two main findings:
-1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on
-an initial protein and protein candidates that have a therapeutic impact; 2)
-the model must provide the PPI and relevant explanations for better
-understanding. Based on these observations, we identified three limitations in
-previous approaches for Target ID: 1) semantic ambiguity, 2) lack of
-explainability, and 3) short retrieval units. To address these issues, we
-propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve
-agent pipeline RAG framework to support large-scale PPI signaling pathway
-exploration in understanding therapeutic impacts by decomposing the analysis of
-entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
+Explainability is often critical to the acceptable implementation of
+artificial intelligence (AI). Nowhere is this more important than healthcare
+where decision-making directly impacts patients and trust in AI systems is
+essential. This trust is often built on the explanations and interpretations
+the AI provides. Despite significant advancements in AI interpretability, there
+remains the need for clear guidelines on when and to what extent explanations
+are necessary in the medical context. We propose a novel categorization system
+with four distinct classes of explanation necessity, guiding the level of
+explanation required: patient or sample (local) level, cohort or dataset
+(global) level, or both levels. We introduce a mathematical formulation that
+distinguishes these categories and offers a practical framework for researchers
+to determine the necessity and depth of explanations required in medical AI
+applications. Three key factors are considered: the robustness of the
+evaluation protocol, the variability of expert observations, and the
+representation dimensionality of the application. In this perspective, we
+address the question: When does an AI medical application need to be explained,
+and at what level of detail?
 
-摘要：药物发现 (DD) 极大地促进了公共卫生的维护和改善。研究人员假设抑制蛋白质错误折叠可以减缓疾病进展，因此专注于靶点识别 (Target ID) 以找到用于药物结合的蛋白质结构。虽然大型语言模型 (LLM) 和检索增强生成 (RAG) 框架加速了药物发现，但将模型整合到内聚工作流中仍然具有挑战性。我们与药物发现研究人员进行了一项用户研究，以确定 LLM 和 RAG 在 Target ID 中的适用性。我们确定了两个主要发现：1) LLM 应该基于初始蛋白质和具有治疗作用的蛋白质候选物提供多个蛋白质-蛋白质相互作用 (PPI)；2) 该模型必须提供 PPI 和相关解释以更好地理解。基于这些观察，我们发现了先前 Target ID 方法中的三个局限性：1) 语义歧义，2) 缺乏可解释性，3) 检索单元短。为了解决这些问题，我们提出了 GraPPI，这是一种基于大规模知识图 (KG) 的检索-分解-求解代理管道 RAG 框架，以支持大规模 PPI 信号通路探索，通过将整个 PPI 通路的分析分解为专注于 PPI 边缘分析的子任务来理解治疗影响。
+摘要：可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域，这一点尤为重要，因为决策直接影响患者，并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展，但仍然需要明确的指导方针，说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统，该系统具有四种不同的解释必要性类别，指导所需的解释级别：患者或样本（局部）级别、队列或数据集（全局）级别，或两个级别。我们引入了一个数学公式，该公式区分了这些类别，并为研究人员提供了一个实用框架，以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素：评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看，我们解决了这个问题：AI 医疗应用何时需要解释，以及需要解释到何种程度？
 
-##### **Evaluating and Improving Graph to Text Generation with Large Language Models**
-2501.14497v1 by Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez Basulto, Jeff Z. Pan
+##### **Interdisciplinary Expertise to Advance Equitable Explainable AI**
+2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles
 
-Large language models (LLMs) have demonstrated immense potential across
-various tasks. However, research for exploring and improving the capabilities
-of LLMs in interpreting graph structures remains limited. To address this gap,
-we conduct a comprehensive evaluation of prompting current open-source LLMs on
-graph-to-text generation tasks. Although we explored the optimal prompting
-strategies and proposed a novel and effective diversity-difficulty-based
-few-shot sample selection method, we found that the improvements from
-tuning-free approaches were incremental, as LLMs struggle with planning on
-complex graphs, particularly those with a larger number of triplets. To further
-improve LLMs in planning with graph sequences and grounding in truth, we
-introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks:
-reordering and attribution. Through extensive automatic and human evaluations,
-we demonstrate significant improvements in the quality of generated text from
-both few-shot learning and fine-tuning perspectives using the PlanGTG dataset.
-Our study paves the way for new research directions in graph-to-text
-generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
+The field of artificial intelligence (AI) is rapidly influencing health and
+healthcare, but bias and poor performance persists for populations who face
+widespread structural oppression. Previous work has clearly outlined the need
+for more rigorous attention to data representativeness and model performance to
+advance equity and reduce bias. However, there is an opportunity to also
+improve the explainability of AI by leveraging best practices of social
+epidemiology and health equity to help us develop hypotheses for associations
+found. In this paper, we focus on explainable AI (XAI) and describe a framework
+for interdisciplinary expert panel review to discuss and critically assess AI
+model explanations from multiple perspectives and identify areas of bias and
+directions for future research. We emphasize the importance of the
+interdisciplinary expert panel to produce more accurate, equitable
+interpretations which are historically and contextually informed.
+Interdisciplinary panel discussions can help reduce bias, identify potential
+confounders, and identify opportunities for additional research where there are
+gaps in the literature. In turn, these insights can suggest opportunities for
+AI model improvement.
 
-摘要：大型語言模型（LLM）已在各種任務中展現出巨大的潛力。然而，探索和提升 LLM 在詮釋圖形結構方面的能力的研究仍然有限。為了解決這個差距，我們對提示目前開源的 LLM 執行圖形轉文字生成任務進行全面評估。儘管我們探索了最佳提示策略並提出了一種新穎且有效的基於多樣性難度的少樣本選擇方法，但我們發現無調校方法的改進是漸進的，因為 LLM 難以規劃複雜的圖形，特別是那些具有較多三元組的圖形。為了進一步提升 LLM 在圖形序列規劃和真實依據方面的能力，我們引入了一個新的圖形轉文字資料集 PlanGTG，並註解了兩個子任務：重新排序和歸因。透過廣泛的自動化和人工評估，我們證明了使用 PlanGTG 資料集從少樣本學習和微調角度產生文字的品質有顯著提升。我們的研究為圖形轉文字生成中的新研究方向鋪路。PlanGTG 資料集可以在 https://github.com/probe2/kg_text 中找到。
+摘要：人工智慧 (AI) 領域正快速影響著健康與醫療保健，但對於面臨廣泛結構性壓迫的人群來說，偏見和不良表現依然存在。先前的研究已清楚說明，需要更嚴格地注意資料代表性和模型效能，以促進公平性並減少偏見。然而，我們有機會透過運用社會流行病學和健康公平的最佳實務，來改善 AI 的可解釋性，以幫助我們針對發現的關聯性，發展假設。在本文中，我們專注於可解釋 AI (XAI)，並描述一個跨領域專家小組審查架構，以從多重觀點討論和批判性評估 AI 模型的解釋，並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要，而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素，並在文獻中有缺口時找出額外研究的機會。反過來，這些見解可以建議 AI 模型改進的機會。
 
-##### **Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**
-2501.14300v1 by Xujian Liang, Zhaoquan Gu
+##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**
+2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen
 
-Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes
-the naive RAG system a step further by integrating graph information, such as
-knowledge graph (KGs), into large-scale language models (LLMs) to mitigate
-hallucination. However, existing GRAG still encounter limitations: 1) simple
-paradigms usually fail with the complex problems due to the narrow and shallow
-correlations capture from KGs 2) methods of strong coupling with KGs tend to be
-high computation cost and time consuming if the graph is dense. In this paper,
-we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for
-enabling LLMs to think ``community by community" within KGs. To do this,
-FastToG employs community detection for deeper correlation capture and two
-stages community pruning - coarse and fine pruning for faster retrieval.
-Furthermore, we also develop two Community-to-Text methods to convert the graph
-structure of communities into textual form for better understanding by LLMs.
-Experimental results demonstrate the effectiveness of FastToG, showcasing
-higher accuracy, faster reasoning, and better explainability compared to the
-previous works.
+Artificial Intelligence (AI) repeatedly match or outperform radiologists in
+lab experiments. However, real-world implementations of radiological AI-based
+systems are found to provide little to no clinical value. This paper explores
+how to design AI for clinical usefulness in different contexts. We conducted 19
+design sessions and design interventions with 13 radiologists from 7 clinical
+sites in Denmark and Kenya, based on three iterations of a functional AI-based
+prototype. Ten sociotechnical dependencies were identified as crucial for the
+design of AI in radiology. We conceptualised four technical dimensions that
+must be configured to the intended clinical context of use: AI functionality,
+AI medical focus, AI decision threshold, and AI Explainability. We present four
+design recommendations on how to address dependencies pertaining to the medical
+knowledge, clinic type, user expertise level, patient context, and user
+situation that condition the configuration of these technical dimensions.
 
-摘要：圖表檢索增強生成 (GRAG) 是一種新穎的範例，它透過將圖表資訊（例如知識圖表 (KG)) 整合到大型語言模型 (LLM) 中，進一步提升了樸素的 RAG 系統以減輕幻覺。然而，現有的 GRAG 仍會遇到限制：1) 簡單的範例通常會因從 KG 中擷取的關聯性狹隘且淺薄而無法解決複雜的問題 2) 如果圖表很密集，與 KG 強耦合的方法往往會導致高運算成本和耗時。在本文中，我們提出了 Fast Think-on-Graph (FastToG)，這是一種創新的範例，可讓 LLM 在 KG 中「逐個社群」進行思考。為此，FastToG 使用社群偵測來擷取更深入的關聯性，並使用兩個階段的社群修剪（粗略修剪和精細修剪）來加快檢索速度。此外，我們還開發了兩種社群到文字的方法，將社群的圖表結構轉換為文字形式，以便 LLM 更容易理解。實驗結果證明了 FastToG 的有效性，與先前的研究相比，展示出更高的準確性、更快的推理速度和更好的可解釋性。
+摘要：人工智慧（AI）在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而，發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代，在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向，必須根據預期的臨床使用情境進行設定：AI 功能、AI 醫療重點、AI 決策門檻，以及 AI 可解釋性。我們提出四項設計建議，說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境，以及影響這些技術面向設定的使用者情境相關的依賴關係。
 
-##### **Top Ten Challenges Towards Agentic Neural Graph Databases**
-2501.14224v1 by Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Lixin Fan, Binhang Yuan, Wei Wang, Lei Chen, Xiaofang Zhou, Yangqiu Song
+##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**
+2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah
 
-Graph databases (GDBs) like Neo4j and TigerGraph excel at handling
-interconnected data but lack advanced inference capabilities. Neural Graph
-Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for
-predictive analysis and reasoning over incomplete or noisy data. However, NGDBs
-rely on predefined queries and lack autonomy and adaptability. This paper
-introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs
-with three core functionalities: autonomous query construction, neural query
-execution, and continuous learning. We identify ten key challenges in realizing
-Agentic NGDBs: semantic unit representation, abductive reasoning, scalable
-query execution, and integration with foundation models like large language
-models (LLMs). By addressing these challenges, Agentic NGDBs can enable
-intelligent, self-improving systems for modern data-driven applications, paving
-the way for adaptable and autonomous data management solutions.
+With advanced AI/ML, there has been growing research on explainable AI (XAI)
+and studies on how humans interact with AI and XAI for effective human-AI
+collaborative decision-making. However, we still have a lack of understanding
+of how AI systems and XAI should be first presented to users without technical
+backgrounds. In this paper, we present the findings of semi-structured
+interviews with health professionals (n=12) and students (n=4) majoring in
+medicine and health to study how to improve onboarding with AI and XAI. For the
+interviews, we built upon human-AI interaction guidelines to create onboarding
+materials of an AI system for stroke rehabilitation assessment and AI
+explanations and introduce them to the participants. Our findings reveal that
+beyond presenting traditional performance metrics on AI, participants desired
+benchmark information, the practical benefits of AI, and interaction trials to
+better contextualize AI performance, and refine the objectives and performance
+of AI. Based on these findings, we highlight directions for improving
+onboarding with AI and XAI and human-AI collaborative decision-making.
 
-摘要：圖形資料庫（GDB），例如 Neo4j 和 TigerGraph，擅長處理相互連接的資料，但缺乏進階的推論能力。神經圖形資料庫（NGDB）透過整合圖形神經網路（GNN）來解決這個問題，以進行預測分析和對不完整或有雜訊的資料進行推理。然而，NGDB 依賴於預先定義的查詢，並且缺乏自主性和適應性。本文介紹了代理神經圖形資料庫（Agentic NGDB），它以三項核心功能擴充了 NGDB：自動查詢建構、神經查詢執行和持續學習。我們找出實現 Agentic NGDB 的十大關鍵挑戰：語義單元表示、演繹推理、可擴充查詢執行，以及與基礎模型（例如大型語言模型 (LLM)）整合。透過解決這些挑戰，Agentic NGDB 可以為現代資料驅動應用打造智慧且自我改善的系統，為適應性和自主資料管理解決方案鋪路。
+摘要：隨著先進的 AI/ML，對可解釋 AI (XAI) 的研究不斷增加，以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而，我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中，我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果，以研究如何改善 AI 和 XAI 的入門。對於訪談，我們建立在人機互動準則之上，為中風康復評估和 AI 解釋的 AI 系統創建入門材料，並將它們介紹給參與者。我們的研究結果表明，除了呈現傳統的 AI 性能指標外，參與者還希望基准信息、AI 的實際好處以及交互試驗，以更好地將 AI 性能情境化，並完善 AI 的目標和性能。根據這些發現，我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。
 
-##### **GraphRAG under Fire**
-2501.14050v1 by Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang
+##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**
+2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao
 
-GraphRAG advances retrieval-augmented generation (RAG) by structuring
-external knowledge as multi-scale knowledge graphs, enabling language models to
-integrate both broad context and granular details in their reasoning. While
-GraphRAG has demonstrated success across domains, its security implications
-remain largely unexplored. To bridge this gap, this work examines GraphRAG's
-vulnerability to poisoning attacks, uncovering an intriguing security paradox:
-compared to conventional RAG, GraphRAG's graph-based indexing and retrieval
-enhance resilience against simple poisoning attacks; meanwhile, the same
-features also create new attack surfaces. We present GRAGPoison, a novel attack
-that exploits shared relations in the knowledge graph to craft poisoning text
-capable of compromising multiple queries simultaneously. GRAGPoison employs
-three key strategies: i) relation injection to introduce false knowledge, ii)
-relation enhancement to amplify poisoning influence, and iii) narrative
-generation to embed malicious content within coherent text. Empirical
-evaluation across diverse datasets and models shows that GRAGPoison
-substantially outperforms existing attacks in terms of effectiveness (up to 98%
-success rate) and scalability (using less than 68% poisoning text). We also
-explore potential defensive measures and their limitations, identifying
-promising directions for future research.
+This article uses machine learning (ML) and explainable artificial
+intelligence (XAI) techniques to investigate the relationship between
+nutritional status and mortality rates associated with Alzheimers disease (AD).
+The Third National Health and Nutrition Examination Survey (NHANES III)
+database is employed for analysis. The random forest model is selected as the
+base model for XAI analysis, and the Shapley Additive Explanations (SHAP)
+method is used to assess feature importance. The results highlight significant
+nutritional factors such as serum vitamin B12 and glycated hemoglobin. The
+study demonstrates the effectiveness of random forests in predicting AD
+mortality compared to other diseases. This research provides insights into the
+impact of nutrition on AD and contributes to a deeper understanding of disease
+progression.
 
-摘要：GraphRAG 透過將外部知識結構化為多尺度知識圖譜，推動了檢索增強生成 (RAG)，使語言模型能夠在其推理中整合廣泛的背景和細微的細節。儘管 GraphRAG 在各個領域都已展現出成功，但其安全性影響在很大程度上仍未被探索。為了彌補這一差距，本研究探討了 GraphRAG 對投毒攻擊的脆弱性，揭示了一個有趣的安全悖論：與傳統的 RAG 相比，GraphRAG 基於圖表的索引和檢索增強了對簡單投毒攻擊的韌性；同時，相同的特徵也創造了新的攻擊面。我們提出了 GRAGPoison，這是一種新穎的攻擊，它利用知識圖譜中的共享關係來製作中毒文本，能夠同時危害多個查詢。GRAGPoison 採用了三項關鍵策略：i) 關係注入以引入錯誤的知識，ii) 關係增強以擴大投毒影響，以及 iii) 敘事生成以將惡意內容嵌入連貫的文本中。在各種數據集和模型上的經驗評估表明，GRAGPoison 在有效性（成功率高達 98%）和可擴展性（使用不到 68% 的投毒文本）方面都明顯優於現有的攻擊。我們還探討了潛在的防禦措施及其局限性，確定了未來研究的有希望的方向。
+摘要：本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型，並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素，例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解，並有助於更深入地了解疾病的進展。
 
-##### **EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**
-2501.13746v1 by Yuhui Yun, Huilong Ye, Xinru Li, Ruojia Li, Jingfeng Deng, Li Li, Haoyi Xiong
+##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**
+2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath
 
-The paper introduces EICopilot, an novel agent-based solution enhancing
-search and exploration of enterprise registration data within extensive online
-knowledge graphs like those detailing legal entities, registered capital, and
-major shareholders. Traditional methods necessitate text-based queries and
-manual subgraph explorations, often resulting in time-consuming processes.
-EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this
-landscape by utilizing Large Language Models (LLMs) to interpret natural
-language queries. This solution automatically generates and executes Gremlin
-scripts, providing efficient summaries of complex enterprise relationships.
-Distinct feature a data pre-processing pipeline that compiles and annotates
-representative queries into a vector database of examples for In-context
-learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought
-with ICL to enhance Gremlin script generation for knowledge graph search and
-exploration, and a novel query masking strategy that improves intent
-recognition for heightened script accuracy. Empirical evaluations demonstrate
-the superior performance of EICopilot, including speed and accuracy, over
-baseline methods, with the \emph{Full Mask} variant achieving a syntax error
-rate reduction to as low as 10.00% and an execution correctness of up to
-82.14%. These components collectively contribute to superior querying
-capabilities and summarization of intricate datasets, positioning EICopilot as
-a groundbreaking tool in the exploration and exploitation of large-scale
-knowledge graphs for enterprise information search.
+Primary care providers are vital for initial triage and referrals to
+specialty care. In glaucoma, asymptomatic and fast progression can lead to
+vision loss, necessitating timely referrals to specialists. However, primary
+eye care providers may not identify urgent cases, potentially delaying care.
+Artificial Intelligence (AI) offering explanations could enhance their referral
+decisions. We investigate how various AI explanations help providers
+distinguish between patients needing immediate or non-urgent specialist
+referrals. We built explainable AI algorithms to predict glaucoma surgery needs
+from routine eyecare data as a proxy for identifying high-risk patients. We
+incorporated intrinsic and post-hoc explainability and conducted an online
+study with optometrists to assess human-AI team performance, measuring referral
+accuracy and analyzing interactions with AI, including agreement rates, task
+time, and user experience perceptions. AI support enhanced referral accuracy
+among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams
+underperformed compared to AI alone. Participants believed they included AI
+advice more when using the intrinsic model, and perceived it more useful and
+promising. Without explanations, deviations from AI recommendations increased.
+AI support did not increase workload, confidence, and trust, but reduced
+challenges. On a separate test set, our black-box and intrinsic models achieved
+an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We
+identify opportunities of human-AI teaming for glaucoma management in primary
+eye care, noting that while AI enhances referral accuracy, it also shows a
+performance gap compared to AI alone, even with explanations. Human involvement
+remains essential in medical decision making, underscoring the need for future
+research to optimize collaboration, ensuring positive experiences and safe AI
+use.
 
-摘要：本文介紹了 EICopilot，這是一種基於代理的新型解決方案，可增強在廣泛的線上知識圖譜中搜尋和探索企業註冊資料，例如詳細說明法律實體、註冊資本和主要股東的資料。傳統方法需要基於文字的查詢和手動子圖探索，通常會導致耗時的流程。EICopilot 部署為百度企業搜尋的聊天機器人，透過利用大型語言模型 (LLM) 來詮釋自然語言查詢，進而改善這項技術。此解決方案會自動產生並執行 Gremlin 腳本，提供複雜企業關係的有效摘要。其獨特功能為資料前處理管線，可將具代表性的查詢編譯並註解到範例的向量資料庫中，以進行脈絡中學習 (ICL)，這是一個結合了思考鏈與 ICL 的綜合推理管線，用於增強 Gremlin 腳本產生，以進行知識圖譜搜尋和探索，以及一種新穎的查詢遮罩策略，可改善意圖辨識，進而提高腳本準確度。實證評估顯示，EICopilot 的效能優於基線方法，包括速度和準確度，其中「完整遮罩」變體將語法錯誤率降低至低於 10.00%，執行正確率高達 82.14%。這些元件共同促成了優異的查詢功能和複雜資料集的摘要，將 EICopilot 定位為探索和利用大規模知識圖譜進行企業資訊搜尋的創新工具。
+摘要：<paragraph>初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下，無症狀且快速惡化可能導致視力喪失，因此需要及時轉診給專家。然而，初級眼科保健提供者可能無法識別緊急情況，可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法，以從例行眼科護理資料預測青光眼手術需求，作為識別高風險患者的代理。我們納入了內在和事後解釋性，並與驗光師進行了一項線上研究，以評估人機團隊的表現，衡量轉診準確度並分析與 AI 的互動，包括同意率、任務時間和使用者體驗感知。在 87 名參與者中，AI 支援提高了轉診準確度（使用 AI/未使用的比例為 59.9%/50.8%），儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議，並認為它更有用且更有希望。沒有解釋，AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任，但減少了挑戰。在一個單獨的測試集中，我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中，人機團隊合作管理青光眼的機會，並注意到雖然 AI 提高了轉診準確度，但即使有解釋，它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要，這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。</paragraph>
 
-##### **Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**
-2501.13731v1 by Chang Gong, Wanrui Bian, Zhijie Zhang, Weiguo Zheng
+##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**
+2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang
 
-Graph computational tasks are inherently challenging and often demand the
-development of advanced algorithms for effective solutions. With the emergence
-of large language models (LLMs), researchers have begun investigating their
-potential to address these tasks. However, existing approaches are constrained
-by LLMs' limited capability to comprehend complex graph structures and their
-high inference costs, rendering them impractical for handling large-scale
-graphs. Inspired by human approaches to graph problems, we introduce a novel
-framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph
-Computational Tasks), which consists of three key steps: problem understanding,
-prompt design, and code generation. In this framework, LLMs are tasked with
-understanding the problem and extracting relevant information to generate
-correct code. The responsibility for analyzing the graph structure and
-executing the code is delegated to the interpreter. We inject task-related
-pseudocodes into the prompts to further assist the LLMs in generating efficient
-code. We also employ cost-effective trial-and-error techniques to ensure that
-the LLM-generated code executes correctly. Unlike other methods that require
-invoking LLMs for each individual test case, PIE only calls the LLM during the
-code generation phase, allowing the generated code to be reused and
-significantly reducing inference costs. Extensive experiments demonstrate that
-PIE outperforms existing baselines in terms of both accuracy and computational
-efficiency.
+In medical imaging, particularly in early disease detection and prognosis
+tasks, discerning the rationale behind an AI model's predictions is crucial for
+evaluating the reliability of its decisions. Conventional explanation methods
+face challenges in identifying discernible decisive features in medical image
+classifications, where discriminative features are subtle or not immediately
+apparent. To bridge this gap, we propose an explainable model that is equipped
+with both decision reasoning and feature identification capabilities. Our
+approach not only detects influential image patterns but also uncovers the
+decisive features that drive the model's final predictions. By implementing our
+method, we can efficiently identify and visualise class-specific features
+leveraged by the data-driven model, providing insights into the decision-making
+processes of deep learning models. We validated our model in the demanding
+realm of medical prognosis task, demonstrating its efficacy and potential in
+enhancing the reliability of AI in healthcare and in discovering new knowledge
+in diseases where prognostic understanding is limited.
+
+摘要：在醫學影像中，特別是在早期疾病檢測和預後任務中，辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰，其中區別性特徵很微妙或並不明顯。為了彌合這一差距，我們提出了一個可解釋的模型，該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式，還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型，我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵，從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型，展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。
+
+##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**
+2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat
+
+This study explores the relationship between informational support seeking
+questions, responses, and helpfulness ratings in online health communities. We
+created a labeled data set of question-response pairs and developed multimodal
+machine learning and deep learning models to reliably predict informational
+support questions and responses. We employed explainable AI to reveal the
+emotions embedded in informational support exchanges, demonstrating the
+importance of emotion in providing informational support. This complex
+interplay between emotional and informational support has not been previously
+researched. The study refines social support theory and lays the groundwork for
+the development of user decision aids. Further implications are discussed.
 
-摘要：圖表計算任務本質上具有挑戰性，而且通常需要開發先進的演算法才能有效解決。隨著大型語言模型 (LLM) 的出現，研究人員已開始探討其解決這些任務的可能性。然而，現有方法受到 LLM 理解複雜圖形結構的能力有限以及其高推理成本的限制，這使得它們不切實際地處理大規模圖形。受到人類解決圖形問題的方法啟發，我們引入了 PIE（偽代碼注入增強 LLM 圖形計算任務推理）這個新框架，它包含三個關鍵步驟：問題理解、提示設計和代碼生成。在此框架中，LLM 的任務是理解問題並擷取相關資訊以產生正確的代碼。分析圖形結構和執行代碼的責任委派給解釋器。我們將與任務相關的偽代碼注入提示中，以進一步協助 LLM 產生有效的代碼。我們還採用具有成本效益的試錯技術，以確保 LLM 生成的代碼正確執行。與需要為每個個別測試案例呼叫 LLM 的其他方法不同，PIE 僅在代碼產生階段呼叫 LLM，允許重複使用產生的代碼並大幅降低推理成本。大量的實驗證明，PIE 在準確性和計算效率方面都優於現有的基準。
+摘要：本研究探討線上健康社群中尋求資訊支持的問題、回應，以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集，並開發了多模態機器學習和深度學習模型，以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒，證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論，並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。
 
-##### **CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**
-2501.13993v1 by Hamza Landolsi, Kais Letaief, Nizar Taghouti, Ines Abdeljaoued-Tej
+##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**
+2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis
 
-The introduction of new features and services in the banking sector often
-overwhelms customers, creating an opportunity for banks to enhance user
-experience through financial chatbots powered by large language models (LLMs).
-We initiated an AI agent designed to provide customers with relevant
-information about banking services and insights from annual reports. We
-proposed a hybrid Customer Analysis Pipeline Retrieval-Augmented Generation
-(CAPRAG) that effectively addresses both relationship-based and contextual
-queries, thereby improving customer engagement in the digital banking
-landscape. To implement this, we developed a processing pipeline to refine text
-data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This
-dual approach enables us to populate both vector and graph databases with
-processed data for efficient retrieval. The Cypher query component is employed
-to effectively query the graph database. When a user submits a query, it is
-first expanded by a query expansion module before being routed to construct a
-final query from the hybrid Knowledge Base (KB). This final query is then sent
-to an open-source LLM for response generation. Overall, our innovative,
-designed to international banks, serves bank's customers in an increasingly
-complex digital environment, enhancing clarity and accessibility of
-information.
+In the era of exponential technology growth, one unexpected guest has claimed
+a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as
+ChatGPT, promises a revolution in education, yet it arrives with a double-edged
+sword. Its potential for personalized learning is offset by issues of cheating,
+inaccuracies, and educators struggling to incorporate it effectively into their
+lesson design. We are standing on the brink of this educational frontier, and
+it is clear that we need to navigate this terrain with a lot of care. This is a
+major challenge that could undermine the integrity and value of our educational
+process. So, how can we turn these challenges into opportunities? When used
+inappropriately, AI tools can become the perfect tool for the cut copy paste
+mentality, and quickly begin to corrode critical thinking, creativity, and deep
+understanding, the most important skills in our rapidly changing world.
+Teachers feel that they are not equipped to leverage this technology, widening
+the digital divide among educators and institutions. Addressing these concerns
+calls for an in depth research approach. We will employ empirical research,
+drawing on the Technology Acceptance Model, to assess the attitudes toward
+generative AI among educators and students. Understanding their perceptions,
+usage patterns, and hurdles is the first crucial step in creating an effective
+solution. The present study will be used as a process manual for future
+researchers to apply, running their own data, based on the steps explained here
 
-摘要：銀行業中新功能和服務的推出經常讓客戶感到不知所措，這為銀行透過大型語言模型 (LLM) 驅動的金融聊天機器人來提升使用者體驗創造了機會。我們啟動了一個人工智慧代理，旨在為客戶提供有關銀行服務和年度報告見解的相關資訊。我們提出了一個混合式客戶分析管道檢索擴充生成 (CAPRAG)，它有效地處理基於關係和情境式的查詢，從而提升數位銀行環境中的客戶參與度。為了實作這一點，我們開發了一個處理管道來精煉文字資料，我們在兩個主要架構中使用它：Vector RAG 和 Graph RAG。這種雙管齊下的方法讓我們能夠使用處理過的資料來填補向量和圖形資料庫，以利於有效檢索。Cypher 查詢元件用於有效查詢圖形資料庫。當使用者提交查詢時，它會先由查詢擴充模組擴充，然後再路由到混合式知識庫 (KB) 中建構最終查詢。然後這個最終查詢會傳送給開源 LLM 以產生回應。整體而言，我們創新的設計服務於國際銀行，在日益複雜的數位環境中服務銀行客戶，提升資訊的清晰度和可及性。
+摘要：在科技飛速發展的時代，一位意外的訪客已在全球教室中佔有一席之地，那就是人工智慧。生成式 AI，例如 ChatGPT，承諾在教育領域掀起一場革命，但它卻是一把雙面刃。它在個人化學習方面的潛力，卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣，顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰，可能會損害我們教育過程的完整性和價值。那麼，我們如何將這些挑戰轉化為機遇？當不適當地使用時，AI 工具可能會成為複製貼上心態的完美工具，並迅速腐蝕批判性思維、創造力和深入理解，這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術，這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究，借鑑技術接受模型，來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊，根據此處說明的步驟運行他們自己的數據
 
-##### **Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**
-2501.13992v1 by Hy Nguyen, Nguyen Hung Nguyen, Nguyen Linh Bao Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
+##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**
+2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk
 
-The Hierarchical Navigable Small World (HNSW) algorithm is widely used for
-approximate nearest neighbor (ANN) search, leveraging the principles of
-navigable small-world graphs. However, it faces some limitations. The first is
-the local optima problem, which arises from the algorithm's greedy search
-strategy, selecting neighbors based solely on proximity at each step. This
-often leads to cluster disconnections. The second limitation is that HNSW
-frequently fails to achieve logarithmic complexity, particularly in
-high-dimensional datasets, due to the exhaustive traversal through each layer.
-To address these limitations, we propose a novel algorithm that mitigates local
-optima and cluster disconnections while enhancing the construction speed,
-maintaining inference speed. The first component is a dual-branch HNSW
-structure with LID-based insertion mechanisms, enabling traversal from multiple
-directions. This improves outlier node capture, enhances cluster connectivity,
-accelerates construction speed and reduces the risk of local minima. The second
-component incorporates a bridge-building technique that bypasses redundant
-intermediate layers, maintaining inference and making up the additional
-computational overhead introduced by the dual-branch structure. Experiments on
-various benchmarks and datasets showed that our algorithm outperforms the
-original HNSW in both accuracy and speed. We evaluated six datasets across
-Computer Vision (CV), and Natural Language Processing (NLP), showing recall
-improvements of 18\% in NLP, and up to 30\% in CV tasks while reducing the
-construction time by up to 20\% and maintaining the inference speed. We did not
-observe any trade-offs in our algorithm. Ablation studies revealed that
-LID-based insertion had the greatest impact on performance, followed by the
-dual-branch structure and bridge-building components.
+With the digitalization of health care systems, artificial intelligence
+becomes more present in medicine. Especially machine learning shows great
+potential for complex tasks such as time series classification, usually at the
+cost of transparency and comprehensibility. This leads to a lack of trust by
+humans and thus hinders its active usage. Explainable artificial intelligence
+tries to close this gap by providing insight into the decision-making process,
+the actual usefulness of its different methods is however unclear. This paper
+proposes a user study based evaluation of the explanation method Grad-CAM with
+application to a neural network for the classification of breaths in time
+series neonatal ventilation data. We present the perceived usefulness of the
+explainability method by different stakeholders, exposing the difficulty to
+achieve actual transparency and the wish for more in-depth explanations by many
+of the participants.
 
-摘要：分層可導航小世界 (HNSW) 演算法廣泛用於近似最近鄰居 (ANN) 搜尋，並利用可導航小世界圖形的原理。然而，它面臨一些限制。第一個是局部最佳化問題，這源自於演算法的貪婪搜尋策略，在每個步驟中僅根據鄰近度來選擇鄰居。這通常會導致群集斷線。第二個限制是，由於透過每一層的窮舉式遍歷，HNSW 常常無法在高維度資料集中達成對數複雜度。為了解決這些限制，我們提出了一種新的演算法，它可以減輕局部最佳化和群集斷線，同時提高建構速度，並維持推論速度。第一個組成部分是一個具有基於 LID 的插入機制的雙分支 HNSW 結構，它能從多個方向進行遍歷。這改善了異常值節點的擷取，增強了群集連通性，加速了建構速度，並降低了局部最小值的風險。第二個組成部分包含一種橋樑建構技術，它繞過了多餘的中間層，維持推論並彌補了雙分支結構所帶來的額外運算負擔。在各種基準和資料集上的實驗顯示，我們的演算法在準確度和速度上都優於原始的 HNSW。我們評估了電腦視覺 (CV) 和自然語言處理 (NLP) 中的六個資料集，顯示 NLP 中的召回率提高了 18%，CV 任務中提高了 30%，同時將建構時間縮短了 20%，並維持了推論速度。我們沒有在我們的演算法中觀察到任何取捨。消融研究顯示，基於 LID 的插入對效能的影響最大，其次是雙分支結構和橋樑建構組成部分。
+摘要：隨著醫療保健系統的數位化，人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力，但通常是以透明度和可理解性為代價。這導致人類缺乏信任，從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距，但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估，其中包含了 Grad-CAM 解釋方法，並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用，揭示了實現實際透明度的難度，以及許多參與者希望獲得更深入的解釋。
 
-##### **Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**
-2501.13984v1 by Bhumika Gupta, Pralaypati Ta, Keerthi Ram, Mohanasankar Sivaprakasam
+##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**
+2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio
 
-The updated recommendations on diagnostic procedures and treatment pathways
-for a medical condition are documented as graphical flows in Clinical Practice
-Guidelines (CPGs). For effective use of the CPGs in helping medical
-professionals in the treatment decision process, it is necessary to fully
-capture the guideline knowledge, particularly the contexts and their
-relationships in the graph. While several existing works have utilized these
-guidelines to create rule bases for Clinical Decision Support Systems, limited
-work has been done toward directly capturing the full medical knowledge
-contained in CPGs. This work proposes an approach to create a contextually
-enriched, faithful digital representation of National Comprehensive Cancer
-Network (NCCN) Cancer CPGs in the form of graphs using automated extraction and
-node & relationship classification. We also implement semantic enrichment of
-the model by using Large Language Models (LLMs) for node classification,
-achieving an accuracy of 80.86% and 88.47% with zero-shot learning and few-shot
-learning, respectively. Additionally, we introduce a methodology for answering
-natural language questions with constraints to guideline text by leveraging
-LLMs to extract the relevant subgraph from the guideline knowledge base. By
-generating natural language answers based on subgraph paths and semantic
-information, we mitigate the risk of incorrect answers and hallucination
-associated with LLMs, ensuring factual accuracy in medical domain Question
-Answering.
+The integration of Large Language Models (LLMs) into healthcare diagnostics
+offers a promising avenue for clinical decision-making. This study outlines the
+development of a novel method for zero-shot/few-shot in-context learning (ICL)
+by integrating medical domain knowledge using a multi-layered structured
+prompt. We also explore the efficacy of two communication styles between the
+user and LLMs: the Numerical Conversational (NC) style, which processes data
+incrementally, and the Natural Language Single-Turn (NL-ST) style, which
+employs long narrative prompts.
+  Our study systematically evaluates the diagnostic accuracy and risk factors,
+including gender bias and false negative rates, using a dataset of 920 patient
+records in various few-shot scenarios. Results indicate that traditional
+clinical machine learning (ML) models generally outperform LLMs in zero-shot
+and few-shot settings. However, the performance gap narrows significantly when
+employing few-shot examples alongside effective explainable AI (XAI) methods as
+sources of domain knowledge. Moreover, with sufficient time and an increased
+number of examples, the conversational style (NC) nearly matches the
+performance of ML models. Most notably, LLMs demonstrate comparable or superior
+cost-sensitive accuracy relative to ML models.
+  This research confirms that, with appropriate domain knowledge and tailored
+communication strategies, LLMs can significantly enhance diagnostic processes.
+The findings highlight the importance of optimizing the number of training
+examples and communication styles to improve accuracy and reduce biases in LLM
+applications.
 
-摘要：已更新的醫療狀況診斷程序和治療途徑建議，以臨床實務指南 (CPG) 中的圖形流程記錄。為了有效使用 CPG 協助醫療專業人員進行治療決策，必須完整擷取指南知識，特別是圖表中的脈絡及其關係。雖然現有許多研究已利用這些指南為臨床決策支援系統建立規則基礎，但直接擷取 CPG 中包含的完整醫療知識的工作卻有限。這項研究提出了一種方法，以自動化擷取和節點與關係分類的方式，建立脈絡豐富、忠實的國家綜合癌症網路 (NCCN) 癌症 CPG 圖形數位表示。我們也透過使用大型語言模型 (LLM) 進行節點分類，實作模型的語意豐富化，分別在零次學習和少次學習中達到 80.86% 和 88.47% 的準確度。此外，我們引進了一種方法，透過運用 LLM 從指南知識庫中擷取相關子圖，來回答具有指南文字限制的自然語言問題。透過根據子圖路徑和語意資訊產生自然語言答案，我們降低了與 LLM 相關的錯誤答案和幻覺風險，確保了醫療領域問題解答中的事實準確性。
+摘要：大型語言模型 (LLM) 與醫療診斷整合
+為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發，用於零次學習/少量學習情境學習 (ICL)，方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效：數值對話 (NC) 方式，它會逐步處理資料，以及自然語言單回合 (NL-ST) 方式，它會使用長篇敘事提示。
+我們的研究系統性地評估了診斷準確性和風險因子，包括性別偏見和假陰性率，使用了一個包含 920 個患者記錄的資料集，採用各種少量學習情境。結果表明，傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而，當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時，效能差距會顯著縮小。此外，隨著時間充足和範例數量增加，對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是，LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。
+本研究證實，透過適當的領域知識和量身打造的溝通策略，LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性，以提高準確度並減少 LLM 應用中的偏差。
 
-##### **LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**
-2501.12300v1 by Hasan Abu-Rasheed, Constance Jumbo, Rashed Al Amin, Christian Weber, Veit Wiese, Roman Obermaisser, Madjid Fathi
+##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**
+2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho
 
-While learning personalization offers great potential for learners, modern
-practices in higher education require a deeper consideration of domain models
-and learning contexts, to develop effective personalization algorithms. This
-paper introduces an innovative approach to higher education curriculum
-modelling that utilizes large language models (LLMs) for knowledge graph (KG)
-completion, with the goal of creating personalized learning-path
-recommendations. Our research focuses on modelling university subjects and
-linking their topics to corresponding domain models, enabling the integration
-of learning modules from different faculties and institutions in the student's
-learning path. Central to our approach is a collaborative process, where LLMs
-assist human experts in extracting high-quality, fine-grained topics from
-lecture materials. We develop a domain, curriculum, and user models for
-university modules and stakeholders. We implement this model to create the KG
-from two study modules: Embedded Systems and Development of Embedded Systems
-Using FPGA. The resulting KG structures the curriculum and links it to the
-domain models. We evaluate our approach through qualitative expert feedback and
-quantitative graph quality metrics. Domain experts validated the relevance and
-accuracy of the model, while the graph quality metrics measured the structural
-properties of our KG. Our results show that the LLM-assisted graph completion
-approach enhances the ability to connect related courses across disciplines to
-personalize the learning experience. Expert feedback also showed high
-acceptance of the proposed collaborative approach for concept extraction and
-classification.
+The increasing reliance on Deep Learning models, combined with their inherent
+lack of transparency, has spurred the development of a novel field of study
+known as eXplainable AI (XAI) methods. These methods seek to enhance the trust
+of end-users in automated systems by providing insights into the rationale
+behind their decisions. This paper presents a novel approach for measuring user
+trust in XAI systems, allowing their refinement. Our proposed metric combines
+both performance metrics and trust indicators from an objective perspective. To
+validate this novel methodology, we conducted a case study in a realistic
+medical scenario: the usage of XAI system for the detection of pneumonia from
+x-ray images.
 
-摘要：<paragraph>在學習個人化提供學習者巨大潛力的同時，高等教育中的現代實務需要更深入地考慮領域模型和學習情境，以開發有效的個人化演算法。本文介紹了一種創新的高等教育課程建模方法，該方法利用大型語言模型 (LLM) 來完成知識圖譜 (KG)，目的是建立個人化的學習路徑建議。我們的研究重點在於建模大學科目，並將它們的主題連結到對應的領域模型，從而能夠將來自不同院系和機構的學習模組整合到學生的學習路徑中。我們的做法核心是一個協作流程，其中 LLM 協助人類專家從講義材料中萃取高品質、細緻的主題。我們為大學模組和利害關係人開發了領域、課程和使用者模型。我們實作這個模型，從兩個研究模組建立 KG：嵌入式系統和使用 FPGA 的嵌入式系統開發。產生的 KG 建構了課程並將其連結到領域模型。我們透過定性專家回饋和定量圖形品質指標來評估我們的做法。領域專家驗證了模型的相關性和準確性，而圖形品質指標則測量了我們 KG 的結構特性。我們的結果顯示，LLM 輔助的圖形完成方法增強了跨學科連結相關課程的能力，以個人化學習體驗。專家回饋也顯示高度接受所提出的協作方法，用於概念萃取和分類。</paragraph>
+摘要：隨著對深度學習模型依賴性的增加，加上其固有的透明度不足，促使一個新的研究領域發展，稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理，來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法，允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法，我們在一個真實的醫療場景中進行了一個案例研究：使用 XAI 系統從 X 光影像中偵測肺炎。
 
-##### **Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**
-2501.12432v1 by Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
+##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**
+2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao
 
-Although current Large Language Models (LLMs) exhibit impressive
-capabilities, performing complex real-world tasks still requires tool learning.
-Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to
-interact with external environments, but they are limited in perceptual scope
-and lack adequate task-planning capability. To address these limitations, other
-studies introduce the first Search-based Decision Tree (DFSDT), which still
-suffers from the high computational cost. In this paper, we introduce a novel
-parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama).
-First, we transform traditional tree-based tool search paths into Directed
-Acyclic Graph (DAG) structure, generating a high-quality parallel tool
-invocation dataset. The DTA-Llama is then trained on the dataset to learn to
-iteratively divide the current task into several parallel tool invocation
-sub-tasks and aggregate the invocation results to decide the next actions.
-Furthermore, we introduce an efficient inference framework inspired by the
-Process/Threads mechanism when applying the DTA-Llama to practical tasks.
-Experimental results show that our approach substantially enhances task
-performance while reducing token consumption and inference time. Llama2-7B,
-using our method, is comparable to the official parallel function calling
-method of GPT-3.5. The relevant code, dataset, and model weights are available
-at https://corn0205.github.io/
+The COVID-19 pandemic has strained global public health, necessitating
+accurate diagnosis and intervention to control disease spread and reduce
+mortality rates. This paper introduces an interpretable deep survival
+prediction model designed specifically for improved understanding and trust in
+COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale
+pretrained image encoder, Risk-specific Grad-CAM, and anatomical region
+detection techniques, our approach produces regional interpretable outcomes
+that effectively capture essential disease features while focusing on rare but
+critical abnormal regions. Our model's predictive results provide enhanced
+clarity and transparency through risk area localization, enabling clinicians to
+make informed decisions regarding COVID-19 diagnosis with better understanding
+of prognostic insights. We evaluate the proposed method on a multi-center
+survival dataset and demonstrate its effectiveness via quantitative and
+qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and
+time-dependent AUCs (0.799 and 0.691). These results suggest that our
+explainable deep survival prediction model surpasses traditional survival
+analysis methods in risk prediction, improving interpretability for clinical
+decision making and enhancing AI system trustworthiness.
 
-摘要：儘管目前的大型語言模型 (LLM) 展現出令人印象深刻的能力，但執行複雜的真實世界任務仍需要工具學習。主流方法（例如 CoT/ReAct）依賴逐步工具呼叫與外部環境互動，但它們的感知範圍有限，且缺乏足夠的任務規劃能力。為了解決這些限制，其他研究引入了第一個基於搜尋的決策樹 (DFSDT)，但仍有很高的運算成本。在本文中，我們介紹了一種新穎的平行工具呼叫範例，DTA-Llama（分而合之 Llama）。首先，我們將傳統的基於樹的工具搜尋路徑轉換為有向無環圖 (DAG) 結構，產生高品質的平行工具呼叫資料集。然後在資料集上訓練 DTA-Llama，學習反覆將當前任務分成幾個平行工具呼叫子任務，並彙總呼叫結果以決定後續動作。此外，我們在將 DTA-Llama 應用於實際任務時，引入了一個受 Process/Threads 機制啟發的高效推論框架。實驗結果表明，我們的做法大幅提升了任務效能，同時減少了符號消耗和推論時間。使用我們方法的 Llama2-7B，可與 GPT-3.5 的官方平行函式呼叫方法相媲美。相關程式碼、資料集和模型權重可在 https://corn0205.github.io/ 取得
+摘要：COVID-19 疫情對全球公共衛生造成壓力，必須進行準確的診斷和干預，以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型，專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術，我們的做法產生區域可解釋的結果，有效捕捉必要的疾病特徵，同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度，讓臨床醫生能夠在更了解預後見解的情況下，就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法，並透過量化和質化評估證明其有效性，達到優異的 C 指數（0.764 和 0.727）和時間相關 AUC（0.799 和 0.691）。這些結果表明，我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法，提升臨床決策的解釋性，並增強 AI 系統的信賴度。
+
+##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**
+2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile
+
+In recent years, machine learning-based clinical decision support systems
+(CDSS) have played a key role in the analysis of several medical conditions.
+Despite their promising capabilities, the lack of transparency in AI models
+poses significant challenges, particularly in medical contexts where
+reliability is a mandatory aspect. However, it appears that explainability is
+inversely proportional to accuracy. For this reason, achieving transparency
+without compromising predictive accuracy remains a key challenge. This paper
+presents a novel method, namely Rad4XCNN, to enhance the predictive power of
+CNN-derived features with the inherent interpretability of radiomic features.
+Rad4XCNN diverges from conventional methods based on saliency maps, by
+associating intelligible meaning to CNN-derived features by means of Radiomics,
+offering new perspectives on explanation methods beyond visualization maps.
+Using a breast cancer classification task as a case study, we evaluated
+Rad4XCNN on ultrasound imaging datasets, including an online dataset and two
+in-house datasets for internal and external validation. Some key results are:
+i) CNN-derived features guarantee more robust accuracy when compared against
+ViT-derived and radiomic features; ii) conventional visualization map methods
+for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice
+model accuracy for their explainability; iv) Rad4XCNN provides a global
+explanation enabling the physician to extract global insights and findings. Our
+method can mitigate some concerns related to the explainability-accuracy
+trade-off. This study highlighted the importance of proposing new methods for
+model explanation without affecting their accuracy.
 
-##### **InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**
-2501.12231v1 by Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
+摘要：<paragraph>近年来，基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景，但 AI 模型缺乏透明度，尤其在医疗领域，可靠性是强制性方面，这带来了重大挑战。然而，解释性似乎与准确性成反比。因此，在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法，即 Rad4XCNN，以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来，从而偏离了基于显着性图的传统方法，为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究，我们在超声成像数据集上评估了 Rad4XCNN，包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是：i) 与 ViT 衍生和放射组学特征相比，CNN 衍生特征保证了更稳健的准确性；ii) 用于解释的传统可视化图方法存在一些缺陷；iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性；iv) Rad4XCNN 提供全局解释，使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。</paragraph>
 
-The improved competence of generative models can help building multi-modal
-virtual assistants that leverage modalities beyond language. By observing
-humans performing multi-step tasks, one can build assistants that have
-situational awareness of actions and tasks being performed, enabling them to
-cater assistance based on this understanding. In this paper, we develop a
-Context-aware Instructional Task Assistant with Multi-modal Large Language
-Models (InsTALL) that leverages an online visual stream (e.g. a user's screen
-share or video recording) and responds in real-time to user queries related to
-the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal
-model on task videos and paired textual data, and 2) automatically extracts
-task graph from video data and leverages it at training and inference time. We
-show InsTALL achieves state-of-the-art performance across proposed sub-tasks
-considered for multimodal activity understanding -- task recognition (TR),
-action recognition (AR), next action prediction (AP), and plan prediction (PP)
--- and outperforms existing baselines on two novel sub-tasks related to
-automatic error identification.
+##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**
+2404.16957v1 by Yunfei Ge, Quanyan Zhu
 
-摘要：生成模型能力的提升有助于构建利用语言之外的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建对正在执行的动作和任务有情境感知的助手，使他们能够根据这种理解提供帮助。在本文中，我们开发了一个具有多模态大语言模型的上下文感知指令任务助手 (InsTALL)，该助手利用在线视觉流（例如用户的屏幕共享或视频录制），并实时响应与手头任务相关的用户查询。为了提供有用的帮助，InsTALL 1) 在任务视频和配对文本数据上训练多模态模型，以及 2) 从视频数据中自动提取任务图，并在训练和推理时间利用它。我们展示了 InsTALL 在考虑用于多模态活动理解的提议子任务中实现了最先进的性能——任务识别 (TR)、动作识别 (AR)、下一个动作预测 (AP) 和计划预测 (PP)——并且在与自动错误识别相关的两个新子任务上优于现有的基准。
+The pervasive integration of Artificial Intelligence (AI) has introduced
+complex challenges in the responsibility and accountability in the event of
+incidents involving AI-enabled systems. The interconnectivity of these systems,
+ethical concerns of AI-induced incidents, coupled with uncertainties in AI
+technology and the absence of corresponding regulations, have made traditional
+responsibility attribution challenging. To this end, this work proposes a
+Computational Reflective Equilibrium (CRE) approach to establish a coherent and
+ethically acceptable responsibility attribution framework for all stakeholders.
+The computational approach provides a structured analysis that overcomes the
+limitations of conceptual approaches in dealing with dynamic and multifaceted
+scenarios, showcasing the framework's explainability, coherence, and adaptivity
+properties in the responsibility attribution process. We examine the pivotal
+role of the initial activation level associated with claims in equilibrium
+computation. Using an AI-assisted medical decision-support system as a case
+study, we illustrate how different initializations lead to diverse
+responsibility distributions. The framework offers valuable insights into
+accountability in AI-induced incidents, facilitating the development of a
+sustainable and resilient system through continuous monitoring, revision, and
+reflection.
 
-##### **Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**
-2501.11977v1 by Maya Medjad, Hugo Imbert, Bruno Yun, Raphaël Szymocha, Frédéric Armetta
+摘要：隨著人工智慧 (AI) 的普及整合，在涉及 AI 驅動系統的事故中，責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題，加上 AI 技術的不確定性和缺乏相應法規，使得傳統責任歸屬面臨挑戰。為此，本研究提出了一種計算反思均衡 (CRE) 方法，以建立一個連貫且在倫理上可接受的責任歸屬架構，適用於所有利害關係人。計算方法提供了結構化的分析，克服了概念方法在處理動態且多面向情境時的限制，展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究，說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解，透過持續監控、修訂和反思，促進了永續且有韌性的系統發展。
 
-Training task-oriented dialogue systems is both costly and time-consuming,
-due to the need for high-quality datasets encompassing diverse intents.
-Traditional methods depend on extensive human annotation, while recent
-advancements leverage large language models (LLMs) to generate synthetic data.
-However, these approaches often require custom prompts or code, limiting
-accessibility for non-technical users. We introduce GraphTOD, an end-to-end
-framework that simplifies the generation of task-oriented dialogues. Users can
-create dialogues by specifying transition graphs in JSON format. Our evaluation
-demonstrates that GraphTOD generates high-quality dialogues across various
-domains, significantly lowering the cost and complexity of dataset creation.
+##### **Explainable AI for Fair Sepsis Mortality Predictive Model**
+2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang
 
-摘要：訓練任務導向對話系統既昂貴又耗時，
-因為需要包含各種意圖的高品質資料集。
-傳統方法依賴於廣泛的人工標註，而最近
-的進展利用大型語言模型 (LLM) 來產生合成資料。
-然而，這些方法通常需要自訂提示或程式碼，限制
-非技術使用者的可及性。我們介紹 GraphTOD，一個端對端的
-架構，簡化了任務導向對話的產生。使用者可以
-透過指定 JSON 格式的轉換圖表來建立對話。我們的評估
-證明 GraphTOD 在各種領域產生高品質對話，顯著降低資料集建立的成本和複雜性。
+Artificial intelligence supports healthcare professionals with predictive
+modeling, greatly transforming clinical decision-making. This study addresses
+the crucial need for fairness and explainability in AI applications within
+healthcare to ensure equitable outcomes across diverse patient demographics. By
+focusing on the predictive modeling of sepsis-related mortality, we propose a
+method that learns a performance-optimized predictive model and then employs
+the transfer learning process to produce a model with better fairness. Our
+method also introduces a novel permutation-based feature importance algorithm
+aiming at elucidating the contribution of each feature in enhancing fairness on
+predictions. Unlike existing explainability methods concentrating on explaining
+feature contribution to predictive performance, our proposed method uniquely
+bridges the gap in understanding how each feature contributes to fairness. This
+advancement is pivotal, given sepsis's significant mortality rate and its role
+in one-third of hospital deaths. Our method not only aids in identifying and
+mitigating biases within the predictive model but also fosters trust among
+healthcare stakeholders by improving the transparency and fairness of model
+predictions, thereby contributing to more equitable and trustworthy healthcare
+delivery.
 
-##### **Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**
-2501.11968v1 by Jie Zhao, Kang Hao Cheong, Witold Pedrycz
+摘要：人工智慧透過預測模型協助醫療專業人員，大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求，以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型，我們提出了一種方法，該方法會學習一個效能最佳化的預測模型，然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法，旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同，我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要，因為敗血症的死亡率很高，且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差，還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任，進而有助於提供更公平且值得信賴的醫療保健服務。
 
-Graph-structured combinatorial challenges are inherently difficult due to
-their nonlinear and intricate nature, often rendering traditional computational
-methods ineffective or expensive. However, these challenges can be more
-naturally tackled by humans through visual representations that harness our
-innate ability for spatial reasoning. In this study, we propose transforming
-graphs into images to preserve their higher-order structural features
-accurately, revolutionizing the representation used in solving graph-structured
-combinatorial tasks. This approach allows machines to emulate human-like
-processing in addressing complex combinatorial challenges. By combining the
-innovative paradigm powered by multimodal large language models (MLLMs) with
-simple search techniques, we aim to develop a novel and effective framework for
-tackling such problems. Our investigation into MLLMs spanned a variety of
-graph-based tasks, from combinatorial problems like influence maximization to
-sequential decision-making in network dismantling, as well as addressing six
-fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit
-exceptional spatial intelligence and a distinctive capability for handling
-these problems, significantly advancing the potential for machines to
-comprehend and analyze graph-structured data with a depth and intuition akin to
-human cognition. These results also imply that integrating MLLMs with simple
-optimization strategies could form a novel and efficient approach for
-navigating graph-structured combinatorial challenges without complex
-derivations, computationally demanding training and fine-tuning.
+##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence**
+2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal
 
-摘要：圖形結構的組合挑戰本質上很困難，因為它們的非線性和複雜性，通常會使傳統的計算方法無效或昂貴。然而，人類可以透過利用我們天生的空間推理能力的視覺表徵，更自然地應對這些挑戰。在本研究中，我們建議將圖形轉換為影像，以準確保留它們的高階結構特徵，從而革新用於解決圖形結構組合任務的表徵。這種方法允許機器在解決複雜的組合挑戰時模擬類人的處理。透過結合由多模態大型語言模型 (MLLM) 提供動力的創新範例與簡單的搜尋技術，我們旨在為解決此類問題開發一個新穎且有效的架構。我們對 MLLM 的研究涵蓋了各種基於圖形的任務，從組合問題（如影響力最大化）到網路拆除中的順序決策制定，以及解決六個基本的圖形相關問題。我們的研究結果表明，MLLM 表現出非凡的空間智能和處理這些問題的獨特能力，顯著提升了機器以類似人類認知的深度和直覺來理解和分析圖形結構資料的潛力。這些結果還暗示，將 MLLM 與簡單的最佳化策略整合在一起，可以形成一種新穎且有效的方法，用於在沒有複雜推導、計算需求量大的訓練和微調的情況下應對圖形結構的組合挑戰。
+Depression is a significant issue nowadays. As per the World Health
+Organization (WHO), in 2023, over 280 million individuals are grappling with
+depression. This is a huge number; if not taken seriously, these numbers will
+increase rapidly. About 4.89 billion individuals are social media users. People
+express their feelings and emotions on platforms like Twitter, Facebook,
+Reddit, Instagram, etc. These platforms contain valuable information which can
+be used for research purposes. Considerable research has been conducted across
+various social media platforms. However, certain limitations persist in these
+endeavors. Particularly, previous studies were only focused on detecting
+depression and the intensity of depression in tweets. Also, there existed
+inaccuracies in dataset labeling. In this research work, five types of
+depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted
+using tweets from the Twitter database based on lexicon labeling. Explainable
+AI was used to provide reasoning by highlighting the parts of tweets that
+represent type of depression. Bidirectional Encoder Representations from
+Transformers (BERT) was used for feature extraction and training. Machine
+learning and deep learning methodologies were used to train the model. The BERT
+model presented the most promising results, achieving an overall accuracy of
+0.96.
 
-##### **A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**
-2501.13958v1 by Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, Xiao Huang
+摘要：現今，憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料，在 2023 年，超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字；如果不認真看待，這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊，可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而，這些努力仍存在某些限制。特別是，先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外，資料集標籤中存在不準確的情況。在這項研究工作中，使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症（雙極型、重度、精神病型、非典型和產後）。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers（BERT）中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果，達到 0.96 的整體準確度。
 
-Large language models (LLMs) have demonstrated remarkable capabilities in a
-wide range of tasks, yet their application to specialized domains remains
-challenging due to the need for deep expertise. Retrieval-augmented generation
-(RAG) has emerged as a promising solution to customize LLMs for professional
-fields by seamlessly integrating external knowledge bases, enabling real-time
-access to domain-specific expertise during inference. Despite its potential,
-traditional RAG systems, based on flat text retrieval, face three critical
-challenges: (i) complex query understanding in professional contexts, (ii)
-difficulties in knowledge integration across distributed sources, and (iii)
-system efficiency bottlenecks at scale. This survey presents a systematic
-analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new
-paradigm that revolutionizes domain-specific LLM applications. GraphRAG
-addresses traditional RAG limitations through three key innovations: (i)
-graph-structured knowledge representation that explicitly captures entity
-relationships and domain hierarchies, (ii) efficient graph-based retrieval
-techniques that enable context-preserving knowledge retrieval with multihop
-reasoning ability, and (iii) structure-aware knowledge integration algorithms
-that leverage retrieved knowledge for accurate and logical coherent generation
-of LLMs. In this survey, we systematically analyze the technical foundations of
-GraphRAG and examine current implementations across various professional
-domains, identifying key technical challenges and promising research
-directions. All the related resources of GraphRAG, including research papers,
-open-source data, and projects, are collected for the community in
-\textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}}.
+##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**
+2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman
 
-摘要：大型語言模型 (LLM) 已在各種任務中展現出非凡的能力，但由於需要深入的專業知識，因此將其應用於專業領域仍具有挑戰性。檢索增強生成 (RAG) 已成為一種有前途的解決方案，可通過無縫整合外部知識庫來客製化 LLM 以適用於專業領域，從而在推理過程中即時存取特定領域的專業知識。儘管有其潛力，但基於平面文字檢索的傳統 RAG 系統面臨三項關鍵挑戰：(i) 在專業情境中進行複雜的查詢理解，(ii) 難以整合分散來源的知識，以及 (iii) 系統效率瓶頸會隨著規模擴大而產生。本調查系統性地分析了圖形化檢索增強生成 (GraphRAG) 的技術基礎，GraphRAG 是一個新的典範，它徹底改變了特定領域的 LLM 應用。GraphRAG 透過三項關鍵創新來解決傳統 RAG 的限制：(i) 圖形結構化的知識表述，明確擷取實體關係和領域階層，(ii) 有效的圖形化檢索技術，可進行保留脈絡的知識檢索，並具備多跳推理能力，以及 (iii) 結構感知知識整合演算法，可利用檢索到的知識來進行 LLM 的準確且邏輯一致的生成。在本調查中，我們系統性地分析了 GraphRAG 的技術基礎，並檢視了在各種專業領域中的現有實作，找出關鍵技術挑戰和有前景的研究方向。所有 GraphRAG 的相關資源，包括研究論文、開放原始碼資料和專案，都已在 \textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}} 中為社群收集。
+Deep learning is dramatically transforming the field of medical imaging and
+radiology, enabling the identification of pathologies in medical images,
+including computed tomography (CT) and X-ray scans. However, the performance of
+deep learning models, particularly in segmentation tasks, is often limited by
+the need for extensive annotated datasets. To address this challenge, the
+capabilities of weakly supervised semantic segmentation are explored through
+the lens of Explainable AI and the generation of counterfactual explanations.
+The scope of this research is development of a novel counterfactual inpainting
+approach (COIN) that flips the predicted classification label from abnormal to
+normal by using a generative model. For instance, if the classifier deems an
+input medical image X as abnormal, indicating the presence of a pathology, the
+generative model aims to inpaint the abnormal region, thus reversing the
+classifier's original prediction label. The approach enables us to produce
+precise segmentations for pathologies without depending on pre-existing
+segmentation masks. Crucially, image-level labels are utilized, which are
+substantially easier to acquire than creating detailed segmentation masks. The
+effectiveness of the method is demonstrated by segmenting synthetic targets and
+actual kidney tumors from CT images acquired from Tartu University Hospital in
+Estonia. The findings indicate that COIN greatly surpasses established
+attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an
+alternative counterfactual explanation method introduced by Singla et al. This
+evidence suggests that COIN is a promising approach for semantic segmentation
+of tumors in CT images, and presents a step forward in making deep learning
+applications more accessible and effective in healthcare, where annotated data
+is scarce.
 
-##### **Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**
-2501.11849v2 by Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
+摘要：深度学习正大幅轉變醫學影像和放射線學領域，能辨識醫學影像中的病理，包括電腦斷層掃描 (CT) 和 X 光掃描。然而，深度學習模型的效能，特別是在分割任務中，常常受到廣泛註解資料集需求的限制。為了應對此挑戰，透過可解釋 AI 和反事實解釋的產生，探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN)，該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如，如果分類器將輸入的醫學影像 X 視為異常，表示存在病理，則生成模型旨在內插異常區域，從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割，而無需依賴於預先存在的分割遮罩。至關重要的是，利用影像層級標籤，這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明，COIN 遠遠超過已建立的歸因方法，例如 RISE、ScoreCAM 和 LayerCAM，以及 Singla 等人提出的另一種反事實解釋方法。此證據表明，COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法，並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步，其中註解資料很稀少。
 
-Detecting organized political campaigns is of paramount importance in
-fighting against disinformation on social media. Existing approaches for the
-identification of such organized actions employ techniques mostly from network
-science, graph machine learning and natural language processing. Their ultimate
-goal is to analyze the relationships and interactions (e.g. re-posting) among
-users and the textual similarities of their posts. Despite their effectiveness
-in recognizing astroturf campaigns, these methods face significant challenges,
-notably the class imbalance in available training datasets. To mitigate this
-issue, recent methods usually resort to data augmentation or increasing the
-number of positive samples, which may not always be feasible or sufficient in
-real-world settings. Following a different path, in this paper, we propose a
-novel framework for identifying astroturf campaigns based solely on large
-language models (LLMs), introducing a Balanced Retrieval-Augmented Generation
-(Balanced RAG) component. Our approach first gives both textual information
-concerning the posts (in our case tweets) and the user interactions of the
-social network as input to a language model. Then, through prompt engineering
-and the proposed Balanced RAG method, it effectively detects coordinated
-disinformation campaigns on X (Twitter). The proposed framework does not
-require any training or fine-tuning of the language model. Instead, by
-strategically harnessing the strengths of prompt engineering and Balanced RAG,
-it facilitates LLMs to overcome the effects of class imbalance and effectively
-identify coordinated political campaigns. The experimental results demonstrate
-that by incorporating the proposed prompt engineering and Balanced RAG methods,
-our framework outperforms the traditional graph-based baselines, achieving
-2x-3x improvements in terms of precision, recall and F1 scores.
+##### **Hybrid Intelligence for Digital Humanities**
+2406.15374v1 by Victor de Boer, Lise Stork
 
-摘要：<paragraph>在社交媒體上對抗錯誤資訊，偵測有組織的政治宣傳活動至關重要。現有的此類有組織行動識別方法，大多採用網路科學、圖形機器學習和自然語言處理的技術。它們的最終目標是分析使用者之間的關係和互動（例如轉發），以及他們貼文的文字相似性。儘管這些方法在辨識草根運動宣傳活動方面很有效，但它們面臨嚴峻的挑戰，特別是可用訓練資料集中的類別不平衡。為了減輕這個問題，最近的方法通常訴諸於資料擴充或增加正向樣本數量，但在現實世界中可能並非總是可行或足夠。本文採取不同的途徑，我們提出了一個基於大型語言模型 (LLM) 的辨識草根運動宣傳活動的新架構，並引入了平衡檢索擴充產生 (Balanced RAG) 組件。我們的做法首先將有關貼文（在我們的案例中是推文）的文字資訊和社交網路的使用者互動作為輸入，輸入到語言模型中。然後，透過提示工程和提出的平衡檢索擴充產生方法，它有效地偵測 X（Twitter）上協調的不實資訊宣傳活動。提出的架構不需要任何語言模型的訓練或微調。相反地，透過策略性地利用提示工程和平衡檢索擴充產生方法的優勢，它使大型語言模型能夠克服類別不平衡的影響，並有效地識別協調的政治宣傳活動。實驗結果證明，透過整合提出的提示工程和平衡檢索擴充產生方法，我們的架構優於傳統的基於圖形的基準，在精確度、召回率和 F1 分數方面獲得 2x-3x 的改進。</paragraph>
+In this paper, we explore the synergies between Digital Humanities (DH) as a
+discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research,
+the use of digital methods and specifically that of Artificial Intelligence is
+subject to a set of requirements and constraints. We argue that these are
+well-supported by the capabilities and goals of HI. Our contribution includes
+the identification of five such DH requirements: Successful AI systems need to
+be able to 1) collaborate with the (human) scholar; 2) support data criticism;
+3) support tool criticism; 4) be aware of and cater to various perspectives and
+5) support distant and close reading. We take the CARE principles of Hybrid
+Intelligence (collaborative, adaptive, responsible and explainable) as
+theoretical framework and map these to the DH requirements. In this mapping, we
+include example research projects. We finally address how insights from DH can
+be applied to HI and discuss open challenges for the combination of the two
+disciplines.
 
-##### **Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**
-2501.16361v1 by Haoran Song, Jiarui Feng, Guangfu Li, Michael Province, Philip Payne, Yixin Chen, Fuhai Li
+摘要：在本文中，我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中，數位方法的使用，特別是人工智慧的使用，受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求：成功的 AI 系統需要能夠 1) 與（人類）學者合作；2) 支援資料批評；3) 支援工具批評；4) 察覺並迎合各種觀點；5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則（協作、適應、負責和可解釋）作為理論架構，並將這些原則對應到 DH 要求。在此對應中，我們納入範例研究專案。最後，我們探討如何將 DH 的見解應用於 HI，並討論結合這兩個學科的開放挑戰。
 
-In real-world scientific discovery, human beings always make use of the
-accumulated prior knowledge with imagination pick select one or a few most
-promising hypotheses from large and noisy data analysis results. In this study,
-we introduce a new type of graph structure, the text-numeric graph (TNG), which
-is defined as graph entities and associations have both text-attributed
-information and numeric information. The TNG is an ideal data structure model
-for novel scientific discovery via graph reasoning because it integrates
-human-understandable textual annotations or prior knowledge, with numeric
-values that represent the observed or activation levels of graph entities or
-associations in different samples. Together both the textual information and
-numeric values determine the importance of graph entities and associations in
-graph reasoning for novel scientific knowledge discovery. We further propose
-integrating large language models (LLMs) and graph neural networks (GNNs) to
-analyze the TNGs for graph understanding and reasoning. To demonstrate the
-utility, we generated the text-omic(numeric) signaling graphs (TOSG), as one
-type of TNGs, in which all graphs have the same entities, associations and
-annotations, but have sample-specific entity numeric (omic) values using single
-cell RNAseq (scRNAseq) datasets of different diseases. We proposed joint
-LLM-GNN models for key entity mining and signaling pathway mining on the TOSGs.
-The evaluation results showed the LLM-GNN and TNGs models significantly improve
-classification accuracy and network inference. In conclusion, the TNGs and
-joint LLM-GNN models are important approaches for scientific discovery.
+##### **Ethical Framework for Responsible Foundational Models in Medical Imaging**
+2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci
 
-摘要：<paragraph>在現實世界的科學發現中，人類總是利用累積的先驗知識，並運用想像力從大量且雜訊的資料分析結果中挑選出一個或幾個最有希望的假設。在本研究中，我們介紹了一種新型態的圖形結構，稱為文字數值圖 (TNG)，定義為圖形實體和關聯具有文字屬性資訊和數值資訊。TNG 是透過圖形推理進行新科學發現的理想資料結構模型，因為它整合了人類可理解的文字註解或先驗知識，以及代表圖形實體或不同樣本中關聯的觀察值或活化程度的數值。文字資訊和數值一起決定了圖形實體和關聯在圖形推理中對於新科學知識發現的重要性。我們進一步提出整合大型語言模型 (LLM) 和圖形神經網路 (GNN) 來分析 TNG，以進行圖形理解和推理。為了展示其效用，我們生成了文字組學（數值）訊號圖 (TOSG)，作為一種 TNG，其中所有圖形都具有相同的實體、關聯和註解，但具有特定於樣本的實體數值（組學）值，使用不同疾病的單細胞 RNAseq (scRNAseq) 資料集。我們針對 TOSG 提出聯合 LLM-GNN 模型，用於關鍵實體探勘和訊號路徑探勘。評估結果顯示，LLM-GNN 和 TNG 模型顯著提升了分類準確度和網路推論。結論而言，TNG 和聯合 LLM-GNN 模型是科學發現的重要方法。</paragraph>
+Foundational models (FMs) have tremendous potential to revolutionize medical
+imaging. However, their deployment in real-world clinical settings demands
+extensive ethical considerations. This paper aims to highlight the ethical
+concerns related to FMs and propose a framework to guide their responsible
+development and implementation within medicine. We meticulously examine ethical
+issues such as privacy of patient data, bias mitigation, algorithmic
+transparency, explainability and accountability. The proposed framework is
+designed to prioritize patient welfare, mitigate potential risks, and foster
+trust in AI-assisted healthcare.
 
-##### **Zep: A Temporal Knowledge Graph Architecture for Agent Memory**
-2501.13956v1 by Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef
+摘要：基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而，它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題，並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題，例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險，並培養對 AI 輔助醫療保健的信任。
 
-We introduce Zep, a novel memory layer service for AI agents that outperforms
-the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR)
-benchmark. Additionally, Zep excels in more comprehensive and challenging
-evaluations than DMR that better reflect real-world enterprise use cases. While
-existing retrieval-augmented generation (RAG) frameworks for large language
-model (LLM)-based agents are limited to static document retrieval, enterprise
-applications demand dynamic knowledge integration from diverse sources
-including ongoing conversations and business data. Zep addresses this
-fundamental limitation through its core component Graphiti -- a
-temporally-aware knowledge graph engine that dynamically synthesizes both
-unstructured conversational data and structured business data while maintaining
-historical relationships. In the DMR benchmark, which the MemGPT team
-established as their primary evaluation metric, Zep demonstrates superior
-performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further
-validated through the more challenging LongMemEval benchmark, which better
-reflects enterprise use cases through complex temporal reasoning tasks. In this
-evaluation, Zep achieves substantial results with accuracy improvements of up
-to 18.5% while simultaneously reducing response latency by 90% compared to
-baseline implementations. These results are particularly pronounced in
-enterprise-critical tasks such as cross-session information synthesis and
-long-term context maintenance, demonstrating Zep's effectiveness for deployment
-in real-world applications.
+##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**
+2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak
 
-摘要：我們推出 Zep，這是一種新穎的記憶層服務，適用於 AI 代理，其在深度記憶擷取 (DMR) 基準測試中優於現行的最先進系統 MemGPT。此外，Zep 在比 DMR 更全面且更具挑戰性的評估中表現出色，這些評估更能反映真實世界的企業用例。雖然現有的檢索增強生成 (RAG) 架構僅限於大型語言模型 (LLM) 基於代理的靜態文件檢索，但企業應用需要從包括正在進行的對話和業務數據在內的不同來源動態整合知識。Zep 通過其核心組件 Graphiti 來解決這個基本限制，Graphiti 是一個時間感知知識圖譜引擎，可以在維護歷史關係的同時動態綜合非結構化對話數據和結構化業務數據。在 MemGPT 團隊確立為其主要評估指標的 DMR 基準測試中，Zep 表現出優異的效能（94.8% 對 93.4%）。除了 DMR 之外，Zep 的功能還通過更具挑戰性的 LongMemEval 基準測試進一步得到驗證，該基準測試通過複雜的時間推理任務更好地反映了企業用例。在這個評估中，Zep 以高達 18.5% 的準確度改進取得了顯著的成果，同時與基線實作相比，將回應延遲降低了 90%。這些成果在企業關鍵任務中尤為明顯，例如跨會話資訊綜合和長期脈絡維護，證明了 Zep 在實際應用中部署的有效性。
+Thyroid cancer is an increasing global health concern that requires advanced
+diagnostic methods. The application of AI and radiomics to thyroid cancer
+diagnosis is examined in this review. A review of multiple databases was
+conducted in compliance with PRISMA guidelines until October 2023. A
+combination of keywords led to the discovery of an English academic publication
+on thyroid cancer and related subjects. 267 papers were returned from the
+original search after 109 duplicates were removed. Relevant studies were
+selected according to predetermined criteria after 124 articles were eliminated
+based on an examination of their abstract and title. After the comprehensive
+analysis, an additional six studies were excluded. Among the 28 included
+studies, radiomics analysis, which incorporates ultrasound (US) images,
+demonstrated its effectiveness in diagnosing thyroid cancer. Various results
+were noted, some of the studies presenting new strategies that outperformed the
+status quo. The literature has emphasized various challenges faced by AI
+models, including interpretability issues, dataset constraints, and operator
+dependence. The synthesized findings of the 28 included studies mentioned the
+need for standardization efforts and prospective multicenter studies to address
+these concerns. Furthermore, approaches to overcome these obstacles were
+identified, such as advances in explainable AI technology and personalized
+medicine techniques. The review focuses on how AI and radiomics could transform
+the diagnosis and treatment of thyroid cancer. Despite challenges, future
+research on multidisciplinary cooperation, clinical applicability validation,
+and algorithm improvement holds the potential to improve patient outcomes and
+diagnostic precision in the treatment of thyroid cancer.
 
-##### **Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**
-2501.11560v1 by M. Manzour, A. Ballardini, R. Izquierdo, M. Á. Sotelo
+摘要：甲狀腺癌是一種日益嚴重的全球健康問題，需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下，對多個資料庫進行了回顧，直到 2023 年 10 月。通過結合關鍵字，發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後，原始搜尋共回傳 267 篇論文。在根據預先確定的標準，淘汰了 124 篇文章的摘要和標題後，選出了相關研究。在進行全面分析後，額外排除了六項研究。在納入的 28 項研究中，結合超音波 (US) 影像的放射特徵分析，證明了其在診斷甲狀腺癌方面的有效性。研究結果不一，有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰，包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到，需要標準化工作和前瞻性多中心研究來解決這些問題。此外，還確定了克服這些障礙的方法，例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰，但未來對多學科合作、臨床適用性驗證和演算法改進的研究，仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。
 
-Lane-changing maneuvers, particularly those executed abruptly or in risky
-situations, are a significant cause of road traffic accidents. However, current
-research mainly focuses on predicting safe lane changes. Furthermore, existing
-accident datasets are often based on images only and lack comprehensive sensory
-data. In this work, we focus on predicting risky lane changes using the CRASH
-dataset (our own collected dataset specifically for risky lane changes), and
-safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian
-inference to predict these maneuvers using linguistic contextual information,
-enhancing the model's interpretability and transparency. The model achieved a
-91.5% f1-score with anticipation time extending to four seconds for risky lane
-changes, and a 90.0% f1-score for predicting safe lane changes with the same
-anticipation time. We validate our model by integrating it into a vehicle
-within the CARLA simulator in scenarios that involve risky lane changes. The
-model managed to anticipate sudden lane changes, thus providing automated
-vehicles with further time to plan and execute appropriate safe reactions.
-Finally, to enhance the explainability of our model, we utilize RAG to provide
-clear and natural language explanations for the given prediction.
+##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**
+2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia
 
-摘要：換車道動作，尤其是突然或在風險情況下執行的動作，是道路交通事故的重要原因。然而，目前的研究所主要集中在預測安全的換車道。此外，現有的事故資料集通常僅基於影像，且缺乏全面的感測資料。在這項工作中，我們專注於使用 CRASH 資料集（我們自己收集的專門針對風險換車道資料集）來預測風險換車道，以及安全換車道（使用 HighD 資料集）。然後，我們利用 KG 和貝氏推理來使用語言背景資訊預測這些動作，增強模型的可解釋性和透明度。該模型在風險換車道的預測時間延長至四秒時，達到了 91.5% 的 f1 分數，在預測安全換車道時，在相同的預測時間內達到了 90.0% 的 f1 分數。我們透過將模型整合到 CARLA 模擬器中的車輛中，在涉及風險換車道的場景中驗證我們的模型。該模型設法預測突然的換車道，從而為自動駕駛車輛提供了更多時間來規劃和執行適當的安全反應。最後，為了增強我們模型的可解釋性，我們利用 RAG 為給定的預測提供清晰且自然的語言解釋。
+Breast cancer has rapidly increased in prevalence in recent years, making it
+one of the leading causes of mortality worldwide. Among all cancers, it is by
+far the most common. Diagnosing this illness manually requires significant time
+and expertise. Since detecting breast cancer is a time-consuming process,
+preventing its further spread can be aided by creating machine-based forecasts.
+Machine learning and Explainable AI are crucial in classification as they not
+only provide accurate predictions but also offer insights into how the model
+arrives at its decisions, aiding in the understanding and trustworthiness of
+the classification results. In this study, we evaluate and compare the
+classification accuracy, precision, recall, and F-1 scores of five different
+machine learning methods using a primary dataset (500 patients from Dhaka
+Medical College Hospital). Five different supervised machine learning
+techniques, including decision tree, random forest, logistic regression, naive
+bayes, and XGBoost, have been used to achieve optimal results on our dataset.
+Additionally, this study applied SHAP analysis to the XGBoost model to
+interpret the model's predictions and understand the impact of each feature on
+the model's output. We compared the accuracy with which several algorithms
+classified the data, as well as contrasted with other literature in this field.
+After final evaluation, this study found that XGBoost achieved the best model
+accuracy, which is 97%.
 
-##### **Each Graph is a New Language: Graph Learning with LLMs**
-2501.11478v2 by Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
+摘要：<paragraph>近年來，乳癌的盛行率迅速增加，使其成為全球主要的死亡原因之一。在所有癌症中，乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時，因此透過建立機器學習模型來預測，有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要，因為它們不僅可以提供準確的預測，還可以深入了解模型如何做出決策，有助於理解和信賴分類結果。在此研究中，我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數，使用了一個主要的資料集（達卡醫學院醫院的 500 名患者）。五種不同的監督式機器學習技術，包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost，已用於在我們的資料集上取得最佳結果。此外，本研究將 SHAP 分析應用於 XGBoost 模型，以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度，並與該領域的其他文獻進行對比。在最後評估後，本研究發現 XGBoost 達到了最佳的模型準確度，為 97%。</paragraph>
 
-Recent efforts leverage Large Language Models (LLMs) for modeling
-text-attributed graph structures in node classification tasks. These approaches
-describe graph structures for LLMs to understand or aggregate LLM-generated
-textual attribute embeddings through graph structure. However, these approaches
-face two main limitations in modeling graph structures with LLMs. (i) Graph
-descriptions become verbose in describing high-order graph structure. (ii)
-Textual attributes alone do not contain adequate graph structure information.
-It is challenging to model graph structure concisely and adequately with LLMs.
-LLMs lack built-in mechanisms to model graph structures directly. They also
-struggle with complex long-range dependencies between high-order nodes and
-target nodes.
-  Inspired by the observation that LLMs pre-trained on one language can achieve
-exceptional performance on another with minimal additional training, we propose
-\textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge
-\textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs
-to transfer their powerful language understanding capabilities to
-graph-structured data. GDL4LLM translates graphs into a graph language corpus
-instead of graph descriptions and pre-trains LLMs on this corpus to adequately
-understand graph structures. During fine-tuning, this corpus describes the
-structural information of target nodes concisely with only a few tokens. By
-treating graphs as a new language, GDL4LLM enables LLMs to model graph
-structures adequately and concisely for node classification tasks. Extensive
-experiments on three real-world datasets demonstrate that GDL4LLM outperforms
-description-based and textual attribute embeddings-based baselines by
-efficiently modeling different orders of graph structure with LLMs.
+##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**
+2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir
 
-摘要：<paragraph>最近的研究利用大型语言模型 (LLM) 对节点分类任务中的文本属性图结构进行建模。这些方法描述图结构，以便 LLM 理解或通过图结构聚合 LLM 生成的文本属性嵌入。然而，这些方法在使用 LLM 对图结构进行建模时面临两个主要限制。(i) 图描述在描述高阶图结构时变得冗长。(ii) 仅文本属性不包含足够的图结构信息。使用 LLM 对图结构进行简洁且充分的建模具有挑战性。LLM 缺乏直接对图结构进行建模的内置机制。它们还难以处理高阶节点和目标节点之间复杂的远程依赖关系。
-受 LLM 在一种语言上进行预训练后，只需进行最少的额外训练即可在另一种语言上实现卓越性能的观察结果的启发，我们提出了**G**raph-**D**efined **L**anguage for **L**arge **L**anguage **M**odel (GDL4LLM)。此新框架使 LLM 能够将其强大的语言理解能力转移到结构化数据图。GDL4LLM 将图翻译成图语言语料库，而不是图描述，并在该语料库上对 LLM 进行预训练，以充分理解图结构。在微调期间，此语料库仅使用几个标记简洁地描述目标节点的结构信息。通过将图视为一种新语言，GDL4LLM 使 LLM 能够充分且简洁地对图结构进行建模，以用于节点分类任务。在三个真实世界数据集上进行的广泛实验表明，GDL4LLM 通过使用 LLM 有效地对不同阶的图结构进行建模，优于基于描述和基于文本属性嵌入的基线。</paragraph>
+The Deep learning (DL) models for diagnosing breast cancer from mammographic
+images often operate as "black boxes", making it difficult for healthcare
+professionals to trust and understand their decision-making processes. The
+study presents an integrated framework combining Convolutional Neural Networks
+(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis
+of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an
+elaborate data preprocessing pipeline and advanced data augmentation techniques
+to counteract dataset limitations and transfer learning using pre-trained
+networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of
+our study is the evaluation of XAI's effectiveness in interpreting model
+predictions, highlighted by utilizing the Hausdorff measure to assess the
+alignment between AI-generated explanations and expert annotations
+quantitatively. This approach is critical for XAI in promoting trustworthiness
+and ethical fairness in AI-assisted diagnostics. The findings from our research
+illustrate the effective collaboration between CNNs and XAI in advancing
+diagnostic methods for breast cancer, thereby facilitating a more seamless
+integration of advanced AI technologies within clinical settings. By enhancing
+the interpretability of AI driven decisions, this work lays the groundwork for
+improved collaboration between AI systems and medical practitioners, ultimately
+enriching patient care. Furthermore, the implications of our research extended
+well beyond the current methodologies. It encourages further research into how
+to combine multimodal data and improve AI explanations to meet the needs of
+clinical practice.
 
-##### **Few-shot Policy (de)composition in Conversational Question Answering**
-2501.11335v1 by Kyle Erwin, Guy Axelrod, Maria Chang, Achille Fokoue, Maxwell Crouse, Soham Dan, Tian Gao, Rosario Uceda-Sosa, Ndivhuwo Makondo, Naweed Khan, Alexander Gray
+摘要：深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作，這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構，結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI)，以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術，以對抗資料集限制，並採用預先訓練的網路（例如 VGG-16、Inception-V3 和 ResNet）進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性，重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作，從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性，這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎，最終豐富了患者照護。此外，我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋，以滿足臨床實務的需求。
 
-The task of policy compliance detection (PCD) is to determine if a scenario
-is in compliance with respect to a set of written policies. In a conversational
-setting, the results of PCD can indicate if clarifying questions must be asked
-to determine compliance status. Existing approaches usually claim to have
-reasoning capabilities that are latent or require a large amount of annotated
-data. In this work, we propose logical decomposition for policy compliance
-(LDPC): a neuro-symbolic framework to detect policy compliance using large
-language models (LLMs) in a few-shot setting. By selecting only a few exemplars
-alongside recently developed prompting techniques, we demonstrate that our
-approach soundly reasons about policy compliance conversations by extracting
-sub-questions to be answered, assigning truth values from contextual
-information, and explicitly producing a set of logic statements from the given
-policies. The formulation of explicit logic graphs can in turn help answer
-PCDrelated questions with increased transparency and explainability. We apply
-this approach to the popular PCD and conversational machine reading benchmark,
-ShARC, and show competitive performance with no task-specific finetuning. We
-also leverage the inherently interpretable architecture of LDPC to understand
-where errors occur, revealing ambiguities in the ShARC dataset and highlighting
-the challenges involved with reasoning for conversational question answering.
+##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**
+2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu
 
-摘要：策略合規偵測 (PCD) 的任務是確定場景是否符合一組書面策略。在對話設定中，PCD 的結果可以指出是否必須提出澄清問題以確定合規狀態。現有的方法通常聲稱具有潛在的推理能力，或需要大量的註釋資料。在這項工作中，我們提出策略合規的邏輯分解 (LDPC)：一種使用大型語言模型 (LLM) 在少次嘗試中偵測策略合規的神經符號框架。透過僅選擇少數範例以及最近開發的提示技術，我們證明我們的做法透過提取要回答的子問題、從脈絡資訊指派真值，以及從給定的策略明確產生一組邏輯陳述，對策略合規對話進行合理的推理。明確邏輯圖表的制定反過來可以幫助回答 PCD 相關問題，並提高透明度和可解釋性。我們將此方法應用於熱門的 PCD 和對話式機器閱讀基準 ShARC，並在沒有特定任務微調的情況下展現出競爭力。我們也利用 LDPC 固有的可解釋架構來了解錯誤發生在哪裡，揭露 ShARC 資料集中的歧義，並強調對話式問題解答推理的挑戰。
+This research presents a novel multimodal data fusion methodology for pain
+behavior recognition, integrating statistical correlation analysis with
+human-centered insights. Our approach introduces two key innovations: 1)
+integrating data-driven statistical relevance weights into the fusion strategy
+to effectively utilize complementary information from heterogeneous modalities,
+and 2) incorporating human-centric movement characteristics into multimodal
+representation learning for detailed modeling of pain behaviors. Validated
+across various deep learning architectures, our method demonstrates superior
+performance and broad applicability. We propose a customizable framework that
+aligns each modality with a suitable classifier based on statistical
+significance, advancing personalized and effective multimodal fusion.
+Furthermore, our methodology provides explainable analysis of multimodal data,
+contributing to interpretable and explainable AI in healthcare. By highlighting
+the importance of data diversity and modality-specific representations, we
+enhance traditional fusion techniques and set new standards for recognizing
+complex pain behaviors. Our findings have significant implications for
+promoting patient-centered healthcare interventions and supporting explainable
+clinical decision-making.
 
-##### **Reasoning Language Models: A Blueprint**
-2501.11223v3 by Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler
+摘要：本研究提出了一種創新的多模態數據融合方法，用於疼痛行為識別，將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新：1) 將數據驅動的統計相關權重整合到融合策略中，以有效利用來自異質模態的補充信息，以及 2) 將以人為中心的運動特徵納入多模態表示學習中，以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證，展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架，根據統計顯著性將每個模態與合適的分類器對齊，推進個性化和有效的多模態融合。此外，我們的模型提供對多模態數據的可解釋分析，有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性，我們增強了傳統的融合技術，並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。
 
-Reasoning language models (RLMs), also known as Large Reasoning Models
-(LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have
-redefined AI's problem-solving capabilities by extending LLMs with advanced
-reasoning mechanisms. Yet, their high costs, proprietary nature, and complex
-architectures - uniquely combining Reinforcement Learning (RL), search
-heuristics, and LLMs - present accessibility and scalability challenges. To
-address these, we propose a comprehensive blueprint that organizes RLM
-components into a modular framework, based on a survey and analysis of all RLM
-works. This blueprint incorporates diverse reasoning structures (chains, trees,
-graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search,
-Beam Search), RL concepts (policy, value models and others), supervision
-schemes (Outcome-Based and Process-Based Supervision), and other related
-concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
-tools). We also provide detailed mathematical formulations and algorithmic
-specifications to simplify RLM implementation. By showing how schemes like
-LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases,
-we demonstrate the blueprint's versatility and unifying potential. To
-illustrate its utility, we introduce x1, a modular implementation for rapid RLM
-prototyping and experimentation. Using x1 and a literature review, we provide
-key insights, such as multi-phase training for policy and value models, and the
-importance of familiar training distributions. Finally, we discuss scalable RLM
-cloud deployments and we outline how RLMs can integrate with a broader LLM
-ecosystem. Our work demystifies RLM construction, democratizes advanced
-reasoning capabilities, and fosters innovation, aiming to mitigate the gap
-between "rich AI" and "poor AI" by lowering barriers to RLM design and
-experimentation.
+##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**
+2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini
+
+Human-centered explainable AI (HCXAI) advocates for the integration of social
+aspects into AI explanations. Central to the HCXAI discourse is the Social
+Transparency (ST) framework, which aims to make the socio-organizational
+context of AI systems accessible to their users. In this work, we suggest
+extending the ST framework to address the risks of social misattributions in
+Large Language Models (LLMs), particularly in sensitive areas like mental
+health. In fact LLMs, which are remarkably capable of simulating roles and
+personas, may lead to mismatches between designers' intentions and users'
+perceptions of social attributes, risking to promote emotional manipulation and
+dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To
+address these issues, we propose enhancing the ST framework with a fifth
+'W-question' to clarify the specific social attributions assigned to LLMs by
+its designers and users. This addition aims to bridge the gap between LLM
+capabilities and user perceptions, promoting the ethically responsible
+development and use of LLM-based technology.
+
+摘要：以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架，其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中，我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险，尤其是在心理健康等敏感领域。事实上，LLM 能够出色地模拟角色和人格，这可能导致设计者的意图和用户对社会属性的认知之间出现错配，从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题，我们建议用第五个“W 问题”来增强 ST 框架，以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距，促进基于 LLM 的技术在道德上负责任地开发和使用。
+
+##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**
+2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu
 
-摘要：推理語言模型 (RLM)，又稱為大型推理模型 (LRM)，例如 OpenAI 的 o1 和 o3、DeepSeek-V3 以及阿里巴巴的 QwQ，透過擴充 LLM 的先進推理機制，重新定義了 AI 的問題解決能力。然而，它們的高成本、專有性質和複雜架構（獨特地結合了強化學習 (RL)、搜尋啟發法和 LLM）提出了可及性和可擴充性的挑戰。為了解決這些問題，我們提出了一個全面的藍圖，將 RLM 組件組織成一個模組化架構，這是基於對所有 RLM 作品的調查和分析。此藍圖包含多樣化的推理結構（鏈、樹、圖和巢狀形式）、推理策略（例如蒙地卡羅樹搜尋、波束搜尋）、RL 概念（策略、價值模型等）、監督方案（基於結果和基於流程的監督）和其他相關概念（例如測試時間運算、檢索增強生成、代理工具）。我們還提供了詳細的數學公式和演算法規範，以簡化 RLM 的實作。透過展示 LLaMA-Berry、QwQ、Journey Learning 和 Graph of Thoughts 等方案如何作為特殊情況，我們展示了藍圖的多功能性和統一潛力。為了說明其效用，我們介紹了 x1，這是一個模組化實作，用於快速 RLM 原型製作和實驗。使用 x1 和文獻回顧，我們提供了關鍵見解，例如策略和價值模型的多階段訓練，以及熟悉訓練分佈的重要性。最後，我們討論了可擴充的 RLM 雲端部署，並概述了 RLM 如何與更廣泛的 LLM 生態系統整合。我們的研究揭開了 RLM 建構的神秘面紗，使先進的推理能力民主化，並促進創新，旨在透過降低 RLM 設計和實驗的障礙，來縮小「富裕 AI」和「貧窮 AI」之間的差距。
+Background: Pneumothorax is an acute thoracic disease caused by abnormal air
+collection between the lungs and chest wall. To address the opaqueness often
+associated with deep learning (DL) models, explainable artificial intelligence
+(XAI) methods have been introduced to outline regions related to pneumothorax
+diagnoses made by DL models. However, these explanations sometimes diverge from
+actual lesion areas, highlighting the need for further improvement. Method: We
+propose a template-guided approach to incorporate the clinical knowledge of
+pneumothorax into model explanations generated by XAI methods, thereby
+enhancing the quality of these explanations. Utilizing one lesion delineation
+created by radiologists, our approach first generates a template that
+represents potential areas of pneumothorax occurrence. This template is then
+superimposed on model explanations to filter out extraneous explanations that
+fall outside the template's boundaries. To validate its efficacy, we carried
+out a comparative analysis of three XAI methods with and without our template
+guidance when explaining two DL models in two real-world datasets. Results: The
+proposed approach consistently improved baseline XAI methods across twelve
+benchmark scenarios built on three XAI methods, two DL models, and two
+datasets. The average incremental percentages, calculated by the performance
+improvements over the baseline performance, were 97.8% in Intersection over
+Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model
+explanations and ground-truth lesion areas. Conclusions: In the context of
+pneumothorax diagnoses, we proposed a template-guided approach for improving AI
+explanations. We anticipate that our template guidance will forge a fresh
+approach to elucidating AI models by integrating clinical domain expertise.
 
-##### **IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**
-2501.11067v1 by Elad Levi, Ilan Kadar
+摘要：<paragraph>背景：氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習（DL）模型經常伴隨的不透明性，可解釋人工智慧（XAI）方法已被引入，用於概述與 DL 模型做出的氣胸診斷相關的區域。然而，這些解釋有時會與實際病灶區域有所出入，突顯出進一步改進的必要性。方法：我們提出了一種模板引導式方法，將氣胸的臨床知識納入 XAI 方法產生的模型解釋中，從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪，我們的做法首先產生一個模板，用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上，以篩選出超出模板邊界的無關解釋。為了驗證其效力，我們對三種 XAI 方法進行了比較分析，在兩個真實世界資料集中解釋兩個 DL 模型時，分別採用和不採用我們的模板引導。結果：所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中，始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時，透過基準效能的效能改進計算出的平均增量百分比為交集比（IoU）的 97.8% 和骰子相似性係數（DSC）的 94.1%。結論：在氣胸診斷的背景下，我們提出了一種模板引導式方法，用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識，為闡明 AI 模型建立一種新方法。</paragraph>
 
-Large Language Models (LLMs) are transforming artificial intelligence,
-evolving into task-oriented systems capable of autonomous planning and
-execution. One of the primary applications of LLMs is conversational AI
-systems, which must navigate multi-turn dialogues, integrate domain-specific
-APIs, and adhere to strict policy constraints. However, evaluating these agents
-remains a significant challenge, as traditional methods fail to capture the
-complexity and variability of real-world interactions. We introduce
-IntellAgent, a scalable, open-source multi-agent framework designed to evaluate
-conversational AI systems comprehensively. IntellAgent automates the creation
-of diverse, synthetic benchmarks by combining policy-driven graph modeling,
-realistic event generation, and interactive user-agent simulations. This
-innovative approach provides fine-grained diagnostics, addressing the
-limitations of static and manually curated benchmarks with coarse-grained
-metrics. IntellAgent represents a paradigm shift in evaluating conversational
-AI. By simulating realistic, multi-policy scenarios across varying levels of
-complexity, IntellAgent captures the nuanced interplay of agent capabilities
-and policy constraints. Unlike traditional methods, it employs a graph-based
-policy model to represent relationships, likelihoods, and complexities of
-policy interactions, enabling highly detailed diagnostics. IntellAgent also
-identifies critical performance gaps, offering actionable insights for targeted
-optimization. Its modular, open-source design supports seamless integration of
-new domains, policies, and APIs, fostering reproducibility and community
-collaboration. Our findings demonstrate that IntellAgent serves as an effective
-framework for advancing conversational AI by addressing challenges in bridging
-research and deployment. The framework is available at
-https://github.com/plurai-ai/intellagent
+##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**
+2403.01580v1 by Séamus Lankford
 
-摘要：大型語言模型 (LLM) 正在轉變人工智慧，演變成具備自主規劃和執行能力的任務導向系統。LLM 的主要應用之一是對話式 AI 系統，它必須應對多輪對話、整合特定領域的 API，並遵守嚴格的政策約束。然而，評估這些代理仍然是一項重大挑戰，因為傳統方法無法捕捉現實世界互動的複雜性和變異性。我們引入了 IntellAgent，一個可擴充、開放原始碼的多代理架構，旨在全面評估對話式 AI 系統。IntellAgent 自動化建立多樣化、合成的基準，方法是結合策略驅動的圖形建模、逼真的事件產生和互動使用者代理模擬。這種創新方法提供了細緻的診斷，解決了具有粗略指標的靜態和手動策劃基準的限制。IntellAgent 代表了評估對話式 AI 的典範轉移。通過模擬不同層級複雜性的逼真多策略場景，IntellAgent 捕捉到了代理功能和策略約束之間的細微交互。與傳統方法不同，它採用基於圖形的策略模型來表示策略交互的關係、可能性和複雜性，從而實現高度詳細的診斷。IntellAgent 還識別出關鍵效能差距，提供可行的見解，以進行目標最佳化。其模組化、開放原始碼的設計支援無縫整合新的領域、策略和 API，促進了可複製性和社群協作。我們的研究結果表明，IntellAgent 可作為一個有效的框架，透過解決研究和部署之間的挑戰來推進對話式 AI。這個框架可在 https://github.com/plurai-ai/intellagent 取得
+In the current machine translation (MT) landscape, the Transformer
+architecture stands out as the gold standard, especially for high-resource
+language pairs. This research delves into its efficacy for low-resource
+language pairs including both the English$\leftrightarrow$Irish and
+English$\leftrightarrow$Marathi language pairs. Notably, the study identifies
+the optimal hyperparameters and subword model type to significantly improve the
+translation quality of Transformer models for low-resource language pairs.
+  The scarcity of parallel datasets for low-resource languages can hinder MT
+development. To address this, gaHealth was developed, the first bilingual
+corpus of health data for the Irish language. Focusing on the health domain,
+models developed using this in-domain dataset exhibited very significant
+improvements in BLEU score when compared with models from the LoResMT2021
+Shared Task. A subsequent human evaluation using the multidimensional quality
+metrics error taxonomy showcased the superior performance of the Transformer
+system in reducing both accuracy and fluency errors compared to an RNN-based
+counterpart.
+  Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source
+applications streamlined for the development, fine-tuning, and deployment of
+neural machine translation models. These tools considerably simplify the setup
+and evaluation process, making MT more accessible to both developers and
+translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes
+eco-friendly natural language processing research by highlighting the
+environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM
+demonstrated advancements in translation performance for two low-resource
+language pairs: English$\leftrightarrow$Irish and
+English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021
+Shared Task.
 
+摘要：<paragraph>在當前機器翻譯 (MT) 領域中，Transformer 架構脫穎而出，成為黃金標準，特別是對於高資源語言對。本研究探討其對低資源語言對的效能，包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是，本研究識別出最佳超參數和子詞模型類型，以顯著提高 Transformer 模型對低資源語言對的翻譯品質。
+低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題，開發了 gaHealth，這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域，使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步，與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示，與基於 RNN 的對應模型相比，Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。
+此外，本論文介紹了 adaptNMT 和 adaptMLLM，這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程，讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是，adaptNMT 以 OpenNMT 生態系統為基礎，通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比，adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。</paragraph>
 
-### LLM
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Theoretical Benefit and Limitation of Diffusion Language Model**|Guhao Feng et.al.|[2502.09622v1](http://arxiv.org/abs/2502.09622v1)|null|
-|**2025-02-13**|**MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**|Dongzhi Jiang et.al.|[2502.09621v1](http://arxiv.org/abs/2502.09621v1)|null|
-|**2025-02-13**|**Exploring the Potential of Encoder-free Architectures in 3D LMMs**|Yiwen Tang et.al.|[2502.09620v1](http://arxiv.org/abs/2502.09620v1)|null|
-|**2025-02-13**|**DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**|Xueyi Liu et.al.|[2502.09614v1](http://arxiv.org/abs/2502.09614v1)|null|
-|**2025-02-13**|**Score-of-Mixture Training: Training One-Step Generative Models Made Simple**|Tejas Jayashankar et.al.|[2502.09609v1](http://arxiv.org/abs/2502.09609v1)|null|
-|**2025-02-13**|**Human-LLM Coevolution: Evidence from Academic Writing**|Mingmeng Geng et.al.|[2502.09606v1](http://arxiv.org/abs/2502.09606v1)|null|
-|**2025-02-13**|**SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**|Yung-Sung Chuang et.al.|[2502.09604v1](http://arxiv.org/abs/2502.09604v1)|null|
-|**2025-02-13**|**CoT-Valve: Length-Compressible Chain-of-Thought Tuning**|Xinyin Ma et.al.|[2502.09601v1](http://arxiv.org/abs/2502.09601v1)|null|
-|**2025-02-13**|**Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**|Siyan Zhao et.al.|[2502.09597v1](http://arxiv.org/abs/2502.09597v1)|null|
-|**2025-02-13**|**KIMAs: A Configurable Knowledge Integrated Multi-Agent System**|Zitao Li et.al.|[2502.09596v1](http://arxiv.org/abs/2502.09596v1)|null|
-|**2025-02-13**|**Logical forms complement probability in understanding language model (and human) performance**|Yixuan Wang et.al.|[2502.09589v1](http://arxiv.org/abs/2502.09589v1)|null|
-|**2025-02-13**|**Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**|Mark Beliaev et.al.|[2502.09573v1](http://arxiv.org/abs/2502.09573v1)|null|
-|**2025-02-13**|**MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**|Vlad Andrei Negru et.al.|[2502.09567v1](http://arxiv.org/abs/2502.09567v1)|null|
-|**2025-02-13**|**Zero-shot generation of synthetic neurosurgical data with large language models**|Austin A. Barr et.al.|[2502.09566v1](http://arxiv.org/abs/2502.09566v1)|null|
-|**2025-02-13**|**MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**|Quintina Campbell et.al.|[2502.09565v1](http://arxiv.org/abs/2502.09565v1)|null|
-|**2025-02-13**|**EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**|Rui Yang et.al.|[2502.09560v1](http://arxiv.org/abs/2502.09560v1)|null|
-|**2025-02-13**|**Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**|Shreyan Biswas et.al.|[2502.09532v1](http://arxiv.org/abs/2502.09532v1)|null|
-|**2025-02-13**|**Diffusion Models for Molecules: A Survey of Methods and Tasks**|Liang Wang et.al.|[2502.09511v1](http://arxiv.org/abs/2502.09511v1)|null|
-|**2025-02-13**|**AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**|Caleb Cranney et.al.|[2502.09503v1](http://arxiv.org/abs/2502.09503v1)|null|
-|**2025-02-13**|**Improve LLM-based Automatic Essay Scoring with Linguistic Features**|Zhaoyi Joey Hou et.al.|[2502.09497v1](http://arxiv.org/abs/2502.09497v1)|null|
-|**2025-02-13**|**Cracking the Code: Enhancing Development finance understanding with artificial intelligence**|Pierre Beaucoral et.al.|[2502.09495v1](http://arxiv.org/abs/2502.09495v1)|null|
-|**2025-02-13**|**Objective quantification of mood states using large language models**|Jakub Onysk et.al.|[2502.09487v1](http://arxiv.org/abs/2502.09487v1)|null|
-|**2025-02-13**|**The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**|Akash Ghosh et.al.|[2502.09457v1](http://arxiv.org/abs/2502.09457v1)|null|
-|**2025-02-13**|**Pixel-Level Reasoning Segmentation via Multi-turn Conversations**|Dexian Cai et.al.|[2502.09447v1](http://arxiv.org/abs/2502.09447v1)|null|
-|**2025-02-13**|**Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**|Navdeep Kumar et.al.|[2502.09432v1](http://arxiv.org/abs/2502.09432v1)|null|
-|**2025-02-13**|**Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**|Ziyi Chen et.al.|[2502.09423v1](http://arxiv.org/abs/2502.09423v1)|null|
-|**2025-02-13**|**On multi-token prediction for efficient LLM inference**|Somesh Mehra et.al.|[2502.09419v1](http://arxiv.org/abs/2502.09419v1)|null|
-|**2025-02-13**|**SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**|Daniel Fleischer et.al.|[2502.09390v1](http://arxiv.org/abs/2502.09390v1)|null|
-|**2025-02-13**|**Truth Knows No Language: Evaluating Truthfulness Beyond English**|Blanca Calvo Figueras et.al.|[2502.09387v1](http://arxiv.org/abs/2502.09387v1)|null|
-|**2025-02-13**|**A Deep Inverse-Mapping Model for a Flapping Robotic Wing**|Hadar Sharvit et.al.|[2502.09378v1](http://arxiv.org/abs/2502.09378v1)|null|
-|**2025-02-13**|**Language Agents as Digital Representatives in Collective Decision-Making**|Daniel Jarrett et.al.|[2502.09369v1](http://arxiv.org/abs/2502.09369v1)|null|
-|**2025-02-13**|**Neural Spatiotemporal Point Processes: Trends and Challenges**|Sumantrak Mukherjee et.al.|[2502.09341v1](http://arxiv.org/abs/2502.09341v1)|null|
-|**2025-02-13**|**Graph Diffusion Network for Drug-Gene Prediction**|Jiayang Wu et.al.|[2502.09335v1](http://arxiv.org/abs/2502.09335v1)|null|
-|**2025-02-13**|**Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**|Itai Mondshine et.al.|[2502.09331v1](http://arxiv.org/abs/2502.09331v1)|null|
-|**2025-02-13**|**A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**|Kentaro Imajo et.al.|[2502.09316v1](http://arxiv.org/abs/2502.09316v1)|null|
-|**2025-02-13**|**When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**|Samuel Joseph Amouyal et.al.|[2502.09307v1](http://arxiv.org/abs/2502.09307v1)|null|
-|**2025-02-13**|**Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**|Bernd Dudzik et.al.|[2502.09294v1](http://arxiv.org/abs/2502.09294v1)|null|
-|**2025-02-13**|**SparQLe: Speech Queries to Text Translation Through LLMs**|Amirbek Djanibekov et.al.|[2502.09284v1](http://arxiv.org/abs/2502.09284v1)|null|
-|**2025-02-13**|**LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**|Wenlun Zhang et.al.|[2502.09271v1](http://arxiv.org/abs/2502.09271v1)|null|
-|**2025-02-13**|**AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**|Hezhe Qiao et.al.|[2502.09254v1](http://arxiv.org/abs/2502.09254v1)|null|
-|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
-|**2025-02-13**|**You Do Not Fully Utilize Transformer's Representation Capacity**|Gleb Gerasimov et.al.|[2502.09245v1](http://arxiv.org/abs/2502.09245v1)|null|
-|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
-|**2025-02-13**|**Reliable Conversational Agents under ASP Control that Understand Natural Language**|Yankai Zeng et.al.|[2502.09237v1](http://arxiv.org/abs/2502.09237v1)|null|
-|**2025-02-13**|**Commonsense Reasoning-Aided Autonomous Vehicle Systems**|Keegan Kimbrell et.al.|[2502.09233v1](http://arxiv.org/abs/2502.09233v1)|null|
-|**2025-02-13**|**Logical foundations of Smart Contracts**|Kalonji Kalala et.al.|[2502.09232v1](http://arxiv.org/abs/2502.09232v1)|null|
-|**2025-02-13**|**Relating Answer Set Programming and Many-sorted Logics for Formal Verification**|Zachary Hansen et.al.|[2502.09230v1](http://arxiv.org/abs/2502.09230v1)|null|
-|**2025-02-13**|**Computational methods for Dynamic Answer Set Programming**|Susana Hahn et.al.|[2502.09228v1](http://arxiv.org/abs/2502.09228v1)|null|
-|**2025-02-13**|**Generating Causally Compliant Counterfactual Explanations using ASP**|Sopam Dasgupta et.al.|[2502.09226v1](http://arxiv.org/abs/2502.09226v1)|null|
-|**2025-02-13**|**Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**|Đorđe Marković et.al.|[2502.09224v1](http://arxiv.org/abs/2502.09224v1)|null|
-|**2025-02-13**|**ASP-driven User-interaction with Clinguin**|Alexander Beiser et.al.|[2502.09222v1](http://arxiv.org/abs/2502.09222v1)|null|
-|**2025-02-13**|**Pearce's Characterisation in an Epistemic Domain**|Ezgi Iraz Su et.al.|[2502.09221v1](http://arxiv.org/abs/2502.09221v1)|null|
-|**2025-02-13**|**Graphical Conditions for the Existence, Unicity and Number of Regular Models**|Van-Giang Trinh et.al.|[2502.09220v1](http://arxiv.org/abs/2502.09220v1)|null|
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-13**|**Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**|Galileo Sartor et.al.|[2502.09216v1](http://arxiv.org/abs/2502.09216v1)|null|
-|**2025-02-13**|**Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**|Sean Glaze et.al.|[2502.09215v1](http://arxiv.org/abs/2502.09215v1)|null|
-|**2025-02-13**|**Neuro-Symbolic Contrastive Learning for Cross-domain Inference**|Mingyue Liu et.al.|[2502.09213v1](http://arxiv.org/abs/2502.09213v1)|null|
-|**2025-02-13**|**LP-LM: No Hallucinations in Question Answering with Logic Programming**|Katherine Wu et.al.|[2502.09212v1](http://arxiv.org/abs/2502.09212v1)|null|
-|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
-|**2025-02-13**|**On LLM-generated Logic Programs and their Inference Execution Methods**|Paul Tarau et.al.|[2502.09209v1](http://arxiv.org/abs/2502.09209v1)|null|
-|**2025-02-13**|**Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**|Haya Majid Qureshi et.al.|[2502.09206v1](http://arxiv.org/abs/2502.09206v1)|null|
-|**2025-02-13**|**Counterfactual Explanations as Plans**|Vaishak Belle et.al.|[2502.09205v1](http://arxiv.org/abs/2502.09205v1)|null|
-|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
-|**2025-02-13**|**Thinking beyond the anthropomorphic paradigm benefits LLM research**|Lujain Ibrahim et.al.|[2502.09192v1](http://arxiv.org/abs/2502.09192v1)|null|
-|**2025-02-13**|**Matina: A Large-Scale 73B Token Persian Text Corpus**|Sara Bourbour Hosseinbeigi et.al.|[2502.09188v1](http://arxiv.org/abs/2502.09188v1)|null|
-|**2025-02-13**|**RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**|Changzhi Zhou et.al.|[2502.09183v1](http://arxiv.org/abs/2502.09183v1)|null|
-|**2025-02-13**|**FLAME: Flexible LLM-Assisted Moderation Engine**|Ivan Bakulin et.al.|[2502.09175v1](http://arxiv.org/abs/2502.09175v1)|null|
-|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
-|**2025-02-13**|**Musical Heritage Historical Entity Linking**|Arianna Graciotti et.al.|[2502.09168v1](http://arxiv.org/abs/2502.09168v1)|null|
-|**2025-02-13**|**Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**|Chang Liu et.al.|[2502.09156v1](http://arxiv.org/abs/2502.09156v1)|null|
-|**2025-02-13**|**A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**|Nasser A Alsadhan et.al.|[2502.09128v1](http://arxiv.org/abs/2502.09128v1)|null|
-|**2025-02-13**|**Automatic Pruning via Structured Lasso with Class-wise Information**|Xiang Liu et.al.|[2502.09125v1](http://arxiv.org/abs/2502.09125v1)|null|
-|**2025-02-13**|**The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**|Ye-eun Cho et.al.|[2502.09120v1](http://arxiv.org/abs/2502.09120v1)|null|
-|**2025-02-13**|**One-shot Federated Learning Methods: A Practical Guide**|Xiang Liu et.al.|[2502.09104v1](http://arxiv.org/abs/2502.09104v1)|null|
-|**2025-02-13**|**Logical Reasoning in Large Language Models: A Survey**|Hanmeng Liu et.al.|[2502.09100v1](http://arxiv.org/abs/2502.09100v1)|null|
-|**2025-02-13**|**A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**|Tianyi Huang et.al.|[2502.09097v1](http://arxiv.org/abs/2502.09097v1)|null|
-|**2025-02-13**|**A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**|Jia Gao et.al.|[2502.09086v1](http://arxiv.org/abs/2502.09086v1)|null|
-|**2025-02-13**|**Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**|Greta Warren et.al.|[2502.09083v1](http://arxiv.org/abs/2502.09083v1)|null|
-|**2025-02-13**|**CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**|Xintao Wang et.al.|[2502.09082v1](http://arxiv.org/abs/2502.09082v1)|null|
-|**2025-02-13**|**Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**|Xuzhao Geng et.al.|[2502.09073v1](http://arxiv.org/abs/2502.09073v1)|null|
-|**2025-02-13**|**An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**|Kunat Pipatanakul et.al.|[2502.09056v1](http://arxiv.org/abs/2502.09056v1)|null|
-|**2025-02-13**|**Cost-Saving LLM Cascades with Early Abstention**|Michael J. Zellinger et.al.|[2502.09054v1](http://arxiv.org/abs/2502.09054v1)|null|
-|**2025-02-13**|**Game Theory Meets Large Language Models: A Systematic Survey**|Haoran Sun et.al.|[2502.09053v1](http://arxiv.org/abs/2502.09053v1)|null|
-|**2025-02-13**|**AIDE: Agentically Improve Visual Language Model with Domain Experts**|Ming-Chang Chiu et.al.|[2502.09051v1](http://arxiv.org/abs/2502.09051v1)|null|
-|**2025-02-13**|**Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**|Chae-Hyun Kim et.al.|[2502.09050v1](http://arxiv.org/abs/2502.09050v1)|null|
-|**2025-02-13**|**Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**|Jin-Duk Park et.al.|[2502.09046v1](http://arxiv.org/abs/2502.09046v1)|null|
-|**2025-02-13**|**Typhoon T1: An Open Thai Reasoning Model**|Pittawat Taveekitworachai et.al.|[2502.09042v1](http://arxiv.org/abs/2502.09042v1)|null|
-|**2025-02-13**|**Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**|Lin Zhang et.al.|[2502.09022v1](http://arxiv.org/abs/2502.09022v1)|null|
-|**2025-02-13**|**EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**|Xiao Wang et.al.|[2502.09020v1](http://arxiv.org/abs/2502.09020v1)|null|
-|**2025-02-13**|**Zero-shot Concept Bottleneck Models**|Shin'ya Yamaguchi et.al.|[2502.09018v1](http://arxiv.org/abs/2502.09018v1)|null|
-|**2025-02-13**|**Diversity Enhances an LLM's Performance in RAG and Long-context Task**|Zhchao Wang et.al.|[2502.09017v1](http://arxiv.org/abs/2502.09017v1)|null|
-|**2025-02-13**|**Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**|Jonathan Pofcher et.al.|[2502.09004v1](http://arxiv.org/abs/2502.09004v1)|null|
-|**2025-02-13**|**RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**|Quan Wei et.al.|[2502.09003v1](http://arxiv.org/abs/2502.09003v1)|null|
-|**2025-02-13**|**PixLift: Accelerating Web Browsing via AI Upscaling**|Yonas Atinafu et.al.|[2502.08995v1](http://arxiv.org/abs/2502.08995v1)|null|
-|**2025-02-13**|**RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**|Nazatul H. Sultan et.al.|[2502.08989v1](http://arxiv.org/abs/2502.08989v1)|null|
-|**2025-02-13**|**Neural Force Field: Learning Generalized Physical Representation from a Few Examples**|Shiqian Li et.al.|[2502.08987v1](http://arxiv.org/abs/2502.08987v1)|null|
-|**2025-02-13**|**Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**|Hyundong Cho et.al.|[2502.08972v1](http://arxiv.org/abs/2502.08972v1)|null|
-|**2025-02-13**|**RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**|Peter Yong Zhong et.al.|[2502.08966v1](http://arxiv.org/abs/2502.08966v1)|null|
-|**2025-02-13**|**Biologically Plausible Brain Graph Transformer**|Ciyuan Peng et.al.|[2502.08958v1](http://arxiv.org/abs/2502.08958v1)|null|
-|**2025-02-13**|**Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**|Leon Nissen et.al.|[2502.08954v1](http://arxiv.org/abs/2502.08954v1)|null|
+##### **Cause and Effect: Can Large Language Models Truly Understand Causality?**
+2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha
 
-#### Abstracts
-##### **Theoretical Benefit and Limitation of Diffusion Language Model**
-2502.09622v1 by Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He
+With the rise of Large Language Models(LLMs), it has become crucial to
+understand their capabilities and limitations in deciphering and explaining the
+complex web of causal relationships that language entails. Current methods use
+either explicit or implicit causal reasoning, yet there is a strong need for a
+unified approach combining both to tackle a wide array of causal relationships
+more effectively. This research proposes a novel architecture called Context
+Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to
+enhance causal reasoning and explainability. The proposed framework
+incorporates an explicit causal detection module with ConceptNet and
+counterfactual statements, as well as implicit causal detection through LLMs.
+Our framework goes one step further with a layer of counterfactual explanations
+to accentuate LLMs understanding of causality. The knowledge from ConceptNet
+enhances the performance of multiple causal reasoning tasks such as causal
+discovery, causal identification and counterfactual reasoning. The
+counterfactual sentences add explicit knowledge of the not caused by scenarios.
+By combining these powerful modules, our model aims to provide a deeper
+understanding of causal relationships, enabling enhanced interpretability.
+Evaluation of benchmark datasets shows improved performance across all metrics,
+such as accuracy, precision, recall, and F1 scores. We also introduce
+CausalNet, a new dataset accompanied by our code, to facilitate further
+research in this domain.
 
-Diffusion language models have emerged as a promising approach for text
-generation. One would naturally expect this method to be an efficient
-replacement for autoregressive models since multiple tokens can be sampled in
-parallel during each diffusion step. However, its efficiency-accuracy trade-off
-is not yet well understood. In this paper, we present a rigorous theoretical
-analysis of a widely used type of diffusion language model, the Masked
-Diffusion Model (MDM), and find that its effectiveness heavily depends on the
-target evaluation metric. Under mild conditions, we prove that when using
-perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling
-steps regardless of sequence length, demonstrating that efficiency can be
-achieved without sacrificing performance. However, when using the sequence
-error rate--which is important for understanding the "correctness" of a
-sequence, such as a reasoning chain--we show that the required sampling steps
-must scale linearly with sequence length to obtain "correct" sequences, thereby
-eliminating MDM's efficiency advantage over autoregressive models. Our analysis
-establishes the first theoretical foundation for understanding the benefits and
-limitations of MDMs. All theoretical findings are supported by empirical
-studies.
+摘要：隨著大型語言模型 (LLM) 的興起，了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理，但強烈需要一種統一的方法，結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構，以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組，以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步，加入一層反事實解釋，以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行，例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組，我們的模型旨在提供對因果關係更深入的理解，實現增強的可解釋性。基準資料集的評估顯示在所有指標（例如準確度、精確度、召回率和 F1 分數）上都有所提升。我們還引入了 CausalNet，一個新的資料集，並附上了我們的程式碼，以促進在這個領域的進一步研究。
 
-摘要：擴散語言模型已成為文字生成的一種有前途的方法。由於在每個擴散步驟期間可以並行採樣多個符號，因此人們自然會期望這種方法成為自迴歸模型的有效替代方案。然而，它的效率準確性權衡尚未得到很好的理解。在本文中，我們對廣泛使用的擴散語言模型類型，即遮罩擴散模型 (MDM) 進行了嚴格的理論分析，並發現其有效性在很大程度上取決於目標評估指標。在溫和條件下，我們證明了當使用困惑度作為指標時，MDM 可以無論序列長度如何，在採樣步驟中實現近乎最佳的困惑度，這表明可以在不犧牲性能的情況下實現效率。然而，當使用序列錯誤率（對於理解序列的「正確性」很重要，例如推理鏈）時，我們表明所需的採樣步驟必須隨著序列長度線性縮放才能獲得「正確」的序列，從而消除了 MDM 相對於自迴歸模型的效率優勢。我們的分析為理解 MDM 的優點和局限性建立了第一個理論基礎。所有理論發現都得到了實證研究的支持。
+##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**
+2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya
+
+Diabetes mellitus (DM) predisposes patients to vascular complications.
+Retinal images and vasculature reflect the body's micro- and macrovascular
+health. They can be used to diagnose DM complications, including diabetic
+retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular
+disease, as well as forecast the risk of cardiovascular events. Artificial
+intelligence (AI)-enabled systems developed for high-throughput detection of DR
+using digitized retinal images have become clinically adopted. Beyond DR
+screening, AI integration also holds immense potential to address challenges
+associated with the holistic care of the patient with DM. In this work, we aim
+to comprehensively review the literature for studies on AI applications based
+on retinal images related to DM diagnosis, prognostication, and management. We
+will describe the findings of holistic AI-assisted diabetes care, including but
+not limited to DR screening, and discuss barriers to implementing such systems,
+including issues concerning ethics, data privacy, equitable access, and
+explainability. With the ability to evaluate the patient's health status vis a
+vis DM complication as well as risk prognostication of future cardiovascular
+complications, AI-assisted retinal image analysis has the potential to become a
+central tool for modern personalized medicine in patients with DM.
 
-##### **MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**
-2502.09621v1 by Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
+摘要：糖尿病（DM）使患者容易出現血管併發症。
+視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症，包括糖尿病視網膜病變（DR）、神經病變、腎病和動脈粥樣硬化性心血管疾病，以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧（AI）啟用系統已在臨床採用。除了 DR 篩檢外，AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中，我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻，這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現，包括但不限於 DR 篩檢，並討論實施此類系統的障礙，包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況，同時考量糖尿病併發症以及未來心血管併發症的風險預後，AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。
 
-Answering questions with Chain-of-Thought (CoT) has significantly enhanced
-the reasoning capabilities of Large Language Models (LLMs), yet its impact on
-Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth
-investigation. In this paper, we introduce MME-CoT, a specialized benchmark
-evaluating the CoT reasoning performance of LMMs, spanning six domains: math,
-science, OCR, logic, space-time, and general scenes. As the first comprehensive
-study in this area, we propose a thorough evaluation suite incorporating three
-novel metrics that assess the reasoning quality, robustness, and efficiency at
-a fine-grained level. Leveraging curated high-quality data and a unique
-evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,
-uncovering several key insights: 1) Models with reflection mechanism
-demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and
-demonstrating the highest quality results; 2) CoT prompting often degrades LMM
-performance on perception-heavy tasks, suggesting a potentially harmful
-overthinking behavior; and 3) Although the CoT quality is high, LMMs with
-reflection exhibit significant inefficiency in both normal response and
-self-correction phases. We hope MME-CoT serves as a foundation for advancing
-multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
+##### **Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**
+2402.15027v2 by A. J. Karran, P. Charland, J-T. Martineau, A. Ortiz de Guinea Lopez de Arana, AM. Lesage, S. Senecal, P-M. Leger
 
-摘要：<paragraph>透過思維鏈（CoT）回答問題，大幅提升了大型語言模型（LLM）的推理能力，但其對大型多模態模型（LMM）的影響仍缺乏系統性的評估和深入探討。在本文中，我們引入了 MME-CoT，一個專門的基準測試，用於評估 LMM 的 CoT 推理效能，涵蓋六個領域：數學、科學、OCR、邏輯、時空和一般場景。作為該領域的第一個全面性研究，我們提出了一個全面的評估套件，包含三個創新的指標，用於評估推理品質、穩健性和效率，並達到細微的層級。透過利用策展的高品質資料和獨特的評估策略，我們對最先進的 LMM 進行深入分析，發現了幾個關鍵見解：1）具有反思機制的模型展現出優異的 CoT 品質，其中 Kimi k1.5 優於 GPT-4o，並展現出最高品質的結果；2）CoT 提示通常會降低 LMM 在感知密集任務上的效能，這表示潛在有害的過度思考行為；3）儘管 CoT 品質很高，但具有反思能力的 LMM 在一般回應和自我修正階段都展現出顯著的低效率。我們希望 MME-CoT 能作為促進 LMM 中多模態推理的基礎。專案頁面：https://mmecot.github.io/</paragraph>
+This study investigates the acceptability of different artificial
+intelligence (AI) applications in education from a multi-stakeholder
+perspective, including students, teachers, and parents. Acknowledging the
+transformative potential of AI in education, it addresses concerns related to
+data privacy, AI agency, transparency, explainability and the ethical
+deployment of AI. Through a vignette methodology, participants were presented
+with four scenarios where AI's agency, transparency, explainability, and
+privacy were manipulated. After each scenario, participants completed a survey
+that captured their perceptions of AI's global utility, individual usefulness,
+justice, confidence, risk, and intention to use each scenario's AI if
+available. The data collection comprising a final sample of 1198
+multi-stakeholder participants was distributed through a partner institution
+and social media campaigns and focused on individual responses to four AI use
+cases. A mediation analysis of the data indicated that acceptance and trust in
+AI varies significantly across stakeholder groups. We found that the key
+mediators between high and low levels of AI's agency, transparency, and
+explainability, as well as the intention to use the different educational AI,
+included perceived global utility, justice, and confidence. The study
+highlights that the acceptance of AI in education is a nuanced and multifaceted
+issue that requires careful consideration of specific AI applications and their
+characteristics, in addition to the diverse stakeholders' perceptions.
 
-##### **Exploring the Potential of Encoder-free Architectures in 3D LMMs**
-2502.09620v1 by Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
+摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
 
-Encoder-free architectures have been preliminarily explored in the 2D visual
-domain, yet it remains an open question whether they can be effectively applied
-to 3D understanding scenarios. In this paper, we present the first
-comprehensive investigation into the potential of encoder-free architectures to
-overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
-These challenges include the failure to adapt to varying point cloud
-resolutions and the point features from the encoder not meeting the semantic
-needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
-remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
-We propose the LLM-embedded Semantic Encoding strategy in the pre-training
-stage, exploring the effects of various point cloud self-supervised losses. And
-we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
-introduce the Hierarchical Geometry Aggregation strategy in the instruction
-tuning stage. This incorporates inductive bias into the LLM early layers to
-focus on the local details of the point clouds. To the end, we present the
-first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
-state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
-classification, captioning, and VQA tasks, respectively. Our results
-demonstrate that the encoder-free architecture is highly promising for
-replacing encoder-based architectures in the field of 3D understanding. The
-code is released at https://github.com/Ivan-Tang-3D/ENEL
+##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
+2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
 
-摘要：<paragraph>編碼器免費架構已在 2D 視覺領域中初步探索，但它們是否能有效應用於 3D 理解場景仍是一個開放的問題。在本文中，我們提出了對編碼器免費架構潛力的首次全面調查，以克服基於編碼器的 3D 大型多模態模型 (LMM) 的挑戰。這些挑戰包括無法適應不同的點雲解析度，且來自編碼器的點特徵無法滿足大型語言模型 (LLM) 的語義需求。我們識別出 3D LMM 的關鍵方面，以移除編碼器並讓 LLM 承擔 3D 編碼器的角色：1) 我們在預訓練階段提出 LLM 嵌入式語義編碼策略，探索各種點雲自我監督損失的影響。我們提出混合語義損失來提取高階語義。2) 我們在指令調整階段引入分層幾何聚合策略。這將歸納偏差納入 LLM 早期層，以專注於點雲的局部細節。最後，我們提出第一個無編碼器 3D LMM，ENEL。我們的 7B 模型與當前最先進的模型 ShapeLLM-13B 相媲美，分別在分類、字幕和 VQA 任務中達到 55.0%、50.92% 和 42.7%。我們的結果表明，無編碼器架構極有望取代基於編碼器的架構在 3D 理解領域的應用。程式碼發布於 https://github.com/Ivan-Tang-3D/ENEL</paragraph>
+Remote patient monitoring based on wearable single-lead electrocardiogram
+(ECG) devices has significant potential for enabling the early detection of
+heart disease, especially in combination with artificial intelligence (AI)
+approaches for automated heart disease detection. There have been prior studies
+applying AI approaches based on deep learning for heart disease detection.
+However, these models are yet to be widely accepted as a reliable aid for
+clinical diagnostics, in part due to the current black-box perception
+surrounding many AI algorithms. In particular, there is a need to identify the
+key features of the ECG signal that contribute toward making an accurate
+diagnosis, thereby enhancing the interpretability of the model. In the present
+study, we develop a vision transformer approach to identify atrial fibrillation
+based on single-lead ECG data. A residual network (ResNet) approach is also
+developed for comparison with the vision transformer approach. These models are
+applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
+well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
+heartbeats. The models enable the identification of the key regions of the
+heartbeat that determine the resulting classification, and highlight the
+importance of P-waves and T-waves, as well as heartbeat duration and signal
+amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
+sinus bradycardia.
 
-##### **DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**
-2502.09614v1 by Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
+摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
 
-We address the challenge of developing a generalizable neural tracking
-controller for dexterous manipulation from human references. This controller
-aims to manage a dexterous robot hand to manipulate diverse objects for various
-purposes defined by kinematic human-object interactions. Developing such a
-controller is complicated by the intricate contact dynamics of dexterous
-manipulation and the need for adaptivity, generalizability, and robustness.
-Current reinforcement learning and trajectory optimization methods often fall
-short due to their dependence on task-specific rewards or precise system
-models. We introduce an approach that curates large-scale successful robot
-tracking demonstrations, comprising pairs of human references and robot
-actions, to train a neural controller. Utilizing a data flywheel, we
-iteratively enhance the controller's performance, as well as the number and
-quality of successful tracking demonstrations. We exploit available tracking
-demonstrations and carefully integrate reinforcement learning and imitation
-learning to boost the controller's performance in dynamic environments. At the
-same time, to obtain high-quality tracking demonstrations, we individually
-optimize per-trajectory tracking by leveraging the learned tracking controller
-in a homotopy optimization method. The homotopy optimization, mimicking
-chain-of-thought, aids in solving challenging trajectory tracking problems to
-increase demonstration diversity. We showcase our success by training a
-generalizable neural controller and evaluating it in both simulation and real
-world. Our method achieves over a 10% improvement in success rates compared to
-leading baselines. The project website with animated results is available at
-https://meowuu7.github.io/DexTrack/.
 
-摘要：<paragraph>我們解決了從人類參照中開發靈巧操作通用神經追蹤控制器的挑戰。此控制器旨在管理靈巧機器人手，以操作各種物體，以實現由運動學人機互動定義的各種目的。由於靈巧操作的複雜接觸動力學以及對適應性、通用性和魯棒性的需求，開發此類控制器很複雜。目前的強化學習和軌跡優化方法通常由於依賴於特定任務的獎勵或精確的系統模型而表現不佳。我們引入了一種方法，它策劃了大規模成功的機器人追蹤示範，包括人體參照和機器人動作對，以訓練神經控制器。利用數據飛輪，我們反覆增強控制器的性能，以及成功追蹤示範的數量和品質。我們利用可用的追蹤示範，並仔細整合強化學習和模仿學習，以提升控制器在動態環境中的性能。同時，為了獲得高品質的追蹤示範，我們透過在同倫優化方法中利用已學習的追蹤控制器，個別優化每個軌跡的追蹤。同倫優化模擬思考鏈，有助於解決具有挑戰性的軌跡追蹤問題，以增加示範的多樣性。我們展示了我們在訓練通用神經控制器並在模擬和真實世界中評估它的成功。與領先的基準相比，我們的模型在成功率方面提高了 10% 以上。包含動畫結果的專案網站可在 https://meowuu7.github.io/DexTrack/ 取得。</paragraph>
+### Medical
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null|
+|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
+|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
+|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|null|
+|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
+|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
+|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null|
+|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null|
+|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null|
+|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null|
+|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
+|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v1](http://arxiv.org/abs/2502.07516v1)|[link](https://github.com/Raman1121/diffusion_memorization)|
+|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
+|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
+|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
+|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
+|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
+|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null|
+|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null|
+|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null|
+|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null|
+|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null|
+|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null|
+|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null|
+|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null|
+|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
+|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null|
+|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)|
+|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null|
+|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null|
+|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null|
+|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null|
+|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null|
+|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null|
+|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null|
+|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)|
+|**2025-02-08**|**Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**|Ya Zhou et.al.|[2502.05494v1](http://arxiv.org/abs/2502.05494v1)|null|
+|**2025-02-08**|**DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**|Sibasish Dhibar et.al.|[2502.05459v1](http://arxiv.org/abs/2502.05459v1)|null|
+|**2025-02-07**|**Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**|Muhammad Imran et.al.|[2502.05330v1](http://arxiv.org/abs/2502.05330v1)|[link](https://github.com/MaxwellEng/MICCAI_CHANLLENGE24_HJL)|
+|**2025-02-07**|**Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**|Yuting He et.al.|[2502.05282v1](http://arxiv.org/abs/2502.05282v1)|null|
+|**2025-02-07**|**"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**|Shihan Fu et.al.|[2502.05115v1](http://arxiv.org/abs/2502.05115v1)|null|
+|**2025-02-07**|**Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**|Thierry Bossy et.al.|[2502.05087v1](http://arxiv.org/abs/2502.05087v1)|[link](https://github.com/tuneinsight/federated-llms)|
+|**2025-02-07**|**MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**|Minrui Chen et.al.|[2502.04794v1](http://arxiv.org/abs/2502.04794v1)|null|
+|**2025-02-06**|**MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**|Wei Fan et.al.|[2502.04515v1](http://arxiv.org/abs/2502.04515v1)|[link](https://github.com/aikunyi/MedGNN)|
+|**2025-02-06**|**Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**|Andrew G. Breithaupt et.al.|[2502.06842v1](http://arxiv.org/abs/2502.06842v1)|null|
+|**2025-02-06**|**Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**|Khushboo Verma et.al.|[2502.04423v1](http://arxiv.org/abs/2502.04423v1)|null|
+|**2025-02-06**|**Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**|Tewele W. Tareke et.al.|[2502.04083v1](http://arxiv.org/abs/2502.04083v1)|null|
+|**2025-02-06**|**Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**|Ran Song et.al.|[2502.04034v1](http://arxiv.org/abs/2502.04034v1)|null|
+|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
+|**2025-02-06**|**Transforming Multimodal Models into Action Models for Radiotherapy**|Matteo Ferrante et.al.|[2502.04408v1](http://arxiv.org/abs/2502.04408v1)|null|
+|**2025-02-06**|**Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**|Bokeng Zheng et.al.|[2502.04399v1](http://arxiv.org/abs/2502.04399v1)|null|
+|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
+|**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
+|**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
+|**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
+|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v1](http://arxiv.org/abs/2502.03568v1)|null|
+|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
+|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
+|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
+|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
+|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
+|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
+|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|null|
+|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
+|**2025-02-05**|**Contrastive Token-level Explanations for Graph-based Rumour Detection**|Daniel Wai Kit Chin et.al.|[2502.04366v1](http://arxiv.org/abs/2502.04366v1)|null|
+|**2025-02-05**|**AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**|Jorge García-Torres et.al.|[2502.04365v1](http://arxiv.org/abs/2502.04365v1)|null|
+|**2025-02-04**|**3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**|Weicheng Zhu et.al.|[2502.02779v1](http://arxiv.org/abs/2502.02779v1)|null|
+|**2025-02-04**|**Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**|Obed Korshie Dzikunu et.al.|[2502.02756v1](http://arxiv.org/abs/2502.02756v1)|[link](https://github.com/obeddzik/pca_segment)|
+|**2025-02-04**|**Diffusion Instruction Tuning**|Chen Jin et.al.|[2502.06814v1](http://arxiv.org/abs/2502.06814v1)|null|
+|**2025-02-04**|**MedRAX: Medical Reasoning Agent for Chest X-ray**|Adibvafa Fallahpour et.al.|[2502.02673v1](http://arxiv.org/abs/2502.02673v1)|[link](https://github.com/bowang-lab/medrax)|
+|**2025-02-04**|**Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**|Mahdi Alkaeed et.al.|[2502.04356v1](http://arxiv.org/abs/2502.04356v1)|null|
+|**2025-02-04**|**Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**|Shayan Kiyani et.al.|[2502.02561v1](http://arxiv.org/abs/2502.02561v1)|null|
+|**2025-02-04**|**CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**|Amy Rafferty et.al.|[2502.05214v1](http://arxiv.org/abs/2502.05214v1)|null|
+|**2025-02-04**|**A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**|Edward Ellis et.al.|[2502.02489v1](http://arxiv.org/abs/2502.02489v1)|null|
+|**2025-02-04**|**Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**|Yaling Shen et.al.|[2502.02438v1](http://arxiv.org/abs/2502.02438v1)|null|
+|**2025-02-04**|**Test Time Training for 4D Medical Image Interpolation**|Qikang Zhang et.al.|[2502.02341v1](http://arxiv.org/abs/2502.02341v1)|[link](https://github.com/chaostheproducer/ttt4d)|
+|**2025-02-04**|**Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**|Atharva Mangeshkumar Agrawal et.al.|[2502.02249v1](http://arxiv.org/abs/2502.02249v1)|null|
+|**2025-02-04**|**Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**|F. Xavier Gaya-Morey et.al.|[2502.02618v1](http://arxiv.org/abs/2502.02618v1)|null|
+|**2025-02-04**|**Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**|Yuxiao Cheng et.al.|[2502.02109v1](http://arxiv.org/abs/2502.02109v1)|null|
+|**2025-02-04**|**JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**|Yehan Yan et.al.|[2502.04345v1](http://arxiv.org/abs/2502.04345v1)|null|
+|**2025-02-03**|**An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**|Jiazi Tian et.al.|[2502.01789v1](http://arxiv.org/abs/2502.01789v1)|null|
+|**2025-02-03**|**Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**|Chacha Chen et.al.|[2502.03482v1](http://arxiv.org/abs/2502.03482v1)|null|
+|**2025-02-03**|**Improving Transformer World Models for Data-Efficient RL**|Antoine Dedieu et.al.|[2502.01591v1](http://arxiv.org/abs/2502.01591v1)|null|
+|**2025-02-03**|**Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**|Zhi Zhang et.al.|[2502.01377v1](http://arxiv.org/abs/2502.01377v1)|null|
+|**2025-02-03**|**OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**|Chengfeng Zhou et.al.|[2502.01243v1](http://arxiv.org/abs/2502.01243v1)|null|
+|**2025-02-03**|**MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**|Alejandro Guerra-Manzanares et.al.|[2502.01158v1](http://arxiv.org/abs/2502.01158v1)|null|
+|**2025-02-03**|**Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**|Qian Chen et.al.|[2502.01141v1](http://arxiv.org/abs/2502.01141v1)|null|
+|**2025-02-03**|**Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**|Mithun Saha et.al.|[2502.01108v1](http://arxiv.org/abs/2502.01108v1)|null|
+|**2025-02-03**|**Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**|Yeyubei Zhang et.al.|[2502.04342v1](http://arxiv.org/abs/2502.04342v1)|null|
+|**2025-02-02**|**Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**|Hadas Ben-Atya et.al.|[2502.01691v1](http://arxiv.org/abs/2502.01691v1)|null|
+|**2025-02-02**|**Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**|Si-Ioi Ng et.al.|[2502.01685v1](http://arxiv.org/abs/2502.01685v1)|null|
+|**2025-02-02**|**Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**|Shengtian Sang et.al.|[2502.00712v1](http://arxiv.org/abs/2502.00712v1)|null|
+|**2025-02-02**|**TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**|Linglong Wu et.al.|[2502.00695v1](http://arxiv.org/abs/2502.00695v1)|null|
+|**2025-02-02**|**Safety at Scale: A Comprehensive Survey of Large Model Safety**|Xingjun Ma et.al.|[2502.05206v2](http://arxiv.org/abs/2502.05206v2)|null|
+|**2025-02-02**|**Enhanced Convolutional Neural Networks for Improved Image Classification**|Xiaoran Yang et.al.|[2502.00663v1](http://arxiv.org/abs/2502.00663v1)|null|
+|**2025-02-02**|**Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**|Yujin Oh et.al.|[2502.00619v1](http://arxiv.org/abs/2502.00619v1)|null|
+|**2025-02-01**|**Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**|Samiran Dey et.al.|[2502.00568v3](http://arxiv.org/abs/2502.00568v3)|[link](https://github.com/Samiran-Dey/PathoGen)|
 
-##### **Score-of-Mixture Training: Training One-Step Generative Models Made Simple**
-2502.09609v1 by Tejas Jayashankar, J. Jon Ryu, Gregory Wornell
+#### Abstracts
+##### **Metamorphic Testing for Pose Estimation Systems**
+2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque
 
-We propose Score-of-Mixture Training (SMT), a novel framework for training
-one-step generative models by minimizing a class of divergences called the
-$\alpha$-skew Jensen-Shannon divergence. At its core, SMT estimates the score
-of mixture distributions between real and fake samples across multiple noise
-levels. Similar to consistency models, our approach supports both training from
-scratch (SMT) and distillation using a pretrained diffusion model, which we
-call Score-of-Mixture Distillation (SMD). It is simple to implement, requires
-minimal hyperparameter tuning, and ensures stable training. Experiments on
-CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even
-outperform existing methods.
+Pose estimation systems are used in a variety of fields, from sports
+analytics to livestock care. Given their potential impact, it is paramount to
+systematically test their behaviour and potential for failure. This is a
+complex task due to the oracle problem and the high cost of manual labelling
+necessary to build ground truth keypoints. This problem is exacerbated by the
+fact that different applications require systems to focus on different subjects
+(e.g., human versus animal) or landmarks (e.g., only extremities versus whole
+body and face), which makes labelled test data rarely reusable. To combat these
+problems we propose MET-POSE, a metamorphic testing framework for pose
+estimation systems that bypasses the need for manual annotation while assessing
+the performance of these systems under different circumstances. MET-POSE thus
+allows users of pose estimation systems to assess the systems in conditions
+that more closely relate to their application without having to label an ad-hoc
+test dataset or rely only on available datasets, which may not be adapted to
+their application domain. While we define MET-POSE in general terms, we also
+present a non-exhaustive list of metamorphic rules that represent common
+challenges in computer vision applications, as well as a specific way to
+evaluate these rules. We then experimentally show the effectiveness of MET-POSE
+by applying it to Mediapipe Holistic, a state of the art human pose estimation
+system, with the FLIC and PHOENIX datasets. With these experiments, we outline
+numerous ways in which the outputs of MET-POSE can uncover faults in pose
+estimation systems at a similar or higher rate than classic testing using hand
+labelled data, and show that users can tailor the rule set they use to the
+faults and level of accuracy relevant to their application.
 
-摘要：我們提出混合評分訓練 (SMT)，一種透過最小化稱為 $\alpha$-偏斜 Jensen-Shannon 距離的距離類別來訓練單步生成模型的新穎架構。在核心部分，SMT 估計真實和虛假樣本之間在多個雜訊層級的混合分配評分。與一致性模型類似，我們的做法支援從頭開始訓練 (SMT) 和使用預先訓練的擴散模型進行蒸餾，我們稱之為混合評分蒸餾 (SMD)。它易於實作，只需要最小的超參數調整，並確保穩定的訓練。在 CIFAR-10 和 ImageNet 64x64 上的實驗顯示，SMT/SMD 具有競爭力，甚至可以優於現有方法。
+摘要：姿勢估計系統應用於各種領域，從運動分析到牲畜照護。鑑於其潛在影響，系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高，這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體（例如，人類對動物）或地標（例如，只有四肢對全身和臉部）而加劇，這使得標記的測試數據很少可以重複使用。為了解決這些問題，我們提出了 MET-POSE，這是一個姿勢估計系統的變形測試框架，在評估這些系統在不同情況下的性能時，可以繞過手動註解的需要。因此，MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統，而無需標記臨時測試數據集或僅依賴可用數據集，這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE，但我們也提供了一個非詳盡的變形規則列表，這些規則代表了電腦視覺應用中的常見挑戰，以及評估這些規則的具體方法。然後，我們通過將 MET-POSE 應用於 Mediapipe Holistic（一種先進的人類姿勢估計系統），並使用 FLIC 和 PHOENIX 數據集，以實驗方式展示 MET-POSE 的有效性。通過這些實驗，我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法，其速度與使用手動標記數據的傳統測試類似或更高，並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。
 
-##### **Human-LLM Coevolution: Evidence from Academic Writing**
-2502.09606v1 by Mingmeng Geng, Roberto Trotta
+##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
+2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
 
-With a statistical analysis of arXiv paper abstracts, we report a marked drop
-in the frequency of several words previously identified as overused by ChatGPT,
-such as "delve", starting soon after they were pointed out in early 2024. The
-frequency of certain other words favored by ChatGPT, such as "significant", has
-instead kept increasing. These phenomena suggest that some authors of academic
-papers have adapted their use of large language models (LLMs), for example, by
-selecting outputs or applying modifications to the LLM-generated content. Such
-coevolution and cooperation of humans and LLMs thus introduce additional
-challenges to the detection of machine-generated text in real-world scenarios.
-Estimating the impact of LLMs on academic writing by examining word frequency
-remains feasible, and more attention should be paid to words that were already
-frequently employed, including those that have decreased in frequency.
+Joint entity-relation extraction is a critical task in transforming
+unstructured or semi-structured text into triplets, facilitating the
+construction of large-scale knowledge graphs, and supporting various downstream
+applications. Despite its importance, research on Chinese text, particularly
+with complex semantics in specialized domains like medicine, remains limited.
+To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
+dataset designed to capture the intricacies of medical text. Leveraging the
+strengths of attention mechanisms in capturing long-range dependencies, we
+propose the SEA module, which enhances the extraction of complex contextual
+semantic information, thereby improving entity recognition and relation
+extraction. Additionally, to address the inefficiencies of existing methods in
+facilitating information exchange between entity recognition and relation
+extraction, we present an interactive fusion representation module. This module
+employs Cross Attention for bidirectional information exchange between the
+tasks and further refines feature extraction through BiLSTM. Experimental
+results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
+our model exhibits strong generalization capabilities. On the CH-DDI dataset,
+our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
+relation extraction. On the CoNLL04 dataset, it attains an entity recognition
+precision of 89.54% and a relation extraction accuracy of 71.64%.
 
-摘要：透過對 arXiv 論文摘要進行統計分析，我們報告了幾個先前被認為 ChatGPT 過度使用的詞彙的頻率大幅下降，例如「深入探討」，從 2024 年初被指出後不久就開始下降。相反地，ChatGPT 偏好的某些其他詞彙，例如「顯著」，頻率持續增加。這些現象表明，一些學術論文作者已經調整了他們使用大型語言模型 (LLM) 的方式，例如，透過選擇輸出或對 LLM 生成的內容進行修改。因此，人類和 LLM 的這種共同演化和合作為在現實世界場景中偵測機器產生的文字帶來了額外的挑戰。透過檢視詞彙頻率來評估 LLM 對學術寫作的影響仍然可行，並且應該對已經頻繁使用的詞彙給予更多關注，包括那些頻率下降的詞彙。
+摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
 
-##### **SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**
-2502.09604v1 by Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
+##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
+2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
 
-We introduce SelfCite, a novel self-supervised approach that aligns LLMs to
-generate high-quality, fine-grained, sentence-level citations for the
-statements in their generated responses. Instead of only relying on costly and
-labor-intensive annotations, SelfCite leverages a reward signal provided by the
-LLM itself through context ablation: If a citation is necessary, removing the
-cited text from the context should prevent the same response; if sufficient,
-retaining the cited text alone should preserve the same response. This reward
-can guide the inference-time best-of-N sampling strategy to improve citation
-quality significantly, as well as be used in preference optimization to
-directly fine-tune the models for generating better citations. The
-effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3
-points on the LongBench-Cite benchmark across five long-form question answering
-tasks.
+Generative artificial intelligence (AI) models, such as diffusion models and
+OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
+and automating clinical workflows. The field has advanced rapidly, evolving
+from text-only large language models for tasks such as clinical documentation
+and decision support to multimodal AI systems capable of integrating diverse
+data modalities, including imaging, text, and structured data, within a single
+model. The diverse landscape of these technologies, along with rising interest,
+highlights the need for a comprehensive review of their applications and
+potential. This scoping review explores the evolution of multimodal AI,
+highlighting its methods, applications, datasets, and evaluation in clinical
+settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
+IEEE Xplore, and Web of Science, prioritizing recent studies published up to
+the end of 2024. After rigorous screening, 144 papers were included, revealing
+key trends and challenges in this dynamic field. Our findings underscore a
+shift from unimodal to multimodal approaches, driving innovations in diagnostic
+support, medical report generation, drug discovery, and conversational AI.
+However, critical challenges remain, including the integration of heterogeneous
+data types, improving model interpretability, addressing ethical concerns, and
+validating AI systems in real-world clinical settings. This review summarizes
+the current state of the art, identifies critical gaps, and provides insights
+to guide the development of scalable, trustworthy, and clinically impactful
+multimodal AI solutions in healthcare.
 
-摘要：我們介紹 SelfCite，一種新穎的自監督方法，它將 LLM 對齊以針對其生成回應中的陳述生成高品質、細粒度、句子級別的引用。SelfCite 不僅依賴於昂貴且勞動密集的註解，還利用 LLM 本身通過上下文消融提供的獎勵信號：如果需要引用，從上下文中移除被引用的文字應當會阻止相同的回應；如果足夠，僅保留被引用的文字應當會保留相同的回應。此獎勵可以引導推理時間最佳 N 個取樣策略以顯著改善引文品質，並用於偏好最佳化以直接微調模型以生成更好的引文。SelfCite 的有效性通過在五個長篇問答任務中將 LongBench-Cite 基準上的引文 F1 提高多達 5.3 點來證明。
+摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
 
-##### **CoT-Valve: Length-Compressible Chain-of-Thought Tuning**
-2502.09601v1 by Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-Chain-of-Thought significantly enhances a model's reasoning capability, but
-it also comes with a considerable increase in inference costs due to long
-chains. With the observation that the reasoning path can be easily compressed
-under easy tasks but struggle on hard tasks, we explore the feasibility of
-elastically controlling the length of reasoning paths with only one model,
-thereby reducing the inference overhead of reasoning models dynamically based
-on task difficulty. We introduce a new tuning and inference strategy named
-CoT-Valve, designed to allow models to generate reasoning chains of varying
-lengths. To achieve this, we propose to identify a direction in the parameter
-space that, when manipulated, can effectively control the length of generated
-CoT. Moreover, we show that this property is valuable for compressing the
-reasoning chain. We construct datasets with chains from long to short for the
-same questions and explore two enhanced strategies for CoT-Valve: (1) a precise
-length-compressible CoT tuning method, and (2) a progressive chain length
-compression approach. Our experiments show that CoT-Valve successfully enables
-controllability and compressibility of the chain and shows better performance
-than the prompt-based control. We applied this method to QwQ-32B-Preview,
-reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor
-performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with
-only one additional incorrect answer.
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-摘要：<paragraph>連續思考大幅提升了模型的推理能力，但由於鏈條過長，也大幅增加了推理成本。由於觀察到推理路徑在簡單的任務中可以輕易壓縮，但在困難的任務中卻很吃力，我們探索了僅使用一個模型彈性控制推理路徑長度的可行性，從而根據任務難度動態減少推理模型的推理開銷。我們引入了一種名為 CoT-Valve 的新調校和推理策略，旨在讓模型產生長度不一的推理鏈。為此，我們提議在參數空間中識別一個方向，在操作時可以有效控制生成的 CoT 的長度。此外，我們展示了此屬性對於壓縮推理鏈是有價值的。我們構造了從長到短的鏈條的資料集，用於相同的問題，並探索了 CoT-Valve 的兩種增強策略：(1) 精確的長度可壓縮 CoT 調校方法，以及 (2) 漸進式鏈長壓縮方法。我們的實驗表明，CoT-Valve 成功地實現了鏈條的可控性和可壓縮性，並顯示出比基於提示的控制更好的效能。我們將此方法應用於 QwQ-32B-Preview，將 GSM8K 上的推理鏈條從 741 個代幣減少到 225 個代幣，效能僅略微下降 (95.07% 至 94.92%)，而在 AIME 上從 6827 個代幣減少到 4629 個代幣，只多了一個錯誤答案。</paragraph>
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-##### **Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**
-2502.09597v1 by Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
+##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
+2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
 
-Large Language Models (LLMs) are increasingly used as chatbots, yet their
-ability to personalize responses to user preferences remains limited. We
-introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize
-and adhere to user preferences in a long-context conversational setting.
-PrefEval comprises 3,000 manually curated user preference and query pairs
-spanning 20 topics. PrefEval contains user personalization or preference
-information in both explicit and implicit forms, and evaluates LLM performance
-using a generation and a classification task. With PrefEval, we evaluated the
-aforementioned preference following capabilities of 10 open-source and
-proprietary LLMs in multi-session conversations with varying context lengths up
-to 100k tokens. We benchmark with various prompting, iterative feedback, and
-retrieval-augmented generation methods. Our benchmarking effort reveals that
-state-of-the-art LLMs face significant challenges in proactively following
-users' preferences during conversations. In particular, in zero-shot settings,
-preference following accuracy falls below 10% at merely 10 turns (~3k tokens)
-across most evaluated models. Even with advanced prompting and retrieval
-methods, preference following still deteriorates in long-context conversations.
-Furthermore, we show that fine-tuning on PrefEval significantly improves
-performance. We believe PrefEval serves as a valuable resource for measuring,
-understanding, and enhancing LLMs' preference following abilities, paving the
-way for personalized conversational agents. Our code and dataset are available
-at https://prefeval.github.io/.
+Legal cases require careful logical reasoning following the laws, whereas
+interactions with non- technical users must be in natural language. As an
+application combining logical reasoning using Prolog and natural language
+processing using large language models (LLMs), this paper presents a novel
+approach and system, LogicLease, to automate the analysis of landlord-tenant
+legal cases in the state of New York. LogicLease determines compliance with
+relevant legal requirements by analyzing case descriptions and citing all
+relevant laws. It leverages LLMs for information extraction and Prolog for
+legal reasoning. By separating information extraction from legal reasoning,
+LogicLease achieves greater transparency and control over the legal logic
+applied to each case. We evaluate the accuracy, efficiency, and robustness of
+LogicLease through a series of tests, achieving 100% accuracy and an average
+processing time of 2.57 seconds. LogicLease presents advantages over
+state-of-the-art LLM- based legal analysis systems by providing clear,
+step-by-step reasoning, citing specific laws, and distinguishing itself by its
+ability to avoid hallucinations - a common issue in LLMs.
 
-摘要：大型語言模型（LLM）正日益被用作聊天機器人，但它們根據使用者偏好個人化回應的能力仍然有限。我們引入了 PrefEval，一個用於評估 LLM 在長時間對話環境中推論、記憶和遵守使用者偏好的能力的基準。PrefEval 包含 3,000 個手動策劃的使用者偏好和查詢對，涵蓋 20 個主題。PrefEval 包含以明確和隱含形式表達的使用者個人化或偏好資訊，並使用生成和分類任務評估 LLM 效能。透過 PrefEval，我們評估了 10 個開源和專有 LLM 在多重對話中上述的偏好追蹤能力，對話內容長度最高達 100k 個符號。我們使用各種提示、迭代回饋和檢索增強生成方法進行基準測試。我們的基準測試工作顯示，最先進的 LLM 在對話中主動追蹤使用者偏好時面臨重大挑戰。特別是在零次學習設定中，在多數評估模型中，在僅 10 個回合（約 3k 個符號）時，偏好追蹤準確度低於 10%。即使使用進階提示和檢索方法，在長時間對話中偏好追蹤仍然會惡化。此外，我們展示了在 PrefEval 上進行微調會大幅改善效能。我們相信 PrefEval 可作為衡量、理解和提升 LLM 偏好追蹤能力的寶貴資源，為個人化對話代理鋪路。我們的程式碼和資料集可在 https://prefeval.github.io/ 取得。
+摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
 
-##### **KIMAs: A Configurable Knowledge Integrated Multi-Agent System**
-2502.09596v1 by Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding
+##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
+2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
-Knowledge-intensive conversations supported by large language models (LLMs)
-have become one of the most popular and helpful applications that can assist
-people in different aspects. Many current knowledge-intensive applications are
-centered on retrieval-augmented generation (RAG) techniques. While many
-open-source RAG frameworks facilitate the development of RAG-based
-applications, they often fall short in handling practical scenarios complicated
-by heterogeneous data in topics and formats, conversational context management,
-and the requirement of low-latency response times. This technical report
-presents a configurable knowledge integrated multi-agent system, KIMAs, to
-address these challenges. KIMAs features a flexible and configurable system for
-integrating diverse knowledge sources with 1) context management and query
-rewrite mechanisms to improve retrieval accuracy and multi-turn conversational
-coherency, 2) efficient knowledge routing and retrieval, 3) simple but
-effective filter and reference generation mechanisms, and 4) optimized
-parallelizable multi-agent pipeline execution. Our work provides a scalable
-framework for advancing the deployment of LLMs in real-world settings. To show
-how KIMAs can help developers build knowledge-intensive applications with
-different scales and emphases, we demonstrate how we configure the system to
-three applications already running in practice with reliable performance.
+In remote healthcare monitoring, time series representation learning reveals
+critical patient behavior patterns from high-frequency data. This study
+analyzes home activity data from individuals living with dementia by proposing
+a two-stage, self-supervised learning approach tailored to uncover low-rank
+structures. The first stage converts time-series activities into text sequences
+encoded by a pre-trained language model, providing a rich, high-dimensional
+latent state space using a PageRank-based method. This PageRank vector captures
+latent state transitions, effectively compressing complex behaviour data into a
+succinct form that enhances interpretability. This low-rank representation not
+only enhances model interpretability but also facilitates clustering and
+transition analysis, revealing key behavioral patterns correlated with
+clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
+framework's potential in supporting cognitive status prediction, personalized
+care interventions, and large-scale health monitoring.
 
-摘要：由大型語言模型 (LLM) 支持的知識密集型對話
-已成為最受歡迎且有用的應用程式之一，可協助
-人們在不同面向獲得協助。許多當前的知識密集型應用程式
-都以檢索增強生成 (RAG) 技術為中心。雖然許多
-開放原始碼 RAG 架構促進了基於 RAG 的應用程式開發，但它們在處理
-主題和格式中異質資料、對話內容管理，以及低延遲回應時間的要求所造成的實際情況時，通常力有未逮。這份技術報告
-提出了可設定的知識整合多重代理系統，KIMAs，以
-解決這些挑戰。KIMAs 具備靈活且可設定的系統，可整合多樣化的知識來源，並具備 1) 內容管理和查詢
-改寫機制，以提升檢索準確度和多輪對話的連貫性，2) 有效的知識路由和檢索，3) 簡單但
-有效的篩選和參考產生機制，以及 4) 最佳化的可平行化多重代理管線執行。我們的作品提供了可擴充的
-架構，以推動在實際環境中部署 LLM。為了展示 KIMAs 如何協助開發人員建置不同規模和重點的知識密集型應用程式，我們示範如何設定系統至
-三個已實際執行且效能良好的應用程式。
+摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
-##### **Logical forms complement probability in understanding language model (and human) performance**
-2502.09589v1 by Yixuan Wang, Freda Shi
+##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
+2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
 
-With the increasing interest in using large language models (LLMs) for
-planning in natural language, understanding their behaviors becomes an
-important research question. This work conducts a systematic investigation of
-LLMs' ability to perform logical reasoning in natural language. We introduce a
-controlled dataset of hypothetical and disjunctive syllogisms in propositional
-and modal logic and use it as the testbed for understanding LLM performance.
-Our results lead to novel insights in predicting LLM behaviors: in addition to
-the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical
-forms should be considered as orthogonal factors. In addition, we show
-similarities and differences between the logical reasoning performances of
-humans and LLMs by comparing LLM and human behavioral results.
+Precise segmentation and classification of cell instances are vital for
+analyzing the tissue microenvironment in histology images, supporting medical
+diagnosis, prognosis, treatment planning, and studies of brain
+cytoarchitecture. However, the creation of high-quality annotated datasets for
+training remains a major challenge. This study introduces a novel single-stage
+approach (HistoSmith) for generating image-label pairs to augment histology
+datasets. Unlike state-of-the-art methods that utilize diffusion models with
+separate components for label and image generation, our approach employs a
+latent diffusion model to learn the joint distribution of cellular layouts,
+classification masks, and histology images. This model enables tailored data
+generation by conditioning on user-defined parameters such as cell types,
+quantities, and tissue types. Trained on the Conic H&E histopathology dataset
+and the Nissl-stained CytoDArk0 dataset, the model generates realistic and
+diverse labeled samples. Experimental results demonstrate improvements in cell
+instance segmentation and classification, particularly for underrepresented
+cell types like neutrophils in the Conic dataset. These findings underscore the
+potential of our approach to address data scarcity challenges.
 
-摘要：隨著在自然語言規劃中使用大型語言模型（LLM）的興趣日益濃厚，理解其行為已成為一項重要的研究課題。本研究對 LLM 在自然語言中執行邏輯推理的能力進行了系統性調查。我們引入了一個由假設和析取三段論組成的受控資料集，並使用它作為理解 LLM 效能的測試平台。我們的結果產生了預測 LLM 行為的新見解：除了輸入的機率（Gonen 等人，2023 年；McCoy 等人，2024 年）之外，邏輯形式應被視為正交因子。此外，我們透過比較 LLM 和人類行為結果，展示了人類和 LLM 在邏輯推理表現上的相似性和差異性。
+摘要：精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而，建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith)，用於產生影像標籤對，以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同，我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數（例如細胞類型、數量和組織類型）來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後，此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步，特別是對於 Conic 資料集中代表性不足的細胞類型，例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。
 
-##### **Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**
-2502.09573v1 by Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
+##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**
+2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
 
-In this study, we tackle industry challenges in video content classification
-by exploring and optimizing GPT-based models for zero-shot classification
-across seven critical categories of video quality. We contribute a novel
-approach to improving GPT's performance through prompt optimization and policy
-refinement, demonstrating that simplifying complex policies significantly
-reduces false negatives. Additionally, we introduce a new
-decomposition-aggregation-based prompt engineering technique, which outperforms
-traditional single-prompt methods. These experiments, conducted on real
-industry problems, show that thoughtful prompt design can substantially enhance
-GPT's performance without additional finetuning, offering an effective and
-scalable solution for improving video classification systems across various
-domains in industry.
+The growing availability of longitudinal Magnetic Resonance Imaging (MRI)
+datasets has facilitated Artificial Intelligence (AI)-driven modeling of
+disease progression, making it possible to predict future medical scans for
+individual patients. However, despite significant advancements in AI, current
+methods continue to face challenges including achieving patient-specific
+individualization, ensuring spatiotemporal consistency, efficiently utilizing
+longitudinal data, and managing the substantial memory demands of 3D scans. To
+address these challenges, we propose Brain Latent Progression (BrLP), a novel
+spatiotemporal model designed to predict individual-level disease progression
+in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates
+in a small latent space, mitigating the computational challenges posed by
+high-dimensional imaging data; (ii) it explicitly integrates subject metadata
+to enhance the individualization of predictions; (iii) it incorporates prior
+knowledge of disease dynamics through an auxiliary model, facilitating the
+integration of longitudinal data; and (iv) it introduces the Latent Average
+Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in
+the predicted progression at inference time and (b) allows us to derive a
+measure of the uncertainty for the prediction. We train and evaluate BrLP on
+11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its
+generalizability on an external test set comprising 2,257 MRIs from 962
+subjects. Our experiments compare BrLP-generated MRI scans with real follow-up
+MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The
+code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
 
-摘要：在這項研究中，我們透過探索和最佳化基於 GPT 的模型，來處理影片內容分類中的產業挑戰，並針對影片品質的七個關鍵類別進行零次學習分類。我們貢獻了一種透過提示最佳化和政策改善來提升 GPT 效能的新方法，證明簡化複雜政策能大幅減少假陰性。此外，我們還引入了一種新的基於分解聚合的提示工程技術，其效能優於傳統的單一提示方法。這些在真實產業問題上執行的實驗顯示，經過深思熟慮的提示設計可以在不進行額外微調的情況下大幅提升 GPT 的效能，為提升產業中各種領域的影片分類系統提供了一個有效且可擴充的解決方案。
+摘要：隨著縱向磁共振影像 (MRI) 資料集的日益普及，已促進人工智慧 (AI) 驅動的疾病進程建模，讓預測個別患者的未來醫學掃描成為可能。然而，儘管 AI 有顯著進展，目前的技術仍面臨挑戰，包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料，以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰，我們提出腦潛在進程 (BrLP)，這是一種新穎的時空模型，旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個：(i) 它在一個小的潛在空間中運作，減輕了高維度影像資料帶來的計算挑戰；(ii) 它明確整合受試者的元資料，以增強預測的個別化；(iii) 它透過輔助模型納入疾病動態的先驗知識，促進縱向資料的整合；(iv) 它引入了潛在平均穩定化 (LAS) 演算法，該演算法 (a) 在推論時強制預測進程中的時空一致性，(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估，並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較，與現有方法相比，展示了最先進的準確性。程式碼已公開於：https://github.com/LemuelPuglisi/BrLP。
 
-##### **MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**
-2502.09567v1 by Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea
+##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
+2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
 
-We introduce MorphNLI, a modular step-by-step approach to natural language
-inference (NLI). When classifying the premise-hypothesis pairs into
-{entailment, contradiction, neutral}, we use a language model to generate the
-necessary edits to incrementally transform (i.e., morph) the premise into the
-hypothesis. Then, using an off-the-shelf NLI model we track how the entailment
-progresses with these atomic changes, aggregating these intermediate labels
-into a final output. We demonstrate the advantages of our proposed method
-particularly in realistic cross-domain settings, where our method always
-outperforms strong baselines with improvements up to 12.6% (relative). Further,
-our proposed approach is explainable as the atomic edits can be used to
-understand the overall NLI label.
+The adoption of EHRs has expanded opportunities to leverage data-driven
+algorithms in clinical care and research. A major bottleneck in effectively
+conducting multi-institutional EHR studies is the data heterogeneity across
+systems with numerous codes that either do not exist or represent different
+clinical concepts across institutions. The need for data privacy further limits
+the feasibility of including multi-institutional patient-level data required to
+study similarities and differences across patient subgroups. To address these
+challenges, we developed the GAME algorithm. Tested and validated across 7
+institutions and 2 languages, GAME integrates data in several levels: (1) at
+the institutional level with knowledge graphs to establish relationships
+between codes and existing knowledge sources, providing the medical context for
+standard codes and their relationship to each other; (2) between institutions,
+leveraging language models to determine the relationships between
+institution-specific codes with established standard codes; and (3) quantifying
+the strength of the relationships between codes using a graph attention
+network. Jointly trained embeddings are created using transfer and federated
+learning to preserve data privacy. In this study, we demonstrate the
+applicability of GAME in selecting relevant features as inputs for AI-driven
+algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
+We then highlight the application of GAME harmonized multi-institutional EHR
+data in a study of Alzheimer's disease outcomes and suicide risk among patients
+with mental health disorders, without sharing patient-level data outside
+individual institutions.
 
-摘要：我們引入 MorphNLI，一種模組化逐步方法，用於自然語言推論 (NLI)。當對前提假設對進行分類時，我們使用語言模型來產生必要的編輯，以逐步轉換（即，變形）前提成為假設。然後，使用現成的 NLI 模型，我們追蹤推論如何隨著這些原子變化而進展，將這些中間標籤彙總成最終輸出。我們展示了我們提出的方法的優點，特別是在現實的跨網域設置中，我們的模型始終優於強大的基線，改進幅度高達 12.6%（相對）。此外，我們提出的方法是可以解釋的，因為原子編輯可以用來理解整體 NLI 標籤。
+摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
 
-##### **Zero-shot generation of synthetic neurosurgical data with large language models**
-2502.09566v1 by Austin A. Barr, Eddie Guo, Emre Sezgin
+##### **EEG Artifact Detection and Correction with Deep Autoencoders**
+2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch
 
-Clinical data is fundamental to advance neurosurgical research, but access is
-often constrained by data availability, small sample sizes, privacy
-regulations, and resource-intensive preprocessing and de-identification
-procedures. Synthetic data offers a potential solution to challenges associated
-with accessing and using real-world data (RWD). This study aims to evaluate the
-capability of zero-shot generation of synthetic neurosurgical data with a large
-language model (LLM), GPT-4o, by benchmarking with the conditional tabular
-generative adversarial network (CTGAN). Synthetic datasets were compared to
-real-world neurosurgical data to assess fidelity (means, proportions,
-distributions, and bivariate correlations), utility (ML classifier performance
-on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated
-datasets matched or exceeded CTGAN performance, despite no fine-tuning or
-access to RWD for pre-training. Datasets demonstrated high univariate and
-bivariate fidelity to RWD without directly exposing any real patient records,
-even at amplified sample size. Training an ML classifier on GPT-4o-generated
-data and testing on RWD for a binary prediction task showed an F1 score (0.706)
-with comparable performance to training on the CTGAN data (0.705) for
-predicting postoperative functional status deterioration. GPT-4o demonstrated a
-promising ability to generate high-fidelity synthetic neurosurgical data. These
-findings also indicate that data synthesized with GPT-4o can effectively
-augment clinical data with small sample sizes, and train ML models for
-prediction of neurosurgical outcomes. Further investigation is necessary to
-improve the preservation of distributional characteristics and boost classifier
-performance.
+EEG signals convey important information about brain activity both in healthy
+and pathological conditions. However, they are inherently noisy, which poses
+significant challenges for accurate analysis and interpretation. Traditional
+EEG artifact removal methods, while effective, often require extensive expert
+intervention. This study presents LSTEEG, a novel LSTM-based autoencoder
+designed for the detection and correction of artifacts in EEG signals.
+Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear
+dependencies in sequential EEG data. LSTEEG demonstrates superior performance
+in both artifact detection and correction tasks compared to other
+state-of-the-art convolutional autoencoders. Our methodology enhances the
+interpretability and utility of the autoencoder's latent space, enabling
+data-driven automated artefact removal in EEG its application in downstream
+tasks. This research advances the field of efficient and accurate multi-channel
+EEG preprocessing, and promotes the implementation and usage of automated EEG
+analysis pipelines for brain health applications.
 
-摘要：<paragraph>臨床數據是推進神經外科研究的基礎，但訪問通常受到數據可用性、樣本量小、隱私法規以及資源密集型預處理和去識別程序的限制。合成數據為與存取和使用真實世界數據 (RWD) 相關的挑戰提供了潛在解決方案。本研究旨在評估使用大型語言模型 (LLM) GPT-4o 零次生成合成神經外科數據的能力，並通過條件表格生成對抗網路 (CTGAN) 進行基準測試。將合成數據集與真實世界的神經外科數據進行比較，以評估保真度（平均值、比例、分布和二元相關性）、實用性（RWD 上的 ML 分類器性能）和隱私（RWD 中記錄的重複）。儘管沒有微調或訪問 RWD 進行預訓練，但 GPT-4o 生成的數據集與 CTGAN 性能相匹配或超過 CTGAN 性能。數據集證明了對 RWD 的高單變量和二變量保真度，即使在擴充的樣本量下也不會直接公開任何真實患者記錄。在 GPT-4o 生成的數據上訓練 ML 分類器，並在 RWD 上測試二元預測任務，顯示 F1 分數 (0.706) 與在 CTGAN 數據上訓練以預測術後功能狀態惡化時的性能相當 (0.705)。GPT-4o 展示了生成高保真合成神經外科數據的潛力。這些發現還表明，使用 GPT-4o 合成的數據可以有效地增加樣本量小的臨床數據，並訓練 ML 模型以預測神經外科結果。需要進一步研究以改善分佈特徵的保留並提升分類器性能。</paragraph>
+摘要：腦電圖訊號傳達了關於大腦活動的重要資訊，無論是在健康或病理狀況下。然而，它們本質上是有雜訊的，這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效，但通常需要大量的專家介入。本研究提出 LSTEEG，一種新穎的基於 LSTM 的自動編碼器，用於偵測和校正腦電圖訊號中的人工製品。利用深度學習，特別是 LSTM 層，LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比，LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性，讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域，並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。
 
-##### **MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**
-2502.09565v1 by Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White
+##### **SycEval: Evaluating LLM Sycophancy**
+2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo
 
-Molecular dynamics (MD) simulations are essential for understanding
-biomolecular systems but remain challenging to automate. Recent advances in
-large language models (LLM) have demonstrated success in automating complex
-scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an
-agentic LLM assistant capable of automating MD workflows. MDCrow uses
-chain-of-thought over 40 expert-designed tools for handling and processing
-files, setting up simulations, analyzing the simulation outputs, and retrieving
-relevant information from literature and databases. We assess MDCrow's
-performance across 25 tasks of varying required subtasks and difficulty, and we
-evaluate the agent's robustness to both difficulty and prompt style.
-\texttt{gpt-4o} is able to complete complex tasks with low variance, followed
-closely by \texttt{llama3-405b}, a compelling open-source model. While prompt
-style does not influence the best models' performance, it has significant
-effects on smaller models.
+Large language models (LLMs) are increasingly applied in educational,
+clinical, and professional settings, but their tendency for sycophancy --
+prioritizing user agreement over independent reasoning -- poses risks to
+reliability. This study introduces a framework to evaluate sycophantic behavior
+in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and
+MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19%
+of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the
+lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred
+in 43.52% of cases, while regressive sycophancy, leading to incorrect answers,
+was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher
+sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$,
+$p<0.001$), particularly in computational tasks, where regressive sycophancy
+increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$).
+Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while
+citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$,
+$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI:
+[77.2%, 79.8%]) regardless of context or model. These findings emphasize the
+risks and opportunities of deploying LLMs in structured and dynamic domains,
+offering insights into prompt programming and model optimization for safer AI
+applications.
 
-摘要：分子動力學 (MD) 模擬對於理解生物分子系統至關重要，但自動化仍然具有挑戰性。大型語言模型 (LLM) 的最新進展已證明使用基於 LLM 的代理自動化複雜的科學任務是成功的。在本文中，我們介紹了 MDCrow，這是一個代理 LLM 助理，能夠自動化 MD 工作流程。MDCrow 使用 40 多種專家設計的工具的思考鏈來處理和處理檔案、設定模擬、分析模擬輸出，以及從文獻和資料庫中檢索相關資訊。我們評估了 MDCrow 在 25 項任務中的表現，這些任務所需的子任務和難度各不相同，並且我們評估了代理對難度和提示樣式的穩健性。\texttt{gpt-4o} 能夠以低變異完成複雜的任務，緊隨其後的是一個引人注目的開源模型 \texttt{llama3-405b}。雖然提示樣式不會影響最佳模型的效能，但它對較小的模型有顯著的影響。
+摘要：大型語言模型（LLM）日益應用於教育、臨床和專業領域，但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為，涉及 AMPS（數學）和 MedQuad（醫療建議）數據集。在 58.19% 的案例中觀察到了趨炎附勢行為，其中 Gemini 表現出最高比率（62.47%），而 ChatGPT 最低（56.71%）。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中，而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率（61.75% 對 56.52%，Z=5.87，p<0.001），特別是在計算任務中，其中退步式趨炎附勢顯著增加（先發制人：8.13%，上下文：3.54%，p<0.001）。簡單的反駁最大化了漸進式趨炎附勢（Z=6.59，p<0.001），而基於引用的反駁表現出最高的退步式比率（Z=6.59，p<0.001）。趨炎附勢行為表現出很高的持續性（78.5%，95% CI：[77.2%，79.8%]），無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇，為更安全的 AI 應用提供了提示編程和模型優化的見解。
 
-##### **EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**
-2502.09560v1 by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
+##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**
+2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace
 
-Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
-agents offers a promising avenue for tackling real-world tasks. While
-language-centric embodied agents have garnered substantial attention,
-MLLM-based embodied agents remain underexplored due to the lack of
-comprehensive evaluation frameworks. To bridge this gap, we introduce
-EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
-embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
-tasks across four environments, ranging from high-level semantic tasks (e.g.,
-household) to low-level tasks involving atomic actions (e.g., navigation and
-manipulation); and (2) six meticulously curated subsets evaluating essential
-agent capabilities like commonsense reasoning, complex instruction
-understanding, spatial awareness, visual perception, and long-term planning.
-Through extensive experiments, we evaluated 13 leading proprietary and
-open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
-at high-level tasks but struggle with low-level manipulation, with the best
-model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
-multifaceted standardized evaluation platform that not only highlights existing
-challenges but also offers valuable insights to advance MLLM-based embodied
-agents. Our code is available at https://embodiedbench.github.io.
+Medical research faces well-documented challenges in translating novel
+treatments into clinical practice. Publishing incentives encourage researchers
+to present "positive" findings, even when empirical results are equivocal.
+Consequently, it is well-documented that authors often spin study results,
+especially in article abstracts. Such spin can influence clinician
+interpretation of evidence and may affect patient care decisions. In this
+study, we ask whether the interpretation of trial results offered by Large
+Language Models (LLMs) is similarly affected by spin. This is important since
+LLMs are increasingly being used to trawl through and synthesize published
+medical evidence. We evaluated 22 LLMs and found that they are across the board
+more susceptible to spin than humans. They might also propagate spin into their
+outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into
+plain language summaries that they generate. We also find, however, that LLMs
+are generally capable of recognizing spin, and can be prompted in a way to
+mitigate spin's impact on LLM outputs.
 
-摘要：<paragraph>利用多模態大型語言模型 (MLLM) 來建立具身代理，提供了解決現實世界任務的有前景途徑。儘管以語言為中心的具身代理已獲得大量關注，但由於缺乏全面的評估框架，基於 MLLM 的具身代理仍未得到充分探索。為了彌補這一差距，我們引入了 EmbodiedBench，這是一個廣泛的基準測試，旨在評估以視覺為導向的具身代理。EmbodiedBench 的特點：(1) 跨越四個環境的 1,128 項多樣化測試任務，範圍從高層級語義任務（例如，家庭）到涉及原子動作的低層級任務（例如，導航和操作）；以及 (2) 六個精心策劃的子集，用於評估基本的代理能力，例如常識推理、複雜指令理解、空間感知、視覺感知和長期規劃。通過廣泛的實驗，我們在 EmbodiedBench 中評估了 13 個領先的專有和開源 MLLM。我們的研究結果表明：MLLM 在高層級任務中表現出色，但在低層級操作中遇到困難，表現最好的模型 GPT-4o 平均得分僅為 28.9%。EmbodiedBench 提供了一個多方面的標準化評估平台，不僅突出了現有挑戰，還提供了有價值的見解來推進基於 MLLM 的具身代理。我們的程式碼可在 https://embodiedbench.github.io/ 取得。</paragraph>
+摘要：醫學研究在將新穎療法轉化為臨床實務上，面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現，即使經驗結果模稜兩可。因此，有據可查的是，作者經常扭曲研究結果，特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋，並可能影響病患照護決策。在本研究中，我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據，因此這點非常重要。我們評估了 22 個 LLM，發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中：例如，我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而，我們也發現 LLM 通常有能力辨認扭曲，而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。
 
-##### **Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**
-2502.09532v1 by Shreyan Biswas, Alexander Erlei, Ujwal Gadiraju
+##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**
+2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri
 
-Recent advances in generative AI have precipitated a proliferation of novel
-writing assistants. These systems typically rely on multilingual large language
-models (LLMs), providing globalized workers the ability to revise or create
-diverse forms of content in different languages. However, there is substantial
-evidence indicating that the performance of multilingual LLMs varies between
-languages. Users who employ writing assistance for multiple languages are
-therefore susceptible to disparate output quality. Importantly, recent research
-has shown that people tend to generalize algorithmic errors across independent
-tasks, violating the behavioral axiom of choice independence. In this paper, we
-analyze whether user utilization of novel writing assistants in a charity
-advertisement writing task is affected by the AI's performance in a second
-language. Furthermore, we quantify the extent to which these patterns translate
-into the persuasiveness of generated charity advertisements, as well as the
-role of peoples' beliefs about LLM utilization in their donation choices. Our
-results provide evidence that writers who engage with an LLM-based writing
-assistant violate choice independence, as prior exposure to a Spanish LLM
-reduces subsequent utilization of an English LLM. While these patterns do not
-affect the aggregate persuasiveness of the generated advertisements, people's
-beliefs about the source of an advertisement (human versus AI) do. In
-particular, Spanish-speaking female participants who believed that they read an
-AI-generated advertisement strongly adjusted their donation behavior downwards.
-Furthermore, people are generally not able to adequately differentiate between
-human-generated and LLM-generated ads. Our work has important implications for
-the design, development, integration, and adoption of multilingual LLMs as
-assistive agents -- particularly in writing tasks.
+This paper presents a novel Natural Language Processing (NLP) framework for
+enhancing medical diagnosis through the integration of advanced techniques in
+data augmentation, feature extraction, and classification. The proposed
+approach employs back-translation to generate diverse paraphrased datasets,
+improving robustness and mitigating overfitting in classification tasks.
+Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with
+Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained
+contextual and positional relationships, dynamically adjusting the influence of
+positional information based on semantic context to produce high-quality text
+embeddings. For classification, an Attention-Based Feedforward Neural Network
+(ABFNN) is utilized, effectively focusing on the most relevant features to
+improve decision-making accuracy. Applied to the classification of symptoms,
+clinical notes, and other medical texts, this architecture demonstrates its
+ability to address the complexities of medical data. The combination of data
+augmentation, contextual embedding generation, and advanced classification
+mechanisms offers a robust and accurate diagnostic tool, with potential
+applications in automated medical diagnosis and clinical decision support. This
+method demonstrates the effectiveness of the proposed NLP framework for medical
+diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of
+99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only
+underscore the model's robust performance in classifying medical texts with
+exceptional precision and reliability but also highlight its superiority over
+existing methods, making it a highly promising tool for automated diagnostic
+systems.
 
-摘要：<paragraph>生成式 AI 的最新進展加速了新穎寫作助理的激增。這些系統通常依賴多語言大型語言模型 (LLM)，讓全球化的工作者能夠以不同的語言修改或建立各種形式的內容。然而，有大量證據顯示多語言 LLM 的表現因語言而異。因此，使用多語言寫作協助的使用者容易受到不同的輸出品質影響。重要的是，最近的研究顯示人們傾向於在獨立的任務中概化演算法錯誤，違反了選擇獨立性的行為公理。在本文中，我們分析使用者在慈善廣告寫作任務中使用新穎寫作助理是否會受到 AI 在第二語言中的表現影響。此外，我們量化這些模式轉化為所產生慈善廣告說服力的程度，以及人們對 LLM 使用在捐款選擇中的信念所扮演的角色。我們的結果提供證據，表明與基於 LLM 的寫作助理互動的寫作者會違反選擇獨立性，因為先前接觸過西班牙語 LLM 會減少後續使用英語 LLM 的情況。雖然這些模式不會影響所產生廣告的整體說服力，但人們對廣告來源（人類與 AI）的信念會影響。特別是，相信自己閱讀 AI 生成的廣告的西班牙語系女性參與者大幅調整了他們的捐款行為。此外，人們通常無法充分區分人類產生的廣告和 LLM 產生的廣告。我們的研究對多語言 LLM 作為輔助代理的設計、開發、整合和採用具有重要的意義，特別是在寫作任務中。</paragraph>
+摘要：本文提出了一個創新的自然語言處理 (NLP) 框架，透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集，提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa)，這個模型捕捉細緻的脈絡和位置關係，根據語意脈絡動態調整位置資訊的影響，以產生高品質的文字嵌入。在分類方面，利用基於注意力的前饋神經網路 (ABFNN)，有效地關注最相關的特徵，以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類，此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具，在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性，以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數，取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性，也突顯了它優於現有方法的優越性，使其成為自動化診斷系統中極具前景的工具。
 
-##### **Diffusion Models for Molecules: A Survey of Methods and Tasks**
-2502.09511v1 by Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
+##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**
+2502.07752v1 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
 
-Generative tasks about molecules, including but not limited to molecule
-generation, are crucial for drug discovery and material design, and have
-consistently attracted significant attention. In recent years, diffusion models
-have emerged as an impressive class of deep generative models, sparking
-extensive research and leading to numerous studies on their application to
-molecular generative tasks. Despite the proliferation of related work, there
-remains a notable lack of up-to-date and systematic surveys in this area.
-Particularly, due to the diversity of diffusion model formulations, molecular
-data modalities, and generative task types, the research landscape is
-challenging to navigate, hindering understanding and limiting the area's
-growth. To address this, this paper conducts a comprehensive survey of
-diffusion model-based molecular generative methods. We systematically review
-the research from the perspectives of methodological formulations, data
-modalities, and task types, offering a novel taxonomy. This survey aims to
-facilitate understanding and further flourishing development in this area. The
-relevant papers are summarized at:
-https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
+Designing efficient optimizers for large language models (LLMs) with
+low-memory requirements and fast convergence is an important and challenging
+problem. This paper makes a step towards the systematic design of such
+optimizers through the lens of structured Fisher information matrix (FIM)
+approximation. We show that many state-of-the-art efficient optimizers can be
+viewed as solutions to FIM approximation (under the Frobenius norm) with
+specific structural assumptions. Building on these insights, we propose two
+design recommendations of practical efficient optimizers for LLMs, involving
+the careful selection of structural assumptions to balance generality and
+efficiency, and enhancing memory efficiency of optimizers with general
+structures through a novel low-rank extension framework. We demonstrate how to
+use each design approach by deriving new memory-efficient optimizers: Row and
+Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation
+(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the
+effectiveness, showing faster and better convergence than existing
+memory-efficient baselines and Adam with little memory overhead. Notably, Alice
+achieves better than 2x faster convergence over Adam, while RACS delivers
+strong performance on the 1B model with SGD-like memory.
 
-摘要：<paragraph>包括但不限於分子生成在內的分子生成任務，對於藥物發現和材料設計至關重要，並持續吸引大量關注。近年來，擴散模型已成為深度生成模型中令人印象深刻的一類，激發了廣泛的研究，並導致對其應用於分子生成任務的眾多研究。儘管相關工作不斷增加，但這個領域仍然缺乏最新的系統性綜述。特別是，由於擴散模型公式、分子數據方式和生成任務類型的多樣性，研究領域難以瀏覽，阻礙了理解並限制了該領域的發展。為了解決這個問題，本文對基於擴散模型的分子生成方法進行了全面的調查。我們從方法論公式、數據方式和任務類型的角度系統性地回顧了研究，提供了一種新穎的分類法。本調查旨在促進理解並進一步促進該領域的蓬勃發展。相關論文總結如下：
-https://github.com/AzureLeon1/awesome-molecular-diffusion-models。</paragraph>
+摘要：設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的角度，朝向此類最佳化器的系統化設計邁進一步。我們展示了許多最先進的高效最佳化器可以被視為 FIM 近似（在 Frobenius 範數下）的解，並具有特定的結構假設。基於這些見解，我們提出了 LLM 的兩個實用高效最佳化器設計建議，包括仔細選擇結構假設以平衡通用性和效率，並透過新穎的低秩延伸架構來增強具有通用結構的最佳化器的記憶體效率。我們展示了如何透過推導新的記憶體高效最佳化器來使用每種設計方法：列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練（高達 1B 參數）上的實驗驗證了其有效性，顯示比現有的記憶體高效基線和 Adam 更快且更好的收斂，且記憶體開銷很小。值得注意的是，Alice 比 Adam 快 2 倍以上，而 RACS 則在 1B 模型上提供類似 SGD 記憶體的強勁效能。
 
-##### **AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**
-2502.09503v1 by Caleb Cranney, Jesse G. Meyer
+##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**
+2502.07516v1 by Raman Dutt
 
-Transformer architectures have transformed AI applications but remain complex
-to customize for domain experts lacking low-level implementation expertise. We
-introduce AttentionSmithy, a modular software package that simplifies
-transformer innovation by breaking down key components into reusable building
-blocks: attention modules, feed-forward networks, normalization layers, and
-positional encodings. Users can rapidly prototype and evaluate transformer
-variants without extensive coding. Our framework supports four positional
-encoding strategies and integrates with neural architecture search for
-automated design. We validate AttentionSmithy by replicating the original
-transformer under resource constraints and optimizing translation performance
-by combining positional encodings. Additionally, we demonstrate its
-adaptability in gene-specific modeling, achieving over 95% accuracy in cell
-type classification. These case studies highlight AttentionSmithy's potential
-to accelerate research across diverse fields by removing framework
-implementation barriers.
+Generative models, particularly text-to-image (T2I) diffusion models, play a
+crucial role in medical image analysis. However, these models are prone to
+training data memorization, posing significant risks to patient privacy.
+Synthetic chest X-ray generation is one of the most common applications in
+medical image analysis with the MIMIC-CXR dataset serving as the primary data
+repository for this task. This study adopts a data-driven approach and presents
+the first systematic attempt to identify prompts and text tokens in MIMIC-CXR
+that contribute the most to training data memorization. Our analysis reveals an
+unexpected finding: prompts containing traces of de-identification procedures
+are among the most memorized, with de-identification markers contributing the
+most. Furthermore, we also find existing inference-time memorization mitigation
+strategies are ineffective and fail to sufficiently reduce the model's reliance
+on memorized text tokens highlighting a broader issue in T2I synthesis with
+MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy
+and improve the reliability of generative models in medical imaging. Finally,
+our results provide a foundation for future work on developing and benchmarking
+memorization mitigation techniques for synthetic chest X-ray generation using
+the MIMIC-CXR dataset.
 
-摘要：Transformer 架構已轉變 AI 應用，但對於缺乏低階實作專業知識的領域專家而言，自訂仍很複雜。我們推出 AttentionSmithy，這是一個模組化軟體套件，透過將關鍵元件分解成可重複使用的建構區塊（注意力模組、前饋網路、正規化層和位置編碼）來簡化 Transformer 創新。使用者可以快速建置原型和評估 Transformer 變體，而無需大量編碼。我們的架構支援四種位置編碼策略，並整合神經架構搜尋以進行自動化設計。我們透過在資源限制下複製原始 Transformer 和結合位置編碼來最佳化翻譯效能，驗證 AttentionSmithy。此外，我們展示其在基因特定建模中的適應性，在細胞類型分類中達到超過 95% 的準確度。這些案例研究突顯 AttentionSmithy 在移除架構實作障礙後，加速各個領域研究的潛力。
+摘要：生成模型，尤其是文字轉圖像 (T2I) 擴散模型，在醫學影像分析中扮演著至關重要的角色。然而，這些模型容易訓練資料記憶，對病患隱私造成重大風險。合成胸部 X 光線生成是醫學影像分析中最常見的應用之一，其中 MIMIC-CXR 資料集作為此任務的主要資料儲存庫。本研究採用資料驅動的方法，並提出首次系統性嘗試，以識別 MIMIC-CXR 中最有助於訓練資料記憶的提示和文字代碼。我們的分析揭露了一個意外的發現：包含去識別程序痕跡的提示是最常被記憶的，其中去識別標記的貢獻最大。此外，我們也發現現有的推論時間記憶減緩策略無效，且無法充分降低模型對記憶文字代碼的依賴性，突顯了使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。針對此問題，我們提出可行的策略，以增強隱私並改善生成模型在醫學影像中的可靠性。最後，我們的結果為未來使用 MIMIC-CXR 資料集開發和評量合成胸部 X 光線生成的記憶減緩技術奠定了基礎。
 
-##### **Improve LLM-based Automatic Essay Scoring with Linguistic Features**
-2502.09497v1 by Zhaoyi Joey Hou, Alejandro Ciuba, Xiang Lorraine Li
+##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**
+2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo
 
-Automatic Essay Scoring (AES) assigns scores to student essays, reducing the
-grading workload for instructors. Developing a scoring system capable of
-handling essays across diverse prompts is challenging due to the flexibility
-and diverse nature of the writing task. Existing methods typically fall into
-two categories: supervised feature-based approaches and large language model
-(LLM)-based methods. Supervised feature-based approaches often achieve higher
-performance but require resource-intensive training. In contrast, LLM-based
-methods are computationally efficient during inference but tend to suffer from
-lower performance. This paper combines these approaches by incorporating
-linguistic features into LLM-based scoring. Experimental results show that this
-hybrid method outperforms baseline models for both in-domain and out-of-domain
-writing prompts.
+Chronic kidney disease (CKD) is a major global health issue, affecting over
+10% of the population and causing significant mortality. While kidney biopsy
+remains the gold standard for CKD diagnosis and treatment, the lack of
+comprehensive benchmarks for kidney pathology segmentation hinders progress in
+the field. To address this, we organized the Kidney Pathology Image
+Segmentation (KPIs) Challenge, introducing a dataset that incorporates
+preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+
+Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes
+two tasks, patch-level segmentation and whole slide image segmentation and
+detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score.
+By encouraging innovative segmentation methods that adapt to diverse CKD models
+and tissue conditions, the KPIs Challenge aims to advance kidney pathology
+analysis, establish new benchmarks, and enable precise, large-scale
+quantification for disease research and diagnosis.
 
-摘要：自動化論文評分 (AES) 會為學生的論文評分，以減輕教師的評分工作負擔。由於寫作任務的靈活性與多樣性，開發一種評分系統來處理各種提示的論文是一項挑戰。現有方法通常分為兩類：監督式特徵方法和大型語言模型 (LLM) 方法。監督式特徵方法通常能達到較高的效能，但需要大量資源進行訓練。相比之下，LLM 方法在推論期間的計算效率很高，但效能往往較低。本文結合了這些方法，將語言特徵納入 LLM 評分中。實驗結果顯示，這種混合方法在領域內和領域外寫作提示方面都優於基準模型。
+摘要：慢性腎臟病 (CKD) 是全球主要的健康問題，影響超過
+10% 的人口，並造成顯著的死亡率。雖然腎臟活檢
+仍然是 CKD 診斷和治療的黃金標準，但缺乏
+腎臟病理學分割的全面基準阻礙了該領域的進展。
+為了解決這個問題，我們組織了腎臟病理影像
+分割 (KPIs) 挑戰，引入了包含超過 10,000 個註解的
+CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
+週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括
+兩個任務，修補層級分割和全幻燈片影像分割和
+偵測，使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。
+通過鼓勵創新的分割方法來適應不同的 CKD 模型
+和組織條件，KPIs 挑戰旨在推進腎臟病理
+分析，建立新的基準，並實現精確、大規模的
+疾病研究和診斷量化。
 
-##### **Cracking the Code: Enhancing Development finance understanding with artificial intelligence**
-2502.09495v1 by Pierre Beaucoral
+##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
+2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
 
-Analyzing development projects is crucial for understanding donors aid
-strategies, recipients priorities, and to assess development finance capacity
-to adress development issues by on-the-ground actions. In this area, the
-Organisation for Economic Co-operation and Developments (OECD) Creditor
-Reporting System (CRS) dataset is a reference data source. This dataset
-provides a vast collection of project narratives from various sectors
-(approximately 5 million projects). While the OECD CRS provides a rich source
-of information on development strategies, it falls short in informing project
-purposes due to its reporting process based on donors self-declared main
-objectives and pre-defined industrial sectors. This research employs a novel
-approach that combines Machine Learning (ML) techniques, specifically Natural
-Language Processing (NLP), an innovative Python topic modeling technique called
-BERTopic, to categorise (cluster) and label development projects based on their
-narrative descriptions. By revealing existing yet hidden topics of development
-finance, this application of artificial intelligence enables a better
-understanding of donor priorities and overall development funding and provides
-methods to analyse public and private projects narratives.
+Early prediction of pediatric cardiac arrest (CA) is critical for timely
+intervention in high-risk intensive care settings. We introduce PedCA-FT, a
+novel transformer-based framework that fuses tabular view of EHR with the
+derived textual view of EHR to fully unleash the interactions of
+high-dimensional risk factors and their dynamics. By employing dedicated
+transformer modules for each modality view, PedCA-FT captures complex temporal
+and contextual patterns to produce robust CA risk estimates. Evaluated on a
+curated pediatric cohort from the CHOA-CICU database, our approach outperforms
+ten other artificial intelligence models across five key performance metrics
+and identifies clinically meaningful risk factors. These findings underscore
+the potential of multimodal fusion techniques to enhance early CA detection and
+improve patient care.
 
-摘要：分析發展專案對於了解捐助者援助策略、受贈者優先事項，以及評估發展資金能力以透過實際行動解決發展問題至關重要。在這個領域中，經濟合作暨發展組織 (OECD) 債權人報告系統 (CRS) 資料集是一個參考資料來源。此資料集提供來自各個部門的大量專案敘述（約 500 萬個專案）。雖然 OECD CRS 提供了豐富的發展策略資訊來源，但由於其報告程序基於捐助者自行申報的主要目標和預先定義的產業部門，因此在告知專案目的方面有所不足。本研究採用一種新穎的方法，結合機器學習 (ML) 技術，特別是自然語言處理 (NLP)，一種稱為 BERTopic 的創新 Python 主題建模技術，根據其敘述描述對發展專案進行分類（叢集）和標籤。透過揭露發展資金現有但隱藏的主題，這種人工智慧應用程式可以更好地了解捐助者的優先事項和整體發展資金，並提供分析公共和私人專案敘述的方法。
+摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
 
-##### **Objective quantification of mood states using large language models**
-2502.09487v1 by Jakub Onysk, Quentin Huys
+##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
+2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
 
-Emotional states influence human behaviour and cognition, leading to diverse
-thought trajectories. Similarly, Large Language Models (LLMs) showcase an
-excellent level of response consistency across wide-ranging contexts (prompts).
-We leverage these parallels to establish a framework for quantifying mental
-states. Our approach utilises self-report questionnaires that reliably assess
-these states due to their inherent sensitivity to patterns of co-occurring
-responses. Specifically, we recruited a large sample of participants (N=422) to
-investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set
-of depressive mood states measured with participants' open-ended responses to a
-depression questionnaire. We show LLM responses to held-out multiple-choice
-questions, given participants' open-ended answers, correlate strongly (r:
-0.52-0.84) with true questionnaire scores, demonstrating LLM's generalisation
-from mood representations. We explore a link between these representations and
-factor analysis. Using ridge regression, we find depression-related subspaces
-within LLM hidden states. We show these subspaces to be predictive of
-participants' "Depression" and "Somatic & Emotional Distress" factor scores, as
-well as suicidality severity. Overall, LLMs can provide quantitative measures
-of mental states. The reliability of these hinges upon how informative the
-questions we ask participants are. Used correctly, this approach could
-supplement mental state assessment in a variety of settings.
+Counterfactual explanations in medical imaging are critical for understanding
+the predictions made by deep learning models. We extend the Latent Shift
+counterfactual generation method from 2D applications to 3D computed tomography
+(CT) scans. We address the challenges associated with 3D data, such as limited
+training samples and high memory demands, by implementing a slice-based
+approach. This method leverages a 2D encoder trained on CT slices, which are
+subsequently combined to maintain 3D context. We demonstrate this technique on
+two models for clinical phenotype prediction and lung segmentation. Our
+approach is both memory-efficient and effective for generating interpretable
+counterfactuals in high-resolution 3D medical imaging.
 
-摘要：情緒狀態會影響人類行為和認知，導致不同的思維軌跡。同樣地，大型語言模型 (LLM) 在廣泛的脈絡（提示）中展示出極佳的反應一致性。我們利用這些相似之處來建立一個量化心理狀態的框架。我們的做法利用自我報告問卷，由於這些問卷對共生反應模式具有內在敏感性，因此可以可靠地評估這些狀態。具體來說，我們招募了大量的參與者樣本 (N=422) 來調查 LLM (Mistral-7B-OpenOrca) 如何量化一組異質的抑鬱情緒狀態，這些狀態是根據參與者對抑鬱症問卷的開放式回答來衡量的。我們展示了 LLM 對保留的多選題的回答，給定參與者的開放式回答，與真正的問卷分數密切相關 (r：0.52-0.84)，這證明了 LLM 從情緒表徵中進行概括。我們探索這些表徵與因子分析之間的聯繫。使用嶺回歸，我們在 LLM 隱藏狀態內發現了與抑鬱相關的子空間。我們展示這些子空間可以預測參與者的「抑鬱」和「軀體和情緒困擾」因子分數，以及自殺嚴重性。總體而言，LLM 可以提供心理狀態的量化測量。這些測量的可靠性取決於我們詢問參與者的問題的資訊性。如果使用得當，這種方法可以補充各種環境中的心理狀態評估。
+摘要：反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法，來解決與 3D 資料相關的挑戰，例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器，隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實，既節省記憶體又有效。
 
-##### **The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**
-2502.09457v1 by Akash Ghosh, Debayan Datta, Sriparna Saha, Chirag Agarwal
+##### **Interactive Data Harmonization with LLM Agents**
+2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire
 
-While reasoning and multilingual capabilities in Language Models (LMs) have
-achieved remarkable progress in recent years, their integration into a unified
-paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning
-requires language models to handle logical reasoning across languages while
-addressing misalignment, biases, and challenges in low-resource settings. This
-survey provides the first in-depth review of multilingual reasoning in LMs. In
-this survey, we provide a systematic overview of existing methods that leverage
-LMs for multilingual reasoning, specifically outlining the challenges,
-motivations, and foundational aspects of applying language models to reason
-across diverse languages. We provide an overview of the standard data resources
-used for training multilingual reasoning in LMs and the evaluation benchmarks
-employed to assess their multilingual capabilities. Next, we analyze various
-state-of-the-art methods and their performance on these benchmarks. Finally, we
-explore future research opportunities to improve multilingual reasoning in LMs,
-focusing on enhancing their ability to handle diverse languages and complex
-reasoning tasks.
+Data harmonization is an essential task that entails integrating datasets
+from diverse sources. Despite years of research in this area, it remains a
+time-consuming and challenging task due to schema mismatches, varying
+terminologies, and differences in data collection methodologies. This paper
+presents the case for agentic data harmonization as a means to both empower
+experts to harmonize their data and to streamline the process. We introduce
+Harmonia, a system that combines LLM-based reasoning, an interactive user
+interface, and a library of data harmonization primitives to automate the
+synthesis of data harmonization pipelines. We demonstrate Harmonia in a
+clinical data harmonization scenario, where it helps to interactively create
+reusable pipelines that map datasets to a standard format. Finally, we discuss
+challenges and open problems, and suggest research directions for advancing our
+vision.
+
+摘要：資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷，但由於架構不匹配、術語不同，以及資料收集方法的差異，它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和，作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia，一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統，以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia，它有助於互動式建立可重複使用的管線，將資料集對應至標準格式。最後，我們討論挑戰和開放性問題，並建議研究方向以推進我們的願景。
+
+##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**
+2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani
+
+Machine learning (ML) is transforming healthcare by enabling predictive
+analytics, personalized treatments, and improved patient outcomes. However,
+traditional ML workflows require specialized skills, infrastructure, and
+resources, limiting accessibility for many healthcare professionals. This paper
+explores how Google Cloud's BigQuery ML simplifies the development and
+deployment of ML models using SQL, reducing technical barriers. Through a case
+study on diabetes prediction using the Diabetes Health Indicators Dataset, we
+evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep
+Neural Network (DNN). Our results demonstrate that the Boosted Tree model
+achieves the highest performance, making it highly effective for diabetes
+prediction. This study highlights BigQuery ML's role in democratizing machine
+learning by providing a scalable, efficient, and accessible solution for
+healthcare analytics.
 
-摘要：儘管語言模型 (LM) 的推理和多語言能力在近年來取得顯著進展，但它們整合至統一典範（多語言推理）仍處於萌芽階段。多語言推理要求語言模型跨語言處理邏輯推理，同時解決低資源環境中的錯位、偏見和挑戰。本調查提供了 LM 中多語言推理的首次深入探討。在本調查中，我們系統性地概述了現有利用 LM 進行多語言推理的方法，特別概述了將語言模型應用於跨不同語言推理的挑戰、動機和基礎方面。我們概述了用於訓練 LM 中多語言推理的標準數據資源，以及用於評估其多語言能力的評估基準。接下來，我們分析了各種最先進的方法及其在這些基準上的表現。最後，我們探討了改進 LM 中多語言推理的未來研究機會，重點關注增強其處理不同語言和複雜推理任務的能力。
+摘要：機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果，正在轉型醫療保健。然而，傳統的 ML 工作流程需要專業技能、基礎設施和資源，限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署，降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究，我們評估了三個預測模型：邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明，提升樹模型達到了最高的效能，使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色，提供可擴充、有效率且可存取的醫療保健分析解決方案。
 
-##### **Pixel-Level Reasoning Segmentation via Multi-turn Conversations**
-2502.09447v1 by Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
+##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**
+2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen
 
-Existing visual perception systems focus on region-level segmentation in
-single-turn dialogues, relying on complex and explicit query instructions. Such
-systems cannot reason at the pixel level and comprehend dynamic user intent
-that changes over interaction. Our work tackles this issue by introducing a
-novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on
-multi-turn conversations, tracking evolving user intent via multi-turn
-interactions for fine-grained segmentation. To establish a benchmark for this
-novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on
-Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
-multi-turn conversational scenarios with segmentation targets. Building on
-PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning
-Segmentation framework, integrates pixel-level segmentation with robust
-multi-turn conversation understanding, generating pixel-grounded explanations
-aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
-pixel-level reasoning segmentation. Experimental results on the PRIST dataset
-demonstrate that our method outperforms current segmentation-specific baselines
-in terms of segmentation and LLM-based reasoning metrics. The code and data are
-available at: https://github.com/ccccai239/PixelRIST.
+Despite over a decade of legislative efforts to address modern slavery in the
+supply chains of large corporations, the effectiveness of government oversight
+remains hampered by the challenge of scrutinizing thousands of statements
+annually. While Large Language Models (LLMs) can be considered a well
+established solution for the automatic analysis and summarization of documents,
+recognizing concrete modern slavery countermeasures taken by companies and
+differentiating those from vague claims remains a challenging task. To help
+evaluate and fine-tune LLMs for the assessment of corporate statements, we
+introduce a dataset composed of 5,731 modern slavery statements taken from the
+Australian Modern Slavery Register and annotated at the sentence level. This
+paper details the construction steps for the dataset that include the careful
+design of annotation specifications, the selection and preprocessing of
+statements, and the creation of high-quality annotation subsets for effective
+model evaluations. To demonstrate our dataset's utility, we propose a machine
+learning methodology for the detection of sentences relevant to mandatory
+reporting requirements set by the Australian Modern Slavery Act. We then follow
+this methodology to benchmark modern language models under zero-shot and
+supervised learning settings.
 
-摘要：現有的視覺感知系統專注於單輪對話中的區域級分割，依賴於複雜且明確的查詢指令。此類系統無法在像素級別推理和理解在互動中不斷變化的動態使用者意圖。我們的研究通過引入一項基於多輪對話的像素級推理分割（像素級 RS）新任務來解決此問題，通過多輪互動追蹤不斷演變的使用者意圖，以進行精細分割。為了建立此新任務的基準，我們建立了一個基於多輪對話的像素級推理分割資料集（PRIST），其中包含來自 8.3k 多輪對話場景的 24k 個語句，以及分割目標。在 PRIST 的基礎上，我們進一步提出了 MIRAS，這是一個多輪互動推理分割框架，它將像素級分割與強大的多輪對話理解整合在一起，生成符合使用者意圖的像素級解釋。PRIST 資料集和 MIRSA 框架填補了像素級推理分割的空白。在 PRIST 資料集上的實驗結果表明，我們的模型在分割和基於 LLM 的推理指標方面優於目前的特定於分割的基準。程式碼和資料可在 https://github.com/ccccai239/PixelRIST 獲得。
+摘要：儘管立法努力超過十年，旨在解決大型企業供應鏈中的現代奴隸制，但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型（LLM）可以被認為是文件自動分析和摘要的完善解決方案，但要辨識公司採取的具體現代奴隸制對策，並將其與含糊的聲明區分開來，仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明，我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集，這些聲明取自澳洲現代奴隸制註冊處，並在句子層級進行註解。本文詳細說明了資料集的建構步驟，其中包括註解規格的仔細設計、聲明的選擇和預處理，以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用，我們提出了一種機器學習方法，用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後，我們遵循這種方法，在零次學習和監督學習設定下對現代語言模型進行基準測試。
 
-##### **Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**
-2502.09432v1 by Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
+##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**
+2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour
 
-We study robust Markov decision processes (RMDPs) with non-rectangular
-uncertainty sets, which capture interdependencies across states unlike
-traditional rectangular models. While non-rectangular robust policy evaluation
-is generally NP-hard, even in approximation, we identify a powerful class of
-$L_p$-bounded uncertainty sets that avoid these complexity barriers due to
-their structural simplicity. We further show that this class can be decomposed
-into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage
-its structural properties to derive a novel dual formulation for $L_p$ RMDPs.
-This formulation provides key insights into the adversary's strategy and
-enables the development of the first robust policy evaluation algorithms for
-non-rectangular RMDPs. Empirical results demonstrate that our approach
-significantly outperforms brute-force methods, establishing a promising
-foundation for future investigation into non-rectangular robust MDPs.
+The fourth Machine Learning for Health (ML4H) symposium was held in person on
+December 15th and 16th, 2024, in the traditional, ancestral, and unceded
+territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver,
+British Columbia, Canada. The symposium included research roundtable sessions
+to foster discussions between participants and senior researchers on timely and
+relevant topics for the ML4H community. The organization of the research
+roundtables at the conference involved 13 senior and 27 junior chairs across 13
+tables. Each roundtable session included an invited senior chair (with
+substantial experience in the field), junior chairs (responsible for
+facilitating the discussion), and attendees from diverse backgrounds with an
+interest in the session's topic.
 
-摘要：我們研究具有非矩形不確定性集合的強健馬可夫決策過程 (RMDP)，它能捕捉到不同於傳統矩形模型的跨狀態相互依賴性。雖然非矩形強健策略評估通常是 NP-hard，即使在近似中也是如此，我們識別了一類強大的 $L_p$ 有界不確定性集合，由於其結構的簡潔性，可以避免這些複雜性障礙。我們進一步表明，此類可以分解為無限多的 \texttt{sa} 矩形 $L_p$ 有界集合，並利用其結構屬性為 $L_p$ RMDP 導出一個新的對偶公式。此公式提供了對抗者策略的重要見解，並能夠開發出第一個非矩形 RMDP 的強健策略評估演算法。實證結果表明，我們的做法顯著優於蠻力方法，為未來對非矩形強健 MDP 的研究奠定了有希望的基礎。
+摘要：第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議，以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席（在該領域擁有豐富的經驗）、初級主席（負責促進討論）以及對會議主題感興趣的來自不同背景的與會者。
 
-##### **Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**
-2502.09423v1 by Ziyi Chen, Yang Yuan, Siming Zheng, Jialong Guo, Sihan Liang, Yangang Wang, Zongguo Wang
+##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**
+2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla
 
-Crystal structure forms the foundation for understanding the physical and
-chemical properties of materials. Generative models have emerged as a new
-paradigm in crystal structure prediction(CSP), however, accurately capturing
-key characteristics of crystal structures, such as periodicity and symmetry,
-remains a significant challenge. In this paper, we propose a
-Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction
-(TransVAE-CSP), who learns the characteristic distribution space of stable
-materials, enabling both the reconstruction and generation of crystal
-structures. TransVAE-CSP integrates adaptive distance expansion with
-irreducible representation to effectively capture the periodicity and symmetry
-of crystal structures, and the encoder is a transformer network based on an
-equivariant dot product attention mechanism. Experimental results on the
-carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP
-outperforms existing methods in structure reconstruction and generation tasks
-under various modeling metrics, offering a powerful tool for crystal structure
-design and optimization.
+Current Large Language Models (LLMs) benchmarks are often based on open-ended
+or close-ended QA evaluations, avoiding the requirement of human labor.
+Close-ended measurements evaluate the factuality of responses but lack
+expressiveness. Open-ended capture the model's capacity to produce discourse
+responses but are harder to assess for correctness. These two approaches are
+commonly used, either independently or together, though their relationship
+remains poorly understood. This work is focused on the healthcare domain, where
+both factuality and discourse matter greatly. It introduces a comprehensive,
+multi-axis suite for healthcare LLM evaluation, exploring correlations between
+open and close benchmarks and metrics. Findings include blind spots and
+overlaps in current methodologies. As an updated sanity check, we release a new
+medical benchmark--CareQA--, with both open and closed variants. Finally, we
+propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to
+mitigate the identified limitations.
 
-摘要：晶體結構形成了解材料物理和化學性質的基礎。生成模型已成為晶體結構預測 (CSP) 的新典範，然而，準確捕捉晶體結構的關鍵特徵（例如週期性和對稱性）仍然是一項重大挑戰。在本文中，我們提出了一種用於晶體結構預測的 Transformer 增強變異自動編碼器 (TransVAE-CSP)，它學習穩定材料的特徵分佈空間，使晶體結構的重建和生成成為可能。TransVAE-CSP 將自適應距離擴展與不可約表示相結合，以有效地捕捉晶體結構的週期性和對稱性，並且編碼器是一個基於等變點積注意力機制的 Transformer 網路。在 carbon_24、perov_5 和 mp_20 資料集上的實驗結果表明，TransVAE-CSP 在各種建模指標下，在結構重建和生成任務中優於現有方法，為晶體結構設計和最佳化提供了一個強大的工具。
+摘要：當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量，避免了人力需求。封閉式測量評估回應的事實性，但缺乏表達力。開放式測量捕捉模型產生論述回應的能力，但較難評估正確性。這兩種方法通常獨立或合併使用，儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域，在該領域中，事實性和論述都非常重要。它引入了一個全面的多軸套件，用於醫療保健 LLM 評量，探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查，我們發布了一個新的醫療基準--CareQA--，包含開放式和封閉式變體。最後，我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。
 
-##### **On multi-token prediction for efficient LLM inference**
-2502.09419v1 by Somesh Mehra, Javier Alonso Garcia, Lukas Mauch
+##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**
+2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra
 
-We systematically investigate multi-token prediction (MTP) capabilities
-within LLMs pre-trained for next-token prediction (NTP). We first show that
-such models inherently possess MTP capabilities via numerical marginalization
-over intermediate token probabilities, though performance is data-dependent and
-improves with model scale. Furthermore, we explore the challenges of
-integrating MTP heads into frozen LLMs and find that their hidden layers are
-strongly specialized for NTP, making adaptation non-trivial. Finally, we show
-that while joint training of MTP heads with the backbone improves performance,
-it cannot fully overcome this barrier, prompting further research in this
-direction. Our findings provide a deeper understanding of MTP applied to
-pretrained LLMs, informing strategies for accelerating inference through
-parallel token prediction.
+Accurate classification and anatomical localization are essential for
+effective medical diagnostics and research, which may be efficiently performed
+using deep learning techniques. However, availability of limited labeled data
+poses a significant challenge. To address this, we adapted Prototypical
+Networks and the Propagation-Reconstruction Network (PRNet) for few-shot
+classification and localization, respectively, in Single Photon Emission
+Computed Tomography (SPECT) images. For the proof of concept we used a
+2D-sliced image cropped around heart. The Prototypical Network, with a
+pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver
+tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for
+2D imaging with an encoder-decoder architecture and skip connections, achieved
+a training loss of 1.395, accurately reconstructing patches and capturing
+spatial relationships. These results highlight the potential of Prototypical
+Networks for tissue classification with limited labeled data and PRNet for
+anatomical landmark localization, paving the way for improved performance in
+deep learning frameworks.
 
-摘要：我們系統性地研究了在預先訓練下用於下一個代幣預測 (NTP) 的 LLM 中的多代幣預測 (MTP) 功能。我們首先表明，此類模型透過中間代幣機率的數值邊際化本質上具備 MTP 功能，儘管效能依賴於資料，且會隨著模型規模而提升。此外，我們探討了將 MTP 頭整合到凍結 LLM 中的挑戰，發現其隱藏層高度專門用於 NTP，使得適應變得不簡單。最後，我們顯示，儘管 MTP 頭與主幹的聯合訓練會提升效能，但無法完全克服此障礙，促使我們進一步研究這個方向。我們的發現提供了對應用於預先訓練 LLM 的 MTP 更深入的理解，並為透過平行代幣預測加速推論提供策略。
+摘要：精確的分類和解剖定位對於有效的醫療診斷和研究至關重要，而這可以使用深度學習技術有效執行。然而，標記資料有限的取得會造成重大的挑戰。為了解決這個問題，我們分別調整了原型網路和傳播重建網路 (PRNet)，用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念，我們使用圍繞心臟裁切的 2D 切片影像。原型網路，使用預先訓練的 ResNet-18 主幹，對心室、心肌和肝臟組織進行分類，訓練準確度為 96.67%，驗證準確度為 93.33%。PRNet，調整為使用編碼器解碼器架構和跳躍連接的 2D 影像，達到了 1.395 的訓練損失，精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力，以及 PRNet 在解剖標誌定位方面的潛力，為深度學習架構中效能的提升鋪平了道路。
 
-##### **SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**
-2502.09390v1 by Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
+##### **Illegal Waste Detection in Remote Sensing Images: A Case Study**
+2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori
 
-In the rapidly evolving field of Natural Language Processing, Large Language
-Models (LLMs) are tasked with increasingly complex reasoning challenges.
-Traditional methods like chain-of-thought prompting have shown promise but
-often fall short in fully leveraging a model's reasoning capabilities. This
-paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a
-novel prompting technique designed to improve reasoning through a
-self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts
-models to generate and resolve multiple auxiliary questions before tackling the
-main query, promoting a more thorough exploration of various aspects of a
-topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models
-across multiple question-answering datasets, demonstrate that SQuARE
-significantly surpasses traditional CoT prompts and existing
-rephrase-and-respond methods. By systematically decomposing queries, SQuARE
-advances LLM capabilities in reasoning tasks. The code is publicly available at
-https://github.com/IntelLabs/RAG-FiT/tree/square.
+Environmental crime currently represents the third largest criminal activity
+worldwide while threatening ecosystems as well as human health. Among the
+crimes related to this activity, improper waste management can nowadays be
+countered more easily thanks to the increasing availability and decreasing cost
+of Very-High-Resolution Remote Sensing images, which enable semi-automatic
+territory scanning in search of illegal landfills. This paper proposes a
+pipeline, developed in collaboration with professionals from a local
+environmental agency, for detecting candidate illegal dumping sites leveraging
+a classifier of Remote Sensing images. To identify the best configuration for
+such classifier, an extensive set of experiments was conducted and the impact
+of diverse image characteristics and training settings was thoroughly analyzed.
+The local environmental agency was then involved in an experimental exercise
+where outputs from the developed classifier were integrated in the experts'
+everyday work, resulting in time savings with respect to manual
+photo-interpretation. The classifier was eventually run with valuable results
+on a location outside of the training area, highlighting potential for
+cross-border applicability of the proposed pipeline.
 
-摘要：在快速發展的自然語言處理領域中，大型語言模型 (LLM) 負責越來越複雜的推理挑戰。
-傳統方法（如思考鏈提示）已展現潛力，但通常無法充分利用模型的推理能力。本文介紹 SQuARE（順序式問答推理引擎），這是一種新穎的提示技術，旨在透過自我提問模式來改善推理。建立在 CoT 架構之上，SQuARE 提示模型在處理主要查詢之前產生並解決多個輔助問題，促進對某個主題的各個面向進行更徹底的探討。我們使用 Llama 3 和 GPT-4o 模型對多個問答資料集進行廣泛評估，結果顯示 SQuARE 明顯優於傳統 CoT 提示和現有的改寫並回應方法。透過系統性地分解查詢，SQuARE 提升了 LLM 在推理任務中的能力。程式碼已公開於 https://github.com/IntelLabs/RAG-FiT/tree/square。
+摘要：環境犯罪目前是全球第三大犯罪活動，威脅生態系統和人類健康。在與此活動相關的犯罪中，不當廢物管理現在可以更容易地得到解決，這要歸功於超高解析度遙測影像越來越普及且成本下降，這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道，與當地環境機構的專業人士合作開發，用於檢測候選非法傾倒地點，利用遙測影像分類器。為了找出這種分類器的最佳配置，進行了一系列廣泛的實驗，並徹底分析了不同影像特徵和訓練設定的影響。然後，當地環境機構參與了一項實驗練習，其中將已開發分類器的輸出整合到專家的日常工作中，從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器，獲得了有價值的結果，突出了所提出管道的跨境適用性潛力。
 
-##### **Truth Knows No Language: Evaluating Truthfulness Beyond English**
-2502.09387v1 by Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
+##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**
+2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li
 
-We introduce a professionally translated extension of the TruthfulQA
-benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
-Spanish. Truthfulness evaluations of large language models (LLMs) have
-primarily been conducted in English. However, the ability of LLMs to maintain
-truthfulness across languages remains under-explored. Our study evaluates 12
-state-of-the-art open LLMs, comparing base and instruction-tuned models using
-human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
-findings reveal that, while LLMs perform best in English and worst in Basque
-(the lowest-resourced language), overall truthfulness discrepancies across
-languages are smaller than anticipated. Furthermore, we show that
-LLM-as-a-Judge correlates more closely with human judgments than
-multiple-choice metrics, and that informativeness plays a critical role in
-truthfulness assessment. Our results also indicate that machine translation
-provides a viable approach for extending truthfulness benchmarks to additional
-languages, offering a scalable alternative to professional translation.
-Finally, we observe that universal knowledge questions are better handled
-across languages than context- and time-dependent ones, highlighting the need
-for truthfulness evaluations that account for cultural and temporal
-variability. Dataset and code are publicly available under open licenses.
+Accurate and efficient electroencephalography (EEG) analysis is essential for
+detecting seizures and artifacts in long-term monitoring, with applications
+spanning hospital diagnostics to wearable health devices. Robust EEG analytics
+have the potential to greatly improve patient care. However, traditional deep
+learning models, especially Transformer-based architectures, are hindered by
+their quadratic time and memory complexity, making them less suitable for
+resource-constrained environments. To address these challenges, we present
+FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel
+self-supervised framework that establishes new efficiency benchmarks for EEG
+analysis through bidirectional state-space modeling. Unlike Transformer-based
+models, which incur quadratic time and memory complexity, FEMBA scales linearly
+with sequence length, enabling more scalable and efficient processing of
+extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and
+fine-tuned on three downstream tasks, FEMBA achieves competitive performance in
+comparison with transformer models, with significantly lower computational
+cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB
+and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates
+viability for resource-constrained devices. These results pave the way for
+scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as
+a promising candidate for wearable applications.
 
-摘要：我們針對 TruthfulQA 推出專業翻譯的延伸版本，旨在評估巴斯克語、加泰隆尼亞語、加利西亞語和西班牙語中的真實性。大型語言模型 (LLM) 的真實性評估主要以英語進行。然而，LLM 在不同語言中維持真實性的能力仍未得到充分探索。我們的研究評估了 12 個最先進的開放 LLM，使用人類評估、多選項指標和 LLM 作為評分標準比較基礎和指令調整模型。我們的研究結果表明，雖然 LLM 在英語中的表現最好，而在巴斯克語（資源最少的語言）中的表現最差，但整體上不同語言之間的真實性差異小於預期。此外，我們表明，與多選項指標相比，LLM 作為評分標準與人類判斷更密切相關，而且信息豐富性在真實性評估中發揮著至關重要的作用。我們的結果還表明，機器翻譯提供了一種可行的途徑，可以將真實性基準擴展到其他語言，從而提供了一種可擴展的專業翻譯替代方案。最後，我們觀察到，與上下文和時間依賴的問題相比，通用知識問題在不同語言之間的處理效果更好，這突顯了考慮文化和時間可變性的真實性評估的必要性。數據集和代碼在開放許可下公開可用。
+摘要：準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要，其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而，傳統深度學習模型，特別是基於 Transformer 的架構，受到其二次時間和記憶體複雜度的阻礙，使其不太適合資源受限的環境。為了應對這些挑戰，我們提出 FEMBA (基礎 EEG Mamba + 雙向架構)，一種創新的自我監督架構，透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同，FEMBA 隨著序列長度線性縮放，支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調，與Transformer模型相比，在計算成本顯著降低的情況下，實現了具有競爭力的效能。具體來說，它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC，而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路，並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。
+
+##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**
+2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham
+
+The advent of foundation models (FMs) is transforming medical domain. In
+ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4
+million natural images and 1.6 million retinal images, has demonstrated high
+adaptability across clinical applications. Conversely, DINOv2, a
+general-purpose vision FM pre-trained on 142 million natural images, has shown
+promise in non-medical domains. However, its applicability to clinical tasks
+remains underexplored. To address this, we conducted head-to-head evaluations
+by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular
+disease detection and systemic disease prediction tasks, across eight
+standardized open-source ocular datasets, as well as the Moorfields AlzEye and
+the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting
+diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets,
+all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In
+glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940,
+P<0.001). Conversely, RETFound achieved superior performance over all DINOv2
+models in predicting heart failure, myocardial infarction, and ischaemic stroke
+(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even
+with 10% of the fine-tuning data. These findings showcase the distinct
+scenarios where general-purpose and domain-specific FMs excel, highlighting the
+importance of aligning FM selection with task-specific requirements to optimise
+clinical performance.
 
-##### **A Deep Inverse-Mapping Model for a Flapping Robotic Wing**
-2502.09378v1 by Hadar Sharvit, Raz Karl, Tsevi Beatus
+摘要：基礎模型 (FM) 的出現正在轉變醫療領域。在眼科，RETFound 是一個視網膜專用 FM，依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練，已展現出高度適應性，可應用於各種臨床應用。相反地，DINOv2 是一個通用視覺 FM，使用 1.42 億張自然影像進行預訓練，已展現出在非醫療領域的潛力。然而，其在臨床任務中的適用性仍未被充分探索。為了解決這個問題，我們針對眼部疾病偵測和全身性疾病預測任務，對 RETFound 和三個 DINOv2 模型（大型、基礎、小型）進行微調，並進行一對一的評估，使用八個標準化的開源眼科資料集，以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound（三個資料集的 AUROC=0.850-0.952，相較於 0.823-0.944，所有 P<=0.007）和多類眼部疾病（AUROC=0.892，相較於 0.846，P<0.001）。在青光眼方面，DINOv2 基礎模型優於 RETFound（AUROC=0.958，相較於 0.940，P<0.001）。相反地，RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型（AUROC=0.732-0.796，相較於 0.663-0.771，所有 P<0.001）。即使使用 10% 的微調資料，這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景，突顯了根據任務特定需求調整 FM 選擇，以最佳化臨床表現的重要性。
 
-In systems control, the dynamics of a system are governed by modulating its
-inputs to achieve a desired outcome. For example, to control the thrust of a
-quad-copter propeller the controller modulates its rotation rate, relying on a
-straightforward mapping between the input rotation rate and the resulting
-thrust. This mapping can be inverted to determine the rotation rate needed to
-generate a desired thrust. However, in complex systems, such as flapping-wing
-robots where intricate fluid motions are involved, mapping inputs (wing
-kinematics) to outcomes (aerodynamic forces) is nontrivial and inverting this
-mapping for real-time control is computationally impractical. Here, we report a
-machine-learning solution for the inverse mapping of a flapping-wing system
-based on data from an experimental system we have developed. Our model learns
-the input wing motion required to generate a desired aerodynamic force outcome.
-We used a sequence-to-sequence model tailored for time-series data and
-augmented it with a novel adaptive-spectrum layer that implements
-representation learning in the frequency domain. To train our model, we
-developed a flapping wing system that simultaneously measures the wing's
-aerodynamic force and its 3D motion using high-speed cameras. We demonstrate
-the performance of our system on an additional open-source dataset of a
-flapping wing in a different flow regime. Results show superior performance
-compared with more complex state-of-the-art transformer-based models, with 11%
-improvement on the test datasets median loss. Moreover, our model shows
-superior inference time, making it practical for onboard robotic control. Our
-open-source data and framework may improve modeling and real-time control of
-systems governed by complex dynamics, from biomimetic robots to biomedical
-devices.
+##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**
+2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun
 
-摘要：<paragraph>在系統控制中，系統的動態受調節其輸入以實現所需結果的影響。例如，為了控制四軸旋翼推進器的推力，控制器會調節其旋轉速率，依賴於輸入旋轉速率和所產生的推力之間的直接映射。此映射可以反轉以確定產生所需推力所需的旋轉速率。然而，在複雜的系統中，例如涉及複雜流體運動的拍打式機翼機器人，將輸入（機翼運動學）映射到輸出（空氣動力）並非易事，並且反轉此映射以進行實時控制在計算上不切實際。在此，我們報告了一個基於我們開發的實驗系統數據的拍打式機翼系統反向映射的機器學習解決方案。我們的模型學習產生所需空氣動力結果所需的輸入機翼運動。我們使用了一個專門針對時間序列數據的序列到序列模型，並用一個在頻域中實現表示學習的新型自適應譜層對其進行了擴充。為了訓練我們的模型，我們開發了一個拍打式機翼系統，該系統同時使用高速相機測量機翼的空氣動力和其 3D 運動。我們在一個不同的流動狀態下拍打機翼的另一個開源數據集上展示了我們系統的性能。結果表明，與更複雜的基於Transformer的最先進模型相比，性能優異，在測試數據集中損失中值改進了 11%。此外，我們的模型顯示出優異的推理時間，使其適用於機載機器人控制。我們的開源數據和框架可以改進受複雜動態支配的系統的建模和實時控制，從仿生機器人到生物醫學設備。</paragraph>
+Medical time series are often irregular and face significant missingness,
+posing challenges for data analysis and clinical decision-making. Existing
+methods typically adopt a single modeling perspective, either treating series
+data as sequences or transforming them into image representations for further
+classification. In this paper, we propose a joint learning framework that
+incorporates both sequence and image representations. We also design three
+self-supervised learning strategies to facilitate the fusion of sequence and
+image representations, capturing a more generalizable joint representation. The
+results indicate that our approach outperforms seven other state-of-the-art
+models in three representative real-world clinical datasets. We further
+validate our approach by simulating two major types of real-world missingness
+through leave-sensors-out and leave-samples-out techniques. The results
+demonstrate that our approach is more robust and significantly surpasses other
+baselines in terms of classification performance.
 
-##### **Language Agents as Digital Representatives in Collective Decision-Making**
-2502.09369v1 by Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti
+摘要：醫療時間序列通常不規則且會面臨顯著的缺失，對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點，將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中，我們提出了一個聯合學習架構，結合序列和影像表示。我們還設計了三種自我監督學習策略，以促進序列和影像表示的融合，捕捉更具概括性的聯合表示。結果表明，我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明，我們的做法更強大，並且在分類效能方面顯著優於其他基準。
 
-Consider the process of collective decision-making, in which a group of
-individuals interactively select a preferred outcome from among a universe of
-alternatives. In this context, "representation" is the activity of making an
-individual's preferences present in the process via participation by a proxy
-agent -- i.e. their "representative". To this end, learned models of human
-behavior have the potential to fill this role, with practical implications for
-multi-agent scenario studies and mechanism design. In this work, we investigate
-the possibility of training \textit{language agents} to behave in the capacity
-of representatives of human agents, appropriately expressing the preferences of
-those individuals whom they stand for. First, we formalize the setting of
-\textit{collective decision-making} -- as the episodic process of interaction
-between a group of agents and a decision mechanism. On this basis, we then
-formalize the problem of \textit{digital representation} -- as the simulation
-of an agent's behavior to yield equivalent outcomes from the mechanism.
-Finally, we conduct an empirical case study in the setting of
-\textit{consensus-finding} among diverse humans, and demonstrate the
-feasibility of fine-tuning large language models to act as digital
-representatives.
+##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
+2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
 
-摘要：考慮集體決策的過程，其中一群個人互動式地從一系列備選方案中選擇一個偏好的結果。在此脈絡中，「代表」是透過代理人（即他們的「代表」）參與，讓個人的偏好出現在這個過程中的活動。為此，人類行為的學習模型有可能填補這個角色，對多重代理人情境研究和機制設計具有實際意義。在這項工作中，我們探討訓練「語言代理人」的可能性，以代表人類代理人的身分行事，適當地表達他們所代表的那些個人的偏好。首先，我們將「集體決策」的設定形式化，作為一群代理人與決策機制之間互動的間歇性過程。在此基礎上，我們接著將「數位代表」的問題形式化，作為模擬代理人的行為，從機制中產生等效結果。最後，我們在多元人類的「共識尋求」設定中進行一個實證個案研究，並展示微調大型語言模型以作為數位代表的可行性。
+We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
+an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
+predicts future PHTs using transformer-based architectures. The Adaptive Risk
+Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
+probabilities for clinician-defined critical events. ARES incorporates a
+personalized explainability module that identifies key clinical factors
+influencing risk estimates for individual patients. ARES was evaluated on the
+MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
+performance against traditional early warning systems and machine learning
+models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
+with 60% including hospital admissions. The dataset contained over 357 million
+tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
+ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
+ETHOS-based risk estimates demonstrated robustness across demographic subgroups
+with strong model reliability, confirmed via calibration curves. The
+personalized explainability module provides insights into patient-specific
+factors contributing to risk. ARES, powered by ETHOS, advances predictive
+healthcare AI by providing dynamic, real-time, and personalized risk estimation
+with patient-specific explainability to enhance clinician trust. Its
+adaptability and superior accuracy position it as a transformative tool for
+clinical decision-making, potentially improving patient outcomes and resource
+allocation in emergency and inpatient settings. We release the full code at
+github.com/ipolharvard/ethos-ares to facilitate future research.
 
-##### **Neural Spatiotemporal Point Processes: Trends and Challenges**
-2502.09341v1 by Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David A. Selby, Yao Xie, Sebastian Vollmer, Gerrit Grossmann
+摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
+一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
+使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
 
-Spatiotemporal point processes (STPPs) are probabilistic models for events
-occurring in continuous space and time. Real-world event data often exhibit
-intricate dependencies and heterogeneous dynamics. By incorporating modern deep
-learning techniques, STPPs can model these complexities more effectively than
-traditional approaches. Consequently, the fusion of neural methods with STPPs
-has become an active and rapidly evolving research area. In this review, we
-categorize existing approaches, unify key design choices, and explain the
-challenges of working with this data modality. We further highlight emerging
-trends and diverse application domains. Finally, we identify open challenges
-and gaps in the literature.
+##### **Can ChatGPT Diagnose Alzheimer's Disease?**
+2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin
 
-摘要：時空點過程 (STPP) 是事件在連續時空發生的機率模型。真實世界的事件資料通常會展現錯綜複雜的依賴關係和異質動態。透過結合現代深度學習技術，STPP 可以比傳統方法更有效地模擬這些複雜性。因此，神經方法與 STPP 的融合已成為一個活躍且快速發展的研究領域。在本篇評論中，我們對現有方法進行分類、統一關鍵設計選擇，並說明處理這種資料模式的挑戰。我們進一步強調新興趨勢和多樣化的應用領域。最後，我們找出文獻中的開放性挑戰和空白。
+Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating
+neurodegenerative condition that affects approximately 1 in 9 individuals aged
+65 and older, profoundly impairing memory and cognitive function. This paper
+utilises 9300 electronic health records (EHRs) with data from Magnetic
+Resonance Imaging (MRI) and cognitive tests to address an intriguing question:
+As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs?
+We present an in-depth evaluation of ChatGPT using a black-box approach with
+zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to
+analyse MRI and cognitive test results, as well as its potential as a
+diagnostic tool for AD. By automating aspects of the diagnostic process, this
+research opens a transformative approach for the healthcare system,
+particularly in addressing disparities in resource-limited regions where AD
+specialists are scarce. Hence, it offers a foundation for a promising method
+for early detection, supporting individuals with timely interventions, which is
+paramount for Quality of Life (QoL).
 
-##### **Graph Diffusion Network for Drug-Gene Prediction**
-2502.09335v1 by Jiayang Wu, Wensheng Gan, Philip S. Yu
+摘要：ChatGPT 能否診斷出阿茲海默症 (AD)？AD 是一種毀滅性的神經退化性疾病，影響約 1/9 的 65 歲及以上人士，嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR)，其中包含磁共振成像 (MRI) 和認知測試的數據，來解決一個有趣的問題：作為一個通用任務解決器，ChatGPT 能否使用 EHR 準確地檢測出 AD？我們使用黑盒方法對 ChatGPT 進行了深入評估，採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力，以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面，這項研究為醫療保健系統開啟了一種變革性的方法，特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此，它為一種有希望的早期檢測方法奠定了基礎，通過及時干預來支持個人，這對於生活品質 (QoL) 至關重要。
 
-Predicting drug-gene associations is crucial for drug development and disease
-treatment. While graph neural networks (GNN) have shown effectiveness in this
-task, they face challenges with data sparsity and efficient contrastive
-learning implementation. We introduce a graph diffusion network for drug-gene
-prediction (GDNDGP), a framework that addresses these limitations through two
-key innovations. First, it employs meta-path-based homogeneous graph learning
-to capture drug-drug and gene-gene relationships, ensuring similar entities
-share embedding spaces. Second, it incorporates a parallel diffusion network
-that generates hard negative samples during training, eliminating the need for
-exhaustive negative sample retrieval. Our model achieves superior performance
-on the DGIdb 4.0 dataset and demonstrates strong generalization capability on
-tripartite drug-gene-disease networks. Results show significant improvements
-over existing methods in drug-gene prediction tasks, particularly in handling
-complex heterogeneous relationships. The source code is publicly available at
-https://github.com/csjywu1/GDNDGP.
+##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**
+2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares
 
-摘要：預測藥物基因關聯對藥物開發和疾病治療至關重要。雖然圖神經網路 (GNN) 已顯示在這個任務中的有效性，但它們在資料稀疏性和高效對比學習實作方面面臨挑戰。我們引入了一個用於藥物基因預測的圖擴散網路 (GDNDGP)，這是一個透過兩項關鍵創新來解決這些限制的框架。首先，它採用基於元路徑的同質圖學習來捕捉藥物-藥物和基因-基因關係，確保類似實體共享嵌入空間。其次，它整合了一個並行擴散網路，在訓練期間產生困難的負面樣本，消除了對詳盡負面樣本擷取的需求。我們的模型在 DGIdb 4.0 資料集上取得了卓越的效能，並在三方藥物-基因-疾病網路中展現強大的概化能力。結果顯示在藥物基因預測任務中，相較於現有方法有顯著的進步，特別是在處理複雜的異質關係方面。原始碼已公開於 https://github.com/csjywu1/GDNDGP。
+EEG-based neural networks, pivotal in medical diagnosis and brain-computer
+interfaces, face significant intellectual property (IP) risks due to their
+reliance on sensitive neurophysiological data and resource-intensive
+development. Current watermarking methods, particularly those using abstract
+trigger sets, lack robust authentication and fail to address the unique
+challenges of EEG models. This paper introduces a cryptographic wonder
+filter-based watermarking framework tailored for EEG-based neural networks.
+Leveraging collision-resistant hashing and public-key encryption, the wonder
+filter embeds the watermark during training, ensuring minimal distortion ($\leq
+5\%$ drop in EEG task accuracy) and high reliability (100\% watermark
+detection). The framework is rigorously evaluated against adversarial attacks,
+including fine-tuning, transfer learning, and neuron pruning. Results
+demonstrate persistent watermark retention, with classification accuracy for
+watermarked states remaining above 90\% even after aggressive pruning, while
+primary task performance degrades faster, deterring removal attempts. Piracy
+resistance is validated by the inability to embed secondary watermarks without
+severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic
+hashing ensures authentication, reducing brute-force attack success
+probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet,
+TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively
+eliminating false positives. By integrating wonder filters with EEG-specific
+adaptations, this work bridges a critical gap in IP protection for
+neurophysiological models, offering a secure, tamper-proof solution for
+healthcare and biometric applications. The framework's robustness against
+adversarial modifications underscores its potential to safeguard sensitive EEG
+models while maintaining diagnostic utility.
 
-##### **Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**
-2502.09331v1 by Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
+摘要：<paragraph>基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要，由於其依賴敏感的神經生理資料和資源密集型的開發，面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法，特別是那些使用抽象觸發集的方法，缺乏強健的驗證，且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密，wonder 濾波器在訓練期間嵌入浮水印，確保最小的失真（EEG 任務準確度下降 $\leq 5\%$）和高可靠性（100% 浮水印檢測）。該架構針對對抗性攻擊進行了嚴格的評估，包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留，即使在激進的剪枝後，浮水印狀態的分類準確度仍保持在 90% 以上，而主要任務的性能下降得更快，阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證，而不會造成嚴重的準確度損失（在 EEGNet 和 CCNN 模型中 $>10\%$）。密碼學雜湊確保驗證，降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型（CCNN、EEGNet、TSception）進行評估，該方法達到了 $>99.4\%$ 的空嵌入準確度，有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合，這項工作彌補了神經生理模型 IP 保護中的關鍵差距，為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。</paragraph>
 
-Despite advances in the multilingual capabilities of Large Language Models
-(LLMs) across diverse tasks, English remains the dominant language for LLM
-research and development. So, when working with a different language, this has
-led to the widespread practice of pre-translation, i.e., translating the task
-prompt into English before inference. Selective pre-translation, a more
-surgical approach, focuses on translating specific prompt components. However,
-its current use is sporagic and lacks a systematic research foundation.
-Consequently, the optimal pre-translation strategy for various multilingual
-settings and tasks remains unclear. In this work, we aim to uncover the optimal
-setup for pre-translation by systematically assessing its use. Specifically, we
-view the prompt as a modular entity, composed of four functional parts:
-instruction, context, examples, and output, either of which could be translated
-or not. We evaluate pre-translation strategies across 35 languages covering
-both low and high-resource languages, on various tasks including Question
-Answering (QA), Natural Language Inference (NLI), Named Entity Recognition
-(NER), and Abstractive Summarization. Our experiments show the impact of
-factors as similarity to English, translation quality and the size of
-pre-trained data, on the model performance with pre-translation. We suggest
-practical guidelines for choosing optimal strategies in various multilingual
-settings.
+##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**
+2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen
 
-摘要：儘管大型語言模型 (LLM) 在各種任務中的多語言能力有進步，英語仍然是 LLM 研究和開發的主導語言。因此，在使用不同語言時，這導致了預翻譯的廣泛實務，即在推理之前將任務提示翻譯成英語。選擇性預翻譯是一種更精準的方法，專注於翻譯特定提示組成部分。然而，目前的使用是零星的，缺乏系統性的研究基礎。因此，各種多語言設定和任務的最佳預翻譯策略仍不清楚。在這項工作中，我們旨在透過系統性評估預翻譯的使用，找出其最佳設定。具體來說，我們將提示視為一個模組化實體，由四個功能部分組成：說明、背景、範例和輸出，其中任何一個都可以翻譯或不翻譯。我們在 35 種語言中評估預翻譯策略，涵蓋低資源語言和高資源語言，以及各種任務，包括問答 (QA)、自然語言推理 (NLI)、命名實體識別 (NER) 和抽象摘要。我們的實驗顯示了與英語的相似性、翻譯品質和預訓練資料大小等因素對預翻譯模型效能的影響。我們建議在各種多語言設定中選擇最佳策略的實用指南。
+Depression is one of the leading causes of disability worldwide, posing a
+severe burden on individuals, healthcare systems, and society at large. Recent
+advancements in Large Language Models (LLMs) have shown promise in addressing
+mental health challenges, including the detection of depression through
+text-based analysis. However, current LLM-based methods often struggle with
+nuanced symptom identification and lack a transparent, step-by-step reasoning
+process, making it difficult to accurately classify and explain mental health
+conditions. To address these challenges, we propose a Chain-of-Thought
+Prompting approach that enhances both the performance and interpretability of
+LLM-based depression detection. Our method breaks down the detection process
+into four stages: (1) sentiment analysis, (2) binary depression classification,
+(3) identification of underlying causes, and (4) assessment of severity. By
+guiding the model through these structured reasoning steps, we improve
+interpretability and reduce the risk of overlooking subtle clinical indicators.
+We validate our method on the E-DAIC dataset, where we test multiple
+state-of-the-art large language models. Experimental results indicate that our
+Chain-of-Thought Prompting technique yields superior performance in both
+classification accuracy and the granularity of diagnostic insights, compared to
+baseline approaches.
 
-##### **A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**
-2502.09316v1 by Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami
+摘要：憂鬱症是全球殘障的主要原因之一，對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望，包括透過基於文字的分析來偵測憂鬱症。然而，現有的基於 LLM 的方法通常難以辨識細微的症狀，而且缺乏透明且逐步的推理過程，這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰，我們提出了一種思考鏈提示方法，它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段：(1) 情緒分析，(2) 二元憂鬱症分類，(3) 找出潛在原因，以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟，我們提升了可解釋性，並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法，並在其中測試了多種最先進的大型語言模型。實驗結果顯示，與基線方法相比，我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。
 
-Evaluating the open-ended text generation of large language models (LLMs) is
-challenging because of the lack of a clear ground truth and the high cost of
-human or LLM-based assessments. We propose a novel benchmark that evaluates
-LLMs using n-gram statistics and rules, without relying on human judgement or
-LLM-as-a-judge approaches. Using 50 question and reference answer sets, we
-introduce three new metrics based on n-grams and rules: Fluency, Truthfulness,
-and Helpfulness. Our benchmark strongly correlates with GPT-4o-based
-evaluations while requiring significantly fewer computational resources,
-demonstrating its effectiveness as a scalable alternative for assessing LLMs'
-open-ended generation capabilities.
+##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**
+2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
 
-摘要：評估大型語言模型 (LLM) 的開放式文字生成具有挑戰性，因為缺乏明確的基礎真實性，以及人工或基於 LLM 的評估成本高昂。我們提出一個新基準，使用 n-gram 統計和規則來評估 LLM，而不依賴於人工判斷或 LLM 作為評審的方法。使用 50 個問題和參考答案集，我們基於 n-gram 和規則引入了三項新指標：流暢度、真實性和有幫助性。我們的基準與基於 GPT-4o 的評估密切相關，同時需要明顯更少的計算資源，證明了其作為評估 LLM 的開放式生成能力的可擴充替代方案的有效性。
+The increasing volume of drug combinations in modern therapeutic regimens
+needs reliable methods for predicting drug-drug interactions (DDIs). While
+Large Language Models (LLMs) have revolutionized various domains, their
+potential in pharmaceutical research, particularly in DDI prediction, remains
+largely unexplored. This study thoroughly investigates LLMs' capabilities in
+predicting DDIs by uniquely processing molecular structures (SMILES), target
+organisms, and gene interaction data as raw text input from the latest DrugBank
+dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4,
+Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first
+assessing their zero-shot capabilities in DDI prediction. We then fine-tuned
+selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1
+distilled Qwen 1.5B) to optimize their performance. Our comprehensive
+evaluation framework included validation across 13 external DDI datasets,
+comparing against traditional approaches such as l2-regularized logistic
+regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5
+2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of
+0.919 on balanced datasets (50% positive, 50% negative cases). This result
+represents an improvement over both zero-shot predictions and state-of-the-art
+machine-learning methods used for DDI prediction. Our analysis reveals that
+LLMs can effectively capture complex molecular interaction patterns and cases
+where drug pairs target common genes, making them valuable tools for practical
+applications in pharmaceutical research and clinical settings.
 
-##### **When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**
-2502.09307v1 by Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
+摘要：<paragraph>現代治療方案中藥物組合的數量越來越多，需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命，它們在藥物研究中的潛力，特別是在 DDI 預測中的潛力，仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入，徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM，包括專有模型（GPT-4、Claude、Gemini）和開源變體（從 1.5B 到 72B 參數），首先評估它們在 DDI 預測中的零次學習能力。然後，我們微調選定的模型（GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B）以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證，並與傳統方法（例如 l2 正則化邏輯迴歸）進行比較。微調後的 LLM 表現出優異的效能，其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度，在平衡資料集（50% 正例，50% 反例）上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明，LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況，使其成為藥物研究和臨床環境中實用應用的寶貴工具。</paragraph>
 
-Modern Large Language Models (LLMs) have shown human-like abilities in many
-language tasks, sparking interest in comparing LLMs' and humans' language
-processing. In this paper, we conduct a detailed comparison of the two on a
-sentence comprehension task using garden-path constructions, which are
-notoriously challenging for humans. Based on psycholinguistic research, we
-formulate hypotheses on why garden-path sentences are hard, and test these
-hypotheses on human participants and a large suite of LLMs using comprehension
-questions. Our findings reveal that both LLMs and humans struggle with specific
-syntactic complexities, with some models showing high correlation with human
-comprehension. To complement our findings, we test LLM comprehension of
-garden-path constructions with paraphrasing and text-to-image generation tasks,
-and find that the results mirror the sentence comprehension question results,
-further validating our findings on LLM understanding of these constructions.
+##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**
+2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh
 
-摘要：現代大型語言模型（LLM）在許多語言任務中展現出類似人類的能力，引發了比較 LLM 與人類語言處理的興趣。在本文中，我們使用對人類來說極具挑戰的花園路徑結構，對這兩者進行了詳細比較，以進行句子理解任務。根據心理語言學研究，我們制定了關於為什麼花園路徑句子困難的假設，並使用理解問題對人類參與者和大量 LLM 測試這些假設。我們的研究結果表明，LLM 和人類都難以應付特定的句法複雜性，其中一些模型與人類理解力高度相關。為了補充我們的研究結果，我們測試了 LLM 對花園路徑結構的理解，並進行了改寫和文字轉換為圖像的生成任務，並發現結果反映了句子理解問題的結果，進一步驗證了我們對 LLM 理解這些結構的研究結果。
+Detecting sensitive data such as Personally Identifiable Information (PII)
+and Protected Health Information (PHI) is critical for data security platforms.
+This study evaluates regex-based pattern matching algorithms and exact-match
+search techniques to optimize detection speed, accuracy, and scalability. Our
+benchmarking results indicate that Google RE2 provides the best balance of
+speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among
+regex engines, outperforming PCRE while maintaining broader hardware
+compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated
+superior performance (8 ms/MB) and scalability for large datasets. Performance
+analysis revealed that regex processing time scales linearly with dataset size
+and pattern complexity. A hybrid AI + Regex approach achieved the highest F1
+score (91. 6%) by improving recall and minimizing false positives. Device
+benchmarking confirmed that our solution maintains efficient CPU and memory
+usage on both high-performance and mid-range systems. Despite its
+effectiveness, challenges remain, such as limited multilingual support and the
+need for regular pattern updates. Future work should focus on expanding
+language coverage, integrating data security and privacy management (DSPM) with
+data loss prevention (DLP) tools, and enhancing regulatory compliance for
+broader global adoption.
 
-##### **Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**
-2502.09294v1 by Bernd Dudzik, Tiffany Matej Hrkalovic, Chenxu Hao, Chirag Raman, Masha Tsfasman
+摘要：偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料，對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術，以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示，在 regex 引擎中，Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡，優於 PCRE，同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對，Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示，regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低，達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效，但仍有挑戰存在，例如多語言支援有限，以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍，將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合，以及加強法規遵循以利更廣泛的全球採用。
 
-Automatic Affect Prediction (AAP) uses computational analysis of input data
-such as text, speech, images, and physiological signals to predict various
-affective phenomena (e.g., emotions or moods). These models are typically
-constructed using supervised machine-learning algorithms, which rely heavily on
-labeled training datasets. In this position paper, we posit that all AAP
-training data are derived from human Affective Interpretation Processes,
-resulting in a form of Affective Meaning. Research on human affect indicates a
-form of complexity that is fundamental to such meaning: it can possess what we
-refer to here broadly as Qualities of Indeterminacy (QIs) - encompassing
-Subjectivity (meaning depends on who is interpreting), Uncertainty (lack of
-confidence regarding meanings' correctness), Ambiguity (meaning contains
-mutually exclusive concepts) and Vagueness (meaning is situated at different
-levels in a nested hierarchy). Failing to appropriately consider QIs leads to
-results incapable of meaningful and reliable predictions. Based on this
-premise, we argue that a crucial step in adequately addressing indeterminacy in
-AAP is the development of data collection practices for modeling corpora that
-involve the systematic consideration of 1) a relevant set of QIs and 2) context
-for the associated interpretation processes. To this end, we are 1) outlining a
-conceptual model of AIPs and the QIs associated with the meaning these produce
-and a conceptual structure of relevant context, supporting understanding of its
-role. Finally, we use our framework for 2) discussing examples of
-context-sensitivity-related challenges for addressing QIs in data collection
-setups. We believe our efforts can stimulate a structured discussion of both
-the role of aspects of indeterminacy and context in research on AAP, informing
-the development of better practices for data collection and analysis.
+##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**
+2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu
 
-摘要：自動影響預測 (AAP) 使用輸入資料的運算分析，例如文字、語音、影像和生理訊號，來預測各種情感現象（例如情緒或心情）。這些模型通常使用監督式機器學習演算法建構，而這些演算法高度依賴標籤訓練資料集。在此立場文件中，我們主張所有 AAP 訓練資料都是從人類的情感詮釋過程中衍生而來的，進而形成一種情感意義。對人類情感的研究指出，這種複雜性是此種意義的基本要素：它可能具備我們在此廣泛稱之為不確定性品質 (QI)，包括主觀性（意義取決於詮釋者）、不確定性（對於意義正確性的信心不足）、歧義性（意義包含相互排斥的概念）和模糊性（意義位於嵌套層級的不同層級）。未能適當地考量 QI 會導致無法進行有意義且可靠預測的結果。基於此前提，我們主張，在 AAP 中適當地處理不確定性的關鍵步驟，是針對建模語料庫制定資料收集實務，其中涉及系統性地考量 1) 一組相關的 QI，以及 2) 相關詮釋過程的脈絡。為此，我們 1) 概述了 AIP 的概念模型，以及與這些 AIP 所產生的意義相關的 QI，以及相關脈絡的概念結構，支持對其角色的理解。最後，我們使用我們的架構 2) 討論了在資料收集設定中處理 QI 時，與脈絡敏感性相關的挑戰範例。我們相信我們的努力可以激勵對不確定性和脈絡面向在 AAP 研究中扮演的角色進行結構化的討論，為資料收集和分析的最佳實務發展提供資訊。
+While just-in-time interventions (JITIs) have effectively targeted common
+health behaviors, individuals often have unique needs to intervene in personal
+undesirable actions that can negatively affect physical, mental, and social
+well-being. We present WatchGuardian, a smartwatch-based JITI system that
+empowers users to define custom interventions for these personal actions with a
+small number of samples. For the model to detect new actions based on limited
+new data samples, we developed a few-shot learning pipeline that finetuned a
+pre-trained inertial measurement unit (IMU) model on public hand-gesture
+datasets. We then designed a data augmentation and synthesis process to train
+additional classification layers for customization. Our offline evaluation with
+26 participants showed that with three, five, and ten examples, our approach
+achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of
+74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to
+compare WatchGuardian against a rule-based intervention. Our results
+demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in
+undesirable actions, substantially outperforming the baseline by 29.0%. Our
+findings underscore the effectiveness of a customizable, AI-driven JITI system
+for individuals in need of behavioral intervention in personal undesirable
+actions. We envision that our work can inspire broader applications of
+user-defined personalized intervention with advanced AI solutions.
 
-##### **SparQLe: Speech Queries to Text Translation Through LLMs**
-2502.09284v1 by Amirbek Djanibekov, Hanan Aldarmaki
+摘要：<paragraph>雖然即時介入（JITIs）有效地針對常見的健康行為，但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian，這是一個基於智慧手錶的 JITI 系統，它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為，我們開發了一個小樣本學習管道，微調了公共手勢資料集上的預訓練慣性測量單元（IMU）模型。然後，我們設計了一個資料擴充和合成流程，以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示，我們的做法使用三個、五個和十個範例，達到了 76.8%、84.7% 和 87.7% 的平均準確度，以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後，我們進行了一項為時四小時的介入研究，以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明，我們的系統導致不良行為顯著減少了 64.0 +- 22.6%，大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用，並採用先進的 AI 解決方案。</paragraph>
 
-With the growing influence of Large Language Models (LLMs), there is
-increasing interest in integrating speech representations with them to enable
-more seamless multi-modal processing and speech understanding. This study
-introduces a novel approach that leverages self-supervised speech
-representations in combination with instruction-tuned LLMs for speech-to-text
-translation. The proposed approach leverages a modality adapter to align
-extracted speech features with instruction-tuned LLMs using English-language
-data. Our experiments demonstrate that this method effectively preserves the
-semantic content of the input speech and serves as an effective bridge between
-self-supervised speech models and instruction-tuned LLMs, offering a promising
-solution for various speech understanding applications.
+##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**
+2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
 
-摘要：隨著大型語言模型（LLM）影響力逐漸擴大，將語音表徵與其整合，以實現更順暢的多模態處理和語音理解，已引起越來越多的興趣。本研究提出了一種新穎的方法，該方法利用自監督語音表徵，結合指令調整的 LLM，進行語音轉文字翻譯。所提出的方法利用模態適配器，使用英語語言資料，將提取的語音特徵與指令調整的 LLM 對齊。我們的實驗證明，此方法有效地保留了輸入語音的語義內容，並作為自監督語音模型和指令調整的 LLM 之間的有效橋樑，為各種語音理解應用程式提供了一個有前景的解決方案。
+Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group
+of cancers that account for more than 35% of cancer-related deaths worldwide,
+but postoperative complications are unpredictable and can be life-threatening.
+In this paper, we investigate how recent advancements in large language models
+(LLMs) can benefit remote patient monitoring (RPM) systems through clinical
+integration by designing RECOVER, an LLM-powered RPM system for postoperative
+GI cancer care. To closely engage stakeholders in the design process, we first
+conducted seven participatory design sessions with five clinical staff and
+interviewed five cancer patients to derive six major design strategies for
+integrating clinical guidelines and information needs into LLM-based RPM
+systems. We then designed and implemented RECOVER, which features an
+LLM-powered conversational agent for cancer patients and an interactive
+dashboard for clinical staff to enable efficient postoperative RPM. Finally, we
+used RECOVER as a pilot system to assess the implementation of our design
+strategies with four clinical staff and five patients, providing design
+implications by identifying crucial design elements, offering insights on
+responsible AI, and outlining opportunities for future LLM-powered RPM systems.
 
-##### **LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**
-2502.09271v1 by Wenlun Zhang, Enyan Dai, Kentaro Yoshioka
+摘要：癌症手術是胃腸道 (GI) 癌症的主要治療方式，這類癌症佔全球癌症相關死亡人數的 35% 以上，但術後併發症無法預測，且可能危及生命。在本文中，我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統，方法是設計 RECOVER，一個由 LLM 驅動的 RPM 系統，用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程，我們首先與五位臨床人員進行七場參與式設計會議，並訪談五位癌症患者，以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著，我們設計並實作 RECOVER，其特色在於一個由 LLM 驅動的對話式代理人，供癌症患者使用，以及一個互動式儀表板，供臨床人員使用，以進行有效的術後 RPM。最後，我們使用 RECOVER 作為試點系統，與四位臨床人員和五位患者評估我們設計策略的實作，並透過找出重要的設計元素、提供對負責任 AI 的見解，以及概述未來由 LLM 驅動的 RPM 系統的機會，提出設計意涵。
 
-Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in
-modeling data with graph structures, yet recent research reveals their
-susceptibility to adversarial attacks. Traditional attack methodologies, which
-rely on manipulating the original graph or adding links to artificially created
-nodes, often prove impractical in real-world settings. This paper introduces a
-novel adversarial scenario involving the injection of an isolated subgraph to
-deceive both the link recommender and the node classifier within a GNN system.
-Specifically, the link recommender is mislead to propose links between targeted
-victim nodes and the subgraph, encouraging users to unintentionally establish
-connections and that would degrade the node classification accuracy, thereby
-facilitating a successful attack. To address this, we present the LiSA
-framework, which employs a dual surrogate model and bi-level optimization to
-simultaneously meet two adversarial objectives. Extensive experiments on
-real-world datasets demonstrate the effectiveness of our method.
+##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**
+2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander
 
-摘要：圖形神經網路 (GNN) 已展現出在對具有圖形結構的資料進行建模方面的卓越能力，但最近的研究揭露了它們容易受到對抗性攻擊的影響。傳統的攻擊方法依賴於操縱原始圖形或將連結新增至人工建立的節點，在真實世界設定中通常被證明不切實際。本文介紹了一種新穎的對抗性場景，涉及注入一個孤立的子圖形，以欺騙 GNN 系統中的連結推薦器和節點分類器。具體來說，連結推薦器被誤導為在目標受害節點和子圖形之間提出連結，鼓勵使用者無意間建立連結，這將降低節點分類準確度，從而促成攻擊成功。為了解決這個問題，我們提出了 LiSA 框架，它採用雙重代理模型和雙層最佳化，以同時滿足兩個對抗性目標。對真實世界資料集進行的廣泛實驗證明了我們方法的有效性。
+Understanding the progression trajectories of diseases is crucial for early
+diagnosis and effective treatment planning. This is especially vital for
+life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a
+chronic, progressive lung disease with a prognosis comparable to many cancers.
+Computed tomography (CT) imaging has been established as a reliable diagnostic
+tool for IPF. Accurately predicting future CT scans of early-stage IPF patients
+can aid in developing better treatment strategies, thereby improving survival
+outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial
+Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF
+patients at any time point. The model is trained using a two-stage approach. In
+the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the
+second stage, a Neural Ordinary Differential Equation (ODE) based temporal
+model is trained to capture the temporal dynamics of the quantised embeddings
+generated by the encoder in the first stage. We evaluate different
+configurations of our model for generating longitudinal CT scans and compare
+the results against ground truth data, both quantitatively and qualitatively.
+For validation, we conduct survival analysis using imaging biomarkers derived
+from generated CT scans and achieve a C-index comparable to that of biomarkers
+derived from the real CT scans. The survival analysis results demonstrate the
+potential clinical utility inherent to generated longitudinal CT scans, showing
+that they can reliably predict survival outcomes.
 
-##### **AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**
-2502.09254v1 by Hezhe Qiao, Chaoxi Niu, Ling Chen, Guansong Pang
+摘要：了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要，IPF 是一種慢性、進行性肺部疾病，其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略，從而改善存活結果。在本文中，我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN)，這是一個模型，能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段，訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段，訓練基於神經常微分方程 (ODE) 的時間模型，以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置，以生成縱向 CT 掃描，並在定量和定性方面將結果與真實數據進行比較。為了驗證，我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析，並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用，表明它們可以可靠地預測存活結果。
 
-Graph anomaly detection (GAD) aims to identify abnormal nodes that differ
-from the majority of the nodes in a graph, which has been attracting
-significant attention in recent years. Existing generalist graph models have
-achieved remarkable success in different graph tasks but struggle to generalize
-to the GAD task. This limitation arises from their difficulty in learning
-generalized knowledge for capturing the inherently infrequent, irregular and
-heterogeneous abnormality patterns in graphs from different domains. To address
-this challenge, we propose AnomalyGFM, a GAD-oriented graph foundation model
-that supports zero-shot inference and few-shot prompt tuning for GAD in diverse
-graph datasets. One key insight is that graph-agnostic representations for
-normal and abnormal classes are required to support effective zero/few-shot GAD
-across different graphs. Motivated by this, AnomalyGFM is pre-trained to align
-data-independent, learnable normal and abnormal class prototypes with node
-representation residuals (i.e., representation deviation of a node from its
-neighbors). The residual features essentially project the node information into
-a unified feature space where we can effectively measure the abnormality of
-nodes from different graphs in a consistent way. This provides a driving force
-for the learning of graph-agnostic, discriminative prototypes for the normal
-and abnormal classes, which can be used to enable zero-shot GAD on new graphs,
-including very large-scale graphs. If there are few-shot labeled normal nodes
-available in the new graphs, AnomalyGFM can further support prompt tuning to
-leverage these nodes for better adaptation. Comprehensive experiments on 11
-widely-used GAD datasets with real anomalies, demonstrate that AnomalyGFM
-significantly outperforms state-of-the-art competing methods under both zero-
-and few-shot GAD settings.
+##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**
+2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho
 
-摘要：圖形異常偵測 (GAD) 的目標是找出與圖形中大多數節點不同的異常節點，這在近年來引起了廣泛的關注。現有的通才圖形模型在不同的圖形任務中都取得了顯著的成功，但卻難以推廣到 GAD 任務。這種限制來自於它們難以學習廣泛的知識，用於擷取來自不同領域圖形中固有的罕見、不規則和異質異常模式。為了應對這個挑戰，我們提出了 AnomalyGFM，一個面向 GAD 的圖形基礎模型，它支援零次學習推論和少次提示調整，用於在不同的圖形資料集中進行 GAD。一個關鍵見解是，需要圖形不可知的正常和異常類別表示，以支援跨不同圖形的有效零次/少次 GAD。受此啟發，AnomalyGFM 被預先訓練以將與資料無關的可學習正常和異常類別原型與節點表示殘差（即節點與其鄰居的表示偏差）對齊。殘差特徵基本上將節點資訊投射到一個統一的特徵空間中，在這個空間中，我們可以有效地測量來自不同圖形的節點異常，並且方式一致。這為學習正常和異常類別的圖形不可知、有區別的原型提供了驅動力，這些原型可用於對新的圖形（包括非常大規模的圖形）啟用零次 GAD。如果新的圖形中有少量的標籤正常節點，AnomalyGFM 可以進一步支援提示調整，以利用這些節點進行更好的適應。在 11 個廣泛使用的具有真實異常值的 GAD 資料集上的綜合實驗表明，在零次和少次 GAD 設定下，AnomalyGFM 明顯優於最先進的競爭方法。
+The increasing demand for mental health services has led to the rise of
+AI-driven mental health chatbots, though challenges related to privacy, data
+collection, and expertise persist. Motivational Interviewing (MI) is gaining
+attention as a theoretical basis for boosting expertise in the development of
+these chatbots. However, existing datasets are showing limitations for training
+chatbots, leading to a substantial demand for publicly available resources in
+the field of MI and psychotherapy. These challenges are even more pronounced in
+non-English languages, where they receive less attention. In this paper, we
+propose a novel framework that simulates MI sessions enriched with the
+expertise of professional therapists. We train an MI forecaster model that
+mimics the behavioral choices of professional therapists and employ Large
+Language Models (LLMs) to generate utterances through prompt engineering. Then,
+we present KMI, the first synthetic dataset theoretically grounded in MI,
+containing 1,000 high-quality Korean Motivational Interviewing dialogues.
+Through an extensive expert evaluation of the generated dataset and the
+dialogue model trained on it, we demonstrate the quality, expertise, and
+practicality of KMI. We also introduce novel metrics derived from MI theory in
+order to evaluate dialogues from the perspective of MI.
 
-##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
-2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
+摘要：由於對心理健康服務的需求日益增加，導致以人工智慧為基礎的心理健康聊天機器人興起，儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而，現有的資料集顯示出訓練聊天機器人的限制，導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯，因為它們受到的關注較少。在本文中，我們提出了一個新穎的架構，它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型，它模擬了專業治療師的行為選擇，並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後，我們展示了 KMI，這是第一個理論上以 MI 為基礎的合成資料集，其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估，我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標，以便從 MI 的角度評估對話。
 
-Joint entity-relation extraction is a critical task in transforming
-unstructured or semi-structured text into triplets, facilitating the
-construction of large-scale knowledge graphs, and supporting various downstream
-applications. Despite its importance, research on Chinese text, particularly
-with complex semantics in specialized domains like medicine, remains limited.
-To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
-dataset designed to capture the intricacies of medical text. Leveraging the
-strengths of attention mechanisms in capturing long-range dependencies, we
-propose the SEA module, which enhances the extraction of complex contextual
-semantic information, thereby improving entity recognition and relation
-extraction. Additionally, to address the inefficiencies of existing methods in
-facilitating information exchange between entity recognition and relation
-extraction, we present an interactive fusion representation module. This module
-employs Cross Attention for bidirectional information exchange between the
-tasks and further refines feature extraction through BiLSTM. Experimental
-results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
-our model exhibits strong generalization capabilities. On the CH-DDI dataset,
-our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
-relation extraction. On the CoNLL04 dataset, it attains an entity recognition
-precision of 89.54% and a relation extraction accuracy of 71.64%.
+##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**
+2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco
 
-摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
+Europe's healthcare systems require enhanced interoperability and
+digitalization, driving a demand for innovative solutions to process legacy
+clinical data. This paper presents the results of our project, which aims to
+leverage Large Language Models (LLMs) to extract structured information from
+unstructured clinical reports, focusing on patient history, diagnoses,
+treatments, and other predefined categories. We developed a workflow with a
+user interface and evaluated LLMs of varying sizes through prompting strategies
+and fine-tuning. Our results show that fine-tuned smaller models match or
+surpass larger counterparts in performance, offering efficiency for
+resource-limited settings. A new dataset of 60,000 annotated English clinical
+summaries and 24,000 German translations was validated with automated and
+manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics.
+The work highlights the approach's viability and outlines future improvements.
 
-##### **You Do Not Fully Utilize Transformer's Representation Capacity**
-2502.09245v1 by Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
+摘要：歐洲的醫療保健系統需要增強互通性和數位化，這驅動了對創新解決方案的需求，以處理傳統的臨床數據。本文介紹了我們專案的成果，該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊，重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程，並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示，微調後的較小模型在效能上與較大的模型相匹配或超越它們，為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性，並概述了未來的改進。
 
-In contrast to RNNs, which compress previous tokens into a single hidden
-state, Transformers can attend to all previous tokens directly. However,
-standard Transformers only use representations from the immediately preceding
-layer. In this paper, we show that this design choice causes representation
-collapse and leads to suboptimal performance. To address this issue, we
-introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
-preserves the model's overall memory footprint while expanding its
-representational capacity by allowing access to hidden states from earlier
-layers. Through extensive experiments across various architectures and
-different lookup mechanisms, we demonstrate consistent performance improvements
-on a wide range of tasks. Moreover, our analysis of the learned representation
-dynamics and our exploration of depthwise circuits reveal how LIMe integrates
-information across layers, pointing to promising directions for future
-research.
+##### **Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**
+2502.05494v1 by Ya Zhou, Yujie Yang, Jianhuang Gan, Xiangjie Li, Jing Yuan, Wei Zhao
 
-摘要：與將先前符號壓縮成單一隱藏狀態的遞迴神經網路不同，Transformer 可以直接關注所有先前的符號。然而，標準 Transformer 僅使用緊鄰前一層的表示。在本文中，我們說明此設計選擇會導致表示崩潰，並導致次優效能。為了解決此問題，我們引入了「層整合式記憶體」(LIMe)，這是一種簡單但強大的方法，可在擴充表示能力的同時，保留模型的整體記憶體使用量，方法是允許存取來自較早層的隱藏狀態。透過各種架構和不同查詢機制的廣泛實驗，我們展示了在各種任務上的一致效能提升。此外，我們對已學習表示動態的分析和對深度電路的探討，揭示了 LIMe 如何整合跨層資訊，並指出未來研究有望發展的方向。
+Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing
+cardiovascular conditions, yet anomaly detection in ECG signals remains
+challenging due to their inherent complexity and variability. We propose
+Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel
+end-to-end framework that effectively captures both global and local
+dependencies in ECG data. Unlike state-of-the-art methods that rely on
+heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for
+such pre-processing steps, enhancing its suitability for clinical deployment.
+MMAE-ECG partitions ECG signals into non-overlapping segments, with each
+segment assigned learnable positional embeddings. A novel multi-scale masking
+strategy and multi-scale attention mechanism, along with distinct positional
+embeddings, enable a lightweight Transformer encoder to effectively capture
+both local and global dependencies. The masked segments are then reconstructed
+using a single-layer Transformer block, with an aggregation strategy employed
+during inference to refine the outputs. Experimental results demonstrate that
+our method achieves performance comparable to state-of-the-art approaches while
+significantly reducing computational complexity-approximately 1/78 of the
+floating-point operations (FLOPs) required for inference. Ablation studies
+further validate the effectiveness of each component, highlighting the
+potential of multi-scale masked autoencoders for anomaly detection.
 
-##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
-2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
+摘要：心電圖 (ECG) 分析是診斷心血管疾病的基本工具，但由於 ECG 訊號本身的複雜性和變異性，異常偵測仍然是一項挑戰。我們提出用於 ECG 異常偵測的多尺度遮罩自編碼器 (MMAE-ECG)，這是一個新穎的端對端架構，可有效擷取 ECG 資料中的全局和局部依賴關係。與依賴於心跳區段或 R 波峰偵測的最新方法不同，MMAE-ECG 消除了對此類前處理步驟的需求，增強其適用於臨床部署。MMAE-ECG 將 ECG 訊號分割成不相疊的區段，每個區段都指派可學習的位置嵌入。新穎的多尺度遮罩策略和多尺度注意力機制，以及不同的位置嵌入，使輕量級 Transformer 編碼器能夠有效擷取局部和全局依賴關係。然後使用單層 Transformer 區塊重建遮罩區段，並在推理期間採用聚合策略來優化輸出。實驗結果表明，我們的模型達到了與最新方法相當的效能，同時大幅降低運算複雜度，約為推理所需的浮點運算 (FLOP) 的 1/78。消融研究進一步驗證了每個組件的有效性，突顯了多尺度遮罩自編碼器在異常偵測方面的潛力。
 
-Generative artificial intelligence (AI) models, such as diffusion models and
-OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
-and automating clinical workflows. The field has advanced rapidly, evolving
-from text-only large language models for tasks such as clinical documentation
-and decision support to multimodal AI systems capable of integrating diverse
-data modalities, including imaging, text, and structured data, within a single
-model. The diverse landscape of these technologies, along with rising interest,
-highlights the need for a comprehensive review of their applications and
-potential. This scoping review explores the evolution of multimodal AI,
-highlighting its methods, applications, datasets, and evaluation in clinical
-settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
-IEEE Xplore, and Web of Science, prioritizing recent studies published up to
-the end of 2024. After rigorous screening, 144 papers were included, revealing
-key trends and challenges in this dynamic field. Our findings underscore a
-shift from unimodal to multimodal approaches, driving innovations in diagnostic
-support, medical report generation, drug discovery, and conversational AI.
-However, critical challenges remain, including the integration of heterogeneous
-data types, improving model interpretability, addressing ethical concerns, and
-validating AI systems in real-world clinical settings. This review summarizes
-the current state of the art, identifies critical gaps, and provides insights
-to guide the development of scalable, trustworthy, and clinically impactful
-multimodal AI solutions in healthcare.
+##### **DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**
+2502.05459v1 by Sibasish Dhibar
 
-摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
+White blood cells (WBC) are important parts of our immune system, and they
+protect our body against infections by eliminating viruses, bacteria, parasites
+and fungi. The number of WBC types and the total number of WBCs provide
+important information about our health status. A traditional method,
+convolutional neural networks (CNN), a deep learning architecture, can classify
+the blood cell from a part of an object and perform object recognition. Various
+CNN models exhibit potential; however, their development often involves ad-hoc
+processes that neglect unnecessary layers, leading to issues with unbalanced
+datasets and insufficient data augmentation. To address these challenges, we
+propose a novel ensemble approach that integrates three CNN architectures, each
+uniquely configured with different dropout and max-pooling layer settings to
+enhance feature learning. This ensemble model, named DCENWCNet, effectively
+balances the bias-variance trade-off. When evaluated on the widely recognized
+Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks,
+achieving highest mean accuracy. Additionally, it demonstrates superior
+performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC)
+across all categories. To delve deeper into the interpretability of
+classifiers, we employ reliable post-hoc explanation techniques, including
+Local Interpretable Model-Agnostic Explanations (LIME). These methods
+approximate the behavior of a black-box model by elucidating the relationships
+between feature values and predictions. Interpretable results enable users to
+comprehend and validate the model's predictions, thereby increasing their
+confidence in the automated diagnosis.
 
-##### **Reliable Conversational Agents under ASP Control that Understand Natural Language**
-2502.09237v1 by Yankai Zeng
+摘要：白血球 (WBC) 是我們免疫系統的重要組成部分，它們通過清除病毒、細菌、寄生蟲和真菌來保護我們的機體免受感染。WBC 類型數量和 WBC 總數提供了有關我們健康狀況的重要資訊。傳統方法卷積神經網路 (CNN) 是一種深度學習架構，可以對物體的一部分進行血細胞分類並執行物體識別。各種 CNN 模型展現出潛力；然而，它們的開發通常涉及忽略不必要層的臨時過程，導致不平衡的資料集和資料擴充不足的問題。為了應對這些挑戰，我們提出了一種新穎的整體方法，它整合了三種 CNN 架構，每種架構都採用不同的中斷和最大池化層設定進行獨特配置，以增強特徵學習。這種名為 DCENWCNet 的整體模型有效地平衡了偏差變異取捨。在廣泛認可的 Rabbin-WBC 資料集上進行評估時，我們的模型優於現有的最先進網路，達到了最高的平均準確度。此外，它在所有類別中都展示了在精確度、召回率、F1 分數和 ROC 曲線下面積 (AUC) 方面的卓越效能。為了更深入地研究分類器的可解釋性，我們採用了可靠的事後解釋技術，包括局部可解釋模型不可知解釋 (LIME)。這些方法通過闡明特徵值和預測之間的關係來近似黑盒模型的行為。可解釋的結果使用戶能夠理解和驗證模型的預測，從而增加他們對自動化診斷的信心。
 
-Efforts have been made to make machines converse like humans in the past few
-decades. The recent techniques of Large Language Models (LLMs) make it possible
-to have human-like conversations with machines, but LLM's flaws of lacking
-understanding and reliability are well documented. We believe that the best way
-to eliminate this problem is to use LLMs only as parsers to translate text to
-knowledge and vice versa and carry out the conversation by reasoning over this
-knowledge using the answer set programming. I have been developing a framework
-based on LLMs and ASP to realize reliable chatbots that "understand" human
-conversation. This framework has been used to develop task-specific chatbots as
-well as socialbots. My future research is focused on making these chatbots
-scalable and trainable.
+##### **Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**
+2502.05330v1 by Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee, Jaehee Chun, Jin Sung Kim, Minghui Zhang, Hanxiao Zhang, Xin You, Yun Gu, Zhaohong Pan, Xuan Liu, Xiaokun Liang, Markus Tiefenthaler, Enrique Almar-Munoz, Matthias Schwab, Mikhail Kotyushev, Rostislav Epifanov, Marek Wodzinski, Henning Muller, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Zhiwei Wang, Kaixiang Yang, Jintao Ren, Stine Sofia Korreman, Yuchong Gao, Hongye Zeng, Haoyu Zheng, Rui Zheng, Jinghua Yue, Fugen Zhou, Bo Liu, Alexander Cosman, Muxuan Liang, Chang Zhao, Gilbert R. Upchurch Jr., Jun Ma, Yuyin Zhou, Michol A. Cooper, Wei Shao
 
-摘要：在過去的幾十年裡，人們一直努力讓機器像人類一樣對話。大型語言模型 (LLM) 的最新技術讓與機器進行類人對話成為可能，但 LLM 缺乏理解力和可靠性的缺陷已被充分記錄。我們相信消除這個問題的最佳方法是僅將 LLM 作為解析器，將文字轉換為知識，反之亦然，並使用答案集程式設計對此知識進行推理來進行對話。我一直在開發一個基於 LLM 和 ASP 的框架，以實現「理解」人類對話的可靠聊天機器人。這個框架已被用於開發特定任務的聊天機器人以及社交機器人。我未來的研究重點在於讓這些聊天機器人具有可擴充性和可訓練性。
+Multi-class segmentation of the aorta in computed tomography angiography
+(CTA) scans is essential for diagnosing and planning complex endovascular
+treatments for patients with aortic dissections. However, existing methods
+reduce aortic segmentation to a binary problem, limiting their ability to
+measure diameters across different branches and zones. Furthermore, no
+open-source dataset is currently available to support the development of
+multi-class aortic segmentation methods. To address this gap, we organized the
+AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes
+annotated for 23 clinically relevant aortic branches and zones. This dataset
+was designed to facilitate both model development and validation. The challenge
+attracted 121 teams worldwide, with participants leveraging state-of-the-art
+frameworks such as nnU-Net and exploring novel techniques, including cascaded
+models, data augmentation strategies, and custom loss functions. We evaluated
+the submitted algorithms using the Dice Similarity Coefficient (DSC) and
+Normalized Surface Distance (NSD), highlighting the approaches adopted by the
+top five performing teams. This paper presents the challenge design, dataset
+details, evaluation metrics, and an in-depth analysis of the top-performing
+algorithms. The annotated dataset, evaluation code, and implementations of the
+leading methods are publicly available to support further research. All
+resources can be accessed at https://aortaseg24.grand-challenge.org.
 
-##### **Commonsense Reasoning-Aided Autonomous Vehicle Systems**
-2502.09233v1 by Keegan Kimbrell
+摘要：多類別主動脈電腦斷層血管攝影 (CTA) 掃描分割對於診斷和規劃主動脈剝離患者的複雜血管內治療至關重要。然而，現有方法將主動脈分割簡化為二元問題，限制了其測量不同分支和區域直徑的能力。此外，目前沒有開放原始碼數據集可用於支援多類別主動脈分割方法的開發。為了解決此問題，我們組織了 AortaSeg24 MICCAI 挑戰，引入了第一個包含 100 個 CTA 體積的數據集，這些體積針對 23 個臨床上相關的主動脈分支和區域進行了註釋。此數據集旨在促進模型開發和驗證。該挑戰吸引了來自世界各地的 121 個團隊，參與者利用了 nnU-Net 等最先進的框架，並探索了創新技術，包括串聯模型、數據擴充策略和自訂損失函數。我們使用 Dice 相似性係數 (DSC) 和標準化表面距離 (NSD) 評估了提交的演算法，重點介紹了前五名表現最佳團隊採用的方法。本文介紹了挑戰設計、數據集詳細資訊、評估指標以及對表現最佳演算法的深入分析。已公開註釋的數據集、評估程式碼和領先方法的實作，以支援進一步的研究。所有資源都可以在 https://aortaseg24.grand-challenge.org/ 獲得。
 
-Autonomous Vehicle (AV) systems have been developed with a strong reliance on
-machine learning techniques. While machine learning approaches, such as deep
-learning, are extremely effective at tasks that involve observation and
-classification, they struggle when it comes to performing higher level
-reasoning about situations on the road. This research involves incorporating
-commonsense reasoning models that use image data to improve AV systems. This
-will allow AV systems to perform more accurate reasoning while also making them
-more adjustable, explainable, and ethical. This paper will discuss the findings
-so far and motivate its direction going forward.
+##### **Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**
+2502.05282v1 by Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
 
-摘要：自動駕駛車輛 (AV) 系統的開發高度依賴機器學習技術。儘管機器學習方法（例如深度學習）在涉及觀察和分類的任務中非常有效，但它們在對路況進行更高層級推理時會遇到困難。本研究涉及整合使用影像資料的常識推理模型，以改善 AV 系統。這將使 AV 系統能夠執行更準確的推理，同時也讓它們更具可調整性、可解釋性和道德性。本文將探討迄今為止的發現，並說明其未來的發展方向。
+Dense contrastive representation learning (DCRL) has greatly improved the
+learning efficiency for image-dense prediction tasks, showing its great
+potential to reduce the large costs of medical image collection and dense
+annotation. However, the properties of medical images make unreliable
+correspondence discovery, bringing an open problem of large-scale false
+positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric
+vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior
+to DCRL and enables a reliable correspondence discovery for effective dense
+contrast. We propose a deformable homeomorphism learning (DHL) which models the
+homeomorphism of medical images and learns to estimate a deformable mapping to
+predict the pixels' correspondence under topological preservation. It
+effectively reduces the searching space of pairing and drives an implicit and
+soft learning of negative pairs via a gradient. We also propose a geometric
+semantic similarity (GSS) which extracts semantic information in features to
+measure the alignment degree for the correspondence learning. It will promote
+the learning efficiency and performance of deformation, constructing positive
+pairs reliably. We implement two practical variants on two typical
+representation learning tasks in our experiments. Our promising results on
+seven datasets which outperform the existing methods show our great
+superiority. We will release our code on a companion link:
+https://github.com/YutingHe-list/GEMINI.
 
-##### **Logical foundations of Smart Contracts**
-2502.09232v1 by Kalonji Kalala
+摘要：密集对比表征学习（DCRL）极大地提高了影像密集预测任务的学习效率，显示出其在降低医学影像收集和密集标注的大量成本方面的巨大潜力。然而，医学影像的特性使得对应关系发现不可靠，给 DCRL 带来大规模假阳性和假阴性（FP&N）对的开放性问题。在本文中，我们提出了 GEoMetric vIsual deNse sImilarity（GEMINI）学习，它将同胚先验嵌入 DCRL 中，并针对有效密集对比提供了可靠的对应关系发现。我们提出了一种可变形同胚学习（DHL），它对医学影像的同胚进行建模，并学习估计可变形映射，以预测在拓扑保持下的像素对应关系。它有效地减少了配对的搜索空间，并通过梯度驱动了负对的隐式和软学习。我们还提出了几何语义相似性（GSS），它提取特征中的语义信息，以测量对应关系学习的对齐度。它将促进变形学习的效率和性能，可靠地构建正对。我们在实验中针对两个典型的表征学习任务实现了两个实际变体。我们在七个数据集上的有希望的结果优于现有方法，显示出我们的巨大优势。我们将在配套链接中发布我们的代码：https://github.com/YutingHe-list/GEMINI。
 
-Nowadays, sophisticated domains are emerging which require appropriate
-formalisms to be specified accurately in order to reason about them. One such
-domain is constituted of smart contracts that have emerged in cyber physical
-systems as a way of enforcing formal agreements between components of these
-systems. Smart contracts self-execute to run and share business processes
-through blockchain, in decentralized systems, with many different participants.
-Legal contracts are in many cases complex documents, with a number of
-exceptions, and many subcontracts. The implementation of smart contracts based
-on legal contracts is a long and laborious task, that needs to include all
-actions, procedures, and the effects of actions related to the execution of the
-contract. An ongoing open problem in this area is to formally account for smart
-contracts using a uniform and somewhat universal formalism. This thesis
-proposes logical foundations to smart contracts using the Situation Calculus, a
-logic for reasoning about actions. Situation Calculus is one of the prominent
-logic-based artificial intelligence approaches that provides enough logical
-mechanism to specify and implement dynamic and complex systems such as
-contracts. Situation Calculus is suitable to show how worlds dynamically
-change. Smart contracts are going to be implement with Golog (written en
-Prolog), a Situation Calculus-based programming language for modeling complex
-and dynamic behaviors.
+##### **"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**
+2502.05115v1 by Shihan Fu, Bingsheng Yao, Smit Desai, Yuqi Hu, Yuling Sun, Samantha Stonbraker, Yanjun Gao, Elizabeth M. Goldberg, Dakuo Wang
 
-摘要：如今，正在出现需要适当形式化来准确指定以对其进行推理的复杂领域。此类领域之一由在网络物理系统中出现的智能合约构成，作为强制执行这些系统组件之间正式协议的一种方式。智能合约自执行以在去中心化系统中通过区块链运行和共享业务流程，并有许多不同的参与者。法律合约在许多情况下是复杂的文档，有许多例外和许多分包合同。基于法律合约实施智能合约是一项漫长而艰巨的任务，需要包括所有操作、程序以及与执行合约相关的操作效果。该领域的持续开放问题是使用统一且某种程度上通用的形式化来正式说明智能合约。本论文提出了使用情景演算（一种用于推理操作的逻辑）为智能合约提供逻辑基础。情景演算是基于逻辑的人工智能方法之一，提供了足够的逻辑机制来指定和实现动态且复杂的系统，例如合约。情景演算适用于展示世界如何动态变化。智能合约将使用 Golog（以 Prolog 编写的）实现，这是一种基于情景演算的编程语言，用于建模复杂且动态的行为。
+Older adult patients constitute a rapidly growing subgroup of Intensive Care
+Unit (ICU) patients. In these situations, their family caregivers are expected
+to represent the unconscious patients to access and interpret patients' medical
+information. However, caregivers currently have to rely on overloaded
+clinicians for information updates and typically lack the health literacy to
+understand complex medical information. Our project aims to explore the
+information needs of caregivers of ICU older adult patients, from which we can
+propose design opportunities to guide future AI systems. The project begins
+with formative interviews with 11 caregivers to identify their challenges in
+accessing and interpreting medical information; From these findings, we then
+synthesize design requirements and propose an AI system prototype to cope with
+caregivers' challenges. The system prototype has two key features: a timeline
+visualization to show the AI extracted and summarized older adult patients' key
+medical events; and an LLM-based chatbot to provide context-aware informational
+support. We conclude our paper by reporting on the follow-up user evaluation of
+the system and discussing future AI-based systems for ICU caregivers of older
+adults.
 
-##### **Relating Answer Set Programming and Many-sorted Logics for Formal Verification**
-2502.09230v1 by Zachary Hansen
+摘要：老年患者構成加護病房 (ICU) 患者中快速成長的子群。在這些情況下，預期他們的家庭照護者能代表無意識的患者取得並解讀患者的醫療資訊。然而，照護者目前必須依賴工作繁重的臨床醫師提供資訊更新，而且通常缺乏了解複雜醫療資訊的健康素養。我們的專案旨在探索 ICU 老年患者照護者的資訊需求，我們可以根據這些需求提出設計機會，以引導未來的 AI 系統。這個專案從對 11 位照護者的形成性訪談開始，以找出他們在取得和解讀醫療資訊方面的挑戰；根據這些發現，我們接著綜合設計需求，並提出一個 AI 系統原型，以應對照護者的挑戰。這個系統原型具有兩個關鍵特點：一個時間軸視覺化，以顯示 AI 萃取並摘要出的老年患者關鍵醫療事件；以及一個基於 LLM 的聊天機器人，以提供情境感知的資訊支援。我們透過報告系統的後續使用者評估，以及討論未來針對老年人 ICU 照護者的 AI 系統，來總結我們的論文。
 
-Answer Set Programming (ASP) is an important logic programming paradigm
-within the field of Knowledge Representation and Reasoning. As a concise,
-human-readable, declarative language, ASP is an excellent tool for developing
-trustworthy (especially, artificially intelligent) software systems. However,
-formally verifying ASP programs offers some unique challenges, such as
-  1. a lack of modularity (the meanings of rules are difficult to define in
-isolation from the enclosing program),
-  2. the ground-and-solve semantics (the meanings of rules are dependent on the
-input data with which the program is grounded), and
-  3. limitations of existing tools.
-  My research agenda has been focused on addressing these three issues with the
-intention of making ASP verification an accessible, routine task that is
-regularly performed alongside program development. In this vein, I have
-investigated alternative semantics for ASP based on translations into the logic
-of here-and-there and many-sorted first-order logic. These semantics promote a
-modular understanding of logic programs, bypass grounding, and enable us to use
-automated theorem provers to automatically verify properties of programs.
+##### **Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**
+2502.05087v1 by Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi
 
-摘要：<paragraph>答案集程式設計 (ASP) 是知識表徵與推理領域中一個重要的邏輯程式設計範式。ASP 作為一種簡潔、人類可讀、宣告式的語言，是開發值得信賴的 (特別是人工智慧) 軟體系統的絕佳工具。然而，正式驗證 ASP 程式提供了一些獨特的挑戰，例如
-  1. 缺乏模組化 (規則的含義難以與封閉程式隔離定義)，
-  2. 基礎與求解語意 (規則的含義取決於程式基礎的輸入資料)，以及
-  3. 現有工具的限制。
-  我的研究議程一直專注於解決這三個問題，目的是讓 ASP 驗證成為一個可存取的、例行任務，並在程式開發過程中定期執行。在這個脈絡下，我研究了基於翻譯成此處和彼處邏輯以及多種排序一階邏輯的 ASP 替代語意。這些語意促進了邏輯程式的模組化理解，繞過基礎，並使我們能夠使用自動化定理證明器自動驗證程式的屬性。</paragraph>
+Federated learning (FL) is a popular paradigm for collaborative training
+which avoids direct data exposure between clients. However, data privacy issues
+still remain: FL-trained large language models are capable of memorizing and
+completing phrases and sentences contained in training data when given with
+their prefixes. Thus, it is possible for adversarial and honest-but-curious
+clients to recover training data of other participants simply through targeted
+prompting. In this work, we demonstrate that a popular and simple fine-tuning
+strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a
+factor of 10. We study this effect by performing a medical question-answering
+fine-tuning task and injecting multiple replicas of out-of-distribution
+sensitive sequences drawn from an external clinical dataset. We observe a
+reduction in memorization for a wide variety of Llama 2 and 3 models, and find
+that LoRA can reduce memorization in centralized learning as well. Furthermore,
+we show that LoRA can be combined with other privacy-preserving techniques such
+as gradient clipping and Gaussian noising, secure aggregation, and Goldfish
+loss to further improve record-level privacy while maintaining performance.
 
-##### **Computational methods for Dynamic Answer Set Programming**
-2502.09228v1 by Susana Hahn
+摘要：聯邦學習 (FL) 是一種流行的協作訓練範例，可避免客戶端之間直接公開資料。然而，資料隱私問題仍然存在：經過 FL 訓練的大型語言模型能夠記憶並完成訓練資料中包含的片語和句子，只要給予其前綴即可。因此，對抗和誠實但好奇的客戶端有可能僅透過目標提示來恢復其他參與者的訓練資料。在這項工作中，我們證明了一種流行且簡單的微調策略，低秩適應 (LoRA)，可將 FL 期間的記憶減少多達 10 倍。我們透過執行醫學問答微調任務並注入從外部臨床資料集抽取的非分佈敏感序列的多次複製品來研究此效應。我們觀察到各種 Llama 2 和 3 模型的記憶力降低，並發現 LoRA 也能減少集中式學習中的記憶力。此外，我們展示 LoRA 可以與其他隱私保護技術結合使用，例如梯度裁剪和高斯雜訊、安全聚合和 Goldfish 損失，以進一步改善記錄級隱私，同時維持效能。
 
-In our daily lives and industrial settings, we often encounter dynamic
-problems that require reasoning over time and metric constraints. These include
-tasks such as scheduling, routing, and production sequencing. Dynamic logics
-have traditionally addressed these needs but often lack the flexibility and
-integration required for comprehensive problem modeling. This research aims to
-extend Answer Set Programming (ASP), a powerful declarative problem-solving
-approach, to handle dynamic domains effectively. By integrating concepts from
-dynamic, temporal, and metric logics into ASP, we seek to develop robust
-systems capable of modeling complex dynamic problems and performing efficient
-reasoning tasks, thereby enhancing ASPs applicability in industrial contexts.
+##### **MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**
+2502.04794v1 by Minrui Chen, Yi Zhou, Huidong Jiang, Yuhan Zhu, Guanjie Zou, Minqi Chen, Rong Tian, Hiroto Saigo
 
-摘要：在我們的日常生活和工業環境中，我們經常會遇到動態問題，需要隨著時間和公制約束進行推理。這些問題包括排程、路由和生產順序等任務。動態邏輯傳統上解決了這些需求，但通常缺乏全面問題建模所需的靈活性與整合性。本研究旨在擴展強大的宣告式問題解決方法「Answer Set Programming (ASP)」，以有效處理動態領域。透過將動態、時態和公制邏輯的概念整合到 ASP 中，我們尋求開發強健的系統，能夠建模複雜的動態問題並執行有效的推理任務，進而增強 ASP 在工業環境中的適用性。
+Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is
+introduced as a multimodal framework inspired by real-world diagnostic
+processes. It uses pretrained models such as DINOv2, Vision Transformer, and
+ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into
+low-dimensional, semantically meaningful features. A learnable
+self-attention-based fusion network then integrates these imaging features with
+clinical data for classification. Using 416 FUO patient cases from Sichuan
+University West China Hospital from 2017 to 2023, the multimodal fusion
+classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to
+0.9291 across seven tasks, outperforming conventional machine learning and
+single-modality deep learning methods. Ablation studies and five-fold
+cross-validation further validated its effectiveness. By combining the
+strengths of pretrained large models and deep learning, MedMimic offers a
+promising solution for disease classification.
 
-##### **Generating Causally Compliant Counterfactual Explanations using ASP**
-2502.09226v1 by Sopam Dasgupta
+摘要：不明原因發燒 (FUO) 仍然是診斷上的挑戰。MedMimic 是一個多模式架構，靈感來自於真實世界的診斷過程。它使用預先訓練的模型，例如 DINOv2、視覺轉換器和 ResNet-18，將高維 18F-FDG PET/CT 影像轉換為低維、語義有意義的特徵。一個可學習的自注意力融合網路接著將這些影像特徵與臨床資料整合，用於分類。使用 2017 年至 2023 年四川大學華西醫院的 416 個 FUO 病患病例，多模式融合分類網路 MFCN 在七項任務中達到了 0.8654 到 0.9291 的巨觀 AUROC 分數，優於傳統機器學習和單一模式深度學習方法。消融研究和五倍交叉驗證進一步驗證了其有效性。MedMimic 結合了預先訓練的大模型和深度學習的優點，為疾病分類提供了一個有前景的解決方案。
 
-This research is focused on generating achievable counterfactual
-explanations. Given a negative outcome computed by a machine learning model or
-a decision system, the novel CoGS approach generates (i) a counterfactual
-solution that represents a positive outcome and (ii) a path that will take us
-from the negative outcome to the positive one, where each node in the path
-represents a change in an attribute (feature) value. CoGS computes paths that
-respect the causal constraints among features. Thus, the counterfactuals
-computed by CoGS are realistic. CoGS utilizes rule-based machine learning
-algorithms to model causal dependencies between features. The paper discusses
-the current status of the research and the preliminary results obtained.
+##### **MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**
+2502.04515v1 by Wei Fan, Jingru Fei, Dingyu Guo, Kun Yi, Xiaozhuang Song, Haolong Xiang, Hangting Ye, Min Li
 
-摘要：本研究重點在於產生可實現的反事實解釋。給定由機器學習模型或決策系統計算出的負面結果，創新的 CoGS 方法會產生 (i) 代表正面結果的反事實解，以及 (ii) 一條將我們從負面結果帶到正面結果的途徑，其中途徑中的每個節點代表屬性 (特徵) 值的變化。CoGS 計算出符合特徵之間因果關係的途徑。因此，CoGS 計算出的反事實是切合實際的。CoGS 利用基於規則的機器學習演算法來建模特徵之間的因果關係。本文探討了研究的現況和獲得的初步結果。
+Medical time series has been playing a vital role in real-world healthcare
+systems as valuable information in monitoring health conditions of patients.
+Accurate classification for medical time series, e.g., Electrocardiography
+(ECG) signals, can help for early detection and diagnosis. Traditional methods
+towards medical time series classification rely on handcrafted feature
+extraction and statistical methods; with the recent advancement of artificial
+intelligence, the machine learning and deep learning methods have become more
+popular. However, existing methods often fail to fully model the complex
+spatial dynamics under different scales, which ignore the dynamic
+multi-resolution spatial and temporal joint inter-dependencies. Moreover, they
+are less likely to consider the special baseline wander problem as well as the
+multi-view characteristics of medical time series, which largely hinders their
+prediction performance. To address these limitations, we propose a
+Multi-resolution Spatiotemporal Graph Learning framework, MedGNN, for medical
+time series classification. Specifically, we first propose to construct
+multi-resolution adaptive graph structures to learn dynamic multi-scale
+embeddings. Then, to address the baseline wander problem, we propose Difference
+Attention Networks to operate self-attention mechanisms on the finite
+difference for temporal modeling. Moreover, to learn the multi-view
+characteristics, we utilize the Frequency Convolution Networks to capture
+complementary information of medical time series from the frequency domain. In
+addition, we introduce the Multi-resolution Graph Transformer architecture to
+model the dynamic dependencies and fuse the information from different
+resolutions. Finally, we have conducted extensive experiments on multiple
+medical real-world datasets that demonstrate the superior performance of our
+method. Our Code is available.
 
-##### **Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**
-2502.09224v1 by Đorđe Marković, Marc Denecker
+摘要：<paragraph>醫療時間序列在真實世界的醫療保健系統中扮演著至關重要的角色，作為監控患者健康狀況的寶貴資訊。
+準確分類醫療時間序列，例如心電圖 (ECG) 訊號，有助於早期偵測和診斷。傳統的醫療時間序列分類方法仰賴手工特徵萃取和統計方法；隨著人工智慧的最新進展，機器學習和深度學習方法變得更為普及。然而，現有方法通常無法完全建模不同尺度下的複雜空間動態，忽略了動態多解析度空間和時間關節相互依賴性。此外，它們不太可能考慮特殊的基線漂移問題以及醫療時間序列的多視角特性，這在很大程度上阻礙了它們的預測效能。為了解決這些限制，我們提出了一個多解析度時空圖形學習架構 MedGNN，用於醫療時間序列分類。具體來說，我們首先提出構建多解析度自適應圖形結構以學習動態多尺度嵌入。然後，為了解決基線漂移問題，我們提出差分注意力網路，對時間建模的有限差分運算自注意力機制。此外，為了學習多視角特性，我們利用頻率卷積網路從頻域擷取醫療時間序列的互補資訊。此外，我們引入了多解析度圖形Transformer架構來建模動態依賴性，並融合來自不同解析度的資訊。最後，我們對多個醫療真實世界資料集進行了廣泛的實驗，證明了我們方法的優異效能。我們的程式碼已公開。</paragraph>
 
-Subtyping, also known as subtype polymorphism, is a concept extensively
-studied in programming language theory, delineating the substitutability
-relation among datatypes. This property ensures that programs designed for
-supertype objects remain compatible with their subtypes.
-  In this paper, we explore the capability of order-sorted logic for utilizing
-these ideas in the context of Knowledge Representation. We recognize two
-fundamental limitations: First, the inability of this logic to address the
-concept rather than the value of non-logical symbols, and second, the lack of
-language constructs for constraining the type of terms. Consequently, we
-propose guarded order-sorted intensional logic, where guards are language
-constructs for annotating typing information and intensional logic provides
-support for quantification over concepts.
+##### **Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**
+2502.06842v1 by Andrew G. Breithaupt, Alice Tang, Bruce L. Miller, Pedro Pinheiro-Chagas
 
-摘要：子類型化，也稱為子類型多態性，是一個在程式語言理論中廣泛研究的概念，用於描述資料類型之間的可替換關係。此特性可確保為超類型物件設計的程式與其子類型相容。
-在本文中，我們探討了使用排序邏輯在知識表徵中運用這些想法的能力。我們發現了兩個基本限制：首先，此邏輯無法處理非邏輯符號的概念而非值，其次，缺乏約束項類型的語言結構。因此，我們提出了受保護的排序邏輯，其中保護是註解類型資訊的語言結構，而內涵邏輯則支援對概念量化。
+Healthcare systems are struggling to meet the growing demand for neurological
+care, with challenges particularly acute in Alzheimer's disease and related
+dementias (ADRD). While artificial intelligence research has often focused on
+identifying patterns beyond human perception, implementing such predictive
+capabilities remains challenging as clinicians cannot readily verify insights
+they cannot themselves detect. We propose that large language models (LLMs)
+offer more immediately practical applications by enhancing clinicians'
+capabilities in three critical areas: comprehensive data collection,
+interpretation of complex clinical information, and timely application of
+relevant medical knowledge. These challenges stem from limited time for proper
+diagnosis, growing data complexity, and an overwhelming volume of medical
+literature that exceeds any clinician's capacity to fully master. We present a
+framework for responsible AI integration that leverages LLMs' ability to
+communicate effectively with both patients and providers while maintaining
+human oversight. This approach prioritizes standardized, high-quality data
+collection to enable a system that learns from every patient encounter while
+incorporating the latest clinical evidence, continuously improving care
+delivery. We begin to address implementation challenges and initiate important
+discussions around ethical considerations and governance needs. While developed
+for ADRD, this roadmap provides principles for responsible AI integration
+across neurology and other medical specialties, with potential to improve
+diagnostic accuracy, reduce care disparities, and advance clinical knowledge
+through a learning healthcare system.
 
-##### **ASP-driven User-interaction with Clinguin**
-2502.09222v1 by Alexander Beiser, Susana Hahn, Torsten Schaub
+摘要：醫療體系正努力滿足日益增長的神經照護需求，其中阿茲海默症和相關失智症 (ADRD) 的挑戰特別嚴重。雖然人工智慧研究通常專注於識別人類感知之外的模式，但實作此類預測功能仍然具有挑戰性，因為臨床醫生無法輕易驗證他們自己無法偵測到的見解。我們提出大型語言模型 (LLM) 可透過提升臨床醫生在三個關鍵領域的能力，提供更直接且實用的應用：全面的資料收集、複雜臨床資訊的詮釋，以及適時應用相關的醫學知識。這些挑戰源自於適當診斷時間有限、資料複雜性日益增加，以及龐大的醫學文獻量超過任何臨床醫生所能完全掌握的容量。我們提出了一個負責任的 AI 整合架構，利用 LLM 與患者和提供者有效溝通的能力，同時維持人為監督。此方法優先考慮標準化、高品質的資料收集，以建立一個從每次患者接觸中學習的系統，同時納入最新的臨床證據，持續改善照護提供。我們開始探討實作挑戰，並展開關於倫理考量和治理需求的重要討論。儘管是為 ADRD 所開發，此藍圖提供了神經科和其他醫學專科負責任 AI 整合的原則，有潛力透過學習型醫療保健系統改善診斷準確性、減少照護差異，並推進臨床知識。
 
-We present clinguin, a system for ASP-driven user interface design. Clinguin
-streamlines the development of user interfaces for ASP developers by letting
-them build interactive prototypes directly in ASP, eliminating the need for
-separate frontend languages. To this end, clinguin uses a few dedicated
-predicates to define user interfaces and the treatment of user-triggered
-events. This simple design greatly facilitates the specification of user
-interactions with an ASP system, in our case clingo.
+##### **Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**
+2502.04423v1 by Khushboo Verma, Alan Michels, Ergi Gumusaneli, Shilpa Chitnis, Smita Sinha Kumar, Christopher Thompson, Lena Esmail, Guruprasath Srinivasan, Chandini Panchada, Sushovan Guha, Satwant Kumar
 
-摘要：我們提出 clinguin，一個用於 ASP 驅動使用者介面設計的系統。Clinguin 透過讓 ASP 開發人員直接在 ASP 中建立互動式原型，簡化了使用者介面的開發，消除了對個別前端語言的需求。為此，clinguin 使用一些專用的謂詞來定義使用者介面和處理使用者觸發的事件。這個簡單的設計極大地簡化了使用者與 ASP 系統互動的規範，在我們的案例中是 clingo。
+Referral workflow inefficiencies, including misaligned referrals and delays,
+contribute to suboptimal patient outcomes and higher healthcare costs. In this
+study, we investigated the possibility of predicting procedural needs based on
+primary care diagnostic entries, thereby improving referral accuracy,
+streamlining workflows, and providing better care to patients. A de-identified
+dataset of 2,086 orthopedic referrals from the University of Texas Health at
+Tyler was analyzed using machine learning models built on Base General
+Embeddings (BGE) for semantic extraction. To ensure real-world applicability,
+noise tolerance experiments were conducted, and oversampling techniques were
+employed to mitigate class imbalance. The selected optimum and parsimonious
+embedding model demonstrated high predictive accuracy (ROC-AUC: 0.874, Matthews
+Correlation Coefficient (MCC): 0.540), effectively distinguishing patients
+requiring surgical intervention. Dimensionality reduction techniques confirmed
+the model's ability to capture meaningful clinical relationships. A threshold
+sensitivity analysis identified an optimal decision threshold (0.30) to balance
+precision and recall, maximizing referral efficiency. In the predictive
+modeling analysis, the procedure rate increased from 11.27% to an optimal
+60.1%, representing a 433% improvement with significant implications for
+operational efficiency and healthcare revenue.
+  The results of our study demonstrate that referral optimization can enhance
+primary and surgical care integration. Through this approach, precise and
+timely predictions of procedural requirements can be made, thereby minimizing
+delays, improving surgical planning, and reducing administrative burdens. In
+addition, the findings highlight the potential of clinical decision support as
+a scalable solution for improving patient outcomes and the efficiency of the
+healthcare system.
 
-##### **Pearce's Characterisation in an Epistemic Domain**
-2502.09221v1 by Ezgi Iraz Su
+摘要：轉診流程效率低落，包括轉診不當和延誤，
+導致次優的患者結果和更高的醫療保健成本。在這
+項研究中，我們探討了根據初級保健診斷條目預測程序需求的可能性，從而提高轉診準確性，
+簡化工作流程，並為患者提供更好的照護。一個去識別化
+德克薩斯大學健康中心的 2,086 個骨科轉診的資料集
+泰勒使用建立在基本通用
+語義提取的嵌入 (BGE) 上的機器學習模型進行分析。為了確保現實世界的適用性，
+進行了噪聲容忍度實驗，並採用了過採樣技術來減輕類別不平衡。所選的最佳和簡約
+嵌入模型展示了高預測準確度 (ROC-AUC：0.874，馬修斯
+相關系數 (MCC)：0.540)，有效區分需要手術干預的患者。降維
+技術證實了模型捕捉有意義的臨床關係的能力。閾值
+敏感性分析確定了一個最佳決策閾值 (0.30) 來平衡
+精確度和召回率，最大化轉診效率。在預測中
+建模分析中，程序率從 11.27% 增加到最佳的
+60.1%，代表 433% 的改進，對運營效率和醫療保健收入具有重大影響。
+我們研究的結果表明，轉診優化可以增強
+初級和外科護理整合。通過這種方法，可以對程序需求進行準確及時的預測，從而最大程度地減少
+延誤，改善手術計劃，並減輕行政負擔。此外，研究結果強調了臨床決策支持作為
+一個可擴展的解決方案的潛力，用於改善患者結果和醫療保健系統的效率。
 
-Answer-set programming (ASP) is a successful problem-solving approach in
-logic-based AI. In ASP, problems are represented as declarative logic programs,
-and solutions are identified through their answer sets. Equilibrium logic (EL)
-is a general-purpose nonmonotonic reasoning formalism, based on a monotonic
-logic called here-and-there logic. EL was basically proposed by Pearce as a
-foundational framework of ASP. Epistemic specifications (ES) are extensions of
-ASP-programs with subjective literals. These new modal constructs in the
-ASP-language make it possible to check whether a regular literal of ASP is true
-in every (or some) answer-set of a program. ES-programs are interpreted by
-world-views, which are essentially collections of answer-sets. (Reflexive)
-autoepistemic logic is a nonmonotonic formalism, modeling self-belief
-(knowledge) of ideally rational agents. A relatively new semantics for ES is
-based on a combination of EL and (reflexive) autoepistemic logic. In this
-paper, we first propose an overarching framework in the epistemic ASP domain.
-We then establish a correspondence between existing (reflexive) (auto)epistemic
-equilibrium logics and our easily-adaptable comprehensive framework, building
-on Pearce's characterisation of answer-sets as equilibrium models. We achieve
-this by extending Ferraris' work on answer sets for propositional theories to
-the epistemic case and reveal the relationship between some ES-semantic
-proposals.
+##### **Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**
+2502.04083v1 by Tewele W. Tareke, Neree Payan, Alexandre Cochet, Laurent Arnould, Benoit Presles, Jean-Marc Vrigneaud, Fabrice Meriaudeau, Alain Lalande
 
-摘要：<paragraph>答案集程式設計（ASP）是基於邏輯的人工智慧中一種成功的問題解決方法。在 ASP 中，問題表示為宣告式邏輯程式，並透過其答案集來找出解答。平衡邏輯（EL）是一種通用的非單調推理形式主義，基於一種稱為此處和彼處邏輯的單調邏輯。EL 基本是由 Pearce 作為 ASP 的基礎架構所提出。知識規範（ES）是 ASP 程式與主觀文字的延伸。ASP 語言中的這些新模態建構使得可以檢查 ASP 的常規文字是否在程式的每個（或某些）答案集中為真。ES 程式由世界觀來詮釋，其本質上是答案集的集合。（反身）自認識邏輯是一種非單調形式主義，用來建模理想理性主體的自信念（知識）。ES 的一種相對新的語意是基於 EL 和（反身）自認識邏輯的組合。在本文中，我們首先提出一個涵蓋知識 ASP 領域的架構。然後，我們建立現有（反身）（自）認識平衡邏輯與我們容易適應的綜合架構之間的對應關係，建立在 Pearce 將答案集描述為平衡模型的特性之上。我們透過將 Ferraris 在命題理論的答案集上的工作延伸到知識案例，並揭示一些 ES 語義提案之間的關係來達成這一點。</paragraph>
+Neoadjuvant chemotherapy (NAC) has become a standard clinical practice for
+tumor downsizing in breast cancer with 18F-FDG Positron Emission Tomography
+(PET). Our work aims to leverage PET imaging for the segmentation of breast
+lesions. The focus is on developing an automated system that accurately
+segments primary tumor regions and extracts key biomarkers from these areas to
+provide insights into the evolution of breast cancer following the first course
+of NAC. 243 baseline 18F-FDG PET scans (PET_Bl) and 180 follow-up 18F-FDG PET
+scans (PET_Fu) were acquired before and after the first course of NAC,
+respectively. Firstly, a deep learning-based breast tumor segmentation method
+was developed. The optimal baseline model (model trained on baseline exams) was
+fine-tuned on 15 follow-up exams and adapted using active learning to segment
+tumor areas in PET_Fu. The pipeline computes biomarkers such as maximum
+standardized uptake value (SUVmax), metabolic tumor volume (MTV), and total
+lesion glycolysis (TLG) to evaluate tumor evolution between PET_Fu and PET_Bl.
+Quality control measures were employed to exclude aberrant outliers. The nnUNet
+deep learning model outperformed in tumor segmentation on PET_Bl, achieved a
+Dice similarity coefficient (DSC) of 0.89 and a Hausdorff distance (HD) of 3.52
+mm. After fine-tuning, the model demonstrated a DSC of 0.78 and a HD of 4.95 mm
+on PET_Fu exams. Biomarkers analysis revealed very strong correlations whatever
+the biomarker between manually segmented and automatically predicted regions.
+The significant average decrease of SUVmax, MTV and TLG were 5.22, 11.79 cm3
+and 19.23 cm3, respectively. The presented approach demonstrates an automated
+system for breast tumor segmentation from 18F-FDG PET. Thanks to the extracted
+biomarkers, our method enables the automatic assessment of cancer progression.
 
-##### **Graphical Conditions for the Existence, Unicity and Number of Regular Models**
-2502.09220v1 by Van-Giang Trinh, Belaid Benhamou, Sylvain Soliman, François Fages
+摘要：新辅助化疗 (NAC) 已成为乳腺癌中采用 18F-FDG 正电子发射断层扫描 (PET) 进行肿瘤缩小的标准临床实践。我们的工作旨在利用 PET 影像分割乳腺病变。重点在于开发一个自动系统，该系统可以准确分割原发性肿瘤区域并从这些区域提取关键生物标记，以深入了解乳腺癌在第一疗程 NAC 后的演变。分别在第一疗程 NAC 之前和之后采集了 243 例基线 18F-FDG PET 扫描 (PET_Bl) 和 180 例随访 18F-FDG PET 扫描 (PET_Fu)。首先，开发了一种基于深度学习的乳腺肿瘤分割方法。对 15 例随访检查对最优基线模型（在基线检查中训练的模型）进行了微调，并使用主动学习对 PET_Fu 中的肿瘤区域进行了分割。该管道计算诸如最大标准摄取值 (SUVmax)、代谢肿瘤体积 (MTV) 和总病灶糖酵解 (TLG) 等生物标记，以评估 PET_Fu 和 PET_Bl 之间的肿瘤演变。采用质量控制措施来排除异常值。nnUNet 深度学习模型在 PET_Bl 上的肿瘤分割方面表现出色，达到 0.89 的 Dice 相似性系数 (DSC) 和 3.52 毫米的 Hausdorff 距离 (HD)。微调后，该模型在 PET_Fu 检查中显示出 0.78 的 DSC 和 4.95 毫米的 HD。无论手动分割区域和自动预测区域之间的生物标记如何，生物标记分析都显示出非常强的相关性。SUVmax、MTV 和 TLG 的平均显着下降分别为 5.22、11.79 cm3 和 19.23 cm3。所提出的方法展示了一个用于从 18F-FDG PET 分割乳腺肿瘤的自动化系统。由于提取了生物标记，我们的方法能够自动评估癌症进展。
 
-The regular models of a normal logic program are a particular type of partial
-(i.e. 3-valued) models which correspond to stable partial models with minimal
-undefinedness. In this paper, we explore graphical conditions on the dependency
-graph of a finite ground normal logic program to analyze the existence, unicity
-and number of regular models for the program. We show three main results: 1) a
-necessary condition for the existence of non-trivial (i.e. non-2-valued)
-regular models, 2) a sufficient condition for the unicity of regular models,
-and 3) two upper bounds for the number of regular models based on positive
-feedback vertex sets. The first two conditions generalize the finite cases of
-the two existing results obtained by You and Yuan (1994) for normal logic
-programs with well-founded stratification. The third result is also new to the
-best of our knowledge. Key to our proofs is a connection that we establish
-between finite ground normal logic programs and Boolean network theory.
+##### **Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**
+2502.04034v1 by Ran Song, Yinpu Bai, Hui Liu
+
+The accurate prediction of drug responses remains a formidable challenge,
+particularly at the single-cell level and in clinical treatment contexts. Some
+studies employ transfer learning techniques to predict drug responses in
+individual cells and patients, but they require access to target-domain data
+during training, which is often unavailable or only obtainable in future. In
+this study, we propose a novel domain generalization framework, termed
+panCancerDR, to address this challenge. We conceptualize each cancer type as a
+distinct source domain, with its cell lines serving as domain-specific samples.
+Our primary objective is to extract domain-invariant features from the
+expression profiles of cell lines across diverse cancer types, thereby
+generalize the predictive capacity to out-of-distribution samples. To enhance
+robustness, we introduce a latent independence projection (LIP) module that
+encourages the encoder to extract informative yet non-redundant features. Also,
+we propose an asymmetric adaptive clustering constraint, which clusters
+drug-sensitive samples into a compact group while drives resistant samples
+dispersed across separate clusters in the latent space. Our empirical
+experiments demonstrate that panCancerDR effectively learns task-relevant
+features from diverse source domains, and achieves accurate predictions of drug
+response for unseen cancer type during training. Furthermore, when evaluated on
+single-cell and patient-level prediction tasks, our model-trained solely on in
+vitro cell line data without access to target-domain information-consistently
+outperforms and matched current state-of-the-art methods. These findings
+highlights the potential of our method for real-world clinical applications.
 
-摘要：正规模型的常规模型是一种特殊类型的局部模型（即 3 值）模型，它对应于具有最小未定义性的稳定局部模型。在本文中，我们探索了有限接地正规逻辑程序的依赖图上的图形条件，以分析程序的正规模型的存在性、唯一性和数量。我们展示了三个主要结果：1) 非平凡（即非 2 值）正规模型存在的必要条件，2) 正规模型唯一性的充分条件，3) 基于正反馈顶点集的正规模型数目的两个上限。前两个条件概括了 You 和 Yuan (1994) 为具有良好基础分层的正规逻辑程序获得的两个现有结果的有限情况。据我们所知，第三个结果也是新的。我们证明的关键是我们在有限接地正规逻辑程序和布尔网络理论之间建立的联系。
+摘要：<paragraph>準確預測藥物反應仍然是一項艱鉅的挑戰，特別是在單細胞層級和臨床治療背景中。一些研究採用遷移學習技術來預測個別細胞和患者的藥物反應，但它們需要在訓練期間存取目標網域資料，而這些資料通常無法取得，或只能在未來取得。在這項研究中，我們提出一個新穎的網域概化架構，稱為 panCancerDR，以應對這項挑戰。我們將每種類型的癌症概念化為一個不同的來源網域，其細胞株作為特定網域的樣本。我們的首要目標是從不同癌症類型的細胞株表現特徵中萃取網域不變特徵，從而將預測能力概化到分布外的樣本。為了增強穩健性，我們引入一個潛在獨立投影 (LIP) 模組，鼓勵編碼器萃取有資訊但非冗餘的特徵。此外，我們提出一個非對稱自適應聚類約束，將對藥物敏感的樣本聚類到一個緊湊的群組中，同時驅動抗藥性樣本分散在潛在空間中的不同群組中。我們的實證實驗證明，panCancerDR 有效地從不同的來源網域學習與任務相關的特徵，並在訓練期間對未見的癌症類型實現準確的藥物反應預測。此外，當在單細胞和患者層級預測任務中進行評估時，我們的模型僅在體外細胞株資料上訓練，而沒有存取目標網域資訊，始終優於並符合當前的最新方法。這些發現突顯了我們的方法在實際臨床應用中的潛力。</paragraph>
 
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
+##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
+2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
 
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
+Retrieval-augmented generation (RAG) is a well-suited technique for
+retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
+key module of the healthcare copilot, helping reduce misdiagnosis for
+healthcare practitioners and patients. However, the diagnostic accuracy and
+specificity of existing heuristic-based RAG models used in the medical domain
+are inadequate, particularly for diseases with similar manifestations. This
+paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
+reasoning for the medical domain that retrieves diagnosis and treatment
+recommendations based on manifestations. MedRAG systematically constructs a
+comprehensive four-tier hierarchical diagnostic KG encompassing critical
+diagnostic differences of various diseases. These differences are dynamically
+integrated with similar EHRs retrieved from an EHR database, and reasoned
+within a large language model. This process enables more accurate and specific
+decision support, while also proactively providing follow-up questions to
+enhance personalized medical decision-making. MedRAG is evaluated on both a
+public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
+collected from Tan Tock Seng Hospital, and its performance is compared against
+various existing RAG methods. Experimental results show that, leveraging the
+information integration and relational abilities of the KG, our MedRAG provides
+more specific diagnostic insights and outperforms state-of-the-art models in
+reducing misdiagnosis rates. Our code will be available at
+https://github.com/SNOWTEAM2023/MedRAG
 
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
+摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
 
-##### **Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**
-2502.09216v1 by Galileo Sartor, Adam Wyner, Giuseppe Contissa
+##### **Transforming Multimodal Models into Action Models for Radiotherapy**
+2502.04408v1 by Matteo Ferrante, Alessandra Carosi, Rolando Maria D Angelillo, Nicola Toschi
 
-In this paper, we present a modular system for representing and reasoning
-with legal aspects of traffic rules for autonomous vehicles. We focus on a
-subset of the United Kingdom's Highway Code (HC) related to junctions. As human
-drivers and automated vehicles (AVs) will interact on the roads, especially in
-urban environments, we claim that an accessible, unitary, high-level
-computational model should exist and be applicable to both users. Autonomous
-vehicles introduce a shift in liability that should not bring disadvantages or
-increased burden on human drivers. We develop a system "in silico" of the
-model. The proposed system is built of three main components: a natural
-language interface, using Logical English, which encodes the rules; an internal
-representation of the rules in Prolog; and an multi-agent-based simulation
-environment, built in NetLogo. The three components interact: Logical English
-is translated into and out of Prolog (along with some support code); Prolog and
-NetLogo interface via predicates. Such a modular approach enables the different
-components to carry different "burdens" in the overall system; it also allows
-swapping of modules. Given NetLogo, we can visualize the effect of the modeled
-rules as well as validate the system with a simple dynamic running scenario.
-Designated agents monitor the behaviour of the vehicles for compliance and
-record potential violations where they occur. The information on potential
-violations is then utilized by Validators, to determine whether the violation
-is punishable, differentiating between exceptions and cases.
+Radiotherapy is a crucial cancer treatment that demands precise planning to
+balance tumor eradication and preservation of healthy tissue. Traditional
+treatment planning (TP) is iterative, time-consuming, and reliant on human
+expertise, which can potentially introduce variability and inefficiency. We
+propose a novel framework to transform a large multimodal foundation model
+(MLM) into an action model for TP using a few-shot reinforcement learning (RL)
+approach. Our method leverages the MLM's extensive pre-existing knowledge of
+physics, radiation, and anatomy, enhancing it through a few-shot learning
+process. This allows the model to iteratively improve treatment plans using a
+Monte Carlo simulator. Our results demonstrate that this method outperforms
+conventional RL-based approaches in both quality and efficiency, achieving
+higher reward scores and more optimal dose distributions in simulations on
+prostate cancer data. This proof-of-concept suggests a promising direction for
+integrating advanced AI models into clinical workflows, potentially enhancing
+the speed, quality, and standardization of radiotherapy treatment planning.
 
-摘要：<paragraph>在本文中，我們提出了一個模組化系統，用於表示和推理自動駕駛車輛交通規則的法律層面。我們專注於與路口相關的英國公路法規 (HC) 子集。由於人類駕駛和自動駕駛車輛 (AV) 將在道路上互動，尤其是在城市環境中，我們主張應存在一個可存取、統一、高階的運算模型，並適用於這兩種使用者。自動駕駛車輛引入了責任轉移，不應給人類駕駛帶來劣勢或增加負擔。我們開發了一個模型的「電腦模擬」系統。所提出的系統由三個主要組成部分建構而成：使用邏輯英語的自然語言介面，用於編碼規則；使用 Prolog 的規則內部表示；以及使用 NetLogo 建構的多主體模擬環境。這三個組成部分會進行互動：邏輯英語會翻譯成 Prolog（以及一些支援程式碼），再從 Prolog 翻譯回來；Prolog 和 NetLogo 會透過謂詞進行介面。這種模組化方法讓不同的組成部分可以在整體系統中承擔不同的「負擔」；它也允許模組交換。有了 NetLogo，我們可以視覺化已建模規則的效果，並使用一個簡單的動態執行範例來驗證系統。指定的代理會監控車輛的行為，以確保遵守規定，並記錄發生的潛在違規行為。然後，驗證者會利用潛在違規行為的資訊，來確定違規行為是否應受懲罰，並區分例外情況和案例。</paragraph>
+摘要：放射治療是一種重要的癌症治療方法，需要精確的規劃來平衡腫瘤根除和健康組織的保留。傳統的治療規劃（TP）是反覆的、耗時的，並且依賴於人為專業知識，這可能會引入變異性和低效率。我們提出了一個新穎的框架，使用少次強化學習 (RL) 方法將大型多模態基礎模型 (MLM) 轉換為 TP 的動作模型。我們的模型利用了 MLM 對物理、輻射和解剖學的廣泛預先存在的知識，並通過少次學習過程對其進行增強。這允許模型使用蒙特卡羅模擬器反覆改進治療計劃。我們的結果表明，這種方法在質量和效率方面都優於基於傳統 RL 的方法，在對前列腺癌數據進行模擬時，獲得了更高的獎勵分數和更優化的劑量分佈。這個概念驗證表明了一個有希望的方向，即將先進的人工智慧模型整合到臨床工作流程中，從而有可能提高放射治療計劃的速度、質量和標準化。
 
-##### **Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**
-2502.09215v1 by Sean Glaze, Daniela Inclezan
+##### **Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**
+2502.04399v1 by Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
 
-This paper presents an architecture for simulating the actions of a
-norm-aware intelligent agent whose behavior with respect to norm compliance is
-set, and can later be changed, by a human controller. Updating an agent's
-behavior mode from a norm-abiding to a riskier one may be relevant when the
-agent is involved in time-sensitive rescue operations, for example. We base our
-work on the Authorization and Obligation Policy Language AOPL designed by
-Gelfond and Lobo for the specification of norms. We introduce an architecture
-and a prototype software system that can be used to simulate an agent's plans
-under different behavior modes that can later be changed by the controller. We
-envision such software to be useful to policy makers, as they can more readily
-understand how agents may act in certain situations based on the agents'
-attitudes towards norm-compliance. Policy makers may then refine their policies
-if simulations show unwanted consequences.
+Advances in artificial intelligence (AI) including foundation models (FMs),
+are increasingly transforming human society, with smart city driving the
+evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as
+a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities.
+In particular, ride-hailing vehicles can effectively facilitate flexible data
+collection and contribute towards urban intelligence, despite resource
+limitations. Therefore, this work explores a promising scenario, where
+edge-assisted vehicles perform joint tasks of order serving and the emerging
+foundation model fine-tuning using various urban data. However, integrating the
+VCS AI task with the conventional order serving task is challenging, due to
+their inconsistent spatio-temporal characteristics: (i) The distributions of
+ride orders and data point-of-interests (PoIs) may not coincide in geography,
+both following a priori unknown patterns; (ii) they have distinct forms of
+temporal effects, i.e., prolonged waiting makes orders become instantly invalid
+while data with increased staleness gradually reduces its utility for model
+fine-tuning.To overcome these obstacles, we propose an online framework based
+on multi-agent reinforcement learning (MARL) with careful augmentation. A new
+quality-of-service (QoS) metric is designed to characterize and balance the
+utility of the two joint tasks, under the effects of varying data volumes and
+staleness. We also integrate graph neural networks (GNNs) with MARL to enhance
+state representations, capturing graph-structured, time-varying dependencies
+among vehicles and across locations. Extensive experiments on our testbed
+simulator, utilizing various real-world foundation model fine-tuning tasks and
+the New York City Taxi ride order dataset, demonstrate the advantage of our
+proposed method.
 
-摘要：本文提出了一個架構，用於模擬一個規範感知智能代理的行為，其行為遵守規範，並可以由人類控制者設定，並可以在稍後進行更改。當代理參與時間敏感的救援行動時，將代理的行為模式從遵守規範更新為更冒險的行為模式可能是相關的。我們的工作基於 Gelfond 和 Lobo 為規範規範設計的授權和義務政策語言 AOPL。我們引入了一個架構和一個原型軟體系統，可用於模擬代理在不同行為模式下的計畫，這些行為模式稍後可以由控制者更改。我們預計此類軟體對政策制定者很有用，因為他們可以更容易地根據代理對規範遵守的態度了解代理在特定情況下的行為方式。如果模擬顯示出不希望的後果，政策制定者可以修改他們的政策。
+摘要：人工智能（AI）的進展，包括基礎模型（FM），正日益轉變人類社會，智慧城市推動著城市生活的演進。同時，車輛群感測（VCS）已成為關鍵推動因素，利用車輛的機動性和配備感測器的能力。特別是，儘管有資源限制，叫車服務車輛能有效促進靈活的資料收集，並有助於城市智慧。因此，這項工作探索了一個有前途的場景，其中邊緣輔助車輛執行訂單服務和新興基礎模型微調的聯合任務，使用各種城市資料。然而，由於 VCS AI 任務與傳統訂單服務任務的不一致時空特徵，整合它們具有挑戰性：(i) 叫車訂單和資料感興趣點 (PoI) 的分佈在地域上可能不重合，兩者都遵循先驗未知的模式；(ii) 它們具有不同的時間效應形式，即長時間等待會使訂單立即失效，而過時的資料會逐漸降低其對模型微調的效用。為了解決這些障礙，我們提出了一個基於多智能體強化學習 (MARL) 的線上架構，並進行了仔細的擴充。設計了一個新的服務品質 (QoS) 指標，用於表徵和平衡這兩個聯合任務的效用，在不同資料量和過時性的影響下。我們還將圖神經網路（GNN）與 MARL 整合，以增強狀態表示，捕捉車輛之間和不同地點之間的圖結構、時變依賴性。在我們的測試平台模擬器上進行的廣泛實驗，利用各種真實世界的基礎模型微調任務和紐約市計程車叫車訂單資料集，證明了我們提出的方法的優點。
 
-##### **Neuro-Symbolic Contrastive Learning for Cross-domain Inference**
-2502.09213v1 by Mingyue Liu, Ryo Ueda, Zhen Wan, Katsumi Inoue, Chris G. Willcocks
+##### **Multimodal Medical Code Tokenizer**
+2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
 
-Pre-trained language models (PLMs) have made significant advances in natural
-language inference (NLI) tasks, however their sensitivity to textual
-perturbations and dependence on large datasets indicate an over-reliance on
-shallow heuristics. In contrast, inductive logic programming (ILP) excels at
-inferring logical relationships across diverse, sparse and limited datasets,
-but its discrete nature requires the inputs to be precisely specified, which
-limits their application. This paper proposes a bridge between the two
-approaches: neuro-symbolic contrastive learning. This allows for smooth and
-differentiable optimisation that improves logical accuracy across an otherwise
-discrete, noisy, and sparse topological space of logical functions. We show
-that abstract logical relationships can be effectively embedded within a
-neuro-symbolic paradigm, by representing data as logic programs and sets of
-logic rules. The embedding space captures highly varied textual information
-with similar semantic logical relations, but can also separate similar textual
-relations that have dissimilar logical relations. Experimental results
-demonstrate that our approach significantly improves the inference capabilities
-of the models in terms of generalisation and reasoning.
+Foundation models trained on patient electronic health records (EHRs) require
+tokenizing medical data into sequences of discrete vocabulary items. Existing
+tokenizers treat medical codes from EHRs as isolated textual tokens. However,
+each medical code is defined by its textual description, its position in
+ontological hierarchies, and its relationships to other codes, such as disease
+co-occurrences and drug-treatment associations. Medical vocabularies contain
+more than 600,000 codes with critical information for clinical reasoning. We
+introduce MedTok, a multimodal medical code tokenizer that uses the text
+descriptions and relational context of codes. MedTok processes text using a
+language model encoder and encodes the relational structure with a graph
+encoder. It then quantizes both modalities into a unified token space,
+preserving modality-specific and cross-modality information. We integrate
+MedTok into five EHR models and evaluate it on operational and clinical tasks
+across in-patient and out-patient datasets, including outcome prediction,
+diagnosis classification, drug recommendation, and risk stratification.
+Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
+models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
+the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
+using MedTok tokenizer with medical QA systems. Our results demonstrate the
+potential of MedTok as a unified tokenizer for medical codes, improving
+tokenization for medical foundation models.
 
-摘要：預訓練語言模型 (PLM) 在自然語言推理 (NLI) 任務中取得了重大進展，然而它們對文本擾動的敏感性和對大型資料集的依賴性表明過度依賴於淺層啟發法。相比之下，歸納邏輯規劃 (ILP) 擅長推論跨越多樣化、稀疏和有限資料集的邏輯關係，但其離散性質要求輸入被精確指定，這限制了它們的應用。本文提出了兩種方法之間的橋樑：神經符號對比學習。這允許平滑且可微分的優化，從而提高邏輯函數的離散、嘈雜和稀疏拓撲空間中的邏輯準確性。我們展示了抽象邏輯關係可以通過將資料表示為邏輯程式和邏輯規則集，有效地嵌入到神經符號範例中。嵌入空間捕獲具有相似語義邏輯關係的高度多變的文本資訊，但也可以分離具有不同邏輯關係的相似文本關係。實驗結果表明，我們的做法在泛化和推理方面顯著提高了模型的推理能力。
+摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
 
-##### **LP-LM: No Hallucinations in Question Answering with Logic Programming**
-2502.09212v1 by Katherine Wu, Yanhong A. Liu
+##### **A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**
+2502.03772v1 by Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Lida Chen, Wei Wang, Qinghua Huang
 
-Large language models (LLMs) are able to generate human-like responses to
-user queries. However, LLMs exhibit inherent limitations, especially because
-they hallucinate. This paper introduces LP-LM, a system that grounds answers to
-questions in known facts contained in a knowledge base (KB), facilitated
-through semantic parsing in Prolog, and always produces answers that are
-reliable.
-  LP-LM generates a most probable constituency parse tree along with a
-corresponding Prolog term for an input question via Prolog definite clause
-grammar (DCG) parsing. The term is then executed against a KB of natural
-language sentences also represented as Prolog terms for question answering. By
-leveraging DCG and tabling, LP-LM runs in linear time in the size of input
-sentences for sufficiently many grammar rules. Performing experiments comparing
-LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate
-on even simple questions, unlike LP-LM.
+Hepatocellular carcinoma (HCC) ranks as the third leading cause of
+cancer-related mortality worldwide, with early detection being crucial for
+improving patient survival rates. However, early screening for HCC using
+ultrasound suffers from insufficient sensitivity and is highly dependent on the
+expertise of radiologists for interpretation. Leveraging the latest
+advancements in artificial intelligence (AI) in medical imaging, this study
+proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model
+that combines the strengths of Convolutional Neural Networks (CNNs) and Vision
+Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound
+screening. The HSQformer leverages sparse latent space representations to
+capture hierarchical details at various granularities without the need for
+complex adjustments, and adopts a modular, plug-and-play design philosophy,
+ensuring the model's versatility and ease of use. The HSQformer's performance
+was rigorously tested across three distinct clinical scenarios: single-center,
+multi-center, and high-risk patient testing. In each of these settings, it
+consistently outperformed existing state-of-the-art models, such as ConvNext
+and SwinTransformer. Notably, the HSQformer even matched the diagnostic
+capabilities of senior radiologists and comprehensively surpassed those of
+junior radiologists. The experimental results from this study strongly
+demonstrate the effectiveness and clinical potential of AI-assisted tools in
+HCC screening. The full code is available at
+https://github.com/Asunatan/HSQformer.
+
+摘要：肝細胞癌（HCC）是全球第三大癌症相關死亡原因，早期檢測對於提高患者存活率至關重要。然而，使用超音波進行 HCC 早期篩檢的靈敏度不足，且高度依賴放射科醫師的專業知識進行判讀。本研究利用醫學影像中人工智慧（AI）的最新進展，提出了一種創新的分層稀疏查詢Transformer（HSQformer）模型，結合了卷積神經網路（CNN）和視覺Transformer（ViT）的優點，以提高超音波篩檢中 HCC 診斷的準確性。HSQformer 利用稀疏潛在空間表示，在不需要複雜調整的情況下擷取各種粒度層級的細節，並採用模組化、即插即用的設計理念，確保模型的多功能性和易用性。HSQformer 的效能經過三個不同的臨床場景的嚴格測試：單中心、多中心和高風險患者測試。在這些設定中，它始終優於現有的最先進模型，例如 ConvNext 和 SwinTransformer。值得注意的是，HSQformer 甚至匹配了資深放射科醫師的診斷能力，並全面超越了初級放射科醫師的診斷能力。本研究的實驗結果有力地證明了 AI 輔助工具在 HCC 篩檢中的有效性和臨床潛力。完整程式碼可在 https://github.com/Asunatan/HSQformer 取得。
+
+##### **Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**
+2502.04386v1 by Guangyao Zheng, Michael A. Jacobs, Vladimir Braverman, Vishwa S. Parekh
+
+Self-supervised learning has revolutionized medical imaging by enabling
+efficient and generalizable feature extraction from large-scale unlabeled
+datasets. Recently, self-supervised foundation models have been extended to
+three-dimensional (3D) computed tomography (CT) data, generating compact,
+information-rich embeddings with 1408 features that achieve state-of-the-art
+performance on downstream tasks such as intracranial hemorrhage detection and
+lung cancer risk forecasting. However, these embeddings have been shown to
+encode demographic information, such as age, sex, and race, which poses a
+significant risk to the fairness of clinical applications.
+  In this work, we propose a Variation Autoencoder (VAE) based adversarial
+debiasing framework to transform these embeddings into a new latent space where
+demographic information is no longer encoded, while maintaining the performance
+of critical downstream tasks. We validated our approach on the NLST lung cancer
+screening dataset, demonstrating that the debiased embeddings effectively
+eliminate multiple encoded demographic information and improve fairness without
+compromising predictive accuracy for lung cancer risk at 1-year and 2-year
+intervals. Additionally, our approach ensures the embeddings are robust against
+adversarial bias attacks. These results highlight the potential of adversarial
+debiasing techniques to ensure fairness and equity in clinical applications of
+self-supervised 3D CT embeddings, paving the way for their broader adoption in
+unbiased medical decision-making.
 
-摘要：大型語言模型 (LLM) 能產生類似人類的回應來回答使用者的問題。然而，LLM 顯示出內在的限制，特別是因為它們會產生幻覺。本文介紹 LP-LM，一個系統，它將問題的答案建立在知識庫 (KB) 中已知的事實上，透過 Prolog 中的語義解析來促進，並始終產生可靠的答案。
-LP-LM 透過 Prolog 明確條款語法 (DCG) 解析產生一個最可能的成分解析樹，以及輸入問題對應的 Prolog 詞彙。然後，針對一個自然語言句子的 KB 執行該詞彙，也表示為 Prolog 詞彙，以進行問題解答。透過利用 DCG 和 tabling，LP-LM 在輸入句子的大小上以線性時間執行，對於足夠多的語法規則。執行實驗比較 LP-LM 與目前眾所周知的 LLM 在準確性上，我們顯示出 LLM 甚至會對簡單的問題產生幻覺，這與 LP-LM 不同。
+摘要：自我監督學習透過從大規模未標記資料集中提取有效且可概化的特徵，進而革新了醫學影像。最近，自我監督基礎模型已擴展到三維 (3D) 電腦斷層掃描 (CT) 資料，產生緊湊、資訊豐富的嵌入，包含 1408 個特徵，在顱內出血偵測和肺癌風險預測等下游任務中達到最先進的效能。然而，這些嵌入已被證明會編碼人口統計資訊，例如年齡、性別和種族，這對臨床應用的公平性構成重大風險。
+在這項工作中，我們提出一個基於變異自編碼器 (VAE) 的對抗性去偏框架，將這些嵌入轉換到一個新的潛在空間，其中不再編碼人口統計資訊，同時維持關鍵下游任務的效能。我們在 NLST 肺癌篩檢資料集上驗證了我們的做法，證明去偏嵌入有效消除了多重編碼的人口統計資訊，並在不損害 1 年和 2 年間隔的肺癌風險預測準確性的情況下提高了公平性。此外，我們的做法確保了嵌入對抗性偏誤攻擊具有魯棒性。這些結果突顯了對抗性去偏技術的潛力，可確保自我監督 3D CT 嵌入在臨床應用中的公平性和公正性，為其在無偏見醫療決策中的廣泛採用鋪路。
 
-##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
-2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
+##### **Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**
+2502.03591v1 by Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
 
-Visual Question Answering (VQA) is a challenging problem that requires to
-process multimodal input. Answer-Set Programming (ASP) has shown great
-potential in this regard to add interpretability and explainability to modular
-VQA architectures. In this work, we address the problem of how to integrate ASP
-with modules for vision and natural language processing to solve a new and
-demanding VQA variant that is concerned with images of graphs (not graphs in
-symbolic form). Images containing graph-based structures are an ubiquitous and
-popular form of visualisation. Here, we deal with the particular problem of
-graphs inspired by transit networks, and we introduce a novel dataset that
-amends an existing one by adding images of graphs that resemble metro lines.
-Our modular neuro-symbolic approach combines optical graph recognition for
-graph parsing, a pretrained optical character recognition neural network for
-parsing labels, Large Language Models (LLMs) for language processing, and ASP
-for reasoning. This method serves as a first baseline and achieves an overall
-average accuracy of 73% on the dataset. Our evaluation provides further
-evidence of the potential of modular neuro-symbolic systems, in particular with
-pretrained models that do not involve any further training and logic
-programming for reasoning, to solve complex VQA tasks.
+In this work, we present a novel approach to multi-label chest X-ray (CXR)
+image classification that enhances clinical interpretability while maintaining
+a streamlined, single-model, single-run training pipeline. Leveraging the
+CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical
+label groupings to capture clinically meaningful relationships between
+diagnoses. To achieve this, we designed a custom hierarchical binary
+cross-entropy (HBCE) loss function that enforces label dependencies using
+either fixed or data-driven penalty types. Our model achieved a mean area under
+the receiver operating characteristic curve (AUROC) of 0.903 on the test set.
+Additionally, we provide visual explanations and uncertainty estimations to
+further enhance model interpretability. All code, model configurations, and
+experiment details are made available.
 
-摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
+摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
 
-##### **On LLM-generated Logic Programs and their Inference Execution Methods**
-2502.09209v1 by Paul Tarau
+##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
+2502.03568v1 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
 
-Large Language Models (LLMs) trained on petabytes of data are highly
-compressed repositories of a significant proportion of the knowledge
-accumulated and distilled so far. In this paper we study techniques to elicit
-this knowledge in the form of several classes of logic programs, including
-propositional Horn clauses, Dual Horn clauses, relational triplets and Definite
-Clause Grammars. Exposing this knowledge as logic programs enables sound
-reasoning methods that can verify alignment of LLM outputs to their intended
-uses and extend their inference capabilities. We study new execution methods
-for the generated programs, including soft-unification of abducible facts
-against LLM-generated content stored in a vector database as well as GPU-based
-acceleration of minimal model computation that supports inference with large
-LLM-generated programs.
+Many reasoning, planning, and problem-solving tasks share an intrinsic
+algorithmic nature: correctly simulating each step is a sufficient condition to
+solve them correctly. We collect pairs of naturalistic and synthetic reasoning
+tasks to assess the capabilities of Large Language Models (LLM). While
+naturalistic tasks often require careful human handcrafting, we show that
+synthetic data is, in many cases, a good proxy that is much easier to collect
+at scale. We leverage common constructs in programming as the counterpart of
+the building blocks of naturalistic reasoning tasks, such as straight-line
+programs, code that contains critical paths, and approximate and redundant
+instructions. We further assess the capabilities of LLMs on sorting problems
+and repeated operations via sorting algorithms and nested loops. Our synthetic
+datasets further reveal that while the most powerful LLMs exhibit relatively
+strong execution capabilities, the process is fragile: it is negatively
+affected by memorisation and seems to rely heavily on pattern recognition. Our
+contribution builds upon synthetically testing the reasoning capabilities of
+LLMs as a scalable complement to handcrafted human-annotated problems.
 
-摘要：大型語言模型 (LLM) 在數位位元組的資料上受過訓練，是目前為止累積和提煉的知識中，高度濃縮的儲存庫。在本文中，我們研究了以數種邏輯程式類別的形式引出這些知識的技術，包括命題霍恩子句、雙重霍恩子句、關聯三元組和確定子句文法。將這些知識作為邏輯程式揭露，能啟用健全的推理方法，驗證 LLM 輸出的對齊方式，符合其預期的用途，並擴展其推論能力。我們研究了產生程式的新執行方法，包括對儲存在向量資料庫中的 LLM 產生內容，進行可約簡事實的軟統一，以及支援使用大型 LLM 產生程式進行推論的，基於 GPU 的最小模型計算加速。
+摘要：許多推理、規劃和問題解決任務共享一個內在的演算法性質：正確模擬每一步就足以正確解決它們。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的功能。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成資料是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見建構，作為自然主義推理任務構建區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複運算上的功能，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎嚴重依賴模式辨識。我們的貢獻建立在以合成方式測試 LLM 的推理能力之上，作為手工編寫人類標註問題的可擴充補充。
 
-##### **Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**
-2502.09206v1 by Haya Majid Qureshi, Wolfgang Faber
+##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
+2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
 
-Metamodeling refers to scenarios in ontologies in which classes and roles can
-be members of classes or occur in roles. This is a desirable modelling feature
-in several applications, but allowing it without restrictions is problematic
-for several reasons, mainly because it causes undecidability. Therefore,
-practical languages either forbid metamodeling explicitly or treat occurrences
-of classes as instances to be semantically different from other occurrences,
-thereby not allowing metamodeling semantically. Several extensions have been
-proposed to provide metamodeling to some extent. Building on earlier work that
-reduces metamodeling query answering to Datalog query answering, recently
-reductions to query answering over hybrid knowledge bases were proposed with
-the aim of using the Datalog transformation only where necessary. Preliminary
-work showed that the approach works, but the hoped-for performance improvements
-were not observed yet. In this work we expand on this body of work by improving
-the theoretical basis of the reductions and by using alternative tools that
-show competitive performance.
+Large Language Models (LLMs) have attained human-level accuracy on medical
+question-answer (QA) benchmarks. However, their limitations in navigating
+open-ended clinical scenarios have recently been shown, raising concerns about
+the robustness and generalizability of LLM reasoning across diverse, real-world
+medical tasks. To probe potential LLM failure modes in clinical
+problem-solving, we present the medical abstraction and reasoning corpus
+(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
+exploit the Einstellung effect -- the fixation of thought arising from prior
+experience, targeting LLM inductive biases toward inflexible pattern matching
+from their training data rather than engaging in flexible reasoning. We find
+that LLMs, including current state-of-the-art o1 and Gemini models, perform
+poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
+medical reasoning and a propensity to hallucinate. In addition, uncertainty
+estimation analyses indicate that LLMs exhibit overconfidence in their answers,
+despite their limited accuracy. The failure modes revealed by M-ARC in LLM
+medical reasoning underscore the need to exercise caution when deploying these
+models in clinical settings.
 
-摘要：元建模是指本体中的場景，其中類別和角色可以是類別成員或出現在角色中。這是一個在多個應用中理想的建模功能，但允許它不受限制會因多個原因而產生問題，主要是因為它會導致無法決定。因此，實用的語言會明確禁止元建模，或將類別的出現視為與其他出現語義不同的實例，從而語義上不允許元建模。已經提出多個擴充功能，在一定程度上提供元建模。建立在將元建模查詢回答簡化為 Datalog 查詢回答的早期工作之上，最近提出了將查詢回答簡化為混合知識庫的簡化，目的是僅在必要時使用 Datalog 轉換。初步工作顯示該方法有效，但尚未觀察到預期的效能改善。在這項工作中，我們透過改善簡化的理論基礎和使用表現競爭力的替代工具，擴展了這項工作。
+摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
 
-##### **Counterfactual Explanations as Plans**
-2502.09205v1 by Vaishak Belle
+##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
+2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
 
-There has been considerable recent interest in explainability in AI,
-especially with black-box machine learning models. As correctly observed by the
-planning community, when the application at hand is not a single-shot decision
-or prediction, but a sequence of actions that depend on observations, a richer
-notion of explanations are desirable.
-  In this paper, we look to provide a formal account of ``counterfactual
-explanations," based in terms of action sequences. We then show that this
-naturally leads to an account of model reconciliation, which might take the
-form of the user correcting the agent's model, or suggesting actions to the
-agent's plan. For this, we will need to articulate what is true versus what is
-known, and we appeal to a modal fragment of the situation calculus to formalise
-these intuitions. We consider various settings: the agent knowing partial
-truths, weakened truths and having false beliefs, and show that our definitions
-easily generalize to these different settings.
+Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
+Systems (HITS) is a hot research trend focusing on enhancing HITS management,
+particularly in emergencies where ambulance vehicles must arrive at the crash
+scene on time and track their real-time location is crucial to the medical
+authorities. Despite the claim of real-time representation, a temporal
+misalignment persists between the physical and virtual domains, leading to
+discrepancies in the ambulance's location representation. This study proposes
+integrating AI predictive models, specifically Support Vector Regression (SVR)
+and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
+framework to anticipate the medical vehicle's next location in the virtual
+world. These models align virtual representations with their physical
+counterparts, i.e., metaphorically offsetting the synchronization delay between
+the two worlds. Trained meticulously on a historical geospatial dataset, SVR
+and DNN exhibit exceptional prediction accuracy in MATLAB and Python
+environments. Through various testing scenarios, we visually demonstrate the
+efficacy of our methodology, showcasing SVR and DNN's key role in significantly
+reducing the witnessed gap within the HITS's DT. This transformative approach
+enhances real-time synchronization in emergency HITS by approximately 88% to
+93%.
 
-摘要：最近在人工智能中對於可解釋性產生了相當大的興趣，
-特別是對於黑盒機器學習模型。正如規劃社群正確觀察到的，當手邊的應用程式不是單次決策或預測，而是一連串依賴於觀察的動作時，一個更豐富的解釋概念是可取的。
-在本文中，我們著眼於提供「反事實解釋」的一個正式說明，以動作序列為基礎。然後我們展示這自然會導致一個模型調和說明，其形式可能是使用者修正代理人的模型，或建議代理人的計畫採取行動。為此，我們需要說明什麼是真實的，什麼是已知的，我們訴諸情境演算的一個模態片段來形式化這些直覺。我們考慮各種設定：代理人知道部分真實、虛弱真實和擁有錯誤信念，並展示我們的定義輕鬆地概括到這些不同的設定。
+摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
 
-##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
-2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
+##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
+2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
 
-Legal cases require careful logical reasoning following the laws, whereas
-interactions with non- technical users must be in natural language. As an
-application combining logical reasoning using Prolog and natural language
-processing using large language models (LLMs), this paper presents a novel
-approach and system, LogicLease, to automate the analysis of landlord-tenant
-legal cases in the state of New York. LogicLease determines compliance with
-relevant legal requirements by analyzing case descriptions and citing all
-relevant laws. It leverages LLMs for information extraction and Prolog for
-legal reasoning. By separating information extraction from legal reasoning,
-LogicLease achieves greater transparency and control over the legal logic
-applied to each case. We evaluate the accuracy, efficiency, and robustness of
-LogicLease through a series of tests, achieving 100% accuracy and an average
-processing time of 2.57 seconds. LogicLease presents advantages over
-state-of-the-art LLM- based legal analysis systems by providing clear,
-step-by-step reasoning, citing specific laws, and distinguishing itself by its
-ability to avoid hallucinations - a common issue in LLMs.
+The widespread use of chest X-rays (CXRs), coupled with a shortage of
+radiologists, has driven growing interest in automated CXR analysis and
+AI-assisted reporting. While existing vision-language models (VLMs) show
+promise in specific tasks such as report generation or abnormality detection,
+they often lack support for interactive diagnostic capabilities. In this work
+we present RadVLM, a compact, multitask conversational foundation model
+designed for CXR interpretation. To this end, we curate a large-scale
+instruction dataset comprising over 1 million image-instruction pairs
+containing both single-turn tasks -- such as report generation, abnormality
+classification, and visual grounding -- and multi-turn, multi-task
+conversational interactions. After fine-tuning RadVLM on this instruction
+dataset, we evaluate it across different tasks along with re-implemented
+baseline VLMs. Our results show that RadVLM achieves state-of-the-art
+performance in conversational capabilities and visual grounding while remaining
+competitive in other radiology tasks. Ablation studies further highlight the
+benefit of joint training across multiple tasks, particularly for scenarios
+with limited annotated data. Together, these findings highlight the potential
+of RadVLM as a clinically relevant AI assistant, providing structured CXR
+interpretation and conversational capabilities to support more effective and
+accessible diagnostic workflows.
 
-摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
+摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
 
-##### **Thinking beyond the anthropomorphic paradigm benefits LLM research**
-2502.09192v1 by Lujain Ibrahim, Myra Cheng
+##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
+2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
 
-Anthropomorphism, or the attribution of human traits to technology, is an
-automatic and unconscious response that occurs even in those with advanced
-technical expertise. In this position paper, we analyze hundreds of thousands
-of computer science research articles from the past decade and present
-empirical evidence of the prevalence and growth of anthropomorphic terminology
-in research on large language models (LLMs). This terminology reflects deeper
-anthropomorphic conceptualizations which shape how we think about and conduct
-LLM research. We argue these conceptualizations may be limiting, and that
-challenging them opens up new pathways for understanding and improving LLMs
-beyond human analogies. To illustrate this, we identify and analyze five core
-anthropomorphic assumptions shaping prominent methodologies across the LLM
-development lifecycle, from the assumption that models must use natural
-language for reasoning tasks to the assumption that model capabilities should
-be evaluated through human-centric benchmarks. For each assumption, we
-demonstrate how non-anthropomorphic alternatives can open new directions for
-research and development.
+While increasing patients' access to medical documents improves medical care,
+this benefit is limited by varying health literacy levels and complex medical
+terminology. Large language models (LLMs) offer solutions by simplifying
+medical information. However, evaluating LLMs for safe and patient-friendly
+text generation is difficult due to the lack of standardized evaluation
+resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
+created from MIMIC-IV discharge summaries through an automated pipeline
+combining LLM-based question-answer generation with manual quality checks. We
+use this dataset to evaluate various LLMs on patient-oriented
+question-answering. Our findings reveal that general-purpose LLMs frequently
+surpass biomedical-adapted models, while automated metrics correlate with human
+judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
+development of LLMs to enhance patient understanding and ultimately improve
+care outcomes.
 
-摘要：擬人化，或將人類特質歸因於科技，是一種自動且無意識的反應，即使是那些擁有進階技術專業知識的人也會發生。在本文中，我們分析了過去十年數十萬篇電腦科學研究文章，並提出實證證據證明擬人化術語在大型語言模型 (LLM) 研究中的普遍性和增長。這些術語反映了更深層的擬人化概念化，塑造了我們思考和進行 LLM 研究的方式。我們認為這些概念化可能是有限制的，並且挑戰它們為超越人類類比來理解和改進 LLM 開闢了新的途徑。為了說明這一點，我們識別並分析了五個核心擬人化假設，這些假設塑造了 LLM 開發生命週期中的顯著方法論，從模型必須使用自然語言進行推理任務的假設到模型能力應該通過以人為中心的基準進行評估的假設。對於每個假設，我們展示了非擬人化替代方案如何為研究和開發打開新方向。
+摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
+但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
 
-##### **Matina: A Large-Scale 73B Token Persian Text Corpus**
-2502.09188v1 by Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri
+##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
+2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
+
+Purpose: To develop and evaluate a deep learning-based method that allows to
+perform myocardial infarct segmentation in a fully-automated way.
+  Materials and Methods: For this retrospective study, a cascaded framework of
+two and three-dimensional convolutional neural networks (CNNs), specialized on
+identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
+cardiac magnetic resonance (CMR) images, was trained on an in-house training
+dataset consisting of 144 examinations. On a separate test dataset from the
+same institution, including images from 152 examinations obtained between 2021
+and 2023, a quantitative comparison between artificial intelligence (AI)-based
+segmentations and manual segmentations was performed. Further, qualitative
+assessment of segmentation accuracy was evaluated for both human and
+AI-generated contours by two CMR experts in a blinded experiment.
+  Results: Excellent agreement could be found between manually and
+automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
+evaluation showed that compared to human-based measurements, the experts rated
+the AI-based segmentations to better represent the actual extent of infarction
+significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
+the contrary, for segmentation of microvascular obstruction (MVO), manual
+measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
+  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
+size to be calculated in a very short time and without requiring any
+pre-processing of the input images while matching the segmentation quality of
+trained human observers. In a blinded experiment, experts preferred automated
+infarct segmentations more often than manual segmentations, paving the way for
+a potential clinical application.
 
-Text corpora are essential for training models used in tasks like
-summarization, translation, and large language models (LLMs). While various
-efforts have been made to collect monolingual and multilingual datasets in many
-languages, Persian has often been underrepresented due to limited resources for
-data collection and preprocessing. Existing Persian datasets are typically
-small and lack content diversity, consisting mainly of weblogs and news
-articles. This shortage of high-quality, varied data has slowed the development
-of NLP models and open-source LLMs for Persian. Since model performance depends
-heavily on the quality of training data, we address this gap by introducing the
-Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed
-and deduplicated to ensure high data quality. We further assess its
-effectiveness by training and evaluating transformer-based models on key NLP
-tasks. Both the dataset and preprocessing codes are publicly available,
-enabling researchers to build on and improve this resource for future Persian
-NLP advancements.
+摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
+材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
+結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
+結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
 
-摘要：文字語料庫對於訓練用於摘要、翻譯和大型語言模型 (LLM) 等任務的模型至關重要。儘管已做出各種努力來收集許多語言中的單語和多語言資料集，但由於資料收集和預處理資源有限，波斯語常常代表性不足。現有的波斯語資料集通常很小，而且缺乏內容多樣性，主要由網誌和新聞文章組成。這種優質、多樣化資料的短缺減緩了波斯語的 NLP 模型和開源 LLM 的開發。由於模型效能高度依賴訓練資料的品質，我們透過推出 Matina 語料庫來解決這個差距，Matina 語料庫是一個新的波斯語資料集，包含 72.9B 個字元，經過仔細預處理和去重，以確保資料品質。我們進一步透過在關鍵 NLP 任務上訓練和評估基於轉換器的模型來評估其有效性。資料集和預處理程式碼都是公開的，使研究人員能夠建立和改善這個資源，以促進未來的波斯語 NLP 進展。
+##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
+2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
 
-##### **RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**
-2502.09183v1 by Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, Linmei Hu
+Recently computer-aided diagnosis has demonstrated promising performance,
+effectively alleviating the workload of clinicians. However, the inherent
+sample imbalance among different diseases leads algorithms biased to the
+majority categories, leading to poor performance for rare categories. Existing
+works formulated this challenge as a long-tailed problem and attempted to
+tackle it by decoupling the feature representation and classification. Yet, due
+to the imbalanced distribution and limited samples from tail classes, these
+works are prone to biased representation learning and insufficient classifier
+calibration. To tackle these problems, we propose a new Long-tailed Medical
+Diagnosis (LMD) framework for balanced medical image classification on
+long-tailed datasets. In the initial stage, we develop a Relation-aware
+Representation Learning (RRL) scheme to boost the representation ability by
+encouraging the encoder to capture intrinsic semantic features through
+different data augmentations. In the subsequent stage, we propose an Iterative
+Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
+This is achieved by generating a large number of balanced virtual features and
+fine-tuning the encoder using an Expectation-Maximization manner. The proposed
+ICC compensates for minority categories to facilitate unbiased classifier
+optimization while maintaining the diagnostic knowledge in majority classes.
+Comprehensive experiments on three public long-tailed medical datasets
+demonstrate that our LMD framework significantly surpasses state-of-the-art
+approaches. The source code can be accessed at
+https://github.com/peterlipan/LMD.
 
-Code generation has attracted increasing attention with the rise of Large
-Language Models (LLMs). Many studies have developed powerful code LLMs by
-synthesizing code-related instruction data and applying supervised fine-tuning.
-However, these methods are limited by teacher model distillation and ignore the
-potential of iterative refinement by self-generated code. In this paper, we
-propose Adaptive Critique Refinement (ACR), which enables the model to refine
-itself by self-generated code and external critique, rather than directly
-imitating the code responses of the teacher model. Concretely, ACR includes a
-composite scoring system with LLM-as-a-Judge to evaluate the quality of code
-responses and a selective critique strategy with LLM-as-a-Critic to critique
-self-generated low-quality code responses. We develop the RefineCoder series by
-iteratively applying ACR, achieving continuous performance improvement on
-multiple code generation benchmarks. Compared to the baselines of the same
-size, our proposed RefineCoder series can achieve comparable or even superior
-performance using less data.
+摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
 
-摘要：隨著大型語言模型 (LLM) 的興起，程式碼生成備受關注。許多研究透過綜合與程式碼相關的指令資料並應用監督式微調來開發強大的程式碼 LLM。然而，這些方法受到教師模型蒸餾的限制，且忽略了透過自行產生的程式碼進行反覆改進的潛力。在本文中，我們提出適應性批判改進 (ACR)，它使模型能夠透過自行產生的程式碼和外部批判來改進自身，而不是直接模仿教師模型的程式碼回應。具體來說，ACR 包含一個複合評分系統，其中 LLM 作為評審員來評估程式碼回應的品質，以及一個選擇性批判策略，其中 LLM 作為批判者來批判自行產生的低品質程式碼回應。我們透過反覆套用 ACR 來開發 RefineCoder 系列，在多個程式碼生成基準上實現持續的效能改善。與相同規模的基準相比，我們提出的 RefineCoder 系列可以使用較少資料來實現相當甚至更優異的效能。
+##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
+2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
 
-##### **FLAME: Flexible LLM-Assisted Moderation Engine**
-2502.09175v1 by Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets
+This study investigates continual fine-tuning strategies for deep learning in
+online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
+within a causal setting involving a large user group and multiple sessions per
+participant. We are the first to explore such strategies across a large user
+group, as longitudinal adaptation is typically studied in the single-subject
+setting with a single adaptation strategy, which limits the ability to
+generalize findings. First, we examine the impact of different fine-tuning
+approaches on decoder performance and stability. Building on this, we integrate
+online test-time adaptation (OTTA) to adapt the model during deployment,
+complementing the effects of prior fine-tuning. Our findings demonstrate that
+fine-tuning that successively builds on prior subject-specific information
+improves both performance and stability, while OTTA effectively adapts the
+model to evolving data distributions across consecutive sessions, enabling
+calibration-free operation. These results offer valuable insights and
+recommendations for future research in longitudinal online MI decoding and
+highlight the importance of combining domain adaptation strategies for
+improving BCI performance in real-world applications. Clinical Relevance: Our
+investigation enables more stable and efficient long-term motor imagery
+decoding, which is critical for neurorehabilitation and assistive technologies.
 
-The rapid advancement of Large Language Models (LLMs) has introduced
-significant challenges in moderating user-model interactions. While LLMs
-demonstrate remarkable capabilities, they remain vulnerable to adversarial
-attacks, particularly ``jailbreaking'' techniques that bypass content safety
-measures. Current content moderation systems, which primarily rely on input
-prompt filtering, have proven insufficient, with techniques like Best-of-N
-(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
-In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
-new approach that shifts the focus from input filtering to output moderation.
-Unlike traditional circuit-breaking methods that analyze user queries, FLAME
-evaluates model responses, offering several key advantages: (1) computational
-efficiency in both training and inference, (2) enhanced resistance to BoN
-jailbreaking attacks, and (3) flexibility in defining and updating safety
-criteria through customizable topic filtering. Our experiments demonstrate that
-FLAME significantly outperforms current moderation systems. For example, FLAME
-reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
-while maintaining low computational overhead. We provide comprehensive
-evaluation on various LLMs and analyze the engine's efficiency against the
-state-of-the-art jailbreaking. This work contributes to the development of more
-robust and adaptable content moderation systems for LLMs.
+摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
 
-摘要：大型語言模型 (LLM) 的快速進步為調節使用者與模型互動帶來重大挑戰。儘管 LLM 展現出非凡的能力，但它們仍然容易受到對抗性攻擊，特別是繞過內容安全措施的「越獄」技術。目前的內容審核系統主要依賴輸入提示過濾，已被證明不足，例如 Best-of-N (BoN) 越獄對抗熱門 LLM 的成功率達到 80% 以上。在本文中，我們介紹了靈活的 LLM 輔助審核引擎 (FLAME)：一種新的方法，將重點從輸入過濾轉移到輸出審核。與分析使用者查詢的傳統電路中斷方法不同，FLAME 評估模型回應，提供幾個關鍵優勢：(1) 訓練和推理中的計算效率，(2) 增強對 BoN 越獄攻擊的抵抗力，以及 (3) 透過可自訂主題過濾定義和更新安全標準的靈活性。我們的實驗證明，FLAME 明顯優於目前的審核系統。例如，FLAME 將 GPT-4o-mini 和 DeepSeek-v3 的攻擊成功率降低了約 9 倍，同時保持較低的計算負擔。我們對各種 LLM 進行了全面的評估，並分析了引擎對抗最新越獄的效率。這項工作有助於開發更強大且適應性更強的 LLM 內容審核系統。
+##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
+2502.03004v1 by Seonok Kim
 
-##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
-2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
+Large Language Models (LLMs) have demonstrated impressive capabilities across
+natural language processing tasks. However, their application to specialized
+domains such as medicine and biology requires further optimization to ensure
+factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
+domain-adapted biomedical question-answering model designed to enhance both
+short-form and long-form queries. By integrating fine-tuning and
+retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
+domain-specific knowledge, improving reasoning abilities and factual accuracy.
+To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
+datasets, covering structured multiple-choice assessments and complex clinical
+reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
+datasets, while RAG enhances factual consistency. These results highlight the
+potential of domain-optimized LLMs in advancing biomedical research, medical
+education, and clinical decision support.
 
-In remote healthcare monitoring, time series representation learning reveals
-critical patient behavior patterns from high-frequency data. This study
-analyzes home activity data from individuals living with dementia by proposing
-a two-stage, self-supervised learning approach tailored to uncover low-rank
-structures. The first stage converts time-series activities into text sequences
-encoded by a pre-trained language model, providing a rich, high-dimensional
-latent state space using a PageRank-based method. This PageRank vector captures
-latent state transitions, effectively compressing complex behaviour data into a
-succinct form that enhances interpretability. This low-rank representation not
-only enhances model interpretability but also facilitates clustering and
-transition analysis, revealing key behavioral patterns correlated with
-clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
-framework's potential in supporting cognitive status prediction, personalized
-care interventions, and large-scale health monitoring.
+摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
 
-摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
+##### **Contrastive Token-level Explanations for Graph-based Rumour Detection**
+2502.04366v1 by Daniel Wai Kit Chin, Roy Ka-Wei Lee
 
-##### **Musical Heritage Historical Entity Linking**
-2502.09168v1 by Arianna Graciotti, Nicolas Lazzari, Valentina Presutti, Rocco Tripodi
+The widespread use of social media has accelerated the dissemination of
+information, but it has also facilitated the spread of harmful rumours, which
+can disrupt economies, influence political outcomes, and exacerbate public
+health crises, such as the COVID-19 pandemic. While Graph Neural Network
+(GNN)-based approaches have shown significant promise in automated rumour
+detection, they often lack transparency, making their predictions difficult to
+interpret. Existing graph explainability techniques fall short in addressing
+the unique challenges posed by the dependencies among feature dimensions in
+high-dimensional text embeddings used in GNN-based models. In this paper, we
+introduce Contrastive Token Layerwise Relevance Propagation (CT-LRP), a novel
+framework designed to enhance the explainability of GNN-based rumour detection.
+CT-LRP extends current graph explainability methods by providing token-level
+explanations that offer greater granularity and interpretability. We evaluate
+the effectiveness of CT-LRP across multiple GNN models trained on three
+publicly available rumour detection datasets, demonstrating that it
+consistently produces high-fidelity, meaningful explanations, paving the way
+for more robust and trustworthy rumour detection systems.
 
-Linking named entities occurring in text to their corresponding entity in a
-Knowledge Base (KB) is challenging, especially when dealing with historical
-texts. In this work, we introduce Musical Heritage named Entities Recognition,
-Classification and Linking (MHERCL), a novel benchmark consisting of manually
-annotated sentences extrapolated from historical periodicals of the music
-domain. MHERCL contains named entities under-represented or absent in the most
-famous KBs. We experiment with several State-of-the-Art models on the Entity
-Linking (EL) task and show that MHERCL is a challenging dataset for all of
-them. We propose a novel unsupervised EL model and a method to extend
-supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main
-difficulties posed by historical documents. Our experiments reveal that relying
-on unsupervised techniques and improving models with logical constraints based
-on KGs and heuristics to predict NIL entities (entities not represented in the
-KB of reference) results in better EL performance on historical documents.
+摘要：社群媒體的廣泛使用加速了資訊的傳播，但也促进了有害謠言的散播，這可能會擾亂經濟、影響政治結果，並加劇公共衛生危機，例如 COVID-19 大流行。雖然基於圖神經網路 (GNN) 的方法在自動化謠言偵測方面展現了顯著的前景，但它們通常缺乏透明度，這使得它們的預測難以解釋。現有的圖形可解釋性技術無法解決 GNN 模型中使用的維度嵌入式文本之間的依賴性所帶來的獨特挑戰。在本文中，我們介紹了對比標記分層關聯性傳播 (CT-LRP)，這是一個新穎的框架，旨在增強基於 GNN 的謠言偵測的可解釋性。CT-LRP 透過提供標記級別的解釋來擴充當前的圖形可解釋性方法，這些解釋提供了更細緻的粒度和可解釋性。我們在三個公開的謠言偵測資料集上訓練的幾個 GNN 模型中評估了 CT-LRP 的有效性，證明它始終產生高保真、有意義的解釋，為更強健且值得信賴的謠言偵測系統鋪路。
 
-摘要：將文本中出現的名稱實體連結到知識庫 (KB) 中對應的實體具有挑戰性，尤其是在處理歷史文本時。在這項工作中，我們引入了音樂遺產命名實體識別、分類和連結 (MHERCL)，這是一個由從音樂領域的歷史期刊中外推的手動標註句子組成的全新基準。MHERCL 包含在最著名的 KB 中代表性不足或不存在的名稱實體。我們在實體連結 (EL) 任務中對多個最先進的模型進行了實驗，並表明 MHERCL 對所有模型來說都是一個具有挑戰性的資料集。我們提出了一個新的無監督 EL 模型和一個通過使用知識圖 (KG) 來擴充監督式實體連結器的的方法，以解決歷史文件提出的主要難題。我們的實驗表明，依賴無監督技術並使用基於 KG 和啟發法的邏輯約束來改善模型以預測 NIL 實體（未在參考 KB 中表示的實體）會在歷史文件中產生更好的 EL 效能。
+##### **AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**
+2502.04365v1 by Jorge García-Torres, Øyvind Meinich-Bache, Siren Rettedal, Kjersti Engan
 
-##### **Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**
-2502.09156v1 by Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin
+Approximately 10% of newborns need some assistance to start breathing and 5\%
+proper ventilation. It is crucial that interventions are initiated as soon as
+possible after birth. Accurate documentation of Time of Birth (ToB) is thereby
+essential for documenting and improving newborn resuscitation performance.
+However, current clinical practices rely on manual recording of ToB, typically
+with minute precision. In this study, we present an AI-driven, video-based
+system for automated ToB detection using thermal imaging, designed to preserve
+the privacy of healthcare providers and mothers by avoiding the use of
+identifiable visual data. Our approach achieves 91.4% precision and 97.4%
+recall in detecting ToB within thermal video clips during performance
+evaluation. Additionally, our system successfully identifies ToB in 96% of test
+cases with an absolute median deviation of 1 second compared to manual
+annotations. This method offers a reliable solution for improving ToB
+documentation and enhancing newborn resuscitation outcomes.
 
-Objectives: Large language models (LLMs) can harness medical knowledge for
-intelligent question answering (Q&A), promising support for auxiliary diagnosis
-and medical talent cultivation. However, there is a deficiency of highly
-efficient retrieval-augmented generation (RAG) frameworks within the domain of
-Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the
-Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A
-tasks.
-  Materials and Methods: We introduce the novel approach of knowledge
-organization, constructing a tree structure knowledge base with hierarchy. At
-inference time, our self-reflection framework retrieves from this knowledge
-base, integrating information across chapters. Questions from the TCM Medical
-Licensing Examination (MLE) and the college Classics Course Exam (CCE) were
-randomly selected as benchmark datasets.
-  Results: By coupling with GPT-4, the framework can improve the best
-performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and
-improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation,
-the framework improves a total of 18.52 points across dimensions of safety,
-consistency, explainability, compliance, and coherence.
-  Conclusion: The TOSRR framework can effectively improve LLM's capability in
-Q&A tasks of TCM.
+摘要：約 10% 的新生兒需要協助才能開始呼吸，5% 需要適當的通氣。在出生後盡快開始介入至關重要。準確記錄出生時間 (ToB) 對於記錄和改善新生兒復甦表現至關重要。然而，目前的臨床實務依賴於手動記錄 ToB，通常精確到分鐘。在這項研究中，我們提出一個以 AI 為主的、基於影片的系統，用於使用熱影像自動偵測 ToB，旨在透過避免使用可識別的視覺資料來保護醫療保健提供者和母親的隱私。我們的做法在執行評估期間，在熱影像片段中偵測 ToB 時達到了 91.4% 的精確度和 97.4% 的召回率。此外，我們的系統在 96% 的測試案例中成功識別出 ToB，與手動註解相比，絕對中位數偏差為 1 秒。此方法提供了一個可靠的解決方案，用於改善 ToB 記錄和增強新生兒復甦結果。
 
-摘要：目標：大型語言模型（LLM）可以利用醫療知識進行智能問答（Q&A），承諾支持輔助診斷和醫療人才培養。然而，在中醫領域內缺乏高效的檢索增強生成（RAG）框架。我們的目的是觀察樹組織自省檢索（TOSRR）框架對中醫問答任務中 LLM 的影響。
-材料和方法：我們引入了知識組織的新方法，構建了一個具有層次的樹結構知識庫。在推理時間，我們的自省框架從這個知識庫中檢索，整合章節中的信息。中醫醫師資格考試（MLE）和大學經典課程考試（CCE）中的問題被隨機選為基準數據集。
-結果：通過與 GPT-4 結合，該框架可以將中醫 MLE 基準上的最佳性能提高 19.85% 的絕對準確度，並將 CCE 數據集上的召回準確度從 27% 提高到 38%。在手動評估中，該框架在安全性、一致性、可解釋性、合規性和連貫性方面總共提高了 18.52 分。
-結論：TOSRR 框架可以有效提升 LLM 在中醫問答任務中的能力。
+##### **3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**
+2502.02779v1 by Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O'Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian
 
-##### **A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**
-2502.09128v1 by Nasser A Alsadhan
+Head computed tomography (CT) imaging is a widely-used imaging modality with
+multitudes of medical indications, particularly in assessing pathology of the
+brain, skull, and cerebrovascular system. It is commonly the first-line imaging
+in neurologic emergencies given its rapidity of image acquisition, safety,
+cost, and ubiquity. Deep learning models may facilitate detection of a wide
+range of diseases. However, the scarcity of high-quality labels and
+annotations, particularly among less common conditions, significantly hinders
+the development of powerful models. To address this challenge, we introduce
+FM-CT: a Foundation Model for Head CT for generalizable disease detection,
+trained using self-supervised learning. Our approach pre-trains a deep learning
+model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans
+without the need for manual annotations, enabling the model to learn robust,
+generalizable features. To investigate the potential of self-supervised
+learning in head CT, we employed both discrimination with self-distillation and
+masked image modeling, and we construct our model in 3D rather than at the
+slice level (2D) to exploit the structure of head CT scans more comprehensively
+and efficiently. The model's downstream classification performance is evaluated
+using internal and three external datasets, encompassing both in-distribution
+(ID) and out-of-distribution (OOD) data. Our results demonstrate that the
+self-supervised foundation model significantly improves performance on
+downstream diagnostic tasks compared to models trained from scratch and
+previous 3D CT foundation models on scarce annotated datasets. This work
+highlights the effectiveness of self-supervised learning in medical imaging and
+sets a new benchmark for head CT image analysis in 3D, enabling broader use of
+artificial intelligence for head CT-based diagnosis.
 
-Arabic is one of the oldest languages still in use today. As a result,
-several Arabic-speaking regions have developed dialects that are unique to
-them. Dialect and emotion recognition have various uses in Arabic text
-analysis, such as determining an online customer's origin based on their
-comments. Furthermore, intelligent chatbots that are aware of a user's emotions
-can respond appropriately to the user. Current research in emotion detection in
-the Arabic language lacks awareness of how emotions are exhibited in different
-dialects, which motivates the work found in this study. This research addresses
-the problems of dialect and emotion classification in Arabic. Specifically,
-this is achieved by building a novel framework that can identify and predict
-Arabic dialects and emotions from a given text. The framework consists of three
-modules: A text-preprocessing module, a classification module, and a clustering
-module with the novel capability of building new dialect-aware emotion
-lexicons. The proposed framework generated a new emotional lexicon for
-different dialects. It achieved an accuracy of 88.9% in classifying Arabic
-dialects, which outperforms the state-of-the-art results by 6.45 percentage
-points. Furthermore, the framework achieved 89.1-79% accuracy in detecting
-emotions in the Egyptian and Gulf dialects, respectively.
+摘要：頭部電腦斷層掃描（CT）影像是一種廣泛使用的影像模式，具有
+大量的醫療適應症，特別是在評估腦部、頭骨和腦血管系統的病理時。由於其影像擷取速度快、安全性、成本低和普遍性，通常是神經緊急情況下的第一線影像。深度學習模型可以促進對各種疾病的檢測。然而，高品質標籤和註釋的稀缺，特別是在較不常見的疾病中，顯著地阻礙了強大模型的發展。為了應對這一挑戰，我們引入了 FM-CT：一個用於頭部 CT 的基礎模型，用於可概化的疾病檢測，並使用自我監督學習進行訓練。我們的做法在一個包含 361,663 個非對比 3D 頭部 CT 掃描的大型、多樣化的數據集上預訓練一個深度學習模型，而無需手動註釋，使模型能夠學習強健、可概化的特徵。為了探討自我監督學習在頭部 CT 中的潛力，我們同時採用了帶有自我蒸餾的判別和遮罩影像建模，並且我們以 3D 而不是切片層級（2D）構建我們的模型，以更全面、有效地利用頭部 CT 掃描的結構。該模型的下游分類效能使用內部和三個外部數據集進行評估，包括分佈內 (ID) 和分佈外 (OOD) 資料。我們的結果表明，與從頭開始訓練的模型和先前在稀疏註釋數據集上訓練的 3D CT 基礎模型相比，自我監督基礎模型顯著改善了下游診斷任務的效能。這項工作突顯了自我監督學習在醫學影像中的有效性，並為 3D 頭部 CT 影像分析設定了一個新的基準，讓人工智慧能夠更廣泛地用於基於頭部 CT 的診斷。
 
-摘要：阿拉伯語是現今仍在使用中最古老的語言之一。因此，幾個講阿拉伯語的地區發展出獨特的方言。方言和情緒辨識在阿拉伯語文本分析中有多種用途，例如根據在線客戶的評論來確定其來源。此外，知道使用者情緒的智慧聊天機器人可以適當地回應使用者。目前對阿拉伯語情緒偵測的研究缺乏對不同方言如何表現情緒的認識，這激勵了本研究中的工作。本研究探討了阿拉伯語中的方言和情緒分類問題。具體而言，這是通過建立一個新的框架來實現的，該框架可以識別和預測給定文本中的阿拉伯方言和情緒。該框架包含三個模組：文字預處理模組、分類模組和聚類模組，具有建立新的方言感知情緒詞彙表的新功能。所提出的框架為不同的方言生成了新的情緒詞彙表。它在分類阿拉伯方言方面達到了 88.9% 的準確率，比最先進的結果高出 6.45 個百分點。此外，該框架在檢測埃及和海灣方言的情緒方面分別達到了 89.1-79% 的準確率。
+##### **Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**
+2502.02756v1 by Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
 
-##### **Automatic Pruning via Structured Lasso with Class-wise Information**
-2502.09125v1 by Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li
+This study proposes a new loss function for deep neural networks, L1-weighted
+Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of
+voxels based on their classification difficulty, towards automated detection
+and segmentation of metastatic prostate cancer lesions in PET/CT scans. We
+obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with
+biochemical recurrence metastatic prostate cancer. We trained two 3D
+convolutional neural networks, Attention U-Net and SegResNet, and concatenated
+the PET and CT volumes channel-wise as input. The performance of our custom
+loss function was evaluated against the Dice and Dice Focal Loss functions. For
+clinical significance, we considered a detected region of interest (ROI) as a
+true positive if at least the voxel with the maximum standardized uptake value
+falls within the ROI. We assessed the models' performance based on the number
+of lesions in an image, tumour volume, activity, and extent of spread. The
+L1DFL outperformed the comparative loss functions by at least 13% on the test
+set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were
+lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal
+Loss yielded more false positives, whereas the Dice Loss was more sensitive to
+smaller volumes and struggled to segment larger lesions accurately. They also
+exhibited network-specific variations and yielded declines in segmentation
+accuracy with increased tumour spread. Our results demonstrate the potential of
+L1DFL to yield robust segmentation of metastatic prostate cancer lesions in
+PSMA PET/CT images. The results further highlight potential complexities
+arising from the variations in lesion characteristics that may influence
+automated prostate cancer tumour detection and segmentation. The code is
+publicly available at: https://github.com/ObedDzik/pca_segment.git.
 
-Most pruning methods concentrate on unimportant filters of neural networks.
-However, they face the loss of statistical information due to a lack of
-consideration for class-wise data. In this paper, from the perspective of
-leveraging precise class-wise information for model pruning, we utilize
-structured lasso with guidance from Information Bottleneck theory. Our approach
-ensures that statistical information is retained during the pruning process.
-With these techniques, we introduce two innovative adaptive network pruning
-schemes: sparse graph-structured lasso pruning with Information Bottleneck
-(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
-Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
-sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
-multiple state-of-the-art methods, our approaches demonstrate superior
-performance across three datasets and six model architectures in extensive
-experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
-achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
-an accuracy of 94.10% (0.14% higher than the original model); we reduce the
-parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
-ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
-computational resource usage while maintaining accuracy. Our codes are at
-https://anonymous.4open.science/r/IJCAI-8104.
+摘要：<paragraph>本研究針對深度神經網路提出一個新的損失函數，L1 加權 Dice 焦點損失 (L1DFL)，它利用 L1 範數根據體素的分類難度進行自適應加權，用於自動偵測和分割 PET/CT 掃描中轉移性前列腺癌病灶。我們取得 380 個經診斷為生化復發轉移性前列腺癌的患者的 PSMA [18-F] DCFPyL PET/CT 掃描。我們訓練了兩個 3D 捲積神經網路，Attention U-Net 和 SegResNet，並將 PET 和 CT 體積按通道連接作為輸入。我們自訂的損失函數的效能與 Dice 和 Dice 焦點損失函數進行評估。為了臨床意義，我們將一個偵測到的感興趣區域 (ROI) 視為真陽性，如果至少具有最大標準攝取值的體素落在 ROI 內。我們根據影像中的病灶數量、腫瘤體積、活性，以及擴散程度評估模型的效能。L1DFL 在測試組中至少比比較損失函數高出 13%。此外，Dice 損失和 Dice 焦點損失的 F1 分數分別比 L1DFL 低至少 6% 和 34%。Dice 焦點損失產生更多假陽性，而 Dice 損失對較小體積較為敏感，且難以準確分割較大病灶。它們也展現出網路特定的變化，並隨著腫瘤擴散而導致分割準確度下降。我們的結果證明 L1DFL 具有在 PSMA PET/CT 影像中產生轉移性前列腺癌病灶的強健分割的潛力。結果進一步強調由病灶特徵變化所產生的潛在複雜性，這可能會影響自動化前列腺癌腫瘤偵測和分割。程式碼公開於：https://github.com/ObedDzik/pca_segment.git。</paragraph>
 
-摘要：大多數剪枝方法都集中在神經網路中不重要的濾波器上。
-然而，由於缺乏對類別資料的考量，它們面臨統計資訊的遺失。在本文中，我們從利用精確類別資訊進行模型剪枝的角度，利用結構化套索搭配資訊瓶頸理論的指導。我們的做法確保在剪枝過程中保留統計資訊。藉由這些技術，我們引入了兩個創新的自適應網路剪枝方案：帶有資訊瓶頸的稀疏圖形結構套索剪枝（sGLP-IB）和帶有資訊瓶頸的稀疏樹導引套索剪枝（sTLP-IB）。關鍵方面是使用 sGLP-IB 和 sTLP-IB 剪枝模型濾波器，以更好地擷取類別關聯性。與多種最先進的方法相比，我們的做法在廣泛的實驗中展現出跨三個資料集和六個模型架構的卓越效能。例如，在 CIFAR-10 資料集上使用 VGG16 模型，我們達到了 85% 的參數減少、61% 的 FLOP 減少，並維持 94.10% 的準確度（比原始模型高 0.14%）；我們在 ImageNet 上使用 ResNet 架構將參數減少了 55%，準確度為 76.12%（僅下降 0.03%）。總之，我們成功地減少了模型大小和計算資源使用，同時維持準確度。我們的程式碼位於 https://anonymous.4open.science/r/IJCAI-8104。
+##### **Diffusion Instruction Tuning**
+2502.06814v1 by Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
 
-##### **The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**
-2502.09120v1 by Ye-eun Cho, Yunho Maeng
+We introduce Lavender, a simple supervised fine-tuning (SFT) method that
+boosts the performance of advanced vision-language models (VLMs) by leveraging
+state-of-the-art image generation models such as Stable Diffusion.
+Specifically, Lavender aligns the text-vision attention in the VLM transformer
+with the equivalent used by Stable Diffusion during SFT, instead of adapting
+separate encoders. This alignment enriches the model's visual understanding and
+significantly boosts performance across in- and out-of-distribution tasks.
+Lavender requires just 0.13 million training examples, 2.5% of typical
+large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a
+single day. It consistently improves state-of-the-art open-source multimodal
+LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and
+a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently
+transferring the visual expertise of image generators with minimal supervision,
+Lavender offers a scalable solution for more accurate vision-language systems.
+All code, training data, and models will be shared at
+https://astrazeneca.github.io/vlm/.
 
-This study explored how Vision-Language Models (VLMs) process ignorance
-implicatures with visual and linguistic cues. Particularly, we focused on the
-effects of contexts (precise and approximate contexts) and modifier types (bare
-numerals, superlative, and comparative modifiers), which were considered
-pragmatic and semantic factors respectively. Methodologically, we conducted a
-truth-value judgment task in visually grounded settings using GPT-4o and Gemini
-1.5 Pro. The results indicate that while both models exhibited sensitivity to
-linguistic cues (modifier), they failed to process ignorance implicatures with
-visual cues (context) as humans do. Specifically, the influence of context was
-weaker and inconsistent across models, indicating challenges in pragmatic
-reasoning for VLMs. On the other hand, superlative modifiers were more strongly
-associated with ignorance implicatures as compared to comparative modifiers,
-supporting the semantic view. These findings highlight the need for further
-advancements in VLMs to process language-vision information in a
-context-dependent way to achieve human-like pragmatic inference.
+摘要：<paragraph>我們介紹 Lavender，一種簡單的監督微調 (SFT) 方法，它透過利用 Stable Diffusion 等最先進的影像生成模型來提升先進視覺語言模型 (VLM) 的效能。
+具體來說，Lavender 在 SFT 期間將 VLM 轉換器中的文字視覺注意力與 Stable Diffusion 使用的等效注意力對齊，而不是調整單獨的編碼器。此對齊豐富了模型的視覺理解，並顯著提升了分佈內外任務的效能。
+Lavender 只需要 0.13 百萬個訓練範例，相當於典型大型 SFT 資料集的 2.5%，並在標準硬體 (8 個 GPU) 上於一天內進行微調。它持續改善最先進的開放原始碼多模態 LLM（例如 Llama-3.2-11B、MiniCPM-Llama3-v2.5），在具有挑戰性的分佈外醫療 QA 任務中獲得高達 30% 的收益和 68% 的提升。透過有效轉移影像生成器的視覺專業知識，並僅需最少的監督，Lavender 提供了一個可擴充的解決方案，以實現更準確的視覺語言系統。
+所有程式碼、訓練資料和模型將在 https://astrazeneca.github.io/vlm/ 分享。</paragraph>
 
-摘要：本研究探討了視覺語言模型 (VLM) 如何處理視覺和語言線索中的無知含義。特別是，我們專注於語境（精確和近似語境）和修飾語類型（裸數字、最高級和比較級修飾語）的影響，這些分別被視為語用和語義因素。在方法論上，我們使用 GPT-4o 和 Gemini 1.5 Pro 在視覺基礎設置中進行了真值判斷任務。結果表明，儘管這兩個模型都對語言線索（修飾語）表現出敏感性，但它們未能像人類那樣處理帶有視覺線索（語境）的無知含義。具體來說，語境的影響在各個模型中較弱且不一致，表明 VLM 在語用推理方面存在挑戰。另一方面，與比較級修飾語相比，最高級修飾語與無知含義的關聯性更強，這支持了語義觀點。這些發現強調了 VLM 進一步發展的必要性，以以語境依賴的方式處理語言視覺信息，以實現類人語用推理。
+##### **MedRAX: Medical Reasoning Agent for Chest X-ray**
+2502.02673v1 by Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
 
-##### **One-shot Federated Learning Methods: A Practical Guide**
-2502.09104v1 by Xiang Liu, Zhenheng Tang, Xia Li, Yijun Song, Sijie Ji, Zemin Liu, Bo Han, Linshan Jiang, Jialin Li
+Chest X-rays (CXRs) play an integral role in driving critical decisions in
+disease management and patient care. While recent innovations have led to
+specialized models for various CXR interpretation tasks, these solutions often
+operate in isolation, limiting their practical utility in clinical practice. We
+present MedRAX, the first versatile AI agent that seamlessly integrates
+state-of-the-art CXR analysis tools and multimodal large language models into a
+unified framework. MedRAX dynamically leverages these models to address complex
+medical queries without requiring additional training. To rigorously evaluate
+its capabilities, we introduce ChestAgentBench, a comprehensive benchmark
+containing 2,500 complex medical queries across 7 diverse categories. Our
+experiments demonstrate that MedRAX achieves state-of-the-art performance
+compared to both open-source and proprietary models, representing a significant
+step toward the practical deployment of automated CXR interpretation systems.
+Data and code have been publicly available at
+https://github.com/bowang-lab/MedRAX
 
-One-shot Federated Learning (OFL) is a distributed machine learning paradigm
-that constrains client-server communication to a single round, addressing
-privacy and communication overhead issues associated with multiple rounds of
-data exchange in traditional Federated Learning (FL). OFL demonstrates the
-practical potential for integration with future approaches that require
-collaborative training models, such as large language models (LLMs). However,
-current OFL methods face two major challenges: data heterogeneity and model
-heterogeneity, which result in subpar performance compared to conventional FL
-methods. Worse still, despite numerous studies addressing these limitations, a
-comprehensive summary is still lacking. To address these gaps, this paper
-presents a systematic analysis of the challenges faced by OFL and thoroughly
-reviews the current methods. We also offer an innovative categorization method
-and analyze the trade-offs of various techniques. Additionally, we discuss the
-most promising future directions and the technologies that should be integrated
-into the OFL field. This work aims to provide guidance and insights for future
-research.
+摘要：胸部 X 光片 (CXR) 在疾病管理和患者照護中扮演著不可或缺的角色，推動著關鍵決策的制定。儘管近期的創新已針對各種 CXR 解讀任務開發出專門的模型，但這些解決方案通常獨立運作，限制了它們在臨床實務中的實際效用。我們提出 MedRAX，這是一款首創的多功能 AI 代理，它將最先進的 CXR 分析工具和多模態大型語言模型無縫整合到一個統一的架構中。MedRAX 動態運用這些模型來解決複雜的醫療查詢，而無需額外的訓練。為了嚴格評估其功能，我們引入了 ChestAgentBench，這是一個全面的基準，包含 7 個不同類別的 2,500 個複雜醫療查詢。我們的實驗證明，與開源和專有模型相比，MedRAX 達到了最先進的效能，這代表了自動化 CXR 解讀系統實際部署的重要一步。資料和程式碼已公開於 https://github.com/bowang-lab/MedRAX
 
-摘要：單次聯邦學習 (OFL) 是一種分散式機器學習範例，將客戶端與伺服器通訊限制在單一輪次中，解決傳統聯邦學習 (FL) 中多輪次資料交換相關的隱私和通訊負擔問題。OFL 展示了與需要協作訓練模型的未來方法整合的實際潛力，例如大型語言模型 (LLM)。然而，目前的 OFL 方法面臨兩大挑戰：資料異質性和模型異質性，這導致與傳統 FL 方法相比，效能較差。更糟的是，儘管有許多研究探討這些限制，但仍缺乏全面的摘要。為了解決這些差距，本文對 OFL 面臨的挑戰進行系統分析，並徹底檢視目前的方法。我們還提供創新的分類方法，並分析各種技術的權衡取捨。此外，我們討論最有希望的未來方向，以及應整合到 OFL 領域的技術。這項工作旨在為未來的研究提供指導和見解。
+##### **Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**
+2502.04356v1 by Mahdi Alkaeed, Sofiat Abioye, Adnan Qayyum, Yosra Magdi Mekki, Ilhem Berrou, Mohamad Abdallah, Ala Al-Fuqaha, Muhammad Bilal, Junaid Qadir
 
-##### **Logical Reasoning in Large Language Models: A Survey**
-2502.09100v1 by Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang
+In response to the success of proprietary Large Language Models (LLMs) such
+as OpenAI's GPT-4, there is a growing interest in developing open,
+non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in
+academic, scientific, and non-commercial applications. Despite their inability
+to match the refined functionalities of their proprietary counterparts, open
+models hold immense potential to revolutionize healthcare applications. In this
+paper, we examine the prospects of open-source LLMs and AIFMs for developing
+healthcare applications and make two key contributions. Firstly, we present a
+comprehensive survey of the current state-of-the-art open-source healthcare
+LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their
+utility across various healthcare tasks. Secondly, to evaluate the
+general-purpose applications of open LLMs in healthcare, we present a case
+study on personalized prescriptions. This task is particularly significant due
+to its critical role in delivering tailored, patient-specific medications that
+can greatly improve treatment outcomes. In addition, we compare the performance
+of open-source models with proprietary models in settings with and without
+Retrieval-Augmented Generation (RAG). Our findings suggest that, although less
+refined, open LLMs can achieve performance comparable to proprietary models
+when paired with grounding techniques such as RAG. Furthermore, to highlight
+the clinical significance of LLMs-empowered personalized prescriptions, we
+perform subjective assessment through an expert clinician. We also elaborate on
+ethical considerations and potential risks associated with the misuse of
+powerful LLMs and AIFMs, highlighting the need for a cautious and responsible
+implementation in healthcare.
 
-With the emergence of advanced reasoning models like OpenAI o3 and
-DeepSeek-R1, large language models (LLMs) have demonstrated remarkable
-reasoning capabilities. However, their ability to perform rigorous logical
-reasoning remains an open question. This survey synthesizes recent advancements
-in logical reasoning within LLMs, a critical area of AI research. It outlines
-the scope of logical reasoning in LLMs, its theoretical foundations, and the
-benchmarks used to evaluate reasoning proficiency. We analyze existing
-capabilities across different reasoning paradigms - deductive, inductive,
-abductive, and analogical - and assess strategies to enhance reasoning
-performance, including data-centric tuning, reinforcement learning, decoding
-strategies, and neuro-symbolic approaches. The review concludes with future
-directions, emphasizing the need for further exploration to strengthen logical
-reasoning in AI systems.
+摘要：<paragraph>為了回應 OpenAI 的 GPT-4 等專有大型語言模型 (LLM) 的成功，開發開放、非專有的 LLM 和人工智慧基礎模型 (AIFM) 以透明地用於學術、科學和非商業應用中，引起了越來越大的興趣。儘管無法與其專有對應產品的精緻功能相匹配，但開放模型在革新醫療保健應用方面具有巨大的潛力。在本文中，我們探討了開放原始碼 LLM 和 AIFM 在開發醫療保健應用方面的前景，並提出了兩項關鍵貢獻。首先，我們對當前最先進的開放原始碼醫療保健 LLM 和 AIFM 進行了全面的調查，並介紹了這些開放 AIFM 的分類法，對它們在各種醫療保健任務中的效用進行了分類。其次，為了評估開放 LLM 在醫療保健中的通用應用，我們對個人化處方進行了案例研究。這項任務特別重要，因為它在提供量身定制的患者特定藥物方面發揮著關鍵作用，可以大大改善治療效果。此外，我們比較了開放原始碼模型與專有模型在有和沒有檢索增強生成 (RAG) 的設置中的性能。我們的研究結果表明，儘管不太精緻，但開放 LLM 在與 RAG 等基礎技術配對時，可以實現與專有模型相當的性能。此外，為了強調 LLM 賦能的個性化處方的臨床意義，我們通過專家臨床醫生進行了主觀評估。我們還詳細說明了與濫用強大的 LLM 和 AIFM 相關的倫理考量和潛在風險，強調了在醫療保健中謹慎和負責任地實施的必要性。</paragraph>
 
-摘要：隨著 OpenAI o3 和 DeepSeek-R1 等先進推理模型的出現，大型語言模型 (LLM) 已展現出非凡的推理能力。然而，它們執行嚴謹邏輯推理的能力仍是一個開放性的問題。此調查綜合了 LLM 中邏輯推理的最新進展，這是 AI 研究的一個關鍵領域。它概述了 LLM 中邏輯推理的範圍、其理論基礎，以及用於評估推理能力的基準。我們分析了不同推理範例（演繹、歸納、外推和類比）中的現有能力，並評估增強推理效能的策略，包括以數據為中心的調整、強化學習、解碼策略和神經符號方法。此評論以未來的方向作為結論，強調需要進一步探索以強化 AI 系統中的邏輯推理。
+##### **Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**
+2502.02561v1 by Shayan Kiyani, George Pappas, Aaron Roth, Hamed Hassani
 
-##### **A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**
-2502.09097v1 by Tianyi Huang, Zeqiu Xu, Peiyang Yu, Jingyuan Yi, Xiaochuan Xu
+A fundamental question in data-driven decision making is how to quantify the
+uncertainty of predictions in ways that can usefully inform downstream action.
+This interface between prediction uncertainty and decision-making is especially
+important in risk-sensitive domains, such as medicine. In this paper, we
+develop decision-theoretic foundations that connect uncertainty quantification
+using prediction sets with risk-averse decision-making. Specifically, we answer
+three fundamental questions: (1) What is the correct notion of uncertainty
+quantification for risk-averse decision makers? We prove that prediction sets
+are optimal for decision makers who wish to optimize their value at risk. (2)
+What is the optimal policy that a risk averse decision maker should use to map
+prediction sets to actions? We show that a simple max-min decision policy is
+optimal for risk-averse decision makers. Finally, (3) How can we derive
+prediction sets that are optimal for such decision makers? We provide an exact
+characterization in the population regime and a distribution free finite-sample
+construction. Answering these questions naturally leads to an algorithm,
+Risk-Averse Calibration (RAC), which follows a provably optimal design for
+deriving action policies from predictions. RAC is designed to be both
+practical-capable of leveraging the quality of predictions in a black-box
+manner to enhance downstream utility-and safe-adhering to a user-defined risk
+threshold and optimizing the corresponding risk quantile of the user's
+downstream utility. Finally, we experimentally demonstrate the significant
+advantages of RAC in applications such as medical diagnosis and recommendation
+systems. Specifically, we show that RAC achieves a substantially improved
+trade-off between safety and utility, offering higher utility compared to
+existing methods while maintaining the safety guarantee.
 
-In this paper, we propose an optimized Transformer model that integrates
-Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and
-apply it to fake news classification for the first time. First, we employ the
-TF-IDF method to extract features from news texts and transform them into
-numeric representations to facilitate subsequent machine learning tasks. Two
-sets of experiments are then conducted for fake news detection and
-classification: one using a Transformer model optimized only with BiGRU, and
-the other incorporating Bayesian algorithms into the BiGRU-based Transformer.
-Experimental results show that the BiGRU-optimized Transformer achieves 100%
-accuracy on the training set and 99.67% on the test set, while the addition of
-the Bayesian algorithm maintains 100% accuracy on the training set and slightly
-improves test-set accuracy to 99.73%. This indicates that the Bayesian
-algorithm boosts model accuracy by 0.06%, further enhancing the detection
-capability for fake news. Moreover, the proposed algorithm converges rapidly at
-around the 10th training epoch with accuracy nearing 100%, demonstrating both
-its effectiveness and its fast classification ability. Overall, the optimized
-Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits
-excellent continuous learning and detection performance, offering a robust
-technical means to combat the spread of fake news in the current era of
-information overload.
+摘要：<paragraph>在資料驅動決策中，一個基本問題是，如何量化預測的不確定性，以能有用地告知下游行動。
+預測不確定性和決策制定之間的這種介面，在風險敏感領域中特別重要，例如醫學。在本文中，我們
+發展了決策理論基礎，它利用預測集合將不確定性量化與風險規避決策制定聯繫起來。具體來說，我們回答
+了三個基本問題：(1) 對於風險規避決策者來說，不確定性量化的正確概念是什麼？我們證明，對於希望最佳化其風險價值的決策者來說，預測集合是最佳的。(2)
+風險規避決策者應使用什麼最佳政策，將預測集合映射到行動？我們表明，對於風險規避決策者來說，一個簡單的最大最小決策政策是最佳的。最後，(3) 我們如何推導出對此類決策者來說最佳的預測集合？我們在總體範圍內提供了一個確切的表徵，並提供了一個不依賴分佈的有限樣本建構。回答這些問題自然會導致一個演算法，風險規避校準 (RAC)，它遵循一個可證明最佳的設計，從預測中推導出行動政策。RAC 被設計為既實用——能夠以黑盒方式利用預測的品質來增強下游效用——又安全——遵守使用者定義的風險閾值，並最佳化使用者的下游效用的對應風險分位數。最後，我們在醫學診斷和推薦系統等應用中，以實驗方式證明了 RAC 的顯著優點。具體來說，我們表明，與現有方法相比，RAC 在安全性和效用之間實現了顯著改善的折衷，在維持安全保證的同時，提供了更高的效用。</paragraph>
 
-摘要：<paragraph>在本文中，我們提出了一個最佳化的 Transformer 模型，它將貝氏演算法與雙向門控遞迴單元 (BiGRU) 整合在一起，並首次將其應用於假新聞分類。首先，我們採用 TF-IDF 方法從新聞文本中提取特徵，並將它們轉換為數值表示，以利於後續的機器學習任務。接著進行兩組實驗，分別針對假新聞偵測和分類：一組使用僅使用 BiGRU 最佳化的 Transformer 模型，另一組將貝氏演算法納入基於 BiGRU 的 Transformer 中。實驗結果顯示，BiGRU 最佳化的 Transformer 在訓練組上達到 100% 的準確度，在測試組上達到 99.67%，而加入貝氏演算法後，在訓練組上維持 100% 的準確度，並將測試組的準確度略微提升至 99.73%。這表示貝氏演算法將模型準確度提升了 0.06%，進一步增強了對假新聞的偵測能力。此外，所提出的演算法在約第 10 個訓練週期時快速收斂，準確度接近 100%，證明了它的有效性和快速的分類能力。總的來說，由貝氏演算法和 BiGRU 增強的最佳化 Transformer 模型展現出絕佳的持續學習和偵測效能，提供了一個強健的技術手段來對抗在當前資訊過載時代中假新聞的散布。</paragraph>
+##### **CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**
+2502.05214v1 by Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
 
-##### **A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**
-2502.09086v1 by Jia Gao, Shuangquan Lyu, Guiran Liu, Binrong Zhu, Hongye Zheng, Xiaoxuan Liao
+Deep learning models for medical image classification tasks are becoming
+widely implemented in AI-assisted diagnostic tools, aiming to enhance
+diagnostic accuracy, reduce clinician workloads, and improve patient outcomes.
+However, their vulnerability to adversarial attacks poses significant risks to
+patient safety. Current attack methodologies use general techniques such as
+model querying or pixel value perturbations to generate adversarial examples
+designed to fool a model. These approaches may not adequately address the
+unique characteristics of clinical errors stemming from missed or incorrectly
+identified clinical features. We propose the Concept-based Report Perturbation
+Attack (CoRPA), a clinically-focused black-box adversarial attack framework
+tailored to the medical imaging domain. CoRPA leverages clinical concepts to
+generate adversarial radiological reports and images that closely mirror
+realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA
+using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our
+evaluation reveals that deep learning models exhibiting strong resilience to
+conventional adversarial attacks are significantly less robust when subjected
+to CoRPA's clinically-focused perturbations. This underscores the importance of
+addressing domain-specific vulnerabilities in medical AI systems. By
+introducing a specialized adversarial attack framework, this study provides a
+foundation for developing robust, real-world-ready AI models in healthcare,
+ensuring their safe and reliable deployment in high-stakes clinical
+environments.
 
-With the continuous development of natural language processing (NLP)
-technology, text classification tasks have been widely used in multiple
-application fields. However, obtaining labeled data is often expensive and
-difficult, especially in few-shot learning scenarios. To solve this problem,
-this paper proposes a few-shot text classification model based on transfer
-learning and meta-learning. The model uses the knowledge of the pre-trained
-model for transfer and optimizes the model's rapid adaptability in few-sample
-tasks through a meta-learning mechanism. Through a series of comparative
-experiments and ablation experiments, we verified the effectiveness of the
-proposed method. The experimental results show that under the conditions of few
-samples and medium samples, the model based on transfer learning and
-meta-learning significantly outperforms traditional machine learning and deep
-learning methods. In addition, ablation experiments further analyzed the
-contribution of each component to the model performance and confirmed the key
-role of transfer learning and meta-learning in improving model accuracy.
-Finally, this paper discusses future research directions and looks forward to
-the potential of this method in practical applications.
+摘要：深度学习模型用于医学影像分类任务，在人工智能辅助诊断工具中得到广泛应用，旨在提高诊断准确性、减少临床医生的工作量并改善患者的治疗效果。然而，它们对对抗性攻击的脆弱性给患者安全带来了重大风险。目前的攻击方法使用通用技术，例如模型查询或像素值扰动来生成对抗性示例，旨在欺骗模型。这些方法可能无法充分解决源自遗漏或错误识别的临床特征的临床错误的独特特征。我们提出了基于概念的报告扰动攻击 (CoRPA)，这是一种以临床为中心的、针对医学成像领域的、黑盒对抗性攻击框架。CoRPA 利用临床概念来生成对抗性放射学报告和图像，这些报告和图像与现实的临床误诊场景非常相似。我们使用胸部 X 射线和放射学报告的 MIMIC-CXR-JPG 数据集演示了 CoRPA 的效用。我们的评估表明，对传统对抗性攻击表现出强大弹性的深度学习模型在受到 CoRPA 以临床为中心的扰动时，其鲁棒性明显降低。这强调了在医疗人工智能系统中解决特定领域漏洞的重要性。通过引入专门的对抗性攻击框架，本研究为在医疗保健领域开发健壮、面向现实世界的 AI 模型奠定了基础，确保它们在高风险临床环境中安全可靠地部署。
 
-摘要：隨著自然語言處理 (NLP) 技術的持續發展，文本分類任務已廣泛應用於多個應用領域。然而，獲取標記資料通常既昂貴又困難，特別是在小樣本學習場景中。為了解決這個問題，本文提出了一個基於遷移學習和元學習的少樣本文本分類模型。該模型利用預訓練模型的知識進行遷移，並透過元學習機制最佳化模型在少樣本任務中的快速適應性。透過一系列的比較實驗和消融實驗，我們驗證了所提出方法的有效性。實驗結果表明，在少樣本和中等樣本的條件下，基於遷移學習和元學習的模型明顯優於傳統機器學習和深度學習方法。此外，消融實驗進一步分析了各個組成部分對模型效能的貢獻，並確認了遷移學習和元學習在提升模型準確度中的關鍵作用。最後，本文探討了未來的研究方向，並期待此方法在實際應用中的潛力。
+##### **A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**
+2502.02489v1 by Edward Ellis, Andrew Bulpitt, Nasim Parsa, Michael F Byrne, Sharib Ali
 
-##### **Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**
-2502.09083v1 by Greta Warren, Irina Shklovski, Isabelle Augenstein
+Ultrasound (US) imaging is clinically invaluable due to its noninvasive and
+safe nature. However, interpreting US images is challenging, requires
+significant expertise, and time, and is often prone to errors. Deep learning
+offers assistive solutions such as segmentation. Supervised methods rely on
+large, high-quality, and consistently labeled datasets, which are challenging
+to curate. Moreover, these methods tend to underperform on out-of-distribution
+data, limiting their clinical utility. Self-supervised learning (SSL) has
+emerged as a promising alternative, leveraging unlabeled data to enhance model
+performance and generalisability. We introduce a contrastive SSL approach
+tailored for B-mode US images, incorporating a novel Relation Contrastive Loss
+(RCL). RCL encourages learning of distinct features by differentiating positive
+and negative sample pairs through a learnable metric. Additionally, we propose
+spatial and frequency-based augmentation strategies for the representation
+learning on US images. Our approach significantly outperforms traditional
+supervised segmentation methods across three public breast US datasets,
+particularly in data-limited scenarios. Notable improvements on the Dice
+similarity metric include a 4% increase on 20% and 50% of the BUSI dataset,
+nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4%
+and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively.
+Furthermore, we demonstrate superior generalisability on the
+out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6%
+compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST
+training data, respectively. Our research highlights that domain-inspired SSL
+can improve US segmentation, especially under data-limited conditions.
 
-The pervasiveness of large language models and generative AI in online media
-has amplified the need for effective automated fact-checking to assist
-fact-checkers in tackling the increasing volume and sophistication of
-misinformation. The complex nature of fact-checking demands that automated
-fact-checking systems provide explanations that enable fact-checkers to
-scrutinise their outputs. However, it is unclear how these explanations should
-align with the decision-making and reasoning processes of fact-checkers to be
-effectively integrated into their workflows. Through semi-structured interviews
-with fact-checking professionals, we bridge this gap by: (i) providing an
-account of how fact-checkers assess evidence, make decisions, and explain their
-processes; (ii) examining how fact-checkers use automated tools in practice;
-and (iii) identifying fact-checker explanation requirements for automated
-fact-checking tools. The findings show unmet explanation needs and identify
-important criteria for replicable fact-checking explanations that trace the
-model's reasoning path, reference specific evidence, and highlight uncertainty
-and information gaps.
+摘要：超音波 (US) 影像由於其非侵入性且安全的特性，在臨床上極具價值。然而，解讀超音波影像具有挑戰性，需要大量的專業知識和時間，而且經常容易出錯。深度學習提供了輔助解決方案，例如分割。監督式方法依賴於大量、高品質且標籤一致的資料集，而這在策劃上具有挑戰性。此外，這些方法在分佈外資料上的表現往往不佳，這限制了它們的臨床效用。自監督學習 (SSL) 已成為一種有前途的替代方案，它利用未標籤資料來增強模型效能和泛化能力。我們提出了一種對比式 SSL 方法，專門針對 B 模式超音波影像，並納入了新穎的關係對比損失 (RCL)。RCL 透過一個可學習的指標區分正負樣本對，來鼓勵學習不同的特徵。此外，我們提出了用於超音波影像上表徵學習的空間和頻率增強策略。我們的做法在三個公開的乳房超音波資料集上顯著優於傳統的監督式分割方法，特別是在資料有限的情況下。在 Dice 相似性指標上的顯著改進包括在 BUSI 資料集的 20% 和 50% 上增加了 4%，在 BrEaST 資料集的 20% 和 50% 上增加了近 6% 和 9%，以及在 UDIAT 資料集的 20% 和 50% 上分別增加了 6.4% 和 3.7%。此外，我們在分佈外的 UDIAT 資料集上展示了卓越的泛化能力，與使用 BUSI 和 BrEaST 訓練資料的 20% 和 50% 的監督式基準相比，效能分別提升了 20.6% 和 13.6%。我們的研究強調，領域啟發的 SSL 可以改善超音波分割，特別是在資料有限的條件下。
 
-摘要：大型語言模型和生成式 AI 在線上媒體的普及
-放大了對有效自動查核事實的需求，以協助查核員應對日益增加的錯誤資訊量和複雜性。查核事實的複雜性質要求自動查核事實系統提供說明，讓查核員能夠仔細審查他們的輸出。然而，目前尚不清楚這些說明應如何與查核員的決策制定和推理過程保持一致，才能有效整合到他們的流程中。透過與查核事實專業人士進行半結構式訪談，我們透過以下方式彌補這個差距：(i) 提供查核員如何評估證據、做出決策和解釋其流程的說明；(ii) 檢視查核員如何實際使用自動化工具；以及 (iii) 找出查核員對自動查核事實工具的說明需求。研究結果顯示未滿足的說明需求，並找出可複製查核事實說明的重要準則，這些準則追蹤模型的推理路徑、參考具體證據，並強調不確定性和資訊差距。
+##### **Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**
+2502.02438v1 by Yaling Shen, Zhixiong Zhuang, Kun Yuan, Maria-Irina Nicolae, Nassir Navab, Nicolas Padoy, Mario Fritz
 
-##### **CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**
-2502.09082v1 by Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
+Medical multimodal large language models (MLLMs) are becoming an instrumental
+part of healthcare systems, assisting medical personnel with decision making
+and results analysis. Models for radiology report generation are able to
+interpret medical imagery, thus reducing the workload of radiologists. As
+medical data is scarce and protected by privacy regulations, medical MLLMs
+represent valuable intellectual property. However, these assets are potentially
+vulnerable to model stealing, where attackers aim to replicate their
+functionality via black-box access. So far, model stealing for the medical
+domain has focused on classification; however, existing attacks are not
+effective against MLLMs. In this paper, we introduce Adversarial Domain
+Alignment (ADA-STEAL), the first stealing attack against medical MLLMs.
+ADA-STEAL relies on natural images, which are public and widely available, as
+opposed to their medical counterparts. We show that data augmentation with
+adversarial noise is sufficient to overcome the data distribution gap between
+natural images and the domain-specific distribution of the victim MLLM.
+Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that
+Adversarial Domain Alignment enables attackers to steal the medical MLLM
+without any access to medical data.
 
-Role-playing language agents (RPLAs) have emerged as promising applications
-of large language models (LLMs). However, simulating established characters
-presents a challenging task for RPLAs, due to the lack of authentic character
-datasets and nuanced evaluation methods using such data. In this paper, we
-present CoSER, a collection of a high-quality dataset, open models, and an
-evaluation protocol towards effective RPLAs of established characters. The
-CoSER dataset covers 17,966 characters from 771 renowned books. It provides
-authentic dialogues with real-world intricacies, as well as diverse data types
-such as conversation setups, character experiences and internal thoughts.
-Drawing from acting methodology, we introduce given-circumstance acting for
-training and evaluating role-playing LLMs, where LLMs sequentially portray
-multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
-CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
-Extensive experiments demonstrate the value of the CoSER dataset for RPLA
-training, evaluation and retrieval. Moreover, CoSER 70B exhibits
-state-of-the-art performance surpassing or matching GPT-4o on our evaluation
-and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
-the InCharacter and LifeChoice benchmarks respectively.
+摘要：醫療多模態大型語言模型 (MLLM) 正在成為醫療保健系統中不可或缺的一部分，協助醫療人員進行決策和結果分析。放射報告生成的模型能夠解釋醫學影像，從而減輕放射科醫師的工作負擔。由於醫療資料稀少且受隱私法規保護，醫療 MLLM 代表了有價值的智慧財產。然而，這些資產潛在地容易受到模型竊取的攻擊，攻擊者旨在透過黑盒存取來複製其功能。到目前為止，針對醫療領域的模型竊取一直專注於分類；然而，現有的攻擊對 MLLM 沒有效。在本文中，我們介紹了對抗域對齊 (ADA-STEAL)，這是針對醫療 MLLM 的第一個竊取攻擊。與醫療對應物相反，ADA-STEAL 依賴於公開且廣泛可用的自然影像。我們表明，對抗雜訊的資料擴充足以克服自然影像與受害者 MLLM 的特定領域分佈之間的資料分佈差距。在 IU X-RAY 和 MIMIC-CXR 放射學資料集上進行的實驗表明，對抗域對齊使攻擊者能夠在不存取任何醫療資料的情況下竊取醫療 MLLM。
 
-摘要：角色扮演語言代理（RPLA）已成為大型語言模型（LLM）的有前途的應用。然而，由於缺乏真實角色資料集和使用此類資料的細緻評估方法，模擬既有角色對 RPLA 來說是一項具有挑戰性的任務。在本文中，我們提出了 CoSER，這是一個高品質資料集、開放模型和評估協議的集合，用於有效地扮演既有角色的 RPLA。CoSER 資料集涵蓋了來自 771 本著名書籍的 17,966 個角色。它提供了具有真實世界複雜性的真實對話，以及對話設定、角色體驗和內心想法等多種資料類型。借鑑表演方法，我們引入了既定情境表演，用於訓練和評估角色扮演 LLM，其中 LLM 在書籍場景中依次扮演多個角色。使用我們的資料集，我們開發了 CoSER 8B 和 CoSER 70B，即建立在 LLaMA-3.1 模型上的先進開放角色扮演 LLM。大量的實驗證明了 CoSER 資料集對於 RPLA 訓練、評估和檢索的價值。此外，CoSER 70B 在我們的評估和三個現有基準上展現了超越或匹配 GPT-4o 的最先進效能，即分別在 InCharacter 和 LifeChoice 基準上達到了 75.80% 和 93.47% 的準確率。
+##### **Test Time Training for 4D Medical Image Interpolation**
+2502.02341v1 by Qikang Zhang, Yingjie Lei, Zihao Zheng, Ziyang Chen, Zhonghao Xie
 
-##### **Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**
-2502.09073v1 by Xuzhao Geng, Haozhao Wang, Jun Wang, Wei Liu, Ruixuan Li
+4D medical image interpolation is essential for improving temporal resolution
+and diagnostic precision in clinical applications. Previous works ignore the
+problem of distribution shifts, resulting in poor generalization under
+different distribution. A natural solution would be to adapt the model to a new
+test distribution, but this cannot be done if the test input comes without a
+ground truth label. In this paper, we propose a novel test time training
+framework which uses self-supervision to adapt the model to a new distribution
+without requiring any labels. Indeed, before performing frame interpolation on
+each test video, the model is trained on the same instance using a
+self-supervised task, such as rotation prediction or image reconstruction. We
+conduct experiments on two publicly available 4D medical image interpolation
+datasets, Cardiac and 4D-Lung. The experimental results show that the proposed
+method achieves significant performance across various evaluation metrics on
+both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on
+Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image
+interpolation but also provides a template for domain adaptation in other
+fields such as image segmentation and image registration.
 
-Retrieval-augmented generation (RAG) is a key technique for leveraging
-external knowledge and reducing hallucinations in large language models (LLMs).
-However, RAG still struggles to fully prevent hallucinated responses. To
-address this, it is essential to identify samples prone to hallucination or
-guide LLMs toward correct responses, which experts then annotate to develop
-high-quality datasets for refining LLMs. However, the growing scarcity of such
-datasets makes their creation challenging. This paper proposes using the vast
-amount of conversations from widespread LLM usage to build these datasets,
-training LLMs to avoid hallucination-prone questions while accurately
-responding to manageable ones. Given the impracticality of expert-annotating
-all conversation records, the paper introduces AL4RAG, which uses active
-learning to select the most suitable conversation samples for annotation,
-optimizing performance within an annotation budget. Additionally, recognizing
-that traditional active learning methods are not fully compatible with RAG due
-to unsuitable distance metrics, we develop a novel sample distance measurement
-for RAG active learning. Extensive experiments show that our method
-consistently outperforms baselines across multiple metrics.
+摘要：4D 醫學影像插值對於提升時間解析度及臨床應用中的診斷精準度至關重要。過往的研究忽略了分佈轉移問題，導致在不同分佈下泛化能力不佳。一個自然的解決方案是將模型適應到新的測試分佈，但如果測試輸入沒有真實標籤，就無法做到這一點。在本文中，我們提出了一個新的測試時間訓練架構，它使用自我監督來適應模型到一個新的分佈，而不需要任何標籤。事實上，在對每個測試影片執行幀插值之前，使用自我監督任務（例如旋轉預測或影像重建）在同一個實例上訓練模型。我們在兩個公開的 4D 醫學影像插值資料集（Cardiac 和 4D-Lung）上進行實驗。實驗結果表明，所提出的方法在兩個資料集上的各種評估指標中都取得了顯著的效能。它達到了更高的峰值信噪比值，在 Cardiac 上為 33.73dB，在 4D-Lung 上為 34.02dB。我們的技術不僅推動了 4D 醫學影像插值，還為其他領域（例如影像分割和影像配準）中的領域適應提供了一個範本。
 
-摘要：檢索增強生成 (RAG) 是一種關鍵技術，用於利用外部知識並減少大型語言模型 (LLM) 中的幻覺。然而，RAG 仍難以完全防止幻覺反應。為了解決這個問題，必須找出容易產生幻覺的範例，或引導 LLM 朝向正確的反應，然後由專家註解以開發用於精煉 LLM 的高品質資料集。然而，此類資料集日益稀少，使得其建立極具挑戰性。本文提出使用來自廣泛 LLM 使用的大量對話來建立這些資料集，訓練 LLM 以避免容易產生幻覺的問題，同時準確回應可管理的問題。鑑於由專家為所有對話記錄加上註解並不切實際，本文引入了 AL4RAG，它使用主動學習來選擇最適合註解的對話範例，在註解預算內最佳化效能。此外，認識到傳統主動學習方法由於不適當的距離度量而無法與 RAG 完全相容，我們為 RAG 主動學習開發了一種新穎的範例距離度量。廣泛的實驗表明，我們的模型在多種度量標準上始終優於基準。
+##### **Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**
+2502.02249v1 by Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
 
-##### **An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**
-2502.09056v1 by Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai
+Large language models (LLMs) have shown impressive capabilities in natural
+language processing tasks, including dialogue generation. This research aims to
+conduct a novel comparative analysis of two prominent techniques, fine-tuning
+with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG)
+framework, in the context of doctor-patient chat conversations with multiple
+datasets of mixed medical domains. The analysis involves three state-of-the-art
+models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient
+dialogues, we comprehensively evaluate the performance of models, assessing key
+metrics such as language quality (perplexity, BLEU score), factual accuracy
+(fact-checking against medical knowledge bases), adherence to medical
+guidelines, and overall human judgments (coherence, empathy, safety). The
+findings provide insights into the strengths and limitations of each approach,
+shedding light on their suitability for healthcare applications. Furthermore,
+the research investigates the robustness of the models in handling diverse
+patient queries, ranging from general health inquiries to specific medical
+conditions. The impact of domain-specific knowledge integration is also
+explored, highlighting the potential for enhancing LLM performance through
+targeted data augmentation and retrieval strategies.
 
-This paper investigates data selection and model merging methodologies aimed
-at incorporating advanced reasoning capabilities such as those of DeepSeek R1
-into language-specific large language models (LLMs), with a particular focus on
-the Thai LLM. Our goal is to enhance the reasoning capabilities of
-language-specific LLMs while maintaining their target language abilities.
-DeepSeek R1 excels in reasoning but primarily benefits high-resource languages
-such as English and Chinese. However, low-resource languages remain underserved
-due to the dominance of English-centric training data and model optimizations,
-which limit performance in these languages. This limitation results in
-unreliable code-switching and diminished effectiveness on tasks in low-resource
-languages. Meanwhile, local and regional LLM initiatives have attempted to
-bridge this gap by developing language-specific LLMs that focus on improving
-local linguistic fidelity. We demonstrate that, with only publicly available
-datasets and a computational budget of $120, it is possible to enhance the
-reasoning capabilities of language-specific LLMs to match the level of DeepSeek
-R1, without compromising their performance on target language tasks.
+摘要：大型語言模型 (LLM) 在自然語言處理任務中展現了令人印象深刻的能力，包括對話生成。本研究旨在對兩種著名的技術進行新穎的比較分析，即微調 LoRA (低秩適應) 和檢索增強生成 (RAG) 框架，在具有混合醫療領域的多個資料集的醫患聊天對話中。分析涉及三個最先進的模型：Llama-2、GPT 和 LSTM 模型。採用真實世界的醫患對話，我們全面評估模型的性能，評估語言品質（困惑度、BLEU 分數）、事實準確性（對照醫學知識庫進行事實查核）、遵守醫療指南以及整體人類判斷（連貫性、同理心、安全性）等關鍵指標。研究結果深入了解了每種方法的優點和限制，闡明了它們適用於醫療保健應用的適當性。此外，該研究調查了模型在處理多樣化患者查詢時的穩健性，範圍從一般健康詢問到特定醫療狀況。還探討了特定領域知識整合的影響，強調了通過有針對性的資料擴充和檢索策略來增強 LLM 性能的潛力。
 
-摘要：本文探討資料選取與模型合併方法，旨在將深度搜尋 R1 等先進推理能力整合至特定語言的大型語言模型 (LLM)，特別著重於泰語 LLM。我們的目標是提升特定語言 LLM 的推理能力，同時維持其目標語言能力。深度搜尋 R1 在推理方面表現出色，但主要受益於英語和中文等資源豐富的語言。然而，由於以英語為中心的訓練資料和模型最佳化佔據主導地位，資源貧乏的語言仍未獲得充分服務，這限制了這些語言的效能。此限制導致不可靠的代碼切換，並降低了資源貧乏語言任務的效能。與此同時，在地區 LLM 計畫已嘗試透過開發專注於改善在地語言忠實度的特定語言 LLM 來彌合此差距。我們證明，僅使用公開可用的資料集和 120 美元的運算預算，即可提升特定語言 LLM 的推理能力，使其達到深度搜尋 R1 的水準，同時不損及它們在目標語言任務上的效能。
+##### **Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**
+2502.02618v1 by F. Xavier Gaya-Morey, Jose M. Buades-Rubio, Philippe Palanque, Raquel Lacuesta, Cristina Manresa-Yee
 
-##### **Cost-Saving LLM Cascades with Early Abstention**
-2502.09054v1 by Michael J. Zellinger, Rex Liu, Matt Thomson
+The rapid aging of the global population has highlighted the need for
+technologies to support elderly, particularly in healthcare and emotional
+well-being. Facial expression recognition (FER) systems offer a non-invasive
+means of monitoring emotional states, with applications in assisted living,
+mental health support, and personalized care. This study presents a systematic
+review of deep learning-based FER systems, focusing on their applications for
+the elderly population. Following a rigorous methodology, we analyzed 31
+studies published over the last decade, addressing challenges such as the
+scarcity of elderly-specific datasets, class imbalances, and the impact of
+age-related facial expression differences. Our findings show that convolutional
+neural networks remain dominant in FER, and especially lightweight versions for
+resource-constrained environments. However, existing datasets often lack
+diversity in age representation, and real-world deployment remains limited.
+Additionally, privacy concerns and the need for explainable artificial
+intelligence emerged as key barriers to adoption. This review underscores the
+importance of developing age-inclusive datasets, integrating multimodal
+solutions, and adopting XAI techniques to enhance system usability,
+reliability, and trustworthiness. We conclude by offering recommendations for
+future research to bridge the gap between academic progress and real-world
+implementation in elderly care.
 
-LLM cascades are based on the idea that processing all queries with the
-largest and most expensive LLMs is inefficient. Instead, cascades deploy small
-LLMs to answer the majority of queries, limiting the use of large and expensive
-LLMs to only the most difficult queries. This approach can significantly reduce
-costs without impacting performance. However, risk-sensitive domains such as
-finance or medicine place an additional premium on avoiding model errors.
-Recognizing that even the most expensive models may make mistakes, applications
-in these domains benefit from allowing LLM systems to completely abstain from
-answering a query when the chance of making a mistake is significant. However,
-giving a cascade the ability to abstain poses an immediate design question for
-LLM cascades: should abstention only be allowed at the final model or also at
-earlier models? Since the error patterns of small and large models are
-correlated, the latter strategy may further reduce inference costs by letting
-inexpensive models anticipate abstention decisions by expensive models, thereby
-obviating the need to run the expensive models. We investigate the benefits of
-"early abstention" in LLM cascades and find that it reduces the overall test
-loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA,
-TruthfulQA, and XSum). These gains result from a more effective use of
-abstention, which trades a 4.1% average increase in the overall abstention rate
-for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings
-demonstrate that it is possible to leverage correlations between the error
-patterns of different language models to drive performance improvements for LLM
-systems with abstention.
+摘要：全球人口快速老龄化突显了对技术的需求，以支持老年人，尤其是在医疗保健和情绪健康方面。面部表情识别 (FER) 系统提供了一种非侵入性的情绪状态监测手段，在辅助生活、心理健康支持和个性化护理中得到应用。本研究对基于深度学习的 FER 系统进行了系统的回顾，重点关注它们在老年人群中的应用。遵循严格的方法，我们分析了在过去十年中发表的 31 项研究，解决了诸如老年人特定数据集的稀缺性、类别不平衡以及与年龄相关的面部表情差异的影响等挑战。我们的研究结果表明，卷积神经网络在 FER 中仍然占主导地位，特别是针对资源受限环境的轻量级版本。然而，现有数据集往往缺乏年龄代表性的多样性，并且现实世界的部署仍然有限。此外，隐私问题和对可解释人工智能的需求已成为采用过程中的主要障碍。本次审查强调了开发包容年龄的数据集、整合多模式解决方案以及采用 XAI 技术以增强系统可用性、可靠性和可信度的重要性。最后，我们提出了未来研究的建议，以弥合学术进展与老年护理中的现实世界实施之间的差距。
 
-摘要：<paragraph>LLM 級聯基於以下概念：使用最大且最昂貴的 LLM 處理所有查詢效率低下。相反，級聯會部署小型 LLM 來回答大部分查詢，將大型且昂貴的 LLM 的使用限制在最困難的查詢上。這種方法可以大幅降低成本，而不會影響效能。然而，像金融或醫學等對風險敏感的領域會額外重視避免模型錯誤。認識到即使是最昂貴的模型也可能會出錯，在這些領域中的應用程式可受益於允許 LLM 系統在出錯機率很大的情況下完全不回答查詢。然而，賦予級聯不回答的能力會對 LLM 級聯提出立即的設計問題：是否只允許在最終模型中不回答，還是也在較早的模型中不回答？由於小型和大型模型的錯誤模式相關，後一種策略可以讓便宜的模型預測昂貴模型的不回答決策，進而降低推論成本，從而避免執行昂貴的模型。我們調查了 LLM 級聯中「早期不回答」的好處，並發現它平均降低了六個基準測試（GSM8K、MedMCQA、MMLU、TriviaQA、TruthfulQA 和 XSum）的整體測試損失 2.2%。這些收益來自於更有效地使用不回答，以整體不回答率平均增加 4.1% 的代價換取成本降低 13.0% 和錯誤率降低 5.0%。我們的研究結果證明，可以利用不同語言模型的錯誤模式之間的關聯性，來推動具有不回答功能的 LLM 系統的效能改進。</paragraph>
+##### **Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**
+2502.02109v1 by Yuxiao Cheng, Xinxin Song, Ziqian Wang, Qin Zhong, Kunlun He, Jinli Suo
 
-##### **Game Theory Meets Large Language Models: A Systematic Survey**
-2502.09053v1 by Haoran Sun, Yusen Wu, Yukun Cheng, Xu Chu
+Recent advances in deep learning (DL) have prompted the development of
+high-performing early warning score (EWS) systems, predicting clinical
+deteriorations such as acute kidney injury, acute myocardial infarction, or
+circulatory failure. DL models have proven to be powerful tools for various
+tasks but come with the cost of lacking interpretability and limited
+generalizability, hindering their clinical applications. To develop a practical
+EWS system applicable to various outcomes, we propose causally-informed
+explainable early prediction model, which leverages causal discovery to
+identify the underlying causal relationships of prediction and thus owns two
+unique advantages: demonstrating the explicit interpretation of the prediction
+while exhibiting decent performance when applied to unfamiliar environments.
+Benefiting from these features, our approach achieves superior accuracy for 6
+different critical deteriorations and achieves better generalizability across
+different patient groups, compared to various baseline algorithms. Besides, we
+provide explicit causal pathways to serve as references for assistant clinical
+diagnosis and potential interventions. The proposed approach enhances the
+practical application of deep learning in various medical scenarios.
 
-Game theory establishes a fundamental framework for analyzing strategic
-interactions among rational decision-makers. The rapid advancement of large
-language models (LLMs) has sparked extensive research exploring the
-intersection of these two fields. Specifically, game-theoretic methods are
-being applied to evaluate and enhance LLM capabilities, while LLMs themselves
-are reshaping classic game models. This paper presents a comprehensive survey
-of the intersection of these fields, exploring a bidirectional relationship
-from three perspectives: (1) Establishing standardized game-based benchmarks
-for evaluating LLM behavior; (2) Leveraging game-theoretic methods to improve
-LLM performance through algorithmic innovations; (3) Characterizing the
-societal impacts of LLMs through game modeling. Among these three aspects, we
-also highlight how the equilibrium analysis for traditional game models is
-impacted by LLMs' advanced language understanding, which in turn extends the
-study of game theory. Finally, we identify key challenges and future research
-directions, assessing their feasibility based on the current state of the
-field. By bridging theoretical rigor with emerging AI capabilities, this survey
-aims to foster interdisciplinary collaboration and drive progress in this
-evolving research area.
+摘要：深度學習 (DL) 的最新進展促使開發出高性能早期預警評分 (EWS) 系統，預測急性腎臟損傷、急性心肌梗塞或循環衰竭等臨床惡化。DL 模型已被證明是各種任務的強大工具，但代價是缺乏可解釋性和有限的概括性，阻礙了其臨床應用。為了開發適用於各種結果的實用 EWS 系統，我們提出了因果關係解釋性早期預測模型，它利用因果發現來識別預測的潛在因果關係，從而擁有兩個獨特的優點：展示預測的明確解釋，同時在應用於不熟悉的環境時表現出良好的性能。得益於這些特性，與各種基線演算法相比，我們的模型在 6 種不同的危重惡化中實現了更高的準確度，並在不同的患者群體中實現了更好的概括性。此外，我們提供了明確的因果途徑，作為輔助臨床診斷和潛在干預措施的參考。所提出的方法增強了深度學習在各種醫療場景中的實際應用。
 
-摘要：博弈論建立一個基本架構，用來分析理性決策者之間的策略互動。大型語言模型 (LLM) 的快速進展，激發了廣泛的研究，探討這兩個領域的交集。具體來說，博弈論方法被應用於評估和增強 LLM 能力，而 LLM 本身正在重塑經典博弈模型。本文對這些領域的交集進行了全面的調查，從三個角度探討了雙向關係：(1) 建立標準化的基於博弈的基準，用於評估 LLM 行為；(2) 利用博弈論方法，通過演算法創新來改善 LLM 效能；(3) 透過博弈模型，描述 LLM 對社會的影響。在這三個方面中，我們還強調了 LLM 的先進語言理解如何影響傳統博弈模型的均衡分析，這反過來又擴展了博弈論的研究。最後，我們找出關鍵挑戰和未來的研究方向，根據該領域的現狀評估其可行性。透過將理論嚴謹性與新興的 AI 能力相結合，這項調查旨在促進跨學科合作，並推動這個不斷演變的研究領域的進展。
+##### **JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**
+2502.04345v1 by Yehan Yan, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
 
-##### **AIDE: Agentically Improve Visual Language Model with Domain Experts**
-2502.09051v1 by Ming-Chang Chiu, Fuxiao Liu, Karan Sapra, Andrew Tao, Yaser Jacoob, Xuezhe Ma, Zhiding Yu, Guilin Liu
+Traditional Chinese medicine (TCM) plays a vital role in health protection
+and disease treatment, but its practical application requires extensive medical
+knowledge and clinical experience. Existing TCM Large Language Models (LLMs)
+exhibit critical limitations of uncomprehensive medical consultation and
+diagnoses, and inaccurate syndrome differentiation-based treatment. To address
+these issues, this study establishes JingFang (JF): a novel TCM Large Language
+Model that demonstrates the expert-level capability of medical diagnosis and
+syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic
+Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation,
+enabling JF with effective and accurate diagnostic ability. In addition, a
+Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to
+significantly enhance the capacity of JF for disease treatment based on
+syndrome differentiation. JingFang not only facilitates the application of LLMs
+but also promotes the effective practice of TCM in human health protection and
+disease treatment.
 
-The enhancement of Visual Language Models (VLMs) has traditionally relied on
-knowledge distillation from larger, more capable models. This dependence
-creates a fundamental bottleneck for improving state-of-the-art systems,
-particularly when no superior models exist. We introduce AIDE (Agentic
-Improvement through Domain Experts), a novel framework that enables VLMs to
-autonomously enhance their capabilities by leveraging specialized domain expert
-models. AIDE operates through a four-stage process: (1) identifying instances
-for refinement, (2) engaging domain experts for targeted analysis, (3)
-synthesizing expert outputs with existing data, and (4) integrating enhanced
-instances into the training pipeline. Experiments on multiple benchmarks,
-including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve
-notable performance gains without relying on larger VLMs nor human supervision.
-Our framework provides a scalable, resource-efficient approach to continuous
-VLM improvement, addressing critical limitations in current methodologies,
-particularly valuable when larger models are unavailable to access.
+摘要：中醫藥在保健與疾病治療中扮演著重要的角色，但其實務應用需要深厚的醫學知識與臨床經驗。現有的中醫大語言模型（LLM）存在著醫療諮詢與診斷不全面、症候分型治療不準確的重大限制。為了解決這些問題，本研究建立了精方（JF）：一個新穎的中醫大語言模型，展示了專家級的醫療診斷與症候分型治療能力。我們創新了一個多智能體動態協作思考鏈機制（MDCCTM）用於醫療諮詢，讓 JF 具備有效且準確的診斷能力。此外，還開發了一個症候智能體和一個雙階段檢索方案（DSRS），以顯著增強 JF 基於症候分型的疾病治療能力。精方不僅促進了 LLM 的應用，也推動了中醫藥在人類保健與疾病治療中的有效實踐。
 
-摘要：視覺語言模型 (VLM) 的增強傳統上依賴於從更大、功能更強大的模型中進行知識萃取。這種依賴性會造成改善最先進系統的基本瓶頸，尤其在沒有更優越的模型時。我們引進 AIDE（透過領域專家進行代理式改善），一個創新的架構，讓 VLM 能夠透過利用專業的領域專家模型，自主增強其功能。AIDE 透過四階段流程運作：(1) 識別需要改善的實例，(2) 聘請領域專家進行有針對性的分析，(3) 將專家輸出與現有資料綜合，以及 (4) 將增強的實例整合到訓練流程中。在多個基準測試上的實驗，包括 MMMU、MME、MMBench 等，證明了 AIDE 能夠在不依賴更大型的 VLM 或人工監督的情況下，實現顯著的效能提升。我們的架構提供了一個可擴充、資源效率高的持續 VLM 改進方法，解決了當前方法中的關鍵限制，特別是在無法取得大型模型時，這一點特別有價值。
+##### **An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**
+2502.01789v1 by Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri
 
-##### **Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**
-2502.09050v1 by Chae-Hyun Kim, Yoon-Ryung Choi, Jin-Duk Park, Won-Yong Shin
+Early identification of cognitive concerns is critical but often hindered by
+subtle symptom presentation. This study developed and validated a fully
+automated, multi-agent AI workflow using LLaMA 3 8B to identify cognitive
+concerns in 3,338 clinical notes from Mass General Brigham. The agentic
+workflow, leveraging task-specific agents that dynamically collaborate to
+extract meaningful insights from clinical notes, was compared to an
+expert-driven benchmark. Both workflows achieved high classification
+performance, with F1-scores of 0.90 and 0.91, respectively. The agentic
+workflow demonstrated improved specificity (1.00) and achieved prompt
+refinement in fewer iterations. Although both workflows showed reduced
+performance on validation data, the agentic workflow maintained perfect
+specificity. These findings highlight the potential of fully automated
+multi-agent AI workflows to achieve expert-level accuracy with greater
+efficiency, offering a scalable and cost-effective solution for detecting
+cognitive concerns in clinical settings.
 
-Group recommendation aims at providing optimized recommendations tailored to
-diverse groups, enabling groups to enjoy appropriate items. On the other hand,
-most existing group recommendation methods are built upon deep neural network
-(DNN) architectures designed to capture the intricate relationships between
-member-level and group-level interactions. While these DNN-based approaches
-have proven their effectiveness, they require complex and expensive training
-procedures to incorporate group-level interactions in addition to member-level
-interactions. To overcome such limitations, we introduce Group-GF, a new
-approach for extremely fast recommendations of items to each group via
-multi-view graph filtering (GF) that offers a holistic view of complex
-member-group dynamics, without the need for costly model training.
-Specifically, in Group-GF, we first construct three item similarity graphs
-manifesting different viewpoints for GF. Then, we discover a distinct
-polynomial graph filter for each similarity graph and judiciously aggregate the
-three graph filters. Extensive experiments demonstrate the effectiveness of
-Group-GF in terms of significantly reducing runtime and achieving
-state-of-the-art recommendation accuracy.
+摘要：及早辨識認知問題至關重要，但常常受到症狀呈現過於細微的阻礙。本研究開發並驗證了一個全自動化、多重代理的 AI 工作流程，使用 LLaMA 3 8B 來辨識來自麻省總醫院布萊根分院的 3,338 則臨床筆記中的認知問題。這個代理工作流程利用了特定任務的代理，這些代理會動態合作從臨床筆記中萃取出有意義的見解，並與專家驅動的基準進行比較。這兩個工作流程都達到了很高的分類效能，F1 分數分別為 0.90 和 0.91。代理工作流程展現出更好的特異性（1.00），並且在更少的反覆運算中達到了提示精煉。儘管這兩個工作流程在驗證資料上的效能都降低了，但代理工作流程維持了完美的特異性。這些發現突顯了全自動化多重代理 AI 工作流程的潛力，它們能以更高的效率達到專家級的準確度，為在臨床環境中偵測認知問題提供了一個可擴充且具成本效益的解決方案。
 
-摘要：群組推薦旨在提供針對不同群組量身打造的最佳推薦，讓群組可以享受適當的項目。另一方面，現有的群組推薦方法大多建立在深度神經網路 (DNN) 架構上，旨在捕捉成員層級和群組層級互動之間的複雜關係。雖然這些基於 DNN 的方法已證明其有效性，但它們需要複雜且昂貴的訓練程序，才能在成員層級互動之外納入群組層級互動。為了克服這些限制，我們引入了 Group-GF，這是一種透過多視圖圖形過濾 (GF) 為每個群組提供極快速項目推薦的新方法，它提供了複雜成員群組動態的整體視圖，而無需進行昂貴的模型訓練。具體來說，在 Group-GF 中，我們首先建構三個項目相似度圖形，展現 GF 的不同觀點。然後，我們為每個相似度圖形發現一個不同的多項式圖形過濾器，並明智地彙總這三個圖形過濾器。廣泛的實驗證明了 Group-GF 在顯著減少執行時間和達成最先進的推薦準確度方面的有效性。
+##### **Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**
+2502.03482v1 by Chacha Chen, Han Liu, Jiamin Yang, Benjamin M. Mervak, Bora Kalaycioglu, Grace Lee, Emre Cakmakli, Matteo Bonatti, Sridhar Pudu, Osman Kahraman, Gul Gizem Pamuk, Aytekin Oto, Aritrick Chatterjee, Chenhao Tan
 
-##### **Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**
-2502.09046v1 by Jin-Duk Park, Jaemin Yoo, Won-Yong Shin
+Despite the growing interest in human-AI decision making, experimental
+studies with domain experts remain rare, largely due to the complexity of
+working with domain experts and the challenges in setting up realistic
+experiments. In this work, we conduct an in-depth collaboration with
+radiologists in prostate cancer diagnosis based on MRI images. Building on
+existing tools for teaching prostate cancer diagnosis, we develop an interface
+and conduct two experiments to study how AI assistance and performance feedback
+shape the decision making of domain experts. In Study 1, clinicians were asked
+to provide an initial diagnosis (human), then view the AI's prediction, and
+subsequently finalize their decision (human-AI team). In Study 2 (after a
+memory wash-out period), the same participants first received aggregated
+performance statistics from Study 1, specifically their own performance, the
+AI's performance, and their human-AI team performance, and then directly viewed
+the AI's prediction before making their diagnosis (i.e., no independent initial
+diagnosis). These two workflows represent realistic ways that clinical AI tools
+might be used in practice, where the second study simulates a scenario where
+doctors can adjust their reliance and trust on AI based on prior performance
+feedback. Our findings show that, while human-AI teams consistently outperform
+humans alone, they still underperform the AI due to under-reliance, similar to
+prior studies with crowdworkers. Providing clinicians with performance feedback
+did not significantly improve the performance of human-AI teams, although
+showing AI decisions in advance nudges people to follow AI more. Meanwhile, we
+observe that the ensemble of human-AI teams can outperform AI alone, suggesting
+promising directions for human-AI collaboration.
 
-Multi-criteria (MC) recommender systems, which utilize MC rating information
-for recommendation, are increasingly widespread in various e-commerce domains.
-However, the MC recommendation using training-based collaborative filtering,
-requiring consideration of multiple ratings compared to single-criterion
-counterparts, often poses practical challenges in achieving state-of-the-art
-performance along with scalable model training. To solve this problem, we
-propose CA-GF, a training-free MC recommendation method, which is built upon
-criteria-aware graph filtering for efficient yet accurate MC recommendations.
-Specifically, first, we construct an item-item similarity graph using an MC
-user-expansion graph. Next, we design CA-GF composed of the following key
-components, including 1) criterion-specific graph filtering where the optimal
-filter for each criterion is found using various types of polynomial low-pass
-filters and 2) criteria preference-infused aggregation where the smoothed
-signals from each criterion are aggregated. We demonstrate that CA-GF is (a)
-efficient: providing the computational efficiency, offering the extremely fast
-runtime of less than 0.2 seconds even on the largest benchmark dataset, (b)
-accurate: outperforming benchmark MC recommendation methods, achieving
-substantial accuracy gains up to 24% compared to the best competitor, and (c)
-interpretable: providing interpretations for the contribution of each criterion
-to the model prediction based on visualizations.
+摘要：儘管人們對人類與 AI 決策制定越來越感興趣，但與領域專家合作的實驗研究仍然很少見，這在很大程度上是因為與領域專家合作的複雜性，以及在設定實際實驗時面臨的挑戰。在這項工作中，我們與放射科醫師進行深入合作，基於 MRI 影像診斷前列腺癌。建立在用於教授前列腺癌診斷的現有工具上，我們開發了一個介面並進行了兩項實驗，以研究 AI 協助和效能回饋如何塑造領域專家的決策制定。在研究 1 中，要求臨床醫師提供初步診斷（人類），然後檢視 AI 的預測，並隨後確定他們的決策（人類-AI 團隊）。在研究 2（經過一段記憶清除期）中，同一位參與者首先收到研究 1 的彙總效能統計資料，特別是他們自己的效能、AI 的效能，以及他們的人類-AI 團隊效能，然後在做出診斷前直接檢視 AI 的預測（即，沒有獨立的初步診斷）。這兩個工作流程代表了臨床 AI 工具在實務中可能被使用的方式，其中第二個研究模擬了醫生可以根據先前的效能回饋調整他們對 AI 的依賴和信任的情況。我們的研究結果顯示，儘管人類-AI 團隊始終優於單獨的人類，但由於依賴不足，他們仍然表現不如 AI，這與之前針對群眾工作者的研究類似。儘管事先顯示 AI 決策會促使人們更多地遵循 AI，但向臨床醫師提供效能回饋並未顯著改善人類-AI 團隊的效能。同時，我們觀察到人類-AI 團隊的集合可以優於單獨的 AI，這表明了人類-AI 合作的前景。
 
-摘要：多準則 (MC) 推薦系統在各種電子商務領域中日益普及，該系統利用 MC 評分資訊進行推薦。
-然而，與單準則對應項目相比，使用基於訓練的協同過濾的 MC 推薦，通常在達成最先進的效能以及可擴充模型訓練方面造成實務上的挑戰，需要考慮多個評分。為了解決這個問題，我們提出 CA-GF，一種無需訓練的 MC 推薦方法，它建立於準則感知圖形過濾之上，用於有效且準確的 MC 推薦。
-具體來說，首先，我們使用 MC 使用者擴展圖形來建構一個項目相似度圖形。接下來，我們設計 CA-GF，它包含以下關鍵組成部分，包括 1) 準則特定圖形過濾，其中使用各種類型的多項式低通濾波器來找出每個準則的最佳濾波器，以及 2) 準則偏好注入聚合，其中來自每個準則的平滑訊號被聚合。我們證明 CA-GF 是 (a) 有效的：提供運算效率，即使在最大的基準資料集上，也能提供低於 0.2 秒的極快執行時間，(b) 準確的：優於基準 MC 推薦方法，與最佳競爭者相比，獲得高達 24% 的顯著準確性提升，以及 (c) 可解釋的：根據視覺化提供對每個準則對模型預測的貢獻的解釋。
+##### **Improving Transformer World Models for Data-Efficient RL**
+2502.01591v1 by Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
 
-##### **Typhoon T1: An Open Thai Reasoning Model**
-2502.09042v1 by Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, Kunat Pipatanakul
+We present an approach to model-based RL that achieves a new state of the art
+performance on the challenging Craftax-classic benchmark, an open-world 2D
+survival game that requires agents to exhibit a wide range of general abilities
+-- such as strong generalization, deep exploration, and long-term reasoning.
+With a series of careful design choices aimed at improving sample efficiency,
+our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps,
+significantly outperforming DreamerV3, which achieves 53.2%, and, for the first
+time, exceeds human performance of 65.0%. Our method starts by constructing a
+SOTA model-free baseline, using a novel policy architecture that combines CNNs
+and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna
+with warmup", which trains the policy on real and imaginary data, (b) "nearest
+neighbor tokenizer" on image patches, which improves the scheme to create the
+transformer world model (TWM) inputs, and (c) "block teacher forcing", which
+allows the TWM to reason jointly about the future tokens of the next timestep.
 
-This paper introduces Typhoon T1, an open effort to develop an open Thai
-reasoning model. A reasoning model is a relatively new type of generative model
-built on top of large language models (LLMs). A reasoning model generates a
-long chain of thought before arriving at a final answer, an approach found to
-improve performance on complex tasks. However, details on developing such a
-model are limited, especially for reasoning models that can generate traces in
-a low-resource language. Typhoon T1 presents an open effort that dives into the
-details of developing a reasoning model in a more cost-effective way by
-leveraging supervised fine-tuning using open datasets, instead of reinforcement
-learning. This paper shares the details about synthetic data generation and
-training, as well as our dataset and model weights. Additionally, we provide
-insights gained from developing a reasoning model that generalizes across
-domains and is capable of generating reasoning traces in a low-resource
-language, using Thai as an example. We hope this open effort provides a
-foundation for further research in this field.
+摘要：我們提出了一個基於模型的 RL 方法，在具有挑戰性的 Craftax-classic 基準上實現了新的技術水準，這是一個開放世界的 2D 生存遊戲，要求代理人展現廣泛的一般能力，例如強大的概括能力、深入探索和長期推理。通過一系列旨在提高樣本效率的仔細設計選擇，我們的 MBRL 演算法在僅 1M 環境步驟後就實現了 67.4% 的獎勵，顯著優於 DreamerV3（實現 53.2%），並且首次超過了人類的 65.0% 的表現。我們的演算法首先通過使用結合 CNN 和 RNN 的新穎策略架構來建構一個 SOTA 無模型基線。然後，我們對標準 MBRL 設定新增了三項改進：(a)「帶熱身的 Dyna」，它在真實和假想資料上訓練策略，(b) 影像貼片的「最近鄰代碼化器」，它改進了建立轉換器世界模型 (TWM) 輸入的方案，以及 (c)「區塊教師強制」，它允許 TWM 共同推理下一個時間步長的未來代碼。
 
-摘要：本文介紹 Typhoon T1，這是一個開放的計畫，旨在開發開放的泰語推理模型。推理模型是一種相對較新的生成模型，建構於大型語言模型 (LLM) 之上。推理模型會在得出最終答案之前產生一連串的思考，這種方法被發現可以改善複雜任務的效能。然而，關於如何開發這種模型的詳細資訊有限，特別是對於能夠以低資源語言產生軌跡的推理模型而言。Typhoon T1 提出了一個開放的計畫，深入探討如何以更具成本效益的方式開發推理模型，方法是利用開放式資料集進行監督微調，而不是強化學習。本文分享了關於合成資料產生和訓練的詳細資訊，以及我們的資料集和模型權重。此外，我們提供了從開發推理模型中獲得的見解，該模型可以跨領域概括，並能夠以低資源語言產生推理軌跡，以泰語為例。我們希望這個開放的計畫能為此領域的進一步研究奠定基礎。
+##### **Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**
+2502.01377v1 by Zhi Zhang, Yan Liu, Mengxia Gao, Yu Yang, Jiannong Cao, Wai Kai Hou, Shirley Li, Sonata Yau, Yun Kwok Wing, Tatia M. C. Lee
 
-##### **Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**
-2502.09022v1 by Lin Zhang, Lijie Hu, Di Wang
+Psychological resilience, defined as the ability to rebound from adversity,
+is crucial for mental health. Compared with traditional resilience assessments
+through self-reported questionnaires, resilience assessments based on
+neurological data offer more objective results with biological markers, hence
+significantly enhancing credibility. This paper proposes a novel data-efficient
+model to address the scarcity of neurological data. We employ Neuro
+Kolmogorov-Arnold Networks as the structure of the prediction model. In the
+training stage, a new trait-informed multimodal representation algorithm with a
+smart chunk technique is proposed to learn the shared latent space with limited
+data. In the test stage, a new noise-informed inference algorithm is proposed
+to address the low signal-to-noise ratio of the neurological data. The proposed
+model not only shows impressive performance on both public datasets and
+self-constructed datasets but also provides some valuable psychological
+hypotheses for future research.
 
-Transformer-based language models have achieved notable success, yet their
-internal reasoning mechanisms remain largely opaque due to complex non-linear
-interactions and high-dimensional operations. While previous research suggests
-that these models implicitly encode reasoning structures, it is still unclear
-which specific multi-step thought processes they employ to solve complex tasks.
-To address this gap, we propose a novel mechanistic interpretability framework,
-SICAF, designed to trace and analyze the reasoning strategies that language
-models use in multi-step inference tasks. By employing circuit analysis and
-self-influence functions, we quantify the evolving importance of each token
-throughout the reasoning process, thereby mapping the pathways the model uses
-for inference. Applying SICAF to the GPT-2 model on the Indirect Object
-Identification (IOI) prediction task, we demonstrate how underlying circuits
-can reveal a reasoning process that aligns with human interpretability,
-offering new insights into the model's internal logic.
+摘要：心理韌性，定義為從逆境中反彈的能力，對心理健康至關重要。與通過自我報告問卷的傳統韌性評估相比，基於神經數據的韌性評估提供了更客觀的結果和生物標記，從而顯著提高了可信度。本文提出了一個新穎的數據高效模型來解決神經數據的稀缺性。我們採用神經科爾莫哥羅夫-阿諾德網路作為預測模型的結構。在訓練階段，提出了一種新的特徵信息多模態表示算法，採用智能塊技術，以有限的數據學習共享潛在空間。在測試階段，提出了一種新的噪聲信息推理算法，以解決神經數據的信噪比低的問題。所提出的模型不僅在公共數據集和自構數據集上都顯示出令人印象深刻的性能，還為未來的研究提供了一些有價值的心理假設。
+
+##### **OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**
+2502.01243v1 by Chengfeng Zhou, Ji Wang, Juanjuan Qin, Yining Wang, Ling Sun, Weiwei Dai
+
+Large language models (LLMs) have shown significant promise across various
+medical applications, with ophthalmology being a notable area of focus. Many
+ophthalmic tasks have shown substantial improvement through the integration of
+LLMs. However, before these models can be widely adopted in clinical practice,
+evaluating their capabilities and identifying their limitations is crucial. To
+address this research gap and support the real-world application of LLMs, we
+introduce the OphthBench, a specialized benchmark designed to assess LLM
+performance within the context of Chinese ophthalmic practices. This benchmark
+systematically divides a typical ophthalmic clinical workflow into five key
+scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each
+scenario, we developed multiple tasks featuring diverse question types,
+resulting in a comprehensive benchmark comprising 9 tasks and 591 questions.
+This comprehensive framework allows for a thorough assessment of LLMs'
+capabilities and provides insights into their practical application in Chinese
+ophthalmology. Using this benchmark, we conducted extensive experiments and
+analyzed the results from 39 popular LLMs. Our evaluation highlights the
+current gap between LLM development and its practical utility in clinical
+settings, providing a clear direction for future advancements. By bridging this
+gap, we aim to unlock the potential of LLMs and advance their development in
+ophthalmology.
 
-摘要：基於 Transformer 的語言模型已取得顯著的成功，但由於複雜的非線性交互和高維度運算，它們的內部推理機制在很大程度上仍然不透明。儘管先前的研究表明這些模型隱含地編碼推理結構，但目前仍不清楚它們採用哪些具體的多步驟思考過程來解決複雜任務。為了解決這個差距，我們提出了一個新穎的機制可解釋性框架 SICAF，旨在追蹤和分析語言模型在多步驟推理任務中使用的推理策略。通過採用電路分析和自影響函數，我們量化了推理過程中每個標記的演化重要性，從而繪製出模型用於推理的路徑。將 SICAF 應用於 GPT-2 模型上的間接賓語識別 (IOI) 預測任務，我們展示了底層電路如何揭示與人類可解釋性相符的推理過程，從而對模型的內部邏輯提供了新的見解。
+摘要：大型語言模型 (LLM) 在各種醫療應用中已展現出顯著的潛力，其中眼科是一個值得關注的重要領域。許多眼科任務已透過整合 LLM 而大幅進步。然而，在這些模型能廣泛應用於臨床實務之前，評估其能力並找出其限制至關重要。為了解決這個研究差距並支援 LLM 的實際應用，我們引入了 OphthBench，這是一個專門的基準測試，旨在評估 LLM 在中國眼科實務中的表現。此基準測試系統性地將典型眼科臨床工作流程劃分為五個關鍵情境：教育、分流、診斷、治療和預後。對於每個情境，我們開發了多項任務，包含多樣化的問題類型，最後組成一個包含 9 項任務和 591 個問題的綜合基準測試。此綜合架構可徹底評估 LLM 的能力，並提供其在中國眼科的實際應用見解。使用此基準測試，我們進行了廣泛的實驗，並分析了來自 39 個熱門 LLM 的結果。我們的評估強調了 LLM 開發與其在臨床環境中的實際效用之間的差距，為未來的進展提供了明確的方向。透過彌合此差距，我們旨在釋放 LLM 的潛力，並促進其在眼科的發展。
 
-##### **EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**
-2502.09020v1 by Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
+##### **MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**
+2502.01158v1 by Alejandro Guerra-Manzanares, Farah E. Shamout
 
-Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB
-cameras which are sensitive to challenging factors such as low illumination,
-motion blur, and cluttered backgrounds. In this paper, we propose to recognize
-the scene text using bio-inspired event cameras by collecting and annotating a
-large-scale benchmark dataset, termed EventSTR. It contains 9,928
-high-definition (1280 * 720) event samples and involves both Chinese and
-English characters. We also benchmark multiple STR algorithms as the baselines
-for future works to compare. In addition, we propose a new event-based scene
-text recognition framework, termed SimC-ESTR. It first extracts the event
-features using a visual encoder and projects them into tokens using a Q-former
-module. More importantly, we propose to augment the vision tokens based on a
-memory mechanism before feeding into the large language models. A
-similarity-based error correction mechanism is embedded within the large
-language model to correct potential minor errors fundamentally based on
-contextual information. Extensive experiments on the newly proposed EventSTR
-dataset and two simulation STR datasets fully demonstrate the effectiveness of
-our proposed model. We believe that the dataset and algorithmic model can
-innovatively propose an event-based STR task and are expected to accelerate the
-application of event cameras in various industries. The source code and
-pre-trained models will be released on https://github.com/Event-AHU/EventSTR
+Multimodal fusion leverages information across modalities to learn better
+feature representations with the goal of improving performance in fusion-based
+tasks. However, multimodal datasets, especially in medical settings, are
+typically smaller than their unimodal counterparts, which can impede the
+performance of multimodal models. Additionally, the increase in the number of
+modalities is often associated with an overall increase in the size of the
+multimodal network, which may be undesirable in medical use cases. Utilizing
+smaller unimodal encoders may lead to sub-optimal performance, particularly
+when dealing with high-dimensional clinical data. In this paper, we propose the
+Modality-INformed knowledge Distillation (MIND) framework, a multimodal model
+compression approach based on knowledge distillation that transfers knowledge
+from ensembles of pre-trained deep neural networks of varying sizes into a
+smaller multimodal student. The teacher models consist of unimodal networks,
+allowing the student to learn from diverse representations. MIND employs
+multi-head joint fusion models, as opposed to single-head models, enabling the
+use of unimodal encoders in the case of unimodal samples without requiring
+imputation or masking of absent modalities. As a result, MIND generates an
+optimized multimodal model, enhancing both multimodal and unimodal
+representations. It can also be leveraged to balance multimodal learning during
+training. We evaluate MIND on binary and multilabel clinical prediction tasks
+using time series data and chest X-ray images. Additionally, we assess the
+generalizability of the MIND framework on three non-medical multimodal
+multiclass datasets. Experimental results demonstrate that MIND enhances the
+performance of the smaller multimodal network across all five tasks, as well as
+various fusion methods and multimodal architectures, compared to
+state-of-the-art baselines.
 
-摘要：主流場景文字辨識（STR）演算法是基於對低光源、動態模糊和雜亂背景等挑戰性因素敏感的 RGB 相機開發的。在本文中，我們提出使用生物靈感事件相機辨識場景文字，方法是收集和標註一個稱為 EventSTR 的大規模基準資料集。它包含 9,928 個高畫質（1280 * 720）事件範例，並包含中文字和英文字元。我們也基準化多個 STR 演算法作為未來工作的基準，以進行比較。此外，我們提出一個新的基於事件的場景文字辨識架構，稱為 SimC-ESTR。它首先使用視覺編碼器萃取事件特徵，並使用 Q-former 模組將它們投影到代幣中。更重要的是，我們提出在輸入大型語言模型之前，基於記憶機制擴充視覺代幣。一個基於相似性的錯誤修正機制嵌入在大型語言模型中，以根據上下文資訊從根本上修正潛在的輕微錯誤。在最新提出的 EventSTR 資料集和兩個模擬 STR 資料集上進行的廣泛實驗充分證明了我們提出的模型的有效性。我們相信，該資料集和演算法模型可以創新地提出一個基於事件的 STR 任務，並有望加速事件相機在各個產業的應用。原始碼和預先訓練的模型將在 https://github.com/Event-AHU/EventSTR 上釋出
+摘要：多模态融合利用跨模态的信息来学习更好的特征表示，目标是提升基于融合的任务的性能。然而，多模态数据集，尤其是在医疗环境中，通常比它们的单模态对应数据集小，这会阻碍多模态模型的性能。此外，模态数量的增加通常与多模态网络尺寸的整体增加相关，这在医疗用例中可能是不可取的。利用较小的单模态编码器可能会导致次优性能，尤其是在处理高维临床数据时。在本文中，我们提出了模态信息知识蒸馏 (MIND) 框架，这是一种基于知识蒸馏的多模态模型压缩方法，它将来自不同大小的预训练深度神经网络的集合中的知识转移到一个较小的多模态学生中。教师模型由单模态网络组成，允许学生从不同的表示中学习。MIND 采用多头联合融合模型，而不是单头模型，从而能够在单模态样本的情况下使用单模态编码器，而不需要缺失模态的插补或掩蔽。因此，MIND 生成了一个经过优化的多模态模型，增强了多模态和单模态表示。它还可以用来在训练期间平衡多模态学习。我们使用时间序列数据和胸部 X 射线图像对二元和多标签临床预测任务评估了 MIND。此外，我们评估了 MIND 框架在三个非医疗多模态多分类数据集上的泛化性。实验结果表明，与最先进的基线相比，MIND 增强了较小的多模态网络在所有五个任务以及各种融合方法和多模态架构中的性能。
 
-##### **Zero-shot Concept Bottleneck Models**
-2502.09018v1 by Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
+##### **Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**
+2502.01141v1 by Qian Chen, Stefanie Rinderle-Ma, Lijie Wen
 
-Concept bottleneck models (CBMs) are inherently interpretable and
-intervenable neural network models, which explain their final label prediction
-by the intermediate prediction of high-level semantic concepts. However, they
-require target task training to learn input-to-concept and concept-to-label
-mappings, incurring target dataset collections and training resources. In this
-paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which
-predict concepts and labels in a fully zero-shot manner without training neural
-networks. Z-CBMs utilize a large-scale concept bank, which is composed of
-millions of vocabulary extracted from the web, to describe arbitrary input in
-various domains. For the input-to-concept mapping, we introduce concept
-retrieval, which dynamically finds input-related concepts by the cross-modal
-search on the concept bank. In the concept-to-label inference, we apply concept
-regression to select essential concepts from the retrieved concepts by sparse
-linear regression. Through extensive experiments, we confirm that our Z-CBMs
-provide interpretable and intervenable concepts without any additional
-training. Code will be available at https://github.com/yshinya6/zcbm.
+Most existing process compliance monitoring approaches detect compliance
+violations in an ex post manner. Only predicate prediction focuses on
+predicting them. However, predicate prediction provides a binary yes/no notion
+of compliance, lacking the ability to measure to which extent an ongoing
+process instance deviates from the desired state as specified in constraints.
+Here, being able to quantify the magnitude of violation would provide
+organizations with deeper insights into their operational performance, enabling
+informed decision making to reduce or mitigate the risk of non-compliance.
+Thus, we propose two predictive compliance monitoring approaches to close this
+research gap. The first approach reformulates the binary classification problem
+as a hybrid task that considers both classification and regression, while the
+second employs a multi-task learning method to explicitly predict the
+compliance status and the magnitude of violation for deviant cases
+simultaneously. In this work, we focus on temporal constraints as they are
+significant in almost any application domain, e.g., health care. The evaluation
+on synthetic and real-world event logs demonstrates that our approaches are
+capable of quantifying the magnitude of violations while maintaining comparable
+performance for compliance predictions achieved by state-of-the-art approaches.
 
-摘要：概念瓶頸模型 (CBM) 本質上是可解釋且可干預的神經網路模型，它們透過對高階語意概念的中間預測來解釋其最終標籤預測。然而，它們需要目標任務訓練來學習輸入到概念和概念到標籤的對應，導致目標資料集收集和訓練資源。在本文中，我們展示了「零次學習概念瓶頸模型」(Z-CBM)，它以完全零次學習的方式預測概念和標籤，而無需訓練神經網路。Z-CBM 利用一個大型概念庫，其中包含從網路中擷取的數百萬個詞彙，來描述各種領域中的任意輸入。對於輸入到概念的對應，我們引入了概念擷取，它透過對概念庫的跨模態搜尋，動態地找出與輸入相關的概念。在概念到標籤的推論中，我們應用概念迴歸，透過稀疏線性迴歸從擷取的概念中選擇必要的概念。透過廣泛的實驗，我們確認我們的 Z-CBM 在沒有任何額外訓練的情況下提供了可解釋且可干預的概念。程式碼將可在 https://github.com/yshinya6/zcbm 取得。
+摘要：現有的流程合規監控方法大多會在事後偵測到合規違規。只有謂詞預測專注於預測這些違規。然而，謂詞預測提供的是合規與否的二元概念，無法衡量正在進行的流程實例偏離約束中所指定之理想狀態的程度。在此，能夠量化違規的嚴重程度，將能讓組織深入了解其營運績效，並能據此做出明智的決策，以降低或減輕不合規的風險。因此，我們提出兩種預測合規監控方法來填補此研究空白。第一種方法將二元分類問題重新表述為同時考量分類和回歸的混合任務，而第二種方法則採用多任務學習方法，同時明確預測合規狀態和偏差案例的違規嚴重程度。在這項工作中，我們專注於時間約束，因為它們幾乎在任何應用領域（例如醫療保健）中都很重要。在合成和真實世界事件記錄上的評估顯示，我們的做法能夠量化違規的嚴重程度，同時維持與現有方法所達成的合規預測相當的績效。
 
-##### **Diversity Enhances an LLM's Performance in RAG and Long-context Task**
-2502.09017v1 by Zhchao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, Claire Na Cheng
+##### **Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**
+2502.01108v1 by Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar
 
-The rapid advancements in large language models (LLMs) have highlighted the
-challenge of context window limitations, primarily due to the quadratic time
-complexity of the self-attention mechanism (\(O(N^2)\), where \(N\) denotes the
-context window length). This constraint impacts tasks such as
-retrieval-augmented generation (RAG) in question answering (Q\&A) and long
-context summarization. A common approach involves selecting content with the
-highest similarity to the query; however, this often leads to redundancy and
-the exclusion of diverse yet relevant information. Building on principles from
-Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we
-integrate diversity into the content selection process. Our findings reveal
-that incorporating diversity substantially increases the recall of selecting
-relevant sentences or chunks before LLM-based Q\&A and summarization. These
-results highlight the importance of maintaining diversity in future LLM
-applications to further improve summarization and Q\&A outcomes.
+Photoplethysmography (PPG)-based foundation models are gaining traction due
+to the widespread use of PPG in biosignal monitoring and their potential to
+generalize across diverse health applications. In this paper, we introduce
+Pulse-PPG, the first open-source PPG foundation model trained exclusively on
+raw PPG data collected over a 100-day field study with 120 participants.
+Existing PPG foundation models are either open-source but trained on clinical
+data or closed-source, limiting their applicability in real-world settings. We
+evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its
+performance against a state-of-the-art foundation model trained on clinical
+data. Our results demonstrate that Pulse-PPG, trained on uncurated field data,
+exhibits superior generalization across clinical and mobile health applications
+in both lab and field settings. This suggests that exposure to real-world
+variability enables the model to learn fine-grained representations, making it
+more adaptable across tasks. Furthermore, pre-training on field data
+surprisingly outperforms its pre-training on clinical data in many tasks,
+reinforcing the importance of training on real-world, diverse datasets. To
+encourage further advancements in robust foundation models leveraging field
+data, we plan to release Pulse-PPG, providing researchers with a powerful
+resource for developing more generalizable PPG-based models.
 
-摘要：大型語言模型 (LLM) 的快速進步凸顯了上下文視窗限制的挑戰，這主要是由於自注意力機制的二次時間複雜度（\(O(N^2)\)），其中 \(N\) 表示上下文視窗長度。此限制會影響任務，例如問答 (Q&A) 中的檢索增強生成 (RAG) 和長文摘要。一種常見的方法涉及選擇與查詢最相似的內容；然而，這通常會導致冗餘，並排除多樣化但相關的資訊。我們根據最大邊際相關性 (MMR) 和最遠點取樣 (FPS) 的原則，將多樣性整合到內容選擇過程中。我們的研究結果顯示，在基於 LLM 的問答和摘要之前，納入多樣性會大幅增加選擇相關句子或區塊的召回率。這些結果突顯了在未來的 LLM 應用中維持多樣性的重要性，以進一步改善摘要和問答的結果。
+摘要：基於光電容積描記術 (PPG) 的基礎模型由於 PPG 在生物訊號監控中的廣泛使用及其在各種健康應用中推廣的潛力而備受關注。在本文中，我們介紹 Pulse-PPG，這是第一個開放原始碼 PPG 基礎模型，專門針對在為期 100 天的現場研究中收集的 120 位參與者的原始 PPG 資料進行訓練。現有的 PPG 基礎模型要不是開放原始碼，但訓練於臨床資料，不然就是閉源，這限制了它們在真實世界中的應用性。我們評估了 Pulse-PPG 在多個資料集和下游任務中的表現，並將其效能與訓練於臨床資料的最新基礎模型進行比較。我們的結果表明，訓練於未整理現場資料的 Pulse-PPG 在實驗室和現場環境中，在臨床和行動健康應用中展現出優異的泛化能力。這表明接觸真實世界的變異性使模型能夠學習細粒度的表示，使其更能適應各種任務。此外，令人驚訝的是，現場資料的預訓練在許多任務中優於臨床資料的預訓練，這強化了在真實世界、多樣化的資料集上訓練的重要性。為了鼓勵在利用現場資料的強健基礎模型方面進一步發展，我們計畫發布 Pulse-PPG，為研究人員提供一個強大的資源，用於開發更具泛化性的基於 PPG 的模型。
 
-##### **Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**
-2502.09004v1 by Jonathan Pofcher, Christopher M. Homan, Randall Sell, Ashiqur R. KhudaBukhsh
+##### **Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**
+2502.04342v1 by Yeyubei Zhang, Zhongyan Wang, Zhanyi Ding, Yexin Tian, Jianglai Dai, Xiaorui Shen, Yunchong Liu, Yuchen Cao
 
-This paper makes three contributions. First, via a substantial corpus of
-1,419,047 comments posted on 3,161 YouTube news videos of major US cable news
-outlets, we analyze how users engage with LGBTQ+ news content. Our analyses
-focus both on positive and negative content. In particular, we construct a
-fine-grained hope speech classifier that detects positive (hope speech),
-negative, neutral, and irrelevant content. Second, in consultation with a
-public health expert specializing on LGBTQ+ health, we conduct an annotation
-study with a balanced and diverse political representation and release a
-dataset of 3,750 instances with fine-grained labels and detailed annotator
-demographic information. Finally, beyond providing a vital resource for the
-LGBTQ+ community, our annotation study and subsequent in-the-wild assessments
-reveal (1) strong association between rater political beliefs and how they rate
-content relevant to a marginalized community; (2) models trained on individual
-political beliefs exhibit considerable in-the-wild disagreement; and (3)
-zero-shot large language models (LLMs) align more with liberal raters.
+Social media has become an important source for understanding mental health,
+providing researchers with a way to detect conditions like depression from
+user-generated posts. This tutorial provides practical guidance to address
+common challenges in applying machine learning and deep learning methods for
+mental health detection on these platforms. It focuses on strategies for
+working with diverse datasets, improving text preprocessing, and addressing
+issues such as imbalanced data and model evaluation. Real-world examples and
+step-by-step instructions demonstrate how to apply these techniques
+effectively, with an emphasis on transparency, reproducibility, and ethical
+considerations. By sharing these approaches, this tutorial aims to help
+researchers build more reliable and widely applicable models for mental health
+research, contributing to better tools for early detection and intervention.
 
-摘要：本文做出了三項貢獻。首先，透過一個龐大的語料庫，其中包含 1,419,047 則評論，這些評論張貼在 3,161 部美國有線新聞頻道的 YouTube 新聞影片上，我們分析了使用者如何參與 LGBTQ+ 新聞內容。我們的分析重點在於正面和負面的內容。特別是，我們建構了一個細緻的希望言論分類器，用來偵測正面的（希望言論）、負面的、中立的和不相關的內容。其次，在諮詢了一位專門研究 LGBTQ+ 健康的公共衛生專家後，我們進行了一項標註研究，其中包含平衡且多元的政治代表性，並發布了一個包含 3,750 個實例的資料集，其中包含細緻的標籤和詳細的標註者人口統計資訊。最後，除了為 LGBTQ+ 社群提供重要的資源外，我們的標註研究和後續的實際評估揭示了：(1) 評分者的政治信仰與他們如何評分與邊緣化社群相關的內容之間有很強的關聯性；(2) 根據個人政治信仰訓練的模型在實際應用中表現出相當大的分歧；(3) 零次學習大型語言模型 (LLM) 與自由派評分者的看法更一致。
+摘要：社群媒體已成為了解心理健康的重要來源，
+為研究人員提供一種方式，從使用者發布的貼文中偵測憂鬱症等狀況。
+本教學提供實務指南，說明如何處理在這些平台上使用機器學習和深度學習方法進行心理健康偵測時常見的挑戰。
+它專注於處理不同資料集、改善文字前處理，以及處理不平衡資料和模型評估等問題的策略。
+實際範例和逐步說明示範如何有效應用這些技術，並強調透明度、可複製性，以及倫理考量。
+透過分享這些方法，本教學指南旨在協助研究人員建構更可靠且廣泛適用的心理健康研究模型，
+進而有助於早期偵測和介入的工具。
 
-##### **RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**
-2502.09003v1 by Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang, Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
+##### **Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**
+2502.01691v1 by Hadas Ben-Atya, Naama Gavrielov, Zvi Badash, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
 
-Supervised fine-tuning is a standard method for adapting pre-trained large
-language models (LLMs) to downstream tasks. Quantization has been recently
-studied as a post-training technique for efficient LLM deployment. To obtain
-quantized fine-tuned LLMs, conventional pipelines would first fine-tune the
-pre-trained models, followed by post-training quantization. This often yields
-suboptimal performance as it fails to leverage the synergy between fine-tuning
-and quantization. To effectively realize low-bit quantization of weights,
-activations, and KV caches in LLMs, we propose an algorithm named Rotated
-Straight-Through-Estimator (RoSTE), which combines quantization-aware
-supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that
-identifies an effective rotation configuration to reduce activation outliers.
-We provide theoretical insights on RoSTE by analyzing its prediction error when
-applied to an overparameterized least square quantized training problem. Our
-findings reveal that the prediction error is directly proportional to the
-quantization error of the converged weights, which can be effectively managed
-through an optimized rotation configuration. Experiments on Pythia and Llama
-models of different sizes demonstrate the effectiveness of RoSTE. Compared to
-existing post-SFT quantization baselines, our method consistently achieves
-superior performances across various tasks and different LLM architectures.
+Reliable extraction of structured data from radiology reports using Large
+Language Models (LLMs) remains challenging, especially for complex, non-English
+texts like Hebrew. This study introduces an agent-based uncertainty-aware
+approach to improve the trustworthiness of LLM predictions in medical
+applications. We analyzed 9,683 Hebrew radiology reports from Crohn's disease
+patients (from 2010 to 2023) across three medical centers. A subset of 512
+reports was manually annotated for six gastrointestinal organs and 15
+pathological findings, while the remaining reports were automatically annotated
+using HSMP-BERT. Structured data extraction was performed using Llama 3.1
+(Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed
+six semantically equivalent prompts to estimate uncertainty. An Agent-Based
+Decision Model integrated multiple prompt outputs into five confidence levels
+for calibrated uncertainty and was compared against three entropy-based models.
+Performance was evaluated using accuracy, F1 score, precision, recall, and
+Cohen's Kappa before and after filtering high-uncertainty cases. The
+agent-based model outperformed the baseline across all metrics, achieving an F1
+score of 0.3967, recall of 0.6437, and Cohen's Kappa of 0.3006. After filtering
+high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to
+0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated
+clear separation between correct and incorrect predictions, with the
+agent-based model providing the most well-calibrated uncertainty estimates. By
+incorporating uncertainty-aware prompt ensembles and an agent-based decision
+model, this approach enhances the performance and reliability of LLMs in
+structured data extraction from radiology reports, offering a more
+interpretable and trustworthy solution for high-stakes medical applications.
 
-摘要：監督式微調是將預訓練的大型語言模型 (LLM) 適應至下游任務的標準方法。量化最近已被研究作為一種訓練後技術，用於高效部署 LLM。為了獲得量化的微調 LLM，傳統管道會先微調預訓練模型，然後再進行訓練後量化。這通常會產生次佳效能，因為它無法利用微調和量化之間的協同效應。為了有效實現 LLM 中權重、激活和 KV 快取的低位元量化，我們提出了一種名為旋轉直通估計器 (RoSTE) 的演算法，它結合了量化感知監督式微調 (QA-SFT) 和一種自適應旋轉策略，該策略會識別有效的旋轉組態以減少激活異常值。我們透過分析 RoSTE 在應用於過度參數化最小平方量化訓練問題時的預測誤差，提供了關於 RoSTE 的理論見解。我們的研究結果顯示，預測誤差與收斂權重的量化誤差成正比，而這可透過最佳化的旋轉組態有效地管理。在不同大小的 Pythia 和 Llama 模型上進行的實驗證明了 RoSTE 的有效性。與現有的訓練後 SFT 量化基準相比，我們的模型在各種任務和不同的 LLM 架構中持續獲得優異的效能。
+摘要：<paragraph>使用大型語言模型 (LLM) 從放射科報告中可靠地提取結構化數據仍然具有挑戰性，尤其是對於希伯來語等複雜的非英語文本。本研究引入了一種基於代理的不確定性感知方法，以提高 LLM 預測在醫療應用中的可信度。我們分析了來自三個醫療中心的 9,683 份克隆氏症患者的希伯來語放射科報告（從 2010 年到 2023 年）。其中 512 份報告的手動註釋包括六個胃腸器官和 15 個病理發現，而其餘報告則使用 HSMP-BERT 自動註釋。結構化數據提取使用 Llama 3.1（Llama 3-8b-instruct）與貝葉斯提示集合（BayesPE）進行，它採用六個語義等效提示來估計不確定性。基於代理的決策模型將多個提示輸出整合到五個置信度級別中以校準不確定性，並與三個基於熵的模型進行比較。在過濾掉高度不確定性的情況之前和之後，使用準確度、F1 分數、精確度、召回率和 Cohen's Kappa 評估性能。基於代理的模型在所有指標上都優於基線，F1 分數達到 0.3967，召回率達到 0.6437，Cohen's Kappa 達到 0.3006。在過濾掉高度不確定性的情況（大於或等於 0.5）後，F1 分數提高到 0.4787，Kappa 提高到 0.4258。不確定性直方圖顯示了正確預測和不正確預測之間的明顯區別，基於代理的模型提供了校準最好的不確定性估計。通過結合不確定性感知提示集合和基於代理的決策模型，這種方法增強了 LLM 在放射科報告中結構化數據提取中的性能和可靠性，為高風險醫療應用提供了更具可解釋性和可信度的解決方案。</paragraph>
 
-##### **PixLift: Accelerating Web Browsing via AI Upscaling**
-2502.08995v1 by Yonas Atinafu, Sarthak Malla, HyunSeok Daniel Jang, Nouar Aldahoul, Matteo Varvello, Yasir Zaki
+##### **Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**
+2502.01685v1 by Si-Ioi Ng, Pranav S. Ambadi, Kimberly D. Mueller, Julie Liss, Visar Berisha
 
-Accessing the internet in regions with expensive data plans and limited
-connectivity poses significant challenges, restricting information access and
-economic growth. Images, as a major contributor to webpage sizes, exacerbate
-this issue, despite advances in compression formats like WebP and AVIF. The
-continued growth of complex and curated web content, coupled with suboptimal
-optimization practices in many regions, has prevented meaningful reductions in
-web page sizes. This paper introduces PixLift, a novel solution to reduce
-webpage sizes by downscaling their images during transmission and leveraging AI
-models on user devices to upscale them. By trading computational resources for
-bandwidth, PixLift enables more affordable and inclusive web access. We address
-key challenges, including the feasibility of scaled image requests on popular
-websites, the implementation of PixLift as a browser extension, and its impact
-on user experience. Through the analysis of 71.4k webpages, evaluations of
-three mainstream upscaling models, and a user study, we demonstrate PixLift's
-ability to significantly reduce data usage without compromising image quality,
-fostering a more equitable internet.
+Existing methods for analyzing linguistic content from picture descriptions
+for assessment of cognitive-linguistic impairment often overlook the
+participant's visual narrative path, which typically requires eye tracking to
+assess. Spatio-semantic graphs are a useful tool for analyzing this narrative
+path from transcripts alone, however they are limited by the need for manual
+tagging of content information units (CIUs). In this paper, we propose an
+automated approach for estimation of spatio-semantic graphs (via automated
+extraction of CIUs) from the Cookie Theft picture commonly used in
+cognitive-linguistic analyses. The method enables the automatic
+characterization of the visual semantic path during picture description.
+Experiments demonstrate that the automatic spatio-semantic graphs effectively
+differentiate between cognitively impaired and unimpaired speakers. Statistical
+analyses reveal that the features derived by the automated method produce
+comparable results to the manual method, with even greater group differences
+between clinical groups of interest. These results highlight the potential of
+the automated approach for extracting spatio-semantic features in developing
+clinical speech models for cognitive impairment assessment.
 
-摘要：在數據方案昂貴且連線有限的地區存取網路會造成重大挑戰，限制了資訊存取和經濟成長。圖像作為網頁大小的主要貢獻者，儘管 WebP 和 AVIF 等壓縮格式進步，但仍加劇了這個問題。複雜且經過策劃的網路內容持續成長，加上許多地區次佳的最佳化實務，已阻礙了網頁大小的顯著減少。本文介紹 PixLift，這是一種創新的解決方案，可在傳輸過程中縮小圖像大小，並利用使用者裝置上的 AI 模型來放大圖像，藉此縮小網頁大小。PixLift 透過以運算資源換取頻寬，讓網路存取更經濟實惠且更具包容性。我們解決了關鍵挑戰，包括熱門網站上縮放圖像要求的可行性、將 PixLift 實作為瀏覽器擴充功能，以及它對使用者體驗的影響。透過分析 71.4k 個網頁、評估三個主流放大模型，以及使用者研究，我們展示了 PixLift 在不影響影像品質的情況下顯著減少資料用量的能力，促進了更公平的網路。
+摘要：現有的用於分析圖像描述中的語言內容的方法，用於評估認知語言障礙，通常會忽略參與者的視覺敘事路徑，這通常需要眼球追蹤來評估。時空語義圖是一種有用的工具，可以僅從轉錄本中分析此敘事路徑，但是它們受到手動標記內容資訊單元 (CIU) 的需求所限制。在本文中，我們提出了一種自動化方法，用於從認知語言分析中常用的 Cookie Theft 圖像估計時空語義圖（通過自動提取 CIU）。該方法能夠自動表徵圖片描述期間的視覺語義路徑。實驗表明，自動時空語義圖有效地區分了認知受損和未受損的說話者。統計分析表明，自動化方法衍生的特徵產生了與手動方法相當的結果，甚至在感興趣的臨床組之間產生了更大的組差異。這些結果突出了自動化方法在提取時空語義特徵以開發用於認知障礙評估的臨床語音模型方面的潛力。
 
-##### **RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**
-2502.08989v1 by Nazatul H. Sultan, Yan Bo, Yansong Gao, Seyit Camtepe, Arash Mahboubi, Hang Thanh Bui, Aufeef Chauhan, Hamed Aboutorab, Michael Bewong, Praveen Gauravaram, Rafiqul Islam, Sharif Abuadbba
+##### **Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**
+2502.00712v1 by Shengtian Sang, Hassan Jahanandish, Cynthia Xinran Li, Indrani Bhattachary, Jeong Hoon Lee, Lichun Zhang, Sulaiman Vesal, Pejman Ghanouni, Richard Fan, Geoffrey A. Sonn, Mirabela Rusu
 
-Federated Learning (FL) allows users to collaboratively train a global
-machine learning model by sharing local model only, without exposing their
-private data to a central server. This distributed learning is particularly
-appealing in scenarios where data privacy is crucial, and it has garnered
-substantial attention from both industry and academia. However, studies have
-revealed privacy vulnerabilities in FL, where adversaries can potentially infer
-sensitive information from the shared model parameters. In this paper, we
-present an efficient masking-based secure aggregation scheme utilizing
-lightweight cryptographic primitives to mitigate privacy risks. Our scheme
-offers several advantages over existing methods. First, it requires only a
-single setup phase for the entire FL training session, significantly reducing
-communication overhead. Second, it minimizes user-side overhead by eliminating
-the need for user-to-user interactions, utilizing an intermediate server layer
-and a lightweight key negotiation method. Third, the scheme is highly resilient
-to user dropouts, and the users can join at any FL round. Fourth, it can detect
-and defend against malicious server activities, including recently discovered
-model inconsistency attacks. Finally, our scheme ensures security in both
-semi-honest and malicious settings. We provide security analysis to formally
-prove the robustness of our approach. Furthermore, we implemented an end-to-end
-prototype of our scheme. We conducted comprehensive experiments and
-comparisons, which show that it outperforms existing solutions in terms of
-communication and computation overhead, functionality, and security.
+Prostate cancer is a major cause of cancer-related deaths in men, where early
+detection greatly improves survival rates. Although MRI-TRUS fusion biopsy
+offers superior accuracy by combining MRI's detailed visualization with TRUS's
+real-time guidance, it is a complex and time-intensive procedure that relies
+heavily on manual annotations, leading to potential errors. To address these
+challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation
+method that identifies prostate tumors directly in TRUS images without
+requiring manual annotations. Unlike traditional multimodal fusion approaches
+that rely on naive data concatenation, our method integrates a
+registration-segmentation framework to align and leverage spatial information
+between MRI and TRUS modalities. This alignment enhances segmentation accuracy
+and reduces reliance on manual effort. Our approach was validated on a dataset
+of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient
+of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132)
+methods, with significant improvements (p $<$ 0.01). This framework
+demonstrates the potential for reducing the complexity of prostate cancer
+diagnosis and provides a flexible architecture applicable to other multimodal
+medical imaging tasks.
 
-摘要：聯合式學習 (FL) 使用者可以透過僅分享本機模型，在不將其私人資料揭露給中央伺服器的情況下，共同訓練全球機器學習模型。這種分散式學習在資料隱私至關重要的場景中特別具有吸引力，並且已獲得業界和學術界的廣泛關注。然而，研究顯示 FL 中存在隱私漏洞，其中對手可能會從共享模型參數中推斷出敏感資訊。在本文中，我們提出了一種有效率的基於遮罩的安全聚合方案，利用輕量級的密碼原語來降低隱私風險。我們的方案相較於現有方法提供了多項優點。首先，它僅需要在整個 FL 訓練階段進行一次設定階段，大幅降低了通訊開銷。其次，透過消除使用者間互動的需要，利用中間伺服器層和輕量級金鑰協商方法，將使用者端的開銷降到最低。第三，該方案對使用者中斷具有高度的復原力，使用者可以在任何 FL 回合中加入。第四，它可以偵測和防禦惡意伺服器活動，包括最近發現的模型不一致攻擊。最後，我們的方案確保在半誠實和惡意設定中都能獲得安全性。我們提供了安全分析，以正式證明我們方法的穩健性。此外，我們實作了我們方案的端對端原型。我們進行了全面的實驗和比較，結果顯示，在通訊和運算開銷、功能和安全性方面，它優於現有的解決方案。
+摘要：前列腺癌是男性癌症相關死亡的主要原因，早期發現可大幅提升存活率。儘管 MRI-TRUS 融合切片檢查結合了 MRI 的詳細視覺化與 TRUS 的即時導引，可提供更高的準確度，但它是一種仰賴大量手動註解的複雜且耗時的程序，容易導致錯誤。為了解決這些挑戰，我們提出了一種全自動的 MRI-TRUS 融合式分割方法，它可以在 TRUS 影像中直接辨識出前列腺腫瘤，而不需要手動註解。與依賴於天真資料串接的傳統多模態融合方法不同，我們的方法整合了一個配準分割架構，以對齊並利用 MRI 與 TRUS 模態之間的空間資訊。這種對齊提升了分割準確度，並減少了對手動作業的依賴。我們的方法已通過來自 Stanford 醫院的 1,747 位患者的資料集進行驗證，達到了 0.212 的平均 Dice 係數，優於僅使用 TRUS (0.117) 和天真的 MRI-TRUS 融合 (0.132) 方法，並有顯著的改善（p < 0.01）。這個架構證明了降低前列腺癌診斷複雜性的潛力，並提供了一個適用於其他多模態醫學影像任務的彈性架構。
 
-##### **Neural Force Field: Learning Generalized Physical Representation from a Few Examples**
-2502.08987v1 by Shiqian Li, Ruihong Shen, Chi Zhang, Yixin Zhu
+##### **TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**
+2502.00695v1 by Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
 
-Physical reasoning is a remarkable human ability that enables rapid learning
-and generalization from limited experience. Current AI models, despite
-extensive training, still struggle to achieve similar generalization,
-especially in Out-of-distribution (OOD) settings. This limitation stems from
-their inability to abstract core physical principles from observations. A key
-challenge is developing representations that can efficiently learn and
-generalize physical dynamics from minimal data. Here we present Neural Force
-Field (NFF) a modeling framework built on Neural Ordinary Differential Equation
-(NODE) that learns interpretable force field representations which can be
-efficiently integrated through an Ordinary Differential Equation ( ODE) solver
-to predict object trajectories. Unlike existing approaches that rely on
-high-dimensional latent spaces, NFF captures fundamental physical concepts such
-as gravity, support, and collision in an interpretable manner. Experiments on
-two challenging physical reasoning tasks demonstrate that NFF, trained with
-only a few examples, achieves strong generalization to unseen scenarios. This
-physics-grounded representation enables efficient forward-backward planning and
-rapid adaptation through interactive refinement. Our work suggests that
-incorporating physics-inspired representations into learning systems can help
-bridge the gap between artificial and human physical reasoning capabilities.
+Chronic liver disease represents a significant health challenge worldwide and
+accurate prognostic evaluations are essential for personalized treatment plans.
+Recent evidence suggests that integrating multimodal data, such as computed
+tomography imaging, radiomic features, and clinical information, can provide
+more comprehensive prognostic information. However, modalities have an inherent
+heterogeneity, and incorporating additional modalities may exacerbate the
+challenges of heterogeneous data fusion. Moreover, existing multimodal fusion
+methods often struggle to adapt to richer medical modalities, making it
+difficult to capture inter-modal relationships. To overcome these limitations,
+We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet).
+Specifically, we develop an Intra-Modality Aggregation module and a
+Triple-Modal Cross-Attention Fusion module, which are designed to eliminate
+intra-modality redundancy and extract cross-modal information, respectively.
+Furthermore, we design a Triple-Modal Feature Fusion loss function to align
+feature representations across modalities. Extensive experiments on the liver
+prognosis dataset demonstrate that our approach significantly outperforms
+existing state-of-the-art unimodal models and other multi-modal techniques. Our
+code is available at https://github.com/Mysterwll/liver.git.
 
-摘要：物理推理是人类非凡的能力，它能从有限的经验中快速学习和概括。尽管经过广泛的训练，但当前的人工智能模型在实现类似的概括方面仍然存在困难，尤其是在分布外 (OOD) 设置中。这种限制源于它们无法从观察中抽象出核心物理原理。一个关键挑战是开发能够从最少数据中有效学习和概括物理动力学的表示。在这里，我们介绍了神经力场 (NFF)，这是一种建立在神经常微分方程 (NODE) 上的建模框架，它学习可解释的力场表示，这些表示可以通过常微分方程 (ODE) 求解器有效地进行积分，以预测物体轨迹。与依赖于高维潜在空间的现有方法不同，NFF 以可解释的方式捕获了诸如重力、支撑和碰撞等基本物理概念。在两个具有挑战性的物理推理任务上的实验表明，仅通过几个示例训练的 NFF 实现了对看不见场景的强大概括。这种基于物理的表示能够进行高效的前向后向规划，并通过交互式细化实现快速适应。我们的工作表明，将受物理启发的表示纳入学习系统可以帮助弥合人工智能和人类物理推理能力之间的差距。
+摘要：慢性肝病在全球范围内代表著重大的健康挑戰，而準確的預後評估對於個人化治療計畫至關重要。最近的證據表明，整合多模態資料（例如電腦斷層影像、放射特徵和臨床資訊）可以提供更全面的預後資訊。然而，模態具有內在異質性，而納入額外的模態可能會加劇異質化資料融合的挑戰。此外，現有的多模態融合方法通常難以適應更豐富的醫療模態，這使得難以捕捉模態間的關係。為了克服這些限制，我們提出了三模態交互慢性肝臟網路 (TMI-CLNet)。具體來說，我們開發了一個模態內聚合模組和一個三模態交叉注意力融合模組，它們分別旨在消除模態內冗餘和提取跨模態資訊。此外，我們設計了一個三模態特徵融合損失函數，以對齊跨模態的特徵表示。在肝臟預後資料集上的廣泛實驗表明，我們的做法顯著優於現有的最先進單模態模型和其他多模態技術。我們的程式碼可以在 https://github.com/Mysterwll/liver.git 上取得。
 
-##### **Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**
-2502.08972v1 by Hyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan May
+##### **Safety at Scale: A Comprehensive Survey of Large Model Safety**
+2502.05206v2 by Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
 
-Language models are aligned to the collective voice of many, resulting in
-generic outputs that do not align with specific users' styles. In this work, we
-present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method
-that personalizes language models for text generation tasks with fewer than 10
-examples per user. TICL iteratively expands an in-context learning prompt via a
-trial-error-explain process, adding model-generated negative samples and
-explanations that provide fine-grained guidance towards a specific user's
-style. TICL achieves favorable win rates on pairwise comparisons with
-LLM-as-a-judge up to 91.5% against the previous state-of-the-art and
-outperforms competitive tuning-free baselines for personalized alignment tasks
-of writing emails, essays and news articles. Both lexical and qualitative
-analyses show that the negative samples and explanations enable language models
-to learn stylistic context more effectively and overcome the bias towards
-structural and formal phrases observed in their zero-shot outputs. By
-front-loading inference compute to create a user-specific in-context learning
-prompt that does not require extra generation steps at test time, TICL presents
-a novel yet simple approach for personalized alignment.
+The rapid advancement of large models, driven by their exceptional abilities
+in learning and generalization through large-scale pre-training, has reshaped
+the landscape of Artificial Intelligence (AI). These models are now
+foundational to a wide range of applications, including conversational AI,
+recommendation systems, autonomous driving, content generation, medical
+diagnostics, and scientific discovery. However, their widespread deployment
+also exposes them to significant safety risks, raising concerns about
+robustness, reliability, and ethical implications. This survey provides a
+systematic review of current safety research on large models, covering Vision
+Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language
+Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models
+(DMs), and large-model-based Agents. Our contributions are summarized as
+follows: (1) We present a comprehensive taxonomy of safety threats to these
+models, including adversarial attacks, data poisoning, backdoor attacks,
+jailbreak and prompt injection attacks, energy-latency attacks, data and model
+extraction attacks, and emerging agent-specific threats. (2) We review defense
+strategies proposed for each type of attacks if available and summarize the
+commonly used datasets and benchmarks for safety research. (3) Building on
+this, we identify and discuss the open challenges in large model safety,
+emphasizing the need for comprehensive safety evaluations, scalable and
+effective defense mechanisms, and sustainable data practices. More importantly,
+we highlight the necessity of collective efforts from the research community
+and international collaboration. Our work can serve as a useful reference for
+researchers and practitioners, fostering the ongoing development of
+comprehensive defense systems and platforms to safeguard AI models.
 
-摘要：語言模型與眾人的集體聲音保持一致，導致產出內容流於一般，無法與特定使用者的風格相符。在這項工作中，我們提出了試驗錯誤解釋情境內學習 (TICL)，一種免調校方法，能為文字生成任務個人化語言模型，每個使用者少於 10 個範例。TICL 透過試驗錯誤解釋程序反覆擴充情境內學習提示，加入模型產生的負面範例和說明，提供細緻的指導，引導至特定使用者的風格。TICL 在與 LLM 作為評審的成對比較中獲得了高勝率，高達 91.5%，優於先前的技術水準，並在個人化對齊任務中超越了競爭性的免調校基準，包括撰寫電子郵件、論文和新聞文章。詞彙和質性分析皆顯示，負面範例和說明讓語言模型能更有效地學習風格脈絡，並克服零次學習產出中觀察到的結構性和正式詞組偏誤。透過預先加載推論運算，建立使用者特定的情境內學習提示，無需在測試時額外產生步驟，TICL 呈現一種新穎卻簡潔的方法，用於個人化對齊。
+摘要：<paragraph>大型模型的快速進展，得益於它們在通過大規模預訓練進行學習和概括方面的卓越能力，已經重塑了人工智能 (AI) 的格局。這些模型現在是廣泛應用程式（包括對話式 AI、推薦系統、自動駕駛、內容生成、醫療診斷和科學發現）的基礎。然而，它們的廣泛部署也使它們面臨重大的安全風險，引發了對穩健性、可靠性和倫理影響的擔憂。本調查提供了對大型模型當前安全研究的系統性回顧，涵蓋視覺基礎模型 (VFM)、大型語言模型 (LLM)、視覺語言預訓練 (VLP) 模型、視覺語言模型 (VLM)、擴散模型 (DM) 和基於大型模型的代理。我們的貢獻總結如下：(1) 我們提出了一個針對這些模型的安全威脅的全面分類，包括對抗性攻擊、資料中毒、後門攻擊、越獄和提示注入攻擊、能量延遲攻擊、資料和模型提取攻擊以及新興的特定代理威脅。(2) 我們檢視了針對每種類型攻擊提出的防禦策略（如果有的話），並總結了安全研究中常用的資料集和基準。(3) 基於此，我們找出並討論了大型模型安全中的開放性挑戰，強調了對全面安全評估、可擴充且有效的防禦機制以及永續資料實務的需求。更重要的是，我們強調了研究社群和國際合作共同努力的必要性。我們的研究可作為研究人員和從業人員的有用參考，促進全面防禦系統和平台的持續發展，以保護 AI 模型。</paragraph>
 
-##### **RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**
-2502.08966v1 by Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller
+##### **Enhanced Convolutional Neural Networks for Improved Image Classification**
+2502.00663v1 by Xiaoran Yang, Shuhan Yu, Wenxi Xu
 
-Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external
-tools for tasks beyond their standalone capabilities, such as searching
-websites, booking flights, or making financial transactions. However, these
-tools greatly increase the risks of prompt injection attacks, where malicious
-content hijacks the LM agent to leak confidential data or trigger harmful
-actions. Existing defenses (OpenAI GPTs) require user confirmation before every
-tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS),
-which automatically detects and executes tool calls that preserve integrity and
-confidentiality, requiring user confirmation only when these safeguards cannot
-be ensured. RTBAS adapts Information Flow Control to the unique challenges
-presented by TBAS. We present two novel dependency screeners, using
-LM-as-a-judge and attention-based saliency, to overcome these challenges.
-Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS
-prevents all targeted attacks with only a 2% loss of task utility when under
-attack, and further tests confirm its ability to obtain near-oracle performance
-on detecting both subtle and direct privacy leaks.
+Image classification is a fundamental task in computer vision with diverse
+applications, ranging from autonomous systems to medical imaging. The CIFAR-10
+dataset is a widely used benchmark to evaluate the performance of
+classification models on small-scale, multi-class datasets. Convolutional
+Neural Networks (CNNs) have demonstrated state-of-the-art results; however,
+they often suffer from overfitting and suboptimal feature representation when
+applied to challenging datasets like CIFAR-10. In this paper, we propose an
+enhanced CNN architecture that integrates deeper convolutional blocks, batch
+normalization, and dropout regularization to achieve superior performance. The
+proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN
+architectures. Through detailed ablation studies, we demonstrate the
+effectiveness of the enhancements and analyze the hierarchical feature
+representations. This work highlights the potential of refined CNN
+architectures for tackling small-scale image classification problems
+effectively.
 
-摘要：基於工具的代理系統 (TBAS) 允許語言模型 (LM) 使用外部工具來執行超出其獨立功能的任務，例如搜尋網站、預訂航班或進行金融交易。然而，這些工具大幅增加了提示注入攻擊的風險，其中惡意內容劫持 LM 代理程式以洩露機密資料或觸發有害動作。現有的防禦措施 (OpenAI GPT) 在每次呼叫工具之前都需要使用者確認，這會對使用者造成沉重的負擔。我們引入了穩健的 TBAS (RTBAS)，它會自動偵測並執行保留完整性與機密性的工具呼叫，僅在無法確保這些防護措施時才需要使用者確認。RTBAS 將資訊流控制調整為 TBAS 呈現的獨特挑戰。我們提出兩種新穎的相依性篩選器，使用 LM 作為判斷者和基於注意力的顯著性，以克服這些挑戰。AgentDojo 提示注入基準上的實驗結果顯示，RTBAS 在受到攻擊時僅損失 2% 的任務效用，即可防止所有目標攻擊，進一步的測試證實了其在偵測細微和直接的隱私洩漏方面獲得接近神諭效能的能力。
+摘要：影像分類是電腦視覺中的一項基本任務，應用範圍廣泛，從自動系統到醫學影像皆有。CIFAR-10 資料集是一個廣泛使用的基準，用於評估分類模型在小規模、多類別資料集上的效能。卷積神經網路 (CNN) 已展現出最先進的成果；然而，當應用於 CIFAR-10 等具挑戰性的資料集時，它們常常會發生過度擬合和次佳特徵表示的問題。在本文中，我們提出一個增強的 CNN 架構，它整合了更深的卷積區塊、批次正規化和中斷正規化，以達成卓越的效能。所提出的模型達到了 84.95% 的測試準確度，優於基準 CNN 架構。透過詳細的消融研究，我們證明了這些增強功能的有效性，並分析了階層式特徵表示。這項工作突顯了精進的 CNN 架構在有效解決小規模影像分類問題上的潛力。
 
-##### **Biologically Plausible Brain Graph Transformer**
-2502.08958v1 by Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
+##### **Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**
+2502.00619v1 by Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
 
-State-of-the-art brain graph analysis methods fail to fully encode the
-small-world architecture of brain graphs (accompanied by the presence of hubs
-and functional modules), and therefore lack biological plausibility to some
-extent. This limitation hinders their ability to accurately represent the
-brain's structural and functional properties, thereby restricting the
-effectiveness of machine learning models in tasks such as brain disorder
-detection. In this work, we propose a novel Biologically Plausible Brain Graph
-Transformer (BioBGT) that encodes the small-world architecture inherent in
-brain graphs. Specifically, we present a network entanglement-based node
-importance encoding technique that captures the structural importance of nodes
-in global information propagation during brain graph communication,
-highlighting the biological properties of the brain structure. Furthermore, we
-introduce a functional module-aware self-attention to preserve the functional
-segregation and integration characteristics of brain graphs in the learned
-representations. Experimental results on three benchmark datasets demonstrate
-that BioBGT outperforms state-of-the-art models, enhancing biologically
-plausible brain graph representations for various brain graph analytical tasks
+Ensuring fairness in medical image segmentation is critical due to biases in
+imbalanced clinical data acquisition caused by demographic attributes (e.g.,
+age, sex, race) and clinical factors (e.g., disease severity). To address these
+challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired
+by optimal control theory. We provide a comprehensive analysis of its
+underlying mechanisms and clarify dMoE's role in adapting to heterogeneous
+distributions in medical image segmentation. Furthermore, we integrate dMoE
+into multiple network architectures, demonstrating its broad applicability
+across diverse medical image analysis tasks. By incorporating demographic and
+clinical factors, dMoE achieves state-of-the-art performance on two 2D
+benchmark datasets and a 3D in-house dataset. Our results highlight the
+effectiveness of dMoE in mitigating biases from imbalanced distributions,
+offering a promising approach to bridging control theory and medical image
+segmentation within fairness learning paradigms. The source code will be made
+available.
 
-摘要：目前最先进的大腦圖形分析方法無法完全編碼大腦圖形的小世界架構（伴隨著樞紐和功能模組的存在），因此在某種程度上缺乏生物學上的可信度。這種限制阻礙了它們準確表示大腦結構和功能特性的能力，從而限制了機器學習模型在腦部疾病檢測等任務中的有效性。在這項工作中，我們提出了一個新的生物學上可信的大腦圖形轉換器 (BioBGT)，它編碼了大腦圖形中固有的、小世界的架構。具體來說，我們提出了一種基於網路糾纏的節點重要性編碼技術，它捕捉了大腦圖形通信過程中節點在全球資訊傳播中的結構重要性，突出了大腦結構的生物學特性。此外，我們引入了一個功能模組感知自注意力，以保留學習表徵中大腦圖形的功能分離和整合特性。在三個基準資料集上的實驗結果表明，BioBGT 優於最先進的模型，增強了各種大腦圖形分析任務的生物學上可信的大腦圖形表徵
+摘要：在医学影像分割中，由於人口屬性（例如年齡、性別、種族）和臨床因素（例如疾病嚴重程度）導致不平衡的臨床數據採集中存在偏差，因此確保公平性至關重要。為了應對這些挑戰，我們引入了受最優控制理論啟發的感知混合專家 (dMoE)。我們對其底層機制進行了全面分析，並釐清了 dMoE 在適應醫學影像分割中的異質分佈中的作用。此外，我們將 dMoE 整合到多個網路架構中，展示了其在各種醫學影像分析任務中的廣泛適用性。通過納入人口統計和臨床因素，dMoE 在兩個 2D 基準數據集和一個 3D 內部數據集上實現了最先進的性能。我們的結果突出了 dMoE 在減輕不平衡分佈的偏差方面的有效性，為在公平性學習範例中橋接控制理論和醫學影像分割提供了一個有前景的方法。原始碼將會公開。
 
-##### **Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**
-2502.08954v1 by Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer
+##### **Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**
+2502.00568v3 by Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
 
-The deployment of Large Language Models (LLM) on mobile devices offers
-significant potential for medical applications, enhancing privacy, security,
-and cost-efficiency by eliminating reliance on cloud-based services and keeping
-sensitive health data local. However, the performance and accuracy of on-device
-LLMs in real-world medical contexts remain underexplored. In this study, we
-benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating
-accuracy, computational efficiency, and thermal limitation across various
-mobile devices. Our results indicate that compact general-purpose models like
-Phi-3 Mini achieve a strong balance between speed and accuracy, while medically
-fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably,
-deploying LLMs on older devices remains feasible, with memory constraints
-posing a greater challenge than raw processing power. Our study underscores the
-potential of on-device LLMs for healthcare while emphasizing the need for more
-efficient inference and models tailored to real-world clinical reasoning.
+Emerging research has highlighted that artificial intelligence based
+multimodal fusion of digital pathology and transcriptomic features can improve
+cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction.
+However, such direct fusion for joint decision is impractical in real clinical
+settings, where histopathology is still the gold standard for diagnosis and
+transcriptomic tests are rarely requested, at least in the public healthcare
+system. With our novel diffusion based crossmodal generative AI model PathGen,
+we show that genomic expressions synthesized from digital histopathology
+jointly predicts cancer grading and patient survival risk with high accuracy
+(state-of-the-art performance), certainty (through conformal coverage
+guarantee) and interpretability (through distributed attention maps). PathGen
+code is available for open use by the research community through GitHub at
+https://github.com/Samiran-Dey/PathGen.
 
-摘要：大型語言模型 (LLM) 在行動裝置上的部署為醫療應用程式提供了巨大的潛力，透過消除對雲端服務的依賴並將敏感的健康資料儲存在本地，進而提升隱私、安全性，並提高成本效益。然而，在實際的醫療環境中，裝置上 LLM 的效能和準確度仍未受到充分的探討。在此研究中，我們使用 AMEGA 資料集來評量公開可用的裝置上 LLM，並評估其在各種行動裝置上的準確度、運算效率和熱限制。我們的結果顯示，像 Phi-3 Mini 等精簡的一般用途模型在速度和準確度之間取得了良好的平衡，而經過醫學微調的模型，例如 Med42 和 Aloe，則達到了最高的準確度。值得注意的是，在較舊的裝置上部署 LLM 仍然可行，記憶體限制比原始處理能力構成更大的挑戰。我們的研究強調了裝置上 LLM 在醫療保健方面的潛力，同時強調了對更有效率的推理和針對實際臨床推理量身打造的模型的需求。
+摘要：新興研究強調，基於人工智慧的多模態融合數位病理學和轉錄組特徵，可以改善癌症診斷（分級/分型）和預後（存活風險）預測。
+然而，這種直接融合對於聯合決策在實際臨床環境中並不切實際，在實際臨床環境中，組織病理學仍然是診斷的黃金標準，而轉錄組檢測很少被要求，至少在公共醫療保健系統中是如此。透過我們新穎的基於擴散的跨模態生成式 AI 模型 PathGen，我們展示了從數位組織病理學合成的基因體表達共同預測癌症分級和患者存活風險，具有很高的準確度（最先進的效能）、確定性（透過共形覆蓋保證）和可解釋性（透過分佈式注意力圖）。PathGen 程式碼可透過 GitHub 上的 https://github.com/Samiran-Dey/PathGen 供研究社群公開使用。
 
diff --git a/__pycache__/config.cpython-310.pyc b/__pycache__/config.cpython-310.pyc
index 075098362a..3cdb3a9452 100644
Binary files a/__pycache__/config.cpython-310.pyc and b/__pycache__/config.cpython-310.pyc differ
diff --git a/__pycache__/util4translation.cpython-310.pyc b/__pycache__/util4translation.cpython-310.pyc
index 44ddd2ea8c..412f8d1f40 100644
Binary files a/__pycache__/util4translation.cpython-310.pyc and b/__pycache__/util4translation.cpython-310.pyc differ
diff --git a/database/logs/runtime.log b/database/logs/runtime.log
index 2dfeec342f..0e37ab494d 100644
--- a/database/logs/runtime.log
+++ b/database/logs/runtime.log
@@ -21386,3 +21386,7 @@ KeyError: 'paper_summary_zh'
 2025-02-16 09:10:43.071 | SUCCESS  | __main__:parse:267 - handle [2/4] | topic=`AI` subtopic=`Medical`
 2025-02-16 09:10:43.122 | SUCCESS  | __main__:parse:267 - handle [3/4] | topic=`AI` subtopic=`Knowledge Graphs`
 2025-02-16 09:10:43.491 | SUCCESS  | __main__:parse:267 - handle [4/4] | topic=`AI` subtopic=`LLM`
+2025-02-16 20:27:09.332 | SUCCESS  | __main__:parse:267 - handle [1/4] | topic=`AI` subtopic=`Knowledge Graphs`
+2025-02-16 20:27:09.498 | SUCCESS  | __main__:parse:267 - handle [2/4] | topic=`AI` subtopic=`LLM`
+2025-02-16 20:27:09.533 | SUCCESS  | __main__:parse:267 - handle [3/4] | topic=`AI` subtopic=`Medical explainable AI`
+2025-02-16 20:27:09.710 | SUCCESS  | __main__:parse:267 - handle [4/4] | topic=`AI` subtopic=`Medical`
diff --git a/database/storage/storage_2025-02-16.md b/database/storage/storage_2025-02-16.md
index 90ce7b8fed..0d91b737db 100644
--- a/database/storage/storage_2025-02-16.md
+++ b/database/storage/storage_2025-02-16.md
@@ -1,10361 +1,10361 @@
 # arxiv-daily
- Automated deployment @ 2025-02-16 09:10:43 Asia/Taipei
+ Automated deployment @ 2025-02-16 20:27:09 Asia/Taipei
 > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml).
 > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage).
 
 ## AI
 
-### Medical explainable AI
+### Knowledge Graphs
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
-|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
-|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)|
-|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null|
-|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null|
-|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v1](http://arxiv.org/abs/2501.09628v1)|null|
-|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null|
-|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null|
-|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null|
-|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null|
-|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null|
-|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)|
-|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null|
-|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null|
-|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null|
-|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null|
-|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null|
-|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null|
-|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null|
-|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null|
-|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null|
-|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)|
-|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null|
-|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null|
-|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null|
-|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)|
-|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)|
-|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null|
-|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)|
-|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null|
-|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null|
-|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null|
-|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null|
-|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null|
-|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null|
-|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null|
-|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null|
-|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**|Antonio Rago et.al.|[2408.17401v1](http://arxiv.org/abs/2408.17401v1)|null|
-|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null|
-|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null|
-|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null|
-|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null|
-|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null|
-|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null|
-|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null|
-|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null|
-|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null|
-|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null|
-|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null|
-|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null|
-|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null|
-|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null|
-|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)|
-|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null|
-|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null|
-|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null|
-|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null|
-|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null|
-|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null|
-|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)|
-|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null|
-|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null|
-|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null|
-|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null|
-|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null|
-|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)|
-|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null|
-|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)|
-|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null|
-|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null|
-|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null|
-|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null|
-|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null|
-|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null|
-|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null|
-|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null|
-|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null|
-|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null|
-|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null|
-|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null|
-|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)|
-|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null|
-|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null|
-|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null|
-|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)|
-|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)|
-|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null|
-|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null|
-|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null|
-|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null|
-|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null|
-|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null|
-|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null|
-|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)|
-|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null|
-|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
-|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
-|**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
-|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
-
-#### Abstracts
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
-
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
-
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
-
-##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
-2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
-
-We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
-an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
-predicts future PHTs using transformer-based architectures. The Adaptive Risk
-Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
-probabilities for clinician-defined critical events. ARES incorporates a
-personalized explainability module that identifies key clinical factors
-influencing risk estimates for individual patients. ARES was evaluated on the
-MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
-performance against traditional early warning systems and machine learning
-models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
-with 60% including hospital admissions. The dataset contained over 357 million
-tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
-ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
-ETHOS-based risk estimates demonstrated robustness across demographic subgroups
-with strong model reliability, confirmed via calibration curves. The
-personalized explainability module provides insights into patient-specific
-factors contributing to risk. ARES, powered by ETHOS, advances predictive
-healthcare AI by providing dynamic, real-time, and personalized risk estimation
-with patient-specific explainability to enhance clinician trust. Its
-adaptability and superior accuracy position it as a transformative tool for
-clinical decision-making, potentially improving patient outcomes and resource
-allocation in emergency and inpatient settings. We release the full code at
-github.com/ipolharvard/ethos-ares to facilitate future research.
-
-摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
-一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
-使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
-
-##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**
-2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq
-
-This study addresses a critical gap in the healthcare system by developing a
-clinically meaningful, practical, and explainable disease surveillance system
-for multiple chronic diseases, utilizing routine EHR data from multiple U.S.
-practices integrated with CureMD's EMR/EHR system. Unlike traditional
-systems--using AI models that rely on features from patients' labs--our
-approach focuses on routinely available data, such as medical history, vitals,
-diagnoses, and medications, to preemptively assess the risks of chronic
-diseases in the next year. We trained three distinct models for each chronic
-disease: prediction models that forecast the risk of a disease 3, 6, and 12
-months before a potential diagnosis. We developed Random Forest models, which
-were internally validated using F1 scores and AUROC as performance metrics and
-further evaluated by a panel of expert physicians for clinical relevance based
-on inferences grounded in medical knowledge. Additionally, we discuss our
-implementation of integrating these models into a practical EMR system. Beyond
-using Shapley attributes and surrogate models for explainability, we also
-introduce a new rule-engineering framework to enhance the intrinsic
-explainability of Random Forests.
-
-摘要：本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統，來解決醫療保健系統中的重大缺口，利用整合 CureMD 的 EMR/EHR 系統，來自多個美國實務的例行 EHR 資料。與傳統系統不同的是，我們的做法著重在例行可得的資料，例如病歷、生命徵象、診斷和藥物，以預先評估未來一年慢性疾病的風險，而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型：預測模型，用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型，並使用 F1 分數和 AUROC 作為效能指標，進行內部驗證，並進一步由專家醫師小組根據植基於醫學知識的推論，評估其臨床相關性。此外，我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外，我們還引進了一個新的規則工程架構，以增強隨機森林的內在可解釋性。
-
-##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**
-2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
-
-Deep neural networks are increasingly employed in high-stakes medical
-applications, despite their tendency for shortcut learning in the presence of
-spurious correlations, which can have potentially fatal consequences in
-practice. Detecting and mitigating shortcut behavior is a challenging task that
-often requires significant labeling efforts from domain experts. To alleviate
-this problem, we introduce a semi-automated framework for the identification of
-spurious behavior from both data and model perspective by leveraging insights
-from eXplainable Artificial Intelligence (XAI). This allows the retrieval of
-spurious data points and the detection of model circuits that encode the
-associated prediction rules. Moreover, we demonstrate how these shortcut
-encodings can be used for XAI-based sample- and pixel-level data annotation,
-providing valuable information for bias mitigation methods to unlearn the
-undesired shortcut behavior. We show the applicability of our framework using
-four medical datasets across two modalities, featuring controlled and
-real-world spurious correlations caused by data artifacts. We successfully
-identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision
-Transformer models, ultimately increasing their robustness and applicability
-for real-world medical tasks.
-
-摘要：深度神经网络越来越多地用于高风险医疗应用中，尽管它们在存在虚假相关性的情况下倾向于捷径学习，这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务，通常需要领域专家的大量标记工作。为了缓解这个问题，我们引入了一个半自动框架，用于从数据和模型的角度识别虚假行为，方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外，我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释，为偏差缓解方法提供有价值的信息，以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性，这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差，最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。
-
-##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**
-2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail
-
-Suicidal ideation detection is crucial for preventing suicides, a leading
-cause of death worldwide. Many individuals express suicidal thoughts on social
-media, offering a vital opportunity for early detection through advanced
-machine learning techniques. The identification of suicidal ideation in social
-media text is improved by utilising a hybrid framework that integrates
-Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory
-(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability
-of the model's predictions, Explainable AI (XAI) methods are applied, with a
-particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At
-first, the model managed to reach an accuracy of 92.81%. By applying
-fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The
-SHAP analysis revealed key features influencing the model's predictions, such
-as terms related to mental health struggles. This level of transparency boosts
-the model's credibility while helping mental health professionals understand
-and trust the predictions. This work highlights the potential for improving the
-accuracy and interpretability of detecting suicidal tendencies, making a
-valuable contribution to the progress of mental health monitoring systems. It
-emphasizes the significance of blending powerful machine learning methods with
-explainability to develop reliable and impactful mental health solutions.
-
-摘要：自殺意念偵測對於預防自殺至關重要，而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭，這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構，並加入注意力機制，可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性，我們採用可解釋人工智慧 (XAI) 方法，特別著重於 SHapley 加法解釋 (SHAP)。一開始，模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術，準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵，例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度，同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力，為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。
-
-##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights**
-2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet
-
-In epidemiology, traditional statistical methods such as logistic regression,
-linear regression, and other parametric models are commonly employed to
-investigate associations between predictors and health outcomes. However,
-non-parametric machine learning techniques, such as deep neural networks
-(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for
-this task. Despite their potential, these methods face challenges due to the
-limited availability of high-quality, high-quantity data in this field. To
-address these challenges, we introduce SEANN, a novel approach for informed
-DNNs that leverages a prevalent form of domain-specific knowledge: Pooled
-Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies,
-in different forms, and represent a quantitative form of a scientific
-consensus. By direct integration within the learning procedure using a custom
-loss, we experimentally demonstrate significant improvements in the
-generalizability of predictive performances and the scientific plausibility of
-extracted relationships compared to a domain-knowledge agnostic neural network
-in a scarce and noisy data setting.
-
-摘要：在流行病學中，傳統的統計方法，例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而，非參數機器學習技術，例如深度神經網路 (DNN)，結合可解釋的 AI (XAI) 工具，為這項任務提供了新的機會。儘管這些方法具有潛力，但由於該領域缺乏高品質、高數量資料，因此這些方法面臨挑戰。為了應對這些挑戰，我們引入了 SEANN，這是一種新穎的方法，用於獲取知識的 DNN，它利用了一種流行的領域特定知識形式：彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中，並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中，我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比，科學合理性的顯著提升，且是在稀少且有雜訊的資料設定中。
-
-##### **Artificial Intelligence-Driven Clinical Decision Support Systems**
-2501.09628v1 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni
-
-As artificial intelligence (AI) becomes increasingly embedded in healthcare
-delivery, this chapter explores the critical aspects of developing reliable and
-ethical Clinical Decision Support Systems (CDSS). Beginning with the
-fundamental transition from traditional statistical models to sophisticated
-machine learning approaches, this work examines rigorous validation strategies
-and performance assessment methods, including the crucial role of model
-calibration and decision curve analysis. The chapter emphasizes that creating
-trustworthy AI systems in healthcare requires more than just technical
-accuracy; it demands careful consideration of fairness, explainability, and
-privacy. The challenge of ensuring equitable healthcare delivery through AI is
-stressed, discussing methods to identify and mitigate bias in clinical
-predictive models. The chapter then delves into explainability as a cornerstone
-of human-centered CDSS. This focus reflects the understanding that healthcare
-professionals must not only trust AI recommendations but also comprehend their
-underlying reasoning. The discussion advances in an analysis of privacy
-vulnerabilities in medical AI systems, from data leakage in deep learning
-models to sophisticated attacks against model explanations. The text explores
-privacy-preservation strategies such as differential privacy and federated
-learning, while acknowledging the inherent trade-offs between privacy
-protection and model performance. This progression, from technical validation
-to ethical considerations, reflects the multifaceted challenges of developing
-AI systems that can be seamlessly and reliably integrated into daily clinical
-practice while maintaining the highest standards of patient care and data
-protection.
-
-摘要：隨著人工智慧 (AI) 在醫療保健中的應用日益普及，本章探討了開發可靠且符合道德標準的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型到複雜機器學習方法的基本轉變開始，這項工作審查了嚴謹的驗證策略和效能評估方法，包括模型校準和決策曲線分析的關鍵角色。本章強調，在醫療保健中建立值得信賴的 AI 系統不只是技術上的準確性；它需要仔細考量公平性、可解釋性和隱私權。本章強調了透過 AI 確保公平的醫療保健服務的挑戰，並討論了識別和減輕臨床預測模型中偏差的方法。接著，本章深入探討可解釋性，作為以人為中心的 CDSS 的基石。這種關注反映了醫療保健專業人員不僅必須信任 AI 建議，還必須理解其背後的推理。討論進一步分析了醫療 AI 系統中的隱私漏洞，從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略，例如差分隱私和聯合學習，同時承認隱私保護和模型效能之間的固有取捨。這種從技術驗證到道德考量的進展，反映了開發 AI 系統的多面向挑戰，這些系統可以無縫且可靠地整合到日常臨床實務中，同時維持最高的病患照護和資料保護標準。
-
-##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**
-2501.06887v1 by Sadia Kamal, Tim Oates
-
-As deep learning models gain attraction in medical data, ensuring transparent
-and trustworthy decision-making is essential. In skin cancer diagnosis, while
-advancements in lesion detection and classification have improved accuracy, the
-black-box nature of these methods poses challenges in understanding their
-decision processes, leading to trust issues among physicians. This study
-leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on
-different skin lesion datasets, to capture meaningful relationships between
-visual features and diagnostic criteria terms. To further enhance transparency,
-we propose a method called MedGrad E-CLIP, which builds on gradient-based
-E-CLIP by incorporating a weighted entropy mechanism designed for complex
-medical imaging like skin lesions. This approach highlights critical image
-regions linked to specific diagnostic descriptions. The developed integrated
-pipeline not only classifies skin lesions by matching corresponding
-descriptions but also adds an essential layer of explainability developed
-especially for medical data. By visually explaining how different features in
-an image relates to diagnostic criteria, this approach demonstrates the
-potential of advanced vision-language models in medical image analysis,
-ultimately improving transparency, robustness, and trust in AI-driven
-diagnostic systems.
-
-摘要：随着深度学习模型在医学数据中获得关注，确保透明且值得信赖的决策至关重要。在皮肤癌诊断中，虽然病灶检测和分类的进步提高了准确性，但这些方法的黑盒性质对理解其决策过程构成了挑战，导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP（对比语言图像预训练）模型，以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度，我们提出了一种名为 MedGrad E-CLIP 的方法，该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制，建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类，还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系，这种方法展示了高级视觉语言模型在医学图像分析中的潜力，最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。
-
-##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**
-2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat
-
-Humour styles can have either a negative or a positive impact on well-being.
-Given the importance of these styles to mental health, significant research has
-been conducted on their automatic identification. However, the automated
-machine learning models used for this purpose are black boxes, making their
-prediction decisions opaque. Clarity and transparency are vital in the field of
-mental health. This paper presents an explainable AI (XAI) framework for
-understanding humour style classification, building upon previous work in
-computational humour analysis. Using the best-performing single model
-(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
-analyse how linguistic, emotional, and semantic features contribute to humour
-style classification decisions. Our analysis reveals distinct patterns in how
-different humour styles are characterised and misclassified, with particular
-emphasis on the challenges in distinguishing affiliative humour from other
-styles. Through detailed examination of feature importance, error patterns, and
-misclassification cases, we identify key factors influencing model decisions,
-including emotional ambiguity, context misinterpretation, and target
-identification. The framework demonstrates significant utility in understanding
-model behaviour, achieving interpretable insights into the complex interplay of
-features that define different humour styles. Our findings contribute to both
-the theoretical understanding of computational humour analysis and practical
-applications in mental health, content moderation, and digital humanities
-research.
-
-摘要：幽默風格對幸福感可能產生負面或正面的影響。
-鑑於這些風格對心理健康的重要性，已經對其自動識別進行了大量研究。然而，用於此目的的自動機器學習模型是黑盒子，使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架，用於理解幽默風格分類，建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost)，我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式，特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例，我們確定了影響模型決策的關鍵因素，包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用，實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。
-
-##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**
-2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani
-
-The increasing demand for mental health services has highlighted the need for
-innovative solutions, particularly in the realm of psychological conversational
-AI, where the availability of sensitive data is scarce. In this work, we
-explored the development of a system tailored for mental health support with a
-novel approach to psychological assessment based on explainable emotional
-profiles in combination with empathetic conversational models, offering a
-promising tool for augmenting traditional care, particularly where immediate
-expertise is unavailable. Our work can be divided into two main parts,
-intrinsecaly connected to each other. First, we present RACLETTE, a
-conversational system that demonstrates superior emotional accuracy compared to
-state-of-the-art benchmarks in both understanding users' emotional states and
-generating empathetic responses during conversations, while progressively
-building an emotional profile of the user through their interactions. Second,
-we show how the emotional profiles of a user can be used as interpretable
-markers for mental health assessment. These profiles can be compared with
-characteristic emotional patterns associated with different mental disorders,
-providing a novel approach to preliminary screening and support.
-
-摘要：隨著對心理健康服務需求的增加，凸顯了創新解決方案的需求，特別是在心理對話式人工智慧領域，那裡缺乏敏感資料。在這項工作中，我們探索了開發一個針對心理健康支持的系統，採用一種基於可解釋的情緒特徵的新方法進行心理評估，結合同理心對話模式，提供了一個有前途的工具，用於擴充傳統照護，特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分，彼此內在相關。首先，我們展示了 RACLETTE，一個對話系統，與最先進的基準相比，在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性，同時透過他們的互動逐漸建立使用者的情緒特徵。其次，我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較，提供了一種初步篩選和支持的新方法。
-
-##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**
-2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia
-
-Artificial intelligence (AI) has emerged as a powerful tool to enhance
-decision-making and optimize treatment protocols in in vitro fertilization
-(IVF). In particular, AI shows significant promise in supporting
-decision-making during the ovarian stimulation phase of the IVF process. This
-review evaluates studies focused on the applications of AI combined with
-medical imaging in ovarian stimulation, examining methodologies, outcomes, and
-current limitations. Our analysis of 13 studies on this topic reveals that,
-reveal that while AI algorithms demonstrated notable potential in predicting
-optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the
-medical imaging data utilized predominantly came from two-dimensional (2D)
-ultrasound which mainly involved basic quantifications, such as follicle size
-and number, with limited use of direct feature extraction or advanced image
-analysis techniques. This points to an underexplored opportunity where advanced
-image analysis approaches, such as deep learning, and more diverse imaging
-modalities, like three-dimensional (3D) ultrasound, could unlock deeper
-insights. Additionally, the lack of explainable AI (XAI) in most studies raises
-concerns about the transparency and traceability of AI-driven decisions - key
-factors for clinical adoption and trust. Furthermore, many studies relied on
-single-center designs and small datasets, which limit the generalizability of
-their findings. This review highlights the need for integrating advanced
-imaging analysis techniques with explainable AI methodologies, as well as the
-importance of leveraging multicenter collaborations and larger datasets.
-Addressing these gaps has the potential to enhance ovarian stimulation
-management, paving the way for efficient, personalized, and data-driven
-treatment pathways that improve IVF outcomes.
-
-摘要：人工智慧（AI）已成為增強體外受精（IVF）決策制定和優化治療方案的強大工具。特別是，AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示，雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力，但所利用的醫學影像數據主要來自於二次元（2D）超音波，而二次元超音波主要涉及基本量化，例如濾泡大小和數量，且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會，例如深度學習等進階影像分析方法，以及更多元的影像模式，例如三維（3D）超音波，可以解鎖更深入的見解。此外，大多數研究缺乏可解釋 AI（XAI），這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂，而透明度和可追溯性是臨床採用和信任的關鍵因素。此外，許多研究依賴於單中心設計和小型數據集，這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性，以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理，為有效、個人化和數據驅動的治療途徑鋪平道路，進而改善 IVF 結果。
-
-##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**
-2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman
-
-This research presents an innovative approach to cancer diagnosis and
-prediction using explainable Artificial Intelligence (XAI) and deep learning
-techniques. With cancer causing nearly 10 million deaths globally in 2020,
-early and accurate diagnosis is crucial. Traditional methods often face
-challenges in cost, accuracy, and efficiency. Our study develops an AI model
-that provides precise outcomes and clear insights into its decision-making
-process, addressing the "black box" problem of deep learning models. By
-employing XAI techniques, we enhance interpretability and transparency,
-building trust among healthcare professionals and patients. Our approach
-leverages neural networks to analyse extensive datasets, identifying patterns
-for cancer detection. This model has the potential to revolutionise diagnosis
-by improving accuracy, accessibility, and clarity in medical decision-making,
-possibly leading to earlier detection and more personalised treatment
-strategies. Furthermore, it could democratise access to high-quality
-diagnostics, particularly in resource-limited settings, contributing to global
-health equity. The model's applications extend beyond cancer diagnosis,
-potentially transforming various aspects of medical decision-making and saving
-millions of lives worldwide.
-
-摘要：本研究提出了一個創新的癌症診斷和預測方法，使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡，因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型，它提供精確的結果並清楚地了解其決策過程，解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術，我們增強了解釋性和透明度，在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集，識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷，可能導致更早的檢測和更個性化的治療策略。此外，它可以使更多人獲得高品質的診斷，特別是在資源有限的環境中，有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷，還可能轉變醫療決策的各個方面，並拯救全球數百萬人的生命。
-
-##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**
-2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag
-
-Deep learning has advanced medical image classification, but interpretability
-challenges hinder its clinical adoption. This study enhances interpretability
-in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)
-and a multi-agent Retrieval-Augmented Generation (RAG) system for report
-generation. By modeling relationships between visual features and clinical
-concepts, we create interpretable concept vectors that guide a multi-agent RAG
-system to generate radiology reports, enhancing clinical relevance,
-explainability, and transparency. Evaluation of the generated reports using an
-LLM-as-a-judge confirmed the interpretability and clinical utility of our
-model's outputs. On the COVID-QU dataset, our model achieved 81% classification
-accuracy and demonstrated robust report generation performance, with five key
-metrics ranging between 84% and 90%. This interpretable multi-agent framework
-bridges the gap between high-performance AI and the explainability required for
-reliable AI-driven CXR analysis in clinical settings. Our code is available at
-https://github.com/tifat58/IRR-with-CBM-RAG.git.
-
-摘要：深度學習已提升醫學影像分類，但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成，來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係，我們建立可解釋的概念向量，引導多代理 RAG 系統生成放射報告，增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估，確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上，我們的模型達到了 81% 的分類準確率，並展示了穩健的報告生成效能，五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。
-
-##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**
-2412.15748v1 by Shamus Sim, Tyrone Chen
-
-Background: Despite the current ubiquity of Large Language Models (LLMs)
-across the medical domain, there is a surprising lack of studies which address
-their reasoning behaviour. We emphasise the importance of understanding
-reasoning behaviour as opposed to high-level prediction accuracies, since it is
-equivalent to explainable AI (XAI) in this context. In particular, achieving
-XAI in medical LLMs used in the clinical domain will have a significant impact
-across the healthcare sector. Results: Therefore, we define the concept of
-reasoning behaviour in the specific context of medical LLMs. We then categorise
-and discuss the current state of the art of methods which evaluate reasoning
-behaviour in medical LLMs. Finally, we propose theoretical frameworks which can
-empower medical professionals or machine learning engineers to gain insight
-into the low-level reasoning operations of these previously obscure models.
-Conclusion: The subsequent increased transparency and trust in medical machine
-learning models by clinicians as well as patients will accelerate the
-integration, application as well as further development of medical AI for the
-healthcare system as a whole
-
-摘要：背景：儘管大型語言模型 (LLM) 目前在醫療領域無所不在，但令人驚訝的是，探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要，因為在這種情況下，這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI，將對整個醫療保健產業產生重大影響。結果：因此，我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後，我們提出理論架構，讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論：臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升，將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。
-
-##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media**
-2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton
-
-Stress is a pervasive global health issue that can lead to severe mental
-health problems. Early detection offers timely intervention and prevention of
-stress-related disorders. The current early detection models perform "black
-box" inference suffering from limited explainability and trust which blocks the
-real-world clinical application. Thanks to the generative properties introduced
-by the Large Language Models (LLMs), the decision and the prediction from such
-models are semi-interpretable through the corresponding description. However,
-the existing LLMs are mostly trained for general purposes without the guidance
-of psychological cognitive theory. To this end, we first highlight the
-importance of prior theory with the observation of performance boosted by the
-chain-of-thoughts tailored for stress detection. This method termed Cognition
-Chain explicates the generation of stress through a step-by-step cognitive
-perspective based on cognitive appraisal theory with a progress pipeline:
-Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress
-State, guiding LLMs to provide comprehensive reasoning explanations. We further
-study the benefits brought by the proposed Cognition Chain format by utilising
-it as a synthetic dataset generation template for LLMs instruction-tuning and
-introduce CogInstruct, an instruction-tuning dataset for stress detection. This
-dataset is developed using a three-stage self-reflective annotation pipeline
-that enables LLMs to autonomously generate and refine instructional data. By
-instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable
-stress detection model. Evaluations demonstrate that CogLLM achieves
-outstanding performance while enhancing explainability. Our work contributes a
-novel approach by integrating cognitive theories into LLM reasoning processes,
-offering a promising direction for future explainable AI research.
-
-摘要：壓力是一個普遍的全球性健康問題，可能會導致嚴重的精神
-健康問題。早期發現提供及時的干預和預防
-壓力相關疾病。目前的早期發現模型執行「黑
-盒子」推論，存在可解釋性和信任度有限的問題，阻礙了
-現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性，此類
-模型的決策和預測通過對應描述具有半可解釋性。然而，
-現有的 LLM 主要針對一般用途進行訓練，沒有心理認知理論的指導。為此，我們首先強調
-先驗理論的重要性，並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知
-鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生，並具有進度管道：
-刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力
-狀態，指導 LLM 提供全面的推理解釋。我們進一步
-通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點，並介紹 CogInstruct，這是一個針對壓力檢測的指令調整數據集。這個
-數據集是使用一個三階段的自省標註管道開發的，使 LLM 能夠自主生成和優化指令數據。通過
-使用 CogInstruct 對 Llama3 進行指令調整，我們開發了 CogLLM，這是一個可解釋的
-壓力檢測模型。評估表明，CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中，提出了一種新穎的方法，
-為未來的可解釋人工智能研究提供了一個有希望的方向。
-
-##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**
-2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi
-
-Human-machine teaming in medical AI requires us to understand to what degree
-a trained clinician should weigh AI predictions. While previous work has shown
-the potential of AI assistance at improving clinical predictions, existing
-clinical decision support systems either provide no explainability of their
-predictions or use techniques like saliency and Shapley values, which do not
-allow for physician-based verification. To address this gap, this study
-compares previously used explainable AI techniques with a newly proposed
-technique termed '2-factor retrieval (2FR)', which is a combination of
-interface design and search retrieval that returns similarly labeled data
-without processing this data. This results in a 2-factor security blanket
-where: (a) correct images need to be retrieved by the AI; and (b) humans should
-associate the retrieved images with the current pathology under test. We find
-that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician
-accuracy, with particular improvements when clinicians are radiologists and
-have low confidence in their decision. Our results highlight the importance of
-understanding how different modes of human-AI decision making may impact
-clinician accuracy in clinical decision support systems.
-
-摘要：人機協作在醫療 AI 中，需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力，但現有的臨床決策支援系統，要不就沒有提供預測的可解釋性，要不就是使用像顯著性和 Shapley 值之類的技術，這些技術不允許基於醫生的驗證。為了解決這個差距，本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較，後者是一種介面設計和搜尋檢索的組合，它會傳回標籤相似的資料，而不會處理這些資料。這會產生一個 2 因子安全機制，其中：(a) 正確的影像需要由 AI 檢索；(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現，當在胸部 X 光診斷上進行測試時，2FR 會提高臨床醫生的準確度，特別是在臨床醫生是放射科醫生且對其決策信心不足時，會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。
-
-##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**
-2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle
-
-Understanding public perception of artificial intelligence (AI) and the
-tradeoffs between potential risks and benefits is crucial, as these perceptions
-might shape policy decisions, influence innovation trajectories for successful
-market strategies, and determine individual and societal acceptance of AI
-technologies. Using a representative sample of 1100 participants from Germany,
-this study examines mental models of AI. Participants quantitatively evaluated
-71 statements about AI's future capabilities (e.g., autonomous driving, medical
-care, art, politics, warfare, and societal divides), assessing the expected
-likelihood of occurrence, perceived risks, benefits, and overall value. We
-present rankings of these projections alongside visual mappings illustrating
-public risk-benefit tradeoffs. While many scenarios were deemed likely,
-participants often associated them with high risks, limited benefits, and low
-overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in
-value assessment can be explained by perceived risks ($\beta=-.504$) and
-perceived benefits ($\beta=+.710$), with no significant relation to expected
-likelihood. Demographics and personality traits influenced perceptions of
-risks, benefits, and overall evaluations, underscoring the importance of
-increasing AI literacy and tailoring public information to diverse user needs.
-These findings provide actionable insights for researchers, developers, and
-policymakers by highlighting critical public concerns and individual factors
-essential to align AI development with individual values.
-
-摘要：<paragraph>了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要，因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡，並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本，探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述（例如，自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧）進行了定量評估，評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名，並附上視覺化映射，說明了公眾的風險收益權衡。儘管許多場景被認為是可能的，但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中，96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋，與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法，這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素，為研究人員、開發人員和政策制定者提供了可行的見解。</paragraph>
-
-##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**
-2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
-
-The use of machine learning and AI on electronic health records (EHRs) holds
-substantial potential for clinical insight. However, this approach faces
-challenges due to data heterogeneity, sparsity, temporal misalignment, and
-limited labeled outcomes. In this context, we leverage a linked EHR dataset of
-approximately one million de-identified individuals from Bristol, North
-Somerset, and South Gloucestershire, UK, to characterize urinary tract
-infections (UTIs). We implemented a data pre-processing and curation pipeline
-that transforms the raw EHR data into a structured format suitable for
-developing predictive models focused on data fairness, accountability and
-transparency. Given the limited availability and biases of ground truth UTI
-outcomes, we introduce a UTI risk estimation framework informed by clinical
-expertise to estimate UTI risk across individual patient timelines. Pairwise
-XGBoost models are trained using this framework to differentiate UTI risk
-categories with explainable AI techniques applied to identify key predictors
-and support interpretability. Our findings reveal differences in clinical and
-demographic predictors across risk groups. While this study highlights the
-potential of AI-driven insights to support UTI clinical decision-making,
-further investigation of patient sub-strata and extensive validation are needed
-to ensure robustness and applicability in clinical practice.
-
-摘要：電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而，由於資料異質性、稀疏性、時間錯位和標籤結果有限，此方法面臨挑戰。在此背景下，我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集，來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線，適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差，我們引入了由臨床專業知識告知的 UTI 風險評估架構，以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練，以區分 UTI 風險類別，並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力，但仍需要進一步調查患者子群體和廣泛驗證，以確保在臨床實務中的穩健性和適用性。
-
-##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**
-2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez
-
-There is a growing need to understand how digital systems can support
-clinical decision-making, particularly as artificial intelligence (AI) models
-become increasingly complex and less human-interpretable. This complexity
-raises concerns about trustworthiness, impacting safe and effective adoption of
-such technologies. Improved understanding of decision-making processes and
-requirements for explanations coming from decision support tools is a vital
-component in providing effective explainable solutions. This is particularly
-relevant in the data-intensive, fast-paced environments of intensive care units
-(ICUs). To explore these issues, group interviews were conducted with seven ICU
-clinicians, representing various roles and experience levels. Thematic analysis
-revealed three core themes: (T1) ICU decision-making relies on a wide range of
-factors, (T2) the complexity of patient state is challenging for shared
-decision-making, and (T3) requirements and capabilities of AI decision support
-systems. We include design recommendations from clinical input, providing
-insights to inform future AI systems for intensive care.
-
-摘要：隨著人工智慧 (AI) 模型變得越來越複雜，且越來越難以被人理解，了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮，影響了此類技術的安全且有效採用。改善對決策制定流程的理解，以及對決策支援工具所提供說明的要求，是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題，對七位 ICU 臨床醫師進行了小組訪談，這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題：(T1) ICU 決策制定依賴於廣泛的因素，(T2) 病患狀態的複雜性對共同決策制定構成挑戰，以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議，提供見解以提供資訊給未來用於加護的 AI 系統。
-
-##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**
-2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra
-
-Pediatric heart diseases present a broad spectrum of congenital and acquired
-diseases. More complex congenital malformations require a differentiated and
-multimodal decision-making process, usually including echocardiography as a
-central imaging method. Artificial intelligence (AI) offers considerable
-promise for clinicians by facilitating automated interpretation of pediatric
-echocardiography data. However, adapting AI technologies for pediatric
-echocardiography analysis has challenges such as limited public data
-availability, data privacy, and AI model transparency. Recently, researchers
-have focused on disruptive technologies, such as federated learning (FL) and
-explainable AI (XAI), to improve automatic diagnostic and decision support
-workflows. This study offers a comprehensive overview of the limitations and
-opportunities of AI in pediatric echocardiography, emphasizing the synergistic
-workflow and role of XAI and FL, identifying research gaps, and exploring
-potential future developments. Additionally, three relevant clinical use cases
-demonstrate the functionality of XAI and FL with a focus on (i) view
-recognition, (ii) disease classification, (iii) segmentation of cardiac
-structures, and (iv) quantitative assessment of cardiac function.
-
-摘要：小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程，通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望，因為它可以促進小兒超音波檢查資料的自動化解讀。然而，將人工智慧技術應用於小兒超音波檢查分析有許多挑戰，例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近，研究人員專注於破壞性技術，例如聯合學習 (FL) 和可解釋人工智慧 (XAI)，以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述，強調了 XAI 和 FL 的協同工作流程和角色，找出研究差距並探討潛在的未來發展。此外，三個相關的臨床使用案例展示了 XAI 和 FL 的功能，重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。
-
-##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**
-2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust
-
-Osteoporosis is a common condition that increases fracture risk, especially
-in older adults. Early diagnosis is vital for preventing fractures, reducing
-treatment costs, and preserving mobility. However, healthcare providers face
-challenges like limited labeled data and difficulties in processing medical
-images. This study presents a novel multi-modal learning framework that
-integrates clinical and imaging data to improve diagnostic accuracy and model
-interpretability. The model utilizes three pre-trained networks-VGG19,
-InceptionV3, and ResNet50-to extract deep features from X-ray images. These
-features are transformed using PCA to reduce dimensionality and focus on the
-most relevant components. A clustering-based selection process identifies the
-most representative components, which are then combined with preprocessed
-clinical data and processed through a fully connected network (FCN) for final
-classification. A feature importance plot highlights key variables, showing
-that Medical History, BMI, and Height were the main contributors, emphasizing
-the significance of patient-specific data. While imaging features were
-valuable, they had lower importance, indicating that clinical data are crucial
-for accurate predictions. This framework promotes precise and interpretable
-predictions, enhancing transparency and building trust in AI-driven diagnoses
-for clinical integration.
-
-摘要：骨質疏鬆症是一種常見的疾病，會增加骨折的風險，特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而，醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架，該框架整合了臨床和影像數據，以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路，VGG19、InceptionV3 和 ResNet50，從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分，然後將這些組成部分與預處理的臨床數據結合，並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數，表明病史、BMI 和身高是主要貢獻因素，強調了患者特定數據的重要性。雖然影像特徵很有價值，但它們的重要性較低，這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測，提高了透明度，並建立了對 AI 驅動診斷在臨床整合中的信任。
+|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
+|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
+|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null|
+|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v1](http://arxiv.org/abs/2502.08346v1)|null|
+|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null|
+|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null|
+|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null|
+|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null|
+|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null|
+|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null|
+|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)|
+|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null|
+|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|null|
+|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null|
+|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null|
+|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null|
+|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null|
+|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null|
+|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null|
+|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null|
+|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null|
+|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)|
+|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null|
+|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|null|
+|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null|
+|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)|
+|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null|
+|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)|
+|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)|
+|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null|
+|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
+|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)|
+|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
+|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
+|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
+|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
+|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
+|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
+|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
+|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
+|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null|
+|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null|
+|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null|
+|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)|
+|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null|
+|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null|
+|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)|
+|**2025-02-03**|**Common Foundations for SHACL, ShEx, and PG-Schema**|S. Ahmetaj et.al.|[2502.01295v1](http://arxiv.org/abs/2502.01295v1)|null|
+|**2025-02-03**|**GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**|Linhao Luo et.al.|[2502.01113v1](http://arxiv.org/abs/2502.01113v1)|[link](https://github.com/RManLuo/gfm-rag)|
+|**2025-02-03**|**Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**|Seungri Yoon et.al.|[2502.01059v1](http://arxiv.org/abs/2502.01059v1)|null|
+|**2025-02-03**|**Encrypted Large Model Inference: The Equivariant Encryption Paradigm**|James Buban et.al.|[2502.01013v1](http://arxiv.org/abs/2502.01013v1)|null|
+|**2025-02-02**|**Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**|Juno Kim et.al.|[2502.01694v1](http://arxiv.org/abs/2502.01694v1)|null|
+|**2025-02-02**|**PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**|Qixuan Li et.al.|[2502.00708v1](http://arxiv.org/abs/2502.00708v1)|null|
+|**2025-02-02**|**A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**|Qika Lin et.al.|[2502.00681v1](http://arxiv.org/abs/2502.00681v1)|null|
+|**2025-02-01**|**Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**|Jingyuan Yi et.al.|[2502.00339v1](http://arxiv.org/abs/2502.00339v1)|null|
+|**2025-02-01**|**DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**|Jiaxin Guo et.al.|[2502.00305v1](http://arxiv.org/abs/2502.00305v1)|null|
+|**2025-01-31**|**Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**|Nathaniel Tomczak et.al.|[2502.01659v2](http://arxiv.org/abs/2502.01659v2)|[link](https://github.com/KLab-AI3/Graph-Processing-Attention-IPDPS-2025)|
+|**2025-01-31**|**Improving vision-language alignment with graph spiking hybrid Networks**|Siyu Zhang et.al.|[2501.19069v1](http://arxiv.org/abs/2501.19069v1)|null|
+|**2025-01-30**|**Semantic Web and Creative AI -- A Technical Report from ISWS 2023**|Raia Abu Ahmad et.al.|[2501.18542v1](http://arxiv.org/abs/2501.18542v1)|null|
+|**2025-01-30**|**Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**|Tianpeng Pan et.al.|[2501.18320v1](http://arxiv.org/abs/2501.18320v1)|null|
+|**2025-01-30**|**Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**|Wanlong Liu et.al.|[2501.18154v1](http://arxiv.org/abs/2501.18154v1)|null|
+|**2025-01-30**|**Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**|Qika Lin et.al.|[2501.18119v1](http://arxiv.org/abs/2501.18119v1)|null|
+|**2025-01-29**|**Hybrid Graphs for Table-and-Text based Question Answering using LLMs**|Ankush Agarwal et.al.|[2501.17767v1](http://arxiv.org/abs/2501.17767v1)|null|
+|**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
+|**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
+|**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
+|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
+|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
+|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
+|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
+|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
+|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
+|**2025-01-26**|**Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**|Haodi Ma et.al.|[2501.15688v1](http://arxiv.org/abs/2501.15688v1)|null|
+|**2025-01-26**|**How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**|Manzong Huang et.al.|[2501.15378v1](http://arxiv.org/abs/2501.15378v1)|null|
+|**2025-01-24**|**Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**|Cencheng Shen et.al.|[2501.14932v1](http://arxiv.org/abs/2501.14932v1)|null|
+|**2025-01-24**|**Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**|Hang Luo et.al.|[2501.14892v1](http://arxiv.org/abs/2501.14892v1)|null|
+|**2025-01-24**|**GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**|Ziwen Li et.al.|[2501.16382v1](http://arxiv.org/abs/2501.16382v1)|[link](https://github.com/aaronli43/grappi)|
+|**2025-01-24**|**Evaluating and Improving Graph to Text Generation with Large Language Models**|Jie He et.al.|[2501.14497v1](http://arxiv.org/abs/2501.14497v1)|[link](https://github.com/probe2/kg_text)|
+|**2025-01-24**|**Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**|Xujian Liang et.al.|[2501.14300v1](http://arxiv.org/abs/2501.14300v1)|[link](https://github.com/dosonleung/fasttog)|
+|**2025-01-24**|**Top Ten Challenges Towards Agentic Neural Graph Databases**|Jiaxin Bai et.al.|[2501.14224v1](http://arxiv.org/abs/2501.14224v1)|null|
+|**2025-01-23**|**GraphRAG under Fire**|Jiacheng Liang et.al.|[2501.14050v1](http://arxiv.org/abs/2501.14050v1)|null|
+|**2025-01-23**|**EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**|Yuhui Yun et.al.|[2501.13746v1](http://arxiv.org/abs/2501.13746v1)|null|
+|**2025-01-23**|**Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**|Chang Gong et.al.|[2501.13731v1](http://arxiv.org/abs/2501.13731v1)|null|
+|**2025-01-23**|**CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**|Hamza Landolsi et.al.|[2501.13993v1](http://arxiv.org/abs/2501.13993v1)|null|
+|**2025-01-23**|**Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**|Hy Nguyen et.al.|[2501.13992v1](http://arxiv.org/abs/2501.13992v1)|null|
+|**2025-01-23**|**Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**|Bhumika Gupta et.al.|[2501.13984v1](http://arxiv.org/abs/2501.13984v1)|null|
+|**2025-01-21**|**LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**|Hasan Abu-Rasheed et.al.|[2501.12300v1](http://arxiv.org/abs/2501.12300v1)|null|
+|**2025-01-21**|**Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**|Dongsheng Zhu et.al.|[2501.12432v1](http://arxiv.org/abs/2501.12432v1)|null|
+|**2025-01-21**|**InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**|Pha Nguyen et.al.|[2501.12231v1](http://arxiv.org/abs/2501.12231v1)|null|
+|**2025-01-21**|**Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**|Maya Medjad et.al.|[2501.11977v1](http://arxiv.org/abs/2501.11977v1)|[link](https://github.com/reecall/graphtod)|
+|**2025-01-21**|**Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**|Jie Zhao et.al.|[2501.11968v1](http://arxiv.org/abs/2501.11968v1)|null|
+|**2025-01-21**|**A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**|Qinggang Zhang et.al.|[2501.13958v1](http://arxiv.org/abs/2501.13958v1)|[link](https://github.com/deep-polyu/awesome-graphrag)|
+|**2025-01-21**|**Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**|Nikos Kanakaris et.al.|[2501.11849v2](http://arxiv.org/abs/2501.11849v2)|[link](https://github.com/nkanak/brag-fake-news-campaigns)|
+|**2025-01-21**|**Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**|Haoran Song et.al.|[2501.16361v1](http://arxiv.org/abs/2501.16361v1)|null|
+|**2025-01-20**|**Zep: A Temporal Knowledge Graph Architecture for Agent Memory**|Preston Rasmussen et.al.|[2501.13956v1](http://arxiv.org/abs/2501.13956v1)|[link](https://github.com/getzep/graphiti)|
+|**2025-01-20**|**Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**|M. Manzour et.al.|[2501.11560v1](http://arxiv.org/abs/2501.11560v1)|null|
+|**2025-01-20**|**Each Graph is a New Language: Graph Learning with LLMs**|Huachi Zhou et.al.|[2501.11478v2](http://arxiv.org/abs/2501.11478v2)|null|
+|**2025-01-20**|**Few-shot Policy (de)composition in Conversational Question Answering**|Kyle Erwin et.al.|[2501.11335v1](http://arxiv.org/abs/2501.11335v1)|null|
+|**2025-01-20**|**Reasoning Language Models: A Blueprint**|Maciej Besta et.al.|[2501.11223v3](http://arxiv.org/abs/2501.11223v3)|[link](https://github.com/spcl/x1)|
+|**2025-01-19**|**IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**|Elad Levi et.al.|[2501.11067v1](http://arxiv.org/abs/2501.11067v1)|[link](https://github.com/plurai-ai/intellagent)|
 
-##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**
-2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor
+#### Abstracts
+##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
+2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
 
-This review paper explores recent advances in deep learning approaches for
-non-invasive cognitive impairment detection. We examine various non-invasive
-indicators of cognitive decline, including speech and language, facial, and
-motoric mobility. The paper provides an overview of relevant datasets,
-feature-extracting techniques, and deep-learning architectures applied to this
-domain. We have analyzed the performance of different methods across modalities
-and observed that speech and language-based methods generally achieved the
-highest detection performance. Studies combining acoustic and linguistic
-features tended to outperform those using a single modality. Facial analysis
-methods showed promise for visual modalities but were less extensively studied.
-Most papers focused on binary classification (impaired vs. non-impaired), with
-fewer addressing multi-class or regression tasks. Transfer learning and
-pre-trained language models emerged as popular and effective techniques,
-especially for linguistic analysis. Despite significant progress, several
-challenges remain, including data standardization and accessibility, model
-explainability, longitudinal analysis limitations, and clinical adaptation.
-Lastly, we propose future research directions, such as investigating
-language-agnostic speech analysis methods, developing multi-modal diagnostic
-systems, and addressing ethical considerations in AI-assisted healthcare. By
-synthesizing current trends and identifying key obstacles, this review aims to
-guide further development of deep learning-based cognitive impairment detection
-systems to improve early diagnosis and ultimately patient outcomes.
+Visual Question Answering (VQA) is a challenging problem that requires to
+process multimodal input. Answer-Set Programming (ASP) has shown great
+potential in this regard to add interpretability and explainability to modular
+VQA architectures. In this work, we address the problem of how to integrate ASP
+with modules for vision and natural language processing to solve a new and
+demanding VQA variant that is concerned with images of graphs (not graphs in
+symbolic form). Images containing graph-based structures are an ubiquitous and
+popular form of visualisation. Here, we deal with the particular problem of
+graphs inspired by transit networks, and we introduce a novel dataset that
+amends an existing one by adding images of graphs that resemble metro lines.
+Our modular neuro-symbolic approach combines optical graph recognition for
+graph parsing, a pretrained optical character recognition neural network for
+parsing labels, Large Language Models (LLMs) for language processing, and ASP
+for reasoning. This method serves as a first baseline and achieves an overall
+average accuracy of 73% on the dataset. Our evaluation provides further
+evidence of the potential of modular neuro-symbolic systems, in particular with
+pretrained models that do not involve any further training and logic
+programming for reasoning, to solve complex VQA tasks.
 
-摘要：本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標，包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現，並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力，但研究較少。大多數論文專注於二元分類（受損與未受損），較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術，特別是對於語言分析。儘管取得了重大進展，但仍存在一些挑戰，包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後，我們提出了未來的研究方向，例如調查與語言無關的語音分析方法、開發多模式診斷系統，以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙，本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展，以改善早期診斷，並最終改善患者的治療結果。
+摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
 
-##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**
-2410.17504v1 by Shruthi Chari
+##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
+2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
 
-Explainable Artificial Intelligence (AI) focuses on helping humans understand
-the working of AI systems or their decisions and has been a cornerstone of AI
-for decades. Recent research in explainability has focused on explaining the
-workings of AI models or model explainability. There have also been several
-position statements and review papers detailing the needs of end-users for
-user-centered explainability but fewer implementations. Hence, this thesis
-seeks to bridge some gaps between model and user-centered explainability. We
-create an explanation ontology (EO) to represent literature-derived explanation
-types via their supporting components. We implement a knowledge-augmented
-question-answering (QA) pipeline to support contextual explanations in a
-clinical setting. Finally, we are implementing a system to combine explanations
-from different AI methods and data modalities. Within the EO, we can represent
-fifteen different explanation types, and we have tested these representations
-in six exemplar use cases. We find that knowledge augmentations improve the
-performance of base large language models in the contextualized QA, and the
-performance is variable across disease groups. In the same setting, clinicians
-also indicated that they prefer to see actionability as one of the main foci in
-explanations. In our explanations combination method, we plan to use similarity
-metrics to determine the similarity of explanations in a chronic disease
-detection setting. Overall, through this thesis, we design methods that can
-support knowledge-enabled explanations across different use cases, accounting
-for the methods in today's AI era that can generate the supporting components
-of these explanations and domain knowledge sources that can enhance them.
+The adoption of EHRs has expanded opportunities to leverage data-driven
+algorithms in clinical care and research. A major bottleneck in effectively
+conducting multi-institutional EHR studies is the data heterogeneity across
+systems with numerous codes that either do not exist or represent different
+clinical concepts across institutions. The need for data privacy further limits
+the feasibility of including multi-institutional patient-level data required to
+study similarities and differences across patient subgroups. To address these
+challenges, we developed the GAME algorithm. Tested and validated across 7
+institutions and 2 languages, GAME integrates data in several levels: (1) at
+the institutional level with knowledge graphs to establish relationships
+between codes and existing knowledge sources, providing the medical context for
+standard codes and their relationship to each other; (2) between institutions,
+leveraging language models to determine the relationships between
+institution-specific codes with established standard codes; and (3) quantifying
+the strength of the relationships between codes using a graph attention
+network. Jointly trained embeddings are created using transfer and federated
+learning to preserve data privacy. In this study, we demonstrate the
+applicability of GAME in selecting relevant features as inputs for AI-driven
+algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
+We then highlight the application of GAME harmonized multi-institutional EHR
+data in a study of Alzheimer's disease outcomes and suicide risk among patients
+with mental health disorders, without sharing patient-level data outside
+individual institutions.
 
-摘要：可解釋人工智慧（AI）專注於協助人類了解 AI 系統運作或其決策，數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求，但實作較少。因此，本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體（EO）以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答（QA）管線，以在臨床環境中支援情境解釋。最後，我們正在實作一個系統，以結合來自不同 AI 方法和資料模式的解釋。在 EO 中，我們可以表示 15 種不同的解釋類型，並且我們已在六個範例使用案例中測試這些表示。我們發現，知識增強改善了基礎大型語言模型在情境化 QA 中的效能，並且效能因疾病群組而異。在相同的環境中，臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中，我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言，透過本論文，我們設計了可以在不同使用案例中支援知識啟用解釋的方法，考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。
+摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
 
-##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**
-2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay
+##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**
+2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang
 
-Objectives: To investigate clinicians' attitudes towards current automated
-interpretation of ECG and novel AI technologies and their perception of
-computer-assisted interpretation. Materials and Methods: We conducted a series
-of interviews with clinicians in the UK. Our study: (i) explores the potential
-for AI, specifically future 'human-like' computing approaches, to facilitate
-ECG interpretation and support clinical decision making, and (ii) elicits their
-opinions about the importance of explainability and trustworthiness of AI
-algorithms. Results: We performed inductive thematic analysis on interview
-transcriptions from 23 clinicians and identified the following themes: (i) a
-lack of trust in current systems, (ii) positive attitudes towards future AI
-applications and requirements for these, (iii) the relationship between the
-accuracy and explainability of algorithms, and (iv) opinions on education,
-possible deskilling, and the impact of AI on clinical competencies. Discussion:
-Clinicians do not trust current computerised methods, but welcome future 'AI'
-technologies. Where clinicians trust future AI interpretation to be accurate,
-they are less concerned that it is explainable. They also preferred ECG
-interpretation that demonstrated the results of the algorithm visually. Whilst
-clinicians do not fear job losses, they are concerned about deskilling and the
-need to educate the workforce to use AI responsibly. Conclusion: Clinicians are
-positive about the future application of AI in clinical decision-making.
-Accuracy is a key factor of uptake and visualisations are preferred over
-current computerised methods. This is viewed as a potential means of training
-and upskilling, in contrast to the deskilling that automation might be
-perceived to bring.
+With the extensive application of Graph Neural Networks (GNNs) across various
+domains, their trustworthiness has emerged as a focal point of research. Some
+existing studies have shown that the integration of large language models
+(LLMs) can improve the semantic understanding and generation capabilities of
+GNNs, which in turn improves the trustworthiness of GNNs from various aspects.
+Our review introduces a taxonomy that offers researchers a clear framework for
+comprehending the principles and applications of different methods and helps
+clarify the connections and differences among various approaches. Then we
+systematically survey representative approaches along the four categories of
+our taxonomy. Through our taxonomy, researchers can understand the applicable
+scenarios, potential advantages, and limitations of each approach for the the
+trusted integration of GNNs with LLMs. Finally, we present some promising
+directions of work and future trends for the integration of LLMs and GNNs to
+improve model trustworthiness.
 
-摘要：<paragraph>目的：調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度，以及他們對電腦輔助解讀的看法。材料和方法：我們對英國的臨床醫生進行了一系列訪談。我們的研究：(i) 探討人工智慧的潛力，特別是未來的「類人類」運算方法，以促進心電圖解讀並支持臨床決策制定，以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果：我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析，並找出以下主題：(i) 對目前系統缺乏信任，(ii) 對未來人工智慧應用和對這些應用的要求持正面態度，(iii) 演算法的準確性和可解釋性之間的關係，以及 (iv) 對教育、可能的技能退化，以及人工智慧對臨床能力的影響的看法。討論：臨床醫生不信任目前的電腦化方法，但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下，他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業，但他們擔心技能退化，以及需要教育員工負責任地使用人工智慧。結論：臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素，而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法，與自動化可能帶來的技能退化形成對比。</paragraph>
+摘要：隨著圖神經網路 (GNN) 在各種領域的廣泛應用，其可信度已成為研究的焦點。一些現有研究表明，整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力，進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法，為研究人員提供了一個清晰的架構，用於理解不同方法的原理和應用，並有助於釐清各種方法之間的關聯和差異。然後，我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法，可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後，我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢，以提升模型的可信度。
 
-##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**
-2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker
+##### **Graph Foundation Models for Recommendation: A Comprehensive Survey**
+2502.08346v1 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi
 
-The aggressiveness of prostate cancer, the most common cancer in men
-worldwide, is primarily assessed based on histopathological data using the
-Gleason scoring system. While artificial intelligence (AI) has shown promise in
-accurately predicting Gleason scores, these predictions often lack inherent
-explainability, potentially leading to distrust in human-machine interactions.
-To address this issue, we introduce a novel dataset of 1,015 tissue microarray
-core images, annotated by an international group of 54 pathologists. The
-annotations provide detailed localized pattern descriptions for Gleason grading
-in line with international guidelines. Utilizing this dataset, we develop an
-inherently explainable AI system based on a U-Net architecture that provides
-predictions leveraging pathologists' terminology. This approach circumvents
-post-hoc explainability methods while maintaining or exceeding the performance
-of methods trained directly for Gleason pattern segmentation (Dice score: 0.713
-$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason
-patterns). By employing soft labels during training, we capture the intrinsic
-uncertainty in the data, yielding strong results in Gleason pattern
-segmentation even in the context of high interobserver variability. With the
-release of this dataset, we aim to encourage further research into segmentation
-in medical tasks with high levels of subjectivity and to advance the
-understanding of pathologists' reasoning processes.
+Recommender systems (RS) serve as a fundamental tool for navigating the vast
+expanse of online information, with deep learning advancements playing an
+increasingly important role in improving ranking accuracy. Among these, graph
+neural networks (GNNs) excel at extracting higher-order structural information,
+while large language models (LLMs) are designed to process and comprehend
+natural language, making both approaches highly effective and widely adopted.
+Recent research has focused on graph foundation models (GFMs), which integrate
+the strengths of GNNs and LLMs to model complex RS problems more efficiently by
+leveraging the graph-based structure of user-item relationships alongside
+textual understanding. In this survey, we provide a comprehensive overview of
+GFM-based RS technologies by introducing a clear taxonomy of current
+approaches, diving into methodological details, and highlighting key challenges
+and future directions. By synthesizing recent advancements, we aim to offer
+valuable insights into the evolving landscape of GFM-based recommender systems.
 
-摘要：前列腺癌是全球男性最常見的癌症，其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力，但這些預測通常缺乏內在的可解釋性，可能會導致對人機互動的不信任。為了解決這個問題，我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述，用於符合國際準則的 Gleason 分級。利用這個資料集，我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統，該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法，同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能（Dice 分數：0.713 ± 0.003，訓練於解釋，相對於 0.691 ± 0.010，訓練於 Gleason 模式）。透過在訓練期間採用軟標籤，我們捕捉了資料中的內在不確定性，即使在觀察者間變異性高的情況下，也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集，我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割，並增進對病理學家推理過程的理解。
+摘要：推薦系統 (RS) 是導航廣闊線上資訊的基本工具，深度學習的進展在提升排名準確度方面扮演著日益重要的角色。在這些進展中，圖形神經網路 (GNN) 擅長萃取高階結構資訊，而大型語言模型 (LLM) 則設計用於處理和理解自然語言，這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM)，它整合了 GNN 和 LLM 的優點，透過利用使用者與項目關係的圖形化結構，以及文字理解，更有效率地建構複雜的 RS 問題。在這項調查中，我們提供 GFM-based RS 技術的全面概觀，介紹當前方法的明確分類法，深入探討方法論的細節，並強調關鍵挑戰和未來方向。透過綜合最近的進展，我們旨在提供有價值的見解，了解 GFM-based 推薦系統不斷演變的樣貌。
 
-##### **Explainable AI Methods for Multi-Omics Analysis: A Survey**
-2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee
+##### **Self-Evaluation for Job-Shop Scheduling**
+2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana
 
-Advancements in high-throughput technologies have led to a shift from
-traditional hypothesis-driven methodologies to data-driven approaches.
-Multi-omics refers to the integrative analysis of data derived from multiple
-'omes', such as genomics, proteomics, transcriptomics, metabolomics, and
-microbiomics. This approach enables a comprehensive understanding of biological
-systems by capturing different layers of biological information. Deep learning
-methods are increasingly utilized to integrate multi-omics data, offering
-insights into molecular interactions and enhancing research into complex
-diseases. However, these models, with their numerous interconnected layers and
-nonlinear relationships, often function as black boxes, lacking transparency in
-decision-making processes. To overcome this challenge, explainable artificial
-intelligence (xAI) methods are crucial for creating transparent models that
-allow clinicians to interpret and work with complex data more effectively. This
-review explores how xAI can improve the interpretability of deep learning
-models in multi-omics research, highlighting its potential to provide
-clinicians with clear insights, thereby facilitating the effective application
-of such models in clinical settings.
+Combinatorial optimization problems, such as scheduling and route planning,
+are crucial in various industries but are computationally intractable due to
+their NP-hard nature. Neural Combinatorial Optimization methods leverage
+machine learning to address these challenges but often depend on sequential
+decision-making, which is prone to error accumulation as small mistakes
+propagate throughout the process. Inspired by self-evaluation techniques in
+Large Language Models, we propose a novel framework that generates and
+evaluates subsets of assignments, moving beyond traditional stepwise
+approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a
+heterogeneous graph neural network with a Transformer to build a policy model
+and a self-evaluation function. Experimental validation on challenging,
+well-known benchmarks demonstrates the effectiveness of our approach,
+surpassing state-of-the-art methods.
 
-摘要：高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料，例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面，能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料，提供分子交互作用的洞察力，並加強對複雜疾病的研究。然而，這些模型具有許多相互連接的層級和非線性關係，通常會像黑盒子一樣運作，缺乏決策過程的透明度。為了克服此挑戰，可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要，讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性，強調其提供臨床醫生明確見解的潛力，進而促進此類模型在臨床環境中的有效應用。
+摘要：組合優化問題，例如排程和路線規劃，在各行各業中至關重要，但由於它們的 NP 難度，在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰，但通常依賴於序貫決策制定，而序貫決策制定容易發生錯誤累積，因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發，我們提出了一個新的框架，可生成和評估作業子集，超越傳統的分步方法。應用於工作車間排程問題，我們的方法將異質圖神經網路與 Transformer 整合在一起，以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性，超越了最先進的方法。
 
-##### **Study on the Helpfulness of Explainable Artificial Intelligence**
-2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing
+##### **Improving Existing Optimization Algorithms with LLMs**
+2502.08298v1 by Camilo Chacón Sartori, Christian Blum
 
-Explainable Artificial Intelligence (XAI) is essential for building advanced
-machine learning-powered applications, especially in critical domains such as
-medical diagnostics or autonomous driving. Legal, business, and ethical
-requirements motivate using effective XAI, but the increasing number of
-different methods makes it challenging to pick the right ones. Further, as
-explanations are highly context-dependent, measuring the effectiveness of XAI
-methods without users can only reveal a limited amount of information,
-excluding human factors such as the ability to understand it. We propose to
-evaluate XAI methods via the user's ability to successfully perform a proxy
-task, designed such that a good performance is an indicator for the explanation
-to provide helpful information. In other words, we address the helpfulness of
-XAI for human decision-making. Further, a user study on state-of-the-art
-methods was conducted, showing differences in their ability to generate trust
-and skepticism and the ability to judge the rightfulness of an AI decision
-correctly. Based on the results, we highly recommend using and extending this
-approach for more objective-based human-centered user studies to measure XAI
-performance in an end-to-end fashion.
+The integration of Large Language Models (LLMs) into optimization has created
+a powerful synergy, opening exciting research opportunities. This paper
+investigates how LLMs can enhance existing optimization algorithms. Using their
+pre-trained knowledge, we demonstrate their ability to propose innovative
+heuristic variations and implementation strategies. To evaluate this, we
+applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt
+(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that
+incorporates a heuristic in the solution construction phase. Our results show
+that an alternative heuristic proposed by GPT-4o outperforms the
+expert-designed heuristic of CMSA, with the performance gap widening on larger
+and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/
 
-摘要：可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要，特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI，但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外，由於解釋高度依賴於背景，在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊，排除人類因素，例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法，設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說，我們探討 XAI 對人類決策制定的幫助。此外，對最先進的方法進行使用者研究，顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果，我們強烈建議使用和擴充這種方法，以進行更多以目標為基礎的人為中心使用者研究，以終端到終端的方式衡量 XAI 效能。
+摘要：大型语言模型 (LLM) 与优化相结合，创造了一种强大的协同作用，开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识，我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点，我们应用了一种非平凡的优化算法，构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法，它在求解构建阶段纳入了启发式算法。我们的结果表明，GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法，并且随着图形变得更大、更密集，性能差距也在扩大。项目网址：https://imp-opt-algo-llms.surge.sh/
 
-##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**
-2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh
+##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**
+2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
 
-Early detection of intrapartum risk enables interventions to potentially
-prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently,
-there is no accurate automated system to predict such events to assist with
-clinical decision-making. To fill this gap, we propose "Artificial Intelligence
-(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning
-framework that not only predicts adverse labor outcomes from maternal, fetal,
-obstetrical, and intrapartum risk factors but also provides the model's
-reasoning behind the predictions made. The latter can provide insights into
-what modifications in the input variables of the model could have changed the
-predicted outcome. We address the challenges of imbalance and small datasets by
-synthesizing additional training data using Adaptive Synthetic Sampling
-(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN
-uses an ensemble of fully-connected neural networks as the backbone for its
-classification with the data augmentation supported by either ADASYN or CTGAN.
-AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in
-classification. AIMEN can predict a high risk for adverse labor outcomes with
-an average F1 score of 0.784. It also provides counterfactual explanations that
-can be achieved by changing 2 to 3 attributes on average. Resources available:
-https://github.com/ab9mamun/AIMEN.
+Identifying cause-and-effect relationships is critical to understanding
+real-world dynamics and ultimately causal reasoning. Existing methods for
+identifying event causality in NLP, including those based on Large Language
+Models (LLMs), exhibit difficulties in out-of-distribution settings due to the
+limited scale and heavy reliance on lexical cues within available benchmarks.
+Modern benchmarks, inspired by probabilistic causal inference, have attempted
+to construct causal graphs of events as a robust representation of causal
+knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent
+benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a
+benchmark designed for discovery and reasoning over abstract causal events.
+Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday
+life events on the abstraction level. We propose a pipeline for identifying
+abstractions for event generalizations from \texttt{GLUCOSE}
+\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit
+commonsense causal knowledge, from which we subsequently extract $1,4$K causal
+pairs. Our experiments highlight the ongoing challenges of using statistical
+methods and/or LLMs for automatic abstraction identification and causal
+discovery in NLP. Nonetheless, we demonstrate that the abstract causal
+knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA
+reasoning performance in LLMs.
 
-摘要：產程中風險的早期偵測有助於進行干預措施，以預防或減輕不利的生產結果，例如腦性麻痺。目前，沒有準確的自動化系統可以預測此類事件，以協助臨床決策。為了填補這一空白，我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN)，這是一個深度學習架構，它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果，還能提供模型做出預測背後的原因。後者可以提供見解，說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料，以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹，並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險，平均 F1 分數為 0.784。它還提供反事實解釋，可透過平均變更 2 至 3 個屬性來達成。可用資源：https://github.com/ab9mamun/AIMEN。
+摘要：<paragraph>找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法，包括基於大型語言模型 (LLM) 的方法，由於規模有限且過度依賴於可用基準中的詞彙線索，在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖，作為因果知識的強健表示，其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中，我們介紹 \texttt{ACCESS}，一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同，\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道，用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象，\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集，我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此，我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。</paragraph>
 
-##### **Artificial intelligence techniques in inherited retinal diseases: A review**
-2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian
+##### **GCoT: Chain-of-Thought Prompt Learning for Graphs**
+2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang
 
-Inherited retinal diseases (IRDs) are a diverse group of genetic disorders
-that lead to progressive vision loss and are a major cause of blindness in
-working-age adults. The complexity and heterogeneity of IRDs pose significant
-challenges in diagnosis, prognosis, and management. Recent advancements in
-artificial intelligence (AI) offer promising solutions to these challenges.
-However, the rapid development of AI techniques and their varied applications
-have led to fragmented knowledge in this field. This review consolidates
-existing studies, identifies gaps, and provides an overview of AI's potential
-in diagnosing and managing IRDs. It aims to structure pathways for advancing
-clinical applications by exploring AI techniques like machine learning and deep
-learning, particularly in disease detection, progression prediction, and
-personalized treatment planning. Special focus is placed on the effectiveness
-of convolutional neural networks in these areas. Additionally, the integration
-of explainable AI is discussed, emphasizing its importance in clinical settings
-to improve transparency and trust in AI-based systems. The review addresses the
-need to bridge existing gaps in focused studies on AI's role in IRDs, offering
-a structured analysis of current AI techniques and outlining future research
-directions. It concludes with an overview of the challenges and opportunities
-in deploying AI for IRDs, highlighting the need for interdisciplinary
-collaboration and the continuous development of robust, interpretable AI models
-to advance clinical applications.
+Chain-of-thought (CoT) prompting has achieved remarkable success in natural
+language processing (NLP). However, its vast potential remains largely
+unexplored for graphs. This raises an interesting question: How can we design
+CoT prompting for graphs to guide graph models to learn step by step? On one
+hand, unlike natural languages, graphs are non-linear and characterized by
+complex topological structures. On the other hand, many graphs lack textual
+data, making it difficult to formulate language-based CoT prompting. In this
+work, we propose the first CoT prompt learning framework for text-free graphs,
+GCoT. Specifically, we decompose the adaptation process for each downstream
+task into a series of inference steps, with each step consisting of
+prompt-based inference, ``thought'' generation, and thought-conditioned prompt
+learning. While the steps mimic CoT prompting in NLP, the exact mechanism
+differs significantly. Specifically, at each step, an input graph, along with a
+prompt, is first fed into a pre-trained graph encoder for prompt-based
+inference. We then aggregate the hidden layers of the encoder to construct a
+``thought'', which captures the working state of each node in the current step.
+Conditioned on this thought, we learn a prompt specific to each node based on
+the current state. These prompts are fed into the next inference step,
+repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we
+conduct comprehensive experiments on eight public datasets, which demonstrate
+the advantage of our approach.
 
-摘要：遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病，
-會導致視力逐漸喪失，是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。
-然而，AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究，找出差距，並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術，特別是在疾病檢測、進程預測和個性化治療計劃中，為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外，討論了可解釋 AI 的整合，強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性，提供了對當前 AI 技術的結構化分析，並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇，強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。
+摘要：<paragraph>鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而，其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題：我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習？一方面，與自然語言不同，圖形是非線性的，並且具有複雜的拓撲結構。另一方面，許多圖形缺乏文本數據，這使得難以制定基於語言的 CoT 提示。在這項工作中，我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說，我們將每個下游任務的適應過程分解為一系列推理步驟，每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示，但具體機制卻有很大不同。具體來說，在每一步中，一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後，我們聚合編碼器的隱藏層以構建一個「思想」，它捕獲了當前步驟中每個節點的工作狀態。基於這個思想，我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中，重複這個循環。為了評估和分析 GCoT 的有效性，我們對八個公共數據集進行了全面的實驗，這證明了我們方法的優勢。</paragraph>
 
-##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**
-2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri
+##### **Deep Semantic Graph Learning via LLM based Node Enhancement**
+2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen
 
-Explaining Artificial Intelligence (AI) decisions is a major challenge
-nowadays in AI, in particular when applied to sensitive scenarios like medicine
-and law. However, the need to explain the rationale behind decisions is a main
-issue also for human-based deliberation as it is important to justify
-\textit{why} a certain decision has been taken. Resident medical doctors for
-instance are required not only to provide a (possibly correct) diagnosis, but
-also to explain how they reached a certain conclusion. Developing new tools to
-aid residents to train their explanation skills is therefore a central
-objective of AI in education. In this paper, we follow this direction, and we
-present, to the best of our knowledge, the first multilingual dataset for
-Medical Question Answering where correct and incorrect diagnoses for a clinical
-case are enriched with a natural language explanation written by doctors. These
-explanations have been manually annotated with argument components (i.e.,
-premise, claim) and argument relations (i.e., attack, support), resulting in
-the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases
-in four languages (English, Spanish, French, Italian) with explanations, where
-we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106
-attack relations. We conclude by showing how competitive baselines perform over
-this challenging dataset for the argument mining task.
+Graph learning has attracted significant attention due to its widespread
+real-world applications. Current mainstream approaches rely on text node
+features and obtain initial node embeddings through shallow embedding learning
+using GNNs, which shows limitations in capturing deep textual semantics. Recent
+advances in Large Language Models (LLMs) have demonstrated superior
+capabilities in understanding text semantics, transforming traditional text
+feature processing. This paper proposes a novel framework that combines Graph
+Transformer architecture with LLM-enhanced node features. Specifically, we
+leverage LLMs to generate rich semantic representations of text nodes, which
+are then processed by a multi-head self-attention mechanism in the Graph
+Transformer to capture both local and global graph structural information. Our
+model utilizes the Transformer's attention mechanism to dynamically aggregate
+neighborhood information while preserving the semantic richness provided by LLM
+embeddings. Experimental results demonstrate that the LLM-enhanced node
+features significantly improve the performance of graph learning models on node
+classification tasks. This approach shows promising results across multiple
+graph learning tasks, offering a practical direction for combining graph
+networks with language models.
 
-摘要：解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰，特別是應用於像醫學和法律等敏感情境時。然而，解釋決策背後理由的需求也是基於人類的考量的一個主要問題，因為有必要證明為什麼做出某個決策。例如，住院醫師不僅需要提供（可能是正確的）診斷，還需要解釋他們如何達成某個結論。因此，開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中，我們遵循這個方向，並且根據我們的了解，提出第一個多語言醫學問答資料集，其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成（即前提、主張）和論證關係（即攻擊、支持）進行手動註解，產生多語言 CasiMedicos-Arg 資料集，其中包含 558 個具有解釋的四種語言（英語、西班牙語、法語、義大利語）的臨床病例，我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。
+摘要：圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵，並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入，這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力，轉換了傳統的文本特徵處理。本文提出了一種新的框架，將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說，我們利用 LLM 來生成文本節點的豐富語義表示，然後在圖形轉換器中由多頭自我注意機制處理，以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息，同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明，LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果，為將圖形網絡與語言模型相結合提供了實用的方向。
 
-##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration**
-2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu
+##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**
+2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
 
-Diagnosis prediction is a critical task in healthcare, where timely and
-accurate identification of medical conditions can significantly impact patient
-outcomes. Traditional machine learning and deep learning models have achieved
-notable success in this domain but often lack interpretability which is a
-crucial requirement in clinical settings. In this study, we explore the use of
-neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop
-explainable models for diagnosis prediction. Essentially, we design and
-implement LNN-based models that integrate domain-specific knowledge through
-logical rules with learnable thresholds. Our models, particularly
-$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior
-performance over traditional models such as Logistic Regression, SVM, and
-Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up
-to 0.8457) in the case study of diabetes prediction. The learned weights and
-thresholds within the LNN models provide direct insights into feature
-contributions, enhancing interpretability without compromising predictive
-power. These findings highlight the potential of neuro-symbolic approaches in
-bridging the gap between accuracy and explainability in healthcare AI
-applications. By offering transparent and adaptable diagnostic models, our work
-contributes to the advancement of precision medicine and supports the
-development of equitable healthcare solutions. Future research will focus on
-extending these methods to larger and more diverse datasets to further validate
-their applicability across different medical conditions and populations.
+The prototyping of computer games, particularly card games, requires
+extensive human effort in creative ideation and gameplay evaluation. Recent
+advances in Large Language Models (LLMs) offer opportunities to automate and
+streamline these processes. However, it remains challenging for LLMs to design
+novel game mechanics beyond existing databases, generate consistent gameplay
+environments, and develop scalable gameplay AI for large-scale evaluations.
+This paper addresses these challenges by introducing a comprehensive automated
+card game prototyping framework. The approach highlights a graph-based indexing
+method for generating novel game designs, an LLM-driven system for consistent
+game code generation validated by gameplay records, and a gameplay AI
+constructing method that uses an ensemble of LLM-generated action-value
+functions optimized through self-play. These contributions aim to accelerate
+card game prototyping, reduce human labor, and lower barriers to entry for game
+developers.
 
-摘要：診斷預測是醫療保健中的關鍵任務，及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功，但通常缺乏可解釋性，這在臨床環境中是一項關鍵要求。在本研究中，我們探討了神經符號方法的應用，特別是邏輯神經網路 (LNN)，以開發用於診斷預測的可解釋模型。基本上，我們設計並實作了基於 LNN 的模型，這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型，特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$，表現出優於傳統模型（例如邏輯迴歸、SVM 和隨機森林）的優異效能，在糖尿病預測的案例研究中達到了更高的準確度（高達 80.52%）和 AUROC 分數（高達 0.8457）。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解，增強了可解釋性，同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型，我們的研究有助於推進精準醫療，並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集，以進一步驗證其在不同醫療狀況和人群中的適用性。
+摘要：電腦遊戲，尤其是卡牌遊戲的原型製作，需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而，LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境，以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法，用於生成新穎的遊戲設計，一個由 LLM 驅動的系統，用於一致的遊戲程式碼生成，並由遊戲記錄驗證，以及一個遊戲 AI 構建方法，該方法使用由 LLM 生成的動作值函數的集合，通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作，減少人力，並降低遊戲開發人員的進入門檻。
 
-##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**
-2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty
+##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**
+2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan
 
-The rapid advancements in artificial intelligence (AI) have revolutionized
-smart healthcare, driving innovations in wearable technologies, continuous
-monitoring devices, and intelligent diagnostic systems. However, security,
-explainability, robustness, and performance optimization challenges remain
-critical barriers to widespread adoption in clinical environments. This
-research presents an innovative algorithmic method using the Adaptive Feature
-Evaluator (AFE) algorithm to improve feature selection in healthcare datasets
-and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable
-Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT),
-the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby
-enhancing predictive accuracy and interpretability. The proposed method is
-validated across three diverse healthcare datasets using six distinct machine
-learning algorithms, demonstrating its robustness and superiority over
-conventional feature selection techniques. The results underscore the
-transformative potential of AFE in smart healthcare, enabling personalized and
-transparent patient care. Notably, the AFE algorithm, when combined with a
-Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting
-its capability to improve clinical decision-making processes in real-world
-healthcare applications.
+Graph Neural Networks (GNNs) are vital for learning from graph-structured
+data, enabling applications in network analysis, recommendation systems, and
+speech analytics. Deploying them on edge devices like client PCs and laptops
+enhances real-time processing, privacy, and cloud independence. GNNs aid
+Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and
+enable event-based vision tasks. However, irregular memory access, sparsity,
+and dynamic structures cause high latency and energy overhead on
+resource-constrained devices. While modern edge processors integrate CPUs,
+GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular
+GNN computations. We introduce GraNNite, the first hardware-aware framework
+optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN
+accelerators via a structured three-step methodology: (1) enabling NPU
+execution, (2) optimizing performance, and (3) trading accuracy for efficiency
+gains. Step 1 employs GraphSplit for workload distribution and StaGr for static
+aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts
+performance using EffOp for control-heavy tasks and GraSp for sparsity
+exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce
+redundancy and memory transfers. Step 3 balances quality versus efficiency,
+where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate
+attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs,
+GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to
+8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher
+performance than CPUs and GPUs, respectively, across GNN models.
 
-摘要：人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健，推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而，安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法，使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT)，該演算法最佳化了臨床決策支援系統 (CDSS)，從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集，證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力，實現了個人化和透明的患者照護。值得注意的是，AFE 演算法與多層感知器 (MLP) 結合使用時，準確度高達 98.5%，突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。
+摘要：圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要，能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置（例如用戶端電腦和筆電）上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG)，並支援基於事件的視覺任務。然而，不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU，但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite，這是第一個硬體感知框架，透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行：(1) 啟用 NPU 執行，(2) 最佳化效能，以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配，並使用 StaGr 進行靜態聚合，而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能，並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率，其中 QuantGr 適用 INT8 量化，而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上，GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速，在 CPU 和 GPU 上實現了高達 8.6X 的能源增益，在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。
 
-##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**
-2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker
+##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**
+2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
 
-Artificial intelligence (AI) systems have substantially improved
-dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI)
-systems further enhancing clinicians' confidence and trust in AI-driven
-decisions. Despite these advancements, there remains a critical need for
-objective evaluation of how dermatologists engage with both AI and XAI tools.
-In this study, 76 dermatologists participated in a reader study, diagnosing 16
-dermoscopic images of melanomas and nevi using an XAI system that provides
-detailed, domain-specific explanations. Eye-tracking technology was employed to
-assess their interactions. Diagnostic performance was compared with that of a
-standard AI system lacking explanatory features. Our findings reveal that XAI
-systems improved balanced diagnostic accuracy by 2.8 percentage points relative
-to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and
-complex lesions were associated with elevated cognitive load, as evidenced by
-increased ocular fixations. These insights have significant implications for
-clinical practice, the design of AI tools for visual tasks, and the broader
-development of XAI in medical diagnostics.
+Recent advancements in AI for biological research focus on integrating
+molecular data with natural language to accelerate drug discovery. However, the
+scarcity of high-quality annotations limits progress in this area. This paper
+introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework
+that leverages large language models to augment existing datasets, thereby
+improving AI training. We demonstrate the effectiveness of LA$^3$ by creating
+an enhanced dataset, LaChEBI-20, where we systematically rewrite the
+annotations of molecules from an established dataset. These rewritten
+annotations preserve essential molecular information while providing more
+varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5
+based on a benchmark architecture to learn the mapping between molecular
+representations and augmented annotations.
+  Experimental results on text-based *de novo* molecule generation and molecule
+captioning demonstrate that LaMolT5 outperforms state-of-the-art models.
+Notably, incorporating LA$^3$ leads to improvements of up to 301% over the
+benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$
+notable applications in *image*, *text* and *graph* tasks, affirming its
+versatility and utility.
 
-摘要：人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度，而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展，對於皮膚科醫師如何使用 AI 和 XAI 工具，仍有客觀評估的迫切需求。在這項研究中，76 位皮膚科醫師參與了一項讀者研究，使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像，該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示，XAI 系統相較於標準 AI，將平衡診斷準確度提升了 2.8 個百分點。此外，與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關，這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。
+摘要：<paragraph>人工智慧在生物研究上的最新進展，專注於將分子資料與自然語言整合，以加速藥物發現。然而，高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$，一個基於語言的自動註解擴充框架，它利用大型語言模型來擴充現有的資料集，進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性，我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊，同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20，我們在基於基準架構上訓練 LaMolT5，以學習分子表示和擴充註解之間的對應。
+在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明，LaMolT5 優於最先進的模型。值得注意的是，納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外，我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性，肯定了它的多功能性和實用性。</paragraph>
 
-##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**
-2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar
+##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**
+2502.06472v1 by Yuxing Lu, Jinzhuo Wang
 
-Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been
-shown to significantly improve the quality of life of autistic individuals.
-However, diagnostics methods for ASD rely on assessments based on clinical
-presentation that are prone to bias and can be challenging to arrive at an
-early diagnosis. There is a need for objective biomarkers of ASD which can help
-improve diagnostic accuracy. Deep learning (DL) has achieved outstanding
-performance in diagnosing diseases and conditions from medical imaging data.
-Extensive research has been conducted on creating models that classify ASD
-using resting-state functional Magnetic Resonance Imaging (fMRI) data. However,
-existing models lack interpretability. This research aims to improve the
-accuracy and interpretability of ASD diagnosis by creating a DL model that can
-not only accurately classify ASD but also provide explainable insights into its
-working. The dataset used is a preprocessed version of the Autism Brain Imaging
-Data Exchange (ABIDE) with 884 samples. Our findings show a model that can
-accurately classify ASD and highlight critical brain regions differing between
-ASD and typical controls, with potential implications for early diagnosis and
-understanding of the neural basis of ASD. These findings are validated by
-studies in the literature that use different datasets and modalities,
-confirming that the model actually learned characteristics of ASD and not just
-the dataset. This study advances the field of explainable AI in medical imaging
-by providing a robust and interpretable model, thereby contributing to a future
-with objective and reliable ASD diagnostics.
+Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical
+for modern AI systems, but manual curation struggles to scale with the rapid
+growth of scientific literature. This paper presents KARMA, a novel framework
+employing multi-agent large language models (LLMs) to automate KG enrichment
+through structured analysis of unstructured text. Our approach employs nine
+collaborative agents, spanning entity discovery, relation extraction, schema
+alignment, and conflict resolution that iteratively parse documents, verify
+extracted knowledge, and integrate it into existing graph structures while
+adhering to domain-specific schema. Experiments on 1,200 PubMed articles from
+three different domains demonstrate the effectiveness of KARMA in knowledge
+graph enrichment, with the identification of up to 38,230 new entities while
+achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\%
+through multi-layer assessments.
 
-摘要：自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而，ASD 的診斷方法依賴於基於臨床表現的評估，容易產生偏見，且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記，以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而，現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD，還能提供可解釋見解說明其運作原理的 DL 模型，來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本，包含 884 個樣本。我們的研究結果顯示，該模型能準確分類 ASD，並強調 ASD 與典型對照組之間存在差異的關鍵腦區，對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證，證實該模型實際上學習了 ASD 的特徵，而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型，推動了醫學影像中可解釋 AI 的領域，從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。
+摘要：維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要，但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA，一個採用多代理大型語言模型 (LLM) 的新框架，透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理，涵蓋實體發現、關係提取、架構比對和衝突解決，這些代理會反覆分析文件、驗證提取的知識，並將其整合到現有的圖結構中，同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性，識別出多達 38,230 個新實體，同時達到 83.1% 的 LLM 驗證正確性，並透過多層評估將衝突邊緣降低了 18.6%。
 
-##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**
-2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul
+##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs**
+2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
 
-The in-vivo identification of the kidney stone types during an ureteroscopy
-would be a major medical advance in urology, as it could reduce the time of the
-tedious renal calculi extraction process, while diminishing infection risks.
-Furthermore, such an automated procedure would make possible to prescribe
-anti-recurrence treatments immediately. Nowadays, only few experienced
-urologists are able to recognize the kidney stone types in the images of the
-videos displayed on a screen during the endoscopy. Thus, several deep learning
-(DL) models have recently been proposed to automatically recognize the kidney
-stone types using ureteroscopic images. However, these DL models are of black
-box nature whicl limits their applicability in clinical settings. This
-contribution proposes a case-based reasoning DL model which uses prototypical
-parts (PPs) and generates local and global descriptors. The PPs encode for each
-class (i.e., kidney stone type) visual feature information (hue, saturation,
-intensity and textures) similar to that used by biologists. The PPs are
-optimally generated due a new loss function used during the model training.
-Moreover, the local and global descriptors of PPs allow to explain the
-decisions ("what" information, "where in the images") in an understandable way
-for biologists and urologists. The proposed DL model has been tested on a
-database including images of the six most widespread kidney stone types. The
-overall average classification accuracy was 90.37. When comparing this results
-with that of the eight other DL models of the kidney stone state-of-the-art, it
-can be seen that the valuable gain in explanability was not reached at the
-expense of accuracy which was even slightly increased with respect to that
-(88.2) of the best method of the literature. These promising and interpretable
-results also encourage urologists to put their trust in AI-based solutions.
+Mitigating positional bias of language models (LMs) for listwise inputs is a
+well-known and important problem (e.g., lost-in-the-middle). While zero-shot
+order-invariant LMs have been proposed to solve this issue, their success on
+practical listwise problems has been limited. In this work, as a first
+contribution, we identify and overcome two limitations to make zero-shot
+invariant LMs more practical: (1) training and inference distribution mismatch
+arising from modifying positional ID assignments to enforce invariance, and (2)
+failure to adapt to a mixture of order-invariant and sensitive inputs in
+practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot
+invariant LM for genuinely order-invariant inputs with minimal modifications of
+positional IDs, and (2) Selective Routing, an adaptive framework that handles
+both order-invariant and order-sensitive inputs in listwise tasks. On the Lost
+in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU
+benchmarks, we show that RoToR with Selective Routing can effectively handle
+practical listwise input tasks in a zero-shot manner.
 
-摘要：尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展，因為它可以減少繁瑣的腎結石取出過程的時間，同時降低感染風險。此外，這種自動化程序將使立即開立抗復發治療成為可能。如今，只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此，最近已提出多種深度學習 (DL) 模型，以使用輸尿管鏡圖像自動識別腎結石類型。然而，這些 DL 模型本質上是黑盒子，這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型，它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型（即腎結石類型）編碼視覺特徵信息（色調、飽和度、強度和紋理），類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數，PP 得到了最佳生成。此外，PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策（“什麼”信息，“圖像中的什麼位置”）。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時，可以看出，可解釋性的寶貴增益並未以準確性為代價，甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。
+摘要：語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題（例如，迷失在中間）。雖然已經提出零次學習順序不變的 LM 來解決這個問題，但它們在實際列表問題上的成功卻很有限。在這項工作中，作為第一個貢獻，我們找出並克服了兩個限制，讓零次學習不變的 LM 更有實用性：(1) 訓練和推論分布不匹配，這是由於修改位置 ID 分配以強制不變性所造成的，以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題，我們提出 (1) RoToR，一個零次學習不變的 LM，用於真正不變的輸入，並對位置 ID 進行最小的修改，以及 (2) 選擇性路由，一個自適應框架，用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中，我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。
 
-##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**
-2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman
+##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model**
+2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen
 
-This study explores the potential of utilizing administrative claims data,
-combined with advanced machine learning and deep learning techniques, to
-predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal
-Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major
-health insurance organization to develop prediction models for multiple
-observation windows using traditional machine learning methods such as Random
-Forest and XGBoost as well as deep learning approaches such as Long Short-Term
-Memory (LSTM) networks. Our findings demonstrate that the LSTM model,
-particularly with a 24-month observation window, exhibits superior performance
-in predicting ESRD progression, outperforming existing models in the
-literature. We further apply SHapley Additive exPlanations (SHAP) analysis to
-enhance interpretability, providing insights into the impact of individual
-features on predictions at the individual patient level. This study underscores
-the value of leveraging administrative claims data for CKD management and
-predicting ESRD progression.
+Recent advancements in large language models (LLMs) have significantly
+improved various natural language processing (NLP) tasks. Typically, LLMs are
+trained to predict the next token, aligning well with many NLP tasks. However,
+in knowledge graph (KG) scenarios, entities are the fundamental units and
+identifying an entity requires at least several tokens. This leads to a
+granularity mismatch between KGs and natural languages. To address this issue,
+we propose K-ON, which integrates KG knowledge into the LLM by employing
+multiple head layers for next k-step prediction. K-ON can not only generate
+entity-level results in one step, but also enables contrastive loss against
+entities, which is the most powerful tool in KG representation learning.
+Experimental results show that K-ON outperforms state-of-the-art methods that
+incorporate text and even the other modalities.
 
-摘要：本研究探討利用行政申報資料，結合先進機器學習與深度學習技術，預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集，使用傳統機器學習方法（例如隨機森林和 XGBoost）以及深度學習方法（例如長期短期記憶 (LSTM) 網路）開發多個觀察視窗的預測模型。我們的研究結果顯示，LSTM 模型（尤其是 24 個月觀察視窗）在預測 ESRD 進展方面表現優異，優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性，深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。
+摘要：大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常，LLM 會接受訓練以預測下一個符號，這與許多 NLP 任務非常吻合。然而，在知識圖譜 (KG) 場景中，實體是基本單位，而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題，我們提出了 K-ON，它透過採用多個頭部層進行下一個 k 步預測，將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果，還能針對實體啟用對比損失，這是 KG 表示學習中最有力的工具。實驗結果顯示，K-ON 優於將文字甚至其他方式納入考量的最新方法。
 
-##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**
-2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller
+##### **LegalViz: Legal Text Visualization by Text To Diagram Generation**
+2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita
 
-While large language models (LLMs) have shown promise for medical question
-answering, there is limited work focused on tropical and infectious
-disease-specific exploration. We build on an opensource tropical and infectious
-diseases (TRINDs) dataset, expanding it to include demographic and semantic
-clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM
-performance on these, comparing generalist and medical LLMs, as well as LLM
-outcomes to human experts. We demonstrate through systematic experimentation,
-the benefit of contextual information such as demographics, location, gender,
-risk factors for optimal LLM response. Finally we develop a prototype of
-TRINDs-LM, a research tool that provides a playground to navigate how context
-impacts LLM outputs for health.
+Legal documents including judgments and court orders require highly
+sophisticated legal knowledge for understanding. To disclose expert knowledge
+for non-experts, we explore the problem of visualizing legal texts with
+easy-to-understand diagrams and propose a novel dataset of LegalViz with 23
+languages and 7,010 cases of legal document and visualization pairs, using the
+DOT graph description language of Graphviz. LegalViz provides a simple diagram
+from a complicated legal corpus identifying legal entities, transactions, legal
+sources, and statements at a glance, that are essential in each judgment. In
+addition, we provide new evaluation metrics for the legal diagram visualization
+by considering graph structures, textual similarities, and legal contents. We
+conducted empirical studies on few-shot and finetuning large language models
+for generating legal diagrams and evaluated them with these metrics, including
+legal content-based evaluation within 23 languages. Models trained with
+LegalViz outperform existing models including GPTs, confirming the
+effectiveness of our dataset.
 
-摘要：儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景，但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上，並將其擴展為納入人口統計和語義臨床和消費者擴充，產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能，比較了通才和醫療 LLM，以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊（例如人口統計、位置、性別、最佳 LLM 回應的風險因素）的好處。最後，我們開發了 TRINDs-LM 的原型，這是一個研究工具，提供一個探索背景如何影響 LLM 健康輸出的平台。
+摘要：法律文件，包括判決和法院命令，需要高度專業的法律知識才能理解。為了向非專家揭露專家知識，我們探討了使用易於理解的圖表將法律文本視覺化的問題，並提出了一個新的 LegalViz 數據集，其中包含 23 種語言和 7,010 個法律文件和視覺化配對，使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表，可以一目了然地識別法律實體、交易、法律來源和陳述，這些在每項判決中都是必不可少的。此外，我們通過考慮圖形結構、文本相似性和法律內容，為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究，以生成法律圖表，並使用這些指標對它們進行了評估，包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型，包括 GPT，證實了我們數據集的有效性。
 
-##### **Explainable AI: Definition and attributes of a good explanation for health AI**
-2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group
+##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**
+2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee
 
-Proposals of artificial intelligence (AI) solutions based on increasingly
-complex and accurate predictive models are becoming ubiquitous across many
-disciplines. As the complexity of these models grows, transparency and users'
-understanding often diminish. This suggests that accurate prediction alone is
-insufficient for making an AI-based solution truly useful. In the development
-of healthcare systems, this introduces new issues related to accountability and
-safety. Understanding how and why an AI system makes a recommendation may
-require complex explanations of its inner workings and reasoning processes.
-Although research on explainable AI (XAI) has significantly increased in recent
-years and there is high demand for XAI in medicine, defining what constitutes a
-good explanation remains ad hoc, and providing adequate explanations continues
-to be challenging. To fully realize the potential of AI, it is critical to
-address two fundamental questions about explanations for safety-critical AI
-applications, such as health-AI: (1) What is an explanation in health-AI? and
-(2) What are the attributes of a good explanation in health-AI? In this study,
-we examined published literature and gathered expert opinions through a
-two-round Delphi study. The research outputs include (1) a definition of what
-constitutes an explanation in health-AI and (2) a comprehensive list of
-attributes that characterize a good explanation in health-AI.
+Mental-illness stigma is a persistent social problem, hampering both
+treatment-seeking and recovery. Accordingly, there is a pressing need to
+understand it more clearly, but analyzing the relevant data is highly
+labor-intensive. Therefore, we designed a chatbot to engage participants in
+conversations; coded those conversations qualitatively with AI assistance; and,
+based on those coding results, built causal knowledge graphs to decode stigma.
+The results we obtained from 1,002 participants demonstrate that conversation
+with our chatbot can elicit rich information about people's attitudes toward
+depression, while our AI-assisted coding was strongly consistent with
+human-expert coding. Our novel approach combining large language models (LLMs)
+and causal knowledge graphs uncovered patterns in individual responses and
+illustrated the interrelationships of psychological constructs in the dataset
+as a whole. The paper also discusses these findings' implications for HCI
+researchers in developing digital interventions, decomposing human
+psychological constructs, and fostering inclusive attitudes.
 
-摘要：隨著越來越複雜且準確的預測模型，基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加，透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中，這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加，且醫學領域對 XAI 有很高的需求，但定義什麼構成一個好的解釋仍是臨時性的，而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力，對於安全關鍵型 AI 應用（例如健康 AI）的解釋，探討兩個基本問題至關重要：(1) 什麼是健康 AI 中的解釋？以及 (2) 健康 AI 中一個好的解釋有哪些屬性？在本研究中，我們檢視了已發表的文獻，並透過兩輪德爾菲研究收集了專家意見。研究成果包括：(1) 健康 AI 中什麼構成解釋的定義，以及 (2) 健康 AI 中一個好解釋的屬性清單。
+摘要：精神疾病的污名化是一個持續存在的社會問題，阻礙了尋求治療和康復。因此，迫切需要更清楚地了解它，但分析相關數據非常費力。因此，我們設計了一個聊天機器人，讓參與者參與對話；使用 AI 協助對這些對話進行定性編碼；並根據這些編碼結果，構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明，與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊，而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式，並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。
 
-##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**
-2408.17401v1 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni
+##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**
+2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
 
-In recent years, various methods have been introduced for explaining the
-outputs of "black-box" AI models. However, it is not well understood whether
-users actually comprehend and trust these explanations. In this paper, we focus
-on explanations for a regression tool for assessing cancer risk and examine the
-effect of the explanations' content and format on the user-centric metrics of
-comprehension and trust. Regarding content, we experiment with two explanation
-methods: the popular SHAP, based on game-theoretic notions and thus potentially
-complex for everyday users to comprehend, and occlusion-1, based on feature
-occlusion which may be more comprehensible. Regarding format, we present SHAP
-explanations as charts (SC), as is conventional, and occlusion-1 explanations
-as charts (OC) as well as text (OT), to which their simpler nature also lends
-itself. The experiments amount to user studies questioning participants, with
-two different levels of expertise (the general population and those with some
-medical training), on their subjective and objective comprehension of and trust
-in explanations for the outputs of the regression tool. In both studies we
-found a clear preference in terms of subjective comprehension and trust for
-occlusion-1 over SHAP explanations in general, when comparing based on content.
-However, direct comparisons of explanations when controlling for format only
-revealed evidence for OT over SC explanations in most cases, suggesting that
-the dominance of occlusion-1 over SHAP explanations may be driven by a
-preference for text over charts as explanations. Finally, we found no evidence
-of a difference between the explanation types in terms of objective
-comprehension. Thus overall, the choice of the content and format of
-explanations needs careful attention, since in some contexts format, rather
-than content, may play the critical role in improving user experience.
+In this paper, we address the task of semantic segmentation of legal
+documents through rhetorical role classification, with a focus on Indian legal
+judgments. We introduce LegalSeg, the largest annotated dataset for this task,
+comprising over 7,000 documents and 1.4 million sentences, labeled with 7
+rhetorical roles. To benchmark performance, we evaluate multiple
+state-of-the-art models, including Hierarchical BiLSTM-CRF,
+TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and
+Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an
+instruction-tuned large language model. Our results demonstrate that models
+incorporating broader context, structural relationships, and sequential
+sentence information outperform those relying solely on sentence-level
+features. Additionally, we conducted experiments using surrounding context and
+predicted or actual labels of neighboring sentences to assess their impact on
+classification accuracy. Despite these advancements, challenges persist in
+distinguishing between closely related roles and addressing class imbalance.
+Our work underscores the potential of advanced techniques for improving legal
+document understanding and sets a strong foundation for future research in
+legal NLP.
 
-摘要：<paragraph>近年來，已經引進各種方法來解釋「黑箱」AI 模型的輸出。然而，目前並不清楚使用者是否實際理解和信任這些解釋。在本文中，我們專注於評估癌症風險的回歸工具的解釋，並探討解釋的內容和格式對以使用者為中心的理解和信任指標的影響。關於內容，我們實驗了兩種解釋方法：流行的 SHAP，基於博弈論概念，因此對於日常使用者來說可能很複雜，以及基於特徵遮蔽的 occlusion-1，可能更易於理解。關於格式，我們將 SHAP 解釋呈現為圖表 (SC)，這是慣例，而將 occlusion-1 解釋呈現為圖表 (OC) 以及文字 (OT)，其較為簡單的性質也適用於此。這些實驗等同於使用者研究，詢問參與者，具有兩種不同程度的專業知識（一般民眾和具備一些醫學訓練的人），他們對回歸工具輸出解釋的主觀和客觀理解和信任。在兩項研究中，我們發現，在基於內容進行比較時，一般來說，occlusion-1 優於 SHAP 解釋，在主觀理解和信任方面有明顯的偏好。然而，在僅控制格式的情況下直接比較解釋，在大多數情況下只顯示 OT 優於 SC 解釋的證據，這表明 occlusion-1 優於 SHAP 解釋的主導地位可能是由偏好文字而非圖表作為解釋所驅動的。最後，我們沒有發現解釋類型在客觀理解方面的差異證據。因此，總體而言，對解釋的內容和格式的選擇需要仔細注意，因為在某些情況下，格式而非內容，可能在改善使用者體驗方面發揮關鍵作用。</paragraph>
+摘要：<paragraph>在本文中，我們通過修辭角色分類來探討法律文件的語義分段任務，重點關注印度法律判決。我們引入了 LegalSeg，這是此任務中最大的註釋資料集，包含超過 7,000 份文件和 140 萬個句子，並標記了 7 個修辭角色。為了評量效能，我們評估了多個最先進的模型，包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer，以及探索性的 RhetoricLLaMA，一種經過指令調整的大型語言模型。我們的結果表明，結合廣泛背景、結構關係和順序句子資訊的模型，表現優於僅依賴句子層級特徵的模型。此外，我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗，以評估它們對分類精度的影響。儘管有這些進展，但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力，並為法律自然語言處理的未來研究奠定了堅實的基礎。</paragraph>
 
-##### **A Survey for Large Language Models in Biomedicine**
-2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen
+##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**
+2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
 
-Recent breakthroughs in large language models (LLMs) offer unprecedented
-natural language understanding and generation capabilities. However, existing
-surveys on LLMs in biomedicine often focus on specific applications or model
-architectures, lacking a comprehensive analysis that integrates the latest
-advancements across various biomedical domains. This review, based on an
-analysis of 484 publications sourced from databases including PubMed, Web of
-Science, and arXiv, provides an in-depth examination of the current landscape,
-applications, challenges, and prospects of LLMs in biomedicine, distinguishing
-itself by focusing on the practical implications of these models in real-world
-biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot
-learning across a broad spectrum of biomedical tasks, including diagnostic
-assistance, drug discovery, and personalized medicine, among others, with
-insights drawn from 137 key studies. Then, we discuss adaptation strategies of
-LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to
-enhance their performance in specialized biomedical contexts where zero-shot
-fails to achieve, such as medical question answering and efficient processing
-of biomedical literature. Finally, we discuss the challenges that LLMs face in
-the biomedicine domain including data privacy concerns, limited model
-interpretability, issues with dataset quality, and ethics due to the sensitive
-nature of biomedical data, the need for highly reliable model outputs, and the
-ethical implications of deploying AI in healthcare. To address these
-challenges, we also identify future research directions of LLM in biomedicine
-including federated learning methods to preserve data privacy and integrating
-explainable AI methodologies to enhance the transparency of LLMs.
+Developing intelligent agents for long-term cooperation in dynamic open-world
+scenarios is a major challenge in multi-agent systems. Traditional Multi-agent
+Reinforcement Learning (MARL) frameworks like centralized training
+decentralized execution (CTDE) struggle with scalability and flexibility. They
+require centralized long-term planning, which is difficult without custom
+reward functions, and face challenges in processing multi-modal data. CTDE
+approaches also assume fixed cooperation strategies, making them impractical in
+dynamic environments where agents need to adapt and plan independently. To
+address decentralized multi-agent cooperation, we propose Decentralized
+Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in
+a novel Multi-agent Crafter environment. Our generative agents, powered by
+Large Language Models (LLMs), are more scalable than traditional MARL agents by
+leveraging external knowledge and language for long-term planning and
+reasoning. Instead of fully sharing information from all past experiences,
+DAMCS introduces a multi-modal memory system organized as a hierarchical
+knowledge graph and a structured communication protocol to optimize agent
+cooperation. This allows agents to reason from past interactions and share
+relevant information efficiently. Experiments on novel multi-agent open-world
+tasks show that DAMCS outperforms both MARL and LLM baselines in task
+efficiency and collaboration. Compared to single-agent scenarios, the two-agent
+scenario achieves the same goal with 63% fewer steps, and the six-agent
+scenario with 74% fewer steps, highlighting the importance of adaptive memory
+and structured communication in achieving long-term goals. We publicly release
+our project at: https://happyeureka.github.io/damcs.
 
-摘要：大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而，現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構，缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析，深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景，其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先，我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力，包括診斷輔助、藥物發現和個性化醫療等，並從 137 項關鍵研究中汲取見解。然後，我們討論了 LLM 的適應策略，包括單模態和多模態 LLM 的微調方法，以增強它們在零次學習無法實現的專業生物醫學背景中的性能，例如醫療問題解答和生物醫學文獻的有效處理。最後，我們討論了 LLM 在生物醫學領域面臨的挑戰，包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰，我們還確定了生物醫學中 LLM 未來的研究方向，包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。
+摘要：<paragraph>在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架，例如集中式訓練去中心化執行 (CTDE)，在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃，這在沒有自訂獎勵函數的情況下很難執行，並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略，這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題，我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援，透過利用外部知識和語言進行長期規劃和推理，比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊，而是引入了多模式記憶體系統，該系統組織成階層式知識圖譜和結構化通訊協定，以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明，DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比，雙重代理情境以少 63% 的步驟達成相同的目標，而六重代理情境則以少 74% 的步驟達成目標，突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於：https://happyeureka.github.io/damcs。</paragraph>
 
-##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**
-2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone
+##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**
+2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang
 
-Significant investment and development have gone into integrating Artificial
-Intelligence (AI) in medical and healthcare applications, leading to advanced
-control systems in medical technology. However, the opacity of AI systems
-raises concerns about essential characteristics needed in such sensitive
-applications, like transparency and trustworthiness. Our study addresses these
-concerns by investigating a process for selecting the most adequate Explainable
-AI (XAI) methods to comply with the explanation requirements of key EU
-regulations in the context of smart bioelectronics for medical devices. The
-adopted methodology starts with categorising smart devices by their control
-mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving
-into their technology. Then, we analyse these regulations to define their
-explainability requirements for the various devices and related goals.
-Simultaneously, we classify XAI methods by their explanatory objectives. This
-allows for matching legal explainability requirements with XAI explanatory
-goals and determining the suitable XAI algorithms for achieving them. Our
-findings provide a nuanced understanding of which XAI algorithms align better
-with EU regulations for different types of medical devices. We demonstrate this
-through practical case studies on different neural implants, from chronic
-disease management to advanced prosthetics. This study fills a crucial gap in
-aligning XAI applications in bioelectronics with stringent provisions of EU
-regulations. It provides a practical framework for developers and researchers,
-ensuring their AI innovations advance healthcare technology and adhere to legal
-and ethical standards.
+Graphs are able to model interconnected entities in many online services,
+supporting a wide range of applications on the Web. This raises an important
+question: How can we train a graph foundational model on multiple source
+domains and adapt to an unseen target domain? A major obstacle is that graphs
+from different domains often exhibit divergent characteristics. Some studies
+leverage large language models to align multiple domains based on textual
+descriptions associated with the graphs, limiting their applicability to
+text-attributed graphs. For text-free graphs, a few recent works attempt to
+align different feature distributions across domains, while generally
+neglecting structural differences. In this work, we propose a novel Structure
+Alignment framework for text-free Multi-domain Graph Pre-Training and
+cross-domain adaptation (SAMGPT). It is designed to learn multi-domain
+knowledge from graphs originating in multiple source domains, which can then be
+adapted to address applications in an unseen target domain. Specifically, we
+introduce a set of structure tokens to harmonize structure-based aggregation
+across source domains during the pre-training phase. Next, for cross-domain
+adaptation, we design dual prompts, namely, holistic prompts and specific
+prompts, which adapt unified multi-domain structural knowledge and
+fine-grained, domain-specific information, respectively, to a target domain.
+Finally, we conduct comprehensive experiments on seven public datasets to
+evaluate and analyze the effectiveness of SAMGPT.
 
-摘要：人工智慧（AI）在醫療和保健應用中投入了大量的投資和開發，進而導致醫療技術中的先進控制系統。然而，AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂，例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題，用於選擇最充分的可解釋 AI（XAI）方法，以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制（開迴路、閉迴路和半閉迴路系統）對智慧型裝置進行分類，並深入探討其技術開始。然後，我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時，我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配，並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點，從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構，確保其 AI 創新能促進醫療技術並遵守法律和道德標準。
+摘要：圖表能夠在許多線上服務中對相互關聯的實體進行建模，
+支援網路上廣泛的應用程式。這提出了重要的問題：我們如何針對多個來源網域訓練圖表基礎模型，並適應未見過的目標網域？一個主要的障礙是，來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型，根據與圖表相關的文字描述，對齊多個網域，限制其適用性於有文字屬性的圖表。對於沒有文字的圖表，最近的一些作品嘗試對齊跨網域的不同特徵分佈，同時通常忽略結構上的差異。在這項工作中，我們提出了一個新的結構對齊框架，用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識，然後可以適應於未見過的目標網域中的應用程式。具體來說，我們引入了一組結構化代碼，以在預訓練階段，調和跨來源網域的基於結構的聚合。接下來，對於跨網域適應，我們設計了雙重提示，即整體提示和具體提示，分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後，我們在七個公共資料集上進行了全面的實驗，以評估和分析 SAMGPT 的有效性。
 
-##### **Towards Case-based Interpretability for Medical Federated Learning**
-2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva
+##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**
+2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang
 
-We explore deep generative models to generate case-based explanations in a
-medical federated learning setting. Explaining AI model decisions through
-case-based interpretability is paramount to increasing trust and allowing
-widespread adoption of AI in clinical practice. However, medical AI training
-paradigms are shifting towards federated learning settings in order to comply
-with data protection regulations. In a federated scenario, past data is
-inaccessible to the current user. Thus, we use a deep generative model to
-generate synthetic examples that protect privacy and explain decisions. Our
-proof-of-concept focuses on pleural effusion diagnosis and uses publicly
-available Chest X-ray data.
+In-context learning (ICL) effectively conditions large language models (LLMs)
+for molecular tasks, such as property prediction and molecule captioning, by
+embedding carefully selected demonstration examples into the input prompt. This
+approach avoids the computational overhead of extensive pertaining and
+fine-tuning. However, current prompt retrieval methods for molecular tasks have
+relied on molecule feature similarity, such as Morgan fingerprints, which do
+not adequately capture the global molecular and atom-binding relationships. As
+a result, these methods fail to represent the full complexity of molecular
+structures during inference. Moreover, small-to-medium-sized LLMs, which offer
+simpler deployment requirements in specialized systems, have remained largely
+unexplored in the molecular ICL literature. To address these gaps, we propose a
+self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context
+learning, which aligns global molecular structures, represented by graph neural
+networks (GNNs), with textual captions (descriptions) while leveraging local
+feature similarity through Morgan fingerprints. In addition, we introduce a
+Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to
+optimize input prompt demonstration samples. Our experimental findings using
+diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL
+retrieval methods across all tasks by up to 45%.
 
-摘要：我們探索深度生成模型，在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策，對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而，醫療 AI 訓練範例正轉向聯邦學習設置，以符合資料保護法規。在聯邦情境中，過去的資料對目前的使用者而言是無法取得的。因此，我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷，並使用公開可取得的胸部 X 光資料。
+摘要：<paragraph>情境學習 (ICL) 有效地調整大型語言模型 (LLM)，以執行分子任務，例如屬性預測和分子標題，方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而，目前針對分子任務的提示檢索方法依賴於分子特徵相似性，例如 Morgan 指紋，而無法充分捕捉全局分子和原子鍵結關係。因此，這些方法無法在推理過程中表示分子結構的完整複雜性。此外，在專業系統中提供更簡單部署需求的小到中型的 LLM，在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距，我們提出了一種自我監督學習技術，GAMIC（圖形對齊分子情境學習），它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題（描述）對齊，同時透過 Morgan 指紋利用局部特徵相似性。此外，我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法，以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示，GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法，最多可達 45%。</paragraph>
 
-##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**
-2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans
+##### **Knowledge Graph-Guided Retrieval Augmented Generation**
+2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
 
-Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging
-lesions with variable clinical behaviours and treatment approaches. This
-systematic review provides an overview of Artificial Intelligence (AI) methods
-using radiological imaging for diagnosis and prognosis of these tumours,
-highlighting challenges in clinical translation, and evaluating study alignment
-with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI
-international consensus guidelines for trustworthy and deployable AI to promote
-the clinical translation of AI methods. The review covered literature from
-several bibliographic databases, including papers published before 17/07/2024.
-Original research in peer-reviewed journals focused on radiology-based AI for
-diagnosing or prognosing primary STBT was included. Exclusion criteria were
-animal, cadaveric, or laboratory studies, and non-English papers. Abstracts
-were screened by two of three independent reviewers for eligibility. Eligible
-papers were assessed against guidelines by one of three independent reviewers.
-The search identified 15,015 abstracts, from which 325 articles were included
-for evaluation. Most studies performed moderately on CLAIM, averaging a score
-of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out
-of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage,
-indicating significant room for improvement. Future efforts by AI developers
-should focus on design (e.g. define unmet clinical need, intended clinical
-setting and how AI would be integrated in clinical workflow), development (e.g.
-build on previous work, explainability), evaluation (e.g. evaluating and
-addressing biases, evaluating AI against best practices), and data
-reproducibility and availability (making documented code and data publicly
-available). Following these recommendations could improve clinical translation
-of AI methods.
+Retrieval-augmented generation (RAG) has emerged as a promising technology
+for addressing hallucination issues in the responses generated by large
+language models (LLMs). Existing studies on RAG primarily focus on applying
+semantic-based approaches to retrieve isolated relevant chunks, which ignore
+their intrinsic relationships. In this paper, we propose a novel Knowledge
+Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
+knowledge graphs (KGs) to provide fact-level relationships between chunks,
+improving the diversity and coherence of the retrieved results. Specifically,
+after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
+employs a KG-guided chunk expansion process and a KG-based chunk organization
+process to deliver relevant and important knowledge in well-organized
+paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
+variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
+approaches, in terms of both response quality and retrieval quality.
 
-摘要：軟組織和骨骼腫瘤（STBT）是罕見、診斷具有挑戰性的病灶，其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀，重點說明了臨床轉譯的挑戰，並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性，以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻，包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究，以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要，其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等，平均得分為 53 分中的 28.9±7.5 分，但在 FUTURE-AI 中表現不佳，平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段，表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計（例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中）、開發（例如建立在先前的工作、可解釋性）、評估（例如評估和解決偏差、評估 AI 與最佳實務）、以及數據可複製性和可用性（公開提供文件化的代碼和數據）。遵循這些建議可以改善 AI 方法的臨床轉譯。
+摘要：檢索增強生成 (RAG) 已成為一項有前途的技術，用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊，而忽略它們的內在關係。在本文中，我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架，它利用知識圖表 (KG) 來提供區塊之間的事實層級關係，從而提高檢索結果的多樣性和一致性。具體來說，在執行基於語義的檢索以提供種子區塊後，KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序，以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。
 
-##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**
-2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen
+##### **Can Large Language Models Understand Intermediate Representations?**
+2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
 
-Early detection of Cerebral Palsy (CP) is crucial for effective intervention
-and monitoring. This paper tests the reliability and applicability of
-Explainable AI (XAI) methods using a deep learning method that predicts CP by
-analyzing skeletal data extracted from video recordings of infant movements.
-Specifically, we use XAI evaluation metrics -- namely faithfulness and
-stability -- to quantitatively assess the reliability of Class Activation
-Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this
-specific medical application. We utilize a unique dataset of infant movements
-and apply skeleton data perturbations without distorting the original dynamics
-of the infant movements. Our CP prediction model utilizes an ensemble approach,
-so we evaluate the XAI metrics performances for both the overall ensemble and
-the individual models. Our findings indicate that both XAI methods effectively
-identify key body points influencing CP predictions and that the explanations
-are robust against minor data perturbations. Grad-CAM significantly outperforms
-CAM in the RISv metric, which measures stability in terms of velocity. In
-contrast, CAM performs better in the RISb metric, which relates to bone
-stability, and the RRS metric, which assesses internal representation
-robustness. Individual models within the ensemble show varied results, and
-neither CAM nor Grad-CAM consistently outperform the other, with the ensemble
-approach providing a representation of outcomes from its constituent models.
+Intermediate Representations (IRs) are essential in compiler design and
+program analysis, yet their comprehension by Large Language Models (LLMs)
+remains underexplored. This paper presents a pioneering empirical study to
+investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA
+3.1, and Code Llama, in understanding IRs. We analyze their performance across
+four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code
+summarization, and execution reasoning. Our results indicate that while LLMs
+demonstrate competence in parsing IR syntax and recognizing high-level
+structures, they struggle with control flow reasoning, execution semantics, and
+loop handling. Specifically, they often misinterpret branching instructions,
+omit critical IR operations, and rely on heuristic-based reasoning, leading to
+errors in CFG reconstruction, IR decompilation, and execution reasoning. The
+study underscores the necessity for IR-specific enhancements in LLMs,
+recommending fine-tuning on structured IR datasets and integration of explicit
+control flow models to augment their comprehension and handling of IR-related
+tasks.
+
+摘要：中間表徵 (IR) 在編譯器設計和程式分析中至關重要，但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究，以探討 LLM（包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama）理解 IR 的能力。我們分析了它們在四項任務中的表現：控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明，儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力，但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言，它們經常誤解分支指令、省略關鍵 IR 操作，並依賴於基於啟發式的推理，導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性，建議對結構化的 IR 資料集進行微調，並整合明確的控制流程模型，以增強其對 IR 相關任務的理解和處理。
+
+##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**
+2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen
+
+Long-context large language models (LLMs) have recently shown strong
+performance in information retrieval and long-document QA. However, to tackle
+the most challenging intellectual problems, LLMs must reason effectively in
+long and complex contexts (e.g., frontier mathematical research). Studying how
+LLMs handle increasing reasoning complexity and context length is essential,
+yet existing benchmarks lack a solid basis for quantitative evaluation.
+Inspired by the abstraction of GSM-8K problems as computational graphs, and the
+ability to introduce noise by adding unnecessary nodes and edges, we develop a
+grade school math problem generator capable of producing arithmetic problems
+with infinite difficulty and context length under fine-grained control. Using
+our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate
+existing LLMs. We find a consistent sigmoid decline in reasoning performance as
+complexity increases, along with a systematic inference scaling trend:
+exponentially increasing inference computation yields only linear performance
+gains. These findings underscore the fundamental limitations of current
+long-context LLMs and the key challenges in scaling reasoning capabilities. Our
+GSM-Infinite benchmark provides a scalable and controllable testbed for
+systematically studying and advancing LLM reasoning in long and complex
+contexts.
 
-摘要：腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性，使用深度學習方法，透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說，我們使用 XAI 評估指標（即忠實度和穩定性）來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集，並應用骨骼資料擾動，而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法，因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明，兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位，並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM，該指標衡量速度方面的穩定性。相比之下，CAM 在 RISb 指標中表現得更好，該指標與骨骼穩定性有關，而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果，CAM 和 Grad-CAM 都不一致地優於另一種，整體方法提供了其組成模型結果的表示。
+摘要：長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而，若要解決最具挑戰性的智力問題，LLM 必須在長且複雜的脈絡中有效推理（例如，前沿數學研究）。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要，但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發，以及透過加入不必要的節點和邊緣來引入雜訊的能力，我們開發了一個小學數學問題產生器，能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準，我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降，並伴隨著系統性的推論縮放趨勢：指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制，以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台，用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。
 
-##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**
-2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma
+##### **Causality can systematically address the monsters under the bench(marks)**
+2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf
 
-Recent global estimates suggest that as many as 2.41 billion individuals have
-health conditions that would benefit from rehabilitation services. Home-based
-Physical Therapy (PT) faces significant challenges in providing interactive
-feedback and meaningful observation for therapists and patients. To fill this
-gap, we present MicroXercise, which integrates micro-motion analysis with
-wearable sensors, providing therapists and patients with a comprehensive
-feedback interface, including video, text, and scores. Crucially, it employs
-multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable
-methods to analyze the existing deep learning neural networks in monitoring
-exercises, focusing on a high granularity of exercise. This synergistic
-approach is pivotal, providing output matching the input size to precisely
-highlight critical subtleties and movements in PT, thus transforming complex AI
-analysis into clear, actionable feedback. By highlighting these micro-motions
-in different metrics, such as stability and range of motion, MicroXercise
-significantly enhances the understanding and relevance of feedback for
-end-users. Comparative performance metrics underscore its effectiveness over
-traditional methods, such as a 39% and 42% improvement in Feature Mutual
-Information (FMI) and Continuity. MicroXercise is a step ahead in home-based
-physical therapy, providing a technologically advanced and intuitively helpful
-solution to enhance patient care and outcomes.
+Effective and reliable evaluation is essential for advancing empirical
+machine learning. However, the increasing accessibility of generalist models
+and the progress towards ever more complex, high-level tasks make systematic
+evaluation more challenging. Benchmarks are plagued by various biases,
+artifacts, or leakage, while models may behave unreliably due to poorly
+explored failure modes. Haphazard treatments and inconsistent formulations of
+such "monsters" can contribute to a duplication of efforts, a lack of trust in
+results, and unsupported inferences. In this position paper, we argue causality
+offers an ideal framework to systematically address these challenges. By making
+causal assumptions in an approach explicit, we can faithfully model phenomena,
+formulate testable hypotheses with explanatory power, and leverage principled
+tools for analysis. To make causal model design more accessible, we identify
+several useful Common Abstract Topologies (CATs) in causal graphs which help
+gain insight into the reasoning abilities in large language models. Through a
+series of case studies, we demonstrate how the precise yet pragmatic language
+of causality clarifies the strengths and limitations of a method and inspires
+new approaches for systematic progress.
 
-摘要：最近的全球估計表明，多達 24.1 億人有
-健康狀況可從復健服務中受益。居家
-物理治療 (PT) 在提供互動式
-回饋和有意義的觀察方面面臨重大挑戰，供治療師和患者使用。為了填補這
-個缺口，我們提出 MicroXercise，它將微動作分析與
-可穿戴式感測器整合在一起，為治療師和患者提供一個全面的
-回饋介面，包括影片、文字和分數。至關重要的是，它採用
-多維動態時間規整 (DTW) 和基於歸因的可解釋
-方法來分析監控運動中現有的深度學習神經網路，專注於運動的高粒度。這種協同
-方法至關重要，提供與輸入大小匹配的輸出，以精確地
-突出 PT 中關鍵的細微差別和動作，從而將複雜的 AI
-分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作，例如穩定性和動作範圍，MicroXercise
-顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於
-傳統方法的有效性，例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家
-物理治療方面更進一步，提供技術先進且直覺有用的
-解決方案，以提升患者照護和結果。
+摘要：有效的、可靠的評估對於推進經驗機器學習至關重要。然而，一般化模型的可及性日益提高，以及朝著更複雜、更高級別任務的進展，使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾，而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中，我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設，我們可以忠實地模擬現象，制定具有解釋力的可測試假設，並利用原則性的分析工具。為了使因果模型設計更易於使用，我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT)，有助於深入了解大型語言模型中的推理能力。通過一系列案例研究，我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點，並激發系統進展的新方法。
 
-##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**
-2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz
+##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**
+2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha
 
-Systematic literature reviews are the highest quality of evidence in
-research. However, the review process is hindered by significant resource and
-data constraints. The Literature Review Network (LRN) is the first of its kind
-explainable AI platform adhering to PRISMA 2020 standards, designed to automate
-the entire literature review process. LRN was evaluated in the domain of
-surgical glove practices using 3 search strings developed by experts to query
-PubMed. A non-expert trained all LRN models. Performance was benchmarked
-against an expert manual review. Explainability and performance metrics
-assessed LRN's ability to replicate the experts' review. Concordance was
-measured with the Jaccard index and confusion matrices. Researchers were
-blinded to the other's results until study completion. Overlapping studies were
-integrated into an LRN-generated systematic review. LRN models demonstrated
-superior classification accuracy without expert training, achieving 84.78% and
-85.71% accuracy. The highest performance model achieved high interrater
-reliability (k = 0.4953) and explainability metrics, linking 'reduce',
-'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51%
-of the relevant literature despite diverging from the non-expert's judgments (k
-= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN
-outperformed the manual review (19,920 minutes over 11 months), reducing the
-entire process to 288.6 minutes over 5 days. This study demonstrates that
-explainable AI does not require expert training to successfully conduct
-PRISMA-compliant systematic literature reviews like an expert. LRN summarized
-the results of surgical glove studies and identified themes that were nearly
-identical to the clinical researchers' findings. Explainable AI can accurately
-expedite our understanding of clinical practices, potentially revolutionizing
-healthcare research.
+Large Language Models (LLMs) have demonstrated impressive reasoning
+capabilities, yet their performance is highly dependent on the prompting
+strategy and model scale. While reinforcement learning and fine-tuning have
+been deployed to boost reasoning, these approaches incur substantial
+computational and data overhead. In this work, we introduce Adaptive Graph of
+Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM
+reasoning solely at test time. Rather than relying on fixed-step methods like
+Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes
+complex queries into structured subproblems, forming an dynamic directed
+acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding
+only those subproblems that require further analysis, AGoT unifies the
+strengths of chain, tree, and graph paradigms into a cohesive framework that
+allocates computation where it is most needed. We validate our approach on
+diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and
+mathematical problem-solving, achieving up to 46.2% improvement on scientific
+reasoning tasks (GPQA) - comparable to gains achieved through computationally
+intensive reinforcement learning approaches and outperforming state-of-the-art
+iterative approaches. These results suggest that dynamic decomposition and
+structured recursion offer a scalable, cost-effective alternative to
+post-training modifications, paving the way for more robust, general-purpose
+reasoning in LLMs.
 
-摘要：系統性文獻回顧是研究中證據品質最高的。然而，回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台，旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估，使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率，達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標，將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻，儘管與非專家的判斷不同 (k = 0.2174)，但包含了「乳膠」、「雙重」（手套）和「適應症」等詞彙。LRN 優於手動回顧（11 個月超過 19,920 分鐘），將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示，可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果，並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解，有潛力革新醫療保健研究。
+摘要：大型語言模型 (LLM) 已展現令人印象深刻的推理能力，但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理，但這些方法會造成大量的運算和資料開銷。在這項工作中，我們引入了「適應性思考圖」(AGoT)，一個動態的、基於圖形的推論架構，它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法，而是遞迴地將複雜的查詢分解成結構化的子問題，形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題，AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中，將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法，在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進，這與透過運算密集的強化學習方法所獲得的增益相當，並且優於最先進的迭代方法。這些結果表明，動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案，用於訓練後修改，為 LLM 中更強健、更通用的推理鋪平了道路。
 
-##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**
-2408.02709v1 by Chi Him Ng
+##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**
+2502.05239v1 by Hussam Ghanem, Christophe Cruz
 
-This study analyzes hybrid AI systems' design patterns and their
-effectiveness in clinical decision-making using the boxology framework. It
-categorizes and copares various architectures combining machine learning and
-rule-based reasoning to provide insights into their structural foundations and
-healthcare applications. Addressing two main questions, how to categorize these
-systems againts established design patterns and how to extract insights through
-comparative analysis, the study uses design patterns from software engineering
-to understand and optimize healthcare AI systems. Boxology helps identify
-commonalities and create reusable solutions, enhancing these systems'
-scalability, reliability, and performance. Five primary architectures are
-examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and
-weaknesses, highlighting the need for tailored approaches in clinical tasks.
-REML excels in high-accuracy prediction for datasets with limited data; MLRB in
-handling large datasets and complex data integration; RBML in explainability
-and trustworthiness; RMLT in managing high-dimensional data; and PERML, though
-limited in analysis, shows promise in urgent care scenarios. The study
-introduces four new patterns, creates five abstract categorization patterns,
-and refines those five further to specific systems. These contributions enhance
-Boxlogy's taxonomical organization and offer novel approaches to integrating
-expert knowledge with machine learning. Boxology's structured, modular apporach
-offers significant advantages in developing and analyzing hybrid AI systems,
-revealing commonalities, and promoting reusable solutions. In conclusion, this
-study underscores hybrid AI systems' crucial role in advancing healthcare and
-Boxology's potential to drive further innovation in AI integration, ultimately
-improving clinical decision support and patient outcomes.
+Recent advancements in large language models have demonstrated significant
+potential in the automated construction of knowledge graphs from unstructured
+text. This paper builds upon our previous work [16], which evaluated various
+models using metrics like precision, recall, F1 score, triple matching, and
+graph matching, and introduces a refined approach to address the critical
+issues of hallucination and omission. We propose an enhanced evaluation
+framework incorporating BERTScore for graph similarity, setting a practical
+threshold of 95% for graph matching. Our experiments focus on the Mistral
+model, comparing its original and fine-tuned versions in zero-shot and few-shot
+settings. We further extend our experiments using examples from the KELM-sub
+training dataset, illustrating that the fine-tuned model significantly improves
+knowledge graph construction accuracy while reducing the exact hallucination
+and omission. However, our findings also reveal that the fine-tuned models
+perform worse in generalization tasks on the KELM-sub dataset. This study
+underscores the importance of comprehensive evaluation metrics in advancing the
+state-of-the-art in knowledge graph construction from textual data.
 
-摘要：本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構，以深入了解其結構基礎和醫療保健應用。針對兩個主要問題，如何根據既定的設計模式對這些系統進行分類，以及如何通過比較分析提取見解，本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案，從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構：REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點，強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測；MLRB 在處理大型資料集和複雜資料整合方面表現出色；RBML 在可解釋性和可信度方面表現出色；RMLT 在管理高維資料方面表現出色；而 PERML 儘管在分析方面有限，但在緊急照護場景中表現出潛力。本研究引入了四種新模式，建立了五種抽象分類模式，並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織，並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之，本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用，以及盒子學在推動人工智慧整合進一步創新方面的潛力，最終改善臨床決策支援和患者的治療成果。
+摘要：大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上，該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型，並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架，結合 BERTScore 來進行圖形相似性，並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上，比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗，說明微調後的模型顯著提高了知識圖譜建構的準確度，同時減少了精確的幻覺和遺漏。然而，我們的研究結果也顯示，微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。
 
-##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**
-2408.02706v1 by Masoud Muhammed Hassan
+##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**
+2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu
 
-Because of its strong predictive skills, deep learning has emerged as an
-essential tool in many industries, including healthcare. Traditional deep
-learning models, on the other hand, frequently lack interpretability and omit
-to take prediction uncertainty into account two crucial components of clinical
-decision making. In order to produce explainable and uncertainty aware
-predictions, this study presents a novel framework called Bayesian Kolmogorov
-Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov
-Arnold Networks with Bayesian inference. We employ BKANs on two medical
-datasets, which are widely used benchmarks for assessing machine learning
-models in medical diagnostics: the Pima Indians Diabetes dataset and the
-Cleveland Heart Disease dataset. Our method provides useful insights into
-prediction confidence and decision boundaries and outperforms traditional deep
-learning models in terms of prediction accuracy. Moreover, BKANs' capacity to
-represent aleatoric and epistemic uncertainty guarantees doctors receive more
-solid and trustworthy decision support. Our Bayesian strategy improves the
-interpretability of the model and considerably minimises overfitting, which is
-important for tiny and imbalanced medical datasets, according to experimental
-results. We present possible expansions to further use BKANs in more
-complicated multimodal datasets and address the significance of these
-discoveries for future research in building reliable AI systems for healthcare.
-This work paves the way for a new paradigm in deep learning model deployment in
-vital sectors where transparency and reliability are crucial.
+We introduce Agentic Reasoning, a framework that enhances large language
+model (LLM) reasoning by integrating external tool-using agents. Unlike
+conventional LLM-based reasoning approaches, which rely solely on internal
+inference, Agentic Reasoning dynamically engages web search, code execution,
+and structured reasoning-context memory to solve complex problems requiring
+deep research and multi-step logical deduction. Our framework introduces the
+Mind Map agent, which constructs a structured knowledge graph to track logical
+relationships, improving deductive reasoning. Additionally, the integration of
+web-search and coding agents enables real-time retrieval and computational
+analysis, enhancing reasoning accuracy and decision-making. Evaluations on
+PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks
+demonstrate that our approach significantly outperforms existing models,
+including leading retrieval-augmented generation (RAG) systems and
+closed-source LLMs. Moreover, our results indicate that agentic reasoning
+improves expert-level knowledge synthesis, test-time scalability, and
+structured problem-solving. The code is at:
+https://github.com/theworldofagents/Agentic-Reasoning.
 
-摘要：由於其強大的預測能力，深度學習已成為許多產業中不可或缺的工具，包括醫療保健。然而，傳統的深度學習模型通常缺乏可解釋性，並且忽略了將預測不確定性納入考量，而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測，本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構，它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN，這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準：皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解，並且在預測準確度方面優於傳統的深度學習模型。此外，BKAN 表現隨機和認識不確定性的能力，可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果，我們的貝氏策略提高了模型的可解釋性，並大幅減少了過度擬合，這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能，以進一步將 BKAN 用於更複雜的多模式資料集，並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。
+摘要：我們引入了代理推理，一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同，代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理，它建立一個結構化的知識圖譜來追蹤邏輯關係，改善演繹推理。此外，整合網路搜尋和編碼代理能進行即時擷取和運算分析，增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示，我們的做法明顯優於現有模型，包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外，我們的結果顯示，代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在：https://github.com/theworldofagents/Agentic-Reasoning。
 
-##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**
-2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal
+##### **Position-aware Automatic Circuit Discovery**
+2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
 
-In modern healthcare, addressing the complexities of accurate disease
-prediction and personalized recommendations is both crucial and challenging.
-This research introduces MLtoGAI, which integrates Semantic Web technology with
-Machine Learning (ML) to enhance disease prediction and offer user-friendly
-explanations through ChatGPT. The system comprises three key components: a
-reusable disease ontology that incorporates detailed knowledge about various
-diseases, a diagnostic classification model that uses patient symptoms to
-detect specific diseases accurately, and the integration of Semantic Web Rule
-Language (SWRL) with ontology and ChatGPT to generate clear, personalized
-health advice. This approach significantly improves prediction accuracy and
-ensures results that are easy to understand, addressing the complexity of
-diseases and diverse symptoms. The MLtoGAI system demonstrates substantial
-advancements in accuracy and user satisfaction, contributing to developing more
-intelligent and accessible healthcare solutions. This innovative approach
-combines the strengths of ML algorithms with the ability to provide
-transparent, human-understandable explanations through ChatGPT, achieving
-significant improvements in prediction accuracy and user comprehension. By
-leveraging semantic technology and explainable AI, the system enhances the
-accuracy of disease prediction and ensures that the recommendations are
-relevant and easily understood by individual patients. Our research highlights
-the potential of integrating advanced technologies to overcome existing
-challenges in medical diagnostics, paving the way for future developments in
-intelligent healthcare systems. Additionally, the system is validated using 200
-synthetic patient data records, ensuring robust performance and reliability.
+A widely used strategy to discover and understand language model mechanisms
+is circuit analysis. A circuit is a minimal subgraph of a model's computation
+graph that executes a specific task. We identify a gap in existing circuit
+discovery methods: they assume circuits are position-invariant, treating model
+components as equally relevant across input positions. This limits their
+ability to capture cross-positional interactions or mechanisms that vary across
+positions. To address this gap, we propose two improvements to incorporate
+positionality into circuits, even on tasks containing variable-length examples.
+First, we extend edge attribution patching, a gradient-based method for circuit
+discovery, to differentiate between token positions. Second, we introduce the
+concept of a dataset schema, which defines token spans with similar semantics
+across examples, enabling position-aware circuit discovery in datasets with
+variable length examples. We additionally develop an automated pipeline for
+schema generation and application using large language models. Our approach
+enables fully automated discovery of position-sensitive circuits, yielding
+better trade-offs between circuit size and faithfulness compared to prior work.
+
+摘要：廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖，可執行特定任務。我們找出電路發現方法中的一個缺口：它們假設電路與位置無關，將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口，我們提出兩項改進，將位置性納入電路中，即使在包含變長範例的任務中也是如此。首先，我們擴充邊緣屬性修補，一種基於梯度的電路發現方法，以區分符號位置。其次，我們引入了資料集架構的概念，它定義了在範例中具有類似語義的符號跨距，使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外，我們開發了一個自動化管線，用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化，與先前的研究相比，在電路大小和忠實度之間產生了更好的權衡。
+
+##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**
+2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
+
+We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by
+jointly optimizing model roles and weights. We represent multi-LLM systems as
+directed acyclic graphs (DAGs) of LLMs with topological message passing for
+collaborative generation. Given a pool of LLM experts and a utility function,
+Heterogeneous Swarms employs two iterative steps: role-step and weight-step.
+For role-step, we interpret model roles as learning a DAG that specifies the
+flow of inputs and outputs between LLMs. Starting from a swarm of random
+continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs
+in topological order, evaluate on the utility function (e.g. accuracy on a
+task), and optimize the adjacency matrices with particle swarm optimization
+based on the utility score. For weight-step, we assess the contribution of
+individual LLMs in the multi-LLM systems and optimize model weights with swarm
+intelligence. We propose JFK-score to quantify the individual contribution of
+each LLM in the best-found DAG of the role-step, then optimize model weights
+with particle swarm optimization based on the JFK-score. Experiments
+demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based
+baselines by 18.5% on average across 12 tasks. Further analysis reveals that
+Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles
+and substantial collaborative gains, and benefits from the diversity of
+language models.
 
-摘要：在現代醫療保健中，解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI，它將語義網路技術與機器學習 (ML) 相結合，以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分：一個可重複使用的疾病本体，其中包含有關各種疾病的詳細知識；一個診斷分類模型，它使用患者症狀來準確檢測特定疾病；以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合，以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性，並確保了易於理解的結果，解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步，有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點，以及透過 ChatGPT 提供透明且人類可以理解的說明的能力，在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI，該系統提高了疾病預測的準確性，並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力，為智慧醫療保健系統的未來發展鋪路。此外，該系統使用 200 個合成患者資料記錄進行驗證，確保了穩健的效能和可靠性。
+摘要：<paragraph>我們提出異質群體，一種演算法，透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG)，並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數，異質群體使用兩個反覆步驟：角色步驟和權重步驟。對於角色步驟，我們將模型角色解釋為學習一個 DAG，它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始，我們將它們解碼為離散 DAG，以拓撲順序呼叫 LLM，根據效用函數（例如任務的準確度）進行評估，並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟，我們評估個別 LLM 在多 LLM 系統中的貢獻，並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻，然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明，異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明，異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統，並受益於語言模型的多樣性。</paragraph>
 
-##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**
-2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora
+##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
+2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
 
-Explainable Artificial Intelligence (XAI) is central to the debate on
-integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms
-into clinical practice. High-performing AI/ML models, such as ensemble learners
-and deep neural networks, often lack interpretability, hampering clinicians'
-trust in their predictions. To address this, XAI techniques are being developed
-to describe AI/ML predictions in human-understandable terms. One promising
-direction is the adaptation of sensitivity analysis (SA) and global sensitivity
-analysis (GSA), which inherently rank model inputs by their impact on
-predictions. Here, we introduce a novel delta-XAI method that provides local
-explanations of ML model predictions by extending the delta index, a GSA
-metric. The delta-XAI index assesses the impact of each feature's value on the
-predicted output for individual instances in both regression and classification
-problems. We formalize the delta-XAI index and provide code for its
-implementation. The delta-XAI method was evaluated on simulated scenarios using
-linear regression models, with Shapley values serving as a benchmark. Results
-showed that the delta-XAI index is generally consistent with Shapley values,
-with notable discrepancies in models with highly impactful or extreme feature
-values. The delta-XAI index demonstrated higher sensitivity in detecting
-dominant features and handling extreme feature values. Qualitatively, the
-delta-XAI provides intuitive explanations by leveraging probability density
-functions, making feature rankings clearer and more explainable for
-practitioners. Overall, the delta-XAI method appears promising for robustly
-obtaining local explanations of ML model predictions. Further investigations in
-real-world clinical settings will be conducted to evaluate its impact on
-AI-assisted clinical workflows.
+Retrieval-augmented generation (RAG) is a well-suited technique for
+retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
+key module of the healthcare copilot, helping reduce misdiagnosis for
+healthcare practitioners and patients. However, the diagnostic accuracy and
+specificity of existing heuristic-based RAG models used in the medical domain
+are inadequate, particularly for diseases with similar manifestations. This
+paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
+reasoning for the medical domain that retrieves diagnosis and treatment
+recommendations based on manifestations. MedRAG systematically constructs a
+comprehensive four-tier hierarchical diagnostic KG encompassing critical
+diagnostic differences of various diseases. These differences are dynamically
+integrated with similar EHRs retrieved from an EHR database, and reasoned
+within a large language model. This process enables more accurate and specific
+decision support, while also proactively providing follow-up questions to
+enhance personalized medical decision-making. MedRAG is evaluated on both a
+public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
+collected from Tan Tock Seng Hospital, and its performance is compared against
+various existing RAG methods. Experimental results show that, leveraging the
+information integration and relational abilities of the KG, our MedRAG provides
+more specific diagnostic insights and outperforms state-of-the-art models in
+reducing misdiagnosis rates. Our code will be available at
+https://github.com/SNOWTEAM2023/MedRAG
 
-摘要：可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型，例如整體學習器和深度神經網路，通常缺乏可解釋性，阻礙臨床醫生對其預測的信任。為了解決這個問題，正在開發 XAI 技術，以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA)，它們本質上會依據模型輸入對預測的影響來對其進行排名。在此，我們介紹一種新的 delta-XAI 方法，透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化，並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法，並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致，但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說，delta-XAI 透過利用機率密度函數提供直觀的解釋，使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言，delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查，以評估其對 AI 輔助臨床工作流程的影響。
+摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
 
-##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**
-2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis
+##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**
+2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
 
-Dementia, a debilitating neurological condition affecting millions worldwide,
-presents significant diagnostic challenges. In this work, we introduce a novel
-methodology for the classification of demented and non-demented elderly
-patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach
-features a unique technique for selectively processing MRI slices, focusing on
-the most relevant brain regions and excluding less informative sections. This
-methodology is complemented by a confidence-based classification committee
-composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and
-Dem3D EfficientNet. These models work synergistically to enhance
-decision-making accuracy, leveraging their collective strengths. Tested on the
-Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an
-impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore,
-validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset
-confirmed the robustness and generalizability of our approach. The use of
-explainable AI (XAI) techniques and comprehensive ablation studies further
-substantiate the effectiveness of our techniques, providing insights into the
-decision-making process and the importance of our methodology. This research
-offers a significant advancement in dementia diagnosis, providing a highly
-accurate and efficient tool for clinical applications.
+Most existing Knowledge Graph Question Answering (KGQA) approaches are
+designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the
+heterogeneity of the underlying graph schema, topology and assertions, most
+KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without
+resource-intensive training data. We present OntoSCPrompt, a novel Large
+Language Model (LLM)-based KGQA approach with a two-stage architecture that
+separates semantic parsing from KG-dependent interactions. OntoSCPrompt first
+generates a SPARQL query structure (including SPARQL keywords such as SELECT,
+ASK, WHERE and placeholders for missing tokens) and then fills them with
+KG-specific information. To enhance the understanding of the underlying KG, we
+present an ontology-guided, hybrid prompt learning strategy that integrates KG
+ontology into the learning process of hybrid prompts (e.g., discrete and
+continuous vectors). We also present several task-specific decoding strategies
+to ensure the correctness and executability of generated SPARQL queries in both
+stages. Experimental results demonstrate that OntoSCPrompt performs as well as
+SOTA approaches without retraining on a number of KGQA datasets such as CWQ,
+WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well
+to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code:
+\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
 
-摘要：失智症是一種影響全球數百萬人的衰弱性神經疾病，在診斷上具有重大挑戰。在這項工作中，我們提出了一種新的方法，用於對失智和非失智老年患者進行分類，使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術，用於選擇性處理 MRI 切片，重點關注最相關的大腦區域，並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充，該委員會由三個自定義深度學習模型組成：Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性，利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試，我們的模型達到了 94.12% 的驚人準確度，超過了現有方法。此外，在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性，提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展，為臨床應用提供了一個高度準確且高效的工具。
+摘要：現有的知識圖譜問答（KGQA）方法大多是為特定 KG 而設計的，例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性，大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜（KG）。我們提出 OntoSCPrompt，這是一種基於大型語言模型（LLM）的新型 KGQA 方法，採用兩階段架構，將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構（包括 SPARQL 關鍵字，例如 SELECT、ASK、WHERE 和缺失令牌的佔位符），然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解，我們提出了一種由本体指導的混合提示學習策略，將 KG 本体整合到混合提示（例如，離散和連續向量）的學習過程中。我們還提出了多種特定任務的解碼策略，以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明，OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時，效能與 SOTA 方法一樣好，且資源使用效率高，並且可以很好地概括到未見過的特定領域 KG，例如 DBLP-QuAD 和 CoyPu KG Code：
+\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
 
-##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**
-2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini
+##### **Multimodal Medical Code Tokenizer**
+2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
 
-Recognizing daily activities with unobtrusive sensors in smart environments
-enables various healthcare applications. Monitoring how subjects perform
-activities at home and their changes over time can reveal early symptoms of
-health issues, such as cognitive decline. Most approaches in this field use
-deep learning models, which are often seen as black boxes mapping sensor data
-to activities. However, non-expert users like clinicians need to trust and
-understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human
-Activity Recognition have emerged to provide intuitive natural language
-explanations from these models. Different XAI methods generate different
-explanations, and their effectiveness is typically evaluated through user
-surveys, that are often challenging in terms of costs and fairness. This paper
-proposes an automatic evaluation method using Large Language Models (LLMs) to
-identify, in a pool of candidates, the best XAI approach for non-expert users.
-Our preliminary results suggest that LLM evaluation aligns with user surveys.
+Foundation models trained on patient electronic health records (EHRs) require
+tokenizing medical data into sequences of discrete vocabulary items. Existing
+tokenizers treat medical codes from EHRs as isolated textual tokens. However,
+each medical code is defined by its textual description, its position in
+ontological hierarchies, and its relationships to other codes, such as disease
+co-occurrences and drug-treatment associations. Medical vocabularies contain
+more than 600,000 codes with critical information for clinical reasoning. We
+introduce MedTok, a multimodal medical code tokenizer that uses the text
+descriptions and relational context of codes. MedTok processes text using a
+language model encoder and encodes the relational structure with a graph
+encoder. It then quantizes both modalities into a unified token space,
+preserving modality-specific and cross-modality information. We integrate
+MedTok into five EHR models and evaluate it on operational and clinical tasks
+across in-patient and out-patient datasets, including outcome prediction,
+diagnosis classification, drug recommendation, and risk stratification.
+Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
+models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
+the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
+using MedTok tokenizer with medical QA systems. Our results demonstrate the
+potential of MedTok as a unified tokenizer for medical codes, improving
+tokenization for medical foundation models.
 
-摘要：藉由智慧環境中不引人注目的感測器辨識日常活動，能啟用各種醫療保健應用。監控受試者在家中如何執行活動，以及其隨著時間的變化，可以揭示健康問題的早期症狀，例如認知能力下降。此領域中的大多數方法都使用深度學習模型，這些模型通常被視為將感測器資料對應至活動的黑盒子。然而，非專家使用者（例如臨床醫師）需要信任並了解這些模型的輸出。因此，人類活動辨識的可解釋 AI (XAI) 方法應運而生，以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明，而其有效性通常透過使用者調查來評估，這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法，以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明，LLM 評估與使用者調查一致。
+摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
 
-##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**
-2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil
+##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**
+2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu
 
-Industry 5.0, which focuses on human and Artificial Intelligence (AI)
-collaboration for performing different tasks in manufacturing, involves a
-higher number of robots, Internet of Things (IoTs) devices and
-interconnections, Augmented/Virtual Reality (AR), and other smart devices. The
-huge involvement of these devices and interconnection in various critical
-areas, such as economy, health, education and defense systems, poses several
-types of potential security flaws. AI itself has been proven a very effective
-and powerful tool in different areas of cybersecurity, such as intrusion
-detection, malware detection, and phishing detection, among others. Just as in
-many application areas, cybersecurity professionals were reluctant to accept
-black-box ML solutions for cybersecurity applications. This reluctance pushed
-forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool
-that helps explain how decisions are made in ML-based systems. In this survey,
-we present a comprehensive study of different XAI-based intrusion detection
-systems for industry 5.0, and we also examine the impact of explainability and
-interpretability on Cybersecurity practices through the lens of Adversarial
-XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities
-and challenges in XAI cybersecurity systems for industry 5.0 that elicit future
-research toward XAI-based solutions to be adopted by high-stakes industry 5.0
-applications. We believe this rigorous analysis will establish a foundational
-framework for subsequent research endeavors within the specified domain.
+The rapid expansion of web content has made on-device AI assistants
+indispensable for helping users manage the increasing complexity of online
+tasks. The emergent reasoning ability in large language models offer a
+promising path for next-generation on-device AI agents. However, deploying
+full-scale Large Language Models (LLMs) on resource-limited local devices is
+challenging. In this paper, we propose Division-of-Thoughts (DoT), a
+collaborative reasoning framework leveraging the synergy between locally
+deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT
+leverages a Task Decomposer to elicit the inherent planning abilities in
+language models to decompose user queries into smaller sub-tasks, which allows
+hybrid language models to fully exploit their respective strengths. Besides,
+DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks
+and create a dependency graph, facilitating parallel reasoning of sub-tasks and
+the identification of key steps. To allocate the appropriate model based on the
+difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an
+additional task head attached to the SLM that does not alter the SLM's
+parameters. To boost adapter's task allocation capability, we propose a
+self-reinforced training method that relies solely on task execution feedback.
+Extensive experiments on various benchmarks demonstrate that our DoT
+significantly reduces LLM costs while maintaining competitive reasoning
+accuracy. Specifically, DoT reduces the average reasoning time and API costs by
+66.12% and 83.57%, while achieving comparable reasoning accuracy with the best
+baseline methods.
 
-摘要：工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務，涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與，引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具，例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣，網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用，有助於說明在基於 ML 的系統中如何做出決策。在這項調查中，我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究，並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外，我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰，引發了未來針對 XAI 基礎解決方案的研究，以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。
+摘要：<paragraph>網頁內容快速擴充，使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而，在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中，我們提出了思想分工 (DoT)，一個協作推理框架，利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力，將使用者查詢分解成較小的子任務，這允許混合語言模型充分發揮其各自的優勢。此外，DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖，促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型，DoT 利用了即插即用適配器，這是一個附加在 SLM 上的任務頭，不會改變 SLM 的參數。為了提升適配器的任務分配能力，我們提出了一種自我強化訓練方法，它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明，我們的 DoT 大幅降低了 LLM 成本，同時維持了有競爭力的推理準確度。具體來說，DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%，同時達到了與最佳基準方法相當的推理準確度。</paragraph>
 
-##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability**
-2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic
+##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**
+2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong
 
-This study aims to explore the implementation of Natural Language Processing
-(NLP) and machine learning (ML) techniques to automate the coding of medical
-letters with visualised explainability and light-weighted local computer
-settings. Currently in clinical settings, coding is a manual process that
-involves assigning codes to each condition, procedure, and medication in a
-patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There
-are preliminary research on automatic coding in this field using
-state-of-the-art ML models; however, due to the complexity and size of the
-models, the real-world deployment is not achieved. To further facilitate the
-possibility of automatic coding practice, we explore some solutions in a local
-computer setting; in addition, we explore the function of explainability for
-transparency of AI models. We used the publicly available MIMIC-III database
-and the HAN/HLAN network models for ICD code prediction purposes. We also
-experimented with the mapping between ICD and SNOMED CT knowledge bases. In our
-experiments, the models provided useful information for 97.98\% of codes. The
-result of this investigation can shed some light on implementing automatic
-clinical coding in practice, such as in hospital settings, on the local
-computers used by clinicians , project page
-\url{https://github.com/Glenj01/Medical-Coding}.
+Knowledge Graph-based recommendations have gained significant attention due
+to their ability to leverage rich semantic relationships. However, constructing
+and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy
+of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent
+advancements in Large Language Models (LLMs) offer a promising way to improve
+the quality and relevance of KGs for recommendation tasks. Despite this,
+integrating LLMs into KG-based systems presents challenges, such as efficiently
+augmenting KGs, addressing hallucinations, and developing effective joint
+learning methods. In this paper, we propose the Confidence-aware KG-based
+Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework
+that combines KGs and LLMs for recommendation task. The framework includes: (1)
+an LLM-based subgraph augmenter for enriching KGs with high-quality
+information, (2) a confidence-aware message propagation mechanism to filter
+noisy triplets, and (3) a dual-view contrastive learning method to integrate
+user-item interactions and KG data. Additionally, we employ a confidence-aware
+explanation generation process to guide LLMs in producing realistic
+explanations for recommendations. Finally, extensive experiments demonstrate
+the effectiveness of CKG-LLMA across multiple public datasets.
 
-摘要：本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化，並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中，編碼是一種手動流程，涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如，使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究；然而，由於模型的複雜性和大小，並未實現實際部署。為了進一步促進自動編碼實務的可能性，我們在本地電腦設定中探討了一些解決方案；此外，我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中，這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解，例如在醫院環境中，由臨床醫生使用的本地電腦，專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。
+摘要：基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而，構建和維護知識圖譜 (KG) 是一項資源密集型任務，而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此，將 LLM 整合到基於 KG 的系統中會帶來挑戰，例如有效擴充 KG、處理幻覺，以及開發有效的聯合學習方法。在本文中，我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA)，這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括：(1) 一個基於 LLM 的子圖擴充器，用於使用高品質資訊豐富 KG，(2) 一個信心感知型訊息傳播機制，用於過濾雜訊三元組，以及 (3) 一個雙視圖對比學習方法，用於整合使用者-項目互動和 KG 資料。此外，我們採用一個信心感知型解釋產生程序，以引導 LLM 為推薦產生逼真的解釋。最後，大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。
 
-##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**
-2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier
+##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**
+2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell
 
-The support of artificial intelligence (AI) based decision-making is a key
-element in future 6G networks, where the concept of native AI will be
-introduced. Moreover, AI is widely employed in different critical applications
-such as autonomous driving and medical diagnosis. In such applications, using
-AI as black-box models is risky and challenging. Hence, it is crucial to
-understand and trust the decisions taken by these models. Tackling this issue
-can be achieved by developing explainable AI (XAI) schemes that aim to explain
-the logic behind the black-box model behavior, and thus, ensure its efficient
-and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST
-framework that is oriented toward channel estimation in wireless
-communications. The core idea of the XAI-CHEST framework is to identify the
-relevant model inputs by inducing high noise on the irrelevant ones. This
-manuscript provides the detailed theoretical foundations of the XAI-CHEST
-framework. In particular, we derive the analytical expressions of the XAI-CHEST
-loss functions and the noise threshold fine-tuning optimization problem. Hence
-the designed XAI-CHEST delivers a smart input feature selection methodology
-that can further improve the overall performance while optimizing the
-architecture of the employed model. Simulation results show that the XAI-CHEST
-framework provides valid interpretations, where it offers an improved bit error
-rate performance while reducing the required computational complexity in
-comparison to the classical DL-based channel estimation.
+Scene graphs have emerged as a structured and serializable environment
+representation for grounded spatial reasoning with Large Language Models
+(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason
+framework for reasoning and planning with scene graphs. Our approach employs
+two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and
+information queries generation, and a (2) Retriever for extracting
+corresponding graph information following the queries. Two agents collaborate
+iteratively, enabling sequential reasoning and adaptive attention to graph
+information. Unlike prior works, both agents are prompted only with the scene
+graph schema rather than the full graph data, which reduces the hallucination
+by limiting input tokens, and drives the Reasoner to generate reasoning trace
+abstractly.Following the trace, the Retriever programmatically query the scene
+graph data based on the schema understanding, allowing dynamic and global
+attention on the graph that enhances alignment between reasoning and retrieval.
+Through experiments in multiple simulation environments, we show that our
+framework surpasses existing LLM-based approaches in numerical Q\&A and
+planning tasks, and can benefit from task-level few-shot examples, even in the
+absence of agent-level demonstrations. Project code will be released.
 
-摘要：人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素，其中將引入原生 AI 的概念。此外，AI 廣泛用於不同的關鍵應用中，例如自動駕駛和醫療診斷。在這些應用中，使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此，理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構，旨在解釋黑盒模型行為背後的邏輯，從而確保其有效且安全的部署。最近，我們提出了一個新的基於擾動的 XAI-CHEST 框架，該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是，我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此，設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法，可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明，XAI-CHEST 框架提供了有效的解釋，在降低所需的計算複雜度的同時，提供了改進的比特錯誤率性能，而這與基於傳統 DL 的信道估計相比。
+摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
 
-##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**
-2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman
+##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
+2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
 
-This paper presents dilated Residual Network (ResNet) models for disease
-classification from retinal fundus images. Dilated convolution filters are used
-to replace normal convolution filters in the higher layers of the ResNet model
-(dilated ResNet) in order to improve the receptive field compared to the normal
-ResNet model for disease classification. This study introduces
-computer-assisted diagnostic tools that employ deep learning, enhanced with
-explainable AI techniques. These techniques aim to make the tool's
-decision-making process transparent, thereby enabling medical professionals to
-understand and trust the AI's diagnostic decision. They are particularly
-relevant in today's healthcare landscape, where there is a growing demand for
-transparency in AI applications to ensure their reliability and ethical use.
-The dilated ResNet is used as a replacement for the normal ResNet to enhance
-the classification accuracy of retinal eye diseases and reduce the required
-computing time. The dataset used in this work is the Ocular Disease Intelligent
-Recognition (ODIR) dataset which is a structured ophthalmic database with eight
-classes covering most of the common retinal eye diseases. The evaluation
-metrics used in this work include precision, recall, accuracy, and F1 score. In
-this work, a comparative study has been made between normal ResNet models and
-dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50,
-ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as
-compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67,
-and 0.70 respectively for the above respective variants in ODIR multiclass
-disease classification.
+Recent advancements have highlighted that Large Language Models (LLMs) are
+prone to hallucinations when solving complex reasoning problems, leading to
+erroneous results. To tackle this issue, researchers incorporate Knowledge
+Graphs (KGs) to improve the reasoning ability of LLMs. However, existing
+methods face two limitations: 1) they typically assume that all answers to the
+questions are contained in KGs, neglecting the incompleteness issue of KGs, and
+2) they treat the KG as a static repository and overlook the implicit logical
+reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an
+innovative neural-symbolic agent framework that achieves collaborative
+augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments
+and transform complex reasoning tasks into a multi-step interactive process,
+enabling KGs to participate deeply in the reasoning process. SymAgent consists
+of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages
+LLM's inductive reasoning capability to extract symbolic rules from KGs,
+guiding efficient question decomposition. The Agent-Executor autonomously
+invokes predefined action tools to integrate information from KGs and external
+documents, addressing the issues of KG incompleteness. Furthermore, we design a
+self-learning framework comprising online exploration and offline iterative
+policy updating phases, enabling the agent to automatically synthesize
+reasoning trajectories and improve performance. Experimental results
+demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields
+better or comparable performance compared to various strong baselines. Further
+analysis reveals that our agent can identify missing triples, facilitating
+automatic KG updates.
 
-摘要：这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器（扩张 ResNet），以改善感知场，从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具，并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化，从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关，在该领域，对 AI 应用的透明度需求不断增长，以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品，以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集，这是一个结构化的眼科数据库，包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中，对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比，扩张 ResNet 模型显示出有希望的结果，在 ODIR 多类疾病分类中，上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。
+摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
 
-##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**
-2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li
+##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
+2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
 
-The rapid advancement of foundation models in medical imaging represents a
-significant leap toward enhancing diagnostic accuracy and personalized
-treatment. However, the deployment of foundation models in healthcare
-necessitates a rigorous examination of their trustworthiness, encompassing
-privacy, robustness, reliability, explainability, and fairness. The current
-body of survey literature on foundation models in medical imaging reveals
-considerable gaps, particularly in the area of trustworthiness. Additionally,
-existing surveys on the trustworthiness of foundation models do not adequately
-address their specific variations and applications within the medical imaging
-domain. This survey aims to fill that gap by presenting a novel taxonomy of
-foundation models used in medical imaging and analyzing the key motivations for
-ensuring their trustworthiness. We review current research on foundation models
-in major medical imaging applications, focusing on segmentation, medical report
-generation, medical question and answering (Q\&A), and disease diagnosis. These
-areas are highlighted because they have seen a relatively mature and
-substantial number of foundation models compared to other applications. We
-focus on literature that discusses trustworthiness in medical image analysis
-manuscripts. We explore the complex challenges of building trustworthy
-foundation models for each application, summarizing current concerns and
-strategies for enhancing trustworthiness. Furthermore, we examine the potential
-of these models to revolutionize patient care. Our analysis underscores the
-imperative for advancing towards trustworthy AI in medical image analysis,
-advocating for a balanced approach that fosters innovation while ensuring
-ethical and equitable healthcare delivery.
+We introduce a new approach to systematically map features discovered by
+sparse autoencoder across consecutive layers of large language models,
+extending earlier work that examined inter-layer feature links. By using a
+data-free cosine similarity technique, we trace how specific features persist,
+transform, or first appear at each stage. This method yields granular flow
+graphs of feature evolution, enabling fine-grained interpretability and
+mechanistic insights into model computations. Crucially, we demonstrate how
+these cross-layer feature maps facilitate direct steering of model behavior by
+amplifying or suppressing chosen features, achieving targeted thematic control
+in text generation. Together, our findings highlight the utility of a causal,
+cross-layer interpretability framework that not only clarifies how features
+develop through forward passes but also provides new means for transparent
+manipulation of large language models.
 
-摘要：基礎模型在醫學影像方面的快速進展，代表著在加強診斷準確性和個人化治療方面邁出一大步。然而，基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查，包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距，特別是在可信度方面。此外，現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機，來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究，重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調，是因為與其他應用相比，它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰，總結了當前關注點和增強可信度的策略。此外，我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性，並倡導一種平衡的方法，既能促進創新，又能確保道德和公平的醫療保健服務。
+摘要：我們提出了一種新方法，用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能，擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術，我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖，實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是，我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導，在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用，不僅闡明了特徵如何透過前向傳遞發展，還提供了新的方法來透明地操作大型語言模型。
 
-##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**
-2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan
+##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**
+2502.02896v1 by Bradley P. Allen, Paul T. Groth
 
-Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and
-interpreting ultrasound scans right at the patient's bedside. However, the
-expertise needed to interpret these images is considerable and may not always
-be present in emergency situations. This reality makes algorithms such as
-machine learning classifiers extremely valuable to augment human decisions.
-POCUS devices are becoming available at a reasonable cost in the size of a
-mobile phone. The challenge of turning POCUS devices into life-saving tools is
-that interpretation of ultrasound images requires specialist training and
-experience. Unfortunately, the difficulty to obtain positive training images
-represents an important obstacle to building efficient and accurate
-classifiers. Hence, the problem we try to investigate is how to explore
-strategies to increase accuracy of classifiers trained with scarce data. We
-hypothesize that training with a few data instances may not suffice for
-classifiers to generalize causing them to overfit. Our approach uses an
-Explainable AI-Augmented approach to help the algorithm learn more from less
-and potentially help the classifier better generalize.
+Evaluating large language models (LLMs) for tasks like fact extraction in
+support of knowledge graph construction frequently involves computing accuracy
+metrics using a ground truth benchmark based on a knowledge graph (KG). These
+evaluations assume that errors represent factual disagreements. However, human
+discourse frequently features metalinguistic disagreement, where agents differ
+not on facts but on the meaning of the language used to express them. Given the
+complexity of natural language processing and generation using LLMs, we ask: do
+metalinguistic disagreements occur between LLMs and KGs? Based on an
+investigation using the T-REx knowledge alignment dataset, we hypothesize that
+metalinguistic disagreement does in fact occur between LLMs and KGs, with
+potential relevance for the practice of knowledge graph engineering. We propose
+a benchmark for evaluating the detection of factual and metalinguistic
+disagreements between LLMs and KGs. An initial proof of concept of such a
+benchmark is available on Github.
 
-摘要：床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而，解讀這些影像所需的專業知識相當可觀，而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出，尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於，解讀超音波影像需要專門訓練和經驗。不幸的是，取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此，我們嘗試探討的問題是如何探索策略，以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括，導致它們過度擬合。我們的做法使用可解釋 AI 增強方法，以協助演算法從較少的資料中學習更多，並潛在協助分類器更好地概括。
+摘要：評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時，通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而，人類話語經常出現元語言分歧，其中代理人之間的差異不在於事實，而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性，我們提出疑問：LLM 和 KG 之間是否會發生元語言分歧？根據使用 T-REx 知識比對資料集進行的調查，我們假設元語言分歧確實會發生在 LLM 和 KG 之間，並可能與知識圖譜工程實務有關。我們提出一個基準，用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。
 
-##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**
-2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang
+##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**
+2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim
 
-In recent years, the United States has witnessed a significant surge in the
-popularity of vaping or e-cigarette use, leading to a notable rise in cases of
-e-cigarette and vaping use-associated lung injury (EVALI) that caused
-hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting
-the urgency to comprehend vaping behaviors and develop effective strategies for
-cessation. Due to the ubiquity of social media platforms, over 4.7 billion
-users worldwide use them for connectivity, communications, news, and
-entertainment with a significant portion of the discourse related to health,
-thereby establishing social media data as an invaluable organic data resource
-for public health research. In this study, we extracted a sample dataset from
-one vaping sub-community on Reddit to analyze users' quit-vaping intentions.
-Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit
-vaping intention detection, this study compares the outcomes of this model
-against layman and clinical expert annotations. Using different prompting
-strategies such as zero-shot, one-shot, few-shot and chain-of-thought
-prompting, we developed 8 prompts with varying levels of detail to explain the
-task to GPT-4 and also evaluated the performance of the strategies against each
-other. These preliminary findings emphasize the potential of GPT-4 in social
-media data analysis, especially in identifying users' subtle intentions that
-may elude human detection.
+Recent advances in Large Language Models (LLMs) have motivated the
+development of general LLMs for molecular tasks. While several studies have
+demonstrated that fine-tuned LLMs can achieve impressive benchmark
+performances, they are far from genuine generalist molecular LLMs due to a lack
+of fundamental understanding of molecular structure. Specifically, when given
+molecular task instructions, LLMs trained with naive next-token prediction
+training assign similar likelihood scores to both original and negatively
+corrupted molecules, revealing their lack of molecular structure understanding
+that is crucial for reliable and general molecular LLMs. To overcome this
+limitation and obtain a true generalist molecular LLM, we introduce a novel
+multi-modal training method based on a thorough multi-modal instruction tuning
+as well as a molecular structure preference optimization between chosen and
+rejected graphs. On various molecular benchmarks, the proposed generalist
+molecular LLM, called Mol-LLM, achieves state-of-the-art performances among
+generalist LLMs on most tasks, at the same time, surpassing or comparable to
+state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior
+generalization performances in reaction prediction tasks, demonstrating the
+effect of the molecular structure understanding for generalization perspective.
 
-摘要：近年來，美國見證了電子煙或電子香菸使用率大幅激增，導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加，在 2019 年 EVALI 爆發期間造成住院和死亡，凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及，全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂，其中很大一部分與健康相關，因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中，我們從 Reddit 上一個電子煙子社群中提取一個範例資料集，以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測，本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略，例如零次學習、一次學習、少次學習和思考鏈提示，我們開發了 8 個提示，詳細程度不同，向 GPT-4 解釋任務，並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力，特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。
+摘要：大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能，但由於缺乏對分子結構的基本理解，它們遠非真正的通才分子 LLM。具體來說，當給予分子任務說明時，使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子，這顯示出它們缺乏對分子結構的理解，而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM，我們引入了一種新穎的多模態訓練方法，該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中，所提出的通才分子 LLM（稱為 Mol-LLM）在多數任務中實現了通才 LLM 中的最新效能，同時超越或與最新的專家 LLM 相當。此外，Mol-LLM 在反應預測任務中也展現出優異的泛化效能，證明了分子結構理解對泛化觀點的影響。
 
-##### **Towards Compositional Interpretability for XAI**
-2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke
+##### **Leveraging the true depth of LLMs**
+2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
 
-Artificial intelligence (AI) is currently based largely on black-box machine
-learning models which lack interpretability. The field of eXplainable AI (XAI)
-strives to address this major concern, being critical in high-stakes areas such
-as the finance, legal and health sectors.
-  We present an approach to defining AI models and their interpretability based
-on category theory. For this we employ the notion of a compositional model,
-which sees a model in terms of formal string diagrams which capture its
-abstract structure together with its concrete implementation. This
-comprehensive view incorporates deterministic, probabilistic and quantum
-models. We compare a wide range of AI models as compositional models, including
-linear and rule-based models, (recurrent) neural networks, transformers, VAEs,
-and causal and DisCoCirc models.
-  Next we give a definition of interpretation of a model in terms of its
-compositional structure, demonstrating how to analyse the interpretability of a
-model, and using this to clarify common themes in XAI. We find that what makes
-the standard 'intrinsically interpretable' models so transparent is brought out
-most clearly diagrammatically. This leads us to the more general notion of
-compositionally-interpretable (CI) models, which additionally include, for
-instance, causal, conceptual space, and DisCoCirc models.
-  We next demonstrate the explainability benefits of CI models. Firstly, their
-compositional structure may allow the computation of other quantities of
-interest, and may facilitate inference from the model to the modelled
-phenomenon by matching its structure. Secondly, they allow for diagrammatic
-explanations for their behaviour, based on influence constraints, diagram
-surgery and rewrite explanations. Finally, we discuss many future directions
-for the approach, raising the question of how to learn such meaningfully
-structured models in practice.
+Large Language Models demonstrate remarkable capabilities at the cost of high
+compute requirements. While recent research has shown that intermediate layers
+can be removed or have their order shuffled without impacting performance
+significantly, these findings have not been employed to reduce the
+computational cost of inference. We investigate several potential ways to
+reduce the depth of pre-trained LLMs without significantly affecting
+performance. Leveraging our insights, we present a novel approach that exploits
+this decoupling between layers by grouping some of them into pairs that can be
+evaluated in parallel.
+  This modification of the computational graph -- through better parallelism --
+results in an average improvement of around 1.20x on the number of tokens
+generated per second, without re-training nor fine-tuning, while retaining
+95%-99% of the original accuracy. Empirical evaluation demonstrates that this
+approach significantly improves serving efficiency while maintaining model
+performance, offering a practical improvement for large-scale LLM deployment.
 
-摘要：<paragraph>人工智慧（AI）目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧（XAI）領域致力於解決這個主要問題，這在金融、法律和健康等高風險領域至關重要。
-我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此，我們採用組合模型的概念，它以形式弦圖的形式看待模型，這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較，包括線性和基於規則的模型、（遞迴）神經網路、Transformer、VAE，以及因果和 DisCoCirc 模型。
-接下來，我們根據模型的組合結構給出模型解釋的定義，展示如何分析模型的可解釋性，並使用它來澄清 XAI 中的常見主題。我們發現，讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋（CI）模型概念，它另外還包括因果、概念空間和 DisCoCirc 模型。
-接下來，我們展示了 CI 模型的可解釋性優勢。首先，它們的組合結構允許計算其他感興趣的量，並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次，它們允許對其行為進行圖解說明，這些說明基於影響約束、圖解手術和重寫說明。最後，我們討論了這種方法的許多未來方向，提出了如何在實踐中學習這種有意義的結構化模型的問題。</paragraph>
+摘要：大型语言模型展示了其强大的功能，但代价是较高的计算需求。虽然最近的研究表明，中间层可以被移除或重新排列其顺序，而不会显著影响性能，但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度，而不会显著影响性能。利用我们的见解，我们提出了一种新颖的方法，该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。
+通过更好的并行性对计算图进行修改，平均而言，每秒生成的令牌数量提高了约 1.20 倍，而无需重新训练或微调，同时保留了 95%-99% 的原始准确性。经验评估表明，这种方法显著提高了服务效率，同时保持了模型性能，为大规模 LLM 部署提供了实际改进。
 
-##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**
-2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen
+##### **Modular Training of Neural Networks aids Interpretability**
+2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots
 
-Machine learning models have achieved high overall accuracy in medical image
-analysis. However, performance disparities on specific patient groups pose
-challenges to their clinical utility, safety, and fairness. This can affect
-known patient groups - such as those based on sex, age, or disease subtype - as
-well as previously unknown and unlabeled groups. Furthermore, the root cause of
-such observed performance disparities is often challenging to uncover,
-hindering mitigation efforts. In this paper, to address these issues, we
-leverage Slice Discovery Methods (SDMs) to identify interpretable
-underperforming subsets of data and formulate hypotheses regarding the cause of
-observed performance disparities. We introduce a novel SDM and apply it in a
-case study on the classification of pneumothorax and atelectasis from chest
-x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis
-formulation and yields an explanation of previously observed but unexplained
-performance disparities between male and female patients in widely used chest
-X-ray datasets and models. Our findings indicate shortcut learning in both
-classification tasks, through the presence of chest drains and ECG wires,
-respectively. Sex-based differences in the prevalence of these shortcut
-features appear to cause the observed classification performance gap,
-representing a previously underappreciated interaction between shortcut
-learning and model fairness analyses.
+An approach to improve neural network interpretability is via clusterability,
+i.e., splitting a model into disjoint clusters that can be studied
+independently. We define a measure for clusterability and show that pre-trained
+models form highly enmeshed clusters via spectral graph clustering. We thus
+train models to be more modular using a "clusterability loss" function that
+encourages the formation of non-interacting clusters. Using automated
+interpretability techniques, we show that our method can help train models that
+are more modular and learn different, disjoint, and smaller circuits. We
+investigate CNNs trained on MNIST and CIFAR, small transformers trained on
+modular addition, and language models. Our approach provides a promising
+direction for training neural networks that learn simpler functions and are
+easier to interpret.
 
-摘要：機器學習模型在醫學影像分析中已達到整體高準確度。然而，特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體（例如基於性別、年齡或疾病亞型）以及先前未知且未標籤的群體。此外，此類觀察到的效能差異的根本原因通常難以發現，阻礙了緩解措施。在本文中，為了解決這些問題，我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集，並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM，並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性，並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明，在分類任務中，透過胸腔引流管和心電圖導線的存在，存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異，似乎會導致觀察到的分類效能差距，這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。
+摘要：一種改善神經網路可解釋性的方法是透過群集性，
+也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量，並顯示預訓練的
+模型透過光譜圖形群集形成高度糾纏的群集。因此，我們使用「群集性損失」函數訓練模型，使其更具模組化，
+這鼓勵形成非交互群集。使用自動化可解釋性技術，我們顯示我們的模型可以幫助訓練更具模組化的模型，並學習不同、不相交且較小的電路。我們
+研究了在 MNIST 和 CIFAR 上訓練的 CNN，在模組化加法上訓練的小型Transformer，以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。
 
-##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**
-2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa
+##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**
+2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür
 
-The concept of Metaverse has attracted a lot of attention in various fields
-and one of its important applications is health and treatment. The Metaverse
-has enormous potential to transform healthcare by changing patient care,
-medical education, and the way teaching/learning and research are done. The
-purpose of this research is to provide an introduction to the basic concepts
-and fundamental technologies of the Metaverse. This paper examines the pros and
-cons of the Metaverse in healthcare context and analyzes its potential from the
-technology and AI perspective. In particular, the role of machine learning
-methods is discussed; We will explain how machine learning algorithms can be
-applied to the Metaverse generated data to gain better insights in healthcare
-applications. Additionally, we examine the future visions of the Metaverse in
-health delivery, by examining emerging technologies such as blockchain and also
-addressing privacy concerns. The findings of this study contribute to a deeper
-understanding of the applications of Metaverse in healthcare and its potential
-to revolutionize the delivery of medical services.
+Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large
+language models (LLMs) by enabling detailed step-by-step solutions. However,
+due to the verbosity of LLMs, the resulting reasoning chains can be long,
+making it harder to verify the reasoning steps and trace issues resulting from
+dependencies between the steps that may be farther away in the sequence of
+steps. Importantly, mathematical reasoning allows each step to be derived from
+a small set of premises, which are a subset of the preceding steps in the
+reasoning chain. In this paper, we present a framework that identifies the
+premises for each step, to improve the evaluation of reasoning. We restructure
+conventional linear reasoning chains into Premise Augmented Reasoning Chains
+(PARC) by introducing premise links, resulting in a directed acyclic graph
+where the nodes are the steps and the edges are the premise links. Through
+experiments with a PARC-based dataset that we built, namely PERL (Premises and
+ERrors identification in LLMs), we demonstrate that LLMs can reliably identify
+premises within complex reasoning chains. In particular, even open-source LLMs
+achieve 90% recall in premise identification. We also show that PARC helps to
+identify errors in reasoning chains more reliably. The accuracy of error
+identification improves by 6% to 16% absolute when step-by-step verification is
+carried out in PARC under the premises. Our findings highlight the utility of
+premise-centric representations in addressing complex problem-solving tasks and
+open new avenues for improving the reliability of LLM-based reasoning
+evaluations.
 
-摘要：元宇宙的概念在各個領域都備受關注，其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育，以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點，並從技術和 AI 的角度分析其潛力。特別是，討論了機器學習方法的角色；我們將說明如何將機器學習演算法應用於元宇宙產生的資料，以獲得醫療保健應用方面的更佳見解。此外，我們透過探討區塊鏈等新興技術，並解決隱私問題，來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用，以及其在醫療服務提供方面發揮革命性變革的潛力。
+摘要：<paragraph>思考鏈（CoT）提示透過提供詳細的逐步解法，增強大型語言模型（LLM）的數學推理能力。然而，由於 LLM 的冗長，產生的推理鏈可能很長，這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難，而這些步驟可能在步驟順序中相距較遠。重要的是，數學推理允許每個步驟從一組小的前提中推導出來，這些前提是推理鏈中前一個步驟的子集。在本文中，我們提出了一個框架，用於識別每個步驟的前提，以改進推理評估。我們透過引入前提連結，將傳統的線性推理鏈重組為前提擴充推理鏈（PARC），產生一個有向無環圖，其中節點是步驟，而邊緣是前提連結。透過我們建立的基於 PARC 的資料集（即 PERL（LLM 中的前提和錯誤識別））進行的實驗，我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是，即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明，PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時，錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用，並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。</paragraph>
 
-##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**
-2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf
+##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**
+2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
 
-Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with
-no known ultimo cure and high morbidity. Research demonstrates that progressive
-Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly
-impacts kidney structure and functions, eventually leading to kidney failure.
-With the progression of time, chronic kidney disease has moved from a
-life-threatening disease affecting few people to a common disorder of varying
-severity. The goal of this research is to visualize dominating features,
-feature scores, and values exhibited for early prognosis and detection of CKD
-using ensemble learning and explainable AI. For that, an AI-driven predictive
-analytics approach is proposed to aid clinical practitioners in prescribing
-lifestyle modifications for individual patients to reduce the rate of
-progression of this disease. Our dataset is collected on body vitals from
-individuals with CKD and healthy subjects to develop our proposed AI-driven
-solution accurately. In this regard, blood and urine test results are provided,
-and ensemble tree-based machine-learning models are applied to predict unseen
-cases of CKD. Our research findings are validated after lengthy consultations
-with nephrologists. Our experiments and interpretation results are compared
-with existing explainable AI applications in various healthcare domains,
-including CKD. The comparison shows that our developed AI models, particularly
-the Random Forest model, have identified more features as significant
-contributors than XgBoost. Interpretability (I), which measures the ratio of
-important to masked features, indicates that our XgBoost model achieved a
-higher score, specifically a Fidelity of 98\%, in this metric and naturally in
-the FII index compared to competing models.
+Embodied agents assisting humans are often asked to complete a new task in a
+new scenario. An agent preparing a particular dish in the kitchen based on a
+known recipe may be asked to prepare a new dish or to perform cleaning tasks in
+the storeroom. There may not be sufficient resources, e.g., time or labeled
+examples, to train the agent for these new situations. Large Language Models
+(LLMs) trained on considerable knowledge across many domains are able to
+predict a sequence of abstract actions for such new tasks and scenarios,
+although it may not be possible for the agent to execute this action sequence
+due to task-, agent-, or domain-specific constraints. Our framework addresses
+these challenges by leveraging the generic predictions provided by LLM and the
+prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an
+agent to quickly adapt to new tasks and scenarios. The robot also solicits and
+uses human input as needed to refine its existing knowledge. Based on
+experimental evaluation over cooking and cleaning tasks in simulation domains,
+we demonstrate that the interplay between LLM, KG, and human input leads to
+substantial performance gains compared with just using the LLM output.
 
-摘要：慢性腎臟病 (CKD) 是一種廣泛的慢性疾病，目前尚未找到最終的治療方法，且發病率很高。研究表明，進行性慢性腎臟病 (CKD) 是一種異質性疾病，會顯著影響腎臟結構和功能，最終導致腎衰竭。隨著時間的推移，慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值，以進行 CKD 的早期預後和檢測。為此，提出了一種 AI 驅動的預測分析方法，以幫助臨床醫生為個別患者開具生活方式的修改建議，以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的，以準確開發我們提出的 AI 驅動的解決方案。在這方面，提供了血液和尿液檢測結果，並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較，包括 CKD。比較表明，我們開發的 AI 模型，特別是隨機森林模型，已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率，表明我們的 XgBoost 模型在此指標中取得了更高的分數，特別是 98% 的保真度，並且在 FII 指數中自然高於競爭模型。
+摘要：具身代理协助人类时，通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源（例如时间或标记的示例）来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列，尽管代理可能无法执行此动作序列，因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战，使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估，我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。
 
-##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**
-2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan
+##### **On Bob Dylan: A Computational Perspective**
+2502.01772v1 by Prashant Garg
 
-Mental health constitutes a complex and pervasive global challenge, affecting
-millions of lives and often leading to severe consequences. In this paper, we
-conduct a thorough survey to explore the intersection of data science,
-artificial intelligence, and mental healthcare, focusing on the recent
-developments of mental disorder detection through online social media (OSM). A
-significant portion of the population actively engages in OSM platforms,
-creating a vast repository of personal data that holds immense potential for
-mental health analytics. The paper navigates through traditional diagnostic
-methods, state-of-the-art data- and AI-driven research studies, and the
-emergence of explainable AI (XAI) models for mental healthcare. We review
-state-of-the-art machine learning methods, particularly those based on modern
-deep learning, while emphasising the need for explainability in healthcare AI
-models. The experimental design section provides insights into prevalent
-practices, including available datasets and evaluation approaches. We also
-identify key issues and challenges in the field and propose promising future
-research directions. As mental health decisions demand transparency,
-interpretability, and ethical considerations, this paper contributes to the
-ongoing discourse on advancing XAI in mental healthcare through social media.
-The comprehensive overview presented here aims to guide researchers,
-practitioners, and policymakers in developing the area of mental disorder
-detection.
+Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style
+-- a constant refusal to conform to expectation and a penchant for reinventing
+his musical and lyrical identity. In this paper, I extend Sunstein's
+observations through a large-scale computational analysis of Dylan's lyrics
+from 1962 to 2012. Using o3-mini-high (a large language model), I extract
+concept-to-concept relationships from the lyrics and construct directed
+knowledge graphs that capture Dylan's thematic structure. I then quantify
+shifts in sentiment, metaphorical expression, thematic diversity, and network
+complexity over time. The results indicate that Dylan's lyrics increasingly
+rely on metaphor, display an evolving sentiment profile, and exhibit heightened
+dishabituation -- measured here as a growing variance in the network centrality
+of key concepts. I also find that references to movement, protest, and mythic
+imagery fluctuate in ways that align with well-known phases of Dylan's career,
+reflecting the dynamic and unpredictable quality of his art. These findings not
+only deepen our empirical understanding of Sunstein's thesis but also introduce
+a novel computational method for analyzing an artist's evolution-offering
+broader applicability to the study of cultural and creative change.
 
-摘要：心理健康構成了一項複雜且普遍的全球挑戰，影響了數百萬人的生活，並經常導致嚴重的後果。在本文中，我們進行了一項徹底的調查，以探索數據科學、人工智慧和心理保健的交集，重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台，創造了一個龐大的人員資料庫，對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究，以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法，特別是那些基於現代深度學習的方法，同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解，包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰，並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量，本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。
+摘要：卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格
+-- 這種風格不斷拒絕符合預期，並熱衷於重新塑造他的音樂和歌詞認同。在本文中，我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析，來延伸桑斯坦的觀察。使用 o3-mini-high（一個大型語言模型），我從歌詞中提取概念對概念的關係，並建構有向知識圖，以捕捉迪倫的主題結構。然後，我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示，迪倫的歌詞越來越依賴隱喻，展現出不斷演化的情緒輪廓，並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現，對運動、抗議和神話意象的引用，會以與迪倫職業生涯中眾所周知階段一致的方式波動，反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解，也引入了分析藝術家演變的新穎運算方法，為文化和創造性變化的研究提供了更廣泛的適用性。
 
-##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**
-2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou
+##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**
+2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
 
-AI-aided clinical diagnosis is desired in medical care. Existing deep
-learning models lack explainability and mainly focus on image analysis. The
-recently developed Dynamic Uncertain Causality Graph (DUCG) approach is
-causality-driven, explainable, and invariant across different application
-scenarios, without problems of data collection, labeling, fitting, privacy,
-bias, generalization, high cost and high energy consumption. Through close
-collaboration between clinical experts and DUCG technicians, 46 DUCG models
-covering 54 chief complaints were constructed. Over 1,000 diseases can be
-diagnosed without triage. Before being applied in real-world, the 46 DUCG
-models were retrospectively verified by third-party hospitals. The verified
-diagnostic precisions were no less than 95%, in which the diagnostic precision
-for every disease including uncommon ones was no less than 80%. After
-verifications, the 46 DUCG models were applied in the real-world in China. Over
-one million real diagnosis cases have been performed, with only 17 incorrect
-diagnoses identified. Due to DUCG's transparency, the mistakes causing the
-incorrect diagnoses were found and corrected. The diagnostic abilities of the
-clinicians who applied DUCG frequently were improved significantly. Following
-the introduction to the earlier presented DUCG methodology, the recommendation
-algorithm for potential medical checks is presented and the key idea of DUCG is
-extracted.
+Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in
+enhancing Large Language Models (LLMs) through external knowledge integration,
+yet its application has primarily focused on textual content, leaving the rich
+domain of multi-modal video knowledge predominantly unexplored. This paper
+introduces VideoRAG, the first retrieval-augmented generation framework
+specifically designed for processing and understanding extremely long-context
+videos. Our core innovation lies in its dual-channel architecture that
+seamlessly integrates (i) graph-based textual knowledge grounding for capturing
+cross-video semantic relationships, and (ii) multi-modal context encoding for
+efficiently preserving visual features. This novel design empowers VideoRAG to
+process unlimited-length videos by constructing precise knowledge graphs that
+span multiple videos while maintaining semantic dependencies through
+specialized multi-modal retrieval paradigms. Through comprehensive empirical
+evaluation on our proposed LongerVideos benchmark-comprising over 160 videos
+totaling 134+ hours across lecture, documentary, and entertainment
+categories-VideoRAG demonstrates substantial performance compared to existing
+RAG alternatives and long video understanding methods. The source code of
+VideoRAG implementation and the benchmark dataset are openly available at:
+https://github.com/HKUDS/VideoRAG.
+
+摘要：檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功，但其應用主要集中在文字內容上，而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG，這是第一個檢索增強生成架構，專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構，它無縫整合 (i) 基於圖形文字知識基礎，用於擷取跨影片語義關係，以及 (ii) 多模態語境編碼，用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片，同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估，該基準包含超過 160 部影片，總時數超過 134 小時，涵蓋演講、紀錄片和娛樂類別，VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比，展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於：https://github.com/HKUDS/VideoRAG。
+
+##### **Transformers trained on proteins can learn to attend to Euclidean distance**
+2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane
+
+While conventional Transformers generally operate on sequence data, they can
+be used in conjunction with structure models, typically SE(3)-invariant or
+equivariant graph neural networks (GNNs), for 3D applications such as protein
+structure modelling. These hybrids typically involve either (1)
+preprocessing/tokenizing structural features as input for Transformers or (2)
+taking Transformer embeddings and processing them within a structural
+representation. However, there is evidence that Transformers can learn to
+process structural information on their own, such as the AlphaFold3 structural
+diffusion model. In this work we show that Transformers can function
+independently as structure models when passed linear embeddings of coordinates.
+We first provide a theoretical explanation for how Transformers can learn to
+filter attention as a 3D Gaussian with learned variance. We then validate this
+theory using both simulated 3D points and in the context of masked token
+prediction for proteins. Finally, we show that pre-training protein Transformer
+encoders with structure improves performance on a downstream task, yielding
+better performance than custom structural models. Together, this work provides
+a basis for using standard Transformers as hybrid structure-language models.
 
-摘要：<paragraph>醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性，並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的，並且在不同的應用場景中是不變的，沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作，構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前，46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%，其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後，46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例，僅發現 17 個不正確的診斷。由於 DUCG 的透明性，發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後，提出了潛在健康檢查的推薦演算法，並提取了 DUCG 的關鍵思想。</paragraph>
+摘要：雖然傳統的 Transformer 通常處理序列資料，但它們可用於結構模型，通常是 SE(3) 不變式或等變式圖神經網路 (GNN)，用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而，有證據表明 Transformer 可以自行學習處理結構資訊，例如 AlphaFold3 結構擴散模型。在這項工作中，我們展示了 Transformer 在傳遞座標的線性嵌入時，可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後，我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能，產生比自訂結構模型更好的效能。綜合來說，這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。
 
-##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**
-2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi
+##### **Common Foundations for SHACL, ShEx, and PG-Schema**
+2502.01295v1 by S. Ahmetaj, I. Boneva, J. Hidders, K. Hose, M. Jakubowski, J. E. Labra-Gayo, W. Martens, F. Mogavero, F. Murlak, C. Okulmus, A. Polleres, O. Savkovic, M. Simkus, D. Tomaszuk
 
-It is imperative that breast cancer is detected precisely and timely to
-improve patient outcomes. Diagnostic methodologies have traditionally relied on
-unimodal approaches; however, medical data analytics is integrating diverse
-data sources beyond conventional imaging. Using multi-modal techniques,
-integrating both image and non-image data, marks a transformative advancement
-in breast cancer diagnosis. The purpose of this review is to explore the
-burgeoning field of multimodal techniques, particularly the fusion of
-histopathology images with non-image data. Further, Explainable AI (XAI) will
-be used to elucidate the decision-making processes of complex algorithms,
-emphasizing the necessity of explainability in diagnostic processes. This
-review utilizes multi-modal data and emphasizes explainability to enhance
-diagnostic accuracy, clinician confidence, and patient engagement, ultimately
-fostering more personalized treatment strategies for breast cancer, while also
-identifying research gaps in multi-modality and explainability, guiding future
-studies, and contributing to the strategic direction of the field.
+Graphs have emerged as an important foundation for a variety of applications,
+including capturing and reasoning over factual knowledge, semantic data
+integration, social networks, and providing factual knowledge for machine
+learning algorithms. To formalise certain properties of the data and to ensure
+data quality, there is a need to describe the schema of such graphs. Because of
+the breadth of applications and availability of different data models, such as
+RDF and property graphs, both the Semantic Web and the database community have
+independently developed graph schema languages: SHACL, ShEx, and PG-Schema.
+Each language has its unique approach to defining constraints and validating
+graph data, leaving potential users in the dark about their commonalities and
+differences. In this paper, we provide formal, concise definitions of the core
+components of each of these schema languages. We employ a uniform framework to
+facilitate a comprehensive comparison between the languages and identify a
+common set of functionalities, shedding light on both overlapping and
+distinctive features of the three languages.
 
-摘要：精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法；然而，醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術，標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域，特別是將組織病理學影像與非影像資料融合。此外，可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程，強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性，以提高診斷準確性、臨床醫師的信心和患者參與度，最終促進乳癌更個人化的治療策略，同時也找出多模式和可解釋性的研究差距，引導未來的研究，並為該領域的策略方向做出貢獻。
+摘要：圖表已成為各種應用的重要基礎，包括擷取和推理事實知識、語義資料整合、社群網路，以及為機器學習演算法提供事實知識。為了形式化資料的特定屬性並確保資料品質，有必要描述此類圖表的架構。由於應用範圍廣泛且有不同的資料模型可用，例如 RDF 和屬性圖表，因此語義網路和資料庫社群已獨立開發圖表架構語言：SHACL、ShEx 和 PG-Schema。每種語言都有其定義約束和驗證圖表資料的獨特方法，讓潛在使用者不清楚它們的共性和差異。在本文中，我們提供這些架構語言中每個核心元件的正式簡潔定義。我們採用統一的框架來促進語言之間的全面比較，並找出功能的共同集合，說明這三種語言的重疊和獨特功能。
 
-##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**
-2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya
+##### **GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**
+2502.01113v1 by Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
 
-The neonatal period is the most vulnerable time for the development of
-seizures. Seizures in the immature brain lead to detrimental consequences,
-therefore require early diagnosis. The gold-standard for neonatal seizure
-detection currently relies on continuous video-EEG monitoring; which involves
-recording multi-channel electroencephalogram (EEG) alongside real-time video
-monitoring within a neonatal intensive care unit (NICU). However, video-EEG
-monitoring technology requires clinical expertise and is often limited to
-technologically advanced and resourceful settings. Cost-effective new
-techniques could help the medical fraternity make an accurate diagnosis and
-advocate treatment without delay. In this work, a novel explainable deep
-learning model to automate the neonatal seizure detection process with a
-reduced EEG montage is proposed, which employs convolutional nets, graph
-attention layers, and fully connected layers. Beyond its ability to detect
-seizures in real-time with a reduced montage, this model offers the unique
-advantage of real-time interpretability. By evaluating the performance on the
-Zenodo dataset with 10-fold cross-validation, the presented model achieves an
-absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall,
-respectively.
+Retrieval-augmented generation (RAG) has proven effective in integrating
+knowledge into large language models (LLMs). However, conventional RAGs
+struggle to capture complex relationships between pieces of knowledge, limiting
+their performance in intricate reasoning that requires integrating knowledge
+from multiple sources. Recently, graph-enhanced retrieval augmented generation
+(GraphRAG) builds graph structure to explicitly model these relationships,
+enabling more effective and efficient retrievers. Nevertheless, its performance
+is still hindered by the noise and incompleteness within the graph structure.
+To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for
+retrieval augmented generation. GFM-RAG is powered by an innovative graph
+neural network that reasons over graph structure to capture complex
+query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage
+training process on large-scale datasets, comprising 60 knowledge graphs with
+over 14M triples and 700k documents. This results in impressive performance and
+generalizability for GFM-RAG, making it the first graph foundation model
+applicable to unseen datasets for retrieval without any fine-tuning required.
+Extensive experiments on three multi-hop QA datasets and seven domain-specific
+RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance
+while maintaining efficiency and alignment with neural scaling laws,
+highlighting its potential for further improvement.
 
-摘要：新生兒期是大腦發育最脆弱的時期，容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果，因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測；其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而，視訊腦電圖監控技術需要臨床專業知識，而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中，提出了一個新穎的可解釋深度學習模型，以自動化新生兒癲癇發作偵測過程，並採用減少的腦電圖裝置，其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外，此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能，所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。
+摘要：檢索增強生成 (RAG) 已證明在整合知識到大語言模型 (LLM) 中有效。然而，傳統的 RAG 難以捕捉知識片段之間的複雜關係，限制了它們在需要整合來自多個來源的知識的複雜推理中的表現。最近，圖表增強檢索增強生成 (GraphRAG) 建立圖表結構來明確建模這些關係，從而實現更有效率的檢索器。儘管如此，其效能仍受到圖表結構中雜訊和不完整性的阻礙。為了解決這個問題，我們引入了 GFM-RAG，一種用於檢索增強生成的全新圖表基礎模型 (GFM)。GFM-RAG 由一個創新的圖神經網路驅動，該網路在圖表結構上進行推理以捕捉複雜的查詢知識關係。具有 8M 參數的 GFM 在大型資料集上進行兩階段訓練流程，包括 60 個包含超過 14M 個三元組和 700k 個文件的文件。這為 GFM-RAG 帶來了令人印象深刻的效能和通用性，使其成為第一個適用於未見過資料集的圖表基礎模型，而無需任何微調。在三個多跳問答資料集和七個特定領域 RAG 資料集上的廣泛實驗表明，GFM-RAG 達到了最先進的效能，同時保持了效率並與神經擴充定律保持一致，突顯了其進一步改進的潛力。
 
-##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**
-2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik
+##### **Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**
+2502.01059v1 by Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn
 
-Breast cancer (BC) stands as one of the most common malignancies affecting
-women worldwide, necessitating advancements in diagnostic methodologies for
-better clinical outcomes. This article provides a comprehensive exploration of
-the application of Explainable Artificial Intelligence (XAI) techniques in the
-detection and diagnosis of breast cancer. As Artificial Intelligence (AI)
-technologies continue to permeate the healthcare sector, particularly in
-oncology, the need for transparent and interpretable models becomes imperative
-to enhance clinical decision-making and patient care. This review discusses the
-integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and
-others, with machine learning and deep learning models utilized in breast
-cancer detection and classification. By investigating the modalities of breast
-cancer datasets, including mammograms, ultrasounds and their processing with
-AI, the paper highlights how XAI can lead to more accurate diagnoses and
-personalized treatment plans. It also examines the challenges in implementing
-these techniques and the importance of developing standardized metrics for
-evaluating XAI's effectiveness in clinical settings. Through detailed analysis
-and discussion, this article aims to highlight the potential of XAI in bridging
-the gap between complex AI models and practical healthcare applications,
-thereby fostering trust and understanding among medical professionals and
-improving patient outcomes.
+The development of biological data analysis tools and large language models
+(LLMs) has opened up new possibilities for utilizing AI in plant science
+research, with the potential to contribute significantly to knowledge
+integration and research gap identification. Nonetheless, current LLMs struggle
+to handle complex biological data and theoretical models in photosynthesis
+research and often fail to provide accurate scientific contexts. Therefore,
+this study proposed a photosynthesis research assistant (PRAG) based on
+OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt
+optimization. Vector databases and an automated feedback loop were used in the
+prompt optimization process to enhance the accuracy and relevance of the
+responses to photosynthesis-related queries. PRAG showed an average improvement
+of 8.7% across five metrics related to scientific writing, with a 25.4%
+increase in source transparency. Additionally, its scientific depth and domain
+coverage were comparable to those of photosynthesis research papers. A
+knowledge graph was used to structure PRAG's responses with papers within and
+outside the database, which allowed PRAG to match key entities with 63% and
+39.5% of the database and test papers, respectively. PRAG can be applied for
+photosynthesis research and broader plant science domains, paving the way for
+more in-depth data analysis and predictive capabilities.
 
-摘要：乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一，因此需要進步的診斷方法，以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域，特別是在腫瘤學中，透明且可解釋的模型需求變得勢在必行，以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合，例如 SHAP、LIME、Grad-CAM 等，以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式，包括乳房攝影、超音波及其在 AI 中的處理，本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰，以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論，本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力，進而促進醫療專業人員之間的信任與理解，並改善患者的結果。
+摘要：生物資料分析工具和大型語言模型 (LLM) 的發展，為利用人工智慧於植物科學研究開啟了新的可能性，並有潛力對知識整合和研究差距的識別做出重大貢獻。儘管如此，目前的 LLM 在處理光合作用研究中的複雜生物資料和理論模型時仍有困難，而且常常無法提供準確的科學背景。因此，本研究提出了一個基於 OpenAI 的 GPT-4o、具備檢索增強生成 (RAG) 技術和提示最佳化的光合作用研究助理 (PRAG)。在提示最佳化過程中，使用了向量資料庫和自動回饋迴路，以增強對與光合作用相關查詢的回應的準確性和相關性。PRAG 在與科學寫作相關的五項指標中顯示出平均改善了 8.7%，來源透明度增加了 25.4%。此外，其科學深度和領域涵蓋範圍與光合作用研究論文相當。知識圖譜用於建構 PRAG 的回應，其中包含資料庫內外論文，這使得 PRAG 能夠分別與資料庫和測試論文中的 63% 和 39.5% 的關鍵實體相匹配。PRAG 可應用於光合作用研究和更廣泛的植物科學領域，為更深入的資料分析和預測能力鋪路。
 
-##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**
-2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
+##### **Encrypted Large Model Inference: The Equivariant Encryption Paradigm**
+2502.01013v1 by James Buban, Hongyang Zhang, Claudio Angione, Harry Yang, Ahmad Farhan, Seyfal Sultanov, Michael Du, Xuran Ma, Zihao Wang, Yue Zhao, Arria Owlia, Fielding Johnston, Patrick Colangelo
 
-Speech emotion recognition (SER) has gained significant attention due to its
-several application fields, such as mental health, education, and
-human-computer interaction. However, the accuracy of SER systems is hindered by
-high-dimensional feature sets that may contain irrelevant and redundant
-information. To overcome this challenge, this study proposes an iterative
-feature boosting approach for SER that emphasizes feature relevance and
-explainability to enhance machine learning model performance. Our approach
-involves meticulous feature selection and analysis to build efficient SER
-systems. In addressing our main problem through model explainability, we employ
-a feature evaluation loop with Shapley values to iteratively refine feature
-sets. This process strikes a balance between model performance and
-transparency, which enables a comprehensive understanding of the model's
-predictions. The proposed approach offers several advantages, including the
-identification and removal of irrelevant and redundant features, leading to a
-more effective model. Additionally, it promotes explainability, facilitating
-comprehension of the model's predictions and the identification of crucial
-features for emotion determination. The effectiveness of the proposed method is
-validated on the SER benchmarks of the Toronto emotional speech set (TESS),
-Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of
-Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion
-(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our
-knowledge, this is the first work to incorporate model explainability into an
-SER framework. The source code of this paper is publicly available via this
-https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.
+Large scale deep learning model, such as modern language models and diffusion
+architectures, have revolutionized applications ranging from natural language
+processing to computer vision. However, their deployment in distributed or
+decentralized environments raises significant privacy concerns, as sensitive
+data may be exposed during inference. Traditional techniques like secure
+multi-party computation, homomorphic encryption, and differential privacy offer
+partial remedies but often incur substantial computational overhead, latency
+penalties, or limited compatibility with non-linear network operations. In this
+work, we introduce Equivariant Encryption (EE), a novel paradigm designed to
+enable secure, "blind" inference on encrypted data with near zero performance
+overhead. Unlike fully homomorphic approaches that encrypt the entire
+computational graph, EE selectively obfuscates critical internal
+representations within neural network layers while preserving the exact
+functionality of both linear and a prescribed set of non-linear operations.
+This targeted encryption ensures that raw inputs, intermediate activations, and
+outputs remain confidential, even when processed on untrusted infrastructure.
+We detail the theoretical foundations of EE, compare its performance and
+integration complexity against conventional privacy preserving techniques, and
+demonstrate its applicability across a range of architectures, from
+convolutional networks to large language models. Furthermore, our work provides
+a comprehensive threat analysis, outlining potential attack vectors and
+baseline strategies, and benchmarks EE against standard inference pipelines in
+decentralized settings. The results confirm that EE maintains high fidelity and
+throughput, effectively bridging the gap between robust data confidentiality
+and the stringent efficiency requirements of modern, large scale model
+inference.
 
-摘要：語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而，SER 系統的準確性受到高維特徵集的阻礙，這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰，本研究提出了一種用於 SER 的迭代特徵提升方法，該方法強調特徵相關性和可解釋性，以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析，以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題，我們採用了具有 Shapley 值的特徵評估迴圈，以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡，這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點，包括識別和移除不相關和冗餘的特徵，從而建立更有效的模型。此外，它促進了可解釋性，有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證，其效能優於現有方法。據我們所知，這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得：https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。
+摘要：大型深度學習模型，例如現代語言模型和擴散架構，徹底改變了從自然語言處理到電腦視覺等各種應用。然而，它們在分散式或分散式環境中的部署引發了重大的隱私問題，因為敏感數據可能會在推理過程中遭到揭露。安全多方計算、同態加密和差分隱私等傳統技術提供了部分補救措施，但通常會產生大量的計算開銷、延遲處罰，或與非線性網路操作相容性有限。在這項工作中，我們引入了等變加密 (EE)，這是一種新穎的範例，旨在以接近零效能開銷對加密數據進行安全、「盲目」推理。與加密整個計算圖形的完全同態方法不同，EE 有選擇性地混淆神經網路層內的關鍵內部表示，同時保留線性和規定的一組非線性操作的精確功能。這種有針對性的加密確保了原始輸入、中間激活和輸出保持機密，即使在不受信任的基礎設施上處理也是如此。我們詳細說明了 EE 的理論基礎，比較了其效能和整合複雜度與傳統的隱私保護技術，並展示了其在從卷積網路到大語言模型等各種架構中的適用性。此外，我們的研究提供了全面的威脅分析，概述了潛在的攻擊媒介和基準策略，並在分散式設定中將 EE 與標準推理管道進行比較。結果證實，EE 保持了高保真度和高傳輸量，有效地彌合了強大的數據機密性與現代化、大規模模型推理的嚴格效率要求之間的差距。
 
-##### **The Explanation Necessity for Healthcare AI**
-2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling
+##### **Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**
+2502.01694v1 by Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki
 
-Explainability is often critical to the acceptable implementation of
-artificial intelligence (AI). Nowhere is this more important than healthcare
-where decision-making directly impacts patients and trust in AI systems is
-essential. This trust is often built on the explanations and interpretations
-the AI provides. Despite significant advancements in AI interpretability, there
-remains the need for clear guidelines on when and to what extent explanations
-are necessary in the medical context. We propose a novel categorization system
-with four distinct classes of explanation necessity, guiding the level of
-explanation required: patient or sample (local) level, cohort or dataset
-(global) level, or both levels. We introduce a mathematical formulation that
-distinguishes these categories and offers a practical framework for researchers
-to determine the necessity and depth of explanations required in medical AI
-applications. Three key factors are considered: the robustness of the
-evaluation protocol, the variability of expert observations, and the
-representation dimensionality of the application. In this perspective, we
-address the question: When does an AI medical application need to be explained,
-and at what level of detail?
+A key paradigm to improve the reasoning capabilities of large language models
+(LLMs) is to allocate more inference-time compute to search against a verifier
+or reward model. This process can then be utilized to refine the pretrained
+model or distill its reasoning patterns into more efficient models. In this
+paper, we study inference-time compute by viewing chain-of-thought (CoT)
+generation as a metastable Markov process: easy reasoning steps (e.g.,
+algebraic manipulations) form densely connected clusters, while hard reasoning
+steps (e.g., applying a relevant theorem) create sparse, low-probability edges
+between clusters, leading to phase transitions at longer timescales. Under this
+framework, we prove that implementing a search protocol that rewards sparse
+edges improves CoT by decreasing the expected number of steps to reach
+different clusters. In contrast, we establish a limit on reasoning capability
+when the model is restricted to local information of the pretrained graph. We
+also show that the information gained by search can be utilized to obtain a
+better reasoning model: (1) the pretrained model can be directly finetuned to
+favor sparse edges via policy gradient methods, and moreover (2) a compressed
+metastable representation of the reasoning dynamics can be distilled into a
+smaller, more efficient model.
 
-摘要：可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域，这一点尤为重要，因为决策直接影响患者，并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展，但仍然需要明确的指导方针，说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统，该系统具有四种不同的解释必要性类别，指导所需的解释级别：患者或样本（局部）级别、队列或数据集（全局）级别，或两个级别。我们引入了一个数学公式，该公式区分了这些类别，并为研究人员提供了一个实用框架，以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素：评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看，我们解决了这个问题：AI 医疗应用何时需要解释，以及需要解释到何种程度？
+摘要：<paragraph>提升大型語言模型 (LLM) 推理能力的一個關鍵範例，是分配更多推論時間運算來搜尋驗證器或獎勵模型。此程序接著可用於改善預訓練模型或將其推理模式提煉到更有效率的模型中。在這篇論文中，我們透過將思維鏈 (CoT) 生成視為亞穩態馬可夫過程來研究推論時間運算：簡單的推理步驟（例如代數運算）形成密集連接的叢集，而困難的推理步驟（例如應用相關定理）則在叢集之間建立稀疏、低機率的邊緣，導致在較長時間尺度上產生相變。在此架構下，我們證明實作一種獎勵稀疏邊緣的搜尋協定，會透過減少到達不同叢集所需的預期步驟數來改善 CoT。相反地，當模型受限於預訓練圖形的局部資訊時，我們建立了推理能力的限制。我們也顯示搜尋所獲得的資訊可用於取得更好的推理模型：(1) 預訓練模型可以直接微調以透過策略梯度方法偏好稀疏邊緣，而且 (2) 推理動態的壓縮亞穩態表徵可以提煉到更小、更有效率的模型中。</paragraph>
 
-##### **Interdisciplinary Expertise to Advance Equitable Explainable AI**
-2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles
+##### **PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**
+2502.00708v1 by Qixuan Li, Chao Wang, Zongjin He, Yan Peng
 
-The field of artificial intelligence (AI) is rapidly influencing health and
-healthcare, but bias and poor performance persists for populations who face
-widespread structural oppression. Previous work has clearly outlined the need
-for more rigorous attention to data representativeness and model performance to
-advance equity and reduce bias. However, there is an opportunity to also
-improve the explainability of AI by leveraging best practices of social
-epidemiology and health equity to help us develop hypotheses for associations
-found. In this paper, we focus on explainable AI (XAI) and describe a framework
-for interdisciplinary expert panel review to discuss and critically assess AI
-model explanations from multiple perspectives and identify areas of bias and
-directions for future research. We emphasize the importance of the
-interdisciplinary expert panel to produce more accurate, equitable
-interpretations which are historically and contextually informed.
-Interdisciplinary panel discussions can help reduce bias, identify potential
-confounders, and identify opportunities for additional research where there are
-gaps in the literature. In turn, these insights can suggest opportunities for
-AI model improvement.
+Text-to-3D asset generation has achieved significant optimization under the
+supervision of 2D diffusion priors. However, when dealing with compositional
+scenes, existing methods encounter several challenges: 1). failure to ensure
+that composite scene layouts comply with physical laws; 2). difficulty in
+accurately capturing the assets and relationships described in complex scene
+descriptions; 3). limited autonomous asset generation capabilities among layout
+approaches leveraging large language models (LLMs). To avoid these compromises,
+we propose a novel framework for compositional scene generation, PhiP-G, which
+seamlessly integrates generation techniques with layout guidance based on a
+world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene
+description to generate a scene graph, and integrating a multimodal 2D
+generation agent and a 3D Gaussian generation method for targeted assets
+creation. For the stage of layout, PhiP-G employs a physical pool with adhesion
+capabilities and a visual supervision agent, forming a world model for layout
+prediction and planning. Extensive experiments demonstrate that PhiP-G
+significantly enhances the generation quality and physical rationality of the
+compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA)
+performance in CLIP scores, achieves parity with the leading methods in
+generation quality as measured by the T$^3$Bench, and improves efficiency by
+24x.
 
-摘要：人工智慧 (AI) 領域正快速影響著健康與醫療保健，但對於面臨廣泛結構性壓迫的人群來說，偏見和不良表現依然存在。先前的研究已清楚說明，需要更嚴格地注意資料代表性和模型效能，以促進公平性並減少偏見。然而，我們有機會透過運用社會流行病學和健康公平的最佳實務，來改善 AI 的可解釋性，以幫助我們針對發現的關聯性，發展假設。在本文中，我們專注於可解釋 AI (XAI)，並描述一個跨領域專家小組審查架構，以從多重觀點討論和批判性評估 AI 模型的解釋，並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要，而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素，並在文獻中有缺口時找出額外研究的機會。反過來，這些見解可以建議 AI 模型改進的機會。
+摘要：<paragraph>在 2D 擴散先驗的監督下，文字轉 3D 資產生成已取得顯著的最佳化。然而，在處理合成場景時，現有方法會遇到幾個挑戰：1) 無法確保複合場景佈局符合物理定律；2) 難以準確捕捉複雜場景描述中所描述的資產和關係；3) 在利用大型語言模型 (LLM) 的佈局方法中，自主資產生成能力有限。為了避免這些折衷，我們提出了一個合成場景生成的新框架 PhiP-G，它將生成技術與基於世界模型的佈局指導無縫整合。利用基於 LLM 的代理，PhiP-G 分析複雜的場景描述以生成場景圖，並整合多模態 2D 生成代理和 3D 高斯生成方法以進行目標資產創建。對於佈局階段，PhiP-G 採用具有附著能力的物理池和視覺監督代理，形成用於佈局預測和規劃的世界模型。大量的實驗證明，PhiP-G 大幅提升了合成場景的生成品質和物理合理性。值得注意的是，PhiP-G 在 CLIP 分數中獲得了最先進 (SOTA) 的效能，在 T$^3$Bench 測量的生成品質中與領先的方法達到同等水準，並將效率提升了 24 倍。</paragraph>
 
-##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**
-2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen
+##### **A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**
+2502.00681v1 by Qika Lin, Zhen Peng, Kaize Shi, Kai He, Yiming Xu, Erik Cambria, Mengling Feng
 
-Artificial Intelligence (AI) repeatedly match or outperform radiologists in
-lab experiments. However, real-world implementations of radiological AI-based
-systems are found to provide little to no clinical value. This paper explores
-how to design AI for clinical usefulness in different contexts. We conducted 19
-design sessions and design interventions with 13 radiologists from 7 clinical
-sites in Denmark and Kenya, based on three iterations of a functional AI-based
-prototype. Ten sociotechnical dependencies were identified as crucial for the
-design of AI in radiology. We conceptualised four technical dimensions that
-must be configured to the intended clinical context of use: AI functionality,
-AI medical focus, AI decision threshold, and AI Explainability. We present four
-design recommendations on how to address dependencies pertaining to the medical
-knowledge, clinic type, user expertise level, patient context, and user
-situation that condition the configuration of these technical dimensions.
+Recent years have witnessed rapid advances in graph representation learning,
+with the continuous embedding approach emerging as the dominant paradigm.
+However, such methods encounter issues regarding parameter efficiency,
+interpretability, and robustness. Thus, Quantized Graph Representation (QGR)
+learning has recently gained increasing interest, which represents the graph
+structure with discrete codes instead of conventional continuous embeddings.
+Given its analogous representation form to natural language, QGR also possesses
+the capability to seamlessly integrate graph structures with large language
+models (LLMs). As this emerging paradigm is still in its infancy yet holds
+significant promise, we undertake this thorough survey to promote its rapid
+future prosperity. We first present the background of the general quantization
+methods and their merits. Moreover, we provide an in-depth demonstration of
+current QGR studies from the perspectives of quantized strategies, training
+objectives, distinctive designs, knowledge graph quantization, and
+applications. We further explore the strategies for code dependence learning
+and integration with LLMs. At last, we give discussions and conclude future
+directions, aiming to provide a comprehensive picture of QGR and inspire future
+research.
 
-摘要：人工智慧（AI）在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而，發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代，在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向，必須根據預期的臨床使用情境進行設定：AI 功能、AI 醫療重點、AI 決策門檻，以及 AI 可解釋性。我們提出四項設計建議，說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境，以及影響這些技術面向設定的使用者情境相關的依賴關係。
+摘要：近年来，图表示学习取得了快速进展，其中连续嵌入方法作为主导范式出现。然而，此类方法遇到了参数效率、可解释性和鲁棒性方面的问题。因此，量化图表示 (QGR) 学习最近引起了越来越多的兴趣，它使用离散代码而不是传统的连续嵌入来表示图结构。鉴于其与自然语言类似的表示形式，QGR 也具备将图结构与大型语言模型 (LLM) 无缝集成的能力。由于这种新兴范式仍处于起步阶段，但前景广阔，我们进行了这项全面调查以促进其快速未来的繁荣。我们首先介绍了通用量化方法的背景及其优点。此外，我们从量化策略、训练目标、独特设计、知识图谱量化和应用的角度对当前的 QGR 研究进行了深入的论证。我们进一步探索了代码依赖性学习和与 LLM 集成的策略。最后，我们给出了讨论并总结了未来的方向，旨在提供 QGR 的全面图景并激发未来的研究。
 
-##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**
-2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah
+##### **Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**
+2502.00339v1 by Jingyuan Yi, Zeqiu Xu, Tianyi Huang, Peiyang Yu
 
-With advanced AI/ML, there has been growing research on explainable AI (XAI)
-and studies on how humans interact with AI and XAI for effective human-AI
-collaborative decision-making. However, we still have a lack of understanding
-of how AI systems and XAI should be first presented to users without technical
-backgrounds. In this paper, we present the findings of semi-structured
-interviews with health professionals (n=12) and students (n=4) majoring in
-medicine and health to study how to improve onboarding with AI and XAI. For the
-interviews, we built upon human-AI interaction guidelines to create onboarding
-materials of an AI system for stroke rehabilitation assessment and AI
-explanations and introduce them to the participants. Our findings reveal that
-beyond presenting traditional performance metrics on AI, participants desired
-benchmark information, the practical benefits of AI, and interaction trials to
-better contextualize AI performance, and refine the objectives and performance
-of AI. Based on these findings, we highlight directions for improving
-onboarding with AI and XAI and human-AI collaborative decision-making.
+The pervasiveness of the dissemination of fake news through social media
+platforms poses critical risks to the trust of the general public, societal
+stability, and democratic institutions. This challenge calls for novel
+methodologies in detection, which can keep pace with the dynamic and
+multi-modal nature of misinformation. Recent works include powering the
+detection using large language model advances in multimodal frameworks,
+methodologies using graphs, and adversarial training in the literature of fake
+news. Based on the different approaches which can bring success, some key
+highlights will be underlined: enhanced LLM-improves accuracy through more
+advanced semantics and cross-modality fusion for robust detections. The review
+further identifies critical gaps in adaptability to dynamic social media
+trends, real-time, and cross-platform detection capabilities, as well as the
+ethical challenges thrown up by the misuse of LLMs. Future directions underline
+the development of style-agnostic models, cross-lingual detection frameworks,
+and robust policies with a view to mitigating LLM-driven misinformation. This
+synthesis thus lays a concrete foundation for those researchers and
+practitioners committed to reinforcing fake news detection systems with
+complications that keep on growing in the digital landscape.
 
-摘要：隨著先進的 AI/ML，對可解釋 AI (XAI) 的研究不斷增加，以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而，我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中，我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果，以研究如何改善 AI 和 XAI 的入門。對於訪談，我們建立在人機互動準則之上，為中風康復評估和 AI 解釋的 AI 系統創建入門材料，並將它們介紹給參與者。我們的研究結果表明，除了呈現傳統的 AI 性能指標外，參與者還希望基准信息、AI 的實際好處以及交互試驗，以更好地將 AI 性能情境化，並完善 AI 的目標和性能。根據這些發現，我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。
+摘要：社群媒體平台上假新聞散播的普遍性對一般大眾的信任、社會穩定性與民主制度構成重大風險。這項挑戰需要在偵測方面採用創新的方法論，才能跟上錯誤資訊的動態和多模態特性。最近的研究包括使用多模態架構中大型語言模型的進展、使用圖形的方法論，以及在假新聞文獻中進行對抗訓練來強化偵測。根據可以帶來成功的不同方法，將重點說明一些重點：增強的 LLM 可透過更進階的語意和跨模態融合來提升準確度，以進行穩健的偵測。這篇評論進一步找出在適應動態社群媒體趨勢、即時和跨平台偵測能力方面的重大差距，以及 LLM 遭濫用的道德挑戰。未來的方向強調開發與風格無關的模型、跨語言偵測架構和穩健的政策，以減輕 LLM 驅動的錯誤資訊。因此，這種綜合分析為那些致力於強化假新聞偵測系統的研究人員和從業人員奠定了具體的基礎，而這些複雜性在數位環境中持續增長。
 
-##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**
-2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao
+##### **DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**
+2502.00305v1 by Jiaxin Guo, C. L. Philip Chen, Shuzhen Li, Tong Zhang
 
-This article uses machine learning (ML) and explainable artificial
-intelligence (XAI) techniques to investigate the relationship between
-nutritional status and mortality rates associated with Alzheimers disease (AD).
-The Third National Health and Nutrition Examination Survey (NHANES III)
-database is employed for analysis. The random forest model is selected as the
-base model for XAI analysis, and the Shapley Additive Explanations (SHAP)
-method is used to assess feature importance. The results highlight significant
-nutritional factors such as serum vitamin B12 and glycated hemoglobin. The
-study demonstrates the effectiveness of random forests in predicting AD
-mortality compared to other diseases. This research provides insights into the
-impact of nutrition on AD and contributes to a deeper understanding of disease
-progression.
+Cold-start active learning (CSAL) selects valuable instances from an
+unlabeled dataset for manual annotation. It provides high-quality data at a low
+annotation cost for label-scarce text classification. However, existing CSAL
+methods overlook weak classes and hard representative examples, resulting in
+biased learning. To address these issues, this paper proposes a novel
+dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL.
+Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently
+extract textual representations, class predictions, and predictive uncertainty.
+Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both
+textual diversity and class diversity, ensuring a balanced data distribution.
+It further propagates uncertainty information via density-based clustering to
+select hard representative instances. DEUCE performs well in selecting
+class-balanced and hard representative data by dual-diversity and
+informativeness. Experiments on six NLP datasets demonstrate the superiority
+and efficiency of DEUCE.
 
-摘要：本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型，並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素，例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解，並有助於更深入地了解疾病的進展。
+摘要：冷啟動主動學習 (CSAL) 從未標記的資料集中選取有價值的實例進行手動標記。它以低標記成本提供高品質的資料，用於標籤稀少的文字分類。然而，現有的 CSAL 方法忽略了弱類別和難以代表的範例，導致有偏差的學習。為了解決這些問題，本文提出了一個新的雙重多樣性增強和不確定性感知 (DEUCE) 架構，用於 CSAL。具體來說，DEUCE 利用預訓練的語言模型 (PLM) 來有效地提取文字表徵、類別預測和預測不確定性。然後，它構建一個雙鄰居圖 (DNG) 來結合文字多樣性和類別多樣性的資訊，確保平衡的資料分佈。它進一步通過基於密度的聚類來傳播不確定性資訊，以選擇難以代表的實例。DEUCE 在通過雙重多樣性和資訊性選擇類別平衡和難以代表的資料方面表現良好。在六個 NLP 資料集上的實驗證明了 DEUCE 的優越性和效率。
 
-##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**
-2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath
+##### **Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**
+2502.01659v2 by Nathaniel Tomczak, Sanmukh Kuppannagari
 
-Primary care providers are vital for initial triage and referrals to
-specialty care. In glaucoma, asymptomatic and fast progression can lead to
-vision loss, necessitating timely referrals to specialists. However, primary
-eye care providers may not identify urgent cases, potentially delaying care.
-Artificial Intelligence (AI) offering explanations could enhance their referral
-decisions. We investigate how various AI explanations help providers
-distinguish between patients needing immediate or non-urgent specialist
-referrals. We built explainable AI algorithms to predict glaucoma surgery needs
-from routine eyecare data as a proxy for identifying high-risk patients. We
-incorporated intrinsic and post-hoc explainability and conducted an online
-study with optometrists to assess human-AI team performance, measuring referral
-accuracy and analyzing interactions with AI, including agreement rates, task
-time, and user experience perceptions. AI support enhanced referral accuracy
-among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams
-underperformed compared to AI alone. Participants believed they included AI
-advice more when using the intrinsic model, and perceived it more useful and
-promising. Without explanations, deviations from AI recommendations increased.
-AI support did not increase workload, confidence, and trust, but reduced
-challenges. On a separate test set, our black-box and intrinsic models achieved
-an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We
-identify opportunities of human-AI teaming for glaucoma management in primary
-eye care, noting that while AI enhances referral accuracy, it also shows a
-performance gap compared to AI alone, even with explanations. Human involvement
-remains essential in medical decision making, underscoring the need for future
-research to optimize collaboration, ensuring positive experiences and safe AI
-use.
+Transformers have demonstrated great success in numerous domains including
+natural language processing and bioinformatics. This success stems from the use
+of the attention mechanism by these models in order to represent and propagate
+pairwise interactions between individual tokens of sequential data. However,
+the primary limitation of this operation is its quadratic memory and time
+complexity in relation to the input's context length - the length of a sequence
+over which the interactions need to be captured. This significantly limits the
+length of sequences that can be inferred upon by these models. Extensive
+research has been conducted to reduce the number of pairwise interactions to
+sub-quadratic in relation to the context length by introducing sparsity into
+the attention mechanism through the development of sparse attention masks.
+However, efficient implementations that achieve "true sparsity" are lacking.
+  In this work, we address this issue by proposing a graph computing view of
+attention where tokens are perceived as nodes of the graph and the attention
+mask determines the edges of the graph. Using this view, we develop graph
+processing algorithms to implement the attention mechanism. Both theoretically
+and empirically, we demonstrate that our algorithms only perform the needed
+computations, i.e., they are work optimal. We also perform extensive
+experimentation using popular attention masks to explore the impact of sparsity
+on execution time and achievable context length. Our experiments demonstrate
+significant speedups in execution times compared to state-of-the-art attention
+implementations such as FlashAttention for large sequence lengths. We also
+demonstrate that our algorithms are able to achieve extremely long sequence
+lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB).
 
-摘要：<paragraph>初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下，無症狀且快速惡化可能導致視力喪失，因此需要及時轉診給專家。然而，初級眼科保健提供者可能無法識別緊急情況，可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法，以從例行眼科護理資料預測青光眼手術需求，作為識別高風險患者的代理。我們納入了內在和事後解釋性，並與驗光師進行了一項線上研究，以評估人機團隊的表現，衡量轉診準確度並分析與 AI 的互動，包括同意率、任務時間和使用者體驗感知。在 87 名參與者中，AI 支援提高了轉診準確度（使用 AI/未使用的比例為 59.9%/50.8%），儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議，並認為它更有用且更有希望。沒有解釋，AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任，但減少了挑戰。在一個單獨的測試集中，我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中，人機團隊合作管理青光眼的機會，並注意到雖然 AI 提高了轉診準確度，但即使有解釋，它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要，這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。</paragraph>
+摘要：變形金剛已在許多領域展現出巨大的成功，包括自然語言處理和生物資訊學。這種成功源自於這些模型使用注意機制來表示和傳播序列資料中各個標記之間成對的互動。然而，這種運算的主要限制在於其二次記憶體和時間複雜度與輸入的內容長度有關，也就是需要擷取互動的序列長度。這會顯著限制這些模型可以推論的序列長度。已經進行了大量的研究來減少成對互動的數量，使其與內容長度成次二次關係，方法是透過開發稀疏注意遮罩來將稀疏性引入注意機制。然而，缺乏能達成「真實稀疏性」的高效實作。在這項工作中，我們透過提出注意力的圖形運算檢視來解決這個問題，其中標記被視為圖形的節點，而注意力遮罩則決定圖形中的邊緣。使用這種檢視，我們開發了圖形處理演算法來實作注意力機制。我們在理論上和經驗上都證明了我們的演算法只執行必要的運算，也就是說，它們是工作最優的。我們也使用流行的注意力遮罩進行廣泛的實驗，以探討稀疏性對執行時間和可達成的內容長度的影響。我們的實驗證明，與最先進的注意力實作（例如 FlashAttention）相比，對於大型序列長度，我們的演算法在執行時間方面有顯著的加速。我們也證明了我們的演算法能夠在單一的 NVIDIA A100 GPU (SXM4 80GB) 上達成極長的序列長度，最高可達 1.6 億。
 
-##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**
-2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang
+##### **Improving vision-language alignment with graph spiking hybrid Networks**
+2501.19069v1 by Siyu Zhang, Heming Zheng, Yiming Wu, Yeming Chen
 
-In medical imaging, particularly in early disease detection and prognosis
-tasks, discerning the rationale behind an AI model's predictions is crucial for
-evaluating the reliability of its decisions. Conventional explanation methods
-face challenges in identifying discernible decisive features in medical image
-classifications, where discriminative features are subtle or not immediately
-apparent. To bridge this gap, we propose an explainable model that is equipped
-with both decision reasoning and feature identification capabilities. Our
-approach not only detects influential image patterns but also uncovers the
-decisive features that drive the model's final predictions. By implementing our
-method, we can efficiently identify and visualise class-specific features
-leveraged by the data-driven model, providing insights into the decision-making
-processes of deep learning models. We validated our model in the demanding
-realm of medical prognosis task, demonstrating its efficacy and potential in
-enhancing the reliability of AI in healthcare and in discovering new knowledge
-in diseases where prognostic understanding is limited.
+To bridge the semantic gap between vision and language (VL), it is necessary
+to develop a good alignment strategy, which includes handling semantic
+diversity, abstract representation of visual information, and generalization
+ability of models. Recent works use detector-based bounding boxes or patches
+with regular partitions to represent visual semantics. While current paradigms
+have made strides, they are still insufficient for fully capturing the nuanced
+contextual relations among various objects. This paper proposes a comprehensive
+visual semantic representation module, necessitating the utilization of
+panoptic segmentation to generate coherent fine-grained semantic features.
+Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that
+integrates the complementary advantages of Spiking Neural Networks (SNNs) and
+Graph Attention Networks (GATs) to encode visual semantic information.
+Intriguingly, the model not only encodes the discrete and continuous latent
+variables of instances but also adeptly captures both local and global
+contextual features, thereby significantly enhancing the richness and diversity
+of semantic representations. Leveraging the spatiotemporal properties inherent
+in SNNs, we employ contrastive learning (CL) to enhance the similarity-based
+representation of embeddings. This strategy alleviates the computational
+overhead of the model and enriches meaningful visual representations by
+constructing positive and negative sample pairs. We design an innovative
+pre-training method, Spiked Text Learning (STL), which uses text features to
+improve the encoding ability of discrete semantics. Experiments show that the
+proposed GSHN exhibits promising results on multiple VL downstream tasks.
 
-摘要：在醫學影像中，特別是在早期疾病檢測和預後任務中，辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰，其中區別性特徵很微妙或並不明顯。為了彌合這一差距，我們提出了一個可解釋的模型，該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式，還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型，我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵，從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型，展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。
+摘要：<paragraph>為了彌合視覺和語言 (VL) 之間的語意差距，必須制定良好的對齊策略，其中包括處理語意多樣性、視覺資訊的抽象表示以及模型的泛化能力。最近的研究使用基於偵測器的邊界框或具有規則分割的區塊來表示視覺語意。雖然目前的範例已取得進展，但對於完全捕捉各種物件之間的細微脈絡關係仍不足夠。本文提出了一個全面的視覺語意表示模組，需要利用全景分割來產生連貫的細粒度語意特徵。此外，我們提出了一個新穎的圖形脈衝混合網路 (GSHN)，它整合了脈衝神經網路 (SNN) 和圖形注意力網路 (GAT) 的互補優勢來編碼視覺語意資訊。有趣的是，該模型不僅編碼實例的離散和連續潛在變數，還能巧妙地捕捉局部和全域脈絡特徵，從而顯著增強語意表示的豐富性和多樣性。利用 SNN 中固有的時空特性，我們採用對比學習 (CL) 來增強嵌入的基於相似性的表示。此策略減輕了模型的計算負擔，並透過建構正負樣本對來豐富有意義的視覺表示。我們設計了一個創新的預訓練方法，脈衝文本學習 (STL)，它使用文本特徵來提高離散語意的編碼能力。實驗表明，所提出的 GSHN 在多個 VL 下游任務上展現出有希望的結果。</paragraph>
 
-##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**
-2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat
+##### **Semantic Web and Creative AI -- A Technical Report from ISWS 2023**
+2501.18542v1 by Raia Abu Ahmad, Reham Alharbi, Roberto Barile, Martin Böckling, Francisco Bolanos, Sara Bonfitto, Oleksandra Bruns, Irene Celino, Yashrajsinh Chudasama, Martin Critelli, Claudia d'Amato, Giada D'Ippolito, Ioannis Dasoulas, Stefano De Giorgis, Vincenzo De Leo, Chiara Di Bonaventura, Marco Di Panfilo, Daniil Dobriy, John Domingue, Xuemin Duan, Michel Dumontier, Sefika Efeoglu, Ruben Eschauzier, Fakih Ginwa, Nicolas Ferranti, Arianna Graciotti, Philipp Hanisch, George Hannah, Golsa Heidari, Aidan Hogan, Hassan Hussein, Alexane Jouglar, Jan-Christoph Kalo, Manoé Kieffer, Antonis Klironomos, Inês Koch, Weronika Lajewska, Nicolas Lazzari, Mikael Lindekrans, Anna Sofia Lippolis, Majlinda Llugiqi, Eleonora Mancini, Eleonora Marzi, Laura Menotti, Daniela Milon Flores, Soulakshmee Nagowah, Kerstin Neubert, Emetis Niazmand, Ebrahim Norouzi, Beatriz Olarte Martinez, Anouk Michelle Oudshoorn, Andrea Poltronieri, Valentina Presutti, Disha Purohit, Ensiyeh Raoufi, Celian Ringwald, Johanna Rockstroh, Sebastian Rudolph, Harald Sack, Zafar Saeed, Mohammad Javad Saeedizade, Aya Sahbi, Cristian Santini, Aleksandra Simic, Dennis Sommer, Rita Sousa, Mary Ann Tan, Vidyashree Tarikere, Tabea Tietz, Liam Tirpitz, Arnaldo Tomasino, Frank van Harmelen, Joao Vissoci, Caitlin Woods, Bohui Zhang, Xinyue Zhang, Heng Zheng
 
-This study explores the relationship between informational support seeking
-questions, responses, and helpfulness ratings in online health communities. We
-created a labeled data set of question-response pairs and developed multimodal
-machine learning and deep learning models to reliably predict informational
-support questions and responses. We employed explainable AI to reveal the
-emotions embedded in informational support exchanges, demonstrating the
-importance of emotion in providing informational support. This complex
-interplay between emotional and informational support has not been previously
-researched. The study refines social support theory and lays the groundwork for
-the development of user decision aids. Further implications are discussed.
+The International Semantic Web Research School (ISWS) is a week-long
+intensive program designed to immerse participants in the field. This document
+reports a collaborative effort performed by ten teams of students, each guided
+by a senior researcher as their mentor, attending ISWS 2023. Each team provided
+a different perspective to the topic of creative AI, substantiated by a set of
+research questions as the main subject of their investigation. The 2023 edition
+of ISWS focuses on the intersection of Semantic Web technologies and Creative
+AI. ISWS 2023 explored various intersections between Semantic Web technologies
+and creative AI. A key area of focus was the potential of LLMs as support tools
+for knowledge engineering. Participants also delved into the multifaceted
+applications of LLMs, including legal aspects of creative content production,
+humans in the loop, decentralised approaches to multimodal generative AI
+models, nanopublications and AI for personal scientific knowledge graphs,
+commonsense knowledge in automatic story and narrative completion, generative
+AI for art critique, prompt engineering, automatic music composition,
+commonsense prototyping and conceptual blending, and elicitation of tacit
+knowledge. As Large Language Models and semantic technologies continue to
+evolve, new exciting prospects are emerging: a future where the boundaries
+between creative expression and factual knowledge become increasingly permeable
+and porous, leading to a world of knowledge that is both informative and
+inspiring.
+
+摘要：國際語意網路研究學校 (ISWS) 是一個為期一週的密集課程，旨在讓參與者沉浸在該領域中。本文件報告了由十個學生團隊進行的合作成果，每個團隊都由一位資深研究員作為導師，參加了 2023 年 ISWS。每個團隊都從不同的角度探討了創意 AI 主題，並以一系列研究問題作為調查的主要主題。2023 年版的 ISWS 關注於語意網路技術和創意 AI 的交集。ISWS 2023 探索了語意網路技術和創意 AI 之間的各種交集。一個重點關注領域是 LLM 作為知識工程的支援工具的潛力。參與者還深入探討了 LLM 的多方面應用，包括創意內容製作的法律方面、循環中的人類、多模態生成式 AI 模型的分散式方法、納米出版物和用於個人科學知識圖譜的 AI、自動故事和敘述完成中的常識知識、生成式 AI 用於藝術評論、提示工程、自動音樂創作、常識原型和概念混合，以及對默會知識的引導。隨著大型語言模型和語意技術的持續發展，新的令人興奮的前景正在出現：一個創意表達和事實知識之間的界限變得越來越可滲透和多孔的未來，從而導致一個既有資訊性又有啟發性的知識世界。
 
-摘要：本研究探討線上健康社群中尋求資訊支持的問題、回應，以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集，並開發了多模態機器學習和深度學習模型，以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒，證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論，並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。
+##### **Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**
+2501.18320v1 by Tianpeng Pan, Wenqiang Pu, Licheng Zhao, Rui Zhou
 
-##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**
-2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis
+Automated optimization modeling (AOM) has evoked considerable interest with
+the rapid evolution of large language models (LLMs). Existing approaches
+predominantly rely on prompt engineering, utilizing meticulously designed
+expert response chains or structured guidance. However, prompt-based techniques
+have failed to perform well in the sensor array signal processing (SASP) area
+due the lack of specific domain knowledge. To address this issue, we propose an
+automated modeling approach based on retrieval-augmented generation (RAG)
+technique, which consists of two principal components: a multi-agent (MA)
+structure and a graph-based RAG (Graph-RAG) process. The MA structure is
+tailored for the architectural AOM process, with each agent being designed
+based on principles of human modeling procedure. The Graph-RAG process serves
+to match user query with specific SASP modeling knowledge, thereby enhancing
+the modeling result. Results on ten classical signal processing problems
+demonstrate that the proposed approach (termed as MAG-RAG) outperforms several
+AOM benchmarks.
 
-In the era of exponential technology growth, one unexpected guest has claimed
-a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as
-ChatGPT, promises a revolution in education, yet it arrives with a double-edged
-sword. Its potential for personalized learning is offset by issues of cheating,
-inaccuracies, and educators struggling to incorporate it effectively into their
-lesson design. We are standing on the brink of this educational frontier, and
-it is clear that we need to navigate this terrain with a lot of care. This is a
-major challenge that could undermine the integrity and value of our educational
-process. So, how can we turn these challenges into opportunities? When used
-inappropriately, AI tools can become the perfect tool for the cut copy paste
-mentality, and quickly begin to corrode critical thinking, creativity, and deep
-understanding, the most important skills in our rapidly changing world.
-Teachers feel that they are not equipped to leverage this technology, widening
-the digital divide among educators and institutions. Addressing these concerns
-calls for an in depth research approach. We will employ empirical research,
-drawing on the Technology Acceptance Model, to assess the attitudes toward
-generative AI among educators and students. Understanding their perceptions,
-usage patterns, and hurdles is the first crucial step in creating an effective
-solution. The present study will be used as a process manual for future
-researchers to apply, running their own data, based on the steps explained here
+摘要：自動化最佳化建模 (AOM) 隨著大型語言模型 (LLM) 的快速演進而引起相當大的興趣。現有方法主要依賴提示工程，利用精心設計的專家回應鏈或結構化指導。然而，基於提示的技術由於缺乏特定領域知識，無法在感測器陣列訊號處理 (SASP) 領域中表現良好。為了解決這個問題，我們提出一個基於檢索增強生成 (RAG) 技術的自動化建模方法，它包含兩個主要組成部分：多代理 (MA) 結構和基於圖形的 RAG (Graph-RAG) 程序。MA 結構是針對架構 AOM 程序量身打造，每個代理都是根據人類建模程序的原理設計的。Graph-RAG 程序用於將使用者查詢與特定的 SASP 建模知識相匹配，從而增強建模結果。在十個經典訊號處理問題上的結果表明，所提出的方法（稱為 MAG-RAG）優於多個 AOM 基準。
 
-摘要：在科技飛速發展的時代，一位意外的訪客已在全球教室中佔有一席之地，那就是人工智慧。生成式 AI，例如 ChatGPT，承諾在教育領域掀起一場革命，但它卻是一把雙面刃。它在個人化學習方面的潛力，卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣，顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰，可能會損害我們教育過程的完整性和價值。那麼，我們如何將這些挑戰轉化為機遇？當不適當地使用時，AI 工具可能會成為複製貼上心態的完美工具，並迅速腐蝕批判性思維、創造力和深入理解，這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術，這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究，借鑑技術接受模型，來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊，根據此處說明的步驟運行他們自己的數據
+##### **Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**
+2501.18154v1 by Wanlong Liu, Yichen Xiao, Dingyi Zeng, Hongyang Zhao, Wenyu Chen, Malu Zhang
 
-##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**
-2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk
+Post-Training Quantization (PTQ) is pivotal for deploying large language
+models (LLMs) within resource-limited settings by significantly reducing
+resource demands. However, existing PTQ strategies underperform at low bit
+levels < 3 bits due to the significant difference between the quantized and
+original weights. To enhance the quantization performance at low bit widths, we
+introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a
+graph neural network (GNN) module to capture dependencies among weights and
+adaptively assign quantization bit-widths. Through the information propagation
+of the GNN module, our method more effectively captures dependencies among
+target weights, leading to a more accurate assessment of weight importance and
+optimized allocation of quantization strategies. Extensive experiments on the
+WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms
+previous state-of-the-art PTQ method GPTQ, setting new benchmarks for
+quantization performance under low-bit conditions.
 
-With the digitalization of health care systems, artificial intelligence
-becomes more present in medicine. Especially machine learning shows great
-potential for complex tasks such as time series classification, usually at the
-cost of transparency and comprehensibility. This leads to a lack of trust by
-humans and thus hinders its active usage. Explainable artificial intelligence
-tries to close this gap by providing insight into the decision-making process,
-the actual usefulness of its different methods is however unclear. This paper
-proposes a user study based evaluation of the explanation method Grad-CAM with
-application to a neural network for the classification of breaths in time
-series neonatal ventilation data. We present the perceived usefulness of the
-explainability method by different stakeholders, exposing the difficulty to
-achieve actual transparency and the wish for more in-depth explanations by many
-of the participants.
+摘要：訓練後量化 (PTQ) 對於在資源受限的設定中部署大型語言模型 (LLM) 至關重要，因為它能顯著降低資源需求。然而，現有的 PTQ 策略在低位元層級 < 3 位元時表現不佳，因為量化後的權重與原始權重之間有顯著的差異。為了提升低位元寬度的量化效能，我們提出混合精度圖神經網路 PTQ (MG-PTQ) 方法，採用圖神經網路 (GNN) 模組來擷取權重之間的依存關係，並動態分配量化位元寬度。透過 GNN 模組的資訊傳播，我們的方法能更有效地擷取目標權重之間的依存關係，進而更準確地評估權重重要性，並最佳化量化策略的配置。在 WikiText2 和 C4 資料集上的廣泛實驗證明，我們的 MG-PTQ 方法優於先前的最先進 PTQ 方法 GPTQ，在低位元條件下設定了量化效能的新基準。
 
-摘要：隨著醫療保健系統的數位化，人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力，但通常是以透明度和可理解性為代價。這導致人類缺乏信任，從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距，但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估，其中包含了 Grad-CAM 解釋方法，並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用，揭示了實現實際透明度的難度，以及許多參與者希望獲得更深入的解釋。
+##### **Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**
+2501.18119v1 by Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
 
-##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**
-2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio
+Due to the presence of the natural gap between Knowledge Graph (KG)
+structures and the natural language, the effective integration of holistic
+structural information of KGs with Large Language Models (LLMs) has emerged as
+a significant question. To this end, we propose a two-stage framework to learn
+and apply quantized codes for each entity, aiming for the seamless integration
+of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR)
+method is proposed to compress both KG structural and semantic knowledge into
+discrete codes (\ie, tokens) that align the format of language sentences. We
+further design KG instruction-following data by viewing these learned codes as
+features to directly input to LLMs, thereby achieving seamless integration. The
+experiment results demonstrate that SSQR outperforms existing unsupervised
+quantized methods, producing more distinguishable codes. Further, the
+fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link
+prediction and triple classification tasks, utilizing only 16 tokens per entity
+instead of thousands in conventional prompting methods.
 
-The integration of Large Language Models (LLMs) into healthcare diagnostics
-offers a promising avenue for clinical decision-making. This study outlines the
-development of a novel method for zero-shot/few-shot in-context learning (ICL)
-by integrating medical domain knowledge using a multi-layered structured
-prompt. We also explore the efficacy of two communication styles between the
-user and LLMs: the Numerical Conversational (NC) style, which processes data
-incrementally, and the Natural Language Single-Turn (NL-ST) style, which
-employs long narrative prompts.
-  Our study systematically evaluates the diagnostic accuracy and risk factors,
-including gender bias and false negative rates, using a dataset of 920 patient
-records in various few-shot scenarios. Results indicate that traditional
-clinical machine learning (ML) models generally outperform LLMs in zero-shot
-and few-shot settings. However, the performance gap narrows significantly when
-employing few-shot examples alongside effective explainable AI (XAI) methods as
-sources of domain knowledge. Moreover, with sufficient time and an increased
-number of examples, the conversational style (NC) nearly matches the
-performance of ML models. Most notably, LLMs demonstrate comparable or superior
-cost-sensitive accuracy relative to ML models.
-  This research confirms that, with appropriate domain knowledge and tailored
-communication strategies, LLMs can significantly enhance diagnostic processes.
-The findings highlight the importance of optimizing the number of training
-examples and communication styles to improve accuracy and reduce biases in LLM
-applications.
+摘要：由於知識圖譜 (KG) 結構與自然語言之間存在自然差距，將 KG 的整體結構資訊與大型語言模型 (LLM) 有效整合已成為一個重要的問題。為此，我們提出了一個兩階段架構來學習和應用每個實體的量化碼，旨在將 KG 與 LLM 無縫整合。首先，提出了一個自監督量化表示 (SSQR) 方法，將 KG 結構和語義知識壓縮成離散碼（即，符號），以對齊語言句子的格式。我們進一步設計 KG 指令遵循資料，將這些學習到的碼視為直接輸入 LLM 的特徵，從而實現無縫整合。實驗結果表明，SSQR 優於現有的無監督量化方法，產生更具區別性的碼。此外，微調後的 LLaMA2 和 LLaMA3.1 在 KG 連結預測和三元分類任務上也具有優異的性能，每個實體僅使用 16 個符號，而不是傳統提示方法中的數千個。
 
-摘要：大型語言模型 (LLM) 與醫療診斷整合
-為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發，用於零次學習/少量學習情境學習 (ICL)，方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效：數值對話 (NC) 方式，它會逐步處理資料，以及自然語言單回合 (NL-ST) 方式，它會使用長篇敘事提示。
-我們的研究系統性地評估了診斷準確性和風險因子，包括性別偏見和假陰性率，使用了一個包含 920 個患者記錄的資料集，採用各種少量學習情境。結果表明，傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而，當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時，效能差距會顯著縮小。此外，隨著時間充足和範例數量增加，對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是，LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。
-本研究證實，透過適當的領域知識和量身打造的溝通策略，LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性，以提高準確度並減少 LLM 應用中的偏差。
+##### **Hybrid Graphs for Table-and-Text based Question Answering using LLMs**
+2501.17767v1 by Ankush Agarwal, Ganesh S, Chaitanya Devaguptapu
 
-##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**
-2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho
+Answering questions that require reasoning and aggregation across both
+structured (tables) and unstructured (raw text) data sources presents
+significant challenges. Current methods rely on fine-tuning and high-quality,
+human-curated data, which is difficult to obtain. Recent advances in Large
+Language Models (LLMs) have shown promising results for multi-hop question
+answering (QA) over single-source text data in a zero-shot setting, yet
+exploration into multi-source Table-Text QA remains limited. In this paper, we
+present a novel Hybrid Graph-based approach for Table-Text QA that leverages
+LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from
+textual and tabular data, pruning information based on the input question to
+provide the LLM with relevant context concisely. We evaluate our approach on
+the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs,
+including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot
+performance on both datasets, improving Exact Match scores by up to 10% on
+Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up
+to 53% compared to the original context.
 
-The increasing reliance on Deep Learning models, combined with their inherent
-lack of transparency, has spurred the development of a novel field of study
-known as eXplainable AI (XAI) methods. These methods seek to enhance the trust
-of end-users in automated systems by providing insights into the rationale
-behind their decisions. This paper presents a novel approach for measuring user
-trust in XAI systems, allowing their refinement. Our proposed metric combines
-both performance metrics and trust indicators from an objective perspective. To
-validate this novel methodology, we conducted a case study in a realistic
-medical scenario: the usage of XAI system for the detection of pneumonia from
-x-ray images.
+摘要：回答需要對結構化（表格）和非結構化（原始文字）資料來源進行推理和彙總的問題會帶來重大挑戰。目前的辦法仰賴微調和高品質、人工整理的資料，而這很難取得。大型語言模型（LLM）的最新進展已針對零次學習設定的單一來源文字資料多跳問題回答（QA）展現出有希望的結果，但對多來源表格文字 QA 的探討仍然有限。在本文中，我們提出了一種新穎的基於混合圖表的表格文字 QA 方法，它利用 LLM 而無需微調。我們的辦法從文字和表格資料建構一個統一的混合圖表，根據輸入問題修剪資訊，以簡潔地為 LLM 提供相關脈絡。我們使用最先進的 LLM，包括 GPT-3.5、GPT-4 和 LLaMA-3，針對具有挑戰性的 Hybrid-QA 和 OTT-QA 資料集評估我們的辦法。我們的辦法在兩個資料集上都達到了最佳的零次學習效能，在 Hybrid-QA 上將完全比對分數提高了 10%，在 OTT-QA 上將完全比對分數提高了 5.4%。此外，與原始脈絡相比，我們的辦法將符號使用量減少了 53%。
 
-摘要：隨著對深度學習模型依賴性的增加，加上其固有的透明度不足，促使一個新的研究領域發展，稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理，來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法，允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法，我們在一個真實的醫療場景中進行了一個案例研究：使用 XAI 系統從 X 光影像中偵測肺炎。
+##### **Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**
+2501.17549v1 by Wooyoung Kim, Byungyoon Park, Wooju Kim
 
-##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**
-2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao
+Graph-structured data plays a vital role in numerous domains, such as social
+networks, citation networks, commonsense reasoning graphs and knowledge graphs.
+While graph neural networks have been employed for graph processing, recent
+advancements have explored integrating large language models for graph-based
+tasks. In this paper, we propose a novel approach named Learnable Graph Pooling
+Token (LGPT), which addresses the limitations of the scalability issues in
+node-level projection and information loss in graph-level projection. LGPT
+enables flexible and efficient graph representation by introducing learnable
+parameters that act as tokens in large language models, balancing fine-grained
+and global graph information. Additionally, we investigate an Early Query
+Fusion technique, which fuses query context before constructing the graph
+representation, leading to more effective graph embeddings. Our method achieves
+a 4.13\% performance improvement on the GraphQA benchmark without training the
+large language model, demonstrating significant gains in handling complex
+textual-attributed graph data.
 
-The COVID-19 pandemic has strained global public health, necessitating
-accurate diagnosis and intervention to control disease spread and reduce
-mortality rates. This paper introduces an interpretable deep survival
-prediction model designed specifically for improved understanding and trust in
-COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale
-pretrained image encoder, Risk-specific Grad-CAM, and anatomical region
-detection techniques, our approach produces regional interpretable outcomes
-that effectively capture essential disease features while focusing on rare but
-critical abnormal regions. Our model's predictive results provide enhanced
-clarity and transparency through risk area localization, enabling clinicians to
-make informed decisions regarding COVID-19 diagnosis with better understanding
-of prognostic insights. We evaluate the proposed method on a multi-center
-survival dataset and demonstrate its effectiveness via quantitative and
-qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and
-time-dependent AUCs (0.799 and 0.691). These results suggest that our
-explainable deep survival prediction model surpasses traditional survival
-analysis methods in risk prediction, improving interpretability for clinical
-decision making and enhancing AI system trustworthiness.
+摘要：圖形結構資料在許多領域中扮演著至關重要的角色，例如社交網路、引用網路、常識推理圖形和知識圖形。雖然圖形神經網路已用於圖形處理，但最近的進展已探討整合大型語言模型以進行基於圖形的任務。在本文中，我們提出了一種名為可學習圖形池化令牌 (LGPT) 的新方法，它解決了節點層級投影中的可擴充性問題和圖形層級投影中的資訊遺失限制。LGPT 透過引入可學習的參數（在大型語言模型中作為令牌運作）來啟用彈性和高效的圖形表示，平衡細粒度和整體圖形資訊。此外，我們研究了一種早期查詢融合技術，它在建構圖形表示之前融合查詢內容，進而產生更有效的圖形嵌入。我們的方法在 GraphQA 基準上達到了 4.13% 的效能提升，而無需訓練大型語言模型，證明了在處理複雜的文字屬性圖形資料方面有顯著的進展。
 
-摘要：COVID-19 疫情對全球公共衛生造成壓力，必須進行準確的診斷和干預，以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型，專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術，我們的做法產生區域可解釋的結果，有效捕捉必要的疾病特徵，同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度，讓臨床醫生能夠在更了解預後見解的情況下，就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法，並透過量化和質化評估證明其有效性，達到優異的 C 指數（0.764 和 0.727）和時間相關 AUC（0.799 和 0.691）。這些結果表明，我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法，提升臨床決策的解釋性，並增強 AI 系統的信賴度。
+##### **General Scene Adaptation for Vision-and-Language Navigation**
+2501.17403v1 by Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
 
-##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**
-2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile
+Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on
+one-time execution of individual instructions across multiple environments,
+aiming to develop agents capable of functioning in any environment in a
+zero-shot manner. However, real-world navigation robots often operate in
+persistent environments with relatively consistent physical layouts, visual
+observations, and language styles from instructors. Such a gap in the task
+setting presents an opportunity to improve VLN agents by incorporating
+continuous adaptation to specific environments. To better reflect these
+real-world conditions, we introduce GSA-VLN, a novel task requiring agents to
+execute navigation instructions within a specific scene and simultaneously
+adapt to it for improved performance over time. To evaluate the proposed task,
+one has to address two challenges in existing VLN datasets: the lack of OOD
+data, and the limited number and style diversity of instructions for each
+scene. Therefore, we propose a new dataset, GSA-R2R, which significantly
+expands the diversity and quantity of environments and instructions for the R2R
+dataset to evaluate agent adaptability in both ID and OOD contexts.
+Furthermore, we design a three-stage instruction orchestration pipeline that
+leverages LLMs to refine speaker-generated instructions and apply role-playing
+techniques to rephrase instructions into different speaking styles. This is
+motivated by the observation that each individual user often has consistent
+signatures or preferences in their instructions. We conducted extensive
+experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various
+methods. Based on our findings, we propose a novel method, GR-DUET, which
+incorporates memory-based navigation graphs with an environment-specific
+training strategy, achieving state-of-the-art results on all GSA-R2R splits.
 
-In recent years, machine learning-based clinical decision support systems
-(CDSS) have played a key role in the analysis of several medical conditions.
-Despite their promising capabilities, the lack of transparency in AI models
-poses significant challenges, particularly in medical contexts where
-reliability is a mandatory aspect. However, it appears that explainability is
-inversely proportional to accuracy. For this reason, achieving transparency
-without compromising predictive accuracy remains a key challenge. This paper
-presents a novel method, namely Rad4XCNN, to enhance the predictive power of
-CNN-derived features with the inherent interpretability of radiomic features.
-Rad4XCNN diverges from conventional methods based on saliency maps, by
-associating intelligible meaning to CNN-derived features by means of Radiomics,
-offering new perspectives on explanation methods beyond visualization maps.
-Using a breast cancer classification task as a case study, we evaluated
-Rad4XCNN on ultrasound imaging datasets, including an online dataset and two
-in-house datasets for internal and external validation. Some key results are:
-i) CNN-derived features guarantee more robust accuracy when compared against
-ViT-derived and radiomic features; ii) conventional visualization map methods
-for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice
-model accuracy for their explainability; iv) Rad4XCNN provides a global
-explanation enabling the physician to extract global insights and findings. Our
-method can mitigate some concerns related to the explainability-accuracy
-trade-off. This study highlighted the importance of proposing new methods for
-model explanation without affecting their accuracy.
+摘要：視覺語言導航 (VLN) 任務主要根據代理程式在多個環境中執行個別指令的一次性執行來評估代理程式，旨在開發能夠在任何環境中以零次學習的方式運作的代理程式。然而，真實世界的導航機器人通常在持續性的環境中運作，而這些環境具有相對一致的物理配置、視覺觀察和指令的語言風格。任務設定中的這種差距提供了一個機會，可以透過將連續適應特定環境納入其中來改善 VLN 代理程式。為了更好地反映這些真實世界的條件，我們推出了 GSA-VLN，這是一個新任務，要求代理程式在特定場景中執行導航指令，並同時適應該場景，以隨著時間推移而提高效能。為了評估所提出的任務，必須解決現有 VLN 資料集中的兩個挑戰：缺乏 OOD 資料，以及每個場景的指令數量和風格多樣性有限。因此，我們提出了一個新的資料集 GSA-R2R，它顯著擴展了 R2R 資料集的環境和指令的多樣性和數量，以評估代理程式在 ID 和 OOD 背景下的適應能力。此外，我們設計了一個三階段指令編排管道，該管道利用大型語言模型 (LLM) 來精煉由說話者產生的指令，並應用角色扮演技巧將指令改寫成不同的說話風格。這項技術的靈感來自於觀察到每個個別使用者通常在其指令中具有相符的簽名或偏好。我們針對 GSA-R2R 進行了大量的實驗，以徹底評估我們的資料集和基準各種方法。根據我們的研究結果，我們提出了一種新的方法 GR-DUET，它將基於記憶的導航圖表與特定於環境的訓練策略結合在一起，在所有 GSA-R2R 分割中取得了最先進的結果。
+
+##### **Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**
+2501.17270v1 by Saloni Potdar, Daniel Lee, Omar Attia, Varun Embar, De Meng, Ramesh Balaji, Chloe Seivwright, Eric Choi, Mina H. Farid, Yiwen Sun, Yunyao Li
+
+Question answering systems for knowledge graph (KGQA), answer factoid
+questions based on the data in the knowledge graph. KGQA systems are complex
+because the system has to understand the relations and entities in the
+knowledge-seeking natural language queries and map them to structured queries
+against the KG to answer them. In this paper, we introduce Chronos, a
+comprehensive evaluation framework for KGQA at industry scale. It is designed
+to evaluate such a multi-component system comprehensively, focusing on (1)
+end-to-end and component-level metrics, (2) scalable to diverse datasets and
+(3) a scalable approach to measure the performance of the system prior to
+release. In this paper, we discuss the unique challenges associated with
+evaluating KGQA systems at industry scale, review the design of Chronos, and
+how it addresses these challenges. We will demonstrate how it provides a base
+for data-driven decisions and discuss the challenges of using it to measure and
+improve a real-world KGQA system.
 
-摘要：<paragraph>近年来，基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景，但 AI 模型缺乏透明度，尤其在医疗领域，可靠性是强制性方面，这带来了重大挑战。然而，解释性似乎与准确性成反比。因此，在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法，即 Rad4XCNN，以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来，从而偏离了基于显着性图的传统方法，为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究，我们在超声成像数据集上评估了 Rad4XCNN，包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是：i) 与 ViT 衍生和放射组学特征相比，CNN 衍生特征保证了更稳健的准确性；ii) 用于解释的传统可视化图方法存在一些缺陷；iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性；iv) Rad4XCNN 提供全局解释，使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。</paragraph>
+摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
 
-##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**
-2404.16957v1 by Yunfei Ge, Quanyan Zhu
+##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
+2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
 
-The pervasive integration of Artificial Intelligence (AI) has introduced
-complex challenges in the responsibility and accountability in the event of
-incidents involving AI-enabled systems. The interconnectivity of these systems,
-ethical concerns of AI-induced incidents, coupled with uncertainties in AI
-technology and the absence of corresponding regulations, have made traditional
-responsibility attribution challenging. To this end, this work proposes a
-Computational Reflective Equilibrium (CRE) approach to establish a coherent and
-ethically acceptable responsibility attribution framework for all stakeholders.
-The computational approach provides a structured analysis that overcomes the
-limitations of conceptual approaches in dealing with dynamic and multifaceted
-scenarios, showcasing the framework's explainability, coherence, and adaptivity
-properties in the responsibility attribution process. We examine the pivotal
-role of the initial activation level associated with claims in equilibrium
-computation. Using an AI-assisted medical decision-support system as a case
-study, we illustrate how different initializations lead to diverse
-responsibility distributions. The framework offers valuable insights into
-accountability in AI-induced incidents, facilitating the development of a
-sustainable and resilient system through continuous monitoring, revision, and
-reflection.
+Prior research on training grounded factuality classification models to
+detect hallucinations in large language models (LLMs) has relied on public
+natural language inference (NLI) data and synthetic data. However, conventional
+NLI datasets are not well-suited for document-level reasoning, which is
+critical for detecting LLM hallucinations. Recent approaches to document-level
+synthetic data generation involve iteratively removing sentences from documents
+and annotating factuality using LLM-based prompts. While effective, this method
+is computationally expensive for long documents and limited by the LLM's
+capabilities. In this work, we analyze the differences between existing
+synthetic training data used in state-of-the-art models and real LLM output
+claims. Based on our findings, we propose a novel approach for synthetic data
+generation, CG2C, that leverages multi-hop reasoning on context graphs
+extracted from documents. Our fact checker model, FactCG, demonstrates improved
+performance with more connected reasoning, using the same backbone models.
+Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
+with much smaller model size.
 
-摘要：隨著人工智慧 (AI) 的普及整合，在涉及 AI 驅動系統的事故中，責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題，加上 AI 技術的不確定性和缺乏相應法規，使得傳統責任歸屬面臨挑戰。為此，本研究提出了一種計算反思均衡 (CRE) 方法，以建立一個連貫且在倫理上可接受的責任歸屬架構，適用於所有利害關係人。計算方法提供了結構化的分析，克服了概念方法在處理動態且多面向情境時的限制，展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究，說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解，透過持續監控、修訂和反思，促進了永續且有韌性的系統發展。
+摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
 
-##### **Explainable AI for Fair Sepsis Mortality Predictive Model**
-2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang
+##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
+2501.16673v2 by Li Yin, Zhangyang Wang
 
-Artificial intelligence supports healthcare professionals with predictive
-modeling, greatly transforming clinical decision-making. This study addresses
-the crucial need for fairness and explainability in AI applications within
-healthcare to ensure equitable outcomes across diverse patient demographics. By
-focusing on the predictive modeling of sepsis-related mortality, we propose a
-method that learns a performance-optimized predictive model and then employs
-the transfer learning process to produce a model with better fairness. Our
-method also introduces a novel permutation-based feature importance algorithm
-aiming at elucidating the contribution of each feature in enhancing fairness on
-predictions. Unlike existing explainability methods concentrating on explaining
-feature contribution to predictive performance, our proposed method uniquely
-bridges the gap in understanding how each feature contributes to fairness. This
-advancement is pivotal, given sepsis's significant mortality rate and its role
-in one-third of hospital deaths. Our method not only aids in identifying and
-mitigating biases within the predictive model but also fosters trust among
-healthcare stakeholders by improving the transparency and fairness of model
-predictions, thereby contributing to more equitable and trustworthy healthcare
-delivery.
+Large Language Models (LLMs) have reshaped natural language processing,
+powering applications from multi-hop retrieval and question answering to
+autonomous agent workflows. Yet, prompt engineering -- the task of crafting
+textual inputs to effectively direct LLMs -- remains difficult and
+labor-intensive, particularly for complex pipelines that combine multiple LLM
+calls with functional operations like retrieval and data formatting. We
+introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
+(APE) that extends textual gradient-based methods (such as Text-Grad) to
+multi-component, potentially cyclic LLM architectures. Implemented within the
+AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
+parameter and uses a frozen backward engine LLM to generate feedback-akin to
+textual gradients -- that guide iterative prompt updates. Unlike prior
+single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
+preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
+and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
+(instructions, formats, or few-shot examples). It further boosts training
+efficiency by focusing on error-prone samples through selective gradient
+computation. Across diverse tasks, including single-step classification,
+multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
+consistently outperforms existing textual gradient baselines in both accuracy
+and training cost. By unifying prompt optimization through a graph-centric
+lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
+LLM workflows - mirroring the transformative role that automatic
+differentiation libraries have long played in neural network research.
 
-摘要：人工智慧透過預測模型協助醫療專業人員，大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求，以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型，我們提出了一種方法，該方法會學習一個效能最佳化的預測模型，然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法，旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同，我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要，因為敗血症的死亡率很高，且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差，還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任，進而有助於提供更公平且值得信賴的醫療保健服務。
+摘要：大型語言模型 (LLM) 已重塑自然語言處理，
+為從多跳檢索和問答到
+自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
+文本輸入以有效指導 LLM 的任務 -- 仍然困難且
+勞動密集，特別是對於將多個 LLM
+呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
+介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
+方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
+AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
+參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
+文本梯度——指導迭代提示更新。與先前的
+單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
+在重複呼叫（例如，多跳循環）中保留時間順序行為，
+並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
+效率，通過選擇性梯度
+計算專注於容易出錯的樣本。在包括單步分類、
+多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
+在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
+視角統一提示優化，LLM-AutoDiff 為擴展和自動化
+LLM 工作流程提供了一個強大的新範例——反映了自動
+微分庫在神經網絡研究中長期扮演的變革性角色。
 
-##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence**
-2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal
+##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
+2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
 
-Depression is a significant issue nowadays. As per the World Health
-Organization (WHO), in 2023, over 280 million individuals are grappling with
-depression. This is a huge number; if not taken seriously, these numbers will
-increase rapidly. About 4.89 billion individuals are social media users. People
-express their feelings and emotions on platforms like Twitter, Facebook,
-Reddit, Instagram, etc. These platforms contain valuable information which can
-be used for research purposes. Considerable research has been conducted across
-various social media platforms. However, certain limitations persist in these
-endeavors. Particularly, previous studies were only focused on detecting
-depression and the intensity of depression in tweets. Also, there existed
-inaccuracies in dataset labeling. In this research work, five types of
-depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted
-using tweets from the Twitter database based on lexicon labeling. Explainable
-AI was used to provide reasoning by highlighting the parts of tweets that
-represent type of depression. Bidirectional Encoder Representations from
-Transformers (BERT) was used for feature extraction and training. Machine
-learning and deep learning methodologies were used to train the model. The BERT
-model presented the most promising results, achieving an overall accuracy of
-0.96.
+Ranking and recommendation systems are the foundation for numerous online
+experiences, ranging from search results to personalized content delivery.
+These systems have evolved into complex, multilayered architectures that
+leverage vast datasets and often incorporate thousands of predictive models.
+The maintenance and enhancement of these models is a labor intensive process
+that requires extensive feature engineering. This approach not only exacerbates
+technical debt but also hampers innovation in extending these systems to
+emerging problem domains. In this report, we present our research to address
+these challenges by utilizing a large foundation model with a textual interface
+for ranking and recommendation tasks. We illustrate several key advantages of
+our approach: (1) a single model can manage multiple predictive tasks involved
+in ranking and recommendation, (2) decoder models with textual interface due to
+their comprehension of reasoning capabilities, can generalize to new
+recommendation surfaces and out-of-domain problems, and (3) by employing
+natural language interfaces for task definitions and verbalizing member
+behaviors and their social connections, we eliminate the need for feature
+engineering and the maintenance of complex directed acyclic graphs of model
+dependencies. We introduce our research pre-production model, 360Brew V1.0, a
+150B parameter, decoder-only model that has been trained and fine-tuned on
+LinkedIn's data and tasks. This model is capable of solving over 30 predictive
+tasks across various segments of the LinkedIn platform, achieving performance
+levels comparable to or exceeding those of current production systems based on
+offline metrics, without task-specific fine-tuning. Notably, each of these
+tasks is conventionally addressed by dedicated models that have been developed
+and maintained over multiple years by teams of a similar or larger size than
+our own.
 
-摘要：現今，憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料，在 2023 年，超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字；如果不認真看待，這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊，可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而，這些努力仍存在某些限制。特別是，先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外，資料集標籤中存在不準確的情況。在這項研究工作中，使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症（雙極型、重度、精神病型、非典型和產後）。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers（BERT）中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果，達到 0.96 的整體準確度。
+摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
+這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
+這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
+這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
+在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
+我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
+我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
+此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
+值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
 
-##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**
-2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman
+##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
+2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
 
-Deep learning is dramatically transforming the field of medical imaging and
-radiology, enabling the identification of pathologies in medical images,
-including computed tomography (CT) and X-ray scans. However, the performance of
-deep learning models, particularly in segmentation tasks, is often limited by
-the need for extensive annotated datasets. To address this challenge, the
-capabilities of weakly supervised semantic segmentation are explored through
-the lens of Explainable AI and the generation of counterfactual explanations.
-The scope of this research is development of a novel counterfactual inpainting
-approach (COIN) that flips the predicted classification label from abnormal to
-normal by using a generative model. For instance, if the classifier deems an
-input medical image X as abnormal, indicating the presence of a pathology, the
-generative model aims to inpaint the abnormal region, thus reversing the
-classifier's original prediction label. The approach enables us to produce
-precise segmentations for pathologies without depending on pre-existing
-segmentation masks. Crucially, image-level labels are utilized, which are
-substantially easier to acquire than creating detailed segmentation masks. The
-effectiveness of the method is demonstrated by segmenting synthetic targets and
-actual kidney tumors from CT images acquired from Tartu University Hospital in
-Estonia. The findings indicate that COIN greatly surpasses established
-attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an
-alternative counterfactual explanation method introduced by Singla et al. This
-evidence suggests that COIN is a promising approach for semantic segmentation
-of tumors in CT images, and presents a step forward in making deep learning
-applications more accessible and effective in healthcare, where annotated data
-is scarce.
+Fixing Python dependency issues is a tedious and error-prone task for
+developers, who must manually identify and resolve environment dependencies and
+version constraints of third-party modules and Python interpreters. Researchers
+have attempted to automate this process by relying on large knowledge graphs
+and database lookup tables. However, these traditional approaches face
+limitations due to the variety of dependency error types, large sets of
+possible module versions, and conflicts among transitive dependencies. This
+study explores the potential of using large language models (LLMs) to
+automatically fix dependency issues in Python programs. We introduce PLLM
+(pronounced "plum"), a novel technique that employs retrieval-augmented
+generation (RAG) to help an LLM infer Python versions and required modules for
+a given Python file. PLLM builds a testing environment that iteratively (1)
+prompts the LLM for module combinations, (2) tests the suggested changes, and
+(3) provides feedback (error messages) to the LLM to refine the fix. This
+feedback cycle leverages natural language processing (NLP) to intelligently
+parse and interpret build error messages. We benchmark PLLM on the Gistable
+HG2.9K dataset, a collection of challenging single-file Python gists. We
+compare PLLM against two state-of-the-art automatic dependency inference
+approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
+issues. Our results indicate that PLLM can fix more dependency issues than the
+two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
+over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
+for projects with many dependencies and for specific third-party numerical and
+machine-learning modules. Our findings demonstrate the potential of LLM-based
+approaches to iteratively resolve Python dependency issues.
 
-摘要：深度学习正大幅轉變醫學影像和放射線學領域，能辨識醫學影像中的病理，包括電腦斷層掃描 (CT) 和 X 光掃描。然而，深度學習模型的效能，特別是在分割任務中，常常受到廣泛註解資料集需求的限制。為了應對此挑戰，透過可解釋 AI 和反事實解釋的產生，探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN)，該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如，如果分類器將輸入的醫學影像 X 視為異常，表示存在病理，則生成模型旨在內插異常區域，從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割，而無需依賴於預先存在的分割遮罩。至關重要的是，利用影像層級標籤，這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明，COIN 遠遠超過已建立的歸因方法，例如 RISE、ScoreCAM 和 LayerCAM，以及 Singla 等人提出的另一種反事實解釋方法。此證據表明，COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法，並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步，其中註解資料很稀少。
+摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
 
-##### **Hybrid Intelligence for Digital Humanities**
-2406.15374v1 by Victor de Boer, Lise Stork
+##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
+2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
+
+Knowledge graphs are widely used in industrial applications, making error
+detection crucial for ensuring the reliability of downstream applications.
+Existing error detection methods often fail to effectively leverage
+fine-grained subgraph information and rely solely on fixed graph structures,
+while also lacking transparency in their decision-making processes, which
+results in suboptimal detection performance. In this paper, we propose a novel
+Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
+utilizes multiple large language models (LLMs) in a collaborative setting. By
+concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
+query embeddings during training, our framework integrates these
+representations to produce four specialized agents. These agents utilize
+subgraph information from different dimensions to engage in multi-round
+discussions, thereby improving error detection accuracy and ensuring a
+transparent decision-making process. Extensive experiments on FB15K and WN18RR
+demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
+accuracy and robustness of KG evaluation. For specific industrial scenarios,
+our framework can facilitate the training of specialized agents using
+domain-specific knowledge graphs for error detection, which highlights the
+potential industrial application value of our framework. Our code and datasets
+are available at https://github.com/kse-ElEvEn/MAKGED.
 
-In this paper, we explore the synergies between Digital Humanities (DH) as a
-discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research,
-the use of digital methods and specifically that of Artificial Intelligence is
-subject to a set of requirements and constraints. We argue that these are
-well-supported by the capabilities and goals of HI. Our contribution includes
-the identification of five such DH requirements: Successful AI systems need to
-be able to 1) collaborate with the (human) scholar; 2) support data criticism;
-3) support tool criticism; 4) be aware of and cater to various perspectives and
-5) support distant and close reading. We take the CARE principles of Hybrid
-Intelligence (collaborative, adaptive, responsible and explainable) as
-theoretical framework and map these to the DH requirements. In this mapping, we
-include example research projects. We finally address how insights from DH can
-be applied to HI and discuss open challenges for the combination of the two
-disciplines.
+摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
 
-摘要：在本文中，我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中，數位方法的使用，特別是人工智慧的使用，受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求：成功的 AI 系統需要能夠 1) 與（人類）學者合作；2) 支援資料批評；3) 支援工具批評；4) 察覺並迎合各種觀點；5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則（協作、適應、負責和可解釋）作為理論架構，並將這些原則對應到 DH 要求。在此對應中，我們納入範例研究專案。最後，我們探討如何將 DH 的見解應用於 HI，並討論結合這兩個學科的開放挑戰。
+##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
+2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
 
-##### **Ethical Framework for Responsible Foundational Models in Medical Imaging**
-2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci
+Short-reading comprehension questions help students understand text structure
+but lack effective feedback. Students struggle to identify and correct errors,
+while manual feedback creation is labor-intensive. This highlights the need for
+automated feedback linking responses to a scoring rubric for deeper
+comprehension.
+  Despite advances in Natural Language Processing (NLP), research has focused
+on automatic grading, with limited work on feedback generation. To address
+this, we propose a system that generates feedback for student responses.
+  Our contributions are twofold. First, we introduce the first system for
+feedback on short-answer reading comprehension. These answers are derived from
+the text, requiring structural understanding. We propose an "answer diagnosis
+graph," integrating the text's logical structure with feedback templates. Using
+this graph and NLP techniques, we estimate students' comprehension and generate
+targeted feedback.
+  Second, we evaluate our feedback through an experiment with Japanese high
+school students (n=39). They answered two 70-80 word questions and were divided
+into two groups with minimal academic differences. One received a model answer,
+the other system-generated feedback. Both re-answered the questions, and we
+compared score changes. A questionnaire assessed perceptions and motivation.
+  Results showed no significant score improvement between groups, but
+system-generated feedback helped students identify errors and key points in the
+text. It also significantly increased motivation. However, further refinement
+is needed to enhance text structure understanding.
 
-Foundational models (FMs) have tremendous potential to revolutionize medical
-imaging. However, their deployment in real-world clinical settings demands
-extensive ethical considerations. This paper aims to highlight the ethical
-concerns related to FMs and propose a framework to guide their responsible
-development and implementation within medicine. We meticulously examine ethical
-issues such as privacy of patient data, bias mitigation, algorithmic
-transparency, explainability and accountability. The proposed framework is
-designed to prioritize patient welfare, mitigate potential risks, and foster
-trust in AI-assisted healthcare.
+摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
 
-摘要：基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而，它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題，並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題，例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險，並培養對 AI 輔助醫療保健的信任。
+儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
 
-##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**
-2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak
+我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
 
-Thyroid cancer is an increasing global health concern that requires advanced
-diagnostic methods. The application of AI and radiomics to thyroid cancer
-diagnosis is examined in this review. A review of multiple databases was
-conducted in compliance with PRISMA guidelines until October 2023. A
-combination of keywords led to the discovery of an English academic publication
-on thyroid cancer and related subjects. 267 papers were returned from the
-original search after 109 duplicates were removed. Relevant studies were
-selected according to predetermined criteria after 124 articles were eliminated
-based on an examination of their abstract and title. After the comprehensive
-analysis, an additional six studies were excluded. Among the 28 included
-studies, radiomics analysis, which incorporates ultrasound (US) images,
-demonstrated its effectiveness in diagnosing thyroid cancer. Various results
-were noted, some of the studies presenting new strategies that outperformed the
-status quo. The literature has emphasized various challenges faced by AI
-models, including interpretability issues, dataset constraints, and operator
-dependence. The synthesized findings of the 28 included studies mentioned the
-need for standardization efforts and prospective multicenter studies to address
-these concerns. Furthermore, approaches to overcome these obstacles were
-identified, such as advances in explainable AI technology and personalized
-medicine techniques. The review focuses on how AI and radiomics could transform
-the diagnosis and treatment of thyroid cancer. Despite challenges, future
-research on multidisciplinary cooperation, clinical applicability validation,
-and algorithm improvement holds the potential to improve patient outcomes and
-diagnostic precision in the treatment of thyroid cancer.
+其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
 
-摘要：甲狀腺癌是一種日益嚴重的全球健康問題，需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下，對多個資料庫進行了回顧，直到 2023 年 10 月。通過結合關鍵字，發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後，原始搜尋共回傳 267 篇論文。在根據預先確定的標準，淘汰了 124 篇文章的摘要和標題後，選出了相關研究。在進行全面分析後，額外排除了六項研究。在納入的 28 項研究中，結合超音波 (US) 影像的放射特徵分析，證明了其在診斷甲狀腺癌方面的有效性。研究結果不一，有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰，包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到，需要標準化工作和前瞻性多中心研究來解決這些問題。此外，還確定了克服這些障礙的方法，例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰，但未來對多學科合作、臨床適用性驗證和演算法改進的研究，仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。
+結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
 
-##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**
-2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia
+##### **Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**
+2501.15688v1 by Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
 
-Breast cancer has rapidly increased in prevalence in recent years, making it
-one of the leading causes of mortality worldwide. Among all cancers, it is by
-far the most common. Diagnosing this illness manually requires significant time
-and expertise. Since detecting breast cancer is a time-consuming process,
-preventing its further spread can be aided by creating machine-based forecasts.
-Machine learning and Explainable AI are crucial in classification as they not
-only provide accurate predictions but also offer insights into how the model
-arrives at its decisions, aiding in the understanding and trustworthiness of
-the classification results. In this study, we evaluate and compare the
-classification accuracy, precision, recall, and F-1 scores of five different
-machine learning methods using a primary dataset (500 patients from Dhaka
-Medical College Hospital). Five different supervised machine learning
-techniques, including decision tree, random forest, logistic regression, naive
-bayes, and XGBoost, have been used to achieve optimal results on our dataset.
-Additionally, this study applied SHAP analysis to the XGBoost model to
-interpret the model's predictions and understand the impact of each feature on
-the model's output. We compared the accuracy with which several algorithms
-classified the data, as well as contrasted with other literature in this field.
-After final evaluation, this study found that XGBoost achieved the best model
-accuracy, which is 97%.
+Multimodal knowledge graph completion (MMKGC) aims to predict missing links
+in multimodal knowledge graphs (MMKGs) by leveraging information from various
+modalities alongside structural data. Existing MMKGC approaches primarily
+extend traditional knowledge graph embedding (KGE) models, which often require
+creating an embedding for every entity. This results in large model sizes and
+inefficiencies in integrating multimodal information, particularly for
+real-world graphs. Meanwhile, Transformer-based models have demonstrated
+competitive performance in knowledge graph completion (KGC). However, their
+focus on single-modal knowledge limits their capacity to utilize cross-modal
+information. Recently, Large vision-language models (VLMs) have shown potential
+in cross-modal tasks but are constrained by the high cost of training. In this
+work, we propose a novel approach that integrates Transformer-based KGE models
+with cross-modal context generated by pre-trained VLMs, thereby extending their
+applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
+relevant visual information from entities and their neighbors into textual
+sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
+model with the generated cross-modal context. This simple yet effective method
+significantly reduces model size compared to traditional KGE approaches while
+achieving competitive performance across multiple large-scale datasets with
+minimal hyperparameter tuning.
 
-摘要：<paragraph>近年來，乳癌的盛行率迅速增加，使其成為全球主要的死亡原因之一。在所有癌症中，乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時，因此透過建立機器學習模型來預測，有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要，因為它們不僅可以提供準確的預測，還可以深入了解模型如何做出決策，有助於理解和信賴分類結果。在此研究中，我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數，使用了一個主要的資料集（達卡醫學院醫院的 500 名患者）。五種不同的監督式機器學習技術，包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost，已用於在我們的資料集上取得最佳結果。此外，本研究將 SHAP 分析應用於 XGBoost 模型，以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度，並與該領域的其他文獻進行對比。在最後評估後，本研究發現 XGBoost 達到了最佳的模型準確度，為 97%。</paragraph>
+摘要：多模態知識圖譜補全 (MMKGC) 旨在透過利用來自各種模態與結構化資料的資訊，來預測多模態知識圖譜 (MMKG) 中的缺失連結。現有的 MMKGC 方法主要擴充傳統的知識圖譜嵌入 (KGE) 模型，這些模型通常需要為每個實體建立一個嵌入。這會導致模型尺寸過大，且在整合多模態資訊時效率低下，特別是對於真實世界的圖譜。與此同時，基於 Transformer 的模型已在知識圖譜補全 (KGC) 中展現出競爭力。然而，它們著重於單模態知識，限制了它們利用跨模態資訊的能力。最近，大型視覺語言模型 (VLM) 已在跨模態任務中展現潛力，但受限於訓練成本過高。在這項工作中，我們提出了一種創新的方法，它將基於 Transformer 的 KGE 模型與預先訓練的 VLM 所產生的跨模態內容整合在一起，從而擴展它們在 MMKGC 中的適用性。具體來說，我們採用預先訓練的 VLM，將實體及其鄰居相關的視覺資訊轉換成文字序列。然後，我們將 KGC 架構成一個序列到序列的任務，並使用產生的跨模態內容微調模型。這種簡單但有效的方法，與傳統的 KGE 方法相比，大幅減少了模型尺寸，同時在多個大型資料集上達到了競爭力的效能，且只需最少的超參數調整。
 
-##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**
-2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir
+##### **How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**
+2501.15378v1 by Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
 
-The Deep learning (DL) models for diagnosing breast cancer from mammographic
-images often operate as "black boxes", making it difficult for healthcare
-professionals to trust and understand their decision-making processes. The
-study presents an integrated framework combining Convolutional Neural Networks
-(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis
-of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an
-elaborate data preprocessing pipeline and advanced data augmentation techniques
-to counteract dataset limitations and transfer learning using pre-trained
-networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of
-our study is the evaluation of XAI's effectiveness in interpreting model
-predictions, highlighted by utilizing the Hausdorff measure to assess the
-alignment between AI-generated explanations and expert annotations
-quantitatively. This approach is critical for XAI in promoting trustworthiness
-and ethical fairness in AI-assisted diagnostics. The findings from our research
-illustrate the effective collaboration between CNNs and XAI in advancing
-diagnostic methods for breast cancer, thereby facilitating a more seamless
-integration of advanced AI technologies within clinical settings. By enhancing
-the interpretability of AI driven decisions, this work lays the groundwork for
-improved collaboration between AI systems and medical practitioners, ultimately
-enriching patient care. Furthermore, the implications of our research extended
-well beyond the current methodologies. It encourages further research into how
-to combine multimodal data and improve AI explanations to meet the needs of
-clinical practice.
+Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently
+propelled significant advances in complex reasoning tasks, thanks to their
+broad domain knowledge and contextual awareness. Unfortunately, current methods
+often assume KGs to be complete, which is impractical given the inherent
+limitations of KG construction and the potential loss of contextual cues when
+converting unstructured text into entity-relation triples. In response, this
+paper proposes the Triple Context Restoration and Query-driven Feedback
+(TCR-QF) framework, which reconstructs the textual context underlying each
+triple to mitigate information loss, while dynamically refining the KG
+structure by iteratively incorporating query-relevant missing knowledge.
+Experiments on five benchmark question-answering datasets substantiate the
+effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1%
+improvement in Exact Match and a 15.5% improvement in F1 over its
+state-of-the-art GraphRAG competitors.
 
-摘要：深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作，這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構，結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI)，以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術，以對抗資料集限制，並採用預先訓練的網路（例如 VGG-16、Inception-V3 和 ResNet）進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性，重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作，從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性，這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎，最終豐富了患者照護。此外，我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋，以滿足臨床實務的需求。
+摘要：知識圖譜 (KG) 增強大型語言模型 (LLM) 最近推動複雜推理任務的重大進展，這要歸功於它們廣泛的領域知識和語境感知。不幸的是，目前的模型通常假設 KG 是完整的，這在考慮到 KG 建構的固有限制和在將非結構化文字轉換為實體關係三元組時潛在的語境線索損失時是不切實際的。為了解決這個問題，本文提出了三元組語境還原和查詢驅動回饋 (TCR-QF) 架構，它重建每個三元組底層的文字語境以減輕資訊損失，同時透過反覆納入與查詢相關的遺失知識來動態優化 KG 結構。在五個基準問題回答資料集上的實驗證實了 TCR-QF 在 KG 和 LLM 整合方面的有效性，它在 Exact Match 中獲得 29.1% 的改進，在 F1 中獲得 15.5% 的改進，優於最先進的 GraphRAG 競爭對手。
 
-##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**
-2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu
+##### **Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**
+2501.14932v1 by Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe
 
-This research presents a novel multimodal data fusion methodology for pain
-behavior recognition, integrating statistical correlation analysis with
-human-centered insights. Our approach introduces two key innovations: 1)
-integrating data-driven statistical relevance weights into the fusion strategy
-to effectively utilize complementary information from heterogeneous modalities,
-and 2) incorporating human-centric movement characteristics into multimodal
-representation learning for detailed modeling of pain behaviors. Validated
-across various deep learning architectures, our method demonstrates superior
-performance and broad applicability. We propose a customizable framework that
-aligns each modality with a suitable classifier based on statistical
-significance, advancing personalized and effective multimodal fusion.
-Furthermore, our methodology provides explainable analysis of multimodal data,
-contributing to interpretable and explainable AI in healthcare. By highlighting
-the importance of data diversity and modality-specific representations, we
-enhance traditional fusion techniques and set new standards for recognizing
-complex pain behaviors. Our findings have significant implications for
-promoting patient-centered healthcare interventions and supporting explainable
-clinical decision-making.
+Modern datasets often consist of numerous samples with abundant features and
+associated timestamps. Analyzing such datasets to uncover underlying events
+typically requires complex statistical methods and substantial domain
+expertise. A notable example, and the primary data focus of this paper, is the
+global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC)
+-- a global hub of human trafficking data containing over 200,000 anonymized
+records spanning from 2002 to 2022, with numerous categorical features for each
+record. In this paper, we propose a fast and scalable method for analyzing and
+extracting significant categorical feature interactions, and querying large
+language models (LLMs) to generate data-driven insights that explain these
+interactions. Our approach begins with a binarization step for categorical
+features using one-hot encoding, followed by the computation of graph
+covariance at each time. This graph covariance quantifies temporal changes in
+dependence structures within categorical data and is established as a
+consistent dependence measure under the Bernoulli distribution. We use this
+measure to identify significant feature pairs, such as those with the most
+frequent trends over time or those exhibiting sudden spikes in dependence at
+specific moments. These extracted feature pairs, along with their timestamps,
+are subsequently passed to an LLM tasked with generating potential explanations
+of the underlying events driving these dependence changes. The effectiveness of
+our method is demonstrated through extensive simulations, and its application
+to the CTDC dataset reveals meaningful feature pairs and potential data stories
+underlying the observed feature interactions.
 
-摘要：本研究提出了一種創新的多模態數據融合方法，用於疼痛行為識別，將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新：1) 將數據驅動的統計相關權重整合到融合策略中，以有效利用來自異質模態的補充信息，以及 2) 將以人為中心的運動特徵納入多模態表示學習中，以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證，展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架，根據統計顯著性將每個模態與合適的分類器對齊，推進個性化和有效的多模態融合。此外，我們的模型提供對多模態數據的可解釋分析，有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性，我們增強了傳統的融合技術，並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。
+摘要：現代資料集通常包含許多具有豐富特徵和關聯時間戳的樣本。分析此類資料集以揭示底層事件通常需要複雜的統計方法和大量的領域專業知識。一個值得注意的範例，也是本文的主要資料重點，是來自反人口販運資料合作組織 (CTDC) 的全球合成資料集，這是全球人口販運資料的樞紐，包含超過 200,000 筆從 2002 年到 2022 年的匿名記錄，每個記錄都有許多分類特徵。在本文中，我們提出了一種快速且可擴充的方法，用於分析和提取重要的分類特徵交互作用，並查詢大型語言模型 (LLM)，以產生資料驅動的見解來解釋這些交互作用。我們的做法從使用獨熱編碼對分類特徵進行二元化步驟開始，然後在每個時間點計算圖形共變異數。此圖形共變異數量化了分類資料中依賴結構的時間變化，並在伯努利分佈下建立為一致的依賴度量。我們使用此度量來識別重要的特徵對，例如隨時間推移趨勢最頻繁的特徵對，或在特定時刻表現出依賴性突然激增的特徵對。這些提取的特徵對及其時間戳隨後傳遞給 LLM，後者負責產生對驅動這些依賴性變化的底層事件的潛在解釋。我們的方法的有效性已通過廣泛的模擬得到證明，其在 CTDC 資料集中的應用揭示了有意義的特徵對和潛在的資料故事，這些故事是觀察到的特徵交互作用的基礎。
 
-##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**
-2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini
+##### **Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**
+2501.14892v1 by Hang Luo, Jian Zhang, Chujun Li
 
-Human-centered explainable AI (HCXAI) advocates for the integration of social
-aspects into AI explanations. Central to the HCXAI discourse is the Social
-Transparency (ST) framework, which aims to make the socio-organizational
-context of AI systems accessible to their users. In this work, we suggest
-extending the ST framework to address the risks of social misattributions in
-Large Language Models (LLMs), particularly in sensitive areas like mental
-health. In fact LLMs, which are remarkably capable of simulating roles and
-personas, may lead to mismatches between designers' intentions and users'
-perceptions of social attributes, risking to promote emotional manipulation and
-dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To
-address these issues, we propose enhancing the ST framework with a fifth
-'W-question' to clarify the specific social attributions assigned to LLMs by
-its designers and users. This addition aims to bridge the gap between LLM
-capabilities and user perceptions, promoting the ethically responsible
-development and use of LLM-based technology.
+In knowledge-intensive tasks, especially in high-stakes domains like medicine
+and law, it is critical not only to retrieve relevant information but also to
+provide causal reasoning and explainability. Large language models (LLMs) have
+achieved remarkable performance in natural language understanding and
+generation tasks. However, they often suffer from limitations such as
+difficulty in incorporating new knowledge, generating hallucinations, and
+explaining their reasoning process. To address these challenges, integrating
+knowledge graphs with Graph Retrieval-Augmented Generation (Graph RAG) has
+emerged as an effective solution. Traditional Graph RAG methods often rely on
+simple graph traversal or semantic similarity, which do not capture causal
+relationships or align well with the model's internal reasoning steps. This
+paper proposes a novel pipeline that filters large knowledge graphs to
+emphasize cause-effect edges, aligns the retrieval process with the model's
+chain-of-thought (CoT), and enhances reasoning through multi-stage path
+improvements. Experiments on medical question-answering tasks show consistent
+gains, with up to a 10\% absolute improvement across multiple large language
+models (LLMs). This approach demonstrates the value of combining causal
+reasoning with stepwise retrieval, leading to more interpretable and logically
+grounded solutions for complex queries.
 
-摘要：以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架，其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中，我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险，尤其是在心理健康等敏感领域。事实上，LLM 能够出色地模拟角色和人格，这可能导致设计者的意图和用户对社会属性的认知之间出现错配，从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题，我们建议用第五个“W 问题”来增强 ST 框架，以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距，促进基于 LLM 的技术在道德上负责任地开发和使用。
+摘要：在知識密集型任務中，特別是在醫學和法律等高風險領域，不僅檢索相關資訊至關重要，還必須提供因果推理和可解釋性。大型語言模型 (LLM) 在自然語言理解和生成任務中取得了顯著的表現。然而，它們通常會遇到一些限制，例如難以納入新知識、產生幻覺，以及解釋其推理過程。為了應對這些挑戰，將知識圖與圖形檢索增強生成 (Graph RAG) 整合在一起已成為一種有效的解決方案。傳統的 Graph RAG 方法通常依賴於簡單的圖形遍歷或語義相似性，這無法捕捉因果關係或與模型的內部推理步驟很好地對齊。本文提出了一個新穎的管道，該管道過濾大型知識圖以強調因果邊緣，將檢索過程與模型的思想鏈 (CoT) 對齊，並通過多階段路徑改進來增強推理。在醫療問題解答任務上的實驗顯示出一致的收益，在多個大型語言模型 (LLM) 中絕對改進幅度高達 10%。這種方法展示了將因果推理與逐步檢索相結合的價值，從而為複雜查詢提供更具可解釋性和邏輯依據的解決方案。
 
-##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**
-2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu
+##### **GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**
+2501.16382v1 by Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
 
-Background: Pneumothorax is an acute thoracic disease caused by abnormal air
-collection between the lungs and chest wall. To address the opaqueness often
-associated with deep learning (DL) models, explainable artificial intelligence
-(XAI) methods have been introduced to outline regions related to pneumothorax
-diagnoses made by DL models. However, these explanations sometimes diverge from
-actual lesion areas, highlighting the need for further improvement. Method: We
-propose a template-guided approach to incorporate the clinical knowledge of
-pneumothorax into model explanations generated by XAI methods, thereby
-enhancing the quality of these explanations. Utilizing one lesion delineation
-created by radiologists, our approach first generates a template that
-represents potential areas of pneumothorax occurrence. This template is then
-superimposed on model explanations to filter out extraneous explanations that
-fall outside the template's boundaries. To validate its efficacy, we carried
-out a comparative analysis of three XAI methods with and without our template
-guidance when explaining two DL models in two real-world datasets. Results: The
-proposed approach consistently improved baseline XAI methods across twelve
-benchmark scenarios built on three XAI methods, two DL models, and two
-datasets. The average incremental percentages, calculated by the performance
-improvements over the baseline performance, were 97.8% in Intersection over
-Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model
-explanations and ground-truth lesion areas. Conclusions: In the context of
-pneumothorax diagnoses, we proposed a template-guided approach for improving AI
-explanations. We anticipate that our template guidance will forge a fresh
-approach to elucidating AI models by integrating clinical domain expertise.
+Drug discovery (DD) has tremendously contributed to maintaining and improving
+public health. Hypothesizing that inhibiting protein misfolding can slow
+disease progression, researchers focus on target identification (Target ID) to
+find protein structures for drug binding. While Large Language Models (LLMs)
+and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug
+discovery, integrating models into cohesive workflows remains challenging. We
+conducted a user study with drug discovery researchers to identify the
+applicability of LLMs and RAGs in Target ID. We identified two main findings:
+1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on
+an initial protein and protein candidates that have a therapeutic impact; 2)
+the model must provide the PPI and relevant explanations for better
+understanding. Based on these observations, we identified three limitations in
+previous approaches for Target ID: 1) semantic ambiguity, 2) lack of
+explainability, and 3) short retrieval units. To address these issues, we
+propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve
+agent pipeline RAG framework to support large-scale PPI signaling pathway
+exploration in understanding therapeutic impacts by decomposing the analysis of
+entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
 
-摘要：<paragraph>背景：氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習（DL）模型經常伴隨的不透明性，可解釋人工智慧（XAI）方法已被引入，用於概述與 DL 模型做出的氣胸診斷相關的區域。然而，這些解釋有時會與實際病灶區域有所出入，突顯出進一步改進的必要性。方法：我們提出了一種模板引導式方法，將氣胸的臨床知識納入 XAI 方法產生的模型解釋中，從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪，我們的做法首先產生一個模板，用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上，以篩選出超出模板邊界的無關解釋。為了驗證其效力，我們對三種 XAI 方法進行了比較分析，在兩個真實世界資料集中解釋兩個 DL 模型時，分別採用和不採用我們的模板引導。結果：所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中，始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時，透過基準效能的效能改進計算出的平均增量百分比為交集比（IoU）的 97.8% 和骰子相似性係數（DSC）的 94.1%。結論：在氣胸診斷的背景下，我們提出了一種模板引導式方法，用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識，為闡明 AI 模型建立一種新方法。</paragraph>
+摘要：药物发现 (DD) 极大地促进了公共卫生的维护和改善。研究人员假设抑制蛋白质错误折叠可以减缓疾病进展，因此专注于靶点识别 (Target ID) 以找到用于药物结合的蛋白质结构。虽然大型语言模型 (LLM) 和检索增强生成 (RAG) 框架加速了药物发现，但将模型整合到内聚工作流中仍然具有挑战性。我们与药物发现研究人员进行了一项用户研究，以确定 LLM 和 RAG 在 Target ID 中的适用性。我们确定了两个主要发现：1) LLM 应该基于初始蛋白质和具有治疗作用的蛋白质候选物提供多个蛋白质-蛋白质相互作用 (PPI)；2) 该模型必须提供 PPI 和相关解释以更好地理解。基于这些观察，我们发现了先前 Target ID 方法中的三个局限性：1) 语义歧义，2) 缺乏可解释性，3) 检索单元短。为了解决这些问题，我们提出了 GraPPI，这是一种基于大规模知识图 (KG) 的检索-分解-求解代理管道 RAG 框架，以支持大规模 PPI 信号通路探索，通过将整个 PPI 通路的分析分解为专注于 PPI 边缘分析的子任务来理解治疗影响。
 
-##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**
-2403.01580v1 by Séamus Lankford
+##### **Evaluating and Improving Graph to Text Generation with Large Language Models**
+2501.14497v1 by Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez Basulto, Jeff Z. Pan
 
-In the current machine translation (MT) landscape, the Transformer
-architecture stands out as the gold standard, especially for high-resource
-language pairs. This research delves into its efficacy for low-resource
-language pairs including both the English$\leftrightarrow$Irish and
-English$\leftrightarrow$Marathi language pairs. Notably, the study identifies
-the optimal hyperparameters and subword model type to significantly improve the
-translation quality of Transformer models for low-resource language pairs.
-  The scarcity of parallel datasets for low-resource languages can hinder MT
-development. To address this, gaHealth was developed, the first bilingual
-corpus of health data for the Irish language. Focusing on the health domain,
-models developed using this in-domain dataset exhibited very significant
-improvements in BLEU score when compared with models from the LoResMT2021
-Shared Task. A subsequent human evaluation using the multidimensional quality
-metrics error taxonomy showcased the superior performance of the Transformer
-system in reducing both accuracy and fluency errors compared to an RNN-based
-counterpart.
-  Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source
-applications streamlined for the development, fine-tuning, and deployment of
-neural machine translation models. These tools considerably simplify the setup
-and evaluation process, making MT more accessible to both developers and
-translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes
-eco-friendly natural language processing research by highlighting the
-environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM
-demonstrated advancements in translation performance for two low-resource
-language pairs: English$\leftrightarrow$Irish and
-English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021
-Shared Task.
+Large language models (LLMs) have demonstrated immense potential across
+various tasks. However, research for exploring and improving the capabilities
+of LLMs in interpreting graph structures remains limited. To address this gap,
+we conduct a comprehensive evaluation of prompting current open-source LLMs on
+graph-to-text generation tasks. Although we explored the optimal prompting
+strategies and proposed a novel and effective diversity-difficulty-based
+few-shot sample selection method, we found that the improvements from
+tuning-free approaches were incremental, as LLMs struggle with planning on
+complex graphs, particularly those with a larger number of triplets. To further
+improve LLMs in planning with graph sequences and grounding in truth, we
+introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks:
+reordering and attribution. Through extensive automatic and human evaluations,
+we demonstrate significant improvements in the quality of generated text from
+both few-shot learning and fine-tuning perspectives using the PlanGTG dataset.
+Our study paves the way for new research directions in graph-to-text
+generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
 
-摘要：<paragraph>在當前機器翻譯 (MT) 領域中，Transformer 架構脫穎而出，成為黃金標準，特別是對於高資源語言對。本研究探討其對低資源語言對的效能，包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是，本研究識別出最佳超參數和子詞模型類型，以顯著提高 Transformer 模型對低資源語言對的翻譯品質。
-低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題，開發了 gaHealth，這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域，使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步，與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示，與基於 RNN 的對應模型相比，Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。
-此外，本論文介紹了 adaptNMT 和 adaptMLLM，這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程，讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是，adaptNMT 以 OpenNMT 生態系統為基礎，通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比，adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。</paragraph>
+摘要：大型語言模型（LLM）已在各種任務中展現出巨大的潛力。然而，探索和提升 LLM 在詮釋圖形結構方面的能力的研究仍然有限。為了解決這個差距，我們對提示目前開源的 LLM 執行圖形轉文字生成任務進行全面評估。儘管我們探索了最佳提示策略並提出了一種新穎且有效的基於多樣性難度的少樣本選擇方法，但我們發現無調校方法的改進是漸進的，因為 LLM 難以規劃複雜的圖形，特別是那些具有較多三元組的圖形。為了進一步提升 LLM 在圖形序列規劃和真實依據方面的能力，我們引入了一個新的圖形轉文字資料集 PlanGTG，並註解了兩個子任務：重新排序和歸因。透過廣泛的自動化和人工評估，我們證明了使用 PlanGTG 資料集從少樣本學習和微調角度產生文字的品質有顯著提升。我們的研究為圖形轉文字生成中的新研究方向鋪路。PlanGTG 資料集可以在 https://github.com/probe2/kg_text 中找到。
 
-##### **Cause and Effect: Can Large Language Models Truly Understand Causality?**
-2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha
+##### **Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**
+2501.14300v1 by Xujian Liang, Zhaoquan Gu
 
-With the rise of Large Language Models(LLMs), it has become crucial to
-understand their capabilities and limitations in deciphering and explaining the
-complex web of causal relationships that language entails. Current methods use
-either explicit or implicit causal reasoning, yet there is a strong need for a
-unified approach combining both to tackle a wide array of causal relationships
-more effectively. This research proposes a novel architecture called Context
-Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to
-enhance causal reasoning and explainability. The proposed framework
-incorporates an explicit causal detection module with ConceptNet and
-counterfactual statements, as well as implicit causal detection through LLMs.
-Our framework goes one step further with a layer of counterfactual explanations
-to accentuate LLMs understanding of causality. The knowledge from ConceptNet
-enhances the performance of multiple causal reasoning tasks such as causal
-discovery, causal identification and counterfactual reasoning. The
-counterfactual sentences add explicit knowledge of the not caused by scenarios.
-By combining these powerful modules, our model aims to provide a deeper
-understanding of causal relationships, enabling enhanced interpretability.
-Evaluation of benchmark datasets shows improved performance across all metrics,
-such as accuracy, precision, recall, and F1 scores. We also introduce
-CausalNet, a new dataset accompanied by our code, to facilitate further
-research in this domain.
+Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes
+the naive RAG system a step further by integrating graph information, such as
+knowledge graph (KGs), into large-scale language models (LLMs) to mitigate
+hallucination. However, existing GRAG still encounter limitations: 1) simple
+paradigms usually fail with the complex problems due to the narrow and shallow
+correlations capture from KGs 2) methods of strong coupling with KGs tend to be
+high computation cost and time consuming if the graph is dense. In this paper,
+we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for
+enabling LLMs to think ``community by community" within KGs. To do this,
+FastToG employs community detection for deeper correlation capture and two
+stages community pruning - coarse and fine pruning for faster retrieval.
+Furthermore, we also develop two Community-to-Text methods to convert the graph
+structure of communities into textual form for better understanding by LLMs.
+Experimental results demonstrate the effectiveness of FastToG, showcasing
+higher accuracy, faster reasoning, and better explainability compared to the
+previous works.
 
-摘要：隨著大型語言模型 (LLM) 的興起，了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理，但強烈需要一種統一的方法，結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構，以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組，以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步，加入一層反事實解釋，以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行，例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組，我們的模型旨在提供對因果關係更深入的理解，實現增強的可解釋性。基準資料集的評估顯示在所有指標（例如準確度、精確度、召回率和 F1 分數）上都有所提升。我們還引入了 CausalNet，一個新的資料集，並附上了我們的程式碼，以促進在這個領域的進一步研究。
+摘要：圖表檢索增強生成 (GRAG) 是一種新穎的範例，它透過將圖表資訊（例如知識圖表 (KG)) 整合到大型語言模型 (LLM) 中，進一步提升了樸素的 RAG 系統以減輕幻覺。然而，現有的 GRAG 仍會遇到限制：1) 簡單的範例通常會因從 KG 中擷取的關聯性狹隘且淺薄而無法解決複雜的問題 2) 如果圖表很密集，與 KG 強耦合的方法往往會導致高運算成本和耗時。在本文中，我們提出了 Fast Think-on-Graph (FastToG)，這是一種創新的範例，可讓 LLM 在 KG 中「逐個社群」進行思考。為此，FastToG 使用社群偵測來擷取更深入的關聯性，並使用兩個階段的社群修剪（粗略修剪和精細修剪）來加快檢索速度。此外，我們還開發了兩種社群到文字的方法，將社群的圖表結構轉換為文字形式，以便 LLM 更容易理解。實驗結果證明了 FastToG 的有效性，與先前的研究相比，展示出更高的準確性、更快的推理速度和更好的可解釋性。
 
-##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**
-2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya
+##### **Top Ten Challenges Towards Agentic Neural Graph Databases**
+2501.14224v1 by Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Lixin Fan, Binhang Yuan, Wei Wang, Lei Chen, Xiaofang Zhou, Yangqiu Song
 
-Diabetes mellitus (DM) predisposes patients to vascular complications.
-Retinal images and vasculature reflect the body's micro- and macrovascular
-health. They can be used to diagnose DM complications, including diabetic
-retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular
-disease, as well as forecast the risk of cardiovascular events. Artificial
-intelligence (AI)-enabled systems developed for high-throughput detection of DR
-using digitized retinal images have become clinically adopted. Beyond DR
-screening, AI integration also holds immense potential to address challenges
-associated with the holistic care of the patient with DM. In this work, we aim
-to comprehensively review the literature for studies on AI applications based
-on retinal images related to DM diagnosis, prognostication, and management. We
-will describe the findings of holistic AI-assisted diabetes care, including but
-not limited to DR screening, and discuss barriers to implementing such systems,
-including issues concerning ethics, data privacy, equitable access, and
-explainability. With the ability to evaluate the patient's health status vis a
-vis DM complication as well as risk prognostication of future cardiovascular
-complications, AI-assisted retinal image analysis has the potential to become a
-central tool for modern personalized medicine in patients with DM.
+Graph databases (GDBs) like Neo4j and TigerGraph excel at handling
+interconnected data but lack advanced inference capabilities. Neural Graph
+Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for
+predictive analysis and reasoning over incomplete or noisy data. However, NGDBs
+rely on predefined queries and lack autonomy and adaptability. This paper
+introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs
+with three core functionalities: autonomous query construction, neural query
+execution, and continuous learning. We identify ten key challenges in realizing
+Agentic NGDBs: semantic unit representation, abductive reasoning, scalable
+query execution, and integration with foundation models like large language
+models (LLMs). By addressing these challenges, Agentic NGDBs can enable
+intelligent, self-improving systems for modern data-driven applications, paving
+the way for adaptable and autonomous data management solutions.
 
-摘要：糖尿病（DM）使患者容易出現血管併發症。
-視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症，包括糖尿病視網膜病變（DR）、神經病變、腎病和動脈粥樣硬化性心血管疾病，以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧（AI）啟用系統已在臨床採用。除了 DR 篩檢外，AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中，我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻，這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現，包括但不限於 DR 篩檢，並討論實施此類系統的障礙，包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況，同時考量糖尿病併發症以及未來心血管併發症的風險預後，AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。
+摘要：圖形資料庫（GDB），例如 Neo4j 和 TigerGraph，擅長處理相互連接的資料，但缺乏進階的推論能力。神經圖形資料庫（NGDB）透過整合圖形神經網路（GNN）來解決這個問題，以進行預測分析和對不完整或有雜訊的資料進行推理。然而，NGDB 依賴於預先定義的查詢，並且缺乏自主性和適應性。本文介紹了代理神經圖形資料庫（Agentic NGDB），它以三項核心功能擴充了 NGDB：自動查詢建構、神經查詢執行和持續學習。我們找出實現 Agentic NGDB 的十大關鍵挑戰：語義單元表示、演繹推理、可擴充查詢執行，以及與基礎模型（例如大型語言模型 (LLM)）整合。透過解決這些挑戰，Agentic NGDB 可以為現代資料驅動應用打造智慧且自我改善的系統，為適應性和自主資料管理解決方案鋪路。
 
-##### **Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**
-2402.15027v2 by A. J. Karran, P. Charland, J-T. Martineau, A. Ortiz de Guinea Lopez de Arana, AM. Lesage, S. Senecal, P-M. Leger
+##### **GraphRAG under Fire**
+2501.14050v1 by Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang
 
-This study investigates the acceptability of different artificial
-intelligence (AI) applications in education from a multi-stakeholder
-perspective, including students, teachers, and parents. Acknowledging the
-transformative potential of AI in education, it addresses concerns related to
-data privacy, AI agency, transparency, explainability and the ethical
-deployment of AI. Through a vignette methodology, participants were presented
-with four scenarios where AI's agency, transparency, explainability, and
-privacy were manipulated. After each scenario, participants completed a survey
-that captured their perceptions of AI's global utility, individual usefulness,
-justice, confidence, risk, and intention to use each scenario's AI if
-available. The data collection comprising a final sample of 1198
-multi-stakeholder participants was distributed through a partner institution
-and social media campaigns and focused on individual responses to four AI use
-cases. A mediation analysis of the data indicated that acceptance and trust in
-AI varies significantly across stakeholder groups. We found that the key
-mediators between high and low levels of AI's agency, transparency, and
-explainability, as well as the intention to use the different educational AI,
-included perceived global utility, justice, and confidence. The study
-highlights that the acceptance of AI in education is a nuanced and multifaceted
-issue that requires careful consideration of specific AI applications and their
-characteristics, in addition to the diverse stakeholders' perceptions.
+GraphRAG advances retrieval-augmented generation (RAG) by structuring
+external knowledge as multi-scale knowledge graphs, enabling language models to
+integrate both broad context and granular details in their reasoning. While
+GraphRAG has demonstrated success across domains, its security implications
+remain largely unexplored. To bridge this gap, this work examines GraphRAG's
+vulnerability to poisoning attacks, uncovering an intriguing security paradox:
+compared to conventional RAG, GraphRAG's graph-based indexing and retrieval
+enhance resilience against simple poisoning attacks; meanwhile, the same
+features also create new attack surfaces. We present GRAGPoison, a novel attack
+that exploits shared relations in the knowledge graph to craft poisoning text
+capable of compromising multiple queries simultaneously. GRAGPoison employs
+three key strategies: i) relation injection to introduce false knowledge, ii)
+relation enhancement to amplify poisoning influence, and iii) narrative
+generation to embed malicious content within coherent text. Empirical
+evaluation across diverse datasets and models shows that GRAGPoison
+substantially outperforms existing attacks in terms of effectiveness (up to 98%
+success rate) and scalability (using less than 68% poisoning text). We also
+explore potential defensive measures and their limitations, identifying
+promising directions for future research.
 
-摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
+摘要：GraphRAG 透過將外部知識結構化為多尺度知識圖譜，推動了檢索增強生成 (RAG)，使語言模型能夠在其推理中整合廣泛的背景和細微的細節。儘管 GraphRAG 在各個領域都已展現出成功，但其安全性影響在很大程度上仍未被探索。為了彌補這一差距，本研究探討了 GraphRAG 對投毒攻擊的脆弱性，揭示了一個有趣的安全悖論：與傳統的 RAG 相比，GraphRAG 基於圖表的索引和檢索增強了對簡單投毒攻擊的韌性；同時，相同的特徵也創造了新的攻擊面。我們提出了 GRAGPoison，這是一種新穎的攻擊，它利用知識圖譜中的共享關係來製作中毒文本，能夠同時危害多個查詢。GRAGPoison 採用了三項關鍵策略：i) 關係注入以引入錯誤的知識，ii) 關係增強以擴大投毒影響，以及 iii) 敘事生成以將惡意內容嵌入連貫的文本中。在各種數據集和模型上的經驗評估表明，GRAGPoison 在有效性（成功率高達 98%）和可擴展性（使用不到 68% 的投毒文本）方面都明顯優於現有的攻擊。我們還探討了潛在的防禦措施及其局限性，確定了未來研究的有希望的方向。
 
-##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
-2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
+##### **EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**
+2501.13746v1 by Yuhui Yun, Huilong Ye, Xinru Li, Ruojia Li, Jingfeng Deng, Li Li, Haoyi Xiong
+
+The paper introduces EICopilot, an novel agent-based solution enhancing
+search and exploration of enterprise registration data within extensive online
+knowledge graphs like those detailing legal entities, registered capital, and
+major shareholders. Traditional methods necessitate text-based queries and
+manual subgraph explorations, often resulting in time-consuming processes.
+EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this
+landscape by utilizing Large Language Models (LLMs) to interpret natural
+language queries. This solution automatically generates and executes Gremlin
+scripts, providing efficient summaries of complex enterprise relationships.
+Distinct feature a data pre-processing pipeline that compiles and annotates
+representative queries into a vector database of examples for In-context
+learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought
+with ICL to enhance Gremlin script generation for knowledge graph search and
+exploration, and a novel query masking strategy that improves intent
+recognition for heightened script accuracy. Empirical evaluations demonstrate
+the superior performance of EICopilot, including speed and accuracy, over
+baseline methods, with the \emph{Full Mask} variant achieving a syntax error
+rate reduction to as low as 10.00% and an execution correctness of up to
+82.14%. These components collectively contribute to superior querying
+capabilities and summarization of intricate datasets, positioning EICopilot as
+a groundbreaking tool in the exploration and exploitation of large-scale
+knowledge graphs for enterprise information search.
 
-Remote patient monitoring based on wearable single-lead electrocardiogram
-(ECG) devices has significant potential for enabling the early detection of
-heart disease, especially in combination with artificial intelligence (AI)
-approaches for automated heart disease detection. There have been prior studies
-applying AI approaches based on deep learning for heart disease detection.
-However, these models are yet to be widely accepted as a reliable aid for
-clinical diagnostics, in part due to the current black-box perception
-surrounding many AI algorithms. In particular, there is a need to identify the
-key features of the ECG signal that contribute toward making an accurate
-diagnosis, thereby enhancing the interpretability of the model. In the present
-study, we develop a vision transformer approach to identify atrial fibrillation
-based on single-lead ECG data. A residual network (ResNet) approach is also
-developed for comparison with the vision transformer approach. These models are
-applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
-well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
-heartbeats. The models enable the identification of the key regions of the
-heartbeat that determine the resulting classification, and highlight the
-importance of P-waves and T-waves, as well as heartbeat duration and signal
-amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
-sinus bradycardia.
+摘要：本文介紹了 EICopilot，這是一種基於代理的新型解決方案，可增強在廣泛的線上知識圖譜中搜尋和探索企業註冊資料，例如詳細說明法律實體、註冊資本和主要股東的資料。傳統方法需要基於文字的查詢和手動子圖探索，通常會導致耗時的流程。EICopilot 部署為百度企業搜尋的聊天機器人，透過利用大型語言模型 (LLM) 來詮釋自然語言查詢，進而改善這項技術。此解決方案會自動產生並執行 Gremlin 腳本，提供複雜企業關係的有效摘要。其獨特功能為資料前處理管線，可將具代表性的查詢編譯並註解到範例的向量資料庫中，以進行脈絡中學習 (ICL)，這是一個結合了思考鏈與 ICL 的綜合推理管線，用於增強 Gremlin 腳本產生，以進行知識圖譜搜尋和探索，以及一種新穎的查詢遮罩策略，可改善意圖辨識，進而提高腳本準確度。實證評估顯示，EICopilot 的效能優於基線方法，包括速度和準確度，其中「完整遮罩」變體將語法錯誤率降低至低於 10.00%，執行正確率高達 82.14%。這些元件共同促成了優異的查詢功能和複雜資料集的摘要，將 EICopilot 定位為探索和利用大規模知識圖譜進行企業資訊搜尋的創新工具。
 
-摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
+##### **Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**
+2501.13731v1 by Chang Gong, Wanrui Bian, Zhijie Zhang, Weiguo Zheng
 
+Graph computational tasks are inherently challenging and often demand the
+development of advanced algorithms for effective solutions. With the emergence
+of large language models (LLMs), researchers have begun investigating their
+potential to address these tasks. However, existing approaches are constrained
+by LLMs' limited capability to comprehend complex graph structures and their
+high inference costs, rendering them impractical for handling large-scale
+graphs. Inspired by human approaches to graph problems, we introduce a novel
+framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph
+Computational Tasks), which consists of three key steps: problem understanding,
+prompt design, and code generation. In this framework, LLMs are tasked with
+understanding the problem and extracting relevant information to generate
+correct code. The responsibility for analyzing the graph structure and
+executing the code is delegated to the interpreter. We inject task-related
+pseudocodes into the prompts to further assist the LLMs in generating efficient
+code. We also employ cost-effective trial-and-error techniques to ensure that
+the LLM-generated code executes correctly. Unlike other methods that require
+invoking LLMs for each individual test case, PIE only calls the LLM during the
+code generation phase, allowing the generated code to be reused and
+significantly reducing inference costs. Extensive experiments demonstrate that
+PIE outperforms existing baselines in terms of both accuracy and computational
+efficiency.
 
-### Medical
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null|
-|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
-|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
-|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
-|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|null|
-|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
-|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
-|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null|
-|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null|
-|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null|
-|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null|
-|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
-|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v1](http://arxiv.org/abs/2502.07516v1)|[link](https://github.com/Raman1121/diffusion_memorization)|
-|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
-|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
-|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
-|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
-|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
-|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null|
-|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null|
-|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null|
-|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null|
-|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null|
-|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null|
-|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null|
-|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null|
-|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
-|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null|
-|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)|
-|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null|
-|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null|
-|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null|
-|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null|
-|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null|
-|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null|
-|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null|
-|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)|
-|**2025-02-08**|**Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**|Ya Zhou et.al.|[2502.05494v1](http://arxiv.org/abs/2502.05494v1)|null|
-|**2025-02-08**|**DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**|Sibasish Dhibar et.al.|[2502.05459v1](http://arxiv.org/abs/2502.05459v1)|null|
-|**2025-02-07**|**Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**|Muhammad Imran et.al.|[2502.05330v1](http://arxiv.org/abs/2502.05330v1)|[link](https://github.com/MaxwellEng/MICCAI_CHANLLENGE24_HJL)|
-|**2025-02-07**|**Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**|Yuting He et.al.|[2502.05282v1](http://arxiv.org/abs/2502.05282v1)|null|
-|**2025-02-07**|**"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**|Shihan Fu et.al.|[2502.05115v1](http://arxiv.org/abs/2502.05115v1)|null|
-|**2025-02-07**|**Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**|Thierry Bossy et.al.|[2502.05087v1](http://arxiv.org/abs/2502.05087v1)|[link](https://github.com/tuneinsight/federated-llms)|
-|**2025-02-07**|**MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**|Minrui Chen et.al.|[2502.04794v1](http://arxiv.org/abs/2502.04794v1)|null|
-|**2025-02-06**|**MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**|Wei Fan et.al.|[2502.04515v1](http://arxiv.org/abs/2502.04515v1)|[link](https://github.com/aikunyi/MedGNN)|
-|**2025-02-06**|**Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**|Andrew G. Breithaupt et.al.|[2502.06842v1](http://arxiv.org/abs/2502.06842v1)|null|
-|**2025-02-06**|**Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**|Khushboo Verma et.al.|[2502.04423v1](http://arxiv.org/abs/2502.04423v1)|null|
-|**2025-02-06**|**Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**|Tewele W. Tareke et.al.|[2502.04083v1](http://arxiv.org/abs/2502.04083v1)|null|
-|**2025-02-06**|**Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**|Ran Song et.al.|[2502.04034v1](http://arxiv.org/abs/2502.04034v1)|null|
-|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
-|**2025-02-06**|**Transforming Multimodal Models into Action Models for Radiotherapy**|Matteo Ferrante et.al.|[2502.04408v1](http://arxiv.org/abs/2502.04408v1)|null|
-|**2025-02-06**|**Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**|Bokeng Zheng et.al.|[2502.04399v1](http://arxiv.org/abs/2502.04399v1)|null|
-|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
-|**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
-|**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
-|**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
-|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v1](http://arxiv.org/abs/2502.03568v1)|null|
-|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
-|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
-|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
-|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
-|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
-|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
-|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|null|
-|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
-|**2025-02-05**|**Contrastive Token-level Explanations for Graph-based Rumour Detection**|Daniel Wai Kit Chin et.al.|[2502.04366v1](http://arxiv.org/abs/2502.04366v1)|null|
-|**2025-02-05**|**AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**|Jorge García-Torres et.al.|[2502.04365v1](http://arxiv.org/abs/2502.04365v1)|null|
-|**2025-02-04**|**3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**|Weicheng Zhu et.al.|[2502.02779v1](http://arxiv.org/abs/2502.02779v1)|null|
-|**2025-02-04**|**Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**|Obed Korshie Dzikunu et.al.|[2502.02756v1](http://arxiv.org/abs/2502.02756v1)|[link](https://github.com/obeddzik/pca_segment)|
-|**2025-02-04**|**Diffusion Instruction Tuning**|Chen Jin et.al.|[2502.06814v1](http://arxiv.org/abs/2502.06814v1)|null|
-|**2025-02-04**|**MedRAX: Medical Reasoning Agent for Chest X-ray**|Adibvafa Fallahpour et.al.|[2502.02673v1](http://arxiv.org/abs/2502.02673v1)|[link](https://github.com/bowang-lab/medrax)|
-|**2025-02-04**|**Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**|Mahdi Alkaeed et.al.|[2502.04356v1](http://arxiv.org/abs/2502.04356v1)|null|
-|**2025-02-04**|**Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**|Shayan Kiyani et.al.|[2502.02561v1](http://arxiv.org/abs/2502.02561v1)|null|
-|**2025-02-04**|**CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**|Amy Rafferty et.al.|[2502.05214v1](http://arxiv.org/abs/2502.05214v1)|null|
-|**2025-02-04**|**A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**|Edward Ellis et.al.|[2502.02489v1](http://arxiv.org/abs/2502.02489v1)|null|
-|**2025-02-04**|**Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**|Yaling Shen et.al.|[2502.02438v1](http://arxiv.org/abs/2502.02438v1)|null|
-|**2025-02-04**|**Test Time Training for 4D Medical Image Interpolation**|Qikang Zhang et.al.|[2502.02341v1](http://arxiv.org/abs/2502.02341v1)|[link](https://github.com/chaostheproducer/ttt4d)|
-|**2025-02-04**|**Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**|Atharva Mangeshkumar Agrawal et.al.|[2502.02249v1](http://arxiv.org/abs/2502.02249v1)|null|
-|**2025-02-04**|**Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**|F. Xavier Gaya-Morey et.al.|[2502.02618v1](http://arxiv.org/abs/2502.02618v1)|null|
-|**2025-02-04**|**Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**|Yuxiao Cheng et.al.|[2502.02109v1](http://arxiv.org/abs/2502.02109v1)|null|
-|**2025-02-04**|**JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**|Yehan Yan et.al.|[2502.04345v1](http://arxiv.org/abs/2502.04345v1)|null|
-|**2025-02-03**|**An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**|Jiazi Tian et.al.|[2502.01789v1](http://arxiv.org/abs/2502.01789v1)|null|
-|**2025-02-03**|**Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**|Chacha Chen et.al.|[2502.03482v1](http://arxiv.org/abs/2502.03482v1)|null|
-|**2025-02-03**|**Improving Transformer World Models for Data-Efficient RL**|Antoine Dedieu et.al.|[2502.01591v1](http://arxiv.org/abs/2502.01591v1)|null|
-|**2025-02-03**|**Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**|Zhi Zhang et.al.|[2502.01377v1](http://arxiv.org/abs/2502.01377v1)|null|
-|**2025-02-03**|**OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**|Chengfeng Zhou et.al.|[2502.01243v1](http://arxiv.org/abs/2502.01243v1)|null|
-|**2025-02-03**|**MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**|Alejandro Guerra-Manzanares et.al.|[2502.01158v1](http://arxiv.org/abs/2502.01158v1)|null|
-|**2025-02-03**|**Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**|Qian Chen et.al.|[2502.01141v1](http://arxiv.org/abs/2502.01141v1)|null|
-|**2025-02-03**|**Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**|Mithun Saha et.al.|[2502.01108v1](http://arxiv.org/abs/2502.01108v1)|null|
-|**2025-02-03**|**Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**|Yeyubei Zhang et.al.|[2502.04342v1](http://arxiv.org/abs/2502.04342v1)|null|
-|**2025-02-02**|**Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**|Hadas Ben-Atya et.al.|[2502.01691v1](http://arxiv.org/abs/2502.01691v1)|null|
-|**2025-02-02**|**Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**|Si-Ioi Ng et.al.|[2502.01685v1](http://arxiv.org/abs/2502.01685v1)|null|
-|**2025-02-02**|**Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**|Shengtian Sang et.al.|[2502.00712v1](http://arxiv.org/abs/2502.00712v1)|null|
-|**2025-02-02**|**TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**|Linglong Wu et.al.|[2502.00695v1](http://arxiv.org/abs/2502.00695v1)|null|
-|**2025-02-02**|**Safety at Scale: A Comprehensive Survey of Large Model Safety**|Xingjun Ma et.al.|[2502.05206v2](http://arxiv.org/abs/2502.05206v2)|null|
-|**2025-02-02**|**Enhanced Convolutional Neural Networks for Improved Image Classification**|Xiaoran Yang et.al.|[2502.00663v1](http://arxiv.org/abs/2502.00663v1)|null|
-|**2025-02-02**|**Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**|Yujin Oh et.al.|[2502.00619v1](http://arxiv.org/abs/2502.00619v1)|null|
-|**2025-02-01**|**Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**|Samiran Dey et.al.|[2502.00568v3](http://arxiv.org/abs/2502.00568v3)|[link](https://github.com/Samiran-Dey/PathoGen)|
+摘要：圖表計算任務本質上具有挑戰性，而且通常需要開發先進的演算法才能有效解決。隨著大型語言模型 (LLM) 的出現，研究人員已開始探討其解決這些任務的可能性。然而，現有方法受到 LLM 理解複雜圖形結構的能力有限以及其高推理成本的限制，這使得它們不切實際地處理大規模圖形。受到人類解決圖形問題的方法啟發，我們引入了 PIE（偽代碼注入增強 LLM 圖形計算任務推理）這個新框架，它包含三個關鍵步驟：問題理解、提示設計和代碼生成。在此框架中，LLM 的任務是理解問題並擷取相關資訊以產生正確的代碼。分析圖形結構和執行代碼的責任委派給解釋器。我們將與任務相關的偽代碼注入提示中，以進一步協助 LLM 產生有效的代碼。我們還採用具有成本效益的試錯技術，以確保 LLM 生成的代碼正確執行。與需要為每個個別測試案例呼叫 LLM 的其他方法不同，PIE 僅在代碼產生階段呼叫 LLM，允許重複使用產生的代碼並大幅降低推理成本。大量的實驗證明，PIE 在準確性和計算效率方面都優於現有的基準。
 
-#### Abstracts
-##### **Metamorphic Testing for Pose Estimation Systems**
-2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque
+##### **CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**
+2501.13993v1 by Hamza Landolsi, Kais Letaief, Nizar Taghouti, Ines Abdeljaoued-Tej
 
-Pose estimation systems are used in a variety of fields, from sports
-analytics to livestock care. Given their potential impact, it is paramount to
-systematically test their behaviour and potential for failure. This is a
-complex task due to the oracle problem and the high cost of manual labelling
-necessary to build ground truth keypoints. This problem is exacerbated by the
-fact that different applications require systems to focus on different subjects
-(e.g., human versus animal) or landmarks (e.g., only extremities versus whole
-body and face), which makes labelled test data rarely reusable. To combat these
-problems we propose MET-POSE, a metamorphic testing framework for pose
-estimation systems that bypasses the need for manual annotation while assessing
-the performance of these systems under different circumstances. MET-POSE thus
-allows users of pose estimation systems to assess the systems in conditions
-that more closely relate to their application without having to label an ad-hoc
-test dataset or rely only on available datasets, which may not be adapted to
-their application domain. While we define MET-POSE in general terms, we also
-present a non-exhaustive list of metamorphic rules that represent common
-challenges in computer vision applications, as well as a specific way to
-evaluate these rules. We then experimentally show the effectiveness of MET-POSE
-by applying it to Mediapipe Holistic, a state of the art human pose estimation
-system, with the FLIC and PHOENIX datasets. With these experiments, we outline
-numerous ways in which the outputs of MET-POSE can uncover faults in pose
-estimation systems at a similar or higher rate than classic testing using hand
-labelled data, and show that users can tailor the rule set they use to the
-faults and level of accuracy relevant to their application.
+The introduction of new features and services in the banking sector often
+overwhelms customers, creating an opportunity for banks to enhance user
+experience through financial chatbots powered by large language models (LLMs).
+We initiated an AI agent designed to provide customers with relevant
+information about banking services and insights from annual reports. We
+proposed a hybrid Customer Analysis Pipeline Retrieval-Augmented Generation
+(CAPRAG) that effectively addresses both relationship-based and contextual
+queries, thereby improving customer engagement in the digital banking
+landscape. To implement this, we developed a processing pipeline to refine text
+data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This
+dual approach enables us to populate both vector and graph databases with
+processed data for efficient retrieval. The Cypher query component is employed
+to effectively query the graph database. When a user submits a query, it is
+first expanded by a query expansion module before being routed to construct a
+final query from the hybrid Knowledge Base (KB). This final query is then sent
+to an open-source LLM for response generation. Overall, our innovative,
+designed to international banks, serves bank's customers in an increasingly
+complex digital environment, enhancing clarity and accessibility of
+information.
 
-摘要：姿勢估計系統應用於各種領域，從運動分析到牲畜照護。鑑於其潛在影響，系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高，這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體（例如，人類對動物）或地標（例如，只有四肢對全身和臉部）而加劇，這使得標記的測試數據很少可以重複使用。為了解決這些問題，我們提出了 MET-POSE，這是一個姿勢估計系統的變形測試框架，在評估這些系統在不同情況下的性能時，可以繞過手動註解的需要。因此，MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統，而無需標記臨時測試數據集或僅依賴可用數據集，這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE，但我們也提供了一個非詳盡的變形規則列表，這些規則代表了電腦視覺應用中的常見挑戰，以及評估這些規則的具體方法。然後，我們通過將 MET-POSE 應用於 Mediapipe Holistic（一種先進的人類姿勢估計系統），並使用 FLIC 和 PHOENIX 數據集，以實驗方式展示 MET-POSE 的有效性。通過這些實驗，我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法，其速度與使用手動標記數據的傳統測試類似或更高，並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。
+摘要：銀行業中新功能和服務的推出經常讓客戶感到不知所措，這為銀行透過大型語言模型 (LLM) 驅動的金融聊天機器人來提升使用者體驗創造了機會。我們啟動了一個人工智慧代理，旨在為客戶提供有關銀行服務和年度報告見解的相關資訊。我們提出了一個混合式客戶分析管道檢索擴充生成 (CAPRAG)，它有效地處理基於關係和情境式的查詢，從而提升數位銀行環境中的客戶參與度。為了實作這一點，我們開發了一個處理管道來精煉文字資料，我們在兩個主要架構中使用它：Vector RAG 和 Graph RAG。這種雙管齊下的方法讓我們能夠使用處理過的資料來填補向量和圖形資料庫，以利於有效檢索。Cypher 查詢元件用於有效查詢圖形資料庫。當使用者提交查詢時，它會先由查詢擴充模組擴充，然後再路由到混合式知識庫 (KB) 中建構最終查詢。然後這個最終查詢會傳送給開源 LLM 以產生回應。整體而言，我們創新的設計服務於國際銀行，在日益複雜的數位環境中服務銀行客戶，提升資訊的清晰度和可及性。
 
-##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
-2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
+##### **Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**
+2501.13992v1 by Hy Nguyen, Nguyen Hung Nguyen, Nguyen Linh Bao Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
 
-Joint entity-relation extraction is a critical task in transforming
-unstructured or semi-structured text into triplets, facilitating the
-construction of large-scale knowledge graphs, and supporting various downstream
-applications. Despite its importance, research on Chinese text, particularly
-with complex semantics in specialized domains like medicine, remains limited.
-To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
-dataset designed to capture the intricacies of medical text. Leveraging the
-strengths of attention mechanisms in capturing long-range dependencies, we
-propose the SEA module, which enhances the extraction of complex contextual
-semantic information, thereby improving entity recognition and relation
-extraction. Additionally, to address the inefficiencies of existing methods in
-facilitating information exchange between entity recognition and relation
-extraction, we present an interactive fusion representation module. This module
-employs Cross Attention for bidirectional information exchange between the
-tasks and further refines feature extraction through BiLSTM. Experimental
-results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
-our model exhibits strong generalization capabilities. On the CH-DDI dataset,
-our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
-relation extraction. On the CoNLL04 dataset, it attains an entity recognition
-precision of 89.54% and a relation extraction accuracy of 71.64%.
+The Hierarchical Navigable Small World (HNSW) algorithm is widely used for
+approximate nearest neighbor (ANN) search, leveraging the principles of
+navigable small-world graphs. However, it faces some limitations. The first is
+the local optima problem, which arises from the algorithm's greedy search
+strategy, selecting neighbors based solely on proximity at each step. This
+often leads to cluster disconnections. The second limitation is that HNSW
+frequently fails to achieve logarithmic complexity, particularly in
+high-dimensional datasets, due to the exhaustive traversal through each layer.
+To address these limitations, we propose a novel algorithm that mitigates local
+optima and cluster disconnections while enhancing the construction speed,
+maintaining inference speed. The first component is a dual-branch HNSW
+structure with LID-based insertion mechanisms, enabling traversal from multiple
+directions. This improves outlier node capture, enhances cluster connectivity,
+accelerates construction speed and reduces the risk of local minima. The second
+component incorporates a bridge-building technique that bypasses redundant
+intermediate layers, maintaining inference and making up the additional
+computational overhead introduced by the dual-branch structure. Experiments on
+various benchmarks and datasets showed that our algorithm outperforms the
+original HNSW in both accuracy and speed. We evaluated six datasets across
+Computer Vision (CV), and Natural Language Processing (NLP), showing recall
+improvements of 18\% in NLP, and up to 30\% in CV tasks while reducing the
+construction time by up to 20\% and maintaining the inference speed. We did not
+observe any trade-offs in our algorithm. Ablation studies revealed that
+LID-based insertion had the greatest impact on performance, followed by the
+dual-branch structure and bridge-building components.
 
-摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
+摘要：分層可導航小世界 (HNSW) 演算法廣泛用於近似最近鄰居 (ANN) 搜尋，並利用可導航小世界圖形的原理。然而，它面臨一些限制。第一個是局部最佳化問題，這源自於演算法的貪婪搜尋策略，在每個步驟中僅根據鄰近度來選擇鄰居。這通常會導致群集斷線。第二個限制是，由於透過每一層的窮舉式遍歷，HNSW 常常無法在高維度資料集中達成對數複雜度。為了解決這些限制，我們提出了一種新的演算法，它可以減輕局部最佳化和群集斷線，同時提高建構速度，並維持推論速度。第一個組成部分是一個具有基於 LID 的插入機制的雙分支 HNSW 結構，它能從多個方向進行遍歷。這改善了異常值節點的擷取，增強了群集連通性，加速了建構速度，並降低了局部最小值的風險。第二個組成部分包含一種橋樑建構技術，它繞過了多餘的中間層，維持推論並彌補了雙分支結構所帶來的額外運算負擔。在各種基準和資料集上的實驗顯示，我們的演算法在準確度和速度上都優於原始的 HNSW。我們評估了電腦視覺 (CV) 和自然語言處理 (NLP) 中的六個資料集，顯示 NLP 中的召回率提高了 18%，CV 任務中提高了 30%，同時將建構時間縮短了 20%，並維持了推論速度。我們沒有在我們的演算法中觀察到任何取捨。消融研究顯示，基於 LID 的插入對效能的影響最大，其次是雙分支結構和橋樑建構組成部分。
 
-##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
-2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
+##### **Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**
+2501.13984v1 by Bhumika Gupta, Pralaypati Ta, Keerthi Ram, Mohanasankar Sivaprakasam
 
-Generative artificial intelligence (AI) models, such as diffusion models and
-OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
-and automating clinical workflows. The field has advanced rapidly, evolving
-from text-only large language models for tasks such as clinical documentation
-and decision support to multimodal AI systems capable of integrating diverse
-data modalities, including imaging, text, and structured data, within a single
-model. The diverse landscape of these technologies, along with rising interest,
-highlights the need for a comprehensive review of their applications and
-potential. This scoping review explores the evolution of multimodal AI,
-highlighting its methods, applications, datasets, and evaluation in clinical
-settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
-IEEE Xplore, and Web of Science, prioritizing recent studies published up to
-the end of 2024. After rigorous screening, 144 papers were included, revealing
-key trends and challenges in this dynamic field. Our findings underscore a
-shift from unimodal to multimodal approaches, driving innovations in diagnostic
-support, medical report generation, drug discovery, and conversational AI.
-However, critical challenges remain, including the integration of heterogeneous
-data types, improving model interpretability, addressing ethical concerns, and
-validating AI systems in real-world clinical settings. This review summarizes
-the current state of the art, identifies critical gaps, and provides insights
-to guide the development of scalable, trustworthy, and clinically impactful
-multimodal AI solutions in healthcare.
+The updated recommendations on diagnostic procedures and treatment pathways
+for a medical condition are documented as graphical flows in Clinical Practice
+Guidelines (CPGs). For effective use of the CPGs in helping medical
+professionals in the treatment decision process, it is necessary to fully
+capture the guideline knowledge, particularly the contexts and their
+relationships in the graph. While several existing works have utilized these
+guidelines to create rule bases for Clinical Decision Support Systems, limited
+work has been done toward directly capturing the full medical knowledge
+contained in CPGs. This work proposes an approach to create a contextually
+enriched, faithful digital representation of National Comprehensive Cancer
+Network (NCCN) Cancer CPGs in the form of graphs using automated extraction and
+node & relationship classification. We also implement semantic enrichment of
+the model by using Large Language Models (LLMs) for node classification,
+achieving an accuracy of 80.86% and 88.47% with zero-shot learning and few-shot
+learning, respectively. Additionally, we introduce a methodology for answering
+natural language questions with constraints to guideline text by leveraging
+LLMs to extract the relevant subgraph from the guideline knowledge base. By
+generating natural language answers based on subgraph paths and semantic
+information, we mitigate the risk of incorrect answers and hallucination
+associated with LLMs, ensuring factual accuracy in medical domain Question
+Answering.
 
-摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
+摘要：已更新的醫療狀況診斷程序和治療途徑建議，以臨床實務指南 (CPG) 中的圖形流程記錄。為了有效使用 CPG 協助醫療專業人員進行治療決策，必須完整擷取指南知識，特別是圖表中的脈絡及其關係。雖然現有許多研究已利用這些指南為臨床決策支援系統建立規則基礎，但直接擷取 CPG 中包含的完整醫療知識的工作卻有限。這項研究提出了一種方法，以自動化擷取和節點與關係分類的方式，建立脈絡豐富、忠實的國家綜合癌症網路 (NCCN) 癌症 CPG 圖形數位表示。我們也透過使用大型語言模型 (LLM) 進行節點分類，實作模型的語意豐富化，分別在零次學習和少次學習中達到 80.86% 和 88.47% 的準確度。此外，我們引進了一種方法，透過運用 LLM 從指南知識庫中擷取相關子圖，來回答具有指南文字限制的自然語言問題。透過根據子圖路徑和語意資訊產生自然語言答案，我們降低了與 LLM 相關的錯誤答案和幻覺風險，確保了醫療領域問題解答中的事實準確性。
 
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
+##### **LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**
+2501.12300v1 by Hasan Abu-Rasheed, Constance Jumbo, Rashed Al Amin, Christian Weber, Veit Wiese, Roman Obermaisser, Madjid Fathi
 
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
+While learning personalization offers great potential for learners, modern
+practices in higher education require a deeper consideration of domain models
+and learning contexts, to develop effective personalization algorithms. This
+paper introduces an innovative approach to higher education curriculum
+modelling that utilizes large language models (LLMs) for knowledge graph (KG)
+completion, with the goal of creating personalized learning-path
+recommendations. Our research focuses on modelling university subjects and
+linking their topics to corresponding domain models, enabling the integration
+of learning modules from different faculties and institutions in the student's
+learning path. Central to our approach is a collaborative process, where LLMs
+assist human experts in extracting high-quality, fine-grained topics from
+lecture materials. We develop a domain, curriculum, and user models for
+university modules and stakeholders. We implement this model to create the KG
+from two study modules: Embedded Systems and Development of Embedded Systems
+Using FPGA. The resulting KG structures the curriculum and links it to the
+domain models. We evaluate our approach through qualitative expert feedback and
+quantitative graph quality metrics. Domain experts validated the relevance and
+accuracy of the model, while the graph quality metrics measured the structural
+properties of our KG. Our results show that the LLM-assisted graph completion
+approach enhances the ability to connect related courses across disciplines to
+personalize the learning experience. Expert feedback also showed high
+acceptance of the proposed collaborative approach for concept extraction and
+classification.
 
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
+摘要：<paragraph>在學習個人化提供學習者巨大潛力的同時，高等教育中的現代實務需要更深入地考慮領域模型和學習情境，以開發有效的個人化演算法。本文介紹了一種創新的高等教育課程建模方法，該方法利用大型語言模型 (LLM) 來完成知識圖譜 (KG)，目的是建立個人化的學習路徑建議。我們的研究重點在於建模大學科目，並將它們的主題連結到對應的領域模型，從而能夠將來自不同院系和機構的學習模組整合到學生的學習路徑中。我們的做法核心是一個協作流程，其中 LLM 協助人類專家從講義材料中萃取高品質、細緻的主題。我們為大學模組和利害關係人開發了領域、課程和使用者模型。我們實作這個模型，從兩個研究模組建立 KG：嵌入式系統和使用 FPGA 的嵌入式系統開發。產生的 KG 建構了課程並將其連結到領域模型。我們透過定性專家回饋和定量圖形品質指標來評估我們的做法。領域專家驗證了模型的相關性和準確性，而圖形品質指標則測量了我們 KG 的結構特性。我們的結果顯示，LLM 輔助的圖形完成方法增強了跨學科連結相關課程的能力，以個人化學習體驗。專家回饋也顯示高度接受所提出的協作方法，用於概念萃取和分類。</paragraph>
 
-##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
-2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
+##### **Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**
+2501.12432v1 by Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
 
-Legal cases require careful logical reasoning following the laws, whereas
-interactions with non- technical users must be in natural language. As an
-application combining logical reasoning using Prolog and natural language
-processing using large language models (LLMs), this paper presents a novel
-approach and system, LogicLease, to automate the analysis of landlord-tenant
-legal cases in the state of New York. LogicLease determines compliance with
-relevant legal requirements by analyzing case descriptions and citing all
-relevant laws. It leverages LLMs for information extraction and Prolog for
-legal reasoning. By separating information extraction from legal reasoning,
-LogicLease achieves greater transparency and control over the legal logic
-applied to each case. We evaluate the accuracy, efficiency, and robustness of
-LogicLease through a series of tests, achieving 100% accuracy and an average
-processing time of 2.57 seconds. LogicLease presents advantages over
-state-of-the-art LLM- based legal analysis systems by providing clear,
-step-by-step reasoning, citing specific laws, and distinguishing itself by its
-ability to avoid hallucinations - a common issue in LLMs.
+Although current Large Language Models (LLMs) exhibit impressive
+capabilities, performing complex real-world tasks still requires tool learning.
+Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to
+interact with external environments, but they are limited in perceptual scope
+and lack adequate task-planning capability. To address these limitations, other
+studies introduce the first Search-based Decision Tree (DFSDT), which still
+suffers from the high computational cost. In this paper, we introduce a novel
+parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama).
+First, we transform traditional tree-based tool search paths into Directed
+Acyclic Graph (DAG) structure, generating a high-quality parallel tool
+invocation dataset. The DTA-Llama is then trained on the dataset to learn to
+iteratively divide the current task into several parallel tool invocation
+sub-tasks and aggregate the invocation results to decide the next actions.
+Furthermore, we introduce an efficient inference framework inspired by the
+Process/Threads mechanism when applying the DTA-Llama to practical tasks.
+Experimental results show that our approach substantially enhances task
+performance while reducing token consumption and inference time. Llama2-7B,
+using our method, is comparable to the official parallel function calling
+method of GPT-3.5. The relevant code, dataset, and model weights are available
+at https://corn0205.github.io/
 
-摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
+摘要：儘管目前的大型語言模型 (LLM) 展現出令人印象深刻的能力，但執行複雜的真實世界任務仍需要工具學習。主流方法（例如 CoT/ReAct）依賴逐步工具呼叫與外部環境互動，但它們的感知範圍有限，且缺乏足夠的任務規劃能力。為了解決這些限制，其他研究引入了第一個基於搜尋的決策樹 (DFSDT)，但仍有很高的運算成本。在本文中，我們介紹了一種新穎的平行工具呼叫範例，DTA-Llama（分而合之 Llama）。首先，我們將傳統的基於樹的工具搜尋路徑轉換為有向無環圖 (DAG) 結構，產生高品質的平行工具呼叫資料集。然後在資料集上訓練 DTA-Llama，學習反覆將當前任務分成幾個平行工具呼叫子任務，並彙總呼叫結果以決定後續動作。此外，我們在將 DTA-Llama 應用於實際任務時，引入了一個受 Process/Threads 機制啟發的高效推論框架。實驗結果表明，我們的做法大幅提升了任務效能，同時減少了符號消耗和推論時間。使用我們方法的 Llama2-7B，可與 GPT-3.5 的官方平行函式呼叫方法相媲美。相關程式碼、資料集和模型權重可在 https://corn0205.github.io/ 取得
 
-##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
-2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
+##### **InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**
+2501.12231v1 by Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
 
-In remote healthcare monitoring, time series representation learning reveals
-critical patient behavior patterns from high-frequency data. This study
-analyzes home activity data from individuals living with dementia by proposing
-a two-stage, self-supervised learning approach tailored to uncover low-rank
-structures. The first stage converts time-series activities into text sequences
-encoded by a pre-trained language model, providing a rich, high-dimensional
-latent state space using a PageRank-based method. This PageRank vector captures
-latent state transitions, effectively compressing complex behaviour data into a
-succinct form that enhances interpretability. This low-rank representation not
-only enhances model interpretability but also facilitates clustering and
-transition analysis, revealing key behavioral patterns correlated with
-clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
-framework's potential in supporting cognitive status prediction, personalized
-care interventions, and large-scale health monitoring.
+The improved competence of generative models can help building multi-modal
+virtual assistants that leverage modalities beyond language. By observing
+humans performing multi-step tasks, one can build assistants that have
+situational awareness of actions and tasks being performed, enabling them to
+cater assistance based on this understanding. In this paper, we develop a
+Context-aware Instructional Task Assistant with Multi-modal Large Language
+Models (InsTALL) that leverages an online visual stream (e.g. a user's screen
+share or video recording) and responds in real-time to user queries related to
+the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal
+model on task videos and paired textual data, and 2) automatically extracts
+task graph from video data and leverages it at training and inference time. We
+show InsTALL achieves state-of-the-art performance across proposed sub-tasks
+considered for multimodal activity understanding -- task recognition (TR),
+action recognition (AR), next action prediction (AP), and plan prediction (PP)
+-- and outperforms existing baselines on two novel sub-tasks related to
+automatic error identification.
 
-摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
+摘要：生成模型能力的提升有助于构建利用语言之外的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建对正在执行的动作和任务有情境感知的助手，使他们能够根据这种理解提供帮助。在本文中，我们开发了一个具有多模态大语言模型的上下文感知指令任务助手 (InsTALL)，该助手利用在线视觉流（例如用户的屏幕共享或视频录制），并实时响应与手头任务相关的用户查询。为了提供有用的帮助，InsTALL 1) 在任务视频和配对文本数据上训练多模态模型，以及 2) 从视频数据中自动提取任务图，并在训练和推理时间利用它。我们展示了 InsTALL 在考虑用于多模态活动理解的提议子任务中实现了最先进的性能——任务识别 (TR)、动作识别 (AR)、下一个动作预测 (AP) 和计划预测 (PP)——并且在与自动错误识别相关的两个新子任务上优于现有的基准。
 
-##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
-2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
+##### **Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**
+2501.11977v1 by Maya Medjad, Hugo Imbert, Bruno Yun, Raphaël Szymocha, Frédéric Armetta
 
-Precise segmentation and classification of cell instances are vital for
-analyzing the tissue microenvironment in histology images, supporting medical
-diagnosis, prognosis, treatment planning, and studies of brain
-cytoarchitecture. However, the creation of high-quality annotated datasets for
-training remains a major challenge. This study introduces a novel single-stage
-approach (HistoSmith) for generating image-label pairs to augment histology
-datasets. Unlike state-of-the-art methods that utilize diffusion models with
-separate components for label and image generation, our approach employs a
-latent diffusion model to learn the joint distribution of cellular layouts,
-classification masks, and histology images. This model enables tailored data
-generation by conditioning on user-defined parameters such as cell types,
-quantities, and tissue types. Trained on the Conic H&E histopathology dataset
-and the Nissl-stained CytoDArk0 dataset, the model generates realistic and
-diverse labeled samples. Experimental results demonstrate improvements in cell
-instance segmentation and classification, particularly for underrepresented
-cell types like neutrophils in the Conic dataset. These findings underscore the
-potential of our approach to address data scarcity challenges.
+Training task-oriented dialogue systems is both costly and time-consuming,
+due to the need for high-quality datasets encompassing diverse intents.
+Traditional methods depend on extensive human annotation, while recent
+advancements leverage large language models (LLMs) to generate synthetic data.
+However, these approaches often require custom prompts or code, limiting
+accessibility for non-technical users. We introduce GraphTOD, an end-to-end
+framework that simplifies the generation of task-oriented dialogues. Users can
+create dialogues by specifying transition graphs in JSON format. Our evaluation
+demonstrates that GraphTOD generates high-quality dialogues across various
+domains, significantly lowering the cost and complexity of dataset creation.
 
-摘要：精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而，建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith)，用於產生影像標籤對，以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同，我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數（例如細胞類型、數量和組織類型）來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後，此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步，特別是對於 Conic 資料集中代表性不足的細胞類型，例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。
+摘要：訓練任務導向對話系統既昂貴又耗時，
+因為需要包含各種意圖的高品質資料集。
+傳統方法依賴於廣泛的人工標註，而最近
+的進展利用大型語言模型 (LLM) 來產生合成資料。
+然而，這些方法通常需要自訂提示或程式碼，限制
+非技術使用者的可及性。我們介紹 GraphTOD，一個端對端的
+架構，簡化了任務導向對話的產生。使用者可以
+透過指定 JSON 格式的轉換圖表來建立對話。我們的評估
+證明 GraphTOD 在各種領域產生高品質對話，顯著降低資料集建立的成本和複雜性。
 
-##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**
-2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
+##### **Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**
+2501.11968v1 by Jie Zhao, Kang Hao Cheong, Witold Pedrycz
 
-The growing availability of longitudinal Magnetic Resonance Imaging (MRI)
-datasets has facilitated Artificial Intelligence (AI)-driven modeling of
-disease progression, making it possible to predict future medical scans for
-individual patients. However, despite significant advancements in AI, current
-methods continue to face challenges including achieving patient-specific
-individualization, ensuring spatiotemporal consistency, efficiently utilizing
-longitudinal data, and managing the substantial memory demands of 3D scans. To
-address these challenges, we propose Brain Latent Progression (BrLP), a novel
-spatiotemporal model designed to predict individual-level disease progression
-in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates
-in a small latent space, mitigating the computational challenges posed by
-high-dimensional imaging data; (ii) it explicitly integrates subject metadata
-to enhance the individualization of predictions; (iii) it incorporates prior
-knowledge of disease dynamics through an auxiliary model, facilitating the
-integration of longitudinal data; and (iv) it introduces the Latent Average
-Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in
-the predicted progression at inference time and (b) allows us to derive a
-measure of the uncertainty for the prediction. We train and evaluate BrLP on
-11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its
-generalizability on an external test set comprising 2,257 MRIs from 962
-subjects. Our experiments compare BrLP-generated MRI scans with real follow-up
-MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The
-code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
+Graph-structured combinatorial challenges are inherently difficult due to
+their nonlinear and intricate nature, often rendering traditional computational
+methods ineffective or expensive. However, these challenges can be more
+naturally tackled by humans through visual representations that harness our
+innate ability for spatial reasoning. In this study, we propose transforming
+graphs into images to preserve their higher-order structural features
+accurately, revolutionizing the representation used in solving graph-structured
+combinatorial tasks. This approach allows machines to emulate human-like
+processing in addressing complex combinatorial challenges. By combining the
+innovative paradigm powered by multimodal large language models (MLLMs) with
+simple search techniques, we aim to develop a novel and effective framework for
+tackling such problems. Our investigation into MLLMs spanned a variety of
+graph-based tasks, from combinatorial problems like influence maximization to
+sequential decision-making in network dismantling, as well as addressing six
+fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit
+exceptional spatial intelligence and a distinctive capability for handling
+these problems, significantly advancing the potential for machines to
+comprehend and analyze graph-structured data with a depth and intuition akin to
+human cognition. These results also imply that integrating MLLMs with simple
+optimization strategies could form a novel and efficient approach for
+navigating graph-structured combinatorial challenges without complex
+derivations, computationally demanding training and fine-tuning.
 
-摘要：隨著縱向磁共振影像 (MRI) 資料集的日益普及，已促進人工智慧 (AI) 驅動的疾病進程建模，讓預測個別患者的未來醫學掃描成為可能。然而，儘管 AI 有顯著進展，目前的技術仍面臨挑戰，包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料，以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰，我們提出腦潛在進程 (BrLP)，這是一種新穎的時空模型，旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個：(i) 它在一個小的潛在空間中運作，減輕了高維度影像資料帶來的計算挑戰；(ii) 它明確整合受試者的元資料，以增強預測的個別化；(iii) 它透過輔助模型納入疾病動態的先驗知識，促進縱向資料的整合；(iv) 它引入了潛在平均穩定化 (LAS) 演算法，該演算法 (a) 在推論時強制預測進程中的時空一致性，(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估，並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較，與現有方法相比，展示了最先進的準確性。程式碼已公開於：https://github.com/LemuelPuglisi/BrLP。
+摘要：圖形結構的組合挑戰本質上很困難，因為它們的非線性和複雜性，通常會使傳統的計算方法無效或昂貴。然而，人類可以透過利用我們天生的空間推理能力的視覺表徵，更自然地應對這些挑戰。在本研究中，我們建議將圖形轉換為影像，以準確保留它們的高階結構特徵，從而革新用於解決圖形結構組合任務的表徵。這種方法允許機器在解決複雜的組合挑戰時模擬類人的處理。透過結合由多模態大型語言模型 (MLLM) 提供動力的創新範例與簡單的搜尋技術，我們旨在為解決此類問題開發一個新穎且有效的架構。我們對 MLLM 的研究涵蓋了各種基於圖形的任務，從組合問題（如影響力最大化）到網路拆除中的順序決策制定，以及解決六個基本的圖形相關問題。我們的研究結果表明，MLLM 表現出非凡的空間智能和處理這些問題的獨特能力，顯著提升了機器以類似人類認知的深度和直覺來理解和分析圖形結構資料的潛力。這些結果還暗示，將 MLLM 與簡單的最佳化策略整合在一起，可以形成一種新穎且有效的方法，用於在沒有複雜推導、計算需求量大的訓練和微調的情況下應對圖形結構的組合挑戰。
 
-##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
-2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
+##### **A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**
+2501.13958v1 by Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, Xiao Huang
 
-The adoption of EHRs has expanded opportunities to leverage data-driven
-algorithms in clinical care and research. A major bottleneck in effectively
-conducting multi-institutional EHR studies is the data heterogeneity across
-systems with numerous codes that either do not exist or represent different
-clinical concepts across institutions. The need for data privacy further limits
-the feasibility of including multi-institutional patient-level data required to
-study similarities and differences across patient subgroups. To address these
-challenges, we developed the GAME algorithm. Tested and validated across 7
-institutions and 2 languages, GAME integrates data in several levels: (1) at
-the institutional level with knowledge graphs to establish relationships
-between codes and existing knowledge sources, providing the medical context for
-standard codes and their relationship to each other; (2) between institutions,
-leveraging language models to determine the relationships between
-institution-specific codes with established standard codes; and (3) quantifying
-the strength of the relationships between codes using a graph attention
-network. Jointly trained embeddings are created using transfer and federated
-learning to preserve data privacy. In this study, we demonstrate the
-applicability of GAME in selecting relevant features as inputs for AI-driven
-algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
-We then highlight the application of GAME harmonized multi-institutional EHR
-data in a study of Alzheimer's disease outcomes and suicide risk among patients
-with mental health disorders, without sharing patient-level data outside
-individual institutions.
+Large language models (LLMs) have demonstrated remarkable capabilities in a
+wide range of tasks, yet their application to specialized domains remains
+challenging due to the need for deep expertise. Retrieval-augmented generation
+(RAG) has emerged as a promising solution to customize LLMs for professional
+fields by seamlessly integrating external knowledge bases, enabling real-time
+access to domain-specific expertise during inference. Despite its potential,
+traditional RAG systems, based on flat text retrieval, face three critical
+challenges: (i) complex query understanding in professional contexts, (ii)
+difficulties in knowledge integration across distributed sources, and (iii)
+system efficiency bottlenecks at scale. This survey presents a systematic
+analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new
+paradigm that revolutionizes domain-specific LLM applications. GraphRAG
+addresses traditional RAG limitations through three key innovations: (i)
+graph-structured knowledge representation that explicitly captures entity
+relationships and domain hierarchies, (ii) efficient graph-based retrieval
+techniques that enable context-preserving knowledge retrieval with multihop
+reasoning ability, and (iii) structure-aware knowledge integration algorithms
+that leverage retrieved knowledge for accurate and logical coherent generation
+of LLMs. In this survey, we systematically analyze the technical foundations of
+GraphRAG and examine current implementations across various professional
+domains, identifying key technical challenges and promising research
+directions. All the related resources of GraphRAG, including research papers,
+open-source data, and projects, are collected for the community in
+\textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}}.
 
-摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
+摘要：大型語言模型 (LLM) 已在各種任務中展現出非凡的能力，但由於需要深入的專業知識，因此將其應用於專業領域仍具有挑戰性。檢索增強生成 (RAG) 已成為一種有前途的解決方案，可通過無縫整合外部知識庫來客製化 LLM 以適用於專業領域，從而在推理過程中即時存取特定領域的專業知識。儘管有其潛力，但基於平面文字檢索的傳統 RAG 系統面臨三項關鍵挑戰：(i) 在專業情境中進行複雜的查詢理解，(ii) 難以整合分散來源的知識，以及 (iii) 系統效率瓶頸會隨著規模擴大而產生。本調查系統性地分析了圖形化檢索增強生成 (GraphRAG) 的技術基礎，GraphRAG 是一個新的典範，它徹底改變了特定領域的 LLM 應用。GraphRAG 透過三項關鍵創新來解決傳統 RAG 的限制：(i) 圖形結構化的知識表述，明確擷取實體關係和領域階層，(ii) 有效的圖形化檢索技術，可進行保留脈絡的知識檢索，並具備多跳推理能力，以及 (iii) 結構感知知識整合演算法，可利用檢索到的知識來進行 LLM 的準確且邏輯一致的生成。在本調查中，我們系統性地分析了 GraphRAG 的技術基礎，並檢視了在各種專業領域中的現有實作，找出關鍵技術挑戰和有前景的研究方向。所有 GraphRAG 的相關資源，包括研究論文、開放原始碼資料和專案，都已在 \textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}} 中為社群收集。
 
-##### **EEG Artifact Detection and Correction with Deep Autoencoders**
-2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch
+##### **Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**
+2501.11849v2 by Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
 
-EEG signals convey important information about brain activity both in healthy
-and pathological conditions. However, they are inherently noisy, which poses
-significant challenges for accurate analysis and interpretation. Traditional
-EEG artifact removal methods, while effective, often require extensive expert
-intervention. This study presents LSTEEG, a novel LSTM-based autoencoder
-designed for the detection and correction of artifacts in EEG signals.
-Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear
-dependencies in sequential EEG data. LSTEEG demonstrates superior performance
-in both artifact detection and correction tasks compared to other
-state-of-the-art convolutional autoencoders. Our methodology enhances the
-interpretability and utility of the autoencoder's latent space, enabling
-data-driven automated artefact removal in EEG its application in downstream
-tasks. This research advances the field of efficient and accurate multi-channel
-EEG preprocessing, and promotes the implementation and usage of automated EEG
-analysis pipelines for brain health applications.
+Detecting organized political campaigns is of paramount importance in
+fighting against disinformation on social media. Existing approaches for the
+identification of such organized actions employ techniques mostly from network
+science, graph machine learning and natural language processing. Their ultimate
+goal is to analyze the relationships and interactions (e.g. re-posting) among
+users and the textual similarities of their posts. Despite their effectiveness
+in recognizing astroturf campaigns, these methods face significant challenges,
+notably the class imbalance in available training datasets. To mitigate this
+issue, recent methods usually resort to data augmentation or increasing the
+number of positive samples, which may not always be feasible or sufficient in
+real-world settings. Following a different path, in this paper, we propose a
+novel framework for identifying astroturf campaigns based solely on large
+language models (LLMs), introducing a Balanced Retrieval-Augmented Generation
+(Balanced RAG) component. Our approach first gives both textual information
+concerning the posts (in our case tweets) and the user interactions of the
+social network as input to a language model. Then, through prompt engineering
+and the proposed Balanced RAG method, it effectively detects coordinated
+disinformation campaigns on X (Twitter). The proposed framework does not
+require any training or fine-tuning of the language model. Instead, by
+strategically harnessing the strengths of prompt engineering and Balanced RAG,
+it facilitates LLMs to overcome the effects of class imbalance and effectively
+identify coordinated political campaigns. The experimental results demonstrate
+that by incorporating the proposed prompt engineering and Balanced RAG methods,
+our framework outperforms the traditional graph-based baselines, achieving
+2x-3x improvements in terms of precision, recall and F1 scores.
 
-摘要：腦電圖訊號傳達了關於大腦活動的重要資訊，無論是在健康或病理狀況下。然而，它們本質上是有雜訊的，這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效，但通常需要大量的專家介入。本研究提出 LSTEEG，一種新穎的基於 LSTM 的自動編碼器，用於偵測和校正腦電圖訊號中的人工製品。利用深度學習，特別是 LSTM 層，LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比，LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性，讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域，並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。
+摘要：<paragraph>在社交媒體上對抗錯誤資訊，偵測有組織的政治宣傳活動至關重要。現有的此類有組織行動識別方法，大多採用網路科學、圖形機器學習和自然語言處理的技術。它們的最終目標是分析使用者之間的關係和互動（例如轉發），以及他們貼文的文字相似性。儘管這些方法在辨識草根運動宣傳活動方面很有效，但它們面臨嚴峻的挑戰，特別是可用訓練資料集中的類別不平衡。為了減輕這個問題，最近的方法通常訴諸於資料擴充或增加正向樣本數量，但在現實世界中可能並非總是可行或足夠。本文採取不同的途徑，我們提出了一個基於大型語言模型 (LLM) 的辨識草根運動宣傳活動的新架構，並引入了平衡檢索擴充產生 (Balanced RAG) 組件。我們的做法首先將有關貼文（在我們的案例中是推文）的文字資訊和社交網路的使用者互動作為輸入，輸入到語言模型中。然後，透過提示工程和提出的平衡檢索擴充產生方法，它有效地偵測 X（Twitter）上協調的不實資訊宣傳活動。提出的架構不需要任何語言模型的訓練或微調。相反地，透過策略性地利用提示工程和平衡檢索擴充產生方法的優勢，它使大型語言模型能夠克服類別不平衡的影響，並有效地識別協調的政治宣傳活動。實驗結果證明，透過整合提出的提示工程和平衡檢索擴充產生方法，我們的架構優於傳統的基於圖形的基準，在精確度、召回率和 F1 分數方面獲得 2x-3x 的改進。</paragraph>
 
-##### **SycEval: Evaluating LLM Sycophancy**
-2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo
+##### **Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**
+2501.16361v1 by Haoran Song, Jiarui Feng, Guangfu Li, Michael Province, Philip Payne, Yixin Chen, Fuhai Li
 
-Large language models (LLMs) are increasingly applied in educational,
-clinical, and professional settings, but their tendency for sycophancy --
-prioritizing user agreement over independent reasoning -- poses risks to
-reliability. This study introduces a framework to evaluate sycophantic behavior
-in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and
-MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19%
-of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the
-lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred
-in 43.52% of cases, while regressive sycophancy, leading to incorrect answers,
-was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher
-sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$,
-$p<0.001$), particularly in computational tasks, where regressive sycophancy
-increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$).
-Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while
-citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$,
-$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI:
-[77.2%, 79.8%]) regardless of context or model. These findings emphasize the
-risks and opportunities of deploying LLMs in structured and dynamic domains,
-offering insights into prompt programming and model optimization for safer AI
-applications.
+In real-world scientific discovery, human beings always make use of the
+accumulated prior knowledge with imagination pick select one or a few most
+promising hypotheses from large and noisy data analysis results. In this study,
+we introduce a new type of graph structure, the text-numeric graph (TNG), which
+is defined as graph entities and associations have both text-attributed
+information and numeric information. The TNG is an ideal data structure model
+for novel scientific discovery via graph reasoning because it integrates
+human-understandable textual annotations or prior knowledge, with numeric
+values that represent the observed or activation levels of graph entities or
+associations in different samples. Together both the textual information and
+numeric values determine the importance of graph entities and associations in
+graph reasoning for novel scientific knowledge discovery. We further propose
+integrating large language models (LLMs) and graph neural networks (GNNs) to
+analyze the TNGs for graph understanding and reasoning. To demonstrate the
+utility, we generated the text-omic(numeric) signaling graphs (TOSG), as one
+type of TNGs, in which all graphs have the same entities, associations and
+annotations, but have sample-specific entity numeric (omic) values using single
+cell RNAseq (scRNAseq) datasets of different diseases. We proposed joint
+LLM-GNN models for key entity mining and signaling pathway mining on the TOSGs.
+The evaluation results showed the LLM-GNN and TNGs models significantly improve
+classification accuracy and network inference. In conclusion, the TNGs and
+joint LLM-GNN models are important approaches for scientific discovery.
 
-摘要：大型語言模型（LLM）日益應用於教育、臨床和專業領域，但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為，涉及 AMPS（數學）和 MedQuad（醫療建議）數據集。在 58.19% 的案例中觀察到了趨炎附勢行為，其中 Gemini 表現出最高比率（62.47%），而 ChatGPT 最低（56.71%）。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中，而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率（61.75% 對 56.52%，Z=5.87，p<0.001），特別是在計算任務中，其中退步式趨炎附勢顯著增加（先發制人：8.13%，上下文：3.54%，p<0.001）。簡單的反駁最大化了漸進式趨炎附勢（Z=6.59，p<0.001），而基於引用的反駁表現出最高的退步式比率（Z=6.59，p<0.001）。趨炎附勢行為表現出很高的持續性（78.5%，95% CI：[77.2%，79.8%]），無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇，為更安全的 AI 應用提供了提示編程和模型優化的見解。
+摘要：<paragraph>在現實世界的科學發現中，人類總是利用累積的先驗知識，並運用想像力從大量且雜訊的資料分析結果中挑選出一個或幾個最有希望的假設。在本研究中，我們介紹了一種新型態的圖形結構，稱為文字數值圖 (TNG)，定義為圖形實體和關聯具有文字屬性資訊和數值資訊。TNG 是透過圖形推理進行新科學發現的理想資料結構模型，因為它整合了人類可理解的文字註解或先驗知識，以及代表圖形實體或不同樣本中關聯的觀察值或活化程度的數值。文字資訊和數值一起決定了圖形實體和關聯在圖形推理中對於新科學知識發現的重要性。我們進一步提出整合大型語言模型 (LLM) 和圖形神經網路 (GNN) 來分析 TNG，以進行圖形理解和推理。為了展示其效用，我們生成了文字組學（數值）訊號圖 (TOSG)，作為一種 TNG，其中所有圖形都具有相同的實體、關聯和註解，但具有特定於樣本的實體數值（組學）值，使用不同疾病的單細胞 RNAseq (scRNAseq) 資料集。我們針對 TOSG 提出聯合 LLM-GNN 模型，用於關鍵實體探勘和訊號路徑探勘。評估結果顯示，LLM-GNN 和 TNG 模型顯著提升了分類準確度和網路推論。結論而言，TNG 和聯合 LLM-GNN 模型是科學發現的重要方法。</paragraph>
 
-##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**
-2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace
+##### **Zep: A Temporal Knowledge Graph Architecture for Agent Memory**
+2501.13956v1 by Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef
 
-Medical research faces well-documented challenges in translating novel
-treatments into clinical practice. Publishing incentives encourage researchers
-to present "positive" findings, even when empirical results are equivocal.
-Consequently, it is well-documented that authors often spin study results,
-especially in article abstracts. Such spin can influence clinician
-interpretation of evidence and may affect patient care decisions. In this
-study, we ask whether the interpretation of trial results offered by Large
-Language Models (LLMs) is similarly affected by spin. This is important since
-LLMs are increasingly being used to trawl through and synthesize published
-medical evidence. We evaluated 22 LLMs and found that they are across the board
-more susceptible to spin than humans. They might also propagate spin into their
-outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into
-plain language summaries that they generate. We also find, however, that LLMs
-are generally capable of recognizing spin, and can be prompted in a way to
-mitigate spin's impact on LLM outputs.
+We introduce Zep, a novel memory layer service for AI agents that outperforms
+the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR)
+benchmark. Additionally, Zep excels in more comprehensive and challenging
+evaluations than DMR that better reflect real-world enterprise use cases. While
+existing retrieval-augmented generation (RAG) frameworks for large language
+model (LLM)-based agents are limited to static document retrieval, enterprise
+applications demand dynamic knowledge integration from diverse sources
+including ongoing conversations and business data. Zep addresses this
+fundamental limitation through its core component Graphiti -- a
+temporally-aware knowledge graph engine that dynamically synthesizes both
+unstructured conversational data and structured business data while maintaining
+historical relationships. In the DMR benchmark, which the MemGPT team
+established as their primary evaluation metric, Zep demonstrates superior
+performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further
+validated through the more challenging LongMemEval benchmark, which better
+reflects enterprise use cases through complex temporal reasoning tasks. In this
+evaluation, Zep achieves substantial results with accuracy improvements of up
+to 18.5% while simultaneously reducing response latency by 90% compared to
+baseline implementations. These results are particularly pronounced in
+enterprise-critical tasks such as cross-session information synthesis and
+long-term context maintenance, demonstrating Zep's effectiveness for deployment
+in real-world applications.
 
-摘要：醫學研究在將新穎療法轉化為臨床實務上，面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現，即使經驗結果模稜兩可。因此，有據可查的是，作者經常扭曲研究結果，特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋，並可能影響病患照護決策。在本研究中，我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據，因此這點非常重要。我們評估了 22 個 LLM，發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中：例如，我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而，我們也發現 LLM 通常有能力辨認扭曲，而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。
+摘要：我們推出 Zep，這是一種新穎的記憶層服務，適用於 AI 代理，其在深度記憶擷取 (DMR) 基準測試中優於現行的最先進系統 MemGPT。此外，Zep 在比 DMR 更全面且更具挑戰性的評估中表現出色，這些評估更能反映真實世界的企業用例。雖然現有的檢索增強生成 (RAG) 架構僅限於大型語言模型 (LLM) 基於代理的靜態文件檢索，但企業應用需要從包括正在進行的對話和業務數據在內的不同來源動態整合知識。Zep 通過其核心組件 Graphiti 來解決這個基本限制，Graphiti 是一個時間感知知識圖譜引擎，可以在維護歷史關係的同時動態綜合非結構化對話數據和結構化業務數據。在 MemGPT 團隊確立為其主要評估指標的 DMR 基準測試中，Zep 表現出優異的效能（94.8% 對 93.4%）。除了 DMR 之外，Zep 的功能還通過更具挑戰性的 LongMemEval 基準測試進一步得到驗證，該基準測試通過複雜的時間推理任務更好地反映了企業用例。在這個評估中，Zep 以高達 18.5% 的準確度改進取得了顯著的成果，同時與基線實作相比，將回應延遲降低了 90%。這些成果在企業關鍵任務中尤為明顯，例如跨會話資訊綜合和長期脈絡維護，證明了 Zep 在實際應用中部署的有效性。
 
-##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**
-2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri
+##### **Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**
+2501.11560v1 by M. Manzour, A. Ballardini, R. Izquierdo, M. Á. Sotelo
 
-This paper presents a novel Natural Language Processing (NLP) framework for
-enhancing medical diagnosis through the integration of advanced techniques in
-data augmentation, feature extraction, and classification. The proposed
-approach employs back-translation to generate diverse paraphrased datasets,
-improving robustness and mitigating overfitting in classification tasks.
-Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with
-Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained
-contextual and positional relationships, dynamically adjusting the influence of
-positional information based on semantic context to produce high-quality text
-embeddings. For classification, an Attention-Based Feedforward Neural Network
-(ABFNN) is utilized, effectively focusing on the most relevant features to
-improve decision-making accuracy. Applied to the classification of symptoms,
-clinical notes, and other medical texts, this architecture demonstrates its
-ability to address the complexities of medical data. The combination of data
-augmentation, contextual embedding generation, and advanced classification
-mechanisms offers a robust and accurate diagnostic tool, with potential
-applications in automated medical diagnosis and clinical decision support. This
-method demonstrates the effectiveness of the proposed NLP framework for medical
-diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of
-99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only
-underscore the model's robust performance in classifying medical texts with
-exceptional precision and reliability but also highlight its superiority over
-existing methods, making it a highly promising tool for automated diagnostic
-systems.
+Lane-changing maneuvers, particularly those executed abruptly or in risky
+situations, are a significant cause of road traffic accidents. However, current
+research mainly focuses on predicting safe lane changes. Furthermore, existing
+accident datasets are often based on images only and lack comprehensive sensory
+data. In this work, we focus on predicting risky lane changes using the CRASH
+dataset (our own collected dataset specifically for risky lane changes), and
+safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian
+inference to predict these maneuvers using linguistic contextual information,
+enhancing the model's interpretability and transparency. The model achieved a
+91.5% f1-score with anticipation time extending to four seconds for risky lane
+changes, and a 90.0% f1-score for predicting safe lane changes with the same
+anticipation time. We validate our model by integrating it into a vehicle
+within the CARLA simulator in scenarios that involve risky lane changes. The
+model managed to anticipate sudden lane changes, thus providing automated
+vehicles with further time to plan and execute appropriate safe reactions.
+Finally, to enhance the explainability of our model, we utilize RAG to provide
+clear and natural language explanations for the given prediction.
+
+摘要：換車道動作，尤其是突然或在風險情況下執行的動作，是道路交通事故的重要原因。然而，目前的研究所主要集中在預測安全的換車道。此外，現有的事故資料集通常僅基於影像，且缺乏全面的感測資料。在這項工作中，我們專注於使用 CRASH 資料集（我們自己收集的專門針對風險換車道資料集）來預測風險換車道，以及安全換車道（使用 HighD 資料集）。然後，我們利用 KG 和貝氏推理來使用語言背景資訊預測這些動作，增強模型的可解釋性和透明度。該模型在風險換車道的預測時間延長至四秒時，達到了 91.5% 的 f1 分數，在預測安全換車道時，在相同的預測時間內達到了 90.0% 的 f1 分數。我們透過將模型整合到 CARLA 模擬器中的車輛中，在涉及風險換車道的場景中驗證我們的模型。該模型設法預測突然的換車道，從而為自動駕駛車輛提供了更多時間來規劃和執行適當的安全反應。最後，為了增強我們模型的可解釋性，我們利用 RAG 為給定的預測提供清晰且自然的語言解釋。
+
+##### **Each Graph is a New Language: Graph Learning with LLMs**
+2501.11478v2 by Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
 
-摘要：本文提出了一個創新的自然語言處理 (NLP) 框架，透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集，提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa)，這個模型捕捉細緻的脈絡和位置關係，根據語意脈絡動態調整位置資訊的影響，以產生高品質的文字嵌入。在分類方面，利用基於注意力的前饋神經網路 (ABFNN)，有效地關注最相關的特徵，以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類，此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具，在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性，以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數，取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性，也突顯了它優於現有方法的優越性，使其成為自動化診斷系統中極具前景的工具。
+Recent efforts leverage Large Language Models (LLMs) for modeling
+text-attributed graph structures in node classification tasks. These approaches
+describe graph structures for LLMs to understand or aggregate LLM-generated
+textual attribute embeddings through graph structure. However, these approaches
+face two main limitations in modeling graph structures with LLMs. (i) Graph
+descriptions become verbose in describing high-order graph structure. (ii)
+Textual attributes alone do not contain adequate graph structure information.
+It is challenging to model graph structure concisely and adequately with LLMs.
+LLMs lack built-in mechanisms to model graph structures directly. They also
+struggle with complex long-range dependencies between high-order nodes and
+target nodes.
+  Inspired by the observation that LLMs pre-trained on one language can achieve
+exceptional performance on another with minimal additional training, we propose
+\textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge
+\textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs
+to transfer their powerful language understanding capabilities to
+graph-structured data. GDL4LLM translates graphs into a graph language corpus
+instead of graph descriptions and pre-trains LLMs on this corpus to adequately
+understand graph structures. During fine-tuning, this corpus describes the
+structural information of target nodes concisely with only a few tokens. By
+treating graphs as a new language, GDL4LLM enables LLMs to model graph
+structures adequately and concisely for node classification tasks. Extensive
+experiments on three real-world datasets demonstrate that GDL4LLM outperforms
+description-based and textual attribute embeddings-based baselines by
+efficiently modeling different orders of graph structure with LLMs.
 
-##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**
-2502.07752v1 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
+摘要：<paragraph>最近的研究利用大型语言模型 (LLM) 对节点分类任务中的文本属性图结构进行建模。这些方法描述图结构，以便 LLM 理解或通过图结构聚合 LLM 生成的文本属性嵌入。然而，这些方法在使用 LLM 对图结构进行建模时面临两个主要限制。(i) 图描述在描述高阶图结构时变得冗长。(ii) 仅文本属性不包含足够的图结构信息。使用 LLM 对图结构进行简洁且充分的建模具有挑战性。LLM 缺乏直接对图结构进行建模的内置机制。它们还难以处理高阶节点和目标节点之间复杂的远程依赖关系。
+受 LLM 在一种语言上进行预训练后，只需进行最少的额外训练即可在另一种语言上实现卓越性能的观察结果的启发，我们提出了**G**raph-**D**efined **L**anguage for **L**arge **L**anguage **M**odel (GDL4LLM)。此新框架使 LLM 能够将其强大的语言理解能力转移到结构化数据图。GDL4LLM 将图翻译成图语言语料库，而不是图描述，并在该语料库上对 LLM 进行预训练，以充分理解图结构。在微调期间，此语料库仅使用几个标记简洁地描述目标节点的结构信息。通过将图视为一种新语言，GDL4LLM 使 LLM 能够充分且简洁地对图结构进行建模，以用于节点分类任务。在三个真实世界数据集上进行的广泛实验表明，GDL4LLM 通过使用 LLM 有效地对不同阶的图结构进行建模，优于基于描述和基于文本属性嵌入的基线。</paragraph>
 
-Designing efficient optimizers for large language models (LLMs) with
-low-memory requirements and fast convergence is an important and challenging
-problem. This paper makes a step towards the systematic design of such
-optimizers through the lens of structured Fisher information matrix (FIM)
-approximation. We show that many state-of-the-art efficient optimizers can be
-viewed as solutions to FIM approximation (under the Frobenius norm) with
-specific structural assumptions. Building on these insights, we propose two
-design recommendations of practical efficient optimizers for LLMs, involving
-the careful selection of structural assumptions to balance generality and
-efficiency, and enhancing memory efficiency of optimizers with general
-structures through a novel low-rank extension framework. We demonstrate how to
-use each design approach by deriving new memory-efficient optimizers: Row and
-Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation
-(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the
-effectiveness, showing faster and better convergence than existing
-memory-efficient baselines and Adam with little memory overhead. Notably, Alice
-achieves better than 2x faster convergence over Adam, while RACS delivers
-strong performance on the 1B model with SGD-like memory.
+##### **Few-shot Policy (de)composition in Conversational Question Answering**
+2501.11335v1 by Kyle Erwin, Guy Axelrod, Maria Chang, Achille Fokoue, Maxwell Crouse, Soham Dan, Tian Gao, Rosario Uceda-Sosa, Ndivhuwo Makondo, Naweed Khan, Alexander Gray
 
-摘要：設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的角度，朝向此類最佳化器的系統化設計邁進一步。我們展示了許多最先進的高效最佳化器可以被視為 FIM 近似（在 Frobenius 範數下）的解，並具有特定的結構假設。基於這些見解，我們提出了 LLM 的兩個實用高效最佳化器設計建議，包括仔細選擇結構假設以平衡通用性和效率，並透過新穎的低秩延伸架構來增強具有通用結構的最佳化器的記憶體效率。我們展示了如何透過推導新的記憶體高效最佳化器來使用每種設計方法：列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練（高達 1B 參數）上的實驗驗證了其有效性，顯示比現有的記憶體高效基線和 Adam 更快且更好的收斂，且記憶體開銷很小。值得注意的是，Alice 比 Adam 快 2 倍以上，而 RACS 則在 1B 模型上提供類似 SGD 記憶體的強勁效能。
+The task of policy compliance detection (PCD) is to determine if a scenario
+is in compliance with respect to a set of written policies. In a conversational
+setting, the results of PCD can indicate if clarifying questions must be asked
+to determine compliance status. Existing approaches usually claim to have
+reasoning capabilities that are latent or require a large amount of annotated
+data. In this work, we propose logical decomposition for policy compliance
+(LDPC): a neuro-symbolic framework to detect policy compliance using large
+language models (LLMs) in a few-shot setting. By selecting only a few exemplars
+alongside recently developed prompting techniques, we demonstrate that our
+approach soundly reasons about policy compliance conversations by extracting
+sub-questions to be answered, assigning truth values from contextual
+information, and explicitly producing a set of logic statements from the given
+policies. The formulation of explicit logic graphs can in turn help answer
+PCDrelated questions with increased transparency and explainability. We apply
+this approach to the popular PCD and conversational machine reading benchmark,
+ShARC, and show competitive performance with no task-specific finetuning. We
+also leverage the inherently interpretable architecture of LDPC to understand
+where errors occur, revealing ambiguities in the ShARC dataset and highlighting
+the challenges involved with reasoning for conversational question answering.
 
-##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**
-2502.07516v1 by Raman Dutt
+摘要：策略合規偵測 (PCD) 的任務是確定場景是否符合一組書面策略。在對話設定中，PCD 的結果可以指出是否必須提出澄清問題以確定合規狀態。現有的方法通常聲稱具有潛在的推理能力，或需要大量的註釋資料。在這項工作中，我們提出策略合規的邏輯分解 (LDPC)：一種使用大型語言模型 (LLM) 在少次嘗試中偵測策略合規的神經符號框架。透過僅選擇少數範例以及最近開發的提示技術，我們證明我們的做法透過提取要回答的子問題、從脈絡資訊指派真值，以及從給定的策略明確產生一組邏輯陳述，對策略合規對話進行合理的推理。明確邏輯圖表的制定反過來可以幫助回答 PCD 相關問題，並提高透明度和可解釋性。我們將此方法應用於熱門的 PCD 和對話式機器閱讀基準 ShARC，並在沒有特定任務微調的情況下展現出競爭力。我們也利用 LDPC 固有的可解釋架構來了解錯誤發生在哪裡，揭露 ShARC 資料集中的歧義，並強調對話式問題解答推理的挑戰。
 
-Generative models, particularly text-to-image (T2I) diffusion models, play a
-crucial role in medical image analysis. However, these models are prone to
-training data memorization, posing significant risks to patient privacy.
-Synthetic chest X-ray generation is one of the most common applications in
-medical image analysis with the MIMIC-CXR dataset serving as the primary data
-repository for this task. This study adopts a data-driven approach and presents
-the first systematic attempt to identify prompts and text tokens in MIMIC-CXR
-that contribute the most to training data memorization. Our analysis reveals an
-unexpected finding: prompts containing traces of de-identification procedures
-are among the most memorized, with de-identification markers contributing the
-most. Furthermore, we also find existing inference-time memorization mitigation
-strategies are ineffective and fail to sufficiently reduce the model's reliance
-on memorized text tokens highlighting a broader issue in T2I synthesis with
-MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy
-and improve the reliability of generative models in medical imaging. Finally,
-our results provide a foundation for future work on developing and benchmarking
-memorization mitigation techniques for synthetic chest X-ray generation using
-the MIMIC-CXR dataset.
+##### **Reasoning Language Models: A Blueprint**
+2501.11223v3 by Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler
 
-摘要：生成模型，尤其是文字轉圖像 (T2I) 擴散模型，在醫學影像分析中扮演著至關重要的角色。然而，這些模型容易訓練資料記憶，對病患隱私造成重大風險。合成胸部 X 光線生成是醫學影像分析中最常見的應用之一，其中 MIMIC-CXR 資料集作為此任務的主要資料儲存庫。本研究採用資料驅動的方法，並提出首次系統性嘗試，以識別 MIMIC-CXR 中最有助於訓練資料記憶的提示和文字代碼。我們的分析揭露了一個意外的發現：包含去識別程序痕跡的提示是最常被記憶的，其中去識別標記的貢獻最大。此外，我們也發現現有的推論時間記憶減緩策略無效，且無法充分降低模型對記憶文字代碼的依賴性，突顯了使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。針對此問題，我們提出可行的策略，以增強隱私並改善生成模型在醫學影像中的可靠性。最後，我們的結果為未來使用 MIMIC-CXR 資料集開發和評量合成胸部 X 光線生成的記憶減緩技術奠定了基礎。
+Reasoning language models (RLMs), also known as Large Reasoning Models
+(LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have
+redefined AI's problem-solving capabilities by extending LLMs with advanced
+reasoning mechanisms. Yet, their high costs, proprietary nature, and complex
+architectures - uniquely combining Reinforcement Learning (RL), search
+heuristics, and LLMs - present accessibility and scalability challenges. To
+address these, we propose a comprehensive blueprint that organizes RLM
+components into a modular framework, based on a survey and analysis of all RLM
+works. This blueprint incorporates diverse reasoning structures (chains, trees,
+graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search,
+Beam Search), RL concepts (policy, value models and others), supervision
+schemes (Outcome-Based and Process-Based Supervision), and other related
+concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
+tools). We also provide detailed mathematical formulations and algorithmic
+specifications to simplify RLM implementation. By showing how schemes like
+LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases,
+we demonstrate the blueprint's versatility and unifying potential. To
+illustrate its utility, we introduce x1, a modular implementation for rapid RLM
+prototyping and experimentation. Using x1 and a literature review, we provide
+key insights, such as multi-phase training for policy and value models, and the
+importance of familiar training distributions. Finally, we discuss scalable RLM
+cloud deployments and we outline how RLMs can integrate with a broader LLM
+ecosystem. Our work demystifies RLM construction, democratizes advanced
+reasoning capabilities, and fosters innovation, aiming to mitigate the gap
+between "rich AI" and "poor AI" by lowering barriers to RLM design and
+experimentation.
 
-##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**
-2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo
+摘要：推理語言模型 (RLM)，又稱為大型推理模型 (LRM)，例如 OpenAI 的 o1 和 o3、DeepSeek-V3 以及阿里巴巴的 QwQ，透過擴充 LLM 的先進推理機制，重新定義了 AI 的問題解決能力。然而，它們的高成本、專有性質和複雜架構（獨特地結合了強化學習 (RL)、搜尋啟發法和 LLM）提出了可及性和可擴充性的挑戰。為了解決這些問題，我們提出了一個全面的藍圖，將 RLM 組件組織成一個模組化架構，這是基於對所有 RLM 作品的調查和分析。此藍圖包含多樣化的推理結構（鏈、樹、圖和巢狀形式）、推理策略（例如蒙地卡羅樹搜尋、波束搜尋）、RL 概念（策略、價值模型等）、監督方案（基於結果和基於流程的監督）和其他相關概念（例如測試時間運算、檢索增強生成、代理工具）。我們還提供了詳細的數學公式和演算法規範，以簡化 RLM 的實作。透過展示 LLaMA-Berry、QwQ、Journey Learning 和 Graph of Thoughts 等方案如何作為特殊情況，我們展示了藍圖的多功能性和統一潛力。為了說明其效用，我們介紹了 x1，這是一個模組化實作，用於快速 RLM 原型製作和實驗。使用 x1 和文獻回顧，我們提供了關鍵見解，例如策略和價值模型的多階段訓練，以及熟悉訓練分佈的重要性。最後，我們討論了可擴充的 RLM 雲端部署，並概述了 RLM 如何與更廣泛的 LLM 生態系統整合。我們的研究揭開了 RLM 建構的神秘面紗，使先進的推理能力民主化，並促進創新，旨在透過降低 RLM 設計和實驗的障礙，來縮小「富裕 AI」和「貧窮 AI」之間的差距。
 
-Chronic kidney disease (CKD) is a major global health issue, affecting over
-10% of the population and causing significant mortality. While kidney biopsy
-remains the gold standard for CKD diagnosis and treatment, the lack of
-comprehensive benchmarks for kidney pathology segmentation hinders progress in
-the field. To address this, we organized the Kidney Pathology Image
-Segmentation (KPIs) Challenge, introducing a dataset that incorporates
-preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+
-Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes
-two tasks, patch-level segmentation and whole slide image segmentation and
-detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score.
-By encouraging innovative segmentation methods that adapt to diverse CKD models
-and tissue conditions, the KPIs Challenge aims to advance kidney pathology
-analysis, establish new benchmarks, and enable precise, large-scale
-quantification for disease research and diagnosis.
+##### **IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**
+2501.11067v1 by Elad Levi, Ilan Kadar
 
-摘要：慢性腎臟病 (CKD) 是全球主要的健康問題，影響超過
-10% 的人口，並造成顯著的死亡率。雖然腎臟活檢
-仍然是 CKD 診斷和治療的黃金標準，但缺乏
-腎臟病理學分割的全面基準阻礙了該領域的進展。
-為了解決這個問題，我們組織了腎臟病理影像
-分割 (KPIs) 挑戰，引入了包含超過 10,000 個註解的
-CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
-週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括
-兩個任務，修補層級分割和全幻燈片影像分割和
-偵測，使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。
-通過鼓勵創新的分割方法來適應不同的 CKD 模型
-和組織條件，KPIs 挑戰旨在推進腎臟病理
-分析，建立新的基準，並實現精確、大規模的
-疾病研究和診斷量化。
+Large Language Models (LLMs) are transforming artificial intelligence,
+evolving into task-oriented systems capable of autonomous planning and
+execution. One of the primary applications of LLMs is conversational AI
+systems, which must navigate multi-turn dialogues, integrate domain-specific
+APIs, and adhere to strict policy constraints. However, evaluating these agents
+remains a significant challenge, as traditional methods fail to capture the
+complexity and variability of real-world interactions. We introduce
+IntellAgent, a scalable, open-source multi-agent framework designed to evaluate
+conversational AI systems comprehensively. IntellAgent automates the creation
+of diverse, synthetic benchmarks by combining policy-driven graph modeling,
+realistic event generation, and interactive user-agent simulations. This
+innovative approach provides fine-grained diagnostics, addressing the
+limitations of static and manually curated benchmarks with coarse-grained
+metrics. IntellAgent represents a paradigm shift in evaluating conversational
+AI. By simulating realistic, multi-policy scenarios across varying levels of
+complexity, IntellAgent captures the nuanced interplay of agent capabilities
+and policy constraints. Unlike traditional methods, it employs a graph-based
+policy model to represent relationships, likelihoods, and complexities of
+policy interactions, enabling highly detailed diagnostics. IntellAgent also
+identifies critical performance gaps, offering actionable insights for targeted
+optimization. Its modular, open-source design supports seamless integration of
+new domains, policies, and APIs, fostering reproducibility and community
+collaboration. Our findings demonstrate that IntellAgent serves as an effective
+framework for advancing conversational AI by addressing challenges in bridging
+research and deployment. The framework is available at
+https://github.com/plurai-ai/intellagent
 
-##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
-2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
+摘要：大型語言模型 (LLM) 正在轉變人工智慧，演變成具備自主規劃和執行能力的任務導向系統。LLM 的主要應用之一是對話式 AI 系統，它必須應對多輪對話、整合特定領域的 API，並遵守嚴格的政策約束。然而，評估這些代理仍然是一項重大挑戰，因為傳統方法無法捕捉現實世界互動的複雜性和變異性。我們引入了 IntellAgent，一個可擴充、開放原始碼的多代理架構，旨在全面評估對話式 AI 系統。IntellAgent 自動化建立多樣化、合成的基準，方法是結合策略驅動的圖形建模、逼真的事件產生和互動使用者代理模擬。這種創新方法提供了細緻的診斷，解決了具有粗略指標的靜態和手動策劃基準的限制。IntellAgent 代表了評估對話式 AI 的典範轉移。通過模擬不同層級複雜性的逼真多策略場景，IntellAgent 捕捉到了代理功能和策略約束之間的細微交互。與傳統方法不同，它採用基於圖形的策略模型來表示策略交互的關係、可能性和複雜性，從而實現高度詳細的診斷。IntellAgent 還識別出關鍵效能差距，提供可行的見解，以進行目標最佳化。其模組化、開放原始碼的設計支援無縫整合新的領域、策略和 API，促進了可複製性和社群協作。我們的研究結果表明，IntellAgent 可作為一個有效的框架，透過解決研究和部署之間的挑戰來推進對話式 AI。這個框架可在 https://github.com/plurai-ai/intellagent 取得
 
-Early prediction of pediatric cardiac arrest (CA) is critical for timely
-intervention in high-risk intensive care settings. We introduce PedCA-FT, a
-novel transformer-based framework that fuses tabular view of EHR with the
-derived textual view of EHR to fully unleash the interactions of
-high-dimensional risk factors and their dynamics. By employing dedicated
-transformer modules for each modality view, PedCA-FT captures complex temporal
-and contextual patterns to produce robust CA risk estimates. Evaluated on a
-curated pediatric cohort from the CHOA-CICU database, our approach outperforms
-ten other artificial intelligence models across five key performance metrics
-and identifies clinically meaningful risk factors. These findings underscore
-the potential of multimodal fusion techniques to enhance early CA detection and
-improve patient care.
 
-摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
+### LLM
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Theoretical Benefit and Limitation of Diffusion Language Model**|Guhao Feng et.al.|[2502.09622v1](http://arxiv.org/abs/2502.09622v1)|null|
+|**2025-02-13**|**MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**|Dongzhi Jiang et.al.|[2502.09621v1](http://arxiv.org/abs/2502.09621v1)|null|
+|**2025-02-13**|**Exploring the Potential of Encoder-free Architectures in 3D LMMs**|Yiwen Tang et.al.|[2502.09620v1](http://arxiv.org/abs/2502.09620v1)|null|
+|**2025-02-13**|**DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**|Xueyi Liu et.al.|[2502.09614v1](http://arxiv.org/abs/2502.09614v1)|null|
+|**2025-02-13**|**Score-of-Mixture Training: Training One-Step Generative Models Made Simple**|Tejas Jayashankar et.al.|[2502.09609v1](http://arxiv.org/abs/2502.09609v1)|null|
+|**2025-02-13**|**Human-LLM Coevolution: Evidence from Academic Writing**|Mingmeng Geng et.al.|[2502.09606v1](http://arxiv.org/abs/2502.09606v1)|null|
+|**2025-02-13**|**SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**|Yung-Sung Chuang et.al.|[2502.09604v1](http://arxiv.org/abs/2502.09604v1)|null|
+|**2025-02-13**|**CoT-Valve: Length-Compressible Chain-of-Thought Tuning**|Xinyin Ma et.al.|[2502.09601v1](http://arxiv.org/abs/2502.09601v1)|null|
+|**2025-02-13**|**Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**|Siyan Zhao et.al.|[2502.09597v1](http://arxiv.org/abs/2502.09597v1)|null|
+|**2025-02-13**|**KIMAs: A Configurable Knowledge Integrated Multi-Agent System**|Zitao Li et.al.|[2502.09596v1](http://arxiv.org/abs/2502.09596v1)|null|
+|**2025-02-13**|**Logical forms complement probability in understanding language model (and human) performance**|Yixuan Wang et.al.|[2502.09589v1](http://arxiv.org/abs/2502.09589v1)|null|
+|**2025-02-13**|**Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**|Mark Beliaev et.al.|[2502.09573v1](http://arxiv.org/abs/2502.09573v1)|null|
+|**2025-02-13**|**MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**|Vlad Andrei Negru et.al.|[2502.09567v1](http://arxiv.org/abs/2502.09567v1)|null|
+|**2025-02-13**|**Zero-shot generation of synthetic neurosurgical data with large language models**|Austin A. Barr et.al.|[2502.09566v1](http://arxiv.org/abs/2502.09566v1)|null|
+|**2025-02-13**|**MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**|Quintina Campbell et.al.|[2502.09565v1](http://arxiv.org/abs/2502.09565v1)|null|
+|**2025-02-13**|**EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**|Rui Yang et.al.|[2502.09560v1](http://arxiv.org/abs/2502.09560v1)|null|
+|**2025-02-13**|**Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**|Shreyan Biswas et.al.|[2502.09532v1](http://arxiv.org/abs/2502.09532v1)|null|
+|**2025-02-13**|**Diffusion Models for Molecules: A Survey of Methods and Tasks**|Liang Wang et.al.|[2502.09511v1](http://arxiv.org/abs/2502.09511v1)|null|
+|**2025-02-13**|**AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**|Caleb Cranney et.al.|[2502.09503v1](http://arxiv.org/abs/2502.09503v1)|null|
+|**2025-02-13**|**Improve LLM-based Automatic Essay Scoring with Linguistic Features**|Zhaoyi Joey Hou et.al.|[2502.09497v1](http://arxiv.org/abs/2502.09497v1)|null|
+|**2025-02-13**|**Cracking the Code: Enhancing Development finance understanding with artificial intelligence**|Pierre Beaucoral et.al.|[2502.09495v1](http://arxiv.org/abs/2502.09495v1)|null|
+|**2025-02-13**|**Objective quantification of mood states using large language models**|Jakub Onysk et.al.|[2502.09487v1](http://arxiv.org/abs/2502.09487v1)|null|
+|**2025-02-13**|**The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**|Akash Ghosh et.al.|[2502.09457v1](http://arxiv.org/abs/2502.09457v1)|null|
+|**2025-02-13**|**Pixel-Level Reasoning Segmentation via Multi-turn Conversations**|Dexian Cai et.al.|[2502.09447v1](http://arxiv.org/abs/2502.09447v1)|null|
+|**2025-02-13**|**Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**|Navdeep Kumar et.al.|[2502.09432v1](http://arxiv.org/abs/2502.09432v1)|null|
+|**2025-02-13**|**Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**|Ziyi Chen et.al.|[2502.09423v1](http://arxiv.org/abs/2502.09423v1)|null|
+|**2025-02-13**|**On multi-token prediction for efficient LLM inference**|Somesh Mehra et.al.|[2502.09419v1](http://arxiv.org/abs/2502.09419v1)|null|
+|**2025-02-13**|**SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**|Daniel Fleischer et.al.|[2502.09390v1](http://arxiv.org/abs/2502.09390v1)|null|
+|**2025-02-13**|**Truth Knows No Language: Evaluating Truthfulness Beyond English**|Blanca Calvo Figueras et.al.|[2502.09387v1](http://arxiv.org/abs/2502.09387v1)|null|
+|**2025-02-13**|**A Deep Inverse-Mapping Model for a Flapping Robotic Wing**|Hadar Sharvit et.al.|[2502.09378v1](http://arxiv.org/abs/2502.09378v1)|null|
+|**2025-02-13**|**Language Agents as Digital Representatives in Collective Decision-Making**|Daniel Jarrett et.al.|[2502.09369v1](http://arxiv.org/abs/2502.09369v1)|null|
+|**2025-02-13**|**Neural Spatiotemporal Point Processes: Trends and Challenges**|Sumantrak Mukherjee et.al.|[2502.09341v1](http://arxiv.org/abs/2502.09341v1)|null|
+|**2025-02-13**|**Graph Diffusion Network for Drug-Gene Prediction**|Jiayang Wu et.al.|[2502.09335v1](http://arxiv.org/abs/2502.09335v1)|null|
+|**2025-02-13**|**Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**|Itai Mondshine et.al.|[2502.09331v1](http://arxiv.org/abs/2502.09331v1)|null|
+|**2025-02-13**|**A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**|Kentaro Imajo et.al.|[2502.09316v1](http://arxiv.org/abs/2502.09316v1)|null|
+|**2025-02-13**|**When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**|Samuel Joseph Amouyal et.al.|[2502.09307v1](http://arxiv.org/abs/2502.09307v1)|null|
+|**2025-02-13**|**Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**|Bernd Dudzik et.al.|[2502.09294v1](http://arxiv.org/abs/2502.09294v1)|null|
+|**2025-02-13**|**SparQLe: Speech Queries to Text Translation Through LLMs**|Amirbek Djanibekov et.al.|[2502.09284v1](http://arxiv.org/abs/2502.09284v1)|null|
+|**2025-02-13**|**LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**|Wenlun Zhang et.al.|[2502.09271v1](http://arxiv.org/abs/2502.09271v1)|null|
+|**2025-02-13**|**AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**|Hezhe Qiao et.al.|[2502.09254v1](http://arxiv.org/abs/2502.09254v1)|null|
+|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
+|**2025-02-13**|**You Do Not Fully Utilize Transformer's Representation Capacity**|Gleb Gerasimov et.al.|[2502.09245v1](http://arxiv.org/abs/2502.09245v1)|null|
+|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
+|**2025-02-13**|**Reliable Conversational Agents under ASP Control that Understand Natural Language**|Yankai Zeng et.al.|[2502.09237v1](http://arxiv.org/abs/2502.09237v1)|null|
+|**2025-02-13**|**Commonsense Reasoning-Aided Autonomous Vehicle Systems**|Keegan Kimbrell et.al.|[2502.09233v1](http://arxiv.org/abs/2502.09233v1)|null|
+|**2025-02-13**|**Logical foundations of Smart Contracts**|Kalonji Kalala et.al.|[2502.09232v1](http://arxiv.org/abs/2502.09232v1)|null|
+|**2025-02-13**|**Relating Answer Set Programming and Many-sorted Logics for Formal Verification**|Zachary Hansen et.al.|[2502.09230v1](http://arxiv.org/abs/2502.09230v1)|null|
+|**2025-02-13**|**Computational methods for Dynamic Answer Set Programming**|Susana Hahn et.al.|[2502.09228v1](http://arxiv.org/abs/2502.09228v1)|null|
+|**2025-02-13**|**Generating Causally Compliant Counterfactual Explanations using ASP**|Sopam Dasgupta et.al.|[2502.09226v1](http://arxiv.org/abs/2502.09226v1)|null|
+|**2025-02-13**|**Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**|Đorđe Marković et.al.|[2502.09224v1](http://arxiv.org/abs/2502.09224v1)|null|
+|**2025-02-13**|**ASP-driven User-interaction with Clinguin**|Alexander Beiser et.al.|[2502.09222v1](http://arxiv.org/abs/2502.09222v1)|null|
+|**2025-02-13**|**Pearce's Characterisation in an Epistemic Domain**|Ezgi Iraz Su et.al.|[2502.09221v1](http://arxiv.org/abs/2502.09221v1)|null|
+|**2025-02-13**|**Graphical Conditions for the Existence, Unicity and Number of Regular Models**|Van-Giang Trinh et.al.|[2502.09220v1](http://arxiv.org/abs/2502.09220v1)|null|
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-13**|**Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**|Galileo Sartor et.al.|[2502.09216v1](http://arxiv.org/abs/2502.09216v1)|null|
+|**2025-02-13**|**Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**|Sean Glaze et.al.|[2502.09215v1](http://arxiv.org/abs/2502.09215v1)|null|
+|**2025-02-13**|**Neuro-Symbolic Contrastive Learning for Cross-domain Inference**|Mingyue Liu et.al.|[2502.09213v1](http://arxiv.org/abs/2502.09213v1)|null|
+|**2025-02-13**|**LP-LM: No Hallucinations in Question Answering with Logic Programming**|Katherine Wu et.al.|[2502.09212v1](http://arxiv.org/abs/2502.09212v1)|null|
+|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
+|**2025-02-13**|**On LLM-generated Logic Programs and their Inference Execution Methods**|Paul Tarau et.al.|[2502.09209v1](http://arxiv.org/abs/2502.09209v1)|null|
+|**2025-02-13**|**Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**|Haya Majid Qureshi et.al.|[2502.09206v1](http://arxiv.org/abs/2502.09206v1)|null|
+|**2025-02-13**|**Counterfactual Explanations as Plans**|Vaishak Belle et.al.|[2502.09205v1](http://arxiv.org/abs/2502.09205v1)|null|
+|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
+|**2025-02-13**|**Thinking beyond the anthropomorphic paradigm benefits LLM research**|Lujain Ibrahim et.al.|[2502.09192v1](http://arxiv.org/abs/2502.09192v1)|null|
+|**2025-02-13**|**Matina: A Large-Scale 73B Token Persian Text Corpus**|Sara Bourbour Hosseinbeigi et.al.|[2502.09188v1](http://arxiv.org/abs/2502.09188v1)|null|
+|**2025-02-13**|**RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**|Changzhi Zhou et.al.|[2502.09183v1](http://arxiv.org/abs/2502.09183v1)|null|
+|**2025-02-13**|**FLAME: Flexible LLM-Assisted Moderation Engine**|Ivan Bakulin et.al.|[2502.09175v1](http://arxiv.org/abs/2502.09175v1)|null|
+|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-13**|**Musical Heritage Historical Entity Linking**|Arianna Graciotti et.al.|[2502.09168v1](http://arxiv.org/abs/2502.09168v1)|null|
+|**2025-02-13**|**Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**|Chang Liu et.al.|[2502.09156v1](http://arxiv.org/abs/2502.09156v1)|null|
+|**2025-02-13**|**A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**|Nasser A Alsadhan et.al.|[2502.09128v1](http://arxiv.org/abs/2502.09128v1)|null|
+|**2025-02-13**|**Automatic Pruning via Structured Lasso with Class-wise Information**|Xiang Liu et.al.|[2502.09125v1](http://arxiv.org/abs/2502.09125v1)|null|
+|**2025-02-13**|**The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**|Ye-eun Cho et.al.|[2502.09120v1](http://arxiv.org/abs/2502.09120v1)|null|
+|**2025-02-13**|**One-shot Federated Learning Methods: A Practical Guide**|Xiang Liu et.al.|[2502.09104v1](http://arxiv.org/abs/2502.09104v1)|null|
+|**2025-02-13**|**Logical Reasoning in Large Language Models: A Survey**|Hanmeng Liu et.al.|[2502.09100v1](http://arxiv.org/abs/2502.09100v1)|null|
+|**2025-02-13**|**A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**|Tianyi Huang et.al.|[2502.09097v1](http://arxiv.org/abs/2502.09097v1)|null|
+|**2025-02-13**|**A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**|Jia Gao et.al.|[2502.09086v1](http://arxiv.org/abs/2502.09086v1)|null|
+|**2025-02-13**|**Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**|Greta Warren et.al.|[2502.09083v1](http://arxiv.org/abs/2502.09083v1)|null|
+|**2025-02-13**|**CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**|Xintao Wang et.al.|[2502.09082v1](http://arxiv.org/abs/2502.09082v1)|null|
+|**2025-02-13**|**Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**|Xuzhao Geng et.al.|[2502.09073v1](http://arxiv.org/abs/2502.09073v1)|null|
+|**2025-02-13**|**An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**|Kunat Pipatanakul et.al.|[2502.09056v1](http://arxiv.org/abs/2502.09056v1)|null|
+|**2025-02-13**|**Cost-Saving LLM Cascades with Early Abstention**|Michael J. Zellinger et.al.|[2502.09054v1](http://arxiv.org/abs/2502.09054v1)|null|
+|**2025-02-13**|**Game Theory Meets Large Language Models: A Systematic Survey**|Haoran Sun et.al.|[2502.09053v1](http://arxiv.org/abs/2502.09053v1)|null|
+|**2025-02-13**|**AIDE: Agentically Improve Visual Language Model with Domain Experts**|Ming-Chang Chiu et.al.|[2502.09051v1](http://arxiv.org/abs/2502.09051v1)|null|
+|**2025-02-13**|**Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**|Chae-Hyun Kim et.al.|[2502.09050v1](http://arxiv.org/abs/2502.09050v1)|null|
+|**2025-02-13**|**Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**|Jin-Duk Park et.al.|[2502.09046v1](http://arxiv.org/abs/2502.09046v1)|null|
+|**2025-02-13**|**Typhoon T1: An Open Thai Reasoning Model**|Pittawat Taveekitworachai et.al.|[2502.09042v1](http://arxiv.org/abs/2502.09042v1)|null|
+|**2025-02-13**|**Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**|Lin Zhang et.al.|[2502.09022v1](http://arxiv.org/abs/2502.09022v1)|null|
+|**2025-02-13**|**EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**|Xiao Wang et.al.|[2502.09020v1](http://arxiv.org/abs/2502.09020v1)|null|
+|**2025-02-13**|**Zero-shot Concept Bottleneck Models**|Shin'ya Yamaguchi et.al.|[2502.09018v1](http://arxiv.org/abs/2502.09018v1)|null|
+|**2025-02-13**|**Diversity Enhances an LLM's Performance in RAG and Long-context Task**|Zhchao Wang et.al.|[2502.09017v1](http://arxiv.org/abs/2502.09017v1)|null|
+|**2025-02-13**|**Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**|Jonathan Pofcher et.al.|[2502.09004v1](http://arxiv.org/abs/2502.09004v1)|null|
+|**2025-02-13**|**RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**|Quan Wei et.al.|[2502.09003v1](http://arxiv.org/abs/2502.09003v1)|null|
+|**2025-02-13**|**PixLift: Accelerating Web Browsing via AI Upscaling**|Yonas Atinafu et.al.|[2502.08995v1](http://arxiv.org/abs/2502.08995v1)|null|
+|**2025-02-13**|**RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**|Nazatul H. Sultan et.al.|[2502.08989v1](http://arxiv.org/abs/2502.08989v1)|null|
+|**2025-02-13**|**Neural Force Field: Learning Generalized Physical Representation from a Few Examples**|Shiqian Li et.al.|[2502.08987v1](http://arxiv.org/abs/2502.08987v1)|null|
+|**2025-02-13**|**Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**|Hyundong Cho et.al.|[2502.08972v1](http://arxiv.org/abs/2502.08972v1)|null|
+|**2025-02-13**|**RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**|Peter Yong Zhong et.al.|[2502.08966v1](http://arxiv.org/abs/2502.08966v1)|null|
+|**2025-02-13**|**Biologically Plausible Brain Graph Transformer**|Ciyuan Peng et.al.|[2502.08958v1](http://arxiv.org/abs/2502.08958v1)|null|
+|**2025-02-13**|**Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**|Leon Nissen et.al.|[2502.08954v1](http://arxiv.org/abs/2502.08954v1)|null|
 
-##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
-2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
+#### Abstracts
+##### **Theoretical Benefit and Limitation of Diffusion Language Model**
+2502.09622v1 by Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He
 
-Counterfactual explanations in medical imaging are critical for understanding
-the predictions made by deep learning models. We extend the Latent Shift
-counterfactual generation method from 2D applications to 3D computed tomography
-(CT) scans. We address the challenges associated with 3D data, such as limited
-training samples and high memory demands, by implementing a slice-based
-approach. This method leverages a 2D encoder trained on CT slices, which are
-subsequently combined to maintain 3D context. We demonstrate this technique on
-two models for clinical phenotype prediction and lung segmentation. Our
-approach is both memory-efficient and effective for generating interpretable
-counterfactuals in high-resolution 3D medical imaging.
+Diffusion language models have emerged as a promising approach for text
+generation. One would naturally expect this method to be an efficient
+replacement for autoregressive models since multiple tokens can be sampled in
+parallel during each diffusion step. However, its efficiency-accuracy trade-off
+is not yet well understood. In this paper, we present a rigorous theoretical
+analysis of a widely used type of diffusion language model, the Masked
+Diffusion Model (MDM), and find that its effectiveness heavily depends on the
+target evaluation metric. Under mild conditions, we prove that when using
+perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling
+steps regardless of sequence length, demonstrating that efficiency can be
+achieved without sacrificing performance. However, when using the sequence
+error rate--which is important for understanding the "correctness" of a
+sequence, such as a reasoning chain--we show that the required sampling steps
+must scale linearly with sequence length to obtain "correct" sequences, thereby
+eliminating MDM's efficiency advantage over autoregressive models. Our analysis
+establishes the first theoretical foundation for understanding the benefits and
+limitations of MDMs. All theoretical findings are supported by empirical
+studies.
 
-摘要：反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法，來解決與 3D 資料相關的挑戰，例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器，隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實，既節省記憶體又有效。
+摘要：擴散語言模型已成為文字生成的一種有前途的方法。由於在每個擴散步驟期間可以並行採樣多個符號，因此人們自然會期望這種方法成為自迴歸模型的有效替代方案。然而，它的效率準確性權衡尚未得到很好的理解。在本文中，我們對廣泛使用的擴散語言模型類型，即遮罩擴散模型 (MDM) 進行了嚴格的理論分析，並發現其有效性在很大程度上取決於目標評估指標。在溫和條件下，我們證明了當使用困惑度作為指標時，MDM 可以無論序列長度如何，在採樣步驟中實現近乎最佳的困惑度，這表明可以在不犧牲性能的情況下實現效率。然而，當使用序列錯誤率（對於理解序列的「正確性」很重要，例如推理鏈）時，我們表明所需的採樣步驟必須隨著序列長度線性縮放才能獲得「正確」的序列，從而消除了 MDM 相對於自迴歸模型的效率優勢。我們的分析為理解 MDM 的優點和局限性建立了第一個理論基礎。所有理論發現都得到了實證研究的支持。
 
-##### **Interactive Data Harmonization with LLM Agents**
-2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire
+##### **MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**
+2502.09621v1 by Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
 
-Data harmonization is an essential task that entails integrating datasets
-from diverse sources. Despite years of research in this area, it remains a
-time-consuming and challenging task due to schema mismatches, varying
-terminologies, and differences in data collection methodologies. This paper
-presents the case for agentic data harmonization as a means to both empower
-experts to harmonize their data and to streamline the process. We introduce
-Harmonia, a system that combines LLM-based reasoning, an interactive user
-interface, and a library of data harmonization primitives to automate the
-synthesis of data harmonization pipelines. We demonstrate Harmonia in a
-clinical data harmonization scenario, where it helps to interactively create
-reusable pipelines that map datasets to a standard format. Finally, we discuss
-challenges and open problems, and suggest research directions for advancing our
-vision.
+Answering questions with Chain-of-Thought (CoT) has significantly enhanced
+the reasoning capabilities of Large Language Models (LLMs), yet its impact on
+Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth
+investigation. In this paper, we introduce MME-CoT, a specialized benchmark
+evaluating the CoT reasoning performance of LMMs, spanning six domains: math,
+science, OCR, logic, space-time, and general scenes. As the first comprehensive
+study in this area, we propose a thorough evaluation suite incorporating three
+novel metrics that assess the reasoning quality, robustness, and efficiency at
+a fine-grained level. Leveraging curated high-quality data and a unique
+evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,
+uncovering several key insights: 1) Models with reflection mechanism
+demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and
+demonstrating the highest quality results; 2) CoT prompting often degrades LMM
+performance on perception-heavy tasks, suggesting a potentially harmful
+overthinking behavior; and 3) Although the CoT quality is high, LMMs with
+reflection exhibit significant inefficiency in both normal response and
+self-correction phases. We hope MME-CoT serves as a foundation for advancing
+multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
 
-摘要：資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷，但由於架構不匹配、術語不同，以及資料收集方法的差異，它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和，作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia，一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統，以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia，它有助於互動式建立可重複使用的管線，將資料集對應至標準格式。最後，我們討論挑戰和開放性問題，並建議研究方向以推進我們的願景。
+摘要：<paragraph>透過思維鏈（CoT）回答問題，大幅提升了大型語言模型（LLM）的推理能力，但其對大型多模態模型（LMM）的影響仍缺乏系統性的評估和深入探討。在本文中，我們引入了 MME-CoT，一個專門的基準測試，用於評估 LMM 的 CoT 推理效能，涵蓋六個領域：數學、科學、OCR、邏輯、時空和一般場景。作為該領域的第一個全面性研究，我們提出了一個全面的評估套件，包含三個創新的指標，用於評估推理品質、穩健性和效率，並達到細微的層級。透過利用策展的高品質資料和獨特的評估策略，我們對最先進的 LMM 進行深入分析，發現了幾個關鍵見解：1）具有反思機制的模型展現出優異的 CoT 品質，其中 Kimi k1.5 優於 GPT-4o，並展現出最高品質的結果；2）CoT 提示通常會降低 LMM 在感知密集任務上的效能，這表示潛在有害的過度思考行為；3）儘管 CoT 品質很高，但具有反思能力的 LMM 在一般回應和自我修正階段都展現出顯著的低效率。我們希望 MME-CoT 能作為促進 LMM 中多模態推理的基礎。專案頁面：https://mmecot.github.io/</paragraph>
 
-##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**
-2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani
+##### **Exploring the Potential of Encoder-free Architectures in 3D LMMs**
+2502.09620v1 by Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
 
-Machine learning (ML) is transforming healthcare by enabling predictive
-analytics, personalized treatments, and improved patient outcomes. However,
-traditional ML workflows require specialized skills, infrastructure, and
-resources, limiting accessibility for many healthcare professionals. This paper
-explores how Google Cloud's BigQuery ML simplifies the development and
-deployment of ML models using SQL, reducing technical barriers. Through a case
-study on diabetes prediction using the Diabetes Health Indicators Dataset, we
-evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep
-Neural Network (DNN). Our results demonstrate that the Boosted Tree model
-achieves the highest performance, making it highly effective for diabetes
-prediction. This study highlights BigQuery ML's role in democratizing machine
-learning by providing a scalable, efficient, and accessible solution for
-healthcare analytics.
+Encoder-free architectures have been preliminarily explored in the 2D visual
+domain, yet it remains an open question whether they can be effectively applied
+to 3D understanding scenarios. In this paper, we present the first
+comprehensive investigation into the potential of encoder-free architectures to
+overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
+These challenges include the failure to adapt to varying point cloud
+resolutions and the point features from the encoder not meeting the semantic
+needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
+remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
+We propose the LLM-embedded Semantic Encoding strategy in the pre-training
+stage, exploring the effects of various point cloud self-supervised losses. And
+we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
+introduce the Hierarchical Geometry Aggregation strategy in the instruction
+tuning stage. This incorporates inductive bias into the LLM early layers to
+focus on the local details of the point clouds. To the end, we present the
+first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
+state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
+classification, captioning, and VQA tasks, respectively. Our results
+demonstrate that the encoder-free architecture is highly promising for
+replacing encoder-based architectures in the field of 3D understanding. The
+code is released at https://github.com/Ivan-Tang-3D/ENEL
 
-摘要：機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果，正在轉型醫療保健。然而，傳統的 ML 工作流程需要專業技能、基礎設施和資源，限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署，降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究，我們評估了三個預測模型：邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明，提升樹模型達到了最高的效能，使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色，提供可擴充、有效率且可存取的醫療保健分析解決方案。
+摘要：<paragraph>編碼器免費架構已在 2D 視覺領域中初步探索，但它們是否能有效應用於 3D 理解場景仍是一個開放的問題。在本文中，我們提出了對編碼器免費架構潛力的首次全面調查，以克服基於編碼器的 3D 大型多模態模型 (LMM) 的挑戰。這些挑戰包括無法適應不同的點雲解析度，且來自編碼器的點特徵無法滿足大型語言模型 (LLM) 的語義需求。我們識別出 3D LMM 的關鍵方面，以移除編碼器並讓 LLM 承擔 3D 編碼器的角色：1) 我們在預訓練階段提出 LLM 嵌入式語義編碼策略，探索各種點雲自我監督損失的影響。我們提出混合語義損失來提取高階語義。2) 我們在指令調整階段引入分層幾何聚合策略。這將歸納偏差納入 LLM 早期層，以專注於點雲的局部細節。最後，我們提出第一個無編碼器 3D LMM，ENEL。我們的 7B 模型與當前最先進的模型 ShapeLLM-13B 相媲美，分別在分類、字幕和 VQA 任務中達到 55.0%、50.92% 和 42.7%。我們的結果表明，無編碼器架構極有望取代基於編碼器的架構在 3D 理解領域的應用。程式碼發布於 https://github.com/Ivan-Tang-3D/ENEL</paragraph>
 
-##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**
-2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen
+##### **DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**
+2502.09614v1 by Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
 
-Despite over a decade of legislative efforts to address modern slavery in the
-supply chains of large corporations, the effectiveness of government oversight
-remains hampered by the challenge of scrutinizing thousands of statements
-annually. While Large Language Models (LLMs) can be considered a well
-established solution for the automatic analysis and summarization of documents,
-recognizing concrete modern slavery countermeasures taken by companies and
-differentiating those from vague claims remains a challenging task. To help
-evaluate and fine-tune LLMs for the assessment of corporate statements, we
-introduce a dataset composed of 5,731 modern slavery statements taken from the
-Australian Modern Slavery Register and annotated at the sentence level. This
-paper details the construction steps for the dataset that include the careful
-design of annotation specifications, the selection and preprocessing of
-statements, and the creation of high-quality annotation subsets for effective
-model evaluations. To demonstrate our dataset's utility, we propose a machine
-learning methodology for the detection of sentences relevant to mandatory
-reporting requirements set by the Australian Modern Slavery Act. We then follow
-this methodology to benchmark modern language models under zero-shot and
-supervised learning settings.
+We address the challenge of developing a generalizable neural tracking
+controller for dexterous manipulation from human references. This controller
+aims to manage a dexterous robot hand to manipulate diverse objects for various
+purposes defined by kinematic human-object interactions. Developing such a
+controller is complicated by the intricate contact dynamics of dexterous
+manipulation and the need for adaptivity, generalizability, and robustness.
+Current reinforcement learning and trajectory optimization methods often fall
+short due to their dependence on task-specific rewards or precise system
+models. We introduce an approach that curates large-scale successful robot
+tracking demonstrations, comprising pairs of human references and robot
+actions, to train a neural controller. Utilizing a data flywheel, we
+iteratively enhance the controller's performance, as well as the number and
+quality of successful tracking demonstrations. We exploit available tracking
+demonstrations and carefully integrate reinforcement learning and imitation
+learning to boost the controller's performance in dynamic environments. At the
+same time, to obtain high-quality tracking demonstrations, we individually
+optimize per-trajectory tracking by leveraging the learned tracking controller
+in a homotopy optimization method. The homotopy optimization, mimicking
+chain-of-thought, aids in solving challenging trajectory tracking problems to
+increase demonstration diversity. We showcase our success by training a
+generalizable neural controller and evaluating it in both simulation and real
+world. Our method achieves over a 10% improvement in success rates compared to
+leading baselines. The project website with animated results is available at
+https://meowuu7.github.io/DexTrack/.
 
-摘要：儘管立法努力超過十年，旨在解決大型企業供應鏈中的現代奴隸制，但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型（LLM）可以被認為是文件自動分析和摘要的完善解決方案，但要辨識公司採取的具體現代奴隸制對策，並將其與含糊的聲明區分開來，仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明，我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集，這些聲明取自澳洲現代奴隸制註冊處，並在句子層級進行註解。本文詳細說明了資料集的建構步驟，其中包括註解規格的仔細設計、聲明的選擇和預處理，以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用，我們提出了一種機器學習方法，用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後，我們遵循這種方法，在零次學習和監督學習設定下對現代語言模型進行基準測試。
+摘要：<paragraph>我們解決了從人類參照中開發靈巧操作通用神經追蹤控制器的挑戰。此控制器旨在管理靈巧機器人手，以操作各種物體，以實現由運動學人機互動定義的各種目的。由於靈巧操作的複雜接觸動力學以及對適應性、通用性和魯棒性的需求，開發此類控制器很複雜。目前的強化學習和軌跡優化方法通常由於依賴於特定任務的獎勵或精確的系統模型而表現不佳。我們引入了一種方法，它策劃了大規模成功的機器人追蹤示範，包括人體參照和機器人動作對，以訓練神經控制器。利用數據飛輪，我們反覆增強控制器的性能，以及成功追蹤示範的數量和品質。我們利用可用的追蹤示範，並仔細整合強化學習和模仿學習，以提升控制器在動態環境中的性能。同時，為了獲得高品質的追蹤示範，我們透過在同倫優化方法中利用已學習的追蹤控制器，個別優化每個軌跡的追蹤。同倫優化模擬思考鏈，有助於解決具有挑戰性的軌跡追蹤問題，以增加示範的多樣性。我們展示了我們在訓練通用神經控制器並在模擬和真實世界中評估它的成功。與領先的基準相比，我們的模型在成功率方面提高了 10% 以上。包含動畫結果的專案網站可在 https://meowuu7.github.io/DexTrack/ 取得。</paragraph>
 
-##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**
-2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour
+##### **Score-of-Mixture Training: Training One-Step Generative Models Made Simple**
+2502.09609v1 by Tejas Jayashankar, J. Jon Ryu, Gregory Wornell
 
-The fourth Machine Learning for Health (ML4H) symposium was held in person on
-December 15th and 16th, 2024, in the traditional, ancestral, and unceded
-territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver,
-British Columbia, Canada. The symposium included research roundtable sessions
-to foster discussions between participants and senior researchers on timely and
-relevant topics for the ML4H community. The organization of the research
-roundtables at the conference involved 13 senior and 27 junior chairs across 13
-tables. Each roundtable session included an invited senior chair (with
-substantial experience in the field), junior chairs (responsible for
-facilitating the discussion), and attendees from diverse backgrounds with an
-interest in the session's topic.
+We propose Score-of-Mixture Training (SMT), a novel framework for training
+one-step generative models by minimizing a class of divergences called the
+$\alpha$-skew Jensen-Shannon divergence. At its core, SMT estimates the score
+of mixture distributions between real and fake samples across multiple noise
+levels. Similar to consistency models, our approach supports both training from
+scratch (SMT) and distillation using a pretrained diffusion model, which we
+call Score-of-Mixture Distillation (SMD). It is simple to implement, requires
+minimal hyperparameter tuning, and ensures stable training. Experiments on
+CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even
+outperform existing methods.
 
-摘要：第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議，以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席（在該領域擁有豐富的經驗）、初級主席（負責促進討論）以及對會議主題感興趣的來自不同背景的與會者。
+摘要：我們提出混合評分訓練 (SMT)，一種透過最小化稱為 $\alpha$-偏斜 Jensen-Shannon 距離的距離類別來訓練單步生成模型的新穎架構。在核心部分，SMT 估計真實和虛假樣本之間在多個雜訊層級的混合分配評分。與一致性模型類似，我們的做法支援從頭開始訓練 (SMT) 和使用預先訓練的擴散模型進行蒸餾，我們稱之為混合評分蒸餾 (SMD)。它易於實作，只需要最小的超參數調整，並確保穩定的訓練。在 CIFAR-10 和 ImageNet 64x64 上的實驗顯示，SMT/SMD 具有競爭力，甚至可以優於現有方法。
 
-##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**
-2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla
+##### **Human-LLM Coevolution: Evidence from Academic Writing**
+2502.09606v1 by Mingmeng Geng, Roberto Trotta
 
-Current Large Language Models (LLMs) benchmarks are often based on open-ended
-or close-ended QA evaluations, avoiding the requirement of human labor.
-Close-ended measurements evaluate the factuality of responses but lack
-expressiveness. Open-ended capture the model's capacity to produce discourse
-responses but are harder to assess for correctness. These two approaches are
-commonly used, either independently or together, though their relationship
-remains poorly understood. This work is focused on the healthcare domain, where
-both factuality and discourse matter greatly. It introduces a comprehensive,
-multi-axis suite for healthcare LLM evaluation, exploring correlations between
-open and close benchmarks and metrics. Findings include blind spots and
-overlaps in current methodologies. As an updated sanity check, we release a new
-medical benchmark--CareQA--, with both open and closed variants. Finally, we
-propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to
-mitigate the identified limitations.
+With a statistical analysis of arXiv paper abstracts, we report a marked drop
+in the frequency of several words previously identified as overused by ChatGPT,
+such as "delve", starting soon after they were pointed out in early 2024. The
+frequency of certain other words favored by ChatGPT, such as "significant", has
+instead kept increasing. These phenomena suggest that some authors of academic
+papers have adapted their use of large language models (LLMs), for example, by
+selecting outputs or applying modifications to the LLM-generated content. Such
+coevolution and cooperation of humans and LLMs thus introduce additional
+challenges to the detection of machine-generated text in real-world scenarios.
+Estimating the impact of LLMs on academic writing by examining word frequency
+remains feasible, and more attention should be paid to words that were already
+frequently employed, including those that have decreased in frequency.
 
-摘要：當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量，避免了人力需求。封閉式測量評估回應的事實性，但缺乏表達力。開放式測量捕捉模型產生論述回應的能力，但較難評估正確性。這兩種方法通常獨立或合併使用，儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域，在該領域中，事實性和論述都非常重要。它引入了一個全面的多軸套件，用於醫療保健 LLM 評量，探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查，我們發布了一個新的醫療基準--CareQA--，包含開放式和封閉式變體。最後，我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。
+摘要：透過對 arXiv 論文摘要進行統計分析，我們報告了幾個先前被認為 ChatGPT 過度使用的詞彙的頻率大幅下降，例如「深入探討」，從 2024 年初被指出後不久就開始下降。相反地，ChatGPT 偏好的某些其他詞彙，例如「顯著」，頻率持續增加。這些現象表明，一些學術論文作者已經調整了他們使用大型語言模型 (LLM) 的方式，例如，透過選擇輸出或對 LLM 生成的內容進行修改。因此，人類和 LLM 的這種共同演化和合作為在現實世界場景中偵測機器產生的文字帶來了額外的挑戰。透過檢視詞彙頻率來評估 LLM 對學術寫作的影響仍然可行，並且應該對已經頻繁使用的詞彙給予更多關注，包括那些頻率下降的詞彙。
 
-##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**
-2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra
+##### **SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**
+2502.09604v1 by Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
 
-Accurate classification and anatomical localization are essential for
-effective medical diagnostics and research, which may be efficiently performed
-using deep learning techniques. However, availability of limited labeled data
-poses a significant challenge. To address this, we adapted Prototypical
-Networks and the Propagation-Reconstruction Network (PRNet) for few-shot
-classification and localization, respectively, in Single Photon Emission
-Computed Tomography (SPECT) images. For the proof of concept we used a
-2D-sliced image cropped around heart. The Prototypical Network, with a
-pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver
-tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for
-2D imaging with an encoder-decoder architecture and skip connections, achieved
-a training loss of 1.395, accurately reconstructing patches and capturing
-spatial relationships. These results highlight the potential of Prototypical
-Networks for tissue classification with limited labeled data and PRNet for
-anatomical landmark localization, paving the way for improved performance in
-deep learning frameworks.
+We introduce SelfCite, a novel self-supervised approach that aligns LLMs to
+generate high-quality, fine-grained, sentence-level citations for the
+statements in their generated responses. Instead of only relying on costly and
+labor-intensive annotations, SelfCite leverages a reward signal provided by the
+LLM itself through context ablation: If a citation is necessary, removing the
+cited text from the context should prevent the same response; if sufficient,
+retaining the cited text alone should preserve the same response. This reward
+can guide the inference-time best-of-N sampling strategy to improve citation
+quality significantly, as well as be used in preference optimization to
+directly fine-tune the models for generating better citations. The
+effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3
+points on the LongBench-Cite benchmark across five long-form question answering
+tasks.
 
-摘要：精確的分類和解剖定位對於有效的醫療診斷和研究至關重要，而這可以使用深度學習技術有效執行。然而，標記資料有限的取得會造成重大的挑戰。為了解決這個問題，我們分別調整了原型網路和傳播重建網路 (PRNet)，用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念，我們使用圍繞心臟裁切的 2D 切片影像。原型網路，使用預先訓練的 ResNet-18 主幹，對心室、心肌和肝臟組織進行分類，訓練準確度為 96.67%，驗證準確度為 93.33%。PRNet，調整為使用編碼器解碼器架構和跳躍連接的 2D 影像，達到了 1.395 的訓練損失，精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力，以及 PRNet 在解剖標誌定位方面的潛力，為深度學習架構中效能的提升鋪平了道路。
+摘要：我們介紹 SelfCite，一種新穎的自監督方法，它將 LLM 對齊以針對其生成回應中的陳述生成高品質、細粒度、句子級別的引用。SelfCite 不僅依賴於昂貴且勞動密集的註解，還利用 LLM 本身通過上下文消融提供的獎勵信號：如果需要引用，從上下文中移除被引用的文字應當會阻止相同的回應；如果足夠，僅保留被引用的文字應當會保留相同的回應。此獎勵可以引導推理時間最佳 N 個取樣策略以顯著改善引文品質，並用於偏好最佳化以直接微調模型以生成更好的引文。SelfCite 的有效性通過在五個長篇問答任務中將 LongBench-Cite 基準上的引文 F1 提高多達 5.3 點來證明。
 
-##### **Illegal Waste Detection in Remote Sensing Images: A Case Study**
-2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori
+##### **CoT-Valve: Length-Compressible Chain-of-Thought Tuning**
+2502.09601v1 by Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
 
-Environmental crime currently represents the third largest criminal activity
-worldwide while threatening ecosystems as well as human health. Among the
-crimes related to this activity, improper waste management can nowadays be
-countered more easily thanks to the increasing availability and decreasing cost
-of Very-High-Resolution Remote Sensing images, which enable semi-automatic
-territory scanning in search of illegal landfills. This paper proposes a
-pipeline, developed in collaboration with professionals from a local
-environmental agency, for detecting candidate illegal dumping sites leveraging
-a classifier of Remote Sensing images. To identify the best configuration for
-such classifier, an extensive set of experiments was conducted and the impact
-of diverse image characteristics and training settings was thoroughly analyzed.
-The local environmental agency was then involved in an experimental exercise
-where outputs from the developed classifier were integrated in the experts'
-everyday work, resulting in time savings with respect to manual
-photo-interpretation. The classifier was eventually run with valuable results
-on a location outside of the training area, highlighting potential for
-cross-border applicability of the proposed pipeline.
+Chain-of-Thought significantly enhances a model's reasoning capability, but
+it also comes with a considerable increase in inference costs due to long
+chains. With the observation that the reasoning path can be easily compressed
+under easy tasks but struggle on hard tasks, we explore the feasibility of
+elastically controlling the length of reasoning paths with only one model,
+thereby reducing the inference overhead of reasoning models dynamically based
+on task difficulty. We introduce a new tuning and inference strategy named
+CoT-Valve, designed to allow models to generate reasoning chains of varying
+lengths. To achieve this, we propose to identify a direction in the parameter
+space that, when manipulated, can effectively control the length of generated
+CoT. Moreover, we show that this property is valuable for compressing the
+reasoning chain. We construct datasets with chains from long to short for the
+same questions and explore two enhanced strategies for CoT-Valve: (1) a precise
+length-compressible CoT tuning method, and (2) a progressive chain length
+compression approach. Our experiments show that CoT-Valve successfully enables
+controllability and compressibility of the chain and shows better performance
+than the prompt-based control. We applied this method to QwQ-32B-Preview,
+reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor
+performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with
+only one additional incorrect answer.
 
-摘要：環境犯罪目前是全球第三大犯罪活動，威脅生態系統和人類健康。在與此活動相關的犯罪中，不當廢物管理現在可以更容易地得到解決，這要歸功於超高解析度遙測影像越來越普及且成本下降，這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道，與當地環境機構的專業人士合作開發，用於檢測候選非法傾倒地點，利用遙測影像分類器。為了找出這種分類器的最佳配置，進行了一系列廣泛的實驗，並徹底分析了不同影像特徵和訓練設定的影響。然後，當地環境機構參與了一項實驗練習，其中將已開發分類器的輸出整合到專家的日常工作中，從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器，獲得了有價值的結果，突出了所提出管道的跨境適用性潛力。
+摘要：<paragraph>連續思考大幅提升了模型的推理能力，但由於鏈條過長，也大幅增加了推理成本。由於觀察到推理路徑在簡單的任務中可以輕易壓縮，但在困難的任務中卻很吃力，我們探索了僅使用一個模型彈性控制推理路徑長度的可行性，從而根據任務難度動態減少推理模型的推理開銷。我們引入了一種名為 CoT-Valve 的新調校和推理策略，旨在讓模型產生長度不一的推理鏈。為此，我們提議在參數空間中識別一個方向，在操作時可以有效控制生成的 CoT 的長度。此外，我們展示了此屬性對於壓縮推理鏈是有價值的。我們構造了從長到短的鏈條的資料集，用於相同的問題，並探索了 CoT-Valve 的兩種增強策略：(1) 精確的長度可壓縮 CoT 調校方法，以及 (2) 漸進式鏈長壓縮方法。我們的實驗表明，CoT-Valve 成功地實現了鏈條的可控性和可壓縮性，並顯示出比基於提示的控制更好的效能。我們將此方法應用於 QwQ-32B-Preview，將 GSM8K 上的推理鏈條從 741 個代幣減少到 225 個代幣，效能僅略微下降 (95.07% 至 94.92%)，而在 AIME 上從 6827 個代幣減少到 4629 個代幣，只多了一個錯誤答案。</paragraph>
 
-##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**
-2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li
+##### **Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**
+2502.09597v1 by Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
 
-Accurate and efficient electroencephalography (EEG) analysis is essential for
-detecting seizures and artifacts in long-term monitoring, with applications
-spanning hospital diagnostics to wearable health devices. Robust EEG analytics
-have the potential to greatly improve patient care. However, traditional deep
-learning models, especially Transformer-based architectures, are hindered by
-their quadratic time and memory complexity, making them less suitable for
-resource-constrained environments. To address these challenges, we present
-FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel
-self-supervised framework that establishes new efficiency benchmarks for EEG
-analysis through bidirectional state-space modeling. Unlike Transformer-based
-models, which incur quadratic time and memory complexity, FEMBA scales linearly
-with sequence length, enabling more scalable and efficient processing of
-extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and
-fine-tuned on three downstream tasks, FEMBA achieves competitive performance in
-comparison with transformer models, with significantly lower computational
-cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB
-and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates
-viability for resource-constrained devices. These results pave the way for
-scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as
-a promising candidate for wearable applications.
+Large Language Models (LLMs) are increasingly used as chatbots, yet their
+ability to personalize responses to user preferences remains limited. We
+introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize
+and adhere to user preferences in a long-context conversational setting.
+PrefEval comprises 3,000 manually curated user preference and query pairs
+spanning 20 topics. PrefEval contains user personalization or preference
+information in both explicit and implicit forms, and evaluates LLM performance
+using a generation and a classification task. With PrefEval, we evaluated the
+aforementioned preference following capabilities of 10 open-source and
+proprietary LLMs in multi-session conversations with varying context lengths up
+to 100k tokens. We benchmark with various prompting, iterative feedback, and
+retrieval-augmented generation methods. Our benchmarking effort reveals that
+state-of-the-art LLMs face significant challenges in proactively following
+users' preferences during conversations. In particular, in zero-shot settings,
+preference following accuracy falls below 10% at merely 10 turns (~3k tokens)
+across most evaluated models. Even with advanced prompting and retrieval
+methods, preference following still deteriorates in long-context conversations.
+Furthermore, we show that fine-tuning on PrefEval significantly improves
+performance. We believe PrefEval serves as a valuable resource for measuring,
+understanding, and enhancing LLMs' preference following abilities, paving the
+way for personalized conversational agents. Our code and dataset are available
+at https://prefeval.github.io/.
 
-摘要：準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要，其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而，傳統深度學習模型，特別是基於 Transformer 的架構，受到其二次時間和記憶體複雜度的阻礙，使其不太適合資源受限的環境。為了應對這些挑戰，我們提出 FEMBA (基礎 EEG Mamba + 雙向架構)，一種創新的自我監督架構，透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同，FEMBA 隨著序列長度線性縮放，支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調，與Transformer模型相比，在計算成本顯著降低的情況下，實現了具有競爭力的效能。具體來說，它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC，而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路，並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。
+摘要：大型語言模型（LLM）正日益被用作聊天機器人，但它們根據使用者偏好個人化回應的能力仍然有限。我們引入了 PrefEval，一個用於評估 LLM 在長時間對話環境中推論、記憶和遵守使用者偏好的能力的基準。PrefEval 包含 3,000 個手動策劃的使用者偏好和查詢對，涵蓋 20 個主題。PrefEval 包含以明確和隱含形式表達的使用者個人化或偏好資訊，並使用生成和分類任務評估 LLM 效能。透過 PrefEval，我們評估了 10 個開源和專有 LLM 在多重對話中上述的偏好追蹤能力，對話內容長度最高達 100k 個符號。我們使用各種提示、迭代回饋和檢索增強生成方法進行基準測試。我們的基準測試工作顯示，最先進的 LLM 在對話中主動追蹤使用者偏好時面臨重大挑戰。特別是在零次學習設定中，在多數評估模型中，在僅 10 個回合（約 3k 個符號）時，偏好追蹤準確度低於 10%。即使使用進階提示和檢索方法，在長時間對話中偏好追蹤仍然會惡化。此外，我們展示了在 PrefEval 上進行微調會大幅改善效能。我們相信 PrefEval 可作為衡量、理解和提升 LLM 偏好追蹤能力的寶貴資源，為個人化對話代理鋪路。我們的程式碼和資料集可在 https://prefeval.github.io/ 取得。
 
-##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**
-2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham
+##### **KIMAs: A Configurable Knowledge Integrated Multi-Agent System**
+2502.09596v1 by Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding
 
-The advent of foundation models (FMs) is transforming medical domain. In
-ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4
-million natural images and 1.6 million retinal images, has demonstrated high
-adaptability across clinical applications. Conversely, DINOv2, a
-general-purpose vision FM pre-trained on 142 million natural images, has shown
-promise in non-medical domains. However, its applicability to clinical tasks
-remains underexplored. To address this, we conducted head-to-head evaluations
-by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular
-disease detection and systemic disease prediction tasks, across eight
-standardized open-source ocular datasets, as well as the Moorfields AlzEye and
-the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting
-diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets,
-all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In
-glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940,
-P<0.001). Conversely, RETFound achieved superior performance over all DINOv2
-models in predicting heart failure, myocardial infarction, and ischaemic stroke
-(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even
-with 10% of the fine-tuning data. These findings showcase the distinct
-scenarios where general-purpose and domain-specific FMs excel, highlighting the
-importance of aligning FM selection with task-specific requirements to optimise
-clinical performance.
+Knowledge-intensive conversations supported by large language models (LLMs)
+have become one of the most popular and helpful applications that can assist
+people in different aspects. Many current knowledge-intensive applications are
+centered on retrieval-augmented generation (RAG) techniques. While many
+open-source RAG frameworks facilitate the development of RAG-based
+applications, they often fall short in handling practical scenarios complicated
+by heterogeneous data in topics and formats, conversational context management,
+and the requirement of low-latency response times. This technical report
+presents a configurable knowledge integrated multi-agent system, KIMAs, to
+address these challenges. KIMAs features a flexible and configurable system for
+integrating diverse knowledge sources with 1) context management and query
+rewrite mechanisms to improve retrieval accuracy and multi-turn conversational
+coherency, 2) efficient knowledge routing and retrieval, 3) simple but
+effective filter and reference generation mechanisms, and 4) optimized
+parallelizable multi-agent pipeline execution. Our work provides a scalable
+framework for advancing the deployment of LLMs in real-world settings. To show
+how KIMAs can help developers build knowledge-intensive applications with
+different scales and emphases, we demonstrate how we configure the system to
+three applications already running in practice with reliable performance.
 
-摘要：基礎模型 (FM) 的出現正在轉變醫療領域。在眼科，RETFound 是一個視網膜專用 FM，依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練，已展現出高度適應性，可應用於各種臨床應用。相反地，DINOv2 是一個通用視覺 FM，使用 1.42 億張自然影像進行預訓練，已展現出在非醫療領域的潛力。然而，其在臨床任務中的適用性仍未被充分探索。為了解決這個問題，我們針對眼部疾病偵測和全身性疾病預測任務，對 RETFound 和三個 DINOv2 模型（大型、基礎、小型）進行微調，並進行一對一的評估，使用八個標準化的開源眼科資料集，以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound（三個資料集的 AUROC=0.850-0.952，相較於 0.823-0.944，所有 P<=0.007）和多類眼部疾病（AUROC=0.892，相較於 0.846，P<0.001）。在青光眼方面，DINOv2 基礎模型優於 RETFound（AUROC=0.958，相較於 0.940，P<0.001）。相反地，RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型（AUROC=0.732-0.796，相較於 0.663-0.771，所有 P<0.001）。即使使用 10% 的微調資料，這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景，突顯了根據任務特定需求調整 FM 選擇，以最佳化臨床表現的重要性。
+摘要：由大型語言模型 (LLM) 支持的知識密集型對話
+已成為最受歡迎且有用的應用程式之一，可協助
+人們在不同面向獲得協助。許多當前的知識密集型應用程式
+都以檢索增強生成 (RAG) 技術為中心。雖然許多
+開放原始碼 RAG 架構促進了基於 RAG 的應用程式開發，但它們在處理
+主題和格式中異質資料、對話內容管理，以及低延遲回應時間的要求所造成的實際情況時，通常力有未逮。這份技術報告
+提出了可設定的知識整合多重代理系統，KIMAs，以
+解決這些挑戰。KIMAs 具備靈活且可設定的系統，可整合多樣化的知識來源，並具備 1) 內容管理和查詢
+改寫機制，以提升檢索準確度和多輪對話的連貫性，2) 有效的知識路由和檢索，3) 簡單但
+有效的篩選和參考產生機制，以及 4) 最佳化的可平行化多重代理管線執行。我們的作品提供了可擴充的
+架構，以推動在實際環境中部署 LLM。為了展示 KIMAs 如何協助開發人員建置不同規模和重點的知識密集型應用程式，我們示範如何設定系統至
+三個已實際執行且效能良好的應用程式。
 
-##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**
-2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun
+##### **Logical forms complement probability in understanding language model (and human) performance**
+2502.09589v1 by Yixuan Wang, Freda Shi
 
-Medical time series are often irregular and face significant missingness,
-posing challenges for data analysis and clinical decision-making. Existing
-methods typically adopt a single modeling perspective, either treating series
-data as sequences or transforming them into image representations for further
-classification. In this paper, we propose a joint learning framework that
-incorporates both sequence and image representations. We also design three
-self-supervised learning strategies to facilitate the fusion of sequence and
-image representations, capturing a more generalizable joint representation. The
-results indicate that our approach outperforms seven other state-of-the-art
-models in three representative real-world clinical datasets. We further
-validate our approach by simulating two major types of real-world missingness
-through leave-sensors-out and leave-samples-out techniques. The results
-demonstrate that our approach is more robust and significantly surpasses other
-baselines in terms of classification performance.
+With the increasing interest in using large language models (LLMs) for
+planning in natural language, understanding their behaviors becomes an
+important research question. This work conducts a systematic investigation of
+LLMs' ability to perform logical reasoning in natural language. We introduce a
+controlled dataset of hypothetical and disjunctive syllogisms in propositional
+and modal logic and use it as the testbed for understanding LLM performance.
+Our results lead to novel insights in predicting LLM behaviors: in addition to
+the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical
+forms should be considered as orthogonal factors. In addition, we show
+similarities and differences between the logical reasoning performances of
+humans and LLMs by comparing LLM and human behavioral results.
 
-摘要：醫療時間序列通常不規則且會面臨顯著的缺失，對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點，將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中，我們提出了一個聯合學習架構，結合序列和影像表示。我們還設計了三種自我監督學習策略，以促進序列和影像表示的融合，捕捉更具概括性的聯合表示。結果表明，我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明，我們的做法更強大，並且在分類效能方面顯著優於其他基準。
+摘要：隨著在自然語言規劃中使用大型語言模型（LLM）的興趣日益濃厚，理解其行為已成為一項重要的研究課題。本研究對 LLM 在自然語言中執行邏輯推理的能力進行了系統性調查。我們引入了一個由假設和析取三段論組成的受控資料集，並使用它作為理解 LLM 效能的測試平台。我們的結果產生了預測 LLM 行為的新見解：除了輸入的機率（Gonen 等人，2023 年；McCoy 等人，2024 年）之外，邏輯形式應被視為正交因子。此外，我們透過比較 LLM 和人類行為結果，展示了人類和 LLM 在邏輯推理表現上的相似性和差異性。
 
-##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
-2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
+##### **Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**
+2502.09573v1 by Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
 
-We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
-an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
-predicts future PHTs using transformer-based architectures. The Adaptive Risk
-Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
-probabilities for clinician-defined critical events. ARES incorporates a
-personalized explainability module that identifies key clinical factors
-influencing risk estimates for individual patients. ARES was evaluated on the
-MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
-performance against traditional early warning systems and machine learning
-models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
-with 60% including hospital admissions. The dataset contained over 357 million
-tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
-ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
-ETHOS-based risk estimates demonstrated robustness across demographic subgroups
-with strong model reliability, confirmed via calibration curves. The
-personalized explainability module provides insights into patient-specific
-factors contributing to risk. ARES, powered by ETHOS, advances predictive
-healthcare AI by providing dynamic, real-time, and personalized risk estimation
-with patient-specific explainability to enhance clinician trust. Its
-adaptability and superior accuracy position it as a transformative tool for
-clinical decision-making, potentially improving patient outcomes and resource
-allocation in emergency and inpatient settings. We release the full code at
-github.com/ipolharvard/ethos-ares to facilitate future research.
+In this study, we tackle industry challenges in video content classification
+by exploring and optimizing GPT-based models for zero-shot classification
+across seven critical categories of video quality. We contribute a novel
+approach to improving GPT's performance through prompt optimization and policy
+refinement, demonstrating that simplifying complex policies significantly
+reduces false negatives. Additionally, we introduce a new
+decomposition-aggregation-based prompt engineering technique, which outperforms
+traditional single-prompt methods. These experiments, conducted on real
+industry problems, show that thoughtful prompt design can substantially enhance
+GPT's performance without additional finetuning, offering an effective and
+scalable solution for improving video classification systems across various
+domains in industry.
 
-摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
-一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
-使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
+摘要：在這項研究中，我們透過探索和最佳化基於 GPT 的模型，來處理影片內容分類中的產業挑戰，並針對影片品質的七個關鍵類別進行零次學習分類。我們貢獻了一種透過提示最佳化和政策改善來提升 GPT 效能的新方法，證明簡化複雜政策能大幅減少假陰性。此外，我們還引入了一種新的基於分解聚合的提示工程技術，其效能優於傳統的單一提示方法。這些在真實產業問題上執行的實驗顯示，經過深思熟慮的提示設計可以在不進行額外微調的情況下大幅提升 GPT 的效能，為提升產業中各種領域的影片分類系統提供了一個有效且可擴充的解決方案。
 
-##### **Can ChatGPT Diagnose Alzheimer's Disease?**
-2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin
+##### **MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**
+2502.09567v1 by Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea
 
-Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating
-neurodegenerative condition that affects approximately 1 in 9 individuals aged
-65 and older, profoundly impairing memory and cognitive function. This paper
-utilises 9300 electronic health records (EHRs) with data from Magnetic
-Resonance Imaging (MRI) and cognitive tests to address an intriguing question:
-As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs?
-We present an in-depth evaluation of ChatGPT using a black-box approach with
-zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to
-analyse MRI and cognitive test results, as well as its potential as a
-diagnostic tool for AD. By automating aspects of the diagnostic process, this
-research opens a transformative approach for the healthcare system,
-particularly in addressing disparities in resource-limited regions where AD
-specialists are scarce. Hence, it offers a foundation for a promising method
-for early detection, supporting individuals with timely interventions, which is
-paramount for Quality of Life (QoL).
+We introduce MorphNLI, a modular step-by-step approach to natural language
+inference (NLI). When classifying the premise-hypothesis pairs into
+{entailment, contradiction, neutral}, we use a language model to generate the
+necessary edits to incrementally transform (i.e., morph) the premise into the
+hypothesis. Then, using an off-the-shelf NLI model we track how the entailment
+progresses with these atomic changes, aggregating these intermediate labels
+into a final output. We demonstrate the advantages of our proposed method
+particularly in realistic cross-domain settings, where our method always
+outperforms strong baselines with improvements up to 12.6% (relative). Further,
+our proposed approach is explainable as the atomic edits can be used to
+understand the overall NLI label.
 
-摘要：ChatGPT 能否診斷出阿茲海默症 (AD)？AD 是一種毀滅性的神經退化性疾病，影響約 1/9 的 65 歲及以上人士，嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR)，其中包含磁共振成像 (MRI) 和認知測試的數據，來解決一個有趣的問題：作為一個通用任務解決器，ChatGPT 能否使用 EHR 準確地檢測出 AD？我們使用黑盒方法對 ChatGPT 進行了深入評估，採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力，以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面，這項研究為醫療保健系統開啟了一種變革性的方法，特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此，它為一種有希望的早期檢測方法奠定了基礎，通過及時干預來支持個人，這對於生活品質 (QoL) 至關重要。
+摘要：我們引入 MorphNLI，一種模組化逐步方法，用於自然語言推論 (NLI)。當對前提假設對進行分類時，我們使用語言模型來產生必要的編輯，以逐步轉換（即，變形）前提成為假設。然後，使用現成的 NLI 模型，我們追蹤推論如何隨著這些原子變化而進展，將這些中間標籤彙總成最終輸出。我們展示了我們提出的方法的優點，特別是在現實的跨網域設置中，我們的模型始終優於強大的基線，改進幅度高達 12.6%（相對）。此外，我們提出的方法是可以解釋的，因為原子編輯可以用來理解整體 NLI 標籤。
 
-##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**
-2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares
+##### **Zero-shot generation of synthetic neurosurgical data with large language models**
+2502.09566v1 by Austin A. Barr, Eddie Guo, Emre Sezgin
 
-EEG-based neural networks, pivotal in medical diagnosis and brain-computer
-interfaces, face significant intellectual property (IP) risks due to their
-reliance on sensitive neurophysiological data and resource-intensive
-development. Current watermarking methods, particularly those using abstract
-trigger sets, lack robust authentication and fail to address the unique
-challenges of EEG models. This paper introduces a cryptographic wonder
-filter-based watermarking framework tailored for EEG-based neural networks.
-Leveraging collision-resistant hashing and public-key encryption, the wonder
-filter embeds the watermark during training, ensuring minimal distortion ($\leq
-5\%$ drop in EEG task accuracy) and high reliability (100\% watermark
-detection). The framework is rigorously evaluated against adversarial attacks,
-including fine-tuning, transfer learning, and neuron pruning. Results
-demonstrate persistent watermark retention, with classification accuracy for
-watermarked states remaining above 90\% even after aggressive pruning, while
-primary task performance degrades faster, deterring removal attempts. Piracy
-resistance is validated by the inability to embed secondary watermarks without
-severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic
-hashing ensures authentication, reducing brute-force attack success
-probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet,
-TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively
-eliminating false positives. By integrating wonder filters with EEG-specific
-adaptations, this work bridges a critical gap in IP protection for
-neurophysiological models, offering a secure, tamper-proof solution for
-healthcare and biometric applications. The framework's robustness against
-adversarial modifications underscores its potential to safeguard sensitive EEG
-models while maintaining diagnostic utility.
+Clinical data is fundamental to advance neurosurgical research, but access is
+often constrained by data availability, small sample sizes, privacy
+regulations, and resource-intensive preprocessing and de-identification
+procedures. Synthetic data offers a potential solution to challenges associated
+with accessing and using real-world data (RWD). This study aims to evaluate the
+capability of zero-shot generation of synthetic neurosurgical data with a large
+language model (LLM), GPT-4o, by benchmarking with the conditional tabular
+generative adversarial network (CTGAN). Synthetic datasets were compared to
+real-world neurosurgical data to assess fidelity (means, proportions,
+distributions, and bivariate correlations), utility (ML classifier performance
+on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated
+datasets matched or exceeded CTGAN performance, despite no fine-tuning or
+access to RWD for pre-training. Datasets demonstrated high univariate and
+bivariate fidelity to RWD without directly exposing any real patient records,
+even at amplified sample size. Training an ML classifier on GPT-4o-generated
+data and testing on RWD for a binary prediction task showed an F1 score (0.706)
+with comparable performance to training on the CTGAN data (0.705) for
+predicting postoperative functional status deterioration. GPT-4o demonstrated a
+promising ability to generate high-fidelity synthetic neurosurgical data. These
+findings also indicate that data synthesized with GPT-4o can effectively
+augment clinical data with small sample sizes, and train ML models for
+prediction of neurosurgical outcomes. Further investigation is necessary to
+improve the preservation of distributional characteristics and boost classifier
+performance.
 
-摘要：<paragraph>基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要，由於其依賴敏感的神經生理資料和資源密集型的開發，面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法，特別是那些使用抽象觸發集的方法，缺乏強健的驗證，且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密，wonder 濾波器在訓練期間嵌入浮水印，確保最小的失真（EEG 任務準確度下降 $\leq 5\%$）和高可靠性（100% 浮水印檢測）。該架構針對對抗性攻擊進行了嚴格的評估，包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留，即使在激進的剪枝後，浮水印狀態的分類準確度仍保持在 90% 以上，而主要任務的性能下降得更快，阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證，而不會造成嚴重的準確度損失（在 EEGNet 和 CCNN 模型中 $>10\%$）。密碼學雜湊確保驗證，降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型（CCNN、EEGNet、TSception）進行評估，該方法達到了 $>99.4\%$ 的空嵌入準確度，有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合，這項工作彌補了神經生理模型 IP 保護中的關鍵差距，為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。</paragraph>
+摘要：<paragraph>臨床數據是推進神經外科研究的基礎，但訪問通常受到數據可用性、樣本量小、隱私法規以及資源密集型預處理和去識別程序的限制。合成數據為與存取和使用真實世界數據 (RWD) 相關的挑戰提供了潛在解決方案。本研究旨在評估使用大型語言模型 (LLM) GPT-4o 零次生成合成神經外科數據的能力，並通過條件表格生成對抗網路 (CTGAN) 進行基準測試。將合成數據集與真實世界的神經外科數據進行比較，以評估保真度（平均值、比例、分布和二元相關性）、實用性（RWD 上的 ML 分類器性能）和隱私（RWD 中記錄的重複）。儘管沒有微調或訪問 RWD 進行預訓練，但 GPT-4o 生成的數據集與 CTGAN 性能相匹配或超過 CTGAN 性能。數據集證明了對 RWD 的高單變量和二變量保真度，即使在擴充的樣本量下也不會直接公開任何真實患者記錄。在 GPT-4o 生成的數據上訓練 ML 分類器，並在 RWD 上測試二元預測任務，顯示 F1 分數 (0.706) 與在 CTGAN 數據上訓練以預測術後功能狀態惡化時的性能相當 (0.705)。GPT-4o 展示了生成高保真合成神經外科數據的潛力。這些發現還表明，使用 GPT-4o 合成的數據可以有效地增加樣本量小的臨床數據，並訓練 ML 模型以預測神經外科結果。需要進一步研究以改善分佈特徵的保留並提升分類器性能。</paragraph>
 
-##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**
-2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen
+##### **MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**
+2502.09565v1 by Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White
 
-Depression is one of the leading causes of disability worldwide, posing a
-severe burden on individuals, healthcare systems, and society at large. Recent
-advancements in Large Language Models (LLMs) have shown promise in addressing
-mental health challenges, including the detection of depression through
-text-based analysis. However, current LLM-based methods often struggle with
-nuanced symptom identification and lack a transparent, step-by-step reasoning
-process, making it difficult to accurately classify and explain mental health
-conditions. To address these challenges, we propose a Chain-of-Thought
-Prompting approach that enhances both the performance and interpretability of
-LLM-based depression detection. Our method breaks down the detection process
-into four stages: (1) sentiment analysis, (2) binary depression classification,
-(3) identification of underlying causes, and (4) assessment of severity. By
-guiding the model through these structured reasoning steps, we improve
-interpretability and reduce the risk of overlooking subtle clinical indicators.
-We validate our method on the E-DAIC dataset, where we test multiple
-state-of-the-art large language models. Experimental results indicate that our
-Chain-of-Thought Prompting technique yields superior performance in both
-classification accuracy and the granularity of diagnostic insights, compared to
-baseline approaches.
+Molecular dynamics (MD) simulations are essential for understanding
+biomolecular systems but remain challenging to automate. Recent advances in
+large language models (LLM) have demonstrated success in automating complex
+scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an
+agentic LLM assistant capable of automating MD workflows. MDCrow uses
+chain-of-thought over 40 expert-designed tools for handling and processing
+files, setting up simulations, analyzing the simulation outputs, and retrieving
+relevant information from literature and databases. We assess MDCrow's
+performance across 25 tasks of varying required subtasks and difficulty, and we
+evaluate the agent's robustness to both difficulty and prompt style.
+\texttt{gpt-4o} is able to complete complex tasks with low variance, followed
+closely by \texttt{llama3-405b}, a compelling open-source model. While prompt
+style does not influence the best models' performance, it has significant
+effects on smaller models.
 
-摘要：憂鬱症是全球殘障的主要原因之一，對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望，包括透過基於文字的分析來偵測憂鬱症。然而，現有的基於 LLM 的方法通常難以辨識細微的症狀，而且缺乏透明且逐步的推理過程，這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰，我們提出了一種思考鏈提示方法，它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段：(1) 情緒分析，(2) 二元憂鬱症分類，(3) 找出潛在原因，以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟，我們提升了可解釋性，並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法，並在其中測試了多種最先進的大型語言模型。實驗結果顯示，與基線方法相比，我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。
+摘要：分子動力學 (MD) 模擬對於理解生物分子系統至關重要，但自動化仍然具有挑戰性。大型語言模型 (LLM) 的最新進展已證明使用基於 LLM 的代理自動化複雜的科學任務是成功的。在本文中，我們介紹了 MDCrow，這是一個代理 LLM 助理，能夠自動化 MD 工作流程。MDCrow 使用 40 多種專家設計的工具的思考鏈來處理和處理檔案、設定模擬、分析模擬輸出，以及從文獻和資料庫中檢索相關資訊。我們評估了 MDCrow 在 25 項任務中的表現，這些任務所需的子任務和難度各不相同，並且我們評估了代理對難度和提示樣式的穩健性。\texttt{gpt-4o} 能夠以低變異完成複雜的任務，緊隨其後的是一個引人注目的開源模型 \texttt{llama3-405b}。雖然提示樣式不會影響最佳模型的效能，但它對較小的模型有顯著的影響。
 
-##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**
-2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
+##### **EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**
+2502.09560v1 by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
 
-The increasing volume of drug combinations in modern therapeutic regimens
-needs reliable methods for predicting drug-drug interactions (DDIs). While
-Large Language Models (LLMs) have revolutionized various domains, their
-potential in pharmaceutical research, particularly in DDI prediction, remains
-largely unexplored. This study thoroughly investigates LLMs' capabilities in
-predicting DDIs by uniquely processing molecular structures (SMILES), target
-organisms, and gene interaction data as raw text input from the latest DrugBank
-dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4,
-Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first
-assessing their zero-shot capabilities in DDI prediction. We then fine-tuned
-selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1
-distilled Qwen 1.5B) to optimize their performance. Our comprehensive
-evaluation framework included validation across 13 external DDI datasets,
-comparing against traditional approaches such as l2-regularized logistic
-regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5
-2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of
-0.919 on balanced datasets (50% positive, 50% negative cases). This result
-represents an improvement over both zero-shot predictions and state-of-the-art
-machine-learning methods used for DDI prediction. Our analysis reveals that
-LLMs can effectively capture complex molecular interaction patterns and cases
-where drug pairs target common genes, making them valuable tools for practical
-applications in pharmaceutical research and clinical settings.
+Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
+agents offers a promising avenue for tackling real-world tasks. While
+language-centric embodied agents have garnered substantial attention,
+MLLM-based embodied agents remain underexplored due to the lack of
+comprehensive evaluation frameworks. To bridge this gap, we introduce
+EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
+embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
+tasks across four environments, ranging from high-level semantic tasks (e.g.,
+household) to low-level tasks involving atomic actions (e.g., navigation and
+manipulation); and (2) six meticulously curated subsets evaluating essential
+agent capabilities like commonsense reasoning, complex instruction
+understanding, spatial awareness, visual perception, and long-term planning.
+Through extensive experiments, we evaluated 13 leading proprietary and
+open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
+at high-level tasks but struggle with low-level manipulation, with the best
+model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
+multifaceted standardized evaluation platform that not only highlights existing
+challenges but also offers valuable insights to advance MLLM-based embodied
+agents. Our code is available at https://embodiedbench.github.io.
 
-摘要：<paragraph>現代治療方案中藥物組合的數量越來越多，需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命，它們在藥物研究中的潛力，特別是在 DDI 預測中的潛力，仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入，徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM，包括專有模型（GPT-4、Claude、Gemini）和開源變體（從 1.5B 到 72B 參數），首先評估它們在 DDI 預測中的零次學習能力。然後，我們微調選定的模型（GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B）以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證，並與傳統方法（例如 l2 正則化邏輯迴歸）進行比較。微調後的 LLM 表現出優異的效能，其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度，在平衡資料集（50% 正例，50% 反例）上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明，LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況，使其成為藥物研究和臨床環境中實用應用的寶貴工具。</paragraph>
+摘要：<paragraph>利用多模態大型語言模型 (MLLM) 來建立具身代理，提供了解決現實世界任務的有前景途徑。儘管以語言為中心的具身代理已獲得大量關注，但由於缺乏全面的評估框架，基於 MLLM 的具身代理仍未得到充分探索。為了彌補這一差距，我們引入了 EmbodiedBench，這是一個廣泛的基準測試，旨在評估以視覺為導向的具身代理。EmbodiedBench 的特點：(1) 跨越四個環境的 1,128 項多樣化測試任務，範圍從高層級語義任務（例如，家庭）到涉及原子動作的低層級任務（例如，導航和操作）；以及 (2) 六個精心策劃的子集，用於評估基本的代理能力，例如常識推理、複雜指令理解、空間感知、視覺感知和長期規劃。通過廣泛的實驗，我們在 EmbodiedBench 中評估了 13 個領先的專有和開源 MLLM。我們的研究結果表明：MLLM 在高層級任務中表現出色，但在低層級操作中遇到困難，表現最好的模型 GPT-4o 平均得分僅為 28.9%。EmbodiedBench 提供了一個多方面的標準化評估平台，不僅突出了現有挑戰，還提供了有價值的見解來推進基於 MLLM 的具身代理。我們的程式碼可在 https://embodiedbench.github.io/ 取得。</paragraph>
 
-##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**
-2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh
+##### **Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**
+2502.09532v1 by Shreyan Biswas, Alexander Erlei, Ujwal Gadiraju
 
-Detecting sensitive data such as Personally Identifiable Information (PII)
-and Protected Health Information (PHI) is critical for data security platforms.
-This study evaluates regex-based pattern matching algorithms and exact-match
-search techniques to optimize detection speed, accuracy, and scalability. Our
-benchmarking results indicate that Google RE2 provides the best balance of
-speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among
-regex engines, outperforming PCRE while maintaining broader hardware
-compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated
-superior performance (8 ms/MB) and scalability for large datasets. Performance
-analysis revealed that regex processing time scales linearly with dataset size
-and pattern complexity. A hybrid AI + Regex approach achieved the highest F1
-score (91. 6%) by improving recall and minimizing false positives. Device
-benchmarking confirmed that our solution maintains efficient CPU and memory
-usage on both high-performance and mid-range systems. Despite its
-effectiveness, challenges remain, such as limited multilingual support and the
-need for regular pattern updates. Future work should focus on expanding
-language coverage, integrating data security and privacy management (DSPM) with
-data loss prevention (DLP) tools, and enhancing regulatory compliance for
-broader global adoption.
+Recent advances in generative AI have precipitated a proliferation of novel
+writing assistants. These systems typically rely on multilingual large language
+models (LLMs), providing globalized workers the ability to revise or create
+diverse forms of content in different languages. However, there is substantial
+evidence indicating that the performance of multilingual LLMs varies between
+languages. Users who employ writing assistance for multiple languages are
+therefore susceptible to disparate output quality. Importantly, recent research
+has shown that people tend to generalize algorithmic errors across independent
+tasks, violating the behavioral axiom of choice independence. In this paper, we
+analyze whether user utilization of novel writing assistants in a charity
+advertisement writing task is affected by the AI's performance in a second
+language. Furthermore, we quantify the extent to which these patterns translate
+into the persuasiveness of generated charity advertisements, as well as the
+role of peoples' beliefs about LLM utilization in their donation choices. Our
+results provide evidence that writers who engage with an LLM-based writing
+assistant violate choice independence, as prior exposure to a Spanish LLM
+reduces subsequent utilization of an English LLM. While these patterns do not
+affect the aggregate persuasiveness of the generated advertisements, people's
+beliefs about the source of an advertisement (human versus AI) do. In
+particular, Spanish-speaking female participants who believed that they read an
+AI-generated advertisement strongly adjusted their donation behavior downwards.
+Furthermore, people are generally not able to adequately differentiate between
+human-generated and LLM-generated ads. Our work has important implications for
+the design, development, integration, and adoption of multilingual LLMs as
+assistive agents -- particularly in writing tasks.
 
-摘要：偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料，對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術，以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示，在 regex 引擎中，Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡，優於 PCRE，同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對，Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示，regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低，達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效，但仍有挑戰存在，例如多語言支援有限，以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍，將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合，以及加強法規遵循以利更廣泛的全球採用。
+摘要：<paragraph>生成式 AI 的最新進展加速了新穎寫作助理的激增。這些系統通常依賴多語言大型語言模型 (LLM)，讓全球化的工作者能夠以不同的語言修改或建立各種形式的內容。然而，有大量證據顯示多語言 LLM 的表現因語言而異。因此，使用多語言寫作協助的使用者容易受到不同的輸出品質影響。重要的是，最近的研究顯示人們傾向於在獨立的任務中概化演算法錯誤，違反了選擇獨立性的行為公理。在本文中，我們分析使用者在慈善廣告寫作任務中使用新穎寫作助理是否會受到 AI 在第二語言中的表現影響。此外，我們量化這些模式轉化為所產生慈善廣告說服力的程度，以及人們對 LLM 使用在捐款選擇中的信念所扮演的角色。我們的結果提供證據，表明與基於 LLM 的寫作助理互動的寫作者會違反選擇獨立性，因為先前接觸過西班牙語 LLM 會減少後續使用英語 LLM 的情況。雖然這些模式不會影響所產生廣告的整體說服力，但人們對廣告來源（人類與 AI）的信念會影響。特別是，相信自己閱讀 AI 生成的廣告的西班牙語系女性參與者大幅調整了他們的捐款行為。此外，人們通常無法充分區分人類產生的廣告和 LLM 產生的廣告。我們的研究對多語言 LLM 作為輔助代理的設計、開發、整合和採用具有重要的意義，特別是在寫作任務中。</paragraph>
 
-##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**
-2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu
+##### **Diffusion Models for Molecules: A Survey of Methods and Tasks**
+2502.09511v1 by Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
 
-While just-in-time interventions (JITIs) have effectively targeted common
-health behaviors, individuals often have unique needs to intervene in personal
-undesirable actions that can negatively affect physical, mental, and social
-well-being. We present WatchGuardian, a smartwatch-based JITI system that
-empowers users to define custom interventions for these personal actions with a
-small number of samples. For the model to detect new actions based on limited
-new data samples, we developed a few-shot learning pipeline that finetuned a
-pre-trained inertial measurement unit (IMU) model on public hand-gesture
-datasets. We then designed a data augmentation and synthesis process to train
-additional classification layers for customization. Our offline evaluation with
-26 participants showed that with three, five, and ten examples, our approach
-achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of
-74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to
-compare WatchGuardian against a rule-based intervention. Our results
-demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in
-undesirable actions, substantially outperforming the baseline by 29.0%. Our
-findings underscore the effectiveness of a customizable, AI-driven JITI system
-for individuals in need of behavioral intervention in personal undesirable
-actions. We envision that our work can inspire broader applications of
-user-defined personalized intervention with advanced AI solutions.
+Generative tasks about molecules, including but not limited to molecule
+generation, are crucial for drug discovery and material design, and have
+consistently attracted significant attention. In recent years, diffusion models
+have emerged as an impressive class of deep generative models, sparking
+extensive research and leading to numerous studies on their application to
+molecular generative tasks. Despite the proliferation of related work, there
+remains a notable lack of up-to-date and systematic surveys in this area.
+Particularly, due to the diversity of diffusion model formulations, molecular
+data modalities, and generative task types, the research landscape is
+challenging to navigate, hindering understanding and limiting the area's
+growth. To address this, this paper conducts a comprehensive survey of
+diffusion model-based molecular generative methods. We systematically review
+the research from the perspectives of methodological formulations, data
+modalities, and task types, offering a novel taxonomy. This survey aims to
+facilitate understanding and further flourishing development in this area. The
+relevant papers are summarized at:
+https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
 
-摘要：<paragraph>雖然即時介入（JITIs）有效地針對常見的健康行為，但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian，這是一個基於智慧手錶的 JITI 系統，它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為，我們開發了一個小樣本學習管道，微調了公共手勢資料集上的預訓練慣性測量單元（IMU）模型。然後，我們設計了一個資料擴充和合成流程，以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示，我們的做法使用三個、五個和十個範例，達到了 76.8%、84.7% 和 87.7% 的平均準確度，以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後，我們進行了一項為時四小時的介入研究，以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明，我們的系統導致不良行為顯著減少了 64.0 +- 22.6%，大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用，並採用先進的 AI 解決方案。</paragraph>
+摘要：<paragraph>包括但不限於分子生成在內的分子生成任務，對於藥物發現和材料設計至關重要，並持續吸引大量關注。近年來，擴散模型已成為深度生成模型中令人印象深刻的一類，激發了廣泛的研究，並導致對其應用於分子生成任務的眾多研究。儘管相關工作不斷增加，但這個領域仍然缺乏最新的系統性綜述。特別是，由於擴散模型公式、分子數據方式和生成任務類型的多樣性，研究領域難以瀏覽，阻礙了理解並限制了該領域的發展。為了解決這個問題，本文對基於擴散模型的分子生成方法進行了全面的調查。我們從方法論公式、數據方式和任務類型的角度系統性地回顧了研究，提供了一種新穎的分類法。本調查旨在促進理解並進一步促進該領域的蓬勃發展。相關論文總結如下：
+https://github.com/AzureLeon1/awesome-molecular-diffusion-models。</paragraph>
 
-##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**
-2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
+##### **AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**
+2502.09503v1 by Caleb Cranney, Jesse G. Meyer
 
-Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group
-of cancers that account for more than 35% of cancer-related deaths worldwide,
-but postoperative complications are unpredictable and can be life-threatening.
-In this paper, we investigate how recent advancements in large language models
-(LLMs) can benefit remote patient monitoring (RPM) systems through clinical
-integration by designing RECOVER, an LLM-powered RPM system for postoperative
-GI cancer care. To closely engage stakeholders in the design process, we first
-conducted seven participatory design sessions with five clinical staff and
-interviewed five cancer patients to derive six major design strategies for
-integrating clinical guidelines and information needs into LLM-based RPM
-systems. We then designed and implemented RECOVER, which features an
-LLM-powered conversational agent for cancer patients and an interactive
-dashboard for clinical staff to enable efficient postoperative RPM. Finally, we
-used RECOVER as a pilot system to assess the implementation of our design
-strategies with four clinical staff and five patients, providing design
-implications by identifying crucial design elements, offering insights on
-responsible AI, and outlining opportunities for future LLM-powered RPM systems.
+Transformer architectures have transformed AI applications but remain complex
+to customize for domain experts lacking low-level implementation expertise. We
+introduce AttentionSmithy, a modular software package that simplifies
+transformer innovation by breaking down key components into reusable building
+blocks: attention modules, feed-forward networks, normalization layers, and
+positional encodings. Users can rapidly prototype and evaluate transformer
+variants without extensive coding. Our framework supports four positional
+encoding strategies and integrates with neural architecture search for
+automated design. We validate AttentionSmithy by replicating the original
+transformer under resource constraints and optimizing translation performance
+by combining positional encodings. Additionally, we demonstrate its
+adaptability in gene-specific modeling, achieving over 95% accuracy in cell
+type classification. These case studies highlight AttentionSmithy's potential
+to accelerate research across diverse fields by removing framework
+implementation barriers.
 
-摘要：癌症手術是胃腸道 (GI) 癌症的主要治療方式，這類癌症佔全球癌症相關死亡人數的 35% 以上，但術後併發症無法預測，且可能危及生命。在本文中，我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統，方法是設計 RECOVER，一個由 LLM 驅動的 RPM 系統，用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程，我們首先與五位臨床人員進行七場參與式設計會議，並訪談五位癌症患者，以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著，我們設計並實作 RECOVER，其特色在於一個由 LLM 驅動的對話式代理人，供癌症患者使用，以及一個互動式儀表板，供臨床人員使用，以進行有效的術後 RPM。最後，我們使用 RECOVER 作為試點系統，與四位臨床人員和五位患者評估我們設計策略的實作，並透過找出重要的設計元素、提供對負責任 AI 的見解，以及概述未來由 LLM 驅動的 RPM 系統的機會，提出設計意涵。
+摘要：Transformer 架構已轉變 AI 應用，但對於缺乏低階實作專業知識的領域專家而言，自訂仍很複雜。我們推出 AttentionSmithy，這是一個模組化軟體套件，透過將關鍵元件分解成可重複使用的建構區塊（注意力模組、前饋網路、正規化層和位置編碼）來簡化 Transformer 創新。使用者可以快速建置原型和評估 Transformer 變體，而無需大量編碼。我們的架構支援四種位置編碼策略，並整合神經架構搜尋以進行自動化設計。我們透過在資源限制下複製原始 Transformer 和結合位置編碼來最佳化翻譯效能，驗證 AttentionSmithy。此外，我們展示其在基因特定建模中的適應性，在細胞類型分類中達到超過 95% 的準確度。這些案例研究突顯 AttentionSmithy 在移除架構實作障礙後，加速各個領域研究的潛力。
 
-##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**
-2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander
+##### **Improve LLM-based Automatic Essay Scoring with Linguistic Features**
+2502.09497v1 by Zhaoyi Joey Hou, Alejandro Ciuba, Xiang Lorraine Li
 
-Understanding the progression trajectories of diseases is crucial for early
-diagnosis and effective treatment planning. This is especially vital for
-life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a
-chronic, progressive lung disease with a prognosis comparable to many cancers.
-Computed tomography (CT) imaging has been established as a reliable diagnostic
-tool for IPF. Accurately predicting future CT scans of early-stage IPF patients
-can aid in developing better treatment strategies, thereby improving survival
-outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial
-Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF
-patients at any time point. The model is trained using a two-stage approach. In
-the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the
-second stage, a Neural Ordinary Differential Equation (ODE) based temporal
-model is trained to capture the temporal dynamics of the quantised embeddings
-generated by the encoder in the first stage. We evaluate different
-configurations of our model for generating longitudinal CT scans and compare
-the results against ground truth data, both quantitatively and qualitatively.
-For validation, we conduct survival analysis using imaging biomarkers derived
-from generated CT scans and achieve a C-index comparable to that of biomarkers
-derived from the real CT scans. The survival analysis results demonstrate the
-potential clinical utility inherent to generated longitudinal CT scans, showing
-that they can reliably predict survival outcomes.
+Automatic Essay Scoring (AES) assigns scores to student essays, reducing the
+grading workload for instructors. Developing a scoring system capable of
+handling essays across diverse prompts is challenging due to the flexibility
+and diverse nature of the writing task. Existing methods typically fall into
+two categories: supervised feature-based approaches and large language model
+(LLM)-based methods. Supervised feature-based approaches often achieve higher
+performance but require resource-intensive training. In contrast, LLM-based
+methods are computationally efficient during inference but tend to suffer from
+lower performance. This paper combines these approaches by incorporating
+linguistic features into LLM-based scoring. Experimental results show that this
+hybrid method outperforms baseline models for both in-domain and out-of-domain
+writing prompts.
 
-摘要：了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要，IPF 是一種慢性、進行性肺部疾病，其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略，從而改善存活結果。在本文中，我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN)，這是一個模型，能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段，訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段，訓練基於神經常微分方程 (ODE) 的時間模型，以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置，以生成縱向 CT 掃描，並在定量和定性方面將結果與真實數據進行比較。為了驗證，我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析，並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用，表明它們可以可靠地預測存活結果。
+摘要：自動化論文評分 (AES) 會為學生的論文評分，以減輕教師的評分工作負擔。由於寫作任務的靈活性與多樣性，開發一種評分系統來處理各種提示的論文是一項挑戰。現有方法通常分為兩類：監督式特徵方法和大型語言模型 (LLM) 方法。監督式特徵方法通常能達到較高的效能，但需要大量資源進行訓練。相比之下，LLM 方法在推論期間的計算效率很高，但效能往往較低。本文結合了這些方法，將語言特徵納入 LLM 評分中。實驗結果顯示，這種混合方法在領域內和領域外寫作提示方面都優於基準模型。
 
-##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**
-2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho
+##### **Cracking the Code: Enhancing Development finance understanding with artificial intelligence**
+2502.09495v1 by Pierre Beaucoral
 
-The increasing demand for mental health services has led to the rise of
-AI-driven mental health chatbots, though challenges related to privacy, data
-collection, and expertise persist. Motivational Interviewing (MI) is gaining
-attention as a theoretical basis for boosting expertise in the development of
-these chatbots. However, existing datasets are showing limitations for training
-chatbots, leading to a substantial demand for publicly available resources in
-the field of MI and psychotherapy. These challenges are even more pronounced in
-non-English languages, where they receive less attention. In this paper, we
-propose a novel framework that simulates MI sessions enriched with the
-expertise of professional therapists. We train an MI forecaster model that
-mimics the behavioral choices of professional therapists and employ Large
-Language Models (LLMs) to generate utterances through prompt engineering. Then,
-we present KMI, the first synthetic dataset theoretically grounded in MI,
-containing 1,000 high-quality Korean Motivational Interviewing dialogues.
-Through an extensive expert evaluation of the generated dataset and the
-dialogue model trained on it, we demonstrate the quality, expertise, and
-practicality of KMI. We also introduce novel metrics derived from MI theory in
-order to evaluate dialogues from the perspective of MI.
+Analyzing development projects is crucial for understanding donors aid
+strategies, recipients priorities, and to assess development finance capacity
+to adress development issues by on-the-ground actions. In this area, the
+Organisation for Economic Co-operation and Developments (OECD) Creditor
+Reporting System (CRS) dataset is a reference data source. This dataset
+provides a vast collection of project narratives from various sectors
+(approximately 5 million projects). While the OECD CRS provides a rich source
+of information on development strategies, it falls short in informing project
+purposes due to its reporting process based on donors self-declared main
+objectives and pre-defined industrial sectors. This research employs a novel
+approach that combines Machine Learning (ML) techniques, specifically Natural
+Language Processing (NLP), an innovative Python topic modeling technique called
+BERTopic, to categorise (cluster) and label development projects based on their
+narrative descriptions. By revealing existing yet hidden topics of development
+finance, this application of artificial intelligence enables a better
+understanding of donor priorities and overall development funding and provides
+methods to analyse public and private projects narratives.
 
-摘要：由於對心理健康服務的需求日益增加，導致以人工智慧為基礎的心理健康聊天機器人興起，儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而，現有的資料集顯示出訓練聊天機器人的限制，導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯，因為它們受到的關注較少。在本文中，我們提出了一個新穎的架構，它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型，它模擬了專業治療師的行為選擇，並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後，我們展示了 KMI，這是第一個理論上以 MI 為基礎的合成資料集，其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估，我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標，以便從 MI 的角度評估對話。
+摘要：分析發展專案對於了解捐助者援助策略、受贈者優先事項，以及評估發展資金能力以透過實際行動解決發展問題至關重要。在這個領域中，經濟合作暨發展組織 (OECD) 債權人報告系統 (CRS) 資料集是一個參考資料來源。此資料集提供來自各個部門的大量專案敘述（約 500 萬個專案）。雖然 OECD CRS 提供了豐富的發展策略資訊來源，但由於其報告程序基於捐助者自行申報的主要目標和預先定義的產業部門，因此在告知專案目的方面有所不足。本研究採用一種新穎的方法，結合機器學習 (ML) 技術，特別是自然語言處理 (NLP)，一種稱為 BERTopic 的創新 Python 主題建模技術，根據其敘述描述對發展專案進行分類（叢集）和標籤。透過揭露發展資金現有但隱藏的主題，這種人工智慧應用程式可以更好地了解捐助者的優先事項和整體發展資金，並提供分析公共和私人專案敘述的方法。
 
-##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**
-2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco
+##### **Objective quantification of mood states using large language models**
+2502.09487v1 by Jakub Onysk, Quentin Huys
 
-Europe's healthcare systems require enhanced interoperability and
-digitalization, driving a demand for innovative solutions to process legacy
-clinical data. This paper presents the results of our project, which aims to
-leverage Large Language Models (LLMs) to extract structured information from
-unstructured clinical reports, focusing on patient history, diagnoses,
-treatments, and other predefined categories. We developed a workflow with a
-user interface and evaluated LLMs of varying sizes through prompting strategies
-and fine-tuning. Our results show that fine-tuned smaller models match or
-surpass larger counterparts in performance, offering efficiency for
-resource-limited settings. A new dataset of 60,000 annotated English clinical
-summaries and 24,000 German translations was validated with automated and
-manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics.
-The work highlights the approach's viability and outlines future improvements.
+Emotional states influence human behaviour and cognition, leading to diverse
+thought trajectories. Similarly, Large Language Models (LLMs) showcase an
+excellent level of response consistency across wide-ranging contexts (prompts).
+We leverage these parallels to establish a framework for quantifying mental
+states. Our approach utilises self-report questionnaires that reliably assess
+these states due to their inherent sensitivity to patterns of co-occurring
+responses. Specifically, we recruited a large sample of participants (N=422) to
+investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set
+of depressive mood states measured with participants' open-ended responses to a
+depression questionnaire. We show LLM responses to held-out multiple-choice
+questions, given participants' open-ended answers, correlate strongly (r:
+0.52-0.84) with true questionnaire scores, demonstrating LLM's generalisation
+from mood representations. We explore a link between these representations and
+factor analysis. Using ridge regression, we find depression-related subspaces
+within LLM hidden states. We show these subspaces to be predictive of
+participants' "Depression" and "Somatic & Emotional Distress" factor scores, as
+well as suicidality severity. Overall, LLMs can provide quantitative measures
+of mental states. The reliability of these hinges upon how informative the
+questions we ask participants are. Used correctly, this approach could
+supplement mental state assessment in a variety of settings.
 
-摘要：歐洲的醫療保健系統需要增強互通性和數位化，這驅動了對創新解決方案的需求，以處理傳統的臨床數據。本文介紹了我們專案的成果，該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊，重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程，並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示，微調後的較小模型在效能上與較大的模型相匹配或超越它們，為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性，並概述了未來的改進。
+摘要：情緒狀態會影響人類行為和認知，導致不同的思維軌跡。同樣地，大型語言模型 (LLM) 在廣泛的脈絡（提示）中展示出極佳的反應一致性。我們利用這些相似之處來建立一個量化心理狀態的框架。我們的做法利用自我報告問卷，由於這些問卷對共生反應模式具有內在敏感性，因此可以可靠地評估這些狀態。具體來說，我們招募了大量的參與者樣本 (N=422) 來調查 LLM (Mistral-7B-OpenOrca) 如何量化一組異質的抑鬱情緒狀態，這些狀態是根據參與者對抑鬱症問卷的開放式回答來衡量的。我們展示了 LLM 對保留的多選題的回答，給定參與者的開放式回答，與真正的問卷分數密切相關 (r：0.52-0.84)，這證明了 LLM 從情緒表徵中進行概括。我們探索這些表徵與因子分析之間的聯繫。使用嶺回歸，我們在 LLM 隱藏狀態內發現了與抑鬱相關的子空間。我們展示這些子空間可以預測參與者的「抑鬱」和「軀體和情緒困擾」因子分數，以及自殺嚴重性。總體而言，LLM 可以提供心理狀態的量化測量。這些測量的可靠性取決於我們詢問參與者的問題的資訊性。如果使用得當，這種方法可以補充各種環境中的心理狀態評估。
 
-##### **Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**
-2502.05494v1 by Ya Zhou, Yujie Yang, Jianhuang Gan, Xiangjie Li, Jing Yuan, Wei Zhao
+##### **The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**
+2502.09457v1 by Akash Ghosh, Debayan Datta, Sriparna Saha, Chirag Agarwal
 
-Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing
-cardiovascular conditions, yet anomaly detection in ECG signals remains
-challenging due to their inherent complexity and variability. We propose
-Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel
-end-to-end framework that effectively captures both global and local
-dependencies in ECG data. Unlike state-of-the-art methods that rely on
-heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for
-such pre-processing steps, enhancing its suitability for clinical deployment.
-MMAE-ECG partitions ECG signals into non-overlapping segments, with each
-segment assigned learnable positional embeddings. A novel multi-scale masking
-strategy and multi-scale attention mechanism, along with distinct positional
-embeddings, enable a lightweight Transformer encoder to effectively capture
-both local and global dependencies. The masked segments are then reconstructed
-using a single-layer Transformer block, with an aggregation strategy employed
-during inference to refine the outputs. Experimental results demonstrate that
-our method achieves performance comparable to state-of-the-art approaches while
-significantly reducing computational complexity-approximately 1/78 of the
-floating-point operations (FLOPs) required for inference. Ablation studies
-further validate the effectiveness of each component, highlighting the
-potential of multi-scale masked autoencoders for anomaly detection.
+While reasoning and multilingual capabilities in Language Models (LMs) have
+achieved remarkable progress in recent years, their integration into a unified
+paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning
+requires language models to handle logical reasoning across languages while
+addressing misalignment, biases, and challenges in low-resource settings. This
+survey provides the first in-depth review of multilingual reasoning in LMs. In
+this survey, we provide a systematic overview of existing methods that leverage
+LMs for multilingual reasoning, specifically outlining the challenges,
+motivations, and foundational aspects of applying language models to reason
+across diverse languages. We provide an overview of the standard data resources
+used for training multilingual reasoning in LMs and the evaluation benchmarks
+employed to assess their multilingual capabilities. Next, we analyze various
+state-of-the-art methods and their performance on these benchmarks. Finally, we
+explore future research opportunities to improve multilingual reasoning in LMs,
+focusing on enhancing their ability to handle diverse languages and complex
+reasoning tasks.
 
-摘要：心電圖 (ECG) 分析是診斷心血管疾病的基本工具，但由於 ECG 訊號本身的複雜性和變異性，異常偵測仍然是一項挑戰。我們提出用於 ECG 異常偵測的多尺度遮罩自編碼器 (MMAE-ECG)，這是一個新穎的端對端架構，可有效擷取 ECG 資料中的全局和局部依賴關係。與依賴於心跳區段或 R 波峰偵測的最新方法不同，MMAE-ECG 消除了對此類前處理步驟的需求，增強其適用於臨床部署。MMAE-ECG 將 ECG 訊號分割成不相疊的區段，每個區段都指派可學習的位置嵌入。新穎的多尺度遮罩策略和多尺度注意力機制，以及不同的位置嵌入，使輕量級 Transformer 編碼器能夠有效擷取局部和全局依賴關係。然後使用單層 Transformer 區塊重建遮罩區段，並在推理期間採用聚合策略來優化輸出。實驗結果表明，我們的模型達到了與最新方法相當的效能，同時大幅降低運算複雜度，約為推理所需的浮點運算 (FLOP) 的 1/78。消融研究進一步驗證了每個組件的有效性，突顯了多尺度遮罩自編碼器在異常偵測方面的潛力。
+摘要：儘管語言模型 (LM) 的推理和多語言能力在近年來取得顯著進展，但它們整合至統一典範（多語言推理）仍處於萌芽階段。多語言推理要求語言模型跨語言處理邏輯推理，同時解決低資源環境中的錯位、偏見和挑戰。本調查提供了 LM 中多語言推理的首次深入探討。在本調查中，我們系統性地概述了現有利用 LM 進行多語言推理的方法，特別概述了將語言模型應用於跨不同語言推理的挑戰、動機和基礎方面。我們概述了用於訓練 LM 中多語言推理的標準數據資源，以及用於評估其多語言能力的評估基準。接下來，我們分析了各種最先進的方法及其在這些基準上的表現。最後，我們探討了改進 LM 中多語言推理的未來研究機會，重點關注增強其處理不同語言和複雜推理任務的能力。
 
-##### **DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**
-2502.05459v1 by Sibasish Dhibar
+##### **Pixel-Level Reasoning Segmentation via Multi-turn Conversations**
+2502.09447v1 by Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
 
-White blood cells (WBC) are important parts of our immune system, and they
-protect our body against infections by eliminating viruses, bacteria, parasites
-and fungi. The number of WBC types and the total number of WBCs provide
-important information about our health status. A traditional method,
-convolutional neural networks (CNN), a deep learning architecture, can classify
-the blood cell from a part of an object and perform object recognition. Various
-CNN models exhibit potential; however, their development often involves ad-hoc
-processes that neglect unnecessary layers, leading to issues with unbalanced
-datasets and insufficient data augmentation. To address these challenges, we
-propose a novel ensemble approach that integrates three CNN architectures, each
-uniquely configured with different dropout and max-pooling layer settings to
-enhance feature learning. This ensemble model, named DCENWCNet, effectively
-balances the bias-variance trade-off. When evaluated on the widely recognized
-Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks,
-achieving highest mean accuracy. Additionally, it demonstrates superior
-performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC)
-across all categories. To delve deeper into the interpretability of
-classifiers, we employ reliable post-hoc explanation techniques, including
-Local Interpretable Model-Agnostic Explanations (LIME). These methods
-approximate the behavior of a black-box model by elucidating the relationships
-between feature values and predictions. Interpretable results enable users to
-comprehend and validate the model's predictions, thereby increasing their
-confidence in the automated diagnosis.
+Existing visual perception systems focus on region-level segmentation in
+single-turn dialogues, relying on complex and explicit query instructions. Such
+systems cannot reason at the pixel level and comprehend dynamic user intent
+that changes over interaction. Our work tackles this issue by introducing a
+novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on
+multi-turn conversations, tracking evolving user intent via multi-turn
+interactions for fine-grained segmentation. To establish a benchmark for this
+novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on
+Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
+multi-turn conversational scenarios with segmentation targets. Building on
+PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning
+Segmentation framework, integrates pixel-level segmentation with robust
+multi-turn conversation understanding, generating pixel-grounded explanations
+aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
+pixel-level reasoning segmentation. Experimental results on the PRIST dataset
+demonstrate that our method outperforms current segmentation-specific baselines
+in terms of segmentation and LLM-based reasoning metrics. The code and data are
+available at: https://github.com/ccccai239/PixelRIST.
 
-摘要：白血球 (WBC) 是我們免疫系統的重要組成部分，它們通過清除病毒、細菌、寄生蟲和真菌來保護我們的機體免受感染。WBC 類型數量和 WBC 總數提供了有關我們健康狀況的重要資訊。傳統方法卷積神經網路 (CNN) 是一種深度學習架構，可以對物體的一部分進行血細胞分類並執行物體識別。各種 CNN 模型展現出潛力；然而，它們的開發通常涉及忽略不必要層的臨時過程，導致不平衡的資料集和資料擴充不足的問題。為了應對這些挑戰，我們提出了一種新穎的整體方法，它整合了三種 CNN 架構，每種架構都採用不同的中斷和最大池化層設定進行獨特配置，以增強特徵學習。這種名為 DCENWCNet 的整體模型有效地平衡了偏差變異取捨。在廣泛認可的 Rabbin-WBC 資料集上進行評估時，我們的模型優於現有的最先進網路，達到了最高的平均準確度。此外，它在所有類別中都展示了在精確度、召回率、F1 分數和 ROC 曲線下面積 (AUC) 方面的卓越效能。為了更深入地研究分類器的可解釋性，我們採用了可靠的事後解釋技術，包括局部可解釋模型不可知解釋 (LIME)。這些方法通過闡明特徵值和預測之間的關係來近似黑盒模型的行為。可解釋的結果使用戶能夠理解和驗證模型的預測，從而增加他們對自動化診斷的信心。
+摘要：現有的視覺感知系統專注於單輪對話中的區域級分割，依賴於複雜且明確的查詢指令。此類系統無法在像素級別推理和理解在互動中不斷變化的動態使用者意圖。我們的研究通過引入一項基於多輪對話的像素級推理分割（像素級 RS）新任務來解決此問題，通過多輪互動追蹤不斷演變的使用者意圖，以進行精細分割。為了建立此新任務的基準，我們建立了一個基於多輪對話的像素級推理分割資料集（PRIST），其中包含來自 8.3k 多輪對話場景的 24k 個語句，以及分割目標。在 PRIST 的基礎上，我們進一步提出了 MIRAS，這是一個多輪互動推理分割框架，它將像素級分割與強大的多輪對話理解整合在一起，生成符合使用者意圖的像素級解釋。PRIST 資料集和 MIRSA 框架填補了像素級推理分割的空白。在 PRIST 資料集上的實驗結果表明，我們的模型在分割和基於 LLM 的推理指標方面優於目前的特定於分割的基準。程式碼和資料可在 https://github.com/ccccai239/PixelRIST 獲得。
 
-##### **Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**
-2502.05330v1 by Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee, Jaehee Chun, Jin Sung Kim, Minghui Zhang, Hanxiao Zhang, Xin You, Yun Gu, Zhaohong Pan, Xuan Liu, Xiaokun Liang, Markus Tiefenthaler, Enrique Almar-Munoz, Matthias Schwab, Mikhail Kotyushev, Rostislav Epifanov, Marek Wodzinski, Henning Muller, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Zhiwei Wang, Kaixiang Yang, Jintao Ren, Stine Sofia Korreman, Yuchong Gao, Hongye Zeng, Haoyu Zheng, Rui Zheng, Jinghua Yue, Fugen Zhou, Bo Liu, Alexander Cosman, Muxuan Liang, Chang Zhao, Gilbert R. Upchurch Jr., Jun Ma, Yuyin Zhou, Michol A. Cooper, Wei Shao
+##### **Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**
+2502.09432v1 by Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
 
-Multi-class segmentation of the aorta in computed tomography angiography
-(CTA) scans is essential for diagnosing and planning complex endovascular
-treatments for patients with aortic dissections. However, existing methods
-reduce aortic segmentation to a binary problem, limiting their ability to
-measure diameters across different branches and zones. Furthermore, no
-open-source dataset is currently available to support the development of
-multi-class aortic segmentation methods. To address this gap, we organized the
-AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes
-annotated for 23 clinically relevant aortic branches and zones. This dataset
-was designed to facilitate both model development and validation. The challenge
-attracted 121 teams worldwide, with participants leveraging state-of-the-art
-frameworks such as nnU-Net and exploring novel techniques, including cascaded
-models, data augmentation strategies, and custom loss functions. We evaluated
-the submitted algorithms using the Dice Similarity Coefficient (DSC) and
-Normalized Surface Distance (NSD), highlighting the approaches adopted by the
-top five performing teams. This paper presents the challenge design, dataset
-details, evaluation metrics, and an in-depth analysis of the top-performing
-algorithms. The annotated dataset, evaluation code, and implementations of the
-leading methods are publicly available to support further research. All
-resources can be accessed at https://aortaseg24.grand-challenge.org.
+We study robust Markov decision processes (RMDPs) with non-rectangular
+uncertainty sets, which capture interdependencies across states unlike
+traditional rectangular models. While non-rectangular robust policy evaluation
+is generally NP-hard, even in approximation, we identify a powerful class of
+$L_p$-bounded uncertainty sets that avoid these complexity barriers due to
+their structural simplicity. We further show that this class can be decomposed
+into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage
+its structural properties to derive a novel dual formulation for $L_p$ RMDPs.
+This formulation provides key insights into the adversary's strategy and
+enables the development of the first robust policy evaluation algorithms for
+non-rectangular RMDPs. Empirical results demonstrate that our approach
+significantly outperforms brute-force methods, establishing a promising
+foundation for future investigation into non-rectangular robust MDPs.
 
-摘要：多類別主動脈電腦斷層血管攝影 (CTA) 掃描分割對於診斷和規劃主動脈剝離患者的複雜血管內治療至關重要。然而，現有方法將主動脈分割簡化為二元問題，限制了其測量不同分支和區域直徑的能力。此外，目前沒有開放原始碼數據集可用於支援多類別主動脈分割方法的開發。為了解決此問題，我們組織了 AortaSeg24 MICCAI 挑戰，引入了第一個包含 100 個 CTA 體積的數據集，這些體積針對 23 個臨床上相關的主動脈分支和區域進行了註釋。此數據集旨在促進模型開發和驗證。該挑戰吸引了來自世界各地的 121 個團隊，參與者利用了 nnU-Net 等最先進的框架，並探索了創新技術，包括串聯模型、數據擴充策略和自訂損失函數。我們使用 Dice 相似性係數 (DSC) 和標準化表面距離 (NSD) 評估了提交的演算法，重點介紹了前五名表現最佳團隊採用的方法。本文介紹了挑戰設計、數據集詳細資訊、評估指標以及對表現最佳演算法的深入分析。已公開註釋的數據集、評估程式碼和領先方法的實作，以支援進一步的研究。所有資源都可以在 https://aortaseg24.grand-challenge.org/ 獲得。
+摘要：我們研究具有非矩形不確定性集合的強健馬可夫決策過程 (RMDP)，它能捕捉到不同於傳統矩形模型的跨狀態相互依賴性。雖然非矩形強健策略評估通常是 NP-hard，即使在近似中也是如此，我們識別了一類強大的 $L_p$ 有界不確定性集合，由於其結構的簡潔性，可以避免這些複雜性障礙。我們進一步表明，此類可以分解為無限多的 \texttt{sa} 矩形 $L_p$ 有界集合，並利用其結構屬性為 $L_p$ RMDP 導出一個新的對偶公式。此公式提供了對抗者策略的重要見解，並能夠開發出第一個非矩形 RMDP 的強健策略評估演算法。實證結果表明，我們的做法顯著優於蠻力方法，為未來對非矩形強健 MDP 的研究奠定了有希望的基礎。
 
-##### **Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**
-2502.05282v1 by Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
+##### **Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**
+2502.09423v1 by Ziyi Chen, Yang Yuan, Siming Zheng, Jialong Guo, Sihan Liang, Yangang Wang, Zongguo Wang
 
-Dense contrastive representation learning (DCRL) has greatly improved the
-learning efficiency for image-dense prediction tasks, showing its great
-potential to reduce the large costs of medical image collection and dense
-annotation. However, the properties of medical images make unreliable
-correspondence discovery, bringing an open problem of large-scale false
-positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric
-vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior
-to DCRL and enables a reliable correspondence discovery for effective dense
-contrast. We propose a deformable homeomorphism learning (DHL) which models the
-homeomorphism of medical images and learns to estimate a deformable mapping to
-predict the pixels' correspondence under topological preservation. It
-effectively reduces the searching space of pairing and drives an implicit and
-soft learning of negative pairs via a gradient. We also propose a geometric
-semantic similarity (GSS) which extracts semantic information in features to
-measure the alignment degree for the correspondence learning. It will promote
-the learning efficiency and performance of deformation, constructing positive
-pairs reliably. We implement two practical variants on two typical
-representation learning tasks in our experiments. Our promising results on
-seven datasets which outperform the existing methods show our great
-superiority. We will release our code on a companion link:
-https://github.com/YutingHe-list/GEMINI.
+Crystal structure forms the foundation for understanding the physical and
+chemical properties of materials. Generative models have emerged as a new
+paradigm in crystal structure prediction(CSP), however, accurately capturing
+key characteristics of crystal structures, such as periodicity and symmetry,
+remains a significant challenge. In this paper, we propose a
+Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction
+(TransVAE-CSP), who learns the characteristic distribution space of stable
+materials, enabling both the reconstruction and generation of crystal
+structures. TransVAE-CSP integrates adaptive distance expansion with
+irreducible representation to effectively capture the periodicity and symmetry
+of crystal structures, and the encoder is a transformer network based on an
+equivariant dot product attention mechanism. Experimental results on the
+carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP
+outperforms existing methods in structure reconstruction and generation tasks
+under various modeling metrics, offering a powerful tool for crystal structure
+design and optimization.
 
-摘要：密集对比表征学习（DCRL）极大地提高了影像密集预测任务的学习效率，显示出其在降低医学影像收集和密集标注的大量成本方面的巨大潜力。然而，医学影像的特性使得对应关系发现不可靠，给 DCRL 带来大规模假阳性和假阴性（FP&N）对的开放性问题。在本文中，我们提出了 GEoMetric vIsual deNse sImilarity（GEMINI）学习，它将同胚先验嵌入 DCRL 中，并针对有效密集对比提供了可靠的对应关系发现。我们提出了一种可变形同胚学习（DHL），它对医学影像的同胚进行建模，并学习估计可变形映射，以预测在拓扑保持下的像素对应关系。它有效地减少了配对的搜索空间，并通过梯度驱动了负对的隐式和软学习。我们还提出了几何语义相似性（GSS），它提取特征中的语义信息，以测量对应关系学习的对齐度。它将促进变形学习的效率和性能，可靠地构建正对。我们在实验中针对两个典型的表征学习任务实现了两个实际变体。我们在七个数据集上的有希望的结果优于现有方法，显示出我们的巨大优势。我们将在配套链接中发布我们的代码：https://github.com/YutingHe-list/GEMINI。
+摘要：晶體結構形成了解材料物理和化學性質的基礎。生成模型已成為晶體結構預測 (CSP) 的新典範，然而，準確捕捉晶體結構的關鍵特徵（例如週期性和對稱性）仍然是一項重大挑戰。在本文中，我們提出了一種用於晶體結構預測的 Transformer 增強變異自動編碼器 (TransVAE-CSP)，它學習穩定材料的特徵分佈空間，使晶體結構的重建和生成成為可能。TransVAE-CSP 將自適應距離擴展與不可約表示相結合，以有效地捕捉晶體結構的週期性和對稱性，並且編碼器是一個基於等變點積注意力機制的 Transformer 網路。在 carbon_24、perov_5 和 mp_20 資料集上的實驗結果表明，TransVAE-CSP 在各種建模指標下，在結構重建和生成任務中優於現有方法，為晶體結構設計和最佳化提供了一個強大的工具。
 
-##### **"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**
-2502.05115v1 by Shihan Fu, Bingsheng Yao, Smit Desai, Yuqi Hu, Yuling Sun, Samantha Stonbraker, Yanjun Gao, Elizabeth M. Goldberg, Dakuo Wang
+##### **On multi-token prediction for efficient LLM inference**
+2502.09419v1 by Somesh Mehra, Javier Alonso Garcia, Lukas Mauch
 
-Older adult patients constitute a rapidly growing subgroup of Intensive Care
-Unit (ICU) patients. In these situations, their family caregivers are expected
-to represent the unconscious patients to access and interpret patients' medical
-information. However, caregivers currently have to rely on overloaded
-clinicians for information updates and typically lack the health literacy to
-understand complex medical information. Our project aims to explore the
-information needs of caregivers of ICU older adult patients, from which we can
-propose design opportunities to guide future AI systems. The project begins
-with formative interviews with 11 caregivers to identify their challenges in
-accessing and interpreting medical information; From these findings, we then
-synthesize design requirements and propose an AI system prototype to cope with
-caregivers' challenges. The system prototype has two key features: a timeline
-visualization to show the AI extracted and summarized older adult patients' key
-medical events; and an LLM-based chatbot to provide context-aware informational
-support. We conclude our paper by reporting on the follow-up user evaluation of
-the system and discussing future AI-based systems for ICU caregivers of older
-adults.
+We systematically investigate multi-token prediction (MTP) capabilities
+within LLMs pre-trained for next-token prediction (NTP). We first show that
+such models inherently possess MTP capabilities via numerical marginalization
+over intermediate token probabilities, though performance is data-dependent and
+improves with model scale. Furthermore, we explore the challenges of
+integrating MTP heads into frozen LLMs and find that their hidden layers are
+strongly specialized for NTP, making adaptation non-trivial. Finally, we show
+that while joint training of MTP heads with the backbone improves performance,
+it cannot fully overcome this barrier, prompting further research in this
+direction. Our findings provide a deeper understanding of MTP applied to
+pretrained LLMs, informing strategies for accelerating inference through
+parallel token prediction.
 
-摘要：老年患者構成加護病房 (ICU) 患者中快速成長的子群。在這些情況下，預期他們的家庭照護者能代表無意識的患者取得並解讀患者的醫療資訊。然而，照護者目前必須依賴工作繁重的臨床醫師提供資訊更新，而且通常缺乏了解複雜醫療資訊的健康素養。我們的專案旨在探索 ICU 老年患者照護者的資訊需求，我們可以根據這些需求提出設計機會，以引導未來的 AI 系統。這個專案從對 11 位照護者的形成性訪談開始，以找出他們在取得和解讀醫療資訊方面的挑戰；根據這些發現，我們接著綜合設計需求，並提出一個 AI 系統原型，以應對照護者的挑戰。這個系統原型具有兩個關鍵特點：一個時間軸視覺化，以顯示 AI 萃取並摘要出的老年患者關鍵醫療事件；以及一個基於 LLM 的聊天機器人，以提供情境感知的資訊支援。我們透過報告系統的後續使用者評估，以及討論未來針對老年人 ICU 照護者的 AI 系統，來總結我們的論文。
+摘要：我們系統性地研究了在預先訓練下用於下一個代幣預測 (NTP) 的 LLM 中的多代幣預測 (MTP) 功能。我們首先表明，此類模型透過中間代幣機率的數值邊際化本質上具備 MTP 功能，儘管效能依賴於資料，且會隨著模型規模而提升。此外，我們探討了將 MTP 頭整合到凍結 LLM 中的挑戰，發現其隱藏層高度專門用於 NTP，使得適應變得不簡單。最後，我們顯示，儘管 MTP 頭與主幹的聯合訓練會提升效能，但無法完全克服此障礙，促使我們進一步研究這個方向。我們的發現提供了對應用於預先訓練 LLM 的 MTP 更深入的理解，並為透過平行代幣預測加速推論提供策略。
 
-##### **Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**
-2502.05087v1 by Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi
+##### **SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**
+2502.09390v1 by Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
 
-Federated learning (FL) is a popular paradigm for collaborative training
-which avoids direct data exposure between clients. However, data privacy issues
-still remain: FL-trained large language models are capable of memorizing and
-completing phrases and sentences contained in training data when given with
-their prefixes. Thus, it is possible for adversarial and honest-but-curious
-clients to recover training data of other participants simply through targeted
-prompting. In this work, we demonstrate that a popular and simple fine-tuning
-strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a
-factor of 10. We study this effect by performing a medical question-answering
-fine-tuning task and injecting multiple replicas of out-of-distribution
-sensitive sequences drawn from an external clinical dataset. We observe a
-reduction in memorization for a wide variety of Llama 2 and 3 models, and find
-that LoRA can reduce memorization in centralized learning as well. Furthermore,
-we show that LoRA can be combined with other privacy-preserving techniques such
-as gradient clipping and Gaussian noising, secure aggregation, and Goldfish
-loss to further improve record-level privacy while maintaining performance.
+In the rapidly evolving field of Natural Language Processing, Large Language
+Models (LLMs) are tasked with increasingly complex reasoning challenges.
+Traditional methods like chain-of-thought prompting have shown promise but
+often fall short in fully leveraging a model's reasoning capabilities. This
+paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a
+novel prompting technique designed to improve reasoning through a
+self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts
+models to generate and resolve multiple auxiliary questions before tackling the
+main query, promoting a more thorough exploration of various aspects of a
+topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models
+across multiple question-answering datasets, demonstrate that SQuARE
+significantly surpasses traditional CoT prompts and existing
+rephrase-and-respond methods. By systematically decomposing queries, SQuARE
+advances LLM capabilities in reasoning tasks. The code is publicly available at
+https://github.com/IntelLabs/RAG-FiT/tree/square.
 
-摘要：聯邦學習 (FL) 是一種流行的協作訓練範例，可避免客戶端之間直接公開資料。然而，資料隱私問題仍然存在：經過 FL 訓練的大型語言模型能夠記憶並完成訓練資料中包含的片語和句子，只要給予其前綴即可。因此，對抗和誠實但好奇的客戶端有可能僅透過目標提示來恢復其他參與者的訓練資料。在這項工作中，我們證明了一種流行且簡單的微調策略，低秩適應 (LoRA)，可將 FL 期間的記憶減少多達 10 倍。我們透過執行醫學問答微調任務並注入從外部臨床資料集抽取的非分佈敏感序列的多次複製品來研究此效應。我們觀察到各種 Llama 2 和 3 模型的記憶力降低，並發現 LoRA 也能減少集中式學習中的記憶力。此外，我們展示 LoRA 可以與其他隱私保護技術結合使用，例如梯度裁剪和高斯雜訊、安全聚合和 Goldfish 損失，以進一步改善記錄級隱私，同時維持效能。
+摘要：在快速發展的自然語言處理領域中，大型語言模型 (LLM) 負責越來越複雜的推理挑戰。
+傳統方法（如思考鏈提示）已展現潛力，但通常無法充分利用模型的推理能力。本文介紹 SQuARE（順序式問答推理引擎），這是一種新穎的提示技術，旨在透過自我提問模式來改善推理。建立在 CoT 架構之上，SQuARE 提示模型在處理主要查詢之前產生並解決多個輔助問題，促進對某個主題的各個面向進行更徹底的探討。我們使用 Llama 3 和 GPT-4o 模型對多個問答資料集進行廣泛評估，結果顯示 SQuARE 明顯優於傳統 CoT 提示和現有的改寫並回應方法。透過系統性地分解查詢，SQuARE 提升了 LLM 在推理任務中的能力。程式碼已公開於 https://github.com/IntelLabs/RAG-FiT/tree/square。
 
-##### **MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**
-2502.04794v1 by Minrui Chen, Yi Zhou, Huidong Jiang, Yuhan Zhu, Guanjie Zou, Minqi Chen, Rong Tian, Hiroto Saigo
+##### **Truth Knows No Language: Evaluating Truthfulness Beyond English**
+2502.09387v1 by Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
 
-Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is
-introduced as a multimodal framework inspired by real-world diagnostic
-processes. It uses pretrained models such as DINOv2, Vision Transformer, and
-ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into
-low-dimensional, semantically meaningful features. A learnable
-self-attention-based fusion network then integrates these imaging features with
-clinical data for classification. Using 416 FUO patient cases from Sichuan
-University West China Hospital from 2017 to 2023, the multimodal fusion
-classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to
-0.9291 across seven tasks, outperforming conventional machine learning and
-single-modality deep learning methods. Ablation studies and five-fold
-cross-validation further validated its effectiveness. By combining the
-strengths of pretrained large models and deep learning, MedMimic offers a
-promising solution for disease classification.
+We introduce a professionally translated extension of the TruthfulQA
+benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
+Spanish. Truthfulness evaluations of large language models (LLMs) have
+primarily been conducted in English. However, the ability of LLMs to maintain
+truthfulness across languages remains under-explored. Our study evaluates 12
+state-of-the-art open LLMs, comparing base and instruction-tuned models using
+human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
+findings reveal that, while LLMs perform best in English and worst in Basque
+(the lowest-resourced language), overall truthfulness discrepancies across
+languages are smaller than anticipated. Furthermore, we show that
+LLM-as-a-Judge correlates more closely with human judgments than
+multiple-choice metrics, and that informativeness plays a critical role in
+truthfulness assessment. Our results also indicate that machine translation
+provides a viable approach for extending truthfulness benchmarks to additional
+languages, offering a scalable alternative to professional translation.
+Finally, we observe that universal knowledge questions are better handled
+across languages than context- and time-dependent ones, highlighting the need
+for truthfulness evaluations that account for cultural and temporal
+variability. Dataset and code are publicly available under open licenses.
 
-摘要：不明原因發燒 (FUO) 仍然是診斷上的挑戰。MedMimic 是一個多模式架構，靈感來自於真實世界的診斷過程。它使用預先訓練的模型，例如 DINOv2、視覺轉換器和 ResNet-18，將高維 18F-FDG PET/CT 影像轉換為低維、語義有意義的特徵。一個可學習的自注意力融合網路接著將這些影像特徵與臨床資料整合，用於分類。使用 2017 年至 2023 年四川大學華西醫院的 416 個 FUO 病患病例，多模式融合分類網路 MFCN 在七項任務中達到了 0.8654 到 0.9291 的巨觀 AUROC 分數，優於傳統機器學習和單一模式深度學習方法。消融研究和五倍交叉驗證進一步驗證了其有效性。MedMimic 結合了預先訓練的大模型和深度學習的優點，為疾病分類提供了一個有前景的解決方案。
+摘要：我們針對 TruthfulQA 推出專業翻譯的延伸版本，旨在評估巴斯克語、加泰隆尼亞語、加利西亞語和西班牙語中的真實性。大型語言模型 (LLM) 的真實性評估主要以英語進行。然而，LLM 在不同語言中維持真實性的能力仍未得到充分探索。我們的研究評估了 12 個最先進的開放 LLM，使用人類評估、多選項指標和 LLM 作為評分標準比較基礎和指令調整模型。我們的研究結果表明，雖然 LLM 在英語中的表現最好，而在巴斯克語（資源最少的語言）中的表現最差，但整體上不同語言之間的真實性差異小於預期。此外，我們表明，與多選項指標相比，LLM 作為評分標準與人類判斷更密切相關，而且信息豐富性在真實性評估中發揮著至關重要的作用。我們的結果還表明，機器翻譯提供了一種可行的途徑，可以將真實性基準擴展到其他語言，從而提供了一種可擴展的專業翻譯替代方案。最後，我們觀察到，與上下文和時間依賴的問題相比，通用知識問題在不同語言之間的處理效果更好，這突顯了考慮文化和時間可變性的真實性評估的必要性。數據集和代碼在開放許可下公開可用。
 
-##### **MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**
-2502.04515v1 by Wei Fan, Jingru Fei, Dingyu Guo, Kun Yi, Xiaozhuang Song, Haolong Xiang, Hangting Ye, Min Li
+##### **A Deep Inverse-Mapping Model for a Flapping Robotic Wing**
+2502.09378v1 by Hadar Sharvit, Raz Karl, Tsevi Beatus
 
-Medical time series has been playing a vital role in real-world healthcare
-systems as valuable information in monitoring health conditions of patients.
-Accurate classification for medical time series, e.g., Electrocardiography
-(ECG) signals, can help for early detection and diagnosis. Traditional methods
-towards medical time series classification rely on handcrafted feature
-extraction and statistical methods; with the recent advancement of artificial
-intelligence, the machine learning and deep learning methods have become more
-popular. However, existing methods often fail to fully model the complex
-spatial dynamics under different scales, which ignore the dynamic
-multi-resolution spatial and temporal joint inter-dependencies. Moreover, they
-are less likely to consider the special baseline wander problem as well as the
-multi-view characteristics of medical time series, which largely hinders their
-prediction performance. To address these limitations, we propose a
-Multi-resolution Spatiotemporal Graph Learning framework, MedGNN, for medical
-time series classification. Specifically, we first propose to construct
-multi-resolution adaptive graph structures to learn dynamic multi-scale
-embeddings. Then, to address the baseline wander problem, we propose Difference
-Attention Networks to operate self-attention mechanisms on the finite
-difference for temporal modeling. Moreover, to learn the multi-view
-characteristics, we utilize the Frequency Convolution Networks to capture
-complementary information of medical time series from the frequency domain. In
-addition, we introduce the Multi-resolution Graph Transformer architecture to
-model the dynamic dependencies and fuse the information from different
-resolutions. Finally, we have conducted extensive experiments on multiple
-medical real-world datasets that demonstrate the superior performance of our
-method. Our Code is available.
+In systems control, the dynamics of a system are governed by modulating its
+inputs to achieve a desired outcome. For example, to control the thrust of a
+quad-copter propeller the controller modulates its rotation rate, relying on a
+straightforward mapping between the input rotation rate and the resulting
+thrust. This mapping can be inverted to determine the rotation rate needed to
+generate a desired thrust. However, in complex systems, such as flapping-wing
+robots where intricate fluid motions are involved, mapping inputs (wing
+kinematics) to outcomes (aerodynamic forces) is nontrivial and inverting this
+mapping for real-time control is computationally impractical. Here, we report a
+machine-learning solution for the inverse mapping of a flapping-wing system
+based on data from an experimental system we have developed. Our model learns
+the input wing motion required to generate a desired aerodynamic force outcome.
+We used a sequence-to-sequence model tailored for time-series data and
+augmented it with a novel adaptive-spectrum layer that implements
+representation learning in the frequency domain. To train our model, we
+developed a flapping wing system that simultaneously measures the wing's
+aerodynamic force and its 3D motion using high-speed cameras. We demonstrate
+the performance of our system on an additional open-source dataset of a
+flapping wing in a different flow regime. Results show superior performance
+compared with more complex state-of-the-art transformer-based models, with 11%
+improvement on the test datasets median loss. Moreover, our model shows
+superior inference time, making it practical for onboard robotic control. Our
+open-source data and framework may improve modeling and real-time control of
+systems governed by complex dynamics, from biomimetic robots to biomedical
+devices.
 
-摘要：<paragraph>醫療時間序列在真實世界的醫療保健系統中扮演著至關重要的角色，作為監控患者健康狀況的寶貴資訊。
-準確分類醫療時間序列，例如心電圖 (ECG) 訊號，有助於早期偵測和診斷。傳統的醫療時間序列分類方法仰賴手工特徵萃取和統計方法；隨著人工智慧的最新進展，機器學習和深度學習方法變得更為普及。然而，現有方法通常無法完全建模不同尺度下的複雜空間動態，忽略了動態多解析度空間和時間關節相互依賴性。此外，它們不太可能考慮特殊的基線漂移問題以及醫療時間序列的多視角特性，這在很大程度上阻礙了它們的預測效能。為了解決這些限制，我們提出了一個多解析度時空圖形學習架構 MedGNN，用於醫療時間序列分類。具體來說，我們首先提出構建多解析度自適應圖形結構以學習動態多尺度嵌入。然後，為了解決基線漂移問題，我們提出差分注意力網路，對時間建模的有限差分運算自注意力機制。此外，為了學習多視角特性，我們利用頻率卷積網路從頻域擷取醫療時間序列的互補資訊。此外，我們引入了多解析度圖形Transformer架構來建模動態依賴性，並融合來自不同解析度的資訊。最後，我們對多個醫療真實世界資料集進行了廣泛的實驗，證明了我們方法的優異效能。我們的程式碼已公開。</paragraph>
+摘要：<paragraph>在系統控制中，系統的動態受調節其輸入以實現所需結果的影響。例如，為了控制四軸旋翼推進器的推力，控制器會調節其旋轉速率，依賴於輸入旋轉速率和所產生的推力之間的直接映射。此映射可以反轉以確定產生所需推力所需的旋轉速率。然而，在複雜的系統中，例如涉及複雜流體運動的拍打式機翼機器人，將輸入（機翼運動學）映射到輸出（空氣動力）並非易事，並且反轉此映射以進行實時控制在計算上不切實際。在此，我們報告了一個基於我們開發的實驗系統數據的拍打式機翼系統反向映射的機器學習解決方案。我們的模型學習產生所需空氣動力結果所需的輸入機翼運動。我們使用了一個專門針對時間序列數據的序列到序列模型，並用一個在頻域中實現表示學習的新型自適應譜層對其進行了擴充。為了訓練我們的模型，我們開發了一個拍打式機翼系統，該系統同時使用高速相機測量機翼的空氣動力和其 3D 運動。我們在一個不同的流動狀態下拍打機翼的另一個開源數據集上展示了我們系統的性能。結果表明，與更複雜的基於Transformer的最先進模型相比，性能優異，在測試數據集中損失中值改進了 11%。此外，我們的模型顯示出優異的推理時間，使其適用於機載機器人控制。我們的開源數據和框架可以改進受複雜動態支配的系統的建模和實時控制，從仿生機器人到生物醫學設備。</paragraph>
 
-##### **Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**
-2502.06842v1 by Andrew G. Breithaupt, Alice Tang, Bruce L. Miller, Pedro Pinheiro-Chagas
+##### **Language Agents as Digital Representatives in Collective Decision-Making**
+2502.09369v1 by Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti
 
-Healthcare systems are struggling to meet the growing demand for neurological
-care, with challenges particularly acute in Alzheimer's disease and related
-dementias (ADRD). While artificial intelligence research has often focused on
-identifying patterns beyond human perception, implementing such predictive
-capabilities remains challenging as clinicians cannot readily verify insights
-they cannot themselves detect. We propose that large language models (LLMs)
-offer more immediately practical applications by enhancing clinicians'
-capabilities in three critical areas: comprehensive data collection,
-interpretation of complex clinical information, and timely application of
-relevant medical knowledge. These challenges stem from limited time for proper
-diagnosis, growing data complexity, and an overwhelming volume of medical
-literature that exceeds any clinician's capacity to fully master. We present a
-framework for responsible AI integration that leverages LLMs' ability to
-communicate effectively with both patients and providers while maintaining
-human oversight. This approach prioritizes standardized, high-quality data
-collection to enable a system that learns from every patient encounter while
-incorporating the latest clinical evidence, continuously improving care
-delivery. We begin to address implementation challenges and initiate important
-discussions around ethical considerations and governance needs. While developed
-for ADRD, this roadmap provides principles for responsible AI integration
-across neurology and other medical specialties, with potential to improve
-diagnostic accuracy, reduce care disparities, and advance clinical knowledge
-through a learning healthcare system.
+Consider the process of collective decision-making, in which a group of
+individuals interactively select a preferred outcome from among a universe of
+alternatives. In this context, "representation" is the activity of making an
+individual's preferences present in the process via participation by a proxy
+agent -- i.e. their "representative". To this end, learned models of human
+behavior have the potential to fill this role, with practical implications for
+multi-agent scenario studies and mechanism design. In this work, we investigate
+the possibility of training \textit{language agents} to behave in the capacity
+of representatives of human agents, appropriately expressing the preferences of
+those individuals whom they stand for. First, we formalize the setting of
+\textit{collective decision-making} -- as the episodic process of interaction
+between a group of agents and a decision mechanism. On this basis, we then
+formalize the problem of \textit{digital representation} -- as the simulation
+of an agent's behavior to yield equivalent outcomes from the mechanism.
+Finally, we conduct an empirical case study in the setting of
+\textit{consensus-finding} among diverse humans, and demonstrate the
+feasibility of fine-tuning large language models to act as digital
+representatives.
 
-摘要：醫療體系正努力滿足日益增長的神經照護需求，其中阿茲海默症和相關失智症 (ADRD) 的挑戰特別嚴重。雖然人工智慧研究通常專注於識別人類感知之外的模式，但實作此類預測功能仍然具有挑戰性，因為臨床醫生無法輕易驗證他們自己無法偵測到的見解。我們提出大型語言模型 (LLM) 可透過提升臨床醫生在三個關鍵領域的能力，提供更直接且實用的應用：全面的資料收集、複雜臨床資訊的詮釋，以及適時應用相關的醫學知識。這些挑戰源自於適當診斷時間有限、資料複雜性日益增加，以及龐大的醫學文獻量超過任何臨床醫生所能完全掌握的容量。我們提出了一個負責任的 AI 整合架構，利用 LLM 與患者和提供者有效溝通的能力，同時維持人為監督。此方法優先考慮標準化、高品質的資料收集，以建立一個從每次患者接觸中學習的系統，同時納入最新的臨床證據，持續改善照護提供。我們開始探討實作挑戰，並展開關於倫理考量和治理需求的重要討論。儘管是為 ADRD 所開發，此藍圖提供了神經科和其他醫學專科負責任 AI 整合的原則，有潛力透過學習型醫療保健系統改善診斷準確性、減少照護差異，並推進臨床知識。
+摘要：考慮集體決策的過程，其中一群個人互動式地從一系列備選方案中選擇一個偏好的結果。在此脈絡中，「代表」是透過代理人（即他們的「代表」）參與，讓個人的偏好出現在這個過程中的活動。為此，人類行為的學習模型有可能填補這個角色，對多重代理人情境研究和機制設計具有實際意義。在這項工作中，我們探討訓練「語言代理人」的可能性，以代表人類代理人的身分行事，適當地表達他們所代表的那些個人的偏好。首先，我們將「集體決策」的設定形式化，作為一群代理人與決策機制之間互動的間歇性過程。在此基礎上，我們接著將「數位代表」的問題形式化，作為模擬代理人的行為，從機制中產生等效結果。最後，我們在多元人類的「共識尋求」設定中進行一個實證個案研究，並展示微調大型語言模型以作為數位代表的可行性。
 
-##### **Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**
-2502.04423v1 by Khushboo Verma, Alan Michels, Ergi Gumusaneli, Shilpa Chitnis, Smita Sinha Kumar, Christopher Thompson, Lena Esmail, Guruprasath Srinivasan, Chandini Panchada, Sushovan Guha, Satwant Kumar
+##### **Neural Spatiotemporal Point Processes: Trends and Challenges**
+2502.09341v1 by Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David A. Selby, Yao Xie, Sebastian Vollmer, Gerrit Grossmann
 
-Referral workflow inefficiencies, including misaligned referrals and delays,
-contribute to suboptimal patient outcomes and higher healthcare costs. In this
-study, we investigated the possibility of predicting procedural needs based on
-primary care diagnostic entries, thereby improving referral accuracy,
-streamlining workflows, and providing better care to patients. A de-identified
-dataset of 2,086 orthopedic referrals from the University of Texas Health at
-Tyler was analyzed using machine learning models built on Base General
-Embeddings (BGE) for semantic extraction. To ensure real-world applicability,
-noise tolerance experiments were conducted, and oversampling techniques were
-employed to mitigate class imbalance. The selected optimum and parsimonious
-embedding model demonstrated high predictive accuracy (ROC-AUC: 0.874, Matthews
-Correlation Coefficient (MCC): 0.540), effectively distinguishing patients
-requiring surgical intervention. Dimensionality reduction techniques confirmed
-the model's ability to capture meaningful clinical relationships. A threshold
-sensitivity analysis identified an optimal decision threshold (0.30) to balance
-precision and recall, maximizing referral efficiency. In the predictive
-modeling analysis, the procedure rate increased from 11.27% to an optimal
-60.1%, representing a 433% improvement with significant implications for
-operational efficiency and healthcare revenue.
-  The results of our study demonstrate that referral optimization can enhance
-primary and surgical care integration. Through this approach, precise and
-timely predictions of procedural requirements can be made, thereby minimizing
-delays, improving surgical planning, and reducing administrative burdens. In
-addition, the findings highlight the potential of clinical decision support as
-a scalable solution for improving patient outcomes and the efficiency of the
-healthcare system.
+Spatiotemporal point processes (STPPs) are probabilistic models for events
+occurring in continuous space and time. Real-world event data often exhibit
+intricate dependencies and heterogeneous dynamics. By incorporating modern deep
+learning techniques, STPPs can model these complexities more effectively than
+traditional approaches. Consequently, the fusion of neural methods with STPPs
+has become an active and rapidly evolving research area. In this review, we
+categorize existing approaches, unify key design choices, and explain the
+challenges of working with this data modality. We further highlight emerging
+trends and diverse application domains. Finally, we identify open challenges
+and gaps in the literature.
 
-摘要：轉診流程效率低落，包括轉診不當和延誤，
-導致次優的患者結果和更高的醫療保健成本。在這
-項研究中，我們探討了根據初級保健診斷條目預測程序需求的可能性，從而提高轉診準確性，
-簡化工作流程，並為患者提供更好的照護。一個去識別化
-德克薩斯大學健康中心的 2,086 個骨科轉診的資料集
-泰勒使用建立在基本通用
-語義提取的嵌入 (BGE) 上的機器學習模型進行分析。為了確保現實世界的適用性，
-進行了噪聲容忍度實驗，並採用了過採樣技術來減輕類別不平衡。所選的最佳和簡約
-嵌入模型展示了高預測準確度 (ROC-AUC：0.874，馬修斯
-相關系數 (MCC)：0.540)，有效區分需要手術干預的患者。降維
-技術證實了模型捕捉有意義的臨床關係的能力。閾值
-敏感性分析確定了一個最佳決策閾值 (0.30) 來平衡
-精確度和召回率，最大化轉診效率。在預測中
-建模分析中，程序率從 11.27% 增加到最佳的
-60.1%，代表 433% 的改進，對運營效率和醫療保健收入具有重大影響。
-我們研究的結果表明，轉診優化可以增強
-初級和外科護理整合。通過這種方法，可以對程序需求進行準確及時的預測，從而最大程度地減少
-延誤，改善手術計劃，並減輕行政負擔。此外，研究結果強調了臨床決策支持作為
-一個可擴展的解決方案的潛力，用於改善患者結果和醫療保健系統的效率。
+摘要：時空點過程 (STPP) 是事件在連續時空發生的機率模型。真實世界的事件資料通常會展現錯綜複雜的依賴關係和異質動態。透過結合現代深度學習技術，STPP 可以比傳統方法更有效地模擬這些複雜性。因此，神經方法與 STPP 的融合已成為一個活躍且快速發展的研究領域。在本篇評論中，我們對現有方法進行分類、統一關鍵設計選擇，並說明處理這種資料模式的挑戰。我們進一步強調新興趨勢和多樣化的應用領域。最後，我們找出文獻中的開放性挑戰和空白。
 
-##### **Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**
-2502.04083v1 by Tewele W. Tareke, Neree Payan, Alexandre Cochet, Laurent Arnould, Benoit Presles, Jean-Marc Vrigneaud, Fabrice Meriaudeau, Alain Lalande
+##### **Graph Diffusion Network for Drug-Gene Prediction**
+2502.09335v1 by Jiayang Wu, Wensheng Gan, Philip S. Yu
 
-Neoadjuvant chemotherapy (NAC) has become a standard clinical practice for
-tumor downsizing in breast cancer with 18F-FDG Positron Emission Tomography
-(PET). Our work aims to leverage PET imaging for the segmentation of breast
-lesions. The focus is on developing an automated system that accurately
-segments primary tumor regions and extracts key biomarkers from these areas to
-provide insights into the evolution of breast cancer following the first course
-of NAC. 243 baseline 18F-FDG PET scans (PET_Bl) and 180 follow-up 18F-FDG PET
-scans (PET_Fu) were acquired before and after the first course of NAC,
-respectively. Firstly, a deep learning-based breast tumor segmentation method
-was developed. The optimal baseline model (model trained on baseline exams) was
-fine-tuned on 15 follow-up exams and adapted using active learning to segment
-tumor areas in PET_Fu. The pipeline computes biomarkers such as maximum
-standardized uptake value (SUVmax), metabolic tumor volume (MTV), and total
-lesion glycolysis (TLG) to evaluate tumor evolution between PET_Fu and PET_Bl.
-Quality control measures were employed to exclude aberrant outliers. The nnUNet
-deep learning model outperformed in tumor segmentation on PET_Bl, achieved a
-Dice similarity coefficient (DSC) of 0.89 and a Hausdorff distance (HD) of 3.52
-mm. After fine-tuning, the model demonstrated a DSC of 0.78 and a HD of 4.95 mm
-on PET_Fu exams. Biomarkers analysis revealed very strong correlations whatever
-the biomarker between manually segmented and automatically predicted regions.
-The significant average decrease of SUVmax, MTV and TLG were 5.22, 11.79 cm3
-and 19.23 cm3, respectively. The presented approach demonstrates an automated
-system for breast tumor segmentation from 18F-FDG PET. Thanks to the extracted
-biomarkers, our method enables the automatic assessment of cancer progression.
+Predicting drug-gene associations is crucial for drug development and disease
+treatment. While graph neural networks (GNN) have shown effectiveness in this
+task, they face challenges with data sparsity and efficient contrastive
+learning implementation. We introduce a graph diffusion network for drug-gene
+prediction (GDNDGP), a framework that addresses these limitations through two
+key innovations. First, it employs meta-path-based homogeneous graph learning
+to capture drug-drug and gene-gene relationships, ensuring similar entities
+share embedding spaces. Second, it incorporates a parallel diffusion network
+that generates hard negative samples during training, eliminating the need for
+exhaustive negative sample retrieval. Our model achieves superior performance
+on the DGIdb 4.0 dataset and demonstrates strong generalization capability on
+tripartite drug-gene-disease networks. Results show significant improvements
+over existing methods in drug-gene prediction tasks, particularly in handling
+complex heterogeneous relationships. The source code is publicly available at
+https://github.com/csjywu1/GDNDGP.
 
-摘要：新辅助化疗 (NAC) 已成为乳腺癌中采用 18F-FDG 正电子发射断层扫描 (PET) 进行肿瘤缩小的标准临床实践。我们的工作旨在利用 PET 影像分割乳腺病变。重点在于开发一个自动系统，该系统可以准确分割原发性肿瘤区域并从这些区域提取关键生物标记，以深入了解乳腺癌在第一疗程 NAC 后的演变。分别在第一疗程 NAC 之前和之后采集了 243 例基线 18F-FDG PET 扫描 (PET_Bl) 和 180 例随访 18F-FDG PET 扫描 (PET_Fu)。首先，开发了一种基于深度学习的乳腺肿瘤分割方法。对 15 例随访检查对最优基线模型（在基线检查中训练的模型）进行了微调，并使用主动学习对 PET_Fu 中的肿瘤区域进行了分割。该管道计算诸如最大标准摄取值 (SUVmax)、代谢肿瘤体积 (MTV) 和总病灶糖酵解 (TLG) 等生物标记，以评估 PET_Fu 和 PET_Bl 之间的肿瘤演变。采用质量控制措施来排除异常值。nnUNet 深度学习模型在 PET_Bl 上的肿瘤分割方面表现出色，达到 0.89 的 Dice 相似性系数 (DSC) 和 3.52 毫米的 Hausdorff 距离 (HD)。微调后，该模型在 PET_Fu 检查中显示出 0.78 的 DSC 和 4.95 毫米的 HD。无论手动分割区域和自动预测区域之间的生物标记如何，生物标记分析都显示出非常强的相关性。SUVmax、MTV 和 TLG 的平均显着下降分别为 5.22、11.79 cm3 和 19.23 cm3。所提出的方法展示了一个用于从 18F-FDG PET 分割乳腺肿瘤的自动化系统。由于提取了生物标记，我们的方法能够自动评估癌症进展。
+摘要：預測藥物基因關聯對藥物開發和疾病治療至關重要。雖然圖神經網路 (GNN) 已顯示在這個任務中的有效性，但它們在資料稀疏性和高效對比學習實作方面面臨挑戰。我們引入了一個用於藥物基因預測的圖擴散網路 (GDNDGP)，這是一個透過兩項關鍵創新來解決這些限制的框架。首先，它採用基於元路徑的同質圖學習來捕捉藥物-藥物和基因-基因關係，確保類似實體共享嵌入空間。其次，它整合了一個並行擴散網路，在訓練期間產生困難的負面樣本，消除了對詳盡負面樣本擷取的需求。我們的模型在 DGIdb 4.0 資料集上取得了卓越的效能，並在三方藥物-基因-疾病網路中展現強大的概化能力。結果顯示在藥物基因預測任務中，相較於現有方法有顯著的進步，特別是在處理複雜的異質關係方面。原始碼已公開於 https://github.com/csjywu1/GDNDGP。
 
-##### **Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**
-2502.04034v1 by Ran Song, Yinpu Bai, Hui Liu
+##### **Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**
+2502.09331v1 by Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
 
-The accurate prediction of drug responses remains a formidable challenge,
-particularly at the single-cell level and in clinical treatment contexts. Some
-studies employ transfer learning techniques to predict drug responses in
-individual cells and patients, but they require access to target-domain data
-during training, which is often unavailable or only obtainable in future. In
-this study, we propose a novel domain generalization framework, termed
-panCancerDR, to address this challenge. We conceptualize each cancer type as a
-distinct source domain, with its cell lines serving as domain-specific samples.
-Our primary objective is to extract domain-invariant features from the
-expression profiles of cell lines across diverse cancer types, thereby
-generalize the predictive capacity to out-of-distribution samples. To enhance
-robustness, we introduce a latent independence projection (LIP) module that
-encourages the encoder to extract informative yet non-redundant features. Also,
-we propose an asymmetric adaptive clustering constraint, which clusters
-drug-sensitive samples into a compact group while drives resistant samples
-dispersed across separate clusters in the latent space. Our empirical
-experiments demonstrate that panCancerDR effectively learns task-relevant
-features from diverse source domains, and achieves accurate predictions of drug
-response for unseen cancer type during training. Furthermore, when evaluated on
-single-cell and patient-level prediction tasks, our model-trained solely on in
-vitro cell line data without access to target-domain information-consistently
-outperforms and matched current state-of-the-art methods. These findings
-highlights the potential of our method for real-world clinical applications.
+Despite advances in the multilingual capabilities of Large Language Models
+(LLMs) across diverse tasks, English remains the dominant language for LLM
+research and development. So, when working with a different language, this has
+led to the widespread practice of pre-translation, i.e., translating the task
+prompt into English before inference. Selective pre-translation, a more
+surgical approach, focuses on translating specific prompt components. However,
+its current use is sporagic and lacks a systematic research foundation.
+Consequently, the optimal pre-translation strategy for various multilingual
+settings and tasks remains unclear. In this work, we aim to uncover the optimal
+setup for pre-translation by systematically assessing its use. Specifically, we
+view the prompt as a modular entity, composed of four functional parts:
+instruction, context, examples, and output, either of which could be translated
+or not. We evaluate pre-translation strategies across 35 languages covering
+both low and high-resource languages, on various tasks including Question
+Answering (QA), Natural Language Inference (NLI), Named Entity Recognition
+(NER), and Abstractive Summarization. Our experiments show the impact of
+factors as similarity to English, translation quality and the size of
+pre-trained data, on the model performance with pre-translation. We suggest
+practical guidelines for choosing optimal strategies in various multilingual
+settings.
 
-摘要：<paragraph>準確預測藥物反應仍然是一項艱鉅的挑戰，特別是在單細胞層級和臨床治療背景中。一些研究採用遷移學習技術來預測個別細胞和患者的藥物反應，但它們需要在訓練期間存取目標網域資料，而這些資料通常無法取得，或只能在未來取得。在這項研究中，我們提出一個新穎的網域概化架構，稱為 panCancerDR，以應對這項挑戰。我們將每種類型的癌症概念化為一個不同的來源網域，其細胞株作為特定網域的樣本。我們的首要目標是從不同癌症類型的細胞株表現特徵中萃取網域不變特徵，從而將預測能力概化到分布外的樣本。為了增強穩健性，我們引入一個潛在獨立投影 (LIP) 模組，鼓勵編碼器萃取有資訊但非冗餘的特徵。此外，我們提出一個非對稱自適應聚類約束，將對藥物敏感的樣本聚類到一個緊湊的群組中，同時驅動抗藥性樣本分散在潛在空間中的不同群組中。我們的實證實驗證明，panCancerDR 有效地從不同的來源網域學習與任務相關的特徵，並在訓練期間對未見的癌症類型實現準確的藥物反應預測。此外，當在單細胞和患者層級預測任務中進行評估時，我們的模型僅在體外細胞株資料上訓練，而沒有存取目標網域資訊，始終優於並符合當前的最新方法。這些發現突顯了我們的方法在實際臨床應用中的潛力。</paragraph>
+摘要：儘管大型語言模型 (LLM) 在各種任務中的多語言能力有進步，英語仍然是 LLM 研究和開發的主導語言。因此，在使用不同語言時，這導致了預翻譯的廣泛實務，即在推理之前將任務提示翻譯成英語。選擇性預翻譯是一種更精準的方法，專注於翻譯特定提示組成部分。然而，目前的使用是零星的，缺乏系統性的研究基礎。因此，各種多語言設定和任務的最佳預翻譯策略仍不清楚。在這項工作中，我們旨在透過系統性評估預翻譯的使用，找出其最佳設定。具體來說，我們將提示視為一個模組化實體，由四個功能部分組成：說明、背景、範例和輸出，其中任何一個都可以翻譯或不翻譯。我們在 35 種語言中評估預翻譯策略，涵蓋低資源語言和高資源語言，以及各種任務，包括問答 (QA)、自然語言推理 (NLI)、命名實體識別 (NER) 和抽象摘要。我們的實驗顯示了與英語的相似性、翻譯品質和預訓練資料大小等因素對預翻譯模型效能的影響。我們建議在各種多語言設定中選擇最佳策略的實用指南。
 
-##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
-2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
+##### **A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**
+2502.09316v1 by Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami
 
-Retrieval-augmented generation (RAG) is a well-suited technique for
-retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
-key module of the healthcare copilot, helping reduce misdiagnosis for
-healthcare practitioners and patients. However, the diagnostic accuracy and
-specificity of existing heuristic-based RAG models used in the medical domain
-are inadequate, particularly for diseases with similar manifestations. This
-paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
-reasoning for the medical domain that retrieves diagnosis and treatment
-recommendations based on manifestations. MedRAG systematically constructs a
-comprehensive four-tier hierarchical diagnostic KG encompassing critical
-diagnostic differences of various diseases. These differences are dynamically
-integrated with similar EHRs retrieved from an EHR database, and reasoned
-within a large language model. This process enables more accurate and specific
-decision support, while also proactively providing follow-up questions to
-enhance personalized medical decision-making. MedRAG is evaluated on both a
-public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
-collected from Tan Tock Seng Hospital, and its performance is compared against
-various existing RAG methods. Experimental results show that, leveraging the
-information integration and relational abilities of the KG, our MedRAG provides
-more specific diagnostic insights and outperforms state-of-the-art models in
-reducing misdiagnosis rates. Our code will be available at
-https://github.com/SNOWTEAM2023/MedRAG
+Evaluating the open-ended text generation of large language models (LLMs) is
+challenging because of the lack of a clear ground truth and the high cost of
+human or LLM-based assessments. We propose a novel benchmark that evaluates
+LLMs using n-gram statistics and rules, without relying on human judgement or
+LLM-as-a-judge approaches. Using 50 question and reference answer sets, we
+introduce three new metrics based on n-grams and rules: Fluency, Truthfulness,
+and Helpfulness. Our benchmark strongly correlates with GPT-4o-based
+evaluations while requiring significantly fewer computational resources,
+demonstrating its effectiveness as a scalable alternative for assessing LLMs'
+open-ended generation capabilities.
+
+摘要：評估大型語言模型 (LLM) 的開放式文字生成具有挑戰性，因為缺乏明確的基礎真實性，以及人工或基於 LLM 的評估成本高昂。我們提出一個新基準，使用 n-gram 統計和規則來評估 LLM，而不依賴於人工判斷或 LLM 作為評審的方法。使用 50 個問題和參考答案集，我們基於 n-gram 和規則引入了三項新指標：流暢度、真實性和有幫助性。我們的基準與基於 GPT-4o 的評估密切相關，同時需要明顯更少的計算資源，證明了其作為評估 LLM 的開放式生成能力的可擴充替代方案的有效性。
+
+##### **When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**
+2502.09307v1 by Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
+
+Modern Large Language Models (LLMs) have shown human-like abilities in many
+language tasks, sparking interest in comparing LLMs' and humans' language
+processing. In this paper, we conduct a detailed comparison of the two on a
+sentence comprehension task using garden-path constructions, which are
+notoriously challenging for humans. Based on psycholinguistic research, we
+formulate hypotheses on why garden-path sentences are hard, and test these
+hypotheses on human participants and a large suite of LLMs using comprehension
+questions. Our findings reveal that both LLMs and humans struggle with specific
+syntactic complexities, with some models showing high correlation with human
+comprehension. To complement our findings, we test LLM comprehension of
+garden-path constructions with paraphrasing and text-to-image generation tasks,
+and find that the results mirror the sentence comprehension question results,
+further validating our findings on LLM understanding of these constructions.
 
-摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
+摘要：現代大型語言模型（LLM）在許多語言任務中展現出類似人類的能力，引發了比較 LLM 與人類語言處理的興趣。在本文中，我們使用對人類來說極具挑戰的花園路徑結構，對這兩者進行了詳細比較，以進行句子理解任務。根據心理語言學研究，我們制定了關於為什麼花園路徑句子困難的假設，並使用理解問題對人類參與者和大量 LLM 測試這些假設。我們的研究結果表明，LLM 和人類都難以應付特定的句法複雜性，其中一些模型與人類理解力高度相關。為了補充我們的研究結果，我們測試了 LLM 對花園路徑結構的理解，並進行了改寫和文字轉換為圖像的生成任務，並發現結果反映了句子理解問題的結果，進一步驗證了我們對 LLM 理解這些結構的研究結果。
 
-##### **Transforming Multimodal Models into Action Models for Radiotherapy**
-2502.04408v1 by Matteo Ferrante, Alessandra Carosi, Rolando Maria D Angelillo, Nicola Toschi
+##### **Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**
+2502.09294v1 by Bernd Dudzik, Tiffany Matej Hrkalovic, Chenxu Hao, Chirag Raman, Masha Tsfasman
 
-Radiotherapy is a crucial cancer treatment that demands precise planning to
-balance tumor eradication and preservation of healthy tissue. Traditional
-treatment planning (TP) is iterative, time-consuming, and reliant on human
-expertise, which can potentially introduce variability and inefficiency. We
-propose a novel framework to transform a large multimodal foundation model
-(MLM) into an action model for TP using a few-shot reinforcement learning (RL)
-approach. Our method leverages the MLM's extensive pre-existing knowledge of
-physics, radiation, and anatomy, enhancing it through a few-shot learning
-process. This allows the model to iteratively improve treatment plans using a
-Monte Carlo simulator. Our results demonstrate that this method outperforms
-conventional RL-based approaches in both quality and efficiency, achieving
-higher reward scores and more optimal dose distributions in simulations on
-prostate cancer data. This proof-of-concept suggests a promising direction for
-integrating advanced AI models into clinical workflows, potentially enhancing
-the speed, quality, and standardization of radiotherapy treatment planning.
+Automatic Affect Prediction (AAP) uses computational analysis of input data
+such as text, speech, images, and physiological signals to predict various
+affective phenomena (e.g., emotions or moods). These models are typically
+constructed using supervised machine-learning algorithms, which rely heavily on
+labeled training datasets. In this position paper, we posit that all AAP
+training data are derived from human Affective Interpretation Processes,
+resulting in a form of Affective Meaning. Research on human affect indicates a
+form of complexity that is fundamental to such meaning: it can possess what we
+refer to here broadly as Qualities of Indeterminacy (QIs) - encompassing
+Subjectivity (meaning depends on who is interpreting), Uncertainty (lack of
+confidence regarding meanings' correctness), Ambiguity (meaning contains
+mutually exclusive concepts) and Vagueness (meaning is situated at different
+levels in a nested hierarchy). Failing to appropriately consider QIs leads to
+results incapable of meaningful and reliable predictions. Based on this
+premise, we argue that a crucial step in adequately addressing indeterminacy in
+AAP is the development of data collection practices for modeling corpora that
+involve the systematic consideration of 1) a relevant set of QIs and 2) context
+for the associated interpretation processes. To this end, we are 1) outlining a
+conceptual model of AIPs and the QIs associated with the meaning these produce
+and a conceptual structure of relevant context, supporting understanding of its
+role. Finally, we use our framework for 2) discussing examples of
+context-sensitivity-related challenges for addressing QIs in data collection
+setups. We believe our efforts can stimulate a structured discussion of both
+the role of aspects of indeterminacy and context in research on AAP, informing
+the development of better practices for data collection and analysis.
 
-摘要：放射治療是一種重要的癌症治療方法，需要精確的規劃來平衡腫瘤根除和健康組織的保留。傳統的治療規劃（TP）是反覆的、耗時的，並且依賴於人為專業知識，這可能會引入變異性和低效率。我們提出了一個新穎的框架，使用少次強化學習 (RL) 方法將大型多模態基礎模型 (MLM) 轉換為 TP 的動作模型。我們的模型利用了 MLM 對物理、輻射和解剖學的廣泛預先存在的知識，並通過少次學習過程對其進行增強。這允許模型使用蒙特卡羅模擬器反覆改進治療計劃。我們的結果表明，這種方法在質量和效率方面都優於基於傳統 RL 的方法，在對前列腺癌數據進行模擬時，獲得了更高的獎勵分數和更優化的劑量分佈。這個概念驗證表明了一個有希望的方向，即將先進的人工智慧模型整合到臨床工作流程中，從而有可能提高放射治療計劃的速度、質量和標準化。
+摘要：自動影響預測 (AAP) 使用輸入資料的運算分析，例如文字、語音、影像和生理訊號，來預測各種情感現象（例如情緒或心情）。這些模型通常使用監督式機器學習演算法建構，而這些演算法高度依賴標籤訓練資料集。在此立場文件中，我們主張所有 AAP 訓練資料都是從人類的情感詮釋過程中衍生而來的，進而形成一種情感意義。對人類情感的研究指出，這種複雜性是此種意義的基本要素：它可能具備我們在此廣泛稱之為不確定性品質 (QI)，包括主觀性（意義取決於詮釋者）、不確定性（對於意義正確性的信心不足）、歧義性（意義包含相互排斥的概念）和模糊性（意義位於嵌套層級的不同層級）。未能適當地考量 QI 會導致無法進行有意義且可靠預測的結果。基於此前提，我們主張，在 AAP 中適當地處理不確定性的關鍵步驟，是針對建模語料庫制定資料收集實務，其中涉及系統性地考量 1) 一組相關的 QI，以及 2) 相關詮釋過程的脈絡。為此，我們 1) 概述了 AIP 的概念模型，以及與這些 AIP 所產生的意義相關的 QI，以及相關脈絡的概念結構，支持對其角色的理解。最後，我們使用我們的架構 2) 討論了在資料收集設定中處理 QI 時，與脈絡敏感性相關的挑戰範例。我們相信我們的努力可以激勵對不確定性和脈絡面向在 AAP 研究中扮演的角色進行結構化的討論，為資料收集和分析的最佳實務發展提供資訊。
 
-##### **Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**
-2502.04399v1 by Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
+##### **SparQLe: Speech Queries to Text Translation Through LLMs**
+2502.09284v1 by Amirbek Djanibekov, Hanan Aldarmaki
 
-Advances in artificial intelligence (AI) including foundation models (FMs),
-are increasingly transforming human society, with smart city driving the
-evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as
-a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities.
-In particular, ride-hailing vehicles can effectively facilitate flexible data
-collection and contribute towards urban intelligence, despite resource
-limitations. Therefore, this work explores a promising scenario, where
-edge-assisted vehicles perform joint tasks of order serving and the emerging
-foundation model fine-tuning using various urban data. However, integrating the
-VCS AI task with the conventional order serving task is challenging, due to
-their inconsistent spatio-temporal characteristics: (i) The distributions of
-ride orders and data point-of-interests (PoIs) may not coincide in geography,
-both following a priori unknown patterns; (ii) they have distinct forms of
-temporal effects, i.e., prolonged waiting makes orders become instantly invalid
-while data with increased staleness gradually reduces its utility for model
-fine-tuning.To overcome these obstacles, we propose an online framework based
-on multi-agent reinforcement learning (MARL) with careful augmentation. A new
-quality-of-service (QoS) metric is designed to characterize and balance the
-utility of the two joint tasks, under the effects of varying data volumes and
-staleness. We also integrate graph neural networks (GNNs) with MARL to enhance
-state representations, capturing graph-structured, time-varying dependencies
-among vehicles and across locations. Extensive experiments on our testbed
-simulator, utilizing various real-world foundation model fine-tuning tasks and
-the New York City Taxi ride order dataset, demonstrate the advantage of our
-proposed method.
+With the growing influence of Large Language Models (LLMs), there is
+increasing interest in integrating speech representations with them to enable
+more seamless multi-modal processing and speech understanding. This study
+introduces a novel approach that leverages self-supervised speech
+representations in combination with instruction-tuned LLMs for speech-to-text
+translation. The proposed approach leverages a modality adapter to align
+extracted speech features with instruction-tuned LLMs using English-language
+data. Our experiments demonstrate that this method effectively preserves the
+semantic content of the input speech and serves as an effective bridge between
+self-supervised speech models and instruction-tuned LLMs, offering a promising
+solution for various speech understanding applications.
 
-摘要：人工智能（AI）的進展，包括基礎模型（FM），正日益轉變人類社會，智慧城市推動著城市生活的演進。同時，車輛群感測（VCS）已成為關鍵推動因素，利用車輛的機動性和配備感測器的能力。特別是，儘管有資源限制，叫車服務車輛能有效促進靈活的資料收集，並有助於城市智慧。因此，這項工作探索了一個有前途的場景，其中邊緣輔助車輛執行訂單服務和新興基礎模型微調的聯合任務，使用各種城市資料。然而，由於 VCS AI 任務與傳統訂單服務任務的不一致時空特徵，整合它們具有挑戰性：(i) 叫車訂單和資料感興趣點 (PoI) 的分佈在地域上可能不重合，兩者都遵循先驗未知的模式；(ii) 它們具有不同的時間效應形式，即長時間等待會使訂單立即失效，而過時的資料會逐漸降低其對模型微調的效用。為了解決這些障礙，我們提出了一個基於多智能體強化學習 (MARL) 的線上架構，並進行了仔細的擴充。設計了一個新的服務品質 (QoS) 指標，用於表徵和平衡這兩個聯合任務的效用，在不同資料量和過時性的影響下。我們還將圖神經網路（GNN）與 MARL 整合，以增強狀態表示，捕捉車輛之間和不同地點之間的圖結構、時變依賴性。在我們的測試平台模擬器上進行的廣泛實驗，利用各種真實世界的基礎模型微調任務和紐約市計程車叫車訂單資料集，證明了我們提出的方法的優點。
+摘要：隨著大型語言模型（LLM）影響力逐漸擴大，將語音表徵與其整合，以實現更順暢的多模態處理和語音理解，已引起越來越多的興趣。本研究提出了一種新穎的方法，該方法利用自監督語音表徵，結合指令調整的 LLM，進行語音轉文字翻譯。所提出的方法利用模態適配器，使用英語語言資料，將提取的語音特徵與指令調整的 LLM 對齊。我們的實驗證明，此方法有效地保留了輸入語音的語義內容，並作為自監督語音模型和指令調整的 LLM 之間的有效橋樑，為各種語音理解應用程式提供了一個有前景的解決方案。
 
-##### **Multimodal Medical Code Tokenizer**
-2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
+##### **LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**
+2502.09271v1 by Wenlun Zhang, Enyan Dai, Kentaro Yoshioka
 
-Foundation models trained on patient electronic health records (EHRs) require
-tokenizing medical data into sequences of discrete vocabulary items. Existing
-tokenizers treat medical codes from EHRs as isolated textual tokens. However,
-each medical code is defined by its textual description, its position in
-ontological hierarchies, and its relationships to other codes, such as disease
-co-occurrences and drug-treatment associations. Medical vocabularies contain
-more than 600,000 codes with critical information for clinical reasoning. We
-introduce MedTok, a multimodal medical code tokenizer that uses the text
-descriptions and relational context of codes. MedTok processes text using a
-language model encoder and encodes the relational structure with a graph
-encoder. It then quantizes both modalities into a unified token space,
-preserving modality-specific and cross-modality information. We integrate
-MedTok into five EHR models and evaluate it on operational and clinical tasks
-across in-patient and out-patient datasets, including outcome prediction,
-diagnosis classification, drug recommendation, and risk stratification.
-Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
-models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
-the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
-using MedTok tokenizer with medical QA systems. Our results demonstrate the
-potential of MedTok as a unified tokenizer for medical codes, improving
-tokenization for medical foundation models.
+Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in
+modeling data with graph structures, yet recent research reveals their
+susceptibility to adversarial attacks. Traditional attack methodologies, which
+rely on manipulating the original graph or adding links to artificially created
+nodes, often prove impractical in real-world settings. This paper introduces a
+novel adversarial scenario involving the injection of an isolated subgraph to
+deceive both the link recommender and the node classifier within a GNN system.
+Specifically, the link recommender is mislead to propose links between targeted
+victim nodes and the subgraph, encouraging users to unintentionally establish
+connections and that would degrade the node classification accuracy, thereby
+facilitating a successful attack. To address this, we present the LiSA
+framework, which employs a dual surrogate model and bi-level optimization to
+simultaneously meet two adversarial objectives. Extensive experiments on
+real-world datasets demonstrate the effectiveness of our method.
 
-摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
+摘要：圖形神經網路 (GNN) 已展現出在對具有圖形結構的資料進行建模方面的卓越能力，但最近的研究揭露了它們容易受到對抗性攻擊的影響。傳統的攻擊方法依賴於操縱原始圖形或將連結新增至人工建立的節點，在真實世界設定中通常被證明不切實際。本文介紹了一種新穎的對抗性場景，涉及注入一個孤立的子圖形，以欺騙 GNN 系統中的連結推薦器和節點分類器。具體來說，連結推薦器被誤導為在目標受害節點和子圖形之間提出連結，鼓勵使用者無意間建立連結，這將降低節點分類準確度，從而促成攻擊成功。為了解決這個問題，我們提出了 LiSA 框架，它採用雙重代理模型和雙層最佳化，以同時滿足兩個對抗性目標。對真實世界資料集進行的廣泛實驗證明了我們方法的有效性。
 
-##### **A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**
-2502.03772v1 by Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Lida Chen, Wei Wang, Qinghua Huang
+##### **AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**
+2502.09254v1 by Hezhe Qiao, Chaoxi Niu, Ling Chen, Guansong Pang
 
-Hepatocellular carcinoma (HCC) ranks as the third leading cause of
-cancer-related mortality worldwide, with early detection being crucial for
-improving patient survival rates. However, early screening for HCC using
-ultrasound suffers from insufficient sensitivity and is highly dependent on the
-expertise of radiologists for interpretation. Leveraging the latest
-advancements in artificial intelligence (AI) in medical imaging, this study
-proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model
-that combines the strengths of Convolutional Neural Networks (CNNs) and Vision
-Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound
-screening. The HSQformer leverages sparse latent space representations to
-capture hierarchical details at various granularities without the need for
-complex adjustments, and adopts a modular, plug-and-play design philosophy,
-ensuring the model's versatility and ease of use. The HSQformer's performance
-was rigorously tested across three distinct clinical scenarios: single-center,
-multi-center, and high-risk patient testing. In each of these settings, it
-consistently outperformed existing state-of-the-art models, such as ConvNext
-and SwinTransformer. Notably, the HSQformer even matched the diagnostic
-capabilities of senior radiologists and comprehensively surpassed those of
-junior radiologists. The experimental results from this study strongly
-demonstrate the effectiveness and clinical potential of AI-assisted tools in
-HCC screening. The full code is available at
-https://github.com/Asunatan/HSQformer.
+Graph anomaly detection (GAD) aims to identify abnormal nodes that differ
+from the majority of the nodes in a graph, which has been attracting
+significant attention in recent years. Existing generalist graph models have
+achieved remarkable success in different graph tasks but struggle to generalize
+to the GAD task. This limitation arises from their difficulty in learning
+generalized knowledge for capturing the inherently infrequent, irregular and
+heterogeneous abnormality patterns in graphs from different domains. To address
+this challenge, we propose AnomalyGFM, a GAD-oriented graph foundation model
+that supports zero-shot inference and few-shot prompt tuning for GAD in diverse
+graph datasets. One key insight is that graph-agnostic representations for
+normal and abnormal classes are required to support effective zero/few-shot GAD
+across different graphs. Motivated by this, AnomalyGFM is pre-trained to align
+data-independent, learnable normal and abnormal class prototypes with node
+representation residuals (i.e., representation deviation of a node from its
+neighbors). The residual features essentially project the node information into
+a unified feature space where we can effectively measure the abnormality of
+nodes from different graphs in a consistent way. This provides a driving force
+for the learning of graph-agnostic, discriminative prototypes for the normal
+and abnormal classes, which can be used to enable zero-shot GAD on new graphs,
+including very large-scale graphs. If there are few-shot labeled normal nodes
+available in the new graphs, AnomalyGFM can further support prompt tuning to
+leverage these nodes for better adaptation. Comprehensive experiments on 11
+widely-used GAD datasets with real anomalies, demonstrate that AnomalyGFM
+significantly outperforms state-of-the-art competing methods under both zero-
+and few-shot GAD settings.
 
-摘要：肝細胞癌（HCC）是全球第三大癌症相關死亡原因，早期檢測對於提高患者存活率至關重要。然而，使用超音波進行 HCC 早期篩檢的靈敏度不足，且高度依賴放射科醫師的專業知識進行判讀。本研究利用醫學影像中人工智慧（AI）的最新進展，提出了一種創新的分層稀疏查詢Transformer（HSQformer）模型，結合了卷積神經網路（CNN）和視覺Transformer（ViT）的優點，以提高超音波篩檢中 HCC 診斷的準確性。HSQformer 利用稀疏潛在空間表示，在不需要複雜調整的情況下擷取各種粒度層級的細節，並採用模組化、即插即用的設計理念，確保模型的多功能性和易用性。HSQformer 的效能經過三個不同的臨床場景的嚴格測試：單中心、多中心和高風險患者測試。在這些設定中，它始終優於現有的最先進模型，例如 ConvNext 和 SwinTransformer。值得注意的是，HSQformer 甚至匹配了資深放射科醫師的診斷能力，並全面超越了初級放射科醫師的診斷能力。本研究的實驗結果有力地證明了 AI 輔助工具在 HCC 篩檢中的有效性和臨床潛力。完整程式碼可在 https://github.com/Asunatan/HSQformer 取得。
+摘要：圖形異常偵測 (GAD) 的目標是找出與圖形中大多數節點不同的異常節點，這在近年來引起了廣泛的關注。現有的通才圖形模型在不同的圖形任務中都取得了顯著的成功，但卻難以推廣到 GAD 任務。這種限制來自於它們難以學習廣泛的知識，用於擷取來自不同領域圖形中固有的罕見、不規則和異質異常模式。為了應對這個挑戰，我們提出了 AnomalyGFM，一個面向 GAD 的圖形基礎模型，它支援零次學習推論和少次提示調整，用於在不同的圖形資料集中進行 GAD。一個關鍵見解是，需要圖形不可知的正常和異常類別表示，以支援跨不同圖形的有效零次/少次 GAD。受此啟發，AnomalyGFM 被預先訓練以將與資料無關的可學習正常和異常類別原型與節點表示殘差（即節點與其鄰居的表示偏差）對齊。殘差特徵基本上將節點資訊投射到一個統一的特徵空間中，在這個空間中，我們可以有效地測量來自不同圖形的節點異常，並且方式一致。這為學習正常和異常類別的圖形不可知、有區別的原型提供了驅動力，這些原型可用於對新的圖形（包括非常大規模的圖形）啟用零次 GAD。如果新的圖形中有少量的標籤正常節點，AnomalyGFM 可以進一步支援提示調整，以利用這些節點進行更好的適應。在 11 個廣泛使用的具有真實異常值的 GAD 資料集上的綜合實驗表明，在零次和少次 GAD 設定下，AnomalyGFM 明顯優於最先進的競爭方法。
 
-##### **Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**
-2502.04386v1 by Guangyao Zheng, Michael A. Jacobs, Vladimir Braverman, Vishwa S. Parekh
+##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
+2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
 
-Self-supervised learning has revolutionized medical imaging by enabling
-efficient and generalizable feature extraction from large-scale unlabeled
-datasets. Recently, self-supervised foundation models have been extended to
-three-dimensional (3D) computed tomography (CT) data, generating compact,
-information-rich embeddings with 1408 features that achieve state-of-the-art
-performance on downstream tasks such as intracranial hemorrhage detection and
-lung cancer risk forecasting. However, these embeddings have been shown to
-encode demographic information, such as age, sex, and race, which poses a
-significant risk to the fairness of clinical applications.
-  In this work, we propose a Variation Autoencoder (VAE) based adversarial
-debiasing framework to transform these embeddings into a new latent space where
-demographic information is no longer encoded, while maintaining the performance
-of critical downstream tasks. We validated our approach on the NLST lung cancer
-screening dataset, demonstrating that the debiased embeddings effectively
-eliminate multiple encoded demographic information and improve fairness without
-compromising predictive accuracy for lung cancer risk at 1-year and 2-year
-intervals. Additionally, our approach ensures the embeddings are robust against
-adversarial bias attacks. These results highlight the potential of adversarial
-debiasing techniques to ensure fairness and equity in clinical applications of
-self-supervised 3D CT embeddings, paving the way for their broader adoption in
-unbiased medical decision-making.
+Joint entity-relation extraction is a critical task in transforming
+unstructured or semi-structured text into triplets, facilitating the
+construction of large-scale knowledge graphs, and supporting various downstream
+applications. Despite its importance, research on Chinese text, particularly
+with complex semantics in specialized domains like medicine, remains limited.
+To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
+dataset designed to capture the intricacies of medical text. Leveraging the
+strengths of attention mechanisms in capturing long-range dependencies, we
+propose the SEA module, which enhances the extraction of complex contextual
+semantic information, thereby improving entity recognition and relation
+extraction. Additionally, to address the inefficiencies of existing methods in
+facilitating information exchange between entity recognition and relation
+extraction, we present an interactive fusion representation module. This module
+employs Cross Attention for bidirectional information exchange between the
+tasks and further refines feature extraction through BiLSTM. Experimental
+results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
+our model exhibits strong generalization capabilities. On the CH-DDI dataset,
+our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
+relation extraction. On the CoNLL04 dataset, it attains an entity recognition
+precision of 89.54% and a relation extraction accuracy of 71.64%.
 
-摘要：自我監督學習透過從大規模未標記資料集中提取有效且可概化的特徵，進而革新了醫學影像。最近，自我監督基礎模型已擴展到三維 (3D) 電腦斷層掃描 (CT) 資料，產生緊湊、資訊豐富的嵌入，包含 1408 個特徵，在顱內出血偵測和肺癌風險預測等下游任務中達到最先進的效能。然而，這些嵌入已被證明會編碼人口統計資訊，例如年齡、性別和種族，這對臨床應用的公平性構成重大風險。
-在這項工作中，我們提出一個基於變異自編碼器 (VAE) 的對抗性去偏框架，將這些嵌入轉換到一個新的潛在空間，其中不再編碼人口統計資訊，同時維持關鍵下游任務的效能。我們在 NLST 肺癌篩檢資料集上驗證了我們的做法，證明去偏嵌入有效消除了多重編碼的人口統計資訊，並在不損害 1 年和 2 年間隔的肺癌風險預測準確性的情況下提高了公平性。此外，我們的做法確保了嵌入對抗性偏誤攻擊具有魯棒性。這些結果突顯了對抗性去偏技術的潛力，可確保自我監督 3D CT 嵌入在臨床應用中的公平性和公正性，為其在無偏見醫療決策中的廣泛採用鋪路。
+摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
 
-##### **Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**
-2502.03591v1 by Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
+##### **You Do Not Fully Utilize Transformer's Representation Capacity**
+2502.09245v1 by Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
+
+In contrast to RNNs, which compress previous tokens into a single hidden
+state, Transformers can attend to all previous tokens directly. However,
+standard Transformers only use representations from the immediately preceding
+layer. In this paper, we show that this design choice causes representation
+collapse and leads to suboptimal performance. To address this issue, we
+introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
+preserves the model's overall memory footprint while expanding its
+representational capacity by allowing access to hidden states from earlier
+layers. Through extensive experiments across various architectures and
+different lookup mechanisms, we demonstrate consistent performance improvements
+on a wide range of tasks. Moreover, our analysis of the learned representation
+dynamics and our exploration of depthwise circuits reveal how LIMe integrates
+information across layers, pointing to promising directions for future
+research.
 
-In this work, we present a novel approach to multi-label chest X-ray (CXR)
-image classification that enhances clinical interpretability while maintaining
-a streamlined, single-model, single-run training pipeline. Leveraging the
-CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical
-label groupings to capture clinically meaningful relationships between
-diagnoses. To achieve this, we designed a custom hierarchical binary
-cross-entropy (HBCE) loss function that enforces label dependencies using
-either fixed or data-driven penalty types. Our model achieved a mean area under
-the receiver operating characteristic curve (AUROC) of 0.903 on the test set.
-Additionally, we provide visual explanations and uncertainty estimations to
-further enhance model interpretability. All code, model configurations, and
-experiment details are made available.
+摘要：與將先前符號壓縮成單一隱藏狀態的遞迴神經網路不同，Transformer 可以直接關注所有先前的符號。然而，標準 Transformer 僅使用緊鄰前一層的表示。在本文中，我們說明此設計選擇會導致表示崩潰，並導致次優效能。為了解決此問題，我們引入了「層整合式記憶體」(LIMe)，這是一種簡單但強大的方法，可在擴充表示能力的同時，保留模型的整體記憶體使用量，方法是允許存取來自較早層的隱藏狀態。透過各種架構和不同查詢機制的廣泛實驗，我們展示了在各種任務上的一致效能提升。此外，我們對已學習表示動態的分析和對深度電路的探討，揭示了 LIMe 如何整合跨層資訊，並指出未來研究有望發展的方向。
 
-摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
+##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
+2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
 
-##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
-2502.03568v1 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
+Generative artificial intelligence (AI) models, such as diffusion models and
+OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
+and automating clinical workflows. The field has advanced rapidly, evolving
+from text-only large language models for tasks such as clinical documentation
+and decision support to multimodal AI systems capable of integrating diverse
+data modalities, including imaging, text, and structured data, within a single
+model. The diverse landscape of these technologies, along with rising interest,
+highlights the need for a comprehensive review of their applications and
+potential. This scoping review explores the evolution of multimodal AI,
+highlighting its methods, applications, datasets, and evaluation in clinical
+settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
+IEEE Xplore, and Web of Science, prioritizing recent studies published up to
+the end of 2024. After rigorous screening, 144 papers were included, revealing
+key trends and challenges in this dynamic field. Our findings underscore a
+shift from unimodal to multimodal approaches, driving innovations in diagnostic
+support, medical report generation, drug discovery, and conversational AI.
+However, critical challenges remain, including the integration of heterogeneous
+data types, improving model interpretability, addressing ethical concerns, and
+validating AI systems in real-world clinical settings. This review summarizes
+the current state of the art, identifies critical gaps, and provides insights
+to guide the development of scalable, trustworthy, and clinically impactful
+multimodal AI solutions in healthcare.
 
-Many reasoning, planning, and problem-solving tasks share an intrinsic
-algorithmic nature: correctly simulating each step is a sufficient condition to
-solve them correctly. We collect pairs of naturalistic and synthetic reasoning
-tasks to assess the capabilities of Large Language Models (LLM). While
-naturalistic tasks often require careful human handcrafting, we show that
-synthetic data is, in many cases, a good proxy that is much easier to collect
-at scale. We leverage common constructs in programming as the counterpart of
-the building blocks of naturalistic reasoning tasks, such as straight-line
-programs, code that contains critical paths, and approximate and redundant
-instructions. We further assess the capabilities of LLMs on sorting problems
-and repeated operations via sorting algorithms and nested loops. Our synthetic
-datasets further reveal that while the most powerful LLMs exhibit relatively
-strong execution capabilities, the process is fragile: it is negatively
-affected by memorisation and seems to rely heavily on pattern recognition. Our
-contribution builds upon synthetically testing the reasoning capabilities of
-LLMs as a scalable complement to handcrafted human-annotated problems.
+摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
 
-摘要：許多推理、規劃和問題解決任務共享一個內在的演算法性質：正確模擬每一步就足以正確解決它們。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的功能。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成資料是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見建構，作為自然主義推理任務構建區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複運算上的功能，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎嚴重依賴模式辨識。我們的貢獻建立在以合成方式測試 LLM 的推理能力之上，作為手工編寫人類標註問題的可擴充補充。
+##### **Reliable Conversational Agents under ASP Control that Understand Natural Language**
+2502.09237v1 by Yankai Zeng
 
-##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
-2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
+Efforts have been made to make machines converse like humans in the past few
+decades. The recent techniques of Large Language Models (LLMs) make it possible
+to have human-like conversations with machines, but LLM's flaws of lacking
+understanding and reliability are well documented. We believe that the best way
+to eliminate this problem is to use LLMs only as parsers to translate text to
+knowledge and vice versa and carry out the conversation by reasoning over this
+knowledge using the answer set programming. I have been developing a framework
+based on LLMs and ASP to realize reliable chatbots that "understand" human
+conversation. This framework has been used to develop task-specific chatbots as
+well as socialbots. My future research is focused on making these chatbots
+scalable and trainable.
 
-Large Language Models (LLMs) have attained human-level accuracy on medical
-question-answer (QA) benchmarks. However, their limitations in navigating
-open-ended clinical scenarios have recently been shown, raising concerns about
-the robustness and generalizability of LLM reasoning across diverse, real-world
-medical tasks. To probe potential LLM failure modes in clinical
-problem-solving, we present the medical abstraction and reasoning corpus
-(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
-exploit the Einstellung effect -- the fixation of thought arising from prior
-experience, targeting LLM inductive biases toward inflexible pattern matching
-from their training data rather than engaging in flexible reasoning. We find
-that LLMs, including current state-of-the-art o1 and Gemini models, perform
-poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
-medical reasoning and a propensity to hallucinate. In addition, uncertainty
-estimation analyses indicate that LLMs exhibit overconfidence in their answers,
-despite their limited accuracy. The failure modes revealed by M-ARC in LLM
-medical reasoning underscore the need to exercise caution when deploying these
-models in clinical settings.
+摘要：在過去的幾十年裡，人們一直努力讓機器像人類一樣對話。大型語言模型 (LLM) 的最新技術讓與機器進行類人對話成為可能，但 LLM 缺乏理解力和可靠性的缺陷已被充分記錄。我們相信消除這個問題的最佳方法是僅將 LLM 作為解析器，將文字轉換為知識，反之亦然，並使用答案集程式設計對此知識進行推理來進行對話。我一直在開發一個基於 LLM 和 ASP 的框架，以實現「理解」人類對話的可靠聊天機器人。這個框架已被用於開發特定任務的聊天機器人以及社交機器人。我未來的研究重點在於讓這些聊天機器人具有可擴充性和可訓練性。
 
-摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
+##### **Commonsense Reasoning-Aided Autonomous Vehicle Systems**
+2502.09233v1 by Keegan Kimbrell
 
-##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
-2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
+Autonomous Vehicle (AV) systems have been developed with a strong reliance on
+machine learning techniques. While machine learning approaches, such as deep
+learning, are extremely effective at tasks that involve observation and
+classification, they struggle when it comes to performing higher level
+reasoning about situations on the road. This research involves incorporating
+commonsense reasoning models that use image data to improve AV systems. This
+will allow AV systems to perform more accurate reasoning while also making them
+more adjustable, explainable, and ethical. This paper will discuss the findings
+so far and motivate its direction going forward.
 
-Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
-Systems (HITS) is a hot research trend focusing on enhancing HITS management,
-particularly in emergencies where ambulance vehicles must arrive at the crash
-scene on time and track their real-time location is crucial to the medical
-authorities. Despite the claim of real-time representation, a temporal
-misalignment persists between the physical and virtual domains, leading to
-discrepancies in the ambulance's location representation. This study proposes
-integrating AI predictive models, specifically Support Vector Regression (SVR)
-and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
-framework to anticipate the medical vehicle's next location in the virtual
-world. These models align virtual representations with their physical
-counterparts, i.e., metaphorically offsetting the synchronization delay between
-the two worlds. Trained meticulously on a historical geospatial dataset, SVR
-and DNN exhibit exceptional prediction accuracy in MATLAB and Python
-environments. Through various testing scenarios, we visually demonstrate the
-efficacy of our methodology, showcasing SVR and DNN's key role in significantly
-reducing the witnessed gap within the HITS's DT. This transformative approach
-enhances real-time synchronization in emergency HITS by approximately 88% to
-93%.
+摘要：自動駕駛車輛 (AV) 系統的開發高度依賴機器學習技術。儘管機器學習方法（例如深度學習）在涉及觀察和分類的任務中非常有效，但它們在對路況進行更高層級推理時會遇到困難。本研究涉及整合使用影像資料的常識推理模型，以改善 AV 系統。這將使 AV 系統能夠執行更準確的推理，同時也讓它們更具可調整性、可解釋性和道德性。本文將探討迄今為止的發現，並說明其未來的發展方向。
 
-摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
+##### **Logical foundations of Smart Contracts**
+2502.09232v1 by Kalonji Kalala
 
-##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
-2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
+Nowadays, sophisticated domains are emerging which require appropriate
+formalisms to be specified accurately in order to reason about them. One such
+domain is constituted of smart contracts that have emerged in cyber physical
+systems as a way of enforcing formal agreements between components of these
+systems. Smart contracts self-execute to run and share business processes
+through blockchain, in decentralized systems, with many different participants.
+Legal contracts are in many cases complex documents, with a number of
+exceptions, and many subcontracts. The implementation of smart contracts based
+on legal contracts is a long and laborious task, that needs to include all
+actions, procedures, and the effects of actions related to the execution of the
+contract. An ongoing open problem in this area is to formally account for smart
+contracts using a uniform and somewhat universal formalism. This thesis
+proposes logical foundations to smart contracts using the Situation Calculus, a
+logic for reasoning about actions. Situation Calculus is one of the prominent
+logic-based artificial intelligence approaches that provides enough logical
+mechanism to specify and implement dynamic and complex systems such as
+contracts. Situation Calculus is suitable to show how worlds dynamically
+change. Smart contracts are going to be implement with Golog (written en
+Prolog), a Situation Calculus-based programming language for modeling complex
+and dynamic behaviors.
 
-The widespread use of chest X-rays (CXRs), coupled with a shortage of
-radiologists, has driven growing interest in automated CXR analysis and
-AI-assisted reporting. While existing vision-language models (VLMs) show
-promise in specific tasks such as report generation or abnormality detection,
-they often lack support for interactive diagnostic capabilities. In this work
-we present RadVLM, a compact, multitask conversational foundation model
-designed for CXR interpretation. To this end, we curate a large-scale
-instruction dataset comprising over 1 million image-instruction pairs
-containing both single-turn tasks -- such as report generation, abnormality
-classification, and visual grounding -- and multi-turn, multi-task
-conversational interactions. After fine-tuning RadVLM on this instruction
-dataset, we evaluate it across different tasks along with re-implemented
-baseline VLMs. Our results show that RadVLM achieves state-of-the-art
-performance in conversational capabilities and visual grounding while remaining
-competitive in other radiology tasks. Ablation studies further highlight the
-benefit of joint training across multiple tasks, particularly for scenarios
-with limited annotated data. Together, these findings highlight the potential
-of RadVLM as a clinically relevant AI assistant, providing structured CXR
-interpretation and conversational capabilities to support more effective and
-accessible diagnostic workflows.
+摘要：如今，正在出现需要适当形式化来准确指定以对其进行推理的复杂领域。此类领域之一由在网络物理系统中出现的智能合约构成，作为强制执行这些系统组件之间正式协议的一种方式。智能合约自执行以在去中心化系统中通过区块链运行和共享业务流程，并有许多不同的参与者。法律合约在许多情况下是复杂的文档，有许多例外和许多分包合同。基于法律合约实施智能合约是一项漫长而艰巨的任务，需要包括所有操作、程序以及与执行合约相关的操作效果。该领域的持续开放问题是使用统一且某种程度上通用的形式化来正式说明智能合约。本论文提出了使用情景演算（一种用于推理操作的逻辑）为智能合约提供逻辑基础。情景演算是基于逻辑的人工智能方法之一，提供了足够的逻辑机制来指定和实现动态且复杂的系统，例如合约。情景演算适用于展示世界如何动态变化。智能合约将使用 Golog（以 Prolog 编写的）实现，这是一种基于情景演算的编程语言，用于建模复杂且动态的行为。
 
-摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
+##### **Relating Answer Set Programming and Many-sorted Logics for Formal Verification**
+2502.09230v1 by Zachary Hansen
 
-##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
-2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
+Answer Set Programming (ASP) is an important logic programming paradigm
+within the field of Knowledge Representation and Reasoning. As a concise,
+human-readable, declarative language, ASP is an excellent tool for developing
+trustworthy (especially, artificially intelligent) software systems. However,
+formally verifying ASP programs offers some unique challenges, such as
+  1. a lack of modularity (the meanings of rules are difficult to define in
+isolation from the enclosing program),
+  2. the ground-and-solve semantics (the meanings of rules are dependent on the
+input data with which the program is grounded), and
+  3. limitations of existing tools.
+  My research agenda has been focused on addressing these three issues with the
+intention of making ASP verification an accessible, routine task that is
+regularly performed alongside program development. In this vein, I have
+investigated alternative semantics for ASP based on translations into the logic
+of here-and-there and many-sorted first-order logic. These semantics promote a
+modular understanding of logic programs, bypass grounding, and enable us to use
+automated theorem provers to automatically verify properties of programs.
 
-While increasing patients' access to medical documents improves medical care,
-this benefit is limited by varying health literacy levels and complex medical
-terminology. Large language models (LLMs) offer solutions by simplifying
-medical information. However, evaluating LLMs for safe and patient-friendly
-text generation is difficult due to the lack of standardized evaluation
-resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
-created from MIMIC-IV discharge summaries through an automated pipeline
-combining LLM-based question-answer generation with manual quality checks. We
-use this dataset to evaluate various LLMs on patient-oriented
-question-answering. Our findings reveal that general-purpose LLMs frequently
-surpass biomedical-adapted models, while automated metrics correlate with human
-judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
-development of LLMs to enhance patient understanding and ultimately improve
-care outcomes.
+摘要：<paragraph>答案集程式設計 (ASP) 是知識表徵與推理領域中一個重要的邏輯程式設計範式。ASP 作為一種簡潔、人類可讀、宣告式的語言，是開發值得信賴的 (特別是人工智慧) 軟體系統的絕佳工具。然而，正式驗證 ASP 程式提供了一些獨特的挑戰，例如
+  1. 缺乏模組化 (規則的含義難以與封閉程式隔離定義)，
+  2. 基礎與求解語意 (規則的含義取決於程式基礎的輸入資料)，以及
+  3. 現有工具的限制。
+  我的研究議程一直專注於解決這三個問題，目的是讓 ASP 驗證成為一個可存取的、例行任務，並在程式開發過程中定期執行。在這個脈絡下，我研究了基於翻譯成此處和彼處邏輯以及多種排序一階邏輯的 ASP 替代語意。這些語意促進了邏輯程式的模組化理解，繞過基礎，並使我們能夠使用自動化定理證明器自動驗證程式的屬性。</paragraph>
 
-摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
-但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
+##### **Computational methods for Dynamic Answer Set Programming**
+2502.09228v1 by Susana Hahn
 
-##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
-2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
+In our daily lives and industrial settings, we often encounter dynamic
+problems that require reasoning over time and metric constraints. These include
+tasks such as scheduling, routing, and production sequencing. Dynamic logics
+have traditionally addressed these needs but often lack the flexibility and
+integration required for comprehensive problem modeling. This research aims to
+extend Answer Set Programming (ASP), a powerful declarative problem-solving
+approach, to handle dynamic domains effectively. By integrating concepts from
+dynamic, temporal, and metric logics into ASP, we seek to develop robust
+systems capable of modeling complex dynamic problems and performing efficient
+reasoning tasks, thereby enhancing ASPs applicability in industrial contexts.
 
-Purpose: To develop and evaluate a deep learning-based method that allows to
-perform myocardial infarct segmentation in a fully-automated way.
-  Materials and Methods: For this retrospective study, a cascaded framework of
-two and three-dimensional convolutional neural networks (CNNs), specialized on
-identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
-cardiac magnetic resonance (CMR) images, was trained on an in-house training
-dataset consisting of 144 examinations. On a separate test dataset from the
-same institution, including images from 152 examinations obtained between 2021
-and 2023, a quantitative comparison between artificial intelligence (AI)-based
-segmentations and manual segmentations was performed. Further, qualitative
-assessment of segmentation accuracy was evaluated for both human and
-AI-generated contours by two CMR experts in a blinded experiment.
-  Results: Excellent agreement could be found between manually and
-automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
-evaluation showed that compared to human-based measurements, the experts rated
-the AI-based segmentations to better represent the actual extent of infarction
-significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
-the contrary, for segmentation of microvascular obstruction (MVO), manual
-measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
-  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
-size to be calculated in a very short time and without requiring any
-pre-processing of the input images while matching the segmentation quality of
-trained human observers. In a blinded experiment, experts preferred automated
-infarct segmentations more often than manual segmentations, paving the way for
-a potential clinical application.
+摘要：在我們的日常生活和工業環境中，我們經常會遇到動態問題，需要隨著時間和公制約束進行推理。這些問題包括排程、路由和生產順序等任務。動態邏輯傳統上解決了這些需求，但通常缺乏全面問題建模所需的靈活性與整合性。本研究旨在擴展強大的宣告式問題解決方法「Answer Set Programming (ASP)」，以有效處理動態領域。透過將動態、時態和公制邏輯的概念整合到 ASP 中，我們尋求開發強健的系統，能夠建模複雜的動態問題並執行有效的推理任務，進而增強 ASP 在工業環境中的適用性。
+
+##### **Generating Causally Compliant Counterfactual Explanations using ASP**
+2502.09226v1 by Sopam Dasgupta
 
-摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
-材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
-結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
-結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
+This research is focused on generating achievable counterfactual
+explanations. Given a negative outcome computed by a machine learning model or
+a decision system, the novel CoGS approach generates (i) a counterfactual
+solution that represents a positive outcome and (ii) a path that will take us
+from the negative outcome to the positive one, where each node in the path
+represents a change in an attribute (feature) value. CoGS computes paths that
+respect the causal constraints among features. Thus, the counterfactuals
+computed by CoGS are realistic. CoGS utilizes rule-based machine learning
+algorithms to model causal dependencies between features. The paper discusses
+the current status of the research and the preliminary results obtained.
 
-##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
-2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
+摘要：本研究重點在於產生可實現的反事實解釋。給定由機器學習模型或決策系統計算出的負面結果，創新的 CoGS 方法會產生 (i) 代表正面結果的反事實解，以及 (ii) 一條將我們從負面結果帶到正面結果的途徑，其中途徑中的每個節點代表屬性 (特徵) 值的變化。CoGS 計算出符合特徵之間因果關係的途徑。因此，CoGS 計算出的反事實是切合實際的。CoGS 利用基於規則的機器學習演算法來建模特徵之間的因果關係。本文探討了研究的現況和獲得的初步結果。
 
-Recently computer-aided diagnosis has demonstrated promising performance,
-effectively alleviating the workload of clinicians. However, the inherent
-sample imbalance among different diseases leads algorithms biased to the
-majority categories, leading to poor performance for rare categories. Existing
-works formulated this challenge as a long-tailed problem and attempted to
-tackle it by decoupling the feature representation and classification. Yet, due
-to the imbalanced distribution and limited samples from tail classes, these
-works are prone to biased representation learning and insufficient classifier
-calibration. To tackle these problems, we propose a new Long-tailed Medical
-Diagnosis (LMD) framework for balanced medical image classification on
-long-tailed datasets. In the initial stage, we develop a Relation-aware
-Representation Learning (RRL) scheme to boost the representation ability by
-encouraging the encoder to capture intrinsic semantic features through
-different data augmentations. In the subsequent stage, we propose an Iterative
-Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
-This is achieved by generating a large number of balanced virtual features and
-fine-tuning the encoder using an Expectation-Maximization manner. The proposed
-ICC compensates for minority categories to facilitate unbiased classifier
-optimization while maintaining the diagnostic knowledge in majority classes.
-Comprehensive experiments on three public long-tailed medical datasets
-demonstrate that our LMD framework significantly surpasses state-of-the-art
-approaches. The source code can be accessed at
-https://github.com/peterlipan/LMD.
+##### **Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**
+2502.09224v1 by Đorđe Marković, Marc Denecker
 
-摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
+Subtyping, also known as subtype polymorphism, is a concept extensively
+studied in programming language theory, delineating the substitutability
+relation among datatypes. This property ensures that programs designed for
+supertype objects remain compatible with their subtypes.
+  In this paper, we explore the capability of order-sorted logic for utilizing
+these ideas in the context of Knowledge Representation. We recognize two
+fundamental limitations: First, the inability of this logic to address the
+concept rather than the value of non-logical symbols, and second, the lack of
+language constructs for constraining the type of terms. Consequently, we
+propose guarded order-sorted intensional logic, where guards are language
+constructs for annotating typing information and intensional logic provides
+support for quantification over concepts.
 
-##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
-2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
+摘要：子類型化，也稱為子類型多態性，是一個在程式語言理論中廣泛研究的概念，用於描述資料類型之間的可替換關係。此特性可確保為超類型物件設計的程式與其子類型相容。
+在本文中，我們探討了使用排序邏輯在知識表徵中運用這些想法的能力。我們發現了兩個基本限制：首先，此邏輯無法處理非邏輯符號的概念而非值，其次，缺乏約束項類型的語言結構。因此，我們提出了受保護的排序邏輯，其中保護是註解類型資訊的語言結構，而內涵邏輯則支援對概念量化。
 
-This study investigates continual fine-tuning strategies for deep learning in
-online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
-within a causal setting involving a large user group and multiple sessions per
-participant. We are the first to explore such strategies across a large user
-group, as longitudinal adaptation is typically studied in the single-subject
-setting with a single adaptation strategy, which limits the ability to
-generalize findings. First, we examine the impact of different fine-tuning
-approaches on decoder performance and stability. Building on this, we integrate
-online test-time adaptation (OTTA) to adapt the model during deployment,
-complementing the effects of prior fine-tuning. Our findings demonstrate that
-fine-tuning that successively builds on prior subject-specific information
-improves both performance and stability, while OTTA effectively adapts the
-model to evolving data distributions across consecutive sessions, enabling
-calibration-free operation. These results offer valuable insights and
-recommendations for future research in longitudinal online MI decoding and
-highlight the importance of combining domain adaptation strategies for
-improving BCI performance in real-world applications. Clinical Relevance: Our
-investigation enables more stable and efficient long-term motor imagery
-decoding, which is critical for neurorehabilitation and assistive technologies.
+##### **ASP-driven User-interaction with Clinguin**
+2502.09222v1 by Alexander Beiser, Susana Hahn, Torsten Schaub
 
-摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
+We present clinguin, a system for ASP-driven user interface design. Clinguin
+streamlines the development of user interfaces for ASP developers by letting
+them build interactive prototypes directly in ASP, eliminating the need for
+separate frontend languages. To this end, clinguin uses a few dedicated
+predicates to define user interfaces and the treatment of user-triggered
+events. This simple design greatly facilitates the specification of user
+interactions with an ASP system, in our case clingo.
 
-##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
-2502.03004v1 by Seonok Kim
+摘要：我們提出 clinguin，一個用於 ASP 驅動使用者介面設計的系統。Clinguin 透過讓 ASP 開發人員直接在 ASP 中建立互動式原型，簡化了使用者介面的開發，消除了對個別前端語言的需求。為此，clinguin 使用一些專用的謂詞來定義使用者介面和處理使用者觸發的事件。這個簡單的設計極大地簡化了使用者與 ASP 系統互動的規範，在我們的案例中是 clingo。
 
-Large Language Models (LLMs) have demonstrated impressive capabilities across
-natural language processing tasks. However, their application to specialized
-domains such as medicine and biology requires further optimization to ensure
-factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
-domain-adapted biomedical question-answering model designed to enhance both
-short-form and long-form queries. By integrating fine-tuning and
-retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
-domain-specific knowledge, improving reasoning abilities and factual accuracy.
-To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
-datasets, covering structured multiple-choice assessments and complex clinical
-reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
-datasets, while RAG enhances factual consistency. These results highlight the
-potential of domain-optimized LLMs in advancing biomedical research, medical
-education, and clinical decision support.
+##### **Pearce's Characterisation in an Epistemic Domain**
+2502.09221v1 by Ezgi Iraz Su
 
-摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
+Answer-set programming (ASP) is a successful problem-solving approach in
+logic-based AI. In ASP, problems are represented as declarative logic programs,
+and solutions are identified through their answer sets. Equilibrium logic (EL)
+is a general-purpose nonmonotonic reasoning formalism, based on a monotonic
+logic called here-and-there logic. EL was basically proposed by Pearce as a
+foundational framework of ASP. Epistemic specifications (ES) are extensions of
+ASP-programs with subjective literals. These new modal constructs in the
+ASP-language make it possible to check whether a regular literal of ASP is true
+in every (or some) answer-set of a program. ES-programs are interpreted by
+world-views, which are essentially collections of answer-sets. (Reflexive)
+autoepistemic logic is a nonmonotonic formalism, modeling self-belief
+(knowledge) of ideally rational agents. A relatively new semantics for ES is
+based on a combination of EL and (reflexive) autoepistemic logic. In this
+paper, we first propose an overarching framework in the epistemic ASP domain.
+We then establish a correspondence between existing (reflexive) (auto)epistemic
+equilibrium logics and our easily-adaptable comprehensive framework, building
+on Pearce's characterisation of answer-sets as equilibrium models. We achieve
+this by extending Ferraris' work on answer sets for propositional theories to
+the epistemic case and reveal the relationship between some ES-semantic
+proposals.
 
-##### **Contrastive Token-level Explanations for Graph-based Rumour Detection**
-2502.04366v1 by Daniel Wai Kit Chin, Roy Ka-Wei Lee
+摘要：<paragraph>答案集程式設計（ASP）是基於邏輯的人工智慧中一種成功的問題解決方法。在 ASP 中，問題表示為宣告式邏輯程式，並透過其答案集來找出解答。平衡邏輯（EL）是一種通用的非單調推理形式主義，基於一種稱為此處和彼處邏輯的單調邏輯。EL 基本是由 Pearce 作為 ASP 的基礎架構所提出。知識規範（ES）是 ASP 程式與主觀文字的延伸。ASP 語言中的這些新模態建構使得可以檢查 ASP 的常規文字是否在程式的每個（或某些）答案集中為真。ES 程式由世界觀來詮釋，其本質上是答案集的集合。（反身）自認識邏輯是一種非單調形式主義，用來建模理想理性主體的自信念（知識）。ES 的一種相對新的語意是基於 EL 和（反身）自認識邏輯的組合。在本文中，我們首先提出一個涵蓋知識 ASP 領域的架構。然後，我們建立現有（反身）（自）認識平衡邏輯與我們容易適應的綜合架構之間的對應關係，建立在 Pearce 將答案集描述為平衡模型的特性之上。我們透過將 Ferraris 在命題理論的答案集上的工作延伸到知識案例，並揭示一些 ES 語義提案之間的關係來達成這一點。</paragraph>
 
-The widespread use of social media has accelerated the dissemination of
-information, but it has also facilitated the spread of harmful rumours, which
-can disrupt economies, influence political outcomes, and exacerbate public
-health crises, such as the COVID-19 pandemic. While Graph Neural Network
-(GNN)-based approaches have shown significant promise in automated rumour
-detection, they often lack transparency, making their predictions difficult to
-interpret. Existing graph explainability techniques fall short in addressing
-the unique challenges posed by the dependencies among feature dimensions in
-high-dimensional text embeddings used in GNN-based models. In this paper, we
-introduce Contrastive Token Layerwise Relevance Propagation (CT-LRP), a novel
-framework designed to enhance the explainability of GNN-based rumour detection.
-CT-LRP extends current graph explainability methods by providing token-level
-explanations that offer greater granularity and interpretability. We evaluate
-the effectiveness of CT-LRP across multiple GNN models trained on three
-publicly available rumour detection datasets, demonstrating that it
-consistently produces high-fidelity, meaningful explanations, paving the way
-for more robust and trustworthy rumour detection systems.
+##### **Graphical Conditions for the Existence, Unicity and Number of Regular Models**
+2502.09220v1 by Van-Giang Trinh, Belaid Benhamou, Sylvain Soliman, François Fages
 
-摘要：社群媒體的廣泛使用加速了資訊的傳播，但也促进了有害謠言的散播，這可能會擾亂經濟、影響政治結果，並加劇公共衛生危機，例如 COVID-19 大流行。雖然基於圖神經網路 (GNN) 的方法在自動化謠言偵測方面展現了顯著的前景，但它們通常缺乏透明度，這使得它們的預測難以解釋。現有的圖形可解釋性技術無法解決 GNN 模型中使用的維度嵌入式文本之間的依賴性所帶來的獨特挑戰。在本文中，我們介紹了對比標記分層關聯性傳播 (CT-LRP)，這是一個新穎的框架，旨在增強基於 GNN 的謠言偵測的可解釋性。CT-LRP 透過提供標記級別的解釋來擴充當前的圖形可解釋性方法，這些解釋提供了更細緻的粒度和可解釋性。我們在三個公開的謠言偵測資料集上訓練的幾個 GNN 模型中評估了 CT-LRP 的有效性，證明它始終產生高保真、有意義的解釋，為更強健且值得信賴的謠言偵測系統鋪路。
+The regular models of a normal logic program are a particular type of partial
+(i.e. 3-valued) models which correspond to stable partial models with minimal
+undefinedness. In this paper, we explore graphical conditions on the dependency
+graph of a finite ground normal logic program to analyze the existence, unicity
+and number of regular models for the program. We show three main results: 1) a
+necessary condition for the existence of non-trivial (i.e. non-2-valued)
+regular models, 2) a sufficient condition for the unicity of regular models,
+and 3) two upper bounds for the number of regular models based on positive
+feedback vertex sets. The first two conditions generalize the finite cases of
+the two existing results obtained by You and Yuan (1994) for normal logic
+programs with well-founded stratification. The third result is also new to the
+best of our knowledge. Key to our proofs is a connection that we establish
+between finite ground normal logic programs and Boolean network theory.
 
-##### **AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**
-2502.04365v1 by Jorge García-Torres, Øyvind Meinich-Bache, Siren Rettedal, Kjersti Engan
+摘要：正规模型的常规模型是一种特殊类型的局部模型（即 3 值）模型，它对应于具有最小未定义性的稳定局部模型。在本文中，我们探索了有限接地正规逻辑程序的依赖图上的图形条件，以分析程序的正规模型的存在性、唯一性和数量。我们展示了三个主要结果：1) 非平凡（即非 2 值）正规模型存在的必要条件，2) 正规模型唯一性的充分条件，3) 基于正反馈顶点集的正规模型数目的两个上限。前两个条件概括了 You 和 Yuan (1994) 为具有良好基础分层的正规逻辑程序获得的两个现有结果的有限情况。据我们所知，第三个结果也是新的。我们证明的关键是我们在有限接地正规逻辑程序和布尔网络理论之间建立的联系。
 
-Approximately 10% of newborns need some assistance to start breathing and 5\%
-proper ventilation. It is crucial that interventions are initiated as soon as
-possible after birth. Accurate documentation of Time of Birth (ToB) is thereby
-essential for documenting and improving newborn resuscitation performance.
-However, current clinical practices rely on manual recording of ToB, typically
-with minute precision. In this study, we present an AI-driven, video-based
-system for automated ToB detection using thermal imaging, designed to preserve
-the privacy of healthcare providers and mothers by avoiding the use of
-identifiable visual data. Our approach achieves 91.4% precision and 97.4%
-recall in detecting ToB within thermal video clips during performance
-evaluation. Additionally, our system successfully identifies ToB in 96% of test
-cases with an absolute median deviation of 1 second compared to manual
-annotations. This method offers a reliable solution for improving ToB
-documentation and enhancing newborn resuscitation outcomes.
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-摘要：約 10% 的新生兒需要協助才能開始呼吸，5% 需要適當的通氣。在出生後盡快開始介入至關重要。準確記錄出生時間 (ToB) 對於記錄和改善新生兒復甦表現至關重要。然而，目前的臨床實務依賴於手動記錄 ToB，通常精確到分鐘。在這項研究中，我們提出一個以 AI 為主的、基於影片的系統，用於使用熱影像自動偵測 ToB，旨在透過避免使用可識別的視覺資料來保護醫療保健提供者和母親的隱私。我們的做法在執行評估期間，在熱影像片段中偵測 ToB 時達到了 91.4% 的精確度和 97.4% 的召回率。此外，我們的系統在 96% 的測試案例中成功識別出 ToB，與手動註解相比，絕對中位數偏差為 1 秒。此方法提供了一個可靠的解決方案，用於改善 ToB 記錄和增強新生兒復甦結果。
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-##### **3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**
-2502.02779v1 by Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O'Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-Head computed tomography (CT) imaging is a widely-used imaging modality with
-multitudes of medical indications, particularly in assessing pathology of the
-brain, skull, and cerebrovascular system. It is commonly the first-line imaging
-in neurologic emergencies given its rapidity of image acquisition, safety,
-cost, and ubiquity. Deep learning models may facilitate detection of a wide
-range of diseases. However, the scarcity of high-quality labels and
-annotations, particularly among less common conditions, significantly hinders
-the development of powerful models. To address this challenge, we introduce
-FM-CT: a Foundation Model for Head CT for generalizable disease detection,
-trained using self-supervised learning. Our approach pre-trains a deep learning
-model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans
-without the need for manual annotations, enabling the model to learn robust,
-generalizable features. To investigate the potential of self-supervised
-learning in head CT, we employed both discrimination with self-distillation and
-masked image modeling, and we construct our model in 3D rather than at the
-slice level (2D) to exploit the structure of head CT scans more comprehensively
-and efficiently. The model's downstream classification performance is evaluated
-using internal and three external datasets, encompassing both in-distribution
-(ID) and out-of-distribution (OOD) data. Our results demonstrate that the
-self-supervised foundation model significantly improves performance on
-downstream diagnostic tasks compared to models trained from scratch and
-previous 3D CT foundation models on scarce annotated datasets. This work
-highlights the effectiveness of self-supervised learning in medical imaging and
-sets a new benchmark for head CT image analysis in 3D, enabling broader use of
-artificial intelligence for head CT-based diagnosis.
+##### **Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**
+2502.09216v1 by Galileo Sartor, Adam Wyner, Giuseppe Contissa
+
+In this paper, we present a modular system for representing and reasoning
+with legal aspects of traffic rules for autonomous vehicles. We focus on a
+subset of the United Kingdom's Highway Code (HC) related to junctions. As human
+drivers and automated vehicles (AVs) will interact on the roads, especially in
+urban environments, we claim that an accessible, unitary, high-level
+computational model should exist and be applicable to both users. Autonomous
+vehicles introduce a shift in liability that should not bring disadvantages or
+increased burden on human drivers. We develop a system "in silico" of the
+model. The proposed system is built of three main components: a natural
+language interface, using Logical English, which encodes the rules; an internal
+representation of the rules in Prolog; and an multi-agent-based simulation
+environment, built in NetLogo. The three components interact: Logical English
+is translated into and out of Prolog (along with some support code); Prolog and
+NetLogo interface via predicates. Such a modular approach enables the different
+components to carry different "burdens" in the overall system; it also allows
+swapping of modules. Given NetLogo, we can visualize the effect of the modeled
+rules as well as validate the system with a simple dynamic running scenario.
+Designated agents monitor the behaviour of the vehicles for compliance and
+record potential violations where they occur. The information on potential
+violations is then utilized by Validators, to determine whether the violation
+is punishable, differentiating between exceptions and cases.
 
-摘要：頭部電腦斷層掃描（CT）影像是一種廣泛使用的影像模式，具有
-大量的醫療適應症，特別是在評估腦部、頭骨和腦血管系統的病理時。由於其影像擷取速度快、安全性、成本低和普遍性，通常是神經緊急情況下的第一線影像。深度學習模型可以促進對各種疾病的檢測。然而，高品質標籤和註釋的稀缺，特別是在較不常見的疾病中，顯著地阻礙了強大模型的發展。為了應對這一挑戰，我們引入了 FM-CT：一個用於頭部 CT 的基礎模型，用於可概化的疾病檢測，並使用自我監督學習進行訓練。我們的做法在一個包含 361,663 個非對比 3D 頭部 CT 掃描的大型、多樣化的數據集上預訓練一個深度學習模型，而無需手動註釋，使模型能夠學習強健、可概化的特徵。為了探討自我監督學習在頭部 CT 中的潛力，我們同時採用了帶有自我蒸餾的判別和遮罩影像建模，並且我們以 3D 而不是切片層級（2D）構建我們的模型，以更全面、有效地利用頭部 CT 掃描的結構。該模型的下游分類效能使用內部和三個外部數據集進行評估，包括分佈內 (ID) 和分佈外 (OOD) 資料。我們的結果表明，與從頭開始訓練的模型和先前在稀疏註釋數據集上訓練的 3D CT 基礎模型相比，自我監督基礎模型顯著改善了下游診斷任務的效能。這項工作突顯了自我監督學習在醫學影像中的有效性，並為 3D 頭部 CT 影像分析設定了一個新的基準，讓人工智慧能夠更廣泛地用於基於頭部 CT 的診斷。
+摘要：<paragraph>在本文中，我們提出了一個模組化系統，用於表示和推理自動駕駛車輛交通規則的法律層面。我們專注於與路口相關的英國公路法規 (HC) 子集。由於人類駕駛和自動駕駛車輛 (AV) 將在道路上互動，尤其是在城市環境中，我們主張應存在一個可存取、統一、高階的運算模型，並適用於這兩種使用者。自動駕駛車輛引入了責任轉移，不應給人類駕駛帶來劣勢或增加負擔。我們開發了一個模型的「電腦模擬」系統。所提出的系統由三個主要組成部分建構而成：使用邏輯英語的自然語言介面，用於編碼規則；使用 Prolog 的規則內部表示；以及使用 NetLogo 建構的多主體模擬環境。這三個組成部分會進行互動：邏輯英語會翻譯成 Prolog（以及一些支援程式碼），再從 Prolog 翻譯回來；Prolog 和 NetLogo 會透過謂詞進行介面。這種模組化方法讓不同的組成部分可以在整體系統中承擔不同的「負擔」；它也允許模組交換。有了 NetLogo，我們可以視覺化已建模規則的效果，並使用一個簡單的動態執行範例來驗證系統。指定的代理會監控車輛的行為，以確保遵守規定，並記錄發生的潛在違規行為。然後，驗證者會利用潛在違規行為的資訊，來確定違規行為是否應受懲罰，並區分例外情況和案例。</paragraph>
 
-##### **Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**
-2502.02756v1 by Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
+##### **Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**
+2502.09215v1 by Sean Glaze, Daniela Inclezan
 
-This study proposes a new loss function for deep neural networks, L1-weighted
-Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of
-voxels based on their classification difficulty, towards automated detection
-and segmentation of metastatic prostate cancer lesions in PET/CT scans. We
-obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with
-biochemical recurrence metastatic prostate cancer. We trained two 3D
-convolutional neural networks, Attention U-Net and SegResNet, and concatenated
-the PET and CT volumes channel-wise as input. The performance of our custom
-loss function was evaluated against the Dice and Dice Focal Loss functions. For
-clinical significance, we considered a detected region of interest (ROI) as a
-true positive if at least the voxel with the maximum standardized uptake value
-falls within the ROI. We assessed the models' performance based on the number
-of lesions in an image, tumour volume, activity, and extent of spread. The
-L1DFL outperformed the comparative loss functions by at least 13% on the test
-set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were
-lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal
-Loss yielded more false positives, whereas the Dice Loss was more sensitive to
-smaller volumes and struggled to segment larger lesions accurately. They also
-exhibited network-specific variations and yielded declines in segmentation
-accuracy with increased tumour spread. Our results demonstrate the potential of
-L1DFL to yield robust segmentation of metastatic prostate cancer lesions in
-PSMA PET/CT images. The results further highlight potential complexities
-arising from the variations in lesion characteristics that may influence
-automated prostate cancer tumour detection and segmentation. The code is
-publicly available at: https://github.com/ObedDzik/pca_segment.git.
+This paper presents an architecture for simulating the actions of a
+norm-aware intelligent agent whose behavior with respect to norm compliance is
+set, and can later be changed, by a human controller. Updating an agent's
+behavior mode from a norm-abiding to a riskier one may be relevant when the
+agent is involved in time-sensitive rescue operations, for example. We base our
+work on the Authorization and Obligation Policy Language AOPL designed by
+Gelfond and Lobo for the specification of norms. We introduce an architecture
+and a prototype software system that can be used to simulate an agent's plans
+under different behavior modes that can later be changed by the controller. We
+envision such software to be useful to policy makers, as they can more readily
+understand how agents may act in certain situations based on the agents'
+attitudes towards norm-compliance. Policy makers may then refine their policies
+if simulations show unwanted consequences.
 
-摘要：<paragraph>本研究針對深度神經網路提出一個新的損失函數，L1 加權 Dice 焦點損失 (L1DFL)，它利用 L1 範數根據體素的分類難度進行自適應加權，用於自動偵測和分割 PET/CT 掃描中轉移性前列腺癌病灶。我們取得 380 個經診斷為生化復發轉移性前列腺癌的患者的 PSMA [18-F] DCFPyL PET/CT 掃描。我們訓練了兩個 3D 捲積神經網路，Attention U-Net 和 SegResNet，並將 PET 和 CT 體積按通道連接作為輸入。我們自訂的損失函數的效能與 Dice 和 Dice 焦點損失函數進行評估。為了臨床意義，我們將一個偵測到的感興趣區域 (ROI) 視為真陽性，如果至少具有最大標準攝取值的體素落在 ROI 內。我們根據影像中的病灶數量、腫瘤體積、活性，以及擴散程度評估模型的效能。L1DFL 在測試組中至少比比較損失函數高出 13%。此外，Dice 損失和 Dice 焦點損失的 F1 分數分別比 L1DFL 低至少 6% 和 34%。Dice 焦點損失產生更多假陽性，而 Dice 損失對較小體積較為敏感，且難以準確分割較大病灶。它們也展現出網路特定的變化，並隨著腫瘤擴散而導致分割準確度下降。我們的結果證明 L1DFL 具有在 PSMA PET/CT 影像中產生轉移性前列腺癌病灶的強健分割的潛力。結果進一步強調由病灶特徵變化所產生的潛在複雜性，這可能會影響自動化前列腺癌腫瘤偵測和分割。程式碼公開於：https://github.com/ObedDzik/pca_segment.git。</paragraph>
+摘要：本文提出了一個架構，用於模擬一個規範感知智能代理的行為，其行為遵守規範，並可以由人類控制者設定，並可以在稍後進行更改。當代理參與時間敏感的救援行動時，將代理的行為模式從遵守規範更新為更冒險的行為模式可能是相關的。我們的工作基於 Gelfond 和 Lobo 為規範規範設計的授權和義務政策語言 AOPL。我們引入了一個架構和一個原型軟體系統，可用於模擬代理在不同行為模式下的計畫，這些行為模式稍後可以由控制者更改。我們預計此類軟體對政策制定者很有用，因為他們可以更容易地根據代理對規範遵守的態度了解代理在特定情況下的行為方式。如果模擬顯示出不希望的後果，政策制定者可以修改他們的政策。
 
-##### **Diffusion Instruction Tuning**
-2502.06814v1 by Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
+##### **Neuro-Symbolic Contrastive Learning for Cross-domain Inference**
+2502.09213v1 by Mingyue Liu, Ryo Ueda, Zhen Wan, Katsumi Inoue, Chris G. Willcocks
 
-We introduce Lavender, a simple supervised fine-tuning (SFT) method that
-boosts the performance of advanced vision-language models (VLMs) by leveraging
-state-of-the-art image generation models such as Stable Diffusion.
-Specifically, Lavender aligns the text-vision attention in the VLM transformer
-with the equivalent used by Stable Diffusion during SFT, instead of adapting
-separate encoders. This alignment enriches the model's visual understanding and
-significantly boosts performance across in- and out-of-distribution tasks.
-Lavender requires just 0.13 million training examples, 2.5% of typical
-large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a
-single day. It consistently improves state-of-the-art open-source multimodal
-LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and
-a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently
-transferring the visual expertise of image generators with minimal supervision,
-Lavender offers a scalable solution for more accurate vision-language systems.
-All code, training data, and models will be shared at
-https://astrazeneca.github.io/vlm/.
+Pre-trained language models (PLMs) have made significant advances in natural
+language inference (NLI) tasks, however their sensitivity to textual
+perturbations and dependence on large datasets indicate an over-reliance on
+shallow heuristics. In contrast, inductive logic programming (ILP) excels at
+inferring logical relationships across diverse, sparse and limited datasets,
+but its discrete nature requires the inputs to be precisely specified, which
+limits their application. This paper proposes a bridge between the two
+approaches: neuro-symbolic contrastive learning. This allows for smooth and
+differentiable optimisation that improves logical accuracy across an otherwise
+discrete, noisy, and sparse topological space of logical functions. We show
+that abstract logical relationships can be effectively embedded within a
+neuro-symbolic paradigm, by representing data as logic programs and sets of
+logic rules. The embedding space captures highly varied textual information
+with similar semantic logical relations, but can also separate similar textual
+relations that have dissimilar logical relations. Experimental results
+demonstrate that our approach significantly improves the inference capabilities
+of the models in terms of generalisation and reasoning.
 
-摘要：<paragraph>我們介紹 Lavender，一種簡單的監督微調 (SFT) 方法，它透過利用 Stable Diffusion 等最先進的影像生成模型來提升先進視覺語言模型 (VLM) 的效能。
-具體來說，Lavender 在 SFT 期間將 VLM 轉換器中的文字視覺注意力與 Stable Diffusion 使用的等效注意力對齊，而不是調整單獨的編碼器。此對齊豐富了模型的視覺理解，並顯著提升了分佈內外任務的效能。
-Lavender 只需要 0.13 百萬個訓練範例，相當於典型大型 SFT 資料集的 2.5%，並在標準硬體 (8 個 GPU) 上於一天內進行微調。它持續改善最先進的開放原始碼多模態 LLM（例如 Llama-3.2-11B、MiniCPM-Llama3-v2.5），在具有挑戰性的分佈外醫療 QA 任務中獲得高達 30% 的收益和 68% 的提升。透過有效轉移影像生成器的視覺專業知識，並僅需最少的監督，Lavender 提供了一個可擴充的解決方案，以實現更準確的視覺語言系統。
-所有程式碼、訓練資料和模型將在 https://astrazeneca.github.io/vlm/ 分享。</paragraph>
+摘要：預訓練語言模型 (PLM) 在自然語言推理 (NLI) 任務中取得了重大進展，然而它們對文本擾動的敏感性和對大型資料集的依賴性表明過度依賴於淺層啟發法。相比之下，歸納邏輯規劃 (ILP) 擅長推論跨越多樣化、稀疏和有限資料集的邏輯關係，但其離散性質要求輸入被精確指定，這限制了它們的應用。本文提出了兩種方法之間的橋樑：神經符號對比學習。這允許平滑且可微分的優化，從而提高邏輯函數的離散、嘈雜和稀疏拓撲空間中的邏輯準確性。我們展示了抽象邏輯關係可以通過將資料表示為邏輯程式和邏輯規則集，有效地嵌入到神經符號範例中。嵌入空間捕獲具有相似語義邏輯關係的高度多變的文本資訊，但也可以分離具有不同邏輯關係的相似文本關係。實驗結果表明，我們的做法在泛化和推理方面顯著提高了模型的推理能力。
 
-##### **MedRAX: Medical Reasoning Agent for Chest X-ray**
-2502.02673v1 by Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
+##### **LP-LM: No Hallucinations in Question Answering with Logic Programming**
+2502.09212v1 by Katherine Wu, Yanhong A. Liu
 
-Chest X-rays (CXRs) play an integral role in driving critical decisions in
-disease management and patient care. While recent innovations have led to
-specialized models for various CXR interpretation tasks, these solutions often
-operate in isolation, limiting their practical utility in clinical practice. We
-present MedRAX, the first versatile AI agent that seamlessly integrates
-state-of-the-art CXR analysis tools and multimodal large language models into a
-unified framework. MedRAX dynamically leverages these models to address complex
-medical queries without requiring additional training. To rigorously evaluate
-its capabilities, we introduce ChestAgentBench, a comprehensive benchmark
-containing 2,500 complex medical queries across 7 diverse categories. Our
-experiments demonstrate that MedRAX achieves state-of-the-art performance
-compared to both open-source and proprietary models, representing a significant
-step toward the practical deployment of automated CXR interpretation systems.
-Data and code have been publicly available at
-https://github.com/bowang-lab/MedRAX
+Large language models (LLMs) are able to generate human-like responses to
+user queries. However, LLMs exhibit inherent limitations, especially because
+they hallucinate. This paper introduces LP-LM, a system that grounds answers to
+questions in known facts contained in a knowledge base (KB), facilitated
+through semantic parsing in Prolog, and always produces answers that are
+reliable.
+  LP-LM generates a most probable constituency parse tree along with a
+corresponding Prolog term for an input question via Prolog definite clause
+grammar (DCG) parsing. The term is then executed against a KB of natural
+language sentences also represented as Prolog terms for question answering. By
+leveraging DCG and tabling, LP-LM runs in linear time in the size of input
+sentences for sufficiently many grammar rules. Performing experiments comparing
+LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate
+on even simple questions, unlike LP-LM.
 
-摘要：胸部 X 光片 (CXR) 在疾病管理和患者照護中扮演著不可或缺的角色，推動著關鍵決策的制定。儘管近期的創新已針對各種 CXR 解讀任務開發出專門的模型，但這些解決方案通常獨立運作，限制了它們在臨床實務中的實際效用。我們提出 MedRAX，這是一款首創的多功能 AI 代理，它將最先進的 CXR 分析工具和多模態大型語言模型無縫整合到一個統一的架構中。MedRAX 動態運用這些模型來解決複雜的醫療查詢，而無需額外的訓練。為了嚴格評估其功能，我們引入了 ChestAgentBench，這是一個全面的基準，包含 7 個不同類別的 2,500 個複雜醫療查詢。我們的實驗證明，與開源和專有模型相比，MedRAX 達到了最先進的效能，這代表了自動化 CXR 解讀系統實際部署的重要一步。資料和程式碼已公開於 https://github.com/bowang-lab/MedRAX
+摘要：大型語言模型 (LLM) 能產生類似人類的回應來回答使用者的問題。然而，LLM 顯示出內在的限制，特別是因為它們會產生幻覺。本文介紹 LP-LM，一個系統，它將問題的答案建立在知識庫 (KB) 中已知的事實上，透過 Prolog 中的語義解析來促進，並始終產生可靠的答案。
+LP-LM 透過 Prolog 明確條款語法 (DCG) 解析產生一個最可能的成分解析樹，以及輸入問題對應的 Prolog 詞彙。然後，針對一個自然語言句子的 KB 執行該詞彙，也表示為 Prolog 詞彙，以進行問題解答。透過利用 DCG 和 tabling，LP-LM 在輸入句子的大小上以線性時間執行，對於足夠多的語法規則。執行實驗比較 LP-LM 與目前眾所周知的 LLM 在準確性上，我們顯示出 LLM 甚至會對簡單的問題產生幻覺，這與 LP-LM 不同。
 
-##### **Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**
-2502.04356v1 by Mahdi Alkaeed, Sofiat Abioye, Adnan Qayyum, Yosra Magdi Mekki, Ilhem Berrou, Mohamad Abdallah, Ala Al-Fuqaha, Muhammad Bilal, Junaid Qadir
+##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
+2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
 
-In response to the success of proprietary Large Language Models (LLMs) such
-as OpenAI's GPT-4, there is a growing interest in developing open,
-non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in
-academic, scientific, and non-commercial applications. Despite their inability
-to match the refined functionalities of their proprietary counterparts, open
-models hold immense potential to revolutionize healthcare applications. In this
-paper, we examine the prospects of open-source LLMs and AIFMs for developing
-healthcare applications and make two key contributions. Firstly, we present a
-comprehensive survey of the current state-of-the-art open-source healthcare
-LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their
-utility across various healthcare tasks. Secondly, to evaluate the
-general-purpose applications of open LLMs in healthcare, we present a case
-study on personalized prescriptions. This task is particularly significant due
-to its critical role in delivering tailored, patient-specific medications that
-can greatly improve treatment outcomes. In addition, we compare the performance
-of open-source models with proprietary models in settings with and without
-Retrieval-Augmented Generation (RAG). Our findings suggest that, although less
-refined, open LLMs can achieve performance comparable to proprietary models
-when paired with grounding techniques such as RAG. Furthermore, to highlight
-the clinical significance of LLMs-empowered personalized prescriptions, we
-perform subjective assessment through an expert clinician. We also elaborate on
-ethical considerations and potential risks associated with the misuse of
-powerful LLMs and AIFMs, highlighting the need for a cautious and responsible
-implementation in healthcare.
+Visual Question Answering (VQA) is a challenging problem that requires to
+process multimodal input. Answer-Set Programming (ASP) has shown great
+potential in this regard to add interpretability and explainability to modular
+VQA architectures. In this work, we address the problem of how to integrate ASP
+with modules for vision and natural language processing to solve a new and
+demanding VQA variant that is concerned with images of graphs (not graphs in
+symbolic form). Images containing graph-based structures are an ubiquitous and
+popular form of visualisation. Here, we deal with the particular problem of
+graphs inspired by transit networks, and we introduce a novel dataset that
+amends an existing one by adding images of graphs that resemble metro lines.
+Our modular neuro-symbolic approach combines optical graph recognition for
+graph parsing, a pretrained optical character recognition neural network for
+parsing labels, Large Language Models (LLMs) for language processing, and ASP
+for reasoning. This method serves as a first baseline and achieves an overall
+average accuracy of 73% on the dataset. Our evaluation provides further
+evidence of the potential of modular neuro-symbolic systems, in particular with
+pretrained models that do not involve any further training and logic
+programming for reasoning, to solve complex VQA tasks.
 
-摘要：<paragraph>為了回應 OpenAI 的 GPT-4 等專有大型語言模型 (LLM) 的成功，開發開放、非專有的 LLM 和人工智慧基礎模型 (AIFM) 以透明地用於學術、科學和非商業應用中，引起了越來越大的興趣。儘管無法與其專有對應產品的精緻功能相匹配，但開放模型在革新醫療保健應用方面具有巨大的潛力。在本文中，我們探討了開放原始碼 LLM 和 AIFM 在開發醫療保健應用方面的前景，並提出了兩項關鍵貢獻。首先，我們對當前最先進的開放原始碼醫療保健 LLM 和 AIFM 進行了全面的調查，並介紹了這些開放 AIFM 的分類法，對它們在各種醫療保健任務中的效用進行了分類。其次，為了評估開放 LLM 在醫療保健中的通用應用，我們對個人化處方進行了案例研究。這項任務特別重要，因為它在提供量身定制的患者特定藥物方面發揮著關鍵作用，可以大大改善治療效果。此外，我們比較了開放原始碼模型與專有模型在有和沒有檢索增強生成 (RAG) 的設置中的性能。我們的研究結果表明，儘管不太精緻，但開放 LLM 在與 RAG 等基礎技術配對時，可以實現與專有模型相當的性能。此外，為了強調 LLM 賦能的個性化處方的臨床意義，我們通過專家臨床醫生進行了主觀評估。我們還詳細說明了與濫用強大的 LLM 和 AIFM 相關的倫理考量和潛在風險，強調了在醫療保健中謹慎和負責任地實施的必要性。</paragraph>
+摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
 
-##### **Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**
-2502.02561v1 by Shayan Kiyani, George Pappas, Aaron Roth, Hamed Hassani
+##### **On LLM-generated Logic Programs and their Inference Execution Methods**
+2502.09209v1 by Paul Tarau
 
-A fundamental question in data-driven decision making is how to quantify the
-uncertainty of predictions in ways that can usefully inform downstream action.
-This interface between prediction uncertainty and decision-making is especially
-important in risk-sensitive domains, such as medicine. In this paper, we
-develop decision-theoretic foundations that connect uncertainty quantification
-using prediction sets with risk-averse decision-making. Specifically, we answer
-three fundamental questions: (1) What is the correct notion of uncertainty
-quantification for risk-averse decision makers? We prove that prediction sets
-are optimal for decision makers who wish to optimize their value at risk. (2)
-What is the optimal policy that a risk averse decision maker should use to map
-prediction sets to actions? We show that a simple max-min decision policy is
-optimal for risk-averse decision makers. Finally, (3) How can we derive
-prediction sets that are optimal for such decision makers? We provide an exact
-characterization in the population regime and a distribution free finite-sample
-construction. Answering these questions naturally leads to an algorithm,
-Risk-Averse Calibration (RAC), which follows a provably optimal design for
-deriving action policies from predictions. RAC is designed to be both
-practical-capable of leveraging the quality of predictions in a black-box
-manner to enhance downstream utility-and safe-adhering to a user-defined risk
-threshold and optimizing the corresponding risk quantile of the user's
-downstream utility. Finally, we experimentally demonstrate the significant
-advantages of RAC in applications such as medical diagnosis and recommendation
-systems. Specifically, we show that RAC achieves a substantially improved
-trade-off between safety and utility, offering higher utility compared to
-existing methods while maintaining the safety guarantee.
+Large Language Models (LLMs) trained on petabytes of data are highly
+compressed repositories of a significant proportion of the knowledge
+accumulated and distilled so far. In this paper we study techniques to elicit
+this knowledge in the form of several classes of logic programs, including
+propositional Horn clauses, Dual Horn clauses, relational triplets and Definite
+Clause Grammars. Exposing this knowledge as logic programs enables sound
+reasoning methods that can verify alignment of LLM outputs to their intended
+uses and extend their inference capabilities. We study new execution methods
+for the generated programs, including soft-unification of abducible facts
+against LLM-generated content stored in a vector database as well as GPU-based
+acceleration of minimal model computation that supports inference with large
+LLM-generated programs.
 
-摘要：<paragraph>在資料驅動決策中，一個基本問題是，如何量化預測的不確定性，以能有用地告知下游行動。
-預測不確定性和決策制定之間的這種介面，在風險敏感領域中特別重要，例如醫學。在本文中，我們
-發展了決策理論基礎，它利用預測集合將不確定性量化與風險規避決策制定聯繫起來。具體來說，我們回答
-了三個基本問題：(1) 對於風險規避決策者來說，不確定性量化的正確概念是什麼？我們證明，對於希望最佳化其風險價值的決策者來說，預測集合是最佳的。(2)
-風險規避決策者應使用什麼最佳政策，將預測集合映射到行動？我們表明，對於風險規避決策者來說，一個簡單的最大最小決策政策是最佳的。最後，(3) 我們如何推導出對此類決策者來說最佳的預測集合？我們在總體範圍內提供了一個確切的表徵，並提供了一個不依賴分佈的有限樣本建構。回答這些問題自然會導致一個演算法，風險規避校準 (RAC)，它遵循一個可證明最佳的設計，從預測中推導出行動政策。RAC 被設計為既實用——能夠以黑盒方式利用預測的品質來增強下游效用——又安全——遵守使用者定義的風險閾值，並最佳化使用者的下游效用的對應風險分位數。最後，我們在醫學診斷和推薦系統等應用中，以實驗方式證明了 RAC 的顯著優點。具體來說，我們表明，與現有方法相比，RAC 在安全性和效用之間實現了顯著改善的折衷，在維持安全保證的同時，提供了更高的效用。</paragraph>
+摘要：大型語言模型 (LLM) 在數位位元組的資料上受過訓練，是目前為止累積和提煉的知識中，高度濃縮的儲存庫。在本文中，我們研究了以數種邏輯程式類別的形式引出這些知識的技術，包括命題霍恩子句、雙重霍恩子句、關聯三元組和確定子句文法。將這些知識作為邏輯程式揭露，能啟用健全的推理方法，驗證 LLM 輸出的對齊方式，符合其預期的用途，並擴展其推論能力。我們研究了產生程式的新執行方法，包括對儲存在向量資料庫中的 LLM 產生內容，進行可約簡事實的軟統一，以及支援使用大型 LLM 產生程式進行推論的，基於 GPU 的最小模型計算加速。
 
-##### **CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**
-2502.05214v1 by Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
+##### **Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**
+2502.09206v1 by Haya Majid Qureshi, Wolfgang Faber
 
-Deep learning models for medical image classification tasks are becoming
-widely implemented in AI-assisted diagnostic tools, aiming to enhance
-diagnostic accuracy, reduce clinician workloads, and improve patient outcomes.
-However, their vulnerability to adversarial attacks poses significant risks to
-patient safety. Current attack methodologies use general techniques such as
-model querying or pixel value perturbations to generate adversarial examples
-designed to fool a model. These approaches may not adequately address the
-unique characteristics of clinical errors stemming from missed or incorrectly
-identified clinical features. We propose the Concept-based Report Perturbation
-Attack (CoRPA), a clinically-focused black-box adversarial attack framework
-tailored to the medical imaging domain. CoRPA leverages clinical concepts to
-generate adversarial radiological reports and images that closely mirror
-realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA
-using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our
-evaluation reveals that deep learning models exhibiting strong resilience to
-conventional adversarial attacks are significantly less robust when subjected
-to CoRPA's clinically-focused perturbations. This underscores the importance of
-addressing domain-specific vulnerabilities in medical AI systems. By
-introducing a specialized adversarial attack framework, this study provides a
-foundation for developing robust, real-world-ready AI models in healthcare,
-ensuring their safe and reliable deployment in high-stakes clinical
-environments.
+Metamodeling refers to scenarios in ontologies in which classes and roles can
+be members of classes or occur in roles. This is a desirable modelling feature
+in several applications, but allowing it without restrictions is problematic
+for several reasons, mainly because it causes undecidability. Therefore,
+practical languages either forbid metamodeling explicitly or treat occurrences
+of classes as instances to be semantically different from other occurrences,
+thereby not allowing metamodeling semantically. Several extensions have been
+proposed to provide metamodeling to some extent. Building on earlier work that
+reduces metamodeling query answering to Datalog query answering, recently
+reductions to query answering over hybrid knowledge bases were proposed with
+the aim of using the Datalog transformation only where necessary. Preliminary
+work showed that the approach works, but the hoped-for performance improvements
+were not observed yet. In this work we expand on this body of work by improving
+the theoretical basis of the reductions and by using alternative tools that
+show competitive performance.
 
-摘要：深度学习模型用于医学影像分类任务，在人工智能辅助诊断工具中得到广泛应用，旨在提高诊断准确性、减少临床医生的工作量并改善患者的治疗效果。然而，它们对对抗性攻击的脆弱性给患者安全带来了重大风险。目前的攻击方法使用通用技术，例如模型查询或像素值扰动来生成对抗性示例，旨在欺骗模型。这些方法可能无法充分解决源自遗漏或错误识别的临床特征的临床错误的独特特征。我们提出了基于概念的报告扰动攻击 (CoRPA)，这是一种以临床为中心的、针对医学成像领域的、黑盒对抗性攻击框架。CoRPA 利用临床概念来生成对抗性放射学报告和图像，这些报告和图像与现实的临床误诊场景非常相似。我们使用胸部 X 射线和放射学报告的 MIMIC-CXR-JPG 数据集演示了 CoRPA 的效用。我们的评估表明，对传统对抗性攻击表现出强大弹性的深度学习模型在受到 CoRPA 以临床为中心的扰动时，其鲁棒性明显降低。这强调了在医疗人工智能系统中解决特定领域漏洞的重要性。通过引入专门的对抗性攻击框架，本研究为在医疗保健领域开发健壮、面向现实世界的 AI 模型奠定了基础，确保它们在高风险临床环境中安全可靠地部署。
+摘要：元建模是指本体中的場景，其中類別和角色可以是類別成員或出現在角色中。這是一個在多個應用中理想的建模功能，但允許它不受限制會因多個原因而產生問題，主要是因為它會導致無法決定。因此，實用的語言會明確禁止元建模，或將類別的出現視為與其他出現語義不同的實例，從而語義上不允許元建模。已經提出多個擴充功能，在一定程度上提供元建模。建立在將元建模查詢回答簡化為 Datalog 查詢回答的早期工作之上，最近提出了將查詢回答簡化為混合知識庫的簡化，目的是僅在必要時使用 Datalog 轉換。初步工作顯示該方法有效，但尚未觀察到預期的效能改善。在這項工作中，我們透過改善簡化的理論基礎和使用表現競爭力的替代工具，擴展了這項工作。
 
-##### **A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**
-2502.02489v1 by Edward Ellis, Andrew Bulpitt, Nasim Parsa, Michael F Byrne, Sharib Ali
+##### **Counterfactual Explanations as Plans**
+2502.09205v1 by Vaishak Belle
 
-Ultrasound (US) imaging is clinically invaluable due to its noninvasive and
-safe nature. However, interpreting US images is challenging, requires
-significant expertise, and time, and is often prone to errors. Deep learning
-offers assistive solutions such as segmentation. Supervised methods rely on
-large, high-quality, and consistently labeled datasets, which are challenging
-to curate. Moreover, these methods tend to underperform on out-of-distribution
-data, limiting their clinical utility. Self-supervised learning (SSL) has
-emerged as a promising alternative, leveraging unlabeled data to enhance model
-performance and generalisability. We introduce a contrastive SSL approach
-tailored for B-mode US images, incorporating a novel Relation Contrastive Loss
-(RCL). RCL encourages learning of distinct features by differentiating positive
-and negative sample pairs through a learnable metric. Additionally, we propose
-spatial and frequency-based augmentation strategies for the representation
-learning on US images. Our approach significantly outperforms traditional
-supervised segmentation methods across three public breast US datasets,
-particularly in data-limited scenarios. Notable improvements on the Dice
-similarity metric include a 4% increase on 20% and 50% of the BUSI dataset,
-nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4%
-and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively.
-Furthermore, we demonstrate superior generalisability on the
-out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6%
-compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST
-training data, respectively. Our research highlights that domain-inspired SSL
-can improve US segmentation, especially under data-limited conditions.
+There has been considerable recent interest in explainability in AI,
+especially with black-box machine learning models. As correctly observed by the
+planning community, when the application at hand is not a single-shot decision
+or prediction, but a sequence of actions that depend on observations, a richer
+notion of explanations are desirable.
+  In this paper, we look to provide a formal account of ``counterfactual
+explanations," based in terms of action sequences. We then show that this
+naturally leads to an account of model reconciliation, which might take the
+form of the user correcting the agent's model, or suggesting actions to the
+agent's plan. For this, we will need to articulate what is true versus what is
+known, and we appeal to a modal fragment of the situation calculus to formalise
+these intuitions. We consider various settings: the agent knowing partial
+truths, weakened truths and having false beliefs, and show that our definitions
+easily generalize to these different settings.
 
-摘要：超音波 (US) 影像由於其非侵入性且安全的特性，在臨床上極具價值。然而，解讀超音波影像具有挑戰性，需要大量的專業知識和時間，而且經常容易出錯。深度學習提供了輔助解決方案，例如分割。監督式方法依賴於大量、高品質且標籤一致的資料集，而這在策劃上具有挑戰性。此外，這些方法在分佈外資料上的表現往往不佳，這限制了它們的臨床效用。自監督學習 (SSL) 已成為一種有前途的替代方案，它利用未標籤資料來增強模型效能和泛化能力。我們提出了一種對比式 SSL 方法，專門針對 B 模式超音波影像，並納入了新穎的關係對比損失 (RCL)。RCL 透過一個可學習的指標區分正負樣本對，來鼓勵學習不同的特徵。此外，我們提出了用於超音波影像上表徵學習的空間和頻率增強策略。我們的做法在三個公開的乳房超音波資料集上顯著優於傳統的監督式分割方法，特別是在資料有限的情況下。在 Dice 相似性指標上的顯著改進包括在 BUSI 資料集的 20% 和 50% 上增加了 4%，在 BrEaST 資料集的 20% 和 50% 上增加了近 6% 和 9%，以及在 UDIAT 資料集的 20% 和 50% 上分別增加了 6.4% 和 3.7%。此外，我們在分佈外的 UDIAT 資料集上展示了卓越的泛化能力，與使用 BUSI 和 BrEaST 訓練資料的 20% 和 50% 的監督式基準相比，效能分別提升了 20.6% 和 13.6%。我們的研究強調，領域啟發的 SSL 可以改善超音波分割，特別是在資料有限的條件下。
+摘要：最近在人工智能中對於可解釋性產生了相當大的興趣，
+特別是對於黑盒機器學習模型。正如規劃社群正確觀察到的，當手邊的應用程式不是單次決策或預測，而是一連串依賴於觀察的動作時，一個更豐富的解釋概念是可取的。
+在本文中，我們著眼於提供「反事實解釋」的一個正式說明，以動作序列為基礎。然後我們展示這自然會導致一個模型調和說明，其形式可能是使用者修正代理人的模型，或建議代理人的計畫採取行動。為此，我們需要說明什麼是真實的，什麼是已知的，我們訴諸情境演算的一個模態片段來形式化這些直覺。我們考慮各種設定：代理人知道部分真實、虛弱真實和擁有錯誤信念，並展示我們的定義輕鬆地概括到這些不同的設定。
 
-##### **Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**
-2502.02438v1 by Yaling Shen, Zhixiong Zhuang, Kun Yuan, Maria-Irina Nicolae, Nassir Navab, Nicolas Padoy, Mario Fritz
+##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
+2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
 
-Medical multimodal large language models (MLLMs) are becoming an instrumental
-part of healthcare systems, assisting medical personnel with decision making
-and results analysis. Models for radiology report generation are able to
-interpret medical imagery, thus reducing the workload of radiologists. As
-medical data is scarce and protected by privacy regulations, medical MLLMs
-represent valuable intellectual property. However, these assets are potentially
-vulnerable to model stealing, where attackers aim to replicate their
-functionality via black-box access. So far, model stealing for the medical
-domain has focused on classification; however, existing attacks are not
-effective against MLLMs. In this paper, we introduce Adversarial Domain
-Alignment (ADA-STEAL), the first stealing attack against medical MLLMs.
-ADA-STEAL relies on natural images, which are public and widely available, as
-opposed to their medical counterparts. We show that data augmentation with
-adversarial noise is sufficient to overcome the data distribution gap between
-natural images and the domain-specific distribution of the victim MLLM.
-Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that
-Adversarial Domain Alignment enables attackers to steal the medical MLLM
-without any access to medical data.
+Legal cases require careful logical reasoning following the laws, whereas
+interactions with non- technical users must be in natural language. As an
+application combining logical reasoning using Prolog and natural language
+processing using large language models (LLMs), this paper presents a novel
+approach and system, LogicLease, to automate the analysis of landlord-tenant
+legal cases in the state of New York. LogicLease determines compliance with
+relevant legal requirements by analyzing case descriptions and citing all
+relevant laws. It leverages LLMs for information extraction and Prolog for
+legal reasoning. By separating information extraction from legal reasoning,
+LogicLease achieves greater transparency and control over the legal logic
+applied to each case. We evaluate the accuracy, efficiency, and robustness of
+LogicLease through a series of tests, achieving 100% accuracy and an average
+processing time of 2.57 seconds. LogicLease presents advantages over
+state-of-the-art LLM- based legal analysis systems by providing clear,
+step-by-step reasoning, citing specific laws, and distinguishing itself by its
+ability to avoid hallucinations - a common issue in LLMs.
 
-摘要：醫療多模態大型語言模型 (MLLM) 正在成為醫療保健系統中不可或缺的一部分，協助醫療人員進行決策和結果分析。放射報告生成的模型能夠解釋醫學影像，從而減輕放射科醫師的工作負擔。由於醫療資料稀少且受隱私法規保護，醫療 MLLM 代表了有價值的智慧財產。然而，這些資產潛在地容易受到模型竊取的攻擊，攻擊者旨在透過黑盒存取來複製其功能。到目前為止，針對醫療領域的模型竊取一直專注於分類；然而，現有的攻擊對 MLLM 沒有效。在本文中，我們介紹了對抗域對齊 (ADA-STEAL)，這是針對醫療 MLLM 的第一個竊取攻擊。與醫療對應物相反，ADA-STEAL 依賴於公開且廣泛可用的自然影像。我們表明，對抗雜訊的資料擴充足以克服自然影像與受害者 MLLM 的特定領域分佈之間的資料分佈差距。在 IU X-RAY 和 MIMIC-CXR 放射學資料集上進行的實驗表明，對抗域對齊使攻擊者能夠在不存取任何醫療資料的情況下竊取醫療 MLLM。
+摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
 
-##### **Test Time Training for 4D Medical Image Interpolation**
-2502.02341v1 by Qikang Zhang, Yingjie Lei, Zihao Zheng, Ziyang Chen, Zhonghao Xie
+##### **Thinking beyond the anthropomorphic paradigm benefits LLM research**
+2502.09192v1 by Lujain Ibrahim, Myra Cheng
 
-4D medical image interpolation is essential for improving temporal resolution
-and diagnostic precision in clinical applications. Previous works ignore the
-problem of distribution shifts, resulting in poor generalization under
-different distribution. A natural solution would be to adapt the model to a new
-test distribution, but this cannot be done if the test input comes without a
-ground truth label. In this paper, we propose a novel test time training
-framework which uses self-supervision to adapt the model to a new distribution
-without requiring any labels. Indeed, before performing frame interpolation on
-each test video, the model is trained on the same instance using a
-self-supervised task, such as rotation prediction or image reconstruction. We
-conduct experiments on two publicly available 4D medical image interpolation
-datasets, Cardiac and 4D-Lung. The experimental results show that the proposed
-method achieves significant performance across various evaluation metrics on
-both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on
-Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image
-interpolation but also provides a template for domain adaptation in other
-fields such as image segmentation and image registration.
+Anthropomorphism, or the attribution of human traits to technology, is an
+automatic and unconscious response that occurs even in those with advanced
+technical expertise. In this position paper, we analyze hundreds of thousands
+of computer science research articles from the past decade and present
+empirical evidence of the prevalence and growth of anthropomorphic terminology
+in research on large language models (LLMs). This terminology reflects deeper
+anthropomorphic conceptualizations which shape how we think about and conduct
+LLM research. We argue these conceptualizations may be limiting, and that
+challenging them opens up new pathways for understanding and improving LLMs
+beyond human analogies. To illustrate this, we identify and analyze five core
+anthropomorphic assumptions shaping prominent methodologies across the LLM
+development lifecycle, from the assumption that models must use natural
+language for reasoning tasks to the assumption that model capabilities should
+be evaluated through human-centric benchmarks. For each assumption, we
+demonstrate how non-anthropomorphic alternatives can open new directions for
+research and development.
 
-摘要：4D 醫學影像插值對於提升時間解析度及臨床應用中的診斷精準度至關重要。過往的研究忽略了分佈轉移問題，導致在不同分佈下泛化能力不佳。一個自然的解決方案是將模型適應到新的測試分佈，但如果測試輸入沒有真實標籤，就無法做到這一點。在本文中，我們提出了一個新的測試時間訓練架構，它使用自我監督來適應模型到一個新的分佈，而不需要任何標籤。事實上，在對每個測試影片執行幀插值之前，使用自我監督任務（例如旋轉預測或影像重建）在同一個實例上訓練模型。我們在兩個公開的 4D 醫學影像插值資料集（Cardiac 和 4D-Lung）上進行實驗。實驗結果表明，所提出的方法在兩個資料集上的各種評估指標中都取得了顯著的效能。它達到了更高的峰值信噪比值，在 Cardiac 上為 33.73dB，在 4D-Lung 上為 34.02dB。我們的技術不僅推動了 4D 醫學影像插值，還為其他領域（例如影像分割和影像配準）中的領域適應提供了一個範本。
+摘要：擬人化，或將人類特質歸因於科技，是一種自動且無意識的反應，即使是那些擁有進階技術專業知識的人也會發生。在本文中，我們分析了過去十年數十萬篇電腦科學研究文章，並提出實證證據證明擬人化術語在大型語言模型 (LLM) 研究中的普遍性和增長。這些術語反映了更深層的擬人化概念化，塑造了我們思考和進行 LLM 研究的方式。我們認為這些概念化可能是有限制的，並且挑戰它們為超越人類類比來理解和改進 LLM 開闢了新的途徑。為了說明這一點，我們識別並分析了五個核心擬人化假設，這些假設塑造了 LLM 開發生命週期中的顯著方法論，從模型必須使用自然語言進行推理任務的假設到模型能力應該通過以人為中心的基準進行評估的假設。對於每個假設，我們展示了非擬人化替代方案如何為研究和開發打開新方向。
 
-##### **Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**
-2502.02249v1 by Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
+##### **Matina: A Large-Scale 73B Token Persian Text Corpus**
+2502.09188v1 by Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri
 
-Large language models (LLMs) have shown impressive capabilities in natural
-language processing tasks, including dialogue generation. This research aims to
-conduct a novel comparative analysis of two prominent techniques, fine-tuning
-with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG)
-framework, in the context of doctor-patient chat conversations with multiple
-datasets of mixed medical domains. The analysis involves three state-of-the-art
-models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient
-dialogues, we comprehensively evaluate the performance of models, assessing key
-metrics such as language quality (perplexity, BLEU score), factual accuracy
-(fact-checking against medical knowledge bases), adherence to medical
-guidelines, and overall human judgments (coherence, empathy, safety). The
-findings provide insights into the strengths and limitations of each approach,
-shedding light on their suitability for healthcare applications. Furthermore,
-the research investigates the robustness of the models in handling diverse
-patient queries, ranging from general health inquiries to specific medical
-conditions. The impact of domain-specific knowledge integration is also
-explored, highlighting the potential for enhancing LLM performance through
-targeted data augmentation and retrieval strategies.
+Text corpora are essential for training models used in tasks like
+summarization, translation, and large language models (LLMs). While various
+efforts have been made to collect monolingual and multilingual datasets in many
+languages, Persian has often been underrepresented due to limited resources for
+data collection and preprocessing. Existing Persian datasets are typically
+small and lack content diversity, consisting mainly of weblogs and news
+articles. This shortage of high-quality, varied data has slowed the development
+of NLP models and open-source LLMs for Persian. Since model performance depends
+heavily on the quality of training data, we address this gap by introducing the
+Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed
+and deduplicated to ensure high data quality. We further assess its
+effectiveness by training and evaluating transformer-based models on key NLP
+tasks. Both the dataset and preprocessing codes are publicly available,
+enabling researchers to build on and improve this resource for future Persian
+NLP advancements.
 
-摘要：大型語言模型 (LLM) 在自然語言處理任務中展現了令人印象深刻的能力，包括對話生成。本研究旨在對兩種著名的技術進行新穎的比較分析，即微調 LoRA (低秩適應) 和檢索增強生成 (RAG) 框架，在具有混合醫療領域的多個資料集的醫患聊天對話中。分析涉及三個最先進的模型：Llama-2、GPT 和 LSTM 模型。採用真實世界的醫患對話，我們全面評估模型的性能，評估語言品質（困惑度、BLEU 分數）、事實準確性（對照醫學知識庫進行事實查核）、遵守醫療指南以及整體人類判斷（連貫性、同理心、安全性）等關鍵指標。研究結果深入了解了每種方法的優點和限制，闡明了它們適用於醫療保健應用的適當性。此外，該研究調查了模型在處理多樣化患者查詢時的穩健性，範圍從一般健康詢問到特定醫療狀況。還探討了特定領域知識整合的影響，強調了通過有針對性的資料擴充和檢索策略來增強 LLM 性能的潛力。
+摘要：文字語料庫對於訓練用於摘要、翻譯和大型語言模型 (LLM) 等任務的模型至關重要。儘管已做出各種努力來收集許多語言中的單語和多語言資料集，但由於資料收集和預處理資源有限，波斯語常常代表性不足。現有的波斯語資料集通常很小，而且缺乏內容多樣性，主要由網誌和新聞文章組成。這種優質、多樣化資料的短缺減緩了波斯語的 NLP 模型和開源 LLM 的開發。由於模型效能高度依賴訓練資料的品質，我們透過推出 Matina 語料庫來解決這個差距，Matina 語料庫是一個新的波斯語資料集，包含 72.9B 個字元，經過仔細預處理和去重，以確保資料品質。我們進一步透過在關鍵 NLP 任務上訓練和評估基於轉換器的模型來評估其有效性。資料集和預處理程式碼都是公開的，使研究人員能夠建立和改善這個資源，以促進未來的波斯語 NLP 進展。
 
-##### **Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**
-2502.02618v1 by F. Xavier Gaya-Morey, Jose M. Buades-Rubio, Philippe Palanque, Raquel Lacuesta, Cristina Manresa-Yee
+##### **RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**
+2502.09183v1 by Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, Linmei Hu
 
-The rapid aging of the global population has highlighted the need for
-technologies to support elderly, particularly in healthcare and emotional
-well-being. Facial expression recognition (FER) systems offer a non-invasive
-means of monitoring emotional states, with applications in assisted living,
-mental health support, and personalized care. This study presents a systematic
-review of deep learning-based FER systems, focusing on their applications for
-the elderly population. Following a rigorous methodology, we analyzed 31
-studies published over the last decade, addressing challenges such as the
-scarcity of elderly-specific datasets, class imbalances, and the impact of
-age-related facial expression differences. Our findings show that convolutional
-neural networks remain dominant in FER, and especially lightweight versions for
-resource-constrained environments. However, existing datasets often lack
-diversity in age representation, and real-world deployment remains limited.
-Additionally, privacy concerns and the need for explainable artificial
-intelligence emerged as key barriers to adoption. This review underscores the
-importance of developing age-inclusive datasets, integrating multimodal
-solutions, and adopting XAI techniques to enhance system usability,
-reliability, and trustworthiness. We conclude by offering recommendations for
-future research to bridge the gap between academic progress and real-world
-implementation in elderly care.
+Code generation has attracted increasing attention with the rise of Large
+Language Models (LLMs). Many studies have developed powerful code LLMs by
+synthesizing code-related instruction data and applying supervised fine-tuning.
+However, these methods are limited by teacher model distillation and ignore the
+potential of iterative refinement by self-generated code. In this paper, we
+propose Adaptive Critique Refinement (ACR), which enables the model to refine
+itself by self-generated code and external critique, rather than directly
+imitating the code responses of the teacher model. Concretely, ACR includes a
+composite scoring system with LLM-as-a-Judge to evaluate the quality of code
+responses and a selective critique strategy with LLM-as-a-Critic to critique
+self-generated low-quality code responses. We develop the RefineCoder series by
+iteratively applying ACR, achieving continuous performance improvement on
+multiple code generation benchmarks. Compared to the baselines of the same
+size, our proposed RefineCoder series can achieve comparable or even superior
+performance using less data.
 
-摘要：全球人口快速老龄化突显了对技术的需求，以支持老年人，尤其是在医疗保健和情绪健康方面。面部表情识别 (FER) 系统提供了一种非侵入性的情绪状态监测手段，在辅助生活、心理健康支持和个性化护理中得到应用。本研究对基于深度学习的 FER 系统进行了系统的回顾，重点关注它们在老年人群中的应用。遵循严格的方法，我们分析了在过去十年中发表的 31 项研究，解决了诸如老年人特定数据集的稀缺性、类别不平衡以及与年龄相关的面部表情差异的影响等挑战。我们的研究结果表明，卷积神经网络在 FER 中仍然占主导地位，特别是针对资源受限环境的轻量级版本。然而，现有数据集往往缺乏年龄代表性的多样性，并且现实世界的部署仍然有限。此外，隐私问题和对可解释人工智能的需求已成为采用过程中的主要障碍。本次审查强调了开发包容年龄的数据集、整合多模式解决方案以及采用 XAI 技术以增强系统可用性、可靠性和可信度的重要性。最后，我们提出了未来研究的建议，以弥合学术进展与老年护理中的现实世界实施之间的差距。
+摘要：隨著大型語言模型 (LLM) 的興起，程式碼生成備受關注。許多研究透過綜合與程式碼相關的指令資料並應用監督式微調來開發強大的程式碼 LLM。然而，這些方法受到教師模型蒸餾的限制，且忽略了透過自行產生的程式碼進行反覆改進的潛力。在本文中，我們提出適應性批判改進 (ACR)，它使模型能夠透過自行產生的程式碼和外部批判來改進自身，而不是直接模仿教師模型的程式碼回應。具體來說，ACR 包含一個複合評分系統，其中 LLM 作為評審員來評估程式碼回應的品質，以及一個選擇性批判策略，其中 LLM 作為批判者來批判自行產生的低品質程式碼回應。我們透過反覆套用 ACR 來開發 RefineCoder 系列，在多個程式碼生成基準上實現持續的效能改善。與相同規模的基準相比，我們提出的 RefineCoder 系列可以使用較少資料來實現相當甚至更優異的效能。
 
-##### **Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**
-2502.02109v1 by Yuxiao Cheng, Xinxin Song, Ziqian Wang, Qin Zhong, Kunlun He, Jinli Suo
+##### **FLAME: Flexible LLM-Assisted Moderation Engine**
+2502.09175v1 by Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets
 
-Recent advances in deep learning (DL) have prompted the development of
-high-performing early warning score (EWS) systems, predicting clinical
-deteriorations such as acute kidney injury, acute myocardial infarction, or
-circulatory failure. DL models have proven to be powerful tools for various
-tasks but come with the cost of lacking interpretability and limited
-generalizability, hindering their clinical applications. To develop a practical
-EWS system applicable to various outcomes, we propose causally-informed
-explainable early prediction model, which leverages causal discovery to
-identify the underlying causal relationships of prediction and thus owns two
-unique advantages: demonstrating the explicit interpretation of the prediction
-while exhibiting decent performance when applied to unfamiliar environments.
-Benefiting from these features, our approach achieves superior accuracy for 6
-different critical deteriorations and achieves better generalizability across
-different patient groups, compared to various baseline algorithms. Besides, we
-provide explicit causal pathways to serve as references for assistant clinical
-diagnosis and potential interventions. The proposed approach enhances the
-practical application of deep learning in various medical scenarios.
+The rapid advancement of Large Language Models (LLMs) has introduced
+significant challenges in moderating user-model interactions. While LLMs
+demonstrate remarkable capabilities, they remain vulnerable to adversarial
+attacks, particularly ``jailbreaking'' techniques that bypass content safety
+measures. Current content moderation systems, which primarily rely on input
+prompt filtering, have proven insufficient, with techniques like Best-of-N
+(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
+In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
+new approach that shifts the focus from input filtering to output moderation.
+Unlike traditional circuit-breaking methods that analyze user queries, FLAME
+evaluates model responses, offering several key advantages: (1) computational
+efficiency in both training and inference, (2) enhanced resistance to BoN
+jailbreaking attacks, and (3) flexibility in defining and updating safety
+criteria through customizable topic filtering. Our experiments demonstrate that
+FLAME significantly outperforms current moderation systems. For example, FLAME
+reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
+while maintaining low computational overhead. We provide comprehensive
+evaluation on various LLMs and analyze the engine's efficiency against the
+state-of-the-art jailbreaking. This work contributes to the development of more
+robust and adaptable content moderation systems for LLMs.
 
-摘要：深度學習 (DL) 的最新進展促使開發出高性能早期預警評分 (EWS) 系統，預測急性腎臟損傷、急性心肌梗塞或循環衰竭等臨床惡化。DL 模型已被證明是各種任務的強大工具，但代價是缺乏可解釋性和有限的概括性，阻礙了其臨床應用。為了開發適用於各種結果的實用 EWS 系統，我們提出了因果關係解釋性早期預測模型，它利用因果發現來識別預測的潛在因果關係，從而擁有兩個獨特的優點：展示預測的明確解釋，同時在應用於不熟悉的環境時表現出良好的性能。得益於這些特性，與各種基線演算法相比，我們的模型在 6 種不同的危重惡化中實現了更高的準確度，並在不同的患者群體中實現了更好的概括性。此外，我們提供了明確的因果途徑，作為輔助臨床診斷和潛在干預措施的參考。所提出的方法增強了深度學習在各種醫療場景中的實際應用。
+摘要：大型語言模型 (LLM) 的快速進步為調節使用者與模型互動帶來重大挑戰。儘管 LLM 展現出非凡的能力，但它們仍然容易受到對抗性攻擊，特別是繞過內容安全措施的「越獄」技術。目前的內容審核系統主要依賴輸入提示過濾，已被證明不足，例如 Best-of-N (BoN) 越獄對抗熱門 LLM 的成功率達到 80% 以上。在本文中，我們介紹了靈活的 LLM 輔助審核引擎 (FLAME)：一種新的方法，將重點從輸入過濾轉移到輸出審核。與分析使用者查詢的傳統電路中斷方法不同，FLAME 評估模型回應，提供幾個關鍵優勢：(1) 訓練和推理中的計算效率，(2) 增強對 BoN 越獄攻擊的抵抗力，以及 (3) 透過可自訂主題過濾定義和更新安全標準的靈活性。我們的實驗證明，FLAME 明顯優於目前的審核系統。例如，FLAME 將 GPT-4o-mini 和 DeepSeek-v3 的攻擊成功率降低了約 9 倍，同時保持較低的計算負擔。我們對各種 LLM 進行了全面的評估，並分析了引擎對抗最新越獄的效率。這項工作有助於開發更強大且適應性更強的 LLM 內容審核系統。
 
-##### **JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**
-2502.04345v1 by Yehan Yan, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
+##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
+2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
-Traditional Chinese medicine (TCM) plays a vital role in health protection
-and disease treatment, but its practical application requires extensive medical
-knowledge and clinical experience. Existing TCM Large Language Models (LLMs)
-exhibit critical limitations of uncomprehensive medical consultation and
-diagnoses, and inaccurate syndrome differentiation-based treatment. To address
-these issues, this study establishes JingFang (JF): a novel TCM Large Language
-Model that demonstrates the expert-level capability of medical diagnosis and
-syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic
-Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation,
-enabling JF with effective and accurate diagnostic ability. In addition, a
-Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to
-significantly enhance the capacity of JF for disease treatment based on
-syndrome differentiation. JingFang not only facilitates the application of LLMs
-but also promotes the effective practice of TCM in human health protection and
-disease treatment.
+In remote healthcare monitoring, time series representation learning reveals
+critical patient behavior patterns from high-frequency data. This study
+analyzes home activity data from individuals living with dementia by proposing
+a two-stage, self-supervised learning approach tailored to uncover low-rank
+structures. The first stage converts time-series activities into text sequences
+encoded by a pre-trained language model, providing a rich, high-dimensional
+latent state space using a PageRank-based method. This PageRank vector captures
+latent state transitions, effectively compressing complex behaviour data into a
+succinct form that enhances interpretability. This low-rank representation not
+only enhances model interpretability but also facilitates clustering and
+transition analysis, revealing key behavioral patterns correlated with
+clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
+framework's potential in supporting cognitive status prediction, personalized
+care interventions, and large-scale health monitoring.
 
-摘要：中醫藥在保健與疾病治療中扮演著重要的角色，但其實務應用需要深厚的醫學知識與臨床經驗。現有的中醫大語言模型（LLM）存在著醫療諮詢與診斷不全面、症候分型治療不準確的重大限制。為了解決這些問題，本研究建立了精方（JF）：一個新穎的中醫大語言模型，展示了專家級的醫療診斷與症候分型治療能力。我們創新了一個多智能體動態協作思考鏈機制（MDCCTM）用於醫療諮詢，讓 JF 具備有效且準確的診斷能力。此外，還開發了一個症候智能體和一個雙階段檢索方案（DSRS），以顯著增強 JF 基於症候分型的疾病治療能力。精方不僅促進了 LLM 的應用，也推動了中醫藥在人類保健與疾病治療中的有效實踐。
+摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
-##### **An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**
-2502.01789v1 by Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri
+##### **Musical Heritage Historical Entity Linking**
+2502.09168v1 by Arianna Graciotti, Nicolas Lazzari, Valentina Presutti, Rocco Tripodi
 
-Early identification of cognitive concerns is critical but often hindered by
-subtle symptom presentation. This study developed and validated a fully
-automated, multi-agent AI workflow using LLaMA 3 8B to identify cognitive
-concerns in 3,338 clinical notes from Mass General Brigham. The agentic
-workflow, leveraging task-specific agents that dynamically collaborate to
-extract meaningful insights from clinical notes, was compared to an
-expert-driven benchmark. Both workflows achieved high classification
-performance, with F1-scores of 0.90 and 0.91, respectively. The agentic
-workflow demonstrated improved specificity (1.00) and achieved prompt
-refinement in fewer iterations. Although both workflows showed reduced
-performance on validation data, the agentic workflow maintained perfect
-specificity. These findings highlight the potential of fully automated
-multi-agent AI workflows to achieve expert-level accuracy with greater
-efficiency, offering a scalable and cost-effective solution for detecting
-cognitive concerns in clinical settings.
+Linking named entities occurring in text to their corresponding entity in a
+Knowledge Base (KB) is challenging, especially when dealing with historical
+texts. In this work, we introduce Musical Heritage named Entities Recognition,
+Classification and Linking (MHERCL), a novel benchmark consisting of manually
+annotated sentences extrapolated from historical periodicals of the music
+domain. MHERCL contains named entities under-represented or absent in the most
+famous KBs. We experiment with several State-of-the-Art models on the Entity
+Linking (EL) task and show that MHERCL is a challenging dataset for all of
+them. We propose a novel unsupervised EL model and a method to extend
+supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main
+difficulties posed by historical documents. Our experiments reveal that relying
+on unsupervised techniques and improving models with logical constraints based
+on KGs and heuristics to predict NIL entities (entities not represented in the
+KB of reference) results in better EL performance on historical documents.
 
-摘要：及早辨識認知問題至關重要，但常常受到症狀呈現過於細微的阻礙。本研究開發並驗證了一個全自動化、多重代理的 AI 工作流程，使用 LLaMA 3 8B 來辨識來自麻省總醫院布萊根分院的 3,338 則臨床筆記中的認知問題。這個代理工作流程利用了特定任務的代理，這些代理會動態合作從臨床筆記中萃取出有意義的見解，並與專家驅動的基準進行比較。這兩個工作流程都達到了很高的分類效能，F1 分數分別為 0.90 和 0.91。代理工作流程展現出更好的特異性（1.00），並且在更少的反覆運算中達到了提示精煉。儘管這兩個工作流程在驗證資料上的效能都降低了，但代理工作流程維持了完美的特異性。這些發現突顯了全自動化多重代理 AI 工作流程的潛力，它們能以更高的效率達到專家級的準確度，為在臨床環境中偵測認知問題提供了一個可擴充且具成本效益的解決方案。
+摘要：將文本中出現的名稱實體連結到知識庫 (KB) 中對應的實體具有挑戰性，尤其是在處理歷史文本時。在這項工作中，我們引入了音樂遺產命名實體識別、分類和連結 (MHERCL)，這是一個由從音樂領域的歷史期刊中外推的手動標註句子組成的全新基準。MHERCL 包含在最著名的 KB 中代表性不足或不存在的名稱實體。我們在實體連結 (EL) 任務中對多個最先進的模型進行了實驗，並表明 MHERCL 對所有模型來說都是一個具有挑戰性的資料集。我們提出了一個新的無監督 EL 模型和一個通過使用知識圖 (KG) 來擴充監督式實體連結器的的方法，以解決歷史文件提出的主要難題。我們的實驗表明，依賴無監督技術並使用基於 KG 和啟發法的邏輯約束來改善模型以預測 NIL 實體（未在參考 KB 中表示的實體）會在歷史文件中產生更好的 EL 效能。
 
-##### **Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**
-2502.03482v1 by Chacha Chen, Han Liu, Jiamin Yang, Benjamin M. Mervak, Bora Kalaycioglu, Grace Lee, Emre Cakmakli, Matteo Bonatti, Sridhar Pudu, Osman Kahraman, Gul Gizem Pamuk, Aytekin Oto, Aritrick Chatterjee, Chenhao Tan
+##### **Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**
+2502.09156v1 by Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin
 
-Despite the growing interest in human-AI decision making, experimental
-studies with domain experts remain rare, largely due to the complexity of
-working with domain experts and the challenges in setting up realistic
-experiments. In this work, we conduct an in-depth collaboration with
-radiologists in prostate cancer diagnosis based on MRI images. Building on
-existing tools for teaching prostate cancer diagnosis, we develop an interface
-and conduct two experiments to study how AI assistance and performance feedback
-shape the decision making of domain experts. In Study 1, clinicians were asked
-to provide an initial diagnosis (human), then view the AI's prediction, and
-subsequently finalize their decision (human-AI team). In Study 2 (after a
-memory wash-out period), the same participants first received aggregated
-performance statistics from Study 1, specifically their own performance, the
-AI's performance, and their human-AI team performance, and then directly viewed
-the AI's prediction before making their diagnosis (i.e., no independent initial
-diagnosis). These two workflows represent realistic ways that clinical AI tools
-might be used in practice, where the second study simulates a scenario where
-doctors can adjust their reliance and trust on AI based on prior performance
-feedback. Our findings show that, while human-AI teams consistently outperform
-humans alone, they still underperform the AI due to under-reliance, similar to
-prior studies with crowdworkers. Providing clinicians with performance feedback
-did not significantly improve the performance of human-AI teams, although
-showing AI decisions in advance nudges people to follow AI more. Meanwhile, we
-observe that the ensemble of human-AI teams can outperform AI alone, suggesting
-promising directions for human-AI collaboration.
+Objectives: Large language models (LLMs) can harness medical knowledge for
+intelligent question answering (Q&A), promising support for auxiliary diagnosis
+and medical talent cultivation. However, there is a deficiency of highly
+efficient retrieval-augmented generation (RAG) frameworks within the domain of
+Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the
+Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A
+tasks.
+  Materials and Methods: We introduce the novel approach of knowledge
+organization, constructing a tree structure knowledge base with hierarchy. At
+inference time, our self-reflection framework retrieves from this knowledge
+base, integrating information across chapters. Questions from the TCM Medical
+Licensing Examination (MLE) and the college Classics Course Exam (CCE) were
+randomly selected as benchmark datasets.
+  Results: By coupling with GPT-4, the framework can improve the best
+performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and
+improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation,
+the framework improves a total of 18.52 points across dimensions of safety,
+consistency, explainability, compliance, and coherence.
+  Conclusion: The TOSRR framework can effectively improve LLM's capability in
+Q&A tasks of TCM.
 
-摘要：儘管人們對人類與 AI 決策制定越來越感興趣，但與領域專家合作的實驗研究仍然很少見，這在很大程度上是因為與領域專家合作的複雜性，以及在設定實際實驗時面臨的挑戰。在這項工作中，我們與放射科醫師進行深入合作，基於 MRI 影像診斷前列腺癌。建立在用於教授前列腺癌診斷的現有工具上，我們開發了一個介面並進行了兩項實驗，以研究 AI 協助和效能回饋如何塑造領域專家的決策制定。在研究 1 中，要求臨床醫師提供初步診斷（人類），然後檢視 AI 的預測，並隨後確定他們的決策（人類-AI 團隊）。在研究 2（經過一段記憶清除期）中，同一位參與者首先收到研究 1 的彙總效能統計資料，特別是他們自己的效能、AI 的效能，以及他們的人類-AI 團隊效能，然後在做出診斷前直接檢視 AI 的預測（即，沒有獨立的初步診斷）。這兩個工作流程代表了臨床 AI 工具在實務中可能被使用的方式，其中第二個研究模擬了醫生可以根據先前的效能回饋調整他們對 AI 的依賴和信任的情況。我們的研究結果顯示，儘管人類-AI 團隊始終優於單獨的人類，但由於依賴不足，他們仍然表現不如 AI，這與之前針對群眾工作者的研究類似。儘管事先顯示 AI 決策會促使人們更多地遵循 AI，但向臨床醫師提供效能回饋並未顯著改善人類-AI 團隊的效能。同時，我們觀察到人類-AI 團隊的集合可以優於單獨的 AI，這表明了人類-AI 合作的前景。
+摘要：目標：大型語言模型（LLM）可以利用醫療知識進行智能問答（Q&A），承諾支持輔助診斷和醫療人才培養。然而，在中醫領域內缺乏高效的檢索增強生成（RAG）框架。我們的目的是觀察樹組織自省檢索（TOSRR）框架對中醫問答任務中 LLM 的影響。
+材料和方法：我們引入了知識組織的新方法，構建了一個具有層次的樹結構知識庫。在推理時間，我們的自省框架從這個知識庫中檢索，整合章節中的信息。中醫醫師資格考試（MLE）和大學經典課程考試（CCE）中的問題被隨機選為基準數據集。
+結果：通過與 GPT-4 結合，該框架可以將中醫 MLE 基準上的最佳性能提高 19.85% 的絕對準確度，並將 CCE 數據集上的召回準確度從 27% 提高到 38%。在手動評估中，該框架在安全性、一致性、可解釋性、合規性和連貫性方面總共提高了 18.52 分。
+結論：TOSRR 框架可以有效提升 LLM 在中醫問答任務中的能力。
 
-##### **Improving Transformer World Models for Data-Efficient RL**
-2502.01591v1 by Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
+##### **A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**
+2502.09128v1 by Nasser A Alsadhan
 
-We present an approach to model-based RL that achieves a new state of the art
-performance on the challenging Craftax-classic benchmark, an open-world 2D
-survival game that requires agents to exhibit a wide range of general abilities
--- such as strong generalization, deep exploration, and long-term reasoning.
-With a series of careful design choices aimed at improving sample efficiency,
-our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps,
-significantly outperforming DreamerV3, which achieves 53.2%, and, for the first
-time, exceeds human performance of 65.0%. Our method starts by constructing a
-SOTA model-free baseline, using a novel policy architecture that combines CNNs
-and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna
-with warmup", which trains the policy on real and imaginary data, (b) "nearest
-neighbor tokenizer" on image patches, which improves the scheme to create the
-transformer world model (TWM) inputs, and (c) "block teacher forcing", which
-allows the TWM to reason jointly about the future tokens of the next timestep.
+Arabic is one of the oldest languages still in use today. As a result,
+several Arabic-speaking regions have developed dialects that are unique to
+them. Dialect and emotion recognition have various uses in Arabic text
+analysis, such as determining an online customer's origin based on their
+comments. Furthermore, intelligent chatbots that are aware of a user's emotions
+can respond appropriately to the user. Current research in emotion detection in
+the Arabic language lacks awareness of how emotions are exhibited in different
+dialects, which motivates the work found in this study. This research addresses
+the problems of dialect and emotion classification in Arabic. Specifically,
+this is achieved by building a novel framework that can identify and predict
+Arabic dialects and emotions from a given text. The framework consists of three
+modules: A text-preprocessing module, a classification module, and a clustering
+module with the novel capability of building new dialect-aware emotion
+lexicons. The proposed framework generated a new emotional lexicon for
+different dialects. It achieved an accuracy of 88.9% in classifying Arabic
+dialects, which outperforms the state-of-the-art results by 6.45 percentage
+points. Furthermore, the framework achieved 89.1-79% accuracy in detecting
+emotions in the Egyptian and Gulf dialects, respectively.
 
-摘要：我們提出了一個基於模型的 RL 方法，在具有挑戰性的 Craftax-classic 基準上實現了新的技術水準，這是一個開放世界的 2D 生存遊戲，要求代理人展現廣泛的一般能力，例如強大的概括能力、深入探索和長期推理。通過一系列旨在提高樣本效率的仔細設計選擇，我們的 MBRL 演算法在僅 1M 環境步驟後就實現了 67.4% 的獎勵，顯著優於 DreamerV3（實現 53.2%），並且首次超過了人類的 65.0% 的表現。我們的演算法首先通過使用結合 CNN 和 RNN 的新穎策略架構來建構一個 SOTA 無模型基線。然後，我們對標準 MBRL 設定新增了三項改進：(a)「帶熱身的 Dyna」，它在真實和假想資料上訓練策略，(b) 影像貼片的「最近鄰代碼化器」，它改進了建立轉換器世界模型 (TWM) 輸入的方案，以及 (c)「區塊教師強制」，它允許 TWM 共同推理下一個時間步長的未來代碼。
+摘要：阿拉伯語是現今仍在使用中最古老的語言之一。因此，幾個講阿拉伯語的地區發展出獨特的方言。方言和情緒辨識在阿拉伯語文本分析中有多種用途，例如根據在線客戶的評論來確定其來源。此外，知道使用者情緒的智慧聊天機器人可以適當地回應使用者。目前對阿拉伯語情緒偵測的研究缺乏對不同方言如何表現情緒的認識，這激勵了本研究中的工作。本研究探討了阿拉伯語中的方言和情緒分類問題。具體而言，這是通過建立一個新的框架來實現的，該框架可以識別和預測給定文本中的阿拉伯方言和情緒。該框架包含三個模組：文字預處理模組、分類模組和聚類模組，具有建立新的方言感知情緒詞彙表的新功能。所提出的框架為不同的方言生成了新的情緒詞彙表。它在分類阿拉伯方言方面達到了 88.9% 的準確率，比最先進的結果高出 6.45 個百分點。此外，該框架在檢測埃及和海灣方言的情緒方面分別達到了 89.1-79% 的準確率。
 
-##### **Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**
-2502.01377v1 by Zhi Zhang, Yan Liu, Mengxia Gao, Yu Yang, Jiannong Cao, Wai Kai Hou, Shirley Li, Sonata Yau, Yun Kwok Wing, Tatia M. C. Lee
+##### **Automatic Pruning via Structured Lasso with Class-wise Information**
+2502.09125v1 by Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li
 
-Psychological resilience, defined as the ability to rebound from adversity,
-is crucial for mental health. Compared with traditional resilience assessments
-through self-reported questionnaires, resilience assessments based on
-neurological data offer more objective results with biological markers, hence
-significantly enhancing credibility. This paper proposes a novel data-efficient
-model to address the scarcity of neurological data. We employ Neuro
-Kolmogorov-Arnold Networks as the structure of the prediction model. In the
-training stage, a new trait-informed multimodal representation algorithm with a
-smart chunk technique is proposed to learn the shared latent space with limited
-data. In the test stage, a new noise-informed inference algorithm is proposed
-to address the low signal-to-noise ratio of the neurological data. The proposed
-model not only shows impressive performance on both public datasets and
-self-constructed datasets but also provides some valuable psychological
-hypotheses for future research.
+Most pruning methods concentrate on unimportant filters of neural networks.
+However, they face the loss of statistical information due to a lack of
+consideration for class-wise data. In this paper, from the perspective of
+leveraging precise class-wise information for model pruning, we utilize
+structured lasso with guidance from Information Bottleneck theory. Our approach
+ensures that statistical information is retained during the pruning process.
+With these techniques, we introduce two innovative adaptive network pruning
+schemes: sparse graph-structured lasso pruning with Information Bottleneck
+(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
+Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
+sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
+multiple state-of-the-art methods, our approaches demonstrate superior
+performance across three datasets and six model architectures in extensive
+experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
+achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
+an accuracy of 94.10% (0.14% higher than the original model); we reduce the
+parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
+ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
+computational resource usage while maintaining accuracy. Our codes are at
+https://anonymous.4open.science/r/IJCAI-8104.
 
-摘要：心理韌性，定義為從逆境中反彈的能力，對心理健康至關重要。與通過自我報告問卷的傳統韌性評估相比，基於神經數據的韌性評估提供了更客觀的結果和生物標記，從而顯著提高了可信度。本文提出了一個新穎的數據高效模型來解決神經數據的稀缺性。我們採用神經科爾莫哥羅夫-阿諾德網路作為預測模型的結構。在訓練階段，提出了一種新的特徵信息多模態表示算法，採用智能塊技術，以有限的數據學習共享潛在空間。在測試階段，提出了一種新的噪聲信息推理算法，以解決神經數據的信噪比低的問題。所提出的模型不僅在公共數據集和自構數據集上都顯示出令人印象深刻的性能，還為未來的研究提供了一些有價值的心理假設。
+摘要：大多數剪枝方法都集中在神經網路中不重要的濾波器上。
+然而，由於缺乏對類別資料的考量，它們面臨統計資訊的遺失。在本文中，我們從利用精確類別資訊進行模型剪枝的角度，利用結構化套索搭配資訊瓶頸理論的指導。我們的做法確保在剪枝過程中保留統計資訊。藉由這些技術，我們引入了兩個創新的自適應網路剪枝方案：帶有資訊瓶頸的稀疏圖形結構套索剪枝（sGLP-IB）和帶有資訊瓶頸的稀疏樹導引套索剪枝（sTLP-IB）。關鍵方面是使用 sGLP-IB 和 sTLP-IB 剪枝模型濾波器，以更好地擷取類別關聯性。與多種最先進的方法相比，我們的做法在廣泛的實驗中展現出跨三個資料集和六個模型架構的卓越效能。例如，在 CIFAR-10 資料集上使用 VGG16 模型，我們達到了 85% 的參數減少、61% 的 FLOP 減少，並維持 94.10% 的準確度（比原始模型高 0.14%）；我們在 ImageNet 上使用 ResNet 架構將參數減少了 55%，準確度為 76.12%（僅下降 0.03%）。總之，我們成功地減少了模型大小和計算資源使用，同時維持準確度。我們的程式碼位於 https://anonymous.4open.science/r/IJCAI-8104。
 
-##### **OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**
-2502.01243v1 by Chengfeng Zhou, Ji Wang, Juanjuan Qin, Yining Wang, Ling Sun, Weiwei Dai
+##### **The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**
+2502.09120v1 by Ye-eun Cho, Yunho Maeng
 
-Large language models (LLMs) have shown significant promise across various
-medical applications, with ophthalmology being a notable area of focus. Many
-ophthalmic tasks have shown substantial improvement through the integration of
-LLMs. However, before these models can be widely adopted in clinical practice,
-evaluating their capabilities and identifying their limitations is crucial. To
-address this research gap and support the real-world application of LLMs, we
-introduce the OphthBench, a specialized benchmark designed to assess LLM
-performance within the context of Chinese ophthalmic practices. This benchmark
-systematically divides a typical ophthalmic clinical workflow into five key
-scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each
-scenario, we developed multiple tasks featuring diverse question types,
-resulting in a comprehensive benchmark comprising 9 tasks and 591 questions.
-This comprehensive framework allows for a thorough assessment of LLMs'
-capabilities and provides insights into their practical application in Chinese
-ophthalmology. Using this benchmark, we conducted extensive experiments and
-analyzed the results from 39 popular LLMs. Our evaluation highlights the
-current gap between LLM development and its practical utility in clinical
-settings, providing a clear direction for future advancements. By bridging this
-gap, we aim to unlock the potential of LLMs and advance their development in
-ophthalmology.
+This study explored how Vision-Language Models (VLMs) process ignorance
+implicatures with visual and linguistic cues. Particularly, we focused on the
+effects of contexts (precise and approximate contexts) and modifier types (bare
+numerals, superlative, and comparative modifiers), which were considered
+pragmatic and semantic factors respectively. Methodologically, we conducted a
+truth-value judgment task in visually grounded settings using GPT-4o and Gemini
+1.5 Pro. The results indicate that while both models exhibited sensitivity to
+linguistic cues (modifier), they failed to process ignorance implicatures with
+visual cues (context) as humans do. Specifically, the influence of context was
+weaker and inconsistent across models, indicating challenges in pragmatic
+reasoning for VLMs. On the other hand, superlative modifiers were more strongly
+associated with ignorance implicatures as compared to comparative modifiers,
+supporting the semantic view. These findings highlight the need for further
+advancements in VLMs to process language-vision information in a
+context-dependent way to achieve human-like pragmatic inference.
 
-摘要：大型語言模型 (LLM) 在各種醫療應用中已展現出顯著的潛力，其中眼科是一個值得關注的重要領域。許多眼科任務已透過整合 LLM 而大幅進步。然而，在這些模型能廣泛應用於臨床實務之前，評估其能力並找出其限制至關重要。為了解決這個研究差距並支援 LLM 的實際應用，我們引入了 OphthBench，這是一個專門的基準測試，旨在評估 LLM 在中國眼科實務中的表現。此基準測試系統性地將典型眼科臨床工作流程劃分為五個關鍵情境：教育、分流、診斷、治療和預後。對於每個情境，我們開發了多項任務，包含多樣化的問題類型，最後組成一個包含 9 項任務和 591 個問題的綜合基準測試。此綜合架構可徹底評估 LLM 的能力，並提供其在中國眼科的實際應用見解。使用此基準測試，我們進行了廣泛的實驗，並分析了來自 39 個熱門 LLM 的結果。我們的評估強調了 LLM 開發與其在臨床環境中的實際效用之間的差距，為未來的進展提供了明確的方向。透過彌合此差距，我們旨在釋放 LLM 的潛力，並促進其在眼科的發展。
+摘要：本研究探討了視覺語言模型 (VLM) 如何處理視覺和語言線索中的無知含義。特別是，我們專注於語境（精確和近似語境）和修飾語類型（裸數字、最高級和比較級修飾語）的影響，這些分別被視為語用和語義因素。在方法論上，我們使用 GPT-4o 和 Gemini 1.5 Pro 在視覺基礎設置中進行了真值判斷任務。結果表明，儘管這兩個模型都對語言線索（修飾語）表現出敏感性，但它們未能像人類那樣處理帶有視覺線索（語境）的無知含義。具體來說，語境的影響在各個模型中較弱且不一致，表明 VLM 在語用推理方面存在挑戰。另一方面，與比較級修飾語相比，最高級修飾語與無知含義的關聯性更強，這支持了語義觀點。這些發現強調了 VLM 進一步發展的必要性，以以語境依賴的方式處理語言視覺信息，以實現類人語用推理。
+
+##### **One-shot Federated Learning Methods: A Practical Guide**
+2502.09104v1 by Xiang Liu, Zhenheng Tang, Xia Li, Yijun Song, Sijie Ji, Zemin Liu, Bo Han, Linshan Jiang, Jialin Li
+
+One-shot Federated Learning (OFL) is a distributed machine learning paradigm
+that constrains client-server communication to a single round, addressing
+privacy and communication overhead issues associated with multiple rounds of
+data exchange in traditional Federated Learning (FL). OFL demonstrates the
+practical potential for integration with future approaches that require
+collaborative training models, such as large language models (LLMs). However,
+current OFL methods face two major challenges: data heterogeneity and model
+heterogeneity, which result in subpar performance compared to conventional FL
+methods. Worse still, despite numerous studies addressing these limitations, a
+comprehensive summary is still lacking. To address these gaps, this paper
+presents a systematic analysis of the challenges faced by OFL and thoroughly
+reviews the current methods. We also offer an innovative categorization method
+and analyze the trade-offs of various techniques. Additionally, we discuss the
+most promising future directions and the technologies that should be integrated
+into the OFL field. This work aims to provide guidance and insights for future
+research.
 
-##### **MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**
-2502.01158v1 by Alejandro Guerra-Manzanares, Farah E. Shamout
+摘要：單次聯邦學習 (OFL) 是一種分散式機器學習範例，將客戶端與伺服器通訊限制在單一輪次中，解決傳統聯邦學習 (FL) 中多輪次資料交換相關的隱私和通訊負擔問題。OFL 展示了與需要協作訓練模型的未來方法整合的實際潛力，例如大型語言模型 (LLM)。然而，目前的 OFL 方法面臨兩大挑戰：資料異質性和模型異質性，這導致與傳統 FL 方法相比，效能較差。更糟的是，儘管有許多研究探討這些限制，但仍缺乏全面的摘要。為了解決這些差距，本文對 OFL 面臨的挑戰進行系統分析，並徹底檢視目前的方法。我們還提供創新的分類方法，並分析各種技術的權衡取捨。此外，我們討論最有希望的未來方向，以及應整合到 OFL 領域的技術。這項工作旨在為未來的研究提供指導和見解。
 
-Multimodal fusion leverages information across modalities to learn better
-feature representations with the goal of improving performance in fusion-based
-tasks. However, multimodal datasets, especially in medical settings, are
-typically smaller than their unimodal counterparts, which can impede the
-performance of multimodal models. Additionally, the increase in the number of
-modalities is often associated with an overall increase in the size of the
-multimodal network, which may be undesirable in medical use cases. Utilizing
-smaller unimodal encoders may lead to sub-optimal performance, particularly
-when dealing with high-dimensional clinical data. In this paper, we propose the
-Modality-INformed knowledge Distillation (MIND) framework, a multimodal model
-compression approach based on knowledge distillation that transfers knowledge
-from ensembles of pre-trained deep neural networks of varying sizes into a
-smaller multimodal student. The teacher models consist of unimodal networks,
-allowing the student to learn from diverse representations. MIND employs
-multi-head joint fusion models, as opposed to single-head models, enabling the
-use of unimodal encoders in the case of unimodal samples without requiring
-imputation or masking of absent modalities. As a result, MIND generates an
-optimized multimodal model, enhancing both multimodal and unimodal
-representations. It can also be leveraged to balance multimodal learning during
-training. We evaluate MIND on binary and multilabel clinical prediction tasks
-using time series data and chest X-ray images. Additionally, we assess the
-generalizability of the MIND framework on three non-medical multimodal
-multiclass datasets. Experimental results demonstrate that MIND enhances the
-performance of the smaller multimodal network across all five tasks, as well as
-various fusion methods and multimodal architectures, compared to
-state-of-the-art baselines.
+##### **Logical Reasoning in Large Language Models: A Survey**
+2502.09100v1 by Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang
 
-摘要：多模态融合利用跨模态的信息来学习更好的特征表示，目标是提升基于融合的任务的性能。然而，多模态数据集，尤其是在医疗环境中，通常比它们的单模态对应数据集小，这会阻碍多模态模型的性能。此外，模态数量的增加通常与多模态网络尺寸的整体增加相关，这在医疗用例中可能是不可取的。利用较小的单模态编码器可能会导致次优性能，尤其是在处理高维临床数据时。在本文中，我们提出了模态信息知识蒸馏 (MIND) 框架，这是一种基于知识蒸馏的多模态模型压缩方法，它将来自不同大小的预训练深度神经网络的集合中的知识转移到一个较小的多模态学生中。教师模型由单模态网络组成，允许学生从不同的表示中学习。MIND 采用多头联合融合模型，而不是单头模型，从而能够在单模态样本的情况下使用单模态编码器，而不需要缺失模态的插补或掩蔽。因此，MIND 生成了一个经过优化的多模态模型，增强了多模态和单模态表示。它还可以用来在训练期间平衡多模态学习。我们使用时间序列数据和胸部 X 射线图像对二元和多标签临床预测任务评估了 MIND。此外，我们评估了 MIND 框架在三个非医疗多模态多分类数据集上的泛化性。实验结果表明，与最先进的基线相比，MIND 增强了较小的多模态网络在所有五个任务以及各种融合方法和多模态架构中的性能。
+With the emergence of advanced reasoning models like OpenAI o3 and
+DeepSeek-R1, large language models (LLMs) have demonstrated remarkable
+reasoning capabilities. However, their ability to perform rigorous logical
+reasoning remains an open question. This survey synthesizes recent advancements
+in logical reasoning within LLMs, a critical area of AI research. It outlines
+the scope of logical reasoning in LLMs, its theoretical foundations, and the
+benchmarks used to evaluate reasoning proficiency. We analyze existing
+capabilities across different reasoning paradigms - deductive, inductive,
+abductive, and analogical - and assess strategies to enhance reasoning
+performance, including data-centric tuning, reinforcement learning, decoding
+strategies, and neuro-symbolic approaches. The review concludes with future
+directions, emphasizing the need for further exploration to strengthen logical
+reasoning in AI systems.
 
-##### **Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**
-2502.01141v1 by Qian Chen, Stefanie Rinderle-Ma, Lijie Wen
+摘要：隨著 OpenAI o3 和 DeepSeek-R1 等先進推理模型的出現，大型語言模型 (LLM) 已展現出非凡的推理能力。然而，它們執行嚴謹邏輯推理的能力仍是一個開放性的問題。此調查綜合了 LLM 中邏輯推理的最新進展，這是 AI 研究的一個關鍵領域。它概述了 LLM 中邏輯推理的範圍、其理論基礎，以及用於評估推理能力的基準。我們分析了不同推理範例（演繹、歸納、外推和類比）中的現有能力，並評估增強推理效能的策略，包括以數據為中心的調整、強化學習、解碼策略和神經符號方法。此評論以未來的方向作為結論，強調需要進一步探索以強化 AI 系統中的邏輯推理。
 
-Most existing process compliance monitoring approaches detect compliance
-violations in an ex post manner. Only predicate prediction focuses on
-predicting them. However, predicate prediction provides a binary yes/no notion
-of compliance, lacking the ability to measure to which extent an ongoing
-process instance deviates from the desired state as specified in constraints.
-Here, being able to quantify the magnitude of violation would provide
-organizations with deeper insights into their operational performance, enabling
-informed decision making to reduce or mitigate the risk of non-compliance.
-Thus, we propose two predictive compliance monitoring approaches to close this
-research gap. The first approach reformulates the binary classification problem
-as a hybrid task that considers both classification and regression, while the
-second employs a multi-task learning method to explicitly predict the
-compliance status and the magnitude of violation for deviant cases
-simultaneously. In this work, we focus on temporal constraints as they are
-significant in almost any application domain, e.g., health care. The evaluation
-on synthetic and real-world event logs demonstrates that our approaches are
-capable of quantifying the magnitude of violations while maintaining comparable
-performance for compliance predictions achieved by state-of-the-art approaches.
+##### **A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**
+2502.09097v1 by Tianyi Huang, Zeqiu Xu, Peiyang Yu, Jingyuan Yi, Xiaochuan Xu
 
-摘要：現有的流程合規監控方法大多會在事後偵測到合規違規。只有謂詞預測專注於預測這些違規。然而，謂詞預測提供的是合規與否的二元概念，無法衡量正在進行的流程實例偏離約束中所指定之理想狀態的程度。在此，能夠量化違規的嚴重程度，將能讓組織深入了解其營運績效，並能據此做出明智的決策，以降低或減輕不合規的風險。因此，我們提出兩種預測合規監控方法來填補此研究空白。第一種方法將二元分類問題重新表述為同時考量分類和回歸的混合任務，而第二種方法則採用多任務學習方法，同時明確預測合規狀態和偏差案例的違規嚴重程度。在這項工作中，我們專注於時間約束，因為它們幾乎在任何應用領域（例如醫療保健）中都很重要。在合成和真實世界事件記錄上的評估顯示，我們的做法能夠量化違規的嚴重程度，同時維持與現有方法所達成的合規預測相當的績效。
+In this paper, we propose an optimized Transformer model that integrates
+Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and
+apply it to fake news classification for the first time. First, we employ the
+TF-IDF method to extract features from news texts and transform them into
+numeric representations to facilitate subsequent machine learning tasks. Two
+sets of experiments are then conducted for fake news detection and
+classification: one using a Transformer model optimized only with BiGRU, and
+the other incorporating Bayesian algorithms into the BiGRU-based Transformer.
+Experimental results show that the BiGRU-optimized Transformer achieves 100%
+accuracy on the training set and 99.67% on the test set, while the addition of
+the Bayesian algorithm maintains 100% accuracy on the training set and slightly
+improves test-set accuracy to 99.73%. This indicates that the Bayesian
+algorithm boosts model accuracy by 0.06%, further enhancing the detection
+capability for fake news. Moreover, the proposed algorithm converges rapidly at
+around the 10th training epoch with accuracy nearing 100%, demonstrating both
+its effectiveness and its fast classification ability. Overall, the optimized
+Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits
+excellent continuous learning and detection performance, offering a robust
+technical means to combat the spread of fake news in the current era of
+information overload.
 
-##### **Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**
-2502.01108v1 by Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar
+摘要：<paragraph>在本文中，我們提出了一個最佳化的 Transformer 模型，它將貝氏演算法與雙向門控遞迴單元 (BiGRU) 整合在一起，並首次將其應用於假新聞分類。首先，我們採用 TF-IDF 方法從新聞文本中提取特徵，並將它們轉換為數值表示，以利於後續的機器學習任務。接著進行兩組實驗，分別針對假新聞偵測和分類：一組使用僅使用 BiGRU 最佳化的 Transformer 模型，另一組將貝氏演算法納入基於 BiGRU 的 Transformer 中。實驗結果顯示，BiGRU 最佳化的 Transformer 在訓練組上達到 100% 的準確度，在測試組上達到 99.67%，而加入貝氏演算法後，在訓練組上維持 100% 的準確度，並將測試組的準確度略微提升至 99.73%。這表示貝氏演算法將模型準確度提升了 0.06%，進一步增強了對假新聞的偵測能力。此外，所提出的演算法在約第 10 個訓練週期時快速收斂，準確度接近 100%，證明了它的有效性和快速的分類能力。總的來說，由貝氏演算法和 BiGRU 增強的最佳化 Transformer 模型展現出絕佳的持續學習和偵測效能，提供了一個強健的技術手段來對抗在當前資訊過載時代中假新聞的散布。</paragraph>
 
-Photoplethysmography (PPG)-based foundation models are gaining traction due
-to the widespread use of PPG in biosignal monitoring and their potential to
-generalize across diverse health applications. In this paper, we introduce
-Pulse-PPG, the first open-source PPG foundation model trained exclusively on
-raw PPG data collected over a 100-day field study with 120 participants.
-Existing PPG foundation models are either open-source but trained on clinical
-data or closed-source, limiting their applicability in real-world settings. We
-evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its
-performance against a state-of-the-art foundation model trained on clinical
-data. Our results demonstrate that Pulse-PPG, trained on uncurated field data,
-exhibits superior generalization across clinical and mobile health applications
-in both lab and field settings. This suggests that exposure to real-world
-variability enables the model to learn fine-grained representations, making it
-more adaptable across tasks. Furthermore, pre-training on field data
-surprisingly outperforms its pre-training on clinical data in many tasks,
-reinforcing the importance of training on real-world, diverse datasets. To
-encourage further advancements in robust foundation models leveraging field
-data, we plan to release Pulse-PPG, providing researchers with a powerful
-resource for developing more generalizable PPG-based models.
+##### **A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**
+2502.09086v1 by Jia Gao, Shuangquan Lyu, Guiran Liu, Binrong Zhu, Hongye Zheng, Xiaoxuan Liao
 
-摘要：基於光電容積描記術 (PPG) 的基礎模型由於 PPG 在生物訊號監控中的廣泛使用及其在各種健康應用中推廣的潛力而備受關注。在本文中，我們介紹 Pulse-PPG，這是第一個開放原始碼 PPG 基礎模型，專門針對在為期 100 天的現場研究中收集的 120 位參與者的原始 PPG 資料進行訓練。現有的 PPG 基礎模型要不是開放原始碼，但訓練於臨床資料，不然就是閉源，這限制了它們在真實世界中的應用性。我們評估了 Pulse-PPG 在多個資料集和下游任務中的表現，並將其效能與訓練於臨床資料的最新基礎模型進行比較。我們的結果表明，訓練於未整理現場資料的 Pulse-PPG 在實驗室和現場環境中，在臨床和行動健康應用中展現出優異的泛化能力。這表明接觸真實世界的變異性使模型能夠學習細粒度的表示，使其更能適應各種任務。此外，令人驚訝的是，現場資料的預訓練在許多任務中優於臨床資料的預訓練，這強化了在真實世界、多樣化的資料集上訓練的重要性。為了鼓勵在利用現場資料的強健基礎模型方面進一步發展，我們計畫發布 Pulse-PPG，為研究人員提供一個強大的資源，用於開發更具泛化性的基於 PPG 的模型。
+With the continuous development of natural language processing (NLP)
+technology, text classification tasks have been widely used in multiple
+application fields. However, obtaining labeled data is often expensive and
+difficult, especially in few-shot learning scenarios. To solve this problem,
+this paper proposes a few-shot text classification model based on transfer
+learning and meta-learning. The model uses the knowledge of the pre-trained
+model for transfer and optimizes the model's rapid adaptability in few-sample
+tasks through a meta-learning mechanism. Through a series of comparative
+experiments and ablation experiments, we verified the effectiveness of the
+proposed method. The experimental results show that under the conditions of few
+samples and medium samples, the model based on transfer learning and
+meta-learning significantly outperforms traditional machine learning and deep
+learning methods. In addition, ablation experiments further analyzed the
+contribution of each component to the model performance and confirmed the key
+role of transfer learning and meta-learning in improving model accuracy.
+Finally, this paper discusses future research directions and looks forward to
+the potential of this method in practical applications.
 
-##### **Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**
-2502.04342v1 by Yeyubei Zhang, Zhongyan Wang, Zhanyi Ding, Yexin Tian, Jianglai Dai, Xiaorui Shen, Yunchong Liu, Yuchen Cao
+摘要：隨著自然語言處理 (NLP) 技術的持續發展，文本分類任務已廣泛應用於多個應用領域。然而，獲取標記資料通常既昂貴又困難，特別是在小樣本學習場景中。為了解決這個問題，本文提出了一個基於遷移學習和元學習的少樣本文本分類模型。該模型利用預訓練模型的知識進行遷移，並透過元學習機制最佳化模型在少樣本任務中的快速適應性。透過一系列的比較實驗和消融實驗，我們驗證了所提出方法的有效性。實驗結果表明，在少樣本和中等樣本的條件下，基於遷移學習和元學習的模型明顯優於傳統機器學習和深度學習方法。此外，消融實驗進一步分析了各個組成部分對模型效能的貢獻，並確認了遷移學習和元學習在提升模型準確度中的關鍵作用。最後，本文探討了未來的研究方向，並期待此方法在實際應用中的潛力。
 
-Social media has become an important source for understanding mental health,
-providing researchers with a way to detect conditions like depression from
-user-generated posts. This tutorial provides practical guidance to address
-common challenges in applying machine learning and deep learning methods for
-mental health detection on these platforms. It focuses on strategies for
-working with diverse datasets, improving text preprocessing, and addressing
-issues such as imbalanced data and model evaluation. Real-world examples and
-step-by-step instructions demonstrate how to apply these techniques
-effectively, with an emphasis on transparency, reproducibility, and ethical
-considerations. By sharing these approaches, this tutorial aims to help
-researchers build more reliable and widely applicable models for mental health
-research, contributing to better tools for early detection and intervention.
+##### **Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**
+2502.09083v1 by Greta Warren, Irina Shklovski, Isabelle Augenstein
 
-摘要：社群媒體已成為了解心理健康的重要來源，
-為研究人員提供一種方式，從使用者發布的貼文中偵測憂鬱症等狀況。
-本教學提供實務指南，說明如何處理在這些平台上使用機器學習和深度學習方法進行心理健康偵測時常見的挑戰。
-它專注於處理不同資料集、改善文字前處理，以及處理不平衡資料和模型評估等問題的策略。
-實際範例和逐步說明示範如何有效應用這些技術，並強調透明度、可複製性，以及倫理考量。
-透過分享這些方法，本教學指南旨在協助研究人員建構更可靠且廣泛適用的心理健康研究模型，
-進而有助於早期偵測和介入的工具。
+The pervasiveness of large language models and generative AI in online media
+has amplified the need for effective automated fact-checking to assist
+fact-checkers in tackling the increasing volume and sophistication of
+misinformation. The complex nature of fact-checking demands that automated
+fact-checking systems provide explanations that enable fact-checkers to
+scrutinise their outputs. However, it is unclear how these explanations should
+align with the decision-making and reasoning processes of fact-checkers to be
+effectively integrated into their workflows. Through semi-structured interviews
+with fact-checking professionals, we bridge this gap by: (i) providing an
+account of how fact-checkers assess evidence, make decisions, and explain their
+processes; (ii) examining how fact-checkers use automated tools in practice;
+and (iii) identifying fact-checker explanation requirements for automated
+fact-checking tools. The findings show unmet explanation needs and identify
+important criteria for replicable fact-checking explanations that trace the
+model's reasoning path, reference specific evidence, and highlight uncertainty
+and information gaps.
 
-##### **Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**
-2502.01691v1 by Hadas Ben-Atya, Naama Gavrielov, Zvi Badash, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
+摘要：大型語言模型和生成式 AI 在線上媒體的普及
+放大了對有效自動查核事實的需求，以協助查核員應對日益增加的錯誤資訊量和複雜性。查核事實的複雜性質要求自動查核事實系統提供說明，讓查核員能夠仔細審查他們的輸出。然而，目前尚不清楚這些說明應如何與查核員的決策制定和推理過程保持一致，才能有效整合到他們的流程中。透過與查核事實專業人士進行半結構式訪談，我們透過以下方式彌補這個差距：(i) 提供查核員如何評估證據、做出決策和解釋其流程的說明；(ii) 檢視查核員如何實際使用自動化工具；以及 (iii) 找出查核員對自動查核事實工具的說明需求。研究結果顯示未滿足的說明需求，並找出可複製查核事實說明的重要準則，這些準則追蹤模型的推理路徑、參考具體證據，並強調不確定性和資訊差距。
 
-Reliable extraction of structured data from radiology reports using Large
-Language Models (LLMs) remains challenging, especially for complex, non-English
-texts like Hebrew. This study introduces an agent-based uncertainty-aware
-approach to improve the trustworthiness of LLM predictions in medical
-applications. We analyzed 9,683 Hebrew radiology reports from Crohn's disease
-patients (from 2010 to 2023) across three medical centers. A subset of 512
-reports was manually annotated for six gastrointestinal organs and 15
-pathological findings, while the remaining reports were automatically annotated
-using HSMP-BERT. Structured data extraction was performed using Llama 3.1
-(Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed
-six semantically equivalent prompts to estimate uncertainty. An Agent-Based
-Decision Model integrated multiple prompt outputs into five confidence levels
-for calibrated uncertainty and was compared against three entropy-based models.
-Performance was evaluated using accuracy, F1 score, precision, recall, and
-Cohen's Kappa before and after filtering high-uncertainty cases. The
-agent-based model outperformed the baseline across all metrics, achieving an F1
-score of 0.3967, recall of 0.6437, and Cohen's Kappa of 0.3006. After filtering
-high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to
-0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated
-clear separation between correct and incorrect predictions, with the
-agent-based model providing the most well-calibrated uncertainty estimates. By
-incorporating uncertainty-aware prompt ensembles and an agent-based decision
-model, this approach enhances the performance and reliability of LLMs in
-structured data extraction from radiology reports, offering a more
-interpretable and trustworthy solution for high-stakes medical applications.
+##### **CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**
+2502.09082v1 by Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
 
-摘要：<paragraph>使用大型語言模型 (LLM) 從放射科報告中可靠地提取結構化數據仍然具有挑戰性，尤其是對於希伯來語等複雜的非英語文本。本研究引入了一種基於代理的不確定性感知方法，以提高 LLM 預測在醫療應用中的可信度。我們分析了來自三個醫療中心的 9,683 份克隆氏症患者的希伯來語放射科報告（從 2010 年到 2023 年）。其中 512 份報告的手動註釋包括六個胃腸器官和 15 個病理發現，而其餘報告則使用 HSMP-BERT 自動註釋。結構化數據提取使用 Llama 3.1（Llama 3-8b-instruct）與貝葉斯提示集合（BayesPE）進行，它採用六個語義等效提示來估計不確定性。基於代理的決策模型將多個提示輸出整合到五個置信度級別中以校準不確定性，並與三個基於熵的模型進行比較。在過濾掉高度不確定性的情況之前和之後，使用準確度、F1 分數、精確度、召回率和 Cohen's Kappa 評估性能。基於代理的模型在所有指標上都優於基線，F1 分數達到 0.3967，召回率達到 0.6437，Cohen's Kappa 達到 0.3006。在過濾掉高度不確定性的情況（大於或等於 0.5）後，F1 分數提高到 0.4787，Kappa 提高到 0.4258。不確定性直方圖顯示了正確預測和不正確預測之間的明顯區別，基於代理的模型提供了校準最好的不確定性估計。通過結合不確定性感知提示集合和基於代理的決策模型，這種方法增強了 LLM 在放射科報告中結構化數據提取中的性能和可靠性，為高風險醫療應用提供了更具可解釋性和可信度的解決方案。</paragraph>
+Role-playing language agents (RPLAs) have emerged as promising applications
+of large language models (LLMs). However, simulating established characters
+presents a challenging task for RPLAs, due to the lack of authentic character
+datasets and nuanced evaluation methods using such data. In this paper, we
+present CoSER, a collection of a high-quality dataset, open models, and an
+evaluation protocol towards effective RPLAs of established characters. The
+CoSER dataset covers 17,966 characters from 771 renowned books. It provides
+authentic dialogues with real-world intricacies, as well as diverse data types
+such as conversation setups, character experiences and internal thoughts.
+Drawing from acting methodology, we introduce given-circumstance acting for
+training and evaluating role-playing LLMs, where LLMs sequentially portray
+multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
+CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
+Extensive experiments demonstrate the value of the CoSER dataset for RPLA
+training, evaluation and retrieval. Moreover, CoSER 70B exhibits
+state-of-the-art performance surpassing or matching GPT-4o on our evaluation
+and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
+the InCharacter and LifeChoice benchmarks respectively.
 
-##### **Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**
-2502.01685v1 by Si-Ioi Ng, Pranav S. Ambadi, Kimberly D. Mueller, Julie Liss, Visar Berisha
+摘要：角色扮演語言代理（RPLA）已成為大型語言模型（LLM）的有前途的應用。然而，由於缺乏真實角色資料集和使用此類資料的細緻評估方法，模擬既有角色對 RPLA 來說是一項具有挑戰性的任務。在本文中，我們提出了 CoSER，這是一個高品質資料集、開放模型和評估協議的集合，用於有效地扮演既有角色的 RPLA。CoSER 資料集涵蓋了來自 771 本著名書籍的 17,966 個角色。它提供了具有真實世界複雜性的真實對話，以及對話設定、角色體驗和內心想法等多種資料類型。借鑑表演方法，我們引入了既定情境表演，用於訓練和評估角色扮演 LLM，其中 LLM 在書籍場景中依次扮演多個角色。使用我們的資料集，我們開發了 CoSER 8B 和 CoSER 70B，即建立在 LLaMA-3.1 模型上的先進開放角色扮演 LLM。大量的實驗證明了 CoSER 資料集對於 RPLA 訓練、評估和檢索的價值。此外，CoSER 70B 在我們的評估和三個現有基準上展現了超越或匹配 GPT-4o 的最先進效能，即分別在 InCharacter 和 LifeChoice 基準上達到了 75.80% 和 93.47% 的準確率。
 
-Existing methods for analyzing linguistic content from picture descriptions
-for assessment of cognitive-linguistic impairment often overlook the
-participant's visual narrative path, which typically requires eye tracking to
-assess. Spatio-semantic graphs are a useful tool for analyzing this narrative
-path from transcripts alone, however they are limited by the need for manual
-tagging of content information units (CIUs). In this paper, we propose an
-automated approach for estimation of spatio-semantic graphs (via automated
-extraction of CIUs) from the Cookie Theft picture commonly used in
-cognitive-linguistic analyses. The method enables the automatic
-characterization of the visual semantic path during picture description.
-Experiments demonstrate that the automatic spatio-semantic graphs effectively
-differentiate between cognitively impaired and unimpaired speakers. Statistical
-analyses reveal that the features derived by the automated method produce
-comparable results to the manual method, with even greater group differences
-between clinical groups of interest. These results highlight the potential of
-the automated approach for extracting spatio-semantic features in developing
-clinical speech models for cognitive impairment assessment.
+##### **Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**
+2502.09073v1 by Xuzhao Geng, Haozhao Wang, Jun Wang, Wei Liu, Ruixuan Li
 
-摘要：現有的用於分析圖像描述中的語言內容的方法，用於評估認知語言障礙，通常會忽略參與者的視覺敘事路徑，這通常需要眼球追蹤來評估。時空語義圖是一種有用的工具，可以僅從轉錄本中分析此敘事路徑，但是它們受到手動標記內容資訊單元 (CIU) 的需求所限制。在本文中，我們提出了一種自動化方法，用於從認知語言分析中常用的 Cookie Theft 圖像估計時空語義圖（通過自動提取 CIU）。該方法能夠自動表徵圖片描述期間的視覺語義路徑。實驗表明，自動時空語義圖有效地區分了認知受損和未受損的說話者。統計分析表明，自動化方法衍生的特徵產生了與手動方法相當的結果，甚至在感興趣的臨床組之間產生了更大的組差異。這些結果突出了自動化方法在提取時空語義特徵以開發用於認知障礙評估的臨床語音模型方面的潛力。
+Retrieval-augmented generation (RAG) is a key technique for leveraging
+external knowledge and reducing hallucinations in large language models (LLMs).
+However, RAG still struggles to fully prevent hallucinated responses. To
+address this, it is essential to identify samples prone to hallucination or
+guide LLMs toward correct responses, which experts then annotate to develop
+high-quality datasets for refining LLMs. However, the growing scarcity of such
+datasets makes their creation challenging. This paper proposes using the vast
+amount of conversations from widespread LLM usage to build these datasets,
+training LLMs to avoid hallucination-prone questions while accurately
+responding to manageable ones. Given the impracticality of expert-annotating
+all conversation records, the paper introduces AL4RAG, which uses active
+learning to select the most suitable conversation samples for annotation,
+optimizing performance within an annotation budget. Additionally, recognizing
+that traditional active learning methods are not fully compatible with RAG due
+to unsuitable distance metrics, we develop a novel sample distance measurement
+for RAG active learning. Extensive experiments show that our method
+consistently outperforms baselines across multiple metrics.
 
-##### **Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**
-2502.00712v1 by Shengtian Sang, Hassan Jahanandish, Cynthia Xinran Li, Indrani Bhattachary, Jeong Hoon Lee, Lichun Zhang, Sulaiman Vesal, Pejman Ghanouni, Richard Fan, Geoffrey A. Sonn, Mirabela Rusu
+摘要：檢索增強生成 (RAG) 是一種關鍵技術，用於利用外部知識並減少大型語言模型 (LLM) 中的幻覺。然而，RAG 仍難以完全防止幻覺反應。為了解決這個問題，必須找出容易產生幻覺的範例，或引導 LLM 朝向正確的反應，然後由專家註解以開發用於精煉 LLM 的高品質資料集。然而，此類資料集日益稀少，使得其建立極具挑戰性。本文提出使用來自廣泛 LLM 使用的大量對話來建立這些資料集，訓練 LLM 以避免容易產生幻覺的問題，同時準確回應可管理的問題。鑑於由專家為所有對話記錄加上註解並不切實際，本文引入了 AL4RAG，它使用主動學習來選擇最適合註解的對話範例，在註解預算內最佳化效能。此外，認識到傳統主動學習方法由於不適當的距離度量而無法與 RAG 完全相容，我們為 RAG 主動學習開發了一種新穎的範例距離度量。廣泛的實驗表明，我們的模型在多種度量標準上始終優於基準。
 
-Prostate cancer is a major cause of cancer-related deaths in men, where early
-detection greatly improves survival rates. Although MRI-TRUS fusion biopsy
-offers superior accuracy by combining MRI's detailed visualization with TRUS's
-real-time guidance, it is a complex and time-intensive procedure that relies
-heavily on manual annotations, leading to potential errors. To address these
-challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation
-method that identifies prostate tumors directly in TRUS images without
-requiring manual annotations. Unlike traditional multimodal fusion approaches
-that rely on naive data concatenation, our method integrates a
-registration-segmentation framework to align and leverage spatial information
-between MRI and TRUS modalities. This alignment enhances segmentation accuracy
-and reduces reliance on manual effort. Our approach was validated on a dataset
-of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient
-of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132)
-methods, with significant improvements (p $<$ 0.01). This framework
-demonstrates the potential for reducing the complexity of prostate cancer
-diagnosis and provides a flexible architecture applicable to other multimodal
-medical imaging tasks.
+##### **An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**
+2502.09056v1 by Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai
 
-摘要：前列腺癌是男性癌症相關死亡的主要原因，早期發現可大幅提升存活率。儘管 MRI-TRUS 融合切片檢查結合了 MRI 的詳細視覺化與 TRUS 的即時導引，可提供更高的準確度，但它是一種仰賴大量手動註解的複雜且耗時的程序，容易導致錯誤。為了解決這些挑戰，我們提出了一種全自動的 MRI-TRUS 融合式分割方法，它可以在 TRUS 影像中直接辨識出前列腺腫瘤，而不需要手動註解。與依賴於天真資料串接的傳統多模態融合方法不同，我們的方法整合了一個配準分割架構，以對齊並利用 MRI 與 TRUS 模態之間的空間資訊。這種對齊提升了分割準確度，並減少了對手動作業的依賴。我們的方法已通過來自 Stanford 醫院的 1,747 位患者的資料集進行驗證，達到了 0.212 的平均 Dice 係數，優於僅使用 TRUS (0.117) 和天真的 MRI-TRUS 融合 (0.132) 方法，並有顯著的改善（p < 0.01）。這個架構證明了降低前列腺癌診斷複雜性的潛力，並提供了一個適用於其他多模態醫學影像任務的彈性架構。
+This paper investigates data selection and model merging methodologies aimed
+at incorporating advanced reasoning capabilities such as those of DeepSeek R1
+into language-specific large language models (LLMs), with a particular focus on
+the Thai LLM. Our goal is to enhance the reasoning capabilities of
+language-specific LLMs while maintaining their target language abilities.
+DeepSeek R1 excels in reasoning but primarily benefits high-resource languages
+such as English and Chinese. However, low-resource languages remain underserved
+due to the dominance of English-centric training data and model optimizations,
+which limit performance in these languages. This limitation results in
+unreliable code-switching and diminished effectiveness on tasks in low-resource
+languages. Meanwhile, local and regional LLM initiatives have attempted to
+bridge this gap by developing language-specific LLMs that focus on improving
+local linguistic fidelity. We demonstrate that, with only publicly available
+datasets and a computational budget of $120, it is possible to enhance the
+reasoning capabilities of language-specific LLMs to match the level of DeepSeek
+R1, without compromising their performance on target language tasks.
 
-##### **TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**
-2502.00695v1 by Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
+摘要：本文探討資料選取與模型合併方法，旨在將深度搜尋 R1 等先進推理能力整合至特定語言的大型語言模型 (LLM)，特別著重於泰語 LLM。我們的目標是提升特定語言 LLM 的推理能力，同時維持其目標語言能力。深度搜尋 R1 在推理方面表現出色，但主要受益於英語和中文等資源豐富的語言。然而，由於以英語為中心的訓練資料和模型最佳化佔據主導地位，資源貧乏的語言仍未獲得充分服務，這限制了這些語言的效能。此限制導致不可靠的代碼切換，並降低了資源貧乏語言任務的效能。與此同時，在地區 LLM 計畫已嘗試透過開發專注於改善在地語言忠實度的特定語言 LLM 來彌合此差距。我們證明，僅使用公開可用的資料集和 120 美元的運算預算，即可提升特定語言 LLM 的推理能力，使其達到深度搜尋 R1 的水準，同時不損及它們在目標語言任務上的效能。
 
-Chronic liver disease represents a significant health challenge worldwide and
-accurate prognostic evaluations are essential for personalized treatment plans.
-Recent evidence suggests that integrating multimodal data, such as computed
-tomography imaging, radiomic features, and clinical information, can provide
-more comprehensive prognostic information. However, modalities have an inherent
-heterogeneity, and incorporating additional modalities may exacerbate the
-challenges of heterogeneous data fusion. Moreover, existing multimodal fusion
-methods often struggle to adapt to richer medical modalities, making it
-difficult to capture inter-modal relationships. To overcome these limitations,
-We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet).
-Specifically, we develop an Intra-Modality Aggregation module and a
-Triple-Modal Cross-Attention Fusion module, which are designed to eliminate
-intra-modality redundancy and extract cross-modal information, respectively.
-Furthermore, we design a Triple-Modal Feature Fusion loss function to align
-feature representations across modalities. Extensive experiments on the liver
-prognosis dataset demonstrate that our approach significantly outperforms
-existing state-of-the-art unimodal models and other multi-modal techniques. Our
-code is available at https://github.com/Mysterwll/liver.git.
+##### **Cost-Saving LLM Cascades with Early Abstention**
+2502.09054v1 by Michael J. Zellinger, Rex Liu, Matt Thomson
 
-摘要：慢性肝病在全球范围内代表著重大的健康挑戰，而準確的預後評估對於個人化治療計畫至關重要。最近的證據表明，整合多模態資料（例如電腦斷層影像、放射特徵和臨床資訊）可以提供更全面的預後資訊。然而，模態具有內在異質性，而納入額外的模態可能會加劇異質化資料融合的挑戰。此外，現有的多模態融合方法通常難以適應更豐富的醫療模態，這使得難以捕捉模態間的關係。為了克服這些限制，我們提出了三模態交互慢性肝臟網路 (TMI-CLNet)。具體來說，我們開發了一個模態內聚合模組和一個三模態交叉注意力融合模組，它們分別旨在消除模態內冗餘和提取跨模態資訊。此外，我們設計了一個三模態特徵融合損失函數，以對齊跨模態的特徵表示。在肝臟預後資料集上的廣泛實驗表明，我們的做法顯著優於現有的最先進單模態模型和其他多模態技術。我們的程式碼可以在 https://github.com/Mysterwll/liver.git 上取得。
+LLM cascades are based on the idea that processing all queries with the
+largest and most expensive LLMs is inefficient. Instead, cascades deploy small
+LLMs to answer the majority of queries, limiting the use of large and expensive
+LLMs to only the most difficult queries. This approach can significantly reduce
+costs without impacting performance. However, risk-sensitive domains such as
+finance or medicine place an additional premium on avoiding model errors.
+Recognizing that even the most expensive models may make mistakes, applications
+in these domains benefit from allowing LLM systems to completely abstain from
+answering a query when the chance of making a mistake is significant. However,
+giving a cascade the ability to abstain poses an immediate design question for
+LLM cascades: should abstention only be allowed at the final model or also at
+earlier models? Since the error patterns of small and large models are
+correlated, the latter strategy may further reduce inference costs by letting
+inexpensive models anticipate abstention decisions by expensive models, thereby
+obviating the need to run the expensive models. We investigate the benefits of
+"early abstention" in LLM cascades and find that it reduces the overall test
+loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA,
+TruthfulQA, and XSum). These gains result from a more effective use of
+abstention, which trades a 4.1% average increase in the overall abstention rate
+for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings
+demonstrate that it is possible to leverage correlations between the error
+patterns of different language models to drive performance improvements for LLM
+systems with abstention.
 
-##### **Safety at Scale: A Comprehensive Survey of Large Model Safety**
-2502.05206v2 by Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
+摘要：<paragraph>LLM 級聯基於以下概念：使用最大且最昂貴的 LLM 處理所有查詢效率低下。相反，級聯會部署小型 LLM 來回答大部分查詢，將大型且昂貴的 LLM 的使用限制在最困難的查詢上。這種方法可以大幅降低成本，而不會影響效能。然而，像金融或醫學等對風險敏感的領域會額外重視避免模型錯誤。認識到即使是最昂貴的模型也可能會出錯，在這些領域中的應用程式可受益於允許 LLM 系統在出錯機率很大的情況下完全不回答查詢。然而，賦予級聯不回答的能力會對 LLM 級聯提出立即的設計問題：是否只允許在最終模型中不回答，還是也在較早的模型中不回答？由於小型和大型模型的錯誤模式相關，後一種策略可以讓便宜的模型預測昂貴模型的不回答決策，進而降低推論成本，從而避免執行昂貴的模型。我們調查了 LLM 級聯中「早期不回答」的好處，並發現它平均降低了六個基準測試（GSM8K、MedMCQA、MMLU、TriviaQA、TruthfulQA 和 XSum）的整體測試損失 2.2%。這些收益來自於更有效地使用不回答，以整體不回答率平均增加 4.1% 的代價換取成本降低 13.0% 和錯誤率降低 5.0%。我們的研究結果證明，可以利用不同語言模型的錯誤模式之間的關聯性，來推動具有不回答功能的 LLM 系統的效能改進。</paragraph>
 
-The rapid advancement of large models, driven by their exceptional abilities
-in learning and generalization through large-scale pre-training, has reshaped
-the landscape of Artificial Intelligence (AI). These models are now
-foundational to a wide range of applications, including conversational AI,
-recommendation systems, autonomous driving, content generation, medical
-diagnostics, and scientific discovery. However, their widespread deployment
-also exposes them to significant safety risks, raising concerns about
-robustness, reliability, and ethical implications. This survey provides a
-systematic review of current safety research on large models, covering Vision
-Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language
-Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models
-(DMs), and large-model-based Agents. Our contributions are summarized as
-follows: (1) We present a comprehensive taxonomy of safety threats to these
-models, including adversarial attacks, data poisoning, backdoor attacks,
-jailbreak and prompt injection attacks, energy-latency attacks, data and model
-extraction attacks, and emerging agent-specific threats. (2) We review defense
-strategies proposed for each type of attacks if available and summarize the
-commonly used datasets and benchmarks for safety research. (3) Building on
-this, we identify and discuss the open challenges in large model safety,
-emphasizing the need for comprehensive safety evaluations, scalable and
-effective defense mechanisms, and sustainable data practices. More importantly,
-we highlight the necessity of collective efforts from the research community
-and international collaboration. Our work can serve as a useful reference for
-researchers and practitioners, fostering the ongoing development of
-comprehensive defense systems and platforms to safeguard AI models.
+##### **Game Theory Meets Large Language Models: A Systematic Survey**
+2502.09053v1 by Haoran Sun, Yusen Wu, Yukun Cheng, Xu Chu
 
-摘要：<paragraph>大型模型的快速進展，得益於它們在通過大規模預訓練進行學習和概括方面的卓越能力，已經重塑了人工智能 (AI) 的格局。這些模型現在是廣泛應用程式（包括對話式 AI、推薦系統、自動駕駛、內容生成、醫療診斷和科學發現）的基礎。然而，它們的廣泛部署也使它們面臨重大的安全風險，引發了對穩健性、可靠性和倫理影響的擔憂。本調查提供了對大型模型當前安全研究的系統性回顧，涵蓋視覺基礎模型 (VFM)、大型語言模型 (LLM)、視覺語言預訓練 (VLP) 模型、視覺語言模型 (VLM)、擴散模型 (DM) 和基於大型模型的代理。我們的貢獻總結如下：(1) 我們提出了一個針對這些模型的安全威脅的全面分類，包括對抗性攻擊、資料中毒、後門攻擊、越獄和提示注入攻擊、能量延遲攻擊、資料和模型提取攻擊以及新興的特定代理威脅。(2) 我們檢視了針對每種類型攻擊提出的防禦策略（如果有的話），並總結了安全研究中常用的資料集和基準。(3) 基於此，我們找出並討論了大型模型安全中的開放性挑戰，強調了對全面安全評估、可擴充且有效的防禦機制以及永續資料實務的需求。更重要的是，我們強調了研究社群和國際合作共同努力的必要性。我們的研究可作為研究人員和從業人員的有用參考，促進全面防禦系統和平台的持續發展，以保護 AI 模型。</paragraph>
+Game theory establishes a fundamental framework for analyzing strategic
+interactions among rational decision-makers. The rapid advancement of large
+language models (LLMs) has sparked extensive research exploring the
+intersection of these two fields. Specifically, game-theoretic methods are
+being applied to evaluate and enhance LLM capabilities, while LLMs themselves
+are reshaping classic game models. This paper presents a comprehensive survey
+of the intersection of these fields, exploring a bidirectional relationship
+from three perspectives: (1) Establishing standardized game-based benchmarks
+for evaluating LLM behavior; (2) Leveraging game-theoretic methods to improve
+LLM performance through algorithmic innovations; (3) Characterizing the
+societal impacts of LLMs through game modeling. Among these three aspects, we
+also highlight how the equilibrium analysis for traditional game models is
+impacted by LLMs' advanced language understanding, which in turn extends the
+study of game theory. Finally, we identify key challenges and future research
+directions, assessing their feasibility based on the current state of the
+field. By bridging theoretical rigor with emerging AI capabilities, this survey
+aims to foster interdisciplinary collaboration and drive progress in this
+evolving research area.
 
-##### **Enhanced Convolutional Neural Networks for Improved Image Classification**
-2502.00663v1 by Xiaoran Yang, Shuhan Yu, Wenxi Xu
+摘要：博弈論建立一個基本架構，用來分析理性決策者之間的策略互動。大型語言模型 (LLM) 的快速進展，激發了廣泛的研究，探討這兩個領域的交集。具體來說，博弈論方法被應用於評估和增強 LLM 能力，而 LLM 本身正在重塑經典博弈模型。本文對這些領域的交集進行了全面的調查，從三個角度探討了雙向關係：(1) 建立標準化的基於博弈的基準，用於評估 LLM 行為；(2) 利用博弈論方法，通過演算法創新來改善 LLM 效能；(3) 透過博弈模型，描述 LLM 對社會的影響。在這三個方面中，我們還強調了 LLM 的先進語言理解如何影響傳統博弈模型的均衡分析，這反過來又擴展了博弈論的研究。最後，我們找出關鍵挑戰和未來的研究方向，根據該領域的現狀評估其可行性。透過將理論嚴謹性與新興的 AI 能力相結合，這項調查旨在促進跨學科合作，並推動這個不斷演變的研究領域的進展。
 
-Image classification is a fundamental task in computer vision with diverse
-applications, ranging from autonomous systems to medical imaging. The CIFAR-10
-dataset is a widely used benchmark to evaluate the performance of
-classification models on small-scale, multi-class datasets. Convolutional
-Neural Networks (CNNs) have demonstrated state-of-the-art results; however,
-they often suffer from overfitting and suboptimal feature representation when
-applied to challenging datasets like CIFAR-10. In this paper, we propose an
-enhanced CNN architecture that integrates deeper convolutional blocks, batch
-normalization, and dropout regularization to achieve superior performance. The
-proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN
-architectures. Through detailed ablation studies, we demonstrate the
-effectiveness of the enhancements and analyze the hierarchical feature
-representations. This work highlights the potential of refined CNN
-architectures for tackling small-scale image classification problems
-effectively.
+##### **AIDE: Agentically Improve Visual Language Model with Domain Experts**
+2502.09051v1 by Ming-Chang Chiu, Fuxiao Liu, Karan Sapra, Andrew Tao, Yaser Jacoob, Xuezhe Ma, Zhiding Yu, Guilin Liu
 
-摘要：影像分類是電腦視覺中的一項基本任務，應用範圍廣泛，從自動系統到醫學影像皆有。CIFAR-10 資料集是一個廣泛使用的基準，用於評估分類模型在小規模、多類別資料集上的效能。卷積神經網路 (CNN) 已展現出最先進的成果；然而，當應用於 CIFAR-10 等具挑戰性的資料集時，它們常常會發生過度擬合和次佳特徵表示的問題。在本文中，我們提出一個增強的 CNN 架構，它整合了更深的卷積區塊、批次正規化和中斷正規化，以達成卓越的效能。所提出的模型達到了 84.95% 的測試準確度，優於基準 CNN 架構。透過詳細的消融研究，我們證明了這些增強功能的有效性，並分析了階層式特徵表示。這項工作突顯了精進的 CNN 架構在有效解決小規模影像分類問題上的潛力。
+The enhancement of Visual Language Models (VLMs) has traditionally relied on
+knowledge distillation from larger, more capable models. This dependence
+creates a fundamental bottleneck for improving state-of-the-art systems,
+particularly when no superior models exist. We introduce AIDE (Agentic
+Improvement through Domain Experts), a novel framework that enables VLMs to
+autonomously enhance their capabilities by leveraging specialized domain expert
+models. AIDE operates through a four-stage process: (1) identifying instances
+for refinement, (2) engaging domain experts for targeted analysis, (3)
+synthesizing expert outputs with existing data, and (4) integrating enhanced
+instances into the training pipeline. Experiments on multiple benchmarks,
+including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve
+notable performance gains without relying on larger VLMs nor human supervision.
+Our framework provides a scalable, resource-efficient approach to continuous
+VLM improvement, addressing critical limitations in current methodologies,
+particularly valuable when larger models are unavailable to access.
 
-##### **Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**
-2502.00619v1 by Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
+摘要：視覺語言模型 (VLM) 的增強傳統上依賴於從更大、功能更強大的模型中進行知識萃取。這種依賴性會造成改善最先進系統的基本瓶頸，尤其在沒有更優越的模型時。我們引進 AIDE（透過領域專家進行代理式改善），一個創新的架構，讓 VLM 能夠透過利用專業的領域專家模型，自主增強其功能。AIDE 透過四階段流程運作：(1) 識別需要改善的實例，(2) 聘請領域專家進行有針對性的分析，(3) 將專家輸出與現有資料綜合，以及 (4) 將增強的實例整合到訓練流程中。在多個基準測試上的實驗，包括 MMMU、MME、MMBench 等，證明了 AIDE 能夠在不依賴更大型的 VLM 或人工監督的情況下，實現顯著的效能提升。我們的架構提供了一個可擴充、資源效率高的持續 VLM 改進方法，解決了當前方法中的關鍵限制，特別是在無法取得大型模型時，這一點特別有價值。
 
-Ensuring fairness in medical image segmentation is critical due to biases in
-imbalanced clinical data acquisition caused by demographic attributes (e.g.,
-age, sex, race) and clinical factors (e.g., disease severity). To address these
-challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired
-by optimal control theory. We provide a comprehensive analysis of its
-underlying mechanisms and clarify dMoE's role in adapting to heterogeneous
-distributions in medical image segmentation. Furthermore, we integrate dMoE
-into multiple network architectures, demonstrating its broad applicability
-across diverse medical image analysis tasks. By incorporating demographic and
-clinical factors, dMoE achieves state-of-the-art performance on two 2D
-benchmark datasets and a 3D in-house dataset. Our results highlight the
-effectiveness of dMoE in mitigating biases from imbalanced distributions,
-offering a promising approach to bridging control theory and medical image
-segmentation within fairness learning paradigms. The source code will be made
-available.
+##### **Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**
+2502.09050v1 by Chae-Hyun Kim, Yoon-Ryung Choi, Jin-Duk Park, Won-Yong Shin
 
-摘要：在医学影像分割中，由於人口屬性（例如年齡、性別、種族）和臨床因素（例如疾病嚴重程度）導致不平衡的臨床數據採集中存在偏差，因此確保公平性至關重要。為了應對這些挑戰，我們引入了受最優控制理論啟發的感知混合專家 (dMoE)。我們對其底層機制進行了全面分析，並釐清了 dMoE 在適應醫學影像分割中的異質分佈中的作用。此外，我們將 dMoE 整合到多個網路架構中，展示了其在各種醫學影像分析任務中的廣泛適用性。通過納入人口統計和臨床因素，dMoE 在兩個 2D 基準數據集和一個 3D 內部數據集上實現了最先進的性能。我們的結果突出了 dMoE 在減輕不平衡分佈的偏差方面的有效性，為在公平性學習範例中橋接控制理論和醫學影像分割提供了一個有前景的方法。原始碼將會公開。
+Group recommendation aims at providing optimized recommendations tailored to
+diverse groups, enabling groups to enjoy appropriate items. On the other hand,
+most existing group recommendation methods are built upon deep neural network
+(DNN) architectures designed to capture the intricate relationships between
+member-level and group-level interactions. While these DNN-based approaches
+have proven their effectiveness, they require complex and expensive training
+procedures to incorporate group-level interactions in addition to member-level
+interactions. To overcome such limitations, we introduce Group-GF, a new
+approach for extremely fast recommendations of items to each group via
+multi-view graph filtering (GF) that offers a holistic view of complex
+member-group dynamics, without the need for costly model training.
+Specifically, in Group-GF, we first construct three item similarity graphs
+manifesting different viewpoints for GF. Then, we discover a distinct
+polynomial graph filter for each similarity graph and judiciously aggregate the
+three graph filters. Extensive experiments demonstrate the effectiveness of
+Group-GF in terms of significantly reducing runtime and achieving
+state-of-the-art recommendation accuracy.
 
-##### **Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**
-2502.00568v3 by Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
+摘要：群組推薦旨在提供針對不同群組量身打造的最佳推薦，讓群組可以享受適當的項目。另一方面，現有的群組推薦方法大多建立在深度神經網路 (DNN) 架構上，旨在捕捉成員層級和群組層級互動之間的複雜關係。雖然這些基於 DNN 的方法已證明其有效性，但它們需要複雜且昂貴的訓練程序，才能在成員層級互動之外納入群組層級互動。為了克服這些限制，我們引入了 Group-GF，這是一種透過多視圖圖形過濾 (GF) 為每個群組提供極快速項目推薦的新方法，它提供了複雜成員群組動態的整體視圖，而無需進行昂貴的模型訓練。具體來說，在 Group-GF 中，我們首先建構三個項目相似度圖形，展現 GF 的不同觀點。然後，我們為每個相似度圖形發現一個不同的多項式圖形過濾器，並明智地彙總這三個圖形過濾器。廣泛的實驗證明了 Group-GF 在顯著減少執行時間和達成最先進的推薦準確度方面的有效性。
 
-Emerging research has highlighted that artificial intelligence based
-multimodal fusion of digital pathology and transcriptomic features can improve
-cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction.
-However, such direct fusion for joint decision is impractical in real clinical
-settings, where histopathology is still the gold standard for diagnosis and
-transcriptomic tests are rarely requested, at least in the public healthcare
-system. With our novel diffusion based crossmodal generative AI model PathGen,
-we show that genomic expressions synthesized from digital histopathology
-jointly predicts cancer grading and patient survival risk with high accuracy
-(state-of-the-art performance), certainty (through conformal coverage
-guarantee) and interpretability (through distributed attention maps). PathGen
-code is available for open use by the research community through GitHub at
-https://github.com/Samiran-Dey/PathGen.
+##### **Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**
+2502.09046v1 by Jin-Duk Park, Jaemin Yoo, Won-Yong Shin
 
-摘要：新興研究強調，基於人工智慧的多模態融合數位病理學和轉錄組特徵，可以改善癌症診斷（分級/分型）和預後（存活風險）預測。
-然而，這種直接融合對於聯合決策在實際臨床環境中並不切實際，在實際臨床環境中，組織病理學仍然是診斷的黃金標準，而轉錄組檢測很少被要求，至少在公共醫療保健系統中是如此。透過我們新穎的基於擴散的跨模態生成式 AI 模型 PathGen，我們展示了從數位組織病理學合成的基因體表達共同預測癌症分級和患者存活風險，具有很高的準確度（最先進的效能）、確定性（透過共形覆蓋保證）和可解釋性（透過分佈式注意力圖）。PathGen 程式碼可透過 GitHub 上的 https://github.com/Samiran-Dey/PathGen 供研究社群公開使用。
+Multi-criteria (MC) recommender systems, which utilize MC rating information
+for recommendation, are increasingly widespread in various e-commerce domains.
+However, the MC recommendation using training-based collaborative filtering,
+requiring consideration of multiple ratings compared to single-criterion
+counterparts, often poses practical challenges in achieving state-of-the-art
+performance along with scalable model training. To solve this problem, we
+propose CA-GF, a training-free MC recommendation method, which is built upon
+criteria-aware graph filtering for efficient yet accurate MC recommendations.
+Specifically, first, we construct an item-item similarity graph using an MC
+user-expansion graph. Next, we design CA-GF composed of the following key
+components, including 1) criterion-specific graph filtering where the optimal
+filter for each criterion is found using various types of polynomial low-pass
+filters and 2) criteria preference-infused aggregation where the smoothed
+signals from each criterion are aggregated. We demonstrate that CA-GF is (a)
+efficient: providing the computational efficiency, offering the extremely fast
+runtime of less than 0.2 seconds even on the largest benchmark dataset, (b)
+accurate: outperforming benchmark MC recommendation methods, achieving
+substantial accuracy gains up to 24% compared to the best competitor, and (c)
+interpretable: providing interpretations for the contribution of each criterion
+to the model prediction based on visualizations.
 
+摘要：多準則 (MC) 推薦系統在各種電子商務領域中日益普及，該系統利用 MC 評分資訊進行推薦。
+然而，與單準則對應項目相比，使用基於訓練的協同過濾的 MC 推薦，通常在達成最先進的效能以及可擴充模型訓練方面造成實務上的挑戰，需要考慮多個評分。為了解決這個問題，我們提出 CA-GF，一種無需訓練的 MC 推薦方法，它建立於準則感知圖形過濾之上，用於有效且準確的 MC 推薦。
+具體來說，首先，我們使用 MC 使用者擴展圖形來建構一個項目相似度圖形。接下來，我們設計 CA-GF，它包含以下關鍵組成部分，包括 1) 準則特定圖形過濾，其中使用各種類型的多項式低通濾波器來找出每個準則的最佳濾波器，以及 2) 準則偏好注入聚合，其中來自每個準則的平滑訊號被聚合。我們證明 CA-GF 是 (a) 有效的：提供運算效率，即使在最大的基準資料集上，也能提供低於 0.2 秒的極快執行時間，(b) 準確的：優於基準 MC 推薦方法，與最佳競爭者相比，獲得高達 24% 的顯著準確性提升，以及 (c) 可解釋的：根據視覺化提供對每個準則對模型預測的貢獻的解釋。
 
-### Knowledge Graphs
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
-|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
-|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null|
-|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v1](http://arxiv.org/abs/2502.08346v1)|null|
-|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null|
-|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null|
-|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null|
-|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null|
-|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null|
-|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null|
-|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)|
-|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null|
-|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|null|
-|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null|
-|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null|
-|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null|
-|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null|
-|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null|
-|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null|
-|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null|
-|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null|
-|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)|
-|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null|
-|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|null|
-|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null|
-|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)|
-|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null|
-|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)|
-|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)|
-|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null|
-|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
-|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)|
-|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
-|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
-|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
-|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
-|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
-|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
-|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
-|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
-|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null|
-|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null|
-|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null|
-|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)|
-|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null|
-|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null|
-|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)|
-|**2025-02-03**|**Common Foundations for SHACL, ShEx, and PG-Schema**|S. Ahmetaj et.al.|[2502.01295v1](http://arxiv.org/abs/2502.01295v1)|null|
-|**2025-02-03**|**GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**|Linhao Luo et.al.|[2502.01113v1](http://arxiv.org/abs/2502.01113v1)|[link](https://github.com/RManLuo/gfm-rag)|
-|**2025-02-03**|**Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**|Seungri Yoon et.al.|[2502.01059v1](http://arxiv.org/abs/2502.01059v1)|null|
-|**2025-02-03**|**Encrypted Large Model Inference: The Equivariant Encryption Paradigm**|James Buban et.al.|[2502.01013v1](http://arxiv.org/abs/2502.01013v1)|null|
-|**2025-02-02**|**Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**|Juno Kim et.al.|[2502.01694v1](http://arxiv.org/abs/2502.01694v1)|null|
-|**2025-02-02**|**PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**|Qixuan Li et.al.|[2502.00708v1](http://arxiv.org/abs/2502.00708v1)|null|
-|**2025-02-02**|**A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**|Qika Lin et.al.|[2502.00681v1](http://arxiv.org/abs/2502.00681v1)|null|
-|**2025-02-01**|**Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**|Jingyuan Yi et.al.|[2502.00339v1](http://arxiv.org/abs/2502.00339v1)|null|
-|**2025-02-01**|**DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**|Jiaxin Guo et.al.|[2502.00305v1](http://arxiv.org/abs/2502.00305v1)|null|
-|**2025-01-31**|**Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**|Nathaniel Tomczak et.al.|[2502.01659v2](http://arxiv.org/abs/2502.01659v2)|[link](https://github.com/KLab-AI3/Graph-Processing-Attention-IPDPS-2025)|
-|**2025-01-31**|**Improving vision-language alignment with graph spiking hybrid Networks**|Siyu Zhang et.al.|[2501.19069v1](http://arxiv.org/abs/2501.19069v1)|null|
-|**2025-01-30**|**Semantic Web and Creative AI -- A Technical Report from ISWS 2023**|Raia Abu Ahmad et.al.|[2501.18542v1](http://arxiv.org/abs/2501.18542v1)|null|
-|**2025-01-30**|**Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**|Tianpeng Pan et.al.|[2501.18320v1](http://arxiv.org/abs/2501.18320v1)|null|
-|**2025-01-30**|**Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**|Wanlong Liu et.al.|[2501.18154v1](http://arxiv.org/abs/2501.18154v1)|null|
-|**2025-01-30**|**Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**|Qika Lin et.al.|[2501.18119v1](http://arxiv.org/abs/2501.18119v1)|null|
-|**2025-01-29**|**Hybrid Graphs for Table-and-Text based Question Answering using LLMs**|Ankush Agarwal et.al.|[2501.17767v1](http://arxiv.org/abs/2501.17767v1)|null|
-|**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
-|**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
-|**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
-|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
-|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
-|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
-|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
-|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
-|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
-|**2025-01-26**|**Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**|Haodi Ma et.al.|[2501.15688v1](http://arxiv.org/abs/2501.15688v1)|null|
-|**2025-01-26**|**How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**|Manzong Huang et.al.|[2501.15378v1](http://arxiv.org/abs/2501.15378v1)|null|
-|**2025-01-24**|**Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**|Cencheng Shen et.al.|[2501.14932v1](http://arxiv.org/abs/2501.14932v1)|null|
-|**2025-01-24**|**Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**|Hang Luo et.al.|[2501.14892v1](http://arxiv.org/abs/2501.14892v1)|null|
-|**2025-01-24**|**GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**|Ziwen Li et.al.|[2501.16382v1](http://arxiv.org/abs/2501.16382v1)|[link](https://github.com/aaronli43/grappi)|
-|**2025-01-24**|**Evaluating and Improving Graph to Text Generation with Large Language Models**|Jie He et.al.|[2501.14497v1](http://arxiv.org/abs/2501.14497v1)|[link](https://github.com/probe2/kg_text)|
-|**2025-01-24**|**Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**|Xujian Liang et.al.|[2501.14300v1](http://arxiv.org/abs/2501.14300v1)|[link](https://github.com/dosonleung/fasttog)|
-|**2025-01-24**|**Top Ten Challenges Towards Agentic Neural Graph Databases**|Jiaxin Bai et.al.|[2501.14224v1](http://arxiv.org/abs/2501.14224v1)|null|
-|**2025-01-23**|**GraphRAG under Fire**|Jiacheng Liang et.al.|[2501.14050v1](http://arxiv.org/abs/2501.14050v1)|null|
-|**2025-01-23**|**EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**|Yuhui Yun et.al.|[2501.13746v1](http://arxiv.org/abs/2501.13746v1)|null|
-|**2025-01-23**|**Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**|Chang Gong et.al.|[2501.13731v1](http://arxiv.org/abs/2501.13731v1)|null|
-|**2025-01-23**|**CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**|Hamza Landolsi et.al.|[2501.13993v1](http://arxiv.org/abs/2501.13993v1)|null|
-|**2025-01-23**|**Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**|Hy Nguyen et.al.|[2501.13992v1](http://arxiv.org/abs/2501.13992v1)|null|
-|**2025-01-23**|**Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**|Bhumika Gupta et.al.|[2501.13984v1](http://arxiv.org/abs/2501.13984v1)|null|
-|**2025-01-21**|**LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**|Hasan Abu-Rasheed et.al.|[2501.12300v1](http://arxiv.org/abs/2501.12300v1)|null|
-|**2025-01-21**|**Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**|Dongsheng Zhu et.al.|[2501.12432v1](http://arxiv.org/abs/2501.12432v1)|null|
-|**2025-01-21**|**InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**|Pha Nguyen et.al.|[2501.12231v1](http://arxiv.org/abs/2501.12231v1)|null|
-|**2025-01-21**|**Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**|Maya Medjad et.al.|[2501.11977v1](http://arxiv.org/abs/2501.11977v1)|[link](https://github.com/reecall/graphtod)|
-|**2025-01-21**|**Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**|Jie Zhao et.al.|[2501.11968v1](http://arxiv.org/abs/2501.11968v1)|null|
-|**2025-01-21**|**A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**|Qinggang Zhang et.al.|[2501.13958v1](http://arxiv.org/abs/2501.13958v1)|[link](https://github.com/deep-polyu/awesome-graphrag)|
-|**2025-01-21**|**Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**|Nikos Kanakaris et.al.|[2501.11849v2](http://arxiv.org/abs/2501.11849v2)|[link](https://github.com/nkanak/brag-fake-news-campaigns)|
-|**2025-01-21**|**Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**|Haoran Song et.al.|[2501.16361v1](http://arxiv.org/abs/2501.16361v1)|null|
-|**2025-01-20**|**Zep: A Temporal Knowledge Graph Architecture for Agent Memory**|Preston Rasmussen et.al.|[2501.13956v1](http://arxiv.org/abs/2501.13956v1)|[link](https://github.com/getzep/graphiti)|
-|**2025-01-20**|**Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**|M. Manzour et.al.|[2501.11560v1](http://arxiv.org/abs/2501.11560v1)|null|
-|**2025-01-20**|**Each Graph is a New Language: Graph Learning with LLMs**|Huachi Zhou et.al.|[2501.11478v2](http://arxiv.org/abs/2501.11478v2)|null|
-|**2025-01-20**|**Few-shot Policy (de)composition in Conversational Question Answering**|Kyle Erwin et.al.|[2501.11335v1](http://arxiv.org/abs/2501.11335v1)|null|
-|**2025-01-20**|**Reasoning Language Models: A Blueprint**|Maciej Besta et.al.|[2501.11223v3](http://arxiv.org/abs/2501.11223v3)|[link](https://github.com/spcl/x1)|
-|**2025-01-19**|**IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**|Elad Levi et.al.|[2501.11067v1](http://arxiv.org/abs/2501.11067v1)|[link](https://github.com/plurai-ai/intellagent)|
+##### **Typhoon T1: An Open Thai Reasoning Model**
+2502.09042v1 by Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, Kunat Pipatanakul
 
-#### Abstracts
-##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
-2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
+This paper introduces Typhoon T1, an open effort to develop an open Thai
+reasoning model. A reasoning model is a relatively new type of generative model
+built on top of large language models (LLMs). A reasoning model generates a
+long chain of thought before arriving at a final answer, an approach found to
+improve performance on complex tasks. However, details on developing such a
+model are limited, especially for reasoning models that can generate traces in
+a low-resource language. Typhoon T1 presents an open effort that dives into the
+details of developing a reasoning model in a more cost-effective way by
+leveraging supervised fine-tuning using open datasets, instead of reinforcement
+learning. This paper shares the details about synthetic data generation and
+training, as well as our dataset and model weights. Additionally, we provide
+insights gained from developing a reasoning model that generalizes across
+domains and is capable of generating reasoning traces in a low-resource
+language, using Thai as an example. We hope this open effort provides a
+foundation for further research in this field.
 
-Visual Question Answering (VQA) is a challenging problem that requires to
-process multimodal input. Answer-Set Programming (ASP) has shown great
-potential in this regard to add interpretability and explainability to modular
-VQA architectures. In this work, we address the problem of how to integrate ASP
-with modules for vision and natural language processing to solve a new and
-demanding VQA variant that is concerned with images of graphs (not graphs in
-symbolic form). Images containing graph-based structures are an ubiquitous and
-popular form of visualisation. Here, we deal with the particular problem of
-graphs inspired by transit networks, and we introduce a novel dataset that
-amends an existing one by adding images of graphs that resemble metro lines.
-Our modular neuro-symbolic approach combines optical graph recognition for
-graph parsing, a pretrained optical character recognition neural network for
-parsing labels, Large Language Models (LLMs) for language processing, and ASP
-for reasoning. This method serves as a first baseline and achieves an overall
-average accuracy of 73% on the dataset. Our evaluation provides further
-evidence of the potential of modular neuro-symbolic systems, in particular with
-pretrained models that do not involve any further training and logic
-programming for reasoning, to solve complex VQA tasks.
+摘要：本文介紹 Typhoon T1，這是一個開放的計畫，旨在開發開放的泰語推理模型。推理模型是一種相對較新的生成模型，建構於大型語言模型 (LLM) 之上。推理模型會在得出最終答案之前產生一連串的思考，這種方法被發現可以改善複雜任務的效能。然而，關於如何開發這種模型的詳細資訊有限，特別是對於能夠以低資源語言產生軌跡的推理模型而言。Typhoon T1 提出了一個開放的計畫，深入探討如何以更具成本效益的方式開發推理模型，方法是利用開放式資料集進行監督微調，而不是強化學習。本文分享了關於合成資料產生和訓練的詳細資訊，以及我們的資料集和模型權重。此外，我們提供了從開發推理模型中獲得的見解，該模型可以跨領域概括，並能夠以低資源語言產生推理軌跡，以泰語為例。我們希望這個開放的計畫能為此領域的進一步研究奠定基礎。
 
-摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
+##### **Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**
+2502.09022v1 by Lin Zhang, Lijie Hu, Di Wang
 
-##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
-2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
+Transformer-based language models have achieved notable success, yet their
+internal reasoning mechanisms remain largely opaque due to complex non-linear
+interactions and high-dimensional operations. While previous research suggests
+that these models implicitly encode reasoning structures, it is still unclear
+which specific multi-step thought processes they employ to solve complex tasks.
+To address this gap, we propose a novel mechanistic interpretability framework,
+SICAF, designed to trace and analyze the reasoning strategies that language
+models use in multi-step inference tasks. By employing circuit analysis and
+self-influence functions, we quantify the evolving importance of each token
+throughout the reasoning process, thereby mapping the pathways the model uses
+for inference. Applying SICAF to the GPT-2 model on the Indirect Object
+Identification (IOI) prediction task, we demonstrate how underlying circuits
+can reveal a reasoning process that aligns with human interpretability,
+offering new insights into the model's internal logic.
 
-The adoption of EHRs has expanded opportunities to leverage data-driven
-algorithms in clinical care and research. A major bottleneck in effectively
-conducting multi-institutional EHR studies is the data heterogeneity across
-systems with numerous codes that either do not exist or represent different
-clinical concepts across institutions. The need for data privacy further limits
-the feasibility of including multi-institutional patient-level data required to
-study similarities and differences across patient subgroups. To address these
-challenges, we developed the GAME algorithm. Tested and validated across 7
-institutions and 2 languages, GAME integrates data in several levels: (1) at
-the institutional level with knowledge graphs to establish relationships
-between codes and existing knowledge sources, providing the medical context for
-standard codes and their relationship to each other; (2) between institutions,
-leveraging language models to determine the relationships between
-institution-specific codes with established standard codes; and (3) quantifying
-the strength of the relationships between codes using a graph attention
-network. Jointly trained embeddings are created using transfer and federated
-learning to preserve data privacy. In this study, we demonstrate the
-applicability of GAME in selecting relevant features as inputs for AI-driven
-algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
-We then highlight the application of GAME harmonized multi-institutional EHR
-data in a study of Alzheimer's disease outcomes and suicide risk among patients
-with mental health disorders, without sharing patient-level data outside
-individual institutions.
+摘要：基於 Transformer 的語言模型已取得顯著的成功，但由於複雜的非線性交互和高維度運算，它們的內部推理機制在很大程度上仍然不透明。儘管先前的研究表明這些模型隱含地編碼推理結構，但目前仍不清楚它們採用哪些具體的多步驟思考過程來解決複雜任務。為了解決這個差距，我們提出了一個新穎的機制可解釋性框架 SICAF，旨在追蹤和分析語言模型在多步驟推理任務中使用的推理策略。通過採用電路分析和自影響函數，我們量化了推理過程中每個標記的演化重要性，從而繪製出模型用於推理的路徑。將 SICAF 應用於 GPT-2 模型上的間接賓語識別 (IOI) 預測任務，我們展示了底層電路如何揭示與人類可解釋性相符的推理過程，從而對模型的內部邏輯提供了新的見解。
 
-摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
+##### **EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**
+2502.09020v1 by Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
 
-##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**
-2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang
+Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB
+cameras which are sensitive to challenging factors such as low illumination,
+motion blur, and cluttered backgrounds. In this paper, we propose to recognize
+the scene text using bio-inspired event cameras by collecting and annotating a
+large-scale benchmark dataset, termed EventSTR. It contains 9,928
+high-definition (1280 * 720) event samples and involves both Chinese and
+English characters. We also benchmark multiple STR algorithms as the baselines
+for future works to compare. In addition, we propose a new event-based scene
+text recognition framework, termed SimC-ESTR. It first extracts the event
+features using a visual encoder and projects them into tokens using a Q-former
+module. More importantly, we propose to augment the vision tokens based on a
+memory mechanism before feeding into the large language models. A
+similarity-based error correction mechanism is embedded within the large
+language model to correct potential minor errors fundamentally based on
+contextual information. Extensive experiments on the newly proposed EventSTR
+dataset and two simulation STR datasets fully demonstrate the effectiveness of
+our proposed model. We believe that the dataset and algorithmic model can
+innovatively propose an event-based STR task and are expected to accelerate the
+application of event cameras in various industries. The source code and
+pre-trained models will be released on https://github.com/Event-AHU/EventSTR
 
-With the extensive application of Graph Neural Networks (GNNs) across various
-domains, their trustworthiness has emerged as a focal point of research. Some
-existing studies have shown that the integration of large language models
-(LLMs) can improve the semantic understanding and generation capabilities of
-GNNs, which in turn improves the trustworthiness of GNNs from various aspects.
-Our review introduces a taxonomy that offers researchers a clear framework for
-comprehending the principles and applications of different methods and helps
-clarify the connections and differences among various approaches. Then we
-systematically survey representative approaches along the four categories of
-our taxonomy. Through our taxonomy, researchers can understand the applicable
-scenarios, potential advantages, and limitations of each approach for the the
-trusted integration of GNNs with LLMs. Finally, we present some promising
-directions of work and future trends for the integration of LLMs and GNNs to
-improve model trustworthiness.
+摘要：主流場景文字辨識（STR）演算法是基於對低光源、動態模糊和雜亂背景等挑戰性因素敏感的 RGB 相機開發的。在本文中，我們提出使用生物靈感事件相機辨識場景文字，方法是收集和標註一個稱為 EventSTR 的大規模基準資料集。它包含 9,928 個高畫質（1280 * 720）事件範例，並包含中文字和英文字元。我們也基準化多個 STR 演算法作為未來工作的基準，以進行比較。此外，我們提出一個新的基於事件的場景文字辨識架構，稱為 SimC-ESTR。它首先使用視覺編碼器萃取事件特徵，並使用 Q-former 模組將它們投影到代幣中。更重要的是，我們提出在輸入大型語言模型之前，基於記憶機制擴充視覺代幣。一個基於相似性的錯誤修正機制嵌入在大型語言模型中，以根據上下文資訊從根本上修正潛在的輕微錯誤。在最新提出的 EventSTR 資料集和兩個模擬 STR 資料集上進行的廣泛實驗充分證明了我們提出的模型的有效性。我們相信，該資料集和演算法模型可以創新地提出一個基於事件的 STR 任務，並有望加速事件相機在各個產業的應用。原始碼和預先訓練的模型將在 https://github.com/Event-AHU/EventSTR 上釋出
 
-摘要：隨著圖神經網路 (GNN) 在各種領域的廣泛應用，其可信度已成為研究的焦點。一些現有研究表明，整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力，進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法，為研究人員提供了一個清晰的架構，用於理解不同方法的原理和應用，並有助於釐清各種方法之間的關聯和差異。然後，我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法，可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後，我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢，以提升模型的可信度。
+##### **Zero-shot Concept Bottleneck Models**
+2502.09018v1 by Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
 
-##### **Graph Foundation Models for Recommendation: A Comprehensive Survey**
-2502.08346v1 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi
+Concept bottleneck models (CBMs) are inherently interpretable and
+intervenable neural network models, which explain their final label prediction
+by the intermediate prediction of high-level semantic concepts. However, they
+require target task training to learn input-to-concept and concept-to-label
+mappings, incurring target dataset collections and training resources. In this
+paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which
+predict concepts and labels in a fully zero-shot manner without training neural
+networks. Z-CBMs utilize a large-scale concept bank, which is composed of
+millions of vocabulary extracted from the web, to describe arbitrary input in
+various domains. For the input-to-concept mapping, we introduce concept
+retrieval, which dynamically finds input-related concepts by the cross-modal
+search on the concept bank. In the concept-to-label inference, we apply concept
+regression to select essential concepts from the retrieved concepts by sparse
+linear regression. Through extensive experiments, we confirm that our Z-CBMs
+provide interpretable and intervenable concepts without any additional
+training. Code will be available at https://github.com/yshinya6/zcbm.
 
-Recommender systems (RS) serve as a fundamental tool for navigating the vast
-expanse of online information, with deep learning advancements playing an
-increasingly important role in improving ranking accuracy. Among these, graph
-neural networks (GNNs) excel at extracting higher-order structural information,
-while large language models (LLMs) are designed to process and comprehend
-natural language, making both approaches highly effective and widely adopted.
-Recent research has focused on graph foundation models (GFMs), which integrate
-the strengths of GNNs and LLMs to model complex RS problems more efficiently by
-leveraging the graph-based structure of user-item relationships alongside
-textual understanding. In this survey, we provide a comprehensive overview of
-GFM-based RS technologies by introducing a clear taxonomy of current
-approaches, diving into methodological details, and highlighting key challenges
-and future directions. By synthesizing recent advancements, we aim to offer
-valuable insights into the evolving landscape of GFM-based recommender systems.
+摘要：概念瓶頸模型 (CBM) 本質上是可解釋且可干預的神經網路模型，它們透過對高階語意概念的中間預測來解釋其最終標籤預測。然而，它們需要目標任務訓練來學習輸入到概念和概念到標籤的對應，導致目標資料集收集和訓練資源。在本文中，我們展示了「零次學習概念瓶頸模型」(Z-CBM)，它以完全零次學習的方式預測概念和標籤，而無需訓練神經網路。Z-CBM 利用一個大型概念庫，其中包含從網路中擷取的數百萬個詞彙，來描述各種領域中的任意輸入。對於輸入到概念的對應，我們引入了概念擷取，它透過對概念庫的跨模態搜尋，動態地找出與輸入相關的概念。在概念到標籤的推論中，我們應用概念迴歸，透過稀疏線性迴歸從擷取的概念中選擇必要的概念。透過廣泛的實驗，我們確認我們的 Z-CBM 在沒有任何額外訓練的情況下提供了可解釋且可干預的概念。程式碼將可在 https://github.com/yshinya6/zcbm 取得。
 
-摘要：推薦系統 (RS) 是導航廣闊線上資訊的基本工具，深度學習的進展在提升排名準確度方面扮演著日益重要的角色。在這些進展中，圖形神經網路 (GNN) 擅長萃取高階結構資訊，而大型語言模型 (LLM) 則設計用於處理和理解自然語言，這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM)，它整合了 GNN 和 LLM 的優點，透過利用使用者與項目關係的圖形化結構，以及文字理解，更有效率地建構複雜的 RS 問題。在這項調查中，我們提供 GFM-based RS 技術的全面概觀，介紹當前方法的明確分類法，深入探討方法論的細節，並強調關鍵挑戰和未來方向。透過綜合最近的進展，我們旨在提供有價值的見解，了解 GFM-based 推薦系統不斷演變的樣貌。
+##### **Diversity Enhances an LLM's Performance in RAG and Long-context Task**
+2502.09017v1 by Zhchao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, Claire Na Cheng
 
-##### **Self-Evaluation for Job-Shop Scheduling**
-2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana
+The rapid advancements in large language models (LLMs) have highlighted the
+challenge of context window limitations, primarily due to the quadratic time
+complexity of the self-attention mechanism (\(O(N^2)\), where \(N\) denotes the
+context window length). This constraint impacts tasks such as
+retrieval-augmented generation (RAG) in question answering (Q\&A) and long
+context summarization. A common approach involves selecting content with the
+highest similarity to the query; however, this often leads to redundancy and
+the exclusion of diverse yet relevant information. Building on principles from
+Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we
+integrate diversity into the content selection process. Our findings reveal
+that incorporating diversity substantially increases the recall of selecting
+relevant sentences or chunks before LLM-based Q\&A and summarization. These
+results highlight the importance of maintaining diversity in future LLM
+applications to further improve summarization and Q\&A outcomes.
 
-Combinatorial optimization problems, such as scheduling and route planning,
-are crucial in various industries but are computationally intractable due to
-their NP-hard nature. Neural Combinatorial Optimization methods leverage
-machine learning to address these challenges but often depend on sequential
-decision-making, which is prone to error accumulation as small mistakes
-propagate throughout the process. Inspired by self-evaluation techniques in
-Large Language Models, we propose a novel framework that generates and
-evaluates subsets of assignments, moving beyond traditional stepwise
-approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a
-heterogeneous graph neural network with a Transformer to build a policy model
-and a self-evaluation function. Experimental validation on challenging,
-well-known benchmarks demonstrates the effectiveness of our approach,
-surpassing state-of-the-art methods.
+摘要：大型語言模型 (LLM) 的快速進步凸顯了上下文視窗限制的挑戰，這主要是由於自注意力機制的二次時間複雜度（\(O(N^2)\)），其中 \(N\) 表示上下文視窗長度。此限制會影響任務，例如問答 (Q&A) 中的檢索增強生成 (RAG) 和長文摘要。一種常見的方法涉及選擇與查詢最相似的內容；然而，這通常會導致冗餘，並排除多樣化但相關的資訊。我們根據最大邊際相關性 (MMR) 和最遠點取樣 (FPS) 的原則，將多樣性整合到內容選擇過程中。我們的研究結果顯示，在基於 LLM 的問答和摘要之前，納入多樣性會大幅增加選擇相關句子或區塊的召回率。這些結果突顯了在未來的 LLM 應用中維持多樣性的重要性，以進一步改善摘要和問答的結果。
 
-摘要：組合優化問題，例如排程和路線規劃，在各行各業中至關重要，但由於它們的 NP 難度，在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰，但通常依賴於序貫決策制定，而序貫決策制定容易發生錯誤累積，因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發，我們提出了一個新的框架，可生成和評估作業子集，超越傳統的分步方法。應用於工作車間排程問題，我們的方法將異質圖神經網路與 Transformer 整合在一起，以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性，超越了最先進的方法。
+##### **Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**
+2502.09004v1 by Jonathan Pofcher, Christopher M. Homan, Randall Sell, Ashiqur R. KhudaBukhsh
 
-##### **Improving Existing Optimization Algorithms with LLMs**
-2502.08298v1 by Camilo Chacón Sartori, Christian Blum
+This paper makes three contributions. First, via a substantial corpus of
+1,419,047 comments posted on 3,161 YouTube news videos of major US cable news
+outlets, we analyze how users engage with LGBTQ+ news content. Our analyses
+focus both on positive and negative content. In particular, we construct a
+fine-grained hope speech classifier that detects positive (hope speech),
+negative, neutral, and irrelevant content. Second, in consultation with a
+public health expert specializing on LGBTQ+ health, we conduct an annotation
+study with a balanced and diverse political representation and release a
+dataset of 3,750 instances with fine-grained labels and detailed annotator
+demographic information. Finally, beyond providing a vital resource for the
+LGBTQ+ community, our annotation study and subsequent in-the-wild assessments
+reveal (1) strong association between rater political beliefs and how they rate
+content relevant to a marginalized community; (2) models trained on individual
+political beliefs exhibit considerable in-the-wild disagreement; and (3)
+zero-shot large language models (LLMs) align more with liberal raters.
 
-The integration of Large Language Models (LLMs) into optimization has created
-a powerful synergy, opening exciting research opportunities. This paper
-investigates how LLMs can enhance existing optimization algorithms. Using their
-pre-trained knowledge, we demonstrate their ability to propose innovative
-heuristic variations and implementation strategies. To evaluate this, we
-applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt
-(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that
-incorporates a heuristic in the solution construction phase. Our results show
-that an alternative heuristic proposed by GPT-4o outperforms the
-expert-designed heuristic of CMSA, with the performance gap widening on larger
-and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/
+摘要：本文做出了三項貢獻。首先，透過一個龐大的語料庫，其中包含 1,419,047 則評論，這些評論張貼在 3,161 部美國有線新聞頻道的 YouTube 新聞影片上，我們分析了使用者如何參與 LGBTQ+ 新聞內容。我們的分析重點在於正面和負面的內容。特別是，我們建構了一個細緻的希望言論分類器，用來偵測正面的（希望言論）、負面的、中立的和不相關的內容。其次，在諮詢了一位專門研究 LGBTQ+ 健康的公共衛生專家後，我們進行了一項標註研究，其中包含平衡且多元的政治代表性，並發布了一個包含 3,750 個實例的資料集，其中包含細緻的標籤和詳細的標註者人口統計資訊。最後，除了為 LGBTQ+ 社群提供重要的資源外，我們的標註研究和後續的實際評估揭示了：(1) 評分者的政治信仰與他們如何評分與邊緣化社群相關的內容之間有很強的關聯性；(2) 根據個人政治信仰訓練的模型在實際應用中表現出相當大的分歧；(3) 零次學習大型語言模型 (LLM) 與自由派評分者的看法更一致。
 
-摘要：大型语言模型 (LLM) 与优化相结合，创造了一种强大的协同作用，开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识，我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点，我们应用了一种非平凡的优化算法，构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法，它在求解构建阶段纳入了启发式算法。我们的结果表明，GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法，并且随着图形变得更大、更密集，性能差距也在扩大。项目网址：https://imp-opt-algo-llms.surge.sh/
+##### **RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**
+2502.09003v1 by Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang, Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
 
-##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**
-2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
+Supervised fine-tuning is a standard method for adapting pre-trained large
+language models (LLMs) to downstream tasks. Quantization has been recently
+studied as a post-training technique for efficient LLM deployment. To obtain
+quantized fine-tuned LLMs, conventional pipelines would first fine-tune the
+pre-trained models, followed by post-training quantization. This often yields
+suboptimal performance as it fails to leverage the synergy between fine-tuning
+and quantization. To effectively realize low-bit quantization of weights,
+activations, and KV caches in LLMs, we propose an algorithm named Rotated
+Straight-Through-Estimator (RoSTE), which combines quantization-aware
+supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that
+identifies an effective rotation configuration to reduce activation outliers.
+We provide theoretical insights on RoSTE by analyzing its prediction error when
+applied to an overparameterized least square quantized training problem. Our
+findings reveal that the prediction error is directly proportional to the
+quantization error of the converged weights, which can be effectively managed
+through an optimized rotation configuration. Experiments on Pythia and Llama
+models of different sizes demonstrate the effectiveness of RoSTE. Compared to
+existing post-SFT quantization baselines, our method consistently achieves
+superior performances across various tasks and different LLM architectures.
 
-Identifying cause-and-effect relationships is critical to understanding
-real-world dynamics and ultimately causal reasoning. Existing methods for
-identifying event causality in NLP, including those based on Large Language
-Models (LLMs), exhibit difficulties in out-of-distribution settings due to the
-limited scale and heavy reliance on lexical cues within available benchmarks.
-Modern benchmarks, inspired by probabilistic causal inference, have attempted
-to construct causal graphs of events as a robust representation of causal
-knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent
-benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a
-benchmark designed for discovery and reasoning over abstract causal events.
-Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday
-life events on the abstraction level. We propose a pipeline for identifying
-abstractions for event generalizations from \texttt{GLUCOSE}
-\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit
-commonsense causal knowledge, from which we subsequently extract $1,4$K causal
-pairs. Our experiments highlight the ongoing challenges of using statistical
-methods and/or LLMs for automatic abstraction identification and causal
-discovery in NLP. Nonetheless, we demonstrate that the abstract causal
-knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA
-reasoning performance in LLMs.
+摘要：監督式微調是將預訓練的大型語言模型 (LLM) 適應至下游任務的標準方法。量化最近已被研究作為一種訓練後技術，用於高效部署 LLM。為了獲得量化的微調 LLM，傳統管道會先微調預訓練模型，然後再進行訓練後量化。這通常會產生次佳效能，因為它無法利用微調和量化之間的協同效應。為了有效實現 LLM 中權重、激活和 KV 快取的低位元量化，我們提出了一種名為旋轉直通估計器 (RoSTE) 的演算法，它結合了量化感知監督式微調 (QA-SFT) 和一種自適應旋轉策略，該策略會識別有效的旋轉組態以減少激活異常值。我們透過分析 RoSTE 在應用於過度參數化最小平方量化訓練問題時的預測誤差，提供了關於 RoSTE 的理論見解。我們的研究結果顯示，預測誤差與收斂權重的量化誤差成正比，而這可透過最佳化的旋轉組態有效地管理。在不同大小的 Pythia 和 Llama 模型上進行的實驗證明了 RoSTE 的有效性。與現有的訓練後 SFT 量化基準相比，我們的模型在各種任務和不同的 LLM 架構中持續獲得優異的效能。
 
-摘要：<paragraph>找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法，包括基於大型語言模型 (LLM) 的方法，由於規模有限且過度依賴於可用基準中的詞彙線索，在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖，作為因果知識的強健表示，其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中，我們介紹 \texttt{ACCESS}，一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同，\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道，用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象，\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集，我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此，我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。</paragraph>
+##### **PixLift: Accelerating Web Browsing via AI Upscaling**
+2502.08995v1 by Yonas Atinafu, Sarthak Malla, HyunSeok Daniel Jang, Nouar Aldahoul, Matteo Varvello, Yasir Zaki
 
-##### **GCoT: Chain-of-Thought Prompt Learning for Graphs**
-2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang
+Accessing the internet in regions with expensive data plans and limited
+connectivity poses significant challenges, restricting information access and
+economic growth. Images, as a major contributor to webpage sizes, exacerbate
+this issue, despite advances in compression formats like WebP and AVIF. The
+continued growth of complex and curated web content, coupled with suboptimal
+optimization practices in many regions, has prevented meaningful reductions in
+web page sizes. This paper introduces PixLift, a novel solution to reduce
+webpage sizes by downscaling their images during transmission and leveraging AI
+models on user devices to upscale them. By trading computational resources for
+bandwidth, PixLift enables more affordable and inclusive web access. We address
+key challenges, including the feasibility of scaled image requests on popular
+websites, the implementation of PixLift as a browser extension, and its impact
+on user experience. Through the analysis of 71.4k webpages, evaluations of
+three mainstream upscaling models, and a user study, we demonstrate PixLift's
+ability to significantly reduce data usage without compromising image quality,
+fostering a more equitable internet.
 
-Chain-of-thought (CoT) prompting has achieved remarkable success in natural
-language processing (NLP). However, its vast potential remains largely
-unexplored for graphs. This raises an interesting question: How can we design
-CoT prompting for graphs to guide graph models to learn step by step? On one
-hand, unlike natural languages, graphs are non-linear and characterized by
-complex topological structures. On the other hand, many graphs lack textual
-data, making it difficult to formulate language-based CoT prompting. In this
-work, we propose the first CoT prompt learning framework for text-free graphs,
-GCoT. Specifically, we decompose the adaptation process for each downstream
-task into a series of inference steps, with each step consisting of
-prompt-based inference, ``thought'' generation, and thought-conditioned prompt
-learning. While the steps mimic CoT prompting in NLP, the exact mechanism
-differs significantly. Specifically, at each step, an input graph, along with a
-prompt, is first fed into a pre-trained graph encoder for prompt-based
-inference. We then aggregate the hidden layers of the encoder to construct a
-``thought'', which captures the working state of each node in the current step.
-Conditioned on this thought, we learn a prompt specific to each node based on
-the current state. These prompts are fed into the next inference step,
-repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we
-conduct comprehensive experiments on eight public datasets, which demonstrate
-the advantage of our approach.
+摘要：在數據方案昂貴且連線有限的地區存取網路會造成重大挑戰，限制了資訊存取和經濟成長。圖像作為網頁大小的主要貢獻者，儘管 WebP 和 AVIF 等壓縮格式進步，但仍加劇了這個問題。複雜且經過策劃的網路內容持續成長，加上許多地區次佳的最佳化實務，已阻礙了網頁大小的顯著減少。本文介紹 PixLift，這是一種創新的解決方案，可在傳輸過程中縮小圖像大小，並利用使用者裝置上的 AI 模型來放大圖像，藉此縮小網頁大小。PixLift 透過以運算資源換取頻寬，讓網路存取更經濟實惠且更具包容性。我們解決了關鍵挑戰，包括熱門網站上縮放圖像要求的可行性、將 PixLift 實作為瀏覽器擴充功能，以及它對使用者體驗的影響。透過分析 71.4k 個網頁、評估三個主流放大模型，以及使用者研究，我們展示了 PixLift 在不影響影像品質的情況下顯著減少資料用量的能力，促進了更公平的網路。
 
-摘要：<paragraph>鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而，其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題：我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習？一方面，與自然語言不同，圖形是非線性的，並且具有複雜的拓撲結構。另一方面，許多圖形缺乏文本數據，這使得難以制定基於語言的 CoT 提示。在這項工作中，我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說，我們將每個下游任務的適應過程分解為一系列推理步驟，每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示，但具體機制卻有很大不同。具體來說，在每一步中，一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後，我們聚合編碼器的隱藏層以構建一個「思想」，它捕獲了當前步驟中每個節點的工作狀態。基於這個思想，我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中，重複這個循環。為了評估和分析 GCoT 的有效性，我們對八個公共數據集進行了全面的實驗，這證明了我們方法的優勢。</paragraph>
+##### **RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**
+2502.08989v1 by Nazatul H. Sultan, Yan Bo, Yansong Gao, Seyit Camtepe, Arash Mahboubi, Hang Thanh Bui, Aufeef Chauhan, Hamed Aboutorab, Michael Bewong, Praveen Gauravaram, Rafiqul Islam, Sharif Abuadbba
 
-##### **Deep Semantic Graph Learning via LLM based Node Enhancement**
-2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen
+Federated Learning (FL) allows users to collaboratively train a global
+machine learning model by sharing local model only, without exposing their
+private data to a central server. This distributed learning is particularly
+appealing in scenarios where data privacy is crucial, and it has garnered
+substantial attention from both industry and academia. However, studies have
+revealed privacy vulnerabilities in FL, where adversaries can potentially infer
+sensitive information from the shared model parameters. In this paper, we
+present an efficient masking-based secure aggregation scheme utilizing
+lightweight cryptographic primitives to mitigate privacy risks. Our scheme
+offers several advantages over existing methods. First, it requires only a
+single setup phase for the entire FL training session, significantly reducing
+communication overhead. Second, it minimizes user-side overhead by eliminating
+the need for user-to-user interactions, utilizing an intermediate server layer
+and a lightweight key negotiation method. Third, the scheme is highly resilient
+to user dropouts, and the users can join at any FL round. Fourth, it can detect
+and defend against malicious server activities, including recently discovered
+model inconsistency attacks. Finally, our scheme ensures security in both
+semi-honest and malicious settings. We provide security analysis to formally
+prove the robustness of our approach. Furthermore, we implemented an end-to-end
+prototype of our scheme. We conducted comprehensive experiments and
+comparisons, which show that it outperforms existing solutions in terms of
+communication and computation overhead, functionality, and security.
 
-Graph learning has attracted significant attention due to its widespread
-real-world applications. Current mainstream approaches rely on text node
-features and obtain initial node embeddings through shallow embedding learning
-using GNNs, which shows limitations in capturing deep textual semantics. Recent
-advances in Large Language Models (LLMs) have demonstrated superior
-capabilities in understanding text semantics, transforming traditional text
-feature processing. This paper proposes a novel framework that combines Graph
-Transformer architecture with LLM-enhanced node features. Specifically, we
-leverage LLMs to generate rich semantic representations of text nodes, which
-are then processed by a multi-head self-attention mechanism in the Graph
-Transformer to capture both local and global graph structural information. Our
-model utilizes the Transformer's attention mechanism to dynamically aggregate
-neighborhood information while preserving the semantic richness provided by LLM
-embeddings. Experimental results demonstrate that the LLM-enhanced node
-features significantly improve the performance of graph learning models on node
-classification tasks. This approach shows promising results across multiple
-graph learning tasks, offering a practical direction for combining graph
-networks with language models.
+摘要：聯合式學習 (FL) 使用者可以透過僅分享本機模型，在不將其私人資料揭露給中央伺服器的情況下，共同訓練全球機器學習模型。這種分散式學習在資料隱私至關重要的場景中特別具有吸引力，並且已獲得業界和學術界的廣泛關注。然而，研究顯示 FL 中存在隱私漏洞，其中對手可能會從共享模型參數中推斷出敏感資訊。在本文中，我們提出了一種有效率的基於遮罩的安全聚合方案，利用輕量級的密碼原語來降低隱私風險。我們的方案相較於現有方法提供了多項優點。首先，它僅需要在整個 FL 訓練階段進行一次設定階段，大幅降低了通訊開銷。其次，透過消除使用者間互動的需要，利用中間伺服器層和輕量級金鑰協商方法，將使用者端的開銷降到最低。第三，該方案對使用者中斷具有高度的復原力，使用者可以在任何 FL 回合中加入。第四，它可以偵測和防禦惡意伺服器活動，包括最近發現的模型不一致攻擊。最後，我們的方案確保在半誠實和惡意設定中都能獲得安全性。我們提供了安全分析，以正式證明我們方法的穩健性。此外，我們實作了我們方案的端對端原型。我們進行了全面的實驗和比較，結果顯示，在通訊和運算開銷、功能和安全性方面，它優於現有的解決方案。
 
-摘要：圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵，並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入，這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力，轉換了傳統的文本特徵處理。本文提出了一種新的框架，將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說，我們利用 LLM 來生成文本節點的豐富語義表示，然後在圖形轉換器中由多頭自我注意機制處理，以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息，同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明，LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果，為將圖形網絡與語言模型相結合提供了實用的方向。
+##### **Neural Force Field: Learning Generalized Physical Representation from a Few Examples**
+2502.08987v1 by Shiqian Li, Ruihong Shen, Chi Zhang, Yixin Zhu
 
-##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**
-2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
+Physical reasoning is a remarkable human ability that enables rapid learning
+and generalization from limited experience. Current AI models, despite
+extensive training, still struggle to achieve similar generalization,
+especially in Out-of-distribution (OOD) settings. This limitation stems from
+their inability to abstract core physical principles from observations. A key
+challenge is developing representations that can efficiently learn and
+generalize physical dynamics from minimal data. Here we present Neural Force
+Field (NFF) a modeling framework built on Neural Ordinary Differential Equation
+(NODE) that learns interpretable force field representations which can be
+efficiently integrated through an Ordinary Differential Equation ( ODE) solver
+to predict object trajectories. Unlike existing approaches that rely on
+high-dimensional latent spaces, NFF captures fundamental physical concepts such
+as gravity, support, and collision in an interpretable manner. Experiments on
+two challenging physical reasoning tasks demonstrate that NFF, trained with
+only a few examples, achieves strong generalization to unseen scenarios. This
+physics-grounded representation enables efficient forward-backward planning and
+rapid adaptation through interactive refinement. Our work suggests that
+incorporating physics-inspired representations into learning systems can help
+bridge the gap between artificial and human physical reasoning capabilities.
 
-The prototyping of computer games, particularly card games, requires
-extensive human effort in creative ideation and gameplay evaluation. Recent
-advances in Large Language Models (LLMs) offer opportunities to automate and
-streamline these processes. However, it remains challenging for LLMs to design
-novel game mechanics beyond existing databases, generate consistent gameplay
-environments, and develop scalable gameplay AI for large-scale evaluations.
-This paper addresses these challenges by introducing a comprehensive automated
-card game prototyping framework. The approach highlights a graph-based indexing
-method for generating novel game designs, an LLM-driven system for consistent
-game code generation validated by gameplay records, and a gameplay AI
-constructing method that uses an ensemble of LLM-generated action-value
-functions optimized through self-play. These contributions aim to accelerate
-card game prototyping, reduce human labor, and lower barriers to entry for game
-developers.
+摘要：物理推理是人类非凡的能力，它能从有限的经验中快速学习和概括。尽管经过广泛的训练，但当前的人工智能模型在实现类似的概括方面仍然存在困难，尤其是在分布外 (OOD) 设置中。这种限制源于它们无法从观察中抽象出核心物理原理。一个关键挑战是开发能够从最少数据中有效学习和概括物理动力学的表示。在这里，我们介绍了神经力场 (NFF)，这是一种建立在神经常微分方程 (NODE) 上的建模框架，它学习可解释的力场表示，这些表示可以通过常微分方程 (ODE) 求解器有效地进行积分，以预测物体轨迹。与依赖于高维潜在空间的现有方法不同，NFF 以可解释的方式捕获了诸如重力、支撑和碰撞等基本物理概念。在两个具有挑战性的物理推理任务上的实验表明，仅通过几个示例训练的 NFF 实现了对看不见场景的强大概括。这种基于物理的表示能够进行高效的前向后向规划，并通过交互式细化实现快速适应。我们的工作表明，将受物理启发的表示纳入学习系统可以帮助弥合人工智能和人类物理推理能力之间的差距。
 
-摘要：電腦遊戲，尤其是卡牌遊戲的原型製作，需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而，LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境，以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法，用於生成新穎的遊戲設計，一個由 LLM 驅動的系統，用於一致的遊戲程式碼生成，並由遊戲記錄驗證，以及一個遊戲 AI 構建方法，該方法使用由 LLM 生成的動作值函數的集合，通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作，減少人力，並降低遊戲開發人員的進入門檻。
+##### **Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**
+2502.08972v1 by Hyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan May
 
-##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**
-2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan
+Language models are aligned to the collective voice of many, resulting in
+generic outputs that do not align with specific users' styles. In this work, we
+present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method
+that personalizes language models for text generation tasks with fewer than 10
+examples per user. TICL iteratively expands an in-context learning prompt via a
+trial-error-explain process, adding model-generated negative samples and
+explanations that provide fine-grained guidance towards a specific user's
+style. TICL achieves favorable win rates on pairwise comparisons with
+LLM-as-a-judge up to 91.5% against the previous state-of-the-art and
+outperforms competitive tuning-free baselines for personalized alignment tasks
+of writing emails, essays and news articles. Both lexical and qualitative
+analyses show that the negative samples and explanations enable language models
+to learn stylistic context more effectively and overcome the bias towards
+structural and formal phrases observed in their zero-shot outputs. By
+front-loading inference compute to create a user-specific in-context learning
+prompt that does not require extra generation steps at test time, TICL presents
+a novel yet simple approach for personalized alignment.
 
-Graph Neural Networks (GNNs) are vital for learning from graph-structured
-data, enabling applications in network analysis, recommendation systems, and
-speech analytics. Deploying them on edge devices like client PCs and laptops
-enhances real-time processing, privacy, and cloud independence. GNNs aid
-Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and
-enable event-based vision tasks. However, irregular memory access, sparsity,
-and dynamic structures cause high latency and energy overhead on
-resource-constrained devices. While modern edge processors integrate CPUs,
-GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular
-GNN computations. We introduce GraNNite, the first hardware-aware framework
-optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN
-accelerators via a structured three-step methodology: (1) enabling NPU
-execution, (2) optimizing performance, and (3) trading accuracy for efficiency
-gains. Step 1 employs GraphSplit for workload distribution and StaGr for static
-aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts
-performance using EffOp for control-heavy tasks and GraSp for sparsity
-exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce
-redundancy and memory transfers. Step 3 balances quality versus efficiency,
-where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate
-attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs,
-GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to
-8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher
-performance than CPUs and GPUs, respectively, across GNN models.
+摘要：語言模型與眾人的集體聲音保持一致，導致產出內容流於一般，無法與特定使用者的風格相符。在這項工作中，我們提出了試驗錯誤解釋情境內學習 (TICL)，一種免調校方法，能為文字生成任務個人化語言模型，每個使用者少於 10 個範例。TICL 透過試驗錯誤解釋程序反覆擴充情境內學習提示，加入模型產生的負面範例和說明，提供細緻的指導，引導至特定使用者的風格。TICL 在與 LLM 作為評審的成對比較中獲得了高勝率，高達 91.5%，優於先前的技術水準，並在個人化對齊任務中超越了競爭性的免調校基準，包括撰寫電子郵件、論文和新聞文章。詞彙和質性分析皆顯示，負面範例和說明讓語言模型能更有效地學習風格脈絡，並克服零次學習產出中觀察到的結構性和正式詞組偏誤。透過預先加載推論運算，建立使用者特定的情境內學習提示，無需在測試時額外產生步驟，TICL 呈現一種新穎卻簡潔的方法，用於個人化對齊。
 
-摘要：圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要，能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置（例如用戶端電腦和筆電）上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG)，並支援基於事件的視覺任務。然而，不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU，但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite，這是第一個硬體感知框架，透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行：(1) 啟用 NPU 執行，(2) 最佳化效能，以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配，並使用 StaGr 進行靜態聚合，而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能，並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率，其中 QuantGr 適用 INT8 量化，而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上，GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速，在 CPU 和 GPU 上實現了高達 8.6X 的能源增益，在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。
+##### **RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**
+2502.08966v1 by Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller
 
-##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**
-2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
+Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external
+tools for tasks beyond their standalone capabilities, such as searching
+websites, booking flights, or making financial transactions. However, these
+tools greatly increase the risks of prompt injection attacks, where malicious
+content hijacks the LM agent to leak confidential data or trigger harmful
+actions. Existing defenses (OpenAI GPTs) require user confirmation before every
+tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS),
+which automatically detects and executes tool calls that preserve integrity and
+confidentiality, requiring user confirmation only when these safeguards cannot
+be ensured. RTBAS adapts Information Flow Control to the unique challenges
+presented by TBAS. We present two novel dependency screeners, using
+LM-as-a-judge and attention-based saliency, to overcome these challenges.
+Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS
+prevents all targeted attacks with only a 2% loss of task utility when under
+attack, and further tests confirm its ability to obtain near-oracle performance
+on detecting both subtle and direct privacy leaks.
 
-Recent advancements in AI for biological research focus on integrating
-molecular data with natural language to accelerate drug discovery. However, the
-scarcity of high-quality annotations limits progress in this area. This paper
-introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework
-that leverages large language models to augment existing datasets, thereby
-improving AI training. We demonstrate the effectiveness of LA$^3$ by creating
-an enhanced dataset, LaChEBI-20, where we systematically rewrite the
-annotations of molecules from an established dataset. These rewritten
-annotations preserve essential molecular information while providing more
-varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5
-based on a benchmark architecture to learn the mapping between molecular
-representations and augmented annotations.
-  Experimental results on text-based *de novo* molecule generation and molecule
-captioning demonstrate that LaMolT5 outperforms state-of-the-art models.
-Notably, incorporating LA$^3$ leads to improvements of up to 301% over the
-benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$
-notable applications in *image*, *text* and *graph* tasks, affirming its
-versatility and utility.
+摘要：基於工具的代理系統 (TBAS) 允許語言模型 (LM) 使用外部工具來執行超出其獨立功能的任務，例如搜尋網站、預訂航班或進行金融交易。然而，這些工具大幅增加了提示注入攻擊的風險，其中惡意內容劫持 LM 代理程式以洩露機密資料或觸發有害動作。現有的防禦措施 (OpenAI GPT) 在每次呼叫工具之前都需要使用者確認，這會對使用者造成沉重的負擔。我們引入了穩健的 TBAS (RTBAS)，它會自動偵測並執行保留完整性與機密性的工具呼叫，僅在無法確保這些防護措施時才需要使用者確認。RTBAS 將資訊流控制調整為 TBAS 呈現的獨特挑戰。我們提出兩種新穎的相依性篩選器，使用 LM 作為判斷者和基於注意力的顯著性，以克服這些挑戰。AgentDojo 提示注入基準上的實驗結果顯示，RTBAS 在受到攻擊時僅損失 2% 的任務效用，即可防止所有目標攻擊，進一步的測試證實了其在偵測細微和直接的隱私洩漏方面獲得接近神諭效能的能力。
 
-摘要：<paragraph>人工智慧在生物研究上的最新進展，專注於將分子資料與自然語言整合，以加速藥物發現。然而，高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$，一個基於語言的自動註解擴充框架，它利用大型語言模型來擴充現有的資料集，進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性，我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊，同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20，我們在基於基準架構上訓練 LaMolT5，以學習分子表示和擴充註解之間的對應。
-在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明，LaMolT5 優於最先進的模型。值得注意的是，納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外，我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性，肯定了它的多功能性和實用性。</paragraph>
+##### **Biologically Plausible Brain Graph Transformer**
+2502.08958v1 by Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
 
-##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**
-2502.06472v1 by Yuxing Lu, Jinzhuo Wang
+State-of-the-art brain graph analysis methods fail to fully encode the
+small-world architecture of brain graphs (accompanied by the presence of hubs
+and functional modules), and therefore lack biological plausibility to some
+extent. This limitation hinders their ability to accurately represent the
+brain's structural and functional properties, thereby restricting the
+effectiveness of machine learning models in tasks such as brain disorder
+detection. In this work, we propose a novel Biologically Plausible Brain Graph
+Transformer (BioBGT) that encodes the small-world architecture inherent in
+brain graphs. Specifically, we present a network entanglement-based node
+importance encoding technique that captures the structural importance of nodes
+in global information propagation during brain graph communication,
+highlighting the biological properties of the brain structure. Furthermore, we
+introduce a functional module-aware self-attention to preserve the functional
+segregation and integration characteristics of brain graphs in the learned
+representations. Experimental results on three benchmark datasets demonstrate
+that BioBGT outperforms state-of-the-art models, enhancing biologically
+plausible brain graph representations for various brain graph analytical tasks
 
-Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical
-for modern AI systems, but manual curation struggles to scale with the rapid
-growth of scientific literature. This paper presents KARMA, a novel framework
-employing multi-agent large language models (LLMs) to automate KG enrichment
-through structured analysis of unstructured text. Our approach employs nine
-collaborative agents, spanning entity discovery, relation extraction, schema
-alignment, and conflict resolution that iteratively parse documents, verify
-extracted knowledge, and integrate it into existing graph structures while
-adhering to domain-specific schema. Experiments on 1,200 PubMed articles from
-three different domains demonstrate the effectiveness of KARMA in knowledge
-graph enrichment, with the identification of up to 38,230 new entities while
-achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\%
-through multi-layer assessments.
+摘要：目前最先进的大腦圖形分析方法無法完全編碼大腦圖形的小世界架構（伴隨著樞紐和功能模組的存在），因此在某種程度上缺乏生物學上的可信度。這種限制阻礙了它們準確表示大腦結構和功能特性的能力，從而限制了機器學習模型在腦部疾病檢測等任務中的有效性。在這項工作中，我們提出了一個新的生物學上可信的大腦圖形轉換器 (BioBGT)，它編碼了大腦圖形中固有的、小世界的架構。具體來說，我們提出了一種基於網路糾纏的節點重要性編碼技術，它捕捉了大腦圖形通信過程中節點在全球資訊傳播中的結構重要性，突出了大腦結構的生物學特性。此外，我們引入了一個功能模組感知自注意力，以保留學習表徵中大腦圖形的功能分離和整合特性。在三個基準資料集上的實驗結果表明，BioBGT 優於最先進的模型，增強了各種大腦圖形分析任務的生物學上可信的大腦圖形表徵
 
-摘要：維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要，但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA，一個採用多代理大型語言模型 (LLM) 的新框架，透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理，涵蓋實體發現、關係提取、架構比對和衝突解決，這些代理會反覆分析文件、驗證提取的知識，並將其整合到現有的圖結構中，同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性，識別出多達 38,230 個新實體，同時達到 83.1% 的 LLM 驗證正確性，並透過多層評估將衝突邊緣降低了 18.6%。
+##### **Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**
+2502.08954v1 by Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer
 
-##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs**
-2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
+The deployment of Large Language Models (LLM) on mobile devices offers
+significant potential for medical applications, enhancing privacy, security,
+and cost-efficiency by eliminating reliance on cloud-based services and keeping
+sensitive health data local. However, the performance and accuracy of on-device
+LLMs in real-world medical contexts remain underexplored. In this study, we
+benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating
+accuracy, computational efficiency, and thermal limitation across various
+mobile devices. Our results indicate that compact general-purpose models like
+Phi-3 Mini achieve a strong balance between speed and accuracy, while medically
+fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably,
+deploying LLMs on older devices remains feasible, with memory constraints
+posing a greater challenge than raw processing power. Our study underscores the
+potential of on-device LLMs for healthcare while emphasizing the need for more
+efficient inference and models tailored to real-world clinical reasoning.
 
-Mitigating positional bias of language models (LMs) for listwise inputs is a
-well-known and important problem (e.g., lost-in-the-middle). While zero-shot
-order-invariant LMs have been proposed to solve this issue, their success on
-practical listwise problems has been limited. In this work, as a first
-contribution, we identify and overcome two limitations to make zero-shot
-invariant LMs more practical: (1) training and inference distribution mismatch
-arising from modifying positional ID assignments to enforce invariance, and (2)
-failure to adapt to a mixture of order-invariant and sensitive inputs in
-practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot
-invariant LM for genuinely order-invariant inputs with minimal modifications of
-positional IDs, and (2) Selective Routing, an adaptive framework that handles
-both order-invariant and order-sensitive inputs in listwise tasks. On the Lost
-in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU
-benchmarks, we show that RoToR with Selective Routing can effectively handle
-practical listwise input tasks in a zero-shot manner.
+摘要：大型語言模型 (LLM) 在行動裝置上的部署為醫療應用程式提供了巨大的潛力，透過消除對雲端服務的依賴並將敏感的健康資料儲存在本地，進而提升隱私、安全性，並提高成本效益。然而，在實際的醫療環境中，裝置上 LLM 的效能和準確度仍未受到充分的探討。在此研究中，我們使用 AMEGA 資料集來評量公開可用的裝置上 LLM，並評估其在各種行動裝置上的準確度、運算效率和熱限制。我們的結果顯示，像 Phi-3 Mini 等精簡的一般用途模型在速度和準確度之間取得了良好的平衡，而經過醫學微調的模型，例如 Med42 和 Aloe，則達到了最高的準確度。值得注意的是，在較舊的裝置上部署 LLM 仍然可行，記憶體限制比原始處理能力構成更大的挑戰。我們的研究強調了裝置上 LLM 在醫療保健方面的潛力，同時強調了對更有效率的推理和針對實際臨床推理量身打造的模型的需求。
 
-摘要：語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題（例如，迷失在中間）。雖然已經提出零次學習順序不變的 LM 來解決這個問題，但它們在實際列表問題上的成功卻很有限。在這項工作中，作為第一個貢獻，我們找出並克服了兩個限制，讓零次學習不變的 LM 更有實用性：(1) 訓練和推論分布不匹配，這是由於修改位置 ID 分配以強制不變性所造成的，以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題，我們提出 (1) RoToR，一個零次學習不變的 LM，用於真正不變的輸入，並對位置 ID 進行最小的修改，以及 (2) 選擇性路由，一個自適應框架，用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中，我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。
 
-##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model**
-2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen
+### Medical explainable AI
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
+|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
+|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)|
+|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null|
+|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null|
+|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v1](http://arxiv.org/abs/2501.09628v1)|null|
+|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null|
+|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null|
+|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null|
+|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null|
+|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null|
+|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)|
+|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null|
+|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null|
+|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null|
+|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null|
+|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null|
+|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null|
+|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null|
+|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null|
+|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null|
+|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)|
+|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null|
+|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null|
+|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null|
+|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)|
+|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)|
+|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null|
+|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)|
+|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null|
+|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null|
+|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null|
+|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null|
+|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null|
+|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null|
+|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null|
+|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null|
+|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**|Antonio Rago et.al.|[2408.17401v1](http://arxiv.org/abs/2408.17401v1)|null|
+|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null|
+|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null|
+|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null|
+|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null|
+|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null|
+|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null|
+|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null|
+|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null|
+|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null|
+|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null|
+|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null|
+|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null|
+|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null|
+|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null|
+|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)|
+|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null|
+|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null|
+|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null|
+|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null|
+|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null|
+|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null|
+|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)|
+|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null|
+|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null|
+|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null|
+|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null|
+|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null|
+|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)|
+|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null|
+|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)|
+|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null|
+|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null|
+|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null|
+|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null|
+|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null|
+|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null|
+|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null|
+|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null|
+|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null|
+|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null|
+|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null|
+|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null|
+|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)|
+|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null|
+|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null|
+|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null|
+|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)|
+|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)|
+|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null|
+|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null|
+|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null|
+|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null|
+|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null|
+|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null|
+|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null|
+|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)|
+|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null|
+|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
+|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
+|**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
+|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
+
+#### Abstracts
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-Recent advancements in large language models (LLMs) have significantly
-improved various natural language processing (NLP) tasks. Typically, LLMs are
-trained to predict the next token, aligning well with many NLP tasks. However,
-in knowledge graph (KG) scenarios, entities are the fundamental units and
-identifying an entity requires at least several tokens. This leads to a
-granularity mismatch between KGs and natural languages. To address this issue,
-we propose K-ON, which integrates KG knowledge into the LLM by employing
-multiple head layers for next k-step prediction. K-ON can not only generate
-entity-level results in one step, but also enables contrastive loss against
-entities, which is the most powerful tool in KG representation learning.
-Experimental results show that K-ON outperforms state-of-the-art methods that
-incorporate text and even the other modalities.
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-摘要：大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常，LLM 會接受訓練以預測下一個符號，這與許多 NLP 任務非常吻合。然而，在知識圖譜 (KG) 場景中，實體是基本單位，而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題，我們提出了 K-ON，它透過採用多個頭部層進行下一個 k 步預測，將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果，還能針對實體啟用對比損失，這是 KG 表示學習中最有力的工具。實驗結果顯示，K-ON 優於將文字甚至其他方式納入考量的最新方法。
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-##### **LegalViz: Legal Text Visualization by Text To Diagram Generation**
-2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita
+##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
+2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
 
-Legal documents including judgments and court orders require highly
-sophisticated legal knowledge for understanding. To disclose expert knowledge
-for non-experts, we explore the problem of visualizing legal texts with
-easy-to-understand diagrams and propose a novel dataset of LegalViz with 23
-languages and 7,010 cases of legal document and visualization pairs, using the
-DOT graph description language of Graphviz. LegalViz provides a simple diagram
-from a complicated legal corpus identifying legal entities, transactions, legal
-sources, and statements at a glance, that are essential in each judgment. In
-addition, we provide new evaluation metrics for the legal diagram visualization
-by considering graph structures, textual similarities, and legal contents. We
-conducted empirical studies on few-shot and finetuning large language models
-for generating legal diagrams and evaluated them with these metrics, including
-legal content-based evaluation within 23 languages. Models trained with
-LegalViz outperform existing models including GPTs, confirming the
-effectiveness of our dataset.
+We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
+an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
+predicts future PHTs using transformer-based architectures. The Adaptive Risk
+Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
+probabilities for clinician-defined critical events. ARES incorporates a
+personalized explainability module that identifies key clinical factors
+influencing risk estimates for individual patients. ARES was evaluated on the
+MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
+performance against traditional early warning systems and machine learning
+models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
+with 60% including hospital admissions. The dataset contained over 357 million
+tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
+ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
+ETHOS-based risk estimates demonstrated robustness across demographic subgroups
+with strong model reliability, confirmed via calibration curves. The
+personalized explainability module provides insights into patient-specific
+factors contributing to risk. ARES, powered by ETHOS, advances predictive
+healthcare AI by providing dynamic, real-time, and personalized risk estimation
+with patient-specific explainability to enhance clinician trust. Its
+adaptability and superior accuracy position it as a transformative tool for
+clinical decision-making, potentially improving patient outcomes and resource
+allocation in emergency and inpatient settings. We release the full code at
+github.com/ipolharvard/ethos-ares to facilitate future research.
 
-摘要：法律文件，包括判決和法院命令，需要高度專業的法律知識才能理解。為了向非專家揭露專家知識，我們探討了使用易於理解的圖表將法律文本視覺化的問題，並提出了一個新的 LegalViz 數據集，其中包含 23 種語言和 7,010 個法律文件和視覺化配對，使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表，可以一目了然地識別法律實體、交易、法律來源和陳述，這些在每項判決中都是必不可少的。此外，我們通過考慮圖形結構、文本相似性和法律內容，為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究，以生成法律圖表，並使用這些指標對它們進行了評估，包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型，包括 GPT，證實了我們數據集的有效性。
+摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
+一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
+使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
 
-##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**
-2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee
+##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**
+2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq
 
-Mental-illness stigma is a persistent social problem, hampering both
-treatment-seeking and recovery. Accordingly, there is a pressing need to
-understand it more clearly, but analyzing the relevant data is highly
-labor-intensive. Therefore, we designed a chatbot to engage participants in
-conversations; coded those conversations qualitatively with AI assistance; and,
-based on those coding results, built causal knowledge graphs to decode stigma.
-The results we obtained from 1,002 participants demonstrate that conversation
-with our chatbot can elicit rich information about people's attitudes toward
-depression, while our AI-assisted coding was strongly consistent with
-human-expert coding. Our novel approach combining large language models (LLMs)
-and causal knowledge graphs uncovered patterns in individual responses and
-illustrated the interrelationships of psychological constructs in the dataset
-as a whole. The paper also discusses these findings' implications for HCI
-researchers in developing digital interventions, decomposing human
-psychological constructs, and fostering inclusive attitudes.
+This study addresses a critical gap in the healthcare system by developing a
+clinically meaningful, practical, and explainable disease surveillance system
+for multiple chronic diseases, utilizing routine EHR data from multiple U.S.
+practices integrated with CureMD's EMR/EHR system. Unlike traditional
+systems--using AI models that rely on features from patients' labs--our
+approach focuses on routinely available data, such as medical history, vitals,
+diagnoses, and medications, to preemptively assess the risks of chronic
+diseases in the next year. We trained three distinct models for each chronic
+disease: prediction models that forecast the risk of a disease 3, 6, and 12
+months before a potential diagnosis. We developed Random Forest models, which
+were internally validated using F1 scores and AUROC as performance metrics and
+further evaluated by a panel of expert physicians for clinical relevance based
+on inferences grounded in medical knowledge. Additionally, we discuss our
+implementation of integrating these models into a practical EMR system. Beyond
+using Shapley attributes and surrogate models for explainability, we also
+introduce a new rule-engineering framework to enhance the intrinsic
+explainability of Random Forests.
 
-摘要：精神疾病的污名化是一個持續存在的社會問題，阻礙了尋求治療和康復。因此，迫切需要更清楚地了解它，但分析相關數據非常費力。因此，我們設計了一個聊天機器人，讓參與者參與對話；使用 AI 協助對這些對話進行定性編碼；並根據這些編碼結果，構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明，與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊，而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式，並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。
+摘要：本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統，來解決醫療保健系統中的重大缺口，利用整合 CureMD 的 EMR/EHR 系統，來自多個美國實務的例行 EHR 資料。與傳統系統不同的是，我們的做法著重在例行可得的資料，例如病歷、生命徵象、診斷和藥物，以預先評估未來一年慢性疾病的風險，而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型：預測模型，用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型，並使用 F1 分數和 AUROC 作為效能指標，進行內部驗證，並進一步由專家醫師小組根據植基於醫學知識的推論，評估其臨床相關性。此外，我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外，我們還引進了一個新的規則工程架構，以增強隨機森林的內在可解釋性。
 
-##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**
-2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
+##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**
+2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
 
-In this paper, we address the task of semantic segmentation of legal
-documents through rhetorical role classification, with a focus on Indian legal
-judgments. We introduce LegalSeg, the largest annotated dataset for this task,
-comprising over 7,000 documents and 1.4 million sentences, labeled with 7
-rhetorical roles. To benchmark performance, we evaluate multiple
-state-of-the-art models, including Hierarchical BiLSTM-CRF,
-TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and
-Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an
-instruction-tuned large language model. Our results demonstrate that models
-incorporating broader context, structural relationships, and sequential
-sentence information outperform those relying solely on sentence-level
-features. Additionally, we conducted experiments using surrounding context and
-predicted or actual labels of neighboring sentences to assess their impact on
-classification accuracy. Despite these advancements, challenges persist in
-distinguishing between closely related roles and addressing class imbalance.
-Our work underscores the potential of advanced techniques for improving legal
-document understanding and sets a strong foundation for future research in
-legal NLP.
+Deep neural networks are increasingly employed in high-stakes medical
+applications, despite their tendency for shortcut learning in the presence of
+spurious correlations, which can have potentially fatal consequences in
+practice. Detecting and mitigating shortcut behavior is a challenging task that
+often requires significant labeling efforts from domain experts. To alleviate
+this problem, we introduce a semi-automated framework for the identification of
+spurious behavior from both data and model perspective by leveraging insights
+from eXplainable Artificial Intelligence (XAI). This allows the retrieval of
+spurious data points and the detection of model circuits that encode the
+associated prediction rules. Moreover, we demonstrate how these shortcut
+encodings can be used for XAI-based sample- and pixel-level data annotation,
+providing valuable information for bias mitigation methods to unlearn the
+undesired shortcut behavior. We show the applicability of our framework using
+four medical datasets across two modalities, featuring controlled and
+real-world spurious correlations caused by data artifacts. We successfully
+identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision
+Transformer models, ultimately increasing their robustness and applicability
+for real-world medical tasks.
 
-摘要：<paragraph>在本文中，我們通過修辭角色分類來探討法律文件的語義分段任務，重點關注印度法律判決。我們引入了 LegalSeg，這是此任務中最大的註釋資料集，包含超過 7,000 份文件和 140 萬個句子，並標記了 7 個修辭角色。為了評量效能，我們評估了多個最先進的模型，包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer，以及探索性的 RhetoricLLaMA，一種經過指令調整的大型語言模型。我們的結果表明，結合廣泛背景、結構關係和順序句子資訊的模型，表現優於僅依賴句子層級特徵的模型。此外，我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗，以評估它們對分類精度的影響。儘管有這些進展，但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力，並為法律自然語言處理的未來研究奠定了堅實的基礎。</paragraph>
+摘要：深度神经网络越来越多地用于高风险医疗应用中，尽管它们在存在虚假相关性的情况下倾向于捷径学习，这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务，通常需要领域专家的大量标记工作。为了缓解这个问题，我们引入了一个半自动框架，用于从数据和模型的角度识别虚假行为，方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外，我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释，为偏差缓解方法提供有价值的信息，以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性，这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差，最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。
 
-##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**
-2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
+##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**
+2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail
 
-Developing intelligent agents for long-term cooperation in dynamic open-world
-scenarios is a major challenge in multi-agent systems. Traditional Multi-agent
-Reinforcement Learning (MARL) frameworks like centralized training
-decentralized execution (CTDE) struggle with scalability and flexibility. They
-require centralized long-term planning, which is difficult without custom
-reward functions, and face challenges in processing multi-modal data. CTDE
-approaches also assume fixed cooperation strategies, making them impractical in
-dynamic environments where agents need to adapt and plan independently. To
-address decentralized multi-agent cooperation, we propose Decentralized
-Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in
-a novel Multi-agent Crafter environment. Our generative agents, powered by
-Large Language Models (LLMs), are more scalable than traditional MARL agents by
-leveraging external knowledge and language for long-term planning and
-reasoning. Instead of fully sharing information from all past experiences,
-DAMCS introduces a multi-modal memory system organized as a hierarchical
-knowledge graph and a structured communication protocol to optimize agent
-cooperation. This allows agents to reason from past interactions and share
-relevant information efficiently. Experiments on novel multi-agent open-world
-tasks show that DAMCS outperforms both MARL and LLM baselines in task
-efficiency and collaboration. Compared to single-agent scenarios, the two-agent
-scenario achieves the same goal with 63% fewer steps, and the six-agent
-scenario with 74% fewer steps, highlighting the importance of adaptive memory
-and structured communication in achieving long-term goals. We publicly release
-our project at: https://happyeureka.github.io/damcs.
+Suicidal ideation detection is crucial for preventing suicides, a leading
+cause of death worldwide. Many individuals express suicidal thoughts on social
+media, offering a vital opportunity for early detection through advanced
+machine learning techniques. The identification of suicidal ideation in social
+media text is improved by utilising a hybrid framework that integrates
+Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory
+(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability
+of the model's predictions, Explainable AI (XAI) methods are applied, with a
+particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At
+first, the model managed to reach an accuracy of 92.81%. By applying
+fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The
+SHAP analysis revealed key features influencing the model's predictions, such
+as terms related to mental health struggles. This level of transparency boosts
+the model's credibility while helping mental health professionals understand
+and trust the predictions. This work highlights the potential for improving the
+accuracy and interpretability of detecting suicidal tendencies, making a
+valuable contribution to the progress of mental health monitoring systems. It
+emphasizes the significance of blending powerful machine learning methods with
+explainability to develop reliable and impactful mental health solutions.
 
-摘要：<paragraph>在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架，例如集中式訓練去中心化執行 (CTDE)，在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃，這在沒有自訂獎勵函數的情況下很難執行，並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略，這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題，我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援，透過利用外部知識和語言進行長期規劃和推理，比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊，而是引入了多模式記憶體系統，該系統組織成階層式知識圖譜和結構化通訊協定，以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明，DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比，雙重代理情境以少 63% 的步驟達成相同的目標，而六重代理情境則以少 74% 的步驟達成目標，突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於：https://happyeureka.github.io/damcs。</paragraph>
+摘要：自殺意念偵測對於預防自殺至關重要，而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭，這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構，並加入注意力機制，可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性，我們採用可解釋人工智慧 (XAI) 方法，特別著重於 SHapley 加法解釋 (SHAP)。一開始，模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術，準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵，例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度，同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力，為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。
 
-##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**
-2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang
+##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights**
+2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet
 
-Graphs are able to model interconnected entities in many online services,
-supporting a wide range of applications on the Web. This raises an important
-question: How can we train a graph foundational model on multiple source
-domains and adapt to an unseen target domain? A major obstacle is that graphs
-from different domains often exhibit divergent characteristics. Some studies
-leverage large language models to align multiple domains based on textual
-descriptions associated with the graphs, limiting their applicability to
-text-attributed graphs. For text-free graphs, a few recent works attempt to
-align different feature distributions across domains, while generally
-neglecting structural differences. In this work, we propose a novel Structure
-Alignment framework for text-free Multi-domain Graph Pre-Training and
-cross-domain adaptation (SAMGPT). It is designed to learn multi-domain
-knowledge from graphs originating in multiple source domains, which can then be
-adapted to address applications in an unseen target domain. Specifically, we
-introduce a set of structure tokens to harmonize structure-based aggregation
-across source domains during the pre-training phase. Next, for cross-domain
-adaptation, we design dual prompts, namely, holistic prompts and specific
-prompts, which adapt unified multi-domain structural knowledge and
-fine-grained, domain-specific information, respectively, to a target domain.
-Finally, we conduct comprehensive experiments on seven public datasets to
-evaluate and analyze the effectiveness of SAMGPT.
+In epidemiology, traditional statistical methods such as logistic regression,
+linear regression, and other parametric models are commonly employed to
+investigate associations between predictors and health outcomes. However,
+non-parametric machine learning techniques, such as deep neural networks
+(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for
+this task. Despite their potential, these methods face challenges due to the
+limited availability of high-quality, high-quantity data in this field. To
+address these challenges, we introduce SEANN, a novel approach for informed
+DNNs that leverages a prevalent form of domain-specific knowledge: Pooled
+Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies,
+in different forms, and represent a quantitative form of a scientific
+consensus. By direct integration within the learning procedure using a custom
+loss, we experimentally demonstrate significant improvements in the
+generalizability of predictive performances and the scientific plausibility of
+extracted relationships compared to a domain-knowledge agnostic neural network
+in a scarce and noisy data setting.
 
-摘要：圖表能夠在許多線上服務中對相互關聯的實體進行建模，
-支援網路上廣泛的應用程式。這提出了重要的問題：我們如何針對多個來源網域訓練圖表基礎模型，並適應未見過的目標網域？一個主要的障礙是，來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型，根據與圖表相關的文字描述，對齊多個網域，限制其適用性於有文字屬性的圖表。對於沒有文字的圖表，最近的一些作品嘗試對齊跨網域的不同特徵分佈，同時通常忽略結構上的差異。在這項工作中，我們提出了一個新的結構對齊框架，用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識，然後可以適應於未見過的目標網域中的應用程式。具體來說，我們引入了一組結構化代碼，以在預訓練階段，調和跨來源網域的基於結構的聚合。接下來，對於跨網域適應，我們設計了雙重提示，即整體提示和具體提示，分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後，我們在七個公共資料集上進行了全面的實驗，以評估和分析 SAMGPT 的有效性。
+摘要：在流行病學中，傳統的統計方法，例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而，非參數機器學習技術，例如深度神經網路 (DNN)，結合可解釋的 AI (XAI) 工具，為這項任務提供了新的機會。儘管這些方法具有潛力，但由於該領域缺乏高品質、高數量資料，因此這些方法面臨挑戰。為了應對這些挑戰，我們引入了 SEANN，這是一種新穎的方法，用於獲取知識的 DNN，它利用了一種流行的領域特定知識形式：彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中，並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中，我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比，科學合理性的顯著提升，且是在稀少且有雜訊的資料設定中。
 
-##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**
-2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang
+##### **Artificial Intelligence-Driven Clinical Decision Support Systems**
+2501.09628v1 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni
 
-In-context learning (ICL) effectively conditions large language models (LLMs)
-for molecular tasks, such as property prediction and molecule captioning, by
-embedding carefully selected demonstration examples into the input prompt. This
-approach avoids the computational overhead of extensive pertaining and
-fine-tuning. However, current prompt retrieval methods for molecular tasks have
-relied on molecule feature similarity, such as Morgan fingerprints, which do
-not adequately capture the global molecular and atom-binding relationships. As
-a result, these methods fail to represent the full complexity of molecular
-structures during inference. Moreover, small-to-medium-sized LLMs, which offer
-simpler deployment requirements in specialized systems, have remained largely
-unexplored in the molecular ICL literature. To address these gaps, we propose a
-self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context
-learning, which aligns global molecular structures, represented by graph neural
-networks (GNNs), with textual captions (descriptions) while leveraging local
-feature similarity through Morgan fingerprints. In addition, we introduce a
-Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to
-optimize input prompt demonstration samples. Our experimental findings using
-diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL
-retrieval methods across all tasks by up to 45%.
+As artificial intelligence (AI) becomes increasingly embedded in healthcare
+delivery, this chapter explores the critical aspects of developing reliable and
+ethical Clinical Decision Support Systems (CDSS). Beginning with the
+fundamental transition from traditional statistical models to sophisticated
+machine learning approaches, this work examines rigorous validation strategies
+and performance assessment methods, including the crucial role of model
+calibration and decision curve analysis. The chapter emphasizes that creating
+trustworthy AI systems in healthcare requires more than just technical
+accuracy; it demands careful consideration of fairness, explainability, and
+privacy. The challenge of ensuring equitable healthcare delivery through AI is
+stressed, discussing methods to identify and mitigate bias in clinical
+predictive models. The chapter then delves into explainability as a cornerstone
+of human-centered CDSS. This focus reflects the understanding that healthcare
+professionals must not only trust AI recommendations but also comprehend their
+underlying reasoning. The discussion advances in an analysis of privacy
+vulnerabilities in medical AI systems, from data leakage in deep learning
+models to sophisticated attacks against model explanations. The text explores
+privacy-preservation strategies such as differential privacy and federated
+learning, while acknowledging the inherent trade-offs between privacy
+protection and model performance. This progression, from technical validation
+to ethical considerations, reflects the multifaceted challenges of developing
+AI systems that can be seamlessly and reliably integrated into daily clinical
+practice while maintaining the highest standards of patient care and data
+protection.
 
-摘要：<paragraph>情境學習 (ICL) 有效地調整大型語言模型 (LLM)，以執行分子任務，例如屬性預測和分子標題，方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而，目前針對分子任務的提示檢索方法依賴於分子特徵相似性，例如 Morgan 指紋，而無法充分捕捉全局分子和原子鍵結關係。因此，這些方法無法在推理過程中表示分子結構的完整複雜性。此外，在專業系統中提供更簡單部署需求的小到中型的 LLM，在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距，我們提出了一種自我監督學習技術，GAMIC（圖形對齊分子情境學習），它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題（描述）對齊，同時透過 Morgan 指紋利用局部特徵相似性。此外，我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法，以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示，GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法，最多可達 45%。</paragraph>
+摘要：隨著人工智慧 (AI) 在醫療保健中的應用日益普及，本章探討了開發可靠且符合道德標準的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型到複雜機器學習方法的基本轉變開始，這項工作審查了嚴謹的驗證策略和效能評估方法，包括模型校準和決策曲線分析的關鍵角色。本章強調，在醫療保健中建立值得信賴的 AI 系統不只是技術上的準確性；它需要仔細考量公平性、可解釋性和隱私權。本章強調了透過 AI 確保公平的醫療保健服務的挑戰，並討論了識別和減輕臨床預測模型中偏差的方法。接著，本章深入探討可解釋性，作為以人為中心的 CDSS 的基石。這種關注反映了醫療保健專業人員不僅必須信任 AI 建議，還必須理解其背後的推理。討論進一步分析了醫療 AI 系統中的隱私漏洞，從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略，例如差分隱私和聯合學習，同時承認隱私保護和模型效能之間的固有取捨。這種從技術驗證到道德考量的進展，反映了開發 AI 系統的多面向挑戰，這些系統可以無縫且可靠地整合到日常臨床實務中，同時維持最高的病患照護和資料保護標準。
 
-##### **Knowledge Graph-Guided Retrieval Augmented Generation**
-2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
+##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**
+2501.06887v1 by Sadia Kamal, Tim Oates
 
-Retrieval-augmented generation (RAG) has emerged as a promising technology
-for addressing hallucination issues in the responses generated by large
-language models (LLMs). Existing studies on RAG primarily focus on applying
-semantic-based approaches to retrieve isolated relevant chunks, which ignore
-their intrinsic relationships. In this paper, we propose a novel Knowledge
-Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
-knowledge graphs (KGs) to provide fact-level relationships between chunks,
-improving the diversity and coherence of the retrieved results. Specifically,
-after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
-employs a KG-guided chunk expansion process and a KG-based chunk organization
-process to deliver relevant and important knowledge in well-organized
-paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
-variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
-approaches, in terms of both response quality and retrieval quality.
+As deep learning models gain attraction in medical data, ensuring transparent
+and trustworthy decision-making is essential. In skin cancer diagnosis, while
+advancements in lesion detection and classification have improved accuracy, the
+black-box nature of these methods poses challenges in understanding their
+decision processes, leading to trust issues among physicians. This study
+leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on
+different skin lesion datasets, to capture meaningful relationships between
+visual features and diagnostic criteria terms. To further enhance transparency,
+we propose a method called MedGrad E-CLIP, which builds on gradient-based
+E-CLIP by incorporating a weighted entropy mechanism designed for complex
+medical imaging like skin lesions. This approach highlights critical image
+regions linked to specific diagnostic descriptions. The developed integrated
+pipeline not only classifies skin lesions by matching corresponding
+descriptions but also adds an essential layer of explainability developed
+especially for medical data. By visually explaining how different features in
+an image relates to diagnostic criteria, this approach demonstrates the
+potential of advanced vision-language models in medical image analysis,
+ultimately improving transparency, robustness, and trust in AI-driven
+diagnostic systems.
 
-摘要：檢索增強生成 (RAG) 已成為一項有前途的技術，用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊，而忽略它們的內在關係。在本文中，我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架，它利用知識圖表 (KG) 來提供區塊之間的事實層級關係，從而提高檢索結果的多樣性和一致性。具體來說，在執行基於語義的檢索以提供種子區塊後，KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序，以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。
+摘要：随着深度学习模型在医学数据中获得关注，确保透明且值得信赖的决策至关重要。在皮肤癌诊断中，虽然病灶检测和分类的进步提高了准确性，但这些方法的黑盒性质对理解其决策过程构成了挑战，导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP（对比语言图像预训练）模型，以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度，我们提出了一种名为 MedGrad E-CLIP 的方法，该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制，建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类，还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系，这种方法展示了高级视觉语言模型在医学图像分析中的潜力，最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。
 
-##### **Can Large Language Models Understand Intermediate Representations?**
-2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
+##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**
+2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat
 
-Intermediate Representations (IRs) are essential in compiler design and
-program analysis, yet their comprehension by Large Language Models (LLMs)
-remains underexplored. This paper presents a pioneering empirical study to
-investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA
-3.1, and Code Llama, in understanding IRs. We analyze their performance across
-four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code
-summarization, and execution reasoning. Our results indicate that while LLMs
-demonstrate competence in parsing IR syntax and recognizing high-level
-structures, they struggle with control flow reasoning, execution semantics, and
-loop handling. Specifically, they often misinterpret branching instructions,
-omit critical IR operations, and rely on heuristic-based reasoning, leading to
-errors in CFG reconstruction, IR decompilation, and execution reasoning. The
-study underscores the necessity for IR-specific enhancements in LLMs,
-recommending fine-tuning on structured IR datasets and integration of explicit
-control flow models to augment their comprehension and handling of IR-related
-tasks.
+Humour styles can have either a negative or a positive impact on well-being.
+Given the importance of these styles to mental health, significant research has
+been conducted on their automatic identification. However, the automated
+machine learning models used for this purpose are black boxes, making their
+prediction decisions opaque. Clarity and transparency are vital in the field of
+mental health. This paper presents an explainable AI (XAI) framework for
+understanding humour style classification, building upon previous work in
+computational humour analysis. Using the best-performing single model
+(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
+analyse how linguistic, emotional, and semantic features contribute to humour
+style classification decisions. Our analysis reveals distinct patterns in how
+different humour styles are characterised and misclassified, with particular
+emphasis on the challenges in distinguishing affiliative humour from other
+styles. Through detailed examination of feature importance, error patterns, and
+misclassification cases, we identify key factors influencing model decisions,
+including emotional ambiguity, context misinterpretation, and target
+identification. The framework demonstrates significant utility in understanding
+model behaviour, achieving interpretable insights into the complex interplay of
+features that define different humour styles. Our findings contribute to both
+the theoretical understanding of computational humour analysis and practical
+applications in mental health, content moderation, and digital humanities
+research.
 
-摘要：中間表徵 (IR) 在編譯器設計和程式分析中至關重要，但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究，以探討 LLM（包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama）理解 IR 的能力。我們分析了它們在四項任務中的表現：控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明，儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力，但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言，它們經常誤解分支指令、省略關鍵 IR 操作，並依賴於基於啟發式的推理，導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性，建議對結構化的 IR 資料集進行微調，並整合明確的控制流程模型，以增強其對 IR 相關任務的理解和處理。
+摘要：幽默風格對幸福感可能產生負面或正面的影響。
+鑑於這些風格對心理健康的重要性，已經對其自動識別進行了大量研究。然而，用於此目的的自動機器學習模型是黑盒子，使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架，用於理解幽默風格分類，建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost)，我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式，特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例，我們確定了影響模型決策的關鍵因素，包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用，實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。
 
-##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**
-2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen
+##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**
+2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani
 
-Long-context large language models (LLMs) have recently shown strong
-performance in information retrieval and long-document QA. However, to tackle
-the most challenging intellectual problems, LLMs must reason effectively in
-long and complex contexts (e.g., frontier mathematical research). Studying how
-LLMs handle increasing reasoning complexity and context length is essential,
-yet existing benchmarks lack a solid basis for quantitative evaluation.
-Inspired by the abstraction of GSM-8K problems as computational graphs, and the
-ability to introduce noise by adding unnecessary nodes and edges, we develop a
-grade school math problem generator capable of producing arithmetic problems
-with infinite difficulty and context length under fine-grained control. Using
-our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate
-existing LLMs. We find a consistent sigmoid decline in reasoning performance as
-complexity increases, along with a systematic inference scaling trend:
-exponentially increasing inference computation yields only linear performance
-gains. These findings underscore the fundamental limitations of current
-long-context LLMs and the key challenges in scaling reasoning capabilities. Our
-GSM-Infinite benchmark provides a scalable and controllable testbed for
-systematically studying and advancing LLM reasoning in long and complex
-contexts.
+The increasing demand for mental health services has highlighted the need for
+innovative solutions, particularly in the realm of psychological conversational
+AI, where the availability of sensitive data is scarce. In this work, we
+explored the development of a system tailored for mental health support with a
+novel approach to psychological assessment based on explainable emotional
+profiles in combination with empathetic conversational models, offering a
+promising tool for augmenting traditional care, particularly where immediate
+expertise is unavailable. Our work can be divided into two main parts,
+intrinsecaly connected to each other. First, we present RACLETTE, a
+conversational system that demonstrates superior emotional accuracy compared to
+state-of-the-art benchmarks in both understanding users' emotional states and
+generating empathetic responses during conversations, while progressively
+building an emotional profile of the user through their interactions. Second,
+we show how the emotional profiles of a user can be used as interpretable
+markers for mental health assessment. These profiles can be compared with
+characteristic emotional patterns associated with different mental disorders,
+providing a novel approach to preliminary screening and support.
 
-摘要：長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而，若要解決最具挑戰性的智力問題，LLM 必須在長且複雜的脈絡中有效推理（例如，前沿數學研究）。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要，但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發，以及透過加入不必要的節點和邊緣來引入雜訊的能力，我們開發了一個小學數學問題產生器，能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準，我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降，並伴隨著系統性的推論縮放趨勢：指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制，以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台，用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。
+摘要：隨著對心理健康服務需求的增加，凸顯了創新解決方案的需求，特別是在心理對話式人工智慧領域，那裡缺乏敏感資料。在這項工作中，我們探索了開發一個針對心理健康支持的系統，採用一種基於可解釋的情緒特徵的新方法進行心理評估，結合同理心對話模式，提供了一個有前途的工具，用於擴充傳統照護，特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分，彼此內在相關。首先，我們展示了 RACLETTE，一個對話系統，與最先進的基準相比，在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性，同時透過他們的互動逐漸建立使用者的情緒特徵。其次，我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較，提供了一種初步篩選和支持的新方法。
 
-##### **Causality can systematically address the monsters under the bench(marks)**
-2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf
+##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**
+2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia
 
-Effective and reliable evaluation is essential for advancing empirical
-machine learning. However, the increasing accessibility of generalist models
-and the progress towards ever more complex, high-level tasks make systematic
-evaluation more challenging. Benchmarks are plagued by various biases,
-artifacts, or leakage, while models may behave unreliably due to poorly
-explored failure modes. Haphazard treatments and inconsistent formulations of
-such "monsters" can contribute to a duplication of efforts, a lack of trust in
-results, and unsupported inferences. In this position paper, we argue causality
-offers an ideal framework to systematically address these challenges. By making
-causal assumptions in an approach explicit, we can faithfully model phenomena,
-formulate testable hypotheses with explanatory power, and leverage principled
-tools for analysis. To make causal model design more accessible, we identify
-several useful Common Abstract Topologies (CATs) in causal graphs which help
-gain insight into the reasoning abilities in large language models. Through a
-series of case studies, we demonstrate how the precise yet pragmatic language
-of causality clarifies the strengths and limitations of a method and inspires
-new approaches for systematic progress.
+Artificial intelligence (AI) has emerged as a powerful tool to enhance
+decision-making and optimize treatment protocols in in vitro fertilization
+(IVF). In particular, AI shows significant promise in supporting
+decision-making during the ovarian stimulation phase of the IVF process. This
+review evaluates studies focused on the applications of AI combined with
+medical imaging in ovarian stimulation, examining methodologies, outcomes, and
+current limitations. Our analysis of 13 studies on this topic reveals that,
+reveal that while AI algorithms demonstrated notable potential in predicting
+optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the
+medical imaging data utilized predominantly came from two-dimensional (2D)
+ultrasound which mainly involved basic quantifications, such as follicle size
+and number, with limited use of direct feature extraction or advanced image
+analysis techniques. This points to an underexplored opportunity where advanced
+image analysis approaches, such as deep learning, and more diverse imaging
+modalities, like three-dimensional (3D) ultrasound, could unlock deeper
+insights. Additionally, the lack of explainable AI (XAI) in most studies raises
+concerns about the transparency and traceability of AI-driven decisions - key
+factors for clinical adoption and trust. Furthermore, many studies relied on
+single-center designs and small datasets, which limit the generalizability of
+their findings. This review highlights the need for integrating advanced
+imaging analysis techniques with explainable AI methodologies, as well as the
+importance of leveraging multicenter collaborations and larger datasets.
+Addressing these gaps has the potential to enhance ovarian stimulation
+management, paving the way for efficient, personalized, and data-driven
+treatment pathways that improve IVF outcomes.
+
+摘要：人工智慧（AI）已成為增強體外受精（IVF）決策制定和優化治療方案的強大工具。特別是，AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示，雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力，但所利用的醫學影像數據主要來自於二次元（2D）超音波，而二次元超音波主要涉及基本量化，例如濾泡大小和數量，且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會，例如深度學習等進階影像分析方法，以及更多元的影像模式，例如三維（3D）超音波，可以解鎖更深入的見解。此外，大多數研究缺乏可解釋 AI（XAI），這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂，而透明度和可追溯性是臨床採用和信任的關鍵因素。此外，許多研究依賴於單中心設計和小型數據集，這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性，以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理，為有效、個人化和數據驅動的治療途徑鋪平道路，進而改善 IVF 結果。
+
+##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**
+2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman
+
+This research presents an innovative approach to cancer diagnosis and
+prediction using explainable Artificial Intelligence (XAI) and deep learning
+techniques. With cancer causing nearly 10 million deaths globally in 2020,
+early and accurate diagnosis is crucial. Traditional methods often face
+challenges in cost, accuracy, and efficiency. Our study develops an AI model
+that provides precise outcomes and clear insights into its decision-making
+process, addressing the "black box" problem of deep learning models. By
+employing XAI techniques, we enhance interpretability and transparency,
+building trust among healthcare professionals and patients. Our approach
+leverages neural networks to analyse extensive datasets, identifying patterns
+for cancer detection. This model has the potential to revolutionise diagnosis
+by improving accuracy, accessibility, and clarity in medical decision-making,
+possibly leading to earlier detection and more personalised treatment
+strategies. Furthermore, it could democratise access to high-quality
+diagnostics, particularly in resource-limited settings, contributing to global
+health equity. The model's applications extend beyond cancer diagnosis,
+potentially transforming various aspects of medical decision-making and saving
+millions of lives worldwide.
 
-摘要：有效的、可靠的評估對於推進經驗機器學習至關重要。然而，一般化模型的可及性日益提高，以及朝著更複雜、更高級別任務的進展，使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾，而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中，我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設，我們可以忠實地模擬現象，制定具有解釋力的可測試假設，並利用原則性的分析工具。為了使因果模型設計更易於使用，我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT)，有助於深入了解大型語言模型中的推理能力。通過一系列案例研究，我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點，並激發系統進展的新方法。
+摘要：本研究提出了一個創新的癌症診斷和預測方法，使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡，因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型，它提供精確的結果並清楚地了解其決策過程，解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術，我們增強了解釋性和透明度，在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集，識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷，可能導致更早的檢測和更個性化的治療策略。此外，它可以使更多人獲得高品質的診斷，特別是在資源有限的環境中，有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷，還可能轉變醫療決策的各個方面，並拯救全球數百萬人的生命。
 
-##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**
-2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha
+##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**
+2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag
 
-Large Language Models (LLMs) have demonstrated impressive reasoning
-capabilities, yet their performance is highly dependent on the prompting
-strategy and model scale. While reinforcement learning and fine-tuning have
-been deployed to boost reasoning, these approaches incur substantial
-computational and data overhead. In this work, we introduce Adaptive Graph of
-Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM
-reasoning solely at test time. Rather than relying on fixed-step methods like
-Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes
-complex queries into structured subproblems, forming an dynamic directed
-acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding
-only those subproblems that require further analysis, AGoT unifies the
-strengths of chain, tree, and graph paradigms into a cohesive framework that
-allocates computation where it is most needed. We validate our approach on
-diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and
-mathematical problem-solving, achieving up to 46.2% improvement on scientific
-reasoning tasks (GPQA) - comparable to gains achieved through computationally
-intensive reinforcement learning approaches and outperforming state-of-the-art
-iterative approaches. These results suggest that dynamic decomposition and
-structured recursion offer a scalable, cost-effective alternative to
-post-training modifications, paving the way for more robust, general-purpose
-reasoning in LLMs.
+Deep learning has advanced medical image classification, but interpretability
+challenges hinder its clinical adoption. This study enhances interpretability
+in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)
+and a multi-agent Retrieval-Augmented Generation (RAG) system for report
+generation. By modeling relationships between visual features and clinical
+concepts, we create interpretable concept vectors that guide a multi-agent RAG
+system to generate radiology reports, enhancing clinical relevance,
+explainability, and transparency. Evaluation of the generated reports using an
+LLM-as-a-judge confirmed the interpretability and clinical utility of our
+model's outputs. On the COVID-QU dataset, our model achieved 81% classification
+accuracy and demonstrated robust report generation performance, with five key
+metrics ranging between 84% and 90%. This interpretable multi-agent framework
+bridges the gap between high-performance AI and the explainability required for
+reliable AI-driven CXR analysis in clinical settings. Our code is available at
+https://github.com/tifat58/IRR-with-CBM-RAG.git.
 
-摘要：大型語言模型 (LLM) 已展現令人印象深刻的推理能力，但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理，但這些方法會造成大量的運算和資料開銷。在這項工作中，我們引入了「適應性思考圖」(AGoT)，一個動態的、基於圖形的推論架構，它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法，而是遞迴地將複雜的查詢分解成結構化的子問題，形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題，AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中，將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法，在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進，這與透過運算密集的強化學習方法所獲得的增益相當，並且優於最先進的迭代方法。這些結果表明，動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案，用於訓練後修改，為 LLM 中更強健、更通用的推理鋪平了道路。
+摘要：深度學習已提升醫學影像分類，但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成，來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係，我們建立可解釋的概念向量，引導多代理 RAG 系統生成放射報告，增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估，確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上，我們的模型達到了 81% 的分類準確率，並展示了穩健的報告生成效能，五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。
 
-##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**
-2502.05239v1 by Hussam Ghanem, Christophe Cruz
+##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**
+2412.15748v1 by Shamus Sim, Tyrone Chen
 
-Recent advancements in large language models have demonstrated significant
-potential in the automated construction of knowledge graphs from unstructured
-text. This paper builds upon our previous work [16], which evaluated various
-models using metrics like precision, recall, F1 score, triple matching, and
-graph matching, and introduces a refined approach to address the critical
-issues of hallucination and omission. We propose an enhanced evaluation
-framework incorporating BERTScore for graph similarity, setting a practical
-threshold of 95% for graph matching. Our experiments focus on the Mistral
-model, comparing its original and fine-tuned versions in zero-shot and few-shot
-settings. We further extend our experiments using examples from the KELM-sub
-training dataset, illustrating that the fine-tuned model significantly improves
-knowledge graph construction accuracy while reducing the exact hallucination
-and omission. However, our findings also reveal that the fine-tuned models
-perform worse in generalization tasks on the KELM-sub dataset. This study
-underscores the importance of comprehensive evaluation metrics in advancing the
-state-of-the-art in knowledge graph construction from textual data.
+Background: Despite the current ubiquity of Large Language Models (LLMs)
+across the medical domain, there is a surprising lack of studies which address
+their reasoning behaviour. We emphasise the importance of understanding
+reasoning behaviour as opposed to high-level prediction accuracies, since it is
+equivalent to explainable AI (XAI) in this context. In particular, achieving
+XAI in medical LLMs used in the clinical domain will have a significant impact
+across the healthcare sector. Results: Therefore, we define the concept of
+reasoning behaviour in the specific context of medical LLMs. We then categorise
+and discuss the current state of the art of methods which evaluate reasoning
+behaviour in medical LLMs. Finally, we propose theoretical frameworks which can
+empower medical professionals or machine learning engineers to gain insight
+into the low-level reasoning operations of these previously obscure models.
+Conclusion: The subsequent increased transparency and trust in medical machine
+learning models by clinicians as well as patients will accelerate the
+integration, application as well as further development of medical AI for the
+healthcare system as a whole
 
-摘要：大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上，該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型，並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架，結合 BERTScore 來進行圖形相似性，並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上，比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗，說明微調後的模型顯著提高了知識圖譜建構的準確度，同時減少了精確的幻覺和遺漏。然而，我們的研究結果也顯示，微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。
+摘要：背景：儘管大型語言模型 (LLM) 目前在醫療領域無所不在，但令人驚訝的是，探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要，因為在這種情況下，這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI，將對整個醫療保健產業產生重大影響。結果：因此，我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後，我們提出理論架構，讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論：臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升，將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。
 
-##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**
-2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu
+##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media**
+2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton
 
-We introduce Agentic Reasoning, a framework that enhances large language
-model (LLM) reasoning by integrating external tool-using agents. Unlike
-conventional LLM-based reasoning approaches, which rely solely on internal
-inference, Agentic Reasoning dynamically engages web search, code execution,
-and structured reasoning-context memory to solve complex problems requiring
-deep research and multi-step logical deduction. Our framework introduces the
-Mind Map agent, which constructs a structured knowledge graph to track logical
-relationships, improving deductive reasoning. Additionally, the integration of
-web-search and coding agents enables real-time retrieval and computational
-analysis, enhancing reasoning accuracy and decision-making. Evaluations on
-PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks
-demonstrate that our approach significantly outperforms existing models,
-including leading retrieval-augmented generation (RAG) systems and
-closed-source LLMs. Moreover, our results indicate that agentic reasoning
-improves expert-level knowledge synthesis, test-time scalability, and
-structured problem-solving. The code is at:
-https://github.com/theworldofagents/Agentic-Reasoning.
+Stress is a pervasive global health issue that can lead to severe mental
+health problems. Early detection offers timely intervention and prevention of
+stress-related disorders. The current early detection models perform "black
+box" inference suffering from limited explainability and trust which blocks the
+real-world clinical application. Thanks to the generative properties introduced
+by the Large Language Models (LLMs), the decision and the prediction from such
+models are semi-interpretable through the corresponding description. However,
+the existing LLMs are mostly trained for general purposes without the guidance
+of psychological cognitive theory. To this end, we first highlight the
+importance of prior theory with the observation of performance boosted by the
+chain-of-thoughts tailored for stress detection. This method termed Cognition
+Chain explicates the generation of stress through a step-by-step cognitive
+perspective based on cognitive appraisal theory with a progress pipeline:
+Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress
+State, guiding LLMs to provide comprehensive reasoning explanations. We further
+study the benefits brought by the proposed Cognition Chain format by utilising
+it as a synthetic dataset generation template for LLMs instruction-tuning and
+introduce CogInstruct, an instruction-tuning dataset for stress detection. This
+dataset is developed using a three-stage self-reflective annotation pipeline
+that enables LLMs to autonomously generate and refine instructional data. By
+instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable
+stress detection model. Evaluations demonstrate that CogLLM achieves
+outstanding performance while enhancing explainability. Our work contributes a
+novel approach by integrating cognitive theories into LLM reasoning processes,
+offering a promising direction for future explainable AI research.
 
-摘要：我們引入了代理推理，一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同，代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理，它建立一個結構化的知識圖譜來追蹤邏輯關係，改善演繹推理。此外，整合網路搜尋和編碼代理能進行即時擷取和運算分析，增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示，我們的做法明顯優於現有模型，包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外，我們的結果顯示，代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在：https://github.com/theworldofagents/Agentic-Reasoning。
+摘要：壓力是一個普遍的全球性健康問題，可能會導致嚴重的精神
+健康問題。早期發現提供及時的干預和預防
+壓力相關疾病。目前的早期發現模型執行「黑
+盒子」推論，存在可解釋性和信任度有限的問題，阻礙了
+現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性，此類
+模型的決策和預測通過對應描述具有半可解釋性。然而，
+現有的 LLM 主要針對一般用途進行訓練，沒有心理認知理論的指導。為此，我們首先強調
+先驗理論的重要性，並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知
+鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生，並具有進度管道：
+刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力
+狀態，指導 LLM 提供全面的推理解釋。我們進一步
+通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點，並介紹 CogInstruct，這是一個針對壓力檢測的指令調整數據集。這個
+數據集是使用一個三階段的自省標註管道開發的，使 LLM 能夠自主生成和優化指令數據。通過
+使用 CogInstruct 對 Llama3 進行指令調整，我們開發了 CogLLM，這是一個可解釋的
+壓力檢測模型。評估表明，CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中，提出了一種新穎的方法，
+為未來的可解釋人工智能研究提供了一個有希望的方向。
 
-##### **Position-aware Automatic Circuit Discovery**
-2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
+##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**
+2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi
 
-A widely used strategy to discover and understand language model mechanisms
-is circuit analysis. A circuit is a minimal subgraph of a model's computation
-graph that executes a specific task. We identify a gap in existing circuit
-discovery methods: they assume circuits are position-invariant, treating model
-components as equally relevant across input positions. This limits their
-ability to capture cross-positional interactions or mechanisms that vary across
-positions. To address this gap, we propose two improvements to incorporate
-positionality into circuits, even on tasks containing variable-length examples.
-First, we extend edge attribution patching, a gradient-based method for circuit
-discovery, to differentiate between token positions. Second, we introduce the
-concept of a dataset schema, which defines token spans with similar semantics
-across examples, enabling position-aware circuit discovery in datasets with
-variable length examples. We additionally develop an automated pipeline for
-schema generation and application using large language models. Our approach
-enables fully automated discovery of position-sensitive circuits, yielding
-better trade-offs between circuit size and faithfulness compared to prior work.
+Human-machine teaming in medical AI requires us to understand to what degree
+a trained clinician should weigh AI predictions. While previous work has shown
+the potential of AI assistance at improving clinical predictions, existing
+clinical decision support systems either provide no explainability of their
+predictions or use techniques like saliency and Shapley values, which do not
+allow for physician-based verification. To address this gap, this study
+compares previously used explainable AI techniques with a newly proposed
+technique termed '2-factor retrieval (2FR)', which is a combination of
+interface design and search retrieval that returns similarly labeled data
+without processing this data. This results in a 2-factor security blanket
+where: (a) correct images need to be retrieved by the AI; and (b) humans should
+associate the retrieved images with the current pathology under test. We find
+that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician
+accuracy, with particular improvements when clinicians are radiologists and
+have low confidence in their decision. Our results highlight the importance of
+understanding how different modes of human-AI decision making may impact
+clinician accuracy in clinical decision support systems.
 
-摘要：廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖，可執行特定任務。我們找出電路發現方法中的一個缺口：它們假設電路與位置無關，將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口，我們提出兩項改進，將位置性納入電路中，即使在包含變長範例的任務中也是如此。首先，我們擴充邊緣屬性修補，一種基於梯度的電路發現方法，以區分符號位置。其次，我們引入了資料集架構的概念，它定義了在範例中具有類似語義的符號跨距，使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外，我們開發了一個自動化管線，用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化，與先前的研究相比，在電路大小和忠實度之間產生了更好的權衡。
+摘要：人機協作在醫療 AI 中，需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力，但現有的臨床決策支援系統，要不就沒有提供預測的可解釋性，要不就是使用像顯著性和 Shapley 值之類的技術，這些技術不允許基於醫生的驗證。為了解決這個差距，本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較，後者是一種介面設計和搜尋檢索的組合，它會傳回標籤相似的資料，而不會處理這些資料。這會產生一個 2 因子安全機制，其中：(a) 正確的影像需要由 AI 檢索；(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現，當在胸部 X 光診斷上進行測試時，2FR 會提高臨床醫生的準確度，特別是在臨床醫生是放射科醫生且對其決策信心不足時，會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。
 
-##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**
-2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
+##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**
+2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle
 
-We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by
-jointly optimizing model roles and weights. We represent multi-LLM systems as
-directed acyclic graphs (DAGs) of LLMs with topological message passing for
-collaborative generation. Given a pool of LLM experts and a utility function,
-Heterogeneous Swarms employs two iterative steps: role-step and weight-step.
-For role-step, we interpret model roles as learning a DAG that specifies the
-flow of inputs and outputs between LLMs. Starting from a swarm of random
-continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs
-in topological order, evaluate on the utility function (e.g. accuracy on a
-task), and optimize the adjacency matrices with particle swarm optimization
-based on the utility score. For weight-step, we assess the contribution of
-individual LLMs in the multi-LLM systems and optimize model weights with swarm
-intelligence. We propose JFK-score to quantify the individual contribution of
-each LLM in the best-found DAG of the role-step, then optimize model weights
-with particle swarm optimization based on the JFK-score. Experiments
-demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based
-baselines by 18.5% on average across 12 tasks. Further analysis reveals that
-Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles
-and substantial collaborative gains, and benefits from the diversity of
-language models.
+Understanding public perception of artificial intelligence (AI) and the
+tradeoffs between potential risks and benefits is crucial, as these perceptions
+might shape policy decisions, influence innovation trajectories for successful
+market strategies, and determine individual and societal acceptance of AI
+technologies. Using a representative sample of 1100 participants from Germany,
+this study examines mental models of AI. Participants quantitatively evaluated
+71 statements about AI's future capabilities (e.g., autonomous driving, medical
+care, art, politics, warfare, and societal divides), assessing the expected
+likelihood of occurrence, perceived risks, benefits, and overall value. We
+present rankings of these projections alongside visual mappings illustrating
+public risk-benefit tradeoffs. While many scenarios were deemed likely,
+participants often associated them with high risks, limited benefits, and low
+overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in
+value assessment can be explained by perceived risks ($\beta=-.504$) and
+perceived benefits ($\beta=+.710$), with no significant relation to expected
+likelihood. Demographics and personality traits influenced perceptions of
+risks, benefits, and overall evaluations, underscoring the importance of
+increasing AI literacy and tailoring public information to diverse user needs.
+These findings provide actionable insights for researchers, developers, and
+policymakers by highlighting critical public concerns and individual factors
+essential to align AI development with individual values.
+
+摘要：<paragraph>了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要，因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡，並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本，探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述（例如，自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧）進行了定量評估，評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名，並附上視覺化映射，說明了公眾的風險收益權衡。儘管許多場景被認為是可能的，但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中，96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋，與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法，這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素，為研究人員、開發人員和政策制定者提供了可行的見解。</paragraph>
+
+##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**
+2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
+
+The use of machine learning and AI on electronic health records (EHRs) holds
+substantial potential for clinical insight. However, this approach faces
+challenges due to data heterogeneity, sparsity, temporal misalignment, and
+limited labeled outcomes. In this context, we leverage a linked EHR dataset of
+approximately one million de-identified individuals from Bristol, North
+Somerset, and South Gloucestershire, UK, to characterize urinary tract
+infections (UTIs). We implemented a data pre-processing and curation pipeline
+that transforms the raw EHR data into a structured format suitable for
+developing predictive models focused on data fairness, accountability and
+transparency. Given the limited availability and biases of ground truth UTI
+outcomes, we introduce a UTI risk estimation framework informed by clinical
+expertise to estimate UTI risk across individual patient timelines. Pairwise
+XGBoost models are trained using this framework to differentiate UTI risk
+categories with explainable AI techniques applied to identify key predictors
+and support interpretability. Our findings reveal differences in clinical and
+demographic predictors across risk groups. While this study highlights the
+potential of AI-driven insights to support UTI clinical decision-making,
+further investigation of patient sub-strata and extensive validation are needed
+to ensure robustness and applicability in clinical practice.
 
-摘要：<paragraph>我們提出異質群體，一種演算法，透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG)，並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數，異質群體使用兩個反覆步驟：角色步驟和權重步驟。對於角色步驟，我們將模型角色解釋為學習一個 DAG，它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始，我們將它們解碼為離散 DAG，以拓撲順序呼叫 LLM，根據效用函數（例如任務的準確度）進行評估，並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟，我們評估個別 LLM 在多 LLM 系統中的貢獻，並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻，然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明，異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明，異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統，並受益於語言模型的多樣性。</paragraph>
+摘要：電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而，由於資料異質性、稀疏性、時間錯位和標籤結果有限，此方法面臨挑戰。在此背景下，我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集，來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線，適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差，我們引入了由臨床專業知識告知的 UTI 風險評估架構，以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練，以區分 UTI 風險類別，並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力，但仍需要進一步調查患者子群體和廣泛驗證，以確保在臨床實務中的穩健性和適用性。
 
-##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
-2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
+##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**
+2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez
 
-Retrieval-augmented generation (RAG) is a well-suited technique for
-retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
-key module of the healthcare copilot, helping reduce misdiagnosis for
-healthcare practitioners and patients. However, the diagnostic accuracy and
-specificity of existing heuristic-based RAG models used in the medical domain
-are inadequate, particularly for diseases with similar manifestations. This
-paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
-reasoning for the medical domain that retrieves diagnosis and treatment
-recommendations based on manifestations. MedRAG systematically constructs a
-comprehensive four-tier hierarchical diagnostic KG encompassing critical
-diagnostic differences of various diseases. These differences are dynamically
-integrated with similar EHRs retrieved from an EHR database, and reasoned
-within a large language model. This process enables more accurate and specific
-decision support, while also proactively providing follow-up questions to
-enhance personalized medical decision-making. MedRAG is evaluated on both a
-public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
-collected from Tan Tock Seng Hospital, and its performance is compared against
-various existing RAG methods. Experimental results show that, leveraging the
-information integration and relational abilities of the KG, our MedRAG provides
-more specific diagnostic insights and outperforms state-of-the-art models in
-reducing misdiagnosis rates. Our code will be available at
-https://github.com/SNOWTEAM2023/MedRAG
+There is a growing need to understand how digital systems can support
+clinical decision-making, particularly as artificial intelligence (AI) models
+become increasingly complex and less human-interpretable. This complexity
+raises concerns about trustworthiness, impacting safe and effective adoption of
+such technologies. Improved understanding of decision-making processes and
+requirements for explanations coming from decision support tools is a vital
+component in providing effective explainable solutions. This is particularly
+relevant in the data-intensive, fast-paced environments of intensive care units
+(ICUs). To explore these issues, group interviews were conducted with seven ICU
+clinicians, representing various roles and experience levels. Thematic analysis
+revealed three core themes: (T1) ICU decision-making relies on a wide range of
+factors, (T2) the complexity of patient state is challenging for shared
+decision-making, and (T3) requirements and capabilities of AI decision support
+systems. We include design recommendations from clinical input, providing
+insights to inform future AI systems for intensive care.
 
-摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
+摘要：隨著人工智慧 (AI) 模型變得越來越複雜，且越來越難以被人理解，了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮，影響了此類技術的安全且有效採用。改善對決策制定流程的理解，以及對決策支援工具所提供說明的要求，是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題，對七位 ICU 臨床醫師進行了小組訪談，這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題：(T1) ICU 決策制定依賴於廣泛的因素，(T2) 病患狀態的複雜性對共同決策制定構成挑戰，以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議，提供見解以提供資訊給未來用於加護的 AI 系統。
 
-##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**
-2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
+##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**
+2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra
 
-Most existing Knowledge Graph Question Answering (KGQA) approaches are
-designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the
-heterogeneity of the underlying graph schema, topology and assertions, most
-KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without
-resource-intensive training data. We present OntoSCPrompt, a novel Large
-Language Model (LLM)-based KGQA approach with a two-stage architecture that
-separates semantic parsing from KG-dependent interactions. OntoSCPrompt first
-generates a SPARQL query structure (including SPARQL keywords such as SELECT,
-ASK, WHERE and placeholders for missing tokens) and then fills them with
-KG-specific information. To enhance the understanding of the underlying KG, we
-present an ontology-guided, hybrid prompt learning strategy that integrates KG
-ontology into the learning process of hybrid prompts (e.g., discrete and
-continuous vectors). We also present several task-specific decoding strategies
-to ensure the correctness and executability of generated SPARQL queries in both
-stages. Experimental results demonstrate that OntoSCPrompt performs as well as
-SOTA approaches without retraining on a number of KGQA datasets such as CWQ,
-WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well
-to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code:
-\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
+Pediatric heart diseases present a broad spectrum of congenital and acquired
+diseases. More complex congenital malformations require a differentiated and
+multimodal decision-making process, usually including echocardiography as a
+central imaging method. Artificial intelligence (AI) offers considerable
+promise for clinicians by facilitating automated interpretation of pediatric
+echocardiography data. However, adapting AI technologies for pediatric
+echocardiography analysis has challenges such as limited public data
+availability, data privacy, and AI model transparency. Recently, researchers
+have focused on disruptive technologies, such as federated learning (FL) and
+explainable AI (XAI), to improve automatic diagnostic and decision support
+workflows. This study offers a comprehensive overview of the limitations and
+opportunities of AI in pediatric echocardiography, emphasizing the synergistic
+workflow and role of XAI and FL, identifying research gaps, and exploring
+potential future developments. Additionally, three relevant clinical use cases
+demonstrate the functionality of XAI and FL with a focus on (i) view
+recognition, (ii) disease classification, (iii) segmentation of cardiac
+structures, and (iv) quantitative assessment of cardiac function.
 
-摘要：現有的知識圖譜問答（KGQA）方法大多是為特定 KG 而設計的，例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性，大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜（KG）。我們提出 OntoSCPrompt，這是一種基於大型語言模型（LLM）的新型 KGQA 方法，採用兩階段架構，將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構（包括 SPARQL 關鍵字，例如 SELECT、ASK、WHERE 和缺失令牌的佔位符），然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解，我們提出了一種由本体指導的混合提示學習策略，將 KG 本体整合到混合提示（例如，離散和連續向量）的學習過程中。我們還提出了多種特定任務的解碼策略，以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明，OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時，效能與 SOTA 方法一樣好，且資源使用效率高，並且可以很好地概括到未見過的特定領域 KG，例如 DBLP-QuAD 和 CoyPu KG Code：
-\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
+摘要：小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程，通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望，因為它可以促進小兒超音波檢查資料的自動化解讀。然而，將人工智慧技術應用於小兒超音波檢查分析有許多挑戰，例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近，研究人員專注於破壞性技術，例如聯合學習 (FL) 和可解釋人工智慧 (XAI)，以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述，強調了 XAI 和 FL 的協同工作流程和角色，找出研究差距並探討潛在的未來發展。此外，三個相關的臨床使用案例展示了 XAI 和 FL 的功能，重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。
 
-##### **Multimodal Medical Code Tokenizer**
-2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
+##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**
+2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust
 
-Foundation models trained on patient electronic health records (EHRs) require
-tokenizing medical data into sequences of discrete vocabulary items. Existing
-tokenizers treat medical codes from EHRs as isolated textual tokens. However,
-each medical code is defined by its textual description, its position in
-ontological hierarchies, and its relationships to other codes, such as disease
-co-occurrences and drug-treatment associations. Medical vocabularies contain
-more than 600,000 codes with critical information for clinical reasoning. We
-introduce MedTok, a multimodal medical code tokenizer that uses the text
-descriptions and relational context of codes. MedTok processes text using a
-language model encoder and encodes the relational structure with a graph
-encoder. It then quantizes both modalities into a unified token space,
-preserving modality-specific and cross-modality information. We integrate
-MedTok into five EHR models and evaluate it on operational and clinical tasks
-across in-patient and out-patient datasets, including outcome prediction,
-diagnosis classification, drug recommendation, and risk stratification.
-Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
-models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
-the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
-using MedTok tokenizer with medical QA systems. Our results demonstrate the
-potential of MedTok as a unified tokenizer for medical codes, improving
-tokenization for medical foundation models.
+Osteoporosis is a common condition that increases fracture risk, especially
+in older adults. Early diagnosis is vital for preventing fractures, reducing
+treatment costs, and preserving mobility. However, healthcare providers face
+challenges like limited labeled data and difficulties in processing medical
+images. This study presents a novel multi-modal learning framework that
+integrates clinical and imaging data to improve diagnostic accuracy and model
+interpretability. The model utilizes three pre-trained networks-VGG19,
+InceptionV3, and ResNet50-to extract deep features from X-ray images. These
+features are transformed using PCA to reduce dimensionality and focus on the
+most relevant components. A clustering-based selection process identifies the
+most representative components, which are then combined with preprocessed
+clinical data and processed through a fully connected network (FCN) for final
+classification. A feature importance plot highlights key variables, showing
+that Medical History, BMI, and Height were the main contributors, emphasizing
+the significance of patient-specific data. While imaging features were
+valuable, they had lower importance, indicating that clinical data are crucial
+for accurate predictions. This framework promotes precise and interpretable
+predictions, enhancing transparency and building trust in AI-driven diagnoses
+for clinical integration.
 
-摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
+摘要：骨質疏鬆症是一種常見的疾病，會增加骨折的風險，特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而，醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架，該框架整合了臨床和影像數據，以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路，VGG19、InceptionV3 和 ResNet50，從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分，然後將這些組成部分與預處理的臨床數據結合，並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數，表明病史、BMI 和身高是主要貢獻因素，強調了患者特定數據的重要性。雖然影像特徵很有價值，但它們的重要性較低，這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測，提高了透明度，並建立了對 AI 驅動診斷在臨床整合中的信任。
 
-##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**
-2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu
+##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**
+2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor
 
-The rapid expansion of web content has made on-device AI assistants
-indispensable for helping users manage the increasing complexity of online
-tasks. The emergent reasoning ability in large language models offer a
-promising path for next-generation on-device AI agents. However, deploying
-full-scale Large Language Models (LLMs) on resource-limited local devices is
-challenging. In this paper, we propose Division-of-Thoughts (DoT), a
-collaborative reasoning framework leveraging the synergy between locally
-deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT
-leverages a Task Decomposer to elicit the inherent planning abilities in
-language models to decompose user queries into smaller sub-tasks, which allows
-hybrid language models to fully exploit their respective strengths. Besides,
-DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks
-and create a dependency graph, facilitating parallel reasoning of sub-tasks and
-the identification of key steps. To allocate the appropriate model based on the
-difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an
-additional task head attached to the SLM that does not alter the SLM's
-parameters. To boost adapter's task allocation capability, we propose a
-self-reinforced training method that relies solely on task execution feedback.
-Extensive experiments on various benchmarks demonstrate that our DoT
-significantly reduces LLM costs while maintaining competitive reasoning
-accuracy. Specifically, DoT reduces the average reasoning time and API costs by
-66.12% and 83.57%, while achieving comparable reasoning accuracy with the best
-baseline methods.
+This review paper explores recent advances in deep learning approaches for
+non-invasive cognitive impairment detection. We examine various non-invasive
+indicators of cognitive decline, including speech and language, facial, and
+motoric mobility. The paper provides an overview of relevant datasets,
+feature-extracting techniques, and deep-learning architectures applied to this
+domain. We have analyzed the performance of different methods across modalities
+and observed that speech and language-based methods generally achieved the
+highest detection performance. Studies combining acoustic and linguistic
+features tended to outperform those using a single modality. Facial analysis
+methods showed promise for visual modalities but were less extensively studied.
+Most papers focused on binary classification (impaired vs. non-impaired), with
+fewer addressing multi-class or regression tasks. Transfer learning and
+pre-trained language models emerged as popular and effective techniques,
+especially for linguistic analysis. Despite significant progress, several
+challenges remain, including data standardization and accessibility, model
+explainability, longitudinal analysis limitations, and clinical adaptation.
+Lastly, we propose future research directions, such as investigating
+language-agnostic speech analysis methods, developing multi-modal diagnostic
+systems, and addressing ethical considerations in AI-assisted healthcare. By
+synthesizing current trends and identifying key obstacles, this review aims to
+guide further development of deep learning-based cognitive impairment detection
+systems to improve early diagnosis and ultimately patient outcomes.
 
-摘要：<paragraph>網頁內容快速擴充，使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而，在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中，我們提出了思想分工 (DoT)，一個協作推理框架，利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力，將使用者查詢分解成較小的子任務，這允許混合語言模型充分發揮其各自的優勢。此外，DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖，促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型，DoT 利用了即插即用適配器，這是一個附加在 SLM 上的任務頭，不會改變 SLM 的參數。為了提升適配器的任務分配能力，我們提出了一種自我強化訓練方法，它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明，我們的 DoT 大幅降低了 LLM 成本，同時維持了有競爭力的推理準確度。具體來說，DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%，同時達到了與最佳基準方法相當的推理準確度。</paragraph>
+摘要：本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標，包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現，並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力，但研究較少。大多數論文專注於二元分類（受損與未受損），較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術，特別是對於語言分析。儘管取得了重大進展，但仍存在一些挑戰，包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後，我們提出了未來的研究方向，例如調查與語言無關的語音分析方法、開發多模式診斷系統，以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙，本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展，以改善早期診斷，並最終改善患者的治療結果。
 
-##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**
-2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong
+##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**
+2410.17504v1 by Shruthi Chari
 
-Knowledge Graph-based recommendations have gained significant attention due
-to their ability to leverage rich semantic relationships. However, constructing
-and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy
-of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent
-advancements in Large Language Models (LLMs) offer a promising way to improve
-the quality and relevance of KGs for recommendation tasks. Despite this,
-integrating LLMs into KG-based systems presents challenges, such as efficiently
-augmenting KGs, addressing hallucinations, and developing effective joint
-learning methods. In this paper, we propose the Confidence-aware KG-based
-Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework
-that combines KGs and LLMs for recommendation task. The framework includes: (1)
-an LLM-based subgraph augmenter for enriching KGs with high-quality
-information, (2) a confidence-aware message propagation mechanism to filter
-noisy triplets, and (3) a dual-view contrastive learning method to integrate
-user-item interactions and KG data. Additionally, we employ a confidence-aware
-explanation generation process to guide LLMs in producing realistic
-explanations for recommendations. Finally, extensive experiments demonstrate
-the effectiveness of CKG-LLMA across multiple public datasets.
+Explainable Artificial Intelligence (AI) focuses on helping humans understand
+the working of AI systems or their decisions and has been a cornerstone of AI
+for decades. Recent research in explainability has focused on explaining the
+workings of AI models or model explainability. There have also been several
+position statements and review papers detailing the needs of end-users for
+user-centered explainability but fewer implementations. Hence, this thesis
+seeks to bridge some gaps between model and user-centered explainability. We
+create an explanation ontology (EO) to represent literature-derived explanation
+types via their supporting components. We implement a knowledge-augmented
+question-answering (QA) pipeline to support contextual explanations in a
+clinical setting. Finally, we are implementing a system to combine explanations
+from different AI methods and data modalities. Within the EO, we can represent
+fifteen different explanation types, and we have tested these representations
+in six exemplar use cases. We find that knowledge augmentations improve the
+performance of base large language models in the contextualized QA, and the
+performance is variable across disease groups. In the same setting, clinicians
+also indicated that they prefer to see actionability as one of the main foci in
+explanations. In our explanations combination method, we plan to use similarity
+metrics to determine the similarity of explanations in a chronic disease
+detection setting. Overall, through this thesis, we design methods that can
+support knowledge-enabled explanations across different use cases, accounting
+for the methods in today's AI era that can generate the supporting components
+of these explanations and domain knowledge sources that can enhance them.
+
+摘要：可解釋人工智慧（AI）專注於協助人類了解 AI 系統運作或其決策，數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求，但實作較少。因此，本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體（EO）以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答（QA）管線，以在臨床環境中支援情境解釋。最後，我們正在實作一個系統，以結合來自不同 AI 方法和資料模式的解釋。在 EO 中，我們可以表示 15 種不同的解釋類型，並且我們已在六個範例使用案例中測試這些表示。我們發現，知識增強改善了基礎大型語言模型在情境化 QA 中的效能，並且效能因疾病群組而異。在相同的環境中，臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中，我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言，透過本論文，我們設計了可以在不同使用案例中支援知識啟用解釋的方法，考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。
+
+##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**
+2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay
+
+Objectives: To investigate clinicians' attitudes towards current automated
+interpretation of ECG and novel AI technologies and their perception of
+computer-assisted interpretation. Materials and Methods: We conducted a series
+of interviews with clinicians in the UK. Our study: (i) explores the potential
+for AI, specifically future 'human-like' computing approaches, to facilitate
+ECG interpretation and support clinical decision making, and (ii) elicits their
+opinions about the importance of explainability and trustworthiness of AI
+algorithms. Results: We performed inductive thematic analysis on interview
+transcriptions from 23 clinicians and identified the following themes: (i) a
+lack of trust in current systems, (ii) positive attitudes towards future AI
+applications and requirements for these, (iii) the relationship between the
+accuracy and explainability of algorithms, and (iv) opinions on education,
+possible deskilling, and the impact of AI on clinical competencies. Discussion:
+Clinicians do not trust current computerised methods, but welcome future 'AI'
+technologies. Where clinicians trust future AI interpretation to be accurate,
+they are less concerned that it is explainable. They also preferred ECG
+interpretation that demonstrated the results of the algorithm visually. Whilst
+clinicians do not fear job losses, they are concerned about deskilling and the
+need to educate the workforce to use AI responsibly. Conclusion: Clinicians are
+positive about the future application of AI in clinical decision-making.
+Accuracy is a key factor of uptake and visualisations are preferred over
+current computerised methods. This is viewed as a potential means of training
+and upskilling, in contrast to the deskilling that automation might be
+perceived to bring.
 
-摘要：基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而，構建和維護知識圖譜 (KG) 是一項資源密集型任務，而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此，將 LLM 整合到基於 KG 的系統中會帶來挑戰，例如有效擴充 KG、處理幻覺，以及開發有效的聯合學習方法。在本文中，我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA)，這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括：(1) 一個基於 LLM 的子圖擴充器，用於使用高品質資訊豐富 KG，(2) 一個信心感知型訊息傳播機制，用於過濾雜訊三元組，以及 (3) 一個雙視圖對比學習方法，用於整合使用者-項目互動和 KG 資料。此外，我們採用一個信心感知型解釋產生程序，以引導 LLM 為推薦產生逼真的解釋。最後，大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。
+摘要：<paragraph>目的：調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度，以及他們對電腦輔助解讀的看法。材料和方法：我們對英國的臨床醫生進行了一系列訪談。我們的研究：(i) 探討人工智慧的潛力，特別是未來的「類人類」運算方法，以促進心電圖解讀並支持臨床決策制定，以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果：我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析，並找出以下主題：(i) 對目前系統缺乏信任，(ii) 對未來人工智慧應用和對這些應用的要求持正面態度，(iii) 演算法的準確性和可解釋性之間的關係，以及 (iv) 對教育、可能的技能退化，以及人工智慧對臨床能力的影響的看法。討論：臨床醫生不信任目前的電腦化方法，但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下，他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業，但他們擔心技能退化，以及需要教育員工負責任地使用人工智慧。結論：臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素，而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法，與自動化可能帶來的技能退化形成對比。</paragraph>
 
-##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**
-2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell
+##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**
+2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker
 
-Scene graphs have emerged as a structured and serializable environment
-representation for grounded spatial reasoning with Large Language Models
-(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason
-framework for reasoning and planning with scene graphs. Our approach employs
-two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and
-information queries generation, and a (2) Retriever for extracting
-corresponding graph information following the queries. Two agents collaborate
-iteratively, enabling sequential reasoning and adaptive attention to graph
-information. Unlike prior works, both agents are prompted only with the scene
-graph schema rather than the full graph data, which reduces the hallucination
-by limiting input tokens, and drives the Reasoner to generate reasoning trace
-abstractly.Following the trace, the Retriever programmatically query the scene
-graph data based on the schema understanding, allowing dynamic and global
-attention on the graph that enhances alignment between reasoning and retrieval.
-Through experiments in multiple simulation environments, we show that our
-framework surpasses existing LLM-based approaches in numerical Q\&A and
-planning tasks, and can benefit from task-level few-shot examples, even in the
-absence of agent-level demonstrations. Project code will be released.
+The aggressiveness of prostate cancer, the most common cancer in men
+worldwide, is primarily assessed based on histopathological data using the
+Gleason scoring system. While artificial intelligence (AI) has shown promise in
+accurately predicting Gleason scores, these predictions often lack inherent
+explainability, potentially leading to distrust in human-machine interactions.
+To address this issue, we introduce a novel dataset of 1,015 tissue microarray
+core images, annotated by an international group of 54 pathologists. The
+annotations provide detailed localized pattern descriptions for Gleason grading
+in line with international guidelines. Utilizing this dataset, we develop an
+inherently explainable AI system based on a U-Net architecture that provides
+predictions leveraging pathologists' terminology. This approach circumvents
+post-hoc explainability methods while maintaining or exceeding the performance
+of methods trained directly for Gleason pattern segmentation (Dice score: 0.713
+$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason
+patterns). By employing soft labels during training, we capture the intrinsic
+uncertainty in the data, yielding strong results in Gleason pattern
+segmentation even in the context of high interobserver variability. With the
+release of this dataset, we aim to encourage further research into segmentation
+in medical tasks with high levels of subjectivity and to advance the
+understanding of pathologists' reasoning processes.
 
-摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
+摘要：前列腺癌是全球男性最常見的癌症，其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力，但這些預測通常缺乏內在的可解釋性，可能會導致對人機互動的不信任。為了解決這個問題，我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述，用於符合國際準則的 Gleason 分級。利用這個資料集，我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統，該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法，同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能（Dice 分數：0.713 ± 0.003，訓練於解釋，相對於 0.691 ± 0.010，訓練於 Gleason 模式）。透過在訓練期間採用軟標籤，我們捕捉了資料中的內在不確定性，即使在觀察者間變異性高的情況下，也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集，我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割，並增進對病理學家推理過程的理解。
 
-##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
-2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
+##### **Explainable AI Methods for Multi-Omics Analysis: A Survey**
+2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee
 
-Recent advancements have highlighted that Large Language Models (LLMs) are
-prone to hallucinations when solving complex reasoning problems, leading to
-erroneous results. To tackle this issue, researchers incorporate Knowledge
-Graphs (KGs) to improve the reasoning ability of LLMs. However, existing
-methods face two limitations: 1) they typically assume that all answers to the
-questions are contained in KGs, neglecting the incompleteness issue of KGs, and
-2) they treat the KG as a static repository and overlook the implicit logical
-reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an
-innovative neural-symbolic agent framework that achieves collaborative
-augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments
-and transform complex reasoning tasks into a multi-step interactive process,
-enabling KGs to participate deeply in the reasoning process. SymAgent consists
-of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages
-LLM's inductive reasoning capability to extract symbolic rules from KGs,
-guiding efficient question decomposition. The Agent-Executor autonomously
-invokes predefined action tools to integrate information from KGs and external
-documents, addressing the issues of KG incompleteness. Furthermore, we design a
-self-learning framework comprising online exploration and offline iterative
-policy updating phases, enabling the agent to automatically synthesize
-reasoning trajectories and improve performance. Experimental results
-demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields
-better or comparable performance compared to various strong baselines. Further
-analysis reveals that our agent can identify missing triples, facilitating
-automatic KG updates.
+Advancements in high-throughput technologies have led to a shift from
+traditional hypothesis-driven methodologies to data-driven approaches.
+Multi-omics refers to the integrative analysis of data derived from multiple
+'omes', such as genomics, proteomics, transcriptomics, metabolomics, and
+microbiomics. This approach enables a comprehensive understanding of biological
+systems by capturing different layers of biological information. Deep learning
+methods are increasingly utilized to integrate multi-omics data, offering
+insights into molecular interactions and enhancing research into complex
+diseases. However, these models, with their numerous interconnected layers and
+nonlinear relationships, often function as black boxes, lacking transparency in
+decision-making processes. To overcome this challenge, explainable artificial
+intelligence (xAI) methods are crucial for creating transparent models that
+allow clinicians to interpret and work with complex data more effectively. This
+review explores how xAI can improve the interpretability of deep learning
+models in multi-omics research, highlighting its potential to provide
+clinicians with clear insights, thereby facilitating the effective application
+of such models in clinical settings.
 
-摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
+摘要：高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料，例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面，能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料，提供分子交互作用的洞察力，並加強對複雜疾病的研究。然而，這些模型具有許多相互連接的層級和非線性關係，通常會像黑盒子一樣運作，缺乏決策過程的透明度。為了克服此挑戰，可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要，讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性，強調其提供臨床醫生明確見解的潛力，進而促進此類模型在臨床環境中的有效應用。
 
-##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
-2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
+##### **Study on the Helpfulness of Explainable Artificial Intelligence**
+2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing
 
-We introduce a new approach to systematically map features discovered by
-sparse autoencoder across consecutive layers of large language models,
-extending earlier work that examined inter-layer feature links. By using a
-data-free cosine similarity technique, we trace how specific features persist,
-transform, or first appear at each stage. This method yields granular flow
-graphs of feature evolution, enabling fine-grained interpretability and
-mechanistic insights into model computations. Crucially, we demonstrate how
-these cross-layer feature maps facilitate direct steering of model behavior by
-amplifying or suppressing chosen features, achieving targeted thematic control
-in text generation. Together, our findings highlight the utility of a causal,
-cross-layer interpretability framework that not only clarifies how features
-develop through forward passes but also provides new means for transparent
-manipulation of large language models.
+Explainable Artificial Intelligence (XAI) is essential for building advanced
+machine learning-powered applications, especially in critical domains such as
+medical diagnostics or autonomous driving. Legal, business, and ethical
+requirements motivate using effective XAI, but the increasing number of
+different methods makes it challenging to pick the right ones. Further, as
+explanations are highly context-dependent, measuring the effectiveness of XAI
+methods without users can only reveal a limited amount of information,
+excluding human factors such as the ability to understand it. We propose to
+evaluate XAI methods via the user's ability to successfully perform a proxy
+task, designed such that a good performance is an indicator for the explanation
+to provide helpful information. In other words, we address the helpfulness of
+XAI for human decision-making. Further, a user study on state-of-the-art
+methods was conducted, showing differences in their ability to generate trust
+and skepticism and the ability to judge the rightfulness of an AI decision
+correctly. Based on the results, we highly recommend using and extending this
+approach for more objective-based human-centered user studies to measure XAI
+performance in an end-to-end fashion.
 
-摘要：我們提出了一種新方法，用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能，擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術，我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖，實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是，我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導，在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用，不僅闡明了特徵如何透過前向傳遞發展，還提供了新的方法來透明地操作大型語言模型。
+摘要：可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要，特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI，但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外，由於解釋高度依賴於背景，在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊，排除人類因素，例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法，設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說，我們探討 XAI 對人類決策制定的幫助。此外，對最先進的方法進行使用者研究，顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果，我們強烈建議使用和擴充這種方法，以進行更多以目標為基礎的人為中心使用者研究，以終端到終端的方式衡量 XAI 效能。
 
-##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**
-2502.02896v1 by Bradley P. Allen, Paul T. Groth
+##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**
+2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh
 
-Evaluating large language models (LLMs) for tasks like fact extraction in
-support of knowledge graph construction frequently involves computing accuracy
-metrics using a ground truth benchmark based on a knowledge graph (KG). These
-evaluations assume that errors represent factual disagreements. However, human
-discourse frequently features metalinguistic disagreement, where agents differ
-not on facts but on the meaning of the language used to express them. Given the
-complexity of natural language processing and generation using LLMs, we ask: do
-metalinguistic disagreements occur between LLMs and KGs? Based on an
-investigation using the T-REx knowledge alignment dataset, we hypothesize that
-metalinguistic disagreement does in fact occur between LLMs and KGs, with
-potential relevance for the practice of knowledge graph engineering. We propose
-a benchmark for evaluating the detection of factual and metalinguistic
-disagreements between LLMs and KGs. An initial proof of concept of such a
-benchmark is available on Github.
+Early detection of intrapartum risk enables interventions to potentially
+prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently,
+there is no accurate automated system to predict such events to assist with
+clinical decision-making. To fill this gap, we propose "Artificial Intelligence
+(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning
+framework that not only predicts adverse labor outcomes from maternal, fetal,
+obstetrical, and intrapartum risk factors but also provides the model's
+reasoning behind the predictions made. The latter can provide insights into
+what modifications in the input variables of the model could have changed the
+predicted outcome. We address the challenges of imbalance and small datasets by
+synthesizing additional training data using Adaptive Synthetic Sampling
+(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN
+uses an ensemble of fully-connected neural networks as the backbone for its
+classification with the data augmentation supported by either ADASYN or CTGAN.
+AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in
+classification. AIMEN can predict a high risk for adverse labor outcomes with
+an average F1 score of 0.784. It also provides counterfactual explanations that
+can be achieved by changing 2 to 3 attributes on average. Resources available:
+https://github.com/ab9mamun/AIMEN.
 
-摘要：評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時，通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而，人類話語經常出現元語言分歧，其中代理人之間的差異不在於事實，而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性，我們提出疑問：LLM 和 KG 之間是否會發生元語言分歧？根據使用 T-REx 知識比對資料集進行的調查，我們假設元語言分歧確實會發生在 LLM 和 KG 之間，並可能與知識圖譜工程實務有關。我們提出一個基準，用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。
+摘要：產程中風險的早期偵測有助於進行干預措施，以預防或減輕不利的生產結果，例如腦性麻痺。目前，沒有準確的自動化系統可以預測此類事件，以協助臨床決策。為了填補這一空白，我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN)，這是一個深度學習架構，它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果，還能提供模型做出預測背後的原因。後者可以提供見解，說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料，以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹，並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險，平均 F1 分數為 0.784。它還提供反事實解釋，可透過平均變更 2 至 3 個屬性來達成。可用資源：https://github.com/ab9mamun/AIMEN。
 
-##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**
-2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim
+##### **Artificial intelligence techniques in inherited retinal diseases: A review**
+2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian
 
-Recent advances in Large Language Models (LLMs) have motivated the
-development of general LLMs for molecular tasks. While several studies have
-demonstrated that fine-tuned LLMs can achieve impressive benchmark
-performances, they are far from genuine generalist molecular LLMs due to a lack
-of fundamental understanding of molecular structure. Specifically, when given
-molecular task instructions, LLMs trained with naive next-token prediction
-training assign similar likelihood scores to both original and negatively
-corrupted molecules, revealing their lack of molecular structure understanding
-that is crucial for reliable and general molecular LLMs. To overcome this
-limitation and obtain a true generalist molecular LLM, we introduce a novel
-multi-modal training method based on a thorough multi-modal instruction tuning
-as well as a molecular structure preference optimization between chosen and
-rejected graphs. On various molecular benchmarks, the proposed generalist
-molecular LLM, called Mol-LLM, achieves state-of-the-art performances among
-generalist LLMs on most tasks, at the same time, surpassing or comparable to
-state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior
-generalization performances in reaction prediction tasks, demonstrating the
-effect of the molecular structure understanding for generalization perspective.
+Inherited retinal diseases (IRDs) are a diverse group of genetic disorders
+that lead to progressive vision loss and are a major cause of blindness in
+working-age adults. The complexity and heterogeneity of IRDs pose significant
+challenges in diagnosis, prognosis, and management. Recent advancements in
+artificial intelligence (AI) offer promising solutions to these challenges.
+However, the rapid development of AI techniques and their varied applications
+have led to fragmented knowledge in this field. This review consolidates
+existing studies, identifies gaps, and provides an overview of AI's potential
+in diagnosing and managing IRDs. It aims to structure pathways for advancing
+clinical applications by exploring AI techniques like machine learning and deep
+learning, particularly in disease detection, progression prediction, and
+personalized treatment planning. Special focus is placed on the effectiveness
+of convolutional neural networks in these areas. Additionally, the integration
+of explainable AI is discussed, emphasizing its importance in clinical settings
+to improve transparency and trust in AI-based systems. The review addresses the
+need to bridge existing gaps in focused studies on AI's role in IRDs, offering
+a structured analysis of current AI techniques and outlining future research
+directions. It concludes with an overview of the challenges and opportunities
+in deploying AI for IRDs, highlighting the need for interdisciplinary
+collaboration and the continuous development of robust, interpretable AI models
+to advance clinical applications.
 
-摘要：大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能，但由於缺乏對分子結構的基本理解，它們遠非真正的通才分子 LLM。具體來說，當給予分子任務說明時，使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子，這顯示出它們缺乏對分子結構的理解，而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM，我們引入了一種新穎的多模態訓練方法，該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中，所提出的通才分子 LLM（稱為 Mol-LLM）在多數任務中實現了通才 LLM 中的最新效能，同時超越或與最新的專家 LLM 相當。此外，Mol-LLM 在反應預測任務中也展現出優異的泛化效能，證明了分子結構理解對泛化觀點的影響。
+摘要：遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病，
+會導致視力逐漸喪失，是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。
+然而，AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究，找出差距，並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術，特別是在疾病檢測、進程預測和個性化治療計劃中，為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外，討論了可解釋 AI 的整合，強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性，提供了對當前 AI 技術的結構化分析，並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇，強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。
 
-##### **Leveraging the true depth of LLMs**
-2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
+##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**
+2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri
 
-Large Language Models demonstrate remarkable capabilities at the cost of high
-compute requirements. While recent research has shown that intermediate layers
-can be removed or have their order shuffled without impacting performance
-significantly, these findings have not been employed to reduce the
-computational cost of inference. We investigate several potential ways to
-reduce the depth of pre-trained LLMs without significantly affecting
-performance. Leveraging our insights, we present a novel approach that exploits
-this decoupling between layers by grouping some of them into pairs that can be
-evaluated in parallel.
-  This modification of the computational graph -- through better parallelism --
-results in an average improvement of around 1.20x on the number of tokens
-generated per second, without re-training nor fine-tuning, while retaining
-95%-99% of the original accuracy. Empirical evaluation demonstrates that this
-approach significantly improves serving efficiency while maintaining model
-performance, offering a practical improvement for large-scale LLM deployment.
+Explaining Artificial Intelligence (AI) decisions is a major challenge
+nowadays in AI, in particular when applied to sensitive scenarios like medicine
+and law. However, the need to explain the rationale behind decisions is a main
+issue also for human-based deliberation as it is important to justify
+\textit{why} a certain decision has been taken. Resident medical doctors for
+instance are required not only to provide a (possibly correct) diagnosis, but
+also to explain how they reached a certain conclusion. Developing new tools to
+aid residents to train their explanation skills is therefore a central
+objective of AI in education. In this paper, we follow this direction, and we
+present, to the best of our knowledge, the first multilingual dataset for
+Medical Question Answering where correct and incorrect diagnoses for a clinical
+case are enriched with a natural language explanation written by doctors. These
+explanations have been manually annotated with argument components (i.e.,
+premise, claim) and argument relations (i.e., attack, support), resulting in
+the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases
+in four languages (English, Spanish, French, Italian) with explanations, where
+we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106
+attack relations. We conclude by showing how competitive baselines perform over
+this challenging dataset for the argument mining task.
 
-摘要：大型语言模型展示了其强大的功能，但代价是较高的计算需求。虽然最近的研究表明，中间层可以被移除或重新排列其顺序，而不会显著影响性能，但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度，而不会显著影响性能。利用我们的见解，我们提出了一种新颖的方法，该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。
-通过更好的并行性对计算图进行修改，平均而言，每秒生成的令牌数量提高了约 1.20 倍，而无需重新训练或微调，同时保留了 95%-99% 的原始准确性。经验评估表明，这种方法显著提高了服务效率，同时保持了模型性能，为大规模 LLM 部署提供了实际改进。
+摘要：解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰，特別是應用於像醫學和法律等敏感情境時。然而，解釋決策背後理由的需求也是基於人類的考量的一個主要問題，因為有必要證明為什麼做出某個決策。例如，住院醫師不僅需要提供（可能是正確的）診斷，還需要解釋他們如何達成某個結論。因此，開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中，我們遵循這個方向，並且根據我們的了解，提出第一個多語言醫學問答資料集，其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成（即前提、主張）和論證關係（即攻擊、支持）進行手動註解，產生多語言 CasiMedicos-Arg 資料集，其中包含 558 個具有解釋的四種語言（英語、西班牙語、法語、義大利語）的臨床病例，我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。
 
-##### **Modular Training of Neural Networks aids Interpretability**
-2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots
+##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration**
+2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu
 
-An approach to improve neural network interpretability is via clusterability,
-i.e., splitting a model into disjoint clusters that can be studied
-independently. We define a measure for clusterability and show that pre-trained
-models form highly enmeshed clusters via spectral graph clustering. We thus
-train models to be more modular using a "clusterability loss" function that
-encourages the formation of non-interacting clusters. Using automated
-interpretability techniques, we show that our method can help train models that
-are more modular and learn different, disjoint, and smaller circuits. We
-investigate CNNs trained on MNIST and CIFAR, small transformers trained on
-modular addition, and language models. Our approach provides a promising
-direction for training neural networks that learn simpler functions and are
-easier to interpret.
+Diagnosis prediction is a critical task in healthcare, where timely and
+accurate identification of medical conditions can significantly impact patient
+outcomes. Traditional machine learning and deep learning models have achieved
+notable success in this domain but often lack interpretability which is a
+crucial requirement in clinical settings. In this study, we explore the use of
+neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop
+explainable models for diagnosis prediction. Essentially, we design and
+implement LNN-based models that integrate domain-specific knowledge through
+logical rules with learnable thresholds. Our models, particularly
+$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior
+performance over traditional models such as Logistic Regression, SVM, and
+Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up
+to 0.8457) in the case study of diabetes prediction. The learned weights and
+thresholds within the LNN models provide direct insights into feature
+contributions, enhancing interpretability without compromising predictive
+power. These findings highlight the potential of neuro-symbolic approaches in
+bridging the gap between accuracy and explainability in healthcare AI
+applications. By offering transparent and adaptable diagnostic models, our work
+contributes to the advancement of precision medicine and supports the
+development of equitable healthcare solutions. Future research will focus on
+extending these methods to larger and more diverse datasets to further validate
+their applicability across different medical conditions and populations.
 
-摘要：一種改善神經網路可解釋性的方法是透過群集性，
-也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量，並顯示預訓練的
-模型透過光譜圖形群集形成高度糾纏的群集。因此，我們使用「群集性損失」函數訓練模型，使其更具模組化，
-這鼓勵形成非交互群集。使用自動化可解釋性技術，我們顯示我們的模型可以幫助訓練更具模組化的模型，並學習不同、不相交且較小的電路。我們
-研究了在 MNIST 和 CIFAR 上訓練的 CNN，在模組化加法上訓練的小型Transformer，以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。
+摘要：診斷預測是醫療保健中的關鍵任務，及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功，但通常缺乏可解釋性，這在臨床環境中是一項關鍵要求。在本研究中，我們探討了神經符號方法的應用，特別是邏輯神經網路 (LNN)，以開發用於診斷預測的可解釋模型。基本上，我們設計並實作了基於 LNN 的模型，這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型，特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$，表現出優於傳統模型（例如邏輯迴歸、SVM 和隨機森林）的優異效能，在糖尿病預測的案例研究中達到了更高的準確度（高達 80.52%）和 AUROC 分數（高達 0.8457）。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解，增強了可解釋性，同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型，我們的研究有助於推進精準醫療，並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集，以進一步驗證其在不同醫療狀況和人群中的適用性。
 
-##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**
-2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür
+##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**
+2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty
 
-Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large
-language models (LLMs) by enabling detailed step-by-step solutions. However,
-due to the verbosity of LLMs, the resulting reasoning chains can be long,
-making it harder to verify the reasoning steps and trace issues resulting from
-dependencies between the steps that may be farther away in the sequence of
-steps. Importantly, mathematical reasoning allows each step to be derived from
-a small set of premises, which are a subset of the preceding steps in the
-reasoning chain. In this paper, we present a framework that identifies the
-premises for each step, to improve the evaluation of reasoning. We restructure
-conventional linear reasoning chains into Premise Augmented Reasoning Chains
-(PARC) by introducing premise links, resulting in a directed acyclic graph
-where the nodes are the steps and the edges are the premise links. Through
-experiments with a PARC-based dataset that we built, namely PERL (Premises and
-ERrors identification in LLMs), we demonstrate that LLMs can reliably identify
-premises within complex reasoning chains. In particular, even open-source LLMs
-achieve 90% recall in premise identification. We also show that PARC helps to
-identify errors in reasoning chains more reliably. The accuracy of error
-identification improves by 6% to 16% absolute when step-by-step verification is
-carried out in PARC under the premises. Our findings highlight the utility of
-premise-centric representations in addressing complex problem-solving tasks and
-open new avenues for improving the reliability of LLM-based reasoning
-evaluations.
+The rapid advancements in artificial intelligence (AI) have revolutionized
+smart healthcare, driving innovations in wearable technologies, continuous
+monitoring devices, and intelligent diagnostic systems. However, security,
+explainability, robustness, and performance optimization challenges remain
+critical barriers to widespread adoption in clinical environments. This
+research presents an innovative algorithmic method using the Adaptive Feature
+Evaluator (AFE) algorithm to improve feature selection in healthcare datasets
+and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable
+Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT),
+the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby
+enhancing predictive accuracy and interpretability. The proposed method is
+validated across three diverse healthcare datasets using six distinct machine
+learning algorithms, demonstrating its robustness and superiority over
+conventional feature selection techniques. The results underscore the
+transformative potential of AFE in smart healthcare, enabling personalized and
+transparent patient care. Notably, the AFE algorithm, when combined with a
+Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting
+its capability to improve clinical decision-making processes in real-world
+healthcare applications.
 
-摘要：<paragraph>思考鏈（CoT）提示透過提供詳細的逐步解法，增強大型語言模型（LLM）的數學推理能力。然而，由於 LLM 的冗長，產生的推理鏈可能很長，這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難，而這些步驟可能在步驟順序中相距較遠。重要的是，數學推理允許每個步驟從一組小的前提中推導出來，這些前提是推理鏈中前一個步驟的子集。在本文中，我們提出了一個框架，用於識別每個步驟的前提，以改進推理評估。我們透過引入前提連結，將傳統的線性推理鏈重組為前提擴充推理鏈（PARC），產生一個有向無環圖，其中節點是步驟，而邊緣是前提連結。透過我們建立的基於 PARC 的資料集（即 PERL（LLM 中的前提和錯誤識別））進行的實驗，我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是，即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明，PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時，錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用，並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。</paragraph>
+摘要：人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健，推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而，安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法，使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT)，該演算法最佳化了臨床決策支援系統 (CDSS)，從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集，證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力，實現了個人化和透明的患者照護。值得注意的是，AFE 演算法與多層感知器 (MLP) 結合使用時，準確度高達 98.5%，突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。
 
-##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**
-2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
+##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**
+2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker
 
-Embodied agents assisting humans are often asked to complete a new task in a
-new scenario. An agent preparing a particular dish in the kitchen based on a
-known recipe may be asked to prepare a new dish or to perform cleaning tasks in
-the storeroom. There may not be sufficient resources, e.g., time or labeled
-examples, to train the agent for these new situations. Large Language Models
-(LLMs) trained on considerable knowledge across many domains are able to
-predict a sequence of abstract actions for such new tasks and scenarios,
-although it may not be possible for the agent to execute this action sequence
-due to task-, agent-, or domain-specific constraints. Our framework addresses
-these challenges by leveraging the generic predictions provided by LLM and the
-prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an
-agent to quickly adapt to new tasks and scenarios. The robot also solicits and
-uses human input as needed to refine its existing knowledge. Based on
-experimental evaluation over cooking and cleaning tasks in simulation domains,
-we demonstrate that the interplay between LLM, KG, and human input leads to
-substantial performance gains compared with just using the LLM output.
+Artificial intelligence (AI) systems have substantially improved
+dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI)
+systems further enhancing clinicians' confidence and trust in AI-driven
+decisions. Despite these advancements, there remains a critical need for
+objective evaluation of how dermatologists engage with both AI and XAI tools.
+In this study, 76 dermatologists participated in a reader study, diagnosing 16
+dermoscopic images of melanomas and nevi using an XAI system that provides
+detailed, domain-specific explanations. Eye-tracking technology was employed to
+assess their interactions. Diagnostic performance was compared with that of a
+standard AI system lacking explanatory features. Our findings reveal that XAI
+systems improved balanced diagnostic accuracy by 2.8 percentage points relative
+to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and
+complex lesions were associated with elevated cognitive load, as evidenced by
+increased ocular fixations. These insights have significant implications for
+clinical practice, the design of AI tools for visual tasks, and the broader
+development of XAI in medical diagnostics.
 
-摘要：具身代理协助人类时，通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源（例如时间或标记的示例）来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列，尽管代理可能无法执行此动作序列，因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战，使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估，我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。
+摘要：人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度，而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展，對於皮膚科醫師如何使用 AI 和 XAI 工具，仍有客觀評估的迫切需求。在這項研究中，76 位皮膚科醫師參與了一項讀者研究，使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像，該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示，XAI 系統相較於標準 AI，將平衡診斷準確度提升了 2.8 個百分點。此外，與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關，這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。
 
-##### **On Bob Dylan: A Computational Perspective**
-2502.01772v1 by Prashant Garg
+##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**
+2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar
 
-Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style
--- a constant refusal to conform to expectation and a penchant for reinventing
-his musical and lyrical identity. In this paper, I extend Sunstein's
-observations through a large-scale computational analysis of Dylan's lyrics
-from 1962 to 2012. Using o3-mini-high (a large language model), I extract
-concept-to-concept relationships from the lyrics and construct directed
-knowledge graphs that capture Dylan's thematic structure. I then quantify
-shifts in sentiment, metaphorical expression, thematic diversity, and network
-complexity over time. The results indicate that Dylan's lyrics increasingly
-rely on metaphor, display an evolving sentiment profile, and exhibit heightened
-dishabituation -- measured here as a growing variance in the network centrality
-of key concepts. I also find that references to movement, protest, and mythic
-imagery fluctuate in ways that align with well-known phases of Dylan's career,
-reflecting the dynamic and unpredictable quality of his art. These findings not
-only deepen our empirical understanding of Sunstein's thesis but also introduce
-a novel computational method for analyzing an artist's evolution-offering
-broader applicability to the study of cultural and creative change.
+Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been
+shown to significantly improve the quality of life of autistic individuals.
+However, diagnostics methods for ASD rely on assessments based on clinical
+presentation that are prone to bias and can be challenging to arrive at an
+early diagnosis. There is a need for objective biomarkers of ASD which can help
+improve diagnostic accuracy. Deep learning (DL) has achieved outstanding
+performance in diagnosing diseases and conditions from medical imaging data.
+Extensive research has been conducted on creating models that classify ASD
+using resting-state functional Magnetic Resonance Imaging (fMRI) data. However,
+existing models lack interpretability. This research aims to improve the
+accuracy and interpretability of ASD diagnosis by creating a DL model that can
+not only accurately classify ASD but also provide explainable insights into its
+working. The dataset used is a preprocessed version of the Autism Brain Imaging
+Data Exchange (ABIDE) with 884 samples. Our findings show a model that can
+accurately classify ASD and highlight critical brain regions differing between
+ASD and typical controls, with potential implications for early diagnosis and
+understanding of the neural basis of ASD. These findings are validated by
+studies in the literature that use different datasets and modalities,
+confirming that the model actually learned characteristics of ASD and not just
+the dataset. This study advances the field of explainable AI in medical imaging
+by providing a robust and interpretable model, thereby contributing to a future
+with objective and reliable ASD diagnostics.
 
-摘要：卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格
--- 這種風格不斷拒絕符合預期，並熱衷於重新塑造他的音樂和歌詞認同。在本文中，我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析，來延伸桑斯坦的觀察。使用 o3-mini-high（一個大型語言模型），我從歌詞中提取概念對概念的關係，並建構有向知識圖，以捕捉迪倫的主題結構。然後，我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示，迪倫的歌詞越來越依賴隱喻，展現出不斷演化的情緒輪廓，並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現，對運動、抗議和神話意象的引用，會以與迪倫職業生涯中眾所周知階段一致的方式波動，反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解，也引入了分析藝術家演變的新穎運算方法，為文化和創造性變化的研究提供了更廣泛的適用性。
+摘要：自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而，ASD 的診斷方法依賴於基於臨床表現的評估，容易產生偏見，且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記，以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而，現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD，還能提供可解釋見解說明其運作原理的 DL 模型，來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本，包含 884 個樣本。我們的研究結果顯示，該模型能準確分類 ASD，並強調 ASD 與典型對照組之間存在差異的關鍵腦區，對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證，證實該模型實際上學習了 ASD 的特徵，而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型，推動了醫學影像中可解釋 AI 的領域，從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。
 
-##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**
-2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
+##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**
+2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul
 
-Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in
-enhancing Large Language Models (LLMs) through external knowledge integration,
-yet its application has primarily focused on textual content, leaving the rich
-domain of multi-modal video knowledge predominantly unexplored. This paper
-introduces VideoRAG, the first retrieval-augmented generation framework
-specifically designed for processing and understanding extremely long-context
-videos. Our core innovation lies in its dual-channel architecture that
-seamlessly integrates (i) graph-based textual knowledge grounding for capturing
-cross-video semantic relationships, and (ii) multi-modal context encoding for
-efficiently preserving visual features. This novel design empowers VideoRAG to
-process unlimited-length videos by constructing precise knowledge graphs that
-span multiple videos while maintaining semantic dependencies through
-specialized multi-modal retrieval paradigms. Through comprehensive empirical
-evaluation on our proposed LongerVideos benchmark-comprising over 160 videos
-totaling 134+ hours across lecture, documentary, and entertainment
-categories-VideoRAG demonstrates substantial performance compared to existing
-RAG alternatives and long video understanding methods. The source code of
-VideoRAG implementation and the benchmark dataset are openly available at:
-https://github.com/HKUDS/VideoRAG.
+The in-vivo identification of the kidney stone types during an ureteroscopy
+would be a major medical advance in urology, as it could reduce the time of the
+tedious renal calculi extraction process, while diminishing infection risks.
+Furthermore, such an automated procedure would make possible to prescribe
+anti-recurrence treatments immediately. Nowadays, only few experienced
+urologists are able to recognize the kidney stone types in the images of the
+videos displayed on a screen during the endoscopy. Thus, several deep learning
+(DL) models have recently been proposed to automatically recognize the kidney
+stone types using ureteroscopic images. However, these DL models are of black
+box nature whicl limits their applicability in clinical settings. This
+contribution proposes a case-based reasoning DL model which uses prototypical
+parts (PPs) and generates local and global descriptors. The PPs encode for each
+class (i.e., kidney stone type) visual feature information (hue, saturation,
+intensity and textures) similar to that used by biologists. The PPs are
+optimally generated due a new loss function used during the model training.
+Moreover, the local and global descriptors of PPs allow to explain the
+decisions ("what" information, "where in the images") in an understandable way
+for biologists and urologists. The proposed DL model has been tested on a
+database including images of the six most widespread kidney stone types. The
+overall average classification accuracy was 90.37. When comparing this results
+with that of the eight other DL models of the kidney stone state-of-the-art, it
+can be seen that the valuable gain in explanability was not reached at the
+expense of accuracy which was even slightly increased with respect to that
+(88.2) of the best method of the literature. These promising and interpretable
+results also encourage urologists to put their trust in AI-based solutions.
 
-摘要：檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功，但其應用主要集中在文字內容上，而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG，這是第一個檢索增強生成架構，專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構，它無縫整合 (i) 基於圖形文字知識基礎，用於擷取跨影片語義關係，以及 (ii) 多模態語境編碼，用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片，同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估，該基準包含超過 160 部影片，總時數超過 134 小時，涵蓋演講、紀錄片和娛樂類別，VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比，展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於：https://github.com/HKUDS/VideoRAG。
+摘要：尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展，因為它可以減少繁瑣的腎結石取出過程的時間，同時降低感染風險。此外，這種自動化程序將使立即開立抗復發治療成為可能。如今，只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此，最近已提出多種深度學習 (DL) 模型，以使用輸尿管鏡圖像自動識別腎結石類型。然而，這些 DL 模型本質上是黑盒子，這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型，它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型（即腎結石類型）編碼視覺特徵信息（色調、飽和度、強度和紋理），類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數，PP 得到了最佳生成。此外，PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策（“什麼”信息，“圖像中的什麼位置”）。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時，可以看出，可解釋性的寶貴增益並未以準確性為代價，甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。
 
-##### **Transformers trained on proteins can learn to attend to Euclidean distance**
-2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane
+##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**
+2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman
+
+This study explores the potential of utilizing administrative claims data,
+combined with advanced machine learning and deep learning techniques, to
+predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal
+Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major
+health insurance organization to develop prediction models for multiple
+observation windows using traditional machine learning methods such as Random
+Forest and XGBoost as well as deep learning approaches such as Long Short-Term
+Memory (LSTM) networks. Our findings demonstrate that the LSTM model,
+particularly with a 24-month observation window, exhibits superior performance
+in predicting ESRD progression, outperforming existing models in the
+literature. We further apply SHapley Additive exPlanations (SHAP) analysis to
+enhance interpretability, providing insights into the impact of individual
+features on predictions at the individual patient level. This study underscores
+the value of leveraging administrative claims data for CKD management and
+predicting ESRD progression.
+
+摘要：本研究探討利用行政申報資料，結合先進機器學習與深度學習技術，預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集，使用傳統機器學習方法（例如隨機森林和 XGBoost）以及深度學習方法（例如長期短期記憶 (LSTM) 網路）開發多個觀察視窗的預測模型。我們的研究結果顯示，LSTM 模型（尤其是 24 個月觀察視窗）在預測 ESRD 進展方面表現優異，優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性，深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。
+
+##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**
+2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller
 
-While conventional Transformers generally operate on sequence data, they can
-be used in conjunction with structure models, typically SE(3)-invariant or
-equivariant graph neural networks (GNNs), for 3D applications such as protein
-structure modelling. These hybrids typically involve either (1)
-preprocessing/tokenizing structural features as input for Transformers or (2)
-taking Transformer embeddings and processing them within a structural
-representation. However, there is evidence that Transformers can learn to
-process structural information on their own, such as the AlphaFold3 structural
-diffusion model. In this work we show that Transformers can function
-independently as structure models when passed linear embeddings of coordinates.
-We first provide a theoretical explanation for how Transformers can learn to
-filter attention as a 3D Gaussian with learned variance. We then validate this
-theory using both simulated 3D points and in the context of masked token
-prediction for proteins. Finally, we show that pre-training protein Transformer
-encoders with structure improves performance on a downstream task, yielding
-better performance than custom structural models. Together, this work provides
-a basis for using standard Transformers as hybrid structure-language models.
+While large language models (LLMs) have shown promise for medical question
+answering, there is limited work focused on tropical and infectious
+disease-specific exploration. We build on an opensource tropical and infectious
+diseases (TRINDs) dataset, expanding it to include demographic and semantic
+clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM
+performance on these, comparing generalist and medical LLMs, as well as LLM
+outcomes to human experts. We demonstrate through systematic experimentation,
+the benefit of contextual information such as demographics, location, gender,
+risk factors for optimal LLM response. Finally we develop a prototype of
+TRINDs-LM, a research tool that provides a playground to navigate how context
+impacts LLM outputs for health.
 
-摘要：雖然傳統的 Transformer 通常處理序列資料，但它們可用於結構模型，通常是 SE(3) 不變式或等變式圖神經網路 (GNN)，用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而，有證據表明 Transformer 可以自行學習處理結構資訊，例如 AlphaFold3 結構擴散模型。在這項工作中，我們展示了 Transformer 在傳遞座標的線性嵌入時，可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後，我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能，產生比自訂結構模型更好的效能。綜合來說，這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。
+摘要：儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景，但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上，並將其擴展為納入人口統計和語義臨床和消費者擴充，產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能，比較了通才和醫療 LLM，以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊（例如人口統計、位置、性別、最佳 LLM 回應的風險因素）的好處。最後，我們開發了 TRINDs-LM 的原型，這是一個研究工具，提供一個探索背景如何影響 LLM 健康輸出的平台。
 
-##### **Common Foundations for SHACL, ShEx, and PG-Schema**
-2502.01295v1 by S. Ahmetaj, I. Boneva, J. Hidders, K. Hose, M. Jakubowski, J. E. Labra-Gayo, W. Martens, F. Mogavero, F. Murlak, C. Okulmus, A. Polleres, O. Savkovic, M. Simkus, D. Tomaszuk
+##### **Explainable AI: Definition and attributes of a good explanation for health AI**
+2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group
 
-Graphs have emerged as an important foundation for a variety of applications,
-including capturing and reasoning over factual knowledge, semantic data
-integration, social networks, and providing factual knowledge for machine
-learning algorithms. To formalise certain properties of the data and to ensure
-data quality, there is a need to describe the schema of such graphs. Because of
-the breadth of applications and availability of different data models, such as
-RDF and property graphs, both the Semantic Web and the database community have
-independently developed graph schema languages: SHACL, ShEx, and PG-Schema.
-Each language has its unique approach to defining constraints and validating
-graph data, leaving potential users in the dark about their commonalities and
-differences. In this paper, we provide formal, concise definitions of the core
-components of each of these schema languages. We employ a uniform framework to
-facilitate a comprehensive comparison between the languages and identify a
-common set of functionalities, shedding light on both overlapping and
-distinctive features of the three languages.
+Proposals of artificial intelligence (AI) solutions based on increasingly
+complex and accurate predictive models are becoming ubiquitous across many
+disciplines. As the complexity of these models grows, transparency and users'
+understanding often diminish. This suggests that accurate prediction alone is
+insufficient for making an AI-based solution truly useful. In the development
+of healthcare systems, this introduces new issues related to accountability and
+safety. Understanding how and why an AI system makes a recommendation may
+require complex explanations of its inner workings and reasoning processes.
+Although research on explainable AI (XAI) has significantly increased in recent
+years and there is high demand for XAI in medicine, defining what constitutes a
+good explanation remains ad hoc, and providing adequate explanations continues
+to be challenging. To fully realize the potential of AI, it is critical to
+address two fundamental questions about explanations for safety-critical AI
+applications, such as health-AI: (1) What is an explanation in health-AI? and
+(2) What are the attributes of a good explanation in health-AI? In this study,
+we examined published literature and gathered expert opinions through a
+two-round Delphi study. The research outputs include (1) a definition of what
+constitutes an explanation in health-AI and (2) a comprehensive list of
+attributes that characterize a good explanation in health-AI.
 
-摘要：圖表已成為各種應用的重要基礎，包括擷取和推理事實知識、語義資料整合、社群網路，以及為機器學習演算法提供事實知識。為了形式化資料的特定屬性並確保資料品質，有必要描述此類圖表的架構。由於應用範圍廣泛且有不同的資料模型可用，例如 RDF 和屬性圖表，因此語義網路和資料庫社群已獨立開發圖表架構語言：SHACL、ShEx 和 PG-Schema。每種語言都有其定義約束和驗證圖表資料的獨特方法，讓潛在使用者不清楚它們的共性和差異。在本文中，我們提供這些架構語言中每個核心元件的正式簡潔定義。我們採用統一的框架來促進語言之間的全面比較，並找出功能的共同集合，說明這三種語言的重疊和獨特功能。
+摘要：隨著越來越複雜且準確的預測模型，基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加，透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中，這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加，且醫學領域對 XAI 有很高的需求，但定義什麼構成一個好的解釋仍是臨時性的，而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力，對於安全關鍵型 AI 應用（例如健康 AI）的解釋，探討兩個基本問題至關重要：(1) 什麼是健康 AI 中的解釋？以及 (2) 健康 AI 中一個好的解釋有哪些屬性？在本研究中，我們檢視了已發表的文獻，並透過兩輪德爾菲研究收集了專家意見。研究成果包括：(1) 健康 AI 中什麼構成解釋的定義，以及 (2) 健康 AI 中一個好解釋的屬性清單。
 
-##### **GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**
-2502.01113v1 by Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
+##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**
+2408.17401v1 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni
 
-Retrieval-augmented generation (RAG) has proven effective in integrating
-knowledge into large language models (LLMs). However, conventional RAGs
-struggle to capture complex relationships between pieces of knowledge, limiting
-their performance in intricate reasoning that requires integrating knowledge
-from multiple sources. Recently, graph-enhanced retrieval augmented generation
-(GraphRAG) builds graph structure to explicitly model these relationships,
-enabling more effective and efficient retrievers. Nevertheless, its performance
-is still hindered by the noise and incompleteness within the graph structure.
-To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for
-retrieval augmented generation. GFM-RAG is powered by an innovative graph
-neural network that reasons over graph structure to capture complex
-query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage
-training process on large-scale datasets, comprising 60 knowledge graphs with
-over 14M triples and 700k documents. This results in impressive performance and
-generalizability for GFM-RAG, making it the first graph foundation model
-applicable to unseen datasets for retrieval without any fine-tuning required.
-Extensive experiments on three multi-hop QA datasets and seven domain-specific
-RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance
-while maintaining efficiency and alignment with neural scaling laws,
-highlighting its potential for further improvement.
+In recent years, various methods have been introduced for explaining the
+outputs of "black-box" AI models. However, it is not well understood whether
+users actually comprehend and trust these explanations. In this paper, we focus
+on explanations for a regression tool for assessing cancer risk and examine the
+effect of the explanations' content and format on the user-centric metrics of
+comprehension and trust. Regarding content, we experiment with two explanation
+methods: the popular SHAP, based on game-theoretic notions and thus potentially
+complex for everyday users to comprehend, and occlusion-1, based on feature
+occlusion which may be more comprehensible. Regarding format, we present SHAP
+explanations as charts (SC), as is conventional, and occlusion-1 explanations
+as charts (OC) as well as text (OT), to which their simpler nature also lends
+itself. The experiments amount to user studies questioning participants, with
+two different levels of expertise (the general population and those with some
+medical training), on their subjective and objective comprehension of and trust
+in explanations for the outputs of the regression tool. In both studies we
+found a clear preference in terms of subjective comprehension and trust for
+occlusion-1 over SHAP explanations in general, when comparing based on content.
+However, direct comparisons of explanations when controlling for format only
+revealed evidence for OT over SC explanations in most cases, suggesting that
+the dominance of occlusion-1 over SHAP explanations may be driven by a
+preference for text over charts as explanations. Finally, we found no evidence
+of a difference between the explanation types in terms of objective
+comprehension. Thus overall, the choice of the content and format of
+explanations needs careful attention, since in some contexts format, rather
+than content, may play the critical role in improving user experience.
 
-摘要：檢索增強生成 (RAG) 已證明在整合知識到大語言模型 (LLM) 中有效。然而，傳統的 RAG 難以捕捉知識片段之間的複雜關係，限制了它們在需要整合來自多個來源的知識的複雜推理中的表現。最近，圖表增強檢索增強生成 (GraphRAG) 建立圖表結構來明確建模這些關係，從而實現更有效率的檢索器。儘管如此，其效能仍受到圖表結構中雜訊和不完整性的阻礙。為了解決這個問題，我們引入了 GFM-RAG，一種用於檢索增強生成的全新圖表基礎模型 (GFM)。GFM-RAG 由一個創新的圖神經網路驅動，該網路在圖表結構上進行推理以捕捉複雜的查詢知識關係。具有 8M 參數的 GFM 在大型資料集上進行兩階段訓練流程，包括 60 個包含超過 14M 個三元組和 700k 個文件的文件。這為 GFM-RAG 帶來了令人印象深刻的效能和通用性，使其成為第一個適用於未見過資料集的圖表基礎模型，而無需任何微調。在三個多跳問答資料集和七個特定領域 RAG 資料集上的廣泛實驗表明，GFM-RAG 達到了最先進的效能，同時保持了效率並與神經擴充定律保持一致，突顯了其進一步改進的潛力。
+摘要：<paragraph>近年來，已經引進各種方法來解釋「黑箱」AI 模型的輸出。然而，目前並不清楚使用者是否實際理解和信任這些解釋。在本文中，我們專注於評估癌症風險的回歸工具的解釋，並探討解釋的內容和格式對以使用者為中心的理解和信任指標的影響。關於內容，我們實驗了兩種解釋方法：流行的 SHAP，基於博弈論概念，因此對於日常使用者來說可能很複雜，以及基於特徵遮蔽的 occlusion-1，可能更易於理解。關於格式，我們將 SHAP 解釋呈現為圖表 (SC)，這是慣例，而將 occlusion-1 解釋呈現為圖表 (OC) 以及文字 (OT)，其較為簡單的性質也適用於此。這些實驗等同於使用者研究，詢問參與者，具有兩種不同程度的專業知識（一般民眾和具備一些醫學訓練的人），他們對回歸工具輸出解釋的主觀和客觀理解和信任。在兩項研究中，我們發現，在基於內容進行比較時，一般來說，occlusion-1 優於 SHAP 解釋，在主觀理解和信任方面有明顯的偏好。然而，在僅控制格式的情況下直接比較解釋，在大多數情況下只顯示 OT 優於 SC 解釋的證據，這表明 occlusion-1 優於 SHAP 解釋的主導地位可能是由偏好文字而非圖表作為解釋所驅動的。最後，我們沒有發現解釋類型在客觀理解方面的差異證據。因此，總體而言，對解釋的內容和格式的選擇需要仔細注意，因為在某些情況下，格式而非內容，可能在改善使用者體驗方面發揮關鍵作用。</paragraph>
 
-##### **Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**
-2502.01059v1 by Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn
+##### **A Survey for Large Language Models in Biomedicine**
+2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen
 
-The development of biological data analysis tools and large language models
-(LLMs) has opened up new possibilities for utilizing AI in plant science
-research, with the potential to contribute significantly to knowledge
-integration and research gap identification. Nonetheless, current LLMs struggle
-to handle complex biological data and theoretical models in photosynthesis
-research and often fail to provide accurate scientific contexts. Therefore,
-this study proposed a photosynthesis research assistant (PRAG) based on
-OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt
-optimization. Vector databases and an automated feedback loop were used in the
-prompt optimization process to enhance the accuracy and relevance of the
-responses to photosynthesis-related queries. PRAG showed an average improvement
-of 8.7% across five metrics related to scientific writing, with a 25.4%
-increase in source transparency. Additionally, its scientific depth and domain
-coverage were comparable to those of photosynthesis research papers. A
-knowledge graph was used to structure PRAG's responses with papers within and
-outside the database, which allowed PRAG to match key entities with 63% and
-39.5% of the database and test papers, respectively. PRAG can be applied for
-photosynthesis research and broader plant science domains, paving the way for
-more in-depth data analysis and predictive capabilities.
+Recent breakthroughs in large language models (LLMs) offer unprecedented
+natural language understanding and generation capabilities. However, existing
+surveys on LLMs in biomedicine often focus on specific applications or model
+architectures, lacking a comprehensive analysis that integrates the latest
+advancements across various biomedical domains. This review, based on an
+analysis of 484 publications sourced from databases including PubMed, Web of
+Science, and arXiv, provides an in-depth examination of the current landscape,
+applications, challenges, and prospects of LLMs in biomedicine, distinguishing
+itself by focusing on the practical implications of these models in real-world
+biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot
+learning across a broad spectrum of biomedical tasks, including diagnostic
+assistance, drug discovery, and personalized medicine, among others, with
+insights drawn from 137 key studies. Then, we discuss adaptation strategies of
+LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to
+enhance their performance in specialized biomedical contexts where zero-shot
+fails to achieve, such as medical question answering and efficient processing
+of biomedical literature. Finally, we discuss the challenges that LLMs face in
+the biomedicine domain including data privacy concerns, limited model
+interpretability, issues with dataset quality, and ethics due to the sensitive
+nature of biomedical data, the need for highly reliable model outputs, and the
+ethical implications of deploying AI in healthcare. To address these
+challenges, we also identify future research directions of LLM in biomedicine
+including federated learning methods to preserve data privacy and integrating
+explainable AI methodologies to enhance the transparency of LLMs.
 
-摘要：生物資料分析工具和大型語言模型 (LLM) 的發展，為利用人工智慧於植物科學研究開啟了新的可能性，並有潛力對知識整合和研究差距的識別做出重大貢獻。儘管如此，目前的 LLM 在處理光合作用研究中的複雜生物資料和理論模型時仍有困難，而且常常無法提供準確的科學背景。因此，本研究提出了一個基於 OpenAI 的 GPT-4o、具備檢索增強生成 (RAG) 技術和提示最佳化的光合作用研究助理 (PRAG)。在提示最佳化過程中，使用了向量資料庫和自動回饋迴路，以增強對與光合作用相關查詢的回應的準確性和相關性。PRAG 在與科學寫作相關的五項指標中顯示出平均改善了 8.7%，來源透明度增加了 25.4%。此外，其科學深度和領域涵蓋範圍與光合作用研究論文相當。知識圖譜用於建構 PRAG 的回應，其中包含資料庫內外論文，這使得 PRAG 能夠分別與資料庫和測試論文中的 63% 和 39.5% 的關鍵實體相匹配。PRAG 可應用於光合作用研究和更廣泛的植物科學領域，為更深入的資料分析和預測能力鋪路。
+摘要：大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而，現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構，缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析，深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景，其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先，我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力，包括診斷輔助、藥物發現和個性化醫療等，並從 137 項關鍵研究中汲取見解。然後，我們討論了 LLM 的適應策略，包括單模態和多模態 LLM 的微調方法，以增強它們在零次學習無法實現的專業生物醫學背景中的性能，例如醫療問題解答和生物醫學文獻的有效處理。最後，我們討論了 LLM 在生物醫學領域面臨的挑戰，包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰，我們還確定了生物醫學中 LLM 未來的研究方向，包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。
 
-##### **Encrypted Large Model Inference: The Equivariant Encryption Paradigm**
-2502.01013v1 by James Buban, Hongyang Zhang, Claudio Angione, Harry Yang, Ahmad Farhan, Seyfal Sultanov, Michael Du, Xuran Ma, Zihao Wang, Yue Zhao, Arria Owlia, Fielding Johnston, Patrick Colangelo
+##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**
+2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone
 
-Large scale deep learning model, such as modern language models and diffusion
-architectures, have revolutionized applications ranging from natural language
-processing to computer vision. However, their deployment in distributed or
-decentralized environments raises significant privacy concerns, as sensitive
-data may be exposed during inference. Traditional techniques like secure
-multi-party computation, homomorphic encryption, and differential privacy offer
-partial remedies but often incur substantial computational overhead, latency
-penalties, or limited compatibility with non-linear network operations. In this
-work, we introduce Equivariant Encryption (EE), a novel paradigm designed to
-enable secure, "blind" inference on encrypted data with near zero performance
-overhead. Unlike fully homomorphic approaches that encrypt the entire
-computational graph, EE selectively obfuscates critical internal
-representations within neural network layers while preserving the exact
-functionality of both linear and a prescribed set of non-linear operations.
-This targeted encryption ensures that raw inputs, intermediate activations, and
-outputs remain confidential, even when processed on untrusted infrastructure.
-We detail the theoretical foundations of EE, compare its performance and
-integration complexity against conventional privacy preserving techniques, and
-demonstrate its applicability across a range of architectures, from
-convolutional networks to large language models. Furthermore, our work provides
-a comprehensive threat analysis, outlining potential attack vectors and
-baseline strategies, and benchmarks EE against standard inference pipelines in
-decentralized settings. The results confirm that EE maintains high fidelity and
-throughput, effectively bridging the gap between robust data confidentiality
-and the stringent efficiency requirements of modern, large scale model
-inference.
+Significant investment and development have gone into integrating Artificial
+Intelligence (AI) in medical and healthcare applications, leading to advanced
+control systems in medical technology. However, the opacity of AI systems
+raises concerns about essential characteristics needed in such sensitive
+applications, like transparency and trustworthiness. Our study addresses these
+concerns by investigating a process for selecting the most adequate Explainable
+AI (XAI) methods to comply with the explanation requirements of key EU
+regulations in the context of smart bioelectronics for medical devices. The
+adopted methodology starts with categorising smart devices by their control
+mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving
+into their technology. Then, we analyse these regulations to define their
+explainability requirements for the various devices and related goals.
+Simultaneously, we classify XAI methods by their explanatory objectives. This
+allows for matching legal explainability requirements with XAI explanatory
+goals and determining the suitable XAI algorithms for achieving them. Our
+findings provide a nuanced understanding of which XAI algorithms align better
+with EU regulations for different types of medical devices. We demonstrate this
+through practical case studies on different neural implants, from chronic
+disease management to advanced prosthetics. This study fills a crucial gap in
+aligning XAI applications in bioelectronics with stringent provisions of EU
+regulations. It provides a practical framework for developers and researchers,
+ensuring their AI innovations advance healthcare technology and adhere to legal
+and ethical standards.
 
-摘要：大型深度學習模型，例如現代語言模型和擴散架構，徹底改變了從自然語言處理到電腦視覺等各種應用。然而，它們在分散式或分散式環境中的部署引發了重大的隱私問題，因為敏感數據可能會在推理過程中遭到揭露。安全多方計算、同態加密和差分隱私等傳統技術提供了部分補救措施，但通常會產生大量的計算開銷、延遲處罰，或與非線性網路操作相容性有限。在這項工作中，我們引入了等變加密 (EE)，這是一種新穎的範例，旨在以接近零效能開銷對加密數據進行安全、「盲目」推理。與加密整個計算圖形的完全同態方法不同，EE 有選擇性地混淆神經網路層內的關鍵內部表示，同時保留線性和規定的一組非線性操作的精確功能。這種有針對性的加密確保了原始輸入、中間激活和輸出保持機密，即使在不受信任的基礎設施上處理也是如此。我們詳細說明了 EE 的理論基礎，比較了其效能和整合複雜度與傳統的隱私保護技術，並展示了其在從卷積網路到大語言模型等各種架構中的適用性。此外，我們的研究提供了全面的威脅分析，概述了潛在的攻擊媒介和基準策略，並在分散式設定中將 EE 與標準推理管道進行比較。結果證實，EE 保持了高保真度和高傳輸量，有效地彌合了強大的數據機密性與現代化、大規模模型推理的嚴格效率要求之間的差距。
+摘要：人工智慧（AI）在醫療和保健應用中投入了大量的投資和開發，進而導致醫療技術中的先進控制系統。然而，AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂，例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題，用於選擇最充分的可解釋 AI（XAI）方法，以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制（開迴路、閉迴路和半閉迴路系統）對智慧型裝置進行分類，並深入探討其技術開始。然後，我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時，我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配，並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點，從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構，確保其 AI 創新能促進醫療技術並遵守法律和道德標準。
 
-##### **Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**
-2502.01694v1 by Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki
+##### **Towards Case-based Interpretability for Medical Federated Learning**
+2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva
 
-A key paradigm to improve the reasoning capabilities of large language models
-(LLMs) is to allocate more inference-time compute to search against a verifier
-or reward model. This process can then be utilized to refine the pretrained
-model or distill its reasoning patterns into more efficient models. In this
-paper, we study inference-time compute by viewing chain-of-thought (CoT)
-generation as a metastable Markov process: easy reasoning steps (e.g.,
-algebraic manipulations) form densely connected clusters, while hard reasoning
-steps (e.g., applying a relevant theorem) create sparse, low-probability edges
-between clusters, leading to phase transitions at longer timescales. Under this
-framework, we prove that implementing a search protocol that rewards sparse
-edges improves CoT by decreasing the expected number of steps to reach
-different clusters. In contrast, we establish a limit on reasoning capability
-when the model is restricted to local information of the pretrained graph. We
-also show that the information gained by search can be utilized to obtain a
-better reasoning model: (1) the pretrained model can be directly finetuned to
-favor sparse edges via policy gradient methods, and moreover (2) a compressed
-metastable representation of the reasoning dynamics can be distilled into a
-smaller, more efficient model.
+We explore deep generative models to generate case-based explanations in a
+medical federated learning setting. Explaining AI model decisions through
+case-based interpretability is paramount to increasing trust and allowing
+widespread adoption of AI in clinical practice. However, medical AI training
+paradigms are shifting towards federated learning settings in order to comply
+with data protection regulations. In a federated scenario, past data is
+inaccessible to the current user. Thus, we use a deep generative model to
+generate synthetic examples that protect privacy and explain decisions. Our
+proof-of-concept focuses on pleural effusion diagnosis and uses publicly
+available Chest X-ray data.
 
-摘要：<paragraph>提升大型語言模型 (LLM) 推理能力的一個關鍵範例，是分配更多推論時間運算來搜尋驗證器或獎勵模型。此程序接著可用於改善預訓練模型或將其推理模式提煉到更有效率的模型中。在這篇論文中，我們透過將思維鏈 (CoT) 生成視為亞穩態馬可夫過程來研究推論時間運算：簡單的推理步驟（例如代數運算）形成密集連接的叢集，而困難的推理步驟（例如應用相關定理）則在叢集之間建立稀疏、低機率的邊緣，導致在較長時間尺度上產生相變。在此架構下，我們證明實作一種獎勵稀疏邊緣的搜尋協定，會透過減少到達不同叢集所需的預期步驟數來改善 CoT。相反地，當模型受限於預訓練圖形的局部資訊時，我們建立了推理能力的限制。我們也顯示搜尋所獲得的資訊可用於取得更好的推理模型：(1) 預訓練模型可以直接微調以透過策略梯度方法偏好稀疏邊緣，而且 (2) 推理動態的壓縮亞穩態表徵可以提煉到更小、更有效率的模型中。</paragraph>
+摘要：我們探索深度生成模型，在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策，對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而，醫療 AI 訓練範例正轉向聯邦學習設置，以符合資料保護法規。在聯邦情境中，過去的資料對目前的使用者而言是無法取得的。因此，我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷，並使用公開可取得的胸部 X 光資料。
 
-##### **PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**
-2502.00708v1 by Qixuan Li, Chao Wang, Zongjin He, Yan Peng
+##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**
+2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans
 
-Text-to-3D asset generation has achieved significant optimization under the
-supervision of 2D diffusion priors. However, when dealing with compositional
-scenes, existing methods encounter several challenges: 1). failure to ensure
-that composite scene layouts comply with physical laws; 2). difficulty in
-accurately capturing the assets and relationships described in complex scene
-descriptions; 3). limited autonomous asset generation capabilities among layout
-approaches leveraging large language models (LLMs). To avoid these compromises,
-we propose a novel framework for compositional scene generation, PhiP-G, which
-seamlessly integrates generation techniques with layout guidance based on a
-world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene
-description to generate a scene graph, and integrating a multimodal 2D
-generation agent and a 3D Gaussian generation method for targeted assets
-creation. For the stage of layout, PhiP-G employs a physical pool with adhesion
-capabilities and a visual supervision agent, forming a world model for layout
-prediction and planning. Extensive experiments demonstrate that PhiP-G
-significantly enhances the generation quality and physical rationality of the
-compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA)
-performance in CLIP scores, achieves parity with the leading methods in
-generation quality as measured by the T$^3$Bench, and improves efficiency by
-24x.
+Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging
+lesions with variable clinical behaviours and treatment approaches. This
+systematic review provides an overview of Artificial Intelligence (AI) methods
+using radiological imaging for diagnosis and prognosis of these tumours,
+highlighting challenges in clinical translation, and evaluating study alignment
+with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI
+international consensus guidelines for trustworthy and deployable AI to promote
+the clinical translation of AI methods. The review covered literature from
+several bibliographic databases, including papers published before 17/07/2024.
+Original research in peer-reviewed journals focused on radiology-based AI for
+diagnosing or prognosing primary STBT was included. Exclusion criteria were
+animal, cadaveric, or laboratory studies, and non-English papers. Abstracts
+were screened by two of three independent reviewers for eligibility. Eligible
+papers were assessed against guidelines by one of three independent reviewers.
+The search identified 15,015 abstracts, from which 325 articles were included
+for evaluation. Most studies performed moderately on CLAIM, averaging a score
+of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out
+of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage,
+indicating significant room for improvement. Future efforts by AI developers
+should focus on design (e.g. define unmet clinical need, intended clinical
+setting and how AI would be integrated in clinical workflow), development (e.g.
+build on previous work, explainability), evaluation (e.g. evaluating and
+addressing biases, evaluating AI against best practices), and data
+reproducibility and availability (making documented code and data publicly
+available). Following these recommendations could improve clinical translation
+of AI methods.
 
-摘要：<paragraph>在 2D 擴散先驗的監督下，文字轉 3D 資產生成已取得顯著的最佳化。然而，在處理合成場景時，現有方法會遇到幾個挑戰：1) 無法確保複合場景佈局符合物理定律；2) 難以準確捕捉複雜場景描述中所描述的資產和關係；3) 在利用大型語言模型 (LLM) 的佈局方法中，自主資產生成能力有限。為了避免這些折衷，我們提出了一個合成場景生成的新框架 PhiP-G，它將生成技術與基於世界模型的佈局指導無縫整合。利用基於 LLM 的代理，PhiP-G 分析複雜的場景描述以生成場景圖，並整合多模態 2D 生成代理和 3D 高斯生成方法以進行目標資產創建。對於佈局階段，PhiP-G 採用具有附著能力的物理池和視覺監督代理，形成用於佈局預測和規劃的世界模型。大量的實驗證明，PhiP-G 大幅提升了合成場景的生成品質和物理合理性。值得注意的是，PhiP-G 在 CLIP 分數中獲得了最先進 (SOTA) 的效能，在 T$^3$Bench 測量的生成品質中與領先的方法達到同等水準，並將效率提升了 24 倍。</paragraph>
+摘要：軟組織和骨骼腫瘤（STBT）是罕見、診斷具有挑戰性的病灶，其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀，重點說明了臨床轉譯的挑戰，並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性，以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻，包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究，以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要，其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等，平均得分為 53 分中的 28.9±7.5 分，但在 FUTURE-AI 中表現不佳，平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段，表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計（例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中）、開發（例如建立在先前的工作、可解釋性）、評估（例如評估和解決偏差、評估 AI 與最佳實務）、以及數據可複製性和可用性（公開提供文件化的代碼和數據）。遵循這些建議可以改善 AI 方法的臨床轉譯。
 
-##### **A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**
-2502.00681v1 by Qika Lin, Zhen Peng, Kaize Shi, Kai He, Yiming Xu, Erik Cambria, Mengling Feng
+##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**
+2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen
 
-Recent years have witnessed rapid advances in graph representation learning,
-with the continuous embedding approach emerging as the dominant paradigm.
-However, such methods encounter issues regarding parameter efficiency,
-interpretability, and robustness. Thus, Quantized Graph Representation (QGR)
-learning has recently gained increasing interest, which represents the graph
-structure with discrete codes instead of conventional continuous embeddings.
-Given its analogous representation form to natural language, QGR also possesses
-the capability to seamlessly integrate graph structures with large language
-models (LLMs). As this emerging paradigm is still in its infancy yet holds
-significant promise, we undertake this thorough survey to promote its rapid
-future prosperity. We first present the background of the general quantization
-methods and their merits. Moreover, we provide an in-depth demonstration of
-current QGR studies from the perspectives of quantized strategies, training
-objectives, distinctive designs, knowledge graph quantization, and
-applications. We further explore the strategies for code dependence learning
-and integration with LLMs. At last, we give discussions and conclude future
-directions, aiming to provide a comprehensive picture of QGR and inspire future
-research.
+Early detection of Cerebral Palsy (CP) is crucial for effective intervention
+and monitoring. This paper tests the reliability and applicability of
+Explainable AI (XAI) methods using a deep learning method that predicts CP by
+analyzing skeletal data extracted from video recordings of infant movements.
+Specifically, we use XAI evaluation metrics -- namely faithfulness and
+stability -- to quantitatively assess the reliability of Class Activation
+Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this
+specific medical application. We utilize a unique dataset of infant movements
+and apply skeleton data perturbations without distorting the original dynamics
+of the infant movements. Our CP prediction model utilizes an ensemble approach,
+so we evaluate the XAI metrics performances for both the overall ensemble and
+the individual models. Our findings indicate that both XAI methods effectively
+identify key body points influencing CP predictions and that the explanations
+are robust against minor data perturbations. Grad-CAM significantly outperforms
+CAM in the RISv metric, which measures stability in terms of velocity. In
+contrast, CAM performs better in the RISb metric, which relates to bone
+stability, and the RRS metric, which assesses internal representation
+robustness. Individual models within the ensemble show varied results, and
+neither CAM nor Grad-CAM consistently outperform the other, with the ensemble
+approach providing a representation of outcomes from its constituent models.
 
-摘要：近年来，图表示学习取得了快速进展，其中连续嵌入方法作为主导范式出现。然而，此类方法遇到了参数效率、可解释性和鲁棒性方面的问题。因此，量化图表示 (QGR) 学习最近引起了越来越多的兴趣，它使用离散代码而不是传统的连续嵌入来表示图结构。鉴于其与自然语言类似的表示形式，QGR 也具备将图结构与大型语言模型 (LLM) 无缝集成的能力。由于这种新兴范式仍处于起步阶段，但前景广阔，我们进行了这项全面调查以促进其快速未来的繁荣。我们首先介绍了通用量化方法的背景及其优点。此外，我们从量化策略、训练目标、独特设计、知识图谱量化和应用的角度对当前的 QGR 研究进行了深入的论证。我们进一步探索了代码依赖性学习和与 LLM 集成的策略。最后，我们给出了讨论并总结了未来的方向，旨在提供 QGR 的全面图景并激发未来的研究。
+摘要：腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性，使用深度學習方法，透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說，我們使用 XAI 評估指標（即忠實度和穩定性）來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集，並應用骨骼資料擾動，而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法，因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明，兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位，並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM，該指標衡量速度方面的穩定性。相比之下，CAM 在 RISb 指標中表現得更好，該指標與骨骼穩定性有關，而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果，CAM 和 Grad-CAM 都不一致地優於另一種，整體方法提供了其組成模型結果的表示。
 
-##### **Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**
-2502.00339v1 by Jingyuan Yi, Zeqiu Xu, Tianyi Huang, Peiyang Yu
+##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**
+2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma
 
-The pervasiveness of the dissemination of fake news through social media
-platforms poses critical risks to the trust of the general public, societal
-stability, and democratic institutions. This challenge calls for novel
-methodologies in detection, which can keep pace with the dynamic and
-multi-modal nature of misinformation. Recent works include powering the
-detection using large language model advances in multimodal frameworks,
-methodologies using graphs, and adversarial training in the literature of fake
-news. Based on the different approaches which can bring success, some key
-highlights will be underlined: enhanced LLM-improves accuracy through more
-advanced semantics and cross-modality fusion for robust detections. The review
-further identifies critical gaps in adaptability to dynamic social media
-trends, real-time, and cross-platform detection capabilities, as well as the
-ethical challenges thrown up by the misuse of LLMs. Future directions underline
-the development of style-agnostic models, cross-lingual detection frameworks,
-and robust policies with a view to mitigating LLM-driven misinformation. This
-synthesis thus lays a concrete foundation for those researchers and
-practitioners committed to reinforcing fake news detection systems with
-complications that keep on growing in the digital landscape.
+Recent global estimates suggest that as many as 2.41 billion individuals have
+health conditions that would benefit from rehabilitation services. Home-based
+Physical Therapy (PT) faces significant challenges in providing interactive
+feedback and meaningful observation for therapists and patients. To fill this
+gap, we present MicroXercise, which integrates micro-motion analysis with
+wearable sensors, providing therapists and patients with a comprehensive
+feedback interface, including video, text, and scores. Crucially, it employs
+multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable
+methods to analyze the existing deep learning neural networks in monitoring
+exercises, focusing on a high granularity of exercise. This synergistic
+approach is pivotal, providing output matching the input size to precisely
+highlight critical subtleties and movements in PT, thus transforming complex AI
+analysis into clear, actionable feedback. By highlighting these micro-motions
+in different metrics, such as stability and range of motion, MicroXercise
+significantly enhances the understanding and relevance of feedback for
+end-users. Comparative performance metrics underscore its effectiveness over
+traditional methods, such as a 39% and 42% improvement in Feature Mutual
+Information (FMI) and Continuity. MicroXercise is a step ahead in home-based
+physical therapy, providing a technologically advanced and intuitively helpful
+solution to enhance patient care and outcomes.
 
-摘要：社群媒體平台上假新聞散播的普遍性對一般大眾的信任、社會穩定性與民主制度構成重大風險。這項挑戰需要在偵測方面採用創新的方法論，才能跟上錯誤資訊的動態和多模態特性。最近的研究包括使用多模態架構中大型語言模型的進展、使用圖形的方法論，以及在假新聞文獻中進行對抗訓練來強化偵測。根據可以帶來成功的不同方法，將重點說明一些重點：增強的 LLM 可透過更進階的語意和跨模態融合來提升準確度，以進行穩健的偵測。這篇評論進一步找出在適應動態社群媒體趨勢、即時和跨平台偵測能力方面的重大差距，以及 LLM 遭濫用的道德挑戰。未來的方向強調開發與風格無關的模型、跨語言偵測架構和穩健的政策，以減輕 LLM 驅動的錯誤資訊。因此，這種綜合分析為那些致力於強化假新聞偵測系統的研究人員和從業人員奠定了具體的基礎，而這些複雜性在數位環境中持續增長。
+摘要：最近的全球估計表明，多達 24.1 億人有
+健康狀況可從復健服務中受益。居家
+物理治療 (PT) 在提供互動式
+回饋和有意義的觀察方面面臨重大挑戰，供治療師和患者使用。為了填補這
+個缺口，我們提出 MicroXercise，它將微動作分析與
+可穿戴式感測器整合在一起，為治療師和患者提供一個全面的
+回饋介面，包括影片、文字和分數。至關重要的是，它採用
+多維動態時間規整 (DTW) 和基於歸因的可解釋
+方法來分析監控運動中現有的深度學習神經網路，專注於運動的高粒度。這種協同
+方法至關重要，提供與輸入大小匹配的輸出，以精確地
+突出 PT 中關鍵的細微差別和動作，從而將複雜的 AI
+分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作，例如穩定性和動作範圍，MicroXercise
+顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於
+傳統方法的有效性，例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家
+物理治療方面更進一步，提供技術先進且直覺有用的
+解決方案，以提升患者照護和結果。
 
-##### **DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**
-2502.00305v1 by Jiaxin Guo, C. L. Philip Chen, Shuzhen Li, Tong Zhang
+##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**
+2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz
 
-Cold-start active learning (CSAL) selects valuable instances from an
-unlabeled dataset for manual annotation. It provides high-quality data at a low
-annotation cost for label-scarce text classification. However, existing CSAL
-methods overlook weak classes and hard representative examples, resulting in
-biased learning. To address these issues, this paper proposes a novel
-dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL.
-Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently
-extract textual representations, class predictions, and predictive uncertainty.
-Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both
-textual diversity and class diversity, ensuring a balanced data distribution.
-It further propagates uncertainty information via density-based clustering to
-select hard representative instances. DEUCE performs well in selecting
-class-balanced and hard representative data by dual-diversity and
-informativeness. Experiments on six NLP datasets demonstrate the superiority
-and efficiency of DEUCE.
+Systematic literature reviews are the highest quality of evidence in
+research. However, the review process is hindered by significant resource and
+data constraints. The Literature Review Network (LRN) is the first of its kind
+explainable AI platform adhering to PRISMA 2020 standards, designed to automate
+the entire literature review process. LRN was evaluated in the domain of
+surgical glove practices using 3 search strings developed by experts to query
+PubMed. A non-expert trained all LRN models. Performance was benchmarked
+against an expert manual review. Explainability and performance metrics
+assessed LRN's ability to replicate the experts' review. Concordance was
+measured with the Jaccard index and confusion matrices. Researchers were
+blinded to the other's results until study completion. Overlapping studies were
+integrated into an LRN-generated systematic review. LRN models demonstrated
+superior classification accuracy without expert training, achieving 84.78% and
+85.71% accuracy. The highest performance model achieved high interrater
+reliability (k = 0.4953) and explainability metrics, linking 'reduce',
+'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51%
+of the relevant literature despite diverging from the non-expert's judgments (k
+= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN
+outperformed the manual review (19,920 minutes over 11 months), reducing the
+entire process to 288.6 minutes over 5 days. This study demonstrates that
+explainable AI does not require expert training to successfully conduct
+PRISMA-compliant systematic literature reviews like an expert. LRN summarized
+the results of surgical glove studies and identified themes that were nearly
+identical to the clinical researchers' findings. Explainable AI can accurately
+expedite our understanding of clinical practices, potentially revolutionizing
+healthcare research.
 
-摘要：冷啟動主動學習 (CSAL) 從未標記的資料集中選取有價值的實例進行手動標記。它以低標記成本提供高品質的資料，用於標籤稀少的文字分類。然而，現有的 CSAL 方法忽略了弱類別和難以代表的範例，導致有偏差的學習。為了解決這些問題，本文提出了一個新的雙重多樣性增強和不確定性感知 (DEUCE) 架構，用於 CSAL。具體來說，DEUCE 利用預訓練的語言模型 (PLM) 來有效地提取文字表徵、類別預測和預測不確定性。然後，它構建一個雙鄰居圖 (DNG) 來結合文字多樣性和類別多樣性的資訊，確保平衡的資料分佈。它進一步通過基於密度的聚類來傳播不確定性資訊，以選擇難以代表的實例。DEUCE 在通過雙重多樣性和資訊性選擇類別平衡和難以代表的資料方面表現良好。在六個 NLP 資料集上的實驗證明了 DEUCE 的優越性和效率。
+摘要：系統性文獻回顧是研究中證據品質最高的。然而，回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台，旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估，使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率，達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標，將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻，儘管與非專家的判斷不同 (k = 0.2174)，但包含了「乳膠」、「雙重」（手套）和「適應症」等詞彙。LRN 優於手動回顧（11 個月超過 19,920 分鐘），將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示，可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果，並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解，有潛力革新醫療保健研究。
 
-##### **Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**
-2502.01659v2 by Nathaniel Tomczak, Sanmukh Kuppannagari
+##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**
+2408.02709v1 by Chi Him Ng
 
-Transformers have demonstrated great success in numerous domains including
-natural language processing and bioinformatics. This success stems from the use
-of the attention mechanism by these models in order to represent and propagate
-pairwise interactions between individual tokens of sequential data. However,
-the primary limitation of this operation is its quadratic memory and time
-complexity in relation to the input's context length - the length of a sequence
-over which the interactions need to be captured. This significantly limits the
-length of sequences that can be inferred upon by these models. Extensive
-research has been conducted to reduce the number of pairwise interactions to
-sub-quadratic in relation to the context length by introducing sparsity into
-the attention mechanism through the development of sparse attention masks.
-However, efficient implementations that achieve "true sparsity" are lacking.
-  In this work, we address this issue by proposing a graph computing view of
-attention where tokens are perceived as nodes of the graph and the attention
-mask determines the edges of the graph. Using this view, we develop graph
-processing algorithms to implement the attention mechanism. Both theoretically
-and empirically, we demonstrate that our algorithms only perform the needed
-computations, i.e., they are work optimal. We also perform extensive
-experimentation using popular attention masks to explore the impact of sparsity
-on execution time and achievable context length. Our experiments demonstrate
-significant speedups in execution times compared to state-of-the-art attention
-implementations such as FlashAttention for large sequence lengths. We also
-demonstrate that our algorithms are able to achieve extremely long sequence
-lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB).
+This study analyzes hybrid AI systems' design patterns and their
+effectiveness in clinical decision-making using the boxology framework. It
+categorizes and copares various architectures combining machine learning and
+rule-based reasoning to provide insights into their structural foundations and
+healthcare applications. Addressing two main questions, how to categorize these
+systems againts established design patterns and how to extract insights through
+comparative analysis, the study uses design patterns from software engineering
+to understand and optimize healthcare AI systems. Boxology helps identify
+commonalities and create reusable solutions, enhancing these systems'
+scalability, reliability, and performance. Five primary architectures are
+examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and
+weaknesses, highlighting the need for tailored approaches in clinical tasks.
+REML excels in high-accuracy prediction for datasets with limited data; MLRB in
+handling large datasets and complex data integration; RBML in explainability
+and trustworthiness; RMLT in managing high-dimensional data; and PERML, though
+limited in analysis, shows promise in urgent care scenarios. The study
+introduces four new patterns, creates five abstract categorization patterns,
+and refines those five further to specific systems. These contributions enhance
+Boxlogy's taxonomical organization and offer novel approaches to integrating
+expert knowledge with machine learning. Boxology's structured, modular apporach
+offers significant advantages in developing and analyzing hybrid AI systems,
+revealing commonalities, and promoting reusable solutions. In conclusion, this
+study underscores hybrid AI systems' crucial role in advancing healthcare and
+Boxology's potential to drive further innovation in AI integration, ultimately
+improving clinical decision support and patient outcomes.
 
-摘要：變形金剛已在許多領域展現出巨大的成功，包括自然語言處理和生物資訊學。這種成功源自於這些模型使用注意機制來表示和傳播序列資料中各個標記之間成對的互動。然而，這種運算的主要限制在於其二次記憶體和時間複雜度與輸入的內容長度有關，也就是需要擷取互動的序列長度。這會顯著限制這些模型可以推論的序列長度。已經進行了大量的研究來減少成對互動的數量，使其與內容長度成次二次關係，方法是透過開發稀疏注意遮罩來將稀疏性引入注意機制。然而，缺乏能達成「真實稀疏性」的高效實作。在這項工作中，我們透過提出注意力的圖形運算檢視來解決這個問題，其中標記被視為圖形的節點，而注意力遮罩則決定圖形中的邊緣。使用這種檢視，我們開發了圖形處理演算法來實作注意力機制。我們在理論上和經驗上都證明了我們的演算法只執行必要的運算，也就是說，它們是工作最優的。我們也使用流行的注意力遮罩進行廣泛的實驗，以探討稀疏性對執行時間和可達成的內容長度的影響。我們的實驗證明，與最先進的注意力實作（例如 FlashAttention）相比，對於大型序列長度，我們的演算法在執行時間方面有顯著的加速。我們也證明了我們的演算法能夠在單一的 NVIDIA A100 GPU (SXM4 80GB) 上達成極長的序列長度，最高可達 1.6 億。
+摘要：本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構，以深入了解其結構基礎和醫療保健應用。針對兩個主要問題，如何根據既定的設計模式對這些系統進行分類，以及如何通過比較分析提取見解，本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案，從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構：REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點，強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測；MLRB 在處理大型資料集和複雜資料整合方面表現出色；RBML 在可解釋性和可信度方面表現出色；RMLT 在管理高維資料方面表現出色；而 PERML 儘管在分析方面有限，但在緊急照護場景中表現出潛力。本研究引入了四種新模式，建立了五種抽象分類模式，並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織，並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之，本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用，以及盒子學在推動人工智慧整合進一步創新方面的潛力，最終改善臨床決策支援和患者的治療成果。
 
-##### **Improving vision-language alignment with graph spiking hybrid Networks**
-2501.19069v1 by Siyu Zhang, Heming Zheng, Yiming Wu, Yeming Chen
+##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**
+2408.02706v1 by Masoud Muhammed Hassan
 
-To bridge the semantic gap between vision and language (VL), it is necessary
-to develop a good alignment strategy, which includes handling semantic
-diversity, abstract representation of visual information, and generalization
-ability of models. Recent works use detector-based bounding boxes or patches
-with regular partitions to represent visual semantics. While current paradigms
-have made strides, they are still insufficient for fully capturing the nuanced
-contextual relations among various objects. This paper proposes a comprehensive
-visual semantic representation module, necessitating the utilization of
-panoptic segmentation to generate coherent fine-grained semantic features.
-Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that
-integrates the complementary advantages of Spiking Neural Networks (SNNs) and
-Graph Attention Networks (GATs) to encode visual semantic information.
-Intriguingly, the model not only encodes the discrete and continuous latent
-variables of instances but also adeptly captures both local and global
-contextual features, thereby significantly enhancing the richness and diversity
-of semantic representations. Leveraging the spatiotemporal properties inherent
-in SNNs, we employ contrastive learning (CL) to enhance the similarity-based
-representation of embeddings. This strategy alleviates the computational
-overhead of the model and enriches meaningful visual representations by
-constructing positive and negative sample pairs. We design an innovative
-pre-training method, Spiked Text Learning (STL), which uses text features to
-improve the encoding ability of discrete semantics. Experiments show that the
-proposed GSHN exhibits promising results on multiple VL downstream tasks.
+Because of its strong predictive skills, deep learning has emerged as an
+essential tool in many industries, including healthcare. Traditional deep
+learning models, on the other hand, frequently lack interpretability and omit
+to take prediction uncertainty into account two crucial components of clinical
+decision making. In order to produce explainable and uncertainty aware
+predictions, this study presents a novel framework called Bayesian Kolmogorov
+Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov
+Arnold Networks with Bayesian inference. We employ BKANs on two medical
+datasets, which are widely used benchmarks for assessing machine learning
+models in medical diagnostics: the Pima Indians Diabetes dataset and the
+Cleveland Heart Disease dataset. Our method provides useful insights into
+prediction confidence and decision boundaries and outperforms traditional deep
+learning models in terms of prediction accuracy. Moreover, BKANs' capacity to
+represent aleatoric and epistemic uncertainty guarantees doctors receive more
+solid and trustworthy decision support. Our Bayesian strategy improves the
+interpretability of the model and considerably minimises overfitting, which is
+important for tiny and imbalanced medical datasets, according to experimental
+results. We present possible expansions to further use BKANs in more
+complicated multimodal datasets and address the significance of these
+discoveries for future research in building reliable AI systems for healthcare.
+This work paves the way for a new paradigm in deep learning model deployment in
+vital sectors where transparency and reliability are crucial.
 
-摘要：<paragraph>為了彌合視覺和語言 (VL) 之間的語意差距，必須制定良好的對齊策略，其中包括處理語意多樣性、視覺資訊的抽象表示以及模型的泛化能力。最近的研究使用基於偵測器的邊界框或具有規則分割的區塊來表示視覺語意。雖然目前的範例已取得進展，但對於完全捕捉各種物件之間的細微脈絡關係仍不足夠。本文提出了一個全面的視覺語意表示模組，需要利用全景分割來產生連貫的細粒度語意特徵。此外，我們提出了一個新穎的圖形脈衝混合網路 (GSHN)，它整合了脈衝神經網路 (SNN) 和圖形注意力網路 (GAT) 的互補優勢來編碼視覺語意資訊。有趣的是，該模型不僅編碼實例的離散和連續潛在變數，還能巧妙地捕捉局部和全域脈絡特徵，從而顯著增強語意表示的豐富性和多樣性。利用 SNN 中固有的時空特性，我們採用對比學習 (CL) 來增強嵌入的基於相似性的表示。此策略減輕了模型的計算負擔，並透過建構正負樣本對來豐富有意義的視覺表示。我們設計了一個創新的預訓練方法，脈衝文本學習 (STL)，它使用文本特徵來提高離散語意的編碼能力。實驗表明，所提出的 GSHN 在多個 VL 下游任務上展現出有希望的結果。</paragraph>
+摘要：由於其強大的預測能力，深度學習已成為許多產業中不可或缺的工具，包括醫療保健。然而，傳統的深度學習模型通常缺乏可解釋性，並且忽略了將預測不確定性納入考量，而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測，本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構，它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN，這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準：皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解，並且在預測準確度方面優於傳統的深度學習模型。此外，BKAN 表現隨機和認識不確定性的能力，可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果，我們的貝氏策略提高了模型的可解釋性，並大幅減少了過度擬合，這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能，以進一步將 BKAN 用於更複雜的多模式資料集，並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。
 
-##### **Semantic Web and Creative AI -- A Technical Report from ISWS 2023**
-2501.18542v1 by Raia Abu Ahmad, Reham Alharbi, Roberto Barile, Martin Böckling, Francisco Bolanos, Sara Bonfitto, Oleksandra Bruns, Irene Celino, Yashrajsinh Chudasama, Martin Critelli, Claudia d'Amato, Giada D'Ippolito, Ioannis Dasoulas, Stefano De Giorgis, Vincenzo De Leo, Chiara Di Bonaventura, Marco Di Panfilo, Daniil Dobriy, John Domingue, Xuemin Duan, Michel Dumontier, Sefika Efeoglu, Ruben Eschauzier, Fakih Ginwa, Nicolas Ferranti, Arianna Graciotti, Philipp Hanisch, George Hannah, Golsa Heidari, Aidan Hogan, Hassan Hussein, Alexane Jouglar, Jan-Christoph Kalo, Manoé Kieffer, Antonis Klironomos, Inês Koch, Weronika Lajewska, Nicolas Lazzari, Mikael Lindekrans, Anna Sofia Lippolis, Majlinda Llugiqi, Eleonora Mancini, Eleonora Marzi, Laura Menotti, Daniela Milon Flores, Soulakshmee Nagowah, Kerstin Neubert, Emetis Niazmand, Ebrahim Norouzi, Beatriz Olarte Martinez, Anouk Michelle Oudshoorn, Andrea Poltronieri, Valentina Presutti, Disha Purohit, Ensiyeh Raoufi, Celian Ringwald, Johanna Rockstroh, Sebastian Rudolph, Harald Sack, Zafar Saeed, Mohammad Javad Saeedizade, Aya Sahbi, Cristian Santini, Aleksandra Simic, Dennis Sommer, Rita Sousa, Mary Ann Tan, Vidyashree Tarikere, Tabea Tietz, Liam Tirpitz, Arnaldo Tomasino, Frank van Harmelen, Joao Vissoci, Caitlin Woods, Bohui Zhang, Xinyue Zhang, Heng Zheng
+##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**
+2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal
 
-The International Semantic Web Research School (ISWS) is a week-long
-intensive program designed to immerse participants in the field. This document
-reports a collaborative effort performed by ten teams of students, each guided
-by a senior researcher as their mentor, attending ISWS 2023. Each team provided
-a different perspective to the topic of creative AI, substantiated by a set of
-research questions as the main subject of their investigation. The 2023 edition
-of ISWS focuses on the intersection of Semantic Web technologies and Creative
-AI. ISWS 2023 explored various intersections between Semantic Web technologies
-and creative AI. A key area of focus was the potential of LLMs as support tools
-for knowledge engineering. Participants also delved into the multifaceted
-applications of LLMs, including legal aspects of creative content production,
-humans in the loop, decentralised approaches to multimodal generative AI
-models, nanopublications and AI for personal scientific knowledge graphs,
-commonsense knowledge in automatic story and narrative completion, generative
-AI for art critique, prompt engineering, automatic music composition,
-commonsense prototyping and conceptual blending, and elicitation of tacit
-knowledge. As Large Language Models and semantic technologies continue to
-evolve, new exciting prospects are emerging: a future where the boundaries
-between creative expression and factual knowledge become increasingly permeable
-and porous, leading to a world of knowledge that is both informative and
-inspiring.
+In modern healthcare, addressing the complexities of accurate disease
+prediction and personalized recommendations is both crucial and challenging.
+This research introduces MLtoGAI, which integrates Semantic Web technology with
+Machine Learning (ML) to enhance disease prediction and offer user-friendly
+explanations through ChatGPT. The system comprises three key components: a
+reusable disease ontology that incorporates detailed knowledge about various
+diseases, a diagnostic classification model that uses patient symptoms to
+detect specific diseases accurately, and the integration of Semantic Web Rule
+Language (SWRL) with ontology and ChatGPT to generate clear, personalized
+health advice. This approach significantly improves prediction accuracy and
+ensures results that are easy to understand, addressing the complexity of
+diseases and diverse symptoms. The MLtoGAI system demonstrates substantial
+advancements in accuracy and user satisfaction, contributing to developing more
+intelligent and accessible healthcare solutions. This innovative approach
+combines the strengths of ML algorithms with the ability to provide
+transparent, human-understandable explanations through ChatGPT, achieving
+significant improvements in prediction accuracy and user comprehension. By
+leveraging semantic technology and explainable AI, the system enhances the
+accuracy of disease prediction and ensures that the recommendations are
+relevant and easily understood by individual patients. Our research highlights
+the potential of integrating advanced technologies to overcome existing
+challenges in medical diagnostics, paving the way for future developments in
+intelligent healthcare systems. Additionally, the system is validated using 200
+synthetic patient data records, ensuring robust performance and reliability.
 
-摘要：國際語意網路研究學校 (ISWS) 是一個為期一週的密集課程，旨在讓參與者沉浸在該領域中。本文件報告了由十個學生團隊進行的合作成果，每個團隊都由一位資深研究員作為導師，參加了 2023 年 ISWS。每個團隊都從不同的角度探討了創意 AI 主題，並以一系列研究問題作為調查的主要主題。2023 年版的 ISWS 關注於語意網路技術和創意 AI 的交集。ISWS 2023 探索了語意網路技術和創意 AI 之間的各種交集。一個重點關注領域是 LLM 作為知識工程的支援工具的潛力。參與者還深入探討了 LLM 的多方面應用，包括創意內容製作的法律方面、循環中的人類、多模態生成式 AI 模型的分散式方法、納米出版物和用於個人科學知識圖譜的 AI、自動故事和敘述完成中的常識知識、生成式 AI 用於藝術評論、提示工程、自動音樂創作、常識原型和概念混合，以及對默會知識的引導。隨著大型語言模型和語意技術的持續發展，新的令人興奮的前景正在出現：一個創意表達和事實知識之間的界限變得越來越可滲透和多孔的未來，從而導致一個既有資訊性又有啟發性的知識世界。
+摘要：在現代醫療保健中，解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI，它將語義網路技術與機器學習 (ML) 相結合，以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分：一個可重複使用的疾病本体，其中包含有關各種疾病的詳細知識；一個診斷分類模型，它使用患者症狀來準確檢測特定疾病；以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合，以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性，並確保了易於理解的結果，解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步，有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點，以及透過 ChatGPT 提供透明且人類可以理解的說明的能力，在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI，該系統提高了疾病預測的準確性，並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力，為智慧醫療保健系統的未來發展鋪路。此外，該系統使用 200 個合成患者資料記錄進行驗證，確保了穩健的效能和可靠性。
 
-##### **Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**
-2501.18320v1 by Tianpeng Pan, Wenqiang Pu, Licheng Zhao, Rui Zhou
+##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**
+2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora
 
-Automated optimization modeling (AOM) has evoked considerable interest with
-the rapid evolution of large language models (LLMs). Existing approaches
-predominantly rely on prompt engineering, utilizing meticulously designed
-expert response chains or structured guidance. However, prompt-based techniques
-have failed to perform well in the sensor array signal processing (SASP) area
-due the lack of specific domain knowledge. To address this issue, we propose an
-automated modeling approach based on retrieval-augmented generation (RAG)
-technique, which consists of two principal components: a multi-agent (MA)
-structure and a graph-based RAG (Graph-RAG) process. The MA structure is
-tailored for the architectural AOM process, with each agent being designed
-based on principles of human modeling procedure. The Graph-RAG process serves
-to match user query with specific SASP modeling knowledge, thereby enhancing
-the modeling result. Results on ten classical signal processing problems
-demonstrate that the proposed approach (termed as MAG-RAG) outperforms several
-AOM benchmarks.
+Explainable Artificial Intelligence (XAI) is central to the debate on
+integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms
+into clinical practice. High-performing AI/ML models, such as ensemble learners
+and deep neural networks, often lack interpretability, hampering clinicians'
+trust in their predictions. To address this, XAI techniques are being developed
+to describe AI/ML predictions in human-understandable terms. One promising
+direction is the adaptation of sensitivity analysis (SA) and global sensitivity
+analysis (GSA), which inherently rank model inputs by their impact on
+predictions. Here, we introduce a novel delta-XAI method that provides local
+explanations of ML model predictions by extending the delta index, a GSA
+metric. The delta-XAI index assesses the impact of each feature's value on the
+predicted output for individual instances in both regression and classification
+problems. We formalize the delta-XAI index and provide code for its
+implementation. The delta-XAI method was evaluated on simulated scenarios using
+linear regression models, with Shapley values serving as a benchmark. Results
+showed that the delta-XAI index is generally consistent with Shapley values,
+with notable discrepancies in models with highly impactful or extreme feature
+values. The delta-XAI index demonstrated higher sensitivity in detecting
+dominant features and handling extreme feature values. Qualitatively, the
+delta-XAI provides intuitive explanations by leveraging probability density
+functions, making feature rankings clearer and more explainable for
+practitioners. Overall, the delta-XAI method appears promising for robustly
+obtaining local explanations of ML model predictions. Further investigations in
+real-world clinical settings will be conducted to evaluate its impact on
+AI-assisted clinical workflows.
 
-摘要：自動化最佳化建模 (AOM) 隨著大型語言模型 (LLM) 的快速演進而引起相當大的興趣。現有方法主要依賴提示工程，利用精心設計的專家回應鏈或結構化指導。然而，基於提示的技術由於缺乏特定領域知識，無法在感測器陣列訊號處理 (SASP) 領域中表現良好。為了解決這個問題，我們提出一個基於檢索增強生成 (RAG) 技術的自動化建模方法，它包含兩個主要組成部分：多代理 (MA) 結構和基於圖形的 RAG (Graph-RAG) 程序。MA 結構是針對架構 AOM 程序量身打造，每個代理都是根據人類建模程序的原理設計的。Graph-RAG 程序用於將使用者查詢與特定的 SASP 建模知識相匹配，從而增強建模結果。在十個經典訊號處理問題上的結果表明，所提出的方法（稱為 MAG-RAG）優於多個 AOM 基準。
+摘要：可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型，例如整體學習器和深度神經網路，通常缺乏可解釋性，阻礙臨床醫生對其預測的信任。為了解決這個問題，正在開發 XAI 技術，以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA)，它們本質上會依據模型輸入對預測的影響來對其進行排名。在此，我們介紹一種新的 delta-XAI 方法，透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化，並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法，並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致，但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說，delta-XAI 透過利用機率密度函數提供直觀的解釋，使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言，delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查，以評估其對 AI 輔助臨床工作流程的影響。
 
-##### **Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**
-2501.18154v1 by Wanlong Liu, Yichen Xiao, Dingyi Zeng, Hongyang Zhao, Wenyu Chen, Malu Zhang
+##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**
+2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis
 
-Post-Training Quantization (PTQ) is pivotal for deploying large language
-models (LLMs) within resource-limited settings by significantly reducing
-resource demands. However, existing PTQ strategies underperform at low bit
-levels < 3 bits due to the significant difference between the quantized and
-original weights. To enhance the quantization performance at low bit widths, we
-introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a
-graph neural network (GNN) module to capture dependencies among weights and
-adaptively assign quantization bit-widths. Through the information propagation
-of the GNN module, our method more effectively captures dependencies among
-target weights, leading to a more accurate assessment of weight importance and
-optimized allocation of quantization strategies. Extensive experiments on the
-WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms
-previous state-of-the-art PTQ method GPTQ, setting new benchmarks for
-quantization performance under low-bit conditions.
+Dementia, a debilitating neurological condition affecting millions worldwide,
+presents significant diagnostic challenges. In this work, we introduce a novel
+methodology for the classification of demented and non-demented elderly
+patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach
+features a unique technique for selectively processing MRI slices, focusing on
+the most relevant brain regions and excluding less informative sections. This
+methodology is complemented by a confidence-based classification committee
+composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and
+Dem3D EfficientNet. These models work synergistically to enhance
+decision-making accuracy, leveraging their collective strengths. Tested on the
+Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an
+impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore,
+validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset
+confirmed the robustness and generalizability of our approach. The use of
+explainable AI (XAI) techniques and comprehensive ablation studies further
+substantiate the effectiveness of our techniques, providing insights into the
+decision-making process and the importance of our methodology. This research
+offers a significant advancement in dementia diagnosis, providing a highly
+accurate and efficient tool for clinical applications.
 
-摘要：訓練後量化 (PTQ) 對於在資源受限的設定中部署大型語言模型 (LLM) 至關重要，因為它能顯著降低資源需求。然而，現有的 PTQ 策略在低位元層級 < 3 位元時表現不佳，因為量化後的權重與原始權重之間有顯著的差異。為了提升低位元寬度的量化效能，我們提出混合精度圖神經網路 PTQ (MG-PTQ) 方法，採用圖神經網路 (GNN) 模組來擷取權重之間的依存關係，並動態分配量化位元寬度。透過 GNN 模組的資訊傳播，我們的方法能更有效地擷取目標權重之間的依存關係，進而更準確地評估權重重要性，並最佳化量化策略的配置。在 WikiText2 和 C4 資料集上的廣泛實驗證明，我們的 MG-PTQ 方法優於先前的最先進 PTQ 方法 GPTQ，在低位元條件下設定了量化效能的新基準。
+摘要：失智症是一種影響全球數百萬人的衰弱性神經疾病，在診斷上具有重大挑戰。在這項工作中，我們提出了一種新的方法，用於對失智和非失智老年患者進行分類，使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術，用於選擇性處理 MRI 切片，重點關注最相關的大腦區域，並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充，該委員會由三個自定義深度學習模型組成：Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性，利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試，我們的模型達到了 94.12% 的驚人準確度，超過了現有方法。此外，在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性，提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展，為臨床應用提供了一個高度準確且高效的工具。
 
-##### **Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**
-2501.18119v1 by Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
+##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**
+2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini
 
-Due to the presence of the natural gap between Knowledge Graph (KG)
-structures and the natural language, the effective integration of holistic
-structural information of KGs with Large Language Models (LLMs) has emerged as
-a significant question. To this end, we propose a two-stage framework to learn
-and apply quantized codes for each entity, aiming for the seamless integration
-of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR)
-method is proposed to compress both KG structural and semantic knowledge into
-discrete codes (\ie, tokens) that align the format of language sentences. We
-further design KG instruction-following data by viewing these learned codes as
-features to directly input to LLMs, thereby achieving seamless integration. The
-experiment results demonstrate that SSQR outperforms existing unsupervised
-quantized methods, producing more distinguishable codes. Further, the
-fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link
-prediction and triple classification tasks, utilizing only 16 tokens per entity
-instead of thousands in conventional prompting methods.
+Recognizing daily activities with unobtrusive sensors in smart environments
+enables various healthcare applications. Monitoring how subjects perform
+activities at home and their changes over time can reveal early symptoms of
+health issues, such as cognitive decline. Most approaches in this field use
+deep learning models, which are often seen as black boxes mapping sensor data
+to activities. However, non-expert users like clinicians need to trust and
+understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human
+Activity Recognition have emerged to provide intuitive natural language
+explanations from these models. Different XAI methods generate different
+explanations, and their effectiveness is typically evaluated through user
+surveys, that are often challenging in terms of costs and fairness. This paper
+proposes an automatic evaluation method using Large Language Models (LLMs) to
+identify, in a pool of candidates, the best XAI approach for non-expert users.
+Our preliminary results suggest that LLM evaluation aligns with user surveys.
 
-摘要：由於知識圖譜 (KG) 結構與自然語言之間存在自然差距，將 KG 的整體結構資訊與大型語言模型 (LLM) 有效整合已成為一個重要的問題。為此，我們提出了一個兩階段架構來學習和應用每個實體的量化碼，旨在將 KG 與 LLM 無縫整合。首先，提出了一個自監督量化表示 (SSQR) 方法，將 KG 結構和語義知識壓縮成離散碼（即，符號），以對齊語言句子的格式。我們進一步設計 KG 指令遵循資料，將這些學習到的碼視為直接輸入 LLM 的特徵，從而實現無縫整合。實驗結果表明，SSQR 優於現有的無監督量化方法，產生更具區別性的碼。此外，微調後的 LLaMA2 和 LLaMA3.1 在 KG 連結預測和三元分類任務上也具有優異的性能，每個實體僅使用 16 個符號，而不是傳統提示方法中的數千個。
+摘要：藉由智慧環境中不引人注目的感測器辨識日常活動，能啟用各種醫療保健應用。監控受試者在家中如何執行活動，以及其隨著時間的變化，可以揭示健康問題的早期症狀，例如認知能力下降。此領域中的大多數方法都使用深度學習模型，這些模型通常被視為將感測器資料對應至活動的黑盒子。然而，非專家使用者（例如臨床醫師）需要信任並了解這些模型的輸出。因此，人類活動辨識的可解釋 AI (XAI) 方法應運而生，以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明，而其有效性通常透過使用者調查來評估，這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法，以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明，LLM 評估與使用者調查一致。
 
-##### **Hybrid Graphs for Table-and-Text based Question Answering using LLMs**
-2501.17767v1 by Ankush Agarwal, Ganesh S, Chaitanya Devaguptapu
+##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**
+2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil
+
+Industry 5.0, which focuses on human and Artificial Intelligence (AI)
+collaboration for performing different tasks in manufacturing, involves a
+higher number of robots, Internet of Things (IoTs) devices and
+interconnections, Augmented/Virtual Reality (AR), and other smart devices. The
+huge involvement of these devices and interconnection in various critical
+areas, such as economy, health, education and defense systems, poses several
+types of potential security flaws. AI itself has been proven a very effective
+and powerful tool in different areas of cybersecurity, such as intrusion
+detection, malware detection, and phishing detection, among others. Just as in
+many application areas, cybersecurity professionals were reluctant to accept
+black-box ML solutions for cybersecurity applications. This reluctance pushed
+forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool
+that helps explain how decisions are made in ML-based systems. In this survey,
+we present a comprehensive study of different XAI-based intrusion detection
+systems for industry 5.0, and we also examine the impact of explainability and
+interpretability on Cybersecurity practices through the lens of Adversarial
+XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities
+and challenges in XAI cybersecurity systems for industry 5.0 that elicit future
+research toward XAI-based solutions to be adopted by high-stakes industry 5.0
+applications. We believe this rigorous analysis will establish a foundational
+framework for subsequent research endeavors within the specified domain.
+
+摘要：工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務，涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與，引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具，例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣，網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用，有助於說明在基於 ML 的系統中如何做出決策。在這項調查中，我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究，並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外，我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰，引發了未來針對 XAI 基礎解決方案的研究，以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。
+
+##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability**
+2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic
 
-Answering questions that require reasoning and aggregation across both
-structured (tables) and unstructured (raw text) data sources presents
-significant challenges. Current methods rely on fine-tuning and high-quality,
-human-curated data, which is difficult to obtain. Recent advances in Large
-Language Models (LLMs) have shown promising results for multi-hop question
-answering (QA) over single-source text data in a zero-shot setting, yet
-exploration into multi-source Table-Text QA remains limited. In this paper, we
-present a novel Hybrid Graph-based approach for Table-Text QA that leverages
-LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from
-textual and tabular data, pruning information based on the input question to
-provide the LLM with relevant context concisely. We evaluate our approach on
-the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs,
-including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot
-performance on both datasets, improving Exact Match scores by up to 10% on
-Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up
-to 53% compared to the original context.
+This study aims to explore the implementation of Natural Language Processing
+(NLP) and machine learning (ML) techniques to automate the coding of medical
+letters with visualised explainability and light-weighted local computer
+settings. Currently in clinical settings, coding is a manual process that
+involves assigning codes to each condition, procedure, and medication in a
+patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There
+are preliminary research on automatic coding in this field using
+state-of-the-art ML models; however, due to the complexity and size of the
+models, the real-world deployment is not achieved. To further facilitate the
+possibility of automatic coding practice, we explore some solutions in a local
+computer setting; in addition, we explore the function of explainability for
+transparency of AI models. We used the publicly available MIMIC-III database
+and the HAN/HLAN network models for ICD code prediction purposes. We also
+experimented with the mapping between ICD and SNOMED CT knowledge bases. In our
+experiments, the models provided useful information for 97.98\% of codes. The
+result of this investigation can shed some light on implementing automatic
+clinical coding in practice, such as in hospital settings, on the local
+computers used by clinicians , project page
+\url{https://github.com/Glenj01/Medical-Coding}.
 
-摘要：回答需要對結構化（表格）和非結構化（原始文字）資料來源進行推理和彙總的問題會帶來重大挑戰。目前的辦法仰賴微調和高品質、人工整理的資料，而這很難取得。大型語言模型（LLM）的最新進展已針對零次學習設定的單一來源文字資料多跳問題回答（QA）展現出有希望的結果，但對多來源表格文字 QA 的探討仍然有限。在本文中，我們提出了一種新穎的基於混合圖表的表格文字 QA 方法，它利用 LLM 而無需微調。我們的辦法從文字和表格資料建構一個統一的混合圖表，根據輸入問題修剪資訊，以簡潔地為 LLM 提供相關脈絡。我們使用最先進的 LLM，包括 GPT-3.5、GPT-4 和 LLaMA-3，針對具有挑戰性的 Hybrid-QA 和 OTT-QA 資料集評估我們的辦法。我們的辦法在兩個資料集上都達到了最佳的零次學習效能，在 Hybrid-QA 上將完全比對分數提高了 10%，在 OTT-QA 上將完全比對分數提高了 5.4%。此外，與原始脈絡相比，我們的辦法將符號使用量減少了 53%。
+摘要：本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化，並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中，編碼是一種手動流程，涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如，使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究；然而，由於模型的複雜性和大小，並未實現實際部署。為了進一步促進自動編碼實務的可能性，我們在本地電腦設定中探討了一些解決方案；此外，我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中，這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解，例如在醫院環境中，由臨床醫生使用的本地電腦，專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。
 
-##### **Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**
-2501.17549v1 by Wooyoung Kim, Byungyoon Park, Wooju Kim
+##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**
+2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier
 
-Graph-structured data plays a vital role in numerous domains, such as social
-networks, citation networks, commonsense reasoning graphs and knowledge graphs.
-While graph neural networks have been employed for graph processing, recent
-advancements have explored integrating large language models for graph-based
-tasks. In this paper, we propose a novel approach named Learnable Graph Pooling
-Token (LGPT), which addresses the limitations of the scalability issues in
-node-level projection and information loss in graph-level projection. LGPT
-enables flexible and efficient graph representation by introducing learnable
-parameters that act as tokens in large language models, balancing fine-grained
-and global graph information. Additionally, we investigate an Early Query
-Fusion technique, which fuses query context before constructing the graph
-representation, leading to more effective graph embeddings. Our method achieves
-a 4.13\% performance improvement on the GraphQA benchmark without training the
-large language model, demonstrating significant gains in handling complex
-textual-attributed graph data.
+The support of artificial intelligence (AI) based decision-making is a key
+element in future 6G networks, where the concept of native AI will be
+introduced. Moreover, AI is widely employed in different critical applications
+such as autonomous driving and medical diagnosis. In such applications, using
+AI as black-box models is risky and challenging. Hence, it is crucial to
+understand and trust the decisions taken by these models. Tackling this issue
+can be achieved by developing explainable AI (XAI) schemes that aim to explain
+the logic behind the black-box model behavior, and thus, ensure its efficient
+and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST
+framework that is oriented toward channel estimation in wireless
+communications. The core idea of the XAI-CHEST framework is to identify the
+relevant model inputs by inducing high noise on the irrelevant ones. This
+manuscript provides the detailed theoretical foundations of the XAI-CHEST
+framework. In particular, we derive the analytical expressions of the XAI-CHEST
+loss functions and the noise threshold fine-tuning optimization problem. Hence
+the designed XAI-CHEST delivers a smart input feature selection methodology
+that can further improve the overall performance while optimizing the
+architecture of the employed model. Simulation results show that the XAI-CHEST
+framework provides valid interpretations, where it offers an improved bit error
+rate performance while reducing the required computational complexity in
+comparison to the classical DL-based channel estimation.
 
-摘要：圖形結構資料在許多領域中扮演著至關重要的角色，例如社交網路、引用網路、常識推理圖形和知識圖形。雖然圖形神經網路已用於圖形處理，但最近的進展已探討整合大型語言模型以進行基於圖形的任務。在本文中，我們提出了一種名為可學習圖形池化令牌 (LGPT) 的新方法，它解決了節點層級投影中的可擴充性問題和圖形層級投影中的資訊遺失限制。LGPT 透過引入可學習的參數（在大型語言模型中作為令牌運作）來啟用彈性和高效的圖形表示，平衡細粒度和整體圖形資訊。此外，我們研究了一種早期查詢融合技術，它在建構圖形表示之前融合查詢內容，進而產生更有效的圖形嵌入。我們的方法在 GraphQA 基準上達到了 4.13% 的效能提升，而無需訓練大型語言模型，證明了在處理複雜的文字屬性圖形資料方面有顯著的進展。
+摘要：人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素，其中將引入原生 AI 的概念。此外，AI 廣泛用於不同的關鍵應用中，例如自動駕駛和醫療診斷。在這些應用中，使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此，理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構，旨在解釋黑盒模型行為背後的邏輯，從而確保其有效且安全的部署。最近，我們提出了一個新的基於擾動的 XAI-CHEST 框架，該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是，我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此，設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法，可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明，XAI-CHEST 框架提供了有效的解釋，在降低所需的計算複雜度的同時，提供了改進的比特錯誤率性能，而這與基於傳統 DL 的信道估計相比。
 
-##### **General Scene Adaptation for Vision-and-Language Navigation**
-2501.17403v1 by Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
+##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**
+2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman
 
-Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on
-one-time execution of individual instructions across multiple environments,
-aiming to develop agents capable of functioning in any environment in a
-zero-shot manner. However, real-world navigation robots often operate in
-persistent environments with relatively consistent physical layouts, visual
-observations, and language styles from instructors. Such a gap in the task
-setting presents an opportunity to improve VLN agents by incorporating
-continuous adaptation to specific environments. To better reflect these
-real-world conditions, we introduce GSA-VLN, a novel task requiring agents to
-execute navigation instructions within a specific scene and simultaneously
-adapt to it for improved performance over time. To evaluate the proposed task,
-one has to address two challenges in existing VLN datasets: the lack of OOD
-data, and the limited number and style diversity of instructions for each
-scene. Therefore, we propose a new dataset, GSA-R2R, which significantly
-expands the diversity and quantity of environments and instructions for the R2R
-dataset to evaluate agent adaptability in both ID and OOD contexts.
-Furthermore, we design a three-stage instruction orchestration pipeline that
-leverages LLMs to refine speaker-generated instructions and apply role-playing
-techniques to rephrase instructions into different speaking styles. This is
-motivated by the observation that each individual user often has consistent
-signatures or preferences in their instructions. We conducted extensive
-experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various
-methods. Based on our findings, we propose a novel method, GR-DUET, which
-incorporates memory-based navigation graphs with an environment-specific
-training strategy, achieving state-of-the-art results on all GSA-R2R splits.
+This paper presents dilated Residual Network (ResNet) models for disease
+classification from retinal fundus images. Dilated convolution filters are used
+to replace normal convolution filters in the higher layers of the ResNet model
+(dilated ResNet) in order to improve the receptive field compared to the normal
+ResNet model for disease classification. This study introduces
+computer-assisted diagnostic tools that employ deep learning, enhanced with
+explainable AI techniques. These techniques aim to make the tool's
+decision-making process transparent, thereby enabling medical professionals to
+understand and trust the AI's diagnostic decision. They are particularly
+relevant in today's healthcare landscape, where there is a growing demand for
+transparency in AI applications to ensure their reliability and ethical use.
+The dilated ResNet is used as a replacement for the normal ResNet to enhance
+the classification accuracy of retinal eye diseases and reduce the required
+computing time. The dataset used in this work is the Ocular Disease Intelligent
+Recognition (ODIR) dataset which is a structured ophthalmic database with eight
+classes covering most of the common retinal eye diseases. The evaluation
+metrics used in this work include precision, recall, accuracy, and F1 score. In
+this work, a comparative study has been made between normal ResNet models and
+dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50,
+ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as
+compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67,
+and 0.70 respectively for the above respective variants in ODIR multiclass
+disease classification.
 
-摘要：視覺語言導航 (VLN) 任務主要根據代理程式在多個環境中執行個別指令的一次性執行來評估代理程式，旨在開發能夠在任何環境中以零次學習的方式運作的代理程式。然而，真實世界的導航機器人通常在持續性的環境中運作，而這些環境具有相對一致的物理配置、視覺觀察和指令的語言風格。任務設定中的這種差距提供了一個機會，可以透過將連續適應特定環境納入其中來改善 VLN 代理程式。為了更好地反映這些真實世界的條件，我們推出了 GSA-VLN，這是一個新任務，要求代理程式在特定場景中執行導航指令，並同時適應該場景，以隨著時間推移而提高效能。為了評估所提出的任務，必須解決現有 VLN 資料集中的兩個挑戰：缺乏 OOD 資料，以及每個場景的指令數量和風格多樣性有限。因此，我們提出了一個新的資料集 GSA-R2R，它顯著擴展了 R2R 資料集的環境和指令的多樣性和數量，以評估代理程式在 ID 和 OOD 背景下的適應能力。此外，我們設計了一個三階段指令編排管道，該管道利用大型語言模型 (LLM) 來精煉由說話者產生的指令，並應用角色扮演技巧將指令改寫成不同的說話風格。這項技術的靈感來自於觀察到每個個別使用者通常在其指令中具有相符的簽名或偏好。我們針對 GSA-R2R 進行了大量的實驗，以徹底評估我們的資料集和基準各種方法。根據我們的研究結果，我們提出了一種新的方法 GR-DUET，它將基於記憶的導航圖表與特定於環境的訓練策略結合在一起，在所有 GSA-R2R 分割中取得了最先進的結果。
+摘要：这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器（扩张 ResNet），以改善感知场，从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具，并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化，从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关，在该领域，对 AI 应用的透明度需求不断增长，以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品，以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集，这是一个结构化的眼科数据库，包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中，对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比，扩张 ResNet 模型显示出有希望的结果，在 ODIR 多类疾病分类中，上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。
 
-##### **Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**
-2501.17270v1 by Saloni Potdar, Daniel Lee, Omar Attia, Varun Embar, De Meng, Ramesh Balaji, Chloe Seivwright, Eric Choi, Mina H. Farid, Yiwen Sun, Yunyao Li
+##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**
+2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li
 
-Question answering systems for knowledge graph (KGQA), answer factoid
-questions based on the data in the knowledge graph. KGQA systems are complex
-because the system has to understand the relations and entities in the
-knowledge-seeking natural language queries and map them to structured queries
-against the KG to answer them. In this paper, we introduce Chronos, a
-comprehensive evaluation framework for KGQA at industry scale. It is designed
-to evaluate such a multi-component system comprehensively, focusing on (1)
-end-to-end and component-level metrics, (2) scalable to diverse datasets and
-(3) a scalable approach to measure the performance of the system prior to
-release. In this paper, we discuss the unique challenges associated with
-evaluating KGQA systems at industry scale, review the design of Chronos, and
-how it addresses these challenges. We will demonstrate how it provides a base
-for data-driven decisions and discuss the challenges of using it to measure and
-improve a real-world KGQA system.
+The rapid advancement of foundation models in medical imaging represents a
+significant leap toward enhancing diagnostic accuracy and personalized
+treatment. However, the deployment of foundation models in healthcare
+necessitates a rigorous examination of their trustworthiness, encompassing
+privacy, robustness, reliability, explainability, and fairness. The current
+body of survey literature on foundation models in medical imaging reveals
+considerable gaps, particularly in the area of trustworthiness. Additionally,
+existing surveys on the trustworthiness of foundation models do not adequately
+address their specific variations and applications within the medical imaging
+domain. This survey aims to fill that gap by presenting a novel taxonomy of
+foundation models used in medical imaging and analyzing the key motivations for
+ensuring their trustworthiness. We review current research on foundation models
+in major medical imaging applications, focusing on segmentation, medical report
+generation, medical question and answering (Q\&A), and disease diagnosis. These
+areas are highlighted because they have seen a relatively mature and
+substantial number of foundation models compared to other applications. We
+focus on literature that discusses trustworthiness in medical image analysis
+manuscripts. We explore the complex challenges of building trustworthy
+foundation models for each application, summarizing current concerns and
+strategies for enhancing trustworthiness. Furthermore, we examine the potential
+of these models to revolutionize patient care. Our analysis underscores the
+imperative for advancing towards trustworthy AI in medical image analysis,
+advocating for a balanced approach that fosters innovation while ensuring
+ethical and equitable healthcare delivery.
 
-摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
+摘要：基礎模型在醫學影像方面的快速進展，代表著在加強診斷準確性和個人化治療方面邁出一大步。然而，基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查，包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距，特別是在可信度方面。此外，現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機，來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究，重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調，是因為與其他應用相比，它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰，總結了當前關注點和增強可信度的策略。此外，我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性，並倡導一種平衡的方法，既能促進創新，又能確保道德和公平的醫療保健服務。
 
-##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
-2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
+##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**
+2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan
 
-Prior research on training grounded factuality classification models to
-detect hallucinations in large language models (LLMs) has relied on public
-natural language inference (NLI) data and synthetic data. However, conventional
-NLI datasets are not well-suited for document-level reasoning, which is
-critical for detecting LLM hallucinations. Recent approaches to document-level
-synthetic data generation involve iteratively removing sentences from documents
-and annotating factuality using LLM-based prompts. While effective, this method
-is computationally expensive for long documents and limited by the LLM's
-capabilities. In this work, we analyze the differences between existing
-synthetic training data used in state-of-the-art models and real LLM output
-claims. Based on our findings, we propose a novel approach for synthetic data
-generation, CG2C, that leverages multi-hop reasoning on context graphs
-extracted from documents. Our fact checker model, FactCG, demonstrates improved
-performance with more connected reasoning, using the same backbone models.
-Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
-with much smaller model size.
+Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and
+interpreting ultrasound scans right at the patient's bedside. However, the
+expertise needed to interpret these images is considerable and may not always
+be present in emergency situations. This reality makes algorithms such as
+machine learning classifiers extremely valuable to augment human decisions.
+POCUS devices are becoming available at a reasonable cost in the size of a
+mobile phone. The challenge of turning POCUS devices into life-saving tools is
+that interpretation of ultrasound images requires specialist training and
+experience. Unfortunately, the difficulty to obtain positive training images
+represents an important obstacle to building efficient and accurate
+classifiers. Hence, the problem we try to investigate is how to explore
+strategies to increase accuracy of classifiers trained with scarce data. We
+hypothesize that training with a few data instances may not suffice for
+classifiers to generalize causing them to overfit. Our approach uses an
+Explainable AI-Augmented approach to help the algorithm learn more from less
+and potentially help the classifier better generalize.
 
-摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
+摘要：床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而，解讀這些影像所需的專業知識相當可觀，而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出，尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於，解讀超音波影像需要專門訓練和經驗。不幸的是，取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此，我們嘗試探討的問題是如何探索策略，以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括，導致它們過度擬合。我們的做法使用可解釋 AI 增強方法，以協助演算法從較少的資料中學習更多，並潛在協助分類器更好地概括。
 
-##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
-2501.16673v2 by Li Yin, Zhangyang Wang
+##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**
+2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang
 
-Large Language Models (LLMs) have reshaped natural language processing,
-powering applications from multi-hop retrieval and question answering to
-autonomous agent workflows. Yet, prompt engineering -- the task of crafting
-textual inputs to effectively direct LLMs -- remains difficult and
-labor-intensive, particularly for complex pipelines that combine multiple LLM
-calls with functional operations like retrieval and data formatting. We
-introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
-(APE) that extends textual gradient-based methods (such as Text-Grad) to
-multi-component, potentially cyclic LLM architectures. Implemented within the
-AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
-parameter and uses a frozen backward engine LLM to generate feedback-akin to
-textual gradients -- that guide iterative prompt updates. Unlike prior
-single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
-preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
-and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
-(instructions, formats, or few-shot examples). It further boosts training
-efficiency by focusing on error-prone samples through selective gradient
-computation. Across diverse tasks, including single-step classification,
-multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
-consistently outperforms existing textual gradient baselines in both accuracy
-and training cost. By unifying prompt optimization through a graph-centric
-lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
-LLM workflows - mirroring the transformative role that automatic
-differentiation libraries have long played in neural network research.
+In recent years, the United States has witnessed a significant surge in the
+popularity of vaping or e-cigarette use, leading to a notable rise in cases of
+e-cigarette and vaping use-associated lung injury (EVALI) that caused
+hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting
+the urgency to comprehend vaping behaviors and develop effective strategies for
+cessation. Due to the ubiquity of social media platforms, over 4.7 billion
+users worldwide use them for connectivity, communications, news, and
+entertainment with a significant portion of the discourse related to health,
+thereby establishing social media data as an invaluable organic data resource
+for public health research. In this study, we extracted a sample dataset from
+one vaping sub-community on Reddit to analyze users' quit-vaping intentions.
+Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit
+vaping intention detection, this study compares the outcomes of this model
+against layman and clinical expert annotations. Using different prompting
+strategies such as zero-shot, one-shot, few-shot and chain-of-thought
+prompting, we developed 8 prompts with varying levels of detail to explain the
+task to GPT-4 and also evaluated the performance of the strategies against each
+other. These preliminary findings emphasize the potential of GPT-4 in social
+media data analysis, especially in identifying users' subtle intentions that
+may elude human detection.
 
-摘要：大型語言模型 (LLM) 已重塑自然語言處理，
-為從多跳檢索和問答到
-自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
-文本輸入以有效指導 LLM 的任務 -- 仍然困難且
-勞動密集，特別是對於將多個 LLM
-呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
-介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
-方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
-AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
-參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
-文本梯度——指導迭代提示更新。與先前的
-單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
-在重複呼叫（例如，多跳循環）中保留時間順序行為，
-並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
-效率，通過選擇性梯度
-計算專注於容易出錯的樣本。在包括單步分類、
-多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
-在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
-視角統一提示優化，LLM-AutoDiff 為擴展和自動化
-LLM 工作流程提供了一個強大的新範例——反映了自動
-微分庫在神經網絡研究中長期扮演的變革性角色。
+摘要：近年來，美國見證了電子煙或電子香菸使用率大幅激增，導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加，在 2019 年 EVALI 爆發期間造成住院和死亡，凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及，全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂，其中很大一部分與健康相關，因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中，我們從 Reddit 上一個電子煙子社群中提取一個範例資料集，以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測，本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略，例如零次學習、一次學習、少次學習和思考鏈提示，我們開發了 8 個提示，詳細程度不同，向 GPT-4 解釋任務，並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力，特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。
 
-##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
-2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
+##### **Towards Compositional Interpretability for XAI**
+2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke
 
-Ranking and recommendation systems are the foundation for numerous online
-experiences, ranging from search results to personalized content delivery.
-These systems have evolved into complex, multilayered architectures that
-leverage vast datasets and often incorporate thousands of predictive models.
-The maintenance and enhancement of these models is a labor intensive process
-that requires extensive feature engineering. This approach not only exacerbates
-technical debt but also hampers innovation in extending these systems to
-emerging problem domains. In this report, we present our research to address
-these challenges by utilizing a large foundation model with a textual interface
-for ranking and recommendation tasks. We illustrate several key advantages of
-our approach: (1) a single model can manage multiple predictive tasks involved
-in ranking and recommendation, (2) decoder models with textual interface due to
-their comprehension of reasoning capabilities, can generalize to new
-recommendation surfaces and out-of-domain problems, and (3) by employing
-natural language interfaces for task definitions and verbalizing member
-behaviors and their social connections, we eliminate the need for feature
-engineering and the maintenance of complex directed acyclic graphs of model
-dependencies. We introduce our research pre-production model, 360Brew V1.0, a
-150B parameter, decoder-only model that has been trained and fine-tuned on
-LinkedIn's data and tasks. This model is capable of solving over 30 predictive
-tasks across various segments of the LinkedIn platform, achieving performance
-levels comparable to or exceeding those of current production systems based on
-offline metrics, without task-specific fine-tuning. Notably, each of these
-tasks is conventionally addressed by dedicated models that have been developed
-and maintained over multiple years by teams of a similar or larger size than
-our own.
+Artificial intelligence (AI) is currently based largely on black-box machine
+learning models which lack interpretability. The field of eXplainable AI (XAI)
+strives to address this major concern, being critical in high-stakes areas such
+as the finance, legal and health sectors.
+  We present an approach to defining AI models and their interpretability based
+on category theory. For this we employ the notion of a compositional model,
+which sees a model in terms of formal string diagrams which capture its
+abstract structure together with its concrete implementation. This
+comprehensive view incorporates deterministic, probabilistic and quantum
+models. We compare a wide range of AI models as compositional models, including
+linear and rule-based models, (recurrent) neural networks, transformers, VAEs,
+and causal and DisCoCirc models.
+  Next we give a definition of interpretation of a model in terms of its
+compositional structure, demonstrating how to analyse the interpretability of a
+model, and using this to clarify common themes in XAI. We find that what makes
+the standard 'intrinsically interpretable' models so transparent is brought out
+most clearly diagrammatically. This leads us to the more general notion of
+compositionally-interpretable (CI) models, which additionally include, for
+instance, causal, conceptual space, and DisCoCirc models.
+  We next demonstrate the explainability benefits of CI models. Firstly, their
+compositional structure may allow the computation of other quantities of
+interest, and may facilitate inference from the model to the modelled
+phenomenon by matching its structure. Secondly, they allow for diagrammatic
+explanations for their behaviour, based on influence constraints, diagram
+surgery and rewrite explanations. Finally, we discuss many future directions
+for the approach, raising the question of how to learn such meaningfully
+structured models in practice.
 
-摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
-這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
-這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
-這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
-在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
-我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
-我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
-此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
-值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
+摘要：<paragraph>人工智慧（AI）目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧（XAI）領域致力於解決這個主要問題，這在金融、法律和健康等高風險領域至關重要。
+我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此，我們採用組合模型的概念，它以形式弦圖的形式看待模型，這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較，包括線性和基於規則的模型、（遞迴）神經網路、Transformer、VAE，以及因果和 DisCoCirc 模型。
+接下來，我們根據模型的組合結構給出模型解釋的定義，展示如何分析模型的可解釋性，並使用它來澄清 XAI 中的常見主題。我們發現，讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋（CI）模型概念，它另外還包括因果、概念空間和 DisCoCirc 模型。
+接下來，我們展示了 CI 模型的可解釋性優勢。首先，它們的組合結構允許計算其他感興趣的量，並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次，它們允許對其行為進行圖解說明，這些說明基於影響約束、圖解手術和重寫說明。最後，我們討論了這種方法的許多未來方向，提出了如何在實踐中學習這種有意義的結構化模型的問題。</paragraph>
 
-##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
-2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
+##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**
+2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen
 
-Fixing Python dependency issues is a tedious and error-prone task for
-developers, who must manually identify and resolve environment dependencies and
-version constraints of third-party modules and Python interpreters. Researchers
-have attempted to automate this process by relying on large knowledge graphs
-and database lookup tables. However, these traditional approaches face
-limitations due to the variety of dependency error types, large sets of
-possible module versions, and conflicts among transitive dependencies. This
-study explores the potential of using large language models (LLMs) to
-automatically fix dependency issues in Python programs. We introduce PLLM
-(pronounced "plum"), a novel technique that employs retrieval-augmented
-generation (RAG) to help an LLM infer Python versions and required modules for
-a given Python file. PLLM builds a testing environment that iteratively (1)
-prompts the LLM for module combinations, (2) tests the suggested changes, and
-(3) provides feedback (error messages) to the LLM to refine the fix. This
-feedback cycle leverages natural language processing (NLP) to intelligently
-parse and interpret build error messages. We benchmark PLLM on the Gistable
-HG2.9K dataset, a collection of challenging single-file Python gists. We
-compare PLLM against two state-of-the-art automatic dependency inference
-approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
-issues. Our results indicate that PLLM can fix more dependency issues than the
-two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
-over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
-for projects with many dependencies and for specific third-party numerical and
-machine-learning modules. Our findings demonstrate the potential of LLM-based
-approaches to iteratively resolve Python dependency issues.
+Machine learning models have achieved high overall accuracy in medical image
+analysis. However, performance disparities on specific patient groups pose
+challenges to their clinical utility, safety, and fairness. This can affect
+known patient groups - such as those based on sex, age, or disease subtype - as
+well as previously unknown and unlabeled groups. Furthermore, the root cause of
+such observed performance disparities is often challenging to uncover,
+hindering mitigation efforts. In this paper, to address these issues, we
+leverage Slice Discovery Methods (SDMs) to identify interpretable
+underperforming subsets of data and formulate hypotheses regarding the cause of
+observed performance disparities. We introduce a novel SDM and apply it in a
+case study on the classification of pneumothorax and atelectasis from chest
+x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis
+formulation and yields an explanation of previously observed but unexplained
+performance disparities between male and female patients in widely used chest
+X-ray datasets and models. Our findings indicate shortcut learning in both
+classification tasks, through the presence of chest drains and ECG wires,
+respectively. Sex-based differences in the prevalence of these shortcut
+features appear to cause the observed classification performance gap,
+representing a previously underappreciated interaction between shortcut
+learning and model fairness analyses.
 
-摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
+摘要：機器學習模型在醫學影像分析中已達到整體高準確度。然而，特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體（例如基於性別、年齡或疾病亞型）以及先前未知且未標籤的群體。此外，此類觀察到的效能差異的根本原因通常難以發現，阻礙了緩解措施。在本文中，為了解決這些問題，我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集，並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM，並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性，並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明，在分類任務中，透過胸腔引流管和心電圖導線的存在，存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異，似乎會導致觀察到的分類效能差距，這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。
 
-##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
-2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
+##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**
+2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa
 
-Knowledge graphs are widely used in industrial applications, making error
-detection crucial for ensuring the reliability of downstream applications.
-Existing error detection methods often fail to effectively leverage
-fine-grained subgraph information and rely solely on fixed graph structures,
-while also lacking transparency in their decision-making processes, which
-results in suboptimal detection performance. In this paper, we propose a novel
-Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
-utilizes multiple large language models (LLMs) in a collaborative setting. By
-concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
-query embeddings during training, our framework integrates these
-representations to produce four specialized agents. These agents utilize
-subgraph information from different dimensions to engage in multi-round
-discussions, thereby improving error detection accuracy and ensuring a
-transparent decision-making process. Extensive experiments on FB15K and WN18RR
-demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
-accuracy and robustness of KG evaluation. For specific industrial scenarios,
-our framework can facilitate the training of specialized agents using
-domain-specific knowledge graphs for error detection, which highlights the
-potential industrial application value of our framework. Our code and datasets
-are available at https://github.com/kse-ElEvEn/MAKGED.
+The concept of Metaverse has attracted a lot of attention in various fields
+and one of its important applications is health and treatment. The Metaverse
+has enormous potential to transform healthcare by changing patient care,
+medical education, and the way teaching/learning and research are done. The
+purpose of this research is to provide an introduction to the basic concepts
+and fundamental technologies of the Metaverse. This paper examines the pros and
+cons of the Metaverse in healthcare context and analyzes its potential from the
+technology and AI perspective. In particular, the role of machine learning
+methods is discussed; We will explain how machine learning algorithms can be
+applied to the Metaverse generated data to gain better insights in healthcare
+applications. Additionally, we examine the future visions of the Metaverse in
+health delivery, by examining emerging technologies such as blockchain and also
+addressing privacy concerns. The findings of this study contribute to a deeper
+understanding of the applications of Metaverse in healthcare and its potential
+to revolutionize the delivery of medical services.
 
-摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
+摘要：元宇宙的概念在各個領域都備受關注，其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育，以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點，並從技術和 AI 的角度分析其潛力。特別是，討論了機器學習方法的角色；我們將說明如何將機器學習演算法應用於元宇宙產生的資料，以獲得醫療保健應用方面的更佳見解。此外，我們透過探討區塊鏈等新興技術，並解決隱私問題，來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用，以及其在醫療服務提供方面發揮革命性變革的潛力。
 
-##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
-2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
+##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**
+2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf
 
-Short-reading comprehension questions help students understand text structure
-but lack effective feedback. Students struggle to identify and correct errors,
-while manual feedback creation is labor-intensive. This highlights the need for
-automated feedback linking responses to a scoring rubric for deeper
-comprehension.
-  Despite advances in Natural Language Processing (NLP), research has focused
-on automatic grading, with limited work on feedback generation. To address
-this, we propose a system that generates feedback for student responses.
-  Our contributions are twofold. First, we introduce the first system for
-feedback on short-answer reading comprehension. These answers are derived from
-the text, requiring structural understanding. We propose an "answer diagnosis
-graph," integrating the text's logical structure with feedback templates. Using
-this graph and NLP techniques, we estimate students' comprehension and generate
-targeted feedback.
-  Second, we evaluate our feedback through an experiment with Japanese high
-school students (n=39). They answered two 70-80 word questions and were divided
-into two groups with minimal academic differences. One received a model answer,
-the other system-generated feedback. Both re-answered the questions, and we
-compared score changes. A questionnaire assessed perceptions and motivation.
-  Results showed no significant score improvement between groups, but
-system-generated feedback helped students identify errors and key points in the
-text. It also significantly increased motivation. However, further refinement
-is needed to enhance text structure understanding.
+Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with
+no known ultimo cure and high morbidity. Research demonstrates that progressive
+Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly
+impacts kidney structure and functions, eventually leading to kidney failure.
+With the progression of time, chronic kidney disease has moved from a
+life-threatening disease affecting few people to a common disorder of varying
+severity. The goal of this research is to visualize dominating features,
+feature scores, and values exhibited for early prognosis and detection of CKD
+using ensemble learning and explainable AI. For that, an AI-driven predictive
+analytics approach is proposed to aid clinical practitioners in prescribing
+lifestyle modifications for individual patients to reduce the rate of
+progression of this disease. Our dataset is collected on body vitals from
+individuals with CKD and healthy subjects to develop our proposed AI-driven
+solution accurately. In this regard, blood and urine test results are provided,
+and ensemble tree-based machine-learning models are applied to predict unseen
+cases of CKD. Our research findings are validated after lengthy consultations
+with nephrologists. Our experiments and interpretation results are compared
+with existing explainable AI applications in various healthcare domains,
+including CKD. The comparison shows that our developed AI models, particularly
+the Random Forest model, have identified more features as significant
+contributors than XgBoost. Interpretability (I), which measures the ratio of
+important to masked features, indicates that our XgBoost model achieved a
+higher score, specifically a Fidelity of 98\%, in this metric and naturally in
+the FII index compared to competing models.
 
-摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
+摘要：慢性腎臟病 (CKD) 是一種廣泛的慢性疾病，目前尚未找到最終的治療方法，且發病率很高。研究表明，進行性慢性腎臟病 (CKD) 是一種異質性疾病，會顯著影響腎臟結構和功能，最終導致腎衰竭。隨著時間的推移，慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值，以進行 CKD 的早期預後和檢測。為此，提出了一種 AI 驅動的預測分析方法，以幫助臨床醫生為個別患者開具生活方式的修改建議，以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的，以準確開發我們提出的 AI 驅動的解決方案。在這方面，提供了血液和尿液檢測結果，並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較，包括 CKD。比較表明，我們開發的 AI 模型，特別是隨機森林模型，已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率，表明我們的 XgBoost 模型在此指標中取得了更高的分數，特別是 98% 的保真度，並且在 FII 指數中自然高於競爭模型。
+
+##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**
+2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan
+
+Mental health constitutes a complex and pervasive global challenge, affecting
+millions of lives and often leading to severe consequences. In this paper, we
+conduct a thorough survey to explore the intersection of data science,
+artificial intelligence, and mental healthcare, focusing on the recent
+developments of mental disorder detection through online social media (OSM). A
+significant portion of the population actively engages in OSM platforms,
+creating a vast repository of personal data that holds immense potential for
+mental health analytics. The paper navigates through traditional diagnostic
+methods, state-of-the-art data- and AI-driven research studies, and the
+emergence of explainable AI (XAI) models for mental healthcare. We review
+state-of-the-art machine learning methods, particularly those based on modern
+deep learning, while emphasising the need for explainability in healthcare AI
+models. The experimental design section provides insights into prevalent
+practices, including available datasets and evaluation approaches. We also
+identify key issues and challenges in the field and propose promising future
+research directions. As mental health decisions demand transparency,
+interpretability, and ethical considerations, this paper contributes to the
+ongoing discourse on advancing XAI in mental healthcare through social media.
+The comprehensive overview presented here aims to guide researchers,
+practitioners, and policymakers in developing the area of mental disorder
+detection.
 
-儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
+摘要：心理健康構成了一項複雜且普遍的全球挑戰，影響了數百萬人的生活，並經常導致嚴重的後果。在本文中，我們進行了一項徹底的調查，以探索數據科學、人工智慧和心理保健的交集，重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台，創造了一個龐大的人員資料庫，對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究，以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法，特別是那些基於現代深度學習的方法，同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解，包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰，並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量，本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。
 
-我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
+##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**
+2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou
 
-其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
+AI-aided clinical diagnosis is desired in medical care. Existing deep
+learning models lack explainability and mainly focus on image analysis. The
+recently developed Dynamic Uncertain Causality Graph (DUCG) approach is
+causality-driven, explainable, and invariant across different application
+scenarios, without problems of data collection, labeling, fitting, privacy,
+bias, generalization, high cost and high energy consumption. Through close
+collaboration between clinical experts and DUCG technicians, 46 DUCG models
+covering 54 chief complaints were constructed. Over 1,000 diseases can be
+diagnosed without triage. Before being applied in real-world, the 46 DUCG
+models were retrospectively verified by third-party hospitals. The verified
+diagnostic precisions were no less than 95%, in which the diagnostic precision
+for every disease including uncommon ones was no less than 80%. After
+verifications, the 46 DUCG models were applied in the real-world in China. Over
+one million real diagnosis cases have been performed, with only 17 incorrect
+diagnoses identified. Due to DUCG's transparency, the mistakes causing the
+incorrect diagnoses were found and corrected. The diagnostic abilities of the
+clinicians who applied DUCG frequently were improved significantly. Following
+the introduction to the earlier presented DUCG methodology, the recommendation
+algorithm for potential medical checks is presented and the key idea of DUCG is
+extracted.
 
-結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
+摘要：<paragraph>醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性，並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的，並且在不同的應用場景中是不變的，沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作，構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前，46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%，其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後，46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例，僅發現 17 個不正確的診斷。由於 DUCG 的透明性，發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後，提出了潛在健康檢查的推薦演算法，並提取了 DUCG 的關鍵思想。</paragraph>
 
-##### **Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**
-2501.15688v1 by Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
+##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**
+2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi
 
-Multimodal knowledge graph completion (MMKGC) aims to predict missing links
-in multimodal knowledge graphs (MMKGs) by leveraging information from various
-modalities alongside structural data. Existing MMKGC approaches primarily
-extend traditional knowledge graph embedding (KGE) models, which often require
-creating an embedding for every entity. This results in large model sizes and
-inefficiencies in integrating multimodal information, particularly for
-real-world graphs. Meanwhile, Transformer-based models have demonstrated
-competitive performance in knowledge graph completion (KGC). However, their
-focus on single-modal knowledge limits their capacity to utilize cross-modal
-information. Recently, Large vision-language models (VLMs) have shown potential
-in cross-modal tasks but are constrained by the high cost of training. In this
-work, we propose a novel approach that integrates Transformer-based KGE models
-with cross-modal context generated by pre-trained VLMs, thereby extending their
-applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
-relevant visual information from entities and their neighbors into textual
-sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
-model with the generated cross-modal context. This simple yet effective method
-significantly reduces model size compared to traditional KGE approaches while
-achieving competitive performance across multiple large-scale datasets with
-minimal hyperparameter tuning.
+It is imperative that breast cancer is detected precisely and timely to
+improve patient outcomes. Diagnostic methodologies have traditionally relied on
+unimodal approaches; however, medical data analytics is integrating diverse
+data sources beyond conventional imaging. Using multi-modal techniques,
+integrating both image and non-image data, marks a transformative advancement
+in breast cancer diagnosis. The purpose of this review is to explore the
+burgeoning field of multimodal techniques, particularly the fusion of
+histopathology images with non-image data. Further, Explainable AI (XAI) will
+be used to elucidate the decision-making processes of complex algorithms,
+emphasizing the necessity of explainability in diagnostic processes. This
+review utilizes multi-modal data and emphasizes explainability to enhance
+diagnostic accuracy, clinician confidence, and patient engagement, ultimately
+fostering more personalized treatment strategies for breast cancer, while also
+identifying research gaps in multi-modality and explainability, guiding future
+studies, and contributing to the strategic direction of the field.
 
-摘要：多模態知識圖譜補全 (MMKGC) 旨在透過利用來自各種模態與結構化資料的資訊，來預測多模態知識圖譜 (MMKG) 中的缺失連結。現有的 MMKGC 方法主要擴充傳統的知識圖譜嵌入 (KGE) 模型，這些模型通常需要為每個實體建立一個嵌入。這會導致模型尺寸過大，且在整合多模態資訊時效率低下，特別是對於真實世界的圖譜。與此同時，基於 Transformer 的模型已在知識圖譜補全 (KGC) 中展現出競爭力。然而，它們著重於單模態知識，限制了它們利用跨模態資訊的能力。最近，大型視覺語言模型 (VLM) 已在跨模態任務中展現潛力，但受限於訓練成本過高。在這項工作中，我們提出了一種創新的方法，它將基於 Transformer 的 KGE 模型與預先訓練的 VLM 所產生的跨模態內容整合在一起，從而擴展它們在 MMKGC 中的適用性。具體來說，我們採用預先訓練的 VLM，將實體及其鄰居相關的視覺資訊轉換成文字序列。然後，我們將 KGC 架構成一個序列到序列的任務，並使用產生的跨模態內容微調模型。這種簡單但有效的方法，與傳統的 KGE 方法相比，大幅減少了模型尺寸，同時在多個大型資料集上達到了競爭力的效能，且只需最少的超參數調整。
+摘要：精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法；然而，醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術，標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域，特別是將組織病理學影像與非影像資料融合。此外，可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程，強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性，以提高診斷準確性、臨床醫師的信心和患者參與度，最終促進乳癌更個人化的治療策略，同時也找出多模式和可解釋性的研究差距，引導未來的研究，並為該領域的策略方向做出貢獻。
 
-##### **How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**
-2501.15378v1 by Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
+##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**
+2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya
 
-Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently
-propelled significant advances in complex reasoning tasks, thanks to their
-broad domain knowledge and contextual awareness. Unfortunately, current methods
-often assume KGs to be complete, which is impractical given the inherent
-limitations of KG construction and the potential loss of contextual cues when
-converting unstructured text into entity-relation triples. In response, this
-paper proposes the Triple Context Restoration and Query-driven Feedback
-(TCR-QF) framework, which reconstructs the textual context underlying each
-triple to mitigate information loss, while dynamically refining the KG
-structure by iteratively incorporating query-relevant missing knowledge.
-Experiments on five benchmark question-answering datasets substantiate the
-effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1%
-improvement in Exact Match and a 15.5% improvement in F1 over its
-state-of-the-art GraphRAG competitors.
+The neonatal period is the most vulnerable time for the development of
+seizures. Seizures in the immature brain lead to detrimental consequences,
+therefore require early diagnosis. The gold-standard for neonatal seizure
+detection currently relies on continuous video-EEG monitoring; which involves
+recording multi-channel electroencephalogram (EEG) alongside real-time video
+monitoring within a neonatal intensive care unit (NICU). However, video-EEG
+monitoring technology requires clinical expertise and is often limited to
+technologically advanced and resourceful settings. Cost-effective new
+techniques could help the medical fraternity make an accurate diagnosis and
+advocate treatment without delay. In this work, a novel explainable deep
+learning model to automate the neonatal seizure detection process with a
+reduced EEG montage is proposed, which employs convolutional nets, graph
+attention layers, and fully connected layers. Beyond its ability to detect
+seizures in real-time with a reduced montage, this model offers the unique
+advantage of real-time interpretability. By evaluating the performance on the
+Zenodo dataset with 10-fold cross-validation, the presented model achieves an
+absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall,
+respectively.
 
-摘要：知識圖譜 (KG) 增強大型語言模型 (LLM) 最近推動複雜推理任務的重大進展，這要歸功於它們廣泛的領域知識和語境感知。不幸的是，目前的模型通常假設 KG 是完整的，這在考慮到 KG 建構的固有限制和在將非結構化文字轉換為實體關係三元組時潛在的語境線索損失時是不切實際的。為了解決這個問題，本文提出了三元組語境還原和查詢驅動回饋 (TCR-QF) 架構，它重建每個三元組底層的文字語境以減輕資訊損失，同時透過反覆納入與查詢相關的遺失知識來動態優化 KG 結構。在五個基準問題回答資料集上的實驗證實了 TCR-QF 在 KG 和 LLM 整合方面的有效性，它在 Exact Match 中獲得 29.1% 的改進，在 F1 中獲得 15.5% 的改進，優於最先進的 GraphRAG 競爭對手。
+摘要：新生兒期是大腦發育最脆弱的時期，容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果，因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測；其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而，視訊腦電圖監控技術需要臨床專業知識，而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中，提出了一個新穎的可解釋深度學習模型，以自動化新生兒癲癇發作偵測過程，並採用減少的腦電圖裝置，其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外，此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能，所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。
 
-##### **Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**
-2501.14932v1 by Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe
+##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**
+2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik
 
-Modern datasets often consist of numerous samples with abundant features and
-associated timestamps. Analyzing such datasets to uncover underlying events
-typically requires complex statistical methods and substantial domain
-expertise. A notable example, and the primary data focus of this paper, is the
-global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC)
--- a global hub of human trafficking data containing over 200,000 anonymized
-records spanning from 2002 to 2022, with numerous categorical features for each
-record. In this paper, we propose a fast and scalable method for analyzing and
-extracting significant categorical feature interactions, and querying large
-language models (LLMs) to generate data-driven insights that explain these
-interactions. Our approach begins with a binarization step for categorical
-features using one-hot encoding, followed by the computation of graph
-covariance at each time. This graph covariance quantifies temporal changes in
-dependence structures within categorical data and is established as a
-consistent dependence measure under the Bernoulli distribution. We use this
-measure to identify significant feature pairs, such as those with the most
-frequent trends over time or those exhibiting sudden spikes in dependence at
-specific moments. These extracted feature pairs, along with their timestamps,
-are subsequently passed to an LLM tasked with generating potential explanations
-of the underlying events driving these dependence changes. The effectiveness of
-our method is demonstrated through extensive simulations, and its application
-to the CTDC dataset reveals meaningful feature pairs and potential data stories
-underlying the observed feature interactions.
+Breast cancer (BC) stands as one of the most common malignancies affecting
+women worldwide, necessitating advancements in diagnostic methodologies for
+better clinical outcomes. This article provides a comprehensive exploration of
+the application of Explainable Artificial Intelligence (XAI) techniques in the
+detection and diagnosis of breast cancer. As Artificial Intelligence (AI)
+technologies continue to permeate the healthcare sector, particularly in
+oncology, the need for transparent and interpretable models becomes imperative
+to enhance clinical decision-making and patient care. This review discusses the
+integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and
+others, with machine learning and deep learning models utilized in breast
+cancer detection and classification. By investigating the modalities of breast
+cancer datasets, including mammograms, ultrasounds and their processing with
+AI, the paper highlights how XAI can lead to more accurate diagnoses and
+personalized treatment plans. It also examines the challenges in implementing
+these techniques and the importance of developing standardized metrics for
+evaluating XAI's effectiveness in clinical settings. Through detailed analysis
+and discussion, this article aims to highlight the potential of XAI in bridging
+the gap between complex AI models and practical healthcare applications,
+thereby fostering trust and understanding among medical professionals and
+improving patient outcomes.
 
-摘要：現代資料集通常包含許多具有豐富特徵和關聯時間戳的樣本。分析此類資料集以揭示底層事件通常需要複雜的統計方法和大量的領域專業知識。一個值得注意的範例，也是本文的主要資料重點，是來自反人口販運資料合作組織 (CTDC) 的全球合成資料集，這是全球人口販運資料的樞紐，包含超過 200,000 筆從 2002 年到 2022 年的匿名記錄，每個記錄都有許多分類特徵。在本文中，我們提出了一種快速且可擴充的方法，用於分析和提取重要的分類特徵交互作用，並查詢大型語言模型 (LLM)，以產生資料驅動的見解來解釋這些交互作用。我們的做法從使用獨熱編碼對分類特徵進行二元化步驟開始，然後在每個時間點計算圖形共變異數。此圖形共變異數量化了分類資料中依賴結構的時間變化，並在伯努利分佈下建立為一致的依賴度量。我們使用此度量來識別重要的特徵對，例如隨時間推移趨勢最頻繁的特徵對，或在特定時刻表現出依賴性突然激增的特徵對。這些提取的特徵對及其時間戳隨後傳遞給 LLM，後者負責產生對驅動這些依賴性變化的底層事件的潛在解釋。我們的方法的有效性已通過廣泛的模擬得到證明，其在 CTDC 資料集中的應用揭示了有意義的特徵對和潛在的資料故事，這些故事是觀察到的特徵交互作用的基礎。
+摘要：乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一，因此需要進步的診斷方法，以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域，特別是在腫瘤學中，透明且可解釋的模型需求變得勢在必行，以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合，例如 SHAP、LIME、Grad-CAM 等，以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式，包括乳房攝影、超音波及其在 AI 中的處理，本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰，以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論，本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力，進而促進醫療專業人員之間的信任與理解，並改善患者的結果。
 
-##### **Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**
-2501.14892v1 by Hang Luo, Jian Zhang, Chujun Li
+##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**
+2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
 
-In knowledge-intensive tasks, especially in high-stakes domains like medicine
-and law, it is critical not only to retrieve relevant information but also to
-provide causal reasoning and explainability. Large language models (LLMs) have
-achieved remarkable performance in natural language understanding and
-generation tasks. However, they often suffer from limitations such as
-difficulty in incorporating new knowledge, generating hallucinations, and
-explaining their reasoning process. To address these challenges, integrating
-knowledge graphs with Graph Retrieval-Augmented Generation (Graph RAG) has
-emerged as an effective solution. Traditional Graph RAG methods often rely on
-simple graph traversal or semantic similarity, which do not capture causal
-relationships or align well with the model's internal reasoning steps. This
-paper proposes a novel pipeline that filters large knowledge graphs to
-emphasize cause-effect edges, aligns the retrieval process with the model's
-chain-of-thought (CoT), and enhances reasoning through multi-stage path
-improvements. Experiments on medical question-answering tasks show consistent
-gains, with up to a 10\% absolute improvement across multiple large language
-models (LLMs). This approach demonstrates the value of combining causal
-reasoning with stepwise retrieval, leading to more interpretable and logically
-grounded solutions for complex queries.
+Speech emotion recognition (SER) has gained significant attention due to its
+several application fields, such as mental health, education, and
+human-computer interaction. However, the accuracy of SER systems is hindered by
+high-dimensional feature sets that may contain irrelevant and redundant
+information. To overcome this challenge, this study proposes an iterative
+feature boosting approach for SER that emphasizes feature relevance and
+explainability to enhance machine learning model performance. Our approach
+involves meticulous feature selection and analysis to build efficient SER
+systems. In addressing our main problem through model explainability, we employ
+a feature evaluation loop with Shapley values to iteratively refine feature
+sets. This process strikes a balance between model performance and
+transparency, which enables a comprehensive understanding of the model's
+predictions. The proposed approach offers several advantages, including the
+identification and removal of irrelevant and redundant features, leading to a
+more effective model. Additionally, it promotes explainability, facilitating
+comprehension of the model's predictions and the identification of crucial
+features for emotion determination. The effectiveness of the proposed method is
+validated on the SER benchmarks of the Toronto emotional speech set (TESS),
+Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of
+Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion
+(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our
+knowledge, this is the first work to incorporate model explainability into an
+SER framework. The source code of this paper is publicly available via this
+https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.
 
-摘要：在知識密集型任務中，特別是在醫學和法律等高風險領域，不僅檢索相關資訊至關重要，還必須提供因果推理和可解釋性。大型語言模型 (LLM) 在自然語言理解和生成任務中取得了顯著的表現。然而，它們通常會遇到一些限制，例如難以納入新知識、產生幻覺，以及解釋其推理過程。為了應對這些挑戰，將知識圖與圖形檢索增強生成 (Graph RAG) 整合在一起已成為一種有效的解決方案。傳統的 Graph RAG 方法通常依賴於簡單的圖形遍歷或語義相似性，這無法捕捉因果關係或與模型的內部推理步驟很好地對齊。本文提出了一個新穎的管道，該管道過濾大型知識圖以強調因果邊緣，將檢索過程與模型的思想鏈 (CoT) 對齊，並通過多階段路徑改進來增強推理。在醫療問題解答任務上的實驗顯示出一致的收益，在多個大型語言模型 (LLM) 中絕對改進幅度高達 10%。這種方法展示了將因果推理與逐步檢索相結合的價值，從而為複雜查詢提供更具可解釋性和邏輯依據的解決方案。
+摘要：語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而，SER 系統的準確性受到高維特徵集的阻礙，這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰，本研究提出了一種用於 SER 的迭代特徵提升方法，該方法強調特徵相關性和可解釋性，以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析，以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題，我們採用了具有 Shapley 值的特徵評估迴圈，以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡，這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點，包括識別和移除不相關和冗餘的特徵，從而建立更有效的模型。此外，它促進了可解釋性，有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證，其效能優於現有方法。據我們所知，這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得：https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。
 
-##### **GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**
-2501.16382v1 by Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
+##### **The Explanation Necessity for Healthcare AI**
+2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling
 
-Drug discovery (DD) has tremendously contributed to maintaining and improving
-public health. Hypothesizing that inhibiting protein misfolding can slow
-disease progression, researchers focus on target identification (Target ID) to
-find protein structures for drug binding. While Large Language Models (LLMs)
-and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug
-discovery, integrating models into cohesive workflows remains challenging. We
-conducted a user study with drug discovery researchers to identify the
-applicability of LLMs and RAGs in Target ID. We identified two main findings:
-1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on
-an initial protein and protein candidates that have a therapeutic impact; 2)
-the model must provide the PPI and relevant explanations for better
-understanding. Based on these observations, we identified three limitations in
-previous approaches for Target ID: 1) semantic ambiguity, 2) lack of
-explainability, and 3) short retrieval units. To address these issues, we
-propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve
-agent pipeline RAG framework to support large-scale PPI signaling pathway
-exploration in understanding therapeutic impacts by decomposing the analysis of
-entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
+Explainability is often critical to the acceptable implementation of
+artificial intelligence (AI). Nowhere is this more important than healthcare
+where decision-making directly impacts patients and trust in AI systems is
+essential. This trust is often built on the explanations and interpretations
+the AI provides. Despite significant advancements in AI interpretability, there
+remains the need for clear guidelines on when and to what extent explanations
+are necessary in the medical context. We propose a novel categorization system
+with four distinct classes of explanation necessity, guiding the level of
+explanation required: patient or sample (local) level, cohort or dataset
+(global) level, or both levels. We introduce a mathematical formulation that
+distinguishes these categories and offers a practical framework for researchers
+to determine the necessity and depth of explanations required in medical AI
+applications. Three key factors are considered: the robustness of the
+evaluation protocol, the variability of expert observations, and the
+representation dimensionality of the application. In this perspective, we
+address the question: When does an AI medical application need to be explained,
+and at what level of detail?
 
-摘要：药物发现 (DD) 极大地促进了公共卫生的维护和改善。研究人员假设抑制蛋白质错误折叠可以减缓疾病进展，因此专注于靶点识别 (Target ID) 以找到用于药物结合的蛋白质结构。虽然大型语言模型 (LLM) 和检索增强生成 (RAG) 框架加速了药物发现，但将模型整合到内聚工作流中仍然具有挑战性。我们与药物发现研究人员进行了一项用户研究，以确定 LLM 和 RAG 在 Target ID 中的适用性。我们确定了两个主要发现：1) LLM 应该基于初始蛋白质和具有治疗作用的蛋白质候选物提供多个蛋白质-蛋白质相互作用 (PPI)；2) 该模型必须提供 PPI 和相关解释以更好地理解。基于这些观察，我们发现了先前 Target ID 方法中的三个局限性：1) 语义歧义，2) 缺乏可解释性，3) 检索单元短。为了解决这些问题，我们提出了 GraPPI，这是一种基于大规模知识图 (KG) 的检索-分解-求解代理管道 RAG 框架，以支持大规模 PPI 信号通路探索，通过将整个 PPI 通路的分析分解为专注于 PPI 边缘分析的子任务来理解治疗影响。
+摘要：可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域，这一点尤为重要，因为决策直接影响患者，并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展，但仍然需要明确的指导方针，说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统，该系统具有四种不同的解释必要性类别，指导所需的解释级别：患者或样本（局部）级别、队列或数据集（全局）级别，或两个级别。我们引入了一个数学公式，该公式区分了这些类别，并为研究人员提供了一个实用框架，以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素：评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看，我们解决了这个问题：AI 医疗应用何时需要解释，以及需要解释到何种程度？
 
-##### **Evaluating and Improving Graph to Text Generation with Large Language Models**
-2501.14497v1 by Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez Basulto, Jeff Z. Pan
+##### **Interdisciplinary Expertise to Advance Equitable Explainable AI**
+2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles
 
-Large language models (LLMs) have demonstrated immense potential across
-various tasks. However, research for exploring and improving the capabilities
-of LLMs in interpreting graph structures remains limited. To address this gap,
-we conduct a comprehensive evaluation of prompting current open-source LLMs on
-graph-to-text generation tasks. Although we explored the optimal prompting
-strategies and proposed a novel and effective diversity-difficulty-based
-few-shot sample selection method, we found that the improvements from
-tuning-free approaches were incremental, as LLMs struggle with planning on
-complex graphs, particularly those with a larger number of triplets. To further
-improve LLMs in planning with graph sequences and grounding in truth, we
-introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks:
-reordering and attribution. Through extensive automatic and human evaluations,
-we demonstrate significant improvements in the quality of generated text from
-both few-shot learning and fine-tuning perspectives using the PlanGTG dataset.
-Our study paves the way for new research directions in graph-to-text
-generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
+The field of artificial intelligence (AI) is rapidly influencing health and
+healthcare, but bias and poor performance persists for populations who face
+widespread structural oppression. Previous work has clearly outlined the need
+for more rigorous attention to data representativeness and model performance to
+advance equity and reduce bias. However, there is an opportunity to also
+improve the explainability of AI by leveraging best practices of social
+epidemiology and health equity to help us develop hypotheses for associations
+found. In this paper, we focus on explainable AI (XAI) and describe a framework
+for interdisciplinary expert panel review to discuss and critically assess AI
+model explanations from multiple perspectives and identify areas of bias and
+directions for future research. We emphasize the importance of the
+interdisciplinary expert panel to produce more accurate, equitable
+interpretations which are historically and contextually informed.
+Interdisciplinary panel discussions can help reduce bias, identify potential
+confounders, and identify opportunities for additional research where there are
+gaps in the literature. In turn, these insights can suggest opportunities for
+AI model improvement.
 
-摘要：大型語言模型（LLM）已在各種任務中展現出巨大的潛力。然而，探索和提升 LLM 在詮釋圖形結構方面的能力的研究仍然有限。為了解決這個差距，我們對提示目前開源的 LLM 執行圖形轉文字生成任務進行全面評估。儘管我們探索了最佳提示策略並提出了一種新穎且有效的基於多樣性難度的少樣本選擇方法，但我們發現無調校方法的改進是漸進的，因為 LLM 難以規劃複雜的圖形，特別是那些具有較多三元組的圖形。為了進一步提升 LLM 在圖形序列規劃和真實依據方面的能力，我們引入了一個新的圖形轉文字資料集 PlanGTG，並註解了兩個子任務：重新排序和歸因。透過廣泛的自動化和人工評估，我們證明了使用 PlanGTG 資料集從少樣本學習和微調角度產生文字的品質有顯著提升。我們的研究為圖形轉文字生成中的新研究方向鋪路。PlanGTG 資料集可以在 https://github.com/probe2/kg_text 中找到。
+摘要：人工智慧 (AI) 領域正快速影響著健康與醫療保健，但對於面臨廣泛結構性壓迫的人群來說，偏見和不良表現依然存在。先前的研究已清楚說明，需要更嚴格地注意資料代表性和模型效能，以促進公平性並減少偏見。然而，我們有機會透過運用社會流行病學和健康公平的最佳實務，來改善 AI 的可解釋性，以幫助我們針對發現的關聯性，發展假設。在本文中，我們專注於可解釋 AI (XAI)，並描述一個跨領域專家小組審查架構，以從多重觀點討論和批判性評估 AI 模型的解釋，並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要，而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素，並在文獻中有缺口時找出額外研究的機會。反過來，這些見解可以建議 AI 模型改進的機會。
 
-##### **Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**
-2501.14300v1 by Xujian Liang, Zhaoquan Gu
+##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**
+2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen
 
-Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes
-the naive RAG system a step further by integrating graph information, such as
-knowledge graph (KGs), into large-scale language models (LLMs) to mitigate
-hallucination. However, existing GRAG still encounter limitations: 1) simple
-paradigms usually fail with the complex problems due to the narrow and shallow
-correlations capture from KGs 2) methods of strong coupling with KGs tend to be
-high computation cost and time consuming if the graph is dense. In this paper,
-we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for
-enabling LLMs to think ``community by community" within KGs. To do this,
-FastToG employs community detection for deeper correlation capture and two
-stages community pruning - coarse and fine pruning for faster retrieval.
-Furthermore, we also develop two Community-to-Text methods to convert the graph
-structure of communities into textual form for better understanding by LLMs.
-Experimental results demonstrate the effectiveness of FastToG, showcasing
-higher accuracy, faster reasoning, and better explainability compared to the
-previous works.
+Artificial Intelligence (AI) repeatedly match or outperform radiologists in
+lab experiments. However, real-world implementations of radiological AI-based
+systems are found to provide little to no clinical value. This paper explores
+how to design AI for clinical usefulness in different contexts. We conducted 19
+design sessions and design interventions with 13 radiologists from 7 clinical
+sites in Denmark and Kenya, based on three iterations of a functional AI-based
+prototype. Ten sociotechnical dependencies were identified as crucial for the
+design of AI in radiology. We conceptualised four technical dimensions that
+must be configured to the intended clinical context of use: AI functionality,
+AI medical focus, AI decision threshold, and AI Explainability. We present four
+design recommendations on how to address dependencies pertaining to the medical
+knowledge, clinic type, user expertise level, patient context, and user
+situation that condition the configuration of these technical dimensions.
 
-摘要：圖表檢索增強生成 (GRAG) 是一種新穎的範例，它透過將圖表資訊（例如知識圖表 (KG)) 整合到大型語言模型 (LLM) 中，進一步提升了樸素的 RAG 系統以減輕幻覺。然而，現有的 GRAG 仍會遇到限制：1) 簡單的範例通常會因從 KG 中擷取的關聯性狹隘且淺薄而無法解決複雜的問題 2) 如果圖表很密集，與 KG 強耦合的方法往往會導致高運算成本和耗時。在本文中，我們提出了 Fast Think-on-Graph (FastToG)，這是一種創新的範例，可讓 LLM 在 KG 中「逐個社群」進行思考。為此，FastToG 使用社群偵測來擷取更深入的關聯性，並使用兩個階段的社群修剪（粗略修剪和精細修剪）來加快檢索速度。此外，我們還開發了兩種社群到文字的方法，將社群的圖表結構轉換為文字形式，以便 LLM 更容易理解。實驗結果證明了 FastToG 的有效性，與先前的研究相比，展示出更高的準確性、更快的推理速度和更好的可解釋性。
+摘要：人工智慧（AI）在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而，發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代，在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向，必須根據預期的臨床使用情境進行設定：AI 功能、AI 醫療重點、AI 決策門檻，以及 AI 可解釋性。我們提出四項設計建議，說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境，以及影響這些技術面向設定的使用者情境相關的依賴關係。
 
-##### **Top Ten Challenges Towards Agentic Neural Graph Databases**
-2501.14224v1 by Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Lixin Fan, Binhang Yuan, Wei Wang, Lei Chen, Xiaofang Zhou, Yangqiu Song
+##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**
+2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah
 
-Graph databases (GDBs) like Neo4j and TigerGraph excel at handling
-interconnected data but lack advanced inference capabilities. Neural Graph
-Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for
-predictive analysis and reasoning over incomplete or noisy data. However, NGDBs
-rely on predefined queries and lack autonomy and adaptability. This paper
-introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs
-with three core functionalities: autonomous query construction, neural query
-execution, and continuous learning. We identify ten key challenges in realizing
-Agentic NGDBs: semantic unit representation, abductive reasoning, scalable
-query execution, and integration with foundation models like large language
-models (LLMs). By addressing these challenges, Agentic NGDBs can enable
-intelligent, self-improving systems for modern data-driven applications, paving
-the way for adaptable and autonomous data management solutions.
+With advanced AI/ML, there has been growing research on explainable AI (XAI)
+and studies on how humans interact with AI and XAI for effective human-AI
+collaborative decision-making. However, we still have a lack of understanding
+of how AI systems and XAI should be first presented to users without technical
+backgrounds. In this paper, we present the findings of semi-structured
+interviews with health professionals (n=12) and students (n=4) majoring in
+medicine and health to study how to improve onboarding with AI and XAI. For the
+interviews, we built upon human-AI interaction guidelines to create onboarding
+materials of an AI system for stroke rehabilitation assessment and AI
+explanations and introduce them to the participants. Our findings reveal that
+beyond presenting traditional performance metrics on AI, participants desired
+benchmark information, the practical benefits of AI, and interaction trials to
+better contextualize AI performance, and refine the objectives and performance
+of AI. Based on these findings, we highlight directions for improving
+onboarding with AI and XAI and human-AI collaborative decision-making.
 
-摘要：圖形資料庫（GDB），例如 Neo4j 和 TigerGraph，擅長處理相互連接的資料，但缺乏進階的推論能力。神經圖形資料庫（NGDB）透過整合圖形神經網路（GNN）來解決這個問題，以進行預測分析和對不完整或有雜訊的資料進行推理。然而，NGDB 依賴於預先定義的查詢，並且缺乏自主性和適應性。本文介紹了代理神經圖形資料庫（Agentic NGDB），它以三項核心功能擴充了 NGDB：自動查詢建構、神經查詢執行和持續學習。我們找出實現 Agentic NGDB 的十大關鍵挑戰：語義單元表示、演繹推理、可擴充查詢執行，以及與基礎模型（例如大型語言模型 (LLM)）整合。透過解決這些挑戰，Agentic NGDB 可以為現代資料驅動應用打造智慧且自我改善的系統，為適應性和自主資料管理解決方案鋪路。
+摘要：隨著先進的 AI/ML，對可解釋 AI (XAI) 的研究不斷增加，以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而，我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中，我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果，以研究如何改善 AI 和 XAI 的入門。對於訪談，我們建立在人機互動準則之上，為中風康復評估和 AI 解釋的 AI 系統創建入門材料，並將它們介紹給參與者。我們的研究結果表明，除了呈現傳統的 AI 性能指標外，參與者還希望基准信息、AI 的實際好處以及交互試驗，以更好地將 AI 性能情境化，並完善 AI 的目標和性能。根據這些發現，我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。
 
-##### **GraphRAG under Fire**
-2501.14050v1 by Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang
+##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**
+2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao
 
-GraphRAG advances retrieval-augmented generation (RAG) by structuring
-external knowledge as multi-scale knowledge graphs, enabling language models to
-integrate both broad context and granular details in their reasoning. While
-GraphRAG has demonstrated success across domains, its security implications
-remain largely unexplored. To bridge this gap, this work examines GraphRAG's
-vulnerability to poisoning attacks, uncovering an intriguing security paradox:
-compared to conventional RAG, GraphRAG's graph-based indexing and retrieval
-enhance resilience against simple poisoning attacks; meanwhile, the same
-features also create new attack surfaces. We present GRAGPoison, a novel attack
-that exploits shared relations in the knowledge graph to craft poisoning text
-capable of compromising multiple queries simultaneously. GRAGPoison employs
-three key strategies: i) relation injection to introduce false knowledge, ii)
-relation enhancement to amplify poisoning influence, and iii) narrative
-generation to embed malicious content within coherent text. Empirical
-evaluation across diverse datasets and models shows that GRAGPoison
-substantially outperforms existing attacks in terms of effectiveness (up to 98%
-success rate) and scalability (using less than 68% poisoning text). We also
-explore potential defensive measures and their limitations, identifying
-promising directions for future research.
+This article uses machine learning (ML) and explainable artificial
+intelligence (XAI) techniques to investigate the relationship between
+nutritional status and mortality rates associated with Alzheimers disease (AD).
+The Third National Health and Nutrition Examination Survey (NHANES III)
+database is employed for analysis. The random forest model is selected as the
+base model for XAI analysis, and the Shapley Additive Explanations (SHAP)
+method is used to assess feature importance. The results highlight significant
+nutritional factors such as serum vitamin B12 and glycated hemoglobin. The
+study demonstrates the effectiveness of random forests in predicting AD
+mortality compared to other diseases. This research provides insights into the
+impact of nutrition on AD and contributes to a deeper understanding of disease
+progression.
 
-摘要：GraphRAG 透過將外部知識結構化為多尺度知識圖譜，推動了檢索增強生成 (RAG)，使語言模型能夠在其推理中整合廣泛的背景和細微的細節。儘管 GraphRAG 在各個領域都已展現出成功，但其安全性影響在很大程度上仍未被探索。為了彌補這一差距，本研究探討了 GraphRAG 對投毒攻擊的脆弱性，揭示了一個有趣的安全悖論：與傳統的 RAG 相比，GraphRAG 基於圖表的索引和檢索增強了對簡單投毒攻擊的韌性；同時，相同的特徵也創造了新的攻擊面。我們提出了 GRAGPoison，這是一種新穎的攻擊，它利用知識圖譜中的共享關係來製作中毒文本，能夠同時危害多個查詢。GRAGPoison 採用了三項關鍵策略：i) 關係注入以引入錯誤的知識，ii) 關係增強以擴大投毒影響，以及 iii) 敘事生成以將惡意內容嵌入連貫的文本中。在各種數據集和模型上的經驗評估表明，GRAGPoison 在有效性（成功率高達 98%）和可擴展性（使用不到 68% 的投毒文本）方面都明顯優於現有的攻擊。我們還探討了潛在的防禦措施及其局限性，確定了未來研究的有希望的方向。
+摘要：本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型，並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素，例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解，並有助於更深入地了解疾病的進展。
 
-##### **EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**
-2501.13746v1 by Yuhui Yun, Huilong Ye, Xinru Li, Ruojia Li, Jingfeng Deng, Li Li, Haoyi Xiong
+##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**
+2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath
 
-The paper introduces EICopilot, an novel agent-based solution enhancing
-search and exploration of enterprise registration data within extensive online
-knowledge graphs like those detailing legal entities, registered capital, and
-major shareholders. Traditional methods necessitate text-based queries and
-manual subgraph explorations, often resulting in time-consuming processes.
-EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this
-landscape by utilizing Large Language Models (LLMs) to interpret natural
-language queries. This solution automatically generates and executes Gremlin
-scripts, providing efficient summaries of complex enterprise relationships.
-Distinct feature a data pre-processing pipeline that compiles and annotates
-representative queries into a vector database of examples for In-context
-learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought
-with ICL to enhance Gremlin script generation for knowledge graph search and
-exploration, and a novel query masking strategy that improves intent
-recognition for heightened script accuracy. Empirical evaluations demonstrate
-the superior performance of EICopilot, including speed and accuracy, over
-baseline methods, with the \emph{Full Mask} variant achieving a syntax error
-rate reduction to as low as 10.00% and an execution correctness of up to
-82.14%. These components collectively contribute to superior querying
-capabilities and summarization of intricate datasets, positioning EICopilot as
-a groundbreaking tool in the exploration and exploitation of large-scale
-knowledge graphs for enterprise information search.
+Primary care providers are vital for initial triage and referrals to
+specialty care. In glaucoma, asymptomatic and fast progression can lead to
+vision loss, necessitating timely referrals to specialists. However, primary
+eye care providers may not identify urgent cases, potentially delaying care.
+Artificial Intelligence (AI) offering explanations could enhance their referral
+decisions. We investigate how various AI explanations help providers
+distinguish between patients needing immediate or non-urgent specialist
+referrals. We built explainable AI algorithms to predict glaucoma surgery needs
+from routine eyecare data as a proxy for identifying high-risk patients. We
+incorporated intrinsic and post-hoc explainability and conducted an online
+study with optometrists to assess human-AI team performance, measuring referral
+accuracy and analyzing interactions with AI, including agreement rates, task
+time, and user experience perceptions. AI support enhanced referral accuracy
+among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams
+underperformed compared to AI alone. Participants believed they included AI
+advice more when using the intrinsic model, and perceived it more useful and
+promising. Without explanations, deviations from AI recommendations increased.
+AI support did not increase workload, confidence, and trust, but reduced
+challenges. On a separate test set, our black-box and intrinsic models achieved
+an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We
+identify opportunities of human-AI teaming for glaucoma management in primary
+eye care, noting that while AI enhances referral accuracy, it also shows a
+performance gap compared to AI alone, even with explanations. Human involvement
+remains essential in medical decision making, underscoring the need for future
+research to optimize collaboration, ensuring positive experiences and safe AI
+use.
 
-摘要：本文介紹了 EICopilot，這是一種基於代理的新型解決方案，可增強在廣泛的線上知識圖譜中搜尋和探索企業註冊資料，例如詳細說明法律實體、註冊資本和主要股東的資料。傳統方法需要基於文字的查詢和手動子圖探索，通常會導致耗時的流程。EICopilot 部署為百度企業搜尋的聊天機器人，透過利用大型語言模型 (LLM) 來詮釋自然語言查詢，進而改善這項技術。此解決方案會自動產生並執行 Gremlin 腳本，提供複雜企業關係的有效摘要。其獨特功能為資料前處理管線，可將具代表性的查詢編譯並註解到範例的向量資料庫中，以進行脈絡中學習 (ICL)，這是一個結合了思考鏈與 ICL 的綜合推理管線，用於增強 Gremlin 腳本產生，以進行知識圖譜搜尋和探索，以及一種新穎的查詢遮罩策略，可改善意圖辨識，進而提高腳本準確度。實證評估顯示，EICopilot 的效能優於基線方法，包括速度和準確度，其中「完整遮罩」變體將語法錯誤率降低至低於 10.00%，執行正確率高達 82.14%。這些元件共同促成了優異的查詢功能和複雜資料集的摘要，將 EICopilot 定位為探索和利用大規模知識圖譜進行企業資訊搜尋的創新工具。
+摘要：<paragraph>初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下，無症狀且快速惡化可能導致視力喪失，因此需要及時轉診給專家。然而，初級眼科保健提供者可能無法識別緊急情況，可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法，以從例行眼科護理資料預測青光眼手術需求，作為識別高風險患者的代理。我們納入了內在和事後解釋性，並與驗光師進行了一項線上研究，以評估人機團隊的表現，衡量轉診準確度並分析與 AI 的互動，包括同意率、任務時間和使用者體驗感知。在 87 名參與者中，AI 支援提高了轉診準確度（使用 AI/未使用的比例為 59.9%/50.8%），儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議，並認為它更有用且更有希望。沒有解釋，AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任，但減少了挑戰。在一個單獨的測試集中，我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中，人機團隊合作管理青光眼的機會，並注意到雖然 AI 提高了轉診準確度，但即使有解釋，它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要，這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。</paragraph>
 
-##### **Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**
-2501.13731v1 by Chang Gong, Wanrui Bian, Zhijie Zhang, Weiguo Zheng
+##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**
+2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang
 
-Graph computational tasks are inherently challenging and often demand the
-development of advanced algorithms for effective solutions. With the emergence
-of large language models (LLMs), researchers have begun investigating their
-potential to address these tasks. However, existing approaches are constrained
-by LLMs' limited capability to comprehend complex graph structures and their
-high inference costs, rendering them impractical for handling large-scale
-graphs. Inspired by human approaches to graph problems, we introduce a novel
-framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph
-Computational Tasks), which consists of three key steps: problem understanding,
-prompt design, and code generation. In this framework, LLMs are tasked with
-understanding the problem and extracting relevant information to generate
-correct code. The responsibility for analyzing the graph structure and
-executing the code is delegated to the interpreter. We inject task-related
-pseudocodes into the prompts to further assist the LLMs in generating efficient
-code. We also employ cost-effective trial-and-error techniques to ensure that
-the LLM-generated code executes correctly. Unlike other methods that require
-invoking LLMs for each individual test case, PIE only calls the LLM during the
-code generation phase, allowing the generated code to be reused and
-significantly reducing inference costs. Extensive experiments demonstrate that
-PIE outperforms existing baselines in terms of both accuracy and computational
-efficiency.
+In medical imaging, particularly in early disease detection and prognosis
+tasks, discerning the rationale behind an AI model's predictions is crucial for
+evaluating the reliability of its decisions. Conventional explanation methods
+face challenges in identifying discernible decisive features in medical image
+classifications, where discriminative features are subtle or not immediately
+apparent. To bridge this gap, we propose an explainable model that is equipped
+with both decision reasoning and feature identification capabilities. Our
+approach not only detects influential image patterns but also uncovers the
+decisive features that drive the model's final predictions. By implementing our
+method, we can efficiently identify and visualise class-specific features
+leveraged by the data-driven model, providing insights into the decision-making
+processes of deep learning models. We validated our model in the demanding
+realm of medical prognosis task, demonstrating its efficacy and potential in
+enhancing the reliability of AI in healthcare and in discovering new knowledge
+in diseases where prognostic understanding is limited.
+
+摘要：在醫學影像中，特別是在早期疾病檢測和預後任務中，辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰，其中區別性特徵很微妙或並不明顯。為了彌合這一差距，我們提出了一個可解釋的模型，該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式，還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型，我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵，從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型，展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。
+
+##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**
+2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat
+
+This study explores the relationship between informational support seeking
+questions, responses, and helpfulness ratings in online health communities. We
+created a labeled data set of question-response pairs and developed multimodal
+machine learning and deep learning models to reliably predict informational
+support questions and responses. We employed explainable AI to reveal the
+emotions embedded in informational support exchanges, demonstrating the
+importance of emotion in providing informational support. This complex
+interplay between emotional and informational support has not been previously
+researched. The study refines social support theory and lays the groundwork for
+the development of user decision aids. Further implications are discussed.
 
-摘要：圖表計算任務本質上具有挑戰性，而且通常需要開發先進的演算法才能有效解決。隨著大型語言模型 (LLM) 的出現，研究人員已開始探討其解決這些任務的可能性。然而，現有方法受到 LLM 理解複雜圖形結構的能力有限以及其高推理成本的限制，這使得它們不切實際地處理大規模圖形。受到人類解決圖形問題的方法啟發，我們引入了 PIE（偽代碼注入增強 LLM 圖形計算任務推理）這個新框架，它包含三個關鍵步驟：問題理解、提示設計和代碼生成。在此框架中，LLM 的任務是理解問題並擷取相關資訊以產生正確的代碼。分析圖形結構和執行代碼的責任委派給解釋器。我們將與任務相關的偽代碼注入提示中，以進一步協助 LLM 產生有效的代碼。我們還採用具有成本效益的試錯技術，以確保 LLM 生成的代碼正確執行。與需要為每個個別測試案例呼叫 LLM 的其他方法不同，PIE 僅在代碼產生階段呼叫 LLM，允許重複使用產生的代碼並大幅降低推理成本。大量的實驗證明，PIE 在準確性和計算效率方面都優於現有的基準。
+摘要：本研究探討線上健康社群中尋求資訊支持的問題、回應，以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集，並開發了多模態機器學習和深度學習模型，以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒，證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論，並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。
 
-##### **CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**
-2501.13993v1 by Hamza Landolsi, Kais Letaief, Nizar Taghouti, Ines Abdeljaoued-Tej
+##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**
+2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis
 
-The introduction of new features and services in the banking sector often
-overwhelms customers, creating an opportunity for banks to enhance user
-experience through financial chatbots powered by large language models (LLMs).
-We initiated an AI agent designed to provide customers with relevant
-information about banking services and insights from annual reports. We
-proposed a hybrid Customer Analysis Pipeline Retrieval-Augmented Generation
-(CAPRAG) that effectively addresses both relationship-based and contextual
-queries, thereby improving customer engagement in the digital banking
-landscape. To implement this, we developed a processing pipeline to refine text
-data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This
-dual approach enables us to populate both vector and graph databases with
-processed data for efficient retrieval. The Cypher query component is employed
-to effectively query the graph database. When a user submits a query, it is
-first expanded by a query expansion module before being routed to construct a
-final query from the hybrid Knowledge Base (KB). This final query is then sent
-to an open-source LLM for response generation. Overall, our innovative,
-designed to international banks, serves bank's customers in an increasingly
-complex digital environment, enhancing clarity and accessibility of
-information.
+In the era of exponential technology growth, one unexpected guest has claimed
+a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as
+ChatGPT, promises a revolution in education, yet it arrives with a double-edged
+sword. Its potential for personalized learning is offset by issues of cheating,
+inaccuracies, and educators struggling to incorporate it effectively into their
+lesson design. We are standing on the brink of this educational frontier, and
+it is clear that we need to navigate this terrain with a lot of care. This is a
+major challenge that could undermine the integrity and value of our educational
+process. So, how can we turn these challenges into opportunities? When used
+inappropriately, AI tools can become the perfect tool for the cut copy paste
+mentality, and quickly begin to corrode critical thinking, creativity, and deep
+understanding, the most important skills in our rapidly changing world.
+Teachers feel that they are not equipped to leverage this technology, widening
+the digital divide among educators and institutions. Addressing these concerns
+calls for an in depth research approach. We will employ empirical research,
+drawing on the Technology Acceptance Model, to assess the attitudes toward
+generative AI among educators and students. Understanding their perceptions,
+usage patterns, and hurdles is the first crucial step in creating an effective
+solution. The present study will be used as a process manual for future
+researchers to apply, running their own data, based on the steps explained here
 
-摘要：銀行業中新功能和服務的推出經常讓客戶感到不知所措，這為銀行透過大型語言模型 (LLM) 驅動的金融聊天機器人來提升使用者體驗創造了機會。我們啟動了一個人工智慧代理，旨在為客戶提供有關銀行服務和年度報告見解的相關資訊。我們提出了一個混合式客戶分析管道檢索擴充生成 (CAPRAG)，它有效地處理基於關係和情境式的查詢，從而提升數位銀行環境中的客戶參與度。為了實作這一點，我們開發了一個處理管道來精煉文字資料，我們在兩個主要架構中使用它：Vector RAG 和 Graph RAG。這種雙管齊下的方法讓我們能夠使用處理過的資料來填補向量和圖形資料庫，以利於有效檢索。Cypher 查詢元件用於有效查詢圖形資料庫。當使用者提交查詢時，它會先由查詢擴充模組擴充，然後再路由到混合式知識庫 (KB) 中建構最終查詢。然後這個最終查詢會傳送給開源 LLM 以產生回應。整體而言，我們創新的設計服務於國際銀行，在日益複雜的數位環境中服務銀行客戶，提升資訊的清晰度和可及性。
+摘要：在科技飛速發展的時代，一位意外的訪客已在全球教室中佔有一席之地，那就是人工智慧。生成式 AI，例如 ChatGPT，承諾在教育領域掀起一場革命，但它卻是一把雙面刃。它在個人化學習方面的潛力，卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣，顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰，可能會損害我們教育過程的完整性和價值。那麼，我們如何將這些挑戰轉化為機遇？當不適當地使用時，AI 工具可能會成為複製貼上心態的完美工具，並迅速腐蝕批判性思維、創造力和深入理解，這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術，這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究，借鑑技術接受模型，來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊，根據此處說明的步驟運行他們自己的數據
 
-##### **Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**
-2501.13992v1 by Hy Nguyen, Nguyen Hung Nguyen, Nguyen Linh Bao Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
+##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**
+2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk
 
-The Hierarchical Navigable Small World (HNSW) algorithm is widely used for
-approximate nearest neighbor (ANN) search, leveraging the principles of
-navigable small-world graphs. However, it faces some limitations. The first is
-the local optima problem, which arises from the algorithm's greedy search
-strategy, selecting neighbors based solely on proximity at each step. This
-often leads to cluster disconnections. The second limitation is that HNSW
-frequently fails to achieve logarithmic complexity, particularly in
-high-dimensional datasets, due to the exhaustive traversal through each layer.
-To address these limitations, we propose a novel algorithm that mitigates local
-optima and cluster disconnections while enhancing the construction speed,
-maintaining inference speed. The first component is a dual-branch HNSW
-structure with LID-based insertion mechanisms, enabling traversal from multiple
-directions. This improves outlier node capture, enhances cluster connectivity,
-accelerates construction speed and reduces the risk of local minima. The second
-component incorporates a bridge-building technique that bypasses redundant
-intermediate layers, maintaining inference and making up the additional
-computational overhead introduced by the dual-branch structure. Experiments on
-various benchmarks and datasets showed that our algorithm outperforms the
-original HNSW in both accuracy and speed. We evaluated six datasets across
-Computer Vision (CV), and Natural Language Processing (NLP), showing recall
-improvements of 18\% in NLP, and up to 30\% in CV tasks while reducing the
-construction time by up to 20\% and maintaining the inference speed. We did not
-observe any trade-offs in our algorithm. Ablation studies revealed that
-LID-based insertion had the greatest impact on performance, followed by the
-dual-branch structure and bridge-building components.
+With the digitalization of health care systems, artificial intelligence
+becomes more present in medicine. Especially machine learning shows great
+potential for complex tasks such as time series classification, usually at the
+cost of transparency and comprehensibility. This leads to a lack of trust by
+humans and thus hinders its active usage. Explainable artificial intelligence
+tries to close this gap by providing insight into the decision-making process,
+the actual usefulness of its different methods is however unclear. This paper
+proposes a user study based evaluation of the explanation method Grad-CAM with
+application to a neural network for the classification of breaths in time
+series neonatal ventilation data. We present the perceived usefulness of the
+explainability method by different stakeholders, exposing the difficulty to
+achieve actual transparency and the wish for more in-depth explanations by many
+of the participants.
 
-摘要：分層可導航小世界 (HNSW) 演算法廣泛用於近似最近鄰居 (ANN) 搜尋，並利用可導航小世界圖形的原理。然而，它面臨一些限制。第一個是局部最佳化問題，這源自於演算法的貪婪搜尋策略，在每個步驟中僅根據鄰近度來選擇鄰居。這通常會導致群集斷線。第二個限制是，由於透過每一層的窮舉式遍歷，HNSW 常常無法在高維度資料集中達成對數複雜度。為了解決這些限制，我們提出了一種新的演算法，它可以減輕局部最佳化和群集斷線，同時提高建構速度，並維持推論速度。第一個組成部分是一個具有基於 LID 的插入機制的雙分支 HNSW 結構，它能從多個方向進行遍歷。這改善了異常值節點的擷取，增強了群集連通性，加速了建構速度，並降低了局部最小值的風險。第二個組成部分包含一種橋樑建構技術，它繞過了多餘的中間層，維持推論並彌補了雙分支結構所帶來的額外運算負擔。在各種基準和資料集上的實驗顯示，我們的演算法在準確度和速度上都優於原始的 HNSW。我們評估了電腦視覺 (CV) 和自然語言處理 (NLP) 中的六個資料集，顯示 NLP 中的召回率提高了 18%，CV 任務中提高了 30%，同時將建構時間縮短了 20%，並維持了推論速度。我們沒有在我們的演算法中觀察到任何取捨。消融研究顯示，基於 LID 的插入對效能的影響最大，其次是雙分支結構和橋樑建構組成部分。
+摘要：隨著醫療保健系統的數位化，人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力，但通常是以透明度和可理解性為代價。這導致人類缺乏信任，從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距，但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估，其中包含了 Grad-CAM 解釋方法，並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用，揭示了實現實際透明度的難度，以及許多參與者希望獲得更深入的解釋。
 
-##### **Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**
-2501.13984v1 by Bhumika Gupta, Pralaypati Ta, Keerthi Ram, Mohanasankar Sivaprakasam
+##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**
+2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio
 
-The updated recommendations on diagnostic procedures and treatment pathways
-for a medical condition are documented as graphical flows in Clinical Practice
-Guidelines (CPGs). For effective use of the CPGs in helping medical
-professionals in the treatment decision process, it is necessary to fully
-capture the guideline knowledge, particularly the contexts and their
-relationships in the graph. While several existing works have utilized these
-guidelines to create rule bases for Clinical Decision Support Systems, limited
-work has been done toward directly capturing the full medical knowledge
-contained in CPGs. This work proposes an approach to create a contextually
-enriched, faithful digital representation of National Comprehensive Cancer
-Network (NCCN) Cancer CPGs in the form of graphs using automated extraction and
-node & relationship classification. We also implement semantic enrichment of
-the model by using Large Language Models (LLMs) for node classification,
-achieving an accuracy of 80.86% and 88.47% with zero-shot learning and few-shot
-learning, respectively. Additionally, we introduce a methodology for answering
-natural language questions with constraints to guideline text by leveraging
-LLMs to extract the relevant subgraph from the guideline knowledge base. By
-generating natural language answers based on subgraph paths and semantic
-information, we mitigate the risk of incorrect answers and hallucination
-associated with LLMs, ensuring factual accuracy in medical domain Question
-Answering.
+The integration of Large Language Models (LLMs) into healthcare diagnostics
+offers a promising avenue for clinical decision-making. This study outlines the
+development of a novel method for zero-shot/few-shot in-context learning (ICL)
+by integrating medical domain knowledge using a multi-layered structured
+prompt. We also explore the efficacy of two communication styles between the
+user and LLMs: the Numerical Conversational (NC) style, which processes data
+incrementally, and the Natural Language Single-Turn (NL-ST) style, which
+employs long narrative prompts.
+  Our study systematically evaluates the diagnostic accuracy and risk factors,
+including gender bias and false negative rates, using a dataset of 920 patient
+records in various few-shot scenarios. Results indicate that traditional
+clinical machine learning (ML) models generally outperform LLMs in zero-shot
+and few-shot settings. However, the performance gap narrows significantly when
+employing few-shot examples alongside effective explainable AI (XAI) methods as
+sources of domain knowledge. Moreover, with sufficient time and an increased
+number of examples, the conversational style (NC) nearly matches the
+performance of ML models. Most notably, LLMs demonstrate comparable or superior
+cost-sensitive accuracy relative to ML models.
+  This research confirms that, with appropriate domain knowledge and tailored
+communication strategies, LLMs can significantly enhance diagnostic processes.
+The findings highlight the importance of optimizing the number of training
+examples and communication styles to improve accuracy and reduce biases in LLM
+applications.
 
-摘要：已更新的醫療狀況診斷程序和治療途徑建議，以臨床實務指南 (CPG) 中的圖形流程記錄。為了有效使用 CPG 協助醫療專業人員進行治療決策，必須完整擷取指南知識，特別是圖表中的脈絡及其關係。雖然現有許多研究已利用這些指南為臨床決策支援系統建立規則基礎，但直接擷取 CPG 中包含的完整醫療知識的工作卻有限。這項研究提出了一種方法，以自動化擷取和節點與關係分類的方式，建立脈絡豐富、忠實的國家綜合癌症網路 (NCCN) 癌症 CPG 圖形數位表示。我們也透過使用大型語言模型 (LLM) 進行節點分類，實作模型的語意豐富化，分別在零次學習和少次學習中達到 80.86% 和 88.47% 的準確度。此外，我們引進了一種方法，透過運用 LLM 從指南知識庫中擷取相關子圖，來回答具有指南文字限制的自然語言問題。透過根據子圖路徑和語意資訊產生自然語言答案，我們降低了與 LLM 相關的錯誤答案和幻覺風險，確保了醫療領域問題解答中的事實準確性。
+摘要：大型語言模型 (LLM) 與醫療診斷整合
+為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發，用於零次學習/少量學習情境學習 (ICL)，方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效：數值對話 (NC) 方式，它會逐步處理資料，以及自然語言單回合 (NL-ST) 方式，它會使用長篇敘事提示。
+我們的研究系統性地評估了診斷準確性和風險因子，包括性別偏見和假陰性率，使用了一個包含 920 個患者記錄的資料集，採用各種少量學習情境。結果表明，傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而，當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時，效能差距會顯著縮小。此外，隨著時間充足和範例數量增加，對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是，LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。
+本研究證實，透過適當的領域知識和量身打造的溝通策略，LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性，以提高準確度並減少 LLM 應用中的偏差。
 
-##### **LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**
-2501.12300v1 by Hasan Abu-Rasheed, Constance Jumbo, Rashed Al Amin, Christian Weber, Veit Wiese, Roman Obermaisser, Madjid Fathi
+##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**
+2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho
 
-While learning personalization offers great potential for learners, modern
-practices in higher education require a deeper consideration of domain models
-and learning contexts, to develop effective personalization algorithms. This
-paper introduces an innovative approach to higher education curriculum
-modelling that utilizes large language models (LLMs) for knowledge graph (KG)
-completion, with the goal of creating personalized learning-path
-recommendations. Our research focuses on modelling university subjects and
-linking their topics to corresponding domain models, enabling the integration
-of learning modules from different faculties and institutions in the student's
-learning path. Central to our approach is a collaborative process, where LLMs
-assist human experts in extracting high-quality, fine-grained topics from
-lecture materials. We develop a domain, curriculum, and user models for
-university modules and stakeholders. We implement this model to create the KG
-from two study modules: Embedded Systems and Development of Embedded Systems
-Using FPGA. The resulting KG structures the curriculum and links it to the
-domain models. We evaluate our approach through qualitative expert feedback and
-quantitative graph quality metrics. Domain experts validated the relevance and
-accuracy of the model, while the graph quality metrics measured the structural
-properties of our KG. Our results show that the LLM-assisted graph completion
-approach enhances the ability to connect related courses across disciplines to
-personalize the learning experience. Expert feedback also showed high
-acceptance of the proposed collaborative approach for concept extraction and
-classification.
+The increasing reliance on Deep Learning models, combined with their inherent
+lack of transparency, has spurred the development of a novel field of study
+known as eXplainable AI (XAI) methods. These methods seek to enhance the trust
+of end-users in automated systems by providing insights into the rationale
+behind their decisions. This paper presents a novel approach for measuring user
+trust in XAI systems, allowing their refinement. Our proposed metric combines
+both performance metrics and trust indicators from an objective perspective. To
+validate this novel methodology, we conducted a case study in a realistic
+medical scenario: the usage of XAI system for the detection of pneumonia from
+x-ray images.
 
-摘要：<paragraph>在學習個人化提供學習者巨大潛力的同時，高等教育中的現代實務需要更深入地考慮領域模型和學習情境，以開發有效的個人化演算法。本文介紹了一種創新的高等教育課程建模方法，該方法利用大型語言模型 (LLM) 來完成知識圖譜 (KG)，目的是建立個人化的學習路徑建議。我們的研究重點在於建模大學科目，並將它們的主題連結到對應的領域模型，從而能夠將來自不同院系和機構的學習模組整合到學生的學習路徑中。我們的做法核心是一個協作流程，其中 LLM 協助人類專家從講義材料中萃取高品質、細緻的主題。我們為大學模組和利害關係人開發了領域、課程和使用者模型。我們實作這個模型，從兩個研究模組建立 KG：嵌入式系統和使用 FPGA 的嵌入式系統開發。產生的 KG 建構了課程並將其連結到領域模型。我們透過定性專家回饋和定量圖形品質指標來評估我們的做法。領域專家驗證了模型的相關性和準確性，而圖形品質指標則測量了我們 KG 的結構特性。我們的結果顯示，LLM 輔助的圖形完成方法增強了跨學科連結相關課程的能力，以個人化學習體驗。專家回饋也顯示高度接受所提出的協作方法，用於概念萃取和分類。</paragraph>
+摘要：隨著對深度學習模型依賴性的增加，加上其固有的透明度不足，促使一個新的研究領域發展，稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理，來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法，允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法，我們在一個真實的醫療場景中進行了一個案例研究：使用 XAI 系統從 X 光影像中偵測肺炎。
 
-##### **Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**
-2501.12432v1 by Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
+##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**
+2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao
 
-Although current Large Language Models (LLMs) exhibit impressive
-capabilities, performing complex real-world tasks still requires tool learning.
-Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to
-interact with external environments, but they are limited in perceptual scope
-and lack adequate task-planning capability. To address these limitations, other
-studies introduce the first Search-based Decision Tree (DFSDT), which still
-suffers from the high computational cost. In this paper, we introduce a novel
-parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama).
-First, we transform traditional tree-based tool search paths into Directed
-Acyclic Graph (DAG) structure, generating a high-quality parallel tool
-invocation dataset. The DTA-Llama is then trained on the dataset to learn to
-iteratively divide the current task into several parallel tool invocation
-sub-tasks and aggregate the invocation results to decide the next actions.
-Furthermore, we introduce an efficient inference framework inspired by the
-Process/Threads mechanism when applying the DTA-Llama to practical tasks.
-Experimental results show that our approach substantially enhances task
-performance while reducing token consumption and inference time. Llama2-7B,
-using our method, is comparable to the official parallel function calling
-method of GPT-3.5. The relevant code, dataset, and model weights are available
-at https://corn0205.github.io/
+The COVID-19 pandemic has strained global public health, necessitating
+accurate diagnosis and intervention to control disease spread and reduce
+mortality rates. This paper introduces an interpretable deep survival
+prediction model designed specifically for improved understanding and trust in
+COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale
+pretrained image encoder, Risk-specific Grad-CAM, and anatomical region
+detection techniques, our approach produces regional interpretable outcomes
+that effectively capture essential disease features while focusing on rare but
+critical abnormal regions. Our model's predictive results provide enhanced
+clarity and transparency through risk area localization, enabling clinicians to
+make informed decisions regarding COVID-19 diagnosis with better understanding
+of prognostic insights. We evaluate the proposed method on a multi-center
+survival dataset and demonstrate its effectiveness via quantitative and
+qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and
+time-dependent AUCs (0.799 and 0.691). These results suggest that our
+explainable deep survival prediction model surpasses traditional survival
+analysis methods in risk prediction, improving interpretability for clinical
+decision making and enhancing AI system trustworthiness.
 
-摘要：儘管目前的大型語言模型 (LLM) 展現出令人印象深刻的能力，但執行複雜的真實世界任務仍需要工具學習。主流方法（例如 CoT/ReAct）依賴逐步工具呼叫與外部環境互動，但它們的感知範圍有限，且缺乏足夠的任務規劃能力。為了解決這些限制，其他研究引入了第一個基於搜尋的決策樹 (DFSDT)，但仍有很高的運算成本。在本文中，我們介紹了一種新穎的平行工具呼叫範例，DTA-Llama（分而合之 Llama）。首先，我們將傳統的基於樹的工具搜尋路徑轉換為有向無環圖 (DAG) 結構，產生高品質的平行工具呼叫資料集。然後在資料集上訓練 DTA-Llama，學習反覆將當前任務分成幾個平行工具呼叫子任務，並彙總呼叫結果以決定後續動作。此外，我們在將 DTA-Llama 應用於實際任務時，引入了一個受 Process/Threads 機制啟發的高效推論框架。實驗結果表明，我們的做法大幅提升了任務效能，同時減少了符號消耗和推論時間。使用我們方法的 Llama2-7B，可與 GPT-3.5 的官方平行函式呼叫方法相媲美。相關程式碼、資料集和模型權重可在 https://corn0205.github.io/ 取得
+摘要：COVID-19 疫情對全球公共衛生造成壓力，必須進行準確的診斷和干預，以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型，專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術，我們的做法產生區域可解釋的結果，有效捕捉必要的疾病特徵，同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度，讓臨床醫生能夠在更了解預後見解的情況下，就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法，並透過量化和質化評估證明其有效性，達到優異的 C 指數（0.764 和 0.727）和時間相關 AUC（0.799 和 0.691）。這些結果表明，我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法，提升臨床決策的解釋性，並增強 AI 系統的信賴度。
+
+##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**
+2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile
+
+In recent years, machine learning-based clinical decision support systems
+(CDSS) have played a key role in the analysis of several medical conditions.
+Despite their promising capabilities, the lack of transparency in AI models
+poses significant challenges, particularly in medical contexts where
+reliability is a mandatory aspect. However, it appears that explainability is
+inversely proportional to accuracy. For this reason, achieving transparency
+without compromising predictive accuracy remains a key challenge. This paper
+presents a novel method, namely Rad4XCNN, to enhance the predictive power of
+CNN-derived features with the inherent interpretability of radiomic features.
+Rad4XCNN diverges from conventional methods based on saliency maps, by
+associating intelligible meaning to CNN-derived features by means of Radiomics,
+offering new perspectives on explanation methods beyond visualization maps.
+Using a breast cancer classification task as a case study, we evaluated
+Rad4XCNN on ultrasound imaging datasets, including an online dataset and two
+in-house datasets for internal and external validation. Some key results are:
+i) CNN-derived features guarantee more robust accuracy when compared against
+ViT-derived and radiomic features; ii) conventional visualization map methods
+for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice
+model accuracy for their explainability; iv) Rad4XCNN provides a global
+explanation enabling the physician to extract global insights and findings. Our
+method can mitigate some concerns related to the explainability-accuracy
+trade-off. This study highlighted the importance of proposing new methods for
+model explanation without affecting their accuracy.
 
-##### **InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**
-2501.12231v1 by Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
+摘要：<paragraph>近年来，基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景，但 AI 模型缺乏透明度，尤其在医疗领域，可靠性是强制性方面，这带来了重大挑战。然而，解释性似乎与准确性成反比。因此，在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法，即 Rad4XCNN，以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来，从而偏离了基于显着性图的传统方法，为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究，我们在超声成像数据集上评估了 Rad4XCNN，包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是：i) 与 ViT 衍生和放射组学特征相比，CNN 衍生特征保证了更稳健的准确性；ii) 用于解释的传统可视化图方法存在一些缺陷；iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性；iv) Rad4XCNN 提供全局解释，使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。</paragraph>
 
-The improved competence of generative models can help building multi-modal
-virtual assistants that leverage modalities beyond language. By observing
-humans performing multi-step tasks, one can build assistants that have
-situational awareness of actions and tasks being performed, enabling them to
-cater assistance based on this understanding. In this paper, we develop a
-Context-aware Instructional Task Assistant with Multi-modal Large Language
-Models (InsTALL) that leverages an online visual stream (e.g. a user's screen
-share or video recording) and responds in real-time to user queries related to
-the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal
-model on task videos and paired textual data, and 2) automatically extracts
-task graph from video data and leverages it at training and inference time. We
-show InsTALL achieves state-of-the-art performance across proposed sub-tasks
-considered for multimodal activity understanding -- task recognition (TR),
-action recognition (AR), next action prediction (AP), and plan prediction (PP)
--- and outperforms existing baselines on two novel sub-tasks related to
-automatic error identification.
+##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**
+2404.16957v1 by Yunfei Ge, Quanyan Zhu
 
-摘要：生成模型能力的提升有助于构建利用语言之外的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建对正在执行的动作和任务有情境感知的助手，使他们能够根据这种理解提供帮助。在本文中，我们开发了一个具有多模态大语言模型的上下文感知指令任务助手 (InsTALL)，该助手利用在线视觉流（例如用户的屏幕共享或视频录制），并实时响应与手头任务相关的用户查询。为了提供有用的帮助，InsTALL 1) 在任务视频和配对文本数据上训练多模态模型，以及 2) 从视频数据中自动提取任务图，并在训练和推理时间利用它。我们展示了 InsTALL 在考虑用于多模态活动理解的提议子任务中实现了最先进的性能——任务识别 (TR)、动作识别 (AR)、下一个动作预测 (AP) 和计划预测 (PP)——并且在与自动错误识别相关的两个新子任务上优于现有的基准。
+The pervasive integration of Artificial Intelligence (AI) has introduced
+complex challenges in the responsibility and accountability in the event of
+incidents involving AI-enabled systems. The interconnectivity of these systems,
+ethical concerns of AI-induced incidents, coupled with uncertainties in AI
+technology and the absence of corresponding regulations, have made traditional
+responsibility attribution challenging. To this end, this work proposes a
+Computational Reflective Equilibrium (CRE) approach to establish a coherent and
+ethically acceptable responsibility attribution framework for all stakeholders.
+The computational approach provides a structured analysis that overcomes the
+limitations of conceptual approaches in dealing with dynamic and multifaceted
+scenarios, showcasing the framework's explainability, coherence, and adaptivity
+properties in the responsibility attribution process. We examine the pivotal
+role of the initial activation level associated with claims in equilibrium
+computation. Using an AI-assisted medical decision-support system as a case
+study, we illustrate how different initializations lead to diverse
+responsibility distributions. The framework offers valuable insights into
+accountability in AI-induced incidents, facilitating the development of a
+sustainable and resilient system through continuous monitoring, revision, and
+reflection.
 
-##### **Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**
-2501.11977v1 by Maya Medjad, Hugo Imbert, Bruno Yun, Raphaël Szymocha, Frédéric Armetta
+摘要：隨著人工智慧 (AI) 的普及整合，在涉及 AI 驅動系統的事故中，責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題，加上 AI 技術的不確定性和缺乏相應法規，使得傳統責任歸屬面臨挑戰。為此，本研究提出了一種計算反思均衡 (CRE) 方法，以建立一個連貫且在倫理上可接受的責任歸屬架構，適用於所有利害關係人。計算方法提供了結構化的分析，克服了概念方法在處理動態且多面向情境時的限制，展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究，說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解，透過持續監控、修訂和反思，促進了永續且有韌性的系統發展。
 
-Training task-oriented dialogue systems is both costly and time-consuming,
-due to the need for high-quality datasets encompassing diverse intents.
-Traditional methods depend on extensive human annotation, while recent
-advancements leverage large language models (LLMs) to generate synthetic data.
-However, these approaches often require custom prompts or code, limiting
-accessibility for non-technical users. We introduce GraphTOD, an end-to-end
-framework that simplifies the generation of task-oriented dialogues. Users can
-create dialogues by specifying transition graphs in JSON format. Our evaluation
-demonstrates that GraphTOD generates high-quality dialogues across various
-domains, significantly lowering the cost and complexity of dataset creation.
+##### **Explainable AI for Fair Sepsis Mortality Predictive Model**
+2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang
 
-摘要：訓練任務導向對話系統既昂貴又耗時，
-因為需要包含各種意圖的高品質資料集。
-傳統方法依賴於廣泛的人工標註，而最近
-的進展利用大型語言模型 (LLM) 來產生合成資料。
-然而，這些方法通常需要自訂提示或程式碼，限制
-非技術使用者的可及性。我們介紹 GraphTOD，一個端對端的
-架構，簡化了任務導向對話的產生。使用者可以
-透過指定 JSON 格式的轉換圖表來建立對話。我們的評估
-證明 GraphTOD 在各種領域產生高品質對話，顯著降低資料集建立的成本和複雜性。
+Artificial intelligence supports healthcare professionals with predictive
+modeling, greatly transforming clinical decision-making. This study addresses
+the crucial need for fairness and explainability in AI applications within
+healthcare to ensure equitable outcomes across diverse patient demographics. By
+focusing on the predictive modeling of sepsis-related mortality, we propose a
+method that learns a performance-optimized predictive model and then employs
+the transfer learning process to produce a model with better fairness. Our
+method also introduces a novel permutation-based feature importance algorithm
+aiming at elucidating the contribution of each feature in enhancing fairness on
+predictions. Unlike existing explainability methods concentrating on explaining
+feature contribution to predictive performance, our proposed method uniquely
+bridges the gap in understanding how each feature contributes to fairness. This
+advancement is pivotal, given sepsis's significant mortality rate and its role
+in one-third of hospital deaths. Our method not only aids in identifying and
+mitigating biases within the predictive model but also fosters trust among
+healthcare stakeholders by improving the transparency and fairness of model
+predictions, thereby contributing to more equitable and trustworthy healthcare
+delivery.
 
-##### **Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**
-2501.11968v1 by Jie Zhao, Kang Hao Cheong, Witold Pedrycz
+摘要：人工智慧透過預測模型協助醫療專業人員，大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求，以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型，我們提出了一種方法，該方法會學習一個效能最佳化的預測模型，然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法，旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同，我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要，因為敗血症的死亡率很高，且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差，還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任，進而有助於提供更公平且值得信賴的醫療保健服務。
 
-Graph-structured combinatorial challenges are inherently difficult due to
-their nonlinear and intricate nature, often rendering traditional computational
-methods ineffective or expensive. However, these challenges can be more
-naturally tackled by humans through visual representations that harness our
-innate ability for spatial reasoning. In this study, we propose transforming
-graphs into images to preserve their higher-order structural features
-accurately, revolutionizing the representation used in solving graph-structured
-combinatorial tasks. This approach allows machines to emulate human-like
-processing in addressing complex combinatorial challenges. By combining the
-innovative paradigm powered by multimodal large language models (MLLMs) with
-simple search techniques, we aim to develop a novel and effective framework for
-tackling such problems. Our investigation into MLLMs spanned a variety of
-graph-based tasks, from combinatorial problems like influence maximization to
-sequential decision-making in network dismantling, as well as addressing six
-fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit
-exceptional spatial intelligence and a distinctive capability for handling
-these problems, significantly advancing the potential for machines to
-comprehend and analyze graph-structured data with a depth and intuition akin to
-human cognition. These results also imply that integrating MLLMs with simple
-optimization strategies could form a novel and efficient approach for
-navigating graph-structured combinatorial challenges without complex
-derivations, computationally demanding training and fine-tuning.
+##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence**
+2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal
 
-摘要：圖形結構的組合挑戰本質上很困難，因為它們的非線性和複雜性，通常會使傳統的計算方法無效或昂貴。然而，人類可以透過利用我們天生的空間推理能力的視覺表徵，更自然地應對這些挑戰。在本研究中，我們建議將圖形轉換為影像，以準確保留它們的高階結構特徵，從而革新用於解決圖形結構組合任務的表徵。這種方法允許機器在解決複雜的組合挑戰時模擬類人的處理。透過結合由多模態大型語言模型 (MLLM) 提供動力的創新範例與簡單的搜尋技術，我們旨在為解決此類問題開發一個新穎且有效的架構。我們對 MLLM 的研究涵蓋了各種基於圖形的任務，從組合問題（如影響力最大化）到網路拆除中的順序決策制定，以及解決六個基本的圖形相關問題。我們的研究結果表明，MLLM 表現出非凡的空間智能和處理這些問題的獨特能力，顯著提升了機器以類似人類認知的深度和直覺來理解和分析圖形結構資料的潛力。這些結果還暗示，將 MLLM 與簡單的最佳化策略整合在一起，可以形成一種新穎且有效的方法，用於在沒有複雜推導、計算需求量大的訓練和微調的情況下應對圖形結構的組合挑戰。
+Depression is a significant issue nowadays. As per the World Health
+Organization (WHO), in 2023, over 280 million individuals are grappling with
+depression. This is a huge number; if not taken seriously, these numbers will
+increase rapidly. About 4.89 billion individuals are social media users. People
+express their feelings and emotions on platforms like Twitter, Facebook,
+Reddit, Instagram, etc. These platforms contain valuable information which can
+be used for research purposes. Considerable research has been conducted across
+various social media platforms. However, certain limitations persist in these
+endeavors. Particularly, previous studies were only focused on detecting
+depression and the intensity of depression in tweets. Also, there existed
+inaccuracies in dataset labeling. In this research work, five types of
+depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted
+using tweets from the Twitter database based on lexicon labeling. Explainable
+AI was used to provide reasoning by highlighting the parts of tweets that
+represent type of depression. Bidirectional Encoder Representations from
+Transformers (BERT) was used for feature extraction and training. Machine
+learning and deep learning methodologies were used to train the model. The BERT
+model presented the most promising results, achieving an overall accuracy of
+0.96.
 
-##### **A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**
-2501.13958v1 by Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, Xiao Huang
+摘要：現今，憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料，在 2023 年，超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字；如果不認真看待，這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊，可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而，這些努力仍存在某些限制。特別是，先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外，資料集標籤中存在不準確的情況。在這項研究工作中，使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症（雙極型、重度、精神病型、非典型和產後）。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers（BERT）中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果，達到 0.96 的整體準確度。
 
-Large language models (LLMs) have demonstrated remarkable capabilities in a
-wide range of tasks, yet their application to specialized domains remains
-challenging due to the need for deep expertise. Retrieval-augmented generation
-(RAG) has emerged as a promising solution to customize LLMs for professional
-fields by seamlessly integrating external knowledge bases, enabling real-time
-access to domain-specific expertise during inference. Despite its potential,
-traditional RAG systems, based on flat text retrieval, face three critical
-challenges: (i) complex query understanding in professional contexts, (ii)
-difficulties in knowledge integration across distributed sources, and (iii)
-system efficiency bottlenecks at scale. This survey presents a systematic
-analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new
-paradigm that revolutionizes domain-specific LLM applications. GraphRAG
-addresses traditional RAG limitations through three key innovations: (i)
-graph-structured knowledge representation that explicitly captures entity
-relationships and domain hierarchies, (ii) efficient graph-based retrieval
-techniques that enable context-preserving knowledge retrieval with multihop
-reasoning ability, and (iii) structure-aware knowledge integration algorithms
-that leverage retrieved knowledge for accurate and logical coherent generation
-of LLMs. In this survey, we systematically analyze the technical foundations of
-GraphRAG and examine current implementations across various professional
-domains, identifying key technical challenges and promising research
-directions. All the related resources of GraphRAG, including research papers,
-open-source data, and projects, are collected for the community in
-\textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}}.
+##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**
+2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman
 
-摘要：大型語言模型 (LLM) 已在各種任務中展現出非凡的能力，但由於需要深入的專業知識，因此將其應用於專業領域仍具有挑戰性。檢索增強生成 (RAG) 已成為一種有前途的解決方案，可通過無縫整合外部知識庫來客製化 LLM 以適用於專業領域，從而在推理過程中即時存取特定領域的專業知識。儘管有其潛力，但基於平面文字檢索的傳統 RAG 系統面臨三項關鍵挑戰：(i) 在專業情境中進行複雜的查詢理解，(ii) 難以整合分散來源的知識，以及 (iii) 系統效率瓶頸會隨著規模擴大而產生。本調查系統性地分析了圖形化檢索增強生成 (GraphRAG) 的技術基礎，GraphRAG 是一個新的典範，它徹底改變了特定領域的 LLM 應用。GraphRAG 透過三項關鍵創新來解決傳統 RAG 的限制：(i) 圖形結構化的知識表述，明確擷取實體關係和領域階層，(ii) 有效的圖形化檢索技術，可進行保留脈絡的知識檢索，並具備多跳推理能力，以及 (iii) 結構感知知識整合演算法，可利用檢索到的知識來進行 LLM 的準確且邏輯一致的生成。在本調查中，我們系統性地分析了 GraphRAG 的技術基礎，並檢視了在各種專業領域中的現有實作，找出關鍵技術挑戰和有前景的研究方向。所有 GraphRAG 的相關資源，包括研究論文、開放原始碼資料和專案，都已在 \textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}} 中為社群收集。
+Deep learning is dramatically transforming the field of medical imaging and
+radiology, enabling the identification of pathologies in medical images,
+including computed tomography (CT) and X-ray scans. However, the performance of
+deep learning models, particularly in segmentation tasks, is often limited by
+the need for extensive annotated datasets. To address this challenge, the
+capabilities of weakly supervised semantic segmentation are explored through
+the lens of Explainable AI and the generation of counterfactual explanations.
+The scope of this research is development of a novel counterfactual inpainting
+approach (COIN) that flips the predicted classification label from abnormal to
+normal by using a generative model. For instance, if the classifier deems an
+input medical image X as abnormal, indicating the presence of a pathology, the
+generative model aims to inpaint the abnormal region, thus reversing the
+classifier's original prediction label. The approach enables us to produce
+precise segmentations for pathologies without depending on pre-existing
+segmentation masks. Crucially, image-level labels are utilized, which are
+substantially easier to acquire than creating detailed segmentation masks. The
+effectiveness of the method is demonstrated by segmenting synthetic targets and
+actual kidney tumors from CT images acquired from Tartu University Hospital in
+Estonia. The findings indicate that COIN greatly surpasses established
+attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an
+alternative counterfactual explanation method introduced by Singla et al. This
+evidence suggests that COIN is a promising approach for semantic segmentation
+of tumors in CT images, and presents a step forward in making deep learning
+applications more accessible and effective in healthcare, where annotated data
+is scarce.
 
-##### **Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**
-2501.11849v2 by Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
+摘要：深度学习正大幅轉變醫學影像和放射線學領域，能辨識醫學影像中的病理，包括電腦斷層掃描 (CT) 和 X 光掃描。然而，深度學習模型的效能，特別是在分割任務中，常常受到廣泛註解資料集需求的限制。為了應對此挑戰，透過可解釋 AI 和反事實解釋的產生，探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN)，該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如，如果分類器將輸入的醫學影像 X 視為異常，表示存在病理，則生成模型旨在內插異常區域，從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割，而無需依賴於預先存在的分割遮罩。至關重要的是，利用影像層級標籤，這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明，COIN 遠遠超過已建立的歸因方法，例如 RISE、ScoreCAM 和 LayerCAM，以及 Singla 等人提出的另一種反事實解釋方法。此證據表明，COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法，並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步，其中註解資料很稀少。
 
-Detecting organized political campaigns is of paramount importance in
-fighting against disinformation on social media. Existing approaches for the
-identification of such organized actions employ techniques mostly from network
-science, graph machine learning and natural language processing. Their ultimate
-goal is to analyze the relationships and interactions (e.g. re-posting) among
-users and the textual similarities of their posts. Despite their effectiveness
-in recognizing astroturf campaigns, these methods face significant challenges,
-notably the class imbalance in available training datasets. To mitigate this
-issue, recent methods usually resort to data augmentation or increasing the
-number of positive samples, which may not always be feasible or sufficient in
-real-world settings. Following a different path, in this paper, we propose a
-novel framework for identifying astroturf campaigns based solely on large
-language models (LLMs), introducing a Balanced Retrieval-Augmented Generation
-(Balanced RAG) component. Our approach first gives both textual information
-concerning the posts (in our case tweets) and the user interactions of the
-social network as input to a language model. Then, through prompt engineering
-and the proposed Balanced RAG method, it effectively detects coordinated
-disinformation campaigns on X (Twitter). The proposed framework does not
-require any training or fine-tuning of the language model. Instead, by
-strategically harnessing the strengths of prompt engineering and Balanced RAG,
-it facilitates LLMs to overcome the effects of class imbalance and effectively
-identify coordinated political campaigns. The experimental results demonstrate
-that by incorporating the proposed prompt engineering and Balanced RAG methods,
-our framework outperforms the traditional graph-based baselines, achieving
-2x-3x improvements in terms of precision, recall and F1 scores.
+##### **Hybrid Intelligence for Digital Humanities**
+2406.15374v1 by Victor de Boer, Lise Stork
 
-摘要：<paragraph>在社交媒體上對抗錯誤資訊，偵測有組織的政治宣傳活動至關重要。現有的此類有組織行動識別方法，大多採用網路科學、圖形機器學習和自然語言處理的技術。它們的最終目標是分析使用者之間的關係和互動（例如轉發），以及他們貼文的文字相似性。儘管這些方法在辨識草根運動宣傳活動方面很有效，但它們面臨嚴峻的挑戰，特別是可用訓練資料集中的類別不平衡。為了減輕這個問題，最近的方法通常訴諸於資料擴充或增加正向樣本數量，但在現實世界中可能並非總是可行或足夠。本文採取不同的途徑，我們提出了一個基於大型語言模型 (LLM) 的辨識草根運動宣傳活動的新架構，並引入了平衡檢索擴充產生 (Balanced RAG) 組件。我們的做法首先將有關貼文（在我們的案例中是推文）的文字資訊和社交網路的使用者互動作為輸入，輸入到語言模型中。然後，透過提示工程和提出的平衡檢索擴充產生方法，它有效地偵測 X（Twitter）上協調的不實資訊宣傳活動。提出的架構不需要任何語言模型的訓練或微調。相反地，透過策略性地利用提示工程和平衡檢索擴充產生方法的優勢，它使大型語言模型能夠克服類別不平衡的影響，並有效地識別協調的政治宣傳活動。實驗結果證明，透過整合提出的提示工程和平衡檢索擴充產生方法，我們的架構優於傳統的基於圖形的基準，在精確度、召回率和 F1 分數方面獲得 2x-3x 的改進。</paragraph>
+In this paper, we explore the synergies between Digital Humanities (DH) as a
+discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research,
+the use of digital methods and specifically that of Artificial Intelligence is
+subject to a set of requirements and constraints. We argue that these are
+well-supported by the capabilities and goals of HI. Our contribution includes
+the identification of five such DH requirements: Successful AI systems need to
+be able to 1) collaborate with the (human) scholar; 2) support data criticism;
+3) support tool criticism; 4) be aware of and cater to various perspectives and
+5) support distant and close reading. We take the CARE principles of Hybrid
+Intelligence (collaborative, adaptive, responsible and explainable) as
+theoretical framework and map these to the DH requirements. In this mapping, we
+include example research projects. We finally address how insights from DH can
+be applied to HI and discuss open challenges for the combination of the two
+disciplines.
 
-##### **Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**
-2501.16361v1 by Haoran Song, Jiarui Feng, Guangfu Li, Michael Province, Philip Payne, Yixin Chen, Fuhai Li
+摘要：在本文中，我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中，數位方法的使用，特別是人工智慧的使用，受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求：成功的 AI 系統需要能夠 1) 與（人類）學者合作；2) 支援資料批評；3) 支援工具批評；4) 察覺並迎合各種觀點；5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則（協作、適應、負責和可解釋）作為理論架構，並將這些原則對應到 DH 要求。在此對應中，我們納入範例研究專案。最後，我們探討如何將 DH 的見解應用於 HI，並討論結合這兩個學科的開放挑戰。
 
-In real-world scientific discovery, human beings always make use of the
-accumulated prior knowledge with imagination pick select one or a few most
-promising hypotheses from large and noisy data analysis results. In this study,
-we introduce a new type of graph structure, the text-numeric graph (TNG), which
-is defined as graph entities and associations have both text-attributed
-information and numeric information. The TNG is an ideal data structure model
-for novel scientific discovery via graph reasoning because it integrates
-human-understandable textual annotations or prior knowledge, with numeric
-values that represent the observed or activation levels of graph entities or
-associations in different samples. Together both the textual information and
-numeric values determine the importance of graph entities and associations in
-graph reasoning for novel scientific knowledge discovery. We further propose
-integrating large language models (LLMs) and graph neural networks (GNNs) to
-analyze the TNGs for graph understanding and reasoning. To demonstrate the
-utility, we generated the text-omic(numeric) signaling graphs (TOSG), as one
-type of TNGs, in which all graphs have the same entities, associations and
-annotations, but have sample-specific entity numeric (omic) values using single
-cell RNAseq (scRNAseq) datasets of different diseases. We proposed joint
-LLM-GNN models for key entity mining and signaling pathway mining on the TOSGs.
-The evaluation results showed the LLM-GNN and TNGs models significantly improve
-classification accuracy and network inference. In conclusion, the TNGs and
-joint LLM-GNN models are important approaches for scientific discovery.
+##### **Ethical Framework for Responsible Foundational Models in Medical Imaging**
+2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci
 
-摘要：<paragraph>在現實世界的科學發現中，人類總是利用累積的先驗知識，並運用想像力從大量且雜訊的資料分析結果中挑選出一個或幾個最有希望的假設。在本研究中，我們介紹了一種新型態的圖形結構，稱為文字數值圖 (TNG)，定義為圖形實體和關聯具有文字屬性資訊和數值資訊。TNG 是透過圖形推理進行新科學發現的理想資料結構模型，因為它整合了人類可理解的文字註解或先驗知識，以及代表圖形實體或不同樣本中關聯的觀察值或活化程度的數值。文字資訊和數值一起決定了圖形實體和關聯在圖形推理中對於新科學知識發現的重要性。我們進一步提出整合大型語言模型 (LLM) 和圖形神經網路 (GNN) 來分析 TNG，以進行圖形理解和推理。為了展示其效用，我們生成了文字組學（數值）訊號圖 (TOSG)，作為一種 TNG，其中所有圖形都具有相同的實體、關聯和註解，但具有特定於樣本的實體數值（組學）值，使用不同疾病的單細胞 RNAseq (scRNAseq) 資料集。我們針對 TOSG 提出聯合 LLM-GNN 模型，用於關鍵實體探勘和訊號路徑探勘。評估結果顯示，LLM-GNN 和 TNG 模型顯著提升了分類準確度和網路推論。結論而言，TNG 和聯合 LLM-GNN 模型是科學發現的重要方法。</paragraph>
+Foundational models (FMs) have tremendous potential to revolutionize medical
+imaging. However, their deployment in real-world clinical settings demands
+extensive ethical considerations. This paper aims to highlight the ethical
+concerns related to FMs and propose a framework to guide their responsible
+development and implementation within medicine. We meticulously examine ethical
+issues such as privacy of patient data, bias mitigation, algorithmic
+transparency, explainability and accountability. The proposed framework is
+designed to prioritize patient welfare, mitigate potential risks, and foster
+trust in AI-assisted healthcare.
 
-##### **Zep: A Temporal Knowledge Graph Architecture for Agent Memory**
-2501.13956v1 by Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef
+摘要：基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而，它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題，並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題，例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險，並培養對 AI 輔助醫療保健的信任。
 
-We introduce Zep, a novel memory layer service for AI agents that outperforms
-the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR)
-benchmark. Additionally, Zep excels in more comprehensive and challenging
-evaluations than DMR that better reflect real-world enterprise use cases. While
-existing retrieval-augmented generation (RAG) frameworks for large language
-model (LLM)-based agents are limited to static document retrieval, enterprise
-applications demand dynamic knowledge integration from diverse sources
-including ongoing conversations and business data. Zep addresses this
-fundamental limitation through its core component Graphiti -- a
-temporally-aware knowledge graph engine that dynamically synthesizes both
-unstructured conversational data and structured business data while maintaining
-historical relationships. In the DMR benchmark, which the MemGPT team
-established as their primary evaluation metric, Zep demonstrates superior
-performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further
-validated through the more challenging LongMemEval benchmark, which better
-reflects enterprise use cases through complex temporal reasoning tasks. In this
-evaluation, Zep achieves substantial results with accuracy improvements of up
-to 18.5% while simultaneously reducing response latency by 90% compared to
-baseline implementations. These results are particularly pronounced in
-enterprise-critical tasks such as cross-session information synthesis and
-long-term context maintenance, demonstrating Zep's effectiveness for deployment
-in real-world applications.
+##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**
+2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak
 
-摘要：我們推出 Zep，這是一種新穎的記憶層服務，適用於 AI 代理，其在深度記憶擷取 (DMR) 基準測試中優於現行的最先進系統 MemGPT。此外，Zep 在比 DMR 更全面且更具挑戰性的評估中表現出色，這些評估更能反映真實世界的企業用例。雖然現有的檢索增強生成 (RAG) 架構僅限於大型語言模型 (LLM) 基於代理的靜態文件檢索，但企業應用需要從包括正在進行的對話和業務數據在內的不同來源動態整合知識。Zep 通過其核心組件 Graphiti 來解決這個基本限制，Graphiti 是一個時間感知知識圖譜引擎，可以在維護歷史關係的同時動態綜合非結構化對話數據和結構化業務數據。在 MemGPT 團隊確立為其主要評估指標的 DMR 基準測試中，Zep 表現出優異的效能（94.8% 對 93.4%）。除了 DMR 之外，Zep 的功能還通過更具挑戰性的 LongMemEval 基準測試進一步得到驗證，該基準測試通過複雜的時間推理任務更好地反映了企業用例。在這個評估中，Zep 以高達 18.5% 的準確度改進取得了顯著的成果，同時與基線實作相比，將回應延遲降低了 90%。這些成果在企業關鍵任務中尤為明顯，例如跨會話資訊綜合和長期脈絡維護，證明了 Zep 在實際應用中部署的有效性。
+Thyroid cancer is an increasing global health concern that requires advanced
+diagnostic methods. The application of AI and radiomics to thyroid cancer
+diagnosis is examined in this review. A review of multiple databases was
+conducted in compliance with PRISMA guidelines until October 2023. A
+combination of keywords led to the discovery of an English academic publication
+on thyroid cancer and related subjects. 267 papers were returned from the
+original search after 109 duplicates were removed. Relevant studies were
+selected according to predetermined criteria after 124 articles were eliminated
+based on an examination of their abstract and title. After the comprehensive
+analysis, an additional six studies were excluded. Among the 28 included
+studies, radiomics analysis, which incorporates ultrasound (US) images,
+demonstrated its effectiveness in diagnosing thyroid cancer. Various results
+were noted, some of the studies presenting new strategies that outperformed the
+status quo. The literature has emphasized various challenges faced by AI
+models, including interpretability issues, dataset constraints, and operator
+dependence. The synthesized findings of the 28 included studies mentioned the
+need for standardization efforts and prospective multicenter studies to address
+these concerns. Furthermore, approaches to overcome these obstacles were
+identified, such as advances in explainable AI technology and personalized
+medicine techniques. The review focuses on how AI and radiomics could transform
+the diagnosis and treatment of thyroid cancer. Despite challenges, future
+research on multidisciplinary cooperation, clinical applicability validation,
+and algorithm improvement holds the potential to improve patient outcomes and
+diagnostic precision in the treatment of thyroid cancer.
 
-##### **Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**
-2501.11560v1 by M. Manzour, A. Ballardini, R. Izquierdo, M. Á. Sotelo
+摘要：甲狀腺癌是一種日益嚴重的全球健康問題，需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下，對多個資料庫進行了回顧，直到 2023 年 10 月。通過結合關鍵字，發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後，原始搜尋共回傳 267 篇論文。在根據預先確定的標準，淘汰了 124 篇文章的摘要和標題後，選出了相關研究。在進行全面分析後，額外排除了六項研究。在納入的 28 項研究中，結合超音波 (US) 影像的放射特徵分析，證明了其在診斷甲狀腺癌方面的有效性。研究結果不一，有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰，包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到，需要標準化工作和前瞻性多中心研究來解決這些問題。此外，還確定了克服這些障礙的方法，例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰，但未來對多學科合作、臨床適用性驗證和演算法改進的研究，仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。
 
-Lane-changing maneuvers, particularly those executed abruptly or in risky
-situations, are a significant cause of road traffic accidents. However, current
-research mainly focuses on predicting safe lane changes. Furthermore, existing
-accident datasets are often based on images only and lack comprehensive sensory
-data. In this work, we focus on predicting risky lane changes using the CRASH
-dataset (our own collected dataset specifically for risky lane changes), and
-safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian
-inference to predict these maneuvers using linguistic contextual information,
-enhancing the model's interpretability and transparency. The model achieved a
-91.5% f1-score with anticipation time extending to four seconds for risky lane
-changes, and a 90.0% f1-score for predicting safe lane changes with the same
-anticipation time. We validate our model by integrating it into a vehicle
-within the CARLA simulator in scenarios that involve risky lane changes. The
-model managed to anticipate sudden lane changes, thus providing automated
-vehicles with further time to plan and execute appropriate safe reactions.
-Finally, to enhance the explainability of our model, we utilize RAG to provide
-clear and natural language explanations for the given prediction.
+##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**
+2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia
 
-摘要：換車道動作，尤其是突然或在風險情況下執行的動作，是道路交通事故的重要原因。然而，目前的研究所主要集中在預測安全的換車道。此外，現有的事故資料集通常僅基於影像，且缺乏全面的感測資料。在這項工作中，我們專注於使用 CRASH 資料集（我們自己收集的專門針對風險換車道資料集）來預測風險換車道，以及安全換車道（使用 HighD 資料集）。然後，我們利用 KG 和貝氏推理來使用語言背景資訊預測這些動作，增強模型的可解釋性和透明度。該模型在風險換車道的預測時間延長至四秒時，達到了 91.5% 的 f1 分數，在預測安全換車道時，在相同的預測時間內達到了 90.0% 的 f1 分數。我們透過將模型整合到 CARLA 模擬器中的車輛中，在涉及風險換車道的場景中驗證我們的模型。該模型設法預測突然的換車道，從而為自動駕駛車輛提供了更多時間來規劃和執行適當的安全反應。最後，為了增強我們模型的可解釋性，我們利用 RAG 為給定的預測提供清晰且自然的語言解釋。
+Breast cancer has rapidly increased in prevalence in recent years, making it
+one of the leading causes of mortality worldwide. Among all cancers, it is by
+far the most common. Diagnosing this illness manually requires significant time
+and expertise. Since detecting breast cancer is a time-consuming process,
+preventing its further spread can be aided by creating machine-based forecasts.
+Machine learning and Explainable AI are crucial in classification as they not
+only provide accurate predictions but also offer insights into how the model
+arrives at its decisions, aiding in the understanding and trustworthiness of
+the classification results. In this study, we evaluate and compare the
+classification accuracy, precision, recall, and F-1 scores of five different
+machine learning methods using a primary dataset (500 patients from Dhaka
+Medical College Hospital). Five different supervised machine learning
+techniques, including decision tree, random forest, logistic regression, naive
+bayes, and XGBoost, have been used to achieve optimal results on our dataset.
+Additionally, this study applied SHAP analysis to the XGBoost model to
+interpret the model's predictions and understand the impact of each feature on
+the model's output. We compared the accuracy with which several algorithms
+classified the data, as well as contrasted with other literature in this field.
+After final evaluation, this study found that XGBoost achieved the best model
+accuracy, which is 97%.
 
-##### **Each Graph is a New Language: Graph Learning with LLMs**
-2501.11478v2 by Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
+摘要：<paragraph>近年來，乳癌的盛行率迅速增加，使其成為全球主要的死亡原因之一。在所有癌症中，乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時，因此透過建立機器學習模型來預測，有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要，因為它們不僅可以提供準確的預測，還可以深入了解模型如何做出決策，有助於理解和信賴分類結果。在此研究中，我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數，使用了一個主要的資料集（達卡醫學院醫院的 500 名患者）。五種不同的監督式機器學習技術，包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost，已用於在我們的資料集上取得最佳結果。此外，本研究將 SHAP 分析應用於 XGBoost 模型，以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度，並與該領域的其他文獻進行對比。在最後評估後，本研究發現 XGBoost 達到了最佳的模型準確度，為 97%。</paragraph>
 
-Recent efforts leverage Large Language Models (LLMs) for modeling
-text-attributed graph structures in node classification tasks. These approaches
-describe graph structures for LLMs to understand or aggregate LLM-generated
-textual attribute embeddings through graph structure. However, these approaches
-face two main limitations in modeling graph structures with LLMs. (i) Graph
-descriptions become verbose in describing high-order graph structure. (ii)
-Textual attributes alone do not contain adequate graph structure information.
-It is challenging to model graph structure concisely and adequately with LLMs.
-LLMs lack built-in mechanisms to model graph structures directly. They also
-struggle with complex long-range dependencies between high-order nodes and
-target nodes.
-  Inspired by the observation that LLMs pre-trained on one language can achieve
-exceptional performance on another with minimal additional training, we propose
-\textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge
-\textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs
-to transfer their powerful language understanding capabilities to
-graph-structured data. GDL4LLM translates graphs into a graph language corpus
-instead of graph descriptions and pre-trains LLMs on this corpus to adequately
-understand graph structures. During fine-tuning, this corpus describes the
-structural information of target nodes concisely with only a few tokens. By
-treating graphs as a new language, GDL4LLM enables LLMs to model graph
-structures adequately and concisely for node classification tasks. Extensive
-experiments on three real-world datasets demonstrate that GDL4LLM outperforms
-description-based and textual attribute embeddings-based baselines by
-efficiently modeling different orders of graph structure with LLMs.
+##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**
+2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir
 
-摘要：<paragraph>最近的研究利用大型语言模型 (LLM) 对节点分类任务中的文本属性图结构进行建模。这些方法描述图结构，以便 LLM 理解或通过图结构聚合 LLM 生成的文本属性嵌入。然而，这些方法在使用 LLM 对图结构进行建模时面临两个主要限制。(i) 图描述在描述高阶图结构时变得冗长。(ii) 仅文本属性不包含足够的图结构信息。使用 LLM 对图结构进行简洁且充分的建模具有挑战性。LLM 缺乏直接对图结构进行建模的内置机制。它们还难以处理高阶节点和目标节点之间复杂的远程依赖关系。
-受 LLM 在一种语言上进行预训练后，只需进行最少的额外训练即可在另一种语言上实现卓越性能的观察结果的启发，我们提出了**G**raph-**D**efined **L**anguage for **L**arge **L**anguage **M**odel (GDL4LLM)。此新框架使 LLM 能够将其强大的语言理解能力转移到结构化数据图。GDL4LLM 将图翻译成图语言语料库，而不是图描述，并在该语料库上对 LLM 进行预训练，以充分理解图结构。在微调期间，此语料库仅使用几个标记简洁地描述目标节点的结构信息。通过将图视为一种新语言，GDL4LLM 使 LLM 能够充分且简洁地对图结构进行建模，以用于节点分类任务。在三个真实世界数据集上进行的广泛实验表明，GDL4LLM 通过使用 LLM 有效地对不同阶的图结构进行建模，优于基于描述和基于文本属性嵌入的基线。</paragraph>
+The Deep learning (DL) models for diagnosing breast cancer from mammographic
+images often operate as "black boxes", making it difficult for healthcare
+professionals to trust and understand their decision-making processes. The
+study presents an integrated framework combining Convolutional Neural Networks
+(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis
+of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an
+elaborate data preprocessing pipeline and advanced data augmentation techniques
+to counteract dataset limitations and transfer learning using pre-trained
+networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of
+our study is the evaluation of XAI's effectiveness in interpreting model
+predictions, highlighted by utilizing the Hausdorff measure to assess the
+alignment between AI-generated explanations and expert annotations
+quantitatively. This approach is critical for XAI in promoting trustworthiness
+and ethical fairness in AI-assisted diagnostics. The findings from our research
+illustrate the effective collaboration between CNNs and XAI in advancing
+diagnostic methods for breast cancer, thereby facilitating a more seamless
+integration of advanced AI technologies within clinical settings. By enhancing
+the interpretability of AI driven decisions, this work lays the groundwork for
+improved collaboration between AI systems and medical practitioners, ultimately
+enriching patient care. Furthermore, the implications of our research extended
+well beyond the current methodologies. It encourages further research into how
+to combine multimodal data and improve AI explanations to meet the needs of
+clinical practice.
 
-##### **Few-shot Policy (de)composition in Conversational Question Answering**
-2501.11335v1 by Kyle Erwin, Guy Axelrod, Maria Chang, Achille Fokoue, Maxwell Crouse, Soham Dan, Tian Gao, Rosario Uceda-Sosa, Ndivhuwo Makondo, Naweed Khan, Alexander Gray
+摘要：深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作，這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構，結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI)，以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術，以對抗資料集限制，並採用預先訓練的網路（例如 VGG-16、Inception-V3 和 ResNet）進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性，重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作，從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性，這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎，最終豐富了患者照護。此外，我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋，以滿足臨床實務的需求。
 
-The task of policy compliance detection (PCD) is to determine if a scenario
-is in compliance with respect to a set of written policies. In a conversational
-setting, the results of PCD can indicate if clarifying questions must be asked
-to determine compliance status. Existing approaches usually claim to have
-reasoning capabilities that are latent or require a large amount of annotated
-data. In this work, we propose logical decomposition for policy compliance
-(LDPC): a neuro-symbolic framework to detect policy compliance using large
-language models (LLMs) in a few-shot setting. By selecting only a few exemplars
-alongside recently developed prompting techniques, we demonstrate that our
-approach soundly reasons about policy compliance conversations by extracting
-sub-questions to be answered, assigning truth values from contextual
-information, and explicitly producing a set of logic statements from the given
-policies. The formulation of explicit logic graphs can in turn help answer
-PCDrelated questions with increased transparency and explainability. We apply
-this approach to the popular PCD and conversational machine reading benchmark,
-ShARC, and show competitive performance with no task-specific finetuning. We
-also leverage the inherently interpretable architecture of LDPC to understand
-where errors occur, revealing ambiguities in the ShARC dataset and highlighting
-the challenges involved with reasoning for conversational question answering.
+##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**
+2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu
 
-摘要：策略合規偵測 (PCD) 的任務是確定場景是否符合一組書面策略。在對話設定中，PCD 的結果可以指出是否必須提出澄清問題以確定合規狀態。現有的方法通常聲稱具有潛在的推理能力，或需要大量的註釋資料。在這項工作中，我們提出策略合規的邏輯分解 (LDPC)：一種使用大型語言模型 (LLM) 在少次嘗試中偵測策略合規的神經符號框架。透過僅選擇少數範例以及最近開發的提示技術，我們證明我們的做法透過提取要回答的子問題、從脈絡資訊指派真值，以及從給定的策略明確產生一組邏輯陳述，對策略合規對話進行合理的推理。明確邏輯圖表的制定反過來可以幫助回答 PCD 相關問題，並提高透明度和可解釋性。我們將此方法應用於熱門的 PCD 和對話式機器閱讀基準 ShARC，並在沒有特定任務微調的情況下展現出競爭力。我們也利用 LDPC 固有的可解釋架構來了解錯誤發生在哪裡，揭露 ShARC 資料集中的歧義，並強調對話式問題解答推理的挑戰。
+This research presents a novel multimodal data fusion methodology for pain
+behavior recognition, integrating statistical correlation analysis with
+human-centered insights. Our approach introduces two key innovations: 1)
+integrating data-driven statistical relevance weights into the fusion strategy
+to effectively utilize complementary information from heterogeneous modalities,
+and 2) incorporating human-centric movement characteristics into multimodal
+representation learning for detailed modeling of pain behaviors. Validated
+across various deep learning architectures, our method demonstrates superior
+performance and broad applicability. We propose a customizable framework that
+aligns each modality with a suitable classifier based on statistical
+significance, advancing personalized and effective multimodal fusion.
+Furthermore, our methodology provides explainable analysis of multimodal data,
+contributing to interpretable and explainable AI in healthcare. By highlighting
+the importance of data diversity and modality-specific representations, we
+enhance traditional fusion techniques and set new standards for recognizing
+complex pain behaviors. Our findings have significant implications for
+promoting patient-centered healthcare interventions and supporting explainable
+clinical decision-making.
 
-##### **Reasoning Language Models: A Blueprint**
-2501.11223v3 by Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler
+摘要：本研究提出了一種創新的多模態數據融合方法，用於疼痛行為識別，將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新：1) 將數據驅動的統計相關權重整合到融合策略中，以有效利用來自異質模態的補充信息，以及 2) 將以人為中心的運動特徵納入多模態表示學習中，以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證，展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架，根據統計顯著性將每個模態與合適的分類器對齊，推進個性化和有效的多模態融合。此外，我們的模型提供對多模態數據的可解釋分析，有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性，我們增強了傳統的融合技術，並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。
 
-Reasoning language models (RLMs), also known as Large Reasoning Models
-(LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have
-redefined AI's problem-solving capabilities by extending LLMs with advanced
-reasoning mechanisms. Yet, their high costs, proprietary nature, and complex
-architectures - uniquely combining Reinforcement Learning (RL), search
-heuristics, and LLMs - present accessibility and scalability challenges. To
-address these, we propose a comprehensive blueprint that organizes RLM
-components into a modular framework, based on a survey and analysis of all RLM
-works. This blueprint incorporates diverse reasoning structures (chains, trees,
-graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search,
-Beam Search), RL concepts (policy, value models and others), supervision
-schemes (Outcome-Based and Process-Based Supervision), and other related
-concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
-tools). We also provide detailed mathematical formulations and algorithmic
-specifications to simplify RLM implementation. By showing how schemes like
-LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases,
-we demonstrate the blueprint's versatility and unifying potential. To
-illustrate its utility, we introduce x1, a modular implementation for rapid RLM
-prototyping and experimentation. Using x1 and a literature review, we provide
-key insights, such as multi-phase training for policy and value models, and the
-importance of familiar training distributions. Finally, we discuss scalable RLM
-cloud deployments and we outline how RLMs can integrate with a broader LLM
-ecosystem. Our work demystifies RLM construction, democratizes advanced
-reasoning capabilities, and fosters innovation, aiming to mitigate the gap
-between "rich AI" and "poor AI" by lowering barriers to RLM design and
-experimentation.
+##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**
+2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini
+
+Human-centered explainable AI (HCXAI) advocates for the integration of social
+aspects into AI explanations. Central to the HCXAI discourse is the Social
+Transparency (ST) framework, which aims to make the socio-organizational
+context of AI systems accessible to their users. In this work, we suggest
+extending the ST framework to address the risks of social misattributions in
+Large Language Models (LLMs), particularly in sensitive areas like mental
+health. In fact LLMs, which are remarkably capable of simulating roles and
+personas, may lead to mismatches between designers' intentions and users'
+perceptions of social attributes, risking to promote emotional manipulation and
+dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To
+address these issues, we propose enhancing the ST framework with a fifth
+'W-question' to clarify the specific social attributions assigned to LLMs by
+its designers and users. This addition aims to bridge the gap between LLM
+capabilities and user perceptions, promoting the ethically responsible
+development and use of LLM-based technology.
+
+摘要：以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架，其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中，我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险，尤其是在心理健康等敏感领域。事实上，LLM 能够出色地模拟角色和人格，这可能导致设计者的意图和用户对社会属性的认知之间出现错配，从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题，我们建议用第五个“W 问题”来增强 ST 框架，以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距，促进基于 LLM 的技术在道德上负责任地开发和使用。
+
+##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**
+2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu
 
-摘要：推理語言模型 (RLM)，又稱為大型推理模型 (LRM)，例如 OpenAI 的 o1 和 o3、DeepSeek-V3 以及阿里巴巴的 QwQ，透過擴充 LLM 的先進推理機制，重新定義了 AI 的問題解決能力。然而，它們的高成本、專有性質和複雜架構（獨特地結合了強化學習 (RL)、搜尋啟發法和 LLM）提出了可及性和可擴充性的挑戰。為了解決這些問題，我們提出了一個全面的藍圖，將 RLM 組件組織成一個模組化架構，這是基於對所有 RLM 作品的調查和分析。此藍圖包含多樣化的推理結構（鏈、樹、圖和巢狀形式）、推理策略（例如蒙地卡羅樹搜尋、波束搜尋）、RL 概念（策略、價值模型等）、監督方案（基於結果和基於流程的監督）和其他相關概念（例如測試時間運算、檢索增強生成、代理工具）。我們還提供了詳細的數學公式和演算法規範，以簡化 RLM 的實作。透過展示 LLaMA-Berry、QwQ、Journey Learning 和 Graph of Thoughts 等方案如何作為特殊情況，我們展示了藍圖的多功能性和統一潛力。為了說明其效用，我們介紹了 x1，這是一個模組化實作，用於快速 RLM 原型製作和實驗。使用 x1 和文獻回顧，我們提供了關鍵見解，例如策略和價值模型的多階段訓練，以及熟悉訓練分佈的重要性。最後，我們討論了可擴充的 RLM 雲端部署，並概述了 RLM 如何與更廣泛的 LLM 生態系統整合。我們的研究揭開了 RLM 建構的神秘面紗，使先進的推理能力民主化，並促進創新，旨在透過降低 RLM 設計和實驗的障礙，來縮小「富裕 AI」和「貧窮 AI」之間的差距。
+Background: Pneumothorax is an acute thoracic disease caused by abnormal air
+collection between the lungs and chest wall. To address the opaqueness often
+associated with deep learning (DL) models, explainable artificial intelligence
+(XAI) methods have been introduced to outline regions related to pneumothorax
+diagnoses made by DL models. However, these explanations sometimes diverge from
+actual lesion areas, highlighting the need for further improvement. Method: We
+propose a template-guided approach to incorporate the clinical knowledge of
+pneumothorax into model explanations generated by XAI methods, thereby
+enhancing the quality of these explanations. Utilizing one lesion delineation
+created by radiologists, our approach first generates a template that
+represents potential areas of pneumothorax occurrence. This template is then
+superimposed on model explanations to filter out extraneous explanations that
+fall outside the template's boundaries. To validate its efficacy, we carried
+out a comparative analysis of three XAI methods with and without our template
+guidance when explaining two DL models in two real-world datasets. Results: The
+proposed approach consistently improved baseline XAI methods across twelve
+benchmark scenarios built on three XAI methods, two DL models, and two
+datasets. The average incremental percentages, calculated by the performance
+improvements over the baseline performance, were 97.8% in Intersection over
+Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model
+explanations and ground-truth lesion areas. Conclusions: In the context of
+pneumothorax diagnoses, we proposed a template-guided approach for improving AI
+explanations. We anticipate that our template guidance will forge a fresh
+approach to elucidating AI models by integrating clinical domain expertise.
 
-##### **IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**
-2501.11067v1 by Elad Levi, Ilan Kadar
+摘要：<paragraph>背景：氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習（DL）模型經常伴隨的不透明性，可解釋人工智慧（XAI）方法已被引入，用於概述與 DL 模型做出的氣胸診斷相關的區域。然而，這些解釋有時會與實際病灶區域有所出入，突顯出進一步改進的必要性。方法：我們提出了一種模板引導式方法，將氣胸的臨床知識納入 XAI 方法產生的模型解釋中，從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪，我們的做法首先產生一個模板，用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上，以篩選出超出模板邊界的無關解釋。為了驗證其效力，我們對三種 XAI 方法進行了比較分析，在兩個真實世界資料集中解釋兩個 DL 模型時，分別採用和不採用我們的模板引導。結果：所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中，始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時，透過基準效能的效能改進計算出的平均增量百分比為交集比（IoU）的 97.8% 和骰子相似性係數（DSC）的 94.1%。結論：在氣胸診斷的背景下，我們提出了一種模板引導式方法，用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識，為闡明 AI 模型建立一種新方法。</paragraph>
 
-Large Language Models (LLMs) are transforming artificial intelligence,
-evolving into task-oriented systems capable of autonomous planning and
-execution. One of the primary applications of LLMs is conversational AI
-systems, which must navigate multi-turn dialogues, integrate domain-specific
-APIs, and adhere to strict policy constraints. However, evaluating these agents
-remains a significant challenge, as traditional methods fail to capture the
-complexity and variability of real-world interactions. We introduce
-IntellAgent, a scalable, open-source multi-agent framework designed to evaluate
-conversational AI systems comprehensively. IntellAgent automates the creation
-of diverse, synthetic benchmarks by combining policy-driven graph modeling,
-realistic event generation, and interactive user-agent simulations. This
-innovative approach provides fine-grained diagnostics, addressing the
-limitations of static and manually curated benchmarks with coarse-grained
-metrics. IntellAgent represents a paradigm shift in evaluating conversational
-AI. By simulating realistic, multi-policy scenarios across varying levels of
-complexity, IntellAgent captures the nuanced interplay of agent capabilities
-and policy constraints. Unlike traditional methods, it employs a graph-based
-policy model to represent relationships, likelihoods, and complexities of
-policy interactions, enabling highly detailed diagnostics. IntellAgent also
-identifies critical performance gaps, offering actionable insights for targeted
-optimization. Its modular, open-source design supports seamless integration of
-new domains, policies, and APIs, fostering reproducibility and community
-collaboration. Our findings demonstrate that IntellAgent serves as an effective
-framework for advancing conversational AI by addressing challenges in bridging
-research and deployment. The framework is available at
-https://github.com/plurai-ai/intellagent
+##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**
+2403.01580v1 by Séamus Lankford
 
-摘要：大型語言模型 (LLM) 正在轉變人工智慧，演變成具備自主規劃和執行能力的任務導向系統。LLM 的主要應用之一是對話式 AI 系統，它必須應對多輪對話、整合特定領域的 API，並遵守嚴格的政策約束。然而，評估這些代理仍然是一項重大挑戰，因為傳統方法無法捕捉現實世界互動的複雜性和變異性。我們引入了 IntellAgent，一個可擴充、開放原始碼的多代理架構，旨在全面評估對話式 AI 系統。IntellAgent 自動化建立多樣化、合成的基準，方法是結合策略驅動的圖形建模、逼真的事件產生和互動使用者代理模擬。這種創新方法提供了細緻的診斷，解決了具有粗略指標的靜態和手動策劃基準的限制。IntellAgent 代表了評估對話式 AI 的典範轉移。通過模擬不同層級複雜性的逼真多策略場景，IntellAgent 捕捉到了代理功能和策略約束之間的細微交互。與傳統方法不同，它採用基於圖形的策略模型來表示策略交互的關係、可能性和複雜性，從而實現高度詳細的診斷。IntellAgent 還識別出關鍵效能差距，提供可行的見解，以進行目標最佳化。其模組化、開放原始碼的設計支援無縫整合新的領域、策略和 API，促進了可複製性和社群協作。我們的研究結果表明，IntellAgent 可作為一個有效的框架，透過解決研究和部署之間的挑戰來推進對話式 AI。這個框架可在 https://github.com/plurai-ai/intellagent 取得
+In the current machine translation (MT) landscape, the Transformer
+architecture stands out as the gold standard, especially for high-resource
+language pairs. This research delves into its efficacy for low-resource
+language pairs including both the English$\leftrightarrow$Irish and
+English$\leftrightarrow$Marathi language pairs. Notably, the study identifies
+the optimal hyperparameters and subword model type to significantly improve the
+translation quality of Transformer models for low-resource language pairs.
+  The scarcity of parallel datasets for low-resource languages can hinder MT
+development. To address this, gaHealth was developed, the first bilingual
+corpus of health data for the Irish language. Focusing on the health domain,
+models developed using this in-domain dataset exhibited very significant
+improvements in BLEU score when compared with models from the LoResMT2021
+Shared Task. A subsequent human evaluation using the multidimensional quality
+metrics error taxonomy showcased the superior performance of the Transformer
+system in reducing both accuracy and fluency errors compared to an RNN-based
+counterpart.
+  Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source
+applications streamlined for the development, fine-tuning, and deployment of
+neural machine translation models. These tools considerably simplify the setup
+and evaluation process, making MT more accessible to both developers and
+translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes
+eco-friendly natural language processing research by highlighting the
+environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM
+demonstrated advancements in translation performance for two low-resource
+language pairs: English$\leftrightarrow$Irish and
+English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021
+Shared Task.
 
+摘要：<paragraph>在當前機器翻譯 (MT) 領域中，Transformer 架構脫穎而出，成為黃金標準，特別是對於高資源語言對。本研究探討其對低資源語言對的效能，包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是，本研究識別出最佳超參數和子詞模型類型，以顯著提高 Transformer 模型對低資源語言對的翻譯品質。
+低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題，開發了 gaHealth，這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域，使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步，與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示，與基於 RNN 的對應模型相比，Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。
+此外，本論文介紹了 adaptNMT 和 adaptMLLM，這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程，讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是，adaptNMT 以 OpenNMT 生態系統為基礎，通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比，adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。</paragraph>
 
-### LLM
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Theoretical Benefit and Limitation of Diffusion Language Model**|Guhao Feng et.al.|[2502.09622v1](http://arxiv.org/abs/2502.09622v1)|null|
-|**2025-02-13**|**MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**|Dongzhi Jiang et.al.|[2502.09621v1](http://arxiv.org/abs/2502.09621v1)|null|
-|**2025-02-13**|**Exploring the Potential of Encoder-free Architectures in 3D LMMs**|Yiwen Tang et.al.|[2502.09620v1](http://arxiv.org/abs/2502.09620v1)|null|
-|**2025-02-13**|**DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**|Xueyi Liu et.al.|[2502.09614v1](http://arxiv.org/abs/2502.09614v1)|null|
-|**2025-02-13**|**Score-of-Mixture Training: Training One-Step Generative Models Made Simple**|Tejas Jayashankar et.al.|[2502.09609v1](http://arxiv.org/abs/2502.09609v1)|null|
-|**2025-02-13**|**Human-LLM Coevolution: Evidence from Academic Writing**|Mingmeng Geng et.al.|[2502.09606v1](http://arxiv.org/abs/2502.09606v1)|null|
-|**2025-02-13**|**SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**|Yung-Sung Chuang et.al.|[2502.09604v1](http://arxiv.org/abs/2502.09604v1)|null|
-|**2025-02-13**|**CoT-Valve: Length-Compressible Chain-of-Thought Tuning**|Xinyin Ma et.al.|[2502.09601v1](http://arxiv.org/abs/2502.09601v1)|null|
-|**2025-02-13**|**Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**|Siyan Zhao et.al.|[2502.09597v1](http://arxiv.org/abs/2502.09597v1)|null|
-|**2025-02-13**|**KIMAs: A Configurable Knowledge Integrated Multi-Agent System**|Zitao Li et.al.|[2502.09596v1](http://arxiv.org/abs/2502.09596v1)|null|
-|**2025-02-13**|**Logical forms complement probability in understanding language model (and human) performance**|Yixuan Wang et.al.|[2502.09589v1](http://arxiv.org/abs/2502.09589v1)|null|
-|**2025-02-13**|**Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**|Mark Beliaev et.al.|[2502.09573v1](http://arxiv.org/abs/2502.09573v1)|null|
-|**2025-02-13**|**MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**|Vlad Andrei Negru et.al.|[2502.09567v1](http://arxiv.org/abs/2502.09567v1)|null|
-|**2025-02-13**|**Zero-shot generation of synthetic neurosurgical data with large language models**|Austin A. Barr et.al.|[2502.09566v1](http://arxiv.org/abs/2502.09566v1)|null|
-|**2025-02-13**|**MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**|Quintina Campbell et.al.|[2502.09565v1](http://arxiv.org/abs/2502.09565v1)|null|
-|**2025-02-13**|**EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**|Rui Yang et.al.|[2502.09560v1](http://arxiv.org/abs/2502.09560v1)|null|
-|**2025-02-13**|**Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**|Shreyan Biswas et.al.|[2502.09532v1](http://arxiv.org/abs/2502.09532v1)|null|
-|**2025-02-13**|**Diffusion Models for Molecules: A Survey of Methods and Tasks**|Liang Wang et.al.|[2502.09511v1](http://arxiv.org/abs/2502.09511v1)|null|
-|**2025-02-13**|**AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**|Caleb Cranney et.al.|[2502.09503v1](http://arxiv.org/abs/2502.09503v1)|null|
-|**2025-02-13**|**Improve LLM-based Automatic Essay Scoring with Linguistic Features**|Zhaoyi Joey Hou et.al.|[2502.09497v1](http://arxiv.org/abs/2502.09497v1)|null|
-|**2025-02-13**|**Cracking the Code: Enhancing Development finance understanding with artificial intelligence**|Pierre Beaucoral et.al.|[2502.09495v1](http://arxiv.org/abs/2502.09495v1)|null|
-|**2025-02-13**|**Objective quantification of mood states using large language models**|Jakub Onysk et.al.|[2502.09487v1](http://arxiv.org/abs/2502.09487v1)|null|
-|**2025-02-13**|**The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**|Akash Ghosh et.al.|[2502.09457v1](http://arxiv.org/abs/2502.09457v1)|null|
-|**2025-02-13**|**Pixel-Level Reasoning Segmentation via Multi-turn Conversations**|Dexian Cai et.al.|[2502.09447v1](http://arxiv.org/abs/2502.09447v1)|null|
-|**2025-02-13**|**Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**|Navdeep Kumar et.al.|[2502.09432v1](http://arxiv.org/abs/2502.09432v1)|null|
-|**2025-02-13**|**Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**|Ziyi Chen et.al.|[2502.09423v1](http://arxiv.org/abs/2502.09423v1)|null|
-|**2025-02-13**|**On multi-token prediction for efficient LLM inference**|Somesh Mehra et.al.|[2502.09419v1](http://arxiv.org/abs/2502.09419v1)|null|
-|**2025-02-13**|**SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**|Daniel Fleischer et.al.|[2502.09390v1](http://arxiv.org/abs/2502.09390v1)|null|
-|**2025-02-13**|**Truth Knows No Language: Evaluating Truthfulness Beyond English**|Blanca Calvo Figueras et.al.|[2502.09387v1](http://arxiv.org/abs/2502.09387v1)|null|
-|**2025-02-13**|**A Deep Inverse-Mapping Model for a Flapping Robotic Wing**|Hadar Sharvit et.al.|[2502.09378v1](http://arxiv.org/abs/2502.09378v1)|null|
-|**2025-02-13**|**Language Agents as Digital Representatives in Collective Decision-Making**|Daniel Jarrett et.al.|[2502.09369v1](http://arxiv.org/abs/2502.09369v1)|null|
-|**2025-02-13**|**Neural Spatiotemporal Point Processes: Trends and Challenges**|Sumantrak Mukherjee et.al.|[2502.09341v1](http://arxiv.org/abs/2502.09341v1)|null|
-|**2025-02-13**|**Graph Diffusion Network for Drug-Gene Prediction**|Jiayang Wu et.al.|[2502.09335v1](http://arxiv.org/abs/2502.09335v1)|null|
-|**2025-02-13**|**Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**|Itai Mondshine et.al.|[2502.09331v1](http://arxiv.org/abs/2502.09331v1)|null|
-|**2025-02-13**|**A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**|Kentaro Imajo et.al.|[2502.09316v1](http://arxiv.org/abs/2502.09316v1)|null|
-|**2025-02-13**|**When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**|Samuel Joseph Amouyal et.al.|[2502.09307v1](http://arxiv.org/abs/2502.09307v1)|null|
-|**2025-02-13**|**Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**|Bernd Dudzik et.al.|[2502.09294v1](http://arxiv.org/abs/2502.09294v1)|null|
-|**2025-02-13**|**SparQLe: Speech Queries to Text Translation Through LLMs**|Amirbek Djanibekov et.al.|[2502.09284v1](http://arxiv.org/abs/2502.09284v1)|null|
-|**2025-02-13**|**LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**|Wenlun Zhang et.al.|[2502.09271v1](http://arxiv.org/abs/2502.09271v1)|null|
-|**2025-02-13**|**AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**|Hezhe Qiao et.al.|[2502.09254v1](http://arxiv.org/abs/2502.09254v1)|null|
-|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
-|**2025-02-13**|**You Do Not Fully Utilize Transformer's Representation Capacity**|Gleb Gerasimov et.al.|[2502.09245v1](http://arxiv.org/abs/2502.09245v1)|null|
-|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
-|**2025-02-13**|**Reliable Conversational Agents under ASP Control that Understand Natural Language**|Yankai Zeng et.al.|[2502.09237v1](http://arxiv.org/abs/2502.09237v1)|null|
-|**2025-02-13**|**Commonsense Reasoning-Aided Autonomous Vehicle Systems**|Keegan Kimbrell et.al.|[2502.09233v1](http://arxiv.org/abs/2502.09233v1)|null|
-|**2025-02-13**|**Logical foundations of Smart Contracts**|Kalonji Kalala et.al.|[2502.09232v1](http://arxiv.org/abs/2502.09232v1)|null|
-|**2025-02-13**|**Relating Answer Set Programming and Many-sorted Logics for Formal Verification**|Zachary Hansen et.al.|[2502.09230v1](http://arxiv.org/abs/2502.09230v1)|null|
-|**2025-02-13**|**Computational methods for Dynamic Answer Set Programming**|Susana Hahn et.al.|[2502.09228v1](http://arxiv.org/abs/2502.09228v1)|null|
-|**2025-02-13**|**Generating Causally Compliant Counterfactual Explanations using ASP**|Sopam Dasgupta et.al.|[2502.09226v1](http://arxiv.org/abs/2502.09226v1)|null|
-|**2025-02-13**|**Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**|Đorđe Marković et.al.|[2502.09224v1](http://arxiv.org/abs/2502.09224v1)|null|
-|**2025-02-13**|**ASP-driven User-interaction with Clinguin**|Alexander Beiser et.al.|[2502.09222v1](http://arxiv.org/abs/2502.09222v1)|null|
-|**2025-02-13**|**Pearce's Characterisation in an Epistemic Domain**|Ezgi Iraz Su et.al.|[2502.09221v1](http://arxiv.org/abs/2502.09221v1)|null|
-|**2025-02-13**|**Graphical Conditions for the Existence, Unicity and Number of Regular Models**|Van-Giang Trinh et.al.|[2502.09220v1](http://arxiv.org/abs/2502.09220v1)|null|
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-13**|**Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**|Galileo Sartor et.al.|[2502.09216v1](http://arxiv.org/abs/2502.09216v1)|null|
-|**2025-02-13**|**Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**|Sean Glaze et.al.|[2502.09215v1](http://arxiv.org/abs/2502.09215v1)|null|
-|**2025-02-13**|**Neuro-Symbolic Contrastive Learning for Cross-domain Inference**|Mingyue Liu et.al.|[2502.09213v1](http://arxiv.org/abs/2502.09213v1)|null|
-|**2025-02-13**|**LP-LM: No Hallucinations in Question Answering with Logic Programming**|Katherine Wu et.al.|[2502.09212v1](http://arxiv.org/abs/2502.09212v1)|null|
-|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
-|**2025-02-13**|**On LLM-generated Logic Programs and their Inference Execution Methods**|Paul Tarau et.al.|[2502.09209v1](http://arxiv.org/abs/2502.09209v1)|null|
-|**2025-02-13**|**Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**|Haya Majid Qureshi et.al.|[2502.09206v1](http://arxiv.org/abs/2502.09206v1)|null|
-|**2025-02-13**|**Counterfactual Explanations as Plans**|Vaishak Belle et.al.|[2502.09205v1](http://arxiv.org/abs/2502.09205v1)|null|
-|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
-|**2025-02-13**|**Thinking beyond the anthropomorphic paradigm benefits LLM research**|Lujain Ibrahim et.al.|[2502.09192v1](http://arxiv.org/abs/2502.09192v1)|null|
-|**2025-02-13**|**Matina: A Large-Scale 73B Token Persian Text Corpus**|Sara Bourbour Hosseinbeigi et.al.|[2502.09188v1](http://arxiv.org/abs/2502.09188v1)|null|
-|**2025-02-13**|**RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**|Changzhi Zhou et.al.|[2502.09183v1](http://arxiv.org/abs/2502.09183v1)|null|
-|**2025-02-13**|**FLAME: Flexible LLM-Assisted Moderation Engine**|Ivan Bakulin et.al.|[2502.09175v1](http://arxiv.org/abs/2502.09175v1)|null|
-|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
-|**2025-02-13**|**Musical Heritage Historical Entity Linking**|Arianna Graciotti et.al.|[2502.09168v1](http://arxiv.org/abs/2502.09168v1)|null|
-|**2025-02-13**|**Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**|Chang Liu et.al.|[2502.09156v1](http://arxiv.org/abs/2502.09156v1)|null|
-|**2025-02-13**|**A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**|Nasser A Alsadhan et.al.|[2502.09128v1](http://arxiv.org/abs/2502.09128v1)|null|
-|**2025-02-13**|**Automatic Pruning via Structured Lasso with Class-wise Information**|Xiang Liu et.al.|[2502.09125v1](http://arxiv.org/abs/2502.09125v1)|null|
-|**2025-02-13**|**The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**|Ye-eun Cho et.al.|[2502.09120v1](http://arxiv.org/abs/2502.09120v1)|null|
-|**2025-02-13**|**One-shot Federated Learning Methods: A Practical Guide**|Xiang Liu et.al.|[2502.09104v1](http://arxiv.org/abs/2502.09104v1)|null|
-|**2025-02-13**|**Logical Reasoning in Large Language Models: A Survey**|Hanmeng Liu et.al.|[2502.09100v1](http://arxiv.org/abs/2502.09100v1)|null|
-|**2025-02-13**|**A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**|Tianyi Huang et.al.|[2502.09097v1](http://arxiv.org/abs/2502.09097v1)|null|
-|**2025-02-13**|**A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**|Jia Gao et.al.|[2502.09086v1](http://arxiv.org/abs/2502.09086v1)|null|
-|**2025-02-13**|**Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**|Greta Warren et.al.|[2502.09083v1](http://arxiv.org/abs/2502.09083v1)|null|
-|**2025-02-13**|**CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**|Xintao Wang et.al.|[2502.09082v1](http://arxiv.org/abs/2502.09082v1)|null|
-|**2025-02-13**|**Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**|Xuzhao Geng et.al.|[2502.09073v1](http://arxiv.org/abs/2502.09073v1)|null|
-|**2025-02-13**|**An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**|Kunat Pipatanakul et.al.|[2502.09056v1](http://arxiv.org/abs/2502.09056v1)|null|
-|**2025-02-13**|**Cost-Saving LLM Cascades with Early Abstention**|Michael J. Zellinger et.al.|[2502.09054v1](http://arxiv.org/abs/2502.09054v1)|null|
-|**2025-02-13**|**Game Theory Meets Large Language Models: A Systematic Survey**|Haoran Sun et.al.|[2502.09053v1](http://arxiv.org/abs/2502.09053v1)|null|
-|**2025-02-13**|**AIDE: Agentically Improve Visual Language Model with Domain Experts**|Ming-Chang Chiu et.al.|[2502.09051v1](http://arxiv.org/abs/2502.09051v1)|null|
-|**2025-02-13**|**Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**|Chae-Hyun Kim et.al.|[2502.09050v1](http://arxiv.org/abs/2502.09050v1)|null|
-|**2025-02-13**|**Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**|Jin-Duk Park et.al.|[2502.09046v1](http://arxiv.org/abs/2502.09046v1)|null|
-|**2025-02-13**|**Typhoon T1: An Open Thai Reasoning Model**|Pittawat Taveekitworachai et.al.|[2502.09042v1](http://arxiv.org/abs/2502.09042v1)|null|
-|**2025-02-13**|**Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**|Lin Zhang et.al.|[2502.09022v1](http://arxiv.org/abs/2502.09022v1)|null|
-|**2025-02-13**|**EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**|Xiao Wang et.al.|[2502.09020v1](http://arxiv.org/abs/2502.09020v1)|null|
-|**2025-02-13**|**Zero-shot Concept Bottleneck Models**|Shin'ya Yamaguchi et.al.|[2502.09018v1](http://arxiv.org/abs/2502.09018v1)|null|
-|**2025-02-13**|**Diversity Enhances an LLM's Performance in RAG and Long-context Task**|Zhchao Wang et.al.|[2502.09017v1](http://arxiv.org/abs/2502.09017v1)|null|
-|**2025-02-13**|**Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**|Jonathan Pofcher et.al.|[2502.09004v1](http://arxiv.org/abs/2502.09004v1)|null|
-|**2025-02-13**|**RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**|Quan Wei et.al.|[2502.09003v1](http://arxiv.org/abs/2502.09003v1)|null|
-|**2025-02-13**|**PixLift: Accelerating Web Browsing via AI Upscaling**|Yonas Atinafu et.al.|[2502.08995v1](http://arxiv.org/abs/2502.08995v1)|null|
-|**2025-02-13**|**RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**|Nazatul H. Sultan et.al.|[2502.08989v1](http://arxiv.org/abs/2502.08989v1)|null|
-|**2025-02-13**|**Neural Force Field: Learning Generalized Physical Representation from a Few Examples**|Shiqian Li et.al.|[2502.08987v1](http://arxiv.org/abs/2502.08987v1)|null|
-|**2025-02-13**|**Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**|Hyundong Cho et.al.|[2502.08972v1](http://arxiv.org/abs/2502.08972v1)|null|
-|**2025-02-13**|**RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**|Peter Yong Zhong et.al.|[2502.08966v1](http://arxiv.org/abs/2502.08966v1)|null|
-|**2025-02-13**|**Biologically Plausible Brain Graph Transformer**|Ciyuan Peng et.al.|[2502.08958v1](http://arxiv.org/abs/2502.08958v1)|null|
-|**2025-02-13**|**Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**|Leon Nissen et.al.|[2502.08954v1](http://arxiv.org/abs/2502.08954v1)|null|
+##### **Cause and Effect: Can Large Language Models Truly Understand Causality?**
+2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha
 
-#### Abstracts
-##### **Theoretical Benefit and Limitation of Diffusion Language Model**
-2502.09622v1 by Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He
+With the rise of Large Language Models(LLMs), it has become crucial to
+understand their capabilities and limitations in deciphering and explaining the
+complex web of causal relationships that language entails. Current methods use
+either explicit or implicit causal reasoning, yet there is a strong need for a
+unified approach combining both to tackle a wide array of causal relationships
+more effectively. This research proposes a novel architecture called Context
+Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to
+enhance causal reasoning and explainability. The proposed framework
+incorporates an explicit causal detection module with ConceptNet and
+counterfactual statements, as well as implicit causal detection through LLMs.
+Our framework goes one step further with a layer of counterfactual explanations
+to accentuate LLMs understanding of causality. The knowledge from ConceptNet
+enhances the performance of multiple causal reasoning tasks such as causal
+discovery, causal identification and counterfactual reasoning. The
+counterfactual sentences add explicit knowledge of the not caused by scenarios.
+By combining these powerful modules, our model aims to provide a deeper
+understanding of causal relationships, enabling enhanced interpretability.
+Evaluation of benchmark datasets shows improved performance across all metrics,
+such as accuracy, precision, recall, and F1 scores. We also introduce
+CausalNet, a new dataset accompanied by our code, to facilitate further
+research in this domain.
 
-Diffusion language models have emerged as a promising approach for text
-generation. One would naturally expect this method to be an efficient
-replacement for autoregressive models since multiple tokens can be sampled in
-parallel during each diffusion step. However, its efficiency-accuracy trade-off
-is not yet well understood. In this paper, we present a rigorous theoretical
-analysis of a widely used type of diffusion language model, the Masked
-Diffusion Model (MDM), and find that its effectiveness heavily depends on the
-target evaluation metric. Under mild conditions, we prove that when using
-perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling
-steps regardless of sequence length, demonstrating that efficiency can be
-achieved without sacrificing performance. However, when using the sequence
-error rate--which is important for understanding the "correctness" of a
-sequence, such as a reasoning chain--we show that the required sampling steps
-must scale linearly with sequence length to obtain "correct" sequences, thereby
-eliminating MDM's efficiency advantage over autoregressive models. Our analysis
-establishes the first theoretical foundation for understanding the benefits and
-limitations of MDMs. All theoretical findings are supported by empirical
-studies.
+摘要：隨著大型語言模型 (LLM) 的興起，了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理，但強烈需要一種統一的方法，結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構，以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組，以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步，加入一層反事實解釋，以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行，例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組，我們的模型旨在提供對因果關係更深入的理解，實現增強的可解釋性。基準資料集的評估顯示在所有指標（例如準確度、精確度、召回率和 F1 分數）上都有所提升。我們還引入了 CausalNet，一個新的資料集，並附上了我們的程式碼，以促進在這個領域的進一步研究。
 
-摘要：擴散語言模型已成為文字生成的一種有前途的方法。由於在每個擴散步驟期間可以並行採樣多個符號，因此人們自然會期望這種方法成為自迴歸模型的有效替代方案。然而，它的效率準確性權衡尚未得到很好的理解。在本文中，我們對廣泛使用的擴散語言模型類型，即遮罩擴散模型 (MDM) 進行了嚴格的理論分析，並發現其有效性在很大程度上取決於目標評估指標。在溫和條件下，我們證明了當使用困惑度作為指標時，MDM 可以無論序列長度如何，在採樣步驟中實現近乎最佳的困惑度，這表明可以在不犧牲性能的情況下實現效率。然而，當使用序列錯誤率（對於理解序列的「正確性」很重要，例如推理鏈）時，我們表明所需的採樣步驟必須隨著序列長度線性縮放才能獲得「正確」的序列，從而消除了 MDM 相對於自迴歸模型的效率優勢。我們的分析為理解 MDM 的優點和局限性建立了第一個理論基礎。所有理論發現都得到了實證研究的支持。
+##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**
+2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya
+
+Diabetes mellitus (DM) predisposes patients to vascular complications.
+Retinal images and vasculature reflect the body's micro- and macrovascular
+health. They can be used to diagnose DM complications, including diabetic
+retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular
+disease, as well as forecast the risk of cardiovascular events. Artificial
+intelligence (AI)-enabled systems developed for high-throughput detection of DR
+using digitized retinal images have become clinically adopted. Beyond DR
+screening, AI integration also holds immense potential to address challenges
+associated with the holistic care of the patient with DM. In this work, we aim
+to comprehensively review the literature for studies on AI applications based
+on retinal images related to DM diagnosis, prognostication, and management. We
+will describe the findings of holistic AI-assisted diabetes care, including but
+not limited to DR screening, and discuss barriers to implementing such systems,
+including issues concerning ethics, data privacy, equitable access, and
+explainability. With the ability to evaluate the patient's health status vis a
+vis DM complication as well as risk prognostication of future cardiovascular
+complications, AI-assisted retinal image analysis has the potential to become a
+central tool for modern personalized medicine in patients with DM.
 
-##### **MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**
-2502.09621v1 by Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
+摘要：糖尿病（DM）使患者容易出現血管併發症。
+視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症，包括糖尿病視網膜病變（DR）、神經病變、腎病和動脈粥樣硬化性心血管疾病，以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧（AI）啟用系統已在臨床採用。除了 DR 篩檢外，AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中，我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻，這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現，包括但不限於 DR 篩檢，並討論實施此類系統的障礙，包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況，同時考量糖尿病併發症以及未來心血管併發症的風險預後，AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。
 
-Answering questions with Chain-of-Thought (CoT) has significantly enhanced
-the reasoning capabilities of Large Language Models (LLMs), yet its impact on
-Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth
-investigation. In this paper, we introduce MME-CoT, a specialized benchmark
-evaluating the CoT reasoning performance of LMMs, spanning six domains: math,
-science, OCR, logic, space-time, and general scenes. As the first comprehensive
-study in this area, we propose a thorough evaluation suite incorporating three
-novel metrics that assess the reasoning quality, robustness, and efficiency at
-a fine-grained level. Leveraging curated high-quality data and a unique
-evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,
-uncovering several key insights: 1) Models with reflection mechanism
-demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and
-demonstrating the highest quality results; 2) CoT prompting often degrades LMM
-performance on perception-heavy tasks, suggesting a potentially harmful
-overthinking behavior; and 3) Although the CoT quality is high, LMMs with
-reflection exhibit significant inefficiency in both normal response and
-self-correction phases. We hope MME-CoT serves as a foundation for advancing
-multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
+##### **Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**
+2402.15027v2 by A. J. Karran, P. Charland, J-T. Martineau, A. Ortiz de Guinea Lopez de Arana, AM. Lesage, S. Senecal, P-M. Leger
 
-摘要：<paragraph>透過思維鏈（CoT）回答問題，大幅提升了大型語言模型（LLM）的推理能力，但其對大型多模態模型（LMM）的影響仍缺乏系統性的評估和深入探討。在本文中，我們引入了 MME-CoT，一個專門的基準測試，用於評估 LMM 的 CoT 推理效能，涵蓋六個領域：數學、科學、OCR、邏輯、時空和一般場景。作為該領域的第一個全面性研究，我們提出了一個全面的評估套件，包含三個創新的指標，用於評估推理品質、穩健性和效率，並達到細微的層級。透過利用策展的高品質資料和獨特的評估策略，我們對最先進的 LMM 進行深入分析，發現了幾個關鍵見解：1）具有反思機制的模型展現出優異的 CoT 品質，其中 Kimi k1.5 優於 GPT-4o，並展現出最高品質的結果；2）CoT 提示通常會降低 LMM 在感知密集任務上的效能，這表示潛在有害的過度思考行為；3）儘管 CoT 品質很高，但具有反思能力的 LMM 在一般回應和自我修正階段都展現出顯著的低效率。我們希望 MME-CoT 能作為促進 LMM 中多模態推理的基礎。專案頁面：https://mmecot.github.io/</paragraph>
+This study investigates the acceptability of different artificial
+intelligence (AI) applications in education from a multi-stakeholder
+perspective, including students, teachers, and parents. Acknowledging the
+transformative potential of AI in education, it addresses concerns related to
+data privacy, AI agency, transparency, explainability and the ethical
+deployment of AI. Through a vignette methodology, participants were presented
+with four scenarios where AI's agency, transparency, explainability, and
+privacy were manipulated. After each scenario, participants completed a survey
+that captured their perceptions of AI's global utility, individual usefulness,
+justice, confidence, risk, and intention to use each scenario's AI if
+available. The data collection comprising a final sample of 1198
+multi-stakeholder participants was distributed through a partner institution
+and social media campaigns and focused on individual responses to four AI use
+cases. A mediation analysis of the data indicated that acceptance and trust in
+AI varies significantly across stakeholder groups. We found that the key
+mediators between high and low levels of AI's agency, transparency, and
+explainability, as well as the intention to use the different educational AI,
+included perceived global utility, justice, and confidence. The study
+highlights that the acceptance of AI in education is a nuanced and multifaceted
+issue that requires careful consideration of specific AI applications and their
+characteristics, in addition to the diverse stakeholders' perceptions.
 
-##### **Exploring the Potential of Encoder-free Architectures in 3D LMMs**
-2502.09620v1 by Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
+摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
 
-Encoder-free architectures have been preliminarily explored in the 2D visual
-domain, yet it remains an open question whether they can be effectively applied
-to 3D understanding scenarios. In this paper, we present the first
-comprehensive investigation into the potential of encoder-free architectures to
-overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
-These challenges include the failure to adapt to varying point cloud
-resolutions and the point features from the encoder not meeting the semantic
-needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
-remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
-We propose the LLM-embedded Semantic Encoding strategy in the pre-training
-stage, exploring the effects of various point cloud self-supervised losses. And
-we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
-introduce the Hierarchical Geometry Aggregation strategy in the instruction
-tuning stage. This incorporates inductive bias into the LLM early layers to
-focus on the local details of the point clouds. To the end, we present the
-first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
-state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
-classification, captioning, and VQA tasks, respectively. Our results
-demonstrate that the encoder-free architecture is highly promising for
-replacing encoder-based architectures in the field of 3D understanding. The
-code is released at https://github.com/Ivan-Tang-3D/ENEL
+##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
+2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
 
-摘要：<paragraph>編碼器免費架構已在 2D 視覺領域中初步探索，但它們是否能有效應用於 3D 理解場景仍是一個開放的問題。在本文中，我們提出了對編碼器免費架構潛力的首次全面調查，以克服基於編碼器的 3D 大型多模態模型 (LMM) 的挑戰。這些挑戰包括無法適應不同的點雲解析度，且來自編碼器的點特徵無法滿足大型語言模型 (LLM) 的語義需求。我們識別出 3D LMM 的關鍵方面，以移除編碼器並讓 LLM 承擔 3D 編碼器的角色：1) 我們在預訓練階段提出 LLM 嵌入式語義編碼策略，探索各種點雲自我監督損失的影響。我們提出混合語義損失來提取高階語義。2) 我們在指令調整階段引入分層幾何聚合策略。這將歸納偏差納入 LLM 早期層，以專注於點雲的局部細節。最後，我們提出第一個無編碼器 3D LMM，ENEL。我們的 7B 模型與當前最先進的模型 ShapeLLM-13B 相媲美，分別在分類、字幕和 VQA 任務中達到 55.0%、50.92% 和 42.7%。我們的結果表明，無編碼器架構極有望取代基於編碼器的架構在 3D 理解領域的應用。程式碼發布於 https://github.com/Ivan-Tang-3D/ENEL</paragraph>
+Remote patient monitoring based on wearable single-lead electrocardiogram
+(ECG) devices has significant potential for enabling the early detection of
+heart disease, especially in combination with artificial intelligence (AI)
+approaches for automated heart disease detection. There have been prior studies
+applying AI approaches based on deep learning for heart disease detection.
+However, these models are yet to be widely accepted as a reliable aid for
+clinical diagnostics, in part due to the current black-box perception
+surrounding many AI algorithms. In particular, there is a need to identify the
+key features of the ECG signal that contribute toward making an accurate
+diagnosis, thereby enhancing the interpretability of the model. In the present
+study, we develop a vision transformer approach to identify atrial fibrillation
+based on single-lead ECG data. A residual network (ResNet) approach is also
+developed for comparison with the vision transformer approach. These models are
+applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
+well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
+heartbeats. The models enable the identification of the key regions of the
+heartbeat that determine the resulting classification, and highlight the
+importance of P-waves and T-waves, as well as heartbeat duration and signal
+amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
+sinus bradycardia.
 
-##### **DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**
-2502.09614v1 by Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
+摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
 
-We address the challenge of developing a generalizable neural tracking
-controller for dexterous manipulation from human references. This controller
-aims to manage a dexterous robot hand to manipulate diverse objects for various
-purposes defined by kinematic human-object interactions. Developing such a
-controller is complicated by the intricate contact dynamics of dexterous
-manipulation and the need for adaptivity, generalizability, and robustness.
-Current reinforcement learning and trajectory optimization methods often fall
-short due to their dependence on task-specific rewards or precise system
-models. We introduce an approach that curates large-scale successful robot
-tracking demonstrations, comprising pairs of human references and robot
-actions, to train a neural controller. Utilizing a data flywheel, we
-iteratively enhance the controller's performance, as well as the number and
-quality of successful tracking demonstrations. We exploit available tracking
-demonstrations and carefully integrate reinforcement learning and imitation
-learning to boost the controller's performance in dynamic environments. At the
-same time, to obtain high-quality tracking demonstrations, we individually
-optimize per-trajectory tracking by leveraging the learned tracking controller
-in a homotopy optimization method. The homotopy optimization, mimicking
-chain-of-thought, aids in solving challenging trajectory tracking problems to
-increase demonstration diversity. We showcase our success by training a
-generalizable neural controller and evaluating it in both simulation and real
-world. Our method achieves over a 10% improvement in success rates compared to
-leading baselines. The project website with animated results is available at
-https://meowuu7.github.io/DexTrack/.
 
-摘要：<paragraph>我們解決了從人類參照中開發靈巧操作通用神經追蹤控制器的挑戰。此控制器旨在管理靈巧機器人手，以操作各種物體，以實現由運動學人機互動定義的各種目的。由於靈巧操作的複雜接觸動力學以及對適應性、通用性和魯棒性的需求，開發此類控制器很複雜。目前的強化學習和軌跡優化方法通常由於依賴於特定任務的獎勵或精確的系統模型而表現不佳。我們引入了一種方法，它策劃了大規模成功的機器人追蹤示範，包括人體參照和機器人動作對，以訓練神經控制器。利用數據飛輪，我們反覆增強控制器的性能，以及成功追蹤示範的數量和品質。我們利用可用的追蹤示範，並仔細整合強化學習和模仿學習，以提升控制器在動態環境中的性能。同時，為了獲得高品質的追蹤示範，我們透過在同倫優化方法中利用已學習的追蹤控制器，個別優化每個軌跡的追蹤。同倫優化模擬思考鏈，有助於解決具有挑戰性的軌跡追蹤問題，以增加示範的多樣性。我們展示了我們在訓練通用神經控制器並在模擬和真實世界中評估它的成功。與領先的基準相比，我們的模型在成功率方面提高了 10% 以上。包含動畫結果的專案網站可在 https://meowuu7.github.io/DexTrack/ 取得。</paragraph>
+### Medical
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null|
+|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
+|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
+|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|null|
+|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
+|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
+|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null|
+|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null|
+|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null|
+|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null|
+|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
+|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v1](http://arxiv.org/abs/2502.07516v1)|[link](https://github.com/Raman1121/diffusion_memorization)|
+|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
+|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
+|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
+|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
+|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
+|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null|
+|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null|
+|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null|
+|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null|
+|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null|
+|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null|
+|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null|
+|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null|
+|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
+|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null|
+|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)|
+|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null|
+|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null|
+|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null|
+|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null|
+|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null|
+|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null|
+|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null|
+|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)|
+|**2025-02-08**|**Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**|Ya Zhou et.al.|[2502.05494v1](http://arxiv.org/abs/2502.05494v1)|null|
+|**2025-02-08**|**DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**|Sibasish Dhibar et.al.|[2502.05459v1](http://arxiv.org/abs/2502.05459v1)|null|
+|**2025-02-07**|**Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**|Muhammad Imran et.al.|[2502.05330v1](http://arxiv.org/abs/2502.05330v1)|[link](https://github.com/MaxwellEng/MICCAI_CHANLLENGE24_HJL)|
+|**2025-02-07**|**Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**|Yuting He et.al.|[2502.05282v1](http://arxiv.org/abs/2502.05282v1)|null|
+|**2025-02-07**|**"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**|Shihan Fu et.al.|[2502.05115v1](http://arxiv.org/abs/2502.05115v1)|null|
+|**2025-02-07**|**Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**|Thierry Bossy et.al.|[2502.05087v1](http://arxiv.org/abs/2502.05087v1)|[link](https://github.com/tuneinsight/federated-llms)|
+|**2025-02-07**|**MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**|Minrui Chen et.al.|[2502.04794v1](http://arxiv.org/abs/2502.04794v1)|null|
+|**2025-02-06**|**MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**|Wei Fan et.al.|[2502.04515v1](http://arxiv.org/abs/2502.04515v1)|[link](https://github.com/aikunyi/MedGNN)|
+|**2025-02-06**|**Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**|Andrew G. Breithaupt et.al.|[2502.06842v1](http://arxiv.org/abs/2502.06842v1)|null|
+|**2025-02-06**|**Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**|Khushboo Verma et.al.|[2502.04423v1](http://arxiv.org/abs/2502.04423v1)|null|
+|**2025-02-06**|**Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**|Tewele W. Tareke et.al.|[2502.04083v1](http://arxiv.org/abs/2502.04083v1)|null|
+|**2025-02-06**|**Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**|Ran Song et.al.|[2502.04034v1](http://arxiv.org/abs/2502.04034v1)|null|
+|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
+|**2025-02-06**|**Transforming Multimodal Models into Action Models for Radiotherapy**|Matteo Ferrante et.al.|[2502.04408v1](http://arxiv.org/abs/2502.04408v1)|null|
+|**2025-02-06**|**Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**|Bokeng Zheng et.al.|[2502.04399v1](http://arxiv.org/abs/2502.04399v1)|null|
+|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
+|**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
+|**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
+|**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
+|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v1](http://arxiv.org/abs/2502.03568v1)|null|
+|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
+|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
+|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
+|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
+|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
+|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
+|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|null|
+|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
+|**2025-02-05**|**Contrastive Token-level Explanations for Graph-based Rumour Detection**|Daniel Wai Kit Chin et.al.|[2502.04366v1](http://arxiv.org/abs/2502.04366v1)|null|
+|**2025-02-05**|**AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**|Jorge García-Torres et.al.|[2502.04365v1](http://arxiv.org/abs/2502.04365v1)|null|
+|**2025-02-04**|**3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**|Weicheng Zhu et.al.|[2502.02779v1](http://arxiv.org/abs/2502.02779v1)|null|
+|**2025-02-04**|**Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**|Obed Korshie Dzikunu et.al.|[2502.02756v1](http://arxiv.org/abs/2502.02756v1)|[link](https://github.com/obeddzik/pca_segment)|
+|**2025-02-04**|**Diffusion Instruction Tuning**|Chen Jin et.al.|[2502.06814v1](http://arxiv.org/abs/2502.06814v1)|null|
+|**2025-02-04**|**MedRAX: Medical Reasoning Agent for Chest X-ray**|Adibvafa Fallahpour et.al.|[2502.02673v1](http://arxiv.org/abs/2502.02673v1)|[link](https://github.com/bowang-lab/medrax)|
+|**2025-02-04**|**Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**|Mahdi Alkaeed et.al.|[2502.04356v1](http://arxiv.org/abs/2502.04356v1)|null|
+|**2025-02-04**|**Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**|Shayan Kiyani et.al.|[2502.02561v1](http://arxiv.org/abs/2502.02561v1)|null|
+|**2025-02-04**|**CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**|Amy Rafferty et.al.|[2502.05214v1](http://arxiv.org/abs/2502.05214v1)|null|
+|**2025-02-04**|**A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**|Edward Ellis et.al.|[2502.02489v1](http://arxiv.org/abs/2502.02489v1)|null|
+|**2025-02-04**|**Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**|Yaling Shen et.al.|[2502.02438v1](http://arxiv.org/abs/2502.02438v1)|null|
+|**2025-02-04**|**Test Time Training for 4D Medical Image Interpolation**|Qikang Zhang et.al.|[2502.02341v1](http://arxiv.org/abs/2502.02341v1)|[link](https://github.com/chaostheproducer/ttt4d)|
+|**2025-02-04**|**Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**|Atharva Mangeshkumar Agrawal et.al.|[2502.02249v1](http://arxiv.org/abs/2502.02249v1)|null|
+|**2025-02-04**|**Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**|F. Xavier Gaya-Morey et.al.|[2502.02618v1](http://arxiv.org/abs/2502.02618v1)|null|
+|**2025-02-04**|**Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**|Yuxiao Cheng et.al.|[2502.02109v1](http://arxiv.org/abs/2502.02109v1)|null|
+|**2025-02-04**|**JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**|Yehan Yan et.al.|[2502.04345v1](http://arxiv.org/abs/2502.04345v1)|null|
+|**2025-02-03**|**An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**|Jiazi Tian et.al.|[2502.01789v1](http://arxiv.org/abs/2502.01789v1)|null|
+|**2025-02-03**|**Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**|Chacha Chen et.al.|[2502.03482v1](http://arxiv.org/abs/2502.03482v1)|null|
+|**2025-02-03**|**Improving Transformer World Models for Data-Efficient RL**|Antoine Dedieu et.al.|[2502.01591v1](http://arxiv.org/abs/2502.01591v1)|null|
+|**2025-02-03**|**Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**|Zhi Zhang et.al.|[2502.01377v1](http://arxiv.org/abs/2502.01377v1)|null|
+|**2025-02-03**|**OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**|Chengfeng Zhou et.al.|[2502.01243v1](http://arxiv.org/abs/2502.01243v1)|null|
+|**2025-02-03**|**MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**|Alejandro Guerra-Manzanares et.al.|[2502.01158v1](http://arxiv.org/abs/2502.01158v1)|null|
+|**2025-02-03**|**Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**|Qian Chen et.al.|[2502.01141v1](http://arxiv.org/abs/2502.01141v1)|null|
+|**2025-02-03**|**Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**|Mithun Saha et.al.|[2502.01108v1](http://arxiv.org/abs/2502.01108v1)|null|
+|**2025-02-03**|**Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**|Yeyubei Zhang et.al.|[2502.04342v1](http://arxiv.org/abs/2502.04342v1)|null|
+|**2025-02-02**|**Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**|Hadas Ben-Atya et.al.|[2502.01691v1](http://arxiv.org/abs/2502.01691v1)|null|
+|**2025-02-02**|**Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**|Si-Ioi Ng et.al.|[2502.01685v1](http://arxiv.org/abs/2502.01685v1)|null|
+|**2025-02-02**|**Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**|Shengtian Sang et.al.|[2502.00712v1](http://arxiv.org/abs/2502.00712v1)|null|
+|**2025-02-02**|**TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**|Linglong Wu et.al.|[2502.00695v1](http://arxiv.org/abs/2502.00695v1)|null|
+|**2025-02-02**|**Safety at Scale: A Comprehensive Survey of Large Model Safety**|Xingjun Ma et.al.|[2502.05206v2](http://arxiv.org/abs/2502.05206v2)|null|
+|**2025-02-02**|**Enhanced Convolutional Neural Networks for Improved Image Classification**|Xiaoran Yang et.al.|[2502.00663v1](http://arxiv.org/abs/2502.00663v1)|null|
+|**2025-02-02**|**Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**|Yujin Oh et.al.|[2502.00619v1](http://arxiv.org/abs/2502.00619v1)|null|
+|**2025-02-01**|**Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**|Samiran Dey et.al.|[2502.00568v3](http://arxiv.org/abs/2502.00568v3)|[link](https://github.com/Samiran-Dey/PathoGen)|
 
-##### **Score-of-Mixture Training: Training One-Step Generative Models Made Simple**
-2502.09609v1 by Tejas Jayashankar, J. Jon Ryu, Gregory Wornell
+#### Abstracts
+##### **Metamorphic Testing for Pose Estimation Systems**
+2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque
 
-We propose Score-of-Mixture Training (SMT), a novel framework for training
-one-step generative models by minimizing a class of divergences called the
-$\alpha$-skew Jensen-Shannon divergence. At its core, SMT estimates the score
-of mixture distributions between real and fake samples across multiple noise
-levels. Similar to consistency models, our approach supports both training from
-scratch (SMT) and distillation using a pretrained diffusion model, which we
-call Score-of-Mixture Distillation (SMD). It is simple to implement, requires
-minimal hyperparameter tuning, and ensures stable training. Experiments on
-CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even
-outperform existing methods.
+Pose estimation systems are used in a variety of fields, from sports
+analytics to livestock care. Given their potential impact, it is paramount to
+systematically test their behaviour and potential for failure. This is a
+complex task due to the oracle problem and the high cost of manual labelling
+necessary to build ground truth keypoints. This problem is exacerbated by the
+fact that different applications require systems to focus on different subjects
+(e.g., human versus animal) or landmarks (e.g., only extremities versus whole
+body and face), which makes labelled test data rarely reusable. To combat these
+problems we propose MET-POSE, a metamorphic testing framework for pose
+estimation systems that bypasses the need for manual annotation while assessing
+the performance of these systems under different circumstances. MET-POSE thus
+allows users of pose estimation systems to assess the systems in conditions
+that more closely relate to their application without having to label an ad-hoc
+test dataset or rely only on available datasets, which may not be adapted to
+their application domain. While we define MET-POSE in general terms, we also
+present a non-exhaustive list of metamorphic rules that represent common
+challenges in computer vision applications, as well as a specific way to
+evaluate these rules. We then experimentally show the effectiveness of MET-POSE
+by applying it to Mediapipe Holistic, a state of the art human pose estimation
+system, with the FLIC and PHOENIX datasets. With these experiments, we outline
+numerous ways in which the outputs of MET-POSE can uncover faults in pose
+estimation systems at a similar or higher rate than classic testing using hand
+labelled data, and show that users can tailor the rule set they use to the
+faults and level of accuracy relevant to their application.
 
-摘要：我們提出混合評分訓練 (SMT)，一種透過最小化稱為 $\alpha$-偏斜 Jensen-Shannon 距離的距離類別來訓練單步生成模型的新穎架構。在核心部分，SMT 估計真實和虛假樣本之間在多個雜訊層級的混合分配評分。與一致性模型類似，我們的做法支援從頭開始訓練 (SMT) 和使用預先訓練的擴散模型進行蒸餾，我們稱之為混合評分蒸餾 (SMD)。它易於實作，只需要最小的超參數調整，並確保穩定的訓練。在 CIFAR-10 和 ImageNet 64x64 上的實驗顯示，SMT/SMD 具有競爭力，甚至可以優於現有方法。
+摘要：姿勢估計系統應用於各種領域，從運動分析到牲畜照護。鑑於其潛在影響，系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高，這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體（例如，人類對動物）或地標（例如，只有四肢對全身和臉部）而加劇，這使得標記的測試數據很少可以重複使用。為了解決這些問題，我們提出了 MET-POSE，這是一個姿勢估計系統的變形測試框架，在評估這些系統在不同情況下的性能時，可以繞過手動註解的需要。因此，MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統，而無需標記臨時測試數據集或僅依賴可用數據集，這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE，但我們也提供了一個非詳盡的變形規則列表，這些規則代表了電腦視覺應用中的常見挑戰，以及評估這些規則的具體方法。然後，我們通過將 MET-POSE 應用於 Mediapipe Holistic（一種先進的人類姿勢估計系統），並使用 FLIC 和 PHOENIX 數據集，以實驗方式展示 MET-POSE 的有效性。通過這些實驗，我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法，其速度與使用手動標記數據的傳統測試類似或更高，並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。
 
-##### **Human-LLM Coevolution: Evidence from Academic Writing**
-2502.09606v1 by Mingmeng Geng, Roberto Trotta
+##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
+2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
 
-With a statistical analysis of arXiv paper abstracts, we report a marked drop
-in the frequency of several words previously identified as overused by ChatGPT,
-such as "delve", starting soon after they were pointed out in early 2024. The
-frequency of certain other words favored by ChatGPT, such as "significant", has
-instead kept increasing. These phenomena suggest that some authors of academic
-papers have adapted their use of large language models (LLMs), for example, by
-selecting outputs or applying modifications to the LLM-generated content. Such
-coevolution and cooperation of humans and LLMs thus introduce additional
-challenges to the detection of machine-generated text in real-world scenarios.
-Estimating the impact of LLMs on academic writing by examining word frequency
-remains feasible, and more attention should be paid to words that were already
-frequently employed, including those that have decreased in frequency.
+Joint entity-relation extraction is a critical task in transforming
+unstructured or semi-structured text into triplets, facilitating the
+construction of large-scale knowledge graphs, and supporting various downstream
+applications. Despite its importance, research on Chinese text, particularly
+with complex semantics in specialized domains like medicine, remains limited.
+To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
+dataset designed to capture the intricacies of medical text. Leveraging the
+strengths of attention mechanisms in capturing long-range dependencies, we
+propose the SEA module, which enhances the extraction of complex contextual
+semantic information, thereby improving entity recognition and relation
+extraction. Additionally, to address the inefficiencies of existing methods in
+facilitating information exchange between entity recognition and relation
+extraction, we present an interactive fusion representation module. This module
+employs Cross Attention for bidirectional information exchange between the
+tasks and further refines feature extraction through BiLSTM. Experimental
+results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
+our model exhibits strong generalization capabilities. On the CH-DDI dataset,
+our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
+relation extraction. On the CoNLL04 dataset, it attains an entity recognition
+precision of 89.54% and a relation extraction accuracy of 71.64%.
 
-摘要：透過對 arXiv 論文摘要進行統計分析，我們報告了幾個先前被認為 ChatGPT 過度使用的詞彙的頻率大幅下降，例如「深入探討」，從 2024 年初被指出後不久就開始下降。相反地，ChatGPT 偏好的某些其他詞彙，例如「顯著」，頻率持續增加。這些現象表明，一些學術論文作者已經調整了他們使用大型語言模型 (LLM) 的方式，例如，透過選擇輸出或對 LLM 生成的內容進行修改。因此，人類和 LLM 的這種共同演化和合作為在現實世界場景中偵測機器產生的文字帶來了額外的挑戰。透過檢視詞彙頻率來評估 LLM 對學術寫作的影響仍然可行，並且應該對已經頻繁使用的詞彙給予更多關注，包括那些頻率下降的詞彙。
+摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
 
-##### **SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**
-2502.09604v1 by Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
+##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
+2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
 
-We introduce SelfCite, a novel self-supervised approach that aligns LLMs to
-generate high-quality, fine-grained, sentence-level citations for the
-statements in their generated responses. Instead of only relying on costly and
-labor-intensive annotations, SelfCite leverages a reward signal provided by the
-LLM itself through context ablation: If a citation is necessary, removing the
-cited text from the context should prevent the same response; if sufficient,
-retaining the cited text alone should preserve the same response. This reward
-can guide the inference-time best-of-N sampling strategy to improve citation
-quality significantly, as well as be used in preference optimization to
-directly fine-tune the models for generating better citations. The
-effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3
-points on the LongBench-Cite benchmark across five long-form question answering
-tasks.
+Generative artificial intelligence (AI) models, such as diffusion models and
+OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
+and automating clinical workflows. The field has advanced rapidly, evolving
+from text-only large language models for tasks such as clinical documentation
+and decision support to multimodal AI systems capable of integrating diverse
+data modalities, including imaging, text, and structured data, within a single
+model. The diverse landscape of these technologies, along with rising interest,
+highlights the need for a comprehensive review of their applications and
+potential. This scoping review explores the evolution of multimodal AI,
+highlighting its methods, applications, datasets, and evaluation in clinical
+settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
+IEEE Xplore, and Web of Science, prioritizing recent studies published up to
+the end of 2024. After rigorous screening, 144 papers were included, revealing
+key trends and challenges in this dynamic field. Our findings underscore a
+shift from unimodal to multimodal approaches, driving innovations in diagnostic
+support, medical report generation, drug discovery, and conversational AI.
+However, critical challenges remain, including the integration of heterogeneous
+data types, improving model interpretability, addressing ethical concerns, and
+validating AI systems in real-world clinical settings. This review summarizes
+the current state of the art, identifies critical gaps, and provides insights
+to guide the development of scalable, trustworthy, and clinically impactful
+multimodal AI solutions in healthcare.
 
-摘要：我們介紹 SelfCite，一種新穎的自監督方法，它將 LLM 對齊以針對其生成回應中的陳述生成高品質、細粒度、句子級別的引用。SelfCite 不僅依賴於昂貴且勞動密集的註解，還利用 LLM 本身通過上下文消融提供的獎勵信號：如果需要引用，從上下文中移除被引用的文字應當會阻止相同的回應；如果足夠，僅保留被引用的文字應當會保留相同的回應。此獎勵可以引導推理時間最佳 N 個取樣策略以顯著改善引文品質，並用於偏好最佳化以直接微調模型以生成更好的引文。SelfCite 的有效性通過在五個長篇問答任務中將 LongBench-Cite 基準上的引文 F1 提高多達 5.3 點來證明。
+摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
 
-##### **CoT-Valve: Length-Compressible Chain-of-Thought Tuning**
-2502.09601v1 by Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-Chain-of-Thought significantly enhances a model's reasoning capability, but
-it also comes with a considerable increase in inference costs due to long
-chains. With the observation that the reasoning path can be easily compressed
-under easy tasks but struggle on hard tasks, we explore the feasibility of
-elastically controlling the length of reasoning paths with only one model,
-thereby reducing the inference overhead of reasoning models dynamically based
-on task difficulty. We introduce a new tuning and inference strategy named
-CoT-Valve, designed to allow models to generate reasoning chains of varying
-lengths. To achieve this, we propose to identify a direction in the parameter
-space that, when manipulated, can effectively control the length of generated
-CoT. Moreover, we show that this property is valuable for compressing the
-reasoning chain. We construct datasets with chains from long to short for the
-same questions and explore two enhanced strategies for CoT-Valve: (1) a precise
-length-compressible CoT tuning method, and (2) a progressive chain length
-compression approach. Our experiments show that CoT-Valve successfully enables
-controllability and compressibility of the chain and shows better performance
-than the prompt-based control. We applied this method to QwQ-32B-Preview,
-reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor
-performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with
-only one additional incorrect answer.
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-摘要：<paragraph>連續思考大幅提升了模型的推理能力，但由於鏈條過長，也大幅增加了推理成本。由於觀察到推理路徑在簡單的任務中可以輕易壓縮，但在困難的任務中卻很吃力，我們探索了僅使用一個模型彈性控制推理路徑長度的可行性，從而根據任務難度動態減少推理模型的推理開銷。我們引入了一種名為 CoT-Valve 的新調校和推理策略，旨在讓模型產生長度不一的推理鏈。為此，我們提議在參數空間中識別一個方向，在操作時可以有效控制生成的 CoT 的長度。此外，我們展示了此屬性對於壓縮推理鏈是有價值的。我們構造了從長到短的鏈條的資料集，用於相同的問題，並探索了 CoT-Valve 的兩種增強策略：(1) 精確的長度可壓縮 CoT 調校方法，以及 (2) 漸進式鏈長壓縮方法。我們的實驗表明，CoT-Valve 成功地實現了鏈條的可控性和可壓縮性，並顯示出比基於提示的控制更好的效能。我們將此方法應用於 QwQ-32B-Preview，將 GSM8K 上的推理鏈條從 741 個代幣減少到 225 個代幣，效能僅略微下降 (95.07% 至 94.92%)，而在 AIME 上從 6827 個代幣減少到 4629 個代幣，只多了一個錯誤答案。</paragraph>
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-##### **Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**
-2502.09597v1 by Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
+##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
+2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
 
-Large Language Models (LLMs) are increasingly used as chatbots, yet their
-ability to personalize responses to user preferences remains limited. We
-introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize
-and adhere to user preferences in a long-context conversational setting.
-PrefEval comprises 3,000 manually curated user preference and query pairs
-spanning 20 topics. PrefEval contains user personalization or preference
-information in both explicit and implicit forms, and evaluates LLM performance
-using a generation and a classification task. With PrefEval, we evaluated the
-aforementioned preference following capabilities of 10 open-source and
-proprietary LLMs in multi-session conversations with varying context lengths up
-to 100k tokens. We benchmark with various prompting, iterative feedback, and
-retrieval-augmented generation methods. Our benchmarking effort reveals that
-state-of-the-art LLMs face significant challenges in proactively following
-users' preferences during conversations. In particular, in zero-shot settings,
-preference following accuracy falls below 10% at merely 10 turns (~3k tokens)
-across most evaluated models. Even with advanced prompting and retrieval
-methods, preference following still deteriorates in long-context conversations.
-Furthermore, we show that fine-tuning on PrefEval significantly improves
-performance. We believe PrefEval serves as a valuable resource for measuring,
-understanding, and enhancing LLMs' preference following abilities, paving the
-way for personalized conversational agents. Our code and dataset are available
-at https://prefeval.github.io/.
+Legal cases require careful logical reasoning following the laws, whereas
+interactions with non- technical users must be in natural language. As an
+application combining logical reasoning using Prolog and natural language
+processing using large language models (LLMs), this paper presents a novel
+approach and system, LogicLease, to automate the analysis of landlord-tenant
+legal cases in the state of New York. LogicLease determines compliance with
+relevant legal requirements by analyzing case descriptions and citing all
+relevant laws. It leverages LLMs for information extraction and Prolog for
+legal reasoning. By separating information extraction from legal reasoning,
+LogicLease achieves greater transparency and control over the legal logic
+applied to each case. We evaluate the accuracy, efficiency, and robustness of
+LogicLease through a series of tests, achieving 100% accuracy and an average
+processing time of 2.57 seconds. LogicLease presents advantages over
+state-of-the-art LLM- based legal analysis systems by providing clear,
+step-by-step reasoning, citing specific laws, and distinguishing itself by its
+ability to avoid hallucinations - a common issue in LLMs.
 
-摘要：大型語言模型（LLM）正日益被用作聊天機器人，但它們根據使用者偏好個人化回應的能力仍然有限。我們引入了 PrefEval，一個用於評估 LLM 在長時間對話環境中推論、記憶和遵守使用者偏好的能力的基準。PrefEval 包含 3,000 個手動策劃的使用者偏好和查詢對，涵蓋 20 個主題。PrefEval 包含以明確和隱含形式表達的使用者個人化或偏好資訊，並使用生成和分類任務評估 LLM 效能。透過 PrefEval，我們評估了 10 個開源和專有 LLM 在多重對話中上述的偏好追蹤能力，對話內容長度最高達 100k 個符號。我們使用各種提示、迭代回饋和檢索增強生成方法進行基準測試。我們的基準測試工作顯示，最先進的 LLM 在對話中主動追蹤使用者偏好時面臨重大挑戰。特別是在零次學習設定中，在多數評估模型中，在僅 10 個回合（約 3k 個符號）時，偏好追蹤準確度低於 10%。即使使用進階提示和檢索方法，在長時間對話中偏好追蹤仍然會惡化。此外，我們展示了在 PrefEval 上進行微調會大幅改善效能。我們相信 PrefEval 可作為衡量、理解和提升 LLM 偏好追蹤能力的寶貴資源，為個人化對話代理鋪路。我們的程式碼和資料集可在 https://prefeval.github.io/ 取得。
+摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
 
-##### **KIMAs: A Configurable Knowledge Integrated Multi-Agent System**
-2502.09596v1 by Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding
+##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
+2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
-Knowledge-intensive conversations supported by large language models (LLMs)
-have become one of the most popular and helpful applications that can assist
-people in different aspects. Many current knowledge-intensive applications are
-centered on retrieval-augmented generation (RAG) techniques. While many
-open-source RAG frameworks facilitate the development of RAG-based
-applications, they often fall short in handling practical scenarios complicated
-by heterogeneous data in topics and formats, conversational context management,
-and the requirement of low-latency response times. This technical report
-presents a configurable knowledge integrated multi-agent system, KIMAs, to
-address these challenges. KIMAs features a flexible and configurable system for
-integrating diverse knowledge sources with 1) context management and query
-rewrite mechanisms to improve retrieval accuracy and multi-turn conversational
-coherency, 2) efficient knowledge routing and retrieval, 3) simple but
-effective filter and reference generation mechanisms, and 4) optimized
-parallelizable multi-agent pipeline execution. Our work provides a scalable
-framework for advancing the deployment of LLMs in real-world settings. To show
-how KIMAs can help developers build knowledge-intensive applications with
-different scales and emphases, we demonstrate how we configure the system to
-three applications already running in practice with reliable performance.
+In remote healthcare monitoring, time series representation learning reveals
+critical patient behavior patterns from high-frequency data. This study
+analyzes home activity data from individuals living with dementia by proposing
+a two-stage, self-supervised learning approach tailored to uncover low-rank
+structures. The first stage converts time-series activities into text sequences
+encoded by a pre-trained language model, providing a rich, high-dimensional
+latent state space using a PageRank-based method. This PageRank vector captures
+latent state transitions, effectively compressing complex behaviour data into a
+succinct form that enhances interpretability. This low-rank representation not
+only enhances model interpretability but also facilitates clustering and
+transition analysis, revealing key behavioral patterns correlated with
+clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
+framework's potential in supporting cognitive status prediction, personalized
+care interventions, and large-scale health monitoring.
 
-摘要：由大型語言模型 (LLM) 支持的知識密集型對話
-已成為最受歡迎且有用的應用程式之一，可協助
-人們在不同面向獲得協助。許多當前的知識密集型應用程式
-都以檢索增強生成 (RAG) 技術為中心。雖然許多
-開放原始碼 RAG 架構促進了基於 RAG 的應用程式開發，但它們在處理
-主題和格式中異質資料、對話內容管理，以及低延遲回應時間的要求所造成的實際情況時，通常力有未逮。這份技術報告
-提出了可設定的知識整合多重代理系統，KIMAs，以
-解決這些挑戰。KIMAs 具備靈活且可設定的系統，可整合多樣化的知識來源，並具備 1) 內容管理和查詢
-改寫機制，以提升檢索準確度和多輪對話的連貫性，2) 有效的知識路由和檢索，3) 簡單但
-有效的篩選和參考產生機制，以及 4) 最佳化的可平行化多重代理管線執行。我們的作品提供了可擴充的
-架構，以推動在實際環境中部署 LLM。為了展示 KIMAs 如何協助開發人員建置不同規模和重點的知識密集型應用程式，我們示範如何設定系統至
-三個已實際執行且效能良好的應用程式。
+摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
-##### **Logical forms complement probability in understanding language model (and human) performance**
-2502.09589v1 by Yixuan Wang, Freda Shi
+##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
+2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
 
-With the increasing interest in using large language models (LLMs) for
-planning in natural language, understanding their behaviors becomes an
-important research question. This work conducts a systematic investigation of
-LLMs' ability to perform logical reasoning in natural language. We introduce a
-controlled dataset of hypothetical and disjunctive syllogisms in propositional
-and modal logic and use it as the testbed for understanding LLM performance.
-Our results lead to novel insights in predicting LLM behaviors: in addition to
-the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical
-forms should be considered as orthogonal factors. In addition, we show
-similarities and differences between the logical reasoning performances of
-humans and LLMs by comparing LLM and human behavioral results.
+Precise segmentation and classification of cell instances are vital for
+analyzing the tissue microenvironment in histology images, supporting medical
+diagnosis, prognosis, treatment planning, and studies of brain
+cytoarchitecture. However, the creation of high-quality annotated datasets for
+training remains a major challenge. This study introduces a novel single-stage
+approach (HistoSmith) for generating image-label pairs to augment histology
+datasets. Unlike state-of-the-art methods that utilize diffusion models with
+separate components for label and image generation, our approach employs a
+latent diffusion model to learn the joint distribution of cellular layouts,
+classification masks, and histology images. This model enables tailored data
+generation by conditioning on user-defined parameters such as cell types,
+quantities, and tissue types. Trained on the Conic H&E histopathology dataset
+and the Nissl-stained CytoDArk0 dataset, the model generates realistic and
+diverse labeled samples. Experimental results demonstrate improvements in cell
+instance segmentation and classification, particularly for underrepresented
+cell types like neutrophils in the Conic dataset. These findings underscore the
+potential of our approach to address data scarcity challenges.
 
-摘要：隨著在自然語言規劃中使用大型語言模型（LLM）的興趣日益濃厚，理解其行為已成為一項重要的研究課題。本研究對 LLM 在自然語言中執行邏輯推理的能力進行了系統性調查。我們引入了一個由假設和析取三段論組成的受控資料集，並使用它作為理解 LLM 效能的測試平台。我們的結果產生了預測 LLM 行為的新見解：除了輸入的機率（Gonen 等人，2023 年；McCoy 等人，2024 年）之外，邏輯形式應被視為正交因子。此外，我們透過比較 LLM 和人類行為結果，展示了人類和 LLM 在邏輯推理表現上的相似性和差異性。
+摘要：精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而，建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith)，用於產生影像標籤對，以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同，我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數（例如細胞類型、數量和組織類型）來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後，此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步，特別是對於 Conic 資料集中代表性不足的細胞類型，例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。
 
-##### **Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**
-2502.09573v1 by Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
+##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**
+2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
 
-In this study, we tackle industry challenges in video content classification
-by exploring and optimizing GPT-based models for zero-shot classification
-across seven critical categories of video quality. We contribute a novel
-approach to improving GPT's performance through prompt optimization and policy
-refinement, demonstrating that simplifying complex policies significantly
-reduces false negatives. Additionally, we introduce a new
-decomposition-aggregation-based prompt engineering technique, which outperforms
-traditional single-prompt methods. These experiments, conducted on real
-industry problems, show that thoughtful prompt design can substantially enhance
-GPT's performance without additional finetuning, offering an effective and
-scalable solution for improving video classification systems across various
-domains in industry.
+The growing availability of longitudinal Magnetic Resonance Imaging (MRI)
+datasets has facilitated Artificial Intelligence (AI)-driven modeling of
+disease progression, making it possible to predict future medical scans for
+individual patients. However, despite significant advancements in AI, current
+methods continue to face challenges including achieving patient-specific
+individualization, ensuring spatiotemporal consistency, efficiently utilizing
+longitudinal data, and managing the substantial memory demands of 3D scans. To
+address these challenges, we propose Brain Latent Progression (BrLP), a novel
+spatiotemporal model designed to predict individual-level disease progression
+in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates
+in a small latent space, mitigating the computational challenges posed by
+high-dimensional imaging data; (ii) it explicitly integrates subject metadata
+to enhance the individualization of predictions; (iii) it incorporates prior
+knowledge of disease dynamics through an auxiliary model, facilitating the
+integration of longitudinal data; and (iv) it introduces the Latent Average
+Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in
+the predicted progression at inference time and (b) allows us to derive a
+measure of the uncertainty for the prediction. We train and evaluate BrLP on
+11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its
+generalizability on an external test set comprising 2,257 MRIs from 962
+subjects. Our experiments compare BrLP-generated MRI scans with real follow-up
+MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The
+code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
 
-摘要：在這項研究中，我們透過探索和最佳化基於 GPT 的模型，來處理影片內容分類中的產業挑戰，並針對影片品質的七個關鍵類別進行零次學習分類。我們貢獻了一種透過提示最佳化和政策改善來提升 GPT 效能的新方法，證明簡化複雜政策能大幅減少假陰性。此外，我們還引入了一種新的基於分解聚合的提示工程技術，其效能優於傳統的單一提示方法。這些在真實產業問題上執行的實驗顯示，經過深思熟慮的提示設計可以在不進行額外微調的情況下大幅提升 GPT 的效能，為提升產業中各種領域的影片分類系統提供了一個有效且可擴充的解決方案。
+摘要：隨著縱向磁共振影像 (MRI) 資料集的日益普及，已促進人工智慧 (AI) 驅動的疾病進程建模，讓預測個別患者的未來醫學掃描成為可能。然而，儘管 AI 有顯著進展，目前的技術仍面臨挑戰，包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料，以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰，我們提出腦潛在進程 (BrLP)，這是一種新穎的時空模型，旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個：(i) 它在一個小的潛在空間中運作，減輕了高維度影像資料帶來的計算挑戰；(ii) 它明確整合受試者的元資料，以增強預測的個別化；(iii) 它透過輔助模型納入疾病動態的先驗知識，促進縱向資料的整合；(iv) 它引入了潛在平均穩定化 (LAS) 演算法，該演算法 (a) 在推論時強制預測進程中的時空一致性，(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估，並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較，與現有方法相比，展示了最先進的準確性。程式碼已公開於：https://github.com/LemuelPuglisi/BrLP。
 
-##### **MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**
-2502.09567v1 by Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea
+##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
+2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
 
-We introduce MorphNLI, a modular step-by-step approach to natural language
-inference (NLI). When classifying the premise-hypothesis pairs into
-{entailment, contradiction, neutral}, we use a language model to generate the
-necessary edits to incrementally transform (i.e., morph) the premise into the
-hypothesis. Then, using an off-the-shelf NLI model we track how the entailment
-progresses with these atomic changes, aggregating these intermediate labels
-into a final output. We demonstrate the advantages of our proposed method
-particularly in realistic cross-domain settings, where our method always
-outperforms strong baselines with improvements up to 12.6% (relative). Further,
-our proposed approach is explainable as the atomic edits can be used to
-understand the overall NLI label.
+The adoption of EHRs has expanded opportunities to leverage data-driven
+algorithms in clinical care and research. A major bottleneck in effectively
+conducting multi-institutional EHR studies is the data heterogeneity across
+systems with numerous codes that either do not exist or represent different
+clinical concepts across institutions. The need for data privacy further limits
+the feasibility of including multi-institutional patient-level data required to
+study similarities and differences across patient subgroups. To address these
+challenges, we developed the GAME algorithm. Tested and validated across 7
+institutions and 2 languages, GAME integrates data in several levels: (1) at
+the institutional level with knowledge graphs to establish relationships
+between codes and existing knowledge sources, providing the medical context for
+standard codes and their relationship to each other; (2) between institutions,
+leveraging language models to determine the relationships between
+institution-specific codes with established standard codes; and (3) quantifying
+the strength of the relationships between codes using a graph attention
+network. Jointly trained embeddings are created using transfer and federated
+learning to preserve data privacy. In this study, we demonstrate the
+applicability of GAME in selecting relevant features as inputs for AI-driven
+algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
+We then highlight the application of GAME harmonized multi-institutional EHR
+data in a study of Alzheimer's disease outcomes and suicide risk among patients
+with mental health disorders, without sharing patient-level data outside
+individual institutions.
 
-摘要：我們引入 MorphNLI，一種模組化逐步方法，用於自然語言推論 (NLI)。當對前提假設對進行分類時，我們使用語言模型來產生必要的編輯，以逐步轉換（即，變形）前提成為假設。然後，使用現成的 NLI 模型，我們追蹤推論如何隨著這些原子變化而進展，將這些中間標籤彙總成最終輸出。我們展示了我們提出的方法的優點，特別是在現實的跨網域設置中，我們的模型始終優於強大的基線，改進幅度高達 12.6%（相對）。此外，我們提出的方法是可以解釋的，因為原子編輯可以用來理解整體 NLI 標籤。
+摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
 
-##### **Zero-shot generation of synthetic neurosurgical data with large language models**
-2502.09566v1 by Austin A. Barr, Eddie Guo, Emre Sezgin
+##### **EEG Artifact Detection and Correction with Deep Autoencoders**
+2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch
 
-Clinical data is fundamental to advance neurosurgical research, but access is
-often constrained by data availability, small sample sizes, privacy
-regulations, and resource-intensive preprocessing and de-identification
-procedures. Synthetic data offers a potential solution to challenges associated
-with accessing and using real-world data (RWD). This study aims to evaluate the
-capability of zero-shot generation of synthetic neurosurgical data with a large
-language model (LLM), GPT-4o, by benchmarking with the conditional tabular
-generative adversarial network (CTGAN). Synthetic datasets were compared to
-real-world neurosurgical data to assess fidelity (means, proportions,
-distributions, and bivariate correlations), utility (ML classifier performance
-on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated
-datasets matched or exceeded CTGAN performance, despite no fine-tuning or
-access to RWD for pre-training. Datasets demonstrated high univariate and
-bivariate fidelity to RWD without directly exposing any real patient records,
-even at amplified sample size. Training an ML classifier on GPT-4o-generated
-data and testing on RWD for a binary prediction task showed an F1 score (0.706)
-with comparable performance to training on the CTGAN data (0.705) for
-predicting postoperative functional status deterioration. GPT-4o demonstrated a
-promising ability to generate high-fidelity synthetic neurosurgical data. These
-findings also indicate that data synthesized with GPT-4o can effectively
-augment clinical data with small sample sizes, and train ML models for
-prediction of neurosurgical outcomes. Further investigation is necessary to
-improve the preservation of distributional characteristics and boost classifier
-performance.
+EEG signals convey important information about brain activity both in healthy
+and pathological conditions. However, they are inherently noisy, which poses
+significant challenges for accurate analysis and interpretation. Traditional
+EEG artifact removal methods, while effective, often require extensive expert
+intervention. This study presents LSTEEG, a novel LSTM-based autoencoder
+designed for the detection and correction of artifacts in EEG signals.
+Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear
+dependencies in sequential EEG data. LSTEEG demonstrates superior performance
+in both artifact detection and correction tasks compared to other
+state-of-the-art convolutional autoencoders. Our methodology enhances the
+interpretability and utility of the autoencoder's latent space, enabling
+data-driven automated artefact removal in EEG its application in downstream
+tasks. This research advances the field of efficient and accurate multi-channel
+EEG preprocessing, and promotes the implementation and usage of automated EEG
+analysis pipelines for brain health applications.
 
-摘要：<paragraph>臨床數據是推進神經外科研究的基礎，但訪問通常受到數據可用性、樣本量小、隱私法規以及資源密集型預處理和去識別程序的限制。合成數據為與存取和使用真實世界數據 (RWD) 相關的挑戰提供了潛在解決方案。本研究旨在評估使用大型語言模型 (LLM) GPT-4o 零次生成合成神經外科數據的能力，並通過條件表格生成對抗網路 (CTGAN) 進行基準測試。將合成數據集與真實世界的神經外科數據進行比較，以評估保真度（平均值、比例、分布和二元相關性）、實用性（RWD 上的 ML 分類器性能）和隱私（RWD 中記錄的重複）。儘管沒有微調或訪問 RWD 進行預訓練，但 GPT-4o 生成的數據集與 CTGAN 性能相匹配或超過 CTGAN 性能。數據集證明了對 RWD 的高單變量和二變量保真度，即使在擴充的樣本量下也不會直接公開任何真實患者記錄。在 GPT-4o 生成的數據上訓練 ML 分類器，並在 RWD 上測試二元預測任務，顯示 F1 分數 (0.706) 與在 CTGAN 數據上訓練以預測術後功能狀態惡化時的性能相當 (0.705)。GPT-4o 展示了生成高保真合成神經外科數據的潛力。這些發現還表明，使用 GPT-4o 合成的數據可以有效地增加樣本量小的臨床數據，並訓練 ML 模型以預測神經外科結果。需要進一步研究以改善分佈特徵的保留並提升分類器性能。</paragraph>
+摘要：腦電圖訊號傳達了關於大腦活動的重要資訊，無論是在健康或病理狀況下。然而，它們本質上是有雜訊的，這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效，但通常需要大量的專家介入。本研究提出 LSTEEG，一種新穎的基於 LSTM 的自動編碼器，用於偵測和校正腦電圖訊號中的人工製品。利用深度學習，特別是 LSTM 層，LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比，LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性，讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域，並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。
 
-##### **MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**
-2502.09565v1 by Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White
+##### **SycEval: Evaluating LLM Sycophancy**
+2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo
 
-Molecular dynamics (MD) simulations are essential for understanding
-biomolecular systems but remain challenging to automate. Recent advances in
-large language models (LLM) have demonstrated success in automating complex
-scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an
-agentic LLM assistant capable of automating MD workflows. MDCrow uses
-chain-of-thought over 40 expert-designed tools for handling and processing
-files, setting up simulations, analyzing the simulation outputs, and retrieving
-relevant information from literature and databases. We assess MDCrow's
-performance across 25 tasks of varying required subtasks and difficulty, and we
-evaluate the agent's robustness to both difficulty and prompt style.
-\texttt{gpt-4o} is able to complete complex tasks with low variance, followed
-closely by \texttt{llama3-405b}, a compelling open-source model. While prompt
-style does not influence the best models' performance, it has significant
-effects on smaller models.
+Large language models (LLMs) are increasingly applied in educational,
+clinical, and professional settings, but their tendency for sycophancy --
+prioritizing user agreement over independent reasoning -- poses risks to
+reliability. This study introduces a framework to evaluate sycophantic behavior
+in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and
+MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19%
+of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the
+lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred
+in 43.52% of cases, while regressive sycophancy, leading to incorrect answers,
+was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher
+sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$,
+$p<0.001$), particularly in computational tasks, where regressive sycophancy
+increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$).
+Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while
+citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$,
+$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI:
+[77.2%, 79.8%]) regardless of context or model. These findings emphasize the
+risks and opportunities of deploying LLMs in structured and dynamic domains,
+offering insights into prompt programming and model optimization for safer AI
+applications.
 
-摘要：分子動力學 (MD) 模擬對於理解生物分子系統至關重要，但自動化仍然具有挑戰性。大型語言模型 (LLM) 的最新進展已證明使用基於 LLM 的代理自動化複雜的科學任務是成功的。在本文中，我們介紹了 MDCrow，這是一個代理 LLM 助理，能夠自動化 MD 工作流程。MDCrow 使用 40 多種專家設計的工具的思考鏈來處理和處理檔案、設定模擬、分析模擬輸出，以及從文獻和資料庫中檢索相關資訊。我們評估了 MDCrow 在 25 項任務中的表現，這些任務所需的子任務和難度各不相同，並且我們評估了代理對難度和提示樣式的穩健性。\texttt{gpt-4o} 能夠以低變異完成複雜的任務，緊隨其後的是一個引人注目的開源模型 \texttt{llama3-405b}。雖然提示樣式不會影響最佳模型的效能，但它對較小的模型有顯著的影響。
+摘要：大型語言模型（LLM）日益應用於教育、臨床和專業領域，但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為，涉及 AMPS（數學）和 MedQuad（醫療建議）數據集。在 58.19% 的案例中觀察到了趨炎附勢行為，其中 Gemini 表現出最高比率（62.47%），而 ChatGPT 最低（56.71%）。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中，而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率（61.75% 對 56.52%，Z=5.87，p<0.001），特別是在計算任務中，其中退步式趨炎附勢顯著增加（先發制人：8.13%，上下文：3.54%，p<0.001）。簡單的反駁最大化了漸進式趨炎附勢（Z=6.59，p<0.001），而基於引用的反駁表現出最高的退步式比率（Z=6.59，p<0.001）。趨炎附勢行為表現出很高的持續性（78.5%，95% CI：[77.2%，79.8%]），無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇，為更安全的 AI 應用提供了提示編程和模型優化的見解。
 
-##### **EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**
-2502.09560v1 by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
+##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**
+2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace
 
-Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
-agents offers a promising avenue for tackling real-world tasks. While
-language-centric embodied agents have garnered substantial attention,
-MLLM-based embodied agents remain underexplored due to the lack of
-comprehensive evaluation frameworks. To bridge this gap, we introduce
-EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
-embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
-tasks across four environments, ranging from high-level semantic tasks (e.g.,
-household) to low-level tasks involving atomic actions (e.g., navigation and
-manipulation); and (2) six meticulously curated subsets evaluating essential
-agent capabilities like commonsense reasoning, complex instruction
-understanding, spatial awareness, visual perception, and long-term planning.
-Through extensive experiments, we evaluated 13 leading proprietary and
-open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
-at high-level tasks but struggle with low-level manipulation, with the best
-model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
-multifaceted standardized evaluation platform that not only highlights existing
-challenges but also offers valuable insights to advance MLLM-based embodied
-agents. Our code is available at https://embodiedbench.github.io.
+Medical research faces well-documented challenges in translating novel
+treatments into clinical practice. Publishing incentives encourage researchers
+to present "positive" findings, even when empirical results are equivocal.
+Consequently, it is well-documented that authors often spin study results,
+especially in article abstracts. Such spin can influence clinician
+interpretation of evidence and may affect patient care decisions. In this
+study, we ask whether the interpretation of trial results offered by Large
+Language Models (LLMs) is similarly affected by spin. This is important since
+LLMs are increasingly being used to trawl through and synthesize published
+medical evidence. We evaluated 22 LLMs and found that they are across the board
+more susceptible to spin than humans. They might also propagate spin into their
+outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into
+plain language summaries that they generate. We also find, however, that LLMs
+are generally capable of recognizing spin, and can be prompted in a way to
+mitigate spin's impact on LLM outputs.
 
-摘要：<paragraph>利用多模態大型語言模型 (MLLM) 來建立具身代理，提供了解決現實世界任務的有前景途徑。儘管以語言為中心的具身代理已獲得大量關注，但由於缺乏全面的評估框架，基於 MLLM 的具身代理仍未得到充分探索。為了彌補這一差距，我們引入了 EmbodiedBench，這是一個廣泛的基準測試，旨在評估以視覺為導向的具身代理。EmbodiedBench 的特點：(1) 跨越四個環境的 1,128 項多樣化測試任務，範圍從高層級語義任務（例如，家庭）到涉及原子動作的低層級任務（例如，導航和操作）；以及 (2) 六個精心策劃的子集，用於評估基本的代理能力，例如常識推理、複雜指令理解、空間感知、視覺感知和長期規劃。通過廣泛的實驗，我們在 EmbodiedBench 中評估了 13 個領先的專有和開源 MLLM。我們的研究結果表明：MLLM 在高層級任務中表現出色，但在低層級操作中遇到困難，表現最好的模型 GPT-4o 平均得分僅為 28.9%。EmbodiedBench 提供了一個多方面的標準化評估平台，不僅突出了現有挑戰，還提供了有價值的見解來推進基於 MLLM 的具身代理。我們的程式碼可在 https://embodiedbench.github.io/ 取得。</paragraph>
+摘要：醫學研究在將新穎療法轉化為臨床實務上，面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現，即使經驗結果模稜兩可。因此，有據可查的是，作者經常扭曲研究結果，特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋，並可能影響病患照護決策。在本研究中，我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據，因此這點非常重要。我們評估了 22 個 LLM，發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中：例如，我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而，我們也發現 LLM 通常有能力辨認扭曲，而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。
 
-##### **Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**
-2502.09532v1 by Shreyan Biswas, Alexander Erlei, Ujwal Gadiraju
+##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**
+2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri
 
-Recent advances in generative AI have precipitated a proliferation of novel
-writing assistants. These systems typically rely on multilingual large language
-models (LLMs), providing globalized workers the ability to revise or create
-diverse forms of content in different languages. However, there is substantial
-evidence indicating that the performance of multilingual LLMs varies between
-languages. Users who employ writing assistance for multiple languages are
-therefore susceptible to disparate output quality. Importantly, recent research
-has shown that people tend to generalize algorithmic errors across independent
-tasks, violating the behavioral axiom of choice independence. In this paper, we
-analyze whether user utilization of novel writing assistants in a charity
-advertisement writing task is affected by the AI's performance in a second
-language. Furthermore, we quantify the extent to which these patterns translate
-into the persuasiveness of generated charity advertisements, as well as the
-role of peoples' beliefs about LLM utilization in their donation choices. Our
-results provide evidence that writers who engage with an LLM-based writing
-assistant violate choice independence, as prior exposure to a Spanish LLM
-reduces subsequent utilization of an English LLM. While these patterns do not
-affect the aggregate persuasiveness of the generated advertisements, people's
-beliefs about the source of an advertisement (human versus AI) do. In
-particular, Spanish-speaking female participants who believed that they read an
-AI-generated advertisement strongly adjusted their donation behavior downwards.
-Furthermore, people are generally not able to adequately differentiate between
-human-generated and LLM-generated ads. Our work has important implications for
-the design, development, integration, and adoption of multilingual LLMs as
-assistive agents -- particularly in writing tasks.
+This paper presents a novel Natural Language Processing (NLP) framework for
+enhancing medical diagnosis through the integration of advanced techniques in
+data augmentation, feature extraction, and classification. The proposed
+approach employs back-translation to generate diverse paraphrased datasets,
+improving robustness and mitigating overfitting in classification tasks.
+Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with
+Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained
+contextual and positional relationships, dynamically adjusting the influence of
+positional information based on semantic context to produce high-quality text
+embeddings. For classification, an Attention-Based Feedforward Neural Network
+(ABFNN) is utilized, effectively focusing on the most relevant features to
+improve decision-making accuracy. Applied to the classification of symptoms,
+clinical notes, and other medical texts, this architecture demonstrates its
+ability to address the complexities of medical data. The combination of data
+augmentation, contextual embedding generation, and advanced classification
+mechanisms offers a robust and accurate diagnostic tool, with potential
+applications in automated medical diagnosis and clinical decision support. This
+method demonstrates the effectiveness of the proposed NLP framework for medical
+diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of
+99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only
+underscore the model's robust performance in classifying medical texts with
+exceptional precision and reliability but also highlight its superiority over
+existing methods, making it a highly promising tool for automated diagnostic
+systems.
 
-摘要：<paragraph>生成式 AI 的最新進展加速了新穎寫作助理的激增。這些系統通常依賴多語言大型語言模型 (LLM)，讓全球化的工作者能夠以不同的語言修改或建立各種形式的內容。然而，有大量證據顯示多語言 LLM 的表現因語言而異。因此，使用多語言寫作協助的使用者容易受到不同的輸出品質影響。重要的是，最近的研究顯示人們傾向於在獨立的任務中概化演算法錯誤，違反了選擇獨立性的行為公理。在本文中，我們分析使用者在慈善廣告寫作任務中使用新穎寫作助理是否會受到 AI 在第二語言中的表現影響。此外，我們量化這些模式轉化為所產生慈善廣告說服力的程度，以及人們對 LLM 使用在捐款選擇中的信念所扮演的角色。我們的結果提供證據，表明與基於 LLM 的寫作助理互動的寫作者會違反選擇獨立性，因為先前接觸過西班牙語 LLM 會減少後續使用英語 LLM 的情況。雖然這些模式不會影響所產生廣告的整體說服力，但人們對廣告來源（人類與 AI）的信念會影響。特別是，相信自己閱讀 AI 生成的廣告的西班牙語系女性參與者大幅調整了他們的捐款行為。此外，人們通常無法充分區分人類產生的廣告和 LLM 產生的廣告。我們的研究對多語言 LLM 作為輔助代理的設計、開發、整合和採用具有重要的意義，特別是在寫作任務中。</paragraph>
+摘要：本文提出了一個創新的自然語言處理 (NLP) 框架，透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集，提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa)，這個模型捕捉細緻的脈絡和位置關係，根據語意脈絡動態調整位置資訊的影響，以產生高品質的文字嵌入。在分類方面，利用基於注意力的前饋神經網路 (ABFNN)，有效地關注最相關的特徵，以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類，此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具，在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性，以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數，取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性，也突顯了它優於現有方法的優越性，使其成為自動化診斷系統中極具前景的工具。
 
-##### **Diffusion Models for Molecules: A Survey of Methods and Tasks**
-2502.09511v1 by Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
+##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**
+2502.07752v1 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
 
-Generative tasks about molecules, including but not limited to molecule
-generation, are crucial for drug discovery and material design, and have
-consistently attracted significant attention. In recent years, diffusion models
-have emerged as an impressive class of deep generative models, sparking
-extensive research and leading to numerous studies on their application to
-molecular generative tasks. Despite the proliferation of related work, there
-remains a notable lack of up-to-date and systematic surveys in this area.
-Particularly, due to the diversity of diffusion model formulations, molecular
-data modalities, and generative task types, the research landscape is
-challenging to navigate, hindering understanding and limiting the area's
-growth. To address this, this paper conducts a comprehensive survey of
-diffusion model-based molecular generative methods. We systematically review
-the research from the perspectives of methodological formulations, data
-modalities, and task types, offering a novel taxonomy. This survey aims to
-facilitate understanding and further flourishing development in this area. The
-relevant papers are summarized at:
-https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
+Designing efficient optimizers for large language models (LLMs) with
+low-memory requirements and fast convergence is an important and challenging
+problem. This paper makes a step towards the systematic design of such
+optimizers through the lens of structured Fisher information matrix (FIM)
+approximation. We show that many state-of-the-art efficient optimizers can be
+viewed as solutions to FIM approximation (under the Frobenius norm) with
+specific structural assumptions. Building on these insights, we propose two
+design recommendations of practical efficient optimizers for LLMs, involving
+the careful selection of structural assumptions to balance generality and
+efficiency, and enhancing memory efficiency of optimizers with general
+structures through a novel low-rank extension framework. We demonstrate how to
+use each design approach by deriving new memory-efficient optimizers: Row and
+Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation
+(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the
+effectiveness, showing faster and better convergence than existing
+memory-efficient baselines and Adam with little memory overhead. Notably, Alice
+achieves better than 2x faster convergence over Adam, while RACS delivers
+strong performance on the 1B model with SGD-like memory.
 
-摘要：<paragraph>包括但不限於分子生成在內的分子生成任務，對於藥物發現和材料設計至關重要，並持續吸引大量關注。近年來，擴散模型已成為深度生成模型中令人印象深刻的一類，激發了廣泛的研究，並導致對其應用於分子生成任務的眾多研究。儘管相關工作不斷增加，但這個領域仍然缺乏最新的系統性綜述。特別是，由於擴散模型公式、分子數據方式和生成任務類型的多樣性，研究領域難以瀏覽，阻礙了理解並限制了該領域的發展。為了解決這個問題，本文對基於擴散模型的分子生成方法進行了全面的調查。我們從方法論公式、數據方式和任務類型的角度系統性地回顧了研究，提供了一種新穎的分類法。本調查旨在促進理解並進一步促進該領域的蓬勃發展。相關論文總結如下：
-https://github.com/AzureLeon1/awesome-molecular-diffusion-models。</paragraph>
+摘要：設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的角度，朝向此類最佳化器的系統化設計邁進一步。我們展示了許多最先進的高效最佳化器可以被視為 FIM 近似（在 Frobenius 範數下）的解，並具有特定的結構假設。基於這些見解，我們提出了 LLM 的兩個實用高效最佳化器設計建議，包括仔細選擇結構假設以平衡通用性和效率，並透過新穎的低秩延伸架構來增強具有通用結構的最佳化器的記憶體效率。我們展示了如何透過推導新的記憶體高效最佳化器來使用每種設計方法：列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練（高達 1B 參數）上的實驗驗證了其有效性，顯示比現有的記憶體高效基線和 Adam 更快且更好的收斂，且記憶體開銷很小。值得注意的是，Alice 比 Adam 快 2 倍以上，而 RACS 則在 1B 模型上提供類似 SGD 記憶體的強勁效能。
 
-##### **AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**
-2502.09503v1 by Caleb Cranney, Jesse G. Meyer
+##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**
+2502.07516v1 by Raman Dutt
 
-Transformer architectures have transformed AI applications but remain complex
-to customize for domain experts lacking low-level implementation expertise. We
-introduce AttentionSmithy, a modular software package that simplifies
-transformer innovation by breaking down key components into reusable building
-blocks: attention modules, feed-forward networks, normalization layers, and
-positional encodings. Users can rapidly prototype and evaluate transformer
-variants without extensive coding. Our framework supports four positional
-encoding strategies and integrates with neural architecture search for
-automated design. We validate AttentionSmithy by replicating the original
-transformer under resource constraints and optimizing translation performance
-by combining positional encodings. Additionally, we demonstrate its
-adaptability in gene-specific modeling, achieving over 95% accuracy in cell
-type classification. These case studies highlight AttentionSmithy's potential
-to accelerate research across diverse fields by removing framework
-implementation barriers.
+Generative models, particularly text-to-image (T2I) diffusion models, play a
+crucial role in medical image analysis. However, these models are prone to
+training data memorization, posing significant risks to patient privacy.
+Synthetic chest X-ray generation is one of the most common applications in
+medical image analysis with the MIMIC-CXR dataset serving as the primary data
+repository for this task. This study adopts a data-driven approach and presents
+the first systematic attempt to identify prompts and text tokens in MIMIC-CXR
+that contribute the most to training data memorization. Our analysis reveals an
+unexpected finding: prompts containing traces of de-identification procedures
+are among the most memorized, with de-identification markers contributing the
+most. Furthermore, we also find existing inference-time memorization mitigation
+strategies are ineffective and fail to sufficiently reduce the model's reliance
+on memorized text tokens highlighting a broader issue in T2I synthesis with
+MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy
+and improve the reliability of generative models in medical imaging. Finally,
+our results provide a foundation for future work on developing and benchmarking
+memorization mitigation techniques for synthetic chest X-ray generation using
+the MIMIC-CXR dataset.
 
-摘要：Transformer 架構已轉變 AI 應用，但對於缺乏低階實作專業知識的領域專家而言，自訂仍很複雜。我們推出 AttentionSmithy，這是一個模組化軟體套件，透過將關鍵元件分解成可重複使用的建構區塊（注意力模組、前饋網路、正規化層和位置編碼）來簡化 Transformer 創新。使用者可以快速建置原型和評估 Transformer 變體，而無需大量編碼。我們的架構支援四種位置編碼策略，並整合神經架構搜尋以進行自動化設計。我們透過在資源限制下複製原始 Transformer 和結合位置編碼來最佳化翻譯效能，驗證 AttentionSmithy。此外，我們展示其在基因特定建模中的適應性，在細胞類型分類中達到超過 95% 的準確度。這些案例研究突顯 AttentionSmithy 在移除架構實作障礙後，加速各個領域研究的潛力。
+摘要：生成模型，尤其是文字轉圖像 (T2I) 擴散模型，在醫學影像分析中扮演著至關重要的角色。然而，這些模型容易訓練資料記憶，對病患隱私造成重大風險。合成胸部 X 光線生成是醫學影像分析中最常見的應用之一，其中 MIMIC-CXR 資料集作為此任務的主要資料儲存庫。本研究採用資料驅動的方法，並提出首次系統性嘗試，以識別 MIMIC-CXR 中最有助於訓練資料記憶的提示和文字代碼。我們的分析揭露了一個意外的發現：包含去識別程序痕跡的提示是最常被記憶的，其中去識別標記的貢獻最大。此外，我們也發現現有的推論時間記憶減緩策略無效，且無法充分降低模型對記憶文字代碼的依賴性，突顯了使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。針對此問題，我們提出可行的策略，以增強隱私並改善生成模型在醫學影像中的可靠性。最後，我們的結果為未來使用 MIMIC-CXR 資料集開發和評量合成胸部 X 光線生成的記憶減緩技術奠定了基礎。
 
-##### **Improve LLM-based Automatic Essay Scoring with Linguistic Features**
-2502.09497v1 by Zhaoyi Joey Hou, Alejandro Ciuba, Xiang Lorraine Li
+##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**
+2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo
 
-Automatic Essay Scoring (AES) assigns scores to student essays, reducing the
-grading workload for instructors. Developing a scoring system capable of
-handling essays across diverse prompts is challenging due to the flexibility
-and diverse nature of the writing task. Existing methods typically fall into
-two categories: supervised feature-based approaches and large language model
-(LLM)-based methods. Supervised feature-based approaches often achieve higher
-performance but require resource-intensive training. In contrast, LLM-based
-methods are computationally efficient during inference but tend to suffer from
-lower performance. This paper combines these approaches by incorporating
-linguistic features into LLM-based scoring. Experimental results show that this
-hybrid method outperforms baseline models for both in-domain and out-of-domain
-writing prompts.
+Chronic kidney disease (CKD) is a major global health issue, affecting over
+10% of the population and causing significant mortality. While kidney biopsy
+remains the gold standard for CKD diagnosis and treatment, the lack of
+comprehensive benchmarks for kidney pathology segmentation hinders progress in
+the field. To address this, we organized the Kidney Pathology Image
+Segmentation (KPIs) Challenge, introducing a dataset that incorporates
+preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+
+Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes
+two tasks, patch-level segmentation and whole slide image segmentation and
+detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score.
+By encouraging innovative segmentation methods that adapt to diverse CKD models
+and tissue conditions, the KPIs Challenge aims to advance kidney pathology
+analysis, establish new benchmarks, and enable precise, large-scale
+quantification for disease research and diagnosis.
 
-摘要：自動化論文評分 (AES) 會為學生的論文評分，以減輕教師的評分工作負擔。由於寫作任務的靈活性與多樣性，開發一種評分系統來處理各種提示的論文是一項挑戰。現有方法通常分為兩類：監督式特徵方法和大型語言模型 (LLM) 方法。監督式特徵方法通常能達到較高的效能，但需要大量資源進行訓練。相比之下，LLM 方法在推論期間的計算效率很高，但效能往往較低。本文結合了這些方法，將語言特徵納入 LLM 評分中。實驗結果顯示，這種混合方法在領域內和領域外寫作提示方面都優於基準模型。
+摘要：慢性腎臟病 (CKD) 是全球主要的健康問題，影響超過
+10% 的人口，並造成顯著的死亡率。雖然腎臟活檢
+仍然是 CKD 診斷和治療的黃金標準，但缺乏
+腎臟病理學分割的全面基準阻礙了該領域的進展。
+為了解決這個問題，我們組織了腎臟病理影像
+分割 (KPIs) 挑戰，引入了包含超過 10,000 個註解的
+CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
+週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括
+兩個任務，修補層級分割和全幻燈片影像分割和
+偵測，使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。
+通過鼓勵創新的分割方法來適應不同的 CKD 模型
+和組織條件，KPIs 挑戰旨在推進腎臟病理
+分析，建立新的基準，並實現精確、大規模的
+疾病研究和診斷量化。
 
-##### **Cracking the Code: Enhancing Development finance understanding with artificial intelligence**
-2502.09495v1 by Pierre Beaucoral
+##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
+2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
 
-Analyzing development projects is crucial for understanding donors aid
-strategies, recipients priorities, and to assess development finance capacity
-to adress development issues by on-the-ground actions. In this area, the
-Organisation for Economic Co-operation and Developments (OECD) Creditor
-Reporting System (CRS) dataset is a reference data source. This dataset
-provides a vast collection of project narratives from various sectors
-(approximately 5 million projects). While the OECD CRS provides a rich source
-of information on development strategies, it falls short in informing project
-purposes due to its reporting process based on donors self-declared main
-objectives and pre-defined industrial sectors. This research employs a novel
-approach that combines Machine Learning (ML) techniques, specifically Natural
-Language Processing (NLP), an innovative Python topic modeling technique called
-BERTopic, to categorise (cluster) and label development projects based on their
-narrative descriptions. By revealing existing yet hidden topics of development
-finance, this application of artificial intelligence enables a better
-understanding of donor priorities and overall development funding and provides
-methods to analyse public and private projects narratives.
+Early prediction of pediatric cardiac arrest (CA) is critical for timely
+intervention in high-risk intensive care settings. We introduce PedCA-FT, a
+novel transformer-based framework that fuses tabular view of EHR with the
+derived textual view of EHR to fully unleash the interactions of
+high-dimensional risk factors and their dynamics. By employing dedicated
+transformer modules for each modality view, PedCA-FT captures complex temporal
+and contextual patterns to produce robust CA risk estimates. Evaluated on a
+curated pediatric cohort from the CHOA-CICU database, our approach outperforms
+ten other artificial intelligence models across five key performance metrics
+and identifies clinically meaningful risk factors. These findings underscore
+the potential of multimodal fusion techniques to enhance early CA detection and
+improve patient care.
 
-摘要：分析發展專案對於了解捐助者援助策略、受贈者優先事項，以及評估發展資金能力以透過實際行動解決發展問題至關重要。在這個領域中，經濟合作暨發展組織 (OECD) 債權人報告系統 (CRS) 資料集是一個參考資料來源。此資料集提供來自各個部門的大量專案敘述（約 500 萬個專案）。雖然 OECD CRS 提供了豐富的發展策略資訊來源，但由於其報告程序基於捐助者自行申報的主要目標和預先定義的產業部門，因此在告知專案目的方面有所不足。本研究採用一種新穎的方法，結合機器學習 (ML) 技術，特別是自然語言處理 (NLP)，一種稱為 BERTopic 的創新 Python 主題建模技術，根據其敘述描述對發展專案進行分類（叢集）和標籤。透過揭露發展資金現有但隱藏的主題，這種人工智慧應用程式可以更好地了解捐助者的優先事項和整體發展資金，並提供分析公共和私人專案敘述的方法。
+摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
 
-##### **Objective quantification of mood states using large language models**
-2502.09487v1 by Jakub Onysk, Quentin Huys
+##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
+2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
 
-Emotional states influence human behaviour and cognition, leading to diverse
-thought trajectories. Similarly, Large Language Models (LLMs) showcase an
-excellent level of response consistency across wide-ranging contexts (prompts).
-We leverage these parallels to establish a framework for quantifying mental
-states. Our approach utilises self-report questionnaires that reliably assess
-these states due to their inherent sensitivity to patterns of co-occurring
-responses. Specifically, we recruited a large sample of participants (N=422) to
-investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set
-of depressive mood states measured with participants' open-ended responses to a
-depression questionnaire. We show LLM responses to held-out multiple-choice
-questions, given participants' open-ended answers, correlate strongly (r:
-0.52-0.84) with true questionnaire scores, demonstrating LLM's generalisation
-from mood representations. We explore a link between these representations and
-factor analysis. Using ridge regression, we find depression-related subspaces
-within LLM hidden states. We show these subspaces to be predictive of
-participants' "Depression" and "Somatic & Emotional Distress" factor scores, as
-well as suicidality severity. Overall, LLMs can provide quantitative measures
-of mental states. The reliability of these hinges upon how informative the
-questions we ask participants are. Used correctly, this approach could
-supplement mental state assessment in a variety of settings.
+Counterfactual explanations in medical imaging are critical for understanding
+the predictions made by deep learning models. We extend the Latent Shift
+counterfactual generation method from 2D applications to 3D computed tomography
+(CT) scans. We address the challenges associated with 3D data, such as limited
+training samples and high memory demands, by implementing a slice-based
+approach. This method leverages a 2D encoder trained on CT slices, which are
+subsequently combined to maintain 3D context. We demonstrate this technique on
+two models for clinical phenotype prediction and lung segmentation. Our
+approach is both memory-efficient and effective for generating interpretable
+counterfactuals in high-resolution 3D medical imaging.
 
-摘要：情緒狀態會影響人類行為和認知，導致不同的思維軌跡。同樣地，大型語言模型 (LLM) 在廣泛的脈絡（提示）中展示出極佳的反應一致性。我們利用這些相似之處來建立一個量化心理狀態的框架。我們的做法利用自我報告問卷，由於這些問卷對共生反應模式具有內在敏感性，因此可以可靠地評估這些狀態。具體來說，我們招募了大量的參與者樣本 (N=422) 來調查 LLM (Mistral-7B-OpenOrca) 如何量化一組異質的抑鬱情緒狀態，這些狀態是根據參與者對抑鬱症問卷的開放式回答來衡量的。我們展示了 LLM 對保留的多選題的回答，給定參與者的開放式回答，與真正的問卷分數密切相關 (r：0.52-0.84)，這證明了 LLM 從情緒表徵中進行概括。我們探索這些表徵與因子分析之間的聯繫。使用嶺回歸，我們在 LLM 隱藏狀態內發現了與抑鬱相關的子空間。我們展示這些子空間可以預測參與者的「抑鬱」和「軀體和情緒困擾」因子分數，以及自殺嚴重性。總體而言，LLM 可以提供心理狀態的量化測量。這些測量的可靠性取決於我們詢問參與者的問題的資訊性。如果使用得當，這種方法可以補充各種環境中的心理狀態評估。
+摘要：反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法，來解決與 3D 資料相關的挑戰，例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器，隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實，既節省記憶體又有效。
 
-##### **The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**
-2502.09457v1 by Akash Ghosh, Debayan Datta, Sriparna Saha, Chirag Agarwal
+##### **Interactive Data Harmonization with LLM Agents**
+2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire
 
-While reasoning and multilingual capabilities in Language Models (LMs) have
-achieved remarkable progress in recent years, their integration into a unified
-paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning
-requires language models to handle logical reasoning across languages while
-addressing misalignment, biases, and challenges in low-resource settings. This
-survey provides the first in-depth review of multilingual reasoning in LMs. In
-this survey, we provide a systematic overview of existing methods that leverage
-LMs for multilingual reasoning, specifically outlining the challenges,
-motivations, and foundational aspects of applying language models to reason
-across diverse languages. We provide an overview of the standard data resources
-used for training multilingual reasoning in LMs and the evaluation benchmarks
-employed to assess their multilingual capabilities. Next, we analyze various
-state-of-the-art methods and their performance on these benchmarks. Finally, we
-explore future research opportunities to improve multilingual reasoning in LMs,
-focusing on enhancing their ability to handle diverse languages and complex
-reasoning tasks.
+Data harmonization is an essential task that entails integrating datasets
+from diverse sources. Despite years of research in this area, it remains a
+time-consuming and challenging task due to schema mismatches, varying
+terminologies, and differences in data collection methodologies. This paper
+presents the case for agentic data harmonization as a means to both empower
+experts to harmonize their data and to streamline the process. We introduce
+Harmonia, a system that combines LLM-based reasoning, an interactive user
+interface, and a library of data harmonization primitives to automate the
+synthesis of data harmonization pipelines. We demonstrate Harmonia in a
+clinical data harmonization scenario, where it helps to interactively create
+reusable pipelines that map datasets to a standard format. Finally, we discuss
+challenges and open problems, and suggest research directions for advancing our
+vision.
+
+摘要：資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷，但由於架構不匹配、術語不同，以及資料收集方法的差異，它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和，作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia，一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統，以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia，它有助於互動式建立可重複使用的管線，將資料集對應至標準格式。最後，我們討論挑戰和開放性問題，並建議研究方向以推進我們的願景。
+
+##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**
+2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani
+
+Machine learning (ML) is transforming healthcare by enabling predictive
+analytics, personalized treatments, and improved patient outcomes. However,
+traditional ML workflows require specialized skills, infrastructure, and
+resources, limiting accessibility for many healthcare professionals. This paper
+explores how Google Cloud's BigQuery ML simplifies the development and
+deployment of ML models using SQL, reducing technical barriers. Through a case
+study on diabetes prediction using the Diabetes Health Indicators Dataset, we
+evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep
+Neural Network (DNN). Our results demonstrate that the Boosted Tree model
+achieves the highest performance, making it highly effective for diabetes
+prediction. This study highlights BigQuery ML's role in democratizing machine
+learning by providing a scalable, efficient, and accessible solution for
+healthcare analytics.
 
-摘要：儘管語言模型 (LM) 的推理和多語言能力在近年來取得顯著進展，但它們整合至統一典範（多語言推理）仍處於萌芽階段。多語言推理要求語言模型跨語言處理邏輯推理，同時解決低資源環境中的錯位、偏見和挑戰。本調查提供了 LM 中多語言推理的首次深入探討。在本調查中，我們系統性地概述了現有利用 LM 進行多語言推理的方法，特別概述了將語言模型應用於跨不同語言推理的挑戰、動機和基礎方面。我們概述了用於訓練 LM 中多語言推理的標準數據資源，以及用於評估其多語言能力的評估基準。接下來，我們分析了各種最先進的方法及其在這些基準上的表現。最後，我們探討了改進 LM 中多語言推理的未來研究機會，重點關注增強其處理不同語言和複雜推理任務的能力。
+摘要：機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果，正在轉型醫療保健。然而，傳統的 ML 工作流程需要專業技能、基礎設施和資源，限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署，降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究，我們評估了三個預測模型：邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明，提升樹模型達到了最高的效能，使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色，提供可擴充、有效率且可存取的醫療保健分析解決方案。
 
-##### **Pixel-Level Reasoning Segmentation via Multi-turn Conversations**
-2502.09447v1 by Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
+##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**
+2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen
 
-Existing visual perception systems focus on region-level segmentation in
-single-turn dialogues, relying on complex and explicit query instructions. Such
-systems cannot reason at the pixel level and comprehend dynamic user intent
-that changes over interaction. Our work tackles this issue by introducing a
-novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on
-multi-turn conversations, tracking evolving user intent via multi-turn
-interactions for fine-grained segmentation. To establish a benchmark for this
-novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on
-Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
-multi-turn conversational scenarios with segmentation targets. Building on
-PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning
-Segmentation framework, integrates pixel-level segmentation with robust
-multi-turn conversation understanding, generating pixel-grounded explanations
-aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
-pixel-level reasoning segmentation. Experimental results on the PRIST dataset
-demonstrate that our method outperforms current segmentation-specific baselines
-in terms of segmentation and LLM-based reasoning metrics. The code and data are
-available at: https://github.com/ccccai239/PixelRIST.
+Despite over a decade of legislative efforts to address modern slavery in the
+supply chains of large corporations, the effectiveness of government oversight
+remains hampered by the challenge of scrutinizing thousands of statements
+annually. While Large Language Models (LLMs) can be considered a well
+established solution for the automatic analysis and summarization of documents,
+recognizing concrete modern slavery countermeasures taken by companies and
+differentiating those from vague claims remains a challenging task. To help
+evaluate and fine-tune LLMs for the assessment of corporate statements, we
+introduce a dataset composed of 5,731 modern slavery statements taken from the
+Australian Modern Slavery Register and annotated at the sentence level. This
+paper details the construction steps for the dataset that include the careful
+design of annotation specifications, the selection and preprocessing of
+statements, and the creation of high-quality annotation subsets for effective
+model evaluations. To demonstrate our dataset's utility, we propose a machine
+learning methodology for the detection of sentences relevant to mandatory
+reporting requirements set by the Australian Modern Slavery Act. We then follow
+this methodology to benchmark modern language models under zero-shot and
+supervised learning settings.
 
-摘要：現有的視覺感知系統專注於單輪對話中的區域級分割，依賴於複雜且明確的查詢指令。此類系統無法在像素級別推理和理解在互動中不斷變化的動態使用者意圖。我們的研究通過引入一項基於多輪對話的像素級推理分割（像素級 RS）新任務來解決此問題，通過多輪互動追蹤不斷演變的使用者意圖，以進行精細分割。為了建立此新任務的基準，我們建立了一個基於多輪對話的像素級推理分割資料集（PRIST），其中包含來自 8.3k 多輪對話場景的 24k 個語句，以及分割目標。在 PRIST 的基礎上，我們進一步提出了 MIRAS，這是一個多輪互動推理分割框架，它將像素級分割與強大的多輪對話理解整合在一起，生成符合使用者意圖的像素級解釋。PRIST 資料集和 MIRSA 框架填補了像素級推理分割的空白。在 PRIST 資料集上的實驗結果表明，我們的模型在分割和基於 LLM 的推理指標方面優於目前的特定於分割的基準。程式碼和資料可在 https://github.com/ccccai239/PixelRIST 獲得。
+摘要：儘管立法努力超過十年，旨在解決大型企業供應鏈中的現代奴隸制，但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型（LLM）可以被認為是文件自動分析和摘要的完善解決方案，但要辨識公司採取的具體現代奴隸制對策，並將其與含糊的聲明區分開來，仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明，我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集，這些聲明取自澳洲現代奴隸制註冊處，並在句子層級進行註解。本文詳細說明了資料集的建構步驟，其中包括註解規格的仔細設計、聲明的選擇和預處理，以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用，我們提出了一種機器學習方法，用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後，我們遵循這種方法，在零次學習和監督學習設定下對現代語言模型進行基準測試。
 
-##### **Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**
-2502.09432v1 by Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
+##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**
+2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour
 
-We study robust Markov decision processes (RMDPs) with non-rectangular
-uncertainty sets, which capture interdependencies across states unlike
-traditional rectangular models. While non-rectangular robust policy evaluation
-is generally NP-hard, even in approximation, we identify a powerful class of
-$L_p$-bounded uncertainty sets that avoid these complexity barriers due to
-their structural simplicity. We further show that this class can be decomposed
-into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage
-its structural properties to derive a novel dual formulation for $L_p$ RMDPs.
-This formulation provides key insights into the adversary's strategy and
-enables the development of the first robust policy evaluation algorithms for
-non-rectangular RMDPs. Empirical results demonstrate that our approach
-significantly outperforms brute-force methods, establishing a promising
-foundation for future investigation into non-rectangular robust MDPs.
+The fourth Machine Learning for Health (ML4H) symposium was held in person on
+December 15th and 16th, 2024, in the traditional, ancestral, and unceded
+territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver,
+British Columbia, Canada. The symposium included research roundtable sessions
+to foster discussions between participants and senior researchers on timely and
+relevant topics for the ML4H community. The organization of the research
+roundtables at the conference involved 13 senior and 27 junior chairs across 13
+tables. Each roundtable session included an invited senior chair (with
+substantial experience in the field), junior chairs (responsible for
+facilitating the discussion), and attendees from diverse backgrounds with an
+interest in the session's topic.
 
-摘要：我們研究具有非矩形不確定性集合的強健馬可夫決策過程 (RMDP)，它能捕捉到不同於傳統矩形模型的跨狀態相互依賴性。雖然非矩形強健策略評估通常是 NP-hard，即使在近似中也是如此，我們識別了一類強大的 $L_p$ 有界不確定性集合，由於其結構的簡潔性，可以避免這些複雜性障礙。我們進一步表明，此類可以分解為無限多的 \texttt{sa} 矩形 $L_p$ 有界集合，並利用其結構屬性為 $L_p$ RMDP 導出一個新的對偶公式。此公式提供了對抗者策略的重要見解，並能夠開發出第一個非矩形 RMDP 的強健策略評估演算法。實證結果表明，我們的做法顯著優於蠻力方法，為未來對非矩形強健 MDP 的研究奠定了有希望的基礎。
+摘要：第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議，以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席（在該領域擁有豐富的經驗）、初級主席（負責促進討論）以及對會議主題感興趣的來自不同背景的與會者。
 
-##### **Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**
-2502.09423v1 by Ziyi Chen, Yang Yuan, Siming Zheng, Jialong Guo, Sihan Liang, Yangang Wang, Zongguo Wang
+##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**
+2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla
 
-Crystal structure forms the foundation for understanding the physical and
-chemical properties of materials. Generative models have emerged as a new
-paradigm in crystal structure prediction(CSP), however, accurately capturing
-key characteristics of crystal structures, such as periodicity and symmetry,
-remains a significant challenge. In this paper, we propose a
-Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction
-(TransVAE-CSP), who learns the characteristic distribution space of stable
-materials, enabling both the reconstruction and generation of crystal
-structures. TransVAE-CSP integrates adaptive distance expansion with
-irreducible representation to effectively capture the periodicity and symmetry
-of crystal structures, and the encoder is a transformer network based on an
-equivariant dot product attention mechanism. Experimental results on the
-carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP
-outperforms existing methods in structure reconstruction and generation tasks
-under various modeling metrics, offering a powerful tool for crystal structure
-design and optimization.
+Current Large Language Models (LLMs) benchmarks are often based on open-ended
+or close-ended QA evaluations, avoiding the requirement of human labor.
+Close-ended measurements evaluate the factuality of responses but lack
+expressiveness. Open-ended capture the model's capacity to produce discourse
+responses but are harder to assess for correctness. These two approaches are
+commonly used, either independently or together, though their relationship
+remains poorly understood. This work is focused on the healthcare domain, where
+both factuality and discourse matter greatly. It introduces a comprehensive,
+multi-axis suite for healthcare LLM evaluation, exploring correlations between
+open and close benchmarks and metrics. Findings include blind spots and
+overlaps in current methodologies. As an updated sanity check, we release a new
+medical benchmark--CareQA--, with both open and closed variants. Finally, we
+propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to
+mitigate the identified limitations.
 
-摘要：晶體結構形成了解材料物理和化學性質的基礎。生成模型已成為晶體結構預測 (CSP) 的新典範，然而，準確捕捉晶體結構的關鍵特徵（例如週期性和對稱性）仍然是一項重大挑戰。在本文中，我們提出了一種用於晶體結構預測的 Transformer 增強變異自動編碼器 (TransVAE-CSP)，它學習穩定材料的特徵分佈空間，使晶體結構的重建和生成成為可能。TransVAE-CSP 將自適應距離擴展與不可約表示相結合，以有效地捕捉晶體結構的週期性和對稱性，並且編碼器是一個基於等變點積注意力機制的 Transformer 網路。在 carbon_24、perov_5 和 mp_20 資料集上的實驗結果表明，TransVAE-CSP 在各種建模指標下，在結構重建和生成任務中優於現有方法，為晶體結構設計和最佳化提供了一個強大的工具。
+摘要：當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量，避免了人力需求。封閉式測量評估回應的事實性，但缺乏表達力。開放式測量捕捉模型產生論述回應的能力，但較難評估正確性。這兩種方法通常獨立或合併使用，儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域，在該領域中，事實性和論述都非常重要。它引入了一個全面的多軸套件，用於醫療保健 LLM 評量，探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查，我們發布了一個新的醫療基準--CareQA--，包含開放式和封閉式變體。最後，我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。
 
-##### **On multi-token prediction for efficient LLM inference**
-2502.09419v1 by Somesh Mehra, Javier Alonso Garcia, Lukas Mauch
+##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**
+2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra
 
-We systematically investigate multi-token prediction (MTP) capabilities
-within LLMs pre-trained for next-token prediction (NTP). We first show that
-such models inherently possess MTP capabilities via numerical marginalization
-over intermediate token probabilities, though performance is data-dependent and
-improves with model scale. Furthermore, we explore the challenges of
-integrating MTP heads into frozen LLMs and find that their hidden layers are
-strongly specialized for NTP, making adaptation non-trivial. Finally, we show
-that while joint training of MTP heads with the backbone improves performance,
-it cannot fully overcome this barrier, prompting further research in this
-direction. Our findings provide a deeper understanding of MTP applied to
-pretrained LLMs, informing strategies for accelerating inference through
-parallel token prediction.
+Accurate classification and anatomical localization are essential for
+effective medical diagnostics and research, which may be efficiently performed
+using deep learning techniques. However, availability of limited labeled data
+poses a significant challenge. To address this, we adapted Prototypical
+Networks and the Propagation-Reconstruction Network (PRNet) for few-shot
+classification and localization, respectively, in Single Photon Emission
+Computed Tomography (SPECT) images. For the proof of concept we used a
+2D-sliced image cropped around heart. The Prototypical Network, with a
+pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver
+tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for
+2D imaging with an encoder-decoder architecture and skip connections, achieved
+a training loss of 1.395, accurately reconstructing patches and capturing
+spatial relationships. These results highlight the potential of Prototypical
+Networks for tissue classification with limited labeled data and PRNet for
+anatomical landmark localization, paving the way for improved performance in
+deep learning frameworks.
 
-摘要：我們系統性地研究了在預先訓練下用於下一個代幣預測 (NTP) 的 LLM 中的多代幣預測 (MTP) 功能。我們首先表明，此類模型透過中間代幣機率的數值邊際化本質上具備 MTP 功能，儘管效能依賴於資料，且會隨著模型規模而提升。此外，我們探討了將 MTP 頭整合到凍結 LLM 中的挑戰，發現其隱藏層高度專門用於 NTP，使得適應變得不簡單。最後，我們顯示，儘管 MTP 頭與主幹的聯合訓練會提升效能，但無法完全克服此障礙，促使我們進一步研究這個方向。我們的發現提供了對應用於預先訓練 LLM 的 MTP 更深入的理解，並為透過平行代幣預測加速推論提供策略。
+摘要：精確的分類和解剖定位對於有效的醫療診斷和研究至關重要，而這可以使用深度學習技術有效執行。然而，標記資料有限的取得會造成重大的挑戰。為了解決這個問題，我們分別調整了原型網路和傳播重建網路 (PRNet)，用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念，我們使用圍繞心臟裁切的 2D 切片影像。原型網路，使用預先訓練的 ResNet-18 主幹，對心室、心肌和肝臟組織進行分類，訓練準確度為 96.67%，驗證準確度為 93.33%。PRNet，調整為使用編碼器解碼器架構和跳躍連接的 2D 影像，達到了 1.395 的訓練損失，精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力，以及 PRNet 在解剖標誌定位方面的潛力，為深度學習架構中效能的提升鋪平了道路。
 
-##### **SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**
-2502.09390v1 by Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
+##### **Illegal Waste Detection in Remote Sensing Images: A Case Study**
+2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori
 
-In the rapidly evolving field of Natural Language Processing, Large Language
-Models (LLMs) are tasked with increasingly complex reasoning challenges.
-Traditional methods like chain-of-thought prompting have shown promise but
-often fall short in fully leveraging a model's reasoning capabilities. This
-paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a
-novel prompting technique designed to improve reasoning through a
-self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts
-models to generate and resolve multiple auxiliary questions before tackling the
-main query, promoting a more thorough exploration of various aspects of a
-topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models
-across multiple question-answering datasets, demonstrate that SQuARE
-significantly surpasses traditional CoT prompts and existing
-rephrase-and-respond methods. By systematically decomposing queries, SQuARE
-advances LLM capabilities in reasoning tasks. The code is publicly available at
-https://github.com/IntelLabs/RAG-FiT/tree/square.
+Environmental crime currently represents the third largest criminal activity
+worldwide while threatening ecosystems as well as human health. Among the
+crimes related to this activity, improper waste management can nowadays be
+countered more easily thanks to the increasing availability and decreasing cost
+of Very-High-Resolution Remote Sensing images, which enable semi-automatic
+territory scanning in search of illegal landfills. This paper proposes a
+pipeline, developed in collaboration with professionals from a local
+environmental agency, for detecting candidate illegal dumping sites leveraging
+a classifier of Remote Sensing images. To identify the best configuration for
+such classifier, an extensive set of experiments was conducted and the impact
+of diverse image characteristics and training settings was thoroughly analyzed.
+The local environmental agency was then involved in an experimental exercise
+where outputs from the developed classifier were integrated in the experts'
+everyday work, resulting in time savings with respect to manual
+photo-interpretation. The classifier was eventually run with valuable results
+on a location outside of the training area, highlighting potential for
+cross-border applicability of the proposed pipeline.
 
-摘要：在快速發展的自然語言處理領域中，大型語言模型 (LLM) 負責越來越複雜的推理挑戰。
-傳統方法（如思考鏈提示）已展現潛力，但通常無法充分利用模型的推理能力。本文介紹 SQuARE（順序式問答推理引擎），這是一種新穎的提示技術，旨在透過自我提問模式來改善推理。建立在 CoT 架構之上，SQuARE 提示模型在處理主要查詢之前產生並解決多個輔助問題，促進對某個主題的各個面向進行更徹底的探討。我們使用 Llama 3 和 GPT-4o 模型對多個問答資料集進行廣泛評估，結果顯示 SQuARE 明顯優於傳統 CoT 提示和現有的改寫並回應方法。透過系統性地分解查詢，SQuARE 提升了 LLM 在推理任務中的能力。程式碼已公開於 https://github.com/IntelLabs/RAG-FiT/tree/square。
+摘要：環境犯罪目前是全球第三大犯罪活動，威脅生態系統和人類健康。在與此活動相關的犯罪中，不當廢物管理現在可以更容易地得到解決，這要歸功於超高解析度遙測影像越來越普及且成本下降，這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道，與當地環境機構的專業人士合作開發，用於檢測候選非法傾倒地點，利用遙測影像分類器。為了找出這種分類器的最佳配置，進行了一系列廣泛的實驗，並徹底分析了不同影像特徵和訓練設定的影響。然後，當地環境機構參與了一項實驗練習，其中將已開發分類器的輸出整合到專家的日常工作中，從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器，獲得了有價值的結果，突出了所提出管道的跨境適用性潛力。
 
-##### **Truth Knows No Language: Evaluating Truthfulness Beyond English**
-2502.09387v1 by Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
+##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**
+2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li
 
-We introduce a professionally translated extension of the TruthfulQA
-benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
-Spanish. Truthfulness evaluations of large language models (LLMs) have
-primarily been conducted in English. However, the ability of LLMs to maintain
-truthfulness across languages remains under-explored. Our study evaluates 12
-state-of-the-art open LLMs, comparing base and instruction-tuned models using
-human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
-findings reveal that, while LLMs perform best in English and worst in Basque
-(the lowest-resourced language), overall truthfulness discrepancies across
-languages are smaller than anticipated. Furthermore, we show that
-LLM-as-a-Judge correlates more closely with human judgments than
-multiple-choice metrics, and that informativeness plays a critical role in
-truthfulness assessment. Our results also indicate that machine translation
-provides a viable approach for extending truthfulness benchmarks to additional
-languages, offering a scalable alternative to professional translation.
-Finally, we observe that universal knowledge questions are better handled
-across languages than context- and time-dependent ones, highlighting the need
-for truthfulness evaluations that account for cultural and temporal
-variability. Dataset and code are publicly available under open licenses.
+Accurate and efficient electroencephalography (EEG) analysis is essential for
+detecting seizures and artifacts in long-term monitoring, with applications
+spanning hospital diagnostics to wearable health devices. Robust EEG analytics
+have the potential to greatly improve patient care. However, traditional deep
+learning models, especially Transformer-based architectures, are hindered by
+their quadratic time and memory complexity, making them less suitable for
+resource-constrained environments. To address these challenges, we present
+FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel
+self-supervised framework that establishes new efficiency benchmarks for EEG
+analysis through bidirectional state-space modeling. Unlike Transformer-based
+models, which incur quadratic time and memory complexity, FEMBA scales linearly
+with sequence length, enabling more scalable and efficient processing of
+extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and
+fine-tuned on three downstream tasks, FEMBA achieves competitive performance in
+comparison with transformer models, with significantly lower computational
+cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB
+and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates
+viability for resource-constrained devices. These results pave the way for
+scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as
+a promising candidate for wearable applications.
 
-摘要：我們針對 TruthfulQA 推出專業翻譯的延伸版本，旨在評估巴斯克語、加泰隆尼亞語、加利西亞語和西班牙語中的真實性。大型語言模型 (LLM) 的真實性評估主要以英語進行。然而，LLM 在不同語言中維持真實性的能力仍未得到充分探索。我們的研究評估了 12 個最先進的開放 LLM，使用人類評估、多選項指標和 LLM 作為評分標準比較基礎和指令調整模型。我們的研究結果表明，雖然 LLM 在英語中的表現最好，而在巴斯克語（資源最少的語言）中的表現最差，但整體上不同語言之間的真實性差異小於預期。此外，我們表明，與多選項指標相比，LLM 作為評分標準與人類判斷更密切相關，而且信息豐富性在真實性評估中發揮著至關重要的作用。我們的結果還表明，機器翻譯提供了一種可行的途徑，可以將真實性基準擴展到其他語言，從而提供了一種可擴展的專業翻譯替代方案。最後，我們觀察到，與上下文和時間依賴的問題相比，通用知識問題在不同語言之間的處理效果更好，這突顯了考慮文化和時間可變性的真實性評估的必要性。數據集和代碼在開放許可下公開可用。
+摘要：準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要，其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而，傳統深度學習模型，特別是基於 Transformer 的架構，受到其二次時間和記憶體複雜度的阻礙，使其不太適合資源受限的環境。為了應對這些挑戰，我們提出 FEMBA (基礎 EEG Mamba + 雙向架構)，一種創新的自我監督架構，透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同，FEMBA 隨著序列長度線性縮放，支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調，與Transformer模型相比，在計算成本顯著降低的情況下，實現了具有競爭力的效能。具體來說，它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC，而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路，並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。
+
+##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**
+2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham
+
+The advent of foundation models (FMs) is transforming medical domain. In
+ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4
+million natural images and 1.6 million retinal images, has demonstrated high
+adaptability across clinical applications. Conversely, DINOv2, a
+general-purpose vision FM pre-trained on 142 million natural images, has shown
+promise in non-medical domains. However, its applicability to clinical tasks
+remains underexplored. To address this, we conducted head-to-head evaluations
+by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular
+disease detection and systemic disease prediction tasks, across eight
+standardized open-source ocular datasets, as well as the Moorfields AlzEye and
+the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting
+diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets,
+all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In
+glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940,
+P<0.001). Conversely, RETFound achieved superior performance over all DINOv2
+models in predicting heart failure, myocardial infarction, and ischaemic stroke
+(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even
+with 10% of the fine-tuning data. These findings showcase the distinct
+scenarios where general-purpose and domain-specific FMs excel, highlighting the
+importance of aligning FM selection with task-specific requirements to optimise
+clinical performance.
 
-##### **A Deep Inverse-Mapping Model for a Flapping Robotic Wing**
-2502.09378v1 by Hadar Sharvit, Raz Karl, Tsevi Beatus
+摘要：基礎模型 (FM) 的出現正在轉變醫療領域。在眼科，RETFound 是一個視網膜專用 FM，依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練，已展現出高度適應性，可應用於各種臨床應用。相反地，DINOv2 是一個通用視覺 FM，使用 1.42 億張自然影像進行預訓練，已展現出在非醫療領域的潛力。然而，其在臨床任務中的適用性仍未被充分探索。為了解決這個問題，我們針對眼部疾病偵測和全身性疾病預測任務，對 RETFound 和三個 DINOv2 模型（大型、基礎、小型）進行微調，並進行一對一的評估，使用八個標準化的開源眼科資料集，以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound（三個資料集的 AUROC=0.850-0.952，相較於 0.823-0.944，所有 P<=0.007）和多類眼部疾病（AUROC=0.892，相較於 0.846，P<0.001）。在青光眼方面，DINOv2 基礎模型優於 RETFound（AUROC=0.958，相較於 0.940，P<0.001）。相反地，RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型（AUROC=0.732-0.796，相較於 0.663-0.771，所有 P<0.001）。即使使用 10% 的微調資料，這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景，突顯了根據任務特定需求調整 FM 選擇，以最佳化臨床表現的重要性。
 
-In systems control, the dynamics of a system are governed by modulating its
-inputs to achieve a desired outcome. For example, to control the thrust of a
-quad-copter propeller the controller modulates its rotation rate, relying on a
-straightforward mapping between the input rotation rate and the resulting
-thrust. This mapping can be inverted to determine the rotation rate needed to
-generate a desired thrust. However, in complex systems, such as flapping-wing
-robots where intricate fluid motions are involved, mapping inputs (wing
-kinematics) to outcomes (aerodynamic forces) is nontrivial and inverting this
-mapping for real-time control is computationally impractical. Here, we report a
-machine-learning solution for the inverse mapping of a flapping-wing system
-based on data from an experimental system we have developed. Our model learns
-the input wing motion required to generate a desired aerodynamic force outcome.
-We used a sequence-to-sequence model tailored for time-series data and
-augmented it with a novel adaptive-spectrum layer that implements
-representation learning in the frequency domain. To train our model, we
-developed a flapping wing system that simultaneously measures the wing's
-aerodynamic force and its 3D motion using high-speed cameras. We demonstrate
-the performance of our system on an additional open-source dataset of a
-flapping wing in a different flow regime. Results show superior performance
-compared with more complex state-of-the-art transformer-based models, with 11%
-improvement on the test datasets median loss. Moreover, our model shows
-superior inference time, making it practical for onboard robotic control. Our
-open-source data and framework may improve modeling and real-time control of
-systems governed by complex dynamics, from biomimetic robots to biomedical
-devices.
+##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**
+2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun
 
-摘要：<paragraph>在系統控制中，系統的動態受調節其輸入以實現所需結果的影響。例如，為了控制四軸旋翼推進器的推力，控制器會調節其旋轉速率，依賴於輸入旋轉速率和所產生的推力之間的直接映射。此映射可以反轉以確定產生所需推力所需的旋轉速率。然而，在複雜的系統中，例如涉及複雜流體運動的拍打式機翼機器人，將輸入（機翼運動學）映射到輸出（空氣動力）並非易事，並且反轉此映射以進行實時控制在計算上不切實際。在此，我們報告了一個基於我們開發的實驗系統數據的拍打式機翼系統反向映射的機器學習解決方案。我們的模型學習產生所需空氣動力結果所需的輸入機翼運動。我們使用了一個專門針對時間序列數據的序列到序列模型，並用一個在頻域中實現表示學習的新型自適應譜層對其進行了擴充。為了訓練我們的模型，我們開發了一個拍打式機翼系統，該系統同時使用高速相機測量機翼的空氣動力和其 3D 運動。我們在一個不同的流動狀態下拍打機翼的另一個開源數據集上展示了我們系統的性能。結果表明，與更複雜的基於Transformer的最先進模型相比，性能優異，在測試數據集中損失中值改進了 11%。此外，我們的模型顯示出優異的推理時間，使其適用於機載機器人控制。我們的開源數據和框架可以改進受複雜動態支配的系統的建模和實時控制，從仿生機器人到生物醫學設備。</paragraph>
+Medical time series are often irregular and face significant missingness,
+posing challenges for data analysis and clinical decision-making. Existing
+methods typically adopt a single modeling perspective, either treating series
+data as sequences or transforming them into image representations for further
+classification. In this paper, we propose a joint learning framework that
+incorporates both sequence and image representations. We also design three
+self-supervised learning strategies to facilitate the fusion of sequence and
+image representations, capturing a more generalizable joint representation. The
+results indicate that our approach outperforms seven other state-of-the-art
+models in three representative real-world clinical datasets. We further
+validate our approach by simulating two major types of real-world missingness
+through leave-sensors-out and leave-samples-out techniques. The results
+demonstrate that our approach is more robust and significantly surpasses other
+baselines in terms of classification performance.
 
-##### **Language Agents as Digital Representatives in Collective Decision-Making**
-2502.09369v1 by Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti
+摘要：醫療時間序列通常不規則且會面臨顯著的缺失，對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點，將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中，我們提出了一個聯合學習架構，結合序列和影像表示。我們還設計了三種自我監督學習策略，以促進序列和影像表示的融合，捕捉更具概括性的聯合表示。結果表明，我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明，我們的做法更強大，並且在分類效能方面顯著優於其他基準。
 
-Consider the process of collective decision-making, in which a group of
-individuals interactively select a preferred outcome from among a universe of
-alternatives. In this context, "representation" is the activity of making an
-individual's preferences present in the process via participation by a proxy
-agent -- i.e. their "representative". To this end, learned models of human
-behavior have the potential to fill this role, with practical implications for
-multi-agent scenario studies and mechanism design. In this work, we investigate
-the possibility of training \textit{language agents} to behave in the capacity
-of representatives of human agents, appropriately expressing the preferences of
-those individuals whom they stand for. First, we formalize the setting of
-\textit{collective decision-making} -- as the episodic process of interaction
-between a group of agents and a decision mechanism. On this basis, we then
-formalize the problem of \textit{digital representation} -- as the simulation
-of an agent's behavior to yield equivalent outcomes from the mechanism.
-Finally, we conduct an empirical case study in the setting of
-\textit{consensus-finding} among diverse humans, and demonstrate the
-feasibility of fine-tuning large language models to act as digital
-representatives.
+##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
+2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
 
-摘要：考慮集體決策的過程，其中一群個人互動式地從一系列備選方案中選擇一個偏好的結果。在此脈絡中，「代表」是透過代理人（即他們的「代表」）參與，讓個人的偏好出現在這個過程中的活動。為此，人類行為的學習模型有可能填補這個角色，對多重代理人情境研究和機制設計具有實際意義。在這項工作中，我們探討訓練「語言代理人」的可能性，以代表人類代理人的身分行事，適當地表達他們所代表的那些個人的偏好。首先，我們將「集體決策」的設定形式化，作為一群代理人與決策機制之間互動的間歇性過程。在此基礎上，我們接著將「數位代表」的問題形式化，作為模擬代理人的行為，從機制中產生等效結果。最後，我們在多元人類的「共識尋求」設定中進行一個實證個案研究，並展示微調大型語言模型以作為數位代表的可行性。
+We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
+an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
+predicts future PHTs using transformer-based architectures. The Adaptive Risk
+Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
+probabilities for clinician-defined critical events. ARES incorporates a
+personalized explainability module that identifies key clinical factors
+influencing risk estimates for individual patients. ARES was evaluated on the
+MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
+performance against traditional early warning systems and machine learning
+models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
+with 60% including hospital admissions. The dataset contained over 357 million
+tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
+ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
+ETHOS-based risk estimates demonstrated robustness across demographic subgroups
+with strong model reliability, confirmed via calibration curves. The
+personalized explainability module provides insights into patient-specific
+factors contributing to risk. ARES, powered by ETHOS, advances predictive
+healthcare AI by providing dynamic, real-time, and personalized risk estimation
+with patient-specific explainability to enhance clinician trust. Its
+adaptability and superior accuracy position it as a transformative tool for
+clinical decision-making, potentially improving patient outcomes and resource
+allocation in emergency and inpatient settings. We release the full code at
+github.com/ipolharvard/ethos-ares to facilitate future research.
 
-##### **Neural Spatiotemporal Point Processes: Trends and Challenges**
-2502.09341v1 by Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David A. Selby, Yao Xie, Sebastian Vollmer, Gerrit Grossmann
+摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
+一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
+使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
 
-Spatiotemporal point processes (STPPs) are probabilistic models for events
-occurring in continuous space and time. Real-world event data often exhibit
-intricate dependencies and heterogeneous dynamics. By incorporating modern deep
-learning techniques, STPPs can model these complexities more effectively than
-traditional approaches. Consequently, the fusion of neural methods with STPPs
-has become an active and rapidly evolving research area. In this review, we
-categorize existing approaches, unify key design choices, and explain the
-challenges of working with this data modality. We further highlight emerging
-trends and diverse application domains. Finally, we identify open challenges
-and gaps in the literature.
+##### **Can ChatGPT Diagnose Alzheimer's Disease?**
+2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin
 
-摘要：時空點過程 (STPP) 是事件在連續時空發生的機率模型。真實世界的事件資料通常會展現錯綜複雜的依賴關係和異質動態。透過結合現代深度學習技術，STPP 可以比傳統方法更有效地模擬這些複雜性。因此，神經方法與 STPP 的融合已成為一個活躍且快速發展的研究領域。在本篇評論中，我們對現有方法進行分類、統一關鍵設計選擇，並說明處理這種資料模式的挑戰。我們進一步強調新興趨勢和多樣化的應用領域。最後，我們找出文獻中的開放性挑戰和空白。
+Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating
+neurodegenerative condition that affects approximately 1 in 9 individuals aged
+65 and older, profoundly impairing memory and cognitive function. This paper
+utilises 9300 electronic health records (EHRs) with data from Magnetic
+Resonance Imaging (MRI) and cognitive tests to address an intriguing question:
+As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs?
+We present an in-depth evaluation of ChatGPT using a black-box approach with
+zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to
+analyse MRI and cognitive test results, as well as its potential as a
+diagnostic tool for AD. By automating aspects of the diagnostic process, this
+research opens a transformative approach for the healthcare system,
+particularly in addressing disparities in resource-limited regions where AD
+specialists are scarce. Hence, it offers a foundation for a promising method
+for early detection, supporting individuals with timely interventions, which is
+paramount for Quality of Life (QoL).
 
-##### **Graph Diffusion Network for Drug-Gene Prediction**
-2502.09335v1 by Jiayang Wu, Wensheng Gan, Philip S. Yu
+摘要：ChatGPT 能否診斷出阿茲海默症 (AD)？AD 是一種毀滅性的神經退化性疾病，影響約 1/9 的 65 歲及以上人士，嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR)，其中包含磁共振成像 (MRI) 和認知測試的數據，來解決一個有趣的問題：作為一個通用任務解決器，ChatGPT 能否使用 EHR 準確地檢測出 AD？我們使用黑盒方法對 ChatGPT 進行了深入評估，採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力，以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面，這項研究為醫療保健系統開啟了一種變革性的方法，特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此，它為一種有希望的早期檢測方法奠定了基礎，通過及時干預來支持個人，這對於生活品質 (QoL) 至關重要。
 
-Predicting drug-gene associations is crucial for drug development and disease
-treatment. While graph neural networks (GNN) have shown effectiveness in this
-task, they face challenges with data sparsity and efficient contrastive
-learning implementation. We introduce a graph diffusion network for drug-gene
-prediction (GDNDGP), a framework that addresses these limitations through two
-key innovations. First, it employs meta-path-based homogeneous graph learning
-to capture drug-drug and gene-gene relationships, ensuring similar entities
-share embedding spaces. Second, it incorporates a parallel diffusion network
-that generates hard negative samples during training, eliminating the need for
-exhaustive negative sample retrieval. Our model achieves superior performance
-on the DGIdb 4.0 dataset and demonstrates strong generalization capability on
-tripartite drug-gene-disease networks. Results show significant improvements
-over existing methods in drug-gene prediction tasks, particularly in handling
-complex heterogeneous relationships. The source code is publicly available at
-https://github.com/csjywu1/GDNDGP.
+##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**
+2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares
 
-摘要：預測藥物基因關聯對藥物開發和疾病治療至關重要。雖然圖神經網路 (GNN) 已顯示在這個任務中的有效性，但它們在資料稀疏性和高效對比學習實作方面面臨挑戰。我們引入了一個用於藥物基因預測的圖擴散網路 (GDNDGP)，這是一個透過兩項關鍵創新來解決這些限制的框架。首先，它採用基於元路徑的同質圖學習來捕捉藥物-藥物和基因-基因關係，確保類似實體共享嵌入空間。其次，它整合了一個並行擴散網路，在訓練期間產生困難的負面樣本，消除了對詳盡負面樣本擷取的需求。我們的模型在 DGIdb 4.0 資料集上取得了卓越的效能，並在三方藥物-基因-疾病網路中展現強大的概化能力。結果顯示在藥物基因預測任務中，相較於現有方法有顯著的進步，特別是在處理複雜的異質關係方面。原始碼已公開於 https://github.com/csjywu1/GDNDGP。
+EEG-based neural networks, pivotal in medical diagnosis and brain-computer
+interfaces, face significant intellectual property (IP) risks due to their
+reliance on sensitive neurophysiological data and resource-intensive
+development. Current watermarking methods, particularly those using abstract
+trigger sets, lack robust authentication and fail to address the unique
+challenges of EEG models. This paper introduces a cryptographic wonder
+filter-based watermarking framework tailored for EEG-based neural networks.
+Leveraging collision-resistant hashing and public-key encryption, the wonder
+filter embeds the watermark during training, ensuring minimal distortion ($\leq
+5\%$ drop in EEG task accuracy) and high reliability (100\% watermark
+detection). The framework is rigorously evaluated against adversarial attacks,
+including fine-tuning, transfer learning, and neuron pruning. Results
+demonstrate persistent watermark retention, with classification accuracy for
+watermarked states remaining above 90\% even after aggressive pruning, while
+primary task performance degrades faster, deterring removal attempts. Piracy
+resistance is validated by the inability to embed secondary watermarks without
+severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic
+hashing ensures authentication, reducing brute-force attack success
+probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet,
+TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively
+eliminating false positives. By integrating wonder filters with EEG-specific
+adaptations, this work bridges a critical gap in IP protection for
+neurophysiological models, offering a secure, tamper-proof solution for
+healthcare and biometric applications. The framework's robustness against
+adversarial modifications underscores its potential to safeguard sensitive EEG
+models while maintaining diagnostic utility.
 
-##### **Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**
-2502.09331v1 by Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
+摘要：<paragraph>基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要，由於其依賴敏感的神經生理資料和資源密集型的開發，面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法，特別是那些使用抽象觸發集的方法，缺乏強健的驗證，且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密，wonder 濾波器在訓練期間嵌入浮水印，確保最小的失真（EEG 任務準確度下降 $\leq 5\%$）和高可靠性（100% 浮水印檢測）。該架構針對對抗性攻擊進行了嚴格的評估，包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留，即使在激進的剪枝後，浮水印狀態的分類準確度仍保持在 90% 以上，而主要任務的性能下降得更快，阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證，而不會造成嚴重的準確度損失（在 EEGNet 和 CCNN 模型中 $>10\%$）。密碼學雜湊確保驗證，降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型（CCNN、EEGNet、TSception）進行評估，該方法達到了 $>99.4\%$ 的空嵌入準確度，有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合，這項工作彌補了神經生理模型 IP 保護中的關鍵差距，為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。</paragraph>
 
-Despite advances in the multilingual capabilities of Large Language Models
-(LLMs) across diverse tasks, English remains the dominant language for LLM
-research and development. So, when working with a different language, this has
-led to the widespread practice of pre-translation, i.e., translating the task
-prompt into English before inference. Selective pre-translation, a more
-surgical approach, focuses on translating specific prompt components. However,
-its current use is sporagic and lacks a systematic research foundation.
-Consequently, the optimal pre-translation strategy for various multilingual
-settings and tasks remains unclear. In this work, we aim to uncover the optimal
-setup for pre-translation by systematically assessing its use. Specifically, we
-view the prompt as a modular entity, composed of four functional parts:
-instruction, context, examples, and output, either of which could be translated
-or not. We evaluate pre-translation strategies across 35 languages covering
-both low and high-resource languages, on various tasks including Question
-Answering (QA), Natural Language Inference (NLI), Named Entity Recognition
-(NER), and Abstractive Summarization. Our experiments show the impact of
-factors as similarity to English, translation quality and the size of
-pre-trained data, on the model performance with pre-translation. We suggest
-practical guidelines for choosing optimal strategies in various multilingual
-settings.
+##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**
+2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen
 
-摘要：儘管大型語言模型 (LLM) 在各種任務中的多語言能力有進步，英語仍然是 LLM 研究和開發的主導語言。因此，在使用不同語言時，這導致了預翻譯的廣泛實務，即在推理之前將任務提示翻譯成英語。選擇性預翻譯是一種更精準的方法，專注於翻譯特定提示組成部分。然而，目前的使用是零星的，缺乏系統性的研究基礎。因此，各種多語言設定和任務的最佳預翻譯策略仍不清楚。在這項工作中，我們旨在透過系統性評估預翻譯的使用，找出其最佳設定。具體來說，我們將提示視為一個模組化實體，由四個功能部分組成：說明、背景、範例和輸出，其中任何一個都可以翻譯或不翻譯。我們在 35 種語言中評估預翻譯策略，涵蓋低資源語言和高資源語言，以及各種任務，包括問答 (QA)、自然語言推理 (NLI)、命名實體識別 (NER) 和抽象摘要。我們的實驗顯示了與英語的相似性、翻譯品質和預訓練資料大小等因素對預翻譯模型效能的影響。我們建議在各種多語言設定中選擇最佳策略的實用指南。
+Depression is one of the leading causes of disability worldwide, posing a
+severe burden on individuals, healthcare systems, and society at large. Recent
+advancements in Large Language Models (LLMs) have shown promise in addressing
+mental health challenges, including the detection of depression through
+text-based analysis. However, current LLM-based methods often struggle with
+nuanced symptom identification and lack a transparent, step-by-step reasoning
+process, making it difficult to accurately classify and explain mental health
+conditions. To address these challenges, we propose a Chain-of-Thought
+Prompting approach that enhances both the performance and interpretability of
+LLM-based depression detection. Our method breaks down the detection process
+into four stages: (1) sentiment analysis, (2) binary depression classification,
+(3) identification of underlying causes, and (4) assessment of severity. By
+guiding the model through these structured reasoning steps, we improve
+interpretability and reduce the risk of overlooking subtle clinical indicators.
+We validate our method on the E-DAIC dataset, where we test multiple
+state-of-the-art large language models. Experimental results indicate that our
+Chain-of-Thought Prompting technique yields superior performance in both
+classification accuracy and the granularity of diagnostic insights, compared to
+baseline approaches.
 
-##### **A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**
-2502.09316v1 by Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami
+摘要：憂鬱症是全球殘障的主要原因之一，對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望，包括透過基於文字的分析來偵測憂鬱症。然而，現有的基於 LLM 的方法通常難以辨識細微的症狀，而且缺乏透明且逐步的推理過程，這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰，我們提出了一種思考鏈提示方法，它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段：(1) 情緒分析，(2) 二元憂鬱症分類，(3) 找出潛在原因，以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟，我們提升了可解釋性，並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法，並在其中測試了多種最先進的大型語言模型。實驗結果顯示，與基線方法相比，我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。
 
-Evaluating the open-ended text generation of large language models (LLMs) is
-challenging because of the lack of a clear ground truth and the high cost of
-human or LLM-based assessments. We propose a novel benchmark that evaluates
-LLMs using n-gram statistics and rules, without relying on human judgement or
-LLM-as-a-judge approaches. Using 50 question and reference answer sets, we
-introduce three new metrics based on n-grams and rules: Fluency, Truthfulness,
-and Helpfulness. Our benchmark strongly correlates with GPT-4o-based
-evaluations while requiring significantly fewer computational resources,
-demonstrating its effectiveness as a scalable alternative for assessing LLMs'
-open-ended generation capabilities.
+##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**
+2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
 
-摘要：評估大型語言模型 (LLM) 的開放式文字生成具有挑戰性，因為缺乏明確的基礎真實性，以及人工或基於 LLM 的評估成本高昂。我們提出一個新基準，使用 n-gram 統計和規則來評估 LLM，而不依賴於人工判斷或 LLM 作為評審的方法。使用 50 個問題和參考答案集，我們基於 n-gram 和規則引入了三項新指標：流暢度、真實性和有幫助性。我們的基準與基於 GPT-4o 的評估密切相關，同時需要明顯更少的計算資源，證明了其作為評估 LLM 的開放式生成能力的可擴充替代方案的有效性。
+The increasing volume of drug combinations in modern therapeutic regimens
+needs reliable methods for predicting drug-drug interactions (DDIs). While
+Large Language Models (LLMs) have revolutionized various domains, their
+potential in pharmaceutical research, particularly in DDI prediction, remains
+largely unexplored. This study thoroughly investigates LLMs' capabilities in
+predicting DDIs by uniquely processing molecular structures (SMILES), target
+organisms, and gene interaction data as raw text input from the latest DrugBank
+dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4,
+Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first
+assessing their zero-shot capabilities in DDI prediction. We then fine-tuned
+selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1
+distilled Qwen 1.5B) to optimize their performance. Our comprehensive
+evaluation framework included validation across 13 external DDI datasets,
+comparing against traditional approaches such as l2-regularized logistic
+regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5
+2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of
+0.919 on balanced datasets (50% positive, 50% negative cases). This result
+represents an improvement over both zero-shot predictions and state-of-the-art
+machine-learning methods used for DDI prediction. Our analysis reveals that
+LLMs can effectively capture complex molecular interaction patterns and cases
+where drug pairs target common genes, making them valuable tools for practical
+applications in pharmaceutical research and clinical settings.
 
-##### **When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**
-2502.09307v1 by Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
+摘要：<paragraph>現代治療方案中藥物組合的數量越來越多，需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命，它們在藥物研究中的潛力，特別是在 DDI 預測中的潛力，仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入，徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM，包括專有模型（GPT-4、Claude、Gemini）和開源變體（從 1.5B 到 72B 參數），首先評估它們在 DDI 預測中的零次學習能力。然後，我們微調選定的模型（GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B）以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證，並與傳統方法（例如 l2 正則化邏輯迴歸）進行比較。微調後的 LLM 表現出優異的效能，其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度，在平衡資料集（50% 正例，50% 反例）上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明，LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況，使其成為藥物研究和臨床環境中實用應用的寶貴工具。</paragraph>
 
-Modern Large Language Models (LLMs) have shown human-like abilities in many
-language tasks, sparking interest in comparing LLMs' and humans' language
-processing. In this paper, we conduct a detailed comparison of the two on a
-sentence comprehension task using garden-path constructions, which are
-notoriously challenging for humans. Based on psycholinguistic research, we
-formulate hypotheses on why garden-path sentences are hard, and test these
-hypotheses on human participants and a large suite of LLMs using comprehension
-questions. Our findings reveal that both LLMs and humans struggle with specific
-syntactic complexities, with some models showing high correlation with human
-comprehension. To complement our findings, we test LLM comprehension of
-garden-path constructions with paraphrasing and text-to-image generation tasks,
-and find that the results mirror the sentence comprehension question results,
-further validating our findings on LLM understanding of these constructions.
+##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**
+2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh
 
-摘要：現代大型語言模型（LLM）在許多語言任務中展現出類似人類的能力，引發了比較 LLM 與人類語言處理的興趣。在本文中，我們使用對人類來說極具挑戰的花園路徑結構，對這兩者進行了詳細比較，以進行句子理解任務。根據心理語言學研究，我們制定了關於為什麼花園路徑句子困難的假設，並使用理解問題對人類參與者和大量 LLM 測試這些假設。我們的研究結果表明，LLM 和人類都難以應付特定的句法複雜性，其中一些模型與人類理解力高度相關。為了補充我們的研究結果，我們測試了 LLM 對花園路徑結構的理解，並進行了改寫和文字轉換為圖像的生成任務，並發現結果反映了句子理解問題的結果，進一步驗證了我們對 LLM 理解這些結構的研究結果。
+Detecting sensitive data such as Personally Identifiable Information (PII)
+and Protected Health Information (PHI) is critical for data security platforms.
+This study evaluates regex-based pattern matching algorithms and exact-match
+search techniques to optimize detection speed, accuracy, and scalability. Our
+benchmarking results indicate that Google RE2 provides the best balance of
+speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among
+regex engines, outperforming PCRE while maintaining broader hardware
+compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated
+superior performance (8 ms/MB) and scalability for large datasets. Performance
+analysis revealed that regex processing time scales linearly with dataset size
+and pattern complexity. A hybrid AI + Regex approach achieved the highest F1
+score (91. 6%) by improving recall and minimizing false positives. Device
+benchmarking confirmed that our solution maintains efficient CPU and memory
+usage on both high-performance and mid-range systems. Despite its
+effectiveness, challenges remain, such as limited multilingual support and the
+need for regular pattern updates. Future work should focus on expanding
+language coverage, integrating data security and privacy management (DSPM) with
+data loss prevention (DLP) tools, and enhancing regulatory compliance for
+broader global adoption.
 
-##### **Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**
-2502.09294v1 by Bernd Dudzik, Tiffany Matej Hrkalovic, Chenxu Hao, Chirag Raman, Masha Tsfasman
+摘要：偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料，對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術，以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示，在 regex 引擎中，Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡，優於 PCRE，同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對，Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示，regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低，達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效，但仍有挑戰存在，例如多語言支援有限，以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍，將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合，以及加強法規遵循以利更廣泛的全球採用。
 
-Automatic Affect Prediction (AAP) uses computational analysis of input data
-such as text, speech, images, and physiological signals to predict various
-affective phenomena (e.g., emotions or moods). These models are typically
-constructed using supervised machine-learning algorithms, which rely heavily on
-labeled training datasets. In this position paper, we posit that all AAP
-training data are derived from human Affective Interpretation Processes,
-resulting in a form of Affective Meaning. Research on human affect indicates a
-form of complexity that is fundamental to such meaning: it can possess what we
-refer to here broadly as Qualities of Indeterminacy (QIs) - encompassing
-Subjectivity (meaning depends on who is interpreting), Uncertainty (lack of
-confidence regarding meanings' correctness), Ambiguity (meaning contains
-mutually exclusive concepts) and Vagueness (meaning is situated at different
-levels in a nested hierarchy). Failing to appropriately consider QIs leads to
-results incapable of meaningful and reliable predictions. Based on this
-premise, we argue that a crucial step in adequately addressing indeterminacy in
-AAP is the development of data collection practices for modeling corpora that
-involve the systematic consideration of 1) a relevant set of QIs and 2) context
-for the associated interpretation processes. To this end, we are 1) outlining a
-conceptual model of AIPs and the QIs associated with the meaning these produce
-and a conceptual structure of relevant context, supporting understanding of its
-role. Finally, we use our framework for 2) discussing examples of
-context-sensitivity-related challenges for addressing QIs in data collection
-setups. We believe our efforts can stimulate a structured discussion of both
-the role of aspects of indeterminacy and context in research on AAP, informing
-the development of better practices for data collection and analysis.
+##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**
+2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu
 
-摘要：自動影響預測 (AAP) 使用輸入資料的運算分析，例如文字、語音、影像和生理訊號，來預測各種情感現象（例如情緒或心情）。這些模型通常使用監督式機器學習演算法建構，而這些演算法高度依賴標籤訓練資料集。在此立場文件中，我們主張所有 AAP 訓練資料都是從人類的情感詮釋過程中衍生而來的，進而形成一種情感意義。對人類情感的研究指出，這種複雜性是此種意義的基本要素：它可能具備我們在此廣泛稱之為不確定性品質 (QI)，包括主觀性（意義取決於詮釋者）、不確定性（對於意義正確性的信心不足）、歧義性（意義包含相互排斥的概念）和模糊性（意義位於嵌套層級的不同層級）。未能適當地考量 QI 會導致無法進行有意義且可靠預測的結果。基於此前提，我們主張，在 AAP 中適當地處理不確定性的關鍵步驟，是針對建模語料庫制定資料收集實務，其中涉及系統性地考量 1) 一組相關的 QI，以及 2) 相關詮釋過程的脈絡。為此，我們 1) 概述了 AIP 的概念模型，以及與這些 AIP 所產生的意義相關的 QI，以及相關脈絡的概念結構，支持對其角色的理解。最後，我們使用我們的架構 2) 討論了在資料收集設定中處理 QI 時，與脈絡敏感性相關的挑戰範例。我們相信我們的努力可以激勵對不確定性和脈絡面向在 AAP 研究中扮演的角色進行結構化的討論，為資料收集和分析的最佳實務發展提供資訊。
+While just-in-time interventions (JITIs) have effectively targeted common
+health behaviors, individuals often have unique needs to intervene in personal
+undesirable actions that can negatively affect physical, mental, and social
+well-being. We present WatchGuardian, a smartwatch-based JITI system that
+empowers users to define custom interventions for these personal actions with a
+small number of samples. For the model to detect new actions based on limited
+new data samples, we developed a few-shot learning pipeline that finetuned a
+pre-trained inertial measurement unit (IMU) model on public hand-gesture
+datasets. We then designed a data augmentation and synthesis process to train
+additional classification layers for customization. Our offline evaluation with
+26 participants showed that with three, five, and ten examples, our approach
+achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of
+74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to
+compare WatchGuardian against a rule-based intervention. Our results
+demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in
+undesirable actions, substantially outperforming the baseline by 29.0%. Our
+findings underscore the effectiveness of a customizable, AI-driven JITI system
+for individuals in need of behavioral intervention in personal undesirable
+actions. We envision that our work can inspire broader applications of
+user-defined personalized intervention with advanced AI solutions.
 
-##### **SparQLe: Speech Queries to Text Translation Through LLMs**
-2502.09284v1 by Amirbek Djanibekov, Hanan Aldarmaki
+摘要：<paragraph>雖然即時介入（JITIs）有效地針對常見的健康行為，但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian，這是一個基於智慧手錶的 JITI 系統，它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為，我們開發了一個小樣本學習管道，微調了公共手勢資料集上的預訓練慣性測量單元（IMU）模型。然後，我們設計了一個資料擴充和合成流程，以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示，我們的做法使用三個、五個和十個範例，達到了 76.8%、84.7% 和 87.7% 的平均準確度，以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後，我們進行了一項為時四小時的介入研究，以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明，我們的系統導致不良行為顯著減少了 64.0 +- 22.6%，大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用，並採用先進的 AI 解決方案。</paragraph>
 
-With the growing influence of Large Language Models (LLMs), there is
-increasing interest in integrating speech representations with them to enable
-more seamless multi-modal processing and speech understanding. This study
-introduces a novel approach that leverages self-supervised speech
-representations in combination with instruction-tuned LLMs for speech-to-text
-translation. The proposed approach leverages a modality adapter to align
-extracted speech features with instruction-tuned LLMs using English-language
-data. Our experiments demonstrate that this method effectively preserves the
-semantic content of the input speech and serves as an effective bridge between
-self-supervised speech models and instruction-tuned LLMs, offering a promising
-solution for various speech understanding applications.
+##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**
+2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
 
-摘要：隨著大型語言模型（LLM）影響力逐漸擴大，將語音表徵與其整合，以實現更順暢的多模態處理和語音理解，已引起越來越多的興趣。本研究提出了一種新穎的方法，該方法利用自監督語音表徵，結合指令調整的 LLM，進行語音轉文字翻譯。所提出的方法利用模態適配器，使用英語語言資料，將提取的語音特徵與指令調整的 LLM 對齊。我們的實驗證明，此方法有效地保留了輸入語音的語義內容，並作為自監督語音模型和指令調整的 LLM 之間的有效橋樑，為各種語音理解應用程式提供了一個有前景的解決方案。
+Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group
+of cancers that account for more than 35% of cancer-related deaths worldwide,
+but postoperative complications are unpredictable and can be life-threatening.
+In this paper, we investigate how recent advancements in large language models
+(LLMs) can benefit remote patient monitoring (RPM) systems through clinical
+integration by designing RECOVER, an LLM-powered RPM system for postoperative
+GI cancer care. To closely engage stakeholders in the design process, we first
+conducted seven participatory design sessions with five clinical staff and
+interviewed five cancer patients to derive six major design strategies for
+integrating clinical guidelines and information needs into LLM-based RPM
+systems. We then designed and implemented RECOVER, which features an
+LLM-powered conversational agent for cancer patients and an interactive
+dashboard for clinical staff to enable efficient postoperative RPM. Finally, we
+used RECOVER as a pilot system to assess the implementation of our design
+strategies with four clinical staff and five patients, providing design
+implications by identifying crucial design elements, offering insights on
+responsible AI, and outlining opportunities for future LLM-powered RPM systems.
 
-##### **LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**
-2502.09271v1 by Wenlun Zhang, Enyan Dai, Kentaro Yoshioka
+摘要：癌症手術是胃腸道 (GI) 癌症的主要治療方式，這類癌症佔全球癌症相關死亡人數的 35% 以上，但術後併發症無法預測，且可能危及生命。在本文中，我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統，方法是設計 RECOVER，一個由 LLM 驅動的 RPM 系統，用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程，我們首先與五位臨床人員進行七場參與式設計會議，並訪談五位癌症患者，以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著，我們設計並實作 RECOVER，其特色在於一個由 LLM 驅動的對話式代理人，供癌症患者使用，以及一個互動式儀表板，供臨床人員使用，以進行有效的術後 RPM。最後，我們使用 RECOVER 作為試點系統，與四位臨床人員和五位患者評估我們設計策略的實作，並透過找出重要的設計元素、提供對負責任 AI 的見解，以及概述未來由 LLM 驅動的 RPM 系統的機會，提出設計意涵。
 
-Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in
-modeling data with graph structures, yet recent research reveals their
-susceptibility to adversarial attacks. Traditional attack methodologies, which
-rely on manipulating the original graph or adding links to artificially created
-nodes, often prove impractical in real-world settings. This paper introduces a
-novel adversarial scenario involving the injection of an isolated subgraph to
-deceive both the link recommender and the node classifier within a GNN system.
-Specifically, the link recommender is mislead to propose links between targeted
-victim nodes and the subgraph, encouraging users to unintentionally establish
-connections and that would degrade the node classification accuracy, thereby
-facilitating a successful attack. To address this, we present the LiSA
-framework, which employs a dual surrogate model and bi-level optimization to
-simultaneously meet two adversarial objectives. Extensive experiments on
-real-world datasets demonstrate the effectiveness of our method.
+##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**
+2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander
 
-摘要：圖形神經網路 (GNN) 已展現出在對具有圖形結構的資料進行建模方面的卓越能力，但最近的研究揭露了它們容易受到對抗性攻擊的影響。傳統的攻擊方法依賴於操縱原始圖形或將連結新增至人工建立的節點，在真實世界設定中通常被證明不切實際。本文介紹了一種新穎的對抗性場景，涉及注入一個孤立的子圖形，以欺騙 GNN 系統中的連結推薦器和節點分類器。具體來說，連結推薦器被誤導為在目標受害節點和子圖形之間提出連結，鼓勵使用者無意間建立連結，這將降低節點分類準確度，從而促成攻擊成功。為了解決這個問題，我們提出了 LiSA 框架，它採用雙重代理模型和雙層最佳化，以同時滿足兩個對抗性目標。對真實世界資料集進行的廣泛實驗證明了我們方法的有效性。
+Understanding the progression trajectories of diseases is crucial for early
+diagnosis and effective treatment planning. This is especially vital for
+life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a
+chronic, progressive lung disease with a prognosis comparable to many cancers.
+Computed tomography (CT) imaging has been established as a reliable diagnostic
+tool for IPF. Accurately predicting future CT scans of early-stage IPF patients
+can aid in developing better treatment strategies, thereby improving survival
+outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial
+Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF
+patients at any time point. The model is trained using a two-stage approach. In
+the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the
+second stage, a Neural Ordinary Differential Equation (ODE) based temporal
+model is trained to capture the temporal dynamics of the quantised embeddings
+generated by the encoder in the first stage. We evaluate different
+configurations of our model for generating longitudinal CT scans and compare
+the results against ground truth data, both quantitatively and qualitatively.
+For validation, we conduct survival analysis using imaging biomarkers derived
+from generated CT scans and achieve a C-index comparable to that of biomarkers
+derived from the real CT scans. The survival analysis results demonstrate the
+potential clinical utility inherent to generated longitudinal CT scans, showing
+that they can reliably predict survival outcomes.
 
-##### **AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**
-2502.09254v1 by Hezhe Qiao, Chaoxi Niu, Ling Chen, Guansong Pang
+摘要：了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要，IPF 是一種慢性、進行性肺部疾病，其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略，從而改善存活結果。在本文中，我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN)，這是一個模型，能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段，訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段，訓練基於神經常微分方程 (ODE) 的時間模型，以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置，以生成縱向 CT 掃描，並在定量和定性方面將結果與真實數據進行比較。為了驗證，我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析，並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用，表明它們可以可靠地預測存活結果。
 
-Graph anomaly detection (GAD) aims to identify abnormal nodes that differ
-from the majority of the nodes in a graph, which has been attracting
-significant attention in recent years. Existing generalist graph models have
-achieved remarkable success in different graph tasks but struggle to generalize
-to the GAD task. This limitation arises from their difficulty in learning
-generalized knowledge for capturing the inherently infrequent, irregular and
-heterogeneous abnormality patterns in graphs from different domains. To address
-this challenge, we propose AnomalyGFM, a GAD-oriented graph foundation model
-that supports zero-shot inference and few-shot prompt tuning for GAD in diverse
-graph datasets. One key insight is that graph-agnostic representations for
-normal and abnormal classes are required to support effective zero/few-shot GAD
-across different graphs. Motivated by this, AnomalyGFM is pre-trained to align
-data-independent, learnable normal and abnormal class prototypes with node
-representation residuals (i.e., representation deviation of a node from its
-neighbors). The residual features essentially project the node information into
-a unified feature space where we can effectively measure the abnormality of
-nodes from different graphs in a consistent way. This provides a driving force
-for the learning of graph-agnostic, discriminative prototypes for the normal
-and abnormal classes, which can be used to enable zero-shot GAD on new graphs,
-including very large-scale graphs. If there are few-shot labeled normal nodes
-available in the new graphs, AnomalyGFM can further support prompt tuning to
-leverage these nodes for better adaptation. Comprehensive experiments on 11
-widely-used GAD datasets with real anomalies, demonstrate that AnomalyGFM
-significantly outperforms state-of-the-art competing methods under both zero-
-and few-shot GAD settings.
+##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**
+2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho
 
-摘要：圖形異常偵測 (GAD) 的目標是找出與圖形中大多數節點不同的異常節點，這在近年來引起了廣泛的關注。現有的通才圖形模型在不同的圖形任務中都取得了顯著的成功，但卻難以推廣到 GAD 任務。這種限制來自於它們難以學習廣泛的知識，用於擷取來自不同領域圖形中固有的罕見、不規則和異質異常模式。為了應對這個挑戰，我們提出了 AnomalyGFM，一個面向 GAD 的圖形基礎模型，它支援零次學習推論和少次提示調整，用於在不同的圖形資料集中進行 GAD。一個關鍵見解是，需要圖形不可知的正常和異常類別表示，以支援跨不同圖形的有效零次/少次 GAD。受此啟發，AnomalyGFM 被預先訓練以將與資料無關的可學習正常和異常類別原型與節點表示殘差（即節點與其鄰居的表示偏差）對齊。殘差特徵基本上將節點資訊投射到一個統一的特徵空間中，在這個空間中，我們可以有效地測量來自不同圖形的節點異常，並且方式一致。這為學習正常和異常類別的圖形不可知、有區別的原型提供了驅動力，這些原型可用於對新的圖形（包括非常大規模的圖形）啟用零次 GAD。如果新的圖形中有少量的標籤正常節點，AnomalyGFM 可以進一步支援提示調整，以利用這些節點進行更好的適應。在 11 個廣泛使用的具有真實異常值的 GAD 資料集上的綜合實驗表明，在零次和少次 GAD 設定下，AnomalyGFM 明顯優於最先進的競爭方法。
+The increasing demand for mental health services has led to the rise of
+AI-driven mental health chatbots, though challenges related to privacy, data
+collection, and expertise persist. Motivational Interviewing (MI) is gaining
+attention as a theoretical basis for boosting expertise in the development of
+these chatbots. However, existing datasets are showing limitations for training
+chatbots, leading to a substantial demand for publicly available resources in
+the field of MI and psychotherapy. These challenges are even more pronounced in
+non-English languages, where they receive less attention. In this paper, we
+propose a novel framework that simulates MI sessions enriched with the
+expertise of professional therapists. We train an MI forecaster model that
+mimics the behavioral choices of professional therapists and employ Large
+Language Models (LLMs) to generate utterances through prompt engineering. Then,
+we present KMI, the first synthetic dataset theoretically grounded in MI,
+containing 1,000 high-quality Korean Motivational Interviewing dialogues.
+Through an extensive expert evaluation of the generated dataset and the
+dialogue model trained on it, we demonstrate the quality, expertise, and
+practicality of KMI. We also introduce novel metrics derived from MI theory in
+order to evaluate dialogues from the perspective of MI.
 
-##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
-2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
+摘要：由於對心理健康服務的需求日益增加，導致以人工智慧為基礎的心理健康聊天機器人興起，儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而，現有的資料集顯示出訓練聊天機器人的限制，導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯，因為它們受到的關注較少。在本文中，我們提出了一個新穎的架構，它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型，它模擬了專業治療師的行為選擇，並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後，我們展示了 KMI，這是第一個理論上以 MI 為基礎的合成資料集，其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估，我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標，以便從 MI 的角度評估對話。
 
-Joint entity-relation extraction is a critical task in transforming
-unstructured or semi-structured text into triplets, facilitating the
-construction of large-scale knowledge graphs, and supporting various downstream
-applications. Despite its importance, research on Chinese text, particularly
-with complex semantics in specialized domains like medicine, remains limited.
-To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
-dataset designed to capture the intricacies of medical text. Leveraging the
-strengths of attention mechanisms in capturing long-range dependencies, we
-propose the SEA module, which enhances the extraction of complex contextual
-semantic information, thereby improving entity recognition and relation
-extraction. Additionally, to address the inefficiencies of existing methods in
-facilitating information exchange between entity recognition and relation
-extraction, we present an interactive fusion representation module. This module
-employs Cross Attention for bidirectional information exchange between the
-tasks and further refines feature extraction through BiLSTM. Experimental
-results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
-our model exhibits strong generalization capabilities. On the CH-DDI dataset,
-our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
-relation extraction. On the CoNLL04 dataset, it attains an entity recognition
-precision of 89.54% and a relation extraction accuracy of 71.64%.
+##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**
+2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco
 
-摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
+Europe's healthcare systems require enhanced interoperability and
+digitalization, driving a demand for innovative solutions to process legacy
+clinical data. This paper presents the results of our project, which aims to
+leverage Large Language Models (LLMs) to extract structured information from
+unstructured clinical reports, focusing on patient history, diagnoses,
+treatments, and other predefined categories. We developed a workflow with a
+user interface and evaluated LLMs of varying sizes through prompting strategies
+and fine-tuning. Our results show that fine-tuned smaller models match or
+surpass larger counterparts in performance, offering efficiency for
+resource-limited settings. A new dataset of 60,000 annotated English clinical
+summaries and 24,000 German translations was validated with automated and
+manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics.
+The work highlights the approach's viability and outlines future improvements.
 
-##### **You Do Not Fully Utilize Transformer's Representation Capacity**
-2502.09245v1 by Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
+摘要：歐洲的醫療保健系統需要增強互通性和數位化，這驅動了對創新解決方案的需求，以處理傳統的臨床數據。本文介紹了我們專案的成果，該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊，重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程，並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示，微調後的較小模型在效能上與較大的模型相匹配或超越它們，為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性，並概述了未來的改進。
 
-In contrast to RNNs, which compress previous tokens into a single hidden
-state, Transformers can attend to all previous tokens directly. However,
-standard Transformers only use representations from the immediately preceding
-layer. In this paper, we show that this design choice causes representation
-collapse and leads to suboptimal performance. To address this issue, we
-introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
-preserves the model's overall memory footprint while expanding its
-representational capacity by allowing access to hidden states from earlier
-layers. Through extensive experiments across various architectures and
-different lookup mechanisms, we demonstrate consistent performance improvements
-on a wide range of tasks. Moreover, our analysis of the learned representation
-dynamics and our exploration of depthwise circuits reveal how LIMe integrates
-information across layers, pointing to promising directions for future
-research.
+##### **Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**
+2502.05494v1 by Ya Zhou, Yujie Yang, Jianhuang Gan, Xiangjie Li, Jing Yuan, Wei Zhao
 
-摘要：與將先前符號壓縮成單一隱藏狀態的遞迴神經網路不同，Transformer 可以直接關注所有先前的符號。然而，標準 Transformer 僅使用緊鄰前一層的表示。在本文中，我們說明此設計選擇會導致表示崩潰，並導致次優效能。為了解決此問題，我們引入了「層整合式記憶體」(LIMe)，這是一種簡單但強大的方法，可在擴充表示能力的同時，保留模型的整體記憶體使用量，方法是允許存取來自較早層的隱藏狀態。透過各種架構和不同查詢機制的廣泛實驗，我們展示了在各種任務上的一致效能提升。此外，我們對已學習表示動態的分析和對深度電路的探討，揭示了 LIMe 如何整合跨層資訊，並指出未來研究有望發展的方向。
+Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing
+cardiovascular conditions, yet anomaly detection in ECG signals remains
+challenging due to their inherent complexity and variability. We propose
+Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel
+end-to-end framework that effectively captures both global and local
+dependencies in ECG data. Unlike state-of-the-art methods that rely on
+heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for
+such pre-processing steps, enhancing its suitability for clinical deployment.
+MMAE-ECG partitions ECG signals into non-overlapping segments, with each
+segment assigned learnable positional embeddings. A novel multi-scale masking
+strategy and multi-scale attention mechanism, along with distinct positional
+embeddings, enable a lightweight Transformer encoder to effectively capture
+both local and global dependencies. The masked segments are then reconstructed
+using a single-layer Transformer block, with an aggregation strategy employed
+during inference to refine the outputs. Experimental results demonstrate that
+our method achieves performance comparable to state-of-the-art approaches while
+significantly reducing computational complexity-approximately 1/78 of the
+floating-point operations (FLOPs) required for inference. Ablation studies
+further validate the effectiveness of each component, highlighting the
+potential of multi-scale masked autoencoders for anomaly detection.
 
-##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
-2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
+摘要：心電圖 (ECG) 分析是診斷心血管疾病的基本工具，但由於 ECG 訊號本身的複雜性和變異性，異常偵測仍然是一項挑戰。我們提出用於 ECG 異常偵測的多尺度遮罩自編碼器 (MMAE-ECG)，這是一個新穎的端對端架構，可有效擷取 ECG 資料中的全局和局部依賴關係。與依賴於心跳區段或 R 波峰偵測的最新方法不同，MMAE-ECG 消除了對此類前處理步驟的需求，增強其適用於臨床部署。MMAE-ECG 將 ECG 訊號分割成不相疊的區段，每個區段都指派可學習的位置嵌入。新穎的多尺度遮罩策略和多尺度注意力機制，以及不同的位置嵌入，使輕量級 Transformer 編碼器能夠有效擷取局部和全局依賴關係。然後使用單層 Transformer 區塊重建遮罩區段，並在推理期間採用聚合策略來優化輸出。實驗結果表明，我們的模型達到了與最新方法相當的效能，同時大幅降低運算複雜度，約為推理所需的浮點運算 (FLOP) 的 1/78。消融研究進一步驗證了每個組件的有效性，突顯了多尺度遮罩自編碼器在異常偵測方面的潛力。
 
-Generative artificial intelligence (AI) models, such as diffusion models and
-OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
-and automating clinical workflows. The field has advanced rapidly, evolving
-from text-only large language models for tasks such as clinical documentation
-and decision support to multimodal AI systems capable of integrating diverse
-data modalities, including imaging, text, and structured data, within a single
-model. The diverse landscape of these technologies, along with rising interest,
-highlights the need for a comprehensive review of their applications and
-potential. This scoping review explores the evolution of multimodal AI,
-highlighting its methods, applications, datasets, and evaluation in clinical
-settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
-IEEE Xplore, and Web of Science, prioritizing recent studies published up to
-the end of 2024. After rigorous screening, 144 papers were included, revealing
-key trends and challenges in this dynamic field. Our findings underscore a
-shift from unimodal to multimodal approaches, driving innovations in diagnostic
-support, medical report generation, drug discovery, and conversational AI.
-However, critical challenges remain, including the integration of heterogeneous
-data types, improving model interpretability, addressing ethical concerns, and
-validating AI systems in real-world clinical settings. This review summarizes
-the current state of the art, identifies critical gaps, and provides insights
-to guide the development of scalable, trustworthy, and clinically impactful
-multimodal AI solutions in healthcare.
+##### **DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**
+2502.05459v1 by Sibasish Dhibar
 
-摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
+White blood cells (WBC) are important parts of our immune system, and they
+protect our body against infections by eliminating viruses, bacteria, parasites
+and fungi. The number of WBC types and the total number of WBCs provide
+important information about our health status. A traditional method,
+convolutional neural networks (CNN), a deep learning architecture, can classify
+the blood cell from a part of an object and perform object recognition. Various
+CNN models exhibit potential; however, their development often involves ad-hoc
+processes that neglect unnecessary layers, leading to issues with unbalanced
+datasets and insufficient data augmentation. To address these challenges, we
+propose a novel ensemble approach that integrates three CNN architectures, each
+uniquely configured with different dropout and max-pooling layer settings to
+enhance feature learning. This ensemble model, named DCENWCNet, effectively
+balances the bias-variance trade-off. When evaluated on the widely recognized
+Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks,
+achieving highest mean accuracy. Additionally, it demonstrates superior
+performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC)
+across all categories. To delve deeper into the interpretability of
+classifiers, we employ reliable post-hoc explanation techniques, including
+Local Interpretable Model-Agnostic Explanations (LIME). These methods
+approximate the behavior of a black-box model by elucidating the relationships
+between feature values and predictions. Interpretable results enable users to
+comprehend and validate the model's predictions, thereby increasing their
+confidence in the automated diagnosis.
 
-##### **Reliable Conversational Agents under ASP Control that Understand Natural Language**
-2502.09237v1 by Yankai Zeng
+摘要：白血球 (WBC) 是我們免疫系統的重要組成部分，它們通過清除病毒、細菌、寄生蟲和真菌來保護我們的機體免受感染。WBC 類型數量和 WBC 總數提供了有關我們健康狀況的重要資訊。傳統方法卷積神經網路 (CNN) 是一種深度學習架構，可以對物體的一部分進行血細胞分類並執行物體識別。各種 CNN 模型展現出潛力；然而，它們的開發通常涉及忽略不必要層的臨時過程，導致不平衡的資料集和資料擴充不足的問題。為了應對這些挑戰，我們提出了一種新穎的整體方法，它整合了三種 CNN 架構，每種架構都採用不同的中斷和最大池化層設定進行獨特配置，以增強特徵學習。這種名為 DCENWCNet 的整體模型有效地平衡了偏差變異取捨。在廣泛認可的 Rabbin-WBC 資料集上進行評估時，我們的模型優於現有的最先進網路，達到了最高的平均準確度。此外，它在所有類別中都展示了在精確度、召回率、F1 分數和 ROC 曲線下面積 (AUC) 方面的卓越效能。為了更深入地研究分類器的可解釋性，我們採用了可靠的事後解釋技術，包括局部可解釋模型不可知解釋 (LIME)。這些方法通過闡明特徵值和預測之間的關係來近似黑盒模型的行為。可解釋的結果使用戶能夠理解和驗證模型的預測，從而增加他們對自動化診斷的信心。
 
-Efforts have been made to make machines converse like humans in the past few
-decades. The recent techniques of Large Language Models (LLMs) make it possible
-to have human-like conversations with machines, but LLM's flaws of lacking
-understanding and reliability are well documented. We believe that the best way
-to eliminate this problem is to use LLMs only as parsers to translate text to
-knowledge and vice versa and carry out the conversation by reasoning over this
-knowledge using the answer set programming. I have been developing a framework
-based on LLMs and ASP to realize reliable chatbots that "understand" human
-conversation. This framework has been used to develop task-specific chatbots as
-well as socialbots. My future research is focused on making these chatbots
-scalable and trainable.
+##### **Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**
+2502.05330v1 by Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee, Jaehee Chun, Jin Sung Kim, Minghui Zhang, Hanxiao Zhang, Xin You, Yun Gu, Zhaohong Pan, Xuan Liu, Xiaokun Liang, Markus Tiefenthaler, Enrique Almar-Munoz, Matthias Schwab, Mikhail Kotyushev, Rostislav Epifanov, Marek Wodzinski, Henning Muller, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Zhiwei Wang, Kaixiang Yang, Jintao Ren, Stine Sofia Korreman, Yuchong Gao, Hongye Zeng, Haoyu Zheng, Rui Zheng, Jinghua Yue, Fugen Zhou, Bo Liu, Alexander Cosman, Muxuan Liang, Chang Zhao, Gilbert R. Upchurch Jr., Jun Ma, Yuyin Zhou, Michol A. Cooper, Wei Shao
 
-摘要：在過去的幾十年裡，人們一直努力讓機器像人類一樣對話。大型語言模型 (LLM) 的最新技術讓與機器進行類人對話成為可能，但 LLM 缺乏理解力和可靠性的缺陷已被充分記錄。我們相信消除這個問題的最佳方法是僅將 LLM 作為解析器，將文字轉換為知識，反之亦然，並使用答案集程式設計對此知識進行推理來進行對話。我一直在開發一個基於 LLM 和 ASP 的框架，以實現「理解」人類對話的可靠聊天機器人。這個框架已被用於開發特定任務的聊天機器人以及社交機器人。我未來的研究重點在於讓這些聊天機器人具有可擴充性和可訓練性。
+Multi-class segmentation of the aorta in computed tomography angiography
+(CTA) scans is essential for diagnosing and planning complex endovascular
+treatments for patients with aortic dissections. However, existing methods
+reduce aortic segmentation to a binary problem, limiting their ability to
+measure diameters across different branches and zones. Furthermore, no
+open-source dataset is currently available to support the development of
+multi-class aortic segmentation methods. To address this gap, we organized the
+AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes
+annotated for 23 clinically relevant aortic branches and zones. This dataset
+was designed to facilitate both model development and validation. The challenge
+attracted 121 teams worldwide, with participants leveraging state-of-the-art
+frameworks such as nnU-Net and exploring novel techniques, including cascaded
+models, data augmentation strategies, and custom loss functions. We evaluated
+the submitted algorithms using the Dice Similarity Coefficient (DSC) and
+Normalized Surface Distance (NSD), highlighting the approaches adopted by the
+top five performing teams. This paper presents the challenge design, dataset
+details, evaluation metrics, and an in-depth analysis of the top-performing
+algorithms. The annotated dataset, evaluation code, and implementations of the
+leading methods are publicly available to support further research. All
+resources can be accessed at https://aortaseg24.grand-challenge.org.
 
-##### **Commonsense Reasoning-Aided Autonomous Vehicle Systems**
-2502.09233v1 by Keegan Kimbrell
+摘要：多類別主動脈電腦斷層血管攝影 (CTA) 掃描分割對於診斷和規劃主動脈剝離患者的複雜血管內治療至關重要。然而，現有方法將主動脈分割簡化為二元問題，限制了其測量不同分支和區域直徑的能力。此外，目前沒有開放原始碼數據集可用於支援多類別主動脈分割方法的開發。為了解決此問題，我們組織了 AortaSeg24 MICCAI 挑戰，引入了第一個包含 100 個 CTA 體積的數據集，這些體積針對 23 個臨床上相關的主動脈分支和區域進行了註釋。此數據集旨在促進模型開發和驗證。該挑戰吸引了來自世界各地的 121 個團隊，參與者利用了 nnU-Net 等最先進的框架，並探索了創新技術，包括串聯模型、數據擴充策略和自訂損失函數。我們使用 Dice 相似性係數 (DSC) 和標準化表面距離 (NSD) 評估了提交的演算法，重點介紹了前五名表現最佳團隊採用的方法。本文介紹了挑戰設計、數據集詳細資訊、評估指標以及對表現最佳演算法的深入分析。已公開註釋的數據集、評估程式碼和領先方法的實作，以支援進一步的研究。所有資源都可以在 https://aortaseg24.grand-challenge.org/ 獲得。
 
-Autonomous Vehicle (AV) systems have been developed with a strong reliance on
-machine learning techniques. While machine learning approaches, such as deep
-learning, are extremely effective at tasks that involve observation and
-classification, they struggle when it comes to performing higher level
-reasoning about situations on the road. This research involves incorporating
-commonsense reasoning models that use image data to improve AV systems. This
-will allow AV systems to perform more accurate reasoning while also making them
-more adjustable, explainable, and ethical. This paper will discuss the findings
-so far and motivate its direction going forward.
+##### **Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**
+2502.05282v1 by Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
 
-摘要：自動駕駛車輛 (AV) 系統的開發高度依賴機器學習技術。儘管機器學習方法（例如深度學習）在涉及觀察和分類的任務中非常有效，但它們在對路況進行更高層級推理時會遇到困難。本研究涉及整合使用影像資料的常識推理模型，以改善 AV 系統。這將使 AV 系統能夠執行更準確的推理，同時也讓它們更具可調整性、可解釋性和道德性。本文將探討迄今為止的發現，並說明其未來的發展方向。
+Dense contrastive representation learning (DCRL) has greatly improved the
+learning efficiency for image-dense prediction tasks, showing its great
+potential to reduce the large costs of medical image collection and dense
+annotation. However, the properties of medical images make unreliable
+correspondence discovery, bringing an open problem of large-scale false
+positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric
+vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior
+to DCRL and enables a reliable correspondence discovery for effective dense
+contrast. We propose a deformable homeomorphism learning (DHL) which models the
+homeomorphism of medical images and learns to estimate a deformable mapping to
+predict the pixels' correspondence under topological preservation. It
+effectively reduces the searching space of pairing and drives an implicit and
+soft learning of negative pairs via a gradient. We also propose a geometric
+semantic similarity (GSS) which extracts semantic information in features to
+measure the alignment degree for the correspondence learning. It will promote
+the learning efficiency and performance of deformation, constructing positive
+pairs reliably. We implement two practical variants on two typical
+representation learning tasks in our experiments. Our promising results on
+seven datasets which outperform the existing methods show our great
+superiority. We will release our code on a companion link:
+https://github.com/YutingHe-list/GEMINI.
 
-##### **Logical foundations of Smart Contracts**
-2502.09232v1 by Kalonji Kalala
+摘要：密集对比表征学习（DCRL）极大地提高了影像密集预测任务的学习效率，显示出其在降低医学影像收集和密集标注的大量成本方面的巨大潜力。然而，医学影像的特性使得对应关系发现不可靠，给 DCRL 带来大规模假阳性和假阴性（FP&N）对的开放性问题。在本文中，我们提出了 GEoMetric vIsual deNse sImilarity（GEMINI）学习，它将同胚先验嵌入 DCRL 中，并针对有效密集对比提供了可靠的对应关系发现。我们提出了一种可变形同胚学习（DHL），它对医学影像的同胚进行建模，并学习估计可变形映射，以预测在拓扑保持下的像素对应关系。它有效地减少了配对的搜索空间，并通过梯度驱动了负对的隐式和软学习。我们还提出了几何语义相似性（GSS），它提取特征中的语义信息，以测量对应关系学习的对齐度。它将促进变形学习的效率和性能，可靠地构建正对。我们在实验中针对两个典型的表征学习任务实现了两个实际变体。我们在七个数据集上的有希望的结果优于现有方法，显示出我们的巨大优势。我们将在配套链接中发布我们的代码：https://github.com/YutingHe-list/GEMINI。
 
-Nowadays, sophisticated domains are emerging which require appropriate
-formalisms to be specified accurately in order to reason about them. One such
-domain is constituted of smart contracts that have emerged in cyber physical
-systems as a way of enforcing formal agreements between components of these
-systems. Smart contracts self-execute to run and share business processes
-through blockchain, in decentralized systems, with many different participants.
-Legal contracts are in many cases complex documents, with a number of
-exceptions, and many subcontracts. The implementation of smart contracts based
-on legal contracts is a long and laborious task, that needs to include all
-actions, procedures, and the effects of actions related to the execution of the
-contract. An ongoing open problem in this area is to formally account for smart
-contracts using a uniform and somewhat universal formalism. This thesis
-proposes logical foundations to smart contracts using the Situation Calculus, a
-logic for reasoning about actions. Situation Calculus is one of the prominent
-logic-based artificial intelligence approaches that provides enough logical
-mechanism to specify and implement dynamic and complex systems such as
-contracts. Situation Calculus is suitable to show how worlds dynamically
-change. Smart contracts are going to be implement with Golog (written en
-Prolog), a Situation Calculus-based programming language for modeling complex
-and dynamic behaviors.
+##### **"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**
+2502.05115v1 by Shihan Fu, Bingsheng Yao, Smit Desai, Yuqi Hu, Yuling Sun, Samantha Stonbraker, Yanjun Gao, Elizabeth M. Goldberg, Dakuo Wang
 
-摘要：如今，正在出现需要适当形式化来准确指定以对其进行推理的复杂领域。此类领域之一由在网络物理系统中出现的智能合约构成，作为强制执行这些系统组件之间正式协议的一种方式。智能合约自执行以在去中心化系统中通过区块链运行和共享业务流程，并有许多不同的参与者。法律合约在许多情况下是复杂的文档，有许多例外和许多分包合同。基于法律合约实施智能合约是一项漫长而艰巨的任务，需要包括所有操作、程序以及与执行合约相关的操作效果。该领域的持续开放问题是使用统一且某种程度上通用的形式化来正式说明智能合约。本论文提出了使用情景演算（一种用于推理操作的逻辑）为智能合约提供逻辑基础。情景演算是基于逻辑的人工智能方法之一，提供了足够的逻辑机制来指定和实现动态且复杂的系统，例如合约。情景演算适用于展示世界如何动态变化。智能合约将使用 Golog（以 Prolog 编写的）实现，这是一种基于情景演算的编程语言，用于建模复杂且动态的行为。
+Older adult patients constitute a rapidly growing subgroup of Intensive Care
+Unit (ICU) patients. In these situations, their family caregivers are expected
+to represent the unconscious patients to access and interpret patients' medical
+information. However, caregivers currently have to rely on overloaded
+clinicians for information updates and typically lack the health literacy to
+understand complex medical information. Our project aims to explore the
+information needs of caregivers of ICU older adult patients, from which we can
+propose design opportunities to guide future AI systems. The project begins
+with formative interviews with 11 caregivers to identify their challenges in
+accessing and interpreting medical information; From these findings, we then
+synthesize design requirements and propose an AI system prototype to cope with
+caregivers' challenges. The system prototype has two key features: a timeline
+visualization to show the AI extracted and summarized older adult patients' key
+medical events; and an LLM-based chatbot to provide context-aware informational
+support. We conclude our paper by reporting on the follow-up user evaluation of
+the system and discussing future AI-based systems for ICU caregivers of older
+adults.
 
-##### **Relating Answer Set Programming and Many-sorted Logics for Formal Verification**
-2502.09230v1 by Zachary Hansen
+摘要：老年患者構成加護病房 (ICU) 患者中快速成長的子群。在這些情況下，預期他們的家庭照護者能代表無意識的患者取得並解讀患者的醫療資訊。然而，照護者目前必須依賴工作繁重的臨床醫師提供資訊更新，而且通常缺乏了解複雜醫療資訊的健康素養。我們的專案旨在探索 ICU 老年患者照護者的資訊需求，我們可以根據這些需求提出設計機會，以引導未來的 AI 系統。這個專案從對 11 位照護者的形成性訪談開始，以找出他們在取得和解讀醫療資訊方面的挑戰；根據這些發現，我們接著綜合設計需求，並提出一個 AI 系統原型，以應對照護者的挑戰。這個系統原型具有兩個關鍵特點：一個時間軸視覺化，以顯示 AI 萃取並摘要出的老年患者關鍵醫療事件；以及一個基於 LLM 的聊天機器人，以提供情境感知的資訊支援。我們透過報告系統的後續使用者評估，以及討論未來針對老年人 ICU 照護者的 AI 系統，來總結我們的論文。
 
-Answer Set Programming (ASP) is an important logic programming paradigm
-within the field of Knowledge Representation and Reasoning. As a concise,
-human-readable, declarative language, ASP is an excellent tool for developing
-trustworthy (especially, artificially intelligent) software systems. However,
-formally verifying ASP programs offers some unique challenges, such as
-  1. a lack of modularity (the meanings of rules are difficult to define in
-isolation from the enclosing program),
-  2. the ground-and-solve semantics (the meanings of rules are dependent on the
-input data with which the program is grounded), and
-  3. limitations of existing tools.
-  My research agenda has been focused on addressing these three issues with the
-intention of making ASP verification an accessible, routine task that is
-regularly performed alongside program development. In this vein, I have
-investigated alternative semantics for ASP based on translations into the logic
-of here-and-there and many-sorted first-order logic. These semantics promote a
-modular understanding of logic programs, bypass grounding, and enable us to use
-automated theorem provers to automatically verify properties of programs.
+##### **Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**
+2502.05087v1 by Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi
 
-摘要：<paragraph>答案集程式設計 (ASP) 是知識表徵與推理領域中一個重要的邏輯程式設計範式。ASP 作為一種簡潔、人類可讀、宣告式的語言，是開發值得信賴的 (特別是人工智慧) 軟體系統的絕佳工具。然而，正式驗證 ASP 程式提供了一些獨特的挑戰，例如
-  1. 缺乏模組化 (規則的含義難以與封閉程式隔離定義)，
-  2. 基礎與求解語意 (規則的含義取決於程式基礎的輸入資料)，以及
-  3. 現有工具的限制。
-  我的研究議程一直專注於解決這三個問題，目的是讓 ASP 驗證成為一個可存取的、例行任務，並在程式開發過程中定期執行。在這個脈絡下，我研究了基於翻譯成此處和彼處邏輯以及多種排序一階邏輯的 ASP 替代語意。這些語意促進了邏輯程式的模組化理解，繞過基礎，並使我們能夠使用自動化定理證明器自動驗證程式的屬性。</paragraph>
+Federated learning (FL) is a popular paradigm for collaborative training
+which avoids direct data exposure between clients. However, data privacy issues
+still remain: FL-trained large language models are capable of memorizing and
+completing phrases and sentences contained in training data when given with
+their prefixes. Thus, it is possible for adversarial and honest-but-curious
+clients to recover training data of other participants simply through targeted
+prompting. In this work, we demonstrate that a popular and simple fine-tuning
+strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a
+factor of 10. We study this effect by performing a medical question-answering
+fine-tuning task and injecting multiple replicas of out-of-distribution
+sensitive sequences drawn from an external clinical dataset. We observe a
+reduction in memorization for a wide variety of Llama 2 and 3 models, and find
+that LoRA can reduce memorization in centralized learning as well. Furthermore,
+we show that LoRA can be combined with other privacy-preserving techniques such
+as gradient clipping and Gaussian noising, secure aggregation, and Goldfish
+loss to further improve record-level privacy while maintaining performance.
 
-##### **Computational methods for Dynamic Answer Set Programming**
-2502.09228v1 by Susana Hahn
+摘要：聯邦學習 (FL) 是一種流行的協作訓練範例，可避免客戶端之間直接公開資料。然而，資料隱私問題仍然存在：經過 FL 訓練的大型語言模型能夠記憶並完成訓練資料中包含的片語和句子，只要給予其前綴即可。因此，對抗和誠實但好奇的客戶端有可能僅透過目標提示來恢復其他參與者的訓練資料。在這項工作中，我們證明了一種流行且簡單的微調策略，低秩適應 (LoRA)，可將 FL 期間的記憶減少多達 10 倍。我們透過執行醫學問答微調任務並注入從外部臨床資料集抽取的非分佈敏感序列的多次複製品來研究此效應。我們觀察到各種 Llama 2 和 3 模型的記憶力降低，並發現 LoRA 也能減少集中式學習中的記憶力。此外，我們展示 LoRA 可以與其他隱私保護技術結合使用，例如梯度裁剪和高斯雜訊、安全聚合和 Goldfish 損失，以進一步改善記錄級隱私，同時維持效能。
 
-In our daily lives and industrial settings, we often encounter dynamic
-problems that require reasoning over time and metric constraints. These include
-tasks such as scheduling, routing, and production sequencing. Dynamic logics
-have traditionally addressed these needs but often lack the flexibility and
-integration required for comprehensive problem modeling. This research aims to
-extend Answer Set Programming (ASP), a powerful declarative problem-solving
-approach, to handle dynamic domains effectively. By integrating concepts from
-dynamic, temporal, and metric logics into ASP, we seek to develop robust
-systems capable of modeling complex dynamic problems and performing efficient
-reasoning tasks, thereby enhancing ASPs applicability in industrial contexts.
+##### **MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**
+2502.04794v1 by Minrui Chen, Yi Zhou, Huidong Jiang, Yuhan Zhu, Guanjie Zou, Minqi Chen, Rong Tian, Hiroto Saigo
 
-摘要：在我們的日常生活和工業環境中，我們經常會遇到動態問題，需要隨著時間和公制約束進行推理。這些問題包括排程、路由和生產順序等任務。動態邏輯傳統上解決了這些需求，但通常缺乏全面問題建模所需的靈活性與整合性。本研究旨在擴展強大的宣告式問題解決方法「Answer Set Programming (ASP)」，以有效處理動態領域。透過將動態、時態和公制邏輯的概念整合到 ASP 中，我們尋求開發強健的系統，能夠建模複雜的動態問題並執行有效的推理任務，進而增強 ASP 在工業環境中的適用性。
+Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is
+introduced as a multimodal framework inspired by real-world diagnostic
+processes. It uses pretrained models such as DINOv2, Vision Transformer, and
+ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into
+low-dimensional, semantically meaningful features. A learnable
+self-attention-based fusion network then integrates these imaging features with
+clinical data for classification. Using 416 FUO patient cases from Sichuan
+University West China Hospital from 2017 to 2023, the multimodal fusion
+classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to
+0.9291 across seven tasks, outperforming conventional machine learning and
+single-modality deep learning methods. Ablation studies and five-fold
+cross-validation further validated its effectiveness. By combining the
+strengths of pretrained large models and deep learning, MedMimic offers a
+promising solution for disease classification.
 
-##### **Generating Causally Compliant Counterfactual Explanations using ASP**
-2502.09226v1 by Sopam Dasgupta
+摘要：不明原因發燒 (FUO) 仍然是診斷上的挑戰。MedMimic 是一個多模式架構，靈感來自於真實世界的診斷過程。它使用預先訓練的模型，例如 DINOv2、視覺轉換器和 ResNet-18，將高維 18F-FDG PET/CT 影像轉換為低維、語義有意義的特徵。一個可學習的自注意力融合網路接著將這些影像特徵與臨床資料整合，用於分類。使用 2017 年至 2023 年四川大學華西醫院的 416 個 FUO 病患病例，多模式融合分類網路 MFCN 在七項任務中達到了 0.8654 到 0.9291 的巨觀 AUROC 分數，優於傳統機器學習和單一模式深度學習方法。消融研究和五倍交叉驗證進一步驗證了其有效性。MedMimic 結合了預先訓練的大模型和深度學習的優點，為疾病分類提供了一個有前景的解決方案。
 
-This research is focused on generating achievable counterfactual
-explanations. Given a negative outcome computed by a machine learning model or
-a decision system, the novel CoGS approach generates (i) a counterfactual
-solution that represents a positive outcome and (ii) a path that will take us
-from the negative outcome to the positive one, where each node in the path
-represents a change in an attribute (feature) value. CoGS computes paths that
-respect the causal constraints among features. Thus, the counterfactuals
-computed by CoGS are realistic. CoGS utilizes rule-based machine learning
-algorithms to model causal dependencies between features. The paper discusses
-the current status of the research and the preliminary results obtained.
+##### **MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**
+2502.04515v1 by Wei Fan, Jingru Fei, Dingyu Guo, Kun Yi, Xiaozhuang Song, Haolong Xiang, Hangting Ye, Min Li
 
-摘要：本研究重點在於產生可實現的反事實解釋。給定由機器學習模型或決策系統計算出的負面結果，創新的 CoGS 方法會產生 (i) 代表正面結果的反事實解，以及 (ii) 一條將我們從負面結果帶到正面結果的途徑，其中途徑中的每個節點代表屬性 (特徵) 值的變化。CoGS 計算出符合特徵之間因果關係的途徑。因此，CoGS 計算出的反事實是切合實際的。CoGS 利用基於規則的機器學習演算法來建模特徵之間的因果關係。本文探討了研究的現況和獲得的初步結果。
+Medical time series has been playing a vital role in real-world healthcare
+systems as valuable information in monitoring health conditions of patients.
+Accurate classification for medical time series, e.g., Electrocardiography
+(ECG) signals, can help for early detection and diagnosis. Traditional methods
+towards medical time series classification rely on handcrafted feature
+extraction and statistical methods; with the recent advancement of artificial
+intelligence, the machine learning and deep learning methods have become more
+popular. However, existing methods often fail to fully model the complex
+spatial dynamics under different scales, which ignore the dynamic
+multi-resolution spatial and temporal joint inter-dependencies. Moreover, they
+are less likely to consider the special baseline wander problem as well as the
+multi-view characteristics of medical time series, which largely hinders their
+prediction performance. To address these limitations, we propose a
+Multi-resolution Spatiotemporal Graph Learning framework, MedGNN, for medical
+time series classification. Specifically, we first propose to construct
+multi-resolution adaptive graph structures to learn dynamic multi-scale
+embeddings. Then, to address the baseline wander problem, we propose Difference
+Attention Networks to operate self-attention mechanisms on the finite
+difference for temporal modeling. Moreover, to learn the multi-view
+characteristics, we utilize the Frequency Convolution Networks to capture
+complementary information of medical time series from the frequency domain. In
+addition, we introduce the Multi-resolution Graph Transformer architecture to
+model the dynamic dependencies and fuse the information from different
+resolutions. Finally, we have conducted extensive experiments on multiple
+medical real-world datasets that demonstrate the superior performance of our
+method. Our Code is available.
 
-##### **Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**
-2502.09224v1 by Đorđe Marković, Marc Denecker
+摘要：<paragraph>醫療時間序列在真實世界的醫療保健系統中扮演著至關重要的角色，作為監控患者健康狀況的寶貴資訊。
+準確分類醫療時間序列，例如心電圖 (ECG) 訊號，有助於早期偵測和診斷。傳統的醫療時間序列分類方法仰賴手工特徵萃取和統計方法；隨著人工智慧的最新進展，機器學習和深度學習方法變得更為普及。然而，現有方法通常無法完全建模不同尺度下的複雜空間動態，忽略了動態多解析度空間和時間關節相互依賴性。此外，它們不太可能考慮特殊的基線漂移問題以及醫療時間序列的多視角特性，這在很大程度上阻礙了它們的預測效能。為了解決這些限制，我們提出了一個多解析度時空圖形學習架構 MedGNN，用於醫療時間序列分類。具體來說，我們首先提出構建多解析度自適應圖形結構以學習動態多尺度嵌入。然後，為了解決基線漂移問題，我們提出差分注意力網路，對時間建模的有限差分運算自注意力機制。此外，為了學習多視角特性，我們利用頻率卷積網路從頻域擷取醫療時間序列的互補資訊。此外，我們引入了多解析度圖形Transformer架構來建模動態依賴性，並融合來自不同解析度的資訊。最後，我們對多個醫療真實世界資料集進行了廣泛的實驗，證明了我們方法的優異效能。我們的程式碼已公開。</paragraph>
 
-Subtyping, also known as subtype polymorphism, is a concept extensively
-studied in programming language theory, delineating the substitutability
-relation among datatypes. This property ensures that programs designed for
-supertype objects remain compatible with their subtypes.
-  In this paper, we explore the capability of order-sorted logic for utilizing
-these ideas in the context of Knowledge Representation. We recognize two
-fundamental limitations: First, the inability of this logic to address the
-concept rather than the value of non-logical symbols, and second, the lack of
-language constructs for constraining the type of terms. Consequently, we
-propose guarded order-sorted intensional logic, where guards are language
-constructs for annotating typing information and intensional logic provides
-support for quantification over concepts.
+##### **Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**
+2502.06842v1 by Andrew G. Breithaupt, Alice Tang, Bruce L. Miller, Pedro Pinheiro-Chagas
 
-摘要：子類型化，也稱為子類型多態性，是一個在程式語言理論中廣泛研究的概念，用於描述資料類型之間的可替換關係。此特性可確保為超類型物件設計的程式與其子類型相容。
-在本文中，我們探討了使用排序邏輯在知識表徵中運用這些想法的能力。我們發現了兩個基本限制：首先，此邏輯無法處理非邏輯符號的概念而非值，其次，缺乏約束項類型的語言結構。因此，我們提出了受保護的排序邏輯，其中保護是註解類型資訊的語言結構，而內涵邏輯則支援對概念量化。
+Healthcare systems are struggling to meet the growing demand for neurological
+care, with challenges particularly acute in Alzheimer's disease and related
+dementias (ADRD). While artificial intelligence research has often focused on
+identifying patterns beyond human perception, implementing such predictive
+capabilities remains challenging as clinicians cannot readily verify insights
+they cannot themselves detect. We propose that large language models (LLMs)
+offer more immediately practical applications by enhancing clinicians'
+capabilities in three critical areas: comprehensive data collection,
+interpretation of complex clinical information, and timely application of
+relevant medical knowledge. These challenges stem from limited time for proper
+diagnosis, growing data complexity, and an overwhelming volume of medical
+literature that exceeds any clinician's capacity to fully master. We present a
+framework for responsible AI integration that leverages LLMs' ability to
+communicate effectively with both patients and providers while maintaining
+human oversight. This approach prioritizes standardized, high-quality data
+collection to enable a system that learns from every patient encounter while
+incorporating the latest clinical evidence, continuously improving care
+delivery. We begin to address implementation challenges and initiate important
+discussions around ethical considerations and governance needs. While developed
+for ADRD, this roadmap provides principles for responsible AI integration
+across neurology and other medical specialties, with potential to improve
+diagnostic accuracy, reduce care disparities, and advance clinical knowledge
+through a learning healthcare system.
 
-##### **ASP-driven User-interaction with Clinguin**
-2502.09222v1 by Alexander Beiser, Susana Hahn, Torsten Schaub
+摘要：醫療體系正努力滿足日益增長的神經照護需求，其中阿茲海默症和相關失智症 (ADRD) 的挑戰特別嚴重。雖然人工智慧研究通常專注於識別人類感知之外的模式，但實作此類預測功能仍然具有挑戰性，因為臨床醫生無法輕易驗證他們自己無法偵測到的見解。我們提出大型語言模型 (LLM) 可透過提升臨床醫生在三個關鍵領域的能力，提供更直接且實用的應用：全面的資料收集、複雜臨床資訊的詮釋，以及適時應用相關的醫學知識。這些挑戰源自於適當診斷時間有限、資料複雜性日益增加，以及龐大的醫學文獻量超過任何臨床醫生所能完全掌握的容量。我們提出了一個負責任的 AI 整合架構，利用 LLM 與患者和提供者有效溝通的能力，同時維持人為監督。此方法優先考慮標準化、高品質的資料收集，以建立一個從每次患者接觸中學習的系統，同時納入最新的臨床證據，持續改善照護提供。我們開始探討實作挑戰，並展開關於倫理考量和治理需求的重要討論。儘管是為 ADRD 所開發，此藍圖提供了神經科和其他醫學專科負責任 AI 整合的原則，有潛力透過學習型醫療保健系統改善診斷準確性、減少照護差異，並推進臨床知識。
 
-We present clinguin, a system for ASP-driven user interface design. Clinguin
-streamlines the development of user interfaces for ASP developers by letting
-them build interactive prototypes directly in ASP, eliminating the need for
-separate frontend languages. To this end, clinguin uses a few dedicated
-predicates to define user interfaces and the treatment of user-triggered
-events. This simple design greatly facilitates the specification of user
-interactions with an ASP system, in our case clingo.
+##### **Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**
+2502.04423v1 by Khushboo Verma, Alan Michels, Ergi Gumusaneli, Shilpa Chitnis, Smita Sinha Kumar, Christopher Thompson, Lena Esmail, Guruprasath Srinivasan, Chandini Panchada, Sushovan Guha, Satwant Kumar
 
-摘要：我們提出 clinguin，一個用於 ASP 驅動使用者介面設計的系統。Clinguin 透過讓 ASP 開發人員直接在 ASP 中建立互動式原型，簡化了使用者介面的開發，消除了對個別前端語言的需求。為此，clinguin 使用一些專用的謂詞來定義使用者介面和處理使用者觸發的事件。這個簡單的設計極大地簡化了使用者與 ASP 系統互動的規範，在我們的案例中是 clingo。
+Referral workflow inefficiencies, including misaligned referrals and delays,
+contribute to suboptimal patient outcomes and higher healthcare costs. In this
+study, we investigated the possibility of predicting procedural needs based on
+primary care diagnostic entries, thereby improving referral accuracy,
+streamlining workflows, and providing better care to patients. A de-identified
+dataset of 2,086 orthopedic referrals from the University of Texas Health at
+Tyler was analyzed using machine learning models built on Base General
+Embeddings (BGE) for semantic extraction. To ensure real-world applicability,
+noise tolerance experiments were conducted, and oversampling techniques were
+employed to mitigate class imbalance. The selected optimum and parsimonious
+embedding model demonstrated high predictive accuracy (ROC-AUC: 0.874, Matthews
+Correlation Coefficient (MCC): 0.540), effectively distinguishing patients
+requiring surgical intervention. Dimensionality reduction techniques confirmed
+the model's ability to capture meaningful clinical relationships. A threshold
+sensitivity analysis identified an optimal decision threshold (0.30) to balance
+precision and recall, maximizing referral efficiency. In the predictive
+modeling analysis, the procedure rate increased from 11.27% to an optimal
+60.1%, representing a 433% improvement with significant implications for
+operational efficiency and healthcare revenue.
+  The results of our study demonstrate that referral optimization can enhance
+primary and surgical care integration. Through this approach, precise and
+timely predictions of procedural requirements can be made, thereby minimizing
+delays, improving surgical planning, and reducing administrative burdens. In
+addition, the findings highlight the potential of clinical decision support as
+a scalable solution for improving patient outcomes and the efficiency of the
+healthcare system.
 
-##### **Pearce's Characterisation in an Epistemic Domain**
-2502.09221v1 by Ezgi Iraz Su
+摘要：轉診流程效率低落，包括轉診不當和延誤，
+導致次優的患者結果和更高的醫療保健成本。在這
+項研究中，我們探討了根據初級保健診斷條目預測程序需求的可能性，從而提高轉診準確性，
+簡化工作流程，並為患者提供更好的照護。一個去識別化
+德克薩斯大學健康中心的 2,086 個骨科轉診的資料集
+泰勒使用建立在基本通用
+語義提取的嵌入 (BGE) 上的機器學習模型進行分析。為了確保現實世界的適用性，
+進行了噪聲容忍度實驗，並採用了過採樣技術來減輕類別不平衡。所選的最佳和簡約
+嵌入模型展示了高預測準確度 (ROC-AUC：0.874，馬修斯
+相關系數 (MCC)：0.540)，有效區分需要手術干預的患者。降維
+技術證實了模型捕捉有意義的臨床關係的能力。閾值
+敏感性分析確定了一個最佳決策閾值 (0.30) 來平衡
+精確度和召回率，最大化轉診效率。在預測中
+建模分析中，程序率從 11.27% 增加到最佳的
+60.1%，代表 433% 的改進，對運營效率和醫療保健收入具有重大影響。
+我們研究的結果表明，轉診優化可以增強
+初級和外科護理整合。通過這種方法，可以對程序需求進行準確及時的預測，從而最大程度地減少
+延誤，改善手術計劃，並減輕行政負擔。此外，研究結果強調了臨床決策支持作為
+一個可擴展的解決方案的潛力，用於改善患者結果和醫療保健系統的效率。
 
-Answer-set programming (ASP) is a successful problem-solving approach in
-logic-based AI. In ASP, problems are represented as declarative logic programs,
-and solutions are identified through their answer sets. Equilibrium logic (EL)
-is a general-purpose nonmonotonic reasoning formalism, based on a monotonic
-logic called here-and-there logic. EL was basically proposed by Pearce as a
-foundational framework of ASP. Epistemic specifications (ES) are extensions of
-ASP-programs with subjective literals. These new modal constructs in the
-ASP-language make it possible to check whether a regular literal of ASP is true
-in every (or some) answer-set of a program. ES-programs are interpreted by
-world-views, which are essentially collections of answer-sets. (Reflexive)
-autoepistemic logic is a nonmonotonic formalism, modeling self-belief
-(knowledge) of ideally rational agents. A relatively new semantics for ES is
-based on a combination of EL and (reflexive) autoepistemic logic. In this
-paper, we first propose an overarching framework in the epistemic ASP domain.
-We then establish a correspondence between existing (reflexive) (auto)epistemic
-equilibrium logics and our easily-adaptable comprehensive framework, building
-on Pearce's characterisation of answer-sets as equilibrium models. We achieve
-this by extending Ferraris' work on answer sets for propositional theories to
-the epistemic case and reveal the relationship between some ES-semantic
-proposals.
+##### **Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**
+2502.04083v1 by Tewele W. Tareke, Neree Payan, Alexandre Cochet, Laurent Arnould, Benoit Presles, Jean-Marc Vrigneaud, Fabrice Meriaudeau, Alain Lalande
 
-摘要：<paragraph>答案集程式設計（ASP）是基於邏輯的人工智慧中一種成功的問題解決方法。在 ASP 中，問題表示為宣告式邏輯程式，並透過其答案集來找出解答。平衡邏輯（EL）是一種通用的非單調推理形式主義，基於一種稱為此處和彼處邏輯的單調邏輯。EL 基本是由 Pearce 作為 ASP 的基礎架構所提出。知識規範（ES）是 ASP 程式與主觀文字的延伸。ASP 語言中的這些新模態建構使得可以檢查 ASP 的常規文字是否在程式的每個（或某些）答案集中為真。ES 程式由世界觀來詮釋，其本質上是答案集的集合。（反身）自認識邏輯是一種非單調形式主義，用來建模理想理性主體的自信念（知識）。ES 的一種相對新的語意是基於 EL 和（反身）自認識邏輯的組合。在本文中，我們首先提出一個涵蓋知識 ASP 領域的架構。然後，我們建立現有（反身）（自）認識平衡邏輯與我們容易適應的綜合架構之間的對應關係，建立在 Pearce 將答案集描述為平衡模型的特性之上。我們透過將 Ferraris 在命題理論的答案集上的工作延伸到知識案例，並揭示一些 ES 語義提案之間的關係來達成這一點。</paragraph>
+Neoadjuvant chemotherapy (NAC) has become a standard clinical practice for
+tumor downsizing in breast cancer with 18F-FDG Positron Emission Tomography
+(PET). Our work aims to leverage PET imaging for the segmentation of breast
+lesions. The focus is on developing an automated system that accurately
+segments primary tumor regions and extracts key biomarkers from these areas to
+provide insights into the evolution of breast cancer following the first course
+of NAC. 243 baseline 18F-FDG PET scans (PET_Bl) and 180 follow-up 18F-FDG PET
+scans (PET_Fu) were acquired before and after the first course of NAC,
+respectively. Firstly, a deep learning-based breast tumor segmentation method
+was developed. The optimal baseline model (model trained on baseline exams) was
+fine-tuned on 15 follow-up exams and adapted using active learning to segment
+tumor areas in PET_Fu. The pipeline computes biomarkers such as maximum
+standardized uptake value (SUVmax), metabolic tumor volume (MTV), and total
+lesion glycolysis (TLG) to evaluate tumor evolution between PET_Fu and PET_Bl.
+Quality control measures were employed to exclude aberrant outliers. The nnUNet
+deep learning model outperformed in tumor segmentation on PET_Bl, achieved a
+Dice similarity coefficient (DSC) of 0.89 and a Hausdorff distance (HD) of 3.52
+mm. After fine-tuning, the model demonstrated a DSC of 0.78 and a HD of 4.95 mm
+on PET_Fu exams. Biomarkers analysis revealed very strong correlations whatever
+the biomarker between manually segmented and automatically predicted regions.
+The significant average decrease of SUVmax, MTV and TLG were 5.22, 11.79 cm3
+and 19.23 cm3, respectively. The presented approach demonstrates an automated
+system for breast tumor segmentation from 18F-FDG PET. Thanks to the extracted
+biomarkers, our method enables the automatic assessment of cancer progression.
 
-##### **Graphical Conditions for the Existence, Unicity and Number of Regular Models**
-2502.09220v1 by Van-Giang Trinh, Belaid Benhamou, Sylvain Soliman, François Fages
+摘要：新辅助化疗 (NAC) 已成为乳腺癌中采用 18F-FDG 正电子发射断层扫描 (PET) 进行肿瘤缩小的标准临床实践。我们的工作旨在利用 PET 影像分割乳腺病变。重点在于开发一个自动系统，该系统可以准确分割原发性肿瘤区域并从这些区域提取关键生物标记，以深入了解乳腺癌在第一疗程 NAC 后的演变。分别在第一疗程 NAC 之前和之后采集了 243 例基线 18F-FDG PET 扫描 (PET_Bl) 和 180 例随访 18F-FDG PET 扫描 (PET_Fu)。首先，开发了一种基于深度学习的乳腺肿瘤分割方法。对 15 例随访检查对最优基线模型（在基线检查中训练的模型）进行了微调，并使用主动学习对 PET_Fu 中的肿瘤区域进行了分割。该管道计算诸如最大标准摄取值 (SUVmax)、代谢肿瘤体积 (MTV) 和总病灶糖酵解 (TLG) 等生物标记，以评估 PET_Fu 和 PET_Bl 之间的肿瘤演变。采用质量控制措施来排除异常值。nnUNet 深度学习模型在 PET_Bl 上的肿瘤分割方面表现出色，达到 0.89 的 Dice 相似性系数 (DSC) 和 3.52 毫米的 Hausdorff 距离 (HD)。微调后，该模型在 PET_Fu 检查中显示出 0.78 的 DSC 和 4.95 毫米的 HD。无论手动分割区域和自动预测区域之间的生物标记如何，生物标记分析都显示出非常强的相关性。SUVmax、MTV 和 TLG 的平均显着下降分别为 5.22、11.79 cm3 和 19.23 cm3。所提出的方法展示了一个用于从 18F-FDG PET 分割乳腺肿瘤的自动化系统。由于提取了生物标记，我们的方法能够自动评估癌症进展。
 
-The regular models of a normal logic program are a particular type of partial
-(i.e. 3-valued) models which correspond to stable partial models with minimal
-undefinedness. In this paper, we explore graphical conditions on the dependency
-graph of a finite ground normal logic program to analyze the existence, unicity
-and number of regular models for the program. We show three main results: 1) a
-necessary condition for the existence of non-trivial (i.e. non-2-valued)
-regular models, 2) a sufficient condition for the unicity of regular models,
-and 3) two upper bounds for the number of regular models based on positive
-feedback vertex sets. The first two conditions generalize the finite cases of
-the two existing results obtained by You and Yuan (1994) for normal logic
-programs with well-founded stratification. The third result is also new to the
-best of our knowledge. Key to our proofs is a connection that we establish
-between finite ground normal logic programs and Boolean network theory.
+##### **Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**
+2502.04034v1 by Ran Song, Yinpu Bai, Hui Liu
+
+The accurate prediction of drug responses remains a formidable challenge,
+particularly at the single-cell level and in clinical treatment contexts. Some
+studies employ transfer learning techniques to predict drug responses in
+individual cells and patients, but they require access to target-domain data
+during training, which is often unavailable or only obtainable in future. In
+this study, we propose a novel domain generalization framework, termed
+panCancerDR, to address this challenge. We conceptualize each cancer type as a
+distinct source domain, with its cell lines serving as domain-specific samples.
+Our primary objective is to extract domain-invariant features from the
+expression profiles of cell lines across diverse cancer types, thereby
+generalize the predictive capacity to out-of-distribution samples. To enhance
+robustness, we introduce a latent independence projection (LIP) module that
+encourages the encoder to extract informative yet non-redundant features. Also,
+we propose an asymmetric adaptive clustering constraint, which clusters
+drug-sensitive samples into a compact group while drives resistant samples
+dispersed across separate clusters in the latent space. Our empirical
+experiments demonstrate that panCancerDR effectively learns task-relevant
+features from diverse source domains, and achieves accurate predictions of drug
+response for unseen cancer type during training. Furthermore, when evaluated on
+single-cell and patient-level prediction tasks, our model-trained solely on in
+vitro cell line data without access to target-domain information-consistently
+outperforms and matched current state-of-the-art methods. These findings
+highlights the potential of our method for real-world clinical applications.
 
-摘要：正规模型的常规模型是一种特殊类型的局部模型（即 3 值）模型，它对应于具有最小未定义性的稳定局部模型。在本文中，我们探索了有限接地正规逻辑程序的依赖图上的图形条件，以分析程序的正规模型的存在性、唯一性和数量。我们展示了三个主要结果：1) 非平凡（即非 2 值）正规模型存在的必要条件，2) 正规模型唯一性的充分条件，3) 基于正反馈顶点集的正规模型数目的两个上限。前两个条件概括了 You 和 Yuan (1994) 为具有良好基础分层的正规逻辑程序获得的两个现有结果的有限情况。据我们所知，第三个结果也是新的。我们证明的关键是我们在有限接地正规逻辑程序和布尔网络理论之间建立的联系。
+摘要：<paragraph>準確預測藥物反應仍然是一項艱鉅的挑戰，特別是在單細胞層級和臨床治療背景中。一些研究採用遷移學習技術來預測個別細胞和患者的藥物反應，但它們需要在訓練期間存取目標網域資料，而這些資料通常無法取得，或只能在未來取得。在這項研究中，我們提出一個新穎的網域概化架構，稱為 panCancerDR，以應對這項挑戰。我們將每種類型的癌症概念化為一個不同的來源網域，其細胞株作為特定網域的樣本。我們的首要目標是從不同癌症類型的細胞株表現特徵中萃取網域不變特徵，從而將預測能力概化到分布外的樣本。為了增強穩健性，我們引入一個潛在獨立投影 (LIP) 模組，鼓勵編碼器萃取有資訊但非冗餘的特徵。此外，我們提出一個非對稱自適應聚類約束，將對藥物敏感的樣本聚類到一個緊湊的群組中，同時驅動抗藥性樣本分散在潛在空間中的不同群組中。我們的實證實驗證明，panCancerDR 有效地從不同的來源網域學習與任務相關的特徵，並在訓練期間對未見的癌症類型實現準確的藥物反應預測。此外，當在單細胞和患者層級預測任務中進行評估時，我們的模型僅在體外細胞株資料上訓練，而沒有存取目標網域資訊，始終優於並符合當前的最新方法。這些發現突顯了我們的方法在實際臨床應用中的潛力。</paragraph>
 
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
+##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
+2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
 
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
+Retrieval-augmented generation (RAG) is a well-suited technique for
+retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
+key module of the healthcare copilot, helping reduce misdiagnosis for
+healthcare practitioners and patients. However, the diagnostic accuracy and
+specificity of existing heuristic-based RAG models used in the medical domain
+are inadequate, particularly for diseases with similar manifestations. This
+paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
+reasoning for the medical domain that retrieves diagnosis and treatment
+recommendations based on manifestations. MedRAG systematically constructs a
+comprehensive four-tier hierarchical diagnostic KG encompassing critical
+diagnostic differences of various diseases. These differences are dynamically
+integrated with similar EHRs retrieved from an EHR database, and reasoned
+within a large language model. This process enables more accurate and specific
+decision support, while also proactively providing follow-up questions to
+enhance personalized medical decision-making. MedRAG is evaluated on both a
+public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
+collected from Tan Tock Seng Hospital, and its performance is compared against
+various existing RAG methods. Experimental results show that, leveraging the
+information integration and relational abilities of the KG, our MedRAG provides
+more specific diagnostic insights and outperforms state-of-the-art models in
+reducing misdiagnosis rates. Our code will be available at
+https://github.com/SNOWTEAM2023/MedRAG
 
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
+摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
 
-##### **Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**
-2502.09216v1 by Galileo Sartor, Adam Wyner, Giuseppe Contissa
+##### **Transforming Multimodal Models into Action Models for Radiotherapy**
+2502.04408v1 by Matteo Ferrante, Alessandra Carosi, Rolando Maria D Angelillo, Nicola Toschi
 
-In this paper, we present a modular system for representing and reasoning
-with legal aspects of traffic rules for autonomous vehicles. We focus on a
-subset of the United Kingdom's Highway Code (HC) related to junctions. As human
-drivers and automated vehicles (AVs) will interact on the roads, especially in
-urban environments, we claim that an accessible, unitary, high-level
-computational model should exist and be applicable to both users. Autonomous
-vehicles introduce a shift in liability that should not bring disadvantages or
-increased burden on human drivers. We develop a system "in silico" of the
-model. The proposed system is built of three main components: a natural
-language interface, using Logical English, which encodes the rules; an internal
-representation of the rules in Prolog; and an multi-agent-based simulation
-environment, built in NetLogo. The three components interact: Logical English
-is translated into and out of Prolog (along with some support code); Prolog and
-NetLogo interface via predicates. Such a modular approach enables the different
-components to carry different "burdens" in the overall system; it also allows
-swapping of modules. Given NetLogo, we can visualize the effect of the modeled
-rules as well as validate the system with a simple dynamic running scenario.
-Designated agents monitor the behaviour of the vehicles for compliance and
-record potential violations where they occur. The information on potential
-violations is then utilized by Validators, to determine whether the violation
-is punishable, differentiating between exceptions and cases.
+Radiotherapy is a crucial cancer treatment that demands precise planning to
+balance tumor eradication and preservation of healthy tissue. Traditional
+treatment planning (TP) is iterative, time-consuming, and reliant on human
+expertise, which can potentially introduce variability and inefficiency. We
+propose a novel framework to transform a large multimodal foundation model
+(MLM) into an action model for TP using a few-shot reinforcement learning (RL)
+approach. Our method leverages the MLM's extensive pre-existing knowledge of
+physics, radiation, and anatomy, enhancing it through a few-shot learning
+process. This allows the model to iteratively improve treatment plans using a
+Monte Carlo simulator. Our results demonstrate that this method outperforms
+conventional RL-based approaches in both quality and efficiency, achieving
+higher reward scores and more optimal dose distributions in simulations on
+prostate cancer data. This proof-of-concept suggests a promising direction for
+integrating advanced AI models into clinical workflows, potentially enhancing
+the speed, quality, and standardization of radiotherapy treatment planning.
 
-摘要：<paragraph>在本文中，我們提出了一個模組化系統，用於表示和推理自動駕駛車輛交通規則的法律層面。我們專注於與路口相關的英國公路法規 (HC) 子集。由於人類駕駛和自動駕駛車輛 (AV) 將在道路上互動，尤其是在城市環境中，我們主張應存在一個可存取、統一、高階的運算模型，並適用於這兩種使用者。自動駕駛車輛引入了責任轉移，不應給人類駕駛帶來劣勢或增加負擔。我們開發了一個模型的「電腦模擬」系統。所提出的系統由三個主要組成部分建構而成：使用邏輯英語的自然語言介面，用於編碼規則；使用 Prolog 的規則內部表示；以及使用 NetLogo 建構的多主體模擬環境。這三個組成部分會進行互動：邏輯英語會翻譯成 Prolog（以及一些支援程式碼），再從 Prolog 翻譯回來；Prolog 和 NetLogo 會透過謂詞進行介面。這種模組化方法讓不同的組成部分可以在整體系統中承擔不同的「負擔」；它也允許模組交換。有了 NetLogo，我們可以視覺化已建模規則的效果，並使用一個簡單的動態執行範例來驗證系統。指定的代理會監控車輛的行為，以確保遵守規定，並記錄發生的潛在違規行為。然後，驗證者會利用潛在違規行為的資訊，來確定違規行為是否應受懲罰，並區分例外情況和案例。</paragraph>
+摘要：放射治療是一種重要的癌症治療方法，需要精確的規劃來平衡腫瘤根除和健康組織的保留。傳統的治療規劃（TP）是反覆的、耗時的，並且依賴於人為專業知識，這可能會引入變異性和低效率。我們提出了一個新穎的框架，使用少次強化學習 (RL) 方法將大型多模態基礎模型 (MLM) 轉換為 TP 的動作模型。我們的模型利用了 MLM 對物理、輻射和解剖學的廣泛預先存在的知識，並通過少次學習過程對其進行增強。這允許模型使用蒙特卡羅模擬器反覆改進治療計劃。我們的結果表明，這種方法在質量和效率方面都優於基於傳統 RL 的方法，在對前列腺癌數據進行模擬時，獲得了更高的獎勵分數和更優化的劑量分佈。這個概念驗證表明了一個有希望的方向，即將先進的人工智慧模型整合到臨床工作流程中，從而有可能提高放射治療計劃的速度、質量和標準化。
 
-##### **Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**
-2502.09215v1 by Sean Glaze, Daniela Inclezan
+##### **Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**
+2502.04399v1 by Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
 
-This paper presents an architecture for simulating the actions of a
-norm-aware intelligent agent whose behavior with respect to norm compliance is
-set, and can later be changed, by a human controller. Updating an agent's
-behavior mode from a norm-abiding to a riskier one may be relevant when the
-agent is involved in time-sensitive rescue operations, for example. We base our
-work on the Authorization and Obligation Policy Language AOPL designed by
-Gelfond and Lobo for the specification of norms. We introduce an architecture
-and a prototype software system that can be used to simulate an agent's plans
-under different behavior modes that can later be changed by the controller. We
-envision such software to be useful to policy makers, as they can more readily
-understand how agents may act in certain situations based on the agents'
-attitudes towards norm-compliance. Policy makers may then refine their policies
-if simulations show unwanted consequences.
+Advances in artificial intelligence (AI) including foundation models (FMs),
+are increasingly transforming human society, with smart city driving the
+evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as
+a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities.
+In particular, ride-hailing vehicles can effectively facilitate flexible data
+collection and contribute towards urban intelligence, despite resource
+limitations. Therefore, this work explores a promising scenario, where
+edge-assisted vehicles perform joint tasks of order serving and the emerging
+foundation model fine-tuning using various urban data. However, integrating the
+VCS AI task with the conventional order serving task is challenging, due to
+their inconsistent spatio-temporal characteristics: (i) The distributions of
+ride orders and data point-of-interests (PoIs) may not coincide in geography,
+both following a priori unknown patterns; (ii) they have distinct forms of
+temporal effects, i.e., prolonged waiting makes orders become instantly invalid
+while data with increased staleness gradually reduces its utility for model
+fine-tuning.To overcome these obstacles, we propose an online framework based
+on multi-agent reinforcement learning (MARL) with careful augmentation. A new
+quality-of-service (QoS) metric is designed to characterize and balance the
+utility of the two joint tasks, under the effects of varying data volumes and
+staleness. We also integrate graph neural networks (GNNs) with MARL to enhance
+state representations, capturing graph-structured, time-varying dependencies
+among vehicles and across locations. Extensive experiments on our testbed
+simulator, utilizing various real-world foundation model fine-tuning tasks and
+the New York City Taxi ride order dataset, demonstrate the advantage of our
+proposed method.
 
-摘要：本文提出了一個架構，用於模擬一個規範感知智能代理的行為，其行為遵守規範，並可以由人類控制者設定，並可以在稍後進行更改。當代理參與時間敏感的救援行動時，將代理的行為模式從遵守規範更新為更冒險的行為模式可能是相關的。我們的工作基於 Gelfond 和 Lobo 為規範規範設計的授權和義務政策語言 AOPL。我們引入了一個架構和一個原型軟體系統，可用於模擬代理在不同行為模式下的計畫，這些行為模式稍後可以由控制者更改。我們預計此類軟體對政策制定者很有用，因為他們可以更容易地根據代理對規範遵守的態度了解代理在特定情況下的行為方式。如果模擬顯示出不希望的後果，政策制定者可以修改他們的政策。
+摘要：人工智能（AI）的進展，包括基礎模型（FM），正日益轉變人類社會，智慧城市推動著城市生活的演進。同時，車輛群感測（VCS）已成為關鍵推動因素，利用車輛的機動性和配備感測器的能力。特別是，儘管有資源限制，叫車服務車輛能有效促進靈活的資料收集，並有助於城市智慧。因此，這項工作探索了一個有前途的場景，其中邊緣輔助車輛執行訂單服務和新興基礎模型微調的聯合任務，使用各種城市資料。然而，由於 VCS AI 任務與傳統訂單服務任務的不一致時空特徵，整合它們具有挑戰性：(i) 叫車訂單和資料感興趣點 (PoI) 的分佈在地域上可能不重合，兩者都遵循先驗未知的模式；(ii) 它們具有不同的時間效應形式，即長時間等待會使訂單立即失效，而過時的資料會逐漸降低其對模型微調的效用。為了解決這些障礙，我們提出了一個基於多智能體強化學習 (MARL) 的線上架構，並進行了仔細的擴充。設計了一個新的服務品質 (QoS) 指標，用於表徵和平衡這兩個聯合任務的效用，在不同資料量和過時性的影響下。我們還將圖神經網路（GNN）與 MARL 整合，以增強狀態表示，捕捉車輛之間和不同地點之間的圖結構、時變依賴性。在我們的測試平台模擬器上進行的廣泛實驗，利用各種真實世界的基礎模型微調任務和紐約市計程車叫車訂單資料集，證明了我們提出的方法的優點。
 
-##### **Neuro-Symbolic Contrastive Learning for Cross-domain Inference**
-2502.09213v1 by Mingyue Liu, Ryo Ueda, Zhen Wan, Katsumi Inoue, Chris G. Willcocks
+##### **Multimodal Medical Code Tokenizer**
+2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
 
-Pre-trained language models (PLMs) have made significant advances in natural
-language inference (NLI) tasks, however their sensitivity to textual
-perturbations and dependence on large datasets indicate an over-reliance on
-shallow heuristics. In contrast, inductive logic programming (ILP) excels at
-inferring logical relationships across diverse, sparse and limited datasets,
-but its discrete nature requires the inputs to be precisely specified, which
-limits their application. This paper proposes a bridge between the two
-approaches: neuro-symbolic contrastive learning. This allows for smooth and
-differentiable optimisation that improves logical accuracy across an otherwise
-discrete, noisy, and sparse topological space of logical functions. We show
-that abstract logical relationships can be effectively embedded within a
-neuro-symbolic paradigm, by representing data as logic programs and sets of
-logic rules. The embedding space captures highly varied textual information
-with similar semantic logical relations, but can also separate similar textual
-relations that have dissimilar logical relations. Experimental results
-demonstrate that our approach significantly improves the inference capabilities
-of the models in terms of generalisation and reasoning.
+Foundation models trained on patient electronic health records (EHRs) require
+tokenizing medical data into sequences of discrete vocabulary items. Existing
+tokenizers treat medical codes from EHRs as isolated textual tokens. However,
+each medical code is defined by its textual description, its position in
+ontological hierarchies, and its relationships to other codes, such as disease
+co-occurrences and drug-treatment associations. Medical vocabularies contain
+more than 600,000 codes with critical information for clinical reasoning. We
+introduce MedTok, a multimodal medical code tokenizer that uses the text
+descriptions and relational context of codes. MedTok processes text using a
+language model encoder and encodes the relational structure with a graph
+encoder. It then quantizes both modalities into a unified token space,
+preserving modality-specific and cross-modality information. We integrate
+MedTok into five EHR models and evaluate it on operational and clinical tasks
+across in-patient and out-patient datasets, including outcome prediction,
+diagnosis classification, drug recommendation, and risk stratification.
+Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
+models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
+the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
+using MedTok tokenizer with medical QA systems. Our results demonstrate the
+potential of MedTok as a unified tokenizer for medical codes, improving
+tokenization for medical foundation models.
 
-摘要：預訓練語言模型 (PLM) 在自然語言推理 (NLI) 任務中取得了重大進展，然而它們對文本擾動的敏感性和對大型資料集的依賴性表明過度依賴於淺層啟發法。相比之下，歸納邏輯規劃 (ILP) 擅長推論跨越多樣化、稀疏和有限資料集的邏輯關係，但其離散性質要求輸入被精確指定，這限制了它們的應用。本文提出了兩種方法之間的橋樑：神經符號對比學習。這允許平滑且可微分的優化，從而提高邏輯函數的離散、嘈雜和稀疏拓撲空間中的邏輯準確性。我們展示了抽象邏輯關係可以通過將資料表示為邏輯程式和邏輯規則集，有效地嵌入到神經符號範例中。嵌入空間捕獲具有相似語義邏輯關係的高度多變的文本資訊，但也可以分離具有不同邏輯關係的相似文本關係。實驗結果表明，我們的做法在泛化和推理方面顯著提高了模型的推理能力。
+摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
 
-##### **LP-LM: No Hallucinations in Question Answering with Logic Programming**
-2502.09212v1 by Katherine Wu, Yanhong A. Liu
+##### **A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**
+2502.03772v1 by Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Lida Chen, Wei Wang, Qinghua Huang
 
-Large language models (LLMs) are able to generate human-like responses to
-user queries. However, LLMs exhibit inherent limitations, especially because
-they hallucinate. This paper introduces LP-LM, a system that grounds answers to
-questions in known facts contained in a knowledge base (KB), facilitated
-through semantic parsing in Prolog, and always produces answers that are
-reliable.
-  LP-LM generates a most probable constituency parse tree along with a
-corresponding Prolog term for an input question via Prolog definite clause
-grammar (DCG) parsing. The term is then executed against a KB of natural
-language sentences also represented as Prolog terms for question answering. By
-leveraging DCG and tabling, LP-LM runs in linear time in the size of input
-sentences for sufficiently many grammar rules. Performing experiments comparing
-LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate
-on even simple questions, unlike LP-LM.
+Hepatocellular carcinoma (HCC) ranks as the third leading cause of
+cancer-related mortality worldwide, with early detection being crucial for
+improving patient survival rates. However, early screening for HCC using
+ultrasound suffers from insufficient sensitivity and is highly dependent on the
+expertise of radiologists for interpretation. Leveraging the latest
+advancements in artificial intelligence (AI) in medical imaging, this study
+proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model
+that combines the strengths of Convolutional Neural Networks (CNNs) and Vision
+Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound
+screening. The HSQformer leverages sparse latent space representations to
+capture hierarchical details at various granularities without the need for
+complex adjustments, and adopts a modular, plug-and-play design philosophy,
+ensuring the model's versatility and ease of use. The HSQformer's performance
+was rigorously tested across three distinct clinical scenarios: single-center,
+multi-center, and high-risk patient testing. In each of these settings, it
+consistently outperformed existing state-of-the-art models, such as ConvNext
+and SwinTransformer. Notably, the HSQformer even matched the diagnostic
+capabilities of senior radiologists and comprehensively surpassed those of
+junior radiologists. The experimental results from this study strongly
+demonstrate the effectiveness and clinical potential of AI-assisted tools in
+HCC screening. The full code is available at
+https://github.com/Asunatan/HSQformer.
+
+摘要：肝細胞癌（HCC）是全球第三大癌症相關死亡原因，早期檢測對於提高患者存活率至關重要。然而，使用超音波進行 HCC 早期篩檢的靈敏度不足，且高度依賴放射科醫師的專業知識進行判讀。本研究利用醫學影像中人工智慧（AI）的最新進展，提出了一種創新的分層稀疏查詢Transformer（HSQformer）模型，結合了卷積神經網路（CNN）和視覺Transformer（ViT）的優點，以提高超音波篩檢中 HCC 診斷的準確性。HSQformer 利用稀疏潛在空間表示，在不需要複雜調整的情況下擷取各種粒度層級的細節，並採用模組化、即插即用的設計理念，確保模型的多功能性和易用性。HSQformer 的效能經過三個不同的臨床場景的嚴格測試：單中心、多中心和高風險患者測試。在這些設定中，它始終優於現有的最先進模型，例如 ConvNext 和 SwinTransformer。值得注意的是，HSQformer 甚至匹配了資深放射科醫師的診斷能力，並全面超越了初級放射科醫師的診斷能力。本研究的實驗結果有力地證明了 AI 輔助工具在 HCC 篩檢中的有效性和臨床潛力。完整程式碼可在 https://github.com/Asunatan/HSQformer 取得。
+
+##### **Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**
+2502.04386v1 by Guangyao Zheng, Michael A. Jacobs, Vladimir Braverman, Vishwa S. Parekh
+
+Self-supervised learning has revolutionized medical imaging by enabling
+efficient and generalizable feature extraction from large-scale unlabeled
+datasets. Recently, self-supervised foundation models have been extended to
+three-dimensional (3D) computed tomography (CT) data, generating compact,
+information-rich embeddings with 1408 features that achieve state-of-the-art
+performance on downstream tasks such as intracranial hemorrhage detection and
+lung cancer risk forecasting. However, these embeddings have been shown to
+encode demographic information, such as age, sex, and race, which poses a
+significant risk to the fairness of clinical applications.
+  In this work, we propose a Variation Autoencoder (VAE) based adversarial
+debiasing framework to transform these embeddings into a new latent space where
+demographic information is no longer encoded, while maintaining the performance
+of critical downstream tasks. We validated our approach on the NLST lung cancer
+screening dataset, demonstrating that the debiased embeddings effectively
+eliminate multiple encoded demographic information and improve fairness without
+compromising predictive accuracy for lung cancer risk at 1-year and 2-year
+intervals. Additionally, our approach ensures the embeddings are robust against
+adversarial bias attacks. These results highlight the potential of adversarial
+debiasing techniques to ensure fairness and equity in clinical applications of
+self-supervised 3D CT embeddings, paving the way for their broader adoption in
+unbiased medical decision-making.
 
-摘要：大型語言模型 (LLM) 能產生類似人類的回應來回答使用者的問題。然而，LLM 顯示出內在的限制，特別是因為它們會產生幻覺。本文介紹 LP-LM，一個系統，它將問題的答案建立在知識庫 (KB) 中已知的事實上，透過 Prolog 中的語義解析來促進，並始終產生可靠的答案。
-LP-LM 透過 Prolog 明確條款語法 (DCG) 解析產生一個最可能的成分解析樹，以及輸入問題對應的 Prolog 詞彙。然後，針對一個自然語言句子的 KB 執行該詞彙，也表示為 Prolog 詞彙，以進行問題解答。透過利用 DCG 和 tabling，LP-LM 在輸入句子的大小上以線性時間執行，對於足夠多的語法規則。執行實驗比較 LP-LM 與目前眾所周知的 LLM 在準確性上，我們顯示出 LLM 甚至會對簡單的問題產生幻覺，這與 LP-LM 不同。
+摘要：自我監督學習透過從大規模未標記資料集中提取有效且可概化的特徵，進而革新了醫學影像。最近，自我監督基礎模型已擴展到三維 (3D) 電腦斷層掃描 (CT) 資料，產生緊湊、資訊豐富的嵌入，包含 1408 個特徵，在顱內出血偵測和肺癌風險預測等下游任務中達到最先進的效能。然而，這些嵌入已被證明會編碼人口統計資訊，例如年齡、性別和種族，這對臨床應用的公平性構成重大風險。
+在這項工作中，我們提出一個基於變異自編碼器 (VAE) 的對抗性去偏框架，將這些嵌入轉換到一個新的潛在空間，其中不再編碼人口統計資訊，同時維持關鍵下游任務的效能。我們在 NLST 肺癌篩檢資料集上驗證了我們的做法，證明去偏嵌入有效消除了多重編碼的人口統計資訊，並在不損害 1 年和 2 年間隔的肺癌風險預測準確性的情況下提高了公平性。此外，我們的做法確保了嵌入對抗性偏誤攻擊具有魯棒性。這些結果突顯了對抗性去偏技術的潛力，可確保自我監督 3D CT 嵌入在臨床應用中的公平性和公正性，為其在無偏見醫療決策中的廣泛採用鋪路。
 
-##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
-2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
+##### **Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**
+2502.03591v1 by Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
 
-Visual Question Answering (VQA) is a challenging problem that requires to
-process multimodal input. Answer-Set Programming (ASP) has shown great
-potential in this regard to add interpretability and explainability to modular
-VQA architectures. In this work, we address the problem of how to integrate ASP
-with modules for vision and natural language processing to solve a new and
-demanding VQA variant that is concerned with images of graphs (not graphs in
-symbolic form). Images containing graph-based structures are an ubiquitous and
-popular form of visualisation. Here, we deal with the particular problem of
-graphs inspired by transit networks, and we introduce a novel dataset that
-amends an existing one by adding images of graphs that resemble metro lines.
-Our modular neuro-symbolic approach combines optical graph recognition for
-graph parsing, a pretrained optical character recognition neural network for
-parsing labels, Large Language Models (LLMs) for language processing, and ASP
-for reasoning. This method serves as a first baseline and achieves an overall
-average accuracy of 73% on the dataset. Our evaluation provides further
-evidence of the potential of modular neuro-symbolic systems, in particular with
-pretrained models that do not involve any further training and logic
-programming for reasoning, to solve complex VQA tasks.
+In this work, we present a novel approach to multi-label chest X-ray (CXR)
+image classification that enhances clinical interpretability while maintaining
+a streamlined, single-model, single-run training pipeline. Leveraging the
+CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical
+label groupings to capture clinically meaningful relationships between
+diagnoses. To achieve this, we designed a custom hierarchical binary
+cross-entropy (HBCE) loss function that enforces label dependencies using
+either fixed or data-driven penalty types. Our model achieved a mean area under
+the receiver operating characteristic curve (AUROC) of 0.903 on the test set.
+Additionally, we provide visual explanations and uncertainty estimations to
+further enhance model interpretability. All code, model configurations, and
+experiment details are made available.
 
-摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
+摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
 
-##### **On LLM-generated Logic Programs and their Inference Execution Methods**
-2502.09209v1 by Paul Tarau
+##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
+2502.03568v1 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
 
-Large Language Models (LLMs) trained on petabytes of data are highly
-compressed repositories of a significant proportion of the knowledge
-accumulated and distilled so far. In this paper we study techniques to elicit
-this knowledge in the form of several classes of logic programs, including
-propositional Horn clauses, Dual Horn clauses, relational triplets and Definite
-Clause Grammars. Exposing this knowledge as logic programs enables sound
-reasoning methods that can verify alignment of LLM outputs to their intended
-uses and extend their inference capabilities. We study new execution methods
-for the generated programs, including soft-unification of abducible facts
-against LLM-generated content stored in a vector database as well as GPU-based
-acceleration of minimal model computation that supports inference with large
-LLM-generated programs.
+Many reasoning, planning, and problem-solving tasks share an intrinsic
+algorithmic nature: correctly simulating each step is a sufficient condition to
+solve them correctly. We collect pairs of naturalistic and synthetic reasoning
+tasks to assess the capabilities of Large Language Models (LLM). While
+naturalistic tasks often require careful human handcrafting, we show that
+synthetic data is, in many cases, a good proxy that is much easier to collect
+at scale. We leverage common constructs in programming as the counterpart of
+the building blocks of naturalistic reasoning tasks, such as straight-line
+programs, code that contains critical paths, and approximate and redundant
+instructions. We further assess the capabilities of LLMs on sorting problems
+and repeated operations via sorting algorithms and nested loops. Our synthetic
+datasets further reveal that while the most powerful LLMs exhibit relatively
+strong execution capabilities, the process is fragile: it is negatively
+affected by memorisation and seems to rely heavily on pattern recognition. Our
+contribution builds upon synthetically testing the reasoning capabilities of
+LLMs as a scalable complement to handcrafted human-annotated problems.
 
-摘要：大型語言模型 (LLM) 在數位位元組的資料上受過訓練，是目前為止累積和提煉的知識中，高度濃縮的儲存庫。在本文中，我們研究了以數種邏輯程式類別的形式引出這些知識的技術，包括命題霍恩子句、雙重霍恩子句、關聯三元組和確定子句文法。將這些知識作為邏輯程式揭露，能啟用健全的推理方法，驗證 LLM 輸出的對齊方式，符合其預期的用途，並擴展其推論能力。我們研究了產生程式的新執行方法，包括對儲存在向量資料庫中的 LLM 產生內容，進行可約簡事實的軟統一，以及支援使用大型 LLM 產生程式進行推論的，基於 GPU 的最小模型計算加速。
+摘要：許多推理、規劃和問題解決任務共享一個內在的演算法性質：正確模擬每一步就足以正確解決它們。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的功能。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成資料是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見建構，作為自然主義推理任務構建區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複運算上的功能，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎嚴重依賴模式辨識。我們的貢獻建立在以合成方式測試 LLM 的推理能力之上，作為手工編寫人類標註問題的可擴充補充。
 
-##### **Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**
-2502.09206v1 by Haya Majid Qureshi, Wolfgang Faber
+##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
+2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
 
-Metamodeling refers to scenarios in ontologies in which classes and roles can
-be members of classes or occur in roles. This is a desirable modelling feature
-in several applications, but allowing it without restrictions is problematic
-for several reasons, mainly because it causes undecidability. Therefore,
-practical languages either forbid metamodeling explicitly or treat occurrences
-of classes as instances to be semantically different from other occurrences,
-thereby not allowing metamodeling semantically. Several extensions have been
-proposed to provide metamodeling to some extent. Building on earlier work that
-reduces metamodeling query answering to Datalog query answering, recently
-reductions to query answering over hybrid knowledge bases were proposed with
-the aim of using the Datalog transformation only where necessary. Preliminary
-work showed that the approach works, but the hoped-for performance improvements
-were not observed yet. In this work we expand on this body of work by improving
-the theoretical basis of the reductions and by using alternative tools that
-show competitive performance.
+Large Language Models (LLMs) have attained human-level accuracy on medical
+question-answer (QA) benchmarks. However, their limitations in navigating
+open-ended clinical scenarios have recently been shown, raising concerns about
+the robustness and generalizability of LLM reasoning across diverse, real-world
+medical tasks. To probe potential LLM failure modes in clinical
+problem-solving, we present the medical abstraction and reasoning corpus
+(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
+exploit the Einstellung effect -- the fixation of thought arising from prior
+experience, targeting LLM inductive biases toward inflexible pattern matching
+from their training data rather than engaging in flexible reasoning. We find
+that LLMs, including current state-of-the-art o1 and Gemini models, perform
+poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
+medical reasoning and a propensity to hallucinate. In addition, uncertainty
+estimation analyses indicate that LLMs exhibit overconfidence in their answers,
+despite their limited accuracy. The failure modes revealed by M-ARC in LLM
+medical reasoning underscore the need to exercise caution when deploying these
+models in clinical settings.
 
-摘要：元建模是指本体中的場景，其中類別和角色可以是類別成員或出現在角色中。這是一個在多個應用中理想的建模功能，但允許它不受限制會因多個原因而產生問題，主要是因為它會導致無法決定。因此，實用的語言會明確禁止元建模，或將類別的出現視為與其他出現語義不同的實例，從而語義上不允許元建模。已經提出多個擴充功能，在一定程度上提供元建模。建立在將元建模查詢回答簡化為 Datalog 查詢回答的早期工作之上，最近提出了將查詢回答簡化為混合知識庫的簡化，目的是僅在必要時使用 Datalog 轉換。初步工作顯示該方法有效，但尚未觀察到預期的效能改善。在這項工作中，我們透過改善簡化的理論基礎和使用表現競爭力的替代工具，擴展了這項工作。
+摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
 
-##### **Counterfactual Explanations as Plans**
-2502.09205v1 by Vaishak Belle
+##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
+2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
 
-There has been considerable recent interest in explainability in AI,
-especially with black-box machine learning models. As correctly observed by the
-planning community, when the application at hand is not a single-shot decision
-or prediction, but a sequence of actions that depend on observations, a richer
-notion of explanations are desirable.
-  In this paper, we look to provide a formal account of ``counterfactual
-explanations," based in terms of action sequences. We then show that this
-naturally leads to an account of model reconciliation, which might take the
-form of the user correcting the agent's model, or suggesting actions to the
-agent's plan. For this, we will need to articulate what is true versus what is
-known, and we appeal to a modal fragment of the situation calculus to formalise
-these intuitions. We consider various settings: the agent knowing partial
-truths, weakened truths and having false beliefs, and show that our definitions
-easily generalize to these different settings.
+Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
+Systems (HITS) is a hot research trend focusing on enhancing HITS management,
+particularly in emergencies where ambulance vehicles must arrive at the crash
+scene on time and track their real-time location is crucial to the medical
+authorities. Despite the claim of real-time representation, a temporal
+misalignment persists between the physical and virtual domains, leading to
+discrepancies in the ambulance's location representation. This study proposes
+integrating AI predictive models, specifically Support Vector Regression (SVR)
+and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
+framework to anticipate the medical vehicle's next location in the virtual
+world. These models align virtual representations with their physical
+counterparts, i.e., metaphorically offsetting the synchronization delay between
+the two worlds. Trained meticulously on a historical geospatial dataset, SVR
+and DNN exhibit exceptional prediction accuracy in MATLAB and Python
+environments. Through various testing scenarios, we visually demonstrate the
+efficacy of our methodology, showcasing SVR and DNN's key role in significantly
+reducing the witnessed gap within the HITS's DT. This transformative approach
+enhances real-time synchronization in emergency HITS by approximately 88% to
+93%.
 
-摘要：最近在人工智能中對於可解釋性產生了相當大的興趣，
-特別是對於黑盒機器學習模型。正如規劃社群正確觀察到的，當手邊的應用程式不是單次決策或預測，而是一連串依賴於觀察的動作時，一個更豐富的解釋概念是可取的。
-在本文中，我們著眼於提供「反事實解釋」的一個正式說明，以動作序列為基礎。然後我們展示這自然會導致一個模型調和說明，其形式可能是使用者修正代理人的模型，或建議代理人的計畫採取行動。為此，我們需要說明什麼是真實的，什麼是已知的，我們訴諸情境演算的一個模態片段來形式化這些直覺。我們考慮各種設定：代理人知道部分真實、虛弱真實和擁有錯誤信念，並展示我們的定義輕鬆地概括到這些不同的設定。
+摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
 
-##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
-2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
+##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
+2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
 
-Legal cases require careful logical reasoning following the laws, whereas
-interactions with non- technical users must be in natural language. As an
-application combining logical reasoning using Prolog and natural language
-processing using large language models (LLMs), this paper presents a novel
-approach and system, LogicLease, to automate the analysis of landlord-tenant
-legal cases in the state of New York. LogicLease determines compliance with
-relevant legal requirements by analyzing case descriptions and citing all
-relevant laws. It leverages LLMs for information extraction and Prolog for
-legal reasoning. By separating information extraction from legal reasoning,
-LogicLease achieves greater transparency and control over the legal logic
-applied to each case. We evaluate the accuracy, efficiency, and robustness of
-LogicLease through a series of tests, achieving 100% accuracy and an average
-processing time of 2.57 seconds. LogicLease presents advantages over
-state-of-the-art LLM- based legal analysis systems by providing clear,
-step-by-step reasoning, citing specific laws, and distinguishing itself by its
-ability to avoid hallucinations - a common issue in LLMs.
+The widespread use of chest X-rays (CXRs), coupled with a shortage of
+radiologists, has driven growing interest in automated CXR analysis and
+AI-assisted reporting. While existing vision-language models (VLMs) show
+promise in specific tasks such as report generation or abnormality detection,
+they often lack support for interactive diagnostic capabilities. In this work
+we present RadVLM, a compact, multitask conversational foundation model
+designed for CXR interpretation. To this end, we curate a large-scale
+instruction dataset comprising over 1 million image-instruction pairs
+containing both single-turn tasks -- such as report generation, abnormality
+classification, and visual grounding -- and multi-turn, multi-task
+conversational interactions. After fine-tuning RadVLM on this instruction
+dataset, we evaluate it across different tasks along with re-implemented
+baseline VLMs. Our results show that RadVLM achieves state-of-the-art
+performance in conversational capabilities and visual grounding while remaining
+competitive in other radiology tasks. Ablation studies further highlight the
+benefit of joint training across multiple tasks, particularly for scenarios
+with limited annotated data. Together, these findings highlight the potential
+of RadVLM as a clinically relevant AI assistant, providing structured CXR
+interpretation and conversational capabilities to support more effective and
+accessible diagnostic workflows.
 
-摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
+摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
 
-##### **Thinking beyond the anthropomorphic paradigm benefits LLM research**
-2502.09192v1 by Lujain Ibrahim, Myra Cheng
+##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
+2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
 
-Anthropomorphism, or the attribution of human traits to technology, is an
-automatic and unconscious response that occurs even in those with advanced
-technical expertise. In this position paper, we analyze hundreds of thousands
-of computer science research articles from the past decade and present
-empirical evidence of the prevalence and growth of anthropomorphic terminology
-in research on large language models (LLMs). This terminology reflects deeper
-anthropomorphic conceptualizations which shape how we think about and conduct
-LLM research. We argue these conceptualizations may be limiting, and that
-challenging them opens up new pathways for understanding and improving LLMs
-beyond human analogies. To illustrate this, we identify and analyze five core
-anthropomorphic assumptions shaping prominent methodologies across the LLM
-development lifecycle, from the assumption that models must use natural
-language for reasoning tasks to the assumption that model capabilities should
-be evaluated through human-centric benchmarks. For each assumption, we
-demonstrate how non-anthropomorphic alternatives can open new directions for
-research and development.
+While increasing patients' access to medical documents improves medical care,
+this benefit is limited by varying health literacy levels and complex medical
+terminology. Large language models (LLMs) offer solutions by simplifying
+medical information. However, evaluating LLMs for safe and patient-friendly
+text generation is difficult due to the lack of standardized evaluation
+resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
+created from MIMIC-IV discharge summaries through an automated pipeline
+combining LLM-based question-answer generation with manual quality checks. We
+use this dataset to evaluate various LLMs on patient-oriented
+question-answering. Our findings reveal that general-purpose LLMs frequently
+surpass biomedical-adapted models, while automated metrics correlate with human
+judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
+development of LLMs to enhance patient understanding and ultimately improve
+care outcomes.
 
-摘要：擬人化，或將人類特質歸因於科技，是一種自動且無意識的反應，即使是那些擁有進階技術專業知識的人也會發生。在本文中，我們分析了過去十年數十萬篇電腦科學研究文章，並提出實證證據證明擬人化術語在大型語言模型 (LLM) 研究中的普遍性和增長。這些術語反映了更深層的擬人化概念化，塑造了我們思考和進行 LLM 研究的方式。我們認為這些概念化可能是有限制的，並且挑戰它們為超越人類類比來理解和改進 LLM 開闢了新的途徑。為了說明這一點，我們識別並分析了五個核心擬人化假設，這些假設塑造了 LLM 開發生命週期中的顯著方法論，從模型必須使用自然語言進行推理任務的假設到模型能力應該通過以人為中心的基準進行評估的假設。對於每個假設，我們展示了非擬人化替代方案如何為研究和開發打開新方向。
+摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
+但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
 
-##### **Matina: A Large-Scale 73B Token Persian Text Corpus**
-2502.09188v1 by Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri
+##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
+2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
+
+Purpose: To develop and evaluate a deep learning-based method that allows to
+perform myocardial infarct segmentation in a fully-automated way.
+  Materials and Methods: For this retrospective study, a cascaded framework of
+two and three-dimensional convolutional neural networks (CNNs), specialized on
+identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
+cardiac magnetic resonance (CMR) images, was trained on an in-house training
+dataset consisting of 144 examinations. On a separate test dataset from the
+same institution, including images from 152 examinations obtained between 2021
+and 2023, a quantitative comparison between artificial intelligence (AI)-based
+segmentations and manual segmentations was performed. Further, qualitative
+assessment of segmentation accuracy was evaluated for both human and
+AI-generated contours by two CMR experts in a blinded experiment.
+  Results: Excellent agreement could be found between manually and
+automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
+evaluation showed that compared to human-based measurements, the experts rated
+the AI-based segmentations to better represent the actual extent of infarction
+significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
+the contrary, for segmentation of microvascular obstruction (MVO), manual
+measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
+  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
+size to be calculated in a very short time and without requiring any
+pre-processing of the input images while matching the segmentation quality of
+trained human observers. In a blinded experiment, experts preferred automated
+infarct segmentations more often than manual segmentations, paving the way for
+a potential clinical application.
 
-Text corpora are essential for training models used in tasks like
-summarization, translation, and large language models (LLMs). While various
-efforts have been made to collect monolingual and multilingual datasets in many
-languages, Persian has often been underrepresented due to limited resources for
-data collection and preprocessing. Existing Persian datasets are typically
-small and lack content diversity, consisting mainly of weblogs and news
-articles. This shortage of high-quality, varied data has slowed the development
-of NLP models and open-source LLMs for Persian. Since model performance depends
-heavily on the quality of training data, we address this gap by introducing the
-Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed
-and deduplicated to ensure high data quality. We further assess its
-effectiveness by training and evaluating transformer-based models on key NLP
-tasks. Both the dataset and preprocessing codes are publicly available,
-enabling researchers to build on and improve this resource for future Persian
-NLP advancements.
+摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
+材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
+結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
+結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
 
-摘要：文字語料庫對於訓練用於摘要、翻譯和大型語言模型 (LLM) 等任務的模型至關重要。儘管已做出各種努力來收集許多語言中的單語和多語言資料集，但由於資料收集和預處理資源有限，波斯語常常代表性不足。現有的波斯語資料集通常很小，而且缺乏內容多樣性，主要由網誌和新聞文章組成。這種優質、多樣化資料的短缺減緩了波斯語的 NLP 模型和開源 LLM 的開發。由於模型效能高度依賴訓練資料的品質，我們透過推出 Matina 語料庫來解決這個差距，Matina 語料庫是一個新的波斯語資料集，包含 72.9B 個字元，經過仔細預處理和去重，以確保資料品質。我們進一步透過在關鍵 NLP 任務上訓練和評估基於轉換器的模型來評估其有效性。資料集和預處理程式碼都是公開的，使研究人員能夠建立和改善這個資源，以促進未來的波斯語 NLP 進展。
+##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
+2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
 
-##### **RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**
-2502.09183v1 by Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, Linmei Hu
+Recently computer-aided diagnosis has demonstrated promising performance,
+effectively alleviating the workload of clinicians. However, the inherent
+sample imbalance among different diseases leads algorithms biased to the
+majority categories, leading to poor performance for rare categories. Existing
+works formulated this challenge as a long-tailed problem and attempted to
+tackle it by decoupling the feature representation and classification. Yet, due
+to the imbalanced distribution and limited samples from tail classes, these
+works are prone to biased representation learning and insufficient classifier
+calibration. To tackle these problems, we propose a new Long-tailed Medical
+Diagnosis (LMD) framework for balanced medical image classification on
+long-tailed datasets. In the initial stage, we develop a Relation-aware
+Representation Learning (RRL) scheme to boost the representation ability by
+encouraging the encoder to capture intrinsic semantic features through
+different data augmentations. In the subsequent stage, we propose an Iterative
+Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
+This is achieved by generating a large number of balanced virtual features and
+fine-tuning the encoder using an Expectation-Maximization manner. The proposed
+ICC compensates for minority categories to facilitate unbiased classifier
+optimization while maintaining the diagnostic knowledge in majority classes.
+Comprehensive experiments on three public long-tailed medical datasets
+demonstrate that our LMD framework significantly surpasses state-of-the-art
+approaches. The source code can be accessed at
+https://github.com/peterlipan/LMD.
 
-Code generation has attracted increasing attention with the rise of Large
-Language Models (LLMs). Many studies have developed powerful code LLMs by
-synthesizing code-related instruction data and applying supervised fine-tuning.
-However, these methods are limited by teacher model distillation and ignore the
-potential of iterative refinement by self-generated code. In this paper, we
-propose Adaptive Critique Refinement (ACR), which enables the model to refine
-itself by self-generated code and external critique, rather than directly
-imitating the code responses of the teacher model. Concretely, ACR includes a
-composite scoring system with LLM-as-a-Judge to evaluate the quality of code
-responses and a selective critique strategy with LLM-as-a-Critic to critique
-self-generated low-quality code responses. We develop the RefineCoder series by
-iteratively applying ACR, achieving continuous performance improvement on
-multiple code generation benchmarks. Compared to the baselines of the same
-size, our proposed RefineCoder series can achieve comparable or even superior
-performance using less data.
+摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
 
-摘要：隨著大型語言模型 (LLM) 的興起，程式碼生成備受關注。許多研究透過綜合與程式碼相關的指令資料並應用監督式微調來開發強大的程式碼 LLM。然而，這些方法受到教師模型蒸餾的限制，且忽略了透過自行產生的程式碼進行反覆改進的潛力。在本文中，我們提出適應性批判改進 (ACR)，它使模型能夠透過自行產生的程式碼和外部批判來改進自身，而不是直接模仿教師模型的程式碼回應。具體來說，ACR 包含一個複合評分系統，其中 LLM 作為評審員來評估程式碼回應的品質，以及一個選擇性批判策略，其中 LLM 作為批判者來批判自行產生的低品質程式碼回應。我們透過反覆套用 ACR 來開發 RefineCoder 系列，在多個程式碼生成基準上實現持續的效能改善。與相同規模的基準相比，我們提出的 RefineCoder 系列可以使用較少資料來實現相當甚至更優異的效能。
+##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
+2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
 
-##### **FLAME: Flexible LLM-Assisted Moderation Engine**
-2502.09175v1 by Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets
+This study investigates continual fine-tuning strategies for deep learning in
+online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
+within a causal setting involving a large user group and multiple sessions per
+participant. We are the first to explore such strategies across a large user
+group, as longitudinal adaptation is typically studied in the single-subject
+setting with a single adaptation strategy, which limits the ability to
+generalize findings. First, we examine the impact of different fine-tuning
+approaches on decoder performance and stability. Building on this, we integrate
+online test-time adaptation (OTTA) to adapt the model during deployment,
+complementing the effects of prior fine-tuning. Our findings demonstrate that
+fine-tuning that successively builds on prior subject-specific information
+improves both performance and stability, while OTTA effectively adapts the
+model to evolving data distributions across consecutive sessions, enabling
+calibration-free operation. These results offer valuable insights and
+recommendations for future research in longitudinal online MI decoding and
+highlight the importance of combining domain adaptation strategies for
+improving BCI performance in real-world applications. Clinical Relevance: Our
+investigation enables more stable and efficient long-term motor imagery
+decoding, which is critical for neurorehabilitation and assistive technologies.
 
-The rapid advancement of Large Language Models (LLMs) has introduced
-significant challenges in moderating user-model interactions. While LLMs
-demonstrate remarkable capabilities, they remain vulnerable to adversarial
-attacks, particularly ``jailbreaking'' techniques that bypass content safety
-measures. Current content moderation systems, which primarily rely on input
-prompt filtering, have proven insufficient, with techniques like Best-of-N
-(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
-In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
-new approach that shifts the focus from input filtering to output moderation.
-Unlike traditional circuit-breaking methods that analyze user queries, FLAME
-evaluates model responses, offering several key advantages: (1) computational
-efficiency in both training and inference, (2) enhanced resistance to BoN
-jailbreaking attacks, and (3) flexibility in defining and updating safety
-criteria through customizable topic filtering. Our experiments demonstrate that
-FLAME significantly outperforms current moderation systems. For example, FLAME
-reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
-while maintaining low computational overhead. We provide comprehensive
-evaluation on various LLMs and analyze the engine's efficiency against the
-state-of-the-art jailbreaking. This work contributes to the development of more
-robust and adaptable content moderation systems for LLMs.
+摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
 
-摘要：大型語言模型 (LLM) 的快速進步為調節使用者與模型互動帶來重大挑戰。儘管 LLM 展現出非凡的能力，但它們仍然容易受到對抗性攻擊，特別是繞過內容安全措施的「越獄」技術。目前的內容審核系統主要依賴輸入提示過濾，已被證明不足，例如 Best-of-N (BoN) 越獄對抗熱門 LLM 的成功率達到 80% 以上。在本文中，我們介紹了靈活的 LLM 輔助審核引擎 (FLAME)：一種新的方法，將重點從輸入過濾轉移到輸出審核。與分析使用者查詢的傳統電路中斷方法不同，FLAME 評估模型回應，提供幾個關鍵優勢：(1) 訓練和推理中的計算效率，(2) 增強對 BoN 越獄攻擊的抵抗力，以及 (3) 透過可自訂主題過濾定義和更新安全標準的靈活性。我們的實驗證明，FLAME 明顯優於目前的審核系統。例如，FLAME 將 GPT-4o-mini 和 DeepSeek-v3 的攻擊成功率降低了約 9 倍，同時保持較低的計算負擔。我們對各種 LLM 進行了全面的評估，並分析了引擎對抗最新越獄的效率。這項工作有助於開發更強大且適應性更強的 LLM 內容審核系統。
+##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
+2502.03004v1 by Seonok Kim
 
-##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
-2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
+Large Language Models (LLMs) have demonstrated impressive capabilities across
+natural language processing tasks. However, their application to specialized
+domains such as medicine and biology requires further optimization to ensure
+factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
+domain-adapted biomedical question-answering model designed to enhance both
+short-form and long-form queries. By integrating fine-tuning and
+retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
+domain-specific knowledge, improving reasoning abilities and factual accuracy.
+To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
+datasets, covering structured multiple-choice assessments and complex clinical
+reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
+datasets, while RAG enhances factual consistency. These results highlight the
+potential of domain-optimized LLMs in advancing biomedical research, medical
+education, and clinical decision support.
 
-In remote healthcare monitoring, time series representation learning reveals
-critical patient behavior patterns from high-frequency data. This study
-analyzes home activity data from individuals living with dementia by proposing
-a two-stage, self-supervised learning approach tailored to uncover low-rank
-structures. The first stage converts time-series activities into text sequences
-encoded by a pre-trained language model, providing a rich, high-dimensional
-latent state space using a PageRank-based method. This PageRank vector captures
-latent state transitions, effectively compressing complex behaviour data into a
-succinct form that enhances interpretability. This low-rank representation not
-only enhances model interpretability but also facilitates clustering and
-transition analysis, revealing key behavioral patterns correlated with
-clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
-framework's potential in supporting cognitive status prediction, personalized
-care interventions, and large-scale health monitoring.
+摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
 
-摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
+##### **Contrastive Token-level Explanations for Graph-based Rumour Detection**
+2502.04366v1 by Daniel Wai Kit Chin, Roy Ka-Wei Lee
 
-##### **Musical Heritage Historical Entity Linking**
-2502.09168v1 by Arianna Graciotti, Nicolas Lazzari, Valentina Presutti, Rocco Tripodi
+The widespread use of social media has accelerated the dissemination of
+information, but it has also facilitated the spread of harmful rumours, which
+can disrupt economies, influence political outcomes, and exacerbate public
+health crises, such as the COVID-19 pandemic. While Graph Neural Network
+(GNN)-based approaches have shown significant promise in automated rumour
+detection, they often lack transparency, making their predictions difficult to
+interpret. Existing graph explainability techniques fall short in addressing
+the unique challenges posed by the dependencies among feature dimensions in
+high-dimensional text embeddings used in GNN-based models. In this paper, we
+introduce Contrastive Token Layerwise Relevance Propagation (CT-LRP), a novel
+framework designed to enhance the explainability of GNN-based rumour detection.
+CT-LRP extends current graph explainability methods by providing token-level
+explanations that offer greater granularity and interpretability. We evaluate
+the effectiveness of CT-LRP across multiple GNN models trained on three
+publicly available rumour detection datasets, demonstrating that it
+consistently produces high-fidelity, meaningful explanations, paving the way
+for more robust and trustworthy rumour detection systems.
 
-Linking named entities occurring in text to their corresponding entity in a
-Knowledge Base (KB) is challenging, especially when dealing with historical
-texts. In this work, we introduce Musical Heritage named Entities Recognition,
-Classification and Linking (MHERCL), a novel benchmark consisting of manually
-annotated sentences extrapolated from historical periodicals of the music
-domain. MHERCL contains named entities under-represented or absent in the most
-famous KBs. We experiment with several State-of-the-Art models on the Entity
-Linking (EL) task and show that MHERCL is a challenging dataset for all of
-them. We propose a novel unsupervised EL model and a method to extend
-supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main
-difficulties posed by historical documents. Our experiments reveal that relying
-on unsupervised techniques and improving models with logical constraints based
-on KGs and heuristics to predict NIL entities (entities not represented in the
-KB of reference) results in better EL performance on historical documents.
+摘要：社群媒體的廣泛使用加速了資訊的傳播，但也促进了有害謠言的散播，這可能會擾亂經濟、影響政治結果，並加劇公共衛生危機，例如 COVID-19 大流行。雖然基於圖神經網路 (GNN) 的方法在自動化謠言偵測方面展現了顯著的前景，但它們通常缺乏透明度，這使得它們的預測難以解釋。現有的圖形可解釋性技術無法解決 GNN 模型中使用的維度嵌入式文本之間的依賴性所帶來的獨特挑戰。在本文中，我們介紹了對比標記分層關聯性傳播 (CT-LRP)，這是一個新穎的框架，旨在增強基於 GNN 的謠言偵測的可解釋性。CT-LRP 透過提供標記級別的解釋來擴充當前的圖形可解釋性方法，這些解釋提供了更細緻的粒度和可解釋性。我們在三個公開的謠言偵測資料集上訓練的幾個 GNN 模型中評估了 CT-LRP 的有效性，證明它始終產生高保真、有意義的解釋，為更強健且值得信賴的謠言偵測系統鋪路。
 
-摘要：將文本中出現的名稱實體連結到知識庫 (KB) 中對應的實體具有挑戰性，尤其是在處理歷史文本時。在這項工作中，我們引入了音樂遺產命名實體識別、分類和連結 (MHERCL)，這是一個由從音樂領域的歷史期刊中外推的手動標註句子組成的全新基準。MHERCL 包含在最著名的 KB 中代表性不足或不存在的名稱實體。我們在實體連結 (EL) 任務中對多個最先進的模型進行了實驗，並表明 MHERCL 對所有模型來說都是一個具有挑戰性的資料集。我們提出了一個新的無監督 EL 模型和一個通過使用知識圖 (KG) 來擴充監督式實體連結器的的方法，以解決歷史文件提出的主要難題。我們的實驗表明，依賴無監督技術並使用基於 KG 和啟發法的邏輯約束來改善模型以預測 NIL 實體（未在參考 KB 中表示的實體）會在歷史文件中產生更好的 EL 效能。
+##### **AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**
+2502.04365v1 by Jorge García-Torres, Øyvind Meinich-Bache, Siren Rettedal, Kjersti Engan
 
-##### **Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**
-2502.09156v1 by Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin
+Approximately 10% of newborns need some assistance to start breathing and 5\%
+proper ventilation. It is crucial that interventions are initiated as soon as
+possible after birth. Accurate documentation of Time of Birth (ToB) is thereby
+essential for documenting and improving newborn resuscitation performance.
+However, current clinical practices rely on manual recording of ToB, typically
+with minute precision. In this study, we present an AI-driven, video-based
+system for automated ToB detection using thermal imaging, designed to preserve
+the privacy of healthcare providers and mothers by avoiding the use of
+identifiable visual data. Our approach achieves 91.4% precision and 97.4%
+recall in detecting ToB within thermal video clips during performance
+evaluation. Additionally, our system successfully identifies ToB in 96% of test
+cases with an absolute median deviation of 1 second compared to manual
+annotations. This method offers a reliable solution for improving ToB
+documentation and enhancing newborn resuscitation outcomes.
 
-Objectives: Large language models (LLMs) can harness medical knowledge for
-intelligent question answering (Q&A), promising support for auxiliary diagnosis
-and medical talent cultivation. However, there is a deficiency of highly
-efficient retrieval-augmented generation (RAG) frameworks within the domain of
-Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the
-Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A
-tasks.
-  Materials and Methods: We introduce the novel approach of knowledge
-organization, constructing a tree structure knowledge base with hierarchy. At
-inference time, our self-reflection framework retrieves from this knowledge
-base, integrating information across chapters. Questions from the TCM Medical
-Licensing Examination (MLE) and the college Classics Course Exam (CCE) were
-randomly selected as benchmark datasets.
-  Results: By coupling with GPT-4, the framework can improve the best
-performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and
-improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation,
-the framework improves a total of 18.52 points across dimensions of safety,
-consistency, explainability, compliance, and coherence.
-  Conclusion: The TOSRR framework can effectively improve LLM's capability in
-Q&A tasks of TCM.
+摘要：約 10% 的新生兒需要協助才能開始呼吸，5% 需要適當的通氣。在出生後盡快開始介入至關重要。準確記錄出生時間 (ToB) 對於記錄和改善新生兒復甦表現至關重要。然而，目前的臨床實務依賴於手動記錄 ToB，通常精確到分鐘。在這項研究中，我們提出一個以 AI 為主的、基於影片的系統，用於使用熱影像自動偵測 ToB，旨在透過避免使用可識別的視覺資料來保護醫療保健提供者和母親的隱私。我們的做法在執行評估期間，在熱影像片段中偵測 ToB 時達到了 91.4% 的精確度和 97.4% 的召回率。此外，我們的系統在 96% 的測試案例中成功識別出 ToB，與手動註解相比，絕對中位數偏差為 1 秒。此方法提供了一個可靠的解決方案，用於改善 ToB 記錄和增強新生兒復甦結果。
 
-摘要：目標：大型語言模型（LLM）可以利用醫療知識進行智能問答（Q&A），承諾支持輔助診斷和醫療人才培養。然而，在中醫領域內缺乏高效的檢索增強生成（RAG）框架。我們的目的是觀察樹組織自省檢索（TOSRR）框架對中醫問答任務中 LLM 的影響。
-材料和方法：我們引入了知識組織的新方法，構建了一個具有層次的樹結構知識庫。在推理時間，我們的自省框架從這個知識庫中檢索，整合章節中的信息。中醫醫師資格考試（MLE）和大學經典課程考試（CCE）中的問題被隨機選為基準數據集。
-結果：通過與 GPT-4 結合，該框架可以將中醫 MLE 基準上的最佳性能提高 19.85% 的絕對準確度，並將 CCE 數據集上的召回準確度從 27% 提高到 38%。在手動評估中，該框架在安全性、一致性、可解釋性、合規性和連貫性方面總共提高了 18.52 分。
-結論：TOSRR 框架可以有效提升 LLM 在中醫問答任務中的能力。
+##### **3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**
+2502.02779v1 by Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O'Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian
 
-##### **A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**
-2502.09128v1 by Nasser A Alsadhan
+Head computed tomography (CT) imaging is a widely-used imaging modality with
+multitudes of medical indications, particularly in assessing pathology of the
+brain, skull, and cerebrovascular system. It is commonly the first-line imaging
+in neurologic emergencies given its rapidity of image acquisition, safety,
+cost, and ubiquity. Deep learning models may facilitate detection of a wide
+range of diseases. However, the scarcity of high-quality labels and
+annotations, particularly among less common conditions, significantly hinders
+the development of powerful models. To address this challenge, we introduce
+FM-CT: a Foundation Model for Head CT for generalizable disease detection,
+trained using self-supervised learning. Our approach pre-trains a deep learning
+model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans
+without the need for manual annotations, enabling the model to learn robust,
+generalizable features. To investigate the potential of self-supervised
+learning in head CT, we employed both discrimination with self-distillation and
+masked image modeling, and we construct our model in 3D rather than at the
+slice level (2D) to exploit the structure of head CT scans more comprehensively
+and efficiently. The model's downstream classification performance is evaluated
+using internal and three external datasets, encompassing both in-distribution
+(ID) and out-of-distribution (OOD) data. Our results demonstrate that the
+self-supervised foundation model significantly improves performance on
+downstream diagnostic tasks compared to models trained from scratch and
+previous 3D CT foundation models on scarce annotated datasets. This work
+highlights the effectiveness of self-supervised learning in medical imaging and
+sets a new benchmark for head CT image analysis in 3D, enabling broader use of
+artificial intelligence for head CT-based diagnosis.
 
-Arabic is one of the oldest languages still in use today. As a result,
-several Arabic-speaking regions have developed dialects that are unique to
-them. Dialect and emotion recognition have various uses in Arabic text
-analysis, such as determining an online customer's origin based on their
-comments. Furthermore, intelligent chatbots that are aware of a user's emotions
-can respond appropriately to the user. Current research in emotion detection in
-the Arabic language lacks awareness of how emotions are exhibited in different
-dialects, which motivates the work found in this study. This research addresses
-the problems of dialect and emotion classification in Arabic. Specifically,
-this is achieved by building a novel framework that can identify and predict
-Arabic dialects and emotions from a given text. The framework consists of three
-modules: A text-preprocessing module, a classification module, and a clustering
-module with the novel capability of building new dialect-aware emotion
-lexicons. The proposed framework generated a new emotional lexicon for
-different dialects. It achieved an accuracy of 88.9% in classifying Arabic
-dialects, which outperforms the state-of-the-art results by 6.45 percentage
-points. Furthermore, the framework achieved 89.1-79% accuracy in detecting
-emotions in the Egyptian and Gulf dialects, respectively.
+摘要：頭部電腦斷層掃描（CT）影像是一種廣泛使用的影像模式，具有
+大量的醫療適應症，特別是在評估腦部、頭骨和腦血管系統的病理時。由於其影像擷取速度快、安全性、成本低和普遍性，通常是神經緊急情況下的第一線影像。深度學習模型可以促進對各種疾病的檢測。然而，高品質標籤和註釋的稀缺，特別是在較不常見的疾病中，顯著地阻礙了強大模型的發展。為了應對這一挑戰，我們引入了 FM-CT：一個用於頭部 CT 的基礎模型，用於可概化的疾病檢測，並使用自我監督學習進行訓練。我們的做法在一個包含 361,663 個非對比 3D 頭部 CT 掃描的大型、多樣化的數據集上預訓練一個深度學習模型，而無需手動註釋，使模型能夠學習強健、可概化的特徵。為了探討自我監督學習在頭部 CT 中的潛力，我們同時採用了帶有自我蒸餾的判別和遮罩影像建模，並且我們以 3D 而不是切片層級（2D）構建我們的模型，以更全面、有效地利用頭部 CT 掃描的結構。該模型的下游分類效能使用內部和三個外部數據集進行評估，包括分佈內 (ID) 和分佈外 (OOD) 資料。我們的結果表明，與從頭開始訓練的模型和先前在稀疏註釋數據集上訓練的 3D CT 基礎模型相比，自我監督基礎模型顯著改善了下游診斷任務的效能。這項工作突顯了自我監督學習在醫學影像中的有效性，並為 3D 頭部 CT 影像分析設定了一個新的基準，讓人工智慧能夠更廣泛地用於基於頭部 CT 的診斷。
 
-摘要：阿拉伯語是現今仍在使用中最古老的語言之一。因此，幾個講阿拉伯語的地區發展出獨特的方言。方言和情緒辨識在阿拉伯語文本分析中有多種用途，例如根據在線客戶的評論來確定其來源。此外，知道使用者情緒的智慧聊天機器人可以適當地回應使用者。目前對阿拉伯語情緒偵測的研究缺乏對不同方言如何表現情緒的認識，這激勵了本研究中的工作。本研究探討了阿拉伯語中的方言和情緒分類問題。具體而言，這是通過建立一個新的框架來實現的，該框架可以識別和預測給定文本中的阿拉伯方言和情緒。該框架包含三個模組：文字預處理模組、分類模組和聚類模組，具有建立新的方言感知情緒詞彙表的新功能。所提出的框架為不同的方言生成了新的情緒詞彙表。它在分類阿拉伯方言方面達到了 88.9% 的準確率，比最先進的結果高出 6.45 個百分點。此外，該框架在檢測埃及和海灣方言的情緒方面分別達到了 89.1-79% 的準確率。
+##### **Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**
+2502.02756v1 by Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
 
-##### **Automatic Pruning via Structured Lasso with Class-wise Information**
-2502.09125v1 by Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li
+This study proposes a new loss function for deep neural networks, L1-weighted
+Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of
+voxels based on their classification difficulty, towards automated detection
+and segmentation of metastatic prostate cancer lesions in PET/CT scans. We
+obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with
+biochemical recurrence metastatic prostate cancer. We trained two 3D
+convolutional neural networks, Attention U-Net and SegResNet, and concatenated
+the PET and CT volumes channel-wise as input. The performance of our custom
+loss function was evaluated against the Dice and Dice Focal Loss functions. For
+clinical significance, we considered a detected region of interest (ROI) as a
+true positive if at least the voxel with the maximum standardized uptake value
+falls within the ROI. We assessed the models' performance based on the number
+of lesions in an image, tumour volume, activity, and extent of spread. The
+L1DFL outperformed the comparative loss functions by at least 13% on the test
+set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were
+lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal
+Loss yielded more false positives, whereas the Dice Loss was more sensitive to
+smaller volumes and struggled to segment larger lesions accurately. They also
+exhibited network-specific variations and yielded declines in segmentation
+accuracy with increased tumour spread. Our results demonstrate the potential of
+L1DFL to yield robust segmentation of metastatic prostate cancer lesions in
+PSMA PET/CT images. The results further highlight potential complexities
+arising from the variations in lesion characteristics that may influence
+automated prostate cancer tumour detection and segmentation. The code is
+publicly available at: https://github.com/ObedDzik/pca_segment.git.
 
-Most pruning methods concentrate on unimportant filters of neural networks.
-However, they face the loss of statistical information due to a lack of
-consideration for class-wise data. In this paper, from the perspective of
-leveraging precise class-wise information for model pruning, we utilize
-structured lasso with guidance from Information Bottleneck theory. Our approach
-ensures that statistical information is retained during the pruning process.
-With these techniques, we introduce two innovative adaptive network pruning
-schemes: sparse graph-structured lasso pruning with Information Bottleneck
-(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
-Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
-sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
-multiple state-of-the-art methods, our approaches demonstrate superior
-performance across three datasets and six model architectures in extensive
-experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
-achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
-an accuracy of 94.10% (0.14% higher than the original model); we reduce the
-parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
-ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
-computational resource usage while maintaining accuracy. Our codes are at
-https://anonymous.4open.science/r/IJCAI-8104.
+摘要：<paragraph>本研究針對深度神經網路提出一個新的損失函數，L1 加權 Dice 焦點損失 (L1DFL)，它利用 L1 範數根據體素的分類難度進行自適應加權，用於自動偵測和分割 PET/CT 掃描中轉移性前列腺癌病灶。我們取得 380 個經診斷為生化復發轉移性前列腺癌的患者的 PSMA [18-F] DCFPyL PET/CT 掃描。我們訓練了兩個 3D 捲積神經網路，Attention U-Net 和 SegResNet，並將 PET 和 CT 體積按通道連接作為輸入。我們自訂的損失函數的效能與 Dice 和 Dice 焦點損失函數進行評估。為了臨床意義，我們將一個偵測到的感興趣區域 (ROI) 視為真陽性，如果至少具有最大標準攝取值的體素落在 ROI 內。我們根據影像中的病灶數量、腫瘤體積、活性，以及擴散程度評估模型的效能。L1DFL 在測試組中至少比比較損失函數高出 13%。此外，Dice 損失和 Dice 焦點損失的 F1 分數分別比 L1DFL 低至少 6% 和 34%。Dice 焦點損失產生更多假陽性，而 Dice 損失對較小體積較為敏感，且難以準確分割較大病灶。它們也展現出網路特定的變化，並隨著腫瘤擴散而導致分割準確度下降。我們的結果證明 L1DFL 具有在 PSMA PET/CT 影像中產生轉移性前列腺癌病灶的強健分割的潛力。結果進一步強調由病灶特徵變化所產生的潛在複雜性，這可能會影響自動化前列腺癌腫瘤偵測和分割。程式碼公開於：https://github.com/ObedDzik/pca_segment.git。</paragraph>
 
-摘要：大多數剪枝方法都集中在神經網路中不重要的濾波器上。
-然而，由於缺乏對類別資料的考量，它們面臨統計資訊的遺失。在本文中，我們從利用精確類別資訊進行模型剪枝的角度，利用結構化套索搭配資訊瓶頸理論的指導。我們的做法確保在剪枝過程中保留統計資訊。藉由這些技術，我們引入了兩個創新的自適應網路剪枝方案：帶有資訊瓶頸的稀疏圖形結構套索剪枝（sGLP-IB）和帶有資訊瓶頸的稀疏樹導引套索剪枝（sTLP-IB）。關鍵方面是使用 sGLP-IB 和 sTLP-IB 剪枝模型濾波器，以更好地擷取類別關聯性。與多種最先進的方法相比，我們的做法在廣泛的實驗中展現出跨三個資料集和六個模型架構的卓越效能。例如，在 CIFAR-10 資料集上使用 VGG16 模型，我們達到了 85% 的參數減少、61% 的 FLOP 減少，並維持 94.10% 的準確度（比原始模型高 0.14%）；我們在 ImageNet 上使用 ResNet 架構將參數減少了 55%，準確度為 76.12%（僅下降 0.03%）。總之，我們成功地減少了模型大小和計算資源使用，同時維持準確度。我們的程式碼位於 https://anonymous.4open.science/r/IJCAI-8104。
+##### **Diffusion Instruction Tuning**
+2502.06814v1 by Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
 
-##### **The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**
-2502.09120v1 by Ye-eun Cho, Yunho Maeng
+We introduce Lavender, a simple supervised fine-tuning (SFT) method that
+boosts the performance of advanced vision-language models (VLMs) by leveraging
+state-of-the-art image generation models such as Stable Diffusion.
+Specifically, Lavender aligns the text-vision attention in the VLM transformer
+with the equivalent used by Stable Diffusion during SFT, instead of adapting
+separate encoders. This alignment enriches the model's visual understanding and
+significantly boosts performance across in- and out-of-distribution tasks.
+Lavender requires just 0.13 million training examples, 2.5% of typical
+large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a
+single day. It consistently improves state-of-the-art open-source multimodal
+LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and
+a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently
+transferring the visual expertise of image generators with minimal supervision,
+Lavender offers a scalable solution for more accurate vision-language systems.
+All code, training data, and models will be shared at
+https://astrazeneca.github.io/vlm/.
 
-This study explored how Vision-Language Models (VLMs) process ignorance
-implicatures with visual and linguistic cues. Particularly, we focused on the
-effects of contexts (precise and approximate contexts) and modifier types (bare
-numerals, superlative, and comparative modifiers), which were considered
-pragmatic and semantic factors respectively. Methodologically, we conducted a
-truth-value judgment task in visually grounded settings using GPT-4o and Gemini
-1.5 Pro. The results indicate that while both models exhibited sensitivity to
-linguistic cues (modifier), they failed to process ignorance implicatures with
-visual cues (context) as humans do. Specifically, the influence of context was
-weaker and inconsistent across models, indicating challenges in pragmatic
-reasoning for VLMs. On the other hand, superlative modifiers were more strongly
-associated with ignorance implicatures as compared to comparative modifiers,
-supporting the semantic view. These findings highlight the need for further
-advancements in VLMs to process language-vision information in a
-context-dependent way to achieve human-like pragmatic inference.
+摘要：<paragraph>我們介紹 Lavender，一種簡單的監督微調 (SFT) 方法，它透過利用 Stable Diffusion 等最先進的影像生成模型來提升先進視覺語言模型 (VLM) 的效能。
+具體來說，Lavender 在 SFT 期間將 VLM 轉換器中的文字視覺注意力與 Stable Diffusion 使用的等效注意力對齊，而不是調整單獨的編碼器。此對齊豐富了模型的視覺理解，並顯著提升了分佈內外任務的效能。
+Lavender 只需要 0.13 百萬個訓練範例，相當於典型大型 SFT 資料集的 2.5%，並在標準硬體 (8 個 GPU) 上於一天內進行微調。它持續改善最先進的開放原始碼多模態 LLM（例如 Llama-3.2-11B、MiniCPM-Llama3-v2.5），在具有挑戰性的分佈外醫療 QA 任務中獲得高達 30% 的收益和 68% 的提升。透過有效轉移影像生成器的視覺專業知識，並僅需最少的監督，Lavender 提供了一個可擴充的解決方案，以實現更準確的視覺語言系統。
+所有程式碼、訓練資料和模型將在 https://astrazeneca.github.io/vlm/ 分享。</paragraph>
 
-摘要：本研究探討了視覺語言模型 (VLM) 如何處理視覺和語言線索中的無知含義。特別是，我們專注於語境（精確和近似語境）和修飾語類型（裸數字、最高級和比較級修飾語）的影響，這些分別被視為語用和語義因素。在方法論上，我們使用 GPT-4o 和 Gemini 1.5 Pro 在視覺基礎設置中進行了真值判斷任務。結果表明，儘管這兩個模型都對語言線索（修飾語）表現出敏感性，但它們未能像人類那樣處理帶有視覺線索（語境）的無知含義。具體來說，語境的影響在各個模型中較弱且不一致，表明 VLM 在語用推理方面存在挑戰。另一方面，與比較級修飾語相比，最高級修飾語與無知含義的關聯性更強，這支持了語義觀點。這些發現強調了 VLM 進一步發展的必要性，以以語境依賴的方式處理語言視覺信息，以實現類人語用推理。
+##### **MedRAX: Medical Reasoning Agent for Chest X-ray**
+2502.02673v1 by Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
 
-##### **One-shot Federated Learning Methods: A Practical Guide**
-2502.09104v1 by Xiang Liu, Zhenheng Tang, Xia Li, Yijun Song, Sijie Ji, Zemin Liu, Bo Han, Linshan Jiang, Jialin Li
+Chest X-rays (CXRs) play an integral role in driving critical decisions in
+disease management and patient care. While recent innovations have led to
+specialized models for various CXR interpretation tasks, these solutions often
+operate in isolation, limiting their practical utility in clinical practice. We
+present MedRAX, the first versatile AI agent that seamlessly integrates
+state-of-the-art CXR analysis tools and multimodal large language models into a
+unified framework. MedRAX dynamically leverages these models to address complex
+medical queries without requiring additional training. To rigorously evaluate
+its capabilities, we introduce ChestAgentBench, a comprehensive benchmark
+containing 2,500 complex medical queries across 7 diverse categories. Our
+experiments demonstrate that MedRAX achieves state-of-the-art performance
+compared to both open-source and proprietary models, representing a significant
+step toward the practical deployment of automated CXR interpretation systems.
+Data and code have been publicly available at
+https://github.com/bowang-lab/MedRAX
 
-One-shot Federated Learning (OFL) is a distributed machine learning paradigm
-that constrains client-server communication to a single round, addressing
-privacy and communication overhead issues associated with multiple rounds of
-data exchange in traditional Federated Learning (FL). OFL demonstrates the
-practical potential for integration with future approaches that require
-collaborative training models, such as large language models (LLMs). However,
-current OFL methods face two major challenges: data heterogeneity and model
-heterogeneity, which result in subpar performance compared to conventional FL
-methods. Worse still, despite numerous studies addressing these limitations, a
-comprehensive summary is still lacking. To address these gaps, this paper
-presents a systematic analysis of the challenges faced by OFL and thoroughly
-reviews the current methods. We also offer an innovative categorization method
-and analyze the trade-offs of various techniques. Additionally, we discuss the
-most promising future directions and the technologies that should be integrated
-into the OFL field. This work aims to provide guidance and insights for future
-research.
+摘要：胸部 X 光片 (CXR) 在疾病管理和患者照護中扮演著不可或缺的角色，推動著關鍵決策的制定。儘管近期的創新已針對各種 CXR 解讀任務開發出專門的模型，但這些解決方案通常獨立運作，限制了它們在臨床實務中的實際效用。我們提出 MedRAX，這是一款首創的多功能 AI 代理，它將最先進的 CXR 分析工具和多模態大型語言模型無縫整合到一個統一的架構中。MedRAX 動態運用這些模型來解決複雜的醫療查詢，而無需額外的訓練。為了嚴格評估其功能，我們引入了 ChestAgentBench，這是一個全面的基準，包含 7 個不同類別的 2,500 個複雜醫療查詢。我們的實驗證明，與開源和專有模型相比，MedRAX 達到了最先進的效能，這代表了自動化 CXR 解讀系統實際部署的重要一步。資料和程式碼已公開於 https://github.com/bowang-lab/MedRAX
 
-摘要：單次聯邦學習 (OFL) 是一種分散式機器學習範例，將客戶端與伺服器通訊限制在單一輪次中，解決傳統聯邦學習 (FL) 中多輪次資料交換相關的隱私和通訊負擔問題。OFL 展示了與需要協作訓練模型的未來方法整合的實際潛力，例如大型語言模型 (LLM)。然而，目前的 OFL 方法面臨兩大挑戰：資料異質性和模型異質性，這導致與傳統 FL 方法相比，效能較差。更糟的是，儘管有許多研究探討這些限制，但仍缺乏全面的摘要。為了解決這些差距，本文對 OFL 面臨的挑戰進行系統分析，並徹底檢視目前的方法。我們還提供創新的分類方法，並分析各種技術的權衡取捨。此外，我們討論最有希望的未來方向，以及應整合到 OFL 領域的技術。這項工作旨在為未來的研究提供指導和見解。
+##### **Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**
+2502.04356v1 by Mahdi Alkaeed, Sofiat Abioye, Adnan Qayyum, Yosra Magdi Mekki, Ilhem Berrou, Mohamad Abdallah, Ala Al-Fuqaha, Muhammad Bilal, Junaid Qadir
 
-##### **Logical Reasoning in Large Language Models: A Survey**
-2502.09100v1 by Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang
+In response to the success of proprietary Large Language Models (LLMs) such
+as OpenAI's GPT-4, there is a growing interest in developing open,
+non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in
+academic, scientific, and non-commercial applications. Despite their inability
+to match the refined functionalities of their proprietary counterparts, open
+models hold immense potential to revolutionize healthcare applications. In this
+paper, we examine the prospects of open-source LLMs and AIFMs for developing
+healthcare applications and make two key contributions. Firstly, we present a
+comprehensive survey of the current state-of-the-art open-source healthcare
+LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their
+utility across various healthcare tasks. Secondly, to evaluate the
+general-purpose applications of open LLMs in healthcare, we present a case
+study on personalized prescriptions. This task is particularly significant due
+to its critical role in delivering tailored, patient-specific medications that
+can greatly improve treatment outcomes. In addition, we compare the performance
+of open-source models with proprietary models in settings with and without
+Retrieval-Augmented Generation (RAG). Our findings suggest that, although less
+refined, open LLMs can achieve performance comparable to proprietary models
+when paired with grounding techniques such as RAG. Furthermore, to highlight
+the clinical significance of LLMs-empowered personalized prescriptions, we
+perform subjective assessment through an expert clinician. We also elaborate on
+ethical considerations and potential risks associated with the misuse of
+powerful LLMs and AIFMs, highlighting the need for a cautious and responsible
+implementation in healthcare.
 
-With the emergence of advanced reasoning models like OpenAI o3 and
-DeepSeek-R1, large language models (LLMs) have demonstrated remarkable
-reasoning capabilities. However, their ability to perform rigorous logical
-reasoning remains an open question. This survey synthesizes recent advancements
-in logical reasoning within LLMs, a critical area of AI research. It outlines
-the scope of logical reasoning in LLMs, its theoretical foundations, and the
-benchmarks used to evaluate reasoning proficiency. We analyze existing
-capabilities across different reasoning paradigms - deductive, inductive,
-abductive, and analogical - and assess strategies to enhance reasoning
-performance, including data-centric tuning, reinforcement learning, decoding
-strategies, and neuro-symbolic approaches. The review concludes with future
-directions, emphasizing the need for further exploration to strengthen logical
-reasoning in AI systems.
+摘要：<paragraph>為了回應 OpenAI 的 GPT-4 等專有大型語言模型 (LLM) 的成功，開發開放、非專有的 LLM 和人工智慧基礎模型 (AIFM) 以透明地用於學術、科學和非商業應用中，引起了越來越大的興趣。儘管無法與其專有對應產品的精緻功能相匹配，但開放模型在革新醫療保健應用方面具有巨大的潛力。在本文中，我們探討了開放原始碼 LLM 和 AIFM 在開發醫療保健應用方面的前景，並提出了兩項關鍵貢獻。首先，我們對當前最先進的開放原始碼醫療保健 LLM 和 AIFM 進行了全面的調查，並介紹了這些開放 AIFM 的分類法，對它們在各種醫療保健任務中的效用進行了分類。其次，為了評估開放 LLM 在醫療保健中的通用應用，我們對個人化處方進行了案例研究。這項任務特別重要，因為它在提供量身定制的患者特定藥物方面發揮著關鍵作用，可以大大改善治療效果。此外，我們比較了開放原始碼模型與專有模型在有和沒有檢索增強生成 (RAG) 的設置中的性能。我們的研究結果表明，儘管不太精緻，但開放 LLM 在與 RAG 等基礎技術配對時，可以實現與專有模型相當的性能。此外，為了強調 LLM 賦能的個性化處方的臨床意義，我們通過專家臨床醫生進行了主觀評估。我們還詳細說明了與濫用強大的 LLM 和 AIFM 相關的倫理考量和潛在風險，強調了在醫療保健中謹慎和負責任地實施的必要性。</paragraph>
 
-摘要：隨著 OpenAI o3 和 DeepSeek-R1 等先進推理模型的出現，大型語言模型 (LLM) 已展現出非凡的推理能力。然而，它們執行嚴謹邏輯推理的能力仍是一個開放性的問題。此調查綜合了 LLM 中邏輯推理的最新進展，這是 AI 研究的一個關鍵領域。它概述了 LLM 中邏輯推理的範圍、其理論基礎，以及用於評估推理能力的基準。我們分析了不同推理範例（演繹、歸納、外推和類比）中的現有能力，並評估增強推理效能的策略，包括以數據為中心的調整、強化學習、解碼策略和神經符號方法。此評論以未來的方向作為結論，強調需要進一步探索以強化 AI 系統中的邏輯推理。
+##### **Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**
+2502.02561v1 by Shayan Kiyani, George Pappas, Aaron Roth, Hamed Hassani
 
-##### **A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**
-2502.09097v1 by Tianyi Huang, Zeqiu Xu, Peiyang Yu, Jingyuan Yi, Xiaochuan Xu
+A fundamental question in data-driven decision making is how to quantify the
+uncertainty of predictions in ways that can usefully inform downstream action.
+This interface between prediction uncertainty and decision-making is especially
+important in risk-sensitive domains, such as medicine. In this paper, we
+develop decision-theoretic foundations that connect uncertainty quantification
+using prediction sets with risk-averse decision-making. Specifically, we answer
+three fundamental questions: (1) What is the correct notion of uncertainty
+quantification for risk-averse decision makers? We prove that prediction sets
+are optimal for decision makers who wish to optimize their value at risk. (2)
+What is the optimal policy that a risk averse decision maker should use to map
+prediction sets to actions? We show that a simple max-min decision policy is
+optimal for risk-averse decision makers. Finally, (3) How can we derive
+prediction sets that are optimal for such decision makers? We provide an exact
+characterization in the population regime and a distribution free finite-sample
+construction. Answering these questions naturally leads to an algorithm,
+Risk-Averse Calibration (RAC), which follows a provably optimal design for
+deriving action policies from predictions. RAC is designed to be both
+practical-capable of leveraging the quality of predictions in a black-box
+manner to enhance downstream utility-and safe-adhering to a user-defined risk
+threshold and optimizing the corresponding risk quantile of the user's
+downstream utility. Finally, we experimentally demonstrate the significant
+advantages of RAC in applications such as medical diagnosis and recommendation
+systems. Specifically, we show that RAC achieves a substantially improved
+trade-off between safety and utility, offering higher utility compared to
+existing methods while maintaining the safety guarantee.
 
-In this paper, we propose an optimized Transformer model that integrates
-Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and
-apply it to fake news classification for the first time. First, we employ the
-TF-IDF method to extract features from news texts and transform them into
-numeric representations to facilitate subsequent machine learning tasks. Two
-sets of experiments are then conducted for fake news detection and
-classification: one using a Transformer model optimized only with BiGRU, and
-the other incorporating Bayesian algorithms into the BiGRU-based Transformer.
-Experimental results show that the BiGRU-optimized Transformer achieves 100%
-accuracy on the training set and 99.67% on the test set, while the addition of
-the Bayesian algorithm maintains 100% accuracy on the training set and slightly
-improves test-set accuracy to 99.73%. This indicates that the Bayesian
-algorithm boosts model accuracy by 0.06%, further enhancing the detection
-capability for fake news. Moreover, the proposed algorithm converges rapidly at
-around the 10th training epoch with accuracy nearing 100%, demonstrating both
-its effectiveness and its fast classification ability. Overall, the optimized
-Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits
-excellent continuous learning and detection performance, offering a robust
-technical means to combat the spread of fake news in the current era of
-information overload.
+摘要：<paragraph>在資料驅動決策中，一個基本問題是，如何量化預測的不確定性，以能有用地告知下游行動。
+預測不確定性和決策制定之間的這種介面，在風險敏感領域中特別重要，例如醫學。在本文中，我們
+發展了決策理論基礎，它利用預測集合將不確定性量化與風險規避決策制定聯繫起來。具體來說，我們回答
+了三個基本問題：(1) 對於風險規避決策者來說，不確定性量化的正確概念是什麼？我們證明，對於希望最佳化其風險價值的決策者來說，預測集合是最佳的。(2)
+風險規避決策者應使用什麼最佳政策，將預測集合映射到行動？我們表明，對於風險規避決策者來說，一個簡單的最大最小決策政策是最佳的。最後，(3) 我們如何推導出對此類決策者來說最佳的預測集合？我們在總體範圍內提供了一個確切的表徵，並提供了一個不依賴分佈的有限樣本建構。回答這些問題自然會導致一個演算法，風險規避校準 (RAC)，它遵循一個可證明最佳的設計，從預測中推導出行動政策。RAC 被設計為既實用——能夠以黑盒方式利用預測的品質來增強下游效用——又安全——遵守使用者定義的風險閾值，並最佳化使用者的下游效用的對應風險分位數。最後，我們在醫學診斷和推薦系統等應用中，以實驗方式證明了 RAC 的顯著優點。具體來說，我們表明，與現有方法相比，RAC 在安全性和效用之間實現了顯著改善的折衷，在維持安全保證的同時，提供了更高的效用。</paragraph>
 
-摘要：<paragraph>在本文中，我們提出了一個最佳化的 Transformer 模型，它將貝氏演算法與雙向門控遞迴單元 (BiGRU) 整合在一起，並首次將其應用於假新聞分類。首先，我們採用 TF-IDF 方法從新聞文本中提取特徵，並將它們轉換為數值表示，以利於後續的機器學習任務。接著進行兩組實驗，分別針對假新聞偵測和分類：一組使用僅使用 BiGRU 最佳化的 Transformer 模型，另一組將貝氏演算法納入基於 BiGRU 的 Transformer 中。實驗結果顯示，BiGRU 最佳化的 Transformer 在訓練組上達到 100% 的準確度，在測試組上達到 99.67%，而加入貝氏演算法後，在訓練組上維持 100% 的準確度，並將測試組的準確度略微提升至 99.73%。這表示貝氏演算法將模型準確度提升了 0.06%，進一步增強了對假新聞的偵測能力。此外，所提出的演算法在約第 10 個訓練週期時快速收斂，準確度接近 100%，證明了它的有效性和快速的分類能力。總的來說，由貝氏演算法和 BiGRU 增強的最佳化 Transformer 模型展現出絕佳的持續學習和偵測效能，提供了一個強健的技術手段來對抗在當前資訊過載時代中假新聞的散布。</paragraph>
+##### **CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**
+2502.05214v1 by Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
 
-##### **A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**
-2502.09086v1 by Jia Gao, Shuangquan Lyu, Guiran Liu, Binrong Zhu, Hongye Zheng, Xiaoxuan Liao
+Deep learning models for medical image classification tasks are becoming
+widely implemented in AI-assisted diagnostic tools, aiming to enhance
+diagnostic accuracy, reduce clinician workloads, and improve patient outcomes.
+However, their vulnerability to adversarial attacks poses significant risks to
+patient safety. Current attack methodologies use general techniques such as
+model querying or pixel value perturbations to generate adversarial examples
+designed to fool a model. These approaches may not adequately address the
+unique characteristics of clinical errors stemming from missed or incorrectly
+identified clinical features. We propose the Concept-based Report Perturbation
+Attack (CoRPA), a clinically-focused black-box adversarial attack framework
+tailored to the medical imaging domain. CoRPA leverages clinical concepts to
+generate adversarial radiological reports and images that closely mirror
+realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA
+using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our
+evaluation reveals that deep learning models exhibiting strong resilience to
+conventional adversarial attacks are significantly less robust when subjected
+to CoRPA's clinically-focused perturbations. This underscores the importance of
+addressing domain-specific vulnerabilities in medical AI systems. By
+introducing a specialized adversarial attack framework, this study provides a
+foundation for developing robust, real-world-ready AI models in healthcare,
+ensuring their safe and reliable deployment in high-stakes clinical
+environments.
 
-With the continuous development of natural language processing (NLP)
-technology, text classification tasks have been widely used in multiple
-application fields. However, obtaining labeled data is often expensive and
-difficult, especially in few-shot learning scenarios. To solve this problem,
-this paper proposes a few-shot text classification model based on transfer
-learning and meta-learning. The model uses the knowledge of the pre-trained
-model for transfer and optimizes the model's rapid adaptability in few-sample
-tasks through a meta-learning mechanism. Through a series of comparative
-experiments and ablation experiments, we verified the effectiveness of the
-proposed method. The experimental results show that under the conditions of few
-samples and medium samples, the model based on transfer learning and
-meta-learning significantly outperforms traditional machine learning and deep
-learning methods. In addition, ablation experiments further analyzed the
-contribution of each component to the model performance and confirmed the key
-role of transfer learning and meta-learning in improving model accuracy.
-Finally, this paper discusses future research directions and looks forward to
-the potential of this method in practical applications.
+摘要：深度学习模型用于医学影像分类任务，在人工智能辅助诊断工具中得到广泛应用，旨在提高诊断准确性、减少临床医生的工作量并改善患者的治疗效果。然而，它们对对抗性攻击的脆弱性给患者安全带来了重大风险。目前的攻击方法使用通用技术，例如模型查询或像素值扰动来生成对抗性示例，旨在欺骗模型。这些方法可能无法充分解决源自遗漏或错误识别的临床特征的临床错误的独特特征。我们提出了基于概念的报告扰动攻击 (CoRPA)，这是一种以临床为中心的、针对医学成像领域的、黑盒对抗性攻击框架。CoRPA 利用临床概念来生成对抗性放射学报告和图像，这些报告和图像与现实的临床误诊场景非常相似。我们使用胸部 X 射线和放射学报告的 MIMIC-CXR-JPG 数据集演示了 CoRPA 的效用。我们的评估表明，对传统对抗性攻击表现出强大弹性的深度学习模型在受到 CoRPA 以临床为中心的扰动时，其鲁棒性明显降低。这强调了在医疗人工智能系统中解决特定领域漏洞的重要性。通过引入专门的对抗性攻击框架，本研究为在医疗保健领域开发健壮、面向现实世界的 AI 模型奠定了基础，确保它们在高风险临床环境中安全可靠地部署。
 
-摘要：隨著自然語言處理 (NLP) 技術的持續發展，文本分類任務已廣泛應用於多個應用領域。然而，獲取標記資料通常既昂貴又困難，特別是在小樣本學習場景中。為了解決這個問題，本文提出了一個基於遷移學習和元學習的少樣本文本分類模型。該模型利用預訓練模型的知識進行遷移，並透過元學習機制最佳化模型在少樣本任務中的快速適應性。透過一系列的比較實驗和消融實驗，我們驗證了所提出方法的有效性。實驗結果表明，在少樣本和中等樣本的條件下，基於遷移學習和元學習的模型明顯優於傳統機器學習和深度學習方法。此外，消融實驗進一步分析了各個組成部分對模型效能的貢獻，並確認了遷移學習和元學習在提升模型準確度中的關鍵作用。最後，本文探討了未來的研究方向，並期待此方法在實際應用中的潛力。
+##### **A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**
+2502.02489v1 by Edward Ellis, Andrew Bulpitt, Nasim Parsa, Michael F Byrne, Sharib Ali
 
-##### **Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**
-2502.09083v1 by Greta Warren, Irina Shklovski, Isabelle Augenstein
+Ultrasound (US) imaging is clinically invaluable due to its noninvasive and
+safe nature. However, interpreting US images is challenging, requires
+significant expertise, and time, and is often prone to errors. Deep learning
+offers assistive solutions such as segmentation. Supervised methods rely on
+large, high-quality, and consistently labeled datasets, which are challenging
+to curate. Moreover, these methods tend to underperform on out-of-distribution
+data, limiting their clinical utility. Self-supervised learning (SSL) has
+emerged as a promising alternative, leveraging unlabeled data to enhance model
+performance and generalisability. We introduce a contrastive SSL approach
+tailored for B-mode US images, incorporating a novel Relation Contrastive Loss
+(RCL). RCL encourages learning of distinct features by differentiating positive
+and negative sample pairs through a learnable metric. Additionally, we propose
+spatial and frequency-based augmentation strategies for the representation
+learning on US images. Our approach significantly outperforms traditional
+supervised segmentation methods across three public breast US datasets,
+particularly in data-limited scenarios. Notable improvements on the Dice
+similarity metric include a 4% increase on 20% and 50% of the BUSI dataset,
+nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4%
+and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively.
+Furthermore, we demonstrate superior generalisability on the
+out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6%
+compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST
+training data, respectively. Our research highlights that domain-inspired SSL
+can improve US segmentation, especially under data-limited conditions.
 
-The pervasiveness of large language models and generative AI in online media
-has amplified the need for effective automated fact-checking to assist
-fact-checkers in tackling the increasing volume and sophistication of
-misinformation. The complex nature of fact-checking demands that automated
-fact-checking systems provide explanations that enable fact-checkers to
-scrutinise their outputs. However, it is unclear how these explanations should
-align with the decision-making and reasoning processes of fact-checkers to be
-effectively integrated into their workflows. Through semi-structured interviews
-with fact-checking professionals, we bridge this gap by: (i) providing an
-account of how fact-checkers assess evidence, make decisions, and explain their
-processes; (ii) examining how fact-checkers use automated tools in practice;
-and (iii) identifying fact-checker explanation requirements for automated
-fact-checking tools. The findings show unmet explanation needs and identify
-important criteria for replicable fact-checking explanations that trace the
-model's reasoning path, reference specific evidence, and highlight uncertainty
-and information gaps.
+摘要：超音波 (US) 影像由於其非侵入性且安全的特性，在臨床上極具價值。然而，解讀超音波影像具有挑戰性，需要大量的專業知識和時間，而且經常容易出錯。深度學習提供了輔助解決方案，例如分割。監督式方法依賴於大量、高品質且標籤一致的資料集，而這在策劃上具有挑戰性。此外，這些方法在分佈外資料上的表現往往不佳，這限制了它們的臨床效用。自監督學習 (SSL) 已成為一種有前途的替代方案，它利用未標籤資料來增強模型效能和泛化能力。我們提出了一種對比式 SSL 方法，專門針對 B 模式超音波影像，並納入了新穎的關係對比損失 (RCL)。RCL 透過一個可學習的指標區分正負樣本對，來鼓勵學習不同的特徵。此外，我們提出了用於超音波影像上表徵學習的空間和頻率增強策略。我們的做法在三個公開的乳房超音波資料集上顯著優於傳統的監督式分割方法，特別是在資料有限的情況下。在 Dice 相似性指標上的顯著改進包括在 BUSI 資料集的 20% 和 50% 上增加了 4%，在 BrEaST 資料集的 20% 和 50% 上增加了近 6% 和 9%，以及在 UDIAT 資料集的 20% 和 50% 上分別增加了 6.4% 和 3.7%。此外，我們在分佈外的 UDIAT 資料集上展示了卓越的泛化能力，與使用 BUSI 和 BrEaST 訓練資料的 20% 和 50% 的監督式基準相比，效能分別提升了 20.6% 和 13.6%。我們的研究強調，領域啟發的 SSL 可以改善超音波分割，特別是在資料有限的條件下。
 
-摘要：大型語言模型和生成式 AI 在線上媒體的普及
-放大了對有效自動查核事實的需求，以協助查核員應對日益增加的錯誤資訊量和複雜性。查核事實的複雜性質要求自動查核事實系統提供說明，讓查核員能夠仔細審查他們的輸出。然而，目前尚不清楚這些說明應如何與查核員的決策制定和推理過程保持一致，才能有效整合到他們的流程中。透過與查核事實專業人士進行半結構式訪談，我們透過以下方式彌補這個差距：(i) 提供查核員如何評估證據、做出決策和解釋其流程的說明；(ii) 檢視查核員如何實際使用自動化工具；以及 (iii) 找出查核員對自動查核事實工具的說明需求。研究結果顯示未滿足的說明需求，並找出可複製查核事實說明的重要準則，這些準則追蹤模型的推理路徑、參考具體證據，並強調不確定性和資訊差距。
+##### **Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**
+2502.02438v1 by Yaling Shen, Zhixiong Zhuang, Kun Yuan, Maria-Irina Nicolae, Nassir Navab, Nicolas Padoy, Mario Fritz
 
-##### **CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**
-2502.09082v1 by Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
+Medical multimodal large language models (MLLMs) are becoming an instrumental
+part of healthcare systems, assisting medical personnel with decision making
+and results analysis. Models for radiology report generation are able to
+interpret medical imagery, thus reducing the workload of radiologists. As
+medical data is scarce and protected by privacy regulations, medical MLLMs
+represent valuable intellectual property. However, these assets are potentially
+vulnerable to model stealing, where attackers aim to replicate their
+functionality via black-box access. So far, model stealing for the medical
+domain has focused on classification; however, existing attacks are not
+effective against MLLMs. In this paper, we introduce Adversarial Domain
+Alignment (ADA-STEAL), the first stealing attack against medical MLLMs.
+ADA-STEAL relies on natural images, which are public and widely available, as
+opposed to their medical counterparts. We show that data augmentation with
+adversarial noise is sufficient to overcome the data distribution gap between
+natural images and the domain-specific distribution of the victim MLLM.
+Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that
+Adversarial Domain Alignment enables attackers to steal the medical MLLM
+without any access to medical data.
 
-Role-playing language agents (RPLAs) have emerged as promising applications
-of large language models (LLMs). However, simulating established characters
-presents a challenging task for RPLAs, due to the lack of authentic character
-datasets and nuanced evaluation methods using such data. In this paper, we
-present CoSER, a collection of a high-quality dataset, open models, and an
-evaluation protocol towards effective RPLAs of established characters. The
-CoSER dataset covers 17,966 characters from 771 renowned books. It provides
-authentic dialogues with real-world intricacies, as well as diverse data types
-such as conversation setups, character experiences and internal thoughts.
-Drawing from acting methodology, we introduce given-circumstance acting for
-training and evaluating role-playing LLMs, where LLMs sequentially portray
-multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
-CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
-Extensive experiments demonstrate the value of the CoSER dataset for RPLA
-training, evaluation and retrieval. Moreover, CoSER 70B exhibits
-state-of-the-art performance surpassing or matching GPT-4o on our evaluation
-and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
-the InCharacter and LifeChoice benchmarks respectively.
+摘要：醫療多模態大型語言模型 (MLLM) 正在成為醫療保健系統中不可或缺的一部分，協助醫療人員進行決策和結果分析。放射報告生成的模型能夠解釋醫學影像，從而減輕放射科醫師的工作負擔。由於醫療資料稀少且受隱私法規保護，醫療 MLLM 代表了有價值的智慧財產。然而，這些資產潛在地容易受到模型竊取的攻擊，攻擊者旨在透過黑盒存取來複製其功能。到目前為止，針對醫療領域的模型竊取一直專注於分類；然而，現有的攻擊對 MLLM 沒有效。在本文中，我們介紹了對抗域對齊 (ADA-STEAL)，這是針對醫療 MLLM 的第一個竊取攻擊。與醫療對應物相反，ADA-STEAL 依賴於公開且廣泛可用的自然影像。我們表明，對抗雜訊的資料擴充足以克服自然影像與受害者 MLLM 的特定領域分佈之間的資料分佈差距。在 IU X-RAY 和 MIMIC-CXR 放射學資料集上進行的實驗表明，對抗域對齊使攻擊者能夠在不存取任何醫療資料的情況下竊取醫療 MLLM。
 
-摘要：角色扮演語言代理（RPLA）已成為大型語言模型（LLM）的有前途的應用。然而，由於缺乏真實角色資料集和使用此類資料的細緻評估方法，模擬既有角色對 RPLA 來說是一項具有挑戰性的任務。在本文中，我們提出了 CoSER，這是一個高品質資料集、開放模型和評估協議的集合，用於有效地扮演既有角色的 RPLA。CoSER 資料集涵蓋了來自 771 本著名書籍的 17,966 個角色。它提供了具有真實世界複雜性的真實對話，以及對話設定、角色體驗和內心想法等多種資料類型。借鑑表演方法，我們引入了既定情境表演，用於訓練和評估角色扮演 LLM，其中 LLM 在書籍場景中依次扮演多個角色。使用我們的資料集，我們開發了 CoSER 8B 和 CoSER 70B，即建立在 LLaMA-3.1 模型上的先進開放角色扮演 LLM。大量的實驗證明了 CoSER 資料集對於 RPLA 訓練、評估和檢索的價值。此外，CoSER 70B 在我們的評估和三個現有基準上展現了超越或匹配 GPT-4o 的最先進效能，即分別在 InCharacter 和 LifeChoice 基準上達到了 75.80% 和 93.47% 的準確率。
+##### **Test Time Training for 4D Medical Image Interpolation**
+2502.02341v1 by Qikang Zhang, Yingjie Lei, Zihao Zheng, Ziyang Chen, Zhonghao Xie
 
-##### **Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**
-2502.09073v1 by Xuzhao Geng, Haozhao Wang, Jun Wang, Wei Liu, Ruixuan Li
+4D medical image interpolation is essential for improving temporal resolution
+and diagnostic precision in clinical applications. Previous works ignore the
+problem of distribution shifts, resulting in poor generalization under
+different distribution. A natural solution would be to adapt the model to a new
+test distribution, but this cannot be done if the test input comes without a
+ground truth label. In this paper, we propose a novel test time training
+framework which uses self-supervision to adapt the model to a new distribution
+without requiring any labels. Indeed, before performing frame interpolation on
+each test video, the model is trained on the same instance using a
+self-supervised task, such as rotation prediction or image reconstruction. We
+conduct experiments on two publicly available 4D medical image interpolation
+datasets, Cardiac and 4D-Lung. The experimental results show that the proposed
+method achieves significant performance across various evaluation metrics on
+both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on
+Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image
+interpolation but also provides a template for domain adaptation in other
+fields such as image segmentation and image registration.
 
-Retrieval-augmented generation (RAG) is a key technique for leveraging
-external knowledge and reducing hallucinations in large language models (LLMs).
-However, RAG still struggles to fully prevent hallucinated responses. To
-address this, it is essential to identify samples prone to hallucination or
-guide LLMs toward correct responses, which experts then annotate to develop
-high-quality datasets for refining LLMs. However, the growing scarcity of such
-datasets makes their creation challenging. This paper proposes using the vast
-amount of conversations from widespread LLM usage to build these datasets,
-training LLMs to avoid hallucination-prone questions while accurately
-responding to manageable ones. Given the impracticality of expert-annotating
-all conversation records, the paper introduces AL4RAG, which uses active
-learning to select the most suitable conversation samples for annotation,
-optimizing performance within an annotation budget. Additionally, recognizing
-that traditional active learning methods are not fully compatible with RAG due
-to unsuitable distance metrics, we develop a novel sample distance measurement
-for RAG active learning. Extensive experiments show that our method
-consistently outperforms baselines across multiple metrics.
+摘要：4D 醫學影像插值對於提升時間解析度及臨床應用中的診斷精準度至關重要。過往的研究忽略了分佈轉移問題，導致在不同分佈下泛化能力不佳。一個自然的解決方案是將模型適應到新的測試分佈，但如果測試輸入沒有真實標籤，就無法做到這一點。在本文中，我們提出了一個新的測試時間訓練架構，它使用自我監督來適應模型到一個新的分佈，而不需要任何標籤。事實上，在對每個測試影片執行幀插值之前，使用自我監督任務（例如旋轉預測或影像重建）在同一個實例上訓練模型。我們在兩個公開的 4D 醫學影像插值資料集（Cardiac 和 4D-Lung）上進行實驗。實驗結果表明，所提出的方法在兩個資料集上的各種評估指標中都取得了顯著的效能。它達到了更高的峰值信噪比值，在 Cardiac 上為 33.73dB，在 4D-Lung 上為 34.02dB。我們的技術不僅推動了 4D 醫學影像插值，還為其他領域（例如影像分割和影像配準）中的領域適應提供了一個範本。
 
-摘要：檢索增強生成 (RAG) 是一種關鍵技術，用於利用外部知識並減少大型語言模型 (LLM) 中的幻覺。然而，RAG 仍難以完全防止幻覺反應。為了解決這個問題，必須找出容易產生幻覺的範例，或引導 LLM 朝向正確的反應，然後由專家註解以開發用於精煉 LLM 的高品質資料集。然而，此類資料集日益稀少，使得其建立極具挑戰性。本文提出使用來自廣泛 LLM 使用的大量對話來建立這些資料集，訓練 LLM 以避免容易產生幻覺的問題，同時準確回應可管理的問題。鑑於由專家為所有對話記錄加上註解並不切實際，本文引入了 AL4RAG，它使用主動學習來選擇最適合註解的對話範例，在註解預算內最佳化效能。此外，認識到傳統主動學習方法由於不適當的距離度量而無法與 RAG 完全相容，我們為 RAG 主動學習開發了一種新穎的範例距離度量。廣泛的實驗表明，我們的模型在多種度量標準上始終優於基準。
+##### **Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**
+2502.02249v1 by Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
 
-##### **An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**
-2502.09056v1 by Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai
+Large language models (LLMs) have shown impressive capabilities in natural
+language processing tasks, including dialogue generation. This research aims to
+conduct a novel comparative analysis of two prominent techniques, fine-tuning
+with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG)
+framework, in the context of doctor-patient chat conversations with multiple
+datasets of mixed medical domains. The analysis involves three state-of-the-art
+models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient
+dialogues, we comprehensively evaluate the performance of models, assessing key
+metrics such as language quality (perplexity, BLEU score), factual accuracy
+(fact-checking against medical knowledge bases), adherence to medical
+guidelines, and overall human judgments (coherence, empathy, safety). The
+findings provide insights into the strengths and limitations of each approach,
+shedding light on their suitability for healthcare applications. Furthermore,
+the research investigates the robustness of the models in handling diverse
+patient queries, ranging from general health inquiries to specific medical
+conditions. The impact of domain-specific knowledge integration is also
+explored, highlighting the potential for enhancing LLM performance through
+targeted data augmentation and retrieval strategies.
 
-This paper investigates data selection and model merging methodologies aimed
-at incorporating advanced reasoning capabilities such as those of DeepSeek R1
-into language-specific large language models (LLMs), with a particular focus on
-the Thai LLM. Our goal is to enhance the reasoning capabilities of
-language-specific LLMs while maintaining their target language abilities.
-DeepSeek R1 excels in reasoning but primarily benefits high-resource languages
-such as English and Chinese. However, low-resource languages remain underserved
-due to the dominance of English-centric training data and model optimizations,
-which limit performance in these languages. This limitation results in
-unreliable code-switching and diminished effectiveness on tasks in low-resource
-languages. Meanwhile, local and regional LLM initiatives have attempted to
-bridge this gap by developing language-specific LLMs that focus on improving
-local linguistic fidelity. We demonstrate that, with only publicly available
-datasets and a computational budget of $120, it is possible to enhance the
-reasoning capabilities of language-specific LLMs to match the level of DeepSeek
-R1, without compromising their performance on target language tasks.
+摘要：大型語言模型 (LLM) 在自然語言處理任務中展現了令人印象深刻的能力，包括對話生成。本研究旨在對兩種著名的技術進行新穎的比較分析，即微調 LoRA (低秩適應) 和檢索增強生成 (RAG) 框架，在具有混合醫療領域的多個資料集的醫患聊天對話中。分析涉及三個最先進的模型：Llama-2、GPT 和 LSTM 模型。採用真實世界的醫患對話，我們全面評估模型的性能，評估語言品質（困惑度、BLEU 分數）、事實準確性（對照醫學知識庫進行事實查核）、遵守醫療指南以及整體人類判斷（連貫性、同理心、安全性）等關鍵指標。研究結果深入了解了每種方法的優點和限制，闡明了它們適用於醫療保健應用的適當性。此外，該研究調查了模型在處理多樣化患者查詢時的穩健性，範圍從一般健康詢問到特定醫療狀況。還探討了特定領域知識整合的影響，強調了通過有針對性的資料擴充和檢索策略來增強 LLM 性能的潛力。
 
-摘要：本文探討資料選取與模型合併方法，旨在將深度搜尋 R1 等先進推理能力整合至特定語言的大型語言模型 (LLM)，特別著重於泰語 LLM。我們的目標是提升特定語言 LLM 的推理能力，同時維持其目標語言能力。深度搜尋 R1 在推理方面表現出色，但主要受益於英語和中文等資源豐富的語言。然而，由於以英語為中心的訓練資料和模型最佳化佔據主導地位，資源貧乏的語言仍未獲得充分服務，這限制了這些語言的效能。此限制導致不可靠的代碼切換，並降低了資源貧乏語言任務的效能。與此同時，在地區 LLM 計畫已嘗試透過開發專注於改善在地語言忠實度的特定語言 LLM 來彌合此差距。我們證明，僅使用公開可用的資料集和 120 美元的運算預算，即可提升特定語言 LLM 的推理能力，使其達到深度搜尋 R1 的水準，同時不損及它們在目標語言任務上的效能。
+##### **Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**
+2502.02618v1 by F. Xavier Gaya-Morey, Jose M. Buades-Rubio, Philippe Palanque, Raquel Lacuesta, Cristina Manresa-Yee
 
-##### **Cost-Saving LLM Cascades with Early Abstention**
-2502.09054v1 by Michael J. Zellinger, Rex Liu, Matt Thomson
+The rapid aging of the global population has highlighted the need for
+technologies to support elderly, particularly in healthcare and emotional
+well-being. Facial expression recognition (FER) systems offer a non-invasive
+means of monitoring emotional states, with applications in assisted living,
+mental health support, and personalized care. This study presents a systematic
+review of deep learning-based FER systems, focusing on their applications for
+the elderly population. Following a rigorous methodology, we analyzed 31
+studies published over the last decade, addressing challenges such as the
+scarcity of elderly-specific datasets, class imbalances, and the impact of
+age-related facial expression differences. Our findings show that convolutional
+neural networks remain dominant in FER, and especially lightweight versions for
+resource-constrained environments. However, existing datasets often lack
+diversity in age representation, and real-world deployment remains limited.
+Additionally, privacy concerns and the need for explainable artificial
+intelligence emerged as key barriers to adoption. This review underscores the
+importance of developing age-inclusive datasets, integrating multimodal
+solutions, and adopting XAI techniques to enhance system usability,
+reliability, and trustworthiness. We conclude by offering recommendations for
+future research to bridge the gap between academic progress and real-world
+implementation in elderly care.
 
-LLM cascades are based on the idea that processing all queries with the
-largest and most expensive LLMs is inefficient. Instead, cascades deploy small
-LLMs to answer the majority of queries, limiting the use of large and expensive
-LLMs to only the most difficult queries. This approach can significantly reduce
-costs without impacting performance. However, risk-sensitive domains such as
-finance or medicine place an additional premium on avoiding model errors.
-Recognizing that even the most expensive models may make mistakes, applications
-in these domains benefit from allowing LLM systems to completely abstain from
-answering a query when the chance of making a mistake is significant. However,
-giving a cascade the ability to abstain poses an immediate design question for
-LLM cascades: should abstention only be allowed at the final model or also at
-earlier models? Since the error patterns of small and large models are
-correlated, the latter strategy may further reduce inference costs by letting
-inexpensive models anticipate abstention decisions by expensive models, thereby
-obviating the need to run the expensive models. We investigate the benefits of
-"early abstention" in LLM cascades and find that it reduces the overall test
-loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA,
-TruthfulQA, and XSum). These gains result from a more effective use of
-abstention, which trades a 4.1% average increase in the overall abstention rate
-for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings
-demonstrate that it is possible to leverage correlations between the error
-patterns of different language models to drive performance improvements for LLM
-systems with abstention.
+摘要：全球人口快速老龄化突显了对技术的需求，以支持老年人，尤其是在医疗保健和情绪健康方面。面部表情识别 (FER) 系统提供了一种非侵入性的情绪状态监测手段，在辅助生活、心理健康支持和个性化护理中得到应用。本研究对基于深度学习的 FER 系统进行了系统的回顾，重点关注它们在老年人群中的应用。遵循严格的方法，我们分析了在过去十年中发表的 31 项研究，解决了诸如老年人特定数据集的稀缺性、类别不平衡以及与年龄相关的面部表情差异的影响等挑战。我们的研究结果表明，卷积神经网络在 FER 中仍然占主导地位，特别是针对资源受限环境的轻量级版本。然而，现有数据集往往缺乏年龄代表性的多样性，并且现实世界的部署仍然有限。此外，隐私问题和对可解释人工智能的需求已成为采用过程中的主要障碍。本次审查强调了开发包容年龄的数据集、整合多模式解决方案以及采用 XAI 技术以增强系统可用性、可靠性和可信度的重要性。最后，我们提出了未来研究的建议，以弥合学术进展与老年护理中的现实世界实施之间的差距。
 
-摘要：<paragraph>LLM 級聯基於以下概念：使用最大且最昂貴的 LLM 處理所有查詢效率低下。相反，級聯會部署小型 LLM 來回答大部分查詢，將大型且昂貴的 LLM 的使用限制在最困難的查詢上。這種方法可以大幅降低成本，而不會影響效能。然而，像金融或醫學等對風險敏感的領域會額外重視避免模型錯誤。認識到即使是最昂貴的模型也可能會出錯，在這些領域中的應用程式可受益於允許 LLM 系統在出錯機率很大的情況下完全不回答查詢。然而，賦予級聯不回答的能力會對 LLM 級聯提出立即的設計問題：是否只允許在最終模型中不回答，還是也在較早的模型中不回答？由於小型和大型模型的錯誤模式相關，後一種策略可以讓便宜的模型預測昂貴模型的不回答決策，進而降低推論成本，從而避免執行昂貴的模型。我們調查了 LLM 級聯中「早期不回答」的好處，並發現它平均降低了六個基準測試（GSM8K、MedMCQA、MMLU、TriviaQA、TruthfulQA 和 XSum）的整體測試損失 2.2%。這些收益來自於更有效地使用不回答，以整體不回答率平均增加 4.1% 的代價換取成本降低 13.0% 和錯誤率降低 5.0%。我們的研究結果證明，可以利用不同語言模型的錯誤模式之間的關聯性，來推動具有不回答功能的 LLM 系統的效能改進。</paragraph>
+##### **Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**
+2502.02109v1 by Yuxiao Cheng, Xinxin Song, Ziqian Wang, Qin Zhong, Kunlun He, Jinli Suo
 
-##### **Game Theory Meets Large Language Models: A Systematic Survey**
-2502.09053v1 by Haoran Sun, Yusen Wu, Yukun Cheng, Xu Chu
+Recent advances in deep learning (DL) have prompted the development of
+high-performing early warning score (EWS) systems, predicting clinical
+deteriorations such as acute kidney injury, acute myocardial infarction, or
+circulatory failure. DL models have proven to be powerful tools for various
+tasks but come with the cost of lacking interpretability and limited
+generalizability, hindering their clinical applications. To develop a practical
+EWS system applicable to various outcomes, we propose causally-informed
+explainable early prediction model, which leverages causal discovery to
+identify the underlying causal relationships of prediction and thus owns two
+unique advantages: demonstrating the explicit interpretation of the prediction
+while exhibiting decent performance when applied to unfamiliar environments.
+Benefiting from these features, our approach achieves superior accuracy for 6
+different critical deteriorations and achieves better generalizability across
+different patient groups, compared to various baseline algorithms. Besides, we
+provide explicit causal pathways to serve as references for assistant clinical
+diagnosis and potential interventions. The proposed approach enhances the
+practical application of deep learning in various medical scenarios.
 
-Game theory establishes a fundamental framework for analyzing strategic
-interactions among rational decision-makers. The rapid advancement of large
-language models (LLMs) has sparked extensive research exploring the
-intersection of these two fields. Specifically, game-theoretic methods are
-being applied to evaluate and enhance LLM capabilities, while LLMs themselves
-are reshaping classic game models. This paper presents a comprehensive survey
-of the intersection of these fields, exploring a bidirectional relationship
-from three perspectives: (1) Establishing standardized game-based benchmarks
-for evaluating LLM behavior; (2) Leveraging game-theoretic methods to improve
-LLM performance through algorithmic innovations; (3) Characterizing the
-societal impacts of LLMs through game modeling. Among these three aspects, we
-also highlight how the equilibrium analysis for traditional game models is
-impacted by LLMs' advanced language understanding, which in turn extends the
-study of game theory. Finally, we identify key challenges and future research
-directions, assessing their feasibility based on the current state of the
-field. By bridging theoretical rigor with emerging AI capabilities, this survey
-aims to foster interdisciplinary collaboration and drive progress in this
-evolving research area.
+摘要：深度學習 (DL) 的最新進展促使開發出高性能早期預警評分 (EWS) 系統，預測急性腎臟損傷、急性心肌梗塞或循環衰竭等臨床惡化。DL 模型已被證明是各種任務的強大工具，但代價是缺乏可解釋性和有限的概括性，阻礙了其臨床應用。為了開發適用於各種結果的實用 EWS 系統，我們提出了因果關係解釋性早期預測模型，它利用因果發現來識別預測的潛在因果關係，從而擁有兩個獨特的優點：展示預測的明確解釋，同時在應用於不熟悉的環境時表現出良好的性能。得益於這些特性，與各種基線演算法相比，我們的模型在 6 種不同的危重惡化中實現了更高的準確度，並在不同的患者群體中實現了更好的概括性。此外，我們提供了明確的因果途徑，作為輔助臨床診斷和潛在干預措施的參考。所提出的方法增強了深度學習在各種醫療場景中的實際應用。
 
-摘要：博弈論建立一個基本架構，用來分析理性決策者之間的策略互動。大型語言模型 (LLM) 的快速進展，激發了廣泛的研究，探討這兩個領域的交集。具體來說，博弈論方法被應用於評估和增強 LLM 能力，而 LLM 本身正在重塑經典博弈模型。本文對這些領域的交集進行了全面的調查，從三個角度探討了雙向關係：(1) 建立標準化的基於博弈的基準，用於評估 LLM 行為；(2) 利用博弈論方法，通過演算法創新來改善 LLM 效能；(3) 透過博弈模型，描述 LLM 對社會的影響。在這三個方面中，我們還強調了 LLM 的先進語言理解如何影響傳統博弈模型的均衡分析，這反過來又擴展了博弈論的研究。最後，我們找出關鍵挑戰和未來的研究方向，根據該領域的現狀評估其可行性。透過將理論嚴謹性與新興的 AI 能力相結合，這項調查旨在促進跨學科合作，並推動這個不斷演變的研究領域的進展。
+##### **JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**
+2502.04345v1 by Yehan Yan, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
 
-##### **AIDE: Agentically Improve Visual Language Model with Domain Experts**
-2502.09051v1 by Ming-Chang Chiu, Fuxiao Liu, Karan Sapra, Andrew Tao, Yaser Jacoob, Xuezhe Ma, Zhiding Yu, Guilin Liu
+Traditional Chinese medicine (TCM) plays a vital role in health protection
+and disease treatment, but its practical application requires extensive medical
+knowledge and clinical experience. Existing TCM Large Language Models (LLMs)
+exhibit critical limitations of uncomprehensive medical consultation and
+diagnoses, and inaccurate syndrome differentiation-based treatment. To address
+these issues, this study establishes JingFang (JF): a novel TCM Large Language
+Model that demonstrates the expert-level capability of medical diagnosis and
+syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic
+Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation,
+enabling JF with effective and accurate diagnostic ability. In addition, a
+Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to
+significantly enhance the capacity of JF for disease treatment based on
+syndrome differentiation. JingFang not only facilitates the application of LLMs
+but also promotes the effective practice of TCM in human health protection and
+disease treatment.
 
-The enhancement of Visual Language Models (VLMs) has traditionally relied on
-knowledge distillation from larger, more capable models. This dependence
-creates a fundamental bottleneck for improving state-of-the-art systems,
-particularly when no superior models exist. We introduce AIDE (Agentic
-Improvement through Domain Experts), a novel framework that enables VLMs to
-autonomously enhance their capabilities by leveraging specialized domain expert
-models. AIDE operates through a four-stage process: (1) identifying instances
-for refinement, (2) engaging domain experts for targeted analysis, (3)
-synthesizing expert outputs with existing data, and (4) integrating enhanced
-instances into the training pipeline. Experiments on multiple benchmarks,
-including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve
-notable performance gains without relying on larger VLMs nor human supervision.
-Our framework provides a scalable, resource-efficient approach to continuous
-VLM improvement, addressing critical limitations in current methodologies,
-particularly valuable when larger models are unavailable to access.
+摘要：中醫藥在保健與疾病治療中扮演著重要的角色，但其實務應用需要深厚的醫學知識與臨床經驗。現有的中醫大語言模型（LLM）存在著醫療諮詢與診斷不全面、症候分型治療不準確的重大限制。為了解決這些問題，本研究建立了精方（JF）：一個新穎的中醫大語言模型，展示了專家級的醫療診斷與症候分型治療能力。我們創新了一個多智能體動態協作思考鏈機制（MDCCTM）用於醫療諮詢，讓 JF 具備有效且準確的診斷能力。此外，還開發了一個症候智能體和一個雙階段檢索方案（DSRS），以顯著增強 JF 基於症候分型的疾病治療能力。精方不僅促進了 LLM 的應用，也推動了中醫藥在人類保健與疾病治療中的有效實踐。
 
-摘要：視覺語言模型 (VLM) 的增強傳統上依賴於從更大、功能更強大的模型中進行知識萃取。這種依賴性會造成改善最先進系統的基本瓶頸，尤其在沒有更優越的模型時。我們引進 AIDE（透過領域專家進行代理式改善），一個創新的架構，讓 VLM 能夠透過利用專業的領域專家模型，自主增強其功能。AIDE 透過四階段流程運作：(1) 識別需要改善的實例，(2) 聘請領域專家進行有針對性的分析，(3) 將專家輸出與現有資料綜合，以及 (4) 將增強的實例整合到訓練流程中。在多個基準測試上的實驗，包括 MMMU、MME、MMBench 等，證明了 AIDE 能夠在不依賴更大型的 VLM 或人工監督的情況下，實現顯著的效能提升。我們的架構提供了一個可擴充、資源效率高的持續 VLM 改進方法，解決了當前方法中的關鍵限制，特別是在無法取得大型模型時，這一點特別有價值。
+##### **An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**
+2502.01789v1 by Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri
 
-##### **Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**
-2502.09050v1 by Chae-Hyun Kim, Yoon-Ryung Choi, Jin-Duk Park, Won-Yong Shin
+Early identification of cognitive concerns is critical but often hindered by
+subtle symptom presentation. This study developed and validated a fully
+automated, multi-agent AI workflow using LLaMA 3 8B to identify cognitive
+concerns in 3,338 clinical notes from Mass General Brigham. The agentic
+workflow, leveraging task-specific agents that dynamically collaborate to
+extract meaningful insights from clinical notes, was compared to an
+expert-driven benchmark. Both workflows achieved high classification
+performance, with F1-scores of 0.90 and 0.91, respectively. The agentic
+workflow demonstrated improved specificity (1.00) and achieved prompt
+refinement in fewer iterations. Although both workflows showed reduced
+performance on validation data, the agentic workflow maintained perfect
+specificity. These findings highlight the potential of fully automated
+multi-agent AI workflows to achieve expert-level accuracy with greater
+efficiency, offering a scalable and cost-effective solution for detecting
+cognitive concerns in clinical settings.
 
-Group recommendation aims at providing optimized recommendations tailored to
-diverse groups, enabling groups to enjoy appropriate items. On the other hand,
-most existing group recommendation methods are built upon deep neural network
-(DNN) architectures designed to capture the intricate relationships between
-member-level and group-level interactions. While these DNN-based approaches
-have proven their effectiveness, they require complex and expensive training
-procedures to incorporate group-level interactions in addition to member-level
-interactions. To overcome such limitations, we introduce Group-GF, a new
-approach for extremely fast recommendations of items to each group via
-multi-view graph filtering (GF) that offers a holistic view of complex
-member-group dynamics, without the need for costly model training.
-Specifically, in Group-GF, we first construct three item similarity graphs
-manifesting different viewpoints for GF. Then, we discover a distinct
-polynomial graph filter for each similarity graph and judiciously aggregate the
-three graph filters. Extensive experiments demonstrate the effectiveness of
-Group-GF in terms of significantly reducing runtime and achieving
-state-of-the-art recommendation accuracy.
+摘要：及早辨識認知問題至關重要，但常常受到症狀呈現過於細微的阻礙。本研究開發並驗證了一個全自動化、多重代理的 AI 工作流程，使用 LLaMA 3 8B 來辨識來自麻省總醫院布萊根分院的 3,338 則臨床筆記中的認知問題。這個代理工作流程利用了特定任務的代理，這些代理會動態合作從臨床筆記中萃取出有意義的見解，並與專家驅動的基準進行比較。這兩個工作流程都達到了很高的分類效能，F1 分數分別為 0.90 和 0.91。代理工作流程展現出更好的特異性（1.00），並且在更少的反覆運算中達到了提示精煉。儘管這兩個工作流程在驗證資料上的效能都降低了，但代理工作流程維持了完美的特異性。這些發現突顯了全自動化多重代理 AI 工作流程的潛力，它們能以更高的效率達到專家級的準確度，為在臨床環境中偵測認知問題提供了一個可擴充且具成本效益的解決方案。
 
-摘要：群組推薦旨在提供針對不同群組量身打造的最佳推薦，讓群組可以享受適當的項目。另一方面，現有的群組推薦方法大多建立在深度神經網路 (DNN) 架構上，旨在捕捉成員層級和群組層級互動之間的複雜關係。雖然這些基於 DNN 的方法已證明其有效性，但它們需要複雜且昂貴的訓練程序，才能在成員層級互動之外納入群組層級互動。為了克服這些限制，我們引入了 Group-GF，這是一種透過多視圖圖形過濾 (GF) 為每個群組提供極快速項目推薦的新方法，它提供了複雜成員群組動態的整體視圖，而無需進行昂貴的模型訓練。具體來說，在 Group-GF 中，我們首先建構三個項目相似度圖形，展現 GF 的不同觀點。然後，我們為每個相似度圖形發現一個不同的多項式圖形過濾器，並明智地彙總這三個圖形過濾器。廣泛的實驗證明了 Group-GF 在顯著減少執行時間和達成最先進的推薦準確度方面的有效性。
+##### **Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**
+2502.03482v1 by Chacha Chen, Han Liu, Jiamin Yang, Benjamin M. Mervak, Bora Kalaycioglu, Grace Lee, Emre Cakmakli, Matteo Bonatti, Sridhar Pudu, Osman Kahraman, Gul Gizem Pamuk, Aytekin Oto, Aritrick Chatterjee, Chenhao Tan
 
-##### **Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**
-2502.09046v1 by Jin-Duk Park, Jaemin Yoo, Won-Yong Shin
+Despite the growing interest in human-AI decision making, experimental
+studies with domain experts remain rare, largely due to the complexity of
+working with domain experts and the challenges in setting up realistic
+experiments. In this work, we conduct an in-depth collaboration with
+radiologists in prostate cancer diagnosis based on MRI images. Building on
+existing tools for teaching prostate cancer diagnosis, we develop an interface
+and conduct two experiments to study how AI assistance and performance feedback
+shape the decision making of domain experts. In Study 1, clinicians were asked
+to provide an initial diagnosis (human), then view the AI's prediction, and
+subsequently finalize their decision (human-AI team). In Study 2 (after a
+memory wash-out period), the same participants first received aggregated
+performance statistics from Study 1, specifically their own performance, the
+AI's performance, and their human-AI team performance, and then directly viewed
+the AI's prediction before making their diagnosis (i.e., no independent initial
+diagnosis). These two workflows represent realistic ways that clinical AI tools
+might be used in practice, where the second study simulates a scenario where
+doctors can adjust their reliance and trust on AI based on prior performance
+feedback. Our findings show that, while human-AI teams consistently outperform
+humans alone, they still underperform the AI due to under-reliance, similar to
+prior studies with crowdworkers. Providing clinicians with performance feedback
+did not significantly improve the performance of human-AI teams, although
+showing AI decisions in advance nudges people to follow AI more. Meanwhile, we
+observe that the ensemble of human-AI teams can outperform AI alone, suggesting
+promising directions for human-AI collaboration.
 
-Multi-criteria (MC) recommender systems, which utilize MC rating information
-for recommendation, are increasingly widespread in various e-commerce domains.
-However, the MC recommendation using training-based collaborative filtering,
-requiring consideration of multiple ratings compared to single-criterion
-counterparts, often poses practical challenges in achieving state-of-the-art
-performance along with scalable model training. To solve this problem, we
-propose CA-GF, a training-free MC recommendation method, which is built upon
-criteria-aware graph filtering for efficient yet accurate MC recommendations.
-Specifically, first, we construct an item-item similarity graph using an MC
-user-expansion graph. Next, we design CA-GF composed of the following key
-components, including 1) criterion-specific graph filtering where the optimal
-filter for each criterion is found using various types of polynomial low-pass
-filters and 2) criteria preference-infused aggregation where the smoothed
-signals from each criterion are aggregated. We demonstrate that CA-GF is (a)
-efficient: providing the computational efficiency, offering the extremely fast
-runtime of less than 0.2 seconds even on the largest benchmark dataset, (b)
-accurate: outperforming benchmark MC recommendation methods, achieving
-substantial accuracy gains up to 24% compared to the best competitor, and (c)
-interpretable: providing interpretations for the contribution of each criterion
-to the model prediction based on visualizations.
+摘要：儘管人們對人類與 AI 決策制定越來越感興趣，但與領域專家合作的實驗研究仍然很少見，這在很大程度上是因為與領域專家合作的複雜性，以及在設定實際實驗時面臨的挑戰。在這項工作中，我們與放射科醫師進行深入合作，基於 MRI 影像診斷前列腺癌。建立在用於教授前列腺癌診斷的現有工具上，我們開發了一個介面並進行了兩項實驗，以研究 AI 協助和效能回饋如何塑造領域專家的決策制定。在研究 1 中，要求臨床醫師提供初步診斷（人類），然後檢視 AI 的預測，並隨後確定他們的決策（人類-AI 團隊）。在研究 2（經過一段記憶清除期）中，同一位參與者首先收到研究 1 的彙總效能統計資料，特別是他們自己的效能、AI 的效能，以及他們的人類-AI 團隊效能，然後在做出診斷前直接檢視 AI 的預測（即，沒有獨立的初步診斷）。這兩個工作流程代表了臨床 AI 工具在實務中可能被使用的方式，其中第二個研究模擬了醫生可以根據先前的效能回饋調整他們對 AI 的依賴和信任的情況。我們的研究結果顯示，儘管人類-AI 團隊始終優於單獨的人類，但由於依賴不足，他們仍然表現不如 AI，這與之前針對群眾工作者的研究類似。儘管事先顯示 AI 決策會促使人們更多地遵循 AI，但向臨床醫師提供效能回饋並未顯著改善人類-AI 團隊的效能。同時，我們觀察到人類-AI 團隊的集合可以優於單獨的 AI，這表明了人類-AI 合作的前景。
 
-摘要：多準則 (MC) 推薦系統在各種電子商務領域中日益普及，該系統利用 MC 評分資訊進行推薦。
-然而，與單準則對應項目相比，使用基於訓練的協同過濾的 MC 推薦，通常在達成最先進的效能以及可擴充模型訓練方面造成實務上的挑戰，需要考慮多個評分。為了解決這個問題，我們提出 CA-GF，一種無需訓練的 MC 推薦方法，它建立於準則感知圖形過濾之上，用於有效且準確的 MC 推薦。
-具體來說，首先，我們使用 MC 使用者擴展圖形來建構一個項目相似度圖形。接下來，我們設計 CA-GF，它包含以下關鍵組成部分，包括 1) 準則特定圖形過濾，其中使用各種類型的多項式低通濾波器來找出每個準則的最佳濾波器，以及 2) 準則偏好注入聚合，其中來自每個準則的平滑訊號被聚合。我們證明 CA-GF 是 (a) 有效的：提供運算效率，即使在最大的基準資料集上，也能提供低於 0.2 秒的極快執行時間，(b) 準確的：優於基準 MC 推薦方法，與最佳競爭者相比，獲得高達 24% 的顯著準確性提升，以及 (c) 可解釋的：根據視覺化提供對每個準則對模型預測的貢獻的解釋。
+##### **Improving Transformer World Models for Data-Efficient RL**
+2502.01591v1 by Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
 
-##### **Typhoon T1: An Open Thai Reasoning Model**
-2502.09042v1 by Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, Kunat Pipatanakul
+We present an approach to model-based RL that achieves a new state of the art
+performance on the challenging Craftax-classic benchmark, an open-world 2D
+survival game that requires agents to exhibit a wide range of general abilities
+-- such as strong generalization, deep exploration, and long-term reasoning.
+With a series of careful design choices aimed at improving sample efficiency,
+our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps,
+significantly outperforming DreamerV3, which achieves 53.2%, and, for the first
+time, exceeds human performance of 65.0%. Our method starts by constructing a
+SOTA model-free baseline, using a novel policy architecture that combines CNNs
+and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna
+with warmup", which trains the policy on real and imaginary data, (b) "nearest
+neighbor tokenizer" on image patches, which improves the scheme to create the
+transformer world model (TWM) inputs, and (c) "block teacher forcing", which
+allows the TWM to reason jointly about the future tokens of the next timestep.
 
-This paper introduces Typhoon T1, an open effort to develop an open Thai
-reasoning model. A reasoning model is a relatively new type of generative model
-built on top of large language models (LLMs). A reasoning model generates a
-long chain of thought before arriving at a final answer, an approach found to
-improve performance on complex tasks. However, details on developing such a
-model are limited, especially for reasoning models that can generate traces in
-a low-resource language. Typhoon T1 presents an open effort that dives into the
-details of developing a reasoning model in a more cost-effective way by
-leveraging supervised fine-tuning using open datasets, instead of reinforcement
-learning. This paper shares the details about synthetic data generation and
-training, as well as our dataset and model weights. Additionally, we provide
-insights gained from developing a reasoning model that generalizes across
-domains and is capable of generating reasoning traces in a low-resource
-language, using Thai as an example. We hope this open effort provides a
-foundation for further research in this field.
+摘要：我們提出了一個基於模型的 RL 方法，在具有挑戰性的 Craftax-classic 基準上實現了新的技術水準，這是一個開放世界的 2D 生存遊戲，要求代理人展現廣泛的一般能力，例如強大的概括能力、深入探索和長期推理。通過一系列旨在提高樣本效率的仔細設計選擇，我們的 MBRL 演算法在僅 1M 環境步驟後就實現了 67.4% 的獎勵，顯著優於 DreamerV3（實現 53.2%），並且首次超過了人類的 65.0% 的表現。我們的演算法首先通過使用結合 CNN 和 RNN 的新穎策略架構來建構一個 SOTA 無模型基線。然後，我們對標準 MBRL 設定新增了三項改進：(a)「帶熱身的 Dyna」，它在真實和假想資料上訓練策略，(b) 影像貼片的「最近鄰代碼化器」，它改進了建立轉換器世界模型 (TWM) 輸入的方案，以及 (c)「區塊教師強制」，它允許 TWM 共同推理下一個時間步長的未來代碼。
 
-摘要：本文介紹 Typhoon T1，這是一個開放的計畫，旨在開發開放的泰語推理模型。推理模型是一種相對較新的生成模型，建構於大型語言模型 (LLM) 之上。推理模型會在得出最終答案之前產生一連串的思考，這種方法被發現可以改善複雜任務的效能。然而，關於如何開發這種模型的詳細資訊有限，特別是對於能夠以低資源語言產生軌跡的推理模型而言。Typhoon T1 提出了一個開放的計畫，深入探討如何以更具成本效益的方式開發推理模型，方法是利用開放式資料集進行監督微調，而不是強化學習。本文分享了關於合成資料產生和訓練的詳細資訊，以及我們的資料集和模型權重。此外，我們提供了從開發推理模型中獲得的見解，該模型可以跨領域概括，並能夠以低資源語言產生推理軌跡，以泰語為例。我們希望這個開放的計畫能為此領域的進一步研究奠定基礎。
+##### **Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**
+2502.01377v1 by Zhi Zhang, Yan Liu, Mengxia Gao, Yu Yang, Jiannong Cao, Wai Kai Hou, Shirley Li, Sonata Yau, Yun Kwok Wing, Tatia M. C. Lee
 
-##### **Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**
-2502.09022v1 by Lin Zhang, Lijie Hu, Di Wang
+Psychological resilience, defined as the ability to rebound from adversity,
+is crucial for mental health. Compared with traditional resilience assessments
+through self-reported questionnaires, resilience assessments based on
+neurological data offer more objective results with biological markers, hence
+significantly enhancing credibility. This paper proposes a novel data-efficient
+model to address the scarcity of neurological data. We employ Neuro
+Kolmogorov-Arnold Networks as the structure of the prediction model. In the
+training stage, a new trait-informed multimodal representation algorithm with a
+smart chunk technique is proposed to learn the shared latent space with limited
+data. In the test stage, a new noise-informed inference algorithm is proposed
+to address the low signal-to-noise ratio of the neurological data. The proposed
+model not only shows impressive performance on both public datasets and
+self-constructed datasets but also provides some valuable psychological
+hypotheses for future research.
 
-Transformer-based language models have achieved notable success, yet their
-internal reasoning mechanisms remain largely opaque due to complex non-linear
-interactions and high-dimensional operations. While previous research suggests
-that these models implicitly encode reasoning structures, it is still unclear
-which specific multi-step thought processes they employ to solve complex tasks.
-To address this gap, we propose a novel mechanistic interpretability framework,
-SICAF, designed to trace and analyze the reasoning strategies that language
-models use in multi-step inference tasks. By employing circuit analysis and
-self-influence functions, we quantify the evolving importance of each token
-throughout the reasoning process, thereby mapping the pathways the model uses
-for inference. Applying SICAF to the GPT-2 model on the Indirect Object
-Identification (IOI) prediction task, we demonstrate how underlying circuits
-can reveal a reasoning process that aligns with human interpretability,
-offering new insights into the model's internal logic.
+摘要：心理韌性，定義為從逆境中反彈的能力，對心理健康至關重要。與通過自我報告問卷的傳統韌性評估相比，基於神經數據的韌性評估提供了更客觀的結果和生物標記，從而顯著提高了可信度。本文提出了一個新穎的數據高效模型來解決神經數據的稀缺性。我們採用神經科爾莫哥羅夫-阿諾德網路作為預測模型的結構。在訓練階段，提出了一種新的特徵信息多模態表示算法，採用智能塊技術，以有限的數據學習共享潛在空間。在測試階段，提出了一種新的噪聲信息推理算法，以解決神經數據的信噪比低的問題。所提出的模型不僅在公共數據集和自構數據集上都顯示出令人印象深刻的性能，還為未來的研究提供了一些有價值的心理假設。
+
+##### **OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**
+2502.01243v1 by Chengfeng Zhou, Ji Wang, Juanjuan Qin, Yining Wang, Ling Sun, Weiwei Dai
+
+Large language models (LLMs) have shown significant promise across various
+medical applications, with ophthalmology being a notable area of focus. Many
+ophthalmic tasks have shown substantial improvement through the integration of
+LLMs. However, before these models can be widely adopted in clinical practice,
+evaluating their capabilities and identifying their limitations is crucial. To
+address this research gap and support the real-world application of LLMs, we
+introduce the OphthBench, a specialized benchmark designed to assess LLM
+performance within the context of Chinese ophthalmic practices. This benchmark
+systematically divides a typical ophthalmic clinical workflow into five key
+scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each
+scenario, we developed multiple tasks featuring diverse question types,
+resulting in a comprehensive benchmark comprising 9 tasks and 591 questions.
+This comprehensive framework allows for a thorough assessment of LLMs'
+capabilities and provides insights into their practical application in Chinese
+ophthalmology. Using this benchmark, we conducted extensive experiments and
+analyzed the results from 39 popular LLMs. Our evaluation highlights the
+current gap between LLM development and its practical utility in clinical
+settings, providing a clear direction for future advancements. By bridging this
+gap, we aim to unlock the potential of LLMs and advance their development in
+ophthalmology.
 
-摘要：基於 Transformer 的語言模型已取得顯著的成功，但由於複雜的非線性交互和高維度運算，它們的內部推理機制在很大程度上仍然不透明。儘管先前的研究表明這些模型隱含地編碼推理結構，但目前仍不清楚它們採用哪些具體的多步驟思考過程來解決複雜任務。為了解決這個差距，我們提出了一個新穎的機制可解釋性框架 SICAF，旨在追蹤和分析語言模型在多步驟推理任務中使用的推理策略。通過採用電路分析和自影響函數，我們量化了推理過程中每個標記的演化重要性，從而繪製出模型用於推理的路徑。將 SICAF 應用於 GPT-2 模型上的間接賓語識別 (IOI) 預測任務，我們展示了底層電路如何揭示與人類可解釋性相符的推理過程，從而對模型的內部邏輯提供了新的見解。
+摘要：大型語言模型 (LLM) 在各種醫療應用中已展現出顯著的潛力，其中眼科是一個值得關注的重要領域。許多眼科任務已透過整合 LLM 而大幅進步。然而，在這些模型能廣泛應用於臨床實務之前，評估其能力並找出其限制至關重要。為了解決這個研究差距並支援 LLM 的實際應用，我們引入了 OphthBench，這是一個專門的基準測試，旨在評估 LLM 在中國眼科實務中的表現。此基準測試系統性地將典型眼科臨床工作流程劃分為五個關鍵情境：教育、分流、診斷、治療和預後。對於每個情境，我們開發了多項任務，包含多樣化的問題類型，最後組成一個包含 9 項任務和 591 個問題的綜合基準測試。此綜合架構可徹底評估 LLM 的能力，並提供其在中國眼科的實際應用見解。使用此基準測試，我們進行了廣泛的實驗，並分析了來自 39 個熱門 LLM 的結果。我們的評估強調了 LLM 開發與其在臨床環境中的實際效用之間的差距，為未來的進展提供了明確的方向。透過彌合此差距，我們旨在釋放 LLM 的潛力，並促進其在眼科的發展。
 
-##### **EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**
-2502.09020v1 by Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
+##### **MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**
+2502.01158v1 by Alejandro Guerra-Manzanares, Farah E. Shamout
 
-Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB
-cameras which are sensitive to challenging factors such as low illumination,
-motion blur, and cluttered backgrounds. In this paper, we propose to recognize
-the scene text using bio-inspired event cameras by collecting and annotating a
-large-scale benchmark dataset, termed EventSTR. It contains 9,928
-high-definition (1280 * 720) event samples and involves both Chinese and
-English characters. We also benchmark multiple STR algorithms as the baselines
-for future works to compare. In addition, we propose a new event-based scene
-text recognition framework, termed SimC-ESTR. It first extracts the event
-features using a visual encoder and projects them into tokens using a Q-former
-module. More importantly, we propose to augment the vision tokens based on a
-memory mechanism before feeding into the large language models. A
-similarity-based error correction mechanism is embedded within the large
-language model to correct potential minor errors fundamentally based on
-contextual information. Extensive experiments on the newly proposed EventSTR
-dataset and two simulation STR datasets fully demonstrate the effectiveness of
-our proposed model. We believe that the dataset and algorithmic model can
-innovatively propose an event-based STR task and are expected to accelerate the
-application of event cameras in various industries. The source code and
-pre-trained models will be released on https://github.com/Event-AHU/EventSTR
+Multimodal fusion leverages information across modalities to learn better
+feature representations with the goal of improving performance in fusion-based
+tasks. However, multimodal datasets, especially in medical settings, are
+typically smaller than their unimodal counterparts, which can impede the
+performance of multimodal models. Additionally, the increase in the number of
+modalities is often associated with an overall increase in the size of the
+multimodal network, which may be undesirable in medical use cases. Utilizing
+smaller unimodal encoders may lead to sub-optimal performance, particularly
+when dealing with high-dimensional clinical data. In this paper, we propose the
+Modality-INformed knowledge Distillation (MIND) framework, a multimodal model
+compression approach based on knowledge distillation that transfers knowledge
+from ensembles of pre-trained deep neural networks of varying sizes into a
+smaller multimodal student. The teacher models consist of unimodal networks,
+allowing the student to learn from diverse representations. MIND employs
+multi-head joint fusion models, as opposed to single-head models, enabling the
+use of unimodal encoders in the case of unimodal samples without requiring
+imputation or masking of absent modalities. As a result, MIND generates an
+optimized multimodal model, enhancing both multimodal and unimodal
+representations. It can also be leveraged to balance multimodal learning during
+training. We evaluate MIND on binary and multilabel clinical prediction tasks
+using time series data and chest X-ray images. Additionally, we assess the
+generalizability of the MIND framework on three non-medical multimodal
+multiclass datasets. Experimental results demonstrate that MIND enhances the
+performance of the smaller multimodal network across all five tasks, as well as
+various fusion methods and multimodal architectures, compared to
+state-of-the-art baselines.
 
-摘要：主流場景文字辨識（STR）演算法是基於對低光源、動態模糊和雜亂背景等挑戰性因素敏感的 RGB 相機開發的。在本文中，我們提出使用生物靈感事件相機辨識場景文字，方法是收集和標註一個稱為 EventSTR 的大規模基準資料集。它包含 9,928 個高畫質（1280 * 720）事件範例，並包含中文字和英文字元。我們也基準化多個 STR 演算法作為未來工作的基準，以進行比較。此外，我們提出一個新的基於事件的場景文字辨識架構，稱為 SimC-ESTR。它首先使用視覺編碼器萃取事件特徵，並使用 Q-former 模組將它們投影到代幣中。更重要的是，我們提出在輸入大型語言模型之前，基於記憶機制擴充視覺代幣。一個基於相似性的錯誤修正機制嵌入在大型語言模型中，以根據上下文資訊從根本上修正潛在的輕微錯誤。在最新提出的 EventSTR 資料集和兩個模擬 STR 資料集上進行的廣泛實驗充分證明了我們提出的模型的有效性。我們相信，該資料集和演算法模型可以創新地提出一個基於事件的 STR 任務，並有望加速事件相機在各個產業的應用。原始碼和預先訓練的模型將在 https://github.com/Event-AHU/EventSTR 上釋出
+摘要：多模态融合利用跨模态的信息来学习更好的特征表示，目标是提升基于融合的任务的性能。然而，多模态数据集，尤其是在医疗环境中，通常比它们的单模态对应数据集小，这会阻碍多模态模型的性能。此外，模态数量的增加通常与多模态网络尺寸的整体增加相关，这在医疗用例中可能是不可取的。利用较小的单模态编码器可能会导致次优性能，尤其是在处理高维临床数据时。在本文中，我们提出了模态信息知识蒸馏 (MIND) 框架，这是一种基于知识蒸馏的多模态模型压缩方法，它将来自不同大小的预训练深度神经网络的集合中的知识转移到一个较小的多模态学生中。教师模型由单模态网络组成，允许学生从不同的表示中学习。MIND 采用多头联合融合模型，而不是单头模型，从而能够在单模态样本的情况下使用单模态编码器，而不需要缺失模态的插补或掩蔽。因此，MIND 生成了一个经过优化的多模态模型，增强了多模态和单模态表示。它还可以用来在训练期间平衡多模态学习。我们使用时间序列数据和胸部 X 射线图像对二元和多标签临床预测任务评估了 MIND。此外，我们评估了 MIND 框架在三个非医疗多模态多分类数据集上的泛化性。实验结果表明，与最先进的基线相比，MIND 增强了较小的多模态网络在所有五个任务以及各种融合方法和多模态架构中的性能。
 
-##### **Zero-shot Concept Bottleneck Models**
-2502.09018v1 by Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
+##### **Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**
+2502.01141v1 by Qian Chen, Stefanie Rinderle-Ma, Lijie Wen
 
-Concept bottleneck models (CBMs) are inherently interpretable and
-intervenable neural network models, which explain their final label prediction
-by the intermediate prediction of high-level semantic concepts. However, they
-require target task training to learn input-to-concept and concept-to-label
-mappings, incurring target dataset collections and training resources. In this
-paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which
-predict concepts and labels in a fully zero-shot manner without training neural
-networks. Z-CBMs utilize a large-scale concept bank, which is composed of
-millions of vocabulary extracted from the web, to describe arbitrary input in
-various domains. For the input-to-concept mapping, we introduce concept
-retrieval, which dynamically finds input-related concepts by the cross-modal
-search on the concept bank. In the concept-to-label inference, we apply concept
-regression to select essential concepts from the retrieved concepts by sparse
-linear regression. Through extensive experiments, we confirm that our Z-CBMs
-provide interpretable and intervenable concepts without any additional
-training. Code will be available at https://github.com/yshinya6/zcbm.
+Most existing process compliance monitoring approaches detect compliance
+violations in an ex post manner. Only predicate prediction focuses on
+predicting them. However, predicate prediction provides a binary yes/no notion
+of compliance, lacking the ability to measure to which extent an ongoing
+process instance deviates from the desired state as specified in constraints.
+Here, being able to quantify the magnitude of violation would provide
+organizations with deeper insights into their operational performance, enabling
+informed decision making to reduce or mitigate the risk of non-compliance.
+Thus, we propose two predictive compliance monitoring approaches to close this
+research gap. The first approach reformulates the binary classification problem
+as a hybrid task that considers both classification and regression, while the
+second employs a multi-task learning method to explicitly predict the
+compliance status and the magnitude of violation for deviant cases
+simultaneously. In this work, we focus on temporal constraints as they are
+significant in almost any application domain, e.g., health care. The evaluation
+on synthetic and real-world event logs demonstrates that our approaches are
+capable of quantifying the magnitude of violations while maintaining comparable
+performance for compliance predictions achieved by state-of-the-art approaches.
 
-摘要：概念瓶頸模型 (CBM) 本質上是可解釋且可干預的神經網路模型，它們透過對高階語意概念的中間預測來解釋其最終標籤預測。然而，它們需要目標任務訓練來學習輸入到概念和概念到標籤的對應，導致目標資料集收集和訓練資源。在本文中，我們展示了「零次學習概念瓶頸模型」(Z-CBM)，它以完全零次學習的方式預測概念和標籤，而無需訓練神經網路。Z-CBM 利用一個大型概念庫，其中包含從網路中擷取的數百萬個詞彙，來描述各種領域中的任意輸入。對於輸入到概念的對應，我們引入了概念擷取，它透過對概念庫的跨模態搜尋，動態地找出與輸入相關的概念。在概念到標籤的推論中，我們應用概念迴歸，透過稀疏線性迴歸從擷取的概念中選擇必要的概念。透過廣泛的實驗，我們確認我們的 Z-CBM 在沒有任何額外訓練的情況下提供了可解釋且可干預的概念。程式碼將可在 https://github.com/yshinya6/zcbm 取得。
+摘要：現有的流程合規監控方法大多會在事後偵測到合規違規。只有謂詞預測專注於預測這些違規。然而，謂詞預測提供的是合規與否的二元概念，無法衡量正在進行的流程實例偏離約束中所指定之理想狀態的程度。在此，能夠量化違規的嚴重程度，將能讓組織深入了解其營運績效，並能據此做出明智的決策，以降低或減輕不合規的風險。因此，我們提出兩種預測合規監控方法來填補此研究空白。第一種方法將二元分類問題重新表述為同時考量分類和回歸的混合任務，而第二種方法則採用多任務學習方法，同時明確預測合規狀態和偏差案例的違規嚴重程度。在這項工作中，我們專注於時間約束，因為它們幾乎在任何應用領域（例如醫療保健）中都很重要。在合成和真實世界事件記錄上的評估顯示，我們的做法能夠量化違規的嚴重程度，同時維持與現有方法所達成的合規預測相當的績效。
 
-##### **Diversity Enhances an LLM's Performance in RAG and Long-context Task**
-2502.09017v1 by Zhchao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, Claire Na Cheng
+##### **Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**
+2502.01108v1 by Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar
 
-The rapid advancements in large language models (LLMs) have highlighted the
-challenge of context window limitations, primarily due to the quadratic time
-complexity of the self-attention mechanism (\(O(N^2)\), where \(N\) denotes the
-context window length). This constraint impacts tasks such as
-retrieval-augmented generation (RAG) in question answering (Q\&A) and long
-context summarization. A common approach involves selecting content with the
-highest similarity to the query; however, this often leads to redundancy and
-the exclusion of diverse yet relevant information. Building on principles from
-Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we
-integrate diversity into the content selection process. Our findings reveal
-that incorporating diversity substantially increases the recall of selecting
-relevant sentences or chunks before LLM-based Q\&A and summarization. These
-results highlight the importance of maintaining diversity in future LLM
-applications to further improve summarization and Q\&A outcomes.
+Photoplethysmography (PPG)-based foundation models are gaining traction due
+to the widespread use of PPG in biosignal monitoring and their potential to
+generalize across diverse health applications. In this paper, we introduce
+Pulse-PPG, the first open-source PPG foundation model trained exclusively on
+raw PPG data collected over a 100-day field study with 120 participants.
+Existing PPG foundation models are either open-source but trained on clinical
+data or closed-source, limiting their applicability in real-world settings. We
+evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its
+performance against a state-of-the-art foundation model trained on clinical
+data. Our results demonstrate that Pulse-PPG, trained on uncurated field data,
+exhibits superior generalization across clinical and mobile health applications
+in both lab and field settings. This suggests that exposure to real-world
+variability enables the model to learn fine-grained representations, making it
+more adaptable across tasks. Furthermore, pre-training on field data
+surprisingly outperforms its pre-training on clinical data in many tasks,
+reinforcing the importance of training on real-world, diverse datasets. To
+encourage further advancements in robust foundation models leveraging field
+data, we plan to release Pulse-PPG, providing researchers with a powerful
+resource for developing more generalizable PPG-based models.
 
-摘要：大型語言模型 (LLM) 的快速進步凸顯了上下文視窗限制的挑戰，這主要是由於自注意力機制的二次時間複雜度（\(O(N^2)\)），其中 \(N\) 表示上下文視窗長度。此限制會影響任務，例如問答 (Q&A) 中的檢索增強生成 (RAG) 和長文摘要。一種常見的方法涉及選擇與查詢最相似的內容；然而，這通常會導致冗餘，並排除多樣化但相關的資訊。我們根據最大邊際相關性 (MMR) 和最遠點取樣 (FPS) 的原則，將多樣性整合到內容選擇過程中。我們的研究結果顯示，在基於 LLM 的問答和摘要之前，納入多樣性會大幅增加選擇相關句子或區塊的召回率。這些結果突顯了在未來的 LLM 應用中維持多樣性的重要性，以進一步改善摘要和問答的結果。
+摘要：基於光電容積描記術 (PPG) 的基礎模型由於 PPG 在生物訊號監控中的廣泛使用及其在各種健康應用中推廣的潛力而備受關注。在本文中，我們介紹 Pulse-PPG，這是第一個開放原始碼 PPG 基礎模型，專門針對在為期 100 天的現場研究中收集的 120 位參與者的原始 PPG 資料進行訓練。現有的 PPG 基礎模型要不是開放原始碼，但訓練於臨床資料，不然就是閉源，這限制了它們在真實世界中的應用性。我們評估了 Pulse-PPG 在多個資料集和下游任務中的表現，並將其效能與訓練於臨床資料的最新基礎模型進行比較。我們的結果表明，訓練於未整理現場資料的 Pulse-PPG 在實驗室和現場環境中，在臨床和行動健康應用中展現出優異的泛化能力。這表明接觸真實世界的變異性使模型能夠學習細粒度的表示，使其更能適應各種任務。此外，令人驚訝的是，現場資料的預訓練在許多任務中優於臨床資料的預訓練，這強化了在真實世界、多樣化的資料集上訓練的重要性。為了鼓勵在利用現場資料的強健基礎模型方面進一步發展，我們計畫發布 Pulse-PPG，為研究人員提供一個強大的資源，用於開發更具泛化性的基於 PPG 的模型。
 
-##### **Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**
-2502.09004v1 by Jonathan Pofcher, Christopher M. Homan, Randall Sell, Ashiqur R. KhudaBukhsh
+##### **Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**
+2502.04342v1 by Yeyubei Zhang, Zhongyan Wang, Zhanyi Ding, Yexin Tian, Jianglai Dai, Xiaorui Shen, Yunchong Liu, Yuchen Cao
 
-This paper makes three contributions. First, via a substantial corpus of
-1,419,047 comments posted on 3,161 YouTube news videos of major US cable news
-outlets, we analyze how users engage with LGBTQ+ news content. Our analyses
-focus both on positive and negative content. In particular, we construct a
-fine-grained hope speech classifier that detects positive (hope speech),
-negative, neutral, and irrelevant content. Second, in consultation with a
-public health expert specializing on LGBTQ+ health, we conduct an annotation
-study with a balanced and diverse political representation and release a
-dataset of 3,750 instances with fine-grained labels and detailed annotator
-demographic information. Finally, beyond providing a vital resource for the
-LGBTQ+ community, our annotation study and subsequent in-the-wild assessments
-reveal (1) strong association between rater political beliefs and how they rate
-content relevant to a marginalized community; (2) models trained on individual
-political beliefs exhibit considerable in-the-wild disagreement; and (3)
-zero-shot large language models (LLMs) align more with liberal raters.
+Social media has become an important source for understanding mental health,
+providing researchers with a way to detect conditions like depression from
+user-generated posts. This tutorial provides practical guidance to address
+common challenges in applying machine learning and deep learning methods for
+mental health detection on these platforms. It focuses on strategies for
+working with diverse datasets, improving text preprocessing, and addressing
+issues such as imbalanced data and model evaluation. Real-world examples and
+step-by-step instructions demonstrate how to apply these techniques
+effectively, with an emphasis on transparency, reproducibility, and ethical
+considerations. By sharing these approaches, this tutorial aims to help
+researchers build more reliable and widely applicable models for mental health
+research, contributing to better tools for early detection and intervention.
 
-摘要：本文做出了三項貢獻。首先，透過一個龐大的語料庫，其中包含 1,419,047 則評論，這些評論張貼在 3,161 部美國有線新聞頻道的 YouTube 新聞影片上，我們分析了使用者如何參與 LGBTQ+ 新聞內容。我們的分析重點在於正面和負面的內容。特別是，我們建構了一個細緻的希望言論分類器，用來偵測正面的（希望言論）、負面的、中立的和不相關的內容。其次，在諮詢了一位專門研究 LGBTQ+ 健康的公共衛生專家後，我們進行了一項標註研究，其中包含平衡且多元的政治代表性，並發布了一個包含 3,750 個實例的資料集，其中包含細緻的標籤和詳細的標註者人口統計資訊。最後，除了為 LGBTQ+ 社群提供重要的資源外，我們的標註研究和後續的實際評估揭示了：(1) 評分者的政治信仰與他們如何評分與邊緣化社群相關的內容之間有很強的關聯性；(2) 根據個人政治信仰訓練的模型在實際應用中表現出相當大的分歧；(3) 零次學習大型語言模型 (LLM) 與自由派評分者的看法更一致。
+摘要：社群媒體已成為了解心理健康的重要來源，
+為研究人員提供一種方式，從使用者發布的貼文中偵測憂鬱症等狀況。
+本教學提供實務指南，說明如何處理在這些平台上使用機器學習和深度學習方法進行心理健康偵測時常見的挑戰。
+它專注於處理不同資料集、改善文字前處理，以及處理不平衡資料和模型評估等問題的策略。
+實際範例和逐步說明示範如何有效應用這些技術，並強調透明度、可複製性，以及倫理考量。
+透過分享這些方法，本教學指南旨在協助研究人員建構更可靠且廣泛適用的心理健康研究模型，
+進而有助於早期偵測和介入的工具。
 
-##### **RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**
-2502.09003v1 by Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang, Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
+##### **Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**
+2502.01691v1 by Hadas Ben-Atya, Naama Gavrielov, Zvi Badash, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
 
-Supervised fine-tuning is a standard method for adapting pre-trained large
-language models (LLMs) to downstream tasks. Quantization has been recently
-studied as a post-training technique for efficient LLM deployment. To obtain
-quantized fine-tuned LLMs, conventional pipelines would first fine-tune the
-pre-trained models, followed by post-training quantization. This often yields
-suboptimal performance as it fails to leverage the synergy between fine-tuning
-and quantization. To effectively realize low-bit quantization of weights,
-activations, and KV caches in LLMs, we propose an algorithm named Rotated
-Straight-Through-Estimator (RoSTE), which combines quantization-aware
-supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that
-identifies an effective rotation configuration to reduce activation outliers.
-We provide theoretical insights on RoSTE by analyzing its prediction error when
-applied to an overparameterized least square quantized training problem. Our
-findings reveal that the prediction error is directly proportional to the
-quantization error of the converged weights, which can be effectively managed
-through an optimized rotation configuration. Experiments on Pythia and Llama
-models of different sizes demonstrate the effectiveness of RoSTE. Compared to
-existing post-SFT quantization baselines, our method consistently achieves
-superior performances across various tasks and different LLM architectures.
+Reliable extraction of structured data from radiology reports using Large
+Language Models (LLMs) remains challenging, especially for complex, non-English
+texts like Hebrew. This study introduces an agent-based uncertainty-aware
+approach to improve the trustworthiness of LLM predictions in medical
+applications. We analyzed 9,683 Hebrew radiology reports from Crohn's disease
+patients (from 2010 to 2023) across three medical centers. A subset of 512
+reports was manually annotated for six gastrointestinal organs and 15
+pathological findings, while the remaining reports were automatically annotated
+using HSMP-BERT. Structured data extraction was performed using Llama 3.1
+(Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed
+six semantically equivalent prompts to estimate uncertainty. An Agent-Based
+Decision Model integrated multiple prompt outputs into five confidence levels
+for calibrated uncertainty and was compared against three entropy-based models.
+Performance was evaluated using accuracy, F1 score, precision, recall, and
+Cohen's Kappa before and after filtering high-uncertainty cases. The
+agent-based model outperformed the baseline across all metrics, achieving an F1
+score of 0.3967, recall of 0.6437, and Cohen's Kappa of 0.3006. After filtering
+high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to
+0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated
+clear separation between correct and incorrect predictions, with the
+agent-based model providing the most well-calibrated uncertainty estimates. By
+incorporating uncertainty-aware prompt ensembles and an agent-based decision
+model, this approach enhances the performance and reliability of LLMs in
+structured data extraction from radiology reports, offering a more
+interpretable and trustworthy solution for high-stakes medical applications.
 
-摘要：監督式微調是將預訓練的大型語言模型 (LLM) 適應至下游任務的標準方法。量化最近已被研究作為一種訓練後技術，用於高效部署 LLM。為了獲得量化的微調 LLM，傳統管道會先微調預訓練模型，然後再進行訓練後量化。這通常會產生次佳效能，因為它無法利用微調和量化之間的協同效應。為了有效實現 LLM 中權重、激活和 KV 快取的低位元量化，我們提出了一種名為旋轉直通估計器 (RoSTE) 的演算法，它結合了量化感知監督式微調 (QA-SFT) 和一種自適應旋轉策略，該策略會識別有效的旋轉組態以減少激活異常值。我們透過分析 RoSTE 在應用於過度參數化最小平方量化訓練問題時的預測誤差，提供了關於 RoSTE 的理論見解。我們的研究結果顯示，預測誤差與收斂權重的量化誤差成正比，而這可透過最佳化的旋轉組態有效地管理。在不同大小的 Pythia 和 Llama 模型上進行的實驗證明了 RoSTE 的有效性。與現有的訓練後 SFT 量化基準相比，我們的模型在各種任務和不同的 LLM 架構中持續獲得優異的效能。
+摘要：<paragraph>使用大型語言模型 (LLM) 從放射科報告中可靠地提取結構化數據仍然具有挑戰性，尤其是對於希伯來語等複雜的非英語文本。本研究引入了一種基於代理的不確定性感知方法，以提高 LLM 預測在醫療應用中的可信度。我們分析了來自三個醫療中心的 9,683 份克隆氏症患者的希伯來語放射科報告（從 2010 年到 2023 年）。其中 512 份報告的手動註釋包括六個胃腸器官和 15 個病理發現，而其餘報告則使用 HSMP-BERT 自動註釋。結構化數據提取使用 Llama 3.1（Llama 3-8b-instruct）與貝葉斯提示集合（BayesPE）進行，它採用六個語義等效提示來估計不確定性。基於代理的決策模型將多個提示輸出整合到五個置信度級別中以校準不確定性，並與三個基於熵的模型進行比較。在過濾掉高度不確定性的情況之前和之後，使用準確度、F1 分數、精確度、召回率和 Cohen's Kappa 評估性能。基於代理的模型在所有指標上都優於基線，F1 分數達到 0.3967，召回率達到 0.6437，Cohen's Kappa 達到 0.3006。在過濾掉高度不確定性的情況（大於或等於 0.5）後，F1 分數提高到 0.4787，Kappa 提高到 0.4258。不確定性直方圖顯示了正確預測和不正確預測之間的明顯區別，基於代理的模型提供了校準最好的不確定性估計。通過結合不確定性感知提示集合和基於代理的決策模型，這種方法增強了 LLM 在放射科報告中結構化數據提取中的性能和可靠性，為高風險醫療應用提供了更具可解釋性和可信度的解決方案。</paragraph>
 
-##### **PixLift: Accelerating Web Browsing via AI Upscaling**
-2502.08995v1 by Yonas Atinafu, Sarthak Malla, HyunSeok Daniel Jang, Nouar Aldahoul, Matteo Varvello, Yasir Zaki
+##### **Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**
+2502.01685v1 by Si-Ioi Ng, Pranav S. Ambadi, Kimberly D. Mueller, Julie Liss, Visar Berisha
 
-Accessing the internet in regions with expensive data plans and limited
-connectivity poses significant challenges, restricting information access and
-economic growth. Images, as a major contributor to webpage sizes, exacerbate
-this issue, despite advances in compression formats like WebP and AVIF. The
-continued growth of complex and curated web content, coupled with suboptimal
-optimization practices in many regions, has prevented meaningful reductions in
-web page sizes. This paper introduces PixLift, a novel solution to reduce
-webpage sizes by downscaling their images during transmission and leveraging AI
-models on user devices to upscale them. By trading computational resources for
-bandwidth, PixLift enables more affordable and inclusive web access. We address
-key challenges, including the feasibility of scaled image requests on popular
-websites, the implementation of PixLift as a browser extension, and its impact
-on user experience. Through the analysis of 71.4k webpages, evaluations of
-three mainstream upscaling models, and a user study, we demonstrate PixLift's
-ability to significantly reduce data usage without compromising image quality,
-fostering a more equitable internet.
+Existing methods for analyzing linguistic content from picture descriptions
+for assessment of cognitive-linguistic impairment often overlook the
+participant's visual narrative path, which typically requires eye tracking to
+assess. Spatio-semantic graphs are a useful tool for analyzing this narrative
+path from transcripts alone, however they are limited by the need for manual
+tagging of content information units (CIUs). In this paper, we propose an
+automated approach for estimation of spatio-semantic graphs (via automated
+extraction of CIUs) from the Cookie Theft picture commonly used in
+cognitive-linguistic analyses. The method enables the automatic
+characterization of the visual semantic path during picture description.
+Experiments demonstrate that the automatic spatio-semantic graphs effectively
+differentiate between cognitively impaired and unimpaired speakers. Statistical
+analyses reveal that the features derived by the automated method produce
+comparable results to the manual method, with even greater group differences
+between clinical groups of interest. These results highlight the potential of
+the automated approach for extracting spatio-semantic features in developing
+clinical speech models for cognitive impairment assessment.
 
-摘要：在數據方案昂貴且連線有限的地區存取網路會造成重大挑戰，限制了資訊存取和經濟成長。圖像作為網頁大小的主要貢獻者，儘管 WebP 和 AVIF 等壓縮格式進步，但仍加劇了這個問題。複雜且經過策劃的網路內容持續成長，加上許多地區次佳的最佳化實務，已阻礙了網頁大小的顯著減少。本文介紹 PixLift，這是一種創新的解決方案，可在傳輸過程中縮小圖像大小，並利用使用者裝置上的 AI 模型來放大圖像，藉此縮小網頁大小。PixLift 透過以運算資源換取頻寬，讓網路存取更經濟實惠且更具包容性。我們解決了關鍵挑戰，包括熱門網站上縮放圖像要求的可行性、將 PixLift 實作為瀏覽器擴充功能，以及它對使用者體驗的影響。透過分析 71.4k 個網頁、評估三個主流放大模型，以及使用者研究，我們展示了 PixLift 在不影響影像品質的情況下顯著減少資料用量的能力，促進了更公平的網路。
+摘要：現有的用於分析圖像描述中的語言內容的方法，用於評估認知語言障礙，通常會忽略參與者的視覺敘事路徑，這通常需要眼球追蹤來評估。時空語義圖是一種有用的工具，可以僅從轉錄本中分析此敘事路徑，但是它們受到手動標記內容資訊單元 (CIU) 的需求所限制。在本文中，我們提出了一種自動化方法，用於從認知語言分析中常用的 Cookie Theft 圖像估計時空語義圖（通過自動提取 CIU）。該方法能夠自動表徵圖片描述期間的視覺語義路徑。實驗表明，自動時空語義圖有效地區分了認知受損和未受損的說話者。統計分析表明，自動化方法衍生的特徵產生了與手動方法相當的結果，甚至在感興趣的臨床組之間產生了更大的組差異。這些結果突出了自動化方法在提取時空語義特徵以開發用於認知障礙評估的臨床語音模型方面的潛力。
 
-##### **RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**
-2502.08989v1 by Nazatul H. Sultan, Yan Bo, Yansong Gao, Seyit Camtepe, Arash Mahboubi, Hang Thanh Bui, Aufeef Chauhan, Hamed Aboutorab, Michael Bewong, Praveen Gauravaram, Rafiqul Islam, Sharif Abuadbba
+##### **Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**
+2502.00712v1 by Shengtian Sang, Hassan Jahanandish, Cynthia Xinran Li, Indrani Bhattachary, Jeong Hoon Lee, Lichun Zhang, Sulaiman Vesal, Pejman Ghanouni, Richard Fan, Geoffrey A. Sonn, Mirabela Rusu
 
-Federated Learning (FL) allows users to collaboratively train a global
-machine learning model by sharing local model only, without exposing their
-private data to a central server. This distributed learning is particularly
-appealing in scenarios where data privacy is crucial, and it has garnered
-substantial attention from both industry and academia. However, studies have
-revealed privacy vulnerabilities in FL, where adversaries can potentially infer
-sensitive information from the shared model parameters. In this paper, we
-present an efficient masking-based secure aggregation scheme utilizing
-lightweight cryptographic primitives to mitigate privacy risks. Our scheme
-offers several advantages over existing methods. First, it requires only a
-single setup phase for the entire FL training session, significantly reducing
-communication overhead. Second, it minimizes user-side overhead by eliminating
-the need for user-to-user interactions, utilizing an intermediate server layer
-and a lightweight key negotiation method. Third, the scheme is highly resilient
-to user dropouts, and the users can join at any FL round. Fourth, it can detect
-and defend against malicious server activities, including recently discovered
-model inconsistency attacks. Finally, our scheme ensures security in both
-semi-honest and malicious settings. We provide security analysis to formally
-prove the robustness of our approach. Furthermore, we implemented an end-to-end
-prototype of our scheme. We conducted comprehensive experiments and
-comparisons, which show that it outperforms existing solutions in terms of
-communication and computation overhead, functionality, and security.
+Prostate cancer is a major cause of cancer-related deaths in men, where early
+detection greatly improves survival rates. Although MRI-TRUS fusion biopsy
+offers superior accuracy by combining MRI's detailed visualization with TRUS's
+real-time guidance, it is a complex and time-intensive procedure that relies
+heavily on manual annotations, leading to potential errors. To address these
+challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation
+method that identifies prostate tumors directly in TRUS images without
+requiring manual annotations. Unlike traditional multimodal fusion approaches
+that rely on naive data concatenation, our method integrates a
+registration-segmentation framework to align and leverage spatial information
+between MRI and TRUS modalities. This alignment enhances segmentation accuracy
+and reduces reliance on manual effort. Our approach was validated on a dataset
+of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient
+of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132)
+methods, with significant improvements (p $<$ 0.01). This framework
+demonstrates the potential for reducing the complexity of prostate cancer
+diagnosis and provides a flexible architecture applicable to other multimodal
+medical imaging tasks.
 
-摘要：聯合式學習 (FL) 使用者可以透過僅分享本機模型，在不將其私人資料揭露給中央伺服器的情況下，共同訓練全球機器學習模型。這種分散式學習在資料隱私至關重要的場景中特別具有吸引力，並且已獲得業界和學術界的廣泛關注。然而，研究顯示 FL 中存在隱私漏洞，其中對手可能會從共享模型參數中推斷出敏感資訊。在本文中，我們提出了一種有效率的基於遮罩的安全聚合方案，利用輕量級的密碼原語來降低隱私風險。我們的方案相較於現有方法提供了多項優點。首先，它僅需要在整個 FL 訓練階段進行一次設定階段，大幅降低了通訊開銷。其次，透過消除使用者間互動的需要，利用中間伺服器層和輕量級金鑰協商方法，將使用者端的開銷降到最低。第三，該方案對使用者中斷具有高度的復原力，使用者可以在任何 FL 回合中加入。第四，它可以偵測和防禦惡意伺服器活動，包括最近發現的模型不一致攻擊。最後，我們的方案確保在半誠實和惡意設定中都能獲得安全性。我們提供了安全分析，以正式證明我們方法的穩健性。此外，我們實作了我們方案的端對端原型。我們進行了全面的實驗和比較，結果顯示，在通訊和運算開銷、功能和安全性方面，它優於現有的解決方案。
+摘要：前列腺癌是男性癌症相關死亡的主要原因，早期發現可大幅提升存活率。儘管 MRI-TRUS 融合切片檢查結合了 MRI 的詳細視覺化與 TRUS 的即時導引，可提供更高的準確度，但它是一種仰賴大量手動註解的複雜且耗時的程序，容易導致錯誤。為了解決這些挑戰，我們提出了一種全自動的 MRI-TRUS 融合式分割方法，它可以在 TRUS 影像中直接辨識出前列腺腫瘤，而不需要手動註解。與依賴於天真資料串接的傳統多模態融合方法不同，我們的方法整合了一個配準分割架構，以對齊並利用 MRI 與 TRUS 模態之間的空間資訊。這種對齊提升了分割準確度，並減少了對手動作業的依賴。我們的方法已通過來自 Stanford 醫院的 1,747 位患者的資料集進行驗證，達到了 0.212 的平均 Dice 係數，優於僅使用 TRUS (0.117) 和天真的 MRI-TRUS 融合 (0.132) 方法，並有顯著的改善（p < 0.01）。這個架構證明了降低前列腺癌診斷複雜性的潛力，並提供了一個適用於其他多模態醫學影像任務的彈性架構。
 
-##### **Neural Force Field: Learning Generalized Physical Representation from a Few Examples**
-2502.08987v1 by Shiqian Li, Ruihong Shen, Chi Zhang, Yixin Zhu
+##### **TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**
+2502.00695v1 by Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
 
-Physical reasoning is a remarkable human ability that enables rapid learning
-and generalization from limited experience. Current AI models, despite
-extensive training, still struggle to achieve similar generalization,
-especially in Out-of-distribution (OOD) settings. This limitation stems from
-their inability to abstract core physical principles from observations. A key
-challenge is developing representations that can efficiently learn and
-generalize physical dynamics from minimal data. Here we present Neural Force
-Field (NFF) a modeling framework built on Neural Ordinary Differential Equation
-(NODE) that learns interpretable force field representations which can be
-efficiently integrated through an Ordinary Differential Equation ( ODE) solver
-to predict object trajectories. Unlike existing approaches that rely on
-high-dimensional latent spaces, NFF captures fundamental physical concepts such
-as gravity, support, and collision in an interpretable manner. Experiments on
-two challenging physical reasoning tasks demonstrate that NFF, trained with
-only a few examples, achieves strong generalization to unseen scenarios. This
-physics-grounded representation enables efficient forward-backward planning and
-rapid adaptation through interactive refinement. Our work suggests that
-incorporating physics-inspired representations into learning systems can help
-bridge the gap between artificial and human physical reasoning capabilities.
+Chronic liver disease represents a significant health challenge worldwide and
+accurate prognostic evaluations are essential for personalized treatment plans.
+Recent evidence suggests that integrating multimodal data, such as computed
+tomography imaging, radiomic features, and clinical information, can provide
+more comprehensive prognostic information. However, modalities have an inherent
+heterogeneity, and incorporating additional modalities may exacerbate the
+challenges of heterogeneous data fusion. Moreover, existing multimodal fusion
+methods often struggle to adapt to richer medical modalities, making it
+difficult to capture inter-modal relationships. To overcome these limitations,
+We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet).
+Specifically, we develop an Intra-Modality Aggregation module and a
+Triple-Modal Cross-Attention Fusion module, which are designed to eliminate
+intra-modality redundancy and extract cross-modal information, respectively.
+Furthermore, we design a Triple-Modal Feature Fusion loss function to align
+feature representations across modalities. Extensive experiments on the liver
+prognosis dataset demonstrate that our approach significantly outperforms
+existing state-of-the-art unimodal models and other multi-modal techniques. Our
+code is available at https://github.com/Mysterwll/liver.git.
 
-摘要：物理推理是人类非凡的能力，它能从有限的经验中快速学习和概括。尽管经过广泛的训练，但当前的人工智能模型在实现类似的概括方面仍然存在困难，尤其是在分布外 (OOD) 设置中。这种限制源于它们无法从观察中抽象出核心物理原理。一个关键挑战是开发能够从最少数据中有效学习和概括物理动力学的表示。在这里，我们介绍了神经力场 (NFF)，这是一种建立在神经常微分方程 (NODE) 上的建模框架，它学习可解释的力场表示，这些表示可以通过常微分方程 (ODE) 求解器有效地进行积分，以预测物体轨迹。与依赖于高维潜在空间的现有方法不同，NFF 以可解释的方式捕获了诸如重力、支撑和碰撞等基本物理概念。在两个具有挑战性的物理推理任务上的实验表明，仅通过几个示例训练的 NFF 实现了对看不见场景的强大概括。这种基于物理的表示能够进行高效的前向后向规划，并通过交互式细化实现快速适应。我们的工作表明，将受物理启发的表示纳入学习系统可以帮助弥合人工智能和人类物理推理能力之间的差距。
+摘要：慢性肝病在全球范围内代表著重大的健康挑戰，而準確的預後評估對於個人化治療計畫至關重要。最近的證據表明，整合多模態資料（例如電腦斷層影像、放射特徵和臨床資訊）可以提供更全面的預後資訊。然而，模態具有內在異質性，而納入額外的模態可能會加劇異質化資料融合的挑戰。此外，現有的多模態融合方法通常難以適應更豐富的醫療模態，這使得難以捕捉模態間的關係。為了克服這些限制，我們提出了三模態交互慢性肝臟網路 (TMI-CLNet)。具體來說，我們開發了一個模態內聚合模組和一個三模態交叉注意力融合模組，它們分別旨在消除模態內冗餘和提取跨模態資訊。此外，我們設計了一個三模態特徵融合損失函數，以對齊跨模態的特徵表示。在肝臟預後資料集上的廣泛實驗表明，我們的做法顯著優於現有的最先進單模態模型和其他多模態技術。我們的程式碼可以在 https://github.com/Mysterwll/liver.git 上取得。
 
-##### **Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**
-2502.08972v1 by Hyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan May
+##### **Safety at Scale: A Comprehensive Survey of Large Model Safety**
+2502.05206v2 by Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
 
-Language models are aligned to the collective voice of many, resulting in
-generic outputs that do not align with specific users' styles. In this work, we
-present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method
-that personalizes language models for text generation tasks with fewer than 10
-examples per user. TICL iteratively expands an in-context learning prompt via a
-trial-error-explain process, adding model-generated negative samples and
-explanations that provide fine-grained guidance towards a specific user's
-style. TICL achieves favorable win rates on pairwise comparisons with
-LLM-as-a-judge up to 91.5% against the previous state-of-the-art and
-outperforms competitive tuning-free baselines for personalized alignment tasks
-of writing emails, essays and news articles. Both lexical and qualitative
-analyses show that the negative samples and explanations enable language models
-to learn stylistic context more effectively and overcome the bias towards
-structural and formal phrases observed in their zero-shot outputs. By
-front-loading inference compute to create a user-specific in-context learning
-prompt that does not require extra generation steps at test time, TICL presents
-a novel yet simple approach for personalized alignment.
+The rapid advancement of large models, driven by their exceptional abilities
+in learning and generalization through large-scale pre-training, has reshaped
+the landscape of Artificial Intelligence (AI). These models are now
+foundational to a wide range of applications, including conversational AI,
+recommendation systems, autonomous driving, content generation, medical
+diagnostics, and scientific discovery. However, their widespread deployment
+also exposes them to significant safety risks, raising concerns about
+robustness, reliability, and ethical implications. This survey provides a
+systematic review of current safety research on large models, covering Vision
+Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language
+Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models
+(DMs), and large-model-based Agents. Our contributions are summarized as
+follows: (1) We present a comprehensive taxonomy of safety threats to these
+models, including adversarial attacks, data poisoning, backdoor attacks,
+jailbreak and prompt injection attacks, energy-latency attacks, data and model
+extraction attacks, and emerging agent-specific threats. (2) We review defense
+strategies proposed for each type of attacks if available and summarize the
+commonly used datasets and benchmarks for safety research. (3) Building on
+this, we identify and discuss the open challenges in large model safety,
+emphasizing the need for comprehensive safety evaluations, scalable and
+effective defense mechanisms, and sustainable data practices. More importantly,
+we highlight the necessity of collective efforts from the research community
+and international collaboration. Our work can serve as a useful reference for
+researchers and practitioners, fostering the ongoing development of
+comprehensive defense systems and platforms to safeguard AI models.
 
-摘要：語言模型與眾人的集體聲音保持一致，導致產出內容流於一般，無法與特定使用者的風格相符。在這項工作中，我們提出了試驗錯誤解釋情境內學習 (TICL)，一種免調校方法，能為文字生成任務個人化語言模型，每個使用者少於 10 個範例。TICL 透過試驗錯誤解釋程序反覆擴充情境內學習提示，加入模型產生的負面範例和說明，提供細緻的指導，引導至特定使用者的風格。TICL 在與 LLM 作為評審的成對比較中獲得了高勝率，高達 91.5%，優於先前的技術水準，並在個人化對齊任務中超越了競爭性的免調校基準，包括撰寫電子郵件、論文和新聞文章。詞彙和質性分析皆顯示，負面範例和說明讓語言模型能更有效地學習風格脈絡，並克服零次學習產出中觀察到的結構性和正式詞組偏誤。透過預先加載推論運算，建立使用者特定的情境內學習提示，無需在測試時額外產生步驟，TICL 呈現一種新穎卻簡潔的方法，用於個人化對齊。
+摘要：<paragraph>大型模型的快速進展，得益於它們在通過大規模預訓練進行學習和概括方面的卓越能力，已經重塑了人工智能 (AI) 的格局。這些模型現在是廣泛應用程式（包括對話式 AI、推薦系統、自動駕駛、內容生成、醫療診斷和科學發現）的基礎。然而，它們的廣泛部署也使它們面臨重大的安全風險，引發了對穩健性、可靠性和倫理影響的擔憂。本調查提供了對大型模型當前安全研究的系統性回顧，涵蓋視覺基礎模型 (VFM)、大型語言模型 (LLM)、視覺語言預訓練 (VLP) 模型、視覺語言模型 (VLM)、擴散模型 (DM) 和基於大型模型的代理。我們的貢獻總結如下：(1) 我們提出了一個針對這些模型的安全威脅的全面分類，包括對抗性攻擊、資料中毒、後門攻擊、越獄和提示注入攻擊、能量延遲攻擊、資料和模型提取攻擊以及新興的特定代理威脅。(2) 我們檢視了針對每種類型攻擊提出的防禦策略（如果有的話），並總結了安全研究中常用的資料集和基準。(3) 基於此，我們找出並討論了大型模型安全中的開放性挑戰，強調了對全面安全評估、可擴充且有效的防禦機制以及永續資料實務的需求。更重要的是，我們強調了研究社群和國際合作共同努力的必要性。我們的研究可作為研究人員和從業人員的有用參考，促進全面防禦系統和平台的持續發展，以保護 AI 模型。</paragraph>
 
-##### **RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**
-2502.08966v1 by Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller
+##### **Enhanced Convolutional Neural Networks for Improved Image Classification**
+2502.00663v1 by Xiaoran Yang, Shuhan Yu, Wenxi Xu
 
-Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external
-tools for tasks beyond their standalone capabilities, such as searching
-websites, booking flights, or making financial transactions. However, these
-tools greatly increase the risks of prompt injection attacks, where malicious
-content hijacks the LM agent to leak confidential data or trigger harmful
-actions. Existing defenses (OpenAI GPTs) require user confirmation before every
-tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS),
-which automatically detects and executes tool calls that preserve integrity and
-confidentiality, requiring user confirmation only when these safeguards cannot
-be ensured. RTBAS adapts Information Flow Control to the unique challenges
-presented by TBAS. We present two novel dependency screeners, using
-LM-as-a-judge and attention-based saliency, to overcome these challenges.
-Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS
-prevents all targeted attacks with only a 2% loss of task utility when under
-attack, and further tests confirm its ability to obtain near-oracle performance
-on detecting both subtle and direct privacy leaks.
+Image classification is a fundamental task in computer vision with diverse
+applications, ranging from autonomous systems to medical imaging. The CIFAR-10
+dataset is a widely used benchmark to evaluate the performance of
+classification models on small-scale, multi-class datasets. Convolutional
+Neural Networks (CNNs) have demonstrated state-of-the-art results; however,
+they often suffer from overfitting and suboptimal feature representation when
+applied to challenging datasets like CIFAR-10. In this paper, we propose an
+enhanced CNN architecture that integrates deeper convolutional blocks, batch
+normalization, and dropout regularization to achieve superior performance. The
+proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN
+architectures. Through detailed ablation studies, we demonstrate the
+effectiveness of the enhancements and analyze the hierarchical feature
+representations. This work highlights the potential of refined CNN
+architectures for tackling small-scale image classification problems
+effectively.
 
-摘要：基於工具的代理系統 (TBAS) 允許語言模型 (LM) 使用外部工具來執行超出其獨立功能的任務，例如搜尋網站、預訂航班或進行金融交易。然而，這些工具大幅增加了提示注入攻擊的風險，其中惡意內容劫持 LM 代理程式以洩露機密資料或觸發有害動作。現有的防禦措施 (OpenAI GPT) 在每次呼叫工具之前都需要使用者確認，這會對使用者造成沉重的負擔。我們引入了穩健的 TBAS (RTBAS)，它會自動偵測並執行保留完整性與機密性的工具呼叫，僅在無法確保這些防護措施時才需要使用者確認。RTBAS 將資訊流控制調整為 TBAS 呈現的獨特挑戰。我們提出兩種新穎的相依性篩選器，使用 LM 作為判斷者和基於注意力的顯著性，以克服這些挑戰。AgentDojo 提示注入基準上的實驗結果顯示，RTBAS 在受到攻擊時僅損失 2% 的任務效用，即可防止所有目標攻擊，進一步的測試證實了其在偵測細微和直接的隱私洩漏方面獲得接近神諭效能的能力。
+摘要：影像分類是電腦視覺中的一項基本任務，應用範圍廣泛，從自動系統到醫學影像皆有。CIFAR-10 資料集是一個廣泛使用的基準，用於評估分類模型在小規模、多類別資料集上的效能。卷積神經網路 (CNN) 已展現出最先進的成果；然而，當應用於 CIFAR-10 等具挑戰性的資料集時，它們常常會發生過度擬合和次佳特徵表示的問題。在本文中，我們提出一個增強的 CNN 架構，它整合了更深的卷積區塊、批次正規化和中斷正規化，以達成卓越的效能。所提出的模型達到了 84.95% 的測試準確度，優於基準 CNN 架構。透過詳細的消融研究，我們證明了這些增強功能的有效性，並分析了階層式特徵表示。這項工作突顯了精進的 CNN 架構在有效解決小規模影像分類問題上的潛力。
 
-##### **Biologically Plausible Brain Graph Transformer**
-2502.08958v1 by Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
+##### **Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**
+2502.00619v1 by Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
 
-State-of-the-art brain graph analysis methods fail to fully encode the
-small-world architecture of brain graphs (accompanied by the presence of hubs
-and functional modules), and therefore lack biological plausibility to some
-extent. This limitation hinders their ability to accurately represent the
-brain's structural and functional properties, thereby restricting the
-effectiveness of machine learning models in tasks such as brain disorder
-detection. In this work, we propose a novel Biologically Plausible Brain Graph
-Transformer (BioBGT) that encodes the small-world architecture inherent in
-brain graphs. Specifically, we present a network entanglement-based node
-importance encoding technique that captures the structural importance of nodes
-in global information propagation during brain graph communication,
-highlighting the biological properties of the brain structure. Furthermore, we
-introduce a functional module-aware self-attention to preserve the functional
-segregation and integration characteristics of brain graphs in the learned
-representations. Experimental results on three benchmark datasets demonstrate
-that BioBGT outperforms state-of-the-art models, enhancing biologically
-plausible brain graph representations for various brain graph analytical tasks
+Ensuring fairness in medical image segmentation is critical due to biases in
+imbalanced clinical data acquisition caused by demographic attributes (e.g.,
+age, sex, race) and clinical factors (e.g., disease severity). To address these
+challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired
+by optimal control theory. We provide a comprehensive analysis of its
+underlying mechanisms and clarify dMoE's role in adapting to heterogeneous
+distributions in medical image segmentation. Furthermore, we integrate dMoE
+into multiple network architectures, demonstrating its broad applicability
+across diverse medical image analysis tasks. By incorporating demographic and
+clinical factors, dMoE achieves state-of-the-art performance on two 2D
+benchmark datasets and a 3D in-house dataset. Our results highlight the
+effectiveness of dMoE in mitigating biases from imbalanced distributions,
+offering a promising approach to bridging control theory and medical image
+segmentation within fairness learning paradigms. The source code will be made
+available.
 
-摘要：目前最先进的大腦圖形分析方法無法完全編碼大腦圖形的小世界架構（伴隨著樞紐和功能模組的存在），因此在某種程度上缺乏生物學上的可信度。這種限制阻礙了它們準確表示大腦結構和功能特性的能力，從而限制了機器學習模型在腦部疾病檢測等任務中的有效性。在這項工作中，我們提出了一個新的生物學上可信的大腦圖形轉換器 (BioBGT)，它編碼了大腦圖形中固有的、小世界的架構。具體來說，我們提出了一種基於網路糾纏的節點重要性編碼技術，它捕捉了大腦圖形通信過程中節點在全球資訊傳播中的結構重要性，突出了大腦結構的生物學特性。此外，我們引入了一個功能模組感知自注意力，以保留學習表徵中大腦圖形的功能分離和整合特性。在三個基準資料集上的實驗結果表明，BioBGT 優於最先進的模型，增強了各種大腦圖形分析任務的生物學上可信的大腦圖形表徵
+摘要：在医学影像分割中，由於人口屬性（例如年齡、性別、種族）和臨床因素（例如疾病嚴重程度）導致不平衡的臨床數據採集中存在偏差，因此確保公平性至關重要。為了應對這些挑戰，我們引入了受最優控制理論啟發的感知混合專家 (dMoE)。我們對其底層機制進行了全面分析，並釐清了 dMoE 在適應醫學影像分割中的異質分佈中的作用。此外，我們將 dMoE 整合到多個網路架構中，展示了其在各種醫學影像分析任務中的廣泛適用性。通過納入人口統計和臨床因素，dMoE 在兩個 2D 基準數據集和一個 3D 內部數據集上實現了最先進的性能。我們的結果突出了 dMoE 在減輕不平衡分佈的偏差方面的有效性，為在公平性學習範例中橋接控制理論和醫學影像分割提供了一個有前景的方法。原始碼將會公開。
 
-##### **Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**
-2502.08954v1 by Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer
+##### **Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**
+2502.00568v3 by Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
 
-The deployment of Large Language Models (LLM) on mobile devices offers
-significant potential for medical applications, enhancing privacy, security,
-and cost-efficiency by eliminating reliance on cloud-based services and keeping
-sensitive health data local. However, the performance and accuracy of on-device
-LLMs in real-world medical contexts remain underexplored. In this study, we
-benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating
-accuracy, computational efficiency, and thermal limitation across various
-mobile devices. Our results indicate that compact general-purpose models like
-Phi-3 Mini achieve a strong balance between speed and accuracy, while medically
-fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably,
-deploying LLMs on older devices remains feasible, with memory constraints
-posing a greater challenge than raw processing power. Our study underscores the
-potential of on-device LLMs for healthcare while emphasizing the need for more
-efficient inference and models tailored to real-world clinical reasoning.
+Emerging research has highlighted that artificial intelligence based
+multimodal fusion of digital pathology and transcriptomic features can improve
+cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction.
+However, such direct fusion for joint decision is impractical in real clinical
+settings, where histopathology is still the gold standard for diagnosis and
+transcriptomic tests are rarely requested, at least in the public healthcare
+system. With our novel diffusion based crossmodal generative AI model PathGen,
+we show that genomic expressions synthesized from digital histopathology
+jointly predicts cancer grading and patient survival risk with high accuracy
+(state-of-the-art performance), certainty (through conformal coverage
+guarantee) and interpretability (through distributed attention maps). PathGen
+code is available for open use by the research community through GitHub at
+https://github.com/Samiran-Dey/PathGen.
 
-摘要：大型語言模型 (LLM) 在行動裝置上的部署為醫療應用程式提供了巨大的潛力，透過消除對雲端服務的依賴並將敏感的健康資料儲存在本地，進而提升隱私、安全性，並提高成本效益。然而，在實際的醫療環境中，裝置上 LLM 的效能和準確度仍未受到充分的探討。在此研究中，我們使用 AMEGA 資料集來評量公開可用的裝置上 LLM，並評估其在各種行動裝置上的準確度、運算效率和熱限制。我們的結果顯示，像 Phi-3 Mini 等精簡的一般用途模型在速度和準確度之間取得了良好的平衡，而經過醫學微調的模型，例如 Med42 和 Aloe，則達到了最高的準確度。值得注意的是，在較舊的裝置上部署 LLM 仍然可行，記憶體限制比原始處理能力構成更大的挑戰。我們的研究強調了裝置上 LLM 在醫療保健方面的潛力，同時強調了對更有效率的推理和針對實際臨床推理量身打造的模型的需求。
+摘要：新興研究強調，基於人工智慧的多模態融合數位病理學和轉錄組特徵，可以改善癌症診斷（分級/分型）和預後（存活風險）預測。
+然而，這種直接融合對於聯合決策在實際臨床環境中並不切實際，在實際臨床環境中，組織病理學仍然是診斷的黃金標準，而轉錄組檢測很少被要求，至少在公共醫療保健系統中是如此。透過我們新穎的基於擴散的跨模態生成式 AI 模型 PathGen，我們展示了從數位組織病理學合成的基因體表達共同預測癌症分級和患者存活風險，具有很高的準確度（最先進的效能）、確定性（透過共形覆蓋保證）和可解釋性（透過分佈式注意力圖）。PathGen 程式碼可透過 GitHub 上的 https://github.com/Samiran-Dey/PathGen 供研究社群公開使用。
 
diff --git a/docs/index.md b/docs/index.md
index 90ce7b8fed..0d91b737db 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,10361 +1,10361 @@
 # arxiv-daily
- Automated deployment @ 2025-02-16 09:10:43 Asia/Taipei
+ Automated deployment @ 2025-02-16 20:27:09 Asia/Taipei
 > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml).
 > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage).
 
 ## AI
 
-### Medical explainable AI
+### Knowledge Graphs
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
-|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
-|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)|
-|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null|
-|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null|
-|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v1](http://arxiv.org/abs/2501.09628v1)|null|
-|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null|
-|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null|
-|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null|
-|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null|
-|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null|
-|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)|
-|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null|
-|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null|
-|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null|
-|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null|
-|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null|
-|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null|
-|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null|
-|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null|
-|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null|
-|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)|
-|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null|
-|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null|
-|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null|
-|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)|
-|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)|
-|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null|
-|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)|
-|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null|
-|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null|
-|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null|
-|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null|
-|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null|
-|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null|
-|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null|
-|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null|
-|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**|Antonio Rago et.al.|[2408.17401v1](http://arxiv.org/abs/2408.17401v1)|null|
-|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null|
-|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null|
-|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null|
-|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null|
-|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null|
-|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null|
-|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null|
-|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null|
-|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null|
-|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null|
-|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null|
-|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null|
-|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null|
-|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null|
-|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)|
-|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null|
-|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null|
-|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null|
-|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null|
-|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null|
-|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null|
-|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)|
-|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null|
-|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null|
-|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null|
-|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null|
-|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null|
-|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)|
-|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null|
-|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)|
-|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null|
-|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null|
-|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null|
-|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null|
-|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null|
-|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null|
-|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null|
-|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null|
-|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null|
-|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null|
-|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null|
-|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null|
-|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)|
-|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null|
-|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null|
-|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null|
-|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)|
-|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)|
-|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null|
-|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null|
-|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null|
-|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null|
-|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null|
-|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null|
-|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null|
-|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)|
-|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null|
-|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
-|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
-|**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
-|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
-
-#### Abstracts
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
-
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
-
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
-
-##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
-2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
-
-We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
-an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
-predicts future PHTs using transformer-based architectures. The Adaptive Risk
-Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
-probabilities for clinician-defined critical events. ARES incorporates a
-personalized explainability module that identifies key clinical factors
-influencing risk estimates for individual patients. ARES was evaluated on the
-MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
-performance against traditional early warning systems and machine learning
-models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
-with 60% including hospital admissions. The dataset contained over 357 million
-tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
-ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
-ETHOS-based risk estimates demonstrated robustness across demographic subgroups
-with strong model reliability, confirmed via calibration curves. The
-personalized explainability module provides insights into patient-specific
-factors contributing to risk. ARES, powered by ETHOS, advances predictive
-healthcare AI by providing dynamic, real-time, and personalized risk estimation
-with patient-specific explainability to enhance clinician trust. Its
-adaptability and superior accuracy position it as a transformative tool for
-clinical decision-making, potentially improving patient outcomes and resource
-allocation in emergency and inpatient settings. We release the full code at
-github.com/ipolharvard/ethos-ares to facilitate future research.
-
-摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
-一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
-使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
-
-##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**
-2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq
-
-This study addresses a critical gap in the healthcare system by developing a
-clinically meaningful, practical, and explainable disease surveillance system
-for multiple chronic diseases, utilizing routine EHR data from multiple U.S.
-practices integrated with CureMD's EMR/EHR system. Unlike traditional
-systems--using AI models that rely on features from patients' labs--our
-approach focuses on routinely available data, such as medical history, vitals,
-diagnoses, and medications, to preemptively assess the risks of chronic
-diseases in the next year. We trained three distinct models for each chronic
-disease: prediction models that forecast the risk of a disease 3, 6, and 12
-months before a potential diagnosis. We developed Random Forest models, which
-were internally validated using F1 scores and AUROC as performance metrics and
-further evaluated by a panel of expert physicians for clinical relevance based
-on inferences grounded in medical knowledge. Additionally, we discuss our
-implementation of integrating these models into a practical EMR system. Beyond
-using Shapley attributes and surrogate models for explainability, we also
-introduce a new rule-engineering framework to enhance the intrinsic
-explainability of Random Forests.
-
-摘要：本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統，來解決醫療保健系統中的重大缺口，利用整合 CureMD 的 EMR/EHR 系統，來自多個美國實務的例行 EHR 資料。與傳統系統不同的是，我們的做法著重在例行可得的資料，例如病歷、生命徵象、診斷和藥物，以預先評估未來一年慢性疾病的風險，而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型：預測模型，用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型，並使用 F1 分數和 AUROC 作為效能指標，進行內部驗證，並進一步由專家醫師小組根據植基於醫學知識的推論，評估其臨床相關性。此外，我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外，我們還引進了一個新的規則工程架構，以增強隨機森林的內在可解釋性。
-
-##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**
-2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
-
-Deep neural networks are increasingly employed in high-stakes medical
-applications, despite their tendency for shortcut learning in the presence of
-spurious correlations, which can have potentially fatal consequences in
-practice. Detecting and mitigating shortcut behavior is a challenging task that
-often requires significant labeling efforts from domain experts. To alleviate
-this problem, we introduce a semi-automated framework for the identification of
-spurious behavior from both data and model perspective by leveraging insights
-from eXplainable Artificial Intelligence (XAI). This allows the retrieval of
-spurious data points and the detection of model circuits that encode the
-associated prediction rules. Moreover, we demonstrate how these shortcut
-encodings can be used for XAI-based sample- and pixel-level data annotation,
-providing valuable information for bias mitigation methods to unlearn the
-undesired shortcut behavior. We show the applicability of our framework using
-four medical datasets across two modalities, featuring controlled and
-real-world spurious correlations caused by data artifacts. We successfully
-identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision
-Transformer models, ultimately increasing their robustness and applicability
-for real-world medical tasks.
-
-摘要：深度神经网络越来越多地用于高风险医疗应用中，尽管它们在存在虚假相关性的情况下倾向于捷径学习，这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务，通常需要领域专家的大量标记工作。为了缓解这个问题，我们引入了一个半自动框架，用于从数据和模型的角度识别虚假行为，方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外，我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释，为偏差缓解方法提供有价值的信息，以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性，这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差，最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。
-
-##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**
-2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail
-
-Suicidal ideation detection is crucial for preventing suicides, a leading
-cause of death worldwide. Many individuals express suicidal thoughts on social
-media, offering a vital opportunity for early detection through advanced
-machine learning techniques. The identification of suicidal ideation in social
-media text is improved by utilising a hybrid framework that integrates
-Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory
-(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability
-of the model's predictions, Explainable AI (XAI) methods are applied, with a
-particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At
-first, the model managed to reach an accuracy of 92.81%. By applying
-fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The
-SHAP analysis revealed key features influencing the model's predictions, such
-as terms related to mental health struggles. This level of transparency boosts
-the model's credibility while helping mental health professionals understand
-and trust the predictions. This work highlights the potential for improving the
-accuracy and interpretability of detecting suicidal tendencies, making a
-valuable contribution to the progress of mental health monitoring systems. It
-emphasizes the significance of blending powerful machine learning methods with
-explainability to develop reliable and impactful mental health solutions.
-
-摘要：自殺意念偵測對於預防自殺至關重要，而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭，這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構，並加入注意力機制，可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性，我們採用可解釋人工智慧 (XAI) 方法，特別著重於 SHapley 加法解釋 (SHAP)。一開始，模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術，準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵，例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度，同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力，為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。
-
-##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights**
-2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet
-
-In epidemiology, traditional statistical methods such as logistic regression,
-linear regression, and other parametric models are commonly employed to
-investigate associations between predictors and health outcomes. However,
-non-parametric machine learning techniques, such as deep neural networks
-(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for
-this task. Despite their potential, these methods face challenges due to the
-limited availability of high-quality, high-quantity data in this field. To
-address these challenges, we introduce SEANN, a novel approach for informed
-DNNs that leverages a prevalent form of domain-specific knowledge: Pooled
-Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies,
-in different forms, and represent a quantitative form of a scientific
-consensus. By direct integration within the learning procedure using a custom
-loss, we experimentally demonstrate significant improvements in the
-generalizability of predictive performances and the scientific plausibility of
-extracted relationships compared to a domain-knowledge agnostic neural network
-in a scarce and noisy data setting.
-
-摘要：在流行病學中，傳統的統計方法，例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而，非參數機器學習技術，例如深度神經網路 (DNN)，結合可解釋的 AI (XAI) 工具，為這項任務提供了新的機會。儘管這些方法具有潛力，但由於該領域缺乏高品質、高數量資料，因此這些方法面臨挑戰。為了應對這些挑戰，我們引入了 SEANN，這是一種新穎的方法，用於獲取知識的 DNN，它利用了一種流行的領域特定知識形式：彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中，並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中，我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比，科學合理性的顯著提升，且是在稀少且有雜訊的資料設定中。
-
-##### **Artificial Intelligence-Driven Clinical Decision Support Systems**
-2501.09628v1 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni
-
-As artificial intelligence (AI) becomes increasingly embedded in healthcare
-delivery, this chapter explores the critical aspects of developing reliable and
-ethical Clinical Decision Support Systems (CDSS). Beginning with the
-fundamental transition from traditional statistical models to sophisticated
-machine learning approaches, this work examines rigorous validation strategies
-and performance assessment methods, including the crucial role of model
-calibration and decision curve analysis. The chapter emphasizes that creating
-trustworthy AI systems in healthcare requires more than just technical
-accuracy; it demands careful consideration of fairness, explainability, and
-privacy. The challenge of ensuring equitable healthcare delivery through AI is
-stressed, discussing methods to identify and mitigate bias in clinical
-predictive models. The chapter then delves into explainability as a cornerstone
-of human-centered CDSS. This focus reflects the understanding that healthcare
-professionals must not only trust AI recommendations but also comprehend their
-underlying reasoning. The discussion advances in an analysis of privacy
-vulnerabilities in medical AI systems, from data leakage in deep learning
-models to sophisticated attacks against model explanations. The text explores
-privacy-preservation strategies such as differential privacy and federated
-learning, while acknowledging the inherent trade-offs between privacy
-protection and model performance. This progression, from technical validation
-to ethical considerations, reflects the multifaceted challenges of developing
-AI systems that can be seamlessly and reliably integrated into daily clinical
-practice while maintaining the highest standards of patient care and data
-protection.
-
-摘要：隨著人工智慧 (AI) 在醫療保健中的應用日益普及，本章探討了開發可靠且符合道德標準的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型到複雜機器學習方法的基本轉變開始，這項工作審查了嚴謹的驗證策略和效能評估方法，包括模型校準和決策曲線分析的關鍵角色。本章強調，在醫療保健中建立值得信賴的 AI 系統不只是技術上的準確性；它需要仔細考量公平性、可解釋性和隱私權。本章強調了透過 AI 確保公平的醫療保健服務的挑戰，並討論了識別和減輕臨床預測模型中偏差的方法。接著，本章深入探討可解釋性，作為以人為中心的 CDSS 的基石。這種關注反映了醫療保健專業人員不僅必須信任 AI 建議，還必須理解其背後的推理。討論進一步分析了醫療 AI 系統中的隱私漏洞，從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略，例如差分隱私和聯合學習，同時承認隱私保護和模型效能之間的固有取捨。這種從技術驗證到道德考量的進展，反映了開發 AI 系統的多面向挑戰，這些系統可以無縫且可靠地整合到日常臨床實務中，同時維持最高的病患照護和資料保護標準。
-
-##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**
-2501.06887v1 by Sadia Kamal, Tim Oates
-
-As deep learning models gain attraction in medical data, ensuring transparent
-and trustworthy decision-making is essential. In skin cancer diagnosis, while
-advancements in lesion detection and classification have improved accuracy, the
-black-box nature of these methods poses challenges in understanding their
-decision processes, leading to trust issues among physicians. This study
-leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on
-different skin lesion datasets, to capture meaningful relationships between
-visual features and diagnostic criteria terms. To further enhance transparency,
-we propose a method called MedGrad E-CLIP, which builds on gradient-based
-E-CLIP by incorporating a weighted entropy mechanism designed for complex
-medical imaging like skin lesions. This approach highlights critical image
-regions linked to specific diagnostic descriptions. The developed integrated
-pipeline not only classifies skin lesions by matching corresponding
-descriptions but also adds an essential layer of explainability developed
-especially for medical data. By visually explaining how different features in
-an image relates to diagnostic criteria, this approach demonstrates the
-potential of advanced vision-language models in medical image analysis,
-ultimately improving transparency, robustness, and trust in AI-driven
-diagnostic systems.
-
-摘要：随着深度学习模型在医学数据中获得关注，确保透明且值得信赖的决策至关重要。在皮肤癌诊断中，虽然病灶检测和分类的进步提高了准确性，但这些方法的黑盒性质对理解其决策过程构成了挑战，导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP（对比语言图像预训练）模型，以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度，我们提出了一种名为 MedGrad E-CLIP 的方法，该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制，建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类，还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系，这种方法展示了高级视觉语言模型在医学图像分析中的潜力，最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。
-
-##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**
-2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat
-
-Humour styles can have either a negative or a positive impact on well-being.
-Given the importance of these styles to mental health, significant research has
-been conducted on their automatic identification. However, the automated
-machine learning models used for this purpose are black boxes, making their
-prediction decisions opaque. Clarity and transparency are vital in the field of
-mental health. This paper presents an explainable AI (XAI) framework for
-understanding humour style classification, building upon previous work in
-computational humour analysis. Using the best-performing single model
-(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
-analyse how linguistic, emotional, and semantic features contribute to humour
-style classification decisions. Our analysis reveals distinct patterns in how
-different humour styles are characterised and misclassified, with particular
-emphasis on the challenges in distinguishing affiliative humour from other
-styles. Through detailed examination of feature importance, error patterns, and
-misclassification cases, we identify key factors influencing model decisions,
-including emotional ambiguity, context misinterpretation, and target
-identification. The framework demonstrates significant utility in understanding
-model behaviour, achieving interpretable insights into the complex interplay of
-features that define different humour styles. Our findings contribute to both
-the theoretical understanding of computational humour analysis and practical
-applications in mental health, content moderation, and digital humanities
-research.
-
-摘要：幽默風格對幸福感可能產生負面或正面的影響。
-鑑於這些風格對心理健康的重要性，已經對其自動識別進行了大量研究。然而，用於此目的的自動機器學習模型是黑盒子，使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架，用於理解幽默風格分類，建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost)，我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式，特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例，我們確定了影響模型決策的關鍵因素，包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用，實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。
-
-##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**
-2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani
-
-The increasing demand for mental health services has highlighted the need for
-innovative solutions, particularly in the realm of psychological conversational
-AI, where the availability of sensitive data is scarce. In this work, we
-explored the development of a system tailored for mental health support with a
-novel approach to psychological assessment based on explainable emotional
-profiles in combination with empathetic conversational models, offering a
-promising tool for augmenting traditional care, particularly where immediate
-expertise is unavailable. Our work can be divided into two main parts,
-intrinsecaly connected to each other. First, we present RACLETTE, a
-conversational system that demonstrates superior emotional accuracy compared to
-state-of-the-art benchmarks in both understanding users' emotional states and
-generating empathetic responses during conversations, while progressively
-building an emotional profile of the user through their interactions. Second,
-we show how the emotional profiles of a user can be used as interpretable
-markers for mental health assessment. These profiles can be compared with
-characteristic emotional patterns associated with different mental disorders,
-providing a novel approach to preliminary screening and support.
-
-摘要：隨著對心理健康服務需求的增加，凸顯了創新解決方案的需求，特別是在心理對話式人工智慧領域，那裡缺乏敏感資料。在這項工作中，我們探索了開發一個針對心理健康支持的系統，採用一種基於可解釋的情緒特徵的新方法進行心理評估，結合同理心對話模式，提供了一個有前途的工具，用於擴充傳統照護，特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分，彼此內在相關。首先，我們展示了 RACLETTE，一個對話系統，與最先進的基準相比，在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性，同時透過他們的互動逐漸建立使用者的情緒特徵。其次，我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較，提供了一種初步篩選和支持的新方法。
-
-##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**
-2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia
-
-Artificial intelligence (AI) has emerged as a powerful tool to enhance
-decision-making and optimize treatment protocols in in vitro fertilization
-(IVF). In particular, AI shows significant promise in supporting
-decision-making during the ovarian stimulation phase of the IVF process. This
-review evaluates studies focused on the applications of AI combined with
-medical imaging in ovarian stimulation, examining methodologies, outcomes, and
-current limitations. Our analysis of 13 studies on this topic reveals that,
-reveal that while AI algorithms demonstrated notable potential in predicting
-optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the
-medical imaging data utilized predominantly came from two-dimensional (2D)
-ultrasound which mainly involved basic quantifications, such as follicle size
-and number, with limited use of direct feature extraction or advanced image
-analysis techniques. This points to an underexplored opportunity where advanced
-image analysis approaches, such as deep learning, and more diverse imaging
-modalities, like three-dimensional (3D) ultrasound, could unlock deeper
-insights. Additionally, the lack of explainable AI (XAI) in most studies raises
-concerns about the transparency and traceability of AI-driven decisions - key
-factors for clinical adoption and trust. Furthermore, many studies relied on
-single-center designs and small datasets, which limit the generalizability of
-their findings. This review highlights the need for integrating advanced
-imaging analysis techniques with explainable AI methodologies, as well as the
-importance of leveraging multicenter collaborations and larger datasets.
-Addressing these gaps has the potential to enhance ovarian stimulation
-management, paving the way for efficient, personalized, and data-driven
-treatment pathways that improve IVF outcomes.
-
-摘要：人工智慧（AI）已成為增強體外受精（IVF）決策制定和優化治療方案的強大工具。特別是，AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示，雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力，但所利用的醫學影像數據主要來自於二次元（2D）超音波，而二次元超音波主要涉及基本量化，例如濾泡大小和數量，且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會，例如深度學習等進階影像分析方法，以及更多元的影像模式，例如三維（3D）超音波，可以解鎖更深入的見解。此外，大多數研究缺乏可解釋 AI（XAI），這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂，而透明度和可追溯性是臨床採用和信任的關鍵因素。此外，許多研究依賴於單中心設計和小型數據集，這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性，以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理，為有效、個人化和數據驅動的治療途徑鋪平道路，進而改善 IVF 結果。
-
-##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**
-2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman
-
-This research presents an innovative approach to cancer diagnosis and
-prediction using explainable Artificial Intelligence (XAI) and deep learning
-techniques. With cancer causing nearly 10 million deaths globally in 2020,
-early and accurate diagnosis is crucial. Traditional methods often face
-challenges in cost, accuracy, and efficiency. Our study develops an AI model
-that provides precise outcomes and clear insights into its decision-making
-process, addressing the "black box" problem of deep learning models. By
-employing XAI techniques, we enhance interpretability and transparency,
-building trust among healthcare professionals and patients. Our approach
-leverages neural networks to analyse extensive datasets, identifying patterns
-for cancer detection. This model has the potential to revolutionise diagnosis
-by improving accuracy, accessibility, and clarity in medical decision-making,
-possibly leading to earlier detection and more personalised treatment
-strategies. Furthermore, it could democratise access to high-quality
-diagnostics, particularly in resource-limited settings, contributing to global
-health equity. The model's applications extend beyond cancer diagnosis,
-potentially transforming various aspects of medical decision-making and saving
-millions of lives worldwide.
-
-摘要：本研究提出了一個創新的癌症診斷和預測方法，使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡，因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型，它提供精確的結果並清楚地了解其決策過程，解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術，我們增強了解釋性和透明度，在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集，識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷，可能導致更早的檢測和更個性化的治療策略。此外，它可以使更多人獲得高品質的診斷，特別是在資源有限的環境中，有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷，還可能轉變醫療決策的各個方面，並拯救全球數百萬人的生命。
-
-##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**
-2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag
-
-Deep learning has advanced medical image classification, but interpretability
-challenges hinder its clinical adoption. This study enhances interpretability
-in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)
-and a multi-agent Retrieval-Augmented Generation (RAG) system for report
-generation. By modeling relationships between visual features and clinical
-concepts, we create interpretable concept vectors that guide a multi-agent RAG
-system to generate radiology reports, enhancing clinical relevance,
-explainability, and transparency. Evaluation of the generated reports using an
-LLM-as-a-judge confirmed the interpretability and clinical utility of our
-model's outputs. On the COVID-QU dataset, our model achieved 81% classification
-accuracy and demonstrated robust report generation performance, with five key
-metrics ranging between 84% and 90%. This interpretable multi-agent framework
-bridges the gap between high-performance AI and the explainability required for
-reliable AI-driven CXR analysis in clinical settings. Our code is available at
-https://github.com/tifat58/IRR-with-CBM-RAG.git.
-
-摘要：深度學習已提升醫學影像分類，但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成，來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係，我們建立可解釋的概念向量，引導多代理 RAG 系統生成放射報告，增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估，確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上，我們的模型達到了 81% 的分類準確率，並展示了穩健的報告生成效能，五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。
-
-##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**
-2412.15748v1 by Shamus Sim, Tyrone Chen
-
-Background: Despite the current ubiquity of Large Language Models (LLMs)
-across the medical domain, there is a surprising lack of studies which address
-their reasoning behaviour. We emphasise the importance of understanding
-reasoning behaviour as opposed to high-level prediction accuracies, since it is
-equivalent to explainable AI (XAI) in this context. In particular, achieving
-XAI in medical LLMs used in the clinical domain will have a significant impact
-across the healthcare sector. Results: Therefore, we define the concept of
-reasoning behaviour in the specific context of medical LLMs. We then categorise
-and discuss the current state of the art of methods which evaluate reasoning
-behaviour in medical LLMs. Finally, we propose theoretical frameworks which can
-empower medical professionals or machine learning engineers to gain insight
-into the low-level reasoning operations of these previously obscure models.
-Conclusion: The subsequent increased transparency and trust in medical machine
-learning models by clinicians as well as patients will accelerate the
-integration, application as well as further development of medical AI for the
-healthcare system as a whole
-
-摘要：背景：儘管大型語言模型 (LLM) 目前在醫療領域無所不在，但令人驚訝的是，探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要，因為在這種情況下，這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI，將對整個醫療保健產業產生重大影響。結果：因此，我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後，我們提出理論架構，讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論：臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升，將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。
-
-##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media**
-2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton
-
-Stress is a pervasive global health issue that can lead to severe mental
-health problems. Early detection offers timely intervention and prevention of
-stress-related disorders. The current early detection models perform "black
-box" inference suffering from limited explainability and trust which blocks the
-real-world clinical application. Thanks to the generative properties introduced
-by the Large Language Models (LLMs), the decision and the prediction from such
-models are semi-interpretable through the corresponding description. However,
-the existing LLMs are mostly trained for general purposes without the guidance
-of psychological cognitive theory. To this end, we first highlight the
-importance of prior theory with the observation of performance boosted by the
-chain-of-thoughts tailored for stress detection. This method termed Cognition
-Chain explicates the generation of stress through a step-by-step cognitive
-perspective based on cognitive appraisal theory with a progress pipeline:
-Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress
-State, guiding LLMs to provide comprehensive reasoning explanations. We further
-study the benefits brought by the proposed Cognition Chain format by utilising
-it as a synthetic dataset generation template for LLMs instruction-tuning and
-introduce CogInstruct, an instruction-tuning dataset for stress detection. This
-dataset is developed using a three-stage self-reflective annotation pipeline
-that enables LLMs to autonomously generate and refine instructional data. By
-instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable
-stress detection model. Evaluations demonstrate that CogLLM achieves
-outstanding performance while enhancing explainability. Our work contributes a
-novel approach by integrating cognitive theories into LLM reasoning processes,
-offering a promising direction for future explainable AI research.
-
-摘要：壓力是一個普遍的全球性健康問題，可能會導致嚴重的精神
-健康問題。早期發現提供及時的干預和預防
-壓力相關疾病。目前的早期發現模型執行「黑
-盒子」推論，存在可解釋性和信任度有限的問題，阻礙了
-現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性，此類
-模型的決策和預測通過對應描述具有半可解釋性。然而，
-現有的 LLM 主要針對一般用途進行訓練，沒有心理認知理論的指導。為此，我們首先強調
-先驗理論的重要性，並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知
-鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生，並具有進度管道：
-刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力
-狀態，指導 LLM 提供全面的推理解釋。我們進一步
-通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點，並介紹 CogInstruct，這是一個針對壓力檢測的指令調整數據集。這個
-數據集是使用一個三階段的自省標註管道開發的，使 LLM 能夠自主生成和優化指令數據。通過
-使用 CogInstruct 對 Llama3 進行指令調整，我們開發了 CogLLM，這是一個可解釋的
-壓力檢測模型。評估表明，CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中，提出了一種新穎的方法，
-為未來的可解釋人工智能研究提供了一個有希望的方向。
-
-##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**
-2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi
-
-Human-machine teaming in medical AI requires us to understand to what degree
-a trained clinician should weigh AI predictions. While previous work has shown
-the potential of AI assistance at improving clinical predictions, existing
-clinical decision support systems either provide no explainability of their
-predictions or use techniques like saliency and Shapley values, which do not
-allow for physician-based verification. To address this gap, this study
-compares previously used explainable AI techniques with a newly proposed
-technique termed '2-factor retrieval (2FR)', which is a combination of
-interface design and search retrieval that returns similarly labeled data
-without processing this data. This results in a 2-factor security blanket
-where: (a) correct images need to be retrieved by the AI; and (b) humans should
-associate the retrieved images with the current pathology under test. We find
-that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician
-accuracy, with particular improvements when clinicians are radiologists and
-have low confidence in their decision. Our results highlight the importance of
-understanding how different modes of human-AI decision making may impact
-clinician accuracy in clinical decision support systems.
-
-摘要：人機協作在醫療 AI 中，需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力，但現有的臨床決策支援系統，要不就沒有提供預測的可解釋性，要不就是使用像顯著性和 Shapley 值之類的技術，這些技術不允許基於醫生的驗證。為了解決這個差距，本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較，後者是一種介面設計和搜尋檢索的組合，它會傳回標籤相似的資料，而不會處理這些資料。這會產生一個 2 因子安全機制，其中：(a) 正確的影像需要由 AI 檢索；(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現，當在胸部 X 光診斷上進行測試時，2FR 會提高臨床醫生的準確度，特別是在臨床醫生是放射科醫生且對其決策信心不足時，會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。
-
-##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**
-2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle
-
-Understanding public perception of artificial intelligence (AI) and the
-tradeoffs between potential risks and benefits is crucial, as these perceptions
-might shape policy decisions, influence innovation trajectories for successful
-market strategies, and determine individual and societal acceptance of AI
-technologies. Using a representative sample of 1100 participants from Germany,
-this study examines mental models of AI. Participants quantitatively evaluated
-71 statements about AI's future capabilities (e.g., autonomous driving, medical
-care, art, politics, warfare, and societal divides), assessing the expected
-likelihood of occurrence, perceived risks, benefits, and overall value. We
-present rankings of these projections alongside visual mappings illustrating
-public risk-benefit tradeoffs. While many scenarios were deemed likely,
-participants often associated them with high risks, limited benefits, and low
-overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in
-value assessment can be explained by perceived risks ($\beta=-.504$) and
-perceived benefits ($\beta=+.710$), with no significant relation to expected
-likelihood. Demographics and personality traits influenced perceptions of
-risks, benefits, and overall evaluations, underscoring the importance of
-increasing AI literacy and tailoring public information to diverse user needs.
-These findings provide actionable insights for researchers, developers, and
-policymakers by highlighting critical public concerns and individual factors
-essential to align AI development with individual values.
-
-摘要：<paragraph>了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要，因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡，並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本，探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述（例如，自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧）進行了定量評估，評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名，並附上視覺化映射，說明了公眾的風險收益權衡。儘管許多場景被認為是可能的，但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中，96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋，與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法，這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素，為研究人員、開發人員和政策制定者提供了可行的見解。</paragraph>
-
-##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**
-2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
-
-The use of machine learning and AI on electronic health records (EHRs) holds
-substantial potential for clinical insight. However, this approach faces
-challenges due to data heterogeneity, sparsity, temporal misalignment, and
-limited labeled outcomes. In this context, we leverage a linked EHR dataset of
-approximately one million de-identified individuals from Bristol, North
-Somerset, and South Gloucestershire, UK, to characterize urinary tract
-infections (UTIs). We implemented a data pre-processing and curation pipeline
-that transforms the raw EHR data into a structured format suitable for
-developing predictive models focused on data fairness, accountability and
-transparency. Given the limited availability and biases of ground truth UTI
-outcomes, we introduce a UTI risk estimation framework informed by clinical
-expertise to estimate UTI risk across individual patient timelines. Pairwise
-XGBoost models are trained using this framework to differentiate UTI risk
-categories with explainable AI techniques applied to identify key predictors
-and support interpretability. Our findings reveal differences in clinical and
-demographic predictors across risk groups. While this study highlights the
-potential of AI-driven insights to support UTI clinical decision-making,
-further investigation of patient sub-strata and extensive validation are needed
-to ensure robustness and applicability in clinical practice.
-
-摘要：電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而，由於資料異質性、稀疏性、時間錯位和標籤結果有限，此方法面臨挑戰。在此背景下，我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集，來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線，適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差，我們引入了由臨床專業知識告知的 UTI 風險評估架構，以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練，以區分 UTI 風險類別，並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力，但仍需要進一步調查患者子群體和廣泛驗證，以確保在臨床實務中的穩健性和適用性。
-
-##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**
-2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez
-
-There is a growing need to understand how digital systems can support
-clinical decision-making, particularly as artificial intelligence (AI) models
-become increasingly complex and less human-interpretable. This complexity
-raises concerns about trustworthiness, impacting safe and effective adoption of
-such technologies. Improved understanding of decision-making processes and
-requirements for explanations coming from decision support tools is a vital
-component in providing effective explainable solutions. This is particularly
-relevant in the data-intensive, fast-paced environments of intensive care units
-(ICUs). To explore these issues, group interviews were conducted with seven ICU
-clinicians, representing various roles and experience levels. Thematic analysis
-revealed three core themes: (T1) ICU decision-making relies on a wide range of
-factors, (T2) the complexity of patient state is challenging for shared
-decision-making, and (T3) requirements and capabilities of AI decision support
-systems. We include design recommendations from clinical input, providing
-insights to inform future AI systems for intensive care.
-
-摘要：隨著人工智慧 (AI) 模型變得越來越複雜，且越來越難以被人理解，了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮，影響了此類技術的安全且有效採用。改善對決策制定流程的理解，以及對決策支援工具所提供說明的要求，是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題，對七位 ICU 臨床醫師進行了小組訪談，這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題：(T1) ICU 決策制定依賴於廣泛的因素，(T2) 病患狀態的複雜性對共同決策制定構成挑戰，以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議，提供見解以提供資訊給未來用於加護的 AI 系統。
-
-##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**
-2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra
-
-Pediatric heart diseases present a broad spectrum of congenital and acquired
-diseases. More complex congenital malformations require a differentiated and
-multimodal decision-making process, usually including echocardiography as a
-central imaging method. Artificial intelligence (AI) offers considerable
-promise for clinicians by facilitating automated interpretation of pediatric
-echocardiography data. However, adapting AI technologies for pediatric
-echocardiography analysis has challenges such as limited public data
-availability, data privacy, and AI model transparency. Recently, researchers
-have focused on disruptive technologies, such as federated learning (FL) and
-explainable AI (XAI), to improve automatic diagnostic and decision support
-workflows. This study offers a comprehensive overview of the limitations and
-opportunities of AI in pediatric echocardiography, emphasizing the synergistic
-workflow and role of XAI and FL, identifying research gaps, and exploring
-potential future developments. Additionally, three relevant clinical use cases
-demonstrate the functionality of XAI and FL with a focus on (i) view
-recognition, (ii) disease classification, (iii) segmentation of cardiac
-structures, and (iv) quantitative assessment of cardiac function.
-
-摘要：小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程，通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望，因為它可以促進小兒超音波檢查資料的自動化解讀。然而，將人工智慧技術應用於小兒超音波檢查分析有許多挑戰，例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近，研究人員專注於破壞性技術，例如聯合學習 (FL) 和可解釋人工智慧 (XAI)，以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述，強調了 XAI 和 FL 的協同工作流程和角色，找出研究差距並探討潛在的未來發展。此外，三個相關的臨床使用案例展示了 XAI 和 FL 的功能，重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。
-
-##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**
-2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust
-
-Osteoporosis is a common condition that increases fracture risk, especially
-in older adults. Early diagnosis is vital for preventing fractures, reducing
-treatment costs, and preserving mobility. However, healthcare providers face
-challenges like limited labeled data and difficulties in processing medical
-images. This study presents a novel multi-modal learning framework that
-integrates clinical and imaging data to improve diagnostic accuracy and model
-interpretability. The model utilizes three pre-trained networks-VGG19,
-InceptionV3, and ResNet50-to extract deep features from X-ray images. These
-features are transformed using PCA to reduce dimensionality and focus on the
-most relevant components. A clustering-based selection process identifies the
-most representative components, which are then combined with preprocessed
-clinical data and processed through a fully connected network (FCN) for final
-classification. A feature importance plot highlights key variables, showing
-that Medical History, BMI, and Height were the main contributors, emphasizing
-the significance of patient-specific data. While imaging features were
-valuable, they had lower importance, indicating that clinical data are crucial
-for accurate predictions. This framework promotes precise and interpretable
-predictions, enhancing transparency and building trust in AI-driven diagnoses
-for clinical integration.
-
-摘要：骨質疏鬆症是一種常見的疾病，會增加骨折的風險，特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而，醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架，該框架整合了臨床和影像數據，以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路，VGG19、InceptionV3 和 ResNet50，從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分，然後將這些組成部分與預處理的臨床數據結合，並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數，表明病史、BMI 和身高是主要貢獻因素，強調了患者特定數據的重要性。雖然影像特徵很有價值，但它們的重要性較低，這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測，提高了透明度，並建立了對 AI 驅動診斷在臨床整合中的信任。
+|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
+|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
+|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null|
+|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v1](http://arxiv.org/abs/2502.08346v1)|null|
+|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null|
+|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null|
+|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null|
+|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null|
+|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null|
+|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null|
+|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)|
+|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null|
+|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|null|
+|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null|
+|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null|
+|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null|
+|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null|
+|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null|
+|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null|
+|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null|
+|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null|
+|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)|
+|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null|
+|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|null|
+|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null|
+|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)|
+|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null|
+|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)|
+|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)|
+|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null|
+|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
+|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)|
+|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
+|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
+|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
+|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
+|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
+|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
+|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
+|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
+|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null|
+|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null|
+|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null|
+|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)|
+|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null|
+|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null|
+|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)|
+|**2025-02-03**|**Common Foundations for SHACL, ShEx, and PG-Schema**|S. Ahmetaj et.al.|[2502.01295v1](http://arxiv.org/abs/2502.01295v1)|null|
+|**2025-02-03**|**GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**|Linhao Luo et.al.|[2502.01113v1](http://arxiv.org/abs/2502.01113v1)|[link](https://github.com/RManLuo/gfm-rag)|
+|**2025-02-03**|**Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**|Seungri Yoon et.al.|[2502.01059v1](http://arxiv.org/abs/2502.01059v1)|null|
+|**2025-02-03**|**Encrypted Large Model Inference: The Equivariant Encryption Paradigm**|James Buban et.al.|[2502.01013v1](http://arxiv.org/abs/2502.01013v1)|null|
+|**2025-02-02**|**Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**|Juno Kim et.al.|[2502.01694v1](http://arxiv.org/abs/2502.01694v1)|null|
+|**2025-02-02**|**PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**|Qixuan Li et.al.|[2502.00708v1](http://arxiv.org/abs/2502.00708v1)|null|
+|**2025-02-02**|**A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**|Qika Lin et.al.|[2502.00681v1](http://arxiv.org/abs/2502.00681v1)|null|
+|**2025-02-01**|**Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**|Jingyuan Yi et.al.|[2502.00339v1](http://arxiv.org/abs/2502.00339v1)|null|
+|**2025-02-01**|**DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**|Jiaxin Guo et.al.|[2502.00305v1](http://arxiv.org/abs/2502.00305v1)|null|
+|**2025-01-31**|**Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**|Nathaniel Tomczak et.al.|[2502.01659v2](http://arxiv.org/abs/2502.01659v2)|[link](https://github.com/KLab-AI3/Graph-Processing-Attention-IPDPS-2025)|
+|**2025-01-31**|**Improving vision-language alignment with graph spiking hybrid Networks**|Siyu Zhang et.al.|[2501.19069v1](http://arxiv.org/abs/2501.19069v1)|null|
+|**2025-01-30**|**Semantic Web and Creative AI -- A Technical Report from ISWS 2023**|Raia Abu Ahmad et.al.|[2501.18542v1](http://arxiv.org/abs/2501.18542v1)|null|
+|**2025-01-30**|**Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**|Tianpeng Pan et.al.|[2501.18320v1](http://arxiv.org/abs/2501.18320v1)|null|
+|**2025-01-30**|**Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**|Wanlong Liu et.al.|[2501.18154v1](http://arxiv.org/abs/2501.18154v1)|null|
+|**2025-01-30**|**Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**|Qika Lin et.al.|[2501.18119v1](http://arxiv.org/abs/2501.18119v1)|null|
+|**2025-01-29**|**Hybrid Graphs for Table-and-Text based Question Answering using LLMs**|Ankush Agarwal et.al.|[2501.17767v1](http://arxiv.org/abs/2501.17767v1)|null|
+|**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
+|**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
+|**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
+|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
+|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
+|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
+|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
+|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
+|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
+|**2025-01-26**|**Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**|Haodi Ma et.al.|[2501.15688v1](http://arxiv.org/abs/2501.15688v1)|null|
+|**2025-01-26**|**How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**|Manzong Huang et.al.|[2501.15378v1](http://arxiv.org/abs/2501.15378v1)|null|
+|**2025-01-24**|**Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**|Cencheng Shen et.al.|[2501.14932v1](http://arxiv.org/abs/2501.14932v1)|null|
+|**2025-01-24**|**Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**|Hang Luo et.al.|[2501.14892v1](http://arxiv.org/abs/2501.14892v1)|null|
+|**2025-01-24**|**GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**|Ziwen Li et.al.|[2501.16382v1](http://arxiv.org/abs/2501.16382v1)|[link](https://github.com/aaronli43/grappi)|
+|**2025-01-24**|**Evaluating and Improving Graph to Text Generation with Large Language Models**|Jie He et.al.|[2501.14497v1](http://arxiv.org/abs/2501.14497v1)|[link](https://github.com/probe2/kg_text)|
+|**2025-01-24**|**Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**|Xujian Liang et.al.|[2501.14300v1](http://arxiv.org/abs/2501.14300v1)|[link](https://github.com/dosonleung/fasttog)|
+|**2025-01-24**|**Top Ten Challenges Towards Agentic Neural Graph Databases**|Jiaxin Bai et.al.|[2501.14224v1](http://arxiv.org/abs/2501.14224v1)|null|
+|**2025-01-23**|**GraphRAG under Fire**|Jiacheng Liang et.al.|[2501.14050v1](http://arxiv.org/abs/2501.14050v1)|null|
+|**2025-01-23**|**EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**|Yuhui Yun et.al.|[2501.13746v1](http://arxiv.org/abs/2501.13746v1)|null|
+|**2025-01-23**|**Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**|Chang Gong et.al.|[2501.13731v1](http://arxiv.org/abs/2501.13731v1)|null|
+|**2025-01-23**|**CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**|Hamza Landolsi et.al.|[2501.13993v1](http://arxiv.org/abs/2501.13993v1)|null|
+|**2025-01-23**|**Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**|Hy Nguyen et.al.|[2501.13992v1](http://arxiv.org/abs/2501.13992v1)|null|
+|**2025-01-23**|**Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**|Bhumika Gupta et.al.|[2501.13984v1](http://arxiv.org/abs/2501.13984v1)|null|
+|**2025-01-21**|**LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**|Hasan Abu-Rasheed et.al.|[2501.12300v1](http://arxiv.org/abs/2501.12300v1)|null|
+|**2025-01-21**|**Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**|Dongsheng Zhu et.al.|[2501.12432v1](http://arxiv.org/abs/2501.12432v1)|null|
+|**2025-01-21**|**InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**|Pha Nguyen et.al.|[2501.12231v1](http://arxiv.org/abs/2501.12231v1)|null|
+|**2025-01-21**|**Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**|Maya Medjad et.al.|[2501.11977v1](http://arxiv.org/abs/2501.11977v1)|[link](https://github.com/reecall/graphtod)|
+|**2025-01-21**|**Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**|Jie Zhao et.al.|[2501.11968v1](http://arxiv.org/abs/2501.11968v1)|null|
+|**2025-01-21**|**A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**|Qinggang Zhang et.al.|[2501.13958v1](http://arxiv.org/abs/2501.13958v1)|[link](https://github.com/deep-polyu/awesome-graphrag)|
+|**2025-01-21**|**Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**|Nikos Kanakaris et.al.|[2501.11849v2](http://arxiv.org/abs/2501.11849v2)|[link](https://github.com/nkanak/brag-fake-news-campaigns)|
+|**2025-01-21**|**Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**|Haoran Song et.al.|[2501.16361v1](http://arxiv.org/abs/2501.16361v1)|null|
+|**2025-01-20**|**Zep: A Temporal Knowledge Graph Architecture for Agent Memory**|Preston Rasmussen et.al.|[2501.13956v1](http://arxiv.org/abs/2501.13956v1)|[link](https://github.com/getzep/graphiti)|
+|**2025-01-20**|**Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**|M. Manzour et.al.|[2501.11560v1](http://arxiv.org/abs/2501.11560v1)|null|
+|**2025-01-20**|**Each Graph is a New Language: Graph Learning with LLMs**|Huachi Zhou et.al.|[2501.11478v2](http://arxiv.org/abs/2501.11478v2)|null|
+|**2025-01-20**|**Few-shot Policy (de)composition in Conversational Question Answering**|Kyle Erwin et.al.|[2501.11335v1](http://arxiv.org/abs/2501.11335v1)|null|
+|**2025-01-20**|**Reasoning Language Models: A Blueprint**|Maciej Besta et.al.|[2501.11223v3](http://arxiv.org/abs/2501.11223v3)|[link](https://github.com/spcl/x1)|
+|**2025-01-19**|**IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**|Elad Levi et.al.|[2501.11067v1](http://arxiv.org/abs/2501.11067v1)|[link](https://github.com/plurai-ai/intellagent)|
 
-##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**
-2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor
+#### Abstracts
+##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
+2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
 
-This review paper explores recent advances in deep learning approaches for
-non-invasive cognitive impairment detection. We examine various non-invasive
-indicators of cognitive decline, including speech and language, facial, and
-motoric mobility. The paper provides an overview of relevant datasets,
-feature-extracting techniques, and deep-learning architectures applied to this
-domain. We have analyzed the performance of different methods across modalities
-and observed that speech and language-based methods generally achieved the
-highest detection performance. Studies combining acoustic and linguistic
-features tended to outperform those using a single modality. Facial analysis
-methods showed promise for visual modalities but were less extensively studied.
-Most papers focused on binary classification (impaired vs. non-impaired), with
-fewer addressing multi-class or regression tasks. Transfer learning and
-pre-trained language models emerged as popular and effective techniques,
-especially for linguistic analysis. Despite significant progress, several
-challenges remain, including data standardization and accessibility, model
-explainability, longitudinal analysis limitations, and clinical adaptation.
-Lastly, we propose future research directions, such as investigating
-language-agnostic speech analysis methods, developing multi-modal diagnostic
-systems, and addressing ethical considerations in AI-assisted healthcare. By
-synthesizing current trends and identifying key obstacles, this review aims to
-guide further development of deep learning-based cognitive impairment detection
-systems to improve early diagnosis and ultimately patient outcomes.
+Visual Question Answering (VQA) is a challenging problem that requires to
+process multimodal input. Answer-Set Programming (ASP) has shown great
+potential in this regard to add interpretability and explainability to modular
+VQA architectures. In this work, we address the problem of how to integrate ASP
+with modules for vision and natural language processing to solve a new and
+demanding VQA variant that is concerned with images of graphs (not graphs in
+symbolic form). Images containing graph-based structures are an ubiquitous and
+popular form of visualisation. Here, we deal with the particular problem of
+graphs inspired by transit networks, and we introduce a novel dataset that
+amends an existing one by adding images of graphs that resemble metro lines.
+Our modular neuro-symbolic approach combines optical graph recognition for
+graph parsing, a pretrained optical character recognition neural network for
+parsing labels, Large Language Models (LLMs) for language processing, and ASP
+for reasoning. This method serves as a first baseline and achieves an overall
+average accuracy of 73% on the dataset. Our evaluation provides further
+evidence of the potential of modular neuro-symbolic systems, in particular with
+pretrained models that do not involve any further training and logic
+programming for reasoning, to solve complex VQA tasks.
 
-摘要：本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標，包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現，並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力，但研究較少。大多數論文專注於二元分類（受損與未受損），較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術，特別是對於語言分析。儘管取得了重大進展，但仍存在一些挑戰，包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後，我們提出了未來的研究方向，例如調查與語言無關的語音分析方法、開發多模式診斷系統，以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙，本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展，以改善早期診斷，並最終改善患者的治療結果。
+摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
 
-##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**
-2410.17504v1 by Shruthi Chari
+##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
+2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
 
-Explainable Artificial Intelligence (AI) focuses on helping humans understand
-the working of AI systems or their decisions and has been a cornerstone of AI
-for decades. Recent research in explainability has focused on explaining the
-workings of AI models or model explainability. There have also been several
-position statements and review papers detailing the needs of end-users for
-user-centered explainability but fewer implementations. Hence, this thesis
-seeks to bridge some gaps between model and user-centered explainability. We
-create an explanation ontology (EO) to represent literature-derived explanation
-types via their supporting components. We implement a knowledge-augmented
-question-answering (QA) pipeline to support contextual explanations in a
-clinical setting. Finally, we are implementing a system to combine explanations
-from different AI methods and data modalities. Within the EO, we can represent
-fifteen different explanation types, and we have tested these representations
-in six exemplar use cases. We find that knowledge augmentations improve the
-performance of base large language models in the contextualized QA, and the
-performance is variable across disease groups. In the same setting, clinicians
-also indicated that they prefer to see actionability as one of the main foci in
-explanations. In our explanations combination method, we plan to use similarity
-metrics to determine the similarity of explanations in a chronic disease
-detection setting. Overall, through this thesis, we design methods that can
-support knowledge-enabled explanations across different use cases, accounting
-for the methods in today's AI era that can generate the supporting components
-of these explanations and domain knowledge sources that can enhance them.
+The adoption of EHRs has expanded opportunities to leverage data-driven
+algorithms in clinical care and research. A major bottleneck in effectively
+conducting multi-institutional EHR studies is the data heterogeneity across
+systems with numerous codes that either do not exist or represent different
+clinical concepts across institutions. The need for data privacy further limits
+the feasibility of including multi-institutional patient-level data required to
+study similarities and differences across patient subgroups. To address these
+challenges, we developed the GAME algorithm. Tested and validated across 7
+institutions and 2 languages, GAME integrates data in several levels: (1) at
+the institutional level with knowledge graphs to establish relationships
+between codes and existing knowledge sources, providing the medical context for
+standard codes and their relationship to each other; (2) between institutions,
+leveraging language models to determine the relationships between
+institution-specific codes with established standard codes; and (3) quantifying
+the strength of the relationships between codes using a graph attention
+network. Jointly trained embeddings are created using transfer and federated
+learning to preserve data privacy. In this study, we demonstrate the
+applicability of GAME in selecting relevant features as inputs for AI-driven
+algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
+We then highlight the application of GAME harmonized multi-institutional EHR
+data in a study of Alzheimer's disease outcomes and suicide risk among patients
+with mental health disorders, without sharing patient-level data outside
+individual institutions.
 
-摘要：可解釋人工智慧（AI）專注於協助人類了解 AI 系統運作或其決策，數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求，但實作較少。因此，本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體（EO）以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答（QA）管線，以在臨床環境中支援情境解釋。最後，我們正在實作一個系統，以結合來自不同 AI 方法和資料模式的解釋。在 EO 中，我們可以表示 15 種不同的解釋類型，並且我們已在六個範例使用案例中測試這些表示。我們發現，知識增強改善了基礎大型語言模型在情境化 QA 中的效能，並且效能因疾病群組而異。在相同的環境中，臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中，我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言，透過本論文，我們設計了可以在不同使用案例中支援知識啟用解釋的方法，考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。
+摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
 
-##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**
-2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay
+##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**
+2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang
 
-Objectives: To investigate clinicians' attitudes towards current automated
-interpretation of ECG and novel AI technologies and their perception of
-computer-assisted interpretation. Materials and Methods: We conducted a series
-of interviews with clinicians in the UK. Our study: (i) explores the potential
-for AI, specifically future 'human-like' computing approaches, to facilitate
-ECG interpretation and support clinical decision making, and (ii) elicits their
-opinions about the importance of explainability and trustworthiness of AI
-algorithms. Results: We performed inductive thematic analysis on interview
-transcriptions from 23 clinicians and identified the following themes: (i) a
-lack of trust in current systems, (ii) positive attitudes towards future AI
-applications and requirements for these, (iii) the relationship between the
-accuracy and explainability of algorithms, and (iv) opinions on education,
-possible deskilling, and the impact of AI on clinical competencies. Discussion:
-Clinicians do not trust current computerised methods, but welcome future 'AI'
-technologies. Where clinicians trust future AI interpretation to be accurate,
-they are less concerned that it is explainable. They also preferred ECG
-interpretation that demonstrated the results of the algorithm visually. Whilst
-clinicians do not fear job losses, they are concerned about deskilling and the
-need to educate the workforce to use AI responsibly. Conclusion: Clinicians are
-positive about the future application of AI in clinical decision-making.
-Accuracy is a key factor of uptake and visualisations are preferred over
-current computerised methods. This is viewed as a potential means of training
-and upskilling, in contrast to the deskilling that automation might be
-perceived to bring.
+With the extensive application of Graph Neural Networks (GNNs) across various
+domains, their trustworthiness has emerged as a focal point of research. Some
+existing studies have shown that the integration of large language models
+(LLMs) can improve the semantic understanding and generation capabilities of
+GNNs, which in turn improves the trustworthiness of GNNs from various aspects.
+Our review introduces a taxonomy that offers researchers a clear framework for
+comprehending the principles and applications of different methods and helps
+clarify the connections and differences among various approaches. Then we
+systematically survey representative approaches along the four categories of
+our taxonomy. Through our taxonomy, researchers can understand the applicable
+scenarios, potential advantages, and limitations of each approach for the the
+trusted integration of GNNs with LLMs. Finally, we present some promising
+directions of work and future trends for the integration of LLMs and GNNs to
+improve model trustworthiness.
 
-摘要：<paragraph>目的：調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度，以及他們對電腦輔助解讀的看法。材料和方法：我們對英國的臨床醫生進行了一系列訪談。我們的研究：(i) 探討人工智慧的潛力，特別是未來的「類人類」運算方法，以促進心電圖解讀並支持臨床決策制定，以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果：我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析，並找出以下主題：(i) 對目前系統缺乏信任，(ii) 對未來人工智慧應用和對這些應用的要求持正面態度，(iii) 演算法的準確性和可解釋性之間的關係，以及 (iv) 對教育、可能的技能退化，以及人工智慧對臨床能力的影響的看法。討論：臨床醫生不信任目前的電腦化方法，但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下，他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業，但他們擔心技能退化，以及需要教育員工負責任地使用人工智慧。結論：臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素，而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法，與自動化可能帶來的技能退化形成對比。</paragraph>
+摘要：隨著圖神經網路 (GNN) 在各種領域的廣泛應用，其可信度已成為研究的焦點。一些現有研究表明，整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力，進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法，為研究人員提供了一個清晰的架構，用於理解不同方法的原理和應用，並有助於釐清各種方法之間的關聯和差異。然後，我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法，可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後，我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢，以提升模型的可信度。
 
-##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**
-2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker
+##### **Graph Foundation Models for Recommendation: A Comprehensive Survey**
+2502.08346v1 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi
 
-The aggressiveness of prostate cancer, the most common cancer in men
-worldwide, is primarily assessed based on histopathological data using the
-Gleason scoring system. While artificial intelligence (AI) has shown promise in
-accurately predicting Gleason scores, these predictions often lack inherent
-explainability, potentially leading to distrust in human-machine interactions.
-To address this issue, we introduce a novel dataset of 1,015 tissue microarray
-core images, annotated by an international group of 54 pathologists. The
-annotations provide detailed localized pattern descriptions for Gleason grading
-in line with international guidelines. Utilizing this dataset, we develop an
-inherently explainable AI system based on a U-Net architecture that provides
-predictions leveraging pathologists' terminology. This approach circumvents
-post-hoc explainability methods while maintaining or exceeding the performance
-of methods trained directly for Gleason pattern segmentation (Dice score: 0.713
-$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason
-patterns). By employing soft labels during training, we capture the intrinsic
-uncertainty in the data, yielding strong results in Gleason pattern
-segmentation even in the context of high interobserver variability. With the
-release of this dataset, we aim to encourage further research into segmentation
-in medical tasks with high levels of subjectivity and to advance the
-understanding of pathologists' reasoning processes.
+Recommender systems (RS) serve as a fundamental tool for navigating the vast
+expanse of online information, with deep learning advancements playing an
+increasingly important role in improving ranking accuracy. Among these, graph
+neural networks (GNNs) excel at extracting higher-order structural information,
+while large language models (LLMs) are designed to process and comprehend
+natural language, making both approaches highly effective and widely adopted.
+Recent research has focused on graph foundation models (GFMs), which integrate
+the strengths of GNNs and LLMs to model complex RS problems more efficiently by
+leveraging the graph-based structure of user-item relationships alongside
+textual understanding. In this survey, we provide a comprehensive overview of
+GFM-based RS technologies by introducing a clear taxonomy of current
+approaches, diving into methodological details, and highlighting key challenges
+and future directions. By synthesizing recent advancements, we aim to offer
+valuable insights into the evolving landscape of GFM-based recommender systems.
 
-摘要：前列腺癌是全球男性最常見的癌症，其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力，但這些預測通常缺乏內在的可解釋性，可能會導致對人機互動的不信任。為了解決這個問題，我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述，用於符合國際準則的 Gleason 分級。利用這個資料集，我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統，該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法，同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能（Dice 分數：0.713 ± 0.003，訓練於解釋，相對於 0.691 ± 0.010，訓練於 Gleason 模式）。透過在訓練期間採用軟標籤，我們捕捉了資料中的內在不確定性，即使在觀察者間變異性高的情況下，也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集，我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割，並增進對病理學家推理過程的理解。
+摘要：推薦系統 (RS) 是導航廣闊線上資訊的基本工具，深度學習的進展在提升排名準確度方面扮演著日益重要的角色。在這些進展中，圖形神經網路 (GNN) 擅長萃取高階結構資訊，而大型語言模型 (LLM) 則設計用於處理和理解自然語言，這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM)，它整合了 GNN 和 LLM 的優點，透過利用使用者與項目關係的圖形化結構，以及文字理解，更有效率地建構複雜的 RS 問題。在這項調查中，我們提供 GFM-based RS 技術的全面概觀，介紹當前方法的明確分類法，深入探討方法論的細節，並強調關鍵挑戰和未來方向。透過綜合最近的進展，我們旨在提供有價值的見解，了解 GFM-based 推薦系統不斷演變的樣貌。
 
-##### **Explainable AI Methods for Multi-Omics Analysis: A Survey**
-2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee
+##### **Self-Evaluation for Job-Shop Scheduling**
+2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana
 
-Advancements in high-throughput technologies have led to a shift from
-traditional hypothesis-driven methodologies to data-driven approaches.
-Multi-omics refers to the integrative analysis of data derived from multiple
-'omes', such as genomics, proteomics, transcriptomics, metabolomics, and
-microbiomics. This approach enables a comprehensive understanding of biological
-systems by capturing different layers of biological information. Deep learning
-methods are increasingly utilized to integrate multi-omics data, offering
-insights into molecular interactions and enhancing research into complex
-diseases. However, these models, with their numerous interconnected layers and
-nonlinear relationships, often function as black boxes, lacking transparency in
-decision-making processes. To overcome this challenge, explainable artificial
-intelligence (xAI) methods are crucial for creating transparent models that
-allow clinicians to interpret and work with complex data more effectively. This
-review explores how xAI can improve the interpretability of deep learning
-models in multi-omics research, highlighting its potential to provide
-clinicians with clear insights, thereby facilitating the effective application
-of such models in clinical settings.
+Combinatorial optimization problems, such as scheduling and route planning,
+are crucial in various industries but are computationally intractable due to
+their NP-hard nature. Neural Combinatorial Optimization methods leverage
+machine learning to address these challenges but often depend on sequential
+decision-making, which is prone to error accumulation as small mistakes
+propagate throughout the process. Inspired by self-evaluation techniques in
+Large Language Models, we propose a novel framework that generates and
+evaluates subsets of assignments, moving beyond traditional stepwise
+approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a
+heterogeneous graph neural network with a Transformer to build a policy model
+and a self-evaluation function. Experimental validation on challenging,
+well-known benchmarks demonstrates the effectiveness of our approach,
+surpassing state-of-the-art methods.
 
-摘要：高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料，例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面，能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料，提供分子交互作用的洞察力，並加強對複雜疾病的研究。然而，這些模型具有許多相互連接的層級和非線性關係，通常會像黑盒子一樣運作，缺乏決策過程的透明度。為了克服此挑戰，可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要，讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性，強調其提供臨床醫生明確見解的潛力，進而促進此類模型在臨床環境中的有效應用。
+摘要：組合優化問題，例如排程和路線規劃，在各行各業中至關重要，但由於它們的 NP 難度，在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰，但通常依賴於序貫決策制定，而序貫決策制定容易發生錯誤累積，因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發，我們提出了一個新的框架，可生成和評估作業子集，超越傳統的分步方法。應用於工作車間排程問題，我們的方法將異質圖神經網路與 Transformer 整合在一起，以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性，超越了最先進的方法。
 
-##### **Study on the Helpfulness of Explainable Artificial Intelligence**
-2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing
+##### **Improving Existing Optimization Algorithms with LLMs**
+2502.08298v1 by Camilo Chacón Sartori, Christian Blum
 
-Explainable Artificial Intelligence (XAI) is essential for building advanced
-machine learning-powered applications, especially in critical domains such as
-medical diagnostics or autonomous driving. Legal, business, and ethical
-requirements motivate using effective XAI, but the increasing number of
-different methods makes it challenging to pick the right ones. Further, as
-explanations are highly context-dependent, measuring the effectiveness of XAI
-methods without users can only reveal a limited amount of information,
-excluding human factors such as the ability to understand it. We propose to
-evaluate XAI methods via the user's ability to successfully perform a proxy
-task, designed such that a good performance is an indicator for the explanation
-to provide helpful information. In other words, we address the helpfulness of
-XAI for human decision-making. Further, a user study on state-of-the-art
-methods was conducted, showing differences in their ability to generate trust
-and skepticism and the ability to judge the rightfulness of an AI decision
-correctly. Based on the results, we highly recommend using and extending this
-approach for more objective-based human-centered user studies to measure XAI
-performance in an end-to-end fashion.
+The integration of Large Language Models (LLMs) into optimization has created
+a powerful synergy, opening exciting research opportunities. This paper
+investigates how LLMs can enhance existing optimization algorithms. Using their
+pre-trained knowledge, we demonstrate their ability to propose innovative
+heuristic variations and implementation strategies. To evaluate this, we
+applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt
+(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that
+incorporates a heuristic in the solution construction phase. Our results show
+that an alternative heuristic proposed by GPT-4o outperforms the
+expert-designed heuristic of CMSA, with the performance gap widening on larger
+and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/
 
-摘要：可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要，特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI，但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外，由於解釋高度依賴於背景，在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊，排除人類因素，例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法，設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說，我們探討 XAI 對人類決策制定的幫助。此外，對最先進的方法進行使用者研究，顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果，我們強烈建議使用和擴充這種方法，以進行更多以目標為基礎的人為中心使用者研究，以終端到終端的方式衡量 XAI 效能。
+摘要：大型语言模型 (LLM) 与优化相结合，创造了一种强大的协同作用，开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识，我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点，我们应用了一种非平凡的优化算法，构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法，它在求解构建阶段纳入了启发式算法。我们的结果表明，GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法，并且随着图形变得更大、更密集，性能差距也在扩大。项目网址：https://imp-opt-algo-llms.surge.sh/
 
-##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**
-2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh
+##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**
+2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
 
-Early detection of intrapartum risk enables interventions to potentially
-prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently,
-there is no accurate automated system to predict such events to assist with
-clinical decision-making. To fill this gap, we propose "Artificial Intelligence
-(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning
-framework that not only predicts adverse labor outcomes from maternal, fetal,
-obstetrical, and intrapartum risk factors but also provides the model's
-reasoning behind the predictions made. The latter can provide insights into
-what modifications in the input variables of the model could have changed the
-predicted outcome. We address the challenges of imbalance and small datasets by
-synthesizing additional training data using Adaptive Synthetic Sampling
-(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN
-uses an ensemble of fully-connected neural networks as the backbone for its
-classification with the data augmentation supported by either ADASYN or CTGAN.
-AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in
-classification. AIMEN can predict a high risk for adverse labor outcomes with
-an average F1 score of 0.784. It also provides counterfactual explanations that
-can be achieved by changing 2 to 3 attributes on average. Resources available:
-https://github.com/ab9mamun/AIMEN.
+Identifying cause-and-effect relationships is critical to understanding
+real-world dynamics and ultimately causal reasoning. Existing methods for
+identifying event causality in NLP, including those based on Large Language
+Models (LLMs), exhibit difficulties in out-of-distribution settings due to the
+limited scale and heavy reliance on lexical cues within available benchmarks.
+Modern benchmarks, inspired by probabilistic causal inference, have attempted
+to construct causal graphs of events as a robust representation of causal
+knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent
+benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a
+benchmark designed for discovery and reasoning over abstract causal events.
+Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday
+life events on the abstraction level. We propose a pipeline for identifying
+abstractions for event generalizations from \texttt{GLUCOSE}
+\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit
+commonsense causal knowledge, from which we subsequently extract $1,4$K causal
+pairs. Our experiments highlight the ongoing challenges of using statistical
+methods and/or LLMs for automatic abstraction identification and causal
+discovery in NLP. Nonetheless, we demonstrate that the abstract causal
+knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA
+reasoning performance in LLMs.
 
-摘要：產程中風險的早期偵測有助於進行干預措施，以預防或減輕不利的生產結果，例如腦性麻痺。目前，沒有準確的自動化系統可以預測此類事件，以協助臨床決策。為了填補這一空白，我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN)，這是一個深度學習架構，它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果，還能提供模型做出預測背後的原因。後者可以提供見解，說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料，以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹，並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險，平均 F1 分數為 0.784。它還提供反事實解釋，可透過平均變更 2 至 3 個屬性來達成。可用資源：https://github.com/ab9mamun/AIMEN。
+摘要：<paragraph>找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法，包括基於大型語言模型 (LLM) 的方法，由於規模有限且過度依賴於可用基準中的詞彙線索，在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖，作為因果知識的強健表示，其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中，我們介紹 \texttt{ACCESS}，一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同，\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道，用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象，\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集，我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此，我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。</paragraph>
 
-##### **Artificial intelligence techniques in inherited retinal diseases: A review**
-2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian
+##### **GCoT: Chain-of-Thought Prompt Learning for Graphs**
+2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang
 
-Inherited retinal diseases (IRDs) are a diverse group of genetic disorders
-that lead to progressive vision loss and are a major cause of blindness in
-working-age adults. The complexity and heterogeneity of IRDs pose significant
-challenges in diagnosis, prognosis, and management. Recent advancements in
-artificial intelligence (AI) offer promising solutions to these challenges.
-However, the rapid development of AI techniques and their varied applications
-have led to fragmented knowledge in this field. This review consolidates
-existing studies, identifies gaps, and provides an overview of AI's potential
-in diagnosing and managing IRDs. It aims to structure pathways for advancing
-clinical applications by exploring AI techniques like machine learning and deep
-learning, particularly in disease detection, progression prediction, and
-personalized treatment planning. Special focus is placed on the effectiveness
-of convolutional neural networks in these areas. Additionally, the integration
-of explainable AI is discussed, emphasizing its importance in clinical settings
-to improve transparency and trust in AI-based systems. The review addresses the
-need to bridge existing gaps in focused studies on AI's role in IRDs, offering
-a structured analysis of current AI techniques and outlining future research
-directions. It concludes with an overview of the challenges and opportunities
-in deploying AI for IRDs, highlighting the need for interdisciplinary
-collaboration and the continuous development of robust, interpretable AI models
-to advance clinical applications.
+Chain-of-thought (CoT) prompting has achieved remarkable success in natural
+language processing (NLP). However, its vast potential remains largely
+unexplored for graphs. This raises an interesting question: How can we design
+CoT prompting for graphs to guide graph models to learn step by step? On one
+hand, unlike natural languages, graphs are non-linear and characterized by
+complex topological structures. On the other hand, many graphs lack textual
+data, making it difficult to formulate language-based CoT prompting. In this
+work, we propose the first CoT prompt learning framework for text-free graphs,
+GCoT. Specifically, we decompose the adaptation process for each downstream
+task into a series of inference steps, with each step consisting of
+prompt-based inference, ``thought'' generation, and thought-conditioned prompt
+learning. While the steps mimic CoT prompting in NLP, the exact mechanism
+differs significantly. Specifically, at each step, an input graph, along with a
+prompt, is first fed into a pre-trained graph encoder for prompt-based
+inference. We then aggregate the hidden layers of the encoder to construct a
+``thought'', which captures the working state of each node in the current step.
+Conditioned on this thought, we learn a prompt specific to each node based on
+the current state. These prompts are fed into the next inference step,
+repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we
+conduct comprehensive experiments on eight public datasets, which demonstrate
+the advantage of our approach.
 
-摘要：遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病，
-會導致視力逐漸喪失，是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。
-然而，AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究，找出差距，並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術，特別是在疾病檢測、進程預測和個性化治療計劃中，為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外，討論了可解釋 AI 的整合，強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性，提供了對當前 AI 技術的結構化分析，並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇，強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。
+摘要：<paragraph>鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而，其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題：我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習？一方面，與自然語言不同，圖形是非線性的，並且具有複雜的拓撲結構。另一方面，許多圖形缺乏文本數據，這使得難以制定基於語言的 CoT 提示。在這項工作中，我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說，我們將每個下游任務的適應過程分解為一系列推理步驟，每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示，但具體機制卻有很大不同。具體來說，在每一步中，一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後，我們聚合編碼器的隱藏層以構建一個「思想」，它捕獲了當前步驟中每個節點的工作狀態。基於這個思想，我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中，重複這個循環。為了評估和分析 GCoT 的有效性，我們對八個公共數據集進行了全面的實驗，這證明了我們方法的優勢。</paragraph>
 
-##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**
-2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri
+##### **Deep Semantic Graph Learning via LLM based Node Enhancement**
+2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen
 
-Explaining Artificial Intelligence (AI) decisions is a major challenge
-nowadays in AI, in particular when applied to sensitive scenarios like medicine
-and law. However, the need to explain the rationale behind decisions is a main
-issue also for human-based deliberation as it is important to justify
-\textit{why} a certain decision has been taken. Resident medical doctors for
-instance are required not only to provide a (possibly correct) diagnosis, but
-also to explain how they reached a certain conclusion. Developing new tools to
-aid residents to train their explanation skills is therefore a central
-objective of AI in education. In this paper, we follow this direction, and we
-present, to the best of our knowledge, the first multilingual dataset for
-Medical Question Answering where correct and incorrect diagnoses for a clinical
-case are enriched with a natural language explanation written by doctors. These
-explanations have been manually annotated with argument components (i.e.,
-premise, claim) and argument relations (i.e., attack, support), resulting in
-the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases
-in four languages (English, Spanish, French, Italian) with explanations, where
-we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106
-attack relations. We conclude by showing how competitive baselines perform over
-this challenging dataset for the argument mining task.
+Graph learning has attracted significant attention due to its widespread
+real-world applications. Current mainstream approaches rely on text node
+features and obtain initial node embeddings through shallow embedding learning
+using GNNs, which shows limitations in capturing deep textual semantics. Recent
+advances in Large Language Models (LLMs) have demonstrated superior
+capabilities in understanding text semantics, transforming traditional text
+feature processing. This paper proposes a novel framework that combines Graph
+Transformer architecture with LLM-enhanced node features. Specifically, we
+leverage LLMs to generate rich semantic representations of text nodes, which
+are then processed by a multi-head self-attention mechanism in the Graph
+Transformer to capture both local and global graph structural information. Our
+model utilizes the Transformer's attention mechanism to dynamically aggregate
+neighborhood information while preserving the semantic richness provided by LLM
+embeddings. Experimental results demonstrate that the LLM-enhanced node
+features significantly improve the performance of graph learning models on node
+classification tasks. This approach shows promising results across multiple
+graph learning tasks, offering a practical direction for combining graph
+networks with language models.
 
-摘要：解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰，特別是應用於像醫學和法律等敏感情境時。然而，解釋決策背後理由的需求也是基於人類的考量的一個主要問題，因為有必要證明為什麼做出某個決策。例如，住院醫師不僅需要提供（可能是正確的）診斷，還需要解釋他們如何達成某個結論。因此，開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中，我們遵循這個方向，並且根據我們的了解，提出第一個多語言醫學問答資料集，其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成（即前提、主張）和論證關係（即攻擊、支持）進行手動註解，產生多語言 CasiMedicos-Arg 資料集，其中包含 558 個具有解釋的四種語言（英語、西班牙語、法語、義大利語）的臨床病例，我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。
+摘要：圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵，並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入，這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力，轉換了傳統的文本特徵處理。本文提出了一種新的框架，將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說，我們利用 LLM 來生成文本節點的豐富語義表示，然後在圖形轉換器中由多頭自我注意機制處理，以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息，同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明，LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果，為將圖形網絡與語言模型相結合提供了實用的方向。
 
-##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration**
-2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu
+##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**
+2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
 
-Diagnosis prediction is a critical task in healthcare, where timely and
-accurate identification of medical conditions can significantly impact patient
-outcomes. Traditional machine learning and deep learning models have achieved
-notable success in this domain but often lack interpretability which is a
-crucial requirement in clinical settings. In this study, we explore the use of
-neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop
-explainable models for diagnosis prediction. Essentially, we design and
-implement LNN-based models that integrate domain-specific knowledge through
-logical rules with learnable thresholds. Our models, particularly
-$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior
-performance over traditional models such as Logistic Regression, SVM, and
-Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up
-to 0.8457) in the case study of diabetes prediction. The learned weights and
-thresholds within the LNN models provide direct insights into feature
-contributions, enhancing interpretability without compromising predictive
-power. These findings highlight the potential of neuro-symbolic approaches in
-bridging the gap between accuracy and explainability in healthcare AI
-applications. By offering transparent and adaptable diagnostic models, our work
-contributes to the advancement of precision medicine and supports the
-development of equitable healthcare solutions. Future research will focus on
-extending these methods to larger and more diverse datasets to further validate
-their applicability across different medical conditions and populations.
+The prototyping of computer games, particularly card games, requires
+extensive human effort in creative ideation and gameplay evaluation. Recent
+advances in Large Language Models (LLMs) offer opportunities to automate and
+streamline these processes. However, it remains challenging for LLMs to design
+novel game mechanics beyond existing databases, generate consistent gameplay
+environments, and develop scalable gameplay AI for large-scale evaluations.
+This paper addresses these challenges by introducing a comprehensive automated
+card game prototyping framework. The approach highlights a graph-based indexing
+method for generating novel game designs, an LLM-driven system for consistent
+game code generation validated by gameplay records, and a gameplay AI
+constructing method that uses an ensemble of LLM-generated action-value
+functions optimized through self-play. These contributions aim to accelerate
+card game prototyping, reduce human labor, and lower barriers to entry for game
+developers.
 
-摘要：診斷預測是醫療保健中的關鍵任務，及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功，但通常缺乏可解釋性，這在臨床環境中是一項關鍵要求。在本研究中，我們探討了神經符號方法的應用，特別是邏輯神經網路 (LNN)，以開發用於診斷預測的可解釋模型。基本上，我們設計並實作了基於 LNN 的模型，這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型，特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$，表現出優於傳統模型（例如邏輯迴歸、SVM 和隨機森林）的優異效能，在糖尿病預測的案例研究中達到了更高的準確度（高達 80.52%）和 AUROC 分數（高達 0.8457）。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解，增強了可解釋性，同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型，我們的研究有助於推進精準醫療，並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集，以進一步驗證其在不同醫療狀況和人群中的適用性。
+摘要：電腦遊戲，尤其是卡牌遊戲的原型製作，需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而，LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境，以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法，用於生成新穎的遊戲設計，一個由 LLM 驅動的系統，用於一致的遊戲程式碼生成，並由遊戲記錄驗證，以及一個遊戲 AI 構建方法，該方法使用由 LLM 生成的動作值函數的集合，通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作，減少人力，並降低遊戲開發人員的進入門檻。
 
-##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**
-2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty
+##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**
+2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan
 
-The rapid advancements in artificial intelligence (AI) have revolutionized
-smart healthcare, driving innovations in wearable technologies, continuous
-monitoring devices, and intelligent diagnostic systems. However, security,
-explainability, robustness, and performance optimization challenges remain
-critical barriers to widespread adoption in clinical environments. This
-research presents an innovative algorithmic method using the Adaptive Feature
-Evaluator (AFE) algorithm to improve feature selection in healthcare datasets
-and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable
-Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT),
-the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby
-enhancing predictive accuracy and interpretability. The proposed method is
-validated across three diverse healthcare datasets using six distinct machine
-learning algorithms, demonstrating its robustness and superiority over
-conventional feature selection techniques. The results underscore the
-transformative potential of AFE in smart healthcare, enabling personalized and
-transparent patient care. Notably, the AFE algorithm, when combined with a
-Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting
-its capability to improve clinical decision-making processes in real-world
-healthcare applications.
+Graph Neural Networks (GNNs) are vital for learning from graph-structured
+data, enabling applications in network analysis, recommendation systems, and
+speech analytics. Deploying them on edge devices like client PCs and laptops
+enhances real-time processing, privacy, and cloud independence. GNNs aid
+Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and
+enable event-based vision tasks. However, irregular memory access, sparsity,
+and dynamic structures cause high latency and energy overhead on
+resource-constrained devices. While modern edge processors integrate CPUs,
+GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular
+GNN computations. We introduce GraNNite, the first hardware-aware framework
+optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN
+accelerators via a structured three-step methodology: (1) enabling NPU
+execution, (2) optimizing performance, and (3) trading accuracy for efficiency
+gains. Step 1 employs GraphSplit for workload distribution and StaGr for static
+aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts
+performance using EffOp for control-heavy tasks and GraSp for sparsity
+exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce
+redundancy and memory transfers. Step 3 balances quality versus efficiency,
+where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate
+attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs,
+GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to
+8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher
+performance than CPUs and GPUs, respectively, across GNN models.
 
-摘要：人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健，推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而，安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法，使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT)，該演算法最佳化了臨床決策支援系統 (CDSS)，從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集，證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力，實現了個人化和透明的患者照護。值得注意的是，AFE 演算法與多層感知器 (MLP) 結合使用時，準確度高達 98.5%，突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。
+摘要：圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要，能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置（例如用戶端電腦和筆電）上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG)，並支援基於事件的視覺任務。然而，不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU，但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite，這是第一個硬體感知框架，透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行：(1) 啟用 NPU 執行，(2) 最佳化效能，以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配，並使用 StaGr 進行靜態聚合，而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能，並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率，其中 QuantGr 適用 INT8 量化，而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上，GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速，在 CPU 和 GPU 上實現了高達 8.6X 的能源增益，在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。
 
-##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**
-2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker
+##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**
+2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
 
-Artificial intelligence (AI) systems have substantially improved
-dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI)
-systems further enhancing clinicians' confidence and trust in AI-driven
-decisions. Despite these advancements, there remains a critical need for
-objective evaluation of how dermatologists engage with both AI and XAI tools.
-In this study, 76 dermatologists participated in a reader study, diagnosing 16
-dermoscopic images of melanomas and nevi using an XAI system that provides
-detailed, domain-specific explanations. Eye-tracking technology was employed to
-assess their interactions. Diagnostic performance was compared with that of a
-standard AI system lacking explanatory features. Our findings reveal that XAI
-systems improved balanced diagnostic accuracy by 2.8 percentage points relative
-to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and
-complex lesions were associated with elevated cognitive load, as evidenced by
-increased ocular fixations. These insights have significant implications for
-clinical practice, the design of AI tools for visual tasks, and the broader
-development of XAI in medical diagnostics.
+Recent advancements in AI for biological research focus on integrating
+molecular data with natural language to accelerate drug discovery. However, the
+scarcity of high-quality annotations limits progress in this area. This paper
+introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework
+that leverages large language models to augment existing datasets, thereby
+improving AI training. We demonstrate the effectiveness of LA$^3$ by creating
+an enhanced dataset, LaChEBI-20, where we systematically rewrite the
+annotations of molecules from an established dataset. These rewritten
+annotations preserve essential molecular information while providing more
+varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5
+based on a benchmark architecture to learn the mapping between molecular
+representations and augmented annotations.
+  Experimental results on text-based *de novo* molecule generation and molecule
+captioning demonstrate that LaMolT5 outperforms state-of-the-art models.
+Notably, incorporating LA$^3$ leads to improvements of up to 301% over the
+benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$
+notable applications in *image*, *text* and *graph* tasks, affirming its
+versatility and utility.
 
-摘要：人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度，而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展，對於皮膚科醫師如何使用 AI 和 XAI 工具，仍有客觀評估的迫切需求。在這項研究中，76 位皮膚科醫師參與了一項讀者研究，使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像，該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示，XAI 系統相較於標準 AI，將平衡診斷準確度提升了 2.8 個百分點。此外，與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關，這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。
+摘要：<paragraph>人工智慧在生物研究上的最新進展，專注於將分子資料與自然語言整合，以加速藥物發現。然而，高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$，一個基於語言的自動註解擴充框架，它利用大型語言模型來擴充現有的資料集，進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性，我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊，同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20，我們在基於基準架構上訓練 LaMolT5，以學習分子表示和擴充註解之間的對應。
+在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明，LaMolT5 優於最先進的模型。值得注意的是，納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外，我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性，肯定了它的多功能性和實用性。</paragraph>
 
-##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**
-2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar
+##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**
+2502.06472v1 by Yuxing Lu, Jinzhuo Wang
 
-Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been
-shown to significantly improve the quality of life of autistic individuals.
-However, diagnostics methods for ASD rely on assessments based on clinical
-presentation that are prone to bias and can be challenging to arrive at an
-early diagnosis. There is a need for objective biomarkers of ASD which can help
-improve diagnostic accuracy. Deep learning (DL) has achieved outstanding
-performance in diagnosing diseases and conditions from medical imaging data.
-Extensive research has been conducted on creating models that classify ASD
-using resting-state functional Magnetic Resonance Imaging (fMRI) data. However,
-existing models lack interpretability. This research aims to improve the
-accuracy and interpretability of ASD diagnosis by creating a DL model that can
-not only accurately classify ASD but also provide explainable insights into its
-working. The dataset used is a preprocessed version of the Autism Brain Imaging
-Data Exchange (ABIDE) with 884 samples. Our findings show a model that can
-accurately classify ASD and highlight critical brain regions differing between
-ASD and typical controls, with potential implications for early diagnosis and
-understanding of the neural basis of ASD. These findings are validated by
-studies in the literature that use different datasets and modalities,
-confirming that the model actually learned characteristics of ASD and not just
-the dataset. This study advances the field of explainable AI in medical imaging
-by providing a robust and interpretable model, thereby contributing to a future
-with objective and reliable ASD diagnostics.
+Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical
+for modern AI systems, but manual curation struggles to scale with the rapid
+growth of scientific literature. This paper presents KARMA, a novel framework
+employing multi-agent large language models (LLMs) to automate KG enrichment
+through structured analysis of unstructured text. Our approach employs nine
+collaborative agents, spanning entity discovery, relation extraction, schema
+alignment, and conflict resolution that iteratively parse documents, verify
+extracted knowledge, and integrate it into existing graph structures while
+adhering to domain-specific schema. Experiments on 1,200 PubMed articles from
+three different domains demonstrate the effectiveness of KARMA in knowledge
+graph enrichment, with the identification of up to 38,230 new entities while
+achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\%
+through multi-layer assessments.
 
-摘要：自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而，ASD 的診斷方法依賴於基於臨床表現的評估，容易產生偏見，且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記，以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而，現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD，還能提供可解釋見解說明其運作原理的 DL 模型，來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本，包含 884 個樣本。我們的研究結果顯示，該模型能準確分類 ASD，並強調 ASD 與典型對照組之間存在差異的關鍵腦區，對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證，證實該模型實際上學習了 ASD 的特徵，而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型，推動了醫學影像中可解釋 AI 的領域，從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。
+摘要：維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要，但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA，一個採用多代理大型語言模型 (LLM) 的新框架，透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理，涵蓋實體發現、關係提取、架構比對和衝突解決，這些代理會反覆分析文件、驗證提取的知識，並將其整合到現有的圖結構中，同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性，識別出多達 38,230 個新實體，同時達到 83.1% 的 LLM 驗證正確性，並透過多層評估將衝突邊緣降低了 18.6%。
 
-##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**
-2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul
+##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs**
+2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
 
-The in-vivo identification of the kidney stone types during an ureteroscopy
-would be a major medical advance in urology, as it could reduce the time of the
-tedious renal calculi extraction process, while diminishing infection risks.
-Furthermore, such an automated procedure would make possible to prescribe
-anti-recurrence treatments immediately. Nowadays, only few experienced
-urologists are able to recognize the kidney stone types in the images of the
-videos displayed on a screen during the endoscopy. Thus, several deep learning
-(DL) models have recently been proposed to automatically recognize the kidney
-stone types using ureteroscopic images. However, these DL models are of black
-box nature whicl limits their applicability in clinical settings. This
-contribution proposes a case-based reasoning DL model which uses prototypical
-parts (PPs) and generates local and global descriptors. The PPs encode for each
-class (i.e., kidney stone type) visual feature information (hue, saturation,
-intensity and textures) similar to that used by biologists. The PPs are
-optimally generated due a new loss function used during the model training.
-Moreover, the local and global descriptors of PPs allow to explain the
-decisions ("what" information, "where in the images") in an understandable way
-for biologists and urologists. The proposed DL model has been tested on a
-database including images of the six most widespread kidney stone types. The
-overall average classification accuracy was 90.37. When comparing this results
-with that of the eight other DL models of the kidney stone state-of-the-art, it
-can be seen that the valuable gain in explanability was not reached at the
-expense of accuracy which was even slightly increased with respect to that
-(88.2) of the best method of the literature. These promising and interpretable
-results also encourage urologists to put their trust in AI-based solutions.
+Mitigating positional bias of language models (LMs) for listwise inputs is a
+well-known and important problem (e.g., lost-in-the-middle). While zero-shot
+order-invariant LMs have been proposed to solve this issue, their success on
+practical listwise problems has been limited. In this work, as a first
+contribution, we identify and overcome two limitations to make zero-shot
+invariant LMs more practical: (1) training and inference distribution mismatch
+arising from modifying positional ID assignments to enforce invariance, and (2)
+failure to adapt to a mixture of order-invariant and sensitive inputs in
+practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot
+invariant LM for genuinely order-invariant inputs with minimal modifications of
+positional IDs, and (2) Selective Routing, an adaptive framework that handles
+both order-invariant and order-sensitive inputs in listwise tasks. On the Lost
+in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU
+benchmarks, we show that RoToR with Selective Routing can effectively handle
+practical listwise input tasks in a zero-shot manner.
 
-摘要：尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展，因為它可以減少繁瑣的腎結石取出過程的時間，同時降低感染風險。此外，這種自動化程序將使立即開立抗復發治療成為可能。如今，只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此，最近已提出多種深度學習 (DL) 模型，以使用輸尿管鏡圖像自動識別腎結石類型。然而，這些 DL 模型本質上是黑盒子，這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型，它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型（即腎結石類型）編碼視覺特徵信息（色調、飽和度、強度和紋理），類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數，PP 得到了最佳生成。此外，PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策（“什麼”信息，“圖像中的什麼位置”）。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時，可以看出，可解釋性的寶貴增益並未以準確性為代價，甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。
+摘要：語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題（例如，迷失在中間）。雖然已經提出零次學習順序不變的 LM 來解決這個問題，但它們在實際列表問題上的成功卻很有限。在這項工作中，作為第一個貢獻，我們找出並克服了兩個限制，讓零次學習不變的 LM 更有實用性：(1) 訓練和推論分布不匹配，這是由於修改位置 ID 分配以強制不變性所造成的，以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題，我們提出 (1) RoToR，一個零次學習不變的 LM，用於真正不變的輸入，並對位置 ID 進行最小的修改，以及 (2) 選擇性路由，一個自適應框架，用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中，我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。
 
-##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**
-2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman
+##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model**
+2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen
 
-This study explores the potential of utilizing administrative claims data,
-combined with advanced machine learning and deep learning techniques, to
-predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal
-Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major
-health insurance organization to develop prediction models for multiple
-observation windows using traditional machine learning methods such as Random
-Forest and XGBoost as well as deep learning approaches such as Long Short-Term
-Memory (LSTM) networks. Our findings demonstrate that the LSTM model,
-particularly with a 24-month observation window, exhibits superior performance
-in predicting ESRD progression, outperforming existing models in the
-literature. We further apply SHapley Additive exPlanations (SHAP) analysis to
-enhance interpretability, providing insights into the impact of individual
-features on predictions at the individual patient level. This study underscores
-the value of leveraging administrative claims data for CKD management and
-predicting ESRD progression.
+Recent advancements in large language models (LLMs) have significantly
+improved various natural language processing (NLP) tasks. Typically, LLMs are
+trained to predict the next token, aligning well with many NLP tasks. However,
+in knowledge graph (KG) scenarios, entities are the fundamental units and
+identifying an entity requires at least several tokens. This leads to a
+granularity mismatch between KGs and natural languages. To address this issue,
+we propose K-ON, which integrates KG knowledge into the LLM by employing
+multiple head layers for next k-step prediction. K-ON can not only generate
+entity-level results in one step, but also enables contrastive loss against
+entities, which is the most powerful tool in KG representation learning.
+Experimental results show that K-ON outperforms state-of-the-art methods that
+incorporate text and even the other modalities.
 
-摘要：本研究探討利用行政申報資料，結合先進機器學習與深度學習技術，預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集，使用傳統機器學習方法（例如隨機森林和 XGBoost）以及深度學習方法（例如長期短期記憶 (LSTM) 網路）開發多個觀察視窗的預測模型。我們的研究結果顯示，LSTM 模型（尤其是 24 個月觀察視窗）在預測 ESRD 進展方面表現優異，優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性，深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。
+摘要：大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常，LLM 會接受訓練以預測下一個符號，這與許多 NLP 任務非常吻合。然而，在知識圖譜 (KG) 場景中，實體是基本單位，而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題，我們提出了 K-ON，它透過採用多個頭部層進行下一個 k 步預測，將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果，還能針對實體啟用對比損失，這是 KG 表示學習中最有力的工具。實驗結果顯示，K-ON 優於將文字甚至其他方式納入考量的最新方法。
 
-##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**
-2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller
+##### **LegalViz: Legal Text Visualization by Text To Diagram Generation**
+2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita
 
-While large language models (LLMs) have shown promise for medical question
-answering, there is limited work focused on tropical and infectious
-disease-specific exploration. We build on an opensource tropical and infectious
-diseases (TRINDs) dataset, expanding it to include demographic and semantic
-clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM
-performance on these, comparing generalist and medical LLMs, as well as LLM
-outcomes to human experts. We demonstrate through systematic experimentation,
-the benefit of contextual information such as demographics, location, gender,
-risk factors for optimal LLM response. Finally we develop a prototype of
-TRINDs-LM, a research tool that provides a playground to navigate how context
-impacts LLM outputs for health.
+Legal documents including judgments and court orders require highly
+sophisticated legal knowledge for understanding. To disclose expert knowledge
+for non-experts, we explore the problem of visualizing legal texts with
+easy-to-understand diagrams and propose a novel dataset of LegalViz with 23
+languages and 7,010 cases of legal document and visualization pairs, using the
+DOT graph description language of Graphviz. LegalViz provides a simple diagram
+from a complicated legal corpus identifying legal entities, transactions, legal
+sources, and statements at a glance, that are essential in each judgment. In
+addition, we provide new evaluation metrics for the legal diagram visualization
+by considering graph structures, textual similarities, and legal contents. We
+conducted empirical studies on few-shot and finetuning large language models
+for generating legal diagrams and evaluated them with these metrics, including
+legal content-based evaluation within 23 languages. Models trained with
+LegalViz outperform existing models including GPTs, confirming the
+effectiveness of our dataset.
 
-摘要：儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景，但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上，並將其擴展為納入人口統計和語義臨床和消費者擴充，產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能，比較了通才和醫療 LLM，以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊（例如人口統計、位置、性別、最佳 LLM 回應的風險因素）的好處。最後，我們開發了 TRINDs-LM 的原型，這是一個研究工具，提供一個探索背景如何影響 LLM 健康輸出的平台。
+摘要：法律文件，包括判決和法院命令，需要高度專業的法律知識才能理解。為了向非專家揭露專家知識，我們探討了使用易於理解的圖表將法律文本視覺化的問題，並提出了一個新的 LegalViz 數據集，其中包含 23 種語言和 7,010 個法律文件和視覺化配對，使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表，可以一目了然地識別法律實體、交易、法律來源和陳述，這些在每項判決中都是必不可少的。此外，我們通過考慮圖形結構、文本相似性和法律內容，為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究，以生成法律圖表，並使用這些指標對它們進行了評估，包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型，包括 GPT，證實了我們數據集的有效性。
 
-##### **Explainable AI: Definition and attributes of a good explanation for health AI**
-2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group
+##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**
+2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee
 
-Proposals of artificial intelligence (AI) solutions based on increasingly
-complex and accurate predictive models are becoming ubiquitous across many
-disciplines. As the complexity of these models grows, transparency and users'
-understanding often diminish. This suggests that accurate prediction alone is
-insufficient for making an AI-based solution truly useful. In the development
-of healthcare systems, this introduces new issues related to accountability and
-safety. Understanding how and why an AI system makes a recommendation may
-require complex explanations of its inner workings and reasoning processes.
-Although research on explainable AI (XAI) has significantly increased in recent
-years and there is high demand for XAI in medicine, defining what constitutes a
-good explanation remains ad hoc, and providing adequate explanations continues
-to be challenging. To fully realize the potential of AI, it is critical to
-address two fundamental questions about explanations for safety-critical AI
-applications, such as health-AI: (1) What is an explanation in health-AI? and
-(2) What are the attributes of a good explanation in health-AI? In this study,
-we examined published literature and gathered expert opinions through a
-two-round Delphi study. The research outputs include (1) a definition of what
-constitutes an explanation in health-AI and (2) a comprehensive list of
-attributes that characterize a good explanation in health-AI.
+Mental-illness stigma is a persistent social problem, hampering both
+treatment-seeking and recovery. Accordingly, there is a pressing need to
+understand it more clearly, but analyzing the relevant data is highly
+labor-intensive. Therefore, we designed a chatbot to engage participants in
+conversations; coded those conversations qualitatively with AI assistance; and,
+based on those coding results, built causal knowledge graphs to decode stigma.
+The results we obtained from 1,002 participants demonstrate that conversation
+with our chatbot can elicit rich information about people's attitudes toward
+depression, while our AI-assisted coding was strongly consistent with
+human-expert coding. Our novel approach combining large language models (LLMs)
+and causal knowledge graphs uncovered patterns in individual responses and
+illustrated the interrelationships of psychological constructs in the dataset
+as a whole. The paper also discusses these findings' implications for HCI
+researchers in developing digital interventions, decomposing human
+psychological constructs, and fostering inclusive attitudes.
 
-摘要：隨著越來越複雜且準確的預測模型，基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加，透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中，這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加，且醫學領域對 XAI 有很高的需求，但定義什麼構成一個好的解釋仍是臨時性的，而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力，對於安全關鍵型 AI 應用（例如健康 AI）的解釋，探討兩個基本問題至關重要：(1) 什麼是健康 AI 中的解釋？以及 (2) 健康 AI 中一個好的解釋有哪些屬性？在本研究中，我們檢視了已發表的文獻，並透過兩輪德爾菲研究收集了專家意見。研究成果包括：(1) 健康 AI 中什麼構成解釋的定義，以及 (2) 健康 AI 中一個好解釋的屬性清單。
+摘要：精神疾病的污名化是一個持續存在的社會問題，阻礙了尋求治療和康復。因此，迫切需要更清楚地了解它，但分析相關數據非常費力。因此，我們設計了一個聊天機器人，讓參與者參與對話；使用 AI 協助對這些對話進行定性編碼；並根據這些編碼結果，構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明，與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊，而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式，並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。
 
-##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**
-2408.17401v1 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni
+##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**
+2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
 
-In recent years, various methods have been introduced for explaining the
-outputs of "black-box" AI models. However, it is not well understood whether
-users actually comprehend and trust these explanations. In this paper, we focus
-on explanations for a regression tool for assessing cancer risk and examine the
-effect of the explanations' content and format on the user-centric metrics of
-comprehension and trust. Regarding content, we experiment with two explanation
-methods: the popular SHAP, based on game-theoretic notions and thus potentially
-complex for everyday users to comprehend, and occlusion-1, based on feature
-occlusion which may be more comprehensible. Regarding format, we present SHAP
-explanations as charts (SC), as is conventional, and occlusion-1 explanations
-as charts (OC) as well as text (OT), to which their simpler nature also lends
-itself. The experiments amount to user studies questioning participants, with
-two different levels of expertise (the general population and those with some
-medical training), on their subjective and objective comprehension of and trust
-in explanations for the outputs of the regression tool. In both studies we
-found a clear preference in terms of subjective comprehension and trust for
-occlusion-1 over SHAP explanations in general, when comparing based on content.
-However, direct comparisons of explanations when controlling for format only
-revealed evidence for OT over SC explanations in most cases, suggesting that
-the dominance of occlusion-1 over SHAP explanations may be driven by a
-preference for text over charts as explanations. Finally, we found no evidence
-of a difference between the explanation types in terms of objective
-comprehension. Thus overall, the choice of the content and format of
-explanations needs careful attention, since in some contexts format, rather
-than content, may play the critical role in improving user experience.
+In this paper, we address the task of semantic segmentation of legal
+documents through rhetorical role classification, with a focus on Indian legal
+judgments. We introduce LegalSeg, the largest annotated dataset for this task,
+comprising over 7,000 documents and 1.4 million sentences, labeled with 7
+rhetorical roles. To benchmark performance, we evaluate multiple
+state-of-the-art models, including Hierarchical BiLSTM-CRF,
+TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and
+Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an
+instruction-tuned large language model. Our results demonstrate that models
+incorporating broader context, structural relationships, and sequential
+sentence information outperform those relying solely on sentence-level
+features. Additionally, we conducted experiments using surrounding context and
+predicted or actual labels of neighboring sentences to assess their impact on
+classification accuracy. Despite these advancements, challenges persist in
+distinguishing between closely related roles and addressing class imbalance.
+Our work underscores the potential of advanced techniques for improving legal
+document understanding and sets a strong foundation for future research in
+legal NLP.
 
-摘要：<paragraph>近年來，已經引進各種方法來解釋「黑箱」AI 模型的輸出。然而，目前並不清楚使用者是否實際理解和信任這些解釋。在本文中，我們專注於評估癌症風險的回歸工具的解釋，並探討解釋的內容和格式對以使用者為中心的理解和信任指標的影響。關於內容，我們實驗了兩種解釋方法：流行的 SHAP，基於博弈論概念，因此對於日常使用者來說可能很複雜，以及基於特徵遮蔽的 occlusion-1，可能更易於理解。關於格式，我們將 SHAP 解釋呈現為圖表 (SC)，這是慣例，而將 occlusion-1 解釋呈現為圖表 (OC) 以及文字 (OT)，其較為簡單的性質也適用於此。這些實驗等同於使用者研究，詢問參與者，具有兩種不同程度的專業知識（一般民眾和具備一些醫學訓練的人），他們對回歸工具輸出解釋的主觀和客觀理解和信任。在兩項研究中，我們發現，在基於內容進行比較時，一般來說，occlusion-1 優於 SHAP 解釋，在主觀理解和信任方面有明顯的偏好。然而，在僅控制格式的情況下直接比較解釋，在大多數情況下只顯示 OT 優於 SC 解釋的證據，這表明 occlusion-1 優於 SHAP 解釋的主導地位可能是由偏好文字而非圖表作為解釋所驅動的。最後，我們沒有發現解釋類型在客觀理解方面的差異證據。因此，總體而言，對解釋的內容和格式的選擇需要仔細注意，因為在某些情況下，格式而非內容，可能在改善使用者體驗方面發揮關鍵作用。</paragraph>
+摘要：<paragraph>在本文中，我們通過修辭角色分類來探討法律文件的語義分段任務，重點關注印度法律判決。我們引入了 LegalSeg，這是此任務中最大的註釋資料集，包含超過 7,000 份文件和 140 萬個句子，並標記了 7 個修辭角色。為了評量效能，我們評估了多個最先進的模型，包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer，以及探索性的 RhetoricLLaMA，一種經過指令調整的大型語言模型。我們的結果表明，結合廣泛背景、結構關係和順序句子資訊的模型，表現優於僅依賴句子層級特徵的模型。此外，我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗，以評估它們對分類精度的影響。儘管有這些進展，但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力，並為法律自然語言處理的未來研究奠定了堅實的基礎。</paragraph>
 
-##### **A Survey for Large Language Models in Biomedicine**
-2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen
+##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**
+2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
 
-Recent breakthroughs in large language models (LLMs) offer unprecedented
-natural language understanding and generation capabilities. However, existing
-surveys on LLMs in biomedicine often focus on specific applications or model
-architectures, lacking a comprehensive analysis that integrates the latest
-advancements across various biomedical domains. This review, based on an
-analysis of 484 publications sourced from databases including PubMed, Web of
-Science, and arXiv, provides an in-depth examination of the current landscape,
-applications, challenges, and prospects of LLMs in biomedicine, distinguishing
-itself by focusing on the practical implications of these models in real-world
-biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot
-learning across a broad spectrum of biomedical tasks, including diagnostic
-assistance, drug discovery, and personalized medicine, among others, with
-insights drawn from 137 key studies. Then, we discuss adaptation strategies of
-LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to
-enhance their performance in specialized biomedical contexts where zero-shot
-fails to achieve, such as medical question answering and efficient processing
-of biomedical literature. Finally, we discuss the challenges that LLMs face in
-the biomedicine domain including data privacy concerns, limited model
-interpretability, issues with dataset quality, and ethics due to the sensitive
-nature of biomedical data, the need for highly reliable model outputs, and the
-ethical implications of deploying AI in healthcare. To address these
-challenges, we also identify future research directions of LLM in biomedicine
-including federated learning methods to preserve data privacy and integrating
-explainable AI methodologies to enhance the transparency of LLMs.
+Developing intelligent agents for long-term cooperation in dynamic open-world
+scenarios is a major challenge in multi-agent systems. Traditional Multi-agent
+Reinforcement Learning (MARL) frameworks like centralized training
+decentralized execution (CTDE) struggle with scalability and flexibility. They
+require centralized long-term planning, which is difficult without custom
+reward functions, and face challenges in processing multi-modal data. CTDE
+approaches also assume fixed cooperation strategies, making them impractical in
+dynamic environments where agents need to adapt and plan independently. To
+address decentralized multi-agent cooperation, we propose Decentralized
+Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in
+a novel Multi-agent Crafter environment. Our generative agents, powered by
+Large Language Models (LLMs), are more scalable than traditional MARL agents by
+leveraging external knowledge and language for long-term planning and
+reasoning. Instead of fully sharing information from all past experiences,
+DAMCS introduces a multi-modal memory system organized as a hierarchical
+knowledge graph and a structured communication protocol to optimize agent
+cooperation. This allows agents to reason from past interactions and share
+relevant information efficiently. Experiments on novel multi-agent open-world
+tasks show that DAMCS outperforms both MARL and LLM baselines in task
+efficiency and collaboration. Compared to single-agent scenarios, the two-agent
+scenario achieves the same goal with 63% fewer steps, and the six-agent
+scenario with 74% fewer steps, highlighting the importance of adaptive memory
+and structured communication in achieving long-term goals. We publicly release
+our project at: https://happyeureka.github.io/damcs.
 
-摘要：大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而，現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構，缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析，深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景，其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先，我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力，包括診斷輔助、藥物發現和個性化醫療等，並從 137 項關鍵研究中汲取見解。然後，我們討論了 LLM 的適應策略，包括單模態和多模態 LLM 的微調方法，以增強它們在零次學習無法實現的專業生物醫學背景中的性能，例如醫療問題解答和生物醫學文獻的有效處理。最後，我們討論了 LLM 在生物醫學領域面臨的挑戰，包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰，我們還確定了生物醫學中 LLM 未來的研究方向，包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。
+摘要：<paragraph>在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架，例如集中式訓練去中心化執行 (CTDE)，在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃，這在沒有自訂獎勵函數的情況下很難執行，並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略，這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題，我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援，透過利用外部知識和語言進行長期規劃和推理，比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊，而是引入了多模式記憶體系統，該系統組織成階層式知識圖譜和結構化通訊協定，以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明，DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比，雙重代理情境以少 63% 的步驟達成相同的目標，而六重代理情境則以少 74% 的步驟達成目標，突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於：https://happyeureka.github.io/damcs。</paragraph>
 
-##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**
-2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone
+##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**
+2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang
 
-Significant investment and development have gone into integrating Artificial
-Intelligence (AI) in medical and healthcare applications, leading to advanced
-control systems in medical technology. However, the opacity of AI systems
-raises concerns about essential characteristics needed in such sensitive
-applications, like transparency and trustworthiness. Our study addresses these
-concerns by investigating a process for selecting the most adequate Explainable
-AI (XAI) methods to comply with the explanation requirements of key EU
-regulations in the context of smart bioelectronics for medical devices. The
-adopted methodology starts with categorising smart devices by their control
-mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving
-into their technology. Then, we analyse these regulations to define their
-explainability requirements for the various devices and related goals.
-Simultaneously, we classify XAI methods by their explanatory objectives. This
-allows for matching legal explainability requirements with XAI explanatory
-goals and determining the suitable XAI algorithms for achieving them. Our
-findings provide a nuanced understanding of which XAI algorithms align better
-with EU regulations for different types of medical devices. We demonstrate this
-through practical case studies on different neural implants, from chronic
-disease management to advanced prosthetics. This study fills a crucial gap in
-aligning XAI applications in bioelectronics with stringent provisions of EU
-regulations. It provides a practical framework for developers and researchers,
-ensuring their AI innovations advance healthcare technology and adhere to legal
-and ethical standards.
+Graphs are able to model interconnected entities in many online services,
+supporting a wide range of applications on the Web. This raises an important
+question: How can we train a graph foundational model on multiple source
+domains and adapt to an unseen target domain? A major obstacle is that graphs
+from different domains often exhibit divergent characteristics. Some studies
+leverage large language models to align multiple domains based on textual
+descriptions associated with the graphs, limiting their applicability to
+text-attributed graphs. For text-free graphs, a few recent works attempt to
+align different feature distributions across domains, while generally
+neglecting structural differences. In this work, we propose a novel Structure
+Alignment framework for text-free Multi-domain Graph Pre-Training and
+cross-domain adaptation (SAMGPT). It is designed to learn multi-domain
+knowledge from graphs originating in multiple source domains, which can then be
+adapted to address applications in an unseen target domain. Specifically, we
+introduce a set of structure tokens to harmonize structure-based aggregation
+across source domains during the pre-training phase. Next, for cross-domain
+adaptation, we design dual prompts, namely, holistic prompts and specific
+prompts, which adapt unified multi-domain structural knowledge and
+fine-grained, domain-specific information, respectively, to a target domain.
+Finally, we conduct comprehensive experiments on seven public datasets to
+evaluate and analyze the effectiveness of SAMGPT.
 
-摘要：人工智慧（AI）在醫療和保健應用中投入了大量的投資和開發，進而導致醫療技術中的先進控制系統。然而，AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂，例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題，用於選擇最充分的可解釋 AI（XAI）方法，以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制（開迴路、閉迴路和半閉迴路系統）對智慧型裝置進行分類，並深入探討其技術開始。然後，我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時，我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配，並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點，從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構，確保其 AI 創新能促進醫療技術並遵守法律和道德標準。
+摘要：圖表能夠在許多線上服務中對相互關聯的實體進行建模，
+支援網路上廣泛的應用程式。這提出了重要的問題：我們如何針對多個來源網域訓練圖表基礎模型，並適應未見過的目標網域？一個主要的障礙是，來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型，根據與圖表相關的文字描述，對齊多個網域，限制其適用性於有文字屬性的圖表。對於沒有文字的圖表，最近的一些作品嘗試對齊跨網域的不同特徵分佈，同時通常忽略結構上的差異。在這項工作中，我們提出了一個新的結構對齊框架，用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識，然後可以適應於未見過的目標網域中的應用程式。具體來說，我們引入了一組結構化代碼，以在預訓練階段，調和跨來源網域的基於結構的聚合。接下來，對於跨網域適應，我們設計了雙重提示，即整體提示和具體提示，分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後，我們在七個公共資料集上進行了全面的實驗，以評估和分析 SAMGPT 的有效性。
 
-##### **Towards Case-based Interpretability for Medical Federated Learning**
-2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva
+##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**
+2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang
 
-We explore deep generative models to generate case-based explanations in a
-medical federated learning setting. Explaining AI model decisions through
-case-based interpretability is paramount to increasing trust and allowing
-widespread adoption of AI in clinical practice. However, medical AI training
-paradigms are shifting towards federated learning settings in order to comply
-with data protection regulations. In a federated scenario, past data is
-inaccessible to the current user. Thus, we use a deep generative model to
-generate synthetic examples that protect privacy and explain decisions. Our
-proof-of-concept focuses on pleural effusion diagnosis and uses publicly
-available Chest X-ray data.
+In-context learning (ICL) effectively conditions large language models (LLMs)
+for molecular tasks, such as property prediction and molecule captioning, by
+embedding carefully selected demonstration examples into the input prompt. This
+approach avoids the computational overhead of extensive pertaining and
+fine-tuning. However, current prompt retrieval methods for molecular tasks have
+relied on molecule feature similarity, such as Morgan fingerprints, which do
+not adequately capture the global molecular and atom-binding relationships. As
+a result, these methods fail to represent the full complexity of molecular
+structures during inference. Moreover, small-to-medium-sized LLMs, which offer
+simpler deployment requirements in specialized systems, have remained largely
+unexplored in the molecular ICL literature. To address these gaps, we propose a
+self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context
+learning, which aligns global molecular structures, represented by graph neural
+networks (GNNs), with textual captions (descriptions) while leveraging local
+feature similarity through Morgan fingerprints. In addition, we introduce a
+Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to
+optimize input prompt demonstration samples. Our experimental findings using
+diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL
+retrieval methods across all tasks by up to 45%.
 
-摘要：我們探索深度生成模型，在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策，對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而，醫療 AI 訓練範例正轉向聯邦學習設置，以符合資料保護法規。在聯邦情境中，過去的資料對目前的使用者而言是無法取得的。因此，我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷，並使用公開可取得的胸部 X 光資料。
+摘要：<paragraph>情境學習 (ICL) 有效地調整大型語言模型 (LLM)，以執行分子任務，例如屬性預測和分子標題，方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而，目前針對分子任務的提示檢索方法依賴於分子特徵相似性，例如 Morgan 指紋，而無法充分捕捉全局分子和原子鍵結關係。因此，這些方法無法在推理過程中表示分子結構的完整複雜性。此外，在專業系統中提供更簡單部署需求的小到中型的 LLM，在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距，我們提出了一種自我監督學習技術，GAMIC（圖形對齊分子情境學習），它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題（描述）對齊，同時透過 Morgan 指紋利用局部特徵相似性。此外，我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法，以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示，GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法，最多可達 45%。</paragraph>
 
-##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**
-2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans
+##### **Knowledge Graph-Guided Retrieval Augmented Generation**
+2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
 
-Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging
-lesions with variable clinical behaviours and treatment approaches. This
-systematic review provides an overview of Artificial Intelligence (AI) methods
-using radiological imaging for diagnosis and prognosis of these tumours,
-highlighting challenges in clinical translation, and evaluating study alignment
-with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI
-international consensus guidelines for trustworthy and deployable AI to promote
-the clinical translation of AI methods. The review covered literature from
-several bibliographic databases, including papers published before 17/07/2024.
-Original research in peer-reviewed journals focused on radiology-based AI for
-diagnosing or prognosing primary STBT was included. Exclusion criteria were
-animal, cadaveric, or laboratory studies, and non-English papers. Abstracts
-were screened by two of three independent reviewers for eligibility. Eligible
-papers were assessed against guidelines by one of three independent reviewers.
-The search identified 15,015 abstracts, from which 325 articles were included
-for evaluation. Most studies performed moderately on CLAIM, averaging a score
-of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out
-of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage,
-indicating significant room for improvement. Future efforts by AI developers
-should focus on design (e.g. define unmet clinical need, intended clinical
-setting and how AI would be integrated in clinical workflow), development (e.g.
-build on previous work, explainability), evaluation (e.g. evaluating and
-addressing biases, evaluating AI against best practices), and data
-reproducibility and availability (making documented code and data publicly
-available). Following these recommendations could improve clinical translation
-of AI methods.
+Retrieval-augmented generation (RAG) has emerged as a promising technology
+for addressing hallucination issues in the responses generated by large
+language models (LLMs). Existing studies on RAG primarily focus on applying
+semantic-based approaches to retrieve isolated relevant chunks, which ignore
+their intrinsic relationships. In this paper, we propose a novel Knowledge
+Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
+knowledge graphs (KGs) to provide fact-level relationships between chunks,
+improving the diversity and coherence of the retrieved results. Specifically,
+after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
+employs a KG-guided chunk expansion process and a KG-based chunk organization
+process to deliver relevant and important knowledge in well-organized
+paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
+variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
+approaches, in terms of both response quality and retrieval quality.
 
-摘要：軟組織和骨骼腫瘤（STBT）是罕見、診斷具有挑戰性的病灶，其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀，重點說明了臨床轉譯的挑戰，並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性，以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻，包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究，以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要，其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等，平均得分為 53 分中的 28.9±7.5 分，但在 FUTURE-AI 中表現不佳，平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段，表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計（例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中）、開發（例如建立在先前的工作、可解釋性）、評估（例如評估和解決偏差、評估 AI 與最佳實務）、以及數據可複製性和可用性（公開提供文件化的代碼和數據）。遵循這些建議可以改善 AI 方法的臨床轉譯。
+摘要：檢索增強生成 (RAG) 已成為一項有前途的技術，用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊，而忽略它們的內在關係。在本文中，我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架，它利用知識圖表 (KG) 來提供區塊之間的事實層級關係，從而提高檢索結果的多樣性和一致性。具體來說，在執行基於語義的檢索以提供種子區塊後，KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序，以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。
 
-##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**
-2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen
+##### **Can Large Language Models Understand Intermediate Representations?**
+2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
 
-Early detection of Cerebral Palsy (CP) is crucial for effective intervention
-and monitoring. This paper tests the reliability and applicability of
-Explainable AI (XAI) methods using a deep learning method that predicts CP by
-analyzing skeletal data extracted from video recordings of infant movements.
-Specifically, we use XAI evaluation metrics -- namely faithfulness and
-stability -- to quantitatively assess the reliability of Class Activation
-Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this
-specific medical application. We utilize a unique dataset of infant movements
-and apply skeleton data perturbations without distorting the original dynamics
-of the infant movements. Our CP prediction model utilizes an ensemble approach,
-so we evaluate the XAI metrics performances for both the overall ensemble and
-the individual models. Our findings indicate that both XAI methods effectively
-identify key body points influencing CP predictions and that the explanations
-are robust against minor data perturbations. Grad-CAM significantly outperforms
-CAM in the RISv metric, which measures stability in terms of velocity. In
-contrast, CAM performs better in the RISb metric, which relates to bone
-stability, and the RRS metric, which assesses internal representation
-robustness. Individual models within the ensemble show varied results, and
-neither CAM nor Grad-CAM consistently outperform the other, with the ensemble
-approach providing a representation of outcomes from its constituent models.
+Intermediate Representations (IRs) are essential in compiler design and
+program analysis, yet their comprehension by Large Language Models (LLMs)
+remains underexplored. This paper presents a pioneering empirical study to
+investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA
+3.1, and Code Llama, in understanding IRs. We analyze their performance across
+four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code
+summarization, and execution reasoning. Our results indicate that while LLMs
+demonstrate competence in parsing IR syntax and recognizing high-level
+structures, they struggle with control flow reasoning, execution semantics, and
+loop handling. Specifically, they often misinterpret branching instructions,
+omit critical IR operations, and rely on heuristic-based reasoning, leading to
+errors in CFG reconstruction, IR decompilation, and execution reasoning. The
+study underscores the necessity for IR-specific enhancements in LLMs,
+recommending fine-tuning on structured IR datasets and integration of explicit
+control flow models to augment their comprehension and handling of IR-related
+tasks.
+
+摘要：中間表徵 (IR) 在編譯器設計和程式分析中至關重要，但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究，以探討 LLM（包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama）理解 IR 的能力。我們分析了它們在四項任務中的表現：控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明，儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力，但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言，它們經常誤解分支指令、省略關鍵 IR 操作，並依賴於基於啟發式的推理，導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性，建議對結構化的 IR 資料集進行微調，並整合明確的控制流程模型，以增強其對 IR 相關任務的理解和處理。
+
+##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**
+2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen
+
+Long-context large language models (LLMs) have recently shown strong
+performance in information retrieval and long-document QA. However, to tackle
+the most challenging intellectual problems, LLMs must reason effectively in
+long and complex contexts (e.g., frontier mathematical research). Studying how
+LLMs handle increasing reasoning complexity and context length is essential,
+yet existing benchmarks lack a solid basis for quantitative evaluation.
+Inspired by the abstraction of GSM-8K problems as computational graphs, and the
+ability to introduce noise by adding unnecessary nodes and edges, we develop a
+grade school math problem generator capable of producing arithmetic problems
+with infinite difficulty and context length under fine-grained control. Using
+our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate
+existing LLMs. We find a consistent sigmoid decline in reasoning performance as
+complexity increases, along with a systematic inference scaling trend:
+exponentially increasing inference computation yields only linear performance
+gains. These findings underscore the fundamental limitations of current
+long-context LLMs and the key challenges in scaling reasoning capabilities. Our
+GSM-Infinite benchmark provides a scalable and controllable testbed for
+systematically studying and advancing LLM reasoning in long and complex
+contexts.
 
-摘要：腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性，使用深度學習方法，透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說，我們使用 XAI 評估指標（即忠實度和穩定性）來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集，並應用骨骼資料擾動，而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法，因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明，兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位，並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM，該指標衡量速度方面的穩定性。相比之下，CAM 在 RISb 指標中表現得更好，該指標與骨骼穩定性有關，而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果，CAM 和 Grad-CAM 都不一致地優於另一種，整體方法提供了其組成模型結果的表示。
+摘要：長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而，若要解決最具挑戰性的智力問題，LLM 必須在長且複雜的脈絡中有效推理（例如，前沿數學研究）。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要，但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發，以及透過加入不必要的節點和邊緣來引入雜訊的能力，我們開發了一個小學數學問題產生器，能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準，我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降，並伴隨著系統性的推論縮放趨勢：指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制，以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台，用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。
 
-##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**
-2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma
+##### **Causality can systematically address the monsters under the bench(marks)**
+2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf
 
-Recent global estimates suggest that as many as 2.41 billion individuals have
-health conditions that would benefit from rehabilitation services. Home-based
-Physical Therapy (PT) faces significant challenges in providing interactive
-feedback and meaningful observation for therapists and patients. To fill this
-gap, we present MicroXercise, which integrates micro-motion analysis with
-wearable sensors, providing therapists and patients with a comprehensive
-feedback interface, including video, text, and scores. Crucially, it employs
-multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable
-methods to analyze the existing deep learning neural networks in monitoring
-exercises, focusing on a high granularity of exercise. This synergistic
-approach is pivotal, providing output matching the input size to precisely
-highlight critical subtleties and movements in PT, thus transforming complex AI
-analysis into clear, actionable feedback. By highlighting these micro-motions
-in different metrics, such as stability and range of motion, MicroXercise
-significantly enhances the understanding and relevance of feedback for
-end-users. Comparative performance metrics underscore its effectiveness over
-traditional methods, such as a 39% and 42% improvement in Feature Mutual
-Information (FMI) and Continuity. MicroXercise is a step ahead in home-based
-physical therapy, providing a technologically advanced and intuitively helpful
-solution to enhance patient care and outcomes.
+Effective and reliable evaluation is essential for advancing empirical
+machine learning. However, the increasing accessibility of generalist models
+and the progress towards ever more complex, high-level tasks make systematic
+evaluation more challenging. Benchmarks are plagued by various biases,
+artifacts, or leakage, while models may behave unreliably due to poorly
+explored failure modes. Haphazard treatments and inconsistent formulations of
+such "monsters" can contribute to a duplication of efforts, a lack of trust in
+results, and unsupported inferences. In this position paper, we argue causality
+offers an ideal framework to systematically address these challenges. By making
+causal assumptions in an approach explicit, we can faithfully model phenomena,
+formulate testable hypotheses with explanatory power, and leverage principled
+tools for analysis. To make causal model design more accessible, we identify
+several useful Common Abstract Topologies (CATs) in causal graphs which help
+gain insight into the reasoning abilities in large language models. Through a
+series of case studies, we demonstrate how the precise yet pragmatic language
+of causality clarifies the strengths and limitations of a method and inspires
+new approaches for systematic progress.
 
-摘要：最近的全球估計表明，多達 24.1 億人有
-健康狀況可從復健服務中受益。居家
-物理治療 (PT) 在提供互動式
-回饋和有意義的觀察方面面臨重大挑戰，供治療師和患者使用。為了填補這
-個缺口，我們提出 MicroXercise，它將微動作分析與
-可穿戴式感測器整合在一起，為治療師和患者提供一個全面的
-回饋介面，包括影片、文字和分數。至關重要的是，它採用
-多維動態時間規整 (DTW) 和基於歸因的可解釋
-方法來分析監控運動中現有的深度學習神經網路，專注於運動的高粒度。這種協同
-方法至關重要，提供與輸入大小匹配的輸出，以精確地
-突出 PT 中關鍵的細微差別和動作，從而將複雜的 AI
-分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作，例如穩定性和動作範圍，MicroXercise
-顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於
-傳統方法的有效性，例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家
-物理治療方面更進一步，提供技術先進且直覺有用的
-解決方案，以提升患者照護和結果。
+摘要：有效的、可靠的評估對於推進經驗機器學習至關重要。然而，一般化模型的可及性日益提高，以及朝著更複雜、更高級別任務的進展，使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾，而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中，我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設，我們可以忠實地模擬現象，制定具有解釋力的可測試假設，並利用原則性的分析工具。為了使因果模型設計更易於使用，我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT)，有助於深入了解大型語言模型中的推理能力。通過一系列案例研究，我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點，並激發系統進展的新方法。
 
-##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**
-2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz
+##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**
+2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha
 
-Systematic literature reviews are the highest quality of evidence in
-research. However, the review process is hindered by significant resource and
-data constraints. The Literature Review Network (LRN) is the first of its kind
-explainable AI platform adhering to PRISMA 2020 standards, designed to automate
-the entire literature review process. LRN was evaluated in the domain of
-surgical glove practices using 3 search strings developed by experts to query
-PubMed. A non-expert trained all LRN models. Performance was benchmarked
-against an expert manual review. Explainability and performance metrics
-assessed LRN's ability to replicate the experts' review. Concordance was
-measured with the Jaccard index and confusion matrices. Researchers were
-blinded to the other's results until study completion. Overlapping studies were
-integrated into an LRN-generated systematic review. LRN models demonstrated
-superior classification accuracy without expert training, achieving 84.78% and
-85.71% accuracy. The highest performance model achieved high interrater
-reliability (k = 0.4953) and explainability metrics, linking 'reduce',
-'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51%
-of the relevant literature despite diverging from the non-expert's judgments (k
-= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN
-outperformed the manual review (19,920 minutes over 11 months), reducing the
-entire process to 288.6 minutes over 5 days. This study demonstrates that
-explainable AI does not require expert training to successfully conduct
-PRISMA-compliant systematic literature reviews like an expert. LRN summarized
-the results of surgical glove studies and identified themes that were nearly
-identical to the clinical researchers' findings. Explainable AI can accurately
-expedite our understanding of clinical practices, potentially revolutionizing
-healthcare research.
+Large Language Models (LLMs) have demonstrated impressive reasoning
+capabilities, yet their performance is highly dependent on the prompting
+strategy and model scale. While reinforcement learning and fine-tuning have
+been deployed to boost reasoning, these approaches incur substantial
+computational and data overhead. In this work, we introduce Adaptive Graph of
+Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM
+reasoning solely at test time. Rather than relying on fixed-step methods like
+Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes
+complex queries into structured subproblems, forming an dynamic directed
+acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding
+only those subproblems that require further analysis, AGoT unifies the
+strengths of chain, tree, and graph paradigms into a cohesive framework that
+allocates computation where it is most needed. We validate our approach on
+diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and
+mathematical problem-solving, achieving up to 46.2% improvement on scientific
+reasoning tasks (GPQA) - comparable to gains achieved through computationally
+intensive reinforcement learning approaches and outperforming state-of-the-art
+iterative approaches. These results suggest that dynamic decomposition and
+structured recursion offer a scalable, cost-effective alternative to
+post-training modifications, paving the way for more robust, general-purpose
+reasoning in LLMs.
 
-摘要：系統性文獻回顧是研究中證據品質最高的。然而，回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台，旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估，使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率，達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標，將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻，儘管與非專家的判斷不同 (k = 0.2174)，但包含了「乳膠」、「雙重」（手套）和「適應症」等詞彙。LRN 優於手動回顧（11 個月超過 19,920 分鐘），將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示，可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果，並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解，有潛力革新醫療保健研究。
+摘要：大型語言模型 (LLM) 已展現令人印象深刻的推理能力，但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理，但這些方法會造成大量的運算和資料開銷。在這項工作中，我們引入了「適應性思考圖」(AGoT)，一個動態的、基於圖形的推論架構，它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法，而是遞迴地將複雜的查詢分解成結構化的子問題，形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題，AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中，將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法，在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進，這與透過運算密集的強化學習方法所獲得的增益相當，並且優於最先進的迭代方法。這些結果表明，動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案，用於訓練後修改，為 LLM 中更強健、更通用的推理鋪平了道路。
 
-##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**
-2408.02709v1 by Chi Him Ng
+##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**
+2502.05239v1 by Hussam Ghanem, Christophe Cruz
 
-This study analyzes hybrid AI systems' design patterns and their
-effectiveness in clinical decision-making using the boxology framework. It
-categorizes and copares various architectures combining machine learning and
-rule-based reasoning to provide insights into their structural foundations and
-healthcare applications. Addressing two main questions, how to categorize these
-systems againts established design patterns and how to extract insights through
-comparative analysis, the study uses design patterns from software engineering
-to understand and optimize healthcare AI systems. Boxology helps identify
-commonalities and create reusable solutions, enhancing these systems'
-scalability, reliability, and performance. Five primary architectures are
-examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and
-weaknesses, highlighting the need for tailored approaches in clinical tasks.
-REML excels in high-accuracy prediction for datasets with limited data; MLRB in
-handling large datasets and complex data integration; RBML in explainability
-and trustworthiness; RMLT in managing high-dimensional data; and PERML, though
-limited in analysis, shows promise in urgent care scenarios. The study
-introduces four new patterns, creates five abstract categorization patterns,
-and refines those five further to specific systems. These contributions enhance
-Boxlogy's taxonomical organization and offer novel approaches to integrating
-expert knowledge with machine learning. Boxology's structured, modular apporach
-offers significant advantages in developing and analyzing hybrid AI systems,
-revealing commonalities, and promoting reusable solutions. In conclusion, this
-study underscores hybrid AI systems' crucial role in advancing healthcare and
-Boxology's potential to drive further innovation in AI integration, ultimately
-improving clinical decision support and patient outcomes.
+Recent advancements in large language models have demonstrated significant
+potential in the automated construction of knowledge graphs from unstructured
+text. This paper builds upon our previous work [16], which evaluated various
+models using metrics like precision, recall, F1 score, triple matching, and
+graph matching, and introduces a refined approach to address the critical
+issues of hallucination and omission. We propose an enhanced evaluation
+framework incorporating BERTScore for graph similarity, setting a practical
+threshold of 95% for graph matching. Our experiments focus on the Mistral
+model, comparing its original and fine-tuned versions in zero-shot and few-shot
+settings. We further extend our experiments using examples from the KELM-sub
+training dataset, illustrating that the fine-tuned model significantly improves
+knowledge graph construction accuracy while reducing the exact hallucination
+and omission. However, our findings also reveal that the fine-tuned models
+perform worse in generalization tasks on the KELM-sub dataset. This study
+underscores the importance of comprehensive evaluation metrics in advancing the
+state-of-the-art in knowledge graph construction from textual data.
 
-摘要：本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構，以深入了解其結構基礎和醫療保健應用。針對兩個主要問題，如何根據既定的設計模式對這些系統進行分類，以及如何通過比較分析提取見解，本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案，從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構：REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點，強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測；MLRB 在處理大型資料集和複雜資料整合方面表現出色；RBML 在可解釋性和可信度方面表現出色；RMLT 在管理高維資料方面表現出色；而 PERML 儘管在分析方面有限，但在緊急照護場景中表現出潛力。本研究引入了四種新模式，建立了五種抽象分類模式，並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織，並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之，本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用，以及盒子學在推動人工智慧整合進一步創新方面的潛力，最終改善臨床決策支援和患者的治療成果。
+摘要：大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上，該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型，並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架，結合 BERTScore 來進行圖形相似性，並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上，比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗，說明微調後的模型顯著提高了知識圖譜建構的準確度，同時減少了精確的幻覺和遺漏。然而，我們的研究結果也顯示，微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。
 
-##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**
-2408.02706v1 by Masoud Muhammed Hassan
+##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**
+2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu
 
-Because of its strong predictive skills, deep learning has emerged as an
-essential tool in many industries, including healthcare. Traditional deep
-learning models, on the other hand, frequently lack interpretability and omit
-to take prediction uncertainty into account two crucial components of clinical
-decision making. In order to produce explainable and uncertainty aware
-predictions, this study presents a novel framework called Bayesian Kolmogorov
-Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov
-Arnold Networks with Bayesian inference. We employ BKANs on two medical
-datasets, which are widely used benchmarks for assessing machine learning
-models in medical diagnostics: the Pima Indians Diabetes dataset and the
-Cleveland Heart Disease dataset. Our method provides useful insights into
-prediction confidence and decision boundaries and outperforms traditional deep
-learning models in terms of prediction accuracy. Moreover, BKANs' capacity to
-represent aleatoric and epistemic uncertainty guarantees doctors receive more
-solid and trustworthy decision support. Our Bayesian strategy improves the
-interpretability of the model and considerably minimises overfitting, which is
-important for tiny and imbalanced medical datasets, according to experimental
-results. We present possible expansions to further use BKANs in more
-complicated multimodal datasets and address the significance of these
-discoveries for future research in building reliable AI systems for healthcare.
-This work paves the way for a new paradigm in deep learning model deployment in
-vital sectors where transparency and reliability are crucial.
+We introduce Agentic Reasoning, a framework that enhances large language
+model (LLM) reasoning by integrating external tool-using agents. Unlike
+conventional LLM-based reasoning approaches, which rely solely on internal
+inference, Agentic Reasoning dynamically engages web search, code execution,
+and structured reasoning-context memory to solve complex problems requiring
+deep research and multi-step logical deduction. Our framework introduces the
+Mind Map agent, which constructs a structured knowledge graph to track logical
+relationships, improving deductive reasoning. Additionally, the integration of
+web-search and coding agents enables real-time retrieval and computational
+analysis, enhancing reasoning accuracy and decision-making. Evaluations on
+PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks
+demonstrate that our approach significantly outperforms existing models,
+including leading retrieval-augmented generation (RAG) systems and
+closed-source LLMs. Moreover, our results indicate that agentic reasoning
+improves expert-level knowledge synthesis, test-time scalability, and
+structured problem-solving. The code is at:
+https://github.com/theworldofagents/Agentic-Reasoning.
 
-摘要：由於其強大的預測能力，深度學習已成為許多產業中不可或缺的工具，包括醫療保健。然而，傳統的深度學習模型通常缺乏可解釋性，並且忽略了將預測不確定性納入考量，而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測，本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構，它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN，這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準：皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解，並且在預測準確度方面優於傳統的深度學習模型。此外，BKAN 表現隨機和認識不確定性的能力，可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果，我們的貝氏策略提高了模型的可解釋性，並大幅減少了過度擬合，這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能，以進一步將 BKAN 用於更複雜的多模式資料集，並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。
+摘要：我們引入了代理推理，一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同，代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理，它建立一個結構化的知識圖譜來追蹤邏輯關係，改善演繹推理。此外，整合網路搜尋和編碼代理能進行即時擷取和運算分析，增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示，我們的做法明顯優於現有模型，包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外，我們的結果顯示，代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在：https://github.com/theworldofagents/Agentic-Reasoning。
 
-##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**
-2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal
+##### **Position-aware Automatic Circuit Discovery**
+2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
 
-In modern healthcare, addressing the complexities of accurate disease
-prediction and personalized recommendations is both crucial and challenging.
-This research introduces MLtoGAI, which integrates Semantic Web technology with
-Machine Learning (ML) to enhance disease prediction and offer user-friendly
-explanations through ChatGPT. The system comprises three key components: a
-reusable disease ontology that incorporates detailed knowledge about various
-diseases, a diagnostic classification model that uses patient symptoms to
-detect specific diseases accurately, and the integration of Semantic Web Rule
-Language (SWRL) with ontology and ChatGPT to generate clear, personalized
-health advice. This approach significantly improves prediction accuracy and
-ensures results that are easy to understand, addressing the complexity of
-diseases and diverse symptoms. The MLtoGAI system demonstrates substantial
-advancements in accuracy and user satisfaction, contributing to developing more
-intelligent and accessible healthcare solutions. This innovative approach
-combines the strengths of ML algorithms with the ability to provide
-transparent, human-understandable explanations through ChatGPT, achieving
-significant improvements in prediction accuracy and user comprehension. By
-leveraging semantic technology and explainable AI, the system enhances the
-accuracy of disease prediction and ensures that the recommendations are
-relevant and easily understood by individual patients. Our research highlights
-the potential of integrating advanced technologies to overcome existing
-challenges in medical diagnostics, paving the way for future developments in
-intelligent healthcare systems. Additionally, the system is validated using 200
-synthetic patient data records, ensuring robust performance and reliability.
+A widely used strategy to discover and understand language model mechanisms
+is circuit analysis. A circuit is a minimal subgraph of a model's computation
+graph that executes a specific task. We identify a gap in existing circuit
+discovery methods: they assume circuits are position-invariant, treating model
+components as equally relevant across input positions. This limits their
+ability to capture cross-positional interactions or mechanisms that vary across
+positions. To address this gap, we propose two improvements to incorporate
+positionality into circuits, even on tasks containing variable-length examples.
+First, we extend edge attribution patching, a gradient-based method for circuit
+discovery, to differentiate between token positions. Second, we introduce the
+concept of a dataset schema, which defines token spans with similar semantics
+across examples, enabling position-aware circuit discovery in datasets with
+variable length examples. We additionally develop an automated pipeline for
+schema generation and application using large language models. Our approach
+enables fully automated discovery of position-sensitive circuits, yielding
+better trade-offs between circuit size and faithfulness compared to prior work.
+
+摘要：廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖，可執行特定任務。我們找出電路發現方法中的一個缺口：它們假設電路與位置無關，將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口，我們提出兩項改進，將位置性納入電路中，即使在包含變長範例的任務中也是如此。首先，我們擴充邊緣屬性修補，一種基於梯度的電路發現方法，以區分符號位置。其次，我們引入了資料集架構的概念，它定義了在範例中具有類似語義的符號跨距，使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外，我們開發了一個自動化管線，用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化，與先前的研究相比，在電路大小和忠實度之間產生了更好的權衡。
+
+##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**
+2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
+
+We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by
+jointly optimizing model roles and weights. We represent multi-LLM systems as
+directed acyclic graphs (DAGs) of LLMs with topological message passing for
+collaborative generation. Given a pool of LLM experts and a utility function,
+Heterogeneous Swarms employs two iterative steps: role-step and weight-step.
+For role-step, we interpret model roles as learning a DAG that specifies the
+flow of inputs and outputs between LLMs. Starting from a swarm of random
+continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs
+in topological order, evaluate on the utility function (e.g. accuracy on a
+task), and optimize the adjacency matrices with particle swarm optimization
+based on the utility score. For weight-step, we assess the contribution of
+individual LLMs in the multi-LLM systems and optimize model weights with swarm
+intelligence. We propose JFK-score to quantify the individual contribution of
+each LLM in the best-found DAG of the role-step, then optimize model weights
+with particle swarm optimization based on the JFK-score. Experiments
+demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based
+baselines by 18.5% on average across 12 tasks. Further analysis reveals that
+Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles
+and substantial collaborative gains, and benefits from the diversity of
+language models.
 
-摘要：在現代醫療保健中，解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI，它將語義網路技術與機器學習 (ML) 相結合，以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分：一個可重複使用的疾病本体，其中包含有關各種疾病的詳細知識；一個診斷分類模型，它使用患者症狀來準確檢測特定疾病；以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合，以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性，並確保了易於理解的結果，解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步，有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點，以及透過 ChatGPT 提供透明且人類可以理解的說明的能力，在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI，該系統提高了疾病預測的準確性，並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力，為智慧醫療保健系統的未來發展鋪路。此外，該系統使用 200 個合成患者資料記錄進行驗證，確保了穩健的效能和可靠性。
+摘要：<paragraph>我們提出異質群體，一種演算法，透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG)，並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數，異質群體使用兩個反覆步驟：角色步驟和權重步驟。對於角色步驟，我們將模型角色解釋為學習一個 DAG，它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始，我們將它們解碼為離散 DAG，以拓撲順序呼叫 LLM，根據效用函數（例如任務的準確度）進行評估，並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟，我們評估個別 LLM 在多 LLM 系統中的貢獻，並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻，然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明，異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明，異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統，並受益於語言模型的多樣性。</paragraph>
 
-##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**
-2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora
+##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
+2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
 
-Explainable Artificial Intelligence (XAI) is central to the debate on
-integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms
-into clinical practice. High-performing AI/ML models, such as ensemble learners
-and deep neural networks, often lack interpretability, hampering clinicians'
-trust in their predictions. To address this, XAI techniques are being developed
-to describe AI/ML predictions in human-understandable terms. One promising
-direction is the adaptation of sensitivity analysis (SA) and global sensitivity
-analysis (GSA), which inherently rank model inputs by their impact on
-predictions. Here, we introduce a novel delta-XAI method that provides local
-explanations of ML model predictions by extending the delta index, a GSA
-metric. The delta-XAI index assesses the impact of each feature's value on the
-predicted output for individual instances in both regression and classification
-problems. We formalize the delta-XAI index and provide code for its
-implementation. The delta-XAI method was evaluated on simulated scenarios using
-linear regression models, with Shapley values serving as a benchmark. Results
-showed that the delta-XAI index is generally consistent with Shapley values,
-with notable discrepancies in models with highly impactful or extreme feature
-values. The delta-XAI index demonstrated higher sensitivity in detecting
-dominant features and handling extreme feature values. Qualitatively, the
-delta-XAI provides intuitive explanations by leveraging probability density
-functions, making feature rankings clearer and more explainable for
-practitioners. Overall, the delta-XAI method appears promising for robustly
-obtaining local explanations of ML model predictions. Further investigations in
-real-world clinical settings will be conducted to evaluate its impact on
-AI-assisted clinical workflows.
+Retrieval-augmented generation (RAG) is a well-suited technique for
+retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
+key module of the healthcare copilot, helping reduce misdiagnosis for
+healthcare practitioners and patients. However, the diagnostic accuracy and
+specificity of existing heuristic-based RAG models used in the medical domain
+are inadequate, particularly for diseases with similar manifestations. This
+paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
+reasoning for the medical domain that retrieves diagnosis and treatment
+recommendations based on manifestations. MedRAG systematically constructs a
+comprehensive four-tier hierarchical diagnostic KG encompassing critical
+diagnostic differences of various diseases. These differences are dynamically
+integrated with similar EHRs retrieved from an EHR database, and reasoned
+within a large language model. This process enables more accurate and specific
+decision support, while also proactively providing follow-up questions to
+enhance personalized medical decision-making. MedRAG is evaluated on both a
+public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
+collected from Tan Tock Seng Hospital, and its performance is compared against
+various existing RAG methods. Experimental results show that, leveraging the
+information integration and relational abilities of the KG, our MedRAG provides
+more specific diagnostic insights and outperforms state-of-the-art models in
+reducing misdiagnosis rates. Our code will be available at
+https://github.com/SNOWTEAM2023/MedRAG
 
-摘要：可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型，例如整體學習器和深度神經網路，通常缺乏可解釋性，阻礙臨床醫生對其預測的信任。為了解決這個問題，正在開發 XAI 技術，以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA)，它們本質上會依據模型輸入對預測的影響來對其進行排名。在此，我們介紹一種新的 delta-XAI 方法，透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化，並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法，並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致，但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說，delta-XAI 透過利用機率密度函數提供直觀的解釋，使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言，delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查，以評估其對 AI 輔助臨床工作流程的影響。
+摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
 
-##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**
-2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis
+##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**
+2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
 
-Dementia, a debilitating neurological condition affecting millions worldwide,
-presents significant diagnostic challenges. In this work, we introduce a novel
-methodology for the classification of demented and non-demented elderly
-patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach
-features a unique technique for selectively processing MRI slices, focusing on
-the most relevant brain regions and excluding less informative sections. This
-methodology is complemented by a confidence-based classification committee
-composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and
-Dem3D EfficientNet. These models work synergistically to enhance
-decision-making accuracy, leveraging their collective strengths. Tested on the
-Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an
-impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore,
-validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset
-confirmed the robustness and generalizability of our approach. The use of
-explainable AI (XAI) techniques and comprehensive ablation studies further
-substantiate the effectiveness of our techniques, providing insights into the
-decision-making process and the importance of our methodology. This research
-offers a significant advancement in dementia diagnosis, providing a highly
-accurate and efficient tool for clinical applications.
+Most existing Knowledge Graph Question Answering (KGQA) approaches are
+designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the
+heterogeneity of the underlying graph schema, topology and assertions, most
+KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without
+resource-intensive training data. We present OntoSCPrompt, a novel Large
+Language Model (LLM)-based KGQA approach with a two-stage architecture that
+separates semantic parsing from KG-dependent interactions. OntoSCPrompt first
+generates a SPARQL query structure (including SPARQL keywords such as SELECT,
+ASK, WHERE and placeholders for missing tokens) and then fills them with
+KG-specific information. To enhance the understanding of the underlying KG, we
+present an ontology-guided, hybrid prompt learning strategy that integrates KG
+ontology into the learning process of hybrid prompts (e.g., discrete and
+continuous vectors). We also present several task-specific decoding strategies
+to ensure the correctness and executability of generated SPARQL queries in both
+stages. Experimental results demonstrate that OntoSCPrompt performs as well as
+SOTA approaches without retraining on a number of KGQA datasets such as CWQ,
+WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well
+to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code:
+\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
 
-摘要：失智症是一種影響全球數百萬人的衰弱性神經疾病，在診斷上具有重大挑戰。在這項工作中，我們提出了一種新的方法，用於對失智和非失智老年患者進行分類，使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術，用於選擇性處理 MRI 切片，重點關注最相關的大腦區域，並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充，該委員會由三個自定義深度學習模型組成：Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性，利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試，我們的模型達到了 94.12% 的驚人準確度，超過了現有方法。此外，在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性，提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展，為臨床應用提供了一個高度準確且高效的工具。
+摘要：現有的知識圖譜問答（KGQA）方法大多是為特定 KG 而設計的，例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性，大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜（KG）。我們提出 OntoSCPrompt，這是一種基於大型語言模型（LLM）的新型 KGQA 方法，採用兩階段架構，將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構（包括 SPARQL 關鍵字，例如 SELECT、ASK、WHERE 和缺失令牌的佔位符），然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解，我們提出了一種由本体指導的混合提示學習策略，將 KG 本体整合到混合提示（例如，離散和連續向量）的學習過程中。我們還提出了多種特定任務的解碼策略，以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明，OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時，效能與 SOTA 方法一樣好，且資源使用效率高，並且可以很好地概括到未見過的特定領域 KG，例如 DBLP-QuAD 和 CoyPu KG Code：
+\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
 
-##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**
-2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini
+##### **Multimodal Medical Code Tokenizer**
+2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
 
-Recognizing daily activities with unobtrusive sensors in smart environments
-enables various healthcare applications. Monitoring how subjects perform
-activities at home and their changes over time can reveal early symptoms of
-health issues, such as cognitive decline. Most approaches in this field use
-deep learning models, which are often seen as black boxes mapping sensor data
-to activities. However, non-expert users like clinicians need to trust and
-understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human
-Activity Recognition have emerged to provide intuitive natural language
-explanations from these models. Different XAI methods generate different
-explanations, and their effectiveness is typically evaluated through user
-surveys, that are often challenging in terms of costs and fairness. This paper
-proposes an automatic evaluation method using Large Language Models (LLMs) to
-identify, in a pool of candidates, the best XAI approach for non-expert users.
-Our preliminary results suggest that LLM evaluation aligns with user surveys.
+Foundation models trained on patient electronic health records (EHRs) require
+tokenizing medical data into sequences of discrete vocabulary items. Existing
+tokenizers treat medical codes from EHRs as isolated textual tokens. However,
+each medical code is defined by its textual description, its position in
+ontological hierarchies, and its relationships to other codes, such as disease
+co-occurrences and drug-treatment associations. Medical vocabularies contain
+more than 600,000 codes with critical information for clinical reasoning. We
+introduce MedTok, a multimodal medical code tokenizer that uses the text
+descriptions and relational context of codes. MedTok processes text using a
+language model encoder and encodes the relational structure with a graph
+encoder. It then quantizes both modalities into a unified token space,
+preserving modality-specific and cross-modality information. We integrate
+MedTok into five EHR models and evaluate it on operational and clinical tasks
+across in-patient and out-patient datasets, including outcome prediction,
+diagnosis classification, drug recommendation, and risk stratification.
+Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
+models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
+the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
+using MedTok tokenizer with medical QA systems. Our results demonstrate the
+potential of MedTok as a unified tokenizer for medical codes, improving
+tokenization for medical foundation models.
 
-摘要：藉由智慧環境中不引人注目的感測器辨識日常活動，能啟用各種醫療保健應用。監控受試者在家中如何執行活動，以及其隨著時間的變化，可以揭示健康問題的早期症狀，例如認知能力下降。此領域中的大多數方法都使用深度學習模型，這些模型通常被視為將感測器資料對應至活動的黑盒子。然而，非專家使用者（例如臨床醫師）需要信任並了解這些模型的輸出。因此，人類活動辨識的可解釋 AI (XAI) 方法應運而生，以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明，而其有效性通常透過使用者調查來評估，這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法，以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明，LLM 評估與使用者調查一致。
+摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
 
-##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**
-2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil
+##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**
+2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu
 
-Industry 5.0, which focuses on human and Artificial Intelligence (AI)
-collaboration for performing different tasks in manufacturing, involves a
-higher number of robots, Internet of Things (IoTs) devices and
-interconnections, Augmented/Virtual Reality (AR), and other smart devices. The
-huge involvement of these devices and interconnection in various critical
-areas, such as economy, health, education and defense systems, poses several
-types of potential security flaws. AI itself has been proven a very effective
-and powerful tool in different areas of cybersecurity, such as intrusion
-detection, malware detection, and phishing detection, among others. Just as in
-many application areas, cybersecurity professionals were reluctant to accept
-black-box ML solutions for cybersecurity applications. This reluctance pushed
-forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool
-that helps explain how decisions are made in ML-based systems. In this survey,
-we present a comprehensive study of different XAI-based intrusion detection
-systems for industry 5.0, and we also examine the impact of explainability and
-interpretability on Cybersecurity practices through the lens of Adversarial
-XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities
-and challenges in XAI cybersecurity systems for industry 5.0 that elicit future
-research toward XAI-based solutions to be adopted by high-stakes industry 5.0
-applications. We believe this rigorous analysis will establish a foundational
-framework for subsequent research endeavors within the specified domain.
+The rapid expansion of web content has made on-device AI assistants
+indispensable for helping users manage the increasing complexity of online
+tasks. The emergent reasoning ability in large language models offer a
+promising path for next-generation on-device AI agents. However, deploying
+full-scale Large Language Models (LLMs) on resource-limited local devices is
+challenging. In this paper, we propose Division-of-Thoughts (DoT), a
+collaborative reasoning framework leveraging the synergy between locally
+deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT
+leverages a Task Decomposer to elicit the inherent planning abilities in
+language models to decompose user queries into smaller sub-tasks, which allows
+hybrid language models to fully exploit their respective strengths. Besides,
+DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks
+and create a dependency graph, facilitating parallel reasoning of sub-tasks and
+the identification of key steps. To allocate the appropriate model based on the
+difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an
+additional task head attached to the SLM that does not alter the SLM's
+parameters. To boost adapter's task allocation capability, we propose a
+self-reinforced training method that relies solely on task execution feedback.
+Extensive experiments on various benchmarks demonstrate that our DoT
+significantly reduces LLM costs while maintaining competitive reasoning
+accuracy. Specifically, DoT reduces the average reasoning time and API costs by
+66.12% and 83.57%, while achieving comparable reasoning accuracy with the best
+baseline methods.
 
-摘要：工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務，涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與，引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具，例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣，網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用，有助於說明在基於 ML 的系統中如何做出決策。在這項調查中，我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究，並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外，我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰，引發了未來針對 XAI 基礎解決方案的研究，以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。
+摘要：<paragraph>網頁內容快速擴充，使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而，在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中，我們提出了思想分工 (DoT)，一個協作推理框架，利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力，將使用者查詢分解成較小的子任務，這允許混合語言模型充分發揮其各自的優勢。此外，DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖，促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型，DoT 利用了即插即用適配器，這是一個附加在 SLM 上的任務頭，不會改變 SLM 的參數。為了提升適配器的任務分配能力，我們提出了一種自我強化訓練方法，它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明，我們的 DoT 大幅降低了 LLM 成本，同時維持了有競爭力的推理準確度。具體來說，DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%，同時達到了與最佳基準方法相當的推理準確度。</paragraph>
 
-##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability**
-2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic
+##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**
+2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong
 
-This study aims to explore the implementation of Natural Language Processing
-(NLP) and machine learning (ML) techniques to automate the coding of medical
-letters with visualised explainability and light-weighted local computer
-settings. Currently in clinical settings, coding is a manual process that
-involves assigning codes to each condition, procedure, and medication in a
-patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There
-are preliminary research on automatic coding in this field using
-state-of-the-art ML models; however, due to the complexity and size of the
-models, the real-world deployment is not achieved. To further facilitate the
-possibility of automatic coding practice, we explore some solutions in a local
-computer setting; in addition, we explore the function of explainability for
-transparency of AI models. We used the publicly available MIMIC-III database
-and the HAN/HLAN network models for ICD code prediction purposes. We also
-experimented with the mapping between ICD and SNOMED CT knowledge bases. In our
-experiments, the models provided useful information for 97.98\% of codes. The
-result of this investigation can shed some light on implementing automatic
-clinical coding in practice, such as in hospital settings, on the local
-computers used by clinicians , project page
-\url{https://github.com/Glenj01/Medical-Coding}.
+Knowledge Graph-based recommendations have gained significant attention due
+to their ability to leverage rich semantic relationships. However, constructing
+and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy
+of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent
+advancements in Large Language Models (LLMs) offer a promising way to improve
+the quality and relevance of KGs for recommendation tasks. Despite this,
+integrating LLMs into KG-based systems presents challenges, such as efficiently
+augmenting KGs, addressing hallucinations, and developing effective joint
+learning methods. In this paper, we propose the Confidence-aware KG-based
+Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework
+that combines KGs and LLMs for recommendation task. The framework includes: (1)
+an LLM-based subgraph augmenter for enriching KGs with high-quality
+information, (2) a confidence-aware message propagation mechanism to filter
+noisy triplets, and (3) a dual-view contrastive learning method to integrate
+user-item interactions and KG data. Additionally, we employ a confidence-aware
+explanation generation process to guide LLMs in producing realistic
+explanations for recommendations. Finally, extensive experiments demonstrate
+the effectiveness of CKG-LLMA across multiple public datasets.
 
-摘要：本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化，並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中，編碼是一種手動流程，涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如，使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究；然而，由於模型的複雜性和大小，並未實現實際部署。為了進一步促進自動編碼實務的可能性，我們在本地電腦設定中探討了一些解決方案；此外，我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中，這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解，例如在醫院環境中，由臨床醫生使用的本地電腦，專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。
+摘要：基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而，構建和維護知識圖譜 (KG) 是一項資源密集型任務，而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此，將 LLM 整合到基於 KG 的系統中會帶來挑戰，例如有效擴充 KG、處理幻覺，以及開發有效的聯合學習方法。在本文中，我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA)，這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括：(1) 一個基於 LLM 的子圖擴充器，用於使用高品質資訊豐富 KG，(2) 一個信心感知型訊息傳播機制，用於過濾雜訊三元組，以及 (3) 一個雙視圖對比學習方法，用於整合使用者-項目互動和 KG 資料。此外，我們採用一個信心感知型解釋產生程序，以引導 LLM 為推薦產生逼真的解釋。最後，大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。
 
-##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**
-2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier
+##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**
+2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell
 
-The support of artificial intelligence (AI) based decision-making is a key
-element in future 6G networks, where the concept of native AI will be
-introduced. Moreover, AI is widely employed in different critical applications
-such as autonomous driving and medical diagnosis. In such applications, using
-AI as black-box models is risky and challenging. Hence, it is crucial to
-understand and trust the decisions taken by these models. Tackling this issue
-can be achieved by developing explainable AI (XAI) schemes that aim to explain
-the logic behind the black-box model behavior, and thus, ensure its efficient
-and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST
-framework that is oriented toward channel estimation in wireless
-communications. The core idea of the XAI-CHEST framework is to identify the
-relevant model inputs by inducing high noise on the irrelevant ones. This
-manuscript provides the detailed theoretical foundations of the XAI-CHEST
-framework. In particular, we derive the analytical expressions of the XAI-CHEST
-loss functions and the noise threshold fine-tuning optimization problem. Hence
-the designed XAI-CHEST delivers a smart input feature selection methodology
-that can further improve the overall performance while optimizing the
-architecture of the employed model. Simulation results show that the XAI-CHEST
-framework provides valid interpretations, where it offers an improved bit error
-rate performance while reducing the required computational complexity in
-comparison to the classical DL-based channel estimation.
+Scene graphs have emerged as a structured and serializable environment
+representation for grounded spatial reasoning with Large Language Models
+(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason
+framework for reasoning and planning with scene graphs. Our approach employs
+two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and
+information queries generation, and a (2) Retriever for extracting
+corresponding graph information following the queries. Two agents collaborate
+iteratively, enabling sequential reasoning and adaptive attention to graph
+information. Unlike prior works, both agents are prompted only with the scene
+graph schema rather than the full graph data, which reduces the hallucination
+by limiting input tokens, and drives the Reasoner to generate reasoning trace
+abstractly.Following the trace, the Retriever programmatically query the scene
+graph data based on the schema understanding, allowing dynamic and global
+attention on the graph that enhances alignment between reasoning and retrieval.
+Through experiments in multiple simulation environments, we show that our
+framework surpasses existing LLM-based approaches in numerical Q\&A and
+planning tasks, and can benefit from task-level few-shot examples, even in the
+absence of agent-level demonstrations. Project code will be released.
 
-摘要：人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素，其中將引入原生 AI 的概念。此外，AI 廣泛用於不同的關鍵應用中，例如自動駕駛和醫療診斷。在這些應用中，使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此，理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構，旨在解釋黑盒模型行為背後的邏輯，從而確保其有效且安全的部署。最近，我們提出了一個新的基於擾動的 XAI-CHEST 框架，該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是，我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此，設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法，可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明，XAI-CHEST 框架提供了有效的解釋，在降低所需的計算複雜度的同時，提供了改進的比特錯誤率性能，而這與基於傳統 DL 的信道估計相比。
+摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
 
-##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**
-2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman
+##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
+2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
 
-This paper presents dilated Residual Network (ResNet) models for disease
-classification from retinal fundus images. Dilated convolution filters are used
-to replace normal convolution filters in the higher layers of the ResNet model
-(dilated ResNet) in order to improve the receptive field compared to the normal
-ResNet model for disease classification. This study introduces
-computer-assisted diagnostic tools that employ deep learning, enhanced with
-explainable AI techniques. These techniques aim to make the tool's
-decision-making process transparent, thereby enabling medical professionals to
-understand and trust the AI's diagnostic decision. They are particularly
-relevant in today's healthcare landscape, where there is a growing demand for
-transparency in AI applications to ensure their reliability and ethical use.
-The dilated ResNet is used as a replacement for the normal ResNet to enhance
-the classification accuracy of retinal eye diseases and reduce the required
-computing time. The dataset used in this work is the Ocular Disease Intelligent
-Recognition (ODIR) dataset which is a structured ophthalmic database with eight
-classes covering most of the common retinal eye diseases. The evaluation
-metrics used in this work include precision, recall, accuracy, and F1 score. In
-this work, a comparative study has been made between normal ResNet models and
-dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50,
-ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as
-compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67,
-and 0.70 respectively for the above respective variants in ODIR multiclass
-disease classification.
+Recent advancements have highlighted that Large Language Models (LLMs) are
+prone to hallucinations when solving complex reasoning problems, leading to
+erroneous results. To tackle this issue, researchers incorporate Knowledge
+Graphs (KGs) to improve the reasoning ability of LLMs. However, existing
+methods face two limitations: 1) they typically assume that all answers to the
+questions are contained in KGs, neglecting the incompleteness issue of KGs, and
+2) they treat the KG as a static repository and overlook the implicit logical
+reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an
+innovative neural-symbolic agent framework that achieves collaborative
+augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments
+and transform complex reasoning tasks into a multi-step interactive process,
+enabling KGs to participate deeply in the reasoning process. SymAgent consists
+of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages
+LLM's inductive reasoning capability to extract symbolic rules from KGs,
+guiding efficient question decomposition. The Agent-Executor autonomously
+invokes predefined action tools to integrate information from KGs and external
+documents, addressing the issues of KG incompleteness. Furthermore, we design a
+self-learning framework comprising online exploration and offline iterative
+policy updating phases, enabling the agent to automatically synthesize
+reasoning trajectories and improve performance. Experimental results
+demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields
+better or comparable performance compared to various strong baselines. Further
+analysis reveals that our agent can identify missing triples, facilitating
+automatic KG updates.
 
-摘要：这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器（扩张 ResNet），以改善感知场，从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具，并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化，从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关，在该领域，对 AI 应用的透明度需求不断增长，以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品，以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集，这是一个结构化的眼科数据库，包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中，对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比，扩张 ResNet 模型显示出有希望的结果，在 ODIR 多类疾病分类中，上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。
+摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
 
-##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**
-2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li
+##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
+2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
 
-The rapid advancement of foundation models in medical imaging represents a
-significant leap toward enhancing diagnostic accuracy and personalized
-treatment. However, the deployment of foundation models in healthcare
-necessitates a rigorous examination of their trustworthiness, encompassing
-privacy, robustness, reliability, explainability, and fairness. The current
-body of survey literature on foundation models in medical imaging reveals
-considerable gaps, particularly in the area of trustworthiness. Additionally,
-existing surveys on the trustworthiness of foundation models do not adequately
-address their specific variations and applications within the medical imaging
-domain. This survey aims to fill that gap by presenting a novel taxonomy of
-foundation models used in medical imaging and analyzing the key motivations for
-ensuring their trustworthiness. We review current research on foundation models
-in major medical imaging applications, focusing on segmentation, medical report
-generation, medical question and answering (Q\&A), and disease diagnosis. These
-areas are highlighted because they have seen a relatively mature and
-substantial number of foundation models compared to other applications. We
-focus on literature that discusses trustworthiness in medical image analysis
-manuscripts. We explore the complex challenges of building trustworthy
-foundation models for each application, summarizing current concerns and
-strategies for enhancing trustworthiness. Furthermore, we examine the potential
-of these models to revolutionize patient care. Our analysis underscores the
-imperative for advancing towards trustworthy AI in medical image analysis,
-advocating for a balanced approach that fosters innovation while ensuring
-ethical and equitable healthcare delivery.
+We introduce a new approach to systematically map features discovered by
+sparse autoencoder across consecutive layers of large language models,
+extending earlier work that examined inter-layer feature links. By using a
+data-free cosine similarity technique, we trace how specific features persist,
+transform, or first appear at each stage. This method yields granular flow
+graphs of feature evolution, enabling fine-grained interpretability and
+mechanistic insights into model computations. Crucially, we demonstrate how
+these cross-layer feature maps facilitate direct steering of model behavior by
+amplifying or suppressing chosen features, achieving targeted thematic control
+in text generation. Together, our findings highlight the utility of a causal,
+cross-layer interpretability framework that not only clarifies how features
+develop through forward passes but also provides new means for transparent
+manipulation of large language models.
 
-摘要：基礎模型在醫學影像方面的快速進展，代表著在加強診斷準確性和個人化治療方面邁出一大步。然而，基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查，包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距，特別是在可信度方面。此外，現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機，來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究，重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調，是因為與其他應用相比，它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰，總結了當前關注點和增強可信度的策略。此外，我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性，並倡導一種平衡的方法，既能促進創新，又能確保道德和公平的醫療保健服務。
+摘要：我們提出了一種新方法，用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能，擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術，我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖，實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是，我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導，在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用，不僅闡明了特徵如何透過前向傳遞發展，還提供了新的方法來透明地操作大型語言模型。
 
-##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**
-2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan
+##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**
+2502.02896v1 by Bradley P. Allen, Paul T. Groth
 
-Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and
-interpreting ultrasound scans right at the patient's bedside. However, the
-expertise needed to interpret these images is considerable and may not always
-be present in emergency situations. This reality makes algorithms such as
-machine learning classifiers extremely valuable to augment human decisions.
-POCUS devices are becoming available at a reasonable cost in the size of a
-mobile phone. The challenge of turning POCUS devices into life-saving tools is
-that interpretation of ultrasound images requires specialist training and
-experience. Unfortunately, the difficulty to obtain positive training images
-represents an important obstacle to building efficient and accurate
-classifiers. Hence, the problem we try to investigate is how to explore
-strategies to increase accuracy of classifiers trained with scarce data. We
-hypothesize that training with a few data instances may not suffice for
-classifiers to generalize causing them to overfit. Our approach uses an
-Explainable AI-Augmented approach to help the algorithm learn more from less
-and potentially help the classifier better generalize.
+Evaluating large language models (LLMs) for tasks like fact extraction in
+support of knowledge graph construction frequently involves computing accuracy
+metrics using a ground truth benchmark based on a knowledge graph (KG). These
+evaluations assume that errors represent factual disagreements. However, human
+discourse frequently features metalinguistic disagreement, where agents differ
+not on facts but on the meaning of the language used to express them. Given the
+complexity of natural language processing and generation using LLMs, we ask: do
+metalinguistic disagreements occur between LLMs and KGs? Based on an
+investigation using the T-REx knowledge alignment dataset, we hypothesize that
+metalinguistic disagreement does in fact occur between LLMs and KGs, with
+potential relevance for the practice of knowledge graph engineering. We propose
+a benchmark for evaluating the detection of factual and metalinguistic
+disagreements between LLMs and KGs. An initial proof of concept of such a
+benchmark is available on Github.
 
-摘要：床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而，解讀這些影像所需的專業知識相當可觀，而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出，尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於，解讀超音波影像需要專門訓練和經驗。不幸的是，取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此，我們嘗試探討的問題是如何探索策略，以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括，導致它們過度擬合。我們的做法使用可解釋 AI 增強方法，以協助演算法從較少的資料中學習更多，並潛在協助分類器更好地概括。
+摘要：評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時，通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而，人類話語經常出現元語言分歧，其中代理人之間的差異不在於事實，而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性，我們提出疑問：LLM 和 KG 之間是否會發生元語言分歧？根據使用 T-REx 知識比對資料集進行的調查，我們假設元語言分歧確實會發生在 LLM 和 KG 之間，並可能與知識圖譜工程實務有關。我們提出一個基準，用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。
 
-##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**
-2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang
+##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**
+2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim
 
-In recent years, the United States has witnessed a significant surge in the
-popularity of vaping or e-cigarette use, leading to a notable rise in cases of
-e-cigarette and vaping use-associated lung injury (EVALI) that caused
-hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting
-the urgency to comprehend vaping behaviors and develop effective strategies for
-cessation. Due to the ubiquity of social media platforms, over 4.7 billion
-users worldwide use them for connectivity, communications, news, and
-entertainment with a significant portion of the discourse related to health,
-thereby establishing social media data as an invaluable organic data resource
-for public health research. In this study, we extracted a sample dataset from
-one vaping sub-community on Reddit to analyze users' quit-vaping intentions.
-Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit
-vaping intention detection, this study compares the outcomes of this model
-against layman and clinical expert annotations. Using different prompting
-strategies such as zero-shot, one-shot, few-shot and chain-of-thought
-prompting, we developed 8 prompts with varying levels of detail to explain the
-task to GPT-4 and also evaluated the performance of the strategies against each
-other. These preliminary findings emphasize the potential of GPT-4 in social
-media data analysis, especially in identifying users' subtle intentions that
-may elude human detection.
+Recent advances in Large Language Models (LLMs) have motivated the
+development of general LLMs for molecular tasks. While several studies have
+demonstrated that fine-tuned LLMs can achieve impressive benchmark
+performances, they are far from genuine generalist molecular LLMs due to a lack
+of fundamental understanding of molecular structure. Specifically, when given
+molecular task instructions, LLMs trained with naive next-token prediction
+training assign similar likelihood scores to both original and negatively
+corrupted molecules, revealing their lack of molecular structure understanding
+that is crucial for reliable and general molecular LLMs. To overcome this
+limitation and obtain a true generalist molecular LLM, we introduce a novel
+multi-modal training method based on a thorough multi-modal instruction tuning
+as well as a molecular structure preference optimization between chosen and
+rejected graphs. On various molecular benchmarks, the proposed generalist
+molecular LLM, called Mol-LLM, achieves state-of-the-art performances among
+generalist LLMs on most tasks, at the same time, surpassing or comparable to
+state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior
+generalization performances in reaction prediction tasks, demonstrating the
+effect of the molecular structure understanding for generalization perspective.
 
-摘要：近年來，美國見證了電子煙或電子香菸使用率大幅激增，導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加，在 2019 年 EVALI 爆發期間造成住院和死亡，凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及，全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂，其中很大一部分與健康相關，因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中，我們從 Reddit 上一個電子煙子社群中提取一個範例資料集，以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測，本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略，例如零次學習、一次學習、少次學習和思考鏈提示，我們開發了 8 個提示，詳細程度不同，向 GPT-4 解釋任務，並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力，特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。
+摘要：大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能，但由於缺乏對分子結構的基本理解，它們遠非真正的通才分子 LLM。具體來說，當給予分子任務說明時，使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子，這顯示出它們缺乏對分子結構的理解，而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM，我們引入了一種新穎的多模態訓練方法，該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中，所提出的通才分子 LLM（稱為 Mol-LLM）在多數任務中實現了通才 LLM 中的最新效能，同時超越或與最新的專家 LLM 相當。此外，Mol-LLM 在反應預測任務中也展現出優異的泛化效能，證明了分子結構理解對泛化觀點的影響。
 
-##### **Towards Compositional Interpretability for XAI**
-2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke
+##### **Leveraging the true depth of LLMs**
+2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
 
-Artificial intelligence (AI) is currently based largely on black-box machine
-learning models which lack interpretability. The field of eXplainable AI (XAI)
-strives to address this major concern, being critical in high-stakes areas such
-as the finance, legal and health sectors.
-  We present an approach to defining AI models and their interpretability based
-on category theory. For this we employ the notion of a compositional model,
-which sees a model in terms of formal string diagrams which capture its
-abstract structure together with its concrete implementation. This
-comprehensive view incorporates deterministic, probabilistic and quantum
-models. We compare a wide range of AI models as compositional models, including
-linear and rule-based models, (recurrent) neural networks, transformers, VAEs,
-and causal and DisCoCirc models.
-  Next we give a definition of interpretation of a model in terms of its
-compositional structure, demonstrating how to analyse the interpretability of a
-model, and using this to clarify common themes in XAI. We find that what makes
-the standard 'intrinsically interpretable' models so transparent is brought out
-most clearly diagrammatically. This leads us to the more general notion of
-compositionally-interpretable (CI) models, which additionally include, for
-instance, causal, conceptual space, and DisCoCirc models.
-  We next demonstrate the explainability benefits of CI models. Firstly, their
-compositional structure may allow the computation of other quantities of
-interest, and may facilitate inference from the model to the modelled
-phenomenon by matching its structure. Secondly, they allow for diagrammatic
-explanations for their behaviour, based on influence constraints, diagram
-surgery and rewrite explanations. Finally, we discuss many future directions
-for the approach, raising the question of how to learn such meaningfully
-structured models in practice.
+Large Language Models demonstrate remarkable capabilities at the cost of high
+compute requirements. While recent research has shown that intermediate layers
+can be removed or have their order shuffled without impacting performance
+significantly, these findings have not been employed to reduce the
+computational cost of inference. We investigate several potential ways to
+reduce the depth of pre-trained LLMs without significantly affecting
+performance. Leveraging our insights, we present a novel approach that exploits
+this decoupling between layers by grouping some of them into pairs that can be
+evaluated in parallel.
+  This modification of the computational graph -- through better parallelism --
+results in an average improvement of around 1.20x on the number of tokens
+generated per second, without re-training nor fine-tuning, while retaining
+95%-99% of the original accuracy. Empirical evaluation demonstrates that this
+approach significantly improves serving efficiency while maintaining model
+performance, offering a practical improvement for large-scale LLM deployment.
 
-摘要：<paragraph>人工智慧（AI）目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧（XAI）領域致力於解決這個主要問題，這在金融、法律和健康等高風險領域至關重要。
-我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此，我們採用組合模型的概念，它以形式弦圖的形式看待模型，這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較，包括線性和基於規則的模型、（遞迴）神經網路、Transformer、VAE，以及因果和 DisCoCirc 模型。
-接下來，我們根據模型的組合結構給出模型解釋的定義，展示如何分析模型的可解釋性，並使用它來澄清 XAI 中的常見主題。我們發現，讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋（CI）模型概念，它另外還包括因果、概念空間和 DisCoCirc 模型。
-接下來，我們展示了 CI 模型的可解釋性優勢。首先，它們的組合結構允許計算其他感興趣的量，並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次，它們允許對其行為進行圖解說明，這些說明基於影響約束、圖解手術和重寫說明。最後，我們討論了這種方法的許多未來方向，提出了如何在實踐中學習這種有意義的結構化模型的問題。</paragraph>
+摘要：大型语言模型展示了其强大的功能，但代价是较高的计算需求。虽然最近的研究表明，中间层可以被移除或重新排列其顺序，而不会显著影响性能，但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度，而不会显著影响性能。利用我们的见解，我们提出了一种新颖的方法，该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。
+通过更好的并行性对计算图进行修改，平均而言，每秒生成的令牌数量提高了约 1.20 倍，而无需重新训练或微调，同时保留了 95%-99% 的原始准确性。经验评估表明，这种方法显著提高了服务效率，同时保持了模型性能，为大规模 LLM 部署提供了实际改进。
 
-##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**
-2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen
+##### **Modular Training of Neural Networks aids Interpretability**
+2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots
 
-Machine learning models have achieved high overall accuracy in medical image
-analysis. However, performance disparities on specific patient groups pose
-challenges to their clinical utility, safety, and fairness. This can affect
-known patient groups - such as those based on sex, age, or disease subtype - as
-well as previously unknown and unlabeled groups. Furthermore, the root cause of
-such observed performance disparities is often challenging to uncover,
-hindering mitigation efforts. In this paper, to address these issues, we
-leverage Slice Discovery Methods (SDMs) to identify interpretable
-underperforming subsets of data and formulate hypotheses regarding the cause of
-observed performance disparities. We introduce a novel SDM and apply it in a
-case study on the classification of pneumothorax and atelectasis from chest
-x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis
-formulation and yields an explanation of previously observed but unexplained
-performance disparities between male and female patients in widely used chest
-X-ray datasets and models. Our findings indicate shortcut learning in both
-classification tasks, through the presence of chest drains and ECG wires,
-respectively. Sex-based differences in the prevalence of these shortcut
-features appear to cause the observed classification performance gap,
-representing a previously underappreciated interaction between shortcut
-learning and model fairness analyses.
+An approach to improve neural network interpretability is via clusterability,
+i.e., splitting a model into disjoint clusters that can be studied
+independently. We define a measure for clusterability and show that pre-trained
+models form highly enmeshed clusters via spectral graph clustering. We thus
+train models to be more modular using a "clusterability loss" function that
+encourages the formation of non-interacting clusters. Using automated
+interpretability techniques, we show that our method can help train models that
+are more modular and learn different, disjoint, and smaller circuits. We
+investigate CNNs trained on MNIST and CIFAR, small transformers trained on
+modular addition, and language models. Our approach provides a promising
+direction for training neural networks that learn simpler functions and are
+easier to interpret.
 
-摘要：機器學習模型在醫學影像分析中已達到整體高準確度。然而，特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體（例如基於性別、年齡或疾病亞型）以及先前未知且未標籤的群體。此外，此類觀察到的效能差異的根本原因通常難以發現，阻礙了緩解措施。在本文中，為了解決這些問題，我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集，並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM，並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性，並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明，在分類任務中，透過胸腔引流管和心電圖導線的存在，存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異，似乎會導致觀察到的分類效能差距，這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。
+摘要：一種改善神經網路可解釋性的方法是透過群集性，
+也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量，並顯示預訓練的
+模型透過光譜圖形群集形成高度糾纏的群集。因此，我們使用「群集性損失」函數訓練模型，使其更具模組化，
+這鼓勵形成非交互群集。使用自動化可解釋性技術，我們顯示我們的模型可以幫助訓練更具模組化的模型，並學習不同、不相交且較小的電路。我們
+研究了在 MNIST 和 CIFAR 上訓練的 CNN，在模組化加法上訓練的小型Transformer，以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。
 
-##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**
-2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa
+##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**
+2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür
 
-The concept of Metaverse has attracted a lot of attention in various fields
-and one of its important applications is health and treatment. The Metaverse
-has enormous potential to transform healthcare by changing patient care,
-medical education, and the way teaching/learning and research are done. The
-purpose of this research is to provide an introduction to the basic concepts
-and fundamental technologies of the Metaverse. This paper examines the pros and
-cons of the Metaverse in healthcare context and analyzes its potential from the
-technology and AI perspective. In particular, the role of machine learning
-methods is discussed; We will explain how machine learning algorithms can be
-applied to the Metaverse generated data to gain better insights in healthcare
-applications. Additionally, we examine the future visions of the Metaverse in
-health delivery, by examining emerging technologies such as blockchain and also
-addressing privacy concerns. The findings of this study contribute to a deeper
-understanding of the applications of Metaverse in healthcare and its potential
-to revolutionize the delivery of medical services.
+Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large
+language models (LLMs) by enabling detailed step-by-step solutions. However,
+due to the verbosity of LLMs, the resulting reasoning chains can be long,
+making it harder to verify the reasoning steps and trace issues resulting from
+dependencies between the steps that may be farther away in the sequence of
+steps. Importantly, mathematical reasoning allows each step to be derived from
+a small set of premises, which are a subset of the preceding steps in the
+reasoning chain. In this paper, we present a framework that identifies the
+premises for each step, to improve the evaluation of reasoning. We restructure
+conventional linear reasoning chains into Premise Augmented Reasoning Chains
+(PARC) by introducing premise links, resulting in a directed acyclic graph
+where the nodes are the steps and the edges are the premise links. Through
+experiments with a PARC-based dataset that we built, namely PERL (Premises and
+ERrors identification in LLMs), we demonstrate that LLMs can reliably identify
+premises within complex reasoning chains. In particular, even open-source LLMs
+achieve 90% recall in premise identification. We also show that PARC helps to
+identify errors in reasoning chains more reliably. The accuracy of error
+identification improves by 6% to 16% absolute when step-by-step verification is
+carried out in PARC under the premises. Our findings highlight the utility of
+premise-centric representations in addressing complex problem-solving tasks and
+open new avenues for improving the reliability of LLM-based reasoning
+evaluations.
 
-摘要：元宇宙的概念在各個領域都備受關注，其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育，以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點，並從技術和 AI 的角度分析其潛力。特別是，討論了機器學習方法的角色；我們將說明如何將機器學習演算法應用於元宇宙產生的資料，以獲得醫療保健應用方面的更佳見解。此外，我們透過探討區塊鏈等新興技術，並解決隱私問題，來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用，以及其在醫療服務提供方面發揮革命性變革的潛力。
+摘要：<paragraph>思考鏈（CoT）提示透過提供詳細的逐步解法，增強大型語言模型（LLM）的數學推理能力。然而，由於 LLM 的冗長，產生的推理鏈可能很長，這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難，而這些步驟可能在步驟順序中相距較遠。重要的是，數學推理允許每個步驟從一組小的前提中推導出來，這些前提是推理鏈中前一個步驟的子集。在本文中，我們提出了一個框架，用於識別每個步驟的前提，以改進推理評估。我們透過引入前提連結，將傳統的線性推理鏈重組為前提擴充推理鏈（PARC），產生一個有向無環圖，其中節點是步驟，而邊緣是前提連結。透過我們建立的基於 PARC 的資料集（即 PERL（LLM 中的前提和錯誤識別））進行的實驗，我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是，即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明，PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時，錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用，並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。</paragraph>
 
-##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**
-2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf
+##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**
+2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
 
-Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with
-no known ultimo cure and high morbidity. Research demonstrates that progressive
-Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly
-impacts kidney structure and functions, eventually leading to kidney failure.
-With the progression of time, chronic kidney disease has moved from a
-life-threatening disease affecting few people to a common disorder of varying
-severity. The goal of this research is to visualize dominating features,
-feature scores, and values exhibited for early prognosis and detection of CKD
-using ensemble learning and explainable AI. For that, an AI-driven predictive
-analytics approach is proposed to aid clinical practitioners in prescribing
-lifestyle modifications for individual patients to reduce the rate of
-progression of this disease. Our dataset is collected on body vitals from
-individuals with CKD and healthy subjects to develop our proposed AI-driven
-solution accurately. In this regard, blood and urine test results are provided,
-and ensemble tree-based machine-learning models are applied to predict unseen
-cases of CKD. Our research findings are validated after lengthy consultations
-with nephrologists. Our experiments and interpretation results are compared
-with existing explainable AI applications in various healthcare domains,
-including CKD. The comparison shows that our developed AI models, particularly
-the Random Forest model, have identified more features as significant
-contributors than XgBoost. Interpretability (I), which measures the ratio of
-important to masked features, indicates that our XgBoost model achieved a
-higher score, specifically a Fidelity of 98\%, in this metric and naturally in
-the FII index compared to competing models.
+Embodied agents assisting humans are often asked to complete a new task in a
+new scenario. An agent preparing a particular dish in the kitchen based on a
+known recipe may be asked to prepare a new dish or to perform cleaning tasks in
+the storeroom. There may not be sufficient resources, e.g., time or labeled
+examples, to train the agent for these new situations. Large Language Models
+(LLMs) trained on considerable knowledge across many domains are able to
+predict a sequence of abstract actions for such new tasks and scenarios,
+although it may not be possible for the agent to execute this action sequence
+due to task-, agent-, or domain-specific constraints. Our framework addresses
+these challenges by leveraging the generic predictions provided by LLM and the
+prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an
+agent to quickly adapt to new tasks and scenarios. The robot also solicits and
+uses human input as needed to refine its existing knowledge. Based on
+experimental evaluation over cooking and cleaning tasks in simulation domains,
+we demonstrate that the interplay between LLM, KG, and human input leads to
+substantial performance gains compared with just using the LLM output.
 
-摘要：慢性腎臟病 (CKD) 是一種廣泛的慢性疾病，目前尚未找到最終的治療方法，且發病率很高。研究表明，進行性慢性腎臟病 (CKD) 是一種異質性疾病，會顯著影響腎臟結構和功能，最終導致腎衰竭。隨著時間的推移，慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值，以進行 CKD 的早期預後和檢測。為此，提出了一種 AI 驅動的預測分析方法，以幫助臨床醫生為個別患者開具生活方式的修改建議，以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的，以準確開發我們提出的 AI 驅動的解決方案。在這方面，提供了血液和尿液檢測結果，並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較，包括 CKD。比較表明，我們開發的 AI 模型，特別是隨機森林模型，已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率，表明我們的 XgBoost 模型在此指標中取得了更高的分數，特別是 98% 的保真度，並且在 FII 指數中自然高於競爭模型。
+摘要：具身代理协助人类时，通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源（例如时间或标记的示例）来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列，尽管代理可能无法执行此动作序列，因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战，使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估，我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。
 
-##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**
-2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan
+##### **On Bob Dylan: A Computational Perspective**
+2502.01772v1 by Prashant Garg
 
-Mental health constitutes a complex and pervasive global challenge, affecting
-millions of lives and often leading to severe consequences. In this paper, we
-conduct a thorough survey to explore the intersection of data science,
-artificial intelligence, and mental healthcare, focusing on the recent
-developments of mental disorder detection through online social media (OSM). A
-significant portion of the population actively engages in OSM platforms,
-creating a vast repository of personal data that holds immense potential for
-mental health analytics. The paper navigates through traditional diagnostic
-methods, state-of-the-art data- and AI-driven research studies, and the
-emergence of explainable AI (XAI) models for mental healthcare. We review
-state-of-the-art machine learning methods, particularly those based on modern
-deep learning, while emphasising the need for explainability in healthcare AI
-models. The experimental design section provides insights into prevalent
-practices, including available datasets and evaluation approaches. We also
-identify key issues and challenges in the field and propose promising future
-research directions. As mental health decisions demand transparency,
-interpretability, and ethical considerations, this paper contributes to the
-ongoing discourse on advancing XAI in mental healthcare through social media.
-The comprehensive overview presented here aims to guide researchers,
-practitioners, and policymakers in developing the area of mental disorder
-detection.
+Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style
+-- a constant refusal to conform to expectation and a penchant for reinventing
+his musical and lyrical identity. In this paper, I extend Sunstein's
+observations through a large-scale computational analysis of Dylan's lyrics
+from 1962 to 2012. Using o3-mini-high (a large language model), I extract
+concept-to-concept relationships from the lyrics and construct directed
+knowledge graphs that capture Dylan's thematic structure. I then quantify
+shifts in sentiment, metaphorical expression, thematic diversity, and network
+complexity over time. The results indicate that Dylan's lyrics increasingly
+rely on metaphor, display an evolving sentiment profile, and exhibit heightened
+dishabituation -- measured here as a growing variance in the network centrality
+of key concepts. I also find that references to movement, protest, and mythic
+imagery fluctuate in ways that align with well-known phases of Dylan's career,
+reflecting the dynamic and unpredictable quality of his art. These findings not
+only deepen our empirical understanding of Sunstein's thesis but also introduce
+a novel computational method for analyzing an artist's evolution-offering
+broader applicability to the study of cultural and creative change.
 
-摘要：心理健康構成了一項複雜且普遍的全球挑戰，影響了數百萬人的生活，並經常導致嚴重的後果。在本文中，我們進行了一項徹底的調查，以探索數據科學、人工智慧和心理保健的交集，重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台，創造了一個龐大的人員資料庫，對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究，以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法，特別是那些基於現代深度學習的方法，同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解，包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰，並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量，本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。
+摘要：卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格
+-- 這種風格不斷拒絕符合預期，並熱衷於重新塑造他的音樂和歌詞認同。在本文中，我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析，來延伸桑斯坦的觀察。使用 o3-mini-high（一個大型語言模型），我從歌詞中提取概念對概念的關係，並建構有向知識圖，以捕捉迪倫的主題結構。然後，我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示，迪倫的歌詞越來越依賴隱喻，展現出不斷演化的情緒輪廓，並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現，對運動、抗議和神話意象的引用，會以與迪倫職業生涯中眾所周知階段一致的方式波動，反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解，也引入了分析藝術家演變的新穎運算方法，為文化和創造性變化的研究提供了更廣泛的適用性。
 
-##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**
-2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou
+##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**
+2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
 
-AI-aided clinical diagnosis is desired in medical care. Existing deep
-learning models lack explainability and mainly focus on image analysis. The
-recently developed Dynamic Uncertain Causality Graph (DUCG) approach is
-causality-driven, explainable, and invariant across different application
-scenarios, without problems of data collection, labeling, fitting, privacy,
-bias, generalization, high cost and high energy consumption. Through close
-collaboration between clinical experts and DUCG technicians, 46 DUCG models
-covering 54 chief complaints were constructed. Over 1,000 diseases can be
-diagnosed without triage. Before being applied in real-world, the 46 DUCG
-models were retrospectively verified by third-party hospitals. The verified
-diagnostic precisions were no less than 95%, in which the diagnostic precision
-for every disease including uncommon ones was no less than 80%. After
-verifications, the 46 DUCG models were applied in the real-world in China. Over
-one million real diagnosis cases have been performed, with only 17 incorrect
-diagnoses identified. Due to DUCG's transparency, the mistakes causing the
-incorrect diagnoses were found and corrected. The diagnostic abilities of the
-clinicians who applied DUCG frequently were improved significantly. Following
-the introduction to the earlier presented DUCG methodology, the recommendation
-algorithm for potential medical checks is presented and the key idea of DUCG is
-extracted.
+Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in
+enhancing Large Language Models (LLMs) through external knowledge integration,
+yet its application has primarily focused on textual content, leaving the rich
+domain of multi-modal video knowledge predominantly unexplored. This paper
+introduces VideoRAG, the first retrieval-augmented generation framework
+specifically designed for processing and understanding extremely long-context
+videos. Our core innovation lies in its dual-channel architecture that
+seamlessly integrates (i) graph-based textual knowledge grounding for capturing
+cross-video semantic relationships, and (ii) multi-modal context encoding for
+efficiently preserving visual features. This novel design empowers VideoRAG to
+process unlimited-length videos by constructing precise knowledge graphs that
+span multiple videos while maintaining semantic dependencies through
+specialized multi-modal retrieval paradigms. Through comprehensive empirical
+evaluation on our proposed LongerVideos benchmark-comprising over 160 videos
+totaling 134+ hours across lecture, documentary, and entertainment
+categories-VideoRAG demonstrates substantial performance compared to existing
+RAG alternatives and long video understanding methods. The source code of
+VideoRAG implementation and the benchmark dataset are openly available at:
+https://github.com/HKUDS/VideoRAG.
+
+摘要：檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功，但其應用主要集中在文字內容上，而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG，這是第一個檢索增強生成架構，專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構，它無縫整合 (i) 基於圖形文字知識基礎，用於擷取跨影片語義關係，以及 (ii) 多模態語境編碼，用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片，同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估，該基準包含超過 160 部影片，總時數超過 134 小時，涵蓋演講、紀錄片和娛樂類別，VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比，展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於：https://github.com/HKUDS/VideoRAG。
+
+##### **Transformers trained on proteins can learn to attend to Euclidean distance**
+2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane
+
+While conventional Transformers generally operate on sequence data, they can
+be used in conjunction with structure models, typically SE(3)-invariant or
+equivariant graph neural networks (GNNs), for 3D applications such as protein
+structure modelling. These hybrids typically involve either (1)
+preprocessing/tokenizing structural features as input for Transformers or (2)
+taking Transformer embeddings and processing them within a structural
+representation. However, there is evidence that Transformers can learn to
+process structural information on their own, such as the AlphaFold3 structural
+diffusion model. In this work we show that Transformers can function
+independently as structure models when passed linear embeddings of coordinates.
+We first provide a theoretical explanation for how Transformers can learn to
+filter attention as a 3D Gaussian with learned variance. We then validate this
+theory using both simulated 3D points and in the context of masked token
+prediction for proteins. Finally, we show that pre-training protein Transformer
+encoders with structure improves performance on a downstream task, yielding
+better performance than custom structural models. Together, this work provides
+a basis for using standard Transformers as hybrid structure-language models.
 
-摘要：<paragraph>醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性，並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的，並且在不同的應用場景中是不變的，沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作，構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前，46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%，其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後，46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例，僅發現 17 個不正確的診斷。由於 DUCG 的透明性，發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後，提出了潛在健康檢查的推薦演算法，並提取了 DUCG 的關鍵思想。</paragraph>
+摘要：雖然傳統的 Transformer 通常處理序列資料，但它們可用於結構模型，通常是 SE(3) 不變式或等變式圖神經網路 (GNN)，用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而，有證據表明 Transformer 可以自行學習處理結構資訊，例如 AlphaFold3 結構擴散模型。在這項工作中，我們展示了 Transformer 在傳遞座標的線性嵌入時，可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後，我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能，產生比自訂結構模型更好的效能。綜合來說，這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。
 
-##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**
-2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi
+##### **Common Foundations for SHACL, ShEx, and PG-Schema**
+2502.01295v1 by S. Ahmetaj, I. Boneva, J. Hidders, K. Hose, M. Jakubowski, J. E. Labra-Gayo, W. Martens, F. Mogavero, F. Murlak, C. Okulmus, A. Polleres, O. Savkovic, M. Simkus, D. Tomaszuk
 
-It is imperative that breast cancer is detected precisely and timely to
-improve patient outcomes. Diagnostic methodologies have traditionally relied on
-unimodal approaches; however, medical data analytics is integrating diverse
-data sources beyond conventional imaging. Using multi-modal techniques,
-integrating both image and non-image data, marks a transformative advancement
-in breast cancer diagnosis. The purpose of this review is to explore the
-burgeoning field of multimodal techniques, particularly the fusion of
-histopathology images with non-image data. Further, Explainable AI (XAI) will
-be used to elucidate the decision-making processes of complex algorithms,
-emphasizing the necessity of explainability in diagnostic processes. This
-review utilizes multi-modal data and emphasizes explainability to enhance
-diagnostic accuracy, clinician confidence, and patient engagement, ultimately
-fostering more personalized treatment strategies for breast cancer, while also
-identifying research gaps in multi-modality and explainability, guiding future
-studies, and contributing to the strategic direction of the field.
+Graphs have emerged as an important foundation for a variety of applications,
+including capturing and reasoning over factual knowledge, semantic data
+integration, social networks, and providing factual knowledge for machine
+learning algorithms. To formalise certain properties of the data and to ensure
+data quality, there is a need to describe the schema of such graphs. Because of
+the breadth of applications and availability of different data models, such as
+RDF and property graphs, both the Semantic Web and the database community have
+independently developed graph schema languages: SHACL, ShEx, and PG-Schema.
+Each language has its unique approach to defining constraints and validating
+graph data, leaving potential users in the dark about their commonalities and
+differences. In this paper, we provide formal, concise definitions of the core
+components of each of these schema languages. We employ a uniform framework to
+facilitate a comprehensive comparison between the languages and identify a
+common set of functionalities, shedding light on both overlapping and
+distinctive features of the three languages.
 
-摘要：精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法；然而，醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術，標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域，特別是將組織病理學影像與非影像資料融合。此外，可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程，強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性，以提高診斷準確性、臨床醫師的信心和患者參與度，最終促進乳癌更個人化的治療策略，同時也找出多模式和可解釋性的研究差距，引導未來的研究，並為該領域的策略方向做出貢獻。
+摘要：圖表已成為各種應用的重要基礎，包括擷取和推理事實知識、語義資料整合、社群網路，以及為機器學習演算法提供事實知識。為了形式化資料的特定屬性並確保資料品質，有必要描述此類圖表的架構。由於應用範圍廣泛且有不同的資料模型可用，例如 RDF 和屬性圖表，因此語義網路和資料庫社群已獨立開發圖表架構語言：SHACL、ShEx 和 PG-Schema。每種語言都有其定義約束和驗證圖表資料的獨特方法，讓潛在使用者不清楚它們的共性和差異。在本文中，我們提供這些架構語言中每個核心元件的正式簡潔定義。我們採用統一的框架來促進語言之間的全面比較，並找出功能的共同集合，說明這三種語言的重疊和獨特功能。
 
-##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**
-2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya
+##### **GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**
+2502.01113v1 by Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
 
-The neonatal period is the most vulnerable time for the development of
-seizures. Seizures in the immature brain lead to detrimental consequences,
-therefore require early diagnosis. The gold-standard for neonatal seizure
-detection currently relies on continuous video-EEG monitoring; which involves
-recording multi-channel electroencephalogram (EEG) alongside real-time video
-monitoring within a neonatal intensive care unit (NICU). However, video-EEG
-monitoring technology requires clinical expertise and is often limited to
-technologically advanced and resourceful settings. Cost-effective new
-techniques could help the medical fraternity make an accurate diagnosis and
-advocate treatment without delay. In this work, a novel explainable deep
-learning model to automate the neonatal seizure detection process with a
-reduced EEG montage is proposed, which employs convolutional nets, graph
-attention layers, and fully connected layers. Beyond its ability to detect
-seizures in real-time with a reduced montage, this model offers the unique
-advantage of real-time interpretability. By evaluating the performance on the
-Zenodo dataset with 10-fold cross-validation, the presented model achieves an
-absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall,
-respectively.
+Retrieval-augmented generation (RAG) has proven effective in integrating
+knowledge into large language models (LLMs). However, conventional RAGs
+struggle to capture complex relationships between pieces of knowledge, limiting
+their performance in intricate reasoning that requires integrating knowledge
+from multiple sources. Recently, graph-enhanced retrieval augmented generation
+(GraphRAG) builds graph structure to explicitly model these relationships,
+enabling more effective and efficient retrievers. Nevertheless, its performance
+is still hindered by the noise and incompleteness within the graph structure.
+To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for
+retrieval augmented generation. GFM-RAG is powered by an innovative graph
+neural network that reasons over graph structure to capture complex
+query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage
+training process on large-scale datasets, comprising 60 knowledge graphs with
+over 14M triples and 700k documents. This results in impressive performance and
+generalizability for GFM-RAG, making it the first graph foundation model
+applicable to unseen datasets for retrieval without any fine-tuning required.
+Extensive experiments on three multi-hop QA datasets and seven domain-specific
+RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance
+while maintaining efficiency and alignment with neural scaling laws,
+highlighting its potential for further improvement.
 
-摘要：新生兒期是大腦發育最脆弱的時期，容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果，因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測；其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而，視訊腦電圖監控技術需要臨床專業知識，而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中，提出了一個新穎的可解釋深度學習模型，以自動化新生兒癲癇發作偵測過程，並採用減少的腦電圖裝置，其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外，此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能，所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。
+摘要：檢索增強生成 (RAG) 已證明在整合知識到大語言模型 (LLM) 中有效。然而，傳統的 RAG 難以捕捉知識片段之間的複雜關係，限制了它們在需要整合來自多個來源的知識的複雜推理中的表現。最近，圖表增強檢索增強生成 (GraphRAG) 建立圖表結構來明確建模這些關係，從而實現更有效率的檢索器。儘管如此，其效能仍受到圖表結構中雜訊和不完整性的阻礙。為了解決這個問題，我們引入了 GFM-RAG，一種用於檢索增強生成的全新圖表基礎模型 (GFM)。GFM-RAG 由一個創新的圖神經網路驅動，該網路在圖表結構上進行推理以捕捉複雜的查詢知識關係。具有 8M 參數的 GFM 在大型資料集上進行兩階段訓練流程，包括 60 個包含超過 14M 個三元組和 700k 個文件的文件。這為 GFM-RAG 帶來了令人印象深刻的效能和通用性，使其成為第一個適用於未見過資料集的圖表基礎模型，而無需任何微調。在三個多跳問答資料集和七個特定領域 RAG 資料集上的廣泛實驗表明，GFM-RAG 達到了最先進的效能，同時保持了效率並與神經擴充定律保持一致，突顯了其進一步改進的潛力。
 
-##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**
-2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik
+##### **Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**
+2502.01059v1 by Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn
 
-Breast cancer (BC) stands as one of the most common malignancies affecting
-women worldwide, necessitating advancements in diagnostic methodologies for
-better clinical outcomes. This article provides a comprehensive exploration of
-the application of Explainable Artificial Intelligence (XAI) techniques in the
-detection and diagnosis of breast cancer. As Artificial Intelligence (AI)
-technologies continue to permeate the healthcare sector, particularly in
-oncology, the need for transparent and interpretable models becomes imperative
-to enhance clinical decision-making and patient care. This review discusses the
-integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and
-others, with machine learning and deep learning models utilized in breast
-cancer detection and classification. By investigating the modalities of breast
-cancer datasets, including mammograms, ultrasounds and their processing with
-AI, the paper highlights how XAI can lead to more accurate diagnoses and
-personalized treatment plans. It also examines the challenges in implementing
-these techniques and the importance of developing standardized metrics for
-evaluating XAI's effectiveness in clinical settings. Through detailed analysis
-and discussion, this article aims to highlight the potential of XAI in bridging
-the gap between complex AI models and practical healthcare applications,
-thereby fostering trust and understanding among medical professionals and
-improving patient outcomes.
+The development of biological data analysis tools and large language models
+(LLMs) has opened up new possibilities for utilizing AI in plant science
+research, with the potential to contribute significantly to knowledge
+integration and research gap identification. Nonetheless, current LLMs struggle
+to handle complex biological data and theoretical models in photosynthesis
+research and often fail to provide accurate scientific contexts. Therefore,
+this study proposed a photosynthesis research assistant (PRAG) based on
+OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt
+optimization. Vector databases and an automated feedback loop were used in the
+prompt optimization process to enhance the accuracy and relevance of the
+responses to photosynthesis-related queries. PRAG showed an average improvement
+of 8.7% across five metrics related to scientific writing, with a 25.4%
+increase in source transparency. Additionally, its scientific depth and domain
+coverage were comparable to those of photosynthesis research papers. A
+knowledge graph was used to structure PRAG's responses with papers within and
+outside the database, which allowed PRAG to match key entities with 63% and
+39.5% of the database and test papers, respectively. PRAG can be applied for
+photosynthesis research and broader plant science domains, paving the way for
+more in-depth data analysis and predictive capabilities.
 
-摘要：乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一，因此需要進步的診斷方法，以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域，特別是在腫瘤學中，透明且可解釋的模型需求變得勢在必行，以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合，例如 SHAP、LIME、Grad-CAM 等，以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式，包括乳房攝影、超音波及其在 AI 中的處理，本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰，以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論，本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力，進而促進醫療專業人員之間的信任與理解，並改善患者的結果。
+摘要：生物資料分析工具和大型語言模型 (LLM) 的發展，為利用人工智慧於植物科學研究開啟了新的可能性，並有潛力對知識整合和研究差距的識別做出重大貢獻。儘管如此，目前的 LLM 在處理光合作用研究中的複雜生物資料和理論模型時仍有困難，而且常常無法提供準確的科學背景。因此，本研究提出了一個基於 OpenAI 的 GPT-4o、具備檢索增強生成 (RAG) 技術和提示最佳化的光合作用研究助理 (PRAG)。在提示最佳化過程中，使用了向量資料庫和自動回饋迴路，以增強對與光合作用相關查詢的回應的準確性和相關性。PRAG 在與科學寫作相關的五項指標中顯示出平均改善了 8.7%，來源透明度增加了 25.4%。此外，其科學深度和領域涵蓋範圍與光合作用研究論文相當。知識圖譜用於建構 PRAG 的回應，其中包含資料庫內外論文，這使得 PRAG 能夠分別與資料庫和測試論文中的 63% 和 39.5% 的關鍵實體相匹配。PRAG 可應用於光合作用研究和更廣泛的植物科學領域，為更深入的資料分析和預測能力鋪路。
 
-##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**
-2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
+##### **Encrypted Large Model Inference: The Equivariant Encryption Paradigm**
+2502.01013v1 by James Buban, Hongyang Zhang, Claudio Angione, Harry Yang, Ahmad Farhan, Seyfal Sultanov, Michael Du, Xuran Ma, Zihao Wang, Yue Zhao, Arria Owlia, Fielding Johnston, Patrick Colangelo
 
-Speech emotion recognition (SER) has gained significant attention due to its
-several application fields, such as mental health, education, and
-human-computer interaction. However, the accuracy of SER systems is hindered by
-high-dimensional feature sets that may contain irrelevant and redundant
-information. To overcome this challenge, this study proposes an iterative
-feature boosting approach for SER that emphasizes feature relevance and
-explainability to enhance machine learning model performance. Our approach
-involves meticulous feature selection and analysis to build efficient SER
-systems. In addressing our main problem through model explainability, we employ
-a feature evaluation loop with Shapley values to iteratively refine feature
-sets. This process strikes a balance between model performance and
-transparency, which enables a comprehensive understanding of the model's
-predictions. The proposed approach offers several advantages, including the
-identification and removal of irrelevant and redundant features, leading to a
-more effective model. Additionally, it promotes explainability, facilitating
-comprehension of the model's predictions and the identification of crucial
-features for emotion determination. The effectiveness of the proposed method is
-validated on the SER benchmarks of the Toronto emotional speech set (TESS),
-Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of
-Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion
-(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our
-knowledge, this is the first work to incorporate model explainability into an
-SER framework. The source code of this paper is publicly available via this
-https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.
+Large scale deep learning model, such as modern language models and diffusion
+architectures, have revolutionized applications ranging from natural language
+processing to computer vision. However, their deployment in distributed or
+decentralized environments raises significant privacy concerns, as sensitive
+data may be exposed during inference. Traditional techniques like secure
+multi-party computation, homomorphic encryption, and differential privacy offer
+partial remedies but often incur substantial computational overhead, latency
+penalties, or limited compatibility with non-linear network operations. In this
+work, we introduce Equivariant Encryption (EE), a novel paradigm designed to
+enable secure, "blind" inference on encrypted data with near zero performance
+overhead. Unlike fully homomorphic approaches that encrypt the entire
+computational graph, EE selectively obfuscates critical internal
+representations within neural network layers while preserving the exact
+functionality of both linear and a prescribed set of non-linear operations.
+This targeted encryption ensures that raw inputs, intermediate activations, and
+outputs remain confidential, even when processed on untrusted infrastructure.
+We detail the theoretical foundations of EE, compare its performance and
+integration complexity against conventional privacy preserving techniques, and
+demonstrate its applicability across a range of architectures, from
+convolutional networks to large language models. Furthermore, our work provides
+a comprehensive threat analysis, outlining potential attack vectors and
+baseline strategies, and benchmarks EE against standard inference pipelines in
+decentralized settings. The results confirm that EE maintains high fidelity and
+throughput, effectively bridging the gap between robust data confidentiality
+and the stringent efficiency requirements of modern, large scale model
+inference.
 
-摘要：語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而，SER 系統的準確性受到高維特徵集的阻礙，這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰，本研究提出了一種用於 SER 的迭代特徵提升方法，該方法強調特徵相關性和可解釋性，以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析，以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題，我們採用了具有 Shapley 值的特徵評估迴圈，以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡，這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點，包括識別和移除不相關和冗餘的特徵，從而建立更有效的模型。此外，它促進了可解釋性，有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證，其效能優於現有方法。據我們所知，這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得：https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。
+摘要：大型深度學習模型，例如現代語言模型和擴散架構，徹底改變了從自然語言處理到電腦視覺等各種應用。然而，它們在分散式或分散式環境中的部署引發了重大的隱私問題，因為敏感數據可能會在推理過程中遭到揭露。安全多方計算、同態加密和差分隱私等傳統技術提供了部分補救措施，但通常會產生大量的計算開銷、延遲處罰，或與非線性網路操作相容性有限。在這項工作中，我們引入了等變加密 (EE)，這是一種新穎的範例，旨在以接近零效能開銷對加密數據進行安全、「盲目」推理。與加密整個計算圖形的完全同態方法不同，EE 有選擇性地混淆神經網路層內的關鍵內部表示，同時保留線性和規定的一組非線性操作的精確功能。這種有針對性的加密確保了原始輸入、中間激活和輸出保持機密，即使在不受信任的基礎設施上處理也是如此。我們詳細說明了 EE 的理論基礎，比較了其效能和整合複雜度與傳統的隱私保護技術，並展示了其在從卷積網路到大語言模型等各種架構中的適用性。此外，我們的研究提供了全面的威脅分析，概述了潛在的攻擊媒介和基準策略，並在分散式設定中將 EE 與標準推理管道進行比較。結果證實，EE 保持了高保真度和高傳輸量，有效地彌合了強大的數據機密性與現代化、大規模模型推理的嚴格效率要求之間的差距。
 
-##### **The Explanation Necessity for Healthcare AI**
-2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling
+##### **Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**
+2502.01694v1 by Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki
 
-Explainability is often critical to the acceptable implementation of
-artificial intelligence (AI). Nowhere is this more important than healthcare
-where decision-making directly impacts patients and trust in AI systems is
-essential. This trust is often built on the explanations and interpretations
-the AI provides. Despite significant advancements in AI interpretability, there
-remains the need for clear guidelines on when and to what extent explanations
-are necessary in the medical context. We propose a novel categorization system
-with four distinct classes of explanation necessity, guiding the level of
-explanation required: patient or sample (local) level, cohort or dataset
-(global) level, or both levels. We introduce a mathematical formulation that
-distinguishes these categories and offers a practical framework for researchers
-to determine the necessity and depth of explanations required in medical AI
-applications. Three key factors are considered: the robustness of the
-evaluation protocol, the variability of expert observations, and the
-representation dimensionality of the application. In this perspective, we
-address the question: When does an AI medical application need to be explained,
-and at what level of detail?
+A key paradigm to improve the reasoning capabilities of large language models
+(LLMs) is to allocate more inference-time compute to search against a verifier
+or reward model. This process can then be utilized to refine the pretrained
+model or distill its reasoning patterns into more efficient models. In this
+paper, we study inference-time compute by viewing chain-of-thought (CoT)
+generation as a metastable Markov process: easy reasoning steps (e.g.,
+algebraic manipulations) form densely connected clusters, while hard reasoning
+steps (e.g., applying a relevant theorem) create sparse, low-probability edges
+between clusters, leading to phase transitions at longer timescales. Under this
+framework, we prove that implementing a search protocol that rewards sparse
+edges improves CoT by decreasing the expected number of steps to reach
+different clusters. In contrast, we establish a limit on reasoning capability
+when the model is restricted to local information of the pretrained graph. We
+also show that the information gained by search can be utilized to obtain a
+better reasoning model: (1) the pretrained model can be directly finetuned to
+favor sparse edges via policy gradient methods, and moreover (2) a compressed
+metastable representation of the reasoning dynamics can be distilled into a
+smaller, more efficient model.
 
-摘要：可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域，这一点尤为重要，因为决策直接影响患者，并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展，但仍然需要明确的指导方针，说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统，该系统具有四种不同的解释必要性类别，指导所需的解释级别：患者或样本（局部）级别、队列或数据集（全局）级别，或两个级别。我们引入了一个数学公式，该公式区分了这些类别，并为研究人员提供了一个实用框架，以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素：评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看，我们解决了这个问题：AI 医疗应用何时需要解释，以及需要解释到何种程度？
+摘要：<paragraph>提升大型語言模型 (LLM) 推理能力的一個關鍵範例，是分配更多推論時間運算來搜尋驗證器或獎勵模型。此程序接著可用於改善預訓練模型或將其推理模式提煉到更有效率的模型中。在這篇論文中，我們透過將思維鏈 (CoT) 生成視為亞穩態馬可夫過程來研究推論時間運算：簡單的推理步驟（例如代數運算）形成密集連接的叢集，而困難的推理步驟（例如應用相關定理）則在叢集之間建立稀疏、低機率的邊緣，導致在較長時間尺度上產生相變。在此架構下，我們證明實作一種獎勵稀疏邊緣的搜尋協定，會透過減少到達不同叢集所需的預期步驟數來改善 CoT。相反地，當模型受限於預訓練圖形的局部資訊時，我們建立了推理能力的限制。我們也顯示搜尋所獲得的資訊可用於取得更好的推理模型：(1) 預訓練模型可以直接微調以透過策略梯度方法偏好稀疏邊緣，而且 (2) 推理動態的壓縮亞穩態表徵可以提煉到更小、更有效率的模型中。</paragraph>
 
-##### **Interdisciplinary Expertise to Advance Equitable Explainable AI**
-2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles
+##### **PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**
+2502.00708v1 by Qixuan Li, Chao Wang, Zongjin He, Yan Peng
 
-The field of artificial intelligence (AI) is rapidly influencing health and
-healthcare, but bias and poor performance persists for populations who face
-widespread structural oppression. Previous work has clearly outlined the need
-for more rigorous attention to data representativeness and model performance to
-advance equity and reduce bias. However, there is an opportunity to also
-improve the explainability of AI by leveraging best practices of social
-epidemiology and health equity to help us develop hypotheses for associations
-found. In this paper, we focus on explainable AI (XAI) and describe a framework
-for interdisciplinary expert panel review to discuss and critically assess AI
-model explanations from multiple perspectives and identify areas of bias and
-directions for future research. We emphasize the importance of the
-interdisciplinary expert panel to produce more accurate, equitable
-interpretations which are historically and contextually informed.
-Interdisciplinary panel discussions can help reduce bias, identify potential
-confounders, and identify opportunities for additional research where there are
-gaps in the literature. In turn, these insights can suggest opportunities for
-AI model improvement.
+Text-to-3D asset generation has achieved significant optimization under the
+supervision of 2D diffusion priors. However, when dealing with compositional
+scenes, existing methods encounter several challenges: 1). failure to ensure
+that composite scene layouts comply with physical laws; 2). difficulty in
+accurately capturing the assets and relationships described in complex scene
+descriptions; 3). limited autonomous asset generation capabilities among layout
+approaches leveraging large language models (LLMs). To avoid these compromises,
+we propose a novel framework for compositional scene generation, PhiP-G, which
+seamlessly integrates generation techniques with layout guidance based on a
+world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene
+description to generate a scene graph, and integrating a multimodal 2D
+generation agent and a 3D Gaussian generation method for targeted assets
+creation. For the stage of layout, PhiP-G employs a physical pool with adhesion
+capabilities and a visual supervision agent, forming a world model for layout
+prediction and planning. Extensive experiments demonstrate that PhiP-G
+significantly enhances the generation quality and physical rationality of the
+compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA)
+performance in CLIP scores, achieves parity with the leading methods in
+generation quality as measured by the T$^3$Bench, and improves efficiency by
+24x.
 
-摘要：人工智慧 (AI) 領域正快速影響著健康與醫療保健，但對於面臨廣泛結構性壓迫的人群來說，偏見和不良表現依然存在。先前的研究已清楚說明，需要更嚴格地注意資料代表性和模型效能，以促進公平性並減少偏見。然而，我們有機會透過運用社會流行病學和健康公平的最佳實務，來改善 AI 的可解釋性，以幫助我們針對發現的關聯性，發展假設。在本文中，我們專注於可解釋 AI (XAI)，並描述一個跨領域專家小組審查架構，以從多重觀點討論和批判性評估 AI 模型的解釋，並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要，而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素，並在文獻中有缺口時找出額外研究的機會。反過來，這些見解可以建議 AI 模型改進的機會。
+摘要：<paragraph>在 2D 擴散先驗的監督下，文字轉 3D 資產生成已取得顯著的最佳化。然而，在處理合成場景時，現有方法會遇到幾個挑戰：1) 無法確保複合場景佈局符合物理定律；2) 難以準確捕捉複雜場景描述中所描述的資產和關係；3) 在利用大型語言模型 (LLM) 的佈局方法中，自主資產生成能力有限。為了避免這些折衷，我們提出了一個合成場景生成的新框架 PhiP-G，它將生成技術與基於世界模型的佈局指導無縫整合。利用基於 LLM 的代理，PhiP-G 分析複雜的場景描述以生成場景圖，並整合多模態 2D 生成代理和 3D 高斯生成方法以進行目標資產創建。對於佈局階段，PhiP-G 採用具有附著能力的物理池和視覺監督代理，形成用於佈局預測和規劃的世界模型。大量的實驗證明，PhiP-G 大幅提升了合成場景的生成品質和物理合理性。值得注意的是，PhiP-G 在 CLIP 分數中獲得了最先進 (SOTA) 的效能，在 T$^3$Bench 測量的生成品質中與領先的方法達到同等水準，並將效率提升了 24 倍。</paragraph>
 
-##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**
-2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen
+##### **A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**
+2502.00681v1 by Qika Lin, Zhen Peng, Kaize Shi, Kai He, Yiming Xu, Erik Cambria, Mengling Feng
 
-Artificial Intelligence (AI) repeatedly match or outperform radiologists in
-lab experiments. However, real-world implementations of radiological AI-based
-systems are found to provide little to no clinical value. This paper explores
-how to design AI for clinical usefulness in different contexts. We conducted 19
-design sessions and design interventions with 13 radiologists from 7 clinical
-sites in Denmark and Kenya, based on three iterations of a functional AI-based
-prototype. Ten sociotechnical dependencies were identified as crucial for the
-design of AI in radiology. We conceptualised four technical dimensions that
-must be configured to the intended clinical context of use: AI functionality,
-AI medical focus, AI decision threshold, and AI Explainability. We present four
-design recommendations on how to address dependencies pertaining to the medical
-knowledge, clinic type, user expertise level, patient context, and user
-situation that condition the configuration of these technical dimensions.
+Recent years have witnessed rapid advances in graph representation learning,
+with the continuous embedding approach emerging as the dominant paradigm.
+However, such methods encounter issues regarding parameter efficiency,
+interpretability, and robustness. Thus, Quantized Graph Representation (QGR)
+learning has recently gained increasing interest, which represents the graph
+structure with discrete codes instead of conventional continuous embeddings.
+Given its analogous representation form to natural language, QGR also possesses
+the capability to seamlessly integrate graph structures with large language
+models (LLMs). As this emerging paradigm is still in its infancy yet holds
+significant promise, we undertake this thorough survey to promote its rapid
+future prosperity. We first present the background of the general quantization
+methods and their merits. Moreover, we provide an in-depth demonstration of
+current QGR studies from the perspectives of quantized strategies, training
+objectives, distinctive designs, knowledge graph quantization, and
+applications. We further explore the strategies for code dependence learning
+and integration with LLMs. At last, we give discussions and conclude future
+directions, aiming to provide a comprehensive picture of QGR and inspire future
+research.
 
-摘要：人工智慧（AI）在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而，發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代，在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向，必須根據預期的臨床使用情境進行設定：AI 功能、AI 醫療重點、AI 決策門檻，以及 AI 可解釋性。我們提出四項設計建議，說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境，以及影響這些技術面向設定的使用者情境相關的依賴關係。
+摘要：近年来，图表示学习取得了快速进展，其中连续嵌入方法作为主导范式出现。然而，此类方法遇到了参数效率、可解释性和鲁棒性方面的问题。因此，量化图表示 (QGR) 学习最近引起了越来越多的兴趣，它使用离散代码而不是传统的连续嵌入来表示图结构。鉴于其与自然语言类似的表示形式，QGR 也具备将图结构与大型语言模型 (LLM) 无缝集成的能力。由于这种新兴范式仍处于起步阶段，但前景广阔，我们进行了这项全面调查以促进其快速未来的繁荣。我们首先介绍了通用量化方法的背景及其优点。此外，我们从量化策略、训练目标、独特设计、知识图谱量化和应用的角度对当前的 QGR 研究进行了深入的论证。我们进一步探索了代码依赖性学习和与 LLM 集成的策略。最后，我们给出了讨论并总结了未来的方向，旨在提供 QGR 的全面图景并激发未来的研究。
 
-##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**
-2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah
+##### **Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**
+2502.00339v1 by Jingyuan Yi, Zeqiu Xu, Tianyi Huang, Peiyang Yu
 
-With advanced AI/ML, there has been growing research on explainable AI (XAI)
-and studies on how humans interact with AI and XAI for effective human-AI
-collaborative decision-making. However, we still have a lack of understanding
-of how AI systems and XAI should be first presented to users without technical
-backgrounds. In this paper, we present the findings of semi-structured
-interviews with health professionals (n=12) and students (n=4) majoring in
-medicine and health to study how to improve onboarding with AI and XAI. For the
-interviews, we built upon human-AI interaction guidelines to create onboarding
-materials of an AI system for stroke rehabilitation assessment and AI
-explanations and introduce them to the participants. Our findings reveal that
-beyond presenting traditional performance metrics on AI, participants desired
-benchmark information, the practical benefits of AI, and interaction trials to
-better contextualize AI performance, and refine the objectives and performance
-of AI. Based on these findings, we highlight directions for improving
-onboarding with AI and XAI and human-AI collaborative decision-making.
+The pervasiveness of the dissemination of fake news through social media
+platforms poses critical risks to the trust of the general public, societal
+stability, and democratic institutions. This challenge calls for novel
+methodologies in detection, which can keep pace with the dynamic and
+multi-modal nature of misinformation. Recent works include powering the
+detection using large language model advances in multimodal frameworks,
+methodologies using graphs, and adversarial training in the literature of fake
+news. Based on the different approaches which can bring success, some key
+highlights will be underlined: enhanced LLM-improves accuracy through more
+advanced semantics and cross-modality fusion for robust detections. The review
+further identifies critical gaps in adaptability to dynamic social media
+trends, real-time, and cross-platform detection capabilities, as well as the
+ethical challenges thrown up by the misuse of LLMs. Future directions underline
+the development of style-agnostic models, cross-lingual detection frameworks,
+and robust policies with a view to mitigating LLM-driven misinformation. This
+synthesis thus lays a concrete foundation for those researchers and
+practitioners committed to reinforcing fake news detection systems with
+complications that keep on growing in the digital landscape.
 
-摘要：隨著先進的 AI/ML，對可解釋 AI (XAI) 的研究不斷增加，以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而，我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中，我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果，以研究如何改善 AI 和 XAI 的入門。對於訪談，我們建立在人機互動準則之上，為中風康復評估和 AI 解釋的 AI 系統創建入門材料，並將它們介紹給參與者。我們的研究結果表明，除了呈現傳統的 AI 性能指標外，參與者還希望基准信息、AI 的實際好處以及交互試驗，以更好地將 AI 性能情境化，並完善 AI 的目標和性能。根據這些發現，我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。
+摘要：社群媒體平台上假新聞散播的普遍性對一般大眾的信任、社會穩定性與民主制度構成重大風險。這項挑戰需要在偵測方面採用創新的方法論，才能跟上錯誤資訊的動態和多模態特性。最近的研究包括使用多模態架構中大型語言模型的進展、使用圖形的方法論，以及在假新聞文獻中進行對抗訓練來強化偵測。根據可以帶來成功的不同方法，將重點說明一些重點：增強的 LLM 可透過更進階的語意和跨模態融合來提升準確度，以進行穩健的偵測。這篇評論進一步找出在適應動態社群媒體趨勢、即時和跨平台偵測能力方面的重大差距，以及 LLM 遭濫用的道德挑戰。未來的方向強調開發與風格無關的模型、跨語言偵測架構和穩健的政策，以減輕 LLM 驅動的錯誤資訊。因此，這種綜合分析為那些致力於強化假新聞偵測系統的研究人員和從業人員奠定了具體的基礎，而這些複雜性在數位環境中持續增長。
 
-##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**
-2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao
+##### **DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**
+2502.00305v1 by Jiaxin Guo, C. L. Philip Chen, Shuzhen Li, Tong Zhang
 
-This article uses machine learning (ML) and explainable artificial
-intelligence (XAI) techniques to investigate the relationship between
-nutritional status and mortality rates associated with Alzheimers disease (AD).
-The Third National Health and Nutrition Examination Survey (NHANES III)
-database is employed for analysis. The random forest model is selected as the
-base model for XAI analysis, and the Shapley Additive Explanations (SHAP)
-method is used to assess feature importance. The results highlight significant
-nutritional factors such as serum vitamin B12 and glycated hemoglobin. The
-study demonstrates the effectiveness of random forests in predicting AD
-mortality compared to other diseases. This research provides insights into the
-impact of nutrition on AD and contributes to a deeper understanding of disease
-progression.
+Cold-start active learning (CSAL) selects valuable instances from an
+unlabeled dataset for manual annotation. It provides high-quality data at a low
+annotation cost for label-scarce text classification. However, existing CSAL
+methods overlook weak classes and hard representative examples, resulting in
+biased learning. To address these issues, this paper proposes a novel
+dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL.
+Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently
+extract textual representations, class predictions, and predictive uncertainty.
+Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both
+textual diversity and class diversity, ensuring a balanced data distribution.
+It further propagates uncertainty information via density-based clustering to
+select hard representative instances. DEUCE performs well in selecting
+class-balanced and hard representative data by dual-diversity and
+informativeness. Experiments on six NLP datasets demonstrate the superiority
+and efficiency of DEUCE.
 
-摘要：本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型，並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素，例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解，並有助於更深入地了解疾病的進展。
+摘要：冷啟動主動學習 (CSAL) 從未標記的資料集中選取有價值的實例進行手動標記。它以低標記成本提供高品質的資料，用於標籤稀少的文字分類。然而，現有的 CSAL 方法忽略了弱類別和難以代表的範例，導致有偏差的學習。為了解決這些問題，本文提出了一個新的雙重多樣性增強和不確定性感知 (DEUCE) 架構，用於 CSAL。具體來說，DEUCE 利用預訓練的語言模型 (PLM) 來有效地提取文字表徵、類別預測和預測不確定性。然後，它構建一個雙鄰居圖 (DNG) 來結合文字多樣性和類別多樣性的資訊，確保平衡的資料分佈。它進一步通過基於密度的聚類來傳播不確定性資訊，以選擇難以代表的實例。DEUCE 在通過雙重多樣性和資訊性選擇類別平衡和難以代表的資料方面表現良好。在六個 NLP 資料集上的實驗證明了 DEUCE 的優越性和效率。
 
-##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**
-2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath
+##### **Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**
+2502.01659v2 by Nathaniel Tomczak, Sanmukh Kuppannagari
 
-Primary care providers are vital for initial triage and referrals to
-specialty care. In glaucoma, asymptomatic and fast progression can lead to
-vision loss, necessitating timely referrals to specialists. However, primary
-eye care providers may not identify urgent cases, potentially delaying care.
-Artificial Intelligence (AI) offering explanations could enhance their referral
-decisions. We investigate how various AI explanations help providers
-distinguish between patients needing immediate or non-urgent specialist
-referrals. We built explainable AI algorithms to predict glaucoma surgery needs
-from routine eyecare data as a proxy for identifying high-risk patients. We
-incorporated intrinsic and post-hoc explainability and conducted an online
-study with optometrists to assess human-AI team performance, measuring referral
-accuracy and analyzing interactions with AI, including agreement rates, task
-time, and user experience perceptions. AI support enhanced referral accuracy
-among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams
-underperformed compared to AI alone. Participants believed they included AI
-advice more when using the intrinsic model, and perceived it more useful and
-promising. Without explanations, deviations from AI recommendations increased.
-AI support did not increase workload, confidence, and trust, but reduced
-challenges. On a separate test set, our black-box and intrinsic models achieved
-an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We
-identify opportunities of human-AI teaming for glaucoma management in primary
-eye care, noting that while AI enhances referral accuracy, it also shows a
-performance gap compared to AI alone, even with explanations. Human involvement
-remains essential in medical decision making, underscoring the need for future
-research to optimize collaboration, ensuring positive experiences and safe AI
-use.
+Transformers have demonstrated great success in numerous domains including
+natural language processing and bioinformatics. This success stems from the use
+of the attention mechanism by these models in order to represent and propagate
+pairwise interactions between individual tokens of sequential data. However,
+the primary limitation of this operation is its quadratic memory and time
+complexity in relation to the input's context length - the length of a sequence
+over which the interactions need to be captured. This significantly limits the
+length of sequences that can be inferred upon by these models. Extensive
+research has been conducted to reduce the number of pairwise interactions to
+sub-quadratic in relation to the context length by introducing sparsity into
+the attention mechanism through the development of sparse attention masks.
+However, efficient implementations that achieve "true sparsity" are lacking.
+  In this work, we address this issue by proposing a graph computing view of
+attention where tokens are perceived as nodes of the graph and the attention
+mask determines the edges of the graph. Using this view, we develop graph
+processing algorithms to implement the attention mechanism. Both theoretically
+and empirically, we demonstrate that our algorithms only perform the needed
+computations, i.e., they are work optimal. We also perform extensive
+experimentation using popular attention masks to explore the impact of sparsity
+on execution time and achievable context length. Our experiments demonstrate
+significant speedups in execution times compared to state-of-the-art attention
+implementations such as FlashAttention for large sequence lengths. We also
+demonstrate that our algorithms are able to achieve extremely long sequence
+lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB).
 
-摘要：<paragraph>初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下，無症狀且快速惡化可能導致視力喪失，因此需要及時轉診給專家。然而，初級眼科保健提供者可能無法識別緊急情況，可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法，以從例行眼科護理資料預測青光眼手術需求，作為識別高風險患者的代理。我們納入了內在和事後解釋性，並與驗光師進行了一項線上研究，以評估人機團隊的表現，衡量轉診準確度並分析與 AI 的互動，包括同意率、任務時間和使用者體驗感知。在 87 名參與者中，AI 支援提高了轉診準確度（使用 AI/未使用的比例為 59.9%/50.8%），儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議，並認為它更有用且更有希望。沒有解釋，AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任，但減少了挑戰。在一個單獨的測試集中，我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中，人機團隊合作管理青光眼的機會，並注意到雖然 AI 提高了轉診準確度，但即使有解釋，它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要，這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。</paragraph>
+摘要：變形金剛已在許多領域展現出巨大的成功，包括自然語言處理和生物資訊學。這種成功源自於這些模型使用注意機制來表示和傳播序列資料中各個標記之間成對的互動。然而，這種運算的主要限制在於其二次記憶體和時間複雜度與輸入的內容長度有關，也就是需要擷取互動的序列長度。這會顯著限制這些模型可以推論的序列長度。已經進行了大量的研究來減少成對互動的數量，使其與內容長度成次二次關係，方法是透過開發稀疏注意遮罩來將稀疏性引入注意機制。然而，缺乏能達成「真實稀疏性」的高效實作。在這項工作中，我們透過提出注意力的圖形運算檢視來解決這個問題，其中標記被視為圖形的節點，而注意力遮罩則決定圖形中的邊緣。使用這種檢視，我們開發了圖形處理演算法來實作注意力機制。我們在理論上和經驗上都證明了我們的演算法只執行必要的運算，也就是說，它們是工作最優的。我們也使用流行的注意力遮罩進行廣泛的實驗，以探討稀疏性對執行時間和可達成的內容長度的影響。我們的實驗證明，與最先進的注意力實作（例如 FlashAttention）相比，對於大型序列長度，我們的演算法在執行時間方面有顯著的加速。我們也證明了我們的演算法能夠在單一的 NVIDIA A100 GPU (SXM4 80GB) 上達成極長的序列長度，最高可達 1.6 億。
 
-##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**
-2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang
+##### **Improving vision-language alignment with graph spiking hybrid Networks**
+2501.19069v1 by Siyu Zhang, Heming Zheng, Yiming Wu, Yeming Chen
 
-In medical imaging, particularly in early disease detection and prognosis
-tasks, discerning the rationale behind an AI model's predictions is crucial for
-evaluating the reliability of its decisions. Conventional explanation methods
-face challenges in identifying discernible decisive features in medical image
-classifications, where discriminative features are subtle or not immediately
-apparent. To bridge this gap, we propose an explainable model that is equipped
-with both decision reasoning and feature identification capabilities. Our
-approach not only detects influential image patterns but also uncovers the
-decisive features that drive the model's final predictions. By implementing our
-method, we can efficiently identify and visualise class-specific features
-leveraged by the data-driven model, providing insights into the decision-making
-processes of deep learning models. We validated our model in the demanding
-realm of medical prognosis task, demonstrating its efficacy and potential in
-enhancing the reliability of AI in healthcare and in discovering new knowledge
-in diseases where prognostic understanding is limited.
+To bridge the semantic gap between vision and language (VL), it is necessary
+to develop a good alignment strategy, which includes handling semantic
+diversity, abstract representation of visual information, and generalization
+ability of models. Recent works use detector-based bounding boxes or patches
+with regular partitions to represent visual semantics. While current paradigms
+have made strides, they are still insufficient for fully capturing the nuanced
+contextual relations among various objects. This paper proposes a comprehensive
+visual semantic representation module, necessitating the utilization of
+panoptic segmentation to generate coherent fine-grained semantic features.
+Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that
+integrates the complementary advantages of Spiking Neural Networks (SNNs) and
+Graph Attention Networks (GATs) to encode visual semantic information.
+Intriguingly, the model not only encodes the discrete and continuous latent
+variables of instances but also adeptly captures both local and global
+contextual features, thereby significantly enhancing the richness and diversity
+of semantic representations. Leveraging the spatiotemporal properties inherent
+in SNNs, we employ contrastive learning (CL) to enhance the similarity-based
+representation of embeddings. This strategy alleviates the computational
+overhead of the model and enriches meaningful visual representations by
+constructing positive and negative sample pairs. We design an innovative
+pre-training method, Spiked Text Learning (STL), which uses text features to
+improve the encoding ability of discrete semantics. Experiments show that the
+proposed GSHN exhibits promising results on multiple VL downstream tasks.
 
-摘要：在醫學影像中，特別是在早期疾病檢測和預後任務中，辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰，其中區別性特徵很微妙或並不明顯。為了彌合這一差距，我們提出了一個可解釋的模型，該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式，還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型，我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵，從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型，展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。
+摘要：<paragraph>為了彌合視覺和語言 (VL) 之間的語意差距，必須制定良好的對齊策略，其中包括處理語意多樣性、視覺資訊的抽象表示以及模型的泛化能力。最近的研究使用基於偵測器的邊界框或具有規則分割的區塊來表示視覺語意。雖然目前的範例已取得進展，但對於完全捕捉各種物件之間的細微脈絡關係仍不足夠。本文提出了一個全面的視覺語意表示模組，需要利用全景分割來產生連貫的細粒度語意特徵。此外，我們提出了一個新穎的圖形脈衝混合網路 (GSHN)，它整合了脈衝神經網路 (SNN) 和圖形注意力網路 (GAT) 的互補優勢來編碼視覺語意資訊。有趣的是，該模型不僅編碼實例的離散和連續潛在變數，還能巧妙地捕捉局部和全域脈絡特徵，從而顯著增強語意表示的豐富性和多樣性。利用 SNN 中固有的時空特性，我們採用對比學習 (CL) 來增強嵌入的基於相似性的表示。此策略減輕了模型的計算負擔，並透過建構正負樣本對來豐富有意義的視覺表示。我們設計了一個創新的預訓練方法，脈衝文本學習 (STL)，它使用文本特徵來提高離散語意的編碼能力。實驗表明，所提出的 GSHN 在多個 VL 下游任務上展現出有希望的結果。</paragraph>
 
-##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**
-2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat
+##### **Semantic Web and Creative AI -- A Technical Report from ISWS 2023**
+2501.18542v1 by Raia Abu Ahmad, Reham Alharbi, Roberto Barile, Martin Böckling, Francisco Bolanos, Sara Bonfitto, Oleksandra Bruns, Irene Celino, Yashrajsinh Chudasama, Martin Critelli, Claudia d'Amato, Giada D'Ippolito, Ioannis Dasoulas, Stefano De Giorgis, Vincenzo De Leo, Chiara Di Bonaventura, Marco Di Panfilo, Daniil Dobriy, John Domingue, Xuemin Duan, Michel Dumontier, Sefika Efeoglu, Ruben Eschauzier, Fakih Ginwa, Nicolas Ferranti, Arianna Graciotti, Philipp Hanisch, George Hannah, Golsa Heidari, Aidan Hogan, Hassan Hussein, Alexane Jouglar, Jan-Christoph Kalo, Manoé Kieffer, Antonis Klironomos, Inês Koch, Weronika Lajewska, Nicolas Lazzari, Mikael Lindekrans, Anna Sofia Lippolis, Majlinda Llugiqi, Eleonora Mancini, Eleonora Marzi, Laura Menotti, Daniela Milon Flores, Soulakshmee Nagowah, Kerstin Neubert, Emetis Niazmand, Ebrahim Norouzi, Beatriz Olarte Martinez, Anouk Michelle Oudshoorn, Andrea Poltronieri, Valentina Presutti, Disha Purohit, Ensiyeh Raoufi, Celian Ringwald, Johanna Rockstroh, Sebastian Rudolph, Harald Sack, Zafar Saeed, Mohammad Javad Saeedizade, Aya Sahbi, Cristian Santini, Aleksandra Simic, Dennis Sommer, Rita Sousa, Mary Ann Tan, Vidyashree Tarikere, Tabea Tietz, Liam Tirpitz, Arnaldo Tomasino, Frank van Harmelen, Joao Vissoci, Caitlin Woods, Bohui Zhang, Xinyue Zhang, Heng Zheng
 
-This study explores the relationship between informational support seeking
-questions, responses, and helpfulness ratings in online health communities. We
-created a labeled data set of question-response pairs and developed multimodal
-machine learning and deep learning models to reliably predict informational
-support questions and responses. We employed explainable AI to reveal the
-emotions embedded in informational support exchanges, demonstrating the
-importance of emotion in providing informational support. This complex
-interplay between emotional and informational support has not been previously
-researched. The study refines social support theory and lays the groundwork for
-the development of user decision aids. Further implications are discussed.
+The International Semantic Web Research School (ISWS) is a week-long
+intensive program designed to immerse participants in the field. This document
+reports a collaborative effort performed by ten teams of students, each guided
+by a senior researcher as their mentor, attending ISWS 2023. Each team provided
+a different perspective to the topic of creative AI, substantiated by a set of
+research questions as the main subject of their investigation. The 2023 edition
+of ISWS focuses on the intersection of Semantic Web technologies and Creative
+AI. ISWS 2023 explored various intersections between Semantic Web technologies
+and creative AI. A key area of focus was the potential of LLMs as support tools
+for knowledge engineering. Participants also delved into the multifaceted
+applications of LLMs, including legal aspects of creative content production,
+humans in the loop, decentralised approaches to multimodal generative AI
+models, nanopublications and AI for personal scientific knowledge graphs,
+commonsense knowledge in automatic story and narrative completion, generative
+AI for art critique, prompt engineering, automatic music composition,
+commonsense prototyping and conceptual blending, and elicitation of tacit
+knowledge. As Large Language Models and semantic technologies continue to
+evolve, new exciting prospects are emerging: a future where the boundaries
+between creative expression and factual knowledge become increasingly permeable
+and porous, leading to a world of knowledge that is both informative and
+inspiring.
+
+摘要：國際語意網路研究學校 (ISWS) 是一個為期一週的密集課程，旨在讓參與者沉浸在該領域中。本文件報告了由十個學生團隊進行的合作成果，每個團隊都由一位資深研究員作為導師，參加了 2023 年 ISWS。每個團隊都從不同的角度探討了創意 AI 主題，並以一系列研究問題作為調查的主要主題。2023 年版的 ISWS 關注於語意網路技術和創意 AI 的交集。ISWS 2023 探索了語意網路技術和創意 AI 之間的各種交集。一個重點關注領域是 LLM 作為知識工程的支援工具的潛力。參與者還深入探討了 LLM 的多方面應用，包括創意內容製作的法律方面、循環中的人類、多模態生成式 AI 模型的分散式方法、納米出版物和用於個人科學知識圖譜的 AI、自動故事和敘述完成中的常識知識、生成式 AI 用於藝術評論、提示工程、自動音樂創作、常識原型和概念混合，以及對默會知識的引導。隨著大型語言模型和語意技術的持續發展，新的令人興奮的前景正在出現：一個創意表達和事實知識之間的界限變得越來越可滲透和多孔的未來，從而導致一個既有資訊性又有啟發性的知識世界。
 
-摘要：本研究探討線上健康社群中尋求資訊支持的問題、回應，以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集，並開發了多模態機器學習和深度學習模型，以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒，證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論，並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。
+##### **Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**
+2501.18320v1 by Tianpeng Pan, Wenqiang Pu, Licheng Zhao, Rui Zhou
 
-##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**
-2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis
+Automated optimization modeling (AOM) has evoked considerable interest with
+the rapid evolution of large language models (LLMs). Existing approaches
+predominantly rely on prompt engineering, utilizing meticulously designed
+expert response chains or structured guidance. However, prompt-based techniques
+have failed to perform well in the sensor array signal processing (SASP) area
+due the lack of specific domain knowledge. To address this issue, we propose an
+automated modeling approach based on retrieval-augmented generation (RAG)
+technique, which consists of two principal components: a multi-agent (MA)
+structure and a graph-based RAG (Graph-RAG) process. The MA structure is
+tailored for the architectural AOM process, with each agent being designed
+based on principles of human modeling procedure. The Graph-RAG process serves
+to match user query with specific SASP modeling knowledge, thereby enhancing
+the modeling result. Results on ten classical signal processing problems
+demonstrate that the proposed approach (termed as MAG-RAG) outperforms several
+AOM benchmarks.
 
-In the era of exponential technology growth, one unexpected guest has claimed
-a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as
-ChatGPT, promises a revolution in education, yet it arrives with a double-edged
-sword. Its potential for personalized learning is offset by issues of cheating,
-inaccuracies, and educators struggling to incorporate it effectively into their
-lesson design. We are standing on the brink of this educational frontier, and
-it is clear that we need to navigate this terrain with a lot of care. This is a
-major challenge that could undermine the integrity and value of our educational
-process. So, how can we turn these challenges into opportunities? When used
-inappropriately, AI tools can become the perfect tool for the cut copy paste
-mentality, and quickly begin to corrode critical thinking, creativity, and deep
-understanding, the most important skills in our rapidly changing world.
-Teachers feel that they are not equipped to leverage this technology, widening
-the digital divide among educators and institutions. Addressing these concerns
-calls for an in depth research approach. We will employ empirical research,
-drawing on the Technology Acceptance Model, to assess the attitudes toward
-generative AI among educators and students. Understanding their perceptions,
-usage patterns, and hurdles is the first crucial step in creating an effective
-solution. The present study will be used as a process manual for future
-researchers to apply, running their own data, based on the steps explained here
+摘要：自動化最佳化建模 (AOM) 隨著大型語言模型 (LLM) 的快速演進而引起相當大的興趣。現有方法主要依賴提示工程，利用精心設計的專家回應鏈或結構化指導。然而，基於提示的技術由於缺乏特定領域知識，無法在感測器陣列訊號處理 (SASP) 領域中表現良好。為了解決這個問題，我們提出一個基於檢索增強生成 (RAG) 技術的自動化建模方法，它包含兩個主要組成部分：多代理 (MA) 結構和基於圖形的 RAG (Graph-RAG) 程序。MA 結構是針對架構 AOM 程序量身打造，每個代理都是根據人類建模程序的原理設計的。Graph-RAG 程序用於將使用者查詢與特定的 SASP 建模知識相匹配，從而增強建模結果。在十個經典訊號處理問題上的結果表明，所提出的方法（稱為 MAG-RAG）優於多個 AOM 基準。
 
-摘要：在科技飛速發展的時代，一位意外的訪客已在全球教室中佔有一席之地，那就是人工智慧。生成式 AI，例如 ChatGPT，承諾在教育領域掀起一場革命，但它卻是一把雙面刃。它在個人化學習方面的潛力，卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣，顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰，可能會損害我們教育過程的完整性和價值。那麼，我們如何將這些挑戰轉化為機遇？當不適當地使用時，AI 工具可能會成為複製貼上心態的完美工具，並迅速腐蝕批判性思維、創造力和深入理解，這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術，這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究，借鑑技術接受模型，來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊，根據此處說明的步驟運行他們自己的數據
+##### **Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**
+2501.18154v1 by Wanlong Liu, Yichen Xiao, Dingyi Zeng, Hongyang Zhao, Wenyu Chen, Malu Zhang
 
-##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**
-2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk
+Post-Training Quantization (PTQ) is pivotal for deploying large language
+models (LLMs) within resource-limited settings by significantly reducing
+resource demands. However, existing PTQ strategies underperform at low bit
+levels < 3 bits due to the significant difference between the quantized and
+original weights. To enhance the quantization performance at low bit widths, we
+introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a
+graph neural network (GNN) module to capture dependencies among weights and
+adaptively assign quantization bit-widths. Through the information propagation
+of the GNN module, our method more effectively captures dependencies among
+target weights, leading to a more accurate assessment of weight importance and
+optimized allocation of quantization strategies. Extensive experiments on the
+WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms
+previous state-of-the-art PTQ method GPTQ, setting new benchmarks for
+quantization performance under low-bit conditions.
 
-With the digitalization of health care systems, artificial intelligence
-becomes more present in medicine. Especially machine learning shows great
-potential for complex tasks such as time series classification, usually at the
-cost of transparency and comprehensibility. This leads to a lack of trust by
-humans and thus hinders its active usage. Explainable artificial intelligence
-tries to close this gap by providing insight into the decision-making process,
-the actual usefulness of its different methods is however unclear. This paper
-proposes a user study based evaluation of the explanation method Grad-CAM with
-application to a neural network for the classification of breaths in time
-series neonatal ventilation data. We present the perceived usefulness of the
-explainability method by different stakeholders, exposing the difficulty to
-achieve actual transparency and the wish for more in-depth explanations by many
-of the participants.
+摘要：訓練後量化 (PTQ) 對於在資源受限的設定中部署大型語言模型 (LLM) 至關重要，因為它能顯著降低資源需求。然而，現有的 PTQ 策略在低位元層級 < 3 位元時表現不佳，因為量化後的權重與原始權重之間有顯著的差異。為了提升低位元寬度的量化效能，我們提出混合精度圖神經網路 PTQ (MG-PTQ) 方法，採用圖神經網路 (GNN) 模組來擷取權重之間的依存關係，並動態分配量化位元寬度。透過 GNN 模組的資訊傳播，我們的方法能更有效地擷取目標權重之間的依存關係，進而更準確地評估權重重要性，並最佳化量化策略的配置。在 WikiText2 和 C4 資料集上的廣泛實驗證明，我們的 MG-PTQ 方法優於先前的最先進 PTQ 方法 GPTQ，在低位元條件下設定了量化效能的新基準。
 
-摘要：隨著醫療保健系統的數位化，人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力，但通常是以透明度和可理解性為代價。這導致人類缺乏信任，從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距，但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估，其中包含了 Grad-CAM 解釋方法，並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用，揭示了實現實際透明度的難度，以及許多參與者希望獲得更深入的解釋。
+##### **Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**
+2501.18119v1 by Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
 
-##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**
-2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio
+Due to the presence of the natural gap between Knowledge Graph (KG)
+structures and the natural language, the effective integration of holistic
+structural information of KGs with Large Language Models (LLMs) has emerged as
+a significant question. To this end, we propose a two-stage framework to learn
+and apply quantized codes for each entity, aiming for the seamless integration
+of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR)
+method is proposed to compress both KG structural and semantic knowledge into
+discrete codes (\ie, tokens) that align the format of language sentences. We
+further design KG instruction-following data by viewing these learned codes as
+features to directly input to LLMs, thereby achieving seamless integration. The
+experiment results demonstrate that SSQR outperforms existing unsupervised
+quantized methods, producing more distinguishable codes. Further, the
+fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link
+prediction and triple classification tasks, utilizing only 16 tokens per entity
+instead of thousands in conventional prompting methods.
 
-The integration of Large Language Models (LLMs) into healthcare diagnostics
-offers a promising avenue for clinical decision-making. This study outlines the
-development of a novel method for zero-shot/few-shot in-context learning (ICL)
-by integrating medical domain knowledge using a multi-layered structured
-prompt. We also explore the efficacy of two communication styles between the
-user and LLMs: the Numerical Conversational (NC) style, which processes data
-incrementally, and the Natural Language Single-Turn (NL-ST) style, which
-employs long narrative prompts.
-  Our study systematically evaluates the diagnostic accuracy and risk factors,
-including gender bias and false negative rates, using a dataset of 920 patient
-records in various few-shot scenarios. Results indicate that traditional
-clinical machine learning (ML) models generally outperform LLMs in zero-shot
-and few-shot settings. However, the performance gap narrows significantly when
-employing few-shot examples alongside effective explainable AI (XAI) methods as
-sources of domain knowledge. Moreover, with sufficient time and an increased
-number of examples, the conversational style (NC) nearly matches the
-performance of ML models. Most notably, LLMs demonstrate comparable or superior
-cost-sensitive accuracy relative to ML models.
-  This research confirms that, with appropriate domain knowledge and tailored
-communication strategies, LLMs can significantly enhance diagnostic processes.
-The findings highlight the importance of optimizing the number of training
-examples and communication styles to improve accuracy and reduce biases in LLM
-applications.
+摘要：由於知識圖譜 (KG) 結構與自然語言之間存在自然差距，將 KG 的整體結構資訊與大型語言模型 (LLM) 有效整合已成為一個重要的問題。為此，我們提出了一個兩階段架構來學習和應用每個實體的量化碼，旨在將 KG 與 LLM 無縫整合。首先，提出了一個自監督量化表示 (SSQR) 方法，將 KG 結構和語義知識壓縮成離散碼（即，符號），以對齊語言句子的格式。我們進一步設計 KG 指令遵循資料，將這些學習到的碼視為直接輸入 LLM 的特徵，從而實現無縫整合。實驗結果表明，SSQR 優於現有的無監督量化方法，產生更具區別性的碼。此外，微調後的 LLaMA2 和 LLaMA3.1 在 KG 連結預測和三元分類任務上也具有優異的性能，每個實體僅使用 16 個符號，而不是傳統提示方法中的數千個。
 
-摘要：大型語言模型 (LLM) 與醫療診斷整合
-為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發，用於零次學習/少量學習情境學習 (ICL)，方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效：數值對話 (NC) 方式，它會逐步處理資料，以及自然語言單回合 (NL-ST) 方式，它會使用長篇敘事提示。
-我們的研究系統性地評估了診斷準確性和風險因子，包括性別偏見和假陰性率，使用了一個包含 920 個患者記錄的資料集，採用各種少量學習情境。結果表明，傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而，當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時，效能差距會顯著縮小。此外，隨著時間充足和範例數量增加，對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是，LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。
-本研究證實，透過適當的領域知識和量身打造的溝通策略，LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性，以提高準確度並減少 LLM 應用中的偏差。
+##### **Hybrid Graphs for Table-and-Text based Question Answering using LLMs**
+2501.17767v1 by Ankush Agarwal, Ganesh S, Chaitanya Devaguptapu
 
-##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**
-2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho
+Answering questions that require reasoning and aggregation across both
+structured (tables) and unstructured (raw text) data sources presents
+significant challenges. Current methods rely on fine-tuning and high-quality,
+human-curated data, which is difficult to obtain. Recent advances in Large
+Language Models (LLMs) have shown promising results for multi-hop question
+answering (QA) over single-source text data in a zero-shot setting, yet
+exploration into multi-source Table-Text QA remains limited. In this paper, we
+present a novel Hybrid Graph-based approach for Table-Text QA that leverages
+LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from
+textual and tabular data, pruning information based on the input question to
+provide the LLM with relevant context concisely. We evaluate our approach on
+the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs,
+including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot
+performance on both datasets, improving Exact Match scores by up to 10% on
+Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up
+to 53% compared to the original context.
 
-The increasing reliance on Deep Learning models, combined with their inherent
-lack of transparency, has spurred the development of a novel field of study
-known as eXplainable AI (XAI) methods. These methods seek to enhance the trust
-of end-users in automated systems by providing insights into the rationale
-behind their decisions. This paper presents a novel approach for measuring user
-trust in XAI systems, allowing their refinement. Our proposed metric combines
-both performance metrics and trust indicators from an objective perspective. To
-validate this novel methodology, we conducted a case study in a realistic
-medical scenario: the usage of XAI system for the detection of pneumonia from
-x-ray images.
+摘要：回答需要對結構化（表格）和非結構化（原始文字）資料來源進行推理和彙總的問題會帶來重大挑戰。目前的辦法仰賴微調和高品質、人工整理的資料，而這很難取得。大型語言模型（LLM）的最新進展已針對零次學習設定的單一來源文字資料多跳問題回答（QA）展現出有希望的結果，但對多來源表格文字 QA 的探討仍然有限。在本文中，我們提出了一種新穎的基於混合圖表的表格文字 QA 方法，它利用 LLM 而無需微調。我們的辦法從文字和表格資料建構一個統一的混合圖表，根據輸入問題修剪資訊，以簡潔地為 LLM 提供相關脈絡。我們使用最先進的 LLM，包括 GPT-3.5、GPT-4 和 LLaMA-3，針對具有挑戰性的 Hybrid-QA 和 OTT-QA 資料集評估我們的辦法。我們的辦法在兩個資料集上都達到了最佳的零次學習效能，在 Hybrid-QA 上將完全比對分數提高了 10%，在 OTT-QA 上將完全比對分數提高了 5.4%。此外，與原始脈絡相比，我們的辦法將符號使用量減少了 53%。
 
-摘要：隨著對深度學習模型依賴性的增加，加上其固有的透明度不足，促使一個新的研究領域發展，稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理，來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法，允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法，我們在一個真實的醫療場景中進行了一個案例研究：使用 XAI 系統從 X 光影像中偵測肺炎。
+##### **Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**
+2501.17549v1 by Wooyoung Kim, Byungyoon Park, Wooju Kim
 
-##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**
-2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao
+Graph-structured data plays a vital role in numerous domains, such as social
+networks, citation networks, commonsense reasoning graphs and knowledge graphs.
+While graph neural networks have been employed for graph processing, recent
+advancements have explored integrating large language models for graph-based
+tasks. In this paper, we propose a novel approach named Learnable Graph Pooling
+Token (LGPT), which addresses the limitations of the scalability issues in
+node-level projection and information loss in graph-level projection. LGPT
+enables flexible and efficient graph representation by introducing learnable
+parameters that act as tokens in large language models, balancing fine-grained
+and global graph information. Additionally, we investigate an Early Query
+Fusion technique, which fuses query context before constructing the graph
+representation, leading to more effective graph embeddings. Our method achieves
+a 4.13\% performance improvement on the GraphQA benchmark without training the
+large language model, demonstrating significant gains in handling complex
+textual-attributed graph data.
 
-The COVID-19 pandemic has strained global public health, necessitating
-accurate diagnosis and intervention to control disease spread and reduce
-mortality rates. This paper introduces an interpretable deep survival
-prediction model designed specifically for improved understanding and trust in
-COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale
-pretrained image encoder, Risk-specific Grad-CAM, and anatomical region
-detection techniques, our approach produces regional interpretable outcomes
-that effectively capture essential disease features while focusing on rare but
-critical abnormal regions. Our model's predictive results provide enhanced
-clarity and transparency through risk area localization, enabling clinicians to
-make informed decisions regarding COVID-19 diagnosis with better understanding
-of prognostic insights. We evaluate the proposed method on a multi-center
-survival dataset and demonstrate its effectiveness via quantitative and
-qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and
-time-dependent AUCs (0.799 and 0.691). These results suggest that our
-explainable deep survival prediction model surpasses traditional survival
-analysis methods in risk prediction, improving interpretability for clinical
-decision making and enhancing AI system trustworthiness.
+摘要：圖形結構資料在許多領域中扮演著至關重要的角色，例如社交網路、引用網路、常識推理圖形和知識圖形。雖然圖形神經網路已用於圖形處理，但最近的進展已探討整合大型語言模型以進行基於圖形的任務。在本文中，我們提出了一種名為可學習圖形池化令牌 (LGPT) 的新方法，它解決了節點層級投影中的可擴充性問題和圖形層級投影中的資訊遺失限制。LGPT 透過引入可學習的參數（在大型語言模型中作為令牌運作）來啟用彈性和高效的圖形表示，平衡細粒度和整體圖形資訊。此外，我們研究了一種早期查詢融合技術，它在建構圖形表示之前融合查詢內容，進而產生更有效的圖形嵌入。我們的方法在 GraphQA 基準上達到了 4.13% 的效能提升，而無需訓練大型語言模型，證明了在處理複雜的文字屬性圖形資料方面有顯著的進展。
 
-摘要：COVID-19 疫情對全球公共衛生造成壓力，必須進行準確的診斷和干預，以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型，專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術，我們的做法產生區域可解釋的結果，有效捕捉必要的疾病特徵，同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度，讓臨床醫生能夠在更了解預後見解的情況下，就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法，並透過量化和質化評估證明其有效性，達到優異的 C 指數（0.764 和 0.727）和時間相關 AUC（0.799 和 0.691）。這些結果表明，我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法，提升臨床決策的解釋性，並增強 AI 系統的信賴度。
+##### **General Scene Adaptation for Vision-and-Language Navigation**
+2501.17403v1 by Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
 
-##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**
-2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile
+Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on
+one-time execution of individual instructions across multiple environments,
+aiming to develop agents capable of functioning in any environment in a
+zero-shot manner. However, real-world navigation robots often operate in
+persistent environments with relatively consistent physical layouts, visual
+observations, and language styles from instructors. Such a gap in the task
+setting presents an opportunity to improve VLN agents by incorporating
+continuous adaptation to specific environments. To better reflect these
+real-world conditions, we introduce GSA-VLN, a novel task requiring agents to
+execute navigation instructions within a specific scene and simultaneously
+adapt to it for improved performance over time. To evaluate the proposed task,
+one has to address two challenges in existing VLN datasets: the lack of OOD
+data, and the limited number and style diversity of instructions for each
+scene. Therefore, we propose a new dataset, GSA-R2R, which significantly
+expands the diversity and quantity of environments and instructions for the R2R
+dataset to evaluate agent adaptability in both ID and OOD contexts.
+Furthermore, we design a three-stage instruction orchestration pipeline that
+leverages LLMs to refine speaker-generated instructions and apply role-playing
+techniques to rephrase instructions into different speaking styles. This is
+motivated by the observation that each individual user often has consistent
+signatures or preferences in their instructions. We conducted extensive
+experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various
+methods. Based on our findings, we propose a novel method, GR-DUET, which
+incorporates memory-based navigation graphs with an environment-specific
+training strategy, achieving state-of-the-art results on all GSA-R2R splits.
 
-In recent years, machine learning-based clinical decision support systems
-(CDSS) have played a key role in the analysis of several medical conditions.
-Despite their promising capabilities, the lack of transparency in AI models
-poses significant challenges, particularly in medical contexts where
-reliability is a mandatory aspect. However, it appears that explainability is
-inversely proportional to accuracy. For this reason, achieving transparency
-without compromising predictive accuracy remains a key challenge. This paper
-presents a novel method, namely Rad4XCNN, to enhance the predictive power of
-CNN-derived features with the inherent interpretability of radiomic features.
-Rad4XCNN diverges from conventional methods based on saliency maps, by
-associating intelligible meaning to CNN-derived features by means of Radiomics,
-offering new perspectives on explanation methods beyond visualization maps.
-Using a breast cancer classification task as a case study, we evaluated
-Rad4XCNN on ultrasound imaging datasets, including an online dataset and two
-in-house datasets for internal and external validation. Some key results are:
-i) CNN-derived features guarantee more robust accuracy when compared against
-ViT-derived and radiomic features; ii) conventional visualization map methods
-for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice
-model accuracy for their explainability; iv) Rad4XCNN provides a global
-explanation enabling the physician to extract global insights and findings. Our
-method can mitigate some concerns related to the explainability-accuracy
-trade-off. This study highlighted the importance of proposing new methods for
-model explanation without affecting their accuracy.
+摘要：視覺語言導航 (VLN) 任務主要根據代理程式在多個環境中執行個別指令的一次性執行來評估代理程式，旨在開發能夠在任何環境中以零次學習的方式運作的代理程式。然而，真實世界的導航機器人通常在持續性的環境中運作，而這些環境具有相對一致的物理配置、視覺觀察和指令的語言風格。任務設定中的這種差距提供了一個機會，可以透過將連續適應特定環境納入其中來改善 VLN 代理程式。為了更好地反映這些真實世界的條件，我們推出了 GSA-VLN，這是一個新任務，要求代理程式在特定場景中執行導航指令，並同時適應該場景，以隨著時間推移而提高效能。為了評估所提出的任務，必須解決現有 VLN 資料集中的兩個挑戰：缺乏 OOD 資料，以及每個場景的指令數量和風格多樣性有限。因此，我們提出了一個新的資料集 GSA-R2R，它顯著擴展了 R2R 資料集的環境和指令的多樣性和數量，以評估代理程式在 ID 和 OOD 背景下的適應能力。此外，我們設計了一個三階段指令編排管道，該管道利用大型語言模型 (LLM) 來精煉由說話者產生的指令，並應用角色扮演技巧將指令改寫成不同的說話風格。這項技術的靈感來自於觀察到每個個別使用者通常在其指令中具有相符的簽名或偏好。我們針對 GSA-R2R 進行了大量的實驗，以徹底評估我們的資料集和基準各種方法。根據我們的研究結果，我們提出了一種新的方法 GR-DUET，它將基於記憶的導航圖表與特定於環境的訓練策略結合在一起，在所有 GSA-R2R 分割中取得了最先進的結果。
+
+##### **Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**
+2501.17270v1 by Saloni Potdar, Daniel Lee, Omar Attia, Varun Embar, De Meng, Ramesh Balaji, Chloe Seivwright, Eric Choi, Mina H. Farid, Yiwen Sun, Yunyao Li
+
+Question answering systems for knowledge graph (KGQA), answer factoid
+questions based on the data in the knowledge graph. KGQA systems are complex
+because the system has to understand the relations and entities in the
+knowledge-seeking natural language queries and map them to structured queries
+against the KG to answer them. In this paper, we introduce Chronos, a
+comprehensive evaluation framework for KGQA at industry scale. It is designed
+to evaluate such a multi-component system comprehensively, focusing on (1)
+end-to-end and component-level metrics, (2) scalable to diverse datasets and
+(3) a scalable approach to measure the performance of the system prior to
+release. In this paper, we discuss the unique challenges associated with
+evaluating KGQA systems at industry scale, review the design of Chronos, and
+how it addresses these challenges. We will demonstrate how it provides a base
+for data-driven decisions and discuss the challenges of using it to measure and
+improve a real-world KGQA system.
 
-摘要：<paragraph>近年来，基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景，但 AI 模型缺乏透明度，尤其在医疗领域，可靠性是强制性方面，这带来了重大挑战。然而，解释性似乎与准确性成反比。因此，在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法，即 Rad4XCNN，以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来，从而偏离了基于显着性图的传统方法，为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究，我们在超声成像数据集上评估了 Rad4XCNN，包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是：i) 与 ViT 衍生和放射组学特征相比，CNN 衍生特征保证了更稳健的准确性；ii) 用于解释的传统可视化图方法存在一些缺陷；iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性；iv) Rad4XCNN 提供全局解释，使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。</paragraph>
+摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
 
-##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**
-2404.16957v1 by Yunfei Ge, Quanyan Zhu
+##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
+2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
 
-The pervasive integration of Artificial Intelligence (AI) has introduced
-complex challenges in the responsibility and accountability in the event of
-incidents involving AI-enabled systems. The interconnectivity of these systems,
-ethical concerns of AI-induced incidents, coupled with uncertainties in AI
-technology and the absence of corresponding regulations, have made traditional
-responsibility attribution challenging. To this end, this work proposes a
-Computational Reflective Equilibrium (CRE) approach to establish a coherent and
-ethically acceptable responsibility attribution framework for all stakeholders.
-The computational approach provides a structured analysis that overcomes the
-limitations of conceptual approaches in dealing with dynamic and multifaceted
-scenarios, showcasing the framework's explainability, coherence, and adaptivity
-properties in the responsibility attribution process. We examine the pivotal
-role of the initial activation level associated with claims in equilibrium
-computation. Using an AI-assisted medical decision-support system as a case
-study, we illustrate how different initializations lead to diverse
-responsibility distributions. The framework offers valuable insights into
-accountability in AI-induced incidents, facilitating the development of a
-sustainable and resilient system through continuous monitoring, revision, and
-reflection.
+Prior research on training grounded factuality classification models to
+detect hallucinations in large language models (LLMs) has relied on public
+natural language inference (NLI) data and synthetic data. However, conventional
+NLI datasets are not well-suited for document-level reasoning, which is
+critical for detecting LLM hallucinations. Recent approaches to document-level
+synthetic data generation involve iteratively removing sentences from documents
+and annotating factuality using LLM-based prompts. While effective, this method
+is computationally expensive for long documents and limited by the LLM's
+capabilities. In this work, we analyze the differences between existing
+synthetic training data used in state-of-the-art models and real LLM output
+claims. Based on our findings, we propose a novel approach for synthetic data
+generation, CG2C, that leverages multi-hop reasoning on context graphs
+extracted from documents. Our fact checker model, FactCG, demonstrates improved
+performance with more connected reasoning, using the same backbone models.
+Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
+with much smaller model size.
 
-摘要：隨著人工智慧 (AI) 的普及整合，在涉及 AI 驅動系統的事故中，責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題，加上 AI 技術的不確定性和缺乏相應法規，使得傳統責任歸屬面臨挑戰。為此，本研究提出了一種計算反思均衡 (CRE) 方法，以建立一個連貫且在倫理上可接受的責任歸屬架構，適用於所有利害關係人。計算方法提供了結構化的分析，克服了概念方法在處理動態且多面向情境時的限制，展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究，說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解，透過持續監控、修訂和反思，促進了永續且有韌性的系統發展。
+摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
 
-##### **Explainable AI for Fair Sepsis Mortality Predictive Model**
-2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang
+##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
+2501.16673v2 by Li Yin, Zhangyang Wang
 
-Artificial intelligence supports healthcare professionals with predictive
-modeling, greatly transforming clinical decision-making. This study addresses
-the crucial need for fairness and explainability in AI applications within
-healthcare to ensure equitable outcomes across diverse patient demographics. By
-focusing on the predictive modeling of sepsis-related mortality, we propose a
-method that learns a performance-optimized predictive model and then employs
-the transfer learning process to produce a model with better fairness. Our
-method also introduces a novel permutation-based feature importance algorithm
-aiming at elucidating the contribution of each feature in enhancing fairness on
-predictions. Unlike existing explainability methods concentrating on explaining
-feature contribution to predictive performance, our proposed method uniquely
-bridges the gap in understanding how each feature contributes to fairness. This
-advancement is pivotal, given sepsis's significant mortality rate and its role
-in one-third of hospital deaths. Our method not only aids in identifying and
-mitigating biases within the predictive model but also fosters trust among
-healthcare stakeholders by improving the transparency and fairness of model
-predictions, thereby contributing to more equitable and trustworthy healthcare
-delivery.
+Large Language Models (LLMs) have reshaped natural language processing,
+powering applications from multi-hop retrieval and question answering to
+autonomous agent workflows. Yet, prompt engineering -- the task of crafting
+textual inputs to effectively direct LLMs -- remains difficult and
+labor-intensive, particularly for complex pipelines that combine multiple LLM
+calls with functional operations like retrieval and data formatting. We
+introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
+(APE) that extends textual gradient-based methods (such as Text-Grad) to
+multi-component, potentially cyclic LLM architectures. Implemented within the
+AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
+parameter and uses a frozen backward engine LLM to generate feedback-akin to
+textual gradients -- that guide iterative prompt updates. Unlike prior
+single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
+preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
+and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
+(instructions, formats, or few-shot examples). It further boosts training
+efficiency by focusing on error-prone samples through selective gradient
+computation. Across diverse tasks, including single-step classification,
+multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
+consistently outperforms existing textual gradient baselines in both accuracy
+and training cost. By unifying prompt optimization through a graph-centric
+lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
+LLM workflows - mirroring the transformative role that automatic
+differentiation libraries have long played in neural network research.
 
-摘要：人工智慧透過預測模型協助醫療專業人員，大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求，以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型，我們提出了一種方法，該方法會學習一個效能最佳化的預測模型，然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法，旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同，我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要，因為敗血症的死亡率很高，且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差，還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任，進而有助於提供更公平且值得信賴的醫療保健服務。
+摘要：大型語言模型 (LLM) 已重塑自然語言處理，
+為從多跳檢索和問答到
+自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
+文本輸入以有效指導 LLM 的任務 -- 仍然困難且
+勞動密集，特別是對於將多個 LLM
+呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
+介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
+方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
+AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
+參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
+文本梯度——指導迭代提示更新。與先前的
+單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
+在重複呼叫（例如，多跳循環）中保留時間順序行為，
+並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
+效率，通過選擇性梯度
+計算專注於容易出錯的樣本。在包括單步分類、
+多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
+在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
+視角統一提示優化，LLM-AutoDiff 為擴展和自動化
+LLM 工作流程提供了一個強大的新範例——反映了自動
+微分庫在神經網絡研究中長期扮演的變革性角色。
 
-##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence**
-2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal
+##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
+2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
 
-Depression is a significant issue nowadays. As per the World Health
-Organization (WHO), in 2023, over 280 million individuals are grappling with
-depression. This is a huge number; if not taken seriously, these numbers will
-increase rapidly. About 4.89 billion individuals are social media users. People
-express their feelings and emotions on platforms like Twitter, Facebook,
-Reddit, Instagram, etc. These platforms contain valuable information which can
-be used for research purposes. Considerable research has been conducted across
-various social media platforms. However, certain limitations persist in these
-endeavors. Particularly, previous studies were only focused on detecting
-depression and the intensity of depression in tweets. Also, there existed
-inaccuracies in dataset labeling. In this research work, five types of
-depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted
-using tweets from the Twitter database based on lexicon labeling. Explainable
-AI was used to provide reasoning by highlighting the parts of tweets that
-represent type of depression. Bidirectional Encoder Representations from
-Transformers (BERT) was used for feature extraction and training. Machine
-learning and deep learning methodologies were used to train the model. The BERT
-model presented the most promising results, achieving an overall accuracy of
-0.96.
+Ranking and recommendation systems are the foundation for numerous online
+experiences, ranging from search results to personalized content delivery.
+These systems have evolved into complex, multilayered architectures that
+leverage vast datasets and often incorporate thousands of predictive models.
+The maintenance and enhancement of these models is a labor intensive process
+that requires extensive feature engineering. This approach not only exacerbates
+technical debt but also hampers innovation in extending these systems to
+emerging problem domains. In this report, we present our research to address
+these challenges by utilizing a large foundation model with a textual interface
+for ranking and recommendation tasks. We illustrate several key advantages of
+our approach: (1) a single model can manage multiple predictive tasks involved
+in ranking and recommendation, (2) decoder models with textual interface due to
+their comprehension of reasoning capabilities, can generalize to new
+recommendation surfaces and out-of-domain problems, and (3) by employing
+natural language interfaces for task definitions and verbalizing member
+behaviors and their social connections, we eliminate the need for feature
+engineering and the maintenance of complex directed acyclic graphs of model
+dependencies. We introduce our research pre-production model, 360Brew V1.0, a
+150B parameter, decoder-only model that has been trained and fine-tuned on
+LinkedIn's data and tasks. This model is capable of solving over 30 predictive
+tasks across various segments of the LinkedIn platform, achieving performance
+levels comparable to or exceeding those of current production systems based on
+offline metrics, without task-specific fine-tuning. Notably, each of these
+tasks is conventionally addressed by dedicated models that have been developed
+and maintained over multiple years by teams of a similar or larger size than
+our own.
 
-摘要：現今，憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料，在 2023 年，超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字；如果不認真看待，這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊，可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而，這些努力仍存在某些限制。特別是，先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外，資料集標籤中存在不準確的情況。在這項研究工作中，使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症（雙極型、重度、精神病型、非典型和產後）。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers（BERT）中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果，達到 0.96 的整體準確度。
+摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
+這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
+這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
+這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
+在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
+我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
+我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
+此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
+值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
 
-##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**
-2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman
+##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
+2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
 
-Deep learning is dramatically transforming the field of medical imaging and
-radiology, enabling the identification of pathologies in medical images,
-including computed tomography (CT) and X-ray scans. However, the performance of
-deep learning models, particularly in segmentation tasks, is often limited by
-the need for extensive annotated datasets. To address this challenge, the
-capabilities of weakly supervised semantic segmentation are explored through
-the lens of Explainable AI and the generation of counterfactual explanations.
-The scope of this research is development of a novel counterfactual inpainting
-approach (COIN) that flips the predicted classification label from abnormal to
-normal by using a generative model. For instance, if the classifier deems an
-input medical image X as abnormal, indicating the presence of a pathology, the
-generative model aims to inpaint the abnormal region, thus reversing the
-classifier's original prediction label. The approach enables us to produce
-precise segmentations for pathologies without depending on pre-existing
-segmentation masks. Crucially, image-level labels are utilized, which are
-substantially easier to acquire than creating detailed segmentation masks. The
-effectiveness of the method is demonstrated by segmenting synthetic targets and
-actual kidney tumors from CT images acquired from Tartu University Hospital in
-Estonia. The findings indicate that COIN greatly surpasses established
-attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an
-alternative counterfactual explanation method introduced by Singla et al. This
-evidence suggests that COIN is a promising approach for semantic segmentation
-of tumors in CT images, and presents a step forward in making deep learning
-applications more accessible and effective in healthcare, where annotated data
-is scarce.
+Fixing Python dependency issues is a tedious and error-prone task for
+developers, who must manually identify and resolve environment dependencies and
+version constraints of third-party modules and Python interpreters. Researchers
+have attempted to automate this process by relying on large knowledge graphs
+and database lookup tables. However, these traditional approaches face
+limitations due to the variety of dependency error types, large sets of
+possible module versions, and conflicts among transitive dependencies. This
+study explores the potential of using large language models (LLMs) to
+automatically fix dependency issues in Python programs. We introduce PLLM
+(pronounced "plum"), a novel technique that employs retrieval-augmented
+generation (RAG) to help an LLM infer Python versions and required modules for
+a given Python file. PLLM builds a testing environment that iteratively (1)
+prompts the LLM for module combinations, (2) tests the suggested changes, and
+(3) provides feedback (error messages) to the LLM to refine the fix. This
+feedback cycle leverages natural language processing (NLP) to intelligently
+parse and interpret build error messages. We benchmark PLLM on the Gistable
+HG2.9K dataset, a collection of challenging single-file Python gists. We
+compare PLLM against two state-of-the-art automatic dependency inference
+approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
+issues. Our results indicate that PLLM can fix more dependency issues than the
+two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
+over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
+for projects with many dependencies and for specific third-party numerical and
+machine-learning modules. Our findings demonstrate the potential of LLM-based
+approaches to iteratively resolve Python dependency issues.
 
-摘要：深度学习正大幅轉變醫學影像和放射線學領域，能辨識醫學影像中的病理，包括電腦斷層掃描 (CT) 和 X 光掃描。然而，深度學習模型的效能，特別是在分割任務中，常常受到廣泛註解資料集需求的限制。為了應對此挑戰，透過可解釋 AI 和反事實解釋的產生，探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN)，該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如，如果分類器將輸入的醫學影像 X 視為異常，表示存在病理，則生成模型旨在內插異常區域，從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割，而無需依賴於預先存在的分割遮罩。至關重要的是，利用影像層級標籤，這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明，COIN 遠遠超過已建立的歸因方法，例如 RISE、ScoreCAM 和 LayerCAM，以及 Singla 等人提出的另一種反事實解釋方法。此證據表明，COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法，並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步，其中註解資料很稀少。
+摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
 
-##### **Hybrid Intelligence for Digital Humanities**
-2406.15374v1 by Victor de Boer, Lise Stork
+##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
+2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
+
+Knowledge graphs are widely used in industrial applications, making error
+detection crucial for ensuring the reliability of downstream applications.
+Existing error detection methods often fail to effectively leverage
+fine-grained subgraph information and rely solely on fixed graph structures,
+while also lacking transparency in their decision-making processes, which
+results in suboptimal detection performance. In this paper, we propose a novel
+Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
+utilizes multiple large language models (LLMs) in a collaborative setting. By
+concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
+query embeddings during training, our framework integrates these
+representations to produce four specialized agents. These agents utilize
+subgraph information from different dimensions to engage in multi-round
+discussions, thereby improving error detection accuracy and ensuring a
+transparent decision-making process. Extensive experiments on FB15K and WN18RR
+demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
+accuracy and robustness of KG evaluation. For specific industrial scenarios,
+our framework can facilitate the training of specialized agents using
+domain-specific knowledge graphs for error detection, which highlights the
+potential industrial application value of our framework. Our code and datasets
+are available at https://github.com/kse-ElEvEn/MAKGED.
 
-In this paper, we explore the synergies between Digital Humanities (DH) as a
-discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research,
-the use of digital methods and specifically that of Artificial Intelligence is
-subject to a set of requirements and constraints. We argue that these are
-well-supported by the capabilities and goals of HI. Our contribution includes
-the identification of five such DH requirements: Successful AI systems need to
-be able to 1) collaborate with the (human) scholar; 2) support data criticism;
-3) support tool criticism; 4) be aware of and cater to various perspectives and
-5) support distant and close reading. We take the CARE principles of Hybrid
-Intelligence (collaborative, adaptive, responsible and explainable) as
-theoretical framework and map these to the DH requirements. In this mapping, we
-include example research projects. We finally address how insights from DH can
-be applied to HI and discuss open challenges for the combination of the two
-disciplines.
+摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
 
-摘要：在本文中，我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中，數位方法的使用，特別是人工智慧的使用，受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求：成功的 AI 系統需要能夠 1) 與（人類）學者合作；2) 支援資料批評；3) 支援工具批評；4) 察覺並迎合各種觀點；5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則（協作、適應、負責和可解釋）作為理論架構，並將這些原則對應到 DH 要求。在此對應中，我們納入範例研究專案。最後，我們探討如何將 DH 的見解應用於 HI，並討論結合這兩個學科的開放挑戰。
+##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
+2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
 
-##### **Ethical Framework for Responsible Foundational Models in Medical Imaging**
-2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci
+Short-reading comprehension questions help students understand text structure
+but lack effective feedback. Students struggle to identify and correct errors,
+while manual feedback creation is labor-intensive. This highlights the need for
+automated feedback linking responses to a scoring rubric for deeper
+comprehension.
+  Despite advances in Natural Language Processing (NLP), research has focused
+on automatic grading, with limited work on feedback generation. To address
+this, we propose a system that generates feedback for student responses.
+  Our contributions are twofold. First, we introduce the first system for
+feedback on short-answer reading comprehension. These answers are derived from
+the text, requiring structural understanding. We propose an "answer diagnosis
+graph," integrating the text's logical structure with feedback templates. Using
+this graph and NLP techniques, we estimate students' comprehension and generate
+targeted feedback.
+  Second, we evaluate our feedback through an experiment with Japanese high
+school students (n=39). They answered two 70-80 word questions and were divided
+into two groups with minimal academic differences. One received a model answer,
+the other system-generated feedback. Both re-answered the questions, and we
+compared score changes. A questionnaire assessed perceptions and motivation.
+  Results showed no significant score improvement between groups, but
+system-generated feedback helped students identify errors and key points in the
+text. It also significantly increased motivation. However, further refinement
+is needed to enhance text structure understanding.
 
-Foundational models (FMs) have tremendous potential to revolutionize medical
-imaging. However, their deployment in real-world clinical settings demands
-extensive ethical considerations. This paper aims to highlight the ethical
-concerns related to FMs and propose a framework to guide their responsible
-development and implementation within medicine. We meticulously examine ethical
-issues such as privacy of patient data, bias mitigation, algorithmic
-transparency, explainability and accountability. The proposed framework is
-designed to prioritize patient welfare, mitigate potential risks, and foster
-trust in AI-assisted healthcare.
+摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
 
-摘要：基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而，它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題，並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題，例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險，並培養對 AI 輔助醫療保健的信任。
+儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
 
-##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**
-2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak
+我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
 
-Thyroid cancer is an increasing global health concern that requires advanced
-diagnostic methods. The application of AI and radiomics to thyroid cancer
-diagnosis is examined in this review. A review of multiple databases was
-conducted in compliance with PRISMA guidelines until October 2023. A
-combination of keywords led to the discovery of an English academic publication
-on thyroid cancer and related subjects. 267 papers were returned from the
-original search after 109 duplicates were removed. Relevant studies were
-selected according to predetermined criteria after 124 articles were eliminated
-based on an examination of their abstract and title. After the comprehensive
-analysis, an additional six studies were excluded. Among the 28 included
-studies, radiomics analysis, which incorporates ultrasound (US) images,
-demonstrated its effectiveness in diagnosing thyroid cancer. Various results
-were noted, some of the studies presenting new strategies that outperformed the
-status quo. The literature has emphasized various challenges faced by AI
-models, including interpretability issues, dataset constraints, and operator
-dependence. The synthesized findings of the 28 included studies mentioned the
-need for standardization efforts and prospective multicenter studies to address
-these concerns. Furthermore, approaches to overcome these obstacles were
-identified, such as advances in explainable AI technology and personalized
-medicine techniques. The review focuses on how AI and radiomics could transform
-the diagnosis and treatment of thyroid cancer. Despite challenges, future
-research on multidisciplinary cooperation, clinical applicability validation,
-and algorithm improvement holds the potential to improve patient outcomes and
-diagnostic precision in the treatment of thyroid cancer.
+其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
 
-摘要：甲狀腺癌是一種日益嚴重的全球健康問題，需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下，對多個資料庫進行了回顧，直到 2023 年 10 月。通過結合關鍵字，發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後，原始搜尋共回傳 267 篇論文。在根據預先確定的標準，淘汰了 124 篇文章的摘要和標題後，選出了相關研究。在進行全面分析後，額外排除了六項研究。在納入的 28 項研究中，結合超音波 (US) 影像的放射特徵分析，證明了其在診斷甲狀腺癌方面的有效性。研究結果不一，有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰，包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到，需要標準化工作和前瞻性多中心研究來解決這些問題。此外，還確定了克服這些障礙的方法，例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰，但未來對多學科合作、臨床適用性驗證和演算法改進的研究，仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。
+結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
 
-##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**
-2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia
+##### **Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**
+2501.15688v1 by Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
 
-Breast cancer has rapidly increased in prevalence in recent years, making it
-one of the leading causes of mortality worldwide. Among all cancers, it is by
-far the most common. Diagnosing this illness manually requires significant time
-and expertise. Since detecting breast cancer is a time-consuming process,
-preventing its further spread can be aided by creating machine-based forecasts.
-Machine learning and Explainable AI are crucial in classification as they not
-only provide accurate predictions but also offer insights into how the model
-arrives at its decisions, aiding in the understanding and trustworthiness of
-the classification results. In this study, we evaluate and compare the
-classification accuracy, precision, recall, and F-1 scores of five different
-machine learning methods using a primary dataset (500 patients from Dhaka
-Medical College Hospital). Five different supervised machine learning
-techniques, including decision tree, random forest, logistic regression, naive
-bayes, and XGBoost, have been used to achieve optimal results on our dataset.
-Additionally, this study applied SHAP analysis to the XGBoost model to
-interpret the model's predictions and understand the impact of each feature on
-the model's output. We compared the accuracy with which several algorithms
-classified the data, as well as contrasted with other literature in this field.
-After final evaluation, this study found that XGBoost achieved the best model
-accuracy, which is 97%.
+Multimodal knowledge graph completion (MMKGC) aims to predict missing links
+in multimodal knowledge graphs (MMKGs) by leveraging information from various
+modalities alongside structural data. Existing MMKGC approaches primarily
+extend traditional knowledge graph embedding (KGE) models, which often require
+creating an embedding for every entity. This results in large model sizes and
+inefficiencies in integrating multimodal information, particularly for
+real-world graphs. Meanwhile, Transformer-based models have demonstrated
+competitive performance in knowledge graph completion (KGC). However, their
+focus on single-modal knowledge limits their capacity to utilize cross-modal
+information. Recently, Large vision-language models (VLMs) have shown potential
+in cross-modal tasks but are constrained by the high cost of training. In this
+work, we propose a novel approach that integrates Transformer-based KGE models
+with cross-modal context generated by pre-trained VLMs, thereby extending their
+applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
+relevant visual information from entities and their neighbors into textual
+sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
+model with the generated cross-modal context. This simple yet effective method
+significantly reduces model size compared to traditional KGE approaches while
+achieving competitive performance across multiple large-scale datasets with
+minimal hyperparameter tuning.
 
-摘要：<paragraph>近年來，乳癌的盛行率迅速增加，使其成為全球主要的死亡原因之一。在所有癌症中，乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時，因此透過建立機器學習模型來預測，有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要，因為它們不僅可以提供準確的預測，還可以深入了解模型如何做出決策，有助於理解和信賴分類結果。在此研究中，我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數，使用了一個主要的資料集（達卡醫學院醫院的 500 名患者）。五種不同的監督式機器學習技術，包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost，已用於在我們的資料集上取得最佳結果。此外，本研究將 SHAP 分析應用於 XGBoost 模型，以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度，並與該領域的其他文獻進行對比。在最後評估後，本研究發現 XGBoost 達到了最佳的模型準確度，為 97%。</paragraph>
+摘要：多模態知識圖譜補全 (MMKGC) 旨在透過利用來自各種模態與結構化資料的資訊，來預測多模態知識圖譜 (MMKG) 中的缺失連結。現有的 MMKGC 方法主要擴充傳統的知識圖譜嵌入 (KGE) 模型，這些模型通常需要為每個實體建立一個嵌入。這會導致模型尺寸過大，且在整合多模態資訊時效率低下，特別是對於真實世界的圖譜。與此同時，基於 Transformer 的模型已在知識圖譜補全 (KGC) 中展現出競爭力。然而，它們著重於單模態知識，限制了它們利用跨模態資訊的能力。最近，大型視覺語言模型 (VLM) 已在跨模態任務中展現潛力，但受限於訓練成本過高。在這項工作中，我們提出了一種創新的方法，它將基於 Transformer 的 KGE 模型與預先訓練的 VLM 所產生的跨模態內容整合在一起，從而擴展它們在 MMKGC 中的適用性。具體來說，我們採用預先訓練的 VLM，將實體及其鄰居相關的視覺資訊轉換成文字序列。然後，我們將 KGC 架構成一個序列到序列的任務，並使用產生的跨模態內容微調模型。這種簡單但有效的方法，與傳統的 KGE 方法相比，大幅減少了模型尺寸，同時在多個大型資料集上達到了競爭力的效能，且只需最少的超參數調整。
 
-##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**
-2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir
+##### **How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**
+2501.15378v1 by Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
 
-The Deep learning (DL) models for diagnosing breast cancer from mammographic
-images often operate as "black boxes", making it difficult for healthcare
-professionals to trust and understand their decision-making processes. The
-study presents an integrated framework combining Convolutional Neural Networks
-(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis
-of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an
-elaborate data preprocessing pipeline and advanced data augmentation techniques
-to counteract dataset limitations and transfer learning using pre-trained
-networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of
-our study is the evaluation of XAI's effectiveness in interpreting model
-predictions, highlighted by utilizing the Hausdorff measure to assess the
-alignment between AI-generated explanations and expert annotations
-quantitatively. This approach is critical for XAI in promoting trustworthiness
-and ethical fairness in AI-assisted diagnostics. The findings from our research
-illustrate the effective collaboration between CNNs and XAI in advancing
-diagnostic methods for breast cancer, thereby facilitating a more seamless
-integration of advanced AI technologies within clinical settings. By enhancing
-the interpretability of AI driven decisions, this work lays the groundwork for
-improved collaboration between AI systems and medical practitioners, ultimately
-enriching patient care. Furthermore, the implications of our research extended
-well beyond the current methodologies. It encourages further research into how
-to combine multimodal data and improve AI explanations to meet the needs of
-clinical practice.
+Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently
+propelled significant advances in complex reasoning tasks, thanks to their
+broad domain knowledge and contextual awareness. Unfortunately, current methods
+often assume KGs to be complete, which is impractical given the inherent
+limitations of KG construction and the potential loss of contextual cues when
+converting unstructured text into entity-relation triples. In response, this
+paper proposes the Triple Context Restoration and Query-driven Feedback
+(TCR-QF) framework, which reconstructs the textual context underlying each
+triple to mitigate information loss, while dynamically refining the KG
+structure by iteratively incorporating query-relevant missing knowledge.
+Experiments on five benchmark question-answering datasets substantiate the
+effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1%
+improvement in Exact Match and a 15.5% improvement in F1 over its
+state-of-the-art GraphRAG competitors.
 
-摘要：深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作，這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構，結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI)，以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術，以對抗資料集限制，並採用預先訓練的網路（例如 VGG-16、Inception-V3 和 ResNet）進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性，重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作，從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性，這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎，最終豐富了患者照護。此外，我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋，以滿足臨床實務的需求。
+摘要：知識圖譜 (KG) 增強大型語言模型 (LLM) 最近推動複雜推理任務的重大進展，這要歸功於它們廣泛的領域知識和語境感知。不幸的是，目前的模型通常假設 KG 是完整的，這在考慮到 KG 建構的固有限制和在將非結構化文字轉換為實體關係三元組時潛在的語境線索損失時是不切實際的。為了解決這個問題，本文提出了三元組語境還原和查詢驅動回饋 (TCR-QF) 架構，它重建每個三元組底層的文字語境以減輕資訊損失，同時透過反覆納入與查詢相關的遺失知識來動態優化 KG 結構。在五個基準問題回答資料集上的實驗證實了 TCR-QF 在 KG 和 LLM 整合方面的有效性，它在 Exact Match 中獲得 29.1% 的改進，在 F1 中獲得 15.5% 的改進，優於最先進的 GraphRAG 競爭對手。
 
-##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**
-2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu
+##### **Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**
+2501.14932v1 by Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe
 
-This research presents a novel multimodal data fusion methodology for pain
-behavior recognition, integrating statistical correlation analysis with
-human-centered insights. Our approach introduces two key innovations: 1)
-integrating data-driven statistical relevance weights into the fusion strategy
-to effectively utilize complementary information from heterogeneous modalities,
-and 2) incorporating human-centric movement characteristics into multimodal
-representation learning for detailed modeling of pain behaviors. Validated
-across various deep learning architectures, our method demonstrates superior
-performance and broad applicability. We propose a customizable framework that
-aligns each modality with a suitable classifier based on statistical
-significance, advancing personalized and effective multimodal fusion.
-Furthermore, our methodology provides explainable analysis of multimodal data,
-contributing to interpretable and explainable AI in healthcare. By highlighting
-the importance of data diversity and modality-specific representations, we
-enhance traditional fusion techniques and set new standards for recognizing
-complex pain behaviors. Our findings have significant implications for
-promoting patient-centered healthcare interventions and supporting explainable
-clinical decision-making.
+Modern datasets often consist of numerous samples with abundant features and
+associated timestamps. Analyzing such datasets to uncover underlying events
+typically requires complex statistical methods and substantial domain
+expertise. A notable example, and the primary data focus of this paper, is the
+global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC)
+-- a global hub of human trafficking data containing over 200,000 anonymized
+records spanning from 2002 to 2022, with numerous categorical features for each
+record. In this paper, we propose a fast and scalable method for analyzing and
+extracting significant categorical feature interactions, and querying large
+language models (LLMs) to generate data-driven insights that explain these
+interactions. Our approach begins with a binarization step for categorical
+features using one-hot encoding, followed by the computation of graph
+covariance at each time. This graph covariance quantifies temporal changes in
+dependence structures within categorical data and is established as a
+consistent dependence measure under the Bernoulli distribution. We use this
+measure to identify significant feature pairs, such as those with the most
+frequent trends over time or those exhibiting sudden spikes in dependence at
+specific moments. These extracted feature pairs, along with their timestamps,
+are subsequently passed to an LLM tasked with generating potential explanations
+of the underlying events driving these dependence changes. The effectiveness of
+our method is demonstrated through extensive simulations, and its application
+to the CTDC dataset reveals meaningful feature pairs and potential data stories
+underlying the observed feature interactions.
 
-摘要：本研究提出了一種創新的多模態數據融合方法，用於疼痛行為識別，將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新：1) 將數據驅動的統計相關權重整合到融合策略中，以有效利用來自異質模態的補充信息，以及 2) 將以人為中心的運動特徵納入多模態表示學習中，以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證，展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架，根據統計顯著性將每個模態與合適的分類器對齊，推進個性化和有效的多模態融合。此外，我們的模型提供對多模態數據的可解釋分析，有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性，我們增強了傳統的融合技術，並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。
+摘要：現代資料集通常包含許多具有豐富特徵和關聯時間戳的樣本。分析此類資料集以揭示底層事件通常需要複雜的統計方法和大量的領域專業知識。一個值得注意的範例，也是本文的主要資料重點，是來自反人口販運資料合作組織 (CTDC) 的全球合成資料集，這是全球人口販運資料的樞紐，包含超過 200,000 筆從 2002 年到 2022 年的匿名記錄，每個記錄都有許多分類特徵。在本文中，我們提出了一種快速且可擴充的方法，用於分析和提取重要的分類特徵交互作用，並查詢大型語言模型 (LLM)，以產生資料驅動的見解來解釋這些交互作用。我們的做法從使用獨熱編碼對分類特徵進行二元化步驟開始，然後在每個時間點計算圖形共變異數。此圖形共變異數量化了分類資料中依賴結構的時間變化，並在伯努利分佈下建立為一致的依賴度量。我們使用此度量來識別重要的特徵對，例如隨時間推移趨勢最頻繁的特徵對，或在特定時刻表現出依賴性突然激增的特徵對。這些提取的特徵對及其時間戳隨後傳遞給 LLM，後者負責產生對驅動這些依賴性變化的底層事件的潛在解釋。我們的方法的有效性已通過廣泛的模擬得到證明，其在 CTDC 資料集中的應用揭示了有意義的特徵對和潛在的資料故事，這些故事是觀察到的特徵交互作用的基礎。
 
-##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**
-2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini
+##### **Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**
+2501.14892v1 by Hang Luo, Jian Zhang, Chujun Li
 
-Human-centered explainable AI (HCXAI) advocates for the integration of social
-aspects into AI explanations. Central to the HCXAI discourse is the Social
-Transparency (ST) framework, which aims to make the socio-organizational
-context of AI systems accessible to their users. In this work, we suggest
-extending the ST framework to address the risks of social misattributions in
-Large Language Models (LLMs), particularly in sensitive areas like mental
-health. In fact LLMs, which are remarkably capable of simulating roles and
-personas, may lead to mismatches between designers' intentions and users'
-perceptions of social attributes, risking to promote emotional manipulation and
-dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To
-address these issues, we propose enhancing the ST framework with a fifth
-'W-question' to clarify the specific social attributions assigned to LLMs by
-its designers and users. This addition aims to bridge the gap between LLM
-capabilities and user perceptions, promoting the ethically responsible
-development and use of LLM-based technology.
+In knowledge-intensive tasks, especially in high-stakes domains like medicine
+and law, it is critical not only to retrieve relevant information but also to
+provide causal reasoning and explainability. Large language models (LLMs) have
+achieved remarkable performance in natural language understanding and
+generation tasks. However, they often suffer from limitations such as
+difficulty in incorporating new knowledge, generating hallucinations, and
+explaining their reasoning process. To address these challenges, integrating
+knowledge graphs with Graph Retrieval-Augmented Generation (Graph RAG) has
+emerged as an effective solution. Traditional Graph RAG methods often rely on
+simple graph traversal or semantic similarity, which do not capture causal
+relationships or align well with the model's internal reasoning steps. This
+paper proposes a novel pipeline that filters large knowledge graphs to
+emphasize cause-effect edges, aligns the retrieval process with the model's
+chain-of-thought (CoT), and enhances reasoning through multi-stage path
+improvements. Experiments on medical question-answering tasks show consistent
+gains, with up to a 10\% absolute improvement across multiple large language
+models (LLMs). This approach demonstrates the value of combining causal
+reasoning with stepwise retrieval, leading to more interpretable and logically
+grounded solutions for complex queries.
 
-摘要：以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架，其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中，我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险，尤其是在心理健康等敏感领域。事实上，LLM 能够出色地模拟角色和人格，这可能导致设计者的意图和用户对社会属性的认知之间出现错配，从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题，我们建议用第五个“W 问题”来增强 ST 框架，以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距，促进基于 LLM 的技术在道德上负责任地开发和使用。
+摘要：在知識密集型任務中，特別是在醫學和法律等高風險領域，不僅檢索相關資訊至關重要，還必須提供因果推理和可解釋性。大型語言模型 (LLM) 在自然語言理解和生成任務中取得了顯著的表現。然而，它們通常會遇到一些限制，例如難以納入新知識、產生幻覺，以及解釋其推理過程。為了應對這些挑戰，將知識圖與圖形檢索增強生成 (Graph RAG) 整合在一起已成為一種有效的解決方案。傳統的 Graph RAG 方法通常依賴於簡單的圖形遍歷或語義相似性，這無法捕捉因果關係或與模型的內部推理步驟很好地對齊。本文提出了一個新穎的管道，該管道過濾大型知識圖以強調因果邊緣，將檢索過程與模型的思想鏈 (CoT) 對齊，並通過多階段路徑改進來增強推理。在醫療問題解答任務上的實驗顯示出一致的收益，在多個大型語言模型 (LLM) 中絕對改進幅度高達 10%。這種方法展示了將因果推理與逐步檢索相結合的價值，從而為複雜查詢提供更具可解釋性和邏輯依據的解決方案。
 
-##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**
-2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu
+##### **GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**
+2501.16382v1 by Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
 
-Background: Pneumothorax is an acute thoracic disease caused by abnormal air
-collection between the lungs and chest wall. To address the opaqueness often
-associated with deep learning (DL) models, explainable artificial intelligence
-(XAI) methods have been introduced to outline regions related to pneumothorax
-diagnoses made by DL models. However, these explanations sometimes diverge from
-actual lesion areas, highlighting the need for further improvement. Method: We
-propose a template-guided approach to incorporate the clinical knowledge of
-pneumothorax into model explanations generated by XAI methods, thereby
-enhancing the quality of these explanations. Utilizing one lesion delineation
-created by radiologists, our approach first generates a template that
-represents potential areas of pneumothorax occurrence. This template is then
-superimposed on model explanations to filter out extraneous explanations that
-fall outside the template's boundaries. To validate its efficacy, we carried
-out a comparative analysis of three XAI methods with and without our template
-guidance when explaining two DL models in two real-world datasets. Results: The
-proposed approach consistently improved baseline XAI methods across twelve
-benchmark scenarios built on three XAI methods, two DL models, and two
-datasets. The average incremental percentages, calculated by the performance
-improvements over the baseline performance, were 97.8% in Intersection over
-Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model
-explanations and ground-truth lesion areas. Conclusions: In the context of
-pneumothorax diagnoses, we proposed a template-guided approach for improving AI
-explanations. We anticipate that our template guidance will forge a fresh
-approach to elucidating AI models by integrating clinical domain expertise.
+Drug discovery (DD) has tremendously contributed to maintaining and improving
+public health. Hypothesizing that inhibiting protein misfolding can slow
+disease progression, researchers focus on target identification (Target ID) to
+find protein structures for drug binding. While Large Language Models (LLMs)
+and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug
+discovery, integrating models into cohesive workflows remains challenging. We
+conducted a user study with drug discovery researchers to identify the
+applicability of LLMs and RAGs in Target ID. We identified two main findings:
+1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on
+an initial protein and protein candidates that have a therapeutic impact; 2)
+the model must provide the PPI and relevant explanations for better
+understanding. Based on these observations, we identified three limitations in
+previous approaches for Target ID: 1) semantic ambiguity, 2) lack of
+explainability, and 3) short retrieval units. To address these issues, we
+propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve
+agent pipeline RAG framework to support large-scale PPI signaling pathway
+exploration in understanding therapeutic impacts by decomposing the analysis of
+entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
 
-摘要：<paragraph>背景：氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習（DL）模型經常伴隨的不透明性，可解釋人工智慧（XAI）方法已被引入，用於概述與 DL 模型做出的氣胸診斷相關的區域。然而，這些解釋有時會與實際病灶區域有所出入，突顯出進一步改進的必要性。方法：我們提出了一種模板引導式方法，將氣胸的臨床知識納入 XAI 方法產生的模型解釋中，從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪，我們的做法首先產生一個模板，用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上，以篩選出超出模板邊界的無關解釋。為了驗證其效力，我們對三種 XAI 方法進行了比較分析，在兩個真實世界資料集中解釋兩個 DL 模型時，分別採用和不採用我們的模板引導。結果：所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中，始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時，透過基準效能的效能改進計算出的平均增量百分比為交集比（IoU）的 97.8% 和骰子相似性係數（DSC）的 94.1%。結論：在氣胸診斷的背景下，我們提出了一種模板引導式方法，用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識，為闡明 AI 模型建立一種新方法。</paragraph>
+摘要：药物发现 (DD) 极大地促进了公共卫生的维护和改善。研究人员假设抑制蛋白质错误折叠可以减缓疾病进展，因此专注于靶点识别 (Target ID) 以找到用于药物结合的蛋白质结构。虽然大型语言模型 (LLM) 和检索增强生成 (RAG) 框架加速了药物发现，但将模型整合到内聚工作流中仍然具有挑战性。我们与药物发现研究人员进行了一项用户研究，以确定 LLM 和 RAG 在 Target ID 中的适用性。我们确定了两个主要发现：1) LLM 应该基于初始蛋白质和具有治疗作用的蛋白质候选物提供多个蛋白质-蛋白质相互作用 (PPI)；2) 该模型必须提供 PPI 和相关解释以更好地理解。基于这些观察，我们发现了先前 Target ID 方法中的三个局限性：1) 语义歧义，2) 缺乏可解释性，3) 检索单元短。为了解决这些问题，我们提出了 GraPPI，这是一种基于大规模知识图 (KG) 的检索-分解-求解代理管道 RAG 框架，以支持大规模 PPI 信号通路探索，通过将整个 PPI 通路的分析分解为专注于 PPI 边缘分析的子任务来理解治疗影响。
 
-##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**
-2403.01580v1 by Séamus Lankford
+##### **Evaluating and Improving Graph to Text Generation with Large Language Models**
+2501.14497v1 by Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez Basulto, Jeff Z. Pan
 
-In the current machine translation (MT) landscape, the Transformer
-architecture stands out as the gold standard, especially for high-resource
-language pairs. This research delves into its efficacy for low-resource
-language pairs including both the English$\leftrightarrow$Irish and
-English$\leftrightarrow$Marathi language pairs. Notably, the study identifies
-the optimal hyperparameters and subword model type to significantly improve the
-translation quality of Transformer models for low-resource language pairs.
-  The scarcity of parallel datasets for low-resource languages can hinder MT
-development. To address this, gaHealth was developed, the first bilingual
-corpus of health data for the Irish language. Focusing on the health domain,
-models developed using this in-domain dataset exhibited very significant
-improvements in BLEU score when compared with models from the LoResMT2021
-Shared Task. A subsequent human evaluation using the multidimensional quality
-metrics error taxonomy showcased the superior performance of the Transformer
-system in reducing both accuracy and fluency errors compared to an RNN-based
-counterpart.
-  Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source
-applications streamlined for the development, fine-tuning, and deployment of
-neural machine translation models. These tools considerably simplify the setup
-and evaluation process, making MT more accessible to both developers and
-translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes
-eco-friendly natural language processing research by highlighting the
-environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM
-demonstrated advancements in translation performance for two low-resource
-language pairs: English$\leftrightarrow$Irish and
-English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021
-Shared Task.
+Large language models (LLMs) have demonstrated immense potential across
+various tasks. However, research for exploring and improving the capabilities
+of LLMs in interpreting graph structures remains limited. To address this gap,
+we conduct a comprehensive evaluation of prompting current open-source LLMs on
+graph-to-text generation tasks. Although we explored the optimal prompting
+strategies and proposed a novel and effective diversity-difficulty-based
+few-shot sample selection method, we found that the improvements from
+tuning-free approaches were incremental, as LLMs struggle with planning on
+complex graphs, particularly those with a larger number of triplets. To further
+improve LLMs in planning with graph sequences and grounding in truth, we
+introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks:
+reordering and attribution. Through extensive automatic and human evaluations,
+we demonstrate significant improvements in the quality of generated text from
+both few-shot learning and fine-tuning perspectives using the PlanGTG dataset.
+Our study paves the way for new research directions in graph-to-text
+generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
 
-摘要：<paragraph>在當前機器翻譯 (MT) 領域中，Transformer 架構脫穎而出，成為黃金標準，特別是對於高資源語言對。本研究探討其對低資源語言對的效能，包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是，本研究識別出最佳超參數和子詞模型類型，以顯著提高 Transformer 模型對低資源語言對的翻譯品質。
-低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題，開發了 gaHealth，這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域，使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步，與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示，與基於 RNN 的對應模型相比，Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。
-此外，本論文介紹了 adaptNMT 和 adaptMLLM，這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程，讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是，adaptNMT 以 OpenNMT 生態系統為基礎，通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比，adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。</paragraph>
+摘要：大型語言模型（LLM）已在各種任務中展現出巨大的潛力。然而，探索和提升 LLM 在詮釋圖形結構方面的能力的研究仍然有限。為了解決這個差距，我們對提示目前開源的 LLM 執行圖形轉文字生成任務進行全面評估。儘管我們探索了最佳提示策略並提出了一種新穎且有效的基於多樣性難度的少樣本選擇方法，但我們發現無調校方法的改進是漸進的，因為 LLM 難以規劃複雜的圖形，特別是那些具有較多三元組的圖形。為了進一步提升 LLM 在圖形序列規劃和真實依據方面的能力，我們引入了一個新的圖形轉文字資料集 PlanGTG，並註解了兩個子任務：重新排序和歸因。透過廣泛的自動化和人工評估，我們證明了使用 PlanGTG 資料集從少樣本學習和微調角度產生文字的品質有顯著提升。我們的研究為圖形轉文字生成中的新研究方向鋪路。PlanGTG 資料集可以在 https://github.com/probe2/kg_text 中找到。
 
-##### **Cause and Effect: Can Large Language Models Truly Understand Causality?**
-2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha
+##### **Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**
+2501.14300v1 by Xujian Liang, Zhaoquan Gu
 
-With the rise of Large Language Models(LLMs), it has become crucial to
-understand their capabilities and limitations in deciphering and explaining the
-complex web of causal relationships that language entails. Current methods use
-either explicit or implicit causal reasoning, yet there is a strong need for a
-unified approach combining both to tackle a wide array of causal relationships
-more effectively. This research proposes a novel architecture called Context
-Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to
-enhance causal reasoning and explainability. The proposed framework
-incorporates an explicit causal detection module with ConceptNet and
-counterfactual statements, as well as implicit causal detection through LLMs.
-Our framework goes one step further with a layer of counterfactual explanations
-to accentuate LLMs understanding of causality. The knowledge from ConceptNet
-enhances the performance of multiple causal reasoning tasks such as causal
-discovery, causal identification and counterfactual reasoning. The
-counterfactual sentences add explicit knowledge of the not caused by scenarios.
-By combining these powerful modules, our model aims to provide a deeper
-understanding of causal relationships, enabling enhanced interpretability.
-Evaluation of benchmark datasets shows improved performance across all metrics,
-such as accuracy, precision, recall, and F1 scores. We also introduce
-CausalNet, a new dataset accompanied by our code, to facilitate further
-research in this domain.
+Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes
+the naive RAG system a step further by integrating graph information, such as
+knowledge graph (KGs), into large-scale language models (LLMs) to mitigate
+hallucination. However, existing GRAG still encounter limitations: 1) simple
+paradigms usually fail with the complex problems due to the narrow and shallow
+correlations capture from KGs 2) methods of strong coupling with KGs tend to be
+high computation cost and time consuming if the graph is dense. In this paper,
+we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for
+enabling LLMs to think ``community by community" within KGs. To do this,
+FastToG employs community detection for deeper correlation capture and two
+stages community pruning - coarse and fine pruning for faster retrieval.
+Furthermore, we also develop two Community-to-Text methods to convert the graph
+structure of communities into textual form for better understanding by LLMs.
+Experimental results demonstrate the effectiveness of FastToG, showcasing
+higher accuracy, faster reasoning, and better explainability compared to the
+previous works.
 
-摘要：隨著大型語言模型 (LLM) 的興起，了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理，但強烈需要一種統一的方法，結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構，以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組，以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步，加入一層反事實解釋，以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行，例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組，我們的模型旨在提供對因果關係更深入的理解，實現增強的可解釋性。基準資料集的評估顯示在所有指標（例如準確度、精確度、召回率和 F1 分數）上都有所提升。我們還引入了 CausalNet，一個新的資料集，並附上了我們的程式碼，以促進在這個領域的進一步研究。
+摘要：圖表檢索增強生成 (GRAG) 是一種新穎的範例，它透過將圖表資訊（例如知識圖表 (KG)) 整合到大型語言模型 (LLM) 中，進一步提升了樸素的 RAG 系統以減輕幻覺。然而，現有的 GRAG 仍會遇到限制：1) 簡單的範例通常會因從 KG 中擷取的關聯性狹隘且淺薄而無法解決複雜的問題 2) 如果圖表很密集，與 KG 強耦合的方法往往會導致高運算成本和耗時。在本文中，我們提出了 Fast Think-on-Graph (FastToG)，這是一種創新的範例，可讓 LLM 在 KG 中「逐個社群」進行思考。為此，FastToG 使用社群偵測來擷取更深入的關聯性，並使用兩個階段的社群修剪（粗略修剪和精細修剪）來加快檢索速度。此外，我們還開發了兩種社群到文字的方法，將社群的圖表結構轉換為文字形式，以便 LLM 更容易理解。實驗結果證明了 FastToG 的有效性，與先前的研究相比，展示出更高的準確性、更快的推理速度和更好的可解釋性。
 
-##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**
-2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya
+##### **Top Ten Challenges Towards Agentic Neural Graph Databases**
+2501.14224v1 by Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Lixin Fan, Binhang Yuan, Wei Wang, Lei Chen, Xiaofang Zhou, Yangqiu Song
 
-Diabetes mellitus (DM) predisposes patients to vascular complications.
-Retinal images and vasculature reflect the body's micro- and macrovascular
-health. They can be used to diagnose DM complications, including diabetic
-retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular
-disease, as well as forecast the risk of cardiovascular events. Artificial
-intelligence (AI)-enabled systems developed for high-throughput detection of DR
-using digitized retinal images have become clinically adopted. Beyond DR
-screening, AI integration also holds immense potential to address challenges
-associated with the holistic care of the patient with DM. In this work, we aim
-to comprehensively review the literature for studies on AI applications based
-on retinal images related to DM diagnosis, prognostication, and management. We
-will describe the findings of holistic AI-assisted diabetes care, including but
-not limited to DR screening, and discuss barriers to implementing such systems,
-including issues concerning ethics, data privacy, equitable access, and
-explainability. With the ability to evaluate the patient's health status vis a
-vis DM complication as well as risk prognostication of future cardiovascular
-complications, AI-assisted retinal image analysis has the potential to become a
-central tool for modern personalized medicine in patients with DM.
+Graph databases (GDBs) like Neo4j and TigerGraph excel at handling
+interconnected data but lack advanced inference capabilities. Neural Graph
+Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for
+predictive analysis and reasoning over incomplete or noisy data. However, NGDBs
+rely on predefined queries and lack autonomy and adaptability. This paper
+introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs
+with three core functionalities: autonomous query construction, neural query
+execution, and continuous learning. We identify ten key challenges in realizing
+Agentic NGDBs: semantic unit representation, abductive reasoning, scalable
+query execution, and integration with foundation models like large language
+models (LLMs). By addressing these challenges, Agentic NGDBs can enable
+intelligent, self-improving systems for modern data-driven applications, paving
+the way for adaptable and autonomous data management solutions.
 
-摘要：糖尿病（DM）使患者容易出現血管併發症。
-視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症，包括糖尿病視網膜病變（DR）、神經病變、腎病和動脈粥樣硬化性心血管疾病，以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧（AI）啟用系統已在臨床採用。除了 DR 篩檢外，AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中，我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻，這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現，包括但不限於 DR 篩檢，並討論實施此類系統的障礙，包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況，同時考量糖尿病併發症以及未來心血管併發症的風險預後，AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。
+摘要：圖形資料庫（GDB），例如 Neo4j 和 TigerGraph，擅長處理相互連接的資料，但缺乏進階的推論能力。神經圖形資料庫（NGDB）透過整合圖形神經網路（GNN）來解決這個問題，以進行預測分析和對不完整或有雜訊的資料進行推理。然而，NGDB 依賴於預先定義的查詢，並且缺乏自主性和適應性。本文介紹了代理神經圖形資料庫（Agentic NGDB），它以三項核心功能擴充了 NGDB：自動查詢建構、神經查詢執行和持續學習。我們找出實現 Agentic NGDB 的十大關鍵挑戰：語義單元表示、演繹推理、可擴充查詢執行，以及與基礎模型（例如大型語言模型 (LLM)）整合。透過解決這些挑戰，Agentic NGDB 可以為現代資料驅動應用打造智慧且自我改善的系統，為適應性和自主資料管理解決方案鋪路。
 
-##### **Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**
-2402.15027v2 by A. J. Karran, P. Charland, J-T. Martineau, A. Ortiz de Guinea Lopez de Arana, AM. Lesage, S. Senecal, P-M. Leger
+##### **GraphRAG under Fire**
+2501.14050v1 by Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang
 
-This study investigates the acceptability of different artificial
-intelligence (AI) applications in education from a multi-stakeholder
-perspective, including students, teachers, and parents. Acknowledging the
-transformative potential of AI in education, it addresses concerns related to
-data privacy, AI agency, transparency, explainability and the ethical
-deployment of AI. Through a vignette methodology, participants were presented
-with four scenarios where AI's agency, transparency, explainability, and
-privacy were manipulated. After each scenario, participants completed a survey
-that captured their perceptions of AI's global utility, individual usefulness,
-justice, confidence, risk, and intention to use each scenario's AI if
-available. The data collection comprising a final sample of 1198
-multi-stakeholder participants was distributed through a partner institution
-and social media campaigns and focused on individual responses to four AI use
-cases. A mediation analysis of the data indicated that acceptance and trust in
-AI varies significantly across stakeholder groups. We found that the key
-mediators between high and low levels of AI's agency, transparency, and
-explainability, as well as the intention to use the different educational AI,
-included perceived global utility, justice, and confidence. The study
-highlights that the acceptance of AI in education is a nuanced and multifaceted
-issue that requires careful consideration of specific AI applications and their
-characteristics, in addition to the diverse stakeholders' perceptions.
+GraphRAG advances retrieval-augmented generation (RAG) by structuring
+external knowledge as multi-scale knowledge graphs, enabling language models to
+integrate both broad context and granular details in their reasoning. While
+GraphRAG has demonstrated success across domains, its security implications
+remain largely unexplored. To bridge this gap, this work examines GraphRAG's
+vulnerability to poisoning attacks, uncovering an intriguing security paradox:
+compared to conventional RAG, GraphRAG's graph-based indexing and retrieval
+enhance resilience against simple poisoning attacks; meanwhile, the same
+features also create new attack surfaces. We present GRAGPoison, a novel attack
+that exploits shared relations in the knowledge graph to craft poisoning text
+capable of compromising multiple queries simultaneously. GRAGPoison employs
+three key strategies: i) relation injection to introduce false knowledge, ii)
+relation enhancement to amplify poisoning influence, and iii) narrative
+generation to embed malicious content within coherent text. Empirical
+evaluation across diverse datasets and models shows that GRAGPoison
+substantially outperforms existing attacks in terms of effectiveness (up to 98%
+success rate) and scalability (using less than 68% poisoning text). We also
+explore potential defensive measures and their limitations, identifying
+promising directions for future research.
 
-摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
+摘要：GraphRAG 透過將外部知識結構化為多尺度知識圖譜，推動了檢索增強生成 (RAG)，使語言模型能夠在其推理中整合廣泛的背景和細微的細節。儘管 GraphRAG 在各個領域都已展現出成功，但其安全性影響在很大程度上仍未被探索。為了彌補這一差距，本研究探討了 GraphRAG 對投毒攻擊的脆弱性，揭示了一個有趣的安全悖論：與傳統的 RAG 相比，GraphRAG 基於圖表的索引和檢索增強了對簡單投毒攻擊的韌性；同時，相同的特徵也創造了新的攻擊面。我們提出了 GRAGPoison，這是一種新穎的攻擊，它利用知識圖譜中的共享關係來製作中毒文本，能夠同時危害多個查詢。GRAGPoison 採用了三項關鍵策略：i) 關係注入以引入錯誤的知識，ii) 關係增強以擴大投毒影響，以及 iii) 敘事生成以將惡意內容嵌入連貫的文本中。在各種數據集和模型上的經驗評估表明，GRAGPoison 在有效性（成功率高達 98%）和可擴展性（使用不到 68% 的投毒文本）方面都明顯優於現有的攻擊。我們還探討了潛在的防禦措施及其局限性，確定了未來研究的有希望的方向。
 
-##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
-2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
+##### **EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**
+2501.13746v1 by Yuhui Yun, Huilong Ye, Xinru Li, Ruojia Li, Jingfeng Deng, Li Li, Haoyi Xiong
+
+The paper introduces EICopilot, an novel agent-based solution enhancing
+search and exploration of enterprise registration data within extensive online
+knowledge graphs like those detailing legal entities, registered capital, and
+major shareholders. Traditional methods necessitate text-based queries and
+manual subgraph explorations, often resulting in time-consuming processes.
+EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this
+landscape by utilizing Large Language Models (LLMs) to interpret natural
+language queries. This solution automatically generates and executes Gremlin
+scripts, providing efficient summaries of complex enterprise relationships.
+Distinct feature a data pre-processing pipeline that compiles and annotates
+representative queries into a vector database of examples for In-context
+learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought
+with ICL to enhance Gremlin script generation for knowledge graph search and
+exploration, and a novel query masking strategy that improves intent
+recognition for heightened script accuracy. Empirical evaluations demonstrate
+the superior performance of EICopilot, including speed and accuracy, over
+baseline methods, with the \emph{Full Mask} variant achieving a syntax error
+rate reduction to as low as 10.00% and an execution correctness of up to
+82.14%. These components collectively contribute to superior querying
+capabilities and summarization of intricate datasets, positioning EICopilot as
+a groundbreaking tool in the exploration and exploitation of large-scale
+knowledge graphs for enterprise information search.
 
-Remote patient monitoring based on wearable single-lead electrocardiogram
-(ECG) devices has significant potential for enabling the early detection of
-heart disease, especially in combination with artificial intelligence (AI)
-approaches for automated heart disease detection. There have been prior studies
-applying AI approaches based on deep learning for heart disease detection.
-However, these models are yet to be widely accepted as a reliable aid for
-clinical diagnostics, in part due to the current black-box perception
-surrounding many AI algorithms. In particular, there is a need to identify the
-key features of the ECG signal that contribute toward making an accurate
-diagnosis, thereby enhancing the interpretability of the model. In the present
-study, we develop a vision transformer approach to identify atrial fibrillation
-based on single-lead ECG data. A residual network (ResNet) approach is also
-developed for comparison with the vision transformer approach. These models are
-applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
-well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
-heartbeats. The models enable the identification of the key regions of the
-heartbeat that determine the resulting classification, and highlight the
-importance of P-waves and T-waves, as well as heartbeat duration and signal
-amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
-sinus bradycardia.
+摘要：本文介紹了 EICopilot，這是一種基於代理的新型解決方案，可增強在廣泛的線上知識圖譜中搜尋和探索企業註冊資料，例如詳細說明法律實體、註冊資本和主要股東的資料。傳統方法需要基於文字的查詢和手動子圖探索，通常會導致耗時的流程。EICopilot 部署為百度企業搜尋的聊天機器人，透過利用大型語言模型 (LLM) 來詮釋自然語言查詢，進而改善這項技術。此解決方案會自動產生並執行 Gremlin 腳本，提供複雜企業關係的有效摘要。其獨特功能為資料前處理管線，可將具代表性的查詢編譯並註解到範例的向量資料庫中，以進行脈絡中學習 (ICL)，這是一個結合了思考鏈與 ICL 的綜合推理管線，用於增強 Gremlin 腳本產生，以進行知識圖譜搜尋和探索，以及一種新穎的查詢遮罩策略，可改善意圖辨識，進而提高腳本準確度。實證評估顯示，EICopilot 的效能優於基線方法，包括速度和準確度，其中「完整遮罩」變體將語法錯誤率降低至低於 10.00%，執行正確率高達 82.14%。這些元件共同促成了優異的查詢功能和複雜資料集的摘要，將 EICopilot 定位為探索和利用大規模知識圖譜進行企業資訊搜尋的創新工具。
 
-摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
+##### **Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**
+2501.13731v1 by Chang Gong, Wanrui Bian, Zhijie Zhang, Weiguo Zheng
 
+Graph computational tasks are inherently challenging and often demand the
+development of advanced algorithms for effective solutions. With the emergence
+of large language models (LLMs), researchers have begun investigating their
+potential to address these tasks. However, existing approaches are constrained
+by LLMs' limited capability to comprehend complex graph structures and their
+high inference costs, rendering them impractical for handling large-scale
+graphs. Inspired by human approaches to graph problems, we introduce a novel
+framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph
+Computational Tasks), which consists of three key steps: problem understanding,
+prompt design, and code generation. In this framework, LLMs are tasked with
+understanding the problem and extracting relevant information to generate
+correct code. The responsibility for analyzing the graph structure and
+executing the code is delegated to the interpreter. We inject task-related
+pseudocodes into the prompts to further assist the LLMs in generating efficient
+code. We also employ cost-effective trial-and-error techniques to ensure that
+the LLM-generated code executes correctly. Unlike other methods that require
+invoking LLMs for each individual test case, PIE only calls the LLM during the
+code generation phase, allowing the generated code to be reused and
+significantly reducing inference costs. Extensive experiments demonstrate that
+PIE outperforms existing baselines in terms of both accuracy and computational
+efficiency.
 
-### Medical
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null|
-|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
-|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
-|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
-|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|null|
-|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
-|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
-|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null|
-|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null|
-|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null|
-|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null|
-|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
-|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v1](http://arxiv.org/abs/2502.07516v1)|[link](https://github.com/Raman1121/diffusion_memorization)|
-|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
-|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
-|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
-|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
-|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
-|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null|
-|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null|
-|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null|
-|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null|
-|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null|
-|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null|
-|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null|
-|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null|
-|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
-|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null|
-|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)|
-|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null|
-|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null|
-|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null|
-|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null|
-|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null|
-|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null|
-|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null|
-|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)|
-|**2025-02-08**|**Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**|Ya Zhou et.al.|[2502.05494v1](http://arxiv.org/abs/2502.05494v1)|null|
-|**2025-02-08**|**DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**|Sibasish Dhibar et.al.|[2502.05459v1](http://arxiv.org/abs/2502.05459v1)|null|
-|**2025-02-07**|**Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**|Muhammad Imran et.al.|[2502.05330v1](http://arxiv.org/abs/2502.05330v1)|[link](https://github.com/MaxwellEng/MICCAI_CHANLLENGE24_HJL)|
-|**2025-02-07**|**Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**|Yuting He et.al.|[2502.05282v1](http://arxiv.org/abs/2502.05282v1)|null|
-|**2025-02-07**|**"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**|Shihan Fu et.al.|[2502.05115v1](http://arxiv.org/abs/2502.05115v1)|null|
-|**2025-02-07**|**Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**|Thierry Bossy et.al.|[2502.05087v1](http://arxiv.org/abs/2502.05087v1)|[link](https://github.com/tuneinsight/federated-llms)|
-|**2025-02-07**|**MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**|Minrui Chen et.al.|[2502.04794v1](http://arxiv.org/abs/2502.04794v1)|null|
-|**2025-02-06**|**MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**|Wei Fan et.al.|[2502.04515v1](http://arxiv.org/abs/2502.04515v1)|[link](https://github.com/aikunyi/MedGNN)|
-|**2025-02-06**|**Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**|Andrew G. Breithaupt et.al.|[2502.06842v1](http://arxiv.org/abs/2502.06842v1)|null|
-|**2025-02-06**|**Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**|Khushboo Verma et.al.|[2502.04423v1](http://arxiv.org/abs/2502.04423v1)|null|
-|**2025-02-06**|**Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**|Tewele W. Tareke et.al.|[2502.04083v1](http://arxiv.org/abs/2502.04083v1)|null|
-|**2025-02-06**|**Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**|Ran Song et.al.|[2502.04034v1](http://arxiv.org/abs/2502.04034v1)|null|
-|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
-|**2025-02-06**|**Transforming Multimodal Models into Action Models for Radiotherapy**|Matteo Ferrante et.al.|[2502.04408v1](http://arxiv.org/abs/2502.04408v1)|null|
-|**2025-02-06**|**Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**|Bokeng Zheng et.al.|[2502.04399v1](http://arxiv.org/abs/2502.04399v1)|null|
-|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
-|**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
-|**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
-|**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
-|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v1](http://arxiv.org/abs/2502.03568v1)|null|
-|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
-|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
-|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
-|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
-|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
-|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
-|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|null|
-|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
-|**2025-02-05**|**Contrastive Token-level Explanations for Graph-based Rumour Detection**|Daniel Wai Kit Chin et.al.|[2502.04366v1](http://arxiv.org/abs/2502.04366v1)|null|
-|**2025-02-05**|**AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**|Jorge García-Torres et.al.|[2502.04365v1](http://arxiv.org/abs/2502.04365v1)|null|
-|**2025-02-04**|**3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**|Weicheng Zhu et.al.|[2502.02779v1](http://arxiv.org/abs/2502.02779v1)|null|
-|**2025-02-04**|**Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**|Obed Korshie Dzikunu et.al.|[2502.02756v1](http://arxiv.org/abs/2502.02756v1)|[link](https://github.com/obeddzik/pca_segment)|
-|**2025-02-04**|**Diffusion Instruction Tuning**|Chen Jin et.al.|[2502.06814v1](http://arxiv.org/abs/2502.06814v1)|null|
-|**2025-02-04**|**MedRAX: Medical Reasoning Agent for Chest X-ray**|Adibvafa Fallahpour et.al.|[2502.02673v1](http://arxiv.org/abs/2502.02673v1)|[link](https://github.com/bowang-lab/medrax)|
-|**2025-02-04**|**Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**|Mahdi Alkaeed et.al.|[2502.04356v1](http://arxiv.org/abs/2502.04356v1)|null|
-|**2025-02-04**|**Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**|Shayan Kiyani et.al.|[2502.02561v1](http://arxiv.org/abs/2502.02561v1)|null|
-|**2025-02-04**|**CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**|Amy Rafferty et.al.|[2502.05214v1](http://arxiv.org/abs/2502.05214v1)|null|
-|**2025-02-04**|**A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**|Edward Ellis et.al.|[2502.02489v1](http://arxiv.org/abs/2502.02489v1)|null|
-|**2025-02-04**|**Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**|Yaling Shen et.al.|[2502.02438v1](http://arxiv.org/abs/2502.02438v1)|null|
-|**2025-02-04**|**Test Time Training for 4D Medical Image Interpolation**|Qikang Zhang et.al.|[2502.02341v1](http://arxiv.org/abs/2502.02341v1)|[link](https://github.com/chaostheproducer/ttt4d)|
-|**2025-02-04**|**Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**|Atharva Mangeshkumar Agrawal et.al.|[2502.02249v1](http://arxiv.org/abs/2502.02249v1)|null|
-|**2025-02-04**|**Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**|F. Xavier Gaya-Morey et.al.|[2502.02618v1](http://arxiv.org/abs/2502.02618v1)|null|
-|**2025-02-04**|**Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**|Yuxiao Cheng et.al.|[2502.02109v1](http://arxiv.org/abs/2502.02109v1)|null|
-|**2025-02-04**|**JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**|Yehan Yan et.al.|[2502.04345v1](http://arxiv.org/abs/2502.04345v1)|null|
-|**2025-02-03**|**An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**|Jiazi Tian et.al.|[2502.01789v1](http://arxiv.org/abs/2502.01789v1)|null|
-|**2025-02-03**|**Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**|Chacha Chen et.al.|[2502.03482v1](http://arxiv.org/abs/2502.03482v1)|null|
-|**2025-02-03**|**Improving Transformer World Models for Data-Efficient RL**|Antoine Dedieu et.al.|[2502.01591v1](http://arxiv.org/abs/2502.01591v1)|null|
-|**2025-02-03**|**Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**|Zhi Zhang et.al.|[2502.01377v1](http://arxiv.org/abs/2502.01377v1)|null|
-|**2025-02-03**|**OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**|Chengfeng Zhou et.al.|[2502.01243v1](http://arxiv.org/abs/2502.01243v1)|null|
-|**2025-02-03**|**MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**|Alejandro Guerra-Manzanares et.al.|[2502.01158v1](http://arxiv.org/abs/2502.01158v1)|null|
-|**2025-02-03**|**Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**|Qian Chen et.al.|[2502.01141v1](http://arxiv.org/abs/2502.01141v1)|null|
-|**2025-02-03**|**Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**|Mithun Saha et.al.|[2502.01108v1](http://arxiv.org/abs/2502.01108v1)|null|
-|**2025-02-03**|**Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**|Yeyubei Zhang et.al.|[2502.04342v1](http://arxiv.org/abs/2502.04342v1)|null|
-|**2025-02-02**|**Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**|Hadas Ben-Atya et.al.|[2502.01691v1](http://arxiv.org/abs/2502.01691v1)|null|
-|**2025-02-02**|**Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**|Si-Ioi Ng et.al.|[2502.01685v1](http://arxiv.org/abs/2502.01685v1)|null|
-|**2025-02-02**|**Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**|Shengtian Sang et.al.|[2502.00712v1](http://arxiv.org/abs/2502.00712v1)|null|
-|**2025-02-02**|**TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**|Linglong Wu et.al.|[2502.00695v1](http://arxiv.org/abs/2502.00695v1)|null|
-|**2025-02-02**|**Safety at Scale: A Comprehensive Survey of Large Model Safety**|Xingjun Ma et.al.|[2502.05206v2](http://arxiv.org/abs/2502.05206v2)|null|
-|**2025-02-02**|**Enhanced Convolutional Neural Networks for Improved Image Classification**|Xiaoran Yang et.al.|[2502.00663v1](http://arxiv.org/abs/2502.00663v1)|null|
-|**2025-02-02**|**Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**|Yujin Oh et.al.|[2502.00619v1](http://arxiv.org/abs/2502.00619v1)|null|
-|**2025-02-01**|**Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**|Samiran Dey et.al.|[2502.00568v3](http://arxiv.org/abs/2502.00568v3)|[link](https://github.com/Samiran-Dey/PathoGen)|
+摘要：圖表計算任務本質上具有挑戰性，而且通常需要開發先進的演算法才能有效解決。隨著大型語言模型 (LLM) 的出現，研究人員已開始探討其解決這些任務的可能性。然而，現有方法受到 LLM 理解複雜圖形結構的能力有限以及其高推理成本的限制，這使得它們不切實際地處理大規模圖形。受到人類解決圖形問題的方法啟發，我們引入了 PIE（偽代碼注入增強 LLM 圖形計算任務推理）這個新框架，它包含三個關鍵步驟：問題理解、提示設計和代碼生成。在此框架中，LLM 的任務是理解問題並擷取相關資訊以產生正確的代碼。分析圖形結構和執行代碼的責任委派給解釋器。我們將與任務相關的偽代碼注入提示中，以進一步協助 LLM 產生有效的代碼。我們還採用具有成本效益的試錯技術，以確保 LLM 生成的代碼正確執行。與需要為每個個別測試案例呼叫 LLM 的其他方法不同，PIE 僅在代碼產生階段呼叫 LLM，允許重複使用產生的代碼並大幅降低推理成本。大量的實驗證明，PIE 在準確性和計算效率方面都優於現有的基準。
 
-#### Abstracts
-##### **Metamorphic Testing for Pose Estimation Systems**
-2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque
+##### **CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**
+2501.13993v1 by Hamza Landolsi, Kais Letaief, Nizar Taghouti, Ines Abdeljaoued-Tej
 
-Pose estimation systems are used in a variety of fields, from sports
-analytics to livestock care. Given their potential impact, it is paramount to
-systematically test their behaviour and potential for failure. This is a
-complex task due to the oracle problem and the high cost of manual labelling
-necessary to build ground truth keypoints. This problem is exacerbated by the
-fact that different applications require systems to focus on different subjects
-(e.g., human versus animal) or landmarks (e.g., only extremities versus whole
-body and face), which makes labelled test data rarely reusable. To combat these
-problems we propose MET-POSE, a metamorphic testing framework for pose
-estimation systems that bypasses the need for manual annotation while assessing
-the performance of these systems under different circumstances. MET-POSE thus
-allows users of pose estimation systems to assess the systems in conditions
-that more closely relate to their application without having to label an ad-hoc
-test dataset or rely only on available datasets, which may not be adapted to
-their application domain. While we define MET-POSE in general terms, we also
-present a non-exhaustive list of metamorphic rules that represent common
-challenges in computer vision applications, as well as a specific way to
-evaluate these rules. We then experimentally show the effectiveness of MET-POSE
-by applying it to Mediapipe Holistic, a state of the art human pose estimation
-system, with the FLIC and PHOENIX datasets. With these experiments, we outline
-numerous ways in which the outputs of MET-POSE can uncover faults in pose
-estimation systems at a similar or higher rate than classic testing using hand
-labelled data, and show that users can tailor the rule set they use to the
-faults and level of accuracy relevant to their application.
+The introduction of new features and services in the banking sector often
+overwhelms customers, creating an opportunity for banks to enhance user
+experience through financial chatbots powered by large language models (LLMs).
+We initiated an AI agent designed to provide customers with relevant
+information about banking services and insights from annual reports. We
+proposed a hybrid Customer Analysis Pipeline Retrieval-Augmented Generation
+(CAPRAG) that effectively addresses both relationship-based and contextual
+queries, thereby improving customer engagement in the digital banking
+landscape. To implement this, we developed a processing pipeline to refine text
+data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This
+dual approach enables us to populate both vector and graph databases with
+processed data for efficient retrieval. The Cypher query component is employed
+to effectively query the graph database. When a user submits a query, it is
+first expanded by a query expansion module before being routed to construct a
+final query from the hybrid Knowledge Base (KB). This final query is then sent
+to an open-source LLM for response generation. Overall, our innovative,
+designed to international banks, serves bank's customers in an increasingly
+complex digital environment, enhancing clarity and accessibility of
+information.
 
-摘要：姿勢估計系統應用於各種領域，從運動分析到牲畜照護。鑑於其潛在影響，系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高，這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體（例如，人類對動物）或地標（例如，只有四肢對全身和臉部）而加劇，這使得標記的測試數據很少可以重複使用。為了解決這些問題，我們提出了 MET-POSE，這是一個姿勢估計系統的變形測試框架，在評估這些系統在不同情況下的性能時，可以繞過手動註解的需要。因此，MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統，而無需標記臨時測試數據集或僅依賴可用數據集，這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE，但我們也提供了一個非詳盡的變形規則列表，這些規則代表了電腦視覺應用中的常見挑戰，以及評估這些規則的具體方法。然後，我們通過將 MET-POSE 應用於 Mediapipe Holistic（一種先進的人類姿勢估計系統），並使用 FLIC 和 PHOENIX 數據集，以實驗方式展示 MET-POSE 的有效性。通過這些實驗，我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法，其速度與使用手動標記數據的傳統測試類似或更高，並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。
+摘要：銀行業中新功能和服務的推出經常讓客戶感到不知所措，這為銀行透過大型語言模型 (LLM) 驅動的金融聊天機器人來提升使用者體驗創造了機會。我們啟動了一個人工智慧代理，旨在為客戶提供有關銀行服務和年度報告見解的相關資訊。我們提出了一個混合式客戶分析管道檢索擴充生成 (CAPRAG)，它有效地處理基於關係和情境式的查詢，從而提升數位銀行環境中的客戶參與度。為了實作這一點，我們開發了一個處理管道來精煉文字資料，我們在兩個主要架構中使用它：Vector RAG 和 Graph RAG。這種雙管齊下的方法讓我們能夠使用處理過的資料來填補向量和圖形資料庫，以利於有效檢索。Cypher 查詢元件用於有效查詢圖形資料庫。當使用者提交查詢時，它會先由查詢擴充模組擴充，然後再路由到混合式知識庫 (KB) 中建構最終查詢。然後這個最終查詢會傳送給開源 LLM 以產生回應。整體而言，我們創新的設計服務於國際銀行，在日益複雜的數位環境中服務銀行客戶，提升資訊的清晰度和可及性。
 
-##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
-2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
+##### **Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**
+2501.13992v1 by Hy Nguyen, Nguyen Hung Nguyen, Nguyen Linh Bao Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
 
-Joint entity-relation extraction is a critical task in transforming
-unstructured or semi-structured text into triplets, facilitating the
-construction of large-scale knowledge graphs, and supporting various downstream
-applications. Despite its importance, research on Chinese text, particularly
-with complex semantics in specialized domains like medicine, remains limited.
-To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
-dataset designed to capture the intricacies of medical text. Leveraging the
-strengths of attention mechanisms in capturing long-range dependencies, we
-propose the SEA module, which enhances the extraction of complex contextual
-semantic information, thereby improving entity recognition and relation
-extraction. Additionally, to address the inefficiencies of existing methods in
-facilitating information exchange between entity recognition and relation
-extraction, we present an interactive fusion representation module. This module
-employs Cross Attention for bidirectional information exchange between the
-tasks and further refines feature extraction through BiLSTM. Experimental
-results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
-our model exhibits strong generalization capabilities. On the CH-DDI dataset,
-our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
-relation extraction. On the CoNLL04 dataset, it attains an entity recognition
-precision of 89.54% and a relation extraction accuracy of 71.64%.
+The Hierarchical Navigable Small World (HNSW) algorithm is widely used for
+approximate nearest neighbor (ANN) search, leveraging the principles of
+navigable small-world graphs. However, it faces some limitations. The first is
+the local optima problem, which arises from the algorithm's greedy search
+strategy, selecting neighbors based solely on proximity at each step. This
+often leads to cluster disconnections. The second limitation is that HNSW
+frequently fails to achieve logarithmic complexity, particularly in
+high-dimensional datasets, due to the exhaustive traversal through each layer.
+To address these limitations, we propose a novel algorithm that mitigates local
+optima and cluster disconnections while enhancing the construction speed,
+maintaining inference speed. The first component is a dual-branch HNSW
+structure with LID-based insertion mechanisms, enabling traversal from multiple
+directions. This improves outlier node capture, enhances cluster connectivity,
+accelerates construction speed and reduces the risk of local minima. The second
+component incorporates a bridge-building technique that bypasses redundant
+intermediate layers, maintaining inference and making up the additional
+computational overhead introduced by the dual-branch structure. Experiments on
+various benchmarks and datasets showed that our algorithm outperforms the
+original HNSW in both accuracy and speed. We evaluated six datasets across
+Computer Vision (CV), and Natural Language Processing (NLP), showing recall
+improvements of 18\% in NLP, and up to 30\% in CV tasks while reducing the
+construction time by up to 20\% and maintaining the inference speed. We did not
+observe any trade-offs in our algorithm. Ablation studies revealed that
+LID-based insertion had the greatest impact on performance, followed by the
+dual-branch structure and bridge-building components.
 
-摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
+摘要：分層可導航小世界 (HNSW) 演算法廣泛用於近似最近鄰居 (ANN) 搜尋，並利用可導航小世界圖形的原理。然而，它面臨一些限制。第一個是局部最佳化問題，這源自於演算法的貪婪搜尋策略，在每個步驟中僅根據鄰近度來選擇鄰居。這通常會導致群集斷線。第二個限制是，由於透過每一層的窮舉式遍歷，HNSW 常常無法在高維度資料集中達成對數複雜度。為了解決這些限制，我們提出了一種新的演算法，它可以減輕局部最佳化和群集斷線，同時提高建構速度，並維持推論速度。第一個組成部分是一個具有基於 LID 的插入機制的雙分支 HNSW 結構，它能從多個方向進行遍歷。這改善了異常值節點的擷取，增強了群集連通性，加速了建構速度，並降低了局部最小值的風險。第二個組成部分包含一種橋樑建構技術，它繞過了多餘的中間層，維持推論並彌補了雙分支結構所帶來的額外運算負擔。在各種基準和資料集上的實驗顯示，我們的演算法在準確度和速度上都優於原始的 HNSW。我們評估了電腦視覺 (CV) 和自然語言處理 (NLP) 中的六個資料集，顯示 NLP 中的召回率提高了 18%，CV 任務中提高了 30%，同時將建構時間縮短了 20%，並維持了推論速度。我們沒有在我們的演算法中觀察到任何取捨。消融研究顯示，基於 LID 的插入對效能的影響最大，其次是雙分支結構和橋樑建構組成部分。
 
-##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
-2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
+##### **Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**
+2501.13984v1 by Bhumika Gupta, Pralaypati Ta, Keerthi Ram, Mohanasankar Sivaprakasam
 
-Generative artificial intelligence (AI) models, such as diffusion models and
-OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
-and automating clinical workflows. The field has advanced rapidly, evolving
-from text-only large language models for tasks such as clinical documentation
-and decision support to multimodal AI systems capable of integrating diverse
-data modalities, including imaging, text, and structured data, within a single
-model. The diverse landscape of these technologies, along with rising interest,
-highlights the need for a comprehensive review of their applications and
-potential. This scoping review explores the evolution of multimodal AI,
-highlighting its methods, applications, datasets, and evaluation in clinical
-settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
-IEEE Xplore, and Web of Science, prioritizing recent studies published up to
-the end of 2024. After rigorous screening, 144 papers were included, revealing
-key trends and challenges in this dynamic field. Our findings underscore a
-shift from unimodal to multimodal approaches, driving innovations in diagnostic
-support, medical report generation, drug discovery, and conversational AI.
-However, critical challenges remain, including the integration of heterogeneous
-data types, improving model interpretability, addressing ethical concerns, and
-validating AI systems in real-world clinical settings. This review summarizes
-the current state of the art, identifies critical gaps, and provides insights
-to guide the development of scalable, trustworthy, and clinically impactful
-multimodal AI solutions in healthcare.
+The updated recommendations on diagnostic procedures and treatment pathways
+for a medical condition are documented as graphical flows in Clinical Practice
+Guidelines (CPGs). For effective use of the CPGs in helping medical
+professionals in the treatment decision process, it is necessary to fully
+capture the guideline knowledge, particularly the contexts and their
+relationships in the graph. While several existing works have utilized these
+guidelines to create rule bases for Clinical Decision Support Systems, limited
+work has been done toward directly capturing the full medical knowledge
+contained in CPGs. This work proposes an approach to create a contextually
+enriched, faithful digital representation of National Comprehensive Cancer
+Network (NCCN) Cancer CPGs in the form of graphs using automated extraction and
+node & relationship classification. We also implement semantic enrichment of
+the model by using Large Language Models (LLMs) for node classification,
+achieving an accuracy of 80.86% and 88.47% with zero-shot learning and few-shot
+learning, respectively. Additionally, we introduce a methodology for answering
+natural language questions with constraints to guideline text by leveraging
+LLMs to extract the relevant subgraph from the guideline knowledge base. By
+generating natural language answers based on subgraph paths and semantic
+information, we mitigate the risk of incorrect answers and hallucination
+associated with LLMs, ensuring factual accuracy in medical domain Question
+Answering.
 
-摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
+摘要：已更新的醫療狀況診斷程序和治療途徑建議，以臨床實務指南 (CPG) 中的圖形流程記錄。為了有效使用 CPG 協助醫療專業人員進行治療決策，必須完整擷取指南知識，特別是圖表中的脈絡及其關係。雖然現有許多研究已利用這些指南為臨床決策支援系統建立規則基礎，但直接擷取 CPG 中包含的完整醫療知識的工作卻有限。這項研究提出了一種方法，以自動化擷取和節點與關係分類的方式，建立脈絡豐富、忠實的國家綜合癌症網路 (NCCN) 癌症 CPG 圖形數位表示。我們也透過使用大型語言模型 (LLM) 進行節點分類，實作模型的語意豐富化，分別在零次學習和少次學習中達到 80.86% 和 88.47% 的準確度。此外，我們引進了一種方法，透過運用 LLM 從指南知識庫中擷取相關子圖，來回答具有指南文字限制的自然語言問題。透過根據子圖路徑和語意資訊產生自然語言答案，我們降低了與 LLM 相關的錯誤答案和幻覺風險，確保了醫療領域問題解答中的事實準確性。
 
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
+##### **LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**
+2501.12300v1 by Hasan Abu-Rasheed, Constance Jumbo, Rashed Al Amin, Christian Weber, Veit Wiese, Roman Obermaisser, Madjid Fathi
 
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
+While learning personalization offers great potential for learners, modern
+practices in higher education require a deeper consideration of domain models
+and learning contexts, to develop effective personalization algorithms. This
+paper introduces an innovative approach to higher education curriculum
+modelling that utilizes large language models (LLMs) for knowledge graph (KG)
+completion, with the goal of creating personalized learning-path
+recommendations. Our research focuses on modelling university subjects and
+linking their topics to corresponding domain models, enabling the integration
+of learning modules from different faculties and institutions in the student's
+learning path. Central to our approach is a collaborative process, where LLMs
+assist human experts in extracting high-quality, fine-grained topics from
+lecture materials. We develop a domain, curriculum, and user models for
+university modules and stakeholders. We implement this model to create the KG
+from two study modules: Embedded Systems and Development of Embedded Systems
+Using FPGA. The resulting KG structures the curriculum and links it to the
+domain models. We evaluate our approach through qualitative expert feedback and
+quantitative graph quality metrics. Domain experts validated the relevance and
+accuracy of the model, while the graph quality metrics measured the structural
+properties of our KG. Our results show that the LLM-assisted graph completion
+approach enhances the ability to connect related courses across disciplines to
+personalize the learning experience. Expert feedback also showed high
+acceptance of the proposed collaborative approach for concept extraction and
+classification.
 
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
+摘要：<paragraph>在學習個人化提供學習者巨大潛力的同時，高等教育中的現代實務需要更深入地考慮領域模型和學習情境，以開發有效的個人化演算法。本文介紹了一種創新的高等教育課程建模方法，該方法利用大型語言模型 (LLM) 來完成知識圖譜 (KG)，目的是建立個人化的學習路徑建議。我們的研究重點在於建模大學科目，並將它們的主題連結到對應的領域模型，從而能夠將來自不同院系和機構的學習模組整合到學生的學習路徑中。我們的做法核心是一個協作流程，其中 LLM 協助人類專家從講義材料中萃取高品質、細緻的主題。我們為大學模組和利害關係人開發了領域、課程和使用者模型。我們實作這個模型，從兩個研究模組建立 KG：嵌入式系統和使用 FPGA 的嵌入式系統開發。產生的 KG 建構了課程並將其連結到領域模型。我們透過定性專家回饋和定量圖形品質指標來評估我們的做法。領域專家驗證了模型的相關性和準確性，而圖形品質指標則測量了我們 KG 的結構特性。我們的結果顯示，LLM 輔助的圖形完成方法增強了跨學科連結相關課程的能力，以個人化學習體驗。專家回饋也顯示高度接受所提出的協作方法，用於概念萃取和分類。</paragraph>
 
-##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
-2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
+##### **Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**
+2501.12432v1 by Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
 
-Legal cases require careful logical reasoning following the laws, whereas
-interactions with non- technical users must be in natural language. As an
-application combining logical reasoning using Prolog and natural language
-processing using large language models (LLMs), this paper presents a novel
-approach and system, LogicLease, to automate the analysis of landlord-tenant
-legal cases in the state of New York. LogicLease determines compliance with
-relevant legal requirements by analyzing case descriptions and citing all
-relevant laws. It leverages LLMs for information extraction and Prolog for
-legal reasoning. By separating information extraction from legal reasoning,
-LogicLease achieves greater transparency and control over the legal logic
-applied to each case. We evaluate the accuracy, efficiency, and robustness of
-LogicLease through a series of tests, achieving 100% accuracy and an average
-processing time of 2.57 seconds. LogicLease presents advantages over
-state-of-the-art LLM- based legal analysis systems by providing clear,
-step-by-step reasoning, citing specific laws, and distinguishing itself by its
-ability to avoid hallucinations - a common issue in LLMs.
+Although current Large Language Models (LLMs) exhibit impressive
+capabilities, performing complex real-world tasks still requires tool learning.
+Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to
+interact with external environments, but they are limited in perceptual scope
+and lack adequate task-planning capability. To address these limitations, other
+studies introduce the first Search-based Decision Tree (DFSDT), which still
+suffers from the high computational cost. In this paper, we introduce a novel
+parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama).
+First, we transform traditional tree-based tool search paths into Directed
+Acyclic Graph (DAG) structure, generating a high-quality parallel tool
+invocation dataset. The DTA-Llama is then trained on the dataset to learn to
+iteratively divide the current task into several parallel tool invocation
+sub-tasks and aggregate the invocation results to decide the next actions.
+Furthermore, we introduce an efficient inference framework inspired by the
+Process/Threads mechanism when applying the DTA-Llama to practical tasks.
+Experimental results show that our approach substantially enhances task
+performance while reducing token consumption and inference time. Llama2-7B,
+using our method, is comparable to the official parallel function calling
+method of GPT-3.5. The relevant code, dataset, and model weights are available
+at https://corn0205.github.io/
 
-摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
+摘要：儘管目前的大型語言模型 (LLM) 展現出令人印象深刻的能力，但執行複雜的真實世界任務仍需要工具學習。主流方法（例如 CoT/ReAct）依賴逐步工具呼叫與外部環境互動，但它們的感知範圍有限，且缺乏足夠的任務規劃能力。為了解決這些限制，其他研究引入了第一個基於搜尋的決策樹 (DFSDT)，但仍有很高的運算成本。在本文中，我們介紹了一種新穎的平行工具呼叫範例，DTA-Llama（分而合之 Llama）。首先，我們將傳統的基於樹的工具搜尋路徑轉換為有向無環圖 (DAG) 結構，產生高品質的平行工具呼叫資料集。然後在資料集上訓練 DTA-Llama，學習反覆將當前任務分成幾個平行工具呼叫子任務，並彙總呼叫結果以決定後續動作。此外，我們在將 DTA-Llama 應用於實際任務時，引入了一個受 Process/Threads 機制啟發的高效推論框架。實驗結果表明，我們的做法大幅提升了任務效能，同時減少了符號消耗和推論時間。使用我們方法的 Llama2-7B，可與 GPT-3.5 的官方平行函式呼叫方法相媲美。相關程式碼、資料集和模型權重可在 https://corn0205.github.io/ 取得
 
-##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
-2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
+##### **InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**
+2501.12231v1 by Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
 
-In remote healthcare monitoring, time series representation learning reveals
-critical patient behavior patterns from high-frequency data. This study
-analyzes home activity data from individuals living with dementia by proposing
-a two-stage, self-supervised learning approach tailored to uncover low-rank
-structures. The first stage converts time-series activities into text sequences
-encoded by a pre-trained language model, providing a rich, high-dimensional
-latent state space using a PageRank-based method. This PageRank vector captures
-latent state transitions, effectively compressing complex behaviour data into a
-succinct form that enhances interpretability. This low-rank representation not
-only enhances model interpretability but also facilitates clustering and
-transition analysis, revealing key behavioral patterns correlated with
-clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
-framework's potential in supporting cognitive status prediction, personalized
-care interventions, and large-scale health monitoring.
+The improved competence of generative models can help building multi-modal
+virtual assistants that leverage modalities beyond language. By observing
+humans performing multi-step tasks, one can build assistants that have
+situational awareness of actions and tasks being performed, enabling them to
+cater assistance based on this understanding. In this paper, we develop a
+Context-aware Instructional Task Assistant with Multi-modal Large Language
+Models (InsTALL) that leverages an online visual stream (e.g. a user's screen
+share or video recording) and responds in real-time to user queries related to
+the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal
+model on task videos and paired textual data, and 2) automatically extracts
+task graph from video data and leverages it at training and inference time. We
+show InsTALL achieves state-of-the-art performance across proposed sub-tasks
+considered for multimodal activity understanding -- task recognition (TR),
+action recognition (AR), next action prediction (AP), and plan prediction (PP)
+-- and outperforms existing baselines on two novel sub-tasks related to
+automatic error identification.
 
-摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
+摘要：生成模型能力的提升有助于构建利用语言之外的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建对正在执行的动作和任务有情境感知的助手，使他们能够根据这种理解提供帮助。在本文中，我们开发了一个具有多模态大语言模型的上下文感知指令任务助手 (InsTALL)，该助手利用在线视觉流（例如用户的屏幕共享或视频录制），并实时响应与手头任务相关的用户查询。为了提供有用的帮助，InsTALL 1) 在任务视频和配对文本数据上训练多模态模型，以及 2) 从视频数据中自动提取任务图，并在训练和推理时间利用它。我们展示了 InsTALL 在考虑用于多模态活动理解的提议子任务中实现了最先进的性能——任务识别 (TR)、动作识别 (AR)、下一个动作预测 (AP) 和计划预测 (PP)——并且在与自动错误识别相关的两个新子任务上优于现有的基准。
 
-##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
-2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
+##### **Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**
+2501.11977v1 by Maya Medjad, Hugo Imbert, Bruno Yun, Raphaël Szymocha, Frédéric Armetta
 
-Precise segmentation and classification of cell instances are vital for
-analyzing the tissue microenvironment in histology images, supporting medical
-diagnosis, prognosis, treatment planning, and studies of brain
-cytoarchitecture. However, the creation of high-quality annotated datasets for
-training remains a major challenge. This study introduces a novel single-stage
-approach (HistoSmith) for generating image-label pairs to augment histology
-datasets. Unlike state-of-the-art methods that utilize diffusion models with
-separate components for label and image generation, our approach employs a
-latent diffusion model to learn the joint distribution of cellular layouts,
-classification masks, and histology images. This model enables tailored data
-generation by conditioning on user-defined parameters such as cell types,
-quantities, and tissue types. Trained on the Conic H&E histopathology dataset
-and the Nissl-stained CytoDArk0 dataset, the model generates realistic and
-diverse labeled samples. Experimental results demonstrate improvements in cell
-instance segmentation and classification, particularly for underrepresented
-cell types like neutrophils in the Conic dataset. These findings underscore the
-potential of our approach to address data scarcity challenges.
+Training task-oriented dialogue systems is both costly and time-consuming,
+due to the need for high-quality datasets encompassing diverse intents.
+Traditional methods depend on extensive human annotation, while recent
+advancements leverage large language models (LLMs) to generate synthetic data.
+However, these approaches often require custom prompts or code, limiting
+accessibility for non-technical users. We introduce GraphTOD, an end-to-end
+framework that simplifies the generation of task-oriented dialogues. Users can
+create dialogues by specifying transition graphs in JSON format. Our evaluation
+demonstrates that GraphTOD generates high-quality dialogues across various
+domains, significantly lowering the cost and complexity of dataset creation.
 
-摘要：精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而，建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith)，用於產生影像標籤對，以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同，我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數（例如細胞類型、數量和組織類型）來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後，此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步，特別是對於 Conic 資料集中代表性不足的細胞類型，例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。
+摘要：訓練任務導向對話系統既昂貴又耗時，
+因為需要包含各種意圖的高品質資料集。
+傳統方法依賴於廣泛的人工標註，而最近
+的進展利用大型語言模型 (LLM) 來產生合成資料。
+然而，這些方法通常需要自訂提示或程式碼，限制
+非技術使用者的可及性。我們介紹 GraphTOD，一個端對端的
+架構，簡化了任務導向對話的產生。使用者可以
+透過指定 JSON 格式的轉換圖表來建立對話。我們的評估
+證明 GraphTOD 在各種領域產生高品質對話，顯著降低資料集建立的成本和複雜性。
 
-##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**
-2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
+##### **Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**
+2501.11968v1 by Jie Zhao, Kang Hao Cheong, Witold Pedrycz
 
-The growing availability of longitudinal Magnetic Resonance Imaging (MRI)
-datasets has facilitated Artificial Intelligence (AI)-driven modeling of
-disease progression, making it possible to predict future medical scans for
-individual patients. However, despite significant advancements in AI, current
-methods continue to face challenges including achieving patient-specific
-individualization, ensuring spatiotemporal consistency, efficiently utilizing
-longitudinal data, and managing the substantial memory demands of 3D scans. To
-address these challenges, we propose Brain Latent Progression (BrLP), a novel
-spatiotemporal model designed to predict individual-level disease progression
-in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates
-in a small latent space, mitigating the computational challenges posed by
-high-dimensional imaging data; (ii) it explicitly integrates subject metadata
-to enhance the individualization of predictions; (iii) it incorporates prior
-knowledge of disease dynamics through an auxiliary model, facilitating the
-integration of longitudinal data; and (iv) it introduces the Latent Average
-Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in
-the predicted progression at inference time and (b) allows us to derive a
-measure of the uncertainty for the prediction. We train and evaluate BrLP on
-11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its
-generalizability on an external test set comprising 2,257 MRIs from 962
-subjects. Our experiments compare BrLP-generated MRI scans with real follow-up
-MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The
-code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
+Graph-structured combinatorial challenges are inherently difficult due to
+their nonlinear and intricate nature, often rendering traditional computational
+methods ineffective or expensive. However, these challenges can be more
+naturally tackled by humans through visual representations that harness our
+innate ability for spatial reasoning. In this study, we propose transforming
+graphs into images to preserve their higher-order structural features
+accurately, revolutionizing the representation used in solving graph-structured
+combinatorial tasks. This approach allows machines to emulate human-like
+processing in addressing complex combinatorial challenges. By combining the
+innovative paradigm powered by multimodal large language models (MLLMs) with
+simple search techniques, we aim to develop a novel and effective framework for
+tackling such problems. Our investigation into MLLMs spanned a variety of
+graph-based tasks, from combinatorial problems like influence maximization to
+sequential decision-making in network dismantling, as well as addressing six
+fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit
+exceptional spatial intelligence and a distinctive capability for handling
+these problems, significantly advancing the potential for machines to
+comprehend and analyze graph-structured data with a depth and intuition akin to
+human cognition. These results also imply that integrating MLLMs with simple
+optimization strategies could form a novel and efficient approach for
+navigating graph-structured combinatorial challenges without complex
+derivations, computationally demanding training and fine-tuning.
 
-摘要：隨著縱向磁共振影像 (MRI) 資料集的日益普及，已促進人工智慧 (AI) 驅動的疾病進程建模，讓預測個別患者的未來醫學掃描成為可能。然而，儘管 AI 有顯著進展，目前的技術仍面臨挑戰，包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料，以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰，我們提出腦潛在進程 (BrLP)，這是一種新穎的時空模型，旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個：(i) 它在一個小的潛在空間中運作，減輕了高維度影像資料帶來的計算挑戰；(ii) 它明確整合受試者的元資料，以增強預測的個別化；(iii) 它透過輔助模型納入疾病動態的先驗知識，促進縱向資料的整合；(iv) 它引入了潛在平均穩定化 (LAS) 演算法，該演算法 (a) 在推論時強制預測進程中的時空一致性，(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估，並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較，與現有方法相比，展示了最先進的準確性。程式碼已公開於：https://github.com/LemuelPuglisi/BrLP。
+摘要：圖形結構的組合挑戰本質上很困難，因為它們的非線性和複雜性，通常會使傳統的計算方法無效或昂貴。然而，人類可以透過利用我們天生的空間推理能力的視覺表徵，更自然地應對這些挑戰。在本研究中，我們建議將圖形轉換為影像，以準確保留它們的高階結構特徵，從而革新用於解決圖形結構組合任務的表徵。這種方法允許機器在解決複雜的組合挑戰時模擬類人的處理。透過結合由多模態大型語言模型 (MLLM) 提供動力的創新範例與簡單的搜尋技術，我們旨在為解決此類問題開發一個新穎且有效的架構。我們對 MLLM 的研究涵蓋了各種基於圖形的任務，從組合問題（如影響力最大化）到網路拆除中的順序決策制定，以及解決六個基本的圖形相關問題。我們的研究結果表明，MLLM 表現出非凡的空間智能和處理這些問題的獨特能力，顯著提升了機器以類似人類認知的深度和直覺來理解和分析圖形結構資料的潛力。這些結果還暗示，將 MLLM 與簡單的最佳化策略整合在一起，可以形成一種新穎且有效的方法，用於在沒有複雜推導、計算需求量大的訓練和微調的情況下應對圖形結構的組合挑戰。
 
-##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
-2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
+##### **A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**
+2501.13958v1 by Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, Xiao Huang
 
-The adoption of EHRs has expanded opportunities to leverage data-driven
-algorithms in clinical care and research. A major bottleneck in effectively
-conducting multi-institutional EHR studies is the data heterogeneity across
-systems with numerous codes that either do not exist or represent different
-clinical concepts across institutions. The need for data privacy further limits
-the feasibility of including multi-institutional patient-level data required to
-study similarities and differences across patient subgroups. To address these
-challenges, we developed the GAME algorithm. Tested and validated across 7
-institutions and 2 languages, GAME integrates data in several levels: (1) at
-the institutional level with knowledge graphs to establish relationships
-between codes and existing knowledge sources, providing the medical context for
-standard codes and their relationship to each other; (2) between institutions,
-leveraging language models to determine the relationships between
-institution-specific codes with established standard codes; and (3) quantifying
-the strength of the relationships between codes using a graph attention
-network. Jointly trained embeddings are created using transfer and federated
-learning to preserve data privacy. In this study, we demonstrate the
-applicability of GAME in selecting relevant features as inputs for AI-driven
-algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
-We then highlight the application of GAME harmonized multi-institutional EHR
-data in a study of Alzheimer's disease outcomes and suicide risk among patients
-with mental health disorders, without sharing patient-level data outside
-individual institutions.
+Large language models (LLMs) have demonstrated remarkable capabilities in a
+wide range of tasks, yet their application to specialized domains remains
+challenging due to the need for deep expertise. Retrieval-augmented generation
+(RAG) has emerged as a promising solution to customize LLMs for professional
+fields by seamlessly integrating external knowledge bases, enabling real-time
+access to domain-specific expertise during inference. Despite its potential,
+traditional RAG systems, based on flat text retrieval, face three critical
+challenges: (i) complex query understanding in professional contexts, (ii)
+difficulties in knowledge integration across distributed sources, and (iii)
+system efficiency bottlenecks at scale. This survey presents a systematic
+analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new
+paradigm that revolutionizes domain-specific LLM applications. GraphRAG
+addresses traditional RAG limitations through three key innovations: (i)
+graph-structured knowledge representation that explicitly captures entity
+relationships and domain hierarchies, (ii) efficient graph-based retrieval
+techniques that enable context-preserving knowledge retrieval with multihop
+reasoning ability, and (iii) structure-aware knowledge integration algorithms
+that leverage retrieved knowledge for accurate and logical coherent generation
+of LLMs. In this survey, we systematically analyze the technical foundations of
+GraphRAG and examine current implementations across various professional
+domains, identifying key technical challenges and promising research
+directions. All the related resources of GraphRAG, including research papers,
+open-source data, and projects, are collected for the community in
+\textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}}.
 
-摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
+摘要：大型語言模型 (LLM) 已在各種任務中展現出非凡的能力，但由於需要深入的專業知識，因此將其應用於專業領域仍具有挑戰性。檢索增強生成 (RAG) 已成為一種有前途的解決方案，可通過無縫整合外部知識庫來客製化 LLM 以適用於專業領域，從而在推理過程中即時存取特定領域的專業知識。儘管有其潛力，但基於平面文字檢索的傳統 RAG 系統面臨三項關鍵挑戰：(i) 在專業情境中進行複雜的查詢理解，(ii) 難以整合分散來源的知識，以及 (iii) 系統效率瓶頸會隨著規模擴大而產生。本調查系統性地分析了圖形化檢索增強生成 (GraphRAG) 的技術基礎，GraphRAG 是一個新的典範，它徹底改變了特定領域的 LLM 應用。GraphRAG 透過三項關鍵創新來解決傳統 RAG 的限制：(i) 圖形結構化的知識表述，明確擷取實體關係和領域階層，(ii) 有效的圖形化檢索技術，可進行保留脈絡的知識檢索，並具備多跳推理能力，以及 (iii) 結構感知知識整合演算法，可利用檢索到的知識來進行 LLM 的準確且邏輯一致的生成。在本調查中，我們系統性地分析了 GraphRAG 的技術基礎，並檢視了在各種專業領域中的現有實作，找出關鍵技術挑戰和有前景的研究方向。所有 GraphRAG 的相關資源，包括研究論文、開放原始碼資料和專案，都已在 \textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}} 中為社群收集。
 
-##### **EEG Artifact Detection and Correction with Deep Autoencoders**
-2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch
+##### **Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**
+2501.11849v2 by Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
 
-EEG signals convey important information about brain activity both in healthy
-and pathological conditions. However, they are inherently noisy, which poses
-significant challenges for accurate analysis and interpretation. Traditional
-EEG artifact removal methods, while effective, often require extensive expert
-intervention. This study presents LSTEEG, a novel LSTM-based autoencoder
-designed for the detection and correction of artifacts in EEG signals.
-Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear
-dependencies in sequential EEG data. LSTEEG demonstrates superior performance
-in both artifact detection and correction tasks compared to other
-state-of-the-art convolutional autoencoders. Our methodology enhances the
-interpretability and utility of the autoencoder's latent space, enabling
-data-driven automated artefact removal in EEG its application in downstream
-tasks. This research advances the field of efficient and accurate multi-channel
-EEG preprocessing, and promotes the implementation and usage of automated EEG
-analysis pipelines for brain health applications.
+Detecting organized political campaigns is of paramount importance in
+fighting against disinformation on social media. Existing approaches for the
+identification of such organized actions employ techniques mostly from network
+science, graph machine learning and natural language processing. Their ultimate
+goal is to analyze the relationships and interactions (e.g. re-posting) among
+users and the textual similarities of their posts. Despite their effectiveness
+in recognizing astroturf campaigns, these methods face significant challenges,
+notably the class imbalance in available training datasets. To mitigate this
+issue, recent methods usually resort to data augmentation or increasing the
+number of positive samples, which may not always be feasible or sufficient in
+real-world settings. Following a different path, in this paper, we propose a
+novel framework for identifying astroturf campaigns based solely on large
+language models (LLMs), introducing a Balanced Retrieval-Augmented Generation
+(Balanced RAG) component. Our approach first gives both textual information
+concerning the posts (in our case tweets) and the user interactions of the
+social network as input to a language model. Then, through prompt engineering
+and the proposed Balanced RAG method, it effectively detects coordinated
+disinformation campaigns on X (Twitter). The proposed framework does not
+require any training or fine-tuning of the language model. Instead, by
+strategically harnessing the strengths of prompt engineering and Balanced RAG,
+it facilitates LLMs to overcome the effects of class imbalance and effectively
+identify coordinated political campaigns. The experimental results demonstrate
+that by incorporating the proposed prompt engineering and Balanced RAG methods,
+our framework outperforms the traditional graph-based baselines, achieving
+2x-3x improvements in terms of precision, recall and F1 scores.
 
-摘要：腦電圖訊號傳達了關於大腦活動的重要資訊，無論是在健康或病理狀況下。然而，它們本質上是有雜訊的，這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效，但通常需要大量的專家介入。本研究提出 LSTEEG，一種新穎的基於 LSTM 的自動編碼器，用於偵測和校正腦電圖訊號中的人工製品。利用深度學習，特別是 LSTM 層，LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比，LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性，讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域，並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。
+摘要：<paragraph>在社交媒體上對抗錯誤資訊，偵測有組織的政治宣傳活動至關重要。現有的此類有組織行動識別方法，大多採用網路科學、圖形機器學習和自然語言處理的技術。它們的最終目標是分析使用者之間的關係和互動（例如轉發），以及他們貼文的文字相似性。儘管這些方法在辨識草根運動宣傳活動方面很有效，但它們面臨嚴峻的挑戰，特別是可用訓練資料集中的類別不平衡。為了減輕這個問題，最近的方法通常訴諸於資料擴充或增加正向樣本數量，但在現實世界中可能並非總是可行或足夠。本文採取不同的途徑，我們提出了一個基於大型語言模型 (LLM) 的辨識草根運動宣傳活動的新架構，並引入了平衡檢索擴充產生 (Balanced RAG) 組件。我們的做法首先將有關貼文（在我們的案例中是推文）的文字資訊和社交網路的使用者互動作為輸入，輸入到語言模型中。然後，透過提示工程和提出的平衡檢索擴充產生方法，它有效地偵測 X（Twitter）上協調的不實資訊宣傳活動。提出的架構不需要任何語言模型的訓練或微調。相反地，透過策略性地利用提示工程和平衡檢索擴充產生方法的優勢，它使大型語言模型能夠克服類別不平衡的影響，並有效地識別協調的政治宣傳活動。實驗結果證明，透過整合提出的提示工程和平衡檢索擴充產生方法，我們的架構優於傳統的基於圖形的基準，在精確度、召回率和 F1 分數方面獲得 2x-3x 的改進。</paragraph>
 
-##### **SycEval: Evaluating LLM Sycophancy**
-2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo
+##### **Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**
+2501.16361v1 by Haoran Song, Jiarui Feng, Guangfu Li, Michael Province, Philip Payne, Yixin Chen, Fuhai Li
 
-Large language models (LLMs) are increasingly applied in educational,
-clinical, and professional settings, but their tendency for sycophancy --
-prioritizing user agreement over independent reasoning -- poses risks to
-reliability. This study introduces a framework to evaluate sycophantic behavior
-in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and
-MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19%
-of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the
-lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred
-in 43.52% of cases, while regressive sycophancy, leading to incorrect answers,
-was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher
-sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$,
-$p<0.001$), particularly in computational tasks, where regressive sycophancy
-increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$).
-Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while
-citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$,
-$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI:
-[77.2%, 79.8%]) regardless of context or model. These findings emphasize the
-risks and opportunities of deploying LLMs in structured and dynamic domains,
-offering insights into prompt programming and model optimization for safer AI
-applications.
+In real-world scientific discovery, human beings always make use of the
+accumulated prior knowledge with imagination pick select one or a few most
+promising hypotheses from large and noisy data analysis results. In this study,
+we introduce a new type of graph structure, the text-numeric graph (TNG), which
+is defined as graph entities and associations have both text-attributed
+information and numeric information. The TNG is an ideal data structure model
+for novel scientific discovery via graph reasoning because it integrates
+human-understandable textual annotations or prior knowledge, with numeric
+values that represent the observed or activation levels of graph entities or
+associations in different samples. Together both the textual information and
+numeric values determine the importance of graph entities and associations in
+graph reasoning for novel scientific knowledge discovery. We further propose
+integrating large language models (LLMs) and graph neural networks (GNNs) to
+analyze the TNGs for graph understanding and reasoning. To demonstrate the
+utility, we generated the text-omic(numeric) signaling graphs (TOSG), as one
+type of TNGs, in which all graphs have the same entities, associations and
+annotations, but have sample-specific entity numeric (omic) values using single
+cell RNAseq (scRNAseq) datasets of different diseases. We proposed joint
+LLM-GNN models for key entity mining and signaling pathway mining on the TOSGs.
+The evaluation results showed the LLM-GNN and TNGs models significantly improve
+classification accuracy and network inference. In conclusion, the TNGs and
+joint LLM-GNN models are important approaches for scientific discovery.
 
-摘要：大型語言模型（LLM）日益應用於教育、臨床和專業領域，但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為，涉及 AMPS（數學）和 MedQuad（醫療建議）數據集。在 58.19% 的案例中觀察到了趨炎附勢行為，其中 Gemini 表現出最高比率（62.47%），而 ChatGPT 最低（56.71%）。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中，而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率（61.75% 對 56.52%，Z=5.87，p<0.001），特別是在計算任務中，其中退步式趨炎附勢顯著增加（先發制人：8.13%，上下文：3.54%，p<0.001）。簡單的反駁最大化了漸進式趨炎附勢（Z=6.59，p<0.001），而基於引用的反駁表現出最高的退步式比率（Z=6.59，p<0.001）。趨炎附勢行為表現出很高的持續性（78.5%，95% CI：[77.2%，79.8%]），無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇，為更安全的 AI 應用提供了提示編程和模型優化的見解。
+摘要：<paragraph>在現實世界的科學發現中，人類總是利用累積的先驗知識，並運用想像力從大量且雜訊的資料分析結果中挑選出一個或幾個最有希望的假設。在本研究中，我們介紹了一種新型態的圖形結構，稱為文字數值圖 (TNG)，定義為圖形實體和關聯具有文字屬性資訊和數值資訊。TNG 是透過圖形推理進行新科學發現的理想資料結構模型，因為它整合了人類可理解的文字註解或先驗知識，以及代表圖形實體或不同樣本中關聯的觀察值或活化程度的數值。文字資訊和數值一起決定了圖形實體和關聯在圖形推理中對於新科學知識發現的重要性。我們進一步提出整合大型語言模型 (LLM) 和圖形神經網路 (GNN) 來分析 TNG，以進行圖形理解和推理。為了展示其效用，我們生成了文字組學（數值）訊號圖 (TOSG)，作為一種 TNG，其中所有圖形都具有相同的實體、關聯和註解，但具有特定於樣本的實體數值（組學）值，使用不同疾病的單細胞 RNAseq (scRNAseq) 資料集。我們針對 TOSG 提出聯合 LLM-GNN 模型，用於關鍵實體探勘和訊號路徑探勘。評估結果顯示，LLM-GNN 和 TNG 模型顯著提升了分類準確度和網路推論。結論而言，TNG 和聯合 LLM-GNN 模型是科學發現的重要方法。</paragraph>
 
-##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**
-2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace
+##### **Zep: A Temporal Knowledge Graph Architecture for Agent Memory**
+2501.13956v1 by Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef
 
-Medical research faces well-documented challenges in translating novel
-treatments into clinical practice. Publishing incentives encourage researchers
-to present "positive" findings, even when empirical results are equivocal.
-Consequently, it is well-documented that authors often spin study results,
-especially in article abstracts. Such spin can influence clinician
-interpretation of evidence and may affect patient care decisions. In this
-study, we ask whether the interpretation of trial results offered by Large
-Language Models (LLMs) is similarly affected by spin. This is important since
-LLMs are increasingly being used to trawl through and synthesize published
-medical evidence. We evaluated 22 LLMs and found that they are across the board
-more susceptible to spin than humans. They might also propagate spin into their
-outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into
-plain language summaries that they generate. We also find, however, that LLMs
-are generally capable of recognizing spin, and can be prompted in a way to
-mitigate spin's impact on LLM outputs.
+We introduce Zep, a novel memory layer service for AI agents that outperforms
+the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR)
+benchmark. Additionally, Zep excels in more comprehensive and challenging
+evaluations than DMR that better reflect real-world enterprise use cases. While
+existing retrieval-augmented generation (RAG) frameworks for large language
+model (LLM)-based agents are limited to static document retrieval, enterprise
+applications demand dynamic knowledge integration from diverse sources
+including ongoing conversations and business data. Zep addresses this
+fundamental limitation through its core component Graphiti -- a
+temporally-aware knowledge graph engine that dynamically synthesizes both
+unstructured conversational data and structured business data while maintaining
+historical relationships. In the DMR benchmark, which the MemGPT team
+established as their primary evaluation metric, Zep demonstrates superior
+performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further
+validated through the more challenging LongMemEval benchmark, which better
+reflects enterprise use cases through complex temporal reasoning tasks. In this
+evaluation, Zep achieves substantial results with accuracy improvements of up
+to 18.5% while simultaneously reducing response latency by 90% compared to
+baseline implementations. These results are particularly pronounced in
+enterprise-critical tasks such as cross-session information synthesis and
+long-term context maintenance, demonstrating Zep's effectiveness for deployment
+in real-world applications.
 
-摘要：醫學研究在將新穎療法轉化為臨床實務上，面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現，即使經驗結果模稜兩可。因此，有據可查的是，作者經常扭曲研究結果，特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋，並可能影響病患照護決策。在本研究中，我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據，因此這點非常重要。我們評估了 22 個 LLM，發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中：例如，我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而，我們也發現 LLM 通常有能力辨認扭曲，而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。
+摘要：我們推出 Zep，這是一種新穎的記憶層服務，適用於 AI 代理，其在深度記憶擷取 (DMR) 基準測試中優於現行的最先進系統 MemGPT。此外，Zep 在比 DMR 更全面且更具挑戰性的評估中表現出色，這些評估更能反映真實世界的企業用例。雖然現有的檢索增強生成 (RAG) 架構僅限於大型語言模型 (LLM) 基於代理的靜態文件檢索，但企業應用需要從包括正在進行的對話和業務數據在內的不同來源動態整合知識。Zep 通過其核心組件 Graphiti 來解決這個基本限制，Graphiti 是一個時間感知知識圖譜引擎，可以在維護歷史關係的同時動態綜合非結構化對話數據和結構化業務數據。在 MemGPT 團隊確立為其主要評估指標的 DMR 基準測試中，Zep 表現出優異的效能（94.8% 對 93.4%）。除了 DMR 之外，Zep 的功能還通過更具挑戰性的 LongMemEval 基準測試進一步得到驗證，該基準測試通過複雜的時間推理任務更好地反映了企業用例。在這個評估中，Zep 以高達 18.5% 的準確度改進取得了顯著的成果，同時與基線實作相比，將回應延遲降低了 90%。這些成果在企業關鍵任務中尤為明顯，例如跨會話資訊綜合和長期脈絡維護，證明了 Zep 在實際應用中部署的有效性。
 
-##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**
-2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri
+##### **Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**
+2501.11560v1 by M. Manzour, A. Ballardini, R. Izquierdo, M. Á. Sotelo
 
-This paper presents a novel Natural Language Processing (NLP) framework for
-enhancing medical diagnosis through the integration of advanced techniques in
-data augmentation, feature extraction, and classification. The proposed
-approach employs back-translation to generate diverse paraphrased datasets,
-improving robustness and mitigating overfitting in classification tasks.
-Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with
-Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained
-contextual and positional relationships, dynamically adjusting the influence of
-positional information based on semantic context to produce high-quality text
-embeddings. For classification, an Attention-Based Feedforward Neural Network
-(ABFNN) is utilized, effectively focusing on the most relevant features to
-improve decision-making accuracy. Applied to the classification of symptoms,
-clinical notes, and other medical texts, this architecture demonstrates its
-ability to address the complexities of medical data. The combination of data
-augmentation, contextual embedding generation, and advanced classification
-mechanisms offers a robust and accurate diagnostic tool, with potential
-applications in automated medical diagnosis and clinical decision support. This
-method demonstrates the effectiveness of the proposed NLP framework for medical
-diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of
-99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only
-underscore the model's robust performance in classifying medical texts with
-exceptional precision and reliability but also highlight its superiority over
-existing methods, making it a highly promising tool for automated diagnostic
-systems.
+Lane-changing maneuvers, particularly those executed abruptly or in risky
+situations, are a significant cause of road traffic accidents. However, current
+research mainly focuses on predicting safe lane changes. Furthermore, existing
+accident datasets are often based on images only and lack comprehensive sensory
+data. In this work, we focus on predicting risky lane changes using the CRASH
+dataset (our own collected dataset specifically for risky lane changes), and
+safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian
+inference to predict these maneuvers using linguistic contextual information,
+enhancing the model's interpretability and transparency. The model achieved a
+91.5% f1-score with anticipation time extending to four seconds for risky lane
+changes, and a 90.0% f1-score for predicting safe lane changes with the same
+anticipation time. We validate our model by integrating it into a vehicle
+within the CARLA simulator in scenarios that involve risky lane changes. The
+model managed to anticipate sudden lane changes, thus providing automated
+vehicles with further time to plan and execute appropriate safe reactions.
+Finally, to enhance the explainability of our model, we utilize RAG to provide
+clear and natural language explanations for the given prediction.
+
+摘要：換車道動作，尤其是突然或在風險情況下執行的動作，是道路交通事故的重要原因。然而，目前的研究所主要集中在預測安全的換車道。此外，現有的事故資料集通常僅基於影像，且缺乏全面的感測資料。在這項工作中，我們專注於使用 CRASH 資料集（我們自己收集的專門針對風險換車道資料集）來預測風險換車道，以及安全換車道（使用 HighD 資料集）。然後，我們利用 KG 和貝氏推理來使用語言背景資訊預測這些動作，增強模型的可解釋性和透明度。該模型在風險換車道的預測時間延長至四秒時，達到了 91.5% 的 f1 分數，在預測安全換車道時，在相同的預測時間內達到了 90.0% 的 f1 分數。我們透過將模型整合到 CARLA 模擬器中的車輛中，在涉及風險換車道的場景中驗證我們的模型。該模型設法預測突然的換車道，從而為自動駕駛車輛提供了更多時間來規劃和執行適當的安全反應。最後，為了增強我們模型的可解釋性，我們利用 RAG 為給定的預測提供清晰且自然的語言解釋。
+
+##### **Each Graph is a New Language: Graph Learning with LLMs**
+2501.11478v2 by Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
 
-摘要：本文提出了一個創新的自然語言處理 (NLP) 框架，透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集，提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa)，這個模型捕捉細緻的脈絡和位置關係，根據語意脈絡動態調整位置資訊的影響，以產生高品質的文字嵌入。在分類方面，利用基於注意力的前饋神經網路 (ABFNN)，有效地關注最相關的特徵，以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類，此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具，在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性，以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數，取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性，也突顯了它優於現有方法的優越性，使其成為自動化診斷系統中極具前景的工具。
+Recent efforts leverage Large Language Models (LLMs) for modeling
+text-attributed graph structures in node classification tasks. These approaches
+describe graph structures for LLMs to understand or aggregate LLM-generated
+textual attribute embeddings through graph structure. However, these approaches
+face two main limitations in modeling graph structures with LLMs. (i) Graph
+descriptions become verbose in describing high-order graph structure. (ii)
+Textual attributes alone do not contain adequate graph structure information.
+It is challenging to model graph structure concisely and adequately with LLMs.
+LLMs lack built-in mechanisms to model graph structures directly. They also
+struggle with complex long-range dependencies between high-order nodes and
+target nodes.
+  Inspired by the observation that LLMs pre-trained on one language can achieve
+exceptional performance on another with minimal additional training, we propose
+\textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge
+\textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs
+to transfer their powerful language understanding capabilities to
+graph-structured data. GDL4LLM translates graphs into a graph language corpus
+instead of graph descriptions and pre-trains LLMs on this corpus to adequately
+understand graph structures. During fine-tuning, this corpus describes the
+structural information of target nodes concisely with only a few tokens. By
+treating graphs as a new language, GDL4LLM enables LLMs to model graph
+structures adequately and concisely for node classification tasks. Extensive
+experiments on three real-world datasets demonstrate that GDL4LLM outperforms
+description-based and textual attribute embeddings-based baselines by
+efficiently modeling different orders of graph structure with LLMs.
 
-##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**
-2502.07752v1 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
+摘要：<paragraph>最近的研究利用大型语言模型 (LLM) 对节点分类任务中的文本属性图结构进行建模。这些方法描述图结构，以便 LLM 理解或通过图结构聚合 LLM 生成的文本属性嵌入。然而，这些方法在使用 LLM 对图结构进行建模时面临两个主要限制。(i) 图描述在描述高阶图结构时变得冗长。(ii) 仅文本属性不包含足够的图结构信息。使用 LLM 对图结构进行简洁且充分的建模具有挑战性。LLM 缺乏直接对图结构进行建模的内置机制。它们还难以处理高阶节点和目标节点之间复杂的远程依赖关系。
+受 LLM 在一种语言上进行预训练后，只需进行最少的额外训练即可在另一种语言上实现卓越性能的观察结果的启发，我们提出了**G**raph-**D**efined **L**anguage for **L**arge **L**anguage **M**odel (GDL4LLM)。此新框架使 LLM 能够将其强大的语言理解能力转移到结构化数据图。GDL4LLM 将图翻译成图语言语料库，而不是图描述，并在该语料库上对 LLM 进行预训练，以充分理解图结构。在微调期间，此语料库仅使用几个标记简洁地描述目标节点的结构信息。通过将图视为一种新语言，GDL4LLM 使 LLM 能够充分且简洁地对图结构进行建模，以用于节点分类任务。在三个真实世界数据集上进行的广泛实验表明，GDL4LLM 通过使用 LLM 有效地对不同阶的图结构进行建模，优于基于描述和基于文本属性嵌入的基线。</paragraph>
 
-Designing efficient optimizers for large language models (LLMs) with
-low-memory requirements and fast convergence is an important and challenging
-problem. This paper makes a step towards the systematic design of such
-optimizers through the lens of structured Fisher information matrix (FIM)
-approximation. We show that many state-of-the-art efficient optimizers can be
-viewed as solutions to FIM approximation (under the Frobenius norm) with
-specific structural assumptions. Building on these insights, we propose two
-design recommendations of practical efficient optimizers for LLMs, involving
-the careful selection of structural assumptions to balance generality and
-efficiency, and enhancing memory efficiency of optimizers with general
-structures through a novel low-rank extension framework. We demonstrate how to
-use each design approach by deriving new memory-efficient optimizers: Row and
-Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation
-(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the
-effectiveness, showing faster and better convergence than existing
-memory-efficient baselines and Adam with little memory overhead. Notably, Alice
-achieves better than 2x faster convergence over Adam, while RACS delivers
-strong performance on the 1B model with SGD-like memory.
+##### **Few-shot Policy (de)composition in Conversational Question Answering**
+2501.11335v1 by Kyle Erwin, Guy Axelrod, Maria Chang, Achille Fokoue, Maxwell Crouse, Soham Dan, Tian Gao, Rosario Uceda-Sosa, Ndivhuwo Makondo, Naweed Khan, Alexander Gray
 
-摘要：設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的角度，朝向此類最佳化器的系統化設計邁進一步。我們展示了許多最先進的高效最佳化器可以被視為 FIM 近似（在 Frobenius 範數下）的解，並具有特定的結構假設。基於這些見解，我們提出了 LLM 的兩個實用高效最佳化器設計建議，包括仔細選擇結構假設以平衡通用性和效率，並透過新穎的低秩延伸架構來增強具有通用結構的最佳化器的記憶體效率。我們展示了如何透過推導新的記憶體高效最佳化器來使用每種設計方法：列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練（高達 1B 參數）上的實驗驗證了其有效性，顯示比現有的記憶體高效基線和 Adam 更快且更好的收斂，且記憶體開銷很小。值得注意的是，Alice 比 Adam 快 2 倍以上，而 RACS 則在 1B 模型上提供類似 SGD 記憶體的強勁效能。
+The task of policy compliance detection (PCD) is to determine if a scenario
+is in compliance with respect to a set of written policies. In a conversational
+setting, the results of PCD can indicate if clarifying questions must be asked
+to determine compliance status. Existing approaches usually claim to have
+reasoning capabilities that are latent or require a large amount of annotated
+data. In this work, we propose logical decomposition for policy compliance
+(LDPC): a neuro-symbolic framework to detect policy compliance using large
+language models (LLMs) in a few-shot setting. By selecting only a few exemplars
+alongside recently developed prompting techniques, we demonstrate that our
+approach soundly reasons about policy compliance conversations by extracting
+sub-questions to be answered, assigning truth values from contextual
+information, and explicitly producing a set of logic statements from the given
+policies. The formulation of explicit logic graphs can in turn help answer
+PCDrelated questions with increased transparency and explainability. We apply
+this approach to the popular PCD and conversational machine reading benchmark,
+ShARC, and show competitive performance with no task-specific finetuning. We
+also leverage the inherently interpretable architecture of LDPC to understand
+where errors occur, revealing ambiguities in the ShARC dataset and highlighting
+the challenges involved with reasoning for conversational question answering.
 
-##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**
-2502.07516v1 by Raman Dutt
+摘要：策略合規偵測 (PCD) 的任務是確定場景是否符合一組書面策略。在對話設定中，PCD 的結果可以指出是否必須提出澄清問題以確定合規狀態。現有的方法通常聲稱具有潛在的推理能力，或需要大量的註釋資料。在這項工作中，我們提出策略合規的邏輯分解 (LDPC)：一種使用大型語言模型 (LLM) 在少次嘗試中偵測策略合規的神經符號框架。透過僅選擇少數範例以及最近開發的提示技術，我們證明我們的做法透過提取要回答的子問題、從脈絡資訊指派真值，以及從給定的策略明確產生一組邏輯陳述，對策略合規對話進行合理的推理。明確邏輯圖表的制定反過來可以幫助回答 PCD 相關問題，並提高透明度和可解釋性。我們將此方法應用於熱門的 PCD 和對話式機器閱讀基準 ShARC，並在沒有特定任務微調的情況下展現出競爭力。我們也利用 LDPC 固有的可解釋架構來了解錯誤發生在哪裡，揭露 ShARC 資料集中的歧義，並強調對話式問題解答推理的挑戰。
 
-Generative models, particularly text-to-image (T2I) diffusion models, play a
-crucial role in medical image analysis. However, these models are prone to
-training data memorization, posing significant risks to patient privacy.
-Synthetic chest X-ray generation is one of the most common applications in
-medical image analysis with the MIMIC-CXR dataset serving as the primary data
-repository for this task. This study adopts a data-driven approach and presents
-the first systematic attempt to identify prompts and text tokens in MIMIC-CXR
-that contribute the most to training data memorization. Our analysis reveals an
-unexpected finding: prompts containing traces of de-identification procedures
-are among the most memorized, with de-identification markers contributing the
-most. Furthermore, we also find existing inference-time memorization mitigation
-strategies are ineffective and fail to sufficiently reduce the model's reliance
-on memorized text tokens highlighting a broader issue in T2I synthesis with
-MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy
-and improve the reliability of generative models in medical imaging. Finally,
-our results provide a foundation for future work on developing and benchmarking
-memorization mitigation techniques for synthetic chest X-ray generation using
-the MIMIC-CXR dataset.
+##### **Reasoning Language Models: A Blueprint**
+2501.11223v3 by Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler
 
-摘要：生成模型，尤其是文字轉圖像 (T2I) 擴散模型，在醫學影像分析中扮演著至關重要的角色。然而，這些模型容易訓練資料記憶，對病患隱私造成重大風險。合成胸部 X 光線生成是醫學影像分析中最常見的應用之一，其中 MIMIC-CXR 資料集作為此任務的主要資料儲存庫。本研究採用資料驅動的方法，並提出首次系統性嘗試，以識別 MIMIC-CXR 中最有助於訓練資料記憶的提示和文字代碼。我們的分析揭露了一個意外的發現：包含去識別程序痕跡的提示是最常被記憶的，其中去識別標記的貢獻最大。此外，我們也發現現有的推論時間記憶減緩策略無效，且無法充分降低模型對記憶文字代碼的依賴性，突顯了使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。針對此問題，我們提出可行的策略，以增強隱私並改善生成模型在醫學影像中的可靠性。最後，我們的結果為未來使用 MIMIC-CXR 資料集開發和評量合成胸部 X 光線生成的記憶減緩技術奠定了基礎。
+Reasoning language models (RLMs), also known as Large Reasoning Models
+(LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have
+redefined AI's problem-solving capabilities by extending LLMs with advanced
+reasoning mechanisms. Yet, their high costs, proprietary nature, and complex
+architectures - uniquely combining Reinforcement Learning (RL), search
+heuristics, and LLMs - present accessibility and scalability challenges. To
+address these, we propose a comprehensive blueprint that organizes RLM
+components into a modular framework, based on a survey and analysis of all RLM
+works. This blueprint incorporates diverse reasoning structures (chains, trees,
+graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search,
+Beam Search), RL concepts (policy, value models and others), supervision
+schemes (Outcome-Based and Process-Based Supervision), and other related
+concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
+tools). We also provide detailed mathematical formulations and algorithmic
+specifications to simplify RLM implementation. By showing how schemes like
+LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases,
+we demonstrate the blueprint's versatility and unifying potential. To
+illustrate its utility, we introduce x1, a modular implementation for rapid RLM
+prototyping and experimentation. Using x1 and a literature review, we provide
+key insights, such as multi-phase training for policy and value models, and the
+importance of familiar training distributions. Finally, we discuss scalable RLM
+cloud deployments and we outline how RLMs can integrate with a broader LLM
+ecosystem. Our work demystifies RLM construction, democratizes advanced
+reasoning capabilities, and fosters innovation, aiming to mitigate the gap
+between "rich AI" and "poor AI" by lowering barriers to RLM design and
+experimentation.
 
-##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**
-2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo
+摘要：推理語言模型 (RLM)，又稱為大型推理模型 (LRM)，例如 OpenAI 的 o1 和 o3、DeepSeek-V3 以及阿里巴巴的 QwQ，透過擴充 LLM 的先進推理機制，重新定義了 AI 的問題解決能力。然而，它們的高成本、專有性質和複雜架構（獨特地結合了強化學習 (RL)、搜尋啟發法和 LLM）提出了可及性和可擴充性的挑戰。為了解決這些問題，我們提出了一個全面的藍圖，將 RLM 組件組織成一個模組化架構，這是基於對所有 RLM 作品的調查和分析。此藍圖包含多樣化的推理結構（鏈、樹、圖和巢狀形式）、推理策略（例如蒙地卡羅樹搜尋、波束搜尋）、RL 概念（策略、價值模型等）、監督方案（基於結果和基於流程的監督）和其他相關概念（例如測試時間運算、檢索增強生成、代理工具）。我們還提供了詳細的數學公式和演算法規範，以簡化 RLM 的實作。透過展示 LLaMA-Berry、QwQ、Journey Learning 和 Graph of Thoughts 等方案如何作為特殊情況，我們展示了藍圖的多功能性和統一潛力。為了說明其效用，我們介紹了 x1，這是一個模組化實作，用於快速 RLM 原型製作和實驗。使用 x1 和文獻回顧，我們提供了關鍵見解，例如策略和價值模型的多階段訓練，以及熟悉訓練分佈的重要性。最後，我們討論了可擴充的 RLM 雲端部署，並概述了 RLM 如何與更廣泛的 LLM 生態系統整合。我們的研究揭開了 RLM 建構的神秘面紗，使先進的推理能力民主化，並促進創新，旨在透過降低 RLM 設計和實驗的障礙，來縮小「富裕 AI」和「貧窮 AI」之間的差距。
 
-Chronic kidney disease (CKD) is a major global health issue, affecting over
-10% of the population and causing significant mortality. While kidney biopsy
-remains the gold standard for CKD diagnosis and treatment, the lack of
-comprehensive benchmarks for kidney pathology segmentation hinders progress in
-the field. To address this, we organized the Kidney Pathology Image
-Segmentation (KPIs) Challenge, introducing a dataset that incorporates
-preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+
-Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes
-two tasks, patch-level segmentation and whole slide image segmentation and
-detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score.
-By encouraging innovative segmentation methods that adapt to diverse CKD models
-and tissue conditions, the KPIs Challenge aims to advance kidney pathology
-analysis, establish new benchmarks, and enable precise, large-scale
-quantification for disease research and diagnosis.
+##### **IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**
+2501.11067v1 by Elad Levi, Ilan Kadar
 
-摘要：慢性腎臟病 (CKD) 是全球主要的健康問題，影響超過
-10% 的人口，並造成顯著的死亡率。雖然腎臟活檢
-仍然是 CKD 診斷和治療的黃金標準，但缺乏
-腎臟病理學分割的全面基準阻礙了該領域的進展。
-為了解決這個問題，我們組織了腎臟病理影像
-分割 (KPIs) 挑戰，引入了包含超過 10,000 個註解的
-CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
-週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括
-兩個任務，修補層級分割和全幻燈片影像分割和
-偵測，使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。
-通過鼓勵創新的分割方法來適應不同的 CKD 模型
-和組織條件，KPIs 挑戰旨在推進腎臟病理
-分析，建立新的基準，並實現精確、大規模的
-疾病研究和診斷量化。
+Large Language Models (LLMs) are transforming artificial intelligence,
+evolving into task-oriented systems capable of autonomous planning and
+execution. One of the primary applications of LLMs is conversational AI
+systems, which must navigate multi-turn dialogues, integrate domain-specific
+APIs, and adhere to strict policy constraints. However, evaluating these agents
+remains a significant challenge, as traditional methods fail to capture the
+complexity and variability of real-world interactions. We introduce
+IntellAgent, a scalable, open-source multi-agent framework designed to evaluate
+conversational AI systems comprehensively. IntellAgent automates the creation
+of diverse, synthetic benchmarks by combining policy-driven graph modeling,
+realistic event generation, and interactive user-agent simulations. This
+innovative approach provides fine-grained diagnostics, addressing the
+limitations of static and manually curated benchmarks with coarse-grained
+metrics. IntellAgent represents a paradigm shift in evaluating conversational
+AI. By simulating realistic, multi-policy scenarios across varying levels of
+complexity, IntellAgent captures the nuanced interplay of agent capabilities
+and policy constraints. Unlike traditional methods, it employs a graph-based
+policy model to represent relationships, likelihoods, and complexities of
+policy interactions, enabling highly detailed diagnostics. IntellAgent also
+identifies critical performance gaps, offering actionable insights for targeted
+optimization. Its modular, open-source design supports seamless integration of
+new domains, policies, and APIs, fostering reproducibility and community
+collaboration. Our findings demonstrate that IntellAgent serves as an effective
+framework for advancing conversational AI by addressing challenges in bridging
+research and deployment. The framework is available at
+https://github.com/plurai-ai/intellagent
 
-##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
-2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
+摘要：大型語言模型 (LLM) 正在轉變人工智慧，演變成具備自主規劃和執行能力的任務導向系統。LLM 的主要應用之一是對話式 AI 系統，它必須應對多輪對話、整合特定領域的 API，並遵守嚴格的政策約束。然而，評估這些代理仍然是一項重大挑戰，因為傳統方法無法捕捉現實世界互動的複雜性和變異性。我們引入了 IntellAgent，一個可擴充、開放原始碼的多代理架構，旨在全面評估對話式 AI 系統。IntellAgent 自動化建立多樣化、合成的基準，方法是結合策略驅動的圖形建模、逼真的事件產生和互動使用者代理模擬。這種創新方法提供了細緻的診斷，解決了具有粗略指標的靜態和手動策劃基準的限制。IntellAgent 代表了評估對話式 AI 的典範轉移。通過模擬不同層級複雜性的逼真多策略場景，IntellAgent 捕捉到了代理功能和策略約束之間的細微交互。與傳統方法不同，它採用基於圖形的策略模型來表示策略交互的關係、可能性和複雜性，從而實現高度詳細的診斷。IntellAgent 還識別出關鍵效能差距，提供可行的見解，以進行目標最佳化。其模組化、開放原始碼的設計支援無縫整合新的領域、策略和 API，促進了可複製性和社群協作。我們的研究結果表明，IntellAgent 可作為一個有效的框架，透過解決研究和部署之間的挑戰來推進對話式 AI。這個框架可在 https://github.com/plurai-ai/intellagent 取得
 
-Early prediction of pediatric cardiac arrest (CA) is critical for timely
-intervention in high-risk intensive care settings. We introduce PedCA-FT, a
-novel transformer-based framework that fuses tabular view of EHR with the
-derived textual view of EHR to fully unleash the interactions of
-high-dimensional risk factors and their dynamics. By employing dedicated
-transformer modules for each modality view, PedCA-FT captures complex temporal
-and contextual patterns to produce robust CA risk estimates. Evaluated on a
-curated pediatric cohort from the CHOA-CICU database, our approach outperforms
-ten other artificial intelligence models across five key performance metrics
-and identifies clinically meaningful risk factors. These findings underscore
-the potential of multimodal fusion techniques to enhance early CA detection and
-improve patient care.
 
-摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
+### LLM
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Theoretical Benefit and Limitation of Diffusion Language Model**|Guhao Feng et.al.|[2502.09622v1](http://arxiv.org/abs/2502.09622v1)|null|
+|**2025-02-13**|**MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**|Dongzhi Jiang et.al.|[2502.09621v1](http://arxiv.org/abs/2502.09621v1)|null|
+|**2025-02-13**|**Exploring the Potential of Encoder-free Architectures in 3D LMMs**|Yiwen Tang et.al.|[2502.09620v1](http://arxiv.org/abs/2502.09620v1)|null|
+|**2025-02-13**|**DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**|Xueyi Liu et.al.|[2502.09614v1](http://arxiv.org/abs/2502.09614v1)|null|
+|**2025-02-13**|**Score-of-Mixture Training: Training One-Step Generative Models Made Simple**|Tejas Jayashankar et.al.|[2502.09609v1](http://arxiv.org/abs/2502.09609v1)|null|
+|**2025-02-13**|**Human-LLM Coevolution: Evidence from Academic Writing**|Mingmeng Geng et.al.|[2502.09606v1](http://arxiv.org/abs/2502.09606v1)|null|
+|**2025-02-13**|**SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**|Yung-Sung Chuang et.al.|[2502.09604v1](http://arxiv.org/abs/2502.09604v1)|null|
+|**2025-02-13**|**CoT-Valve: Length-Compressible Chain-of-Thought Tuning**|Xinyin Ma et.al.|[2502.09601v1](http://arxiv.org/abs/2502.09601v1)|null|
+|**2025-02-13**|**Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**|Siyan Zhao et.al.|[2502.09597v1](http://arxiv.org/abs/2502.09597v1)|null|
+|**2025-02-13**|**KIMAs: A Configurable Knowledge Integrated Multi-Agent System**|Zitao Li et.al.|[2502.09596v1](http://arxiv.org/abs/2502.09596v1)|null|
+|**2025-02-13**|**Logical forms complement probability in understanding language model (and human) performance**|Yixuan Wang et.al.|[2502.09589v1](http://arxiv.org/abs/2502.09589v1)|null|
+|**2025-02-13**|**Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**|Mark Beliaev et.al.|[2502.09573v1](http://arxiv.org/abs/2502.09573v1)|null|
+|**2025-02-13**|**MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**|Vlad Andrei Negru et.al.|[2502.09567v1](http://arxiv.org/abs/2502.09567v1)|null|
+|**2025-02-13**|**Zero-shot generation of synthetic neurosurgical data with large language models**|Austin A. Barr et.al.|[2502.09566v1](http://arxiv.org/abs/2502.09566v1)|null|
+|**2025-02-13**|**MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**|Quintina Campbell et.al.|[2502.09565v1](http://arxiv.org/abs/2502.09565v1)|null|
+|**2025-02-13**|**EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**|Rui Yang et.al.|[2502.09560v1](http://arxiv.org/abs/2502.09560v1)|null|
+|**2025-02-13**|**Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**|Shreyan Biswas et.al.|[2502.09532v1](http://arxiv.org/abs/2502.09532v1)|null|
+|**2025-02-13**|**Diffusion Models for Molecules: A Survey of Methods and Tasks**|Liang Wang et.al.|[2502.09511v1](http://arxiv.org/abs/2502.09511v1)|null|
+|**2025-02-13**|**AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**|Caleb Cranney et.al.|[2502.09503v1](http://arxiv.org/abs/2502.09503v1)|null|
+|**2025-02-13**|**Improve LLM-based Automatic Essay Scoring with Linguistic Features**|Zhaoyi Joey Hou et.al.|[2502.09497v1](http://arxiv.org/abs/2502.09497v1)|null|
+|**2025-02-13**|**Cracking the Code: Enhancing Development finance understanding with artificial intelligence**|Pierre Beaucoral et.al.|[2502.09495v1](http://arxiv.org/abs/2502.09495v1)|null|
+|**2025-02-13**|**Objective quantification of mood states using large language models**|Jakub Onysk et.al.|[2502.09487v1](http://arxiv.org/abs/2502.09487v1)|null|
+|**2025-02-13**|**The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**|Akash Ghosh et.al.|[2502.09457v1](http://arxiv.org/abs/2502.09457v1)|null|
+|**2025-02-13**|**Pixel-Level Reasoning Segmentation via Multi-turn Conversations**|Dexian Cai et.al.|[2502.09447v1](http://arxiv.org/abs/2502.09447v1)|null|
+|**2025-02-13**|**Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**|Navdeep Kumar et.al.|[2502.09432v1](http://arxiv.org/abs/2502.09432v1)|null|
+|**2025-02-13**|**Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**|Ziyi Chen et.al.|[2502.09423v1](http://arxiv.org/abs/2502.09423v1)|null|
+|**2025-02-13**|**On multi-token prediction for efficient LLM inference**|Somesh Mehra et.al.|[2502.09419v1](http://arxiv.org/abs/2502.09419v1)|null|
+|**2025-02-13**|**SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**|Daniel Fleischer et.al.|[2502.09390v1](http://arxiv.org/abs/2502.09390v1)|null|
+|**2025-02-13**|**Truth Knows No Language: Evaluating Truthfulness Beyond English**|Blanca Calvo Figueras et.al.|[2502.09387v1](http://arxiv.org/abs/2502.09387v1)|null|
+|**2025-02-13**|**A Deep Inverse-Mapping Model for a Flapping Robotic Wing**|Hadar Sharvit et.al.|[2502.09378v1](http://arxiv.org/abs/2502.09378v1)|null|
+|**2025-02-13**|**Language Agents as Digital Representatives in Collective Decision-Making**|Daniel Jarrett et.al.|[2502.09369v1](http://arxiv.org/abs/2502.09369v1)|null|
+|**2025-02-13**|**Neural Spatiotemporal Point Processes: Trends and Challenges**|Sumantrak Mukherjee et.al.|[2502.09341v1](http://arxiv.org/abs/2502.09341v1)|null|
+|**2025-02-13**|**Graph Diffusion Network for Drug-Gene Prediction**|Jiayang Wu et.al.|[2502.09335v1](http://arxiv.org/abs/2502.09335v1)|null|
+|**2025-02-13**|**Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**|Itai Mondshine et.al.|[2502.09331v1](http://arxiv.org/abs/2502.09331v1)|null|
+|**2025-02-13**|**A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**|Kentaro Imajo et.al.|[2502.09316v1](http://arxiv.org/abs/2502.09316v1)|null|
+|**2025-02-13**|**When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**|Samuel Joseph Amouyal et.al.|[2502.09307v1](http://arxiv.org/abs/2502.09307v1)|null|
+|**2025-02-13**|**Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**|Bernd Dudzik et.al.|[2502.09294v1](http://arxiv.org/abs/2502.09294v1)|null|
+|**2025-02-13**|**SparQLe: Speech Queries to Text Translation Through LLMs**|Amirbek Djanibekov et.al.|[2502.09284v1](http://arxiv.org/abs/2502.09284v1)|null|
+|**2025-02-13**|**LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**|Wenlun Zhang et.al.|[2502.09271v1](http://arxiv.org/abs/2502.09271v1)|null|
+|**2025-02-13**|**AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**|Hezhe Qiao et.al.|[2502.09254v1](http://arxiv.org/abs/2502.09254v1)|null|
+|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
+|**2025-02-13**|**You Do Not Fully Utilize Transformer's Representation Capacity**|Gleb Gerasimov et.al.|[2502.09245v1](http://arxiv.org/abs/2502.09245v1)|null|
+|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
+|**2025-02-13**|**Reliable Conversational Agents under ASP Control that Understand Natural Language**|Yankai Zeng et.al.|[2502.09237v1](http://arxiv.org/abs/2502.09237v1)|null|
+|**2025-02-13**|**Commonsense Reasoning-Aided Autonomous Vehicle Systems**|Keegan Kimbrell et.al.|[2502.09233v1](http://arxiv.org/abs/2502.09233v1)|null|
+|**2025-02-13**|**Logical foundations of Smart Contracts**|Kalonji Kalala et.al.|[2502.09232v1](http://arxiv.org/abs/2502.09232v1)|null|
+|**2025-02-13**|**Relating Answer Set Programming and Many-sorted Logics for Formal Verification**|Zachary Hansen et.al.|[2502.09230v1](http://arxiv.org/abs/2502.09230v1)|null|
+|**2025-02-13**|**Computational methods for Dynamic Answer Set Programming**|Susana Hahn et.al.|[2502.09228v1](http://arxiv.org/abs/2502.09228v1)|null|
+|**2025-02-13**|**Generating Causally Compliant Counterfactual Explanations using ASP**|Sopam Dasgupta et.al.|[2502.09226v1](http://arxiv.org/abs/2502.09226v1)|null|
+|**2025-02-13**|**Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**|Đorđe Marković et.al.|[2502.09224v1](http://arxiv.org/abs/2502.09224v1)|null|
+|**2025-02-13**|**ASP-driven User-interaction with Clinguin**|Alexander Beiser et.al.|[2502.09222v1](http://arxiv.org/abs/2502.09222v1)|null|
+|**2025-02-13**|**Pearce's Characterisation in an Epistemic Domain**|Ezgi Iraz Su et.al.|[2502.09221v1](http://arxiv.org/abs/2502.09221v1)|null|
+|**2025-02-13**|**Graphical Conditions for the Existence, Unicity and Number of Regular Models**|Van-Giang Trinh et.al.|[2502.09220v1](http://arxiv.org/abs/2502.09220v1)|null|
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-13**|**Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**|Galileo Sartor et.al.|[2502.09216v1](http://arxiv.org/abs/2502.09216v1)|null|
+|**2025-02-13**|**Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**|Sean Glaze et.al.|[2502.09215v1](http://arxiv.org/abs/2502.09215v1)|null|
+|**2025-02-13**|**Neuro-Symbolic Contrastive Learning for Cross-domain Inference**|Mingyue Liu et.al.|[2502.09213v1](http://arxiv.org/abs/2502.09213v1)|null|
+|**2025-02-13**|**LP-LM: No Hallucinations in Question Answering with Logic Programming**|Katherine Wu et.al.|[2502.09212v1](http://arxiv.org/abs/2502.09212v1)|null|
+|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
+|**2025-02-13**|**On LLM-generated Logic Programs and their Inference Execution Methods**|Paul Tarau et.al.|[2502.09209v1](http://arxiv.org/abs/2502.09209v1)|null|
+|**2025-02-13**|**Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**|Haya Majid Qureshi et.al.|[2502.09206v1](http://arxiv.org/abs/2502.09206v1)|null|
+|**2025-02-13**|**Counterfactual Explanations as Plans**|Vaishak Belle et.al.|[2502.09205v1](http://arxiv.org/abs/2502.09205v1)|null|
+|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
+|**2025-02-13**|**Thinking beyond the anthropomorphic paradigm benefits LLM research**|Lujain Ibrahim et.al.|[2502.09192v1](http://arxiv.org/abs/2502.09192v1)|null|
+|**2025-02-13**|**Matina: A Large-Scale 73B Token Persian Text Corpus**|Sara Bourbour Hosseinbeigi et.al.|[2502.09188v1](http://arxiv.org/abs/2502.09188v1)|null|
+|**2025-02-13**|**RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**|Changzhi Zhou et.al.|[2502.09183v1](http://arxiv.org/abs/2502.09183v1)|null|
+|**2025-02-13**|**FLAME: Flexible LLM-Assisted Moderation Engine**|Ivan Bakulin et.al.|[2502.09175v1](http://arxiv.org/abs/2502.09175v1)|null|
+|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-13**|**Musical Heritage Historical Entity Linking**|Arianna Graciotti et.al.|[2502.09168v1](http://arxiv.org/abs/2502.09168v1)|null|
+|**2025-02-13**|**Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**|Chang Liu et.al.|[2502.09156v1](http://arxiv.org/abs/2502.09156v1)|null|
+|**2025-02-13**|**A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**|Nasser A Alsadhan et.al.|[2502.09128v1](http://arxiv.org/abs/2502.09128v1)|null|
+|**2025-02-13**|**Automatic Pruning via Structured Lasso with Class-wise Information**|Xiang Liu et.al.|[2502.09125v1](http://arxiv.org/abs/2502.09125v1)|null|
+|**2025-02-13**|**The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**|Ye-eun Cho et.al.|[2502.09120v1](http://arxiv.org/abs/2502.09120v1)|null|
+|**2025-02-13**|**One-shot Federated Learning Methods: A Practical Guide**|Xiang Liu et.al.|[2502.09104v1](http://arxiv.org/abs/2502.09104v1)|null|
+|**2025-02-13**|**Logical Reasoning in Large Language Models: A Survey**|Hanmeng Liu et.al.|[2502.09100v1](http://arxiv.org/abs/2502.09100v1)|null|
+|**2025-02-13**|**A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**|Tianyi Huang et.al.|[2502.09097v1](http://arxiv.org/abs/2502.09097v1)|null|
+|**2025-02-13**|**A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**|Jia Gao et.al.|[2502.09086v1](http://arxiv.org/abs/2502.09086v1)|null|
+|**2025-02-13**|**Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**|Greta Warren et.al.|[2502.09083v1](http://arxiv.org/abs/2502.09083v1)|null|
+|**2025-02-13**|**CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**|Xintao Wang et.al.|[2502.09082v1](http://arxiv.org/abs/2502.09082v1)|null|
+|**2025-02-13**|**Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**|Xuzhao Geng et.al.|[2502.09073v1](http://arxiv.org/abs/2502.09073v1)|null|
+|**2025-02-13**|**An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**|Kunat Pipatanakul et.al.|[2502.09056v1](http://arxiv.org/abs/2502.09056v1)|null|
+|**2025-02-13**|**Cost-Saving LLM Cascades with Early Abstention**|Michael J. Zellinger et.al.|[2502.09054v1](http://arxiv.org/abs/2502.09054v1)|null|
+|**2025-02-13**|**Game Theory Meets Large Language Models: A Systematic Survey**|Haoran Sun et.al.|[2502.09053v1](http://arxiv.org/abs/2502.09053v1)|null|
+|**2025-02-13**|**AIDE: Agentically Improve Visual Language Model with Domain Experts**|Ming-Chang Chiu et.al.|[2502.09051v1](http://arxiv.org/abs/2502.09051v1)|null|
+|**2025-02-13**|**Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**|Chae-Hyun Kim et.al.|[2502.09050v1](http://arxiv.org/abs/2502.09050v1)|null|
+|**2025-02-13**|**Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**|Jin-Duk Park et.al.|[2502.09046v1](http://arxiv.org/abs/2502.09046v1)|null|
+|**2025-02-13**|**Typhoon T1: An Open Thai Reasoning Model**|Pittawat Taveekitworachai et.al.|[2502.09042v1](http://arxiv.org/abs/2502.09042v1)|null|
+|**2025-02-13**|**Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**|Lin Zhang et.al.|[2502.09022v1](http://arxiv.org/abs/2502.09022v1)|null|
+|**2025-02-13**|**EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**|Xiao Wang et.al.|[2502.09020v1](http://arxiv.org/abs/2502.09020v1)|null|
+|**2025-02-13**|**Zero-shot Concept Bottleneck Models**|Shin'ya Yamaguchi et.al.|[2502.09018v1](http://arxiv.org/abs/2502.09018v1)|null|
+|**2025-02-13**|**Diversity Enhances an LLM's Performance in RAG and Long-context Task**|Zhchao Wang et.al.|[2502.09017v1](http://arxiv.org/abs/2502.09017v1)|null|
+|**2025-02-13**|**Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**|Jonathan Pofcher et.al.|[2502.09004v1](http://arxiv.org/abs/2502.09004v1)|null|
+|**2025-02-13**|**RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**|Quan Wei et.al.|[2502.09003v1](http://arxiv.org/abs/2502.09003v1)|null|
+|**2025-02-13**|**PixLift: Accelerating Web Browsing via AI Upscaling**|Yonas Atinafu et.al.|[2502.08995v1](http://arxiv.org/abs/2502.08995v1)|null|
+|**2025-02-13**|**RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**|Nazatul H. Sultan et.al.|[2502.08989v1](http://arxiv.org/abs/2502.08989v1)|null|
+|**2025-02-13**|**Neural Force Field: Learning Generalized Physical Representation from a Few Examples**|Shiqian Li et.al.|[2502.08987v1](http://arxiv.org/abs/2502.08987v1)|null|
+|**2025-02-13**|**Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**|Hyundong Cho et.al.|[2502.08972v1](http://arxiv.org/abs/2502.08972v1)|null|
+|**2025-02-13**|**RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**|Peter Yong Zhong et.al.|[2502.08966v1](http://arxiv.org/abs/2502.08966v1)|null|
+|**2025-02-13**|**Biologically Plausible Brain Graph Transformer**|Ciyuan Peng et.al.|[2502.08958v1](http://arxiv.org/abs/2502.08958v1)|null|
+|**2025-02-13**|**Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**|Leon Nissen et.al.|[2502.08954v1](http://arxiv.org/abs/2502.08954v1)|null|
 
-##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
-2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
+#### Abstracts
+##### **Theoretical Benefit and Limitation of Diffusion Language Model**
+2502.09622v1 by Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He
 
-Counterfactual explanations in medical imaging are critical for understanding
-the predictions made by deep learning models. We extend the Latent Shift
-counterfactual generation method from 2D applications to 3D computed tomography
-(CT) scans. We address the challenges associated with 3D data, such as limited
-training samples and high memory demands, by implementing a slice-based
-approach. This method leverages a 2D encoder trained on CT slices, which are
-subsequently combined to maintain 3D context. We demonstrate this technique on
-two models for clinical phenotype prediction and lung segmentation. Our
-approach is both memory-efficient and effective for generating interpretable
-counterfactuals in high-resolution 3D medical imaging.
+Diffusion language models have emerged as a promising approach for text
+generation. One would naturally expect this method to be an efficient
+replacement for autoregressive models since multiple tokens can be sampled in
+parallel during each diffusion step. However, its efficiency-accuracy trade-off
+is not yet well understood. In this paper, we present a rigorous theoretical
+analysis of a widely used type of diffusion language model, the Masked
+Diffusion Model (MDM), and find that its effectiveness heavily depends on the
+target evaluation metric. Under mild conditions, we prove that when using
+perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling
+steps regardless of sequence length, demonstrating that efficiency can be
+achieved without sacrificing performance. However, when using the sequence
+error rate--which is important for understanding the "correctness" of a
+sequence, such as a reasoning chain--we show that the required sampling steps
+must scale linearly with sequence length to obtain "correct" sequences, thereby
+eliminating MDM's efficiency advantage over autoregressive models. Our analysis
+establishes the first theoretical foundation for understanding the benefits and
+limitations of MDMs. All theoretical findings are supported by empirical
+studies.
 
-摘要：反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法，來解決與 3D 資料相關的挑戰，例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器，隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實，既節省記憶體又有效。
+摘要：擴散語言模型已成為文字生成的一種有前途的方法。由於在每個擴散步驟期間可以並行採樣多個符號，因此人們自然會期望這種方法成為自迴歸模型的有效替代方案。然而，它的效率準確性權衡尚未得到很好的理解。在本文中，我們對廣泛使用的擴散語言模型類型，即遮罩擴散模型 (MDM) 進行了嚴格的理論分析，並發現其有效性在很大程度上取決於目標評估指標。在溫和條件下，我們證明了當使用困惑度作為指標時，MDM 可以無論序列長度如何，在採樣步驟中實現近乎最佳的困惑度，這表明可以在不犧牲性能的情況下實現效率。然而，當使用序列錯誤率（對於理解序列的「正確性」很重要，例如推理鏈）時，我們表明所需的採樣步驟必須隨著序列長度線性縮放才能獲得「正確」的序列，從而消除了 MDM 相對於自迴歸模型的效率優勢。我們的分析為理解 MDM 的優點和局限性建立了第一個理論基礎。所有理論發現都得到了實證研究的支持。
 
-##### **Interactive Data Harmonization with LLM Agents**
-2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire
+##### **MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**
+2502.09621v1 by Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
 
-Data harmonization is an essential task that entails integrating datasets
-from diverse sources. Despite years of research in this area, it remains a
-time-consuming and challenging task due to schema mismatches, varying
-terminologies, and differences in data collection methodologies. This paper
-presents the case for agentic data harmonization as a means to both empower
-experts to harmonize their data and to streamline the process. We introduce
-Harmonia, a system that combines LLM-based reasoning, an interactive user
-interface, and a library of data harmonization primitives to automate the
-synthesis of data harmonization pipelines. We demonstrate Harmonia in a
-clinical data harmonization scenario, where it helps to interactively create
-reusable pipelines that map datasets to a standard format. Finally, we discuss
-challenges and open problems, and suggest research directions for advancing our
-vision.
+Answering questions with Chain-of-Thought (CoT) has significantly enhanced
+the reasoning capabilities of Large Language Models (LLMs), yet its impact on
+Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth
+investigation. In this paper, we introduce MME-CoT, a specialized benchmark
+evaluating the CoT reasoning performance of LMMs, spanning six domains: math,
+science, OCR, logic, space-time, and general scenes. As the first comprehensive
+study in this area, we propose a thorough evaluation suite incorporating three
+novel metrics that assess the reasoning quality, robustness, and efficiency at
+a fine-grained level. Leveraging curated high-quality data and a unique
+evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,
+uncovering several key insights: 1) Models with reflection mechanism
+demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and
+demonstrating the highest quality results; 2) CoT prompting often degrades LMM
+performance on perception-heavy tasks, suggesting a potentially harmful
+overthinking behavior; and 3) Although the CoT quality is high, LMMs with
+reflection exhibit significant inefficiency in both normal response and
+self-correction phases. We hope MME-CoT serves as a foundation for advancing
+multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
 
-摘要：資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷，但由於架構不匹配、術語不同，以及資料收集方法的差異，它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和，作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia，一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統，以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia，它有助於互動式建立可重複使用的管線，將資料集對應至標準格式。最後，我們討論挑戰和開放性問題，並建議研究方向以推進我們的願景。
+摘要：<paragraph>透過思維鏈（CoT）回答問題，大幅提升了大型語言模型（LLM）的推理能力，但其對大型多模態模型（LMM）的影響仍缺乏系統性的評估和深入探討。在本文中，我們引入了 MME-CoT，一個專門的基準測試，用於評估 LMM 的 CoT 推理效能，涵蓋六個領域：數學、科學、OCR、邏輯、時空和一般場景。作為該領域的第一個全面性研究，我們提出了一個全面的評估套件，包含三個創新的指標，用於評估推理品質、穩健性和效率，並達到細微的層級。透過利用策展的高品質資料和獨特的評估策略，我們對最先進的 LMM 進行深入分析，發現了幾個關鍵見解：1）具有反思機制的模型展現出優異的 CoT 品質，其中 Kimi k1.5 優於 GPT-4o，並展現出最高品質的結果；2）CoT 提示通常會降低 LMM 在感知密集任務上的效能，這表示潛在有害的過度思考行為；3）儘管 CoT 品質很高，但具有反思能力的 LMM 在一般回應和自我修正階段都展現出顯著的低效率。我們希望 MME-CoT 能作為促進 LMM 中多模態推理的基礎。專案頁面：https://mmecot.github.io/</paragraph>
 
-##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**
-2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani
+##### **Exploring the Potential of Encoder-free Architectures in 3D LMMs**
+2502.09620v1 by Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
 
-Machine learning (ML) is transforming healthcare by enabling predictive
-analytics, personalized treatments, and improved patient outcomes. However,
-traditional ML workflows require specialized skills, infrastructure, and
-resources, limiting accessibility for many healthcare professionals. This paper
-explores how Google Cloud's BigQuery ML simplifies the development and
-deployment of ML models using SQL, reducing technical barriers. Through a case
-study on diabetes prediction using the Diabetes Health Indicators Dataset, we
-evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep
-Neural Network (DNN). Our results demonstrate that the Boosted Tree model
-achieves the highest performance, making it highly effective for diabetes
-prediction. This study highlights BigQuery ML's role in democratizing machine
-learning by providing a scalable, efficient, and accessible solution for
-healthcare analytics.
+Encoder-free architectures have been preliminarily explored in the 2D visual
+domain, yet it remains an open question whether they can be effectively applied
+to 3D understanding scenarios. In this paper, we present the first
+comprehensive investigation into the potential of encoder-free architectures to
+overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
+These challenges include the failure to adapt to varying point cloud
+resolutions and the point features from the encoder not meeting the semantic
+needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
+remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
+We propose the LLM-embedded Semantic Encoding strategy in the pre-training
+stage, exploring the effects of various point cloud self-supervised losses. And
+we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
+introduce the Hierarchical Geometry Aggregation strategy in the instruction
+tuning stage. This incorporates inductive bias into the LLM early layers to
+focus on the local details of the point clouds. To the end, we present the
+first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
+state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
+classification, captioning, and VQA tasks, respectively. Our results
+demonstrate that the encoder-free architecture is highly promising for
+replacing encoder-based architectures in the field of 3D understanding. The
+code is released at https://github.com/Ivan-Tang-3D/ENEL
 
-摘要：機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果，正在轉型醫療保健。然而，傳統的 ML 工作流程需要專業技能、基礎設施和資源，限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署，降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究，我們評估了三個預測模型：邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明，提升樹模型達到了最高的效能，使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色，提供可擴充、有效率且可存取的醫療保健分析解決方案。
+摘要：<paragraph>編碼器免費架構已在 2D 視覺領域中初步探索，但它們是否能有效應用於 3D 理解場景仍是一個開放的問題。在本文中，我們提出了對編碼器免費架構潛力的首次全面調查，以克服基於編碼器的 3D 大型多模態模型 (LMM) 的挑戰。這些挑戰包括無法適應不同的點雲解析度，且來自編碼器的點特徵無法滿足大型語言模型 (LLM) 的語義需求。我們識別出 3D LMM 的關鍵方面，以移除編碼器並讓 LLM 承擔 3D 編碼器的角色：1) 我們在預訓練階段提出 LLM 嵌入式語義編碼策略，探索各種點雲自我監督損失的影響。我們提出混合語義損失來提取高階語義。2) 我們在指令調整階段引入分層幾何聚合策略。這將歸納偏差納入 LLM 早期層，以專注於點雲的局部細節。最後，我們提出第一個無編碼器 3D LMM，ENEL。我們的 7B 模型與當前最先進的模型 ShapeLLM-13B 相媲美，分別在分類、字幕和 VQA 任務中達到 55.0%、50.92% 和 42.7%。我們的結果表明，無編碼器架構極有望取代基於編碼器的架構在 3D 理解領域的應用。程式碼發布於 https://github.com/Ivan-Tang-3D/ENEL</paragraph>
 
-##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**
-2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen
+##### **DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**
+2502.09614v1 by Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
 
-Despite over a decade of legislative efforts to address modern slavery in the
-supply chains of large corporations, the effectiveness of government oversight
-remains hampered by the challenge of scrutinizing thousands of statements
-annually. While Large Language Models (LLMs) can be considered a well
-established solution for the automatic analysis and summarization of documents,
-recognizing concrete modern slavery countermeasures taken by companies and
-differentiating those from vague claims remains a challenging task. To help
-evaluate and fine-tune LLMs for the assessment of corporate statements, we
-introduce a dataset composed of 5,731 modern slavery statements taken from the
-Australian Modern Slavery Register and annotated at the sentence level. This
-paper details the construction steps for the dataset that include the careful
-design of annotation specifications, the selection and preprocessing of
-statements, and the creation of high-quality annotation subsets for effective
-model evaluations. To demonstrate our dataset's utility, we propose a machine
-learning methodology for the detection of sentences relevant to mandatory
-reporting requirements set by the Australian Modern Slavery Act. We then follow
-this methodology to benchmark modern language models under zero-shot and
-supervised learning settings.
+We address the challenge of developing a generalizable neural tracking
+controller for dexterous manipulation from human references. This controller
+aims to manage a dexterous robot hand to manipulate diverse objects for various
+purposes defined by kinematic human-object interactions. Developing such a
+controller is complicated by the intricate contact dynamics of dexterous
+manipulation and the need for adaptivity, generalizability, and robustness.
+Current reinforcement learning and trajectory optimization methods often fall
+short due to their dependence on task-specific rewards or precise system
+models. We introduce an approach that curates large-scale successful robot
+tracking demonstrations, comprising pairs of human references and robot
+actions, to train a neural controller. Utilizing a data flywheel, we
+iteratively enhance the controller's performance, as well as the number and
+quality of successful tracking demonstrations. We exploit available tracking
+demonstrations and carefully integrate reinforcement learning and imitation
+learning to boost the controller's performance in dynamic environments. At the
+same time, to obtain high-quality tracking demonstrations, we individually
+optimize per-trajectory tracking by leveraging the learned tracking controller
+in a homotopy optimization method. The homotopy optimization, mimicking
+chain-of-thought, aids in solving challenging trajectory tracking problems to
+increase demonstration diversity. We showcase our success by training a
+generalizable neural controller and evaluating it in both simulation and real
+world. Our method achieves over a 10% improvement in success rates compared to
+leading baselines. The project website with animated results is available at
+https://meowuu7.github.io/DexTrack/.
 
-摘要：儘管立法努力超過十年，旨在解決大型企業供應鏈中的現代奴隸制，但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型（LLM）可以被認為是文件自動分析和摘要的完善解決方案，但要辨識公司採取的具體現代奴隸制對策，並將其與含糊的聲明區分開來，仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明，我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集，這些聲明取自澳洲現代奴隸制註冊處，並在句子層級進行註解。本文詳細說明了資料集的建構步驟，其中包括註解規格的仔細設計、聲明的選擇和預處理，以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用，我們提出了一種機器學習方法，用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後，我們遵循這種方法，在零次學習和監督學習設定下對現代語言模型進行基準測試。
+摘要：<paragraph>我們解決了從人類參照中開發靈巧操作通用神經追蹤控制器的挑戰。此控制器旨在管理靈巧機器人手，以操作各種物體，以實現由運動學人機互動定義的各種目的。由於靈巧操作的複雜接觸動力學以及對適應性、通用性和魯棒性的需求，開發此類控制器很複雜。目前的強化學習和軌跡優化方法通常由於依賴於特定任務的獎勵或精確的系統模型而表現不佳。我們引入了一種方法，它策劃了大規模成功的機器人追蹤示範，包括人體參照和機器人動作對，以訓練神經控制器。利用數據飛輪，我們反覆增強控制器的性能，以及成功追蹤示範的數量和品質。我們利用可用的追蹤示範，並仔細整合強化學習和模仿學習，以提升控制器在動態環境中的性能。同時，為了獲得高品質的追蹤示範，我們透過在同倫優化方法中利用已學習的追蹤控制器，個別優化每個軌跡的追蹤。同倫優化模擬思考鏈，有助於解決具有挑戰性的軌跡追蹤問題，以增加示範的多樣性。我們展示了我們在訓練通用神經控制器並在模擬和真實世界中評估它的成功。與領先的基準相比，我們的模型在成功率方面提高了 10% 以上。包含動畫結果的專案網站可在 https://meowuu7.github.io/DexTrack/ 取得。</paragraph>
 
-##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**
-2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour
+##### **Score-of-Mixture Training: Training One-Step Generative Models Made Simple**
+2502.09609v1 by Tejas Jayashankar, J. Jon Ryu, Gregory Wornell
 
-The fourth Machine Learning for Health (ML4H) symposium was held in person on
-December 15th and 16th, 2024, in the traditional, ancestral, and unceded
-territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver,
-British Columbia, Canada. The symposium included research roundtable sessions
-to foster discussions between participants and senior researchers on timely and
-relevant topics for the ML4H community. The organization of the research
-roundtables at the conference involved 13 senior and 27 junior chairs across 13
-tables. Each roundtable session included an invited senior chair (with
-substantial experience in the field), junior chairs (responsible for
-facilitating the discussion), and attendees from diverse backgrounds with an
-interest in the session's topic.
+We propose Score-of-Mixture Training (SMT), a novel framework for training
+one-step generative models by minimizing a class of divergences called the
+$\alpha$-skew Jensen-Shannon divergence. At its core, SMT estimates the score
+of mixture distributions between real and fake samples across multiple noise
+levels. Similar to consistency models, our approach supports both training from
+scratch (SMT) and distillation using a pretrained diffusion model, which we
+call Score-of-Mixture Distillation (SMD). It is simple to implement, requires
+minimal hyperparameter tuning, and ensures stable training. Experiments on
+CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even
+outperform existing methods.
 
-摘要：第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議，以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席（在該領域擁有豐富的經驗）、初級主席（負責促進討論）以及對會議主題感興趣的來自不同背景的與會者。
+摘要：我們提出混合評分訓練 (SMT)，一種透過最小化稱為 $\alpha$-偏斜 Jensen-Shannon 距離的距離類別來訓練單步生成模型的新穎架構。在核心部分，SMT 估計真實和虛假樣本之間在多個雜訊層級的混合分配評分。與一致性模型類似，我們的做法支援從頭開始訓練 (SMT) 和使用預先訓練的擴散模型進行蒸餾，我們稱之為混合評分蒸餾 (SMD)。它易於實作，只需要最小的超參數調整，並確保穩定的訓練。在 CIFAR-10 和 ImageNet 64x64 上的實驗顯示，SMT/SMD 具有競爭力，甚至可以優於現有方法。
 
-##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**
-2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla
+##### **Human-LLM Coevolution: Evidence from Academic Writing**
+2502.09606v1 by Mingmeng Geng, Roberto Trotta
 
-Current Large Language Models (LLMs) benchmarks are often based on open-ended
-or close-ended QA evaluations, avoiding the requirement of human labor.
-Close-ended measurements evaluate the factuality of responses but lack
-expressiveness. Open-ended capture the model's capacity to produce discourse
-responses but are harder to assess for correctness. These two approaches are
-commonly used, either independently or together, though their relationship
-remains poorly understood. This work is focused on the healthcare domain, where
-both factuality and discourse matter greatly. It introduces a comprehensive,
-multi-axis suite for healthcare LLM evaluation, exploring correlations between
-open and close benchmarks and metrics. Findings include blind spots and
-overlaps in current methodologies. As an updated sanity check, we release a new
-medical benchmark--CareQA--, with both open and closed variants. Finally, we
-propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to
-mitigate the identified limitations.
+With a statistical analysis of arXiv paper abstracts, we report a marked drop
+in the frequency of several words previously identified as overused by ChatGPT,
+such as "delve", starting soon after they were pointed out in early 2024. The
+frequency of certain other words favored by ChatGPT, such as "significant", has
+instead kept increasing. These phenomena suggest that some authors of academic
+papers have adapted their use of large language models (LLMs), for example, by
+selecting outputs or applying modifications to the LLM-generated content. Such
+coevolution and cooperation of humans and LLMs thus introduce additional
+challenges to the detection of machine-generated text in real-world scenarios.
+Estimating the impact of LLMs on academic writing by examining word frequency
+remains feasible, and more attention should be paid to words that were already
+frequently employed, including those that have decreased in frequency.
 
-摘要：當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量，避免了人力需求。封閉式測量評估回應的事實性，但缺乏表達力。開放式測量捕捉模型產生論述回應的能力，但較難評估正確性。這兩種方法通常獨立或合併使用，儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域，在該領域中，事實性和論述都非常重要。它引入了一個全面的多軸套件，用於醫療保健 LLM 評量，探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查，我們發布了一個新的醫療基準--CareQA--，包含開放式和封閉式變體。最後，我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。
+摘要：透過對 arXiv 論文摘要進行統計分析，我們報告了幾個先前被認為 ChatGPT 過度使用的詞彙的頻率大幅下降，例如「深入探討」，從 2024 年初被指出後不久就開始下降。相反地，ChatGPT 偏好的某些其他詞彙，例如「顯著」，頻率持續增加。這些現象表明，一些學術論文作者已經調整了他們使用大型語言模型 (LLM) 的方式，例如，透過選擇輸出或對 LLM 生成的內容進行修改。因此，人類和 LLM 的這種共同演化和合作為在現實世界場景中偵測機器產生的文字帶來了額外的挑戰。透過檢視詞彙頻率來評估 LLM 對學術寫作的影響仍然可行，並且應該對已經頻繁使用的詞彙給予更多關注，包括那些頻率下降的詞彙。
 
-##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**
-2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra
+##### **SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**
+2502.09604v1 by Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
 
-Accurate classification and anatomical localization are essential for
-effective medical diagnostics and research, which may be efficiently performed
-using deep learning techniques. However, availability of limited labeled data
-poses a significant challenge. To address this, we adapted Prototypical
-Networks and the Propagation-Reconstruction Network (PRNet) for few-shot
-classification and localization, respectively, in Single Photon Emission
-Computed Tomography (SPECT) images. For the proof of concept we used a
-2D-sliced image cropped around heart. The Prototypical Network, with a
-pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver
-tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for
-2D imaging with an encoder-decoder architecture and skip connections, achieved
-a training loss of 1.395, accurately reconstructing patches and capturing
-spatial relationships. These results highlight the potential of Prototypical
-Networks for tissue classification with limited labeled data and PRNet for
-anatomical landmark localization, paving the way for improved performance in
-deep learning frameworks.
+We introduce SelfCite, a novel self-supervised approach that aligns LLMs to
+generate high-quality, fine-grained, sentence-level citations for the
+statements in their generated responses. Instead of only relying on costly and
+labor-intensive annotations, SelfCite leverages a reward signal provided by the
+LLM itself through context ablation: If a citation is necessary, removing the
+cited text from the context should prevent the same response; if sufficient,
+retaining the cited text alone should preserve the same response. This reward
+can guide the inference-time best-of-N sampling strategy to improve citation
+quality significantly, as well as be used in preference optimization to
+directly fine-tune the models for generating better citations. The
+effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3
+points on the LongBench-Cite benchmark across five long-form question answering
+tasks.
 
-摘要：精確的分類和解剖定位對於有效的醫療診斷和研究至關重要，而這可以使用深度學習技術有效執行。然而，標記資料有限的取得會造成重大的挑戰。為了解決這個問題，我們分別調整了原型網路和傳播重建網路 (PRNet)，用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念，我們使用圍繞心臟裁切的 2D 切片影像。原型網路，使用預先訓練的 ResNet-18 主幹，對心室、心肌和肝臟組織進行分類，訓練準確度為 96.67%，驗證準確度為 93.33%。PRNet，調整為使用編碼器解碼器架構和跳躍連接的 2D 影像，達到了 1.395 的訓練損失，精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力，以及 PRNet 在解剖標誌定位方面的潛力，為深度學習架構中效能的提升鋪平了道路。
+摘要：我們介紹 SelfCite，一種新穎的自監督方法，它將 LLM 對齊以針對其生成回應中的陳述生成高品質、細粒度、句子級別的引用。SelfCite 不僅依賴於昂貴且勞動密集的註解，還利用 LLM 本身通過上下文消融提供的獎勵信號：如果需要引用，從上下文中移除被引用的文字應當會阻止相同的回應；如果足夠，僅保留被引用的文字應當會保留相同的回應。此獎勵可以引導推理時間最佳 N 個取樣策略以顯著改善引文品質，並用於偏好最佳化以直接微調模型以生成更好的引文。SelfCite 的有效性通過在五個長篇問答任務中將 LongBench-Cite 基準上的引文 F1 提高多達 5.3 點來證明。
 
-##### **Illegal Waste Detection in Remote Sensing Images: A Case Study**
-2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori
+##### **CoT-Valve: Length-Compressible Chain-of-Thought Tuning**
+2502.09601v1 by Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
 
-Environmental crime currently represents the third largest criminal activity
-worldwide while threatening ecosystems as well as human health. Among the
-crimes related to this activity, improper waste management can nowadays be
-countered more easily thanks to the increasing availability and decreasing cost
-of Very-High-Resolution Remote Sensing images, which enable semi-automatic
-territory scanning in search of illegal landfills. This paper proposes a
-pipeline, developed in collaboration with professionals from a local
-environmental agency, for detecting candidate illegal dumping sites leveraging
-a classifier of Remote Sensing images. To identify the best configuration for
-such classifier, an extensive set of experiments was conducted and the impact
-of diverse image characteristics and training settings was thoroughly analyzed.
-The local environmental agency was then involved in an experimental exercise
-where outputs from the developed classifier were integrated in the experts'
-everyday work, resulting in time savings with respect to manual
-photo-interpretation. The classifier was eventually run with valuable results
-on a location outside of the training area, highlighting potential for
-cross-border applicability of the proposed pipeline.
+Chain-of-Thought significantly enhances a model's reasoning capability, but
+it also comes with a considerable increase in inference costs due to long
+chains. With the observation that the reasoning path can be easily compressed
+under easy tasks but struggle on hard tasks, we explore the feasibility of
+elastically controlling the length of reasoning paths with only one model,
+thereby reducing the inference overhead of reasoning models dynamically based
+on task difficulty. We introduce a new tuning and inference strategy named
+CoT-Valve, designed to allow models to generate reasoning chains of varying
+lengths. To achieve this, we propose to identify a direction in the parameter
+space that, when manipulated, can effectively control the length of generated
+CoT. Moreover, we show that this property is valuable for compressing the
+reasoning chain. We construct datasets with chains from long to short for the
+same questions and explore two enhanced strategies for CoT-Valve: (1) a precise
+length-compressible CoT tuning method, and (2) a progressive chain length
+compression approach. Our experiments show that CoT-Valve successfully enables
+controllability and compressibility of the chain and shows better performance
+than the prompt-based control. We applied this method to QwQ-32B-Preview,
+reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor
+performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with
+only one additional incorrect answer.
 
-摘要：環境犯罪目前是全球第三大犯罪活動，威脅生態系統和人類健康。在與此活動相關的犯罪中，不當廢物管理現在可以更容易地得到解決，這要歸功於超高解析度遙測影像越來越普及且成本下降，這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道，與當地環境機構的專業人士合作開發，用於檢測候選非法傾倒地點，利用遙測影像分類器。為了找出這種分類器的最佳配置，進行了一系列廣泛的實驗，並徹底分析了不同影像特徵和訓練設定的影響。然後，當地環境機構參與了一項實驗練習，其中將已開發分類器的輸出整合到專家的日常工作中，從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器，獲得了有價值的結果，突出了所提出管道的跨境適用性潛力。
+摘要：<paragraph>連續思考大幅提升了模型的推理能力，但由於鏈條過長，也大幅增加了推理成本。由於觀察到推理路徑在簡單的任務中可以輕易壓縮，但在困難的任務中卻很吃力，我們探索了僅使用一個模型彈性控制推理路徑長度的可行性，從而根據任務難度動態減少推理模型的推理開銷。我們引入了一種名為 CoT-Valve 的新調校和推理策略，旨在讓模型產生長度不一的推理鏈。為此，我們提議在參數空間中識別一個方向，在操作時可以有效控制生成的 CoT 的長度。此外，我們展示了此屬性對於壓縮推理鏈是有價值的。我們構造了從長到短的鏈條的資料集，用於相同的問題，並探索了 CoT-Valve 的兩種增強策略：(1) 精確的長度可壓縮 CoT 調校方法，以及 (2) 漸進式鏈長壓縮方法。我們的實驗表明，CoT-Valve 成功地實現了鏈條的可控性和可壓縮性，並顯示出比基於提示的控制更好的效能。我們將此方法應用於 QwQ-32B-Preview，將 GSM8K 上的推理鏈條從 741 個代幣減少到 225 個代幣，效能僅略微下降 (95.07% 至 94.92%)，而在 AIME 上從 6827 個代幣減少到 4629 個代幣，只多了一個錯誤答案。</paragraph>
 
-##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**
-2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li
+##### **Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**
+2502.09597v1 by Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
 
-Accurate and efficient electroencephalography (EEG) analysis is essential for
-detecting seizures and artifacts in long-term monitoring, with applications
-spanning hospital diagnostics to wearable health devices. Robust EEG analytics
-have the potential to greatly improve patient care. However, traditional deep
-learning models, especially Transformer-based architectures, are hindered by
-their quadratic time and memory complexity, making them less suitable for
-resource-constrained environments. To address these challenges, we present
-FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel
-self-supervised framework that establishes new efficiency benchmarks for EEG
-analysis through bidirectional state-space modeling. Unlike Transformer-based
-models, which incur quadratic time and memory complexity, FEMBA scales linearly
-with sequence length, enabling more scalable and efficient processing of
-extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and
-fine-tuned on three downstream tasks, FEMBA achieves competitive performance in
-comparison with transformer models, with significantly lower computational
-cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB
-and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates
-viability for resource-constrained devices. These results pave the way for
-scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as
-a promising candidate for wearable applications.
+Large Language Models (LLMs) are increasingly used as chatbots, yet their
+ability to personalize responses to user preferences remains limited. We
+introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize
+and adhere to user preferences in a long-context conversational setting.
+PrefEval comprises 3,000 manually curated user preference and query pairs
+spanning 20 topics. PrefEval contains user personalization or preference
+information in both explicit and implicit forms, and evaluates LLM performance
+using a generation and a classification task. With PrefEval, we evaluated the
+aforementioned preference following capabilities of 10 open-source and
+proprietary LLMs in multi-session conversations with varying context lengths up
+to 100k tokens. We benchmark with various prompting, iterative feedback, and
+retrieval-augmented generation methods. Our benchmarking effort reveals that
+state-of-the-art LLMs face significant challenges in proactively following
+users' preferences during conversations. In particular, in zero-shot settings,
+preference following accuracy falls below 10% at merely 10 turns (~3k tokens)
+across most evaluated models. Even with advanced prompting and retrieval
+methods, preference following still deteriorates in long-context conversations.
+Furthermore, we show that fine-tuning on PrefEval significantly improves
+performance. We believe PrefEval serves as a valuable resource for measuring,
+understanding, and enhancing LLMs' preference following abilities, paving the
+way for personalized conversational agents. Our code and dataset are available
+at https://prefeval.github.io/.
 
-摘要：準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要，其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而，傳統深度學習模型，特別是基於 Transformer 的架構，受到其二次時間和記憶體複雜度的阻礙，使其不太適合資源受限的環境。為了應對這些挑戰，我們提出 FEMBA (基礎 EEG Mamba + 雙向架構)，一種創新的自我監督架構，透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同，FEMBA 隨著序列長度線性縮放，支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調，與Transformer模型相比，在計算成本顯著降低的情況下，實現了具有競爭力的效能。具體來說，它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC，而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路，並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。
+摘要：大型語言模型（LLM）正日益被用作聊天機器人，但它們根據使用者偏好個人化回應的能力仍然有限。我們引入了 PrefEval，一個用於評估 LLM 在長時間對話環境中推論、記憶和遵守使用者偏好的能力的基準。PrefEval 包含 3,000 個手動策劃的使用者偏好和查詢對，涵蓋 20 個主題。PrefEval 包含以明確和隱含形式表達的使用者個人化或偏好資訊，並使用生成和分類任務評估 LLM 效能。透過 PrefEval，我們評估了 10 個開源和專有 LLM 在多重對話中上述的偏好追蹤能力，對話內容長度最高達 100k 個符號。我們使用各種提示、迭代回饋和檢索增強生成方法進行基準測試。我們的基準測試工作顯示，最先進的 LLM 在對話中主動追蹤使用者偏好時面臨重大挑戰。特別是在零次學習設定中，在多數評估模型中，在僅 10 個回合（約 3k 個符號）時，偏好追蹤準確度低於 10%。即使使用進階提示和檢索方法，在長時間對話中偏好追蹤仍然會惡化。此外，我們展示了在 PrefEval 上進行微調會大幅改善效能。我們相信 PrefEval 可作為衡量、理解和提升 LLM 偏好追蹤能力的寶貴資源，為個人化對話代理鋪路。我們的程式碼和資料集可在 https://prefeval.github.io/ 取得。
 
-##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**
-2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham
+##### **KIMAs: A Configurable Knowledge Integrated Multi-Agent System**
+2502.09596v1 by Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding
 
-The advent of foundation models (FMs) is transforming medical domain. In
-ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4
-million natural images and 1.6 million retinal images, has demonstrated high
-adaptability across clinical applications. Conversely, DINOv2, a
-general-purpose vision FM pre-trained on 142 million natural images, has shown
-promise in non-medical domains. However, its applicability to clinical tasks
-remains underexplored. To address this, we conducted head-to-head evaluations
-by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular
-disease detection and systemic disease prediction tasks, across eight
-standardized open-source ocular datasets, as well as the Moorfields AlzEye and
-the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting
-diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets,
-all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In
-glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940,
-P<0.001). Conversely, RETFound achieved superior performance over all DINOv2
-models in predicting heart failure, myocardial infarction, and ischaemic stroke
-(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even
-with 10% of the fine-tuning data. These findings showcase the distinct
-scenarios where general-purpose and domain-specific FMs excel, highlighting the
-importance of aligning FM selection with task-specific requirements to optimise
-clinical performance.
+Knowledge-intensive conversations supported by large language models (LLMs)
+have become one of the most popular and helpful applications that can assist
+people in different aspects. Many current knowledge-intensive applications are
+centered on retrieval-augmented generation (RAG) techniques. While many
+open-source RAG frameworks facilitate the development of RAG-based
+applications, they often fall short in handling practical scenarios complicated
+by heterogeneous data in topics and formats, conversational context management,
+and the requirement of low-latency response times. This technical report
+presents a configurable knowledge integrated multi-agent system, KIMAs, to
+address these challenges. KIMAs features a flexible and configurable system for
+integrating diverse knowledge sources with 1) context management and query
+rewrite mechanisms to improve retrieval accuracy and multi-turn conversational
+coherency, 2) efficient knowledge routing and retrieval, 3) simple but
+effective filter and reference generation mechanisms, and 4) optimized
+parallelizable multi-agent pipeline execution. Our work provides a scalable
+framework for advancing the deployment of LLMs in real-world settings. To show
+how KIMAs can help developers build knowledge-intensive applications with
+different scales and emphases, we demonstrate how we configure the system to
+three applications already running in practice with reliable performance.
 
-摘要：基礎模型 (FM) 的出現正在轉變醫療領域。在眼科，RETFound 是一個視網膜專用 FM，依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練，已展現出高度適應性，可應用於各種臨床應用。相反地，DINOv2 是一個通用視覺 FM，使用 1.42 億張自然影像進行預訓練，已展現出在非醫療領域的潛力。然而，其在臨床任務中的適用性仍未被充分探索。為了解決這個問題，我們針對眼部疾病偵測和全身性疾病預測任務，對 RETFound 和三個 DINOv2 模型（大型、基礎、小型）進行微調，並進行一對一的評估，使用八個標準化的開源眼科資料集，以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound（三個資料集的 AUROC=0.850-0.952，相較於 0.823-0.944，所有 P<=0.007）和多類眼部疾病（AUROC=0.892，相較於 0.846，P<0.001）。在青光眼方面，DINOv2 基礎模型優於 RETFound（AUROC=0.958，相較於 0.940，P<0.001）。相反地，RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型（AUROC=0.732-0.796，相較於 0.663-0.771，所有 P<0.001）。即使使用 10% 的微調資料，這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景，突顯了根據任務特定需求調整 FM 選擇，以最佳化臨床表現的重要性。
+摘要：由大型語言模型 (LLM) 支持的知識密集型對話
+已成為最受歡迎且有用的應用程式之一，可協助
+人們在不同面向獲得協助。許多當前的知識密集型應用程式
+都以檢索增強生成 (RAG) 技術為中心。雖然許多
+開放原始碼 RAG 架構促進了基於 RAG 的應用程式開發，但它們在處理
+主題和格式中異質資料、對話內容管理，以及低延遲回應時間的要求所造成的實際情況時，通常力有未逮。這份技術報告
+提出了可設定的知識整合多重代理系統，KIMAs，以
+解決這些挑戰。KIMAs 具備靈活且可設定的系統，可整合多樣化的知識來源，並具備 1) 內容管理和查詢
+改寫機制，以提升檢索準確度和多輪對話的連貫性，2) 有效的知識路由和檢索，3) 簡單但
+有效的篩選和參考產生機制，以及 4) 最佳化的可平行化多重代理管線執行。我們的作品提供了可擴充的
+架構，以推動在實際環境中部署 LLM。為了展示 KIMAs 如何協助開發人員建置不同規模和重點的知識密集型應用程式，我們示範如何設定系統至
+三個已實際執行且效能良好的應用程式。
 
-##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**
-2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun
+##### **Logical forms complement probability in understanding language model (and human) performance**
+2502.09589v1 by Yixuan Wang, Freda Shi
 
-Medical time series are often irregular and face significant missingness,
-posing challenges for data analysis and clinical decision-making. Existing
-methods typically adopt a single modeling perspective, either treating series
-data as sequences or transforming them into image representations for further
-classification. In this paper, we propose a joint learning framework that
-incorporates both sequence and image representations. We also design three
-self-supervised learning strategies to facilitate the fusion of sequence and
-image representations, capturing a more generalizable joint representation. The
-results indicate that our approach outperforms seven other state-of-the-art
-models in three representative real-world clinical datasets. We further
-validate our approach by simulating two major types of real-world missingness
-through leave-sensors-out and leave-samples-out techniques. The results
-demonstrate that our approach is more robust and significantly surpasses other
-baselines in terms of classification performance.
+With the increasing interest in using large language models (LLMs) for
+planning in natural language, understanding their behaviors becomes an
+important research question. This work conducts a systematic investigation of
+LLMs' ability to perform logical reasoning in natural language. We introduce a
+controlled dataset of hypothetical and disjunctive syllogisms in propositional
+and modal logic and use it as the testbed for understanding LLM performance.
+Our results lead to novel insights in predicting LLM behaviors: in addition to
+the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical
+forms should be considered as orthogonal factors. In addition, we show
+similarities and differences between the logical reasoning performances of
+humans and LLMs by comparing LLM and human behavioral results.
 
-摘要：醫療時間序列通常不規則且會面臨顯著的缺失，對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點，將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中，我們提出了一個聯合學習架構，結合序列和影像表示。我們還設計了三種自我監督學習策略，以促進序列和影像表示的融合，捕捉更具概括性的聯合表示。結果表明，我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明，我們的做法更強大，並且在分類效能方面顯著優於其他基準。
+摘要：隨著在自然語言規劃中使用大型語言模型（LLM）的興趣日益濃厚，理解其行為已成為一項重要的研究課題。本研究對 LLM 在自然語言中執行邏輯推理的能力進行了系統性調查。我們引入了一個由假設和析取三段論組成的受控資料集，並使用它作為理解 LLM 效能的測試平台。我們的結果產生了預測 LLM 行為的新見解：除了輸入的機率（Gonen 等人，2023 年；McCoy 等人，2024 年）之外，邏輯形式應被視為正交因子。此外，我們透過比較 LLM 和人類行為結果，展示了人類和 LLM 在邏輯推理表現上的相似性和差異性。
 
-##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
-2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
+##### **Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**
+2502.09573v1 by Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
 
-We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
-an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
-predicts future PHTs using transformer-based architectures. The Adaptive Risk
-Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
-probabilities for clinician-defined critical events. ARES incorporates a
-personalized explainability module that identifies key clinical factors
-influencing risk estimates for individual patients. ARES was evaluated on the
-MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
-performance against traditional early warning systems and machine learning
-models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
-with 60% including hospital admissions. The dataset contained over 357 million
-tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
-ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
-ETHOS-based risk estimates demonstrated robustness across demographic subgroups
-with strong model reliability, confirmed via calibration curves. The
-personalized explainability module provides insights into patient-specific
-factors contributing to risk. ARES, powered by ETHOS, advances predictive
-healthcare AI by providing dynamic, real-time, and personalized risk estimation
-with patient-specific explainability to enhance clinician trust. Its
-adaptability and superior accuracy position it as a transformative tool for
-clinical decision-making, potentially improving patient outcomes and resource
-allocation in emergency and inpatient settings. We release the full code at
-github.com/ipolharvard/ethos-ares to facilitate future research.
+In this study, we tackle industry challenges in video content classification
+by exploring and optimizing GPT-based models for zero-shot classification
+across seven critical categories of video quality. We contribute a novel
+approach to improving GPT's performance through prompt optimization and policy
+refinement, demonstrating that simplifying complex policies significantly
+reduces false negatives. Additionally, we introduce a new
+decomposition-aggregation-based prompt engineering technique, which outperforms
+traditional single-prompt methods. These experiments, conducted on real
+industry problems, show that thoughtful prompt design can substantially enhance
+GPT's performance without additional finetuning, offering an effective and
+scalable solution for improving video classification systems across various
+domains in industry.
 
-摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
-一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
-使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
+摘要：在這項研究中，我們透過探索和最佳化基於 GPT 的模型，來處理影片內容分類中的產業挑戰，並針對影片品質的七個關鍵類別進行零次學習分類。我們貢獻了一種透過提示最佳化和政策改善來提升 GPT 效能的新方法，證明簡化複雜政策能大幅減少假陰性。此外，我們還引入了一種新的基於分解聚合的提示工程技術，其效能優於傳統的單一提示方法。這些在真實產業問題上執行的實驗顯示，經過深思熟慮的提示設計可以在不進行額外微調的情況下大幅提升 GPT 的效能，為提升產業中各種領域的影片分類系統提供了一個有效且可擴充的解決方案。
 
-##### **Can ChatGPT Diagnose Alzheimer's Disease?**
-2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin
+##### **MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**
+2502.09567v1 by Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea
 
-Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating
-neurodegenerative condition that affects approximately 1 in 9 individuals aged
-65 and older, profoundly impairing memory and cognitive function. This paper
-utilises 9300 electronic health records (EHRs) with data from Magnetic
-Resonance Imaging (MRI) and cognitive tests to address an intriguing question:
-As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs?
-We present an in-depth evaluation of ChatGPT using a black-box approach with
-zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to
-analyse MRI and cognitive test results, as well as its potential as a
-diagnostic tool for AD. By automating aspects of the diagnostic process, this
-research opens a transformative approach for the healthcare system,
-particularly in addressing disparities in resource-limited regions where AD
-specialists are scarce. Hence, it offers a foundation for a promising method
-for early detection, supporting individuals with timely interventions, which is
-paramount for Quality of Life (QoL).
+We introduce MorphNLI, a modular step-by-step approach to natural language
+inference (NLI). When classifying the premise-hypothesis pairs into
+{entailment, contradiction, neutral}, we use a language model to generate the
+necessary edits to incrementally transform (i.e., morph) the premise into the
+hypothesis. Then, using an off-the-shelf NLI model we track how the entailment
+progresses with these atomic changes, aggregating these intermediate labels
+into a final output. We demonstrate the advantages of our proposed method
+particularly in realistic cross-domain settings, where our method always
+outperforms strong baselines with improvements up to 12.6% (relative). Further,
+our proposed approach is explainable as the atomic edits can be used to
+understand the overall NLI label.
 
-摘要：ChatGPT 能否診斷出阿茲海默症 (AD)？AD 是一種毀滅性的神經退化性疾病，影響約 1/9 的 65 歲及以上人士，嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR)，其中包含磁共振成像 (MRI) 和認知測試的數據，來解決一個有趣的問題：作為一個通用任務解決器，ChatGPT 能否使用 EHR 準確地檢測出 AD？我們使用黑盒方法對 ChatGPT 進行了深入評估，採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力，以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面，這項研究為醫療保健系統開啟了一種變革性的方法，特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此，它為一種有希望的早期檢測方法奠定了基礎，通過及時干預來支持個人，這對於生活品質 (QoL) 至關重要。
+摘要：我們引入 MorphNLI，一種模組化逐步方法，用於自然語言推論 (NLI)。當對前提假設對進行分類時，我們使用語言模型來產生必要的編輯，以逐步轉換（即，變形）前提成為假設。然後，使用現成的 NLI 模型，我們追蹤推論如何隨著這些原子變化而進展，將這些中間標籤彙總成最終輸出。我們展示了我們提出的方法的優點，特別是在現實的跨網域設置中，我們的模型始終優於強大的基線，改進幅度高達 12.6%（相對）。此外，我們提出的方法是可以解釋的，因為原子編輯可以用來理解整體 NLI 標籤。
 
-##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**
-2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares
+##### **Zero-shot generation of synthetic neurosurgical data with large language models**
+2502.09566v1 by Austin A. Barr, Eddie Guo, Emre Sezgin
 
-EEG-based neural networks, pivotal in medical diagnosis and brain-computer
-interfaces, face significant intellectual property (IP) risks due to their
-reliance on sensitive neurophysiological data and resource-intensive
-development. Current watermarking methods, particularly those using abstract
-trigger sets, lack robust authentication and fail to address the unique
-challenges of EEG models. This paper introduces a cryptographic wonder
-filter-based watermarking framework tailored for EEG-based neural networks.
-Leveraging collision-resistant hashing and public-key encryption, the wonder
-filter embeds the watermark during training, ensuring minimal distortion ($\leq
-5\%$ drop in EEG task accuracy) and high reliability (100\% watermark
-detection). The framework is rigorously evaluated against adversarial attacks,
-including fine-tuning, transfer learning, and neuron pruning. Results
-demonstrate persistent watermark retention, with classification accuracy for
-watermarked states remaining above 90\% even after aggressive pruning, while
-primary task performance degrades faster, deterring removal attempts. Piracy
-resistance is validated by the inability to embed secondary watermarks without
-severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic
-hashing ensures authentication, reducing brute-force attack success
-probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet,
-TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively
-eliminating false positives. By integrating wonder filters with EEG-specific
-adaptations, this work bridges a critical gap in IP protection for
-neurophysiological models, offering a secure, tamper-proof solution for
-healthcare and biometric applications. The framework's robustness against
-adversarial modifications underscores its potential to safeguard sensitive EEG
-models while maintaining diagnostic utility.
+Clinical data is fundamental to advance neurosurgical research, but access is
+often constrained by data availability, small sample sizes, privacy
+regulations, and resource-intensive preprocessing and de-identification
+procedures. Synthetic data offers a potential solution to challenges associated
+with accessing and using real-world data (RWD). This study aims to evaluate the
+capability of zero-shot generation of synthetic neurosurgical data with a large
+language model (LLM), GPT-4o, by benchmarking with the conditional tabular
+generative adversarial network (CTGAN). Synthetic datasets were compared to
+real-world neurosurgical data to assess fidelity (means, proportions,
+distributions, and bivariate correlations), utility (ML classifier performance
+on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated
+datasets matched or exceeded CTGAN performance, despite no fine-tuning or
+access to RWD for pre-training. Datasets demonstrated high univariate and
+bivariate fidelity to RWD without directly exposing any real patient records,
+even at amplified sample size. Training an ML classifier on GPT-4o-generated
+data and testing on RWD for a binary prediction task showed an F1 score (0.706)
+with comparable performance to training on the CTGAN data (0.705) for
+predicting postoperative functional status deterioration. GPT-4o demonstrated a
+promising ability to generate high-fidelity synthetic neurosurgical data. These
+findings also indicate that data synthesized with GPT-4o can effectively
+augment clinical data with small sample sizes, and train ML models for
+prediction of neurosurgical outcomes. Further investigation is necessary to
+improve the preservation of distributional characteristics and boost classifier
+performance.
 
-摘要：<paragraph>基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要，由於其依賴敏感的神經生理資料和資源密集型的開發，面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法，特別是那些使用抽象觸發集的方法，缺乏強健的驗證，且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密，wonder 濾波器在訓練期間嵌入浮水印，確保最小的失真（EEG 任務準確度下降 $\leq 5\%$）和高可靠性（100% 浮水印檢測）。該架構針對對抗性攻擊進行了嚴格的評估，包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留，即使在激進的剪枝後，浮水印狀態的分類準確度仍保持在 90% 以上，而主要任務的性能下降得更快，阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證，而不會造成嚴重的準確度損失（在 EEGNet 和 CCNN 模型中 $>10\%$）。密碼學雜湊確保驗證，降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型（CCNN、EEGNet、TSception）進行評估，該方法達到了 $>99.4\%$ 的空嵌入準確度，有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合，這項工作彌補了神經生理模型 IP 保護中的關鍵差距，為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。</paragraph>
+摘要：<paragraph>臨床數據是推進神經外科研究的基礎，但訪問通常受到數據可用性、樣本量小、隱私法規以及資源密集型預處理和去識別程序的限制。合成數據為與存取和使用真實世界數據 (RWD) 相關的挑戰提供了潛在解決方案。本研究旨在評估使用大型語言模型 (LLM) GPT-4o 零次生成合成神經外科數據的能力，並通過條件表格生成對抗網路 (CTGAN) 進行基準測試。將合成數據集與真實世界的神經外科數據進行比較，以評估保真度（平均值、比例、分布和二元相關性）、實用性（RWD 上的 ML 分類器性能）和隱私（RWD 中記錄的重複）。儘管沒有微調或訪問 RWD 進行預訓練，但 GPT-4o 生成的數據集與 CTGAN 性能相匹配或超過 CTGAN 性能。數據集證明了對 RWD 的高單變量和二變量保真度，即使在擴充的樣本量下也不會直接公開任何真實患者記錄。在 GPT-4o 生成的數據上訓練 ML 分類器，並在 RWD 上測試二元預測任務，顯示 F1 分數 (0.706) 與在 CTGAN 數據上訓練以預測術後功能狀態惡化時的性能相當 (0.705)。GPT-4o 展示了生成高保真合成神經外科數據的潛力。這些發現還表明，使用 GPT-4o 合成的數據可以有效地增加樣本量小的臨床數據，並訓練 ML 模型以預測神經外科結果。需要進一步研究以改善分佈特徵的保留並提升分類器性能。</paragraph>
 
-##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**
-2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen
+##### **MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**
+2502.09565v1 by Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White
 
-Depression is one of the leading causes of disability worldwide, posing a
-severe burden on individuals, healthcare systems, and society at large. Recent
-advancements in Large Language Models (LLMs) have shown promise in addressing
-mental health challenges, including the detection of depression through
-text-based analysis. However, current LLM-based methods often struggle with
-nuanced symptom identification and lack a transparent, step-by-step reasoning
-process, making it difficult to accurately classify and explain mental health
-conditions. To address these challenges, we propose a Chain-of-Thought
-Prompting approach that enhances both the performance and interpretability of
-LLM-based depression detection. Our method breaks down the detection process
-into four stages: (1) sentiment analysis, (2) binary depression classification,
-(3) identification of underlying causes, and (4) assessment of severity. By
-guiding the model through these structured reasoning steps, we improve
-interpretability and reduce the risk of overlooking subtle clinical indicators.
-We validate our method on the E-DAIC dataset, where we test multiple
-state-of-the-art large language models. Experimental results indicate that our
-Chain-of-Thought Prompting technique yields superior performance in both
-classification accuracy and the granularity of diagnostic insights, compared to
-baseline approaches.
+Molecular dynamics (MD) simulations are essential for understanding
+biomolecular systems but remain challenging to automate. Recent advances in
+large language models (LLM) have demonstrated success in automating complex
+scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an
+agentic LLM assistant capable of automating MD workflows. MDCrow uses
+chain-of-thought over 40 expert-designed tools for handling and processing
+files, setting up simulations, analyzing the simulation outputs, and retrieving
+relevant information from literature and databases. We assess MDCrow's
+performance across 25 tasks of varying required subtasks and difficulty, and we
+evaluate the agent's robustness to both difficulty and prompt style.
+\texttt{gpt-4o} is able to complete complex tasks with low variance, followed
+closely by \texttt{llama3-405b}, a compelling open-source model. While prompt
+style does not influence the best models' performance, it has significant
+effects on smaller models.
 
-摘要：憂鬱症是全球殘障的主要原因之一，對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望，包括透過基於文字的分析來偵測憂鬱症。然而，現有的基於 LLM 的方法通常難以辨識細微的症狀，而且缺乏透明且逐步的推理過程，這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰，我們提出了一種思考鏈提示方法，它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段：(1) 情緒分析，(2) 二元憂鬱症分類，(3) 找出潛在原因，以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟，我們提升了可解釋性，並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法，並在其中測試了多種最先進的大型語言模型。實驗結果顯示，與基線方法相比，我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。
+摘要：分子動力學 (MD) 模擬對於理解生物分子系統至關重要，但自動化仍然具有挑戰性。大型語言模型 (LLM) 的最新進展已證明使用基於 LLM 的代理自動化複雜的科學任務是成功的。在本文中，我們介紹了 MDCrow，這是一個代理 LLM 助理，能夠自動化 MD 工作流程。MDCrow 使用 40 多種專家設計的工具的思考鏈來處理和處理檔案、設定模擬、分析模擬輸出，以及從文獻和資料庫中檢索相關資訊。我們評估了 MDCrow 在 25 項任務中的表現，這些任務所需的子任務和難度各不相同，並且我們評估了代理對難度和提示樣式的穩健性。\texttt{gpt-4o} 能夠以低變異完成複雜的任務，緊隨其後的是一個引人注目的開源模型 \texttt{llama3-405b}。雖然提示樣式不會影響最佳模型的效能，但它對較小的模型有顯著的影響。
 
-##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**
-2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
+##### **EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**
+2502.09560v1 by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
 
-The increasing volume of drug combinations in modern therapeutic regimens
-needs reliable methods for predicting drug-drug interactions (DDIs). While
-Large Language Models (LLMs) have revolutionized various domains, their
-potential in pharmaceutical research, particularly in DDI prediction, remains
-largely unexplored. This study thoroughly investigates LLMs' capabilities in
-predicting DDIs by uniquely processing molecular structures (SMILES), target
-organisms, and gene interaction data as raw text input from the latest DrugBank
-dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4,
-Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first
-assessing their zero-shot capabilities in DDI prediction. We then fine-tuned
-selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1
-distilled Qwen 1.5B) to optimize their performance. Our comprehensive
-evaluation framework included validation across 13 external DDI datasets,
-comparing against traditional approaches such as l2-regularized logistic
-regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5
-2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of
-0.919 on balanced datasets (50% positive, 50% negative cases). This result
-represents an improvement over both zero-shot predictions and state-of-the-art
-machine-learning methods used for DDI prediction. Our analysis reveals that
-LLMs can effectively capture complex molecular interaction patterns and cases
-where drug pairs target common genes, making them valuable tools for practical
-applications in pharmaceutical research and clinical settings.
+Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
+agents offers a promising avenue for tackling real-world tasks. While
+language-centric embodied agents have garnered substantial attention,
+MLLM-based embodied agents remain underexplored due to the lack of
+comprehensive evaluation frameworks. To bridge this gap, we introduce
+EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
+embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
+tasks across four environments, ranging from high-level semantic tasks (e.g.,
+household) to low-level tasks involving atomic actions (e.g., navigation and
+manipulation); and (2) six meticulously curated subsets evaluating essential
+agent capabilities like commonsense reasoning, complex instruction
+understanding, spatial awareness, visual perception, and long-term planning.
+Through extensive experiments, we evaluated 13 leading proprietary and
+open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
+at high-level tasks but struggle with low-level manipulation, with the best
+model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
+multifaceted standardized evaluation platform that not only highlights existing
+challenges but also offers valuable insights to advance MLLM-based embodied
+agents. Our code is available at https://embodiedbench.github.io.
 
-摘要：<paragraph>現代治療方案中藥物組合的數量越來越多，需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命，它們在藥物研究中的潛力，特別是在 DDI 預測中的潛力，仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入，徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM，包括專有模型（GPT-4、Claude、Gemini）和開源變體（從 1.5B 到 72B 參數），首先評估它們在 DDI 預測中的零次學習能力。然後，我們微調選定的模型（GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B）以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證，並與傳統方法（例如 l2 正則化邏輯迴歸）進行比較。微調後的 LLM 表現出優異的效能，其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度，在平衡資料集（50% 正例，50% 反例）上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明，LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況，使其成為藥物研究和臨床環境中實用應用的寶貴工具。</paragraph>
+摘要：<paragraph>利用多模態大型語言模型 (MLLM) 來建立具身代理，提供了解決現實世界任務的有前景途徑。儘管以語言為中心的具身代理已獲得大量關注，但由於缺乏全面的評估框架，基於 MLLM 的具身代理仍未得到充分探索。為了彌補這一差距，我們引入了 EmbodiedBench，這是一個廣泛的基準測試，旨在評估以視覺為導向的具身代理。EmbodiedBench 的特點：(1) 跨越四個環境的 1,128 項多樣化測試任務，範圍從高層級語義任務（例如，家庭）到涉及原子動作的低層級任務（例如，導航和操作）；以及 (2) 六個精心策劃的子集，用於評估基本的代理能力，例如常識推理、複雜指令理解、空間感知、視覺感知和長期規劃。通過廣泛的實驗，我們在 EmbodiedBench 中評估了 13 個領先的專有和開源 MLLM。我們的研究結果表明：MLLM 在高層級任務中表現出色，但在低層級操作中遇到困難，表現最好的模型 GPT-4o 平均得分僅為 28.9%。EmbodiedBench 提供了一個多方面的標準化評估平台，不僅突出了現有挑戰，還提供了有價值的見解來推進基於 MLLM 的具身代理。我們的程式碼可在 https://embodiedbench.github.io/ 取得。</paragraph>
 
-##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**
-2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh
+##### **Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**
+2502.09532v1 by Shreyan Biswas, Alexander Erlei, Ujwal Gadiraju
 
-Detecting sensitive data such as Personally Identifiable Information (PII)
-and Protected Health Information (PHI) is critical for data security platforms.
-This study evaluates regex-based pattern matching algorithms and exact-match
-search techniques to optimize detection speed, accuracy, and scalability. Our
-benchmarking results indicate that Google RE2 provides the best balance of
-speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among
-regex engines, outperforming PCRE while maintaining broader hardware
-compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated
-superior performance (8 ms/MB) and scalability for large datasets. Performance
-analysis revealed that regex processing time scales linearly with dataset size
-and pattern complexity. A hybrid AI + Regex approach achieved the highest F1
-score (91. 6%) by improving recall and minimizing false positives. Device
-benchmarking confirmed that our solution maintains efficient CPU and memory
-usage on both high-performance and mid-range systems. Despite its
-effectiveness, challenges remain, such as limited multilingual support and the
-need for regular pattern updates. Future work should focus on expanding
-language coverage, integrating data security and privacy management (DSPM) with
-data loss prevention (DLP) tools, and enhancing regulatory compliance for
-broader global adoption.
+Recent advances in generative AI have precipitated a proliferation of novel
+writing assistants. These systems typically rely on multilingual large language
+models (LLMs), providing globalized workers the ability to revise or create
+diverse forms of content in different languages. However, there is substantial
+evidence indicating that the performance of multilingual LLMs varies between
+languages. Users who employ writing assistance for multiple languages are
+therefore susceptible to disparate output quality. Importantly, recent research
+has shown that people tend to generalize algorithmic errors across independent
+tasks, violating the behavioral axiom of choice independence. In this paper, we
+analyze whether user utilization of novel writing assistants in a charity
+advertisement writing task is affected by the AI's performance in a second
+language. Furthermore, we quantify the extent to which these patterns translate
+into the persuasiveness of generated charity advertisements, as well as the
+role of peoples' beliefs about LLM utilization in their donation choices. Our
+results provide evidence that writers who engage with an LLM-based writing
+assistant violate choice independence, as prior exposure to a Spanish LLM
+reduces subsequent utilization of an English LLM. While these patterns do not
+affect the aggregate persuasiveness of the generated advertisements, people's
+beliefs about the source of an advertisement (human versus AI) do. In
+particular, Spanish-speaking female participants who believed that they read an
+AI-generated advertisement strongly adjusted their donation behavior downwards.
+Furthermore, people are generally not able to adequately differentiate between
+human-generated and LLM-generated ads. Our work has important implications for
+the design, development, integration, and adoption of multilingual LLMs as
+assistive agents -- particularly in writing tasks.
 
-摘要：偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料，對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術，以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示，在 regex 引擎中，Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡，優於 PCRE，同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對，Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示，regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低，達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效，但仍有挑戰存在，例如多語言支援有限，以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍，將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合，以及加強法規遵循以利更廣泛的全球採用。
+摘要：<paragraph>生成式 AI 的最新進展加速了新穎寫作助理的激增。這些系統通常依賴多語言大型語言模型 (LLM)，讓全球化的工作者能夠以不同的語言修改或建立各種形式的內容。然而，有大量證據顯示多語言 LLM 的表現因語言而異。因此，使用多語言寫作協助的使用者容易受到不同的輸出品質影響。重要的是，最近的研究顯示人們傾向於在獨立的任務中概化演算法錯誤，違反了選擇獨立性的行為公理。在本文中，我們分析使用者在慈善廣告寫作任務中使用新穎寫作助理是否會受到 AI 在第二語言中的表現影響。此外，我們量化這些模式轉化為所產生慈善廣告說服力的程度，以及人們對 LLM 使用在捐款選擇中的信念所扮演的角色。我們的結果提供證據，表明與基於 LLM 的寫作助理互動的寫作者會違反選擇獨立性，因為先前接觸過西班牙語 LLM 會減少後續使用英語 LLM 的情況。雖然這些模式不會影響所產生廣告的整體說服力，但人們對廣告來源（人類與 AI）的信念會影響。特別是，相信自己閱讀 AI 生成的廣告的西班牙語系女性參與者大幅調整了他們的捐款行為。此外，人們通常無法充分區分人類產生的廣告和 LLM 產生的廣告。我們的研究對多語言 LLM 作為輔助代理的設計、開發、整合和採用具有重要的意義，特別是在寫作任務中。</paragraph>
 
-##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**
-2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu
+##### **Diffusion Models for Molecules: A Survey of Methods and Tasks**
+2502.09511v1 by Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
 
-While just-in-time interventions (JITIs) have effectively targeted common
-health behaviors, individuals often have unique needs to intervene in personal
-undesirable actions that can negatively affect physical, mental, and social
-well-being. We present WatchGuardian, a smartwatch-based JITI system that
-empowers users to define custom interventions for these personal actions with a
-small number of samples. For the model to detect new actions based on limited
-new data samples, we developed a few-shot learning pipeline that finetuned a
-pre-trained inertial measurement unit (IMU) model on public hand-gesture
-datasets. We then designed a data augmentation and synthesis process to train
-additional classification layers for customization. Our offline evaluation with
-26 participants showed that with three, five, and ten examples, our approach
-achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of
-74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to
-compare WatchGuardian against a rule-based intervention. Our results
-demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in
-undesirable actions, substantially outperforming the baseline by 29.0%. Our
-findings underscore the effectiveness of a customizable, AI-driven JITI system
-for individuals in need of behavioral intervention in personal undesirable
-actions. We envision that our work can inspire broader applications of
-user-defined personalized intervention with advanced AI solutions.
+Generative tasks about molecules, including but not limited to molecule
+generation, are crucial for drug discovery and material design, and have
+consistently attracted significant attention. In recent years, diffusion models
+have emerged as an impressive class of deep generative models, sparking
+extensive research and leading to numerous studies on their application to
+molecular generative tasks. Despite the proliferation of related work, there
+remains a notable lack of up-to-date and systematic surveys in this area.
+Particularly, due to the diversity of diffusion model formulations, molecular
+data modalities, and generative task types, the research landscape is
+challenging to navigate, hindering understanding and limiting the area's
+growth. To address this, this paper conducts a comprehensive survey of
+diffusion model-based molecular generative methods. We systematically review
+the research from the perspectives of methodological formulations, data
+modalities, and task types, offering a novel taxonomy. This survey aims to
+facilitate understanding and further flourishing development in this area. The
+relevant papers are summarized at:
+https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
 
-摘要：<paragraph>雖然即時介入（JITIs）有效地針對常見的健康行為，但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian，這是一個基於智慧手錶的 JITI 系統，它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為，我們開發了一個小樣本學習管道，微調了公共手勢資料集上的預訓練慣性測量單元（IMU）模型。然後，我們設計了一個資料擴充和合成流程，以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示，我們的做法使用三個、五個和十個範例，達到了 76.8%、84.7% 和 87.7% 的平均準確度，以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後，我們進行了一項為時四小時的介入研究，以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明，我們的系統導致不良行為顯著減少了 64.0 +- 22.6%，大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用，並採用先進的 AI 解決方案。</paragraph>
+摘要：<paragraph>包括但不限於分子生成在內的分子生成任務，對於藥物發現和材料設計至關重要，並持續吸引大量關注。近年來，擴散模型已成為深度生成模型中令人印象深刻的一類，激發了廣泛的研究，並導致對其應用於分子生成任務的眾多研究。儘管相關工作不斷增加，但這個領域仍然缺乏最新的系統性綜述。特別是，由於擴散模型公式、分子數據方式和生成任務類型的多樣性，研究領域難以瀏覽，阻礙了理解並限制了該領域的發展。為了解決這個問題，本文對基於擴散模型的分子生成方法進行了全面的調查。我們從方法論公式、數據方式和任務類型的角度系統性地回顧了研究，提供了一種新穎的分類法。本調查旨在促進理解並進一步促進該領域的蓬勃發展。相關論文總結如下：
+https://github.com/AzureLeon1/awesome-molecular-diffusion-models。</paragraph>
 
-##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**
-2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
+##### **AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**
+2502.09503v1 by Caleb Cranney, Jesse G. Meyer
 
-Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group
-of cancers that account for more than 35% of cancer-related deaths worldwide,
-but postoperative complications are unpredictable and can be life-threatening.
-In this paper, we investigate how recent advancements in large language models
-(LLMs) can benefit remote patient monitoring (RPM) systems through clinical
-integration by designing RECOVER, an LLM-powered RPM system for postoperative
-GI cancer care. To closely engage stakeholders in the design process, we first
-conducted seven participatory design sessions with five clinical staff and
-interviewed five cancer patients to derive six major design strategies for
-integrating clinical guidelines and information needs into LLM-based RPM
-systems. We then designed and implemented RECOVER, which features an
-LLM-powered conversational agent for cancer patients and an interactive
-dashboard for clinical staff to enable efficient postoperative RPM. Finally, we
-used RECOVER as a pilot system to assess the implementation of our design
-strategies with four clinical staff and five patients, providing design
-implications by identifying crucial design elements, offering insights on
-responsible AI, and outlining opportunities for future LLM-powered RPM systems.
+Transformer architectures have transformed AI applications but remain complex
+to customize for domain experts lacking low-level implementation expertise. We
+introduce AttentionSmithy, a modular software package that simplifies
+transformer innovation by breaking down key components into reusable building
+blocks: attention modules, feed-forward networks, normalization layers, and
+positional encodings. Users can rapidly prototype and evaluate transformer
+variants without extensive coding. Our framework supports four positional
+encoding strategies and integrates with neural architecture search for
+automated design. We validate AttentionSmithy by replicating the original
+transformer under resource constraints and optimizing translation performance
+by combining positional encodings. Additionally, we demonstrate its
+adaptability in gene-specific modeling, achieving over 95% accuracy in cell
+type classification. These case studies highlight AttentionSmithy's potential
+to accelerate research across diverse fields by removing framework
+implementation barriers.
 
-摘要：癌症手術是胃腸道 (GI) 癌症的主要治療方式，這類癌症佔全球癌症相關死亡人數的 35% 以上，但術後併發症無法預測，且可能危及生命。在本文中，我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統，方法是設計 RECOVER，一個由 LLM 驅動的 RPM 系統，用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程，我們首先與五位臨床人員進行七場參與式設計會議，並訪談五位癌症患者，以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著，我們設計並實作 RECOVER，其特色在於一個由 LLM 驅動的對話式代理人，供癌症患者使用，以及一個互動式儀表板，供臨床人員使用，以進行有效的術後 RPM。最後，我們使用 RECOVER 作為試點系統，與四位臨床人員和五位患者評估我們設計策略的實作，並透過找出重要的設計元素、提供對負責任 AI 的見解，以及概述未來由 LLM 驅動的 RPM 系統的機會，提出設計意涵。
+摘要：Transformer 架構已轉變 AI 應用，但對於缺乏低階實作專業知識的領域專家而言，自訂仍很複雜。我們推出 AttentionSmithy，這是一個模組化軟體套件，透過將關鍵元件分解成可重複使用的建構區塊（注意力模組、前饋網路、正規化層和位置編碼）來簡化 Transformer 創新。使用者可以快速建置原型和評估 Transformer 變體，而無需大量編碼。我們的架構支援四種位置編碼策略，並整合神經架構搜尋以進行自動化設計。我們透過在資源限制下複製原始 Transformer 和結合位置編碼來最佳化翻譯效能，驗證 AttentionSmithy。此外，我們展示其在基因特定建模中的適應性，在細胞類型分類中達到超過 95% 的準確度。這些案例研究突顯 AttentionSmithy 在移除架構實作障礙後，加速各個領域研究的潛力。
 
-##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**
-2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander
+##### **Improve LLM-based Automatic Essay Scoring with Linguistic Features**
+2502.09497v1 by Zhaoyi Joey Hou, Alejandro Ciuba, Xiang Lorraine Li
 
-Understanding the progression trajectories of diseases is crucial for early
-diagnosis and effective treatment planning. This is especially vital for
-life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a
-chronic, progressive lung disease with a prognosis comparable to many cancers.
-Computed tomography (CT) imaging has been established as a reliable diagnostic
-tool for IPF. Accurately predicting future CT scans of early-stage IPF patients
-can aid in developing better treatment strategies, thereby improving survival
-outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial
-Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF
-patients at any time point. The model is trained using a two-stage approach. In
-the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the
-second stage, a Neural Ordinary Differential Equation (ODE) based temporal
-model is trained to capture the temporal dynamics of the quantised embeddings
-generated by the encoder in the first stage. We evaluate different
-configurations of our model for generating longitudinal CT scans and compare
-the results against ground truth data, both quantitatively and qualitatively.
-For validation, we conduct survival analysis using imaging biomarkers derived
-from generated CT scans and achieve a C-index comparable to that of biomarkers
-derived from the real CT scans. The survival analysis results demonstrate the
-potential clinical utility inherent to generated longitudinal CT scans, showing
-that they can reliably predict survival outcomes.
+Automatic Essay Scoring (AES) assigns scores to student essays, reducing the
+grading workload for instructors. Developing a scoring system capable of
+handling essays across diverse prompts is challenging due to the flexibility
+and diverse nature of the writing task. Existing methods typically fall into
+two categories: supervised feature-based approaches and large language model
+(LLM)-based methods. Supervised feature-based approaches often achieve higher
+performance but require resource-intensive training. In contrast, LLM-based
+methods are computationally efficient during inference but tend to suffer from
+lower performance. This paper combines these approaches by incorporating
+linguistic features into LLM-based scoring. Experimental results show that this
+hybrid method outperforms baseline models for both in-domain and out-of-domain
+writing prompts.
 
-摘要：了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要，IPF 是一種慢性、進行性肺部疾病，其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略，從而改善存活結果。在本文中，我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN)，這是一個模型，能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段，訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段，訓練基於神經常微分方程 (ODE) 的時間模型，以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置，以生成縱向 CT 掃描，並在定量和定性方面將結果與真實數據進行比較。為了驗證，我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析，並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用，表明它們可以可靠地預測存活結果。
+摘要：自動化論文評分 (AES) 會為學生的論文評分，以減輕教師的評分工作負擔。由於寫作任務的靈活性與多樣性，開發一種評分系統來處理各種提示的論文是一項挑戰。現有方法通常分為兩類：監督式特徵方法和大型語言模型 (LLM) 方法。監督式特徵方法通常能達到較高的效能，但需要大量資源進行訓練。相比之下，LLM 方法在推論期間的計算效率很高，但效能往往較低。本文結合了這些方法，將語言特徵納入 LLM 評分中。實驗結果顯示，這種混合方法在領域內和領域外寫作提示方面都優於基準模型。
 
-##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**
-2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho
+##### **Cracking the Code: Enhancing Development finance understanding with artificial intelligence**
+2502.09495v1 by Pierre Beaucoral
 
-The increasing demand for mental health services has led to the rise of
-AI-driven mental health chatbots, though challenges related to privacy, data
-collection, and expertise persist. Motivational Interviewing (MI) is gaining
-attention as a theoretical basis for boosting expertise in the development of
-these chatbots. However, existing datasets are showing limitations for training
-chatbots, leading to a substantial demand for publicly available resources in
-the field of MI and psychotherapy. These challenges are even more pronounced in
-non-English languages, where they receive less attention. In this paper, we
-propose a novel framework that simulates MI sessions enriched with the
-expertise of professional therapists. We train an MI forecaster model that
-mimics the behavioral choices of professional therapists and employ Large
-Language Models (LLMs) to generate utterances through prompt engineering. Then,
-we present KMI, the first synthetic dataset theoretically grounded in MI,
-containing 1,000 high-quality Korean Motivational Interviewing dialogues.
-Through an extensive expert evaluation of the generated dataset and the
-dialogue model trained on it, we demonstrate the quality, expertise, and
-practicality of KMI. We also introduce novel metrics derived from MI theory in
-order to evaluate dialogues from the perspective of MI.
+Analyzing development projects is crucial for understanding donors aid
+strategies, recipients priorities, and to assess development finance capacity
+to adress development issues by on-the-ground actions. In this area, the
+Organisation for Economic Co-operation and Developments (OECD) Creditor
+Reporting System (CRS) dataset is a reference data source. This dataset
+provides a vast collection of project narratives from various sectors
+(approximately 5 million projects). While the OECD CRS provides a rich source
+of information on development strategies, it falls short in informing project
+purposes due to its reporting process based on donors self-declared main
+objectives and pre-defined industrial sectors. This research employs a novel
+approach that combines Machine Learning (ML) techniques, specifically Natural
+Language Processing (NLP), an innovative Python topic modeling technique called
+BERTopic, to categorise (cluster) and label development projects based on their
+narrative descriptions. By revealing existing yet hidden topics of development
+finance, this application of artificial intelligence enables a better
+understanding of donor priorities and overall development funding and provides
+methods to analyse public and private projects narratives.
 
-摘要：由於對心理健康服務的需求日益增加，導致以人工智慧為基礎的心理健康聊天機器人興起，儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而，現有的資料集顯示出訓練聊天機器人的限制，導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯，因為它們受到的關注較少。在本文中，我們提出了一個新穎的架構，它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型，它模擬了專業治療師的行為選擇，並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後，我們展示了 KMI，這是第一個理論上以 MI 為基礎的合成資料集，其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估，我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標，以便從 MI 的角度評估對話。
+摘要：分析發展專案對於了解捐助者援助策略、受贈者優先事項，以及評估發展資金能力以透過實際行動解決發展問題至關重要。在這個領域中，經濟合作暨發展組織 (OECD) 債權人報告系統 (CRS) 資料集是一個參考資料來源。此資料集提供來自各個部門的大量專案敘述（約 500 萬個專案）。雖然 OECD CRS 提供了豐富的發展策略資訊來源，但由於其報告程序基於捐助者自行申報的主要目標和預先定義的產業部門，因此在告知專案目的方面有所不足。本研究採用一種新穎的方法，結合機器學習 (ML) 技術，特別是自然語言處理 (NLP)，一種稱為 BERTopic 的創新 Python 主題建模技術，根據其敘述描述對發展專案進行分類（叢集）和標籤。透過揭露發展資金現有但隱藏的主題，這種人工智慧應用程式可以更好地了解捐助者的優先事項和整體發展資金，並提供分析公共和私人專案敘述的方法。
 
-##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**
-2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco
+##### **Objective quantification of mood states using large language models**
+2502.09487v1 by Jakub Onysk, Quentin Huys
 
-Europe's healthcare systems require enhanced interoperability and
-digitalization, driving a demand for innovative solutions to process legacy
-clinical data. This paper presents the results of our project, which aims to
-leverage Large Language Models (LLMs) to extract structured information from
-unstructured clinical reports, focusing on patient history, diagnoses,
-treatments, and other predefined categories. We developed a workflow with a
-user interface and evaluated LLMs of varying sizes through prompting strategies
-and fine-tuning. Our results show that fine-tuned smaller models match or
-surpass larger counterparts in performance, offering efficiency for
-resource-limited settings. A new dataset of 60,000 annotated English clinical
-summaries and 24,000 German translations was validated with automated and
-manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics.
-The work highlights the approach's viability and outlines future improvements.
+Emotional states influence human behaviour and cognition, leading to diverse
+thought trajectories. Similarly, Large Language Models (LLMs) showcase an
+excellent level of response consistency across wide-ranging contexts (prompts).
+We leverage these parallels to establish a framework for quantifying mental
+states. Our approach utilises self-report questionnaires that reliably assess
+these states due to their inherent sensitivity to patterns of co-occurring
+responses. Specifically, we recruited a large sample of participants (N=422) to
+investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set
+of depressive mood states measured with participants' open-ended responses to a
+depression questionnaire. We show LLM responses to held-out multiple-choice
+questions, given participants' open-ended answers, correlate strongly (r:
+0.52-0.84) with true questionnaire scores, demonstrating LLM's generalisation
+from mood representations. We explore a link between these representations and
+factor analysis. Using ridge regression, we find depression-related subspaces
+within LLM hidden states. We show these subspaces to be predictive of
+participants' "Depression" and "Somatic & Emotional Distress" factor scores, as
+well as suicidality severity. Overall, LLMs can provide quantitative measures
+of mental states. The reliability of these hinges upon how informative the
+questions we ask participants are. Used correctly, this approach could
+supplement mental state assessment in a variety of settings.
 
-摘要：歐洲的醫療保健系統需要增強互通性和數位化，這驅動了對創新解決方案的需求，以處理傳統的臨床數據。本文介紹了我們專案的成果，該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊，重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程，並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示，微調後的較小模型在效能上與較大的模型相匹配或超越它們，為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性，並概述了未來的改進。
+摘要：情緒狀態會影響人類行為和認知，導致不同的思維軌跡。同樣地，大型語言模型 (LLM) 在廣泛的脈絡（提示）中展示出極佳的反應一致性。我們利用這些相似之處來建立一個量化心理狀態的框架。我們的做法利用自我報告問卷，由於這些問卷對共生反應模式具有內在敏感性，因此可以可靠地評估這些狀態。具體來說，我們招募了大量的參與者樣本 (N=422) 來調查 LLM (Mistral-7B-OpenOrca) 如何量化一組異質的抑鬱情緒狀態，這些狀態是根據參與者對抑鬱症問卷的開放式回答來衡量的。我們展示了 LLM 對保留的多選題的回答，給定參與者的開放式回答，與真正的問卷分數密切相關 (r：0.52-0.84)，這證明了 LLM 從情緒表徵中進行概括。我們探索這些表徵與因子分析之間的聯繫。使用嶺回歸，我們在 LLM 隱藏狀態內發現了與抑鬱相關的子空間。我們展示這些子空間可以預測參與者的「抑鬱」和「軀體和情緒困擾」因子分數，以及自殺嚴重性。總體而言，LLM 可以提供心理狀態的量化測量。這些測量的可靠性取決於我們詢問參與者的問題的資訊性。如果使用得當，這種方法可以補充各種環境中的心理狀態評估。
 
-##### **Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**
-2502.05494v1 by Ya Zhou, Yujie Yang, Jianhuang Gan, Xiangjie Li, Jing Yuan, Wei Zhao
+##### **The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**
+2502.09457v1 by Akash Ghosh, Debayan Datta, Sriparna Saha, Chirag Agarwal
 
-Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing
-cardiovascular conditions, yet anomaly detection in ECG signals remains
-challenging due to their inherent complexity and variability. We propose
-Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel
-end-to-end framework that effectively captures both global and local
-dependencies in ECG data. Unlike state-of-the-art methods that rely on
-heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for
-such pre-processing steps, enhancing its suitability for clinical deployment.
-MMAE-ECG partitions ECG signals into non-overlapping segments, with each
-segment assigned learnable positional embeddings. A novel multi-scale masking
-strategy and multi-scale attention mechanism, along with distinct positional
-embeddings, enable a lightweight Transformer encoder to effectively capture
-both local and global dependencies. The masked segments are then reconstructed
-using a single-layer Transformer block, with an aggregation strategy employed
-during inference to refine the outputs. Experimental results demonstrate that
-our method achieves performance comparable to state-of-the-art approaches while
-significantly reducing computational complexity-approximately 1/78 of the
-floating-point operations (FLOPs) required for inference. Ablation studies
-further validate the effectiveness of each component, highlighting the
-potential of multi-scale masked autoencoders for anomaly detection.
+While reasoning and multilingual capabilities in Language Models (LMs) have
+achieved remarkable progress in recent years, their integration into a unified
+paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning
+requires language models to handle logical reasoning across languages while
+addressing misalignment, biases, and challenges in low-resource settings. This
+survey provides the first in-depth review of multilingual reasoning in LMs. In
+this survey, we provide a systematic overview of existing methods that leverage
+LMs for multilingual reasoning, specifically outlining the challenges,
+motivations, and foundational aspects of applying language models to reason
+across diverse languages. We provide an overview of the standard data resources
+used for training multilingual reasoning in LMs and the evaluation benchmarks
+employed to assess their multilingual capabilities. Next, we analyze various
+state-of-the-art methods and their performance on these benchmarks. Finally, we
+explore future research opportunities to improve multilingual reasoning in LMs,
+focusing on enhancing their ability to handle diverse languages and complex
+reasoning tasks.
 
-摘要：心電圖 (ECG) 分析是診斷心血管疾病的基本工具，但由於 ECG 訊號本身的複雜性和變異性，異常偵測仍然是一項挑戰。我們提出用於 ECG 異常偵測的多尺度遮罩自編碼器 (MMAE-ECG)，這是一個新穎的端對端架構，可有效擷取 ECG 資料中的全局和局部依賴關係。與依賴於心跳區段或 R 波峰偵測的最新方法不同，MMAE-ECG 消除了對此類前處理步驟的需求，增強其適用於臨床部署。MMAE-ECG 將 ECG 訊號分割成不相疊的區段，每個區段都指派可學習的位置嵌入。新穎的多尺度遮罩策略和多尺度注意力機制，以及不同的位置嵌入，使輕量級 Transformer 編碼器能夠有效擷取局部和全局依賴關係。然後使用單層 Transformer 區塊重建遮罩區段，並在推理期間採用聚合策略來優化輸出。實驗結果表明，我們的模型達到了與最新方法相當的效能，同時大幅降低運算複雜度，約為推理所需的浮點運算 (FLOP) 的 1/78。消融研究進一步驗證了每個組件的有效性，突顯了多尺度遮罩自編碼器在異常偵測方面的潛力。
+摘要：儘管語言模型 (LM) 的推理和多語言能力在近年來取得顯著進展，但它們整合至統一典範（多語言推理）仍處於萌芽階段。多語言推理要求語言模型跨語言處理邏輯推理，同時解決低資源環境中的錯位、偏見和挑戰。本調查提供了 LM 中多語言推理的首次深入探討。在本調查中，我們系統性地概述了現有利用 LM 進行多語言推理的方法，特別概述了將語言模型應用於跨不同語言推理的挑戰、動機和基礎方面。我們概述了用於訓練 LM 中多語言推理的標準數據資源，以及用於評估其多語言能力的評估基準。接下來，我們分析了各種最先進的方法及其在這些基準上的表現。最後，我們探討了改進 LM 中多語言推理的未來研究機會，重點關注增強其處理不同語言和複雜推理任務的能力。
 
-##### **DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**
-2502.05459v1 by Sibasish Dhibar
+##### **Pixel-Level Reasoning Segmentation via Multi-turn Conversations**
+2502.09447v1 by Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
 
-White blood cells (WBC) are important parts of our immune system, and they
-protect our body against infections by eliminating viruses, bacteria, parasites
-and fungi. The number of WBC types and the total number of WBCs provide
-important information about our health status. A traditional method,
-convolutional neural networks (CNN), a deep learning architecture, can classify
-the blood cell from a part of an object and perform object recognition. Various
-CNN models exhibit potential; however, their development often involves ad-hoc
-processes that neglect unnecessary layers, leading to issues with unbalanced
-datasets and insufficient data augmentation. To address these challenges, we
-propose a novel ensemble approach that integrates three CNN architectures, each
-uniquely configured with different dropout and max-pooling layer settings to
-enhance feature learning. This ensemble model, named DCENWCNet, effectively
-balances the bias-variance trade-off. When evaluated on the widely recognized
-Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks,
-achieving highest mean accuracy. Additionally, it demonstrates superior
-performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC)
-across all categories. To delve deeper into the interpretability of
-classifiers, we employ reliable post-hoc explanation techniques, including
-Local Interpretable Model-Agnostic Explanations (LIME). These methods
-approximate the behavior of a black-box model by elucidating the relationships
-between feature values and predictions. Interpretable results enable users to
-comprehend and validate the model's predictions, thereby increasing their
-confidence in the automated diagnosis.
+Existing visual perception systems focus on region-level segmentation in
+single-turn dialogues, relying on complex and explicit query instructions. Such
+systems cannot reason at the pixel level and comprehend dynamic user intent
+that changes over interaction. Our work tackles this issue by introducing a
+novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on
+multi-turn conversations, tracking evolving user intent via multi-turn
+interactions for fine-grained segmentation. To establish a benchmark for this
+novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on
+Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
+multi-turn conversational scenarios with segmentation targets. Building on
+PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning
+Segmentation framework, integrates pixel-level segmentation with robust
+multi-turn conversation understanding, generating pixel-grounded explanations
+aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
+pixel-level reasoning segmentation. Experimental results on the PRIST dataset
+demonstrate that our method outperforms current segmentation-specific baselines
+in terms of segmentation and LLM-based reasoning metrics. The code and data are
+available at: https://github.com/ccccai239/PixelRIST.
 
-摘要：白血球 (WBC) 是我們免疫系統的重要組成部分，它們通過清除病毒、細菌、寄生蟲和真菌來保護我們的機體免受感染。WBC 類型數量和 WBC 總數提供了有關我們健康狀況的重要資訊。傳統方法卷積神經網路 (CNN) 是一種深度學習架構，可以對物體的一部分進行血細胞分類並執行物體識別。各種 CNN 模型展現出潛力；然而，它們的開發通常涉及忽略不必要層的臨時過程，導致不平衡的資料集和資料擴充不足的問題。為了應對這些挑戰，我們提出了一種新穎的整體方法，它整合了三種 CNN 架構，每種架構都採用不同的中斷和最大池化層設定進行獨特配置，以增強特徵學習。這種名為 DCENWCNet 的整體模型有效地平衡了偏差變異取捨。在廣泛認可的 Rabbin-WBC 資料集上進行評估時，我們的模型優於現有的最先進網路，達到了最高的平均準確度。此外，它在所有類別中都展示了在精確度、召回率、F1 分數和 ROC 曲線下面積 (AUC) 方面的卓越效能。為了更深入地研究分類器的可解釋性，我們採用了可靠的事後解釋技術，包括局部可解釋模型不可知解釋 (LIME)。這些方法通過闡明特徵值和預測之間的關係來近似黑盒模型的行為。可解釋的結果使用戶能夠理解和驗證模型的預測，從而增加他們對自動化診斷的信心。
+摘要：現有的視覺感知系統專注於單輪對話中的區域級分割，依賴於複雜且明確的查詢指令。此類系統無法在像素級別推理和理解在互動中不斷變化的動態使用者意圖。我們的研究通過引入一項基於多輪對話的像素級推理分割（像素級 RS）新任務來解決此問題，通過多輪互動追蹤不斷演變的使用者意圖，以進行精細分割。為了建立此新任務的基準，我們建立了一個基於多輪對話的像素級推理分割資料集（PRIST），其中包含來自 8.3k 多輪對話場景的 24k 個語句，以及分割目標。在 PRIST 的基礎上，我們進一步提出了 MIRAS，這是一個多輪互動推理分割框架，它將像素級分割與強大的多輪對話理解整合在一起，生成符合使用者意圖的像素級解釋。PRIST 資料集和 MIRSA 框架填補了像素級推理分割的空白。在 PRIST 資料集上的實驗結果表明，我們的模型在分割和基於 LLM 的推理指標方面優於目前的特定於分割的基準。程式碼和資料可在 https://github.com/ccccai239/PixelRIST 獲得。
 
-##### **Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**
-2502.05330v1 by Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee, Jaehee Chun, Jin Sung Kim, Minghui Zhang, Hanxiao Zhang, Xin You, Yun Gu, Zhaohong Pan, Xuan Liu, Xiaokun Liang, Markus Tiefenthaler, Enrique Almar-Munoz, Matthias Schwab, Mikhail Kotyushev, Rostislav Epifanov, Marek Wodzinski, Henning Muller, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Zhiwei Wang, Kaixiang Yang, Jintao Ren, Stine Sofia Korreman, Yuchong Gao, Hongye Zeng, Haoyu Zheng, Rui Zheng, Jinghua Yue, Fugen Zhou, Bo Liu, Alexander Cosman, Muxuan Liang, Chang Zhao, Gilbert R. Upchurch Jr., Jun Ma, Yuyin Zhou, Michol A. Cooper, Wei Shao
+##### **Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**
+2502.09432v1 by Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
 
-Multi-class segmentation of the aorta in computed tomography angiography
-(CTA) scans is essential for diagnosing and planning complex endovascular
-treatments for patients with aortic dissections. However, existing methods
-reduce aortic segmentation to a binary problem, limiting their ability to
-measure diameters across different branches and zones. Furthermore, no
-open-source dataset is currently available to support the development of
-multi-class aortic segmentation methods. To address this gap, we organized the
-AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes
-annotated for 23 clinically relevant aortic branches and zones. This dataset
-was designed to facilitate both model development and validation. The challenge
-attracted 121 teams worldwide, with participants leveraging state-of-the-art
-frameworks such as nnU-Net and exploring novel techniques, including cascaded
-models, data augmentation strategies, and custom loss functions. We evaluated
-the submitted algorithms using the Dice Similarity Coefficient (DSC) and
-Normalized Surface Distance (NSD), highlighting the approaches adopted by the
-top five performing teams. This paper presents the challenge design, dataset
-details, evaluation metrics, and an in-depth analysis of the top-performing
-algorithms. The annotated dataset, evaluation code, and implementations of the
-leading methods are publicly available to support further research. All
-resources can be accessed at https://aortaseg24.grand-challenge.org.
+We study robust Markov decision processes (RMDPs) with non-rectangular
+uncertainty sets, which capture interdependencies across states unlike
+traditional rectangular models. While non-rectangular robust policy evaluation
+is generally NP-hard, even in approximation, we identify a powerful class of
+$L_p$-bounded uncertainty sets that avoid these complexity barriers due to
+their structural simplicity. We further show that this class can be decomposed
+into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage
+its structural properties to derive a novel dual formulation for $L_p$ RMDPs.
+This formulation provides key insights into the adversary's strategy and
+enables the development of the first robust policy evaluation algorithms for
+non-rectangular RMDPs. Empirical results demonstrate that our approach
+significantly outperforms brute-force methods, establishing a promising
+foundation for future investigation into non-rectangular robust MDPs.
 
-摘要：多類別主動脈電腦斷層血管攝影 (CTA) 掃描分割對於診斷和規劃主動脈剝離患者的複雜血管內治療至關重要。然而，現有方法將主動脈分割簡化為二元問題，限制了其測量不同分支和區域直徑的能力。此外，目前沒有開放原始碼數據集可用於支援多類別主動脈分割方法的開發。為了解決此問題，我們組織了 AortaSeg24 MICCAI 挑戰，引入了第一個包含 100 個 CTA 體積的數據集，這些體積針對 23 個臨床上相關的主動脈分支和區域進行了註釋。此數據集旨在促進模型開發和驗證。該挑戰吸引了來自世界各地的 121 個團隊，參與者利用了 nnU-Net 等最先進的框架，並探索了創新技術，包括串聯模型、數據擴充策略和自訂損失函數。我們使用 Dice 相似性係數 (DSC) 和標準化表面距離 (NSD) 評估了提交的演算法，重點介紹了前五名表現最佳團隊採用的方法。本文介紹了挑戰設計、數據集詳細資訊、評估指標以及對表現最佳演算法的深入分析。已公開註釋的數據集、評估程式碼和領先方法的實作，以支援進一步的研究。所有資源都可以在 https://aortaseg24.grand-challenge.org/ 獲得。
+摘要：我們研究具有非矩形不確定性集合的強健馬可夫決策過程 (RMDP)，它能捕捉到不同於傳統矩形模型的跨狀態相互依賴性。雖然非矩形強健策略評估通常是 NP-hard，即使在近似中也是如此，我們識別了一類強大的 $L_p$ 有界不確定性集合，由於其結構的簡潔性，可以避免這些複雜性障礙。我們進一步表明，此類可以分解為無限多的 \texttt{sa} 矩形 $L_p$ 有界集合，並利用其結構屬性為 $L_p$ RMDP 導出一個新的對偶公式。此公式提供了對抗者策略的重要見解，並能夠開發出第一個非矩形 RMDP 的強健策略評估演算法。實證結果表明，我們的做法顯著優於蠻力方法，為未來對非矩形強健 MDP 的研究奠定了有希望的基礎。
 
-##### **Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**
-2502.05282v1 by Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
+##### **Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**
+2502.09423v1 by Ziyi Chen, Yang Yuan, Siming Zheng, Jialong Guo, Sihan Liang, Yangang Wang, Zongguo Wang
 
-Dense contrastive representation learning (DCRL) has greatly improved the
-learning efficiency for image-dense prediction tasks, showing its great
-potential to reduce the large costs of medical image collection and dense
-annotation. However, the properties of medical images make unreliable
-correspondence discovery, bringing an open problem of large-scale false
-positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric
-vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior
-to DCRL and enables a reliable correspondence discovery for effective dense
-contrast. We propose a deformable homeomorphism learning (DHL) which models the
-homeomorphism of medical images and learns to estimate a deformable mapping to
-predict the pixels' correspondence under topological preservation. It
-effectively reduces the searching space of pairing and drives an implicit and
-soft learning of negative pairs via a gradient. We also propose a geometric
-semantic similarity (GSS) which extracts semantic information in features to
-measure the alignment degree for the correspondence learning. It will promote
-the learning efficiency and performance of deformation, constructing positive
-pairs reliably. We implement two practical variants on two typical
-representation learning tasks in our experiments. Our promising results on
-seven datasets which outperform the existing methods show our great
-superiority. We will release our code on a companion link:
-https://github.com/YutingHe-list/GEMINI.
+Crystal structure forms the foundation for understanding the physical and
+chemical properties of materials. Generative models have emerged as a new
+paradigm in crystal structure prediction(CSP), however, accurately capturing
+key characteristics of crystal structures, such as periodicity and symmetry,
+remains a significant challenge. In this paper, we propose a
+Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction
+(TransVAE-CSP), who learns the characteristic distribution space of stable
+materials, enabling both the reconstruction and generation of crystal
+structures. TransVAE-CSP integrates adaptive distance expansion with
+irreducible representation to effectively capture the periodicity and symmetry
+of crystal structures, and the encoder is a transformer network based on an
+equivariant dot product attention mechanism. Experimental results on the
+carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP
+outperforms existing methods in structure reconstruction and generation tasks
+under various modeling metrics, offering a powerful tool for crystal structure
+design and optimization.
 
-摘要：密集对比表征学习（DCRL）极大地提高了影像密集预测任务的学习效率，显示出其在降低医学影像收集和密集标注的大量成本方面的巨大潜力。然而，医学影像的特性使得对应关系发现不可靠，给 DCRL 带来大规模假阳性和假阴性（FP&N）对的开放性问题。在本文中，我们提出了 GEoMetric vIsual deNse sImilarity（GEMINI）学习，它将同胚先验嵌入 DCRL 中，并针对有效密集对比提供了可靠的对应关系发现。我们提出了一种可变形同胚学习（DHL），它对医学影像的同胚进行建模，并学习估计可变形映射，以预测在拓扑保持下的像素对应关系。它有效地减少了配对的搜索空间，并通过梯度驱动了负对的隐式和软学习。我们还提出了几何语义相似性（GSS），它提取特征中的语义信息，以测量对应关系学习的对齐度。它将促进变形学习的效率和性能，可靠地构建正对。我们在实验中针对两个典型的表征学习任务实现了两个实际变体。我们在七个数据集上的有希望的结果优于现有方法，显示出我们的巨大优势。我们将在配套链接中发布我们的代码：https://github.com/YutingHe-list/GEMINI。
+摘要：晶體結構形成了解材料物理和化學性質的基礎。生成模型已成為晶體結構預測 (CSP) 的新典範，然而，準確捕捉晶體結構的關鍵特徵（例如週期性和對稱性）仍然是一項重大挑戰。在本文中，我們提出了一種用於晶體結構預測的 Transformer 增強變異自動編碼器 (TransVAE-CSP)，它學習穩定材料的特徵分佈空間，使晶體結構的重建和生成成為可能。TransVAE-CSP 將自適應距離擴展與不可約表示相結合，以有效地捕捉晶體結構的週期性和對稱性，並且編碼器是一個基於等變點積注意力機制的 Transformer 網路。在 carbon_24、perov_5 和 mp_20 資料集上的實驗結果表明，TransVAE-CSP 在各種建模指標下，在結構重建和生成任務中優於現有方法，為晶體結構設計和最佳化提供了一個強大的工具。
 
-##### **"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**
-2502.05115v1 by Shihan Fu, Bingsheng Yao, Smit Desai, Yuqi Hu, Yuling Sun, Samantha Stonbraker, Yanjun Gao, Elizabeth M. Goldberg, Dakuo Wang
+##### **On multi-token prediction for efficient LLM inference**
+2502.09419v1 by Somesh Mehra, Javier Alonso Garcia, Lukas Mauch
 
-Older adult patients constitute a rapidly growing subgroup of Intensive Care
-Unit (ICU) patients. In these situations, their family caregivers are expected
-to represent the unconscious patients to access and interpret patients' medical
-information. However, caregivers currently have to rely on overloaded
-clinicians for information updates and typically lack the health literacy to
-understand complex medical information. Our project aims to explore the
-information needs of caregivers of ICU older adult patients, from which we can
-propose design opportunities to guide future AI systems. The project begins
-with formative interviews with 11 caregivers to identify their challenges in
-accessing and interpreting medical information; From these findings, we then
-synthesize design requirements and propose an AI system prototype to cope with
-caregivers' challenges. The system prototype has two key features: a timeline
-visualization to show the AI extracted and summarized older adult patients' key
-medical events; and an LLM-based chatbot to provide context-aware informational
-support. We conclude our paper by reporting on the follow-up user evaluation of
-the system and discussing future AI-based systems for ICU caregivers of older
-adults.
+We systematically investigate multi-token prediction (MTP) capabilities
+within LLMs pre-trained for next-token prediction (NTP). We first show that
+such models inherently possess MTP capabilities via numerical marginalization
+over intermediate token probabilities, though performance is data-dependent and
+improves with model scale. Furthermore, we explore the challenges of
+integrating MTP heads into frozen LLMs and find that their hidden layers are
+strongly specialized for NTP, making adaptation non-trivial. Finally, we show
+that while joint training of MTP heads with the backbone improves performance,
+it cannot fully overcome this barrier, prompting further research in this
+direction. Our findings provide a deeper understanding of MTP applied to
+pretrained LLMs, informing strategies for accelerating inference through
+parallel token prediction.
 
-摘要：老年患者構成加護病房 (ICU) 患者中快速成長的子群。在這些情況下，預期他們的家庭照護者能代表無意識的患者取得並解讀患者的醫療資訊。然而，照護者目前必須依賴工作繁重的臨床醫師提供資訊更新，而且通常缺乏了解複雜醫療資訊的健康素養。我們的專案旨在探索 ICU 老年患者照護者的資訊需求，我們可以根據這些需求提出設計機會，以引導未來的 AI 系統。這個專案從對 11 位照護者的形成性訪談開始，以找出他們在取得和解讀醫療資訊方面的挑戰；根據這些發現，我們接著綜合設計需求，並提出一個 AI 系統原型，以應對照護者的挑戰。這個系統原型具有兩個關鍵特點：一個時間軸視覺化，以顯示 AI 萃取並摘要出的老年患者關鍵醫療事件；以及一個基於 LLM 的聊天機器人，以提供情境感知的資訊支援。我們透過報告系統的後續使用者評估，以及討論未來針對老年人 ICU 照護者的 AI 系統，來總結我們的論文。
+摘要：我們系統性地研究了在預先訓練下用於下一個代幣預測 (NTP) 的 LLM 中的多代幣預測 (MTP) 功能。我們首先表明，此類模型透過中間代幣機率的數值邊際化本質上具備 MTP 功能，儘管效能依賴於資料，且會隨著模型規模而提升。此外，我們探討了將 MTP 頭整合到凍結 LLM 中的挑戰，發現其隱藏層高度專門用於 NTP，使得適應變得不簡單。最後，我們顯示，儘管 MTP 頭與主幹的聯合訓練會提升效能，但無法完全克服此障礙，促使我們進一步研究這個方向。我們的發現提供了對應用於預先訓練 LLM 的 MTP 更深入的理解，並為透過平行代幣預測加速推論提供策略。
 
-##### **Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**
-2502.05087v1 by Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi
+##### **SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**
+2502.09390v1 by Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
 
-Federated learning (FL) is a popular paradigm for collaborative training
-which avoids direct data exposure between clients. However, data privacy issues
-still remain: FL-trained large language models are capable of memorizing and
-completing phrases and sentences contained in training data when given with
-their prefixes. Thus, it is possible for adversarial and honest-but-curious
-clients to recover training data of other participants simply through targeted
-prompting. In this work, we demonstrate that a popular and simple fine-tuning
-strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a
-factor of 10. We study this effect by performing a medical question-answering
-fine-tuning task and injecting multiple replicas of out-of-distribution
-sensitive sequences drawn from an external clinical dataset. We observe a
-reduction in memorization for a wide variety of Llama 2 and 3 models, and find
-that LoRA can reduce memorization in centralized learning as well. Furthermore,
-we show that LoRA can be combined with other privacy-preserving techniques such
-as gradient clipping and Gaussian noising, secure aggregation, and Goldfish
-loss to further improve record-level privacy while maintaining performance.
+In the rapidly evolving field of Natural Language Processing, Large Language
+Models (LLMs) are tasked with increasingly complex reasoning challenges.
+Traditional methods like chain-of-thought prompting have shown promise but
+often fall short in fully leveraging a model's reasoning capabilities. This
+paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a
+novel prompting technique designed to improve reasoning through a
+self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts
+models to generate and resolve multiple auxiliary questions before tackling the
+main query, promoting a more thorough exploration of various aspects of a
+topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models
+across multiple question-answering datasets, demonstrate that SQuARE
+significantly surpasses traditional CoT prompts and existing
+rephrase-and-respond methods. By systematically decomposing queries, SQuARE
+advances LLM capabilities in reasoning tasks. The code is publicly available at
+https://github.com/IntelLabs/RAG-FiT/tree/square.
 
-摘要：聯邦學習 (FL) 是一種流行的協作訓練範例，可避免客戶端之間直接公開資料。然而，資料隱私問題仍然存在：經過 FL 訓練的大型語言模型能夠記憶並完成訓練資料中包含的片語和句子，只要給予其前綴即可。因此，對抗和誠實但好奇的客戶端有可能僅透過目標提示來恢復其他參與者的訓練資料。在這項工作中，我們證明了一種流行且簡單的微調策略，低秩適應 (LoRA)，可將 FL 期間的記憶減少多達 10 倍。我們透過執行醫學問答微調任務並注入從外部臨床資料集抽取的非分佈敏感序列的多次複製品來研究此效應。我們觀察到各種 Llama 2 和 3 模型的記憶力降低，並發現 LoRA 也能減少集中式學習中的記憶力。此外，我們展示 LoRA 可以與其他隱私保護技術結合使用，例如梯度裁剪和高斯雜訊、安全聚合和 Goldfish 損失，以進一步改善記錄級隱私，同時維持效能。
+摘要：在快速發展的自然語言處理領域中，大型語言模型 (LLM) 負責越來越複雜的推理挑戰。
+傳統方法（如思考鏈提示）已展現潛力，但通常無法充分利用模型的推理能力。本文介紹 SQuARE（順序式問答推理引擎），這是一種新穎的提示技術，旨在透過自我提問模式來改善推理。建立在 CoT 架構之上，SQuARE 提示模型在處理主要查詢之前產生並解決多個輔助問題，促進對某個主題的各個面向進行更徹底的探討。我們使用 Llama 3 和 GPT-4o 模型對多個問答資料集進行廣泛評估，結果顯示 SQuARE 明顯優於傳統 CoT 提示和現有的改寫並回應方法。透過系統性地分解查詢，SQuARE 提升了 LLM 在推理任務中的能力。程式碼已公開於 https://github.com/IntelLabs/RAG-FiT/tree/square。
 
-##### **MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**
-2502.04794v1 by Minrui Chen, Yi Zhou, Huidong Jiang, Yuhan Zhu, Guanjie Zou, Minqi Chen, Rong Tian, Hiroto Saigo
+##### **Truth Knows No Language: Evaluating Truthfulness Beyond English**
+2502.09387v1 by Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
 
-Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is
-introduced as a multimodal framework inspired by real-world diagnostic
-processes. It uses pretrained models such as DINOv2, Vision Transformer, and
-ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into
-low-dimensional, semantically meaningful features. A learnable
-self-attention-based fusion network then integrates these imaging features with
-clinical data for classification. Using 416 FUO patient cases from Sichuan
-University West China Hospital from 2017 to 2023, the multimodal fusion
-classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to
-0.9291 across seven tasks, outperforming conventional machine learning and
-single-modality deep learning methods. Ablation studies and five-fold
-cross-validation further validated its effectiveness. By combining the
-strengths of pretrained large models and deep learning, MedMimic offers a
-promising solution for disease classification.
+We introduce a professionally translated extension of the TruthfulQA
+benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
+Spanish. Truthfulness evaluations of large language models (LLMs) have
+primarily been conducted in English. However, the ability of LLMs to maintain
+truthfulness across languages remains under-explored. Our study evaluates 12
+state-of-the-art open LLMs, comparing base and instruction-tuned models using
+human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
+findings reveal that, while LLMs perform best in English and worst in Basque
+(the lowest-resourced language), overall truthfulness discrepancies across
+languages are smaller than anticipated. Furthermore, we show that
+LLM-as-a-Judge correlates more closely with human judgments than
+multiple-choice metrics, and that informativeness plays a critical role in
+truthfulness assessment. Our results also indicate that machine translation
+provides a viable approach for extending truthfulness benchmarks to additional
+languages, offering a scalable alternative to professional translation.
+Finally, we observe that universal knowledge questions are better handled
+across languages than context- and time-dependent ones, highlighting the need
+for truthfulness evaluations that account for cultural and temporal
+variability. Dataset and code are publicly available under open licenses.
 
-摘要：不明原因發燒 (FUO) 仍然是診斷上的挑戰。MedMimic 是一個多模式架構，靈感來自於真實世界的診斷過程。它使用預先訓練的模型，例如 DINOv2、視覺轉換器和 ResNet-18，將高維 18F-FDG PET/CT 影像轉換為低維、語義有意義的特徵。一個可學習的自注意力融合網路接著將這些影像特徵與臨床資料整合，用於分類。使用 2017 年至 2023 年四川大學華西醫院的 416 個 FUO 病患病例，多模式融合分類網路 MFCN 在七項任務中達到了 0.8654 到 0.9291 的巨觀 AUROC 分數，優於傳統機器學習和單一模式深度學習方法。消融研究和五倍交叉驗證進一步驗證了其有效性。MedMimic 結合了預先訓練的大模型和深度學習的優點，為疾病分類提供了一個有前景的解決方案。
+摘要：我們針對 TruthfulQA 推出專業翻譯的延伸版本，旨在評估巴斯克語、加泰隆尼亞語、加利西亞語和西班牙語中的真實性。大型語言模型 (LLM) 的真實性評估主要以英語進行。然而，LLM 在不同語言中維持真實性的能力仍未得到充分探索。我們的研究評估了 12 個最先進的開放 LLM，使用人類評估、多選項指標和 LLM 作為評分標準比較基礎和指令調整模型。我們的研究結果表明，雖然 LLM 在英語中的表現最好，而在巴斯克語（資源最少的語言）中的表現最差，但整體上不同語言之間的真實性差異小於預期。此外，我們表明，與多選項指標相比，LLM 作為評分標準與人類判斷更密切相關，而且信息豐富性在真實性評估中發揮著至關重要的作用。我們的結果還表明，機器翻譯提供了一種可行的途徑，可以將真實性基準擴展到其他語言，從而提供了一種可擴展的專業翻譯替代方案。最後，我們觀察到，與上下文和時間依賴的問題相比，通用知識問題在不同語言之間的處理效果更好，這突顯了考慮文化和時間可變性的真實性評估的必要性。數據集和代碼在開放許可下公開可用。
 
-##### **MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**
-2502.04515v1 by Wei Fan, Jingru Fei, Dingyu Guo, Kun Yi, Xiaozhuang Song, Haolong Xiang, Hangting Ye, Min Li
+##### **A Deep Inverse-Mapping Model for a Flapping Robotic Wing**
+2502.09378v1 by Hadar Sharvit, Raz Karl, Tsevi Beatus
 
-Medical time series has been playing a vital role in real-world healthcare
-systems as valuable information in monitoring health conditions of patients.
-Accurate classification for medical time series, e.g., Electrocardiography
-(ECG) signals, can help for early detection and diagnosis. Traditional methods
-towards medical time series classification rely on handcrafted feature
-extraction and statistical methods; with the recent advancement of artificial
-intelligence, the machine learning and deep learning methods have become more
-popular. However, existing methods often fail to fully model the complex
-spatial dynamics under different scales, which ignore the dynamic
-multi-resolution spatial and temporal joint inter-dependencies. Moreover, they
-are less likely to consider the special baseline wander problem as well as the
-multi-view characteristics of medical time series, which largely hinders their
-prediction performance. To address these limitations, we propose a
-Multi-resolution Spatiotemporal Graph Learning framework, MedGNN, for medical
-time series classification. Specifically, we first propose to construct
-multi-resolution adaptive graph structures to learn dynamic multi-scale
-embeddings. Then, to address the baseline wander problem, we propose Difference
-Attention Networks to operate self-attention mechanisms on the finite
-difference for temporal modeling. Moreover, to learn the multi-view
-characteristics, we utilize the Frequency Convolution Networks to capture
-complementary information of medical time series from the frequency domain. In
-addition, we introduce the Multi-resolution Graph Transformer architecture to
-model the dynamic dependencies and fuse the information from different
-resolutions. Finally, we have conducted extensive experiments on multiple
-medical real-world datasets that demonstrate the superior performance of our
-method. Our Code is available.
+In systems control, the dynamics of a system are governed by modulating its
+inputs to achieve a desired outcome. For example, to control the thrust of a
+quad-copter propeller the controller modulates its rotation rate, relying on a
+straightforward mapping between the input rotation rate and the resulting
+thrust. This mapping can be inverted to determine the rotation rate needed to
+generate a desired thrust. However, in complex systems, such as flapping-wing
+robots where intricate fluid motions are involved, mapping inputs (wing
+kinematics) to outcomes (aerodynamic forces) is nontrivial and inverting this
+mapping for real-time control is computationally impractical. Here, we report a
+machine-learning solution for the inverse mapping of a flapping-wing system
+based on data from an experimental system we have developed. Our model learns
+the input wing motion required to generate a desired aerodynamic force outcome.
+We used a sequence-to-sequence model tailored for time-series data and
+augmented it with a novel adaptive-spectrum layer that implements
+representation learning in the frequency domain. To train our model, we
+developed a flapping wing system that simultaneously measures the wing's
+aerodynamic force and its 3D motion using high-speed cameras. We demonstrate
+the performance of our system on an additional open-source dataset of a
+flapping wing in a different flow regime. Results show superior performance
+compared with more complex state-of-the-art transformer-based models, with 11%
+improvement on the test datasets median loss. Moreover, our model shows
+superior inference time, making it practical for onboard robotic control. Our
+open-source data and framework may improve modeling and real-time control of
+systems governed by complex dynamics, from biomimetic robots to biomedical
+devices.
 
-摘要：<paragraph>醫療時間序列在真實世界的醫療保健系統中扮演著至關重要的角色，作為監控患者健康狀況的寶貴資訊。
-準確分類醫療時間序列，例如心電圖 (ECG) 訊號，有助於早期偵測和診斷。傳統的醫療時間序列分類方法仰賴手工特徵萃取和統計方法；隨著人工智慧的最新進展，機器學習和深度學習方法變得更為普及。然而，現有方法通常無法完全建模不同尺度下的複雜空間動態，忽略了動態多解析度空間和時間關節相互依賴性。此外，它們不太可能考慮特殊的基線漂移問題以及醫療時間序列的多視角特性，這在很大程度上阻礙了它們的預測效能。為了解決這些限制，我們提出了一個多解析度時空圖形學習架構 MedGNN，用於醫療時間序列分類。具體來說，我們首先提出構建多解析度自適應圖形結構以學習動態多尺度嵌入。然後，為了解決基線漂移問題，我們提出差分注意力網路，對時間建模的有限差分運算自注意力機制。此外，為了學習多視角特性，我們利用頻率卷積網路從頻域擷取醫療時間序列的互補資訊。此外，我們引入了多解析度圖形Transformer架構來建模動態依賴性，並融合來自不同解析度的資訊。最後，我們對多個醫療真實世界資料集進行了廣泛的實驗，證明了我們方法的優異效能。我們的程式碼已公開。</paragraph>
+摘要：<paragraph>在系統控制中，系統的動態受調節其輸入以實現所需結果的影響。例如，為了控制四軸旋翼推進器的推力，控制器會調節其旋轉速率，依賴於輸入旋轉速率和所產生的推力之間的直接映射。此映射可以反轉以確定產生所需推力所需的旋轉速率。然而，在複雜的系統中，例如涉及複雜流體運動的拍打式機翼機器人，將輸入（機翼運動學）映射到輸出（空氣動力）並非易事，並且反轉此映射以進行實時控制在計算上不切實際。在此，我們報告了一個基於我們開發的實驗系統數據的拍打式機翼系統反向映射的機器學習解決方案。我們的模型學習產生所需空氣動力結果所需的輸入機翼運動。我們使用了一個專門針對時間序列數據的序列到序列模型，並用一個在頻域中實現表示學習的新型自適應譜層對其進行了擴充。為了訓練我們的模型，我們開發了一個拍打式機翼系統，該系統同時使用高速相機測量機翼的空氣動力和其 3D 運動。我們在一個不同的流動狀態下拍打機翼的另一個開源數據集上展示了我們系統的性能。結果表明，與更複雜的基於Transformer的最先進模型相比，性能優異，在測試數據集中損失中值改進了 11%。此外，我們的模型顯示出優異的推理時間，使其適用於機載機器人控制。我們的開源數據和框架可以改進受複雜動態支配的系統的建模和實時控制，從仿生機器人到生物醫學設備。</paragraph>
 
-##### **Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**
-2502.06842v1 by Andrew G. Breithaupt, Alice Tang, Bruce L. Miller, Pedro Pinheiro-Chagas
+##### **Language Agents as Digital Representatives in Collective Decision-Making**
+2502.09369v1 by Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti
 
-Healthcare systems are struggling to meet the growing demand for neurological
-care, with challenges particularly acute in Alzheimer's disease and related
-dementias (ADRD). While artificial intelligence research has often focused on
-identifying patterns beyond human perception, implementing such predictive
-capabilities remains challenging as clinicians cannot readily verify insights
-they cannot themselves detect. We propose that large language models (LLMs)
-offer more immediately practical applications by enhancing clinicians'
-capabilities in three critical areas: comprehensive data collection,
-interpretation of complex clinical information, and timely application of
-relevant medical knowledge. These challenges stem from limited time for proper
-diagnosis, growing data complexity, and an overwhelming volume of medical
-literature that exceeds any clinician's capacity to fully master. We present a
-framework for responsible AI integration that leverages LLMs' ability to
-communicate effectively with both patients and providers while maintaining
-human oversight. This approach prioritizes standardized, high-quality data
-collection to enable a system that learns from every patient encounter while
-incorporating the latest clinical evidence, continuously improving care
-delivery. We begin to address implementation challenges and initiate important
-discussions around ethical considerations and governance needs. While developed
-for ADRD, this roadmap provides principles for responsible AI integration
-across neurology and other medical specialties, with potential to improve
-diagnostic accuracy, reduce care disparities, and advance clinical knowledge
-through a learning healthcare system.
+Consider the process of collective decision-making, in which a group of
+individuals interactively select a preferred outcome from among a universe of
+alternatives. In this context, "representation" is the activity of making an
+individual's preferences present in the process via participation by a proxy
+agent -- i.e. their "representative". To this end, learned models of human
+behavior have the potential to fill this role, with practical implications for
+multi-agent scenario studies and mechanism design. In this work, we investigate
+the possibility of training \textit{language agents} to behave in the capacity
+of representatives of human agents, appropriately expressing the preferences of
+those individuals whom they stand for. First, we formalize the setting of
+\textit{collective decision-making} -- as the episodic process of interaction
+between a group of agents and a decision mechanism. On this basis, we then
+formalize the problem of \textit{digital representation} -- as the simulation
+of an agent's behavior to yield equivalent outcomes from the mechanism.
+Finally, we conduct an empirical case study in the setting of
+\textit{consensus-finding} among diverse humans, and demonstrate the
+feasibility of fine-tuning large language models to act as digital
+representatives.
 
-摘要：醫療體系正努力滿足日益增長的神經照護需求，其中阿茲海默症和相關失智症 (ADRD) 的挑戰特別嚴重。雖然人工智慧研究通常專注於識別人類感知之外的模式，但實作此類預測功能仍然具有挑戰性，因為臨床醫生無法輕易驗證他們自己無法偵測到的見解。我們提出大型語言模型 (LLM) 可透過提升臨床醫生在三個關鍵領域的能力，提供更直接且實用的應用：全面的資料收集、複雜臨床資訊的詮釋，以及適時應用相關的醫學知識。這些挑戰源自於適當診斷時間有限、資料複雜性日益增加，以及龐大的醫學文獻量超過任何臨床醫生所能完全掌握的容量。我們提出了一個負責任的 AI 整合架構，利用 LLM 與患者和提供者有效溝通的能力，同時維持人為監督。此方法優先考慮標準化、高品質的資料收集，以建立一個從每次患者接觸中學習的系統，同時納入最新的臨床證據，持續改善照護提供。我們開始探討實作挑戰，並展開關於倫理考量和治理需求的重要討論。儘管是為 ADRD 所開發，此藍圖提供了神經科和其他醫學專科負責任 AI 整合的原則，有潛力透過學習型醫療保健系統改善診斷準確性、減少照護差異，並推進臨床知識。
+摘要：考慮集體決策的過程，其中一群個人互動式地從一系列備選方案中選擇一個偏好的結果。在此脈絡中，「代表」是透過代理人（即他們的「代表」）參與，讓個人的偏好出現在這個過程中的活動。為此，人類行為的學習模型有可能填補這個角色，對多重代理人情境研究和機制設計具有實際意義。在這項工作中，我們探討訓練「語言代理人」的可能性，以代表人類代理人的身分行事，適當地表達他們所代表的那些個人的偏好。首先，我們將「集體決策」的設定形式化，作為一群代理人與決策機制之間互動的間歇性過程。在此基礎上，我們接著將「數位代表」的問題形式化，作為模擬代理人的行為，從機制中產生等效結果。最後，我們在多元人類的「共識尋求」設定中進行一個實證個案研究，並展示微調大型語言模型以作為數位代表的可行性。
 
-##### **Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**
-2502.04423v1 by Khushboo Verma, Alan Michels, Ergi Gumusaneli, Shilpa Chitnis, Smita Sinha Kumar, Christopher Thompson, Lena Esmail, Guruprasath Srinivasan, Chandini Panchada, Sushovan Guha, Satwant Kumar
+##### **Neural Spatiotemporal Point Processes: Trends and Challenges**
+2502.09341v1 by Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David A. Selby, Yao Xie, Sebastian Vollmer, Gerrit Grossmann
 
-Referral workflow inefficiencies, including misaligned referrals and delays,
-contribute to suboptimal patient outcomes and higher healthcare costs. In this
-study, we investigated the possibility of predicting procedural needs based on
-primary care diagnostic entries, thereby improving referral accuracy,
-streamlining workflows, and providing better care to patients. A de-identified
-dataset of 2,086 orthopedic referrals from the University of Texas Health at
-Tyler was analyzed using machine learning models built on Base General
-Embeddings (BGE) for semantic extraction. To ensure real-world applicability,
-noise tolerance experiments were conducted, and oversampling techniques were
-employed to mitigate class imbalance. The selected optimum and parsimonious
-embedding model demonstrated high predictive accuracy (ROC-AUC: 0.874, Matthews
-Correlation Coefficient (MCC): 0.540), effectively distinguishing patients
-requiring surgical intervention. Dimensionality reduction techniques confirmed
-the model's ability to capture meaningful clinical relationships. A threshold
-sensitivity analysis identified an optimal decision threshold (0.30) to balance
-precision and recall, maximizing referral efficiency. In the predictive
-modeling analysis, the procedure rate increased from 11.27% to an optimal
-60.1%, representing a 433% improvement with significant implications for
-operational efficiency and healthcare revenue.
-  The results of our study demonstrate that referral optimization can enhance
-primary and surgical care integration. Through this approach, precise and
-timely predictions of procedural requirements can be made, thereby minimizing
-delays, improving surgical planning, and reducing administrative burdens. In
-addition, the findings highlight the potential of clinical decision support as
-a scalable solution for improving patient outcomes and the efficiency of the
-healthcare system.
+Spatiotemporal point processes (STPPs) are probabilistic models for events
+occurring in continuous space and time. Real-world event data often exhibit
+intricate dependencies and heterogeneous dynamics. By incorporating modern deep
+learning techniques, STPPs can model these complexities more effectively than
+traditional approaches. Consequently, the fusion of neural methods with STPPs
+has become an active and rapidly evolving research area. In this review, we
+categorize existing approaches, unify key design choices, and explain the
+challenges of working with this data modality. We further highlight emerging
+trends and diverse application domains. Finally, we identify open challenges
+and gaps in the literature.
 
-摘要：轉診流程效率低落，包括轉診不當和延誤，
-導致次優的患者結果和更高的醫療保健成本。在這
-項研究中，我們探討了根據初級保健診斷條目預測程序需求的可能性，從而提高轉診準確性，
-簡化工作流程，並為患者提供更好的照護。一個去識別化
-德克薩斯大學健康中心的 2,086 個骨科轉診的資料集
-泰勒使用建立在基本通用
-語義提取的嵌入 (BGE) 上的機器學習模型進行分析。為了確保現實世界的適用性，
-進行了噪聲容忍度實驗，並採用了過採樣技術來減輕類別不平衡。所選的最佳和簡約
-嵌入模型展示了高預測準確度 (ROC-AUC：0.874，馬修斯
-相關系數 (MCC)：0.540)，有效區分需要手術干預的患者。降維
-技術證實了模型捕捉有意義的臨床關係的能力。閾值
-敏感性分析確定了一個最佳決策閾值 (0.30) 來平衡
-精確度和召回率，最大化轉診效率。在預測中
-建模分析中，程序率從 11.27% 增加到最佳的
-60.1%，代表 433% 的改進，對運營效率和醫療保健收入具有重大影響。
-我們研究的結果表明，轉診優化可以增強
-初級和外科護理整合。通過這種方法，可以對程序需求進行準確及時的預測，從而最大程度地減少
-延誤，改善手術計劃，並減輕行政負擔。此外，研究結果強調了臨床決策支持作為
-一個可擴展的解決方案的潛力，用於改善患者結果和醫療保健系統的效率。
+摘要：時空點過程 (STPP) 是事件在連續時空發生的機率模型。真實世界的事件資料通常會展現錯綜複雜的依賴關係和異質動態。透過結合現代深度學習技術，STPP 可以比傳統方法更有效地模擬這些複雜性。因此，神經方法與 STPP 的融合已成為一個活躍且快速發展的研究領域。在本篇評論中，我們對現有方法進行分類、統一關鍵設計選擇，並說明處理這種資料模式的挑戰。我們進一步強調新興趨勢和多樣化的應用領域。最後，我們找出文獻中的開放性挑戰和空白。
 
-##### **Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**
-2502.04083v1 by Tewele W. Tareke, Neree Payan, Alexandre Cochet, Laurent Arnould, Benoit Presles, Jean-Marc Vrigneaud, Fabrice Meriaudeau, Alain Lalande
+##### **Graph Diffusion Network for Drug-Gene Prediction**
+2502.09335v1 by Jiayang Wu, Wensheng Gan, Philip S. Yu
 
-Neoadjuvant chemotherapy (NAC) has become a standard clinical practice for
-tumor downsizing in breast cancer with 18F-FDG Positron Emission Tomography
-(PET). Our work aims to leverage PET imaging for the segmentation of breast
-lesions. The focus is on developing an automated system that accurately
-segments primary tumor regions and extracts key biomarkers from these areas to
-provide insights into the evolution of breast cancer following the first course
-of NAC. 243 baseline 18F-FDG PET scans (PET_Bl) and 180 follow-up 18F-FDG PET
-scans (PET_Fu) were acquired before and after the first course of NAC,
-respectively. Firstly, a deep learning-based breast tumor segmentation method
-was developed. The optimal baseline model (model trained on baseline exams) was
-fine-tuned on 15 follow-up exams and adapted using active learning to segment
-tumor areas in PET_Fu. The pipeline computes biomarkers such as maximum
-standardized uptake value (SUVmax), metabolic tumor volume (MTV), and total
-lesion glycolysis (TLG) to evaluate tumor evolution between PET_Fu and PET_Bl.
-Quality control measures were employed to exclude aberrant outliers. The nnUNet
-deep learning model outperformed in tumor segmentation on PET_Bl, achieved a
-Dice similarity coefficient (DSC) of 0.89 and a Hausdorff distance (HD) of 3.52
-mm. After fine-tuning, the model demonstrated a DSC of 0.78 and a HD of 4.95 mm
-on PET_Fu exams. Biomarkers analysis revealed very strong correlations whatever
-the biomarker between manually segmented and automatically predicted regions.
-The significant average decrease of SUVmax, MTV and TLG were 5.22, 11.79 cm3
-and 19.23 cm3, respectively. The presented approach demonstrates an automated
-system for breast tumor segmentation from 18F-FDG PET. Thanks to the extracted
-biomarkers, our method enables the automatic assessment of cancer progression.
+Predicting drug-gene associations is crucial for drug development and disease
+treatment. While graph neural networks (GNN) have shown effectiveness in this
+task, they face challenges with data sparsity and efficient contrastive
+learning implementation. We introduce a graph diffusion network for drug-gene
+prediction (GDNDGP), a framework that addresses these limitations through two
+key innovations. First, it employs meta-path-based homogeneous graph learning
+to capture drug-drug and gene-gene relationships, ensuring similar entities
+share embedding spaces. Second, it incorporates a parallel diffusion network
+that generates hard negative samples during training, eliminating the need for
+exhaustive negative sample retrieval. Our model achieves superior performance
+on the DGIdb 4.0 dataset and demonstrates strong generalization capability on
+tripartite drug-gene-disease networks. Results show significant improvements
+over existing methods in drug-gene prediction tasks, particularly in handling
+complex heterogeneous relationships. The source code is publicly available at
+https://github.com/csjywu1/GDNDGP.
 
-摘要：新辅助化疗 (NAC) 已成为乳腺癌中采用 18F-FDG 正电子发射断层扫描 (PET) 进行肿瘤缩小的标准临床实践。我们的工作旨在利用 PET 影像分割乳腺病变。重点在于开发一个自动系统，该系统可以准确分割原发性肿瘤区域并从这些区域提取关键生物标记，以深入了解乳腺癌在第一疗程 NAC 后的演变。分别在第一疗程 NAC 之前和之后采集了 243 例基线 18F-FDG PET 扫描 (PET_Bl) 和 180 例随访 18F-FDG PET 扫描 (PET_Fu)。首先，开发了一种基于深度学习的乳腺肿瘤分割方法。对 15 例随访检查对最优基线模型（在基线检查中训练的模型）进行了微调，并使用主动学习对 PET_Fu 中的肿瘤区域进行了分割。该管道计算诸如最大标准摄取值 (SUVmax)、代谢肿瘤体积 (MTV) 和总病灶糖酵解 (TLG) 等生物标记，以评估 PET_Fu 和 PET_Bl 之间的肿瘤演变。采用质量控制措施来排除异常值。nnUNet 深度学习模型在 PET_Bl 上的肿瘤分割方面表现出色，达到 0.89 的 Dice 相似性系数 (DSC) 和 3.52 毫米的 Hausdorff 距离 (HD)。微调后，该模型在 PET_Fu 检查中显示出 0.78 的 DSC 和 4.95 毫米的 HD。无论手动分割区域和自动预测区域之间的生物标记如何，生物标记分析都显示出非常强的相关性。SUVmax、MTV 和 TLG 的平均显着下降分别为 5.22、11.79 cm3 和 19.23 cm3。所提出的方法展示了一个用于从 18F-FDG PET 分割乳腺肿瘤的自动化系统。由于提取了生物标记，我们的方法能够自动评估癌症进展。
+摘要：預測藥物基因關聯對藥物開發和疾病治療至關重要。雖然圖神經網路 (GNN) 已顯示在這個任務中的有效性，但它們在資料稀疏性和高效對比學習實作方面面臨挑戰。我們引入了一個用於藥物基因預測的圖擴散網路 (GDNDGP)，這是一個透過兩項關鍵創新來解決這些限制的框架。首先，它採用基於元路徑的同質圖學習來捕捉藥物-藥物和基因-基因關係，確保類似實體共享嵌入空間。其次，它整合了一個並行擴散網路，在訓練期間產生困難的負面樣本，消除了對詳盡負面樣本擷取的需求。我們的模型在 DGIdb 4.0 資料集上取得了卓越的效能，並在三方藥物-基因-疾病網路中展現強大的概化能力。結果顯示在藥物基因預測任務中，相較於現有方法有顯著的進步，特別是在處理複雜的異質關係方面。原始碼已公開於 https://github.com/csjywu1/GDNDGP。
 
-##### **Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**
-2502.04034v1 by Ran Song, Yinpu Bai, Hui Liu
+##### **Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**
+2502.09331v1 by Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
 
-The accurate prediction of drug responses remains a formidable challenge,
-particularly at the single-cell level and in clinical treatment contexts. Some
-studies employ transfer learning techniques to predict drug responses in
-individual cells and patients, but they require access to target-domain data
-during training, which is often unavailable or only obtainable in future. In
-this study, we propose a novel domain generalization framework, termed
-panCancerDR, to address this challenge. We conceptualize each cancer type as a
-distinct source domain, with its cell lines serving as domain-specific samples.
-Our primary objective is to extract domain-invariant features from the
-expression profiles of cell lines across diverse cancer types, thereby
-generalize the predictive capacity to out-of-distribution samples. To enhance
-robustness, we introduce a latent independence projection (LIP) module that
-encourages the encoder to extract informative yet non-redundant features. Also,
-we propose an asymmetric adaptive clustering constraint, which clusters
-drug-sensitive samples into a compact group while drives resistant samples
-dispersed across separate clusters in the latent space. Our empirical
-experiments demonstrate that panCancerDR effectively learns task-relevant
-features from diverse source domains, and achieves accurate predictions of drug
-response for unseen cancer type during training. Furthermore, when evaluated on
-single-cell and patient-level prediction tasks, our model-trained solely on in
-vitro cell line data without access to target-domain information-consistently
-outperforms and matched current state-of-the-art methods. These findings
-highlights the potential of our method for real-world clinical applications.
+Despite advances in the multilingual capabilities of Large Language Models
+(LLMs) across diverse tasks, English remains the dominant language for LLM
+research and development. So, when working with a different language, this has
+led to the widespread practice of pre-translation, i.e., translating the task
+prompt into English before inference. Selective pre-translation, a more
+surgical approach, focuses on translating specific prompt components. However,
+its current use is sporagic and lacks a systematic research foundation.
+Consequently, the optimal pre-translation strategy for various multilingual
+settings and tasks remains unclear. In this work, we aim to uncover the optimal
+setup for pre-translation by systematically assessing its use. Specifically, we
+view the prompt as a modular entity, composed of four functional parts:
+instruction, context, examples, and output, either of which could be translated
+or not. We evaluate pre-translation strategies across 35 languages covering
+both low and high-resource languages, on various tasks including Question
+Answering (QA), Natural Language Inference (NLI), Named Entity Recognition
+(NER), and Abstractive Summarization. Our experiments show the impact of
+factors as similarity to English, translation quality and the size of
+pre-trained data, on the model performance with pre-translation. We suggest
+practical guidelines for choosing optimal strategies in various multilingual
+settings.
 
-摘要：<paragraph>準確預測藥物反應仍然是一項艱鉅的挑戰，特別是在單細胞層級和臨床治療背景中。一些研究採用遷移學習技術來預測個別細胞和患者的藥物反應，但它們需要在訓練期間存取目標網域資料，而這些資料通常無法取得，或只能在未來取得。在這項研究中，我們提出一個新穎的網域概化架構，稱為 panCancerDR，以應對這項挑戰。我們將每種類型的癌症概念化為一個不同的來源網域，其細胞株作為特定網域的樣本。我們的首要目標是從不同癌症類型的細胞株表現特徵中萃取網域不變特徵，從而將預測能力概化到分布外的樣本。為了增強穩健性，我們引入一個潛在獨立投影 (LIP) 模組，鼓勵編碼器萃取有資訊但非冗餘的特徵。此外，我們提出一個非對稱自適應聚類約束，將對藥物敏感的樣本聚類到一個緊湊的群組中，同時驅動抗藥性樣本分散在潛在空間中的不同群組中。我們的實證實驗證明，panCancerDR 有效地從不同的來源網域學習與任務相關的特徵，並在訓練期間對未見的癌症類型實現準確的藥物反應預測。此外，當在單細胞和患者層級預測任務中進行評估時，我們的模型僅在體外細胞株資料上訓練，而沒有存取目標網域資訊，始終優於並符合當前的最新方法。這些發現突顯了我們的方法在實際臨床應用中的潛力。</paragraph>
+摘要：儘管大型語言模型 (LLM) 在各種任務中的多語言能力有進步，英語仍然是 LLM 研究和開發的主導語言。因此，在使用不同語言時，這導致了預翻譯的廣泛實務，即在推理之前將任務提示翻譯成英語。選擇性預翻譯是一種更精準的方法，專注於翻譯特定提示組成部分。然而，目前的使用是零星的，缺乏系統性的研究基礎。因此，各種多語言設定和任務的最佳預翻譯策略仍不清楚。在這項工作中，我們旨在透過系統性評估預翻譯的使用，找出其最佳設定。具體來說，我們將提示視為一個模組化實體，由四個功能部分組成：說明、背景、範例和輸出，其中任何一個都可以翻譯或不翻譯。我們在 35 種語言中評估預翻譯策略，涵蓋低資源語言和高資源語言，以及各種任務，包括問答 (QA)、自然語言推理 (NLI)、命名實體識別 (NER) 和抽象摘要。我們的實驗顯示了與英語的相似性、翻譯品質和預訓練資料大小等因素對預翻譯模型效能的影響。我們建議在各種多語言設定中選擇最佳策略的實用指南。
 
-##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
-2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
+##### **A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**
+2502.09316v1 by Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami
 
-Retrieval-augmented generation (RAG) is a well-suited technique for
-retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
-key module of the healthcare copilot, helping reduce misdiagnosis for
-healthcare practitioners and patients. However, the diagnostic accuracy and
-specificity of existing heuristic-based RAG models used in the medical domain
-are inadequate, particularly for diseases with similar manifestations. This
-paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
-reasoning for the medical domain that retrieves diagnosis and treatment
-recommendations based on manifestations. MedRAG systematically constructs a
-comprehensive four-tier hierarchical diagnostic KG encompassing critical
-diagnostic differences of various diseases. These differences are dynamically
-integrated with similar EHRs retrieved from an EHR database, and reasoned
-within a large language model. This process enables more accurate and specific
-decision support, while also proactively providing follow-up questions to
-enhance personalized medical decision-making. MedRAG is evaluated on both a
-public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
-collected from Tan Tock Seng Hospital, and its performance is compared against
-various existing RAG methods. Experimental results show that, leveraging the
-information integration and relational abilities of the KG, our MedRAG provides
-more specific diagnostic insights and outperforms state-of-the-art models in
-reducing misdiagnosis rates. Our code will be available at
-https://github.com/SNOWTEAM2023/MedRAG
+Evaluating the open-ended text generation of large language models (LLMs) is
+challenging because of the lack of a clear ground truth and the high cost of
+human or LLM-based assessments. We propose a novel benchmark that evaluates
+LLMs using n-gram statistics and rules, without relying on human judgement or
+LLM-as-a-judge approaches. Using 50 question and reference answer sets, we
+introduce three new metrics based on n-grams and rules: Fluency, Truthfulness,
+and Helpfulness. Our benchmark strongly correlates with GPT-4o-based
+evaluations while requiring significantly fewer computational resources,
+demonstrating its effectiveness as a scalable alternative for assessing LLMs'
+open-ended generation capabilities.
+
+摘要：評估大型語言模型 (LLM) 的開放式文字生成具有挑戰性，因為缺乏明確的基礎真實性，以及人工或基於 LLM 的評估成本高昂。我們提出一個新基準，使用 n-gram 統計和規則來評估 LLM，而不依賴於人工判斷或 LLM 作為評審的方法。使用 50 個問題和參考答案集，我們基於 n-gram 和規則引入了三項新指標：流暢度、真實性和有幫助性。我們的基準與基於 GPT-4o 的評估密切相關，同時需要明顯更少的計算資源，證明了其作為評估 LLM 的開放式生成能力的可擴充替代方案的有效性。
+
+##### **When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**
+2502.09307v1 by Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
+
+Modern Large Language Models (LLMs) have shown human-like abilities in many
+language tasks, sparking interest in comparing LLMs' and humans' language
+processing. In this paper, we conduct a detailed comparison of the two on a
+sentence comprehension task using garden-path constructions, which are
+notoriously challenging for humans. Based on psycholinguistic research, we
+formulate hypotheses on why garden-path sentences are hard, and test these
+hypotheses on human participants and a large suite of LLMs using comprehension
+questions. Our findings reveal that both LLMs and humans struggle with specific
+syntactic complexities, with some models showing high correlation with human
+comprehension. To complement our findings, we test LLM comprehension of
+garden-path constructions with paraphrasing and text-to-image generation tasks,
+and find that the results mirror the sentence comprehension question results,
+further validating our findings on LLM understanding of these constructions.
 
-摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
+摘要：現代大型語言模型（LLM）在許多語言任務中展現出類似人類的能力，引發了比較 LLM 與人類語言處理的興趣。在本文中，我們使用對人類來說極具挑戰的花園路徑結構，對這兩者進行了詳細比較，以進行句子理解任務。根據心理語言學研究，我們制定了關於為什麼花園路徑句子困難的假設，並使用理解問題對人類參與者和大量 LLM 測試這些假設。我們的研究結果表明，LLM 和人類都難以應付特定的句法複雜性，其中一些模型與人類理解力高度相關。為了補充我們的研究結果，我們測試了 LLM 對花園路徑結構的理解，並進行了改寫和文字轉換為圖像的生成任務，並發現結果反映了句子理解問題的結果，進一步驗證了我們對 LLM 理解這些結構的研究結果。
 
-##### **Transforming Multimodal Models into Action Models for Radiotherapy**
-2502.04408v1 by Matteo Ferrante, Alessandra Carosi, Rolando Maria D Angelillo, Nicola Toschi
+##### **Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**
+2502.09294v1 by Bernd Dudzik, Tiffany Matej Hrkalovic, Chenxu Hao, Chirag Raman, Masha Tsfasman
 
-Radiotherapy is a crucial cancer treatment that demands precise planning to
-balance tumor eradication and preservation of healthy tissue. Traditional
-treatment planning (TP) is iterative, time-consuming, and reliant on human
-expertise, which can potentially introduce variability and inefficiency. We
-propose a novel framework to transform a large multimodal foundation model
-(MLM) into an action model for TP using a few-shot reinforcement learning (RL)
-approach. Our method leverages the MLM's extensive pre-existing knowledge of
-physics, radiation, and anatomy, enhancing it through a few-shot learning
-process. This allows the model to iteratively improve treatment plans using a
-Monte Carlo simulator. Our results demonstrate that this method outperforms
-conventional RL-based approaches in both quality and efficiency, achieving
-higher reward scores and more optimal dose distributions in simulations on
-prostate cancer data. This proof-of-concept suggests a promising direction for
-integrating advanced AI models into clinical workflows, potentially enhancing
-the speed, quality, and standardization of radiotherapy treatment planning.
+Automatic Affect Prediction (AAP) uses computational analysis of input data
+such as text, speech, images, and physiological signals to predict various
+affective phenomena (e.g., emotions or moods). These models are typically
+constructed using supervised machine-learning algorithms, which rely heavily on
+labeled training datasets. In this position paper, we posit that all AAP
+training data are derived from human Affective Interpretation Processes,
+resulting in a form of Affective Meaning. Research on human affect indicates a
+form of complexity that is fundamental to such meaning: it can possess what we
+refer to here broadly as Qualities of Indeterminacy (QIs) - encompassing
+Subjectivity (meaning depends on who is interpreting), Uncertainty (lack of
+confidence regarding meanings' correctness), Ambiguity (meaning contains
+mutually exclusive concepts) and Vagueness (meaning is situated at different
+levels in a nested hierarchy). Failing to appropriately consider QIs leads to
+results incapable of meaningful and reliable predictions. Based on this
+premise, we argue that a crucial step in adequately addressing indeterminacy in
+AAP is the development of data collection practices for modeling corpora that
+involve the systematic consideration of 1) a relevant set of QIs and 2) context
+for the associated interpretation processes. To this end, we are 1) outlining a
+conceptual model of AIPs and the QIs associated with the meaning these produce
+and a conceptual structure of relevant context, supporting understanding of its
+role. Finally, we use our framework for 2) discussing examples of
+context-sensitivity-related challenges for addressing QIs in data collection
+setups. We believe our efforts can stimulate a structured discussion of both
+the role of aspects of indeterminacy and context in research on AAP, informing
+the development of better practices for data collection and analysis.
 
-摘要：放射治療是一種重要的癌症治療方法，需要精確的規劃來平衡腫瘤根除和健康組織的保留。傳統的治療規劃（TP）是反覆的、耗時的，並且依賴於人為專業知識，這可能會引入變異性和低效率。我們提出了一個新穎的框架，使用少次強化學習 (RL) 方法將大型多模態基礎模型 (MLM) 轉換為 TP 的動作模型。我們的模型利用了 MLM 對物理、輻射和解剖學的廣泛預先存在的知識，並通過少次學習過程對其進行增強。這允許模型使用蒙特卡羅模擬器反覆改進治療計劃。我們的結果表明，這種方法在質量和效率方面都優於基於傳統 RL 的方法，在對前列腺癌數據進行模擬時，獲得了更高的獎勵分數和更優化的劑量分佈。這個概念驗證表明了一個有希望的方向，即將先進的人工智慧模型整合到臨床工作流程中，從而有可能提高放射治療計劃的速度、質量和標準化。
+摘要：自動影響預測 (AAP) 使用輸入資料的運算分析，例如文字、語音、影像和生理訊號，來預測各種情感現象（例如情緒或心情）。這些模型通常使用監督式機器學習演算法建構，而這些演算法高度依賴標籤訓練資料集。在此立場文件中，我們主張所有 AAP 訓練資料都是從人類的情感詮釋過程中衍生而來的，進而形成一種情感意義。對人類情感的研究指出，這種複雜性是此種意義的基本要素：它可能具備我們在此廣泛稱之為不確定性品質 (QI)，包括主觀性（意義取決於詮釋者）、不確定性（對於意義正確性的信心不足）、歧義性（意義包含相互排斥的概念）和模糊性（意義位於嵌套層級的不同層級）。未能適當地考量 QI 會導致無法進行有意義且可靠預測的結果。基於此前提，我們主張，在 AAP 中適當地處理不確定性的關鍵步驟，是針對建模語料庫制定資料收集實務，其中涉及系統性地考量 1) 一組相關的 QI，以及 2) 相關詮釋過程的脈絡。為此，我們 1) 概述了 AIP 的概念模型，以及與這些 AIP 所產生的意義相關的 QI，以及相關脈絡的概念結構，支持對其角色的理解。最後，我們使用我們的架構 2) 討論了在資料收集設定中處理 QI 時，與脈絡敏感性相關的挑戰範例。我們相信我們的努力可以激勵對不確定性和脈絡面向在 AAP 研究中扮演的角色進行結構化的討論，為資料收集和分析的最佳實務發展提供資訊。
 
-##### **Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**
-2502.04399v1 by Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
+##### **SparQLe: Speech Queries to Text Translation Through LLMs**
+2502.09284v1 by Amirbek Djanibekov, Hanan Aldarmaki
 
-Advances in artificial intelligence (AI) including foundation models (FMs),
-are increasingly transforming human society, with smart city driving the
-evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as
-a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities.
-In particular, ride-hailing vehicles can effectively facilitate flexible data
-collection and contribute towards urban intelligence, despite resource
-limitations. Therefore, this work explores a promising scenario, where
-edge-assisted vehicles perform joint tasks of order serving and the emerging
-foundation model fine-tuning using various urban data. However, integrating the
-VCS AI task with the conventional order serving task is challenging, due to
-their inconsistent spatio-temporal characteristics: (i) The distributions of
-ride orders and data point-of-interests (PoIs) may not coincide in geography,
-both following a priori unknown patterns; (ii) they have distinct forms of
-temporal effects, i.e., prolonged waiting makes orders become instantly invalid
-while data with increased staleness gradually reduces its utility for model
-fine-tuning.To overcome these obstacles, we propose an online framework based
-on multi-agent reinforcement learning (MARL) with careful augmentation. A new
-quality-of-service (QoS) metric is designed to characterize and balance the
-utility of the two joint tasks, under the effects of varying data volumes and
-staleness. We also integrate graph neural networks (GNNs) with MARL to enhance
-state representations, capturing graph-structured, time-varying dependencies
-among vehicles and across locations. Extensive experiments on our testbed
-simulator, utilizing various real-world foundation model fine-tuning tasks and
-the New York City Taxi ride order dataset, demonstrate the advantage of our
-proposed method.
+With the growing influence of Large Language Models (LLMs), there is
+increasing interest in integrating speech representations with them to enable
+more seamless multi-modal processing and speech understanding. This study
+introduces a novel approach that leverages self-supervised speech
+representations in combination with instruction-tuned LLMs for speech-to-text
+translation. The proposed approach leverages a modality adapter to align
+extracted speech features with instruction-tuned LLMs using English-language
+data. Our experiments demonstrate that this method effectively preserves the
+semantic content of the input speech and serves as an effective bridge between
+self-supervised speech models and instruction-tuned LLMs, offering a promising
+solution for various speech understanding applications.
 
-摘要：人工智能（AI）的進展，包括基礎模型（FM），正日益轉變人類社會，智慧城市推動著城市生活的演進。同時，車輛群感測（VCS）已成為關鍵推動因素，利用車輛的機動性和配備感測器的能力。特別是，儘管有資源限制，叫車服務車輛能有效促進靈活的資料收集，並有助於城市智慧。因此，這項工作探索了一個有前途的場景，其中邊緣輔助車輛執行訂單服務和新興基礎模型微調的聯合任務，使用各種城市資料。然而，由於 VCS AI 任務與傳統訂單服務任務的不一致時空特徵，整合它們具有挑戰性：(i) 叫車訂單和資料感興趣點 (PoI) 的分佈在地域上可能不重合，兩者都遵循先驗未知的模式；(ii) 它們具有不同的時間效應形式，即長時間等待會使訂單立即失效，而過時的資料會逐漸降低其對模型微調的效用。為了解決這些障礙，我們提出了一個基於多智能體強化學習 (MARL) 的線上架構，並進行了仔細的擴充。設計了一個新的服務品質 (QoS) 指標，用於表徵和平衡這兩個聯合任務的效用，在不同資料量和過時性的影響下。我們還將圖神經網路（GNN）與 MARL 整合，以增強狀態表示，捕捉車輛之間和不同地點之間的圖結構、時變依賴性。在我們的測試平台模擬器上進行的廣泛實驗，利用各種真實世界的基礎模型微調任務和紐約市計程車叫車訂單資料集，證明了我們提出的方法的優點。
+摘要：隨著大型語言模型（LLM）影響力逐漸擴大，將語音表徵與其整合，以實現更順暢的多模態處理和語音理解，已引起越來越多的興趣。本研究提出了一種新穎的方法，該方法利用自監督語音表徵，結合指令調整的 LLM，進行語音轉文字翻譯。所提出的方法利用模態適配器，使用英語語言資料，將提取的語音特徵與指令調整的 LLM 對齊。我們的實驗證明，此方法有效地保留了輸入語音的語義內容，並作為自監督語音模型和指令調整的 LLM 之間的有效橋樑，為各種語音理解應用程式提供了一個有前景的解決方案。
 
-##### **Multimodal Medical Code Tokenizer**
-2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
+##### **LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**
+2502.09271v1 by Wenlun Zhang, Enyan Dai, Kentaro Yoshioka
 
-Foundation models trained on patient electronic health records (EHRs) require
-tokenizing medical data into sequences of discrete vocabulary items. Existing
-tokenizers treat medical codes from EHRs as isolated textual tokens. However,
-each medical code is defined by its textual description, its position in
-ontological hierarchies, and its relationships to other codes, such as disease
-co-occurrences and drug-treatment associations. Medical vocabularies contain
-more than 600,000 codes with critical information for clinical reasoning. We
-introduce MedTok, a multimodal medical code tokenizer that uses the text
-descriptions and relational context of codes. MedTok processes text using a
-language model encoder and encodes the relational structure with a graph
-encoder. It then quantizes both modalities into a unified token space,
-preserving modality-specific and cross-modality information. We integrate
-MedTok into five EHR models and evaluate it on operational and clinical tasks
-across in-patient and out-patient datasets, including outcome prediction,
-diagnosis classification, drug recommendation, and risk stratification.
-Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
-models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
-the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
-using MedTok tokenizer with medical QA systems. Our results demonstrate the
-potential of MedTok as a unified tokenizer for medical codes, improving
-tokenization for medical foundation models.
+Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in
+modeling data with graph structures, yet recent research reveals their
+susceptibility to adversarial attacks. Traditional attack methodologies, which
+rely on manipulating the original graph or adding links to artificially created
+nodes, often prove impractical in real-world settings. This paper introduces a
+novel adversarial scenario involving the injection of an isolated subgraph to
+deceive both the link recommender and the node classifier within a GNN system.
+Specifically, the link recommender is mislead to propose links between targeted
+victim nodes and the subgraph, encouraging users to unintentionally establish
+connections and that would degrade the node classification accuracy, thereby
+facilitating a successful attack. To address this, we present the LiSA
+framework, which employs a dual surrogate model and bi-level optimization to
+simultaneously meet two adversarial objectives. Extensive experiments on
+real-world datasets demonstrate the effectiveness of our method.
 
-摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
+摘要：圖形神經網路 (GNN) 已展現出在對具有圖形結構的資料進行建模方面的卓越能力，但最近的研究揭露了它們容易受到對抗性攻擊的影響。傳統的攻擊方法依賴於操縱原始圖形或將連結新增至人工建立的節點，在真實世界設定中通常被證明不切實際。本文介紹了一種新穎的對抗性場景，涉及注入一個孤立的子圖形，以欺騙 GNN 系統中的連結推薦器和節點分類器。具體來說，連結推薦器被誤導為在目標受害節點和子圖形之間提出連結，鼓勵使用者無意間建立連結，這將降低節點分類準確度，從而促成攻擊成功。為了解決這個問題，我們提出了 LiSA 框架，它採用雙重代理模型和雙層最佳化，以同時滿足兩個對抗性目標。對真實世界資料集進行的廣泛實驗證明了我們方法的有效性。
 
-##### **A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**
-2502.03772v1 by Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Lida Chen, Wei Wang, Qinghua Huang
+##### **AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**
+2502.09254v1 by Hezhe Qiao, Chaoxi Niu, Ling Chen, Guansong Pang
 
-Hepatocellular carcinoma (HCC) ranks as the third leading cause of
-cancer-related mortality worldwide, with early detection being crucial for
-improving patient survival rates. However, early screening for HCC using
-ultrasound suffers from insufficient sensitivity and is highly dependent on the
-expertise of radiologists for interpretation. Leveraging the latest
-advancements in artificial intelligence (AI) in medical imaging, this study
-proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model
-that combines the strengths of Convolutional Neural Networks (CNNs) and Vision
-Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound
-screening. The HSQformer leverages sparse latent space representations to
-capture hierarchical details at various granularities without the need for
-complex adjustments, and adopts a modular, plug-and-play design philosophy,
-ensuring the model's versatility and ease of use. The HSQformer's performance
-was rigorously tested across three distinct clinical scenarios: single-center,
-multi-center, and high-risk patient testing. In each of these settings, it
-consistently outperformed existing state-of-the-art models, such as ConvNext
-and SwinTransformer. Notably, the HSQformer even matched the diagnostic
-capabilities of senior radiologists and comprehensively surpassed those of
-junior radiologists. The experimental results from this study strongly
-demonstrate the effectiveness and clinical potential of AI-assisted tools in
-HCC screening. The full code is available at
-https://github.com/Asunatan/HSQformer.
+Graph anomaly detection (GAD) aims to identify abnormal nodes that differ
+from the majority of the nodes in a graph, which has been attracting
+significant attention in recent years. Existing generalist graph models have
+achieved remarkable success in different graph tasks but struggle to generalize
+to the GAD task. This limitation arises from their difficulty in learning
+generalized knowledge for capturing the inherently infrequent, irregular and
+heterogeneous abnormality patterns in graphs from different domains. To address
+this challenge, we propose AnomalyGFM, a GAD-oriented graph foundation model
+that supports zero-shot inference and few-shot prompt tuning for GAD in diverse
+graph datasets. One key insight is that graph-agnostic representations for
+normal and abnormal classes are required to support effective zero/few-shot GAD
+across different graphs. Motivated by this, AnomalyGFM is pre-trained to align
+data-independent, learnable normal and abnormal class prototypes with node
+representation residuals (i.e., representation deviation of a node from its
+neighbors). The residual features essentially project the node information into
+a unified feature space where we can effectively measure the abnormality of
+nodes from different graphs in a consistent way. This provides a driving force
+for the learning of graph-agnostic, discriminative prototypes for the normal
+and abnormal classes, which can be used to enable zero-shot GAD on new graphs,
+including very large-scale graphs. If there are few-shot labeled normal nodes
+available in the new graphs, AnomalyGFM can further support prompt tuning to
+leverage these nodes for better adaptation. Comprehensive experiments on 11
+widely-used GAD datasets with real anomalies, demonstrate that AnomalyGFM
+significantly outperforms state-of-the-art competing methods under both zero-
+and few-shot GAD settings.
 
-摘要：肝細胞癌（HCC）是全球第三大癌症相關死亡原因，早期檢測對於提高患者存活率至關重要。然而，使用超音波進行 HCC 早期篩檢的靈敏度不足，且高度依賴放射科醫師的專業知識進行判讀。本研究利用醫學影像中人工智慧（AI）的最新進展，提出了一種創新的分層稀疏查詢Transformer（HSQformer）模型，結合了卷積神經網路（CNN）和視覺Transformer（ViT）的優點，以提高超音波篩檢中 HCC 診斷的準確性。HSQformer 利用稀疏潛在空間表示，在不需要複雜調整的情況下擷取各種粒度層級的細節，並採用模組化、即插即用的設計理念，確保模型的多功能性和易用性。HSQformer 的效能經過三個不同的臨床場景的嚴格測試：單中心、多中心和高風險患者測試。在這些設定中，它始終優於現有的最先進模型，例如 ConvNext 和 SwinTransformer。值得注意的是，HSQformer 甚至匹配了資深放射科醫師的診斷能力，並全面超越了初級放射科醫師的診斷能力。本研究的實驗結果有力地證明了 AI 輔助工具在 HCC 篩檢中的有效性和臨床潛力。完整程式碼可在 https://github.com/Asunatan/HSQformer 取得。
+摘要：圖形異常偵測 (GAD) 的目標是找出與圖形中大多數節點不同的異常節點，這在近年來引起了廣泛的關注。現有的通才圖形模型在不同的圖形任務中都取得了顯著的成功，但卻難以推廣到 GAD 任務。這種限制來自於它們難以學習廣泛的知識，用於擷取來自不同領域圖形中固有的罕見、不規則和異質異常模式。為了應對這個挑戰，我們提出了 AnomalyGFM，一個面向 GAD 的圖形基礎模型，它支援零次學習推論和少次提示調整，用於在不同的圖形資料集中進行 GAD。一個關鍵見解是，需要圖形不可知的正常和異常類別表示，以支援跨不同圖形的有效零次/少次 GAD。受此啟發，AnomalyGFM 被預先訓練以將與資料無關的可學習正常和異常類別原型與節點表示殘差（即節點與其鄰居的表示偏差）對齊。殘差特徵基本上將節點資訊投射到一個統一的特徵空間中，在這個空間中，我們可以有效地測量來自不同圖形的節點異常，並且方式一致。這為學習正常和異常類別的圖形不可知、有區別的原型提供了驅動力，這些原型可用於對新的圖形（包括非常大規模的圖形）啟用零次 GAD。如果新的圖形中有少量的標籤正常節點，AnomalyGFM 可以進一步支援提示調整，以利用這些節點進行更好的適應。在 11 個廣泛使用的具有真實異常值的 GAD 資料集上的綜合實驗表明，在零次和少次 GAD 設定下，AnomalyGFM 明顯優於最先進的競爭方法。
 
-##### **Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**
-2502.04386v1 by Guangyao Zheng, Michael A. Jacobs, Vladimir Braverman, Vishwa S. Parekh
+##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
+2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
 
-Self-supervised learning has revolutionized medical imaging by enabling
-efficient and generalizable feature extraction from large-scale unlabeled
-datasets. Recently, self-supervised foundation models have been extended to
-three-dimensional (3D) computed tomography (CT) data, generating compact,
-information-rich embeddings with 1408 features that achieve state-of-the-art
-performance on downstream tasks such as intracranial hemorrhage detection and
-lung cancer risk forecasting. However, these embeddings have been shown to
-encode demographic information, such as age, sex, and race, which poses a
-significant risk to the fairness of clinical applications.
-  In this work, we propose a Variation Autoencoder (VAE) based adversarial
-debiasing framework to transform these embeddings into a new latent space where
-demographic information is no longer encoded, while maintaining the performance
-of critical downstream tasks. We validated our approach on the NLST lung cancer
-screening dataset, demonstrating that the debiased embeddings effectively
-eliminate multiple encoded demographic information and improve fairness without
-compromising predictive accuracy for lung cancer risk at 1-year and 2-year
-intervals. Additionally, our approach ensures the embeddings are robust against
-adversarial bias attacks. These results highlight the potential of adversarial
-debiasing techniques to ensure fairness and equity in clinical applications of
-self-supervised 3D CT embeddings, paving the way for their broader adoption in
-unbiased medical decision-making.
+Joint entity-relation extraction is a critical task in transforming
+unstructured or semi-structured text into triplets, facilitating the
+construction of large-scale knowledge graphs, and supporting various downstream
+applications. Despite its importance, research on Chinese text, particularly
+with complex semantics in specialized domains like medicine, remains limited.
+To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
+dataset designed to capture the intricacies of medical text. Leveraging the
+strengths of attention mechanisms in capturing long-range dependencies, we
+propose the SEA module, which enhances the extraction of complex contextual
+semantic information, thereby improving entity recognition and relation
+extraction. Additionally, to address the inefficiencies of existing methods in
+facilitating information exchange between entity recognition and relation
+extraction, we present an interactive fusion representation module. This module
+employs Cross Attention for bidirectional information exchange between the
+tasks and further refines feature extraction through BiLSTM. Experimental
+results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
+our model exhibits strong generalization capabilities. On the CH-DDI dataset,
+our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
+relation extraction. On the CoNLL04 dataset, it attains an entity recognition
+precision of 89.54% and a relation extraction accuracy of 71.64%.
 
-摘要：自我監督學習透過從大規模未標記資料集中提取有效且可概化的特徵，進而革新了醫學影像。最近，自我監督基礎模型已擴展到三維 (3D) 電腦斷層掃描 (CT) 資料，產生緊湊、資訊豐富的嵌入，包含 1408 個特徵，在顱內出血偵測和肺癌風險預測等下游任務中達到最先進的效能。然而，這些嵌入已被證明會編碼人口統計資訊，例如年齡、性別和種族，這對臨床應用的公平性構成重大風險。
-在這項工作中，我們提出一個基於變異自編碼器 (VAE) 的對抗性去偏框架，將這些嵌入轉換到一個新的潛在空間，其中不再編碼人口統計資訊，同時維持關鍵下游任務的效能。我們在 NLST 肺癌篩檢資料集上驗證了我們的做法，證明去偏嵌入有效消除了多重編碼的人口統計資訊，並在不損害 1 年和 2 年間隔的肺癌風險預測準確性的情況下提高了公平性。此外，我們的做法確保了嵌入對抗性偏誤攻擊具有魯棒性。這些結果突顯了對抗性去偏技術的潛力，可確保自我監督 3D CT 嵌入在臨床應用中的公平性和公正性，為其在無偏見醫療決策中的廣泛採用鋪路。
+摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
 
-##### **Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**
-2502.03591v1 by Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
+##### **You Do Not Fully Utilize Transformer's Representation Capacity**
+2502.09245v1 by Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
+
+In contrast to RNNs, which compress previous tokens into a single hidden
+state, Transformers can attend to all previous tokens directly. However,
+standard Transformers only use representations from the immediately preceding
+layer. In this paper, we show that this design choice causes representation
+collapse and leads to suboptimal performance. To address this issue, we
+introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
+preserves the model's overall memory footprint while expanding its
+representational capacity by allowing access to hidden states from earlier
+layers. Through extensive experiments across various architectures and
+different lookup mechanisms, we demonstrate consistent performance improvements
+on a wide range of tasks. Moreover, our analysis of the learned representation
+dynamics and our exploration of depthwise circuits reveal how LIMe integrates
+information across layers, pointing to promising directions for future
+research.
 
-In this work, we present a novel approach to multi-label chest X-ray (CXR)
-image classification that enhances clinical interpretability while maintaining
-a streamlined, single-model, single-run training pipeline. Leveraging the
-CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical
-label groupings to capture clinically meaningful relationships between
-diagnoses. To achieve this, we designed a custom hierarchical binary
-cross-entropy (HBCE) loss function that enforces label dependencies using
-either fixed or data-driven penalty types. Our model achieved a mean area under
-the receiver operating characteristic curve (AUROC) of 0.903 on the test set.
-Additionally, we provide visual explanations and uncertainty estimations to
-further enhance model interpretability. All code, model configurations, and
-experiment details are made available.
+摘要：與將先前符號壓縮成單一隱藏狀態的遞迴神經網路不同，Transformer 可以直接關注所有先前的符號。然而，標準 Transformer 僅使用緊鄰前一層的表示。在本文中，我們說明此設計選擇會導致表示崩潰，並導致次優效能。為了解決此問題，我們引入了「層整合式記憶體」(LIMe)，這是一種簡單但強大的方法，可在擴充表示能力的同時，保留模型的整體記憶體使用量，方法是允許存取來自較早層的隱藏狀態。透過各種架構和不同查詢機制的廣泛實驗，我們展示了在各種任務上的一致效能提升。此外，我們對已學習表示動態的分析和對深度電路的探討，揭示了 LIMe 如何整合跨層資訊，並指出未來研究有望發展的方向。
 
-摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
+##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
+2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
 
-##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
-2502.03568v1 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
+Generative artificial intelligence (AI) models, such as diffusion models and
+OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
+and automating clinical workflows. The field has advanced rapidly, evolving
+from text-only large language models for tasks such as clinical documentation
+and decision support to multimodal AI systems capable of integrating diverse
+data modalities, including imaging, text, and structured data, within a single
+model. The diverse landscape of these technologies, along with rising interest,
+highlights the need for a comprehensive review of their applications and
+potential. This scoping review explores the evolution of multimodal AI,
+highlighting its methods, applications, datasets, and evaluation in clinical
+settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
+IEEE Xplore, and Web of Science, prioritizing recent studies published up to
+the end of 2024. After rigorous screening, 144 papers were included, revealing
+key trends and challenges in this dynamic field. Our findings underscore a
+shift from unimodal to multimodal approaches, driving innovations in diagnostic
+support, medical report generation, drug discovery, and conversational AI.
+However, critical challenges remain, including the integration of heterogeneous
+data types, improving model interpretability, addressing ethical concerns, and
+validating AI systems in real-world clinical settings. This review summarizes
+the current state of the art, identifies critical gaps, and provides insights
+to guide the development of scalable, trustworthy, and clinically impactful
+multimodal AI solutions in healthcare.
 
-Many reasoning, planning, and problem-solving tasks share an intrinsic
-algorithmic nature: correctly simulating each step is a sufficient condition to
-solve them correctly. We collect pairs of naturalistic and synthetic reasoning
-tasks to assess the capabilities of Large Language Models (LLM). While
-naturalistic tasks often require careful human handcrafting, we show that
-synthetic data is, in many cases, a good proxy that is much easier to collect
-at scale. We leverage common constructs in programming as the counterpart of
-the building blocks of naturalistic reasoning tasks, such as straight-line
-programs, code that contains critical paths, and approximate and redundant
-instructions. We further assess the capabilities of LLMs on sorting problems
-and repeated operations via sorting algorithms and nested loops. Our synthetic
-datasets further reveal that while the most powerful LLMs exhibit relatively
-strong execution capabilities, the process is fragile: it is negatively
-affected by memorisation and seems to rely heavily on pattern recognition. Our
-contribution builds upon synthetically testing the reasoning capabilities of
-LLMs as a scalable complement to handcrafted human-annotated problems.
+摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
 
-摘要：許多推理、規劃和問題解決任務共享一個內在的演算法性質：正確模擬每一步就足以正確解決它們。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的功能。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成資料是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見建構，作為自然主義推理任務構建區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複運算上的功能，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎嚴重依賴模式辨識。我們的貢獻建立在以合成方式測試 LLM 的推理能力之上，作為手工編寫人類標註問題的可擴充補充。
+##### **Reliable Conversational Agents under ASP Control that Understand Natural Language**
+2502.09237v1 by Yankai Zeng
 
-##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
-2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
+Efforts have been made to make machines converse like humans in the past few
+decades. The recent techniques of Large Language Models (LLMs) make it possible
+to have human-like conversations with machines, but LLM's flaws of lacking
+understanding and reliability are well documented. We believe that the best way
+to eliminate this problem is to use LLMs only as parsers to translate text to
+knowledge and vice versa and carry out the conversation by reasoning over this
+knowledge using the answer set programming. I have been developing a framework
+based on LLMs and ASP to realize reliable chatbots that "understand" human
+conversation. This framework has been used to develop task-specific chatbots as
+well as socialbots. My future research is focused on making these chatbots
+scalable and trainable.
 
-Large Language Models (LLMs) have attained human-level accuracy on medical
-question-answer (QA) benchmarks. However, their limitations in navigating
-open-ended clinical scenarios have recently been shown, raising concerns about
-the robustness and generalizability of LLM reasoning across diverse, real-world
-medical tasks. To probe potential LLM failure modes in clinical
-problem-solving, we present the medical abstraction and reasoning corpus
-(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
-exploit the Einstellung effect -- the fixation of thought arising from prior
-experience, targeting LLM inductive biases toward inflexible pattern matching
-from their training data rather than engaging in flexible reasoning. We find
-that LLMs, including current state-of-the-art o1 and Gemini models, perform
-poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
-medical reasoning and a propensity to hallucinate. In addition, uncertainty
-estimation analyses indicate that LLMs exhibit overconfidence in their answers,
-despite their limited accuracy. The failure modes revealed by M-ARC in LLM
-medical reasoning underscore the need to exercise caution when deploying these
-models in clinical settings.
+摘要：在過去的幾十年裡，人們一直努力讓機器像人類一樣對話。大型語言模型 (LLM) 的最新技術讓與機器進行類人對話成為可能，但 LLM 缺乏理解力和可靠性的缺陷已被充分記錄。我們相信消除這個問題的最佳方法是僅將 LLM 作為解析器，將文字轉換為知識，反之亦然，並使用答案集程式設計對此知識進行推理來進行對話。我一直在開發一個基於 LLM 和 ASP 的框架，以實現「理解」人類對話的可靠聊天機器人。這個框架已被用於開發特定任務的聊天機器人以及社交機器人。我未來的研究重點在於讓這些聊天機器人具有可擴充性和可訓練性。
 
-摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
+##### **Commonsense Reasoning-Aided Autonomous Vehicle Systems**
+2502.09233v1 by Keegan Kimbrell
 
-##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
-2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
+Autonomous Vehicle (AV) systems have been developed with a strong reliance on
+machine learning techniques. While machine learning approaches, such as deep
+learning, are extremely effective at tasks that involve observation and
+classification, they struggle when it comes to performing higher level
+reasoning about situations on the road. This research involves incorporating
+commonsense reasoning models that use image data to improve AV systems. This
+will allow AV systems to perform more accurate reasoning while also making them
+more adjustable, explainable, and ethical. This paper will discuss the findings
+so far and motivate its direction going forward.
 
-Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
-Systems (HITS) is a hot research trend focusing on enhancing HITS management,
-particularly in emergencies where ambulance vehicles must arrive at the crash
-scene on time and track their real-time location is crucial to the medical
-authorities. Despite the claim of real-time representation, a temporal
-misalignment persists between the physical and virtual domains, leading to
-discrepancies in the ambulance's location representation. This study proposes
-integrating AI predictive models, specifically Support Vector Regression (SVR)
-and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
-framework to anticipate the medical vehicle's next location in the virtual
-world. These models align virtual representations with their physical
-counterparts, i.e., metaphorically offsetting the synchronization delay between
-the two worlds. Trained meticulously on a historical geospatial dataset, SVR
-and DNN exhibit exceptional prediction accuracy in MATLAB and Python
-environments. Through various testing scenarios, we visually demonstrate the
-efficacy of our methodology, showcasing SVR and DNN's key role in significantly
-reducing the witnessed gap within the HITS's DT. This transformative approach
-enhances real-time synchronization in emergency HITS by approximately 88% to
-93%.
+摘要：自動駕駛車輛 (AV) 系統的開發高度依賴機器學習技術。儘管機器學習方法（例如深度學習）在涉及觀察和分類的任務中非常有效，但它們在對路況進行更高層級推理時會遇到困難。本研究涉及整合使用影像資料的常識推理模型，以改善 AV 系統。這將使 AV 系統能夠執行更準確的推理，同時也讓它們更具可調整性、可解釋性和道德性。本文將探討迄今為止的發現，並說明其未來的發展方向。
 
-摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
+##### **Logical foundations of Smart Contracts**
+2502.09232v1 by Kalonji Kalala
 
-##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
-2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
+Nowadays, sophisticated domains are emerging which require appropriate
+formalisms to be specified accurately in order to reason about them. One such
+domain is constituted of smart contracts that have emerged in cyber physical
+systems as a way of enforcing formal agreements between components of these
+systems. Smart contracts self-execute to run and share business processes
+through blockchain, in decentralized systems, with many different participants.
+Legal contracts are in many cases complex documents, with a number of
+exceptions, and many subcontracts. The implementation of smart contracts based
+on legal contracts is a long and laborious task, that needs to include all
+actions, procedures, and the effects of actions related to the execution of the
+contract. An ongoing open problem in this area is to formally account for smart
+contracts using a uniform and somewhat universal formalism. This thesis
+proposes logical foundations to smart contracts using the Situation Calculus, a
+logic for reasoning about actions. Situation Calculus is one of the prominent
+logic-based artificial intelligence approaches that provides enough logical
+mechanism to specify and implement dynamic and complex systems such as
+contracts. Situation Calculus is suitable to show how worlds dynamically
+change. Smart contracts are going to be implement with Golog (written en
+Prolog), a Situation Calculus-based programming language for modeling complex
+and dynamic behaviors.
 
-The widespread use of chest X-rays (CXRs), coupled with a shortage of
-radiologists, has driven growing interest in automated CXR analysis and
-AI-assisted reporting. While existing vision-language models (VLMs) show
-promise in specific tasks such as report generation or abnormality detection,
-they often lack support for interactive diagnostic capabilities. In this work
-we present RadVLM, a compact, multitask conversational foundation model
-designed for CXR interpretation. To this end, we curate a large-scale
-instruction dataset comprising over 1 million image-instruction pairs
-containing both single-turn tasks -- such as report generation, abnormality
-classification, and visual grounding -- and multi-turn, multi-task
-conversational interactions. After fine-tuning RadVLM on this instruction
-dataset, we evaluate it across different tasks along with re-implemented
-baseline VLMs. Our results show that RadVLM achieves state-of-the-art
-performance in conversational capabilities and visual grounding while remaining
-competitive in other radiology tasks. Ablation studies further highlight the
-benefit of joint training across multiple tasks, particularly for scenarios
-with limited annotated data. Together, these findings highlight the potential
-of RadVLM as a clinically relevant AI assistant, providing structured CXR
-interpretation and conversational capabilities to support more effective and
-accessible diagnostic workflows.
+摘要：如今，正在出现需要适当形式化来准确指定以对其进行推理的复杂领域。此类领域之一由在网络物理系统中出现的智能合约构成，作为强制执行这些系统组件之间正式协议的一种方式。智能合约自执行以在去中心化系统中通过区块链运行和共享业务流程，并有许多不同的参与者。法律合约在许多情况下是复杂的文档，有许多例外和许多分包合同。基于法律合约实施智能合约是一项漫长而艰巨的任务，需要包括所有操作、程序以及与执行合约相关的操作效果。该领域的持续开放问题是使用统一且某种程度上通用的形式化来正式说明智能合约。本论文提出了使用情景演算（一种用于推理操作的逻辑）为智能合约提供逻辑基础。情景演算是基于逻辑的人工智能方法之一，提供了足够的逻辑机制来指定和实现动态且复杂的系统，例如合约。情景演算适用于展示世界如何动态变化。智能合约将使用 Golog（以 Prolog 编写的）实现，这是一种基于情景演算的编程语言，用于建模复杂且动态的行为。
 
-摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
+##### **Relating Answer Set Programming and Many-sorted Logics for Formal Verification**
+2502.09230v1 by Zachary Hansen
 
-##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
-2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
+Answer Set Programming (ASP) is an important logic programming paradigm
+within the field of Knowledge Representation and Reasoning. As a concise,
+human-readable, declarative language, ASP is an excellent tool for developing
+trustworthy (especially, artificially intelligent) software systems. However,
+formally verifying ASP programs offers some unique challenges, such as
+  1. a lack of modularity (the meanings of rules are difficult to define in
+isolation from the enclosing program),
+  2. the ground-and-solve semantics (the meanings of rules are dependent on the
+input data with which the program is grounded), and
+  3. limitations of existing tools.
+  My research agenda has been focused on addressing these three issues with the
+intention of making ASP verification an accessible, routine task that is
+regularly performed alongside program development. In this vein, I have
+investigated alternative semantics for ASP based on translations into the logic
+of here-and-there and many-sorted first-order logic. These semantics promote a
+modular understanding of logic programs, bypass grounding, and enable us to use
+automated theorem provers to automatically verify properties of programs.
 
-While increasing patients' access to medical documents improves medical care,
-this benefit is limited by varying health literacy levels and complex medical
-terminology. Large language models (LLMs) offer solutions by simplifying
-medical information. However, evaluating LLMs for safe and patient-friendly
-text generation is difficult due to the lack of standardized evaluation
-resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
-created from MIMIC-IV discharge summaries through an automated pipeline
-combining LLM-based question-answer generation with manual quality checks. We
-use this dataset to evaluate various LLMs on patient-oriented
-question-answering. Our findings reveal that general-purpose LLMs frequently
-surpass biomedical-adapted models, while automated metrics correlate with human
-judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
-development of LLMs to enhance patient understanding and ultimately improve
-care outcomes.
+摘要：<paragraph>答案集程式設計 (ASP) 是知識表徵與推理領域中一個重要的邏輯程式設計範式。ASP 作為一種簡潔、人類可讀、宣告式的語言，是開發值得信賴的 (特別是人工智慧) 軟體系統的絕佳工具。然而，正式驗證 ASP 程式提供了一些獨特的挑戰，例如
+  1. 缺乏模組化 (規則的含義難以與封閉程式隔離定義)，
+  2. 基礎與求解語意 (規則的含義取決於程式基礎的輸入資料)，以及
+  3. 現有工具的限制。
+  我的研究議程一直專注於解決這三個問題，目的是讓 ASP 驗證成為一個可存取的、例行任務，並在程式開發過程中定期執行。在這個脈絡下，我研究了基於翻譯成此處和彼處邏輯以及多種排序一階邏輯的 ASP 替代語意。這些語意促進了邏輯程式的模組化理解，繞過基礎，並使我們能夠使用自動化定理證明器自動驗證程式的屬性。</paragraph>
 
-摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
-但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
+##### **Computational methods for Dynamic Answer Set Programming**
+2502.09228v1 by Susana Hahn
 
-##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
-2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
+In our daily lives and industrial settings, we often encounter dynamic
+problems that require reasoning over time and metric constraints. These include
+tasks such as scheduling, routing, and production sequencing. Dynamic logics
+have traditionally addressed these needs but often lack the flexibility and
+integration required for comprehensive problem modeling. This research aims to
+extend Answer Set Programming (ASP), a powerful declarative problem-solving
+approach, to handle dynamic domains effectively. By integrating concepts from
+dynamic, temporal, and metric logics into ASP, we seek to develop robust
+systems capable of modeling complex dynamic problems and performing efficient
+reasoning tasks, thereby enhancing ASPs applicability in industrial contexts.
 
-Purpose: To develop and evaluate a deep learning-based method that allows to
-perform myocardial infarct segmentation in a fully-automated way.
-  Materials and Methods: For this retrospective study, a cascaded framework of
-two and three-dimensional convolutional neural networks (CNNs), specialized on
-identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
-cardiac magnetic resonance (CMR) images, was trained on an in-house training
-dataset consisting of 144 examinations. On a separate test dataset from the
-same institution, including images from 152 examinations obtained between 2021
-and 2023, a quantitative comparison between artificial intelligence (AI)-based
-segmentations and manual segmentations was performed. Further, qualitative
-assessment of segmentation accuracy was evaluated for both human and
-AI-generated contours by two CMR experts in a blinded experiment.
-  Results: Excellent agreement could be found between manually and
-automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
-evaluation showed that compared to human-based measurements, the experts rated
-the AI-based segmentations to better represent the actual extent of infarction
-significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
-the contrary, for segmentation of microvascular obstruction (MVO), manual
-measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
-  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
-size to be calculated in a very short time and without requiring any
-pre-processing of the input images while matching the segmentation quality of
-trained human observers. In a blinded experiment, experts preferred automated
-infarct segmentations more often than manual segmentations, paving the way for
-a potential clinical application.
+摘要：在我們的日常生活和工業環境中，我們經常會遇到動態問題，需要隨著時間和公制約束進行推理。這些問題包括排程、路由和生產順序等任務。動態邏輯傳統上解決了這些需求，但通常缺乏全面問題建模所需的靈活性與整合性。本研究旨在擴展強大的宣告式問題解決方法「Answer Set Programming (ASP)」，以有效處理動態領域。透過將動態、時態和公制邏輯的概念整合到 ASP 中，我們尋求開發強健的系統，能夠建模複雜的動態問題並執行有效的推理任務，進而增強 ASP 在工業環境中的適用性。
+
+##### **Generating Causally Compliant Counterfactual Explanations using ASP**
+2502.09226v1 by Sopam Dasgupta
 
-摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
-材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
-結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
-結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
+This research is focused on generating achievable counterfactual
+explanations. Given a negative outcome computed by a machine learning model or
+a decision system, the novel CoGS approach generates (i) a counterfactual
+solution that represents a positive outcome and (ii) a path that will take us
+from the negative outcome to the positive one, where each node in the path
+represents a change in an attribute (feature) value. CoGS computes paths that
+respect the causal constraints among features. Thus, the counterfactuals
+computed by CoGS are realistic. CoGS utilizes rule-based machine learning
+algorithms to model causal dependencies between features. The paper discusses
+the current status of the research and the preliminary results obtained.
 
-##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
-2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
+摘要：本研究重點在於產生可實現的反事實解釋。給定由機器學習模型或決策系統計算出的負面結果，創新的 CoGS 方法會產生 (i) 代表正面結果的反事實解，以及 (ii) 一條將我們從負面結果帶到正面結果的途徑，其中途徑中的每個節點代表屬性 (特徵) 值的變化。CoGS 計算出符合特徵之間因果關係的途徑。因此，CoGS 計算出的反事實是切合實際的。CoGS 利用基於規則的機器學習演算法來建模特徵之間的因果關係。本文探討了研究的現況和獲得的初步結果。
 
-Recently computer-aided diagnosis has demonstrated promising performance,
-effectively alleviating the workload of clinicians. However, the inherent
-sample imbalance among different diseases leads algorithms biased to the
-majority categories, leading to poor performance for rare categories. Existing
-works formulated this challenge as a long-tailed problem and attempted to
-tackle it by decoupling the feature representation and classification. Yet, due
-to the imbalanced distribution and limited samples from tail classes, these
-works are prone to biased representation learning and insufficient classifier
-calibration. To tackle these problems, we propose a new Long-tailed Medical
-Diagnosis (LMD) framework for balanced medical image classification on
-long-tailed datasets. In the initial stage, we develop a Relation-aware
-Representation Learning (RRL) scheme to boost the representation ability by
-encouraging the encoder to capture intrinsic semantic features through
-different data augmentations. In the subsequent stage, we propose an Iterative
-Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
-This is achieved by generating a large number of balanced virtual features and
-fine-tuning the encoder using an Expectation-Maximization manner. The proposed
-ICC compensates for minority categories to facilitate unbiased classifier
-optimization while maintaining the diagnostic knowledge in majority classes.
-Comprehensive experiments on three public long-tailed medical datasets
-demonstrate that our LMD framework significantly surpasses state-of-the-art
-approaches. The source code can be accessed at
-https://github.com/peterlipan/LMD.
+##### **Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**
+2502.09224v1 by Đorđe Marković, Marc Denecker
 
-摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
+Subtyping, also known as subtype polymorphism, is a concept extensively
+studied in programming language theory, delineating the substitutability
+relation among datatypes. This property ensures that programs designed for
+supertype objects remain compatible with their subtypes.
+  In this paper, we explore the capability of order-sorted logic for utilizing
+these ideas in the context of Knowledge Representation. We recognize two
+fundamental limitations: First, the inability of this logic to address the
+concept rather than the value of non-logical symbols, and second, the lack of
+language constructs for constraining the type of terms. Consequently, we
+propose guarded order-sorted intensional logic, where guards are language
+constructs for annotating typing information and intensional logic provides
+support for quantification over concepts.
 
-##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
-2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
+摘要：子類型化，也稱為子類型多態性，是一個在程式語言理論中廣泛研究的概念，用於描述資料類型之間的可替換關係。此特性可確保為超類型物件設計的程式與其子類型相容。
+在本文中，我們探討了使用排序邏輯在知識表徵中運用這些想法的能力。我們發現了兩個基本限制：首先，此邏輯無法處理非邏輯符號的概念而非值，其次，缺乏約束項類型的語言結構。因此，我們提出了受保護的排序邏輯，其中保護是註解類型資訊的語言結構，而內涵邏輯則支援對概念量化。
 
-This study investigates continual fine-tuning strategies for deep learning in
-online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
-within a causal setting involving a large user group and multiple sessions per
-participant. We are the first to explore such strategies across a large user
-group, as longitudinal adaptation is typically studied in the single-subject
-setting with a single adaptation strategy, which limits the ability to
-generalize findings. First, we examine the impact of different fine-tuning
-approaches on decoder performance and stability. Building on this, we integrate
-online test-time adaptation (OTTA) to adapt the model during deployment,
-complementing the effects of prior fine-tuning. Our findings demonstrate that
-fine-tuning that successively builds on prior subject-specific information
-improves both performance and stability, while OTTA effectively adapts the
-model to evolving data distributions across consecutive sessions, enabling
-calibration-free operation. These results offer valuable insights and
-recommendations for future research in longitudinal online MI decoding and
-highlight the importance of combining domain adaptation strategies for
-improving BCI performance in real-world applications. Clinical Relevance: Our
-investigation enables more stable and efficient long-term motor imagery
-decoding, which is critical for neurorehabilitation and assistive technologies.
+##### **ASP-driven User-interaction with Clinguin**
+2502.09222v1 by Alexander Beiser, Susana Hahn, Torsten Schaub
 
-摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
+We present clinguin, a system for ASP-driven user interface design. Clinguin
+streamlines the development of user interfaces for ASP developers by letting
+them build interactive prototypes directly in ASP, eliminating the need for
+separate frontend languages. To this end, clinguin uses a few dedicated
+predicates to define user interfaces and the treatment of user-triggered
+events. This simple design greatly facilitates the specification of user
+interactions with an ASP system, in our case clingo.
 
-##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
-2502.03004v1 by Seonok Kim
+摘要：我們提出 clinguin，一個用於 ASP 驅動使用者介面設計的系統。Clinguin 透過讓 ASP 開發人員直接在 ASP 中建立互動式原型，簡化了使用者介面的開發，消除了對個別前端語言的需求。為此，clinguin 使用一些專用的謂詞來定義使用者介面和處理使用者觸發的事件。這個簡單的設計極大地簡化了使用者與 ASP 系統互動的規範，在我們的案例中是 clingo。
 
-Large Language Models (LLMs) have demonstrated impressive capabilities across
-natural language processing tasks. However, their application to specialized
-domains such as medicine and biology requires further optimization to ensure
-factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
-domain-adapted biomedical question-answering model designed to enhance both
-short-form and long-form queries. By integrating fine-tuning and
-retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
-domain-specific knowledge, improving reasoning abilities and factual accuracy.
-To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
-datasets, covering structured multiple-choice assessments and complex clinical
-reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
-datasets, while RAG enhances factual consistency. These results highlight the
-potential of domain-optimized LLMs in advancing biomedical research, medical
-education, and clinical decision support.
+##### **Pearce's Characterisation in an Epistemic Domain**
+2502.09221v1 by Ezgi Iraz Su
 
-摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
+Answer-set programming (ASP) is a successful problem-solving approach in
+logic-based AI. In ASP, problems are represented as declarative logic programs,
+and solutions are identified through their answer sets. Equilibrium logic (EL)
+is a general-purpose nonmonotonic reasoning formalism, based on a monotonic
+logic called here-and-there logic. EL was basically proposed by Pearce as a
+foundational framework of ASP. Epistemic specifications (ES) are extensions of
+ASP-programs with subjective literals. These new modal constructs in the
+ASP-language make it possible to check whether a regular literal of ASP is true
+in every (or some) answer-set of a program. ES-programs are interpreted by
+world-views, which are essentially collections of answer-sets. (Reflexive)
+autoepistemic logic is a nonmonotonic formalism, modeling self-belief
+(knowledge) of ideally rational agents. A relatively new semantics for ES is
+based on a combination of EL and (reflexive) autoepistemic logic. In this
+paper, we first propose an overarching framework in the epistemic ASP domain.
+We then establish a correspondence between existing (reflexive) (auto)epistemic
+equilibrium logics and our easily-adaptable comprehensive framework, building
+on Pearce's characterisation of answer-sets as equilibrium models. We achieve
+this by extending Ferraris' work on answer sets for propositional theories to
+the epistemic case and reveal the relationship between some ES-semantic
+proposals.
 
-##### **Contrastive Token-level Explanations for Graph-based Rumour Detection**
-2502.04366v1 by Daniel Wai Kit Chin, Roy Ka-Wei Lee
+摘要：<paragraph>答案集程式設計（ASP）是基於邏輯的人工智慧中一種成功的問題解決方法。在 ASP 中，問題表示為宣告式邏輯程式，並透過其答案集來找出解答。平衡邏輯（EL）是一種通用的非單調推理形式主義，基於一種稱為此處和彼處邏輯的單調邏輯。EL 基本是由 Pearce 作為 ASP 的基礎架構所提出。知識規範（ES）是 ASP 程式與主觀文字的延伸。ASP 語言中的這些新模態建構使得可以檢查 ASP 的常規文字是否在程式的每個（或某些）答案集中為真。ES 程式由世界觀來詮釋，其本質上是答案集的集合。（反身）自認識邏輯是一種非單調形式主義，用來建模理想理性主體的自信念（知識）。ES 的一種相對新的語意是基於 EL 和（反身）自認識邏輯的組合。在本文中，我們首先提出一個涵蓋知識 ASP 領域的架構。然後，我們建立現有（反身）（自）認識平衡邏輯與我們容易適應的綜合架構之間的對應關係，建立在 Pearce 將答案集描述為平衡模型的特性之上。我們透過將 Ferraris 在命題理論的答案集上的工作延伸到知識案例，並揭示一些 ES 語義提案之間的關係來達成這一點。</paragraph>
 
-The widespread use of social media has accelerated the dissemination of
-information, but it has also facilitated the spread of harmful rumours, which
-can disrupt economies, influence political outcomes, and exacerbate public
-health crises, such as the COVID-19 pandemic. While Graph Neural Network
-(GNN)-based approaches have shown significant promise in automated rumour
-detection, they often lack transparency, making their predictions difficult to
-interpret. Existing graph explainability techniques fall short in addressing
-the unique challenges posed by the dependencies among feature dimensions in
-high-dimensional text embeddings used in GNN-based models. In this paper, we
-introduce Contrastive Token Layerwise Relevance Propagation (CT-LRP), a novel
-framework designed to enhance the explainability of GNN-based rumour detection.
-CT-LRP extends current graph explainability methods by providing token-level
-explanations that offer greater granularity and interpretability. We evaluate
-the effectiveness of CT-LRP across multiple GNN models trained on three
-publicly available rumour detection datasets, demonstrating that it
-consistently produces high-fidelity, meaningful explanations, paving the way
-for more robust and trustworthy rumour detection systems.
+##### **Graphical Conditions for the Existence, Unicity and Number of Regular Models**
+2502.09220v1 by Van-Giang Trinh, Belaid Benhamou, Sylvain Soliman, François Fages
 
-摘要：社群媒體的廣泛使用加速了資訊的傳播，但也促进了有害謠言的散播，這可能會擾亂經濟、影響政治結果，並加劇公共衛生危機，例如 COVID-19 大流行。雖然基於圖神經網路 (GNN) 的方法在自動化謠言偵測方面展現了顯著的前景，但它們通常缺乏透明度，這使得它們的預測難以解釋。現有的圖形可解釋性技術無法解決 GNN 模型中使用的維度嵌入式文本之間的依賴性所帶來的獨特挑戰。在本文中，我們介紹了對比標記分層關聯性傳播 (CT-LRP)，這是一個新穎的框架，旨在增強基於 GNN 的謠言偵測的可解釋性。CT-LRP 透過提供標記級別的解釋來擴充當前的圖形可解釋性方法，這些解釋提供了更細緻的粒度和可解釋性。我們在三個公開的謠言偵測資料集上訓練的幾個 GNN 模型中評估了 CT-LRP 的有效性，證明它始終產生高保真、有意義的解釋，為更強健且值得信賴的謠言偵測系統鋪路。
+The regular models of a normal logic program are a particular type of partial
+(i.e. 3-valued) models which correspond to stable partial models with minimal
+undefinedness. In this paper, we explore graphical conditions on the dependency
+graph of a finite ground normal logic program to analyze the existence, unicity
+and number of regular models for the program. We show three main results: 1) a
+necessary condition for the existence of non-trivial (i.e. non-2-valued)
+regular models, 2) a sufficient condition for the unicity of regular models,
+and 3) two upper bounds for the number of regular models based on positive
+feedback vertex sets. The first two conditions generalize the finite cases of
+the two existing results obtained by You and Yuan (1994) for normal logic
+programs with well-founded stratification. The third result is also new to the
+best of our knowledge. Key to our proofs is a connection that we establish
+between finite ground normal logic programs and Boolean network theory.
 
-##### **AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**
-2502.04365v1 by Jorge García-Torres, Øyvind Meinich-Bache, Siren Rettedal, Kjersti Engan
+摘要：正规模型的常规模型是一种特殊类型的局部模型（即 3 值）模型，它对应于具有最小未定义性的稳定局部模型。在本文中，我们探索了有限接地正规逻辑程序的依赖图上的图形条件，以分析程序的正规模型的存在性、唯一性和数量。我们展示了三个主要结果：1) 非平凡（即非 2 值）正规模型存在的必要条件，2) 正规模型唯一性的充分条件，3) 基于正反馈顶点集的正规模型数目的两个上限。前两个条件概括了 You 和 Yuan (1994) 为具有良好基础分层的正规逻辑程序获得的两个现有结果的有限情况。据我们所知，第三个结果也是新的。我们证明的关键是我们在有限接地正规逻辑程序和布尔网络理论之间建立的联系。
 
-Approximately 10% of newborns need some assistance to start breathing and 5\%
-proper ventilation. It is crucial that interventions are initiated as soon as
-possible after birth. Accurate documentation of Time of Birth (ToB) is thereby
-essential for documenting and improving newborn resuscitation performance.
-However, current clinical practices rely on manual recording of ToB, typically
-with minute precision. In this study, we present an AI-driven, video-based
-system for automated ToB detection using thermal imaging, designed to preserve
-the privacy of healthcare providers and mothers by avoiding the use of
-identifiable visual data. Our approach achieves 91.4% precision and 97.4%
-recall in detecting ToB within thermal video clips during performance
-evaluation. Additionally, our system successfully identifies ToB in 96% of test
-cases with an absolute median deviation of 1 second compared to manual
-annotations. This method offers a reliable solution for improving ToB
-documentation and enhancing newborn resuscitation outcomes.
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-摘要：約 10% 的新生兒需要協助才能開始呼吸，5% 需要適當的通氣。在出生後盡快開始介入至關重要。準確記錄出生時間 (ToB) 對於記錄和改善新生兒復甦表現至關重要。然而，目前的臨床實務依賴於手動記錄 ToB，通常精確到分鐘。在這項研究中，我們提出一個以 AI 為主的、基於影片的系統，用於使用熱影像自動偵測 ToB，旨在透過避免使用可識別的視覺資料來保護醫療保健提供者和母親的隱私。我們的做法在執行評估期間，在熱影像片段中偵測 ToB 時達到了 91.4% 的精確度和 97.4% 的召回率。此外，我們的系統在 96% 的測試案例中成功識別出 ToB，與手動註解相比，絕對中位數偏差為 1 秒。此方法提供了一個可靠的解決方案，用於改善 ToB 記錄和增強新生兒復甦結果。
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-##### **3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**
-2502.02779v1 by Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O'Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-Head computed tomography (CT) imaging is a widely-used imaging modality with
-multitudes of medical indications, particularly in assessing pathology of the
-brain, skull, and cerebrovascular system. It is commonly the first-line imaging
-in neurologic emergencies given its rapidity of image acquisition, safety,
-cost, and ubiquity. Deep learning models may facilitate detection of a wide
-range of diseases. However, the scarcity of high-quality labels and
-annotations, particularly among less common conditions, significantly hinders
-the development of powerful models. To address this challenge, we introduce
-FM-CT: a Foundation Model for Head CT for generalizable disease detection,
-trained using self-supervised learning. Our approach pre-trains a deep learning
-model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans
-without the need for manual annotations, enabling the model to learn robust,
-generalizable features. To investigate the potential of self-supervised
-learning in head CT, we employed both discrimination with self-distillation and
-masked image modeling, and we construct our model in 3D rather than at the
-slice level (2D) to exploit the structure of head CT scans more comprehensively
-and efficiently. The model's downstream classification performance is evaluated
-using internal and three external datasets, encompassing both in-distribution
-(ID) and out-of-distribution (OOD) data. Our results demonstrate that the
-self-supervised foundation model significantly improves performance on
-downstream diagnostic tasks compared to models trained from scratch and
-previous 3D CT foundation models on scarce annotated datasets. This work
-highlights the effectiveness of self-supervised learning in medical imaging and
-sets a new benchmark for head CT image analysis in 3D, enabling broader use of
-artificial intelligence for head CT-based diagnosis.
+##### **Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**
+2502.09216v1 by Galileo Sartor, Adam Wyner, Giuseppe Contissa
+
+In this paper, we present a modular system for representing and reasoning
+with legal aspects of traffic rules for autonomous vehicles. We focus on a
+subset of the United Kingdom's Highway Code (HC) related to junctions. As human
+drivers and automated vehicles (AVs) will interact on the roads, especially in
+urban environments, we claim that an accessible, unitary, high-level
+computational model should exist and be applicable to both users. Autonomous
+vehicles introduce a shift in liability that should not bring disadvantages or
+increased burden on human drivers. We develop a system "in silico" of the
+model. The proposed system is built of three main components: a natural
+language interface, using Logical English, which encodes the rules; an internal
+representation of the rules in Prolog; and an multi-agent-based simulation
+environment, built in NetLogo. The three components interact: Logical English
+is translated into and out of Prolog (along with some support code); Prolog and
+NetLogo interface via predicates. Such a modular approach enables the different
+components to carry different "burdens" in the overall system; it also allows
+swapping of modules. Given NetLogo, we can visualize the effect of the modeled
+rules as well as validate the system with a simple dynamic running scenario.
+Designated agents monitor the behaviour of the vehicles for compliance and
+record potential violations where they occur. The information on potential
+violations is then utilized by Validators, to determine whether the violation
+is punishable, differentiating between exceptions and cases.
 
-摘要：頭部電腦斷層掃描（CT）影像是一種廣泛使用的影像模式，具有
-大量的醫療適應症，特別是在評估腦部、頭骨和腦血管系統的病理時。由於其影像擷取速度快、安全性、成本低和普遍性，通常是神經緊急情況下的第一線影像。深度學習模型可以促進對各種疾病的檢測。然而，高品質標籤和註釋的稀缺，特別是在較不常見的疾病中，顯著地阻礙了強大模型的發展。為了應對這一挑戰，我們引入了 FM-CT：一個用於頭部 CT 的基礎模型，用於可概化的疾病檢測，並使用自我監督學習進行訓練。我們的做法在一個包含 361,663 個非對比 3D 頭部 CT 掃描的大型、多樣化的數據集上預訓練一個深度學習模型，而無需手動註釋，使模型能夠學習強健、可概化的特徵。為了探討自我監督學習在頭部 CT 中的潛力，我們同時採用了帶有自我蒸餾的判別和遮罩影像建模，並且我們以 3D 而不是切片層級（2D）構建我們的模型，以更全面、有效地利用頭部 CT 掃描的結構。該模型的下游分類效能使用內部和三個外部數據集進行評估，包括分佈內 (ID) 和分佈外 (OOD) 資料。我們的結果表明，與從頭開始訓練的模型和先前在稀疏註釋數據集上訓練的 3D CT 基礎模型相比，自我監督基礎模型顯著改善了下游診斷任務的效能。這項工作突顯了自我監督學習在醫學影像中的有效性，並為 3D 頭部 CT 影像分析設定了一個新的基準，讓人工智慧能夠更廣泛地用於基於頭部 CT 的診斷。
+摘要：<paragraph>在本文中，我們提出了一個模組化系統，用於表示和推理自動駕駛車輛交通規則的法律層面。我們專注於與路口相關的英國公路法規 (HC) 子集。由於人類駕駛和自動駕駛車輛 (AV) 將在道路上互動，尤其是在城市環境中，我們主張應存在一個可存取、統一、高階的運算模型，並適用於這兩種使用者。自動駕駛車輛引入了責任轉移，不應給人類駕駛帶來劣勢或增加負擔。我們開發了一個模型的「電腦模擬」系統。所提出的系統由三個主要組成部分建構而成：使用邏輯英語的自然語言介面，用於編碼規則；使用 Prolog 的規則內部表示；以及使用 NetLogo 建構的多主體模擬環境。這三個組成部分會進行互動：邏輯英語會翻譯成 Prolog（以及一些支援程式碼），再從 Prolog 翻譯回來；Prolog 和 NetLogo 會透過謂詞進行介面。這種模組化方法讓不同的組成部分可以在整體系統中承擔不同的「負擔」；它也允許模組交換。有了 NetLogo，我們可以視覺化已建模規則的效果，並使用一個簡單的動態執行範例來驗證系統。指定的代理會監控車輛的行為，以確保遵守規定，並記錄發生的潛在違規行為。然後，驗證者會利用潛在違規行為的資訊，來確定違規行為是否應受懲罰，並區分例外情況和案例。</paragraph>
 
-##### **Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**
-2502.02756v1 by Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
+##### **Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**
+2502.09215v1 by Sean Glaze, Daniela Inclezan
 
-This study proposes a new loss function for deep neural networks, L1-weighted
-Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of
-voxels based on their classification difficulty, towards automated detection
-and segmentation of metastatic prostate cancer lesions in PET/CT scans. We
-obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with
-biochemical recurrence metastatic prostate cancer. We trained two 3D
-convolutional neural networks, Attention U-Net and SegResNet, and concatenated
-the PET and CT volumes channel-wise as input. The performance of our custom
-loss function was evaluated against the Dice and Dice Focal Loss functions. For
-clinical significance, we considered a detected region of interest (ROI) as a
-true positive if at least the voxel with the maximum standardized uptake value
-falls within the ROI. We assessed the models' performance based on the number
-of lesions in an image, tumour volume, activity, and extent of spread. The
-L1DFL outperformed the comparative loss functions by at least 13% on the test
-set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were
-lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal
-Loss yielded more false positives, whereas the Dice Loss was more sensitive to
-smaller volumes and struggled to segment larger lesions accurately. They also
-exhibited network-specific variations and yielded declines in segmentation
-accuracy with increased tumour spread. Our results demonstrate the potential of
-L1DFL to yield robust segmentation of metastatic prostate cancer lesions in
-PSMA PET/CT images. The results further highlight potential complexities
-arising from the variations in lesion characteristics that may influence
-automated prostate cancer tumour detection and segmentation. The code is
-publicly available at: https://github.com/ObedDzik/pca_segment.git.
+This paper presents an architecture for simulating the actions of a
+norm-aware intelligent agent whose behavior with respect to norm compliance is
+set, and can later be changed, by a human controller. Updating an agent's
+behavior mode from a norm-abiding to a riskier one may be relevant when the
+agent is involved in time-sensitive rescue operations, for example. We base our
+work on the Authorization and Obligation Policy Language AOPL designed by
+Gelfond and Lobo for the specification of norms. We introduce an architecture
+and a prototype software system that can be used to simulate an agent's plans
+under different behavior modes that can later be changed by the controller. We
+envision such software to be useful to policy makers, as they can more readily
+understand how agents may act in certain situations based on the agents'
+attitudes towards norm-compliance. Policy makers may then refine their policies
+if simulations show unwanted consequences.
 
-摘要：<paragraph>本研究針對深度神經網路提出一個新的損失函數，L1 加權 Dice 焦點損失 (L1DFL)，它利用 L1 範數根據體素的分類難度進行自適應加權，用於自動偵測和分割 PET/CT 掃描中轉移性前列腺癌病灶。我們取得 380 個經診斷為生化復發轉移性前列腺癌的患者的 PSMA [18-F] DCFPyL PET/CT 掃描。我們訓練了兩個 3D 捲積神經網路，Attention U-Net 和 SegResNet，並將 PET 和 CT 體積按通道連接作為輸入。我們自訂的損失函數的效能與 Dice 和 Dice 焦點損失函數進行評估。為了臨床意義，我們將一個偵測到的感興趣區域 (ROI) 視為真陽性，如果至少具有最大標準攝取值的體素落在 ROI 內。我們根據影像中的病灶數量、腫瘤體積、活性，以及擴散程度評估模型的效能。L1DFL 在測試組中至少比比較損失函數高出 13%。此外，Dice 損失和 Dice 焦點損失的 F1 分數分別比 L1DFL 低至少 6% 和 34%。Dice 焦點損失產生更多假陽性，而 Dice 損失對較小體積較為敏感，且難以準確分割較大病灶。它們也展現出網路特定的變化，並隨著腫瘤擴散而導致分割準確度下降。我們的結果證明 L1DFL 具有在 PSMA PET/CT 影像中產生轉移性前列腺癌病灶的強健分割的潛力。結果進一步強調由病灶特徵變化所產生的潛在複雜性，這可能會影響自動化前列腺癌腫瘤偵測和分割。程式碼公開於：https://github.com/ObedDzik/pca_segment.git。</paragraph>
+摘要：本文提出了一個架構，用於模擬一個規範感知智能代理的行為，其行為遵守規範，並可以由人類控制者設定，並可以在稍後進行更改。當代理參與時間敏感的救援行動時，將代理的行為模式從遵守規範更新為更冒險的行為模式可能是相關的。我們的工作基於 Gelfond 和 Lobo 為規範規範設計的授權和義務政策語言 AOPL。我們引入了一個架構和一個原型軟體系統，可用於模擬代理在不同行為模式下的計畫，這些行為模式稍後可以由控制者更改。我們預計此類軟體對政策制定者很有用，因為他們可以更容易地根據代理對規範遵守的態度了解代理在特定情況下的行為方式。如果模擬顯示出不希望的後果，政策制定者可以修改他們的政策。
 
-##### **Diffusion Instruction Tuning**
-2502.06814v1 by Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
+##### **Neuro-Symbolic Contrastive Learning for Cross-domain Inference**
+2502.09213v1 by Mingyue Liu, Ryo Ueda, Zhen Wan, Katsumi Inoue, Chris G. Willcocks
 
-We introduce Lavender, a simple supervised fine-tuning (SFT) method that
-boosts the performance of advanced vision-language models (VLMs) by leveraging
-state-of-the-art image generation models such as Stable Diffusion.
-Specifically, Lavender aligns the text-vision attention in the VLM transformer
-with the equivalent used by Stable Diffusion during SFT, instead of adapting
-separate encoders. This alignment enriches the model's visual understanding and
-significantly boosts performance across in- and out-of-distribution tasks.
-Lavender requires just 0.13 million training examples, 2.5% of typical
-large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a
-single day. It consistently improves state-of-the-art open-source multimodal
-LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and
-a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently
-transferring the visual expertise of image generators with minimal supervision,
-Lavender offers a scalable solution for more accurate vision-language systems.
-All code, training data, and models will be shared at
-https://astrazeneca.github.io/vlm/.
+Pre-trained language models (PLMs) have made significant advances in natural
+language inference (NLI) tasks, however their sensitivity to textual
+perturbations and dependence on large datasets indicate an over-reliance on
+shallow heuristics. In contrast, inductive logic programming (ILP) excels at
+inferring logical relationships across diverse, sparse and limited datasets,
+but its discrete nature requires the inputs to be precisely specified, which
+limits their application. This paper proposes a bridge between the two
+approaches: neuro-symbolic contrastive learning. This allows for smooth and
+differentiable optimisation that improves logical accuracy across an otherwise
+discrete, noisy, and sparse topological space of logical functions. We show
+that abstract logical relationships can be effectively embedded within a
+neuro-symbolic paradigm, by representing data as logic programs and sets of
+logic rules. The embedding space captures highly varied textual information
+with similar semantic logical relations, but can also separate similar textual
+relations that have dissimilar logical relations. Experimental results
+demonstrate that our approach significantly improves the inference capabilities
+of the models in terms of generalisation and reasoning.
 
-摘要：<paragraph>我們介紹 Lavender，一種簡單的監督微調 (SFT) 方法，它透過利用 Stable Diffusion 等最先進的影像生成模型來提升先進視覺語言模型 (VLM) 的效能。
-具體來說，Lavender 在 SFT 期間將 VLM 轉換器中的文字視覺注意力與 Stable Diffusion 使用的等效注意力對齊，而不是調整單獨的編碼器。此對齊豐富了模型的視覺理解，並顯著提升了分佈內外任務的效能。
-Lavender 只需要 0.13 百萬個訓練範例，相當於典型大型 SFT 資料集的 2.5%，並在標準硬體 (8 個 GPU) 上於一天內進行微調。它持續改善最先進的開放原始碼多模態 LLM（例如 Llama-3.2-11B、MiniCPM-Llama3-v2.5），在具有挑戰性的分佈外醫療 QA 任務中獲得高達 30% 的收益和 68% 的提升。透過有效轉移影像生成器的視覺專業知識，並僅需最少的監督，Lavender 提供了一個可擴充的解決方案，以實現更準確的視覺語言系統。
-所有程式碼、訓練資料和模型將在 https://astrazeneca.github.io/vlm/ 分享。</paragraph>
+摘要：預訓練語言模型 (PLM) 在自然語言推理 (NLI) 任務中取得了重大進展，然而它們對文本擾動的敏感性和對大型資料集的依賴性表明過度依賴於淺層啟發法。相比之下，歸納邏輯規劃 (ILP) 擅長推論跨越多樣化、稀疏和有限資料集的邏輯關係，但其離散性質要求輸入被精確指定，這限制了它們的應用。本文提出了兩種方法之間的橋樑：神經符號對比學習。這允許平滑且可微分的優化，從而提高邏輯函數的離散、嘈雜和稀疏拓撲空間中的邏輯準確性。我們展示了抽象邏輯關係可以通過將資料表示為邏輯程式和邏輯規則集，有效地嵌入到神經符號範例中。嵌入空間捕獲具有相似語義邏輯關係的高度多變的文本資訊，但也可以分離具有不同邏輯關係的相似文本關係。實驗結果表明，我們的做法在泛化和推理方面顯著提高了模型的推理能力。
 
-##### **MedRAX: Medical Reasoning Agent for Chest X-ray**
-2502.02673v1 by Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
+##### **LP-LM: No Hallucinations in Question Answering with Logic Programming**
+2502.09212v1 by Katherine Wu, Yanhong A. Liu
 
-Chest X-rays (CXRs) play an integral role in driving critical decisions in
-disease management and patient care. While recent innovations have led to
-specialized models for various CXR interpretation tasks, these solutions often
-operate in isolation, limiting their practical utility in clinical practice. We
-present MedRAX, the first versatile AI agent that seamlessly integrates
-state-of-the-art CXR analysis tools and multimodal large language models into a
-unified framework. MedRAX dynamically leverages these models to address complex
-medical queries without requiring additional training. To rigorously evaluate
-its capabilities, we introduce ChestAgentBench, a comprehensive benchmark
-containing 2,500 complex medical queries across 7 diverse categories. Our
-experiments demonstrate that MedRAX achieves state-of-the-art performance
-compared to both open-source and proprietary models, representing a significant
-step toward the practical deployment of automated CXR interpretation systems.
-Data and code have been publicly available at
-https://github.com/bowang-lab/MedRAX
+Large language models (LLMs) are able to generate human-like responses to
+user queries. However, LLMs exhibit inherent limitations, especially because
+they hallucinate. This paper introduces LP-LM, a system that grounds answers to
+questions in known facts contained in a knowledge base (KB), facilitated
+through semantic parsing in Prolog, and always produces answers that are
+reliable.
+  LP-LM generates a most probable constituency parse tree along with a
+corresponding Prolog term for an input question via Prolog definite clause
+grammar (DCG) parsing. The term is then executed against a KB of natural
+language sentences also represented as Prolog terms for question answering. By
+leveraging DCG and tabling, LP-LM runs in linear time in the size of input
+sentences for sufficiently many grammar rules. Performing experiments comparing
+LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate
+on even simple questions, unlike LP-LM.
 
-摘要：胸部 X 光片 (CXR) 在疾病管理和患者照護中扮演著不可或缺的角色，推動著關鍵決策的制定。儘管近期的創新已針對各種 CXR 解讀任務開發出專門的模型，但這些解決方案通常獨立運作，限制了它們在臨床實務中的實際效用。我們提出 MedRAX，這是一款首創的多功能 AI 代理，它將最先進的 CXR 分析工具和多模態大型語言模型無縫整合到一個統一的架構中。MedRAX 動態運用這些模型來解決複雜的醫療查詢，而無需額外的訓練。為了嚴格評估其功能，我們引入了 ChestAgentBench，這是一個全面的基準，包含 7 個不同類別的 2,500 個複雜醫療查詢。我們的實驗證明，與開源和專有模型相比，MedRAX 達到了最先進的效能，這代表了自動化 CXR 解讀系統實際部署的重要一步。資料和程式碼已公開於 https://github.com/bowang-lab/MedRAX
+摘要：大型語言模型 (LLM) 能產生類似人類的回應來回答使用者的問題。然而，LLM 顯示出內在的限制，特別是因為它們會產生幻覺。本文介紹 LP-LM，一個系統，它將問題的答案建立在知識庫 (KB) 中已知的事實上，透過 Prolog 中的語義解析來促進，並始終產生可靠的答案。
+LP-LM 透過 Prolog 明確條款語法 (DCG) 解析產生一個最可能的成分解析樹，以及輸入問題對應的 Prolog 詞彙。然後，針對一個自然語言句子的 KB 執行該詞彙，也表示為 Prolog 詞彙，以進行問題解答。透過利用 DCG 和 tabling，LP-LM 在輸入句子的大小上以線性時間執行，對於足夠多的語法規則。執行實驗比較 LP-LM 與目前眾所周知的 LLM 在準確性上，我們顯示出 LLM 甚至會對簡單的問題產生幻覺，這與 LP-LM 不同。
 
-##### **Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**
-2502.04356v1 by Mahdi Alkaeed, Sofiat Abioye, Adnan Qayyum, Yosra Magdi Mekki, Ilhem Berrou, Mohamad Abdallah, Ala Al-Fuqaha, Muhammad Bilal, Junaid Qadir
+##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
+2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
 
-In response to the success of proprietary Large Language Models (LLMs) such
-as OpenAI's GPT-4, there is a growing interest in developing open,
-non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in
-academic, scientific, and non-commercial applications. Despite their inability
-to match the refined functionalities of their proprietary counterparts, open
-models hold immense potential to revolutionize healthcare applications. In this
-paper, we examine the prospects of open-source LLMs and AIFMs for developing
-healthcare applications and make two key contributions. Firstly, we present a
-comprehensive survey of the current state-of-the-art open-source healthcare
-LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their
-utility across various healthcare tasks. Secondly, to evaluate the
-general-purpose applications of open LLMs in healthcare, we present a case
-study on personalized prescriptions. This task is particularly significant due
-to its critical role in delivering tailored, patient-specific medications that
-can greatly improve treatment outcomes. In addition, we compare the performance
-of open-source models with proprietary models in settings with and without
-Retrieval-Augmented Generation (RAG). Our findings suggest that, although less
-refined, open LLMs can achieve performance comparable to proprietary models
-when paired with grounding techniques such as RAG. Furthermore, to highlight
-the clinical significance of LLMs-empowered personalized prescriptions, we
-perform subjective assessment through an expert clinician. We also elaborate on
-ethical considerations and potential risks associated with the misuse of
-powerful LLMs and AIFMs, highlighting the need for a cautious and responsible
-implementation in healthcare.
+Visual Question Answering (VQA) is a challenging problem that requires to
+process multimodal input. Answer-Set Programming (ASP) has shown great
+potential in this regard to add interpretability and explainability to modular
+VQA architectures. In this work, we address the problem of how to integrate ASP
+with modules for vision and natural language processing to solve a new and
+demanding VQA variant that is concerned with images of graphs (not graphs in
+symbolic form). Images containing graph-based structures are an ubiquitous and
+popular form of visualisation. Here, we deal with the particular problem of
+graphs inspired by transit networks, and we introduce a novel dataset that
+amends an existing one by adding images of graphs that resemble metro lines.
+Our modular neuro-symbolic approach combines optical graph recognition for
+graph parsing, a pretrained optical character recognition neural network for
+parsing labels, Large Language Models (LLMs) for language processing, and ASP
+for reasoning. This method serves as a first baseline and achieves an overall
+average accuracy of 73% on the dataset. Our evaluation provides further
+evidence of the potential of modular neuro-symbolic systems, in particular with
+pretrained models that do not involve any further training and logic
+programming for reasoning, to solve complex VQA tasks.
 
-摘要：<paragraph>為了回應 OpenAI 的 GPT-4 等專有大型語言模型 (LLM) 的成功，開發開放、非專有的 LLM 和人工智慧基礎模型 (AIFM) 以透明地用於學術、科學和非商業應用中，引起了越來越大的興趣。儘管無法與其專有對應產品的精緻功能相匹配，但開放模型在革新醫療保健應用方面具有巨大的潛力。在本文中，我們探討了開放原始碼 LLM 和 AIFM 在開發醫療保健應用方面的前景，並提出了兩項關鍵貢獻。首先，我們對當前最先進的開放原始碼醫療保健 LLM 和 AIFM 進行了全面的調查，並介紹了這些開放 AIFM 的分類法，對它們在各種醫療保健任務中的效用進行了分類。其次，為了評估開放 LLM 在醫療保健中的通用應用，我們對個人化處方進行了案例研究。這項任務特別重要，因為它在提供量身定制的患者特定藥物方面發揮著關鍵作用，可以大大改善治療效果。此外，我們比較了開放原始碼模型與專有模型在有和沒有檢索增強生成 (RAG) 的設置中的性能。我們的研究結果表明，儘管不太精緻，但開放 LLM 在與 RAG 等基礎技術配對時，可以實現與專有模型相當的性能。此外，為了強調 LLM 賦能的個性化處方的臨床意義，我們通過專家臨床醫生進行了主觀評估。我們還詳細說明了與濫用強大的 LLM 和 AIFM 相關的倫理考量和潛在風險，強調了在醫療保健中謹慎和負責任地實施的必要性。</paragraph>
+摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
 
-##### **Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**
-2502.02561v1 by Shayan Kiyani, George Pappas, Aaron Roth, Hamed Hassani
+##### **On LLM-generated Logic Programs and their Inference Execution Methods**
+2502.09209v1 by Paul Tarau
 
-A fundamental question in data-driven decision making is how to quantify the
-uncertainty of predictions in ways that can usefully inform downstream action.
-This interface between prediction uncertainty and decision-making is especially
-important in risk-sensitive domains, such as medicine. In this paper, we
-develop decision-theoretic foundations that connect uncertainty quantification
-using prediction sets with risk-averse decision-making. Specifically, we answer
-three fundamental questions: (1) What is the correct notion of uncertainty
-quantification for risk-averse decision makers? We prove that prediction sets
-are optimal for decision makers who wish to optimize their value at risk. (2)
-What is the optimal policy that a risk averse decision maker should use to map
-prediction sets to actions? We show that a simple max-min decision policy is
-optimal for risk-averse decision makers. Finally, (3) How can we derive
-prediction sets that are optimal for such decision makers? We provide an exact
-characterization in the population regime and a distribution free finite-sample
-construction. Answering these questions naturally leads to an algorithm,
-Risk-Averse Calibration (RAC), which follows a provably optimal design for
-deriving action policies from predictions. RAC is designed to be both
-practical-capable of leveraging the quality of predictions in a black-box
-manner to enhance downstream utility-and safe-adhering to a user-defined risk
-threshold and optimizing the corresponding risk quantile of the user's
-downstream utility. Finally, we experimentally demonstrate the significant
-advantages of RAC in applications such as medical diagnosis and recommendation
-systems. Specifically, we show that RAC achieves a substantially improved
-trade-off between safety and utility, offering higher utility compared to
-existing methods while maintaining the safety guarantee.
+Large Language Models (LLMs) trained on petabytes of data are highly
+compressed repositories of a significant proportion of the knowledge
+accumulated and distilled so far. In this paper we study techniques to elicit
+this knowledge in the form of several classes of logic programs, including
+propositional Horn clauses, Dual Horn clauses, relational triplets and Definite
+Clause Grammars. Exposing this knowledge as logic programs enables sound
+reasoning methods that can verify alignment of LLM outputs to their intended
+uses and extend their inference capabilities. We study new execution methods
+for the generated programs, including soft-unification of abducible facts
+against LLM-generated content stored in a vector database as well as GPU-based
+acceleration of minimal model computation that supports inference with large
+LLM-generated programs.
 
-摘要：<paragraph>在資料驅動決策中，一個基本問題是，如何量化預測的不確定性，以能有用地告知下游行動。
-預測不確定性和決策制定之間的這種介面，在風險敏感領域中特別重要，例如醫學。在本文中，我們
-發展了決策理論基礎，它利用預測集合將不確定性量化與風險規避決策制定聯繫起來。具體來說，我們回答
-了三個基本問題：(1) 對於風險規避決策者來說，不確定性量化的正確概念是什麼？我們證明，對於希望最佳化其風險價值的決策者來說，預測集合是最佳的。(2)
-風險規避決策者應使用什麼最佳政策，將預測集合映射到行動？我們表明，對於風險規避決策者來說，一個簡單的最大最小決策政策是最佳的。最後，(3) 我們如何推導出對此類決策者來說最佳的預測集合？我們在總體範圍內提供了一個確切的表徵，並提供了一個不依賴分佈的有限樣本建構。回答這些問題自然會導致一個演算法，風險規避校準 (RAC)，它遵循一個可證明最佳的設計，從預測中推導出行動政策。RAC 被設計為既實用——能夠以黑盒方式利用預測的品質來增強下游效用——又安全——遵守使用者定義的風險閾值，並最佳化使用者的下游效用的對應風險分位數。最後，我們在醫學診斷和推薦系統等應用中，以實驗方式證明了 RAC 的顯著優點。具體來說，我們表明，與現有方法相比，RAC 在安全性和效用之間實現了顯著改善的折衷，在維持安全保證的同時，提供了更高的效用。</paragraph>
+摘要：大型語言模型 (LLM) 在數位位元組的資料上受過訓練，是目前為止累積和提煉的知識中，高度濃縮的儲存庫。在本文中，我們研究了以數種邏輯程式類別的形式引出這些知識的技術，包括命題霍恩子句、雙重霍恩子句、關聯三元組和確定子句文法。將這些知識作為邏輯程式揭露，能啟用健全的推理方法，驗證 LLM 輸出的對齊方式，符合其預期的用途，並擴展其推論能力。我們研究了產生程式的新執行方法，包括對儲存在向量資料庫中的 LLM 產生內容，進行可約簡事實的軟統一，以及支援使用大型 LLM 產生程式進行推論的，基於 GPU 的最小模型計算加速。
 
-##### **CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**
-2502.05214v1 by Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
+##### **Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**
+2502.09206v1 by Haya Majid Qureshi, Wolfgang Faber
 
-Deep learning models for medical image classification tasks are becoming
-widely implemented in AI-assisted diagnostic tools, aiming to enhance
-diagnostic accuracy, reduce clinician workloads, and improve patient outcomes.
-However, their vulnerability to adversarial attacks poses significant risks to
-patient safety. Current attack methodologies use general techniques such as
-model querying or pixel value perturbations to generate adversarial examples
-designed to fool a model. These approaches may not adequately address the
-unique characteristics of clinical errors stemming from missed or incorrectly
-identified clinical features. We propose the Concept-based Report Perturbation
-Attack (CoRPA), a clinically-focused black-box adversarial attack framework
-tailored to the medical imaging domain. CoRPA leverages clinical concepts to
-generate adversarial radiological reports and images that closely mirror
-realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA
-using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our
-evaluation reveals that deep learning models exhibiting strong resilience to
-conventional adversarial attacks are significantly less robust when subjected
-to CoRPA's clinically-focused perturbations. This underscores the importance of
-addressing domain-specific vulnerabilities in medical AI systems. By
-introducing a specialized adversarial attack framework, this study provides a
-foundation for developing robust, real-world-ready AI models in healthcare,
-ensuring their safe and reliable deployment in high-stakes clinical
-environments.
+Metamodeling refers to scenarios in ontologies in which classes and roles can
+be members of classes or occur in roles. This is a desirable modelling feature
+in several applications, but allowing it without restrictions is problematic
+for several reasons, mainly because it causes undecidability. Therefore,
+practical languages either forbid metamodeling explicitly or treat occurrences
+of classes as instances to be semantically different from other occurrences,
+thereby not allowing metamodeling semantically. Several extensions have been
+proposed to provide metamodeling to some extent. Building on earlier work that
+reduces metamodeling query answering to Datalog query answering, recently
+reductions to query answering over hybrid knowledge bases were proposed with
+the aim of using the Datalog transformation only where necessary. Preliminary
+work showed that the approach works, but the hoped-for performance improvements
+were not observed yet. In this work we expand on this body of work by improving
+the theoretical basis of the reductions and by using alternative tools that
+show competitive performance.
 
-摘要：深度学习模型用于医学影像分类任务，在人工智能辅助诊断工具中得到广泛应用，旨在提高诊断准确性、减少临床医生的工作量并改善患者的治疗效果。然而，它们对对抗性攻击的脆弱性给患者安全带来了重大风险。目前的攻击方法使用通用技术，例如模型查询或像素值扰动来生成对抗性示例，旨在欺骗模型。这些方法可能无法充分解决源自遗漏或错误识别的临床特征的临床错误的独特特征。我们提出了基于概念的报告扰动攻击 (CoRPA)，这是一种以临床为中心的、针对医学成像领域的、黑盒对抗性攻击框架。CoRPA 利用临床概念来生成对抗性放射学报告和图像，这些报告和图像与现实的临床误诊场景非常相似。我们使用胸部 X 射线和放射学报告的 MIMIC-CXR-JPG 数据集演示了 CoRPA 的效用。我们的评估表明，对传统对抗性攻击表现出强大弹性的深度学习模型在受到 CoRPA 以临床为中心的扰动时，其鲁棒性明显降低。这强调了在医疗人工智能系统中解决特定领域漏洞的重要性。通过引入专门的对抗性攻击框架，本研究为在医疗保健领域开发健壮、面向现实世界的 AI 模型奠定了基础，确保它们在高风险临床环境中安全可靠地部署。
+摘要：元建模是指本体中的場景，其中類別和角色可以是類別成員或出現在角色中。這是一個在多個應用中理想的建模功能，但允許它不受限制會因多個原因而產生問題，主要是因為它會導致無法決定。因此，實用的語言會明確禁止元建模，或將類別的出現視為與其他出現語義不同的實例，從而語義上不允許元建模。已經提出多個擴充功能，在一定程度上提供元建模。建立在將元建模查詢回答簡化為 Datalog 查詢回答的早期工作之上，最近提出了將查詢回答簡化為混合知識庫的簡化，目的是僅在必要時使用 Datalog 轉換。初步工作顯示該方法有效，但尚未觀察到預期的效能改善。在這項工作中，我們透過改善簡化的理論基礎和使用表現競爭力的替代工具，擴展了這項工作。
 
-##### **A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**
-2502.02489v1 by Edward Ellis, Andrew Bulpitt, Nasim Parsa, Michael F Byrne, Sharib Ali
+##### **Counterfactual Explanations as Plans**
+2502.09205v1 by Vaishak Belle
 
-Ultrasound (US) imaging is clinically invaluable due to its noninvasive and
-safe nature. However, interpreting US images is challenging, requires
-significant expertise, and time, and is often prone to errors. Deep learning
-offers assistive solutions such as segmentation. Supervised methods rely on
-large, high-quality, and consistently labeled datasets, which are challenging
-to curate. Moreover, these methods tend to underperform on out-of-distribution
-data, limiting their clinical utility. Self-supervised learning (SSL) has
-emerged as a promising alternative, leveraging unlabeled data to enhance model
-performance and generalisability. We introduce a contrastive SSL approach
-tailored for B-mode US images, incorporating a novel Relation Contrastive Loss
-(RCL). RCL encourages learning of distinct features by differentiating positive
-and negative sample pairs through a learnable metric. Additionally, we propose
-spatial and frequency-based augmentation strategies for the representation
-learning on US images. Our approach significantly outperforms traditional
-supervised segmentation methods across three public breast US datasets,
-particularly in data-limited scenarios. Notable improvements on the Dice
-similarity metric include a 4% increase on 20% and 50% of the BUSI dataset,
-nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4%
-and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively.
-Furthermore, we demonstrate superior generalisability on the
-out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6%
-compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST
-training data, respectively. Our research highlights that domain-inspired SSL
-can improve US segmentation, especially under data-limited conditions.
+There has been considerable recent interest in explainability in AI,
+especially with black-box machine learning models. As correctly observed by the
+planning community, when the application at hand is not a single-shot decision
+or prediction, but a sequence of actions that depend on observations, a richer
+notion of explanations are desirable.
+  In this paper, we look to provide a formal account of ``counterfactual
+explanations," based in terms of action sequences. We then show that this
+naturally leads to an account of model reconciliation, which might take the
+form of the user correcting the agent's model, or suggesting actions to the
+agent's plan. For this, we will need to articulate what is true versus what is
+known, and we appeal to a modal fragment of the situation calculus to formalise
+these intuitions. We consider various settings: the agent knowing partial
+truths, weakened truths and having false beliefs, and show that our definitions
+easily generalize to these different settings.
 
-摘要：超音波 (US) 影像由於其非侵入性且安全的特性，在臨床上極具價值。然而，解讀超音波影像具有挑戰性，需要大量的專業知識和時間，而且經常容易出錯。深度學習提供了輔助解決方案，例如分割。監督式方法依賴於大量、高品質且標籤一致的資料集，而這在策劃上具有挑戰性。此外，這些方法在分佈外資料上的表現往往不佳，這限制了它們的臨床效用。自監督學習 (SSL) 已成為一種有前途的替代方案，它利用未標籤資料來增強模型效能和泛化能力。我們提出了一種對比式 SSL 方法，專門針對 B 模式超音波影像，並納入了新穎的關係對比損失 (RCL)。RCL 透過一個可學習的指標區分正負樣本對，來鼓勵學習不同的特徵。此外，我們提出了用於超音波影像上表徵學習的空間和頻率增強策略。我們的做法在三個公開的乳房超音波資料集上顯著優於傳統的監督式分割方法，特別是在資料有限的情況下。在 Dice 相似性指標上的顯著改進包括在 BUSI 資料集的 20% 和 50% 上增加了 4%，在 BrEaST 資料集的 20% 和 50% 上增加了近 6% 和 9%，以及在 UDIAT 資料集的 20% 和 50% 上分別增加了 6.4% 和 3.7%。此外，我們在分佈外的 UDIAT 資料集上展示了卓越的泛化能力，與使用 BUSI 和 BrEaST 訓練資料的 20% 和 50% 的監督式基準相比，效能分別提升了 20.6% 和 13.6%。我們的研究強調，領域啟發的 SSL 可以改善超音波分割，特別是在資料有限的條件下。
+摘要：最近在人工智能中對於可解釋性產生了相當大的興趣，
+特別是對於黑盒機器學習模型。正如規劃社群正確觀察到的，當手邊的應用程式不是單次決策或預測，而是一連串依賴於觀察的動作時，一個更豐富的解釋概念是可取的。
+在本文中，我們著眼於提供「反事實解釋」的一個正式說明，以動作序列為基礎。然後我們展示這自然會導致一個模型調和說明，其形式可能是使用者修正代理人的模型，或建議代理人的計畫採取行動。為此，我們需要說明什麼是真實的，什麼是已知的，我們訴諸情境演算的一個模態片段來形式化這些直覺。我們考慮各種設定：代理人知道部分真實、虛弱真實和擁有錯誤信念，並展示我們的定義輕鬆地概括到這些不同的設定。
 
-##### **Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**
-2502.02438v1 by Yaling Shen, Zhixiong Zhuang, Kun Yuan, Maria-Irina Nicolae, Nassir Navab, Nicolas Padoy, Mario Fritz
+##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
+2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
 
-Medical multimodal large language models (MLLMs) are becoming an instrumental
-part of healthcare systems, assisting medical personnel with decision making
-and results analysis. Models for radiology report generation are able to
-interpret medical imagery, thus reducing the workload of radiologists. As
-medical data is scarce and protected by privacy regulations, medical MLLMs
-represent valuable intellectual property. However, these assets are potentially
-vulnerable to model stealing, where attackers aim to replicate their
-functionality via black-box access. So far, model stealing for the medical
-domain has focused on classification; however, existing attacks are not
-effective against MLLMs. In this paper, we introduce Adversarial Domain
-Alignment (ADA-STEAL), the first stealing attack against medical MLLMs.
-ADA-STEAL relies on natural images, which are public and widely available, as
-opposed to their medical counterparts. We show that data augmentation with
-adversarial noise is sufficient to overcome the data distribution gap between
-natural images and the domain-specific distribution of the victim MLLM.
-Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that
-Adversarial Domain Alignment enables attackers to steal the medical MLLM
-without any access to medical data.
+Legal cases require careful logical reasoning following the laws, whereas
+interactions with non- technical users must be in natural language. As an
+application combining logical reasoning using Prolog and natural language
+processing using large language models (LLMs), this paper presents a novel
+approach and system, LogicLease, to automate the analysis of landlord-tenant
+legal cases in the state of New York. LogicLease determines compliance with
+relevant legal requirements by analyzing case descriptions and citing all
+relevant laws. It leverages LLMs for information extraction and Prolog for
+legal reasoning. By separating information extraction from legal reasoning,
+LogicLease achieves greater transparency and control over the legal logic
+applied to each case. We evaluate the accuracy, efficiency, and robustness of
+LogicLease through a series of tests, achieving 100% accuracy and an average
+processing time of 2.57 seconds. LogicLease presents advantages over
+state-of-the-art LLM- based legal analysis systems by providing clear,
+step-by-step reasoning, citing specific laws, and distinguishing itself by its
+ability to avoid hallucinations - a common issue in LLMs.
 
-摘要：醫療多模態大型語言模型 (MLLM) 正在成為醫療保健系統中不可或缺的一部分，協助醫療人員進行決策和結果分析。放射報告生成的模型能夠解釋醫學影像，從而減輕放射科醫師的工作負擔。由於醫療資料稀少且受隱私法規保護，醫療 MLLM 代表了有價值的智慧財產。然而，這些資產潛在地容易受到模型竊取的攻擊，攻擊者旨在透過黑盒存取來複製其功能。到目前為止，針對醫療領域的模型竊取一直專注於分類；然而，現有的攻擊對 MLLM 沒有效。在本文中，我們介紹了對抗域對齊 (ADA-STEAL)，這是針對醫療 MLLM 的第一個竊取攻擊。與醫療對應物相反，ADA-STEAL 依賴於公開且廣泛可用的自然影像。我們表明，對抗雜訊的資料擴充足以克服自然影像與受害者 MLLM 的特定領域分佈之間的資料分佈差距。在 IU X-RAY 和 MIMIC-CXR 放射學資料集上進行的實驗表明，對抗域對齊使攻擊者能夠在不存取任何醫療資料的情況下竊取醫療 MLLM。
+摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
 
-##### **Test Time Training for 4D Medical Image Interpolation**
-2502.02341v1 by Qikang Zhang, Yingjie Lei, Zihao Zheng, Ziyang Chen, Zhonghao Xie
+##### **Thinking beyond the anthropomorphic paradigm benefits LLM research**
+2502.09192v1 by Lujain Ibrahim, Myra Cheng
 
-4D medical image interpolation is essential for improving temporal resolution
-and diagnostic precision in clinical applications. Previous works ignore the
-problem of distribution shifts, resulting in poor generalization under
-different distribution. A natural solution would be to adapt the model to a new
-test distribution, but this cannot be done if the test input comes without a
-ground truth label. In this paper, we propose a novel test time training
-framework which uses self-supervision to adapt the model to a new distribution
-without requiring any labels. Indeed, before performing frame interpolation on
-each test video, the model is trained on the same instance using a
-self-supervised task, such as rotation prediction or image reconstruction. We
-conduct experiments on two publicly available 4D medical image interpolation
-datasets, Cardiac and 4D-Lung. The experimental results show that the proposed
-method achieves significant performance across various evaluation metrics on
-both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on
-Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image
-interpolation but also provides a template for domain adaptation in other
-fields such as image segmentation and image registration.
+Anthropomorphism, or the attribution of human traits to technology, is an
+automatic and unconscious response that occurs even in those with advanced
+technical expertise. In this position paper, we analyze hundreds of thousands
+of computer science research articles from the past decade and present
+empirical evidence of the prevalence and growth of anthropomorphic terminology
+in research on large language models (LLMs). This terminology reflects deeper
+anthropomorphic conceptualizations which shape how we think about and conduct
+LLM research. We argue these conceptualizations may be limiting, and that
+challenging them opens up new pathways for understanding and improving LLMs
+beyond human analogies. To illustrate this, we identify and analyze five core
+anthropomorphic assumptions shaping prominent methodologies across the LLM
+development lifecycle, from the assumption that models must use natural
+language for reasoning tasks to the assumption that model capabilities should
+be evaluated through human-centric benchmarks. For each assumption, we
+demonstrate how non-anthropomorphic alternatives can open new directions for
+research and development.
 
-摘要：4D 醫學影像插值對於提升時間解析度及臨床應用中的診斷精準度至關重要。過往的研究忽略了分佈轉移問題，導致在不同分佈下泛化能力不佳。一個自然的解決方案是將模型適應到新的測試分佈，但如果測試輸入沒有真實標籤，就無法做到這一點。在本文中，我們提出了一個新的測試時間訓練架構，它使用自我監督來適應模型到一個新的分佈，而不需要任何標籤。事實上，在對每個測試影片執行幀插值之前，使用自我監督任務（例如旋轉預測或影像重建）在同一個實例上訓練模型。我們在兩個公開的 4D 醫學影像插值資料集（Cardiac 和 4D-Lung）上進行實驗。實驗結果表明，所提出的方法在兩個資料集上的各種評估指標中都取得了顯著的效能。它達到了更高的峰值信噪比值，在 Cardiac 上為 33.73dB，在 4D-Lung 上為 34.02dB。我們的技術不僅推動了 4D 醫學影像插值，還為其他領域（例如影像分割和影像配準）中的領域適應提供了一個範本。
+摘要：擬人化，或將人類特質歸因於科技，是一種自動且無意識的反應，即使是那些擁有進階技術專業知識的人也會發生。在本文中，我們分析了過去十年數十萬篇電腦科學研究文章，並提出實證證據證明擬人化術語在大型語言模型 (LLM) 研究中的普遍性和增長。這些術語反映了更深層的擬人化概念化，塑造了我們思考和進行 LLM 研究的方式。我們認為這些概念化可能是有限制的，並且挑戰它們為超越人類類比來理解和改進 LLM 開闢了新的途徑。為了說明這一點，我們識別並分析了五個核心擬人化假設，這些假設塑造了 LLM 開發生命週期中的顯著方法論，從模型必須使用自然語言進行推理任務的假設到模型能力應該通過以人為中心的基準進行評估的假設。對於每個假設，我們展示了非擬人化替代方案如何為研究和開發打開新方向。
 
-##### **Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**
-2502.02249v1 by Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
+##### **Matina: A Large-Scale 73B Token Persian Text Corpus**
+2502.09188v1 by Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri
 
-Large language models (LLMs) have shown impressive capabilities in natural
-language processing tasks, including dialogue generation. This research aims to
-conduct a novel comparative analysis of two prominent techniques, fine-tuning
-with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG)
-framework, in the context of doctor-patient chat conversations with multiple
-datasets of mixed medical domains. The analysis involves three state-of-the-art
-models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient
-dialogues, we comprehensively evaluate the performance of models, assessing key
-metrics such as language quality (perplexity, BLEU score), factual accuracy
-(fact-checking against medical knowledge bases), adherence to medical
-guidelines, and overall human judgments (coherence, empathy, safety). The
-findings provide insights into the strengths and limitations of each approach,
-shedding light on their suitability for healthcare applications. Furthermore,
-the research investigates the robustness of the models in handling diverse
-patient queries, ranging from general health inquiries to specific medical
-conditions. The impact of domain-specific knowledge integration is also
-explored, highlighting the potential for enhancing LLM performance through
-targeted data augmentation and retrieval strategies.
+Text corpora are essential for training models used in tasks like
+summarization, translation, and large language models (LLMs). While various
+efforts have been made to collect monolingual and multilingual datasets in many
+languages, Persian has often been underrepresented due to limited resources for
+data collection and preprocessing. Existing Persian datasets are typically
+small and lack content diversity, consisting mainly of weblogs and news
+articles. This shortage of high-quality, varied data has slowed the development
+of NLP models and open-source LLMs for Persian. Since model performance depends
+heavily on the quality of training data, we address this gap by introducing the
+Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed
+and deduplicated to ensure high data quality. We further assess its
+effectiveness by training and evaluating transformer-based models on key NLP
+tasks. Both the dataset and preprocessing codes are publicly available,
+enabling researchers to build on and improve this resource for future Persian
+NLP advancements.
 
-摘要：大型語言模型 (LLM) 在自然語言處理任務中展現了令人印象深刻的能力，包括對話生成。本研究旨在對兩種著名的技術進行新穎的比較分析，即微調 LoRA (低秩適應) 和檢索增強生成 (RAG) 框架，在具有混合醫療領域的多個資料集的醫患聊天對話中。分析涉及三個最先進的模型：Llama-2、GPT 和 LSTM 模型。採用真實世界的醫患對話，我們全面評估模型的性能，評估語言品質（困惑度、BLEU 分數）、事實準確性（對照醫學知識庫進行事實查核）、遵守醫療指南以及整體人類判斷（連貫性、同理心、安全性）等關鍵指標。研究結果深入了解了每種方法的優點和限制，闡明了它們適用於醫療保健應用的適當性。此外，該研究調查了模型在處理多樣化患者查詢時的穩健性，範圍從一般健康詢問到特定醫療狀況。還探討了特定領域知識整合的影響，強調了通過有針對性的資料擴充和檢索策略來增強 LLM 性能的潛力。
+摘要：文字語料庫對於訓練用於摘要、翻譯和大型語言模型 (LLM) 等任務的模型至關重要。儘管已做出各種努力來收集許多語言中的單語和多語言資料集，但由於資料收集和預處理資源有限，波斯語常常代表性不足。現有的波斯語資料集通常很小，而且缺乏內容多樣性，主要由網誌和新聞文章組成。這種優質、多樣化資料的短缺減緩了波斯語的 NLP 模型和開源 LLM 的開發。由於模型效能高度依賴訓練資料的品質，我們透過推出 Matina 語料庫來解決這個差距，Matina 語料庫是一個新的波斯語資料集，包含 72.9B 個字元，經過仔細預處理和去重，以確保資料品質。我們進一步透過在關鍵 NLP 任務上訓練和評估基於轉換器的模型來評估其有效性。資料集和預處理程式碼都是公開的，使研究人員能夠建立和改善這個資源，以促進未來的波斯語 NLP 進展。
 
-##### **Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**
-2502.02618v1 by F. Xavier Gaya-Morey, Jose M. Buades-Rubio, Philippe Palanque, Raquel Lacuesta, Cristina Manresa-Yee
+##### **RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**
+2502.09183v1 by Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, Linmei Hu
 
-The rapid aging of the global population has highlighted the need for
-technologies to support elderly, particularly in healthcare and emotional
-well-being. Facial expression recognition (FER) systems offer a non-invasive
-means of monitoring emotional states, with applications in assisted living,
-mental health support, and personalized care. This study presents a systematic
-review of deep learning-based FER systems, focusing on their applications for
-the elderly population. Following a rigorous methodology, we analyzed 31
-studies published over the last decade, addressing challenges such as the
-scarcity of elderly-specific datasets, class imbalances, and the impact of
-age-related facial expression differences. Our findings show that convolutional
-neural networks remain dominant in FER, and especially lightweight versions for
-resource-constrained environments. However, existing datasets often lack
-diversity in age representation, and real-world deployment remains limited.
-Additionally, privacy concerns and the need for explainable artificial
-intelligence emerged as key barriers to adoption. This review underscores the
-importance of developing age-inclusive datasets, integrating multimodal
-solutions, and adopting XAI techniques to enhance system usability,
-reliability, and trustworthiness. We conclude by offering recommendations for
-future research to bridge the gap between academic progress and real-world
-implementation in elderly care.
+Code generation has attracted increasing attention with the rise of Large
+Language Models (LLMs). Many studies have developed powerful code LLMs by
+synthesizing code-related instruction data and applying supervised fine-tuning.
+However, these methods are limited by teacher model distillation and ignore the
+potential of iterative refinement by self-generated code. In this paper, we
+propose Adaptive Critique Refinement (ACR), which enables the model to refine
+itself by self-generated code and external critique, rather than directly
+imitating the code responses of the teacher model. Concretely, ACR includes a
+composite scoring system with LLM-as-a-Judge to evaluate the quality of code
+responses and a selective critique strategy with LLM-as-a-Critic to critique
+self-generated low-quality code responses. We develop the RefineCoder series by
+iteratively applying ACR, achieving continuous performance improvement on
+multiple code generation benchmarks. Compared to the baselines of the same
+size, our proposed RefineCoder series can achieve comparable or even superior
+performance using less data.
 
-摘要：全球人口快速老龄化突显了对技术的需求，以支持老年人，尤其是在医疗保健和情绪健康方面。面部表情识别 (FER) 系统提供了一种非侵入性的情绪状态监测手段，在辅助生活、心理健康支持和个性化护理中得到应用。本研究对基于深度学习的 FER 系统进行了系统的回顾，重点关注它们在老年人群中的应用。遵循严格的方法，我们分析了在过去十年中发表的 31 项研究，解决了诸如老年人特定数据集的稀缺性、类别不平衡以及与年龄相关的面部表情差异的影响等挑战。我们的研究结果表明，卷积神经网络在 FER 中仍然占主导地位，特别是针对资源受限环境的轻量级版本。然而，现有数据集往往缺乏年龄代表性的多样性，并且现实世界的部署仍然有限。此外，隐私问题和对可解释人工智能的需求已成为采用过程中的主要障碍。本次审查强调了开发包容年龄的数据集、整合多模式解决方案以及采用 XAI 技术以增强系统可用性、可靠性和可信度的重要性。最后，我们提出了未来研究的建议，以弥合学术进展与老年护理中的现实世界实施之间的差距。
+摘要：隨著大型語言模型 (LLM) 的興起，程式碼生成備受關注。許多研究透過綜合與程式碼相關的指令資料並應用監督式微調來開發強大的程式碼 LLM。然而，這些方法受到教師模型蒸餾的限制，且忽略了透過自行產生的程式碼進行反覆改進的潛力。在本文中，我們提出適應性批判改進 (ACR)，它使模型能夠透過自行產生的程式碼和外部批判來改進自身，而不是直接模仿教師模型的程式碼回應。具體來說，ACR 包含一個複合評分系統，其中 LLM 作為評審員來評估程式碼回應的品質，以及一個選擇性批判策略，其中 LLM 作為批判者來批判自行產生的低品質程式碼回應。我們透過反覆套用 ACR 來開發 RefineCoder 系列，在多個程式碼生成基準上實現持續的效能改善。與相同規模的基準相比，我們提出的 RefineCoder 系列可以使用較少資料來實現相當甚至更優異的效能。
 
-##### **Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**
-2502.02109v1 by Yuxiao Cheng, Xinxin Song, Ziqian Wang, Qin Zhong, Kunlun He, Jinli Suo
+##### **FLAME: Flexible LLM-Assisted Moderation Engine**
+2502.09175v1 by Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets
 
-Recent advances in deep learning (DL) have prompted the development of
-high-performing early warning score (EWS) systems, predicting clinical
-deteriorations such as acute kidney injury, acute myocardial infarction, or
-circulatory failure. DL models have proven to be powerful tools for various
-tasks but come with the cost of lacking interpretability and limited
-generalizability, hindering their clinical applications. To develop a practical
-EWS system applicable to various outcomes, we propose causally-informed
-explainable early prediction model, which leverages causal discovery to
-identify the underlying causal relationships of prediction and thus owns two
-unique advantages: demonstrating the explicit interpretation of the prediction
-while exhibiting decent performance when applied to unfamiliar environments.
-Benefiting from these features, our approach achieves superior accuracy for 6
-different critical deteriorations and achieves better generalizability across
-different patient groups, compared to various baseline algorithms. Besides, we
-provide explicit causal pathways to serve as references for assistant clinical
-diagnosis and potential interventions. The proposed approach enhances the
-practical application of deep learning in various medical scenarios.
+The rapid advancement of Large Language Models (LLMs) has introduced
+significant challenges in moderating user-model interactions. While LLMs
+demonstrate remarkable capabilities, they remain vulnerable to adversarial
+attacks, particularly ``jailbreaking'' techniques that bypass content safety
+measures. Current content moderation systems, which primarily rely on input
+prompt filtering, have proven insufficient, with techniques like Best-of-N
+(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
+In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
+new approach that shifts the focus from input filtering to output moderation.
+Unlike traditional circuit-breaking methods that analyze user queries, FLAME
+evaluates model responses, offering several key advantages: (1) computational
+efficiency in both training and inference, (2) enhanced resistance to BoN
+jailbreaking attacks, and (3) flexibility in defining and updating safety
+criteria through customizable topic filtering. Our experiments demonstrate that
+FLAME significantly outperforms current moderation systems. For example, FLAME
+reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
+while maintaining low computational overhead. We provide comprehensive
+evaluation on various LLMs and analyze the engine's efficiency against the
+state-of-the-art jailbreaking. This work contributes to the development of more
+robust and adaptable content moderation systems for LLMs.
 
-摘要：深度學習 (DL) 的最新進展促使開發出高性能早期預警評分 (EWS) 系統，預測急性腎臟損傷、急性心肌梗塞或循環衰竭等臨床惡化。DL 模型已被證明是各種任務的強大工具，但代價是缺乏可解釋性和有限的概括性，阻礙了其臨床應用。為了開發適用於各種結果的實用 EWS 系統，我們提出了因果關係解釋性早期預測模型，它利用因果發現來識別預測的潛在因果關係，從而擁有兩個獨特的優點：展示預測的明確解釋，同時在應用於不熟悉的環境時表現出良好的性能。得益於這些特性，與各種基線演算法相比，我們的模型在 6 種不同的危重惡化中實現了更高的準確度，並在不同的患者群體中實現了更好的概括性。此外，我們提供了明確的因果途徑，作為輔助臨床診斷和潛在干預措施的參考。所提出的方法增強了深度學習在各種醫療場景中的實際應用。
+摘要：大型語言模型 (LLM) 的快速進步為調節使用者與模型互動帶來重大挑戰。儘管 LLM 展現出非凡的能力，但它們仍然容易受到對抗性攻擊，特別是繞過內容安全措施的「越獄」技術。目前的內容審核系統主要依賴輸入提示過濾，已被證明不足，例如 Best-of-N (BoN) 越獄對抗熱門 LLM 的成功率達到 80% 以上。在本文中，我們介紹了靈活的 LLM 輔助審核引擎 (FLAME)：一種新的方法，將重點從輸入過濾轉移到輸出審核。與分析使用者查詢的傳統電路中斷方法不同，FLAME 評估模型回應，提供幾個關鍵優勢：(1) 訓練和推理中的計算效率，(2) 增強對 BoN 越獄攻擊的抵抗力，以及 (3) 透過可自訂主題過濾定義和更新安全標準的靈活性。我們的實驗證明，FLAME 明顯優於目前的審核系統。例如，FLAME 將 GPT-4o-mini 和 DeepSeek-v3 的攻擊成功率降低了約 9 倍，同時保持較低的計算負擔。我們對各種 LLM 進行了全面的評估，並分析了引擎對抗最新越獄的效率。這項工作有助於開發更強大且適應性更強的 LLM 內容審核系統。
 
-##### **JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**
-2502.04345v1 by Yehan Yan, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
+##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
+2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
-Traditional Chinese medicine (TCM) plays a vital role in health protection
-and disease treatment, but its practical application requires extensive medical
-knowledge and clinical experience. Existing TCM Large Language Models (LLMs)
-exhibit critical limitations of uncomprehensive medical consultation and
-diagnoses, and inaccurate syndrome differentiation-based treatment. To address
-these issues, this study establishes JingFang (JF): a novel TCM Large Language
-Model that demonstrates the expert-level capability of medical diagnosis and
-syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic
-Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation,
-enabling JF with effective and accurate diagnostic ability. In addition, a
-Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to
-significantly enhance the capacity of JF for disease treatment based on
-syndrome differentiation. JingFang not only facilitates the application of LLMs
-but also promotes the effective practice of TCM in human health protection and
-disease treatment.
+In remote healthcare monitoring, time series representation learning reveals
+critical patient behavior patterns from high-frequency data. This study
+analyzes home activity data from individuals living with dementia by proposing
+a two-stage, self-supervised learning approach tailored to uncover low-rank
+structures. The first stage converts time-series activities into text sequences
+encoded by a pre-trained language model, providing a rich, high-dimensional
+latent state space using a PageRank-based method. This PageRank vector captures
+latent state transitions, effectively compressing complex behaviour data into a
+succinct form that enhances interpretability. This low-rank representation not
+only enhances model interpretability but also facilitates clustering and
+transition analysis, revealing key behavioral patterns correlated with
+clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
+framework's potential in supporting cognitive status prediction, personalized
+care interventions, and large-scale health monitoring.
 
-摘要：中醫藥在保健與疾病治療中扮演著重要的角色，但其實務應用需要深厚的醫學知識與臨床經驗。現有的中醫大語言模型（LLM）存在著醫療諮詢與診斷不全面、症候分型治療不準確的重大限制。為了解決這些問題，本研究建立了精方（JF）：一個新穎的中醫大語言模型，展示了專家級的醫療診斷與症候分型治療能力。我們創新了一個多智能體動態協作思考鏈機制（MDCCTM）用於醫療諮詢，讓 JF 具備有效且準確的診斷能力。此外，還開發了一個症候智能體和一個雙階段檢索方案（DSRS），以顯著增強 JF 基於症候分型的疾病治療能力。精方不僅促進了 LLM 的應用，也推動了中醫藥在人類保健與疾病治療中的有效實踐。
+摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
-##### **An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**
-2502.01789v1 by Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri
+##### **Musical Heritage Historical Entity Linking**
+2502.09168v1 by Arianna Graciotti, Nicolas Lazzari, Valentina Presutti, Rocco Tripodi
 
-Early identification of cognitive concerns is critical but often hindered by
-subtle symptom presentation. This study developed and validated a fully
-automated, multi-agent AI workflow using LLaMA 3 8B to identify cognitive
-concerns in 3,338 clinical notes from Mass General Brigham. The agentic
-workflow, leveraging task-specific agents that dynamically collaborate to
-extract meaningful insights from clinical notes, was compared to an
-expert-driven benchmark. Both workflows achieved high classification
-performance, with F1-scores of 0.90 and 0.91, respectively. The agentic
-workflow demonstrated improved specificity (1.00) and achieved prompt
-refinement in fewer iterations. Although both workflows showed reduced
-performance on validation data, the agentic workflow maintained perfect
-specificity. These findings highlight the potential of fully automated
-multi-agent AI workflows to achieve expert-level accuracy with greater
-efficiency, offering a scalable and cost-effective solution for detecting
-cognitive concerns in clinical settings.
+Linking named entities occurring in text to their corresponding entity in a
+Knowledge Base (KB) is challenging, especially when dealing with historical
+texts. In this work, we introduce Musical Heritage named Entities Recognition,
+Classification and Linking (MHERCL), a novel benchmark consisting of manually
+annotated sentences extrapolated from historical periodicals of the music
+domain. MHERCL contains named entities under-represented or absent in the most
+famous KBs. We experiment with several State-of-the-Art models on the Entity
+Linking (EL) task and show that MHERCL is a challenging dataset for all of
+them. We propose a novel unsupervised EL model and a method to extend
+supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main
+difficulties posed by historical documents. Our experiments reveal that relying
+on unsupervised techniques and improving models with logical constraints based
+on KGs and heuristics to predict NIL entities (entities not represented in the
+KB of reference) results in better EL performance on historical documents.
 
-摘要：及早辨識認知問題至關重要，但常常受到症狀呈現過於細微的阻礙。本研究開發並驗證了一個全自動化、多重代理的 AI 工作流程，使用 LLaMA 3 8B 來辨識來自麻省總醫院布萊根分院的 3,338 則臨床筆記中的認知問題。這個代理工作流程利用了特定任務的代理，這些代理會動態合作從臨床筆記中萃取出有意義的見解，並與專家驅動的基準進行比較。這兩個工作流程都達到了很高的分類效能，F1 分數分別為 0.90 和 0.91。代理工作流程展現出更好的特異性（1.00），並且在更少的反覆運算中達到了提示精煉。儘管這兩個工作流程在驗證資料上的效能都降低了，但代理工作流程維持了完美的特異性。這些發現突顯了全自動化多重代理 AI 工作流程的潛力，它們能以更高的效率達到專家級的準確度，為在臨床環境中偵測認知問題提供了一個可擴充且具成本效益的解決方案。
+摘要：將文本中出現的名稱實體連結到知識庫 (KB) 中對應的實體具有挑戰性，尤其是在處理歷史文本時。在這項工作中，我們引入了音樂遺產命名實體識別、分類和連結 (MHERCL)，這是一個由從音樂領域的歷史期刊中外推的手動標註句子組成的全新基準。MHERCL 包含在最著名的 KB 中代表性不足或不存在的名稱實體。我們在實體連結 (EL) 任務中對多個最先進的模型進行了實驗，並表明 MHERCL 對所有模型來說都是一個具有挑戰性的資料集。我們提出了一個新的無監督 EL 模型和一個通過使用知識圖 (KG) 來擴充監督式實體連結器的的方法，以解決歷史文件提出的主要難題。我們的實驗表明，依賴無監督技術並使用基於 KG 和啟發法的邏輯約束來改善模型以預測 NIL 實體（未在參考 KB 中表示的實體）會在歷史文件中產生更好的 EL 效能。
 
-##### **Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**
-2502.03482v1 by Chacha Chen, Han Liu, Jiamin Yang, Benjamin M. Mervak, Bora Kalaycioglu, Grace Lee, Emre Cakmakli, Matteo Bonatti, Sridhar Pudu, Osman Kahraman, Gul Gizem Pamuk, Aytekin Oto, Aritrick Chatterjee, Chenhao Tan
+##### **Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**
+2502.09156v1 by Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin
 
-Despite the growing interest in human-AI decision making, experimental
-studies with domain experts remain rare, largely due to the complexity of
-working with domain experts and the challenges in setting up realistic
-experiments. In this work, we conduct an in-depth collaboration with
-radiologists in prostate cancer diagnosis based on MRI images. Building on
-existing tools for teaching prostate cancer diagnosis, we develop an interface
-and conduct two experiments to study how AI assistance and performance feedback
-shape the decision making of domain experts. In Study 1, clinicians were asked
-to provide an initial diagnosis (human), then view the AI's prediction, and
-subsequently finalize their decision (human-AI team). In Study 2 (after a
-memory wash-out period), the same participants first received aggregated
-performance statistics from Study 1, specifically their own performance, the
-AI's performance, and their human-AI team performance, and then directly viewed
-the AI's prediction before making their diagnosis (i.e., no independent initial
-diagnosis). These two workflows represent realistic ways that clinical AI tools
-might be used in practice, where the second study simulates a scenario where
-doctors can adjust their reliance and trust on AI based on prior performance
-feedback. Our findings show that, while human-AI teams consistently outperform
-humans alone, they still underperform the AI due to under-reliance, similar to
-prior studies with crowdworkers. Providing clinicians with performance feedback
-did not significantly improve the performance of human-AI teams, although
-showing AI decisions in advance nudges people to follow AI more. Meanwhile, we
-observe that the ensemble of human-AI teams can outperform AI alone, suggesting
-promising directions for human-AI collaboration.
+Objectives: Large language models (LLMs) can harness medical knowledge for
+intelligent question answering (Q&A), promising support for auxiliary diagnosis
+and medical talent cultivation. However, there is a deficiency of highly
+efficient retrieval-augmented generation (RAG) frameworks within the domain of
+Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the
+Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A
+tasks.
+  Materials and Methods: We introduce the novel approach of knowledge
+organization, constructing a tree structure knowledge base with hierarchy. At
+inference time, our self-reflection framework retrieves from this knowledge
+base, integrating information across chapters. Questions from the TCM Medical
+Licensing Examination (MLE) and the college Classics Course Exam (CCE) were
+randomly selected as benchmark datasets.
+  Results: By coupling with GPT-4, the framework can improve the best
+performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and
+improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation,
+the framework improves a total of 18.52 points across dimensions of safety,
+consistency, explainability, compliance, and coherence.
+  Conclusion: The TOSRR framework can effectively improve LLM's capability in
+Q&A tasks of TCM.
 
-摘要：儘管人們對人類與 AI 決策制定越來越感興趣，但與領域專家合作的實驗研究仍然很少見，這在很大程度上是因為與領域專家合作的複雜性，以及在設定實際實驗時面臨的挑戰。在這項工作中，我們與放射科醫師進行深入合作，基於 MRI 影像診斷前列腺癌。建立在用於教授前列腺癌診斷的現有工具上，我們開發了一個介面並進行了兩項實驗，以研究 AI 協助和效能回饋如何塑造領域專家的決策制定。在研究 1 中，要求臨床醫師提供初步診斷（人類），然後檢視 AI 的預測，並隨後確定他們的決策（人類-AI 團隊）。在研究 2（經過一段記憶清除期）中，同一位參與者首先收到研究 1 的彙總效能統計資料，特別是他們自己的效能、AI 的效能，以及他們的人類-AI 團隊效能，然後在做出診斷前直接檢視 AI 的預測（即，沒有獨立的初步診斷）。這兩個工作流程代表了臨床 AI 工具在實務中可能被使用的方式，其中第二個研究模擬了醫生可以根據先前的效能回饋調整他們對 AI 的依賴和信任的情況。我們的研究結果顯示，儘管人類-AI 團隊始終優於單獨的人類，但由於依賴不足，他們仍然表現不如 AI，這與之前針對群眾工作者的研究類似。儘管事先顯示 AI 決策會促使人們更多地遵循 AI，但向臨床醫師提供效能回饋並未顯著改善人類-AI 團隊的效能。同時，我們觀察到人類-AI 團隊的集合可以優於單獨的 AI，這表明了人類-AI 合作的前景。
+摘要：目標：大型語言模型（LLM）可以利用醫療知識進行智能問答（Q&A），承諾支持輔助診斷和醫療人才培養。然而，在中醫領域內缺乏高效的檢索增強生成（RAG）框架。我們的目的是觀察樹組織自省檢索（TOSRR）框架對中醫問答任務中 LLM 的影響。
+材料和方法：我們引入了知識組織的新方法，構建了一個具有層次的樹結構知識庫。在推理時間，我們的自省框架從這個知識庫中檢索，整合章節中的信息。中醫醫師資格考試（MLE）和大學經典課程考試（CCE）中的問題被隨機選為基準數據集。
+結果：通過與 GPT-4 結合，該框架可以將中醫 MLE 基準上的最佳性能提高 19.85% 的絕對準確度，並將 CCE 數據集上的召回準確度從 27% 提高到 38%。在手動評估中，該框架在安全性、一致性、可解釋性、合規性和連貫性方面總共提高了 18.52 分。
+結論：TOSRR 框架可以有效提升 LLM 在中醫問答任務中的能力。
 
-##### **Improving Transformer World Models for Data-Efficient RL**
-2502.01591v1 by Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
+##### **A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**
+2502.09128v1 by Nasser A Alsadhan
 
-We present an approach to model-based RL that achieves a new state of the art
-performance on the challenging Craftax-classic benchmark, an open-world 2D
-survival game that requires agents to exhibit a wide range of general abilities
--- such as strong generalization, deep exploration, and long-term reasoning.
-With a series of careful design choices aimed at improving sample efficiency,
-our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps,
-significantly outperforming DreamerV3, which achieves 53.2%, and, for the first
-time, exceeds human performance of 65.0%. Our method starts by constructing a
-SOTA model-free baseline, using a novel policy architecture that combines CNNs
-and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna
-with warmup", which trains the policy on real and imaginary data, (b) "nearest
-neighbor tokenizer" on image patches, which improves the scheme to create the
-transformer world model (TWM) inputs, and (c) "block teacher forcing", which
-allows the TWM to reason jointly about the future tokens of the next timestep.
+Arabic is one of the oldest languages still in use today. As a result,
+several Arabic-speaking regions have developed dialects that are unique to
+them. Dialect and emotion recognition have various uses in Arabic text
+analysis, such as determining an online customer's origin based on their
+comments. Furthermore, intelligent chatbots that are aware of a user's emotions
+can respond appropriately to the user. Current research in emotion detection in
+the Arabic language lacks awareness of how emotions are exhibited in different
+dialects, which motivates the work found in this study. This research addresses
+the problems of dialect and emotion classification in Arabic. Specifically,
+this is achieved by building a novel framework that can identify and predict
+Arabic dialects and emotions from a given text. The framework consists of three
+modules: A text-preprocessing module, a classification module, and a clustering
+module with the novel capability of building new dialect-aware emotion
+lexicons. The proposed framework generated a new emotional lexicon for
+different dialects. It achieved an accuracy of 88.9% in classifying Arabic
+dialects, which outperforms the state-of-the-art results by 6.45 percentage
+points. Furthermore, the framework achieved 89.1-79% accuracy in detecting
+emotions in the Egyptian and Gulf dialects, respectively.
 
-摘要：我們提出了一個基於模型的 RL 方法，在具有挑戰性的 Craftax-classic 基準上實現了新的技術水準，這是一個開放世界的 2D 生存遊戲，要求代理人展現廣泛的一般能力，例如強大的概括能力、深入探索和長期推理。通過一系列旨在提高樣本效率的仔細設計選擇，我們的 MBRL 演算法在僅 1M 環境步驟後就實現了 67.4% 的獎勵，顯著優於 DreamerV3（實現 53.2%），並且首次超過了人類的 65.0% 的表現。我們的演算法首先通過使用結合 CNN 和 RNN 的新穎策略架構來建構一個 SOTA 無模型基線。然後，我們對標準 MBRL 設定新增了三項改進：(a)「帶熱身的 Dyna」，它在真實和假想資料上訓練策略，(b) 影像貼片的「最近鄰代碼化器」，它改進了建立轉換器世界模型 (TWM) 輸入的方案，以及 (c)「區塊教師強制」，它允許 TWM 共同推理下一個時間步長的未來代碼。
+摘要：阿拉伯語是現今仍在使用中最古老的語言之一。因此，幾個講阿拉伯語的地區發展出獨特的方言。方言和情緒辨識在阿拉伯語文本分析中有多種用途，例如根據在線客戶的評論來確定其來源。此外，知道使用者情緒的智慧聊天機器人可以適當地回應使用者。目前對阿拉伯語情緒偵測的研究缺乏對不同方言如何表現情緒的認識，這激勵了本研究中的工作。本研究探討了阿拉伯語中的方言和情緒分類問題。具體而言，這是通過建立一個新的框架來實現的，該框架可以識別和預測給定文本中的阿拉伯方言和情緒。該框架包含三個模組：文字預處理模組、分類模組和聚類模組，具有建立新的方言感知情緒詞彙表的新功能。所提出的框架為不同的方言生成了新的情緒詞彙表。它在分類阿拉伯方言方面達到了 88.9% 的準確率，比最先進的結果高出 6.45 個百分點。此外，該框架在檢測埃及和海灣方言的情緒方面分別達到了 89.1-79% 的準確率。
 
-##### **Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**
-2502.01377v1 by Zhi Zhang, Yan Liu, Mengxia Gao, Yu Yang, Jiannong Cao, Wai Kai Hou, Shirley Li, Sonata Yau, Yun Kwok Wing, Tatia M. C. Lee
+##### **Automatic Pruning via Structured Lasso with Class-wise Information**
+2502.09125v1 by Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li
 
-Psychological resilience, defined as the ability to rebound from adversity,
-is crucial for mental health. Compared with traditional resilience assessments
-through self-reported questionnaires, resilience assessments based on
-neurological data offer more objective results with biological markers, hence
-significantly enhancing credibility. This paper proposes a novel data-efficient
-model to address the scarcity of neurological data. We employ Neuro
-Kolmogorov-Arnold Networks as the structure of the prediction model. In the
-training stage, a new trait-informed multimodal representation algorithm with a
-smart chunk technique is proposed to learn the shared latent space with limited
-data. In the test stage, a new noise-informed inference algorithm is proposed
-to address the low signal-to-noise ratio of the neurological data. The proposed
-model not only shows impressive performance on both public datasets and
-self-constructed datasets but also provides some valuable psychological
-hypotheses for future research.
+Most pruning methods concentrate on unimportant filters of neural networks.
+However, they face the loss of statistical information due to a lack of
+consideration for class-wise data. In this paper, from the perspective of
+leveraging precise class-wise information for model pruning, we utilize
+structured lasso with guidance from Information Bottleneck theory. Our approach
+ensures that statistical information is retained during the pruning process.
+With these techniques, we introduce two innovative adaptive network pruning
+schemes: sparse graph-structured lasso pruning with Information Bottleneck
+(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
+Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
+sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
+multiple state-of-the-art methods, our approaches demonstrate superior
+performance across three datasets and six model architectures in extensive
+experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
+achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
+an accuracy of 94.10% (0.14% higher than the original model); we reduce the
+parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
+ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
+computational resource usage while maintaining accuracy. Our codes are at
+https://anonymous.4open.science/r/IJCAI-8104.
 
-摘要：心理韌性，定義為從逆境中反彈的能力，對心理健康至關重要。與通過自我報告問卷的傳統韌性評估相比，基於神經數據的韌性評估提供了更客觀的結果和生物標記，從而顯著提高了可信度。本文提出了一個新穎的數據高效模型來解決神經數據的稀缺性。我們採用神經科爾莫哥羅夫-阿諾德網路作為預測模型的結構。在訓練階段，提出了一種新的特徵信息多模態表示算法，採用智能塊技術，以有限的數據學習共享潛在空間。在測試階段，提出了一種新的噪聲信息推理算法，以解決神經數據的信噪比低的問題。所提出的模型不僅在公共數據集和自構數據集上都顯示出令人印象深刻的性能，還為未來的研究提供了一些有價值的心理假設。
+摘要：大多數剪枝方法都集中在神經網路中不重要的濾波器上。
+然而，由於缺乏對類別資料的考量，它們面臨統計資訊的遺失。在本文中，我們從利用精確類別資訊進行模型剪枝的角度，利用結構化套索搭配資訊瓶頸理論的指導。我們的做法確保在剪枝過程中保留統計資訊。藉由這些技術，我們引入了兩個創新的自適應網路剪枝方案：帶有資訊瓶頸的稀疏圖形結構套索剪枝（sGLP-IB）和帶有資訊瓶頸的稀疏樹導引套索剪枝（sTLP-IB）。關鍵方面是使用 sGLP-IB 和 sTLP-IB 剪枝模型濾波器，以更好地擷取類別關聯性。與多種最先進的方法相比，我們的做法在廣泛的實驗中展現出跨三個資料集和六個模型架構的卓越效能。例如，在 CIFAR-10 資料集上使用 VGG16 模型，我們達到了 85% 的參數減少、61% 的 FLOP 減少，並維持 94.10% 的準確度（比原始模型高 0.14%）；我們在 ImageNet 上使用 ResNet 架構將參數減少了 55%，準確度為 76.12%（僅下降 0.03%）。總之，我們成功地減少了模型大小和計算資源使用，同時維持準確度。我們的程式碼位於 https://anonymous.4open.science/r/IJCAI-8104。
 
-##### **OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**
-2502.01243v1 by Chengfeng Zhou, Ji Wang, Juanjuan Qin, Yining Wang, Ling Sun, Weiwei Dai
+##### **The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**
+2502.09120v1 by Ye-eun Cho, Yunho Maeng
 
-Large language models (LLMs) have shown significant promise across various
-medical applications, with ophthalmology being a notable area of focus. Many
-ophthalmic tasks have shown substantial improvement through the integration of
-LLMs. However, before these models can be widely adopted in clinical practice,
-evaluating their capabilities and identifying their limitations is crucial. To
-address this research gap and support the real-world application of LLMs, we
-introduce the OphthBench, a specialized benchmark designed to assess LLM
-performance within the context of Chinese ophthalmic practices. This benchmark
-systematically divides a typical ophthalmic clinical workflow into five key
-scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each
-scenario, we developed multiple tasks featuring diverse question types,
-resulting in a comprehensive benchmark comprising 9 tasks and 591 questions.
-This comprehensive framework allows for a thorough assessment of LLMs'
-capabilities and provides insights into their practical application in Chinese
-ophthalmology. Using this benchmark, we conducted extensive experiments and
-analyzed the results from 39 popular LLMs. Our evaluation highlights the
-current gap between LLM development and its practical utility in clinical
-settings, providing a clear direction for future advancements. By bridging this
-gap, we aim to unlock the potential of LLMs and advance their development in
-ophthalmology.
+This study explored how Vision-Language Models (VLMs) process ignorance
+implicatures with visual and linguistic cues. Particularly, we focused on the
+effects of contexts (precise and approximate contexts) and modifier types (bare
+numerals, superlative, and comparative modifiers), which were considered
+pragmatic and semantic factors respectively. Methodologically, we conducted a
+truth-value judgment task in visually grounded settings using GPT-4o and Gemini
+1.5 Pro. The results indicate that while both models exhibited sensitivity to
+linguistic cues (modifier), they failed to process ignorance implicatures with
+visual cues (context) as humans do. Specifically, the influence of context was
+weaker and inconsistent across models, indicating challenges in pragmatic
+reasoning for VLMs. On the other hand, superlative modifiers were more strongly
+associated with ignorance implicatures as compared to comparative modifiers,
+supporting the semantic view. These findings highlight the need for further
+advancements in VLMs to process language-vision information in a
+context-dependent way to achieve human-like pragmatic inference.
 
-摘要：大型語言模型 (LLM) 在各種醫療應用中已展現出顯著的潛力，其中眼科是一個值得關注的重要領域。許多眼科任務已透過整合 LLM 而大幅進步。然而，在這些模型能廣泛應用於臨床實務之前，評估其能力並找出其限制至關重要。為了解決這個研究差距並支援 LLM 的實際應用，我們引入了 OphthBench，這是一個專門的基準測試，旨在評估 LLM 在中國眼科實務中的表現。此基準測試系統性地將典型眼科臨床工作流程劃分為五個關鍵情境：教育、分流、診斷、治療和預後。對於每個情境，我們開發了多項任務，包含多樣化的問題類型，最後組成一個包含 9 項任務和 591 個問題的綜合基準測試。此綜合架構可徹底評估 LLM 的能力，並提供其在中國眼科的實際應用見解。使用此基準測試，我們進行了廣泛的實驗，並分析了來自 39 個熱門 LLM 的結果。我們的評估強調了 LLM 開發與其在臨床環境中的實際效用之間的差距，為未來的進展提供了明確的方向。透過彌合此差距，我們旨在釋放 LLM 的潛力，並促進其在眼科的發展。
+摘要：本研究探討了視覺語言模型 (VLM) 如何處理視覺和語言線索中的無知含義。特別是，我們專注於語境（精確和近似語境）和修飾語類型（裸數字、最高級和比較級修飾語）的影響，這些分別被視為語用和語義因素。在方法論上，我們使用 GPT-4o 和 Gemini 1.5 Pro 在視覺基礎設置中進行了真值判斷任務。結果表明，儘管這兩個模型都對語言線索（修飾語）表現出敏感性，但它們未能像人類那樣處理帶有視覺線索（語境）的無知含義。具體來說，語境的影響在各個模型中較弱且不一致，表明 VLM 在語用推理方面存在挑戰。另一方面，與比較級修飾語相比，最高級修飾語與無知含義的關聯性更強，這支持了語義觀點。這些發現強調了 VLM 進一步發展的必要性，以以語境依賴的方式處理語言視覺信息，以實現類人語用推理。
+
+##### **One-shot Federated Learning Methods: A Practical Guide**
+2502.09104v1 by Xiang Liu, Zhenheng Tang, Xia Li, Yijun Song, Sijie Ji, Zemin Liu, Bo Han, Linshan Jiang, Jialin Li
+
+One-shot Federated Learning (OFL) is a distributed machine learning paradigm
+that constrains client-server communication to a single round, addressing
+privacy and communication overhead issues associated with multiple rounds of
+data exchange in traditional Federated Learning (FL). OFL demonstrates the
+practical potential for integration with future approaches that require
+collaborative training models, such as large language models (LLMs). However,
+current OFL methods face two major challenges: data heterogeneity and model
+heterogeneity, which result in subpar performance compared to conventional FL
+methods. Worse still, despite numerous studies addressing these limitations, a
+comprehensive summary is still lacking. To address these gaps, this paper
+presents a systematic analysis of the challenges faced by OFL and thoroughly
+reviews the current methods. We also offer an innovative categorization method
+and analyze the trade-offs of various techniques. Additionally, we discuss the
+most promising future directions and the technologies that should be integrated
+into the OFL field. This work aims to provide guidance and insights for future
+research.
 
-##### **MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**
-2502.01158v1 by Alejandro Guerra-Manzanares, Farah E. Shamout
+摘要：單次聯邦學習 (OFL) 是一種分散式機器學習範例，將客戶端與伺服器通訊限制在單一輪次中，解決傳統聯邦學習 (FL) 中多輪次資料交換相關的隱私和通訊負擔問題。OFL 展示了與需要協作訓練模型的未來方法整合的實際潛力，例如大型語言模型 (LLM)。然而，目前的 OFL 方法面臨兩大挑戰：資料異質性和模型異質性，這導致與傳統 FL 方法相比，效能較差。更糟的是，儘管有許多研究探討這些限制，但仍缺乏全面的摘要。為了解決這些差距，本文對 OFL 面臨的挑戰進行系統分析，並徹底檢視目前的方法。我們還提供創新的分類方法，並分析各種技術的權衡取捨。此外，我們討論最有希望的未來方向，以及應整合到 OFL 領域的技術。這項工作旨在為未來的研究提供指導和見解。
 
-Multimodal fusion leverages information across modalities to learn better
-feature representations with the goal of improving performance in fusion-based
-tasks. However, multimodal datasets, especially in medical settings, are
-typically smaller than their unimodal counterparts, which can impede the
-performance of multimodal models. Additionally, the increase in the number of
-modalities is often associated with an overall increase in the size of the
-multimodal network, which may be undesirable in medical use cases. Utilizing
-smaller unimodal encoders may lead to sub-optimal performance, particularly
-when dealing with high-dimensional clinical data. In this paper, we propose the
-Modality-INformed knowledge Distillation (MIND) framework, a multimodal model
-compression approach based on knowledge distillation that transfers knowledge
-from ensembles of pre-trained deep neural networks of varying sizes into a
-smaller multimodal student. The teacher models consist of unimodal networks,
-allowing the student to learn from diverse representations. MIND employs
-multi-head joint fusion models, as opposed to single-head models, enabling the
-use of unimodal encoders in the case of unimodal samples without requiring
-imputation or masking of absent modalities. As a result, MIND generates an
-optimized multimodal model, enhancing both multimodal and unimodal
-representations. It can also be leveraged to balance multimodal learning during
-training. We evaluate MIND on binary and multilabel clinical prediction tasks
-using time series data and chest X-ray images. Additionally, we assess the
-generalizability of the MIND framework on three non-medical multimodal
-multiclass datasets. Experimental results demonstrate that MIND enhances the
-performance of the smaller multimodal network across all five tasks, as well as
-various fusion methods and multimodal architectures, compared to
-state-of-the-art baselines.
+##### **Logical Reasoning in Large Language Models: A Survey**
+2502.09100v1 by Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang
 
-摘要：多模态融合利用跨模态的信息来学习更好的特征表示，目标是提升基于融合的任务的性能。然而，多模态数据集，尤其是在医疗环境中，通常比它们的单模态对应数据集小，这会阻碍多模态模型的性能。此外，模态数量的增加通常与多模态网络尺寸的整体增加相关，这在医疗用例中可能是不可取的。利用较小的单模态编码器可能会导致次优性能，尤其是在处理高维临床数据时。在本文中，我们提出了模态信息知识蒸馏 (MIND) 框架，这是一种基于知识蒸馏的多模态模型压缩方法，它将来自不同大小的预训练深度神经网络的集合中的知识转移到一个较小的多模态学生中。教师模型由单模态网络组成，允许学生从不同的表示中学习。MIND 采用多头联合融合模型，而不是单头模型，从而能够在单模态样本的情况下使用单模态编码器，而不需要缺失模态的插补或掩蔽。因此，MIND 生成了一个经过优化的多模态模型，增强了多模态和单模态表示。它还可以用来在训练期间平衡多模态学习。我们使用时间序列数据和胸部 X 射线图像对二元和多标签临床预测任务评估了 MIND。此外，我们评估了 MIND 框架在三个非医疗多模态多分类数据集上的泛化性。实验结果表明，与最先进的基线相比，MIND 增强了较小的多模态网络在所有五个任务以及各种融合方法和多模态架构中的性能。
+With the emergence of advanced reasoning models like OpenAI o3 and
+DeepSeek-R1, large language models (LLMs) have demonstrated remarkable
+reasoning capabilities. However, their ability to perform rigorous logical
+reasoning remains an open question. This survey synthesizes recent advancements
+in logical reasoning within LLMs, a critical area of AI research. It outlines
+the scope of logical reasoning in LLMs, its theoretical foundations, and the
+benchmarks used to evaluate reasoning proficiency. We analyze existing
+capabilities across different reasoning paradigms - deductive, inductive,
+abductive, and analogical - and assess strategies to enhance reasoning
+performance, including data-centric tuning, reinforcement learning, decoding
+strategies, and neuro-symbolic approaches. The review concludes with future
+directions, emphasizing the need for further exploration to strengthen logical
+reasoning in AI systems.
 
-##### **Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**
-2502.01141v1 by Qian Chen, Stefanie Rinderle-Ma, Lijie Wen
+摘要：隨著 OpenAI o3 和 DeepSeek-R1 等先進推理模型的出現，大型語言模型 (LLM) 已展現出非凡的推理能力。然而，它們執行嚴謹邏輯推理的能力仍是一個開放性的問題。此調查綜合了 LLM 中邏輯推理的最新進展，這是 AI 研究的一個關鍵領域。它概述了 LLM 中邏輯推理的範圍、其理論基礎，以及用於評估推理能力的基準。我們分析了不同推理範例（演繹、歸納、外推和類比）中的現有能力，並評估增強推理效能的策略，包括以數據為中心的調整、強化學習、解碼策略和神經符號方法。此評論以未來的方向作為結論，強調需要進一步探索以強化 AI 系統中的邏輯推理。
 
-Most existing process compliance monitoring approaches detect compliance
-violations in an ex post manner. Only predicate prediction focuses on
-predicting them. However, predicate prediction provides a binary yes/no notion
-of compliance, lacking the ability to measure to which extent an ongoing
-process instance deviates from the desired state as specified in constraints.
-Here, being able to quantify the magnitude of violation would provide
-organizations with deeper insights into their operational performance, enabling
-informed decision making to reduce or mitigate the risk of non-compliance.
-Thus, we propose two predictive compliance monitoring approaches to close this
-research gap. The first approach reformulates the binary classification problem
-as a hybrid task that considers both classification and regression, while the
-second employs a multi-task learning method to explicitly predict the
-compliance status and the magnitude of violation for deviant cases
-simultaneously. In this work, we focus on temporal constraints as they are
-significant in almost any application domain, e.g., health care. The evaluation
-on synthetic and real-world event logs demonstrates that our approaches are
-capable of quantifying the magnitude of violations while maintaining comparable
-performance for compliance predictions achieved by state-of-the-art approaches.
+##### **A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**
+2502.09097v1 by Tianyi Huang, Zeqiu Xu, Peiyang Yu, Jingyuan Yi, Xiaochuan Xu
 
-摘要：現有的流程合規監控方法大多會在事後偵測到合規違規。只有謂詞預測專注於預測這些違規。然而，謂詞預測提供的是合規與否的二元概念，無法衡量正在進行的流程實例偏離約束中所指定之理想狀態的程度。在此，能夠量化違規的嚴重程度，將能讓組織深入了解其營運績效，並能據此做出明智的決策，以降低或減輕不合規的風險。因此，我們提出兩種預測合規監控方法來填補此研究空白。第一種方法將二元分類問題重新表述為同時考量分類和回歸的混合任務，而第二種方法則採用多任務學習方法，同時明確預測合規狀態和偏差案例的違規嚴重程度。在這項工作中，我們專注於時間約束，因為它們幾乎在任何應用領域（例如醫療保健）中都很重要。在合成和真實世界事件記錄上的評估顯示，我們的做法能夠量化違規的嚴重程度，同時維持與現有方法所達成的合規預測相當的績效。
+In this paper, we propose an optimized Transformer model that integrates
+Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and
+apply it to fake news classification for the first time. First, we employ the
+TF-IDF method to extract features from news texts and transform them into
+numeric representations to facilitate subsequent machine learning tasks. Two
+sets of experiments are then conducted for fake news detection and
+classification: one using a Transformer model optimized only with BiGRU, and
+the other incorporating Bayesian algorithms into the BiGRU-based Transformer.
+Experimental results show that the BiGRU-optimized Transformer achieves 100%
+accuracy on the training set and 99.67% on the test set, while the addition of
+the Bayesian algorithm maintains 100% accuracy on the training set and slightly
+improves test-set accuracy to 99.73%. This indicates that the Bayesian
+algorithm boosts model accuracy by 0.06%, further enhancing the detection
+capability for fake news. Moreover, the proposed algorithm converges rapidly at
+around the 10th training epoch with accuracy nearing 100%, demonstrating both
+its effectiveness and its fast classification ability. Overall, the optimized
+Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits
+excellent continuous learning and detection performance, offering a robust
+technical means to combat the spread of fake news in the current era of
+information overload.
 
-##### **Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**
-2502.01108v1 by Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar
+摘要：<paragraph>在本文中，我們提出了一個最佳化的 Transformer 模型，它將貝氏演算法與雙向門控遞迴單元 (BiGRU) 整合在一起，並首次將其應用於假新聞分類。首先，我們採用 TF-IDF 方法從新聞文本中提取特徵，並將它們轉換為數值表示，以利於後續的機器學習任務。接著進行兩組實驗，分別針對假新聞偵測和分類：一組使用僅使用 BiGRU 最佳化的 Transformer 模型，另一組將貝氏演算法納入基於 BiGRU 的 Transformer 中。實驗結果顯示，BiGRU 最佳化的 Transformer 在訓練組上達到 100% 的準確度，在測試組上達到 99.67%，而加入貝氏演算法後，在訓練組上維持 100% 的準確度，並將測試組的準確度略微提升至 99.73%。這表示貝氏演算法將模型準確度提升了 0.06%，進一步增強了對假新聞的偵測能力。此外，所提出的演算法在約第 10 個訓練週期時快速收斂，準確度接近 100%，證明了它的有效性和快速的分類能力。總的來說，由貝氏演算法和 BiGRU 增強的最佳化 Transformer 模型展現出絕佳的持續學習和偵測效能，提供了一個強健的技術手段來對抗在當前資訊過載時代中假新聞的散布。</paragraph>
 
-Photoplethysmography (PPG)-based foundation models are gaining traction due
-to the widespread use of PPG in biosignal monitoring and their potential to
-generalize across diverse health applications. In this paper, we introduce
-Pulse-PPG, the first open-source PPG foundation model trained exclusively on
-raw PPG data collected over a 100-day field study with 120 participants.
-Existing PPG foundation models are either open-source but trained on clinical
-data or closed-source, limiting their applicability in real-world settings. We
-evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its
-performance against a state-of-the-art foundation model trained on clinical
-data. Our results demonstrate that Pulse-PPG, trained on uncurated field data,
-exhibits superior generalization across clinical and mobile health applications
-in both lab and field settings. This suggests that exposure to real-world
-variability enables the model to learn fine-grained representations, making it
-more adaptable across tasks. Furthermore, pre-training on field data
-surprisingly outperforms its pre-training on clinical data in many tasks,
-reinforcing the importance of training on real-world, diverse datasets. To
-encourage further advancements in robust foundation models leveraging field
-data, we plan to release Pulse-PPG, providing researchers with a powerful
-resource for developing more generalizable PPG-based models.
+##### **A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**
+2502.09086v1 by Jia Gao, Shuangquan Lyu, Guiran Liu, Binrong Zhu, Hongye Zheng, Xiaoxuan Liao
 
-摘要：基於光電容積描記術 (PPG) 的基礎模型由於 PPG 在生物訊號監控中的廣泛使用及其在各種健康應用中推廣的潛力而備受關注。在本文中，我們介紹 Pulse-PPG，這是第一個開放原始碼 PPG 基礎模型，專門針對在為期 100 天的現場研究中收集的 120 位參與者的原始 PPG 資料進行訓練。現有的 PPG 基礎模型要不是開放原始碼，但訓練於臨床資料，不然就是閉源，這限制了它們在真實世界中的應用性。我們評估了 Pulse-PPG 在多個資料集和下游任務中的表現，並將其效能與訓練於臨床資料的最新基礎模型進行比較。我們的結果表明，訓練於未整理現場資料的 Pulse-PPG 在實驗室和現場環境中，在臨床和行動健康應用中展現出優異的泛化能力。這表明接觸真實世界的變異性使模型能夠學習細粒度的表示，使其更能適應各種任務。此外，令人驚訝的是，現場資料的預訓練在許多任務中優於臨床資料的預訓練，這強化了在真實世界、多樣化的資料集上訓練的重要性。為了鼓勵在利用現場資料的強健基礎模型方面進一步發展，我們計畫發布 Pulse-PPG，為研究人員提供一個強大的資源，用於開發更具泛化性的基於 PPG 的模型。
+With the continuous development of natural language processing (NLP)
+technology, text classification tasks have been widely used in multiple
+application fields. However, obtaining labeled data is often expensive and
+difficult, especially in few-shot learning scenarios. To solve this problem,
+this paper proposes a few-shot text classification model based on transfer
+learning and meta-learning. The model uses the knowledge of the pre-trained
+model for transfer and optimizes the model's rapid adaptability in few-sample
+tasks through a meta-learning mechanism. Through a series of comparative
+experiments and ablation experiments, we verified the effectiveness of the
+proposed method. The experimental results show that under the conditions of few
+samples and medium samples, the model based on transfer learning and
+meta-learning significantly outperforms traditional machine learning and deep
+learning methods. In addition, ablation experiments further analyzed the
+contribution of each component to the model performance and confirmed the key
+role of transfer learning and meta-learning in improving model accuracy.
+Finally, this paper discusses future research directions and looks forward to
+the potential of this method in practical applications.
 
-##### **Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**
-2502.04342v1 by Yeyubei Zhang, Zhongyan Wang, Zhanyi Ding, Yexin Tian, Jianglai Dai, Xiaorui Shen, Yunchong Liu, Yuchen Cao
+摘要：隨著自然語言處理 (NLP) 技術的持續發展，文本分類任務已廣泛應用於多個應用領域。然而，獲取標記資料通常既昂貴又困難，特別是在小樣本學習場景中。為了解決這個問題，本文提出了一個基於遷移學習和元學習的少樣本文本分類模型。該模型利用預訓練模型的知識進行遷移，並透過元學習機制最佳化模型在少樣本任務中的快速適應性。透過一系列的比較實驗和消融實驗，我們驗證了所提出方法的有效性。實驗結果表明，在少樣本和中等樣本的條件下，基於遷移學習和元學習的模型明顯優於傳統機器學習和深度學習方法。此外，消融實驗進一步分析了各個組成部分對模型效能的貢獻，並確認了遷移學習和元學習在提升模型準確度中的關鍵作用。最後，本文探討了未來的研究方向，並期待此方法在實際應用中的潛力。
 
-Social media has become an important source for understanding mental health,
-providing researchers with a way to detect conditions like depression from
-user-generated posts. This tutorial provides practical guidance to address
-common challenges in applying machine learning and deep learning methods for
-mental health detection on these platforms. It focuses on strategies for
-working with diverse datasets, improving text preprocessing, and addressing
-issues such as imbalanced data and model evaluation. Real-world examples and
-step-by-step instructions demonstrate how to apply these techniques
-effectively, with an emphasis on transparency, reproducibility, and ethical
-considerations. By sharing these approaches, this tutorial aims to help
-researchers build more reliable and widely applicable models for mental health
-research, contributing to better tools for early detection and intervention.
+##### **Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**
+2502.09083v1 by Greta Warren, Irina Shklovski, Isabelle Augenstein
 
-摘要：社群媒體已成為了解心理健康的重要來源，
-為研究人員提供一種方式，從使用者發布的貼文中偵測憂鬱症等狀況。
-本教學提供實務指南，說明如何處理在這些平台上使用機器學習和深度學習方法進行心理健康偵測時常見的挑戰。
-它專注於處理不同資料集、改善文字前處理，以及處理不平衡資料和模型評估等問題的策略。
-實際範例和逐步說明示範如何有效應用這些技術，並強調透明度、可複製性，以及倫理考量。
-透過分享這些方法，本教學指南旨在協助研究人員建構更可靠且廣泛適用的心理健康研究模型，
-進而有助於早期偵測和介入的工具。
+The pervasiveness of large language models and generative AI in online media
+has amplified the need for effective automated fact-checking to assist
+fact-checkers in tackling the increasing volume and sophistication of
+misinformation. The complex nature of fact-checking demands that automated
+fact-checking systems provide explanations that enable fact-checkers to
+scrutinise their outputs. However, it is unclear how these explanations should
+align with the decision-making and reasoning processes of fact-checkers to be
+effectively integrated into their workflows. Through semi-structured interviews
+with fact-checking professionals, we bridge this gap by: (i) providing an
+account of how fact-checkers assess evidence, make decisions, and explain their
+processes; (ii) examining how fact-checkers use automated tools in practice;
+and (iii) identifying fact-checker explanation requirements for automated
+fact-checking tools. The findings show unmet explanation needs and identify
+important criteria for replicable fact-checking explanations that trace the
+model's reasoning path, reference specific evidence, and highlight uncertainty
+and information gaps.
 
-##### **Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**
-2502.01691v1 by Hadas Ben-Atya, Naama Gavrielov, Zvi Badash, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
+摘要：大型語言模型和生成式 AI 在線上媒體的普及
+放大了對有效自動查核事實的需求，以協助查核員應對日益增加的錯誤資訊量和複雜性。查核事實的複雜性質要求自動查核事實系統提供說明，讓查核員能夠仔細審查他們的輸出。然而，目前尚不清楚這些說明應如何與查核員的決策制定和推理過程保持一致，才能有效整合到他們的流程中。透過與查核事實專業人士進行半結構式訪談，我們透過以下方式彌補這個差距：(i) 提供查核員如何評估證據、做出決策和解釋其流程的說明；(ii) 檢視查核員如何實際使用自動化工具；以及 (iii) 找出查核員對自動查核事實工具的說明需求。研究結果顯示未滿足的說明需求，並找出可複製查核事實說明的重要準則，這些準則追蹤模型的推理路徑、參考具體證據，並強調不確定性和資訊差距。
 
-Reliable extraction of structured data from radiology reports using Large
-Language Models (LLMs) remains challenging, especially for complex, non-English
-texts like Hebrew. This study introduces an agent-based uncertainty-aware
-approach to improve the trustworthiness of LLM predictions in medical
-applications. We analyzed 9,683 Hebrew radiology reports from Crohn's disease
-patients (from 2010 to 2023) across three medical centers. A subset of 512
-reports was manually annotated for six gastrointestinal organs and 15
-pathological findings, while the remaining reports were automatically annotated
-using HSMP-BERT. Structured data extraction was performed using Llama 3.1
-(Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed
-six semantically equivalent prompts to estimate uncertainty. An Agent-Based
-Decision Model integrated multiple prompt outputs into five confidence levels
-for calibrated uncertainty and was compared against three entropy-based models.
-Performance was evaluated using accuracy, F1 score, precision, recall, and
-Cohen's Kappa before and after filtering high-uncertainty cases. The
-agent-based model outperformed the baseline across all metrics, achieving an F1
-score of 0.3967, recall of 0.6437, and Cohen's Kappa of 0.3006. After filtering
-high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to
-0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated
-clear separation between correct and incorrect predictions, with the
-agent-based model providing the most well-calibrated uncertainty estimates. By
-incorporating uncertainty-aware prompt ensembles and an agent-based decision
-model, this approach enhances the performance and reliability of LLMs in
-structured data extraction from radiology reports, offering a more
-interpretable and trustworthy solution for high-stakes medical applications.
+##### **CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**
+2502.09082v1 by Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
 
-摘要：<paragraph>使用大型語言模型 (LLM) 從放射科報告中可靠地提取結構化數據仍然具有挑戰性，尤其是對於希伯來語等複雜的非英語文本。本研究引入了一種基於代理的不確定性感知方法，以提高 LLM 預測在醫療應用中的可信度。我們分析了來自三個醫療中心的 9,683 份克隆氏症患者的希伯來語放射科報告（從 2010 年到 2023 年）。其中 512 份報告的手動註釋包括六個胃腸器官和 15 個病理發現，而其餘報告則使用 HSMP-BERT 自動註釋。結構化數據提取使用 Llama 3.1（Llama 3-8b-instruct）與貝葉斯提示集合（BayesPE）進行，它採用六個語義等效提示來估計不確定性。基於代理的決策模型將多個提示輸出整合到五個置信度級別中以校準不確定性，並與三個基於熵的模型進行比較。在過濾掉高度不確定性的情況之前和之後，使用準確度、F1 分數、精確度、召回率和 Cohen's Kappa 評估性能。基於代理的模型在所有指標上都優於基線，F1 分數達到 0.3967，召回率達到 0.6437，Cohen's Kappa 達到 0.3006。在過濾掉高度不確定性的情況（大於或等於 0.5）後，F1 分數提高到 0.4787，Kappa 提高到 0.4258。不確定性直方圖顯示了正確預測和不正確預測之間的明顯區別，基於代理的模型提供了校準最好的不確定性估計。通過結合不確定性感知提示集合和基於代理的決策模型，這種方法增強了 LLM 在放射科報告中結構化數據提取中的性能和可靠性，為高風險醫療應用提供了更具可解釋性和可信度的解決方案。</paragraph>
+Role-playing language agents (RPLAs) have emerged as promising applications
+of large language models (LLMs). However, simulating established characters
+presents a challenging task for RPLAs, due to the lack of authentic character
+datasets and nuanced evaluation methods using such data. In this paper, we
+present CoSER, a collection of a high-quality dataset, open models, and an
+evaluation protocol towards effective RPLAs of established characters. The
+CoSER dataset covers 17,966 characters from 771 renowned books. It provides
+authentic dialogues with real-world intricacies, as well as diverse data types
+such as conversation setups, character experiences and internal thoughts.
+Drawing from acting methodology, we introduce given-circumstance acting for
+training and evaluating role-playing LLMs, where LLMs sequentially portray
+multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
+CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
+Extensive experiments demonstrate the value of the CoSER dataset for RPLA
+training, evaluation and retrieval. Moreover, CoSER 70B exhibits
+state-of-the-art performance surpassing or matching GPT-4o on our evaluation
+and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
+the InCharacter and LifeChoice benchmarks respectively.
 
-##### **Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**
-2502.01685v1 by Si-Ioi Ng, Pranav S. Ambadi, Kimberly D. Mueller, Julie Liss, Visar Berisha
+摘要：角色扮演語言代理（RPLA）已成為大型語言模型（LLM）的有前途的應用。然而，由於缺乏真實角色資料集和使用此類資料的細緻評估方法，模擬既有角色對 RPLA 來說是一項具有挑戰性的任務。在本文中，我們提出了 CoSER，這是一個高品質資料集、開放模型和評估協議的集合，用於有效地扮演既有角色的 RPLA。CoSER 資料集涵蓋了來自 771 本著名書籍的 17,966 個角色。它提供了具有真實世界複雜性的真實對話，以及對話設定、角色體驗和內心想法等多種資料類型。借鑑表演方法，我們引入了既定情境表演，用於訓練和評估角色扮演 LLM，其中 LLM 在書籍場景中依次扮演多個角色。使用我們的資料集，我們開發了 CoSER 8B 和 CoSER 70B，即建立在 LLaMA-3.1 模型上的先進開放角色扮演 LLM。大量的實驗證明了 CoSER 資料集對於 RPLA 訓練、評估和檢索的價值。此外，CoSER 70B 在我們的評估和三個現有基準上展現了超越或匹配 GPT-4o 的最先進效能，即分別在 InCharacter 和 LifeChoice 基準上達到了 75.80% 和 93.47% 的準確率。
 
-Existing methods for analyzing linguistic content from picture descriptions
-for assessment of cognitive-linguistic impairment often overlook the
-participant's visual narrative path, which typically requires eye tracking to
-assess. Spatio-semantic graphs are a useful tool for analyzing this narrative
-path from transcripts alone, however they are limited by the need for manual
-tagging of content information units (CIUs). In this paper, we propose an
-automated approach for estimation of spatio-semantic graphs (via automated
-extraction of CIUs) from the Cookie Theft picture commonly used in
-cognitive-linguistic analyses. The method enables the automatic
-characterization of the visual semantic path during picture description.
-Experiments demonstrate that the automatic spatio-semantic graphs effectively
-differentiate between cognitively impaired and unimpaired speakers. Statistical
-analyses reveal that the features derived by the automated method produce
-comparable results to the manual method, with even greater group differences
-between clinical groups of interest. These results highlight the potential of
-the automated approach for extracting spatio-semantic features in developing
-clinical speech models for cognitive impairment assessment.
+##### **Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**
+2502.09073v1 by Xuzhao Geng, Haozhao Wang, Jun Wang, Wei Liu, Ruixuan Li
 
-摘要：現有的用於分析圖像描述中的語言內容的方法，用於評估認知語言障礙，通常會忽略參與者的視覺敘事路徑，這通常需要眼球追蹤來評估。時空語義圖是一種有用的工具，可以僅從轉錄本中分析此敘事路徑，但是它們受到手動標記內容資訊單元 (CIU) 的需求所限制。在本文中，我們提出了一種自動化方法，用於從認知語言分析中常用的 Cookie Theft 圖像估計時空語義圖（通過自動提取 CIU）。該方法能夠自動表徵圖片描述期間的視覺語義路徑。實驗表明，自動時空語義圖有效地區分了認知受損和未受損的說話者。統計分析表明，自動化方法衍生的特徵產生了與手動方法相當的結果，甚至在感興趣的臨床組之間產生了更大的組差異。這些結果突出了自動化方法在提取時空語義特徵以開發用於認知障礙評估的臨床語音模型方面的潛力。
+Retrieval-augmented generation (RAG) is a key technique for leveraging
+external knowledge and reducing hallucinations in large language models (LLMs).
+However, RAG still struggles to fully prevent hallucinated responses. To
+address this, it is essential to identify samples prone to hallucination or
+guide LLMs toward correct responses, which experts then annotate to develop
+high-quality datasets for refining LLMs. However, the growing scarcity of such
+datasets makes their creation challenging. This paper proposes using the vast
+amount of conversations from widespread LLM usage to build these datasets,
+training LLMs to avoid hallucination-prone questions while accurately
+responding to manageable ones. Given the impracticality of expert-annotating
+all conversation records, the paper introduces AL4RAG, which uses active
+learning to select the most suitable conversation samples for annotation,
+optimizing performance within an annotation budget. Additionally, recognizing
+that traditional active learning methods are not fully compatible with RAG due
+to unsuitable distance metrics, we develop a novel sample distance measurement
+for RAG active learning. Extensive experiments show that our method
+consistently outperforms baselines across multiple metrics.
 
-##### **Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**
-2502.00712v1 by Shengtian Sang, Hassan Jahanandish, Cynthia Xinran Li, Indrani Bhattachary, Jeong Hoon Lee, Lichun Zhang, Sulaiman Vesal, Pejman Ghanouni, Richard Fan, Geoffrey A. Sonn, Mirabela Rusu
+摘要：檢索增強生成 (RAG) 是一種關鍵技術，用於利用外部知識並減少大型語言模型 (LLM) 中的幻覺。然而，RAG 仍難以完全防止幻覺反應。為了解決這個問題，必須找出容易產生幻覺的範例，或引導 LLM 朝向正確的反應，然後由專家註解以開發用於精煉 LLM 的高品質資料集。然而，此類資料集日益稀少，使得其建立極具挑戰性。本文提出使用來自廣泛 LLM 使用的大量對話來建立這些資料集，訓練 LLM 以避免容易產生幻覺的問題，同時準確回應可管理的問題。鑑於由專家為所有對話記錄加上註解並不切實際，本文引入了 AL4RAG，它使用主動學習來選擇最適合註解的對話範例，在註解預算內最佳化效能。此外，認識到傳統主動學習方法由於不適當的距離度量而無法與 RAG 完全相容，我們為 RAG 主動學習開發了一種新穎的範例距離度量。廣泛的實驗表明，我們的模型在多種度量標準上始終優於基準。
 
-Prostate cancer is a major cause of cancer-related deaths in men, where early
-detection greatly improves survival rates. Although MRI-TRUS fusion biopsy
-offers superior accuracy by combining MRI's detailed visualization with TRUS's
-real-time guidance, it is a complex and time-intensive procedure that relies
-heavily on manual annotations, leading to potential errors. To address these
-challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation
-method that identifies prostate tumors directly in TRUS images without
-requiring manual annotations. Unlike traditional multimodal fusion approaches
-that rely on naive data concatenation, our method integrates a
-registration-segmentation framework to align and leverage spatial information
-between MRI and TRUS modalities. This alignment enhances segmentation accuracy
-and reduces reliance on manual effort. Our approach was validated on a dataset
-of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient
-of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132)
-methods, with significant improvements (p $<$ 0.01). This framework
-demonstrates the potential for reducing the complexity of prostate cancer
-diagnosis and provides a flexible architecture applicable to other multimodal
-medical imaging tasks.
+##### **An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**
+2502.09056v1 by Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai
 
-摘要：前列腺癌是男性癌症相關死亡的主要原因，早期發現可大幅提升存活率。儘管 MRI-TRUS 融合切片檢查結合了 MRI 的詳細視覺化與 TRUS 的即時導引，可提供更高的準確度，但它是一種仰賴大量手動註解的複雜且耗時的程序，容易導致錯誤。為了解決這些挑戰，我們提出了一種全自動的 MRI-TRUS 融合式分割方法，它可以在 TRUS 影像中直接辨識出前列腺腫瘤，而不需要手動註解。與依賴於天真資料串接的傳統多模態融合方法不同，我們的方法整合了一個配準分割架構，以對齊並利用 MRI 與 TRUS 模態之間的空間資訊。這種對齊提升了分割準確度，並減少了對手動作業的依賴。我們的方法已通過來自 Stanford 醫院的 1,747 位患者的資料集進行驗證，達到了 0.212 的平均 Dice 係數，優於僅使用 TRUS (0.117) 和天真的 MRI-TRUS 融合 (0.132) 方法，並有顯著的改善（p < 0.01）。這個架構證明了降低前列腺癌診斷複雜性的潛力，並提供了一個適用於其他多模態醫學影像任務的彈性架構。
+This paper investigates data selection and model merging methodologies aimed
+at incorporating advanced reasoning capabilities such as those of DeepSeek R1
+into language-specific large language models (LLMs), with a particular focus on
+the Thai LLM. Our goal is to enhance the reasoning capabilities of
+language-specific LLMs while maintaining their target language abilities.
+DeepSeek R1 excels in reasoning but primarily benefits high-resource languages
+such as English and Chinese. However, low-resource languages remain underserved
+due to the dominance of English-centric training data and model optimizations,
+which limit performance in these languages. This limitation results in
+unreliable code-switching and diminished effectiveness on tasks in low-resource
+languages. Meanwhile, local and regional LLM initiatives have attempted to
+bridge this gap by developing language-specific LLMs that focus on improving
+local linguistic fidelity. We demonstrate that, with only publicly available
+datasets and a computational budget of $120, it is possible to enhance the
+reasoning capabilities of language-specific LLMs to match the level of DeepSeek
+R1, without compromising their performance on target language tasks.
 
-##### **TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**
-2502.00695v1 by Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
+摘要：本文探討資料選取與模型合併方法，旨在將深度搜尋 R1 等先進推理能力整合至特定語言的大型語言模型 (LLM)，特別著重於泰語 LLM。我們的目標是提升特定語言 LLM 的推理能力，同時維持其目標語言能力。深度搜尋 R1 在推理方面表現出色，但主要受益於英語和中文等資源豐富的語言。然而，由於以英語為中心的訓練資料和模型最佳化佔據主導地位，資源貧乏的語言仍未獲得充分服務，這限制了這些語言的效能。此限制導致不可靠的代碼切換，並降低了資源貧乏語言任務的效能。與此同時，在地區 LLM 計畫已嘗試透過開發專注於改善在地語言忠實度的特定語言 LLM 來彌合此差距。我們證明，僅使用公開可用的資料集和 120 美元的運算預算，即可提升特定語言 LLM 的推理能力，使其達到深度搜尋 R1 的水準，同時不損及它們在目標語言任務上的效能。
 
-Chronic liver disease represents a significant health challenge worldwide and
-accurate prognostic evaluations are essential for personalized treatment plans.
-Recent evidence suggests that integrating multimodal data, such as computed
-tomography imaging, radiomic features, and clinical information, can provide
-more comprehensive prognostic information. However, modalities have an inherent
-heterogeneity, and incorporating additional modalities may exacerbate the
-challenges of heterogeneous data fusion. Moreover, existing multimodal fusion
-methods often struggle to adapt to richer medical modalities, making it
-difficult to capture inter-modal relationships. To overcome these limitations,
-We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet).
-Specifically, we develop an Intra-Modality Aggregation module and a
-Triple-Modal Cross-Attention Fusion module, which are designed to eliminate
-intra-modality redundancy and extract cross-modal information, respectively.
-Furthermore, we design a Triple-Modal Feature Fusion loss function to align
-feature representations across modalities. Extensive experiments on the liver
-prognosis dataset demonstrate that our approach significantly outperforms
-existing state-of-the-art unimodal models and other multi-modal techniques. Our
-code is available at https://github.com/Mysterwll/liver.git.
+##### **Cost-Saving LLM Cascades with Early Abstention**
+2502.09054v1 by Michael J. Zellinger, Rex Liu, Matt Thomson
 
-摘要：慢性肝病在全球范围内代表著重大的健康挑戰，而準確的預後評估對於個人化治療計畫至關重要。最近的證據表明，整合多模態資料（例如電腦斷層影像、放射特徵和臨床資訊）可以提供更全面的預後資訊。然而，模態具有內在異質性，而納入額外的模態可能會加劇異質化資料融合的挑戰。此外，現有的多模態融合方法通常難以適應更豐富的醫療模態，這使得難以捕捉模態間的關係。為了克服這些限制，我們提出了三模態交互慢性肝臟網路 (TMI-CLNet)。具體來說，我們開發了一個模態內聚合模組和一個三模態交叉注意力融合模組，它們分別旨在消除模態內冗餘和提取跨模態資訊。此外，我們設計了一個三模態特徵融合損失函數，以對齊跨模態的特徵表示。在肝臟預後資料集上的廣泛實驗表明，我們的做法顯著優於現有的最先進單模態模型和其他多模態技術。我們的程式碼可以在 https://github.com/Mysterwll/liver.git 上取得。
+LLM cascades are based on the idea that processing all queries with the
+largest and most expensive LLMs is inefficient. Instead, cascades deploy small
+LLMs to answer the majority of queries, limiting the use of large and expensive
+LLMs to only the most difficult queries. This approach can significantly reduce
+costs without impacting performance. However, risk-sensitive domains such as
+finance or medicine place an additional premium on avoiding model errors.
+Recognizing that even the most expensive models may make mistakes, applications
+in these domains benefit from allowing LLM systems to completely abstain from
+answering a query when the chance of making a mistake is significant. However,
+giving a cascade the ability to abstain poses an immediate design question for
+LLM cascades: should abstention only be allowed at the final model or also at
+earlier models? Since the error patterns of small and large models are
+correlated, the latter strategy may further reduce inference costs by letting
+inexpensive models anticipate abstention decisions by expensive models, thereby
+obviating the need to run the expensive models. We investigate the benefits of
+"early abstention" in LLM cascades and find that it reduces the overall test
+loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA,
+TruthfulQA, and XSum). These gains result from a more effective use of
+abstention, which trades a 4.1% average increase in the overall abstention rate
+for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings
+demonstrate that it is possible to leverage correlations between the error
+patterns of different language models to drive performance improvements for LLM
+systems with abstention.
 
-##### **Safety at Scale: A Comprehensive Survey of Large Model Safety**
-2502.05206v2 by Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
+摘要：<paragraph>LLM 級聯基於以下概念：使用最大且最昂貴的 LLM 處理所有查詢效率低下。相反，級聯會部署小型 LLM 來回答大部分查詢，將大型且昂貴的 LLM 的使用限制在最困難的查詢上。這種方法可以大幅降低成本，而不會影響效能。然而，像金融或醫學等對風險敏感的領域會額外重視避免模型錯誤。認識到即使是最昂貴的模型也可能會出錯，在這些領域中的應用程式可受益於允許 LLM 系統在出錯機率很大的情況下完全不回答查詢。然而，賦予級聯不回答的能力會對 LLM 級聯提出立即的設計問題：是否只允許在最終模型中不回答，還是也在較早的模型中不回答？由於小型和大型模型的錯誤模式相關，後一種策略可以讓便宜的模型預測昂貴模型的不回答決策，進而降低推論成本，從而避免執行昂貴的模型。我們調查了 LLM 級聯中「早期不回答」的好處，並發現它平均降低了六個基準測試（GSM8K、MedMCQA、MMLU、TriviaQA、TruthfulQA 和 XSum）的整體測試損失 2.2%。這些收益來自於更有效地使用不回答，以整體不回答率平均增加 4.1% 的代價換取成本降低 13.0% 和錯誤率降低 5.0%。我們的研究結果證明，可以利用不同語言模型的錯誤模式之間的關聯性，來推動具有不回答功能的 LLM 系統的效能改進。</paragraph>
 
-The rapid advancement of large models, driven by their exceptional abilities
-in learning and generalization through large-scale pre-training, has reshaped
-the landscape of Artificial Intelligence (AI). These models are now
-foundational to a wide range of applications, including conversational AI,
-recommendation systems, autonomous driving, content generation, medical
-diagnostics, and scientific discovery. However, their widespread deployment
-also exposes them to significant safety risks, raising concerns about
-robustness, reliability, and ethical implications. This survey provides a
-systematic review of current safety research on large models, covering Vision
-Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language
-Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models
-(DMs), and large-model-based Agents. Our contributions are summarized as
-follows: (1) We present a comprehensive taxonomy of safety threats to these
-models, including adversarial attacks, data poisoning, backdoor attacks,
-jailbreak and prompt injection attacks, energy-latency attacks, data and model
-extraction attacks, and emerging agent-specific threats. (2) We review defense
-strategies proposed for each type of attacks if available and summarize the
-commonly used datasets and benchmarks for safety research. (3) Building on
-this, we identify and discuss the open challenges in large model safety,
-emphasizing the need for comprehensive safety evaluations, scalable and
-effective defense mechanisms, and sustainable data practices. More importantly,
-we highlight the necessity of collective efforts from the research community
-and international collaboration. Our work can serve as a useful reference for
-researchers and practitioners, fostering the ongoing development of
-comprehensive defense systems and platforms to safeguard AI models.
+##### **Game Theory Meets Large Language Models: A Systematic Survey**
+2502.09053v1 by Haoran Sun, Yusen Wu, Yukun Cheng, Xu Chu
 
-摘要：<paragraph>大型模型的快速進展，得益於它們在通過大規模預訓練進行學習和概括方面的卓越能力，已經重塑了人工智能 (AI) 的格局。這些模型現在是廣泛應用程式（包括對話式 AI、推薦系統、自動駕駛、內容生成、醫療診斷和科學發現）的基礎。然而，它們的廣泛部署也使它們面臨重大的安全風險，引發了對穩健性、可靠性和倫理影響的擔憂。本調查提供了對大型模型當前安全研究的系統性回顧，涵蓋視覺基礎模型 (VFM)、大型語言模型 (LLM)、視覺語言預訓練 (VLP) 模型、視覺語言模型 (VLM)、擴散模型 (DM) 和基於大型模型的代理。我們的貢獻總結如下：(1) 我們提出了一個針對這些模型的安全威脅的全面分類，包括對抗性攻擊、資料中毒、後門攻擊、越獄和提示注入攻擊、能量延遲攻擊、資料和模型提取攻擊以及新興的特定代理威脅。(2) 我們檢視了針對每種類型攻擊提出的防禦策略（如果有的話），並總結了安全研究中常用的資料集和基準。(3) 基於此，我們找出並討論了大型模型安全中的開放性挑戰，強調了對全面安全評估、可擴充且有效的防禦機制以及永續資料實務的需求。更重要的是，我們強調了研究社群和國際合作共同努力的必要性。我們的研究可作為研究人員和從業人員的有用參考，促進全面防禦系統和平台的持續發展，以保護 AI 模型。</paragraph>
+Game theory establishes a fundamental framework for analyzing strategic
+interactions among rational decision-makers. The rapid advancement of large
+language models (LLMs) has sparked extensive research exploring the
+intersection of these two fields. Specifically, game-theoretic methods are
+being applied to evaluate and enhance LLM capabilities, while LLMs themselves
+are reshaping classic game models. This paper presents a comprehensive survey
+of the intersection of these fields, exploring a bidirectional relationship
+from three perspectives: (1) Establishing standardized game-based benchmarks
+for evaluating LLM behavior; (2) Leveraging game-theoretic methods to improve
+LLM performance through algorithmic innovations; (3) Characterizing the
+societal impacts of LLMs through game modeling. Among these three aspects, we
+also highlight how the equilibrium analysis for traditional game models is
+impacted by LLMs' advanced language understanding, which in turn extends the
+study of game theory. Finally, we identify key challenges and future research
+directions, assessing their feasibility based on the current state of the
+field. By bridging theoretical rigor with emerging AI capabilities, this survey
+aims to foster interdisciplinary collaboration and drive progress in this
+evolving research area.
 
-##### **Enhanced Convolutional Neural Networks for Improved Image Classification**
-2502.00663v1 by Xiaoran Yang, Shuhan Yu, Wenxi Xu
+摘要：博弈論建立一個基本架構，用來分析理性決策者之間的策略互動。大型語言模型 (LLM) 的快速進展，激發了廣泛的研究，探討這兩個領域的交集。具體來說，博弈論方法被應用於評估和增強 LLM 能力，而 LLM 本身正在重塑經典博弈模型。本文對這些領域的交集進行了全面的調查，從三個角度探討了雙向關係：(1) 建立標準化的基於博弈的基準，用於評估 LLM 行為；(2) 利用博弈論方法，通過演算法創新來改善 LLM 效能；(3) 透過博弈模型，描述 LLM 對社會的影響。在這三個方面中，我們還強調了 LLM 的先進語言理解如何影響傳統博弈模型的均衡分析，這反過來又擴展了博弈論的研究。最後，我們找出關鍵挑戰和未來的研究方向，根據該領域的現狀評估其可行性。透過將理論嚴謹性與新興的 AI 能力相結合，這項調查旨在促進跨學科合作，並推動這個不斷演變的研究領域的進展。
 
-Image classification is a fundamental task in computer vision with diverse
-applications, ranging from autonomous systems to medical imaging. The CIFAR-10
-dataset is a widely used benchmark to evaluate the performance of
-classification models on small-scale, multi-class datasets. Convolutional
-Neural Networks (CNNs) have demonstrated state-of-the-art results; however,
-they often suffer from overfitting and suboptimal feature representation when
-applied to challenging datasets like CIFAR-10. In this paper, we propose an
-enhanced CNN architecture that integrates deeper convolutional blocks, batch
-normalization, and dropout regularization to achieve superior performance. The
-proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN
-architectures. Through detailed ablation studies, we demonstrate the
-effectiveness of the enhancements and analyze the hierarchical feature
-representations. This work highlights the potential of refined CNN
-architectures for tackling small-scale image classification problems
-effectively.
+##### **AIDE: Agentically Improve Visual Language Model with Domain Experts**
+2502.09051v1 by Ming-Chang Chiu, Fuxiao Liu, Karan Sapra, Andrew Tao, Yaser Jacoob, Xuezhe Ma, Zhiding Yu, Guilin Liu
 
-摘要：影像分類是電腦視覺中的一項基本任務，應用範圍廣泛，從自動系統到醫學影像皆有。CIFAR-10 資料集是一個廣泛使用的基準，用於評估分類模型在小規模、多類別資料集上的效能。卷積神經網路 (CNN) 已展現出最先進的成果；然而，當應用於 CIFAR-10 等具挑戰性的資料集時，它們常常會發生過度擬合和次佳特徵表示的問題。在本文中，我們提出一個增強的 CNN 架構，它整合了更深的卷積區塊、批次正規化和中斷正規化，以達成卓越的效能。所提出的模型達到了 84.95% 的測試準確度，優於基準 CNN 架構。透過詳細的消融研究，我們證明了這些增強功能的有效性，並分析了階層式特徵表示。這項工作突顯了精進的 CNN 架構在有效解決小規模影像分類問題上的潛力。
+The enhancement of Visual Language Models (VLMs) has traditionally relied on
+knowledge distillation from larger, more capable models. This dependence
+creates a fundamental bottleneck for improving state-of-the-art systems,
+particularly when no superior models exist. We introduce AIDE (Agentic
+Improvement through Domain Experts), a novel framework that enables VLMs to
+autonomously enhance their capabilities by leveraging specialized domain expert
+models. AIDE operates through a four-stage process: (1) identifying instances
+for refinement, (2) engaging domain experts for targeted analysis, (3)
+synthesizing expert outputs with existing data, and (4) integrating enhanced
+instances into the training pipeline. Experiments on multiple benchmarks,
+including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve
+notable performance gains without relying on larger VLMs nor human supervision.
+Our framework provides a scalable, resource-efficient approach to continuous
+VLM improvement, addressing critical limitations in current methodologies,
+particularly valuable when larger models are unavailable to access.
 
-##### **Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**
-2502.00619v1 by Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
+摘要：視覺語言模型 (VLM) 的增強傳統上依賴於從更大、功能更強大的模型中進行知識萃取。這種依賴性會造成改善最先進系統的基本瓶頸，尤其在沒有更優越的模型時。我們引進 AIDE（透過領域專家進行代理式改善），一個創新的架構，讓 VLM 能夠透過利用專業的領域專家模型，自主增強其功能。AIDE 透過四階段流程運作：(1) 識別需要改善的實例，(2) 聘請領域專家進行有針對性的分析，(3) 將專家輸出與現有資料綜合，以及 (4) 將增強的實例整合到訓練流程中。在多個基準測試上的實驗，包括 MMMU、MME、MMBench 等，證明了 AIDE 能夠在不依賴更大型的 VLM 或人工監督的情況下，實現顯著的效能提升。我們的架構提供了一個可擴充、資源效率高的持續 VLM 改進方法，解決了當前方法中的關鍵限制，特別是在無法取得大型模型時，這一點特別有價值。
 
-Ensuring fairness in medical image segmentation is critical due to biases in
-imbalanced clinical data acquisition caused by demographic attributes (e.g.,
-age, sex, race) and clinical factors (e.g., disease severity). To address these
-challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired
-by optimal control theory. We provide a comprehensive analysis of its
-underlying mechanisms and clarify dMoE's role in adapting to heterogeneous
-distributions in medical image segmentation. Furthermore, we integrate dMoE
-into multiple network architectures, demonstrating its broad applicability
-across diverse medical image analysis tasks. By incorporating demographic and
-clinical factors, dMoE achieves state-of-the-art performance on two 2D
-benchmark datasets and a 3D in-house dataset. Our results highlight the
-effectiveness of dMoE in mitigating biases from imbalanced distributions,
-offering a promising approach to bridging control theory and medical image
-segmentation within fairness learning paradigms. The source code will be made
-available.
+##### **Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**
+2502.09050v1 by Chae-Hyun Kim, Yoon-Ryung Choi, Jin-Duk Park, Won-Yong Shin
 
-摘要：在医学影像分割中，由於人口屬性（例如年齡、性別、種族）和臨床因素（例如疾病嚴重程度）導致不平衡的臨床數據採集中存在偏差，因此確保公平性至關重要。為了應對這些挑戰，我們引入了受最優控制理論啟發的感知混合專家 (dMoE)。我們對其底層機制進行了全面分析，並釐清了 dMoE 在適應醫學影像分割中的異質分佈中的作用。此外，我們將 dMoE 整合到多個網路架構中，展示了其在各種醫學影像分析任務中的廣泛適用性。通過納入人口統計和臨床因素，dMoE 在兩個 2D 基準數據集和一個 3D 內部數據集上實現了最先進的性能。我們的結果突出了 dMoE 在減輕不平衡分佈的偏差方面的有效性，為在公平性學習範例中橋接控制理論和醫學影像分割提供了一個有前景的方法。原始碼將會公開。
+Group recommendation aims at providing optimized recommendations tailored to
+diverse groups, enabling groups to enjoy appropriate items. On the other hand,
+most existing group recommendation methods are built upon deep neural network
+(DNN) architectures designed to capture the intricate relationships between
+member-level and group-level interactions. While these DNN-based approaches
+have proven their effectiveness, they require complex and expensive training
+procedures to incorporate group-level interactions in addition to member-level
+interactions. To overcome such limitations, we introduce Group-GF, a new
+approach for extremely fast recommendations of items to each group via
+multi-view graph filtering (GF) that offers a holistic view of complex
+member-group dynamics, without the need for costly model training.
+Specifically, in Group-GF, we first construct three item similarity graphs
+manifesting different viewpoints for GF. Then, we discover a distinct
+polynomial graph filter for each similarity graph and judiciously aggregate the
+three graph filters. Extensive experiments demonstrate the effectiveness of
+Group-GF in terms of significantly reducing runtime and achieving
+state-of-the-art recommendation accuracy.
 
-##### **Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**
-2502.00568v3 by Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
+摘要：群組推薦旨在提供針對不同群組量身打造的最佳推薦，讓群組可以享受適當的項目。另一方面，現有的群組推薦方法大多建立在深度神經網路 (DNN) 架構上，旨在捕捉成員層級和群組層級互動之間的複雜關係。雖然這些基於 DNN 的方法已證明其有效性，但它們需要複雜且昂貴的訓練程序，才能在成員層級互動之外納入群組層級互動。為了克服這些限制，我們引入了 Group-GF，這是一種透過多視圖圖形過濾 (GF) 為每個群組提供極快速項目推薦的新方法，它提供了複雜成員群組動態的整體視圖，而無需進行昂貴的模型訓練。具體來說，在 Group-GF 中，我們首先建構三個項目相似度圖形，展現 GF 的不同觀點。然後，我們為每個相似度圖形發現一個不同的多項式圖形過濾器，並明智地彙總這三個圖形過濾器。廣泛的實驗證明了 Group-GF 在顯著減少執行時間和達成最先進的推薦準確度方面的有效性。
 
-Emerging research has highlighted that artificial intelligence based
-multimodal fusion of digital pathology and transcriptomic features can improve
-cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction.
-However, such direct fusion for joint decision is impractical in real clinical
-settings, where histopathology is still the gold standard for diagnosis and
-transcriptomic tests are rarely requested, at least in the public healthcare
-system. With our novel diffusion based crossmodal generative AI model PathGen,
-we show that genomic expressions synthesized from digital histopathology
-jointly predicts cancer grading and patient survival risk with high accuracy
-(state-of-the-art performance), certainty (through conformal coverage
-guarantee) and interpretability (through distributed attention maps). PathGen
-code is available for open use by the research community through GitHub at
-https://github.com/Samiran-Dey/PathGen.
+##### **Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**
+2502.09046v1 by Jin-Duk Park, Jaemin Yoo, Won-Yong Shin
 
-摘要：新興研究強調，基於人工智慧的多模態融合數位病理學和轉錄組特徵，可以改善癌症診斷（分級/分型）和預後（存活風險）預測。
-然而，這種直接融合對於聯合決策在實際臨床環境中並不切實際，在實際臨床環境中，組織病理學仍然是診斷的黃金標準，而轉錄組檢測很少被要求，至少在公共醫療保健系統中是如此。透過我們新穎的基於擴散的跨模態生成式 AI 模型 PathGen，我們展示了從數位組織病理學合成的基因體表達共同預測癌症分級和患者存活風險，具有很高的準確度（最先進的效能）、確定性（透過共形覆蓋保證）和可解釋性（透過分佈式注意力圖）。PathGen 程式碼可透過 GitHub 上的 https://github.com/Samiran-Dey/PathGen 供研究社群公開使用。
+Multi-criteria (MC) recommender systems, which utilize MC rating information
+for recommendation, are increasingly widespread in various e-commerce domains.
+However, the MC recommendation using training-based collaborative filtering,
+requiring consideration of multiple ratings compared to single-criterion
+counterparts, often poses practical challenges in achieving state-of-the-art
+performance along with scalable model training. To solve this problem, we
+propose CA-GF, a training-free MC recommendation method, which is built upon
+criteria-aware graph filtering for efficient yet accurate MC recommendations.
+Specifically, first, we construct an item-item similarity graph using an MC
+user-expansion graph. Next, we design CA-GF composed of the following key
+components, including 1) criterion-specific graph filtering where the optimal
+filter for each criterion is found using various types of polynomial low-pass
+filters and 2) criteria preference-infused aggregation where the smoothed
+signals from each criterion are aggregated. We demonstrate that CA-GF is (a)
+efficient: providing the computational efficiency, offering the extremely fast
+runtime of less than 0.2 seconds even on the largest benchmark dataset, (b)
+accurate: outperforming benchmark MC recommendation methods, achieving
+substantial accuracy gains up to 24% compared to the best competitor, and (c)
+interpretable: providing interpretations for the contribution of each criterion
+to the model prediction based on visualizations.
 
+摘要：多準則 (MC) 推薦系統在各種電子商務領域中日益普及，該系統利用 MC 評分資訊進行推薦。
+然而，與單準則對應項目相比，使用基於訓練的協同過濾的 MC 推薦，通常在達成最先進的效能以及可擴充模型訓練方面造成實務上的挑戰，需要考慮多個評分。為了解決這個問題，我們提出 CA-GF，一種無需訓練的 MC 推薦方法，它建立於準則感知圖形過濾之上，用於有效且準確的 MC 推薦。
+具體來說，首先，我們使用 MC 使用者擴展圖形來建構一個項目相似度圖形。接下來，我們設計 CA-GF，它包含以下關鍵組成部分，包括 1) 準則特定圖形過濾，其中使用各種類型的多項式低通濾波器來找出每個準則的最佳濾波器，以及 2) 準則偏好注入聚合，其中來自每個準則的平滑訊號被聚合。我們證明 CA-GF 是 (a) 有效的：提供運算效率，即使在最大的基準資料集上，也能提供低於 0.2 秒的極快執行時間，(b) 準確的：優於基準 MC 推薦方法，與最佳競爭者相比，獲得高達 24% 的顯著準確性提升，以及 (c) 可解釋的：根據視覺化提供對每個準則對模型預測的貢獻的解釋。
 
-### Knowledge Graphs
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
-|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
-|**2025-02-12**|**Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**|Ruizhan Xue et.al.|[2502.08353v1](http://arxiv.org/abs/2502.08353v1)|null|
-|**2025-02-12**|**Graph Foundation Models for Recommendation: A Comprehensive Survey**|Bin Wu et.al.|[2502.08346v1](http://arxiv.org/abs/2502.08346v1)|null|
-|**2025-02-12**|**Self-Evaluation for Job-Shop Scheduling**|Imanol Echeverria et.al.|[2502.08684v1](http://arxiv.org/abs/2502.08684v1)|null|
-|**2025-02-12**|**Improving Existing Optimization Algorithms with LLMs**|Camilo Chacón Sartori et.al.|[2502.08298v1](http://arxiv.org/abs/2502.08298v1)|null|
-|**2025-02-12**|**ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**|Vy Vo et.al.|[2502.08148v1](http://arxiv.org/abs/2502.08148v1)|null|
-|**2025-02-12**|**GCoT: Chain-of-Thought Prompt Learning for Graphs**|Xingtong Yu et.al.|[2502.08092v1](http://arxiv.org/abs/2502.08092v1)|null|
-|**2025-02-11**|**Deep Semantic Graph Learning via LLM based Node Enhancement**|Chuanqi Shi et.al.|[2502.07982v1](http://arxiv.org/abs/2502.07982v1)|null|
-|**2025-02-10**|**Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**|Danrui Li et.al.|[2502.07128v1](http://arxiv.org/abs/2502.07128v1)|null|
-|**2025-02-10**|**GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**|Arghadip Das et.al.|[2502.06921v2](http://arxiv.org/abs/2502.06921v2)|[link](https://github.com/arghadippurdue/GraNNite)|
-|**2025-02-10**|**Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**|Zhiqiang Zhong et.al.|[2502.06634v1](http://arxiv.org/abs/2502.06634v1)|null|
-|**2025-02-10**|**KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**|Yuxing Lu et.al.|[2502.06472v1](http://arxiv.org/abs/2502.06472v1)|null|
-|**2025-02-10**|**RoToR: Towards More Reliable Responses for Order-Invariant Inputs**|Soyoung Yoon et.al.|[2502.08662v1](http://arxiv.org/abs/2502.08662v1)|null|
-|**2025-02-10**|**K-ON: Stacking Knowledge On the Head Layer of Large Language Model**|Lingbing Guo et.al.|[2502.06257v1](http://arxiv.org/abs/2502.06257v1)|null|
-|**2025-02-10**|**LegalViz: Legal Text Visualization by Text To Diagram Generation**|Eri Onami et.al.|[2502.06147v2](http://arxiv.org/abs/2502.06147v2)|null|
-|**2025-02-09**|**Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**|Han Meng et.al.|[2502.06075v1](http://arxiv.org/abs/2502.06075v1)|null|
-|**2025-02-09**|**LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**|Shubham Kumar Nigam et.al.|[2502.05836v1](http://arxiv.org/abs/2502.05836v1)|null|
-|**2025-02-08**|**LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**|Hanqing Yang et.al.|[2502.05453v1](http://arxiv.org/abs/2502.05453v1)|null|
-|**2025-02-08**|**SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**|Xingtong Yu et.al.|[2502.05424v1](http://arxiv.org/abs/2502.05424v1)|null|
-|**2025-02-08**|**Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**|Ali Al-Lawati et.al.|[2502.05414v1](http://arxiv.org/abs/2502.05414v1)|null|
-|**2025-02-08**|**Knowledge Graph-Guided Retrieval Augmented Generation**|Xiangrong Zhu et.al.|[2502.06864v1](http://arxiv.org/abs/2502.06864v1)|[link](https://github.com/nju-websoft/KG2RAG)|
-|**2025-02-07**|**Can Large Language Models Understand Intermediate Representations?**|Hailong Jiang et.al.|[2502.06854v1](http://arxiv.org/abs/2502.06854v1)|null|
-|**2025-02-07**|**GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**|Yang Zhou et.al.|[2502.05252v1](http://arxiv.org/abs/2502.05252v1)|null|
-|**2025-02-07**|**Causality can systematically address the monsters under the bench(marks)**|Felix Leeb et.al.|[2502.05085v1](http://arxiv.org/abs/2502.05085v1)|null|
-|**2025-02-07**|**Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**|Tushar Pandey et.al.|[2502.05078v1](http://arxiv.org/abs/2502.05078v1)|[link](https://github.com/AgnostiqHQ/multi-agent-llm)|
-|**2025-02-07**|**Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**|Hussam Ghanem et.al.|[2502.05239v1](http://arxiv.org/abs/2502.05239v1)|null|
-|**2025-02-07**|**Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**|Junde Wu et.al.|[2502.04644v1](http://arxiv.org/abs/2502.04644v1)|[link](https://github.com/theworldofagents/agentic-reasoning)|
-|**2025-02-07**|**Position-aware Automatic Circuit Discovery**|Tal Haklay et.al.|[2502.04577v1](http://arxiv.org/abs/2502.04577v1)|[link](https://github.com/technion-cs-nlp/peap)|
-|**2025-02-06**|**Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**|Shangbin Feng et.al.|[2502.04510v1](http://arxiv.org/abs/2502.04510v1)|null|
-|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
-|**2025-02-06**|**Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**|Longquan Jiang et.al.|[2502.03992v1](http://arxiv.org/abs/2502.03992v1)|[link](https://github.com/longquanjiang/ontoscprompt)|
-|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
-|**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
-|**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
-|**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
-|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
-|**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
-|**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
-|**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
-|**2025-02-05**|**Leveraging the true depth of LLMs**|Ramón Calvo González et.al.|[2502.02790v1](http://arxiv.org/abs/2502.02790v1)|null|
-|**2025-02-04**|**Modular Training of Neural Networks aids Interpretability**|Satvik Golechha et.al.|[2502.02470v2](http://arxiv.org/abs/2502.02470v2)|null|
-|**2025-02-04**|**Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**|Sagnik Mukherjee et.al.|[2502.02362v3](http://arxiv.org/abs/2502.02362v3)|null|
-|**2025-02-04**|**AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**|Shivam Singh et.al.|[2502.02067v1](http://arxiv.org/abs/2502.02067v1)|[link](https://github.com/sssshivvvv/adaptbot)|
-|**2025-02-03**|**On Bob Dylan: A Computational Perspective**|Prashant Garg et.al.|[2502.01772v1](http://arxiv.org/abs/2502.01772v1)|null|
-|**2025-02-03**|**VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**|Xubin Ren et.al.|[2502.01549v1](http://arxiv.org/abs/2502.01549v1)|null|
-|**2025-02-03**|**Transformers trained on proteins can learn to attend to Euclidean distance**|Isaac Ellmen et.al.|[2502.01533v1](http://arxiv.org/abs/2502.01533v1)|[link](https://github.com/Ellmen/attending-to-distance)|
-|**2025-02-03**|**Common Foundations for SHACL, ShEx, and PG-Schema**|S. Ahmetaj et.al.|[2502.01295v1](http://arxiv.org/abs/2502.01295v1)|null|
-|**2025-02-03**|**GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**|Linhao Luo et.al.|[2502.01113v1](http://arxiv.org/abs/2502.01113v1)|[link](https://github.com/RManLuo/gfm-rag)|
-|**2025-02-03**|**Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**|Seungri Yoon et.al.|[2502.01059v1](http://arxiv.org/abs/2502.01059v1)|null|
-|**2025-02-03**|**Encrypted Large Model Inference: The Equivariant Encryption Paradigm**|James Buban et.al.|[2502.01013v1](http://arxiv.org/abs/2502.01013v1)|null|
-|**2025-02-02**|**Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**|Juno Kim et.al.|[2502.01694v1](http://arxiv.org/abs/2502.01694v1)|null|
-|**2025-02-02**|**PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**|Qixuan Li et.al.|[2502.00708v1](http://arxiv.org/abs/2502.00708v1)|null|
-|**2025-02-02**|**A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**|Qika Lin et.al.|[2502.00681v1](http://arxiv.org/abs/2502.00681v1)|null|
-|**2025-02-01**|**Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**|Jingyuan Yi et.al.|[2502.00339v1](http://arxiv.org/abs/2502.00339v1)|null|
-|**2025-02-01**|**DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**|Jiaxin Guo et.al.|[2502.00305v1](http://arxiv.org/abs/2502.00305v1)|null|
-|**2025-01-31**|**Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**|Nathaniel Tomczak et.al.|[2502.01659v2](http://arxiv.org/abs/2502.01659v2)|[link](https://github.com/KLab-AI3/Graph-Processing-Attention-IPDPS-2025)|
-|**2025-01-31**|**Improving vision-language alignment with graph spiking hybrid Networks**|Siyu Zhang et.al.|[2501.19069v1](http://arxiv.org/abs/2501.19069v1)|null|
-|**2025-01-30**|**Semantic Web and Creative AI -- A Technical Report from ISWS 2023**|Raia Abu Ahmad et.al.|[2501.18542v1](http://arxiv.org/abs/2501.18542v1)|null|
-|**2025-01-30**|**Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**|Tianpeng Pan et.al.|[2501.18320v1](http://arxiv.org/abs/2501.18320v1)|null|
-|**2025-01-30**|**Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**|Wanlong Liu et.al.|[2501.18154v1](http://arxiv.org/abs/2501.18154v1)|null|
-|**2025-01-30**|**Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**|Qika Lin et.al.|[2501.18119v1](http://arxiv.org/abs/2501.18119v1)|null|
-|**2025-01-29**|**Hybrid Graphs for Table-and-Text based Question Answering using LLMs**|Ankush Agarwal et.al.|[2501.17767v1](http://arxiv.org/abs/2501.17767v1)|null|
-|**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
-|**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
-|**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
-|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
-|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
-|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
-|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
-|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
-|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
-|**2025-01-26**|**Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**|Haodi Ma et.al.|[2501.15688v1](http://arxiv.org/abs/2501.15688v1)|null|
-|**2025-01-26**|**How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**|Manzong Huang et.al.|[2501.15378v1](http://arxiv.org/abs/2501.15378v1)|null|
-|**2025-01-24**|**Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**|Cencheng Shen et.al.|[2501.14932v1](http://arxiv.org/abs/2501.14932v1)|null|
-|**2025-01-24**|**Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**|Hang Luo et.al.|[2501.14892v1](http://arxiv.org/abs/2501.14892v1)|null|
-|**2025-01-24**|**GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**|Ziwen Li et.al.|[2501.16382v1](http://arxiv.org/abs/2501.16382v1)|[link](https://github.com/aaronli43/grappi)|
-|**2025-01-24**|**Evaluating and Improving Graph to Text Generation with Large Language Models**|Jie He et.al.|[2501.14497v1](http://arxiv.org/abs/2501.14497v1)|[link](https://github.com/probe2/kg_text)|
-|**2025-01-24**|**Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**|Xujian Liang et.al.|[2501.14300v1](http://arxiv.org/abs/2501.14300v1)|[link](https://github.com/dosonleung/fasttog)|
-|**2025-01-24**|**Top Ten Challenges Towards Agentic Neural Graph Databases**|Jiaxin Bai et.al.|[2501.14224v1](http://arxiv.org/abs/2501.14224v1)|null|
-|**2025-01-23**|**GraphRAG under Fire**|Jiacheng Liang et.al.|[2501.14050v1](http://arxiv.org/abs/2501.14050v1)|null|
-|**2025-01-23**|**EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**|Yuhui Yun et.al.|[2501.13746v1](http://arxiv.org/abs/2501.13746v1)|null|
-|**2025-01-23**|**Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**|Chang Gong et.al.|[2501.13731v1](http://arxiv.org/abs/2501.13731v1)|null|
-|**2025-01-23**|**CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**|Hamza Landolsi et.al.|[2501.13993v1](http://arxiv.org/abs/2501.13993v1)|null|
-|**2025-01-23**|**Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**|Hy Nguyen et.al.|[2501.13992v1](http://arxiv.org/abs/2501.13992v1)|null|
-|**2025-01-23**|**Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**|Bhumika Gupta et.al.|[2501.13984v1](http://arxiv.org/abs/2501.13984v1)|null|
-|**2025-01-21**|**LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**|Hasan Abu-Rasheed et.al.|[2501.12300v1](http://arxiv.org/abs/2501.12300v1)|null|
-|**2025-01-21**|**Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**|Dongsheng Zhu et.al.|[2501.12432v1](http://arxiv.org/abs/2501.12432v1)|null|
-|**2025-01-21**|**InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**|Pha Nguyen et.al.|[2501.12231v1](http://arxiv.org/abs/2501.12231v1)|null|
-|**2025-01-21**|**Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**|Maya Medjad et.al.|[2501.11977v1](http://arxiv.org/abs/2501.11977v1)|[link](https://github.com/reecall/graphtod)|
-|**2025-01-21**|**Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**|Jie Zhao et.al.|[2501.11968v1](http://arxiv.org/abs/2501.11968v1)|null|
-|**2025-01-21**|**A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**|Qinggang Zhang et.al.|[2501.13958v1](http://arxiv.org/abs/2501.13958v1)|[link](https://github.com/deep-polyu/awesome-graphrag)|
-|**2025-01-21**|**Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**|Nikos Kanakaris et.al.|[2501.11849v2](http://arxiv.org/abs/2501.11849v2)|[link](https://github.com/nkanak/brag-fake-news-campaigns)|
-|**2025-01-21**|**Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**|Haoran Song et.al.|[2501.16361v1](http://arxiv.org/abs/2501.16361v1)|null|
-|**2025-01-20**|**Zep: A Temporal Knowledge Graph Architecture for Agent Memory**|Preston Rasmussen et.al.|[2501.13956v1](http://arxiv.org/abs/2501.13956v1)|[link](https://github.com/getzep/graphiti)|
-|**2025-01-20**|**Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**|M. Manzour et.al.|[2501.11560v1](http://arxiv.org/abs/2501.11560v1)|null|
-|**2025-01-20**|**Each Graph is a New Language: Graph Learning with LLMs**|Huachi Zhou et.al.|[2501.11478v2](http://arxiv.org/abs/2501.11478v2)|null|
-|**2025-01-20**|**Few-shot Policy (de)composition in Conversational Question Answering**|Kyle Erwin et.al.|[2501.11335v1](http://arxiv.org/abs/2501.11335v1)|null|
-|**2025-01-20**|**Reasoning Language Models: A Blueprint**|Maciej Besta et.al.|[2501.11223v3](http://arxiv.org/abs/2501.11223v3)|[link](https://github.com/spcl/x1)|
-|**2025-01-19**|**IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**|Elad Levi et.al.|[2501.11067v1](http://arxiv.org/abs/2501.11067v1)|[link](https://github.com/plurai-ai/intellagent)|
+##### **Typhoon T1: An Open Thai Reasoning Model**
+2502.09042v1 by Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, Kunat Pipatanakul
 
-#### Abstracts
-##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
-2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
+This paper introduces Typhoon T1, an open effort to develop an open Thai
+reasoning model. A reasoning model is a relatively new type of generative model
+built on top of large language models (LLMs). A reasoning model generates a
+long chain of thought before arriving at a final answer, an approach found to
+improve performance on complex tasks. However, details on developing such a
+model are limited, especially for reasoning models that can generate traces in
+a low-resource language. Typhoon T1 presents an open effort that dives into the
+details of developing a reasoning model in a more cost-effective way by
+leveraging supervised fine-tuning using open datasets, instead of reinforcement
+learning. This paper shares the details about synthetic data generation and
+training, as well as our dataset and model weights. Additionally, we provide
+insights gained from developing a reasoning model that generalizes across
+domains and is capable of generating reasoning traces in a low-resource
+language, using Thai as an example. We hope this open effort provides a
+foundation for further research in this field.
 
-Visual Question Answering (VQA) is a challenging problem that requires to
-process multimodal input. Answer-Set Programming (ASP) has shown great
-potential in this regard to add interpretability and explainability to modular
-VQA architectures. In this work, we address the problem of how to integrate ASP
-with modules for vision and natural language processing to solve a new and
-demanding VQA variant that is concerned with images of graphs (not graphs in
-symbolic form). Images containing graph-based structures are an ubiquitous and
-popular form of visualisation. Here, we deal with the particular problem of
-graphs inspired by transit networks, and we introduce a novel dataset that
-amends an existing one by adding images of graphs that resemble metro lines.
-Our modular neuro-symbolic approach combines optical graph recognition for
-graph parsing, a pretrained optical character recognition neural network for
-parsing labels, Large Language Models (LLMs) for language processing, and ASP
-for reasoning. This method serves as a first baseline and achieves an overall
-average accuracy of 73% on the dataset. Our evaluation provides further
-evidence of the potential of modular neuro-symbolic systems, in particular with
-pretrained models that do not involve any further training and logic
-programming for reasoning, to solve complex VQA tasks.
+摘要：本文介紹 Typhoon T1，這是一個開放的計畫，旨在開發開放的泰語推理模型。推理模型是一種相對較新的生成模型，建構於大型語言模型 (LLM) 之上。推理模型會在得出最終答案之前產生一連串的思考，這種方法被發現可以改善複雜任務的效能。然而，關於如何開發這種模型的詳細資訊有限，特別是對於能夠以低資源語言產生軌跡的推理模型而言。Typhoon T1 提出了一個開放的計畫，深入探討如何以更具成本效益的方式開發推理模型，方法是利用開放式資料集進行監督微調，而不是強化學習。本文分享了關於合成資料產生和訓練的詳細資訊，以及我們的資料集和模型權重。此外，我們提供了從開發推理模型中獲得的見解，該模型可以跨領域概括，並能夠以低資源語言產生推理軌跡，以泰語為例。我們希望這個開放的計畫能為此領域的進一步研究奠定基礎。
 
-摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
+##### **Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**
+2502.09022v1 by Lin Zhang, Lijie Hu, Di Wang
 
-##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
-2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
+Transformer-based language models have achieved notable success, yet their
+internal reasoning mechanisms remain largely opaque due to complex non-linear
+interactions and high-dimensional operations. While previous research suggests
+that these models implicitly encode reasoning structures, it is still unclear
+which specific multi-step thought processes they employ to solve complex tasks.
+To address this gap, we propose a novel mechanistic interpretability framework,
+SICAF, designed to trace and analyze the reasoning strategies that language
+models use in multi-step inference tasks. By employing circuit analysis and
+self-influence functions, we quantify the evolving importance of each token
+throughout the reasoning process, thereby mapping the pathways the model uses
+for inference. Applying SICAF to the GPT-2 model on the Indirect Object
+Identification (IOI) prediction task, we demonstrate how underlying circuits
+can reveal a reasoning process that aligns with human interpretability,
+offering new insights into the model's internal logic.
 
-The adoption of EHRs has expanded opportunities to leverage data-driven
-algorithms in clinical care and research. A major bottleneck in effectively
-conducting multi-institutional EHR studies is the data heterogeneity across
-systems with numerous codes that either do not exist or represent different
-clinical concepts across institutions. The need for data privacy further limits
-the feasibility of including multi-institutional patient-level data required to
-study similarities and differences across patient subgroups. To address these
-challenges, we developed the GAME algorithm. Tested and validated across 7
-institutions and 2 languages, GAME integrates data in several levels: (1) at
-the institutional level with knowledge graphs to establish relationships
-between codes and existing knowledge sources, providing the medical context for
-standard codes and their relationship to each other; (2) between institutions,
-leveraging language models to determine the relationships between
-institution-specific codes with established standard codes; and (3) quantifying
-the strength of the relationships between codes using a graph attention
-network. Jointly trained embeddings are created using transfer and federated
-learning to preserve data privacy. In this study, we demonstrate the
-applicability of GAME in selecting relevant features as inputs for AI-driven
-algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
-We then highlight the application of GAME harmonized multi-institutional EHR
-data in a study of Alzheimer's disease outcomes and suicide risk among patients
-with mental health disorders, without sharing patient-level data outside
-individual institutions.
+摘要：基於 Transformer 的語言模型已取得顯著的成功，但由於複雜的非線性交互和高維度運算，它們的內部推理機制在很大程度上仍然不透明。儘管先前的研究表明這些模型隱含地編碼推理結構，但目前仍不清楚它們採用哪些具體的多步驟思考過程來解決複雜任務。為了解決這個差距，我們提出了一個新穎的機制可解釋性框架 SICAF，旨在追蹤和分析語言模型在多步驟推理任務中使用的推理策略。通過採用電路分析和自影響函數，我們量化了推理過程中每個標記的演化重要性，從而繪製出模型用於推理的路徑。將 SICAF 應用於 GPT-2 模型上的間接賓語識別 (IOI) 預測任務，我們展示了底層電路如何揭示與人類可解釋性相符的推理過程，從而對模型的內部邏輯提供了新的見解。
 
-摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
+##### **EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**
+2502.09020v1 by Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
 
-##### **Trustworthy GNNs with LLMs: A Systematic Review and Taxonomy**
-2502.08353v1 by Ruizhan Xue, Huimin Deng, Fang He, Maojun Wang, Zeyu Zhang
+Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB
+cameras which are sensitive to challenging factors such as low illumination,
+motion blur, and cluttered backgrounds. In this paper, we propose to recognize
+the scene text using bio-inspired event cameras by collecting and annotating a
+large-scale benchmark dataset, termed EventSTR. It contains 9,928
+high-definition (1280 * 720) event samples and involves both Chinese and
+English characters. We also benchmark multiple STR algorithms as the baselines
+for future works to compare. In addition, we propose a new event-based scene
+text recognition framework, termed SimC-ESTR. It first extracts the event
+features using a visual encoder and projects them into tokens using a Q-former
+module. More importantly, we propose to augment the vision tokens based on a
+memory mechanism before feeding into the large language models. A
+similarity-based error correction mechanism is embedded within the large
+language model to correct potential minor errors fundamentally based on
+contextual information. Extensive experiments on the newly proposed EventSTR
+dataset and two simulation STR datasets fully demonstrate the effectiveness of
+our proposed model. We believe that the dataset and algorithmic model can
+innovatively propose an event-based STR task and are expected to accelerate the
+application of event cameras in various industries. The source code and
+pre-trained models will be released on https://github.com/Event-AHU/EventSTR
 
-With the extensive application of Graph Neural Networks (GNNs) across various
-domains, their trustworthiness has emerged as a focal point of research. Some
-existing studies have shown that the integration of large language models
-(LLMs) can improve the semantic understanding and generation capabilities of
-GNNs, which in turn improves the trustworthiness of GNNs from various aspects.
-Our review introduces a taxonomy that offers researchers a clear framework for
-comprehending the principles and applications of different methods and helps
-clarify the connections and differences among various approaches. Then we
-systematically survey representative approaches along the four categories of
-our taxonomy. Through our taxonomy, researchers can understand the applicable
-scenarios, potential advantages, and limitations of each approach for the the
-trusted integration of GNNs with LLMs. Finally, we present some promising
-directions of work and future trends for the integration of LLMs and GNNs to
-improve model trustworthiness.
+摘要：主流場景文字辨識（STR）演算法是基於對低光源、動態模糊和雜亂背景等挑戰性因素敏感的 RGB 相機開發的。在本文中，我們提出使用生物靈感事件相機辨識場景文字，方法是收集和標註一個稱為 EventSTR 的大規模基準資料集。它包含 9,928 個高畫質（1280 * 720）事件範例，並包含中文字和英文字元。我們也基準化多個 STR 演算法作為未來工作的基準，以進行比較。此外，我們提出一個新的基於事件的場景文字辨識架構，稱為 SimC-ESTR。它首先使用視覺編碼器萃取事件特徵，並使用 Q-former 模組將它們投影到代幣中。更重要的是，我們提出在輸入大型語言模型之前，基於記憶機制擴充視覺代幣。一個基於相似性的錯誤修正機制嵌入在大型語言模型中，以根據上下文資訊從根本上修正潛在的輕微錯誤。在最新提出的 EventSTR 資料集和兩個模擬 STR 資料集上進行的廣泛實驗充分證明了我們提出的模型的有效性。我們相信，該資料集和演算法模型可以創新地提出一個基於事件的 STR 任務，並有望加速事件相機在各個產業的應用。原始碼和預先訓練的模型將在 https://github.com/Event-AHU/EventSTR 上釋出
 
-摘要：隨著圖神經網路 (GNN) 在各種領域的廣泛應用，其可信度已成為研究的焦點。一些現有研究表明，整合大型語言模型 (LLM) 可以提升 GNN 的語意理解和生成能力，進而從各方面提升 GNN 的可信度。我們的評論介紹了一種分類法，為研究人員提供了一個清晰的架構，用於理解不同方法的原理和應用，並有助於釐清各種方法之間的關聯和差異。然後，我們系統性地針對分類法的四個類別進行代表性方法的調查。研究人員透過我們的分類法，可以了解每種方法在 GNN 與 LLM 的可信整合中適用的場景、潛在優點和限制。最後，我們提出 LLM 與 GNN 整合的一些有前景的工作方向和未來趨勢，以提升模型的可信度。
+##### **Zero-shot Concept Bottleneck Models**
+2502.09018v1 by Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
 
-##### **Graph Foundation Models for Recommendation: A Comprehensive Survey**
-2502.08346v1 by Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi
+Concept bottleneck models (CBMs) are inherently interpretable and
+intervenable neural network models, which explain their final label prediction
+by the intermediate prediction of high-level semantic concepts. However, they
+require target task training to learn input-to-concept and concept-to-label
+mappings, incurring target dataset collections and training resources. In this
+paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which
+predict concepts and labels in a fully zero-shot manner without training neural
+networks. Z-CBMs utilize a large-scale concept bank, which is composed of
+millions of vocabulary extracted from the web, to describe arbitrary input in
+various domains. For the input-to-concept mapping, we introduce concept
+retrieval, which dynamically finds input-related concepts by the cross-modal
+search on the concept bank. In the concept-to-label inference, we apply concept
+regression to select essential concepts from the retrieved concepts by sparse
+linear regression. Through extensive experiments, we confirm that our Z-CBMs
+provide interpretable and intervenable concepts without any additional
+training. Code will be available at https://github.com/yshinya6/zcbm.
 
-Recommender systems (RS) serve as a fundamental tool for navigating the vast
-expanse of online information, with deep learning advancements playing an
-increasingly important role in improving ranking accuracy. Among these, graph
-neural networks (GNNs) excel at extracting higher-order structural information,
-while large language models (LLMs) are designed to process and comprehend
-natural language, making both approaches highly effective and widely adopted.
-Recent research has focused on graph foundation models (GFMs), which integrate
-the strengths of GNNs and LLMs to model complex RS problems more efficiently by
-leveraging the graph-based structure of user-item relationships alongside
-textual understanding. In this survey, we provide a comprehensive overview of
-GFM-based RS technologies by introducing a clear taxonomy of current
-approaches, diving into methodological details, and highlighting key challenges
-and future directions. By synthesizing recent advancements, we aim to offer
-valuable insights into the evolving landscape of GFM-based recommender systems.
+摘要：概念瓶頸模型 (CBM) 本質上是可解釋且可干預的神經網路模型，它們透過對高階語意概念的中間預測來解釋其最終標籤預測。然而，它們需要目標任務訓練來學習輸入到概念和概念到標籤的對應，導致目標資料集收集和訓練資源。在本文中，我們展示了「零次學習概念瓶頸模型」(Z-CBM)，它以完全零次學習的方式預測概念和標籤，而無需訓練神經網路。Z-CBM 利用一個大型概念庫，其中包含從網路中擷取的數百萬個詞彙，來描述各種領域中的任意輸入。對於輸入到概念的對應，我們引入了概念擷取，它透過對概念庫的跨模態搜尋，動態地找出與輸入相關的概念。在概念到標籤的推論中，我們應用概念迴歸，透過稀疏線性迴歸從擷取的概念中選擇必要的概念。透過廣泛的實驗，我們確認我們的 Z-CBM 在沒有任何額外訓練的情況下提供了可解釋且可干預的概念。程式碼將可在 https://github.com/yshinya6/zcbm 取得。
 
-摘要：推薦系統 (RS) 是導航廣闊線上資訊的基本工具，深度學習的進展在提升排名準確度方面扮演著日益重要的角色。在這些進展中，圖形神經網路 (GNN) 擅長萃取高階結構資訊，而大型語言模型 (LLM) 則設計用於處理和理解自然語言，這兩種方法都非常有效且廣泛採用。最近的研究專注於圖形基礎模型 (GFM)，它整合了 GNN 和 LLM 的優點，透過利用使用者與項目關係的圖形化結構，以及文字理解，更有效率地建構複雜的 RS 問題。在這項調查中，我們提供 GFM-based RS 技術的全面概觀，介紹當前方法的明確分類法，深入探討方法論的細節，並強調關鍵挑戰和未來方向。透過綜合最近的進展，我們旨在提供有價值的見解，了解 GFM-based 推薦系統不斷演變的樣貌。
+##### **Diversity Enhances an LLM's Performance in RAG and Long-context Task**
+2502.09017v1 by Zhchao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, Claire Na Cheng
 
-##### **Self-Evaluation for Job-Shop Scheduling**
-2502.08684v1 by Imanol Echeverria, Maialen Murua, Roberto Santana
+The rapid advancements in large language models (LLMs) have highlighted the
+challenge of context window limitations, primarily due to the quadratic time
+complexity of the self-attention mechanism (\(O(N^2)\), where \(N\) denotes the
+context window length). This constraint impacts tasks such as
+retrieval-augmented generation (RAG) in question answering (Q\&A) and long
+context summarization. A common approach involves selecting content with the
+highest similarity to the query; however, this often leads to redundancy and
+the exclusion of diverse yet relevant information. Building on principles from
+Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we
+integrate diversity into the content selection process. Our findings reveal
+that incorporating diversity substantially increases the recall of selecting
+relevant sentences or chunks before LLM-based Q\&A and summarization. These
+results highlight the importance of maintaining diversity in future LLM
+applications to further improve summarization and Q\&A outcomes.
 
-Combinatorial optimization problems, such as scheduling and route planning,
-are crucial in various industries but are computationally intractable due to
-their NP-hard nature. Neural Combinatorial Optimization methods leverage
-machine learning to address these challenges but often depend on sequential
-decision-making, which is prone to error accumulation as small mistakes
-propagate throughout the process. Inspired by self-evaluation techniques in
-Large Language Models, we propose a novel framework that generates and
-evaluates subsets of assignments, moving beyond traditional stepwise
-approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a
-heterogeneous graph neural network with a Transformer to build a policy model
-and a self-evaluation function. Experimental validation on challenging,
-well-known benchmarks demonstrates the effectiveness of our approach,
-surpassing state-of-the-art methods.
+摘要：大型語言模型 (LLM) 的快速進步凸顯了上下文視窗限制的挑戰，這主要是由於自注意力機制的二次時間複雜度（\(O(N^2)\)），其中 \(N\) 表示上下文視窗長度。此限制會影響任務，例如問答 (Q&A) 中的檢索增強生成 (RAG) 和長文摘要。一種常見的方法涉及選擇與查詢最相似的內容；然而，這通常會導致冗餘，並排除多樣化但相關的資訊。我們根據最大邊際相關性 (MMR) 和最遠點取樣 (FPS) 的原則，將多樣性整合到內容選擇過程中。我們的研究結果顯示，在基於 LLM 的問答和摘要之前，納入多樣性會大幅增加選擇相關句子或區塊的召回率。這些結果突顯了在未來的 LLM 應用中維持多樣性的重要性，以進一步改善摘要和問答的結果。
 
-摘要：組合優化問題，例如排程和路線規劃，在各行各業中至關重要，但由於它們的 NP 難度，在計算上難以處理。神經組合優化方法利用機器學習來解決這些挑戰，但通常依賴於序貫決策制定，而序貫決策制定容易發生錯誤累積，因為小錯誤會在整個過程中傳播。受大型語言模型中的自我評估技術啟發，我們提出了一個新的框架，可生成和評估作業子集，超越傳統的分步方法。應用於工作車間排程問題，我們的方法將異質圖神經網路與 Transformer 整合在一起，以建立策略模型和自我評估函數。在具有挑戰性的著名基準上的實驗驗證證明了我們方法的有效性，超越了最先進的方法。
+##### **Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**
+2502.09004v1 by Jonathan Pofcher, Christopher M. Homan, Randall Sell, Ashiqur R. KhudaBukhsh
 
-##### **Improving Existing Optimization Algorithms with LLMs**
-2502.08298v1 by Camilo Chacón Sartori, Christian Blum
+This paper makes three contributions. First, via a substantial corpus of
+1,419,047 comments posted on 3,161 YouTube news videos of major US cable news
+outlets, we analyze how users engage with LGBTQ+ news content. Our analyses
+focus both on positive and negative content. In particular, we construct a
+fine-grained hope speech classifier that detects positive (hope speech),
+negative, neutral, and irrelevant content. Second, in consultation with a
+public health expert specializing on LGBTQ+ health, we conduct an annotation
+study with a balanced and diverse political representation and release a
+dataset of 3,750 instances with fine-grained labels and detailed annotator
+demographic information. Finally, beyond providing a vital resource for the
+LGBTQ+ community, our annotation study and subsequent in-the-wild assessments
+reveal (1) strong association between rater political beliefs and how they rate
+content relevant to a marginalized community; (2) models trained on individual
+political beliefs exhibit considerable in-the-wild disagreement; and (3)
+zero-shot large language models (LLMs) align more with liberal raters.
 
-The integration of Large Language Models (LLMs) into optimization has created
-a powerful synergy, opening exciting research opportunities. This paper
-investigates how LLMs can enhance existing optimization algorithms. Using their
-pre-trained knowledge, we demonstrate their ability to propose innovative
-heuristic variations and implementation strategies. To evaluate this, we
-applied a non-trivial optimization algorithm, Construct, Merge, Solve and Adapt
-(CMSA) -- a hybrid metaheuristic for combinatorial optimization problems that
-incorporates a heuristic in the solution construction phase. Our results show
-that an alternative heuristic proposed by GPT-4o outperforms the
-expert-designed heuristic of CMSA, with the performance gap widening on larger
-and denser graphs. Project URL: https://imp-opt-algo-llms.surge.sh/
+摘要：本文做出了三項貢獻。首先，透過一個龐大的語料庫，其中包含 1,419,047 則評論，這些評論張貼在 3,161 部美國有線新聞頻道的 YouTube 新聞影片上，我們分析了使用者如何參與 LGBTQ+ 新聞內容。我們的分析重點在於正面和負面的內容。特別是，我們建構了一個細緻的希望言論分類器，用來偵測正面的（希望言論）、負面的、中立的和不相關的內容。其次，在諮詢了一位專門研究 LGBTQ+ 健康的公共衛生專家後，我們進行了一項標註研究，其中包含平衡且多元的政治代表性，並發布了一個包含 3,750 個實例的資料集，其中包含細緻的標籤和詳細的標註者人口統計資訊。最後，除了為 LGBTQ+ 社群提供重要的資源外，我們的標註研究和後續的實際評估揭示了：(1) 評分者的政治信仰與他們如何評分與邊緣化社群相關的內容之間有很強的關聯性；(2) 根據個人政治信仰訓練的模型在實際應用中表現出相當大的分歧；(3) 零次學習大型語言模型 (LLM) 與自由派評分者的看法更一致。
 
-摘要：大型语言模型 (LLM) 与优化相结合，创造了一种强大的协同作用，开启了令人兴奋的研究机会。本文探讨了 LLM 如何增强现有的优化算法。利用其预先训练的知识，我们展示了它们提出创新启发式变体和实施策略的能力。为了评估这一点，我们应用了一种非平凡的优化算法，构建、合并、求解和适应 (CMSA)——一种用于组合优化问题的混合元启发式算法，它在求解构建阶段纳入了启发式算法。我们的结果表明，GPT-4o 提出的替代启发式算法优于 CMSA 的专家设计的启发式算法，并且随着图形变得更大、更密集，性能差距也在扩大。项目网址：https://imp-opt-algo-llms.surge.sh/
+##### **RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**
+2502.09003v1 by Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang, Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
 
-##### **ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning**
-2502.08148v1 by Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
+Supervised fine-tuning is a standard method for adapting pre-trained large
+language models (LLMs) to downstream tasks. Quantization has been recently
+studied as a post-training technique for efficient LLM deployment. To obtain
+quantized fine-tuned LLMs, conventional pipelines would first fine-tune the
+pre-trained models, followed by post-training quantization. This often yields
+suboptimal performance as it fails to leverage the synergy between fine-tuning
+and quantization. To effectively realize low-bit quantization of weights,
+activations, and KV caches in LLMs, we propose an algorithm named Rotated
+Straight-Through-Estimator (RoSTE), which combines quantization-aware
+supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that
+identifies an effective rotation configuration to reduce activation outliers.
+We provide theoretical insights on RoSTE by analyzing its prediction error when
+applied to an overparameterized least square quantized training problem. Our
+findings reveal that the prediction error is directly proportional to the
+quantization error of the converged weights, which can be effectively managed
+through an optimized rotation configuration. Experiments on Pythia and Llama
+models of different sizes demonstrate the effectiveness of RoSTE. Compared to
+existing post-SFT quantization baselines, our method consistently achieves
+superior performances across various tasks and different LLM architectures.
 
-Identifying cause-and-effect relationships is critical to understanding
-real-world dynamics and ultimately causal reasoning. Existing methods for
-identifying event causality in NLP, including those based on Large Language
-Models (LLMs), exhibit difficulties in out-of-distribution settings due to the
-limited scale and heavy reliance on lexical cues within available benchmarks.
-Modern benchmarks, inspired by probabilistic causal inference, have attempted
-to construct causal graphs of events as a robust representation of causal
-knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent
-benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a
-benchmark designed for discovery and reasoning over abstract causal events.
-Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday
-life events on the abstraction level. We propose a pipeline for identifying
-abstractions for event generalizations from \texttt{GLUCOSE}
-\citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit
-commonsense causal knowledge, from which we subsequently extract $1,4$K causal
-pairs. Our experiments highlight the ongoing challenges of using statistical
-methods and/or LLMs for automatic abstraction identification and causal
-discovery in NLP. Nonetheless, we demonstrate that the abstract causal
-knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA
-reasoning performance in LLMs.
+摘要：監督式微調是將預訓練的大型語言模型 (LLM) 適應至下游任務的標準方法。量化最近已被研究作為一種訓練後技術，用於高效部署 LLM。為了獲得量化的微調 LLM，傳統管道會先微調預訓練模型，然後再進行訓練後量化。這通常會產生次佳效能，因為它無法利用微調和量化之間的協同效應。為了有效實現 LLM 中權重、激活和 KV 快取的低位元量化，我們提出了一種名為旋轉直通估計器 (RoSTE) 的演算法，它結合了量化感知監督式微調 (QA-SFT) 和一種自適應旋轉策略，該策略會識別有效的旋轉組態以減少激活異常值。我們透過分析 RoSTE 在應用於過度參數化最小平方量化訓練問題時的預測誤差，提供了關於 RoSTE 的理論見解。我們的研究結果顯示，預測誤差與收斂權重的量化誤差成正比，而這可透過最佳化的旋轉組態有效地管理。在不同大小的 Pythia 和 Llama 模型上進行的實驗證明了 RoSTE 的有效性。與現有的訓練後 SFT 量化基準相比，我們的模型在各種任務和不同的 LLM 架構中持續獲得優異的效能。
 
-摘要：<paragraph>找出因果關係對於理解現實世界的動態和最終的因果推理至關重要。現有的 NLP 事件因果關係識別方法，包括基於大型語言模型 (LLM) 的方法，由於規模有限且過度依賴於可用基準中的詞彙線索，在分佈外環境中表現出困難。受機率因果推論啟發的現代基準已嘗試建構事件的因果圖，作為因果知識的強健表示，其中 \texttt{CRAB} \citep{romanou2023crab} 是這條路徑上最近的一個基準。在本文中，我們介紹 \texttt{ACCESS}，一個專門設計來探索和推理抽象因果事件的基準。與現有資源不同，\texttt{ACCESS} 專注於抽象層面上日常生活事件的因果關係。我們提出一個管道，用於從 \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose} 找出事件概括的抽象，\texttt{GLUCOSE} 是隱含常識因果知識的大規模資料集，我們隨後從中萃取出 1,4K 因果對。我們的實驗突顯出使用統計方法和/或 LLM 進行 NLP 中的自動抽象識別和因果發現的持續挑戰。儘管如此，我們證明了 \texttt{ACCESS} 中提供的抽象因果知識可用於增強 LLM 中的問答推理效能。</paragraph>
+##### **PixLift: Accelerating Web Browsing via AI Upscaling**
+2502.08995v1 by Yonas Atinafu, Sarthak Malla, HyunSeok Daniel Jang, Nouar Aldahoul, Matteo Varvello, Yasir Zaki
 
-##### **GCoT: Chain-of-Thought Prompt Learning for Graphs**
-2502.08092v1 by Xingtong Yu, Chang Zhou, Zhongwei Kuai, Xinming Zhang, Yuan Fang
+Accessing the internet in regions with expensive data plans and limited
+connectivity poses significant challenges, restricting information access and
+economic growth. Images, as a major contributor to webpage sizes, exacerbate
+this issue, despite advances in compression formats like WebP and AVIF. The
+continued growth of complex and curated web content, coupled with suboptimal
+optimization practices in many regions, has prevented meaningful reductions in
+web page sizes. This paper introduces PixLift, a novel solution to reduce
+webpage sizes by downscaling their images during transmission and leveraging AI
+models on user devices to upscale them. By trading computational resources for
+bandwidth, PixLift enables more affordable and inclusive web access. We address
+key challenges, including the feasibility of scaled image requests on popular
+websites, the implementation of PixLift as a browser extension, and its impact
+on user experience. Through the analysis of 71.4k webpages, evaluations of
+three mainstream upscaling models, and a user study, we demonstrate PixLift's
+ability to significantly reduce data usage without compromising image quality,
+fostering a more equitable internet.
 
-Chain-of-thought (CoT) prompting has achieved remarkable success in natural
-language processing (NLP). However, its vast potential remains largely
-unexplored for graphs. This raises an interesting question: How can we design
-CoT prompting for graphs to guide graph models to learn step by step? On one
-hand, unlike natural languages, graphs are non-linear and characterized by
-complex topological structures. On the other hand, many graphs lack textual
-data, making it difficult to formulate language-based CoT prompting. In this
-work, we propose the first CoT prompt learning framework for text-free graphs,
-GCoT. Specifically, we decompose the adaptation process for each downstream
-task into a series of inference steps, with each step consisting of
-prompt-based inference, ``thought'' generation, and thought-conditioned prompt
-learning. While the steps mimic CoT prompting in NLP, the exact mechanism
-differs significantly. Specifically, at each step, an input graph, along with a
-prompt, is first fed into a pre-trained graph encoder for prompt-based
-inference. We then aggregate the hidden layers of the encoder to construct a
-``thought'', which captures the working state of each node in the current step.
-Conditioned on this thought, we learn a prompt specific to each node based on
-the current state. These prompts are fed into the next inference step,
-repeating the cycle. To evaluate and analyze the effectiveness of GCoT, we
-conduct comprehensive experiments on eight public datasets, which demonstrate
-the advantage of our approach.
+摘要：在數據方案昂貴且連線有限的地區存取網路會造成重大挑戰，限制了資訊存取和經濟成長。圖像作為網頁大小的主要貢獻者，儘管 WebP 和 AVIF 等壓縮格式進步，但仍加劇了這個問題。複雜且經過策劃的網路內容持續成長，加上許多地區次佳的最佳化實務，已阻礙了網頁大小的顯著減少。本文介紹 PixLift，這是一種創新的解決方案，可在傳輸過程中縮小圖像大小，並利用使用者裝置上的 AI 模型來放大圖像，藉此縮小網頁大小。PixLift 透過以運算資源換取頻寬，讓網路存取更經濟實惠且更具包容性。我們解決了關鍵挑戰，包括熱門網站上縮放圖像要求的可行性、將 PixLift 實作為瀏覽器擴充功能，以及它對使用者體驗的影響。透過分析 71.4k 個網頁、評估三個主流放大模型，以及使用者研究，我們展示了 PixLift 在不影響影像品質的情況下顯著減少資料用量的能力，促進了更公平的網路。
 
-摘要：<paragraph>鏈式思考 (CoT) 提示在自然語言處理 (NLP) 中取得了顯著的成功。然而，其龐大的潛力在圖形方面仍未得到充分探索。這提出了一個有趣的問題：我們如何設計圖形的 CoT 提示來指導圖形模型逐步學習？一方面，與自然語言不同，圖形是非線性的，並且具有複雜的拓撲結構。另一方面，許多圖形缺乏文本數據，這使得難以制定基於語言的 CoT 提示。在這項工作中，我們提出了第一個適用於無文本圖形的 CoT 提示學習框架 GCoT。具體來說，我們將每個下游任務的適應過程分解為一系列推理步驟，每個步驟都包含基於提示的推理、「思想」生成以及基於思想的提示學習。雖然這些步驟模擬了 NLP 中的 CoT 提示，但具體機制卻有很大不同。具體來說，在每一步中，一個輸入圖形連同一個提示首先被輸入到一個預訓練的圖形編碼器中進行基於提示的推理。然後，我們聚合編碼器的隱藏層以構建一個「思想」，它捕獲了當前步驟中每個節點的工作狀態。基於這個思想，我們根據當前狀態學習一個特定於每個節點的提示。這些提示被輸入到下一個推理步驟中，重複這個循環。為了評估和分析 GCoT 的有效性，我們對八個公共數據集進行了全面的實驗，這證明了我們方法的優勢。</paragraph>
+##### **RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**
+2502.08989v1 by Nazatul H. Sultan, Yan Bo, Yansong Gao, Seyit Camtepe, Arash Mahboubi, Hang Thanh Bui, Aufeef Chauhan, Hamed Aboutorab, Michael Bewong, Praveen Gauravaram, Rafiqul Islam, Sharif Abuadbba
 
-##### **Deep Semantic Graph Learning via LLM based Node Enhancement**
-2502.07982v1 by Chuanqi Shi, Yiyi Tao, Hang Zhang, Lun Wang, Shaoshuai Du, Yixian Shen, Yanxin Shen
+Federated Learning (FL) allows users to collaboratively train a global
+machine learning model by sharing local model only, without exposing their
+private data to a central server. This distributed learning is particularly
+appealing in scenarios where data privacy is crucial, and it has garnered
+substantial attention from both industry and academia. However, studies have
+revealed privacy vulnerabilities in FL, where adversaries can potentially infer
+sensitive information from the shared model parameters. In this paper, we
+present an efficient masking-based secure aggregation scheme utilizing
+lightweight cryptographic primitives to mitigate privacy risks. Our scheme
+offers several advantages over existing methods. First, it requires only a
+single setup phase for the entire FL training session, significantly reducing
+communication overhead. Second, it minimizes user-side overhead by eliminating
+the need for user-to-user interactions, utilizing an intermediate server layer
+and a lightweight key negotiation method. Third, the scheme is highly resilient
+to user dropouts, and the users can join at any FL round. Fourth, it can detect
+and defend against malicious server activities, including recently discovered
+model inconsistency attacks. Finally, our scheme ensures security in both
+semi-honest and malicious settings. We provide security analysis to formally
+prove the robustness of our approach. Furthermore, we implemented an end-to-end
+prototype of our scheme. We conducted comprehensive experiments and
+comparisons, which show that it outperforms existing solutions in terms of
+communication and computation overhead, functionality, and security.
 
-Graph learning has attracted significant attention due to its widespread
-real-world applications. Current mainstream approaches rely on text node
-features and obtain initial node embeddings through shallow embedding learning
-using GNNs, which shows limitations in capturing deep textual semantics. Recent
-advances in Large Language Models (LLMs) have demonstrated superior
-capabilities in understanding text semantics, transforming traditional text
-feature processing. This paper proposes a novel framework that combines Graph
-Transformer architecture with LLM-enhanced node features. Specifically, we
-leverage LLMs to generate rich semantic representations of text nodes, which
-are then processed by a multi-head self-attention mechanism in the Graph
-Transformer to capture both local and global graph structural information. Our
-model utilizes the Transformer's attention mechanism to dynamically aggregate
-neighborhood information while preserving the semantic richness provided by LLM
-embeddings. Experimental results demonstrate that the LLM-enhanced node
-features significantly improve the performance of graph learning models on node
-classification tasks. This approach shows promising results across multiple
-graph learning tasks, offering a practical direction for combining graph
-networks with language models.
+摘要：聯合式學習 (FL) 使用者可以透過僅分享本機模型，在不將其私人資料揭露給中央伺服器的情況下，共同訓練全球機器學習模型。這種分散式學習在資料隱私至關重要的場景中特別具有吸引力，並且已獲得業界和學術界的廣泛關注。然而，研究顯示 FL 中存在隱私漏洞，其中對手可能會從共享模型參數中推斷出敏感資訊。在本文中，我們提出了一種有效率的基於遮罩的安全聚合方案，利用輕量級的密碼原語來降低隱私風險。我們的方案相較於現有方法提供了多項優點。首先，它僅需要在整個 FL 訓練階段進行一次設定階段，大幅降低了通訊開銷。其次，透過消除使用者間互動的需要，利用中間伺服器層和輕量級金鑰協商方法，將使用者端的開銷降到最低。第三，該方案對使用者中斷具有高度的復原力，使用者可以在任何 FL 回合中加入。第四，它可以偵測和防禦惡意伺服器活動，包括最近發現的模型不一致攻擊。最後，我們的方案確保在半誠實和惡意設定中都能獲得安全性。我們提供了安全分析，以正式證明我們方法的穩健性。此外，我們實作了我們方案的端對端原型。我們進行了全面的實驗和比較，結果顯示，在通訊和運算開銷、功能和安全性方面，它優於現有的解決方案。
 
-摘要：圖形學習因其廣泛的現實世界應用而備受關注。目前的熱門方法依賴於文本節點特徵，並通過使用 GNN 的淺層嵌入學習來獲取初始節點嵌入，這在捕捉深度文本語義方面表現出局限性。大語言模型 (LLM) 的最新進展已證明在理解文本語義方面具有優越的能力，轉換了傳統的文本特徵處理。本文提出了一種新的框架，將圖形轉換器架構與 LLM 增強的節點特徵相結合。具體來說，我們利用 LLM 來生成文本節點的豐富語義表示，然後在圖形轉換器中由多頭自我注意機制處理，以捕捉局部和全局圖形結構信息。我們的模型利用 Transformer 的注意機制來動態聚合鄰域信息，同時保留 LLM 嵌入提供的語義豐富性。實驗結果表明，LLM 增強的節點特徵顯著提高了圖形學習模型在節點分類任務上的性能。這種方法在多個圖形學習任務中顯示出有希望的結果，為將圖形網絡與語言模型相結合提供了實用的方向。
+##### **Neural Force Field: Learning Generalized Physical Representation from a Few Examples**
+2502.08987v1 by Shiqian Li, Ruihong Shen, Chi Zhang, Yixin Zhu
 
-##### **Cardiverse: Harnessing LLMs for Novel Card Game Prototyping**
-2502.07128v1 by Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
+Physical reasoning is a remarkable human ability that enables rapid learning
+and generalization from limited experience. Current AI models, despite
+extensive training, still struggle to achieve similar generalization,
+especially in Out-of-distribution (OOD) settings. This limitation stems from
+their inability to abstract core physical principles from observations. A key
+challenge is developing representations that can efficiently learn and
+generalize physical dynamics from minimal data. Here we present Neural Force
+Field (NFF) a modeling framework built on Neural Ordinary Differential Equation
+(NODE) that learns interpretable force field representations which can be
+efficiently integrated through an Ordinary Differential Equation ( ODE) solver
+to predict object trajectories. Unlike existing approaches that rely on
+high-dimensional latent spaces, NFF captures fundamental physical concepts such
+as gravity, support, and collision in an interpretable manner. Experiments on
+two challenging physical reasoning tasks demonstrate that NFF, trained with
+only a few examples, achieves strong generalization to unseen scenarios. This
+physics-grounded representation enables efficient forward-backward planning and
+rapid adaptation through interactive refinement. Our work suggests that
+incorporating physics-inspired representations into learning systems can help
+bridge the gap between artificial and human physical reasoning capabilities.
 
-The prototyping of computer games, particularly card games, requires
-extensive human effort in creative ideation and gameplay evaluation. Recent
-advances in Large Language Models (LLMs) offer opportunities to automate and
-streamline these processes. However, it remains challenging for LLMs to design
-novel game mechanics beyond existing databases, generate consistent gameplay
-environments, and develop scalable gameplay AI for large-scale evaluations.
-This paper addresses these challenges by introducing a comprehensive automated
-card game prototyping framework. The approach highlights a graph-based indexing
-method for generating novel game designs, an LLM-driven system for consistent
-game code generation validated by gameplay records, and a gameplay AI
-constructing method that uses an ensemble of LLM-generated action-value
-functions optimized through self-play. These contributions aim to accelerate
-card game prototyping, reduce human labor, and lower barriers to entry for game
-developers.
+摘要：物理推理是人类非凡的能力，它能从有限的经验中快速学习和概括。尽管经过广泛的训练，但当前的人工智能模型在实现类似的概括方面仍然存在困难，尤其是在分布外 (OOD) 设置中。这种限制源于它们无法从观察中抽象出核心物理原理。一个关键挑战是开发能够从最少数据中有效学习和概括物理动力学的表示。在这里，我们介绍了神经力场 (NFF)，这是一种建立在神经常微分方程 (NODE) 上的建模框架，它学习可解释的力场表示，这些表示可以通过常微分方程 (ODE) 求解器有效地进行积分，以预测物体轨迹。与依赖于高维潜在空间的现有方法不同，NFF 以可解释的方式捕获了诸如重力、支撑和碰撞等基本物理概念。在两个具有挑战性的物理推理任务上的实验表明，仅通过几个示例训练的 NFF 实现了对看不见场景的强大概括。这种基于物理的表示能够进行高效的前向后向规划，并通过交互式细化实现快速适应。我们的工作表明，将受物理启发的表示纳入学习系统可以帮助弥合人工智能和人类物理推理能力之间的差距。
 
-摘要：電腦遊戲，尤其是卡牌遊戲的原型製作，需要大量的人力在創意構思和遊戲玩法評估上。大型語言模型 (LLM) 的最新進展提供了自動化和簡化這些流程的機會。然而，LLM 在設計超越現有資料庫的新穎遊戲機制、生成一致的遊戲環境，以及開發用於大規模評估的可擴充遊戲 AI 方面仍然面臨挑戰。本文通過引入一個全面的自動化卡牌遊戲原型製作框架來應對這些挑戰。該方法強調了一種基於圖表的索引方法，用於生成新穎的遊戲設計，一個由 LLM 驅動的系統，用於一致的遊戲程式碼生成，並由遊戲記錄驗證，以及一個遊戲 AI 構建方法，該方法使用由 LLM 生成的動作值函數的集合，通過自我對弈進行最佳化。這些貢獻旨在加速卡牌遊戲原型製作，減少人力，並降低遊戲開發人員的進入門檻。
+##### **Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**
+2502.08972v1 by Hyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan May
 
-##### **GraNNite: Enabling High-Performance Execution of Graph Neural Networks on Resource-Constrained Neural Processing Units**
-2502.06921v2 by Arghadip Das, Shamik Kundu, Arnab Raha, Soumendu Ghosh, Deepak Mathaikutty, Vijay Raghunathan
+Language models are aligned to the collective voice of many, resulting in
+generic outputs that do not align with specific users' styles. In this work, we
+present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method
+that personalizes language models for text generation tasks with fewer than 10
+examples per user. TICL iteratively expands an in-context learning prompt via a
+trial-error-explain process, adding model-generated negative samples and
+explanations that provide fine-grained guidance towards a specific user's
+style. TICL achieves favorable win rates on pairwise comparisons with
+LLM-as-a-judge up to 91.5% against the previous state-of-the-art and
+outperforms competitive tuning-free baselines for personalized alignment tasks
+of writing emails, essays and news articles. Both lexical and qualitative
+analyses show that the negative samples and explanations enable language models
+to learn stylistic context more effectively and overcome the bias towards
+structural and formal phrases observed in their zero-shot outputs. By
+front-loading inference compute to create a user-specific in-context learning
+prompt that does not require extra generation steps at test time, TICL presents
+a novel yet simple approach for personalized alignment.
 
-Graph Neural Networks (GNNs) are vital for learning from graph-structured
-data, enabling applications in network analysis, recommendation systems, and
-speech analytics. Deploying them on edge devices like client PCs and laptops
-enhances real-time processing, privacy, and cloud independence. GNNs aid
-Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs) and
-enable event-based vision tasks. However, irregular memory access, sparsity,
-and dynamic structures cause high latency and energy overhead on
-resource-constrained devices. While modern edge processors integrate CPUs,
-GPUs, and NPUs, NPUs designed for data-parallel tasks struggle with irregular
-GNN computations. We introduce GraNNite, the first hardware-aware framework
-optimizing GNN execution on commercial-off-the-shelf (COTS) SOTA DNN
-accelerators via a structured three-step methodology: (1) enabling NPU
-execution, (2) optimizing performance, and (3) trading accuracy for efficiency
-gains. Step 1 employs GraphSplit for workload distribution and StaGr for static
-aggregation, while GrAd and NodePad handle dynamic graphs. Step 2 boosts
-performance using EffOp for control-heavy tasks and GraSp for sparsity
-exploitation. Graph Convolution optimizations PreG, SymG, and CacheG reduce
-redundancy and memory transfers. Step 3 balances quality versus efficiency,
-where QuantGr applies INT8 quantization, and GrAx1, GrAx2, and GrAx3 accelerate
-attention, broadcast-add, and SAGE-max aggregation. On Intel Core Ultra AI PCs,
-GraNNite achieves 2.6X to 7.6X speedups over default NPU mappings and up to
-8.6X energy gains over CPUs and GPUs, delivering 10.8X and 6.7X higher
-performance than CPUs and GPUs, respectively, across GNN models.
+摘要：語言模型與眾人的集體聲音保持一致，導致產出內容流於一般，無法與特定使用者的風格相符。在這項工作中，我們提出了試驗錯誤解釋情境內學習 (TICL)，一種免調校方法，能為文字生成任務個人化語言模型，每個使用者少於 10 個範例。TICL 透過試驗錯誤解釋程序反覆擴充情境內學習提示，加入模型產生的負面範例和說明，提供細緻的指導，引導至特定使用者的風格。TICL 在與 LLM 作為評審的成對比較中獲得了高勝率，高達 91.5%，優於先前的技術水準，並在個人化對齊任務中超越了競爭性的免調校基準，包括撰寫電子郵件、論文和新聞文章。詞彙和質性分析皆顯示，負面範例和說明讓語言模型能更有效地學習風格脈絡，並克服零次學習產出中觀察到的結構性和正式詞組偏誤。透過預先加載推論運算，建立使用者特定的情境內學習提示，無需在測試時額外產生步驟，TICL 呈現一種新穎卻簡潔的方法，用於個人化對齊。
 
-摘要：圖形神經網路 (GNN) 對於從圖形結構資料中學習至關重要，能應用於網路分析、推薦系統和語音分析。將其部署在邊緣裝置（例如用戶端電腦和筆電）上可增強即時處理、隱私和雲端獨立性。GNN 協助大型語言模型 (LLM) 的檢索增強生成 (RAG)，並支援基於事件的視覺任務。然而，不規則的記憶體存取、稀疏性和動態結構會導致資源受限裝置上的高延遲和能源負擔。儘管現代邊緣處理器整合了 CPU、GPU 和 NPU，但針對資料平行任務所設計的 NPU 難以處理不規則的 GNN 計算。我們引入了 GraNNite，這是第一個硬體感知框架，透過結構化的三步驟方法最佳化商用現成 (COTS) SOTA DNN 加速器上的 GNN 執行：(1) 啟用 NPU 執行，(2) 最佳化效能，以及 (3) 以準確度換取效率提升。步驟 1 使用 GraphSplit 進行工作負載分配，並使用 StaGr 進行靜態聚合，而 GrAd 和 NodePad 則處理動態圖形。步驟 2 使用 EffOp 提升控制密集型任務的效能，並使用 GraSp 進行稀疏性利用。圖形卷積最佳化 PreG、SymG 和 CacheG 減少了冗餘和記憶體傳輸。步驟 3 平衡品質與效率，其中 QuantGr 適用 INT8 量化，而 GrAx1、GrAx2 和 GrAx3 則加速注意力、廣播加法和 SAGE-max 聚合。在 Intel Core Ultra AI PC 上，GraNNite 在預設 NPU 映射上實現了 2.6X 到 7.6X 的加速，在 CPU 和 GPU 上實現了高達 8.6X 的能源增益，在 GNN 模型中分別提供了比 CPU 和 GPU 高出 10.8X 和 6.7X 的效能。
+##### **RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**
+2502.08966v1 by Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller
 
-##### **Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language**
-2502.06634v1 by Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
+Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external
+tools for tasks beyond their standalone capabilities, such as searching
+websites, booking flights, or making financial transactions. However, these
+tools greatly increase the risks of prompt injection attacks, where malicious
+content hijacks the LM agent to leak confidential data or trigger harmful
+actions. Existing defenses (OpenAI GPTs) require user confirmation before every
+tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS),
+which automatically detects and executes tool calls that preserve integrity and
+confidentiality, requiring user confirmation only when these safeguards cannot
+be ensured. RTBAS adapts Information Flow Control to the unique challenges
+presented by TBAS. We present two novel dependency screeners, using
+LM-as-a-judge and attention-based saliency, to overcome these challenges.
+Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS
+prevents all targeted attacks with only a 2% loss of task utility when under
+attack, and further tests confirm its ability to obtain near-oracle performance
+on detecting both subtle and direct privacy leaks.
 
-Recent advancements in AI for biological research focus on integrating
-molecular data with natural language to accelerate drug discovery. However, the
-scarcity of high-quality annotations limits progress in this area. This paper
-introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework
-that leverages large language models to augment existing datasets, thereby
-improving AI training. We demonstrate the effectiveness of LA$^3$ by creating
-an enhanced dataset, LaChEBI-20, where we systematically rewrite the
-annotations of molecules from an established dataset. These rewritten
-annotations preserve essential molecular information while providing more
-varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5
-based on a benchmark architecture to learn the mapping between molecular
-representations and augmented annotations.
-  Experimental results on text-based *de novo* molecule generation and molecule
-captioning demonstrate that LaMolT5 outperforms state-of-the-art models.
-Notably, incorporating LA$^3$ leads to improvements of up to 301% over the
-benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$
-notable applications in *image*, *text* and *graph* tasks, affirming its
-versatility and utility.
+摘要：基於工具的代理系統 (TBAS) 允許語言模型 (LM) 使用外部工具來執行超出其獨立功能的任務，例如搜尋網站、預訂航班或進行金融交易。然而，這些工具大幅增加了提示注入攻擊的風險，其中惡意內容劫持 LM 代理程式以洩露機密資料或觸發有害動作。現有的防禦措施 (OpenAI GPT) 在每次呼叫工具之前都需要使用者確認，這會對使用者造成沉重的負擔。我們引入了穩健的 TBAS (RTBAS)，它會自動偵測並執行保留完整性與機密性的工具呼叫，僅在無法確保這些防護措施時才需要使用者確認。RTBAS 將資訊流控制調整為 TBAS 呈現的獨特挑戰。我們提出兩種新穎的相依性篩選器，使用 LM 作為判斷者和基於注意力的顯著性，以克服這些挑戰。AgentDojo 提示注入基準上的實驗結果顯示，RTBAS 在受到攻擊時僅損失 2% 的任務效用，即可防止所有目標攻擊，進一步的測試證實了其在偵測細微和直接的隱私洩漏方面獲得接近神諭效能的能力。
 
-摘要：<paragraph>人工智慧在生物研究上的最新進展，專注於將分子資料與自然語言整合，以加速藥物發現。然而，高品質註解的稀少限制了此領域的進展。這篇論文介紹了 LA$^3$，一個基於語言的自動註解擴充框架，它利用大型語言模型來擴充現有的資料集，進而改善人工智慧訓練。我們透過建立一個增強的資料集 LaChEBI-20 來展示 LA$^3$ 的有效性，我們系統性地改寫了一個既定資料集中分子的註解。這些改寫的註解保留了重要的分子資訊，同時提供了更多樣化的句子結構和詞彙。使用 LaChEBI-20，我們在基於基準架構上訓練 LaMolT5，以學習分子表示和擴充註解之間的對應。
-在基於文字的 *從頭開始* 分子生成和分子標題上的實驗結果表明，LaMolT5 優於最先進的模型。值得注意的是，納入 LA$^3$ 可讓基準架構的改進幅度高達 301%。此外，我們驗證了 LA$^3$ 在 *影像*、*文字* 和 *圖形* 任務中的有效性，肯定了它的多功能性和實用性。</paragraph>
+##### **Biologically Plausible Brain Graph Transformer**
+2502.08958v1 by Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
 
-##### **KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment**
-2502.06472v1 by Yuxing Lu, Jinzhuo Wang
+State-of-the-art brain graph analysis methods fail to fully encode the
+small-world architecture of brain graphs (accompanied by the presence of hubs
+and functional modules), and therefore lack biological plausibility to some
+extent. This limitation hinders their ability to accurately represent the
+brain's structural and functional properties, thereby restricting the
+effectiveness of machine learning models in tasks such as brain disorder
+detection. In this work, we propose a novel Biologically Plausible Brain Graph
+Transformer (BioBGT) that encodes the small-world architecture inherent in
+brain graphs. Specifically, we present a network entanglement-based node
+importance encoding technique that captures the structural importance of nodes
+in global information propagation during brain graph communication,
+highlighting the biological properties of the brain structure. Furthermore, we
+introduce a functional module-aware self-attention to preserve the functional
+segregation and integration characteristics of brain graphs in the learned
+representations. Experimental results on three benchmark datasets demonstrate
+that BioBGT outperforms state-of-the-art models, enhancing biologically
+plausible brain graph representations for various brain graph analytical tasks
 
-Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical
-for modern AI systems, but manual curation struggles to scale with the rapid
-growth of scientific literature. This paper presents KARMA, a novel framework
-employing multi-agent large language models (LLMs) to automate KG enrichment
-through structured analysis of unstructured text. Our approach employs nine
-collaborative agents, spanning entity discovery, relation extraction, schema
-alignment, and conflict resolution that iteratively parse documents, verify
-extracted knowledge, and integrate it into existing graph structures while
-adhering to domain-specific schema. Experiments on 1,200 PubMed articles from
-three different domains demonstrate the effectiveness of KARMA in knowledge
-graph enrichment, with the identification of up to 38,230 new entities while
-achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\%
-through multi-layer assessments.
+摘要：目前最先进的大腦圖形分析方法無法完全編碼大腦圖形的小世界架構（伴隨著樞紐和功能模組的存在），因此在某種程度上缺乏生物學上的可信度。這種限制阻礙了它們準確表示大腦結構和功能特性的能力，從而限制了機器學習模型在腦部疾病檢測等任務中的有效性。在這項工作中，我們提出了一個新的生物學上可信的大腦圖形轉換器 (BioBGT)，它編碼了大腦圖形中固有的、小世界的架構。具體來說，我們提出了一種基於網路糾纏的節點重要性編碼技術，它捕捉了大腦圖形通信過程中節點在全球資訊傳播中的結構重要性，突出了大腦結構的生物學特性。此外，我們引入了一個功能模組感知自注意力，以保留學習表徵中大腦圖形的功能分離和整合特性。在三個基準資料集上的實驗結果表明，BioBGT 優於最先進的模型，增強了各種大腦圖形分析任務的生物學上可信的大腦圖形表徵
 
-摘要：維護全面且最新的知識圖譜 (KG) 對現代 AI 系統至關重要，但手動策劃難以隨著科學文獻的快速增長而擴展。本文提出了 KARMA，一個採用多代理大型語言模型 (LLM) 的新框架，透過對非結構化文本的結構化分析來自動化 KG 豐富化。我們的做法採用九個協作代理，涵蓋實體發現、關係提取、架構比對和衝突解決，這些代理會反覆分析文件、驗證提取的知識，並將其整合到現有的圖結構中，同時遵守特定領域的架構。針對來自三個不同領域的 1,200 篇 PubMed 文章進行的實驗證明了 KARMA 在知識圖譜豐富化方面的有效性，識別出多達 38,230 個新實體，同時達到 83.1% 的 LLM 驗證正確性，並透過多層評估將衝突邊緣降低了 18.6%。
+##### **Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**
+2502.08954v1 by Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer
 
-##### **RoToR: Towards More Reliable Responses for Order-Invariant Inputs**
-2502.08662v1 by Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
+The deployment of Large Language Models (LLM) on mobile devices offers
+significant potential for medical applications, enhancing privacy, security,
+and cost-efficiency by eliminating reliance on cloud-based services and keeping
+sensitive health data local. However, the performance and accuracy of on-device
+LLMs in real-world medical contexts remain underexplored. In this study, we
+benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating
+accuracy, computational efficiency, and thermal limitation across various
+mobile devices. Our results indicate that compact general-purpose models like
+Phi-3 Mini achieve a strong balance between speed and accuracy, while medically
+fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably,
+deploying LLMs on older devices remains feasible, with memory constraints
+posing a greater challenge than raw processing power. Our study underscores the
+potential of on-device LLMs for healthcare while emphasizing the need for more
+efficient inference and models tailored to real-world clinical reasoning.
 
-Mitigating positional bias of language models (LMs) for listwise inputs is a
-well-known and important problem (e.g., lost-in-the-middle). While zero-shot
-order-invariant LMs have been proposed to solve this issue, their success on
-practical listwise problems has been limited. In this work, as a first
-contribution, we identify and overcome two limitations to make zero-shot
-invariant LMs more practical: (1) training and inference distribution mismatch
-arising from modifying positional ID assignments to enforce invariance, and (2)
-failure to adapt to a mixture of order-invariant and sensitive inputs in
-practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot
-invariant LM for genuinely order-invariant inputs with minimal modifications of
-positional IDs, and (2) Selective Routing, an adaptive framework that handles
-both order-invariant and order-sensitive inputs in listwise tasks. On the Lost
-in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU
-benchmarks, we show that RoToR with Selective Routing can effectively handle
-practical listwise input tasks in a zero-shot manner.
+摘要：大型語言模型 (LLM) 在行動裝置上的部署為醫療應用程式提供了巨大的潛力，透過消除對雲端服務的依賴並將敏感的健康資料儲存在本地，進而提升隱私、安全性，並提高成本效益。然而，在實際的醫療環境中，裝置上 LLM 的效能和準確度仍未受到充分的探討。在此研究中，我們使用 AMEGA 資料集來評量公開可用的裝置上 LLM，並評估其在各種行動裝置上的準確度、運算效率和熱限制。我們的結果顯示，像 Phi-3 Mini 等精簡的一般用途模型在速度和準確度之間取得了良好的平衡，而經過醫學微調的模型，例如 Med42 和 Aloe，則達到了最高的準確度。值得注意的是，在較舊的裝置上部署 LLM 仍然可行，記憶體限制比原始處理能力構成更大的挑戰。我們的研究強調了裝置上 LLM 在醫療保健方面的潛力，同時強調了對更有效率的推理和針對實際臨床推理量身打造的模型的需求。
 
-摘要：語言模型 (LM) 的位置偏差緩解對於列表輸入來說是一個廣為人知且重要的問題（例如，迷失在中間）。雖然已經提出零次學習順序不變的 LM 來解決這個問題，但它們在實際列表問題上的成功卻很有限。在這項工作中，作為第一個貢獻，我們找出並克服了兩個限制，讓零次學習不變的 LM 更有實用性：(1) 訓練和推論分布不匹配，這是由於修改位置 ID 分配以強制不變性所造成的，以及 (2) 無法適應實際列表問題中不變和敏感輸入的組合。為了克服這些問題，我們提出 (1) RoToR，一個零次學習不變的 LM，用於真正不變的輸入，並對位置 ID 進行最小的修改，以及 (2) 選擇性路由，一個自適應框架，用於處理列表任務中不變和敏感的輸入。在迷失在中間 (LitM)、知識圖譜問答 (KGQA) 和 MMLU 基準測試中，我們展示了 RoToR 與選擇性路由可以有效地以零次學習的方式處理實際的列表輸入任務。
 
-##### **K-ON: Stacking Knowledge On the Head Layer of Large Language Model**
-2502.06257v1 by Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen
+### Medical explainable AI
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
+|**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
+|**2025-01-23**|**Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**|Frederik Pahde et.al.|[2501.13818v1](http://arxiv.org/abs/2501.13818v1)|[link](https://github.com/frederikpahde/medical-ai-safety)|
+|**2025-01-19**|**Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**|Mohaiminul Islam Bhuiyan et.al.|[2501.11094v1](http://arxiv.org/abs/2501.11094v1)|null|
+|**2025-01-17**|**SEANN: A Domain-Informed Neural Network for Epidemiological Insights**|Jean-Baptiste Guimbaud et.al.|[2501.10273v1](http://arxiv.org/abs/2501.10273v1)|null|
+|**2025-01-16**|**Artificial Intelligence-Driven Clinical Decision Support Systems**|Muhammet Alkan et.al.|[2501.09628v1](http://arxiv.org/abs/2501.09628v1)|null|
+|**2025-01-12**|**MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**|Sadia Kamal et.al.|[2501.06887v1](http://arxiv.org/abs/2501.06887v1)|null|
+|**2025-01-06**|**Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**|Mary Ogbuka Kenneth et.al.|[2501.02891v1](http://arxiv.org/abs/2501.02891v1)|null|
+|**2024-12-28**|**The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**|Alessandro De Grandi et.al.|[2412.20068v1](http://arxiv.org/abs/2412.20068v1)|null|
+|**2024-12-27**|**A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**|Jana Zakall et.al.|[2412.19688v1](http://arxiv.org/abs/2412.19688v1)|null|
+|**2024-12-23**|**Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**|Badaru I. Olumuyiwa et.al.|[2412.17527v1](http://arxiv.org/abs/2412.17527v1)|null|
+|**2024-12-20**|**Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**|Hasan Md Tusfiqur Alam et.al.|[2412.16086v2](http://arxiv.org/abs/2412.16086v2)|[link](https://github.com/tifat58/irr-with-cbm-rag)|
+|**2024-12-20**|**Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**|Shamus Sim et.al.|[2412.15748v1](http://arxiv.org/abs/2412.15748v1)|null|
+|**2024-12-18**|**Cognition Chain for Explainable Psychological Stress Detection on Social Media**|Xin Wang et.al.|[2412.14009v1](http://arxiv.org/abs/2412.14009v1)|null|
+|**2024-11-30**|**2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**|Jim Solomon et.al.|[2412.00372v1](http://arxiv.org/abs/2412.00372v1)|null|
+|**2024-11-28**|**Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**|Philipp Brauner et.al.|[2411.19356v1](http://arxiv.org/abs/2411.19356v1)|null|
+|**2024-11-26**|**Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**|Yujie Dai et.al.|[2411.17645v2](http://arxiv.org/abs/2411.17645v2)|null|
+|**2024-11-18**|**Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**|Jeffrey N. Clark et.al.|[2411.11774v1](http://arxiv.org/abs/2411.11774v1)|null|
+|**2024-11-15**|**Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**|Mohammed Yaseen Jabarulla et.al.|[2411.10255v1](http://arxiv.org/abs/2411.10255v1)|null|
+|**2024-11-01**|**Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**|Mehdi Hosseini Chagahi et.al.|[2411.00916v2](http://arxiv.org/abs/2411.00916v2)|null|
+|**2024-10-25**|**A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**|Muath Alsuhaibani et.al.|[2410.19898v1](http://arxiv.org/abs/2410.19898v1)|null|
+|**2024-10-23**|**An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**|Shruthi Chari et.al.|[2410.17504v1](http://arxiv.org/abs/2410.17504v1)|[link](https://github.com/tetherless-world/metaexplainer)|
+|**2024-10-22**|**Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**|Lukas Hughes-Noehrer et.al.|[2410.16879v1](http://arxiv.org/abs/2410.16879v1)|null|
+|**2024-10-19**|**Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**|Gesa Mittmann et.al.|[2410.15012v1](http://arxiv.org/abs/2410.15012v1)|null|
+|**2024-10-15**|**Explainable AI Methods for Multi-Omics Analysis: A Survey**|Ahmad Hussein et.al.|[2410.11910v1](http://arxiv.org/abs/2410.11910v1)|null|
+|**2024-10-14**|**Study on the Helpfulness of Explainable Artificial Intelligence**|Tobias Labarta et.al.|[2410.11896v1](http://arxiv.org/abs/2410.11896v1)|[link](https://github.com/tlabarta/helpfulnessofxai)|
+|**2024-10-12**|**Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**|Abdullah Mamun et.al.|[2410.09635v1](http://arxiv.org/abs/2410.09635v1)|[link](https://github.com/ab9mamun/aimen)|
+|**2024-10-10**|**Artificial intelligence techniques in inherited retinal diseases: A review**|Han Trinh et.al.|[2410.09105v1](http://arxiv.org/abs/2410.09105v1)|null|
+|**2024-10-07**|**CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**|Ekaterina Sviridova et.al.|[2410.05235v2](http://arxiv.org/abs/2410.05235v2)|[link](https://github.com/ixa-ehu/antidote-casimedicos)|
+|**2024-10-01**|**Explainable Diagnosis Prediction through Neuro-Symbolic Integration**|Qiuhao Lu et.al.|[2410.01855v2](http://arxiv.org/abs/2410.01855v2)|null|
+|**2024-10-01**|**Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**|Prasenjit Maji et.al.|[2410.00366v1](http://arxiv.org/abs/2410.00366v1)|null|
+|**2024-09-20**|**Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**|Tirtha Chanda et.al.|[2409.13476v1](http://arxiv.org/abs/2409.13476v1)|null|
+|**2024-09-19**|**Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**|Suryansh Vidya et.al.|[2409.15374v1](http://arxiv.org/abs/2409.15374v1)|null|
+|**2024-09-19**|**Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**|Daniel Flores-Araiza et.al.|[2409.12883v1](http://arxiv.org/abs/2409.12883v1)|null|
+|**2024-09-18**|**Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**|Yubo Li et.al.|[2409.12087v3](http://arxiv.org/abs/2409.12087v3)|null|
+|**2024-09-13**|**Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**|Mercy Asiedu et.al.|[2409.09201v3](http://arxiv.org/abs/2409.09201v3)|null|
+|**2024-09-09**|**Explainable AI: Definition and attributes of a good explanation for health AI**|Evangelia Kyrimi et.al.|[2409.15338v1](http://arxiv.org/abs/2409.15338v1)|null|
+|**2024-08-30**|**Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**|Antonio Rago et.al.|[2408.17401v1](http://arxiv.org/abs/2408.17401v1)|null|
+|**2024-08-29**|**A Survey for Large Language Models in Biomedicine**|Chong Wang et.al.|[2409.00133v1](http://arxiv.org/abs/2409.00133v1)|null|
+|**2024-08-27**|**Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**|Francesco Sovrano et.al.|[2408.15121v1](http://arxiv.org/abs/2408.15121v1)|null|
+|**2024-08-24**|**Towards Case-based Interpretability for Medical Federated Learning**|Laura Latorre et.al.|[2408.13626v1](http://arxiv.org/abs/2408.13626v1)|null|
+|**2024-08-22**|**AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**|Douwe J. Spaanderman et.al.|[2408.12491v1](http://arxiv.org/abs/2408.12491v1)|null|
+|**2024-08-14**|**Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**|Kimji N. Pellano et.al.|[2409.00001v1](http://arxiv.org/abs/2409.00001v1)|null|
+|**2024-08-06**|**MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**|Hanchen David Wang et.al.|[2408.11837v1](http://arxiv.org/abs/2408.11837v1)|null|
+|**2024-08-05**|**The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**|Joshua Morriss et.al.|[2408.05239v1](http://arxiv.org/abs/2408.05239v1)|null|
+|**2024-08-05**|**Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**|Chi Him Ng et.al.|[2408.02709v1](http://arxiv.org/abs/2408.02709v1)|null|
+|**2024-08-05**|**Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**|Masoud Muhammed Hassan et.al.|[2408.02706v1](http://arxiv.org/abs/2408.02706v1)|null|
+|**2024-07-26**|**MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**|Shyam Dongre et.al.|[2407.20284v1](http://arxiv.org/abs/2407.20284v1)|null|
+|**2024-07-25**|**Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**|Alessandro De Carlo et.al.|[2407.18343v2](http://arxiv.org/abs/2407.18343v2)|null|
+|**2024-07-24**|**Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**|Nikolaos Ntampakis et.al.|[2407.17324v2](http://arxiv.org/abs/2407.17324v2)|null|
+|**2024-07-24**|**Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**|Michele Fiori et.al.|[2408.06352v1](http://arxiv.org/abs/2408.06352v1)|null|
+|**2024-07-21**|**Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**|Naseem Khan et.al.|[2408.03335v1](http://arxiv.org/abs/2408.03335v1)|null|
+|**2024-07-18**|**A Comparative Study on Automatic Coding of Medical Letters with Explainability**|Jamie Glen et.al.|[2407.13638v1](http://arxiv.org/abs/2407.13638v1)|[link](https://github.com/Glenj01/Medical-Coding)|
+|**2024-07-09**|**Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**|Abdul Karim Gizzini et.al.|[2407.07009v1](http://arxiv.org/abs/2407.07009v1)|null|
+|**2024-07-07**|**Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**|P. N. Karthikayan et.al.|[2407.05440v2](http://arxiv.org/abs/2407.05440v2)|null|
+|**2024-07-03**|**A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**|Congzhen Shi et.al.|[2407.15851v2](http://arxiv.org/abs/2407.15851v2)|null|
+|**2024-07-01**|**The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**|Ximing Wen et.al.|[2407.06206v1](http://arxiv.org/abs/2407.06206v1)|null|
+|**2024-06-28**|**Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**|Sai Krishna Revanth Vuruma et.al.|[2407.00167v1](http://arxiv.org/abs/2407.00167v1)|null|
+|**2024-06-25**|**Towards Compositional Interpretability for XAI**|Sean Tull et.al.|[2406.17583v1](http://arxiv.org/abs/2406.17583v1)|null|
+|**2024-06-17**|**Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**|Vincent Olesen et.al.|[2406.12142v2](http://arxiv.org/abs/2406.12142v2)|[link](https://github.com/volesen/slicing-through-bias)|
+|**2024-06-11**|**Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**|Fatemeh Ebrahimzadeh et.al.|[2406.07114v2](http://arxiv.org/abs/2406.07114v2)|null|
+|**2024-06-10**|**AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**|K M Tawsik Jawad et.al.|[2406.06728v2](http://arxiv.org/abs/2406.06728v2)|null|
+|**2024-06-10**|**Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**|Yusif Ibrahimov et.al.|[2406.05984v1](http://arxiv.org/abs/2406.05984v1)|null|
+|**2024-06-09**|**Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**|Zhan Zhang et.al.|[2406.05746v1](http://arxiv.org/abs/2406.05746v1)|null|
+|**2024-06-07**|**Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**|Faseela Abdullakutty et.al.|[2406.12897v1](http://arxiv.org/abs/2406.12897v1)|null|
+|**2024-06-04**|**Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**|Dinuka Sandun Udayantha et.al.|[2406.16908v3](http://arxiv.org/abs/2406.16908v3)|[link](https://github.com/dinuka-1999/braineocare)|
+|**2024-06-01**|**Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**|Samita Bai et.al.|[2406.00532v1](http://arxiv.org/abs/2406.00532v1)|null|
+|**2024-06-01**|**Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**|Alaa Nfissi et.al.|[2406.01624v2](http://arxiv.org/abs/2406.01624v2)|[link](https://github.com/alaanfissi/unveiling-hidden-factors-explainable-ai-for-feature-boosting-in-speech-emotion-recognition)|
+|**2024-05-31**|**The Explanation Necessity for Healthcare AI**|Michail Mamalakis et.al.|[2406.00216v1](http://arxiv.org/abs/2406.00216v1)|null|
+|**2024-05-29**|**Interdisciplinary Expertise to Advance Equitable Explainable AI**|Chloe R. Bennett et.al.|[2406.18563v1](http://arxiv.org/abs/2406.18563v1)|null|
+|**2024-05-27**|**"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**|Hubert D. Zając et.al.|[2407.11978v1](http://arxiv.org/abs/2407.11978v1)|null|
+|**2024-05-26**|**Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**|Min Hun Lee et.al.|[2405.16424v1](http://arxiv.org/abs/2405.16424v1)|null|
+|**2024-05-26**|**Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**|Ziming Liu et.al.|[2405.17502v1](http://arxiv.org/abs/2405.17502v1)|null|
+|**2024-05-24**|**Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**|Catalina Gomez et.al.|[2407.11974v1](http://arxiv.org/abs/2407.11974v1)|null|
+|**2024-05-23**|**Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**|Yingying Fang et.al.|[2406.18552v1](http://arxiv.org/abs/2406.18552v1)|null|
+|**2024-05-21**|**The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**|Mohsen Jozani et.al.|[2405.13099v1](http://arxiv.org/abs/2405.13099v1)|null|
+|**2024-05-17**|**ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**|Harris Bin Munawar et.al.|[2405.10645v1](http://arxiv.org/abs/2405.10645v1)|null|
+|**2024-05-13**|**Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**|Camelia Oprea et.al.|[2405.07590v1](http://arxiv.org/abs/2405.07590v1)|null|
+|**2024-05-10**|**XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**|Fatemeh Nazary et.al.|[2405.06270v3](http://arxiv.org/abs/2405.06270v3)|null|
+|**2024-05-09**|**To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**|Miquel Miró-Nicolau et.al.|[2405.05766v1](http://arxiv.org/abs/2405.05766v1)|null|
+|**2024-05-05**|**Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**|Zhusi Zhong et.al.|[2405.02815v1](http://arxiv.org/abs/2405.02815v1)|[link](https://github.com/zzs95/RSP_COVID)|
+|**2024-04-26**|**Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**|Francesco Prinzi et.al.|[2405.02334v2](http://arxiv.org/abs/2405.02334v2)|null|
+|**2024-04-25**|**Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**|Yunfei Ge et.al.|[2404.16957v1](http://arxiv.org/abs/2404.16957v1)|null|
+|**2024-04-19**|**Explainable AI for Fair Sepsis Mortality Predictive Model**|Chia-Hsuan Chang et.al.|[2404.13139v1](http://arxiv.org/abs/2404.13139v1)|null|
+|**2024-04-19**|**Multi Class Depression Detection Through Tweets using Artificial Intelligence**|Muhammad Osama Nusrat et.al.|[2404.13104v1](http://arxiv.org/abs/2404.13104v1)|[link](https://github.com/mnusrat786/masters-thesis)|
+|**2024-04-19**|**COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**|Dmytro Shvetsov et.al.|[2404.12832v2](http://arxiv.org/abs/2404.12832v2)|[link](https://github.com/dmytro-shvetsov/counterfactual-search)|
+|**2024-04-15**|**Hybrid Intelligence for Digital Humanities**|Victor de Boer et.al.|[2406.15374v1](http://arxiv.org/abs/2406.15374v1)|null|
+|**2024-04-14**|**Ethical Framework for Responsible Foundational Models in Medical Imaging**|Abhijit Das et.al.|[2406.11868v1](http://arxiv.org/abs/2406.11868v1)|null|
+|**2024-04-09**|**Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**|Milad Yousefi et.al.|[2404.07239v1](http://arxiv.org/abs/2404.07239v1)|null|
+|**2024-04-06**|**Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**|Taminul Islam et.al.|[2404.04686v1](http://arxiv.org/abs/2404.04686v1)|null|
+|**2024-04-05**|**Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**|Maryam Ahmed et.al.|[2404.03892v3](http://arxiv.org/abs/2404.03892v3)|null|
+|**2024-03-30**|**Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**|Xingrui Gu et.al.|[2404.00320v2](http://arxiv.org/abs/2404.00320v2)|null|
+|**2024-03-26**|**Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**|Andrea Ferrario et.al.|[2403.17873v1](http://arxiv.org/abs/2403.17873v1)|null|
+|**2024-03-26**|**Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**|Han Yuan et.al.|[2403.18871v1](http://arxiv.org/abs/2403.18871v1)|[link](https://github.com/han-yuan-med/template-explanation)|
+|**2024-03-03**|**Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**|Séamus Lankford et.al.|[2403.01580v1](http://arxiv.org/abs/2403.01580v1)|null|
+|**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
+|**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
+|**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
+|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
+
+#### Abstracts
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-Recent advancements in large language models (LLMs) have significantly
-improved various natural language processing (NLP) tasks. Typically, LLMs are
-trained to predict the next token, aligning well with many NLP tasks. However,
-in knowledge graph (KG) scenarios, entities are the fundamental units and
-identifying an entity requires at least several tokens. This leads to a
-granularity mismatch between KGs and natural languages. To address this issue,
-we propose K-ON, which integrates KG knowledge into the LLM by employing
-multiple head layers for next k-step prediction. K-ON can not only generate
-entity-level results in one step, but also enables contrastive loss against
-entities, which is the most powerful tool in KG representation learning.
-Experimental results show that K-ON outperforms state-of-the-art methods that
-incorporate text and even the other modalities.
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-摘要：大型語言模型 (LLM) 的最新進展顯著提升了各種自然語言處理 (NLP) 任務。通常，LLM 會接受訓練以預測下一個符號，這與許多 NLP 任務非常吻合。然而，在知識圖譜 (KG) 場景中，實體是基本單位，而識別實體至少需要幾個符號。這導致 KG 和自然語言之間的粒度不匹配。為了解決這個問題，我們提出了 K-ON，它透過採用多個頭部層進行下一個 k 步預測，將 KG 知識整合到 LLM 中。K-ON 不僅可以在一個步驟中產生實體層級的結果，還能針對實體啟用對比損失，這是 KG 表示學習中最有力的工具。實驗結果顯示，K-ON 優於將文字甚至其他方式納入考量的最新方法。
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-##### **LegalViz: Legal Text Visualization by Text To Diagram Generation**
-2502.06147v2 by Eri Onami, Taiki Miyanishi, Koki Maeda, Shuhei Kurita
+##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
+2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
 
-Legal documents including judgments and court orders require highly
-sophisticated legal knowledge for understanding. To disclose expert knowledge
-for non-experts, we explore the problem of visualizing legal texts with
-easy-to-understand diagrams and propose a novel dataset of LegalViz with 23
-languages and 7,010 cases of legal document and visualization pairs, using the
-DOT graph description language of Graphviz. LegalViz provides a simple diagram
-from a complicated legal corpus identifying legal entities, transactions, legal
-sources, and statements at a glance, that are essential in each judgment. In
-addition, we provide new evaluation metrics for the legal diagram visualization
-by considering graph structures, textual similarities, and legal contents. We
-conducted empirical studies on few-shot and finetuning large language models
-for generating legal diagrams and evaluated them with these metrics, including
-legal content-based evaluation within 23 languages. Models trained with
-LegalViz outperform existing models including GPTs, confirming the
-effectiveness of our dataset.
+We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
+an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
+predicts future PHTs using transformer-based architectures. The Adaptive Risk
+Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
+probabilities for clinician-defined critical events. ARES incorporates a
+personalized explainability module that identifies key clinical factors
+influencing risk estimates for individual patients. ARES was evaluated on the
+MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
+performance against traditional early warning systems and machine learning
+models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
+with 60% including hospital admissions. The dataset contained over 357 million
+tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
+ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
+ETHOS-based risk estimates demonstrated robustness across demographic subgroups
+with strong model reliability, confirmed via calibration curves. The
+personalized explainability module provides insights into patient-specific
+factors contributing to risk. ARES, powered by ETHOS, advances predictive
+healthcare AI by providing dynamic, real-time, and personalized risk estimation
+with patient-specific explainability to enhance clinician trust. Its
+adaptability and superior accuracy position it as a transformative tool for
+clinical decision-making, potentially improving patient outcomes and resource
+allocation in emergency and inpatient settings. We release the full code at
+github.com/ipolharvard/ethos-ares to facilitate future research.
 
-摘要：法律文件，包括判決和法院命令，需要高度專業的法律知識才能理解。為了向非專家揭露專家知識，我們探討了使用易於理解的圖表將法律文本視覺化的問題，並提出了一個新的 LegalViz 數據集，其中包含 23 種語言和 7,010 個法律文件和視覺化配對，使用 Graphviz 的 DOT 圖形描述語言。LegalViz 從複雜的法律語料庫中提供了一個簡單的圖表，可以一目了然地識別法律實體、交易、法律來源和陳述，這些在每項判決中都是必不可少的。此外，我們通過考慮圖形結構、文本相似性和法律內容，為法律圖表視覺化提供了新的評估指標。我們對少次學習和微調大型語言模型進行了實證研究，以生成法律圖表，並使用這些指標對它們進行了評估，包括在 23 種語言中基於法律內容的評估。使用 LegalViz 訓練的模型優於現有的模型，包括 GPT，證實了我們數據集的有效性。
+摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
+一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
+使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
 
-##### **Deconstructing Depression Stigma: Integrating AI-driven Data Collection and Analysis with Causal Knowledge Graphs**
-2502.06075v1 by Han Meng, Renwen Zhang, Ganyi Wang, Yitian Yang, Peinuan Qin, Jungup Lee, Yi-Chieh Lee
+##### **An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**
+2501.15969v1 by Shaheer Ahmad Khan, Muhammad Usamah Shahid, Ahmad Abdullah, Ibrahim Hashmat, Muddassar Farooq
 
-Mental-illness stigma is a persistent social problem, hampering both
-treatment-seeking and recovery. Accordingly, there is a pressing need to
-understand it more clearly, but analyzing the relevant data is highly
-labor-intensive. Therefore, we designed a chatbot to engage participants in
-conversations; coded those conversations qualitatively with AI assistance; and,
-based on those coding results, built causal knowledge graphs to decode stigma.
-The results we obtained from 1,002 participants demonstrate that conversation
-with our chatbot can elicit rich information about people's attitudes toward
-depression, while our AI-assisted coding was strongly consistent with
-human-expert coding. Our novel approach combining large language models (LLMs)
-and causal knowledge graphs uncovered patterns in individual responses and
-illustrated the interrelationships of psychological constructs in the dataset
-as a whole. The paper also discusses these findings' implications for HCI
-researchers in developing digital interventions, decomposing human
-psychological constructs, and fostering inclusive attitudes.
+This study addresses a critical gap in the healthcare system by developing a
+clinically meaningful, practical, and explainable disease surveillance system
+for multiple chronic diseases, utilizing routine EHR data from multiple U.S.
+practices integrated with CureMD's EMR/EHR system. Unlike traditional
+systems--using AI models that rely on features from patients' labs--our
+approach focuses on routinely available data, such as medical history, vitals,
+diagnoses, and medications, to preemptively assess the risks of chronic
+diseases in the next year. We trained three distinct models for each chronic
+disease: prediction models that forecast the risk of a disease 3, 6, and 12
+months before a potential diagnosis. We developed Random Forest models, which
+were internally validated using F1 scores and AUROC as performance metrics and
+further evaluated by a panel of expert physicians for clinical relevance based
+on inferences grounded in medical knowledge. Additionally, we discuss our
+implementation of integrating these models into a practical EMR system. Beyond
+using Shapley attributes and surrogate models for explainability, we also
+introduce a new rule-engineering framework to enhance the intrinsic
+explainability of Random Forests.
 
-摘要：精神疾病的污名化是一個持續存在的社會問題，阻礙了尋求治療和康復。因此，迫切需要更清楚地了解它，但分析相關數據非常費力。因此，我們設計了一個聊天機器人，讓參與者參與對話；使用 AI 協助對這些對話進行定性編碼；並根據這些編碼結果，構建因果知識圖譜來破譯污名化。我們從 1,002 名參與者那裡獲得的結果表明，與我們的聊天機器人的對話可以引出人們對憂鬱症的豐富資訊，而我們 AI 輔助的編碼與人類專家編碼非常一致。我們將大型語言模型 (LLM) 和因果知識圖譜相結合的新方法揭示了個別反應中的模式，並說明了資料集中心理建構之間的相互關係。本文還討論了這些發現對 HCI 研究人員在開發數位介入措施、分解人類心理建構和培養包容態度方面的影響。
+摘要：本研究透過開發一個臨床有意義、實用且可解釋的多重慢性疾病疾病監測系統，來解決醫療保健系統中的重大缺口，利用整合 CureMD 的 EMR/EHR 系統，來自多個美國實務的例行 EHR 資料。與傳統系統不同的是，我們的做法著重在例行可得的資料，例如病歷、生命徵象、診斷和藥物，以預先評估未來一年慢性疾病的風險，而非仰賴病患實驗室特徵的 AI 模型。我們針對每種慢性疾病訓練了三個不同的模型：預測模型，用以預測在潛在診斷前 3、6 和 12 個月的疾病風險。我們開發了隨機森林模型，並使用 F1 分數和 AUROC 作為效能指標，進行內部驗證，並進一步由專家醫師小組根據植基於醫學知識的推論，評估其臨床相關性。此外，我們討論了將這些模型整合到實用 EMR 系統中的實作方式。除了使用 Shapley 屬性和代理模型來解釋外，我們還引進了一個新的規則工程架構，以增強隨機森林的內在可解釋性。
 
-##### **LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification**
-2502.05836v1 by Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya
+##### **Ensuring Medical AI Safety: Explainable AI-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data**
+2501.13818v1 by Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
 
-In this paper, we address the task of semantic segmentation of legal
-documents through rhetorical role classification, with a focus on Indian legal
-judgments. We introduce LegalSeg, the largest annotated dataset for this task,
-comprising over 7,000 documents and 1.4 million sentences, labeled with 7
-rhetorical roles. To benchmark performance, we evaluate multiple
-state-of-the-art models, including Hierarchical BiLSTM-CRF,
-TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and
-Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an
-instruction-tuned large language model. Our results demonstrate that models
-incorporating broader context, structural relationships, and sequential
-sentence information outperform those relying solely on sentence-level
-features. Additionally, we conducted experiments using surrounding context and
-predicted or actual labels of neighboring sentences to assess their impact on
-classification accuracy. Despite these advancements, challenges persist in
-distinguishing between closely related roles and addressing class imbalance.
-Our work underscores the potential of advanced techniques for improving legal
-document understanding and sets a strong foundation for future research in
-legal NLP.
+Deep neural networks are increasingly employed in high-stakes medical
+applications, despite their tendency for shortcut learning in the presence of
+spurious correlations, which can have potentially fatal consequences in
+practice. Detecting and mitigating shortcut behavior is a challenging task that
+often requires significant labeling efforts from domain experts. To alleviate
+this problem, we introduce a semi-automated framework for the identification of
+spurious behavior from both data and model perspective by leveraging insights
+from eXplainable Artificial Intelligence (XAI). This allows the retrieval of
+spurious data points and the detection of model circuits that encode the
+associated prediction rules. Moreover, we demonstrate how these shortcut
+encodings can be used for XAI-based sample- and pixel-level data annotation,
+providing valuable information for bias mitigation methods to unlearn the
+undesired shortcut behavior. We show the applicability of our framework using
+four medical datasets across two modalities, featuring controlled and
+real-world spurious correlations caused by data artifacts. We successfully
+identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision
+Transformer models, ultimately increasing their robustness and applicability
+for real-world medical tasks.
 
-摘要：<paragraph>在本文中，我們通過修辭角色分類來探討法律文件的語義分段任務，重點關注印度法律判決。我們引入了 LegalSeg，這是此任務中最大的註釋資料集，包含超過 7,000 份文件和 140 萬個句子，並標記了 7 個修辭角色。為了評量效能，我們評估了多個最先進的模型，包括分層 BiLSTM-CRF、TransformerOverInLegalBERT (ToInLegalBERT)、圖神經網路 (GNN) 和角色感知Transformer，以及探索性的 RhetoricLLaMA，一種經過指令調整的大型語言模型。我們的結果表明，結合廣泛背景、結構關係和順序句子資訊的模型，表現優於僅依賴句子層級特徵的模型。此外，我們使用周圍的背景和鄰近句子的預測或實際標籤進行實驗，以評估它們對分類精度的影響。儘管有這些進展，但在區分密切相關的角色和解決類別不平衡方面仍存在挑戰。我們的研究強調了先進技術在改善法律文件理解方面的潛力，並為法律自然語言處理的未來研究奠定了堅實的基礎。</paragraph>
+摘要：深度神经网络越来越多地用于高风险医疗应用中，尽管它们在存在虚假相关性的情况下倾向于捷径学习，这在实践中可能产生致命的后果。检测和缓解捷径行为是一项艰巨的任务，通常需要领域专家的大量标记工作。为了缓解这个问题，我们引入了一个半自动框架，用于从数据和模型的角度识别虚假行为，方法是利用可解释人工智能 (XAI) 的见解。这允许检索虚假数据点并检测对关联预测规则进行编码的模型电路。此外，我们演示了如何使用这些捷径编码进行基于 XAI 的样本和像素级数据注释，为偏差缓解方法提供有价值的信息，以消除不需要的捷径行为。我们使用跨越两种方式的四个医学数据集展示了我们框架的适用性，这些数据集具有由数据伪像引起的受控和真实世界虚假相关性。我们成功地识别并减轻了 VGG16、ResNet50 和当代 Vision Transformer 模型中的这些偏差，最终提高了它们的鲁棒性和在真实世界医疗任务中的适用性。
 
-##### **LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning**
-2502.05453v1 by Hanqing Yang, Jingdi Chen, Marie Siew, Tania Lorido-Botran, Carlee Joe-Wong
+##### **Enhanced Suicidal Ideation Detection from Social Media Using a CNN-BiLSTM Hybrid Model**
+2501.11094v1 by Mohaiminul Islam Bhuiyan, Nur Shazwani Kamarudin, Nur Hafieza Ismail
 
-Developing intelligent agents for long-term cooperation in dynamic open-world
-scenarios is a major challenge in multi-agent systems. Traditional Multi-agent
-Reinforcement Learning (MARL) frameworks like centralized training
-decentralized execution (CTDE) struggle with scalability and flexibility. They
-require centralized long-term planning, which is difficult without custom
-reward functions, and face challenges in processing multi-modal data. CTDE
-approaches also assume fixed cooperation strategies, making them impractical in
-dynamic environments where agents need to adapt and plan independently. To
-address decentralized multi-agent cooperation, we propose Decentralized
-Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in
-a novel Multi-agent Crafter environment. Our generative agents, powered by
-Large Language Models (LLMs), are more scalable than traditional MARL agents by
-leveraging external knowledge and language for long-term planning and
-reasoning. Instead of fully sharing information from all past experiences,
-DAMCS introduces a multi-modal memory system organized as a hierarchical
-knowledge graph and a structured communication protocol to optimize agent
-cooperation. This allows agents to reason from past interactions and share
-relevant information efficiently. Experiments on novel multi-agent open-world
-tasks show that DAMCS outperforms both MARL and LLM baselines in task
-efficiency and collaboration. Compared to single-agent scenarios, the two-agent
-scenario achieves the same goal with 63% fewer steps, and the six-agent
-scenario with 74% fewer steps, highlighting the importance of adaptive memory
-and structured communication in achieving long-term goals. We publicly release
-our project at: https://happyeureka.github.io/damcs.
+Suicidal ideation detection is crucial for preventing suicides, a leading
+cause of death worldwide. Many individuals express suicidal thoughts on social
+media, offering a vital opportunity for early detection through advanced
+machine learning techniques. The identification of suicidal ideation in social
+media text is improved by utilising a hybrid framework that integrates
+Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory
+(BiLSTM), enhanced with an attention mechanism. To enhance the interpretability
+of the model's predictions, Explainable AI (XAI) methods are applied, with a
+particular focus on SHapley Additive exPlanations (SHAP), are incorporated. At
+first, the model managed to reach an accuracy of 92.81%. By applying
+fine-tuning and early stopping techniques, the accuracy improved to 94.29%. The
+SHAP analysis revealed key features influencing the model's predictions, such
+as terms related to mental health struggles. This level of transparency boosts
+the model's credibility while helping mental health professionals understand
+and trust the predictions. This work highlights the potential for improving the
+accuracy and interpretability of detecting suicidal tendencies, making a
+valuable contribution to the progress of mental health monitoring systems. It
+emphasizes the significance of blending powerful machine learning methods with
+explainability to develop reliable and impactful mental health solutions.
 
-摘要：<paragraph>在動態開放世界情境中開發用於長期合作的智慧代理是多重代理系統中的一項重大挑戰。傳統的多重代理強化學習 (MARL) 框架，例如集中式訓練去中心化執行 (CTDE)，在可擴充性和靈活性方面面臨困難。它們需要集中式長期規劃，這在沒有自訂獎勵函數的情況下很難執行，並且在處理多模式數據時會面臨挑戰。CTDE 方法還假設固定的合作策略，這使得它們在代理需要獨立適應和規劃的動態環境中不切實際。為了解決分散式多重代理合作問題，我們在一個新穎的多重代理工匠環境中提出了分散式自適應知識圖譜記憶體和結構化通訊系統 (DAMCS)。我們的生成代理由大型語言模型 (LLM) 提供支援，透過利用外部知識和語言進行長期規劃和推理，比傳統的 MARL 代理更具可擴充性。DAMCS 沒有完全分享來自所有過去經驗的資訊，而是引入了多模式記憶體系統，該系統組織成階層式知識圖譜和結構化通訊協定，以最佳化代理合作。這允許代理根據過去的互動進行推理並有效地分享相關資訊。在新的多重代理開放世界任務上的實驗表明，DAMCS 在任務效率和協作方面優於 MARL 和 LLM 基準。與單一代理情境相比，雙重代理情境以少 63% 的步驟達成相同的目標，而六重代理情境則以少 74% 的步驟達成目標，突顯了自適應記憶體和結構化通訊在達成長期目標中的重要性。我們公開發布我們的專案於：https://happyeureka.github.io/damcs。</paragraph>
+摘要：自殺意念偵測對於預防自殺至關重要，而自殺是全球主要的死亡原因。許多人在社群媒體上表達自殺念頭，這提供了透過進階機器學習技術進行早期偵測的重要機會。透過整合卷積神經網路 (CNN) 和雙向長短期記憶 (BiLSTM) 的混合架構，並加入注意力機制，可以提升在社群媒體文字中辨識自殺意念的能力。為了加強模型預測的可解釋性，我們採用可解釋人工智慧 (XAI) 方法，特別著重於 SHapley 加法解釋 (SHAP)。一開始，模型成功達到 92.81% 的準確度。透過套用微調和早期停止技術，準確度提升至 94.29%。SHAP 分析揭露了影響模型預測的關鍵特徵，例如與心理健康困境相關的詞彙。這種透明度提升了模型的可信度，同時協助心理健康專業人員理解和信賴預測結果。這項工作突顯了提升偵測自殺傾向的準確度和可解釋性的潛力，為心理健康監控系統的進展做出寶貴的貢獻。它強調了將強大的機器學習方法與可解釋性相結合以開發可靠且有影響力的心理健康解決方案的重要性。
 
-##### **SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation**
-2502.05424v1 by Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang
+##### **SEANN: A Domain-Informed Neural Network for Epidemiological Insights**
+2501.10273v1 by Jean-Baptiste Guimbaud, Marc Plantevit, Léa Maître, Rémy Cazabet
 
-Graphs are able to model interconnected entities in many online services,
-supporting a wide range of applications on the Web. This raises an important
-question: How can we train a graph foundational model on multiple source
-domains and adapt to an unseen target domain? A major obstacle is that graphs
-from different domains often exhibit divergent characteristics. Some studies
-leverage large language models to align multiple domains based on textual
-descriptions associated with the graphs, limiting their applicability to
-text-attributed graphs. For text-free graphs, a few recent works attempt to
-align different feature distributions across domains, while generally
-neglecting structural differences. In this work, we propose a novel Structure
-Alignment framework for text-free Multi-domain Graph Pre-Training and
-cross-domain adaptation (SAMGPT). It is designed to learn multi-domain
-knowledge from graphs originating in multiple source domains, which can then be
-adapted to address applications in an unseen target domain. Specifically, we
-introduce a set of structure tokens to harmonize structure-based aggregation
-across source domains during the pre-training phase. Next, for cross-domain
-adaptation, we design dual prompts, namely, holistic prompts and specific
-prompts, which adapt unified multi-domain structural knowledge and
-fine-grained, domain-specific information, respectively, to a target domain.
-Finally, we conduct comprehensive experiments on seven public datasets to
-evaluate and analyze the effectiveness of SAMGPT.
+In epidemiology, traditional statistical methods such as logistic regression,
+linear regression, and other parametric models are commonly employed to
+investigate associations between predictors and health outcomes. However,
+non-parametric machine learning techniques, such as deep neural networks
+(DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for
+this task. Despite their potential, these methods face challenges due to the
+limited availability of high-quality, high-quantity data in this field. To
+address these challenges, we introduce SEANN, a novel approach for informed
+DNNs that leverages a prevalent form of domain-specific knowledge: Pooled
+Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies,
+in different forms, and represent a quantitative form of a scientific
+consensus. By direct integration within the learning procedure using a custom
+loss, we experimentally demonstrate significant improvements in the
+generalizability of predictive performances and the scientific plausibility of
+extracted relationships compared to a domain-knowledge agnostic neural network
+in a scarce and noisy data setting.
 
-摘要：圖表能夠在許多線上服務中對相互關聯的實體進行建模，
-支援網路上廣泛的應用程式。這提出了重要的問題：我們如何針對多個來源網域訓練圖表基礎模型，並適應未見過的目標網域？一個主要的障礙是，來自不同網域的圖表通常表現出不同的特性。一些研究利用大型語言模型，根據與圖表相關的文字描述，對齊多個網域，限制其適用性於有文字屬性的圖表。對於沒有文字的圖表，最近的一些作品嘗試對齊跨網域的不同特徵分佈，同時通常忽略結構上的差異。在這項工作中，我們提出了一個新的結構對齊框架，用於無文字多網域圖表預訓練和跨網域適應 (SAMGPT)。它被設計為從起源於多個來源網域的圖表中學習多網域知識，然後可以適應於未見過的目標網域中的應用程式。具體來說，我們引入了一組結構化代碼，以在預訓練階段，調和跨來源網域的基於結構的聚合。接下來，對於跨網域適應，我們設計了雙重提示，即整體提示和具體提示，分別將統一的多網域結構知識和細緻的、特定於網域的資訊適應到目標網域。最後，我們在七個公共資料集上進行了全面的實驗，以評估和分析 SAMGPT 的有效性。
+摘要：在流行病學中，傳統的統計方法，例如邏輯迴歸、線性迴歸和其他參數模型通常用於調查預測因子與健康結果之間的關聯。然而，非參數機器學習技術，例如深度神經網路 (DNN)，結合可解釋的 AI (XAI) 工具，為這項任務提供了新的機會。儘管這些方法具有潛力，但由於該領域缺乏高品質、高數量資料，因此這些方法面臨挑戰。為了應對這些挑戰，我們引入了 SEANN，這是一種新穎的方法，用於獲取知識的 DNN，它利用了一種流行的領域特定知識形式：彙總效應量 (PES)。PES 通常以不同的形式出現在已發表的 Meta 分析研究中，並代表科學共識的量化形式。通過使用自訂損失函數直接整合在學習程序中，我們以實驗方式證明了預測效能的概括性以及與從缺乏領域知識的神經網路中提取的關係相比，科學合理性的顯著提升，且是在稀少且有雜訊的資料設定中。
 
-##### **Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints**
-2502.05414v1 by Ali Al-Lawati, Jason Lucas, Zhiwei Zhang, Prasenjit Mitra, Suhang Wang
+##### **Artificial Intelligence-Driven Clinical Decision Support Systems**
+2501.09628v1 by Muhammet Alkan, Idris Zakariyya, Samuel Leighton, Kaushik Bhargav Sivangi, Christos Anagnostopoulos, Fani Deligianni
 
-In-context learning (ICL) effectively conditions large language models (LLMs)
-for molecular tasks, such as property prediction and molecule captioning, by
-embedding carefully selected demonstration examples into the input prompt. This
-approach avoids the computational overhead of extensive pertaining and
-fine-tuning. However, current prompt retrieval methods for molecular tasks have
-relied on molecule feature similarity, such as Morgan fingerprints, which do
-not adequately capture the global molecular and atom-binding relationships. As
-a result, these methods fail to represent the full complexity of molecular
-structures during inference. Moreover, small-to-medium-sized LLMs, which offer
-simpler deployment requirements in specialized systems, have remained largely
-unexplored in the molecular ICL literature. To address these gaps, we propose a
-self-supervised learning technique, GAMIC (Graph-Aligned Molecular In-Context
-learning, which aligns global molecular structures, represented by graph neural
-networks (GNNs), with textual captions (descriptions) while leveraging local
-feature similarity through Morgan fingerprints. In addition, we introduce a
-Maximum Marginal Relevance (MMR) based diversity heuristic during retrieval to
-optimize input prompt demonstration samples. Our experimental findings using
-diverse benchmark datasets show GAMIC outperforms simple Morgan-based ICL
-retrieval methods across all tasks by up to 45%.
+As artificial intelligence (AI) becomes increasingly embedded in healthcare
+delivery, this chapter explores the critical aspects of developing reliable and
+ethical Clinical Decision Support Systems (CDSS). Beginning with the
+fundamental transition from traditional statistical models to sophisticated
+machine learning approaches, this work examines rigorous validation strategies
+and performance assessment methods, including the crucial role of model
+calibration and decision curve analysis. The chapter emphasizes that creating
+trustworthy AI systems in healthcare requires more than just technical
+accuracy; it demands careful consideration of fairness, explainability, and
+privacy. The challenge of ensuring equitable healthcare delivery through AI is
+stressed, discussing methods to identify and mitigate bias in clinical
+predictive models. The chapter then delves into explainability as a cornerstone
+of human-centered CDSS. This focus reflects the understanding that healthcare
+professionals must not only trust AI recommendations but also comprehend their
+underlying reasoning. The discussion advances in an analysis of privacy
+vulnerabilities in medical AI systems, from data leakage in deep learning
+models to sophisticated attacks against model explanations. The text explores
+privacy-preservation strategies such as differential privacy and federated
+learning, while acknowledging the inherent trade-offs between privacy
+protection and model performance. This progression, from technical validation
+to ethical considerations, reflects the multifaceted challenges of developing
+AI systems that can be seamlessly and reliably integrated into daily clinical
+practice while maintaining the highest standards of patient care and data
+protection.
 
-摘要：<paragraph>情境學習 (ICL) 有效地調整大型語言模型 (LLM)，以執行分子任務，例如屬性預測和分子標題，方法是將仔細挑選的示範範例嵌入輸入提示中。這種方法避免了廣泛相關和微調的計算開銷。然而，目前針對分子任務的提示檢索方法依賴於分子特徵相似性，例如 Morgan 指紋，而無法充分捕捉全局分子和原子鍵結關係。因此，這些方法無法在推理過程中表示分子結構的完整複雜性。此外，在專業系統中提供更簡單部署需求的小到中型的 LLM，在分子 ICL 文獻中仍未得到充分探索。為了解決這些差距，我們提出了一種自我監督學習技術，GAMIC（圖形對齊分子情境學習），它將由圖形神經網路 (GNN) 表示的全局分子結構與文字標題（描述）對齊，同時透過 Morgan 指紋利用局部特徵相似性。此外，我們在檢索過程中引入了一個基於最大邊際相關性 (MMR) 的多樣性啟發法，以最佳化輸入提示示範樣本。我們使用不同的基準資料集進行的實驗結果顯示，GAMIC 在所有任務中都優於基於 Morgan 的簡單 ICL 檢索方法，最多可達 45%。</paragraph>
+摘要：隨著人工智慧 (AI) 在醫療保健中的應用日益普及，本章探討了開發可靠且符合道德標準的臨床決策支援系統 (CDSS) 的關鍵面向。從傳統統計模型到複雜機器學習方法的基本轉變開始，這項工作審查了嚴謹的驗證策略和效能評估方法，包括模型校準和決策曲線分析的關鍵角色。本章強調，在醫療保健中建立值得信賴的 AI 系統不只是技術上的準確性；它需要仔細考量公平性、可解釋性和隱私權。本章強調了透過 AI 確保公平的醫療保健服務的挑戰，並討論了識別和減輕臨床預測模型中偏差的方法。接著，本章深入探討可解釋性，作為以人為中心的 CDSS 的基石。這種關注反映了醫療保健專業人員不僅必須信任 AI 建議，還必須理解其背後的推理。討論進一步分析了醫療 AI 系統中的隱私漏洞，從深度學習模型中的資料外洩到針對模型解釋的複雜攻擊。本文探討了隱私保護策略，例如差分隱私和聯合學習，同時承認隱私保護和模型效能之間的固有取捨。這種從技術驗證到道德考量的進展，反映了開發 AI 系統的多面向挑戰，這些系統可以無縫且可靠地整合到日常臨床實務中，同時維持最高的病患照護和資料保護標準。
 
-##### **Knowledge Graph-Guided Retrieval Augmented Generation**
-2502.06864v1 by Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
+##### **MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis**
+2501.06887v1 by Sadia Kamal, Tim Oates
 
-Retrieval-augmented generation (RAG) has emerged as a promising technology
-for addressing hallucination issues in the responses generated by large
-language models (LLMs). Existing studies on RAG primarily focus on applying
-semantic-based approaches to retrieve isolated relevant chunks, which ignore
-their intrinsic relationships. In this paper, we propose a novel Knowledge
-Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes
-knowledge graphs (KGs) to provide fact-level relationships between chunks,
-improving the diversity and coherence of the retrieved results. Specifically,
-after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG
-employs a KG-guided chunk expansion process and a KG-based chunk organization
-process to deliver relevant and important knowledge in well-organized
-paragraphs. Extensive experiments conducted on the HotpotQA dataset and its
-variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based
-approaches, in terms of both response quality and retrieval quality.
+As deep learning models gain attraction in medical data, ensuring transparent
+and trustworthy decision-making is essential. In skin cancer diagnosis, while
+advancements in lesion detection and classification have improved accuracy, the
+black-box nature of these methods poses challenges in understanding their
+decision processes, leading to trust issues among physicians. This study
+leverages the CLIP (Contrastive Language-Image Pretraining) model, trained on
+different skin lesion datasets, to capture meaningful relationships between
+visual features and diagnostic criteria terms. To further enhance transparency,
+we propose a method called MedGrad E-CLIP, which builds on gradient-based
+E-CLIP by incorporating a weighted entropy mechanism designed for complex
+medical imaging like skin lesions. This approach highlights critical image
+regions linked to specific diagnostic descriptions. The developed integrated
+pipeline not only classifies skin lesions by matching corresponding
+descriptions but also adds an essential layer of explainability developed
+especially for medical data. By visually explaining how different features in
+an image relates to diagnostic criteria, this approach demonstrates the
+potential of advanced vision-language models in medical image analysis,
+ultimately improving transparency, robustness, and trust in AI-driven
+diagnostic systems.
 
-摘要：檢索增強生成 (RAG) 已成為一項有前途的技術，用於解決大型語言模型 (LLM) 所產生回應中的幻覺問題。現有關於 RAG 的研究主要專注於應用基於語義的方法來檢索孤立相關的區塊，而忽略它們的內在關係。在本文中，我們提出了一個新穎的知識圖表引導檢索增強生成 (KG$^2$RAG) 框架，它利用知識圖表 (KG) 來提供區塊之間的事實層級關係，從而提高檢索結果的多樣性和一致性。具體來說，在執行基於語義的檢索以提供種子區塊後，KG$^2$RAG 採用 KG 引導的區塊擴充程序和基於 KG 的區塊組織程序，以在組織良好的段落中傳達相關且重要的知識。在 HotpotQA 資料集及其變體上進行的大量實驗證明了 KG$^2$RAG 在回應品質和檢索品質方面優於現有的基於 RAG 的方法。
+摘要：随着深度学习模型在医学数据中获得关注，确保透明且值得信赖的决策至关重要。在皮肤癌诊断中，虽然病灶检测和分类的进步提高了准确性，但这些方法的黑盒性质对理解其决策过程构成了挑战，导致医生之间的信任问题。本研究利用在不同皮肤病变数据集上训练的 CLIP（对比语言图像预训练）模型，以捕捉视觉特征和诊断标准术语之间的有意义关系。为了进一步提高透明度，我们提出了一种名为 MedGrad E-CLIP 的方法，该方法通过结合专为皮肤病变等复杂医学影像设计的加权熵机制，建立在基于梯度的 E-CLIP 之上。此方法突出了与特定诊断描述相关联的关键图像区域。开发的集成管道不仅通过匹配相应的描述对皮肤病变进行分类，还添加了一层专门为医学数据开发的基本可解释性。通过直观地解释图像中不同特征与诊断标准的关系，这种方法展示了高级视觉语言模型在医学图像分析中的潜力，最终提高了透明度、稳健性和对人工智能驱动的诊断系统的信任。
 
-##### **Can Large Language Models Understand Intermediate Representations?**
-2502.06854v1 by Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
+##### **Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis**
+2501.02891v1 by Mary Ogbuka Kenneth, Foaad Khosmood, Abbas Edalat
 
-Intermediate Representations (IRs) are essential in compiler design and
-program analysis, yet their comprehension by Large Language Models (LLMs)
-remains underexplored. This paper presents a pioneering empirical study to
-investigate the capabilities of LLMs, including GPT-4, GPT-3, Gemma 2, LLaMA
-3.1, and Code Llama, in understanding IRs. We analyze their performance across
-four tasks: Control Flow Graph (CFG) reconstruction, decompilation, code
-summarization, and execution reasoning. Our results indicate that while LLMs
-demonstrate competence in parsing IR syntax and recognizing high-level
-structures, they struggle with control flow reasoning, execution semantics, and
-loop handling. Specifically, they often misinterpret branching instructions,
-omit critical IR operations, and rely on heuristic-based reasoning, leading to
-errors in CFG reconstruction, IR decompilation, and execution reasoning. The
-study underscores the necessity for IR-specific enhancements in LLMs,
-recommending fine-tuning on structured IR datasets and integration of explicit
-control flow models to augment their comprehension and handling of IR-related
-tasks.
+Humour styles can have either a negative or a positive impact on well-being.
+Given the importance of these styles to mental health, significant research has
+been conducted on their automatic identification. However, the automated
+machine learning models used for this purpose are black boxes, making their
+prediction decisions opaque. Clarity and transparency are vital in the field of
+mental health. This paper presents an explainable AI (XAI) framework for
+understanding humour style classification, building upon previous work in
+computational humour analysis. Using the best-performing single model
+(ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to
+analyse how linguistic, emotional, and semantic features contribute to humour
+style classification decisions. Our analysis reveals distinct patterns in how
+different humour styles are characterised and misclassified, with particular
+emphasis on the challenges in distinguishing affiliative humour from other
+styles. Through detailed examination of feature importance, error patterns, and
+misclassification cases, we identify key factors influencing model decisions,
+including emotional ambiguity, context misinterpretation, and target
+identification. The framework demonstrates significant utility in understanding
+model behaviour, achieving interpretable insights into the complex interplay of
+features that define different humour styles. Our findings contribute to both
+the theoretical understanding of computational humour analysis and practical
+applications in mental health, content moderation, and digital humanities
+research.
 
-摘要：中間表徵 (IR) 在編譯器設計和程式分析中至關重要，但大型語言模型 (LLM) 對其理解仍未得到充分探討。本文提出了一項開創性的實證研究，以探討 LLM（包括 GPT-4、GPT-3、Gemma 2、LLaMA 3.1 和 Code Llama）理解 IR 的能力。我們分析了它們在四項任務中的表現：控制流程圖 (CFG) 重建、反編譯、程式碼摘要和執行推理。我們的結果表明，儘管 LLM 在解析 IR 語法和識別高階結構方面表現出能力，但它們在控制流程推理、執行語義和迴圈處理方面存在困難。具體而言，它們經常誤解分支指令、省略關鍵 IR 操作，並依賴於基於啟發式的推理，導致 CFG 重建、IR 反編譯和執行推理出現錯誤。這項研究強調了 LLM 中對 IR 特定的增強的必要性，建議對結構化的 IR 資料集進行微調，並整合明確的控制流程模型，以增強其對 IR 相關任務的理解和處理。
+摘要：幽默風格對幸福感可能產生負面或正面的影響。
+鑑於這些風格對心理健康的重要性，已經對其自動識別進行了大量研究。然而，用於此目的的自動機器學習模型是黑盒子，使得其預測決策不透明。清晰度和透明度在心理健康領域至關重要。本文提出了一個可解釋的 AI (XAI) 框架，用於理解幽默風格分類，建立在計算幽默分析的先前工作之上。使用先前研究中表現最好的單一模型 (ALI+XGBoost)，我們應用全面的 XAI 技術來分析語言、情緒和語義特徵如何影響幽默風格分類決策。我們的分析揭示了不同幽默風格如何被表徵和錯誤分類的不同模式，特別強調了區分聯屬幽默與其他風格的挑戰。通過仔細檢查特徵重要性、錯誤模式和錯誤分類案例，我們確定了影響模型決策的關鍵因素，包括情緒模糊、情境誤解和目標識別。該框架展示了在理解模型行為方面的顯著效用，實現了對定義不同幽默風格的特徵之間複雜相互作用的可解釋見解。我們的發現有助於計算幽默分析的理論理解和心理健康、內容審核和數字人文研究中的實際應用。
 
-##### **GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?**
-2502.05252v1 by Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen
+##### **The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support**
+2412.20068v1 by Alessandro De Grandi, Federico Ravenda, Andrea Raballo, Fabio Crestani
 
-Long-context large language models (LLMs) have recently shown strong
-performance in information retrieval and long-document QA. However, to tackle
-the most challenging intellectual problems, LLMs must reason effectively in
-long and complex contexts (e.g., frontier mathematical research). Studying how
-LLMs handle increasing reasoning complexity and context length is essential,
-yet existing benchmarks lack a solid basis for quantitative evaluation.
-Inspired by the abstraction of GSM-8K problems as computational graphs, and the
-ability to introduce noise by adding unnecessary nodes and edges, we develop a
-grade school math problem generator capable of producing arithmetic problems
-with infinite difficulty and context length under fine-grained control. Using
-our newly synthesized GSM-Infinite benchmark, we comprehensively evaluate
-existing LLMs. We find a consistent sigmoid decline in reasoning performance as
-complexity increases, along with a systematic inference scaling trend:
-exponentially increasing inference computation yields only linear performance
-gains. These findings underscore the fundamental limitations of current
-long-context LLMs and the key challenges in scaling reasoning capabilities. Our
-GSM-Infinite benchmark provides a scalable and controllable testbed for
-systematically studying and advancing LLM reasoning in long and complex
-contexts.
+The increasing demand for mental health services has highlighted the need for
+innovative solutions, particularly in the realm of psychological conversational
+AI, where the availability of sensitive data is scarce. In this work, we
+explored the development of a system tailored for mental health support with a
+novel approach to psychological assessment based on explainable emotional
+profiles in combination with empathetic conversational models, offering a
+promising tool for augmenting traditional care, particularly where immediate
+expertise is unavailable. Our work can be divided into two main parts,
+intrinsecaly connected to each other. First, we present RACLETTE, a
+conversational system that demonstrates superior emotional accuracy compared to
+state-of-the-art benchmarks in both understanding users' emotional states and
+generating empathetic responses during conversations, while progressively
+building an emotional profile of the user through their interactions. Second,
+we show how the emotional profiles of a user can be used as interpretable
+markers for mental health assessment. These profiles can be compared with
+characteristic emotional patterns associated with different mental disorders,
+providing a novel approach to preliminary screening and support.
 
-摘要：長文本大型語言模型 (LLM) 最近在資訊檢索和長文件問答中展示了強大的效能。然而，若要解決最具挑戰性的智力問題，LLM 必須在長且複雜的脈絡中有效推理（例如，前沿數學研究）。研究 LLM 如何處理增加的推理複雜性和脈絡長度至關重要，但現有的基準缺乏定量評估的穩固基礎。受到 GSM-8K 問題抽象化為計算圖形的啟發，以及透過加入不必要的節點和邊緣來引入雜訊的能力，我們開發了一個小學數學問題產生器，能夠在細緻的控制下產生具有無限難度和脈絡長度的算術問題。使用我們新合成的 GSM-Infinite 基準，我們全面評估現有的 LLM。我們發現推理效能會隨著複雜性的增加而持續呈 S 形下降，並伴隨著系統性的推論縮放趨勢：指數增加的推論計算僅產生線性的效能增益。這些發現強調了當前長脈絡 LLM 的基本限制，以及擴展推理能力的主要挑戰。我們的 GSM-Infinite 基準提供了一個可擴充且可控的測試平台，用於系統性地研究和提升 LLM 在長且複雜脈絡中的推理能力。
+摘要：隨著對心理健康服務需求的增加，凸顯了創新解決方案的需求，特別是在心理對話式人工智慧領域，那裡缺乏敏感資料。在這項工作中，我們探索了開發一個針對心理健康支持的系統，採用一種基於可解釋的情緒特徵的新方法進行心理評估，結合同理心對話模式，提供了一個有前途的工具，用於擴充傳統照護，特別是在無法立即獲得專業知識的情況下。我們的工作可以分為兩個主要部分，彼此內在相關。首先，我們展示了 RACLETTE，一個對話系統，與最先進的基準相比，在理解使用者情緒狀態和在對話中產生同理心回應方面表現出優越的情緒準確性，同時透過他們的互動逐漸建立使用者的情緒特徵。其次，我們展示了使用者的情緒特徵如何可用作心理健康評估的可解釋標記。這些特徵可以與與不同心理疾病相關的典型情緒模式進行比較，提供了一種初步篩選和支持的新方法。
 
-##### **Causality can systematically address the monsters under the bench(marks)**
-2502.05085v1 by Felix Leeb, Zhijing Jin, Bernhard Schölkopf
+##### **A Review on the Integration of Artificial Intelligence and Medical Imaging in IVF Ovarian Stimulation**
+2412.19688v1 by Jana Zakall, Birgit Pohn, Antonia Graf, Daniel Kovatchki, Arezoo Borji, Ragib Shahriar Islam, Hossam Haick, Heinz Strohmer, Sepideh Hatamikia
 
-Effective and reliable evaluation is essential for advancing empirical
-machine learning. However, the increasing accessibility of generalist models
-and the progress towards ever more complex, high-level tasks make systematic
-evaluation more challenging. Benchmarks are plagued by various biases,
-artifacts, or leakage, while models may behave unreliably due to poorly
-explored failure modes. Haphazard treatments and inconsistent formulations of
-such "monsters" can contribute to a duplication of efforts, a lack of trust in
-results, and unsupported inferences. In this position paper, we argue causality
-offers an ideal framework to systematically address these challenges. By making
-causal assumptions in an approach explicit, we can faithfully model phenomena,
-formulate testable hypotheses with explanatory power, and leverage principled
-tools for analysis. To make causal model design more accessible, we identify
-several useful Common Abstract Topologies (CATs) in causal graphs which help
-gain insight into the reasoning abilities in large language models. Through a
-series of case studies, we demonstrate how the precise yet pragmatic language
-of causality clarifies the strengths and limitations of a method and inspires
-new approaches for systematic progress.
+Artificial intelligence (AI) has emerged as a powerful tool to enhance
+decision-making and optimize treatment protocols in in vitro fertilization
+(IVF). In particular, AI shows significant promise in supporting
+decision-making during the ovarian stimulation phase of the IVF process. This
+review evaluates studies focused on the applications of AI combined with
+medical imaging in ovarian stimulation, examining methodologies, outcomes, and
+current limitations. Our analysis of 13 studies on this topic reveals that,
+reveal that while AI algorithms demonstrated notable potential in predicting
+optimal hormonal dosages, trigger timing, and oocyte retrieval outcomes, the
+medical imaging data utilized predominantly came from two-dimensional (2D)
+ultrasound which mainly involved basic quantifications, such as follicle size
+and number, with limited use of direct feature extraction or advanced image
+analysis techniques. This points to an underexplored opportunity where advanced
+image analysis approaches, such as deep learning, and more diverse imaging
+modalities, like three-dimensional (3D) ultrasound, could unlock deeper
+insights. Additionally, the lack of explainable AI (XAI) in most studies raises
+concerns about the transparency and traceability of AI-driven decisions - key
+factors for clinical adoption and trust. Furthermore, many studies relied on
+single-center designs and small datasets, which limit the generalizability of
+their findings. This review highlights the need for integrating advanced
+imaging analysis techniques with explainable AI methodologies, as well as the
+importance of leveraging multicenter collaborations and larger datasets.
+Addressing these gaps has the potential to enhance ovarian stimulation
+management, paving the way for efficient, personalized, and data-driven
+treatment pathways that improve IVF outcomes.
+
+摘要：人工智慧（AI）已成為增強體外受精（IVF）決策制定和優化治療方案的強大工具。特別是，AI 在支持 IVF 過程中卵巢刺激階段的決策制定方面顯示出顯著的前景。本綜述評估了專注於 AI 結合卵巢刺激中的醫學影像應用、檢驗方法、結果和當前限制的研究。我們對 13 項關於此主題的研究分析顯示，雖然 AI 演算法在預測最佳荷爾蒙劑量、觸發時機和卵子取出結果方面表現出顯著的潛力，但所利用的醫學影像數據主要來自於二次元（2D）超音波，而二次元超音波主要涉及基本量化，例如濾泡大小和數量，且有限使用直接特徵提取或進階影像分析技術。這指向一個尚未探索的機會，例如深度學習等進階影像分析方法，以及更多元的影像模式，例如三維（3D）超音波，可以解鎖更深入的見解。此外，大多數研究缺乏可解釋 AI（XAI），這引起了人們對 AI 驅動決策的透明度和可追溯性的擔憂，而透明度和可追溯性是臨床採用和信任的關鍵因素。此外，許多研究依賴於單中心設計和小型數據集，這限制了其發現的普遍性。本綜述強調了將進階影像分析技術與可解釋 AI 方法整合起來的必要性，以及利用多中心合作和大型數據集的重要性。解決這些差距有可能增強卵巢刺激管理，為有效、個人化和數據驅動的治療途徑鋪平道路，進而改善 IVF 結果。
+
+##### **Enhancing Cancer Diagnosis with Explainable & Trustworthy Deep Learning Models**
+2412.17527v1 by Badaru I. Olumuyiwa, The Anh Han, Zia U. Shamszaman
+
+This research presents an innovative approach to cancer diagnosis and
+prediction using explainable Artificial Intelligence (XAI) and deep learning
+techniques. With cancer causing nearly 10 million deaths globally in 2020,
+early and accurate diagnosis is crucial. Traditional methods often face
+challenges in cost, accuracy, and efficiency. Our study develops an AI model
+that provides precise outcomes and clear insights into its decision-making
+process, addressing the "black box" problem of deep learning models. By
+employing XAI techniques, we enhance interpretability and transparency,
+building trust among healthcare professionals and patients. Our approach
+leverages neural networks to analyse extensive datasets, identifying patterns
+for cancer detection. This model has the potential to revolutionise diagnosis
+by improving accuracy, accessibility, and clarity in medical decision-making,
+possibly leading to earlier detection and more personalised treatment
+strategies. Furthermore, it could democratise access to high-quality
+diagnostics, particularly in resource-limited settings, contributing to global
+health equity. The model's applications extend beyond cancer diagnosis,
+potentially transforming various aspects of medical decision-making and saving
+millions of lives worldwide.
 
-摘要：有效的、可靠的評估對於推進經驗機器學習至關重要。然而，一般化模型的可及性日益提高，以及朝著更複雜、更高級別任務的進展，使得系統評估更具挑戰性。基準測試受到各種偏差、人工製品或洩漏的困擾，而模型由於探索不充分的故障模式而可能表現得不可靠。隨意處理和不一致的表述等「怪物」可能會導致重複工作、對結果缺乏信任以及不支援的推論。在本文中，我們論證因果關係提供了一個系統性解決這些挑戰的理想框架。通過在方法中明確因果假設，我們可以忠實地模擬現象，制定具有解釋力的可測試假設，並利用原則性的分析工具。為了使因果模型設計更易於使用，我們在因果圖中識別出幾個有用的通用抽象拓撲 (CAT)，有助於深入了解大型語言模型中的推理能力。通過一系列案例研究，我們展示了因果關係的精確但務實的語言如何釐清方法的優缺點，並激發系統進展的新方法。
+摘要：本研究提出了一個創新的癌症診斷和預測方法，使用可解釋的人工智慧 (XAI) 和深度學習技術。由於癌症在 2020 年造成全球近 1,000 萬人死亡，因此早期準確的診斷至關重要。傳統方法通常面臨成本、準確性和效率方面的挑戰。我們的研究開發了一個 AI 模型，它提供精確的結果並清楚地了解其決策過程，解決了深度學習模型的「黑箱」問題。通過採用 XAI 技術，我們增強了解釋性和透明度，在醫療專業人員和患者之間建立信任。我們的做法利用神經網路分析廣泛的數據集，識別癌症檢測模式。這個模型有可能通過提高醫療決策的準確性、可及性和清晰度來革新診斷，可能導致更早的檢測和更個性化的治療策略。此外，它可以使更多人獲得高品質的診斷，特別是在資源有限的環境中，有助於全球健康公平。該模型的應用範圍不僅限於癌症診斷，還可能轉變醫療決策的各個方面，並拯救全球數百萬人的生命。
 
-##### **Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures**
-2502.05078v1 by Tushar Pandey, Ara Ghukasyan, Oktay Goktas, Santosh Kumar Radha
+##### **Towards Interpretable Radiology Report Generation via Concept Bottlenecks using a Multi-Agentic RAG**
+2412.16086v2 by Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag
 
-Large Language Models (LLMs) have demonstrated impressive reasoning
-capabilities, yet their performance is highly dependent on the prompting
-strategy and model scale. While reinforcement learning and fine-tuning have
-been deployed to boost reasoning, these approaches incur substantial
-computational and data overhead. In this work, we introduce Adaptive Graph of
-Thoughts (AGoT), a dynamic, graph-based inference framework that enhances LLM
-reasoning solely at test time. Rather than relying on fixed-step methods like
-Chain of Thought (CoT) or Tree of Thoughts (ToT), AGoT recursively decomposes
-complex queries into structured subproblems, forming an dynamic directed
-acyclic graph (DAG) of interdependent reasoning steps. By selectively expanding
-only those subproblems that require further analysis, AGoT unifies the
-strengths of chain, tree, and graph paradigms into a cohesive framework that
-allocates computation where it is most needed. We validate our approach on
-diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and
-mathematical problem-solving, achieving up to 46.2% improvement on scientific
-reasoning tasks (GPQA) - comparable to gains achieved through computationally
-intensive reinforcement learning approaches and outperforming state-of-the-art
-iterative approaches. These results suggest that dynamic decomposition and
-structured recursion offer a scalable, cost-effective alternative to
-post-training modifications, paving the way for more robust, general-purpose
-reasoning in LLMs.
+Deep learning has advanced medical image classification, but interpretability
+challenges hinder its clinical adoption. This study enhances interpretability
+in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)
+and a multi-agent Retrieval-Augmented Generation (RAG) system for report
+generation. By modeling relationships between visual features and clinical
+concepts, we create interpretable concept vectors that guide a multi-agent RAG
+system to generate radiology reports, enhancing clinical relevance,
+explainability, and transparency. Evaluation of the generated reports using an
+LLM-as-a-judge confirmed the interpretability and clinical utility of our
+model's outputs. On the COVID-QU dataset, our model achieved 81% classification
+accuracy and demonstrated robust report generation performance, with five key
+metrics ranging between 84% and 90%. This interpretable multi-agent framework
+bridges the gap between high-performance AI and the explainability required for
+reliable AI-driven CXR analysis in clinical settings. Our code is available at
+https://github.com/tifat58/IRR-with-CBM-RAG.git.
 
-摘要：大型語言模型 (LLM) 已展現令人印象深刻的推理能力，但其效能高度依賴於提示策略和模型規模。雖然強化學習和微調已被用於提升推理，但這些方法會造成大量的運算和資料開銷。在這項工作中，我們引入了「適應性思考圖」(AGoT)，一個動態的、基於圖形的推論架構，它僅在測試時就能增強 LLM 推理。AGoT 並非依賴於鏈式思考 (CoT) 或樹狀思考 (ToT) 等固定步驟方法，而是遞迴地將複雜的查詢分解成結構化的子問題，形成一個由相互依賴的推理步驟所組成的動態有向無環圖 (DAG)。透過選擇性地僅擴充那些需要進一步分析的子問題，AGoT 將鏈式、樹狀和圖形範例的優勢統一到一個緊密的架構中，將運算分配到最需要的地方。我們在跨越多重跳躍檢索、科學推理和數學問題解決等多樣基準上驗證了我們的做法，在科學推理任務 (GPQA) 上達到了高達 46.2% 的改進，這與透過運算密集的強化學習方法所獲得的增益相當，並且優於最先進的迭代方法。這些結果表明，動態分解和結構化遞迴提供了一個可擴充、具成本效益的替代方案，用於訓練後修改，為 LLM 中更強健、更通用的推理鋪平了道路。
+摘要：深度學習已提升醫學影像分類，但可解釋性挑戰阻礙其臨床應用。本研究透過使用概念瓶頸模型 (CBM) 和多代理檢索增強生成 (RAG) 系統進行報告生成，來增強胸部 X 光 (CXR) 分類的可解釋性。透過建模視覺特徵與臨床概念之間的關係，我們建立可解釋的概念向量，引導多代理 RAG 系統生成放射報告，增強臨床相關性、可解釋性和透明度。使用 LLM 作為評審員對生成報告進行評估，確認了我們模型輸出的可解釋性和臨床效用。在 COVID-QU 資料集上，我們的模型達到了 81% 的分類準確率，並展示了穩健的報告生成效能，五項關鍵指標介於 84% 至 90% 之間。這個可解釋的多代理架構彌合了高性能 AI 與臨床環境中可靠的 AI 驅動 CXR 分析所需的解釋性之間的差距。我們的程式碼可於 https://github.com/tifat58/IRR-with-CBM-RAG.git 取得。
 
-##### **Enhancing Knowledge Graph Construction: Evaluating with Emphasis on Hallucination, Omission, and Graph Similarity Metrics**
-2502.05239v1 by Hussam Ghanem, Christophe Cruz
+##### **Critique of Impure Reason: Unveiling the reasoning behaviour of medical Large Language Models**
+2412.15748v1 by Shamus Sim, Tyrone Chen
 
-Recent advancements in large language models have demonstrated significant
-potential in the automated construction of knowledge graphs from unstructured
-text. This paper builds upon our previous work [16], which evaluated various
-models using metrics like precision, recall, F1 score, triple matching, and
-graph matching, and introduces a refined approach to address the critical
-issues of hallucination and omission. We propose an enhanced evaluation
-framework incorporating BERTScore for graph similarity, setting a practical
-threshold of 95% for graph matching. Our experiments focus on the Mistral
-model, comparing its original and fine-tuned versions in zero-shot and few-shot
-settings. We further extend our experiments using examples from the KELM-sub
-training dataset, illustrating that the fine-tuned model significantly improves
-knowledge graph construction accuracy while reducing the exact hallucination
-and omission. However, our findings also reveal that the fine-tuned models
-perform worse in generalization tasks on the KELM-sub dataset. This study
-underscores the importance of comprehensive evaluation metrics in advancing the
-state-of-the-art in knowledge graph construction from textual data.
+Background: Despite the current ubiquity of Large Language Models (LLMs)
+across the medical domain, there is a surprising lack of studies which address
+their reasoning behaviour. We emphasise the importance of understanding
+reasoning behaviour as opposed to high-level prediction accuracies, since it is
+equivalent to explainable AI (XAI) in this context. In particular, achieving
+XAI in medical LLMs used in the clinical domain will have a significant impact
+across the healthcare sector. Results: Therefore, we define the concept of
+reasoning behaviour in the specific context of medical LLMs. We then categorise
+and discuss the current state of the art of methods which evaluate reasoning
+behaviour in medical LLMs. Finally, we propose theoretical frameworks which can
+empower medical professionals or machine learning engineers to gain insight
+into the low-level reasoning operations of these previously obscure models.
+Conclusion: The subsequent increased transparency and trust in medical machine
+learning models by clinicians as well as patients will accelerate the
+integration, application as well as further development of medical AI for the
+healthcare system as a whole
 
-摘要：大型語言模型的最新進展已證明在從非結構化文字自動建構知識圖譜方面具有顯著的潛力。本文建立在我們先前的研究 [16] 之上，該研究使用準確度、召回率、F1 分數、三元組匹配和圖形匹配等指標評估各種模型，並引入了一種改進的方法來解決幻覺和遺漏的關鍵問題。我們提出一個增強的評估框架，結合 BERTScore 來進行圖形相似性，並將圖形匹配的實際閾值設定為 95%。我們的實驗重點在 Mistral 模型上，比較其原始版本和微調版本在零次學習和少量學習的設定中。我們進一步使用 KELM-sub 訓練資料集中的範例來擴展我們的實驗，說明微調後的模型顯著提高了知識圖譜建構的準確度，同時減少了精確的幻覺和遺漏。然而，我們的研究結果也顯示，微調後的模型在 KELM-sub 資料集上的泛化任務表現較差。這項研究強調了全面評估指標在推進從文字資料建構知識圖譜的最新技術方面的重要性。
+摘要：背景：儘管大型語言模型 (LLM) 目前在醫療領域無所不在，但令人驚訝的是，探討其推理行為的研究卻相當缺乏。我們強調了解推理行為而非高層級的預測準確度非常重要，因為在這種情況下，這等同於可解釋 AI (XAI)。尤其是在臨床領域中使用的醫療 LLM 中實現 XAI，將對整個醫療保健產業產生重大影響。結果：因此，我們在醫療 LLM 的特定背景下定義了推理行為的概念。接著我們分類並探討當前評估醫療 LLM 中推理行為的方法的最新技術。最後，我們提出理論架構，讓醫療專業人員或機器學習工程師得以深入了解這些先前模糊模型的低層級推理運算。結論：臨床醫生和患者對醫療機器學習模型的透明度和信任度隨之提升，將加速醫療 AI 在整個醫療保健系統中的整合、應用和進一步發展。
 
-##### **Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research**
-2502.04644v1 by Junde Wu, Jiayuan Zhu, Yuyuan Liu
+##### **Cognition Chain for Explainable Psychological Stress Detection on Social Media**
+2412.14009v1 by Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton
 
-We introduce Agentic Reasoning, a framework that enhances large language
-model (LLM) reasoning by integrating external tool-using agents. Unlike
-conventional LLM-based reasoning approaches, which rely solely on internal
-inference, Agentic Reasoning dynamically engages web search, code execution,
-and structured reasoning-context memory to solve complex problems requiring
-deep research and multi-step logical deduction. Our framework introduces the
-Mind Map agent, which constructs a structured knowledge graph to track logical
-relationships, improving deductive reasoning. Additionally, the integration of
-web-search and coding agents enables real-time retrieval and computational
-analysis, enhancing reasoning accuracy and decision-making. Evaluations on
-PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks
-demonstrate that our approach significantly outperforms existing models,
-including leading retrieval-augmented generation (RAG) systems and
-closed-source LLMs. Moreover, our results indicate that agentic reasoning
-improves expert-level knowledge synthesis, test-time scalability, and
-structured problem-solving. The code is at:
-https://github.com/theworldofagents/Agentic-Reasoning.
+Stress is a pervasive global health issue that can lead to severe mental
+health problems. Early detection offers timely intervention and prevention of
+stress-related disorders. The current early detection models perform "black
+box" inference suffering from limited explainability and trust which blocks the
+real-world clinical application. Thanks to the generative properties introduced
+by the Large Language Models (LLMs), the decision and the prediction from such
+models are semi-interpretable through the corresponding description. However,
+the existing LLMs are mostly trained for general purposes without the guidance
+of psychological cognitive theory. To this end, we first highlight the
+importance of prior theory with the observation of performance boosted by the
+chain-of-thoughts tailored for stress detection. This method termed Cognition
+Chain explicates the generation of stress through a step-by-step cognitive
+perspective based on cognitive appraisal theory with a progress pipeline:
+Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress
+State, guiding LLMs to provide comprehensive reasoning explanations. We further
+study the benefits brought by the proposed Cognition Chain format by utilising
+it as a synthetic dataset generation template for LLMs instruction-tuning and
+introduce CogInstruct, an instruction-tuning dataset for stress detection. This
+dataset is developed using a three-stage self-reflective annotation pipeline
+that enables LLMs to autonomously generate and refine instructional data. By
+instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable
+stress detection model. Evaluations demonstrate that CogLLM achieves
+outstanding performance while enhancing explainability. Our work contributes a
+novel approach by integrating cognitive theories into LLM reasoning processes,
+offering a promising direction for future explainable AI research.
 
-摘要：我們引入了代理推理，一個透過整合外部工具使用代理來增強大型語言模型 (LLM) 推理的框架。與僅依賴於內部推論的傳統基於 LLM 的推理方法不同，代理推理動態地運用網路搜尋、程式碼執行和結構化推理情境記憶來解決需要深入研究和多步驟邏輯推論的複雜問題。我們的框架引入了心智圖代理，它建立一個結構化的知識圖譜來追蹤邏輯關係，改善演繹推理。此外，整合網路搜尋和編碼代理能進行即時擷取和運算分析，增強推理準確度和決策制定。在博士等級科學推理 (GPQA) 和特定領域的深入研究任務上的評估顯示，我們的做法明顯優於現有模型，包括領先的檢索增強生成 (RAG) 系統和封閉原始碼 LLM。此外，我們的結果顯示，代理推理改進了專家級知識綜合、測試時間可擴充性和結構化問題解決。程式碼在：https://github.com/theworldofagents/Agentic-Reasoning。
+摘要：壓力是一個普遍的全球性健康問題，可能會導致嚴重的精神
+健康問題。早期發現提供及時的干預和預防
+壓力相關疾病。目前的早期發現模型執行「黑
+盒子」推論，存在可解釋性和信任度有限的問題，阻礙了
+現實世界的臨床應用。多虧了大型語言模型 (LLM) 引入的生成屬性，此類
+模型的決策和預測通過對應描述具有半可解釋性。然而，
+現有的 LLM 主要針對一般用途進行訓練，沒有心理認知理論的指導。為此，我們首先強調
+先驗理論的重要性，並觀察到針對壓力檢測量身定制的思想鏈提升了性能。這種方法稱為認知
+鏈通過基於認知評估理論的循序漸進的認知視角闡明了壓力的產生，並具有進度管道：
+刺激 $\rightarrow$ 評估 $\rightarrow$ 反應 $\rightarrow$ 壓力
+狀態，指導 LLM 提供全面的推理解釋。我們進一步
+通過將其用作 LLM 指令調整的合成數據集生成模板來研究所提出的認知鏈格式帶來的優點，並介紹 CogInstruct，這是一個針對壓力檢測的指令調整數據集。這個
+數據集是使用一個三階段的自省標註管道開發的，使 LLM 能夠自主生成和優化指令數據。通過
+使用 CogInstruct 對 Llama3 進行指令調整，我們開發了 CogLLM，這是一個可解釋的
+壓力檢測模型。評估表明，CogLLM 在提高可解釋性的同時實現了出色的性能。我們的研究通過將認知理論整合到 LLM 推理過程中，提出了一種新穎的方法，
+為未來的可解釋人工智能研究提供了一個有希望的方向。
 
-##### **Position-aware Automatic Circuit Discovery**
-2502.04577v1 by Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov
+##### **2-Factor Retrieval for Improved Human-AI Decision Making in Radiology**
+2412.00372v1 by Jim Solomon, Laleh Jalilian, Alexander Vilesov, Meryl Mathew, Tristan Grogan, Arash Bedayat, Achuta Kadambi
 
-A widely used strategy to discover and understand language model mechanisms
-is circuit analysis. A circuit is a minimal subgraph of a model's computation
-graph that executes a specific task. We identify a gap in existing circuit
-discovery methods: they assume circuits are position-invariant, treating model
-components as equally relevant across input positions. This limits their
-ability to capture cross-positional interactions or mechanisms that vary across
-positions. To address this gap, we propose two improvements to incorporate
-positionality into circuits, even on tasks containing variable-length examples.
-First, we extend edge attribution patching, a gradient-based method for circuit
-discovery, to differentiate between token positions. Second, we introduce the
-concept of a dataset schema, which defines token spans with similar semantics
-across examples, enabling position-aware circuit discovery in datasets with
-variable length examples. We additionally develop an automated pipeline for
-schema generation and application using large language models. Our approach
-enables fully automated discovery of position-sensitive circuits, yielding
-better trade-offs between circuit size and faithfulness compared to prior work.
+Human-machine teaming in medical AI requires us to understand to what degree
+a trained clinician should weigh AI predictions. While previous work has shown
+the potential of AI assistance at improving clinical predictions, existing
+clinical decision support systems either provide no explainability of their
+predictions or use techniques like saliency and Shapley values, which do not
+allow for physician-based verification. To address this gap, this study
+compares previously used explainable AI techniques with a newly proposed
+technique termed '2-factor retrieval (2FR)', which is a combination of
+interface design and search retrieval that returns similarly labeled data
+without processing this data. This results in a 2-factor security blanket
+where: (a) correct images need to be retrieved by the AI; and (b) humans should
+associate the retrieved images with the current pathology under test. We find
+that when tested on chest X-ray diagnoses, 2FR leads to increases in clinician
+accuracy, with particular improvements when clinicians are radiologists and
+have low confidence in their decision. Our results highlight the importance of
+understanding how different modes of human-AI decision making may impact
+clinician accuracy in clinical decision support systems.
 
-摘要：廣泛用於發現和了解語言模型機制的策略是電路分析。電路是模型計算圖的最小子圖，可執行特定任務。我們找出電路發現方法中的一個缺口：它們假設電路與位置無關，將模型組件視為在輸入位置中同樣相關。這限制了它們捕捉跨位置互動或在不同位置中變化的機制的能力。為了解決這個缺口，我們提出兩項改進，將位置性納入電路中，即使在包含變長範例的任務中也是如此。首先，我們擴充邊緣屬性修補，一種基於梯度的電路發現方法，以區分符號位置。其次，我們引入了資料集架構的概念，它定義了在範例中具有類似語義的符號跨距，使我們可以在具有變長範例的資料集中進行與位置相關的電路發現。此外，我們開發了一個自動化管線，用於使用大型語言模型進行架構生成和應用。我們的做法能讓位置敏感電路的發現完全自動化，與先前的研究相比，在電路大小和忠實度之間產生了更好的權衡。
+摘要：人機協作在醫療 AI 中，需要我們理解受過訓練的臨床醫生在多大程度上應重視 AI 預測。雖然先前的研究顯示 AI 輔助在改善臨床預測方面的潛力，但現有的臨床決策支援系統，要不就沒有提供預測的可解釋性，要不就是使用像顯著性和 Shapley 值之類的技術，這些技術不允許基於醫生的驗證。為了解決這個差距，本研究將先前使用的可解釋 AI 技術與一種新提出的稱為「2 因子檢索 (2FR)」的技術進行比較，後者是一種介面設計和搜尋檢索的組合，它會傳回標籤相似的資料，而不會處理這些資料。這會產生一個 2 因子安全機制，其中：(a) 正確的影像需要由 AI 檢索；(b) 人類應將檢索的影像與正在測試中的病理聯想起來。我們發現，當在胸部 X 光診斷上進行測試時，2FR 會提高臨床醫生的準確度，特別是在臨床醫生是放射科醫生且對其決策信心不足時，會有顯著的改善。我們的結果強調了理解人機決策的不同模式如何影響臨床醫生在臨床決策支援系統中的準確性的重要性。
 
-##### **Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems**
-2502.04510v1 by Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
+##### **Mapping Public Perception of Artificial Intelligence: Expectations, Risk-Benefit Tradeoffs, and Value As Determinants for Societal Acceptance**
+2411.19356v1 by Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle
 
-We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by
-jointly optimizing model roles and weights. We represent multi-LLM systems as
-directed acyclic graphs (DAGs) of LLMs with topological message passing for
-collaborative generation. Given a pool of LLM experts and a utility function,
-Heterogeneous Swarms employs two iterative steps: role-step and weight-step.
-For role-step, we interpret model roles as learning a DAG that specifies the
-flow of inputs and outputs between LLMs. Starting from a swarm of random
-continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs
-in topological order, evaluate on the utility function (e.g. accuracy on a
-task), and optimize the adjacency matrices with particle swarm optimization
-based on the utility score. For weight-step, we assess the contribution of
-individual LLMs in the multi-LLM systems and optimize model weights with swarm
-intelligence. We propose JFK-score to quantify the individual contribution of
-each LLM in the best-found DAG of the role-step, then optimize model weights
-with particle swarm optimization based on the JFK-score. Experiments
-demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based
-baselines by 18.5% on average across 12 tasks. Further analysis reveals that
-Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles
-and substantial collaborative gains, and benefits from the diversity of
-language models.
+Understanding public perception of artificial intelligence (AI) and the
+tradeoffs between potential risks and benefits is crucial, as these perceptions
+might shape policy decisions, influence innovation trajectories for successful
+market strategies, and determine individual and societal acceptance of AI
+technologies. Using a representative sample of 1100 participants from Germany,
+this study examines mental models of AI. Participants quantitatively evaluated
+71 statements about AI's future capabilities (e.g., autonomous driving, medical
+care, art, politics, warfare, and societal divides), assessing the expected
+likelihood of occurrence, perceived risks, benefits, and overall value. We
+present rankings of these projections alongside visual mappings illustrating
+public risk-benefit tradeoffs. While many scenarios were deemed likely,
+participants often associated them with high risks, limited benefits, and low
+overall value. Across all scenarios, 96.4% ($r^2=96.4\%$) of the variance in
+value assessment can be explained by perceived risks ($\beta=-.504$) and
+perceived benefits ($\beta=+.710$), with no significant relation to expected
+likelihood. Demographics and personality traits influenced perceptions of
+risks, benefits, and overall evaluations, underscoring the importance of
+increasing AI literacy and tailoring public information to diverse user needs.
+These findings provide actionable insights for researchers, developers, and
+policymakers by highlighting critical public concerns and individual factors
+essential to align AI development with individual values.
+
+摘要：<paragraph>了解公眾對人工智慧 (AI) 的認知以及潛在風險與好處之間的權衡至關重要，因為這些認知可能會影響政策決策、影響成功市場策略的創新軌跡，並決定個人和社會對 AI 技術的接受度。本研究使用來自德國的 1100 名參與者的代表性樣本，探討了 AI 的心智模型。參與者對 71 項關於 AI 未來能力的陳述（例如，自動駕駛、醫療保健、藝術、政治、戰爭和社會分歧）進行了定量評估，評估預期的發生可能性、感知風險、好處和整體價值。我們展示了這些預測的排名，並附上視覺化映射，說明了公眾的風險收益權衡。儘管許多場景被認為是可能的，但參與者通常將它們與高風險、有限的好處和低整體價值聯繫起來。在所有場景中，96.4% ($r^2=96.4\%$) 的價值評估差異可以用感知風險 ($\beta=-.504$) 和感知好處 ($\beta=+.710$) 來解釋，與預期的可能性沒有顯著關係。人口統計和人格特質影響了對風險、好處和整體評估的看法，這凸顯了提高 AI 素養和根據不同的使用者需求調整公共資訊的重要性。這些發現通過強調關鍵的公共關注和與個人價值觀一致的 AI 開發必不可少的個人因素，為研究人員、開發人員和政策制定者提供了可行的見解。</paragraph>
+
+##### **Explainable AI for Classifying UTI Risk Groups Using a Real-World Linked EHR and Pathology Lab Dataset**
+2411.17645v2 by Yujie Dai, Brian Sullivan, Axel Montout, Amy Dillon, Chris Waller, Peter Acs, Rachel Denholm, Philip Williams, Alastair D Hay, Raul Santos-Rodriguez, Andrew Dowsey
+
+The use of machine learning and AI on electronic health records (EHRs) holds
+substantial potential for clinical insight. However, this approach faces
+challenges due to data heterogeneity, sparsity, temporal misalignment, and
+limited labeled outcomes. In this context, we leverage a linked EHR dataset of
+approximately one million de-identified individuals from Bristol, North
+Somerset, and South Gloucestershire, UK, to characterize urinary tract
+infections (UTIs). We implemented a data pre-processing and curation pipeline
+that transforms the raw EHR data into a structured format suitable for
+developing predictive models focused on data fairness, accountability and
+transparency. Given the limited availability and biases of ground truth UTI
+outcomes, we introduce a UTI risk estimation framework informed by clinical
+expertise to estimate UTI risk across individual patient timelines. Pairwise
+XGBoost models are trained using this framework to differentiate UTI risk
+categories with explainable AI techniques applied to identify key predictors
+and support interpretability. Our findings reveal differences in clinical and
+demographic predictors across risk groups. While this study highlights the
+potential of AI-driven insights to support UTI clinical decision-making,
+further investigation of patient sub-strata and extensive validation are needed
+to ensure robustness and applicability in clinical practice.
 
-摘要：<paragraph>我們提出異質群體，一種演算法，透過共同最佳化模型角色和權重來設計多 LLM 系統。我們將多 LLM 系統表示為 LLM 的有向非循環圖 (DAG)，並透過拓撲訊息傳遞進行協作產生。給定一組 LLM 專家和一個效用函數，異質群體使用兩個反覆步驟：角色步驟和權重步驟。對於角色步驟，我們將模型角色解釋為學習一個 DAG，它指定 LLM 之間輸入和輸出的流動。從一組隨機連續鄰接矩陣開始，我們將它們解碼為離散 DAG，以拓撲順序呼叫 LLM，根據效用函數（例如任務的準確度）進行評估，並根據效用分數使用粒子群最佳化最佳化鄰接矩陣。對於權重步驟，我們評估個別 LLM 在多 LLM 系統中的貢獻，並使用群體智慧最佳化模型權重。我們提出 JFK 分數來量化每個 LLM 在角色步驟中找到的最佳 DAG 中的個別貢獻，然後根據 JFK 分數使用粒子群最佳化最佳化模型權重。實驗表明，異質群體在 12 項任務中平均比 15 個基於角色和/或權重的基線高出 18.5%。進一步的分析表明，異質群體發現具有異質模型角色和大量協作收益的多 LLM 系統，並受益於語言模型的多樣性。</paragraph>
+摘要：電子健康紀錄 (EHR) 中機器學習和 AI 的使用對於臨床見解具有相當大的潛力。然而，由於資料異質性、稀疏性、時間錯位和標籤結果有限，此方法面臨挑戰。在此背景下，我們利用來自英國布里斯托、北薩默塞特和南格洛斯特郡約一百萬名去識別個人連結的 EHR 資料集，來描述尿路感染 (UTI)。我們實施了將原始 EHR 資料轉換為結構化格式的資料前處理和整理管線，適合開發專注於資料公平性、問責制和透明度的預測模型。鑑於 UTI 真實結果的可用性有限和偏差，我們引入了由臨床專業知識告知的 UTI 風險評估架構，以估計個別患者時間軸上的 UTI 風險。成對的 XGBoost 模型使用此架構進行訓練，以區分 UTI 風險類別，並應用可解釋的 AI 技術來識別關鍵預測因子並支持可解釋性。我們的研究結果揭示了不同風險群組在臨床和人口統計預測因子上的差異。雖然這項研究強調了 AI 驅動見解在支援 UTI 臨床決策制定方面的潛力，但仍需要進一步調查患者子群體和廣泛驗證，以確保在臨床實務中的穩健性和適用性。
 
-##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
-2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
+##### **Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care**
+2411.11774v1 by Jeffrey N. Clark, Matthew Wragg, Emily Nielsen, Miquel Perello-Nieto, Nawid Keshtmand, Michael Ambler, Shiv Sharma, Christopher P. Bourdeaux, Amberly Brigden, Raul Santos-Rodriguez
 
-Retrieval-augmented generation (RAG) is a well-suited technique for
-retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
-key module of the healthcare copilot, helping reduce misdiagnosis for
-healthcare practitioners and patients. However, the diagnostic accuracy and
-specificity of existing heuristic-based RAG models used in the medical domain
-are inadequate, particularly for diseases with similar manifestations. This
-paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
-reasoning for the medical domain that retrieves diagnosis and treatment
-recommendations based on manifestations. MedRAG systematically constructs a
-comprehensive four-tier hierarchical diagnostic KG encompassing critical
-diagnostic differences of various diseases. These differences are dynamically
-integrated with similar EHRs retrieved from an EHR database, and reasoned
-within a large language model. This process enables more accurate and specific
-decision support, while also proactively providing follow-up questions to
-enhance personalized medical decision-making. MedRAG is evaluated on both a
-public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
-collected from Tan Tock Seng Hospital, and its performance is compared against
-various existing RAG methods. Experimental results show that, leveraging the
-information integration and relational abilities of the KG, our MedRAG provides
-more specific diagnostic insights and outperforms state-of-the-art models in
-reducing misdiagnosis rates. Our code will be available at
-https://github.com/SNOWTEAM2023/MedRAG
+There is a growing need to understand how digital systems can support
+clinical decision-making, particularly as artificial intelligence (AI) models
+become increasingly complex and less human-interpretable. This complexity
+raises concerns about trustworthiness, impacting safe and effective adoption of
+such technologies. Improved understanding of decision-making processes and
+requirements for explanations coming from decision support tools is a vital
+component in providing effective explainable solutions. This is particularly
+relevant in the data-intensive, fast-paced environments of intensive care units
+(ICUs). To explore these issues, group interviews were conducted with seven ICU
+clinicians, representing various roles and experience levels. Thematic analysis
+revealed three core themes: (T1) ICU decision-making relies on a wide range of
+factors, (T2) the complexity of patient state is challenging for shared
+decision-making, and (T3) requirements and capabilities of AI decision support
+systems. We include design recommendations from clinical input, providing
+insights to inform future AI systems for intensive care.
 
-摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
+摘要：隨著人工智慧 (AI) 模型變得越來越複雜，且越來越難以被人理解，了解數位系統如何支援臨床決策的需求也日益增加。這種複雜性引發了對可信度的疑慮，影響了此類技術的安全且有效採用。改善對決策制定流程的理解，以及對決策支援工具所提供說明的要求，是提供有效可解釋解決方案的重要組成部分。這在資料密集、快節奏的加護病房 (ICU) 環境中特別相關。為了探討這些問題，對七位 ICU 臨床醫師進行了小組訪談，這些醫師代表了不同的角色和經驗層級。主題分析揭露了三個核心主題：(T1) ICU 決策制定依賴於廣泛的因素，(T2) 病患狀態的複雜性對共同決策制定構成挑戰，以及 (T3) AI 決策支援系統的要求和能力。我們納入了臨床輸入的設計建議，提供見解以提供資訊給未來用於加護的 AI 系統。
 
-##### **Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering**
-2502.03992v1 by Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
+##### **Artificial Intelligence in Pediatric Echocardiography: Exploring Challenges, Opportunities, and Clinical Applications with Explainable AI and Federated Learning**
+2411.10255v1 by Mohammed Yaseen Jabarulla, Theodor Uden, Thomas Jack, Philipp Beerbaum, Steffen Oeltze-Jafra
 
-Most existing Knowledge Graph Question Answering (KGQA) approaches are
-designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the
-heterogeneity of the underlying graph schema, topology and assertions, most
-KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without
-resource-intensive training data. We present OntoSCPrompt, a novel Large
-Language Model (LLM)-based KGQA approach with a two-stage architecture that
-separates semantic parsing from KG-dependent interactions. OntoSCPrompt first
-generates a SPARQL query structure (including SPARQL keywords such as SELECT,
-ASK, WHERE and placeholders for missing tokens) and then fills them with
-KG-specific information. To enhance the understanding of the underlying KG, we
-present an ontology-guided, hybrid prompt learning strategy that integrates KG
-ontology into the learning process of hybrid prompts (e.g., discrete and
-continuous vectors). We also present several task-specific decoding strategies
-to ensure the correctness and executability of generated SPARQL queries in both
-stages. Experimental results demonstrate that OntoSCPrompt performs as well as
-SOTA approaches without retraining on a number of KGQA datasets such as CWQ,
-WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well
-to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code:
-\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
+Pediatric heart diseases present a broad spectrum of congenital and acquired
+diseases. More complex congenital malformations require a differentiated and
+multimodal decision-making process, usually including echocardiography as a
+central imaging method. Artificial intelligence (AI) offers considerable
+promise for clinicians by facilitating automated interpretation of pediatric
+echocardiography data. However, adapting AI technologies for pediatric
+echocardiography analysis has challenges such as limited public data
+availability, data privacy, and AI model transparency. Recently, researchers
+have focused on disruptive technologies, such as federated learning (FL) and
+explainable AI (XAI), to improve automatic diagnostic and decision support
+workflows. This study offers a comprehensive overview of the limitations and
+opportunities of AI in pediatric echocardiography, emphasizing the synergistic
+workflow and role of XAI and FL, identifying research gaps, and exploring
+potential future developments. Additionally, three relevant clinical use cases
+demonstrate the functionality of XAI and FL with a focus on (i) view
+recognition, (ii) disease classification, (iii) segmentation of cardiac
+structures, and (iv) quantitative assessment of cardiac function.
 
-摘要：現有的知識圖譜問答（KGQA）方法大多是為特定 KG 而設計的，例如 Wikidata、DBpedia 或 Freebase。由於底層圖形模式、拓撲和斷言的異質性，大多數 KGQA 系統無法在沒有資源密集型訓練資料的情況下轉移到未見過的知識圖譜（KG）。我們提出 OntoSCPrompt，這是一種基於大型語言模型（LLM）的新型 KGQA 方法，採用兩階段架構，將語義解析與依賴 KG 的互動分開。OntoSCPrompt 首先生成 SPARQL 查詢結構（包括 SPARQL 關鍵字，例如 SELECT、ASK、WHERE 和缺失令牌的佔位符），然後用 KG 特定的資訊填寫它們。為了增強對底層 KG 的理解，我們提出了一種由本体指導的混合提示學習策略，將 KG 本体整合到混合提示（例如，離散和連續向量）的學習過程中。我們還提出了多種特定任務的解碼策略，以確保在兩個階段中生成的 SPARQL 查詢的正確性和可執行性。實驗結果表明，OntoSCPrompt 在 CWQ、WebQSP 和 LC-QuAD 1.0 等多個 KGQA 資料集上執行時，效能與 SOTA 方法一樣好，且資源使用效率高，並且可以很好地概括到未見過的特定領域 KG，例如 DBLP-QuAD 和 CoyPu KG Code：
-\href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
+摘要：小兒心臟疾病呈現先天性與後天性疾病的廣泛光譜。較複雜的先天性畸形需要一個差異化且多模式的決策過程，通常包括超音波檢查作為主要的影像方法。人工智慧 (AI) 為臨床醫生提供了相當大的希望，因為它可以促進小兒超音波檢查資料的自動化解讀。然而，將人工智慧技術應用於小兒超音波檢查分析有許多挑戰，例如有限的公開資料可用性、資料隱私和人工智慧模型透明度。最近，研究人員專注於破壞性技術，例如聯合學習 (FL) 和可解釋人工智慧 (XAI)，以改善自動診斷和決策支援工作流程。本研究提供了人工智慧在小兒超音波檢查中的限制和機會的全面概述，強調了 XAI 和 FL 的協同工作流程和角色，找出研究差距並探討潛在的未來發展。此外，三個相關的臨床使用案例展示了 XAI 和 FL 的功能，重點在於 (i) 檢視辨識、(ii) 疾病分類、(iii) 心臟結構分割和 (iv) 心臟功能的量化評估。
 
-##### **Multimodal Medical Code Tokenizer**
-2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
+##### **Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering**
+2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust
 
-Foundation models trained on patient electronic health records (EHRs) require
-tokenizing medical data into sequences of discrete vocabulary items. Existing
-tokenizers treat medical codes from EHRs as isolated textual tokens. However,
-each medical code is defined by its textual description, its position in
-ontological hierarchies, and its relationships to other codes, such as disease
-co-occurrences and drug-treatment associations. Medical vocabularies contain
-more than 600,000 codes with critical information for clinical reasoning. We
-introduce MedTok, a multimodal medical code tokenizer that uses the text
-descriptions and relational context of codes. MedTok processes text using a
-language model encoder and encodes the relational structure with a graph
-encoder. It then quantizes both modalities into a unified token space,
-preserving modality-specific and cross-modality information. We integrate
-MedTok into five EHR models and evaluate it on operational and clinical tasks
-across in-patient and out-patient datasets, including outcome prediction,
-diagnosis classification, drug recommendation, and risk stratification.
-Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
-models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
-the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
-using MedTok tokenizer with medical QA systems. Our results demonstrate the
-potential of MedTok as a unified tokenizer for medical codes, improving
-tokenization for medical foundation models.
+Osteoporosis is a common condition that increases fracture risk, especially
+in older adults. Early diagnosis is vital for preventing fractures, reducing
+treatment costs, and preserving mobility. However, healthcare providers face
+challenges like limited labeled data and difficulties in processing medical
+images. This study presents a novel multi-modal learning framework that
+integrates clinical and imaging data to improve diagnostic accuracy and model
+interpretability. The model utilizes three pre-trained networks-VGG19,
+InceptionV3, and ResNet50-to extract deep features from X-ray images. These
+features are transformed using PCA to reduce dimensionality and focus on the
+most relevant components. A clustering-based selection process identifies the
+most representative components, which are then combined with preprocessed
+clinical data and processed through a fully connected network (FCN) for final
+classification. A feature importance plot highlights key variables, showing
+that Medical History, BMI, and Height were the main contributors, emphasizing
+the significance of patient-specific data. While imaging features were
+valuable, they had lower importance, indicating that clinical data are crucial
+for accurate predictions. This framework promotes precise and interpretable
+predictions, enhancing transparency and building trust in AI-driven diagnoses
+for clinical integration.
 
-摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
+摘要：骨質疏鬆症是一種常見的疾病，會增加骨折的風險，特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而，醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架，該框架整合了臨床和影像數據，以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路，VGG19、InceptionV3 和 ResNet50，從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分，然後將這些組成部分與預處理的臨床數據結合，並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數，表明病史、BMI 和身高是主要貢獻因素，強調了患者特定數據的重要性。雖然影像特徵很有價值，但它們的重要性較低，這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測，提高了透明度，並建立了對 AI 驅動診斷在臨床整合中的信任。
 
-##### **Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**
-2502.04392v1 by Chenyang Shao, Xinyuan Hu, Yutang Lin, Fengli Xu
+##### **A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection**
+2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor
 
-The rapid expansion of web content has made on-device AI assistants
-indispensable for helping users manage the increasing complexity of online
-tasks. The emergent reasoning ability in large language models offer a
-promising path for next-generation on-device AI agents. However, deploying
-full-scale Large Language Models (LLMs) on resource-limited local devices is
-challenging. In this paper, we propose Division-of-Thoughts (DoT), a
-collaborative reasoning framework leveraging the synergy between locally
-deployed Smaller-scale Language Models (SLMs) and cloud-based LLMs. DoT
-leverages a Task Decomposer to elicit the inherent planning abilities in
-language models to decompose user queries into smaller sub-tasks, which allows
-hybrid language models to fully exploit their respective strengths. Besides,
-DoT employs a Task Scheduler to analyze the pair-wise dependency of sub-tasks
-and create a dependency graph, facilitating parallel reasoning of sub-tasks and
-the identification of key steps. To allocate the appropriate model based on the
-difficulty of sub-tasks, DoT leverages a Plug-and-Play Adapter, which is an
-additional task head attached to the SLM that does not alter the SLM's
-parameters. To boost adapter's task allocation capability, we propose a
-self-reinforced training method that relies solely on task execution feedback.
-Extensive experiments on various benchmarks demonstrate that our DoT
-significantly reduces LLM costs while maintaining competitive reasoning
-accuracy. Specifically, DoT reduces the average reasoning time and API costs by
-66.12% and 83.57%, while achieving comparable reasoning accuracy with the best
-baseline methods.
+This review paper explores recent advances in deep learning approaches for
+non-invasive cognitive impairment detection. We examine various non-invasive
+indicators of cognitive decline, including speech and language, facial, and
+motoric mobility. The paper provides an overview of relevant datasets,
+feature-extracting techniques, and deep-learning architectures applied to this
+domain. We have analyzed the performance of different methods across modalities
+and observed that speech and language-based methods generally achieved the
+highest detection performance. Studies combining acoustic and linguistic
+features tended to outperform those using a single modality. Facial analysis
+methods showed promise for visual modalities but were less extensively studied.
+Most papers focused on binary classification (impaired vs. non-impaired), with
+fewer addressing multi-class or regression tasks. Transfer learning and
+pre-trained language models emerged as popular and effective techniques,
+especially for linguistic analysis. Despite significant progress, several
+challenges remain, including data standardization and accessibility, model
+explainability, longitudinal analysis limitations, and clinical adaptation.
+Lastly, we propose future research directions, such as investigating
+language-agnostic speech analysis methods, developing multi-modal diagnostic
+systems, and addressing ethical considerations in AI-assisted healthcare. By
+synthesizing current trends and identifying key obstacles, this review aims to
+guide further development of deep learning-based cognitive impairment detection
+systems to improve early diagnosis and ultimately patient outcomes.
 
-摘要：<paragraph>網頁內容快速擴充，使得行動裝置上的 AI 助理在協助使用者管理日益複雜的線上工作上變得不可或缺。大型語言模型中浮現的推理能力為新一代行動裝置上的 AI 代理提供了一條有希望的途徑。然而，在資源有限的本機裝置上部署全規模的大型語言模型 (LLM) 是一項挑戰。在本文中，我們提出了思想分工 (DoT)，一個協作推理框架，利用了本地部署的小型語言模型 (SLM) 與雲端 LLM 之間的協同效應。DoT 利用任務分解器引出語言模型中固有的規劃能力，將使用者查詢分解成較小的子任務，這允許混合語言模型充分發揮其各自的優勢。此外，DoT 雇用了一個任務排程器來分析子任務的成對依賴性並建立一個依賴性圖，促進子任務的並行推理和關鍵步驟的識別。為了根據子任務的難度分配適當的模型，DoT 利用了即插即用適配器，這是一個附加在 SLM 上的任務頭，不會改變 SLM 的參數。為了提升適配器的任務分配能力，我們提出了一種自我強化訓練方法，它僅依賴於任務執行回饋。在各種基準上的廣泛實驗表明，我們的 DoT 大幅降低了 LLM 成本，同時維持了有競爭力的推理準確度。具體來說，DoT 將平均推理時間和 API 成本分別降低了 66.12% 和 83.57%，同時達到了與最佳基準方法相當的推理準確度。</paragraph>
+摘要：本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標，包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現，並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力，但研究較少。大多數論文專注於二元分類（受損與未受損），較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術，特別是對於語言分析。儘管取得了重大進展，但仍存在一些挑戰，包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後，我們提出了未來的研究方向，例如調查與語言無關的語音分析方法、開發多模式診斷系統，以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙，本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展，以改善早期診斷，並最終改善患者的治療結果。
 
-##### **Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**
-2502.03715v1 by Rui Cai, Chao Wang, Qianyi Cai, Dazhong Shen, Hui Xiong
+##### **An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems**
+2410.17504v1 by Shruthi Chari
 
-Knowledge Graph-based recommendations have gained significant attention due
-to their ability to leverage rich semantic relationships. However, constructing
-and maintaining Knowledge Graphs (KGs) is resource-intensive, and the accuracy
-of KGs can suffer from noisy, outdated, or irrelevant triplets. Recent
-advancements in Large Language Models (LLMs) offer a promising way to improve
-the quality and relevance of KGs for recommendation tasks. Despite this,
-integrating LLMs into KG-based systems presents challenges, such as efficiently
-augmenting KGs, addressing hallucinations, and developing effective joint
-learning methods. In this paper, we propose the Confidence-aware KG-based
-Recommendation Framework with LLM Augmentation (CKG-LLMA), a novel framework
-that combines KGs and LLMs for recommendation task. The framework includes: (1)
-an LLM-based subgraph augmenter for enriching KGs with high-quality
-information, (2) a confidence-aware message propagation mechanism to filter
-noisy triplets, and (3) a dual-view contrastive learning method to integrate
-user-item interactions and KG data. Additionally, we employ a confidence-aware
-explanation generation process to guide LLMs in producing realistic
-explanations for recommendations. Finally, extensive experiments demonstrate
-the effectiveness of CKG-LLMA across multiple public datasets.
+Explainable Artificial Intelligence (AI) focuses on helping humans understand
+the working of AI systems or their decisions and has been a cornerstone of AI
+for decades. Recent research in explainability has focused on explaining the
+workings of AI models or model explainability. There have also been several
+position statements and review papers detailing the needs of end-users for
+user-centered explainability but fewer implementations. Hence, this thesis
+seeks to bridge some gaps between model and user-centered explainability. We
+create an explanation ontology (EO) to represent literature-derived explanation
+types via their supporting components. We implement a knowledge-augmented
+question-answering (QA) pipeline to support contextual explanations in a
+clinical setting. Finally, we are implementing a system to combine explanations
+from different AI methods and data modalities. Within the EO, we can represent
+fifteen different explanation types, and we have tested these representations
+in six exemplar use cases. We find that knowledge augmentations improve the
+performance of base large language models in the contextualized QA, and the
+performance is variable across disease groups. In the same setting, clinicians
+also indicated that they prefer to see actionability as one of the main foci in
+explanations. In our explanations combination method, we plan to use similarity
+metrics to determine the similarity of explanations in a chronic disease
+detection setting. Overall, through this thesis, we design methods that can
+support knowledge-enabled explanations across different use cases, accounting
+for the methods in today's AI era that can generate the supporting components
+of these explanations and domain knowledge sources that can enhance them.
+
+摘要：可解釋人工智慧（AI）專注於協助人類了解 AI 系統運作或其決策，數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求，但實作較少。因此，本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體（EO）以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答（QA）管線，以在臨床環境中支援情境解釋。最後，我們正在實作一個系統，以結合來自不同 AI 方法和資料模式的解釋。在 EO 中，我們可以表示 15 種不同的解釋類型，並且我們已在六個範例使用案例中測試這些表示。我們發現，知識增強改善了基礎大型語言模型在情境化 QA 中的效能，並且效能因疾病群組而異。在相同的環境中，臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中，我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言，透過本論文，我們設計了可以在不同使用案例中支援知識啟用解釋的方法，考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。
+
+##### **Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study**
+2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay
+
+Objectives: To investigate clinicians' attitudes towards current automated
+interpretation of ECG and novel AI technologies and their perception of
+computer-assisted interpretation. Materials and Methods: We conducted a series
+of interviews with clinicians in the UK. Our study: (i) explores the potential
+for AI, specifically future 'human-like' computing approaches, to facilitate
+ECG interpretation and support clinical decision making, and (ii) elicits their
+opinions about the importance of explainability and trustworthiness of AI
+algorithms. Results: We performed inductive thematic analysis on interview
+transcriptions from 23 clinicians and identified the following themes: (i) a
+lack of trust in current systems, (ii) positive attitudes towards future AI
+applications and requirements for these, (iii) the relationship between the
+accuracy and explainability of algorithms, and (iv) opinions on education,
+possible deskilling, and the impact of AI on clinical competencies. Discussion:
+Clinicians do not trust current computerised methods, but welcome future 'AI'
+technologies. Where clinicians trust future AI interpretation to be accurate,
+they are less concerned that it is explainable. They also preferred ECG
+interpretation that demonstrated the results of the algorithm visually. Whilst
+clinicians do not fear job losses, they are concerned about deskilling and the
+need to educate the workforce to use AI responsibly. Conclusion: Clinicians are
+positive about the future application of AI in clinical decision-making.
+Accuracy is a key factor of uptake and visualisations are preferred over
+current computerised methods. This is viewed as a potential means of training
+and upskilling, in contrast to the deskilling that automation might be
+perceived to bring.
 
-摘要：基於知識圖譜的推薦因其利用豐富語義關係的能力而備受關注。然而，構建和維護知識圖譜 (KG) 是一項資源密集型任務，而 KG 的準確性可能會受到雜訊、過時或無關的三元組的影響。大型語言模型 (LLM) 的最新進展為提高 KG 在推薦任務中的品質和相關性提供了一種有前途的方法。儘管如此，將 LLM 整合到基於 KG 的系統中會帶來挑戰，例如有效擴充 KG、處理幻覺，以及開發有效的聯合學習方法。在本文中，我們提出具有 LLM 擴充的信心感知型基於 KG 的推薦框架 (CKG-LLMA)，這是一個結合 KG 和 LLM 進行推薦任務的新穎框架。該框架包括：(1) 一個基於 LLM 的子圖擴充器，用於使用高品質資訊豐富 KG，(2) 一個信心感知型訊息傳播機制，用於過濾雜訊三元組，以及 (3) 一個雙視圖對比學習方法，用於整合使用者-項目互動和 KG 資料。此外，我們採用一個信心感知型解釋產生程序，以引導 LLM 為推薦產生逼真的解釋。最後，大量的實驗證明了 CKG-LLMA 在多個公開資料集中的有效性。
+摘要：<paragraph>目的：調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度，以及他們對電腦輔助解讀的看法。材料和方法：我們對英國的臨床醫生進行了一系列訪談。我們的研究：(i) 探討人工智慧的潛力，特別是未來的「類人類」運算方法，以促進心電圖解讀並支持臨床決策制定，以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果：我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析，並找出以下主題：(i) 對目前系統缺乏信任，(ii) 對未來人工智慧應用和對這些應用的要求持正面態度，(iii) 演算法的準確性和可解釋性之間的關係，以及 (iv) 對教育、可能的技能退化，以及人工智慧對臨床能力的影響的看法。討論：臨床醫生不信任目前的電腦化方法，但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下，他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業，但他們擔心技能退化，以及需要教育員工負責任地使用人工智慧。結論：臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素，而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法，與自動化可能帶來的技能退化形成對比。</paragraph>
 
-##### **A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**
-2502.03450v1 by Yiye Chen, Harpreet Sawhney, Nicholas Gydé, Yanan Jian, Jack Saunders, Patricio Vela, Ben Lundell
+##### **Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer**
+2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker
 
-Scene graphs have emerged as a structured and serializable environment
-representation for grounded spatial reasoning with Large Language Models
-(LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason
-framework for reasoning and planning with scene graphs. Our approach employs
-two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and
-information queries generation, and a (2) Retriever for extracting
-corresponding graph information following the queries. Two agents collaborate
-iteratively, enabling sequential reasoning and adaptive attention to graph
-information. Unlike prior works, both agents are prompted only with the scene
-graph schema rather than the full graph data, which reduces the hallucination
-by limiting input tokens, and drives the Reasoner to generate reasoning trace
-abstractly.Following the trace, the Retriever programmatically query the scene
-graph data based on the schema understanding, allowing dynamic and global
-attention on the graph that enhances alignment between reasoning and retrieval.
-Through experiments in multiple simulation environments, we show that our
-framework surpasses existing LLM-based approaches in numerical Q\&A and
-planning tasks, and can benefit from task-level few-shot examples, even in the
-absence of agent-level demonstrations. Project code will be released.
+The aggressiveness of prostate cancer, the most common cancer in men
+worldwide, is primarily assessed based on histopathological data using the
+Gleason scoring system. While artificial intelligence (AI) has shown promise in
+accurately predicting Gleason scores, these predictions often lack inherent
+explainability, potentially leading to distrust in human-machine interactions.
+To address this issue, we introduce a novel dataset of 1,015 tissue microarray
+core images, annotated by an international group of 54 pathologists. The
+annotations provide detailed localized pattern descriptions for Gleason grading
+in line with international guidelines. Utilizing this dataset, we develop an
+inherently explainable AI system based on a U-Net architecture that provides
+predictions leveraging pathologists' terminology. This approach circumvents
+post-hoc explainability methods while maintaining or exceeding the performance
+of methods trained directly for Gleason pattern segmentation (Dice score: 0.713
+$\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason
+patterns). By employing soft labels during training, we capture the intrinsic
+uncertainty in the data, yielding strong results in Gleason pattern
+segmentation even in the context of high interobserver variability. With the
+release of this dataset, we aim to encourage further research into segmentation
+in medical tasks with high levels of subjectivity and to advance the
+understanding of pathologists' reasoning processes.
 
-摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
+摘要：前列腺癌是全球男性最常見的癌症，其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力，但這些預測通常缺乏內在的可解釋性，可能會導致對人機互動的不信任。為了解決這個問題，我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述，用於符合國際準則的 Gleason 分級。利用這個資料集，我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統，該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法，同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能（Dice 分數：0.713 ± 0.003，訓練於解釋，相對於 0.691 ± 0.010，訓練於 Gleason 模式）。透過在訓練期間採用軟標籤，我們捕捉了資料中的內在不確定性，即使在觀察者間變異性高的情況下，也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集，我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割，並增進對病理學家推理過程的理解。
 
-##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
-2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
+##### **Explainable AI Methods for Multi-Omics Analysis: A Survey**
+2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee
 
-Recent advancements have highlighted that Large Language Models (LLMs) are
-prone to hallucinations when solving complex reasoning problems, leading to
-erroneous results. To tackle this issue, researchers incorporate Knowledge
-Graphs (KGs) to improve the reasoning ability of LLMs. However, existing
-methods face two limitations: 1) they typically assume that all answers to the
-questions are contained in KGs, neglecting the incompleteness issue of KGs, and
-2) they treat the KG as a static repository and overlook the implicit logical
-reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an
-innovative neural-symbolic agent framework that achieves collaborative
-augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments
-and transform complex reasoning tasks into a multi-step interactive process,
-enabling KGs to participate deeply in the reasoning process. SymAgent consists
-of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages
-LLM's inductive reasoning capability to extract symbolic rules from KGs,
-guiding efficient question decomposition. The Agent-Executor autonomously
-invokes predefined action tools to integrate information from KGs and external
-documents, addressing the issues of KG incompleteness. Furthermore, we design a
-self-learning framework comprising online exploration and offline iterative
-policy updating phases, enabling the agent to automatically synthesize
-reasoning trajectories and improve performance. Experimental results
-demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields
-better or comparable performance compared to various strong baselines. Further
-analysis reveals that our agent can identify missing triples, facilitating
-automatic KG updates.
+Advancements in high-throughput technologies have led to a shift from
+traditional hypothesis-driven methodologies to data-driven approaches.
+Multi-omics refers to the integrative analysis of data derived from multiple
+'omes', such as genomics, proteomics, transcriptomics, metabolomics, and
+microbiomics. This approach enables a comprehensive understanding of biological
+systems by capturing different layers of biological information. Deep learning
+methods are increasingly utilized to integrate multi-omics data, offering
+insights into molecular interactions and enhancing research into complex
+diseases. However, these models, with their numerous interconnected layers and
+nonlinear relationships, often function as black boxes, lacking transparency in
+decision-making processes. To overcome this challenge, explainable artificial
+intelligence (xAI) methods are crucial for creating transparent models that
+allow clinicians to interpret and work with complex data more effectively. This
+review explores how xAI can improve the interpretability of deep learning
+models in multi-omics research, highlighting its potential to provide
+clinicians with clear insights, thereby facilitating the effective application
+of such models in clinical settings.
 
-摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
+摘要：高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料，例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面，能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料，提供分子交互作用的洞察力，並加強對複雜疾病的研究。然而，這些模型具有許多相互連接的層級和非線性關係，通常會像黑盒子一樣運作，缺乏決策過程的透明度。為了克服此挑戰，可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要，讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性，強調其提供臨床醫生明確見解的潛力，進而促進此類模型在臨床環境中的有效應用。
 
-##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
-2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
+##### **Study on the Helpfulness of Explainable Artificial Intelligence**
+2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing
 
-We introduce a new approach to systematically map features discovered by
-sparse autoencoder across consecutive layers of large language models,
-extending earlier work that examined inter-layer feature links. By using a
-data-free cosine similarity technique, we trace how specific features persist,
-transform, or first appear at each stage. This method yields granular flow
-graphs of feature evolution, enabling fine-grained interpretability and
-mechanistic insights into model computations. Crucially, we demonstrate how
-these cross-layer feature maps facilitate direct steering of model behavior by
-amplifying or suppressing chosen features, achieving targeted thematic control
-in text generation. Together, our findings highlight the utility of a causal,
-cross-layer interpretability framework that not only clarifies how features
-develop through forward passes but also provides new means for transparent
-manipulation of large language models.
+Explainable Artificial Intelligence (XAI) is essential for building advanced
+machine learning-powered applications, especially in critical domains such as
+medical diagnostics or autonomous driving. Legal, business, and ethical
+requirements motivate using effective XAI, but the increasing number of
+different methods makes it challenging to pick the right ones. Further, as
+explanations are highly context-dependent, measuring the effectiveness of XAI
+methods without users can only reveal a limited amount of information,
+excluding human factors such as the ability to understand it. We propose to
+evaluate XAI methods via the user's ability to successfully perform a proxy
+task, designed such that a good performance is an indicator for the explanation
+to provide helpful information. In other words, we address the helpfulness of
+XAI for human decision-making. Further, a user study on state-of-the-art
+methods was conducted, showing differences in their ability to generate trust
+and skepticism and the ability to judge the rightfulness of an AI decision
+correctly. Based on the results, we highly recommend using and extending this
+approach for more objective-based human-centered user studies to measure XAI
+performance in an end-to-end fashion.
 
-摘要：我們提出了一種新方法，用於系統性地繪製大型語言模型連續層中稀疏自動編碼器發現的功能，擴展了先前研究層間特徵連結的工作。透過使用無資料餘弦相似性技術，我們追蹤特定特徵在每個階段如何持續、轉換或首次出現。此方法產生了特徵演化的細粒度流程圖，實現了細粒度的可解釋性和對模型運算的機制見解。至關重要的是，我們展示了這些跨層特徵圖如何透過放大或抑制所選特徵來促進模型行為的直接引導，在文字生成中實現目標主題控制。我們的研究結果共同突出了因果、跨層可解釋性框架的效用，不僅闡明了特徵如何透過前向傳遞發展，還提供了新的方法來透明地操作大型語言模型。
+摘要：可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要，特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI，但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外，由於解釋高度依賴於背景，在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊，排除人類因素，例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法，設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說，我們探討 XAI 對人類決策制定的幫助。此外，對最先進的方法進行使用者研究，顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果，我們強烈建議使用和擴充這種方法，以進行更多以目標為基礎的人為中心使用者研究，以終端到終端的方式衡量 XAI 效能。
 
-##### **A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**
-2502.02896v1 by Bradley P. Allen, Paul T. Groth
+##### **Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health**
+2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh
 
-Evaluating large language models (LLMs) for tasks like fact extraction in
-support of knowledge graph construction frequently involves computing accuracy
-metrics using a ground truth benchmark based on a knowledge graph (KG). These
-evaluations assume that errors represent factual disagreements. However, human
-discourse frequently features metalinguistic disagreement, where agents differ
-not on facts but on the meaning of the language used to express them. Given the
-complexity of natural language processing and generation using LLMs, we ask: do
-metalinguistic disagreements occur between LLMs and KGs? Based on an
-investigation using the T-REx knowledge alignment dataset, we hypothesize that
-metalinguistic disagreement does in fact occur between LLMs and KGs, with
-potential relevance for the practice of knowledge graph engineering. We propose
-a benchmark for evaluating the detection of factual and metalinguistic
-disagreements between LLMs and KGs. An initial proof of concept of such a
-benchmark is available on Github.
+Early detection of intrapartum risk enables interventions to potentially
+prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently,
+there is no accurate automated system to predict such events to assist with
+clinical decision-making. To fill this gap, we propose "Artificial Intelligence
+(AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning
+framework that not only predicts adverse labor outcomes from maternal, fetal,
+obstetrical, and intrapartum risk factors but also provides the model's
+reasoning behind the predictions made. The latter can provide insights into
+what modifications in the input variables of the model could have changed the
+predicted outcome. We address the challenges of imbalance and small datasets by
+synthesizing additional training data using Adaptive Synthetic Sampling
+(ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN
+uses an ensemble of fully-connected neural networks as the backbone for its
+classification with the data augmentation supported by either ADASYN or CTGAN.
+AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in
+classification. AIMEN can predict a high risk for adverse labor outcomes with
+an average F1 score of 0.784. It also provides counterfactual explanations that
+can be achieved by changing 2 to 3 attributes on average. Resources available:
+https://github.com/ab9mamun/AIMEN.
 
-摘要：評估大型語言模型 (LLM) 執行知識圖譜建構支援事實萃取等任務時，通常會使用基於知識圖譜 (KG) 的基準事實計算準確度指標。這些評估假設錯誤代表事實上的分歧。然而，人類話語經常出現元語言分歧，其中代理人之間的差異不在於事實，而在於用於表達事實的語言的含義。鑑於使用 LLM 處理和產生自然語言的複雜性，我們提出疑問：LLM 和 KG 之間是否會發生元語言分歧？根據使用 T-REx 知識比對資料集進行的調查，我們假設元語言分歧確實會發生在 LLM 和 KG 之間，並可能與知識圖譜工程實務有關。我們提出一個基準，用於評估 LLM 和 KG 之間的事實和元語言分歧的偵測。此基準的初步概念驗證可在 Github 上取得。
+摘要：產程中風險的早期偵測有助於進行干預措施，以預防或減輕不利的生產結果，例如腦性麻痺。目前，沒有準確的自動化系統可以預測此類事件，以協助臨床決策。為了填補這一空白，我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN)，這是一個深度學習架構，它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果，還能提供模型做出預測背後的原因。後者可以提供見解，說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料，以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹，並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險，平均 F1 分數為 0.784。它還提供反事實解釋，可透過平均變更 2 至 3 個屬性來達成。可用資源：https://github.com/ab9mamun/AIMEN。
 
-##### **Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**
-2502.02810v1 by Chanhui Lee, Yuheon Song, YongJun Jeong, Hanbum Ko, Rodrigo Hormazabal, Sehui Han, Kyunghoon Bae, Sungbin Lim, Sungwoong Kim
+##### **Artificial intelligence techniques in inherited retinal diseases: A review**
+2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian
 
-Recent advances in Large Language Models (LLMs) have motivated the
-development of general LLMs for molecular tasks. While several studies have
-demonstrated that fine-tuned LLMs can achieve impressive benchmark
-performances, they are far from genuine generalist molecular LLMs due to a lack
-of fundamental understanding of molecular structure. Specifically, when given
-molecular task instructions, LLMs trained with naive next-token prediction
-training assign similar likelihood scores to both original and negatively
-corrupted molecules, revealing their lack of molecular structure understanding
-that is crucial for reliable and general molecular LLMs. To overcome this
-limitation and obtain a true generalist molecular LLM, we introduce a novel
-multi-modal training method based on a thorough multi-modal instruction tuning
-as well as a molecular structure preference optimization between chosen and
-rejected graphs. On various molecular benchmarks, the proposed generalist
-molecular LLM, called Mol-LLM, achieves state-of-the-art performances among
-generalist LLMs on most tasks, at the same time, surpassing or comparable to
-state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior
-generalization performances in reaction prediction tasks, demonstrating the
-effect of the molecular structure understanding for generalization perspective.
+Inherited retinal diseases (IRDs) are a diverse group of genetic disorders
+that lead to progressive vision loss and are a major cause of blindness in
+working-age adults. The complexity and heterogeneity of IRDs pose significant
+challenges in diagnosis, prognosis, and management. Recent advancements in
+artificial intelligence (AI) offer promising solutions to these challenges.
+However, the rapid development of AI techniques and their varied applications
+have led to fragmented knowledge in this field. This review consolidates
+existing studies, identifies gaps, and provides an overview of AI's potential
+in diagnosing and managing IRDs. It aims to structure pathways for advancing
+clinical applications by exploring AI techniques like machine learning and deep
+learning, particularly in disease detection, progression prediction, and
+personalized treatment planning. Special focus is placed on the effectiveness
+of convolutional neural networks in these areas. Additionally, the integration
+of explainable AI is discussed, emphasizing its importance in clinical settings
+to improve transparency and trust in AI-based systems. The review addresses the
+need to bridge existing gaps in focused studies on AI's role in IRDs, offering
+a structured analysis of current AI techniques and outlining future research
+directions. It concludes with an overview of the challenges and opportunities
+in deploying AI for IRDs, highlighting the need for interdisciplinary
+collaboration and the continuous development of robust, interpretable AI models
+to advance clinical applications.
 
-摘要：大型語言模型 (LLM) 的近期進展激勵了針對分子任務開發通用 LLM。雖然多項研究已證明微調 LLM 可實現令人印象深刻的基準效能，但由於缺乏對分子結構的基本理解，它們遠非真正的通才分子 LLM。具體來說，當給予分子任務說明時，使用天真的下一個符號預測訓練訓練的 LLM 會將類似的可能性評分分配給原始分子和負面損壞分子，這顯示出它們缺乏對分子結構的理解，而這對於可靠且通用的分子 LLM 至關重要。為了克服這個限制並獲得真正的通才分子 LLM，我們引入了一種新穎的多模態訓練方法，該方法基於徹底的多模態說明調整以及在所選和拒絕圖形之間的分子結構偏好最佳化。在各種分子基準測試中，所提出的通才分子 LLM（稱為 Mol-LLM）在多數任務中實現了通才 LLM 中的最新效能，同時超越或與最新的專家 LLM 相當。此外，Mol-LLM 在反應預測任務中也展現出優異的泛化效能，證明了分子結構理解對泛化觀點的影響。
+摘要：遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病，
+會導致視力逐漸喪失，是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。
+然而，AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究，找出差距，並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術，特別是在疾病檢測、進程預測和個性化治療計劃中，為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外，討論了可解釋 AI 的整合，強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性，提供了對當前 AI 技術的結構化分析，並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇，強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。
 
-##### **Leveraging the true depth of LLMs**
-2502.02790v1 by Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
+##### **CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures**
+2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri
 
-Large Language Models demonstrate remarkable capabilities at the cost of high
-compute requirements. While recent research has shown that intermediate layers
-can be removed or have their order shuffled without impacting performance
-significantly, these findings have not been employed to reduce the
-computational cost of inference. We investigate several potential ways to
-reduce the depth of pre-trained LLMs without significantly affecting
-performance. Leveraging our insights, we present a novel approach that exploits
-this decoupling between layers by grouping some of them into pairs that can be
-evaluated in parallel.
-  This modification of the computational graph -- through better parallelism --
-results in an average improvement of around 1.20x on the number of tokens
-generated per second, without re-training nor fine-tuning, while retaining
-95%-99% of the original accuracy. Empirical evaluation demonstrates that this
-approach significantly improves serving efficiency while maintaining model
-performance, offering a practical improvement for large-scale LLM deployment.
+Explaining Artificial Intelligence (AI) decisions is a major challenge
+nowadays in AI, in particular when applied to sensitive scenarios like medicine
+and law. However, the need to explain the rationale behind decisions is a main
+issue also for human-based deliberation as it is important to justify
+\textit{why} a certain decision has been taken. Resident medical doctors for
+instance are required not only to provide a (possibly correct) diagnosis, but
+also to explain how they reached a certain conclusion. Developing new tools to
+aid residents to train their explanation skills is therefore a central
+objective of AI in education. In this paper, we follow this direction, and we
+present, to the best of our knowledge, the first multilingual dataset for
+Medical Question Answering where correct and incorrect diagnoses for a clinical
+case are enriched with a natural language explanation written by doctors. These
+explanations have been manually annotated with argument components (i.e.,
+premise, claim) and argument relations (i.e., attack, support), resulting in
+the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases
+in four languages (English, Spanish, French, Italian) with explanations, where
+we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106
+attack relations. We conclude by showing how competitive baselines perform over
+this challenging dataset for the argument mining task.
 
-摘要：大型语言模型展示了其强大的功能，但代价是较高的计算需求。虽然最近的研究表明，中间层可以被移除或重新排列其顺序，而不会显著影响性能，但这些发现尚未被用来降低推理的计算成本。我们研究了几种潜在的方法来减少预训练 LLM 的深度，而不会显著影响性能。利用我们的见解，我们提出了一种新颖的方法，该方法通过将其中一些分组为可以并行评估的成对来利用层之间的这种解耦。
-通过更好的并行性对计算图进行修改，平均而言，每秒生成的令牌数量提高了约 1.20 倍，而无需重新训练或微调，同时保留了 95%-99% 的原始准确性。经验评估表明，这种方法显著提高了服务效率，同时保持了模型性能，为大规模 LLM 部署提供了实际改进。
+摘要：解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰，特別是應用於像醫學和法律等敏感情境時。然而，解釋決策背後理由的需求也是基於人類的考量的一個主要問題，因為有必要證明為什麼做出某個決策。例如，住院醫師不僅需要提供（可能是正確的）診斷，還需要解釋他們如何達成某個結論。因此，開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中，我們遵循這個方向，並且根據我們的了解，提出第一個多語言醫學問答資料集，其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成（即前提、主張）和論證關係（即攻擊、支持）進行手動註解，產生多語言 CasiMedicos-Arg 資料集，其中包含 558 個具有解釋的四種語言（英語、西班牙語、法語、義大利語）的臨床病例，我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。
 
-##### **Modular Training of Neural Networks aids Interpretability**
-2502.02470v2 by Satvik Golechha, Maheep Chaudhary, Joan Velja, Alessandro Abate, Nandi Schoots
+##### **Explainable Diagnosis Prediction through Neuro-Symbolic Integration**
+2410.01855v2 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu
 
-An approach to improve neural network interpretability is via clusterability,
-i.e., splitting a model into disjoint clusters that can be studied
-independently. We define a measure for clusterability and show that pre-trained
-models form highly enmeshed clusters via spectral graph clustering. We thus
-train models to be more modular using a "clusterability loss" function that
-encourages the formation of non-interacting clusters. Using automated
-interpretability techniques, we show that our method can help train models that
-are more modular and learn different, disjoint, and smaller circuits. We
-investigate CNNs trained on MNIST and CIFAR, small transformers trained on
-modular addition, and language models. Our approach provides a promising
-direction for training neural networks that learn simpler functions and are
-easier to interpret.
+Diagnosis prediction is a critical task in healthcare, where timely and
+accurate identification of medical conditions can significantly impact patient
+outcomes. Traditional machine learning and deep learning models have achieved
+notable success in this domain but often lack interpretability which is a
+crucial requirement in clinical settings. In this study, we explore the use of
+neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop
+explainable models for diagnosis prediction. Essentially, we design and
+implement LNN-based models that integrate domain-specific knowledge through
+logical rules with learnable thresholds. Our models, particularly
+$M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior
+performance over traditional models such as Logistic Regression, SVM, and
+Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up
+to 0.8457) in the case study of diabetes prediction. The learned weights and
+thresholds within the LNN models provide direct insights into feature
+contributions, enhancing interpretability without compromising predictive
+power. These findings highlight the potential of neuro-symbolic approaches in
+bridging the gap between accuracy and explainability in healthcare AI
+applications. By offering transparent and adaptable diagnostic models, our work
+contributes to the advancement of precision medicine and supports the
+development of equitable healthcare solutions. Future research will focus on
+extending these methods to larger and more diverse datasets to further validate
+their applicability across different medical conditions and populations.
 
-摘要：一種改善神經網路可解釋性的方法是透過群集性，
-也就是將模型分割成可獨立研究的不相交群集。我們定義一個群集性的度量，並顯示預訓練的
-模型透過光譜圖形群集形成高度糾纏的群集。因此，我們使用「群集性損失」函數訓練模型，使其更具模組化，
-這鼓勵形成非交互群集。使用自動化可解釋性技術，我們顯示我們的模型可以幫助訓練更具模組化的模型，並學習不同、不相交且較小的電路。我們
-研究了在 MNIST 和 CIFAR 上訓練的 CNN，在模組化加法上訓練的小型Transformer，以及語言模型。我們的做法為訓練學習更簡單函數且更容易解釋的神經網路提供了有希望的方向。
+摘要：診斷預測是醫療保健中的關鍵任務，及時且準確地識別醫療狀況會顯著影響患者的結果。傳統的機器學習和深度學習模型已在這個領域取得顯著成功，但通常缺乏可解釋性，這在臨床環境中是一項關鍵要求。在本研究中，我們探討了神經符號方法的應用，特別是邏輯神經網路 (LNN)，以開發用於診斷預測的可解釋模型。基本上，我們設計並實作了基於 LNN 的模型，這些模型透過具有可學習閾值的邏輯規則整合領域特定知識。我們的模型，特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$，表現出優於傳統模型（例如邏輯迴歸、SVM 和隨機森林）的優異效能，在糖尿病預測的案例研究中達到了更高的準確度（高達 80.52%）和 AUROC 分數（高達 0.8457）。LNN 模型中學習到的權重和閾值提供了對特徵貢獻的直接見解，增強了可解釋性，同時不影響預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型，我們的研究有助於推進精準醫療，並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集，以進一步驗證其在不同醫療狀況和人群中的適用性。
 
-##### **Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs**
-2502.02362v3 by Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani-Tür
+##### **Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare**
+2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty
 
-Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large
-language models (LLMs) by enabling detailed step-by-step solutions. However,
-due to the verbosity of LLMs, the resulting reasoning chains can be long,
-making it harder to verify the reasoning steps and trace issues resulting from
-dependencies between the steps that may be farther away in the sequence of
-steps. Importantly, mathematical reasoning allows each step to be derived from
-a small set of premises, which are a subset of the preceding steps in the
-reasoning chain. In this paper, we present a framework that identifies the
-premises for each step, to improve the evaluation of reasoning. We restructure
-conventional linear reasoning chains into Premise Augmented Reasoning Chains
-(PARC) by introducing premise links, resulting in a directed acyclic graph
-where the nodes are the steps and the edges are the premise links. Through
-experiments with a PARC-based dataset that we built, namely PERL (Premises and
-ERrors identification in LLMs), we demonstrate that LLMs can reliably identify
-premises within complex reasoning chains. In particular, even open-source LLMs
-achieve 90% recall in premise identification. We also show that PARC helps to
-identify errors in reasoning chains more reliably. The accuracy of error
-identification improves by 6% to 16% absolute when step-by-step verification is
-carried out in PARC under the premises. Our findings highlight the utility of
-premise-centric representations in addressing complex problem-solving tasks and
-open new avenues for improving the reliability of LLM-based reasoning
-evaluations.
+The rapid advancements in artificial intelligence (AI) have revolutionized
+smart healthcare, driving innovations in wearable technologies, continuous
+monitoring devices, and intelligent diagnostic systems. However, security,
+explainability, robustness, and performance optimization challenges remain
+critical barriers to widespread adoption in clinical environments. This
+research presents an innovative algorithmic method using the Adaptive Feature
+Evaluator (AFE) algorithm to improve feature selection in healthcare datasets
+and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable
+Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT),
+the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby
+enhancing predictive accuracy and interpretability. The proposed method is
+validated across three diverse healthcare datasets using six distinct machine
+learning algorithms, demonstrating its robustness and superiority over
+conventional feature selection techniques. The results underscore the
+transformative potential of AFE in smart healthcare, enabling personalized and
+transparent patient care. Notably, the AFE algorithm, when combined with a
+Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting
+its capability to improve clinical decision-making processes in real-world
+healthcare applications.
 
-摘要：<paragraph>思考鏈（CoT）提示透過提供詳細的逐步解法，增強大型語言模型（LLM）的數學推理能力。然而，由於 LLM 的冗長，產生的推理鏈可能很長，這使得驗證推理步驟和追蹤由步驟之間相依關係所產生的問題變得更加困難，而這些步驟可能在步驟順序中相距較遠。重要的是，數學推理允許每個步驟從一組小的前提中推導出來，這些前提是推理鏈中前一個步驟的子集。在本文中，我們提出了一個框架，用於識別每個步驟的前提，以改進推理評估。我們透過引入前提連結，將傳統的線性推理鏈重組為前提擴充推理鏈（PARC），產生一個有向無環圖，其中節點是步驟，而邊緣是前提連結。透過我們建立的基於 PARC 的資料集（即 PERL（LLM 中的前提和錯誤識別））進行的實驗，我們證明 LLM 能夠在複雜的推理鏈中可靠地識別前提。特別是，即使是開源 LLM 在前提識別中也能達到 90% 的召回率。我們還表明，PARC 有助於更可靠地識別推理鏈中的錯誤。在前提下於 PARC 中執行逐步驗證時，錯誤識別的準確度提高了 6% 到 16%。我們的研究結果突顯了以前提為中心的表示在解決複雜問題解決任務中的效用，並為改進基於 LLM 的推理評估的可靠性開闢了新途徑。</paragraph>
+摘要：人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健，推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而，安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法，使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT)，該演算法最佳化了臨床決策支援系統 (CDSS)，從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集，證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力，實現了個人化和透明的患者照護。值得注意的是，AFE 演算法與多層感知器 (MLP) 結合使用時，準確度高達 98.5%，突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。
 
-##### **AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement**
-2502.02067v1 by Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
+##### **Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study**
+2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker
 
-Embodied agents assisting humans are often asked to complete a new task in a
-new scenario. An agent preparing a particular dish in the kitchen based on a
-known recipe may be asked to prepare a new dish or to perform cleaning tasks in
-the storeroom. There may not be sufficient resources, e.g., time or labeled
-examples, to train the agent for these new situations. Large Language Models
-(LLMs) trained on considerable knowledge across many domains are able to
-predict a sequence of abstract actions for such new tasks and scenarios,
-although it may not be possible for the agent to execute this action sequence
-due to task-, agent-, or domain-specific constraints. Our framework addresses
-these challenges by leveraging the generic predictions provided by LLM and the
-prior domain-specific knowledge encoded in a Knowledge Graph (KG), enabling an
-agent to quickly adapt to new tasks and scenarios. The robot also solicits and
-uses human input as needed to refine its existing knowledge. Based on
-experimental evaluation over cooking and cleaning tasks in simulation domains,
-we demonstrate that the interplay between LLM, KG, and human input leads to
-substantial performance gains compared with just using the LLM output.
+Artificial intelligence (AI) systems have substantially improved
+dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI)
+systems further enhancing clinicians' confidence and trust in AI-driven
+decisions. Despite these advancements, there remains a critical need for
+objective evaluation of how dermatologists engage with both AI and XAI tools.
+In this study, 76 dermatologists participated in a reader study, diagnosing 16
+dermoscopic images of melanomas and nevi using an XAI system that provides
+detailed, domain-specific explanations. Eye-tracking technology was employed to
+assess their interactions. Diagnostic performance was compared with that of a
+standard AI system lacking explanatory features. Our findings reveal that XAI
+systems improved balanced diagnostic accuracy by 2.8 percentage points relative
+to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and
+complex lesions were associated with elevated cognitive load, as evidenced by
+increased ocular fixations. These insights have significant implications for
+clinical practice, the design of AI tools for visual tasks, and the broader
+development of XAI in medical diagnostics.
 
-摘要：具身代理协助人类时，通常需要在新的情境中完成新的任务。基于已知食谱在厨房准备特定菜肴的代理可能会被要求准备新菜肴或在储藏室执行清洁任务。可能没有足够资源（例如时间或标记的示例）来训练代理以应对这些新情况。在许多领域接受大量知识训练的大型语言模型 (LLM) 能够预测此类新任务和情境的抽象动作序列，尽管代理可能无法执行此动作序列，因为任务、代理或特定于域的约束。我们的框架通过利用 LLM 提供的通用预测和知识图 (KG) 中编码的先前特定于域的知识来应对这些挑战，使代理能够快速适应新任务和情境。该机器人还会根据需要征求并使用人类输入来完善其现有知识。基于在模拟域中对烹饪和清洁任务的实验评估，我们证明了 LLM、KG 和人类输入之间的相互作用与仅使用 LLM 输出相比带来了巨大的性能提升。
+摘要：人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度，而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展，對於皮膚科醫師如何使用 AI 和 XAI 工具，仍有客觀評估的迫切需求。在這項研究中，76 位皮膚科醫師參與了一項讀者研究，使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像，該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示，XAI 系統相較於標準 AI，將平衡診斷準確度提升了 2.8 個百分點。此外，與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關，這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。
 
-##### **On Bob Dylan: A Computational Perspective**
-2502.01772v1 by Prashant Garg
+##### **Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data**
+2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar
 
-Cass Sunstein's essay 'On Bob Dylan' describes Dylan's 'dishabituating' style
--- a constant refusal to conform to expectation and a penchant for reinventing
-his musical and lyrical identity. In this paper, I extend Sunstein's
-observations through a large-scale computational analysis of Dylan's lyrics
-from 1962 to 2012. Using o3-mini-high (a large language model), I extract
-concept-to-concept relationships from the lyrics and construct directed
-knowledge graphs that capture Dylan's thematic structure. I then quantify
-shifts in sentiment, metaphorical expression, thematic diversity, and network
-complexity over time. The results indicate that Dylan's lyrics increasingly
-rely on metaphor, display an evolving sentiment profile, and exhibit heightened
-dishabituation -- measured here as a growing variance in the network centrality
-of key concepts. I also find that references to movement, protest, and mythic
-imagery fluctuate in ways that align with well-known phases of Dylan's career,
-reflecting the dynamic and unpredictable quality of his art. These findings not
-only deepen our empirical understanding of Sunstein's thesis but also introduce
-a novel computational method for analyzing an artist's evolution-offering
-broader applicability to the study of cultural and creative change.
+Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been
+shown to significantly improve the quality of life of autistic individuals.
+However, diagnostics methods for ASD rely on assessments based on clinical
+presentation that are prone to bias and can be challenging to arrive at an
+early diagnosis. There is a need for objective biomarkers of ASD which can help
+improve diagnostic accuracy. Deep learning (DL) has achieved outstanding
+performance in diagnosing diseases and conditions from medical imaging data.
+Extensive research has been conducted on creating models that classify ASD
+using resting-state functional Magnetic Resonance Imaging (fMRI) data. However,
+existing models lack interpretability. This research aims to improve the
+accuracy and interpretability of ASD diagnosis by creating a DL model that can
+not only accurately classify ASD but also provide explainable insights into its
+working. The dataset used is a preprocessed version of the Autism Brain Imaging
+Data Exchange (ABIDE) with 884 samples. Our findings show a model that can
+accurately classify ASD and highlight critical brain regions differing between
+ASD and typical controls, with potential implications for early diagnosis and
+understanding of the neural basis of ASD. These findings are validated by
+studies in the literature that use different datasets and modalities,
+confirming that the model actually learned characteristics of ASD and not just
+the dataset. This study advances the field of explainable AI in medical imaging
+by providing a robust and interpretable model, thereby contributing to a future
+with objective and reliable ASD diagnostics.
 
-摘要：卡斯·桑斯坦的論文「論鮑伯·迪倫」描述了迪倫「去習慣化」的風格
--- 這種風格不斷拒絕符合預期，並熱衷於重新塑造他的音樂和歌詞認同。在本文中，我透過對迪倫 1962 年至 2012 年歌詞進行大規模的運算分析，來延伸桑斯坦的觀察。使用 o3-mini-high（一個大型語言模型），我從歌詞中提取概念對概念的關係，並建構有向知識圖，以捕捉迪倫的主題結構。然後，我量化情緒、隱喻表達、主題多樣性和網路複雜性隨時間的變化。結果顯示，迪倫的歌詞越來越依賴隱喻，展現出不斷演化的情緒輪廓，並表現出高度的去習慣化 -- 在這裡測量為關鍵概念的網路中心性的變異增加。我也發現，對運動、抗議和神話意象的引用，會以與迪倫職業生涯中眾所周知階段一致的方式波動，反映了他藝術的動態和不可預測的品質。這些發現不僅加深了我們對桑斯坦論文的經驗理解，也引入了分析藝術家演變的新穎運算方法，為文化和創造性變化的研究提供了更廣泛的適用性。
+摘要：自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而，ASD 的診斷方法依賴於基於臨床表現的評估，容易產生偏見，且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記，以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而，現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD，還能提供可解釋見解說明其運作原理的 DL 模型，來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本，包含 884 個樣本。我們的研究結果顯示，該模型能準確分類 ASD，並強調 ASD 與典型對照組之間存在差異的關鍵腦區，對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證，證實該模型實際上學習了 ASD 的特徵，而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型，推動了醫學影像中可解釋 AI 的領域，從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。
 
-##### **VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos**
-2502.01549v1 by Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
+##### **Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition**
+2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul
 
-Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in
-enhancing Large Language Models (LLMs) through external knowledge integration,
-yet its application has primarily focused on textual content, leaving the rich
-domain of multi-modal video knowledge predominantly unexplored. This paper
-introduces VideoRAG, the first retrieval-augmented generation framework
-specifically designed for processing and understanding extremely long-context
-videos. Our core innovation lies in its dual-channel architecture that
-seamlessly integrates (i) graph-based textual knowledge grounding for capturing
-cross-video semantic relationships, and (ii) multi-modal context encoding for
-efficiently preserving visual features. This novel design empowers VideoRAG to
-process unlimited-length videos by constructing precise knowledge graphs that
-span multiple videos while maintaining semantic dependencies through
-specialized multi-modal retrieval paradigms. Through comprehensive empirical
-evaluation on our proposed LongerVideos benchmark-comprising over 160 videos
-totaling 134+ hours across lecture, documentary, and entertainment
-categories-VideoRAG demonstrates substantial performance compared to existing
-RAG alternatives and long video understanding methods. The source code of
-VideoRAG implementation and the benchmark dataset are openly available at:
-https://github.com/HKUDS/VideoRAG.
+The in-vivo identification of the kidney stone types during an ureteroscopy
+would be a major medical advance in urology, as it could reduce the time of the
+tedious renal calculi extraction process, while diminishing infection risks.
+Furthermore, such an automated procedure would make possible to prescribe
+anti-recurrence treatments immediately. Nowadays, only few experienced
+urologists are able to recognize the kidney stone types in the images of the
+videos displayed on a screen during the endoscopy. Thus, several deep learning
+(DL) models have recently been proposed to automatically recognize the kidney
+stone types using ureteroscopic images. However, these DL models are of black
+box nature whicl limits their applicability in clinical settings. This
+contribution proposes a case-based reasoning DL model which uses prototypical
+parts (PPs) and generates local and global descriptors. The PPs encode for each
+class (i.e., kidney stone type) visual feature information (hue, saturation,
+intensity and textures) similar to that used by biologists. The PPs are
+optimally generated due a new loss function used during the model training.
+Moreover, the local and global descriptors of PPs allow to explain the
+decisions ("what" information, "where in the images") in an understandable way
+for biologists and urologists. The proposed DL model has been tested on a
+database including images of the six most widespread kidney stone types. The
+overall average classification accuracy was 90.37. When comparing this results
+with that of the eight other DL models of the kidney stone state-of-the-art, it
+can be seen that the valuable gain in explanability was not reached at the
+expense of accuracy which was even slightly increased with respect to that
+(88.2) of the best method of the literature. These promising and interpretable
+results also encourage urologists to put their trust in AI-based solutions.
 
-摘要：檢索增強生成 (RAG) 已證明在透過外部知識整合增強大型語言模型 (LLM) 方面取得顯著成功，但其應用主要集中在文字內容上，而豐富的多模態影片知識領域則鮮少被探索。本文介紹 VideoRAG，這是第一個檢索增強生成架構，專門設計用於處理和理解極長語境的影片。我們的核心創新在於其雙通道架構，它無縫整合 (i) 基於圖形文字知識基礎，用於擷取跨影片語義關係，以及 (ii) 多模態語境編碼，用於有效保留視覺特徵。這個新穎的設計讓 VideoRAG 能夠透過建構跨越多個影片的精確知識圖譜來處理長度不限的影片，同時透過專門的多模態檢索範例來維持語義依賴性。透過我們提出的 LongerVideos 基準的全面經驗評估，該基準包含超過 160 部影片，總時數超過 134 小時，涵蓋演講、紀錄片和娛樂類別，VideoRAG 與現有的 RAG 替代方案和長影片理解方法相比，展現出顯著的效能。VideoRAG 實作的原始碼和基準資料集已公開於：https://github.com/HKUDS/VideoRAG。
+摘要：尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展，因為它可以減少繁瑣的腎結石取出過程的時間，同時降低感染風險。此外，這種自動化程序將使立即開立抗復發治療成為可能。如今，只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此，最近已提出多種深度學習 (DL) 模型，以使用輸尿管鏡圖像自動識別腎結石類型。然而，這些 DL 模型本質上是黑盒子，這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型，它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型（即腎結石類型）編碼視覺特徵信息（色調、飽和度、強度和紋理），類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數，PP 得到了最佳生成。此外，PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策（“什麼”信息，“圖像中的什麼位置”）。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時，可以看出，可解釋性的寶貴增益並未以準確性為代價，甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。
 
-##### **Transformers trained on proteins can learn to attend to Euclidean distance**
-2502.01533v1 by Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane
+##### **Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques**
+2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman
+
+This study explores the potential of utilizing administrative claims data,
+combined with advanced machine learning and deep learning techniques, to
+predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal
+Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major
+health insurance organization to develop prediction models for multiple
+observation windows using traditional machine learning methods such as Random
+Forest and XGBoost as well as deep learning approaches such as Long Short-Term
+Memory (LSTM) networks. Our findings demonstrate that the LSTM model,
+particularly with a 24-month observation window, exhibits superior performance
+in predicting ESRD progression, outperforming existing models in the
+literature. We further apply SHapley Additive exPlanations (SHAP) analysis to
+enhance interpretability, providing insights into the impact of individual
+features on predictions at the individual patient level. This study underscores
+the value of leveraging administrative claims data for CKD management and
+predicting ESRD progression.
+
+摘要：本研究探討利用行政申報資料，結合先進機器學習與深度學習技術，預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集，使用傳統機器學習方法（例如隨機森林和 XGBoost）以及深度學習方法（例如長期短期記憶 (LSTM) 網路）開發多個觀察視窗的預測模型。我們的研究結果顯示，LSTM 模型（尤其是 24 個月觀察視窗）在預測 ESRD 進展方面表現優異，優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性，深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。
+
+##### **Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases**
+2409.09201v3 by Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller
 
-While conventional Transformers generally operate on sequence data, they can
-be used in conjunction with structure models, typically SE(3)-invariant or
-equivariant graph neural networks (GNNs), for 3D applications such as protein
-structure modelling. These hybrids typically involve either (1)
-preprocessing/tokenizing structural features as input for Transformers or (2)
-taking Transformer embeddings and processing them within a structural
-representation. However, there is evidence that Transformers can learn to
-process structural information on their own, such as the AlphaFold3 structural
-diffusion model. In this work we show that Transformers can function
-independently as structure models when passed linear embeddings of coordinates.
-We first provide a theoretical explanation for how Transformers can learn to
-filter attention as a 3D Gaussian with learned variance. We then validate this
-theory using both simulated 3D points and in the context of masked token
-prediction for proteins. Finally, we show that pre-training protein Transformer
-encoders with structure improves performance on a downstream task, yielding
-better performance than custom structural models. Together, this work provides
-a basis for using standard Transformers as hybrid structure-language models.
+While large language models (LLMs) have shown promise for medical question
+answering, there is limited work focused on tropical and infectious
+disease-specific exploration. We build on an opensource tropical and infectious
+diseases (TRINDs) dataset, expanding it to include demographic and semantic
+clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM
+performance on these, comparing generalist and medical LLMs, as well as LLM
+outcomes to human experts. We demonstrate through systematic experimentation,
+the benefit of contextual information such as demographics, location, gender,
+risk factors for optimal LLM response. Finally we develop a prototype of
+TRINDs-LM, a research tool that provides a playground to navigate how context
+impacts LLM outputs for health.
 
-摘要：雖然傳統的 Transformer 通常處理序列資料，但它們可用於結構模型，通常是 SE(3) 不變式或等變式圖神經網路 (GNN)，用於蛋白質結構建模等 3D 應用。這些混合模型通常包含 (1) 將結構特徵預處理/標記化為 Transformer 的輸入或 (2) 取用 Transformer 嵌入並在結構表示中處理它們。然而，有證據表明 Transformer 可以自行學習處理結構資訊，例如 AlphaFold3 結構擴散模型。在這項工作中，我們展示了 Transformer 在傳遞座標的線性嵌入時，可以獨立作為結構模型運作。我們首先提供了 Transformer 如何學習將注意力濾波為具有學習變異的 3D 高斯的理論解釋。然後我們使用模擬 3D 點和在蛋白質遮罩標記預測的背景下驗證此理論。最後，我們展示了使用結構預訓練蛋白質 Transformer 編碼器會改善下游任務的效能，產生比自訂結構模型更好的效能。綜合來說，這項工作提供了使用標準 Transformer 作為混合結構語言模型的基礎。
+摘要：儘管大型語言模型 (LLM) 在醫療問題解答方面展現出前景，但專注於熱帶和傳染病特定探索的研究有限。我們建立在一個開放原始碼熱帶和傳染病 (TRINDs) 資料集上，並將其擴展為納入人口統計和語義臨床和消費者擴充，產生超過 11000 個提示。我們評估了 LLM 在這些方面的效能，比較了通才和醫療 LLM，以及 LLM 結果與人類專家的比較。我們透過系統性實驗證明了背景資訊（例如人口統計、位置、性別、最佳 LLM 回應的風險因素）的好處。最後，我們開發了 TRINDs-LM 的原型，這是一個研究工具，提供一個探索背景如何影響 LLM 健康輸出的平台。
 
-##### **Common Foundations for SHACL, ShEx, and PG-Schema**
-2502.01295v1 by S. Ahmetaj, I. Boneva, J. Hidders, K. Hose, M. Jakubowski, J. E. Labra-Gayo, W. Martens, F. Mogavero, F. Murlak, C. Okulmus, A. Polleres, O. Savkovic, M. Simkus, D. Tomaszuk
+##### **Explainable AI: Definition and attributes of a good explanation for health AI**
+2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group
 
-Graphs have emerged as an important foundation for a variety of applications,
-including capturing and reasoning over factual knowledge, semantic data
-integration, social networks, and providing factual knowledge for machine
-learning algorithms. To formalise certain properties of the data and to ensure
-data quality, there is a need to describe the schema of such graphs. Because of
-the breadth of applications and availability of different data models, such as
-RDF and property graphs, both the Semantic Web and the database community have
-independently developed graph schema languages: SHACL, ShEx, and PG-Schema.
-Each language has its unique approach to defining constraints and validating
-graph data, leaving potential users in the dark about their commonalities and
-differences. In this paper, we provide formal, concise definitions of the core
-components of each of these schema languages. We employ a uniform framework to
-facilitate a comprehensive comparison between the languages and identify a
-common set of functionalities, shedding light on both overlapping and
-distinctive features of the three languages.
+Proposals of artificial intelligence (AI) solutions based on increasingly
+complex and accurate predictive models are becoming ubiquitous across many
+disciplines. As the complexity of these models grows, transparency and users'
+understanding often diminish. This suggests that accurate prediction alone is
+insufficient for making an AI-based solution truly useful. In the development
+of healthcare systems, this introduces new issues related to accountability and
+safety. Understanding how and why an AI system makes a recommendation may
+require complex explanations of its inner workings and reasoning processes.
+Although research on explainable AI (XAI) has significantly increased in recent
+years and there is high demand for XAI in medicine, defining what constitutes a
+good explanation remains ad hoc, and providing adequate explanations continues
+to be challenging. To fully realize the potential of AI, it is critical to
+address two fundamental questions about explanations for safety-critical AI
+applications, such as health-AI: (1) What is an explanation in health-AI? and
+(2) What are the attributes of a good explanation in health-AI? In this study,
+we examined published literature and gathered expert opinions through a
+two-round Delphi study. The research outputs include (1) a definition of what
+constitutes an explanation in health-AI and (2) a comprehensive list of
+attributes that characterize a good explanation in health-AI.
 
-摘要：圖表已成為各種應用的重要基礎，包括擷取和推理事實知識、語義資料整合、社群網路，以及為機器學習演算法提供事實知識。為了形式化資料的特定屬性並確保資料品質，有必要描述此類圖表的架構。由於應用範圍廣泛且有不同的資料模型可用，例如 RDF 和屬性圖表，因此語義網路和資料庫社群已獨立開發圖表架構語言：SHACL、ShEx 和 PG-Schema。每種語言都有其定義約束和驗證圖表資料的獨特方法，讓潛在使用者不清楚它們的共性和差異。在本文中，我們提供這些架構語言中每個核心元件的正式簡潔定義。我們採用統一的框架來促進語言之間的全面比較，並找出功能的共同集合，說明這三種語言的重疊和獨特功能。
+摘要：隨著越來越複雜且準確的預測模型，基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加，透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中，這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加，且醫學領域對 XAI 有很高的需求，但定義什麼構成一個好的解釋仍是臨時性的，而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力，對於安全關鍵型 AI 應用（例如健康 AI）的解釋，探討兩個基本問題至關重要：(1) 什麼是健康 AI 中的解釋？以及 (2) 健康 AI 中一個好的解釋有哪些屬性？在本研究中，我們檢視了已發表的文獻，並透過兩輪德爾菲研究收集了專家意見。研究成果包括：(1) 健康 AI 中什麼構成解釋的定義，以及 (2) 健康 AI 中一個好解釋的屬性清單。
 
-##### **GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation**
-2502.01113v1 by Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
+##### **Exploring the Effect of Explanation Content and Format on User Comprehension and Trust**
+2408.17401v1 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni
 
-Retrieval-augmented generation (RAG) has proven effective in integrating
-knowledge into large language models (LLMs). However, conventional RAGs
-struggle to capture complex relationships between pieces of knowledge, limiting
-their performance in intricate reasoning that requires integrating knowledge
-from multiple sources. Recently, graph-enhanced retrieval augmented generation
-(GraphRAG) builds graph structure to explicitly model these relationships,
-enabling more effective and efficient retrievers. Nevertheless, its performance
-is still hindered by the noise and incompleteness within the graph structure.
-To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for
-retrieval augmented generation. GFM-RAG is powered by an innovative graph
-neural network that reasons over graph structure to capture complex
-query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage
-training process on large-scale datasets, comprising 60 knowledge graphs with
-over 14M triples and 700k documents. This results in impressive performance and
-generalizability for GFM-RAG, making it the first graph foundation model
-applicable to unseen datasets for retrieval without any fine-tuning required.
-Extensive experiments on three multi-hop QA datasets and seven domain-specific
-RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance
-while maintaining efficiency and alignment with neural scaling laws,
-highlighting its potential for further improvement.
+In recent years, various methods have been introduced for explaining the
+outputs of "black-box" AI models. However, it is not well understood whether
+users actually comprehend and trust these explanations. In this paper, we focus
+on explanations for a regression tool for assessing cancer risk and examine the
+effect of the explanations' content and format on the user-centric metrics of
+comprehension and trust. Regarding content, we experiment with two explanation
+methods: the popular SHAP, based on game-theoretic notions and thus potentially
+complex for everyday users to comprehend, and occlusion-1, based on feature
+occlusion which may be more comprehensible. Regarding format, we present SHAP
+explanations as charts (SC), as is conventional, and occlusion-1 explanations
+as charts (OC) as well as text (OT), to which their simpler nature also lends
+itself. The experiments amount to user studies questioning participants, with
+two different levels of expertise (the general population and those with some
+medical training), on their subjective and objective comprehension of and trust
+in explanations for the outputs of the regression tool. In both studies we
+found a clear preference in terms of subjective comprehension and trust for
+occlusion-1 over SHAP explanations in general, when comparing based on content.
+However, direct comparisons of explanations when controlling for format only
+revealed evidence for OT over SC explanations in most cases, suggesting that
+the dominance of occlusion-1 over SHAP explanations may be driven by a
+preference for text over charts as explanations. Finally, we found no evidence
+of a difference between the explanation types in terms of objective
+comprehension. Thus overall, the choice of the content and format of
+explanations needs careful attention, since in some contexts format, rather
+than content, may play the critical role in improving user experience.
 
-摘要：檢索增強生成 (RAG) 已證明在整合知識到大語言模型 (LLM) 中有效。然而，傳統的 RAG 難以捕捉知識片段之間的複雜關係，限制了它們在需要整合來自多個來源的知識的複雜推理中的表現。最近，圖表增強檢索增強生成 (GraphRAG) 建立圖表結構來明確建模這些關係，從而實現更有效率的檢索器。儘管如此，其效能仍受到圖表結構中雜訊和不完整性的阻礙。為了解決這個問題，我們引入了 GFM-RAG，一種用於檢索增強生成的全新圖表基礎模型 (GFM)。GFM-RAG 由一個創新的圖神經網路驅動，該網路在圖表結構上進行推理以捕捉複雜的查詢知識關係。具有 8M 參數的 GFM 在大型資料集上進行兩階段訓練流程，包括 60 個包含超過 14M 個三元組和 700k 個文件的文件。這為 GFM-RAG 帶來了令人印象深刻的效能和通用性，使其成為第一個適用於未見過資料集的圖表基礎模型，而無需任何微調。在三個多跳問答資料集和七個特定領域 RAG 資料集上的廣泛實驗表明，GFM-RAG 達到了最先進的效能，同時保持了效率並與神經擴充定律保持一致，突顯了其進一步改進的潛力。
+摘要：<paragraph>近年來，已經引進各種方法來解釋「黑箱」AI 模型的輸出。然而，目前並不清楚使用者是否實際理解和信任這些解釋。在本文中，我們專注於評估癌症風險的回歸工具的解釋，並探討解釋的內容和格式對以使用者為中心的理解和信任指標的影響。關於內容，我們實驗了兩種解釋方法：流行的 SHAP，基於博弈論概念，因此對於日常使用者來說可能很複雜，以及基於特徵遮蔽的 occlusion-1，可能更易於理解。關於格式，我們將 SHAP 解釋呈現為圖表 (SC)，這是慣例，而將 occlusion-1 解釋呈現為圖表 (OC) 以及文字 (OT)，其較為簡單的性質也適用於此。這些實驗等同於使用者研究，詢問參與者，具有兩種不同程度的專業知識（一般民眾和具備一些醫學訓練的人），他們對回歸工具輸出解釋的主觀和客觀理解和信任。在兩項研究中，我們發現，在基於內容進行比較時，一般來說，occlusion-1 優於 SHAP 解釋，在主觀理解和信任方面有明顯的偏好。然而，在僅控制格式的情況下直接比較解釋，在大多數情況下只顯示 OT 優於 SC 解釋的證據，這表明 occlusion-1 優於 SHAP 解釋的主導地位可能是由偏好文字而非圖表作為解釋所驅動的。最後，我們沒有發現解釋類型在客觀理解方面的差異證據。因此，總體而言，對解釋的內容和格式的選擇需要仔細注意，因為在某些情況下，格式而非內容，可能在改善使用者體驗方面發揮關鍵作用。</paragraph>
 
-##### **Knowledge Synthesis of Photosynthesis Research Using a Large Language Model**
-2502.01059v1 by Seungri Yoon, Woosang Jeon, Sanghyeok Choi, Taehyeong Kim, Tae In Ahn
+##### **A Survey for Large Language Models in Biomedicine**
+2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen
 
-The development of biological data analysis tools and large language models
-(LLMs) has opened up new possibilities for utilizing AI in plant science
-research, with the potential to contribute significantly to knowledge
-integration and research gap identification. Nonetheless, current LLMs struggle
-to handle complex biological data and theoretical models in photosynthesis
-research and often fail to provide accurate scientific contexts. Therefore,
-this study proposed a photosynthesis research assistant (PRAG) based on
-OpenAI's GPT-4o with retrieval-augmented generation (RAG) techniques and prompt
-optimization. Vector databases and an automated feedback loop were used in the
-prompt optimization process to enhance the accuracy and relevance of the
-responses to photosynthesis-related queries. PRAG showed an average improvement
-of 8.7% across five metrics related to scientific writing, with a 25.4%
-increase in source transparency. Additionally, its scientific depth and domain
-coverage were comparable to those of photosynthesis research papers. A
-knowledge graph was used to structure PRAG's responses with papers within and
-outside the database, which allowed PRAG to match key entities with 63% and
-39.5% of the database and test papers, respectively. PRAG can be applied for
-photosynthesis research and broader plant science domains, paving the way for
-more in-depth data analysis and predictive capabilities.
+Recent breakthroughs in large language models (LLMs) offer unprecedented
+natural language understanding and generation capabilities. However, existing
+surveys on LLMs in biomedicine often focus on specific applications or model
+architectures, lacking a comprehensive analysis that integrates the latest
+advancements across various biomedical domains. This review, based on an
+analysis of 484 publications sourced from databases including PubMed, Web of
+Science, and arXiv, provides an in-depth examination of the current landscape,
+applications, challenges, and prospects of LLMs in biomedicine, distinguishing
+itself by focusing on the practical implications of these models in real-world
+biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot
+learning across a broad spectrum of biomedical tasks, including diagnostic
+assistance, drug discovery, and personalized medicine, among others, with
+insights drawn from 137 key studies. Then, we discuss adaptation strategies of
+LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to
+enhance their performance in specialized biomedical contexts where zero-shot
+fails to achieve, such as medical question answering and efficient processing
+of biomedical literature. Finally, we discuss the challenges that LLMs face in
+the biomedicine domain including data privacy concerns, limited model
+interpretability, issues with dataset quality, and ethics due to the sensitive
+nature of biomedical data, the need for highly reliable model outputs, and the
+ethical implications of deploying AI in healthcare. To address these
+challenges, we also identify future research directions of LLM in biomedicine
+including federated learning methods to preserve data privacy and integrating
+explainable AI methodologies to enhance the transparency of LLMs.
 
-摘要：生物資料分析工具和大型語言模型 (LLM) 的發展，為利用人工智慧於植物科學研究開啟了新的可能性，並有潛力對知識整合和研究差距的識別做出重大貢獻。儘管如此，目前的 LLM 在處理光合作用研究中的複雜生物資料和理論模型時仍有困難，而且常常無法提供準確的科學背景。因此，本研究提出了一個基於 OpenAI 的 GPT-4o、具備檢索增強生成 (RAG) 技術和提示最佳化的光合作用研究助理 (PRAG)。在提示最佳化過程中，使用了向量資料庫和自動回饋迴路，以增強對與光合作用相關查詢的回應的準確性和相關性。PRAG 在與科學寫作相關的五項指標中顯示出平均改善了 8.7%，來源透明度增加了 25.4%。此外，其科學深度和領域涵蓋範圍與光合作用研究論文相當。知識圖譜用於建構 PRAG 的回應，其中包含資料庫內外論文，這使得 PRAG 能夠分別與資料庫和測試論文中的 63% 和 39.5% 的關鍵實體相匹配。PRAG 可應用於光合作用研究和更廣泛的植物科學領域，為更深入的資料分析和預測能力鋪路。
+摘要：大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而，現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構，缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析，深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景，其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先，我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力，包括診斷輔助、藥物發現和個性化醫療等，並從 137 項關鍵研究中汲取見解。然後，我們討論了 LLM 的適應策略，包括單模態和多模態 LLM 的微調方法，以增強它們在零次學習無法實現的專業生物醫學背景中的性能，例如醫療問題解答和生物醫學文獻的有效處理。最後，我們討論了 LLM 在生物醫學領域面臨的挑戰，包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰，我們還確定了生物醫學中 LLM 未來的研究方向，包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。
 
-##### **Encrypted Large Model Inference: The Equivariant Encryption Paradigm**
-2502.01013v1 by James Buban, Hongyang Zhang, Claudio Angione, Harry Yang, Ahmad Farhan, Seyfal Sultanov, Michael Du, Xuran Ma, Zihao Wang, Yue Zhao, Arria Owlia, Fielding Johnston, Patrick Colangelo
+##### **Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis**
+2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone
 
-Large scale deep learning model, such as modern language models and diffusion
-architectures, have revolutionized applications ranging from natural language
-processing to computer vision. However, their deployment in distributed or
-decentralized environments raises significant privacy concerns, as sensitive
-data may be exposed during inference. Traditional techniques like secure
-multi-party computation, homomorphic encryption, and differential privacy offer
-partial remedies but often incur substantial computational overhead, latency
-penalties, or limited compatibility with non-linear network operations. In this
-work, we introduce Equivariant Encryption (EE), a novel paradigm designed to
-enable secure, "blind" inference on encrypted data with near zero performance
-overhead. Unlike fully homomorphic approaches that encrypt the entire
-computational graph, EE selectively obfuscates critical internal
-representations within neural network layers while preserving the exact
-functionality of both linear and a prescribed set of non-linear operations.
-This targeted encryption ensures that raw inputs, intermediate activations, and
-outputs remain confidential, even when processed on untrusted infrastructure.
-We detail the theoretical foundations of EE, compare its performance and
-integration complexity against conventional privacy preserving techniques, and
-demonstrate its applicability across a range of architectures, from
-convolutional networks to large language models. Furthermore, our work provides
-a comprehensive threat analysis, outlining potential attack vectors and
-baseline strategies, and benchmarks EE against standard inference pipelines in
-decentralized settings. The results confirm that EE maintains high fidelity and
-throughput, effectively bridging the gap between robust data confidentiality
-and the stringent efficiency requirements of modern, large scale model
-inference.
+Significant investment and development have gone into integrating Artificial
+Intelligence (AI) in medical and healthcare applications, leading to advanced
+control systems in medical technology. However, the opacity of AI systems
+raises concerns about essential characteristics needed in such sensitive
+applications, like transparency and trustworthiness. Our study addresses these
+concerns by investigating a process for selecting the most adequate Explainable
+AI (XAI) methods to comply with the explanation requirements of key EU
+regulations in the context of smart bioelectronics for medical devices. The
+adopted methodology starts with categorising smart devices by their control
+mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving
+into their technology. Then, we analyse these regulations to define their
+explainability requirements for the various devices and related goals.
+Simultaneously, we classify XAI methods by their explanatory objectives. This
+allows for matching legal explainability requirements with XAI explanatory
+goals and determining the suitable XAI algorithms for achieving them. Our
+findings provide a nuanced understanding of which XAI algorithms align better
+with EU regulations for different types of medical devices. We demonstrate this
+through practical case studies on different neural implants, from chronic
+disease management to advanced prosthetics. This study fills a crucial gap in
+aligning XAI applications in bioelectronics with stringent provisions of EU
+regulations. It provides a practical framework for developers and researchers,
+ensuring their AI innovations advance healthcare technology and adhere to legal
+and ethical standards.
 
-摘要：大型深度學習模型，例如現代語言模型和擴散架構，徹底改變了從自然語言處理到電腦視覺等各種應用。然而，它們在分散式或分散式環境中的部署引發了重大的隱私問題，因為敏感數據可能會在推理過程中遭到揭露。安全多方計算、同態加密和差分隱私等傳統技術提供了部分補救措施，但通常會產生大量的計算開銷、延遲處罰，或與非線性網路操作相容性有限。在這項工作中，我們引入了等變加密 (EE)，這是一種新穎的範例，旨在以接近零效能開銷對加密數據進行安全、「盲目」推理。與加密整個計算圖形的完全同態方法不同，EE 有選擇性地混淆神經網路層內的關鍵內部表示，同時保留線性和規定的一組非線性操作的精確功能。這種有針對性的加密確保了原始輸入、中間激活和輸出保持機密，即使在不受信任的基礎設施上處理也是如此。我們詳細說明了 EE 的理論基礎，比較了其效能和整合複雜度與傳統的隱私保護技術，並展示了其在從卷積網路到大語言模型等各種架構中的適用性。此外，我們的研究提供了全面的威脅分析，概述了潛在的攻擊媒介和基準策略，並在分散式設定中將 EE 與標準推理管道進行比較。結果證實，EE 保持了高保真度和高傳輸量，有效地彌合了強大的數據機密性與現代化、大規模模型推理的嚴格效率要求之間的差距。
+摘要：人工智慧（AI）在醫療和保健應用中投入了大量的投資和開發，進而導致醫療技術中的先進控制系統。然而，AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂，例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題，用於選擇最充分的可解釋 AI（XAI）方法，以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制（開迴路、閉迴路和半閉迴路系統）對智慧型裝置進行分類，並深入探討其技術開始。然後，我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時，我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配，並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點，從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構，確保其 AI 創新能促進醫療技術並遵守法律和道德標準。
 
-##### **Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation**
-2502.01694v1 by Juno Kim, Denny Wu, Jason Lee, Taiji Suzuki
+##### **Towards Case-based Interpretability for Medical Federated Learning**
+2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva
 
-A key paradigm to improve the reasoning capabilities of large language models
-(LLMs) is to allocate more inference-time compute to search against a verifier
-or reward model. This process can then be utilized to refine the pretrained
-model or distill its reasoning patterns into more efficient models. In this
-paper, we study inference-time compute by viewing chain-of-thought (CoT)
-generation as a metastable Markov process: easy reasoning steps (e.g.,
-algebraic manipulations) form densely connected clusters, while hard reasoning
-steps (e.g., applying a relevant theorem) create sparse, low-probability edges
-between clusters, leading to phase transitions at longer timescales. Under this
-framework, we prove that implementing a search protocol that rewards sparse
-edges improves CoT by decreasing the expected number of steps to reach
-different clusters. In contrast, we establish a limit on reasoning capability
-when the model is restricted to local information of the pretrained graph. We
-also show that the information gained by search can be utilized to obtain a
-better reasoning model: (1) the pretrained model can be directly finetuned to
-favor sparse edges via policy gradient methods, and moreover (2) a compressed
-metastable representation of the reasoning dynamics can be distilled into a
-smaller, more efficient model.
+We explore deep generative models to generate case-based explanations in a
+medical federated learning setting. Explaining AI model decisions through
+case-based interpretability is paramount to increasing trust and allowing
+widespread adoption of AI in clinical practice. However, medical AI training
+paradigms are shifting towards federated learning settings in order to comply
+with data protection regulations. In a federated scenario, past data is
+inaccessible to the current user. Thus, we use a deep generative model to
+generate synthetic examples that protect privacy and explain decisions. Our
+proof-of-concept focuses on pleural effusion diagnosis and uses publicly
+available Chest X-ray data.
 
-摘要：<paragraph>提升大型語言模型 (LLM) 推理能力的一個關鍵範例，是分配更多推論時間運算來搜尋驗證器或獎勵模型。此程序接著可用於改善預訓練模型或將其推理模式提煉到更有效率的模型中。在這篇論文中，我們透過將思維鏈 (CoT) 生成視為亞穩態馬可夫過程來研究推論時間運算：簡單的推理步驟（例如代數運算）形成密集連接的叢集，而困難的推理步驟（例如應用相關定理）則在叢集之間建立稀疏、低機率的邊緣，導致在較長時間尺度上產生相變。在此架構下，我們證明實作一種獎勵稀疏邊緣的搜尋協定，會透過減少到達不同叢集所需的預期步驟數來改善 CoT。相反地，當模型受限於預訓練圖形的局部資訊時，我們建立了推理能力的限制。我們也顯示搜尋所獲得的資訊可用於取得更好的推理模型：(1) 預訓練模型可以直接微調以透過策略梯度方法偏好稀疏邊緣，而且 (2) 推理動態的壓縮亞穩態表徵可以提煉到更小、更有效率的模型中。</paragraph>
+摘要：我們探索深度生成模型，在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策，對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而，醫療 AI 訓練範例正轉向聯邦學習設置，以符合資料保護法規。在聯邦情境中，過去的資料對目前的使用者而言是無法取得的。因此，我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷，並使用公開可取得的胸部 X 光資料。
 
-##### **PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation**
-2502.00708v1 by Qixuan Li, Chao Wang, Zongjin He, Yan Peng
+##### **AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines**
+2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans
 
-Text-to-3D asset generation has achieved significant optimization under the
-supervision of 2D diffusion priors. However, when dealing with compositional
-scenes, existing methods encounter several challenges: 1). failure to ensure
-that composite scene layouts comply with physical laws; 2). difficulty in
-accurately capturing the assets and relationships described in complex scene
-descriptions; 3). limited autonomous asset generation capabilities among layout
-approaches leveraging large language models (LLMs). To avoid these compromises,
-we propose a novel framework for compositional scene generation, PhiP-G, which
-seamlessly integrates generation techniques with layout guidance based on a
-world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene
-description to generate a scene graph, and integrating a multimodal 2D
-generation agent and a 3D Gaussian generation method for targeted assets
-creation. For the stage of layout, PhiP-G employs a physical pool with adhesion
-capabilities and a visual supervision agent, forming a world model for layout
-prediction and planning. Extensive experiments demonstrate that PhiP-G
-significantly enhances the generation quality and physical rationality of the
-compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA)
-performance in CLIP scores, achieves parity with the leading methods in
-generation quality as measured by the T$^3$Bench, and improves efficiency by
-24x.
+Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging
+lesions with variable clinical behaviours and treatment approaches. This
+systematic review provides an overview of Artificial Intelligence (AI) methods
+using radiological imaging for diagnosis and prognosis of these tumours,
+highlighting challenges in clinical translation, and evaluating study alignment
+with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI
+international consensus guidelines for trustworthy and deployable AI to promote
+the clinical translation of AI methods. The review covered literature from
+several bibliographic databases, including papers published before 17/07/2024.
+Original research in peer-reviewed journals focused on radiology-based AI for
+diagnosing or prognosing primary STBT was included. Exclusion criteria were
+animal, cadaveric, or laboratory studies, and non-English papers. Abstracts
+were screened by two of three independent reviewers for eligibility. Eligible
+papers were assessed against guidelines by one of three independent reviewers.
+The search identified 15,015 abstracts, from which 325 articles were included
+for evaluation. Most studies performed moderately on CLAIM, averaging a score
+of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out
+of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage,
+indicating significant room for improvement. Future efforts by AI developers
+should focus on design (e.g. define unmet clinical need, intended clinical
+setting and how AI would be integrated in clinical workflow), development (e.g.
+build on previous work, explainability), evaluation (e.g. evaluating and
+addressing biases, evaluating AI against best practices), and data
+reproducibility and availability (making documented code and data publicly
+available). Following these recommendations could improve clinical translation
+of AI methods.
 
-摘要：<paragraph>在 2D 擴散先驗的監督下，文字轉 3D 資產生成已取得顯著的最佳化。然而，在處理合成場景時，現有方法會遇到幾個挑戰：1) 無法確保複合場景佈局符合物理定律；2) 難以準確捕捉複雜場景描述中所描述的資產和關係；3) 在利用大型語言模型 (LLM) 的佈局方法中，自主資產生成能力有限。為了避免這些折衷，我們提出了一個合成場景生成的新框架 PhiP-G，它將生成技術與基於世界模型的佈局指導無縫整合。利用基於 LLM 的代理，PhiP-G 分析複雜的場景描述以生成場景圖，並整合多模態 2D 生成代理和 3D 高斯生成方法以進行目標資產創建。對於佈局階段，PhiP-G 採用具有附著能力的物理池和視覺監督代理，形成用於佈局預測和規劃的世界模型。大量的實驗證明，PhiP-G 大幅提升了合成場景的生成品質和物理合理性。值得注意的是，PhiP-G 在 CLIP 分數中獲得了最先進 (SOTA) 的效能，在 T$^3$Bench 測量的生成品質中與領先的方法達到同等水準，並將效率提升了 24 倍。</paragraph>
+摘要：軟組織和骨骼腫瘤（STBT）是罕見、診斷具有挑戰性的病灶，其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀，重點說明了臨床轉譯的挑戰，並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性，以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻，包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究，以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要，其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等，平均得分為 53 分中的 28.9±7.5 分，但在 FUTURE-AI 中表現不佳，平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段，表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計（例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中）、開發（例如建立在先前的工作、可解釋性）、評估（例如評估和解決偏差、評估 AI 與最佳實務）、以及數據可複製性和可用性（公開提供文件化的代碼和數據）。遵循這些建議可以改善 AI 方法的臨床轉譯。
 
-##### **A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models**
-2502.00681v1 by Qika Lin, Zhen Peng, Kaize Shi, Kai He, Yiming Xu, Erik Cambria, Mengling Feng
+##### **Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy**
+2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen
 
-Recent years have witnessed rapid advances in graph representation learning,
-with the continuous embedding approach emerging as the dominant paradigm.
-However, such methods encounter issues regarding parameter efficiency,
-interpretability, and robustness. Thus, Quantized Graph Representation (QGR)
-learning has recently gained increasing interest, which represents the graph
-structure with discrete codes instead of conventional continuous embeddings.
-Given its analogous representation form to natural language, QGR also possesses
-the capability to seamlessly integrate graph structures with large language
-models (LLMs). As this emerging paradigm is still in its infancy yet holds
-significant promise, we undertake this thorough survey to promote its rapid
-future prosperity. We first present the background of the general quantization
-methods and their merits. Moreover, we provide an in-depth demonstration of
-current QGR studies from the perspectives of quantized strategies, training
-objectives, distinctive designs, knowledge graph quantization, and
-applications. We further explore the strategies for code dependence learning
-and integration with LLMs. At last, we give discussions and conclude future
-directions, aiming to provide a comprehensive picture of QGR and inspire future
-research.
+Early detection of Cerebral Palsy (CP) is crucial for effective intervention
+and monitoring. This paper tests the reliability and applicability of
+Explainable AI (XAI) methods using a deep learning method that predicts CP by
+analyzing skeletal data extracted from video recordings of infant movements.
+Specifically, we use XAI evaluation metrics -- namely faithfulness and
+stability -- to quantitatively assess the reliability of Class Activation
+Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this
+specific medical application. We utilize a unique dataset of infant movements
+and apply skeleton data perturbations without distorting the original dynamics
+of the infant movements. Our CP prediction model utilizes an ensemble approach,
+so we evaluate the XAI metrics performances for both the overall ensemble and
+the individual models. Our findings indicate that both XAI methods effectively
+identify key body points influencing CP predictions and that the explanations
+are robust against minor data perturbations. Grad-CAM significantly outperforms
+CAM in the RISv metric, which measures stability in terms of velocity. In
+contrast, CAM performs better in the RISb metric, which relates to bone
+stability, and the RRS metric, which assesses internal representation
+robustness. Individual models within the ensemble show varied results, and
+neither CAM nor Grad-CAM consistently outperform the other, with the ensemble
+approach providing a representation of outcomes from its constituent models.
 
-摘要：近年来，图表示学习取得了快速进展，其中连续嵌入方法作为主导范式出现。然而，此类方法遇到了参数效率、可解释性和鲁棒性方面的问题。因此，量化图表示 (QGR) 学习最近引起了越来越多的兴趣，它使用离散代码而不是传统的连续嵌入来表示图结构。鉴于其与自然语言类似的表示形式，QGR 也具备将图结构与大型语言模型 (LLM) 无缝集成的能力。由于这种新兴范式仍处于起步阶段，但前景广阔，我们进行了这项全面调查以促进其快速未来的繁荣。我们首先介绍了通用量化方法的背景及其优点。此外，我们从量化策略、训练目标、独特设计、知识图谱量化和应用的角度对当前的 QGR 研究进行了深入的论证。我们进一步探索了代码依赖性学习和与 LLM 集成的策略。最后，我们给出了讨论并总结了未来的方向，旨在提供 QGR 的全面图景并激发未来的研究。
+摘要：腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性，使用深度學習方法，透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說，我們使用 XAI 評估指標（即忠實度和穩定性）來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集，並應用骨骼資料擾動，而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法，因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明，兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位，並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM，該指標衡量速度方面的穩定性。相比之下，CAM 在 RISb 指標中表現得更好，該指標與骨骼穩定性有關，而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果，CAM 和 Grad-CAM 都不一致地優於另一種，整體方法提供了其組成模型結果的表示。
 
-##### **Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions**
-2502.00339v1 by Jingyuan Yi, Zeqiu Xu, Tianyi Huang, Peiyang Yu
+##### **MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy**
+2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma
 
-The pervasiveness of the dissemination of fake news through social media
-platforms poses critical risks to the trust of the general public, societal
-stability, and democratic institutions. This challenge calls for novel
-methodologies in detection, which can keep pace with the dynamic and
-multi-modal nature of misinformation. Recent works include powering the
-detection using large language model advances in multimodal frameworks,
-methodologies using graphs, and adversarial training in the literature of fake
-news. Based on the different approaches which can bring success, some key
-highlights will be underlined: enhanced LLM-improves accuracy through more
-advanced semantics and cross-modality fusion for robust detections. The review
-further identifies critical gaps in adaptability to dynamic social media
-trends, real-time, and cross-platform detection capabilities, as well as the
-ethical challenges thrown up by the misuse of LLMs. Future directions underline
-the development of style-agnostic models, cross-lingual detection frameworks,
-and robust policies with a view to mitigating LLM-driven misinformation. This
-synthesis thus lays a concrete foundation for those researchers and
-practitioners committed to reinforcing fake news detection systems with
-complications that keep on growing in the digital landscape.
+Recent global estimates suggest that as many as 2.41 billion individuals have
+health conditions that would benefit from rehabilitation services. Home-based
+Physical Therapy (PT) faces significant challenges in providing interactive
+feedback and meaningful observation for therapists and patients. To fill this
+gap, we present MicroXercise, which integrates micro-motion analysis with
+wearable sensors, providing therapists and patients with a comprehensive
+feedback interface, including video, text, and scores. Crucially, it employs
+multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable
+methods to analyze the existing deep learning neural networks in monitoring
+exercises, focusing on a high granularity of exercise. This synergistic
+approach is pivotal, providing output matching the input size to precisely
+highlight critical subtleties and movements in PT, thus transforming complex AI
+analysis into clear, actionable feedback. By highlighting these micro-motions
+in different metrics, such as stability and range of motion, MicroXercise
+significantly enhances the understanding and relevance of feedback for
+end-users. Comparative performance metrics underscore its effectiveness over
+traditional methods, such as a 39% and 42% improvement in Feature Mutual
+Information (FMI) and Continuity. MicroXercise is a step ahead in home-based
+physical therapy, providing a technologically advanced and intuitively helpful
+solution to enhance patient care and outcomes.
 
-摘要：社群媒體平台上假新聞散播的普遍性對一般大眾的信任、社會穩定性與民主制度構成重大風險。這項挑戰需要在偵測方面採用創新的方法論，才能跟上錯誤資訊的動態和多模態特性。最近的研究包括使用多模態架構中大型語言模型的進展、使用圖形的方法論，以及在假新聞文獻中進行對抗訓練來強化偵測。根據可以帶來成功的不同方法，將重點說明一些重點：增強的 LLM 可透過更進階的語意和跨模態融合來提升準確度，以進行穩健的偵測。這篇評論進一步找出在適應動態社群媒體趨勢、即時和跨平台偵測能力方面的重大差距，以及 LLM 遭濫用的道德挑戰。未來的方向強調開發與風格無關的模型、跨語言偵測架構和穩健的政策，以減輕 LLM 驅動的錯誤資訊。因此，這種綜合分析為那些致力於強化假新聞偵測系統的研究人員和從業人員奠定了具體的基礎，而這些複雜性在數位環境中持續增長。
+摘要：最近的全球估計表明，多達 24.1 億人有
+健康狀況可從復健服務中受益。居家
+物理治療 (PT) 在提供互動式
+回饋和有意義的觀察方面面臨重大挑戰，供治療師和患者使用。為了填補這
+個缺口，我們提出 MicroXercise，它將微動作分析與
+可穿戴式感測器整合在一起，為治療師和患者提供一個全面的
+回饋介面，包括影片、文字和分數。至關重要的是，它採用
+多維動態時間規整 (DTW) 和基於歸因的可解釋
+方法來分析監控運動中現有的深度學習神經網路，專注於運動的高粒度。這種協同
+方法至關重要，提供與輸入大小匹配的輸出，以精確地
+突出 PT 中關鍵的細微差別和動作，從而將複雜的 AI
+分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作，例如穩定性和動作範圍，MicroXercise
+顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於
+傳統方法的有效性，例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家
+物理治療方面更進一步，提供技術先進且直覺有用的
+解決方案，以提升患者照護和結果。
 
-##### **DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning**
-2502.00305v1 by Jiaxin Guo, C. L. Philip Chen, Shuzhen Li, Tong Zhang
+##### **The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development**
+2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz
 
-Cold-start active learning (CSAL) selects valuable instances from an
-unlabeled dataset for manual annotation. It provides high-quality data at a low
-annotation cost for label-scarce text classification. However, existing CSAL
-methods overlook weak classes and hard representative examples, resulting in
-biased learning. To address these issues, this paper proposes a novel
-dual-diversity enhancing and uncertainty-aware (DEUCE) framework for CSAL.
-Specifically, DEUCE leverages a pretrained language model (PLM) to efficiently
-extract textual representations, class predictions, and predictive uncertainty.
-Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both
-textual diversity and class diversity, ensuring a balanced data distribution.
-It further propagates uncertainty information via density-based clustering to
-select hard representative instances. DEUCE performs well in selecting
-class-balanced and hard representative data by dual-diversity and
-informativeness. Experiments on six NLP datasets demonstrate the superiority
-and efficiency of DEUCE.
+Systematic literature reviews are the highest quality of evidence in
+research. However, the review process is hindered by significant resource and
+data constraints. The Literature Review Network (LRN) is the first of its kind
+explainable AI platform adhering to PRISMA 2020 standards, designed to automate
+the entire literature review process. LRN was evaluated in the domain of
+surgical glove practices using 3 search strings developed by experts to query
+PubMed. A non-expert trained all LRN models. Performance was benchmarked
+against an expert manual review. Explainability and performance metrics
+assessed LRN's ability to replicate the experts' review. Concordance was
+measured with the Jaccard index and confusion matrices. Researchers were
+blinded to the other's results until study completion. Overlapping studies were
+integrated into an LRN-generated systematic review. LRN models demonstrated
+superior classification accuracy without expert training, achieving 84.78% and
+85.71% accuracy. The highest performance model achieved high interrater
+reliability (k = 0.4953) and explainability metrics, linking 'reduce',
+'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51%
+of the relevant literature despite diverging from the non-expert's judgments (k
+= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN
+outperformed the manual review (19,920 minutes over 11 months), reducing the
+entire process to 288.6 minutes over 5 days. This study demonstrates that
+explainable AI does not require expert training to successfully conduct
+PRISMA-compliant systematic literature reviews like an expert. LRN summarized
+the results of surgical glove studies and identified themes that were nearly
+identical to the clinical researchers' findings. Explainable AI can accurately
+expedite our understanding of clinical practices, potentially revolutionizing
+healthcare research.
 
-摘要：冷啟動主動學習 (CSAL) 從未標記的資料集中選取有價值的實例進行手動標記。它以低標記成本提供高品質的資料，用於標籤稀少的文字分類。然而，現有的 CSAL 方法忽略了弱類別和難以代表的範例，導致有偏差的學習。為了解決這些問題，本文提出了一個新的雙重多樣性增強和不確定性感知 (DEUCE) 架構，用於 CSAL。具體來說，DEUCE 利用預訓練的語言模型 (PLM) 來有效地提取文字表徵、類別預測和預測不確定性。然後，它構建一個雙鄰居圖 (DNG) 來結合文字多樣性和類別多樣性的資訊，確保平衡的資料分佈。它進一步通過基於密度的聚類來傳播不確定性資訊，以選擇難以代表的實例。DEUCE 在通過雙重多樣性和資訊性選擇類別平衡和難以代表的資料方面表現良好。在六個 NLP 資料集上的實驗證明了 DEUCE 的優越性和效率。
+摘要：系統性文獻回顧是研究中證據品質最高的。然而，回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台，旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估，使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率，達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標，將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻，儘管與非專家的判斷不同 (k = 0.2174)，但包含了「乳膠」、「雙重」（手套）和「適應症」等詞彙。LRN 優於手動回顧（11 個月超過 19,920 分鐘），將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示，可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果，並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解，有潛力革新醫療保健研究。
 
-##### **Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques**
-2502.01659v2 by Nathaniel Tomczak, Sanmukh Kuppannagari
+##### **Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns**
+2408.02709v1 by Chi Him Ng
 
-Transformers have demonstrated great success in numerous domains including
-natural language processing and bioinformatics. This success stems from the use
-of the attention mechanism by these models in order to represent and propagate
-pairwise interactions between individual tokens of sequential data. However,
-the primary limitation of this operation is its quadratic memory and time
-complexity in relation to the input's context length - the length of a sequence
-over which the interactions need to be captured. This significantly limits the
-length of sequences that can be inferred upon by these models. Extensive
-research has been conducted to reduce the number of pairwise interactions to
-sub-quadratic in relation to the context length by introducing sparsity into
-the attention mechanism through the development of sparse attention masks.
-However, efficient implementations that achieve "true sparsity" are lacking.
-  In this work, we address this issue by proposing a graph computing view of
-attention where tokens are perceived as nodes of the graph and the attention
-mask determines the edges of the graph. Using this view, we develop graph
-processing algorithms to implement the attention mechanism. Both theoretically
-and empirically, we demonstrate that our algorithms only perform the needed
-computations, i.e., they are work optimal. We also perform extensive
-experimentation using popular attention masks to explore the impact of sparsity
-on execution time and achievable context length. Our experiments demonstrate
-significant speedups in execution times compared to state-of-the-art attention
-implementations such as FlashAttention for large sequence lengths. We also
-demonstrate that our algorithms are able to achieve extremely long sequence
-lengths of as high as 160 million on a single NVIDIA A100 GPU (SXM4 80GB).
+This study analyzes hybrid AI systems' design patterns and their
+effectiveness in clinical decision-making using the boxology framework. It
+categorizes and copares various architectures combining machine learning and
+rule-based reasoning to provide insights into their structural foundations and
+healthcare applications. Addressing two main questions, how to categorize these
+systems againts established design patterns and how to extract insights through
+comparative analysis, the study uses design patterns from software engineering
+to understand and optimize healthcare AI systems. Boxology helps identify
+commonalities and create reusable solutions, enhancing these systems'
+scalability, reliability, and performance. Five primary architectures are
+examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and
+weaknesses, highlighting the need for tailored approaches in clinical tasks.
+REML excels in high-accuracy prediction for datasets with limited data; MLRB in
+handling large datasets and complex data integration; RBML in explainability
+and trustworthiness; RMLT in managing high-dimensional data; and PERML, though
+limited in analysis, shows promise in urgent care scenarios. The study
+introduces four new patterns, creates five abstract categorization patterns,
+and refines those five further to specific systems. These contributions enhance
+Boxlogy's taxonomical organization and offer novel approaches to integrating
+expert knowledge with machine learning. Boxology's structured, modular apporach
+offers significant advantages in developing and analyzing hybrid AI systems,
+revealing commonalities, and promoting reusable solutions. In conclusion, this
+study underscores hybrid AI systems' crucial role in advancing healthcare and
+Boxology's potential to drive further innovation in AI integration, ultimately
+improving clinical decision support and patient outcomes.
 
-摘要：變形金剛已在許多領域展現出巨大的成功，包括自然語言處理和生物資訊學。這種成功源自於這些模型使用注意機制來表示和傳播序列資料中各個標記之間成對的互動。然而，這種運算的主要限制在於其二次記憶體和時間複雜度與輸入的內容長度有關，也就是需要擷取互動的序列長度。這會顯著限制這些模型可以推論的序列長度。已經進行了大量的研究來減少成對互動的數量，使其與內容長度成次二次關係，方法是透過開發稀疏注意遮罩來將稀疏性引入注意機制。然而，缺乏能達成「真實稀疏性」的高效實作。在這項工作中，我們透過提出注意力的圖形運算檢視來解決這個問題，其中標記被視為圖形的節點，而注意力遮罩則決定圖形中的邊緣。使用這種檢視，我們開發了圖形處理演算法來實作注意力機制。我們在理論上和經驗上都證明了我們的演算法只執行必要的運算，也就是說，它們是工作最優的。我們也使用流行的注意力遮罩進行廣泛的實驗，以探討稀疏性對執行時間和可達成的內容長度的影響。我們的實驗證明，與最先進的注意力實作（例如 FlashAttention）相比，對於大型序列長度，我們的演算法在執行時間方面有顯著的加速。我們也證明了我們的演算法能夠在單一的 NVIDIA A100 GPU (SXM4 80GB) 上達成極長的序列長度，最高可達 1.6 億。
+摘要：本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構，以深入了解其結構基礎和醫療保健應用。針對兩個主要問題，如何根據既定的設計模式對這些系統進行分類，以及如何通過比較分析提取見解，本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案，從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構：REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點，強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測；MLRB 在處理大型資料集和複雜資料整合方面表現出色；RBML 在可解釋性和可信度方面表現出色；RMLT 在管理高維資料方面表現出色；而 PERML 儘管在分析方面有限，但在緊急照護場景中表現出潛力。本研究引入了四種新模式，建立了五種抽象分類模式，並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織，並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之，本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用，以及盒子學在推動人工智慧整合進一步創新方面的潛力，最終改善臨床決策支援和患者的治療成果。
 
-##### **Improving vision-language alignment with graph spiking hybrid Networks**
-2501.19069v1 by Siyu Zhang, Heming Zheng, Yiming Wu, Yeming Chen
+##### **Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability**
+2408.02706v1 by Masoud Muhammed Hassan
 
-To bridge the semantic gap between vision and language (VL), it is necessary
-to develop a good alignment strategy, which includes handling semantic
-diversity, abstract representation of visual information, and generalization
-ability of models. Recent works use detector-based bounding boxes or patches
-with regular partitions to represent visual semantics. While current paradigms
-have made strides, they are still insufficient for fully capturing the nuanced
-contextual relations among various objects. This paper proposes a comprehensive
-visual semantic representation module, necessitating the utilization of
-panoptic segmentation to generate coherent fine-grained semantic features.
-Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that
-integrates the complementary advantages of Spiking Neural Networks (SNNs) and
-Graph Attention Networks (GATs) to encode visual semantic information.
-Intriguingly, the model not only encodes the discrete and continuous latent
-variables of instances but also adeptly captures both local and global
-contextual features, thereby significantly enhancing the richness and diversity
-of semantic representations. Leveraging the spatiotemporal properties inherent
-in SNNs, we employ contrastive learning (CL) to enhance the similarity-based
-representation of embeddings. This strategy alleviates the computational
-overhead of the model and enriches meaningful visual representations by
-constructing positive and negative sample pairs. We design an innovative
-pre-training method, Spiked Text Learning (STL), which uses text features to
-improve the encoding ability of discrete semantics. Experiments show that the
-proposed GSHN exhibits promising results on multiple VL downstream tasks.
+Because of its strong predictive skills, deep learning has emerged as an
+essential tool in many industries, including healthcare. Traditional deep
+learning models, on the other hand, frequently lack interpretability and omit
+to take prediction uncertainty into account two crucial components of clinical
+decision making. In order to produce explainable and uncertainty aware
+predictions, this study presents a novel framework called Bayesian Kolmogorov
+Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov
+Arnold Networks with Bayesian inference. We employ BKANs on two medical
+datasets, which are widely used benchmarks for assessing machine learning
+models in medical diagnostics: the Pima Indians Diabetes dataset and the
+Cleveland Heart Disease dataset. Our method provides useful insights into
+prediction confidence and decision boundaries and outperforms traditional deep
+learning models in terms of prediction accuracy. Moreover, BKANs' capacity to
+represent aleatoric and epistemic uncertainty guarantees doctors receive more
+solid and trustworthy decision support. Our Bayesian strategy improves the
+interpretability of the model and considerably minimises overfitting, which is
+important for tiny and imbalanced medical datasets, according to experimental
+results. We present possible expansions to further use BKANs in more
+complicated multimodal datasets and address the significance of these
+discoveries for future research in building reliable AI systems for healthcare.
+This work paves the way for a new paradigm in deep learning model deployment in
+vital sectors where transparency and reliability are crucial.
 
-摘要：<paragraph>為了彌合視覺和語言 (VL) 之間的語意差距，必須制定良好的對齊策略，其中包括處理語意多樣性、視覺資訊的抽象表示以及模型的泛化能力。最近的研究使用基於偵測器的邊界框或具有規則分割的區塊來表示視覺語意。雖然目前的範例已取得進展，但對於完全捕捉各種物件之間的細微脈絡關係仍不足夠。本文提出了一個全面的視覺語意表示模組，需要利用全景分割來產生連貫的細粒度語意特徵。此外，我們提出了一個新穎的圖形脈衝混合網路 (GSHN)，它整合了脈衝神經網路 (SNN) 和圖形注意力網路 (GAT) 的互補優勢來編碼視覺語意資訊。有趣的是，該模型不僅編碼實例的離散和連續潛在變數，還能巧妙地捕捉局部和全域脈絡特徵，從而顯著增強語意表示的豐富性和多樣性。利用 SNN 中固有的時空特性，我們採用對比學習 (CL) 來增強嵌入的基於相似性的表示。此策略減輕了模型的計算負擔，並透過建構正負樣本對來豐富有意義的視覺表示。我們設計了一個創新的預訓練方法，脈衝文本學習 (STL)，它使用文本特徵來提高離散語意的編碼能力。實驗表明，所提出的 GSHN 在多個 VL 下游任務上展現出有希望的結果。</paragraph>
+摘要：由於其強大的預測能力，深度學習已成為許多產業中不可或缺的工具，包括醫療保健。然而，傳統的深度學習模型通常缺乏可解釋性，並且忽略了將預測不確定性納入考量，而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測，本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構，它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN，這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準：皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解，並且在預測準確度方面優於傳統的深度學習模型。此外，BKAN 表現隨機和認識不確定性的能力，可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果，我們的貝氏策略提高了模型的可解釋性，並大幅減少了過度擬合，這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能，以進一步將 BKAN 用於更複雜的多模式資料集，並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。
 
-##### **Semantic Web and Creative AI -- A Technical Report from ISWS 2023**
-2501.18542v1 by Raia Abu Ahmad, Reham Alharbi, Roberto Barile, Martin Böckling, Francisco Bolanos, Sara Bonfitto, Oleksandra Bruns, Irene Celino, Yashrajsinh Chudasama, Martin Critelli, Claudia d'Amato, Giada D'Ippolito, Ioannis Dasoulas, Stefano De Giorgis, Vincenzo De Leo, Chiara Di Bonaventura, Marco Di Panfilo, Daniil Dobriy, John Domingue, Xuemin Duan, Michel Dumontier, Sefika Efeoglu, Ruben Eschauzier, Fakih Ginwa, Nicolas Ferranti, Arianna Graciotti, Philipp Hanisch, George Hannah, Golsa Heidari, Aidan Hogan, Hassan Hussein, Alexane Jouglar, Jan-Christoph Kalo, Manoé Kieffer, Antonis Klironomos, Inês Koch, Weronika Lajewska, Nicolas Lazzari, Mikael Lindekrans, Anna Sofia Lippolis, Majlinda Llugiqi, Eleonora Mancini, Eleonora Marzi, Laura Menotti, Daniela Milon Flores, Soulakshmee Nagowah, Kerstin Neubert, Emetis Niazmand, Ebrahim Norouzi, Beatriz Olarte Martinez, Anouk Michelle Oudshoorn, Andrea Poltronieri, Valentina Presutti, Disha Purohit, Ensiyeh Raoufi, Celian Ringwald, Johanna Rockstroh, Sebastian Rudolph, Harald Sack, Zafar Saeed, Mohammad Javad Saeedizade, Aya Sahbi, Cristian Santini, Aleksandra Simic, Dennis Sommer, Rita Sousa, Mary Ann Tan, Vidyashree Tarikere, Tabea Tietz, Liam Tirpitz, Arnaldo Tomasino, Frank van Harmelen, Joao Vissoci, Caitlin Woods, Bohui Zhang, Xinyue Zhang, Heng Zheng
+##### **MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI**
+2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal
 
-The International Semantic Web Research School (ISWS) is a week-long
-intensive program designed to immerse participants in the field. This document
-reports a collaborative effort performed by ten teams of students, each guided
-by a senior researcher as their mentor, attending ISWS 2023. Each team provided
-a different perspective to the topic of creative AI, substantiated by a set of
-research questions as the main subject of their investigation. The 2023 edition
-of ISWS focuses on the intersection of Semantic Web technologies and Creative
-AI. ISWS 2023 explored various intersections between Semantic Web technologies
-and creative AI. A key area of focus was the potential of LLMs as support tools
-for knowledge engineering. Participants also delved into the multifaceted
-applications of LLMs, including legal aspects of creative content production,
-humans in the loop, decentralised approaches to multimodal generative AI
-models, nanopublications and AI for personal scientific knowledge graphs,
-commonsense knowledge in automatic story and narrative completion, generative
-AI for art critique, prompt engineering, automatic music composition,
-commonsense prototyping and conceptual blending, and elicitation of tacit
-knowledge. As Large Language Models and semantic technologies continue to
-evolve, new exciting prospects are emerging: a future where the boundaries
-between creative expression and factual knowledge become increasingly permeable
-and porous, leading to a world of knowledge that is both informative and
-inspiring.
+In modern healthcare, addressing the complexities of accurate disease
+prediction and personalized recommendations is both crucial and challenging.
+This research introduces MLtoGAI, which integrates Semantic Web technology with
+Machine Learning (ML) to enhance disease prediction and offer user-friendly
+explanations through ChatGPT. The system comprises three key components: a
+reusable disease ontology that incorporates detailed knowledge about various
+diseases, a diagnostic classification model that uses patient symptoms to
+detect specific diseases accurately, and the integration of Semantic Web Rule
+Language (SWRL) with ontology and ChatGPT to generate clear, personalized
+health advice. This approach significantly improves prediction accuracy and
+ensures results that are easy to understand, addressing the complexity of
+diseases and diverse symptoms. The MLtoGAI system demonstrates substantial
+advancements in accuracy and user satisfaction, contributing to developing more
+intelligent and accessible healthcare solutions. This innovative approach
+combines the strengths of ML algorithms with the ability to provide
+transparent, human-understandable explanations through ChatGPT, achieving
+significant improvements in prediction accuracy and user comprehension. By
+leveraging semantic technology and explainable AI, the system enhances the
+accuracy of disease prediction and ensures that the recommendations are
+relevant and easily understood by individual patients. Our research highlights
+the potential of integrating advanced technologies to overcome existing
+challenges in medical diagnostics, paving the way for future developments in
+intelligent healthcare systems. Additionally, the system is validated using 200
+synthetic patient data records, ensuring robust performance and reliability.
 
-摘要：國際語意網路研究學校 (ISWS) 是一個為期一週的密集課程，旨在讓參與者沉浸在該領域中。本文件報告了由十個學生團隊進行的合作成果，每個團隊都由一位資深研究員作為導師，參加了 2023 年 ISWS。每個團隊都從不同的角度探討了創意 AI 主題，並以一系列研究問題作為調查的主要主題。2023 年版的 ISWS 關注於語意網路技術和創意 AI 的交集。ISWS 2023 探索了語意網路技術和創意 AI 之間的各種交集。一個重點關注領域是 LLM 作為知識工程的支援工具的潛力。參與者還深入探討了 LLM 的多方面應用，包括創意內容製作的法律方面、循環中的人類、多模態生成式 AI 模型的分散式方法、納米出版物和用於個人科學知識圖譜的 AI、自動故事和敘述完成中的常識知識、生成式 AI 用於藝術評論、提示工程、自動音樂創作、常識原型和概念混合，以及對默會知識的引導。隨著大型語言模型和語意技術的持續發展，新的令人興奮的前景正在出現：一個創意表達和事實知識之間的界限變得越來越可滲透和多孔的未來，從而導致一個既有資訊性又有啟發性的知識世界。
+摘要：在現代醫療保健中，解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI，它將語義網路技術與機器學習 (ML) 相結合，以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分：一個可重複使用的疾病本体，其中包含有關各種疾病的詳細知識；一個診斷分類模型，它使用患者症狀來準確檢測特定疾病；以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合，以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性，並確保了易於理解的結果，解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步，有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點，以及透過 ChatGPT 提供透明且人類可以理解的說明的能力，在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI，該系統提高了疾病預測的準確性，並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力，為智慧醫療保健系統的未來發展鋪路。此外，該系統使用 200 個合成患者資料記錄進行驗證，確保了穩健的效能和可靠性。
 
-##### **Leveraging LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach**
-2501.18320v1 by Tianpeng Pan, Wenqiang Pu, Licheng Zhao, Rui Zhou
+##### **Introducing δ-XAI: a novel sensitivity-based method for local AI explanations**
+2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora
 
-Automated optimization modeling (AOM) has evoked considerable interest with
-the rapid evolution of large language models (LLMs). Existing approaches
-predominantly rely on prompt engineering, utilizing meticulously designed
-expert response chains or structured guidance. However, prompt-based techniques
-have failed to perform well in the sensor array signal processing (SASP) area
-due the lack of specific domain knowledge. To address this issue, we propose an
-automated modeling approach based on retrieval-augmented generation (RAG)
-technique, which consists of two principal components: a multi-agent (MA)
-structure and a graph-based RAG (Graph-RAG) process. The MA structure is
-tailored for the architectural AOM process, with each agent being designed
-based on principles of human modeling procedure. The Graph-RAG process serves
-to match user query with specific SASP modeling knowledge, thereby enhancing
-the modeling result. Results on ten classical signal processing problems
-demonstrate that the proposed approach (termed as MAG-RAG) outperforms several
-AOM benchmarks.
+Explainable Artificial Intelligence (XAI) is central to the debate on
+integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms
+into clinical practice. High-performing AI/ML models, such as ensemble learners
+and deep neural networks, often lack interpretability, hampering clinicians'
+trust in their predictions. To address this, XAI techniques are being developed
+to describe AI/ML predictions in human-understandable terms. One promising
+direction is the adaptation of sensitivity analysis (SA) and global sensitivity
+analysis (GSA), which inherently rank model inputs by their impact on
+predictions. Here, we introduce a novel delta-XAI method that provides local
+explanations of ML model predictions by extending the delta index, a GSA
+metric. The delta-XAI index assesses the impact of each feature's value on the
+predicted output for individual instances in both regression and classification
+problems. We formalize the delta-XAI index and provide code for its
+implementation. The delta-XAI method was evaluated on simulated scenarios using
+linear regression models, with Shapley values serving as a benchmark. Results
+showed that the delta-XAI index is generally consistent with Shapley values,
+with notable discrepancies in models with highly impactful or extreme feature
+values. The delta-XAI index demonstrated higher sensitivity in detecting
+dominant features and handling extreme feature values. Qualitatively, the
+delta-XAI provides intuitive explanations by leveraging probability density
+functions, making feature rankings clearer and more explainable for
+practitioners. Overall, the delta-XAI method appears promising for robustly
+obtaining local explanations of ML model predictions. Further investigations in
+real-world clinical settings will be conducted to evaluate its impact on
+AI-assisted clinical workflows.
 
-摘要：自動化最佳化建模 (AOM) 隨著大型語言模型 (LLM) 的快速演進而引起相當大的興趣。現有方法主要依賴提示工程，利用精心設計的專家回應鏈或結構化指導。然而，基於提示的技術由於缺乏特定領域知識，無法在感測器陣列訊號處理 (SASP) 領域中表現良好。為了解決這個問題，我們提出一個基於檢索增強生成 (RAG) 技術的自動化建模方法，它包含兩個主要組成部分：多代理 (MA) 結構和基於圖形的 RAG (Graph-RAG) 程序。MA 結構是針對架構 AOM 程序量身打造，每個代理都是根據人類建模程序的原理設計的。Graph-RAG 程序用於將使用者查詢與特定的 SASP 建模知識相匹配，從而增強建模結果。在十個經典訊號處理問題上的結果表明，所提出的方法（稱為 MAG-RAG）優於多個 AOM 基準。
+摘要：可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型，例如整體學習器和深度神經網路，通常缺乏可解釋性，阻礙臨床醫生對其預測的信任。為了解決這個問題，正在開發 XAI 技術，以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA)，它們本質上會依據模型輸入對預測的影響來對其進行排名。在此，我們介紹一種新的 delta-XAI 方法，透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化，並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法，並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致，但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說，delta-XAI 透過利用機率密度函數提供直觀的解釋，使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言，delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查，以評估其對 AI 輔助臨床工作流程的影響。
 
-##### **Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models**
-2501.18154v1 by Wanlong Liu, Yichen Xiao, Dingyi Zeng, Hongyang Zhao, Wenyu Chen, Malu Zhang
+##### **Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population**
+2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis
 
-Post-Training Quantization (PTQ) is pivotal for deploying large language
-models (LLMs) within resource-limited settings by significantly reducing
-resource demands. However, existing PTQ strategies underperform at low bit
-levels < 3 bits due to the significant difference between the quantized and
-original weights. To enhance the quantization performance at low bit widths, we
-introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a
-graph neural network (GNN) module to capture dependencies among weights and
-adaptively assign quantization bit-widths. Through the information propagation
-of the GNN module, our method more effectively captures dependencies among
-target weights, leading to a more accurate assessment of weight importance and
-optimized allocation of quantization strategies. Extensive experiments on the
-WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms
-previous state-of-the-art PTQ method GPTQ, setting new benchmarks for
-quantization performance under low-bit conditions.
+Dementia, a debilitating neurological condition affecting millions worldwide,
+presents significant diagnostic challenges. In this work, we introduce a novel
+methodology for the classification of demented and non-demented elderly
+patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach
+features a unique technique for selectively processing MRI slices, focusing on
+the most relevant brain regions and excluding less informative sections. This
+methodology is complemented by a confidence-based classification committee
+composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and
+Dem3D EfficientNet. These models work synergistically to enhance
+decision-making accuracy, leveraging their collective strengths. Tested on the
+Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an
+impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore,
+validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset
+confirmed the robustness and generalizability of our approach. The use of
+explainable AI (XAI) techniques and comprehensive ablation studies further
+substantiate the effectiveness of our techniques, providing insights into the
+decision-making process and the importance of our methodology. This research
+offers a significant advancement in dementia diagnosis, providing a highly
+accurate and efficient tool for clinical applications.
 
-摘要：訓練後量化 (PTQ) 對於在資源受限的設定中部署大型語言模型 (LLM) 至關重要，因為它能顯著降低資源需求。然而，現有的 PTQ 策略在低位元層級 < 3 位元時表現不佳，因為量化後的權重與原始權重之間有顯著的差異。為了提升低位元寬度的量化效能，我們提出混合精度圖神經網路 PTQ (MG-PTQ) 方法，採用圖神經網路 (GNN) 模組來擷取權重之間的依存關係，並動態分配量化位元寬度。透過 GNN 模組的資訊傳播，我們的方法能更有效地擷取目標權重之間的依存關係，進而更準確地評估權重重要性，並最佳化量化策略的配置。在 WikiText2 和 C4 資料集上的廣泛實驗證明，我們的 MG-PTQ 方法優於先前的最先進 PTQ 方法 GPTQ，在低位元條件下設定了量化效能的新基準。
+摘要：失智症是一種影響全球數百萬人的衰弱性神經疾病，在診斷上具有重大挑戰。在這項工作中，我們提出了一種新的方法，用於對失智和非失智老年患者進行分類，使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術，用於選擇性處理 MRI 切片，重點關注最相關的大腦區域，並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充，該委員會由三個自定義深度學習模型組成：Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性，利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試，我們的模型達到了 94.12% 的驚人準確度，超過了現有方法。此外，在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性，提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展，為臨床應用提供了一個高度準確且高效的工具。
 
-##### **Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models**
-2501.18119v1 by Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng
+##### **Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition**
+2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini
 
-Due to the presence of the natural gap between Knowledge Graph (KG)
-structures and the natural language, the effective integration of holistic
-structural information of KGs with Large Language Models (LLMs) has emerged as
-a significant question. To this end, we propose a two-stage framework to learn
-and apply quantized codes for each entity, aiming for the seamless integration
-of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR)
-method is proposed to compress both KG structural and semantic knowledge into
-discrete codes (\ie, tokens) that align the format of language sentences. We
-further design KG instruction-following data by viewing these learned codes as
-features to directly input to LLMs, thereby achieving seamless integration. The
-experiment results demonstrate that SSQR outperforms existing unsupervised
-quantized methods, producing more distinguishable codes. Further, the
-fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link
-prediction and triple classification tasks, utilizing only 16 tokens per entity
-instead of thousands in conventional prompting methods.
+Recognizing daily activities with unobtrusive sensors in smart environments
+enables various healthcare applications. Monitoring how subjects perform
+activities at home and their changes over time can reveal early symptoms of
+health issues, such as cognitive decline. Most approaches in this field use
+deep learning models, which are often seen as black boxes mapping sensor data
+to activities. However, non-expert users like clinicians need to trust and
+understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human
+Activity Recognition have emerged to provide intuitive natural language
+explanations from these models. Different XAI methods generate different
+explanations, and their effectiveness is typically evaluated through user
+surveys, that are often challenging in terms of costs and fairness. This paper
+proposes an automatic evaluation method using Large Language Models (LLMs) to
+identify, in a pool of candidates, the best XAI approach for non-expert users.
+Our preliminary results suggest that LLM evaluation aligns with user surveys.
 
-摘要：由於知識圖譜 (KG) 結構與自然語言之間存在自然差距，將 KG 的整體結構資訊與大型語言模型 (LLM) 有效整合已成為一個重要的問題。為此，我們提出了一個兩階段架構來學習和應用每個實體的量化碼，旨在將 KG 與 LLM 無縫整合。首先，提出了一個自監督量化表示 (SSQR) 方法，將 KG 結構和語義知識壓縮成離散碼（即，符號），以對齊語言句子的格式。我們進一步設計 KG 指令遵循資料，將這些學習到的碼視為直接輸入 LLM 的特徵，從而實現無縫整合。實驗結果表明，SSQR 優於現有的無監督量化方法，產生更具區別性的碼。此外，微調後的 LLaMA2 和 LLaMA3.1 在 KG 連結預測和三元分類任務上也具有優異的性能，每個實體僅使用 16 個符號，而不是傳統提示方法中的數千個。
+摘要：藉由智慧環境中不引人注目的感測器辨識日常活動，能啟用各種醫療保健應用。監控受試者在家中如何執行活動，以及其隨著時間的變化，可以揭示健康問題的早期症狀，例如認知能力下降。此領域中的大多數方法都使用深度學習模型，這些模型通常被視為將感測器資料對應至活動的黑盒子。然而，非專家使用者（例如臨床醫師）需要信任並了解這些模型的輸出。因此，人類活動辨識的可解釋 AI (XAI) 方法應運而生，以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明，而其有效性通常透過使用者調查來評估，這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法，以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明，LLM 評估與使用者調查一致。
 
-##### **Hybrid Graphs for Table-and-Text based Question Answering using LLMs**
-2501.17767v1 by Ankush Agarwal, Ganesh S, Chaitanya Devaguptapu
+##### **Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions**
+2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil
+
+Industry 5.0, which focuses on human and Artificial Intelligence (AI)
+collaboration for performing different tasks in manufacturing, involves a
+higher number of robots, Internet of Things (IoTs) devices and
+interconnections, Augmented/Virtual Reality (AR), and other smart devices. The
+huge involvement of these devices and interconnection in various critical
+areas, such as economy, health, education and defense systems, poses several
+types of potential security flaws. AI itself has been proven a very effective
+and powerful tool in different areas of cybersecurity, such as intrusion
+detection, malware detection, and phishing detection, among others. Just as in
+many application areas, cybersecurity professionals were reluctant to accept
+black-box ML solutions for cybersecurity applications. This reluctance pushed
+forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool
+that helps explain how decisions are made in ML-based systems. In this survey,
+we present a comprehensive study of different XAI-based intrusion detection
+systems for industry 5.0, and we also examine the impact of explainability and
+interpretability on Cybersecurity practices through the lens of Adversarial
+XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities
+and challenges in XAI cybersecurity systems for industry 5.0 that elicit future
+research toward XAI-based solutions to be adopted by high-stakes industry 5.0
+applications. We believe this rigorous analysis will establish a foundational
+framework for subsequent research endeavors within the specified domain.
+
+摘要：工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務，涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與，引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具，例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣，網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用，有助於說明在基於 ML 的系統中如何做出決策。在這項調查中，我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究，並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外，我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰，引發了未來針對 XAI 基礎解決方案的研究，以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。
+
+##### **A Comparative Study on Automatic Coding of Medical Letters with Explainability**
+2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic
 
-Answering questions that require reasoning and aggregation across both
-structured (tables) and unstructured (raw text) data sources presents
-significant challenges. Current methods rely on fine-tuning and high-quality,
-human-curated data, which is difficult to obtain. Recent advances in Large
-Language Models (LLMs) have shown promising results for multi-hop question
-answering (QA) over single-source text data in a zero-shot setting, yet
-exploration into multi-source Table-Text QA remains limited. In this paper, we
-present a novel Hybrid Graph-based approach for Table-Text QA that leverages
-LLMs without fine-tuning. Our method constructs a unified Hybrid Graph from
-textual and tabular data, pruning information based on the input question to
-provide the LLM with relevant context concisely. We evaluate our approach on
-the challenging Hybrid-QA and OTT-QA datasets using state-of-the-art LLMs,
-including GPT-3.5, GPT-4, and LLaMA-3. Our method achieves the best zero-shot
-performance on both datasets, improving Exact Match scores by up to 10% on
-Hybrid-QA and 5.4% on OTT-QA. Moreover, our approach reduces token usage by up
-to 53% compared to the original context.
+This study aims to explore the implementation of Natural Language Processing
+(NLP) and machine learning (ML) techniques to automate the coding of medical
+letters with visualised explainability and light-weighted local computer
+settings. Currently in clinical settings, coding is a manual process that
+involves assigning codes to each condition, procedure, and medication in a
+patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There
+are preliminary research on automatic coding in this field using
+state-of-the-art ML models; however, due to the complexity and size of the
+models, the real-world deployment is not achieved. To further facilitate the
+possibility of automatic coding practice, we explore some solutions in a local
+computer setting; in addition, we explore the function of explainability for
+transparency of AI models. We used the publicly available MIMIC-III database
+and the HAN/HLAN network models for ICD code prediction purposes. We also
+experimented with the mapping between ICD and SNOMED CT knowledge bases. In our
+experiments, the models provided useful information for 97.98\% of codes. The
+result of this investigation can shed some light on implementing automatic
+clinical coding in practice, such as in hospital settings, on the local
+computers used by clinicians , project page
+\url{https://github.com/Glenj01/Medical-Coding}.
 
-摘要：回答需要對結構化（表格）和非結構化（原始文字）資料來源進行推理和彙總的問題會帶來重大挑戰。目前的辦法仰賴微調和高品質、人工整理的資料，而這很難取得。大型語言模型（LLM）的最新進展已針對零次學習設定的單一來源文字資料多跳問題回答（QA）展現出有希望的結果，但對多來源表格文字 QA 的探討仍然有限。在本文中，我們提出了一種新穎的基於混合圖表的表格文字 QA 方法，它利用 LLM 而無需微調。我們的辦法從文字和表格資料建構一個統一的混合圖表，根據輸入問題修剪資訊，以簡潔地為 LLM 提供相關脈絡。我們使用最先進的 LLM，包括 GPT-3.5、GPT-4 和 LLaMA-3，針對具有挑戰性的 Hybrid-QA 和 OTT-QA 資料集評估我們的辦法。我們的辦法在兩個資料集上都達到了最佳的零次學習效能，在 Hybrid-QA 上將完全比對分數提高了 10%，在 OTT-QA 上將完全比對分數提高了 5.4%。此外，與原始脈絡相比，我們的辦法將符號使用量減少了 53%。
+摘要：本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化，並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中，編碼是一種手動流程，涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如，使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究；然而，由於模型的複雜性和大小，並未實現實際部署。為了進一步促進自動編碼實務的可能性，我們在本地電腦設定中探討了一些解決方案；此外，我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中，這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解，例如在醫院環境中，由臨床醫生使用的本地電腦，專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。
 
-##### **Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**
-2501.17549v1 by Wooyoung Kim, Byungyoon Park, Wooju Kim
+##### **Explainable AI for Enhancing Efficiency of DL-based Channel Estimation**
+2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier
 
-Graph-structured data plays a vital role in numerous domains, such as social
-networks, citation networks, commonsense reasoning graphs and knowledge graphs.
-While graph neural networks have been employed for graph processing, recent
-advancements have explored integrating large language models for graph-based
-tasks. In this paper, we propose a novel approach named Learnable Graph Pooling
-Token (LGPT), which addresses the limitations of the scalability issues in
-node-level projection and information loss in graph-level projection. LGPT
-enables flexible and efficient graph representation by introducing learnable
-parameters that act as tokens in large language models, balancing fine-grained
-and global graph information. Additionally, we investigate an Early Query
-Fusion technique, which fuses query context before constructing the graph
-representation, leading to more effective graph embeddings. Our method achieves
-a 4.13\% performance improvement on the GraphQA benchmark without training the
-large language model, demonstrating significant gains in handling complex
-textual-attributed graph data.
+The support of artificial intelligence (AI) based decision-making is a key
+element in future 6G networks, where the concept of native AI will be
+introduced. Moreover, AI is widely employed in different critical applications
+such as autonomous driving and medical diagnosis. In such applications, using
+AI as black-box models is risky and challenging. Hence, it is crucial to
+understand and trust the decisions taken by these models. Tackling this issue
+can be achieved by developing explainable AI (XAI) schemes that aim to explain
+the logic behind the black-box model behavior, and thus, ensure its efficient
+and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST
+framework that is oriented toward channel estimation in wireless
+communications. The core idea of the XAI-CHEST framework is to identify the
+relevant model inputs by inducing high noise on the irrelevant ones. This
+manuscript provides the detailed theoretical foundations of the XAI-CHEST
+framework. In particular, we derive the analytical expressions of the XAI-CHEST
+loss functions and the noise threshold fine-tuning optimization problem. Hence
+the designed XAI-CHEST delivers a smart input feature selection methodology
+that can further improve the overall performance while optimizing the
+architecture of the employed model. Simulation results show that the XAI-CHEST
+framework provides valid interpretations, where it offers an improved bit error
+rate performance while reducing the required computational complexity in
+comparison to the classical DL-based channel estimation.
 
-摘要：圖形結構資料在許多領域中扮演著至關重要的角色，例如社交網路、引用網路、常識推理圖形和知識圖形。雖然圖形神經網路已用於圖形處理，但最近的進展已探討整合大型語言模型以進行基於圖形的任務。在本文中，我們提出了一種名為可學習圖形池化令牌 (LGPT) 的新方法，它解決了節點層級投影中的可擴充性問題和圖形層級投影中的資訊遺失限制。LGPT 透過引入可學習的參數（在大型語言模型中作為令牌運作）來啟用彈性和高效的圖形表示，平衡細粒度和整體圖形資訊。此外，我們研究了一種早期查詢融合技術，它在建構圖形表示之前融合查詢內容，進而產生更有效的圖形嵌入。我們的方法在 GraphQA 基準上達到了 4.13% 的效能提升，而無需訓練大型語言模型，證明了在處理複雜的文字屬性圖形資料方面有顯著的進展。
+摘要：人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素，其中將引入原生 AI 的概念。此外，AI 廣泛用於不同的關鍵應用中，例如自動駕駛和醫療診斷。在這些應用中，使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此，理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構，旨在解釋黑盒模型行為背後的邏輯，從而確保其有效且安全的部署。最近，我們提出了一個新的基於擾動的 XAI-CHEST 框架，該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是，我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此，設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法，可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明，XAI-CHEST 框架提供了有效的解釋，在降低所需的計算複雜度的同時，提供了改進的比特錯誤率性能，而這與基於傳統 DL 的信道估計相比。
 
-##### **General Scene Adaptation for Vision-and-Language Navigation**
-2501.17403v1 by Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
+##### **Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification**
+2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman
 
-Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on
-one-time execution of individual instructions across multiple environments,
-aiming to develop agents capable of functioning in any environment in a
-zero-shot manner. However, real-world navigation robots often operate in
-persistent environments with relatively consistent physical layouts, visual
-observations, and language styles from instructors. Such a gap in the task
-setting presents an opportunity to improve VLN agents by incorporating
-continuous adaptation to specific environments. To better reflect these
-real-world conditions, we introduce GSA-VLN, a novel task requiring agents to
-execute navigation instructions within a specific scene and simultaneously
-adapt to it for improved performance over time. To evaluate the proposed task,
-one has to address two challenges in existing VLN datasets: the lack of OOD
-data, and the limited number and style diversity of instructions for each
-scene. Therefore, we propose a new dataset, GSA-R2R, which significantly
-expands the diversity and quantity of environments and instructions for the R2R
-dataset to evaluate agent adaptability in both ID and OOD contexts.
-Furthermore, we design a three-stage instruction orchestration pipeline that
-leverages LLMs to refine speaker-generated instructions and apply role-playing
-techniques to rephrase instructions into different speaking styles. This is
-motivated by the observation that each individual user often has consistent
-signatures or preferences in their instructions. We conducted extensive
-experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various
-methods. Based on our findings, we propose a novel method, GR-DUET, which
-incorporates memory-based navigation graphs with an environment-specific
-training strategy, achieving state-of-the-art results on all GSA-R2R splits.
+This paper presents dilated Residual Network (ResNet) models for disease
+classification from retinal fundus images. Dilated convolution filters are used
+to replace normal convolution filters in the higher layers of the ResNet model
+(dilated ResNet) in order to improve the receptive field compared to the normal
+ResNet model for disease classification. This study introduces
+computer-assisted diagnostic tools that employ deep learning, enhanced with
+explainable AI techniques. These techniques aim to make the tool's
+decision-making process transparent, thereby enabling medical professionals to
+understand and trust the AI's diagnostic decision. They are particularly
+relevant in today's healthcare landscape, where there is a growing demand for
+transparency in AI applications to ensure their reliability and ethical use.
+The dilated ResNet is used as a replacement for the normal ResNet to enhance
+the classification accuracy of retinal eye diseases and reduce the required
+computing time. The dataset used in this work is the Ocular Disease Intelligent
+Recognition (ODIR) dataset which is a structured ophthalmic database with eight
+classes covering most of the common retinal eye diseases. The evaluation
+metrics used in this work include precision, recall, accuracy, and F1 score. In
+this work, a comparative study has been made between normal ResNet models and
+dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50,
+ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as
+compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67,
+and 0.70 respectively for the above respective variants in ODIR multiclass
+disease classification.
 
-摘要：視覺語言導航 (VLN) 任務主要根據代理程式在多個環境中執行個別指令的一次性執行來評估代理程式，旨在開發能夠在任何環境中以零次學習的方式運作的代理程式。然而，真實世界的導航機器人通常在持續性的環境中運作，而這些環境具有相對一致的物理配置、視覺觀察和指令的語言風格。任務設定中的這種差距提供了一個機會，可以透過將連續適應特定環境納入其中來改善 VLN 代理程式。為了更好地反映這些真實世界的條件，我們推出了 GSA-VLN，這是一個新任務，要求代理程式在特定場景中執行導航指令，並同時適應該場景，以隨著時間推移而提高效能。為了評估所提出的任務，必須解決現有 VLN 資料集中的兩個挑戰：缺乏 OOD 資料，以及每個場景的指令數量和風格多樣性有限。因此，我們提出了一個新的資料集 GSA-R2R，它顯著擴展了 R2R 資料集的環境和指令的多樣性和數量，以評估代理程式在 ID 和 OOD 背景下的適應能力。此外，我們設計了一個三階段指令編排管道，該管道利用大型語言模型 (LLM) 來精煉由說話者產生的指令，並應用角色扮演技巧將指令改寫成不同的說話風格。這項技術的靈感來自於觀察到每個個別使用者通常在其指令中具有相符的簽名或偏好。我們針對 GSA-R2R 進行了大量的實驗，以徹底評估我們的資料集和基準各種方法。根據我們的研究結果，我們提出了一種新的方法 GR-DUET，它將基於記憶的導航圖表與特定於環境的訓練策略結合在一起，在所有 GSA-R2R 分割中取得了最先進的結果。
+摘要：这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器（扩张 ResNet），以改善感知场，从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具，并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化，从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关，在该领域，对 AI 应用的透明度需求不断增长，以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品，以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集，这是一个结构化的眼科数据库，包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中，对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比，扩张 ResNet 模型显示出有希望的结果，在 ODIR 多类疾病分类中，上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。
 
-##### **Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**
-2501.17270v1 by Saloni Potdar, Daniel Lee, Omar Attia, Varun Embar, De Meng, Ramesh Balaji, Chloe Seivwright, Eric Choi, Mina H. Farid, Yiwen Sun, Yunyao Li
+##### **A Survey on Trustworthiness in Foundation Models for Medical Image Analysis**
+2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li
 
-Question answering systems for knowledge graph (KGQA), answer factoid
-questions based on the data in the knowledge graph. KGQA systems are complex
-because the system has to understand the relations and entities in the
-knowledge-seeking natural language queries and map them to structured queries
-against the KG to answer them. In this paper, we introduce Chronos, a
-comprehensive evaluation framework for KGQA at industry scale. It is designed
-to evaluate such a multi-component system comprehensively, focusing on (1)
-end-to-end and component-level metrics, (2) scalable to diverse datasets and
-(3) a scalable approach to measure the performance of the system prior to
-release. In this paper, we discuss the unique challenges associated with
-evaluating KGQA systems at industry scale, review the design of Chronos, and
-how it addresses these challenges. We will demonstrate how it provides a base
-for data-driven decisions and discuss the challenges of using it to measure and
-improve a real-world KGQA system.
+The rapid advancement of foundation models in medical imaging represents a
+significant leap toward enhancing diagnostic accuracy and personalized
+treatment. However, the deployment of foundation models in healthcare
+necessitates a rigorous examination of their trustworthiness, encompassing
+privacy, robustness, reliability, explainability, and fairness. The current
+body of survey literature on foundation models in medical imaging reveals
+considerable gaps, particularly in the area of trustworthiness. Additionally,
+existing surveys on the trustworthiness of foundation models do not adequately
+address their specific variations and applications within the medical imaging
+domain. This survey aims to fill that gap by presenting a novel taxonomy of
+foundation models used in medical imaging and analyzing the key motivations for
+ensuring their trustworthiness. We review current research on foundation models
+in major medical imaging applications, focusing on segmentation, medical report
+generation, medical question and answering (Q\&A), and disease diagnosis. These
+areas are highlighted because they have seen a relatively mature and
+substantial number of foundation models compared to other applications. We
+focus on literature that discusses trustworthiness in medical image analysis
+manuscripts. We explore the complex challenges of building trustworthy
+foundation models for each application, summarizing current concerns and
+strategies for enhancing trustworthiness. Furthermore, we examine the potential
+of these models to revolutionize patient care. Our analysis underscores the
+imperative for advancing towards trustworthy AI in medical image analysis,
+advocating for a balanced approach that fosters innovation while ensuring
+ethical and equitable healthcare delivery.
 
-摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
+摘要：基礎模型在醫學影像方面的快速進展，代表著在加強診斷準確性和個人化治療方面邁出一大步。然而，基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查，包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距，特別是在可信度方面。此外，現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機，來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究，重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調，是因為與其他應用相比，它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰，總結了當前關注點和增強可信度的策略。此外，我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性，並倡導一種平衡的方法，既能促進創新，又能確保道德和公平的醫療保健服務。
 
-##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
-2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
+##### **The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data**
+2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan
 
-Prior research on training grounded factuality classification models to
-detect hallucinations in large language models (LLMs) has relied on public
-natural language inference (NLI) data and synthetic data. However, conventional
-NLI datasets are not well-suited for document-level reasoning, which is
-critical for detecting LLM hallucinations. Recent approaches to document-level
-synthetic data generation involve iteratively removing sentences from documents
-and annotating factuality using LLM-based prompts. While effective, this method
-is computationally expensive for long documents and limited by the LLM's
-capabilities. In this work, we analyze the differences between existing
-synthetic training data used in state-of-the-art models and real LLM output
-claims. Based on our findings, we propose a novel approach for synthetic data
-generation, CG2C, that leverages multi-hop reasoning on context graphs
-extracted from documents. Our fact checker model, FactCG, demonstrates improved
-performance with more connected reasoning, using the same backbone models.
-Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
-with much smaller model size.
+Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and
+interpreting ultrasound scans right at the patient's bedside. However, the
+expertise needed to interpret these images is considerable and may not always
+be present in emergency situations. This reality makes algorithms such as
+machine learning classifiers extremely valuable to augment human decisions.
+POCUS devices are becoming available at a reasonable cost in the size of a
+mobile phone. The challenge of turning POCUS devices into life-saving tools is
+that interpretation of ultrasound images requires specialist training and
+experience. Unfortunately, the difficulty to obtain positive training images
+represents an important obstacle to building efficient and accurate
+classifiers. Hence, the problem we try to investigate is how to explore
+strategies to increase accuracy of classifiers trained with scarce data. We
+hypothesize that training with a few data instances may not suffice for
+classifiers to generalize causing them to overfit. Our approach uses an
+Explainable AI-Augmented approach to help the algorithm learn more from less
+and potentially help the classifier better generalize.
 
-摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
+摘要：床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而，解讀這些影像所需的專業知識相當可觀，而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出，尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於，解讀超音波影像需要專門訓練和經驗。不幸的是，取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此，我們嘗試探討的問題是如何探索策略，以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括，導致它們過度擬合。我們的做法使用可解釋 AI 增強方法，以協助演算法從較少的資料中學習更多，並潛在協助分類器更好地概括。
 
-##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
-2501.16673v2 by Li Yin, Zhangyang Wang
+##### **Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach**
+2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang
 
-Large Language Models (LLMs) have reshaped natural language processing,
-powering applications from multi-hop retrieval and question answering to
-autonomous agent workflows. Yet, prompt engineering -- the task of crafting
-textual inputs to effectively direct LLMs -- remains difficult and
-labor-intensive, particularly for complex pipelines that combine multiple LLM
-calls with functional operations like retrieval and data formatting. We
-introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
-(APE) that extends textual gradient-based methods (such as Text-Grad) to
-multi-component, potentially cyclic LLM architectures. Implemented within the
-AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
-parameter and uses a frozen backward engine LLM to generate feedback-akin to
-textual gradients -- that guide iterative prompt updates. Unlike prior
-single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
-preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
-and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
-(instructions, formats, or few-shot examples). It further boosts training
-efficiency by focusing on error-prone samples through selective gradient
-computation. Across diverse tasks, including single-step classification,
-multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
-consistently outperforms existing textual gradient baselines in both accuracy
-and training cost. By unifying prompt optimization through a graph-centric
-lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
-LLM workflows - mirroring the transformative role that automatic
-differentiation libraries have long played in neural network research.
+In recent years, the United States has witnessed a significant surge in the
+popularity of vaping or e-cigarette use, leading to a notable rise in cases of
+e-cigarette and vaping use-associated lung injury (EVALI) that caused
+hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting
+the urgency to comprehend vaping behaviors and develop effective strategies for
+cessation. Due to the ubiquity of social media platforms, over 4.7 billion
+users worldwide use them for connectivity, communications, news, and
+entertainment with a significant portion of the discourse related to health,
+thereby establishing social media data as an invaluable organic data resource
+for public health research. In this study, we extracted a sample dataset from
+one vaping sub-community on Reddit to analyze users' quit-vaping intentions.
+Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit
+vaping intention detection, this study compares the outcomes of this model
+against layman and clinical expert annotations. Using different prompting
+strategies such as zero-shot, one-shot, few-shot and chain-of-thought
+prompting, we developed 8 prompts with varying levels of detail to explain the
+task to GPT-4 and also evaluated the performance of the strategies against each
+other. These preliminary findings emphasize the potential of GPT-4 in social
+media data analysis, especially in identifying users' subtle intentions that
+may elude human detection.
 
-摘要：大型語言模型 (LLM) 已重塑自然語言處理，
-為從多跳檢索和問答到
-自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
-文本輸入以有效指導 LLM 的任務 -- 仍然困難且
-勞動密集，特別是對於將多個 LLM
-呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
-介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
-方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
-AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
-參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
-文本梯度——指導迭代提示更新。與先前的
-單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
-在重複呼叫（例如，多跳循環）中保留時間順序行為，
-並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
-效率，通過選擇性梯度
-計算專注於容易出錯的樣本。在包括單步分類、
-多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
-在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
-視角統一提示優化，LLM-AutoDiff 為擴展和自動化
-LLM 工作流程提供了一個強大的新範例——反映了自動
-微分庫在神經網絡研究中長期扮演的變革性角色。
+摘要：近年來，美國見證了電子煙或電子香菸使用率大幅激增，導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加，在 2019 年 EVALI 爆發期間造成住院和死亡，凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及，全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂，其中很大一部分與健康相關，因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中，我們從 Reddit 上一個電子煙子社群中提取一個範例資料集，以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測，本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略，例如零次學習、一次學習、少次學習和思考鏈提示，我們開發了 8 個提示，詳細程度不同，向 GPT-4 解釋任務，並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力，特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。
 
-##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
-2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
+##### **Towards Compositional Interpretability for XAI**
+2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke
 
-Ranking and recommendation systems are the foundation for numerous online
-experiences, ranging from search results to personalized content delivery.
-These systems have evolved into complex, multilayered architectures that
-leverage vast datasets and often incorporate thousands of predictive models.
-The maintenance and enhancement of these models is a labor intensive process
-that requires extensive feature engineering. This approach not only exacerbates
-technical debt but also hampers innovation in extending these systems to
-emerging problem domains. In this report, we present our research to address
-these challenges by utilizing a large foundation model with a textual interface
-for ranking and recommendation tasks. We illustrate several key advantages of
-our approach: (1) a single model can manage multiple predictive tasks involved
-in ranking and recommendation, (2) decoder models with textual interface due to
-their comprehension of reasoning capabilities, can generalize to new
-recommendation surfaces and out-of-domain problems, and (3) by employing
-natural language interfaces for task definitions and verbalizing member
-behaviors and their social connections, we eliminate the need for feature
-engineering and the maintenance of complex directed acyclic graphs of model
-dependencies. We introduce our research pre-production model, 360Brew V1.0, a
-150B parameter, decoder-only model that has been trained and fine-tuned on
-LinkedIn's data and tasks. This model is capable of solving over 30 predictive
-tasks across various segments of the LinkedIn platform, achieving performance
-levels comparable to or exceeding those of current production systems based on
-offline metrics, without task-specific fine-tuning. Notably, each of these
-tasks is conventionally addressed by dedicated models that have been developed
-and maintained over multiple years by teams of a similar or larger size than
-our own.
+Artificial intelligence (AI) is currently based largely on black-box machine
+learning models which lack interpretability. The field of eXplainable AI (XAI)
+strives to address this major concern, being critical in high-stakes areas such
+as the finance, legal and health sectors.
+  We present an approach to defining AI models and their interpretability based
+on category theory. For this we employ the notion of a compositional model,
+which sees a model in terms of formal string diagrams which capture its
+abstract structure together with its concrete implementation. This
+comprehensive view incorporates deterministic, probabilistic and quantum
+models. We compare a wide range of AI models as compositional models, including
+linear and rule-based models, (recurrent) neural networks, transformers, VAEs,
+and causal and DisCoCirc models.
+  Next we give a definition of interpretation of a model in terms of its
+compositional structure, demonstrating how to analyse the interpretability of a
+model, and using this to clarify common themes in XAI. We find that what makes
+the standard 'intrinsically interpretable' models so transparent is brought out
+most clearly diagrammatically. This leads us to the more general notion of
+compositionally-interpretable (CI) models, which additionally include, for
+instance, causal, conceptual space, and DisCoCirc models.
+  We next demonstrate the explainability benefits of CI models. Firstly, their
+compositional structure may allow the computation of other quantities of
+interest, and may facilitate inference from the model to the modelled
+phenomenon by matching its structure. Secondly, they allow for diagrammatic
+explanations for their behaviour, based on influence constraints, diagram
+surgery and rewrite explanations. Finally, we discuss many future directions
+for the approach, raising the question of how to learn such meaningfully
+structured models in practice.
 
-摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
-這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
-這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
-這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
-在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
-我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
-我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
-此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
-值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
+摘要：<paragraph>人工智慧（AI）目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧（XAI）領域致力於解決這個主要問題，這在金融、法律和健康等高風險領域至關重要。
+我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此，我們採用組合模型的概念，它以形式弦圖的形式看待模型，這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較，包括線性和基於規則的模型、（遞迴）神經網路、Transformer、VAE，以及因果和 DisCoCirc 模型。
+接下來，我們根據模型的組合結構給出模型解釋的定義，展示如何分析模型的可解釋性，並使用它來澄清 XAI 中的常見主題。我們發現，讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋（CI）模型概念，它另外還包括因果、概念空間和 DisCoCirc 模型。
+接下來，我們展示了 CI 模型的可解釋性優勢。首先，它們的組合結構允許計算其他感興趣的量，並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次，它們允許對其行為進行圖解說明，這些說明基於影響約束、圖解手術和重寫說明。最後，我們討論了這種方法的許多未來方向，提出了如何在實踐中學習這種有意義的結構化模型的問題。</paragraph>
 
-##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
-2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
+##### **Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods**
+2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen
 
-Fixing Python dependency issues is a tedious and error-prone task for
-developers, who must manually identify and resolve environment dependencies and
-version constraints of third-party modules and Python interpreters. Researchers
-have attempted to automate this process by relying on large knowledge graphs
-and database lookup tables. However, these traditional approaches face
-limitations due to the variety of dependency error types, large sets of
-possible module versions, and conflicts among transitive dependencies. This
-study explores the potential of using large language models (LLMs) to
-automatically fix dependency issues in Python programs. We introduce PLLM
-(pronounced "plum"), a novel technique that employs retrieval-augmented
-generation (RAG) to help an LLM infer Python versions and required modules for
-a given Python file. PLLM builds a testing environment that iteratively (1)
-prompts the LLM for module combinations, (2) tests the suggested changes, and
-(3) provides feedback (error messages) to the LLM to refine the fix. This
-feedback cycle leverages natural language processing (NLP) to intelligently
-parse and interpret build error messages. We benchmark PLLM on the Gistable
-HG2.9K dataset, a collection of challenging single-file Python gists. We
-compare PLLM against two state-of-the-art automatic dependency inference
-approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
-issues. Our results indicate that PLLM can fix more dependency issues than the
-two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
-over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
-for projects with many dependencies and for specific third-party numerical and
-machine-learning modules. Our findings demonstrate the potential of LLM-based
-approaches to iteratively resolve Python dependency issues.
+Machine learning models have achieved high overall accuracy in medical image
+analysis. However, performance disparities on specific patient groups pose
+challenges to their clinical utility, safety, and fairness. This can affect
+known patient groups - such as those based on sex, age, or disease subtype - as
+well as previously unknown and unlabeled groups. Furthermore, the root cause of
+such observed performance disparities is often challenging to uncover,
+hindering mitigation efforts. In this paper, to address these issues, we
+leverage Slice Discovery Methods (SDMs) to identify interpretable
+underperforming subsets of data and formulate hypotheses regarding the cause of
+observed performance disparities. We introduce a novel SDM and apply it in a
+case study on the classification of pneumothorax and atelectasis from chest
+x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis
+formulation and yields an explanation of previously observed but unexplained
+performance disparities between male and female patients in widely used chest
+X-ray datasets and models. Our findings indicate shortcut learning in both
+classification tasks, through the presence of chest drains and ECG wires,
+respectively. Sex-based differences in the prevalence of these shortcut
+features appear to cause the observed classification performance gap,
+representing a previously underappreciated interaction between shortcut
+learning and model fairness analyses.
 
-摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
+摘要：機器學習模型在醫學影像分析中已達到整體高準確度。然而，特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體（例如基於性別、年齡或疾病亞型）以及先前未知且未標籤的群體。此外，此類觀察到的效能差異的根本原因通常難以發現，阻礙了緩解措施。在本文中，為了解決這些問題，我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集，並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM，並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性，並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明，在分類任務中，透過胸腔引流管和心電圖導線的存在，存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異，似乎會導致觀察到的分類效能差距，這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。
 
-##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
-2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
+##### **Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health**
+2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa
 
-Knowledge graphs are widely used in industrial applications, making error
-detection crucial for ensuring the reliability of downstream applications.
-Existing error detection methods often fail to effectively leverage
-fine-grained subgraph information and rely solely on fixed graph structures,
-while also lacking transparency in their decision-making processes, which
-results in suboptimal detection performance. In this paper, we propose a novel
-Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
-utilizes multiple large language models (LLMs) in a collaborative setting. By
-concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
-query embeddings during training, our framework integrates these
-representations to produce four specialized agents. These agents utilize
-subgraph information from different dimensions to engage in multi-round
-discussions, thereby improving error detection accuracy and ensuring a
-transparent decision-making process. Extensive experiments on FB15K and WN18RR
-demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
-accuracy and robustness of KG evaluation. For specific industrial scenarios,
-our framework can facilitate the training of specialized agents using
-domain-specific knowledge graphs for error detection, which highlights the
-potential industrial application value of our framework. Our code and datasets
-are available at https://github.com/kse-ElEvEn/MAKGED.
+The concept of Metaverse has attracted a lot of attention in various fields
+and one of its important applications is health and treatment. The Metaverse
+has enormous potential to transform healthcare by changing patient care,
+medical education, and the way teaching/learning and research are done. The
+purpose of this research is to provide an introduction to the basic concepts
+and fundamental technologies of the Metaverse. This paper examines the pros and
+cons of the Metaverse in healthcare context and analyzes its potential from the
+technology and AI perspective. In particular, the role of machine learning
+methods is discussed; We will explain how machine learning algorithms can be
+applied to the Metaverse generated data to gain better insights in healthcare
+applications. Additionally, we examine the future visions of the Metaverse in
+health delivery, by examining emerging technologies such as blockchain and also
+addressing privacy concerns. The findings of this study contribute to a deeper
+understanding of the applications of Metaverse in healthcare and its potential
+to revolutionize the delivery of medical services.
 
-摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
+摘要：元宇宙的概念在各個領域都備受關注，其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育，以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點，並從技術和 AI 的角度分析其潛力。特別是，討論了機器學習方法的角色；我們將說明如何將機器學習演算法應用於元宇宙產生的資料，以獲得醫療保健應用方面的更佳見解。此外，我們透過探討區塊鏈等新興技術，並解決隱私問題，來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用，以及其在醫療服務提供方面發揮革命性變革的潛力。
 
-##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
-2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
+##### **AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI**
+2406.06728v2 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad, Lamia Ashraf
 
-Short-reading comprehension questions help students understand text structure
-but lack effective feedback. Students struggle to identify and correct errors,
-while manual feedback creation is labor-intensive. This highlights the need for
-automated feedback linking responses to a scoring rubric for deeper
-comprehension.
-  Despite advances in Natural Language Processing (NLP), research has focused
-on automatic grading, with limited work on feedback generation. To address
-this, we propose a system that generates feedback for student responses.
-  Our contributions are twofold. First, we introduce the first system for
-feedback on short-answer reading comprehension. These answers are derived from
-the text, requiring structural understanding. We propose an "answer diagnosis
-graph," integrating the text's logical structure with feedback templates. Using
-this graph and NLP techniques, we estimate students' comprehension and generate
-targeted feedback.
-  Second, we evaluate our feedback through an experiment with Japanese high
-school students (n=39). They answered two 70-80 word questions and were divided
-into two groups with minimal academic differences. One received a model answer,
-the other system-generated feedback. Both re-answered the questions, and we
-compared score changes. A questionnaire assessed perceptions and motivation.
-  Results showed no significant score improvement between groups, but
-system-generated feedback helped students identify errors and key points in the
-text. It also significantly increased motivation. However, further refinement
-is needed to enhance text structure understanding.
+Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with
+no known ultimo cure and high morbidity. Research demonstrates that progressive
+Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly
+impacts kidney structure and functions, eventually leading to kidney failure.
+With the progression of time, chronic kidney disease has moved from a
+life-threatening disease affecting few people to a common disorder of varying
+severity. The goal of this research is to visualize dominating features,
+feature scores, and values exhibited for early prognosis and detection of CKD
+using ensemble learning and explainable AI. For that, an AI-driven predictive
+analytics approach is proposed to aid clinical practitioners in prescribing
+lifestyle modifications for individual patients to reduce the rate of
+progression of this disease. Our dataset is collected on body vitals from
+individuals with CKD and healthy subjects to develop our proposed AI-driven
+solution accurately. In this regard, blood and urine test results are provided,
+and ensemble tree-based machine-learning models are applied to predict unseen
+cases of CKD. Our research findings are validated after lengthy consultations
+with nephrologists. Our experiments and interpretation results are compared
+with existing explainable AI applications in various healthcare domains,
+including CKD. The comparison shows that our developed AI models, particularly
+the Random Forest model, have identified more features as significant
+contributors than XgBoost. Interpretability (I), which measures the ratio of
+important to masked features, indicates that our XgBoost model achieved a
+higher score, specifically a Fidelity of 98\%, in this metric and naturally in
+the FII index compared to competing models.
 
-摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
+摘要：慢性腎臟病 (CKD) 是一種廣泛的慢性疾病，目前尚未找到最終的治療方法，且發病率很高。研究表明，進行性慢性腎臟病 (CKD) 是一種異質性疾病，會顯著影響腎臟結構和功能，最終導致腎衰竭。隨著時間的推移，慢性腎臟病已從影響少數人的致命疾病演變成一種嚴重程度不一的常見疾病。本研究的目標是使用整體學習和可解釋的 AI 來視覺化支配性特徵、特徵分數和值，以進行 CKD 的早期預後和檢測。為此，提出了一種 AI 驅動的預測分析方法，以幫助臨床醫生為個別患者開具生活方式的修改建議，以降低此疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命徵象中收集的，以準確開發我們提出的 AI 驅動的解決方案。在這方面，提供了血液和尿液檢測結果，並應用基於集成樹的機器學習模型來預測未見的 CKD 病例。我們的研究結果在與腎臟科醫師進行長時間諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較，包括 CKD。比較表明，我們開發的 AI 模型，特別是隨機森林模型，已經確定了比 XgBoost 更多的特徵作為顯著的貢獻者。可解釋性 (I) 衡量重要特徵與被遮蔽特徵的比率，表明我們的 XgBoost 模型在此指標中取得了更高的分數，特別是 98% 的保真度，並且在 FII 指數中自然高於競爭模型。
+
+##### **Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook**
+2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan
+
+Mental health constitutes a complex and pervasive global challenge, affecting
+millions of lives and often leading to severe consequences. In this paper, we
+conduct a thorough survey to explore the intersection of data science,
+artificial intelligence, and mental healthcare, focusing on the recent
+developments of mental disorder detection through online social media (OSM). A
+significant portion of the population actively engages in OSM platforms,
+creating a vast repository of personal data that holds immense potential for
+mental health analytics. The paper navigates through traditional diagnostic
+methods, state-of-the-art data- and AI-driven research studies, and the
+emergence of explainable AI (XAI) models for mental healthcare. We review
+state-of-the-art machine learning methods, particularly those based on modern
+deep learning, while emphasising the need for explainability in healthcare AI
+models. The experimental design section provides insights into prevalent
+practices, including available datasets and evaluation approaches. We also
+identify key issues and challenges in the field and propose promising future
+research directions. As mental health decisions demand transparency,
+interpretability, and ethical considerations, this paper contributes to the
+ongoing discourse on advancing XAI in mental healthcare through social media.
+The comprehensive overview presented here aims to guide researchers,
+practitioners, and policymakers in developing the area of mental disorder
+detection.
 
-儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
+摘要：心理健康構成了一項複雜且普遍的全球挑戰，影響了數百萬人的生活，並經常導致嚴重的後果。在本文中，我們進行了一項徹底的調查，以探索數據科學、人工智慧和心理保健的交集，重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台，創造了一個龐大的人員資料庫，對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究，以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法，特別是那些基於現代深度學習的方法，同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解，包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰，並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量，本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。
 
-我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
+##### **Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance**
+2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou
 
-其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
+AI-aided clinical diagnosis is desired in medical care. Existing deep
+learning models lack explainability and mainly focus on image analysis. The
+recently developed Dynamic Uncertain Causality Graph (DUCG) approach is
+causality-driven, explainable, and invariant across different application
+scenarios, without problems of data collection, labeling, fitting, privacy,
+bias, generalization, high cost and high energy consumption. Through close
+collaboration between clinical experts and DUCG technicians, 46 DUCG models
+covering 54 chief complaints were constructed. Over 1,000 diseases can be
+diagnosed without triage. Before being applied in real-world, the 46 DUCG
+models were retrospectively verified by third-party hospitals. The verified
+diagnostic precisions were no less than 95%, in which the diagnostic precision
+for every disease including uncommon ones was no less than 80%. After
+verifications, the 46 DUCG models were applied in the real-world in China. Over
+one million real diagnosis cases have been performed, with only 17 incorrect
+diagnoses identified. Due to DUCG's transparency, the mistakes causing the
+incorrect diagnoses were found and corrected. The diagnostic abilities of the
+clinicians who applied DUCG frequently were improved significantly. Following
+the introduction to the earlier presented DUCG methodology, the recommendation
+algorithm for potential medical checks is presented and the key idea of DUCG is
+extracted.
 
-結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
+摘要：<paragraph>醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性，並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的，並且在不同的應用場景中是不變的，沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作，構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前，46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%，其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後，46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例，僅發現 17 個不正確的診斷。由於 DUCG 的透明性，發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後，提出了潛在健康檢查的推薦演算法，並提取了 DUCG 的關鍵思想。</paragraph>
 
-##### **Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts**
-2501.15688v1 by Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
+##### **Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability**
+2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi
 
-Multimodal knowledge graph completion (MMKGC) aims to predict missing links
-in multimodal knowledge graphs (MMKGs) by leveraging information from various
-modalities alongside structural data. Existing MMKGC approaches primarily
-extend traditional knowledge graph embedding (KGE) models, which often require
-creating an embedding for every entity. This results in large model sizes and
-inefficiencies in integrating multimodal information, particularly for
-real-world graphs. Meanwhile, Transformer-based models have demonstrated
-competitive performance in knowledge graph completion (KGC). However, their
-focus on single-modal knowledge limits their capacity to utilize cross-modal
-information. Recently, Large vision-language models (VLMs) have shown potential
-in cross-modal tasks but are constrained by the high cost of training. In this
-work, we propose a novel approach that integrates Transformer-based KGE models
-with cross-modal context generated by pre-trained VLMs, thereby extending their
-applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
-relevant visual information from entities and their neighbors into textual
-sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
-model with the generated cross-modal context. This simple yet effective method
-significantly reduces model size compared to traditional KGE approaches while
-achieving competitive performance across multiple large-scale datasets with
-minimal hyperparameter tuning.
+It is imperative that breast cancer is detected precisely and timely to
+improve patient outcomes. Diagnostic methodologies have traditionally relied on
+unimodal approaches; however, medical data analytics is integrating diverse
+data sources beyond conventional imaging. Using multi-modal techniques,
+integrating both image and non-image data, marks a transformative advancement
+in breast cancer diagnosis. The purpose of this review is to explore the
+burgeoning field of multimodal techniques, particularly the fusion of
+histopathology images with non-image data. Further, Explainable AI (XAI) will
+be used to elucidate the decision-making processes of complex algorithms,
+emphasizing the necessity of explainability in diagnostic processes. This
+review utilizes multi-modal data and emphasizes explainability to enhance
+diagnostic accuracy, clinician confidence, and patient engagement, ultimately
+fostering more personalized treatment strategies for breast cancer, while also
+identifying research gaps in multi-modality and explainability, guiding future
+studies, and contributing to the strategic direction of the field.
 
-摘要：多模態知識圖譜補全 (MMKGC) 旨在透過利用來自各種模態與結構化資料的資訊，來預測多模態知識圖譜 (MMKG) 中的缺失連結。現有的 MMKGC 方法主要擴充傳統的知識圖譜嵌入 (KGE) 模型，這些模型通常需要為每個實體建立一個嵌入。這會導致模型尺寸過大，且在整合多模態資訊時效率低下，特別是對於真實世界的圖譜。與此同時，基於 Transformer 的模型已在知識圖譜補全 (KGC) 中展現出競爭力。然而，它們著重於單模態知識，限制了它們利用跨模態資訊的能力。最近，大型視覺語言模型 (VLM) 已在跨模態任務中展現潛力，但受限於訓練成本過高。在這項工作中，我們提出了一種創新的方法，它將基於 Transformer 的 KGE 模型與預先訓練的 VLM 所產生的跨模態內容整合在一起，從而擴展它們在 MMKGC 中的適用性。具體來說，我們採用預先訓練的 VLM，將實體及其鄰居相關的視覺資訊轉換成文字序列。然後，我們將 KGC 架構成一個序列到序列的任務，並使用產生的跨模態內容微調模型。這種簡單但有效的方法，與傳統的 KGE 方法相比，大幅減少了模型尺寸，同時在多個大型資料集上達到了競爭力的效能，且只需最少的超參數調整。
+摘要：精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法；然而，醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術，標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域，特別是將組織病理學影像與非影像資料融合。此外，可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程，強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性，以提高診斷準確性、臨床醫師的信心和患者參與度，最終促進乳癌更個人化的治療策略，同時也找出多模式和可解釋性的研究差距，引導未來的研究，並為該領域的策略方向做出貢獻。
 
-##### **How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback**
-2501.15378v1 by Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
+##### **Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection**
+2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya
 
-Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently
-propelled significant advances in complex reasoning tasks, thanks to their
-broad domain knowledge and contextual awareness. Unfortunately, current methods
-often assume KGs to be complete, which is impractical given the inherent
-limitations of KG construction and the potential loss of contextual cues when
-converting unstructured text into entity-relation triples. In response, this
-paper proposes the Triple Context Restoration and Query-driven Feedback
-(TCR-QF) framework, which reconstructs the textual context underlying each
-triple to mitigate information loss, while dynamically refining the KG
-structure by iteratively incorporating query-relevant missing knowledge.
-Experiments on five benchmark question-answering datasets substantiate the
-effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1%
-improvement in Exact Match and a 15.5% improvement in F1 over its
-state-of-the-art GraphRAG competitors.
+The neonatal period is the most vulnerable time for the development of
+seizures. Seizures in the immature brain lead to detrimental consequences,
+therefore require early diagnosis. The gold-standard for neonatal seizure
+detection currently relies on continuous video-EEG monitoring; which involves
+recording multi-channel electroencephalogram (EEG) alongside real-time video
+monitoring within a neonatal intensive care unit (NICU). However, video-EEG
+monitoring technology requires clinical expertise and is often limited to
+technologically advanced and resourceful settings. Cost-effective new
+techniques could help the medical fraternity make an accurate diagnosis and
+advocate treatment without delay. In this work, a novel explainable deep
+learning model to automate the neonatal seizure detection process with a
+reduced EEG montage is proposed, which employs convolutional nets, graph
+attention layers, and fully connected layers. Beyond its ability to detect
+seizures in real-time with a reduced montage, this model offers the unique
+advantage of real-time interpretability. By evaluating the performance on the
+Zenodo dataset with 10-fold cross-validation, the presented model achieves an
+absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall,
+respectively.
 
-摘要：知識圖譜 (KG) 增強大型語言模型 (LLM) 最近推動複雜推理任務的重大進展，這要歸功於它們廣泛的領域知識和語境感知。不幸的是，目前的模型通常假設 KG 是完整的，這在考慮到 KG 建構的固有限制和在將非結構化文字轉換為實體關係三元組時潛在的語境線索損失時是不切實際的。為了解決這個問題，本文提出了三元組語境還原和查詢驅動回饋 (TCR-QF) 架構，它重建每個三元組底層的文字語境以減輕資訊損失，同時透過反覆納入與查詢相關的遺失知識來動態優化 KG 結構。在五個基準問題回答資料集上的實驗證實了 TCR-QF 在 KG 和 LLM 整合方面的有效性，它在 Exact Match 中獲得 29.1% 的改進，在 F1 中獲得 15.5% 的改進，優於最先進的 GraphRAG 競爭對手。
+摘要：新生兒期是大腦發育最脆弱的時期，容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果，因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測；其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而，視訊腦電圖監控技術需要臨床專業知識，而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中，提出了一個新穎的可解釋深度學習模型，以自動化新生兒癲癇發作偵測過程，並採用減少的腦電圖裝置，其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外，此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能，所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。
 
-##### **Explaining Categorical Feature Interactions Using Graph Covariance and LLMs**
-2501.14932v1 by Cencheng Shen, Darren Edge, Jonathan Larson, Carey E. Priebe
+##### **Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques**
+2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik
 
-Modern datasets often consist of numerous samples with abundant features and
-associated timestamps. Analyzing such datasets to uncover underlying events
-typically requires complex statistical methods and substantial domain
-expertise. A notable example, and the primary data focus of this paper, is the
-global synthetic dataset from the Counter Trafficking Data Collaborative (CTDC)
--- a global hub of human trafficking data containing over 200,000 anonymized
-records spanning from 2002 to 2022, with numerous categorical features for each
-record. In this paper, we propose a fast and scalable method for analyzing and
-extracting significant categorical feature interactions, and querying large
-language models (LLMs) to generate data-driven insights that explain these
-interactions. Our approach begins with a binarization step for categorical
-features using one-hot encoding, followed by the computation of graph
-covariance at each time. This graph covariance quantifies temporal changes in
-dependence structures within categorical data and is established as a
-consistent dependence measure under the Bernoulli distribution. We use this
-measure to identify significant feature pairs, such as those with the most
-frequent trends over time or those exhibiting sudden spikes in dependence at
-specific moments. These extracted feature pairs, along with their timestamps,
-are subsequently passed to an LLM tasked with generating potential explanations
-of the underlying events driving these dependence changes. The effectiveness of
-our method is demonstrated through extensive simulations, and its application
-to the CTDC dataset reveals meaningful feature pairs and potential data stories
-underlying the observed feature interactions.
+Breast cancer (BC) stands as one of the most common malignancies affecting
+women worldwide, necessitating advancements in diagnostic methodologies for
+better clinical outcomes. This article provides a comprehensive exploration of
+the application of Explainable Artificial Intelligence (XAI) techniques in the
+detection and diagnosis of breast cancer. As Artificial Intelligence (AI)
+technologies continue to permeate the healthcare sector, particularly in
+oncology, the need for transparent and interpretable models becomes imperative
+to enhance clinical decision-making and patient care. This review discusses the
+integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and
+others, with machine learning and deep learning models utilized in breast
+cancer detection and classification. By investigating the modalities of breast
+cancer datasets, including mammograms, ultrasounds and their processing with
+AI, the paper highlights how XAI can lead to more accurate diagnoses and
+personalized treatment plans. It also examines the challenges in implementing
+these techniques and the importance of developing standardized metrics for
+evaluating XAI's effectiveness in clinical settings. Through detailed analysis
+and discussion, this article aims to highlight the potential of XAI in bridging
+the gap between complex AI models and practical healthcare applications,
+thereby fostering trust and understanding among medical professionals and
+improving patient outcomes.
 
-摘要：現代資料集通常包含許多具有豐富特徵和關聯時間戳的樣本。分析此類資料集以揭示底層事件通常需要複雜的統計方法和大量的領域專業知識。一個值得注意的範例，也是本文的主要資料重點，是來自反人口販運資料合作組織 (CTDC) 的全球合成資料集，這是全球人口販運資料的樞紐，包含超過 200,000 筆從 2002 年到 2022 年的匿名記錄，每個記錄都有許多分類特徵。在本文中，我們提出了一種快速且可擴充的方法，用於分析和提取重要的分類特徵交互作用，並查詢大型語言模型 (LLM)，以產生資料驅動的見解來解釋這些交互作用。我們的做法從使用獨熱編碼對分類特徵進行二元化步驟開始，然後在每個時間點計算圖形共變異數。此圖形共變異數量化了分類資料中依賴結構的時間變化，並在伯努利分佈下建立為一致的依賴度量。我們使用此度量來識別重要的特徵對，例如隨時間推移趨勢最頻繁的特徵對，或在特定時刻表現出依賴性突然激增的特徵對。這些提取的特徵對及其時間戳隨後傳遞給 LLM，後者負責產生對驅動這些依賴性變化的底層事件的潛在解釋。我們的方法的有效性已通過廣泛的模擬得到證明，其在 CTDC 資料集中的應用揭示了有意義的特徵對和潛在的資料故事，這些故事是觀察到的特徵交互作用的基礎。
+摘要：乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一，因此需要進步的診斷方法，以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域，特別是在腫瘤學中，透明且可解釋的模型需求變得勢在必行，以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合，例如 SHAP、LIME、Grad-CAM 等，以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式，包括乳房攝影、超音波及其在 AI 中的處理，本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰，以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論，本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力，進而促進醫療專業人員之間的信任與理解，並改善患者的結果。
 
-##### **Causal Graphs Meet Thoughts: Enhancing Complex Reasoning in Graph-Augmented LLMs**
-2501.14892v1 by Hang Luo, Jian Zhang, Chujun Li
+##### **Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition**
+2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
 
-In knowledge-intensive tasks, especially in high-stakes domains like medicine
-and law, it is critical not only to retrieve relevant information but also to
-provide causal reasoning and explainability. Large language models (LLMs) have
-achieved remarkable performance in natural language understanding and
-generation tasks. However, they often suffer from limitations such as
-difficulty in incorporating new knowledge, generating hallucinations, and
-explaining their reasoning process. To address these challenges, integrating
-knowledge graphs with Graph Retrieval-Augmented Generation (Graph RAG) has
-emerged as an effective solution. Traditional Graph RAG methods often rely on
-simple graph traversal or semantic similarity, which do not capture causal
-relationships or align well with the model's internal reasoning steps. This
-paper proposes a novel pipeline that filters large knowledge graphs to
-emphasize cause-effect edges, aligns the retrieval process with the model's
-chain-of-thought (CoT), and enhances reasoning through multi-stage path
-improvements. Experiments on medical question-answering tasks show consistent
-gains, with up to a 10\% absolute improvement across multiple large language
-models (LLMs). This approach demonstrates the value of combining causal
-reasoning with stepwise retrieval, leading to more interpretable and logically
-grounded solutions for complex queries.
+Speech emotion recognition (SER) has gained significant attention due to its
+several application fields, such as mental health, education, and
+human-computer interaction. However, the accuracy of SER systems is hindered by
+high-dimensional feature sets that may contain irrelevant and redundant
+information. To overcome this challenge, this study proposes an iterative
+feature boosting approach for SER that emphasizes feature relevance and
+explainability to enhance machine learning model performance. Our approach
+involves meticulous feature selection and analysis to build efficient SER
+systems. In addressing our main problem through model explainability, we employ
+a feature evaluation loop with Shapley values to iteratively refine feature
+sets. This process strikes a balance between model performance and
+transparency, which enables a comprehensive understanding of the model's
+predictions. The proposed approach offers several advantages, including the
+identification and removal of irrelevant and redundant features, leading to a
+more effective model. Additionally, it promotes explainability, facilitating
+comprehension of the model's predictions and the identification of crucial
+features for emotion determination. The effectiveness of the proposed method is
+validated on the SER benchmarks of the Toronto emotional speech set (TESS),
+Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of
+Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion
+(SAVEE) datasets, outperforming state-of-the-art methods. To the best of our
+knowledge, this is the first work to incorporate model explainability into an
+SER framework. The source code of this paper is publicly available via this
+https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.
 
-摘要：在知識密集型任務中，特別是在醫學和法律等高風險領域，不僅檢索相關資訊至關重要，還必須提供因果推理和可解釋性。大型語言模型 (LLM) 在自然語言理解和生成任務中取得了顯著的表現。然而，它們通常會遇到一些限制，例如難以納入新知識、產生幻覺，以及解釋其推理過程。為了應對這些挑戰，將知識圖與圖形檢索增強生成 (Graph RAG) 整合在一起已成為一種有效的解決方案。傳統的 Graph RAG 方法通常依賴於簡單的圖形遍歷或語義相似性，這無法捕捉因果關係或與模型的內部推理步驟很好地對齊。本文提出了一個新穎的管道，該管道過濾大型知識圖以強調因果邊緣，將檢索過程與模型的思想鏈 (CoT) 對齊，並通過多階段路徑改進來增強推理。在醫療問題解答任務上的實驗顯示出一致的收益，在多個大型語言模型 (LLM) 中絕對改進幅度高達 10%。這種方法展示了將因果推理與逐步檢索相結合的價值，從而為複雜查詢提供更具可解釋性和邏輯依據的解決方案。
+摘要：語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而，SER 系統的準確性受到高維特徵集的阻礙，這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰，本研究提出了一種用於 SER 的迭代特徵提升方法，該方法強調特徵相關性和可解釋性，以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析，以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題，我們採用了具有 Shapley 值的特徵評估迴圈，以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡，這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點，包括識別和移除不相關和冗餘的特徵，從而建立更有效的模型。此外，它促進了可解釋性，有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證，其效能優於現有方法。據我們所知，這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得：https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。
 
-##### **GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration**
-2501.16382v1 by Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon
+##### **The Explanation Necessity for Healthcare AI**
+2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling
 
-Drug discovery (DD) has tremendously contributed to maintaining and improving
-public health. Hypothesizing that inhibiting protein misfolding can slow
-disease progression, researchers focus on target identification (Target ID) to
-find protein structures for drug binding. While Large Language Models (LLMs)
-and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug
-discovery, integrating models into cohesive workflows remains challenging. We
-conducted a user study with drug discovery researchers to identify the
-applicability of LLMs and RAGs in Target ID. We identified two main findings:
-1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on
-an initial protein and protein candidates that have a therapeutic impact; 2)
-the model must provide the PPI and relevant explanations for better
-understanding. Based on these observations, we identified three limitations in
-previous approaches for Target ID: 1) semantic ambiguity, 2) lack of
-explainability, and 3) short retrieval units. To address these issues, we
-propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve
-agent pipeline RAG framework to support large-scale PPI signaling pathway
-exploration in understanding therapeutic impacts by decomposing the analysis of
-entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
+Explainability is often critical to the acceptable implementation of
+artificial intelligence (AI). Nowhere is this more important than healthcare
+where decision-making directly impacts patients and trust in AI systems is
+essential. This trust is often built on the explanations and interpretations
+the AI provides. Despite significant advancements in AI interpretability, there
+remains the need for clear guidelines on when and to what extent explanations
+are necessary in the medical context. We propose a novel categorization system
+with four distinct classes of explanation necessity, guiding the level of
+explanation required: patient or sample (local) level, cohort or dataset
+(global) level, or both levels. We introduce a mathematical formulation that
+distinguishes these categories and offers a practical framework for researchers
+to determine the necessity and depth of explanations required in medical AI
+applications. Three key factors are considered: the robustness of the
+evaluation protocol, the variability of expert observations, and the
+representation dimensionality of the application. In this perspective, we
+address the question: When does an AI medical application need to be explained,
+and at what level of detail?
 
-摘要：药物发现 (DD) 极大地促进了公共卫生的维护和改善。研究人员假设抑制蛋白质错误折叠可以减缓疾病进展，因此专注于靶点识别 (Target ID) 以找到用于药物结合的蛋白质结构。虽然大型语言模型 (LLM) 和检索增强生成 (RAG) 框架加速了药物发现，但将模型整合到内聚工作流中仍然具有挑战性。我们与药物发现研究人员进行了一项用户研究，以确定 LLM 和 RAG 在 Target ID 中的适用性。我们确定了两个主要发现：1) LLM 应该基于初始蛋白质和具有治疗作用的蛋白质候选物提供多个蛋白质-蛋白质相互作用 (PPI)；2) 该模型必须提供 PPI 和相关解释以更好地理解。基于这些观察，我们发现了先前 Target ID 方法中的三个局限性：1) 语义歧义，2) 缺乏可解释性，3) 检索单元短。为了解决这些问题，我们提出了 GraPPI，这是一种基于大规模知识图 (KG) 的检索-分解-求解代理管道 RAG 框架，以支持大规模 PPI 信号通路探索，通过将整个 PPI 通路的分析分解为专注于 PPI 边缘分析的子任务来理解治疗影响。
+摘要：可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域，这一点尤为重要，因为决策直接影响患者，并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展，但仍然需要明确的指导方针，说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统，该系统具有四种不同的解释必要性类别，指导所需的解释级别：患者或样本（局部）级别、队列或数据集（全局）级别，或两个级别。我们引入了一个数学公式，该公式区分了这些类别，并为研究人员提供了一个实用框架，以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素：评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看，我们解决了这个问题：AI 医疗应用何时需要解释，以及需要解释到何种程度？
 
-##### **Evaluating and Improving Graph to Text Generation with Large Language Models**
-2501.14497v1 by Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez Basulto, Jeff Z. Pan
+##### **Interdisciplinary Expertise to Advance Equitable Explainable AI**
+2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles
 
-Large language models (LLMs) have demonstrated immense potential across
-various tasks. However, research for exploring and improving the capabilities
-of LLMs in interpreting graph structures remains limited. To address this gap,
-we conduct a comprehensive evaluation of prompting current open-source LLMs on
-graph-to-text generation tasks. Although we explored the optimal prompting
-strategies and proposed a novel and effective diversity-difficulty-based
-few-shot sample selection method, we found that the improvements from
-tuning-free approaches were incremental, as LLMs struggle with planning on
-complex graphs, particularly those with a larger number of triplets. To further
-improve LLMs in planning with graph sequences and grounding in truth, we
-introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks:
-reordering and attribution. Through extensive automatic and human evaluations,
-we demonstrate significant improvements in the quality of generated text from
-both few-shot learning and fine-tuning perspectives using the PlanGTG dataset.
-Our study paves the way for new research directions in graph-to-text
-generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
+The field of artificial intelligence (AI) is rapidly influencing health and
+healthcare, but bias and poor performance persists for populations who face
+widespread structural oppression. Previous work has clearly outlined the need
+for more rigorous attention to data representativeness and model performance to
+advance equity and reduce bias. However, there is an opportunity to also
+improve the explainability of AI by leveraging best practices of social
+epidemiology and health equity to help us develop hypotheses for associations
+found. In this paper, we focus on explainable AI (XAI) and describe a framework
+for interdisciplinary expert panel review to discuss and critically assess AI
+model explanations from multiple perspectives and identify areas of bias and
+directions for future research. We emphasize the importance of the
+interdisciplinary expert panel to produce more accurate, equitable
+interpretations which are historically and contextually informed.
+Interdisciplinary panel discussions can help reduce bias, identify potential
+confounders, and identify opportunities for additional research where there are
+gaps in the literature. In turn, these insights can suggest opportunities for
+AI model improvement.
 
-摘要：大型語言模型（LLM）已在各種任務中展現出巨大的潛力。然而，探索和提升 LLM 在詮釋圖形結構方面的能力的研究仍然有限。為了解決這個差距，我們對提示目前開源的 LLM 執行圖形轉文字生成任務進行全面評估。儘管我們探索了最佳提示策略並提出了一種新穎且有效的基於多樣性難度的少樣本選擇方法，但我們發現無調校方法的改進是漸進的，因為 LLM 難以規劃複雜的圖形，特別是那些具有較多三元組的圖形。為了進一步提升 LLM 在圖形序列規劃和真實依據方面的能力，我們引入了一個新的圖形轉文字資料集 PlanGTG，並註解了兩個子任務：重新排序和歸因。透過廣泛的自動化和人工評估，我們證明了使用 PlanGTG 資料集從少樣本學習和微調角度產生文字的品質有顯著提升。我們的研究為圖形轉文字生成中的新研究方向鋪路。PlanGTG 資料集可以在 https://github.com/probe2/kg_text 中找到。
+摘要：人工智慧 (AI) 領域正快速影響著健康與醫療保健，但對於面臨廣泛結構性壓迫的人群來說，偏見和不良表現依然存在。先前的研究已清楚說明，需要更嚴格地注意資料代表性和模型效能，以促進公平性並減少偏見。然而，我們有機會透過運用社會流行病學和健康公平的最佳實務，來改善 AI 的可解釋性，以幫助我們針對發現的關聯性，發展假設。在本文中，我們專注於可解釋 AI (XAI)，並描述一個跨領域專家小組審查架構，以從多重觀點討論和批判性評估 AI 模型的解釋，並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要，而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素，並在文獻中有缺口時找出額外研究的機會。反過來，這些見解可以建議 AI 模型改進的機會。
 
-##### **Fast Think-on-Graph: Wider, Deeper and Faster Reasoning of Large Language Model on Knowledge Graph**
-2501.14300v1 by Xujian Liang, Zhaoquan Gu
+##### **"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts**
+2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen
 
-Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes
-the naive RAG system a step further by integrating graph information, such as
-knowledge graph (KGs), into large-scale language models (LLMs) to mitigate
-hallucination. However, existing GRAG still encounter limitations: 1) simple
-paradigms usually fail with the complex problems due to the narrow and shallow
-correlations capture from KGs 2) methods of strong coupling with KGs tend to be
-high computation cost and time consuming if the graph is dense. In this paper,
-we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for
-enabling LLMs to think ``community by community" within KGs. To do this,
-FastToG employs community detection for deeper correlation capture and two
-stages community pruning - coarse and fine pruning for faster retrieval.
-Furthermore, we also develop two Community-to-Text methods to convert the graph
-structure of communities into textual form for better understanding by LLMs.
-Experimental results demonstrate the effectiveness of FastToG, showcasing
-higher accuracy, faster reasoning, and better explainability compared to the
-previous works.
+Artificial Intelligence (AI) repeatedly match or outperform radiologists in
+lab experiments. However, real-world implementations of radiological AI-based
+systems are found to provide little to no clinical value. This paper explores
+how to design AI for clinical usefulness in different contexts. We conducted 19
+design sessions and design interventions with 13 radiologists from 7 clinical
+sites in Denmark and Kenya, based on three iterations of a functional AI-based
+prototype. Ten sociotechnical dependencies were identified as crucial for the
+design of AI in radiology. We conceptualised four technical dimensions that
+must be configured to the intended clinical context of use: AI functionality,
+AI medical focus, AI decision threshold, and AI Explainability. We present four
+design recommendations on how to address dependencies pertaining to the medical
+knowledge, clinic type, user expertise level, patient context, and user
+situation that condition the configuration of these technical dimensions.
 
-摘要：圖表檢索增強生成 (GRAG) 是一種新穎的範例，它透過將圖表資訊（例如知識圖表 (KG)) 整合到大型語言模型 (LLM) 中，進一步提升了樸素的 RAG 系統以減輕幻覺。然而，現有的 GRAG 仍會遇到限制：1) 簡單的範例通常會因從 KG 中擷取的關聯性狹隘且淺薄而無法解決複雜的問題 2) 如果圖表很密集，與 KG 強耦合的方法往往會導致高運算成本和耗時。在本文中，我們提出了 Fast Think-on-Graph (FastToG)，這是一種創新的範例，可讓 LLM 在 KG 中「逐個社群」進行思考。為此，FastToG 使用社群偵測來擷取更深入的關聯性，並使用兩個階段的社群修剪（粗略修剪和精細修剪）來加快檢索速度。此外，我們還開發了兩種社群到文字的方法，將社群的圖表結構轉換為文字形式，以便 LLM 更容易理解。實驗結果證明了 FastToG 的有效性，與先前的研究相比，展示出更高的準確性、更快的推理速度和更好的可解釋性。
+摘要：人工智慧（AI）在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而，發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代，在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向，必須根據預期的臨床使用情境進行設定：AI 功能、AI 醫療重點、AI 決策門檻，以及 AI 可解釋性。我們提出四項設計建議，說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境，以及影響這些技術面向設定的使用者情境相關的依賴關係。
 
-##### **Top Ten Challenges Towards Agentic Neural Graph Databases**
-2501.14224v1 by Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Lixin Fan, Binhang Yuan, Wei Wang, Lei Chen, Xiaofang Zhou, Yangqiu Song
+##### **Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making**
+2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah
 
-Graph databases (GDBs) like Neo4j and TigerGraph excel at handling
-interconnected data but lack advanced inference capabilities. Neural Graph
-Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for
-predictive analysis and reasoning over incomplete or noisy data. However, NGDBs
-rely on predefined queries and lack autonomy and adaptability. This paper
-introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs
-with three core functionalities: autonomous query construction, neural query
-execution, and continuous learning. We identify ten key challenges in realizing
-Agentic NGDBs: semantic unit representation, abductive reasoning, scalable
-query execution, and integration with foundation models like large language
-models (LLMs). By addressing these challenges, Agentic NGDBs can enable
-intelligent, self-improving systems for modern data-driven applications, paving
-the way for adaptable and autonomous data management solutions.
+With advanced AI/ML, there has been growing research on explainable AI (XAI)
+and studies on how humans interact with AI and XAI for effective human-AI
+collaborative decision-making. However, we still have a lack of understanding
+of how AI systems and XAI should be first presented to users without technical
+backgrounds. In this paper, we present the findings of semi-structured
+interviews with health professionals (n=12) and students (n=4) majoring in
+medicine and health to study how to improve onboarding with AI and XAI. For the
+interviews, we built upon human-AI interaction guidelines to create onboarding
+materials of an AI system for stroke rehabilitation assessment and AI
+explanations and introduce them to the participants. Our findings reveal that
+beyond presenting traditional performance metrics on AI, participants desired
+benchmark information, the practical benefits of AI, and interaction trials to
+better contextualize AI performance, and refine the objectives and performance
+of AI. Based on these findings, we highlight directions for improving
+onboarding with AI and XAI and human-AI collaborative decision-making.
 
-摘要：圖形資料庫（GDB），例如 Neo4j 和 TigerGraph，擅長處理相互連接的資料，但缺乏進階的推論能力。神經圖形資料庫（NGDB）透過整合圖形神經網路（GNN）來解決這個問題，以進行預測分析和對不完整或有雜訊的資料進行推理。然而，NGDB 依賴於預先定義的查詢，並且缺乏自主性和適應性。本文介紹了代理神經圖形資料庫（Agentic NGDB），它以三項核心功能擴充了 NGDB：自動查詢建構、神經查詢執行和持續學習。我們找出實現 Agentic NGDB 的十大關鍵挑戰：語義單元表示、演繹推理、可擴充查詢執行，以及與基礎模型（例如大型語言模型 (LLM)）整合。透過解決這些挑戰，Agentic NGDB 可以為現代資料驅動應用打造智慧且自我改善的系統，為適應性和自主資料管理解決方案鋪路。
+摘要：隨著先進的 AI/ML，對可解釋 AI (XAI) 的研究不斷增加，以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而，我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中，我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果，以研究如何改善 AI 和 XAI 的入門。對於訪談，我們建立在人機互動準則之上，為中風康復評估和 AI 解釋的 AI 系統創建入門材料，並將它們介紹給參與者。我們的研究結果表明，除了呈現傳統的 AI 性能指標外，參與者還希望基准信息、AI 的實際好處以及交互試驗，以更好地將 AI 性能情境化，並完善 AI 的目標和性能。根據這些發現，我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。
 
-##### **GraphRAG under Fire**
-2501.14050v1 by Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang
+##### **Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach**
+2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao
 
-GraphRAG advances retrieval-augmented generation (RAG) by structuring
-external knowledge as multi-scale knowledge graphs, enabling language models to
-integrate both broad context and granular details in their reasoning. While
-GraphRAG has demonstrated success across domains, its security implications
-remain largely unexplored. To bridge this gap, this work examines GraphRAG's
-vulnerability to poisoning attacks, uncovering an intriguing security paradox:
-compared to conventional RAG, GraphRAG's graph-based indexing and retrieval
-enhance resilience against simple poisoning attacks; meanwhile, the same
-features also create new attack surfaces. We present GRAGPoison, a novel attack
-that exploits shared relations in the knowledge graph to craft poisoning text
-capable of compromising multiple queries simultaneously. GRAGPoison employs
-three key strategies: i) relation injection to introduce false knowledge, ii)
-relation enhancement to amplify poisoning influence, and iii) narrative
-generation to embed malicious content within coherent text. Empirical
-evaluation across diverse datasets and models shows that GRAGPoison
-substantially outperforms existing attacks in terms of effectiveness (up to 98%
-success rate) and scalability (using less than 68% poisoning text). We also
-explore potential defensive measures and their limitations, identifying
-promising directions for future research.
+This article uses machine learning (ML) and explainable artificial
+intelligence (XAI) techniques to investigate the relationship between
+nutritional status and mortality rates associated with Alzheimers disease (AD).
+The Third National Health and Nutrition Examination Survey (NHANES III)
+database is employed for analysis. The random forest model is selected as the
+base model for XAI analysis, and the Shapley Additive Explanations (SHAP)
+method is used to assess feature importance. The results highlight significant
+nutritional factors such as serum vitamin B12 and glycated hemoglobin. The
+study demonstrates the effectiveness of random forests in predicting AD
+mortality compared to other diseases. This research provides insights into the
+impact of nutrition on AD and contributes to a deeper understanding of disease
+progression.
 
-摘要：GraphRAG 透過將外部知識結構化為多尺度知識圖譜，推動了檢索增強生成 (RAG)，使語言模型能夠在其推理中整合廣泛的背景和細微的細節。儘管 GraphRAG 在各個領域都已展現出成功，但其安全性影響在很大程度上仍未被探索。為了彌補這一差距，本研究探討了 GraphRAG 對投毒攻擊的脆弱性，揭示了一個有趣的安全悖論：與傳統的 RAG 相比，GraphRAG 基於圖表的索引和檢索增強了對簡單投毒攻擊的韌性；同時，相同的特徵也創造了新的攻擊面。我們提出了 GRAGPoison，這是一種新穎的攻擊，它利用知識圖譜中的共享關係來製作中毒文本，能夠同時危害多個查詢。GRAGPoison 採用了三項關鍵策略：i) 關係注入以引入錯誤的知識，ii) 關係增強以擴大投毒影響，以及 iii) 敘事生成以將惡意內容嵌入連貫的文本中。在各種數據集和模型上的經驗評估表明，GRAGPoison 在有效性（成功率高達 98%）和可擴展性（使用不到 68% 的投毒文本）方面都明顯優於現有的攻擊。我們還探討了潛在的防禦措施及其局限性，確定了未來研究的有希望的方向。
+摘要：本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型，並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素，例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解，並有助於更深入地了解疾病的進展。
 
-##### **EICopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents**
-2501.13746v1 by Yuhui Yun, Huilong Ye, Xinru Li, Ruojia Li, Jingfeng Deng, Li Li, Haoyi Xiong
+##### **Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone**
+2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath
 
-The paper introduces EICopilot, an novel agent-based solution enhancing
-search and exploration of enterprise registration data within extensive online
-knowledge graphs like those detailing legal entities, registered capital, and
-major shareholders. Traditional methods necessitate text-based queries and
-manual subgraph explorations, often resulting in time-consuming processes.
-EICopilot, deployed as a chatbot via Baidu Enterprise Search, improves this
-landscape by utilizing Large Language Models (LLMs) to interpret natural
-language queries. This solution automatically generates and executes Gremlin
-scripts, providing efficient summaries of complex enterprise relationships.
-Distinct feature a data pre-processing pipeline that compiles and annotates
-representative queries into a vector database of examples for In-context
-learning (ICL), a comprehensive reasoning pipeline combining Chain-of-Thought
-with ICL to enhance Gremlin script generation for knowledge graph search and
-exploration, and a novel query masking strategy that improves intent
-recognition for heightened script accuracy. Empirical evaluations demonstrate
-the superior performance of EICopilot, including speed and accuracy, over
-baseline methods, with the \emph{Full Mask} variant achieving a syntax error
-rate reduction to as low as 10.00% and an execution correctness of up to
-82.14%. These components collectively contribute to superior querying
-capabilities and summarization of intricate datasets, positioning EICopilot as
-a groundbreaking tool in the exploration and exploitation of large-scale
-knowledge graphs for enterprise information search.
+Primary care providers are vital for initial triage and referrals to
+specialty care. In glaucoma, asymptomatic and fast progression can lead to
+vision loss, necessitating timely referrals to specialists. However, primary
+eye care providers may not identify urgent cases, potentially delaying care.
+Artificial Intelligence (AI) offering explanations could enhance their referral
+decisions. We investigate how various AI explanations help providers
+distinguish between patients needing immediate or non-urgent specialist
+referrals. We built explainable AI algorithms to predict glaucoma surgery needs
+from routine eyecare data as a proxy for identifying high-risk patients. We
+incorporated intrinsic and post-hoc explainability and conducted an online
+study with optometrists to assess human-AI team performance, measuring referral
+accuracy and analyzing interactions with AI, including agreement rates, task
+time, and user experience perceptions. AI support enhanced referral accuracy
+among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams
+underperformed compared to AI alone. Participants believed they included AI
+advice more when using the intrinsic model, and perceived it more useful and
+promising. Without explanations, deviations from AI recommendations increased.
+AI support did not increase workload, confidence, and trust, but reduced
+challenges. On a separate test set, our black-box and intrinsic models achieved
+an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We
+identify opportunities of human-AI teaming for glaucoma management in primary
+eye care, noting that while AI enhances referral accuracy, it also shows a
+performance gap compared to AI alone, even with explanations. Human involvement
+remains essential in medical decision making, underscoring the need for future
+research to optimize collaboration, ensuring positive experiences and safe AI
+use.
 
-摘要：本文介紹了 EICopilot，這是一種基於代理的新型解決方案，可增強在廣泛的線上知識圖譜中搜尋和探索企業註冊資料，例如詳細說明法律實體、註冊資本和主要股東的資料。傳統方法需要基於文字的查詢和手動子圖探索，通常會導致耗時的流程。EICopilot 部署為百度企業搜尋的聊天機器人，透過利用大型語言模型 (LLM) 來詮釋自然語言查詢，進而改善這項技術。此解決方案會自動產生並執行 Gremlin 腳本，提供複雜企業關係的有效摘要。其獨特功能為資料前處理管線，可將具代表性的查詢編譯並註解到範例的向量資料庫中，以進行脈絡中學習 (ICL)，這是一個結合了思考鏈與 ICL 的綜合推理管線，用於增強 Gremlin 腳本產生，以進行知識圖譜搜尋和探索，以及一種新穎的查詢遮罩策略，可改善意圖辨識，進而提高腳本準確度。實證評估顯示，EICopilot 的效能優於基線方法，包括速度和準確度，其中「完整遮罩」變體將語法錯誤率降低至低於 10.00%，執行正確率高達 82.14%。這些元件共同促成了優異的查詢功能和複雜資料集的摘要，將 EICopilot 定位為探索和利用大規模知識圖譜進行企業資訊搜尋的創新工具。
+摘要：<paragraph>初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下，無症狀且快速惡化可能導致視力喪失，因此需要及時轉診給專家。然而，初級眼科保健提供者可能無法識別緊急情況，可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法，以從例行眼科護理資料預測青光眼手術需求，作為識別高風險患者的代理。我們納入了內在和事後解釋性，並與驗光師進行了一項線上研究，以評估人機團隊的表現，衡量轉診準確度並分析與 AI 的互動，包括同意率、任務時間和使用者體驗感知。在 87 名參與者中，AI 支援提高了轉診準確度（使用 AI/未使用的比例為 59.9%/50.8%），儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議，並認為它更有用且更有希望。沒有解釋，AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任，但減少了挑戰。在一個單獨的測試集中，我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中，人機團隊合作管理青光眼的機會，並注意到雖然 AI 提高了轉診準確度，但即使有解釋，它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要，這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。</paragraph>
 
-##### **Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks**
-2501.13731v1 by Chang Gong, Wanrui Bian, Zhijie Zhang, Weiguo Zheng
+##### **Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery**
+2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang
 
-Graph computational tasks are inherently challenging and often demand the
-development of advanced algorithms for effective solutions. With the emergence
-of large language models (LLMs), researchers have begun investigating their
-potential to address these tasks. However, existing approaches are constrained
-by LLMs' limited capability to comprehend complex graph structures and their
-high inference costs, rendering them impractical for handling large-scale
-graphs. Inspired by human approaches to graph problems, we introduce a novel
-framework, PIE (Pseudocode-Injection-Enhanced LLM Reasoning for Graph
-Computational Tasks), which consists of three key steps: problem understanding,
-prompt design, and code generation. In this framework, LLMs are tasked with
-understanding the problem and extracting relevant information to generate
-correct code. The responsibility for analyzing the graph structure and
-executing the code is delegated to the interpreter. We inject task-related
-pseudocodes into the prompts to further assist the LLMs in generating efficient
-code. We also employ cost-effective trial-and-error techniques to ensure that
-the LLM-generated code executes correctly. Unlike other methods that require
-invoking LLMs for each individual test case, PIE only calls the LLM during the
-code generation phase, allowing the generated code to be reused and
-significantly reducing inference costs. Extensive experiments demonstrate that
-PIE outperforms existing baselines in terms of both accuracy and computational
-efficiency.
+In medical imaging, particularly in early disease detection and prognosis
+tasks, discerning the rationale behind an AI model's predictions is crucial for
+evaluating the reliability of its decisions. Conventional explanation methods
+face challenges in identifying discernible decisive features in medical image
+classifications, where discriminative features are subtle or not immediately
+apparent. To bridge this gap, we propose an explainable model that is equipped
+with both decision reasoning and feature identification capabilities. Our
+approach not only detects influential image patterns but also uncovers the
+decisive features that drive the model's final predictions. By implementing our
+method, we can efficiently identify and visualise class-specific features
+leveraged by the data-driven model, providing insights into the decision-making
+processes of deep learning models. We validated our model in the demanding
+realm of medical prognosis task, demonstrating its efficacy and potential in
+enhancing the reliability of AI in healthcare and in discovering new knowledge
+in diseases where prognostic understanding is limited.
+
+摘要：在醫學影像中，特別是在早期疾病檢測和預後任務中，辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰，其中區別性特徵很微妙或並不明顯。為了彌合這一差距，我們提出了一個可解釋的模型，該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式，還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型，我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵，從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型，展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。
+
+##### **The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach**
+2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat
+
+This study explores the relationship between informational support seeking
+questions, responses, and helpfulness ratings in online health communities. We
+created a labeled data set of question-response pairs and developed multimodal
+machine learning and deep learning models to reliably predict informational
+support questions and responses. We employed explainable AI to reveal the
+emotions embedded in informational support exchanges, demonstrating the
+importance of emotion in providing informational support. This complex
+interplay between emotional and informational support has not been previously
+researched. The study refines social support theory and lays the groundwork for
+the development of user decision aids. Further implications are discussed.
 
-摘要：圖表計算任務本質上具有挑戰性，而且通常需要開發先進的演算法才能有效解決。隨著大型語言模型 (LLM) 的出現，研究人員已開始探討其解決這些任務的可能性。然而，現有方法受到 LLM 理解複雜圖形結構的能力有限以及其高推理成本的限制，這使得它們不切實際地處理大規模圖形。受到人類解決圖形問題的方法啟發，我們引入了 PIE（偽代碼注入增強 LLM 圖形計算任務推理）這個新框架，它包含三個關鍵步驟：問題理解、提示設計和代碼生成。在此框架中，LLM 的任務是理解問題並擷取相關資訊以產生正確的代碼。分析圖形結構和執行代碼的責任委派給解釋器。我們將與任務相關的偽代碼注入提示中，以進一步協助 LLM 產生有效的代碼。我們還採用具有成本效益的試錯技術，以確保 LLM 生成的代碼正確執行。與需要為每個個別測試案例呼叫 LLM 的其他方法不同，PIE 僅在代碼產生階段呼叫 LLM，允許重複使用產生的代碼並大幅降低推理成本。大量的實驗證明，PIE 在準確性和計算效率方面都優於現有的基準。
+摘要：本研究探討線上健康社群中尋求資訊支持的問題、回應，以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集，並開發了多模態機器學習和深度學習模型，以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒，證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論，並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。
 
-##### **CAPRAG: A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation**
-2501.13993v1 by Hamza Landolsi, Kais Letaief, Nizar Taghouti, Ines Abdeljaoued-Tej
+##### **ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education**
+2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis
 
-The introduction of new features and services in the banking sector often
-overwhelms customers, creating an opportunity for banks to enhance user
-experience through financial chatbots powered by large language models (LLMs).
-We initiated an AI agent designed to provide customers with relevant
-information about banking services and insights from annual reports. We
-proposed a hybrid Customer Analysis Pipeline Retrieval-Augmented Generation
-(CAPRAG) that effectively addresses both relationship-based and contextual
-queries, thereby improving customer engagement in the digital banking
-landscape. To implement this, we developed a processing pipeline to refine text
-data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This
-dual approach enables us to populate both vector and graph databases with
-processed data for efficient retrieval. The Cypher query component is employed
-to effectively query the graph database. When a user submits a query, it is
-first expanded by a query expansion module before being routed to construct a
-final query from the hybrid Knowledge Base (KB). This final query is then sent
-to an open-source LLM for response generation. Overall, our innovative,
-designed to international banks, serves bank's customers in an increasingly
-complex digital environment, enhancing clarity and accessibility of
-information.
+In the era of exponential technology growth, one unexpected guest has claimed
+a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as
+ChatGPT, promises a revolution in education, yet it arrives with a double-edged
+sword. Its potential for personalized learning is offset by issues of cheating,
+inaccuracies, and educators struggling to incorporate it effectively into their
+lesson design. We are standing on the brink of this educational frontier, and
+it is clear that we need to navigate this terrain with a lot of care. This is a
+major challenge that could undermine the integrity and value of our educational
+process. So, how can we turn these challenges into opportunities? When used
+inappropriately, AI tools can become the perfect tool for the cut copy paste
+mentality, and quickly begin to corrode critical thinking, creativity, and deep
+understanding, the most important skills in our rapidly changing world.
+Teachers feel that they are not equipped to leverage this technology, widening
+the digital divide among educators and institutions. Addressing these concerns
+calls for an in depth research approach. We will employ empirical research,
+drawing on the Technology Acceptance Model, to assess the attitudes toward
+generative AI among educators and students. Understanding their perceptions,
+usage patterns, and hurdles is the first crucial step in creating an effective
+solution. The present study will be used as a process manual for future
+researchers to apply, running their own data, based on the steps explained here
 
-摘要：銀行業中新功能和服務的推出經常讓客戶感到不知所措，這為銀行透過大型語言模型 (LLM) 驅動的金融聊天機器人來提升使用者體驗創造了機會。我們啟動了一個人工智慧代理，旨在為客戶提供有關銀行服務和年度報告見解的相關資訊。我們提出了一個混合式客戶分析管道檢索擴充生成 (CAPRAG)，它有效地處理基於關係和情境式的查詢，從而提升數位銀行環境中的客戶參與度。為了實作這一點，我們開發了一個處理管道來精煉文字資料，我們在兩個主要架構中使用它：Vector RAG 和 Graph RAG。這種雙管齊下的方法讓我們能夠使用處理過的資料來填補向量和圖形資料庫，以利於有效檢索。Cypher 查詢元件用於有效查詢圖形資料庫。當使用者提交查詢時，它會先由查詢擴充模組擴充，然後再路由到混合式知識庫 (KB) 中建構最終查詢。然後這個最終查詢會傳送給開源 LLM 以產生回應。整體而言，我們創新的設計服務於國際銀行，在日益複雜的數位環境中服務銀行客戶，提升資訊的清晰度和可及性。
+摘要：在科技飛速發展的時代，一位意外的訪客已在全球教室中佔有一席之地，那就是人工智慧。生成式 AI，例如 ChatGPT，承諾在教育領域掀起一場革命，但它卻是一把雙面刃。它在個人化學習方面的潛力，卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣，顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰，可能會損害我們教育過程的完整性和價值。那麼，我們如何將這些挑戰轉化為機遇？當不適當地使用時，AI 工具可能會成為複製貼上心態的完美工具，並迅速腐蝕批判性思維、創造力和深入理解，這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術，這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究，借鑑技術接受模型，來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊，根據此處說明的步驟運行他們自己的數據
 
-##### **Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization**
-2501.13992v1 by Hy Nguyen, Nguyen Hung Nguyen, Nguyen Linh Bao Nguyen, Srikanth Thudumu, Hung Du, Rajesh Vasa, Kon Mouzakis
+##### **Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data**
+2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk
 
-The Hierarchical Navigable Small World (HNSW) algorithm is widely used for
-approximate nearest neighbor (ANN) search, leveraging the principles of
-navigable small-world graphs. However, it faces some limitations. The first is
-the local optima problem, which arises from the algorithm's greedy search
-strategy, selecting neighbors based solely on proximity at each step. This
-often leads to cluster disconnections. The second limitation is that HNSW
-frequently fails to achieve logarithmic complexity, particularly in
-high-dimensional datasets, due to the exhaustive traversal through each layer.
-To address these limitations, we propose a novel algorithm that mitigates local
-optima and cluster disconnections while enhancing the construction speed,
-maintaining inference speed. The first component is a dual-branch HNSW
-structure with LID-based insertion mechanisms, enabling traversal from multiple
-directions. This improves outlier node capture, enhances cluster connectivity,
-accelerates construction speed and reduces the risk of local minima. The second
-component incorporates a bridge-building technique that bypasses redundant
-intermediate layers, maintaining inference and making up the additional
-computational overhead introduced by the dual-branch structure. Experiments on
-various benchmarks and datasets showed that our algorithm outperforms the
-original HNSW in both accuracy and speed. We evaluated six datasets across
-Computer Vision (CV), and Natural Language Processing (NLP), showing recall
-improvements of 18\% in NLP, and up to 30\% in CV tasks while reducing the
-construction time by up to 20\% and maintaining the inference speed. We did not
-observe any trade-offs in our algorithm. Ablation studies revealed that
-LID-based insertion had the greatest impact on performance, followed by the
-dual-branch structure and bridge-building components.
+With the digitalization of health care systems, artificial intelligence
+becomes more present in medicine. Especially machine learning shows great
+potential for complex tasks such as time series classification, usually at the
+cost of transparency and comprehensibility. This leads to a lack of trust by
+humans and thus hinders its active usage. Explainable artificial intelligence
+tries to close this gap by providing insight into the decision-making process,
+the actual usefulness of its different methods is however unclear. This paper
+proposes a user study based evaluation of the explanation method Grad-CAM with
+application to a neural network for the classification of breaths in time
+series neonatal ventilation data. We present the perceived usefulness of the
+explainability method by different stakeholders, exposing the difficulty to
+achieve actual transparency and the wish for more in-depth explanations by many
+of the participants.
 
-摘要：分層可導航小世界 (HNSW) 演算法廣泛用於近似最近鄰居 (ANN) 搜尋，並利用可導航小世界圖形的原理。然而，它面臨一些限制。第一個是局部最佳化問題，這源自於演算法的貪婪搜尋策略，在每個步驟中僅根據鄰近度來選擇鄰居。這通常會導致群集斷線。第二個限制是，由於透過每一層的窮舉式遍歷，HNSW 常常無法在高維度資料集中達成對數複雜度。為了解決這些限制，我們提出了一種新的演算法，它可以減輕局部最佳化和群集斷線，同時提高建構速度，並維持推論速度。第一個組成部分是一個具有基於 LID 的插入機制的雙分支 HNSW 結構，它能從多個方向進行遍歷。這改善了異常值節點的擷取，增強了群集連通性，加速了建構速度，並降低了局部最小值的風險。第二個組成部分包含一種橋樑建構技術，它繞過了多餘的中間層，維持推論並彌補了雙分支結構所帶來的額外運算負擔。在各種基準和資料集上的實驗顯示，我們的演算法在準確度和速度上都優於原始的 HNSW。我們評估了電腦視覺 (CV) 和自然語言處理 (NLP) 中的六個資料集，顯示 NLP 中的召回率提高了 18%，CV 任務中提高了 30%，同時將建構時間縮短了 20%，並維持了推論速度。我們沒有在我們的演算法中觀察到任何取捨。消融研究顯示，基於 LID 的插入對效能的影響最大，其次是雙分支結構和橋樑建構組成部分。
+摘要：隨著醫療保健系統的數位化，人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力，但通常是以透明度和可理解性為代價。這導致人類缺乏信任，從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距，但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估，其中包含了 Grad-CAM 解釋方法，並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用，揭示了實現實際透明度的難度，以及許多參與者希望獲得更深入的解釋。
 
-##### **Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLMs**
-2501.13984v1 by Bhumika Gupta, Pralaypati Ta, Keerthi Ram, Mohanasankar Sivaprakasam
+##### **XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare**
+2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio
 
-The updated recommendations on diagnostic procedures and treatment pathways
-for a medical condition are documented as graphical flows in Clinical Practice
-Guidelines (CPGs). For effective use of the CPGs in helping medical
-professionals in the treatment decision process, it is necessary to fully
-capture the guideline knowledge, particularly the contexts and their
-relationships in the graph. While several existing works have utilized these
-guidelines to create rule bases for Clinical Decision Support Systems, limited
-work has been done toward directly capturing the full medical knowledge
-contained in CPGs. This work proposes an approach to create a contextually
-enriched, faithful digital representation of National Comprehensive Cancer
-Network (NCCN) Cancer CPGs in the form of graphs using automated extraction and
-node & relationship classification. We also implement semantic enrichment of
-the model by using Large Language Models (LLMs) for node classification,
-achieving an accuracy of 80.86% and 88.47% with zero-shot learning and few-shot
-learning, respectively. Additionally, we introduce a methodology for answering
-natural language questions with constraints to guideline text by leveraging
-LLMs to extract the relevant subgraph from the guideline knowledge base. By
-generating natural language answers based on subgraph paths and semantic
-information, we mitigate the risk of incorrect answers and hallucination
-associated with LLMs, ensuring factual accuracy in medical domain Question
-Answering.
+The integration of Large Language Models (LLMs) into healthcare diagnostics
+offers a promising avenue for clinical decision-making. This study outlines the
+development of a novel method for zero-shot/few-shot in-context learning (ICL)
+by integrating medical domain knowledge using a multi-layered structured
+prompt. We also explore the efficacy of two communication styles between the
+user and LLMs: the Numerical Conversational (NC) style, which processes data
+incrementally, and the Natural Language Single-Turn (NL-ST) style, which
+employs long narrative prompts.
+  Our study systematically evaluates the diagnostic accuracy and risk factors,
+including gender bias and false negative rates, using a dataset of 920 patient
+records in various few-shot scenarios. Results indicate that traditional
+clinical machine learning (ML) models generally outperform LLMs in zero-shot
+and few-shot settings. However, the performance gap narrows significantly when
+employing few-shot examples alongside effective explainable AI (XAI) methods as
+sources of domain knowledge. Moreover, with sufficient time and an increased
+number of examples, the conversational style (NC) nearly matches the
+performance of ML models. Most notably, LLMs demonstrate comparable or superior
+cost-sensitive accuracy relative to ML models.
+  This research confirms that, with appropriate domain knowledge and tailored
+communication strategies, LLMs can significantly enhance diagnostic processes.
+The findings highlight the importance of optimizing the number of training
+examples and communication styles to improve accuracy and reduce biases in LLM
+applications.
 
-摘要：已更新的醫療狀況診斷程序和治療途徑建議，以臨床實務指南 (CPG) 中的圖形流程記錄。為了有效使用 CPG 協助醫療專業人員進行治療決策，必須完整擷取指南知識，特別是圖表中的脈絡及其關係。雖然現有許多研究已利用這些指南為臨床決策支援系統建立規則基礎，但直接擷取 CPG 中包含的完整醫療知識的工作卻有限。這項研究提出了一種方法，以自動化擷取和節點與關係分類的方式，建立脈絡豐富、忠實的國家綜合癌症網路 (NCCN) 癌症 CPG 圖形數位表示。我們也透過使用大型語言模型 (LLM) 進行節點分類，實作模型的語意豐富化，分別在零次學習和少次學習中達到 80.86% 和 88.47% 的準確度。此外，我們引進了一種方法，透過運用 LLM 從指南知識庫中擷取相關子圖，來回答具有指南文字限制的自然語言問題。透過根據子圖路徑和語意資訊產生自然語言答案，我們降低了與 LLM 相關的錯誤答案和幻覺風險，確保了醫療領域問題解答中的事實準確性。
+摘要：大型語言模型 (LLM) 與醫療診斷整合
+為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發，用於零次學習/少量學習情境學習 (ICL)，方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效：數值對話 (NC) 方式，它會逐步處理資料，以及自然語言單回合 (NL-ST) 方式，它會使用長篇敘事提示。
+我們的研究系統性地評估了診斷準確性和風險因子，包括性別偏見和假陰性率，使用了一個包含 920 個患者記錄的資料集，採用各種少量學習情境。結果表明，傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而，當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時，效能差距會顯著縮小。此外，隨著時間充足和範例數量增加，對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是，LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。
+本研究證實，透過適當的領域知識和量身打造的溝通策略，LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性，以提高準確度並減少 LLM 應用中的偏差。
 
-##### **LLM-Assisted Knowledge Graph Completion for Curriculum and Domain Modelling in Personalized Higher Education Recommendations**
-2501.12300v1 by Hasan Abu-Rasheed, Constance Jumbo, Rashed Al Amin, Christian Weber, Veit Wiese, Roman Obermaisser, Madjid Fathi
+##### **To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems**
+2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho
 
-While learning personalization offers great potential for learners, modern
-practices in higher education require a deeper consideration of domain models
-and learning contexts, to develop effective personalization algorithms. This
-paper introduces an innovative approach to higher education curriculum
-modelling that utilizes large language models (LLMs) for knowledge graph (KG)
-completion, with the goal of creating personalized learning-path
-recommendations. Our research focuses on modelling university subjects and
-linking their topics to corresponding domain models, enabling the integration
-of learning modules from different faculties and institutions in the student's
-learning path. Central to our approach is a collaborative process, where LLMs
-assist human experts in extracting high-quality, fine-grained topics from
-lecture materials. We develop a domain, curriculum, and user models for
-university modules and stakeholders. We implement this model to create the KG
-from two study modules: Embedded Systems and Development of Embedded Systems
-Using FPGA. The resulting KG structures the curriculum and links it to the
-domain models. We evaluate our approach through qualitative expert feedback and
-quantitative graph quality metrics. Domain experts validated the relevance and
-accuracy of the model, while the graph quality metrics measured the structural
-properties of our KG. Our results show that the LLM-assisted graph completion
-approach enhances the ability to connect related courses across disciplines to
-personalize the learning experience. Expert feedback also showed high
-acceptance of the proposed collaborative approach for concept extraction and
-classification.
+The increasing reliance on Deep Learning models, combined with their inherent
+lack of transparency, has spurred the development of a novel field of study
+known as eXplainable AI (XAI) methods. These methods seek to enhance the trust
+of end-users in automated systems by providing insights into the rationale
+behind their decisions. This paper presents a novel approach for measuring user
+trust in XAI systems, allowing their refinement. Our proposed metric combines
+both performance metrics and trust indicators from an objective perspective. To
+validate this novel methodology, we conducted a case study in a realistic
+medical scenario: the usage of XAI system for the detection of pneumonia from
+x-ray images.
 
-摘要：<paragraph>在學習個人化提供學習者巨大潛力的同時，高等教育中的現代實務需要更深入地考慮領域模型和學習情境，以開發有效的個人化演算法。本文介紹了一種創新的高等教育課程建模方法，該方法利用大型語言模型 (LLM) 來完成知識圖譜 (KG)，目的是建立個人化的學習路徑建議。我們的研究重點在於建模大學科目，並將它們的主題連結到對應的領域模型，從而能夠將來自不同院系和機構的學習模組整合到學生的學習路徑中。我們的做法核心是一個協作流程，其中 LLM 協助人類專家從講義材料中萃取高品質、細緻的主題。我們為大學模組和利害關係人開發了領域、課程和使用者模型。我們實作這個模型，從兩個研究模組建立 KG：嵌入式系統和使用 FPGA 的嵌入式系統開發。產生的 KG 建構了課程並將其連結到領域模型。我們透過定性專家回饋和定量圖形品質指標來評估我們的做法。領域專家驗證了模型的相關性和準確性，而圖形品質指標則測量了我們 KG 的結構特性。我們的結果顯示，LLM 輔助的圖形完成方法增強了跨學科連結相關課程的能力，以個人化學習體驗。專家回饋也顯示高度接受所提出的協作方法，用於概念萃取和分類。</paragraph>
+摘要：隨著對深度學習模型依賴性的增加，加上其固有的透明度不足，促使一個新的研究領域發展，稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理，來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法，允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法，我們在一個真實的醫療場景中進行了一個案例研究：使用 XAI 系統從 X 光影像中偵測肺炎。
 
-##### **Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation**
-2501.12432v1 by Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
+##### **Region-specific Risk Quantification for Interpretable Prognosis of COVID-19**
+2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao
 
-Although current Large Language Models (LLMs) exhibit impressive
-capabilities, performing complex real-world tasks still requires tool learning.
-Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to
-interact with external environments, but they are limited in perceptual scope
-and lack adequate task-planning capability. To address these limitations, other
-studies introduce the first Search-based Decision Tree (DFSDT), which still
-suffers from the high computational cost. In this paper, we introduce a novel
-parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama).
-First, we transform traditional tree-based tool search paths into Directed
-Acyclic Graph (DAG) structure, generating a high-quality parallel tool
-invocation dataset. The DTA-Llama is then trained on the dataset to learn to
-iteratively divide the current task into several parallel tool invocation
-sub-tasks and aggregate the invocation results to decide the next actions.
-Furthermore, we introduce an efficient inference framework inspired by the
-Process/Threads mechanism when applying the DTA-Llama to practical tasks.
-Experimental results show that our approach substantially enhances task
-performance while reducing token consumption and inference time. Llama2-7B,
-using our method, is comparable to the official parallel function calling
-method of GPT-3.5. The relevant code, dataset, and model weights are available
-at https://corn0205.github.io/
+The COVID-19 pandemic has strained global public health, necessitating
+accurate diagnosis and intervention to control disease spread and reduce
+mortality rates. This paper introduces an interpretable deep survival
+prediction model designed specifically for improved understanding and trust in
+COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale
+pretrained image encoder, Risk-specific Grad-CAM, and anatomical region
+detection techniques, our approach produces regional interpretable outcomes
+that effectively capture essential disease features while focusing on rare but
+critical abnormal regions. Our model's predictive results provide enhanced
+clarity and transparency through risk area localization, enabling clinicians to
+make informed decisions regarding COVID-19 diagnosis with better understanding
+of prognostic insights. We evaluate the proposed method on a multi-center
+survival dataset and demonstrate its effectiveness via quantitative and
+qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and
+time-dependent AUCs (0.799 and 0.691). These results suggest that our
+explainable deep survival prediction model surpasses traditional survival
+analysis methods in risk prediction, improving interpretability for clinical
+decision making and enhancing AI system trustworthiness.
 
-摘要：儘管目前的大型語言模型 (LLM) 展現出令人印象深刻的能力，但執行複雜的真實世界任務仍需要工具學習。主流方法（例如 CoT/ReAct）依賴逐步工具呼叫與外部環境互動，但它們的感知範圍有限，且缺乏足夠的任務規劃能力。為了解決這些限制，其他研究引入了第一個基於搜尋的決策樹 (DFSDT)，但仍有很高的運算成本。在本文中，我們介紹了一種新穎的平行工具呼叫範例，DTA-Llama（分而合之 Llama）。首先，我們將傳統的基於樹的工具搜尋路徑轉換為有向無環圖 (DAG) 結構，產生高品質的平行工具呼叫資料集。然後在資料集上訓練 DTA-Llama，學習反覆將當前任務分成幾個平行工具呼叫子任務，並彙總呼叫結果以決定後續動作。此外，我們在將 DTA-Llama 應用於實際任務時，引入了一個受 Process/Threads 機制啟發的高效推論框架。實驗結果表明，我們的做法大幅提升了任務效能，同時減少了符號消耗和推論時間。使用我們方法的 Llama2-7B，可與 GPT-3.5 的官方平行函式呼叫方法相媲美。相關程式碼、資料集和模型權重可在 https://corn0205.github.io/ 取得
+摘要：COVID-19 疫情對全球公共衛生造成壓力，必須進行準確的診斷和干預，以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型，專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術，我們的做法產生區域可解釋的結果，有效捕捉必要的疾病特徵，同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度，讓臨床醫生能夠在更了解預後見解的情況下，就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法，並透過量化和質化評估證明其有效性，達到優異的 C 指數（0.764 和 0.727）和時間相關 AUC（0.799 和 0.691）。這些結果表明，我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法，提升臨床決策的解釋性，並增強 AI 系統的信賴度。
+
+##### **Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics**
+2405.02334v2 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile
+
+In recent years, machine learning-based clinical decision support systems
+(CDSS) have played a key role in the analysis of several medical conditions.
+Despite their promising capabilities, the lack of transparency in AI models
+poses significant challenges, particularly in medical contexts where
+reliability is a mandatory aspect. However, it appears that explainability is
+inversely proportional to accuracy. For this reason, achieving transparency
+without compromising predictive accuracy remains a key challenge. This paper
+presents a novel method, namely Rad4XCNN, to enhance the predictive power of
+CNN-derived features with the inherent interpretability of radiomic features.
+Rad4XCNN diverges from conventional methods based on saliency maps, by
+associating intelligible meaning to CNN-derived features by means of Radiomics,
+offering new perspectives on explanation methods beyond visualization maps.
+Using a breast cancer classification task as a case study, we evaluated
+Rad4XCNN on ultrasound imaging datasets, including an online dataset and two
+in-house datasets for internal and external validation. Some key results are:
+i) CNN-derived features guarantee more robust accuracy when compared against
+ViT-derived and radiomic features; ii) conventional visualization map methods
+for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice
+model accuracy for their explainability; iv) Rad4XCNN provides a global
+explanation enabling the physician to extract global insights and findings. Our
+method can mitigate some concerns related to the explainability-accuracy
+trade-off. This study highlighted the importance of proposing new methods for
+model explanation without affecting their accuracy.
 
-##### **InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models**
-2501.12231v1 by Pha Nguyen, Sailik Sengupta, Girik Malik, Arshit Gupta, Bonan Min
+摘要：<paragraph>近年来，基于机器学习的临床决策支持系统 (CDSS) 在多种疾病的分析中扮演了关键角色。尽管它们具有广阔的前景，但 AI 模型缺乏透明度，尤其在医疗领域，可靠性是强制性方面，这带来了重大挑战。然而，解释性似乎与准确性成反比。因此，在不影响预测准确性的情况下实现透明度仍然是一个关键挑战。本文提出了一种新方法，即 Rad4XCNN，以通过放射组学的内在可解释性来增强 CNN 衍生特征的预测能力。Rad4XCNN 通过放射组学将可理解的含义与 CNN 衍生特征关联起来，从而偏离了基于显着性图的传统方法，为超越可视化图的解释方法提供了新的视角。使用乳腺癌分类任务作为案例研究，我们在超声成像数据集上评估了 Rad4XCNN，包括一个在线数据集和两个用于内部和外部验证的内部数据集。一些关键结果是：i) 与 ViT 衍生和放射组学特征相比，CNN 衍生特征保证了更稳健的准确性；ii) 用于解释的传统可视化图方法存在一些缺陷；iii) Rad4XCNN 不会为了可解释性而牺牲模型准确性；iv) Rad4XCNN 提供全局解释，使医生能够提取全局见解和发现。我们的方法可以减轻一些与可解释性-准确性权衡相关的担忧。本研究强调了提出新方法来解释模型而不影响其准确性的重要性。</paragraph>
 
-The improved competence of generative models can help building multi-modal
-virtual assistants that leverage modalities beyond language. By observing
-humans performing multi-step tasks, one can build assistants that have
-situational awareness of actions and tasks being performed, enabling them to
-cater assistance based on this understanding. In this paper, we develop a
-Context-aware Instructional Task Assistant with Multi-modal Large Language
-Models (InsTALL) that leverages an online visual stream (e.g. a user's screen
-share or video recording) and responds in real-time to user queries related to
-the task at hand. To enable useful assistance, InsTALL 1) trains a multi-modal
-model on task videos and paired textual data, and 2) automatically extracts
-task graph from video data and leverages it at training and inference time. We
-show InsTALL achieves state-of-the-art performance across proposed sub-tasks
-considered for multimodal activity understanding -- task recognition (TR),
-action recognition (AR), next action prediction (AP), and plan prediction (PP)
--- and outperforms existing baselines on two novel sub-tasks related to
-automatic error identification.
+##### **Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability**
+2404.16957v1 by Yunfei Ge, Quanyan Zhu
 
-摘要：生成模型能力的提升有助于构建利用语言之外的多模态虚拟助手。通过观察人类执行多步骤任务，可以构建对正在执行的动作和任务有情境感知的助手，使他们能够根据这种理解提供帮助。在本文中，我们开发了一个具有多模态大语言模型的上下文感知指令任务助手 (InsTALL)，该助手利用在线视觉流（例如用户的屏幕共享或视频录制），并实时响应与手头任务相关的用户查询。为了提供有用的帮助，InsTALL 1) 在任务视频和配对文本数据上训练多模态模型，以及 2) 从视频数据中自动提取任务图，并在训练和推理时间利用它。我们展示了 InsTALL 在考虑用于多模态活动理解的提议子任务中实现了最先进的性能——任务识别 (TR)、动作识别 (AR)、下一个动作预测 (AP) 和计划预测 (PP)——并且在与自动错误识别相关的两个新子任务上优于现有的基准。
+The pervasive integration of Artificial Intelligence (AI) has introduced
+complex challenges in the responsibility and accountability in the event of
+incidents involving AI-enabled systems. The interconnectivity of these systems,
+ethical concerns of AI-induced incidents, coupled with uncertainties in AI
+technology and the absence of corresponding regulations, have made traditional
+responsibility attribution challenging. To this end, this work proposes a
+Computational Reflective Equilibrium (CRE) approach to establish a coherent and
+ethically acceptable responsibility attribution framework for all stakeholders.
+The computational approach provides a structured analysis that overcomes the
+limitations of conceptual approaches in dealing with dynamic and multifaceted
+scenarios, showcasing the framework's explainability, coherence, and adaptivity
+properties in the responsibility attribution process. We examine the pivotal
+role of the initial activation level associated with claims in equilibrium
+computation. Using an AI-assisted medical decision-support system as a case
+study, we illustrate how different initializations lead to diverse
+responsibility distributions. The framework offers valuable insights into
+accountability in AI-induced incidents, facilitating the development of a
+sustainable and resilient system through continuous monitoring, revision, and
+reflection.
 
-##### **Leveraging Graph Structures and Large Language Models for End-to-End Synthetic Task-Oriented Dialogues**
-2501.11977v1 by Maya Medjad, Hugo Imbert, Bruno Yun, Raphaël Szymocha, Frédéric Armetta
+摘要：隨著人工智慧 (AI) 的普及整合，在涉及 AI 驅動系統的事故中，責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題，加上 AI 技術的不確定性和缺乏相應法規，使得傳統責任歸屬面臨挑戰。為此，本研究提出了一種計算反思均衡 (CRE) 方法，以建立一個連貫且在倫理上可接受的責任歸屬架構，適用於所有利害關係人。計算方法提供了結構化的分析，克服了概念方法在處理動態且多面向情境時的限制，展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究，說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解，透過持續監控、修訂和反思，促進了永續且有韌性的系統發展。
 
-Training task-oriented dialogue systems is both costly and time-consuming,
-due to the need for high-quality datasets encompassing diverse intents.
-Traditional methods depend on extensive human annotation, while recent
-advancements leverage large language models (LLMs) to generate synthetic data.
-However, these approaches often require custom prompts or code, limiting
-accessibility for non-technical users. We introduce GraphTOD, an end-to-end
-framework that simplifies the generation of task-oriented dialogues. Users can
-create dialogues by specifying transition graphs in JSON format. Our evaluation
-demonstrates that GraphTOD generates high-quality dialogues across various
-domains, significantly lowering the cost and complexity of dataset creation.
+##### **Explainable AI for Fair Sepsis Mortality Predictive Model**
+2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang
 
-摘要：訓練任務導向對話系統既昂貴又耗時，
-因為需要包含各種意圖的高品質資料集。
-傳統方法依賴於廣泛的人工標註，而最近
-的進展利用大型語言模型 (LLM) 來產生合成資料。
-然而，這些方法通常需要自訂提示或程式碼，限制
-非技術使用者的可及性。我們介紹 GraphTOD，一個端對端的
-架構，簡化了任務導向對話的產生。使用者可以
-透過指定 JSON 格式的轉換圖表來建立對話。我們的評估
-證明 GraphTOD 在各種領域產生高品質對話，顯著降低資料集建立的成本和複雜性。
+Artificial intelligence supports healthcare professionals with predictive
+modeling, greatly transforming clinical decision-making. This study addresses
+the crucial need for fairness and explainability in AI applications within
+healthcare to ensure equitable outcomes across diverse patient demographics. By
+focusing on the predictive modeling of sepsis-related mortality, we propose a
+method that learns a performance-optimized predictive model and then employs
+the transfer learning process to produce a model with better fairness. Our
+method also introduces a novel permutation-based feature importance algorithm
+aiming at elucidating the contribution of each feature in enhancing fairness on
+predictions. Unlike existing explainability methods concentrating on explaining
+feature contribution to predictive performance, our proposed method uniquely
+bridges the gap in understanding how each feature contributes to fairness. This
+advancement is pivotal, given sepsis's significant mortality rate and its role
+in one-third of hospital deaths. Our method not only aids in identifying and
+mitigating biases within the predictive model but also fosters trust among
+healthcare stakeholders by improving the transparency and fairness of model
+predictions, thereby contributing to more equitable and trustworthy healthcare
+delivery.
 
-##### **Bridging Visualization and Optimization: Multimodal Large Language Models on Graph-Structured Combinatorial Optimization**
-2501.11968v1 by Jie Zhao, Kang Hao Cheong, Witold Pedrycz
+摘要：人工智慧透過預測模型協助醫療專業人員，大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求，以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型，我們提出了一種方法，該方法會學習一個效能最佳化的預測模型，然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法，旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同，我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要，因為敗血症的死亡率很高，且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差，還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任，進而有助於提供更公平且值得信賴的醫療保健服務。
 
-Graph-structured combinatorial challenges are inherently difficult due to
-their nonlinear and intricate nature, often rendering traditional computational
-methods ineffective or expensive. However, these challenges can be more
-naturally tackled by humans through visual representations that harness our
-innate ability for spatial reasoning. In this study, we propose transforming
-graphs into images to preserve their higher-order structural features
-accurately, revolutionizing the representation used in solving graph-structured
-combinatorial tasks. This approach allows machines to emulate human-like
-processing in addressing complex combinatorial challenges. By combining the
-innovative paradigm powered by multimodal large language models (MLLMs) with
-simple search techniques, we aim to develop a novel and effective framework for
-tackling such problems. Our investigation into MLLMs spanned a variety of
-graph-based tasks, from combinatorial problems like influence maximization to
-sequential decision-making in network dismantling, as well as addressing six
-fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit
-exceptional spatial intelligence and a distinctive capability for handling
-these problems, significantly advancing the potential for machines to
-comprehend and analyze graph-structured data with a depth and intuition akin to
-human cognition. These results also imply that integrating MLLMs with simple
-optimization strategies could form a novel and efficient approach for
-navigating graph-structured combinatorial challenges without complex
-derivations, computationally demanding training and fine-tuning.
+##### **Multi Class Depression Detection Through Tweets using Artificial Intelligence**
+2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal
 
-摘要：圖形結構的組合挑戰本質上很困難，因為它們的非線性和複雜性，通常會使傳統的計算方法無效或昂貴。然而，人類可以透過利用我們天生的空間推理能力的視覺表徵，更自然地應對這些挑戰。在本研究中，我們建議將圖形轉換為影像，以準確保留它們的高階結構特徵，從而革新用於解決圖形結構組合任務的表徵。這種方法允許機器在解決複雜的組合挑戰時模擬類人的處理。透過結合由多模態大型語言模型 (MLLM) 提供動力的創新範例與簡單的搜尋技術，我們旨在為解決此類問題開發一個新穎且有效的架構。我們對 MLLM 的研究涵蓋了各種基於圖形的任務，從組合問題（如影響力最大化）到網路拆除中的順序決策制定，以及解決六個基本的圖形相關問題。我們的研究結果表明，MLLM 表現出非凡的空間智能和處理這些問題的獨特能力，顯著提升了機器以類似人類認知的深度和直覺來理解和分析圖形結構資料的潛力。這些結果還暗示，將 MLLM 與簡單的最佳化策略整合在一起，可以形成一種新穎且有效的方法，用於在沒有複雜推導、計算需求量大的訓練和微調的情況下應對圖形結構的組合挑戰。
+Depression is a significant issue nowadays. As per the World Health
+Organization (WHO), in 2023, over 280 million individuals are grappling with
+depression. This is a huge number; if not taken seriously, these numbers will
+increase rapidly. About 4.89 billion individuals are social media users. People
+express their feelings and emotions on platforms like Twitter, Facebook,
+Reddit, Instagram, etc. These platforms contain valuable information which can
+be used for research purposes. Considerable research has been conducted across
+various social media platforms. However, certain limitations persist in these
+endeavors. Particularly, previous studies were only focused on detecting
+depression and the intensity of depression in tweets. Also, there existed
+inaccuracies in dataset labeling. In this research work, five types of
+depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted
+using tweets from the Twitter database based on lexicon labeling. Explainable
+AI was used to provide reasoning by highlighting the parts of tweets that
+represent type of depression. Bidirectional Encoder Representations from
+Transformers (BERT) was used for feature extraction and training. Machine
+learning and deep learning methodologies were used to train the model. The BERT
+model presented the most promising results, achieving an overall accuracy of
+0.96.
 
-##### **A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models**
-2501.13958v1 by Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, Xiao Huang
+摘要：現今，憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料，在 2023 年，超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字；如果不認真看待，這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊，可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而，這些努力仍存在某些限制。特別是，先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外，資料集標籤中存在不準確的情況。在這項研究工作中，使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症（雙極型、重度、精神病型、非典型和產後）。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers（BERT）中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果，達到 0.96 的整體準確度。
 
-Large language models (LLMs) have demonstrated remarkable capabilities in a
-wide range of tasks, yet their application to specialized domains remains
-challenging due to the need for deep expertise. Retrieval-augmented generation
-(RAG) has emerged as a promising solution to customize LLMs for professional
-fields by seamlessly integrating external knowledge bases, enabling real-time
-access to domain-specific expertise during inference. Despite its potential,
-traditional RAG systems, based on flat text retrieval, face three critical
-challenges: (i) complex query understanding in professional contexts, (ii)
-difficulties in knowledge integration across distributed sources, and (iii)
-system efficiency bottlenecks at scale. This survey presents a systematic
-analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new
-paradigm that revolutionizes domain-specific LLM applications. GraphRAG
-addresses traditional RAG limitations through three key innovations: (i)
-graph-structured knowledge representation that explicitly captures entity
-relationships and domain hierarchies, (ii) efficient graph-based retrieval
-techniques that enable context-preserving knowledge retrieval with multihop
-reasoning ability, and (iii) structure-aware knowledge integration algorithms
-that leverage retrieved knowledge for accurate and logical coherent generation
-of LLMs. In this survey, we systematically analyze the technical foundations of
-GraphRAG and examine current implementations across various professional
-domains, identifying key technical challenges and promising research
-directions. All the related resources of GraphRAG, including research papers,
-open-source data, and projects, are collected for the community in
-\textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}}.
+##### **COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images**
+2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman
 
-摘要：大型語言模型 (LLM) 已在各種任務中展現出非凡的能力，但由於需要深入的專業知識，因此將其應用於專業領域仍具有挑戰性。檢索增強生成 (RAG) 已成為一種有前途的解決方案，可通過無縫整合外部知識庫來客製化 LLM 以適用於專業領域，從而在推理過程中即時存取特定領域的專業知識。儘管有其潛力，但基於平面文字檢索的傳統 RAG 系統面臨三項關鍵挑戰：(i) 在專業情境中進行複雜的查詢理解，(ii) 難以整合分散來源的知識，以及 (iii) 系統效率瓶頸會隨著規模擴大而產生。本調查系統性地分析了圖形化檢索增強生成 (GraphRAG) 的技術基礎，GraphRAG 是一個新的典範，它徹底改變了特定領域的 LLM 應用。GraphRAG 透過三項關鍵創新來解決傳統 RAG 的限制：(i) 圖形結構化的知識表述，明確擷取實體關係和領域階層，(ii) 有效的圖形化檢索技術，可進行保留脈絡的知識檢索，並具備多跳推理能力，以及 (iii) 結構感知知識整合演算法，可利用檢索到的知識來進行 LLM 的準確且邏輯一致的生成。在本調查中，我們系統性地分析了 GraphRAG 的技術基礎，並檢視了在各種專業領域中的現有實作，找出關鍵技術挑戰和有前景的研究方向。所有 GraphRAG 的相關資源，包括研究論文、開放原始碼資料和專案，都已在 \textcolor{blue}{\url{https://github.com/DEEP-PolyU/Awesome-GraphRAG}} 中為社群收集。
+Deep learning is dramatically transforming the field of medical imaging and
+radiology, enabling the identification of pathologies in medical images,
+including computed tomography (CT) and X-ray scans. However, the performance of
+deep learning models, particularly in segmentation tasks, is often limited by
+the need for extensive annotated datasets. To address this challenge, the
+capabilities of weakly supervised semantic segmentation are explored through
+the lens of Explainable AI and the generation of counterfactual explanations.
+The scope of this research is development of a novel counterfactual inpainting
+approach (COIN) that flips the predicted classification label from abnormal to
+normal by using a generative model. For instance, if the classifier deems an
+input medical image X as abnormal, indicating the presence of a pathology, the
+generative model aims to inpaint the abnormal region, thus reversing the
+classifier's original prediction label. The approach enables us to produce
+precise segmentations for pathologies without depending on pre-existing
+segmentation masks. Crucially, image-level labels are utilized, which are
+substantially easier to acquire than creating detailed segmentation masks. The
+effectiveness of the method is demonstrated by segmenting synthetic targets and
+actual kidney tumors from CT images acquired from Tartu University Hospital in
+Estonia. The findings indicate that COIN greatly surpasses established
+attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an
+alternative counterfactual explanation method introduced by Singla et al. This
+evidence suggests that COIN is a promising approach for semantic segmentation
+of tumors in CT images, and presents a step forward in making deep learning
+applications more accessible and effective in healthcare, where annotated data
+is scarce.
 
-##### **Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance**
-2501.11849v2 by Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
+摘要：深度学习正大幅轉變醫學影像和放射線學領域，能辨識醫學影像中的病理，包括電腦斷層掃描 (CT) 和 X 光掃描。然而，深度學習模型的效能，特別是在分割任務中，常常受到廣泛註解資料集需求的限制。為了應對此挑戰，透過可解釋 AI 和反事實解釋的產生，探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN)，該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如，如果分類器將輸入的醫學影像 X 視為異常，表示存在病理，則生成模型旨在內插異常區域，從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割，而無需依賴於預先存在的分割遮罩。至關重要的是，利用影像層級標籤，這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明，COIN 遠遠超過已建立的歸因方法，例如 RISE、ScoreCAM 和 LayerCAM，以及 Singla 等人提出的另一種反事實解釋方法。此證據表明，COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法，並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步，其中註解資料很稀少。
 
-Detecting organized political campaigns is of paramount importance in
-fighting against disinformation on social media. Existing approaches for the
-identification of such organized actions employ techniques mostly from network
-science, graph machine learning and natural language processing. Their ultimate
-goal is to analyze the relationships and interactions (e.g. re-posting) among
-users and the textual similarities of their posts. Despite their effectiveness
-in recognizing astroturf campaigns, these methods face significant challenges,
-notably the class imbalance in available training datasets. To mitigate this
-issue, recent methods usually resort to data augmentation or increasing the
-number of positive samples, which may not always be feasible or sufficient in
-real-world settings. Following a different path, in this paper, we propose a
-novel framework for identifying astroturf campaigns based solely on large
-language models (LLMs), introducing a Balanced Retrieval-Augmented Generation
-(Balanced RAG) component. Our approach first gives both textual information
-concerning the posts (in our case tweets) and the user interactions of the
-social network as input to a language model. Then, through prompt engineering
-and the proposed Balanced RAG method, it effectively detects coordinated
-disinformation campaigns on X (Twitter). The proposed framework does not
-require any training or fine-tuning of the language model. Instead, by
-strategically harnessing the strengths of prompt engineering and Balanced RAG,
-it facilitates LLMs to overcome the effects of class imbalance and effectively
-identify coordinated political campaigns. The experimental results demonstrate
-that by incorporating the proposed prompt engineering and Balanced RAG methods,
-our framework outperforms the traditional graph-based baselines, achieving
-2x-3x improvements in terms of precision, recall and F1 scores.
+##### **Hybrid Intelligence for Digital Humanities**
+2406.15374v1 by Victor de Boer, Lise Stork
 
-摘要：<paragraph>在社交媒體上對抗錯誤資訊，偵測有組織的政治宣傳活動至關重要。現有的此類有組織行動識別方法，大多採用網路科學、圖形機器學習和自然語言處理的技術。它們的最終目標是分析使用者之間的關係和互動（例如轉發），以及他們貼文的文字相似性。儘管這些方法在辨識草根運動宣傳活動方面很有效，但它們面臨嚴峻的挑戰，特別是可用訓練資料集中的類別不平衡。為了減輕這個問題，最近的方法通常訴諸於資料擴充或增加正向樣本數量，但在現實世界中可能並非總是可行或足夠。本文採取不同的途徑，我們提出了一個基於大型語言模型 (LLM) 的辨識草根運動宣傳活動的新架構，並引入了平衡檢索擴充產生 (Balanced RAG) 組件。我們的做法首先將有關貼文（在我們的案例中是推文）的文字資訊和社交網路的使用者互動作為輸入，輸入到語言模型中。然後，透過提示工程和提出的平衡檢索擴充產生方法，它有效地偵測 X（Twitter）上協調的不實資訊宣傳活動。提出的架構不需要任何語言模型的訓練或微調。相反地，透過策略性地利用提示工程和平衡檢索擴充產生方法的優勢，它使大型語言模型能夠克服類別不平衡的影響，並有效地識別協調的政治宣傳活動。實驗結果證明，透過整合提出的提示工程和平衡檢索擴充產生方法，我們的架構優於傳統的基於圖形的基準，在精確度、召回率和 F1 分數方面獲得 2x-3x 的改進。</paragraph>
+In this paper, we explore the synergies between Digital Humanities (DH) as a
+discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research,
+the use of digital methods and specifically that of Artificial Intelligence is
+subject to a set of requirements and constraints. We argue that these are
+well-supported by the capabilities and goals of HI. Our contribution includes
+the identification of five such DH requirements: Successful AI systems need to
+be able to 1) collaborate with the (human) scholar; 2) support data criticism;
+3) support tool criticism; 4) be aware of and cater to various perspectives and
+5) support distant and close reading. We take the CARE principles of Hybrid
+Intelligence (collaborative, adaptive, responsible and explainable) as
+theoretical framework and map these to the DH requirements. In this mapping, we
+include example research projects. We finally address how insights from DH can
+be applied to HI and discuss open challenges for the combination of the two
+disciplines.
 
-##### **Large Language Models Meet Graph Neural Networks for Text-Numeric Graph Reasoning**
-2501.16361v1 by Haoran Song, Jiarui Feng, Guangfu Li, Michael Province, Philip Payne, Yixin Chen, Fuhai Li
+摘要：在本文中，我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中，數位方法的使用，特別是人工智慧的使用，受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求：成功的 AI 系統需要能夠 1) 與（人類）學者合作；2) 支援資料批評；3) 支援工具批評；4) 察覺並迎合各種觀點；5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則（協作、適應、負責和可解釋）作為理論架構，並將這些原則對應到 DH 要求。在此對應中，我們納入範例研究專案。最後，我們探討如何將 DH 的見解應用於 HI，並討論結合這兩個學科的開放挑戰。
 
-In real-world scientific discovery, human beings always make use of the
-accumulated prior knowledge with imagination pick select one or a few most
-promising hypotheses from large and noisy data analysis results. In this study,
-we introduce a new type of graph structure, the text-numeric graph (TNG), which
-is defined as graph entities and associations have both text-attributed
-information and numeric information. The TNG is an ideal data structure model
-for novel scientific discovery via graph reasoning because it integrates
-human-understandable textual annotations or prior knowledge, with numeric
-values that represent the observed or activation levels of graph entities or
-associations in different samples. Together both the textual information and
-numeric values determine the importance of graph entities and associations in
-graph reasoning for novel scientific knowledge discovery. We further propose
-integrating large language models (LLMs) and graph neural networks (GNNs) to
-analyze the TNGs for graph understanding and reasoning. To demonstrate the
-utility, we generated the text-omic(numeric) signaling graphs (TOSG), as one
-type of TNGs, in which all graphs have the same entities, associations and
-annotations, but have sample-specific entity numeric (omic) values using single
-cell RNAseq (scRNAseq) datasets of different diseases. We proposed joint
-LLM-GNN models for key entity mining and signaling pathway mining on the TOSGs.
-The evaluation results showed the LLM-GNN and TNGs models significantly improve
-classification accuracy and network inference. In conclusion, the TNGs and
-joint LLM-GNN models are important approaches for scientific discovery.
+##### **Ethical Framework for Responsible Foundational Models in Medical Imaging**
+2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci
 
-摘要：<paragraph>在現實世界的科學發現中，人類總是利用累積的先驗知識，並運用想像力從大量且雜訊的資料分析結果中挑選出一個或幾個最有希望的假設。在本研究中，我們介紹了一種新型態的圖形結構，稱為文字數值圖 (TNG)，定義為圖形實體和關聯具有文字屬性資訊和數值資訊。TNG 是透過圖形推理進行新科學發現的理想資料結構模型，因為它整合了人類可理解的文字註解或先驗知識，以及代表圖形實體或不同樣本中關聯的觀察值或活化程度的數值。文字資訊和數值一起決定了圖形實體和關聯在圖形推理中對於新科學知識發現的重要性。我們進一步提出整合大型語言模型 (LLM) 和圖形神經網路 (GNN) 來分析 TNG，以進行圖形理解和推理。為了展示其效用，我們生成了文字組學（數值）訊號圖 (TOSG)，作為一種 TNG，其中所有圖形都具有相同的實體、關聯和註解，但具有特定於樣本的實體數值（組學）值，使用不同疾病的單細胞 RNAseq (scRNAseq) 資料集。我們針對 TOSG 提出聯合 LLM-GNN 模型，用於關鍵實體探勘和訊號路徑探勘。評估結果顯示，LLM-GNN 和 TNG 模型顯著提升了分類準確度和網路推論。結論而言，TNG 和聯合 LLM-GNN 模型是科學發現的重要方法。</paragraph>
+Foundational models (FMs) have tremendous potential to revolutionize medical
+imaging. However, their deployment in real-world clinical settings demands
+extensive ethical considerations. This paper aims to highlight the ethical
+concerns related to FMs and propose a framework to guide their responsible
+development and implementation within medicine. We meticulously examine ethical
+issues such as privacy of patient data, bias mitigation, algorithmic
+transparency, explainability and accountability. The proposed framework is
+designed to prioritize patient welfare, mitigate potential risks, and foster
+trust in AI-assisted healthcare.
 
-##### **Zep: A Temporal Knowledge Graph Architecture for Agent Memory**
-2501.13956v1 by Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef
+摘要：基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而，它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題，並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題，例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險，並培養對 AI 輔助醫療保健的信任。
 
-We introduce Zep, a novel memory layer service for AI agents that outperforms
-the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR)
-benchmark. Additionally, Zep excels in more comprehensive and challenging
-evaluations than DMR that better reflect real-world enterprise use cases. While
-existing retrieval-augmented generation (RAG) frameworks for large language
-model (LLM)-based agents are limited to static document retrieval, enterprise
-applications demand dynamic knowledge integration from diverse sources
-including ongoing conversations and business data. Zep addresses this
-fundamental limitation through its core component Graphiti -- a
-temporally-aware knowledge graph engine that dynamically synthesizes both
-unstructured conversational data and structured business data while maintaining
-historical relationships. In the DMR benchmark, which the MemGPT team
-established as their primary evaluation metric, Zep demonstrates superior
-performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further
-validated through the more challenging LongMemEval benchmark, which better
-reflects enterprise use cases through complex temporal reasoning tasks. In this
-evaluation, Zep achieves substantial results with accuracy improvements of up
-to 18.5% while simultaneously reducing response latency by 90% compared to
-baseline implementations. These results are particularly pronounced in
-enterprise-critical tasks such as cross-session information synthesis and
-long-term context maintenance, demonstrating Zep's effectiveness for deployment
-in real-world applications.
+##### **Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis**
+2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak
 
-摘要：我們推出 Zep，這是一種新穎的記憶層服務，適用於 AI 代理，其在深度記憶擷取 (DMR) 基準測試中優於現行的最先進系統 MemGPT。此外，Zep 在比 DMR 更全面且更具挑戰性的評估中表現出色，這些評估更能反映真實世界的企業用例。雖然現有的檢索增強生成 (RAG) 架構僅限於大型語言模型 (LLM) 基於代理的靜態文件檢索，但企業應用需要從包括正在進行的對話和業務數據在內的不同來源動態整合知識。Zep 通過其核心組件 Graphiti 來解決這個基本限制，Graphiti 是一個時間感知知識圖譜引擎，可以在維護歷史關係的同時動態綜合非結構化對話數據和結構化業務數據。在 MemGPT 團隊確立為其主要評估指標的 DMR 基準測試中，Zep 表現出優異的效能（94.8% 對 93.4%）。除了 DMR 之外，Zep 的功能還通過更具挑戰性的 LongMemEval 基準測試進一步得到驗證，該基準測試通過複雜的時間推理任務更好地反映了企業用例。在這個評估中，Zep 以高達 18.5% 的準確度改進取得了顯著的成果，同時與基線實作相比，將回應延遲降低了 90%。這些成果在企業關鍵任務中尤為明顯，例如跨會話資訊綜合和長期脈絡維護，證明了 Zep 在實際應用中部署的有效性。
+Thyroid cancer is an increasing global health concern that requires advanced
+diagnostic methods. The application of AI and radiomics to thyroid cancer
+diagnosis is examined in this review. A review of multiple databases was
+conducted in compliance with PRISMA guidelines until October 2023. A
+combination of keywords led to the discovery of an English academic publication
+on thyroid cancer and related subjects. 267 papers were returned from the
+original search after 109 duplicates were removed. Relevant studies were
+selected according to predetermined criteria after 124 articles were eliminated
+based on an examination of their abstract and title. After the comprehensive
+analysis, an additional six studies were excluded. Among the 28 included
+studies, radiomics analysis, which incorporates ultrasound (US) images,
+demonstrated its effectiveness in diagnosing thyroid cancer. Various results
+were noted, some of the studies presenting new strategies that outperformed the
+status quo. The literature has emphasized various challenges faced by AI
+models, including interpretability issues, dataset constraints, and operator
+dependence. The synthesized findings of the 28 included studies mentioned the
+need for standardization efforts and prospective multicenter studies to address
+these concerns. Furthermore, approaches to overcome these obstacles were
+identified, such as advances in explainable AI technology and personalized
+medicine techniques. The review focuses on how AI and radiomics could transform
+the diagnosis and treatment of thyroid cancer. Despite challenges, future
+research on multidisciplinary cooperation, clinical applicability validation,
+and algorithm improvement holds the potential to improve patient outcomes and
+diagnostic precision in the treatment of thyroid cancer.
 
-##### **Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation**
-2501.11560v1 by M. Manzour, A. Ballardini, R. Izquierdo, M. Á. Sotelo
+摘要：甲狀腺癌是一種日益嚴重的全球健康問題，需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下，對多個資料庫進行了回顧，直到 2023 年 10 月。通過結合關鍵字，發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後，原始搜尋共回傳 267 篇論文。在根據預先確定的標準，淘汰了 124 篇文章的摘要和標題後，選出了相關研究。在進行全面分析後，額外排除了六項研究。在納入的 28 項研究中，結合超音波 (US) 影像的放射特徵分析，證明了其在診斷甲狀腺癌方面的有效性。研究結果不一，有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰，包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到，需要標準化工作和前瞻性多中心研究來解決這些問題。此外，還確定了克服這些障礙的方法，例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰，但未來對多學科合作、臨床適用性驗證和演算法改進的研究，仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。
 
-Lane-changing maneuvers, particularly those executed abruptly or in risky
-situations, are a significant cause of road traffic accidents. However, current
-research mainly focuses on predicting safe lane changes. Furthermore, existing
-accident datasets are often based on images only and lack comprehensive sensory
-data. In this work, we focus on predicting risky lane changes using the CRASH
-dataset (our own collected dataset specifically for risky lane changes), and
-safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian
-inference to predict these maneuvers using linguistic contextual information,
-enhancing the model's interpretability and transparency. The model achieved a
-91.5% f1-score with anticipation time extending to four seconds for risky lane
-changes, and a 90.0% f1-score for predicting safe lane changes with the same
-anticipation time. We validate our model by integrating it into a vehicle
-within the CARLA simulator in scenarios that involve risky lane changes. The
-model managed to anticipate sudden lane changes, thus providing automated
-vehicles with further time to plan and execute appropriate safe reactions.
-Finally, to enhance the explainability of our model, we utilize RAG to provide
-clear and natural language explanations for the given prediction.
+##### **Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI**
+2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia
 
-摘要：換車道動作，尤其是突然或在風險情況下執行的動作，是道路交通事故的重要原因。然而，目前的研究所主要集中在預測安全的換車道。此外，現有的事故資料集通常僅基於影像，且缺乏全面的感測資料。在這項工作中，我們專注於使用 CRASH 資料集（我們自己收集的專門針對風險換車道資料集）來預測風險換車道，以及安全換車道（使用 HighD 資料集）。然後，我們利用 KG 和貝氏推理來使用語言背景資訊預測這些動作，增強模型的可解釋性和透明度。該模型在風險換車道的預測時間延長至四秒時，達到了 91.5% 的 f1 分數，在預測安全換車道時，在相同的預測時間內達到了 90.0% 的 f1 分數。我們透過將模型整合到 CARLA 模擬器中的車輛中，在涉及風險換車道的場景中驗證我們的模型。該模型設法預測突然的換車道，從而為自動駕駛車輛提供了更多時間來規劃和執行適當的安全反應。最後，為了增強我們模型的可解釋性，我們利用 RAG 為給定的預測提供清晰且自然的語言解釋。
+Breast cancer has rapidly increased in prevalence in recent years, making it
+one of the leading causes of mortality worldwide. Among all cancers, it is by
+far the most common. Diagnosing this illness manually requires significant time
+and expertise. Since detecting breast cancer is a time-consuming process,
+preventing its further spread can be aided by creating machine-based forecasts.
+Machine learning and Explainable AI are crucial in classification as they not
+only provide accurate predictions but also offer insights into how the model
+arrives at its decisions, aiding in the understanding and trustworthiness of
+the classification results. In this study, we evaluate and compare the
+classification accuracy, precision, recall, and F-1 scores of five different
+machine learning methods using a primary dataset (500 patients from Dhaka
+Medical College Hospital). Five different supervised machine learning
+techniques, including decision tree, random forest, logistic regression, naive
+bayes, and XGBoost, have been used to achieve optimal results on our dataset.
+Additionally, this study applied SHAP analysis to the XGBoost model to
+interpret the model's predictions and understand the impact of each feature on
+the model's output. We compared the accuracy with which several algorithms
+classified the data, as well as contrasted with other literature in this field.
+After final evaluation, this study found that XGBoost achieved the best model
+accuracy, which is 97%.
 
-##### **Each Graph is a New Language: Graph Learning with LLMs**
-2501.11478v2 by Huachi Zhou, Jiahe Du, Chuang Zhou, Chang Yang, Yilin Xiao, Yuxuan Xie, Xiao Huang
+摘要：<paragraph>近年來，乳癌的盛行率迅速增加，使其成為全球主要的死亡原因之一。在所有癌症中，乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時，因此透過建立機器學習模型來預測，有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要，因為它們不僅可以提供準確的預測，還可以深入了解模型如何做出決策，有助於理解和信賴分類結果。在此研究中，我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數，使用了一個主要的資料集（達卡醫學院醫院的 500 名患者）。五種不同的監督式機器學習技術，包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost，已用於在我們的資料集上取得最佳結果。此外，本研究將 SHAP 分析應用於 XGBoost 模型，以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度，並與該領域的其他文獻進行對比。在最後評估後，本研究發現 XGBoost 達到了最佳的模型準確度，為 97%。</paragraph>
 
-Recent efforts leverage Large Language Models (LLMs) for modeling
-text-attributed graph structures in node classification tasks. These approaches
-describe graph structures for LLMs to understand or aggregate LLM-generated
-textual attribute embeddings through graph structure. However, these approaches
-face two main limitations in modeling graph structures with LLMs. (i) Graph
-descriptions become verbose in describing high-order graph structure. (ii)
-Textual attributes alone do not contain adequate graph structure information.
-It is challenging to model graph structure concisely and adequately with LLMs.
-LLMs lack built-in mechanisms to model graph structures directly. They also
-struggle with complex long-range dependencies between high-order nodes and
-target nodes.
-  Inspired by the observation that LLMs pre-trained on one language can achieve
-exceptional performance on another with minimal additional training, we propose
-\textbf{G}raph-\textbf{D}efined \textbf{L}anguage for \textbf{L}arge
-\textbf{L}anguage \textbf{M}odel (GDL4LLM). This novel framework enables LLMs
-to transfer their powerful language understanding capabilities to
-graph-structured data. GDL4LLM translates graphs into a graph language corpus
-instead of graph descriptions and pre-trains LLMs on this corpus to adequately
-understand graph structures. During fine-tuning, this corpus describes the
-structural information of target nodes concisely with only a few tokens. By
-treating graphs as a new language, GDL4LLM enables LLMs to model graph
-structures adequately and concisely for node classification tasks. Extensive
-experiments on three real-world datasets demonstrate that GDL4LLM outperforms
-description-based and textual attribute embeddings-based baselines by
-efficiently modeling different orders of graph structure with LLMs.
+##### **Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI**
+2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir
 
-摘要：<paragraph>最近的研究利用大型语言模型 (LLM) 对节点分类任务中的文本属性图结构进行建模。这些方法描述图结构，以便 LLM 理解或通过图结构聚合 LLM 生成的文本属性嵌入。然而，这些方法在使用 LLM 对图结构进行建模时面临两个主要限制。(i) 图描述在描述高阶图结构时变得冗长。(ii) 仅文本属性不包含足够的图结构信息。使用 LLM 对图结构进行简洁且充分的建模具有挑战性。LLM 缺乏直接对图结构进行建模的内置机制。它们还难以处理高阶节点和目标节点之间复杂的远程依赖关系。
-受 LLM 在一种语言上进行预训练后，只需进行最少的额外训练即可在另一种语言上实现卓越性能的观察结果的启发，我们提出了**G**raph-**D**efined **L**anguage for **L**arge **L**anguage **M**odel (GDL4LLM)。此新框架使 LLM 能够将其强大的语言理解能力转移到结构化数据图。GDL4LLM 将图翻译成图语言语料库，而不是图描述，并在该语料库上对 LLM 进行预训练，以充分理解图结构。在微调期间，此语料库仅使用几个标记简洁地描述目标节点的结构信息。通过将图视为一种新语言，GDL4LLM 使 LLM 能够充分且简洁地对图结构进行建模，以用于节点分类任务。在三个真实世界数据集上进行的广泛实验表明，GDL4LLM 通过使用 LLM 有效地对不同阶的图结构进行建模，优于基于描述和基于文本属性嵌入的基线。</paragraph>
+The Deep learning (DL) models for diagnosing breast cancer from mammographic
+images often operate as "black boxes", making it difficult for healthcare
+professionals to trust and understand their decision-making processes. The
+study presents an integrated framework combining Convolutional Neural Networks
+(CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis
+of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an
+elaborate data preprocessing pipeline and advanced data augmentation techniques
+to counteract dataset limitations and transfer learning using pre-trained
+networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of
+our study is the evaluation of XAI's effectiveness in interpreting model
+predictions, highlighted by utilizing the Hausdorff measure to assess the
+alignment between AI-generated explanations and expert annotations
+quantitatively. This approach is critical for XAI in promoting trustworthiness
+and ethical fairness in AI-assisted diagnostics. The findings from our research
+illustrate the effective collaboration between CNNs and XAI in advancing
+diagnostic methods for breast cancer, thereby facilitating a more seamless
+integration of advanced AI technologies within clinical settings. By enhancing
+the interpretability of AI driven decisions, this work lays the groundwork for
+improved collaboration between AI systems and medical practitioners, ultimately
+enriching patient care. Furthermore, the implications of our research extended
+well beyond the current methodologies. It encourages further research into how
+to combine multimodal data and improve AI explanations to meet the needs of
+clinical practice.
 
-##### **Few-shot Policy (de)composition in Conversational Question Answering**
-2501.11335v1 by Kyle Erwin, Guy Axelrod, Maria Chang, Achille Fokoue, Maxwell Crouse, Soham Dan, Tian Gao, Rosario Uceda-Sosa, Ndivhuwo Makondo, Naweed Khan, Alexander Gray
+摘要：深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作，這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構，結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI)，以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術，以對抗資料集限制，並採用預先訓練的網路（例如 VGG-16、Inception-V3 和 ResNet）進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性，重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作，從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性，這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎，最終豐富了患者照護。此外，我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋，以滿足臨床實務的需求。
 
-The task of policy compliance detection (PCD) is to determine if a scenario
-is in compliance with respect to a set of written policies. In a conversational
-setting, the results of PCD can indicate if clarifying questions must be asked
-to determine compliance status. Existing approaches usually claim to have
-reasoning capabilities that are latent or require a large amount of annotated
-data. In this work, we propose logical decomposition for policy compliance
-(LDPC): a neuro-symbolic framework to detect policy compliance using large
-language models (LLMs) in a few-shot setting. By selecting only a few exemplars
-alongside recently developed prompting techniques, we demonstrate that our
-approach soundly reasons about policy compliance conversations by extracting
-sub-questions to be answered, assigning truth values from contextual
-information, and explicitly producing a set of logic statements from the given
-policies. The formulation of explicit logic graphs can in turn help answer
-PCDrelated questions with increased transparency and explainability. We apply
-this approach to the popular PCD and conversational machine reading benchmark,
-ShARC, and show competitive performance with no task-specific finetuning. We
-also leverage the inherently interpretable architecture of LDPC to understand
-where errors occur, revealing ambiguities in the ShARC dataset and highlighting
-the challenges involved with reasoning for conversational question answering.
+##### **Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives**
+2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu
 
-摘要：策略合規偵測 (PCD) 的任務是確定場景是否符合一組書面策略。在對話設定中，PCD 的結果可以指出是否必須提出澄清問題以確定合規狀態。現有的方法通常聲稱具有潛在的推理能力，或需要大量的註釋資料。在這項工作中，我們提出策略合規的邏輯分解 (LDPC)：一種使用大型語言模型 (LLM) 在少次嘗試中偵測策略合規的神經符號框架。透過僅選擇少數範例以及最近開發的提示技術，我們證明我們的做法透過提取要回答的子問題、從脈絡資訊指派真值，以及從給定的策略明確產生一組邏輯陳述，對策略合規對話進行合理的推理。明確邏輯圖表的制定反過來可以幫助回答 PCD 相關問題，並提高透明度和可解釋性。我們將此方法應用於熱門的 PCD 和對話式機器閱讀基準 ShARC，並在沒有特定任務微調的情況下展現出競爭力。我們也利用 LDPC 固有的可解釋架構來了解錯誤發生在哪裡，揭露 ShARC 資料集中的歧義，並強調對話式問題解答推理的挑戰。
+This research presents a novel multimodal data fusion methodology for pain
+behavior recognition, integrating statistical correlation analysis with
+human-centered insights. Our approach introduces two key innovations: 1)
+integrating data-driven statistical relevance weights into the fusion strategy
+to effectively utilize complementary information from heterogeneous modalities,
+and 2) incorporating human-centric movement characteristics into multimodal
+representation learning for detailed modeling of pain behaviors. Validated
+across various deep learning architectures, our method demonstrates superior
+performance and broad applicability. We propose a customizable framework that
+aligns each modality with a suitable classifier based on statistical
+significance, advancing personalized and effective multimodal fusion.
+Furthermore, our methodology provides explainable analysis of multimodal data,
+contributing to interpretable and explainable AI in healthcare. By highlighting
+the importance of data diversity and modality-specific representations, we
+enhance traditional fusion techniques and set new standards for recognizing
+complex pain behaviors. Our findings have significant implications for
+promoting patient-centered healthcare interventions and supporting explainable
+clinical decision-making.
 
-##### **Reasoning Language Models: A Blueprint**
-2501.11223v3 by Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler
+摘要：本研究提出了一種創新的多模態數據融合方法，用於疼痛行為識別，將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新：1) 將數據驅動的統計相關權重整合到融合策略中，以有效利用來自異質模態的補充信息，以及 2) 將以人為中心的運動特徵納入多模態表示學習中，以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證，展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架，根據統計顯著性將每個模態與合適的分類器對齊，推進個性化和有效的多模態融合。此外，我們的模型提供對多模態數據的可解釋分析，有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性，我們增強了傳統的融合技術，並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。
 
-Reasoning language models (RLMs), also known as Large Reasoning Models
-(LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have
-redefined AI's problem-solving capabilities by extending LLMs with advanced
-reasoning mechanisms. Yet, their high costs, proprietary nature, and complex
-architectures - uniquely combining Reinforcement Learning (RL), search
-heuristics, and LLMs - present accessibility and scalability challenges. To
-address these, we propose a comprehensive blueprint that organizes RLM
-components into a modular framework, based on a survey and analysis of all RLM
-works. This blueprint incorporates diverse reasoning structures (chains, trees,
-graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search,
-Beam Search), RL concepts (policy, value models and others), supervision
-schemes (Outcome-Based and Process-Based Supervision), and other related
-concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
-tools). We also provide detailed mathematical formulations and algorithmic
-specifications to simplify RLM implementation. By showing how schemes like
-LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases,
-we demonstrate the blueprint's versatility and unifying potential. To
-illustrate its utility, we introduce x1, a modular implementation for rapid RLM
-prototyping and experimentation. Using x1 and a literature review, we provide
-key insights, such as multi-phase training for policy and value models, and the
-importance of familiar training distributions. Finally, we discuss scalable RLM
-cloud deployments and we outline how RLMs can integrate with a broader LLM
-ecosystem. Our work demystifies RLM construction, democratizes advanced
-reasoning capabilities, and fosters innovation, aiming to mitigate the gap
-between "rich AI" and "poor AI" by lowering barriers to RLM design and
-experimentation.
+##### **Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach**
+2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini
+
+Human-centered explainable AI (HCXAI) advocates for the integration of social
+aspects into AI explanations. Central to the HCXAI discourse is the Social
+Transparency (ST) framework, which aims to make the socio-organizational
+context of AI systems accessible to their users. In this work, we suggest
+extending the ST framework to address the risks of social misattributions in
+Large Language Models (LLMs), particularly in sensitive areas like mental
+health. In fact LLMs, which are remarkably capable of simulating roles and
+personas, may lead to mismatches between designers' intentions and users'
+perceptions of social attributes, risking to promote emotional manipulation and
+dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To
+address these issues, we propose enhancing the ST framework with a fifth
+'W-question' to clarify the specific social attributions assigned to LLMs by
+its designers and users. This addition aims to bridge the gap between LLM
+capabilities and user perceptions, promoting the ethically responsible
+development and use of LLM-based technology.
+
+摘要：以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架，其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中，我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险，尤其是在心理健康等敏感领域。事实上，LLM 能够出色地模拟角色和人格，这可能导致设计者的意图和用户对社会属性的认知之间出现错配，从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题，我们建议用第五个“W 问题”来增强 ST 框架，以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距，促进基于 LLM 的技术在道德上负责任地开发和使用。
+
+##### **Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification**
+2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu
 
-摘要：推理語言模型 (RLM)，又稱為大型推理模型 (LRM)，例如 OpenAI 的 o1 和 o3、DeepSeek-V3 以及阿里巴巴的 QwQ，透過擴充 LLM 的先進推理機制，重新定義了 AI 的問題解決能力。然而，它們的高成本、專有性質和複雜架構（獨特地結合了強化學習 (RL)、搜尋啟發法和 LLM）提出了可及性和可擴充性的挑戰。為了解決這些問題，我們提出了一個全面的藍圖，將 RLM 組件組織成一個模組化架構，這是基於對所有 RLM 作品的調查和分析。此藍圖包含多樣化的推理結構（鏈、樹、圖和巢狀形式）、推理策略（例如蒙地卡羅樹搜尋、波束搜尋）、RL 概念（策略、價值模型等）、監督方案（基於結果和基於流程的監督）和其他相關概念（例如測試時間運算、檢索增強生成、代理工具）。我們還提供了詳細的數學公式和演算法規範，以簡化 RLM 的實作。透過展示 LLaMA-Berry、QwQ、Journey Learning 和 Graph of Thoughts 等方案如何作為特殊情況，我們展示了藍圖的多功能性和統一潛力。為了說明其效用，我們介紹了 x1，這是一個模組化實作，用於快速 RLM 原型製作和實驗。使用 x1 和文獻回顧，我們提供了關鍵見解，例如策略和價值模型的多階段訓練，以及熟悉訓練分佈的重要性。最後，我們討論了可擴充的 RLM 雲端部署，並概述了 RLM 如何與更廣泛的 LLM 生態系統整合。我們的研究揭開了 RLM 建構的神秘面紗，使先進的推理能力民主化，並促進創新，旨在透過降低 RLM 設計和實驗的障礙，來縮小「富裕 AI」和「貧窮 AI」之間的差距。
+Background: Pneumothorax is an acute thoracic disease caused by abnormal air
+collection between the lungs and chest wall. To address the opaqueness often
+associated with deep learning (DL) models, explainable artificial intelligence
+(XAI) methods have been introduced to outline regions related to pneumothorax
+diagnoses made by DL models. However, these explanations sometimes diverge from
+actual lesion areas, highlighting the need for further improvement. Method: We
+propose a template-guided approach to incorporate the clinical knowledge of
+pneumothorax into model explanations generated by XAI methods, thereby
+enhancing the quality of these explanations. Utilizing one lesion delineation
+created by radiologists, our approach first generates a template that
+represents potential areas of pneumothorax occurrence. This template is then
+superimposed on model explanations to filter out extraneous explanations that
+fall outside the template's boundaries. To validate its efficacy, we carried
+out a comparative analysis of three XAI methods with and without our template
+guidance when explaining two DL models in two real-world datasets. Results: The
+proposed approach consistently improved baseline XAI methods across twelve
+benchmark scenarios built on three XAI methods, two DL models, and two
+datasets. The average incremental percentages, calculated by the performance
+improvements over the baseline performance, were 97.8% in Intersection over
+Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model
+explanations and ground-truth lesion areas. Conclusions: In the context of
+pneumothorax diagnoses, we proposed a template-guided approach for improving AI
+explanations. We anticipate that our template guidance will forge a fresh
+approach to elucidating AI models by integrating clinical domain expertise.
 
-##### **IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems**
-2501.11067v1 by Elad Levi, Ilan Kadar
+摘要：<paragraph>背景：氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習（DL）模型經常伴隨的不透明性，可解釋人工智慧（XAI）方法已被引入，用於概述與 DL 模型做出的氣胸診斷相關的區域。然而，這些解釋有時會與實際病灶區域有所出入，突顯出進一步改進的必要性。方法：我們提出了一種模板引導式方法，將氣胸的臨床知識納入 XAI 方法產生的模型解釋中，從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪，我們的做法首先產生一個模板，用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上，以篩選出超出模板邊界的無關解釋。為了驗證其效力，我們對三種 XAI 方法進行了比較分析，在兩個真實世界資料集中解釋兩個 DL 模型時，分別採用和不採用我們的模板引導。結果：所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中，始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時，透過基準效能的效能改進計算出的平均增量百分比為交集比（IoU）的 97.8% 和骰子相似性係數（DSC）的 94.1%。結論：在氣胸診斷的背景下，我們提出了一種模板引導式方法，用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識，為闡明 AI 模型建立一種新方法。</paragraph>
 
-Large Language Models (LLMs) are transforming artificial intelligence,
-evolving into task-oriented systems capable of autonomous planning and
-execution. One of the primary applications of LLMs is conversational AI
-systems, which must navigate multi-turn dialogues, integrate domain-specific
-APIs, and adhere to strict policy constraints. However, evaluating these agents
-remains a significant challenge, as traditional methods fail to capture the
-complexity and variability of real-world interactions. We introduce
-IntellAgent, a scalable, open-source multi-agent framework designed to evaluate
-conversational AI systems comprehensively. IntellAgent automates the creation
-of diverse, synthetic benchmarks by combining policy-driven graph modeling,
-realistic event generation, and interactive user-agent simulations. This
-innovative approach provides fine-grained diagnostics, addressing the
-limitations of static and manually curated benchmarks with coarse-grained
-metrics. IntellAgent represents a paradigm shift in evaluating conversational
-AI. By simulating realistic, multi-policy scenarios across varying levels of
-complexity, IntellAgent captures the nuanced interplay of agent capabilities
-and policy constraints. Unlike traditional methods, it employs a graph-based
-policy model to represent relationships, likelihoods, and complexities of
-policy interactions, enabling highly detailed diagnostics. IntellAgent also
-identifies critical performance gaps, offering actionable insights for targeted
-optimization. Its modular, open-source design supports seamless integration of
-new domains, policies, and APIs, fostering reproducibility and community
-collaboration. Our findings demonstrate that IntellAgent serves as an effective
-framework for advancing conversational AI by addressing challenges in bridging
-research and deployment. The framework is available at
-https://github.com/plurai-ai/intellagent
+##### **Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures**
+2403.01580v1 by Séamus Lankford
 
-摘要：大型語言模型 (LLM) 正在轉變人工智慧，演變成具備自主規劃和執行能力的任務導向系統。LLM 的主要應用之一是對話式 AI 系統，它必須應對多輪對話、整合特定領域的 API，並遵守嚴格的政策約束。然而，評估這些代理仍然是一項重大挑戰，因為傳統方法無法捕捉現實世界互動的複雜性和變異性。我們引入了 IntellAgent，一個可擴充、開放原始碼的多代理架構，旨在全面評估對話式 AI 系統。IntellAgent 自動化建立多樣化、合成的基準，方法是結合策略驅動的圖形建模、逼真的事件產生和互動使用者代理模擬。這種創新方法提供了細緻的診斷，解決了具有粗略指標的靜態和手動策劃基準的限制。IntellAgent 代表了評估對話式 AI 的典範轉移。通過模擬不同層級複雜性的逼真多策略場景，IntellAgent 捕捉到了代理功能和策略約束之間的細微交互。與傳統方法不同，它採用基於圖形的策略模型來表示策略交互的關係、可能性和複雜性，從而實現高度詳細的診斷。IntellAgent 還識別出關鍵效能差距，提供可行的見解，以進行目標最佳化。其模組化、開放原始碼的設計支援無縫整合新的領域、策略和 API，促進了可複製性和社群協作。我們的研究結果表明，IntellAgent 可作為一個有效的框架，透過解決研究和部署之間的挑戰來推進對話式 AI。這個框架可在 https://github.com/plurai-ai/intellagent 取得
+In the current machine translation (MT) landscape, the Transformer
+architecture stands out as the gold standard, especially for high-resource
+language pairs. This research delves into its efficacy for low-resource
+language pairs including both the English$\leftrightarrow$Irish and
+English$\leftrightarrow$Marathi language pairs. Notably, the study identifies
+the optimal hyperparameters and subword model type to significantly improve the
+translation quality of Transformer models for low-resource language pairs.
+  The scarcity of parallel datasets for low-resource languages can hinder MT
+development. To address this, gaHealth was developed, the first bilingual
+corpus of health data for the Irish language. Focusing on the health domain,
+models developed using this in-domain dataset exhibited very significant
+improvements in BLEU score when compared with models from the LoResMT2021
+Shared Task. A subsequent human evaluation using the multidimensional quality
+metrics error taxonomy showcased the superior performance of the Transformer
+system in reducing both accuracy and fluency errors compared to an RNN-based
+counterpart.
+  Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source
+applications streamlined for the development, fine-tuning, and deployment of
+neural machine translation models. These tools considerably simplify the setup
+and evaluation process, making MT more accessible to both developers and
+translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes
+eco-friendly natural language processing research by highlighting the
+environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM
+demonstrated advancements in translation performance for two low-resource
+language pairs: English$\leftrightarrow$Irish and
+English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021
+Shared Task.
 
+摘要：<paragraph>在當前機器翻譯 (MT) 領域中，Transformer 架構脫穎而出，成為黃金標準，特別是對於高資源語言對。本研究探討其對低資源語言對的效能，包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是，本研究識別出最佳超參數和子詞模型類型，以顯著提高 Transformer 模型對低資源語言對的翻譯品質。
+低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題，開發了 gaHealth，這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域，使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步，與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示，與基於 RNN 的對應模型相比，Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。
+此外，本論文介紹了 adaptNMT 和 adaptMLLM，這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程，讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是，adaptNMT 以 OpenNMT 生態系統為基礎，通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比，adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。</paragraph>
 
-### LLM
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-13**|**Theoretical Benefit and Limitation of Diffusion Language Model**|Guhao Feng et.al.|[2502.09622v1](http://arxiv.org/abs/2502.09622v1)|null|
-|**2025-02-13**|**MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**|Dongzhi Jiang et.al.|[2502.09621v1](http://arxiv.org/abs/2502.09621v1)|null|
-|**2025-02-13**|**Exploring the Potential of Encoder-free Architectures in 3D LMMs**|Yiwen Tang et.al.|[2502.09620v1](http://arxiv.org/abs/2502.09620v1)|null|
-|**2025-02-13**|**DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**|Xueyi Liu et.al.|[2502.09614v1](http://arxiv.org/abs/2502.09614v1)|null|
-|**2025-02-13**|**Score-of-Mixture Training: Training One-Step Generative Models Made Simple**|Tejas Jayashankar et.al.|[2502.09609v1](http://arxiv.org/abs/2502.09609v1)|null|
-|**2025-02-13**|**Human-LLM Coevolution: Evidence from Academic Writing**|Mingmeng Geng et.al.|[2502.09606v1](http://arxiv.org/abs/2502.09606v1)|null|
-|**2025-02-13**|**SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**|Yung-Sung Chuang et.al.|[2502.09604v1](http://arxiv.org/abs/2502.09604v1)|null|
-|**2025-02-13**|**CoT-Valve: Length-Compressible Chain-of-Thought Tuning**|Xinyin Ma et.al.|[2502.09601v1](http://arxiv.org/abs/2502.09601v1)|null|
-|**2025-02-13**|**Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**|Siyan Zhao et.al.|[2502.09597v1](http://arxiv.org/abs/2502.09597v1)|null|
-|**2025-02-13**|**KIMAs: A Configurable Knowledge Integrated Multi-Agent System**|Zitao Li et.al.|[2502.09596v1](http://arxiv.org/abs/2502.09596v1)|null|
-|**2025-02-13**|**Logical forms complement probability in understanding language model (and human) performance**|Yixuan Wang et.al.|[2502.09589v1](http://arxiv.org/abs/2502.09589v1)|null|
-|**2025-02-13**|**Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**|Mark Beliaev et.al.|[2502.09573v1](http://arxiv.org/abs/2502.09573v1)|null|
-|**2025-02-13**|**MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**|Vlad Andrei Negru et.al.|[2502.09567v1](http://arxiv.org/abs/2502.09567v1)|null|
-|**2025-02-13**|**Zero-shot generation of synthetic neurosurgical data with large language models**|Austin A. Barr et.al.|[2502.09566v1](http://arxiv.org/abs/2502.09566v1)|null|
-|**2025-02-13**|**MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**|Quintina Campbell et.al.|[2502.09565v1](http://arxiv.org/abs/2502.09565v1)|null|
-|**2025-02-13**|**EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**|Rui Yang et.al.|[2502.09560v1](http://arxiv.org/abs/2502.09560v1)|null|
-|**2025-02-13**|**Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**|Shreyan Biswas et.al.|[2502.09532v1](http://arxiv.org/abs/2502.09532v1)|null|
-|**2025-02-13**|**Diffusion Models for Molecules: A Survey of Methods and Tasks**|Liang Wang et.al.|[2502.09511v1](http://arxiv.org/abs/2502.09511v1)|null|
-|**2025-02-13**|**AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**|Caleb Cranney et.al.|[2502.09503v1](http://arxiv.org/abs/2502.09503v1)|null|
-|**2025-02-13**|**Improve LLM-based Automatic Essay Scoring with Linguistic Features**|Zhaoyi Joey Hou et.al.|[2502.09497v1](http://arxiv.org/abs/2502.09497v1)|null|
-|**2025-02-13**|**Cracking the Code: Enhancing Development finance understanding with artificial intelligence**|Pierre Beaucoral et.al.|[2502.09495v1](http://arxiv.org/abs/2502.09495v1)|null|
-|**2025-02-13**|**Objective quantification of mood states using large language models**|Jakub Onysk et.al.|[2502.09487v1](http://arxiv.org/abs/2502.09487v1)|null|
-|**2025-02-13**|**The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**|Akash Ghosh et.al.|[2502.09457v1](http://arxiv.org/abs/2502.09457v1)|null|
-|**2025-02-13**|**Pixel-Level Reasoning Segmentation via Multi-turn Conversations**|Dexian Cai et.al.|[2502.09447v1](http://arxiv.org/abs/2502.09447v1)|null|
-|**2025-02-13**|**Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**|Navdeep Kumar et.al.|[2502.09432v1](http://arxiv.org/abs/2502.09432v1)|null|
-|**2025-02-13**|**Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**|Ziyi Chen et.al.|[2502.09423v1](http://arxiv.org/abs/2502.09423v1)|null|
-|**2025-02-13**|**On multi-token prediction for efficient LLM inference**|Somesh Mehra et.al.|[2502.09419v1](http://arxiv.org/abs/2502.09419v1)|null|
-|**2025-02-13**|**SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**|Daniel Fleischer et.al.|[2502.09390v1](http://arxiv.org/abs/2502.09390v1)|null|
-|**2025-02-13**|**Truth Knows No Language: Evaluating Truthfulness Beyond English**|Blanca Calvo Figueras et.al.|[2502.09387v1](http://arxiv.org/abs/2502.09387v1)|null|
-|**2025-02-13**|**A Deep Inverse-Mapping Model for a Flapping Robotic Wing**|Hadar Sharvit et.al.|[2502.09378v1](http://arxiv.org/abs/2502.09378v1)|null|
-|**2025-02-13**|**Language Agents as Digital Representatives in Collective Decision-Making**|Daniel Jarrett et.al.|[2502.09369v1](http://arxiv.org/abs/2502.09369v1)|null|
-|**2025-02-13**|**Neural Spatiotemporal Point Processes: Trends and Challenges**|Sumantrak Mukherjee et.al.|[2502.09341v1](http://arxiv.org/abs/2502.09341v1)|null|
-|**2025-02-13**|**Graph Diffusion Network for Drug-Gene Prediction**|Jiayang Wu et.al.|[2502.09335v1](http://arxiv.org/abs/2502.09335v1)|null|
-|**2025-02-13**|**Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**|Itai Mondshine et.al.|[2502.09331v1](http://arxiv.org/abs/2502.09331v1)|null|
-|**2025-02-13**|**A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**|Kentaro Imajo et.al.|[2502.09316v1](http://arxiv.org/abs/2502.09316v1)|null|
-|**2025-02-13**|**When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**|Samuel Joseph Amouyal et.al.|[2502.09307v1](http://arxiv.org/abs/2502.09307v1)|null|
-|**2025-02-13**|**Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**|Bernd Dudzik et.al.|[2502.09294v1](http://arxiv.org/abs/2502.09294v1)|null|
-|**2025-02-13**|**SparQLe: Speech Queries to Text Translation Through LLMs**|Amirbek Djanibekov et.al.|[2502.09284v1](http://arxiv.org/abs/2502.09284v1)|null|
-|**2025-02-13**|**LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**|Wenlun Zhang et.al.|[2502.09271v1](http://arxiv.org/abs/2502.09271v1)|null|
-|**2025-02-13**|**AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**|Hezhe Qiao et.al.|[2502.09254v1](http://arxiv.org/abs/2502.09254v1)|null|
-|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
-|**2025-02-13**|**You Do Not Fully Utilize Transformer's Representation Capacity**|Gleb Gerasimov et.al.|[2502.09245v1](http://arxiv.org/abs/2502.09245v1)|null|
-|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
-|**2025-02-13**|**Reliable Conversational Agents under ASP Control that Understand Natural Language**|Yankai Zeng et.al.|[2502.09237v1](http://arxiv.org/abs/2502.09237v1)|null|
-|**2025-02-13**|**Commonsense Reasoning-Aided Autonomous Vehicle Systems**|Keegan Kimbrell et.al.|[2502.09233v1](http://arxiv.org/abs/2502.09233v1)|null|
-|**2025-02-13**|**Logical foundations of Smart Contracts**|Kalonji Kalala et.al.|[2502.09232v1](http://arxiv.org/abs/2502.09232v1)|null|
-|**2025-02-13**|**Relating Answer Set Programming and Many-sorted Logics for Formal Verification**|Zachary Hansen et.al.|[2502.09230v1](http://arxiv.org/abs/2502.09230v1)|null|
-|**2025-02-13**|**Computational methods for Dynamic Answer Set Programming**|Susana Hahn et.al.|[2502.09228v1](http://arxiv.org/abs/2502.09228v1)|null|
-|**2025-02-13**|**Generating Causally Compliant Counterfactual Explanations using ASP**|Sopam Dasgupta et.al.|[2502.09226v1](http://arxiv.org/abs/2502.09226v1)|null|
-|**2025-02-13**|**Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**|Đorđe Marković et.al.|[2502.09224v1](http://arxiv.org/abs/2502.09224v1)|null|
-|**2025-02-13**|**ASP-driven User-interaction with Clinguin**|Alexander Beiser et.al.|[2502.09222v1](http://arxiv.org/abs/2502.09222v1)|null|
-|**2025-02-13**|**Pearce's Characterisation in an Epistemic Domain**|Ezgi Iraz Su et.al.|[2502.09221v1](http://arxiv.org/abs/2502.09221v1)|null|
-|**2025-02-13**|**Graphical Conditions for the Existence, Unicity and Number of Regular Models**|Van-Giang Trinh et.al.|[2502.09220v1](http://arxiv.org/abs/2502.09220v1)|null|
-|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
-|**2025-02-13**|**Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**|Galileo Sartor et.al.|[2502.09216v1](http://arxiv.org/abs/2502.09216v1)|null|
-|**2025-02-13**|**Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**|Sean Glaze et.al.|[2502.09215v1](http://arxiv.org/abs/2502.09215v1)|null|
-|**2025-02-13**|**Neuro-Symbolic Contrastive Learning for Cross-domain Inference**|Mingyue Liu et.al.|[2502.09213v1](http://arxiv.org/abs/2502.09213v1)|null|
-|**2025-02-13**|**LP-LM: No Hallucinations in Question Answering with Logic Programming**|Katherine Wu et.al.|[2502.09212v1](http://arxiv.org/abs/2502.09212v1)|null|
-|**2025-02-13**|**Visual Graph Question Answering with ASP and LLMs for Language Parsing**|Jakob Johannes Bauer et.al.|[2502.09211v1](http://arxiv.org/abs/2502.09211v1)|null|
-|**2025-02-13**|**On LLM-generated Logic Programs and their Inference Execution Methods**|Paul Tarau et.al.|[2502.09209v1](http://arxiv.org/abs/2502.09209v1)|null|
-|**2025-02-13**|**Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**|Haya Majid Qureshi et.al.|[2502.09206v1](http://arxiv.org/abs/2502.09206v1)|null|
-|**2025-02-13**|**Counterfactual Explanations as Plans**|Vaishak Belle et.al.|[2502.09205v1](http://arxiv.org/abs/2502.09205v1)|null|
-|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
-|**2025-02-13**|**Thinking beyond the anthropomorphic paradigm benefits LLM research**|Lujain Ibrahim et.al.|[2502.09192v1](http://arxiv.org/abs/2502.09192v1)|null|
-|**2025-02-13**|**Matina: A Large-Scale 73B Token Persian Text Corpus**|Sara Bourbour Hosseinbeigi et.al.|[2502.09188v1](http://arxiv.org/abs/2502.09188v1)|null|
-|**2025-02-13**|**RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**|Changzhi Zhou et.al.|[2502.09183v1](http://arxiv.org/abs/2502.09183v1)|null|
-|**2025-02-13**|**FLAME: Flexible LLM-Assisted Moderation Engine**|Ivan Bakulin et.al.|[2502.09175v1](http://arxiv.org/abs/2502.09175v1)|null|
-|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
-|**2025-02-13**|**Musical Heritage Historical Entity Linking**|Arianna Graciotti et.al.|[2502.09168v1](http://arxiv.org/abs/2502.09168v1)|null|
-|**2025-02-13**|**Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**|Chang Liu et.al.|[2502.09156v1](http://arxiv.org/abs/2502.09156v1)|null|
-|**2025-02-13**|**A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**|Nasser A Alsadhan et.al.|[2502.09128v1](http://arxiv.org/abs/2502.09128v1)|null|
-|**2025-02-13**|**Automatic Pruning via Structured Lasso with Class-wise Information**|Xiang Liu et.al.|[2502.09125v1](http://arxiv.org/abs/2502.09125v1)|null|
-|**2025-02-13**|**The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**|Ye-eun Cho et.al.|[2502.09120v1](http://arxiv.org/abs/2502.09120v1)|null|
-|**2025-02-13**|**One-shot Federated Learning Methods: A Practical Guide**|Xiang Liu et.al.|[2502.09104v1](http://arxiv.org/abs/2502.09104v1)|null|
-|**2025-02-13**|**Logical Reasoning in Large Language Models: A Survey**|Hanmeng Liu et.al.|[2502.09100v1](http://arxiv.org/abs/2502.09100v1)|null|
-|**2025-02-13**|**A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**|Tianyi Huang et.al.|[2502.09097v1](http://arxiv.org/abs/2502.09097v1)|null|
-|**2025-02-13**|**A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**|Jia Gao et.al.|[2502.09086v1](http://arxiv.org/abs/2502.09086v1)|null|
-|**2025-02-13**|**Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**|Greta Warren et.al.|[2502.09083v1](http://arxiv.org/abs/2502.09083v1)|null|
-|**2025-02-13**|**CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**|Xintao Wang et.al.|[2502.09082v1](http://arxiv.org/abs/2502.09082v1)|null|
-|**2025-02-13**|**Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**|Xuzhao Geng et.al.|[2502.09073v1](http://arxiv.org/abs/2502.09073v1)|null|
-|**2025-02-13**|**An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**|Kunat Pipatanakul et.al.|[2502.09056v1](http://arxiv.org/abs/2502.09056v1)|null|
-|**2025-02-13**|**Cost-Saving LLM Cascades with Early Abstention**|Michael J. Zellinger et.al.|[2502.09054v1](http://arxiv.org/abs/2502.09054v1)|null|
-|**2025-02-13**|**Game Theory Meets Large Language Models: A Systematic Survey**|Haoran Sun et.al.|[2502.09053v1](http://arxiv.org/abs/2502.09053v1)|null|
-|**2025-02-13**|**AIDE: Agentically Improve Visual Language Model with Domain Experts**|Ming-Chang Chiu et.al.|[2502.09051v1](http://arxiv.org/abs/2502.09051v1)|null|
-|**2025-02-13**|**Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**|Chae-Hyun Kim et.al.|[2502.09050v1](http://arxiv.org/abs/2502.09050v1)|null|
-|**2025-02-13**|**Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**|Jin-Duk Park et.al.|[2502.09046v1](http://arxiv.org/abs/2502.09046v1)|null|
-|**2025-02-13**|**Typhoon T1: An Open Thai Reasoning Model**|Pittawat Taveekitworachai et.al.|[2502.09042v1](http://arxiv.org/abs/2502.09042v1)|null|
-|**2025-02-13**|**Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**|Lin Zhang et.al.|[2502.09022v1](http://arxiv.org/abs/2502.09022v1)|null|
-|**2025-02-13**|**EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**|Xiao Wang et.al.|[2502.09020v1](http://arxiv.org/abs/2502.09020v1)|null|
-|**2025-02-13**|**Zero-shot Concept Bottleneck Models**|Shin'ya Yamaguchi et.al.|[2502.09018v1](http://arxiv.org/abs/2502.09018v1)|null|
-|**2025-02-13**|**Diversity Enhances an LLM's Performance in RAG and Long-context Task**|Zhchao Wang et.al.|[2502.09017v1](http://arxiv.org/abs/2502.09017v1)|null|
-|**2025-02-13**|**Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**|Jonathan Pofcher et.al.|[2502.09004v1](http://arxiv.org/abs/2502.09004v1)|null|
-|**2025-02-13**|**RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**|Quan Wei et.al.|[2502.09003v1](http://arxiv.org/abs/2502.09003v1)|null|
-|**2025-02-13**|**PixLift: Accelerating Web Browsing via AI Upscaling**|Yonas Atinafu et.al.|[2502.08995v1](http://arxiv.org/abs/2502.08995v1)|null|
-|**2025-02-13**|**RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**|Nazatul H. Sultan et.al.|[2502.08989v1](http://arxiv.org/abs/2502.08989v1)|null|
-|**2025-02-13**|**Neural Force Field: Learning Generalized Physical Representation from a Few Examples**|Shiqian Li et.al.|[2502.08987v1](http://arxiv.org/abs/2502.08987v1)|null|
-|**2025-02-13**|**Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**|Hyundong Cho et.al.|[2502.08972v1](http://arxiv.org/abs/2502.08972v1)|null|
-|**2025-02-13**|**RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**|Peter Yong Zhong et.al.|[2502.08966v1](http://arxiv.org/abs/2502.08966v1)|null|
-|**2025-02-13**|**Biologically Plausible Brain Graph Transformer**|Ciyuan Peng et.al.|[2502.08958v1](http://arxiv.org/abs/2502.08958v1)|null|
-|**2025-02-13**|**Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**|Leon Nissen et.al.|[2502.08954v1](http://arxiv.org/abs/2502.08954v1)|null|
+##### **Cause and Effect: Can Large Language Models Truly Understand Causality?**
+2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha
 
-#### Abstracts
-##### **Theoretical Benefit and Limitation of Diffusion Language Model**
-2502.09622v1 by Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He
+With the rise of Large Language Models(LLMs), it has become crucial to
+understand their capabilities and limitations in deciphering and explaining the
+complex web of causal relationships that language entails. Current methods use
+either explicit or implicit causal reasoning, yet there is a strong need for a
+unified approach combining both to tackle a wide array of causal relationships
+more effectively. This research proposes a novel architecture called Context
+Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to
+enhance causal reasoning and explainability. The proposed framework
+incorporates an explicit causal detection module with ConceptNet and
+counterfactual statements, as well as implicit causal detection through LLMs.
+Our framework goes one step further with a layer of counterfactual explanations
+to accentuate LLMs understanding of causality. The knowledge from ConceptNet
+enhances the performance of multiple causal reasoning tasks such as causal
+discovery, causal identification and counterfactual reasoning. The
+counterfactual sentences add explicit knowledge of the not caused by scenarios.
+By combining these powerful modules, our model aims to provide a deeper
+understanding of causal relationships, enabling enhanced interpretability.
+Evaluation of benchmark datasets shows improved performance across all metrics,
+such as accuracy, precision, recall, and F1 scores. We also introduce
+CausalNet, a new dataset accompanied by our code, to facilitate further
+research in this domain.
 
-Diffusion language models have emerged as a promising approach for text
-generation. One would naturally expect this method to be an efficient
-replacement for autoregressive models since multiple tokens can be sampled in
-parallel during each diffusion step. However, its efficiency-accuracy trade-off
-is not yet well understood. In this paper, we present a rigorous theoretical
-analysis of a widely used type of diffusion language model, the Masked
-Diffusion Model (MDM), and find that its effectiveness heavily depends on the
-target evaluation metric. Under mild conditions, we prove that when using
-perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling
-steps regardless of sequence length, demonstrating that efficiency can be
-achieved without sacrificing performance. However, when using the sequence
-error rate--which is important for understanding the "correctness" of a
-sequence, such as a reasoning chain--we show that the required sampling steps
-must scale linearly with sequence length to obtain "correct" sequences, thereby
-eliminating MDM's efficiency advantage over autoregressive models. Our analysis
-establishes the first theoretical foundation for understanding the benefits and
-limitations of MDMs. All theoretical findings are supported by empirical
-studies.
+摘要：隨著大型語言模型 (LLM) 的興起，了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理，但強烈需要一種統一的方法，結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構，以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組，以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步，加入一層反事實解釋，以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行，例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組，我們的模型旨在提供對因果關係更深入的理解，實現增強的可解釋性。基準資料集的評估顯示在所有指標（例如準確度、精確度、召回率和 F1 分數）上都有所提升。我們還引入了 CausalNet，一個新的資料集，並附上了我們的程式碼，以促進在這個領域的進一步研究。
 
-摘要：擴散語言模型已成為文字生成的一種有前途的方法。由於在每個擴散步驟期間可以並行採樣多個符號，因此人們自然會期望這種方法成為自迴歸模型的有效替代方案。然而，它的效率準確性權衡尚未得到很好的理解。在本文中，我們對廣泛使用的擴散語言模型類型，即遮罩擴散模型 (MDM) 進行了嚴格的理論分析，並發現其有效性在很大程度上取決於目標評估指標。在溫和條件下，我們證明了當使用困惑度作為指標時，MDM 可以無論序列長度如何，在採樣步驟中實現近乎最佳的困惑度，這表明可以在不犧牲性能的情況下實現效率。然而，當使用序列錯誤率（對於理解序列的「正確性」很重要，例如推理鏈）時，我們表明所需的採樣步驟必須隨著序列長度線性縮放才能獲得「正確」的序列，從而消除了 MDM 相對於自迴歸模型的效率優勢。我們的分析為理解 MDM 的優點和局限性建立了第一個理論基礎。所有理論發現都得到了實證研究的支持。
+##### **Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**
+2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya
+
+Diabetes mellitus (DM) predisposes patients to vascular complications.
+Retinal images and vasculature reflect the body's micro- and macrovascular
+health. They can be used to diagnose DM complications, including diabetic
+retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular
+disease, as well as forecast the risk of cardiovascular events. Artificial
+intelligence (AI)-enabled systems developed for high-throughput detection of DR
+using digitized retinal images have become clinically adopted. Beyond DR
+screening, AI integration also holds immense potential to address challenges
+associated with the holistic care of the patient with DM. In this work, we aim
+to comprehensively review the literature for studies on AI applications based
+on retinal images related to DM diagnosis, prognostication, and management. We
+will describe the findings of holistic AI-assisted diabetes care, including but
+not limited to DR screening, and discuss barriers to implementing such systems,
+including issues concerning ethics, data privacy, equitable access, and
+explainability. With the ability to evaluate the patient's health status vis a
+vis DM complication as well as risk prognostication of future cardiovascular
+complications, AI-assisted retinal image analysis has the potential to become a
+central tool for modern personalized medicine in patients with DM.
 
-##### **MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency**
-2502.09621v1 by Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
+摘要：糖尿病（DM）使患者容易出現血管併發症。
+視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症，包括糖尿病視網膜病變（DR）、神經病變、腎病和動脈粥樣硬化性心血管疾病，以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧（AI）啟用系統已在臨床採用。除了 DR 篩檢外，AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中，我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻，這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現，包括但不限於 DR 篩檢，並討論實施此類系統的障礙，包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況，同時考量糖尿病併發症以及未來心血管併發症的風險預後，AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。
 
-Answering questions with Chain-of-Thought (CoT) has significantly enhanced
-the reasoning capabilities of Large Language Models (LLMs), yet its impact on
-Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth
-investigation. In this paper, we introduce MME-CoT, a specialized benchmark
-evaluating the CoT reasoning performance of LMMs, spanning six domains: math,
-science, OCR, logic, space-time, and general scenes. As the first comprehensive
-study in this area, we propose a thorough evaluation suite incorporating three
-novel metrics that assess the reasoning quality, robustness, and efficiency at
-a fine-grained level. Leveraging curated high-quality data and a unique
-evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs,
-uncovering several key insights: 1) Models with reflection mechanism
-demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and
-demonstrating the highest quality results; 2) CoT prompting often degrades LMM
-performance on perception-heavy tasks, suggesting a potentially harmful
-overthinking behavior; and 3) Although the CoT quality is high, LMMs with
-reflection exhibit significant inefficiency in both normal response and
-self-correction phases. We hope MME-CoT serves as a foundation for advancing
-multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
+##### **Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**
+2402.15027v2 by A. J. Karran, P. Charland, J-T. Martineau, A. Ortiz de Guinea Lopez de Arana, AM. Lesage, S. Senecal, P-M. Leger
 
-摘要：<paragraph>透過思維鏈（CoT）回答問題，大幅提升了大型語言模型（LLM）的推理能力，但其對大型多模態模型（LMM）的影響仍缺乏系統性的評估和深入探討。在本文中，我們引入了 MME-CoT，一個專門的基準測試，用於評估 LMM 的 CoT 推理效能，涵蓋六個領域：數學、科學、OCR、邏輯、時空和一般場景。作為該領域的第一個全面性研究，我們提出了一個全面的評估套件，包含三個創新的指標，用於評估推理品質、穩健性和效率，並達到細微的層級。透過利用策展的高品質資料和獨特的評估策略，我們對最先進的 LMM 進行深入分析，發現了幾個關鍵見解：1）具有反思機制的模型展現出優異的 CoT 品質，其中 Kimi k1.5 優於 GPT-4o，並展現出最高品質的結果；2）CoT 提示通常會降低 LMM 在感知密集任務上的效能，這表示潛在有害的過度思考行為；3）儘管 CoT 品質很高，但具有反思能力的 LMM 在一般回應和自我修正階段都展現出顯著的低效率。我們希望 MME-CoT 能作為促進 LMM 中多模態推理的基礎。專案頁面：https://mmecot.github.io/</paragraph>
+This study investigates the acceptability of different artificial
+intelligence (AI) applications in education from a multi-stakeholder
+perspective, including students, teachers, and parents. Acknowledging the
+transformative potential of AI in education, it addresses concerns related to
+data privacy, AI agency, transparency, explainability and the ethical
+deployment of AI. Through a vignette methodology, participants were presented
+with four scenarios where AI's agency, transparency, explainability, and
+privacy were manipulated. After each scenario, participants completed a survey
+that captured their perceptions of AI's global utility, individual usefulness,
+justice, confidence, risk, and intention to use each scenario's AI if
+available. The data collection comprising a final sample of 1198
+multi-stakeholder participants was distributed through a partner institution
+and social media campaigns and focused on individual responses to four AI use
+cases. A mediation analysis of the data indicated that acceptance and trust in
+AI varies significantly across stakeholder groups. We found that the key
+mediators between high and low levels of AI's agency, transparency, and
+explainability, as well as the intention to use the different educational AI,
+included perceived global utility, justice, and confidence. The study
+highlights that the acceptance of AI in education is a nuanced and multifaceted
+issue that requires careful consideration of specific AI applications and their
+characteristics, in addition to the diverse stakeholders' perceptions.
 
-##### **Exploring the Potential of Encoder-free Architectures in 3D LMMs**
-2502.09620v1 by Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
+摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
 
-Encoder-free architectures have been preliminarily explored in the 2D visual
-domain, yet it remains an open question whether they can be effectively applied
-to 3D understanding scenarios. In this paper, we present the first
-comprehensive investigation into the potential of encoder-free architectures to
-overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
-These challenges include the failure to adapt to varying point cloud
-resolutions and the point features from the encoder not meeting the semantic
-needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
-remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
-We propose the LLM-embedded Semantic Encoding strategy in the pre-training
-stage, exploring the effects of various point cloud self-supervised losses. And
-we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
-introduce the Hierarchical Geometry Aggregation strategy in the instruction
-tuning stage. This incorporates inductive bias into the LLM early layers to
-focus on the local details of the point clouds. To the end, we present the
-first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
-state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
-classification, captioning, and VQA tasks, respectively. Our results
-demonstrate that the encoder-free architecture is highly promising for
-replacing encoder-based architectures in the field of 3D understanding. The
-code is released at https://github.com/Ivan-Tang-3D/ENEL
+##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
+2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
 
-摘要：<paragraph>編碼器免費架構已在 2D 視覺領域中初步探索，但它們是否能有效應用於 3D 理解場景仍是一個開放的問題。在本文中，我們提出了對編碼器免費架構潛力的首次全面調查，以克服基於編碼器的 3D 大型多模態模型 (LMM) 的挑戰。這些挑戰包括無法適應不同的點雲解析度，且來自編碼器的點特徵無法滿足大型語言模型 (LLM) 的語義需求。我們識別出 3D LMM 的關鍵方面，以移除編碼器並讓 LLM 承擔 3D 編碼器的角色：1) 我們在預訓練階段提出 LLM 嵌入式語義編碼策略，探索各種點雲自我監督損失的影響。我們提出混合語義損失來提取高階語義。2) 我們在指令調整階段引入分層幾何聚合策略。這將歸納偏差納入 LLM 早期層，以專注於點雲的局部細節。最後，我們提出第一個無編碼器 3D LMM，ENEL。我們的 7B 模型與當前最先進的模型 ShapeLLM-13B 相媲美，分別在分類、字幕和 VQA 任務中達到 55.0%、50.92% 和 42.7%。我們的結果表明，無編碼器架構極有望取代基於編碼器的架構在 3D 理解領域的應用。程式碼發布於 https://github.com/Ivan-Tang-3D/ENEL</paragraph>
+Remote patient monitoring based on wearable single-lead electrocardiogram
+(ECG) devices has significant potential for enabling the early detection of
+heart disease, especially in combination with artificial intelligence (AI)
+approaches for automated heart disease detection. There have been prior studies
+applying AI approaches based on deep learning for heart disease detection.
+However, these models are yet to be widely accepted as a reliable aid for
+clinical diagnostics, in part due to the current black-box perception
+surrounding many AI algorithms. In particular, there is a need to identify the
+key features of the ECG signal that contribute toward making an accurate
+diagnosis, thereby enhancing the interpretability of the model. In the present
+study, we develop a vision transformer approach to identify atrial fibrillation
+based on single-lead ECG data. A residual network (ResNet) approach is also
+developed for comparison with the vision transformer approach. These models are
+applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
+well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
+heartbeats. The models enable the identification of the key regions of the
+heartbeat that determine the resulting classification, and highlight the
+importance of P-waves and T-waves, as well as heartbeat duration and signal
+amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
+sinus bradycardia.
 
-##### **DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References**
-2502.09614v1 by Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
+摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
 
-We address the challenge of developing a generalizable neural tracking
-controller for dexterous manipulation from human references. This controller
-aims to manage a dexterous robot hand to manipulate diverse objects for various
-purposes defined by kinematic human-object interactions. Developing such a
-controller is complicated by the intricate contact dynamics of dexterous
-manipulation and the need for adaptivity, generalizability, and robustness.
-Current reinforcement learning and trajectory optimization methods often fall
-short due to their dependence on task-specific rewards or precise system
-models. We introduce an approach that curates large-scale successful robot
-tracking demonstrations, comprising pairs of human references and robot
-actions, to train a neural controller. Utilizing a data flywheel, we
-iteratively enhance the controller's performance, as well as the number and
-quality of successful tracking demonstrations. We exploit available tracking
-demonstrations and carefully integrate reinforcement learning and imitation
-learning to boost the controller's performance in dynamic environments. At the
-same time, to obtain high-quality tracking demonstrations, we individually
-optimize per-trajectory tracking by leveraging the learned tracking controller
-in a homotopy optimization method. The homotopy optimization, mimicking
-chain-of-thought, aids in solving challenging trajectory tracking problems to
-increase demonstration diversity. We showcase our success by training a
-generalizable neural controller and evaluating it in both simulation and real
-world. Our method achieves over a 10% improvement in success rates compared to
-leading baselines. The project website with animated results is available at
-https://meowuu7.github.io/DexTrack/.
 
-摘要：<paragraph>我們解決了從人類參照中開發靈巧操作通用神經追蹤控制器的挑戰。此控制器旨在管理靈巧機器人手，以操作各種物體，以實現由運動學人機互動定義的各種目的。由於靈巧操作的複雜接觸動力學以及對適應性、通用性和魯棒性的需求，開發此類控制器很複雜。目前的強化學習和軌跡優化方法通常由於依賴於特定任務的獎勵或精確的系統模型而表現不佳。我們引入了一種方法，它策劃了大規模成功的機器人追蹤示範，包括人體參照和機器人動作對，以訓練神經控制器。利用數據飛輪，我們反覆增強控制器的性能，以及成功追蹤示範的數量和品質。我們利用可用的追蹤示範，並仔細整合強化學習和模仿學習，以提升控制器在動態環境中的性能。同時，為了獲得高品質的追蹤示範，我們透過在同倫優化方法中利用已學習的追蹤控制器，個別優化每個軌跡的追蹤。同倫優化模擬思考鏈，有助於解決具有挑戰性的軌跡追蹤問題，以增加示範的多樣性。我們展示了我們在訓練通用神經控制器並在模擬和真實世界中評估它的成功。與領先的基準相比，我們的模型在成功率方面提高了 10% 以上。包含動畫結果的專案網站可在 https://meowuu7.github.io/DexTrack/ 取得。</paragraph>
+### Medical
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-13**|**Metamorphic Testing for Pose Estimation Systems**|Matias Duran et.al.|[2502.09460v1](http://arxiv.org/abs/2502.09460v1)|null|
+|**2025-02-13**|**The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**|Danni Feng et.al.|[2502.09247v1](http://arxiv.org/abs/2502.09247v1)|null|
+|**2025-02-13**|**From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**|Lukas Buess et.al.|[2502.09242v1](http://arxiv.org/abs/2502.09242v1)|null|
+|**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
+|**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
+|**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|null|
+|**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
+|**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
+|**2025-02-12**|**EEG Artifact Detection and Correction with Deep Autoencoders**|David Aquilué-Llorens et.al.|[2502.08686v1](http://arxiv.org/abs/2502.08686v1)|null|
+|**2025-02-12**|**SycEval: Evaluating LLM Sycophancy**|Aaron Fanous et.al.|[2502.08177v1](http://arxiv.org/abs/2502.08177v1)|null|
+|**2025-02-11**|**Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**|Hye Sun Yun et.al.|[2502.07963v1](http://arxiv.org/abs/2502.07963v1)|null|
+|**2025-02-11**|**An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**|Mohammad Ali Labbaf Khaniki et.al.|[2502.07755v1](http://arxiv.org/abs/2502.07755v1)|null|
+|**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
+|**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v1](http://arxiv.org/abs/2502.07516v1)|[link](https://github.com/Raman1121/diffusion_memorization)|
+|**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
+|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
+|**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
+|**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
+|**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
+|**2025-02-10**|**AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**|Adriana Eufrosiana Bora et.al.|[2502.07022v1](http://arxiv.org/abs/2502.07022v1)|null|
+|**2025-02-10**|**Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**|Amin Adibi et.al.|[2502.06693v1](http://arxiv.org/abs/2502.06693v1)|null|
+|**2025-02-10**|**Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**|Anna Arias-Duart et.al.|[2502.06666v1](http://arxiv.org/abs/2502.06666v1)|null|
+|**2025-02-10**|**Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**|Mohammed Abdul Hafeez Khan et.al.|[2502.06632v1](http://arxiv.org/abs/2502.06632v1)|null|
+|**2025-02-10**|**Illegal Waste Detection in Remote Sensing Images: A Case Study**|Federico Gibellini et.al.|[2502.06607v2](http://arxiv.org/abs/2502.06607v2)|null|
+|**2025-02-10**|**FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**|Anna Tegon et.al.|[2502.06438v1](http://arxiv.org/abs/2502.06438v1)|null|
+|**2025-02-10**|**Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**|Qingshan Hou et.al.|[2502.06289v1](http://arxiv.org/abs/2502.06289v1)|null|
+|**2025-02-10**|**Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**|Liuqing Chen et.al.|[2502.06134v1](http://arxiv.org/abs/2502.06134v1)|null|
+|**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
+|**2025-02-10**|**Can ChatGPT Diagnose Alzheimer's Disease?**|Quoc-Toan Nguyen et.al.|[2502.06907v1](http://arxiv.org/abs/2502.06907v1)|null|
+|**2025-02-09**|**Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**|Ahmed Abdelaziz et.al.|[2502.05931v1](http://arxiv.org/abs/2502.05931v1)|[link](https://github.com/Prog-Jacob/watermarking-eeg-models)|
+|**2025-02-09**|**Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**|Shiyu Teng et.al.|[2502.05879v1](http://arxiv.org/abs/2502.05879v1)|null|
+|**2025-02-09**|**LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**|Gabriele De Vito et.al.|[2502.06890v1](http://arxiv.org/abs/2502.06890v1)|null|
+|**2025-02-09**|**Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**|Lokesh Koli et.al.|[2502.07815v1](http://arxiv.org/abs/2502.07815v1)|null|
+|**2025-02-09**|**WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**|Ying Lei et.al.|[2502.05783v1](http://arxiv.org/abs/2502.05783v1)|null|
+|**2025-02-09**|**RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**|Ziqi Yang et.al.|[2502.05740v1](http://arxiv.org/abs/2502.05740v1)|null|
+|**2025-02-08**|**4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**|An Zhao et.al.|[2502.05713v1](http://arxiv.org/abs/2502.05713v1)|null|
+|**2025-02-08**|**KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**|Hyunjong Kim et.al.|[2502.05651v1](http://arxiv.org/abs/2502.05651v1)|null|
+|**2025-02-08**|**ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**|Aynur Guluzade et.al.|[2502.05638v1](http://arxiv.org/abs/2502.05638v1)|[link](https://gitlab.cc-asp.fraunhofer.de/health-open/elmtex)|
+|**2025-02-08**|**Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**|Ya Zhou et.al.|[2502.05494v1](http://arxiv.org/abs/2502.05494v1)|null|
+|**2025-02-08**|**DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**|Sibasish Dhibar et.al.|[2502.05459v1](http://arxiv.org/abs/2502.05459v1)|null|
+|**2025-02-07**|**Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**|Muhammad Imran et.al.|[2502.05330v1](http://arxiv.org/abs/2502.05330v1)|[link](https://github.com/MaxwellEng/MICCAI_CHANLLENGE24_HJL)|
+|**2025-02-07**|**Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**|Yuting He et.al.|[2502.05282v1](http://arxiv.org/abs/2502.05282v1)|null|
+|**2025-02-07**|**"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**|Shihan Fu et.al.|[2502.05115v1](http://arxiv.org/abs/2502.05115v1)|null|
+|**2025-02-07**|**Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**|Thierry Bossy et.al.|[2502.05087v1](http://arxiv.org/abs/2502.05087v1)|[link](https://github.com/tuneinsight/federated-llms)|
+|**2025-02-07**|**MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**|Minrui Chen et.al.|[2502.04794v1](http://arxiv.org/abs/2502.04794v1)|null|
+|**2025-02-06**|**MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**|Wei Fan et.al.|[2502.04515v1](http://arxiv.org/abs/2502.04515v1)|[link](https://github.com/aikunyi/MedGNN)|
+|**2025-02-06**|**Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**|Andrew G. Breithaupt et.al.|[2502.06842v1](http://arxiv.org/abs/2502.06842v1)|null|
+|**2025-02-06**|**Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**|Khushboo Verma et.al.|[2502.04423v1](http://arxiv.org/abs/2502.04423v1)|null|
+|**2025-02-06**|**Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**|Tewele W. Tareke et.al.|[2502.04083v1](http://arxiv.org/abs/2502.04083v1)|null|
+|**2025-02-06**|**Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**|Ran Song et.al.|[2502.04034v1](http://arxiv.org/abs/2502.04034v1)|null|
+|**2025-02-06**|**MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**|Xuejiao Zhao et.al.|[2502.04413v1](http://arxiv.org/abs/2502.04413v1)|[link](https://github.com/snowteam2023/medrag)|
+|**2025-02-06**|**Transforming Multimodal Models into Action Models for Radiotherapy**|Matteo Ferrante et.al.|[2502.04408v1](http://arxiv.org/abs/2502.04408v1)|null|
+|**2025-02-06**|**Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**|Bokeng Zheng et.al.|[2502.04399v1](http://arxiv.org/abs/2502.04399v1)|null|
+|**2025-02-06**|**Multimodal Medical Code Tokenizer**|Xiaorui Su et.al.|[2502.04397v2](http://arxiv.org/abs/2502.04397v2)|null|
+|**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
+|**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
+|**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
+|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v1](http://arxiv.org/abs/2502.03568v1)|null|
+|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
+|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
+|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
+|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
+|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
+|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
+|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|null|
+|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
+|**2025-02-05**|**Contrastive Token-level Explanations for Graph-based Rumour Detection**|Daniel Wai Kit Chin et.al.|[2502.04366v1](http://arxiv.org/abs/2502.04366v1)|null|
+|**2025-02-05**|**AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**|Jorge García-Torres et.al.|[2502.04365v1](http://arxiv.org/abs/2502.04365v1)|null|
+|**2025-02-04**|**3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**|Weicheng Zhu et.al.|[2502.02779v1](http://arxiv.org/abs/2502.02779v1)|null|
+|**2025-02-04**|**Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**|Obed Korshie Dzikunu et.al.|[2502.02756v1](http://arxiv.org/abs/2502.02756v1)|[link](https://github.com/obeddzik/pca_segment)|
+|**2025-02-04**|**Diffusion Instruction Tuning**|Chen Jin et.al.|[2502.06814v1](http://arxiv.org/abs/2502.06814v1)|null|
+|**2025-02-04**|**MedRAX: Medical Reasoning Agent for Chest X-ray**|Adibvafa Fallahpour et.al.|[2502.02673v1](http://arxiv.org/abs/2502.02673v1)|[link](https://github.com/bowang-lab/medrax)|
+|**2025-02-04**|**Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**|Mahdi Alkaeed et.al.|[2502.04356v1](http://arxiv.org/abs/2502.04356v1)|null|
+|**2025-02-04**|**Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**|Shayan Kiyani et.al.|[2502.02561v1](http://arxiv.org/abs/2502.02561v1)|null|
+|**2025-02-04**|**CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**|Amy Rafferty et.al.|[2502.05214v1](http://arxiv.org/abs/2502.05214v1)|null|
+|**2025-02-04**|**A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**|Edward Ellis et.al.|[2502.02489v1](http://arxiv.org/abs/2502.02489v1)|null|
+|**2025-02-04**|**Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**|Yaling Shen et.al.|[2502.02438v1](http://arxiv.org/abs/2502.02438v1)|null|
+|**2025-02-04**|**Test Time Training for 4D Medical Image Interpolation**|Qikang Zhang et.al.|[2502.02341v1](http://arxiv.org/abs/2502.02341v1)|[link](https://github.com/chaostheproducer/ttt4d)|
+|**2025-02-04**|**Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**|Atharva Mangeshkumar Agrawal et.al.|[2502.02249v1](http://arxiv.org/abs/2502.02249v1)|null|
+|**2025-02-04**|**Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**|F. Xavier Gaya-Morey et.al.|[2502.02618v1](http://arxiv.org/abs/2502.02618v1)|null|
+|**2025-02-04**|**Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**|Yuxiao Cheng et.al.|[2502.02109v1](http://arxiv.org/abs/2502.02109v1)|null|
+|**2025-02-04**|**JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**|Yehan Yan et.al.|[2502.04345v1](http://arxiv.org/abs/2502.04345v1)|null|
+|**2025-02-03**|**An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**|Jiazi Tian et.al.|[2502.01789v1](http://arxiv.org/abs/2502.01789v1)|null|
+|**2025-02-03**|**Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**|Chacha Chen et.al.|[2502.03482v1](http://arxiv.org/abs/2502.03482v1)|null|
+|**2025-02-03**|**Improving Transformer World Models for Data-Efficient RL**|Antoine Dedieu et.al.|[2502.01591v1](http://arxiv.org/abs/2502.01591v1)|null|
+|**2025-02-03**|**Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**|Zhi Zhang et.al.|[2502.01377v1](http://arxiv.org/abs/2502.01377v1)|null|
+|**2025-02-03**|**OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**|Chengfeng Zhou et.al.|[2502.01243v1](http://arxiv.org/abs/2502.01243v1)|null|
+|**2025-02-03**|**MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**|Alejandro Guerra-Manzanares et.al.|[2502.01158v1](http://arxiv.org/abs/2502.01158v1)|null|
+|**2025-02-03**|**Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**|Qian Chen et.al.|[2502.01141v1](http://arxiv.org/abs/2502.01141v1)|null|
+|**2025-02-03**|**Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**|Mithun Saha et.al.|[2502.01108v1](http://arxiv.org/abs/2502.01108v1)|null|
+|**2025-02-03**|**Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**|Yeyubei Zhang et.al.|[2502.04342v1](http://arxiv.org/abs/2502.04342v1)|null|
+|**2025-02-02**|**Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**|Hadas Ben-Atya et.al.|[2502.01691v1](http://arxiv.org/abs/2502.01691v1)|null|
+|**2025-02-02**|**Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**|Si-Ioi Ng et.al.|[2502.01685v1](http://arxiv.org/abs/2502.01685v1)|null|
+|**2025-02-02**|**Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**|Shengtian Sang et.al.|[2502.00712v1](http://arxiv.org/abs/2502.00712v1)|null|
+|**2025-02-02**|**TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**|Linglong Wu et.al.|[2502.00695v1](http://arxiv.org/abs/2502.00695v1)|null|
+|**2025-02-02**|**Safety at Scale: A Comprehensive Survey of Large Model Safety**|Xingjun Ma et.al.|[2502.05206v2](http://arxiv.org/abs/2502.05206v2)|null|
+|**2025-02-02**|**Enhanced Convolutional Neural Networks for Improved Image Classification**|Xiaoran Yang et.al.|[2502.00663v1](http://arxiv.org/abs/2502.00663v1)|null|
+|**2025-02-02**|**Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**|Yujin Oh et.al.|[2502.00619v1](http://arxiv.org/abs/2502.00619v1)|null|
+|**2025-02-01**|**Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**|Samiran Dey et.al.|[2502.00568v3](http://arxiv.org/abs/2502.00568v3)|[link](https://github.com/Samiran-Dey/PathoGen)|
 
-##### **Score-of-Mixture Training: Training One-Step Generative Models Made Simple**
-2502.09609v1 by Tejas Jayashankar, J. Jon Ryu, Gregory Wornell
+#### Abstracts
+##### **Metamorphic Testing for Pose Estimation Systems**
+2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque
 
-We propose Score-of-Mixture Training (SMT), a novel framework for training
-one-step generative models by minimizing a class of divergences called the
-$\alpha$-skew Jensen-Shannon divergence. At its core, SMT estimates the score
-of mixture distributions between real and fake samples across multiple noise
-levels. Similar to consistency models, our approach supports both training from
-scratch (SMT) and distillation using a pretrained diffusion model, which we
-call Score-of-Mixture Distillation (SMD). It is simple to implement, requires
-minimal hyperparameter tuning, and ensures stable training. Experiments on
-CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even
-outperform existing methods.
+Pose estimation systems are used in a variety of fields, from sports
+analytics to livestock care. Given their potential impact, it is paramount to
+systematically test their behaviour and potential for failure. This is a
+complex task due to the oracle problem and the high cost of manual labelling
+necessary to build ground truth keypoints. This problem is exacerbated by the
+fact that different applications require systems to focus on different subjects
+(e.g., human versus animal) or landmarks (e.g., only extremities versus whole
+body and face), which makes labelled test data rarely reusable. To combat these
+problems we propose MET-POSE, a metamorphic testing framework for pose
+estimation systems that bypasses the need for manual annotation while assessing
+the performance of these systems under different circumstances. MET-POSE thus
+allows users of pose estimation systems to assess the systems in conditions
+that more closely relate to their application without having to label an ad-hoc
+test dataset or rely only on available datasets, which may not be adapted to
+their application domain. While we define MET-POSE in general terms, we also
+present a non-exhaustive list of metamorphic rules that represent common
+challenges in computer vision applications, as well as a specific way to
+evaluate these rules. We then experimentally show the effectiveness of MET-POSE
+by applying it to Mediapipe Holistic, a state of the art human pose estimation
+system, with the FLIC and PHOENIX datasets. With these experiments, we outline
+numerous ways in which the outputs of MET-POSE can uncover faults in pose
+estimation systems at a similar or higher rate than classic testing using hand
+labelled data, and show that users can tailor the rule set they use to the
+faults and level of accuracy relevant to their application.
 
-摘要：我們提出混合評分訓練 (SMT)，一種透過最小化稱為 $\alpha$-偏斜 Jensen-Shannon 距離的距離類別來訓練單步生成模型的新穎架構。在核心部分，SMT 估計真實和虛假樣本之間在多個雜訊層級的混合分配評分。與一致性模型類似，我們的做法支援從頭開始訓練 (SMT) 和使用預先訓練的擴散模型進行蒸餾，我們稱之為混合評分蒸餾 (SMD)。它易於實作，只需要最小的超參數調整，並確保穩定的訓練。在 CIFAR-10 和 ImageNet 64x64 上的實驗顯示，SMT/SMD 具有競爭力，甚至可以優於現有方法。
+摘要：姿勢估計系統應用於各種領域，從運動分析到牲畜照護。鑑於其潛在影響，系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高，這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體（例如，人類對動物）或地標（例如，只有四肢對全身和臉部）而加劇，這使得標記的測試數據很少可以重複使用。為了解決這些問題，我們提出了 MET-POSE，這是一個姿勢估計系統的變形測試框架，在評估這些系統在不同情況下的性能時，可以繞過手動註解的需要。因此，MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統，而無需標記臨時測試數據集或僅依賴可用數據集，這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE，但我們也提供了一個非詳盡的變形規則列表，這些規則代表了電腦視覺應用中的常見挑戰，以及評估這些規則的具體方法。然後，我們通過將 MET-POSE 應用於 Mediapipe Holistic（一種先進的人類姿勢估計系統），並使用 FLIC 和 PHOENIX 數據集，以實驗方式展示 MET-POSE 的有效性。通過這些實驗，我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法，其速度與使用手動標記數據的傳統測試類似或更高，並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。
 
-##### **Human-LLM Coevolution: Evidence from Academic Writing**
-2502.09606v1 by Mingmeng Geng, Roberto Trotta
+##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
+2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
 
-With a statistical analysis of arXiv paper abstracts, we report a marked drop
-in the frequency of several words previously identified as overused by ChatGPT,
-such as "delve", starting soon after they were pointed out in early 2024. The
-frequency of certain other words favored by ChatGPT, such as "significant", has
-instead kept increasing. These phenomena suggest that some authors of academic
-papers have adapted their use of large language models (LLMs), for example, by
-selecting outputs or applying modifications to the LLM-generated content. Such
-coevolution and cooperation of humans and LLMs thus introduce additional
-challenges to the detection of machine-generated text in real-world scenarios.
-Estimating the impact of LLMs on academic writing by examining word frequency
-remains feasible, and more attention should be paid to words that were already
-frequently employed, including those that have decreased in frequency.
+Joint entity-relation extraction is a critical task in transforming
+unstructured or semi-structured text into triplets, facilitating the
+construction of large-scale knowledge graphs, and supporting various downstream
+applications. Despite its importance, research on Chinese text, particularly
+with complex semantics in specialized domains like medicine, remains limited.
+To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
+dataset designed to capture the intricacies of medical text. Leveraging the
+strengths of attention mechanisms in capturing long-range dependencies, we
+propose the SEA module, which enhances the extraction of complex contextual
+semantic information, thereby improving entity recognition and relation
+extraction. Additionally, to address the inefficiencies of existing methods in
+facilitating information exchange between entity recognition and relation
+extraction, we present an interactive fusion representation module. This module
+employs Cross Attention for bidirectional information exchange between the
+tasks and further refines feature extraction through BiLSTM. Experimental
+results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
+our model exhibits strong generalization capabilities. On the CH-DDI dataset,
+our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
+relation extraction. On the CoNLL04 dataset, it attains an entity recognition
+precision of 89.54% and a relation extraction accuracy of 71.64%.
 
-摘要：透過對 arXiv 論文摘要進行統計分析，我們報告了幾個先前被認為 ChatGPT 過度使用的詞彙的頻率大幅下降，例如「深入探討」，從 2024 年初被指出後不久就開始下降。相反地，ChatGPT 偏好的某些其他詞彙，例如「顯著」，頻率持續增加。這些現象表明，一些學術論文作者已經調整了他們使用大型語言模型 (LLM) 的方式，例如，透過選擇輸出或對 LLM 生成的內容進行修改。因此，人類和 LLM 的這種共同演化和合作為在現實世界場景中偵測機器產生的文字帶來了額外的挑戰。透過檢視詞彙頻率來評估 LLM 對學術寫作的影響仍然可行，並且應該對已經頻繁使用的詞彙給予更多關注，包括那些頻率下降的詞彙。
+摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
 
-##### **SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models**
-2502.09604v1 by Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
+##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
+2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
 
-We introduce SelfCite, a novel self-supervised approach that aligns LLMs to
-generate high-quality, fine-grained, sentence-level citations for the
-statements in their generated responses. Instead of only relying on costly and
-labor-intensive annotations, SelfCite leverages a reward signal provided by the
-LLM itself through context ablation: If a citation is necessary, removing the
-cited text from the context should prevent the same response; if sufficient,
-retaining the cited text alone should preserve the same response. This reward
-can guide the inference-time best-of-N sampling strategy to improve citation
-quality significantly, as well as be used in preference optimization to
-directly fine-tune the models for generating better citations. The
-effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3
-points on the LongBench-Cite benchmark across five long-form question answering
-tasks.
+Generative artificial intelligence (AI) models, such as diffusion models and
+OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
+and automating clinical workflows. The field has advanced rapidly, evolving
+from text-only large language models for tasks such as clinical documentation
+and decision support to multimodal AI systems capable of integrating diverse
+data modalities, including imaging, text, and structured data, within a single
+model. The diverse landscape of these technologies, along with rising interest,
+highlights the need for a comprehensive review of their applications and
+potential. This scoping review explores the evolution of multimodal AI,
+highlighting its methods, applications, datasets, and evaluation in clinical
+settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
+IEEE Xplore, and Web of Science, prioritizing recent studies published up to
+the end of 2024. After rigorous screening, 144 papers were included, revealing
+key trends and challenges in this dynamic field. Our findings underscore a
+shift from unimodal to multimodal approaches, driving innovations in diagnostic
+support, medical report generation, drug discovery, and conversational AI.
+However, critical challenges remain, including the integration of heterogeneous
+data types, improving model interpretability, addressing ethical concerns, and
+validating AI systems in real-world clinical settings. This review summarizes
+the current state of the art, identifies critical gaps, and provides insights
+to guide the development of scalable, trustworthy, and clinically impactful
+multimodal AI solutions in healthcare.
 
-摘要：我們介紹 SelfCite，一種新穎的自監督方法，它將 LLM 對齊以針對其生成回應中的陳述生成高品質、細粒度、句子級別的引用。SelfCite 不僅依賴於昂貴且勞動密集的註解，還利用 LLM 本身通過上下文消融提供的獎勵信號：如果需要引用，從上下文中移除被引用的文字應當會阻止相同的回應；如果足夠，僅保留被引用的文字應當會保留相同的回應。此獎勵可以引導推理時間最佳 N 個取樣策略以顯著改善引文品質，並用於偏好最佳化以直接微調模型以生成更好的引文。SelfCite 的有效性通過在五個長篇問答任務中將 LongBench-Cite 基準上的引文 F1 提高多達 5.3 點來證明。
+摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
 
-##### **CoT-Valve: Length-Compressible Chain-of-Thought Tuning**
-2502.09601v1 by Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
+##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
+2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
-Chain-of-Thought significantly enhances a model's reasoning capability, but
-it also comes with a considerable increase in inference costs due to long
-chains. With the observation that the reasoning path can be easily compressed
-under easy tasks but struggle on hard tasks, we explore the feasibility of
-elastically controlling the length of reasoning paths with only one model,
-thereby reducing the inference overhead of reasoning models dynamically based
-on task difficulty. We introduce a new tuning and inference strategy named
-CoT-Valve, designed to allow models to generate reasoning chains of varying
-lengths. To achieve this, we propose to identify a direction in the parameter
-space that, when manipulated, can effectively control the length of generated
-CoT. Moreover, we show that this property is valuable for compressing the
-reasoning chain. We construct datasets with chains from long to short for the
-same questions and explore two enhanced strategies for CoT-Valve: (1) a precise
-length-compressible CoT tuning method, and (2) a progressive chain length
-compression approach. Our experiments show that CoT-Valve successfully enables
-controllability and compressibility of the chain and shows better performance
-than the prompt-based control. We applied this method to QwQ-32B-Preview,
-reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor
-performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with
-only one additional incorrect answer.
+This paper presents a complete explainable system that interprets a set of
+data, abstracts the underlying features and describes them in a natural
+language of choice. The system relies on two crucial stages: (i) identifying
+emerging properties from data and transforming them into abstract concepts, and
+(ii) converting these concepts into natural language. Despite the impressive
+natural language generation capabilities demonstrated by Large Language Models,
+their statistical nature and the intricacy of their internal mechanism still
+force us to employ these techniques as black boxes, forgoing trustworthiness.
+Developing an explainable pipeline for data interpretation would allow
+facilitating its use in safety-critical environments like processing medical
+information and allowing non-experts and visually impaired people to access
+narrated information. To this end, we believe that the fields of knowledge
+representation and automated reasoning research could present a valid
+alternative. Expanding on prior research that tackled the first stage (i), we
+focus on the second stage, named Concept2Text. Being explainable, data
+translation is easily modeled through logic-based rules, once again emphasizing
+the role of declarative programming in achieving AI explainability. This paper
+explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
+in terms of classes and relations, plus common knowledge-derived from a generic
+ontology, generating natural language text. Its main features include
+hierarchical tree rewritings, modular multilingual generation, support for
+equivalent variants across semantic, grammar, and lexical levels, and a
+transparent rule-based system. We outline the architecture and demonstrate its
+flexibility through some examples capable of generating numerous diverse and
+equivalent rewritings based on the input concept.
 
-摘要：<paragraph>連續思考大幅提升了模型的推理能力，但由於鏈條過長，也大幅增加了推理成本。由於觀察到推理路徑在簡單的任務中可以輕易壓縮，但在困難的任務中卻很吃力，我們探索了僅使用一個模型彈性控制推理路徑長度的可行性，從而根據任務難度動態減少推理模型的推理開銷。我們引入了一種名為 CoT-Valve 的新調校和推理策略，旨在讓模型產生長度不一的推理鏈。為此，我們提議在參數空間中識別一個方向，在操作時可以有效控制生成的 CoT 的長度。此外，我們展示了此屬性對於壓縮推理鏈是有價值的。我們構造了從長到短的鏈條的資料集，用於相同的問題，並探索了 CoT-Valve 的兩種增強策略：(1) 精確的長度可壓縮 CoT 調校方法，以及 (2) 漸進式鏈長壓縮方法。我們的實驗表明，CoT-Valve 成功地實現了鏈條的可控性和可壓縮性，並顯示出比基於提示的控制更好的效能。我們將此方法應用於 QwQ-32B-Preview，將 GSM8K 上的推理鏈條從 741 個代幣減少到 225 個代幣，效能僅略微下降 (95.07% 至 94.92%)，而在 AIME 上從 6827 個代幣減少到 4629 個代幣，只多了一個錯誤答案。</paragraph>
+摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
 
-##### **Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs**
-2502.09597v1 by Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
+##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
+2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
 
-Large Language Models (LLMs) are increasingly used as chatbots, yet their
-ability to personalize responses to user preferences remains limited. We
-introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize
-and adhere to user preferences in a long-context conversational setting.
-PrefEval comprises 3,000 manually curated user preference and query pairs
-spanning 20 topics. PrefEval contains user personalization or preference
-information in both explicit and implicit forms, and evaluates LLM performance
-using a generation and a classification task. With PrefEval, we evaluated the
-aforementioned preference following capabilities of 10 open-source and
-proprietary LLMs in multi-session conversations with varying context lengths up
-to 100k tokens. We benchmark with various prompting, iterative feedback, and
-retrieval-augmented generation methods. Our benchmarking effort reveals that
-state-of-the-art LLMs face significant challenges in proactively following
-users' preferences during conversations. In particular, in zero-shot settings,
-preference following accuracy falls below 10% at merely 10 turns (~3k tokens)
-across most evaluated models. Even with advanced prompting and retrieval
-methods, preference following still deteriorates in long-context conversations.
-Furthermore, we show that fine-tuning on PrefEval significantly improves
-performance. We believe PrefEval serves as a valuable resource for measuring,
-understanding, and enhancing LLMs' preference following abilities, paving the
-way for personalized conversational agents. Our code and dataset are available
-at https://prefeval.github.io/.
+Legal cases require careful logical reasoning following the laws, whereas
+interactions with non- technical users must be in natural language. As an
+application combining logical reasoning using Prolog and natural language
+processing using large language models (LLMs), this paper presents a novel
+approach and system, LogicLease, to automate the analysis of landlord-tenant
+legal cases in the state of New York. LogicLease determines compliance with
+relevant legal requirements by analyzing case descriptions and citing all
+relevant laws. It leverages LLMs for information extraction and Prolog for
+legal reasoning. By separating information extraction from legal reasoning,
+LogicLease achieves greater transparency and control over the legal logic
+applied to each case. We evaluate the accuracy, efficiency, and robustness of
+LogicLease through a series of tests, achieving 100% accuracy and an average
+processing time of 2.57 seconds. LogicLease presents advantages over
+state-of-the-art LLM- based legal analysis systems by providing clear,
+step-by-step reasoning, citing specific laws, and distinguishing itself by its
+ability to avoid hallucinations - a common issue in LLMs.
 
-摘要：大型語言模型（LLM）正日益被用作聊天機器人，但它們根據使用者偏好個人化回應的能力仍然有限。我們引入了 PrefEval，一個用於評估 LLM 在長時間對話環境中推論、記憶和遵守使用者偏好的能力的基準。PrefEval 包含 3,000 個手動策劃的使用者偏好和查詢對，涵蓋 20 個主題。PrefEval 包含以明確和隱含形式表達的使用者個人化或偏好資訊，並使用生成和分類任務評估 LLM 效能。透過 PrefEval，我們評估了 10 個開源和專有 LLM 在多重對話中上述的偏好追蹤能力，對話內容長度最高達 100k 個符號。我們使用各種提示、迭代回饋和檢索增強生成方法進行基準測試。我們的基準測試工作顯示，最先進的 LLM 在對話中主動追蹤使用者偏好時面臨重大挑戰。特別是在零次學習設定中，在多數評估模型中，在僅 10 個回合（約 3k 個符號）時，偏好追蹤準確度低於 10%。即使使用進階提示和檢索方法，在長時間對話中偏好追蹤仍然會惡化。此外，我們展示了在 PrefEval 上進行微調會大幅改善效能。我們相信 PrefEval 可作為衡量、理解和提升 LLM 偏好追蹤能力的寶貴資源，為個人化對話代理鋪路。我們的程式碼和資料集可在 https://prefeval.github.io/ 取得。
+摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
 
-##### **KIMAs: A Configurable Knowledge Integrated Multi-Agent System**
-2502.09596v1 by Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding
+##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
+2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
-Knowledge-intensive conversations supported by large language models (LLMs)
-have become one of the most popular and helpful applications that can assist
-people in different aspects. Many current knowledge-intensive applications are
-centered on retrieval-augmented generation (RAG) techniques. While many
-open-source RAG frameworks facilitate the development of RAG-based
-applications, they often fall short in handling practical scenarios complicated
-by heterogeneous data in topics and formats, conversational context management,
-and the requirement of low-latency response times. This technical report
-presents a configurable knowledge integrated multi-agent system, KIMAs, to
-address these challenges. KIMAs features a flexible and configurable system for
-integrating diverse knowledge sources with 1) context management and query
-rewrite mechanisms to improve retrieval accuracy and multi-turn conversational
-coherency, 2) efficient knowledge routing and retrieval, 3) simple but
-effective filter and reference generation mechanisms, and 4) optimized
-parallelizable multi-agent pipeline execution. Our work provides a scalable
-framework for advancing the deployment of LLMs in real-world settings. To show
-how KIMAs can help developers build knowledge-intensive applications with
-different scales and emphases, we demonstrate how we configure the system to
-three applications already running in practice with reliable performance.
+In remote healthcare monitoring, time series representation learning reveals
+critical patient behavior patterns from high-frequency data. This study
+analyzes home activity data from individuals living with dementia by proposing
+a two-stage, self-supervised learning approach tailored to uncover low-rank
+structures. The first stage converts time-series activities into text sequences
+encoded by a pre-trained language model, providing a rich, high-dimensional
+latent state space using a PageRank-based method. This PageRank vector captures
+latent state transitions, effectively compressing complex behaviour data into a
+succinct form that enhances interpretability. This low-rank representation not
+only enhances model interpretability but also facilitates clustering and
+transition analysis, revealing key behavioral patterns correlated with
+clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
+framework's potential in supporting cognitive status prediction, personalized
+care interventions, and large-scale health monitoring.
 
-摘要：由大型語言模型 (LLM) 支持的知識密集型對話
-已成為最受歡迎且有用的應用程式之一，可協助
-人們在不同面向獲得協助。許多當前的知識密集型應用程式
-都以檢索增強生成 (RAG) 技術為中心。雖然許多
-開放原始碼 RAG 架構促進了基於 RAG 的應用程式開發，但它們在處理
-主題和格式中異質資料、對話內容管理，以及低延遲回應時間的要求所造成的實際情況時，通常力有未逮。這份技術報告
-提出了可設定的知識整合多重代理系統，KIMAs，以
-解決這些挑戰。KIMAs 具備靈活且可設定的系統，可整合多樣化的知識來源，並具備 1) 內容管理和查詢
-改寫機制，以提升檢索準確度和多輪對話的連貫性，2) 有效的知識路由和檢索，3) 簡單但
-有效的篩選和參考產生機制，以及 4) 最佳化的可平行化多重代理管線執行。我們的作品提供了可擴充的
-架構，以推動在實際環境中部署 LLM。為了展示 KIMAs 如何協助開發人員建置不同規模和重點的知識密集型應用程式，我們示範如何設定系統至
-三個已實際執行且效能良好的應用程式。
+摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
-##### **Logical forms complement probability in understanding language model (and human) performance**
-2502.09589v1 by Yixuan Wang, Freda Shi
+##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
+2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
 
-With the increasing interest in using large language models (LLMs) for
-planning in natural language, understanding their behaviors becomes an
-important research question. This work conducts a systematic investigation of
-LLMs' ability to perform logical reasoning in natural language. We introduce a
-controlled dataset of hypothetical and disjunctive syllogisms in propositional
-and modal logic and use it as the testbed for understanding LLM performance.
-Our results lead to novel insights in predicting LLM behaviors: in addition to
-the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical
-forms should be considered as orthogonal factors. In addition, we show
-similarities and differences between the logical reasoning performances of
-humans and LLMs by comparing LLM and human behavioral results.
+Precise segmentation and classification of cell instances are vital for
+analyzing the tissue microenvironment in histology images, supporting medical
+diagnosis, prognosis, treatment planning, and studies of brain
+cytoarchitecture. However, the creation of high-quality annotated datasets for
+training remains a major challenge. This study introduces a novel single-stage
+approach (HistoSmith) for generating image-label pairs to augment histology
+datasets. Unlike state-of-the-art methods that utilize diffusion models with
+separate components for label and image generation, our approach employs a
+latent diffusion model to learn the joint distribution of cellular layouts,
+classification masks, and histology images. This model enables tailored data
+generation by conditioning on user-defined parameters such as cell types,
+quantities, and tissue types. Trained on the Conic H&E histopathology dataset
+and the Nissl-stained CytoDArk0 dataset, the model generates realistic and
+diverse labeled samples. Experimental results demonstrate improvements in cell
+instance segmentation and classification, particularly for underrepresented
+cell types like neutrophils in the Conic dataset. These findings underscore the
+potential of our approach to address data scarcity challenges.
 
-摘要：隨著在自然語言規劃中使用大型語言模型（LLM）的興趣日益濃厚，理解其行為已成為一項重要的研究課題。本研究對 LLM 在自然語言中執行邏輯推理的能力進行了系統性調查。我們引入了一個由假設和析取三段論組成的受控資料集，並使用它作為理解 LLM 效能的測試平台。我們的結果產生了預測 LLM 行為的新見解：除了輸入的機率（Gonen 等人，2023 年；McCoy 等人，2024 年）之外，邏輯形式應被視為正交因子。此外，我們透過比較 LLM 和人類行為結果，展示了人類和 LLM 在邏輯推理表現上的相似性和差異性。
+摘要：精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而，建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith)，用於產生影像標籤對，以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同，我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數（例如細胞類型、數量和組織類型）來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後，此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步，特別是對於 Conic 資料集中代表性不足的細胞類型，例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。
 
-##### **Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering**
-2502.09573v1 by Mark Beliaev, Victor Yang, Madhura Raju, Jiachen Sun, Xinghai Hu
+##### **Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**
+2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
 
-In this study, we tackle industry challenges in video content classification
-by exploring and optimizing GPT-based models for zero-shot classification
-across seven critical categories of video quality. We contribute a novel
-approach to improving GPT's performance through prompt optimization and policy
-refinement, demonstrating that simplifying complex policies significantly
-reduces false negatives. Additionally, we introduce a new
-decomposition-aggregation-based prompt engineering technique, which outperforms
-traditional single-prompt methods. These experiments, conducted on real
-industry problems, show that thoughtful prompt design can substantially enhance
-GPT's performance without additional finetuning, offering an effective and
-scalable solution for improving video classification systems across various
-domains in industry.
+The growing availability of longitudinal Magnetic Resonance Imaging (MRI)
+datasets has facilitated Artificial Intelligence (AI)-driven modeling of
+disease progression, making it possible to predict future medical scans for
+individual patients. However, despite significant advancements in AI, current
+methods continue to face challenges including achieving patient-specific
+individualization, ensuring spatiotemporal consistency, efficiently utilizing
+longitudinal data, and managing the substantial memory demands of 3D scans. To
+address these challenges, we propose Brain Latent Progression (BrLP), a novel
+spatiotemporal model designed to predict individual-level disease progression
+in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates
+in a small latent space, mitigating the computational challenges posed by
+high-dimensional imaging data; (ii) it explicitly integrates subject metadata
+to enhance the individualization of predictions; (iii) it incorporates prior
+knowledge of disease dynamics through an auxiliary model, facilitating the
+integration of longitudinal data; and (iv) it introduces the Latent Average
+Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in
+the predicted progression at inference time and (b) allows us to derive a
+measure of the uncertainty for the prediction. We train and evaluate BrLP on
+11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its
+generalizability on an external test set comprising 2,257 MRIs from 962
+subjects. Our experiments compare BrLP-generated MRI scans with real follow-up
+MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The
+code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
 
-摘要：在這項研究中，我們透過探索和最佳化基於 GPT 的模型，來處理影片內容分類中的產業挑戰，並針對影片品質的七個關鍵類別進行零次學習分類。我們貢獻了一種透過提示最佳化和政策改善來提升 GPT 效能的新方法，證明簡化複雜政策能大幅減少假陰性。此外，我們還引入了一種新的基於分解聚合的提示工程技術，其效能優於傳統的單一提示方法。這些在真實產業問題上執行的實驗顯示，經過深思熟慮的提示設計可以在不進行額外微調的情況下大幅提升 GPT 的效能，為提升產業中各種領域的影片分類系統提供了一個有效且可擴充的解決方案。
+摘要：隨著縱向磁共振影像 (MRI) 資料集的日益普及，已促進人工智慧 (AI) 驅動的疾病進程建模，讓預測個別患者的未來醫學掃描成為可能。然而，儘管 AI 有顯著進展，目前的技術仍面臨挑戰，包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料，以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰，我們提出腦潛在進程 (BrLP)，這是一種新穎的時空模型，旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個：(i) 它在一個小的潛在空間中運作，減輕了高維度影像資料帶來的計算挑戰；(ii) 它明確整合受試者的元資料，以增強預測的個別化；(iii) 它透過輔助模型納入疾病動態的先驗知識，促進縱向資料的整合；(iv) 它引入了潛在平均穩定化 (LAS) 演算法，該演算法 (a) 在推論時強制預測進程中的時空一致性，(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估，並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較，與現有方法相比，展示了最先進的準確性。程式碼已公開於：https://github.com/LemuelPuglisi/BrLP。
 
-##### **MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing**
-2502.09567v1 by Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea
+##### **Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**
+2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai
 
-We introduce MorphNLI, a modular step-by-step approach to natural language
-inference (NLI). When classifying the premise-hypothesis pairs into
-{entailment, contradiction, neutral}, we use a language model to generate the
-necessary edits to incrementally transform (i.e., morph) the premise into the
-hypothesis. Then, using an off-the-shelf NLI model we track how the entailment
-progresses with these atomic changes, aggregating these intermediate labels
-into a final output. We demonstrate the advantages of our proposed method
-particularly in realistic cross-domain settings, where our method always
-outperforms strong baselines with improvements up to 12.6% (relative). Further,
-our proposed approach is explainable as the atomic edits can be used to
-understand the overall NLI label.
+The adoption of EHRs has expanded opportunities to leverage data-driven
+algorithms in clinical care and research. A major bottleneck in effectively
+conducting multi-institutional EHR studies is the data heterogeneity across
+systems with numerous codes that either do not exist or represent different
+clinical concepts across institutions. The need for data privacy further limits
+the feasibility of including multi-institutional patient-level data required to
+study similarities and differences across patient subgroups. To address these
+challenges, we developed the GAME algorithm. Tested and validated across 7
+institutions and 2 languages, GAME integrates data in several levels: (1) at
+the institutional level with knowledge graphs to establish relationships
+between codes and existing knowledge sources, providing the medical context for
+standard codes and their relationship to each other; (2) between institutions,
+leveraging language models to determine the relationships between
+institution-specific codes with established standard codes; and (3) quantifying
+the strength of the relationships between codes using a graph attention
+network. Jointly trained embeddings are created using transfer and federated
+learning to preserve data privacy. In this study, we demonstrate the
+applicability of GAME in selecting relevant features as inputs for AI-driven
+algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis.
+We then highlight the application of GAME harmonized multi-institutional EHR
+data in a study of Alzheimer's disease outcomes and suicide risk among patients
+with mental health disorders, without sharing patient-level data outside
+individual institutions.
 
-摘要：我們引入 MorphNLI，一種模組化逐步方法，用於自然語言推論 (NLI)。當對前提假設對進行分類時，我們使用語言模型來產生必要的編輯，以逐步轉換（即，變形）前提成為假設。然後，使用現成的 NLI 模型，我們追蹤推論如何隨著這些原子變化而進展，將這些中間標籤彙總成最終輸出。我們展示了我們提出的方法的優點，特別是在現實的跨網域設置中，我們的模型始終優於強大的基線，改進幅度高達 12.6%（相對）。此外，我們提出的方法是可以解釋的，因為原子編輯可以用來理解整體 NLI 標籤。
+摘要：電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時，一個主要的瓶頸是系統間資料異質性，其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性，而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰，我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證，它整合了多個層級的資料：(1) 在機構層級，使用知識圖表來建立代碼和現有知識來源之間的關係，為標準代碼及其彼此之間的關係提供醫療背景；(2) 在機構之間，利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係；(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入，以保護資料隱私。在本研究中，我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性，適用於各種情況，例如心臟衰竭、類風濕性關節炎。然後，我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用，而無需在個別機構之外共享患者層級資料。
 
-##### **Zero-shot generation of synthetic neurosurgical data with large language models**
-2502.09566v1 by Austin A. Barr, Eddie Guo, Emre Sezgin
+##### **EEG Artifact Detection and Correction with Deep Autoencoders**
+2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch
 
-Clinical data is fundamental to advance neurosurgical research, but access is
-often constrained by data availability, small sample sizes, privacy
-regulations, and resource-intensive preprocessing and de-identification
-procedures. Synthetic data offers a potential solution to challenges associated
-with accessing and using real-world data (RWD). This study aims to evaluate the
-capability of zero-shot generation of synthetic neurosurgical data with a large
-language model (LLM), GPT-4o, by benchmarking with the conditional tabular
-generative adversarial network (CTGAN). Synthetic datasets were compared to
-real-world neurosurgical data to assess fidelity (means, proportions,
-distributions, and bivariate correlations), utility (ML classifier performance
-on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated
-datasets matched or exceeded CTGAN performance, despite no fine-tuning or
-access to RWD for pre-training. Datasets demonstrated high univariate and
-bivariate fidelity to RWD without directly exposing any real patient records,
-even at amplified sample size. Training an ML classifier on GPT-4o-generated
-data and testing on RWD for a binary prediction task showed an F1 score (0.706)
-with comparable performance to training on the CTGAN data (0.705) for
-predicting postoperative functional status deterioration. GPT-4o demonstrated a
-promising ability to generate high-fidelity synthetic neurosurgical data. These
-findings also indicate that data synthesized with GPT-4o can effectively
-augment clinical data with small sample sizes, and train ML models for
-prediction of neurosurgical outcomes. Further investigation is necessary to
-improve the preservation of distributional characteristics and boost classifier
-performance.
+EEG signals convey important information about brain activity both in healthy
+and pathological conditions. However, they are inherently noisy, which poses
+significant challenges for accurate analysis and interpretation. Traditional
+EEG artifact removal methods, while effective, often require extensive expert
+intervention. This study presents LSTEEG, a novel LSTM-based autoencoder
+designed for the detection and correction of artifacts in EEG signals.
+Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear
+dependencies in sequential EEG data. LSTEEG demonstrates superior performance
+in both artifact detection and correction tasks compared to other
+state-of-the-art convolutional autoencoders. Our methodology enhances the
+interpretability and utility of the autoencoder's latent space, enabling
+data-driven automated artefact removal in EEG its application in downstream
+tasks. This research advances the field of efficient and accurate multi-channel
+EEG preprocessing, and promotes the implementation and usage of automated EEG
+analysis pipelines for brain health applications.
 
-摘要：<paragraph>臨床數據是推進神經外科研究的基礎，但訪問通常受到數據可用性、樣本量小、隱私法規以及資源密集型預處理和去識別程序的限制。合成數據為與存取和使用真實世界數據 (RWD) 相關的挑戰提供了潛在解決方案。本研究旨在評估使用大型語言模型 (LLM) GPT-4o 零次生成合成神經外科數據的能力，並通過條件表格生成對抗網路 (CTGAN) 進行基準測試。將合成數據集與真實世界的神經外科數據進行比較，以評估保真度（平均值、比例、分布和二元相關性）、實用性（RWD 上的 ML 分類器性能）和隱私（RWD 中記錄的重複）。儘管沒有微調或訪問 RWD 進行預訓練，但 GPT-4o 生成的數據集與 CTGAN 性能相匹配或超過 CTGAN 性能。數據集證明了對 RWD 的高單變量和二變量保真度，即使在擴充的樣本量下也不會直接公開任何真實患者記錄。在 GPT-4o 生成的數據上訓練 ML 分類器，並在 RWD 上測試二元預測任務，顯示 F1 分數 (0.706) 與在 CTGAN 數據上訓練以預測術後功能狀態惡化時的性能相當 (0.705)。GPT-4o 展示了生成高保真合成神經外科數據的潛力。這些發現還表明，使用 GPT-4o 合成的數據可以有效地增加樣本量小的臨床數據，並訓練 ML 模型以預測神經外科結果。需要進一步研究以改善分佈特徵的保留並提升分類器性能。</paragraph>
+摘要：腦電圖訊號傳達了關於大腦活動的重要資訊，無論是在健康或病理狀況下。然而，它們本質上是有雜訊的，這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效，但通常需要大量的專家介入。本研究提出 LSTEEG，一種新穎的基於 LSTM 的自動編碼器，用於偵測和校正腦電圖訊號中的人工製品。利用深度學習，特別是 LSTM 層，LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比，LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性，讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域，並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。
 
-##### **MDCrow: Automating Molecular Dynamics Workflows with Large Language Models**
-2502.09565v1 by Quintina Campbell, Sam Cox, Jorge Medina, Brittany Watterson, Andrew D. White
+##### **SycEval: Evaluating LLM Sycophancy**
+2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo
 
-Molecular dynamics (MD) simulations are essential for understanding
-biomolecular systems but remain challenging to automate. Recent advances in
-large language models (LLM) have demonstrated success in automating complex
-scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an
-agentic LLM assistant capable of automating MD workflows. MDCrow uses
-chain-of-thought over 40 expert-designed tools for handling and processing
-files, setting up simulations, analyzing the simulation outputs, and retrieving
-relevant information from literature and databases. We assess MDCrow's
-performance across 25 tasks of varying required subtasks and difficulty, and we
-evaluate the agent's robustness to both difficulty and prompt style.
-\texttt{gpt-4o} is able to complete complex tasks with low variance, followed
-closely by \texttt{llama3-405b}, a compelling open-source model. While prompt
-style does not influence the best models' performance, it has significant
-effects on smaller models.
+Large language models (LLMs) are increasingly applied in educational,
+clinical, and professional settings, but their tendency for sycophancy --
+prioritizing user agreement over independent reasoning -- poses risks to
+reliability. This study introduces a framework to evaluate sycophantic behavior
+in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and
+MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19%
+of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the
+lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred
+in 43.52% of cases, while regressive sycophancy, leading to incorrect answers,
+was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher
+sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$,
+$p<0.001$), particularly in computational tasks, where regressive sycophancy
+increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$).
+Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while
+citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$,
+$p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI:
+[77.2%, 79.8%]) regardless of context or model. These findings emphasize the
+risks and opportunities of deploying LLMs in structured and dynamic domains,
+offering insights into prompt programming and model optimization for safer AI
+applications.
 
-摘要：分子動力學 (MD) 模擬對於理解生物分子系統至關重要，但自動化仍然具有挑戰性。大型語言模型 (LLM) 的最新進展已證明使用基於 LLM 的代理自動化複雜的科學任務是成功的。在本文中，我們介紹了 MDCrow，這是一個代理 LLM 助理，能夠自動化 MD 工作流程。MDCrow 使用 40 多種專家設計的工具的思考鏈來處理和處理檔案、設定模擬、分析模擬輸出，以及從文獻和資料庫中檢索相關資訊。我們評估了 MDCrow 在 25 項任務中的表現，這些任務所需的子任務和難度各不相同，並且我們評估了代理對難度和提示樣式的穩健性。\texttt{gpt-4o} 能夠以低變異完成複雜的任務，緊隨其後的是一個引人注目的開源模型 \texttt{llama3-405b}。雖然提示樣式不會影響最佳模型的效能，但它對較小的模型有顯著的影響。
+摘要：大型語言模型（LLM）日益應用於教育、臨床和專業領域，但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為，涉及 AMPS（數學）和 MedQuad（醫療建議）數據集。在 58.19% 的案例中觀察到了趨炎附勢行為，其中 Gemini 表現出最高比率（62.47%），而 ChatGPT 最低（56.71%）。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中，而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率（61.75% 對 56.52%，Z=5.87，p<0.001），特別是在計算任務中，其中退步式趨炎附勢顯著增加（先發制人：8.13%，上下文：3.54%，p<0.001）。簡單的反駁最大化了漸進式趨炎附勢（Z=6.59，p<0.001），而基於引用的反駁表現出最高的退步式比率（Z=6.59，p<0.001）。趨炎附勢行為表現出很高的持續性（78.5%，95% CI：[77.2%，79.8%]），無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇，為更安全的 AI 應用提供了提示編程和模型優化的見解。
 
-##### **EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents**
-2502.09560v1 by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
+##### **Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?**
+2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace
 
-Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
-agents offers a promising avenue for tackling real-world tasks. While
-language-centric embodied agents have garnered substantial attention,
-MLLM-based embodied agents remain underexplored due to the lack of
-comprehensive evaluation frameworks. To bridge this gap, we introduce
-EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
-embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
-tasks across four environments, ranging from high-level semantic tasks (e.g.,
-household) to low-level tasks involving atomic actions (e.g., navigation and
-manipulation); and (2) six meticulously curated subsets evaluating essential
-agent capabilities like commonsense reasoning, complex instruction
-understanding, spatial awareness, visual perception, and long-term planning.
-Through extensive experiments, we evaluated 13 leading proprietary and
-open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
-at high-level tasks but struggle with low-level manipulation, with the best
-model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
-multifaceted standardized evaluation platform that not only highlights existing
-challenges but also offers valuable insights to advance MLLM-based embodied
-agents. Our code is available at https://embodiedbench.github.io.
+Medical research faces well-documented challenges in translating novel
+treatments into clinical practice. Publishing incentives encourage researchers
+to present "positive" findings, even when empirical results are equivocal.
+Consequently, it is well-documented that authors often spin study results,
+especially in article abstracts. Such spin can influence clinician
+interpretation of evidence and may affect patient care decisions. In this
+study, we ask whether the interpretation of trial results offered by Large
+Language Models (LLMs) is similarly affected by spin. This is important since
+LLMs are increasingly being used to trawl through and synthesize published
+medical evidence. We evaluated 22 LLMs and found that they are across the board
+more susceptible to spin than humans. They might also propagate spin into their
+outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into
+plain language summaries that they generate. We also find, however, that LLMs
+are generally capable of recognizing spin, and can be prompted in a way to
+mitigate spin's impact on LLM outputs.
 
-摘要：<paragraph>利用多模態大型語言模型 (MLLM) 來建立具身代理，提供了解決現實世界任務的有前景途徑。儘管以語言為中心的具身代理已獲得大量關注，但由於缺乏全面的評估框架，基於 MLLM 的具身代理仍未得到充分探索。為了彌補這一差距，我們引入了 EmbodiedBench，這是一個廣泛的基準測試，旨在評估以視覺為導向的具身代理。EmbodiedBench 的特點：(1) 跨越四個環境的 1,128 項多樣化測試任務，範圍從高層級語義任務（例如，家庭）到涉及原子動作的低層級任務（例如，導航和操作）；以及 (2) 六個精心策劃的子集，用於評估基本的代理能力，例如常識推理、複雜指令理解、空間感知、視覺感知和長期規劃。通過廣泛的實驗，我們在 EmbodiedBench 中評估了 13 個領先的專有和開源 MLLM。我們的研究結果表明：MLLM 在高層級任務中表現出色，但在低層級操作中遇到困難，表現最好的模型 GPT-4o 平均得分僅為 28.9%。EmbodiedBench 提供了一個多方面的標準化評估平台，不僅突出了現有挑戰，還提供了有價值的見解來推進基於 MLLM 的具身代理。我們的程式碼可在 https://embodiedbench.github.io/ 取得。</paragraph>
+摘要：醫學研究在將新穎療法轉化為臨床實務上，面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現，即使經驗結果模稜兩可。因此，有據可查的是，作者經常扭曲研究結果，特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋，並可能影響病患照護決策。在本研究中，我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據，因此這點非常重要。我們評估了 22 個 LLM，發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中：例如，我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而，我們也發現 LLM 通常有能力辨認扭曲，而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。
 
-##### **Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages**
-2502.09532v1 by Shreyan Biswas, Alexander Erlei, Ujwal Gadiraju
+##### **An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating**
+2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri
 
-Recent advances in generative AI have precipitated a proliferation of novel
-writing assistants. These systems typically rely on multilingual large language
-models (LLMs), providing globalized workers the ability to revise or create
-diverse forms of content in different languages. However, there is substantial
-evidence indicating that the performance of multilingual LLMs varies between
-languages. Users who employ writing assistance for multiple languages are
-therefore susceptible to disparate output quality. Importantly, recent research
-has shown that people tend to generalize algorithmic errors across independent
-tasks, violating the behavioral axiom of choice independence. In this paper, we
-analyze whether user utilization of novel writing assistants in a charity
-advertisement writing task is affected by the AI's performance in a second
-language. Furthermore, we quantify the extent to which these patterns translate
-into the persuasiveness of generated charity advertisements, as well as the
-role of peoples' beliefs about LLM utilization in their donation choices. Our
-results provide evidence that writers who engage with an LLM-based writing
-assistant violate choice independence, as prior exposure to a Spanish LLM
-reduces subsequent utilization of an English LLM. While these patterns do not
-affect the aggregate persuasiveness of the generated advertisements, people's
-beliefs about the source of an advertisement (human versus AI) do. In
-particular, Spanish-speaking female participants who believed that they read an
-AI-generated advertisement strongly adjusted their donation behavior downwards.
-Furthermore, people are generally not able to adequately differentiate between
-human-generated and LLM-generated ads. Our work has important implications for
-the design, development, integration, and adoption of multilingual LLMs as
-assistive agents -- particularly in writing tasks.
+This paper presents a novel Natural Language Processing (NLP) framework for
+enhancing medical diagnosis through the integration of advanced techniques in
+data augmentation, feature extraction, and classification. The proposed
+approach employs back-translation to generate diverse paraphrased datasets,
+improving robustness and mitigating overfitting in classification tasks.
+Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with
+Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained
+contextual and positional relationships, dynamically adjusting the influence of
+positional information based on semantic context to produce high-quality text
+embeddings. For classification, an Attention-Based Feedforward Neural Network
+(ABFNN) is utilized, effectively focusing on the most relevant features to
+improve decision-making accuracy. Applied to the classification of symptoms,
+clinical notes, and other medical texts, this architecture demonstrates its
+ability to address the complexities of medical data. The combination of data
+augmentation, contextual embedding generation, and advanced classification
+mechanisms offers a robust and accurate diagnostic tool, with potential
+applications in automated medical diagnosis and clinical decision support. This
+method demonstrates the effectiveness of the proposed NLP framework for medical
+diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of
+99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only
+underscore the model's robust performance in classifying medical texts with
+exceptional precision and reliability but also highlight its superiority over
+existing methods, making it a highly promising tool for automated diagnostic
+systems.
 
-摘要：<paragraph>生成式 AI 的最新進展加速了新穎寫作助理的激增。這些系統通常依賴多語言大型語言模型 (LLM)，讓全球化的工作者能夠以不同的語言修改或建立各種形式的內容。然而，有大量證據顯示多語言 LLM 的表現因語言而異。因此，使用多語言寫作協助的使用者容易受到不同的輸出品質影響。重要的是，最近的研究顯示人們傾向於在獨立的任務中概化演算法錯誤，違反了選擇獨立性的行為公理。在本文中，我們分析使用者在慈善廣告寫作任務中使用新穎寫作助理是否會受到 AI 在第二語言中的表現影響。此外，我們量化這些模式轉化為所產生慈善廣告說服力的程度，以及人們對 LLM 使用在捐款選擇中的信念所扮演的角色。我們的結果提供證據，表明與基於 LLM 的寫作助理互動的寫作者會違反選擇獨立性，因為先前接觸過西班牙語 LLM 會減少後續使用英語 LLM 的情況。雖然這些模式不會影響所產生廣告的整體說服力，但人們對廣告來源（人類與 AI）的信念會影響。特別是，相信自己閱讀 AI 生成的廣告的西班牙語系女性參與者大幅調整了他們的捐款行為。此外，人們通常無法充分區分人類產生的廣告和 LLM 產生的廣告。我們的研究對多語言 LLM 作為輔助代理的設計、開發、整合和採用具有重要的意義，特別是在寫作任務中。</paragraph>
+摘要：本文提出了一個創新的自然語言處理 (NLP) 框架，透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集，提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa)，這個模型捕捉細緻的脈絡和位置關係，根據語意脈絡動態調整位置資訊的影響，以產生高品質的文字嵌入。在分類方面，利用基於注意力的前饋神經網路 (ABFNN)，有效地關注最相關的特徵，以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類，此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具，在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性，以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數，取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性，也突顯了它優於現有方法的優越性，使其成為自動化診斷系統中極具前景的工具。
 
-##### **Diffusion Models for Molecules: A Survey of Methods and Tasks**
-2502.09511v1 by Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
+##### **Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**
+2502.07752v1 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds
 
-Generative tasks about molecules, including but not limited to molecule
-generation, are crucial for drug discovery and material design, and have
-consistently attracted significant attention. In recent years, diffusion models
-have emerged as an impressive class of deep generative models, sparking
-extensive research and leading to numerous studies on their application to
-molecular generative tasks. Despite the proliferation of related work, there
-remains a notable lack of up-to-date and systematic surveys in this area.
-Particularly, due to the diversity of diffusion model formulations, molecular
-data modalities, and generative task types, the research landscape is
-challenging to navigate, hindering understanding and limiting the area's
-growth. To address this, this paper conducts a comprehensive survey of
-diffusion model-based molecular generative methods. We systematically review
-the research from the perspectives of methodological formulations, data
-modalities, and task types, offering a novel taxonomy. This survey aims to
-facilitate understanding and further flourishing development in this area. The
-relevant papers are summarized at:
-https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
+Designing efficient optimizers for large language models (LLMs) with
+low-memory requirements and fast convergence is an important and challenging
+problem. This paper makes a step towards the systematic design of such
+optimizers through the lens of structured Fisher information matrix (FIM)
+approximation. We show that many state-of-the-art efficient optimizers can be
+viewed as solutions to FIM approximation (under the Frobenius norm) with
+specific structural assumptions. Building on these insights, we propose two
+design recommendations of practical efficient optimizers for LLMs, involving
+the careful selection of structural assumptions to balance generality and
+efficiency, and enhancing memory efficiency of optimizers with general
+structures through a novel low-rank extension framework. We demonstrate how to
+use each design approach by deriving new memory-efficient optimizers: Row and
+Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation
+(Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the
+effectiveness, showing faster and better convergence than existing
+memory-efficient baselines and Adam with little memory overhead. Notably, Alice
+achieves better than 2x faster convergence over Adam, while RACS delivers
+strong performance on the 1B model with SGD-like memory.
 
-摘要：<paragraph>包括但不限於分子生成在內的分子生成任務，對於藥物發現和材料設計至關重要，並持續吸引大量關注。近年來，擴散模型已成為深度生成模型中令人印象深刻的一類，激發了廣泛的研究，並導致對其應用於分子生成任務的眾多研究。儘管相關工作不斷增加，但這個領域仍然缺乏最新的系統性綜述。特別是，由於擴散模型公式、分子數據方式和生成任務類型的多樣性，研究領域難以瀏覽，阻礙了理解並限制了該領域的發展。為了解決這個問題，本文對基於擴散模型的分子生成方法進行了全面的調查。我們從方法論公式、數據方式和任務類型的角度系統性地回顧了研究，提供了一種新穎的分類法。本調查旨在促進理解並進一步促進該領域的蓬勃發展。相關論文總結如下：
-https://github.com/AzureLeon1/awesome-molecular-diffusion-models。</paragraph>
+摘要：設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的角度，朝向此類最佳化器的系統化設計邁進一步。我們展示了許多最先進的高效最佳化器可以被視為 FIM 近似（在 Frobenius 範數下）的解，並具有特定的結構假設。基於這些見解，我們提出了 LLM 的兩個實用高效最佳化器設計建議，包括仔細選擇結構假設以平衡通用性和效率，並透過新穎的低秩延伸架構來增強具有通用結構的最佳化器的記憶體效率。我們展示了如何透過推導新的記憶體高效最佳化器來使用每種設計方法：列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練（高達 1B 參數）上的實驗驗證了其有效性，顯示比現有的記憶體高效基線和 Adam 更快且更好的收斂，且記憶體開銷很小。值得注意的是，Alice 比 Adam 快 2 倍以上，而 RACS 則在 1B 模型上提供類似 SGD 記憶體的強勁效能。
 
-##### **AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization**
-2502.09503v1 by Caleb Cranney, Jesse G. Meyer
+##### **The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**
+2502.07516v1 by Raman Dutt
 
-Transformer architectures have transformed AI applications but remain complex
-to customize for domain experts lacking low-level implementation expertise. We
-introduce AttentionSmithy, a modular software package that simplifies
-transformer innovation by breaking down key components into reusable building
-blocks: attention modules, feed-forward networks, normalization layers, and
-positional encodings. Users can rapidly prototype and evaluate transformer
-variants without extensive coding. Our framework supports four positional
-encoding strategies and integrates with neural architecture search for
-automated design. We validate AttentionSmithy by replicating the original
-transformer under resource constraints and optimizing translation performance
-by combining positional encodings. Additionally, we demonstrate its
-adaptability in gene-specific modeling, achieving over 95% accuracy in cell
-type classification. These case studies highlight AttentionSmithy's potential
-to accelerate research across diverse fields by removing framework
-implementation barriers.
+Generative models, particularly text-to-image (T2I) diffusion models, play a
+crucial role in medical image analysis. However, these models are prone to
+training data memorization, posing significant risks to patient privacy.
+Synthetic chest X-ray generation is one of the most common applications in
+medical image analysis with the MIMIC-CXR dataset serving as the primary data
+repository for this task. This study adopts a data-driven approach and presents
+the first systematic attempt to identify prompts and text tokens in MIMIC-CXR
+that contribute the most to training data memorization. Our analysis reveals an
+unexpected finding: prompts containing traces of de-identification procedures
+are among the most memorized, with de-identification markers contributing the
+most. Furthermore, we also find existing inference-time memorization mitigation
+strategies are ineffective and fail to sufficiently reduce the model's reliance
+on memorized text tokens highlighting a broader issue in T2I synthesis with
+MIMIC-CXR. On this front, we propose actionable strategies to enhance privacy
+and improve the reliability of generative models in medical imaging. Finally,
+our results provide a foundation for future work on developing and benchmarking
+memorization mitigation techniques for synthetic chest X-ray generation using
+the MIMIC-CXR dataset.
 
-摘要：Transformer 架構已轉變 AI 應用，但對於缺乏低階實作專業知識的領域專家而言，自訂仍很複雜。我們推出 AttentionSmithy，這是一個模組化軟體套件，透過將關鍵元件分解成可重複使用的建構區塊（注意力模組、前饋網路、正規化層和位置編碼）來簡化 Transformer 創新。使用者可以快速建置原型和評估 Transformer 變體，而無需大量編碼。我們的架構支援四種位置編碼策略，並整合神經架構搜尋以進行自動化設計。我們透過在資源限制下複製原始 Transformer 和結合位置編碼來最佳化翻譯效能，驗證 AttentionSmithy。此外，我們展示其在基因特定建模中的適應性，在細胞類型分類中達到超過 95% 的準確度。這些案例研究突顯 AttentionSmithy 在移除架構實作障礙後，加速各個領域研究的潛力。
+摘要：生成模型，尤其是文字轉圖像 (T2I) 擴散模型，在醫學影像分析中扮演著至關重要的角色。然而，這些模型容易訓練資料記憶，對病患隱私造成重大風險。合成胸部 X 光線生成是醫學影像分析中最常見的應用之一，其中 MIMIC-CXR 資料集作為此任務的主要資料儲存庫。本研究採用資料驅動的方法，並提出首次系統性嘗試，以識別 MIMIC-CXR 中最有助於訓練資料記憶的提示和文字代碼。我們的分析揭露了一個意外的發現：包含去識別程序痕跡的提示是最常被記憶的，其中去識別標記的貢獻最大。此外，我們也發現現有的推論時間記憶減緩策略無效，且無法充分降低模型對記憶文字代碼的依賴性，突顯了使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。針對此問題，我們提出可行的策略，以增強隱私並改善生成模型在醫學影像中的可靠性。最後，我們的結果為未來使用 MIMIC-CXR 資料集開發和評量合成胸部 X 光線生成的記憶減緩技術奠定了基礎。
 
-##### **Improve LLM-based Automatic Essay Scoring with Linguistic Features**
-2502.09497v1 by Zhaoyi Joey Hou, Alejandro Ciuba, Xiang Lorraine Li
+##### **KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**
+2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo
 
-Automatic Essay Scoring (AES) assigns scores to student essays, reducing the
-grading workload for instructors. Developing a scoring system capable of
-handling essays across diverse prompts is challenging due to the flexibility
-and diverse nature of the writing task. Existing methods typically fall into
-two categories: supervised feature-based approaches and large language model
-(LLM)-based methods. Supervised feature-based approaches often achieve higher
-performance but require resource-intensive training. In contrast, LLM-based
-methods are computationally efficient during inference but tend to suffer from
-lower performance. This paper combines these approaches by incorporating
-linguistic features into LLM-based scoring. Experimental results show that this
-hybrid method outperforms baseline models for both in-domain and out-of-domain
-writing prompts.
+Chronic kidney disease (CKD) is a major global health issue, affecting over
+10% of the population and causing significant mortality. While kidney biopsy
+remains the gold standard for CKD diagnosis and treatment, the lack of
+comprehensive benchmarks for kidney pathology segmentation hinders progress in
+the field. To address this, we organized the Kidney Pathology Image
+Segmentation (KPIs) Challenge, introducing a dataset that incorporates
+preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+
+Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes
+two tasks, patch-level segmentation and whole slide image segmentation and
+detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score.
+By encouraging innovative segmentation methods that adapt to diverse CKD models
+and tissue conditions, the KPIs Challenge aims to advance kidney pathology
+analysis, establish new benchmarks, and enable precise, large-scale
+quantification for disease research and diagnosis.
 
-摘要：自動化論文評分 (AES) 會為學生的論文評分，以減輕教師的評分工作負擔。由於寫作任務的靈活性與多樣性，開發一種評分系統來處理各種提示的論文是一項挑戰。現有方法通常分為兩類：監督式特徵方法和大型語言模型 (LLM) 方法。監督式特徵方法通常能達到較高的效能，但需要大量資源進行訓練。相比之下，LLM 方法在推論期間的計算效率很高，但效能往往較低。本文結合了這些方法，將語言特徵納入 LLM 評分中。實驗結果顯示，這種混合方法在領域內和領域外寫作提示方面都優於基準模型。
+摘要：慢性腎臟病 (CKD) 是全球主要的健康問題，影響超過
+10% 的人口，並造成顯著的死亡率。雖然腎臟活檢
+仍然是 CKD 診斷和治療的黃金標準，但缺乏
+腎臟病理學分割的全面基準阻礙了該領域的進展。
+為了解決這個問題，我們組織了腎臟病理影像
+分割 (KPIs) 挑戰，引入了包含超過 10,000 個註解的
+CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
+週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括
+兩個任務，修補層級分割和全幻燈片影像分割和
+偵測，使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。
+通過鼓勵創新的分割方法來適應不同的 CKD 模型
+和組織條件，KPIs 挑戰旨在推進腎臟病理
+分析，建立新的基準，並實現精確、大規模的
+疾病研究和診斷量化。
 
-##### **Cracking the Code: Enhancing Development finance understanding with artificial intelligence**
-2502.09495v1 by Pierre Beaucoral
+##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
+2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
 
-Analyzing development projects is crucial for understanding donors aid
-strategies, recipients priorities, and to assess development finance capacity
-to adress development issues by on-the-ground actions. In this area, the
-Organisation for Economic Co-operation and Developments (OECD) Creditor
-Reporting System (CRS) dataset is a reference data source. This dataset
-provides a vast collection of project narratives from various sectors
-(approximately 5 million projects). While the OECD CRS provides a rich source
-of information on development strategies, it falls short in informing project
-purposes due to its reporting process based on donors self-declared main
-objectives and pre-defined industrial sectors. This research employs a novel
-approach that combines Machine Learning (ML) techniques, specifically Natural
-Language Processing (NLP), an innovative Python topic modeling technique called
-BERTopic, to categorise (cluster) and label development projects based on their
-narrative descriptions. By revealing existing yet hidden topics of development
-finance, this application of artificial intelligence enables a better
-understanding of donor priorities and overall development funding and provides
-methods to analyse public and private projects narratives.
+Early prediction of pediatric cardiac arrest (CA) is critical for timely
+intervention in high-risk intensive care settings. We introduce PedCA-FT, a
+novel transformer-based framework that fuses tabular view of EHR with the
+derived textual view of EHR to fully unleash the interactions of
+high-dimensional risk factors and their dynamics. By employing dedicated
+transformer modules for each modality view, PedCA-FT captures complex temporal
+and contextual patterns to produce robust CA risk estimates. Evaluated on a
+curated pediatric cohort from the CHOA-CICU database, our approach outperforms
+ten other artificial intelligence models across five key performance metrics
+and identifies clinically meaningful risk factors. These findings underscore
+the potential of multimodal fusion techniques to enhance early CA detection and
+improve patient care.
 
-摘要：分析發展專案對於了解捐助者援助策略、受贈者優先事項，以及評估發展資金能力以透過實際行動解決發展問題至關重要。在這個領域中，經濟合作暨發展組織 (OECD) 債權人報告系統 (CRS) 資料集是一個參考資料來源。此資料集提供來自各個部門的大量專案敘述（約 500 萬個專案）。雖然 OECD CRS 提供了豐富的發展策略資訊來源，但由於其報告程序基於捐助者自行申報的主要目標和預先定義的產業部門，因此在告知專案目的方面有所不足。本研究採用一種新穎的方法，結合機器學習 (ML) 技術，特別是自然語言處理 (NLP)，一種稱為 BERTopic 的創新 Python 主題建模技術，根據其敘述描述對發展專案進行分類（叢集）和標籤。透過揭露發展資金現有但隱藏的主題，這種人工智慧應用程式可以更好地了解捐助者的優先事項和整體發展資金，並提供分析公共和私人專案敘述的方法。
+摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
 
-##### **Objective quantification of mood states using large language models**
-2502.09487v1 by Jakub Onysk, Quentin Huys
+##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
+2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
 
-Emotional states influence human behaviour and cognition, leading to diverse
-thought trajectories. Similarly, Large Language Models (LLMs) showcase an
-excellent level of response consistency across wide-ranging contexts (prompts).
-We leverage these parallels to establish a framework for quantifying mental
-states. Our approach utilises self-report questionnaires that reliably assess
-these states due to their inherent sensitivity to patterns of co-occurring
-responses. Specifically, we recruited a large sample of participants (N=422) to
-investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set
-of depressive mood states measured with participants' open-ended responses to a
-depression questionnaire. We show LLM responses to held-out multiple-choice
-questions, given participants' open-ended answers, correlate strongly (r:
-0.52-0.84) with true questionnaire scores, demonstrating LLM's generalisation
-from mood representations. We explore a link between these representations and
-factor analysis. Using ridge regression, we find depression-related subspaces
-within LLM hidden states. We show these subspaces to be predictive of
-participants' "Depression" and "Somatic & Emotional Distress" factor scores, as
-well as suicidality severity. Overall, LLMs can provide quantitative measures
-of mental states. The reliability of these hinges upon how informative the
-questions we ask participants are. Used correctly, this approach could
-supplement mental state assessment in a variety of settings.
+Counterfactual explanations in medical imaging are critical for understanding
+the predictions made by deep learning models. We extend the Latent Shift
+counterfactual generation method from 2D applications to 3D computed tomography
+(CT) scans. We address the challenges associated with 3D data, such as limited
+training samples and high memory demands, by implementing a slice-based
+approach. This method leverages a 2D encoder trained on CT slices, which are
+subsequently combined to maintain 3D context. We demonstrate this technique on
+two models for clinical phenotype prediction and lung segmentation. Our
+approach is both memory-efficient and effective for generating interpretable
+counterfactuals in high-resolution 3D medical imaging.
 
-摘要：情緒狀態會影響人類行為和認知，導致不同的思維軌跡。同樣地，大型語言模型 (LLM) 在廣泛的脈絡（提示）中展示出極佳的反應一致性。我們利用這些相似之處來建立一個量化心理狀態的框架。我們的做法利用自我報告問卷，由於這些問卷對共生反應模式具有內在敏感性，因此可以可靠地評估這些狀態。具體來說，我們招募了大量的參與者樣本 (N=422) 來調查 LLM (Mistral-7B-OpenOrca) 如何量化一組異質的抑鬱情緒狀態，這些狀態是根據參與者對抑鬱症問卷的開放式回答來衡量的。我們展示了 LLM 對保留的多選題的回答，給定參與者的開放式回答，與真正的問卷分數密切相關 (r：0.52-0.84)，這證明了 LLM 從情緒表徵中進行概括。我們探索這些表徵與因子分析之間的聯繫。使用嶺回歸，我們在 LLM 隱藏狀態內發現了與抑鬱相關的子空間。我們展示這些子空間可以預測參與者的「抑鬱」和「軀體和情緒困擾」因子分數，以及自殺嚴重性。總體而言，LLM 可以提供心理狀態的量化測量。這些測量的可靠性取決於我們詢問參與者的問題的資訊性。如果使用得當，這種方法可以補充各種環境中的心理狀態評估。
+摘要：反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法，來解決與 3D 資料相關的挑戰，例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器，隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實，既節省記憶體又有效。
 
-##### **The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models**
-2502.09457v1 by Akash Ghosh, Debayan Datta, Sriparna Saha, Chirag Agarwal
+##### **Interactive Data Harmonization with LLM Agents**
+2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire
 
-While reasoning and multilingual capabilities in Language Models (LMs) have
-achieved remarkable progress in recent years, their integration into a unified
-paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning
-requires language models to handle logical reasoning across languages while
-addressing misalignment, biases, and challenges in low-resource settings. This
-survey provides the first in-depth review of multilingual reasoning in LMs. In
-this survey, we provide a systematic overview of existing methods that leverage
-LMs for multilingual reasoning, specifically outlining the challenges,
-motivations, and foundational aspects of applying language models to reason
-across diverse languages. We provide an overview of the standard data resources
-used for training multilingual reasoning in LMs and the evaluation benchmarks
-employed to assess their multilingual capabilities. Next, we analyze various
-state-of-the-art methods and their performance on these benchmarks. Finally, we
-explore future research opportunities to improve multilingual reasoning in LMs,
-focusing on enhancing their ability to handle diverse languages and complex
-reasoning tasks.
+Data harmonization is an essential task that entails integrating datasets
+from diverse sources. Despite years of research in this area, it remains a
+time-consuming and challenging task due to schema mismatches, varying
+terminologies, and differences in data collection methodologies. This paper
+presents the case for agentic data harmonization as a means to both empower
+experts to harmonize their data and to streamline the process. We introduce
+Harmonia, a system that combines LLM-based reasoning, an interactive user
+interface, and a library of data harmonization primitives to automate the
+synthesis of data harmonization pipelines. We demonstrate Harmonia in a
+clinical data harmonization scenario, where it helps to interactively create
+reusable pipelines that map datasets to a standard format. Finally, we discuss
+challenges and open problems, and suggest research directions for advancing our
+vision.
+
+摘要：資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷，但由於架構不匹配、術語不同，以及資料收集方法的差異，它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和，作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia，一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統，以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia，它有助於互動式建立可重複使用的管線，將資料集對應至標準格式。最後，我們討論挑戰和開放性問題，並建議研究方向以推進我們的願景。
+
+##### **Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**
+2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani
+
+Machine learning (ML) is transforming healthcare by enabling predictive
+analytics, personalized treatments, and improved patient outcomes. However,
+traditional ML workflows require specialized skills, infrastructure, and
+resources, limiting accessibility for many healthcare professionals. This paper
+explores how Google Cloud's BigQuery ML simplifies the development and
+deployment of ML models using SQL, reducing technical barriers. Through a case
+study on diabetes prediction using the Diabetes Health Indicators Dataset, we
+evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep
+Neural Network (DNN). Our results demonstrate that the Boosted Tree model
+achieves the highest performance, making it highly effective for diabetes
+prediction. This study highlights BigQuery ML's role in democratizing machine
+learning by providing a scalable, efficient, and accessible solution for
+healthcare analytics.
 
-摘要：儘管語言模型 (LM) 的推理和多語言能力在近年來取得顯著進展，但它們整合至統一典範（多語言推理）仍處於萌芽階段。多語言推理要求語言模型跨語言處理邏輯推理，同時解決低資源環境中的錯位、偏見和挑戰。本調查提供了 LM 中多語言推理的首次深入探討。在本調查中，我們系統性地概述了現有利用 LM 進行多語言推理的方法，特別概述了將語言模型應用於跨不同語言推理的挑戰、動機和基礎方面。我們概述了用於訓練 LM 中多語言推理的標準數據資源，以及用於評估其多語言能力的評估基準。接下來，我們分析了各種最先進的方法及其在這些基準上的表現。最後，我們探討了改進 LM 中多語言推理的未來研究機會，重點關注增強其處理不同語言和複雜推理任務的能力。
+摘要：機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果，正在轉型醫療保健。然而，傳統的 ML 工作流程需要專業技能、基礎設施和資源，限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署，降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究，我們評估了三個預測模型：邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明，提升樹模型達到了最高的效能，使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色，提供可擴充、有效率且可存取的醫療保健分析解決方案。
 
-##### **Pixel-Level Reasoning Segmentation via Multi-turn Conversations**
-2502.09447v1 by Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
+##### **AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements**
+2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen
 
-Existing visual perception systems focus on region-level segmentation in
-single-turn dialogues, relying on complex and explicit query instructions. Such
-systems cannot reason at the pixel level and comprehend dynamic user intent
-that changes over interaction. Our work tackles this issue by introducing a
-novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on
-multi-turn conversations, tracking evolving user intent via multi-turn
-interactions for fine-grained segmentation. To establish a benchmark for this
-novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on
-Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k
-multi-turn conversational scenarios with segmentation targets. Building on
-PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning
-Segmentation framework, integrates pixel-level segmentation with robust
-multi-turn conversation understanding, generating pixel-grounded explanations
-aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in
-pixel-level reasoning segmentation. Experimental results on the PRIST dataset
-demonstrate that our method outperforms current segmentation-specific baselines
-in terms of segmentation and LLM-based reasoning metrics. The code and data are
-available at: https://github.com/ccccai239/PixelRIST.
+Despite over a decade of legislative efforts to address modern slavery in the
+supply chains of large corporations, the effectiveness of government oversight
+remains hampered by the challenge of scrutinizing thousands of statements
+annually. While Large Language Models (LLMs) can be considered a well
+established solution for the automatic analysis and summarization of documents,
+recognizing concrete modern slavery countermeasures taken by companies and
+differentiating those from vague claims remains a challenging task. To help
+evaluate and fine-tune LLMs for the assessment of corporate statements, we
+introduce a dataset composed of 5,731 modern slavery statements taken from the
+Australian Modern Slavery Register and annotated at the sentence level. This
+paper details the construction steps for the dataset that include the careful
+design of annotation specifications, the selection and preprocessing of
+statements, and the creation of high-quality annotation subsets for effective
+model evaluations. To demonstrate our dataset's utility, we propose a machine
+learning methodology for the detection of sentences relevant to mandatory
+reporting requirements set by the Australian Modern Slavery Act. We then follow
+this methodology to benchmark modern language models under zero-shot and
+supervised learning settings.
 
-摘要：現有的視覺感知系統專注於單輪對話中的區域級分割，依賴於複雜且明確的查詢指令。此類系統無法在像素級別推理和理解在互動中不斷變化的動態使用者意圖。我們的研究通過引入一項基於多輪對話的像素級推理分割（像素級 RS）新任務來解決此問題，通過多輪互動追蹤不斷演變的使用者意圖，以進行精細分割。為了建立此新任務的基準，我們建立了一個基於多輪對話的像素級推理分割資料集（PRIST），其中包含來自 8.3k 多輪對話場景的 24k 個語句，以及分割目標。在 PRIST 的基礎上，我們進一步提出了 MIRAS，這是一個多輪互動推理分割框架，它將像素級分割與強大的多輪對話理解整合在一起，生成符合使用者意圖的像素級解釋。PRIST 資料集和 MIRSA 框架填補了像素級推理分割的空白。在 PRIST 資料集上的實驗結果表明，我們的模型在分割和基於 LLM 的推理指標方面優於目前的特定於分割的基準。程式碼和資料可在 https://github.com/ccccai239/PixelRIST 獲得。
+摘要：儘管立法努力超過十年，旨在解決大型企業供應鏈中的現代奴隸制，但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型（LLM）可以被認為是文件自動分析和摘要的完善解決方案，但要辨識公司採取的具體現代奴隸制對策，並將其與含糊的聲明區分開來，仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明，我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集，這些聲明取自澳洲現代奴隸制註冊處，並在句子層級進行註解。本文詳細說明了資料集的建構步驟，其中包括註解規格的仔細設計、聲明的選擇和預處理，以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用，我們提出了一種機器學習方法，用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後，我們遵循這種方法，在零次學習和監督學習設定下對現代語言模型進行基準測試。
 
-##### **Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes**
-2502.09432v1 by Navdeep Kumar, Adarsh Gupta, Maxence Mohamed Elfatihi, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor
+##### **Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium**
+2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour
 
-We study robust Markov decision processes (RMDPs) with non-rectangular
-uncertainty sets, which capture interdependencies across states unlike
-traditional rectangular models. While non-rectangular robust policy evaluation
-is generally NP-hard, even in approximation, we identify a powerful class of
-$L_p$-bounded uncertainty sets that avoid these complexity barriers due to
-their structural simplicity. We further show that this class can be decomposed
-into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage
-its structural properties to derive a novel dual formulation for $L_p$ RMDPs.
-This formulation provides key insights into the adversary's strategy and
-enables the development of the first robust policy evaluation algorithms for
-non-rectangular RMDPs. Empirical results demonstrate that our approach
-significantly outperforms brute-force methods, establishing a promising
-foundation for future investigation into non-rectangular robust MDPs.
+The fourth Machine Learning for Health (ML4H) symposium was held in person on
+December 15th and 16th, 2024, in the traditional, ancestral, and unceded
+territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver,
+British Columbia, Canada. The symposium included research roundtable sessions
+to foster discussions between participants and senior researchers on timely and
+relevant topics for the ML4H community. The organization of the research
+roundtables at the conference involved 13 senior and 27 junior chairs across 13
+tables. Each roundtable session included an invited senior chair (with
+substantial experience in the field), junior chairs (responsible for
+facilitating the discussion), and attendees from diverse backgrounds with an
+interest in the session's topic.
 
-摘要：我們研究具有非矩形不確定性集合的強健馬可夫決策過程 (RMDP)，它能捕捉到不同於傳統矩形模型的跨狀態相互依賴性。雖然非矩形強健策略評估通常是 NP-hard，即使在近似中也是如此，我們識別了一類強大的 $L_p$ 有界不確定性集合，由於其結構的簡潔性，可以避免這些複雜性障礙。我們進一步表明，此類可以分解為無限多的 \texttt{sa} 矩形 $L_p$ 有界集合，並利用其結構屬性為 $L_p$ RMDP 導出一個新的對偶公式。此公式提供了對抗者策略的重要見解，並能夠開發出第一個非矩形 RMDP 的強健策略評估演算法。實證結果表明，我們的做法顯著優於蠻力方法，為未來對非矩形強健 MDP 的研究奠定了有希望的基礎。
+摘要：第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議，以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席（在該領域擁有豐富的經驗）、初級主席（負責促進討論）以及對會議主題感興趣的來自不同背景的與會者。
 
-##### **Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction**
-2502.09423v1 by Ziyi Chen, Yang Yuan, Siming Zheng, Jialong Guo, Sihan Liang, Yangang Wang, Zongguo Wang
+##### **Automatic Evaluation of Healthcare LLMs Beyond Question-Answering**
+2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla
 
-Crystal structure forms the foundation for understanding the physical and
-chemical properties of materials. Generative models have emerged as a new
-paradigm in crystal structure prediction(CSP), however, accurately capturing
-key characteristics of crystal structures, such as periodicity and symmetry,
-remains a significant challenge. In this paper, we propose a
-Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction
-(TransVAE-CSP), who learns the characteristic distribution space of stable
-materials, enabling both the reconstruction and generation of crystal
-structures. TransVAE-CSP integrates adaptive distance expansion with
-irreducible representation to effectively capture the periodicity and symmetry
-of crystal structures, and the encoder is a transformer network based on an
-equivariant dot product attention mechanism. Experimental results on the
-carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP
-outperforms existing methods in structure reconstruction and generation tasks
-under various modeling metrics, offering a powerful tool for crystal structure
-design and optimization.
+Current Large Language Models (LLMs) benchmarks are often based on open-ended
+or close-ended QA evaluations, avoiding the requirement of human labor.
+Close-ended measurements evaluate the factuality of responses but lack
+expressiveness. Open-ended capture the model's capacity to produce discourse
+responses but are harder to assess for correctness. These two approaches are
+commonly used, either independently or together, though their relationship
+remains poorly understood. This work is focused on the healthcare domain, where
+both factuality and discourse matter greatly. It introduces a comprehensive,
+multi-axis suite for healthcare LLM evaluation, exploring correlations between
+open and close benchmarks and metrics. Findings include blind spots and
+overlaps in current methodologies. As an updated sanity check, we release a new
+medical benchmark--CareQA--, with both open and closed variants. Finally, we
+propose a novel metric for open-ended evaluations --Relaxed Perplexity-- to
+mitigate the identified limitations.
 
-摘要：晶體結構形成了解材料物理和化學性質的基礎。生成模型已成為晶體結構預測 (CSP) 的新典範，然而，準確捕捉晶體結構的關鍵特徵（例如週期性和對稱性）仍然是一項重大挑戰。在本文中，我們提出了一種用於晶體結構預測的 Transformer 增強變異自動編碼器 (TransVAE-CSP)，它學習穩定材料的特徵分佈空間，使晶體結構的重建和生成成為可能。TransVAE-CSP 將自適應距離擴展與不可約表示相結合，以有效地捕捉晶體結構的週期性和對稱性，並且編碼器是一個基於等變點積注意力機制的 Transformer 網路。在 carbon_24、perov_5 和 mp_20 資料集上的實驗結果表明，TransVAE-CSP 在各種建模指標下，在結構重建和生成任務中優於現有方法，為晶體結構設計和最佳化提供了一個強大的工具。
+摘要：當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量，避免了人力需求。封閉式測量評估回應的事實性，但缺乏表達力。開放式測量捕捉模型產生論述回應的能力，但較難評估正確性。這兩種方法通常獨立或合併使用，儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域，在該領域中，事實性和論述都非常重要。它引入了一個全面的多軸套件，用於醫療保健 LLM 評量，探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查，我們發布了一個新的醫療基準--CareQA--，包含開放式和封閉式變體。最後，我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。
 
-##### **On multi-token prediction for efficient LLM inference**
-2502.09419v1 by Somesh Mehra, Javier Alonso Garcia, Lukas Mauch
+##### **Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging**
+2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra
 
-We systematically investigate multi-token prediction (MTP) capabilities
-within LLMs pre-trained for next-token prediction (NTP). We first show that
-such models inherently possess MTP capabilities via numerical marginalization
-over intermediate token probabilities, though performance is data-dependent and
-improves with model scale. Furthermore, we explore the challenges of
-integrating MTP heads into frozen LLMs and find that their hidden layers are
-strongly specialized for NTP, making adaptation non-trivial. Finally, we show
-that while joint training of MTP heads with the backbone improves performance,
-it cannot fully overcome this barrier, prompting further research in this
-direction. Our findings provide a deeper understanding of MTP applied to
-pretrained LLMs, informing strategies for accelerating inference through
-parallel token prediction.
+Accurate classification and anatomical localization are essential for
+effective medical diagnostics and research, which may be efficiently performed
+using deep learning techniques. However, availability of limited labeled data
+poses a significant challenge. To address this, we adapted Prototypical
+Networks and the Propagation-Reconstruction Network (PRNet) for few-shot
+classification and localization, respectively, in Single Photon Emission
+Computed Tomography (SPECT) images. For the proof of concept we used a
+2D-sliced image cropped around heart. The Prototypical Network, with a
+pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver
+tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for
+2D imaging with an encoder-decoder architecture and skip connections, achieved
+a training loss of 1.395, accurately reconstructing patches and capturing
+spatial relationships. These results highlight the potential of Prototypical
+Networks for tissue classification with limited labeled data and PRNet for
+anatomical landmark localization, paving the way for improved performance in
+deep learning frameworks.
 
-摘要：我們系統性地研究了在預先訓練下用於下一個代幣預測 (NTP) 的 LLM 中的多代幣預測 (MTP) 功能。我們首先表明，此類模型透過中間代幣機率的數值邊際化本質上具備 MTP 功能，儘管效能依賴於資料，且會隨著模型規模而提升。此外，我們探討了將 MTP 頭整合到凍結 LLM 中的挑戰，發現其隱藏層高度專門用於 NTP，使得適應變得不簡單。最後，我們顯示，儘管 MTP 頭與主幹的聯合訓練會提升效能，但無法完全克服此障礙，促使我們進一步研究這個方向。我們的發現提供了對應用於預先訓練 LLM 的 MTP 更深入的理解，並為透過平行代幣預測加速推論提供策略。
+摘要：精確的分類和解剖定位對於有效的醫療診斷和研究至關重要，而這可以使用深度學習技術有效執行。然而，標記資料有限的取得會造成重大的挑戰。為了解決這個問題，我們分別調整了原型網路和傳播重建網路 (PRNet)，用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念，我們使用圍繞心臟裁切的 2D 切片影像。原型網路，使用預先訓練的 ResNet-18 主幹，對心室、心肌和肝臟組織進行分類，訓練準確度為 96.67%，驗證準確度為 93.33%。PRNet，調整為使用編碼器解碼器架構和跳躍連接的 2D 影像，達到了 1.395 的訓練損失，精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力，以及 PRNet 在解剖標誌定位方面的潛力，為深度學習架構中效能的提升鋪平了道路。
 
-##### **SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models**
-2502.09390v1 by Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
+##### **Illegal Waste Detection in Remote Sensing Images: A Case Study**
+2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori
 
-In the rapidly evolving field of Natural Language Processing, Large Language
-Models (LLMs) are tasked with increasingly complex reasoning challenges.
-Traditional methods like chain-of-thought prompting have shown promise but
-often fall short in fully leveraging a model's reasoning capabilities. This
-paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a
-novel prompting technique designed to improve reasoning through a
-self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts
-models to generate and resolve multiple auxiliary questions before tackling the
-main query, promoting a more thorough exploration of various aspects of a
-topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models
-across multiple question-answering datasets, demonstrate that SQuARE
-significantly surpasses traditional CoT prompts and existing
-rephrase-and-respond methods. By systematically decomposing queries, SQuARE
-advances LLM capabilities in reasoning tasks. The code is publicly available at
-https://github.com/IntelLabs/RAG-FiT/tree/square.
+Environmental crime currently represents the third largest criminal activity
+worldwide while threatening ecosystems as well as human health. Among the
+crimes related to this activity, improper waste management can nowadays be
+countered more easily thanks to the increasing availability and decreasing cost
+of Very-High-Resolution Remote Sensing images, which enable semi-automatic
+territory scanning in search of illegal landfills. This paper proposes a
+pipeline, developed in collaboration with professionals from a local
+environmental agency, for detecting candidate illegal dumping sites leveraging
+a classifier of Remote Sensing images. To identify the best configuration for
+such classifier, an extensive set of experiments was conducted and the impact
+of diverse image characteristics and training settings was thoroughly analyzed.
+The local environmental agency was then involved in an experimental exercise
+where outputs from the developed classifier were integrated in the experts'
+everyday work, resulting in time savings with respect to manual
+photo-interpretation. The classifier was eventually run with valuable results
+on a location outside of the training area, highlighting potential for
+cross-border applicability of the proposed pipeline.
 
-摘要：在快速發展的自然語言處理領域中，大型語言模型 (LLM) 負責越來越複雜的推理挑戰。
-傳統方法（如思考鏈提示）已展現潛力，但通常無法充分利用模型的推理能力。本文介紹 SQuARE（順序式問答推理引擎），這是一種新穎的提示技術，旨在透過自我提問模式來改善推理。建立在 CoT 架構之上，SQuARE 提示模型在處理主要查詢之前產生並解決多個輔助問題，促進對某個主題的各個面向進行更徹底的探討。我們使用 Llama 3 和 GPT-4o 模型對多個問答資料集進行廣泛評估，結果顯示 SQuARE 明顯優於傳統 CoT 提示和現有的改寫並回應方法。透過系統性地分解查詢，SQuARE 提升了 LLM 在推理任務中的能力。程式碼已公開於 https://github.com/IntelLabs/RAG-FiT/tree/square。
+摘要：環境犯罪目前是全球第三大犯罪活動，威脅生態系統和人類健康。在與此活動相關的犯罪中，不當廢物管理現在可以更容易地得到解決，這要歸功於超高解析度遙測影像越來越普及且成本下降，這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道，與當地環境機構的專業人士合作開發，用於檢測候選非法傾倒地點，利用遙測影像分類器。為了找出這種分類器的最佳配置，進行了一系列廣泛的實驗，並徹底分析了不同影像特徵和訓練設定的影響。然後，當地環境機構參與了一項實驗練習，其中將已開發分類器的輸出整合到專家的日常工作中，從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器，獲得了有價值的結果，突出了所提出管道的跨境適用性潛力。
 
-##### **Truth Knows No Language: Evaluating Truthfulness Beyond English**
-2502.09387v1 by Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri
+##### **FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model**
+2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li
 
-We introduce a professionally translated extension of the TruthfulQA
-benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and
-Spanish. Truthfulness evaluations of large language models (LLMs) have
-primarily been conducted in English. However, the ability of LLMs to maintain
-truthfulness across languages remains under-explored. Our study evaluates 12
-state-of-the-art open LLMs, comparing base and instruction-tuned models using
-human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our
-findings reveal that, while LLMs perform best in English and worst in Basque
-(the lowest-resourced language), overall truthfulness discrepancies across
-languages are smaller than anticipated. Furthermore, we show that
-LLM-as-a-Judge correlates more closely with human judgments than
-multiple-choice metrics, and that informativeness plays a critical role in
-truthfulness assessment. Our results also indicate that machine translation
-provides a viable approach for extending truthfulness benchmarks to additional
-languages, offering a scalable alternative to professional translation.
-Finally, we observe that universal knowledge questions are better handled
-across languages than context- and time-dependent ones, highlighting the need
-for truthfulness evaluations that account for cultural and temporal
-variability. Dataset and code are publicly available under open licenses.
+Accurate and efficient electroencephalography (EEG) analysis is essential for
+detecting seizures and artifacts in long-term monitoring, with applications
+spanning hospital diagnostics to wearable health devices. Robust EEG analytics
+have the potential to greatly improve patient care. However, traditional deep
+learning models, especially Transformer-based architectures, are hindered by
+their quadratic time and memory complexity, making them less suitable for
+resource-constrained environments. To address these challenges, we present
+FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel
+self-supervised framework that establishes new efficiency benchmarks for EEG
+analysis through bidirectional state-space modeling. Unlike Transformer-based
+models, which incur quadratic time and memory complexity, FEMBA scales linearly
+with sequence length, enabling more scalable and efficient processing of
+extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and
+fine-tuned on three downstream tasks, FEMBA achieves competitive performance in
+comparison with transformer models, with significantly lower computational
+cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB
+and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates
+viability for resource-constrained devices. These results pave the way for
+scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as
+a promising candidate for wearable applications.
 
-摘要：我們針對 TruthfulQA 推出專業翻譯的延伸版本，旨在評估巴斯克語、加泰隆尼亞語、加利西亞語和西班牙語中的真實性。大型語言模型 (LLM) 的真實性評估主要以英語進行。然而，LLM 在不同語言中維持真實性的能力仍未得到充分探索。我們的研究評估了 12 個最先進的開放 LLM，使用人類評估、多選項指標和 LLM 作為評分標準比較基礎和指令調整模型。我們的研究結果表明，雖然 LLM 在英語中的表現最好，而在巴斯克語（資源最少的語言）中的表現最差，但整體上不同語言之間的真實性差異小於預期。此外，我們表明，與多選項指標相比，LLM 作為評分標準與人類判斷更密切相關，而且信息豐富性在真實性評估中發揮著至關重要的作用。我們的結果還表明，機器翻譯提供了一種可行的途徑，可以將真實性基準擴展到其他語言，從而提供了一種可擴展的專業翻譯替代方案。最後，我們觀察到，與上下文和時間依賴的問題相比，通用知識問題在不同語言之間的處理效果更好，這突顯了考慮文化和時間可變性的真實性評估的必要性。數據集和代碼在開放許可下公開可用。
+摘要：準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要，其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而，傳統深度學習模型，特別是基於 Transformer 的架構，受到其二次時間和記憶體複雜度的阻礙，使其不太適合資源受限的環境。為了應對這些挑戰，我們提出 FEMBA (基礎 EEG Mamba + 雙向架構)，一種創新的自我監督架構，透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同，FEMBA 隨著序列長度線性縮放，支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調，與Transformer模型相比，在計算成本顯著降低的情況下，實現了具有競爭力的效能。具體來說，它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC，而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路，並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。
+
+##### **Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?**
+2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham
+
+The advent of foundation models (FMs) is transforming medical domain. In
+ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4
+million natural images and 1.6 million retinal images, has demonstrated high
+adaptability across clinical applications. Conversely, DINOv2, a
+general-purpose vision FM pre-trained on 142 million natural images, has shown
+promise in non-medical domains. However, its applicability to clinical tasks
+remains underexplored. To address this, we conducted head-to-head evaluations
+by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular
+disease detection and systemic disease prediction tasks, across eight
+standardized open-source ocular datasets, as well as the Moorfields AlzEye and
+the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting
+diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets,
+all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In
+glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940,
+P<0.001). Conversely, RETFound achieved superior performance over all DINOv2
+models in predicting heart failure, myocardial infarction, and ischaemic stroke
+(AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even
+with 10% of the fine-tuning data. These findings showcase the distinct
+scenarios where general-purpose and domain-specific FMs excel, highlighting the
+importance of aligning FM selection with task-specific requirements to optimise
+clinical performance.
 
-##### **A Deep Inverse-Mapping Model for a Flapping Robotic Wing**
-2502.09378v1 by Hadar Sharvit, Raz Karl, Tsevi Beatus
+摘要：基礎模型 (FM) 的出現正在轉變醫療領域。在眼科，RETFound 是一個視網膜專用 FM，依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練，已展現出高度適應性，可應用於各種臨床應用。相反地，DINOv2 是一個通用視覺 FM，使用 1.42 億張自然影像進行預訓練，已展現出在非醫療領域的潛力。然而，其在臨床任務中的適用性仍未被充分探索。為了解決這個問題，我們針對眼部疾病偵測和全身性疾病預測任務，對 RETFound 和三個 DINOv2 模型（大型、基礎、小型）進行微調，並進行一對一的評估，使用八個標準化的開源眼科資料集，以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound（三個資料集的 AUROC=0.850-0.952，相較於 0.823-0.944，所有 P<=0.007）和多類眼部疾病（AUROC=0.892，相較於 0.846，P<0.001）。在青光眼方面，DINOv2 基礎模型優於 RETFound（AUROC=0.958，相較於 0.940，P<0.001）。相反地，RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型（AUROC=0.732-0.796，相較於 0.663-0.771，所有 P<0.001）。即使使用 10% 的微調資料，這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景，突顯了根據任務特定需求調整 FM 選擇，以最佳化臨床表現的重要性。
 
-In systems control, the dynamics of a system are governed by modulating its
-inputs to achieve a desired outcome. For example, to control the thrust of a
-quad-copter propeller the controller modulates its rotation rate, relying on a
-straightforward mapping between the input rotation rate and the resulting
-thrust. This mapping can be inverted to determine the rotation rate needed to
-generate a desired thrust. However, in complex systems, such as flapping-wing
-robots where intricate fluid motions are involved, mapping inputs (wing
-kinematics) to outcomes (aerodynamic forces) is nontrivial and inverting this
-mapping for real-time control is computationally impractical. Here, we report a
-machine-learning solution for the inverse mapping of a flapping-wing system
-based on data from an experimental system we have developed. Our model learns
-the input wing motion required to generate a desired aerodynamic force outcome.
-We used a sequence-to-sequence model tailored for time-series data and
-augmented it with a novel adaptive-spectrum layer that implements
-representation learning in the frequency domain. To train our model, we
-developed a flapping wing system that simultaneously measures the wing's
-aerodynamic force and its 3D motion using high-speed cameras. We demonstrate
-the performance of our system on an additional open-source dataset of a
-flapping wing in a different flow regime. Results show superior performance
-compared with more complex state-of-the-art transformer-based models, with 11%
-improvement on the test datasets median loss. Moreover, our model shows
-superior inference time, making it practical for onboard robotic control. Our
-open-source data and framework may improve modeling and real-time control of
-systems governed by complex dynamics, from biomimetic robots to biomedical
-devices.
+##### **Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning**
+2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun
 
-摘要：<paragraph>在系統控制中，系統的動態受調節其輸入以實現所需結果的影響。例如，為了控制四軸旋翼推進器的推力，控制器會調節其旋轉速率，依賴於輸入旋轉速率和所產生的推力之間的直接映射。此映射可以反轉以確定產生所需推力所需的旋轉速率。然而，在複雜的系統中，例如涉及複雜流體運動的拍打式機翼機器人，將輸入（機翼運動學）映射到輸出（空氣動力）並非易事，並且反轉此映射以進行實時控制在計算上不切實際。在此，我們報告了一個基於我們開發的實驗系統數據的拍打式機翼系統反向映射的機器學習解決方案。我們的模型學習產生所需空氣動力結果所需的輸入機翼運動。我們使用了一個專門針對時間序列數據的序列到序列模型，並用一個在頻域中實現表示學習的新型自適應譜層對其進行了擴充。為了訓練我們的模型，我們開發了一個拍打式機翼系統，該系統同時使用高速相機測量機翼的空氣動力和其 3D 運動。我們在一個不同的流動狀態下拍打機翼的另一個開源數據集上展示了我們系統的性能。結果表明，與更複雜的基於Transformer的最先進模型相比，性能優異，在測試數據集中損失中值改進了 11%。此外，我們的模型顯示出優異的推理時間，使其適用於機載機器人控制。我們的開源數據和框架可以改進受複雜動態支配的系統的建模和實時控制，從仿生機器人到生物醫學設備。</paragraph>
+Medical time series are often irregular and face significant missingness,
+posing challenges for data analysis and clinical decision-making. Existing
+methods typically adopt a single modeling perspective, either treating series
+data as sequences or transforming them into image representations for further
+classification. In this paper, we propose a joint learning framework that
+incorporates both sequence and image representations. We also design three
+self-supervised learning strategies to facilitate the fusion of sequence and
+image representations, capturing a more generalizable joint representation. The
+results indicate that our approach outperforms seven other state-of-the-art
+models in three representative real-world clinical datasets. We further
+validate our approach by simulating two major types of real-world missingness
+through leave-sensors-out and leave-samples-out techniques. The results
+demonstrate that our approach is more robust and significantly surpasses other
+baselines in terms of classification performance.
 
-##### **Language Agents as Digital Representatives in Collective Decision-Making**
-2502.09369v1 by Daniel Jarrett, Miruna Pîslar, Michiel A. Bakker, Michael Henry Tessler, Raphael Köster, Jan Balaguer, Romuald Elie, Christopher Summerfield, Andrea Tacchetti
+摘要：醫療時間序列通常不規則且會面臨顯著的缺失，對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點，將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中，我們提出了一個聯合學習架構，結合序列和影像表示。我們還設計了三種自我監督學習策略，以促進序列和影像表示的融合，捕捉更具概括性的聯合表示。結果表明，我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明，我們的做法更強大，並且在分類效能方面顯著優於其他基準。
 
-Consider the process of collective decision-making, in which a group of
-individuals interactively select a preferred outcome from among a universe of
-alternatives. In this context, "representation" is the activity of making an
-individual's preferences present in the process via participation by a proxy
-agent -- i.e. their "representative". To this end, learned models of human
-behavior have the potential to fill this role, with practical implications for
-multi-agent scenario studies and mechanism design. In this work, we investigate
-the possibility of training \textit{language agents} to behave in the capacity
-of representatives of human agents, appropriately expressing the preferences of
-those individuals whom they stand for. First, we formalize the setting of
-\textit{collective decision-making} -- as the episodic process of interaction
-between a group of agents and a decision mechanism. On this basis, we then
-formalize the problem of \textit{digital representation} -- as the simulation
-of an agent's behavior to yield equivalent outcomes from the mechanism.
-Finally, we conduct an empirical case study in the setting of
-\textit{consensus-finding} among diverse humans, and demonstrate the
-feasibility of fine-tuning large language models to act as digital
-representatives.
+##### **Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**
+2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek
 
-摘要：考慮集體決策的過程，其中一群個人互動式地從一系列備選方案中選擇一個偏好的結果。在此脈絡中，「代表」是透過代理人（即他們的「代表」）參與，讓個人的偏好出現在這個過程中的活動。為此，人類行為的學習模型有可能填補這個角色，對多重代理人情境研究和機制設計具有實際意義。在這項工作中，我們探討訓練「語言代理人」的可能性，以代表人類代理人的身分行事，適當地表達他們所代表的那些個人的偏好。首先，我們將「集體決策」的設定形式化，作為一群代理人與決策機制之間互動的間歇性過程。在此基礎上，我們接著將「數位代表」的問題形式化，作為模擬代理人的行為，從機制中產生等效結果。最後，我們在多元人類的「共識尋求」設定中進行一個實證個案研究，並展示微調大型語言模型以作為數位代表的可行性。
+We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS),
+an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS
+predicts future PHTs using transformer-based architectures. The Adaptive Risk
+Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk
+probabilities for clinician-defined critical events. ARES incorporates a
+personalized explainability module that identifies key clinical factors
+influencing risk estimates for individual patients. ARES was evaluated on the
+MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its
+performance against traditional early warning systems and machine learning
+models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs,
+with 60% including hospital admissions. The dataset contained over 357 million
+tokens. ETHOS outperformed benchmark models in predicting hospital admissions,
+ICU admissions, and prolonged hospital stays, achieving superior AUC scores.
+ETHOS-based risk estimates demonstrated robustness across demographic subgroups
+with strong model reliability, confirmed via calibration curves. The
+personalized explainability module provides insights into patient-specific
+factors contributing to risk. ARES, powered by ETHOS, advances predictive
+healthcare AI by providing dynamic, real-time, and personalized risk estimation
+with patient-specific explainability to enhance clinician trust. Its
+adaptability and superior accuracy position it as a transformative tool for
+clinical decision-making, potentially improving patient outcomes and resource
+allocation in emergency and inpatient settings. We release the full code at
+github.com/ipolharvard/ethos-ares to facilitate future research.
 
-##### **Neural Spatiotemporal Point Processes: Trends and Challenges**
-2502.09341v1 by Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David A. Selby, Yao Xie, Sebastian Vollmer, Gerrit Grossmann
+摘要：我們開發了增強型健康結果模擬轉換器 (ETHOS)，
+一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS
+使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組，可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估，並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT，其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型，並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性，並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估，以及患者特定的可解釋性來增強臨床醫生的信任，從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具，有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼，以利未來的研究。
 
-Spatiotemporal point processes (STPPs) are probabilistic models for events
-occurring in continuous space and time. Real-world event data often exhibit
-intricate dependencies and heterogeneous dynamics. By incorporating modern deep
-learning techniques, STPPs can model these complexities more effectively than
-traditional approaches. Consequently, the fusion of neural methods with STPPs
-has become an active and rapidly evolving research area. In this review, we
-categorize existing approaches, unify key design choices, and explain the
-challenges of working with this data modality. We further highlight emerging
-trends and diverse application domains. Finally, we identify open challenges
-and gaps in the literature.
+##### **Can ChatGPT Diagnose Alzheimer's Disease?**
+2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin
 
-摘要：時空點過程 (STPP) 是事件在連續時空發生的機率模型。真實世界的事件資料通常會展現錯綜複雜的依賴關係和異質動態。透過結合現代深度學習技術，STPP 可以比傳統方法更有效地模擬這些複雜性。因此，神經方法與 STPP 的融合已成為一個活躍且快速發展的研究領域。在本篇評論中，我們對現有方法進行分類、統一關鍵設計選擇，並說明處理這種資料模式的挑戰。我們進一步強調新興趨勢和多樣化的應用領域。最後，我們找出文獻中的開放性挑戰和空白。
+Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating
+neurodegenerative condition that affects approximately 1 in 9 individuals aged
+65 and older, profoundly impairing memory and cognitive function. This paper
+utilises 9300 electronic health records (EHRs) with data from Magnetic
+Resonance Imaging (MRI) and cognitive tests to address an intriguing question:
+As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs?
+We present an in-depth evaluation of ChatGPT using a black-box approach with
+zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to
+analyse MRI and cognitive test results, as well as its potential as a
+diagnostic tool for AD. By automating aspects of the diagnostic process, this
+research opens a transformative approach for the healthcare system,
+particularly in addressing disparities in resource-limited regions where AD
+specialists are scarce. Hence, it offers a foundation for a promising method
+for early detection, supporting individuals with timely interventions, which is
+paramount for Quality of Life (QoL).
 
-##### **Graph Diffusion Network for Drug-Gene Prediction**
-2502.09335v1 by Jiayang Wu, Wensheng Gan, Philip S. Yu
+摘要：ChatGPT 能否診斷出阿茲海默症 (AD)？AD 是一種毀滅性的神經退化性疾病，影響約 1/9 的 65 歲及以上人士，嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR)，其中包含磁共振成像 (MRI) 和認知測試的數據，來解決一個有趣的問題：作為一個通用任務解決器，ChatGPT 能否使用 EHR 準確地檢測出 AD？我們使用黑盒方法對 ChatGPT 進行了深入評估，採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力，以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面，這項研究為醫療保健系統開啟了一種變革性的方法，特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此，它為一種有希望的早期檢測方法奠定了基礎，通過及時干預來支持個人，這對於生活品質 (QoL) 至關重要。
 
-Predicting drug-gene associations is crucial for drug development and disease
-treatment. While graph neural networks (GNN) have shown effectiveness in this
-task, they face challenges with data sparsity and efficient contrastive
-learning implementation. We introduce a graph diffusion network for drug-gene
-prediction (GDNDGP), a framework that addresses these limitations through two
-key innovations. First, it employs meta-path-based homogeneous graph learning
-to capture drug-drug and gene-gene relationships, ensuring similar entities
-share embedding spaces. Second, it incorporates a parallel diffusion network
-that generates hard negative samples during training, eliminating the need for
-exhaustive negative sample retrieval. Our model achieves superior performance
-on the DGIdb 4.0 dataset and demonstrates strong generalization capability on
-tripartite drug-gene-disease networks. Results show significant improvements
-over existing methods in drug-gene prediction tasks, particularly in handling
-complex heterogeneous relationships. The source code is publicly available at
-https://github.com/csjywu1/GDNDGP.
+##### **Protecting Intellectual Property of EEG-based Neural Networks with Watermarking**
+2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares
 
-摘要：預測藥物基因關聯對藥物開發和疾病治療至關重要。雖然圖神經網路 (GNN) 已顯示在這個任務中的有效性，但它們在資料稀疏性和高效對比學習實作方面面臨挑戰。我們引入了一個用於藥物基因預測的圖擴散網路 (GDNDGP)，這是一個透過兩項關鍵創新來解決這些限制的框架。首先，它採用基於元路徑的同質圖學習來捕捉藥物-藥物和基因-基因關係，確保類似實體共享嵌入空間。其次，它整合了一個並行擴散網路，在訓練期間產生困難的負面樣本，消除了對詳盡負面樣本擷取的需求。我們的模型在 DGIdb 4.0 資料集上取得了卓越的效能，並在三方藥物-基因-疾病網路中展現強大的概化能力。結果顯示在藥物基因預測任務中，相較於現有方法有顯著的進步，特別是在處理複雜的異質關係方面。原始碼已公開於 https://github.com/csjywu1/GDNDGP。
+EEG-based neural networks, pivotal in medical diagnosis and brain-computer
+interfaces, face significant intellectual property (IP) risks due to their
+reliance on sensitive neurophysiological data and resource-intensive
+development. Current watermarking methods, particularly those using abstract
+trigger sets, lack robust authentication and fail to address the unique
+challenges of EEG models. This paper introduces a cryptographic wonder
+filter-based watermarking framework tailored for EEG-based neural networks.
+Leveraging collision-resistant hashing and public-key encryption, the wonder
+filter embeds the watermark during training, ensuring minimal distortion ($\leq
+5\%$ drop in EEG task accuracy) and high reliability (100\% watermark
+detection). The framework is rigorously evaluated against adversarial attacks,
+including fine-tuning, transfer learning, and neuron pruning. Results
+demonstrate persistent watermark retention, with classification accuracy for
+watermarked states remaining above 90\% even after aggressive pruning, while
+primary task performance degrades faster, deterring removal attempts. Piracy
+resistance is validated by the inability to embed secondary watermarks without
+severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic
+hashing ensures authentication, reducing brute-force attack success
+probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet,
+TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively
+eliminating false positives. By integrating wonder filters with EEG-specific
+adaptations, this work bridges a critical gap in IP protection for
+neurophysiological models, offering a secure, tamper-proof solution for
+healthcare and biometric applications. The framework's robustness against
+adversarial modifications underscores its potential to safeguard sensitive EEG
+models while maintaining diagnostic utility.
 
-##### **Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs**
-2502.09331v1 by Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
+摘要：<paragraph>基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要，由於其依賴敏感的神經生理資料和資源密集型的開發，面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法，特別是那些使用抽象觸發集的方法，缺乏強健的驗證，且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密，wonder 濾波器在訓練期間嵌入浮水印，確保最小的失真（EEG 任務準確度下降 $\leq 5\%$）和高可靠性（100% 浮水印檢測）。該架構針對對抗性攻擊進行了嚴格的評估，包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留，即使在激進的剪枝後，浮水印狀態的分類準確度仍保持在 90% 以上，而主要任務的性能下降得更快，阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證，而不會造成嚴重的準確度損失（在 EEGNet 和 CCNN 模型中 $>10\%$）。密碼學雜湊確保驗證，降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型（CCNN、EEGNet、TSception）進行評估，該方法達到了 $>99.4\%$ 的空嵌入準確度，有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合，這項工作彌補了神經生理模型 IP 保護中的關鍵差距，為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。</paragraph>
 
-Despite advances in the multilingual capabilities of Large Language Models
-(LLMs) across diverse tasks, English remains the dominant language for LLM
-research and development. So, when working with a different language, this has
-led to the widespread practice of pre-translation, i.e., translating the task
-prompt into English before inference. Selective pre-translation, a more
-surgical approach, focuses on translating specific prompt components. However,
-its current use is sporagic and lacks a systematic research foundation.
-Consequently, the optimal pre-translation strategy for various multilingual
-settings and tasks remains unclear. In this work, we aim to uncover the optimal
-setup for pre-translation by systematically assessing its use. Specifically, we
-view the prompt as a modular entity, composed of four functional parts:
-instruction, context, examples, and output, either of which could be translated
-or not. We evaluate pre-translation strategies across 35 languages covering
-both low and high-resource languages, on various tasks including Question
-Answering (QA), Natural Language Inference (NLI), Named Entity Recognition
-(NER), and Abstractive Summarization. Our experiments show the impact of
-factors as similarity to English, translation quality and the size of
-pre-trained data, on the model performance with pre-translation. We suggest
-practical guidelines for choosing optimal strategies in various multilingual
-settings.
+##### **Enhancing Depression Detection with Chain-of-Thought Prompting: From Emotion to Reasoning Using Large Language Models**
+2502.05879v1 by Shiyu Teng, Jiaqing Liu, Rahul Kumar Jain, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-wei Chen
 
-摘要：儘管大型語言模型 (LLM) 在各種任務中的多語言能力有進步，英語仍然是 LLM 研究和開發的主導語言。因此，在使用不同語言時，這導致了預翻譯的廣泛實務，即在推理之前將任務提示翻譯成英語。選擇性預翻譯是一種更精準的方法，專注於翻譯特定提示組成部分。然而，目前的使用是零星的，缺乏系統性的研究基礎。因此，各種多語言設定和任務的最佳預翻譯策略仍不清楚。在這項工作中，我們旨在透過系統性評估預翻譯的使用，找出其最佳設定。具體來說，我們將提示視為一個模組化實體，由四個功能部分組成：說明、背景、範例和輸出，其中任何一個都可以翻譯或不翻譯。我們在 35 種語言中評估預翻譯策略，涵蓋低資源語言和高資源語言，以及各種任務，包括問答 (QA)、自然語言推理 (NLI)、命名實體識別 (NER) 和抽象摘要。我們的實驗顯示了與英語的相似性、翻譯品質和預訓練資料大小等因素對預翻譯模型效能的影響。我們建議在各種多語言設定中選擇最佳策略的實用指南。
+Depression is one of the leading causes of disability worldwide, posing a
+severe burden on individuals, healthcare systems, and society at large. Recent
+advancements in Large Language Models (LLMs) have shown promise in addressing
+mental health challenges, including the detection of depression through
+text-based analysis. However, current LLM-based methods often struggle with
+nuanced symptom identification and lack a transparent, step-by-step reasoning
+process, making it difficult to accurately classify and explain mental health
+conditions. To address these challenges, we propose a Chain-of-Thought
+Prompting approach that enhances both the performance and interpretability of
+LLM-based depression detection. Our method breaks down the detection process
+into four stages: (1) sentiment analysis, (2) binary depression classification,
+(3) identification of underlying causes, and (4) assessment of severity. By
+guiding the model through these structured reasoning steps, we improve
+interpretability and reduce the risk of overlooking subtle clinical indicators.
+We validate our method on the E-DAIC dataset, where we test multiple
+state-of-the-art large language models. Experimental results indicate that our
+Chain-of-Thought Prompting technique yields superior performance in both
+classification accuracy and the granularity of diagnostic insights, compared to
+baseline approaches.
 
-##### **A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis**
-2502.09316v1 by Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami
+摘要：憂鬱症是全球殘障的主要原因之一，對個人、醫療保健系統和整個社會造成嚴重負擔。大型語言模型 (LLM) 的最新進展已展現出解決心理健康挑戰的希望，包括透過基於文字的分析來偵測憂鬱症。然而，現有的基於 LLM 的方法通常難以辨識細微的症狀，而且缺乏透明且逐步的推理過程，這使得準確分類和解釋心理健康狀況變得困難。為了應對這些挑戰，我們提出了一種思考鏈提示方法，它增強了基於 LLM 的憂鬱症偵測的效能和可解釋性。我們的這項方法將偵測過程分解為四個階段：(1) 情緒分析，(2) 二元憂鬱症分類，(3) 找出潛在原因，以及 (4) 評估嚴重程度。透過引導模型完成這些結構化的推理步驟，我們提升了可解釋性，並降低了忽略細微臨床指標的風險。我們在 E-DAIC 資料集上驗證了我們的這項方法，並在其中測試了多種最先進的大型語言模型。實驗結果顯示，與基線方法相比，我們的思考鏈提示技術在分類準確度和診斷見解的精細度方面都表現出優異的效能。
 
-Evaluating the open-ended text generation of large language models (LLMs) is
-challenging because of the lack of a clear ground truth and the high cost of
-human or LLM-based assessments. We propose a novel benchmark that evaluates
-LLMs using n-gram statistics and rules, without relying on human judgement or
-LLM-as-a-judge approaches. Using 50 question and reference answer sets, we
-introduce three new metrics based on n-grams and rules: Fluency, Truthfulness,
-and Helpfulness. Our benchmark strongly correlates with GPT-4o-based
-evaluations while requiring significantly fewer computational resources,
-demonstrating its effectiveness as a scalable alternative for assessing LLMs'
-open-ended generation capabilities.
+##### **LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison**
+2502.06890v1 by Gabriele De Vito, Filomena Ferrucci, Athanasios Angelakis
 
-摘要：評估大型語言模型 (LLM) 的開放式文字生成具有挑戰性，因為缺乏明確的基礎真實性，以及人工或基於 LLM 的評估成本高昂。我們提出一個新基準，使用 n-gram 統計和規則來評估 LLM，而不依賴於人工判斷或 LLM 作為評審的方法。使用 50 個問題和參考答案集，我們基於 n-gram 和規則引入了三項新指標：流暢度、真實性和有幫助性。我們的基準與基於 GPT-4o 的評估密切相關，同時需要明顯更少的計算資源，證明了其作為評估 LLM 的開放式生成能力的可擴充替代方案的有效性。
+The increasing volume of drug combinations in modern therapeutic regimens
+needs reliable methods for predicting drug-drug interactions (DDIs). While
+Large Language Models (LLMs) have revolutionized various domains, their
+potential in pharmaceutical research, particularly in DDI prediction, remains
+largely unexplored. This study thoroughly investigates LLMs' capabilities in
+predicting DDIs by uniquely processing molecular structures (SMILES), target
+organisms, and gene interaction data as raw text input from the latest DrugBank
+dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4,
+Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first
+assessing their zero-shot capabilities in DDI prediction. We then fine-tuned
+selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1
+distilled Qwen 1.5B) to optimize their performance. Our comprehensive
+evaluation framework included validation across 13 external DDI datasets,
+comparing against traditional approaches such as l2-regularized logistic
+regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5
+2.7B achieving a sensitivity of 0.978 in DDI prediction, with an accuracy of
+0.919 on balanced datasets (50% positive, 50% negative cases). This result
+represents an improvement over both zero-shot predictions and state-of-the-art
+machine-learning methods used for DDI prediction. Our analysis reveals that
+LLMs can effectively capture complex molecular interaction patterns and cases
+where drug pairs target common genes, making them valuable tools for practical
+applications in pharmaceutical research and clinical settings.
 
-##### **When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models**
-2502.09307v1 by Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant
+摘要：<paragraph>現代治療方案中藥物組合的數量越來越多，需要可靠的方法來預測藥物間交互作用 (DDI)。儘管大型語言模型 (LLM) 已在各個領域掀起革命，它們在藥物研究中的潛力，特別是在 DDI 預測中的潛力，仍未得到充分探索。本研究通過獨特地處理分子結構 (SMILES)、目標生物和基因交互資料作為來自最新 DrugBank 資料集的原始文字輸入，徹底調查了 LLM 在預測 DDI 中的能力。我們評估了 18 種不同的 LLM，包括專有模型（GPT-4、Claude、Gemini）和開源變體（從 1.5B 到 72B 參數），首先評估它們在 DDI 預測中的零次學習能力。然後，我們微調選定的模型（GPT-4、Phi-3.5 2.7B、Qwen-2.5 3B、Gemma-2 9B 和 Deepseek R1 蒸餾 Qwen 1.5B）以最佳化其效能。我們的全面評估框架包括跨 13 個外部 DDI 資料集進行驗證，並與傳統方法（例如 l2 正則化邏輯迴歸）進行比較。微調後的 LLM 表現出優異的效能，其中 Phi-3.5 2.7B 在 DDI 預測中達到 0.978 的靈敏度，在平衡資料集（50% 正例，50% 反例）上的準確度為 0.919。此結果優於零次學習預測和用於 DDI 預測的最新機器學習方法。我們的分析表明，LLM 可以有效捕捉複雜的分子交互模式和藥物對靶向共同基因的情況，使其成為藥物研究和臨床環境中實用應用的寶貴工具。</paragraph>
 
-Modern Large Language Models (LLMs) have shown human-like abilities in many
-language tasks, sparking interest in comparing LLMs' and humans' language
-processing. In this paper, we conduct a detailed comparison of the two on a
-sentence comprehension task using garden-path constructions, which are
-notoriously challenging for humans. Based on psycholinguistic research, we
-formulate hypotheses on why garden-path sentences are hard, and test these
-hypotheses on human participants and a large suite of LLMs using comprehension
-questions. Our findings reveal that both LLMs and humans struggle with specific
-syntactic complexities, with some models showing high correlation with human
-comprehension. To complement our findings, we test LLM comprehension of
-garden-path constructions with paraphrasing and text-to-image generation tasks,
-and find that the results mirror the sentence comprehension question results,
-further validating our findings on LLM understanding of these constructions.
+##### **Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm)**
+2502.07815v1 by Lokesh Koli, Shubham Kalra, Karanpreet Singh
 
-摘要：現代大型語言模型（LLM）在許多語言任務中展現出類似人類的能力，引發了比較 LLM 與人類語言處理的興趣。在本文中，我們使用對人類來說極具挑戰的花園路徑結構，對這兩者進行了詳細比較，以進行句子理解任務。根據心理語言學研究，我們制定了關於為什麼花園路徑句子困難的假設，並使用理解問題對人類參與者和大量 LLM 測試這些假設。我們的研究結果表明，LLM 和人類都難以應付特定的句法複雜性，其中一些模型與人類理解力高度相關。為了補充我們的研究結果，我們測試了 LLM 對花園路徑結構的理解，並進行了改寫和文字轉換為圖像的生成任務，並發現結果反映了句子理解問題的結果，進一步驗證了我們對 LLM 理解這些結構的研究結果。
+Detecting sensitive data such as Personally Identifiable Information (PII)
+and Protected Health Information (PHI) is critical for data security platforms.
+This study evaluates regex-based pattern matching algorithms and exact-match
+search techniques to optimize detection speed, accuracy, and scalability. Our
+benchmarking results indicate that Google RE2 provides the best balance of
+speed (10-15 ms/MB), memory efficiency (8-16 MB), and accuracy (99.5%) among
+regex engines, outperforming PCRE while maintaining broader hardware
+compatibility than Hyperscan. For exact matching, Aho-Corasick demonstrated
+superior performance (8 ms/MB) and scalability for large datasets. Performance
+analysis revealed that regex processing time scales linearly with dataset size
+and pattern complexity. A hybrid AI + Regex approach achieved the highest F1
+score (91. 6%) by improving recall and minimizing false positives. Device
+benchmarking confirmed that our solution maintains efficient CPU and memory
+usage on both high-performance and mid-range systems. Despite its
+effectiveness, challenges remain, such as limited multilingual support and the
+need for regular pattern updates. Future work should focus on expanding
+language coverage, integrating data security and privacy management (DSPM) with
+data loss prevention (DLP) tools, and enhancing regulatory compliance for
+broader global adoption.
 
-##### **Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices**
-2502.09294v1 by Bernd Dudzik, Tiffany Matej Hrkalovic, Chenxu Hao, Chirag Raman, Masha Tsfasman
+摘要：偵測個人身分資訊 (PII) 和受保護健康資訊 (PHI) 等敏感資料，對於資料安全平台至關重要。本研究評估基於 regex 的模式配對演算法和精確配對搜尋技術，以最佳化偵測速度、準確度和可擴充性。我們的基準測試結果顯示，在 regex 引擎中，Google RE2 在速度 (10-15 ms/MB)、記憶體效率 (8-16 MB) 和準確度 (99.5%) 方面取得最佳平衡，優於 PCRE，同時比 Hyperscan 擁有更廣泛的硬體相容性。對於精確配對，Aho-Corasick 展現出優異的效能 (8 ms/MB) 和大資料集的可擴充性。效能分析顯示，regex 處理時間會隨著資料集大小和模式複雜度線性擴充。混合 AI + Regex 方法透過提升召回率和將假陽性降至最低，達到了最高的 F1 分數 (91. 6%)。裝置基準測試確認我們的解決方案在高性能和中階系統上都能維持高效的 CPU 和記憶體使用率。儘管有效，但仍有挑戰存在，例如多語言支援有限，以及需要定期更新模式。未來的研究應著重於擴展語言涵蓋範圍，將資料安全和隱私管理 (DSPM) 與資料遺失防護 (DLP) 工具整合，以及加強法規遵循以利更廣泛的全球採用。
 
-Automatic Affect Prediction (AAP) uses computational analysis of input data
-such as text, speech, images, and physiological signals to predict various
-affective phenomena (e.g., emotions or moods). These models are typically
-constructed using supervised machine-learning algorithms, which rely heavily on
-labeled training datasets. In this position paper, we posit that all AAP
-training data are derived from human Affective Interpretation Processes,
-resulting in a form of Affective Meaning. Research on human affect indicates a
-form of complexity that is fundamental to such meaning: it can possess what we
-refer to here broadly as Qualities of Indeterminacy (QIs) - encompassing
-Subjectivity (meaning depends on who is interpreting), Uncertainty (lack of
-confidence regarding meanings' correctness), Ambiguity (meaning contains
-mutually exclusive concepts) and Vagueness (meaning is situated at different
-levels in a nested hierarchy). Failing to appropriately consider QIs leads to
-results incapable of meaningful and reliable predictions. Based on this
-premise, we argue that a crucial step in adequately addressing indeterminacy in
-AAP is the development of data collection practices for modeling corpora that
-involve the systematic consideration of 1) a relevant set of QIs and 2) context
-for the associated interpretation processes. To this end, we are 1) outlining a
-conceptual model of AIPs and the QIs associated with the meaning these produce
-and a conceptual structure of relevant context, supporting understanding of its
-role. Finally, we use our framework for 2) discussing examples of
-context-sensitivity-related challenges for addressing QIs in data collection
-setups. We believe our efforts can stimulate a structured discussion of both
-the role of aspects of indeterminacy and context in research on AAP, informing
-the development of better practices for data collection and analysis.
+##### **WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch**
+2502.05783v1 by Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu
 
-摘要：自動影響預測 (AAP) 使用輸入資料的運算分析，例如文字、語音、影像和生理訊號，來預測各種情感現象（例如情緒或心情）。這些模型通常使用監督式機器學習演算法建構，而這些演算法高度依賴標籤訓練資料集。在此立場文件中，我們主張所有 AAP 訓練資料都是從人類的情感詮釋過程中衍生而來的，進而形成一種情感意義。對人類情感的研究指出，這種複雜性是此種意義的基本要素：它可能具備我們在此廣泛稱之為不確定性品質 (QI)，包括主觀性（意義取決於詮釋者）、不確定性（對於意義正確性的信心不足）、歧義性（意義包含相互排斥的概念）和模糊性（意義位於嵌套層級的不同層級）。未能適當地考量 QI 會導致無法進行有意義且可靠預測的結果。基於此前提，我們主張，在 AAP 中適當地處理不確定性的關鍵步驟，是針對建模語料庫制定資料收集實務，其中涉及系統性地考量 1) 一組相關的 QI，以及 2) 相關詮釋過程的脈絡。為此，我們 1) 概述了 AIP 的概念模型，以及與這些 AIP 所產生的意義相關的 QI，以及相關脈絡的概念結構，支持對其角色的理解。最後，我們使用我們的架構 2) 討論了在資料收集設定中處理 QI 時，與脈絡敏感性相關的挑戰範例。我們相信我們的努力可以激勵對不確定性和脈絡面向在 AAP 研究中扮演的角色進行結構化的討論，為資料收集和分析的最佳實務發展提供資訊。
+While just-in-time interventions (JITIs) have effectively targeted common
+health behaviors, individuals often have unique needs to intervene in personal
+undesirable actions that can negatively affect physical, mental, and social
+well-being. We present WatchGuardian, a smartwatch-based JITI system that
+empowers users to define custom interventions for these personal actions with a
+small number of samples. For the model to detect new actions based on limited
+new data samples, we developed a few-shot learning pipeline that finetuned a
+pre-trained inertial measurement unit (IMU) model on public hand-gesture
+datasets. We then designed a data augmentation and synthesis process to train
+additional classification layers for customization. Our offline evaluation with
+26 participants showed that with three, five, and ten examples, our approach
+achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of
+74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to
+compare WatchGuardian against a rule-based intervention. Our results
+demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in
+undesirable actions, substantially outperforming the baseline by 29.0%. Our
+findings underscore the effectiveness of a customizable, AI-driven JITI system
+for individuals in need of behavioral intervention in personal undesirable
+actions. We envision that our work can inspire broader applications of
+user-defined personalized intervention with advanced AI solutions.
 
-##### **SparQLe: Speech Queries to Text Translation Through LLMs**
-2502.09284v1 by Amirbek Djanibekov, Hanan Aldarmaki
+摘要：<paragraph>雖然即時介入（JITIs）有效地針對常見的健康行為，但個人通常有獨特的需求來介入可能會對身心和社會福祉產生負面影響的個人不良行為。我們提出 WatchGuardian，這是一個基於智慧手錶的 JITI 系統，它使用少數樣本讓使用者能夠為這些個人行為定義自訂介入措施。為了讓模型根據有限的新資料樣本偵測新行為，我們開發了一個小樣本學習管道，微調了公共手勢資料集上的預訓練慣性測量單元（IMU）模型。然後，我們設計了一個資料擴充和合成流程，以訓練其他分類層以進行自訂。我們對 26 位參與者進行的離線評估顯示，我們的做法使用三個、五個和十個範例，達到了 76.8%、84.7% 和 87.7% 的平均準確度，以及 74.8%、84.2% 和 87.2% 的 F1 分數。然後，我們進行了一項為時四小時的介入研究，以將 WatchGuardian 與基於規則的介入進行比較。我們的結果表明，我們的系統導致不良行為顯著減少了 64.0 +- 22.6%，大幅優於基線 29.0%。我們的研究結果強調了可自訂、AI 驅動的 JITI 系統對需要行為介入以應對個人不良行為的個人的有效性。我們預計我們的研究可以激勵使用者定義個人化介入的更廣泛應用，並採用先進的 AI 解決方案。</paragraph>
 
-With the growing influence of Large Language Models (LLMs), there is
-increasing interest in integrating speech representations with them to enable
-more seamless multi-modal processing and speech understanding. This study
-introduces a novel approach that leverages self-supervised speech
-representations in combination with instruction-tuned LLMs for speech-to-text
-translation. The proposed approach leverages a modality adapter to align
-extracted speech features with instruction-tuned LLMs using English-language
-data. Our experiments demonstrate that this method effectively preserves the
-semantic content of the input speech and serves as an effective bridge between
-self-supervised speech models and instruction-tuned LLMs, offering a promising
-solution for various speech understanding applications.
+##### **RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care**
+2502.05740v1 by Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
 
-摘要：隨著大型語言模型（LLM）影響力逐漸擴大，將語音表徵與其整合，以實現更順暢的多模態處理和語音理解，已引起越來越多的興趣。本研究提出了一種新穎的方法，該方法利用自監督語音表徵，結合指令調整的 LLM，進行語音轉文字翻譯。所提出的方法利用模態適配器，使用英語語言資料，將提取的語音特徵與指令調整的 LLM 對齊。我們的實驗證明，此方法有效地保留了輸入語音的語義內容，並作為自監督語音模型和指令調整的 LLM 之間的有效橋樑，為各種語音理解應用程式提供了一個有前景的解決方案。
+Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group
+of cancers that account for more than 35% of cancer-related deaths worldwide,
+but postoperative complications are unpredictable and can be life-threatening.
+In this paper, we investigate how recent advancements in large language models
+(LLMs) can benefit remote patient monitoring (RPM) systems through clinical
+integration by designing RECOVER, an LLM-powered RPM system for postoperative
+GI cancer care. To closely engage stakeholders in the design process, we first
+conducted seven participatory design sessions with five clinical staff and
+interviewed five cancer patients to derive six major design strategies for
+integrating clinical guidelines and information needs into LLM-based RPM
+systems. We then designed and implemented RECOVER, which features an
+LLM-powered conversational agent for cancer patients and an interactive
+dashboard for clinical staff to enable efficient postoperative RPM. Finally, we
+used RECOVER as a pilot system to assess the implementation of our design
+strategies with four clinical staff and five patients, providing design
+implications by identifying crucial design elements, offering insights on
+responsible AI, and outlining opportunities for future LLM-powered RPM systems.
 
-##### **LiSA: Leveraging Link Recommender to Attack Graph Neural Networks via Subgraph Injection**
-2502.09271v1 by Wenlun Zhang, Enyan Dai, Kentaro Yoshioka
+摘要：癌症手術是胃腸道 (GI) 癌症的主要治療方式，這類癌症佔全球癌症相關死亡人數的 35% 以上，但術後併發症無法預測，且可能危及生命。在本文中，我們探討大型語言模型 (LLM) 的近期進展如何透過臨床整合造福遠端病患監控 (RPM) 系統，方法是設計 RECOVER，一個由 LLM 驅動的 RPM 系統，用於術後胃腸道癌症照護。為了讓利害關係人密切參與設計流程，我們首先與五位臨床人員進行七場參與式設計會議，並訪談五位癌症患者，以找出六項整合臨床指南和資訊需求至基於 LLM 的 RPM 系統的主要設計策略。接著，我們設計並實作 RECOVER，其特色在於一個由 LLM 驅動的對話式代理人，供癌症患者使用，以及一個互動式儀表板，供臨床人員使用，以進行有效的術後 RPM。最後，我們使用 RECOVER 作為試點系統，與四位臨床人員和五位患者評估我們設計策略的實作，並透過找出重要的設計元素、提供對負責任 AI 的見解，以及概述未來由 LLM 驅動的 RPM 系統的機會，提出設計意涵。
 
-Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in
-modeling data with graph structures, yet recent research reveals their
-susceptibility to adversarial attacks. Traditional attack methodologies, which
-rely on manipulating the original graph or adding links to artificially created
-nodes, often prove impractical in real-world settings. This paper introduces a
-novel adversarial scenario involving the injection of an isolated subgraph to
-deceive both the link recommender and the node classifier within a GNN system.
-Specifically, the link recommender is mislead to propose links between targeted
-victim nodes and the subgraph, encouraging users to unintentionally establish
-connections and that would degrade the node classification accuracy, thereby
-facilitating a successful attack. To address this, we present the LiSA
-framework, which employs a dual surrogate model and bi-level optimization to
-simultaneously meet two adversarial objectives. Extensive experiments on
-real-world datasets demonstrate the effectiveness of our method.
+##### **4D VQ-GAN: Synthesising Medical Scans at Any Time Point for Personalised Disease Progression Modelling of Idiopathic Pulmonary Fibrosis**
+2502.05713v1 by An Zhao, Moucheng Xu, Ahmed H. Shahin, Wim Wuyts, Mark G. Jones, Joseph Jacob, Daniel C. Alexander
 
-摘要：圖形神經網路 (GNN) 已展現出在對具有圖形結構的資料進行建模方面的卓越能力，但最近的研究揭露了它們容易受到對抗性攻擊的影響。傳統的攻擊方法依賴於操縱原始圖形或將連結新增至人工建立的節點，在真實世界設定中通常被證明不切實際。本文介紹了一種新穎的對抗性場景，涉及注入一個孤立的子圖形，以欺騙 GNN 系統中的連結推薦器和節點分類器。具體來說，連結推薦器被誤導為在目標受害節點和子圖形之間提出連結，鼓勵使用者無意間建立連結，這將降低節點分類準確度，從而促成攻擊成功。為了解決這個問題，我們提出了 LiSA 框架，它採用雙重代理模型和雙層最佳化，以同時滿足兩個對抗性目標。對真實世界資料集進行的廣泛實驗證明了我們方法的有效性。
+Understanding the progression trajectories of diseases is crucial for early
+diagnosis and effective treatment planning. This is especially vital for
+life-threatening conditions such as Idiopathic Pulmonary Fibrosis (IPF), a
+chronic, progressive lung disease with a prognosis comparable to many cancers.
+Computed tomography (CT) imaging has been established as a reliable diagnostic
+tool for IPF. Accurately predicting future CT scans of early-stage IPF patients
+can aid in developing better treatment strategies, thereby improving survival
+outcomes. In this paper, we propose 4D Vector Quantised Generative Adversarial
+Networks (4D-VQ-GAN), a model capable of generating realistic CT volumes of IPF
+patients at any time point. The model is trained using a two-stage approach. In
+the first stage, a 3D-VQ-GAN is trained to reconstruct CT volumes. In the
+second stage, a Neural Ordinary Differential Equation (ODE) based temporal
+model is trained to capture the temporal dynamics of the quantised embeddings
+generated by the encoder in the first stage. We evaluate different
+configurations of our model for generating longitudinal CT scans and compare
+the results against ground truth data, both quantitatively and qualitatively.
+For validation, we conduct survival analysis using imaging biomarkers derived
+from generated CT scans and achieve a C-index comparable to that of biomarkers
+derived from the real CT scans. The survival analysis results demonstrate the
+potential clinical utility inherent to generated longitudinal CT scans, showing
+that they can reliably predict survival outcomes.
 
-##### **AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection**
-2502.09254v1 by Hezhe Qiao, Chaoxi Niu, Ling Chen, Guansong Pang
+摘要：了解疾病的進程軌跡對於早期診斷和有效的治療計畫至關重要。這對於特發性肺纖維化 (IPF) 等威脅生命的疾病尤其重要，IPF 是一種慢性、進行性肺部疾病，其預後與許多癌症相當。電腦斷層掃描 (CT) 影像已被確立為 IPF 的可靠診斷工具。準確預測早期 IPF 患者的未來 CT 掃描有助於制定更好的治療策略，從而改善存活結果。在本文中，我們提出 4D 向量量化生成對抗網路 (4D-VQ-GAN)，這是一個模型，能夠在任何時間點生成 IPF 患者的逼真 CT 體積。該模型使用兩階段方法進行訓練。在第一階段，訓練 3D-VQ-GAN 以重建 CT 體積。在第二階段，訓練基於神經常微分方程 (ODE) 的時間模型，以捕捉第一階段編碼器生成的量化嵌入的時間動態。我們評估了我們的模型的不同配置，以生成縱向 CT 掃描，並在定量和定性方面將結果與真實數據進行比較。為了驗證，我們使用從生成的 CT 掃描中得出的影像生物標記進行存活分析，並獲得與從真實 CT 掃描中得出的生物標記相當的 C 指數。存活分析結果證明了生成縱向 CT 掃描固有的潛在臨床效用，表明它們可以可靠地預測存活結果。
 
-Graph anomaly detection (GAD) aims to identify abnormal nodes that differ
-from the majority of the nodes in a graph, which has been attracting
-significant attention in recent years. Existing generalist graph models have
-achieved remarkable success in different graph tasks but struggle to generalize
-to the GAD task. This limitation arises from their difficulty in learning
-generalized knowledge for capturing the inherently infrequent, irregular and
-heterogeneous abnormality patterns in graphs from different domains. To address
-this challenge, we propose AnomalyGFM, a GAD-oriented graph foundation model
-that supports zero-shot inference and few-shot prompt tuning for GAD in diverse
-graph datasets. One key insight is that graph-agnostic representations for
-normal and abnormal classes are required to support effective zero/few-shot GAD
-across different graphs. Motivated by this, AnomalyGFM is pre-trained to align
-data-independent, learnable normal and abnormal class prototypes with node
-representation residuals (i.e., representation deviation of a node from its
-neighbors). The residual features essentially project the node information into
-a unified feature space where we can effectively measure the abnormality of
-nodes from different graphs in a consistent way. This provides a driving force
-for the learning of graph-agnostic, discriminative prototypes for the normal
-and abnormal classes, which can be used to enable zero-shot GAD on new graphs,
-including very large-scale graphs. If there are few-shot labeled normal nodes
-available in the new graphs, AnomalyGFM can further support prompt tuning to
-leverage these nodes for better adaptation. Comprehensive experiments on 11
-widely-used GAD datasets with real anomalies, demonstrate that AnomalyGFM
-significantly outperforms state-of-the-art competing methods under both zero-
-and few-shot GAD settings.
+##### **KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy**
+2502.05651v1 by Hyunjong Kim, Suyeon Lee, Yeongjae Cho, Eunseo Ryu, Yohan Jo, Suran Seong, Sungzoon Cho
 
-摘要：圖形異常偵測 (GAD) 的目標是找出與圖形中大多數節點不同的異常節點，這在近年來引起了廣泛的關注。現有的通才圖形模型在不同的圖形任務中都取得了顯著的成功，但卻難以推廣到 GAD 任務。這種限制來自於它們難以學習廣泛的知識，用於擷取來自不同領域圖形中固有的罕見、不規則和異質異常模式。為了應對這個挑戰，我們提出了 AnomalyGFM，一個面向 GAD 的圖形基礎模型，它支援零次學習推論和少次提示調整，用於在不同的圖形資料集中進行 GAD。一個關鍵見解是，需要圖形不可知的正常和異常類別表示，以支援跨不同圖形的有效零次/少次 GAD。受此啟發，AnomalyGFM 被預先訓練以將與資料無關的可學習正常和異常類別原型與節點表示殘差（即節點與其鄰居的表示偏差）對齊。殘差特徵基本上將節點資訊投射到一個統一的特徵空間中，在這個空間中，我們可以有效地測量來自不同圖形的節點異常，並且方式一致。這為學習正常和異常類別的圖形不可知、有區別的原型提供了驅動力，這些原型可用於對新的圖形（包括非常大規模的圖形）啟用零次 GAD。如果新的圖形中有少量的標籤正常節點，AnomalyGFM 可以進一步支援提示調整，以利用這些節點進行更好的適應。在 11 個廣泛使用的具有真實異常值的 GAD 資料集上的綜合實驗表明，在零次和少次 GAD 設定下，AnomalyGFM 明顯優於最先進的競爭方法。
+The increasing demand for mental health services has led to the rise of
+AI-driven mental health chatbots, though challenges related to privacy, data
+collection, and expertise persist. Motivational Interviewing (MI) is gaining
+attention as a theoretical basis for boosting expertise in the development of
+these chatbots. However, existing datasets are showing limitations for training
+chatbots, leading to a substantial demand for publicly available resources in
+the field of MI and psychotherapy. These challenges are even more pronounced in
+non-English languages, where they receive less attention. In this paper, we
+propose a novel framework that simulates MI sessions enriched with the
+expertise of professional therapists. We train an MI forecaster model that
+mimics the behavioral choices of professional therapists and employ Large
+Language Models (LLMs) to generate utterances through prompt engineering. Then,
+we present KMI, the first synthetic dataset theoretically grounded in MI,
+containing 1,000 high-quality Korean Motivational Interviewing dialogues.
+Through an extensive expert evaluation of the generated dataset and the
+dialogue model trained on it, we demonstrate the quality, expertise, and
+practicality of KMI. We also introduce novel metrics derived from MI theory in
+order to evaluate dialogues from the perspective of MI.
 
-##### **The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics**
-2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing
+摘要：由於對心理健康服務的需求日益增加，導致以人工智慧為基礎的心理健康聊天機器人興起，儘管與隱私、資料蒐集和專業知識相關的挑戰依然存在。動機性訪談 (MI) 正作為提升這些聊天機器人在開發方面專業知識的理論基礎而備受關注。然而，現有的資料集顯示出訓練聊天機器人的限制，導致對 MI 和心理治療領域中公開可用資源的需求大幅增加。這些挑戰在非英語語言中更加明顯，因為它們受到的關注較少。在本文中，我們提出了一個新穎的架構，它模擬了豐富專業治療師專業知識的 MI 課程。我們訓練了一個 MI 預測模型，它模擬了專業治療師的行為選擇，並採用大型語言模型 (LLM) 透過提示工程來產生話語。然後，我們展示了 KMI，這是第一個理論上以 MI 為基礎的合成資料集，其中包含 1,000 個高品質的韓語動機性訪談對話。透過對所產生的資料集和在該資料集上訓練的對話模型進行廣泛的專家評估，我們展示了 KMI 的品質、專業知識和實用性。我們還引入了從 MI 理論中衍生的新指標，以便從 MI 的角度評估對話。
 
-Joint entity-relation extraction is a critical task in transforming
-unstructured or semi-structured text into triplets, facilitating the
-construction of large-scale knowledge graphs, and supporting various downstream
-applications. Despite its importance, research on Chinese text, particularly
-with complex semantics in specialized domains like medicine, remains limited.
-To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions
-dataset designed to capture the intricacies of medical text. Leveraging the
-strengths of attention mechanisms in capturing long-range dependencies, we
-propose the SEA module, which enhances the extraction of complex contextual
-semantic information, thereby improving entity recognition and relation
-extraction. Additionally, to address the inefficiencies of existing methods in
-facilitating information exchange between entity recognition and relation
-extraction, we present an interactive fusion representation module. This module
-employs Cross Attention for bidirectional information exchange between the
-tasks and further refines feature extraction through BiLSTM. Experimental
-results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that
-our model exhibits strong generalization capabilities. On the CH-DDI dataset,
-our model achieves an F1-score of 96.73% for entity recognition and 78.43% for
-relation extraction. On the CoNLL04 dataset, it attains an entity recognition
-precision of 89.54% and a relation extraction accuracy of 71.64%.
+##### **ELMTEX: Fine-Tuning Large Language Models for Structured Clinical Information Extraction. A Case Study on Clinical Reports**
+2502.05638v1 by Aynur Guluzade, Naguib Heiba, Zeyd Boukhers, Florim Hamiti, Jahid Hasan Polash, Yehya Mohamad, Carlos A Velasco
 
-摘要：聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務，有助於建構大規模知識圖譜，並支援各種下游應用程式。儘管其重要性，但針對中文文本的研究，特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距，我們引入了 CH-DDI，一個中文藥物-藥物交互作用資料集，旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢，我們提出了 SEA 模組，增強了複雜脈絡語義資訊的抽取，從而改進了實體辨識和關係抽取。此外，為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題，我們提出了互動式融合表示模組。此模組採用交叉注意力，在任務之間進行雙向資訊交換，並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明，我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上，我們的模型在實體辨識方面達到了 96.73% 的 F1 分數，在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上，它在實體辨識方面達到了 89.54% 的準確度，在關係抽取方面達到了 71.64% 的準確度。
+Europe's healthcare systems require enhanced interoperability and
+digitalization, driving a demand for innovative solutions to process legacy
+clinical data. This paper presents the results of our project, which aims to
+leverage Large Language Models (LLMs) to extract structured information from
+unstructured clinical reports, focusing on patient history, diagnoses,
+treatments, and other predefined categories. We developed a workflow with a
+user interface and evaluated LLMs of varying sizes through prompting strategies
+and fine-tuning. Our results show that fine-tuned smaller models match or
+surpass larger counterparts in performance, offering efficiency for
+resource-limited settings. A new dataset of 60,000 annotated English clinical
+summaries and 24,000 German translations was validated with automated and
+manual checks. The evaluations used ROUGE, BERTScore, and entity-level metrics.
+The work highlights the approach's viability and outlines future improvements.
 
-##### **You Do Not Fully Utilize Transformer's Representation Capacity**
-2502.09245v1 by Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov
+摘要：歐洲的醫療保健系統需要增強互通性和數位化，這驅動了對創新解決方案的需求，以處理傳統的臨床數據。本文介紹了我們專案的成果，該專案旨在利用大型語言模型 (LLM) 從非結構化的臨床報告中提取結構化的資訊，重點放在病歷、診斷、治療和其他預定義類別上。我們開發了一個具有使用者介面的工作流程，並透過提示策略和微調來評估不同規模的 LLM。我們的結果顯示，微調後的較小模型在效能上與較大的模型相匹配或超越它們，為資源有限的環境提供了效率。一個包含 60,000 個註解英文臨床摘要和 24,000 個德文翻譯的新資料集已透過自動化和手動檢查進行驗證。評估使用了 ROUGE、BERTScore 和實體層級的指標。這項工作突出了這種方法的可行性，並概述了未來的改進。
 
-In contrast to RNNs, which compress previous tokens into a single hidden
-state, Transformers can attend to all previous tokens directly. However,
-standard Transformers only use representations from the immediately preceding
-layer. In this paper, we show that this design choice causes representation
-collapse and leads to suboptimal performance. To address this issue, we
-introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that
-preserves the model's overall memory footprint while expanding its
-representational capacity by allowing access to hidden states from earlier
-layers. Through extensive experiments across various architectures and
-different lookup mechanisms, we demonstrate consistent performance improvements
-on a wide range of tasks. Moreover, our analysis of the learned representation
-dynamics and our exploration of depthwise circuits reveal how LIMe integrates
-information across layers, pointing to promising directions for future
-research.
+##### **Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection**
+2502.05494v1 by Ya Zhou, Yujie Yang, Jianhuang Gan, Xiangjie Li, Jing Yuan, Wei Zhao
 
-摘要：與將先前符號壓縮成單一隱藏狀態的遞迴神經網路不同，Transformer 可以直接關注所有先前的符號。然而，標準 Transformer 僅使用緊鄰前一層的表示。在本文中，我們說明此設計選擇會導致表示崩潰，並導致次優效能。為了解決此問題，我們引入了「層整合式記憶體」(LIMe)，這是一種簡單但強大的方法，可在擴充表示能力的同時，保留模型的整體記憶體使用量，方法是允許存取來自較早層的隱藏狀態。透過各種架構和不同查詢機制的廣泛實驗，我們展示了在各種任務上的一致效能提升。此外，我們對已學習表示動態的分析和對深度電路的探討，揭示了 LIMe 如何整合跨層資訊，並指出未來研究有望發展的方向。
+Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing
+cardiovascular conditions, yet anomaly detection in ECG signals remains
+challenging due to their inherent complexity and variability. We propose
+Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel
+end-to-end framework that effectively captures both global and local
+dependencies in ECG data. Unlike state-of-the-art methods that rely on
+heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for
+such pre-processing steps, enhancing its suitability for clinical deployment.
+MMAE-ECG partitions ECG signals into non-overlapping segments, with each
+segment assigned learnable positional embeddings. A novel multi-scale masking
+strategy and multi-scale attention mechanism, along with distinct positional
+embeddings, enable a lightweight Transformer encoder to effectively capture
+both local and global dependencies. The masked segments are then reconstructed
+using a single-layer Transformer block, with an aggregation strategy employed
+during inference to refine the outputs. Experimental results demonstrate that
+our method achieves performance comparable to state-of-the-art approaches while
+significantly reducing computational complexity-approximately 1/78 of the
+floating-point operations (FLOPs) required for inference. Ablation studies
+further validate the effectiveness of each component, highlighting the
+potential of multi-scale masked autoencoders for anomaly detection.
 
-##### **From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine**
-2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh
+摘要：心電圖 (ECG) 分析是診斷心血管疾病的基本工具，但由於 ECG 訊號本身的複雜性和變異性，異常偵測仍然是一項挑戰。我們提出用於 ECG 異常偵測的多尺度遮罩自編碼器 (MMAE-ECG)，這是一個新穎的端對端架構，可有效擷取 ECG 資料中的全局和局部依賴關係。與依賴於心跳區段或 R 波峰偵測的最新方法不同，MMAE-ECG 消除了對此類前處理步驟的需求，增強其適用於臨床部署。MMAE-ECG 將 ECG 訊號分割成不相疊的區段，每個區段都指派可學習的位置嵌入。新穎的多尺度遮罩策略和多尺度注意力機制，以及不同的位置嵌入，使輕量級 Transformer 編碼器能夠有效擷取局部和全局依賴關係。然後使用單層 Transformer 區塊重建遮罩區段，並在推理期間採用聚合策略來優化輸出。實驗結果表明，我們的模型達到了與最新方法相當的效能，同時大幅降低運算複雜度，約為推理所需的浮點運算 (FLOP) 的 1/78。消融研究進一步驗證了每個組件的有效性，突顯了多尺度遮罩自編碼器在異常偵測方面的潛力。
 
-Generative artificial intelligence (AI) models, such as diffusion models and
-OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy
-and automating clinical workflows. The field has advanced rapidly, evolving
-from text-only large language models for tasks such as clinical documentation
-and decision support to multimodal AI systems capable of integrating diverse
-data modalities, including imaging, text, and structured data, within a single
-model. The diverse landscape of these technologies, along with rising interest,
-highlights the need for a comprehensive review of their applications and
-potential. This scoping review explores the evolution of multimodal AI,
-highlighting its methods, applications, datasets, and evaluation in clinical
-settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed,
-IEEE Xplore, and Web of Science, prioritizing recent studies published up to
-the end of 2024. After rigorous screening, 144 papers were included, revealing
-key trends and challenges in this dynamic field. Our findings underscore a
-shift from unimodal to multimodal approaches, driving innovations in diagnostic
-support, medical report generation, drug discovery, and conversational AI.
-However, critical challenges remain, including the integration of heterogeneous
-data types, improving model interpretability, addressing ethical concerns, and
-validating AI systems in real-world clinical settings. This review summarizes
-the current state of the art, identifies critical gaps, and provides insights
-to guide the development of scalable, trustworthy, and clinically impactful
-multimodal AI solutions in healthcare.
+##### **DCENWCNet: A Deep CNN Ensemble Network for White Blood Cell Classification with LIME-Based Explainability**
+2502.05459v1 by Sibasish Dhibar
 
-摘要：生成式人工智能 (AI) 模型，例如扩散模型和 OpenAI 的 ChatGPT，通过提高诊断准确性和自动化临床工作流程，正在改变医学领域。该领域已迅速发展，从用于临床文件编制和决策支持等任务的纯文本大型语言模型，发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣，凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变，重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南，我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science，优先考虑截至 2024 年底发表的最新研究。经过严格筛选，纳入了 144 篇论文，揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变，推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而，关键挑战仍然存在，包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术，确定了关键差距，并提供了见解，以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。
+White blood cells (WBC) are important parts of our immune system, and they
+protect our body against infections by eliminating viruses, bacteria, parasites
+and fungi. The number of WBC types and the total number of WBCs provide
+important information about our health status. A traditional method,
+convolutional neural networks (CNN), a deep learning architecture, can classify
+the blood cell from a part of an object and perform object recognition. Various
+CNN models exhibit potential; however, their development often involves ad-hoc
+processes that neglect unnecessary layers, leading to issues with unbalanced
+datasets and insufficient data augmentation. To address these challenges, we
+propose a novel ensemble approach that integrates three CNN architectures, each
+uniquely configured with different dropout and max-pooling layer settings to
+enhance feature learning. This ensemble model, named DCENWCNet, effectively
+balances the bias-variance trade-off. When evaluated on the widely recognized
+Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks,
+achieving highest mean accuracy. Additionally, it demonstrates superior
+performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC)
+across all categories. To delve deeper into the interpretability of
+classifiers, we employ reliable post-hoc explanation techniques, including
+Local Interpretable Model-Agnostic Explanations (LIME). These methods
+approximate the behavior of a black-box model by elucidating the relationships
+between feature values and predictions. Interpretable results enable users to
+comprehend and validate the model's predictions, thereby increasing their
+confidence in the automated diagnosis.
 
-##### **Reliable Conversational Agents under ASP Control that Understand Natural Language**
-2502.09237v1 by Yankai Zeng
+摘要：白血球 (WBC) 是我們免疫系統的重要組成部分，它們通過清除病毒、細菌、寄生蟲和真菌來保護我們的機體免受感染。WBC 類型數量和 WBC 總數提供了有關我們健康狀況的重要資訊。傳統方法卷積神經網路 (CNN) 是一種深度學習架構，可以對物體的一部分進行血細胞分類並執行物體識別。各種 CNN 模型展現出潛力；然而，它們的開發通常涉及忽略不必要層的臨時過程，導致不平衡的資料集和資料擴充不足的問題。為了應對這些挑戰，我們提出了一種新穎的整體方法，它整合了三種 CNN 架構，每種架構都採用不同的中斷和最大池化層設定進行獨特配置，以增強特徵學習。這種名為 DCENWCNet 的整體模型有效地平衡了偏差變異取捨。在廣泛認可的 Rabbin-WBC 資料集上進行評估時，我們的模型優於現有的最先進網路，達到了最高的平均準確度。此外，它在所有類別中都展示了在精確度、召回率、F1 分數和 ROC 曲線下面積 (AUC) 方面的卓越效能。為了更深入地研究分類器的可解釋性，我們採用了可靠的事後解釋技術，包括局部可解釋模型不可知解釋 (LIME)。這些方法通過闡明特徵值和預測之間的關係來近似黑盒模型的行為。可解釋的結果使用戶能夠理解和驗證模型的預測，從而增加他們對自動化診斷的信心。
 
-Efforts have been made to make machines converse like humans in the past few
-decades. The recent techniques of Large Language Models (LLMs) make it possible
-to have human-like conversations with machines, but LLM's flaws of lacking
-understanding and reliability are well documented. We believe that the best way
-to eliminate this problem is to use LLMs only as parsers to translate text to
-knowledge and vice versa and carry out the conversation by reasoning over this
-knowledge using the answer set programming. I have been developing a framework
-based on LLMs and ASP to realize reliable chatbots that "understand" human
-conversation. This framework has been used to develop task-specific chatbots as
-well as socialbots. My future research is focused on making these chatbots
-scalable and trainable.
+##### **Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge**
+2502.05330v1 by Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee, Jaehee Chun, Jin Sung Kim, Minghui Zhang, Hanxiao Zhang, Xin You, Yun Gu, Zhaohong Pan, Xuan Liu, Xiaokun Liang, Markus Tiefenthaler, Enrique Almar-Munoz, Matthias Schwab, Mikhail Kotyushev, Rostislav Epifanov, Marek Wodzinski, Henning Muller, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Zhiwei Wang, Kaixiang Yang, Jintao Ren, Stine Sofia Korreman, Yuchong Gao, Hongye Zeng, Haoyu Zheng, Rui Zheng, Jinghua Yue, Fugen Zhou, Bo Liu, Alexander Cosman, Muxuan Liang, Chang Zhao, Gilbert R. Upchurch Jr., Jun Ma, Yuyin Zhou, Michol A. Cooper, Wei Shao
 
-摘要：在過去的幾十年裡，人們一直努力讓機器像人類一樣對話。大型語言模型 (LLM) 的最新技術讓與機器進行類人對話成為可能，但 LLM 缺乏理解力和可靠性的缺陷已被充分記錄。我們相信消除這個問題的最佳方法是僅將 LLM 作為解析器，將文字轉換為知識，反之亦然，並使用答案集程式設計對此知識進行推理來進行對話。我一直在開發一個基於 LLM 和 ASP 的框架，以實現「理解」人類對話的可靠聊天機器人。這個框架已被用於開發特定任務的聊天機器人以及社交機器人。我未來的研究重點在於讓這些聊天機器人具有可擴充性和可訓練性。
+Multi-class segmentation of the aorta in computed tomography angiography
+(CTA) scans is essential for diagnosing and planning complex endovascular
+treatments for patients with aortic dissections. However, existing methods
+reduce aortic segmentation to a binary problem, limiting their ability to
+measure diameters across different branches and zones. Furthermore, no
+open-source dataset is currently available to support the development of
+multi-class aortic segmentation methods. To address this gap, we organized the
+AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes
+annotated for 23 clinically relevant aortic branches and zones. This dataset
+was designed to facilitate both model development and validation. The challenge
+attracted 121 teams worldwide, with participants leveraging state-of-the-art
+frameworks such as nnU-Net and exploring novel techniques, including cascaded
+models, data augmentation strategies, and custom loss functions. We evaluated
+the submitted algorithms using the Dice Similarity Coefficient (DSC) and
+Normalized Surface Distance (NSD), highlighting the approaches adopted by the
+top five performing teams. This paper presents the challenge design, dataset
+details, evaluation metrics, and an in-depth analysis of the top-performing
+algorithms. The annotated dataset, evaluation code, and implementations of the
+leading methods are publicly available to support further research. All
+resources can be accessed at https://aortaseg24.grand-challenge.org.
 
-##### **Commonsense Reasoning-Aided Autonomous Vehicle Systems**
-2502.09233v1 by Keegan Kimbrell
+摘要：多類別主動脈電腦斷層血管攝影 (CTA) 掃描分割對於診斷和規劃主動脈剝離患者的複雜血管內治療至關重要。然而，現有方法將主動脈分割簡化為二元問題，限制了其測量不同分支和區域直徑的能力。此外，目前沒有開放原始碼數據集可用於支援多類別主動脈分割方法的開發。為了解決此問題，我們組織了 AortaSeg24 MICCAI 挑戰，引入了第一個包含 100 個 CTA 體積的數據集，這些體積針對 23 個臨床上相關的主動脈分支和區域進行了註釋。此數據集旨在促進模型開發和驗證。該挑戰吸引了來自世界各地的 121 個團隊，參與者利用了 nnU-Net 等最先進的框架，並探索了創新技術，包括串聯模型、數據擴充策略和自訂損失函數。我們使用 Dice 相似性係數 (DSC) 和標準化表面距離 (NSD) 評估了提交的演算法，重點介紹了前五名表現最佳團隊採用的方法。本文介紹了挑戰設計、數據集詳細資訊、評估指標以及對表現最佳演算法的深入分析。已公開註釋的數據集、評估程式碼和領先方法的實作，以支援進一步的研究。所有資源都可以在 https://aortaseg24.grand-challenge.org/ 獲得。
 
-Autonomous Vehicle (AV) systems have been developed with a strong reliance on
-machine learning techniques. While machine learning approaches, such as deep
-learning, are extremely effective at tasks that involve observation and
-classification, they struggle when it comes to performing higher level
-reasoning about situations on the road. This research involves incorporating
-commonsense reasoning models that use image data to improve AV systems. This
-will allow AV systems to perform more accurate reasoning while also making them
-more adjustable, explainable, and ethical. This paper will discuss the findings
-so far and motivate its direction going forward.
+##### **Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning**
+2502.05282v1 by Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
 
-摘要：自動駕駛車輛 (AV) 系統的開發高度依賴機器學習技術。儘管機器學習方法（例如深度學習）在涉及觀察和分類的任務中非常有效，但它們在對路況進行更高層級推理時會遇到困難。本研究涉及整合使用影像資料的常識推理模型，以改善 AV 系統。這將使 AV 系統能夠執行更準確的推理，同時也讓它們更具可調整性、可解釋性和道德性。本文將探討迄今為止的發現，並說明其未來的發展方向。
+Dense contrastive representation learning (DCRL) has greatly improved the
+learning efficiency for image-dense prediction tasks, showing its great
+potential to reduce the large costs of medical image collection and dense
+annotation. However, the properties of medical images make unreliable
+correspondence discovery, bringing an open problem of large-scale false
+positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric
+vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior
+to DCRL and enables a reliable correspondence discovery for effective dense
+contrast. We propose a deformable homeomorphism learning (DHL) which models the
+homeomorphism of medical images and learns to estimate a deformable mapping to
+predict the pixels' correspondence under topological preservation. It
+effectively reduces the searching space of pairing and drives an implicit and
+soft learning of negative pairs via a gradient. We also propose a geometric
+semantic similarity (GSS) which extracts semantic information in features to
+measure the alignment degree for the correspondence learning. It will promote
+the learning efficiency and performance of deformation, constructing positive
+pairs reliably. We implement two practical variants on two typical
+representation learning tasks in our experiments. Our promising results on
+seven datasets which outperform the existing methods show our great
+superiority. We will release our code on a companion link:
+https://github.com/YutingHe-list/GEMINI.
 
-##### **Logical foundations of Smart Contracts**
-2502.09232v1 by Kalonji Kalala
+摘要：密集对比表征学习（DCRL）极大地提高了影像密集预测任务的学习效率，显示出其在降低医学影像收集和密集标注的大量成本方面的巨大潜力。然而，医学影像的特性使得对应关系发现不可靠，给 DCRL 带来大规模假阳性和假阴性（FP&N）对的开放性问题。在本文中，我们提出了 GEoMetric vIsual deNse sImilarity（GEMINI）学习，它将同胚先验嵌入 DCRL 中，并针对有效密集对比提供了可靠的对应关系发现。我们提出了一种可变形同胚学习（DHL），它对医学影像的同胚进行建模，并学习估计可变形映射，以预测在拓扑保持下的像素对应关系。它有效地减少了配对的搜索空间，并通过梯度驱动了负对的隐式和软学习。我们还提出了几何语义相似性（GSS），它提取特征中的语义信息，以测量对应关系学习的对齐度。它将促进变形学习的效率和性能，可靠地构建正对。我们在实验中针对两个典型的表征学习任务实现了两个实际变体。我们在七个数据集上的有希望的结果优于现有方法，显示出我们的巨大优势。我们将在配套链接中发布我们的代码：https://github.com/YutingHe-list/GEMINI。
 
-Nowadays, sophisticated domains are emerging which require appropriate
-formalisms to be specified accurately in order to reason about them. One such
-domain is constituted of smart contracts that have emerged in cyber physical
-systems as a way of enforcing formal agreements between components of these
-systems. Smart contracts self-execute to run and share business processes
-through blockchain, in decentralized systems, with many different participants.
-Legal contracts are in many cases complex documents, with a number of
-exceptions, and many subcontracts. The implementation of smart contracts based
-on legal contracts is a long and laborious task, that needs to include all
-actions, procedures, and the effects of actions related to the execution of the
-contract. An ongoing open problem in this area is to formally account for smart
-contracts using a uniform and somewhat universal formalism. This thesis
-proposes logical foundations to smart contracts using the Situation Calculus, a
-logic for reasoning about actions. Situation Calculus is one of the prominent
-logic-based artificial intelligence approaches that provides enough logical
-mechanism to specify and implement dynamic and complex systems such as
-contracts. Situation Calculus is suitable to show how worlds dynamically
-change. Smart contracts are going to be implement with Golog (written en
-Prolog), a Situation Calculus-based programming language for modeling complex
-and dynamic behaviors.
+##### **"It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings**
+2502.05115v1 by Shihan Fu, Bingsheng Yao, Smit Desai, Yuqi Hu, Yuling Sun, Samantha Stonbraker, Yanjun Gao, Elizabeth M. Goldberg, Dakuo Wang
 
-摘要：如今，正在出现需要适当形式化来准确指定以对其进行推理的复杂领域。此类领域之一由在网络物理系统中出现的智能合约构成，作为强制执行这些系统组件之间正式协议的一种方式。智能合约自执行以在去中心化系统中通过区块链运行和共享业务流程，并有许多不同的参与者。法律合约在许多情况下是复杂的文档，有许多例外和许多分包合同。基于法律合约实施智能合约是一项漫长而艰巨的任务，需要包括所有操作、程序以及与执行合约相关的操作效果。该领域的持续开放问题是使用统一且某种程度上通用的形式化来正式说明智能合约。本论文提出了使用情景演算（一种用于推理操作的逻辑）为智能合约提供逻辑基础。情景演算是基于逻辑的人工智能方法之一，提供了足够的逻辑机制来指定和实现动态且复杂的系统，例如合约。情景演算适用于展示世界如何动态变化。智能合约将使用 Golog（以 Prolog 编写的）实现，这是一种基于情景演算的编程语言，用于建模复杂且动态的行为。
+Older adult patients constitute a rapidly growing subgroup of Intensive Care
+Unit (ICU) patients. In these situations, their family caregivers are expected
+to represent the unconscious patients to access and interpret patients' medical
+information. However, caregivers currently have to rely on overloaded
+clinicians for information updates and typically lack the health literacy to
+understand complex medical information. Our project aims to explore the
+information needs of caregivers of ICU older adult patients, from which we can
+propose design opportunities to guide future AI systems. The project begins
+with formative interviews with 11 caregivers to identify their challenges in
+accessing and interpreting medical information; From these findings, we then
+synthesize design requirements and propose an AI system prototype to cope with
+caregivers' challenges. The system prototype has two key features: a timeline
+visualization to show the AI extracted and summarized older adult patients' key
+medical events; and an LLM-based chatbot to provide context-aware informational
+support. We conclude our paper by reporting on the follow-up user evaluation of
+the system and discussing future AI-based systems for ICU caregivers of older
+adults.
 
-##### **Relating Answer Set Programming and Many-sorted Logics for Formal Verification**
-2502.09230v1 by Zachary Hansen
+摘要：老年患者構成加護病房 (ICU) 患者中快速成長的子群。在這些情況下，預期他們的家庭照護者能代表無意識的患者取得並解讀患者的醫療資訊。然而，照護者目前必須依賴工作繁重的臨床醫師提供資訊更新，而且通常缺乏了解複雜醫療資訊的健康素養。我們的專案旨在探索 ICU 老年患者照護者的資訊需求，我們可以根據這些需求提出設計機會，以引導未來的 AI 系統。這個專案從對 11 位照護者的形成性訪談開始，以找出他們在取得和解讀醫療資訊方面的挑戰；根據這些發現，我們接著綜合設計需求，並提出一個 AI 系統原型，以應對照護者的挑戰。這個系統原型具有兩個關鍵特點：一個時間軸視覺化，以顯示 AI 萃取並摘要出的老年患者關鍵醫療事件；以及一個基於 LLM 的聊天機器人，以提供情境感知的資訊支援。我們透過報告系統的後續使用者評估，以及討論未來針對老年人 ICU 照護者的 AI 系統，來總結我們的論文。
 
-Answer Set Programming (ASP) is an important logic programming paradigm
-within the field of Knowledge Representation and Reasoning. As a concise,
-human-readable, declarative language, ASP is an excellent tool for developing
-trustworthy (especially, artificially intelligent) software systems. However,
-formally verifying ASP programs offers some unique challenges, such as
-  1. a lack of modularity (the meanings of rules are difficult to define in
-isolation from the enclosing program),
-  2. the ground-and-solve semantics (the meanings of rules are dependent on the
-input data with which the program is grounded), and
-  3. limitations of existing tools.
-  My research agenda has been focused on addressing these three issues with the
-intention of making ASP verification an accessible, routine task that is
-regularly performed alongside program development. In this vein, I have
-investigated alternative semantics for ASP based on translations into the logic
-of here-and-there and many-sorted first-order logic. These semantics promote a
-modular understanding of logic programs, bypass grounding, and enable us to use
-automated theorem provers to automatically verify properties of programs.
+##### **Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs**
+2502.05087v1 by Thierry Bossy, Julien Vignoud, Tahseen Rabbani, Juan R. Troncoso Pastoriza, Martin Jaggi
 
-摘要：<paragraph>答案集程式設計 (ASP) 是知識表徵與推理領域中一個重要的邏輯程式設計範式。ASP 作為一種簡潔、人類可讀、宣告式的語言，是開發值得信賴的 (特別是人工智慧) 軟體系統的絕佳工具。然而，正式驗證 ASP 程式提供了一些獨特的挑戰，例如
-  1. 缺乏模組化 (規則的含義難以與封閉程式隔離定義)，
-  2. 基礎與求解語意 (規則的含義取決於程式基礎的輸入資料)，以及
-  3. 現有工具的限制。
-  我的研究議程一直專注於解決這三個問題，目的是讓 ASP 驗證成為一個可存取的、例行任務，並在程式開發過程中定期執行。在這個脈絡下，我研究了基於翻譯成此處和彼處邏輯以及多種排序一階邏輯的 ASP 替代語意。這些語意促進了邏輯程式的模組化理解，繞過基礎，並使我們能夠使用自動化定理證明器自動驗證程式的屬性。</paragraph>
+Federated learning (FL) is a popular paradigm for collaborative training
+which avoids direct data exposure between clients. However, data privacy issues
+still remain: FL-trained large language models are capable of memorizing and
+completing phrases and sentences contained in training data when given with
+their prefixes. Thus, it is possible for adversarial and honest-but-curious
+clients to recover training data of other participants simply through targeted
+prompting. In this work, we demonstrate that a popular and simple fine-tuning
+strategy, low-rank adaptation (LoRA), reduces memorization during FL up to a
+factor of 10. We study this effect by performing a medical question-answering
+fine-tuning task and injecting multiple replicas of out-of-distribution
+sensitive sequences drawn from an external clinical dataset. We observe a
+reduction in memorization for a wide variety of Llama 2 and 3 models, and find
+that LoRA can reduce memorization in centralized learning as well. Furthermore,
+we show that LoRA can be combined with other privacy-preserving techniques such
+as gradient clipping and Gaussian noising, secure aggregation, and Goldfish
+loss to further improve record-level privacy while maintaining performance.
 
-##### **Computational methods for Dynamic Answer Set Programming**
-2502.09228v1 by Susana Hahn
+摘要：聯邦學習 (FL) 是一種流行的協作訓練範例，可避免客戶端之間直接公開資料。然而，資料隱私問題仍然存在：經過 FL 訓練的大型語言模型能夠記憶並完成訓練資料中包含的片語和句子，只要給予其前綴即可。因此，對抗和誠實但好奇的客戶端有可能僅透過目標提示來恢復其他參與者的訓練資料。在這項工作中，我們證明了一種流行且簡單的微調策略，低秩適應 (LoRA)，可將 FL 期間的記憶減少多達 10 倍。我們透過執行醫學問答微調任務並注入從外部臨床資料集抽取的非分佈敏感序列的多次複製品來研究此效應。我們觀察到各種 Llama 2 和 3 模型的記憶力降低，並發現 LoRA 也能減少集中式學習中的記憶力。此外，我們展示 LoRA 可以與其他隱私保護技術結合使用，例如梯度裁剪和高斯雜訊、安全聚合和 Goldfish 損失，以進一步改善記錄級隱私，同時維持效能。
 
-In our daily lives and industrial settings, we often encounter dynamic
-problems that require reasoning over time and metric constraints. These include
-tasks such as scheduling, routing, and production sequencing. Dynamic logics
-have traditionally addressed these needs but often lack the flexibility and
-integration required for comprehensive problem modeling. This research aims to
-extend Answer Set Programming (ASP), a powerful declarative problem-solving
-approach, to handle dynamic domains effectively. By integrating concepts from
-dynamic, temporal, and metric logics into ASP, we seek to develop robust
-systems capable of modeling complex dynamic problems and performing efficient
-reasoning tasks, thereby enhancing ASPs applicability in industrial contexts.
+##### **MedMimic: Physician-Inspired Multimodal Fusion for Early Diagnosis of Fever of Unknown Origin**
+2502.04794v1 by Minrui Chen, Yi Zhou, Huidong Jiang, Yuhan Zhu, Guanjie Zou, Minqi Chen, Rong Tian, Hiroto Saigo
 
-摘要：在我們的日常生活和工業環境中，我們經常會遇到動態問題，需要隨著時間和公制約束進行推理。這些問題包括排程、路由和生產順序等任務。動態邏輯傳統上解決了這些需求，但通常缺乏全面問題建模所需的靈活性與整合性。本研究旨在擴展強大的宣告式問題解決方法「Answer Set Programming (ASP)」，以有效處理動態領域。透過將動態、時態和公制邏輯的概念整合到 ASP 中，我們尋求開發強健的系統，能夠建模複雜的動態問題並執行有效的推理任務，進而增強 ASP 在工業環境中的適用性。
+Fever of unknown origin FUO remains a diagnostic challenge. MedMimic is
+introduced as a multimodal framework inspired by real-world diagnostic
+processes. It uses pretrained models such as DINOv2, Vision Transformer, and
+ResNet-18 to convert high-dimensional 18F-FDG PET/CT imaging into
+low-dimensional, semantically meaningful features. A learnable
+self-attention-based fusion network then integrates these imaging features with
+clinical data for classification. Using 416 FUO patient cases from Sichuan
+University West China Hospital from 2017 to 2023, the multimodal fusion
+classification network MFCN achieved macro-AUROC scores ranging from 0.8654 to
+0.9291 across seven tasks, outperforming conventional machine learning and
+single-modality deep learning methods. Ablation studies and five-fold
+cross-validation further validated its effectiveness. By combining the
+strengths of pretrained large models and deep learning, MedMimic offers a
+promising solution for disease classification.
 
-##### **Generating Causally Compliant Counterfactual Explanations using ASP**
-2502.09226v1 by Sopam Dasgupta
+摘要：不明原因發燒 (FUO) 仍然是診斷上的挑戰。MedMimic 是一個多模式架構，靈感來自於真實世界的診斷過程。它使用預先訓練的模型，例如 DINOv2、視覺轉換器和 ResNet-18，將高維 18F-FDG PET/CT 影像轉換為低維、語義有意義的特徵。一個可學習的自注意力融合網路接著將這些影像特徵與臨床資料整合，用於分類。使用 2017 年至 2023 年四川大學華西醫院的 416 個 FUO 病患病例，多模式融合分類網路 MFCN 在七項任務中達到了 0.8654 到 0.9291 的巨觀 AUROC 分數，優於傳統機器學習和單一模式深度學習方法。消融研究和五倍交叉驗證進一步驗證了其有效性。MedMimic 結合了預先訓練的大模型和深度學習的優點，為疾病分類提供了一個有前景的解決方案。
 
-This research is focused on generating achievable counterfactual
-explanations. Given a negative outcome computed by a machine learning model or
-a decision system, the novel CoGS approach generates (i) a counterfactual
-solution that represents a positive outcome and (ii) a path that will take us
-from the negative outcome to the positive one, where each node in the path
-represents a change in an attribute (feature) value. CoGS computes paths that
-respect the causal constraints among features. Thus, the counterfactuals
-computed by CoGS are realistic. CoGS utilizes rule-based machine learning
-algorithms to model causal dependencies between features. The paper discusses
-the current status of the research and the preliminary results obtained.
+##### **MedGNN: Towards Multi-resolution Spatiotemporal Graph Learning for Medical Time Series Classification**
+2502.04515v1 by Wei Fan, Jingru Fei, Dingyu Guo, Kun Yi, Xiaozhuang Song, Haolong Xiang, Hangting Ye, Min Li
 
-摘要：本研究重點在於產生可實現的反事實解釋。給定由機器學習模型或決策系統計算出的負面結果，創新的 CoGS 方法會產生 (i) 代表正面結果的反事實解，以及 (ii) 一條將我們從負面結果帶到正面結果的途徑，其中途徑中的每個節點代表屬性 (特徵) 值的變化。CoGS 計算出符合特徵之間因果關係的途徑。因此，CoGS 計算出的反事實是切合實際的。CoGS 利用基於規則的機器學習演算法來建模特徵之間的因果關係。本文探討了研究的現況和獲得的初步結果。
+Medical time series has been playing a vital role in real-world healthcare
+systems as valuable information in monitoring health conditions of patients.
+Accurate classification for medical time series, e.g., Electrocardiography
+(ECG) signals, can help for early detection and diagnosis. Traditional methods
+towards medical time series classification rely on handcrafted feature
+extraction and statistical methods; with the recent advancement of artificial
+intelligence, the machine learning and deep learning methods have become more
+popular. However, existing methods often fail to fully model the complex
+spatial dynamics under different scales, which ignore the dynamic
+multi-resolution spatial and temporal joint inter-dependencies. Moreover, they
+are less likely to consider the special baseline wander problem as well as the
+multi-view characteristics of medical time series, which largely hinders their
+prediction performance. To address these limitations, we propose a
+Multi-resolution Spatiotemporal Graph Learning framework, MedGNN, for medical
+time series classification. Specifically, we first propose to construct
+multi-resolution adaptive graph structures to learn dynamic multi-scale
+embeddings. Then, to address the baseline wander problem, we propose Difference
+Attention Networks to operate self-attention mechanisms on the finite
+difference for temporal modeling. Moreover, to learn the multi-view
+characteristics, we utilize the Frequency Convolution Networks to capture
+complementary information of medical time series from the frequency domain. In
+addition, we introduce the Multi-resolution Graph Transformer architecture to
+model the dynamic dependencies and fuse the information from different
+resolutions. Finally, we have conducted extensive experiments on multiple
+medical real-world datasets that demonstrate the superior performance of our
+method. Our Code is available.
 
-##### **Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts**
-2502.09224v1 by Đorđe Marković, Marc Denecker
+摘要：<paragraph>醫療時間序列在真實世界的醫療保健系統中扮演著至關重要的角色，作為監控患者健康狀況的寶貴資訊。
+準確分類醫療時間序列，例如心電圖 (ECG) 訊號，有助於早期偵測和診斷。傳統的醫療時間序列分類方法仰賴手工特徵萃取和統計方法；隨著人工智慧的最新進展，機器學習和深度學習方法變得更為普及。然而，現有方法通常無法完全建模不同尺度下的複雜空間動態，忽略了動態多解析度空間和時間關節相互依賴性。此外，它們不太可能考慮特殊的基線漂移問題以及醫療時間序列的多視角特性，這在很大程度上阻礙了它們的預測效能。為了解決這些限制，我們提出了一個多解析度時空圖形學習架構 MedGNN，用於醫療時間序列分類。具體來說，我們首先提出構建多解析度自適應圖形結構以學習動態多尺度嵌入。然後，為了解決基線漂移問題，我們提出差分注意力網路，對時間建模的有限差分運算自注意力機制。此外，為了學習多視角特性，我們利用頻率卷積網路從頻域擷取醫療時間序列的互補資訊。此外，我們引入了多解析度圖形Transformer架構來建模動態依賴性，並融合來自不同解析度的資訊。最後，我們對多個醫療真實世界資料集進行了廣泛的實驗，證明了我們方法的優異效能。我們的程式碼已公開。</paragraph>
 
-Subtyping, also known as subtype polymorphism, is a concept extensively
-studied in programming language theory, delineating the substitutability
-relation among datatypes. This property ensures that programs designed for
-supertype objects remain compatible with their subtypes.
-  In this paper, we explore the capability of order-sorted logic for utilizing
-these ideas in the context of Knowledge Representation. We recognize two
-fundamental limitations: First, the inability of this logic to address the
-concept rather than the value of non-logical symbols, and second, the lack of
-language constructs for constraining the type of terms. Consequently, we
-propose guarded order-sorted intensional logic, where guards are language
-constructs for annotating typing information and intensional logic provides
-support for quantification over concepts.
+##### **Integrating Generative Artificial Intelligence in ADRD: A Framework for Streamlining Diagnosis and Care in Neurodegenerative Diseases**
+2502.06842v1 by Andrew G. Breithaupt, Alice Tang, Bruce L. Miller, Pedro Pinheiro-Chagas
 
-摘要：子類型化，也稱為子類型多態性，是一個在程式語言理論中廣泛研究的概念，用於描述資料類型之間的可替換關係。此特性可確保為超類型物件設計的程式與其子類型相容。
-在本文中，我們探討了使用排序邏輯在知識表徵中運用這些想法的能力。我們發現了兩個基本限制：首先，此邏輯無法處理非邏輯符號的概念而非值，其次，缺乏約束項類型的語言結構。因此，我們提出了受保護的排序邏輯，其中保護是註解類型資訊的語言結構，而內涵邏輯則支援對概念量化。
+Healthcare systems are struggling to meet the growing demand for neurological
+care, with challenges particularly acute in Alzheimer's disease and related
+dementias (ADRD). While artificial intelligence research has often focused on
+identifying patterns beyond human perception, implementing such predictive
+capabilities remains challenging as clinicians cannot readily verify insights
+they cannot themselves detect. We propose that large language models (LLMs)
+offer more immediately practical applications by enhancing clinicians'
+capabilities in three critical areas: comprehensive data collection,
+interpretation of complex clinical information, and timely application of
+relevant medical knowledge. These challenges stem from limited time for proper
+diagnosis, growing data complexity, and an overwhelming volume of medical
+literature that exceeds any clinician's capacity to fully master. We present a
+framework for responsible AI integration that leverages LLMs' ability to
+communicate effectively with both patients and providers while maintaining
+human oversight. This approach prioritizes standardized, high-quality data
+collection to enable a system that learns from every patient encounter while
+incorporating the latest clinical evidence, continuously improving care
+delivery. We begin to address implementation challenges and initiate important
+discussions around ethical considerations and governance needs. While developed
+for ADRD, this roadmap provides principles for responsible AI integration
+across neurology and other medical specialties, with potential to improve
+diagnostic accuracy, reduce care disparities, and advance clinical knowledge
+through a learning healthcare system.
 
-##### **ASP-driven User-interaction with Clinguin**
-2502.09222v1 by Alexander Beiser, Susana Hahn, Torsten Schaub
+摘要：醫療體系正努力滿足日益增長的神經照護需求，其中阿茲海默症和相關失智症 (ADRD) 的挑戰特別嚴重。雖然人工智慧研究通常專注於識別人類感知之外的模式，但實作此類預測功能仍然具有挑戰性，因為臨床醫生無法輕易驗證他們自己無法偵測到的見解。我們提出大型語言模型 (LLM) 可透過提升臨床醫生在三個關鍵領域的能力，提供更直接且實用的應用：全面的資料收集、複雜臨床資訊的詮釋，以及適時應用相關的醫學知識。這些挑戰源自於適當診斷時間有限、資料複雜性日益增加，以及龐大的醫學文獻量超過任何臨床醫生所能完全掌握的容量。我們提出了一個負責任的 AI 整合架構，利用 LLM 與患者和提供者有效溝通的能力，同時維持人為監督。此方法優先考慮標準化、高品質的資料收集，以建立一個從每次患者接觸中學習的系統，同時納入最新的臨床證據，持續改善照護提供。我們開始探討實作挑戰，並展開關於倫理考量和治理需求的重要討論。儘管是為 ADRD 所開發，此藍圖提供了神經科和其他醫學專科負責任 AI 整合的原則，有潛力透過學習型醫療保健系統改善診斷準確性、減少照護差異，並推進臨床知識。
 
-We present clinguin, a system for ASP-driven user interface design. Clinguin
-streamlines the development of user interfaces for ASP developers by letting
-them build interactive prototypes directly in ASP, eliminating the need for
-separate frontend languages. To this end, clinguin uses a few dedicated
-predicates to define user interfaces and the treatment of user-triggered
-events. This simple design greatly facilitates the specification of user
-interactions with an ASP system, in our case clingo.
+##### **Primary Care Diagnoses as a Reliable Predictor for Orthopedic Surgical Interventions**
+2502.04423v1 by Khushboo Verma, Alan Michels, Ergi Gumusaneli, Shilpa Chitnis, Smita Sinha Kumar, Christopher Thompson, Lena Esmail, Guruprasath Srinivasan, Chandini Panchada, Sushovan Guha, Satwant Kumar
 
-摘要：我們提出 clinguin，一個用於 ASP 驅動使用者介面設計的系統。Clinguin 透過讓 ASP 開發人員直接在 ASP 中建立互動式原型，簡化了使用者介面的開發，消除了對個別前端語言的需求。為此，clinguin 使用一些專用的謂詞來定義使用者介面和處理使用者觸發的事件。這個簡單的設計極大地簡化了使用者與 ASP 系統互動的規範，在我們的案例中是 clingo。
+Referral workflow inefficiencies, including misaligned referrals and delays,
+contribute to suboptimal patient outcomes and higher healthcare costs. In this
+study, we investigated the possibility of predicting procedural needs based on
+primary care diagnostic entries, thereby improving referral accuracy,
+streamlining workflows, and providing better care to patients. A de-identified
+dataset of 2,086 orthopedic referrals from the University of Texas Health at
+Tyler was analyzed using machine learning models built on Base General
+Embeddings (BGE) for semantic extraction. To ensure real-world applicability,
+noise tolerance experiments were conducted, and oversampling techniques were
+employed to mitigate class imbalance. The selected optimum and parsimonious
+embedding model demonstrated high predictive accuracy (ROC-AUC: 0.874, Matthews
+Correlation Coefficient (MCC): 0.540), effectively distinguishing patients
+requiring surgical intervention. Dimensionality reduction techniques confirmed
+the model's ability to capture meaningful clinical relationships. A threshold
+sensitivity analysis identified an optimal decision threshold (0.30) to balance
+precision and recall, maximizing referral efficiency. In the predictive
+modeling analysis, the procedure rate increased from 11.27% to an optimal
+60.1%, representing a 433% improvement with significant implications for
+operational efficiency and healthcare revenue.
+  The results of our study demonstrate that referral optimization can enhance
+primary and surgical care integration. Through this approach, precise and
+timely predictions of procedural requirements can be made, thereby minimizing
+delays, improving surgical planning, and reducing administrative burdens. In
+addition, the findings highlight the potential of clinical decision support as
+a scalable solution for improving patient outcomes and the efficiency of the
+healthcare system.
 
-##### **Pearce's Characterisation in an Epistemic Domain**
-2502.09221v1 by Ezgi Iraz Su
+摘要：轉診流程效率低落，包括轉診不當和延誤，
+導致次優的患者結果和更高的醫療保健成本。在這
+項研究中，我們探討了根據初級保健診斷條目預測程序需求的可能性，從而提高轉診準確性，
+簡化工作流程，並為患者提供更好的照護。一個去識別化
+德克薩斯大學健康中心的 2,086 個骨科轉診的資料集
+泰勒使用建立在基本通用
+語義提取的嵌入 (BGE) 上的機器學習模型進行分析。為了確保現實世界的適用性，
+進行了噪聲容忍度實驗，並採用了過採樣技術來減輕類別不平衡。所選的最佳和簡約
+嵌入模型展示了高預測準確度 (ROC-AUC：0.874，馬修斯
+相關系數 (MCC)：0.540)，有效區分需要手術干預的患者。降維
+技術證實了模型捕捉有意義的臨床關係的能力。閾值
+敏感性分析確定了一個最佳決策閾值 (0.30) 來平衡
+精確度和召回率，最大化轉診效率。在預測中
+建模分析中，程序率從 11.27% 增加到最佳的
+60.1%，代表 433% 的改進，對運營效率和醫療保健收入具有重大影響。
+我們研究的結果表明，轉診優化可以增強
+初級和外科護理整合。通過這種方法，可以對程序需求進行準確及時的預測，從而最大程度地減少
+延誤，改善手術計劃，並減輕行政負擔。此外，研究結果強調了臨床決策支持作為
+一個可擴展的解決方案的潛力，用於改善患者結果和醫療保健系統的效率。
 
-Answer-set programming (ASP) is a successful problem-solving approach in
-logic-based AI. In ASP, problems are represented as declarative logic programs,
-and solutions are identified through their answer sets. Equilibrium logic (EL)
-is a general-purpose nonmonotonic reasoning formalism, based on a monotonic
-logic called here-and-there logic. EL was basically proposed by Pearce as a
-foundational framework of ASP. Epistemic specifications (ES) are extensions of
-ASP-programs with subjective literals. These new modal constructs in the
-ASP-language make it possible to check whether a regular literal of ASP is true
-in every (or some) answer-set of a program. ES-programs are interpreted by
-world-views, which are essentially collections of answer-sets. (Reflexive)
-autoepistemic logic is a nonmonotonic formalism, modeling self-belief
-(knowledge) of ideally rational agents. A relatively new semantics for ES is
-based on a combination of EL and (reflexive) autoepistemic logic. In this
-paper, we first propose an overarching framework in the epistemic ASP domain.
-We then establish a correspondence between existing (reflexive) (auto)epistemic
-equilibrium logics and our easily-adaptable comprehensive framework, building
-on Pearce's characterisation of answer-sets as equilibrium models. We achieve
-this by extending Ferraris' work on answer sets for propositional theories to
-the epistemic case and reveal the relationship between some ES-semantic
-proposals.
+##### **Automatic quantification of breast cancer biomarkers from multiple 18F-FDG PET image segmentation**
+2502.04083v1 by Tewele W. Tareke, Neree Payan, Alexandre Cochet, Laurent Arnould, Benoit Presles, Jean-Marc Vrigneaud, Fabrice Meriaudeau, Alain Lalande
 
-摘要：<paragraph>答案集程式設計（ASP）是基於邏輯的人工智慧中一種成功的問題解決方法。在 ASP 中，問題表示為宣告式邏輯程式，並透過其答案集來找出解答。平衡邏輯（EL）是一種通用的非單調推理形式主義，基於一種稱為此處和彼處邏輯的單調邏輯。EL 基本是由 Pearce 作為 ASP 的基礎架構所提出。知識規範（ES）是 ASP 程式與主觀文字的延伸。ASP 語言中的這些新模態建構使得可以檢查 ASP 的常規文字是否在程式的每個（或某些）答案集中為真。ES 程式由世界觀來詮釋，其本質上是答案集的集合。（反身）自認識邏輯是一種非單調形式主義，用來建模理想理性主體的自信念（知識）。ES 的一種相對新的語意是基於 EL 和（反身）自認識邏輯的組合。在本文中，我們首先提出一個涵蓋知識 ASP 領域的架構。然後，我們建立現有（反身）（自）認識平衡邏輯與我們容易適應的綜合架構之間的對應關係，建立在 Pearce 將答案集描述為平衡模型的特性之上。我們透過將 Ferraris 在命題理論的答案集上的工作延伸到知識案例，並揭示一些 ES 語義提案之間的關係來達成這一點。</paragraph>
+Neoadjuvant chemotherapy (NAC) has become a standard clinical practice for
+tumor downsizing in breast cancer with 18F-FDG Positron Emission Tomography
+(PET). Our work aims to leverage PET imaging for the segmentation of breast
+lesions. The focus is on developing an automated system that accurately
+segments primary tumor regions and extracts key biomarkers from these areas to
+provide insights into the evolution of breast cancer following the first course
+of NAC. 243 baseline 18F-FDG PET scans (PET_Bl) and 180 follow-up 18F-FDG PET
+scans (PET_Fu) were acquired before and after the first course of NAC,
+respectively. Firstly, a deep learning-based breast tumor segmentation method
+was developed. The optimal baseline model (model trained on baseline exams) was
+fine-tuned on 15 follow-up exams and adapted using active learning to segment
+tumor areas in PET_Fu. The pipeline computes biomarkers such as maximum
+standardized uptake value (SUVmax), metabolic tumor volume (MTV), and total
+lesion glycolysis (TLG) to evaluate tumor evolution between PET_Fu and PET_Bl.
+Quality control measures were employed to exclude aberrant outliers. The nnUNet
+deep learning model outperformed in tumor segmentation on PET_Bl, achieved a
+Dice similarity coefficient (DSC) of 0.89 and a Hausdorff distance (HD) of 3.52
+mm. After fine-tuning, the model demonstrated a DSC of 0.78 and a HD of 4.95 mm
+on PET_Fu exams. Biomarkers analysis revealed very strong correlations whatever
+the biomarker between manually segmented and automatically predicted regions.
+The significant average decrease of SUVmax, MTV and TLG were 5.22, 11.79 cm3
+and 19.23 cm3, respectively. The presented approach demonstrates an automated
+system for breast tumor segmentation from 18F-FDG PET. Thanks to the extracted
+biomarkers, our method enables the automatic assessment of cancer progression.
 
-##### **Graphical Conditions for the Existence, Unicity and Number of Regular Models**
-2502.09220v1 by Van-Giang Trinh, Belaid Benhamou, Sylvain Soliman, François Fages
+摘要：新辅助化疗 (NAC) 已成为乳腺癌中采用 18F-FDG 正电子发射断层扫描 (PET) 进行肿瘤缩小的标准临床实践。我们的工作旨在利用 PET 影像分割乳腺病变。重点在于开发一个自动系统，该系统可以准确分割原发性肿瘤区域并从这些区域提取关键生物标记，以深入了解乳腺癌在第一疗程 NAC 后的演变。分别在第一疗程 NAC 之前和之后采集了 243 例基线 18F-FDG PET 扫描 (PET_Bl) 和 180 例随访 18F-FDG PET 扫描 (PET_Fu)。首先，开发了一种基于深度学习的乳腺肿瘤分割方法。对 15 例随访检查对最优基线模型（在基线检查中训练的模型）进行了微调，并使用主动学习对 PET_Fu 中的肿瘤区域进行了分割。该管道计算诸如最大标准摄取值 (SUVmax)、代谢肿瘤体积 (MTV) 和总病灶糖酵解 (TLG) 等生物标记，以评估 PET_Fu 和 PET_Bl 之间的肿瘤演变。采用质量控制措施来排除异常值。nnUNet 深度学习模型在 PET_Bl 上的肿瘤分割方面表现出色，达到 0.89 的 Dice 相似性系数 (DSC) 和 3.52 毫米的 Hausdorff 距离 (HD)。微调后，该模型在 PET_Fu 检查中显示出 0.78 的 DSC 和 4.95 毫米的 HD。无论手动分割区域和自动预测区域之间的生物标记如何，生物标记分析都显示出非常强的相关性。SUVmax、MTV 和 TLG 的平均显着下降分别为 5.22、11.79 cm3 和 19.23 cm3。所提出的方法展示了一个用于从 18F-FDG PET 分割乳腺肿瘤的自动化系统。由于提取了生物标记，我们的方法能够自动评估癌症进展。
 
-The regular models of a normal logic program are a particular type of partial
-(i.e. 3-valued) models which correspond to stable partial models with minimal
-undefinedness. In this paper, we explore graphical conditions on the dependency
-graph of a finite ground normal logic program to analyze the existence, unicity
-and number of regular models for the program. We show three main results: 1) a
-necessary condition for the existence of non-trivial (i.e. non-2-valued)
-regular models, 2) a sufficient condition for the unicity of regular models,
-and 3) two upper bounds for the number of regular models based on positive
-feedback vertex sets. The first two conditions generalize the finite cases of
-the two existing results obtained by You and Yuan (1994) for normal logic
-programs with well-founded stratification. The third result is also new to the
-best of our knowledge. Key to our proofs is a connection that we establish
-between finite ground normal logic programs and Boolean network theory.
+##### **Generalize Drug Response Prediction by Latent Independent Projection for Asymmetric Constrained Domain Generalization**
+2502.04034v1 by Ran Song, Yinpu Bai, Hui Liu
+
+The accurate prediction of drug responses remains a formidable challenge,
+particularly at the single-cell level and in clinical treatment contexts. Some
+studies employ transfer learning techniques to predict drug responses in
+individual cells and patients, but they require access to target-domain data
+during training, which is often unavailable or only obtainable in future. In
+this study, we propose a novel domain generalization framework, termed
+panCancerDR, to address this challenge. We conceptualize each cancer type as a
+distinct source domain, with its cell lines serving as domain-specific samples.
+Our primary objective is to extract domain-invariant features from the
+expression profiles of cell lines across diverse cancer types, thereby
+generalize the predictive capacity to out-of-distribution samples. To enhance
+robustness, we introduce a latent independence projection (LIP) module that
+encourages the encoder to extract informative yet non-redundant features. Also,
+we propose an asymmetric adaptive clustering constraint, which clusters
+drug-sensitive samples into a compact group while drives resistant samples
+dispersed across separate clusters in the latent space. Our empirical
+experiments demonstrate that panCancerDR effectively learns task-relevant
+features from diverse source domains, and achieves accurate predictions of drug
+response for unseen cancer type during training. Furthermore, when evaluated on
+single-cell and patient-level prediction tasks, our model-trained solely on in
+vitro cell line data without access to target-domain information-consistently
+outperforms and matched current state-of-the-art methods. These findings
+highlights the potential of our method for real-world clinical applications.
 
-摘要：正规模型的常规模型是一种特殊类型的局部模型（即 3 值）模型，它对应于具有最小未定义性的稳定局部模型。在本文中，我们探索了有限接地正规逻辑程序的依赖图上的图形条件，以分析程序的正规模型的存在性、唯一性和数量。我们展示了三个主要结果：1) 非平凡（即非 2 值）正规模型存在的必要条件，2) 正规模型唯一性的充分条件，3) 基于正反馈顶点集的正规模型数目的两个上限。前两个条件概括了 You 和 Yuan (1994) 为具有良好基础分层的正规逻辑程序获得的两个现有结果的有限情况。据我们所知，第三个结果也是新的。我们证明的关键是我们在有限接地正规逻辑程序和布尔网络理论之间建立的联系。
+摘要：<paragraph>準確預測藥物反應仍然是一項艱鉅的挑戰，特別是在單細胞層級和臨床治療背景中。一些研究採用遷移學習技術來預測個別細胞和患者的藥物反應，但它們需要在訓練期間存取目標網域資料，而這些資料通常無法取得，或只能在未來取得。在這項研究中，我們提出一個新穎的網域概化架構，稱為 panCancerDR，以應對這項挑戰。我們將每種類型的癌症概念化為一個不同的來源網域，其細胞株作為特定網域的樣本。我們的首要目標是從不同癌症類型的細胞株表現特徵中萃取網域不變特徵，從而將預測能力概化到分布外的樣本。為了增強穩健性，我們引入一個潛在獨立投影 (LIP) 模組，鼓勵編碼器萃取有資訊但非冗餘的特徵。此外，我們提出一個非對稱自適應聚類約束，將對藥物敏感的樣本聚類到一個緊湊的群組中，同時驅動抗藥性樣本分散在潛在空間中的不同群組中。我們的實證實驗證明，panCancerDR 有效地從不同的來源網域學習與任務相關的特徵，並在訓練期間對未見的癌症類型實現準確的藥物反應預測。此外，當在單細胞和患者層級預測任務中進行評估時，我們的模型僅在體外細胞株資料上訓練，而沒有存取目標網域資訊，始終優於並符合當前的最新方法。這些發現突顯了我們的方法在實際臨床應用中的潛力。</paragraph>
 
-##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
-2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
+##### **MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot**
+2502.04413v1 by Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
 
-This paper presents a complete explainable system that interprets a set of
-data, abstracts the underlying features and describes them in a natural
-language of choice. The system relies on two crucial stages: (i) identifying
-emerging properties from data and transforming them into abstract concepts, and
-(ii) converting these concepts into natural language. Despite the impressive
-natural language generation capabilities demonstrated by Large Language Models,
-their statistical nature and the intricacy of their internal mechanism still
-force us to employ these techniques as black boxes, forgoing trustworthiness.
-Developing an explainable pipeline for data interpretation would allow
-facilitating its use in safety-critical environments like processing medical
-information and allowing non-experts and visually impaired people to access
-narrated information. To this end, we believe that the fields of knowledge
-representation and automated reasoning research could present a valid
-alternative. Expanding on prior research that tackled the first stage (i), we
-focus on the second stage, named Concept2Text. Being explainable, data
-translation is easily modeled through logic-based rules, once again emphasizing
-the role of declarative programming in achieving AI explainability. This paper
-explores a Prolog/CLP-based rewriting system to interpret concepts-articulated
-in terms of classes and relations, plus common knowledge-derived from a generic
-ontology, generating natural language text. Its main features include
-hierarchical tree rewritings, modular multilingual generation, support for
-equivalent variants across semantic, grammar, and lexical levels, and a
-transparent rule-based system. We outline the architecture and demonstrate its
-flexibility through some examples capable of generating numerous diverse and
-equivalent rewritings based on the input concept.
+Retrieval-augmented generation (RAG) is a well-suited technique for
+retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
+key module of the healthcare copilot, helping reduce misdiagnosis for
+healthcare practitioners and patients. However, the diagnostic accuracy and
+specificity of existing heuristic-based RAG models used in the medical domain
+are inadequate, particularly for diseases with similar manifestations. This
+paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
+reasoning for the medical domain that retrieves diagnosis and treatment
+recommendations based on manifestations. MedRAG systematically constructs a
+comprehensive four-tier hierarchical diagnostic KG encompassing critical
+diagnostic differences of various diseases. These differences are dynamically
+integrated with similar EHRs retrieved from an EHR database, and reasoned
+within a large language model. This process enables more accurate and specific
+decision support, while also proactively providing follow-up questions to
+enhance personalized medical decision-making. MedRAG is evaluated on both a
+public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
+collected from Tan Tock Seng Hospital, and its performance is compared against
+various existing RAG methods. Experimental results show that, leveraging the
+information integration and relational abilities of the KG, our MedRAG provides
+more specific diagnostic insights and outperforms state-of-the-art models in
+reducing misdiagnosis rates. Our code will be available at
+https://github.com/SNOWTEAM2023/MedRAG
 
-摘要：<paragraph>這篇論文提出了一個完整的可解釋系統，它可以解釋一組資料，抽象出基礎特徵，並以選擇的自然語言描述它們。系統依賴兩個關鍵階段：(i) 從資料中識別新興屬性，並將它們轉換為抽象概念，以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力，但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子，放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它，例如處理醫療資訊，並允許非專家和視障人士存取敘述資訊。為此，我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上，我們專注於第二階段，稱為 Concept2Text。由於具有可解釋性，資料翻譯很容易透過基於邏輯的規則建模，再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統，以解釋概念，這些概念以類別和關係的形式表達，再加上從通用本体衍生的常識，產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體，以及一個透明的基於規則的系統。我們概述了架構，並透過一些範例展示了它的靈活性，這些範例能夠根據輸入概念生成許多不同的等效重寫。</paragraph>
+摘要：檢索增強生成 (RAG) 是一種適用於檢索隱私敏感的電子健康記錄 (EHR) 的技術。它可以作為醫療保健副駕駛的一個關鍵模組，協助減少醫療保健從業人員和患者的誤診。然而，在醫療領域中使用的現有基於啟發法的 RAG 模型的診斷準確性和特異性不足，特別是對於具有類似表現的疾病。本文提出 MedRAG，一種由知識圖譜 (KG) 引發的推理增強的 RAG 模型，用於醫療領域，它根據表現檢索診斷和治療建議。MedRAG 系統性地構建了一個全面的四層階層式診斷 KG，涵蓋各種疾病的關鍵診斷差異。這些差異與從 EHR 資料庫中檢索到的類似 EHR 動態整合，並在大型語言模型中進行推理。這個過程可以實現更準確和具體的決策支援，同時主動提供後續問題，以增強個人化醫療決策制定。MedRAG 在公共資料集 DDXPlus 和從陳篤生醫院收集的私人慢性疼痛診斷資料集 (CPDD) 上進行評估，並將其效能與各種現有 RAG 方法進行比較。實驗結果顯示，利用 KG 的資訊整合和關係能力，我們的 MedRAG 提供了更具體的診斷見解，並在降低誤診率方面優於最先進的模型。我們的程式碼將在 https://github.com/SNOWTEAM2023/MedRAG 提供
 
-##### **Mind the Gaps: Logical English, Prolog, and Multi-agent Systems for Autonomous Vehicles**
-2502.09216v1 by Galileo Sartor, Adam Wyner, Giuseppe Contissa
+##### **Transforming Multimodal Models into Action Models for Radiotherapy**
+2502.04408v1 by Matteo Ferrante, Alessandra Carosi, Rolando Maria D Angelillo, Nicola Toschi
 
-In this paper, we present a modular system for representing and reasoning
-with legal aspects of traffic rules for autonomous vehicles. We focus on a
-subset of the United Kingdom's Highway Code (HC) related to junctions. As human
-drivers and automated vehicles (AVs) will interact on the roads, especially in
-urban environments, we claim that an accessible, unitary, high-level
-computational model should exist and be applicable to both users. Autonomous
-vehicles introduce a shift in liability that should not bring disadvantages or
-increased burden on human drivers. We develop a system "in silico" of the
-model. The proposed system is built of three main components: a natural
-language interface, using Logical English, which encodes the rules; an internal
-representation of the rules in Prolog; and an multi-agent-based simulation
-environment, built in NetLogo. The three components interact: Logical English
-is translated into and out of Prolog (along with some support code); Prolog and
-NetLogo interface via predicates. Such a modular approach enables the different
-components to carry different "burdens" in the overall system; it also allows
-swapping of modules. Given NetLogo, we can visualize the effect of the modeled
-rules as well as validate the system with a simple dynamic running scenario.
-Designated agents monitor the behaviour of the vehicles for compliance and
-record potential violations where they occur. The information on potential
-violations is then utilized by Validators, to determine whether the violation
-is punishable, differentiating between exceptions and cases.
+Radiotherapy is a crucial cancer treatment that demands precise planning to
+balance tumor eradication and preservation of healthy tissue. Traditional
+treatment planning (TP) is iterative, time-consuming, and reliant on human
+expertise, which can potentially introduce variability and inefficiency. We
+propose a novel framework to transform a large multimodal foundation model
+(MLM) into an action model for TP using a few-shot reinforcement learning (RL)
+approach. Our method leverages the MLM's extensive pre-existing knowledge of
+physics, radiation, and anatomy, enhancing it through a few-shot learning
+process. This allows the model to iteratively improve treatment plans using a
+Monte Carlo simulator. Our results demonstrate that this method outperforms
+conventional RL-based approaches in both quality and efficiency, achieving
+higher reward scores and more optimal dose distributions in simulations on
+prostate cancer data. This proof-of-concept suggests a promising direction for
+integrating advanced AI models into clinical workflows, potentially enhancing
+the speed, quality, and standardization of radiotherapy treatment planning.
 
-摘要：<paragraph>在本文中，我們提出了一個模組化系統，用於表示和推理自動駕駛車輛交通規則的法律層面。我們專注於與路口相關的英國公路法規 (HC) 子集。由於人類駕駛和自動駕駛車輛 (AV) 將在道路上互動，尤其是在城市環境中，我們主張應存在一個可存取、統一、高階的運算模型，並適用於這兩種使用者。自動駕駛車輛引入了責任轉移，不應給人類駕駛帶來劣勢或增加負擔。我們開發了一個模型的「電腦模擬」系統。所提出的系統由三個主要組成部分建構而成：使用邏輯英語的自然語言介面，用於編碼規則；使用 Prolog 的規則內部表示；以及使用 NetLogo 建構的多主體模擬環境。這三個組成部分會進行互動：邏輯英語會翻譯成 Prolog（以及一些支援程式碼），再從 Prolog 翻譯回來；Prolog 和 NetLogo 會透過謂詞進行介面。這種模組化方法讓不同的組成部分可以在整體系統中承擔不同的「負擔」；它也允許模組交換。有了 NetLogo，我們可以視覺化已建模規則的效果，並使用一個簡單的動態執行範例來驗證系統。指定的代理會監控車輛的行為，以確保遵守規定，並記錄發生的潛在違規行為。然後，驗證者會利用潛在違規行為的資訊，來確定違規行為是否應受懲罰，並區分例外情況和案例。</paragraph>
+摘要：放射治療是一種重要的癌症治療方法，需要精確的規劃來平衡腫瘤根除和健康組織的保留。傳統的治療規劃（TP）是反覆的、耗時的，並且依賴於人為專業知識，這可能會引入變異性和低效率。我們提出了一個新穎的框架，使用少次強化學習 (RL) 方法將大型多模態基礎模型 (MLM) 轉換為 TP 的動作模型。我們的模型利用了 MLM 對物理、輻射和解剖學的廣泛預先存在的知識，並通過少次學習過程對其進行增強。這允許模型使用蒙特卡羅模擬器反覆改進治療計劃。我們的結果表明，這種方法在質量和效率方面都優於基於傳統 RL 的方法，在對前列腺癌數據進行模擬時，獲得了更高的獎勵分數和更優化的劑量分佈。這個概念驗證表明了一個有希望的方向，即將先進的人工智慧模型整合到臨床工作流程中，從而有可能提高放射治療計劃的速度、質量和標準化。
 
-##### **Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents**
-2502.09215v1 by Sean Glaze, Daniela Inclezan
+##### **Online Location Planning for AI-Defined Vehicles: Optimizing Joint Tasks of Order Serving and Spatio-Temporal Heterogeneous Model Fine-Tuning**
+2502.04399v1 by Bokeng Zheng, Bo Rao, Tianxiang Zhu, Chee Wei Tan, Jingpu Duan, Zhi Zhou, Xu Chen, Xiaoxi Zhang
 
-This paper presents an architecture for simulating the actions of a
-norm-aware intelligent agent whose behavior with respect to norm compliance is
-set, and can later be changed, by a human controller. Updating an agent's
-behavior mode from a norm-abiding to a riskier one may be relevant when the
-agent is involved in time-sensitive rescue operations, for example. We base our
-work on the Authorization and Obligation Policy Language AOPL designed by
-Gelfond and Lobo for the specification of norms. We introduce an architecture
-and a prototype software system that can be used to simulate an agent's plans
-under different behavior modes that can later be changed by the controller. We
-envision such software to be useful to policy makers, as they can more readily
-understand how agents may act in certain situations based on the agents'
-attitudes towards norm-compliance. Policy makers may then refine their policies
-if simulations show unwanted consequences.
+Advances in artificial intelligence (AI) including foundation models (FMs),
+are increasingly transforming human society, with smart city driving the
+evolution of urban living.Meanwhile, vehicle crowdsensing (VCS) has emerged as
+a key enabler, leveraging vehicles' mobility and sensor-equipped capabilities.
+In particular, ride-hailing vehicles can effectively facilitate flexible data
+collection and contribute towards urban intelligence, despite resource
+limitations. Therefore, this work explores a promising scenario, where
+edge-assisted vehicles perform joint tasks of order serving and the emerging
+foundation model fine-tuning using various urban data. However, integrating the
+VCS AI task with the conventional order serving task is challenging, due to
+their inconsistent spatio-temporal characteristics: (i) The distributions of
+ride orders and data point-of-interests (PoIs) may not coincide in geography,
+both following a priori unknown patterns; (ii) they have distinct forms of
+temporal effects, i.e., prolonged waiting makes orders become instantly invalid
+while data with increased staleness gradually reduces its utility for model
+fine-tuning.To overcome these obstacles, we propose an online framework based
+on multi-agent reinforcement learning (MARL) with careful augmentation. A new
+quality-of-service (QoS) metric is designed to characterize and balance the
+utility of the two joint tasks, under the effects of varying data volumes and
+staleness. We also integrate graph neural networks (GNNs) with MARL to enhance
+state representations, capturing graph-structured, time-varying dependencies
+among vehicles and across locations. Extensive experiments on our testbed
+simulator, utilizing various real-world foundation model fine-tuning tasks and
+the New York City Taxi ride order dataset, demonstrate the advantage of our
+proposed method.
 
-摘要：本文提出了一個架構，用於模擬一個規範感知智能代理的行為，其行為遵守規範，並可以由人類控制者設定，並可以在稍後進行更改。當代理參與時間敏感的救援行動時，將代理的行為模式從遵守規範更新為更冒險的行為模式可能是相關的。我們的工作基於 Gelfond 和 Lobo 為規範規範設計的授權和義務政策語言 AOPL。我們引入了一個架構和一個原型軟體系統，可用於模擬代理在不同行為模式下的計畫，這些行為模式稍後可以由控制者更改。我們預計此類軟體對政策制定者很有用，因為他們可以更容易地根據代理對規範遵守的態度了解代理在特定情況下的行為方式。如果模擬顯示出不希望的後果，政策制定者可以修改他們的政策。
+摘要：人工智能（AI）的進展，包括基礎模型（FM），正日益轉變人類社會，智慧城市推動著城市生活的演進。同時，車輛群感測（VCS）已成為關鍵推動因素，利用車輛的機動性和配備感測器的能力。特別是，儘管有資源限制，叫車服務車輛能有效促進靈活的資料收集，並有助於城市智慧。因此，這項工作探索了一個有前途的場景，其中邊緣輔助車輛執行訂單服務和新興基礎模型微調的聯合任務，使用各種城市資料。然而，由於 VCS AI 任務與傳統訂單服務任務的不一致時空特徵，整合它們具有挑戰性：(i) 叫車訂單和資料感興趣點 (PoI) 的分佈在地域上可能不重合，兩者都遵循先驗未知的模式；(ii) 它們具有不同的時間效應形式，即長時間等待會使訂單立即失效，而過時的資料會逐漸降低其對模型微調的效用。為了解決這些障礙，我們提出了一個基於多智能體強化學習 (MARL) 的線上架構，並進行了仔細的擴充。設計了一個新的服務品質 (QoS) 指標，用於表徵和平衡這兩個聯合任務的效用，在不同資料量和過時性的影響下。我們還將圖神經網路（GNN）與 MARL 整合，以增強狀態表示，捕捉車輛之間和不同地點之間的圖結構、時變依賴性。在我們的測試平台模擬器上進行的廣泛實驗，利用各種真實世界的基礎模型微調任務和紐約市計程車叫車訂單資料集，證明了我們提出的方法的優點。
 
-##### **Neuro-Symbolic Contrastive Learning for Cross-domain Inference**
-2502.09213v1 by Mingyue Liu, Ryo Ueda, Zhen Wan, Katsumi Inoue, Chris G. Willcocks
+##### **Multimodal Medical Code Tokenizer**
+2502.04397v2 by Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik
 
-Pre-trained language models (PLMs) have made significant advances in natural
-language inference (NLI) tasks, however their sensitivity to textual
-perturbations and dependence on large datasets indicate an over-reliance on
-shallow heuristics. In contrast, inductive logic programming (ILP) excels at
-inferring logical relationships across diverse, sparse and limited datasets,
-but its discrete nature requires the inputs to be precisely specified, which
-limits their application. This paper proposes a bridge between the two
-approaches: neuro-symbolic contrastive learning. This allows for smooth and
-differentiable optimisation that improves logical accuracy across an otherwise
-discrete, noisy, and sparse topological space of logical functions. We show
-that abstract logical relationships can be effectively embedded within a
-neuro-symbolic paradigm, by representing data as logic programs and sets of
-logic rules. The embedding space captures highly varied textual information
-with similar semantic logical relations, but can also separate similar textual
-relations that have dissimilar logical relations. Experimental results
-demonstrate that our approach significantly improves the inference capabilities
-of the models in terms of generalisation and reasoning.
+Foundation models trained on patient electronic health records (EHRs) require
+tokenizing medical data into sequences of discrete vocabulary items. Existing
+tokenizers treat medical codes from EHRs as isolated textual tokens. However,
+each medical code is defined by its textual description, its position in
+ontological hierarchies, and its relationships to other codes, such as disease
+co-occurrences and drug-treatment associations. Medical vocabularies contain
+more than 600,000 codes with critical information for clinical reasoning. We
+introduce MedTok, a multimodal medical code tokenizer that uses the text
+descriptions and relational context of codes. MedTok processes text using a
+language model encoder and encodes the relational structure with a graph
+encoder. It then quantizes both modalities into a unified token space,
+preserving modality-specific and cross-modality information. We integrate
+MedTok into five EHR models and evaluate it on operational and clinical tasks
+across in-patient and out-patient datasets, including outcome prediction,
+diagnosis classification, drug recommendation, and risk stratification.
+Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR
+models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.30% on EHRShot, with
+the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate
+using MedTok tokenizer with medical QA systems. Our results demonstrate the
+potential of MedTok as a unified tokenizer for medical codes, improving
+tokenization for medical foundation models.
 
-摘要：預訓練語言模型 (PLM) 在自然語言推理 (NLI) 任務中取得了重大進展，然而它們對文本擾動的敏感性和對大型資料集的依賴性表明過度依賴於淺層啟發法。相比之下，歸納邏輯規劃 (ILP) 擅長推論跨越多樣化、稀疏和有限資料集的邏輯關係，但其離散性質要求輸入被精確指定，這限制了它們的應用。本文提出了兩種方法之間的橋樑：神經符號對比學習。這允許平滑且可微分的優化，從而提高邏輯函數的離散、嘈雜和稀疏拓撲空間中的邏輯準確性。我們展示了抽象邏輯關係可以通過將資料表示為邏輯程式和邏輯規則集，有效地嵌入到神經符號範例中。嵌入空間捕獲具有相似語義邏輯關係的高度多變的文本資訊，但也可以分離具有不同邏輯關係的相似文本關係。實驗結果表明，我們的做法在泛化和推理方面顯著提高了模型的推理能力。
+摘要：<paragraph>在患者电子健康记录 (EHR) 上训练的基础模型需要将医学数据标记为离散词汇项序列。现有的标记器将 EHR 中的医学代码视为孤立的文本标记。然而，每个医学代码都由其文本描述、在本体层次结构中的位置以及与其他代码的关系（例如疾病共现和药物治疗关联）来定义。医学词汇表包含超过 600,000 个代码，这些代码包含临床推理的关键信息。我们引入了 MedTok，这是一种多模态医学代码标记器，它使用文本描述和代码的关系上下文。MedTok 使用语言模型编码器处理文本，并使用图编码器对关系结构进行编码。然后，它将这两种模态量化为一个统一的标记空间，保留特定于模态和跨模态的信息。我们将 MedTok 集成到五个 EHR 模型中，并在住院和门诊数据集（包括结果预测、诊断分类、药物推荐和风险分层）上对其实施操作和临床任务进行评估。用 MedTok 替换标准 EHR 标记器可提高所有 EHR 模型的 AUPRC，在 MIMIC-III 上提高 4.10%，在 MIMIC-IV 上提高 4.78%，在 EHRShot 上提高 11.30%，其中药物推荐的增益最大。除了 EHR 建模之外，我们还演示了将 MedTok 标记器与医学问答系统结合使用。我们的结果证明了 MedTok 作为医学代码的统一标记器的潜力，改进了医学基础模型的标记化。</paragraph>
 
-##### **LP-LM: No Hallucinations in Question Answering with Logic Programming**
-2502.09212v1 by Katherine Wu, Yanhong A. Liu
+##### **A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**
+2502.03772v1 by Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Lida Chen, Wei Wang, Qinghua Huang
 
-Large language models (LLMs) are able to generate human-like responses to
-user queries. However, LLMs exhibit inherent limitations, especially because
-they hallucinate. This paper introduces LP-LM, a system that grounds answers to
-questions in known facts contained in a knowledge base (KB), facilitated
-through semantic parsing in Prolog, and always produces answers that are
-reliable.
-  LP-LM generates a most probable constituency parse tree along with a
-corresponding Prolog term for an input question via Prolog definite clause
-grammar (DCG) parsing. The term is then executed against a KB of natural
-language sentences also represented as Prolog terms for question answering. By
-leveraging DCG and tabling, LP-LM runs in linear time in the size of input
-sentences for sufficiently many grammar rules. Performing experiments comparing
-LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate
-on even simple questions, unlike LP-LM.
+Hepatocellular carcinoma (HCC) ranks as the third leading cause of
+cancer-related mortality worldwide, with early detection being crucial for
+improving patient survival rates. However, early screening for HCC using
+ultrasound suffers from insufficient sensitivity and is highly dependent on the
+expertise of radiologists for interpretation. Leveraging the latest
+advancements in artificial intelligence (AI) in medical imaging, this study
+proposes an innovative Hierarchical Sparse Query Transformer (HSQformer) model
+that combines the strengths of Convolutional Neural Networks (CNNs) and Vision
+Transformers (ViTs) to enhance the accuracy of HCC diagnosis in ultrasound
+screening. The HSQformer leverages sparse latent space representations to
+capture hierarchical details at various granularities without the need for
+complex adjustments, and adopts a modular, plug-and-play design philosophy,
+ensuring the model's versatility and ease of use. The HSQformer's performance
+was rigorously tested across three distinct clinical scenarios: single-center,
+multi-center, and high-risk patient testing. In each of these settings, it
+consistently outperformed existing state-of-the-art models, such as ConvNext
+and SwinTransformer. Notably, the HSQformer even matched the diagnostic
+capabilities of senior radiologists and comprehensively surpassed those of
+junior radiologists. The experimental results from this study strongly
+demonstrate the effectiveness and clinical potential of AI-assisted tools in
+HCC screening. The full code is available at
+https://github.com/Asunatan/HSQformer.
+
+摘要：肝細胞癌（HCC）是全球第三大癌症相關死亡原因，早期檢測對於提高患者存活率至關重要。然而，使用超音波進行 HCC 早期篩檢的靈敏度不足，且高度依賴放射科醫師的專業知識進行判讀。本研究利用醫學影像中人工智慧（AI）的最新進展，提出了一種創新的分層稀疏查詢Transformer（HSQformer）模型，結合了卷積神經網路（CNN）和視覺Transformer（ViT）的優點，以提高超音波篩檢中 HCC 診斷的準確性。HSQformer 利用稀疏潛在空間表示，在不需要複雜調整的情況下擷取各種粒度層級的細節，並採用模組化、即插即用的設計理念，確保模型的多功能性和易用性。HSQformer 的效能經過三個不同的臨床場景的嚴格測試：單中心、多中心和高風險患者測試。在這些設定中，它始終優於現有的最先進模型，例如 ConvNext 和 SwinTransformer。值得注意的是，HSQformer 甚至匹配了資深放射科醫師的診斷能力，並全面超越了初級放射科醫師的診斷能力。本研究的實驗結果有力地證明了 AI 輔助工具在 HCC 篩檢中的有效性和臨床潛力。完整程式碼可在 https://github.com/Asunatan/HSQformer 取得。
+
+##### **Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**
+2502.04386v1 by Guangyao Zheng, Michael A. Jacobs, Vladimir Braverman, Vishwa S. Parekh
+
+Self-supervised learning has revolutionized medical imaging by enabling
+efficient and generalizable feature extraction from large-scale unlabeled
+datasets. Recently, self-supervised foundation models have been extended to
+three-dimensional (3D) computed tomography (CT) data, generating compact,
+information-rich embeddings with 1408 features that achieve state-of-the-art
+performance on downstream tasks such as intracranial hemorrhage detection and
+lung cancer risk forecasting. However, these embeddings have been shown to
+encode demographic information, such as age, sex, and race, which poses a
+significant risk to the fairness of clinical applications.
+  In this work, we propose a Variation Autoencoder (VAE) based adversarial
+debiasing framework to transform these embeddings into a new latent space where
+demographic information is no longer encoded, while maintaining the performance
+of critical downstream tasks. We validated our approach on the NLST lung cancer
+screening dataset, demonstrating that the debiased embeddings effectively
+eliminate multiple encoded demographic information and improve fairness without
+compromising predictive accuracy for lung cancer risk at 1-year and 2-year
+intervals. Additionally, our approach ensures the embeddings are robust against
+adversarial bias attacks. These results highlight the potential of adversarial
+debiasing techniques to ensure fairness and equity in clinical applications of
+self-supervised 3D CT embeddings, paving the way for their broader adoption in
+unbiased medical decision-making.
 
-摘要：大型語言模型 (LLM) 能產生類似人類的回應來回答使用者的問題。然而，LLM 顯示出內在的限制，特別是因為它們會產生幻覺。本文介紹 LP-LM，一個系統，它將問題的答案建立在知識庫 (KB) 中已知的事實上，透過 Prolog 中的語義解析來促進，並始終產生可靠的答案。
-LP-LM 透過 Prolog 明確條款語法 (DCG) 解析產生一個最可能的成分解析樹，以及輸入問題對應的 Prolog 詞彙。然後，針對一個自然語言句子的 KB 執行該詞彙，也表示為 Prolog 詞彙，以進行問題解答。透過利用 DCG 和 tabling，LP-LM 在輸入句子的大小上以線性時間執行，對於足夠多的語法規則。執行實驗比較 LP-LM 與目前眾所周知的 LLM 在準確性上，我們顯示出 LLM 甚至會對簡單的問題產生幻覺，這與 LP-LM 不同。
+摘要：自我監督學習透過從大規模未標記資料集中提取有效且可概化的特徵，進而革新了醫學影像。最近，自我監督基礎模型已擴展到三維 (3D) 電腦斷層掃描 (CT) 資料，產生緊湊、資訊豐富的嵌入，包含 1408 個特徵，在顱內出血偵測和肺癌風險預測等下游任務中達到最先進的效能。然而，這些嵌入已被證明會編碼人口統計資訊，例如年齡、性別和種族，這對臨床應用的公平性構成重大風險。
+在這項工作中，我們提出一個基於變異自編碼器 (VAE) 的對抗性去偏框架，將這些嵌入轉換到一個新的潛在空間，其中不再編碼人口統計資訊，同時維持關鍵下游任務的效能。我們在 NLST 肺癌篩檢資料集上驗證了我們的做法，證明去偏嵌入有效消除了多重編碼的人口統計資訊，並在不損害 1 年和 2 年間隔的肺癌風險預測準確性的情況下提高了公平性。此外，我們的做法確保了嵌入對抗性偏誤攻擊具有魯棒性。這些結果突顯了對抗性去偏技術的潛力，可確保自我監督 3D CT 嵌入在臨床應用中的公平性和公正性，為其在無偏見醫療決策中的廣泛採用鋪路。
 
-##### **Visual Graph Question Answering with ASP and LLMs for Language Parsing**
-2502.09211v1 by Jakob Johannes Bauer, Thomas Eiter, Nelson Higuera Ruiz, Johannes Oetsch
+##### **Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**
+2502.03591v1 by Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
 
-Visual Question Answering (VQA) is a challenging problem that requires to
-process multimodal input. Answer-Set Programming (ASP) has shown great
-potential in this regard to add interpretability and explainability to modular
-VQA architectures. In this work, we address the problem of how to integrate ASP
-with modules for vision and natural language processing to solve a new and
-demanding VQA variant that is concerned with images of graphs (not graphs in
-symbolic form). Images containing graph-based structures are an ubiquitous and
-popular form of visualisation. Here, we deal with the particular problem of
-graphs inspired by transit networks, and we introduce a novel dataset that
-amends an existing one by adding images of graphs that resemble metro lines.
-Our modular neuro-symbolic approach combines optical graph recognition for
-graph parsing, a pretrained optical character recognition neural network for
-parsing labels, Large Language Models (LLMs) for language processing, and ASP
-for reasoning. This method serves as a first baseline and achieves an overall
-average accuracy of 73% on the dataset. Our evaluation provides further
-evidence of the potential of modular neuro-symbolic systems, in particular with
-pretrained models that do not involve any further training and logic
-programming for reasoning, to solve complex VQA tasks.
+In this work, we present a novel approach to multi-label chest X-ray (CXR)
+image classification that enhances clinical interpretability while maintaining
+a streamlined, single-model, single-run training pipeline. Leveraging the
+CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical
+label groupings to capture clinically meaningful relationships between
+diagnoses. To achieve this, we designed a custom hierarchical binary
+cross-entropy (HBCE) loss function that enforces label dependencies using
+either fixed or data-driven penalty types. Our model achieved a mean area under
+the receiver operating characteristic curve (AUROC) of 0.903 on the test set.
+Additionally, we provide visual explanations and uncertainty estimations to
+further enhance model interpretability. All code, model configurations, and
+experiment details are made available.
 
-摘要：視覺問答（VQA）是一項具有挑戰性的問題，需要處理多模態輸入。答案集程式設計（ASP）在這方面顯示出巨大的潛力，可以為模組化 VQA 架構增加可解釋性和說明性。在這項工作中，我們探討如何將 ASP 與視覺和自然語言處理模組整合，以解決一個新的且要求嚴格的 VQA 變體，該變體與圖形影像（而非符號形式的圖形）有關。包含圖形結構的影像是一種普遍且流行的可視化形式。在這裡，我們處理受交通網路啟發的圖形特定問題，並引入一個新的資料集，透過新增類似地鐵路線的圖形影像來修正現有資料集。我們的模組化神經符號方法結合光學圖形辨識進行圖形解析、預先訓練的光學字元辨識神經網路進行標籤解析、大型語言模型（LLM）進行語言處理，以及 ASP 進行推理。此方法作為第一個基準，在資料集上達到 73% 的整體平均準確度。我們的評估進一步證明了模組化神經符號系統的潛力，特別是預先訓練的模型，這些模型不涉及任何進一步的訓練和邏輯程式設計進行推理，以解決複雜的 VQA 任務。
+摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
 
-##### **On LLM-generated Logic Programs and their Inference Execution Methods**
-2502.09209v1 by Paul Tarau
+##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
+2502.03568v1 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
 
-Large Language Models (LLMs) trained on petabytes of data are highly
-compressed repositories of a significant proportion of the knowledge
-accumulated and distilled so far. In this paper we study techniques to elicit
-this knowledge in the form of several classes of logic programs, including
-propositional Horn clauses, Dual Horn clauses, relational triplets and Definite
-Clause Grammars. Exposing this knowledge as logic programs enables sound
-reasoning methods that can verify alignment of LLM outputs to their intended
-uses and extend their inference capabilities. We study new execution methods
-for the generated programs, including soft-unification of abducible facts
-against LLM-generated content stored in a vector database as well as GPU-based
-acceleration of minimal model computation that supports inference with large
-LLM-generated programs.
+Many reasoning, planning, and problem-solving tasks share an intrinsic
+algorithmic nature: correctly simulating each step is a sufficient condition to
+solve them correctly. We collect pairs of naturalistic and synthetic reasoning
+tasks to assess the capabilities of Large Language Models (LLM). While
+naturalistic tasks often require careful human handcrafting, we show that
+synthetic data is, in many cases, a good proxy that is much easier to collect
+at scale. We leverage common constructs in programming as the counterpart of
+the building blocks of naturalistic reasoning tasks, such as straight-line
+programs, code that contains critical paths, and approximate and redundant
+instructions. We further assess the capabilities of LLMs on sorting problems
+and repeated operations via sorting algorithms and nested loops. Our synthetic
+datasets further reveal that while the most powerful LLMs exhibit relatively
+strong execution capabilities, the process is fragile: it is negatively
+affected by memorisation and seems to rely heavily on pattern recognition. Our
+contribution builds upon synthetically testing the reasoning capabilities of
+LLMs as a scalable complement to handcrafted human-annotated problems.
 
-摘要：大型語言模型 (LLM) 在數位位元組的資料上受過訓練，是目前為止累積和提煉的知識中，高度濃縮的儲存庫。在本文中，我們研究了以數種邏輯程式類別的形式引出這些知識的技術，包括命題霍恩子句、雙重霍恩子句、關聯三元組和確定子句文法。將這些知識作為邏輯程式揭露，能啟用健全的推理方法，驗證 LLM 輸出的對齊方式，符合其預期的用途，並擴展其推論能力。我們研究了產生程式的新執行方法，包括對儲存在向量資料庫中的 LLM 產生內容，進行可約簡事實的軟統一，以及支援使用大型 LLM 產生程式進行推論的，基於 GPU 的最小模型計算加速。
+摘要：許多推理、規劃和問題解決任務共享一個內在的演算法性質：正確模擬每一步就足以正確解決它們。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的功能。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成資料是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見建構，作為自然主義推理任務構建區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複運算上的功能，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎嚴重依賴模式辨識。我們的貢獻建立在以合成方式測試 LLM 的推理能力之上，作為手工編寫人類標註問題的可擴充補充。
 
-##### **Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases**
-2502.09206v1 by Haya Majid Qureshi, Wolfgang Faber
+##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
+2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
 
-Metamodeling refers to scenarios in ontologies in which classes and roles can
-be members of classes or occur in roles. This is a desirable modelling feature
-in several applications, but allowing it without restrictions is problematic
-for several reasons, mainly because it causes undecidability. Therefore,
-practical languages either forbid metamodeling explicitly or treat occurrences
-of classes as instances to be semantically different from other occurrences,
-thereby not allowing metamodeling semantically. Several extensions have been
-proposed to provide metamodeling to some extent. Building on earlier work that
-reduces metamodeling query answering to Datalog query answering, recently
-reductions to query answering over hybrid knowledge bases were proposed with
-the aim of using the Datalog transformation only where necessary. Preliminary
-work showed that the approach works, but the hoped-for performance improvements
-were not observed yet. In this work we expand on this body of work by improving
-the theoretical basis of the reductions and by using alternative tools that
-show competitive performance.
+Large Language Models (LLMs) have attained human-level accuracy on medical
+question-answer (QA) benchmarks. However, their limitations in navigating
+open-ended clinical scenarios have recently been shown, raising concerns about
+the robustness and generalizability of LLM reasoning across diverse, real-world
+medical tasks. To probe potential LLM failure modes in clinical
+problem-solving, we present the medical abstraction and reasoning corpus
+(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
+exploit the Einstellung effect -- the fixation of thought arising from prior
+experience, targeting LLM inductive biases toward inflexible pattern matching
+from their training data rather than engaging in flexible reasoning. We find
+that LLMs, including current state-of-the-art o1 and Gemini models, perform
+poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
+medical reasoning and a propensity to hallucinate. In addition, uncertainty
+estimation analyses indicate that LLMs exhibit overconfidence in their answers,
+despite their limited accuracy. The failure modes revealed by M-ARC in LLM
+medical reasoning underscore the need to exercise caution when deploying these
+models in clinical settings.
 
-摘要：元建模是指本体中的場景，其中類別和角色可以是類別成員或出現在角色中。這是一個在多個應用中理想的建模功能，但允許它不受限制會因多個原因而產生問題，主要是因為它會導致無法決定。因此，實用的語言會明確禁止元建模，或將類別的出現視為與其他出現語義不同的實例，從而語義上不允許元建模。已經提出多個擴充功能，在一定程度上提供元建模。建立在將元建模查詢回答簡化為 Datalog 查詢回答的早期工作之上，最近提出了將查詢回答簡化為混合知識庫的簡化，目的是僅在必要時使用 Datalog 轉換。初步工作顯示該方法有效，但尚未觀察到預期的效能改善。在這項工作中，我們透過改善簡化的理論基礎和使用表現競爭力的替代工具，擴展了這項工作。
+摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
 
-##### **Counterfactual Explanations as Plans**
-2502.09205v1 by Vaishak Belle
+##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
+2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
 
-There has been considerable recent interest in explainability in AI,
-especially with black-box machine learning models. As correctly observed by the
-planning community, when the application at hand is not a single-shot decision
-or prediction, but a sequence of actions that depend on observations, a richer
-notion of explanations are desirable.
-  In this paper, we look to provide a formal account of ``counterfactual
-explanations," based in terms of action sequences. We then show that this
-naturally leads to an account of model reconciliation, which might take the
-form of the user correcting the agent's model, or suggesting actions to the
-agent's plan. For this, we will need to articulate what is true versus what is
-known, and we appeal to a modal fragment of the situation calculus to formalise
-these intuitions. We consider various settings: the agent knowing partial
-truths, weakened truths and having false beliefs, and show that our definitions
-easily generalize to these different settings.
+Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
+Systems (HITS) is a hot research trend focusing on enhancing HITS management,
+particularly in emergencies where ambulance vehicles must arrive at the crash
+scene on time and track their real-time location is crucial to the medical
+authorities. Despite the claim of real-time representation, a temporal
+misalignment persists between the physical and virtual domains, leading to
+discrepancies in the ambulance's location representation. This study proposes
+integrating AI predictive models, specifically Support Vector Regression (SVR)
+and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
+framework to anticipate the medical vehicle's next location in the virtual
+world. These models align virtual representations with their physical
+counterparts, i.e., metaphorically offsetting the synchronization delay between
+the two worlds. Trained meticulously on a historical geospatial dataset, SVR
+and DNN exhibit exceptional prediction accuracy in MATLAB and Python
+environments. Through various testing scenarios, we visually demonstrate the
+efficacy of our methodology, showcasing SVR and DNN's key role in significantly
+reducing the witnessed gap within the HITS's DT. This transformative approach
+enhances real-time synchronization in emergency HITS by approximately 88% to
+93%.
 
-摘要：最近在人工智能中對於可解釋性產生了相當大的興趣，
-特別是對於黑盒機器學習模型。正如規劃社群正確觀察到的，當手邊的應用程式不是單次決策或預測，而是一連串依賴於觀察的動作時，一個更豐富的解釋概念是可取的。
-在本文中，我們著眼於提供「反事實解釋」的一個正式說明，以動作序列為基礎。然後我們展示這自然會導致一個模型調和說明，其形式可能是使用者修正代理人的模型，或建議代理人的計畫採取行動。為此，我們需要說明什麼是真實的，什麼是已知的，我們訴諸情境演算的一個模態片段來形式化這些直覺。我們考慮各種設定：代理人知道部分真實、虛弱真實和擁有錯誤信念，並展示我們的定義輕鬆地概括到這些不同的設定。
+摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
 
-##### **Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**
-2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu
+##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
+2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
 
-Legal cases require careful logical reasoning following the laws, whereas
-interactions with non- technical users must be in natural language. As an
-application combining logical reasoning using Prolog and natural language
-processing using large language models (LLMs), this paper presents a novel
-approach and system, LogicLease, to automate the analysis of landlord-tenant
-legal cases in the state of New York. LogicLease determines compliance with
-relevant legal requirements by analyzing case descriptions and citing all
-relevant laws. It leverages LLMs for information extraction and Prolog for
-legal reasoning. By separating information extraction from legal reasoning,
-LogicLease achieves greater transparency and control over the legal logic
-applied to each case. We evaluate the accuracy, efficiency, and robustness of
-LogicLease through a series of tests, achieving 100% accuracy and an average
-processing time of 2.57 seconds. LogicLease presents advantages over
-state-of-the-art LLM- based legal analysis systems by providing clear,
-step-by-step reasoning, citing specific laws, and distinguishing itself by its
-ability to avoid hallucinations - a common issue in LLMs.
+The widespread use of chest X-rays (CXRs), coupled with a shortage of
+radiologists, has driven growing interest in automated CXR analysis and
+AI-assisted reporting. While existing vision-language models (VLMs) show
+promise in specific tasks such as report generation or abnormality detection,
+they often lack support for interactive diagnostic capabilities. In this work
+we present RadVLM, a compact, multitask conversational foundation model
+designed for CXR interpretation. To this end, we curate a large-scale
+instruction dataset comprising over 1 million image-instruction pairs
+containing both single-turn tasks -- such as report generation, abnormality
+classification, and visual grounding -- and multi-turn, multi-task
+conversational interactions. After fine-tuning RadVLM on this instruction
+dataset, we evaluate it across different tasks along with re-implemented
+baseline VLMs. Our results show that RadVLM achieves state-of-the-art
+performance in conversational capabilities and visual grounding while remaining
+competitive in other radiology tasks. Ablation studies further highlight the
+benefit of joint training across multiple tasks, particularly for scenarios
+with limited annotated data. Together, these findings highlight the potential
+of RadVLM as a clinically relevant AI assistant, providing structured CXR
+interpretation and conversational capabilities to support more effective and
+accessible diagnostic workflows.
 
-摘要：法律案件需要遵循法律进行谨慎的逻辑推理，而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序，本文提出了一种新颖的方法和系统 LogicLease，以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取，并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开，LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性，实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理，引用具体法律，并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统，从而显示出优势——这是 LLM 中的常见问题。
+摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
 
-##### **Thinking beyond the anthropomorphic paradigm benefits LLM research**
-2502.09192v1 by Lujain Ibrahim, Myra Cheng
+##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
+2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
 
-Anthropomorphism, or the attribution of human traits to technology, is an
-automatic and unconscious response that occurs even in those with advanced
-technical expertise. In this position paper, we analyze hundreds of thousands
-of computer science research articles from the past decade and present
-empirical evidence of the prevalence and growth of anthropomorphic terminology
-in research on large language models (LLMs). This terminology reflects deeper
-anthropomorphic conceptualizations which shape how we think about and conduct
-LLM research. We argue these conceptualizations may be limiting, and that
-challenging them opens up new pathways for understanding and improving LLMs
-beyond human analogies. To illustrate this, we identify and analyze five core
-anthropomorphic assumptions shaping prominent methodologies across the LLM
-development lifecycle, from the assumption that models must use natural
-language for reasoning tasks to the assumption that model capabilities should
-be evaluated through human-centric benchmarks. For each assumption, we
-demonstrate how non-anthropomorphic alternatives can open new directions for
-research and development.
+While increasing patients' access to medical documents improves medical care,
+this benefit is limited by varying health literacy levels and complex medical
+terminology. Large language models (LLMs) offer solutions by simplifying
+medical information. However, evaluating LLMs for safe and patient-friendly
+text generation is difficult due to the lack of standardized evaluation
+resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
+created from MIMIC-IV discharge summaries through an automated pipeline
+combining LLM-based question-answer generation with manual quality checks. We
+use this dataset to evaluate various LLMs on patient-oriented
+question-answering. Our findings reveal that general-purpose LLMs frequently
+surpass biomedical-adapted models, while automated metrics correlate with human
+judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
+development of LLMs to enhance patient understanding and ultimately improve
+care outcomes.
 
-摘要：擬人化，或將人類特質歸因於科技，是一種自動且無意識的反應，即使是那些擁有進階技術專業知識的人也會發生。在本文中，我們分析了過去十年數十萬篇電腦科學研究文章，並提出實證證據證明擬人化術語在大型語言模型 (LLM) 研究中的普遍性和增長。這些術語反映了更深層的擬人化概念化，塑造了我們思考和進行 LLM 研究的方式。我們認為這些概念化可能是有限制的，並且挑戰它們為超越人類類比來理解和改進 LLM 開闢了新的途徑。為了說明這一點，我們識別並分析了五個核心擬人化假設，這些假設塑造了 LLM 開發生命週期中的顯著方法論，從模型必須使用自然語言進行推理任務的假設到模型能力應該通過以人為中心的基準進行評估的假設。對於每個假設，我們展示了非擬人化替代方案如何為研究和開發打開新方向。
+摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
+但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
 
-##### **Matina: A Large-Scale 73B Token Persian Text Corpus**
-2502.09188v1 by Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri
+##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
+2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
+
+Purpose: To develop and evaluate a deep learning-based method that allows to
+perform myocardial infarct segmentation in a fully-automated way.
+  Materials and Methods: For this retrospective study, a cascaded framework of
+two and three-dimensional convolutional neural networks (CNNs), specialized on
+identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
+cardiac magnetic resonance (CMR) images, was trained on an in-house training
+dataset consisting of 144 examinations. On a separate test dataset from the
+same institution, including images from 152 examinations obtained between 2021
+and 2023, a quantitative comparison between artificial intelligence (AI)-based
+segmentations and manual segmentations was performed. Further, qualitative
+assessment of segmentation accuracy was evaluated for both human and
+AI-generated contours by two CMR experts in a blinded experiment.
+  Results: Excellent agreement could be found between manually and
+automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
+evaluation showed that compared to human-based measurements, the experts rated
+the AI-based segmentations to better represent the actual extent of infarction
+significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
+the contrary, for segmentation of microvascular obstruction (MVO), manual
+measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
+  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
+size to be calculated in a very short time and without requiring any
+pre-processing of the input images while matching the segmentation quality of
+trained human observers. In a blinded experiment, experts preferred automated
+infarct segmentations more often than manual segmentations, paving the way for
+a potential clinical application.
 
-Text corpora are essential for training models used in tasks like
-summarization, translation, and large language models (LLMs). While various
-efforts have been made to collect monolingual and multilingual datasets in many
-languages, Persian has often been underrepresented due to limited resources for
-data collection and preprocessing. Existing Persian datasets are typically
-small and lack content diversity, consisting mainly of weblogs and news
-articles. This shortage of high-quality, varied data has slowed the development
-of NLP models and open-source LLMs for Persian. Since model performance depends
-heavily on the quality of training data, we address this gap by introducing the
-Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed
-and deduplicated to ensure high data quality. We further assess its
-effectiveness by training and evaluating transformer-based models on key NLP
-tasks. Both the dataset and preprocessing codes are publicly available,
-enabling researchers to build on and improve this resource for future Persian
-NLP advancements.
+摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
+材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
+結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
+結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
 
-摘要：文字語料庫對於訓練用於摘要、翻譯和大型語言模型 (LLM) 等任務的模型至關重要。儘管已做出各種努力來收集許多語言中的單語和多語言資料集，但由於資料收集和預處理資源有限，波斯語常常代表性不足。現有的波斯語資料集通常很小，而且缺乏內容多樣性，主要由網誌和新聞文章組成。這種優質、多樣化資料的短缺減緩了波斯語的 NLP 模型和開源 LLM 的開發。由於模型效能高度依賴訓練資料的品質，我們透過推出 Matina 語料庫來解決這個差距，Matina 語料庫是一個新的波斯語資料集，包含 72.9B 個字元，經過仔細預處理和去重，以確保資料品質。我們進一步透過在關鍵 NLP 任務上訓練和評估基於轉換器的模型來評估其有效性。資料集和預處理程式碼都是公開的，使研究人員能夠建立和改善這個資源，以促進未來的波斯語 NLP 進展。
+##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
+2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
 
-##### **RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation**
-2502.09183v1 by Changzhi Zhou, Xinyu Zhang, Dandan Song, Xiancai Chen, Wanli Gu, Huipeng Ma, Yuhang Tian, Mengdi Zhang, Linmei Hu
+Recently computer-aided diagnosis has demonstrated promising performance,
+effectively alleviating the workload of clinicians. However, the inherent
+sample imbalance among different diseases leads algorithms biased to the
+majority categories, leading to poor performance for rare categories. Existing
+works formulated this challenge as a long-tailed problem and attempted to
+tackle it by decoupling the feature representation and classification. Yet, due
+to the imbalanced distribution and limited samples from tail classes, these
+works are prone to biased representation learning and insufficient classifier
+calibration. To tackle these problems, we propose a new Long-tailed Medical
+Diagnosis (LMD) framework for balanced medical image classification on
+long-tailed datasets. In the initial stage, we develop a Relation-aware
+Representation Learning (RRL) scheme to boost the representation ability by
+encouraging the encoder to capture intrinsic semantic features through
+different data augmentations. In the subsequent stage, we propose an Iterative
+Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
+This is achieved by generating a large number of balanced virtual features and
+fine-tuning the encoder using an Expectation-Maximization manner. The proposed
+ICC compensates for minority categories to facilitate unbiased classifier
+optimization while maintaining the diagnostic knowledge in majority classes.
+Comprehensive experiments on three public long-tailed medical datasets
+demonstrate that our LMD framework significantly surpasses state-of-the-art
+approaches. The source code can be accessed at
+https://github.com/peterlipan/LMD.
 
-Code generation has attracted increasing attention with the rise of Large
-Language Models (LLMs). Many studies have developed powerful code LLMs by
-synthesizing code-related instruction data and applying supervised fine-tuning.
-However, these methods are limited by teacher model distillation and ignore the
-potential of iterative refinement by self-generated code. In this paper, we
-propose Adaptive Critique Refinement (ACR), which enables the model to refine
-itself by self-generated code and external critique, rather than directly
-imitating the code responses of the teacher model. Concretely, ACR includes a
-composite scoring system with LLM-as-a-Judge to evaluate the quality of code
-responses and a selective critique strategy with LLM-as-a-Critic to critique
-self-generated low-quality code responses. We develop the RefineCoder series by
-iteratively applying ACR, achieving continuous performance improvement on
-multiple code generation benchmarks. Compared to the baselines of the same
-size, our proposed RefineCoder series can achieve comparable or even superior
-performance using less data.
+摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
 
-摘要：隨著大型語言模型 (LLM) 的興起，程式碼生成備受關注。許多研究透過綜合與程式碼相關的指令資料並應用監督式微調來開發強大的程式碼 LLM。然而，這些方法受到教師模型蒸餾的限制，且忽略了透過自行產生的程式碼進行反覆改進的潛力。在本文中，我們提出適應性批判改進 (ACR)，它使模型能夠透過自行產生的程式碼和外部批判來改進自身，而不是直接模仿教師模型的程式碼回應。具體來說，ACR 包含一個複合評分系統，其中 LLM 作為評審員來評估程式碼回應的品質，以及一個選擇性批判策略，其中 LLM 作為批判者來批判自行產生的低品質程式碼回應。我們透過反覆套用 ACR 來開發 RefineCoder 系列，在多個程式碼生成基準上實現持續的效能改善。與相同規模的基準相比，我們提出的 RefineCoder 系列可以使用較少資料來實現相當甚至更優異的效能。
+##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
+2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
 
-##### **FLAME: Flexible LLM-Assisted Moderation Engine**
-2502.09175v1 by Ivan Bakulin, Ilia Kopanichuk, Iaroslav Bespalov, Nikita Radchenko, Vladimir Shaposhnikov, Dmitry Dylov, Ivan Oseledets
+This study investigates continual fine-tuning strategies for deep learning in
+online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
+within a causal setting involving a large user group and multiple sessions per
+participant. We are the first to explore such strategies across a large user
+group, as longitudinal adaptation is typically studied in the single-subject
+setting with a single adaptation strategy, which limits the ability to
+generalize findings. First, we examine the impact of different fine-tuning
+approaches on decoder performance and stability. Building on this, we integrate
+online test-time adaptation (OTTA) to adapt the model during deployment,
+complementing the effects of prior fine-tuning. Our findings demonstrate that
+fine-tuning that successively builds on prior subject-specific information
+improves both performance and stability, while OTTA effectively adapts the
+model to evolving data distributions across consecutive sessions, enabling
+calibration-free operation. These results offer valuable insights and
+recommendations for future research in longitudinal online MI decoding and
+highlight the importance of combining domain adaptation strategies for
+improving BCI performance in real-world applications. Clinical Relevance: Our
+investigation enables more stable and efficient long-term motor imagery
+decoding, which is critical for neurorehabilitation and assistive technologies.
 
-The rapid advancement of Large Language Models (LLMs) has introduced
-significant challenges in moderating user-model interactions. While LLMs
-demonstrate remarkable capabilities, they remain vulnerable to adversarial
-attacks, particularly ``jailbreaking'' techniques that bypass content safety
-measures. Current content moderation systems, which primarily rely on input
-prompt filtering, have proven insufficient, with techniques like Best-of-N
-(BoN) jailbreaking achieving success rates of 80% or more against popular LLMs.
-In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a
-new approach that shifts the focus from input filtering to output moderation.
-Unlike traditional circuit-breaking methods that analyze user queries, FLAME
-evaluates model responses, offering several key advantages: (1) computational
-efficiency in both training and inference, (2) enhanced resistance to BoN
-jailbreaking attacks, and (3) flexibility in defining and updating safety
-criteria through customizable topic filtering. Our experiments demonstrate that
-FLAME significantly outperforms current moderation systems. For example, FLAME
-reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9,
-while maintaining low computational overhead. We provide comprehensive
-evaluation on various LLMs and analyze the engine's efficiency against the
-state-of-the-art jailbreaking. This work contributes to the development of more
-robust and adaptable content moderation systems for LLMs.
+摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
 
-摘要：大型語言模型 (LLM) 的快速進步為調節使用者與模型互動帶來重大挑戰。儘管 LLM 展現出非凡的能力，但它們仍然容易受到對抗性攻擊，特別是繞過內容安全措施的「越獄」技術。目前的內容審核系統主要依賴輸入提示過濾，已被證明不足，例如 Best-of-N (BoN) 越獄對抗熱門 LLM 的成功率達到 80% 以上。在本文中，我們介紹了靈活的 LLM 輔助審核引擎 (FLAME)：一種新的方法，將重點從輸入過濾轉移到輸出審核。與分析使用者查詢的傳統電路中斷方法不同，FLAME 評估模型回應，提供幾個關鍵優勢：(1) 訓練和推理中的計算效率，(2) 增強對 BoN 越獄攻擊的抵抗力，以及 (3) 透過可自訂主題過濾定義和更新安全標準的靈活性。我們的實驗證明，FLAME 明顯優於目前的審核系統。例如，FLAME 將 GPT-4o-mini 和 DeepSeek-v3 的攻擊成功率降低了約 9 倍，同時保持較低的計算負擔。我們對各種 LLM 進行了全面的評估，並分析了引擎對抗最新越獄的效率。這項工作有助於開發更強大且適應性更強的 LLM 內容審核系統。
+##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
+2502.03004v1 by Seonok Kim
 
-##### **Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**
-2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
+Large Language Models (LLMs) have demonstrated impressive capabilities across
+natural language processing tasks. However, their application to specialized
+domains such as medicine and biology requires further optimization to ensure
+factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
+domain-adapted biomedical question-answering model designed to enhance both
+short-form and long-form queries. By integrating fine-tuning and
+retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
+domain-specific knowledge, improving reasoning abilities and factual accuracy.
+To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
+datasets, covering structured multiple-choice assessments and complex clinical
+reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
+datasets, while RAG enhances factual consistency. These results highlight the
+potential of domain-optimized LLMs in advancing biomedical research, medical
+education, and clinical decision support.
 
-In remote healthcare monitoring, time series representation learning reveals
-critical patient behavior patterns from high-frequency data. This study
-analyzes home activity data from individuals living with dementia by proposing
-a two-stage, self-supervised learning approach tailored to uncover low-rank
-structures. The first stage converts time-series activities into text sequences
-encoded by a pre-trained language model, providing a rich, high-dimensional
-latent state space using a PageRank-based method. This PageRank vector captures
-latent state transitions, effectively compressing complex behaviour data into a
-succinct form that enhances interpretability. This low-rank representation not
-only enhances model interpretability but also facilitates clustering and
-transition analysis, revealing key behavioral patterns correlated with
-clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the
-framework's potential in supporting cognitive status prediction, personalized
-care interventions, and large-scale health monitoring.
+摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
 
-摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
+##### **Contrastive Token-level Explanations for Graph-based Rumour Detection**
+2502.04366v1 by Daniel Wai Kit Chin, Roy Ka-Wei Lee
 
-##### **Musical Heritage Historical Entity Linking**
-2502.09168v1 by Arianna Graciotti, Nicolas Lazzari, Valentina Presutti, Rocco Tripodi
+The widespread use of social media has accelerated the dissemination of
+information, but it has also facilitated the spread of harmful rumours, which
+can disrupt economies, influence political outcomes, and exacerbate public
+health crises, such as the COVID-19 pandemic. While Graph Neural Network
+(GNN)-based approaches have shown significant promise in automated rumour
+detection, they often lack transparency, making their predictions difficult to
+interpret. Existing graph explainability techniques fall short in addressing
+the unique challenges posed by the dependencies among feature dimensions in
+high-dimensional text embeddings used in GNN-based models. In this paper, we
+introduce Contrastive Token Layerwise Relevance Propagation (CT-LRP), a novel
+framework designed to enhance the explainability of GNN-based rumour detection.
+CT-LRP extends current graph explainability methods by providing token-level
+explanations that offer greater granularity and interpretability. We evaluate
+the effectiveness of CT-LRP across multiple GNN models trained on three
+publicly available rumour detection datasets, demonstrating that it
+consistently produces high-fidelity, meaningful explanations, paving the way
+for more robust and trustworthy rumour detection systems.
 
-Linking named entities occurring in text to their corresponding entity in a
-Knowledge Base (KB) is challenging, especially when dealing with historical
-texts. In this work, we introduce Musical Heritage named Entities Recognition,
-Classification and Linking (MHERCL), a novel benchmark consisting of manually
-annotated sentences extrapolated from historical periodicals of the music
-domain. MHERCL contains named entities under-represented or absent in the most
-famous KBs. We experiment with several State-of-the-Art models on the Entity
-Linking (EL) task and show that MHERCL is a challenging dataset for all of
-them. We propose a novel unsupervised EL model and a method to extend
-supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main
-difficulties posed by historical documents. Our experiments reveal that relying
-on unsupervised techniques and improving models with logical constraints based
-on KGs and heuristics to predict NIL entities (entities not represented in the
-KB of reference) results in better EL performance on historical documents.
+摘要：社群媒體的廣泛使用加速了資訊的傳播，但也促进了有害謠言的散播，這可能會擾亂經濟、影響政治結果，並加劇公共衛生危機，例如 COVID-19 大流行。雖然基於圖神經網路 (GNN) 的方法在自動化謠言偵測方面展現了顯著的前景，但它們通常缺乏透明度，這使得它們的預測難以解釋。現有的圖形可解釋性技術無法解決 GNN 模型中使用的維度嵌入式文本之間的依賴性所帶來的獨特挑戰。在本文中，我們介紹了對比標記分層關聯性傳播 (CT-LRP)，這是一個新穎的框架，旨在增強基於 GNN 的謠言偵測的可解釋性。CT-LRP 透過提供標記級別的解釋來擴充當前的圖形可解釋性方法，這些解釋提供了更細緻的粒度和可解釋性。我們在三個公開的謠言偵測資料集上訓練的幾個 GNN 模型中評估了 CT-LRP 的有效性，證明它始終產生高保真、有意義的解釋，為更強健且值得信賴的謠言偵測系統鋪路。
 
-摘要：將文本中出現的名稱實體連結到知識庫 (KB) 中對應的實體具有挑戰性，尤其是在處理歷史文本時。在這項工作中，我們引入了音樂遺產命名實體識別、分類和連結 (MHERCL)，這是一個由從音樂領域的歷史期刊中外推的手動標註句子組成的全新基準。MHERCL 包含在最著名的 KB 中代表性不足或不存在的名稱實體。我們在實體連結 (EL) 任務中對多個最先進的模型進行了實驗，並表明 MHERCL 對所有模型來說都是一個具有挑戰性的資料集。我們提出了一個新的無監督 EL 模型和一個通過使用知識圖 (KG) 來擴充監督式實體連結器的的方法，以解決歷史文件提出的主要難題。我們的實驗表明，依賴無監督技術並使用基於 KG 和啟發法的邏輯約束來改善模型以預測 NIL 實體（未在參考 KB 中表示的實體）會在歷史文件中產生更好的 EL 效能。
+##### **AI-Based Thermal Video Analysis in Privacy-Preserving Healthcare: A Case Study on Detecting Time of Birth**
+2502.04365v1 by Jorge García-Torres, Øyvind Meinich-Bache, Siren Rettedal, Kjersti Engan
 
-##### **Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLMs**
-2502.09156v1 by Chang Liu, Ying Chang, Jianmin Li, Yiqian Qu, Yu Li, Lingyong Cao, Shuyuan Lin
+Approximately 10% of newborns need some assistance to start breathing and 5\%
+proper ventilation. It is crucial that interventions are initiated as soon as
+possible after birth. Accurate documentation of Time of Birth (ToB) is thereby
+essential for documenting and improving newborn resuscitation performance.
+However, current clinical practices rely on manual recording of ToB, typically
+with minute precision. In this study, we present an AI-driven, video-based
+system for automated ToB detection using thermal imaging, designed to preserve
+the privacy of healthcare providers and mothers by avoiding the use of
+identifiable visual data. Our approach achieves 91.4% precision and 97.4%
+recall in detecting ToB within thermal video clips during performance
+evaluation. Additionally, our system successfully identifies ToB in 96% of test
+cases with an absolute median deviation of 1 second compared to manual
+annotations. This method offers a reliable solution for improving ToB
+documentation and enhancing newborn resuscitation outcomes.
 
-Objectives: Large language models (LLMs) can harness medical knowledge for
-intelligent question answering (Q&A), promising support for auxiliary diagnosis
-and medical talent cultivation. However, there is a deficiency of highly
-efficient retrieval-augmented generation (RAG) frameworks within the domain of
-Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the
-Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM Q&A
-tasks.
-  Materials and Methods: We introduce the novel approach of knowledge
-organization, constructing a tree structure knowledge base with hierarchy. At
-inference time, our self-reflection framework retrieves from this knowledge
-base, integrating information across chapters. Questions from the TCM Medical
-Licensing Examination (MLE) and the college Classics Course Exam (CCE) were
-randomly selected as benchmark datasets.
-  Results: By coupling with GPT-4, the framework can improve the best
-performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and
-improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation,
-the framework improves a total of 18.52 points across dimensions of safety,
-consistency, explainability, compliance, and coherence.
-  Conclusion: The TOSRR framework can effectively improve LLM's capability in
-Q&A tasks of TCM.
+摘要：約 10% 的新生兒需要協助才能開始呼吸，5% 需要適當的通氣。在出生後盡快開始介入至關重要。準確記錄出生時間 (ToB) 對於記錄和改善新生兒復甦表現至關重要。然而，目前的臨床實務依賴於手動記錄 ToB，通常精確到分鐘。在這項研究中，我們提出一個以 AI 為主的、基於影片的系統，用於使用熱影像自動偵測 ToB，旨在透過避免使用可識別的視覺資料來保護醫療保健提供者和母親的隱私。我們的做法在執行評估期間，在熱影像片段中偵測 ToB 時達到了 91.4% 的精確度和 97.4% 的召回率。此外，我們的系統在 96% 的測試案例中成功識別出 ToB，與手動註解相比，絕對中位數偏差為 1 秒。此方法提供了一個可靠的解決方案，用於改善 ToB 記錄和增強新生兒復甦結果。
 
-摘要：目標：大型語言模型（LLM）可以利用醫療知識進行智能問答（Q&A），承諾支持輔助診斷和醫療人才培養。然而，在中醫領域內缺乏高效的檢索增強生成（RAG）框架。我們的目的是觀察樹組織自省檢索（TOSRR）框架對中醫問答任務中 LLM 的影響。
-材料和方法：我們引入了知識組織的新方法，構建了一個具有層次的樹結構知識庫。在推理時間，我們的自省框架從這個知識庫中檢索，整合章節中的信息。中醫醫師資格考試（MLE）和大學經典課程考試（CCE）中的問題被隨機選為基準數據集。
-結果：通過與 GPT-4 結合，該框架可以將中醫 MLE 基準上的最佳性能提高 19.85% 的絕對準確度，並將 CCE 數據集上的召回準確度從 27% 提高到 38%。在手動評估中，該框架在安全性、一致性、可解釋性、合規性和連貫性方面總共提高了 18.52 分。
-結論：TOSRR 框架可以有效提升 LLM 在中醫問答任務中的能力。
+##### **3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography**
+2502.02779v1 by Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O'Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian
 
-##### **A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions**
-2502.09128v1 by Nasser A Alsadhan
+Head computed tomography (CT) imaging is a widely-used imaging modality with
+multitudes of medical indications, particularly in assessing pathology of the
+brain, skull, and cerebrovascular system. It is commonly the first-line imaging
+in neurologic emergencies given its rapidity of image acquisition, safety,
+cost, and ubiquity. Deep learning models may facilitate detection of a wide
+range of diseases. However, the scarcity of high-quality labels and
+annotations, particularly among less common conditions, significantly hinders
+the development of powerful models. To address this challenge, we introduce
+FM-CT: a Foundation Model for Head CT for generalizable disease detection,
+trained using self-supervised learning. Our approach pre-trains a deep learning
+model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans
+without the need for manual annotations, enabling the model to learn robust,
+generalizable features. To investigate the potential of self-supervised
+learning in head CT, we employed both discrimination with self-distillation and
+masked image modeling, and we construct our model in 3D rather than at the
+slice level (2D) to exploit the structure of head CT scans more comprehensively
+and efficiently. The model's downstream classification performance is evaluated
+using internal and three external datasets, encompassing both in-distribution
+(ID) and out-of-distribution (OOD) data. Our results demonstrate that the
+self-supervised foundation model significantly improves performance on
+downstream diagnostic tasks compared to models trained from scratch and
+previous 3D CT foundation models on scarce annotated datasets. This work
+highlights the effectiveness of self-supervised learning in medical imaging and
+sets a new benchmark for head CT image analysis in 3D, enabling broader use of
+artificial intelligence for head CT-based diagnosis.
 
-Arabic is one of the oldest languages still in use today. As a result,
-several Arabic-speaking regions have developed dialects that are unique to
-them. Dialect and emotion recognition have various uses in Arabic text
-analysis, such as determining an online customer's origin based on their
-comments. Furthermore, intelligent chatbots that are aware of a user's emotions
-can respond appropriately to the user. Current research in emotion detection in
-the Arabic language lacks awareness of how emotions are exhibited in different
-dialects, which motivates the work found in this study. This research addresses
-the problems of dialect and emotion classification in Arabic. Specifically,
-this is achieved by building a novel framework that can identify and predict
-Arabic dialects and emotions from a given text. The framework consists of three
-modules: A text-preprocessing module, a classification module, and a clustering
-module with the novel capability of building new dialect-aware emotion
-lexicons. The proposed framework generated a new emotional lexicon for
-different dialects. It achieved an accuracy of 88.9% in classifying Arabic
-dialects, which outperforms the state-of-the-art results by 6.45 percentage
-points. Furthermore, the framework achieved 89.1-79% accuracy in detecting
-emotions in the Egyptian and Gulf dialects, respectively.
+摘要：頭部電腦斷層掃描（CT）影像是一種廣泛使用的影像模式，具有
+大量的醫療適應症，特別是在評估腦部、頭骨和腦血管系統的病理時。由於其影像擷取速度快、安全性、成本低和普遍性，通常是神經緊急情況下的第一線影像。深度學習模型可以促進對各種疾病的檢測。然而，高品質標籤和註釋的稀缺，特別是在較不常見的疾病中，顯著地阻礙了強大模型的發展。為了應對這一挑戰，我們引入了 FM-CT：一個用於頭部 CT 的基礎模型，用於可概化的疾病檢測，並使用自我監督學習進行訓練。我們的做法在一個包含 361,663 個非對比 3D 頭部 CT 掃描的大型、多樣化的數據集上預訓練一個深度學習模型，而無需手動註釋，使模型能夠學習強健、可概化的特徵。為了探討自我監督學習在頭部 CT 中的潛力，我們同時採用了帶有自我蒸餾的判別和遮罩影像建模，並且我們以 3D 而不是切片層級（2D）構建我們的模型，以更全面、有效地利用頭部 CT 掃描的結構。該模型的下游分類效能使用內部和三個外部數據集進行評估，包括分佈內 (ID) 和分佈外 (OOD) 資料。我們的結果表明，與從頭開始訓練的模型和先前在稀疏註釋數據集上訓練的 3D CT 基礎模型相比，自我監督基礎模型顯著改善了下游診斷任務的效能。這項工作突顯了自我監督學習在醫學影像中的有效性，並為 3D 頭部 CT 影像分析設定了一個新的基準，讓人工智慧能夠更廣泛地用於基於頭部 CT 的診斷。
 
-摘要：阿拉伯語是現今仍在使用中最古老的語言之一。因此，幾個講阿拉伯語的地區發展出獨特的方言。方言和情緒辨識在阿拉伯語文本分析中有多種用途，例如根據在線客戶的評論來確定其來源。此外，知道使用者情緒的智慧聊天機器人可以適當地回應使用者。目前對阿拉伯語情緒偵測的研究缺乏對不同方言如何表現情緒的認識，這激勵了本研究中的工作。本研究探討了阿拉伯語中的方言和情緒分類問題。具體而言，這是通過建立一個新的框架來實現的，該框架可以識別和預測給定文本中的阿拉伯方言和情緒。該框架包含三個模組：文字預處理模組、分類模組和聚類模組，具有建立新的方言感知情緒詞彙表的新功能。所提出的框架為不同的方言生成了新的情緒詞彙表。它在分類阿拉伯方言方面達到了 88.9% 的準確率，比最先進的結果高出 6.45 個百分點。此外，該框架在檢測埃及和海灣方言的情緒方面分別達到了 89.1-79% 的準確率。
+##### **Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images**
+2502.02756v1 by Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
 
-##### **Automatic Pruning via Structured Lasso with Class-wise Information**
-2502.09125v1 by Xiang Liu, Mingchen Li, Xia Li, Leigang Qu, Zifan Peng, Yijun Song, Zemin Liu, Linshan Jiang, Jialin Li
+This study proposes a new loss function for deep neural networks, L1-weighted
+Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of
+voxels based on their classification difficulty, towards automated detection
+and segmentation of metastatic prostate cancer lesions in PET/CT scans. We
+obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with
+biochemical recurrence metastatic prostate cancer. We trained two 3D
+convolutional neural networks, Attention U-Net and SegResNet, and concatenated
+the PET and CT volumes channel-wise as input. The performance of our custom
+loss function was evaluated against the Dice and Dice Focal Loss functions. For
+clinical significance, we considered a detected region of interest (ROI) as a
+true positive if at least the voxel with the maximum standardized uptake value
+falls within the ROI. We assessed the models' performance based on the number
+of lesions in an image, tumour volume, activity, and extent of spread. The
+L1DFL outperformed the comparative loss functions by at least 13% on the test
+set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were
+lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal
+Loss yielded more false positives, whereas the Dice Loss was more sensitive to
+smaller volumes and struggled to segment larger lesions accurately. They also
+exhibited network-specific variations and yielded declines in segmentation
+accuracy with increased tumour spread. Our results demonstrate the potential of
+L1DFL to yield robust segmentation of metastatic prostate cancer lesions in
+PSMA PET/CT images. The results further highlight potential complexities
+arising from the variations in lesion characteristics that may influence
+automated prostate cancer tumour detection and segmentation. The code is
+publicly available at: https://github.com/ObedDzik/pca_segment.git.
 
-Most pruning methods concentrate on unimportant filters of neural networks.
-However, they face the loss of statistical information due to a lack of
-consideration for class-wise data. In this paper, from the perspective of
-leveraging precise class-wise information for model pruning, we utilize
-structured lasso with guidance from Information Bottleneck theory. Our approach
-ensures that statistical information is retained during the pruning process.
-With these techniques, we introduce two innovative adaptive network pruning
-schemes: sparse graph-structured lasso pruning with Information Bottleneck
-(\textbf{sGLP-IB}) and sparse tree-guided lasso pruning with Information
-Bottleneck (\textbf{sTLP-IB}). The key aspect is pruning model filters using
-sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to
-multiple state-of-the-art methods, our approaches demonstrate superior
-performance across three datasets and six model architectures in extensive
-experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we
-achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain
-an accuracy of 94.10% (0.14% higher than the original model); we reduce the
-parameters by 55% with the accuracy at 76.12% using the ResNet architecture on
-ImageNet (only drops 0.03%). In summary, we successfully reduce model size and
-computational resource usage while maintaining accuracy. Our codes are at
-https://anonymous.4open.science/r/IJCAI-8104.
+摘要：<paragraph>本研究針對深度神經網路提出一個新的損失函數，L1 加權 Dice 焦點損失 (L1DFL)，它利用 L1 範數根據體素的分類難度進行自適應加權，用於自動偵測和分割 PET/CT 掃描中轉移性前列腺癌病灶。我們取得 380 個經診斷為生化復發轉移性前列腺癌的患者的 PSMA [18-F] DCFPyL PET/CT 掃描。我們訓練了兩個 3D 捲積神經網路，Attention U-Net 和 SegResNet，並將 PET 和 CT 體積按通道連接作為輸入。我們自訂的損失函數的效能與 Dice 和 Dice 焦點損失函數進行評估。為了臨床意義，我們將一個偵測到的感興趣區域 (ROI) 視為真陽性，如果至少具有最大標準攝取值的體素落在 ROI 內。我們根據影像中的病灶數量、腫瘤體積、活性，以及擴散程度評估模型的效能。L1DFL 在測試組中至少比比較損失函數高出 13%。此外，Dice 損失和 Dice 焦點損失的 F1 分數分別比 L1DFL 低至少 6% 和 34%。Dice 焦點損失產生更多假陽性，而 Dice 損失對較小體積較為敏感，且難以準確分割較大病灶。它們也展現出網路特定的變化，並隨著腫瘤擴散而導致分割準確度下降。我們的結果證明 L1DFL 具有在 PSMA PET/CT 影像中產生轉移性前列腺癌病灶的強健分割的潛力。結果進一步強調由病灶特徵變化所產生的潛在複雜性，這可能會影響自動化前列腺癌腫瘤偵測和分割。程式碼公開於：https://github.com/ObedDzik/pca_segment.git。</paragraph>
 
-摘要：大多數剪枝方法都集中在神經網路中不重要的濾波器上。
-然而，由於缺乏對類別資料的考量，它們面臨統計資訊的遺失。在本文中，我們從利用精確類別資訊進行模型剪枝的角度，利用結構化套索搭配資訊瓶頸理論的指導。我們的做法確保在剪枝過程中保留統計資訊。藉由這些技術，我們引入了兩個創新的自適應網路剪枝方案：帶有資訊瓶頸的稀疏圖形結構套索剪枝（sGLP-IB）和帶有資訊瓶頸的稀疏樹導引套索剪枝（sTLP-IB）。關鍵方面是使用 sGLP-IB 和 sTLP-IB 剪枝模型濾波器，以更好地擷取類別關聯性。與多種最先進的方法相比，我們的做法在廣泛的實驗中展現出跨三個資料集和六個模型架構的卓越效能。例如，在 CIFAR-10 資料集上使用 VGG16 模型，我們達到了 85% 的參數減少、61% 的 FLOP 減少，並維持 94.10% 的準確度（比原始模型高 0.14%）；我們在 ImageNet 上使用 ResNet 架構將參數減少了 55%，準確度為 76.12%（僅下降 0.03%）。總之，我們成功地減少了模型大小和計算資源使用，同時維持準確度。我們的程式碼位於 https://anonymous.4open.science/r/IJCAI-8104。
+##### **Diffusion Instruction Tuning**
+2502.06814v1 by Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
 
-##### **The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)**
-2502.09120v1 by Ye-eun Cho, Yunho Maeng
+We introduce Lavender, a simple supervised fine-tuning (SFT) method that
+boosts the performance of advanced vision-language models (VLMs) by leveraging
+state-of-the-art image generation models such as Stable Diffusion.
+Specifically, Lavender aligns the text-vision attention in the VLM transformer
+with the equivalent used by Stable Diffusion during SFT, instead of adapting
+separate encoders. This alignment enriches the model's visual understanding and
+significantly boosts performance across in- and out-of-distribution tasks.
+Lavender requires just 0.13 million training examples, 2.5% of typical
+large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a
+single day. It consistently improves state-of-the-art open-source multimodal
+LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and
+a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently
+transferring the visual expertise of image generators with minimal supervision,
+Lavender offers a scalable solution for more accurate vision-language systems.
+All code, training data, and models will be shared at
+https://astrazeneca.github.io/vlm/.
 
-This study explored how Vision-Language Models (VLMs) process ignorance
-implicatures with visual and linguistic cues. Particularly, we focused on the
-effects of contexts (precise and approximate contexts) and modifier types (bare
-numerals, superlative, and comparative modifiers), which were considered
-pragmatic and semantic factors respectively. Methodologically, we conducted a
-truth-value judgment task in visually grounded settings using GPT-4o and Gemini
-1.5 Pro. The results indicate that while both models exhibited sensitivity to
-linguistic cues (modifier), they failed to process ignorance implicatures with
-visual cues (context) as humans do. Specifically, the influence of context was
-weaker and inconsistent across models, indicating challenges in pragmatic
-reasoning for VLMs. On the other hand, superlative modifiers were more strongly
-associated with ignorance implicatures as compared to comparative modifiers,
-supporting the semantic view. These findings highlight the need for further
-advancements in VLMs to process language-vision information in a
-context-dependent way to achieve human-like pragmatic inference.
+摘要：<paragraph>我們介紹 Lavender，一種簡單的監督微調 (SFT) 方法，它透過利用 Stable Diffusion 等最先進的影像生成模型來提升先進視覺語言模型 (VLM) 的效能。
+具體來說，Lavender 在 SFT 期間將 VLM 轉換器中的文字視覺注意力與 Stable Diffusion 使用的等效注意力對齊，而不是調整單獨的編碼器。此對齊豐富了模型的視覺理解，並顯著提升了分佈內外任務的效能。
+Lavender 只需要 0.13 百萬個訓練範例，相當於典型大型 SFT 資料集的 2.5%，並在標準硬體 (8 個 GPU) 上於一天內進行微調。它持續改善最先進的開放原始碼多模態 LLM（例如 Llama-3.2-11B、MiniCPM-Llama3-v2.5），在具有挑戰性的分佈外醫療 QA 任務中獲得高達 30% 的收益和 68% 的提升。透過有效轉移影像生成器的視覺專業知識，並僅需最少的監督，Lavender 提供了一個可擴充的解決方案，以實現更準確的視覺語言系統。
+所有程式碼、訓練資料和模型將在 https://astrazeneca.github.io/vlm/ 分享。</paragraph>
 
-摘要：本研究探討了視覺語言模型 (VLM) 如何處理視覺和語言線索中的無知含義。特別是，我們專注於語境（精確和近似語境）和修飾語類型（裸數字、最高級和比較級修飾語）的影響，這些分別被視為語用和語義因素。在方法論上，我們使用 GPT-4o 和 Gemini 1.5 Pro 在視覺基礎設置中進行了真值判斷任務。結果表明，儘管這兩個模型都對語言線索（修飾語）表現出敏感性，但它們未能像人類那樣處理帶有視覺線索（語境）的無知含義。具體來說，語境的影響在各個模型中較弱且不一致，表明 VLM 在語用推理方面存在挑戰。另一方面，與比較級修飾語相比，最高級修飾語與無知含義的關聯性更強，這支持了語義觀點。這些發現強調了 VLM 進一步發展的必要性，以以語境依賴的方式處理語言視覺信息，以實現類人語用推理。
+##### **MedRAX: Medical Reasoning Agent for Chest X-ray**
+2502.02673v1 by Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
 
-##### **One-shot Federated Learning Methods: A Practical Guide**
-2502.09104v1 by Xiang Liu, Zhenheng Tang, Xia Li, Yijun Song, Sijie Ji, Zemin Liu, Bo Han, Linshan Jiang, Jialin Li
+Chest X-rays (CXRs) play an integral role in driving critical decisions in
+disease management and patient care. While recent innovations have led to
+specialized models for various CXR interpretation tasks, these solutions often
+operate in isolation, limiting their practical utility in clinical practice. We
+present MedRAX, the first versatile AI agent that seamlessly integrates
+state-of-the-art CXR analysis tools and multimodal large language models into a
+unified framework. MedRAX dynamically leverages these models to address complex
+medical queries without requiring additional training. To rigorously evaluate
+its capabilities, we introduce ChestAgentBench, a comprehensive benchmark
+containing 2,500 complex medical queries across 7 diverse categories. Our
+experiments demonstrate that MedRAX achieves state-of-the-art performance
+compared to both open-source and proprietary models, representing a significant
+step toward the practical deployment of automated CXR interpretation systems.
+Data and code have been publicly available at
+https://github.com/bowang-lab/MedRAX
 
-One-shot Federated Learning (OFL) is a distributed machine learning paradigm
-that constrains client-server communication to a single round, addressing
-privacy and communication overhead issues associated with multiple rounds of
-data exchange in traditional Federated Learning (FL). OFL demonstrates the
-practical potential for integration with future approaches that require
-collaborative training models, such as large language models (LLMs). However,
-current OFL methods face two major challenges: data heterogeneity and model
-heterogeneity, which result in subpar performance compared to conventional FL
-methods. Worse still, despite numerous studies addressing these limitations, a
-comprehensive summary is still lacking. To address these gaps, this paper
-presents a systematic analysis of the challenges faced by OFL and thoroughly
-reviews the current methods. We also offer an innovative categorization method
-and analyze the trade-offs of various techniques. Additionally, we discuss the
-most promising future directions and the technologies that should be integrated
-into the OFL field. This work aims to provide guidance and insights for future
-research.
+摘要：胸部 X 光片 (CXR) 在疾病管理和患者照護中扮演著不可或缺的角色，推動著關鍵決策的制定。儘管近期的創新已針對各種 CXR 解讀任務開發出專門的模型，但這些解決方案通常獨立運作，限制了它們在臨床實務中的實際效用。我們提出 MedRAX，這是一款首創的多功能 AI 代理，它將最先進的 CXR 分析工具和多模態大型語言模型無縫整合到一個統一的架構中。MedRAX 動態運用這些模型來解決複雜的醫療查詢，而無需額外的訓練。為了嚴格評估其功能，我們引入了 ChestAgentBench，這是一個全面的基準，包含 7 個不同類別的 2,500 個複雜醫療查詢。我們的實驗證明，與開源和專有模型相比，MedRAX 達到了最先進的效能，這代表了自動化 CXR 解讀系統實際部署的重要一步。資料和程式碼已公開於 https://github.com/bowang-lab/MedRAX
 
-摘要：單次聯邦學習 (OFL) 是一種分散式機器學習範例，將客戶端與伺服器通訊限制在單一輪次中，解決傳統聯邦學習 (FL) 中多輪次資料交換相關的隱私和通訊負擔問題。OFL 展示了與需要協作訓練模型的未來方法整合的實際潛力，例如大型語言模型 (LLM)。然而，目前的 OFL 方法面臨兩大挑戰：資料異質性和模型異質性，這導致與傳統 FL 方法相比，效能較差。更糟的是，儘管有許多研究探討這些限制，但仍缺乏全面的摘要。為了解決這些差距，本文對 OFL 面臨的挑戰進行系統分析，並徹底檢視目前的方法。我們還提供創新的分類方法，並分析各種技術的權衡取捨。此外，我們討論最有希望的未來方向，以及應整合到 OFL 領域的技術。這項工作旨在為未來的研究提供指導和見解。
+##### **Open Foundation Models in Healthcare: Challenges, Paradoxes, and Opportunities with GenAI Driven Personalized Prescription**
+2502.04356v1 by Mahdi Alkaeed, Sofiat Abioye, Adnan Qayyum, Yosra Magdi Mekki, Ilhem Berrou, Mohamad Abdallah, Ala Al-Fuqaha, Muhammad Bilal, Junaid Qadir
 
-##### **Logical Reasoning in Large Language Models: A Survey**
-2502.09100v1 by Hanmeng Liu, Zhizhang Fu, Mengru Ding, Ruoxi Ning, Chaoli Zhang, Xiaozhang Liu, Yue Zhang
+In response to the success of proprietary Large Language Models (LLMs) such
+as OpenAI's GPT-4, there is a growing interest in developing open,
+non-proprietary LLMs and AI foundation models (AIFMs) for transparent use in
+academic, scientific, and non-commercial applications. Despite their inability
+to match the refined functionalities of their proprietary counterparts, open
+models hold immense potential to revolutionize healthcare applications. In this
+paper, we examine the prospects of open-source LLMs and AIFMs for developing
+healthcare applications and make two key contributions. Firstly, we present a
+comprehensive survey of the current state-of-the-art open-source healthcare
+LLMs and AIFMs and introduce a taxonomy of these open AIFMs, categorizing their
+utility across various healthcare tasks. Secondly, to evaluate the
+general-purpose applications of open LLMs in healthcare, we present a case
+study on personalized prescriptions. This task is particularly significant due
+to its critical role in delivering tailored, patient-specific medications that
+can greatly improve treatment outcomes. In addition, we compare the performance
+of open-source models with proprietary models in settings with and without
+Retrieval-Augmented Generation (RAG). Our findings suggest that, although less
+refined, open LLMs can achieve performance comparable to proprietary models
+when paired with grounding techniques such as RAG. Furthermore, to highlight
+the clinical significance of LLMs-empowered personalized prescriptions, we
+perform subjective assessment through an expert clinician. We also elaborate on
+ethical considerations and potential risks associated with the misuse of
+powerful LLMs and AIFMs, highlighting the need for a cautious and responsible
+implementation in healthcare.
 
-With the emergence of advanced reasoning models like OpenAI o3 and
-DeepSeek-R1, large language models (LLMs) have demonstrated remarkable
-reasoning capabilities. However, their ability to perform rigorous logical
-reasoning remains an open question. This survey synthesizes recent advancements
-in logical reasoning within LLMs, a critical area of AI research. It outlines
-the scope of logical reasoning in LLMs, its theoretical foundations, and the
-benchmarks used to evaluate reasoning proficiency. We analyze existing
-capabilities across different reasoning paradigms - deductive, inductive,
-abductive, and analogical - and assess strategies to enhance reasoning
-performance, including data-centric tuning, reinforcement learning, decoding
-strategies, and neuro-symbolic approaches. The review concludes with future
-directions, emphasizing the need for further exploration to strengthen logical
-reasoning in AI systems.
+摘要：<paragraph>為了回應 OpenAI 的 GPT-4 等專有大型語言模型 (LLM) 的成功，開發開放、非專有的 LLM 和人工智慧基礎模型 (AIFM) 以透明地用於學術、科學和非商業應用中，引起了越來越大的興趣。儘管無法與其專有對應產品的精緻功能相匹配，但開放模型在革新醫療保健應用方面具有巨大的潛力。在本文中，我們探討了開放原始碼 LLM 和 AIFM 在開發醫療保健應用方面的前景，並提出了兩項關鍵貢獻。首先，我們對當前最先進的開放原始碼醫療保健 LLM 和 AIFM 進行了全面的調查，並介紹了這些開放 AIFM 的分類法，對它們在各種醫療保健任務中的效用進行了分類。其次，為了評估開放 LLM 在醫療保健中的通用應用，我們對個人化處方進行了案例研究。這項任務特別重要，因為它在提供量身定制的患者特定藥物方面發揮著關鍵作用，可以大大改善治療效果。此外，我們比較了開放原始碼模型與專有模型在有和沒有檢索增強生成 (RAG) 的設置中的性能。我們的研究結果表明，儘管不太精緻，但開放 LLM 在與 RAG 等基礎技術配對時，可以實現與專有模型相當的性能。此外，為了強調 LLM 賦能的個性化處方的臨床意義，我們通過專家臨床醫生進行了主觀評估。我們還詳細說明了與濫用強大的 LLM 和 AIFM 相關的倫理考量和潛在風險，強調了在醫療保健中謹慎和負責任地實施的必要性。</paragraph>
 
-摘要：隨著 OpenAI o3 和 DeepSeek-R1 等先進推理模型的出現，大型語言模型 (LLM) 已展現出非凡的推理能力。然而，它們執行嚴謹邏輯推理的能力仍是一個開放性的問題。此調查綜合了 LLM 中邏輯推理的最新進展，這是 AI 研究的一個關鍵領域。它概述了 LLM 中邏輯推理的範圍、其理論基礎，以及用於評估推理能力的基準。我們分析了不同推理範例（演繹、歸納、外推和類比）中的現有能力，並評估增強推理效能的策略，包括以數據為中心的調整、強化學習、解碼策略和神經符號方法。此評論以未來的方向作為結論，強調需要進一步探索以強化 AI 系統中的邏輯推理。
+##### **Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents**
+2502.02561v1 by Shayan Kiyani, George Pappas, Aaron Roth, Hamed Hassani
 
-##### **A Hybrid Transformer Model for Fake News Detection: Leveraging Bayesian Optimization and Bidirectional Recurrent Unit**
-2502.09097v1 by Tianyi Huang, Zeqiu Xu, Peiyang Yu, Jingyuan Yi, Xiaochuan Xu
+A fundamental question in data-driven decision making is how to quantify the
+uncertainty of predictions in ways that can usefully inform downstream action.
+This interface between prediction uncertainty and decision-making is especially
+important in risk-sensitive domains, such as medicine. In this paper, we
+develop decision-theoretic foundations that connect uncertainty quantification
+using prediction sets with risk-averse decision-making. Specifically, we answer
+three fundamental questions: (1) What is the correct notion of uncertainty
+quantification for risk-averse decision makers? We prove that prediction sets
+are optimal for decision makers who wish to optimize their value at risk. (2)
+What is the optimal policy that a risk averse decision maker should use to map
+prediction sets to actions? We show that a simple max-min decision policy is
+optimal for risk-averse decision makers. Finally, (3) How can we derive
+prediction sets that are optimal for such decision makers? We provide an exact
+characterization in the population regime and a distribution free finite-sample
+construction. Answering these questions naturally leads to an algorithm,
+Risk-Averse Calibration (RAC), which follows a provably optimal design for
+deriving action policies from predictions. RAC is designed to be both
+practical-capable of leveraging the quality of predictions in a black-box
+manner to enhance downstream utility-and safe-adhering to a user-defined risk
+threshold and optimizing the corresponding risk quantile of the user's
+downstream utility. Finally, we experimentally demonstrate the significant
+advantages of RAC in applications such as medical diagnosis and recommendation
+systems. Specifically, we show that RAC achieves a substantially improved
+trade-off between safety and utility, offering higher utility compared to
+existing methods while maintaining the safety guarantee.
 
-In this paper, we propose an optimized Transformer model that integrates
-Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and
-apply it to fake news classification for the first time. First, we employ the
-TF-IDF method to extract features from news texts and transform them into
-numeric representations to facilitate subsequent machine learning tasks. Two
-sets of experiments are then conducted for fake news detection and
-classification: one using a Transformer model optimized only with BiGRU, and
-the other incorporating Bayesian algorithms into the BiGRU-based Transformer.
-Experimental results show that the BiGRU-optimized Transformer achieves 100%
-accuracy on the training set and 99.67% on the test set, while the addition of
-the Bayesian algorithm maintains 100% accuracy on the training set and slightly
-improves test-set accuracy to 99.73%. This indicates that the Bayesian
-algorithm boosts model accuracy by 0.06%, further enhancing the detection
-capability for fake news. Moreover, the proposed algorithm converges rapidly at
-around the 10th training epoch with accuracy nearing 100%, demonstrating both
-its effectiveness and its fast classification ability. Overall, the optimized
-Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits
-excellent continuous learning and detection performance, offering a robust
-technical means to combat the spread of fake news in the current era of
-information overload.
+摘要：<paragraph>在資料驅動決策中，一個基本問題是，如何量化預測的不確定性，以能有用地告知下游行動。
+預測不確定性和決策制定之間的這種介面，在風險敏感領域中特別重要，例如醫學。在本文中，我們
+發展了決策理論基礎，它利用預測集合將不確定性量化與風險規避決策制定聯繫起來。具體來說，我們回答
+了三個基本問題：(1) 對於風險規避決策者來說，不確定性量化的正確概念是什麼？我們證明，對於希望最佳化其風險價值的決策者來說，預測集合是最佳的。(2)
+風險規避決策者應使用什麼最佳政策，將預測集合映射到行動？我們表明，對於風險規避決策者來說，一個簡單的最大最小決策政策是最佳的。最後，(3) 我們如何推導出對此類決策者來說最佳的預測集合？我們在總體範圍內提供了一個確切的表徵，並提供了一個不依賴分佈的有限樣本建構。回答這些問題自然會導致一個演算法，風險規避校準 (RAC)，它遵循一個可證明最佳的設計，從預測中推導出行動政策。RAC 被設計為既實用——能夠以黑盒方式利用預測的品質來增強下游效用——又安全——遵守使用者定義的風險閾值，並最佳化使用者的下游效用的對應風險分位數。最後，我們在醫學診斷和推薦系統等應用中，以實驗方式證明了 RAC 的顯著優點。具體來說，我們表明，與現有方法相比，RAC 在安全性和效用之間實現了顯著改善的折衷，在維持安全保證的同時，提供了更高的效用。</paragraph>
 
-摘要：<paragraph>在本文中，我們提出了一個最佳化的 Transformer 模型，它將貝氏演算法與雙向門控遞迴單元 (BiGRU) 整合在一起，並首次將其應用於假新聞分類。首先，我們採用 TF-IDF 方法從新聞文本中提取特徵，並將它們轉換為數值表示，以利於後續的機器學習任務。接著進行兩組實驗，分別針對假新聞偵測和分類：一組使用僅使用 BiGRU 最佳化的 Transformer 模型，另一組將貝氏演算法納入基於 BiGRU 的 Transformer 中。實驗結果顯示，BiGRU 最佳化的 Transformer 在訓練組上達到 100% 的準確度，在測試組上達到 99.67%，而加入貝氏演算法後，在訓練組上維持 100% 的準確度，並將測試組的準確度略微提升至 99.73%。這表示貝氏演算法將模型準確度提升了 0.06%，進一步增強了對假新聞的偵測能力。此外，所提出的演算法在約第 10 個訓練週期時快速收斂，準確度接近 100%，證明了它的有效性和快速的分類能力。總的來說，由貝氏演算法和 BiGRU 增強的最佳化 Transformer 模型展現出絕佳的持續學習和偵測效能，提供了一個強健的技術手段來對抗在當前資訊過載時代中假新聞的散布。</paragraph>
+##### **CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models**
+2502.05214v1 by Amy Rafferty, Rishi Ramaesh, Ajitha Rajan
 
-##### **A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning**
-2502.09086v1 by Jia Gao, Shuangquan Lyu, Guiran Liu, Binrong Zhu, Hongye Zheng, Xiaoxuan Liao
+Deep learning models for medical image classification tasks are becoming
+widely implemented in AI-assisted diagnostic tools, aiming to enhance
+diagnostic accuracy, reduce clinician workloads, and improve patient outcomes.
+However, their vulnerability to adversarial attacks poses significant risks to
+patient safety. Current attack methodologies use general techniques such as
+model querying or pixel value perturbations to generate adversarial examples
+designed to fool a model. These approaches may not adequately address the
+unique characteristics of clinical errors stemming from missed or incorrectly
+identified clinical features. We propose the Concept-based Report Perturbation
+Attack (CoRPA), a clinically-focused black-box adversarial attack framework
+tailored to the medical imaging domain. CoRPA leverages clinical concepts to
+generate adversarial radiological reports and images that closely mirror
+realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA
+using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our
+evaluation reveals that deep learning models exhibiting strong resilience to
+conventional adversarial attacks are significantly less robust when subjected
+to CoRPA's clinically-focused perturbations. This underscores the importance of
+addressing domain-specific vulnerabilities in medical AI systems. By
+introducing a specialized adversarial attack framework, this study provides a
+foundation for developing robust, real-world-ready AI models in healthcare,
+ensuring their safe and reliable deployment in high-stakes clinical
+environments.
 
-With the continuous development of natural language processing (NLP)
-technology, text classification tasks have been widely used in multiple
-application fields. However, obtaining labeled data is often expensive and
-difficult, especially in few-shot learning scenarios. To solve this problem,
-this paper proposes a few-shot text classification model based on transfer
-learning and meta-learning. The model uses the knowledge of the pre-trained
-model for transfer and optimizes the model's rapid adaptability in few-sample
-tasks through a meta-learning mechanism. Through a series of comparative
-experiments and ablation experiments, we verified the effectiveness of the
-proposed method. The experimental results show that under the conditions of few
-samples and medium samples, the model based on transfer learning and
-meta-learning significantly outperforms traditional machine learning and deep
-learning methods. In addition, ablation experiments further analyzed the
-contribution of each component to the model performance and confirmed the key
-role of transfer learning and meta-learning in improving model accuracy.
-Finally, this paper discusses future research directions and looks forward to
-the potential of this method in practical applications.
+摘要：深度学习模型用于医学影像分类任务，在人工智能辅助诊断工具中得到广泛应用，旨在提高诊断准确性、减少临床医生的工作量并改善患者的治疗效果。然而，它们对对抗性攻击的脆弱性给患者安全带来了重大风险。目前的攻击方法使用通用技术，例如模型查询或像素值扰动来生成对抗性示例，旨在欺骗模型。这些方法可能无法充分解决源自遗漏或错误识别的临床特征的临床错误的独特特征。我们提出了基于概念的报告扰动攻击 (CoRPA)，这是一种以临床为中心的、针对医学成像领域的、黑盒对抗性攻击框架。CoRPA 利用临床概念来生成对抗性放射学报告和图像，这些报告和图像与现实的临床误诊场景非常相似。我们使用胸部 X 射线和放射学报告的 MIMIC-CXR-JPG 数据集演示了 CoRPA 的效用。我们的评估表明，对传统对抗性攻击表现出强大弹性的深度学习模型在受到 CoRPA 以临床为中心的扰动时，其鲁棒性明显降低。这强调了在医疗人工智能系统中解决特定领域漏洞的重要性。通过引入专门的对抗性攻击框架，本研究为在医疗保健领域开发健壮、面向现实世界的 AI 模型奠定了基础，确保它们在高风险临床环境中安全可靠地部署。
 
-摘要：隨著自然語言處理 (NLP) 技術的持續發展，文本分類任務已廣泛應用於多個應用領域。然而，獲取標記資料通常既昂貴又困難，特別是在小樣本學習場景中。為了解決這個問題，本文提出了一個基於遷移學習和元學習的少樣本文本分類模型。該模型利用預訓練模型的知識進行遷移，並透過元學習機制最佳化模型在少樣本任務中的快速適應性。透過一系列的比較實驗和消融實驗，我們驗證了所提出方法的有效性。實驗結果表明，在少樣本和中等樣本的條件下，基於遷移學習和元學習的模型明顯優於傳統機器學習和深度學習方法。此外，消融實驗進一步分析了各個組成部分對模型效能的貢獻，並確認了遷移學習和元學習在提升模型準確度中的關鍵作用。最後，本文探討了未來的研究方向，並期待此方法在實際應用中的潛力。
+##### **A Self-Supervised Framework for Improved Generalisability in Ultrasound B-mode Image Segmentation**
+2502.02489v1 by Edward Ellis, Andrew Bulpitt, Nasim Parsa, Michael F Byrne, Sharib Ali
 
-##### **Show Me the Work: Fact-Checkers' Requirements for Explainable Automated Fact-Checking**
-2502.09083v1 by Greta Warren, Irina Shklovski, Isabelle Augenstein
+Ultrasound (US) imaging is clinically invaluable due to its noninvasive and
+safe nature. However, interpreting US images is challenging, requires
+significant expertise, and time, and is often prone to errors. Deep learning
+offers assistive solutions such as segmentation. Supervised methods rely on
+large, high-quality, and consistently labeled datasets, which are challenging
+to curate. Moreover, these methods tend to underperform on out-of-distribution
+data, limiting their clinical utility. Self-supervised learning (SSL) has
+emerged as a promising alternative, leveraging unlabeled data to enhance model
+performance and generalisability. We introduce a contrastive SSL approach
+tailored for B-mode US images, incorporating a novel Relation Contrastive Loss
+(RCL). RCL encourages learning of distinct features by differentiating positive
+and negative sample pairs through a learnable metric. Additionally, we propose
+spatial and frequency-based augmentation strategies for the representation
+learning on US images. Our approach significantly outperforms traditional
+supervised segmentation methods across three public breast US datasets,
+particularly in data-limited scenarios. Notable improvements on the Dice
+similarity metric include a 4% increase on 20% and 50% of the BUSI dataset,
+nearly 6% and 9% improvements on 20% and 50% of the BrEaST dataset, and 6.4%
+and 3.7% improvements on 20% and 50% of the UDIAT dataset, respectively.
+Furthermore, we demonstrate superior generalisability on the
+out-of-distribution UDIAT dataset with performance boosts of 20.6% and 13.6%
+compared to the supervised baseline using 20% and 50% of the BUSI and BrEaST
+training data, respectively. Our research highlights that domain-inspired SSL
+can improve US segmentation, especially under data-limited conditions.
 
-The pervasiveness of large language models and generative AI in online media
-has amplified the need for effective automated fact-checking to assist
-fact-checkers in tackling the increasing volume and sophistication of
-misinformation. The complex nature of fact-checking demands that automated
-fact-checking systems provide explanations that enable fact-checkers to
-scrutinise their outputs. However, it is unclear how these explanations should
-align with the decision-making and reasoning processes of fact-checkers to be
-effectively integrated into their workflows. Through semi-structured interviews
-with fact-checking professionals, we bridge this gap by: (i) providing an
-account of how fact-checkers assess evidence, make decisions, and explain their
-processes; (ii) examining how fact-checkers use automated tools in practice;
-and (iii) identifying fact-checker explanation requirements for automated
-fact-checking tools. The findings show unmet explanation needs and identify
-important criteria for replicable fact-checking explanations that trace the
-model's reasoning path, reference specific evidence, and highlight uncertainty
-and information gaps.
+摘要：超音波 (US) 影像由於其非侵入性且安全的特性，在臨床上極具價值。然而，解讀超音波影像具有挑戰性，需要大量的專業知識和時間，而且經常容易出錯。深度學習提供了輔助解決方案，例如分割。監督式方法依賴於大量、高品質且標籤一致的資料集，而這在策劃上具有挑戰性。此外，這些方法在分佈外資料上的表現往往不佳，這限制了它們的臨床效用。自監督學習 (SSL) 已成為一種有前途的替代方案，它利用未標籤資料來增強模型效能和泛化能力。我們提出了一種對比式 SSL 方法，專門針對 B 模式超音波影像，並納入了新穎的關係對比損失 (RCL)。RCL 透過一個可學習的指標區分正負樣本對，來鼓勵學習不同的特徵。此外，我們提出了用於超音波影像上表徵學習的空間和頻率增強策略。我們的做法在三個公開的乳房超音波資料集上顯著優於傳統的監督式分割方法，特別是在資料有限的情況下。在 Dice 相似性指標上的顯著改進包括在 BUSI 資料集的 20% 和 50% 上增加了 4%，在 BrEaST 資料集的 20% 和 50% 上增加了近 6% 和 9%，以及在 UDIAT 資料集的 20% 和 50% 上分別增加了 6.4% 和 3.7%。此外，我們在分佈外的 UDIAT 資料集上展示了卓越的泛化能力，與使用 BUSI 和 BrEaST 訓練資料的 20% 和 50% 的監督式基準相比，效能分別提升了 20.6% 和 13.6%。我們的研究強調，領域啟發的 SSL 可以改善超音波分割，特別是在資料有限的條件下。
 
-摘要：大型語言模型和生成式 AI 在線上媒體的普及
-放大了對有效自動查核事實的需求，以協助查核員應對日益增加的錯誤資訊量和複雜性。查核事實的複雜性質要求自動查核事實系統提供說明，讓查核員能夠仔細審查他們的輸出。然而，目前尚不清楚這些說明應如何與查核員的決策制定和推理過程保持一致，才能有效整合到他們的流程中。透過與查核事實專業人士進行半結構式訪談，我們透過以下方式彌補這個差距：(i) 提供查核員如何評估證據、做出決策和解釋其流程的說明；(ii) 檢視查核員如何實際使用自動化工具；以及 (iii) 找出查核員對自動查核事實工具的說明需求。研究結果顯示未滿足的說明需求，並找出可複製查核事實說明的重要準則，這些準則追蹤模型的推理路徑、參考具體證據，並強調不確定性和資訊差距。
+##### **Medical Multimodal Model Stealing Attacks via Adversarial Domain Alignment**
+2502.02438v1 by Yaling Shen, Zhixiong Zhuang, Kun Yuan, Maria-Irina Nicolae, Nassir Navab, Nicolas Padoy, Mario Fritz
 
-##### **CoSER: Coordinating LLM-Based Persona Simulation of Established Roles**
-2502.09082v1 by Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou
+Medical multimodal large language models (MLLMs) are becoming an instrumental
+part of healthcare systems, assisting medical personnel with decision making
+and results analysis. Models for radiology report generation are able to
+interpret medical imagery, thus reducing the workload of radiologists. As
+medical data is scarce and protected by privacy regulations, medical MLLMs
+represent valuable intellectual property. However, these assets are potentially
+vulnerable to model stealing, where attackers aim to replicate their
+functionality via black-box access. So far, model stealing for the medical
+domain has focused on classification; however, existing attacks are not
+effective against MLLMs. In this paper, we introduce Adversarial Domain
+Alignment (ADA-STEAL), the first stealing attack against medical MLLMs.
+ADA-STEAL relies on natural images, which are public and widely available, as
+opposed to their medical counterparts. We show that data augmentation with
+adversarial noise is sufficient to overcome the data distribution gap between
+natural images and the domain-specific distribution of the victim MLLM.
+Experiments on the IU X-RAY and MIMIC-CXR radiology datasets demonstrate that
+Adversarial Domain Alignment enables attackers to steal the medical MLLM
+without any access to medical data.
 
-Role-playing language agents (RPLAs) have emerged as promising applications
-of large language models (LLMs). However, simulating established characters
-presents a challenging task for RPLAs, due to the lack of authentic character
-datasets and nuanced evaluation methods using such data. In this paper, we
-present CoSER, a collection of a high-quality dataset, open models, and an
-evaluation protocol towards effective RPLAs of established characters. The
-CoSER dataset covers 17,966 characters from 771 renowned books. It provides
-authentic dialogues with real-world intricacies, as well as diverse data types
-such as conversation setups, character experiences and internal thoughts.
-Drawing from acting methodology, we introduce given-circumstance acting for
-training and evaluating role-playing LLMs, where LLMs sequentially portray
-multiple characters in book scenes. Using our dataset, we develop CoSER 8B and
-CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models.
-Extensive experiments demonstrate the value of the CoSER dataset for RPLA
-training, evaluation and retrieval. Moreover, CoSER 70B exhibits
-state-of-the-art performance surpassing or matching GPT-4o on our evaluation
-and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on
-the InCharacter and LifeChoice benchmarks respectively.
+摘要：醫療多模態大型語言模型 (MLLM) 正在成為醫療保健系統中不可或缺的一部分，協助醫療人員進行決策和結果分析。放射報告生成的模型能夠解釋醫學影像，從而減輕放射科醫師的工作負擔。由於醫療資料稀少且受隱私法規保護，醫療 MLLM 代表了有價值的智慧財產。然而，這些資產潛在地容易受到模型竊取的攻擊，攻擊者旨在透過黑盒存取來複製其功能。到目前為止，針對醫療領域的模型竊取一直專注於分類；然而，現有的攻擊對 MLLM 沒有效。在本文中，我們介紹了對抗域對齊 (ADA-STEAL)，這是針對醫療 MLLM 的第一個竊取攻擊。與醫療對應物相反，ADA-STEAL 依賴於公開且廣泛可用的自然影像。我們表明，對抗雜訊的資料擴充足以克服自然影像與受害者 MLLM 的特定領域分佈之間的資料分佈差距。在 IU X-RAY 和 MIMIC-CXR 放射學資料集上進行的實驗表明，對抗域對齊使攻擊者能夠在不存取任何醫療資料的情況下竊取醫療 MLLM。
 
-摘要：角色扮演語言代理（RPLA）已成為大型語言模型（LLM）的有前途的應用。然而，由於缺乏真實角色資料集和使用此類資料的細緻評估方法，模擬既有角色對 RPLA 來說是一項具有挑戰性的任務。在本文中，我們提出了 CoSER，這是一個高品質資料集、開放模型和評估協議的集合，用於有效地扮演既有角色的 RPLA。CoSER 資料集涵蓋了來自 771 本著名書籍的 17,966 個角色。它提供了具有真實世界複雜性的真實對話，以及對話設定、角色體驗和內心想法等多種資料類型。借鑑表演方法，我們引入了既定情境表演，用於訓練和評估角色扮演 LLM，其中 LLM 在書籍場景中依次扮演多個角色。使用我們的資料集，我們開發了 CoSER 8B 和 CoSER 70B，即建立在 LLaMA-3.1 模型上的先進開放角色扮演 LLM。大量的實驗證明了 CoSER 資料集對於 RPLA 訓練、評估和檢索的價值。此外，CoSER 70B 在我們的評估和三個現有基準上展現了超越或匹配 GPT-4o 的最先進效能，即分別在 InCharacter 和 LifeChoice 基準上達到了 75.80% 和 93.47% 的準確率。
+##### **Test Time Training for 4D Medical Image Interpolation**
+2502.02341v1 by Qikang Zhang, Yingjie Lei, Zihao Zheng, Ziyang Chen, Zhonghao Xie
 
-##### **Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables**
-2502.09073v1 by Xuzhao Geng, Haozhao Wang, Jun Wang, Wei Liu, Ruixuan Li
+4D medical image interpolation is essential for improving temporal resolution
+and diagnostic precision in clinical applications. Previous works ignore the
+problem of distribution shifts, resulting in poor generalization under
+different distribution. A natural solution would be to adapt the model to a new
+test distribution, but this cannot be done if the test input comes without a
+ground truth label. In this paper, we propose a novel test time training
+framework which uses self-supervision to adapt the model to a new distribution
+without requiring any labels. Indeed, before performing frame interpolation on
+each test video, the model is trained on the same instance using a
+self-supervised task, such as rotation prediction or image reconstruction. We
+conduct experiments on two publicly available 4D medical image interpolation
+datasets, Cardiac and 4D-Lung. The experimental results show that the proposed
+method achieves significant performance across various evaluation metrics on
+both datasets. It achieves higher peak signal-to-noise ratio values, 33.73dB on
+Cardiac and 34.02dB on 4D-Lung. Our method not only advances 4D medical image
+interpolation but also provides a template for domain adaptation in other
+fields such as image segmentation and image registration.
 
-Retrieval-augmented generation (RAG) is a key technique for leveraging
-external knowledge and reducing hallucinations in large language models (LLMs).
-However, RAG still struggles to fully prevent hallucinated responses. To
-address this, it is essential to identify samples prone to hallucination or
-guide LLMs toward correct responses, which experts then annotate to develop
-high-quality datasets for refining LLMs. However, the growing scarcity of such
-datasets makes their creation challenging. This paper proposes using the vast
-amount of conversations from widespread LLM usage to build these datasets,
-training LLMs to avoid hallucination-prone questions while accurately
-responding to manageable ones. Given the impracticality of expert-annotating
-all conversation records, the paper introduces AL4RAG, which uses active
-learning to select the most suitable conversation samples for annotation,
-optimizing performance within an annotation budget. Additionally, recognizing
-that traditional active learning methods are not fully compatible with RAG due
-to unsuitable distance metrics, we develop a novel sample distance measurement
-for RAG active learning. Extensive experiments show that our method
-consistently outperforms baselines across multiple metrics.
+摘要：4D 醫學影像插值對於提升時間解析度及臨床應用中的診斷精準度至關重要。過往的研究忽略了分佈轉移問題，導致在不同分佈下泛化能力不佳。一個自然的解決方案是將模型適應到新的測試分佈，但如果測試輸入沒有真實標籤，就無法做到這一點。在本文中，我們提出了一個新的測試時間訓練架構，它使用自我監督來適應模型到一個新的分佈，而不需要任何標籤。事實上，在對每個測試影片執行幀插值之前，使用自我監督任務（例如旋轉預測或影像重建）在同一個實例上訓練模型。我們在兩個公開的 4D 醫學影像插值資料集（Cardiac 和 4D-Lung）上進行實驗。實驗結果表明，所提出的方法在兩個資料集上的各種評估指標中都取得了顯著的效能。它達到了更高的峰值信噪比值，在 Cardiac 上為 33.73dB，在 4D-Lung 上為 34.02dB。我們的技術不僅推動了 4D 醫學影像插值，還為其他領域（例如影像分割和影像配準）中的領域適應提供了一個範本。
 
-摘要：檢索增強生成 (RAG) 是一種關鍵技術，用於利用外部知識並減少大型語言模型 (LLM) 中的幻覺。然而，RAG 仍難以完全防止幻覺反應。為了解決這個問題，必須找出容易產生幻覺的範例，或引導 LLM 朝向正確的反應，然後由專家註解以開發用於精煉 LLM 的高品質資料集。然而，此類資料集日益稀少，使得其建立極具挑戰性。本文提出使用來自廣泛 LLM 使用的大量對話來建立這些資料集，訓練 LLM 以避免容易產生幻覺的問題，同時準確回應可管理的問題。鑑於由專家為所有對話記錄加上註解並不切實際，本文引入了 AL4RAG，它使用主動學習來選擇最適合註解的對話範例，在註解預算內最佳化效能。此外，認識到傳統主動學習方法由於不適當的距離度量而無法與 RAG 完全相容，我們為 RAG 主動學習開發了一種新穎的範例距離度量。廣泛的實驗表明，我們的模型在多種度量標準上始終優於基準。
+##### **Conversation AI Dialog for Medicare powered by Finetuning and Retrieval Augmented Generation**
+2502.02249v1 by Atharva Mangeshkumar Agrawal, Rutika Pandurang Shinde, Vasanth Kumar Bhukya, Ashmita Chakraborty, Sagar Bharat Shah, Tanmay Shukla, Sree Pradeep Kumar Relangi, Nilesh Mutyam
 
-##### **An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging**
-2502.09056v1 by Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai
+Large language models (LLMs) have shown impressive capabilities in natural
+language processing tasks, including dialogue generation. This research aims to
+conduct a novel comparative analysis of two prominent techniques, fine-tuning
+with LoRA (Low-Rank Adaptation) and the Retrieval-Augmented Generation (RAG)
+framework, in the context of doctor-patient chat conversations with multiple
+datasets of mixed medical domains. The analysis involves three state-of-the-art
+models: Llama-2, GPT, and the LSTM model. Employing real-world doctor-patient
+dialogues, we comprehensively evaluate the performance of models, assessing key
+metrics such as language quality (perplexity, BLEU score), factual accuracy
+(fact-checking against medical knowledge bases), adherence to medical
+guidelines, and overall human judgments (coherence, empathy, safety). The
+findings provide insights into the strengths and limitations of each approach,
+shedding light on their suitability for healthcare applications. Furthermore,
+the research investigates the robustness of the models in handling diverse
+patient queries, ranging from general health inquiries to specific medical
+conditions. The impact of domain-specific knowledge integration is also
+explored, highlighting the potential for enhancing LLM performance through
+targeted data augmentation and retrieval strategies.
 
-This paper investigates data selection and model merging methodologies aimed
-at incorporating advanced reasoning capabilities such as those of DeepSeek R1
-into language-specific large language models (LLMs), with a particular focus on
-the Thai LLM. Our goal is to enhance the reasoning capabilities of
-language-specific LLMs while maintaining their target language abilities.
-DeepSeek R1 excels in reasoning but primarily benefits high-resource languages
-such as English and Chinese. However, low-resource languages remain underserved
-due to the dominance of English-centric training data and model optimizations,
-which limit performance in these languages. This limitation results in
-unreliable code-switching and diminished effectiveness on tasks in low-resource
-languages. Meanwhile, local and regional LLM initiatives have attempted to
-bridge this gap by developing language-specific LLMs that focus on improving
-local linguistic fidelity. We demonstrate that, with only publicly available
-datasets and a computational budget of $120, it is possible to enhance the
-reasoning capabilities of language-specific LLMs to match the level of DeepSeek
-R1, without compromising their performance on target language tasks.
+摘要：大型語言模型 (LLM) 在自然語言處理任務中展現了令人印象深刻的能力，包括對話生成。本研究旨在對兩種著名的技術進行新穎的比較分析，即微調 LoRA (低秩適應) 和檢索增強生成 (RAG) 框架，在具有混合醫療領域的多個資料集的醫患聊天對話中。分析涉及三個最先進的模型：Llama-2、GPT 和 LSTM 模型。採用真實世界的醫患對話，我們全面評估模型的性能，評估語言品質（困惑度、BLEU 分數）、事實準確性（對照醫學知識庫進行事實查核）、遵守醫療指南以及整體人類判斷（連貫性、同理心、安全性）等關鍵指標。研究結果深入了解了每種方法的優點和限制，闡明了它們適用於醫療保健應用的適當性。此外，該研究調查了模型在處理多樣化患者查詢時的穩健性，範圍從一般健康詢問到特定醫療狀況。還探討了特定領域知識整合的影響，強調了通過有針對性的資料擴充和檢索策略來增強 LLM 性能的潛力。
 
-摘要：本文探討資料選取與模型合併方法，旨在將深度搜尋 R1 等先進推理能力整合至特定語言的大型語言模型 (LLM)，特別著重於泰語 LLM。我們的目標是提升特定語言 LLM 的推理能力，同時維持其目標語言能力。深度搜尋 R1 在推理方面表現出色，但主要受益於英語和中文等資源豐富的語言。然而，由於以英語為中心的訓練資料和模型最佳化佔據主導地位，資源貧乏的語言仍未獲得充分服務，這限制了這些語言的效能。此限制導致不可靠的代碼切換，並降低了資源貧乏語言任務的效能。與此同時，在地區 LLM 計畫已嘗試透過開發專注於改善在地語言忠實度的特定語言 LLM 來彌合此差距。我們證明，僅使用公開可用的資料集和 120 美元的運算預算，即可提升特定語言 LLM 的推理能力，使其達到深度搜尋 R1 的水準，同時不損及它們在目標語言任務上的效能。
+##### **Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review**
+2502.02618v1 by F. Xavier Gaya-Morey, Jose M. Buades-Rubio, Philippe Palanque, Raquel Lacuesta, Cristina Manresa-Yee
 
-##### **Cost-Saving LLM Cascades with Early Abstention**
-2502.09054v1 by Michael J. Zellinger, Rex Liu, Matt Thomson
+The rapid aging of the global population has highlighted the need for
+technologies to support elderly, particularly in healthcare and emotional
+well-being. Facial expression recognition (FER) systems offer a non-invasive
+means of monitoring emotional states, with applications in assisted living,
+mental health support, and personalized care. This study presents a systematic
+review of deep learning-based FER systems, focusing on their applications for
+the elderly population. Following a rigorous methodology, we analyzed 31
+studies published over the last decade, addressing challenges such as the
+scarcity of elderly-specific datasets, class imbalances, and the impact of
+age-related facial expression differences. Our findings show that convolutional
+neural networks remain dominant in FER, and especially lightweight versions for
+resource-constrained environments. However, existing datasets often lack
+diversity in age representation, and real-world deployment remains limited.
+Additionally, privacy concerns and the need for explainable artificial
+intelligence emerged as key barriers to adoption. This review underscores the
+importance of developing age-inclusive datasets, integrating multimodal
+solutions, and adopting XAI techniques to enhance system usability,
+reliability, and trustworthiness. We conclude by offering recommendations for
+future research to bridge the gap between academic progress and real-world
+implementation in elderly care.
 
-LLM cascades are based on the idea that processing all queries with the
-largest and most expensive LLMs is inefficient. Instead, cascades deploy small
-LLMs to answer the majority of queries, limiting the use of large and expensive
-LLMs to only the most difficult queries. This approach can significantly reduce
-costs without impacting performance. However, risk-sensitive domains such as
-finance or medicine place an additional premium on avoiding model errors.
-Recognizing that even the most expensive models may make mistakes, applications
-in these domains benefit from allowing LLM systems to completely abstain from
-answering a query when the chance of making a mistake is significant. However,
-giving a cascade the ability to abstain poses an immediate design question for
-LLM cascades: should abstention only be allowed at the final model or also at
-earlier models? Since the error patterns of small and large models are
-correlated, the latter strategy may further reduce inference costs by letting
-inexpensive models anticipate abstention decisions by expensive models, thereby
-obviating the need to run the expensive models. We investigate the benefits of
-"early abstention" in LLM cascades and find that it reduces the overall test
-loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA,
-TruthfulQA, and XSum). These gains result from a more effective use of
-abstention, which trades a 4.1% average increase in the overall abstention rate
-for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings
-demonstrate that it is possible to leverage correlations between the error
-patterns of different language models to drive performance improvements for LLM
-systems with abstention.
+摘要：全球人口快速老龄化突显了对技术的需求，以支持老年人，尤其是在医疗保健和情绪健康方面。面部表情识别 (FER) 系统提供了一种非侵入性的情绪状态监测手段，在辅助生活、心理健康支持和个性化护理中得到应用。本研究对基于深度学习的 FER 系统进行了系统的回顾，重点关注它们在老年人群中的应用。遵循严格的方法，我们分析了在过去十年中发表的 31 项研究，解决了诸如老年人特定数据集的稀缺性、类别不平衡以及与年龄相关的面部表情差异的影响等挑战。我们的研究结果表明，卷积神经网络在 FER 中仍然占主导地位，特别是针对资源受限环境的轻量级版本。然而，现有数据集往往缺乏年龄代表性的多样性，并且现实世界的部署仍然有限。此外，隐私问题和对可解释人工智能的需求已成为采用过程中的主要障碍。本次审查强调了开发包容年龄的数据集、整合多模式解决方案以及采用 XAI 技术以增强系统可用性、可靠性和可信度的重要性。最后，我们提出了未来研究的建议，以弥合学术进展与老年护理中的现实世界实施之间的差距。
 
-摘要：<paragraph>LLM 級聯基於以下概念：使用最大且最昂貴的 LLM 處理所有查詢效率低下。相反，級聯會部署小型 LLM 來回答大部分查詢，將大型且昂貴的 LLM 的使用限制在最困難的查詢上。這種方法可以大幅降低成本，而不會影響效能。然而，像金融或醫學等對風險敏感的領域會額外重視避免模型錯誤。認識到即使是最昂貴的模型也可能會出錯，在這些領域中的應用程式可受益於允許 LLM 系統在出錯機率很大的情況下完全不回答查詢。然而，賦予級聯不回答的能力會對 LLM 級聯提出立即的設計問題：是否只允許在最終模型中不回答，還是也在較早的模型中不回答？由於小型和大型模型的錯誤模式相關，後一種策略可以讓便宜的模型預測昂貴模型的不回答決策，進而降低推論成本，從而避免執行昂貴的模型。我們調查了 LLM 級聯中「早期不回答」的好處，並發現它平均降低了六個基準測試（GSM8K、MedMCQA、MMLU、TriviaQA、TruthfulQA 和 XSum）的整體測試損失 2.2%。這些收益來自於更有效地使用不回答，以整體不回答率平均增加 4.1% 的代價換取成本降低 13.0% 和錯誤率降低 5.0%。我們的研究結果證明，可以利用不同語言模型的錯誤模式之間的關聯性，來推動具有不回答功能的 LLM 系統的效能改進。</paragraph>
+##### **Causally-informed Deep Learning towards Explainable and Generalizable Outcomes Prediction in Critical Care**
+2502.02109v1 by Yuxiao Cheng, Xinxin Song, Ziqian Wang, Qin Zhong, Kunlun He, Jinli Suo
 
-##### **Game Theory Meets Large Language Models: A Systematic Survey**
-2502.09053v1 by Haoran Sun, Yusen Wu, Yukun Cheng, Xu Chu
+Recent advances in deep learning (DL) have prompted the development of
+high-performing early warning score (EWS) systems, predicting clinical
+deteriorations such as acute kidney injury, acute myocardial infarction, or
+circulatory failure. DL models have proven to be powerful tools for various
+tasks but come with the cost of lacking interpretability and limited
+generalizability, hindering their clinical applications. To develop a practical
+EWS system applicable to various outcomes, we propose causally-informed
+explainable early prediction model, which leverages causal discovery to
+identify the underlying causal relationships of prediction and thus owns two
+unique advantages: demonstrating the explicit interpretation of the prediction
+while exhibiting decent performance when applied to unfamiliar environments.
+Benefiting from these features, our approach achieves superior accuracy for 6
+different critical deteriorations and achieves better generalizability across
+different patient groups, compared to various baseline algorithms. Besides, we
+provide explicit causal pathways to serve as references for assistant clinical
+diagnosis and potential interventions. The proposed approach enhances the
+practical application of deep learning in various medical scenarios.
 
-Game theory establishes a fundamental framework for analyzing strategic
-interactions among rational decision-makers. The rapid advancement of large
-language models (LLMs) has sparked extensive research exploring the
-intersection of these two fields. Specifically, game-theoretic methods are
-being applied to evaluate and enhance LLM capabilities, while LLMs themselves
-are reshaping classic game models. This paper presents a comprehensive survey
-of the intersection of these fields, exploring a bidirectional relationship
-from three perspectives: (1) Establishing standardized game-based benchmarks
-for evaluating LLM behavior; (2) Leveraging game-theoretic methods to improve
-LLM performance through algorithmic innovations; (3) Characterizing the
-societal impacts of LLMs through game modeling. Among these three aspects, we
-also highlight how the equilibrium analysis for traditional game models is
-impacted by LLMs' advanced language understanding, which in turn extends the
-study of game theory. Finally, we identify key challenges and future research
-directions, assessing their feasibility based on the current state of the
-field. By bridging theoretical rigor with emerging AI capabilities, this survey
-aims to foster interdisciplinary collaboration and drive progress in this
-evolving research area.
+摘要：深度學習 (DL) 的最新進展促使開發出高性能早期預警評分 (EWS) 系統，預測急性腎臟損傷、急性心肌梗塞或循環衰竭等臨床惡化。DL 模型已被證明是各種任務的強大工具，但代價是缺乏可解釋性和有限的概括性，阻礙了其臨床應用。為了開發適用於各種結果的實用 EWS 系統，我們提出了因果關係解釋性早期預測模型，它利用因果發現來識別預測的潛在因果關係，從而擁有兩個獨特的優點：展示預測的明確解釋，同時在應用於不熟悉的環境時表現出良好的性能。得益於這些特性，與各種基線演算法相比，我們的模型在 6 種不同的危重惡化中實現了更高的準確度，並在不同的患者群體中實現了更好的概括性。此外，我們提供了明確的因果途徑，作為輔助臨床診斷和潛在干預措施的參考。所提出的方法增強了深度學習在各種醫療場景中的實際應用。
 
-摘要：博弈論建立一個基本架構，用來分析理性決策者之間的策略互動。大型語言模型 (LLM) 的快速進展，激發了廣泛的研究，探討這兩個領域的交集。具體來說，博弈論方法被應用於評估和增強 LLM 能力，而 LLM 本身正在重塑經典博弈模型。本文對這些領域的交集進行了全面的調查，從三個角度探討了雙向關係：(1) 建立標準化的基於博弈的基準，用於評估 LLM 行為；(2) 利用博弈論方法，通過演算法創新來改善 LLM 效能；(3) 透過博弈模型，描述 LLM 對社會的影響。在這三個方面中，我們還強調了 LLM 的先進語言理解如何影響傳統博弈模型的均衡分析，這反過來又擴展了博弈論的研究。最後，我們找出關鍵挑戰和未來的研究方向，根據該領域的現狀評估其可行性。透過將理論嚴謹性與新興的 AI 能力相結合，這項調查旨在促進跨學科合作，並推動這個不斷演變的研究領域的進展。
+##### **JingFang: A Traditional Chinese Medicine Large Language Model of Expert-Level Medical Diagnosis and Syndrome Differentiation-Based Treatment**
+2502.04345v1 by Yehan Yan, Tianhao Ma, Ruotai Li, Xinhan Zheng, Guodong Shan, Chisheng Li
 
-##### **AIDE: Agentically Improve Visual Language Model with Domain Experts**
-2502.09051v1 by Ming-Chang Chiu, Fuxiao Liu, Karan Sapra, Andrew Tao, Yaser Jacoob, Xuezhe Ma, Zhiding Yu, Guilin Liu
+Traditional Chinese medicine (TCM) plays a vital role in health protection
+and disease treatment, but its practical application requires extensive medical
+knowledge and clinical experience. Existing TCM Large Language Models (LLMs)
+exhibit critical limitations of uncomprehensive medical consultation and
+diagnoses, and inaccurate syndrome differentiation-based treatment. To address
+these issues, this study establishes JingFang (JF): a novel TCM Large Language
+Model that demonstrates the expert-level capability of medical diagnosis and
+syndrome differentiation-based treatment. We innovate a Multi-agent Dynamic
+Collaborative Chain-of-Thought Mechanism (MDCCTM) for medical consultation,
+enabling JF with effective and accurate diagnostic ability. In addition, a
+Syndrome Agent and a Dual-Stage Retrieval Scheme (DSRS) are developed to
+significantly enhance the capacity of JF for disease treatment based on
+syndrome differentiation. JingFang not only facilitates the application of LLMs
+but also promotes the effective practice of TCM in human health protection and
+disease treatment.
 
-The enhancement of Visual Language Models (VLMs) has traditionally relied on
-knowledge distillation from larger, more capable models. This dependence
-creates a fundamental bottleneck for improving state-of-the-art systems,
-particularly when no superior models exist. We introduce AIDE (Agentic
-Improvement through Domain Experts), a novel framework that enables VLMs to
-autonomously enhance their capabilities by leveraging specialized domain expert
-models. AIDE operates through a four-stage process: (1) identifying instances
-for refinement, (2) engaging domain experts for targeted analysis, (3)
-synthesizing expert outputs with existing data, and (4) integrating enhanced
-instances into the training pipeline. Experiments on multiple benchmarks,
-including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve
-notable performance gains without relying on larger VLMs nor human supervision.
-Our framework provides a scalable, resource-efficient approach to continuous
-VLM improvement, addressing critical limitations in current methodologies,
-particularly valuable when larger models are unavailable to access.
+摘要：中醫藥在保健與疾病治療中扮演著重要的角色，但其實務應用需要深厚的醫學知識與臨床經驗。現有的中醫大語言模型（LLM）存在著醫療諮詢與診斷不全面、症候分型治療不準確的重大限制。為了解決這些問題，本研究建立了精方（JF）：一個新穎的中醫大語言模型，展示了專家級的醫療診斷與症候分型治療能力。我們創新了一個多智能體動態協作思考鏈機制（MDCCTM）用於醫療諮詢，讓 JF 具備有效且準確的診斷能力。此外，還開發了一個症候智能體和一個雙階段檢索方案（DSRS），以顯著增強 JF 基於症候分型的疾病治療能力。精方不僅促進了 LLM 的應用，也推動了中醫藥在人類保健與疾病治療中的有效實踐。
 
-摘要：視覺語言模型 (VLM) 的增強傳統上依賴於從更大、功能更強大的模型中進行知識萃取。這種依賴性會造成改善最先進系統的基本瓶頸，尤其在沒有更優越的模型時。我們引進 AIDE（透過領域專家進行代理式改善），一個創新的架構，讓 VLM 能夠透過利用專業的領域專家模型，自主增強其功能。AIDE 透過四階段流程運作：(1) 識別需要改善的實例，(2) 聘請領域專家進行有針對性的分析，(3) 將專家輸出與現有資料綜合，以及 (4) 將增強的實例整合到訓練流程中。在多個基準測試上的實驗，包括 MMMU、MME、MMBench 等，證明了 AIDE 能夠在不依賴更大型的 VLM 或人工監督的情況下，實現顯著的效能提升。我們的架構提供了一個可擴充、資源效率高的持續 VLM 改進方法，解決了當前方法中的關鍵限制，特別是在無法取得大型模型時，這一點特別有價值。
+##### **An Agentic AI Workflow for Detecting Cognitive Concerns in Real-world Data**
+2502.01789v1 by Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S. Haas, Chirag Patel, Shawn N. Murphy, Lidia M. V. R. Moura, Hossein Estiri
 
-##### **Leveraging Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation**
-2502.09050v1 by Chae-Hyun Kim, Yoon-Ryung Choi, Jin-Duk Park, Won-Yong Shin
+Early identification of cognitive concerns is critical but often hindered by
+subtle symptom presentation. This study developed and validated a fully
+automated, multi-agent AI workflow using LLaMA 3 8B to identify cognitive
+concerns in 3,338 clinical notes from Mass General Brigham. The agentic
+workflow, leveraging task-specific agents that dynamically collaborate to
+extract meaningful insights from clinical notes, was compared to an
+expert-driven benchmark. Both workflows achieved high classification
+performance, with F1-scores of 0.90 and 0.91, respectively. The agentic
+workflow demonstrated improved specificity (1.00) and achieved prompt
+refinement in fewer iterations. Although both workflows showed reduced
+performance on validation data, the agentic workflow maintained perfect
+specificity. These findings highlight the potential of fully automated
+multi-agent AI workflows to achieve expert-level accuracy with greater
+efficiency, offering a scalable and cost-effective solution for detecting
+cognitive concerns in clinical settings.
 
-Group recommendation aims at providing optimized recommendations tailored to
-diverse groups, enabling groups to enjoy appropriate items. On the other hand,
-most existing group recommendation methods are built upon deep neural network
-(DNN) architectures designed to capture the intricate relationships between
-member-level and group-level interactions. While these DNN-based approaches
-have proven their effectiveness, they require complex and expensive training
-procedures to incorporate group-level interactions in addition to member-level
-interactions. To overcome such limitations, we introduce Group-GF, a new
-approach for extremely fast recommendations of items to each group via
-multi-view graph filtering (GF) that offers a holistic view of complex
-member-group dynamics, without the need for costly model training.
-Specifically, in Group-GF, we first construct three item similarity graphs
-manifesting different viewpoints for GF. Then, we discover a distinct
-polynomial graph filter for each similarity graph and judiciously aggregate the
-three graph filters. Extensive experiments demonstrate the effectiveness of
-Group-GF in terms of significantly reducing runtime and achieving
-state-of-the-art recommendation accuracy.
+摘要：及早辨識認知問題至關重要，但常常受到症狀呈現過於細微的阻礙。本研究開發並驗證了一個全自動化、多重代理的 AI 工作流程，使用 LLaMA 3 8B 來辨識來自麻省總醫院布萊根分院的 3,338 則臨床筆記中的認知問題。這個代理工作流程利用了特定任務的代理，這些代理會動態合作從臨床筆記中萃取出有意義的見解，並與專家驅動的基準進行比較。這兩個工作流程都達到了很高的分類效能，F1 分數分別為 0.90 和 0.91。代理工作流程展現出更好的特異性（1.00），並且在更少的反覆運算中達到了提示精煉。儘管這兩個工作流程在驗證資料上的效能都降低了，但代理工作流程維持了完美的特異性。這些發現突顯了全自動化多重代理 AI 工作流程的潛力，它們能以更高的效率達到專家級的準確度，為在臨床環境中偵測認知問題提供了一個可擴充且具成本效益的解決方案。
 
-摘要：群組推薦旨在提供針對不同群組量身打造的最佳推薦，讓群組可以享受適當的項目。另一方面，現有的群組推薦方法大多建立在深度神經網路 (DNN) 架構上，旨在捕捉成員層級和群組層級互動之間的複雜關係。雖然這些基於 DNN 的方法已證明其有效性，但它們需要複雜且昂貴的訓練程序，才能在成員層級互動之外納入群組層級互動。為了克服這些限制，我們引入了 Group-GF，這是一種透過多視圖圖形過濾 (GF) 為每個群組提供極快速項目推薦的新方法，它提供了複雜成員群組動態的整體視圖，而無需進行昂貴的模型訓練。具體來說，在 Group-GF 中，我們首先建構三個項目相似度圖形，展現 GF 的不同觀點。然後，我們為每個相似度圖形發現一個不同的多項式圖形過濾器，並明智地彙總這三個圖形過濾器。廣泛的實驗證明了 Group-GF 在顯著減少執行時間和達成最先進的推薦準確度方面的有效性。
+##### **Can Domain Experts Rely on AI Appropriately? A Case Study on AI-Assisted Prostate Cancer MRI Diagnosis**
+2502.03482v1 by Chacha Chen, Han Liu, Jiamin Yang, Benjamin M. Mervak, Bora Kalaycioglu, Grace Lee, Emre Cakmakli, Matteo Bonatti, Sridhar Pudu, Osman Kahraman, Gul Gizem Pamuk, Aytekin Oto, Aritrick Chatterjee, Chenhao Tan
 
-##### **Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation**
-2502.09046v1 by Jin-Duk Park, Jaemin Yoo, Won-Yong Shin
+Despite the growing interest in human-AI decision making, experimental
+studies with domain experts remain rare, largely due to the complexity of
+working with domain experts and the challenges in setting up realistic
+experiments. In this work, we conduct an in-depth collaboration with
+radiologists in prostate cancer diagnosis based on MRI images. Building on
+existing tools for teaching prostate cancer diagnosis, we develop an interface
+and conduct two experiments to study how AI assistance and performance feedback
+shape the decision making of domain experts. In Study 1, clinicians were asked
+to provide an initial diagnosis (human), then view the AI's prediction, and
+subsequently finalize their decision (human-AI team). In Study 2 (after a
+memory wash-out period), the same participants first received aggregated
+performance statistics from Study 1, specifically their own performance, the
+AI's performance, and their human-AI team performance, and then directly viewed
+the AI's prediction before making their diagnosis (i.e., no independent initial
+diagnosis). These two workflows represent realistic ways that clinical AI tools
+might be used in practice, where the second study simulates a scenario where
+doctors can adjust their reliance and trust on AI based on prior performance
+feedback. Our findings show that, while human-AI teams consistently outperform
+humans alone, they still underperform the AI due to under-reliance, similar to
+prior studies with crowdworkers. Providing clinicians with performance feedback
+did not significantly improve the performance of human-AI teams, although
+showing AI decisions in advance nudges people to follow AI more. Meanwhile, we
+observe that the ensemble of human-AI teams can outperform AI alone, suggesting
+promising directions for human-AI collaboration.
 
-Multi-criteria (MC) recommender systems, which utilize MC rating information
-for recommendation, are increasingly widespread in various e-commerce domains.
-However, the MC recommendation using training-based collaborative filtering,
-requiring consideration of multiple ratings compared to single-criterion
-counterparts, often poses practical challenges in achieving state-of-the-art
-performance along with scalable model training. To solve this problem, we
-propose CA-GF, a training-free MC recommendation method, which is built upon
-criteria-aware graph filtering for efficient yet accurate MC recommendations.
-Specifically, first, we construct an item-item similarity graph using an MC
-user-expansion graph. Next, we design CA-GF composed of the following key
-components, including 1) criterion-specific graph filtering where the optimal
-filter for each criterion is found using various types of polynomial low-pass
-filters and 2) criteria preference-infused aggregation where the smoothed
-signals from each criterion are aggregated. We demonstrate that CA-GF is (a)
-efficient: providing the computational efficiency, offering the extremely fast
-runtime of less than 0.2 seconds even on the largest benchmark dataset, (b)
-accurate: outperforming benchmark MC recommendation methods, achieving
-substantial accuracy gains up to 24% compared to the best competitor, and (c)
-interpretable: providing interpretations for the contribution of each criterion
-to the model prediction based on visualizations.
+摘要：儘管人們對人類與 AI 決策制定越來越感興趣，但與領域專家合作的實驗研究仍然很少見，這在很大程度上是因為與領域專家合作的複雜性，以及在設定實際實驗時面臨的挑戰。在這項工作中，我們與放射科醫師進行深入合作，基於 MRI 影像診斷前列腺癌。建立在用於教授前列腺癌診斷的現有工具上，我們開發了一個介面並進行了兩項實驗，以研究 AI 協助和效能回饋如何塑造領域專家的決策制定。在研究 1 中，要求臨床醫師提供初步診斷（人類），然後檢視 AI 的預測，並隨後確定他們的決策（人類-AI 團隊）。在研究 2（經過一段記憶清除期）中，同一位參與者首先收到研究 1 的彙總效能統計資料，特別是他們自己的效能、AI 的效能，以及他們的人類-AI 團隊效能，然後在做出診斷前直接檢視 AI 的預測（即，沒有獨立的初步診斷）。這兩個工作流程代表了臨床 AI 工具在實務中可能被使用的方式，其中第二個研究模擬了醫生可以根據先前的效能回饋調整他們對 AI 的依賴和信任的情況。我們的研究結果顯示，儘管人類-AI 團隊始終優於單獨的人類，但由於依賴不足，他們仍然表現不如 AI，這與之前針對群眾工作者的研究類似。儘管事先顯示 AI 決策會促使人們更多地遵循 AI，但向臨床醫師提供效能回饋並未顯著改善人類-AI 團隊的效能。同時，我們觀察到人類-AI 團隊的集合可以優於單獨的 AI，這表明了人類-AI 合作的前景。
 
-摘要：多準則 (MC) 推薦系統在各種電子商務領域中日益普及，該系統利用 MC 評分資訊進行推薦。
-然而，與單準則對應項目相比，使用基於訓練的協同過濾的 MC 推薦，通常在達成最先進的效能以及可擴充模型訓練方面造成實務上的挑戰，需要考慮多個評分。為了解決這個問題，我們提出 CA-GF，一種無需訓練的 MC 推薦方法，它建立於準則感知圖形過濾之上，用於有效且準確的 MC 推薦。
-具體來說，首先，我們使用 MC 使用者擴展圖形來建構一個項目相似度圖形。接下來，我們設計 CA-GF，它包含以下關鍵組成部分，包括 1) 準則特定圖形過濾，其中使用各種類型的多項式低通濾波器來找出每個準則的最佳濾波器，以及 2) 準則偏好注入聚合，其中來自每個準則的平滑訊號被聚合。我們證明 CA-GF 是 (a) 有效的：提供運算效率，即使在最大的基準資料集上，也能提供低於 0.2 秒的極快執行時間，(b) 準確的：優於基準 MC 推薦方法，與最佳競爭者相比，獲得高達 24% 的顯著準確性提升，以及 (c) 可解釋的：根據視覺化提供對每個準則對模型預測的貢獻的解釋。
+##### **Improving Transformer World Models for Data-Efficient RL**
+2502.01591v1 by Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Patrick Murphy
 
-##### **Typhoon T1: An Open Thai Reasoning Model**
-2502.09042v1 by Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai, Kunat Pipatanakul
+We present an approach to model-based RL that achieves a new state of the art
+performance on the challenging Craftax-classic benchmark, an open-world 2D
+survival game that requires agents to exhibit a wide range of general abilities
+-- such as strong generalization, deep exploration, and long-term reasoning.
+With a series of careful design choices aimed at improving sample efficiency,
+our MBRL algorithm achieves a reward of 67.4% after only 1M environment steps,
+significantly outperforming DreamerV3, which achieves 53.2%, and, for the first
+time, exceeds human performance of 65.0%. Our method starts by constructing a
+SOTA model-free baseline, using a novel policy architecture that combines CNNs
+and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna
+with warmup", which trains the policy on real and imaginary data, (b) "nearest
+neighbor tokenizer" on image patches, which improves the scheme to create the
+transformer world model (TWM) inputs, and (c) "block teacher forcing", which
+allows the TWM to reason jointly about the future tokens of the next timestep.
 
-This paper introduces Typhoon T1, an open effort to develop an open Thai
-reasoning model. A reasoning model is a relatively new type of generative model
-built on top of large language models (LLMs). A reasoning model generates a
-long chain of thought before arriving at a final answer, an approach found to
-improve performance on complex tasks. However, details on developing such a
-model are limited, especially for reasoning models that can generate traces in
-a low-resource language. Typhoon T1 presents an open effort that dives into the
-details of developing a reasoning model in a more cost-effective way by
-leveraging supervised fine-tuning using open datasets, instead of reinforcement
-learning. This paper shares the details about synthetic data generation and
-training, as well as our dataset and model weights. Additionally, we provide
-insights gained from developing a reasoning model that generalizes across
-domains and is capable of generating reasoning traces in a low-resource
-language, using Thai as an example. We hope this open effort provides a
-foundation for further research in this field.
+摘要：我們提出了一個基於模型的 RL 方法，在具有挑戰性的 Craftax-classic 基準上實現了新的技術水準，這是一個開放世界的 2D 生存遊戲，要求代理人展現廣泛的一般能力，例如強大的概括能力、深入探索和長期推理。通過一系列旨在提高樣本效率的仔細設計選擇，我們的 MBRL 演算法在僅 1M 環境步驟後就實現了 67.4% 的獎勵，顯著優於 DreamerV3（實現 53.2%），並且首次超過了人類的 65.0% 的表現。我們的演算法首先通過使用結合 CNN 和 RNN 的新穎策略架構來建構一個 SOTA 無模型基線。然後，我們對標準 MBRL 設定新增了三項改進：(a)「帶熱身的 Dyna」，它在真實和假想資料上訓練策略，(b) 影像貼片的「最近鄰代碼化器」，它改進了建立轉換器世界模型 (TWM) 輸入的方案，以及 (c)「區塊教師強制」，它允許 TWM 共同推理下一個時間步長的未來代碼。
 
-摘要：本文介紹 Typhoon T1，這是一個開放的計畫，旨在開發開放的泰語推理模型。推理模型是一種相對較新的生成模型，建構於大型語言模型 (LLM) 之上。推理模型會在得出最終答案之前產生一連串的思考，這種方法被發現可以改善複雜任務的效能。然而，關於如何開發這種模型的詳細資訊有限，特別是對於能夠以低資源語言產生軌跡的推理模型而言。Typhoon T1 提出了一個開放的計畫，深入探討如何以更具成本效益的方式開發推理模型，方法是利用開放式資料集進行監督微調，而不是強化學習。本文分享了關於合成資料產生和訓練的詳細資訊，以及我們的資料集和模型權重。此外，我們提供了從開發推理模型中獲得的見解，該模型可以跨領域概括，並能夠以低資源語言產生推理軌跡，以泰語為例。我們希望這個開放的計畫能為此領域的進一步研究奠定基礎。
+##### **Data-Efficient Model for Psychological Resilience Prediction based on Neurological Data**
+2502.01377v1 by Zhi Zhang, Yan Liu, Mengxia Gao, Yu Yang, Jiannong Cao, Wai Kai Hou, Shirley Li, Sonata Yau, Yun Kwok Wing, Tatia M. C. Lee
 
-##### **Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning**
-2502.09022v1 by Lin Zhang, Lijie Hu, Di Wang
+Psychological resilience, defined as the ability to rebound from adversity,
+is crucial for mental health. Compared with traditional resilience assessments
+through self-reported questionnaires, resilience assessments based on
+neurological data offer more objective results with biological markers, hence
+significantly enhancing credibility. This paper proposes a novel data-efficient
+model to address the scarcity of neurological data. We employ Neuro
+Kolmogorov-Arnold Networks as the structure of the prediction model. In the
+training stage, a new trait-informed multimodal representation algorithm with a
+smart chunk technique is proposed to learn the shared latent space with limited
+data. In the test stage, a new noise-informed inference algorithm is proposed
+to address the low signal-to-noise ratio of the neurological data. The proposed
+model not only shows impressive performance on both public datasets and
+self-constructed datasets but also provides some valuable psychological
+hypotheses for future research.
 
-Transformer-based language models have achieved notable success, yet their
-internal reasoning mechanisms remain largely opaque due to complex non-linear
-interactions and high-dimensional operations. While previous research suggests
-that these models implicitly encode reasoning structures, it is still unclear
-which specific multi-step thought processes they employ to solve complex tasks.
-To address this gap, we propose a novel mechanistic interpretability framework,
-SICAF, designed to trace and analyze the reasoning strategies that language
-models use in multi-step inference tasks. By employing circuit analysis and
-self-influence functions, we quantify the evolving importance of each token
-throughout the reasoning process, thereby mapping the pathways the model uses
-for inference. Applying SICAF to the GPT-2 model on the Indirect Object
-Identification (IOI) prediction task, we demonstrate how underlying circuits
-can reveal a reasoning process that aligns with human interpretability,
-offering new insights into the model's internal logic.
+摘要：心理韌性，定義為從逆境中反彈的能力，對心理健康至關重要。與通過自我報告問卷的傳統韌性評估相比，基於神經數據的韌性評估提供了更客觀的結果和生物標記，從而顯著提高了可信度。本文提出了一個新穎的數據高效模型來解決神經數據的稀缺性。我們採用神經科爾莫哥羅夫-阿諾德網路作為預測模型的結構。在訓練階段，提出了一種新的特徵信息多模態表示算法，採用智能塊技術，以有限的數據學習共享潛在空間。在測試階段，提出了一種新的噪聲信息推理算法，以解決神經數據的信噪比低的問題。所提出的模型不僅在公共數據集和自構數據集上都顯示出令人印象深刻的性能，還為未來的研究提供了一些有價值的心理假設。
+
+##### **OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology**
+2502.01243v1 by Chengfeng Zhou, Ji Wang, Juanjuan Qin, Yining Wang, Ling Sun, Weiwei Dai
+
+Large language models (LLMs) have shown significant promise across various
+medical applications, with ophthalmology being a notable area of focus. Many
+ophthalmic tasks have shown substantial improvement through the integration of
+LLMs. However, before these models can be widely adopted in clinical practice,
+evaluating their capabilities and identifying their limitations is crucial. To
+address this research gap and support the real-world application of LLMs, we
+introduce the OphthBench, a specialized benchmark designed to assess LLM
+performance within the context of Chinese ophthalmic practices. This benchmark
+systematically divides a typical ophthalmic clinical workflow into five key
+scenarios: Education, Triage, Diagnosis, Treatment, and Prognosis. For each
+scenario, we developed multiple tasks featuring diverse question types,
+resulting in a comprehensive benchmark comprising 9 tasks and 591 questions.
+This comprehensive framework allows for a thorough assessment of LLMs'
+capabilities and provides insights into their practical application in Chinese
+ophthalmology. Using this benchmark, we conducted extensive experiments and
+analyzed the results from 39 popular LLMs. Our evaluation highlights the
+current gap between LLM development and its practical utility in clinical
+settings, providing a clear direction for future advancements. By bridging this
+gap, we aim to unlock the potential of LLMs and advance their development in
+ophthalmology.
 
-摘要：基於 Transformer 的語言模型已取得顯著的成功，但由於複雜的非線性交互和高維度運算，它們的內部推理機制在很大程度上仍然不透明。儘管先前的研究表明這些模型隱含地編碼推理結構，但目前仍不清楚它們採用哪些具體的多步驟思考過程來解決複雜任務。為了解決這個差距，我們提出了一個新穎的機制可解釋性框架 SICAF，旨在追蹤和分析語言模型在多步驟推理任務中使用的推理策略。通過採用電路分析和自影響函數，我們量化了推理過程中每個標記的演化重要性，從而繪製出模型用於推理的路徑。將 SICAF 應用於 GPT-2 模型上的間接賓語識別 (IOI) 預測任務，我們展示了底層電路如何揭示與人類可解釋性相符的推理過程，從而對模型的內部邏輯提供了新的見解。
+摘要：大型語言模型 (LLM) 在各種醫療應用中已展現出顯著的潛力，其中眼科是一個值得關注的重要領域。許多眼科任務已透過整合 LLM 而大幅進步。然而，在這些模型能廣泛應用於臨床實務之前，評估其能力並找出其限制至關重要。為了解決這個研究差距並支援 LLM 的實際應用，我們引入了 OphthBench，這是一個專門的基準測試，旨在評估 LLM 在中國眼科實務中的表現。此基準測試系統性地將典型眼科臨床工作流程劃分為五個關鍵情境：教育、分流、診斷、治療和預後。對於每個情境，我們開發了多項任務，包含多樣化的問題類型，最後組成一個包含 9 項任務和 591 個問題的綜合基準測試。此綜合架構可徹底評估 LLM 的能力，並提供其在中國眼科的實際應用見解。使用此基準測試，我們進行了廣泛的實驗，並分析了來自 39 個熱門 LLM 的結果。我們的評估強調了 LLM 開發與其在臨床環境中的實際效用之間的差距，為未來的進展提供了明確的方向。透過彌合此差距，我們旨在釋放 LLM 的潛力，並促進其在眼科的發展。
 
-##### **EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition**
-2502.09020v1 by Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
+##### **MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks**
+2502.01158v1 by Alejandro Guerra-Manzanares, Farah E. Shamout
 
-Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB
-cameras which are sensitive to challenging factors such as low illumination,
-motion blur, and cluttered backgrounds. In this paper, we propose to recognize
-the scene text using bio-inspired event cameras by collecting and annotating a
-large-scale benchmark dataset, termed EventSTR. It contains 9,928
-high-definition (1280 * 720) event samples and involves both Chinese and
-English characters. We also benchmark multiple STR algorithms as the baselines
-for future works to compare. In addition, we propose a new event-based scene
-text recognition framework, termed SimC-ESTR. It first extracts the event
-features using a visual encoder and projects them into tokens using a Q-former
-module. More importantly, we propose to augment the vision tokens based on a
-memory mechanism before feeding into the large language models. A
-similarity-based error correction mechanism is embedded within the large
-language model to correct potential minor errors fundamentally based on
-contextual information. Extensive experiments on the newly proposed EventSTR
-dataset and two simulation STR datasets fully demonstrate the effectiveness of
-our proposed model. We believe that the dataset and algorithmic model can
-innovatively propose an event-based STR task and are expected to accelerate the
-application of event cameras in various industries. The source code and
-pre-trained models will be released on https://github.com/Event-AHU/EventSTR
+Multimodal fusion leverages information across modalities to learn better
+feature representations with the goal of improving performance in fusion-based
+tasks. However, multimodal datasets, especially in medical settings, are
+typically smaller than their unimodal counterparts, which can impede the
+performance of multimodal models. Additionally, the increase in the number of
+modalities is often associated with an overall increase in the size of the
+multimodal network, which may be undesirable in medical use cases. Utilizing
+smaller unimodal encoders may lead to sub-optimal performance, particularly
+when dealing with high-dimensional clinical data. In this paper, we propose the
+Modality-INformed knowledge Distillation (MIND) framework, a multimodal model
+compression approach based on knowledge distillation that transfers knowledge
+from ensembles of pre-trained deep neural networks of varying sizes into a
+smaller multimodal student. The teacher models consist of unimodal networks,
+allowing the student to learn from diverse representations. MIND employs
+multi-head joint fusion models, as opposed to single-head models, enabling the
+use of unimodal encoders in the case of unimodal samples without requiring
+imputation or masking of absent modalities. As a result, MIND generates an
+optimized multimodal model, enhancing both multimodal and unimodal
+representations. It can also be leveraged to balance multimodal learning during
+training. We evaluate MIND on binary and multilabel clinical prediction tasks
+using time series data and chest X-ray images. Additionally, we assess the
+generalizability of the MIND framework on three non-medical multimodal
+multiclass datasets. Experimental results demonstrate that MIND enhances the
+performance of the smaller multimodal network across all five tasks, as well as
+various fusion methods and multimodal architectures, compared to
+state-of-the-art baselines.
 
-摘要：主流場景文字辨識（STR）演算法是基於對低光源、動態模糊和雜亂背景等挑戰性因素敏感的 RGB 相機開發的。在本文中，我們提出使用生物靈感事件相機辨識場景文字，方法是收集和標註一個稱為 EventSTR 的大規模基準資料集。它包含 9,928 個高畫質（1280 * 720）事件範例，並包含中文字和英文字元。我們也基準化多個 STR 演算法作為未來工作的基準，以進行比較。此外，我們提出一個新的基於事件的場景文字辨識架構，稱為 SimC-ESTR。它首先使用視覺編碼器萃取事件特徵，並使用 Q-former 模組將它們投影到代幣中。更重要的是，我們提出在輸入大型語言模型之前，基於記憶機制擴充視覺代幣。一個基於相似性的錯誤修正機制嵌入在大型語言模型中，以根據上下文資訊從根本上修正潛在的輕微錯誤。在最新提出的 EventSTR 資料集和兩個模擬 STR 資料集上進行的廣泛實驗充分證明了我們提出的模型的有效性。我們相信，該資料集和演算法模型可以創新地提出一個基於事件的 STR 任務，並有望加速事件相機在各個產業的應用。原始碼和預先訓練的模型將在 https://github.com/Event-AHU/EventSTR 上釋出
+摘要：多模态融合利用跨模态的信息来学习更好的特征表示，目标是提升基于融合的任务的性能。然而，多模态数据集，尤其是在医疗环境中，通常比它们的单模态对应数据集小，这会阻碍多模态模型的性能。此外，模态数量的增加通常与多模态网络尺寸的整体增加相关，这在医疗用例中可能是不可取的。利用较小的单模态编码器可能会导致次优性能，尤其是在处理高维临床数据时。在本文中，我们提出了模态信息知识蒸馏 (MIND) 框架，这是一种基于知识蒸馏的多模态模型压缩方法，它将来自不同大小的预训练深度神经网络的集合中的知识转移到一个较小的多模态学生中。教师模型由单模态网络组成，允许学生从不同的表示中学习。MIND 采用多头联合融合模型，而不是单头模型，从而能够在单模态样本的情况下使用单模态编码器，而不需要缺失模态的插补或掩蔽。因此，MIND 生成了一个经过优化的多模态模型，增强了多模态和单模态表示。它还可以用来在训练期间平衡多模态学习。我们使用时间序列数据和胸部 X 射线图像对二元和多标签临床预测任务评估了 MIND。此外，我们评估了 MIND 框架在三个非医疗多模态多分类数据集上的泛化性。实验结果表明，与最先进的基线相比，MIND 增强了较小的多模态网络在所有五个任务以及各种融合方法和多模态架构中的性能。
 
-##### **Zero-shot Concept Bottleneck Models**
-2502.09018v1 by Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
+##### **Beyond Yes or No: Predictive Compliance Monitoring Approaches for Quantifying the Magnitude of Compliance Violations**
+2502.01141v1 by Qian Chen, Stefanie Rinderle-Ma, Lijie Wen
 
-Concept bottleneck models (CBMs) are inherently interpretable and
-intervenable neural network models, which explain their final label prediction
-by the intermediate prediction of high-level semantic concepts. However, they
-require target task training to learn input-to-concept and concept-to-label
-mappings, incurring target dataset collections and training resources. In this
-paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which
-predict concepts and labels in a fully zero-shot manner without training neural
-networks. Z-CBMs utilize a large-scale concept bank, which is composed of
-millions of vocabulary extracted from the web, to describe arbitrary input in
-various domains. For the input-to-concept mapping, we introduce concept
-retrieval, which dynamically finds input-related concepts by the cross-modal
-search on the concept bank. In the concept-to-label inference, we apply concept
-regression to select essential concepts from the retrieved concepts by sparse
-linear regression. Through extensive experiments, we confirm that our Z-CBMs
-provide interpretable and intervenable concepts without any additional
-training. Code will be available at https://github.com/yshinya6/zcbm.
+Most existing process compliance monitoring approaches detect compliance
+violations in an ex post manner. Only predicate prediction focuses on
+predicting them. However, predicate prediction provides a binary yes/no notion
+of compliance, lacking the ability to measure to which extent an ongoing
+process instance deviates from the desired state as specified in constraints.
+Here, being able to quantify the magnitude of violation would provide
+organizations with deeper insights into their operational performance, enabling
+informed decision making to reduce or mitigate the risk of non-compliance.
+Thus, we propose two predictive compliance monitoring approaches to close this
+research gap. The first approach reformulates the binary classification problem
+as a hybrid task that considers both classification and regression, while the
+second employs a multi-task learning method to explicitly predict the
+compliance status and the magnitude of violation for deviant cases
+simultaneously. In this work, we focus on temporal constraints as they are
+significant in almost any application domain, e.g., health care. The evaluation
+on synthetic and real-world event logs demonstrates that our approaches are
+capable of quantifying the magnitude of violations while maintaining comparable
+performance for compliance predictions achieved by state-of-the-art approaches.
 
-摘要：概念瓶頸模型 (CBM) 本質上是可解釋且可干預的神經網路模型，它們透過對高階語意概念的中間預測來解釋其最終標籤預測。然而，它們需要目標任務訓練來學習輸入到概念和概念到標籤的對應，導致目標資料集收集和訓練資源。在本文中，我們展示了「零次學習概念瓶頸模型」(Z-CBM)，它以完全零次學習的方式預測概念和標籤，而無需訓練神經網路。Z-CBM 利用一個大型概念庫，其中包含從網路中擷取的數百萬個詞彙，來描述各種領域中的任意輸入。對於輸入到概念的對應，我們引入了概念擷取，它透過對概念庫的跨模態搜尋，動態地找出與輸入相關的概念。在概念到標籤的推論中，我們應用概念迴歸，透過稀疏線性迴歸從擷取的概念中選擇必要的概念。透過廣泛的實驗，我們確認我們的 Z-CBM 在沒有任何額外訓練的情況下提供了可解釋且可干預的概念。程式碼將可在 https://github.com/yshinya6/zcbm 取得。
+摘要：現有的流程合規監控方法大多會在事後偵測到合規違規。只有謂詞預測專注於預測這些違規。然而，謂詞預測提供的是合規與否的二元概念，無法衡量正在進行的流程實例偏離約束中所指定之理想狀態的程度。在此，能夠量化違規的嚴重程度，將能讓組織深入了解其營運績效，並能據此做出明智的決策，以降低或減輕不合規的風險。因此，我們提出兩種預測合規監控方法來填補此研究空白。第一種方法將二元分類問題重新表述為同時考量分類和回歸的混合任務，而第二種方法則採用多任務學習方法，同時明確預測合規狀態和偏差案例的違規嚴重程度。在這項工作中，我們專注於時間約束，因為它們幾乎在任何應用領域（例如醫療保健）中都很重要。在合成和真實世界事件記錄上的評估顯示，我們的做法能夠量化違規的嚴重程度，同時維持與現有方法所達成的合規預測相當的績效。
 
-##### **Diversity Enhances an LLM's Performance in RAG and Long-context Task**
-2502.09017v1 by Zhchao Wang, Bin Bi, Yanqi Luo, Sitaram Asur, Claire Na Cheng
+##### **Pulse-PPG: An Open-Source Field-Trained PPG Foundation Model for Wearable Applications Across Lab and Field Settings**
+2502.01108v1 by Mithun Saha, Maxwell A. Xu, Wanting Mao, Sameer Neupane, James M. Rehg, Santosh Kumar
 
-The rapid advancements in large language models (LLMs) have highlighted the
-challenge of context window limitations, primarily due to the quadratic time
-complexity of the self-attention mechanism (\(O(N^2)\), where \(N\) denotes the
-context window length). This constraint impacts tasks such as
-retrieval-augmented generation (RAG) in question answering (Q\&A) and long
-context summarization. A common approach involves selecting content with the
-highest similarity to the query; however, this often leads to redundancy and
-the exclusion of diverse yet relevant information. Building on principles from
-Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we
-integrate diversity into the content selection process. Our findings reveal
-that incorporating diversity substantially increases the recall of selecting
-relevant sentences or chunks before LLM-based Q\&A and summarization. These
-results highlight the importance of maintaining diversity in future LLM
-applications to further improve summarization and Q\&A outcomes.
+Photoplethysmography (PPG)-based foundation models are gaining traction due
+to the widespread use of PPG in biosignal monitoring and their potential to
+generalize across diverse health applications. In this paper, we introduce
+Pulse-PPG, the first open-source PPG foundation model trained exclusively on
+raw PPG data collected over a 100-day field study with 120 participants.
+Existing PPG foundation models are either open-source but trained on clinical
+data or closed-source, limiting their applicability in real-world settings. We
+evaluate Pulse-PPG across multiple datasets and downstream tasks, comparing its
+performance against a state-of-the-art foundation model trained on clinical
+data. Our results demonstrate that Pulse-PPG, trained on uncurated field data,
+exhibits superior generalization across clinical and mobile health applications
+in both lab and field settings. This suggests that exposure to real-world
+variability enables the model to learn fine-grained representations, making it
+more adaptable across tasks. Furthermore, pre-training on field data
+surprisingly outperforms its pre-training on clinical data in many tasks,
+reinforcing the importance of training on real-world, diverse datasets. To
+encourage further advancements in robust foundation models leveraging field
+data, we plan to release Pulse-PPG, providing researchers with a powerful
+resource for developing more generalizable PPG-based models.
 
-摘要：大型語言模型 (LLM) 的快速進步凸顯了上下文視窗限制的挑戰，這主要是由於自注意力機制的二次時間複雜度（\(O(N^2)\)），其中 \(N\) 表示上下文視窗長度。此限制會影響任務，例如問答 (Q&A) 中的檢索增強生成 (RAG) 和長文摘要。一種常見的方法涉及選擇與查詢最相似的內容；然而，這通常會導致冗餘，並排除多樣化但相關的資訊。我們根據最大邊際相關性 (MMR) 和最遠點取樣 (FPS) 的原則，將多樣性整合到內容選擇過程中。我們的研究結果顯示，在基於 LLM 的問答和摘要之前，納入多樣性會大幅增加選擇相關句子或區塊的召回率。這些結果突顯了在未來的 LLM 應用中維持多樣性的重要性，以進一步改善摘要和問答的結果。
+摘要：基於光電容積描記術 (PPG) 的基礎模型由於 PPG 在生物訊號監控中的廣泛使用及其在各種健康應用中推廣的潛力而備受關注。在本文中，我們介紹 Pulse-PPG，這是第一個開放原始碼 PPG 基礎模型，專門針對在為期 100 天的現場研究中收集的 120 位參與者的原始 PPG 資料進行訓練。現有的 PPG 基礎模型要不是開放原始碼，但訓練於臨床資料，不然就是閉源，這限制了它們在真實世界中的應用性。我們評估了 Pulse-PPG 在多個資料集和下游任務中的表現，並將其效能與訓練於臨床資料的最新基礎模型進行比較。我們的結果表明，訓練於未整理現場資料的 Pulse-PPG 在實驗室和現場環境中，在臨床和行動健康應用中展現出優異的泛化能力。這表明接觸真實世界的變異性使模型能夠學習細粒度的表示，使其更能適應各種任務。此外，令人驚訝的是，現場資料的預訓練在許多任務中優於臨床資料的預訓練，這強化了在真實世界、多樣化的資料集上訓練的重要性。為了鼓勵在利用現場資料的強健基礎模型方面進一步發展，我們計畫發布 Pulse-PPG，為研究人員提供一個強大的資源，用於開發更具泛化性的基於 PPG 的模型。
 
-##### **Hope vs. Hate: Understanding User Interactions with LGBTQ+ News Content in Mainstream US News Media through the Lens of Hope Speech**
-2502.09004v1 by Jonathan Pofcher, Christopher M. Homan, Randall Sell, Ashiqur R. KhudaBukhsh
+##### **Tutorial on Using Machine Learning and Deep Learning Models for Mental Illness Detection**
+2502.04342v1 by Yeyubei Zhang, Zhongyan Wang, Zhanyi Ding, Yexin Tian, Jianglai Dai, Xiaorui Shen, Yunchong Liu, Yuchen Cao
 
-This paper makes three contributions. First, via a substantial corpus of
-1,419,047 comments posted on 3,161 YouTube news videos of major US cable news
-outlets, we analyze how users engage with LGBTQ+ news content. Our analyses
-focus both on positive and negative content. In particular, we construct a
-fine-grained hope speech classifier that detects positive (hope speech),
-negative, neutral, and irrelevant content. Second, in consultation with a
-public health expert specializing on LGBTQ+ health, we conduct an annotation
-study with a balanced and diverse political representation and release a
-dataset of 3,750 instances with fine-grained labels and detailed annotator
-demographic information. Finally, beyond providing a vital resource for the
-LGBTQ+ community, our annotation study and subsequent in-the-wild assessments
-reveal (1) strong association between rater political beliefs and how they rate
-content relevant to a marginalized community; (2) models trained on individual
-political beliefs exhibit considerable in-the-wild disagreement; and (3)
-zero-shot large language models (LLMs) align more with liberal raters.
+Social media has become an important source for understanding mental health,
+providing researchers with a way to detect conditions like depression from
+user-generated posts. This tutorial provides practical guidance to address
+common challenges in applying machine learning and deep learning methods for
+mental health detection on these platforms. It focuses on strategies for
+working with diverse datasets, improving text preprocessing, and addressing
+issues such as imbalanced data and model evaluation. Real-world examples and
+step-by-step instructions demonstrate how to apply these techniques
+effectively, with an emphasis on transparency, reproducibility, and ethical
+considerations. By sharing these approaches, this tutorial aims to help
+researchers build more reliable and widely applicable models for mental health
+research, contributing to better tools for early detection and intervention.
 
-摘要：本文做出了三項貢獻。首先，透過一個龐大的語料庫，其中包含 1,419,047 則評論，這些評論張貼在 3,161 部美國有線新聞頻道的 YouTube 新聞影片上，我們分析了使用者如何參與 LGBTQ+ 新聞內容。我們的分析重點在於正面和負面的內容。特別是，我們建構了一個細緻的希望言論分類器，用來偵測正面的（希望言論）、負面的、中立的和不相關的內容。其次，在諮詢了一位專門研究 LGBTQ+ 健康的公共衛生專家後，我們進行了一項標註研究，其中包含平衡且多元的政治代表性，並發布了一個包含 3,750 個實例的資料集，其中包含細緻的標籤和詳細的標註者人口統計資訊。最後，除了為 LGBTQ+ 社群提供重要的資源外，我們的標註研究和後續的實際評估揭示了：(1) 評分者的政治信仰與他們如何評分與邊緣化社群相關的內容之間有很強的關聯性；(2) 根據個人政治信仰訓練的模型在實際應用中表現出相當大的分歧；(3) 零次學習大型語言模型 (LLM) 與自由派評分者的看法更一致。
+摘要：社群媒體已成為了解心理健康的重要來源，
+為研究人員提供一種方式，從使用者發布的貼文中偵測憂鬱症等狀況。
+本教學提供實務指南，說明如何處理在這些平台上使用機器學習和深度學習方法進行心理健康偵測時常見的挑戰。
+它專注於處理不同資料集、改善文字前處理，以及處理不平衡資料和模型評估等問題的策略。
+實際範例和逐步說明示範如何有效應用這些技術，並強調透明度、可複製性，以及倫理考量。
+透過分享這些方法，本教學指南旨在協助研究人員建構更可靠且廣泛適用的心理健康研究模型，
+進而有助於早期偵測和介入的工具。
 
-##### **RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models**
-2502.09003v1 by Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang, Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
+##### **Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model**
+2502.01691v1 by Hadas Ben-Atya, Naama Gavrielov, Zvi Badash, Gili Focht, Ruth Cytter-Kuint, Talar Hagopian, Dan Turner, Moti Freiman
 
-Supervised fine-tuning is a standard method for adapting pre-trained large
-language models (LLMs) to downstream tasks. Quantization has been recently
-studied as a post-training technique for efficient LLM deployment. To obtain
-quantized fine-tuned LLMs, conventional pipelines would first fine-tune the
-pre-trained models, followed by post-training quantization. This often yields
-suboptimal performance as it fails to leverage the synergy between fine-tuning
-and quantization. To effectively realize low-bit quantization of weights,
-activations, and KV caches in LLMs, we propose an algorithm named Rotated
-Straight-Through-Estimator (RoSTE), which combines quantization-aware
-supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that
-identifies an effective rotation configuration to reduce activation outliers.
-We provide theoretical insights on RoSTE by analyzing its prediction error when
-applied to an overparameterized least square quantized training problem. Our
-findings reveal that the prediction error is directly proportional to the
-quantization error of the converged weights, which can be effectively managed
-through an optimized rotation configuration. Experiments on Pythia and Llama
-models of different sizes demonstrate the effectiveness of RoSTE. Compared to
-existing post-SFT quantization baselines, our method consistently achieves
-superior performances across various tasks and different LLM architectures.
+Reliable extraction of structured data from radiology reports using Large
+Language Models (LLMs) remains challenging, especially for complex, non-English
+texts like Hebrew. This study introduces an agent-based uncertainty-aware
+approach to improve the trustworthiness of LLM predictions in medical
+applications. We analyzed 9,683 Hebrew radiology reports from Crohn's disease
+patients (from 2010 to 2023) across three medical centers. A subset of 512
+reports was manually annotated for six gastrointestinal organs and 15
+pathological findings, while the remaining reports were automatically annotated
+using HSMP-BERT. Structured data extraction was performed using Llama 3.1
+(Llama 3-8b-instruct) with Bayesian Prompt Ensembles (BayesPE), which employed
+six semantically equivalent prompts to estimate uncertainty. An Agent-Based
+Decision Model integrated multiple prompt outputs into five confidence levels
+for calibrated uncertainty and was compared against three entropy-based models.
+Performance was evaluated using accuracy, F1 score, precision, recall, and
+Cohen's Kappa before and after filtering high-uncertainty cases. The
+agent-based model outperformed the baseline across all metrics, achieving an F1
+score of 0.3967, recall of 0.6437, and Cohen's Kappa of 0.3006. After filtering
+high-uncertainty cases (greater than or equal to 0.5), the F1 score improved to
+0.4787, and Kappa increased to 0.4258. Uncertainty histograms demonstrated
+clear separation between correct and incorrect predictions, with the
+agent-based model providing the most well-calibrated uncertainty estimates. By
+incorporating uncertainty-aware prompt ensembles and an agent-based decision
+model, this approach enhances the performance and reliability of LLMs in
+structured data extraction from radiology reports, offering a more
+interpretable and trustworthy solution for high-stakes medical applications.
 
-摘要：監督式微調是將預訓練的大型語言模型 (LLM) 適應至下游任務的標準方法。量化最近已被研究作為一種訓練後技術，用於高效部署 LLM。為了獲得量化的微調 LLM，傳統管道會先微調預訓練模型，然後再進行訓練後量化。這通常會產生次佳效能，因為它無法利用微調和量化之間的協同效應。為了有效實現 LLM 中權重、激活和 KV 快取的低位元量化，我們提出了一種名為旋轉直通估計器 (RoSTE) 的演算法，它結合了量化感知監督式微調 (QA-SFT) 和一種自適應旋轉策略，該策略會識別有效的旋轉組態以減少激活異常值。我們透過分析 RoSTE 在應用於過度參數化最小平方量化訓練問題時的預測誤差，提供了關於 RoSTE 的理論見解。我們的研究結果顯示，預測誤差與收斂權重的量化誤差成正比，而這可透過最佳化的旋轉組態有效地管理。在不同大小的 Pythia 和 Llama 模型上進行的實驗證明了 RoSTE 的有效性。與現有的訓練後 SFT 量化基準相比，我們的模型在各種任務和不同的 LLM 架構中持續獲得優異的效能。
+摘要：<paragraph>使用大型語言模型 (LLM) 從放射科報告中可靠地提取結構化數據仍然具有挑戰性，尤其是對於希伯來語等複雜的非英語文本。本研究引入了一種基於代理的不確定性感知方法，以提高 LLM 預測在醫療應用中的可信度。我們分析了來自三個醫療中心的 9,683 份克隆氏症患者的希伯來語放射科報告（從 2010 年到 2023 年）。其中 512 份報告的手動註釋包括六個胃腸器官和 15 個病理發現，而其餘報告則使用 HSMP-BERT 自動註釋。結構化數據提取使用 Llama 3.1（Llama 3-8b-instruct）與貝葉斯提示集合（BayesPE）進行，它採用六個語義等效提示來估計不確定性。基於代理的決策模型將多個提示輸出整合到五個置信度級別中以校準不確定性，並與三個基於熵的模型進行比較。在過濾掉高度不確定性的情況之前和之後，使用準確度、F1 分數、精確度、召回率和 Cohen's Kappa 評估性能。基於代理的模型在所有指標上都優於基線，F1 分數達到 0.3967，召回率達到 0.6437，Cohen's Kappa 達到 0.3006。在過濾掉高度不確定性的情況（大於或等於 0.5）後，F1 分數提高到 0.4787，Kappa 提高到 0.4258。不確定性直方圖顯示了正確預測和不正確預測之間的明顯區別，基於代理的模型提供了校準最好的不確定性估計。通過結合不確定性感知提示集合和基於代理的決策模型，這種方法增強了 LLM 在放射科報告中結構化數據提取中的性能和可靠性，為高風險醫療應用提供了更具可解釋性和可信度的解決方案。</paragraph>
 
-##### **PixLift: Accelerating Web Browsing via AI Upscaling**
-2502.08995v1 by Yonas Atinafu, Sarthak Malla, HyunSeok Daniel Jang, Nouar Aldahoul, Matteo Varvello, Yasir Zaki
+##### **Automated Extraction of Spatio-Semantic Graphs for Identifying Cognitive Impairment**
+2502.01685v1 by Si-Ioi Ng, Pranav S. Ambadi, Kimberly D. Mueller, Julie Liss, Visar Berisha
 
-Accessing the internet in regions with expensive data plans and limited
-connectivity poses significant challenges, restricting information access and
-economic growth. Images, as a major contributor to webpage sizes, exacerbate
-this issue, despite advances in compression formats like WebP and AVIF. The
-continued growth of complex and curated web content, coupled with suboptimal
-optimization practices in many regions, has prevented meaningful reductions in
-web page sizes. This paper introduces PixLift, a novel solution to reduce
-webpage sizes by downscaling their images during transmission and leveraging AI
-models on user devices to upscale them. By trading computational resources for
-bandwidth, PixLift enables more affordable and inclusive web access. We address
-key challenges, including the feasibility of scaled image requests on popular
-websites, the implementation of PixLift as a browser extension, and its impact
-on user experience. Through the analysis of 71.4k webpages, evaluations of
-three mainstream upscaling models, and a user study, we demonstrate PixLift's
-ability to significantly reduce data usage without compromising image quality,
-fostering a more equitable internet.
+Existing methods for analyzing linguistic content from picture descriptions
+for assessment of cognitive-linguistic impairment often overlook the
+participant's visual narrative path, which typically requires eye tracking to
+assess. Spatio-semantic graphs are a useful tool for analyzing this narrative
+path from transcripts alone, however they are limited by the need for manual
+tagging of content information units (CIUs). In this paper, we propose an
+automated approach for estimation of spatio-semantic graphs (via automated
+extraction of CIUs) from the Cookie Theft picture commonly used in
+cognitive-linguistic analyses. The method enables the automatic
+characterization of the visual semantic path during picture description.
+Experiments demonstrate that the automatic spatio-semantic graphs effectively
+differentiate between cognitively impaired and unimpaired speakers. Statistical
+analyses reveal that the features derived by the automated method produce
+comparable results to the manual method, with even greater group differences
+between clinical groups of interest. These results highlight the potential of
+the automated approach for extracting spatio-semantic features in developing
+clinical speech models for cognitive impairment assessment.
 
-摘要：在數據方案昂貴且連線有限的地區存取網路會造成重大挑戰，限制了資訊存取和經濟成長。圖像作為網頁大小的主要貢獻者，儘管 WebP 和 AVIF 等壓縮格式進步，但仍加劇了這個問題。複雜且經過策劃的網路內容持續成長，加上許多地區次佳的最佳化實務，已阻礙了網頁大小的顯著減少。本文介紹 PixLift，這是一種創新的解決方案，可在傳輸過程中縮小圖像大小，並利用使用者裝置上的 AI 模型來放大圖像，藉此縮小網頁大小。PixLift 透過以運算資源換取頻寬，讓網路存取更經濟實惠且更具包容性。我們解決了關鍵挑戰，包括熱門網站上縮放圖像要求的可行性、將 PixLift 實作為瀏覽器擴充功能，以及它對使用者體驗的影響。透過分析 71.4k 個網頁、評估三個主流放大模型，以及使用者研究，我們展示了 PixLift 在不影響影像品質的情況下顯著減少資料用量的能力，促進了更公平的網路。
+摘要：現有的用於分析圖像描述中的語言內容的方法，用於評估認知語言障礙，通常會忽略參與者的視覺敘事路徑，這通常需要眼球追蹤來評估。時空語義圖是一種有用的工具，可以僅從轉錄本中分析此敘事路徑，但是它們受到手動標記內容資訊單元 (CIU) 的需求所限制。在本文中，我們提出了一種自動化方法，用於從認知語言分析中常用的 Cookie Theft 圖像估計時空語義圖（通過自動提取 CIU）。該方法能夠自動表徵圖片描述期間的視覺語義路徑。實驗表明，自動時空語義圖有效地區分了認知受損和未受損的說話者。統計分析表明，自動化方法衍生的特徵產生了與手動方法相當的結果，甚至在感興趣的臨床組之間產生了更大的組差異。這些結果突出了自動化方法在提取時空語義特徵以開發用於認知障礙評估的臨床語音模型方面的潛力。
 
-##### **RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning**
-2502.08989v1 by Nazatul H. Sultan, Yan Bo, Yansong Gao, Seyit Camtepe, Arash Mahboubi, Hang Thanh Bui, Aufeef Chauhan, Hamed Aboutorab, Michael Bewong, Praveen Gauravaram, Rafiqul Islam, Sharif Abuadbba
+##### **Registration-Enhanced Segmentation Method for Prostate Cancer in Ultrasound Images**
+2502.00712v1 by Shengtian Sang, Hassan Jahanandish, Cynthia Xinran Li, Indrani Bhattachary, Jeong Hoon Lee, Lichun Zhang, Sulaiman Vesal, Pejman Ghanouni, Richard Fan, Geoffrey A. Sonn, Mirabela Rusu
 
-Federated Learning (FL) allows users to collaboratively train a global
-machine learning model by sharing local model only, without exposing their
-private data to a central server. This distributed learning is particularly
-appealing in scenarios where data privacy is crucial, and it has garnered
-substantial attention from both industry and academia. However, studies have
-revealed privacy vulnerabilities in FL, where adversaries can potentially infer
-sensitive information from the shared model parameters. In this paper, we
-present an efficient masking-based secure aggregation scheme utilizing
-lightweight cryptographic primitives to mitigate privacy risks. Our scheme
-offers several advantages over existing methods. First, it requires only a
-single setup phase for the entire FL training session, significantly reducing
-communication overhead. Second, it minimizes user-side overhead by eliminating
-the need for user-to-user interactions, utilizing an intermediate server layer
-and a lightweight key negotiation method. Third, the scheme is highly resilient
-to user dropouts, and the users can join at any FL round. Fourth, it can detect
-and defend against malicious server activities, including recently discovered
-model inconsistency attacks. Finally, our scheme ensures security in both
-semi-honest and malicious settings. We provide security analysis to formally
-prove the robustness of our approach. Furthermore, we implemented an end-to-end
-prototype of our scheme. We conducted comprehensive experiments and
-comparisons, which show that it outperforms existing solutions in terms of
-communication and computation overhead, functionality, and security.
+Prostate cancer is a major cause of cancer-related deaths in men, where early
+detection greatly improves survival rates. Although MRI-TRUS fusion biopsy
+offers superior accuracy by combining MRI's detailed visualization with TRUS's
+real-time guidance, it is a complex and time-intensive procedure that relies
+heavily on manual annotations, leading to potential errors. To address these
+challenges, we propose a fully automatic MRI-TRUS fusion-based segmentation
+method that identifies prostate tumors directly in TRUS images without
+requiring manual annotations. Unlike traditional multimodal fusion approaches
+that rely on naive data concatenation, our method integrates a
+registration-segmentation framework to align and leverage spatial information
+between MRI and TRUS modalities. This alignment enhances segmentation accuracy
+and reduces reliance on manual effort. Our approach was validated on a dataset
+of 1,747 patients from Stanford Hospital, achieving an average Dice coefficient
+of 0.212, outperforming TRUS-only (0.117) and naive MRI-TRUS fusion (0.132)
+methods, with significant improvements (p $<$ 0.01). This framework
+demonstrates the potential for reducing the complexity of prostate cancer
+diagnosis and provides a flexible architecture applicable to other multimodal
+medical imaging tasks.
 
-摘要：聯合式學習 (FL) 使用者可以透過僅分享本機模型，在不將其私人資料揭露給中央伺服器的情況下，共同訓練全球機器學習模型。這種分散式學習在資料隱私至關重要的場景中特別具有吸引力，並且已獲得業界和學術界的廣泛關注。然而，研究顯示 FL 中存在隱私漏洞，其中對手可能會從共享模型參數中推斷出敏感資訊。在本文中，我們提出了一種有效率的基於遮罩的安全聚合方案，利用輕量級的密碼原語來降低隱私風險。我們的方案相較於現有方法提供了多項優點。首先，它僅需要在整個 FL 訓練階段進行一次設定階段，大幅降低了通訊開銷。其次，透過消除使用者間互動的需要，利用中間伺服器層和輕量級金鑰協商方法，將使用者端的開銷降到最低。第三，該方案對使用者中斷具有高度的復原力，使用者可以在任何 FL 回合中加入。第四，它可以偵測和防禦惡意伺服器活動，包括最近發現的模型不一致攻擊。最後，我們的方案確保在半誠實和惡意設定中都能獲得安全性。我們提供了安全分析，以正式證明我們方法的穩健性。此外，我們實作了我們方案的端對端原型。我們進行了全面的實驗和比較，結果顯示，在通訊和運算開銷、功能和安全性方面，它優於現有的解決方案。
+摘要：前列腺癌是男性癌症相關死亡的主要原因，早期發現可大幅提升存活率。儘管 MRI-TRUS 融合切片檢查結合了 MRI 的詳細視覺化與 TRUS 的即時導引，可提供更高的準確度，但它是一種仰賴大量手動註解的複雜且耗時的程序，容易導致錯誤。為了解決這些挑戰，我們提出了一種全自動的 MRI-TRUS 融合式分割方法，它可以在 TRUS 影像中直接辨識出前列腺腫瘤，而不需要手動註解。與依賴於天真資料串接的傳統多模態融合方法不同，我們的方法整合了一個配準分割架構，以對齊並利用 MRI 與 TRUS 模態之間的空間資訊。這種對齊提升了分割準確度，並減少了對手動作業的依賴。我們的方法已通過來自 Stanford 醫院的 1,747 位患者的資料集進行驗證，達到了 0.212 的平均 Dice 係數，優於僅使用 TRUS (0.117) 和天真的 MRI-TRUS 融合 (0.132) 方法，並有顯著的改善（p < 0.01）。這個架構證明了降低前列腺癌診斷複雜性的潛力，並提供了一個適用於其他多模態醫學影像任務的彈性架構。
 
-##### **Neural Force Field: Learning Generalized Physical Representation from a Few Examples**
-2502.08987v1 by Shiqian Li, Ruihong Shen, Chi Zhang, Yixin Zhu
+##### **TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion**
+2502.00695v1 by Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
 
-Physical reasoning is a remarkable human ability that enables rapid learning
-and generalization from limited experience. Current AI models, despite
-extensive training, still struggle to achieve similar generalization,
-especially in Out-of-distribution (OOD) settings. This limitation stems from
-their inability to abstract core physical principles from observations. A key
-challenge is developing representations that can efficiently learn and
-generalize physical dynamics from minimal data. Here we present Neural Force
-Field (NFF) a modeling framework built on Neural Ordinary Differential Equation
-(NODE) that learns interpretable force field representations which can be
-efficiently integrated through an Ordinary Differential Equation ( ODE) solver
-to predict object trajectories. Unlike existing approaches that rely on
-high-dimensional latent spaces, NFF captures fundamental physical concepts such
-as gravity, support, and collision in an interpretable manner. Experiments on
-two challenging physical reasoning tasks demonstrate that NFF, trained with
-only a few examples, achieves strong generalization to unseen scenarios. This
-physics-grounded representation enables efficient forward-backward planning and
-rapid adaptation through interactive refinement. Our work suggests that
-incorporating physics-inspired representations into learning systems can help
-bridge the gap between artificial and human physical reasoning capabilities.
+Chronic liver disease represents a significant health challenge worldwide and
+accurate prognostic evaluations are essential for personalized treatment plans.
+Recent evidence suggests that integrating multimodal data, such as computed
+tomography imaging, radiomic features, and clinical information, can provide
+more comprehensive prognostic information. However, modalities have an inherent
+heterogeneity, and incorporating additional modalities may exacerbate the
+challenges of heterogeneous data fusion. Moreover, existing multimodal fusion
+methods often struggle to adapt to richer medical modalities, making it
+difficult to capture inter-modal relationships. To overcome these limitations,
+We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet).
+Specifically, we develop an Intra-Modality Aggregation module and a
+Triple-Modal Cross-Attention Fusion module, which are designed to eliminate
+intra-modality redundancy and extract cross-modal information, respectively.
+Furthermore, we design a Triple-Modal Feature Fusion loss function to align
+feature representations across modalities. Extensive experiments on the liver
+prognosis dataset demonstrate that our approach significantly outperforms
+existing state-of-the-art unimodal models and other multi-modal techniques. Our
+code is available at https://github.com/Mysterwll/liver.git.
 
-摘要：物理推理是人类非凡的能力，它能从有限的经验中快速学习和概括。尽管经过广泛的训练，但当前的人工智能模型在实现类似的概括方面仍然存在困难，尤其是在分布外 (OOD) 设置中。这种限制源于它们无法从观察中抽象出核心物理原理。一个关键挑战是开发能够从最少数据中有效学习和概括物理动力学的表示。在这里，我们介绍了神经力场 (NFF)，这是一种建立在神经常微分方程 (NODE) 上的建模框架，它学习可解释的力场表示，这些表示可以通过常微分方程 (ODE) 求解器有效地进行积分，以预测物体轨迹。与依赖于高维潜在空间的现有方法不同，NFF 以可解释的方式捕获了诸如重力、支撑和碰撞等基本物理概念。在两个具有挑战性的物理推理任务上的实验表明，仅通过几个示例训练的 NFF 实现了对看不见场景的强大概括。这种基于物理的表示能够进行高效的前向后向规划，并通过交互式细化实现快速适应。我们的工作表明，将受物理启发的表示纳入学习系统可以帮助弥合人工智能和人类物理推理能力之间的差距。
+摘要：慢性肝病在全球范围内代表著重大的健康挑戰，而準確的預後評估對於個人化治療計畫至關重要。最近的證據表明，整合多模態資料（例如電腦斷層影像、放射特徵和臨床資訊）可以提供更全面的預後資訊。然而，模態具有內在異質性，而納入額外的模態可能會加劇異質化資料融合的挑戰。此外，現有的多模態融合方法通常難以適應更豐富的醫療模態，這使得難以捕捉模態間的關係。為了克服這些限制，我們提出了三模態交互慢性肝臟網路 (TMI-CLNet)。具體來說，我們開發了一個模態內聚合模組和一個三模態交叉注意力融合模組，它們分別旨在消除模態內冗餘和提取跨模態資訊。此外，我們設計了一個三模態特徵融合損失函數，以對齊跨模態的特徵表示。在肝臟預後資料集上的廣泛實驗表明，我們的做法顯著優於現有的最先進單模態模型和其他多模態技術。我們的程式碼可以在 https://github.com/Mysterwll/liver.git 上取得。
 
-##### **Tuning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning**
-2502.08972v1 by Hyundong Cho, Karishma Sharma, Nicolaas Jedema, Leonardo F. R. Ribeiro, Alessandro Moschitti, Ravi Krishnan, Jonathan May
+##### **Safety at Scale: A Comprehensive Survey of Large Model Safety**
+2502.05206v2 by Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
 
-Language models are aligned to the collective voice of many, resulting in
-generic outputs that do not align with specific users' styles. In this work, we
-present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method
-that personalizes language models for text generation tasks with fewer than 10
-examples per user. TICL iteratively expands an in-context learning prompt via a
-trial-error-explain process, adding model-generated negative samples and
-explanations that provide fine-grained guidance towards a specific user's
-style. TICL achieves favorable win rates on pairwise comparisons with
-LLM-as-a-judge up to 91.5% against the previous state-of-the-art and
-outperforms competitive tuning-free baselines for personalized alignment tasks
-of writing emails, essays and news articles. Both lexical and qualitative
-analyses show that the negative samples and explanations enable language models
-to learn stylistic context more effectively and overcome the bias towards
-structural and formal phrases observed in their zero-shot outputs. By
-front-loading inference compute to create a user-specific in-context learning
-prompt that does not require extra generation steps at test time, TICL presents
-a novel yet simple approach for personalized alignment.
+The rapid advancement of large models, driven by their exceptional abilities
+in learning and generalization through large-scale pre-training, has reshaped
+the landscape of Artificial Intelligence (AI). These models are now
+foundational to a wide range of applications, including conversational AI,
+recommendation systems, autonomous driving, content generation, medical
+diagnostics, and scientific discovery. However, their widespread deployment
+also exposes them to significant safety risks, raising concerns about
+robustness, reliability, and ethical implications. This survey provides a
+systematic review of current safety research on large models, covering Vision
+Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language
+Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models
+(DMs), and large-model-based Agents. Our contributions are summarized as
+follows: (1) We present a comprehensive taxonomy of safety threats to these
+models, including adversarial attacks, data poisoning, backdoor attacks,
+jailbreak and prompt injection attacks, energy-latency attacks, data and model
+extraction attacks, and emerging agent-specific threats. (2) We review defense
+strategies proposed for each type of attacks if available and summarize the
+commonly used datasets and benchmarks for safety research. (3) Building on
+this, we identify and discuss the open challenges in large model safety,
+emphasizing the need for comprehensive safety evaluations, scalable and
+effective defense mechanisms, and sustainable data practices. More importantly,
+we highlight the necessity of collective efforts from the research community
+and international collaboration. Our work can serve as a useful reference for
+researchers and practitioners, fostering the ongoing development of
+comprehensive defense systems and platforms to safeguard AI models.
 
-摘要：語言模型與眾人的集體聲音保持一致，導致產出內容流於一般，無法與特定使用者的風格相符。在這項工作中，我們提出了試驗錯誤解釋情境內學習 (TICL)，一種免調校方法，能為文字生成任務個人化語言模型，每個使用者少於 10 個範例。TICL 透過試驗錯誤解釋程序反覆擴充情境內學習提示，加入模型產生的負面範例和說明，提供細緻的指導，引導至特定使用者的風格。TICL 在與 LLM 作為評審的成對比較中獲得了高勝率，高達 91.5%，優於先前的技術水準，並在個人化對齊任務中超越了競爭性的免調校基準，包括撰寫電子郵件、論文和新聞文章。詞彙和質性分析皆顯示，負面範例和說明讓語言模型能更有效地學習風格脈絡，並克服零次學習產出中觀察到的結構性和正式詞組偏誤。透過預先加載推論運算，建立使用者特定的情境內學習提示，無需在測試時額外產生步驟，TICL 呈現一種新穎卻簡潔的方法，用於個人化對齊。
+摘要：<paragraph>大型模型的快速進展，得益於它們在通過大規模預訓練進行學習和概括方面的卓越能力，已經重塑了人工智能 (AI) 的格局。這些模型現在是廣泛應用程式（包括對話式 AI、推薦系統、自動駕駛、內容生成、醫療診斷和科學發現）的基礎。然而，它們的廣泛部署也使它們面臨重大的安全風險，引發了對穩健性、可靠性和倫理影響的擔憂。本調查提供了對大型模型當前安全研究的系統性回顧，涵蓋視覺基礎模型 (VFM)、大型語言模型 (LLM)、視覺語言預訓練 (VLP) 模型、視覺語言模型 (VLM)、擴散模型 (DM) 和基於大型模型的代理。我們的貢獻總結如下：(1) 我們提出了一個針對這些模型的安全威脅的全面分類，包括對抗性攻擊、資料中毒、後門攻擊、越獄和提示注入攻擊、能量延遲攻擊、資料和模型提取攻擊以及新興的特定代理威脅。(2) 我們檢視了針對每種類型攻擊提出的防禦策略（如果有的話），並總結了安全研究中常用的資料集和基準。(3) 基於此，我們找出並討論了大型模型安全中的開放性挑戰，強調了對全面安全評估、可擴充且有效的防禦機制以及永續資料實務的需求。更重要的是，我們強調了研究社群和國際合作共同努力的必要性。我們的研究可作為研究人員和從業人員的有用參考，促進全面防禦系統和平台的持續發展，以保護 AI 模型。</paragraph>
 
-##### **RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage**
-2502.08966v1 by Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller
+##### **Enhanced Convolutional Neural Networks for Improved Image Classification**
+2502.00663v1 by Xiaoran Yang, Shuhan Yu, Wenxi Xu
 
-Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external
-tools for tasks beyond their standalone capabilities, such as searching
-websites, booking flights, or making financial transactions. However, these
-tools greatly increase the risks of prompt injection attacks, where malicious
-content hijacks the LM agent to leak confidential data or trigger harmful
-actions. Existing defenses (OpenAI GPTs) require user confirmation before every
-tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS),
-which automatically detects and executes tool calls that preserve integrity and
-confidentiality, requiring user confirmation only when these safeguards cannot
-be ensured. RTBAS adapts Information Flow Control to the unique challenges
-presented by TBAS. We present two novel dependency screeners, using
-LM-as-a-judge and attention-based saliency, to overcome these challenges.
-Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS
-prevents all targeted attacks with only a 2% loss of task utility when under
-attack, and further tests confirm its ability to obtain near-oracle performance
-on detecting both subtle and direct privacy leaks.
+Image classification is a fundamental task in computer vision with diverse
+applications, ranging from autonomous systems to medical imaging. The CIFAR-10
+dataset is a widely used benchmark to evaluate the performance of
+classification models on small-scale, multi-class datasets. Convolutional
+Neural Networks (CNNs) have demonstrated state-of-the-art results; however,
+they often suffer from overfitting and suboptimal feature representation when
+applied to challenging datasets like CIFAR-10. In this paper, we propose an
+enhanced CNN architecture that integrates deeper convolutional blocks, batch
+normalization, and dropout regularization to achieve superior performance. The
+proposed model achieves a test accuracy of 84.95%, outperforming baseline CNN
+architectures. Through detailed ablation studies, we demonstrate the
+effectiveness of the enhancements and analyze the hierarchical feature
+representations. This work highlights the potential of refined CNN
+architectures for tackling small-scale image classification problems
+effectively.
 
-摘要：基於工具的代理系統 (TBAS) 允許語言模型 (LM) 使用外部工具來執行超出其獨立功能的任務，例如搜尋網站、預訂航班或進行金融交易。然而，這些工具大幅增加了提示注入攻擊的風險，其中惡意內容劫持 LM 代理程式以洩露機密資料或觸發有害動作。現有的防禦措施 (OpenAI GPT) 在每次呼叫工具之前都需要使用者確認，這會對使用者造成沉重的負擔。我們引入了穩健的 TBAS (RTBAS)，它會自動偵測並執行保留完整性與機密性的工具呼叫，僅在無法確保這些防護措施時才需要使用者確認。RTBAS 將資訊流控制調整為 TBAS 呈現的獨特挑戰。我們提出兩種新穎的相依性篩選器，使用 LM 作為判斷者和基於注意力的顯著性，以克服這些挑戰。AgentDojo 提示注入基準上的實驗結果顯示，RTBAS 在受到攻擊時僅損失 2% 的任務效用，即可防止所有目標攻擊，進一步的測試證實了其在偵測細微和直接的隱私洩漏方面獲得接近神諭效能的能力。
+摘要：影像分類是電腦視覺中的一項基本任務，應用範圍廣泛，從自動系統到醫學影像皆有。CIFAR-10 資料集是一個廣泛使用的基準，用於評估分類模型在小規模、多類別資料集上的效能。卷積神經網路 (CNN) 已展現出最先進的成果；然而，當應用於 CIFAR-10 等具挑戰性的資料集時，它們常常會發生過度擬合和次佳特徵表示的問題。在本文中，我們提出一個增強的 CNN 架構，它整合了更深的卷積區塊、批次正規化和中斷正規化，以達成卓越的效能。所提出的模型達到了 84.95% 的測試準確度，優於基準 CNN 架構。透過詳細的消融研究，我們證明了這些增強功能的有效性，並分析了階層式特徵表示。這項工作突顯了精進的 CNN 架構在有效解決小規模影像分類問題上的潛力。
 
-##### **Biologically Plausible Brain Graph Transformer**
-2502.08958v1 by Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
+##### **Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective**
+2502.00619v1 by Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
 
-State-of-the-art brain graph analysis methods fail to fully encode the
-small-world architecture of brain graphs (accompanied by the presence of hubs
-and functional modules), and therefore lack biological plausibility to some
-extent. This limitation hinders their ability to accurately represent the
-brain's structural and functional properties, thereby restricting the
-effectiveness of machine learning models in tasks such as brain disorder
-detection. In this work, we propose a novel Biologically Plausible Brain Graph
-Transformer (BioBGT) that encodes the small-world architecture inherent in
-brain graphs. Specifically, we present a network entanglement-based node
-importance encoding technique that captures the structural importance of nodes
-in global information propagation during brain graph communication,
-highlighting the biological properties of the brain structure. Furthermore, we
-introduce a functional module-aware self-attention to preserve the functional
-segregation and integration characteristics of brain graphs in the learned
-representations. Experimental results on three benchmark datasets demonstrate
-that BioBGT outperforms state-of-the-art models, enhancing biologically
-plausible brain graph representations for various brain graph analytical tasks
+Ensuring fairness in medical image segmentation is critical due to biases in
+imbalanced clinical data acquisition caused by demographic attributes (e.g.,
+age, sex, race) and clinical factors (e.g., disease severity). To address these
+challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired
+by optimal control theory. We provide a comprehensive analysis of its
+underlying mechanisms and clarify dMoE's role in adapting to heterogeneous
+distributions in medical image segmentation. Furthermore, we integrate dMoE
+into multiple network architectures, demonstrating its broad applicability
+across diverse medical image analysis tasks. By incorporating demographic and
+clinical factors, dMoE achieves state-of-the-art performance on two 2D
+benchmark datasets and a 3D in-house dataset. Our results highlight the
+effectiveness of dMoE in mitigating biases from imbalanced distributions,
+offering a promising approach to bridging control theory and medical image
+segmentation within fairness learning paradigms. The source code will be made
+available.
 
-摘要：目前最先进的大腦圖形分析方法無法完全編碼大腦圖形的小世界架構（伴隨著樞紐和功能模組的存在），因此在某種程度上缺乏生物學上的可信度。這種限制阻礙了它們準確表示大腦結構和功能特性的能力，從而限制了機器學習模型在腦部疾病檢測等任務中的有效性。在這項工作中，我們提出了一個新的生物學上可信的大腦圖形轉換器 (BioBGT)，它編碼了大腦圖形中固有的、小世界的架構。具體來說，我們提出了一種基於網路糾纏的節點重要性編碼技術，它捕捉了大腦圖形通信過程中節點在全球資訊傳播中的結構重要性，突出了大腦結構的生物學特性。此外，我們引入了一個功能模組感知自注意力，以保留學習表徵中大腦圖形的功能分離和整合特性。在三個基準資料集上的實驗結果表明，BioBGT 優於最先進的模型，增強了各種大腦圖形分析任務的生物學上可信的大腦圖形表徵
+摘要：在医学影像分割中，由於人口屬性（例如年齡、性別、種族）和臨床因素（例如疾病嚴重程度）導致不平衡的臨床數據採集中存在偏差，因此確保公平性至關重要。為了應對這些挑戰，我們引入了受最優控制理論啟發的感知混合專家 (dMoE)。我們對其底層機制進行了全面分析，並釐清了 dMoE 在適應醫學影像分割中的異質分佈中的作用。此外，我們將 dMoE 整合到多個網路架構中，展示了其在各種醫學影像分析任務中的廣泛適用性。通過納入人口統計和臨床因素，dMoE 在兩個 2D 基準數據集和一個 3D 內部數據集上實現了最先進的性能。我們的結果突出了 dMoE 在減輕不平衡分佈的偏差方面的有效性，為在公平性學習範例中橋接控制理論和醫學影像分割提供了一個有前景的方法。原始碼將會公開。
 
-##### **Medicine on the Edge: Comparative Performance Analysis of On-Device LLMs for Clinical Reasoning**
-2502.08954v1 by Leon Nissen, Philipp Zagar, Vishnu Ravi, Aydin Zahedivash, Lara Marie Reimer, Stephan Jonas, Oliver Aalami, Paul Schmiedmayer
+##### **Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions**
+2502.00568v3 by Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
 
-The deployment of Large Language Models (LLM) on mobile devices offers
-significant potential for medical applications, enhancing privacy, security,
-and cost-efficiency by eliminating reliance on cloud-based services and keeping
-sensitive health data local. However, the performance and accuracy of on-device
-LLMs in real-world medical contexts remain underexplored. In this study, we
-benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating
-accuracy, computational efficiency, and thermal limitation across various
-mobile devices. Our results indicate that compact general-purpose models like
-Phi-3 Mini achieve a strong balance between speed and accuracy, while medically
-fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably,
-deploying LLMs on older devices remains feasible, with memory constraints
-posing a greater challenge than raw processing power. Our study underscores the
-potential of on-device LLMs for healthcare while emphasizing the need for more
-efficient inference and models tailored to real-world clinical reasoning.
+Emerging research has highlighted that artificial intelligence based
+multimodal fusion of digital pathology and transcriptomic features can improve
+cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction.
+However, such direct fusion for joint decision is impractical in real clinical
+settings, where histopathology is still the gold standard for diagnosis and
+transcriptomic tests are rarely requested, at least in the public healthcare
+system. With our novel diffusion based crossmodal generative AI model PathGen,
+we show that genomic expressions synthesized from digital histopathology
+jointly predicts cancer grading and patient survival risk with high accuracy
+(state-of-the-art performance), certainty (through conformal coverage
+guarantee) and interpretability (through distributed attention maps). PathGen
+code is available for open use by the research community through GitHub at
+https://github.com/Samiran-Dey/PathGen.
 
-摘要：大型語言模型 (LLM) 在行動裝置上的部署為醫療應用程式提供了巨大的潛力，透過消除對雲端服務的依賴並將敏感的健康資料儲存在本地，進而提升隱私、安全性，並提高成本效益。然而，在實際的醫療環境中，裝置上 LLM 的效能和準確度仍未受到充分的探討。在此研究中，我們使用 AMEGA 資料集來評量公開可用的裝置上 LLM，並評估其在各種行動裝置上的準確度、運算效率和熱限制。我們的結果顯示，像 Phi-3 Mini 等精簡的一般用途模型在速度和準確度之間取得了良好的平衡，而經過醫學微調的模型，例如 Med42 和 Aloe，則達到了最高的準確度。值得注意的是，在較舊的裝置上部署 LLM 仍然可行，記憶體限制比原始處理能力構成更大的挑戰。我們的研究強調了裝置上 LLM 在醫療保健方面的潛力，同時強調了對更有效率的推理和針對實際臨床推理量身打造的模型的需求。
+摘要：新興研究強調，基於人工智慧的多模態融合數位病理學和轉錄組特徵，可以改善癌症診斷（分級/分型）和預後（存活風險）預測。
+然而，這種直接融合對於聯合決策在實際臨床環境中並不切實際，在實際臨床環境中，組織病理學仍然是診斷的黃金標準，而轉錄組檢測很少被要求，至少在公共醫療保健系統中是如此。透過我們新穎的基於擴散的跨模態生成式 AI 模型 PathGen，我們展示了從數位組織病理學合成的基因體表達共同預測癌症分級和患者存活風險，具有很高的準確度（最先進的效能）、確定性（透過共形覆蓋保證）和可解釋性（透過分佈式注意力圖）。PathGen 程式碼可透過 GitHub 上的 https://github.com/Samiran-Dey/PathGen 供研究社群公開使用。