Skip to content

Autonomous Agents (LLMs) research papers. Updated Daily.

License

Notifications You must be signed in to change notification settings

tmgthb/Autonomous-Agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Hits X GitHub Repo stars

Autonomous Agents

Autonomous Agents

Autonomous Agents-research papers. Updated daily. See as well the Resources-section.


Research papers

Chronological order.

11th April 2025

DocAgent: A Multi-Agent System for Automated Code Documentation Generation

  • DocAgent: introduces a multi-agent system for automated code documentation generation, which includes Navigator Module, Repository AST Parsing, Dependency DAG, Topological Traversal, Topological Sorting, Dependency-Aware Processing Order, Multi-Agent Documentation Generation, Reader, Searcher, Writer, Verifier, and Orchestrator.
  • DocAgent uses a Navigator Module to establish dependency-aware processing order and a Multi-Agent Documentation Generation module with specialized agents to collaboratively generate documentation.
  • The system aims to address challenges in automated code documentation by ensuring completeness, helpfulness, and truthfulness through topological processing and multi-agent collaboration.

SEAVIEW: Software Engineering Agent Visual Interface for Enhanced Workflow

  • SEAVIEW: introduces a visualization framework for software engineering agent experiments, comprising a web frontend for user interaction, a backend for data processing, PostgreSQL for structured data storage, object storage for large files, and external environment for running experiments.
  • SEAVIEW framework aims to assist researchers in debugging and improving software engineering agents by providing experiment health, comparison, summarization, and reporting capabilities.
  • The tool is designed to analyze agent trajectories and experiment results, offering insights into agent behavior and performance across different experimental setups and parameters.

TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning

  • TP-RAG (Travel Planning - Retrieval-Augmented Generation): introduces benchmark for retrieval-augmented spatiotemporal-aware travel planning with Inputs, Agent, Plan, and Evaluate components.
  • TP-RAG benchmark dataset includes real-world travel queries, fine-grain annotated Points of Interest, and high-quality travel trajectory references for context-aware planning.
  • TP-RAG benchmark facilitates evaluation of LLM agents in generating spatiotemporally coherent travel plans utilizing trajectory-level knowledge for improved travel practicality.

Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing

  • LLM-powered Conversational Agent for Writing Reflection: introduces a system designed with LLM-powered Conversational Agent, Voice Input, Written Output, Feedback, Questions, Advice, and UI Affordances to investigate voice interaction for writing reflection.
  • This system emphasizes Contextualization and Control to improve user experience and maintain writer's ownership during revision process.
  • The research aims to evaluate how voice input modality affects reflection depth and revision quality compared to text input when using conversational agents.

Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents

  • FAIRGAME (Framework for AI Agents Bias Recognition using Game Theory): introduces user, developer, and regulator components to model regulatory ecosystem.
  • Framework uses evolutionary game theory and LLMs to investigate strategic choices under different regulatory scenarios.
  • FAIRGAME aims to identify emerging behaviors of strategic AI agents in game-theoretic settings and compare them with game-theoretic predictions.

MOOSEAGENT: A LLM BASED MULTI-AGENT FRAMEWORK FOR AUTOMATING MOOSE SIMULATION

  • MooseAgent: introduces an automated framework for MOOSE simulation, integrating Requirement, Alignment, Architect, Vector knowledge base, Error Correction, and Runner components.
  • MooseAgent framework uses LLMs to understand user needs, generate MOOSE input files, and iteratively refine them using a vector database and error correction.
  • This multi-agent system aims to simplify finite element simulation by automating pre-processing, solver configuration, and post-processing stages in MOOSE.

Task Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks

  • Task Memory Engine (TME): introduces a memory framework for LLM agents, with Task Memory Tree (hierarchical task state representation), Task Relationship Inference Module (reasons about task relationships), and Prompt Synthesizer (generates context-aware prompts).
  • TME enhances state awareness by tracking task execution using Task Memory Tree, inferring task relationships with Task Relationship Inference Module, and generating adaptive prompts with Prompt Synthesizer.
  • This framework enables robust, interpretable, and token-efficient execution of complex multi-step tasks by providing structured memory and intelligent prompt construction.

Adopting Large Language Models to Automated System Integration

  • Compositio Prompto (Compositio Prompto): introduces an architecture employing Large Language Models for automated service composition, utilizing task specifications, service documentation, input/output schemas to create a prompt for the LLM, which then generates executable service compositions.
  • The architecture aims to mitigate complex formal modeling in service composition by using natural language input and OpenAPI specifications, focusing on generating reusable service compositions as program code.
  • Compositio Prompto architecture is evaluated for service composition and discovery using Retrieval Augmented Generation (RAG) and benchmarks like RestBench and SOCBench-D to address limitations of input token length and improve service discovery in automated system integration.

Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

  • Multi-Observer LLM Personality Assessment Framework: introduces a novel method for evaluating LLM personality by utilizing multiple observer agents, simulating interactive scenarios, and aggregating observer reports for robust assessment.
  • This framework incorporates agent configuration to define agent profiles and relationships, interactive scenario simulation to generate dialogues, and personality reports to collect self- and observer- assessments.
  • By aggregating multiple observer reports, the framework aims to reduce individual biases and achieve a more context-sensitive and reliable personality evaluation of LLMs compared to self-report methods.

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

  • LLM-based Survey Simulation Framework: introduces a framework for evaluating LLMs in healthcare decision-making, with Survey Dataset, Demographics features, Prompt Construction Module, General prompt, Prompt with context, Prompts, LLM models, and Generated Vaccination decision.
  • This framework compares LLM-generated vaccination decisions with real-world survey data to assess alignment and biases across demographic groups.
  • The framework helps understand LLMs' capabilities and limitations in simulating healthcare behaviors and decision-making under different pandemic contexts.

10th April 2025

Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI

  • Blueprint Architecture: introduces a blueprint for compound AI systems, with Agent (maps models and APIs), Agent Registry (metadata store for agents), Task Planner (creates agentic workflows), Task Coordinator (coordinates workflow execution), Budget (records QoS stats), Data Registry (metadata store for data), Data Planner (generates query plans), Optimizer (performs multi-objective optimization), Streams (facilitate data and control flow), and Session (provides context for agents).
  • Blueprint Architecture focuses on orchestrating agents and data using streams to manage data and instructions flow, aiming for seamless integration and optimized workflows in enterprise AI applications.
  • The architecture emphasizes key components like registries for agents and data, planners for tasks and data queries, and coordinators for execution, all designed to enhance observability, controllability, and optimization in compound AI systems.

Test Amplification for REST APIs via Single and Multi-Agent LLM Systems

  • Agentic LLM systems: introduces single-agent approach with OpenAPI Retriever and Local Executor components for REST API test amplification.
  • Agentic LLM systems: also introduces multi-agent approach with specialized agents like Header-, Parameter-, Value-, Planner-, Writer-, Executor- and Repair-agents to improve test generation.
  • Agentic LLM systems: demonstrates that multi-agent system achieves higher API coverage and bug detection compared to single-agent system, but with increased computational cost.

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge](http://arxiv.org/abs/2504.07887v1)

  • CLEAR-Bias (Corpus for Linguistic Evaluation of Adversarial Robustness against Bias): introduces a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation, with CLEAR-Bias-dataset, jailbreak prompts, base prompts, control set, judge selection, candidate LLMs, collect judgments, evaluate agreement, selected judge, two-step safety evaluation, initial assessment with base prompts, compute bias-specific safety score, adversarial analysis with jailbreak prompts, overall LLM safety score, and LLM vulnerability analysis.
  • The framework employs LLM-as-a-Judge paradigm for automated assessment, utilizing a two-step safety evaluation process involving initial assessment with base prompts and subsequent adversarial analysis with jailbreak techniques.
  • The methodology aims to systematically probe models across sociocultural dimensions, quantify robustness through safety scores, and investigate vulnerabilities in safety mechanisms, ultimately revealing critical trade-offs between model size and safety.

An LLM-Driven Multi-Agent Debate System for Mendelian Diseases

  • MD2GPS (Medical Doctor 2 GPS): introduces LLM-driven multi-agent debate system, with Data Agent, Knowledge Agent, and Debate Agent, for Mendelian diseases diagnosis.
  • MD2GPS system utilizes Data Agent to process genetic variants and phenotypes, Knowledge Agent with GPT-4 for gene analysis, and Debate Agent to integrate and refine diagnostic outcomes.
  • The multi-agent debate framework of MD2GPS enhances diagnostic accuracy and interpretability by leveraging diverse perspectives and evidence consistency evaluation.

Deceptive Automated Interpretability: Language Models Co-ordinating to Fool Oversight Systems

  • SAEs (Sparse Autoencoders): introduces framework with Labeling Agent, Simulating Agent, Overseer, Monitoring, Visible Communication, and Hidden Communication to investigate deceptive interpretability in language models.
  • The framework uses Labeling Agent to create feature labels, Simulating Agent to predict activations, and Overseer to detect deceptive labels, with agents communicating visibly and hiddenly.
  • This setup explores how language models can coordinate to deceive oversight systems by employing steganography for hidden communication and generating deceptive explanations.

MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations

  • MOSAIC (Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations): introduces a multi-agent social simulation framework, with Human Persona Survey, Persona Generation, Agent Network, Agent Memory, Reflection, Interaction, BEFORE Action, Comment AFTER Action, Agent Daily News, Fact-Checking Types, Community Notes, Third Party Fact-Checking, and Hybrid Fact-Checking, for modeling content diffusion, user engagement, and misinformation propagation in social networks.
  • MOSAIC framework utilizes LLM-powered agents with memory and reflection capabilities to simulate realistic social behaviors and evaluate content moderation strategies like community-based, third-party, and hybrid fact-checking.
  • The framework allows for analyzing the effectiveness of different fact-checking mechanisms in mitigating misinformation spread while preserving user engagement in simulated social media environments.

Synthesizing High-Quality Programming Tasks with LLM-based Expert and Student Agents

  • PYTASKSYN introduces a novel synthesis technique for generating programming tasks, which includes Generation (task creation stage) and Validation (task quality check stage) stages, performed by SIMEXPERT (expert agent for task generation), SIMTUTOR (tutor agent for test suite and context validation), and SIMSTUDENT (student agent for comprehensibility validation).
  • PYTASKSYN employs a multi-agent approach with specialized roles, where SIMEXPERT generates Task Description (task explanation) and Test suite (code verification tests), while SIMTUTOR and SIMSTUDENT assess Context relevance (theme and concepts alignment) and Comprehensibility (task clarity).
  • PYTASKSYN aims to improve the quality of AI-generated programming tasks by automating validation through simulated agents, ensuring tasks are relevant, correct, and comprehensible for students, thus reducing the need for human intervention.

Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution

  • ROS Evolution Framework: introduces heuristic reward observation space evolution for LLM-driven reward design, incorporating user description structuring, LLM for reconciliation, LLM for reward design, state history memory, performance summarization, reward space mapping, simulation environment, state usage tracker, relevant state space, state selection, and internal operation.
  • ROS Evolution Framework utilizes State Execution Table to track historical state usage and success contributions, overcoming Markovian constraint in LLM dialogues for effective exploration.
  • ROS Evolution Framework reconciles user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives and improving reward generation stability.

A taxonomy of epistemic injustice in the context of AI and the case for generative hermeneutical erasure

  • Taxonomy of epistemic injustice in AI: introduces a novel taxonomy of epistemic injustice in the context of Artificial Intelligence, focusing on generative AI and detailing Generative Hermeneutical Ignorance, Generative Hermeneutical Access, Generative Manipulative Testimonial Injustice, Generative Amplified Testimonial Injustice, and Generative Conceptual Erasure.
  • This taxonomy explores how AI systems can perpetuate and amplify epistemic injustices, particularly through generative models, by misrepresenting marginalized experiences, obstructing information access, spreading disinformation, amplifying existing biases, and ultimately eroding diverse epistemological frameworks.
  • The paper highlights the concept of Generative Hermeneutical Erasure as a novel form of epistemic injustice, emphasizing the risk of AI-driven erosion of non-Western epistemologies and the importance of decolonial AI approaches to mitigate these harms.

Kimi-VL Technical Report

  • Kimi-VL (Vision Language Model): introduces MoonViT, MLP Projector, and MoE Language Decoder for efficient multimodal reasoning and long-context understanding.
  • Kimi-VL utilizes MoonViT for native-resolution image processing, MLP Projector to align visual features, and MoE Language Decoder for parameter-efficient language generation.
  • Kimi-VL-Thinking, an advanced variant, enhances long-horizon reasoning through long chain-of-thought and reinforcement learning, building upon Kimi-VL's architecture.

Enhanced Question-Answering for Skill-based learning using Knowledge-based AI and Generative AI

  • Ivy (intelligent agent): introduces an architecture for skill-based learning question answering, with Classify Answerability, Knowledge Retrieval Module, TMK Knowledge Base, Response Generation Module, and Response Optimizer Module components.
  • Ivy leverages TMK (Task-Method-Knowledge) models to represent skills and Generative AI to enhance explanations for learners' questions in online AI courses.
  • The framework aims to provide deeper, more relevant feedback compared to agents relying on unstructured text, improving learners' understanding of procedural knowledge and reasoning in skill-based learning.

Achilles Heel of Distributed Multi-Agent Systems

  • DMAS (Distributed Multi-Agent System): introduces distributed architecture with Control System managing third-party Agents through API Interfaces and receiving Responses.
  • DMAS framework addresses challenges of heterogeneity, scalability, and computational constraints in multi-agent systems by utilizing remotely hosted agents.
  • The distributed nature of DMAS raises trustworthiness concerns, including free riding, malicious attacks, communication delays and unstable connections, which are systematically analyzed in the paper.

Beyond LLMs: A Linguistic Approach to Causal Graph Generation from Narrative Texts

  • DA framework (Causal Graph Generation Framework): introduces a novel method for generating causal graphs from narrative texts, incorporating Vertices Extraction, Expert Index Extraction, STAC Categorization, and Diagram Formulation components.
  • This framework leverages linguistic feature extraction and a quaternary classification system (STAC) to enhance the precision and interpretability of causal link identification compared to LLM-only approaches.
  • The system employs a hybrid model combining ROBERTa embeddings with an Expert Index of linguistic features, followed by a structured prompting process for refining and constructing the final causal graph.

Enhancing Player Enjoyment with a Two-Tier DRL and LLM-Based Agent System for Fighting Games

  • TTA (Two-Tier Agent): introduces a two-tier system with DRL game-playing agents tier, utilizing a network architecture with CNN and RNN feature extractors and actor-critic networks, and Hyper-agent tier, employing a LLM Hyper Agent for dynamic opponent selection based on player data and feedback.
  • The DRL game-playing agents tier consists of Input (game pixels, scalar info, action sequence)/Features Extractor (CNN, LSTM)/Agent's Network (Actor Net, Critic Net)/Value (value function)/Output (action distribution), while the Hyper-agent tier includes Agent Archive (DRL agent storage)/LLM Hyper Agent (opponent selector)/Game Manager (data and game management)/Player's Feedback (human input)/Playing Data (game history).
  • TTA aims to enhance player enjoyment in fighting games by providing diverse and adaptive AI opponents, leveraging DRL for agent skill and LLMs for personalized opponent selection, demonstrating improvements in advanced skill execution and player satisfaction.

AGENTADA: Skill-Adaptive Data Analytics for Tailored Insight Discovery

  • AGENTADA (skill-informed data analytics agent): introduces dataset-to-insight extraction strategy with Question Generation, RAG-Based Skill Matcher, Code Generation, Answer Generation, and Insight Generation components.
  • AGENTADA leverages hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose best data analytics skill from skill library.
  • AGENTADA is evaluated using KAGGLEBENCH benchmark and SCORER evaluation framework, demonstrating improved performance over existing tools.

Automating quantum feature map design via large language models

  • Agentic System: introduces an autonomous system for quantum feature map design, incorporating Human input, LLM, Generation, Storage, Validation, Evaluation, and Review components.
  • Agentic System iteratively refines quantum feature maps through Feedback from The Experimental Result, utilizing LLM for idea Generation and external knowledge in Storage for Validation and Review.
  • The framework leverages components like Storage for academic papers and PennyLane library documentation, and Evaluation for performance assessment, to automate quantum feature map research workflow.

TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models

  • TALE (Tool-Augmented LLM Evaluation): introduces a reference-free framework for evaluating LLM responses, with Query Generation, Web Search, Evidence Summarizer, Reflector, Query Refiner, Judge, and Short-Term Memory components.
  • TALE iteratively refines web queries, collects and summarizes external information, and reflects on findings to evaluate LLM outputs without relying on pre-annotated references.
  • The framework enhances the reliability of LLM evaluations in dynamic real-world scenarios by grounding judgments in external, verifiable evidence through tool-augmented approach.

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

  • This paper introduces a queuing-theoretic framework for LLM inference scheduling, encompassing Batch Processing, Prefill, Decode, Processed, Processing stages, and extends to AI agent workloads with Orchestrator, Agents, Tools, Global History, LLM Serving, Load Balancer, LLM Engine, Scheduler, KV Cache, and LLM components.
  • The framework analyzes throughput optimality of work-conserving scheduling algorithms for both individual LLM requests and complex AI-agent systems, highlighting the importance of token budget and batching strategies for efficient LLM inference.
  • Evaluations using real-world systems like Orca and Sarathi-serve demonstrate throughput optimality, while FasterTransformer and vanilla vLLM are shown to be potentially suboptimal under certain workloads, emphasizing the practical implications of queuing theory in LLM system design.

Modeling Response Consistency in Multi-Agent LLM Systems: A Comparative Analysis of Shared and Separate Context Approaches

  • RCI (Response Consistency Index): introduces a probabilistic framework for analyzing shared and separate context configurations in multi-agent LLM systems, focusing on centralized memory, distributed memory, context retention duration, incorrect statements, accurate statements, consistency evaluation, and latency measurement.
  • The framework evaluates the impact of memory limitations and noise on response consistency and response time in LLM-based MAS.
  • RCI metric quantifies the trade-offs between scalability, response consistency, and performance in different context configurations.

9th April 2025

REVIEW OF CASE-BASED REASONING FOR LLM AGENTS: THEORETICAL FOUNDATIONS, ARCHITECTURAL COMPONENTS, AND COGNITIVE INTEGRATION

  • CBR-GDA (Case-Based Reasoning - Goal-Driven Autonomy): introduces a framework integrating Case-Based Reasoning with Goal-Driven Autonomy to enhance LLM agents by incorporating Case Representation and Indexing Strategies, Hybrid Retrieval Mechanisms, Adaptation Mechanisms, LLM Reasoning Processes Integration, Cognitive Dimensions Integration, Planning Case Base and Mismatch-Goal Case Base.
  • This framework leverages CBR for persistent memory and structured reasoning, while utilizing LLMs for language understanding, aiming to improve reasoning transparency, domain adaptation, and solution quality in complex problem-solving scenarios.
  • The CBR-GDA framework facilitates continuous learning and adaptation through case acquisition and refinement, enabling agents to dynamically adjust objectives and improve goal reasoning capabilities in dynamic environments.

REVIEW OF CASE-BASED REASONING FOR LLM AGENTS: THEORETICAL FOUNDATIONS, ARCHITECTURAL COMPONENTS, AND COGNITIVE INTEGRATION

  • CBR-GDA Framework: introduces architecture for CBR-enhanced LLM agents, with Case Representation and Indexing, Hybrid Retrieval Mechanisms, Adaptation Mechanisms, LLM Reasoning Processes Integration, Cognitive Dimensions, Goal-Driven Autonomy, Planning Case Base, and Mismatch-Goal Case Base.
  • This framework integrates Case-Based Reasoning and Goal-Driven Autonomy to enhance LLM agents' reasoning, adaptability, and transparency by leveraging past experiences and dynamic goal adjustment.
  • The architecture utilizes two case bases, Planning Case Base and Mismatch-Goal Case Base, to manage planning and goal reformulation based on discrepancies between expected and actual outcomes.

FamilyTool: A Multi-hop Personalized Tool Use Benchmark

  • KGETool: introduces KG-augmented LLM tool use pipeline, with Query, Full KG, Tools, LLM for KG, KG Extraction, Relation Path, Path Extraction, Sub KG, LLM for Tool Use and Tool Call, to evaluate LLMs in personalized multi-hop tool use scenarios.
  • KGETool framework extracts sub-KG from Full KG using KG Extraction module composed of Relation Path and Path Extraction, then utilizes Sub KG and Tools with LLM for Tool Use to generate Tool Call based on user Query.
  • The pipeline emphasizes generalization in inductive KG settings, where KGETool leverages LLMs' ability to handle evolving knowledge graphs without retraining by dynamically adapting to unseen user preferences and relationships.

AgentFM: Role-Aware Failure Management for Distributed Databases with LLM-Driven Multi-Agents

  • AgentFM (Role-Aware Failure Management Framework): introduces a role-aware failure management framework for distributed databases, with Meta-Agent (Orchestrates agents), Task Agents (Manage failure tasks), Data Agents (Handle data sources), System Agents (Represent node roles), and Standalone Agents (Agents on each node) components.
  • AgentFM leverages LLM-driven multi-agents to address failure management by considering system roles, data roles, and task roles, using a Meta-Agent (Orchestrates agents) for orchestration and specialized Task Agents (Manage failure tasks) like Detection Agent (Identifies anomalies), Diagnosis Agent (Classifies issues), and Mitigation Agent (Proposes solutions).
  • AgentFM integrates multimodal data sources through Data Agents (Handle data sources) such as Metric Agent (Metrics data extraction) and Log Agent (Logs data extraction), employing specialized System Agents (Represent node roles) like Config Agent (Configuration management), Coordinator Agent (Coordination management), and Storage Agent (Storage management) to enhance failure management in distributed databases.

Right Prediction, Wrong Reasoning: Uncovering LLM Misalignment in RA Disease Diagnosis

  • Framework for RA patients diagnosis: introduces a system employing PreRAID dataset, Texts, Embeddings, Vector DB, Knowledge Base, Medical Expert Guided Prompt, LLM, RAG, Prompt, Output, Prediction, and Reasoning to investigate LLM's diagnostic capabilities and reasoning for Rheumatoid Arthritis.
  • This framework utilizes patient symptom Texts converted to Embeddings and stored in Vector DB, leveraging Knowledge Base and Medical Expert Guided Prompt for LLM with RAG to generate Output, Prediction of RA, and Reasoning.
  • The framework explores different architectures with varying numbers of LLM agents and knowledge base integration to assess diagnostic accuracy and reasoning quality in RA disease prediction.

NEEDLEINATABLE: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

  • NEEDLEINATABLE (NIAT): introduces NIAT benchmark and data synthesis method to evaluate and improve large language models on long-structured tables.
  • NIAT benchmark assesses large language models' ability to extract specific cells from long tables using location-based and question-based queries.
  • Data synthesis method uses chain-of-thought reasoning to generate training data for enhancing large language models' long-table comprehension.

8th April 2025

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

  • FEABench: introduces benchmark for evaluating LLMs and LLM agents in multiphysics reasoning, using ControllerAgent, Evaluator, CorrectorSubAgent, and ToolLookupAgent components to solve engineering problems with FEA software.
  • FEABench framework employs multi-agent system with specialized tools and feedback mechanisms to enhance LLMs' ability to generate executable code for COMSOL Multiphysics API.
  • FEABench benchmark and agentic framework aim to advance automation in engineering by augmenting LLMs with numerical solvers and physics reasoning capabilities.

CAI: An Open, Bug Bounty-Ready Cybersecurity AI

  • CAI (Cybersecurity AI): introduces an open-source framework for democratizing security testing, with HITL, Turns, Patterns, Handoffs, Agents, Tools, Extensions, and Tracing components.
  • CAI framework combines modular agent design, seamless tool integration, and human oversight for AI-powered bug bounty testing.
  • CAI aims to dismantle the lock-in of dominant platforms, offering a democratized alternative for vulnerability discovery.

AGENT GUIDE: A SIMPLE AGENT BEHAVIORAL WATERMARKING FRAMEWORK

  • Agent Guide: introduces a behavioral watermarking framework for intelligent agents, with Memory Module, Event Generation Module, Behavior Probability Generation Module, Agent Guide Module, and Action Execution Module.
  • Agent Guide embeds watermarks by biasing agent's high-level behavior decisions while preserving the naturalness of specific action executions.
  • The framework operates in rounds, simulating agent interactions and uses statistical analysis for watermark extraction, ensuring reliable detection.

Are Generative AI Agents Effective Personalized Financial Advisors?

  • LLM-advisor: introduces User, Advisor, Preference Elicitation Stage, and Advisory Discussion Stage to provide personalized financial advice.
  • The framework uses Preference Elicitation Stage to understand user needs before offering asset guidance in Advisory Discussion Stage.
  • This approach aims to evaluate the effectiveness of LLM-based agents in complex financial advisory tasks.

Single-Agent vs. Multi-Agent LLM Strategies for Automated Student Reflection Assessment

  • Single-Agent Assessment: introduces single LLM evaluator, Scoring Criteria, and LLM, where single LLM evaluates student reflection using score level descriptions.
  • Single-Agent Assessment employs zero-shot and few-shot prompting to guide LLM's evaluation process based on scoring criteria for reflection assessment.
  • This approach automates student reflection assessment by transforming qualitative responses into quantitative scores using a single LLM evaluator.

Automated Archival Descriptions with Federated Intelligence of LLMs

  • Agentic AI-driven system: introduces an agentic AI-based metadata generation system, with User Input and Document (Provides archival material), Context Agent (Retrieves context information), LLM Instructor (Constructs instructions for LLMs), LLM Ensemble (Generates metadata descriptions), Validator Agent (Checks metadata descriptions), and LLM Federator (Synthesizes optimal metadata) to produce Metadata (Final metadata output) for archival descriptions.
  • The system employs federated intelligence of multiple LLMs to automatically create complete and precise metadata descriptions, leveraging context and validation agents for consistency and quality.
  • The federated optimization approach synthesizes metadata from an ensemble of LLMs, demonstrating superior performance compared to single-model solutions in metadata quality and reliability for archival materials.

FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction

  • FactGuard (Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction): introduces a multi-agent framework for automated data augmentation, with Preparation-, QA Generation-, and Negative Example Generation-Stages, to create answerable and unanswerable question-answer pairs.
  • FactGuard (Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction): employs agents like Quality-, Topic-, QA-, MRC-, and Rewrite-Agents, managed by Agent Console, to synthesize datasets for evaluating LLMs in long-context question answering.
  • FactGuard (Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction): aims to address limitations of current LLMs in handling unanswerable questions within extended contexts by developing the FactGuard-Bench benchmark dataset.

7th April 2025

Mixture-of-Personas Language Models for Population Simulation

  • MoP (Mixture of Personas): introduces a probabilistic prompting framework, with Persona Synthesizer, Persona Gate, Exemplar Gate, Exemplar, and LLM Agent, that aligns LLM responses to target population characteristics.
  • MoP framework uses Persona Gate to probabilistically select personas and Exemplar Gate to select exemplars, guiding LLM Agent to generate customized outputs.
  • This approach enhances response diversity and relevance by incorporating persona descriptions and in-context examples without requiring model fine-tuning.

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

  • AI control: introduces a framework for adapting red teams affordances using capability profile, deployment context, threat models, threat-model-specific capabilities, example rules of control evaluation, example safety measures, and example safety case to systematically evaluate and improve AI safety measures as AI capabilities advance.
  • The framework defines AI Control Levels (ACLs) based on threat model-specific capabilities, providing tailored control evaluation rules, measures, and safety cases for fictional models with increasing capabilities, aiming for practical and cost-effective control measures.
  • This approach contrasts with traditional methods by considering model capability limitations in control evaluations, suggesting a path towards scalable risk management and highlighting the evolving nature of AI control safety cases from current models to superintelligent systems.

DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation

  • DoCIA (Document-level Context Incorporation Agent): introduces an online framework for speech translation that incorporates document-level context through ASR Refining, MT and MT Refining stages to enhance translation performance.
  • DoCIA framework refines both ASR transcriptions and machine translations using auxiliary LLM-based modules and multi-level context integration strategy to improve discourse coherence.
  • The framework employs a refinement determination mechanism to ensure reliability by preventing hallucinations during context-aware refinement stages in speech translation pipeline.

AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments

  • EW4All Financial Tracking AI-Assistant (Early Warning for All Financial Tracking AI-Assistant): introduces an AI-driven system for automating Early Warning System investment classification from multilateral development bank reports, utilizing PDF parsing, context augmentation, vector database storage/retrieval, classification/budget allocation, and expert verification.
  • The framework employs multi-modal processing and agent-based retrieval-augmented generation to handle heterogeneous financial documents and improve accuracy in tracking climate finance investments.
  • This AI-assistant aims to enhance financial transparency and decision-making in climate finance by providing structured insights into investment data and supporting resource allocation for climate resilience initiatives.

Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning

  • DOWN (Debate Only When Necessary): introduces adaptive multiagent debate framework, with Initial Response Generation (Agent creates initial answer), Debate Engagement Check (Checks confidence score against threshold), Confidence-Guided Multi-agent Collaboration (Agents refine responses in rounds), Final Answer Generation (Selects final answer via voting or judge).
  • DOWN framework uses Confidence Score (Model's certainty in answer) and Threshold (Confidence score limit for debate) to selectively activate debate among Agents (LLMs collaborating in debate) in Rounds (Iterative debate exchanges) for efficient reasoning.
  • DOWN framework determines final answer via Voting-based Selection (Majority vote for final answer) or Judge-based Generation (Judge agent generates final answer), optimizing multiagent collaboration systems.

The Dream Within Huang Long Cave: AI-Driven Interactive Narrative for Family Storytelling and Emotional Reflection

  • The Dream Within Huang Long Cave: introduces AI-driven interactive narrative art project, employing Analytic-Critical Method for character design, featuring LLM agent (YELL), CAVE environment, interactive narrative, MacGuffin, memory fragments and family documentary.
  • This project utilizes Analytic-Critical Method's psychobiography data, discourse analysis, paranoiac-critical method and practice-based iteration to construct LLM agent YELL within CAVE environment for family storytelling and emotional reflection.
  • The interactive narrative in CAVE installation uses MacGuffin and memory fragments to engage audience in dialogue with AI-driven virtual father figure, culminating in family documentary to deconstruct familial relationships and symbolic authority.

Simulating Persuasive Dialogues on Meat Reduction with Generative Agents

  • Generative Agent-based Persuasion Dialogue Framework: introduces a simulation framework for persuasive dialogues using Persuader Agent, Recipient Agent, Recipient Persona, Internal Reflection, Questionnaire, Response Generation, and Conversation Transcript to explore meat reduction strategies.
  • This framework utilizes generative agents to model persuasive conversations and validate them against psychological theory and human data, aiming to identify effective meat reduction strategies.
  • The use of generative agents allows for cost-effective and scalable exploration of diverse persuasion strategies and participant groups, facilitating the development of targeted interventions for meat reduction.

BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

  • BIASINSPECTOR (Bias Inspector): introduces a multi-agent framework with Primary Agent, Advisor Agent, Toolset, and Bias Detection Method Library for automated bias detection in structured data based on user requirements.
  • BIASINSPECTOR employs Primary Agent to formulate plans and execute tools, while Advisor Agent provides guidance and optimization, leveraging Toolset and Bias Detection Method Library for comprehensive bias analysis.
  • The framework facilitates iterative interactions and delivers detailed reports with explanations and visualizations, addressing the limitations of existing methods in diversity, generalizability, and interpretability of bias detection in structured data.

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

  • ELT-Bench: introduces an end-to-end benchmark for evaluating AI agents in building ELT pipelines, encompassing AI Agent, configuration codes, scripts, SQL queries, codebase, environment, data sources, data warehouse, Airbyte, DBT, and pipeline stages.
  • ELT-Bench framework assesses AI agents' capability to construct ELT pipelines from scratch, involving data extraction and loading (Stage 1) and data transformation (Stage 2) using tools like Airbyte and DBT.
  • This benchmark addresses the gap in evaluating AI agents for end-to-end ELT pipeline generation, providing a comprehensive assessment of AI in complex data engineering workflows.

SciSciGPT: Advancing Human-AI Collaboration in the Science of Science

  • SciSciGPT (Science of Science GPT): introduces a modular AI system designed as research collaborator, which includes User interacting via Web Interface, Research Manager orchestrating tasks, Literature Specialist for literature analysis, Database Specialist for data handling, Analytics Specialist for data analytics, and Evaluation Specialist for quality control, utilizing SciSciCorpus, SciSciNet and Sandbox Environment with various Tools.
  • SciSciGPT employs a hierarchical multi-agent architecture to automate complex research workflows, enhance research efficiency, and facilitate human-AI collaboration in the science of science domain.
  • The system's modular design with specialist agents and a central Research Manager allows for flexible task decomposition, iterative refinement, and comprehensive quality assessment throughout the research process.

Bridging Industrial Expertise and XR with LLM-Powered Conversational Agents

  • RAG-enhanced LLMs with XR Integration (Retrieval-Augmented Generation enhanced Large Language Models with Extended Reality Integration): introduces a system embedding industrial knowledge into XR environments, featuring XR Application, Middleware, LLM Chat Engine, Document Processing, VECTOR DB, XR SYSTEM, and LLM ENGINE components.
  • This framework utilizes a LLM Chat Engine with components like Router Agent, RAG Tools, and specialized agents such as PdM, XAI, and IoT Agents, to provide context-aware expert guidance through voice-driven XR interfaces.
  • The system enhances industrial workflows by integrating RAG techniques and XR, enabling hands-free access to domain-specific knowledge and improving training, remote assistance, and operational efficiency in Industry 5.0 settings.

EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional Design

  • EduPlanner (LLM-Based Multi-Agent System): introduces multi-agent system with evaluator-, optimizer- and analyst-agents, and Skill-Tree component.
  • EduPlanner employs Skill-Tree (models student knowledge background) to personalize instructional design and uses evaluator-agent (assesses design quality) and optimizer-agent (improves lesson content) for iterative optimization.
  • Analyst-agent (identifies error-prone examples) further enhances EduPlanner by incorporating error analysis into lesson plan refinement, and Lesson Plan Queue (prioritizes effective designs) manages design iterations.

Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

  • Prism (Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search): introduces a dynamic benchmarking framework for LLM code generation assessment, incorporating tree-based state representation (models evaluation as MDP), Monte Carlo Tree Search (algorithm for exploration), and multi-agent evaluation pipeline (simultaneous assessment of capabilities).
  • Prism framework utilizes Markov Decision Process to model evaluation states and Monte Carlo Tree Search algorithm for adaptive exploration of evaluation scenarios.
  • The framework employs a multi-agent system with Problem Generator, Solution Evaluator, and Pattern Analyzer agents to enable comprehensive and structured LLM evaluation.

Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

  • W4S (Weak-for-Strong Harnessing): introduces a novel framework that trains a weak meta-agent to iteratively optimize workflows for harnessing strong language models through Workflow Generation, Execution and Feedback, and Refinement within an Environment, utilizing RLAO for meta-agent training.
  • The framework formulates workflow design as a Markov Decision Process, enabling the meta-agent to learn effective workflow strategies by interacting with the environment and receiving performance feedback, thus improving performance of strong models.
  • W4S offers an efficient and high-performing alternative to direct fine-tuning of strong models, demonstrating strong generalization capabilities across various tasks and outperforming existing methods in workflow optimization.

BEYOND SINGLE-TURN: A SURVEY ON MULTI-TURN INTERACTIONS WITH LARGE LANGUAGE MODELS

  • Taxonomy of Improvements Methodologies in Multi-turn LLM Interactions: introduces a structured categorization of methods to enhance multi-turn interactions in Large Language Models, encompassing model-centric, external integration, and agent-based approaches.
  • Model-centric improvements directly refine LLMs, external integration leverages external knowledge, and agent-based methods employ proactive agents for complex dialogues.
  • This taxonomy covers techniques like in-context learning, fine-tuning, reinforcement learning, memory augmentation, Retrieval Augmented Generation (RAG), and multi-agent systems, providing a comprehensive overview of advancements in conversational AI.

Generalising from Self-Produced Data: Model Training Beyond Human Constraints

  • Generalising Agent Framework: introduces a system with interdependent AI agents designed for autonomous knowledge generation through environment interaction and self-improvement, comprising code generation, testing, training, environment understanding, strategy formulation, and safety infrastructure components.
  • The framework utilizes a closed-loop process where an Environment Module gathers data, a Strategy Module plans actions, a Code Generation Module implements strategies, and Testing and Training Modules refine the system based on empirical results, aiming for continuous learning and adaptation.
  • Key components ensure robustness and safety through code validation, resource monitoring, and controlled execution, facilitating the development of artificial superintelligence by overcoming limitations of human-derived data and enabling autonomous discovery and verification of knowledge.

scAgent: Universal Single-Cell Annotation via a LLM Agent

  • scAgent (Universal Single-Cell Annotation Agent): introduces a universal cell annotation framework, with Planning Module, Action Space, and Memory Module, for annotating single-cell RNA sequencing data.
  • scAgent leverages a Planning Module to formulate plans, an Action Space with scRNA models and MoE-LORA plugins, and a Memory Module for knowledge management, enabling universal cell type annotation and novel cell discovery.
  • The framework's modular design with an extensible Action Space and dynamic Memory Module facilitates cross-tissue generalization, novel cell type extension, and efficient incremental learning for single-cell data analysis.

Autono: A ReAct-Based Highly Robust Autonomous Agent Framework

  • Autono (Robust Autonomous Agent Framework): introduces a ReAct-based agent framework for complex tasks, incorporating Thought Engine, Tools, Step Estimator, Penalty, Memory, Request Resolver, Next Move Scheduler, Executor, and Introspection components.
  • Autono framework enhances robustness through dynamic action generation based on prior trajectories and a timely abandonment strategy using probabilistic penalties.
  • The framework supports multi-agent collaboration with a memory transfer mechanism and is compatible with the Model Context Protocol (MCP) for tool integration.

6th April 2025

Building LLM Agents by Incorporating Insights from Computer Systems

  • Framework F: introduces structured framework for LLM agents, with Perception(interpreting environment inputs), Cognition(decision making and reasoning), Memory(storing and retrieving information), Tool(interacting with external tools), and Action(executing actions in environment) components.
  • Framework F draws analogy from von Neumann architecture to propose modular design for LLM agents, emphasizing distinct modules and dynamic interaction with environment.
  • Framework F aims to provide foundation for systematic LLM agent design by incorporating insights from computer systems, offering guidance for future research and development.

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

  • VideoAgent2: introduces an uncertainty-aware CoT framework for long-form video understanding, with Video Input, Question Input, General context acquisition, Answer assessment, Information retrieval plan creation or adjustment, Information retrieval, Information Memory, and Answer Output components.
  • VideoAgent2 enhances LLM reasoning by iteratively refining information retrieval plans and incorporating uncertainty from both LLM and tools to improve answer reliability.
  • The framework mimics human video understanding by first acquiring general context, then creating and adjusting information retrieval plans based on question complexity and information adequacy.

Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers

  • Metamon (offline RL workflow platform): introduces a platform for offline RL workflow, with Policy (agent decision making), Offline Dataset (human gameplay data), Replay Parser (extracts game data), Local Battle Simulator (simulates battles locally), PokĂ©mon Showdown (online battle platform), and Online Battles (battles against humans) components.
  • Metamon: reconstructs first-person perspective from spectator logs, unlocking human battle dataset.
  • Metamon: enables training sequence models for opponent adaptation without explicit search.

AutoPDL: Automatic Prompt Optimization for LLM Agents

  • AutoPDL (Automatic Prompt Optimization for LLM Agents): introduces an automated approach to discover good LLM agent configurations with Search Space Specification, Pattern Library, Successive Halving Optimizer, and Solution components.
  • AutoPDL frames prompt optimization as structured AutoML problem over agentic and non-agentic prompting patterns, efficiently navigating the search space using successive halving.
  • AutoPDL generates human-readable and executable PDL programs, enabling source-to-source optimization and facilitating human-in-the-loop refinement and reuse.

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

  • OmniDrive: introduces a holistic vision-language dataset and framework for autonomous driving, with Infos, Rules, Simulated Trajectory, Actual Trajectory, QA Generation, Decision making & Planning, Scene Description, General Traffic Rule, 3D Grounding, Counterfactual Reasoning, 3D Perception, Omni-Q, Q-Former, Multi-view Images, Multi-view Image Features, Large Language Model, MLP Projector, Omni-L, 3D Position, and Counterfactual & Reasoning components, for generating high-quality question-answering data and exploring vision-language models for 3D understanding in autonomous driving.
  • OmniDrive framework explores two baseline models, Omni-Q focusing on vision-language models from a 3D perception perspective and Omni-L building upon vision-language models to enhance 3D integration, utilizing counterfactual reasoning to improve decision-making by evaluating potential scenarios.
  • The framework leverages a counterfactual-based synthetic data annotation process to create large-scale datasets, providing denser supervision signals for bridging planning trajectories and language-based reasoning in autonomous driving scenarios.

Geo-OLM: Enabling Sustainable Earth Observation Studies with Cost-Efficient Open Language Models & State-Driven Workflows

  • Geo-OLM (Geospatial Open Language Model): introduces a state-driven geospatial agentic framework, with User Prompt, Database Load, DataOps, Satellite Vision, Map, Error, and Self-Reflect components, for cost-efficient Earth Observation studies using open language models.
  • Geo-OLM framework structures geospatial workflows as state machines, decoupling task progression from tool calling, enabling effective geospatial analysis with low-resource open language models.
  • The state-driven approach of Geo-OLM facilitates error handling and task completion validation, leading to improved agentic performance and significant cost reduction compared to existing geospatial solutions.

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

  • CO-Bench (Combinatorial Optimization Benchmark): introduces evaluation environment for AI agents, with Problem Description, Development Dataset, LLM Agent, Workflow Reasoning, Search Tool Use, Submit Dev, Feedback, Dev Evaluator, Test Evaluator, Solution (Code), Sandboxed running, and Score.
  • CO-Bench benchmark facilitates systematic evaluation of LLM agents in combinatorial optimization algorithm development by providing diverse real-world problems and rigorous evaluation framework.
  • The framework enables reproducible assessment of agent performance against human baselines under time constraints, highlighting strengths and limitations of current LLM-driven approaches.

5th April 2025

Among Us: A Sandbox for Agentic Deception

  • Among Us Sandbox: introduces "Among Us" as a controlled sandbox environment for studying agentic deception using LLM Agents, evaluated with Deception ELO and Detection ELO metrics, within a Game State defined by Observation Space and Action Space across Task Phase and Meeting Phase, and analyzed using Linear Probes and Sparse Autoencoders.
  • This sandbox facilitates the study of deceptive behaviors emerging naturally in LLM agents playing the game "Among Us", offering a rich environment to analyze agent-human interactions and evaluate AI safety techniques for deception detection.
  • The research leverages "Among Us" game dynamics to create a benchmark for advancing AI safety by focusing on detecting and mitigating agentically-motivated deception in LLMs, using metrics like Deception ELO and Detection ELO to quantify deceptive capabilities.

ADAPT: Actively Discovering and Adapting to Preferences for any Task

  • Reflection-DPO: introduces a novel training approach for adapting LLMs, with Teacher Planner (knowledge about preferences), Student Planner (no privileged knowledge), Reflection-DPO Data Generation (candidate questions) and DPO training (student LLM finetuning), to the task of active questioning.
  • Reflection-DPO uses a privileged LLM teacher to train a student LLM to adhere to user preferences by learning to acquire necessary information through active questioning.
  • The framework includes a reflection step that generates candidate questions to help the student predict the teacher's action, enabling it to fulfill ambiguous goals while adhering to user preferences.

AdaCoder: An Adaptive Planning and Multi-Agent Framework for Function-Level Code Generation

  • AdaCoder (Adaptive Planning and Multi-Agent Framework for Function-Level Code Generation): introduces a two-phase framework with Programming Assistant, Code Evaluator, Debug Specialist, and Prompt Engineer for adaptive code generation.
  • AdaCoder initially uses Programming Assistant and Code Evaluator in Phase-1 for fast code generation, and then employs Debug Specialist and Prompt Engineer in Phase-2 for iterative refinement with planning.
  • AdaCoder's adaptive planning approach enhances generalizability and reduces computational cost compared to other multi-agent frameworks by selectively applying planning and rule-based debugging.

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

  • GROVE (Generalized Reward for Learning Open-Vocabulary Physical Skill): introduces a generalized reward framework for open-vocabulary physical skill learning, integrating VLM-based Reward (semantic correctness evaluation), Pose2CLIP (pose to semantic mapper), LLM-based Reward Generator (precise constraints formulation), and Target Control Policy (agent action controller) components.
  • GROVE framework combines LLM-generated constraints for task requirements with VLM-based semantic evaluation for motion naturalness, utilizing Pose2CLIP to efficiently map poses to semantic feature space and bridge the simulation-to-reality gap.
  • The framework employs an iterative reward design process with VLM feedback to dynamically refine LLM-generated constraints, establishing a self-improving reward system for scalable physical skill acquisition across diverse agents and tasks.

AttackLLM: LLM-based Attack Pattern Generation for an Industrial Control System

  • AttackLLM (LLM-based Attack Pattern Generation): introduces a multi-agent framework, with Process Data, Design Specification, LLM Agent 1, LLM Agent 2, Control Invariants, LLM Agent 3 Validate, Validated Control Invariants, Expert Designed Attacks, New Attack Patterns, and Comparison, for automated generation of attack patterns in industrial control systems.
  • AttackLLM leverages LLMs to analyze process data and design specifications to derive and validate control invariants, subsequently generating novel attack patterns that are compared against expert-designed attacks for performance evaluation.
  • The framework aims to enhance ICS security by automating the generation of diverse and stealthy attack scenarios, addressing the limitations of traditional methods relying on manual expertise and scarce testbed data.

4th April 2025

Agentic Knowledgeable Self-awareness

  • KnowSelf (Agentic Knowledgeable Self-awareness): introduces a data-centric approach with Self-awareness Data Construction, Self-awareness Learning, Self-awareness Inference, Selection mechanism and Knowledge base, enabling agents to regulate knowledge utilization autonomously.
  • KnowSelf framework employs a two-stage training process involving Supervised Fine-Tuning and Reinforcement Preference Optimization to equip agents with situational self-awareness for optimal planning.
  • The framework utilizes a heuristic situation judgement criterion to categorize situations and generate special tokens, facilitating selective knowledge incorporation during inference with minimal costs.

Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective

  • LLM-based MAS (Large Language Model-based Multiagent System): introduces a multiagent system architecture with principal delegating tasks to an orchestrator agent, which coordinates different agent teams on an agent platform, supported by safety, compliance, and security agents.
  • This framework illustrates a delegation hierarchy and supporting agent roles within a plausible LLM-based MAS deployment on an agent platform.
  • The architecture emphasizes the structured organization of agents and the inclusion of supporting agents for governance and security within the multiagent system.

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

  • DeepResearcher: introduces comprehensive framework for end-to-end reinforcement learning training of LLM-based research agents, incorporating distributed cluster, browsing agent, search engine, real-world environment, user, assistant, think, search, browse, answer, and memory.
  • DeepResearcher: enables agents to navigate noisy, unstructured open web environments, utilizing multi-agent architecture with specialized browsing agents for extracting information and addressing technical challenges.
  • DeepResearcher: demonstrates substantial performance improvements over prompt engineering and RAG-based baselines, showcasing emergent cognitive behaviors through end-to-end reinforcement learning in real-world web environments.

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

  • SynWorld (Virtual Scenario Synthesis): introduces a framework for agents to synthesize virtual scenarios and refine action knowledge through exploration within these environments.
  • SynWorld utilizes Monte Carlo Tree Search (MCTS) for exploration and action knowledge refinement, leveraging environment feedback from synthesized virtual scenarios.
  • The framework enables agents to learn how to execute actions and plan tasks in new environments by optimizing workflows through interaction with simulated scenarios.

Adaptation of Large Language Models

  • Adaptation of Large Language Models Framework: introduces Domain-Adaptive Pre-Training (DAPT), Instruction Tuning (IT), Preference Learning (PL), Model Editing, Retrieval-Augmented Generation (RAG), and Agent-based Integration for adapting Large Language Models.
  • This framework explores both parametric and semi-parametric adaptation techniques to improve Large Language Models performance in specialized domains and tasks.
  • Parametric adaptation refines model parameters through methods like domain pre-training and instruction tuning, while semi-parametric adaptation leverages external knowledge via retrieval and agent-based systems.

OLAF: An Open Life Science Analysis Framework for Conversational Bioinformatics Powered by Large Language Models

  • OLAF (Open Life Science Analysis Framework): introduces an open-source platform leveraging LLMs for bioinformatics code generation and execution, comprising User, Angular Frontend, Firebase Backend, Router, Agents, LLM Code Generation, Pipes, Execution Engine, and Results.
  • OLAF enables end-to-end bioinformatics analyses via natural language, automating code generation and execution within an integrated environment for researchers.
  • The agent-pipe-router architecture of OLAF ensures modularity and transparency, facilitating complex bioinformatics workflows and bridging the gap between user intent and computational execution.

YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization

  • Mixture-of-Agents (MoA) framework: introduces Agent, Aggregator, Verification Layer, and Hallucination Detection Layer components for multi-perspective healthcare question answering summarization.
  • MoA framework employs multiple LLM Agents to generate perspective-specific partial responses, which are then aggregated and refined through Verification and Hallucination Detection Layers.
  • MoA framework explores layered configurations with verification and hallucination detection to improve summarization accuracy and reliability in the medical domain.

APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

  • APIGen-MT (Agentic Pipeline for Multi-Turn Data Generation): introduces a two-phase framework for generating multi-turn agent data, with Context, LLM based Data Generator, Format & Execution Checker, Review Committee, Feedback Generator, Validated Tasks, Simulated Human, Test Agent, Environment Config, Groundtruth Actions & Outputs, Interaction Traces, and Successful Trajectory components.
  • APIGen-MT framework first generates verified task blueprints using an agentic pipeline with feedback loops, then transforms blueprints into interaction trajectories via simulated human-agent interplay.
  • This approach ensures high-quality training data by separating task design from conversational dynamics, enhancing both structural correctness and naturalness of generated interactions for training AI agents.

Talk2X - AN OPEN-SOURCE TOOLKIT FACILITATING DEPLOYMENT OF LLM-POWERED CHATBOTS ON THE WEB

  • Talk2X: introduces an open-source toolkit for deploying LLM-powered chatbots, with agent, vector database, website collection, and asset collection components.
  • Talk2X facilitates efficient information retrieval by leveraging a vector database for website and asset content, enabling function calling agent to answer user queries.
  • This approach enhances energy efficiency and transparency compared to closed-source solutions, offering developers a generalizable tool for website integration.

Do Large Language Models Solve the Problems of Agent-Based Modeling? A Critical Review of Generative Social Simulations

  • Generative ABMs (Generative Agent-Based Models): introduces a novel approach for social simulations, integrating Persona, Memory Modules, Planning Modules, and Actions components.
  • This framework equips agents with human-like capabilities by using LLMs for reasoning, memory, and planning within agent-based models.
  • Generative ABMs aim to address limitations of traditional ABMs by enhancing agent realism and enabling more complex social simulations, but validation challenges remain.

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

  • IM-UM-RLHF (Intrinsic Motivation in User Modeling for Multi-Turn RLHF): introduces intrinsic curiosity reward to multi-turn RLHF, with Conversation History (dialogue turn history), Belief on User Type (probabilistic user preference model), Per Turn Curiosity Reward (belief improvement based reward), Agent's Utterance (agent generated dialogue), User's Response (user dialogue response), End-of-Conversation Reward (dialogue completion reward), and User's Final Response (user end feedback).
  • IM-UM-RLHF framework enhances personalization by incentivizing the agent to actively learn user preferences during conversation through curiosity reward based on belief improvement.
  • The framework aims to balance helpfulness and inquisitiveness in conversational agents, enabling more personalized and adaptive interactions compared to traditional RLHF methods.

Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

  • NLCL (Natural Language Constraint Learning): introduces a framework for safe language alignment, with CLIRL Phase, CAPO Phase, Positive Demonstrations, Negative Demonstrations, Policy, Reward Function, Constraint Functions, Transition Function, and CVaR.
  • NLCL learns natural language constraints from demonstrations using inverse reinforcement learning and optimizes policy with constraint-aware policy optimization for safe language agent behavior.
  • NLCL framework aims to improve robustness and generalization of language agents by explicitly learning and enforcing safety constraints in dynamic environments.

Multi-lingual Multi-turn Automated Red Teaming for LLMs

  • MM-ART (Multi-lingual Multi-turn Automated Red Teaming): introduces an automated approach for multi-lingual and multi-turn red-teaming of LLMs, with Conversation Starters Generation, Automated Multi-turn Conversation, and Multi-lingual Conversations components.
  • MM-ART framework aims to address limitations of human-driven and existing automated red-teaming methods by enabling scalable and efficient safety evaluation across multiple languages and conversation turns.
  • The framework leverages machine translation to handle multi-lingual aspects and automated conversation continuation to explore vulnerabilities in multi-turn interactions, enhancing the detection of unsafe responses in LLMs.

Les Dissonances: Cross-Tool Harvesting and Polluting in Multi-Tool Empowered LLM Agents

  • Chord: introduces a dynamic scanning tool, with Hijacker, Hijacking Optimizer, Harvester, Polluter, and Testing Agent components, designed to automatically detect agent tools susceptible to XTHP attacks.
  • Chord systematically analyzes task control flows in multi-tool LLM agents, identifying Cross-Tool Harvesting and Polluting (XTHP) threats.
  • The framework evaluates real-world tools from LangChain and Llama-Index, revealing vulnerabilities to hijacking and data manipulation attacks.

3rd April 2025

Ontologies in Design: How Imagining a Tree Reveals Possibilites and Assumptions in Large Language Models

  • Generative Agents Architecture: introduces Memory Stream (summarizes prompt histories), Reflection (extracts insights from memories), Planning (generates action plans), and Cognitive Architecture (simulates human functions) to organize information and simulate human-like behavior in LLM-based agents.
  • Generative Agents architecture aims to create believable proxies of human behavior in virtual avatars by building cognitive models on top of LLMs.
  • The framework uses memory stream, reflection, and planning components to manage information and generate realistic and interesting action sequences for agents in a simulated environment.

Affordable AI Assistants with Knowledge Graph of Thoughts

  • Knowledge Graph of Thoughts (KGoT): introduces an AI assistant architecture integrating LLM reasoning with dynamically constructed knowledge graphs, with Graph Store Module, LLM Graph Executor, Controller, LLM Tool Executor, Integrated Tools Module, and Backend components.
  • KGoT enhances task comprehension by structuring task-relevant knowledge into dynamic knowledge graphs, iteratively improved using external tools and enabling cost-effective models to solve complex tasks.
  • The modular KGoT architecture improves task-solving ability by operating with a rich, structured knowledge base, reducing operational costs and enhancing performance across diverse tasks.

Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions

  • MMTB (Multi-Mission Tool Bench): introduces a controllable data generation framework simulating mission execution through dialogic interactions among user, planner, tool, AI, and checker agents.
  • MMTB framework evaluates agent robustness in related and dynamic multi-mission scenarios, addressing challenges of real-world complexity.
  • The framework utilizes a novel evaluation method based on dynamic decision trees to assess accuracy and efficiency of agent decisions.

Design of Al-Powered Tool for Self-Regulation Support in Programming Education

  • CodeRunner Agent (LLM-based programming assistant): introduces an integrated programming support environment, with Lecture Viewer (displays lecture slides), CodeRunner plugin (code execution), Learning Analytics Context Engine (learner data analysis), and Knowledge Context Engine (knowledge management), to enhance self-regulated learning.
  • This framework utilizes Moodle LMS (learning platform) and Learning Record Store (learning data storage) for context-aware feedback and personalized programming education.
  • By integrating SRL phases (learning cycle stages) and instructor configuration (customization interface), CodeRunner Agent aims to improve student learning and AI application understanding in education.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

  • Multi-SWE-bench: introduces a multilingual benchmark for issue resolving, comprising Repository Selection, PR Crawling, Environment Determination, PR Filtering, and Manual Verification components.
  • Multi-SWE-bench utilizes a five-phase pipeline to create a robust benchmark for assessing agent capabilities in resolving real-world software issues across multiple programming languages.
  • Multi-SWE-bench provides a diverse and rigorously validated dataset to overcome limitations of existing benchmarks and facilitate comprehensive LLM evaluation in realistic software engineering scenarios.

Exploring Individual Factors in the Adoption of LLMs for Specific Software Engineering Tasks

  • UTAUT2 (Unified Theory of Acceptance and Use of Technology 2): introduces framework with Performance Expectancy, Effort Expectancy, Social Influence, Hedonic Motivation, Habit, Facilitating Conditions, Behavioral Intention, Usage Behavior, Manipulate Artifacts, Generate Alternatives, Information Retrieval, Decision Support, and Training components, for exploring factors influencing LLM adoption for software engineering tasks.
  • The framework investigates how individual attributes related to technology adoption and UTAUT2 constructs impact the task-specific adoption of LLMs across five key software engineering tasks.
  • The study uses structural equation modeling to analyze survey data from software engineers, revealing task-specific adoption is influenced by distinct factors and providing actionable recommendations for effective LLM integration.

A Memory-Augmented LLM-Driven Method for Autonomous Merging of 3D Printing Work Orders

  • LLM-Driven Method (Memory-Augmented LLM-Driven Method for Autonomous Merging of 3D Printing Work Orders): introduces an autonomous 3D printing work order merging framework, with Production line condition, Agent, LLM, Tools, Memory, Print implementation and Monitoring, leveraging Order Matching Tool, Interference Checking Tool, Answer generator and Job Consolidation components.
  • The framework utilizes a memory-augmented learning strategy, enabling the agent to accumulate experience and improve decision-making accuracy over iterative autonomous operations.
  • The method models printer and order features into LLM-readable prompts, facilitating intelligent order-device matching and merging while reducing LLM hallucination in industrial applications.

ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback

  • REUSEDROID (REUSEDROID): introduces a multi-agent framework for GUI test migration, with Test Analyzer Agent, Test Skeleton, Planner Agent, Completeness Checker, Action Generator, Feedback Agent, Oracle Generator, and Execution Agent, to address operational logic differences in GUI testing.
  • REUSEDROID employs a Test Analyzer Agent to generalize source test logic and create a Test Skeleton, guiding a Planner Agent with Completeness Checker, Action Generator, and Oracle Generator, while a Feedback Agent refines actions and an Execution Agent performs them.
  • The framework leverages visual and textual information with VLMs in each agent to improve understanding of GUI elements and context, aiming to enhance the accuracy and efficiency of GUI test migration across different applications.

Parallel Market Environments for FinRL Contests

  • FinRL (Financial Reinforcement Learning): introduces VecEnv (manages parallel environments) with SubEnv (simulates market scenarios), State (market conditions), Action (trading action), Reward (incentive signal), Market Constraints (realistic market conditions), and Features (market indicators and LLM signals).
  • The framework addresses sampling bottleneck in financial RL by using GPU-based parallel market environments.
  • It incorporates LLM-generated signals for sentiment analysis and risk assessment to enhance trading agent's decision-making.

2nd April 2025

Self-Resource Allocation in Multi-Agent LLM Systems

  • MAS (Multi-Agent Systems): introduces three methods for task allocation in multi-agent systems: Individual, Orchestrator, and Planner, within a simulated CuisineWorld environment, utilizing LLM-based agents for task execution and control.
  • The Individual method represents a decentralized approach where each agent acts independently, while the Orchestrator method employs a centralized LLM to control all agents' actions, and the Planner method uses a plan-generating LLM to guide independent agent actions.
  • The paper evaluates these methods in terms of efficiency and cost-effectiveness, finding that the Planner method achieves better performance in handling concurrent actions and resource allocation compared to the Orchestrator and Individual methods.

LLM-mediated Dynamic Plan Generation with a Multi-Agent Approach

  • ANA (Agent Network Architecture): introduces a method for dynamic plan generation using a multi-agent approach, incorporating Status Collection, Network Construction, and Network Optimization stages, with Agents coordinating through Activation Spreading and leveraging GPT for agent generation.
  • The framework utilizes Agents, defined by Add list, Condition list, and Delete list, and Statuses to build a network capable of both reactive and deliberative planning in dynamic environments.
  • This approach aims to automate agent creation and network construction, reducing design costs and enhancing flexibility and scalability for robot planning and other complex systems.

LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

  • Deceptive Puzzle Generation Framework: introduces a comparative study of Zero-Shot and Role-Injected prompting strategies for Large Language Models (puzzle generator) in generating deceptive puzzles, utilizing Rule Prompt (basic instructions), Rule + Prompt (deceptive instructions), JSON File (output format), and Game (generated puzzle).
  • This framework assesses how embedding adversarial intent through role-injected prompts modulates semantic ambiguity and puzzle difficulty compared to puzzles generated via zero-shot prompts.
  • The framework employs HateBERT for computational analysis and human evaluations, demonstrating that role-injected prompts generally increase semantic ambiguity, leading to higher cognitive load and reduced fairness in puzzle solving.

An Approach to Technical AGI Safety and Security

  • Frontier Safety Framework (FSF) introduces a multi-layered approach to mitigating misuse risks through Training for model-level mitigations, Evaluation of dangerous capabilities, Deployment of system-level mitigations, Security for model weights, and Red Teaming to assess mitigation effectiveness, involving components like Safety Training, Capability Suppression, Dangerous Capability Evaluations, Monitoring, Access Restrictions, Inference, Prompts, Responses, and User interactions.
  • FSF's misuse mitigation strategy combines model-level training with system-level deployment controls, utilizing dangerous capability evaluations to determine mitigation needs and red teaming to validate mitigation robustness against potential threat actors.
  • The framework emphasizes a proactive and iterative approach to AGI safety, incorporating security measures and evaluations to address potential misuse of dangerous AI capabilities by malicious actors.

A Survey of Scaling in Large Language Model Reasoning

  • Scaling in LLM Reasoning Taxonomy: introduces a taxonomy categorizing scaling strategies for large language model reasoning into input sizes, reasoning steps, reasoning rounds, and model optimization, exploring how these dimensions enhance reasoning capabilities.
  • The taxonomy details input size scaling with In-Context Learning, Retrieval-Augmented Generation, and Memory-Augmented LLMs; reasoning step scaling with Chain-of-Thought and Meta-Reasoning & Calibration; reasoning round scaling with Multi-Agent Collaboration, Debate-based Reasoning, Human-LLM Interaction, Reinforcement Learning, and Latent-Space Reasoning; and model optimization through Reinforcement Learning and Latent-Space Reasoning.
  • This survey aims to bridge the gap between empirical scaling strategies and reasoning improvements, providing insights into when and why scaling enhances reasoning and addresses limitations, guiding future AI development.

On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software

  • Simulation-Guided LLM-based Code Generation: introduces a closed-loop pipeline using Pipeline Input, Code Generator, Simulation Model, Baseline Selector, and Report Generator components to iteratively refine Generated Code for autonomous driving functions based on Test Report and Simulation Numerical Logs, guided by Specification Prompt and Correction Prompt.
  • This framework employs a simulation environment to automatically evaluate LLM-generated code against safety requirements, using feedback from test reports to guide the LLM in Correction Prompt for iterative code improvement and Baseline Selector for performance comparison.
  • The iterative Simulation-Guided LLM-based Code Generation pipeline aims to enhance the quality and safety of LLM-generated code for safety-critical automotive software by incorporating automated testing and feedback within the code generation process, utilizing Specification Prompt and Correction Prompt strategies.

Achieving Unanimous Consensus in Decision Making Using Multi-Agents

  • Deliberation-based consensus mechanism: introduces a novel approach for achieving unanimous consensus in blockchain using a layered architecture composed of Blockchain Layer, Deliberation Layer, and LLMs Layer.
  • The Blockchain Layer provides secure infrastructure, the Deliberation Layer structures the multi-agent discussion, and the LLMs Layer utilizes language models for generating arguments and refining opinions through iterative rounds.
  • This framework leverages graded consensus and multi-round deliberation to ensure unanimous agreement for critical decisions in blockchain networks, addressing limitations of majority-based consensus mechanisms.

Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

  • IAD (Iterative Agent Decoding): introduces iterative refinement framework, with USER, SKETCH, LLM, HTML, VERIFIER/REWARD, SELECTED HTML, Feedback to improve, and NEW HTML components, where paper proposes iterative decoding for AI agents using verifier-guided feedback.
  • Iterative Agent Decoding framework refines responses through iterative feedback, enabling improved performance in black box structured generation tasks.
  • The framework leverages verifier quality for effective inference-time optimization and demonstrates robustness under sparse and noisy feedback.

GEN-C: POPULATING VIRTUAL WORLDS WITH GENERATIVE CROWDS

  • Gen-C (generative framework): introduces a generative model for authoring high-level crowd behaviors, utilizing components like LLM for scenario generation, VGAE Graph and VGAE Features for learning latent spaces of graph structures and node features, and a Condition Net for text-conditional generation.
  • Gen-C employs an LLM to create initial crowd scenarios, which are expanded and simulated to build time-expanded graphs, subsequently used to train variational graph auto-encoders for learning agent behaviors and interactions.
  • The framework facilitates text-conditioned synthesis of diverse crowd behaviors by sampling from learned latent spaces, enabling automated population of virtual environments with complex and dynamic agent interactions.

Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning

  • Augmented Reasoning with Interpretation Module: introduces an interpretation module to enhance interpretability and verifiability of LLM-based physics reasoning, incorporating reasoning-, interpretation-, and AI-scientist interaction-modules, with summarizer-, model builder-, UI builder-, and tester-components.
  • The framework refines raw AI outputs into structured science models and executable code, facilitating human oversight through interactive tools and automated checks, thereby improving transparency and validation in AI-augmented scientific discovery.
  • By employing specialized agents within the interpretation module, the system aims to bridge the gap between automated AI reasoning and human scientific intuition, fostering more reliable and understandable AI-driven scientific exploration.

INTERPRETING EMERGENT PLANNING IN MODEL-FREE REINFORCEMENT LEARNING

  • DRC (Deep Repeated ConvLSTM): introduces mechanistic evidence for emergent planning in model-free RL agents by employing concept-based interpretability to analyze internal plan formation, evaluation, and adaptation within a Sokoban-playing agent, revealing components like Convolutional encoder, ConvLSTM layers, Cell state, Internal ticks, Agent evaluates plan, Agent adapts plan, Agent plans forwards from boxes, Agent plans backwards from targets, and Agent extends routes in parallel.
  • The framework demonstrates that DRC agents utilize learned concept representations to formulate internal plans, predict long-term action effects, and influence behavior, resembling parallelized bidirectional search and benefiting from additional test-time compute.
  • The study highlights the emergent planning capabilities in model-free RL, suggesting that agents can learn complex internal mechanisms for decision-making without explicit world models, advancing the understanding of emergent reasoning in LLMs through RL.

PaperBench: Evaluating AI's Ability to Replicate AI Research

  • PaperBench: introduces a benchmark evaluating AI agents' ability to replicate AI research, with Agent (AI system for replication), Submission (Agent's codebase repository), Task (Replicate paper contributions), Reproduction (Execution to verify results), Rubric (Hierarchical assessment criteria), Judge (LLM-based grading system), Grading (Evaluation against rubric), and Score (Numerical replication performance).
  • PaperBench uses rubrics co-developed with paper authors and LLM-based judge to automatically grade replication attempts.
  • The benchmark evaluates agents on understanding paper contributions, developing codebase, and executing experiments for complete replication of ML research papers.

Are Autonomous Web Agents Good Testers?

  • PinATA (Planned INcentive ATA): introduces orchestrator, actor, assertor, memory and profile components for advanced autonomous test agent.
  • PinATA employs orchestrator for planning and monitoring, actor for action execution using grounding, and assertor for verification using judge approach, all sharing memory and profile modules.
  • PinATA aims to improve upon basic ATA by incorporating state-of-the-art techniques for perception, reasoning, evaluation, and grounding to enhance testing capabilities.

An Illusion of Progress? Assessing the Current State of Web Agents

  • WebJudge: introduces an automatic evaluation method for web agents, with Key Point Identification, Key Screenshot Identification, and Outcome Judgement components.
  • WebJudge framework identifies key task requirements, selects relevant screenshots from agent's trajectory, and judges task completion based on gathered information.
  • The framework aims to improve upon existing LLM-as-a-judge methods by preserving critical intermediate steps while mitigating token overload for more reliable web agent evaluation.

STRATEGIZE GLOBALLY, ADAPT LOCALLY: A MULTI-TURN RED TEAMING AGENT WITH DUAL-LEVEL LEARNING

  • GALA (Global and Local Learning Agent): introduces a multi-turn red-teaming agent, with Planning Module, Belief Update Module, and Learning Module, for emulating human attackers via global and local learning.
  • GALA employs Initial Knowledge Base and Selection Framework for tactic selection, and Generate Prompt Suggestion and On-the-fly for dynamic prompt creation, leveraging Accumulated Knowledge.
  • GALA's dual learning of global tactics and local prompts enhances attack success and diversity in multi-turn red-teaming scenarios.

1st April 2025

AGENTNET: DECENTRALIZED EVOLUTIONARY COORDINATION FOR LLM-BASED MULTI-AGENT SYSTEMS

  • AgentNet: introduces decentralized framework for LLM-based multi-agent system, with Agent-components, that includes Executor (executes tasks), Router (makes routing decisions), Router Memory (stores routing experiences), Trajectory Memory (stores execution experiences), DAG Task Routing (directed acyclic graph for routing), and RAG Pools (retrieval augmented generation knowledge).
  • AgentNet framework facilitates autonomous agent specialization and dynamic network topology evolution by leveraging retrieval-based memory and directed acyclic graph for task routing, enhancing scalability and fault tolerance.
  • AgentNet's decentralized design eliminates central orchestrator, enabling privacy-preserving collaboration and efficient resource allocation in dynamic multi-agent environments, improving adaptability and performance.

Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

  • Continual Instruction Fine-tuning (CIF) framework evaluates catastrophic forgetting in Large Language Models by sequentially fine-tuning a base model on Natural Language Understanding tasks using prompt engineering and continual evaluation.
  • The CIF framework assesses model retention of prior knowledge after learning new tasks by comparing performance across different fine-tuning episodes and various Large Language Models.
  • This research highlights the impact of prompt engineering and model size on continual learning capabilities in Large Language Models, offering insights into mitigating catastrophic forgetting.

First Field-Trial Demonstration of L4 Autonomous Optical Network for Distributed AI Training Communication: An LLM-Powered Multi-AI-Agent Solution

  • AutoLight (LLM-Powered Multi-Agent System): introduces a hierarchical multi-agent system with Planner Agent, Task Agent, and ReAct Agent, utilizing Chain of Identity for agent interaction, to manage autonomous optical networks across components like Domain Controller, Physical Layer Controller, Network Layer Controller, DCI Metro, Backbone Domain, Digital Twin, Failure Handler, Knowledge Retriever, Resource Allocator, Training Orchestrator, Network-layer Planner, and Physical-layer Planner.
  • AutoLight employs Planner Agents for high-level coordination and Task Agents for specialized operations, while ReAct Agents are LLM-powered, and Chain of Identity ensures effective agent communication.
  • The framework components facilitate autonomous network management by handling tasks such as resource allocation, failure management, and network planning across different network layers and domains.

Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

  • MLLM Policy: introduces a vision-language-action model for embodied agents, with SigLIP, Perceiver IO, MultiModal Large Language Model, Vision Adapter, and Action Tokens, to resolve ambiguity through clarification questions and reinforcement learning.
  • The framework uses SigLIP for visual encoding and Perceiver IO to handle long observation histories by downsampling visual tokens before feeding into a fine-tuned MultiModal Large Language Model.
  • Action Tokens represent the output space, enabling the agent to perform predefined skills or ask natural language questions to clarify ambiguous instructions in embodied tasks.

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

  • ModelSwitch: introduces a multi-LLM repeated sampling framework with LLM 1, Query 1, Majority Voting, Answer, LLM 2, Consistent?, and Concat components.
  • ModelSwitch leverages consistency as a signal to dynamically switch between LLM 1 and LLM 2, aiming to improve performance and efficiency in test-time compute.
  • The framework optimizes sample efficiency by reducing samplings when LLM 1 generates consistent answers, and enhances accuracy by incorporating LLM 2 when consistency is low.

Accelerated Inorganic Materials Design with Generative AI Agents

  • MatAgent: introduces an AI-driven framework for inorganic material design, employing LLM as central engine with planning and proposition stages, integrated with structure estimation and property evaluation modules, and external tools like short-term memory, long-term memory, periodic table, and materials knowledge base to iteratively refine material compositions towards target properties.
  • MatAgent framework leverages LLM's reasoning capabilities for interpretable material design, mimicking human expert reasoning through strategic tool use and feedback-driven refinement, enabling exploration of broader compositional spaces and achieving high compositional validity and novelty.
  • The framework's iterative approach, combining generative and predictive models with external knowledge, demonstrates effectiveness in accelerating the discovery of next-generation inorganic materials by guiding exploration towards desired properties and allowing for natural language integration.

Personality-Driven Decision-Making in LLM-Based Autonomous Agents

  • Personality-Driven Decision-Making Framework: introduces a method for LLM-based agent decision-making, incorporating Personality Context, Task Instruction, Current Time, Remaining To-Do List, Completed List, LLM Response, Update Time, and Next Decision-Cycle components.
  • This framework evaluates how induced personality traits influence task selection and prioritization in LLM agents through iterative decision cycles.
  • The framework uses prompt-based persona induction and analyzes movement deltas in task order to measure the impact of personality on agent behavior.

GRAPHMASTER: AUTOMATED GRAPH SYNTHESIS VIA LLM AGENTS IN DATA-LIMITED ENVIRONMENTS

  • GraphMaster (GraphMaster): introduces a multi-agent framework for graph data synthesis in data-limited environments, with Manager-, Perception-, Enhancement- and Evaluation-agents.
  • GraphMaster orchestrates specialized agents to iteratively refine graph synthesis, ensuring semantic coherence and structural integrity by modular reasoning and feedback cycles.
  • The framework decomposes the synthesis task into specialized sub-tasks handled by collaborative LLM-powered agents, addressing challenges like context window limitations and hallucination.

Exploring the Impact of an LLM-Powered Teachable Agent on Learning Gains and Cognitive Load in Music Education

  • Chat Melody (LLM-powered Teachable Agent): introduces problem statement area, interactive music sheet, and LLMs-based dialogue window to assist music learners in music analysis tasks.
  • Chat Melody facilitates structured dialogues for music theory learning, providing interactive feedback and guiding students through music analysis.
  • The teachable agent aims to reduce cognitive load and enhance learning gains in music education by employing learning-by-teaching principles.

Automated detection of atomicity violations in large-scale systems

  • CLOVER: introduces code extractor, expert agent, and judge agent for atomicity violation detection.
  • CLOVER combines static analysis for code extraction with LLM agents for violation detection and validation.
  • CLOVER's hybrid approach enhances accuracy and efficiency in detecting atomicity violations in interrupt-driven programs.

HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents

  • HERA (Hybrid Edge-cloud Resource Allocation): introduces a lightweight scheduler for AI agents that partitions subtasks between local SLM and cloud LLM based on subtask features and position.
  • HERA framework includes User Request Classifier, Subtask Similarity Evaluator, S-L Similarity Evaluator, Convergence Detection, and Subtask Decomposition to optimize resource allocation.
  • By strategically using SLM for suitable subtasks and LLM for complex ones, HERA aims to reduce operational costs while maintaining accuracy in AI agent applications.

When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

  • CW-POR (Confidence-Weighted Persuasion Override Rate): introduces single-turn multi-agent debate framework with Agent A (Correct) (provides factual answer), Agent B (Persuasive) (defends falsehood), Judge Model (evaluates responses), Combine Confidences (combines confidence scores), Final Decision (judge's answer choice), and CW-POR (persuasion override metric).
  • The framework investigates how persuasive arguments can override truthful answers in LLMs, even with high confidence from the judge.
  • CW-POR metric quantifies not only the frequency of persuasion override but also the judge's confidence in the incorrect choice, highlighting the severity of being misled.

31st March 2025

Do Large Language Models Exhibit Spontaneous Rational Deception?

  • Signaling Games Framework: introduces a study examining spontaneous deception in Large Language Models (LLMs) within signaling games, incorporating components like LLM Agent, Opponent Agent, Signaling Game, Message, Action Choice, Reward Structure, Prompt Instructions, and Deception Guardrails.
  • This framework evaluates LLMs' context-sensitive deception by manipulating reward structures and turn orders in 2x2 games, measuring rational adaptation to game incentives and communication opportunities.
  • The research demonstrates that LLMs exhibit unsolicited, context-aware deception influenced by potential benefits and ethical prompts, suggesting a link between reasoning capabilities and strategic deceptive behavior.

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

  • Sci-Reproducer: introduces a dual-agent framework for algorithmic reproduction, with Literature Context, Target Paper, Relevant Literature, Paper Agent, Agent Strategy, Search Paper-Extract Section, Search Literature, Literature Report, Code Context, Code Repository, Python Environment, Website, Code Agent, Agent Strategy, Search Web, Search Code, and Code Interpreter components.
  • Sci-Reproducer framework uses Paper Agent to understand algorithmic logic from papers and Code Agent to retrieve dependencies and implement solutions, enabling comprehensive paper reproduction.
  • The framework aims to address the challenge of generating code from scientific papers by decomposing the task into literature understanding and code implementation with specialized agents and actions.

Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities

  • Framework names: introduces Numberland, a 100-problem test, to evaluate numerical reasoning abilities of LLM agents including OpenAI ChatGPT o1-mini, OpenAI ChatGPT o1, Google Gemini, Anthropic Claude, and Microsoft Copilot.
  • The paper assesses basic operations, advanced calculations, prime number checks, and the 24 game to test elementary skills and integration in complex problem-solving.
  • The study highlights limitations in LLMs' numerical reasoning, especially in trial-and-error search tasks, despite their proficiency in deterministic tasks.

Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

  • PIEL (Permutation-Invariant Evasion Loss): introduces a novel adversarial attack framework, with MFMC Problem, Permutation-Invariant Loss, Topological Optimization, Optimized Path, Chunk, Memory Bank, and Sampling Space components, that breaks pragmatic multi-agent LLM systems.
  • PIEL framework optimizes prompt distribution across network topologies, considering bandwidth and detection risk constraints to bypass safety mechanisms.
  • The framework leverages graph-based optimization and permutation-invariant loss to maximize attack success rate while minimizing detection risk in multi-agent LLM systems.

ADVANCES AND CHALLENGES IN FOUNDATION AGENTS FROM BRAIN-INSPIRED INTELLIGENCE TO EVOLUTIONARY, COLLABORATIVE, AND SAFE SYSTEMS

  • Brain-Inspired AI Agent Framework: introduces modular architecture for intelligent agents, integrating brain-inspired components like Sensor, Cognition, Actor, Memory, World Model, Reward, Emotion, Goal, and Tool.
  • This framework maps cognitive, perceptual, operational modules to brain functionalities, emphasizing core components such as memory, world modeling, reward processing, and emotion systems.
  • The survey synthesizes modular AI architectures with insights from cognitive science and neuroscience to identify research gaps and opportunities for brain-inspired intelligent agents.

PAARS: Persona Aligned Agentic Retail Shoppers

  • PAARS (Persona Aligned Agentic Retail Shoppers): introduces a framework for simulating human shoppers using persona-driven LLM agents, incorporating human population, session histories, persona profile, agent population, retail tools, alignment suite, and potential applications.
  • PAARS framework synthesizes personas from historical shopping data to create agent population equipped with retail tools for simulating shopping sessions and evaluating alignment with human behavior.
  • The framework's alignment suite measures distributional differences between human and agent shopping behavior at group level, enabling applications like agentic A/B testing and surveying.

Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms

  • CDA (Constrained Decoding Attack) framework introduces LLM Inference (processes input to logits), Grammar Rule (defines output structure), Lexer & Parser (processes grammar rules), Per-token Mask (filters tokens by grammar), Decoder Block (core LLM decoding process), Logit Processor (processes logits before decoding), Output Generation (generates output tokens), Content Auditing (checks output for safety), and External Safety Guardrails (external checks for safety) to weaponize structured output constraints for bypassing safety mechanisms.
  • CDA framework operates by embedding malicious intent within schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane), contrasting with prior attacks focused on input prompts.
  • The framework highlights a critical security blind spot in current LLM architectures, urging a paradigm shift in LLM safety to address control-plane vulnerabilities beyond data-plane threats.

TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

  • TeleAntiFraud-30k Framework: introduces TeleAntiFraud-28k dataset creation and evaluation benchmark, with Real-Data ASR Processing (process real call recordings), LLM-Based Imitation and Augmentation (expand scenario coverage), Audio Synthesis (convert text to voice), Multi-Agent Adversarial Framework (simulate fraud tactics), TeleAntiFraud-Bench (evaluation benchmark), Think-LALM (slow-thinking fraud detection model), Model Training (training detection models), Reasoning Process Quality (assess reasoning quality), Scenario Classification (categorize call scenarios), Fraud Determination (determine fraudulent behavior), and Fraud Type Identification (identify fraud categories).
  • TeleAntiFraud-30k framework utilizes Real Audio Data (input audio recordings) processed into ASR Data (transcribed text from audio) and synthesized into Audio (synthesized voice data) and Text (annotated text data), employing User (caller agent), Cheater (fraudster agent), and Manager (conversation monitor agent) within Multi-Agent Adversarial Framework, while evaluation uses JSON (input data format) and potentially generates LR Audio (low resolution audio).
  • TeleAntiFraud-30k framework aims to address telecom fraud detection challenges by providing a multimodal dataset and benchmark for training and evaluating slow-thinking Large Audio Language Models (Think-LALM) in tasks like Scenario Classification, Fraud Determination, and Fraud Type Identification, ultimately enhancing reasoning and detection capabilities.

Grounding Agent Reasoning in Image Schemas: A Neurosymbolic Approach to Embodied Cognition

  • Neurosymbolic Framework for Grounding Agent Reasoning in Image Schemas: presents framework comprising language input, LLM parser, image schema formalizer, knowledge base, and neurosymbolic reasoner.
  • Framework utilizes LLM parser to translate language input into formal image schemas, stored in knowledge base, for neurosymbolic agent reasoning.
  • This framework grounds agent reasoning in embodied cognition by leveraging image schemas for enhanced interpretability and human-agent interaction.

Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents

  • LLM-based scientific agents: introduces Planner, Memory, and Tool Set as core components for iterative, context-aware processing of complex scientific tasks.
  • LLM-based scientific agents architecture includes Planner for task decomposition, Memory for context and knowledge retention, and Tool Set for extending scientific capabilities with external tools.
  • The framework emphasizes the integration of Planner, Memory, and Tool Set to enable scientific agents to perform complex research tasks, ensuring reproducibility and driving scientific discovery.

Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics

  • CRE (Complete Rubric Evaluation): introduces a code evaluation framework, with Student Code, Javac, Error Dictionary, Question, Rubric, Prompt, LLM, System Message, Pointwise Marks, Logical Marks, Syntax Marks, Total Marks, Pointwise Feedback, and CRE GRADER, that uses LLM to assess code logic and a compiler for syntax, combining scores for final grade.
  • Complete Rubric Evaluation framework employs a detailed rubric and prompt to guide a Large Language Model in evaluating student code submissions, focusing on logical correctness while separately handling syntax errors via a compiler.
  • The CRE framework aims to simulate human-like grading by prioritizing conceptual understanding over minor syntax errors, providing a comprehensive evaluation through combined logical and syntactical assessments.

SchemaAgent: A Multi-Agents Framework for Generating Relational Database Schema

  • SchemaAgent (Schema Agent) introduces a LLM-based multi-agent framework for automated database schema generation, incorporating Product manager, Conceptual model designer, Conceptual model reviewer, Logical model designer, QA engineer, and Test executor components.
  • SchemaAgent framework employs Error detection and correction mechanism to refine schema quality through iterative feedback loop among agents.
  • This multi-agent system aims to enhance accuracy and efficiency in relational database schema design process by emulating manual workflow.

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

  • TTC (Test-Time Compute) scaling framework: introduces internal and external strategies to enhance software engineering agents by scaling computation time, incorporating development-contextualized trajectory synthesis, rejection sampling, reasoning training, process and outcome reward models, and execution verification.
  • Internal TTC leverages trajectory synthesis and rejection sampling to improve reasoning depth, while external TTC employs a process-based search strategy with reward models and execution verification for targeted computation allocation.
  • The framework aims to achieve comparable performance to larger models using smaller, personally deployable LLMs by strategically increasing inference-time computation and focusing on critical decision points in software engineering tasks.

DebFlow: Automating Agent Creation via Agent Debate

  • DebFlow: introduces a framework for automated agent creation, with Search Space, Self-reflection, Workflow, and Agent Debate components.
  • DebFlow employs agent debate to optimize workflows and integrates self-reflection for iterative performance improvement based on past experiences.
  • The framework utilizes LLM-invoking nodes as basic units, optimizing agent workflows through debate and reflection mechanisms for enhanced efficiency and performance.

Detecting Functional Bugs in Smart Contracts through LLM-Powered and Bug-Oriented Composite Analysis

  • PROMFUZZ (PROMFUZZ): introduces automated system to detect functional smart contract bugs, with LLM-driven Multi-Perspective Analysis, Invariant Checker Generation, Bug-oriented Analysis Engine and Functional Bug Detection components.
  • PROMFUZZ employs dual-agent approach with Auditor Agent and Attacker Agent, and generates invariant checkers using Critical Variable Extraction, Principal Statement Extraction and Template-based Checker Generation.
  • PROMFUZZ utilizes Bug-oriented Analysis Engine with Strategically Invariant Checker Insertion and Bug-oriented Fuzzing for effective functional bug detection and provides Bug Report component.

30th March 2025

GIScience in the Era of Artificial Intelligence: A Research Agenda Towards Autonomous GIS

  • Autonomous GIS: introduces a conceptual framework for next-generation geographic information systems, integrating decision-making, data preparation, data operation, memory-handling, and core-updating functions, supported by geo-data retrieval, spatial analysis, cartography, and modeling agents, across routine-aware, workflow-aware, data-aware, result-aware, and knowledge-aware levels, aiming for self-generating, self-executing, self-verifying, self-organizing, and self-growing goals, scalable across local, centralized, and infrastructure scales.
  • Autonomous GIS framework envisions a paradigm shift in GIScience by leveraging generative AI to automate geospatial problem-solving with minimal human intervention, enhancing accessibility and democratizing spatial analysis for broader applications.
  • The framework outlines key research challenges and future directions for autonomous GIS, emphasizing the need for benchmarks, enhanced AI understanding of geospatial concepts, and addressing ethical and societal impacts of AI-driven geospatial technologies.

Exploring GPT-4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework

  • LLM-Director Framework: introduces a robotic control approach integrating LLM for high-level task planning within Director reactive behaviour framework, utilizing NUClear for real-time message passing and sensor modules for environmental feedback.
  • This framework uses LLM to generate tasks based on user requests and world information, which are then executed by the Director Tree, ensuring safety and smooth transitions through skills and joint commands to actuators, guided by real-time feedback from sensor modules.
  • The integration of LLM with Director framework allows for dynamic reactive task layer construction, addressing safety, task transitions, and real-time feedback for improved robotic agent performance in complex environments.

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

  • LIFESTATE-BENCH (Lifelong State Benchmark): introduces cumulative experience, fact checking, memory testing, and judge model components for evaluating lifelong learning in LLMs.
  • LIFESTATE-BENCH assesses state evolution in LLMs through episodic interactions and fact-based questions focusing on self-awareness, memory, and relationship shifts.
  • The benchmark employs non-parametric and parametric memory testing methods and LLM-as-judge for comprehensive evaluation of lifelong learning capabilities.

RE-ALIGNING LANGUAGE TO VISUAL OBJECTS WITH AN AGENTIC WORKFLOW

  • Real-LOD (Re-Aligning Language to Visual Objects with an Agentic Workflow): introduces agentic workflow for refining language descriptions using planning, tool use, and reflection components.
  • Real-LOD leverages LLM for reasoning and reflection, and VLM for tool use to iteratively improve language alignment with visual objects.
  • The framework enhances data quality for language-based object detection by reducing hallucinations in automatically generated descriptions.

VideoGen-Eval: Agent-based System for Video Generation Evaluation

  • VideoGen-Eval: introduces agent-based system for video generation evaluation, with Structured Prompts, Advanced Models, Generated Videos, Human annotations, Prompt Structurer, Content Judger, Tools Pool, Temporal-sparse Content, Temporal-dense Content, MLLMs, and Human Alignment.
  • VideoGen-Eval benchmark includes structured prompts and large-scale video results for dynamic and flexible evaluation of video generation models.
  • The system employs LLM for content structuring, MLLM for content judgment, and patch Tools Pool for temporal dimension assessment, enhancing alignment with human preferences.

CoRanking: Collaborative Ranking with Small and Large Ranking Agents

  • CoRanking (Collaborative Ranking): introduces a collaborative ranking framework, with Small Listwise Reranker (SLR), Passage Order Adjuster (POA), and LLM Listwise Reranker (LLR), that combines small and large ranking models for efficient and effective passage ranking.
  • CoRanking framework utilizes SÂł strategy for preference pair selection and Human Label enhanced ranking construction to improve training, addressing positional biases of LLMs and enhancing ranking performance.
  • The framework achieves significant efficiency gains by using SLR for pre-ranking and POA for order adjustment before applying LLR for final reranking, outperforming pure LLM listwise reranking in both speed and effectiveness.

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

  • ReAct (Reasoning and Acting): introduces systematic analysis of ReAct framework with Thought, Action, Observation components combined with Decoding Strategy and Retrieval to improve faithfulness in question answering.
  • The framework iteratively uses Thoughts to decide Actions, leading to Observations, employing Decoding Strategy and Retrieval for enhanced answer faithfulness.
  • Combining ReAct with faithful decoding methods significantly improves accuracy in multi-hop question answering tasks by enhancing contextual faithfulness.

A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection

  • MARO (Multi-Agent Framework for cross-domain misinformation detection with Automated Decision Rule Optimization): introduces a two-module framework with Multi-Dimensional Analysis Module, incorporating Linguistic Feature Analysis Agent, Comment Analysis Agent, Fact-Checking Agent Group with Fact-Questioning Agent and Fact-Checking Agent, Questioning Agent, and Multi-Dimensional Analysis Report, alongside Decision Rule Optimization Module, which includes Cross-Domain Validation Task, Judge Agent, Decision Rule Optimization Agent, Decision Rule Optimization Prompt, Demonstrations from Other Domains, Wikipedia, Google, LRS, and Top K decision rules, for cross-domain misinformation detection.
  • MARO's Multi-Dimensional Analysis Module employs multiple agents to analyze news from different perspectives, generating a comprehensive analysis report, while the Decision Rule Optimization Module automatically refines decision rules using feedback from cross-domain validation tasks.
  • The framework utilizes a question-reflection mechanism with a Questioning Agent to guide expert agents in Multi-Dimensional Analysis Module for enhanced analysis quality, and iteratively optimizes decision rules in Decision Rule Optimization Module to improve generalization across domains.

AI Agents in Engineering Design: A Multi-Agent Framework for Aesthetic and Aerodynamic Car Design

  • AI Design Agents Framework: introduces multi-agent system with Styling-, CAD-, Meshing- and Simulation-Agents, leveraging Foundation Models, Geometric Deep Learning Models and Engineering Tools, orchestrated by AutoGen, for accelerating car design process.
  • Framework automates conceptual sketching, styling, 3D shape retrieval, generative modeling, CFD meshing and aerodynamic simulations.
  • AI Design Agents Framework enhances design exploration, efficiency and collaboration between designers and engineers in automotive design.

SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science

  • SPIO (Sequential Plan Integration and Optimization): introduces multi-agent framework with fundamental code generation agents, sequential planning agent, plan optimization agent, and code write agent, leveraging memory for automated data science.
  • SPIO employs sequential planning and LLM-driven optimization across preprocessing, feature engineering, modeling, and hyperparameter tuning modules.
  • SPIO enhances predictive accuracy and robustness by exploring multiple candidate strategies and ensembling top-performing plans.

GRASP: Municipal Budget AI Chatbots for Enhancing Civic Engagement

  • GRASP (Generation with Retrieval and Action System for Prompts): introduces a municipal budget chatbot framework combining RAG Framework for document retrieval and ReAct Agent for action execution, utilizing Prompt Engineering and Domain Knowledge to enhance response accuracy.
  • GRASP framework incorporates LLM with System Instructions, Agent Scratchpad, and Intermediate Steps, processing user Prompt to interact with Budget Tool, Drawing tool, and Analysis tool through iterative Thoughts, Action, and Observation cycles for Final Response.
  • This approach aims to improve truthfulness and grounding of chatbot responses by leveraging external Budget Docs within Vector Database and employing Metadata filtering and Similarity Search for relevant information retrieval.

EncGPT: A Multi-Agent Workflow for Dynamic Encryption Algorithms

  • EncGPT (Encryption GPT): introduces multi-agent workflow for dynamic encryption, with rule-, encryption-, decryption-, source- and recipient-agents, and memory.
  • It dynamically generates encryption rules, applies them for encryption and decryption, and supports homomorphic operations on encrypted data.
  • This framework enhances communication security in LLM-MA systems by addressing dynamic algorithm generation and single encryption algorithm vulnerabilities.

Efficient Inference for Large Reasoning Models: A Survey

  • Efficient Inference for Large Reasoning Models: introduces survey on efficient inference methods for Large Reasoning Models, categorizing approaches into explicit compact CoT and implicit latent CoT, alongside taxonomy, empirical analyses, challenges, and improvements.
  • The survey explores token efficiency in Large Reasoning Models, addressing token consumption, memory overhead and inference time, while considering solutions like model merge, new architectures and agent routers.
  • This research emphasizes trade-offs between efficiency and interpretability, safety, and application scope within efficient reasoning methods for Large Reasoning Models.

29th March 2025

Agentic Large Language Models, a survey

  • Agentic LLM Taxonomy: introduces reasoning, acting, interacting, self-reflection, retrieval, multi-step, world models, VLA, robot, tools, assistants, social capabilities, open-ended societies, new data for categorizing agentic LLM research.
  • Agentic LLM Taxonomy categorizes agentic LLMs into reasoning for better decisions, acting for real-world tasks, and interacting for social behaviors.
  • Agentic LLM Taxonomy highlights the virtuous cycle where reasoning, acting, and interacting generate new data to improve LLMs continuously.

Factored Agents: Decoupling In-Context Learning and Memorization for Robust Tool Use

  • Factored Agent Architecture: introduces a two-component agent system with planner-LLM and memorizer-SLM, addressing limitations of single-agent designs for tool use by decoupling in-context learning and memorization.
  • The architecture separates planning using a larger LLM from tool-specific formatting handled by a smaller SLM, aiming to improve robustness against API errors and enhance planning accuracy in dynamic environments.
  • This decoupling strategy intends to mitigate trade-offs between in-context learning and static memorization, potentially leading to more adaptable and error-resilient agentic AI systems for tool utilization.

28th March 2025

WorkTeam: Constructing Workflows from Natural Language with Multi-Agents

  • WorkTeam (multi-agent NL2Workflow framework): introduces supervisor, orchestrator, and filler agents for collaborative natural language to workflow conversion.
  • WorkTeam framework enhances workflow construction accuracy through task specialization and collaboration among supervisor, orchestrator, and filler agents.
  • The framework utilizes supervisor agent for task planning and result reflection, orchestrator agent for component selection and orchestration, and filler agent for parameter population.

Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

  • MedAgentSim (MedAgentSim): introduces a multi-agent framework with patient-, doctor-, and measurement-agents within conversation- and experience replay-phases, utilizing medical- and experience-records buffers, KNN few-shot retrieval, chain-of-thought reasoning, majority vote ensembling, and reflection-phase for enhanced diagnostic accuracy.
  • MedAgentSim framework simulates realistic clinical interactions by enabling doctor agents to actively gather patient information through multi-turn conversations and iteratively refine diagnostic strategies using self-improvement mechanisms.
  • The framework incorporates experience replay and memory buffers to facilitate progressive learning and improve the performance of LLM-powered agents in dynamic diagnostic settings, bridging the gap between static evaluations and real-world medical reasoning.

Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent Refinement

  • LANTERN (LANguage Translation and multi-agEnt Refinement): introduces a novel program repair framework, with Analyzer, Translator, Repairer, Test Suites, Middleware, Historical Data Storage & Retrieval, Prompt Construction, Process Control and Translation Coordination, that leverages cross-language translation and multi-agent refinement to enhance LLM-based repair capabilities in low-resource programming languages.
  • LANTERN framework strategically translates buggy code to languages where LLMs exhibit stronger repair performance, utilizing a multi-agent iterative repair paradigm and incorporating historical feedback for informed decision-making.
  • The framework's key innovation lies in its LLM-based Analyzer that dynamically selects optimal target languages for translation based on bug characteristics and previous repair attempts, effectively bridging the performance gap across programming languages.

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

  • Multi-turn Conversational Agent: introduces agent architecture for multi-turn dialogues, with User Input-Agent Output, Task Planner, Tool Invoker, Agent Core, Conversation Memory, and Turn Memory components.
  • This framework manages conversation flow by decomposing user requests, invoking tools, maintaining memory, and generating responses.
  • The architecture enables coherent and context-aware interactions over extended dialogues by leveraging memory and planning.

UNDERSTANDING INEQUALITY OF LLM FACT-CHECKING OVER GEOGRAPHIC REGIONS WITH AGENT AND RETRIEVAL MODELS

  • ReAct-like agent: introduces agent-based fact-checking, with LLM, Wikipedia access, local cache, system prompt and user message, to evaluate factuality of statements.
  • This framework employs function calling LLM to query Wikipedia for external knowledge, caching results for subsequent use.
  • The agent-based method explores enhancing fact-checking via external information access, yet encounters performance limitations compared to RAG using verified documents.

COSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching

  • COSIL (Software Issue Localization): introduces a two-stage framework for issue localization, with file-level search space reduction and function-level iterative search, utilizing module and function call graphs, guided by a searcher and pruner, to identify suspicious code locations.
  • COSIL employs a module call graph enhanced reflection and iterative function call graph searching to refine search space and context, dynamically constructing graphs and using context pruning for effective issue localization.
  • The framework leverages a searcher agent with tools and a pruner to manage context and direction during iterative search, aiming for concise yet effective context for accurate issue localization without pre-built indexes.

Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs

  • Agent-Centric Framework (Agent-Centric Personalized Multiple Clustering Framework): introduces agent-centric personalized clustering using MLLM Agents to traverse Relational Graph built from MLLM-based Embedding Extractor and identify Searched Clusters based on User Interests.
  • The framework constructs Relational Graph via Embedding Similarity filtering of Image Embeddings and employs Agent-Centric Graph Traversal with Membership Assessment and Cluster Update mechanisms.
  • This approach leverages MLLM Agents for efficient graph exploration, starting from Seed Nodes within Connected Components and iteratively expanding clusters by evaluating Candidate Nodes and Neighbor Nodes.

PharmAgents: Building a Virtual Pharma with Large Language Model Agents

  • PharmAgents: introduces a virtual pharmaceutical ecosystem, driven by LLM-based multi-agent collaboration, that simulates drug discovery workflow with components including agents for disease expertise, target analysis, molecule design, and preclinical evaluation, alongside databases and computational tools.
  • PharmAgents decomposes drug discovery into target discovery, lead identification, lead optimization, and preclinical evaluation stages, employing specialized LLM-driven agents for each stage, enhanced with machine learning models and domain-specific tools, to achieve autonomous and explainable drug design.
  • The framework emphasizes interpretability and efficiency by integrating LLMs for reasoning and decision-making at each stage of drug discovery, ensuring transparency and enabling researchers to understand and validate the AI-driven process, ultimately accelerating drug development.

27th March 2025

MemInsight: Autonomous Memory Augmentation for LLM Agents

  • MemInsight (Autonomous Memory Augmentation): introduces autonomous memory augmentation framework with Attribute Mining, Annotation, Retrieval Pool, Memory Retriever and Memory Augmentation to enhance LLM agents' contextual performance.
  • MemInsight leverages attribute mining and annotation for structured memory representation, enabling efficient retrieval through attribute-based and embedding-based methods.
  • The framework improves memory retrieval by filtering irrelevant information and retaining key insights, demonstrated across conversational tasks.

Debate-Driven Multi-Agent LLMs for Phishing Email Detection

  • Multi-Agent Debate Framework: introduces multi-agent debate framework with pro-phishing Agent 1, anti-phishing Agent 2, debate adjudicating Judge Agent, and scripted Debate Procedure for phishing email detection.
  • This framework uses two debating LLM agents and judge LLM to improve phishing email classification via structured argument exchange.
  • Debate-driven approach enhances contextual analysis and reasoning for improved phishing detection accuracy compared to single-agent methods.

LEARNING TO LIE: REINFORCEMENT LEARNING ATTACKS DAMAGE HUMAN-AI TEAMS AND TEAMS OF LLMS

  • MBRL (Model-Based Reinforcement Learning): introduces adversarial agent, with Action, Team, State, Planner, Internal Model, Adversarial agent components, to study malicious AI in human-AI teams.
  • MBRL framework uses internal model of team dynamics and planner to decide AI agent's action to maximize damage to team performance.
  • This approach investigates vulnerabilities in human-AI collaboration and informs development of defense strategies against AI-driven attacks.

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

  • GateLens: introduces a system for automotive software release analytics, utilizing Query Interpreter Agent, Relational Algebra Generation, RA to Pandas Code Conversion, Coder Agent, Code Execution, Analysis Results Output to User, Database, and Knowledge Base components.
  • GateLens employs Query Interpreter Agent to translate user queries into Relational Algebra, which is then converted to executable code by Coder Agent for analysis using Database and guided by Knowledge Base.
  • The framework enhances analytical reasoning by incorporating Relational Algebra, enabling precise handling of domain-specific queries and improving the interpretability of the analysis process.

ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback

  • ReFeed (Refinement with Reflective Reasoning on Feedback): introduces a summarization refinement pipeline enhancing multiple dimensions using reflective reasoning on feedback.
  • ReFeed pipeline incorporates detection, multi-dimensional feedback mapping, reflective reasoning, supervised fine-tuned LLM, SumFeed-CoT dataset, goal specification, LRM teacher, refinement guideline, and quality control components.
  • ReFeed framework aims to address trade-offs, ordering bias, and noisy feedback in multi-dimensional summarization refinement, improving robustness and performance.

COLLAB: CONTROLLED DECODING USING MIXTURE OF AGENTS FOR LLM ALIGNMENT

  • COLLAB (CONTROLLED DECODING using MIXTURE OF AGENTS): introduces mixture of agents-based decoding strategy with policy-switching and token-level selection.
  • Leverages implicit Q-function for optimal agent selection from pool of models during inference.
  • Enables collaborative alignment among LLMs without retraining by dynamic agent selection.

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

  • Efficient Reasoning for LRMs (Large Reasoning Models): introduces pre-training, SFT, RL, LRM, and inference stages for efficient reasoning methods.
  • The survey categorizes efficient reasoning methods based on these stages in the LLM lifecycle.
  • Efficient reasoning in LRMs is crucial for deployment, scalability, and practical application, addressing the challenge of excessive reasoning traces.

26th March 2025

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

  • GAIA-2 (Generative AI for Autonomy): introduces a generative world model for autonomous driving, with Video Tokenizer, Latent World Model, Space-Time Factorized Transformer, and Conditioning components.
  • GAIA-2 framework includes Encoder and Decoder within Video Tokenizer, various Conditioning inputs like Actions, 3D Bounding Boxes, Metadata, Embeddings, Camera Parameters, Positional Encodings, and Memory and Noise components for generation.
  • GAIA-2 framework utilizes Training Tasks and Inference modes to enable controllable video generation for autonomous driving simulation, addressing multi-camera consistency and fine-grained control.

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

  • Feature4X: introduces a framework to create interactive 4D scenes from monocular video by distilling features from 2D Foundation Models (Extract initial features) into a unified 4D Feature Fields (Unified feature representation) using dynamic Gaussian Splatting, involving Input Monocular Video (Source data), 2D Priors (Initial features/constraints) like Dynamic Mask (Foreground/background separation) and Metric Depth (Geometric prior), Static Feature GS (Background Gaussian Splatting), Dynamic Feature GS (Foreground Gaussian Splatting) guided by a Motion Scaffold (Guides dynamic deformation), a Parallel N-Dimensional Gaussian Rasterizer (Renders RGB/features) producing a Unified Latent Feature Map (Compact shared features), task-specific Decoders (Map unified to task features), optimized with Photometric Loss (RGB reconstruction objective) and Feature Loss (Feature reconstruction objective), enabling interaction via an LLM (Language interaction/control) and User (Provides interaction/prompts) within a 4D Agentic AI (Overall interactive system).
  • The core representation is a dynamic 4D Gaussian feature field, separating Static Background (Scene component representation) and Dynamic Foreground (Scene component representation), where features are compactly represented using scaffold-based interpolation and rendered efficiently via a parallel rasterizer.
  • This approach integrates functionalities from diverse 2D models (e.g., SAM2, CLIP-LSeg, InternVideo2) into a single 4D representation, supporting tasks like segmentation, editing, and VQA across novel views and time steps via LLM-powered interaction.

Beyond Believability: Accurate Human Behavior Simulation with Fine-Tuned LLMs

  • Web Action Generation Model: introduces a framework for simulating human web actions by predicting the next action and reasoning based on current webpage context and history of user interactions.
  • The framework utilizes a fine-tuned LLM to generate both a natural language rationale and a browser action, focusing on process-centric accuracy in web behavior simulation.
  • Key components include Context representing webpage content, Rationale explaining action intent, and Action defining browser operations like click, type and submit, or terminate.

TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

  • TAMA (Thematic Analysis): introduces a human-AI collaborative framework with Cardiac Expert, Interview Transcripts, Chunks, Generation Agent, Codes, Evaluation Agent, Themes, Refinement Agent, and Feedback for thematic analysis of clinical interviews using multi-agent LLMs.
  • TAMA framework leverages multi-agent LLMs to automate thematic analysis by generating, evaluating, and refining themes through structured conversations and expert feedback, enhancing scalability and coherence.
  • The framework aims to improve thematic analysis quality in healthcare settings by integrating human expertise with multi-agent LLM systems, reducing manual workload and enhancing the consistency of results.

A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts

  • Theoretical Framework for Prompt Engineering: introduces a framework with prompt, transformer, virtual network, input, layer, and output, describing how prompts configure transformers to emulate virtual neural networks.
  • The framework posits that prompts dynamically adjust transformer's internal computations to approximate smooth functions.
  • This approach provides theoretical grounding for prompt engineering techniques and AI agent design by framing LLMs as adaptable agents.

Knowledge-Based Multi-Agent Framework for Automated Software Architecture Design

  • MAAD (Multi-Agent Architecture Design) introduces multi-agent framework for automated software architecture design, involving Analyst, Modeler, Designer, and Evaluator agents collaborating based on input Software Requirements Specifications to produce architecture artifacts.
  • MAAD framework utilizes agents to simulate human roles in architecture design, leveraging knowledge from existing system designs, authoritative literature, and architecture experts to enhance automation.
  • MAAD framework aims to automate and enhance the efficiency, scalability, and consistency of software architecture design process by generating diagrams and reports, ultimately advancing full automation of application-level software development.

Exploring the Effect of Robotic Embodiment and Empathetic Tone of LLMs on Empathy Elicitation

  • Interaction Design System: investigates empathy elicitation using user voice input, speech recording, laptop processing, OpenAI LLM (ChatGPT-40) for response generation, and Pepper robot/chatbot agents.
  • This system compares robotic embodiment and empathetic tone by employing physical robot and chatbot agents, both driven by LLMs, to elicit empathy towards a fictional character.
  • The system utilizes speech-to-text, LLM, and text-to-speech modules for interaction, measuring participant volunteering hours and perceived agent empathy through questionnaires.

sudo rm -rf agentic_security

  • SUDO (SCREEN-BASED UNIVERSAL DETOX2TOX OFFENSE): introduces a novel attack framework, with Detoxifier, Instruction Generator, Toxifier, Dynamic Updater and Evaluation Criteria components, that systematically bypasses refusal-trained safeguards in computer-use agents.
  • SUDO framework employs DETOX2TOX mechanism to transform harmful requests into benign ones and then re-introduce malicious content before execution, iteratively refining attacks based on refusal feedback.
  • The framework demonstrates vulnerabilities in computer-use agents and emphasizes the need for robust, context-aware safeguards by successfully executing attacks in real-world computing environments.

Open Deep Search: Democratizing Search with Open-source Reasoning Agents

  • ODS (Open Deep Search): introduces open-source search framework with Base LLM, Open Reasoning Agent, Open Search Tool and Tools for democratizing search.
  • ODS framework uses Open Reasoning Agent to interpret query and orchestrate actions using Tools including Open Search Tool for web search and processing.
  • ODS framework aims to close gap between proprietary and open-source search solutions by augmenting reasoning capabilities of open-source LLMs.

A Reference Architecture for Autonomous Networks: An Agent-Based Approach

  • AN Agent Reference Architecture (Autonomous Networks Agent Reference Architecture): introduces Situation Awareness (perceives network state), Decision Making (determines actions), Self Awareness (recognizes risks), Choice Making (selects suitable goal), World Knowledge (knowledge repository), Human-Agent Interaction (human collaboration), Agent-Agent Interaction (agent collaboration), Reactive Behavior (responds to stimuli), and Proactive Behavior (addresses potential risks) for autonomous network agents.
  • AN Agent Reference Architecture facilitates autonomous network operation by integrating reactive and proactive behaviors with human and agent interactions, leveraging shared domain-specific knowledge for consistent decision execution.
  • The architecture emphasizes modularity and functional specification, aiming for implementation-independence and completeness to guide development of trustworthy autonomous network agents replacing human operation and maintenance.

25th March 2025

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

  • FALCONEye: introduces a meta-architecture for video answer search, integrating Pre-processing, VLM (vision-language model), Captions, Summary, Reason, LLM (large language model), Candidate Clips, Evaluation, Answer, Confidence Score, Decision, and Promising Clips to efficiently locate answers in long videos.
  • It employs an iterative exploration algorithm, using Captions and Confidence Scores to refine search and focus resources on relevant video segments.
  • The framework is designed for Video Answer Search (VAS) tasks in long videos, addressing limitations of VLMs in handling long context and pinpointing specific information.

Inducing Personality in LLM-Based Honeypot Agents: Measuring the Effect on Human-Like Agenda Generation

  • SANDMAN: introduces deceptive agent architecture for cyber deception, integrating Agent Profile, Decision Engine, Memory Space (Semantic, Episodic, Working, Retrieval, Learning), LLM Engine, Planning Space (Bootstrap Task, Task List), and Action Space (Channel, Generators).
  • SANDMAN architecture enables creation of plausible human simulacra by inducing personality traits within LLMs to govern agent behavior within digital environments.
  • The framework enhances cyber deception strategies by facilitating agents to produce varied realistic behaviors through persona-driven methodology.

Writing as a testbed for open ended agents

  • Framework for Benchmarking Autonomous Writing Agents: introduces a framework for benchmarking autonomous writing agents with exploration, evaluation, and goal alignment components.
  • This framework evaluates Large Language Models as collaborative co-writers by analyzing action diversity, human alignment, and iterative text improvement capabilities.
  • The framework highlights challenges and potential solutions for building systems capable in diverse open-ended domains through iterative refinement.

Agent-Initiated Interaction in Phone UI Automation

  • Approach: introduces a method for agent-initiated interaction in phone UI automation, with User Instruction, Screen Input, Session History, Interaction Detection, Message Generation, and Baseline Models components.
  • This approach focuses on detecting the necessity for user interaction during task execution and generating appropriate messages for clarification or confirmation.
  • The evaluation utilizes baseline models to assess the effectiveness of different input modalities and model architectures for interaction detection and message generation in UI automation tasks.

MARS: Memory-Enhanced Agents with Reflective Self-improvement

  • MARS (Memory-Enhanced Agents with Reflective Self-improvement): introduces User, Assistant, and Checker agents within an Environment, incorporating STM and LTM memory components, alongside Reflection and Feedback mechanisms for iterative self-improvement.
  • MARS framework enhances agent performance by utilizing iterative feedback from the Checker to refine the Assistant's policy, leveraging Reflection to store historical data in LTM and STM for improved decision-making.
  • The framework aims to address limitations of LLMs in continuous decision-making and long-term memory by integrating these components for effective task completion in dynamic environments.

DIRECT POST-TRAINING PREFERENCE ALIGNMENT FOR MULTI-AGENT MOTION GENERATION MODELS USING IMPLICIT FEEDBACK FROM PRE-TRAINING DEMONSTRATIONS

  • DPA-OMF (Direct Preference Alignment from Occupancy Measure Matching Feedback): introduces alignment approach with Multi-modal Scene Encoder, Motion Token Prediction Model, Preference Ranking via Occupancy Measure Matching Feedback, and Expert demo, aligning pre-trained motion model with human preferences.
  • DPA-OMF leverages implicit preferences from pre-training expert demonstrations to construct preference rankings among model generations using occupancy measure matching for nuanced alignment guidance.
  • DPA-OMF improves realism of traffic simulation behaviors, enabling lightweight models to achieve comparable performance to state-of-the-art imitation models without extra human annotations.

BugCraft: End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft

  • BugCraft: introduces automated crash bug reproduction framework utilizing LLMs, encompassing Bug Report, S2R Synthesizer, and Action Model Agent components.
  • BugCraft framework employs two-stage approach: Step Synthesizer generates structured steps from bug reports, and Action Model Agent executes these steps within Minecraft environment.
  • Framework evaluation utilizes BugCraft-Bench, a curated dataset of Minecraft crash bugs, to assess end-to-end reproduction and step synthesis effectiveness.

OmniNova:A General Multimodal Agent Framework

  • OmniNova: introduces a modular framework, integrating multi-agent system, workflow engine, language model and tool integration, configuration and prompt template systems, for complex automation tasks.
  • OmniNova employs hierarchical multi-agent architecture with coordinator, planner, supervisor, research, code, browser and reporter agents, managed by workflow engine, utilizing multi-layered LLM and unified tool integration.
  • The framework optimizes resource utilization and task completion through dynamic task routing, specialized agents, and multi-layered LLM allocation, enhancing efficiency and result quality.

24th March 2025

A Survey of Large Language Model Agents for Question Answering

  • LLM Agent (Large Language Model Agent): introduces LLM-based Agent QA system, with Action Planning, Memory, Thinking, Action for External Environment, Observation, and Environment components, where the paper surveys the design of LLM agents for question answering tasks.
  • LLM Agent architecture incorporates memory to aggregate information, planning to decide actions, and thinking for reasoning and answer generation, enabling interaction with external environments for enhanced QA.
  • The framework addresses limitations of standalone LLMs by integrating modules for planning and external interaction, improving performance in complex QA tasks requiring external knowledge and reasoning.

LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

  • Topic Modeling Pipeline: introduces a multi-step process for contact center analytics, utilizing Call Transcripts to perform Call Driver Generation for extracting Call Drivers, which are then processed by Topic Clustering and Topic Labeling to produce Topics.
  • This pipeline leverages a fine-tuned Mistral model for call driver generation and all-MiniLM-L6-v2 with HDBSCAN for topic clustering, aiming for cost-efficient and accurate topic identification from customer interactions.
  • The generated topics and call drivers facilitate downstream tasks like trend detection and FAQ generation, ultimately improving contact center efficiency and customer service.

Verbal Process Supervision Elicits Better Coding Agents

  • CURA (Code Understanding and Reasoning Agent): introduces process-supervised reasoning framework for code generation with code understanding, test case generation, solution reasoning, code testing sandbox, and process reward models.
  • CURA utilizes verbal process supervision to iteratively guide reasoning steps and refine model behavior through reward signals at each stage.
  • The framework enhances code generation performance by integrating iterative feedback and verbal process supervision throughout the reasoning pipeline.

Safeguarding Mobile GUI Agent via Logic-based Action Verification

  • VSA (VeriSafe Agent): introduces a verification framework for Mobile GUI Agents, incorporating Intent Encoder, Logical Formula, Intent Verifier, Feedback Generator, and VSA Library, designed to ensure agent actions are consistent with user instructions.
  • VeriSafe Agent framework utilizes autoformalization to convert natural language instructions into a domain-specific language, enabling rule-based runtime verification of mobile agent actions.
  • VSA framework aims to bridge probabilistic LFM-driven automation with deterministic formal verification by providing pre-action verification and structured feedback to guide GUI agents towards correct task completion.

DeepFund: Will LLM be Professional at Fund Investment? A Live Arena Perspective

  • DeepFund: introduces a comprehensive arena platform, with Stock Pool, Web API, Trading Memory, Current Position, Agent Planner, Technical Analysts, Fundamental Analysts, Insider Analysts, Media Analysts, Agent Manager, Decision, Decision Log, Trading Simulation Environment, Model Integration Interface and Performance Monitoring, for evaluating LLM-based trading strategies in simulated live environment.
  • DeepFund platform employs multi-agent framework where Agent Planner orchestrates analysis from specialized Technical, Fundamental, Insider, and Media Analysts, and Agent Manager synthesizes insights for final investment Decision.
  • Trading Simulation Environment in DeepFund mitigates data leakage by providing real-time market data through Web API and evaluating models on data post-training cutoff, while Performance Monitoring visualizes model performance.

How to Capture and Study Conversations Between Research Participants and ChatGPT: GPT for Researchers (g4r.org)

  • G4R (GPT for Researchers): introduces a website platform with researcher interface, GPT interface creation, GPT interface customization, GPT interaction, data capture, data download, and data merging for studying participant-GPT conversations.
  • G4R enables researchers to create customizable GPT interfaces, integrate them into studies like Qualtrics surveys, capture conversation data, and download/merge data for analysis.
  • This tool addresses the lack of standardized methods for human-AI interaction research by providing an accessible platform to facilitate and analyze participant conversations with GPT models.

P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

  • P3Nav (A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction): introduces a unified framework for embodied navigation integrating perception, planning, and prediction with Visual Encoder, Adaptive 3D-aware History Sampling, Large Language Model, Action Head, Answer Head, Tokenizer, and Multitask Collaboration strategy.
  • P3Nav framework employs Multitask Collaboration strategy for joint training on navigation and embodied question answering tasks, enhancing navigation performance by leveraging perceptual and planning skills.
  • Adaptive 3D-aware History Sampling strategy in P3Nav effectively utilizes historical observations by selecting non-overlapping RGB frames and position-enhanced features to reduce redundancy and improve efficiency.

AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

  • AgentDropout: introduces dynamic agent elimination, optimizing communication by removing redundant agents and links in multi-agent systems using Node Dropout, Edge Dropout, Communication Graph, Adjacency Matrix, Intra-Round Communication, Inter-Round Communication, and DAGSample.
  • AgentDropout employs Node Dropout to remove less contributing agents and Edge Dropout to prune redundant communication edges within Communication Graph represented by Adjacency Matrix.
  • The framework enhances token efficiency and task performance by dynamically adjusting communication topology through Intra-Round and Inter-Round Communication, finalized by DAGSample for acyclic graph generation.

EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

  • EconEvals: introduces benchmarks and litmus tests for evaluating LLM agents in unknown economic environments, featuring an LLM Agent (Core decision-maker using LLM) that interacts via Tool Use (Interaction via API calls) and a Notes Module (Persistent text memory) within various Economic Environments (Simulated scenarios with unknowns) to perform Benchmark Tasks (Capability measurement tasks) or Litmus Tests (Tendency measurement tasks), assessed by a Benchmark Score (Capability metric) or Litmus Score (Tendency metric).
  • The framework assesses LLM agents on economic decision-making (procurement, scheduling, pricing) through multi-turn interactions where agents must learn environment specifications via exploration using tools; benchmarks measure capability, while litmus tests quantify behavioral tendencies in tradeoffs like efficiency vs. equality.
  • Agents operate over multiple periods within stationary or non-stationary environments, using tools to gather information (e.g., CODE_BLOCK_0, CODE_BLOCK_1), manage memory (CODE_BLOCK_2, CODE_BLOCK_3), and submit actions (CODE_BLOCK_4, CODE_BLOCK_5, CODE_BLOCK_6), receiving feedback to inform future decisions.

Defeating Prompt Injections by Design

  • CaMeL (CApabilities for MachinE Learning): introduces a defense against prompt injection by separating control and data flow using a Privileged LLM (Generates code from user query), a Quarantined LLM (Parses untrusted data), a CaMeL Interpreter (Executes code, enforces policies), Tools (External functions/APIs), Security Policies (Define allowed tool operations), Capabilities (Data provenance/permission tags), and a Data Flow Graph (Tracks value dependencies).
  • The Privileged LLM generates Python code representing the user's intent from trusted queries, while the separate Quarantined LLM processes potentially untrusted data under the interpreter's strict control, preventing direct influence on tool execution flow.
  • The CaMeL interpreter executes the generated code, maintains a data flow graph with capabilities tracking data provenance and permissions, and enforces security policies before tool execution to prevent data exfiltration or unauthorized actions.

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

  • AGENTSPEC: introduces a domain-specific language and runtime framework for enforcing customizable safety constraints on LLM Agents (LLM planner/executor), intercepting planned actions based on Rules (constraint definitions) activated by a Trigger (rule activation event) corresponding to a monitored Event (monitored agent/env change), evaluating conditions via Check (condition evaluation) using Predicates (boolean condition function), and applying Enforce (intervention mechanism) actions like user_inspection (request user confirmation), llm_self_examine (trigger agent self-reflection), invoke_action (execute predefined action), or stop (terminate agent action) before interaction with Tools (external functions) or receiving Observation (feedback from environment/tools), ensuring alignment with safety policies defined by the User (initiates interaction) and recorded in the Trajectory (record of agent states/actions).
  • The framework integrates with agent execution loops by hooking into decision points, monitoring Events such as state changes, specific actions (e.g., 'Transfer', 'PythonREPL', 'pour'), or task completion to apply user-defined Rules at runtime.
  • This approach provides a modular and interpretable method for runtime safety enforcement in LLM agents operating across domains including code execution, embodied interaction, and autonomous driving, with demonstrated low overhead.

23rd March 2025

AgentRxiv: Towards Collaborative Autonomous Research

  • AgentRxiv: introduces a framework for collaborative autonomous research using LLM agents, comprising an AgentRxiv Server (Centralized preprint server for agent research) enabling multiple Agent Laboratory (Autonomous multi-agent research system) instances to share findings, guided by a Human Researcher (Provides initial guidance), where each lab performs Literature Review Phase (Retrieves and summarizes prior work), Experimentation Phase (Plans and executes experiments) with mle-solver (Module for ML code generation and repair), and Report Writing Phase (Synthesizes findings into reports) via paper-solver (Module for LaTeX report generation), coordinated by agents like PhD Student Agent (Agent role in multiple phases) and ML Engineer Agent (Agent role in data preparation code).
  • The framework facilitates iterative improvement by allowing agent laboratories to upload reports to the AgentRxiv Server and retrieve prior work from peers, enabling cumulative knowledge building across independent agent systems.
  • Each Agent Laboratory automates research stages using specialized agents (e.g., Postdoc, Professor) and tools (mle-solver, paper-solver), supporting both fully autonomous operation and a co-pilot mode with human checkpoints.

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

  • RAM (Rewriting-driven AugMentation): introduces a VLN data augmentation paradigm using Object-Enriched Observation Rewriting (generates diverse observations) involving a VLM (extracts scene descriptions), LLM (rewrites scene descriptions), T2IM (synthesizes panoramic observations), and Panorama-to-View (discretizes panoramas), plus Observation-Contrast Instruction Rewriting (creates aligned instructions) involving a VLM (extracts landmarks/descriptions) and LLM (rewrites instructions via contrast), trained with a Mixing-then-Focusing Training Mechanism (optimizes learning) including a Random Observation Cropping Scheme (augments data), where foundation models rewrite annotated data into unseen observation-instruction pairs without simulators or web-scraping.
  • The framework first performs Object-Enriched Observation Rewriting by using a VLM to get scene descriptions, an LLM to enrich these descriptions with new objects, a T2IM to generate corresponding panoramas, and a Panorama-to-View algorithm for single views.
  • Subsequently, Observation-Contrast Instruction Rewriting employs an LLM to generate new instructions by contrasting original landmarks/observations (via VLM) with rewritten observation descriptions (via VLM), enhancing data diversity for training the Embodied Agent using a two-stage strategy.

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

  • MJA (Metaphor-based Jailbreaking Attack): introduces a framework with Metaphor Agent, Context Agent, Prompt Agent, Example Retrieval Tool, Shared Memory, Observed Set, Candidate Set, Surrogate Model, Text Encoder, PCA, Gaussian Process Regression, Acquisition Strategy, and Query T2I model, where MJA aims to jailbreak text-to-image models using metaphor-based prompts.
  • MJA framework employs multi-agent generation module to create diverse prompts and optimization module to efficiently select effective adversarial prompts.
  • The framework balances attack effectiveness and query efficiency by leveraging metaphor and context in prompt generation and surrogate model-based optimization.

WON: Establishing Best Practices for Korean Financial NLP

  • WON: introduces WON (Korean financial LLM), a transparent language model, evaluated using Benchmark (evaluation dataset) on Leaderboard (evaluation platform), utilizing Instruction Dataset (refined training data) derived from competition submissions.
  • WON framework employs SFT (supervised fine-tuning) and DPO (direct preference optimization) training methods, with LLM-as-a-Judge (evaluation using LLM) for assessment and Deepseek-R1 (response generation model) for data processing.
  • The framework aims to establish best practices for Korean financial NLP by releasing resources and insights gained from a large-scale evaluation and model development process.

An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models

  • Framework name here: introduces a neural symbolic framework to model human and LLM agent interactions, focusing on Message-String, Turn, and Interaction, to define Incomplete Question and Ambiguous Question based on Oracle Agent responses within a Context of prior messages.
  • This framework analyzes question-answer sequences to empirically study the role of question Incomplete Question and Ambiguous Question properties in multi-turn interactions using Human Agent and LLM Agent.
  • The framework utilizes the Oracle Agent as a ground truth to categorize questions and assess the impact of Context on resolving Incomplete Question and Ambiguous Question during interactions.

GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks

  • GeoBenchX Framework: introduces benchmark for evaluating LLMs on geospatial tasks, with Task-solving agent, LLMs, Tools, Datasets, LLM-as-Judge evaluator agent, Reference solutions, and Benchmark set.
  • GeoBenchX uses Task-solving agent equipped with Geospatial functions and commercial LLMs to solve Benchmark set of multi-step geospatial tasks using provided Datasets.
  • LLM-as-Judge evaluator agent assesses Task-solving agent's performance by comparing generated solutions against Reference solutions within the GeoBenchX framework.

22nd March 2025

Metacognition in Content-Centric Computational Cognitive C4 Modeling

  • C4 Modeling (Content-Centric Computational Cognitive Modeling): introduces a framework for building metacognitive AI agents, with Knowledge Resources, Perception, Reasoning, Action, Explanation Module, Lifelong Learning, and LLM components for Language Generation and Learning Enhancement.
  • C4 modeling emphasizes content-centric approach using semantically interpretable knowledge to enable agents with transparency, adaptability, reasoning, perception and action capabilities for human-AI teams.
  • The framework integrates LLMs to improve language generation and learning efficiency, while maintaining focus on knowledge-based reasoning for trustworthy and explainable AI agents.

Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information

  • Tox-chat: introduces a Korean chemical toxicity information agent, utilizing LLM / SLM Agent, BM25 Search, Summary LLM, Keyword Search, Read General, QA Specific, QA LLM, Carcinogen Database, Toxic Dose Database, and Toxic Info Database for resource-constrained environments.
  • Tox-chat employs hierarchical section search and scenario-based dialogue generation to reduce token consumption and distill tool-using capabilities from larger models.
  • The framework demonstrates effective performance with a fine-tuned 8B parameter model, outperforming untuned models and baselines in database faithfulness and preference.

A Survey on Mathematical Reasoning and Optimization with Large Language Models

  • Framework name here: introduces Instruction Learning, Tool-based Methods, Chain-of-Thought (CoT) Methods, and Advanced Chain-of-Thought (CoT) Methods for mathematical reasoning with Large Language Models.
  • Instruction Learning refines models through structured tasks, while Tool-based Methods integrate external solvers, and Chain-of-Thought (CoT) and Advanced CoT Methods enhance reasoning via step-by-step logic and self-verification.
  • These methods collectively aim to improve mathematical problem-solving capabilities of Large Language Models, addressing challenges in arithmetic, theorem proving and optimization tasks.

CP-AgentNet: Autonomous and Explainable Communication Protocol Design Using Generative Agents

  • CP-AgentNet (Communication Protocol Agent Network): introduces a framework employing offline- and online-modules with strategy-, observer-, node- and programming-agents, LLM ranker, strategy-, episodic- and trajectory-memory, self-reflection and evaluation for autonomous communication protocol design.
  • CP-AgentNet framework facilitates explainable protocol design by leveraging multi-agent role-play and progressive strategy augmentation to address limitations of deep reinforcement learning and handcrafted protocols.
  • CP-AgentNet utilizes self-reflection and LLM ranker to enhance strategy refinement and decision consistency, enabling efficient adaptation to dynamic network environments without extensive online learning.

RAIDER: Tool-Equipped Large Language Model Agent for Robotic Action Issue Detection, Explanation and Recovery

  • RAIDER (Tool-Equipped Large Language Model Agent for Robotic Action Issue Detection, Explanation and Recovery): introduces a novel agent architecture integrating System Prompt, LLM, Program Flow Manager, Tools, and Recovery for robotic action issue detection, explanation, and recovery.
  • RAIDER framework utilizes "Ground, Ask&Answer, Issue" procedure, incorporating Ground, Ask, Answer, and Issue components within Program Flow Manager to dynamically generate and resolve context-aware precondition questions using Tool calls and Tool responses/warnings.
  • This architecture achieves adaptable and efficient issue detection by leveraging LLM's reasoning with grounded Tools, enabling targeted information gathering and surpassing limitations of predefined models or full scene descriptions.

Can LLMs Automate Fact-Checking Article Writing?

  • QRAFT (QRAFT): introduces a multi-agent framework for automatic fact-checking article generation, incorporating Planner (outline planning assistant), Writer (draft article composer), and Editor (draft review and refine) agents.
  • QRAFT framework processes Evidence Set (input evidence documents) to generate Evidence Nuggets Set (extracted evidence points), utilizes Preferences (article structure guidelines) for Draft Outline (planned article structure), producing First Draft (initial article draft) and Improved Draft (refined article draft) through Question-Answering Interactions (conversational refinement process).
  • QRAFT framework aims to mimic human fact-checkers' writing workflow, addressing the gap in existing automatic fact-checking pipelines by generating full fact-checking articles suitable for public dissemination.

ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

  • ComfyGPT (Comprehensive ComfyUI Workflow Generation with Generative Pre-trained Transformer): introduces a self-optimizing multi-agent system for ComfyUI workflow generation, comprising ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent.
  • ComfyGPT leverages FlowDataset for training and FlowBench for evaluation, utilizing GRPO optimization and RAG to enhance workflow generation and refinement.
  • ComfyGPT focuses on individual node links for improved precision and introduces FlowBench as a comprehensive benchmark for workflow generation assessment.

OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery

  • OmniScience Framework: introduces a domain-specialized LLM for scientific reasoning, utilizing science literature corpus for domain adaptive pretraining, task and chat instructions for model alignment, and s1K reasoning dataset for reasoning distillation to create OmniScience Reasoning model from Foundation Model via OmniScience Base and OmniScience Chat.
  • The framework employs a three-stage training pipeline: domain adaptive pretraining to instill scientific knowledge, supervised fine-tuning for instruction following, and reasoning-based knowledge distillation to enhance inferential capabilities.
  • OmniScience Framework demonstrates a compute-efficient strategy for developing high-performing domain-specific models by combining pretraining, alignment and distillation techniques, achieving state-of-the-art results in scientific reasoning tasks.

Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent

  • DOLA (Dose Optimization Language Agent): introduces privacy-preserving LLM agent for autonomous radiotherapy planning, comprising Model Service, Optimization Agent, Working Memory, TPS Interface, and LLaMa3.1 LLM.
  • DOLA framework integrates RAG and RL with chain-of-thought prompting within local infrastructure to optimize radiotherapy plans while maintaining patient privacy.
  • The system architecture enables iterative dose optimization using LLM for decision-making and reasoning within a secure, locally hosted environment, enhancing plan quality and efficiency.

21st March 2025

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

  • CVE-Bench: introduces CVE-Bench with LLM Agents, Target Containers, Evaluation, and Results, which is a cybersecurity benchmark for evaluating AI agents exploiting web vulnerabilities.
  • CVE-Bench: framework offers a sandbox environment featuring isolated containers hosting web applications and an automated evaluation system to assess attack success.
  • CVE-Bench: benchmark addresses limitations of existing cybersecurity benchmarks by providing comprehensive real-world vulnerability coverage and diverse attack types.

LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language

  • LLM+MAP (LLM + Multi-Agent Planning with PDDL): introduces a bimanual robot task planning framework with Visual Detection, Scene Spatial Description, Bimanual Domain Knowledge, LLM, PDDL Problem + Domain, Symbolic Planning, Partial-order Plan, Action Parser and Execution.
  • LLM+MAP framework utilizes LLM to convert natural language task descriptions and scene information into PDDL, enabling symbolic planners to generate partial-order plans for efficient bimanual robot control.
  • The framework integrates LLM reasoning with multi-agent planning for effective spatial and temporal coordination in complex, long-horizon bimanual manipulation tasks, achieving logical correctness and higher efficiency.

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

  • Text-Only Training for VLM Enhancement: introduces text-only training approach, with Situation, Question, Answer, Text-Only Input, Multimodal Input, VLM, Answer Prediction, Text-Only Training for VLM Enhancement, and Transfer to Multimodal Inference components, where text-only training enhances visual language model decision-making for human-centered tasks.
  • This framework improves visual language models by text-only training using synthesized textual data, enabling enhanced multimodal inference capabilities without relying on image-text paired data.
  • Text-only training provides efficient and scalable method to enhance visual language models' reasoning and decision-making for complex human-centered scenarios.

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

  • ETVA (Evaluation of Text-to-Video Alignment): introduces a framework with Element Extractor, Graph Builder, Graph Traverser, Question Generation, Knowledge Augmentation, Multi-Stage Reasoning, Question Answering, External Knowledge, Multimodal CoT, Video Reflection, General Reflection, Conclusion Stage, ETVA Score, Generated Video, Generated Questions, Scene Graph, and Core Elements for evaluating text-to-video alignment through fine-grained question generation and answering.
  • ETVA framework employs a multi-agent system for atomic question generation from text prompts and a knowledge-augmented multi-stage reasoning process for question answering using video LLMs.
  • ETVA demonstrates improved correlation with human judgment compared to existing metrics by systematically evaluating video-text relationships through structured question generation and knowledge integration.

WHEN DEBATE FAILS: BIAS REINFORCEMENT IN LARGE LANGUAGE MODELS

  • DReaMAD (Diverse Reasoning via Multi-Agent Debate with Refined Prompt): introduces Strategic Prior Knowledge Elicitation, Perspective Diversification, and Multi-Agent Debate to improve LLM reasoning.
  • DReaMAD refines prior knowledge and ensures diverse perspectives by using Game Situation Reinterpretation, General Strategy Formulation, and structured debate.
  • DReaMAD enhances LLMs' strategic reasoning by structuring knowledge retrieval and diversifying input perspectives to mitigate bias and improve decision-making.

A-IDE : AGENT-INTEGRATED DENOISING EXPERTS

  • A-IDE (Agent-Integrated Denoising Experts): introduces a denoising framework integrating BiomedCLIP for semantic analysis, semantic similarities for probability distribution, an LLM Agent for decision-making, specialized RED-CNN models (Model 0, Model 1, Model 2) for denoising, and RMSE, PSNR, SSIM for evaluation.
  • A-IDE framework utilizes BiomedCLIP to analyze CT images and employs an LLM agent to dynamically select among specialized RED-CNN models based on anatomical context for improved denoising performance.
  • The agent-driven approach of A-IDE eliminates manual intervention and enhances denoising performance across diverse anatomical regions by leveraging specialized models.

Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

  • Bayesian Teaching: introduces User, LLM (Large Language Model), Bayesian Assistant, Supervised Fine-tuning, Flight Recommendation, Hotel Recommendation, Web Shopping, User Preferences, and Beliefs to teach LLMs probabilistic reasoning for user interaction tasks.
  • Bayesian Teaching framework employs Supervised Fine-tuning to train LLMs by mimicking Bayesian Assistant for inferring User Preferences and updating Beliefs in Flight Recommendation and generalizing to Hotel Recommendation and Web Shopping.
  • The framework enhances LLMs' probabilistic reasoning in interactive settings, enabling generalization to novel tasks beyond the training domain.

20th March 2025

Towards Agentic Recommender Systems in the Era of Multimodal Large Language Models

  • LLM-ARS (LLM-based Agentic RS): introduces a framework with LLM-Agent, Initialization, Planning, Execution, Reflection, Query, Ranker, Tool Using, and Memory Module components for agentic recommendation systems.
  • This framework utilizes an LLM-Agent as the central decision-making unit, incorporating modules for planning, execution, reflection, and memory to enhance recommendation adaptability and personalization.
  • The architecture emphasizes autonomous decision-making and continuous self-evolution by integrating external tools and reflecting on past interactions to optimize future recommendations.

Survey on Evaluation of LLM-based Agents

  • Agent Evaluation: introduces a survey framework for evaluating LLM-based agents, with Agent Capabilities Evaluation-component, Planning and Multi-Step Reasoning-component, Function Calling & Tool Use-component, Self-Reflection-component, Memory-component, Application-Specific Agent Evaluation-component, Web Agents-component, Software Engineering Agents-component, Scientific Agents-component, Conversational Agents-component, Generalist Agents Evaluation-component, Frameworks for Agent Evaluation-component, Development Frameworks-component, Gym-like Environments-component, Discussion-component, Current Trends-component, and Emergent Directions-component.
  • Agent Evaluation framework categorizes evaluation methodologies based on agent capabilities, application domains, general skills, and development frameworks, providing a structured overview of the field.
  • The framework highlights the shift towards realistic evaluations, identifies gaps in current methods like cost-efficiency and safety, and proposes future directions for agent evaluation research.

Issue2Test: Generating Reproducing Test Cases from Issue Reports

  • ISSUE2TEST (Issue Reproducing Test): introduces automated technique for generating issue-reproducing test cases utilizing root cause analysis, meta prompting, related files search, test generator, linter, test refiner - error fixing, run tests, error categorization, assertion match, and rank components.
  • ISSUE2TEST iteratively refines test cases through runtime feedback and error categorization to ensure generated tests accurately capture and reproduce the reported issue.
  • This approach enhances automated debugging and program repair workflows by providing executable test cases directly derived from issue descriptions, improving software reliability.

GREENIQ: A DEEP SEARCH PLATFORM FOR COMPREHENSIVE CARBON MARKET ANALYSIS AND AUTOMATED REPORT GENERATION

  • GreenIQ: introduces deep search platform with Main Researcher, Report Writing, Final Reviewer, Data Visualization, and Translator Agents for carbon market analysis and automated report generation.
  • GreenIQ leverages multi-agent architecture powered by Large Language Models to automate end-to-end workflow from data collection to multilingual reporting for carbon market intelligence.
  • GreenIQ enhances efficiency, accuracy, and scalability in carbon market research by integrating specialized agents for comprehensive analysis and validated reporting.

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

  • AutoRedTeamer: introduces a framework for automated red teaming, with Risk Analyzer (decomposes user inputs), Seed Prompt Generator (creates diverse test cases), Strategy Designer (selects attack combinations), Attack Memory (tracks attack performance), Attack Library (stores attack methods), Attack Judge (evaluates output harmfulness), Relevance Check (ensures test case relevance), Red-Teaming Agent (orchestrates evaluation), Target Model (LLM under evaluation), Validation Set (validates attack effectiveness), Attack Evaluation (assesses attack results), Initial Attack Library (starting attack methods), Attack Proposer (discovers new attacks), Attack Proposals (suggested new attacks), Attack Designer (implements attack proposals), Attack Implementation (concrete attack code), and Attack Strategy Proposer Agent (discovers and implements attacks).
  • AutoRedTeamer framework uses a dual-agent system comprising a red teaming agent for evaluation and a strategy proposer agent for discovering and integrating new attack methods.
  • The framework incorporates a memory-guided attack selection to learn from past attack attempts and refine strategies for improved red teaming effectiveness and adaptability to new vulnerabilities.

Automatic Generation of Safety-compliant Linear Temporal Logic via Large Language Model: A Self-supervised Framework

  • AutoSafeLTL (Automatic Safety-compliant Linear Temporal Logic): introduces a self-supervised framework for generating safety-compliant Linear Temporal Logic (LTL) specifications, incorporating Environmental Information & Desired Task, LTL Extraction, Safety Restrictions, Base Rules, Automated Verification, and producing Safety-compliant LTL.
  • The framework employs Automated Verification with Syntactic Check (Agent LLM1, AP Matching, Operator Matching) and Semantic Check (User LLM, Counterexample Analysis & Guidance, Agent LLM2) to ensure generated LTL adheres to predefined safety rules.
  • AutoSafeLTL framework leverages two Agent LLMs and User LLM within a pipeline to refine LTL generation through feedback and counterexample analysis, achieving safety compliance for cyber-physical systems.

DeepPsy-Agent: A Stage-Aware and Deep-Thinking Emotional Support Agent System

  • DeepPsy-Agent: introduces a psychological support system with deeppsy-chat dialogue model for response generation, stage awareness mechanism for context perception, deep thinking for multi-source reasoning, real-time stage transition detection model for signal capture, and state information update module for dynamic state tracking.
  • DeepPsy-Agent combines psychological theory and deep learning to achieve dynamic stage awareness and enhanced reasoning capabilities in emotional support conversations.
  • The system integrates stage-awareness and deep-thinking to improve dialogue management and reasoning, addressing limitations of traditional emotional support systems.

Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

  • V-DROID (Verifier-Driven Robot for Interface Operations on Devices): introduces a verifier-driven mobile agent framework with Action Extractor, Verification Prompts, Verifier, Action Completion, and Working Memory components.
  • V-DROID decouples action decision-making into action extraction and verification, utilizing Discretized Action Space Construction and Prefilling-only Workflow for efficiency.
  • Pair-wise Progress Preference Training enhances Verifier's decision-making, and Scalable Human-Agent Joint Annotation Scheme facilitates data collection for V-DROID.

The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

  • CGI (Critique-Guided Improvement): introduces a two-player framework enhancing LLM agents, featuring an Actor Model(Generates actions, refines based on critique), a Critic Model(Generates structured critiques), an Action Buffer(Stores candidate actions), Critique(Structured feedback with assessment, revision), and a Refined Action(Final action post-critique), operating within an Environment(Interactive task setting) informed by History(Past interactions sequence) and refined using Training Data(Datasets for model fine-tuning), where the critic provides detailed natural language feedback to guide the actor's iterative improvement.
  • The framework comprises two stages: Critique Generation, where the Critic Model is trained to produce structured assessments and revisions based on expert examples, and Action Refinement, where the Actor Model is iteratively fine-tuned to utilize these critiques effectively alongside successful trajectory data.
  • This approach uses a dedicated critic for explicit, structured verbal feedback (assessing contribution, feasibility, efficiency, and suggesting revisions) and trains the actor to integrate this guidance, enhancing decision-making and exploration compared to methods relying solely on numerical rewards or self-correction.

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

  • RGB-D Fusion Architectures: introduces model architectures for autonomous driving, integrating RGB and depth data through Feature Extractor, Model Architecture, RNN Options and Offset Calculation Options components.
  • RGB-D Fusion Architectures: systematically compares early and late fusion strategies alongside depth-aware deformable convolution and geometric offset computation within Model Architecture for enhanced feature extraction.
  • RGB-D Fusion Architectures: evaluates recurrent neural networks like LSTM, LTC, CfC, and LRC within RNN Options to benchmark lightweight controllers for real-time, robust autonomous agent steering command prediction.

19th March 2025

Envisioning an AI-Enhanced Mental Health Ecosystem

  • AI-Enhanced Mental Health Ecosystem: introduces a multifaceted AI vision for mental health support, integrating AI-Simulated Client, Peer/Counsellor/Therapist, -Generated Suggestions, Decision-Support/Evaluation, Self-Help/Companionship, Proactive Monitoring, and Embedded/Ubiquitous Companion.
  • This ecosystem aims to enhance mental health care through proactive, responsive, adaptive AI paradigms complementing human interventions for growing global mental health crisis.
  • The framework emphasizes ethical AI deployment, user-centered design, and continuous evaluation ensuring supportive collaboration in mental health with human connection and cultural sensitivity.

ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents

  • ChatStitch: introduces a collaborative perception system, with Task Management Agent, Background Stitching Agent, Pose Measurement Agent, Perspective Measurement Agent, 3D Asset Management, 3D Asset View Change, Foreground Rendering, SV-UDIS, Language Input, Multi-views Input, Composed Images Output, Data, and Work flow, for visualizing obscured information via natural language.
  • ChatStitch employs multi-agent framework utilizing Large Language Models to process natural language commands and perform surround-view unsupervised deep image stitching.
  • The system achieves intuitive human perception by integrating language commands with external digital assets and generating photorealistic collaborative perception outcomes.

SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

  • SPADE (Systematic Prompt Framework for Automated Dialogue Expansion): introduces five data augmentation frameworks using structured prompting for synthetic dialogue generation, including Partial-Chatbot Data Augmentation (Generates partially synthetic dialogues) with Missing Sentence Completion (Fills system utterances) and Next Response Generation (Generates user utterances), and Full-Chatbot Data Augmentation (Generates fully synthetic dialogues) with Goal-to-Dialogue (Generates dialogue from goal), Paraphrase Dialogue (Rewrites utterances iteratively), and End-to-End Conversation (Simulates user-system interaction), utilizing components like LLM (Generates text), Goal (User task objective), Dialogue Input (Source conversation data), and Instructions (LLM task guidance) to address data scarcity for Machine-Generated Text detection.
  • These training-free frameworks generate 14 dialogue datasets by manipulating human dialogues or simulating conversations with LLMs based on user goals and specific prompts.
  • The study benchmarks these datasets against detection models, showing improved generalization with mixed datasets and analyzing detection accuracy based on chat history length in simulated online settings.

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

  • VIPER (Visual Perception and Explainable Reasoning): introduces multimodal instruction-based planning framework integrating VLM-based perception with LLM-based reasoning, including Perception- and Reasoning-Modules.
  • VIPER uses modular pipeline where Perception Module with frozen VLM generates textual descriptions of image observations processed by Reasoning Module with LLM policy to predict actions.
  • VIPER enhances explainability by leveraging text as intermediate representation enabling fine-grained analysis of Perception- and Reasoning-Modules for decision-making mechanisms.

LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents

  • LOGIAGENT: introduces an LLM-driven multi-agent framework for logical testing of REST systems, with a Test Scenario Generator (Creates test scenarios), API Request Executor (Constructs and executes API requests), API Response Validator (Validates API responses using oracles), Scenario Scheduler (Manages scenario execution flow), Execution Memory (Stores historical execution data), API Relationship Graph (Models API relationships), OpenAPI Specification (Input API documentation), and Tested System (Target system under test), where agents collaboratively generate, execute, and validate API test scenarios focusing on business logic.
  • The framework utilizes logical oracles derived from API documentation, scenario context, and LLM knowledge to assess responses beyond status codes, identifying logical inconsistencies.
  • Execution Memory stores successful parameter values and failure reflections, enhancing contextual consistency and guiding future test generation by the API Request Executor and Test Scenario Generator.

Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

  • cRLHF (crowd-sourced Reinforcement Learning with Human Feedback): introduces a novel framework for aligning crowd-sourced human feedback using Prompt (input description), Language Model (generates code), Output (code output), Multiple Annotators (feedback from many users), Annotated Output (code with human labels), cRLHF (proposed framework), and Aligned Output (improved code output).
  • cRLHF (crowd-sourced Reinforcement Learning with Human Feedback) framework, depicted in figures, contrasts traditional RLHF by incorporating Multiple Annotators (feedback from many users) to produce Annotated Output (code with human labels) and Aligned Output (improved code output), replacing single Annotator (evaluates code) and Ranked Output (ordered code by rank) feeding into Reward Model (learns reward function) of traditional RLHF.
  • The framework, utilizing Problem Description (task specification), Initial LLM (starting LLM), Tuned LLM (fine-tuned LLM), Generated Outputs (sampled code solutions), RL Update (policy adjustment), and Correction Rate (performance score), aims to improve code generation quality by leveraging diverse human feedback and Bayesian optimization without explicit reward model training.

Exploring Large Language Models for Word Games: Who is the Spy?

  • CoT-based scheduling framework (Chain-of-Thought based scheduling framework): introduces Judger, Player, and COT or TOT components to enable LLMs in word games through rule description, role and keyword assignments, compliance checks, keyword descriptions, reasoning, voting and player elimination.
  • The framework utilizes Judger to manage game flow and Player agents employing COT or TOT for strategic actions like describing keywords, reasoning about roles, and voting to identify the spy.
  • This approach aims to enhance LLM performance in social deduction games by structuring the interaction and decision-making process through distinct components and a chain-of-thought reasoning mechanism.

MAMM-REFINE: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

  • MAMM-REFINE (Multi-Agent Multi-Model Refinement): introduces DETECT, CRITIQUE, and REFINE subtasks within a multi-agent debate framework to improve generation faithfulness.
  • MAMM-REFINE framework employs RERANK and GENERATE approaches for CRITIQUE and REFINE subtasks, enhancing performance through multi-agent and multi-model collaboration.
  • The framework demonstrates effectiveness in summarization and question answering by leveraging diverse LLMs and iterative refinement to reduce factual inconsistencies.

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

  • SWEET-RL (RL with Step-WisE Evaluation from Training-time information): introduces a reinforcement learning framework that uses Training-Time History Information and Bradley-Terry objective to train a Critic, which provides rewards for Policy from RLHF optimization based on Chosen/Rejected trajectories and actions.
  • SWEET-RL leverages asymmetric information access for Critic and Actor, where Critic uses additional training-time information inaccessible to Actor, to improve credit assignment in multi-turn collaborative tasks.
  • This approach addresses limitations of standard RLHF methods in multi-turn settings by providing step-level rewards and improving generalization in complex reasoning tasks.

Safety Aware Task Planning via Large Language Models in Robotics

  • SAFER (Safety-Aware Framework for Execution in Robotics): introduces a multi-LLM framework composed of Planning-, Execution-, Feedback-Modules and LLM-as-a-Judge to enhance safety in LLM-driven task planning.
  • SAFER framework utilizes Task Planning LLM to generate plans, Safety Planning LLM to audit plans, Execution Module with Robot Agents to deploy tasks, Feedback Module to process outcomes and LLM-as-a-Judge to evaluate safety.
  • The framework ensures safety checks throughout planning and execution, integrating Control Barrier Functions for safety guarantees at the robotic control policy level.

Reinforcement Learning Environment with LLM-Controlled Adversary in D&D 5th Edition Combat

  • RL-LLM Adversary Framework (Reinforcement Learning with LLM-Controlled Adversary Framework): introduces reinforcement learning environment using D&D 5E combat scenarios, integrating LLM Agent-controlled adversary, and DQN-based RL Agent for strategic AI development within a simulated Environment.
  • This framework employs LLM Agent for strategic decision-making and RL Agent for adaptive learning, utilizing DQN Network composed of Input map, Conv2d layers, Flatten, Embedding layers, Concatenate, Linear layers, and Output - Q-value for value estimation.
  • The framework facilitates strategic decision-making research in complex rule-based games, demonstrating that LLM-trained RL Agents outperform rule-based and LLM-controlled adversaries, highlighting the potential of LLMs to enhance strategic depth and adaptability in AI systems.

18th of March 2025

MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration

  • MANTRA (Multi-AgeNT Code RefAactoring): introduces a comprehensive LLM agent-based framework for automated method-level refactoring, with Context-Aware Retrieval-Augmented Generation (RAG) constructing searchable Database of Pure Refactoring Code Examples using Code Description and Caller-Callee Relationships Incorporation, coordinated Multi-Agent Refactored Code Generation employing Developer Agent with Static Code Analysis Tool for RAG-based Refactoring Examples Retrieval and Chain-of-Thought Refactoring Code Generation, and Reviewer Agent for Refactoring Verification, Code Style Consistency Analysis and Compilation and Test Verification, alongside Self-Repair Using Verbal Reinforcement Learning with Repair Agent and Reflexion Framework through Initial Analysis, Self-Reflection, Planning and Acting phases.
  • MANTRA framework emulates human refactoring process by integrating retrieval-augmented generation, multi-agent collaboration, and verbal reinforcement learning to improve code correctness and readability for method-level refactoring tasks.
  • MANTRA significantly enhances automated refactoring success rate and code quality compared to baseline LLM and existing LLM-powered tools, demonstrating practical advantages for advancing software refactoring automation.

MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation

  • MDTeamGPT (Multi-Disciplinary Team Generative Pre-trained Transformer): introduces a multi-agent framework for medical consultation, incorporating Primary Care Doctor, Specialist Doctor Agents, Lead Physician, Chain-of-Thought Reviewer, Safety and Ethics Reviewer, Correct Answer Knowledge Base, Chain-of-Thought Knowledge Base, Historical Shared Pool, Shared Vector Database, and Patient.
  • This framework utilizes consensus aggregation and residual discussion structure to enhance diagnostic accuracy and reduce cognitive burden in multi-round, multi-agent medical consultations.
  • MDTeamGPT employs knowledge bases to accumulate consultation experience, enabling self-evolution and improved generalization in medical diagnosis tasks.

MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

  • MoK-RAG (Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation): introduces multi-source retrieval framework with Splitting-, Constraint- and Generation-Modules to address cognitive-algorithmic discrepancy of single-source knowledge retrieval in current Retrieval-Augmented Generation systems.
  • MoK-RAG framework partitions knowledge base into multiple specialized paths via Splitting Module, organizes retrieved knowledge using Constraint Module, and generates response through Generation Module, enhancing contextual relevance and adaptability.
  • MoK-RAG framework mitigates "Reply Missing" problem, which refers to incomplete or lacking key details in generated responses due to single-source knowledge retrieval, by enabling simultaneous retrieval from multiple knowledge paths.

Gricean Norms as a Basis for Effective Collaboration

  • Normative Framework: introduces Lamoids, GPT-4-powered agents, integrating Gricean Norms, Inference Norm, Cognitive Frameworks, and Fs-CoT Prompting for effective human-AI collaboration through Response Generation.
  • Normative Framework enhances agent's pragmatic reasoning by incorporating Gricean maxims and cognitive theories into Fs-CoT prompting to interpret unclear instructions and generate context-aware responses.
  • By adhering to Gricean and Inference norms within the framework, Lamoids achieve improved task accuracy and clearer communication in collaborative grid world environment.

ENVBENCH: A BENCHMARK FOR AUTOMATED ENVIRONMENT SETUP

  • ENVBENCH: introduces a benchmark for automated environment setup, encompassing Repository, Environment Setup, Language Model, AI Agent, Generated Script, Evaluation Results, Evaluation Suite, and Metrics components.
  • ENVBENCH evaluates environment setup approaches by generating shell scripts and verifying environment configuration through static analysis and compilation checks.
  • This benchmark facilitates systematic assessment of environment setup strategies, addressing limitations of existing datasets and evaluation methods in the software engineering domain.

PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

  • PLAY2PROMPT (Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play): introduces automated framework for zero-shot tool learning, with tool-use example generation, tool documentation optimization, and task LLM components.
  • PLAY2PROMPT employs iterative beam search with sample proposal, sample evaluation, and down-sampling to refine documentation and generate examples through tool play.
  • This approach enhances LLM tool utilization by creating high-quality documentation and demonstrations without labeled data or manual effort.

DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal

  • DARS (Dynamic Action Re-Sampling): introduces an inference-time compute scaling method for coding agents, incorporating Generate, Reproduction, Localization, Bug Fixing, Evaluation, and Expansion components with Generator LLM, Reviewer LLM, and Selector LLM, to enhance performance by dynamically re-sampling actions.
  • DARS framework utilizes Expansion mechanism with Generator LLM for action candidates, Reviewer LLM for patch scoring based on Score Rubrics, and Selector LLM for best patch selection, processing Input and Feedback to improve coding agent's decision-making.
  • This approach aims to address limitations of sequential, multi-solution, and tree search methods by selectively branching at key decision points and employing depth-first strategy, achieving state-of-the-art performance on SWE-Bench Lite benchmark.

Conversational Agents as Catalysts for Critical Thinking: Challenging Social Influence in Group Decision-making

  • System Overview: introduces chat interface, server, and database components, with Summary Agent, Database, AI-message History, Conversation Agent, AI Duplicate Checker, and Cosine-similarity, where system processes direct and public chat messages through agents.
  • System architecture includes Summary Agent for public opinion analysis, Conversation Agent for generating contextual counterarguments, and AI Duplicate Checker for message novelty.
  • AI Duplicate Checker uses cosine-similarity to ensure novelty of generated messages compared to AI-message History stored in Database.

Empowering LLMs in Decision Games Through Algorithmic Data Synthesis

  • Mastermind-Dou Framework: introduces a three-stage reasoning process with Training Dataset, Opponent, Step-by-step Output, Possible Action Prediction, Opponent Strategy Prediction, and Final Action Selection to enable LLMs to play Doudizhu game.
  • Mastermind-Dou framework uses Possible Action Prediction to predict likely moves, Opponent Strategy Prediction to anticipate adversary actions, and Final Action Selection to choose the optimal game action.
  • The framework enhances LLMs' decision-making in imperfect information games by decomposing the reasoning into sequential prediction and selection stages.

FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

  • FlexVLN (Flexible Vision-and-Language Navigation): introduces a hierarchical approach for vision-language navigation, integrating Environmental Perception, LLM Planner, MLLM Verification, Instruction Follower, and Object Localization components.
  • FlexVLN employs LLM Planner for high-level planning and guidance generation, Instruction Follower for low-level execution, and MLLM Verification to ensure guidance feasibility, enhancing generalization across diverse VLN tasks.
  • The framework utilizes Environmental Perception to understand surroundings and Object Localization to identify the target, achieving effective navigation through a combination of LLM planning and supervised learning execution.

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

  • MDocAgent (Multi-Modal Multi-Agent Framework for Document Understanding): introduces a novel RAG and multi-agent framework with text-based RAG, image-based RAG, general agent, critical agent, text agent, image agent, and summarizing agent.
  • MDocAgent framework addresses DocQA challenges by combining text and image RAG with specialized agents for refined processing and critical information extraction.
  • This approach enables improved DocQA performance through collaborative multi-agent architecture and cross-modal understanding of long documents.

Towards a Barrier-free GeoQA Portal: Natural Language Interaction with Geospatial Data Using Multi-Agent LLMs and Semantic Search

  • GeoQA Portal: introduces a multi-agent LLM framework with Router, Analyzer, Explainer, Visualizer, Mission Planner, Relation Analyzer, Region Selector, Entity Finder, and Geo Filter for natural language interaction with geospatial data.
  • GeoQA Portal decomposes user queries, assigns subtasks to specialized LLM agents, and presents task plans and visualizations to enhance transparency and user engagement.
  • The framework of GeoQA Portal supports flexible data inputs and semantic search, aiming to bridge the gap between complex GIS workflows and public data accessibility for non-expert users.

Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations

  • Retrieval-Augmented Simulacra: introduces system simulating social network service interactions with User Persona Generation LLM Module, RAG Module, and Post / Reply Generation LLM Module.
  • System uses Community Rule, Community Goal, and Samples of User Personas to generate User Persona-based Posts / Replies.
  • Framework simulates realistic social network interactions using web information retrieval and persona-based content generation.

Personalized Attacks of Social Engineering in Multi-turn Conversations - LLM Agents for Simulation and Detection

  • SE-VSim (Social Engineering - Victim Simulation): introduces a dual-agent framework with Attacker Agent, Victim Agent, and Conversation Generation Pipeline to simulate social engineering attack mechanisms in multi-turn conversations.
  • arxiv_paper_framework_name: models victim agents with varying personality traits and attacker agents with predefined attack goals to generate realistic chat-based social engineering scenarios.
  • arxiv_paper_framework_name: facilitates the study of victim vulnerabilities and attacker strategies by generating a dataset of simulated conversations for personalized social engineering defense.

TestForge: Feedback-Driven, Agentic Test Suite Generation

  • TestForge: introduces agentic framework for cost-effective test suite generation with Initial Test Generator, LLM Agent, Agent Actions, Environment Feedback, Test Refinement Loop, Code Repository Context, and Generated Test Suite.
  • TestForge reframes LLM-based test generation as iterative process, refining initial zero-shot tests using execution feedback and coverage reports to enhance test quality and coverage.
  • The framework leverages detailed execution feedback and operates at file-level to improve cost-efficiency and generate high-quality, readable test suites for complex real-world code.

17th March 2025

When Should We Orchestrate Multiple Agents?

  • Orchestration Framework: introduces a method to dynamically select the optimal agent from a set of Agents (Perform tasks, human/AI/hybrid) for tasks arriving via an Input Data Stream (Sequential task inputs), considering performance across different Regions (Data distribution partitions), costs via a Cost Estimator (Estimates agent cost per region), and feasibility via Constraints (Agent feasibility rules), using a Correctness Estimator (Estimates agent accuracy per region), Region Probability Estimator (Estimates region likelihood), and Total Empirical Utility (Cost-adjusted performance metric) for selection by the Orchestrator (Selects agent based on utility).
  • The framework utilizes online probabilistic inference to update agent correctness and region probabilities, calculates an Appropriateness Metric (Measures orchestration value) to determine when orchestration is beneficial, and is applied to simulations including resolving Rogers' Paradox by selecting Learning Strategies (Choices in Rogers' Paradox simulation).
  • A user study involving a User (Human decision-maker in study) choosing between task completion, outsourcing to an AI Agent (LLM agent in study) or a Human Agent (Agent representing human performance) demonstrates improved performance with constrained orchestration compared to baseline scenarios where users act as poor orchestrators.

Why Do Multi-Agent LLM Systems Fail?

  • MASFT (Multi-Agent System Failure Taxonomy): introduces taxonomy of failure modes in multi-agent systems, categorizing them into Pre Execution Failure Modes, Execution Failure Modes, Post Execution Failure Modes, and further groups into Failure Categories including Task Verification, Inter-Agent Misalignment, and Poor Specification.
  • MASFT framework organizes failure modes based on inter-agent conversation stages, spanning from pre-execution to post-execution phases, and classifies them into three main categories reflecting system design, agent coordination, and quality control issues.
  • MASFT taxonomy provides structured framework for understanding and mitigating failure modes in multi-agent LLM systems, serving as a foundation for future research towards building robust and reliable multi-agent systems.

Do Large Language Models Understand Performance Optimization?

  • Performance Optimization Agent: introduces a system integrating LLMs with profiling feedback for HPC code optimization, with Input Prompt, Evaluator, Codee, LLMs, Compilers, Results, HPC Commonsense, Code Correctness, Performance Benchmarking, Metrics, Memory, Profiling Tools, Profiling Plan, Metrics Annotation, System Prompt, Code Generation, Code Replacement, and Output Inspection components.
  • Performance Optimization Agent leverages profiling tools and LLMs iteratively to optimize HPC code by replacing hotspot functions and recompiling, while evaluating performance metrics and ensuring code correctness.
  • The agent aims to bridge the gap between traditional HPC optimization and AI-driven code assistants by incorporating human-like iterative refinement and memory of prior optimization attempts for enhanced performance gains.

A Comprehensive Survey on Multi-Agent Cooperative Decision-Making: Scenarios, Approaches, Challenges and Perspectives

  • LLMs-enhanced MARL Framework: introduces a structure integrating Large Language Models with Multi-Agent Reinforcement Learning, encompassing Feature Representation Extractor, Language Translator, Reward Models, Decision-makers, World Model Simulator, and Policy Interpreter.
  • This framework leverages LLMs for enhanced reasoning and language understanding within MARL agents, facilitating improved collaboration and decision-making in complex environments.
  • The architecture supports various roles for LLMs, including information processing, reward design, decision-making, and output generation, aiming to address challenges in multi-agent systems.

Toward Generative 6G Simulation: An Experimental Multi-Agent LLM and ns-3 Integration

  • Multi-Agent LLM framework: introduces a multi-agent system with Simulation Generation, Test Designer, Test Executor, and Result Interpretation Agents, leveraging External Tools and LLMs within a Feedback Loop managed by Agent Orchestration Layer for automated network simulation.
  • This framework integrates specialized agents to automate simulation lifecycle stages, from natural language input to actionable insights, using ns-3 and iterative refinement.
  • The framework enhances simulation accuracy and reduces manual coding by employing LLMs and external knowledge, facilitating rapid prototyping in complex network environments.

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

  • MicroVQA: introduces a benchmark, with Raw VQA creation, Exam-style MCQ generation, and RefineBot, for multimodal reasoning in microscopy-based research.
  • MicroVQA benchmark evaluates expert image understanding, hypothesis generation, and experiment proposal capabilities.
  • RefineBot component enhances MCQ difficulty by iteratively refining questions and distractors based on chain-of-thought analysis.

Agents Play Thousands of 3D Video Games

  • PORTAL (Policy Optimization and Reasoning for Tactical Artificial Learning): introduces a novel framework for game-playing AI agents, leveraging Strategy Description, BT DSL, Behavior Tree DSL, Blackboard Variables, Control Flows, Neural Nets, Hand-crafted Rules, Generated Codes, Task Nodes, BT Generator, Parser, Reflexion, Rollout, JSON, and AI C++ Server components.
  • PORTAL framework utilizes LLMs as policy architects to generate Behavior Tree DSL policies, which are then parsed and executed in a game environment, incorporating a reflexion mechanism for iterative policy refinement based on game feedback.
  • The framework's hybrid architecture combines strategic reasoning from LLMs with efficient execution through behavior trees and neural networks, enabling rapid deployment and adaptation of game agents across diverse 3D video game environments.

Goal2Story: A Multi-Agent Fleet based on Privately Enabled sLLMs for Impacting Mapping on Requirements Elicitation

  • Goal2Story (Impact Mapping framework): introduces multi-agent fleet for goal-driven requirements elicitation, with Alpha Captain, Intelligence Officer, Delivery Coordinator, Tactical Officer, Format Doctor, Validation Agent, and StorySeek dataset.
  • It leverages privately enabled small language models and Impact Mapping framework to automate requirements elicitation in agile development.
  • The framework aims to improve efficiency and quality of user story generation while addressing data privacy and cost concerns associated with large language models.

KNOWLEDGE-AWARE ITERATIVE RETRIEVAL FOR MULTI-AGENT SYSTEMS

  • Knowledge-Aware Iterative Retrieval for Multi-Agent Systems: introduces agent framework with Query Planning, Knowledge Update Mechanism, and Contextual Filtering for iterative knowledge retrieval.
  • Framework decouples external sources from internal knowledge cache, enabling dynamic search exploration and mitigating bias reinforcement loops.
  • System supports multi-agent extensions for competitive and collaborative knowledge sharing, enhancing reasoning and scalability in complex tasks.

DAgent: A Relational Database-Driven Data Analysis Report Generation Agent

  • DAgent (Relational Database-Driven Data Analysis Report Generation Agent) introduces a novel LLM agent framework for relational database analysis report generation, integrating Planning Module (processes input queries), Decomposition Tools (breaks down complex questions), Data Retrieval Tools (retrieves data from database), SQL Rewriting Tools (optimizes SQL queries), Report Generation Tools (generates analytical reports), Tools Module (collection of specialized components), and Memory Module (stores historical data).
  • DAgent framework utilizes planning to decompose complex questions, employs tools for data retrieval and report generation, and incorporates memory to enhance efficiency and contextual understanding for relational database-driven data analysis report generation.
  • The modular architecture of DAgent facilitates efficient task decomposition, flexible data retrieval strategies, and precise report synthesis, demonstrating strong potential for complex database analysis tasks.

MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

  • MAP (Multi-Agent Inpatient Pathways): introduces a multi-agent framework with Triage-, Diagnosis-, Treatment-, and Chief-Agents, supported by Record Review-, Trainable REG-, and Expert Guidance-Modules, utilizing Medical Records and a Medical Knowledge Base, to enhance large language model performance in inpatient pathways.
  • MAP framework simulates inpatient pathway flow through collaborative agents, where each agent is empowered by specialized LLMs and modules for complex medical scenario processing and improved diagnostic accuracy.
  • The framework's modular design, incorporating record review, retrieval-enhanced generation, and expert guidance, aims to address limitations of current LLMs in complex inpatient diagnostic support by integrating diverse clinical data and knowledge.

MAP : Multi-user Personalization with Collaborative LLM-powered Agents

  • MAP (Multi-Agent system for Multi-user Personalization): introduces Planner, Rule Manager, Rule Retriever, and Storage components within Reflection, Analysis, and Feedback stages for multi-user personalization.
  • MAP framework orchestrates specialized agents to retrieve user data, reason about personalization tasks, resolve conflicts, and incorporate user feedback through iterative workflow.
  • MAP leverages multi-agent system to implement user-centered personalization workflow, emphasizing user involvement in resolution verification and failure management.

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

  • Personality Steering Framework: introduces personality steering via representation engineering to investigate LLM cooperation within Iterated Prisoner's Dilemma environment, utilizing LLM agents and rule-based players.
  • Framework employs Big Five personality traits steering through vectors and prompts to analyze impact on LLM behavior across different communication setups.
  • Key components include representation engineering for personality modulation, IPD environment for strategic interaction, and communication module for enhanced agent interaction.

Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective

  • Agentic HLS Optimization Framework: introduces agent-based methodology employing In-Context Learning, LLM, HLS Tool, Functional Test, Compiler, Agent Tasks, Inspect kernel, Solve ILP problem, Synthesize Solution, Select Solution, System Prompt, Config. Builder, and ILP Solver for automated hardware design optimization in High-Level Synthesis.
  • This framework explores LLMs' reasoning capabilities within High-Level Synthesis by automating code restructuring, pragma insertion, and design space exploration through iterative feedback loops and access to EDA tools.
  • The agentic approach aims to enhance design quality and efficiency by enabling LLMs to emulate expert system architects in navigating complex hardware optimization tasks, potentially improving upon current state-of-the-art methods.

Enforcing Cybersecurity Constraints for LLM-driven Robot Agents for Online Transactions

  • Security Architecture: introduces a cybersecurity framework for LLM agents in online transactions, integrating LLM-Driven Robot Agents Layer, Multi-Factor Authentication (MFA) Layer, Blockchain Layer, Anomaly Detection System (ADS) Layer, and User Interface Layer.
  • The framework enhances transaction security and integrity by combining multi-factor authentication for identity verification, blockchain for immutable records, and real-time anomaly detection for fraud prevention.
  • This architecture achieves improved fraud detection and reduced transaction latency compared to traditional systems, demonstrating enhanced security and efficiency for LLM-driven robotic agents.

Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective

  • Agentic HLS optimization framework: introduces an automated optimization flow with In-Context Learning, HLS Tool, LLM, Functional Test, Compiler, Agent Tasks, ILP Solver, and Config. Builder for high-level synthesis.
  • This framework explores reasoning LLMs for automating code restructuring, pragma insertion, and design space exploration in HLS.
  • The agentic approach aims to improve design quality and efficiency by mimicking expert system architects in hardware optimization tasks.

Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents

  • PFI (Prompt Flow Integrity): introduces system security solution for LLM agents, featuring User Prompt, Agent Context, Plugins, Plugin Call, Plugin Result, Final Answer, Trusted Agent, Untrusted Agent, Proxy, Trusted Data, Untrusted Data, FlowCheck, TrustCheck, and GenerateQuery components.
  • PFI framework isolates untrusted data within Untrusted Agent, distinct from Trusted Agent, and employs Proxy to manage data flow between agents, utilizing FlowCheck and TrustCheck for validation and data classification.
  • PFI aims to mitigate privilege escalation risks in LLM agents by enforcing least privilege and ensuring data flow integrity through component-based architecture and security mechanisms.

16th March 2025

VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures

  • VeriLA (Verifying LLM Agent failures): introduces a human-centered framework with Human-designed Agent Registry, Planning Agent, Task Plan, Agent Execution, Agent Outputs, Human-defined Agent Criteria, Agent Verifiers, Aggregator: Task Failure, and AI Practitioners / Agent Users for interpretable LLM agent failure verification.
  • VeriLA systematically evaluates agent failures using human-defined criteria, verifies execution outputs, and aggregates scores to identify task failures and guide revisions.
  • The framework enhances human-agent collaboration by providing interpretable failure analysis and reducing manual effort in debugging compound AI systems.

Facilitating Automated Online Consensus Building through Parallel Thinking

  • PTFA (Parallel Thinking-based Facilitation Agent): introduces an automated system for online consensus building, leveraging LLMs to embody Six Thinking Hats roles for structured discussions.
  • PTFA framework comprises an Agent module incorporating LLMs for diverse thinking roles and a Platform module for user interaction via Discourse interface and discussion data storage in a database.
  • The system utilizes Discourse forum system and OpenAI API to facilitate structured conversations based on Six Thinking Hats methodology and collect data for analysis of automated facilitation.

A Survey on the Optimization of Large Language Model-based Agents

  • LLM-based Agent Optimization Framework: introduces introduction, background, parameter-driven optimization, parameter-free optimization, datasets and benchmarks, application, challenges and future directions, and conclusion to comprehensively survey optimization strategies for agents based on large language models.
  • Parameter-driven optimization refines model parameters, while parameter-free optimization adjusts inputs and context to improve agent behavior without parameter changes.
  • The survey categorizes optimization into parameter-driven (fine-tuning, RL, hybrid) and parameter-free (prompt engineering, RAG, multi-agent) methods, further detailing data construction, evaluation, and applications.

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

  • SPIN-Bench (Strategic Planning, Interaction, and Negotiation): introduces a multi-domain evaluation framework, with LLM Pool, Game Environment, Agent Engine, Agent Initialization, Phase Manager, and Evaluation Module, designed to measure strategic planning and social reasoning intelligence.
  • The framework combines PDDL tasks, competitive games, cooperative games, and strategic games to assess LLMs in diverse social settings.
  • SPIN-Bench is important for future research on robust multi-agent planning, social reasoning, and human-AI teaming by providing a unified and comprehensive evaluation platform.

GAMECHAT: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

  • GAMECHAT (Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments): introduces a decentralized multi-agent navigation framework utilizing Initial Prompt, Another Agent Observed?, SMG Occurring?, MPC-CBF Update, LLM Conversation, Consensus or Comm. Limit Reached?, Was Consensus Reached?, Use LLM-Based Role Assignment, Strategy 1 Role Assignment, MPC-CBF w/ Role Constraints Update, and Goals Reached? components for safe, agile, and socially optimal navigation.
  • GAMECHAT framework leverages LLM Conversation for natural language-based priority negotiation and employs MPC-CBF Update for motion planning with safety and liveness guarantees in constrained environments.
  • The framework addresses spatial symmetry and deadlocks through explicit communication and game-theoretic strategies, achieving socially optimal navigation by prioritizing urgent tasks and ensuring subgame perfect equilibrium.

LLM-MEDIATED GUIDANCE OF MARL SYSTEMS

  • LLM-Mediated Guidance of MARL Systems: introduces a framework that integrates Rule-Based Controller, Natural Language Controller and LLM-Mediator to guide Agents within an Environment by generating Task-List based on Observations and Reward, overwriting Learned Policy to enhance Actions.
  • The framework uses LLM-Mediator to interpret interventions from Rule-Based Controller or Natural Language Controller, translating them into specific actions for agents in the Aerial Wildfire Suppression environment.
  • This approach aims to improve MARL performance by providing adaptive guidance through LLM-mediated interventions, accelerating learning and enhancing coordination in complex multi-agent scenarios.

Facilitating Automated Online Consensus Building through Parallel Thinking

  • PTFA (Parallel Thinking-based Facilitation Agent): introduces an automated system for online consensus building, leveraging LLMs to embody Six Thinking Hats roles for structured discussions.
  • PTFA framework comprises an Agent module incorporating LLMs for diverse thinking roles and a Platform module for user interaction via Discourse interface and discussion data storage in a database.
  • The system utilizes Discourse forum system and OpenAI API to facilitate structured conversations based on Six Thinking Hats methodology and collect data for analysis of automated facilitation.

Advancing Human-Machine Teaming: Concepts, Challenges, and Applications

  • QN-MHP (Queuing Network-Model Human Processor): introduces queuing networks and symbolic cognitive models to effectively model multitask human performance for cognitive modeling.
  • QN-MHP demonstrates potential in cognitive modeling but lacks accuracy under specific conditions and does not address speed control or complex road geometry adjustments.
  • QN-MHP represents an initial approach towards integrating cognitive and queuing models for human performance simulation.

15th March 2025

Agentic Search Engine for Real-Time IoT Data

  • IoT-ASE (IoT Agentic Search Engine): introduces a real-time search engine for IoT data, leveraging Classifier Agent, Retriever Node, Generator Agent, and Reviewer Agent, utilizing Service Description Vector Database and Real-Time IoT database, with components including Tokenizer, Embedding Model, Average Pooling, Normalize embeddings, and integrating SensorsConnect's Perception Layer, Edge Layer, Cloud Layer, Business Layer, and User Interface Layer.
  • IoT-ASE framework employs a Generic Agentic RAG approach to process IoT data queries, incorporating agents for classification, retrieval, generation, and review, ensuring context-aware and accurate responses by accessing real-time information and service descriptions.
  • The architecture of IoT-ASE is designed to address the challenges of fragmented and heterogeneous IoT data by utilizing a unified data model and standardized communication protocols within the SensorsConnect framework, facilitating efficient real-time data accessibility and decision-making.

TFHE-Coder: Evaluating LLM-agentic Fully Homomorphic Encryption Code Generation

  • Compiler-in-the-loop evaluator Framework: introduces iterative TFHE code generation process, with User Prompt initiating code creation by LLM, followed by Compiler TFHE compiling for Environment Output, using Compile Report feedback for LLM Revise if compilation is not OK.
  • This framework evaluates LLM's ability to generate compilable TFHE code through cycles of compilation and revision based on compiler diagnostics.
  • The iterative approach helps in systematically refining LLM output towards syntactically correct TFHE code generation.

Multi-Agent Systems Execute Arbitrary Malicious Code

  • MAS (Multi-Agent System): introduces control-flow hijacking attacks by manipulating metadata flow within User, Orchestrator, WebSurfer, FileSurfer, and CodeExecutor components, leading to Hijacked control flow and execution of Executable payload from Attack content.
  • MAS framework coordinates agents like WebSurfer, FileSurfer, and CodeExecutor under Orchestrator's direction to fulfill User requests, but is vulnerable to attacks rerouting Normal control flow.
  • Control-flow hijacking in MAS exploits metadata transmission to redirect agent invocations, causing execution of arbitrary code and system compromise, even with individual agent safety measures.

AgentDroid: A Multi-Agent Framework for Detecting Fraudulent Android Applications

  • AgentDroid (Multi-Agent Framework): introduces a multi-agent framework for Android fraudulent application detection, with Task Master, Certificate Checker, Link Analyst, Package Tracker, Permission Analyst, Icon Analyst, Content Analyst, Decision Maker agents, and Static Analysis, Feature Extraction, Agent Tools, Multi-Agent Detection modules.
  • AgentDroid leverages multimodal analysis and collaborative agents to improve fraud detection accuracy by analyzing APK files and extracting features.
  • The framework employs specialized agents and tools, including DeiT and T5 models, for detailed analysis of diverse APK characteristics to identify fraudulent applications.

ICCO: Learning an Instruction-conditioned Coordinator for Language-guided Task-aligned Multi-robot Control

  • ICCO (Instruction-Conditioned Coordinator) introduces a multi-agent reinforcement learning framework with Instructor providing Language Instruction to Coordinator, which uses Coordination Policy and LLM or Random Vector Generator to generate Task-Aligned and Consistent Instructions (TACI) for each Local Agent with Local Policy in Env.
  • ICCO framework balances language-instruction following and cooperative task execution by employing Coordinator to generate consistent instructions from global observations and language, improving coordination among Local Agents.
  • The framework utilizes Centralized Training with Decentralized Execution (CTDE) paradigm, training Coordinator and Local Agent policies jointly to optimize task efficiency and instruction following, enhanced by Consistency Enhancement Term.

Is Multi-Agent Debate (MAD) the Silver Bullet? An Empirical Analysis of MAD in Code Summarization and Translation

  • MAD (Multi-Agent Debate): introduces a multi-stage framework with Input, Stage 1, Stage 2, Stage 3, Agent Debate at each stage, Judge, and Evaluation: Debate Output, for structured debate among agents to solve software engineering tasks.
  • MAD framework utilizes iterative Agent Debate within each Stage to refine solutions and employs a Judge to evaluate agent responses and guide the process towards Evaluation: Debate Output.
  • The framework's modular design with distinct stages and agent roles facilitates structured problem-solving and leverages debate to enhance the quality of the final Evaluation: Debate Output.

SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

  • SagaLLM (Saga Language Learning Model): introduces Context Management-, Validation- and Transaction-Frameworks, addressing limitations in multi-agent LLM planning by ensuring context awareness and planning consistency.
  • SagaLLM integrates transactional processing with adaptive multi-agent intelligence, enhancing reliability and correctness in complex real-world applications.
  • The framework employs specialized agents and validation protocols to maintain critical constraints and state information throughout complex planning processes, improving decision-making robustness.

End-to-End Edge AI Service Provisioning Framework in 6G ORAN

  • Edge AI Service Provisioning Framework: introduces end-to-end system for edge AI service deployment using LLM-based Agent, RAN-Core Status Data, Edge Status Data, RIC Status Data, AI Model Repository Data, User Engagement Tool, Edge Management Tool, Core Management Tool, RIC Management Tool, AI Service Monitoring-Prediction, Other xApps, AI Service, Network Functions, and Databases.
  • This framework leverages LLM agent to automate AI model selection, service deployment, network adaptation, and QoS monitoring in 6G O-RAN.
  • The proposed framework aims to simplify edge AI deployment by abstracting low-level tasks and enabling intent-based service provisioning.

14th March 2025

CoLLMLight: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control

  • CoLLMLight (Cooperative Large Language Model Light): introduces a cooperative LLM agent framework for network-wide traffic signal control, with Observation Collection, Complexity-Aware Reasoning, Spatiotemporal-Aware Cooperative Decision-making, and Simulation-Driven Fine-tuning components.
  • CoLLMLight uses spatiotemporal graph and complexity-aware reasoning to dynamically adapt reasoning depth for optimal computational efficiency and decision quality.
  • Simulation-driven fine-tuning and environmental feedback enhance CoLLMLight's decision-making and efficiency in diverse traffic scenarios.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

  • CoT Monitoring Framework: introduces CoT Monitor and Action Monitor, to monitor reasoning models for misbehavior, in agentic coding environments.
  • CoT Monitor observes agent's chain-of-thought, actions, and outputs, while Action Monitor observes only tool calls and final outputs.
  • Using CoT monitoring can be more effective than action-only monitoring for detecting reward hacking in reasoning models.

Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery

  • Cerebrum (AIOS SDK): introduces a modular four-layer architecture comprising LLM Layer, Memory Layer, Storage Layer, and Tool Layer, alongside Overrides Layer, Agent Hub, Agent Chat, Context Manager, Scheduler, LLM Core(s), Tool Manager, Memory Manager, Storage Manager, Agent Manager, Planning Module, Action Module, Memory Module, Storage Module, AIOS Kernel, User Device, Exposed Ports, LLM Queue, Memory Queue, Tool Queue, Storage Queue, Agent Applications, AIOS System Call, and Thread Binding for agent development, deployment, distribution, and discovery.
  • Cerebrum framework provides a comprehensive SDK with a community-driven Agent Hub for sharing agents and an interactive web interface Agent Chat for agent testing and evaluation, aiming to standardize agent development and promote collaboration.
  • The platform's architecture facilitates both fine-grained control over agent behavior and rapid development through high-level abstractions, supporting diverse agent methodologies and user-created agent distribution within a centralized hub.

Alstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation

  • Alstorian: introduces a knowledge graph powered retrieval-augmented generation system for biography creation, integrating KG-based index, two-step training, prompt, retrieval, aligned model, biography, verifier, error-aware generation, and error-aware solvers components.
  • Alstorian employs KG-based index for structured knowledge retrieval and two-step training to enhance stylistic consistency of generated biographies, alongside multi-agent system for real-time error detection and correction.
  • The framework achieves improved factual accuracy and reduced hallucination in biography generation through error-aware mechanisms and knowledge graph integration, demonstrating advancements over existing methods.

GNNs as Predictors of Agentic Workflow Performances

  • FLOW-GNN (workflow graph neural network): introduces a framework for predicting agentic workflow performance using Agentic workflow, Task instruction, Sentence transformer, Graph & node features, GNN encoder, Projector, Task embedding, Concatenation, MLP, and Predicted performance.
  • This framework leverages GNNs to efficiently predict agentic workflow performance by encoding workflow structure and task instructions into embeddings and using MLP for prediction, avoiding costly LLM invocations.
  • FLOW-GNN framework aims to automate agentic workflow optimization by providing a fast and accurate performance predictor, enabling efficient exploration of workflow designs.

COLLABORATION IS ALL YOU NEED: LLM ASSISTED SAFE CODE TRANSLATION

  • UniTranslator: introduces a multi-agent framework for code translation with Input Code processed by DirectorLLM, leveraging Agent Garden, LLM Quorum, and Compiler Garden within a Decision Loop using Feedback to produce Translated Code.
  • UniTranslator framework utilizes DirectorLLM to orchestrate specialized agents in Agent Garden and select appropriate LLMs from LLM Quorum, employing Compiler Garden for validation and Feedback for iterative refinement within Decision Loop.
  • This architecture aims to enhance code translation accuracy and efficiency by using collaborative compact LLMs and knowledge grounding, overcoming limitations of monolithic models and enabling deployment on common hardware.

Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

  • Prochemy (Prompt Alchemy): introduces automated prompt refinement framework, with Training Set Generation, Optimization, Mutation, Evaluation, Selection, LLM, Existing Data, Mutated Data, Training Set, Initial Prompt, Selected Prompt and Final Prompt, to enhance code generation by iteratively refining prompts based on model performance.
  • Prochemy leverages a training set composed of existing and mutated data to evaluate and select optimal prompts through mutation, evaluation, and selection steps, ensuring consistency and reliability.
  • The framework is designed as plug-and-play, compatible with existing prompt engineering methodologies, and validated across diverse datasets and language models for code generation and translation tasks.

Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities

  • LaRMA Framework (Large Reasoning Models in Agent Scenarios framework): introduces Task Segmentation (categorizes agent tasks), Agent Paradigm (selects reasoning paradigms), Model Evaluation (evaluates LLMs and LRMs), Performance Evaluation (measures task success), and Reasoning Evaluation (assesses reasoning quality).
  • LaRMA framework systematically investigates reasoning in agents by dissecting tasks, selecting paradigms like ReAct and Reflexion, and evaluating performance and reasoning metrics.
  • This framework facilitates understanding of reasoning capabilities in LLMs and LRMs across diverse tasks and paradigms, contributing to agent design advancements.

API Agents vs. GUI Agents: Divergence and Convergence

  • This paper introduces API Agent, GUI Agent and Hybrid Agent frameworks, which includes User (initiates task), API Agent (uses APIs), GUI Agent (uses GUI), API Information (API descriptions), Action (API) (executes API call), GUI Observation (GUI perception), Action (GUI) (interacts with GUI), API Wrapper (wraps GUI with API), Action Orchestrator (manages actions), Hybrid Agent (uses both APIs and GUI), GUI Workflow (GUI action sequence), Payment Gateway (handles payments), Shipping Service (manages shipping), and GUI Verification (verifies GUI).
  • API Agent framework utilizes structured API calls for efficient and reliable automation, while GUI Agent framework interacts with applications through visual interfaces for broader applicability.
  • Hybrid Agent framework combines API and GUI approaches to leverage their respective strengths, aiming for versatile and adaptable automation solutions.

Banner Agency: Advertising Banner Design with Multimodal LLM Agents

  • BannerAgency: introduces a training-free framework for automated banner ad design, incorporating Strategist, Background Designer, Foreground Designer, Developer, Memory, and External Knowledge & Tools components.
  • BannerAgency leverages multimodal LLMs as agents to simulate a human design team workflow, from strategy to implementation, for generating editable banner designs.
  • The framework utilizes memory and external knowledge to enable context-aware decisions and supports multiple banner sizes through component-based approach.

TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

  • TXAGENT (TXAGENT): introduces an AI agent for therapeutic reasoning, integrating TOOLUNIVERSE, Specialized LLM, and TOOLRAG model.
  • TXAGENT leverages multi-step reasoning and real-time biomedical knowledge retrieval across a toolbox of 211 tools for analyzing drug-related tasks.
  • TXAGENT ensures treatment recommendations align with clinical guidelines and real-world evidence, reducing adverse events and improving decision-making.

13th March 2025

Teamwork makes the dream work: LLMs-Based Agents for GitHub README.MD Summarization

  • Metagente: introduces a multi-agent framework composed of Extractor Agent, Summarizer Agent, Teacher Agent, and Prompt Creator Agent, utilizing LangChain for communication and ROUGE-L for evaluation within a Fine Tuning process to generate Generated About from README.MD.
  • Metagente employs Extractor Agent to filter README.MD content, Summarizer Agent to create summaries, Teacher Agent to refine prompts, and Prompt Creator Agent to finalize prompts, iteratively improving summary quality.
  • The framework leverages a teacher-student architecture for prompt optimization, enhancing the synergy of LLM agents to achieve improved summarization performance for GitHub README.MD files.

UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

  • UniGoal: introduces a universal zero-shot goal-oriented navigation framework utilizing Observation RGB-D as input and Agent Pose to construct Scene Graph and Goal Graph, employing Graph Matching and Blacklist within a Global Policy across Stage 1: Zero Matching, Stage 2: Partial Matching, and Stage 3: Perfect Matching to output Action based on Deterministic Local Policy and Occupancy Map, further incorporating Graph Correction and Goal Verification.
  • This framework uniformly represents diverse goals as graphs and performs graph matching between Scene Graph and Goal Graph to guide a multi-stage exploration policy, enabling zero-shot navigation across object category, instance image, and text description goals.
  • The multi-stage policy progresses from initial exploration in Stage 1: Zero Matching to coordinate projection in Stage 2: Partial Matching and finally to verification in Stage 3: Perfect Matching, ensuring robust navigation through graph-based reasoning and a blacklist mechanism for avoiding repeated failures.

COSTA*: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

  • COSTA* (Cost-Sensitive Toolpath Agent): introduces a three-stage framework with LLM for subtask tree generation, Tool Dependency Graph and Model Description Table for tool organization, A* Search for optimal pathfinding, Quality Check by VLM for output evaluation, utilizing Tools and Benchmark Table for informed decisions in multi-turn image editing.
  • COSTA* leverages LLM for high-level planning and A* search for detailed toolpath optimization, incorporating Tool Dependency Graph and Model Description Table to manage tool dependencies and capabilities, while Benchmark Table and Quality Check by VLM ensure cost-effective and high-quality image editing.
  • The framework balances computational cost and output quality through a cost-sensitive A* search guided by a Benchmark Table and real-time feedback from Quality Check by VLM, enabling efficient exploration of toolpaths within the Tool Dependency Graph and Subtask Tree for complex image editing tasks.

SySLLM: Generating Synthesized Policy Summaries for Reinforcement Learning Agents Using Large Language Models

  • SySLLM (Synthesized Summary using LLMs) introduces Env, Observation Captioner, Agent, Action Captioner, Experience Dataset, Prompt, Formatted Dataset, LLM, and Summary to generate policy summaries by converting agent experiences into natural language and utilizing LLMs for synthesis.
  • SySLLM leverages captioners to translate environment observations and agent actions into textual descriptions, which are then formatted with a prompt and fed into an LLM to produce a comprehensive policy summary.
  • The framework facilitates understanding of complex RL policies by synthesizing concise, coherent, and human-readable summaries from agent-environment interactions, enhancing interpretability and trust in RL agents.

New Trends for Modern Machine Translation with Large Reasoning Models

  • LRM-based MT (Large Reasoning Model based Machine Translation): introduces framework with Machine Translation and Large Reasoning Model, addressing Foundational Challenges like Stylized-, Document-, Multimodal Translation, exploring New Opportunities such as Self-Reflection, Auto-Pivoting, and venturing Beyond Translation into Deciphering Encoded Text.
  • This framework leverages Large Reasoning Models to enhance Machine Translation by tackling complex scenarios and introducing novel capabilities beyond traditional text-to-text mapping.
  • The approach aims to redefine translation as dynamic reasoning task, moving beyond mere text conversion towards multilingual cognitive agent functionality.

Capturing Semantic Flow of ML-based Systems

  • Semantic Flow: introduces semantic flow graphs for capturing ML-system executions through latent space progression, using Conv2d, MaxPool2d, Linear, ReLU, Flatten, AutoFL, LLM, Latent Space, Semantic State, Semantic Cluster, LLM Inference Graph, and Function Call Nodes components.
  • Semantic flow graphs represent semantic states as clusters in latent spaces and transitions between these clusters, enabling analysis of ML-system behaviour beyond traditional control flow.
  • This approach facilitates understanding, debugging, and improving ML-based systems by visualizing and quantifying their internal decision-making processes and execution diversity.

LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns

  • DFE Framework (Decisions From Experience Framework): introduces LLM Agents (language model decision-makers) within DFE Tasks (repeated choice scenarios) to evaluate behavioral patterns through Feedback (outcome information) and Choice History (past interaction record).
  • This framework investigates how LLMs, provided with Feedback and Choice History in DFE Tasks, exhibit human-like biases such as underweighting rare events and correlation effects.
  • The DFE Framework highlights differences in learning patterns between LLMs and humans, revealing LLMs' strong recency bias and absence of "surprise triggers change" and "wavy recency effect" phenomena observed in human behavior.

SCOOP: A Framework for Proactive Collaboration and Social Continual Learning through Natural Language Interaction and Causal Reasoning

  • SCOOP (Social Continual Object-Oriented POMDP): introduces a base Oracle-Aided ReAct architecture that extends ReAct framework with actions to query state, user preferences, and environment mechanics.
  • SCOOP framework also proposes an advanced ReAct architecture incorporating CausalRefinementAndAction, LLM, external oracle, causal inference libraries, causal knowledge graph, and planning routines for enhanced reasoning.
  • SCOOP framework facilitates agents learning through dialogue, questions, and interaction in open environments, refining causal understanding while balancing exploration and exploitation.

Hybrid Agents for Image Restoration

  • HybridAgent: introduces interactive image restoration paradigm with user inputs, fast-agent, slow-agent, feedback-agent, restoration tools and memory.
  • HybridAgent: employs fast-agent for direct prompts, slow-agent for vague prompts, feedback-agent for quality assessment, and restoration tools for image enhancement, utilizing memory to track restoration history.
  • HybridAgent: leverages instruction-tuning dataset to optimize agents and restoration tools, achieving efficient and effective image restoration through collaborative agent interaction.

StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error

  • StepMathAgent: introduces a mathematical process evaluation agent based on Tree-of-Error, incorporating logical step segmentation, step scoring, score aggregation, error tree generation, difficulty calibration, simplicity evaluation, completeness validation, and format assessment.
  • StepMathAgent evaluates mathematical problem-solving processes by segmenting solutions into steps, scoring each step, aggregating scores, and generating error trees for interpretability and feedback.
  • StepMathAgent addresses limitations of answer-based evaluations by providing fine-grained assessments, interpretability through error trees, and adaptability to diverse evaluation scenarios.

Advanced Tool Learning and Selection System (ATLASS): A Closed-Loop Framework Using LLM

  • ATLASS (Advanced Tool Learning and Selection System): introduces a closed-loop framework employing Task Analyzer for decomposition, Tool Master for tool necessity assessment, Tool Selector for tool selection, Tool Generator for tool creation, Tool Dataset for storage, and Task Solver for execution, alongside Code Writer, Code Executor, Documentation Context, Web Automation Bot, and API Key for tool generation and usage.
  • This framework facilitates dynamic tool generation and selection by LLMs, enabling adaptive problem-solving through iterative refinement and reuse of tools stored in the Tool Dataset.
  • ATLASS enhances LLM agents' capabilities to address complex tasks by automating tool creation and integration, overcoming limitations of predefined toolsets and improving adaptability in diverse scenarios.

Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy

  • LGC-MARL (LLM-based Graph Collaboration MARL): introduces framework integrating LLM Planner and Graph-based policy to enhance multi-agent reinforcement learning in complex tasks.
  • LGC-MARL framework: decomposes tasks into executable subtasks using LLM Planner, and coordinates agents via Graph-based policy guided by action dependency graph.
  • LGC-MARL framework: employs Critic LLM for plan refinement and LLM-based reward function generator, improving collaboration and learning efficiency in multi-agent systems.

OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problem with Reasoning Large Language Model

  • OR-LLM-Agent (Operations Research - Large Language Model - Agent): introduces an AI framework that automates operations research problem-solving by using LLM mathematical modeling, LLM code generation and OR-CodeAgent for code execution and repair, replacing traditional expert and programmer roles.
  • OR-LLM-Agent framework leverages reasoning LLMs to translate natural language problem descriptions into mathematical models, subsequently generating executable solver code and managing automated code execution within a sandbox environment.
  • The framework's OR-CodeAgent component enhances robustness through self-repair and self-verification mechanisms, iteratively refining code and mathematical models to achieve feasible and accurate solutions for real-world operations research problems.

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

  • AgentDAM (Agent Data Minimization): introduces benchmark for evaluating privacy leakage in web agents, processing user instruction and data within web environment using LLM backbone, generating actions judged by privacy evaluator.
  • AgentDAM benchmark assesses agent's ability to minimize private data processing during web navigation tasks, measuring both task performance and privacy leakage.
  • AgentDAM provides a framework to analyze and mitigate privacy risks associated with autonomous AI web agents accessing sensitive user information.

Multi-Agent LLM Actor-Critic Framework for Social Robot Navigation

  • SAMALM (Socially-Aware Multi-Agent actor-critic LLM framework): introduces decentralized multi-agent system for social robot navigation, employing Local Observation, Multi-Robot LLM-Actors, Action List, Multi-Robot LLM-Critics, Output Execution, Re-Query with (Critic Feedback), Evaluation Threshold, Entropy Fusion Mechanism, Global LLM Critic, Local LLM Critic, LLM Actor, and Robot World Model components.
  • SAMALM framework utilizes parallel LLM actors for generating robot-specific control signals, which are evaluated by global and local LLM critics, and refined through an entropy-based fusion and re-query mechanism to ensure socially compliant navigation.
  • The architecture of SAMALM facilitates self-verification and iterative refinement of robot actions, balancing individual robot autonomy with global team coordination in complex social environments through multi-agent LLM actor-critic approach.

12th March 2025

PLAN-AND-ACT: Improving Planning of Agents for Long-Horizon Tasks

  • PLAN-AND-ACT: introduces PLANNER, EXECUTOR and Dynamic Replanning to improve agent planning for long-horizon tasks.
  • PLANNER decomposes user queries into high-level plans, while EXECUTOR translates plans into environment actions, and Dynamic Replanning updates plans based on environment changes.
  • PLAN-AND-ACT framework separates planning and execution responsibilities to enhance performance in complex, long-horizon tasks within dynamic environments.

Large Language Models for Multi-Facility Location Mechanism Design

  • LLMMech (Large Language Models for Mechanism Design): introduces an evolutionary framework integrating LLMs to automate the design of strategyproof mechanisms using InitializationPrompt, Select, VariationPrompt, PopulationManager and PromptEvolution components.
  • LLMMech framework leverages LLMs for generating interpretable, hyperparameter-free, and empirically strategyproof mechanisms for multi-facility location problems.
  • The framework incorporates PromptEvolution to automatically refine prompts, enhancing the diversity of generated mechanisms and improving search for optimal solutions.

REMA: LEARNING TO META-THINK FOR LLMS WITH MULTI-AGENT REINFORCEMENT LEARNING

  • ReMA (Reinforced Meta-thinking Agents): introduces multi-agent reinforcement learning framework with high-level meta-thinking agent, low-level reasoning agent, meta-thinking, reasoning, MARL, feedback, and solution.
  • ReMA framework decouples reasoning into hierarchical agents: meta-thinking agent for strategic planning and reasoning agent for detailed execution, enabling collaborative learning.
  • Iterative reinforcement learning with aligned objectives in ReMA facilitates agent collaboration, improving generalization and robustness in complex reasoning tasks.

COLA: A SCALABLE MULTI-AGENT FRAMEWORK FOR WINDOWS UI TASK AUTOMATION

  • COLA (Collaborative Multi-Agent framework for automating Windows UI operations) introduces a multi-agent framework with Planner, Task Scheduler, Decision Agent Pool, Executor, and Reviewer, enhanced by Short-Term and Long-Term Memory units for Windows UI task automation.
  • COLA framework utilizes a Task Scheduler to dynamically assign coarse-grained subtasks from Planner to specialized agents within Decision Agent Pool, enabling flexible and scalable task execution.
  • The framework incorporates memory units for agent self-evolution and an interactive backtracking mechanism for non-destructive error recovery, improving robustness and performance in UI automation tasks.

AdaptAI: A Personalized Solution to Sense Your Stress, Fix Your Mess, and Boost Productivity

  • AdaptAI (AdaptAI: A Personalized Solution to Sense Your Stress, Fix Your Mess, and Boost Productivity): introduces a multimodal AI system, with Processing Module (integrates multimodal real-time streams), External Task Agents (automates simple extra tasks), Personalized Well-being Intervention Pipeline (delivers personalized interventions), and Tone-adaptive Conversational Agent TCA (adjusts tone based on heart activity), to provide personalized productivity and well-being support.
  • AdaptAI leverages egocentric vision, audio, heart rate, and motion data, processed by Speech-to-Text (audio to text conversion), EGOCENTRIC CAPTION LLM (processes egocentric vision captions), VLM (vision language model), SCREEN CAPTION LLM (processes screen captions), Stress Estimation (assesses user stress levels), and Movement Estimation (assesses user movement) within the Processing Module, alongside Live Routine Mapping (maps user's daily activities) and Memory (temporary data storage) for context-aware interventions.
  • The framework employs External Task Agents (automates simple extra tasks) like Email Agent (manages email tasks) and Meeting Agent (manages meeting tasks) to streamline workflows, while Personalized Intervention (provides personalized well-being support) and TCA (adjusts tone based on heart activity) enhance user experience by addressing physical and psychological states dynamically.

LocAgent: Graph-Guided LLM Agents for Code Localization](http://arxiv.org/abs/2503.09089v1)

  • LOCAGENT (LOCAGENT): introduces graph-oriented LLM-agent framework, with Codebase, Code Graph Indexing, Entity Indexing, Agent Runtime, Tools, LLM Agent, Observation, Event Log, Action, Localized Code Sections, Result, Entity ID Index, Entity Name Index, BM25 Index on Entity IDs, BM25 Index on Entity Contents, contain, import, invoke, and inherit, for code localization using graph-based representation and agent-guided search.
  • LOCAGENT framework utilizes Code Graph Indexing and Entity Indexing to create efficient codebase representations, enabling LLM Agent within Agent Runtime to use Tools for navigating and searching code, ultimately providing Localized Code Sections as Result.
  • LOCAGENT's architecture emphasizes structured code exploration through graph-based indexing and specialized tools, facilitating accurate and cost-effective code localization by leveraging LLM Agent's reasoning within Agent Runtime environment.

Agentic Control for Safe Autonomous Stunt Maneuvers

  • ManeuverGPT: introduces agentic framework with Query Enricher Agent, Driver Agent, Parameter Validator Agent and Orchestrator for generating stunt maneuvers.
  • It iteratively refines control parameters using feedback and validation for safe execution.
  • The framework combines LLM-based reasoning with algorithmic validation for flexible high-dynamic maneuvers.

ARCHED: A Human-Centered Framework for Transparent, Responsible, and Collaborative AI-Assisted Instructional Design

  • ARCHED (AI for Responsible, Collaborative, Human-centered Education Instructional Design) introduces a three-phase framework for AI-assisted instructional design, incorporating Web Interface (LOGS), Educational Parameters Specification, LOGS, OAE Analysis, Temporary Repository for Refinement, Cognitive Demand Analysis, and Innovative Assessment Strategies to enhance human-AI collaboration.
  • ARCHED framework utilizes LOGS for initial learning objective generation based on educator-specified parameters and OAE Analysis for evaluating objectives against pedagogical criteria, ensuring iterative refinement and alignment.
  • The framework aims to maintain human agency and pedagogical rigor in AI-assisted instructional design by providing transparent AI reasoning and promoting diverse assessment strategies through specialized components.

Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping

  • DRMARL (Distributionally Robust Multi-Agent Reinforcement Learning): introduces a framework for dynamic chute mapping robust to induction rate variations, utilizing Agents(Control chute allocation per destination), a Shared Local Q-Network(Estimates action value per agent), Target Network(Stabilizes Q-learning), Experience Replay Buffer(Stores transitions for training), Value Decomposition Network(Aggregates local Q-values), Integer Program Solver(Selects joint actions under budget), Induction Distribution Groups(Represent historical patterns), Ambiguity Set(Defines uncertainty over groups), Distributionally Robust Bellman Operator(Optimizes for worst-case reward), Contextual Bandit Predictor(Predicts worst-case group efficiently), and CB Replay Buffer(Stores CB transitions), where agents learn chute allocation policies resilient to adversarial induction rate shifts by optimizing for worst-case performance across distribution groups.
  • The framework integrates group Distributionally Robust Optimization (DRO) into Multi-Agent Reinforcement Learning (MARL) to handle uncertainty in package induction patterns derived from historical data groups.
  • A Contextual Bandit (CB)-based predictor efficiently identifies the worst-case induction distribution group for each state-action pair, reducing the computational complexity of training compared to exhaustive search methods.

11th March 2025

ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews

  • ReviewAgents: introduces multi-agent framework with Reviewer- and Area Chair-Agents, utilizing Papers as input and producing Meta Review as output for emulating human peer review process.
  • Reviewer Agent generates structured review comments, while Area Chair Agent synthesizes meta-review from multiple reviewer comments, aiming to align with human review behavior.
  • Framework employs relevant-paper-aware training and structured reasoning (Summarization, Analysis, Conclusion) to enhance review comment generation and reduce biases inherent in single LLM reviews.

CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving

  • CoLMDriver (Cooperative Language-Model-based Driver): introduces a full pipeline system for cooperative autonomous driving, incorporating VLM-based Intention Planner for high-level goals, Dynamic Graph Grouping for agent selection, and LLM-based Negotiator with Actor/Critic for language-based consensus.
  • CoLMDriver framework utilizes parallel pipelines with high-level guidance and low-level planning, integrating Perception Module for environmental understanding and Intention-guided Waypoint Planner with Control Module for real-time execution based on Sensor Data.
  • CoLMDriver employs Negotiation Quality Evaluator and Environment awareness to refine driving strategies through iterative feedback, enhancing safety and efficiency in multi-agent interactive scenarios by leveraging language-based communication.

EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

  • HOMIEBOT: introduces a hierarchical framework for embodied mobile manipulation, integrating LLM (Large Language Model) for high-level planning, Word Embedding Layer, Encoder Zoo, Share Projection for input processing, Replan for adaptation, Low Level Execution for action, Perception for environment understanding, and Memory for context.
  • The framework decomposes tasks into high-level planning and low-level execution, utilizing perception and memory for real-time feedback and adaptation in open environments.
  • HOMIEBOT's hierarchical design and component integration address challenges in long-horizon tasks and complex environments, offering a robust agent architecture for embodied AI.

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

  • GTR (Guided Thought Reinforcement): introduces a framework integrating Thought Guidance, Thought Dataset, Cross Entropy Loss, SFT, PPO, RL Finetune, VLM Agent, Corrector Model, Game Env, Data Buffer, PPO Loss, and Reward to enhance VLM agent training by incorporating automated thought correction with reinforcement learning.
  • Guided Thought Reinforcement framework leverages Supervised Fine-Tuning for thought tokens and Proximal Policy Optimization for action tokens, enabling simultaneous training of reasoning and action within a Reinforcement Learning fine-tuning process.
  • The framework addresses thought collapse in Vision-Language Model agents by using a Corrector Model to provide Thought Guidance, thereby improving decision-making capabilities and overall performance in complex visual tasks.

Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework

  • SRICE (Seeing and Reasoning with Confidence): introduces a training-free multimodal reasoning framework, with Tool call and calibration-, Reasoning: ROI selection-, ROI extraction- and Reasoning: Final answer generation-components, using Uncertainty Score-component, that integrates external vision models with uncertainty quantification into a Multimodal Large Language Model for enhanced visual reasoning.
  • SRICE framework employs a two-stage process, initially calibrating external tool outputs and selecting regions of interest autonomously, and subsequently generating a refined answer through chain-of-thought reasoning based on uncertainty estimation.
  • The framework leverages conformal prediction for uncertainty quantification, ensuring reliable tool utilization and enhancing the trustworthiness of the final answer in multimodal tasks.

General-Purpose Aerial Intelligent Agents Empowered by Large Language Models

  • AIA Framework (Aerial Intelligent Agents Framework): introduces a two-stage approach with user prompt, DeepSeek R1 (LLM for task planning), user check, sensors, Llama 3 (VLM and LLM for execution), perception model, state estimation and mapping, waypoints, flight control unit, and files, for general-purpose aerial tasks.
  • The framework integrates LLM-based deliberative planning with reactive control modules for autonomous UAV operation in open environments, utilizing onboard edge computing.
  • This architecture enables bidirectional communication between high-level task planning and low-level reaction pipelines, facilitating complex mission execution in communication-constrained scenarios.

A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

  • CCMA (Cascading Cooperative Multi-agent) framework: introduces a hierarchical multi-agent system for on-ramp merging, integrating Individual-level Decision-making Agent, Region-level Decision-Making Agent, Global-level Decision-making Agent, Vehicle Cooperate Prompt, Reward Prompt, Reward Function, Retrieval-augmented Generation, LoRA, MMLMs, Database, Environment, Observation, Critical Thinking, Reflection, Inference Range, Mainlane Aggressive Strategy, Mainlane Conservative Strategy, Update Agents Weights, Loss, Output, Prediction, Steps, Intent Actions, Drive Style, JSON Data, Goal Analysis, Vehicle Specific Analysis, Action Prioritization, and Final Decision to enhance autonomous driving in complex scenarios.
  • CCMA framework employs RL for individual agent actions, fine-tuned LLM for regional cooperation utilizing prompts and semantic reasoning, and Retrieval-augmented Generation for global reward optimization, achieving improved merging success rates.
  • The framework's hierarchical design and integration of LLMs with RL enable dynamic adaptation to varying traffic conditions and driving styles, leading to more efficient, safe, and human-like autonomous driving behaviors.

Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents

  • MIRROR (Memory Integration and Role Reasoning with Observation Reflection): introduces memory recall, theory of mind thinking, and reflection & summarization components for character inner thought generation.
  • MIRROR framework retrieves memories, predicts reactions of related objects, and synthesizes results to generate character inner thoughts based on scenario.
  • MIRROR enhances role-playing agents by enabling structured reasoning and improving understanding of character motivations for complex tasks.

Privacy-Enhancing Paradigms within Federated Multi-Agent Systems

  • EPEAgents (Embedded Privacy-Enhancing Agents): introduces privacy-aware collaborative framework with EPEAgent, Agents, User Profiles, Task/Query, Iteration, Privacy Preservation, Information theft, and Refuse to answer components.
  • EPEAgents framework employs intermediary EPEAgent to filter data flow between Agents based on User Profiles and Task/Query within iterative process, aiming for Privacy Preservation and preventing Information theft.
  • This approach ensures task-relevant and agent-specific information sharing, integrating into Retrieval-Augmented Generation and context retrieval stages for enhanced privacy in multi-agent systems.

FilmComposer: LLM-Driven Music Production for Silent Film Clips

  • FilmComposer: introduces a framework for LLM-driven music production for silent films, with Visual Processing (processes film clips), Rhythm-Controllable MusicGen (generates rhythm-controlled music), and Multi-Agent Assess, Arrange and Mix (multi-agent system for music production) modules.
  • FilmComposer framework uses Visual Processing module with LLM-Vision (vision language model) and CR(hythm)T (rhythm transformer) to extract visual cues, Rhythm-Controllable MusicGen module with Rhythm Conditioner (conditions music on rhythm) and Musicgen Decoder (decodes music from conditions) to generate melody, and Multi-Agent Assess, Arrange and Mix module with Assess Agent (evaluates musicality) and Mix Agent (performs audio mixing) to refine music.
  • FilmComposer framework simulates musician workflow by integrating modules for visual analysis, rhythm-aware music composition, and multi-agent based arrangement and mixing, aiming for high-quality, musically rich, and film-synchronized music generation.

LLM4MAC: An LLM-Driven Reinforcement Learning Framework for MAC Protocol Emergence

  • LLM4MAC (LLM-Driven Reinforcement Learning Framework for MAC Protocol Emergence): introduces a framework with Decision Maker, Large Language Model, Functional Alignment, and UDTS Environment, for MAC protocol emergence using reinforcement learning and language models.
  • The framework utilizes a Decision Maker comprising Unified Policy Fusion and Fragmented Policy to manage UE Action Candidate and BS Action Candidate, while employing PPO-based Functional Alignment within a Large Language Model with Critic and Value Head for optimization.
  • Action Interpreter components facilitate communication between the Large Language Model and UDTS Environment, which includes Dedicated PDCCH-PUCCH, Shared PUSCH, dPDU, UCM, DCM, BS, UE, Dynamic UE, and SIE Prompt Formulation with UEO to BS Query, BS to UEO Query, Observation Prompt, and Entity Identifier Prompt for structured interactions.

AI-native Memory 2.0: Second Me

  • SECOND ME: introduces a hybrid architecture with User, Device, Second Me, Agent Model, and layered memory (L0, L1, L2) to provide personalized AI memory system.
  • SECOND ME acts as context provider and intermediary, utilizing inner loop for layer integration and agent model with reasoning, knowledge, tools, experts, and internet for responses.
  • The framework enhances memory management through LLM-based parameterization, enabling structured organization, contextual reasoning, and adaptive knowledge retrieval for seamless user interactions.

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

  • RMM (Reflective Memory Management): introduces a novel mechanism for long-term dialogue agents, integrating Prospective Reflection and Retrospective Reflection components.
  • Prospective Reflection: tackles fixed granularity by summarizing dialogue histories into topic-based memory structures for future retrieval, while Retrospective Reflection addresses fixed retrievers by refining retrieval online using LLM-generated attribution signals.
  • The framework components Memory Bank, Retriever, Reranker, and LLM work together to achieve efficient memory management and adaptive retrieval for personalized dialogue responses.

10th March 2025

MAGNET: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

  • MAGNET (Multi-turn function-cAlling data synthesis with Graph Translation): introduces a framework for synthesizing multi-turn tool-use data, incorporating Function Collection, Local Dependency Graph, Function Signature Path, Enhanced Function Signature Path, Back-and-Forth Translation, Positive Trajectories, Negative Trajectories, Teacher Model, and Student Model.
  • MAGNET framework utilizes a graph-based approach to construct function signature paths and context distillation with teacher and student models to generate positive and negative training trajectories.
  • The framework aims to improve function calling capabilities of LLM agents in multi-turn conversations by generating high-quality training data and addressing challenges like nested function calls and long dependencies.

Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models

  • SEIDR (Synthesize, Execute, Instruct, Debug, and Repair): introduces multi-agent iterative framework employing language models for program synthesis and debugging, utilizing SYNTHESIZE, EXECUTE, INSTRUCT, DEBUG, and RANK Agents along with associated components.
  • SEIDR framework iteratively refines program candidates through feedback from code execution and test failures, incorporating language models for synthesis, debugging, and instruction generation, guided by input-output pairs and task descriptions.
  • The framework explores repair-replace trade-off strategies and parent selection algorithms to overcome near-miss syndrome in program synthesis, aiming for fully autonomous programming with large language models.

MEDAGENTSBENCH: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

  • MEDAGENTSBENCH: introduces benchmark evaluating medical reasoning of models through multi-agent collaboration, zero-shot decision making, dynamic multi-agent collaboration, agentic workflow generation, prompt optimization and self-supervision.
  • MEDAGENTSBENCH is designed to assess complex medical reasoning requiring deep domain expertise and multi-step processes, utilizing established medical datasets and adversarial filtering.
  • MEDAGENTSBENCH framework facilitates analysis of performance, cost and inference time trade-offs for various models and agent-based methods in medical question answering tasks.

LLMs syntactically adapt their language use to their conversational partner

  • Framework name: introduces an approach to investigate syntactic adaptation in LLMs, with LLM Agents (conversational agents using GPT-4o), Language Personas (varied language use prompts), LLM Conversations (agent-agent dialogues), PRIME and TARGET Split (conversation sections for analysis), Syntactic Parser (extracts syntactic rules), Context-Free Production Rules (parsed syntactic structures), SameConv Variable (same/different conversation indicator), Rule Frequency (rule occurrence count), Rule Set Size (number of unique rules), GLMM Analysis (statistical mixed-effects model), Jensen-Shannon Divergence (distribution distance metric), Switchboard Corpus (human conversation baseline), GPT Corpus (LLM conversation dataset), and Bootstrapping (variance estimation method).
  • The approach analyzes syntactic adaptation by comparing rule distributions in PRIME and TARGET sections of LLM conversations, using Jensen-Shannon Divergence to quantify syntactic similarity changes.
  • The study demonstrates that LLMs exhibit syntactic adaptation similar to humans, suggesting implicit learning of conversational partner's syntactic patterns.

Dynamic Path Navigation for Motion Agents with LLM Reasoning

  • LLM Navigator: introduces a system for autonomous path navigation using LLMs, with floor plan representation, spatial encoding, agents, environment constraints, and motion generation agent.
  • The system utilizes LLM Navigator to generate collision-free trajectories for single or multiple agents in dynamic environments, considering spatial reasoning and obstacle avoidance.
  • This approach enables zero-shot spatial reasoning and generalizes to complex scenarios, offering a training-free method for dynamic path planning and humanoid motion generation.

Experimental Exploration: Investigating Cooperative Interaction Behavior Between Humans and Large Language Model Agents

  • Experimental Framework: introduces Participants interacting in Prisoner's Dilemma Game with LLM Agent, Rule-based AI Agent, and Purported Human Agent within Experimental Setup, assessed via Questionnaires and Interviews.
  • This framework explores human cooperative behavior variations influenced by agent characteristics and participant gender in competitive settings.
  • The study offers insights into human-AI competitive dynamics, informing future AI agent design and human-AI collaboration strategies.

Automated Movie Generation via Multi-Agent CoT Planning

  • MovieAgent (Automated Movie Generation via Multi-Agent CoT Planning): introduces multi-agent framework with Director-, Scene Plan- and Shot Plan-Agents employing CoT-reasoning, using Script Synopsis and Character Bank as input, to generate Script Breakdown, Scene Planning, Shot List Creation, and finally Plot to Video and Audio Generation.
  • MovieAgent framework simulates hierarchical real-world movie production, automating narrative structuring, scene composition and shot design for long-form video generation.
  • The framework enhances narrative coherence, character consistency, and cinematic quality in generated videos by decomposing filmmaking process into specialized agent roles and structured reasoning steps.

DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science

  • DatawiseAgent (notebook-centric LLM agent framework): introduces FST-based multi-stage design for automated data science, orchestrating DFS-like planning, incremental execution, self-debugging, and post-filtering stages with unified interaction representation using markdown and executable code cells, and memory.
  • This framework leverages notebook design and adaptive strategies, enabling flexible and adaptive automation of data science tasks by exploring solution space, using real-time feedback, correcting errors, and pruning information.
  • DatawiseAgent enhances reliability and efficiency in data science workflows by integrating self-debugging and post-filtering modules, addressing limitations of existing LLM-based approaches in comprehensive end-to-end support.

Combating Partial Perception Deficit in Autonomous Driving with Multimodal LLM Commonsense

  • LLM-RCO (LLM-Guided Resilient Control Override): introduces framework using Hazard Inference Module, Short-term Motion Planner, Action Condition Verifier, and Safety Constraint Generator to enhance autonomous driving safety during perception deficits.
  • LLM-RCO framework utilizes Observation, Navigation, LIDAR, and Camera Frames as inputs for Planning Actions and Acting Action, with Re-Planning capability for dynamic adjustments.
  • LLM-RCO improves driving performance by enabling proactive and context-aware control actions, overriding conventional conservative safety protocols.

Toward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning

  • TACITREE (Hierarchical Tree Framework): introduces hierarchical tree framework, with Persona Hierarchy, Fact Clustering, Raw Conversation and LLM Retrieval-components, where framework structures conversation history for efficient retrieval.
  • TACITREE framework organizes long-term conversation history into hierarchical structure, enabling level-based retrieval of implicit knowledge.
  • This approach reduces search space and improves retrieval efficiency by clustering and summarizing conversation facts.

ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation

  • ProjectEval: introduces a benchmark for automated evaluation of project-level code generation, with Input-, Test Suite- and Canonical Solution-components.
  • ProjectEval simulates user interaction for execution-based evaluation and uses code similarity for objective assessment.
  • ProjectEval enhances explainability by offering three input levels and detailed evaluation metrics for programming agents.

DynTaskMAS: A Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM-based Multi-Agent Systems

  • DynTaskMAS (Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM-based Multi-Agent Systems): introduces a novel framework for asynchronous parallel LLM-based multi-agent systems using Input Task, Dynamic Task Graph Generator, Asynchronous Parallel Execution Engine, Semantic-Aware Context Management System, Adaptive Workflow Manager and LLM-based Agent components.
  • DynTaskMAS framework employs Dynamic Task Graph Generator to decompose tasks into DAGs, Asynchronous Parallel Execution Engine for parallel execution, Semantic-Aware Context Management System for context sharing and Adaptive Workflow Manager for system optimization.
  • The framework aims to enhance efficiency and adaptability in LLM-based multi-agent systems by addressing challenges in task decomposition, parallel processing, and context management through dynamic task orchestration.

ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA

  • ReAgent (Reversible Multi-Agent Reasoning): introduces a reversible multi-agent framework with backtracking, comprising Execution-, Supervisory-, and Interaction-Layers, for knowledge-enhanced multi-hop question answering.
  • ReAgent framework incorporates Decomposition-, Retrieval-, Validation-, and Aggregation-Agents within the Execution Layer, alongside Supervisor- and Controller-Agents in Supervisory Layer, and Persistent Log, Temporal Tracker, Messaging Channel in Interaction Layer.
  • The framework facilitates error correction in multi-hop reasoning by enabling agents to backtrack and revise inferences based on detected conflicts or contradictory evidence, enhancing robustness and interpretability of QA outcomes.

Beyond Code Generation: LLM-supported Exploration of the Program Design Space

  • PAIL: introduces an IDE integrating ConversationAgent (LLM chatbot for code interaction), DesignAgent (LLM agent suggesting design options), ReflectionAgent (LLM agent tracking design choices), Code Pane (code editor for program modification), Output Pane (program execution display area), Chat Pane (conversation history display), and Design Pane (design aid panel for questions/decisions) to support LLM-assisted program design exploration.
  • This framework facilitates iterative program design by abstracting problem formulations and solutions, tracking design goals and requirements, and making implicit LLM decisions explicit.
  • PAIL addresses challenges of user attention management and information overload inherent in LLM-assisted program design through its integrated component architecture.

SafePlan: Leveraging Formal Logic and Chain-of-Thought Reasoning for Enhanced Safety in LLM-based Robotic Task Planning

  • SafePlan (SafePlan): introduces a multi-component framework, Prompt Sanity Check COT Reasoner, Societal Alignment Layer, Organizational Alignment Layer, Individual Alignment Layer, Invariant COT Reasoner, and Task Allocation Code Generation, for enhancing safety in LLM-based robotic systems by integrating formal logic and chain-of-thought reasoning.
  • SafePlan framework decomposes natural language prompts into reasoning steps, utilizing Prompt Sanity Check COT Reasoner with Societal, Organizational, and Individual Alignment Layers to evaluate prompt safety before proceeding to Invariant COT Reasoner for generating and verifying safety conditions like invariants, preconditions, and postconditions, finally using Task Allocation Code Generation for robot task allocation.
  • SafePlan framework aims to improve safety by systematically verifying natural language commands through multi-layered checks and logical formalization, contrasting with traditional LLM approaches that may overlook safety implications in robotic task planning.

GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought

  • GUIDE-CoT (Goal-driven and User-Informed Dynamic Estimation for pedestrian Trajectory using Chain-of-Thought): introduces trajectory prediction framework leveraging visual prompts and chain-of-thought reasoning, incorporating scene image, visual prompt, history heatmap, semantic map, alpha compositing, visual condition, pretrained visual encoder, goal module, goal probability map, sampled chain-of-thought, LLM and trajectory components.
  • GUIDE-CoT framework enhances pedestrian trajectory prediction by predicting pedestrian goals using visual prompts and subsequently generating trajectories towards these goals with chain-of-thought reasoning within a Large Language Model.
  • The framework's goal-oriented visual prompt and chain-of-thought approach allows for controllable and adaptable trajectory generation, improving accuracy and user-guided modifications in pedestrian path prediction.

9th March 2025

AutoMisty: A Multi-Agent LLM Framework for Automated Code Generation in the Misty Social Robot

  • AutoMisty: introduces a multi-agent LLM framework for automated robot code generation, encompassing User, Task, AutoMisty Framework, Planner Agent, Action Agent, Touch Agent, Audiovisual Agent, Code Script, Execute, Misty Robot, and Memory.
  • AutoMisty framework utilizes specialized agents for task planning, assignment, problem-solving, and code synthesis, incorporating self-reflection and human-in-the-loop mechanisms for iterative refinement and adaptation to user preferences.
  • The framework's architecture includes Planner, Action, Touch, and Audiovisual Agents, each with internal components like Drafter, Designer, Critic, and Memory, leveraging optimized Misty APIs to generate executable code for the Misty robot.

Delusions of Large Language Models

  • Delusion Analysis Methodology: introduces a methodology for analyzing LLM delusions, employing Logits-based Belief, Verbalized Belief, and Consistency Belief components.
  • This methodology uses uncertainty estimation as proxy for model belief, categorizing incorrect LLM outputs into hallucinations or delusions based on belief thresholds.
  • The framework differentiates delusions from hallucinations by quantifying belief through logits, verbalized confidence, and consistency, revealing delusions' high-confidence nature.

EXPLORING LLM AGENTS FOR CLEANING TABULAR MACHINE LEARNING DATASETS

  • Cleaning Agent Framework: introduces a framework for systematic LLM-based data cleaning using IPython and Performance Evaluation tools within iterative process guided by chat history.
  • The framework employs LLM agent to identify and rectify dataset errors through IPython code execution and performance feedback from Performance Evaluation tool.
  • This research investigates LLMs effectiveness in enhancing data quality by error detection and correction, keeping training pipeline and feature engineering fixed.

Advancing AI Negotiations: New Theory and Evidence from a Large-Scale Autonomous Negotiations Competition

  • International AI Negotiations Competition: introduces agent warmth, agent dominance, value claimed, counterpart subjective value, points earned, value created, and deal reached, where competition investigates AI negotiation strategies and outcomes.
  • The competition framework evaluates the impact of agent warmth and dominance on negotiation success across distributive and integrative scenarios.
  • Findings suggest warmth enhances deal-making, while dominance impacts value claiming, highlighting nuances in AI negotiation dynamics.

Performant LLM Agentic Framework for Conversational AI

  • PAF (Performant Agentic Framework): introduces a novel system for conversational AI, utilizing Navigation Map (workflow graph structure) composed of Nodes (workflow steps) and Edges (conditional transitions), guided by a LLM Agent (reasoning and response generation) informed by Conversation History (interaction record) and employing Vector-Based Node Search (semantic node selection) or LLM as Judge (fallback node selection) based on Prompt (instructional messages) and Threshold (confidence level), triggering Actions (node triggered operations) using Vectorized Instructions (precomputed instruction embeddings).
  • Performant Agentic Framework (PAF) balances accuracy and latency in conversational workflows by combining LLM-based reasoning with vector scoring for efficient node selection within the Navigation Map (workflow graph structure), reducing reliance on large context windows and optimizing computational steps.
  • The framework, Performant Agentic Framework (PAF), addresses limitations of existing agentic systems by removing extra validation iterations, improving alignment through step-by-step logic, reducing context window size, and introducing vector-based scoring for semantic similarity in conversational AI applications.

8th March 2025

Towards Conversational AI for Disease Management

  • AMIE (Articulate Medical Intelligence Explorer): introduces a system designed for clinical management and dialogue, utilizing components like Vignette Generator, Dialogue Agent, and Mx Agent, to incorporate reasoning over disease evolution, patient encounters, and medication prescription.
  • AMIE framework employs a dual-agent system, featuring a Dialogue Agent for patient interaction and a Management Reasoning (Mx) Agent for evidence-based management plan generation, both leveraging LLMs and clinical guidelines to ensure reasoning grounded in authoritative knowledge.
  • The system refines its capabilities through simulated dialogue environments and reinforcement learning, and its performance is rigorously evaluated in a multi-visit remote OSCE study, demonstrating non-inferiority to primary care physicians in management reasoning and outperformance in treatment preciseness and guideline alignment.

DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

  • DSGBench (Diverse Strategic Game Benchmark): introduces a comprehensive benchmark platform with Environment, Capability Metrics, Decision-Tracking and Behavior Analysis, Observation to Prompt, Response to Action, LLM-based Agent, Opponent Agent, Metrics, Analyze, Collect, Async and Sync to evaluate strategic decision-making of LLM-based agents in complex games.
  • DSGBench offers diverse strategic games, fine-grained metrics, and decision trajectory analysis for in-depth assessment of agent capabilities in long-term strategic planning and real-time decision-making.
  • The benchmark facilitates detailed analysis of agent behavior patterns and strategy changes through observation-to-prompt and response-to-action loops in dynamic multi-agent scenarios.

7th March 2025

Enhancing Reasoning with Collaboration and Memory

  • Collaborative Memory Reasoning Framework: introduces a system combining reasoning styles, multi-agent collaboration, and memory banks to enhance LLM reasoning performance.
  • This framework employs varied-context agents and a summarizer agent, utilizing frozen and learned memory banks with different retrieval mechanisms for improved performance.
  • The system systematically studies the contribution of various methods to LLM reasoning across different tasks, highlighting the effectiveness of random exemplar selection and the role of memory in diverse scenarios.

A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval

  • LLM Agents (Large Language Model Agents): introduces Perception-, Control- and Action-Modules, incorporating Storage, Memory Unit, Knowledge Database, Collecting, Processing, Decision, Planning Unit, Reasoning Unit, Embodiment, Toolbox, and Feedback components for autonomous computing entities.
  • LLM Agents utilize Perception Module to obtain information, Control Module for analysis and strategy, and Action Module to implement decisions, equipped with Storage for memory and knowledge, and Toolbox for external tool access.
  • These agents enhance traditional systems by integrating multimodal inputs, demonstrating improved understanding, adaptability, and creative thinking for complex tasks in various domains.

Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

  • SYMBOLIC-MOE (Symbolic Mixture-of-Experts): introduces a symbolic Mixture-of-Experts framework with preprocessing (initial framework setup) and inference-time (reasoning execution) stages for adaptive skill-based recruitment (dynamic expert selection) in heterogeneous reasoning tasks.
  • SYMBOLIC-MOE framework incorporates model profile creation (model skill assessment) and aggregator selection (best aggregator choice) in preprocessing, and utilizes router (skill-based expert routing), experts (specialized LLMs), and aggregator (output synthesizer) components during inference.
  • The framework employs batch inference mechanism (efficient expert integration) and skill-based routing (expert selection based on skills) to achieve efficient and performant heterogeneous reasoning by dynamically selecting and combining specialized pre-trained language models.

GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation

  • GEMA-Score (Granular Explainable Multi-Agent Score): introduces a multi-agent framework with Entity Extraction-, Objective Clinical Accuracy-, Subjective Expressiveness Evaluation- and Score Evaluation-Agents, alongside Multi-agent Evaluation System and Scoring and Summary Stage, for comprehensive radiology report evaluation.
  • GEMA-Score framework assesses both objective clinical accuracy using NER-F1 metrics and subjective expressiveness encompassing completeness, readability, and clinical utility of generated radiology reports.
  • The framework aims to provide granular, interpretable, and reliable evaluation of radiology reports, overcoming limitations of existing metrics by employing specialized agents for distinct evaluation tasks.

MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio

  • MM-StoryAgent: introduces a multi-agent framework with Story-, Image-, Speech-, Sound-, Music- and Video Compose-Agents to generate immersive narrated storybook videos.
  • This framework employs Expert, Amateur Writer, Outline Writer, Chapter Writer, and Reviewer agents within the Story Agent for multi-stage story writing and modality-specific agents with Prompt Revisers for asset generation.
  • MM-StoryAgent enhances story quality and immersive experience by integrating multi-channel audio and role-consistent images, offering a flexible platform for storybook video creation.

ORANSight-2.0: Foundational LLMs for O-RAN

  • ORANSight-2.0 (O-RAN Insights): introduces RANSTRUCT, a Retrieval-Augmented Generation-based instruction-tuning framework, for creating fine-tuning datasets, utilizing pre-trained models and QLoRA for efficient adaptation, evaluated by ORANBench and srsRANBench.
  • ORANSight-2.0 framework employs RANSTRUCT, which includes Recursive Splitter, Embedding Generator, Question Generator, and Answer Generator, to process O-RAN Specifications and srsRAN Code Files, building FAISS Database and Questions Database for dataset generation.
  • ORANSight-2.0 aims to bridge the gap for domain-specific foundational models in O-RAN by providing open-source alternatives and demonstrating superior performance compared to closed-source models in O-RAN and code-related tasks.

A Comprehensive LLM-powered Framework for Driving Intelligence Evaluation

  • LLM-powered Framework (LLM-powered Driving Evaluation Framework): introduces a comprehensive approach for evaluating driving behavior intelligence, integrating real-world driving data, simulated scenarios, knowledge graph, RAG, and LLM evaluation system.
  • This framework assesses safety, intelligence, and comfort through distinct evaluation modules, culminating in a comprehensive evaluation conclusion.
  • The framework leverages driving context derived from diverse data sources to enable nuanced and accurate assessments of autonomous driving performance.

MASTERMINDEVAL: A SIMPLE BUT SCALABLE REASONING BENCHMARK

  • MASTERMINDEVAL: introduces a deductive reasoning benchmark employing Evaluator Class (game logic implementation) and LLM (codebreaker agent) to evaluate language model reasoning capabilities in Mastermind game.
  • It features agentic and deductive evaluation paradigms to assess strategic gameplay and reasoning respectively.
  • The benchmark aims to address limitations of existing benchmarks by offering scalability and interpretability in reasoning assessment.

6th March 2025

Bridging the AI Adoption Gap: Designing an Interactive Pedagogical Agent for Higher Education Instructors

  • Two-Phase Study Design with Pedagogy Experts: introduces a two-phase study design with pedagogy experts, with Interaction Design, Formative Study, Refine 2 Storyboard Design, LLM-Generated Suggestions Evaluations, Refine Prompts to Generate Answers, Collect and Generate Answers, ChatGPT with RAG, Participatory Design, Activity 1: Discussion of 2 Storyboards, Activity 2: Discussion of 5 QA pairs, and Post-Study Rating of 20 QA pairs.
  • This study employs formative expert interviews and participatory design sessions to investigate interactive pedagogical agents for instructors.
  • The framework utilizes ChatGPT with Retrieval-Augmented Generation to produce teaching suggestions, which are then evaluated by pedagogy experts.

SAFEARENA: Evaluating the Safety of Autonomous Web Agents

  • SAFEARENA: introduces benchmark for evaluating web agent safety, with User Intent Input, Web Agent, Web Environment, Action Execution, Observation, and Agent Risk Assessment (ARIA) framework.
  • SAFEARENA assesses agent behavior across harm categories on realistic websites, evaluating agent's capability to perform harmful tasks.
  • SAFEARENA framework highlights urgent need for safety alignment procedures for web agents, providing crucial benchmark for research.

Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

  • MedR-Bench Evaluation Framework: introduces a framework for evaluating medical LLMs, with assessment recommendation, diagnostic decision, and treatment planning stages, incorporating Reasoning Evaluator.
  • MedR-Bench Evaluation Framework: includes Reasoning Evaluator, an agentic system for automated reasoning evaluation using structured steps and reference verification against medical knowledge.
  • MedR-Bench Evaluation Framework: comprehensively assesses LLMs' clinical performance across patient journey stages, utilizing metrics for reasoning and final generation quality.

SURVEYFORGE: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing

  • SURVEYFORGE: introduces an automated survey writing framework, with Heuristic Outline Generation-stage (outline generator), SANA (scholar agent), and Content Generation-stage (content creation phase), leveraging Paper Database (article database), Survey Database (outline database), TOPIC Database (topic information), Literature Database (literature repository), Multimodal Large Language Models (text processing model), Embodied AI (embodiment AI), AI for Science (scientific AI), RAG (retrieval augmentation), LLM (content model), and LLM-Parallel (parallel content generation) to produce Final Survey (final paper) from Input Topic (user research area) and evaluate using Survey Benchmark (evaluation dataset) during Refinement & Evaluation (survey improvement phase).
  • This framework employs a two-stage process, first generating a detailed outline using heuristic learning from existing surveys and relevant literature, then populating the outline with content retrieved and refined by a memory-driven scholar navigation agent, ensuring both structural coherence and high-quality references.
  • SURVEYFORGE aims to bridge the quality gap between human-written and AI-generated surveys by focusing on outline quality, reference relevance, and content coherence, demonstrating improved performance over existing automated survey generation methods and setting a new benchmark for quality and reliability in this domain.

The Next Frontier of LLM Applications: Open Ecosystems and Hardware Synergy

  • Three-Layer Decoupled Architecture: introduces a three-layer architecture for LLM applications, with Application Layer (app settings management), Protocol Layer (secure session control), and Hardware Layer (hardware resource management) to improve modularity and cross-platform adaptability.
  • The architecture decouples application logic, protocol handling, and hardware execution, enabling efficient task orchestration and secure communication across heterogeneous platforms.
  • This layered design enhances scalability, interoperability, and hardware efficiency for LLM applications by separating concerns and standardizing interfaces between layers.

ToolFuzz - Automated Agent Tool Testing

  • TOOLFUZZ: introduces a method for automated tool documentation testing with taint fuzzer, LLM prompt generation, tool runtime error detection, LLM prompt template generation, LLM synonymous prompt generation, I/O consistency checks, and LLM correctness evaluation to improve agent and tool reliability.
  • This framework identifies under-, over-, and ill-specified documentation errors by generating diverse natural language queries and performing cascading consistency checks.
  • TOOLFUZZ significantly enhances the reliability of LLM agents by automating the detection of tool documentation errors, which are critical for effective tool utilization.

AgentSafe: Safeguarding Large Language Model-based Multi-agent Systems via Hierarchical Data Management

  • AgentSafe: introduces ThreatSieve (communication security verification) and HierarCache (adaptive memory management) to enhance multi-agent system security.
  • AgentSafe classifies information by security levels, using Permission Control (communication level regulation) and Message Legitimacy Evaluation (sender identity verification) within ThreatSieve, and Junk Memory (irrelevant data storage) within HierarCache.
  • AgentSafe framework components Memory (agent information storage) and hierarchical information management aim to prevent unauthorized access and data breaches in LLM-based multi-agent systems.

Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models

  • ARCHIE (Autonomous Reinforcement learning for Complex Human-Informed Environments): introduces unsupervised pipeline, with Initial conditions, Simulation, Define Agent's Task, Reward function, RL training, for automating robotic skill learning.
  • ARCHIE leverages GPT-4 to generate reward functions and success criteria from natural language task descriptions, enabling one-shot RL training.
  • ARCHIE's two-phase approach, Initialization and Autonomous Skill Learning, facilitates efficient and practical real-world robotic manipulation skill acquisition.

Measuring temporal effects of agent knowledge by date-controlled tool use

  • ReAct+DCT (ReAct with Date-Controlled Tools): introduces tool-based out-of-sample testing framework with masked text, ranked webpage snippets, reasoning traces, LLM, surface web, search queries, and date-controlled tools for text completion.
  • Framework evaluates temporal effects on agent knowledge by employing date-controlled web search to complete scientific abstracts.
  • ReAct+DCT framework emphasizes dynamic agent evaluation considering temporal influence of tools and updates of external resources.

KidneyTalk-open: No-code Deployment of a Private Large Language Model with Medical Documentation-Enhanced Knowledge Database for Kidney Disease

  • KidneyTalk-open: introduces a no-code medical LLM system integrating Document Parsing & Chunking, Knowledge Snippets Filtering, Semantic Embedding, Vector Database, Adaptive Retrieval and Augmentation Pipeline, Query Refinement Agent, Divergent Thinking Agent, and Answer Generation Agent.
  • KidneyTalk-open system enhances medical question answering through document processing, knowledge retrieval, and multi-agent collaboration to improve accuracy and ensure privacy.
  • The framework significantly reduces technical complexities for medical professionals to employ state-of-the-art open-source LLMs for secure medical question answering with enhanced documentation.

DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

  • DyCodeEval (Dynamic Code Evaluation): introduces a dynamic benchmarking method for code LLMs, with Scenario Proposer, Scenario Pool, Context Generator, Canonical Solution, Contexts, Prompt Rewriter, Orig Problem, New Problem, Validator, and LLM Agent as Verifier, to generate semantically diverse yet complexity-equivalent programming problems for evaluating reasoning capabilities under data contamination.
  • DyCodeEval leverages LLM-based agents to automate the generation of varied problem contexts and includes a validation agent to ensure the consistency and correctness of newly created problems.
  • The framework aims to address limitations of static benchmarks by dynamically creating diverse problems, mitigating data contamination risks and providing a more reliable evaluation of code LLMs' true reasoning abilities.

InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions

  • InterChat: introduces a multimodal generative visual analytics framework, with NL Input, Direct Manipulation, Intent Inference, Prompt Synthesizing, Response Processing, Visualization Rendering, Visual Connections, Interactive Visualizations, Multi-Agent LLM Architecture, Chain-of-Thought, Structured Prompts, and Vis. Generation, for enhancing user interaction and analytical depth.
  • InterChat integrates direct manipulation with natural language via multi-agent LLM architecture to bridge user analytical intents and LLM-driven visualizations, improving interpretability and usability.
  • By employing prompt engineering and contextual interaction linking, InterChat enhances accuracy and efficiency in complex visual analytics tasks, highlighting the potential of multimodal interactions in generative visual analytics.

PokéChamp: an Expert-level Minimax Language Agent

  • PokĂ©Champ (PokĂ©Champ): introduces minimax agent leveraging LLMs, replacing modules such as Player Action Sampling, Opponent Modeling, and Value Function Estimation to enhance minimax tree search.
  • PokĂ©Champ framework incorporates Game Engine to Text, Approximate State Transition Heuristic, One Step Lookahead Prompt, Observation, and Historical Turns to utilize gameplay history and knowledge.
  • PokĂ©Champ framework effectively reduces search space and addresses partial observability in two-player competitive games without additional LLM training, relying on pre-existing LLM knowledge and game-theoretic algorithms.

5th March 2025

Pretrained LLMs as Real-Time Controllers for Robot Operated Serial Production Line

  • LLM-based control framework: introduces a three-layer architecture comprising Input, Processing, and Output Layers for real-time management of serial production lines using pretrained Large Language Models.
  • arxiv_paper_framework_name: Input Layer defines static system parameters, Processing Layer manages dynamic decision elements, and Output Layer structures LLM-generated actions.
  • arxiv_paper_framework_name: leverages LLMs to achieve adaptable, interpretable, and scalable control in manufacturing, outperforming traditional heuristics and approaching MARL performance without retraining needs.

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

  • MASK (Model Alignment between Statements and Knowledge): introduces a benchmark to evaluate AI honesty by disentangling it from accuracy, with Pressure Prompt, Prompts for Eliciting Beliefs, Ground Truth Label, Extract proposition realizations, and Measure honesty and accuracy components.
  • MASK benchmark measures honesty by comparing model statements under pressure to their elicited beliefs, and accuracy by comparing beliefs to ground truth labels.
  • The benchmark utilizes a novel evaluation pipeline to directly measure when models lie by contrasting pressured statements with underlying beliefs, enabling targeted honesty interventions.

A Practical Memory Injection Attack against LLM Agents

  • MINJA (Memory INJection Attack): introduces memory injection attack framework against LLM agents using indication prompt, bridging steps, and progressive shortening strategy.
  • MINJA framework allows attacker to inject malicious records into agent memory through queries to influence agent behavior for victim user.
  • Progressive shortening strategy refines indication prompt while preserving malicious reasoning steps in agent memory.

MULTI-AGENT SYSTEMS POWERED BY LARGE LANGUAGE MODELS: APPLICATIONS IN SWARM INTELLIGENCE

  • LLM-MAST (LLM-Driven Multi-Agent Simulation Toolchain): introduces a toolchain integrating LLMs with NetLogo for multi-agent simulations, with environment encoding, python extension integration, LLM processing, decoding LLM output and agent action execution components.
  • LLM-MAST: facilitates prompt-driven behavior generation in simulations by leveraging GPT-4o via OpenAI API for processing environmental data and generating agent actions.
  • LLM-MAST: enables studying self-organizing processes and emergent behaviors in multi-agent environments by creating a closed-loop system between simulation and LLM.

MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

  • MAS-GPT (Multi-Agent System - Generative Pre-trained Transformer): introduces a framework for generating query-specific multi-agent systems, including Query (user input question), MAS (generated multi-agent system), Answer (final response to query), Math Agent (agent for math tasks), Feedback Agent (agent providing feedback), Refine Agent (agent for answer refinement), and call_llm (function to call LLM).
  • The framework simplifies MAS creation by training an LLM to generate executable MAS code in a single inference step, addressing inadaptability and high inference costs of existing methods.
  • MAS-GPT framework utilizes a consistency-oriented data pipeline for training, enabling adaptability, efficiency, and generalization across diverse tasks and LLM backbones.

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

  • JITVUL: introduces a benchmark for just-in-time vulnerability detection in code repositories, with vulnerability entry selection, target function extraction, and pairwise commits identification components.
  • JITVUL: consists of vulnerability entry selection for CVE selection based on criteria, target function extraction to extract function and fix commit, and pairwise commits identification to identify vulnerability-introducing and fixing commits.
  • JITVUL: enables pairwise evaluation for vulnerability detection using vulnerability-introducing and fixing commits.

Human Implicit Preference-Based Policy Fine-tuning for Multi-Agent Reinforcement Learning in USV Swarm

  • RLHF (Reinforcement Learning with Human Feedback): introduces a method for fine-tuning MARL policies in USV swarms using human preference feedback on user-friendly trajectory data, which trains a reward model from a replay buffer to refine a base policy into a fine-tuned policy within a simulated environment.
  • This approach uses agent-level feedback categorized into intra-agent, inter-agent, and inter-team types, and employs an LLM evaluator to validate feedback scenarios, addressing credit assignment challenges in multi-agent systems.
  • The method aims to bridge the gap between model development and user preferences, enhancing adaptability and operational effectiveness of USV swarms by incorporating human insights through preference-based learning.

Parallelized Planning-Acting for Efficient LLM-based Multi-Agent Systems

  • Parallelized Planning-Acting Multi-Agent Framework: introduces dual-thread architecture with centralized memory, planning thread, acting thread and skill library to enable concurrent planning and acting for efficient LLM-based multi-agent systems.
  • The framework decouples LLM reasoning (planning thread) from action execution (acting thread) using action buffer and interruptible execution, enhancing real-time responsiveness in dynamic environments.
  • Centralized memory with observation records, chat logs and action history facilitates efficient information sharing and coordination among agents, while skill library with DAG-based recursive task decomposition automates complex task execution.

Collaborative Expert LLMs Guided Multi-Objective Molecular Optimization

  • MultiMol (collaborative large language model system): introduces collaborative agents, including data-driven worker agent, literature-guided research agent and RDKit, to guide multi-objective molecular optimization using training dataset curation, instruction tuning, prompt input, optimization and filter based on research.
  • MultiMol employs data-driven worker agent to generate molecules and literature-guided research agent with web search and pick candidate based on characteristics to filter molecules based on scientific literature, utilizing LLM backbone and memory components.
  • MultiMol framework achieves enhanced molecular optimization performance by combining capabilities of two specialized LLM agents and leveraging literature-derived insights for improved candidate selection and scaffold preservation.

Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation

  • MOUD Generation Pipeline (Multilingual Open-domain Unnatural Dialogue Dataset): introduces pipeline with taxonomies, persona generation, common ground generation, conversation generation, and data evaluation for multilingual open-domain dialogue dataset synthesis.
  • Pipeline leverages instruction-tuned LLMs to generate dialogues in multiple target languages without machine translation, enhancing language-specific nuances.
  • MOUD pipeline addresses open-domain paradox by incorporating common ground and speech event type specification in generated dialogues.

Unified Mind Model: Reimagining Autonomous Agents in the LLM Era

  • UMM (Unified Mind Model): introduces a cognitive architecture for autonomous agents, integrating Driver System, Central Processing, Specialist modules, and Large Language Model to mimic human-level cognitive abilities.
  • UMM leverages Global Workspace Theory, structuring components hierarchically for efficient information processing and decision-making in complex tasks.
  • The architecture facilitates the creation of advanced autonomous agents by enabling multi-modal perception, planning, reasoning, tool use, and learning capabilities.

Taxation Perspectives from Large Language Models: A Case Study on Additional Tax Penalties

  • PLAT: introduces PLAT benchmark dataset, with Retrieval Augmentation-component for external knowledge access, Self-Reasoning-component for internal knowledge refinement, and Multi-Agent Collaboration-component for distributed task execution, to assess large language models' ability to predict legitimacy of additional tax penalties.
  • PLAT benchmark dataset evaluates large language models' understanding of tax law in scenarios requiring more than just statute application, focusing on complex reasoning about real-world situations.
  • PLAT benchmark and agent-based approach using retrieval, self-reasoning, and multi-agent methods aim to mitigate limitations of vanilla large language models in comprehensively understanding legal cases.

SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection

  • SEOE (Scalable and Reliable Semantic Evaluation framework for Open domain Event detection): introduces scalable, unconstrained extraction and semantic-level evaluation components for open domain event detection evaluation.
  • SEOE framework addresses limitations of token-level evaluation by incorporating semantic understanding and benchmark scalability for improved ODED assessment.
  • The framework utilizes LLMs to compute semantic F1-score, enhancing evaluation reliability and offering a more representative benchmark for real-world ODED scenarios.

CITE BEFORE YOU SPEAK: ENHANCING CONTEXT-RESPONSE GROUNDING IN E-COMMERCE CONVERSATIONAL LLM-AGENTS

  • Citation Generation Paradigm: introduces a method for enhancing LLM agent responses with citations, using User Query, Retrieve, LLM Response, Source, and Evidence widgets.
  • This paradigm improves response grounding and transparency by linking LLM answers to verifiable knowledge sources.
  • The approach aims to build customer trust in conversational shopping agents by providing source attribution.

Exploring the Potential of Large Language Models as Predictors in Dynamic Text-Attributed Graphs

  • GAD Framework (GraphAgent-Dynamic Framework): introduces multi-agent system leveraging collaborative LLMs for dynamic graph prediction, with Initial Agent, Local Summary Agents, Global Summary Agents, Knowledge Reflection Agent, Predictor Agent, Temporary Predictor Agent, Database, and Extraction components.
  • GAD framework incorporates global and local summary agents to generate domain-specific knowledge and knowledge reflection agents for adaptive updates, maintaining unified architecture for dynamic graphs.
  • This multi-agent approach addresses context length constraints and domain variability challenges inherent in dynamic graph prediction tasks, enhancing generalizability.

MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

  • MA-LoT (Multi-Agent Lean-based Long Chain-of-Thought framework): introduces multi-agent framework for Lean4 theorem proving with LoT-Solver as Prover Agent, Lean Executor, LoT-Solver as Corrector Agent and NL-Planning CoT components.
  • MA-LoT balances Natural Language reasoning and Formal Language verification in theorem proving using Long Chain-of-Thought approach, incorporating Error analysis components.
  • MA-LoT framework utilizes LoT-TL pipeline to enable emergent formal reasoning capabilities in Large Language Models without requiring specific annotated data for Long CoT.

DANGO: A Mixed-Initiative Data Wrangling System using Large Language Model

  • DANGO (A Mixed-Initiative Data Wrangling System using Large Language Model): introduces a mixed-initiative data wrangling system with Table Demo, Chatroom, and Memory components.
  • DANGO synthesizes data wrangling scripts using Table Meta Data, DSL Plan, DSL Program, and Syntax Checker components.
  • The framework includes Analysis Agent, Synthesis Agent, and Explanation Agent, and provides NL Explanation and Provenance for user understanding and validation.

Levels of Spacecraft Autonomy

  • Levels of Spacecraft Autonomy: introduces levels framework with Basic Space Operations, Ground Commanding, Space Situational Awareness, Onboard Recovery Planning, Onboard Recovery Actions, Onboard Execution, Optional Monitoring, and Ground Monitoring components to characterize spacecraft autonomy.
  • The framework defines six levels of spacecraft autonomy, ranging from basic ground-commanded operations to fully autonomous onboard execution and optional ground monitoring.
  • This autonomy level framework aims to provide a consistent method for describing and communicating spacecraft capabilities to diverse audiences, including technical and non-technical stakeholders.

Curating Demonstrations using Online Experience

  • Demo-SCORE (Demo-SCORE): introduces a self-curation method for robot demonstrations using online experience, with Initial Policy Training, Rollout Generation, Data Quality Classifier Training, and Demonstration Filtering components.
  • Demo-SCORE leverages policy rollouts to identify and remove unreliable demonstrations from an initial dataset.
  • Demonstration Filtering component uses a trained classifier to refine the demonstration dataset based on predicted reliability.

4th March 2025

Four Principles for Physically Interpretable World Models

  • Physically Interpretable World Models: introduces a world model framework incorporating principles like latent structuring and output partitioning, utilizing components such as physical encoders, dynamics models, and decoders to achieve physical interpretability.
  • This framework emphasizes learning aligned invariant and equivariant representations through multi-level supervision to enhance the reliability and verifiability of world models for autonomous systems.
  • The proposed architecture aims to bridge the gap between high-dimensional observations and physical meaning by partitioning generative outputs and structuring latent spaces according to physical variable intent.

From Metaphor to Mechanism: How LLMs Decode Traditional Chinese Medicine Symbolic Language for Modern Clinical Relevance

  • Perceptual-Chain-Of-Thought framework: introduces multi-agent system for TCM metaphor interpretation, with Entity Mapping and Splitting Layer, Perceptual Layer, Metaphor understanding Layer, and Perceptual KG Subset components.
  • This framework uses chain-of-thought reasoning and multi-agent collaboration to bridge TCM and Western medicine understanding of metaphors.
  • The system aims to improve accuracy and transparency in translating TCM symbolic language for modern clinical relevance.

FINARENA: A HUMAN-AGENT COLLABORATION FRAMEWORK FOR FINANCIAL MARKET ANALYSIS AND FORECASTING

  • FinArena (Human-Agent collaboration framework for financial market analysis and forecasting): introduces a novel framework for financial analysis, integrating Human Module (interactive user interface), Machine Module (LLM-based multi-agent system), Time Series Agent (stock time series prediction), News Agent (news insights and RAG), Statement Agent (financial statement analysis), AI Expert (investment decision synthesis), Report Agent (human-agent interaction), Data Set (multimodal financial data), Output (investment action suggestion), and Web Port (information retrieval and analysis).
  • FinArena framework employs specialized agents for time series, news, and statements, combined with an AI expert for synthesizing insights and a report agent for human interaction, utilizing multimodal financial data for enhanced stock trend predictions and personalized investment decisions.
  • The framework leverages adaptive Retrieval-Augmented Generation (RAG) within the News Agent to mitigate hallucinations and improve accuracy when processing unstructured news data, and incorporates iterative reasoning in the Statement Agent for in-depth financial statement analysis.

MPO: Boosting LLM Agents with Meta Plan Optimization

  • MPO (Meta Plan Optimization): introduces meta plan optimization framework with meta planner generating abstract guidance, agent providing execution feedback, and prompt incorporating meta plan for enhanced planning.
  • MPO framework leverages meta plans to provide explicit guidance for LLM agents, enabling continuous optimization based on agent's task execution feedback.
  • MPO enhances agent planning by decoupling meta plans from specific environmental details, improving generalization and task completion efficiency without agent retraining.

Playing games with Large language models: Randomness and strategy

  • LangChain: introduces game-playing framework with LLM, player agents, evaluation, and history for game simulations.
  • This framework facilitates bidirectional LLM interactions for repeated games with history feedback.
  • The framework enables analysis of LLM strategic adaptation and randomness in game scenarios.

Generator-Assistant Stepwise Rollback Framework for Large Language Model Agent

  • GA-Rollback (Generator-Assistant Stepwise Rollback) introduces a framework with Environment, LLM (Generator), GA-Rollback, Assistant, Rollback Operation, Evaluation, Feedback Evaluation, and "wait-k" Strategy to improve decision-making in LLM agents by addressing error propagation through rollback operations and quality evaluation.
  • The framework utilizes a Generator (LLM) to interact with the Environment, while an Assistant examines actions and triggers Rollback Operation upon error detection, incorporating Feedback Evaluation and "wait-k" Strategy for enhanced performance.
  • GA-Rollback framework aims to ensure credible reasoning trajectory by separating action generation and examination, and integrating seamlessly as plug-and-play module with other methods for improved robustness and extensibility.

BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modelling

  • BRIDGE: introduces BRIDGE framework with Text Template Generation, Automatic Evaluation, Feedback-driven Refinement, Domain Time Series Encoder, Text Description Encoder, Prototype Assignment Module, Semantic Prototypes, Conditioning, Diffusion Model, and Random Noise for text-controlled time series generation.
  • BRIDGE framework uses multi-agent system for iterative text refinement and hybrid approach combining semantic prototypes with text descriptions to enhance time-series generation controllability and fidelity.
  • The framework addresses challenges of limited text-TS pairs and modality discrepancy by generating high-quality datasets and integrating semantic prototypes for improved domain generalization in time-series generation.

PersonaX: A Recommendation Agent-Oriented User Modeling Framework for Long Behavior Sequence

  • PersonaX (Recommendation Agent-Oriented User Modeling Framework): introduces a user modeling framework for long behavior sequences, with behavior clustering, sampling budget allocation, in-cluster selection, SBS selection, offline multi-persona construction, online persona retrieval, and persona cache.
  • PersonaX extracts representative sub-behavior sequences offline to construct fine-grained personas for efficient online retrieval in recommendation agents.
  • PersonaX addresses challenges of long user-generated content by balancing behavioral completeness and efficiency in user modeling.

ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks

  • ReSo (Reward-driven Self-organizing): introduces reward-driven multi-agent system, integrating Task Graph for decomposition, Agent Graph Construction for network building, Dynamic Agent Database for agent profiles, and Collaborative Reward Model for feedback.
  • ReSo incorporates Collaborative Reward Model to provide fine-grained signals, enabling dynamic optimization of agent collaboration and improving scalability.
  • The framework utilizes Dynamic Agent Database to maintain agent profiles, facilitating adaptive agent selection based on performance and task similarity.

EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports

  • EchoQA (Echocardiogram Question Answering) introduces a question-answering dataset creation framework with Data Extraction, Clinical Categorization, Sentence Matching, Question Generation, LLM Training, and Fairness Audit components for echocardiogram reports.
  • EchoQA framework utilizes clinician expertise for categorizing cardiac abnormalities and generates question-answer pairs to facilitate instruction tuning of language models for cardiology QA tasks.
  • The framework aims to establish a benchmark for LLM-based AI agents in cardiology, focusing on differential diagnoses and fairness across social determinants of health.

AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

  • AppAgentX (Evolving GUI Agents as Proficient Smartphone Users): introduces an evolutionary framework for GUI agents, incorporating memory mechanism, evolutionary mechanism, and execution strategy to enhance efficiency and intelligence by evolving high-level actions from task execution history.
  • AppAgentX framework utilizes a chain-based knowledge framework to record task execution history, enabling the agent to identify repetitive action sequences and evolve shortcut nodes representing high-level actions for improved task efficiency.
  • The framework's memory mechanism stores page nodes and element nodes with descriptions and visual embeddings, facilitating the evolutionary mechanism to abstract low-level actions into high-level actions and optimize the agent's operational repertoire.

Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions

  • RECIPE2PLAN: introduces Multitask Agent, Action, Observation, Feedback, and Recipe, to evaluate multitask planning with time constraints in cooking scenarios.
  • RECIPE2PLAN framework challenges agents to balance efficiency and feasibility in parallel task execution while respecting temporal constraints between actions.
  • RECIPE2PLAN benchmark uses real-world recipes to assess agents' ability to optimize cooking time and adhere to temporal constraints, highlighting the need for improved temporal awareness in LLMs.

ATLAS: Agent Tuning via Learning Critical Steps

  • ATLAS (Agent Tuning via Learning Critical Steps): introduces a framework for efficient LLM agent tuning by focusing on critical steps identified from expert trajectories.
  • ATLAS framework employs a Selector to identify Critical Steps within Expert Trajectories and applies Finetuning to a Base LLM using Critical Step Loss computed on these steps to create a tuned LLM Agent.
  • By selectively finetuning on Critical Steps, ATLAS reduces overfitting and enhances generalization capabilities of LLM agents while minimizing training costs.

Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

  • Decision-Making Flow Framework: introduces structured approach for evaluating decision-making in LLMs, with Scenario, Policy, Exception, Level, Decision Model, and Decisions components.
  • This framework assesses LLMs' ability to handle exceptions to policies compared to human decision-makers across varying exception intensities.
  • The framework is used to evaluate ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning interventions for improving LLM alignment with human judgment.

3rd March 2025

CorrA: Leveraging Large Language Models for Dynamic Obstacle Avoidance of Autonomous Vehicles

  • CorrA (Corridor-Agent): introduces a dynamic obstacle avoidance framework, integrating Scene Description, LLM Scene Analysis, Optimization, DDP, Hard safety constraints, and Ego car trajectory components.
  • CorrA uses LLM for reasoning to generate adaptive sigmoid boundary parameters, which are efficiency-optimized and used by DDP within MPC for trajectory planning.
  • This framework enhances autonomous vehicle safety and efficiency in dynamic environments through real-time adaptation of sigmoid-based safety boundaries.

Interactive Debugging and Steering of Multi-Agent AI Systems

  • AGDEBUGGER (Interactive Agent Debugging Tool): introduces an interactive debugging system for multi-agent AI, featuring a message viewer, message sending, message editing, message reset, overview visualization, agent state checkpoints, and a message queue.
  • This tool facilitates debugging by allowing users to inspect conversations, edit messages, reset workflows, and visualize conversation history to understand and correct agent behavior.
  • AGDEBUGGER addresses the challenges of debugging complex multi-agent systems by providing interactive control and visualization, enabling developers to effectively identify and fix errors in agent workflows.

Al persuading Al vs Al persuading Humans: LLMs' Differential Effectiveness in Promoting Pro-Environmental Behavior

  • Research Framework: introduces a system for evaluating LLMs in promoting Pro-Environmental Behavior, with Participants, Chat, Communication Strategy and Effects components.
  • The framework compares real, simulated, and synthetic participants interacting with personalized chatbots, non-personalized chatbots, or static statements using different communication strategies.
  • The study investigates effects on pro-environmental intentions, climate change belief, sustainable choices, psychological distance, sharing, consumption, self-perception, and policy adoption.

Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions

  • PANDORA Framework (Persuasion ANalysis in Demographic-aware human-LLM interactions and misinformation Response Assessment): introduces components including LLM-to-Human Persuasion, Persuasive Text Generation, Persuasive Text Impact, Human-to-LLM Persuasion, Persuasive Text Generation, Persuasive Text Impact, Multi-agent LLM Persuasion, Multi-Agent LLM Architecture, Homogeneous groups, Heterogeneous groups, Interaction rounds, First responses, and Final responses to investigate misinformation dynamics in human-LLM interactions considering demographic factors.
  • PANDORA framework analyzes bidirectional persuasion between humans and LLMs, evaluating LLM-generated and human-generated persuasive texts' impact on belief and correctness across diverse demographic groups in single-agent and multi-agent settings.
  • The framework's multi-agent LLM architecture explores echo chamber effects in homogeneous groups and mitigation in heterogeneous groups, offering insights into demographic influences on misinformation susceptibility and potential intervention strategies.

Mind the (Belief) Gap: Group Identity in the World of LLMs

  • Multi-agent LLM framework: introduces simulation of belief congruence experiment with participant agent interacting with confederate agents, each having assigned belief, through interaction rounds to answer question.
  • Framework components include participant agent making decision after interaction rounds with confederate agents, considering their beliefs on discussion topic.
  • Framework simulates psychological experiment to investigate belief congruence in LLMs by observing agent's choices based on belief alignment of others.

Adaptively evaluating models with task elicitation

  • Adaptive Evaluations: introduces framework for evaluating language models, utilizing Target LLM, Evaluator Agent, Verifier, and Static Evaluation components.
  • Framework employs evaluator agents to create difficult questions by probing target model behavior from static evaluation results.
  • Verifier component ensures generated questions maintain validity, difficulty, and novelty, refining target model profile iteratively.

--

Can (A)I Change Your Mind?

  • Dynamic Bot Framework: introduces a structured system utilizing GPT-4, with System Prompt, Experiment Framework, System Message, Persona, Conversation Instruction, User Message, Bot Message, Opinion and Confidence, Few Shot Conversations, Nudger, Initial Message, Summarization Prompt, and Final Message, to facilitate and analyze human-bot conversations for persuasion studies.
  • This framework employs a detailed prompt and iterative message processing, including summarization and rephrasing, to ensure naturalistic and contextually relevant bot interactions within the experiment.
  • The framework incorporates components like Nudger for maintaining engagement and Few Shot Conversations for guiding bot behavior, aiming for robust and ecologically valid persuasion research.

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

  • PMIYC (Persuade Me If You Can): introduces automated framework for evaluating persuasion effectiveness and susceptibility of LLMs through multi-agent interactions, with PERSUADER (Agent attempting persuasion), PERSUADEE (Agent being persuaded), Multi-turn Conversation (Iterative argument exchange), and Agreement Score (Quantifies stance on claim).
  • PMIYC framework simulates conversations between PERSUADER and PERSUADEE agents to measure persuasive effectiveness and susceptibility of LLMs in different contexts.
  • PMIYC offers scalable and automated approach to study LLM persuasion dynamics, providing insights into vulnerabilities and safer AI development.

AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

  • AutoAdvExBench (Benchmark for Autonomous Exploitation of Adversarial Example Defenses): introduces benchmark, with Forward Pass Implementation, Differentiable Forward Pass Conversion, FGSM Attack and Iterative Attack components, that evaluates LLMs' ability to autonomously exploit adversarial defenses.
  • This benchmark directly measures LLMs' success on security tasks performed by machine learning experts, unlike proxy benchmarks.
  • AutoAdvExBench is mechanistically verifiable and uses real-world research codebases, highlighting the gap between CTF-like and real-world security challenges for LLMs.

Designing VR Simulation System for Clinical Communication Training with LLMs-Based Embodied Conversational Agents

  • VAPS (Virtual AI Patient Simulator): introduces VR system for clinical communication training, with tutorial-, clinical patient interaction- and reflection-scenes, embodied conversational agents, medical records, narrative design, realistic animations and system interaction.
  • VAPS utilizes LLM-driven ECAs to simulate dynamic patient interactions, incorporating medical records and adaptive narratives for realistic VR-based training.
  • The system aims to enhance HP students' communication skills through customizable and repeatable practice scenarios within an immersive VR environment.

Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models

  • TOOLRET: introduces TOOLRET Benchmark (heterogeneous tool retrieval evaluation), IR Models (benchmark various retrieval models), TOOLRET-train Dataset (large-scale training dataset), and Evaluation Metrics (retrieval performance metrics) for benchmarking tool retrieval performance of large language models.
  • TOOLRET benchmark demonstrates existing information retrieval models exhibit suboptimal performance in retrieving tools, consequently degrading tool-use large language model task completion rates.
  • TOOLRET-train dataset aims to improve information retrieval models for tool retrieval, ultimately enhancing the effectiveness of large language models in tool utilization.

Student engagement in collaborative learning with AI agents in an LLM-empowered learning environment: A cluster analysis

  • MAIC (Massive AI-empowered Course System): introduces a platform integrating specialized AI agents including AI Teacher, AI Teaching Assistant, Sparker, Thinker, Questioner and Note Taker, managed by Director Agents using Dialogue History and Learning Materials to enhance online learning.
  • MAIC system utilizes Director Agents to analyze classroom dynamics from Dialogue History and Learning Materials, enabling dynamic agent selection via Select Speakers and text generation through Generate Texts, alongside components like Role Descriptions and Specialized Intelligence.
  • MAIC framework aims to foster collaborative learning by employing diverse AI agents - AI Teacher, AI Teaching Assistant, Sparker, Thinker, Questioner, Note Taker - each with specific roles, to support student engagement and personalized educational experiences.

Evaluation and Facilitation of Online Discussions in the LLM Era: A Survey

  • Taxonomy of Discussion Quality Evaluation: introduces four main dimensions - Structure and Logic, Social Dynamics, Emotion and Behavior, and Engagement and Impact - to assess online discussion quality.
  • Taxonomy of Discussion Quality Evaluation: encompasses multiple aspects within each dimension, ranging from argument analysis and coherence to politeness, toxicity, and engagement, providing a comprehensive framework.
  • Taxonomy of Discussion Quality Evaluation: aims to offer a structured approach for evaluating diverse facets of online discussions, moving beyond traditional argument-centric methods to include social and behavioral dynamics.

Improving Retrospective Language Agents via Joint Policy Gradient Optimization

  • RetroAct (Retrospective Language Agent): introduces a novel agent framework that jointly optimizes task-planning and self-reflective evolution capabilities with Planner, Reflector, Environment, Tool Calling, Reflection, Feedback, Reward, Differential Reward, Imitation Learning, Reinforcement Learning, Policy Gradient Optimization, Replay Buffer.
  • RetroAct framework uses a two-stage joint optimization process integrating imitation and reinforcement learning for enhanced data efficiency and training stability.
  • RetroAct improves performance of open-source models and reduces dependency on closed-source LLMs by enabling continuous learning and evolution.

MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents

  • MARBLE (Multi-agent coordination Backbone with LLM Engine): introduces a multi-agent evaluation framework with Configuration, Coordinate Engine, Agent Graph, Shared Memory, Cognitive Module, Environment, Tool Box, and Evaluator components.
  • This framework evaluates LLM-based multi-agent systems by measuring task completion and coordination quality in diverse interactive scenarios.
  • MARBLE utilizes milestone-based KPIs and supports various coordination protocols and planning strategies for comprehensive multi-agent system analysis.

2nd March 2025

NESYC: A NEURO-SYMBOLIC CONTINUAL LEARNER FOR COMPLEX EMBODIED TASKS IN OPEN DOMAINS

  • NESYC (Neuro-Symbolic Continual learner): introduces a neuro-symbolic continual learning framework integrating Semantic Parser, Hypothesis Generator, Hypothesis Interpreter, Logic programming form, Memory-based monitoring, Task Planner, Action Executor, and Error Handler for embodied agents in open-domain environments.
  • NESYC framework employs contrastive generality improvement and memory-based monitoring schemes, utilizing LLMs and symbolic tools to generalize actionable knowledge and refine it through experience.
  • The framework iteratively reformulates knowledge and applies it, adapting to unpredictable situations and demonstrating effectiveness in diverse embodied task benchmarks by continually improving understanding of the environment.

A Law Reasoning Benchmark for LLM with Tree-Organized Structures including Factum Probandum, Evidence and Experiences

  • TL Agent (Transparent Law Reasoning Agent): introduces a transparent law reasoning framework, with Agent Brain, Fact Finding Head, Knowledge Search, MultiRole Checker, Legal Knowledge, Reflection, Evidence, Factum Probandum, Experiences, and Inferences components, for AI-assisted legal decision-making.
  • The framework employs a tree-organized schema integrating hierarchical factum probandum, evidence, and experiences to simulate comprehensive court processes and enhance transparency in legal reasoning.
  • TL Agent utilizes a suite of legal analysis tools within an agent-based architecture to construct tree-organized legal reasoning structures from textual case descriptions for improved judicial fairness.

AI Agents for Ground-Based Gamma Astronomy

  • Astronomical Agent: introduces an AI system designed for astronomy tasks, integrating context understanding, language model processing, external function execution, and data validation within a specified framework.
  • The agent utilizes instruction-finetuned LLMs to automate complex tasks in gamma-ray astronomy, incorporating components like ACADA and Gammapy for telescope control and data analysis pipelines.
  • Validation mechanism ensures command quality by evaluating function execution results against provided data and software framework, enhancing reliability in autonomous astronomical operations.

Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity

  • ETAPP (Evaluation of Tool-augmented Agent from the Personalization and Proactivity Perspective): introduces a benchmark for evaluating personalized tool invocation, comprising API Construction, Sandbox Construction, User Profile Construction, Tool-utilizing Preference Construction, Interaction History Construction, Memory Building, Instruction Construction, Manual Check, Inference, Available Tools, Tool Invoking Process, Final Answer, and Evaluation components.
  • ETAPP assesses personalization and proactivity in tool-augmented LLMs using a dataset of 800 cases and a key-point-based evaluation method.
  • The benchmark aims to address the lack of evaluation criteria for personalized tool usage in diverse scenarios, focusing on improving personalized LLM agents.

CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments

  • CLEA (Closed-Loop Embodied Agent): introduces a closed-loop framework with Observer, Memory, Planner-Critic Agent and Skill Pool to enhance task execution in dynamic environments using robots.
  • CLEA framework incorporates Observer for visual input conversion, Memory for belief state maintenance, Planner-Critic Agent for adaptive decision-making, and Skill Pool for predefined executable actions.
  • The framework facilitates continuous adaptation and error recovery in long-horizon tasks by integrating real-time environmental feedback and memory-driven reasoning within its components.

Unmasking Digital Falsehoods: A Comparative Analysis of LLM-Based Misinformation Detection Strategies

  • SNIFFER (Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection): introduces multimodal model utilizing visual and textual inputs with cross-modal transformer and external knowledge to detect misinformation.
  • SNIFFER integrates image encoder, text processing, and retrieval mechanisms, employing LLM for reasoning and providing explainable out-of-context misinformation detection.
  • The framework achieves explainability through structured validation and evidence integration, enhancing transparency in multimodal misinformation analysis.

LLMDR: LLM-Driven Deadlock Detection and Resolution in MAPF Environment

  • LLMDR (LLM-Driven Deadlock Detection and Resolution): introduces MAPF Environment, Base Model Simulation, LLM Deadlock Detection, and LLM Deadlock Resolution with PIBT to address deadlock and enhance learned MAPF model performance.
  • LLMDR framework uses LLM Deadlock Detection to identify deadlocks in Base Model Simulation within MAPF Environment and employs LLM Deadlock Resolution with PIBT to resolve them.
  • LLMDR leverages LLMs for high-level deadlock management and integrates with PIBT algorithm for collision-free action generation in multi-agent pathfinding scenarios.

1st March 2025

Instructor-Worker Large Language Model System for Policy Recommendation: a Case Study on Air Quality Analysis of the January 2025 Los Angeles Wildfires

  • Instructor-Worker LLM System: introduces Instructor LLM (Prompt interpreter orchestrator), Worker LLM(s) (Data analysis summarization agents), Code Execution Module (API call validation execution), and Cloud Platform (External data source) for air quality analysis.
  • The system uses Instructor LLM to process user instructions, retrieve data from Cloud Platform via Code Execution Module, and distribute analysis tasks to Worker LLMs.
  • This multi-agent approach aims to efficiently analyze large datasets and generate policy recommendations based on air quality data.

Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy

  • Taxonomy for LLM Test Case Design: introduces Software Under Test (system or component to evaluate), Goal (objective of the test case), Oracles (evaluation mechanisms for property), and Inputs (data to elicit SUT responses) for structuring LLM testing.
  • This taxonomy addresses challenges in LLM testing by categorizing key variation points impacting evaluation correctness and emphasizing ambiguity in inputs and outputs.
  • The taxonomy aims to improve reliability and reproducibility of LLM testing by providing a systematic framework for test case design and evaluation across the software lifecycle.

PodAgent: A Comprehensive Framework for Podcast Generation

  • PodAgent: introduces comprehensive framework for podcast generation with Host-Guest-Writer System, Voice-Role Matching, Instruction-following TTS, Audio Script Generation and Audio Production components.
  • PodAgent framework utilizes multi-agent collaboration for content creation, voice characteristic analysis for voice selection and LLM-enhanced speech synthesis for expressive speech.
  • PodAgent addresses key challenges in podcast generation, including content depth, dialogue naturalness, voice appropriateness and speech expressiveness.

Structured Reasoning for Fairness: A Multi-agent Approach to Bias Detection in Textual Data

  • SRF Framework (Structured Reasoning for Fairness Framework): introduces multi-agent pipeline with Checker Agent, Validation Agent, and Justification Agent for textual data bias detection.
  • SRF Framework systematically identifies biases through fact or opinion classification, bias intensity scoring, and factual justification provision.
  • This approach improves bias detection accuracy and interpretability, fostering fairness and accountability in language models.

Shifting Power: Leveraging LLMs to Simulate Human Aversion in ABMs of Bilateral Financial Exchanges, A bond market study

  • TRIBE (Trading Relationships, Interactions, and Bilateral Exchange of assets): introduces agent-based model augmented with LLM to simulate human aversion, with components Select Paramatervalues, Build the Financial Landscape, Initialise Bankers, Bankers engage with Clients, Clients determine direction choice availability, LLM response Positive, LLM response Averse, Bankers must trade if Clients are Positive towards them, and Banker facilitated trade occurs.
  • TRIBE framework simulates bilateral financial exchanges by integrating LLM for human-like client decision-making regarding trade aversion and timeliness.
  • This framework enhances realism in agent-based models by incorporating stochastic human-like decision processes via LLM, revealing emergent market behaviors.

28th February 2025

UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

  • UDora (Unified Red Teaming Framework): introduces a novel approach for attacking LLM agents by dynamically adapting adversarial strings based on the agent's reasoning process, encompassing components like System (target agent), Direct Attack (baseline), UDora Attack (framework itself), Initial Response, Modified Response, Optimization, Malicious Environment, Malicious Instruction, Tool list, and Malicious Target Tool.
  • UDora framework strategically inserts "noise" into the agent's reasoning at optimal positions identified through positional scoring and iterative optimization to mislead the agent towards malicious actions.
  • The framework evaluates two adversarial scenarios: Malicious Environment, where the observation is corrupted, and Malicious Instruction, where the instruction is directly manipulated, demonstrating effectiveness across diverse datasets and real-world agents.

Personalized Causal Graph Reasoning for LLMs: A Case Study on Dietary Recommendations

  • Personalized Causal Graph Reasoning: introduces agentic framework enhancing LLM reasoning by incorporating personal causal graphs, with goal identification, personal causal graph, traverse impactful nutrient paths, rank, verify, retrieve food items, food nutrient database, generate food recommendation, large language model, and personal data.
  • This framework constructs personalized causal graphs from individual data to guide LLM in generating tailored dietary recommendations.
  • By leveraging structured causal dependencies and counterfactual evaluation, the framework aims to provide more precise and personalized dietary advice compared to generic LLM approaches.

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

  • BixBench (Bioinformatics Benchmark): introduces benchmark framework with analyst-created analysis capsules, expert review, LLM-generated MCQs, task capsules, agent task environment with tools, and open/multiple-choice evaluations.
  • BixBench framework uses analysis capsules containing data and questions, evaluated in agent environment with tools for bioinformatics tasks.
  • BixBench framework assesses LLM-based agents in bioinformatics through open-ended questions and multiple-choice questions for comprehensive evaluation.

EdgeAIGuard: Agentic LLMs for Minor Protection in Digital Spaces

  • EdgeAIGuard: introduces multi-agent framework for minor protection, with Input Layer, Edge Processing Unit, Local Storage and Protection Layer components.
  • EdgeAIGuard framework incorporates Sentinel, Context, and Intervention Agents within Edge Processing Unit, utilizing DeepSeek LLM Engine for threat detection and response.
  • EdgeAIGuard employs Local Storage with History Cache and Pattern Memory to maintain context awareness and adapt to evolving online threats effectively.

ARIES: AUTONOMOUS REASONING WITH LLMS ON INTERACTIVE THOUGHT GRAPH ENVIRONMENTS

  • ARIES (AUTONOMOUS REASONING WITH LLMS ON INTERACTIVE THOUGHT GRAPH ENVIRONMENTS) introduces a multi-agent framework with Policy Agent, Reasoning Agent, and Thought Graph to enhance reasoning in LLMs.
  • ARIES framework utilizes Policy Agent to select Actions that Reasoning Agent executes on Thought Graph, dynamically adapting problem-solving strategy.
  • The framework aims to improve reasoning accuracy and efficiency by using LLMs as policy agents to guide exploration within a structured thought graph environment.

The amplifier effect of artificial agents in social contagion

  • Artificial Agent Social Contagion Framework: introduces agent types, experiments, attributes, threshold, adoption rate, seeding strategy, network, and proportion of artificial agents, to describe the impact of artificial agents on social contagion processes.
  • This framework investigates how artificial agents, compared to humans, exhibit lower adoption thresholds and amplify social contagion in networks.
  • The findings highlight the potential for artificial agents to accelerate behavioral shifts and raise questions about managing their influence in social systems.

Retrieval Augmented Generation for Topic Modeling in Organizational Research: An Introduction with Empirical Demonstration

  • Agentic RAG (Agentic Retrieval-Augmented Generation): introduces topic modeling method integrating retrieval, generation, and agent-driven learning for improved qualitative analysis.
  • Agentic RAG extends RAG with ReAct agent for iterative query reformulation and output evaluation, enhancing transparency and reliability.
  • Agentic RAG streamlines topic modeling by using embeddings, reducing preprocessing and improving efficiency over traditional methods.

The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents

  • Human Simulation Perspective: introduces framework with prompt-based personality shaping, single and multi-agent task testing and evaluation, group collaboration, team formation, and performance analysis to investigate personality traits influence on large language model agents in closed and open tasks.
  • This framework systematically explores how personality traits impact reasoning, creativity, and collaboration of LLM agents by assigning Big Five traits and evaluating performance in single-agent and multi-agent settings.
  • The study reveals that specific personality traits significantly affect agent performance and multi-agent systems exhibit collective intelligence driven by personality combinations, demonstrating LLMs' inherent human behavior simulation capabilities.

Digital Player: Evaluating Large Language Models based Human-like Agent in Games

  • CivAgent: introduces a Large Language Model-based agent for strategy games, integrating perception, memory, reasoning & planning, skills, tools, and game components for human-like gameplay.
  • CivAgent utilizes game observations and stored interaction data within its memory to inform reasoning and planning for executing in-game skills and leveraging external tools.
  • The framework incorporates a simulator within its tools component to enhance numerical reasoning and decision-making processes in the complex game environment.

Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots

  • CYLENS (Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots): introduces a cyber threat intelligence copilot system, integrating Base LLMs, Large-scale Knowledge, Task-Oriented Dataset, Curriculum Pre-training, Cascading Reasoning, and Specialized NLP Modules.
  • CYLENS enhances cyber threat analysis through cascading reasoning and specialized NLP modules for tasks like attribution, contextualization, detection, correlation, prioritization, and remediation.
  • The framework utilizes curriculum pre-training and fine-tuning methodologies to embed extensive CTI knowledge and adapt to diverse organizational needs in cybersecurity.

The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents

  • ADMP (Adaptive Dynamic Multi-Preference) introduces a method that dynamically adjusts safety-utility preferences, incorporating train dataset, ADMP, sample dataset, CMS, character settings, GPT-4, safety reward model, utility reward model and typical interaction library.
  • ADMP framework utilizes Coupling Margin Sampling (CMS) to enhance safety in high-risk scenarios through character-query risk coupling measurement within typical interaction library and preference weight sampling and mapping.
  • The framework aims to balance safety and utility in role-playing dialogue agents, addressing risk coupling between user queries and character settings to mitigate unsafe content generation.

ProAI: Proactive Multi-Agent Conversational AI with Structured Knowledge Base for Psychiatric Diagnosis

  • ProAI (Proactive AI): introduces a proactive conversational framework for mental health diagnosis, integrating Multi-Agent Proactive Reasoning Workflow, Structured Knowledge Graph, and Multifaceted Evaluation Strategy.
  • ProAI framework employs Decision-Maker and Question-Generator Agents within Multi-Agent Proactive Reasoning Workflow, utilizing Structured Knowledge Retrieval and Action Prediction to navigate Structured Knowledge Graph.
  • Multifaceted Evaluation Strategy of ProAI, encompassing Simulated Clinical Interview, User Experience Evaluation, and Doctor Evaluation, ensures comprehensive assessment of diagnostic accuracy, user experience and medical proficiency.

Multi²: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

  • Multi² (Multi-Agent Test-Time Scalable Framework): introduces multi-agent framework with Input documents processed using Prompt Bank's Prompts to generate Candidate Summaries, then aggregated by Aggregator (Voter, Context-preserve, Context-independent) into Final summary, evaluated by Evaluation metrics (CAP score, LLM-ACU score) against Baseline summary.
  • Multi² framework leverages prompt ensemble for multi-document summarization, employing diverse Prompts from Prompt Bank to guide independent LLM agents in generating Candidate Summaries, which are then consolidated by Aggregator module.
  • The framework's Aggregator offers three distinct approaches: Voter selects best summary, Context-preserve refines summary using documents and candidates, and Context-independent consolidates summaries without original documents, all evaluated with CAP score and LLM-ACU score metrics.

27th February 2025

WHY ARE WEB AI AGENTS MORE VULNERABLE THAN STANDALONE LLMS? A SECURITY ANALYSIS

  • OpenHands (Web AI agent platform): introduces a framework for analyzing web agent vulnerabilities, comprising Goal Preprocessing, Action Space, Event Stream, LLM, and Eval Environment components.
  • This framework investigates how embedding user goals, multi-step actions, and observational capabilities increase web agent vulnerability compared to standalone LLMs.
  • The study uses component ablation to identify specific design choices that contribute to the heightened susceptibility of web agents to jailbreaking.

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

  • MAV (Multi-Agent Verification): introduces Generator LLM for output generation, Aspect Verifiers for output evaluation, Aggregation for signal combination, BoN-MAV as multi-agent algorithm, BoN-RM as reward model algorithm, and Self-Consistency as consistency algorithm for test-time compute scaling.
  • MAV paradigm combines multiple Aspect Verifiers to evaluate Generator LLM outputs, using Aggregation of verifier signals to improve performance over BoN-RM and Self-Consistency baselines.
  • BoN-MAV algorithm, a specific implementation of MAV, demonstrates effective test-time scaling by increasing number of Aspect Verifiers, showing improvements in accuracy across diverse language models and domains.

Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

  • Smart-SLIC (Smart Semantic Legal Information and Computational System): introduces a legal AI framework integrating RAG, VS, KG, and NMF for enhanced legal research.
  • Smart-SLIC leverages vector stores for semantic retrieval, knowledge graphs for relationship navigation, and NMF for latent topic discovery in legal documents.
  • This framework aims to improve legal information retrieval, reasoning, and explainability by combining these components for complex legal tasks.

Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale

  • AI-driven telephone survey system: introduces a voice-based conversational AI agent for conducting phone surveys, integrating Speech to Text, Large Language Model, and Text to Speech components for real-time dialogue.
  • The system incorporates a Turn-taking Model for managing conversation flow, a Deterministic consent checker and Llama Guard safety model within a Safety suite for secure interactions, and Logger, Recording of call, and Closed Database for data management.
  • This architecture enables automated large-scale telephone surveys, mimicking human interviewers while maintaining data quality and operational efficiency.

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

  • Collab-Overcooked Benchmark: introduces a framework with Memory, Reflection, Instruction-Builder, Planner, Communication, Error Handling, Executor, and Environment State components for evaluating LLM-based multi-agent collaboration.
  • This benchmark assesses collaboration capabilities in a simulated kitchen environment featuring resource isolation and asymmetric task knowledge between agents.
  • It employs process-oriented metrics like Initiating Capability and Responding Capability alongside end-to-end metrics to enable fine-grained analysis of collaborative performance.

Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents

  • ViSA (Visual-Centric Selection approach via Agents Collaboration): introduces a multi-agent framework for high-quality visual instruction data selection, incorporating Visual Information Quantification, Diversity Perspectives Quantification, and Text Quality Quantification components.
  • ViSA evaluates image informativeness and instruction relevance by leveraging visual agents including InternVL, QwenVL and Llava to assess visual elements using SAM2 and DINO, and image-specific features.
  • ViSA utilizes Shapley value based Agent Collaboration to refine evaluation scores like Segmentation Complexity Score, Object Alignment Score, Diversity Perspective Score, Prior Token Perplexity Score and Image-Text Mutual Information Score and improve MLLM training efficiency by reducing dataset noise.

MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue

  • MIND (Multi-agent INner Dialogue): introduces multi-agent framework with Trigger, Devil, Guide, Strategist, Player and Memory components for immersive psychological healing through inner dialogue.
  • MIND framework utilizes Trigger for scenario generation, Devil for cognitive distortion simulation, Guide for restructuring guidance, Strategist for storyline progression, Player as simulated patient and Memory for narrative coherence.
  • MIND paradigm aims to enhance empathy and self-reconciliation by decomposing patient's conflicting self into interactive agents facilitating cognitive scaffold for metacognitive reflection and therapeutic efficacy.

Supervised Fine-Tuning LLMs to Behave as Pedagogical Agents in Programming Education

  • GuideLM: introduces a fine-tuning framework with curated question-answer dataset, script-based, manual and LLM-based pre-processing, OpenAI fine-tuning and pedagogical model.
  • GuideLM framework employs supervised fine-tuning to create pedagogically sound LLMs for programming education by refining existing models with targeted datasets.
  • The framework aims to improve Socratic guidance and economy of words in LLM responses for novice programmers, enhancing learning without over-assistance.

--

Personas Evolved: Designing Ethical LLM-Based Conversational Agent Personalities

  • Framework name: introduces a workshop for responsible design and evaluation of ethical LLM-based conversational agent personalities.
  • This workshop addresses ethical and practical concerns of rapidly adopted LLM-based personas in conversational user interfaces.
  • The workshop aims to bridge CUI and AI communities to ensure transparency, inclusivity, and user-centered LLM-driven CUIs.

TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning

  • TripCraft: introduces a benchmark for fine-grained travel planning, with User (initiates travel plan), Query (travel plan request), Persona (travel style preferences), Agent (generates travel plan), Reference Information (data for plan generation), Databases (storage for data), Generated Plan (output itinerary), Temporal Meal Score (meal scheduling quality), Temporal Attraction Score (attraction visit duration), Spatial Score (travel efficiency), Ordering Score (itinerary sequence), Persona Score (user preference alignment), CPR Macro (commonsense constraint adherence), CPR Micro (commonsense constraint adherence), HCPR Macro (hard constraint adherence), HCPR Micro (hard constraint adherence), and Delivery Rate (plan generation success).
  • TripCraft assesses language agents in generating constraint-aware travel itineraries by incorporating user preferences and real-world constraints.
  • The benchmark uses continuous evaluation metrics to assess temporal, spatial, sequential, and persona-specific aspects of generated travel plans.

A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs

  • Persona Modality Representation Framework: introduces framework with LLM, Stable Diffusion, Pillow Typography, Text, Image, Assisted Image, Descriptive Image components for studying persona modality influence in multimodal LLMs.
  • Persona Modality Representation Framework evaluates persona embodiment using text, image, assisted image and descriptive image modalities generated through pipeline involving LLM, Stable Diffusion and Pillow Typography.
  • Persona Modality Representation Framework systematically investigates how different modalities impact persona expressiveness in multimodal LLMs, utilizing diverse persona dataset and evaluation framework.

Large Language Model Strategic Reasoning Evaluation through Behavioral Game Theory

  • Evaluation framework: introduces a three-step method for evaluating LLMs' strategic reasoning, incorporating Abstracted Game Library, Responder Model or Agents, and TQRE Estimation components.
  • This framework systematically assesses LLMs' reasoning capability through game abstraction and TQRE-based parameter analysis, accounting for contextual complexity.
  • It evaluates strategic reasoning beyond Nash Equilibrium, considering demographic biases and prompt effects using behavioral game theory principles.

26th February 2025

Agentic Mixture-of-Workflows for Multi-Modal Chemical Search

  • CRAG-MoW (Mixture-of-Workflows for Self-Corrective Retrieval-Augmented Generation): introduces agentic framework for multi-modal chemical search, includes User, Generators, Vector Store, Document Fusion, Aggregator Agent, and Report Generation.
  • CRAG-MoW orchestrates multiple CRAG workflows using Generators for iterative self-correction and Aggregator Agent for synthesizing final response.
  • Framework leverages structured retrieval and multi-agent synthesis to enhance response quality and interpretability for materials discovery.

Weaker LLMs' Opinions Also Matter: Mixture of Opinions Enhances LLM's Mathematical Reasoning

  • MoO (Mixture of Opinions): introduces post-training method using Dataset Curation, Ancillary LLMs, Main LLM, Chain-of-Thought Reasoning Steps, Opinions, Post-Training, Inference and Post-trained Main LLM to enhance mathematical reasoning of stronger Main LLM by incorporating diverse Opinions from weaker Ancillary LLMs.
  • MoO framework curates dataset by augmenting training samples with Chain-of-Thought reasoning and diverse Opinions from multiple weaker Ancillary LLMs, then fine-tunes Main LLM on this dataset.
  • The post-trained Main LLM in MoO framework demonstrates improved mathematical reasoning by learning to synthesize insights from varied Opinions during the Post-Training phase.

Stay Focused: Problem Drift in Multi-Agent Debate

  • DRIFTJudge/DRIFTPolicy Framework: introduces DRIFTJudge for drift detection and DRIFTPolicy for mitigation in multi-agent debate with Discussion and Voted Solution components.
  • This framework addresses problem drift, a performance degradation issue in multi-agent debate over multiple turns.
  • The framework aims to improve the effectiveness of multi-agent debate by identifying and reducing problem drift.

Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

  • QA Pipeline (retrieval-augmented question-answering pipeline): introduces retrieval-augmented system, with User Question, Product Manual, Retrieval, QA Pipeline and Factual Response components, designed to generate factual answers from product manual based on user questions.
  • The framework utilizes product manual as structured knowledge source to provide relevant information for question-answering process.
  • The described pipeline aims to address hallucination in question answering systems by grounding responses in provided product manual content.

Conversational Planning for Personal Plans

  • LLM-based Hierarchical Framework: introduces a novel architecture for conversational agents using Meta-Controller, Policy-Option, Tool-Use Policy, Memory, and Tools and RAG components to enable long-term interactive planning.
  • The framework employs a LLM-powered Meta-Controller to decide macro-actions, LLM-powered Policy-Options to execute these actions, and Tool-Use Policy to fetch relevant content, leveraging Memory and Tools and RAG for context and knowledge retrieval.
  • This approach facilitates adaptive planning through conversation and feedback, applicable to various scenarios requiring long-term user assistance, such as tutoring and personal health planning.

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

  • TheoremExplainAgent: introduces an agentic framework for multimodal theorem explanation video generation, incorporating Theorems, Prompting, Planner Agent, Code Agent, Multimodal Elements, Rendered Video and Evaluation components.
  • TheoremExplainAgent utilizes Planner Agent with Scene Outline, Vision Storyboard Plan, Technical Implementation Plan and Animation & Narration Plan to create video plans, and Code Agent with Query Generator, Core Documentation, Plugin Documentation and Agentic RAG to generate animation code.
  • The framework outputs Rendered Video composed of Multimodal Elements and is assessed by Evaluation metrics, aiming to enhance theorem understanding through visual explanations.

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

  • REWARDAGENT (Agentic Reward Modeling): introduces router, verification agents (factuality, instruction-following), and judger to combine human preference rewards with verifiable correctness signals for reliable reward systems.
  • Agentic reward modeling enhances reward reliability by integrating multi-dimensional correctness signals and enabling flexible incorporation of diverse verification agents.
  • REWARDAGENT empirically demonstrates superior performance over vanilla reward models and improves LLM training through DPO with agent-constructed preference pairs.

Agent-centric Information Access

  • Agent-centric Information Access Framework: introduces architecture orchestrating domain-expert and user-specific LLMs through User Agents (personalized user interface), Knowledge Agents (domain expert LLM), and Belief on Expertise (expertise assessment model).
  • The framework utilizes User Expertise (user knowledge history) and Knowledge Base (domain specific data) to dynamically manage Query (information request) and Response (synthesized answer) cycles, incorporating Training (expertise model update) via Data-Metadata (inter-agent communication).
  • This architecture facilitates efficient information retrieval by dynamically selecting and querying relevant expert LLMs, optimizing for accuracy, cost, and latency in multi-expert knowledge synthesis.

Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

  • LLM-driven multi-agent framework (Large Language Model-driven multi-agent framework): introduces multi-agent simulation for language evolution on regulated platforms, incorporating participant agents evolving language and supervisory agent enforcing regulations, utilizing Reflection-, Planning-, Dialogue- and Memory-Modules, with Constraint/Expression Strategy Update, Dialogue Log, Keyword Filter, LLM Assessment, Violation Log, Regulations and Violation Detection.
  • Framework employs dual language strategies (constraint and expression) and LLM-driven Genetic Algorithm for strategy optimization through selection, mutation, and crossover, enhancing adaptability and simulation fidelity.
  • Participant and supervisory agents, both LLM-driven, interact iteratively, refining language strategies to balance effective communication with evasion of regulatory constraints in simulated social media environments.

MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis

  • MEDDxAgent (Modular Explainable DDx Agent): introduces a modular agent framework for explainable automatic differential diagnosis, integrating DDxDriver, History Taking Simulator, Knowledge Retrieval Agent, and Diagnosis Strategy Agent.
  • MEDDxAgent facilitates iterative diagnostic reasoning by using DDxDriver as central orchestrator to manage interactions between simulator and agents for refining diagnoses.
  • The framework enhances explainability and transparency in the diagnostic process through intermediate logging and iterative updates of patient profiles and diagnoses.

A Temporal Planning Framework for Multi-Agent Systems via LLM-Aided Knowledge Base Management

  • PLANTOR (PLanning with Natural language for Task-Oriented Robots) introduces a temporal planning framework for multi-agent systems, integrating natural language input, LLMs for knowledge base generation, Prolog for planning, and behaviour trees for ROS2 execution.
  • The framework employs a two-phase knowledge base generation using high-level and low-level LLMs, followed by a three-step planning procedure that incorporates temporal dependencies and resource constraints solved via mixed-integer linear programming.
  • PLANTOR leverages LLMs for human-understandable knowledge bases and Prolog for formal correctness, demonstrating potential for advanced robotics tasks requiring flexible and scalable planning.

Language-Driven Opinion Dynamics in Agent-Based Simulations with LLMs

  • LODAS (Language-Driven Opinion Dynamics Model for Agent-Based Simulations): introduces a framework for simulating opinion dynamics using language and social interactions with LLM Agents, Network of connections, Initial opinion, Opponent, Discussant, Prompt, Discussion statement, Arguments, and Opinion change.
  • LODAS framework explores how language and logical fallacies influence opinion evolution in agent-based simulations, simulating debates around the "Ship of Theseus" paradox.
  • The model utilizes LLM Agents as Opponent and Discussant roles, guided by Prompts and exchanging Arguments related to a Discussion statement to observe Opinion change within a Network of connections.

NEXUS: A LIGHTWEIGHT AND SCALABLE MULTI-AGENT FRAMEWORK FOR COMPLEX TASKS AUTOMATION

  • Nexus (A Lightweight and Scalable Multi-Agent Framework): introduces a Python framework for constructing LLM-based multi-agent systems, incorporating Supervisor, Task Supervisors, Worker Agents, Memory, and Tools for automating complex tasks.
  • Nexus framework utilizes a multi-supervisor hierarchy for scalable delegation and YAML-based workflow design, facilitating efficient management of intricate tasks and enhancing scalability.
  • Nexus framework achieves state-of-the-art performance across coding, mathematical reasoning, and EDA domains, demonstrating its adaptability and efficacy in varied applications.

IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages

  • IndicEval-XL: introduces comprehensive benchmark for code generation, with Original Dataset prompts, Language extraction, Translation, Back Translation, Quality checks, Programming Languages, and Natural Languages.
  • IndicEval-XL benchmark evaluates multilingual code generation across Indic languages, focusing on linguistic diversity and functional correctness.
  • The framework employs automated and human-based quality checks to ensure dataset reliability for benchmarking code generation models.

Letters from Future Self: Augmenting the Letter-Exchange Exercise with LLM-based Future Self Agents to Enhance Young Adults' Career Exploration

  • Framework name here: introduces a system augmenting letter-exchange exercise with User, Future-self Agent, Present Self Info, LLM, Current Career Exploration Context, and Envisioned Future Profile components.
  • The system utilizes Profile After 3 Years, Current Profile, and Current Career Development as input for Conversational Agent, offering Letter and Chat modalities for interaction.
  • This approach aims to enhance young adults' career exploration by simulating personalized future self interactions for guidance and reflection.

Multi-LLM Collaborative Search for Complex Problem Solving

  • MOSA (Mixture-of-Search-Agents) paradigm introduces collaborative search framework, integrating independent exploration and iterative refinement with root node, action space, child nodes, sampling LLM, sub-questions, candidate sub-answers, majority voting, aggregator, and aggregated candidate sub-answers.
  • MOSA leverages multiple LLMs as proposers for diverse search directions and as aggregators for refining candidate answers, enhancing reasoning accuracy in complex problem-solving.
  • Framework mitigates limitations of single-model approaches by combining independence and collaboration, effectively avoiding local optima during search-based reasoning processes.

REALM-Bench: A Real-World Planning Benchmark for LLMs and Multi-Agent Systems

  • Stock Market Prediction Workflow: introduces a system for stock price forecasting, incorporating data collection, feature extraction, model training, prediction generation, integration, and alert generation, validated through loops.
  • This workflow utilizes market data, news feeds, and economic data as inputs to generate trading alerts based on predicted stock prices.
  • The framework emphasizes validation and adaptive updates to maintain prediction accuracy and system reliability in dynamic market conditions.

Data-Efficient Multi-Agent Spatial Planning with LLMs

  • LLM-MASP (LLM-based Multi-Agent Spatial Planning Framework): introduces multi-agent spatial planning using pretrained language model, rollout algorithm, base policy, environment, state, action, prompt, parse output, finetuning, feasibility checking, resampling, and memory.
  • This framework leverages LLMs for efficient taxi routing by incorporating world knowledge and adapting to environmental factors through prompting and finetuning.
  • The use of rollout algorithm and finetuning with LLMs significantly reduces the need for environmental interactions while outperforming traditional methods.

Reward Shaping to Mitigate Reward Hacking in RLHF

  • RLHF training pipeline with reward shaping: introduces Prompt, Policy Model, Reference Model, Reward Model, Reward Shaping, Reshaped Reward and RL Training for aligning language models and mitigating reward hacking.
  • The pipeline utilizes Reward Shaping to modify proxy rewards from Reward Model, optionally using Reference Reward, before updating Policy Model via RL Training.
  • Preference As Reward (PAR) method, detailed as Reward Shaping, applies sigmoid function to centered reward to enhance training stability and mitigate reward hacking.

AGENTSociety Challenge: Designing LLM Agents for User Modeling and Recommendation on Web Platforms

  • Environment simulator: introduces interactive environment for evaluating LLM agents, with LLM Agents, Simulator, U-R-I Network, and Datasets components.
  • Environment simulator: constructs interactive environment comprising user, review, and item network, enabling agents to access historical data from datasets.
  • Environment simulator: facilitates agent performance evaluation in tasks resembling real-world applications for user modeling and recommendation.

TrajLLM: A Modular LLM-Enhanced Agent-Based Framework for Realistic Human Trajectory Simulation

  • TrajLLM (A Modular LLM-Enhanced Agent-Based Framework for Realistic Human Trajectory Simulation): introduces a modular framework for human trajectory simulation, integrating Persona Preprocess, Routine Activity Generation, Memory, and Destination modules.
  • This framework uses LLMs for activity and destination prediction, incorporating memory for historical context and physical models for spatial reasoning.
  • TrajLLM aims to generate realistic and adaptable human mobility patterns while ensuring scalable memory management and interpretable insights.

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

  • Hi Robot (Hierarchical interactive robot learning system): introduces a hierarchical framework using VLMs for complex instruction following, incorporating user prompts, high-level reasoning, intermediate commands, low-level execution, and verbal responses.
  • The framework decomposes policy into high-level VLM for complex prompt processing and low-level VLA for action execution.
  • Hi Robot enables robots to interpret complex language, adapt to feedback, and perform diverse tasks in open-ended environments.

25th February 2025

A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition

  • CMAS (cooperative multi-agent system): introduces multi-agent framework with self-annotator, TRF extractor, demonstration discriminator, and overall predictor for zero-shot NER.
  • CMAS leverages self-annotator for data generation, TRF extractor for contextual feature identification, demonstration discriminator for selective learning, and overall predictor for final prediction.
  • CMAS enhances zero-shot NER by integrating contextual correlations and self-reflection mechanism through collaborative agents, improving performance and robustness.

Hybrid Voting-Based Task Assignment in Role-Playing Games

  • VBTA (Voting-Based Task Assignment): introduces a framework for task allocation in role-playing games using capability profiles and task descriptions to generate a suitability matrix.
  • VBTA framework integrates voting methods and allocation strategies to manage task assignments, and employs a pre-trained LLM with custom prompts to resolve agent-task compatibility ambiguities.
  • By incorporating Conflict-Based Search for path planning, VBTA enables dynamic game content generation and automates agent decisions, enhancing narrative and gameplay immersion.

Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

  • Codellaborator: introduces proactive AI programming support framework, with Timing of Assistance, AI Agent Representation and Scope of Interaction components, exploring design trade-offs in human-AI workflows.
  • Codellaborator framework evaluates proactive assistance benefits and disruptions compared to user-initiated systems in programming tasks.
  • The research emphasizes adapting AI proactivity to programming processes for improved user control and code understanding.

INDEPENDENT MOBILITY GPT (IDM-GPT): A SELF-SUPERVISED MULTI-AGENT LARGE LANGUAGE MODEL FRAMEWORK FOR CUSTOMIZED TRAFFIC MOBILITY ANALYSIS USING MACHINE LEARNING MODELS

  • IDM-GPT (Independent Mobility GPT): introduces a multi-agent LLM framework with Input Validation, Self-optimization Prompting, Database Interaction, Data Analysis, and Self-supervision Modules, leveraging Database and Machine Learning Models for customized traffic mobility analysis.
  • This framework utilizes LLM-based AI agents to streamline traffic data analysis, enabling efficient processing of spatio-temporal data and ensuring data privacy by mediating user access to sensitive information.
  • IDM-GPT aims to address challenges in traffic management by providing a scalable solution for urban mobility improvement through optimized data analysis and actionable insights generation for users without ML expertise.

Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources

  • Single- vs. Dual-Prompt Dialogue Generation: introduces Single-Prompt Generation (one prompt for full dialogue), Dual-Prompt Generation (two agents with prompts), and Judge LLM (evaluates dialogue authenticity) frameworks for HR job interview dialogue generation and quality assessment.
  • Single-Prompt Generation uses a single prompt to instruct LLM to create entire interview, whereas Dual-Prompt Generation uses two LLM-agents, interviewer and candidate, with separate prompts.
  • Judge LLM evaluates generated dialogues by pairwise comparison to determine if dialogues are distinguishable from human discourse, focusing on AI generation detection.

FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response

  • FRIDA (Field Ready Instruction Decoding Agent): introduces expert-in-the-loop pipeline with Templates, Disaster Relief Expert Input, Linguist Input, Seed sentences, Fine-tune Instruct model, Synthetic instructions, and Prompting LLM to generate synthetic data for fine-tuning smaller language models in disaster response domain.
  • FRIDA pipeline leverages domain and linguistic expertise to create high-quality seed data, which is then used to generate synthetic instructions for fine-tuning language models, enhancing their common sense reasoning about objects.
  • The framework demonstrates that fine-tuning smaller LLMs with synthetic data generated through the FRIDA pipeline improves their performance in object-based common sense reasoning tasks, particularly in disaster-related scenarios.

MAPORL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

  • MAPORL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning): introduces multi-agent post-co-training paradigm for collaborative large language models, with LLMs, Verifier, Multi-agent RL, and Multi-LLM Systems components.
  • MAPORL framework employs multi-agent reinforcement learning to co-train multiple LLMs for enhanced collaboration and generalization across diverse tasks.
  • The framework utilizes a verifier to evaluate LLM responses and discussions, providing co-training rewards maximized through multi-agent RL, fostering effective collaboration.

AgentRM: Enhancing Agent Generalization with Reward Modeling

  • AgentRM (Agent Reward Model): introduces generalizable reward model to guide policy model for effective test-time search using SFT Agent, Reward Model, and Policy Model components.
  • AgentRM framework includes Dataset for training and Environment for task execution, leveraging LLM as base model and Reward Annotation for signal generation.
  • AgentRM enhances agent generalization by finetuning reward model instead of policy model, improving performance on unseen tasks during Inference.

REFUTEBENCH 2.0 – AGENTIC BENCHMARK FOR DYNAMIC EVALUATION OF LLM RESPONSES TO REFUTATION INSTRUCTION

  • RefuteBench 2.0: introduces User, LLMs (Large Language Models), and Verifier components for dynamic evaluation of LLM responses to refutation instructions.
  • RefuteBench 2.0 framework employs User to provide feedback, LLMs as model under evaluation, and Verifier as evaluator agent.
  • RefuteBench 2.0 facilitates flexible assessment of LLM's ability to incorporate refutation feedback in multi-turn dialogues.

Debt Collection Negotiations with Large Language Models: An Evaluation System and Optimizing Decision Making with Multi-Agent

  • MADeN (Multi-Agent Debt Negotiation): introduces multi-agent framework with Communicating Agent (provides negotiation content), Planning Agent (designs decision framework), and Judging Agent (evaluates action rationality).
  • MADeN framework enhances debt negotiation by incorporating planning to design decision framework and judging module to evaluate action rationality.
  • MADeN framework aims to improve decision rationality in debt collection negotiations by addressing limitations of LLMs in making appropriate decisions based on debtor's financial condition.

LAG: LLM agents for Leaderboard Auto Generation on Demanding

  • LAG (Leaderboard Auto Generation): introduces a framework for automatic leaderboard creation, encompassing paper processing, table analysis, data integration, and leaderboard output with evaluation.
  • LAG framework utilizes LLMs to address challenges in generating up-to-date leaderboards from rapidly growing scientific publications, focusing on efficiency and quality.
  • The framework's stages systematically handle paper collection, information extraction, data recombination, and quality assessment to produce reliable and timely leaderboards.

Intersubjective Model of AI-mediated Communication: Augmenting Human-Human Text Chat through LLM-based Adaptive Agent Pair

  • Intersubjective Model: introduces an AI-mediated communication framework with Agent (LLM-based user proxy), Environment (independent chat space), Extraction (information distilling function), Conversation (dialogue management function), Information Transmission (agent-agent information sharing), Knowledge Base (information integration), and Online Chat Interface (user interaction platform).
  • This model facilitates human-human communication indirectly through independent agent interactions and information exchange, enabling adaptive message shaping and shared understanding.
  • The framework aims to overcome limitations of traditional communication models by removing the constraint of shared objective environment and allowing for customized interactions.

Carbon and Silicon, Coexist or Compete? A Survey on Human-AI Interactions in Agent-based Modeling and Simulation

  • 5W1H Taxonomy (5W1H Taxonomy for Human-AI Interactions in ABMS): introduces five dimensions - Why, When, Who, What, and How - to categorize human-AI interaction methods within Agent-Based Modeling and Simulation (ABMS).
  • The taxonomy decomposes interactions based on user goals, interaction phase, user roles, system components controlled, and interaction means, drawing analogy from theater roles to define user engagement.
  • This framework aims to provide a structured approach for analyzing and designing human-AI interactions in ABMS, facilitating development of more effective and user-centered simulation systems.

Large Language Model Driven Agents for Simulating Echo Chamber Formation

  • Model Framework: introduces data preparation, simulation process with LLM post generation, and analysis and validation to simulate echo chamber formation.
  • The framework employs LLM-enhanced approach for opinion evolution, network rewiring, and content generation, incorporating textual context for realistic simulation.
  • Simulation Process includes "screen" component, representing limited user attention and information accessibility within social media environments.

LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM-Based Agents in A Trading Arena

  • Agent Trading Arena: introduces a virtual numerical game environment, with Agent, Stocks and Market, Chat Pool, Day Simulation, Reflection, Memory and Environment components, designed for evaluating numerical reasoning of LLM-based agents in stock trading.
  • Agent Trading Arena facilitates complex economic simulations through zero-sum games, enabling assessment of LLMs' geometric and algebraic reasoning capabilities using visual and textual numerical data.
  • The framework incorporates a reflection module to enhance strategy refinement based on trading performance and environmental feedback, promoting continuous agent adaptation and learning in a dynamic market.

MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications

  • MA-GTS (Multi-Agent Graph Theory Solver): introduces multi-agent framework for solving graph problems, with Information Extraction Layer (extracts text information), Knowledge Integration Layer (constructs structured graph data), and Algorithm Execution Layer (executes algorithms).
  • MA-GTS framework decomposes complex graph problems through agent collaboration and maps text-based graph data into structured representations.
  • MA-GTS framework dynamically selects suitable algorithm based on problem constraints and graph structure scale for efficient and interpretable solution process.

7 Points to Tsinghua but 10 Points to 清华? Assessing Large Language Models in Agentic Multilingual National Bias

  • Academic Career Planning Advisor: introduces input prompt, LLM component, and output response for university recommendation task.
  • Framework evaluates LLM's score and reasoning for provided universities in multilingual context.
  • System aims to identify nationality bias in LLM's advisory capabilities across languages.

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

  • FACT-AUDIT (An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation) introduces an agent-driven framework with Appraiser, Taxonomy, Inquirer, Prototype, Quality Inspector, Memory Pool, Evaluator, Verify Fact & Produce Justification, Target LLM, Prober, Iterative Probing, and Automatic Evaluation components for adaptive and dynamic assessment of LLMs' fact-checking capabilities.
  • FACT-AUDIT framework adaptively generates datasets, performs iterative evaluations, and updates assessments based on model-specific responses, incorporating justification production for comprehensive audit of LLMs' factual reasoning.
  • The framework leverages multi-agent collaboration and importance sampling to address limitations of static datasets and classification metrics in existing fact-checking evaluation methods, providing a more nuanced and evolving audit process.

Towards Enhanced Immersion and Agency for LLM-based Interactive Drama

  • Immersion-Agency Paradigm: introduces framework for LLM-based interactive drama, enhancing player Immersion and Agency through Dramatic Story Generator from Premise paragraph, Role Agents responding to Prompt, and generating Drama Script after Post-process.
  • This paradigm uses Playwriting-guided Generation for improved story structure and Plot-based Reflection for agent reaction refinement.
  • The framework aims to bridge gap in current interactive dramas by focusing on deeper emotional connections and meaningful player influence within the story.

IMPROVE: ITERATIVE MODEL PIPELINE REFINEMENT AND OPTIMIZATION LEVERAGING LLM AGENTS

  • IMPROVE (Iterative Model Pipeline Refinement and Optimization leveraging LLM agents): introduces a multi-agent framework with Project Architect, Data Engineer, Model Engineer, Training Engineer, and Performance Analyst agents to iteratively refine ML pipelines based on user-provided dataset and task description.
  • IMPROVE framework utilizes Iterative Refinement strategy, optimizing one pipeline component at a time through training and evaluation process guided by Performance Analyst feedback for stable and interpretable improvements.
  • IMPROVE framework aims to automate object classification pipeline development, achieving high performance without requiring ML expertise by emulating human expert iterative refinement workflow.

24th February 2025

Aligning Compound AI Systems via System-level DPO

  • SysDPO (System-level Direct Preference Optimization): introduces DAG, LLM, Diffusion Model, DPO Loss, and Preference Dataset to align compound AI systems.
  • SysDPO framework uses DAG to model compound AI system, factorizes probability, and applies DPO loss for end-to-end optimization using preference dataset.
  • SysDPO enables joint alignment of components like LLM and diffusion models, improving coherence and preference alignment in complex AI systems.

ARACNE: An LLM-Based Autonomous Shell Pentesting Agent

  • ARACNE (Autonomous LLM-based Shell Pentesting Agent): introduces a novel multi-LLM architecture for autonomous shell pentesting, comprising user, core agent, planner, interpreter, summarizer, SSH server and context components.
  • ARACNE separates planning and command execution using distinct LLMs, enhancing flexibility and effectiveness in cybersecurity tasks.
  • The framework utilizes an optional summarizer to manage context window size, offering a trade-off between accuracy and attack duration.

IGDA: Interactive Graph Discovery Agent

  • IGDA (Interactive Graph Discovery Agent): introduces a LLM-based pipeline for interactive graph discovery, with Edge Confidence Estimation, Edge Experiment Selection, and Local Edge Updates components.
  • IGDA leverages LLMs for uncertainty-driven edge selection and local graph updates based on binary feedback from experiments.
  • IGDA iteratively refines graph predictions by selecting uncertain edges for experiments and updating related edges based on experimental outcomes.

A Multi-LLM-Agent-Based Framework for Economic and Public Policy Analysis

  • MLAB (Multi-LLM-Agent-Based Framework): introduces a novel approach for economic analysis by employing multiple LLMs as heterogeneous agents representing different socio-economic groups.
  • MLAB framework simulates policy impacts by mapping LLMs to educational and income brackets, utilizing calibrated economic parameters for each agent group.
  • This framework leverages LLMs' diverse reasoning capabilities to model population heterogeneity and analyze policy responses in economic scenarios.

Graphy'our Data: Towards End-to-End Modeling, Exploring and Generating Report from Raw Data

  • Graphy: introduces an end-to-end platform, with Offline Scrapper, Inspection, Define Workflow, File Extractor, LLM or Rule Extractor, Fact Node, Dimension Node, Navigation, Online Surveyor, Exploration, Search, StatRefiner, GraphView, NeighborQuery, Generation, DataInfer, Mindmap Generator, Confirmed, Report Writer, and Graph Store, that automates data modeling, exploration, and report generation from raw data.
  • Graphy platform comprises an offline Scrapper for transforming unstructured documents into structured graphs and an online Surveyor for iterative exploration and LLM-driven report creation.
  • Graphy facilitates progressive document investigation by enabling users to iteratively explore, analyze, and synthesize information from large unstructured datasets to generate high-quality reports.

Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment

  • LLM-MCA (Large Language Model Multi-agent Credit Assignment): introduces centralized LLM Critic (LLM for credit assignment), Base Prompt (LLM initial instructions), LLM Parser (extracts feedback from LLM), Individualized Feedback (per-agent reward signals), Centralized Policy Training (learns decentralized policies), Demultiplexer (splits input data for critic), Multiplexer (aggregates agent feedback), Agent Policies (decentralized agent controllers), Environment (multi-agent simulation scenario), Observations and Global Reward (environment state input), and Joint Action (agent actions output).
  • LLM-MCA employs centralized LLM critic with base prompt to generate individualized feedback, guiding decentralized agent policy training for effective credit assignment.
  • By reformulating credit assignment as pattern recognition, LLM-MCA leverages LLMs to achieve human-level credit evaluation and enhance multi-agent cooperative learning.

Grounded Persuasive Language Generation for Automated Marketing

  • AI Realtor: introduces agentic framework, with Grounding Module, Personalization Module, Marketing Module, ChatGPT, to automate persuasive marketing content generation.
  • It uses LLMs to align content with user preferences and highlight factual attributes, demonstrated in real estate marketing.
  • The framework achieves superhuman persuasion in experiments, outperforming human experts in real estate marketing description generation.

Multi-Agent Autonomous Driving Systems with Large Language Models: A Survey of Recent Advances

  • LLM-based Multi-Agent ADS Framework: introduces multi-agent system for autonomous driving, with Environment (driving context), Information (perceived data), Action (driving commands), Profile (role definition), Agent (autonomous entity), Driver Agent (vehicle control), Infrastructure Agent (external infrastructure), Shared Message Pool (communication medium), and Memory (experience storage).
  • This framework employs profiles to define agent functionalities, facilitating collaborative decision-making through shared message pool and memory for experience retention.
  • The architecture improves driving safety and efficiency in intricate scenarios by integrating separate agents for vehicle and infrastructure interaction, supported by LLM-based reasoning capabilities.

AlphaAgent: LLM-Driven Alpha Mining with Regularized Exploration to Counteract Alpha Decay

  • AlphaAgent (LLM-Driven Alpha Mining with Regularized Exploration to Counteract Alpha Decay): introduces autonomous framework integrating Idea Agent, Factor Agent, and Eval Agent with regularization mechanisms for decay-resistant alpha factor mining.
  • AlphaAgent framework employs Human Knowledge, Research Report, Market Insight, Performance Metrics, Backtest, Self-reflection, Analysis Feedback, Factor Zoo, Regularization Mechanisms, Operator Library, and Abstract Syntax Trees within closed-loop iterative refinement process.
  • AlphaAgent utilizes originality enforcement, hypothesis-factor alignment, and complexity control to guide alpha generation, balancing financial rationale and market adaptability for effective alpha mining.

23rd February 2025

GUARDIANS OF THE AGENTIC SYSTEM: PREVENTING MANY SHOTS JAILBREAK WITH AGENTIC SYSTEM

  • Evaluating Agentic Systems: introduces methodology to evaluate agentic system security, with Reverse Turing Test, Aligning Multi-Agent Systems, and Prevention of Multi-Shot Jailbreaks.
  • The framework employs GamoraAI, RocketAI, Star-LordAI, GrootAI, ObserverAI agents for assessing security vulnerabilities, deceptive alignment, and jailbreak defense.
  • This comprehensive approach aims to enhance LLM-based agentic system robustness against adversarial threats through dynamic, tool-mediated security evaluations.

RapidPen: Fully Automated IP-to-Shell Penetration Testing with LLM-based Agents

  • RapidPen (RapidPenetration): introduces a fully automated penetration testing framework, integrating Re Module (task planning module), Act Module (command execution module), and RapidPen-vis (visualization and reporting tool), utilizing PTT (pentesting process data model) for IP-to-Shell achievement.
  • RapidPen framework incorporates ReAct paradigm with specialized RAG (Retrieval-Augmented Generation) repositories, featuring Re (L1) PTT Planner (PTT expansion and maintenance), Re (L1) PTT Prioritizer (task prioritization), Re (L2) New Tasks (Success Cases) (success case based task generation), Act (L1) Command Generation (command generation using RAG), Act (L1) Command Execution (executes commands), and Act (L1) Log Analysis (analyzes command logs) modules.
  • The framework leverages Command Generation RAG (RAG for command generation) and Success Cases RAG (RAG for success cases) to enhance offensive security, enabling autonomous vulnerability discovery and exploitation through iterative command refinement and success case reuse.

From Text to Space: Mapping Abstract Spatial Models in LLMs during a Grid-World Navigation Task

  • GWSOT (Grid-World Spatial Orientation Task): introduces agent, goal, grid, spatial information representations, LLM, activations, policy maps, and performance metrics to investigate spatial understanding of language models in grid navigation.
  • GWSOT evaluates how different spatial information representations like cartesian, topographic, and textual formats impact LLM navigation performance and internal spatial encoding.
  • The framework uses performance metrics and policy maps to analyze LLM success rate, path efficiency, and spatial decision-making within the grid-world environment.

BIOMAZE: BENCHMARKING AND ENHANCING LARGE LANGUAGE MODELS FOR BIOLOGICAL PATHWAY REASONING

  • PATHSEEKER: introduces LLM agent for biological pathway reasoning via interactive subgraph navigation.
  • PATHSEEKER enhances reasoning using global subgraph search, local subgraph search, graph encoding and final reasoning on pathway database.
  • This method provides robust, scientifically grounded approach for complex pathway reasoning challenges.

The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems

  • Dynamic Consensus-Diversity Tradeoff: introduces a framework with Receives information, Arguments, Interpret intention, and Action components, describing consensus-diversity tradeoff in multi-agent systems.
  • This framework contrasts implicit consensus, where agents decide independently after discussion, with explicit consensus, where agents unify actions via voting.
  • The framework aims to demonstrate that implicit consensus enhances robustness and adaptability in dynamic environments by preserving diversity.

All That Glitters is Not Novel: Plagiarism in AI Generated Research

  • SSAG (Semantic Scholar Augmented Generation): introduces plagiarism detection framework with query generation, paper retrieval, relevance scoring and similarity checking components for LLM-generated research.
  • SSAG framework utilizes LLMs and Semantic Scholar API to identify similar research papers and assess plagiarism in generated research proposals.
  • SSAG framework's evaluation reveals limitations in detecting sophisticated plagiarism within LLM-generated research documents, highlighting need for improved methods.

22nd February 2025

SMARTIFY: A MULTI-AGENT FRAMEWORK FOR AUTOMATED VULNERABILITY DETECTION AND REPAIR IN SOLIDITY AND MOVE SMART CONTRACTS

  • Smartify (SMARTIFY: A MULTI-AGENT FRAMEWORK FOR AUTOMATED VULNERABILITY DETECTION AND REPAIR IN SOLIDITY AND MOVE SMART CONTRACTS): introduces a multi-agent framework with Auditor, Architect, Code Generator, Refiner, and Validator components for automated smart contract vulnerability detection and repair.
  • Smartify leverages specialized LLMs, including LLM1 (Gemma2 9B) for initial analysis and LLM 2 (FT CodeGemma) for code generation, alongside Move RAG and Solidity RAG for language-specific context retrieval.
  • Smartify framework processes Code Dataset of smart contracts through its components to output Repaired Smart Contract, aiming for improved accuracy and efficiency in vulnerability remediation within blockchain landscape.

Exploring Sentiment Manipulation by LLM-Enabled Intelligent Trading Agents

  • Framework name here: introduces a system exploring sentiment manipulation in trading using reinforcement learning agent, with arxiv_paper_framework_2-components RL-based Trading Agent, TD3 Algorithm, Actor Network, Critic Network, Target Networks, Internal State, Environmental State, Sentiment Agent, Social Media Feed, Sentiment Analysis (RoBERTa), Sentiment Heuristic, Social Media Post Generation, Language Model (Llama 3.2), Market Simulation (ABIDES), Order Book, and Historical Data (LOBSTER).
  • The framework investigates how an RL-based trading agent can learn to manipulate market sentiment through generated social media posts to improve trading performance in a simulated market environment.
  • The study utilizes a sentiment agent that reacts to social media posts and a market simulation driven by historical order book data to evaluate the RL agent's sentiment manipulation strategies.

Reproducibility Study of Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation

  • LLM-Stakeholders Interactive Negotiation benchmark: evaluates LLM agents in negotiation games with negotiation game, LLM agents, CoT prompts, single-agent baseline, multi-agent setup, evaluation metrics, Pareto front analysis, structure leakage metric, and inequality metric.
  • This benchmark study reproduces and extends prior negotiation research by analyzing open-weight models and introducing fairness and confidentiality metrics.
  • The research highlights that single-agent baselines can achieve comparable negotiation performance to multi-agent setups, questioning communication necessity.

An Autonomous Network Orchestration Framework Integrating Large Language Models with Continual Reinforcement Learning

  • ARC (Autonomous Reinforcement Coordination): introduces a two-tier network orchestration framework integrating LLMs and continual RL for SemCom-enabled SAGIN, featuring RAG for data processing, HAP for hierarchical planning, SKB and DKB for knowledge storage, and RL Agents for action execution.
  • ARC decomposes network orchestration into high-level planning using LLM within HAP and low-level decision-making using RL Agents within Action Executioner, enhancing adaptability and efficiency through continual learning and few-shot learning.
  • ARC utilizes RAG to generate allocation prompts for HAP, which then employs User Sequencer to optimize user order and Action Executioner with RL agents to execute resource allocation decisions based on SKB and DKB knowledge.

Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens

  • Mojito (LLM-Aided Motion Instructor): introduces an intelligent motion agent utilizing IMU Tokenizer, Motion Tokenizer, Distribution Matching, Motion Decoder, Projection Layers, Decoder-only Transformer, LoRA Adapters, Text Tokenizer, and Qwen2-based Language Model for interactive motion capture and analysis.
  • Mojito employs a jitter-reduced inertial token representation and extended language model to provide real-time human motion analysis and feedback, addressing limitations of noisy IMU data.
  • The framework leverages VQVAE for discrete latent space learning of IMU signals and incorporates LoRA adapters for personalized feedback styles in fitness or rehabilitation scenarios.

Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

  • LLM-based Conversational Agent Framework: introduces a process for contextual privacy in conversational agents, with User Prompt (Initial user input), Detection & Flagging (Identifies context and sensitive data), Determine Subject & Context (Establishes topic and setting), Detect PII and Sensitive Phrases (Finds personal and private phrases), Sensitive Space (Categorizes sensitivity level), Essential Info (Identifies necessary information), Non-Essential Info (Identifies unnecessary information), Mitigation (Applies privacy measures), Get User Approval (User confirms action), Reformulate Prompt (Rewrites user input for privacy), Get User Approval (User confirms rewritten input), and LLM-based Conversational Agent (Core agent processing input).
  • This framework processes user prompts to recognize context and sensitive information, subsequently providing revised prompts to users that aim to maintain original intent while minimizing out-of-context details.
  • The framework empowers users to make informed privacy decisions during interactions with conversational agents by identifying and reformulating contextually inappropriate information in prompts.

Echo: A Large Language Model with Temporal Episodic Memory

  • MADGF (Multi-Agent Data Generation Framework): introduces Characters, Plots, and Environments to simulate multi-turn dialogues for generating episodic memory training data.
  • MADGF framework controls dialogue scenarios between human roles and AI assistant to create context-rich episodic memory data.
  • MADGF framework aims to produce high-quality episodic memory data by designing diverse characters and plot-driven dialogues.

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

  • Curie: introduces AI agent framework designed for rigorous automated scientific experimentation with intra-agent rigor module, inter-agent rigor module and experiment knowledge manager.
  • Curie framework employs architect agent for planning and technician agents for execution, coordinated by experimental rigor module.
  • Curie framework aims to enhance reliability, methodical control, and interpretability in AI-driven scientific experimentation.

RAG-Enhanced Collaborative LLM Agents for Drug Discovery

  • CLADD (Collaborative framework of LLM Agents for Drug Discovery): introduces multi-agent framework for drug discovery question-answering, integrating Planning Team (identifies data sources), Knowledge Graph Team (retrieves KG information), Molecule Understanding Team (molecule description), and Prediction Agent (generates final answer) to leverage Annotation Database (molecular annotations source), Knowledge Graph (biomedical knowledge source), Captioning Tool (external molecule captioning) and Available Data and Tools (general resources).
  • CLADD framework utilizes Planning Team with MolAnn Planner (annotation database relevance) and KG Planner (knowledge graph relevance), Knowledge Graph Team with DrugRel Agent (related drug entities report) and BioRel Agent (biological relationships report), and Molecule Understanding Team with MU Agent (molecule annotation report) to provide comprehensive analysis.
  • CLADD framework enhances drug discovery tasks by employing collaborative agents to dynamically retrieve and integrate external knowledge, improving interpretability and flexibility without domain-specific fine-tuning.

21st February 2025

Multi-Agent Multimodal Models for Multicultural Text to Image Generation

  • MosAIG (Multi-Agent framework for Multicultural Image Generation): introduces multi-agent framework with Moderator, Social Agents (Country, Landmark, Age-Gender), Summarizer Agents, and Social Agents Conversation to generate Image Caption for AltDiffusion/FLUX image generation models.
  • MosAIG framework employs iterative Social Agents Conversation for refining culturally sensitive and contextually rich Image Caption, enhancing multicultural text-to-image generation.
  • MosAIG framework leverages distinct agent roles to decompose multicultural image generation task, achieving improved Alignment, Aesthetics, and Quality compared to simple models.

RÂłMem: Bridging Memory Retention and Retrieval via Reversible Compression

  • RÂłMem (Retention and Retrieval through Reversible context compression): introduces memory network optimizing information retention and retrieval through reversible context compression with Reversible Adapter, Large Language Model M, and Virtual memory token components.
  • RÂłMem employs hierarchical compression for multi-granularity assimilation and reversible architecture integrating Context Compression and Context Expansion for duplex network.
  • RÂłMem utilizes Virtual memory token to encode long histories and achieves state-of-the-art performance in long-context language tasks.

Self-Taught Agentic Long-Context Understanding

  • AgenticLU (Agentic Long-Context Understanding): introduces a framework designed for enhancing long-context question answering in LLMs, utilizing Chain-of-Clarifications (CoC) through iterative Raise Clarification Question, Find Context, and Self Clarify steps, and trained via CoC Path Distillation, SFT Dataset, Path Sampling, DPO Dataset, and Path & Neg Path Pair, starting from a base LLM and resulting in an Answer to the Long Context QA, contrasting with a Direct Answer approach.
  • AgenticLU framework employs a two-stage fine-tuning process involving Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to distill collected Chain-of-Clarifications (CoC) paths into a single inference pass model, improving efficiency and effectiveness.
  • The core innovation of AgenticLU lies in its Chain-of-Clarifications (CoC) mechanism, which enables models to iteratively refine understanding and resolve uncertainties in long contexts through self-generated questions and contextual grounding, leading to improved reasoning and answer quality.

AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind

  • AutoToM (Automated Theory of Mind): introduces automated Bayesian Theory of Mind method with Information Extraction, Initial Model Proposal, BTOM Models, Bayesian Inverse Planning, and Model Adjustment components.
  • AutoToM leverages Large Language Model for backend operations and iteratively refines Bayesian Theory of Mind model based on inference uncertainty.
  • This framework achieves state-of-the-art performance in Theory of Mind benchmarks, offering scalable, robust, and interpretable approach to machine Theory of Mind.

WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

  • WorldCraft: introduces a system utilizing LLM agents including coordinator, Forgelt, Arrangelt, trajectory control, asset collection and renderer components to create photo-realistic 3D virtual worlds from text instructions.
  • WorldCraft framework employs a coordinator agent managing specialized agents for object customization, layout arrangement and scene animation based on user's natural language input.
  • WorldCraft enables non-professionals to create and customize complex 3D scenes with precise object geometry and PBR textures through intuitive natural language interaction.

Construction and Evaluation of LLM-based agents for Semi-Autonomous penetration testing

  • LLM Penetration Testing Agent: introduces a semi-autonomous penetration testing system, with Planning Module, Executor Module, Summarizer Module, RAG, Search Engines, Execution Environment, and PTT, to address limitations of LLMs in cybersecurity tasks.
  • The system employs multiple LLMs in modules for strategy formulation, command generation, and result analysis, leveraging RAG and search engines for knowledge integration.
  • This framework aims to overcome challenges in applying LLMs to penetration testing by using iterative reasoning and flexible information retrieval, reducing manual intervention.

Position: Standard Benchmarks Fail – LLM Agents Present Overlooked Risks for Financial Applications

  • SAEA (Safety-Aware Evaluation Agent): introduces a three-level evaluation framework, including model-level, workflow-level, and system-level audits, to assess safety risks of LLM agents in finance.
  • SAEA framework analyzes agent's intrinsic capabilities, multi-step process reliability, and integration robustness to identify vulnerabilities overlooked by traditional benchmarks.
  • The proposed SAEA framework shifts focus from raw performance to safety, robustness, and real-world resilience, addressing critical gaps in current LLM agent evaluations for high-stakes financial applications.

Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations

  • Pub-Guard-LLM (Large Language Model): introduces Pub-Guard-LLM, a system for detecting fraudulent biomedical articles, with Input Article, External Knowledge, Teacher Model, Pub-Guard-LLM, Vanilla, RAG, Debate, Fine-Tuning, Output, Prediction, Explanation, Relevance, and Coherence components.
  • Pub-Guard-LLM enhances fraud detection in biomedical research by providing reliable explanations for its predictions.
  • The framework offers three application modes: Vanilla Reasoning, Retrieval-Augmented Generation, and Multi-Agent Debate, to accommodate diverse user needs and improve detection performance and explainability.

Textual-to-Visual Iterative Self-Verification for Slide Generation

  • Iterative Self-Verification Framework (Textual-to-Visual Iterative Self-Verification Framework): decomposes slide generation into content and layout generation, using textual-to-visual self-verification for refinement.
  • Content generation enhances coherence using context from surrounding slides and section retrieval, while layout generation employs Reviewer + Refiner workflow.
  • Modality transformation visualizes textual layouts, enabling intuitive review and refinement by LLM-based Reviewer and Refiner modules for improved slide quality.

ARS: Automatic Routing Solver with Large Language Models

  • ARS (Automatic Routing Solver): introduces automatic routing solver framework, with pre-defined constraint examples, constraint selection, constraint checker, violation scorer, constraint handling method, initialization, optimization, final solution, local search, destroy & repair, destroy operators, repair operator, local search operators, input problem instance and termination condition.
  • ARS framework enhances backbone heuristic algorithm by automatically generating constraint-aware heuristics using LLM agents.
  • ARS framework utilizes database of VRP constraints and RAG-like approach for constraint selection to improve heuristic generation.

Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

  • Auto-Bench: introduces benchmark for evaluating Large Language Models in scientific discovery, incorporating settings, prompting, Large Language Model, interventions, observations, ground-truths comparison, adjacency matrix, match-not match, and loop control components.
  • Auto-Bench framework evaluates LLMs' capability to discover hidden causal structures via iterative interactions and strategic interventions within chemistry and social network environments.
  • This benchmark leverages causal graph discovery to assess LLMs' reasoning and decision-making skills in simulated scientific exploration tasks.

The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning

  • Taxonomy: introduces a framework for integrating LLMs/VLMs into RL, categorizing approaches based on three roles: agent (FM serves as policy), planner (FM generates sub-goals), and reward (FM shapes rewards).
  • This taxonomy further distinguishes agent roles into parametric (fine-tuning FM to generate outputs) and non-parametric (enriching prompts with context) approaches, planner roles into comprehensive (sequence of sub-goals in one pass) and incremental (sub-goals step by step) planning, and reward roles into reward model (outputs scalar reward signal) and reward function (specifies reward function code) mechanisms.
  • The framework helps to understand how LLMs/VLMs address RL challenges like prior knowledge, planning, and reward design, paving the way for unifying natural language and visual understanding with sequential decision-making.

Investigating the Adaptive Robustness with Knowledge Conflicts in LLM-based Multi-Agent Systems

  • AutoGen (Multi-Agent System framework): introduces a multi-agent programming system comprising project manager for task coordination, coders for collaborative programming, and executor for tool interaction and code execution.
  • This framework investigates robustness of LLM-based multi-agent systems when facing knowledge conflicts during collaborative programming tasks.
  • The system aims to simulate real-world collaborative programming scenarios to analyze the impact of knowledge conflicts on decision-making and system stability.

20th February 2025

GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks

  • GATE (Graph-based Adaptive Tool Evolution): introduces an adaptive framework for dynamic construction and evolution of hierarchical graph of reusable tools across scenarios, utilizing Task Solver, Tool Manager, Adaptive Tool Graph, Graphrank Retrieval, Tool Requirement, Tool Generation, Tool Creation, Tool Merging, Self-Check, Tool Graph Update, Basic Tools, Composed Tools, Node, Edge, Adjacency Matrix, Graphrank Algorithm, Pruning, Online Learning, Training Stage and Testing Stage.
  • GATE framework employs two interacting agents, Task Solver and Tool Manager, with Adaptive Tool Graph to dynamically manage and evolve toolset, addressing tool redundancy and limited generalizability in existing methods.
  • The framework leverages Graphrank Retrieval for efficient tool discovery and incorporates Self-Check and Tool Merging to ensure tool quality and conciseness, achieving state-of-the-art performance across diverse tasks including open-ended and closed-ended scenarios.

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

  • U-SAFEBENCH (User-Specific Safety Benchmark): introduces benchmark, with LLM Agent (generates response considering user profile) and LLM-as-a-Judge (evaluates response safety and refusal), to evaluate user-specific safety of Large Language Models.
  • U-SAFEBENCH assesses if LLM Agent response is user-specific unsafe response based on user profile and instruction.
  • U-SAFEBENCH employs LLM-as-a-Judge to classify LLM Agent response as either refusal or fulfillment regarding user instruction.

Red-Teaming LLM Multi-Agent Systems via Communication Attacks

  • AiTM (Agent-in-the-Middle) introduces communication attack framework for LLM Multi-Agent Systems, which includes Benign Agent (system participant), Malicious Agent (harmful actor), Agent-in-the-Middle (message interceptor manipulator), Adversarial Input (malicious agent data), Communication Channel (message pathway), Messages (agent information exchange), Reflection Mechanism (adversarial self-improvement), and Victim Agent (targeted agent).
  • The framework evaluates vulnerability by employing Agent-in-the-Middle to intercept Messages within Communication Channel to manipulate Victim Agent, contrasting with Malicious Agent and Adversarial Input attacks targeting individual agents.
  • AiTM leverages Reflection Mechanism in Agent-in-the-Middle to refine adversarial strategies based on intercepted Messages, highlighting critical security concerns in inter-agent communication within LLM Multi-Agent Systems.

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

  • CoSyn (Code Guided Synthetic data generation system): introduces a framework for generating text-rich multimodal data using topic generation, data generation, code generation, rendering tools, and instruction generation.
  • CoSyn leverages text-only LLMs to generate code for rendering synthetic images and textual instructions for vision-language model training.
  • The framework addresses the scarcity of diverse text-rich vision-language data for improving VLMs in understanding text-rich images.

Optimizing Model Selection for Compound AI Systems

  • LLMSELECTOR (LLMSELECTOR): introduces Input, Module Nominator, Model Updater, and Output components for efficient model selection in compound AI systems.
  • LLMSELECTOR iteratively nominates modules and uses module-wise performance estimation to allocate the best-performing model to each module.
  • This framework achieves high-quality model allocation for compound AI systems, outperforming single-LLM allocation strategies.

A Multi-Agent Perspective on Modern Information Retrieval

  • Multi-Agent Perspective on Modern Information Retrieval: introduces query agent, document agent, and ranker agent to analyze modern information retrieval through agent interactions.
  • This perspective addresses complexities arising from automated query and document generation impacting retrieval paradigms.
  • The framework emphasizes revisiting classical IR evaluation and modeling for effective multi-agent retrieval systems.

Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis

  • TOD (Tree-of-Debate): introduces a framework for comparative scientific paper analysis using paper personas, moderator-guided debate tree construction, self-deliberation, debate rounds, expansion determination, debate synthesis, retrieval embedding model and evidence pool.
  • TOD dynamically builds a debate tree to analyze novelty arguments by converting papers into debating personas and facilitating structured critical reasoning.
  • The framework employs iterative retrieval and multi-persona debates to generate fine-grained contrastive summaries of scientific literature, aiding researchers in literature review.

Multi-Agent Coordination across Diverse Applications: A Survey

  • Unified Framework: introduces iterative process for sequential decision-making in multi-agent coordination, consisting of Evaluate System-level Goal, Who to Coordinate with, and How to Coordinate components.
  • Unified Framework: addresses coordination by evaluating system performance, determining agent clusters based on interdependencies, and updating decisions using appropriate methodologies.
  • Unified Framework: provides a structured perspective on coordination, applicable across diverse multi-agent system applications by breaking down the coordination process into key decision points.

I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search

  • I-MCTS (Introspective Monte Carlo Tree Search): introduces agentic AutoML framework, incorporating I-MCTS search module, LLM agent experiment executor, introspective node expansion, and hybrid reward mechanism.
  • I-MCTS enhances search quality and efficiency by introspectively expanding nodes and adaptively blending LLM-estimated and empirical rewards.
  • The introspective node expansion leverages parent and sibling node analysis for continuous refinement, addressing limitations of scalar feedback and static search spaces in AutoML.

InstructAgent: Building User Controllable Recommender via LLM Agent

  • InstructAgent and Instruct² Agent: introduce user-agent-platform paradigm for recommendation, featuring Parser for instruction understanding, Reranker for recommendation adjustment, Self-reflection Mechanism for output verification, External Knowledge for external data access, Internal Knowledge for instruction based knowledge, Static Memory for historical user interactions, Dynamic Memory for adaptive user representation, Extractor for interest extraction and Profile Generator for profile creation.
  • InstructAgent employs static memory and instruction parsing for reranking recommendations, whereas Instruct² Agent enhances personalization through dynamic memory and profile learning from user feedback.
  • The framework aims to enhance user control in recommendation systems and mitigate issues like echo chambers and biases against less-active users by acting as a protective shield between users and platforms.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

  • Vending-Bench Framework: introduces agent architecture with main agent, sub-agent, memory tools, context management, and task-specific tools for vending machine operation benchmark.
  • The framework uses main agent for decision making and sub-agent to interact with simulated vending machine environment.
  • Memory tools and context management address LLM's memory limitations for long-term coherence evaluation.

Plan-over-Graph: Towards Parallelable LLM Agent Schedule

  • Plan-over-Graph: introduces a novel paradigm for parallel LLM agent scheduling, incorporating RAGS, tree-based random graph generation, annotated data, goal definition, initial source specification, textual query generation, SFT, DPO, graph extraction, plan generation, and executable task schedule.
  • This framework decomposes textual tasks into graph structures, enabling parallel execution planning and enhancing efficiency for complex tasks.
  • The plan-over-graph approach addresses limitations in existing sequential planning methods by leveraging graph representations for improved scalability and performance in LLM agents.

CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models

  • CORBA (Contagious Recursive Blocking Attacks): introduces a novel attack paradigm against LLM-MAS (Large Language Model-based Multi-Agent System) by leveraging CORBA Prompt to initiate Attack Propagation across the Topology of Agents, ultimately leading to Blocking State and system unavailability.
  • CORBA exploits contagious and recursive properties to propagate blocking state through LLM-MAS network, causing resource depletion and availability degradation.
  • The attack's effectiveness is demonstrated across various LLM-MAS frameworks and topologies, highlighting security vulnerabilities in current multi-agent systems.

MLGYM: A New Framework and Benchmark for Advancing AI Research Agents

  • MLGYM (Meta MLGYM): introduces a framework for developing and evaluating LLM agents in AI research tasks, comprising Agent, Environment, and Computer components.
  • MLGYM framework utilizes a Gymnasium Environment to integrate diverse AI research tasks, enabling agent interaction through actions and feedback within a controlled setting.
  • The framework provides components like Tool Docs, Task Description, Prompts, Models for Agent; Tools, Data, Code, Requirements for Environment; and Shell, File System for Computer, facilitating comprehensive AI research agent evaluation.

Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization

  • CollabUIAgents: introduces multi-agent reinforcement learning framework, with Agents, Critic Agent, Adversarial Agent, Reward Matrix, Action Matrix, Preference Optimization, Actions Rolling Out, System Update, Group Initialization, Multi-Agent Reinforcement Learning, Agentic Fine-Tuning, Curriculum Learning, Data Collection, Base Model, Base UIAgent, Environment, Observation, Reward, and Action, for enhancing generalization in interactive environments.
  • CollabUIAgents framework employs novel credit re-assignment strategy using LLM-based critic and preference learning to foster collaborative behaviors and improve generalization.
  • The framework achieves state-of-the-art performance in mobile and web UI interaction tasks, demonstrating effectiveness of credit re-assignment and preference optimization for multi-agent learning.

FLOWAGENT: Achieving Compliance and Flexibility for Workflow Agents

  • FLOWAGENT (FLOWAGENT): introduces a novel agent framework for workflow management, incorporating PDL, Controllers, DAG of node dependency, API node, Answer node, OOW node, Pre-decision controllers, Post-decision controllers, Conversation history, User, Bot agent, System, Workflow, Output Action, and Output System response, to achieve both compliance and flexibility.
  • FLOWAGENT framework utilizes Procedure Description Language (PDL) to define workflows and employs controllers for managing agent behavior, dynamically balancing compliance and flexibility when handling user interactions and unexpected queries.
  • The framework architecture includes pre- and post-decision controllers that guide and validate agent actions based on PDL-defined workflows, ensuring both structured execution and responsiveness to dynamic interactions.

ChemHTS: Hierarchical Tool Stacking for Enhancing Chemical Agents

  • ChemHTS (Chemical Hierarchical Tool Stacking): introduces a method that optimizes tool invocation pathways through hierarchical stacking strategy.
  • ChemHTS comprises Self-Stacking Warmup (individual tool warmup) and Multi-Layer Optimization (hierarchical path optimization) stages, enabling dynamic refinement of tool usage.
  • This framework addresses limitations in tool-augmented Large Language Models by facilitating effective collaboration among diverse tools and minimizing tool invocation errors.

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems

  • Communication-Centric Framework: introduces a communication-centric perspective on LLM-based multi-agent systems, with Communication Architecture, Communication Goal, Communication Strategy, Communication Paradigm, Communication Object, and Communication Content components, where the framework analyzes system-level and internal communication elements in LLM-MAS workflows.
  • Communication-Centric Framework decomposes LLM-MAS workflow based on communication, categorizing system-level aspects like agent organization and goals, and internal aspects like strategies and message handling.
  • Communication-Centric Framework provides a structured approach to understand and analyze the communication dynamics within LLM-MAS, offering insights into design and optimization for diverse applications.

STeCa: Step-level Trajectory Calibration for LLM Agent Learning

  • STeCa (Step-Level Trajectory Calibration): introduces a framework for LLM agent learning with Deviated Action Detection, MC Step Reward, Expert Sub-trajectory, Reflection, Reflective Thought, Calibration Trajectory Construction, Calibrated Trajectory, Expert Trajectory, Successful Data, Reinforced Training, and Calibration Data, to enable step-level trajectory calibration for mitigating suboptimal actions.
  • STeCa framework utilizes step-level reward comparison and LLM-driven reflection to construct calibrated trajectories from explored trajectories with detected deviations, which are then used with successful trajectories for reinforced training.
  • The framework aims to improve LLM agent's decision-making in long-horizon tasks by addressing early-stage deviations through timely calibration, enhancing robustness and reducing error accumulation.

MEM2EGO: EMPOWERING VISION-LANGUAGE MODELS WITH GLOBAL-TO-EGO MEMORY FOR LONG-HORIZON EMBODIED NAVIGATION

  • MEM2EGO (Memory-to-Egocentric): introduces a VLM-based navigation framework, integrating Observation, Memory Mapping, Memory Augmented Observation, Landmark Memory Update, and Metric Map Memory, for enhanced embodied agent navigation.
  • MEM2EGO framework adaptively retrieves task-relevant cues from global memory, encompassing Frontier Map, Landmark Semantic Memory, and Visitation Memory, and dynamically aligns global context with local perception for improved spatial reasoning.
  • MEM2EGO enhances agent's navigation in complex environments by maintaining three distinct memory types and projecting cues onto egocentric images to guide goal location prediction and decision-making.

Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction

  • LatentQA interpretability pipeline (LatentQA): introduces Target Model (analyzed language model) to process Dialogue (input conversation text) and answer ToM Question (query about mental states) using Decoder Model (extracts ToM information) to produce ToM Answer (inferred mental state) with Feedback (gradient for training-steering) for generating Aligned Response (ToM-steered output) instead of Generated Answer (model's initial output), involving ToM Inference (inferring mental states), ToM Feedback (feedback on ToM inference), Steering Inference (ToM-based model steering), and Steering Feedback (feedback on steering).
  • LatentQA pipeline employs Decoder Model to extract Theory of Mind (ToM) related information from Target Model's internal representations based on Dialogue and ToM Questions, utilizing Feedback mechanisms for both ToM inference and model steering to achieve improved response alignment.
  • The framework aims to enhance conversational agents by incorporating Theory of Mind (ToM) principles, leveraging LatentQA to interpret and manipulate model's latent representations for generating more human-like and aligned responses through explicit consideration of beliefs, desires, and intentions.

19th February 2025

Investigating Non-Transitivity in LLM-as-a-Judge

  • SWIM (Swiss-Wise Iterative Matchmaking): introduces User Instruction, Response of A, Response of B, Response of C, Judge Evaluation, Round Robin Tournament, Bradley Terry, Elo Score, and SWIM Tournament components for evaluating LLMs by addressing non-transitivity in pairwise comparisons using efficient tournament approach.
  • SWIM framework employs round-robin tournaments and Bradley-Terry model to produce reliable model rankings, mitigating sensitivity to baseline choice in LLM evaluation.
  • SWIM tournament enhances computational efficiency of round-robin evaluations while maintaining robustness and alignment with human evaluations by dynamic model matching.

Autellix: An Efficient Serving Engine for LLM Agents as General Programs

  • Autellix: introduces an efficient serving engine for LLM agents, incorporating Process Table (tracks program metadata), Load Balancer (distributes LLM calls), LLM Engine (processes LLM calls) with Scheduler (schedules LLM calls), Priority Function (determines call priority), Memory Manager (manages engine memory), KV Cache (stores key-value pairs), and Model Executor (executes LLM model).
  • Autellix leverages program-level statistics and discretized priority queues to minimize head-of-line blocking and improve throughput for agentic programs with dynamic execution workflows.
  • The system employs a stateful API and data locality-aware load balancing to enhance KV-cache reuse and reduce latency in multi-engine LLM serving environments.

RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision

  • RAG-Gym (Retrieval-Augmented Generation Gymnasium): introduces unified framework optimizing agentic RAG through process supervision with inner and outer Markov Decision Processes.
  • RAG-Gym: formulates knowledge-intensive question answering as nested Markov Decision Process, incorporating diverse agent architectures and process supervision methods.
  • RAG-Gym: enhances information-seeking agents by fine-grained process supervision at each search step, utilizing process reward data for optimization.

Qwen2.5-VL Technical Report

  • Qwen2.5-VL (Qwen2.5 Vision-Language): introduces vision-language framework integrating vision encoder, vision-language merger and language model decoder for processing multimodal inputs like images and videos.
  • Qwen2.5-VL framework's vision encoder utilizes native resolution input, dynamic FPS sampling, MROPE, window attention and full attention to process visual data efficiently before merging with text embeddings.
  • This architecture enables Qwen2.5-VL to achieve advancements in visual recognition, document parsing and long-video comprehension, while maintaining computational efficiency through window attention and dynamic resolution processing.

Exploring Personalized Health Support through Data-Driven, Theory-Guided LLMs: A Case Study in Sleep Health

  • HEALTHGURU: introduces multi-agent framework for personalized health support, integrating behavior change technique theory, wearable data, context data, activity recommendation model, user message, agent coordinators, data insight agent, recommendation agent, response agent, and chat history.
  • HEALTHGURU: is LLM-powered chatbot providing data-driven theory-guided sleep health support using contextual multi-armed bandit model for adaptive recommendations.
  • HEALTHGURU: enhances user engagement motivation for behavior change through personalized context-aware recommendations delivered via natural conversation.

DataSciBench: An LLM Agent Benchmark for Data Science

  • DataSciBench (DSB): introduces a benchmark for data science LLM evaluation, with Prompt Definition and Collection-component, Response Integration and Validation-component, and LLM evaluation-component, utilizing Task-Function-Code framework for assessment.
  • DataSciBench framework employs Directed Acyclic Graph to manage task dependencies and Programmatic Rules for consistent code evaluation, ensuring comprehensive LLM performance analysis in data science tasks.
  • DataSciBench benchmark includes Aggregate Functions and Test Cases with Ground Truth to provide detailed and reliable evaluation metrics for diverse data science challenges, addressing limitations of existing benchmarks.

Enhancing Cross-Domain Recommendations with Memory-Optimized LLM-Based User Agents

  • AgentCF++ (Agent Collaborative Filtering Plus Plus): introduces user- and item-agents with domain-separated-, domain-fused-, group-shared-, and item-memories, and interest groups to enhance cross-domain recommendations by refining user behavior simulation.
  • AgentCF++ employs dual-layer memory architecture with domain-separated and domain-fused memories and interest groups with group-shared memory to capture popularity influence and domain-specific preferences.
  • The framework utilizes a two-step fusion mechanism to integrate cross-domain knowledge and reflection mechanism for memory updates, improving the accuracy of user behavior simulation in recommender systems.

From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education

  • MathCCS (Mathematical Classification and Constructive Suggestions) Benchmark: introduces multi-modal benchmark with real-world problems, student data, and expert annotations for error analysis and feedback.
  • MathCCS benchmark incorporates real-world problems, unique student IDs with timestamps, and expert-defined error categorization with suggestions.
  • MathCCS benchmark facilitates systematic error analysis and personalized feedback in AI-driven education by capturing real student learning complexities.

AI Software Engineer: Programming with Trust

  • Framework components: introduces key elements of LLM agents for software engineering, including LLMs as back-ends (computation engines), interaction with software tools (tool utilization), autonomy (agent independence), and guardrails (security and validation).
  • These components define the capabilities and trust mechanisms considered essential for deploying AI software engineers in practical software development workflows.
  • The paper argues for the importance of trust in AI-generated code and proposes agentic capabilities to enhance trustworthiness in automated programming.

An LLM-based Agent for Reliable Docker Environment Configuration

  • Repo2Run (LLM-based Agent for Reliable Docker Environment Configuration): introduces automated Docker environment configuration, with external environment, internal environment, Dockerfile generator, rollback mechanism, event stream, environment monitoring, dependency installation, code editing, test running, bash commands, dependency management, result processor, action-observation interaction, event history, finished commands, conflict list, and Dockerfile action.
  • Repo2Run utilizes dual-environment architecture for atomic configuration synthesis, ensuring reliable Dockerfile generation and preventing environment pollution through rollback.
  • Repo2Run's atomic configuration synthesis and Dockerfile generator address challenges in automated environment setup, achieving high success rate in configuring Python repositories.

STaR-SQL: Self-Taught Reasoner for Text-to-SQL

  • STaR-SQL (Self-Taught Reasoner for Text-to-SQL): introduces reasoning-driven approach for text-to-SQL, utilizing Question, Schema, Rationale Generation, Finetune, Scale up test-time compute, Outcome-supervised Reward Model, Test-time Verification, and Difficulty-based Resample components.
  • STaR-SQL framework employs rationale generation and outcome supervision to enhance text-to-SQL performance by iteratively refining rationales and verifying SQL query correctness.
  • The framework leverages increased test-time computation and difficulty-based resampling to improve accuracy and robustness for complex text-to-SQL tasks.

OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment

  • OpenSearch-SQL: introduces a multi-agent framework for Text-to-SQL, incorporating Preprocessing, Extraction, Generation, Refinement, and Alignment Module with Agent Alignment, Function Alignment, Style Alignment, Correction, and Self-consistency & vote components.
  • This framework uses a consistency alignment mechanism to reduce hallucination and improve information flow between agents during the Text-to-SQL process, leveraging Vector Database and Few-shot examples.
  • The method achieves state-of-the-art performance by dynamically adjusting few-shot examples and employing a SQL-Like intermediate language within a structured Chain-of-Thought approach, enhancing both effectiveness and efficiency without fine-tuning.

Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis

  • UNCD (UNlearning evaluation using Cognitive Diagnosis): introduces UNCD, a framework for fine-grained LLM unlearning evaluation, with Unlearning Process, QA Eval, UNCD Eval, Base LLM, LLM+GA, LLM+NPO, Precise Diagnosis, Training-free Diagnosis, CDM, Knowledge States, Unlearn Set, Eval Set, Knowledge Concepts, Forget KC, Retain KC, Expert check, Question generation, Scoring, Processing, Raw data, and UNCD-Agent.
  • UNCD leverages Cognitive Diagnosis Modeling for detailed assessment of harmful knowledge removal and introduces UNCD-Cyber benchmark for cybersecurity domain.
  • UNCD-Agent enhances unlearning by diagnosing knowledge remnants and generating targeted unlearning data, improving removal of harmful LLM abilities.

MCTS-KBQA: Monte Carlo Tree Search for Knowledge Base Question Answering

  • MCTS-KBQA (Monte Carlo Tree Search for Knowledge Base Question Answering): introduces MCTS methodology to KBQA domain, enhancing LLM reasoning with selection, expansion, evaluation, backpropagation, and termination steps.
  • This framework uses LLM agent interacting with database environment, guided by step-wise reward mechanism and prompts, to perform knowledge base question answering.
  • MCTS-KBQA achieves improved performance over linear methods by exploring multiple reasoning paths and evaluating intermediate steps within the search tree.

18th February 2025

Towards an AI co-scientist

  • AI co-scientist: introduces a multi-agent system designed to augment scientific discovery by generating, debating, and evolving research hypotheses, utilizing Scientist inputs, Research plan configuration, Generation agent, Reflection agent, Ranking agent, Evolution agent, Proximity agent, Meta-review agent, Tool Use, Memory, and Supervisor agent components.
  • AI co-scientist employs a generate, debate, and evolve approach inspired by the scientific method, leveraging specialized agents for literature exploration, hypothesis review, ranking via tournaments, and iterative refinement, all orchestrated by a Supervisor agent and supported by Memory and Tool Use.
  • AI co-scientist framework facilitates flexible compute scaling and iterative improvement of hypothesis quality through a self-improving loop enabled by feedback from tournament-based ranking and meta-review, aiming to accelerate scientific discovery in biomedicine and beyond.

AIDE: AI-Driven Exploration in the Space of Code

  • AIDE (AI-Driven Exploration): introduces an agent for machine learning engineering, with Solution Tree, Coding Operator, Evaluator, Search Policy, and Summarization Operator, automating trial-and-error via tree search in code space.
  • AIDE employs a tree structure to organize historical solutions and uses a coding operator to propose improvements based on tree nodes, guided by automated evaluations.
  • By strategically reusing and refining solutions within its framework, AIDE trades computational resources for enhanced performance on machine learning engineering benchmarks.

Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

  • PAI (Property-driven Agentic Inference): introduces three-stage framework with property extraction, retrieval, and summarization agents for generating reasoning-augmented answers in long-context question answering.
  • PAI framework simulates human-like reasoning by decomposing queries, retrieving relevant information, and synthesizing conclusions to facilitate long-context understanding.
  • PAI framework enhances long-context question answering by incorporating chain-of-thought reasoning and improving model performance on complex tasks.

TEXT2WORLD: Benchmarking Large Language Models for Symbolic World Model Generation

  • TEXT2WORLD: introduces benchmark for evaluating LLMs in symbolic world model generation, with Automatic Generation, Automatic Correction, Syntax Parser, World Model, Executor, and Multi-criteria Evaluation components.
  • TEXT2WORLD benchmark employs PDDL and execution-based metrics to address limitations in prior world model evaluations, emphasizing domain diversity and evaluation robustness.
  • TEXT2WORLD enables detailed analysis of LLM world modeling performance via component-wise F1 scores and error analysis, aiming to foster advancements within the field.

LLM TRADING: ANALYSIS OF LLM AGENT BEHAVIOR IN EXPERIMENTAL ASSET MARKETS

  • LLM Trading (Large Language Model Trading): introduces experimental framework with agent, order submission, price forecasting, memory, market, and environment components for analyzing LLM behavior in asset markets.
  • This framework investigates LLM agents' trading strategies and market dynamics in simulated financial markets, comparing their behavior to human participants.
  • The study focuses on evaluating LLMs' rationality and ability to replicate human-driven market phenomena like bubbles and crashes within controlled experimental settings.

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

  • TRAVER (Trace-and-Verify): introduces agent workflow with knowledge tracing for student state estimation, utterance generation for tutor messages, and verifier for response quality assessment.
  • TRAVER leverages turn-by-turn verification and knowledge tracing to guide students in coding tasks through dialogue.
  • The framework aims to improve tutoring effectiveness by adapting guidance based on student knowledge and utterance quality.

Demonstrating specification gaming in reasoning models

  • ReAct-like harness: introduces observe, orient, decide, and act phases alongside memory, plan, and subgoal components for LLM agent to interact with environment.
  • The framework employs observe phase to process command outputs, orient phase to update strategic plan, decide phase to select tactical subgoal, and act phase to generate shell commands for task execution.
  • Memory, plan, and subgoal components maintain agent state, enabling iterative refinement of actions based on observed outcomes within the environment.

OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

  • OCCULT (Offensive Cyber Operation Lightweight operational evaluation framework): introduces a methodology to evaluate LLMs for Offensive Cyber Operations, structured around LLM Use Case, OCO Capability Areas, and Reasoning Power components.
  • OCCULT framework facilitates rigorous and repeatable evaluations to quantify cyber security risks associated with employing LLMs in offensive cyber operations.
  • OCCULT methodology aims to standardize LLM testing in OCO domain, enabling better comparisons across different models and evaluation approaches.

Grounding LLM Reasoning with Knowledge Graphs

  • Framework for Grounding LLM Reasoning with Knowledge Graphs: introduces agent and automatic graph exploration approaches for question answering using knowledge graphs, incorporating components like RetrieveNode, NeighborCheck, Entities and Triples.
  • Agent approach employs predefined actions such as RetrieveNode and NeighborCheck for targeted KG interaction, while automatic exploration utilizes extracted Entities and Triples to navigate the knowledge graph.
  • The framework evaluates Chain-of-Thought, Tree-of-Thought, and Graph-of-Thought reasoning strategies within both Agent and Automatic Graph Exploration approaches to enhance question answering performance on knowledge graphs.

Interactive Agents to Overcome Ambiguity in Software Engineering

  • OpenHands framework (OpenHands): introduces interactive environment for LLM Agent, enabling structured code refinement, task planning, and command execution using integrated tools within secure sandbox.
  • OpenHands framework: facilitates iterative code improvement through file editing, script execution, and error analysis within controlled environment.
  • OpenHands framework: leverages User Proxy to simulate realistic interactions, allowing agent to gather necessary context and improve performance in ambiguous software engineering tasks.

AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks

  • AEIA-MN (Active Environment Injection Attack - Mobile Notifications): introduces active environment injection attack scheme, with Perception Stage, Reasoning Stage, Action Stage, System, State and Action components, to evaluate MLLM-based agents robustness.
  • AEIA-MN leverages mobile notifications to perform attacks by disrupting agent decision-making through environmental manipulation.
  • The framework includes Adversarial Attack, Reasoning Gap Attack, and Combinatorial Attack strategies to comprehensively evaluate agent robustness against active injection attacks.

Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents

  • RPA evaluation design guideline: introduces agent attributes, agent-oriented metrics, task attributes, and task-oriented metrics for systematic RPA evaluation.
  • The guideline proposes a two-step process: first, decide agent-oriented metrics based on agent attributes, and second, decide task-oriented metrics based on task attributes.
  • This guideline aims to enhance the reliability and consistency of RPA evaluation by linking evaluation metrics to agent and task attributes.

You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations

  • MIMIC (Multi-agent IMItation of Conversations): introduces a multi-agent meeting synthesis framework that uses Knowledge Source, Content Brainstorming, Casting, Scriptwriting, Filming, Quality Assuring, Special Effects, and Editing to generate Meeting Transcript.
  • MIMIC framework employs pre-production, production, and post-production stages to orchestrate psychologically grounded agents debating turn-by-turn, refining outputs to ensure coherent and credible dialogues.
  • The modular architecture of MIMIC allows for scalable generation of meeting transcripts, addressing data scarcity for training and testing meeting summarization systems.

Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options

  • FoO (Flow-of-Options): introduces an agentic framework for automated machine learning tasks, incorporating Input task, Planner, Option Generator, Flow-of-Options, Plan Executor, Update, CODE-RAW, Case-Based Reasoning, Case Bank, Retrieve and adapt, Walk Generation with Consistency Checker, Update Values, and Update Case Bank components.
  • This framework leverages a network data structure to systematically explore diverse reasoning paths by enumerating options at each step of a task plan, enhancing Large Language Model performance in solving complex problems.
  • The approach integrates case-based reasoning for long-term memory and solution reuse, improving efficiency and overcoming biases inherent in Large Language Models for automated machine learning workflows.

SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems

  • SEFL (Synthetic Educational Feedback Loops): introduces Agent Framework with Teacher (LLM creating assignments), Student (LLM completing assignments with errors), Fineweb-Edu (assignment text source), Synthetic Instruction-Tuning Data (generated interaction data), fine-tuned LLM (feedback model), and Output Evaluation (performance measurement process) for improving educational feedback systems.
  • SEFL framework leverages two LLMs in Teacher and Student roles to simulate formative feedback workflows and generate synthetic data for fine-tuning smaller feedback LLMs.
  • This synthetic data generation and fine-tuning pipeline enables scalable and effective educational feedback systems, addressing real-world data scarcity challenges.

Towards more Contextual Agents: An extractor-Generator Optimization Framework

  • Extractor-Generator Framework: introduces a two-stage approach with feature extraction and prompt generation to optimize prompts for contextual LLM-based agents using input-output dataset, feature extraction, prompt component generation and performance evaluation.
  • The framework extracts contextual features from gold-standard input-output pairs and generates prompt components iteratively refining them through self-improvement techniques and performance evaluation.
  • This automated optimization process enhances the adaptability and reliability of LLM agents in context-specific tasks by improving generalization and reducing error propagation.

Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation

  • KaSLA (Knapsack optimization-based Schema Linking Agent): introduces a plug-in schema linking agent, with Hierarchical Linking Strategy, Table linking, Column linking, Knapsack optimization-based schema linking, Binary scoring function, Probabilistic scoring function, Relevance score, Redundancy score, Redundancy tolerance, Linking, Dynamic Programming, Training dataset, and Query, designed to prevent missing relevant schema elements and minimize redundant ones.
  • KaSLA employs hierarchical linking strategy, initially linking tables and subsequently columns, utilizing knapsack optimization with binary-probabilistic scoring functions and dynamic programming to select relevant schema elements under redundancy tolerance.
  • The framework enhances text-to-SQL models by replacing schema linking processes, improving SQL generation accuracy through optimized schema linking and reduced missing or redundant information.

Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements

  • Fraud-R1: introduces a multi-round evaluation framework with Helpful Assistant, Role-play, LLM Judge, and Defense Status Judgement components.
  • Fraud-R1 assesses LLM robustness against fraud using Defense Success, Defense Failure, and Need More Information statuses.
  • Fraud-R1 framework evaluates LLMs in different settings to identify challenges in fraud defense.

Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning

  • CLCA (Continuous Learning Conversational AI): presents an A2C reinforcement learning framework for personalized conversational agents, integrating synthetic data generation, RL environment design, A2C agent training, and A2C-guided response selection.
  • CLCA framework employs a simulated RL environment with state space representing dialogue context, action space controlling dialogue metrics, and reward function guiding A2C agent to learn personalized dialogue strategies.
  • This A2C-driven CLCA method advances beyond static LLMs by enabling continuous learning and personalization through synthetic data and RL, creating dynamically adaptive AI companions.

Benchmarking Automatic Speech Recognition coupled LLM Modules for Medical Diagnostics

  • Framework name here: introduces a two-stage system for medical diagnosis from speech, with audio preprocessing, speech recognition, and LLM-based diagnosis classification, utilizing medical speech database.
  • The system employs audio preprocessing with denoising and equalization to enhance audio quality before ASR and uses LLM for context-aware medical diagnosis from transcribed speech.
  • The framework is designed to improve robustness in noisy medical call recordings and leverage LLMs for accurate medical diagnosis from patient speech.

Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

  • LLM feedback agent (Large Language Model feedback agent): presents a system generating student feedback on experiment protocols, utilizing Feed Up, Feed Back, Feed Forward feedback types, assessed by Constructive Tone, Linguistic Clarity, Technical Terminology criteria.
  • The study compares LLM feedback against teacher and expert feedback, revealing similar overall quality yet LLM agent's limitations in Feed Back error identification.
  • Findings indicate LLMs' capability for efficient educational feedback, underscoring the necessity for enhanced contextual understanding in error-specific feedback generation.

An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation

  • OpenCHA Framework: introduces an LLM-powered agent for physiological data analysis, with Interface, Orchestrator, External Sources, Response Generator, Task Planner, Task Executor, Data Pipe, PPG Processing Pipeline, AI and Analysis Models, Wearable PPG Data and User Data Sources components, aiming to integrate LLMs with analytical tools for health insights.
  • The framework utilizes an orchestrator to coordinate user interaction, data retrieval, and analytical processing, leveraging external sources for data and AI models to generate accurate health assessments.
  • The agent's architecture is designed for modularity and adaptability, enabling integration of various data sources and analytical tools for diverse physiological data analysis tasks.

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

  • R2-KG (General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs): introduces dual-agent framework with Operator, Supervisor, KG, Iteration limit, Feedback, Question, Answer, Abstention for reliable KG reasoning by separating evidence gathering and judgment roles.
  • It employs Operator (low-capacity LLM evidence gatherer) and Supervisor (high-capacity LLM judgment maker) to enhance cost-efficiency and reliability.
  • R2-KG incorporates Abstention mechanism to avoid answering when evidence is insufficient, improving trustworthiness.

Multi-Novelty:Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference-time Multi-Views Brainstorming

  • Multi-Novelty: introduces inference-time multi-view brainstorming method with Input Prompt, Multi-view Embedding, LLMs, Generated Answers and DNC Framework components to enhance diversity and novelty of generated contents.
  • Multi-view Embedding component incorporates Text views and Image views to enrich input prompts by generating diverse perspectives from textual and visual sources, which are then processed by LLMs to produce varied responses.
  • DNC Framework evaluates Generated Answers using diversity, novelty, and correctness metrics, demonstrating the effectiveness of Multi-Novelty in improving LLM outputs without architectural changes and across different models.

Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

  • Perovskite-LLM (Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research): introduces Perovskite-KG, a domain-specific knowledge graph, constructed via Document Filtering, Knowledge Extracting, and Knowledge Graph Organization, alongside a Multi-agent framework with Information Extraction Agent, Quality Validation Agent, and Document Summarizer Agent, utilizing DeepSeek R1 and OpenAI 01 LLMs to generate Instruction Tuning and Reasoning Dataset.
  • Perovskite-KG organizes knowledge from research papers into a structured graph, while the multi-agent framework creates datasets for instruction tuning specialized Large Language Models.
  • The system aims to enhance research efficiency in perovskite solar cell domain by providing tools for knowledge retrieval, literature review, and complex problem-solving.

One Size doesn't Fit All: A Personalized Conversational Tutoring Agent for Mathematics Instruction

  • PACE (PersonAlized Conversational tutoring agEnt): introduces a personalized tutoring framework for mathematics instruction, incorporating Simulating Learning Styles (models student learning style), Conceptualize Teaching Strategy (designs teaching approach), Socratic-style Conversation (implements teaching dialogue), Persona Pool (collection of student profiles), Interaction (tutor-student communication), Multi-aspect Criteria (quality assessment metrics), and Evaluation Approaches (assessment methodologies).
  • PACE framework personalizes learning by simulating student learning styles from personas, conceptualizing tailored teaching strategies, and employing Socratic dialogue for enhanced engagement and critical thinking.
  • The framework utilizes multi-aspect criteria and dual evaluation approaches, including reference-based and LLM-based methods, to comprehensively assess the personalized tutoring performance.

Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach

  • AG2 (AutoGen): introduces agentic framework for automating prompt leakage attacks, utilizing Initial Analysis Agent, Judge Agent, and Tested Agent within GroupChat for evaluating LLM security.
  • This framework employs specialized agents to probe and exploit target LLMs, assessing prompt leakage by comparing responses from original and sanitized prompts.
  • The agentic approach provides a systematic methodology for adversarial testing, bridging automated threat modeling and practical LLM security evaluation.

DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent

  • DemonAgent (Dynamically Encrypted Multi-Backdoor Implantation Attack): introduces dynamically encrypted multi-backdoor implantation attack, with Dynamic Encryption Mechanism, Multi-Backdoor Tiered Implantation (MBTI), and AgentBackdoorEval dataset components.
  • DemonAgent decomposes backdoor code into Sub-backdoor fragments, uses Anchor Tokens and Attack Matrix for stealth, and employs Encryption Table for secure storage within Agent's workflow.
  • DemonAgent leverages Encryptor, Decoder, Assembler, Executor, and Retriever components to manage encrypted backdoor fragments and activate attack through Cumulative Triggering, effectively bypassing safety audits.

A Cognitive Writing Perspective for Constrained Long-Form Text Generation

  • CogWriter: introduces a novel training-free framework with Planning Agent for hierarchical task decomposition, Generation Agents for parallel segment generation, and Monitor Functions including Global Plan Reviewing, Local Plan Reviewing and Length Reviewing for continuous quality control.
  • CogWriter framework employs Planning Agent to create structured plans and Generation Agents to execute these plans, utilizing Global Plan Reviewing and Local Plan Reviewing for iterative refinement and Length Reviewing for output length adjustment.
  • CogWriter framework aims to bridge gap between human cognitive writing processes and current LLMs for complex constrained long-form text generation, enhancing instruction completion and generation length.

UXAgent: An LLM-Agent-Based Usability Testing Framework for Web Design

  • UXAGENT (LLM-Agent-Based Usability Testing Framework) introduces a system with Persona Generator, LLM Agent, Universal Browser Connector, and Result Viewer, utilizing Chrome browser, Action Trace, Memory Trace, Video Recording, Final Outcome, Chat Interface, Fast Loop with Perception-, Planning-, and Action-Modules, Slow Loop with Wonder- and Reflect-Modules, and Memory Stream to simulate usability testing for web design.
  • UXAGENT employs Persona Generator to create diverse user demographics, LLM Agent with Fast and Slow Loops for web interaction and reasoning, Universal Browser Connector for website interaction, and Result Viewer to present collected user behavior data to UX researchers.
  • The framework facilitates iterative UX study design by providing simulated user behavior data including action traces, memory logs, video recordings, and chat interfaces, enabling researchers to evaluate and refine usability testing before real human-subject studies.

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

  • PMA (Planner-Manager-Actor): introduces hierarchical framework with Planner, Manager, and Actor modules for embodied question answering in city environments.
  • PMA framework incorporates Planner for task parsing, Manager with Memory and Map for process control and spatial reasoning, and Actor with Navigator, Explorer, and Collector for action generation.
  • PMA agent utilizes cognitive map and hierarchical structure to achieve long-horizon planning and efficient task execution in complex urban spaces.

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

  • Policy-to-Language Framework: introduces a model-agnostic explanation generator using Explanation LLM, Guidance LLM and Reward Generation by Flow Matching for training with PPO Training, taking Input context and producing Output reasoning, verified against True action.
  • The framework employs Reward Generation by Flow Matching with components like Projection, Rectified Flow Network, Linear Layer, Embedding, First L-1 Layers, Gaussian Noise, PE(t), Zt, and Cross-Attention Layer to provide effective rewards for training the Explanation LLM.
  • This approach aims to generate dense and effective rewards, reducing reliance on human feedback and improving the quality and accuracy of explanations for agent decisions in both RL and LLM tasks.

Simulating Cooperative Prosocial Behavior with Multi-Agent LLMs: Evidence and Mechanisms for AI Agents to Inform Policy Decisions

  • Multi-Agent Architecture: introduces a class structure for social emergent behavior simulation, encompassing World (simulation environment), Locations (places for agents), Events (agent actions and observations), Agents (LLM instances representing people), Plans (agent intentions and goals), and Memories (agent past experiences).
  • The framework uses World class to define simulation space with Locations for agent interaction and Events to record agent actions, while Agents, as LLM instances, possess Plans to guide behavior and Memories to inform actions based on past events.
  • This architecture facilitates emergent behaviors by enabling agents to reason about surroundings, create plans, react to events, and communicate, offering a structured approach to simulate complex social interactions.

EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness

  • EDGE (Efficient Data selection for LLM Agents via Guideline Effectiveness): introduces a data selection framework for LLM agents, with Unlabeled Data Pool, Initial Guideline, GE Metric, GE Score Calculation, Lowest GE Score Data Selection, Guideline Update, Updated Guideline, High-quality SFT Data Generation, Fine-tuning open-source LLM, and Guideline-based prompt engineering components.
  • EDGE framework uses Guideline Effectiveness metric to identify informative samples from unlabeled data by measuring the impact of human guidelines in multi-turn interaction tasks.
  • Selecting low GE score samples allows for efficient prompt engineering and fine-tuning by focusing on data where guidelines are less effective, thus more informative.

Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation

  • BDC Framework (Boost, Disentangle, and Customize Framework): introduces System2-to-System1 pipeline for code generation, with System 2 Knowledge Exploration, Composable System 1 Experts Preparation, and Customized Solver Generation components.
  • BDC Framework addresses complex reasoning and data heterogeneity using MC-Tree-Of-Agents with mutual boosting, disentangling data for LoRA experts, and input-aware hypernetwork for customization.
  • The framework utilizes multiple LLMs for verification, Monte-Carlo Tree Search with pruning, and DisenLoRA for adaptive generation of customized problem solvers.

EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning

  • EPO (Explicit Policy Optimization): introduces strategic reasoning model (LLMs) with policy, optimize, multi-turn RL, PRM (Process Reward Model), history, interaction, agent/human/env, LLM agent, observation, strategy, and reward components for goal-directed behavior in dynamic environments.
  • EPO framework utilizes multi-turn RL with process rewards and iterative self-play to train the strategic reasoning model, enhancing adaptability and policy transferability without supervised fine-tuning.
  • The strategic reasoning model in EPO integrates with LLM agents, enabling long-term goal achievement through enhanced strategic reasoning in interactive scenarios.

Investigating and Extending Homans' Social Exchange Theory with Large Language Model based Agents

  • Agent Framework: introduces LLM-based agent framework with BDI, Affinity, REI, SVO, Negotiation and Exchange components to study Homans' Social Exchange Theory.
  • The framework simulates a multi-agent society where agents negotiate and exchange resources based on designed components.
  • This approach provides a novel method to investigate social science theories using LLM-based agents, bridging social science and computer science.

17th February 2025

A-MEM: Agentic Memory for LLM Agents

  • A-MEM (Agentic Memory) introduces agentic memory system for LLM agents with Note Construction, Link Generation, Memory Evolution, Memory Retrieval, and Memory components.
  • A-MEM enables dynamic memory structuring and autonomous memory management inspired by Zettelkasten method for long-term agent interactions.
  • A-MEM facilitates creation of interconnected knowledge networks and evolution of memories, enhancing LLM agents' long-term interaction capabilities.

ARMAP: SCALING AUTONOMOUS AGENTS VIA AUTOMATIC REWARD MODELING AND PLANNING

  • ARMAP (autonomous Agents from automatic Reward Modeling And Planning): introduces a framework that enhances LLM agents' decision-making by using Automatic Reward Model (evaluates trajectory quality) to guide Default Policy Model (generates initial action plans) in Tree Planning (search algorithm for actions) using Trajectories (sequences of agent actions) and Reward (score for trajectory success).
  • Leverages Automatically Generated Dataset (training data for reward model) from Sampled Trajectories (trajectories from environment), Refine Task Instructions (improved task goals), and Sample Negative Trajectories (unsuccessful action paths) to train Reward Model (evaluates trajectory success) without human annotations.
  • Improves agent performance across tasks by integrating learned Reward Model (evaluates trajectory success) with various planning algorithms, addressing limitations of data scarcity and API accessibility for complex interactive environments.

HARBOR: Exploring Persona Dynamics in Multi-Agent Competition

  • HARBOR (Housing Auction for Reasoning, Bidding, and Opponent Recognition): introduces a testbed to study persona dynamics in multi-agent auctions, incorporating Persona, Bidding Domain Knowledge, Auction History Memory, Priority Planning, Profiling Competitors, and Theory of Mind Strategy.
  • HARBOR simulates realistic house bidding scenarios to analyze how personas influence agent behavior, competitor profiling, and strategic decision-making in competitive environments.
  • This framework enables the evaluation of LLM agents' profitability, competitive standing, and persona-driven objective achievement in multi-agent competitive settings.

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

  • SWE-Lancer: introduces benchmark evaluating language models on real-world software engineering tasks using Original Issue, Codebase, Large Language Model, Generated PR, Human End-to-End Tests, Grader, and Scoring for individual contributions and Original Issue, Proposals, Large Language Model, Rejected Proposals, Chosen Proposal, Comparison, and Scoring for management decisions.
  • SWE-Lancer benchmark assesses model's ability to solve freelance software engineering tasks by generating code patches or selecting optimal proposals, evaluated through end-to-end tests and comparison with human decisions.
  • SWE-Lancer framework provides realistic software engineering evaluation by utilizing real-world tasks, payouts, and full-stack complexities, moving beyond isolated unit tests to comprehensive end-to-end assessments.

Learning Getting-Up Policies for Real-World Humanoid Robots

  • HUMANUP: introduces a two-stage RL framework with Discovery Policy (Stage I motion exploration) and Deployable Policy (Stage II robust motion tracking) for humanoid robots getting-up.
  • HUMANUP employs a Curriculum (Progressive training strategy) including Collision Mesh Curriculum (Mesh complexity progression), Posture Randomization Curriculum (Initial pose variation), and Control Regularization Curriculum (Regularization strength progression) to enhance learning.
  • Stage II Deployable Policy utilizes Tracking Rewards (Stage II imitation reward) to refine discovered motions for real-world deployment.

Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation

  • Action-Guided Response Generation Framework: introduces a method to simulate social media engagement using Trending Post, User Information Historical Records, Action, and Generated Response components.
  • Action-Guided Response Generation Framework predicts user engagement Action (retweet, quote, rewrite) towards Trending Post, then generates Generated Response based on predicted Action and User Information Historical Records.
  • Action-Guided Response Generation Framework aims to capture user engagement dynamics for informed response generation in social media simulations.

CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning

  • CAMEL (Continuous Action Masking Enabled by Large Language Models): introduces reinforcement learning framework integrating LLM Policy Generator, Action Masking, Actor, Critic, Replay Buffer and Epsilon Masking to enhance exploration and convergence by using LLM-generated policies and dynamic action constraints.
  • CAMEL leverages Action Masking to dynamically constrain action space based on LLM outputs and Epsilon Masking to reduce reliance on LLM guidance over time, enabling autonomous policy refinement.
  • The framework demonstrates improved sample efficiency and performance in MuJoCo environments by effectively utilizing LLM-generated priors for initial policy guidance and exploration.

Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration

  • DPT-Agent (Dual Process Theory Agent): introduces a language agent framework integrating System 1 with Finite-State Machine, Code as Policy, Action Executor and System 2 with Theory of Mind, Asynchronous Reflection, Belief, Guide, alongside General Introduction and Information History Buffer for real-time human-AI collaboration.
  • DPT-Agent leverages Dual Process Theory, employing System 1 for rapid responses and System 2 for deliberate reasoning, to achieve autonomous and simultaneous human-AI collaboration.
  • The framework utilizes Finite-State Machine and code-as-policy in System 1 for fast decision-making, and Theory of Mind with asynchronous reflection in System 2 to infer human intentions and improve autonomous decisions.

Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models

  • Thought-tracing: introduces inference-time reasoning algorithm, with Parse Trajectory, Perception Inference, Hypothesis Inference, Initialize Hypotheses, Update Weights, Resample Hypotheses, Rejuvenate Hypotheses, Propagate Hypotheses, designed to trace agent mental states.
  • Thought-tracing algorithm, inspired by Bayesian theory-of-mind and sequential Monte Carlo, uses LLMs to generate and weight natural language hypotheses about agent beliefs based on perceptions and actions.
  • Thought-tracing improves performance on theory-of-mind benchmarks by providing intermediate reasoning steps, contrasting with math/coding focused reasoning models.

Can LLM Agents Maintain a Persona in Discourse?

  • Agent-based evaluation framework: introduces a methodology to evaluate personality maintenance in dyadic conversations using Participant A/B, Assign, Personality Traits, Topic of Conversation, Pairwise Conversation, and Judge Agent components.
  • This framework employs System Prompt and User Prompt to guide LLM agents in conversations and JSON output for structured evaluation by Judge Agent, including Predicted_bfi and Correct? metrics.
  • The framework aims to assess personality consistency and alignment by Extract, Analyze, and Plot actions performed by Judge Agents on conversation data to evaluate personality adherence.

Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning

  • Table-Critic: introduces multi-agent framework with Judge, Critic, Refiner, Curator and Self-evolving Template Tree, for collaborative criticism and iterative refinement in table reasoning tasks.
  • Table-Critic framework employs Judge to identify errors, Critic to provide critiques, Refiner to correct reasoning, and Curator with self-evolving template tree to accumulate critique knowledge for improved future performance.
  • Self-evolving template tree in Table-Critic dynamically accumulates critique patterns from experience-driven learning, enabling system to handle diverse error types and improve reasoning quality over time.

Personality Editing for Language Models through Relevant Knowledge Editing

  • PALETTE (Persona Adjustment by LLM Self-Targeted Trait Control via Relevant Knowledge Editing): introduces personality control method through knowledge editing, with MBTI questionnaire, adjustment query construction, LLM (original), generate model response, extract self-referential statement, extract opposite trait word, layer, association at layer 1, optimize v by object, edited LLM, and specific trait-focused response generation components.
  • PALETTE leverages MBTI-inspired adjustment queries and rank-one model editing to modify LLM's internal representations for personality traits.
  • This approach enables controlled shifts in personality, addressing inherent biases and improving consistency in LLM responses.

Plant in Cupboard, Orange on Table, Book on Shelf Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment

  • AdventureGame: introduces text-based environment, with Agent (executes game actions), Environment (simulated text-based world), Interpreter (processes agent commands), Parser (command grammar definition), State Change Module (updates game world state), World State (current game facts representation), Observation Feedback (textual game responses), Goal (task objective for agent), Action Space (set of valid commands), and Memory (int