Skip to content

Autonomous Agents (LLMs) research papers. Updated Daily.

License

Notifications You must be signed in to change notification settings

tmgthb/Autonomous-Agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Hits X GitHub Repo stars

Autonomous Agents

Autonomous Agents

Autonomous Agents-research papers. Updated daily. See as well the Resources-section.


Research papers

Chronological order.

27th of January 2025

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

  • Janus-Pro: Advances multimodal models via optimized training, expanded data, and model scaling. Janus-Pro achieves SOTA-level performance in both multimodal understanding and text-to-image generation benchmarks.
  • Enhanced training strategy includes "Longer Training in Stage I" and "Focused Training in Stage II" for better efficiency and performance. This refines the original 3-stage training process of Janus.
  • Text-to-image generation stability and aesthetic quality are significantly enhanced through synthetic data and improved training.
  • Decoupled visual encoding remains a core and effective architectural design for unified multimodal tasks.
  • 7B model demonstrates strong scalability of the decoupled visual encoding approach.

26th of January 2025

Qwen2.5-1M Technical Report

  • Introduces Qwen2.5-1M, which extends open source support for 1M token context length.
  • Includes infererence framework, which speeds up 1M context inference by 3.2x to 6.7x.

24th of January 2025

MedAgentBench: Dataset for Benchmarking LLMs as Agents in Medical Applications

  • MedAgentBench: is a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts.
  • It encompasses 100 patient-specific clinically-derived tasks, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase.
  • This framework establishes a valuable benchmark for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.

DEEPFLOW: Serverless Large Language Model Serving at Scale

  • DEEPFLOW: is a serverless AI platform designed for efficient large language model serving at scale.
  • It uses request-job-task model, FLOWSERVE serving engine, NPU-centric execution, SPMD-based parallelism, and novel scheduling policies.
  • This framework addresses resource allocation, serving efficiency, and cold start latencies.

DRESSING UP LLM: EFFICIENT STYLIZED QUESTION-ANSWERING VIA STYLE SUBSPACE EDITING

  • DRESS (Disentangling Representation Editing in Style Subspace): is a novel approach for generating stylized large language model (LLM) responses through representation editing.
  • It leverages over-parameterized nature of LLMs, disentangles style-relevant subspace, applies adaptive editing strengths, and maintains stylistic fidelity and semantic integrity.
  • DRESS is a lightweight, train-free solution for enhancing LLMs with flexible and effective style control, making it useful for developing stylized conversational agents.

Exploring the sustainable scaling of Al dilemma: A projective study of corporations' Al environmental impacts

  • The proposed methodology: estimates the environmental impact of a company's AI portfolio, providing actionable insights without extensive AI and Life-Cycle Assessment (LCA) expertise.
  • The framework includes four interconnected models: life cycle impacts of primary components, life cycle impacts of AI use cases, AI company portfolio model, and 2030 AI landscape projections.
  • This framework empowers organizations to understand and project their AI impacts and align their initiatives with global sustainability goals.

MASTER: A Multi-Agent System with LLM Specialized MCTS

  • MASTER (Multi-Agent System with Tactical Execution and Reasoning using LLM Specialized MCTS): is a novel multi-agent framework that employs a new agent recruitment process and communication protocol based on the MCTS algorithm.
  • It autonomously adjusts the number of agents based on task complexity, mitigates distractions and token window shortage, and includes a modified MCTS tailored to LLM scenarios.
  • This framework achieves state-of-the-art performance on HotpotQA and WebShop datasets.

Top Ten Challenges Towards Agentic Neural Graph Databases

  • Agentic NGDB (Agentic Neural Graph Databases): extends NGDBs with autonomous query construction, neural query execution, and continuous learning.
  • It identifies ten key challenges, including semantic unit representation, abductive reasoning, scalable query execution, and integration with foundation models like LLMs.
  • This framework enables intelligent, self-improving systems for modern data-driven applications.

Serving Long-Context LLMs at the Mobile Edge: Test-Time Reinforcement Learning-based Model Caching and Inference Offloading

  • T2DRL (Test-Time Deep Reinforcement Learning): is a joint model caching and inference offloading framework that optimizes deployment and execution strategies for long-context LLM serving.
  • Framework analyzes performance convergence, designs optimization problem considering context windows, manages cached models and service requests, adapts to context changes, and uses double Dutch auction mechanism for resource allocation.
  • The framework reduces system costs while guaranteeing the performance of LLM agents in real-world perception and reasoning tasks.

Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models

  • VL-DCOPs (visual-linguistic instruction-based DCOPs): is a framework that uses large multimodal foundation models to generate constraints from visual and linguistic instructions.
  • Framework includes spectrum of agent archetypes, from neuro-symbolic to fully neural agents, and evaluates them using LLMs and VLMs on novel VL-DCOP tasks.
  • This work extends the DCOP literature by addressing the challenge of manual problem construction and opens new research directions.

AI Chatbots as Professional Service Agents: Developing a Professional Identity

  • LAPI (LLM-based Agent with a Professional Identity): is a novel framework for designing professional service agents tailored for medical question-and-answer services.
  • LAPI includes theory-guided task planning process, pragmatic entropy method, and iterative updating of responses.
  • This framework improves response quality, providing more accurate, empathetic, and professional answers compared to baseline approaches.

ARGOS: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models

  • ARGOS: is an agentic system for detecting time-series anomalies in cloud infrastructure by leveraging large language models (LLMs).
  • It uses explainable anomaly rules as intermediate representation, employs LLMs to autonomously generate rules, and includes detection-, repair- and review-agents.
  • This framework improves anomaly detection accuracy and efficiency compared to state-of-the-art methods.

Top Ten Challenges Towards Agentic Neural Graph Databases

  • Agentic NGDB (Agentic Neural Graph Databases): extends NGDBs with autonomous query construction, neural query execution, and continuous learning.
  • It identifies ten key challenges, including semantic unit representation, abductive reasoning, scalable query execution, and integration with foundation models like LLMs.
  • This framework enables intelligent, self-improving systems for modern data-driven applications.

23rd of January 2025

ElCopilot: Search and Explore Enterprise Information over Large-scale Knowledge Graphs with LLM-driven Agents

  • EICopilot: is a novel agent-based solution enhancing search and exploration of enterprise registration data within extensive online knowledge graphs.
  • EICopilot includes data pre-processing pipeline, comprehensive reasoning pipeline with Chain-of-Thought and In-context learning, and novel query masking strategy.
  • EICopilot is a groundbreaking tool for exploration and exploitation of large-scale knowledge graphs for enterprise information search.

The though process behind Kimi k1.5

  • Explains the way the Kimi K-1.5 model was trained and discusses overall likely o1-model training procedure.

Operator System Card

  • OA Operator-agent system card.
  • Uses RL.
  • Additional details

21st of January 2025

LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems

  • AUTOSIMTEST: is a Large Language Model (LLM)-driven framework, where multiple LLM agents collaborate to support the sUAS simulation testing process.
  • Framework includes scenario generation-, mission-, environment- and analytics-agents; uses RAG approach; provides interactive analysis interface.
  • Framework improves efficiency and scope of sUAS testing process, allowing for more comprehensive and varied scenario evaluations while reducing manual effort.

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

  • EMBODIEDEVAL: is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.
  • EMBODIEDEVAL features 328 distinct tasks within 125 varied 3D scenes, covers navigation, object interaction, social interaction, attribute question answering, and spatial question answering.
  • This framework provides insights for future development of MLLMs in embodied capabilities.

20th of January 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinformcent Learning

  • DeepSeek-R1: Trains SOTA-level Large Reasoning Model from LLM via Reinforcement Learning, which matches performance with o1-model.

Kimi-K1.5: Scaling Reinforcement Learning with LLMs

  • Kimi k1.5: is a multi-modal large language model (LLM) trained with reinforcement learning (RL) to achieve SOTA-level reasoning performance across multiple benchmarks and modalities.

Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems

  • Conversation Routines (CR): is a structured prompt engineering framework for developing task-oriented dialog systems using Large Language Models (LLMs).
  • CR enables development of Conversation Agentic Systems (CAS) through natural language specifications, embedding task-oriented logic within LLM prompts, providing systematic methodology for designing complex conversational workflows while maintaining behavioral consistency.
  • This framework enables domain experts to design conversational workflows in natural language while leveraging custom enterprise functionalities.

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

  • Agent-R: is an iterative self-training framework that enables language agents to reflect on the fly.
  • It leverages Monte Carlo Tree Search (MCTS) to construct training samples, recovers correct trajectories from erroneous ones, and uses a model-guided critique construction mechanism for timely revision.
  • This framework effectively equips agents to identify and correct erroneous actions while avoiding loops, achieving superior performance.

Towards Advancing Code Generation with Large Language Models: A Research Roadmap

  • Six-layer vision framework: categorizes code generation process into Input, Orchestration, Development, and Validation phases.
  • Framework includes analysis of existing studies, outlines vision workflow, and systematically analyses challenges faced by LLMs.
  • This work provides guidelines for improving reliability, robustness and usability of LLM-based code generation systems.

Large Language Model Agents for Radio Map Generation and Wireless Network Planning

  • LLM agent framework: automates radio map generation and wireless network planning tasks.
  • Framework includes tools-, models- and profiles-modules; it uses short-term and long-term memory; it performs task planning.
  • The framework reduces manual operations and enhances network coverage and signal-to-interference-noise ratio.

Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian

  • HULA (Human-in-the-loop software development agents framework): is a LLM-based framework for software development.
  • The framework uses GPT-4, compares LLM-generated code with human-written code, and evaluates code readability using static analysis metrics.
  • This study highlights the importance of code readability in the age of LLMs and shows that LLM-generated code can be comparable to human-written code.

PlotEdit: Natural Language-Driven Accessible Chart Editing in PDFs via Multimodal LLM Agents

  • PlotEdit: is a multi-agent framework for natural language-driven end-to-end chart image editing via self-reflective LLM agents.
  • Framework includes Chart2Table, Chart2Vision, Chart2Code, Instruction Decomposition and Multimodal Editing agents; uses multimodal feedback to maintain visual fidelity; outperforms existing baselines on ChartCraft dataset.
  • It enhances accessibility for visually challenged users and improves novice productivity.

19th of January 2025

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

  • IntellAgent: is a scalable, open-source multi-agent framework designed to evaluate conversational AI systems.
  • It automates synthetic benchmark creation using policy-driven graph modeling, realistic event generation, and interactive user-agent simulations, providing fine-grained diagnostics.
  • This framework enables comprehensive evaluation of conversational AI by addressing limitations of traditional methods.

GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation

  • GREEN-CODE: is a framework for energy-aware code generation in LLMs, performing dynamic early exit during inference.
  • It uses Reinforcement Learning agent to balance accuracy, latency, and energy consumption trade-offs, and fine-tunes models with weighted aggregated loss.
  • This framework reduces energy consumption significantly without affecting accuracy for code generation tasks.

Open FinLLM Leaderboard: Towards Financial AI Readiness

  • Open FinLLM Leaderboard: is an open platform for assessing and comparing Large Language Models' performance on financial tasks.
  • The framework includes a leaderboard, demos, and financial AI readiness components; it uses zero-shot evaluation, and provides side-by-side model comparisons.
  • This framework is important for encouraging innovation and improving model effectiveness in the financial sector.

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

  • LEARN-BY-INTERACT: is a data-centric framework to adapt LLM agents to any given environments without human annotations.
  • LEARN-BY-INTERACT synthesizes agent-environment interactions based on documentations, constructs instructions by summarizing interaction histories, and uses innovative retrieval approaches optimized for agents.
  • This framework serves as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

18th of January 2025

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

  • BAP v2 (Builder Action Prediction v2): is an upgraded task framework for instruction following in Minecraft dialogues.
  • BAP v2 includes enhanced evaluation benchmark with cleaner test set and fairer metrics, and additional synthetic training data generated from novel Minecraft dialogue and target structure simulators.
  • BAP v2 enables more efficient and meaningful progress on the task of instruction following in Minecraft dialogues.

ML-SceGen: A Multi-level Scenario Generation Framework

  • ML-SceGen: is a three-stage framework for generating comprehensive and critical scenarios in autonomous driving.
  • It uses LLM agents for parsing, Answer Set Programming (ASP) solver for logical traffic generation, and LLM for parameter updates to increase criticality.
  • This framework enhances controllability, scalability, and realism in scenario generation for autonomous driving systems.

17th of January 2025

Evolving Deeper LLM Thinking

  • Mind Evolution: is an evolutionary search strategy that uses a language model to generate, recombine and refine candidate responses.
  • It avoids formalizing the inference problem (so is usable in spaces like planning in natural language without explicit formalization of the problem and as well in hiding encoded message inside poems, which is non-natural language task), uses a global solution evaluator (focuses on domains, where evaluator is available), and can be easily parallelized.
  • This approach significantly outperforms other inference strategies in natural language planning tasks.
  • Introduces new StegPoet-benchmark, where the benchmark task is to encode message inside essay/story.

Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems

  • Agent4Edu: is a personalized learning simulator that uses LLM-powered generative agents to simulate human learners' response data.
  • It includes learner profile, memory, and action modules; interacts with personalized learning environments; evaluates and improves intelligent tutoring algorithms.
  • This framework provides a versatile platform for comprehensive evaluations and future collection of valuable learner response data.

Towards Human-Guided, Data-Centric LLM Co-Pilots

  • CliMB-DC (Clinical predictive Model Builder with Data-Centric AI): is a human-guided, data-centric framework for LLM co-pilots.
  • It includes a multi-agent reasoning system with a strategic coordinator and a specialized worker agent, integrates state-of-the-art data-centric tools, and uses a human-in-the-loop approach.
  • This framework empowers domain experts to actively participate in driving real-world impact using ML.

Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling

  • Accountability Model: is an augmented LLM with an additional accountability head, functioning as a binary classifier to predict dialogue state slots.
  • It detects false positives and negatives, guides LLM decoder for accurate actions, enables self-correction, and introduces friction to prevent overreliance.
  • This model improves joint goal accuracy and overall performance in task-oriented dialogue systems.

PaSa: An LLM Agent for Comprehensive Academic Paper Search

  • PaSa: is an advanced paper search agent powered by large language models. Available https://pasa-agent.ai/
  • It autonomously makes decisions, including invoking search tools, reading papers, and selecting references; it is optimized using reinforcement learning with synthetic dataset; it outperforms existing baselines on real-world academic queries.
  • This framework significantly improves the efficiency and accuracy of academic search.

LLM Reasoner and Automated Planner: A new NPC approach

  • LLM Reasoner and Automated Planner: is a novel architecture that integrates an LLM for decision-making with a classical automated planner.
  • Framework uses LLM to decide goal, then uses automated planning to create plan, and includes modules for reasoning, planning and interface.
  • This framework aims to empower autonomous agents with flexibility to adapt to any situation while maintaining plausible and human-like behavior.

A Survey on LLM Test-Time Compute via Search: Tasks, LLM Profiling, Search Algorithms, and Relevant Frameworks

  • This survey provides a comprehensive technical review that unifies task definitions and provides modular definitions of LLM profiling and search procedures.
  • It enables precise comparisons of various LLM inference frameworks, highlights their departures from conventional search algorithms, and discusses applicability, performance, and efficiency.
  • This survey offers a collection of classical and reusable implementations that can serve as solid foundations for future research and development.

Agent-as-Judge for Factual Summarization of Long Narratives

  • NARRATIVEFACTSCORE: is a novel "Agent-as-a-Judge" framework for evaluating and refining summaries.
  • It leverages Character Knowledge Graph (CKG), assesses factual consistency, provides actionable guidance for refinement, identifies missing or erroneous facts, and uses retrieval-based verification with explicit feedback.
  • This framework improves the factual reliability of LLM-generated summaries.

A Survey on Multi-Turn Interaction Capabilities of Large Language Models

  • This survey provides a focused review of the multi-turn capabilities of LLMs.
  • The survey explores core model capabilities, evaluation methods, enhancement algorithms, and future research directions.
  • This survey is important for both academic researchers and industry practitioners.

TOWARDS A LITMUS TEST FOR COMMON SENSE

  • Axiomatic litmus test: diagnoses common sense by combining minimal prior knowledge constraints with diagonal arguments to create tasks beyond the agent's known concept set.
  • It addresses deceptive hallucinations, integrates observations regarding emergent deceptive hallucinations, and uses Abstraction and Reasoning Corpus (ARC) constraints.
  • This test provides a stepping stone toward an ethical, reliable foundation for future safe, beneficial and aligned artificial intelligence.

16th of January 2025

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

  • Inference-time scaling framework: explores the inference-time scaling behavior of diffusion models beyond increasing denoising steps.
  • Framework uses search problem to identify better noises, design space includes verifiers and algorithms, experiments on class-conditioned and text-conditioned image generation benchmarks.
  • This framework reveals that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models.

Foundations of Large Language Models

  • Introduces a literature review / survey on LLMs.

AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling

  • AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling.
  • AutoCBT incorporates a counsellor agent and multiple supervisor agents, uses short-term and long-term memory, and is evaluated on a bilingual dataset.
  • AutoCBT leverages dynamic routing and supervisory mechanisms to offer high-quality, automated CBT services, enhancing the effectiveness of single-turn consultations.

OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

  • OmniThink: is a machine writing framework that emulates human-like iterative expansion and reflection.
  • It uses continuous reflection and exploration, attaches knowledge to an information tree, and extracts it into a conceptual pool to deepen understanding.
  • This framework improves the knowledge density of generated articles without compromising coherence and depth.

CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education

  • CyberMentor: is a learning tool platform designed to address diverse needs of cybersecurity students using agentic workflow and Generative Large Language Models (LLMs).
  • It leverages Retrieval-Augmented Generation (RAG) for accurate information retrieval, includes knowledge base, skill base and LLM agent, and provides personalized learning experiences.
  • This framework aims to improve equity and sustainability in higher education by offering open-source design for adaptation across disciplines.

Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

  • PVI (Pointwise V-Information) based fine-tuning method: enhances LLMs for wireless communication by quantifying information content of training data.
  • Dataset includes multi-hop questions, true/false and multiple-choice types, varying difficulty levels, rigorous data curation, advanced language models for entity extraction and question generation.
  • This work aims to improve LLM training and evaluation for wireless communication research and applications.

SOP-AGENT: EMPOWER GENERAL PURPOSE AI AGENT WITH DOMAIN-SPECIFIC SOPS

  • SOP-agent (Standard Operational Procedure-guided Agent): is a novel framework for constructing domain-specific agents through pseudocode-style Standard Operational Procedures (SOPs) written in natural language.
  • SOP-agent represents SOP as a decision graph, traverses it to guide the agent, conducts experiments across multiple domains, and introduces Grounded Customer Service Benchmark.
  • SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks and comparable to domain-specific agent systems.

15th of January 2025

AGENTIC RETRIEVAL-AUGMENTED GENERATION: A SURVEY ON AGENTIC RAG

  • Introduces Survey on compherensive list of RAG-techniques with LLM-agents.

Agent TCP/IP: An Agent-to-Agent Transaction System

  • ATCP/IP (Agent Transaction Control Protocol for Intellectual Property): introduces a trustless framework for exchanging IP between agents via programmable contracts.
  • Framework enables agents to initiate, trade, borrow, and sell agent-to-agent contracts on the Story blockchain network, including legal wrappers for offchain enforcement, and facilitates autonomous selling of training data, licensing of information, and content collaboration.
  • This framework is important for creating a standardized way for agents to negotiate and enter into agreements, forming a market for knowledge.

Leveraging Large Language Models as Knowledge-Driven Agents for Reliable Retrosynthesis Planning

  • MBRPS (Multi-branched Reaction Pathway Search): Algorithm enabling exploration of all pathways, with a focus on multi-branched ones.
  • Framework integrates LLMs and KGs, automates literature retrieval, reaction data extraction, database querying, and construction of retrosynthetic pathway trees, and recommends optimal routes.
  • Attempt to develop a fully automated retrosynthesis planning agent tailored specially for macromolecules powered by LLMs.

AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL

  • AutoRestTest: is a novel tool that integrates Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement Learning (MARL) and Large Language Models (LLMs) for effective REST API testing.
  • It uses five specialized agents for operation, parameter, value, dependency, and header identification, and employs LLMs for realistic input generation and a command-line interface for user interaction.
  • This framework provides a comprehensive solution for thorough REST API evaluation and validation.

Leveraging LLM Agents for Translating Network Configurations

  • IRAG (Intent-based Retrieval Augmented Generation): is an intent-based framework for translating network configurations using LLM agents.
  • Framework includes intent extraction, manual retrieval, incremental translation, syntax verification and semantic verification modules.
  • This framework achieves high syntax correctness and superior translation accuracy compared to state-of-the-art methods.

DISENTANGLING EXPLORATION OF LARGE LANGUAGE MODELS BY OPTIMAL EXPLOITATION

  • Optimal Exploitation framework: isolates exploration as the sole objective by tasking the agent with delivering information that enhances future returns.
  • Framework decomposes missing rewards into exploration and exploitation components, measures optimal achievable return for explored states, and provides insights into behaviors driven by agent instructions.

Physical AI Agents: Integrating Cognitive Intelligence with Real-World Action

  • Physical AI Agents: is a framework that integrates cognitive reasoning with physical interaction for real-world tasks.
  • Framework includes modular architecture with perception, cognition, and actuation blocks, and introduces Ph-RAG (Physical Retrieval Augmented Generation) design pattern for real-time decision-making.

Doc-Guided Sent2Sent++: A Sent2Sent++ Agent with Doc-Guided memory for Document-level Machine Translation

  • Doc-Guided Sent2Sent++: is an agent that employs an incremental sentence-level forced decoding strategy for document-level machine translation.
  • It uses Doc-Guided Memory with summary and its translation, ensures sentence completeness, enhances fluency, and improves translation quality.
  • This approach addresses the limitations of other DocMT agents by maintaining both completeness and fluency.

Evaluating GenAl for Simplifying Texts for Education: Improving Accuracy and Consistency for Enhanced Readability

  • GenAI (Generative Artificial Intelligence): framework evaluates the use of LLMs for text simplification in educational contexts.
  • Framework uses three LLMs (GPT-4 Turbo, Claude 3, and Mixtral 8x22B), four prompting techniques (zero-shot, directional stimulus, chain-of-thought, and prompt chaining), and a novel multi-agent architecture; it assesses grade level accuracy, keyword accuracy, semantic similarity, and word count change.
  • This study provides a rigorous evaluation of LLMs for automated text simplification, offering insights for educators and future research.

14th of January 2025

Flow: A Modular Approach to Automated Agentic Workflow Generation

  • Flow: is a multi-agent framework that dynamically adjusts workflows using activity-on-vertex graphs.
  • It refines workflows based on historical performance, emphasizes modularity, and achieves concurrent sub-task execution.
  • This framework improves efficiency and adaptability in multi-agent systems through dynamic workflow updates.

POKERBENCH: Training Large Language Models to become Professional Poker Players

  • POKERBENCH: is a benchmark for evaluating poker-playing abilities of large language models (LLMs).
  • It includes 11,000 poker scenarios, covers pre-flop and post-flop play, and evaluates models like GPT-4, ChatGPT 3.5, Llama and Gemma series.
  • This benchmark provides a quick and reliable way to evaluate LLMs in complex game-playing scenarios.

A Multi-Agent Framework for Systematic Review Automation Using Large Language Models

  • LatteReview: Intrdocus LLM-based systematic literature review multi-agent framework automation, which consists of three layers: LM providers (local models / LLMs via api), Reviewer agents (with roles & expertise levels) and Workflows (support sequential, parallel review rounds, dynamic decision-making and iterative refinement).
  • Includes BaseReviewer/ScoringReviewer/TitleAbstractReviewer/AbstractionReviewer/Custom reviewer-agents, which are used as modular agents for title and abstract screening, relevance scoring, and structured data extraction; agents operate within orchestrated workflows.
  • Workflow module includes Concept of rounds / Chaining reviews / Parallel reviews and Dynamic filter.

CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation

  • CodeCoR (Code Collaboration and Repair): is a self-reflective multi-agent framework for code generation.
  • It includes prompt-, coding-, test- and repair-agents, uses pruning methods to evaluate agent effectiveness, and enhances self-reflective ability.
  • It significantly outperforms existing state-of-the-art methods in code generation.

Engineering LLM Powered Multi-agent Framework for Autonomous CloudOps

  • MOYA (Meta Orchestrator Of Your Agents): is a multi-agent framework leveraging GenAI for autonomous CloudOps, balancing automation with human control.
  • Framework integrates internal and external systems, optimizes task orchestration, security, and error mitigation using Retrieval Augmented Generation (RAG), and includes LLM-based and non-LLM-based agents.
  • The framework enhances accuracy, responsiveness, and effectiveness over non-agentic approaches across complex workflows.

Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models

  • Agent-Centric Projection: introduces a framework to reveal connections between prompting strategies and multi-agent systems.
  • Framework uses linear and non-linear contexts to classify prompting techniques, and proposes three conjectures about the relationship between prompting and multi-agent systems.
  • This framework enables cross-pollination of research findings between prompting and multi-agent domains, while providing new directions for improving both the design and training of future LLM systems.

Talk to Right Specialists: Routing and Planning in Multi-agent System for Question Answering

  • RopMura: is a multi-agent system that incorporates a router and a planner for question answering across diverse knowledge domains.
  • RopMura includes router for selecting relevant agents, planner for decomposing complex queries, and knowledge sovereignty consideration.
  • This framework enables efficient and accurate multi-domain question-answering.

Infecting Generative AI With Viruses

  • VLM/LLM (Vision-Large Language Model): framework tests security boundaries by embedding EICAR test file within JPEG images.
  • Framework includes multiple LLM platforms, such as OpenAI GPT-40, Microsoft Copilot, Google Gemini 1.5 Pro, and Anthropic Claude 3.5 Sonnet; it demonstrates masking EICAR string, extracting test file, and using obfuscation techniques.
  • This research extends penetration testing framework to evaluate cloud-based generative AI and LLM security boundaries.

Visual Language Models as Operator Agents in the Space Domain

  • Explores the application of VLMs as operator agents in the space domain.
  • Framework builds on LLMs and their multimodal extensions, investigates how VLMs enhance autonomous control and decision-making in space missions, includes software and hardware operational paradigms.
  • This research demonstrates that VLMs can effectively process visual and textual data to generate contextually appropriate actions.

ADAM-1: AI and Bioinformatics for Alzheimer's Detection and Microbiome-Clinical Data Integrations

  • ADAM-1 (Alzheimer's Disease Analysis Model Generation 1): is a multi-agent large language model framework designed to integrate and analyze multi-modal data.
  • Framework uses retrieval-augmented generation techniques, multi-agent architecture, synthesizes insights from diverse data sources, contextualizes findings using literature-driven evidence, and is tailored for binary classification tasks.
  • This framework demonstrates robustness and consistency, particularly in small laboratory datasets, and has potential for Alzheimer's research and diagnostics.

ADDRESSING THE SUSTAINABLE AI TRILEMMA: A CASE STUDY ON LLM AGENTS AND RAG

  • Sustainable AI Trilemma: highlights the tensions between AI capability, digital equity, and environmental sustainability.
  • Framework analyzes energy costs in memory module designs, introduces metrics for energy consumption and system performance trade-offs, challenges LLM-centric autonomy paradigm.
  • This framework provides practical insights for developing more sustainable AI systems.

Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models

  • Agent-Centric Projection: introduces a framework to reveal connections between prompting strategies and multi-agent systems.
  • Framework uses linear and non-linear contexts to classify prompting techniques, and proposes three conjectures about the relationship between prompting and multi-agent systems.
  • This framework enables cross-pollination of research findings between prompting and multi-agent domains, while providing new directions for improving both the design and training of future LLM systems.

ASTRID - An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

  • ASTRID: is an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG.
  • ASTRID includes three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF); it is validated using real-world patient questions and clinician assessments; it is automatable using LLMs.
  • ASTRID provides a valuable resource for further research and development of clinical QA systems.

CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning

  • CuAsmRL: is an automatic optimizer for optimizing NVIDIA GPU SASS schedules using reinforcement learning.
  • It formulates SASS optimization as an assembly game, integrates with OpenAI Triton, and improves performance of specialized CUDA kernels by up to 26%.
  • This framework provides a way to automatically optimize GPU kernels, which is important for improving the performance of LLMs.

13th of January 2025

GPT as a Monte Carlo Language Tree: A Probabilistic Perspective

  • Reviews LLM as a Monte Carlo Language Tree (data tree), where each node is token, each edge is the token transition probability and each sequence has unique path.
  • Any GPT LLM can be flattened into MCLT.
  • Claims CoT attempts to find path between the input and output in the MCLT to connect them.

WebWalker: Benchmarking LLMs in Web Traversal

  • WebWalker: is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
  • WebWalkerQA is a benchmark designed to assess the ability of LLMs to perform web traversal, it evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically, and it focuses on text-based reasoning abilities.
  • This work highlights the importance of deep, vertical exploration in web-based tasks.

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

  • MVoT (Multimodal Visualization-of-Thought): is a multimodal native reasoning paradigm that generates image visualizations of reasoning traces.
  • MVoT uses token discrepancy loss to improve visual coherence and fidelity, and is validated on dynamic spatial reasoning tasks, showing competitive performance.
  • MVoT establishes new possibilities for complex reasoning tasks where visual thinking complements verbal reasoning.

Understanding and Benchmarking Artificial Intelligence: OpenAI's 03 Is Not AGI

  • Claims, that ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark proposed to measure intelligence, but not suitable for measuring progress towards AGI.
  • ARC-AGI tasks represent a specific problem structure, which can be solved by massive trialling of predefined operations, and it does not require exploration, but only exploitation.
  • A new benchmark is outlined that covers a much higher diversity of unknown tasks to be solved, to enable a comprehensive assessment of intelligence and of progress towards AGI.

PoAct: Policy and Action Dual-Control Agent for Generalized Applications

  • PoAct (Policy and Action Dual-Control Agent): is a framework that dynamically adjusts action space and reasoning policy using a Policy Controller and Action Controller.
  • PoAct includes a Policy Controller for switching between reasoning policies, and an Action Controller with RAG Selector and Action Reviewer for managing action space and reasoning paths; it is evaluated on LegalAgentBench and AgentBench datasets.
  • PoAct achieves higher quality code actions and more accurate reasoning paths, while also reducing token consumption.

Lifelong Learning of Large Language Model based Agents: A Roadmap

  • Introduces a s survey incorporating lifelong learning into LLM-based agents.
  • Categorizes core components into perception-, memory-, and action-modules, highlights continuous adaptation, mitigates catastrophic forgetting, and improves long-term performance.

How GPT LEARNS LAYER BY LAYER

  • Explores how LLMs build internal world models with OthelloGPT by using Sparse AutoEncoders.

SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing

  • SST-EM (Semantic, Spatial, and Temporal Evaluation Metric): is a benchmark for video editing that leverages VLMs, object detection, and temporal consistency checks.
  • SST-EM includes semantic extraction using VLM, primary object tracking with object detection, focused object refinement via LLM agent, and temporal consistency assessment using ViT.
  • This framework provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing.

PoAct: Policy and Action Dual-Control Agent for Generalized Applications

  • PoAct (Policy and Action Dual-Control Agent): is a framework that dynamically adjusts action space and reasoning policy by switching between different reasoning policies and managing action space.
  • PoAct includes Policy Controller for high-quality planning and coding, and Action Controller with RAG Selector and Action Reviewer for managing action space and reasoning paths; it is evaluated on multiple datasets with commercial and open-source large models.
  • PoAct achieves higher-quality code actions and more accurate reasoning paths, demonstrating strong generalizability and scalability.

12th of January 2025

DVM: Towards Controllable LLM Agents in Social Deduction Games

  • DVM (Dynamic Victory Manager): is a framework for controllable LLM agents in social deduction games, comprising Predictor, Decider, and Discussor components.
  • It uses reinforcement learning with a win rate-constrained decision chain reward mechanism, enabling agents to dynamically adjust their gameplay proficiency, and it is evaluated in the Werewolf game.
  • DVM enables adaptive and balanced gameplay in social deduction games, opening new research avenues for controllable game agents.

LLMs Model Non-WEIRD Populations: Experiments with Synthetic Cultural Agents

  • Synthetic Cultural Agents (SCAs): uses LLMs to create synthetic agents representing non-WEIRD populations. Includes web scraping, LLMs, RAG prompting to construct cultural profiles and uses these agents to classic behavioral experiments, demonstrating cross-cultural variability.
  • Offers an effective and ethical method to pilot experiments and refine protocols for hard-to-reach populations for cross-cultural economic studies.

11th of January 2025

The Internet of Large Language Models

  • The Internet of LLM: introduces an universal environment and sharing protocol of LLM training/knowledge exchange, which consists of LLM sharing protocol/LLM Universal environment/Agent Optimal Path Module/joint mining mechanism.
  • Includes also planning-, reflection- and tool use-agents.

Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks

  • Guided code generation: introduces a multi-agent framework for complex code tasks, which includes hierarchical decomposition, bottom-up code generation, and multi-agent validation.
  • Leverages LLMs as fuzzy searchers and information retrievers. Mitigates LLM weaknesses in long sequential reasoning and context understanding.
  • This framework enhances code generation capabilities and overcomes limitations of LLMs in compositional reasoning and context handling.

10th of January 2025

BioAgents: Democratizing Bioinformatics Analysis with Multi-Agent Systems

  • BioAgents: is a multi-agent system designed to assist users in bioinformatics pipeline design, development, and troubleshooting. which includes two specialized agents and a reasoning agent.
  • First specialized agent was fine tuned with conceptual genomics tasks and the second specialized agent uses RAG related to workflow documentation.
  • Reasoning agent uses self-ratings / threshold.
  • Achieves performance comparable to human experts on conceptual genomics tasks.

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

  • The survey reviews Multi-Agent Systems (MASs) collaboration mechanisms based on key dimensions.
  • Framework includes actors, types, structures, strategies, and coordination protocols; reviews existing methodologies; investigates applications across diverse domains; identifies key lessons, open challenges, and potential research directions.

How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond

  • Human-Model Cooperation: is a survey of principles, formalizations, and open challenges in human-model cooperation.
  • It introduces a new taxonomy for categorizing human-model cooperation, identifies key research frontiers, and discusses associated challenges.

OpenFOAMGPT: a RAG-Augmented LLM Agent for OpenFOAM-Based Computational Fluid Dynamics

  • OpenFOAMGPT: LLM-based agent tailored for OpenFOAM-centric computational fluid dynamics (CFD) simulations.
  • It leverages GPT-4 and a chain-of-thought (CoT)-enabled o1 preview model, uses retrieval-augmented generation (RAG) pipeline, and includes an iterative correction loop.

9th of January 2024

Search-01: Agentic Search-Enhanced Large Reasoning Models

  • Search-01: is a framework that enhances Large Reasoning Models (LRMs) with an agentic retrieval-augmented generation mechanism and a Reason-in-Documents module.
  • It integrates an agentic search workflow, enables dynamic retrieval of external knowledge, and uses a separate module to analyze retrieved information.
  • This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks.

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

  • OpenOmni: Introduces three-stage training method combining speech-to-text generation/image-to-text generation/speech generation, which results SOTA-level omnimodal LLM.

Emergence of human-like polarization among large language model agents

  • Introduces a networked system, which simulates social interactions of thousands of LLM-based agents, including capabilities of establishing social relationships, communicating, and forming opinions on political issues. LLM agents form spontaneously human-like social networks (echo chamber).
  • LLM agents exhibit human-like polarization and can be used to study interventions, offering insights into managing polarization in real-world scenarios.
  • Self-regulation helps to reduce inconsistencies in the opinions, which leads to more balanced polarization patterns. Openmindedness and diverse interaction limit polarization effect.

NSChat: A Chatbot System To Rule Them All

  • NSChat: introduces a web-based chatbot system designed for neuroscience research.
  • NSChat is built using React framework, it is customizable, flexible, and allows integration of various LLMs, it also includes a logging mechanism for user interactions.

Emergence of human-like polarization among large language model agents

  • LLM (Large Language Model) agents framework: simulates a networked system of agents that establish social relationships, communicate, and form opinions on political issues.
  • Framework includes self-expression, communication, and opinion update stages; agents develop human-like polarization, homophilic clustering, and echo chamber effects; self-regulation strategy reduces self-inconsistency.
  • This framework provides a valuable platform for exploring strategies to mitigate polarization and promote inclusive political conversations.

LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models

  • LearningFlow: is an automated policy learning workflow for urban driving that uses multiple LLM agents.
  • It includes curriculum sequence generation and reward generation processes, supported by analysis agents, and enhances sample efficiency.
  • This framework automates policy learning across complex driving tasks and reduces reliance on manual reward function design.

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

  • OVO-Bench (Online-VideO-Benchmark): is a novel video benchmark for evaluating online video understanding capabilities of Video-LLMs.
  • It includes 644 videos, 2800 meta-annotations, and 12 tasks across three categories: Backward Tracing, Real-Time Visual Perception, and Forward Active Responding.
  • This benchmark highlights the importance of temporal awareness for advanced online video understanding.

8th of January 2025

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

  • rStar-Math: A framework demonstrating that small language models (SLMs) can rival or surpass the math reasoning capability of OpenAI models through deep thinking. Iteratively improves through self-evolution generating millions of new math reasoning trajectories in each round.
  • Uses Monte Carlo Tree Search (MCTS) with self-annotated Q-values. rStar-Math used 747k math word problems, took the final correct answer and then rolled out 16 MCTS-based step-by-step verified reasoning trajectories, to categorize problems by difficulty level (easy/medium/hard) based on ratio of correct solutions. Hard problems are assigned with an additional extra 16 rollouts. The policy SLM is trained using all the step-by-step trajectories with their Q-values.
  • The importance of this work lies in showing that smaller language models can achieve state-of-the-art math reasoning, rivaling larger models, through a novel self-evolutionary process.
  • Includes Code-Augmented CoT, where step-by-step reasoning trajectories generated are verified with code execution for correctness.

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

  • Meta-CoT (Meta Chain-of-Thought): A novel framework that extends traditional CoT by explicitly modeling the underlying reasoning process required to arrive at a particular CoT.
  • Inspired by Cognitive Science's dual-process theory, non-linear, iterative, latent process of exploration and verification, in-context search, process supervision, synthetic data generation, search algorithms, instruction tuning, reinforcement learning, scaling laws, verifier roles, novel reasoning algorithms, meta-reinforcement learning.
  • This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

  • URSA (Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics): A framework for enhancing the mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs) through a three-module synthesis strategy and a novel dual-view process supervision data synthesis method.
  • Integrates CoT distillation, trajectory-format rewriting, format unification, MMathCoT-1M dataset, DualMath-1.1M dataset, URSA-7B model, URSA-RM-7B model, test-time scaling, process annotation, out-of-distribution (OOD) verification.
  • This work significantly enhances MLLMs' potential in mathematical reasoning, achieving state-of-the-art performance on multiple multimodal mathematical benchmarks and demonstrating robust supervision abilities.

Agent Laboratory: Using LLM Agents as Research Assistants

  • Agent Laboratory: An autonomous research-framework with LLMs for completing the entire research process (literature review/experimentation/report writing), from literature review to experimentation (plan formulation, data preparation and running experiments) and report writing (report writing and report refinements).
  • Human-in-the-loop, research idea as input and code repository/research report as output. Producs SOTA-level performance and reduces research expensesn.
  • The framework has the potential to accelerate scientific discovery by enabling researchers to focus on creative ideation rather than low-level coding and writing.
  • Includes postdoc/ph student/sw engineer/ml engineer/professor-agents. Includes mle-solver-tool capable of solving ML-tasks, which iteratively improves research code.
  • Automated evaluation of the framework significantly overestimated the accurate scoring. Copilot mode was found useful by the human testers. Includes prompts.

7th of January 2025

Reasoning-Enhanced Self-Training for Long-Form Personalized Text Generation

  • REST-PG (Reasoning-Enhanced Self-Training for Personalized Text Generation): Introduces a multi-stage framework designed to teach LLMs reasoning over personalized context through Expectation-Maximization Reinforced Self-Training.
  • Generates reasoning paths based on the user's past preferences, background knowledge, and writing style
  • The framework enhances LLMs' ability to generate personalized text, outperforming state-of-the-art baselines by 14.5% on average.

6th of January 2025

Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches

  • Introduces a survey about AGI concepts and achieving AGI with LLMs. Includes list of memory types used with LLMs: sensory/working/semantic/episodic/procedural. Lists aspects of embodiment as: goal-awareness/self-awareness/situatedness/deliberate action.

CALM: Curiosity-Driven Auditing for Large Language Models

  • CALM (Curiosity-driven Auditing for LLMs): Introduces intrinsically motivated RL based on curiousity to finetune LLM as an auditor agent, to discover harmful/biased input/output pairs in the LLM. Includes token-level intrinsic bonus. Uses curiosity-driven exploration to navigate efficiently the prompt space, such as discover specific celebrity names.

RTLSquad: Multi-Agent Based Interpretable RTL Design

  • RTLSquad: is a novel LLM-Based Multi-Agent system for interpretable RTL code generation.
  • It divides the design process into exploration, implementation, and verification & evaluation stages, managed by specialized agent squads, generating optimized RTL code through inter-agent collaboration, and providing decision interpretability through the communication process.
  • This framework enhances the ability to generate functionally correct RTL code and optimize PPA performance, while also providing decision paths.

5th of January 2025

LLMs Help Alleviate the Cross-Subject Variability in Brain Signal and Language Alignment

  • Decodes EEG scans to text with subject-independent semantic features for Brain-Computer Interfaces (BCIs). Introduces EEG embeddings.
  • Includes cross-subject generalization (addresses the issue of variability in brain anatomy between humans/neural dynamics/signal), zero-shot and comprehensive evaluation.

4th of January 2025

Table as Thought: Exploring Structured Thoughts in LLM Reasoning

  • Table as Thought: organizes reasoning within a tabular schema, where rows represent sequential thought steps and columns capture critical constraints and contextual information.
  • Framework is inspired by cognitive neuroscience theories, reasoning process iteratively populates the table until self-verification ensures completeness and correctness, excels in planning tasks and mathematical reasoning.
  • This work provides a novel exploration of refining thought representation within LLMs, paving the way for advancements in reasoning and AI cognition.

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

  • Hallo3: The first application of a pretrained transformer-based video generative model for highly dynamic, realistic portrait animation.
  • Identity reference network, 3D VAE, transformer layers, speech audio conditioning, motion frame mechanisms, DiT-based video generation, video extrapolation.
  • Addresses challenges of non-frontal perspectives, dynamic objects, and immersive backgrounds in portrait animation.

Thinking with Many Minds: Using Large Language Models for Multi-Perspective Problem-Solving

  • Replicates the concept of "Wisdom of the Crowd" with LLMs using synthetic deliberation.
  • Generates multiple agents, each with dinstinct perspective to a problem. Agents simulate arguments and counter-arguments from their perspective.
  • Agents explore in parallel the problem space using its own perspective. The integration mechanism adjusts agents positions based on proposals/evaluations of others controllable with influence parameter alpha. The iterative deliberation repeats multiple rounds until consensus is reached.

UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility

  • Review systematically integration of LLMs with UAVs (Unmanned aerial vehicles).
  • Proposes roadmap towards agentic UAVs. Includes github-repository with links to papers/approaches around LLM-based UAV systems.

3rd of January 2025

SDPO: Segment-Level Direct Preference Optimization for Social Agents

  • Introduces SDPO (Segment-Level Direct Preference Optimization)-fine tuning, which aligns the LLM to key segments in multi-turn conversation.
  • Addresses goal-completion in multi-turn conversation.

AgentRefine: Enhancing Agent Generalization through Refinement Tuning

  • AgentRefine: Uses a strong LLM to simulate interactive role-playing, with the model acting as both Dungeon Master and player. A verifier checks each action for errors, providing feedback that allows the model to refine its actions until it achieves the correct result. This iterative process, with its corrected action sequences, trains the system to explore viable actions and generalize to new scenarios.

Multi-Agent Conversational Online Learning for Adaptive LLM Response Identification

  • MACO (Multi-Agent Conversation Online learning for adaptive LLM response identification): Introduces near-optimal cumulative regret with multiple local agents to identify, which is the most optimal LLM response to serve for the particular user, even when new user.

MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning

  • MoColl: Introduces LLM-agent based framework for image captioning with specialised VQA model. Includes warm-up stage and agent-guided tuning stage.

2nd of January 2025

ProgCo: Program Helps Self-Correction of Large Language Models

  • ProgCo (Program-driven Self-Correction): A self-correction framework that uses self-generated and self-executed verification pseudo-programs to improve reasoning in large language models. Incluces ProgVe (Program driven Verification) and ProgRe (Program driven Refinement).
  • This framework enhances the ability of large language models to self-correct without external feedback, particularly in complex reasoning tasks.

A3: Android Agent Arena for Mobile GUI Agents

  • A3 (Android Agent Area): Introduces benchmark to evaluate mobile GUI agents, which focuses on practical tasks, larger action spaces and automated LLM-based evaluation.
  • A3 consists of controller (gets/controls states of the device), evaluator (final rating) and translator (between device device function and the agent message).

Dynamic Scaling of Unit Tests for Code Reward Modeling

  • CodeRM-8B: A lightweight unit test generator with dynamic scaling mechanism, which adapts number of unit tests based on problem difficulty. The unit tests are used in validating generated code by the LLM as reward signal.
  • The framework significantly improves performance of code generation across various models and benchmarks by enhancing the quality of the reward signal.


3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

  • 3D-LLaVa: Introduces 3D multi modal LLM with point clouds/text instruction/visual prompt as input and generates text output and 3D mask with Omni Superpoint Transformer (OST).
  • 3D-LLaVa handles 3D vision-centric dialogue.
  • OST includes visual features selection, visual prompt encoding and 3D mask generation.

Harnessing Multi-Agent LLMs for Complex Engineering Problem-Solving: A Framework for Senior Design Projects

  • Proposes multi-agent LLM framework for engineering design projects consisting of problem formulation/breadth & depth/ambiguity & uncertainty/system complexity/technical innovation & risk management/societal & ethical consideration/methodology & approach/compherensive evaluation-agents.
  • Each agent consists of description, task, objective and evaluation points.

Embodied AI-Enhanced Vehicular Networks: An Integrated Large Language Models and Reinforcement Learning Method

  • Incorporates embodied AI framework, which consists of semantic data processing with LLaVa-agent (extracts semantics from image data captured by the vehicle), Data transmission optimization (balances bandwidth utilization and quality of experience) and Enhanced decision making with Deep RL with GAE-PPO.

MDSF: Context-Aware Multi-Dimensional Data Storytelling Framework based on Large language Model

  • MDSF (Multidimensional Data Storytelling Framework): Automatess data analysis and storytelling. Includes data preprocessing steps, fine tuned LLMs, LLM agents.

Toward Inclusive Educational AI: Auditing Frontier LLMs through a Multiplexity Lens

  • Suggests two strategies to improve LLMs multiplexity (diverse cultural viewpoints) over WEIRD (western/educated/industrialized/rich/democratic): system prompt with diverse cultural perspectives and multi-agent system with agents with different cultural views. Sentiment analysis is used to review cultural resonance.

PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents

  • PSYCHE: Introduces an LLM-based psychiatric evaluation framework by comparing the predicted values of psychiatric elements (Construct-PACA) against the actual values (Construct-SP). The actual values are simulated patient data generated with a multi-faceted construct (MFC).
  • The framework guarantees clinical relevance, ethical safety, cost efficiency, and quantitative evaluation by simulating psychiatric patients with detailed profiles, histories, and behaviors.

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

  • Introduces BoxingGym-benchmark, reviews LLMs capabilities to design and model discovery: collect data to test scientific theory and propose/update scientific theories through 10 environments. Introduces metric called EIG.
  • Expected information gain (EIG) measures an experiment's informativeness by testing if one scientific agent's model explanation enables another to make accurate environmental predictions.

1st of January 2025

Agentic Systems: A Guide to Transforming Industries with Vertical AI Agents

  • Reviews transition from SaaS to context-aware, adaptive systems handling dynamic environments through vertical agents.
  • Identifies core modules of LLM agents: memory/reasoning engine/cognitive skills/tools.
  • Author categorises agentic systems into: task-specific, multi-agent and human augmented agent systems.

Large Language Model Based Multi-Agent System Augmented Complex Event Processing Pipeline for Internet of Multimedia Things

  • Introduces multi-agent framework for complex event processing of video queries(think TikTok/Youtube as examples) with AutoGen and Kafka brokers (real time data streams).
  • Consists of conversable/assistant/user proxy/LLM backend/human backed/tool backed-agents.

Interactionalism: Re-Designing Higher Learning for the Large Language Agent Era

  • Introduces Interactionalism-framework focuses on interactional intelligence to learn more personalized/social/non-linearly way, instead of monological way.
  • Proposes usage of dialogue-agents in education, such as tutors, teaching assistants, evaluators, guides and mentors.

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

  • Introduces multi-agent framework for cryptocurrency investing with intrateam and interteam collaboration and multi modality. Consists of expert training module and multi-agent investment module.
  • Expert training module uses data/literature-agents to feed historical data and investment literature. Explanation-agents process this information to generate high-quality prompts to fine tune investment agents.
  • Multi-agent investment module consists of data-agent fetching real-time data to market-agents and crypto agents. Market agents includes two expert agents to analyze news/market factors to predict market trends and determining cash-crypto allocation. Crypto-agents includes two specialized agents to analyze crypto-specific factors and candlestick charts to make crypto selection decisions. Trading agents finally act with a trading API to execute the final portfolio strategy.

Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform

  • Proposes design and implementation of multi modal and multi-agent framework with LLMs. Includes multi modal inputs (text/audio/video/image), multi-agent layer (includes supervisory-agent and RAG/image analysis/audio generation/image generation/video generation- worker agents), process layer (vector db and modality specific models) and the output layer (text/audio/video/image).
  • Supervisor agent controls sequence of tasks, distributes tasks, manages output of worker agents, tnterprets outputs and makes decisions about next steps in the sequence.

31st of December 2024

Enhancing LLM Reasoning with Reward-guided Tree Search

  • STILL-1 (Slow Thinking with LLMs): A reward-guided tree search framework to enhance the reasoning capabilities of LLMs.
  • Integrates policy model, reward model, and search algorithm; policy model navigates a dynamically expanding tree; guided by a trained reward model.
  • Improves LLMs' performance on complex mathematical reasoning tasks by trading test time for improved accuracy.

MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation

  • Main-RAG: Introduces multi-agent framework, where LLM-agents collaboratively filter and score retrieved documents.
  • Introduces adaptive filtering, which dynamically adjusts relevance filtering threshold.
  • Includes three agents: predictor (infers answers based on retrieved documents), judge (scores filtering and ordering) and final-predictor (generates final answer based on filtered and ordered documents).
  • Includes system instruction prompts.

Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents

  • RR-MP (Reactive and Reflection agents with Multi-Path Reasoning): Improves reasoning capability of LLMs in complex scientific tasks.
  • Consists of reactive and reflection agents collaborating together to improve accuracy/avoid degeneration-of-thoughts.
  • Reactive agent receives information from external environment, decomposes it into sub-tasks, then stores them in the database.
  • Reflective agent analyzes sub-task it executes, offering suggestions or critiques. This feedback loop allows the reactive agent to refine its reasoning and complete the scientific process.

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

  • Embodied VideoAgent: Introduces VLM-based Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs.
  • Includes persistent object memory, using VLM (depth maps / camera poses).
  • Automatically updates memory as actions / activities over objects are perceived.

Enabling New HDLs with Agents

  • HDLAgent: Introduces LLM-based agent to support code generation for underrepresented HDLs (Hardware Description Languages).

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

  • VideoRefer-model: Improves Video-LLMs fine-grained spatial and temporal detail understanding in videos, which facilitates more precise object descriptions, more detailed event analysis, and enhanced predictive reasoning in dynamic environments using masked object features.
  • VideoRefer-model consists of VideoLLaMA 2.1 as the foundation and a novel unified spatial-temporal object encoder that merges cross-frame token similarities.
  • Includes VideoRefer-dataset and VideoReferBench-benchmark.

LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models

  • LLM-MedQA: is a multi-agent medical question-answering system that incorporates similar case generation within a multi-agent architecture.
  • It leverages Llama3.1:70B model, includes question-specific analysis, option analysis, and case generation agents, and uses zero-shot learning.
  • This framework enhances performance on the MedQA dataset and improves interpretability and reliability in medical question answering.

30th of December 2024

Aviary: training language agents on challenging scientific tasks

  • Defines Language Decision Process (LDP). LDP is framed as Partially-Observable Markov Decision Process (POMDP), where actions only consist of the ones with the external environment.
  • Introduces Language agent training framework: Aviary. Includes implementation in 3 scientific domain tasks.
  • Builds language agents as stochastic computation graphs (SCG).

Distributed Mixture-of-Agents for Edge Inference with Large Language Models

  • Introduces Distributed Mixture-of-Agents, where multiple LLMs collaborate on various edge devices with decentralized gossip algorithm.
  • Does not rely in centralized server.

Exploring and Controlling Diversity in LLM-Agent Conversation

  • APP (Adaptive Prompt Pruning): Controls diversity of the LLM-agent conversation through adjusting lambda-variable.
  • The lambbda variable adjusts diversity by increasing/decreasing details about: current dialogue/history dialogue/environment/profile/memory.

Plancraft: an evaluation dataset for planning with LLM agents

  • Introduces Plancraft-benchmark to evaluate VLMs and LLMs planning capabilities and ability to decide in Minecraft craftting GUI, if the model is able to identify task as unsolvable (intentionally).
  • Identifies, that success rate alone is poor metric in real world tasks.

25th of December 2024

Probabilistic Mission Design in Neuro-Symbolic Systems

  • ProMis (Probabilistic Mission Design): ProMis helps drones understand where they can and cannot go by combining different types of information, like maps and sensor data, with rules and regulations, such as no-fly zones. Refers with mission landscape to safest and most legal paths.
  • Combines formal reasoning with probabilistic inference. Uses LLM to convert instructions into ProMis code and ChangeFormer for perception of satellite images.

24th of December 2024

24.12.2024

A Novel Task-Driven Method with Evolvable Interactive Agents Using Event Trees for Enhanced Emergency Decision Support

  • EvoTaskTree: is a task-driven method with evolvable interactive agents using event trees for emergency decision support.
  • Framework integrates task executors and task validators powered by large language models (LLMs), leverages insights from event tree analysis, and includes three crucial tasks: initiating event subevent analysis, event tree header event analysis, and decision recommendations.
  • This approach enhances rapid formulation of emergency decision-making and outperforms existing approaches.

Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

  • Introduces multi-agent framework consisting of three level of agents collaborating to provide answer: junior, senior and manager. Final answer is determined through voting. Each agent uses planning and tools (knowledge base / LLM knowledge).

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

  • VLABench-benchmark: Evaluates VLA models (Vision-Language Action models). Focuses on tasks requiring mesh & texture understanding, spatial understanding, semantic conversation cognition, common sense & applying real world knowledge, physical laws understanding and long horizon multi-step reasoning.

INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent

  • Investorbench-benchmark: Evaluates LLMs capability for financial decision making.

Decentralized Intelligence in GameFi: Embodied AI Agents and the Convergence of DeFi and Virtual Ecosystems

  • Introduces decentralized GameFI-ecosystem with LLM-agents based on Ethereum-blockchain.

Automated Code Review In Practice

  • Reviews automated code reviews, which led to longer average pull request closer time.

Large Language Model guided Deep Reinforcement Learning for Decision Making in Autonomous Driving

  • LGDRL (Language Guided Deep Reinforcement Learning): Introduces LLM-based autonomous driving system.
  • DRL agent learns from LLM-based driving expert-agent (prompted with prompt generator), when the LLM-based driving expert finds necessary to intervene DRL agent actions.

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

  • 3DGraphLLM: Improves LLMs understanding of 3D scenes by creating 3D scene graph representation (think graph, where arrows point, if object is right/left/front/behind) from set of point clouds (object input).

Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent

  • XMODE: Uses LLM to decompose (converts into simpler sub-questions and translates into workflows) user queries into SQL / image analysis.
  • Includes planning & expert model allocation/execution & self-debugging/decision making/expert models & tools/data lake.

Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles

  • Introduces MUSE-dataset with conversations centered around clothing-domain by using multi-agent framework to generate real world-scenarios (scenario-grounded user profile generator/simulated conversation generator/conversation optimizer).

Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents

  • Agentable: Introduces static analysis tool to detect defects in code with LLM-based agents and Code Property Graphs (identifies specific code patterns/analyses descriptions). Includes AgentSet-dataset.
  • Includes pre-processing, defect detection (code abstraction/LLM invocation/semantic enrichment/detect oracles engineeering), and defect reporting-modules.

22.12.2024

Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

  • STILL-2 (Slow Thinking with LLMs): A framework to train reasoning models using a three-phase approach: imitation, exploration, and self-improvement.
  • Initial fine-tuning with distilled long-form thought data, exploration of challenging problems by generating multiple rollouts, iterative refinement of the training dataset.
  • The framework demonstrates competitive performance compared to industry-level reasoning systems, highlighting the potential of slow-thinking in enhancing complex reasoning capabilities of LLMs.

21st of December 2024

OpenAI o1 System Card

  • o1 model series: Large-scale reinforcement learning models trained to reason using chain of thought, improving safety and robustness.
  • Next model in series is OpenAI o1, faster version is OpenAI o1-mini, effective at coding, "thinks before it answers", long chain of thought before responding, refine thinking process, try different strategies, recognize mistakes.
  • Reasoning allows models to follow safety guidelines, provide helpful answers, resist attempts to bypass safety rules, avoid producing unsafe content, and reach state-of-the-art performance on certain benchmarks.

20th of December 2024

Deliberative Alignment: Reasoning Enables Safer Language Models

  • Deliberative Alignment: A training approach that "directly teaches" LLMs to explicitly reason through (safety) specifications before producing an answer.
  • Claims, that reasoning using explicitly specified policies in general, enable scaling alignment. Apart, imrpoves model safety, robustness to jailbreaks, out-of-distribution generalization, and reduces overrefusal rates.
  • Two core stages: supervised fine-tuning on (prompt, CoT, output) examples, reinforcement learning; uses context distillation; includes a "judge" LLM for reward signal.
  • Assigns deliberatedly a varied amount of compute to CoT, which improves performance in hard evals.
  • In first stage, the model is fine tuned with SFT to reason about the (safety) specification within its CoT using examples dataset generated with context distillation with o-type model, where the CoT references the specification.
  • Second stage trains with high-compute RL the model to think effectively by providing reward signal using a judge LLM with access to the (safety) instructions.

Offline Reinforcement Learning for LLM Multi-Step Reasoning

  • OREO (Offline REasoning Opyimization): improves multi-step reasoning with offline RL.
  • Iterative OREO improves consistently with additional training rounds.

19th of December 2024

Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

  • Reasoning-highlighted Finetuning (RFT): Highlights reasoning tokens from boilerplate tokens (format and connecting tokens less critical for the task). Adds larger weight to reasoning tokens.
  • Introduces SHAD (Shuffle-Aware Discriminator): automatic, adaptive token discrimination.

On Verbalized Confidence Scores for LLMs

  • Claims, that LLMs can be prompted to provide caliberated confidence scores.

Agent-SafetyBench: Evaluating the Safety of LLM Agents

  • Agent-SafetyBench-benchmark evaluates LLM-agents safety. Agents tested achieved below 60% pass score.
  • LLM-agents lack currently robustness and risk awareness.

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

  • TheAgentCompany-benchmark: evaluates AI agents capacity to perform long-sequence tasks in real world-like environment as a digital worker: arranging meetings, writing code, screening resumes, communicating (simulates communication between agents), planning and administrative work. Best agent completed 24% of tasks.
  • Generates tasks in a self-contained environment with internal websites and data similar to used by SW companies.

18th of December 2024

Inference Scaling Flaws: The Limits of LLM Resampling with Imperfect Verifiers

  • LLM Resampling: explores the limits of using resampling with imperfect verifiers for improving language model accuracy.
  • The framework shows that imperfect verifiers, like unit tests, lead to false positives, limiting the effectiveness of resampling, and that weaker models generalize worse than stronger models, even with infinite compute budget.
  • This research highlights the importance of developing accurate verifiers and questions the effectiveness of inference scaling with imperfect verifiers.

17th of December 2024

AI PERSONA: Towards Life-long Personalization of LLMs

  • AI Persona: proposes, that LLMs should continuously adapt to diverse set of users via personalization.
  • Introduces a framework for life-long personalization of LLMs through learnable and dynamically updated dictionaries, which are updated based on interaction between user and the LLM.

13th of December 2024

Byte Latent Transformer: Patches Scale Better Than Tokens

  • Byte Latent Transformer (BLT): is a byte-level LLM architecture that encodes bytes into dynamically sized patches to efficiently allocate compute by varying the amount of compute based on the entropy of the next byte prediction.
  • BLT segments patches based on next-byte entropy, allocates more compute where data complexity increases, and improves training and inference efficiency.
  • BLT shows better scaling than tokenization-based models by simultaneously growing both patch and model size.

11th of December 2024

A Multimodal Social Agent

  • MuSA: is a multimodal LLM-based agent designed for analyzing text-rich social content.
  • MuSA includes reason-, plan-, optimize-, criticize-, refine- and act-LLM-based units, is model-agnostic, and optimized for social content analysis tasks.
  • MuSA can automate and improve social content analysis, aiding decision-making processes across various applications.

9th of December 2024

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

  • AlphaVerus: generates formally verified code with LLMs and through self-improvement by iteratively translating programs from higher resource language.
  • Includes three phases: exploration (translates programs from source language to Verus, which is a tool to verify correctness of code written in Rust), treefinement(iteratively fixes errors with Verus-verifier feedback/tree search) and critique (validates and filters unspecified/incorrect translations).
  • Illustrates the potential of inference-time scaling in verified settings. Suggests formal verification ensures correctness and reliability of the generated code.

Query-Efficient Planning with Language Models

  • Reviews efficient ways to use LLMs for planning: heuristic and LLM as generative planner.
  • Introduces two new algorithms: Tree of Interaction (ToI) and Boomerang.

Simulating Human-like Daily Activities with Desire-driven Autonomy

  • D2A-agent (Desire-driven Autonomous Agent): Introduces autonomous agent proposing and selecting autonomously fulfilling and motivating tasks (based on theory of needs: social interaction/personal fulfillment/self-care).
  • Introduces desire-based characters.
  • Includes value system (measures satisfaction per desired dimension) and Desire-driven planner (choses next action of the agent with history and value system).
  • Proposes using in the future more complex human motivation and planning mechanisms to satisfy intrinsic desires. Includes prompts.

Toward LLM-Agent-Based Modeling of Transportation Systems: A Conceptual Framework

  • Proposes transportation system modelling with LLM-based agents to replicate human decision making.
  • LLM-based agents include long-lasting core components: identity (age/income/occupation/cars owned/persona/travel related task/travel restrictions)/memory(short and long term)/LLM core(summarization/planning/nlu/workflow).
  • Includes iterative process with perception, reflection, planning, plan processing and action.

Beyond pip install: Evaluating LLM Agents for the Automated Installation of Python Projects

  • Installamatic: Reviews LLM-agents capability to install repository-level python packages with pip by automatically inspecting repository content and install the packages required.
  • Installamatic-agent is capable of installing packages required in 21/40 repositories tested with 4 main challenges: Identifying install-relevant documentation/writing valid docker files/cost/oracle-problem.

AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

  • AutoDCWorkflow: uses LLM to automatically generate data-cleaning workflows (duplicates/missing values/inconsistent data format) and introduces a benchmark.

StarWhisper Telescope: Agent-Based Observation Assistant System to Approach AI Astrophysicist

  • SWT (StarWhisper Telescope System): proposes automation of the astronomer observation process with LLMs. Includes observation planning/control/data processing/agent suggestion. Includes customized observation lists and real time analysis.

5th of December 2024

Practical Considerations for Agentic LLM Systems

  • Reviews LLM agent research from perspective of planning (explicit/implicit, task decomposition, plan adherence), memory (RAG, long-term memory), tools (ysage/dynamic/multiplicity) and control flow (output processing/error handling/stopping/multi-persona/context).
  • Long term memory may include reflection/consolidation/forgetting/revision and should be independent/consistent/long-term.

Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation

  • Investigates adversial Adaptive Attack Prompt- and ArtPrompt-attack methods success rates between LLM models.

2nd of December 2024

Mastering Board Games by External and Internal Planning with Language Models

  • MAV (Multi Action-Value) model: is a transformer model pre-trained on textual game data, functioning as a world model, value function, and policy function for multiple perfect-information board games.
  • Framework includes external and internal search methods, uses MCTS controller, and distills search procedure directly into the LLM, pre-trained on relevant domain knowledge, minimizes hallucinations, and improves win-rates against state-of-the-art bots.
  • This framework demonstrates the capacity of LLMs to learn strong value functions and act as a world model across multiple perfect information games.

Inference Scaling Flaws: The Limits of LLM Resampling with Imperfect Verifiers

  • LLM Resampling: explores the limits of using resampling with imperfect verifiers for improving language model accuracy.
  • The framework shows that imperfect verifiers, like unit tests, lead to false positives, limiting the effectiveness of resampling, and that weaker models generalize worse than stronger models, even with infinite compute budget.
  • This research highlights the importance of developing accurate verifiers and questions the effectiveness of inference scaling with imperfect verifiers.

29th of November 2024

Amplifying human performance in combinatorial competitive programming

  • FunSearch: is a framework that evolves scoring functions for a human-designed solution backbone using a large language model.
  • Framework uses Gemini 1.5 Flash 002, improves scores on Hash Code, and uses a switching variable for multiple choice points.
  • This approach demonstrates a successful human-AI synergy in combinatorial optimization problems.

25th of November 2024

Agent-Based Modelling Meets Generative AI in Social Network Simulations

  • Generative Agent-Based Modelling (GABM): LLM-based agents, which simulate social network users with personality traits/interests and custom agent interactions.
  • The framework consists of two phases: Characterization (Personality assignment) and Simulation (Reasoning module and Interaction module). Decisions of the agent are stored in vector db for retrieval.

TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation

  • TopV-Nav: Improves Zero-Shot Object Navigation (ZSON) in unfamiliar environments by reasoning on top-view maps ("birds eye") with MLLM's spatial reasoning capabilities.
  • Proposes Adaptive Visual Prompt Generation (AVPG), which adaptively constructs top-view map. The framework then uses Dynamic Map Scaling (DMS), which dynamically zooms top-view map at preferred scales for local reasoning. Uses Target-Guided Navigation (TGN) to facilitate human-like exploration.

A Multi-agent Framework for Materials Laws Discovery

  • Introduces a LLM-based multi agent framework to discover materials laws in materials science, using general framework for solving symbolic regression tasks with LLMs. Uses a depth-first search (DFS) algorithm and a reflection mechanism, implemented through LLMs, to optimize formula generation.

Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

  • Introduces a multi-agent consensus framework, which integrates confidence weight obtained with third-party LLM, to adjust attention weights of each agent.
  • Each agent answers individually on the first round, agents self-adjust with feedback on second/third round with third party LLM and finally agents majority vote the final answer.

SAGEval: The frontiers of satisfactory agent-based NLG evaluation for reference-free open-ended text

  • SAGEval: Introduces an eval for an open-ended, reference-free natural language generation (NLG) by using a critiquing agent to provide feedback on scores generated by LLM evaluators. Focuses on open-ended text like surveys, forms, and lists.
  • Includes Evaluator- (based on G-Eval) and Sage-agent as meta-evaluator. Evaluation aspects include: accuracy, semantic diversity, coherence, relevancy, audience understandability, audience engagement score, fairness score and sentiment/tone type.

24th of November 2024

PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

  • PIANIST (Partition function, Information set space, Action space function, N players, Information realization function, State space, and Transition reward function): A framework for decomposing a world model into seven components, enabling zero-shot LLM generation of a working world model for multi-agent decision-making tasks.
  • The framework leverages LLMs for generating forward transition functions, action functions, and information partition functions. It uses MCTS for planning in partially observable environments. The approach is evaluated on language and non-language based action-taking games, without domain-specific training data.
  • PIANIST demonstrates strong performance in multi-agent, partial information settings, showcasing the potential of LLMs for complex decision-making.

21st of November 2024

Natural Language Reinforcement Learning

  • Introduces: Natural Language Reinforcement Learning (NLRL).
  • Efficiently implements RL algorithms and principles in language representation space.
  • Presents NLRL-pipeline, where LLM learns from textual environmental feedback.
  • Implements empirically in various games.

18th of November 2024

GENERATIVE WORLD EXPLORER

  • Generative World Explorer (Genex): Introduces and egocentric world exploration, which allows an agent to mentally explore a large-scale 3D world and acquire imagined observations to update its belief inside partially observable decision process.
  • Generates high-quality and consistent observations in long-horizon tasks.
  • Consists of generative video model, egocentric views, belief revision, and decision-making (e.g., LLM agent). Includes multi-agent reasoning with imagination, where the framework infers perspectives of other actors in the scene.

OASIS: Open Agents SOCIAL INTERACTION Simulations on One Million Agents

  • OASIS (Open Agents SOCIAL INTERACTION Simulations on One Million Agents): Introduces generalizable, scalable (millions of agents) social media (twitter/reddit-like) simulator LLM-based agents supporting dynamic social networks, diverse actions and recommendation systems. Includes registration and simulation phases.
  • OASIS pulls in the registration phase information about user, past posts, self-description and name.
  • Simulation phase consists of Environment server(sends agent information, posts and user relationships)/RecSys(recommends visible content to user and agents)/Agent module(generates actions updating environment state)/Time engine(updates agents temporal behaviours)/Scalable Inferencer-components(handles large scale inference requests by user).
  • OASIS replicates social phenomena observed in human-societies, including group polarization and herd effect, which take place in dynamically updating environments with diverse action spaces.
  • Uses event-driven architecture, where agent communicates with server in dedicated channel, which consists of asynchronous message queue.

TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World

  • TrojanRobot: A backdoor attack framework, which targets robotic manipulation in the physical world by embedding a backdoor robotic system's visual perception module.
  • Uses common objects as triggers.

A Code Knowledge Graph-Enhanced System for LLM-Based Fuzz Driver Generation

  • CodeGraphGPT: a framework that leverages a code knowledge graph and an LLM-powered intelligent agent to automate fuzz driver generation (sw testing technique by feeding unexpected random data as program inputs to discover bugs).
  • Includes agents for API combination generation (knowledge into graphs and then embeddings to query), dynamic program repair (past example embeddings), and crash analysis (bugs embeddings).
  • Constructs knowledge graph of code repos, tailors fuzz drivers and input seeds, resolves compilation errors, and analyzes crash reports.

Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment

  • Reviews Persuader agents capacity to influence another LLM agent (Base agent) in morally ambiguous decision making scenarios.
  • LLMs show greater variability between the degree it is possible to persuade them, than their capacity to persuade others.

LLM-IE: A Python Package for Generative Information Extraction with Large Language Models

  • LLM-IE [LLM-based Information Extraction]: A Python package for building complete information extraction pipelines using large language models (LLMs).
  • Key features include interactive LLM agent for prompt design, support for named entity recognition, entity attribute extraction, and relation extraction tasks. Benchmarked on i2b2 datasets. Sentence-based prompting algorithm.

16th of November 2024

Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts

  • Analyzes developer challenges with LLMs. Challenges include LLM ecosystem, API usage, LLM training, dataset management, prompt engineering, and error handling. Identifies several unresolved posts, slow response times, especially with complex topics.

FlexFL: Flexible and Effective Fault Localization with Open-Source Large Language Models

  • FlexFL (Flexible and Effective Fault Localization): LLM-agents (Agent4SR and Agent4LR) based framework for code debugging / fixing with bug-related information (bug reports, test cases).
  • The framework employs a two-stage approach: space reduction (Agent4SR) to narrow search space and localization refinement (Agent4LR) to localize top k-most suspicious methods.

IntentGPT: Few-shot Intent Discovery with Large Language Models

  • IntentGPT: introduces a training-free method for Intent discovery using In-context Learning prompt (generated with LLM consisting of known intents/few-shot examples and user query) and LLM generating the intent.
  • Adds discovered intents back into the prompt. Includes prompts.
  • IntentGPT outperforms previous methods with extensive domain-specific data for training/fine-tuning. Discovers intents dynamic, open-world scenarios.

15th of November 2024

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

  • MPO (Mixed Preference Optimization): is a method that blends supervised fine-tuning loss with preference optimization losses to enhance training effectiveness of multimodal large language models.
  • MPO uses a novel automated preference data construction pipeline to create MMPR dataset, and explores different Chain-of-Thought approaches with multimodal input to improve reasoning performance.
  • This approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks.

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

  • Decision-theoretic reasoning: Introduces a dataset of natural language questions on Newcomb-like problems.
  • The dataset includes capability questions (unambiguous answers) and attitude questions (disagreements among decision theorists). It evaluates existing large language models (LLMs) and their attitudes toward evidential decision theory (EDT) and causal decision theory (CDT).
  • Findings associate higher capability LLMs with more EDT-favorable attitudes across question types. The dataset helps to understand decision-theoretic reasoning capabilities and attitudes of LLMs in AI-AI interactions.

12th of November 2024

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

  • RedCode-benchmark: Evaluates safety of code agents capacity to generate / execute code and reviews code agents capacity to recognize/manage unsafe code execution.
  • Includes two steps: RedCode-Gen (evaluates code generated) and RedCode-Exec (evaluates code execution).

World Models: The Safety Perspective

  • Introduces a Survey about World Models in Embodied AI agents from safety perspective.

BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks

  • BudgetLMAgent: Multi agent framework using cascading (sequentially invoking/chaining) free/low cost/frontier LLMs with distinct roles: planner (default/expert)/workers(high-level actions/low-level actions).
  • Gives LLM-agent an option to call more advanced LLM-model to request help (with maximum retries) in complex planning problems.
  • Reduces operation cost by 94% compared to single agent with GPT-4 and improved success rate.

LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models

  • LLMPhy: Combines LLM with Mujoco-physics engine for complex physical reasoning tasks and introduces TraySim-dataset consisting of 100 scenes.
  • Claims, that LLMs have enough world knowledge with physics engine for better interactive reasoning and LLMs trained with more scientific reasoning tasks tend to demonstrate superior physical reasoning in LLMPhy-pipeline.

From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents

  • Introduces an automatic evaluation framework for Role-Playing Agents (RPAs) that generates claims from a knowledge graph and has characters discuss them with the main character.
  • Evaluates the believability of interactions by leveraging the inherent hallucination properties of RPAs. Defines relationship hallucination metric.

Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach

  • Focuses on inclusive / gender neutrality in LLM-agents with: assistant/language analysis/optimizer-agents.

11th of November 2024

Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory

  • Mr.Steve (Memory Recall Steve-1): Improves long-horizon task solving by incorporating solver module and Place Event Memory (PEM), which recalls what-, where- and when-information from episodes.
  • Includes memory-augmented task solving and exploration strategy.

Using Generative AI and Multi-Agents to Provide Automatic Feedback

  • Autofeedback: Introduces multi agent LLM-based framework for student feedback, which includes: feedback generation- and feedback validation/modifier. Reduces over-praising and over-inference.
  • Includes prompts of both agents.

Script-Strategy Aligned Generation: Aligning LLMs with Expert-Crafted Dialogue Scripts and Therapeutic Strategies for Psychotherapy

  • SSAG (Script-Strategy Aligned Generation): Aligns LLMs with key therapeutic strategies in Motivational Interviewing. Claims, that LLMs aligned with expert prompting outperform rule-based chatbots and pure LLMs.

Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving

  • ChemAgent-framework: Introduces agent for chemistry tasks, which includes reasoning/grounding and tool use.

A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

  • AutoRestTest: Introduces MARL-framework with Semantic Property Dependency Graphs (SDG) and LLMs for REST API exploration.
  • Includes dependency/operation/parameter/value-agents.

10th of November 2024

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

  • WebDreamer: LLM-based web-agent framework by using LLM to predict outcomes of candidate actions in web environment in order to pick optimal action.
  • The LLM simulates as world-model actions using prompt like: "what would happen if I click this button" and then evaluates the imagined outcomes.
  • Model-based planning enables safe simulation of possible actions before taking them (some web environments do not allow going back to previous step, which complicates tree-based search by investigating candidate next steps).
  • Includes system prompts of the world model and reward model.

9th of November 2024

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

  • IOPO (Input-Output Preference Optimization): Aligns/fine-tunes LLMs based on both the input data (new approach) and the output data (traditional approach).
  • Explores instruction preference space.

From References to Insights: Collaborative Knowledge Minigraph Agents for Automating Scholarly Literature Review

  • Introduces CKMAs (Collaborative Knowledge Minigraph Agents), which automate literature reviews. Building knowledge minigraphs by organizing information and relationships from research papers.
  • Includes KMCA (Knowledge Minigraph Construction Agent) and MPSA (Multiple Path Summarization Agent), which both prompts are included.

8th of November 2024

The influence of persona and conversational task on social interactions with a LLM-controlled embodied conversational agent

  • Reviews effect of the LLM-based agent persona traits to user experience.
  • Manipulation of the personality traits strongly influences social interaction and user experience.

Game-theoretic LLM: Agent Workflow for Negotiation Games

  • Studies with game-theoretic analysis the rationality of LLM-based (with various LLMs) negotiation workflow in various complete-information games and in a incomplete-information game.

7th of November 2024

Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations

  • Simulates interactive dialogue by utilizing hindsight to regenerate optimal task-relevant dialogue data based on initial dialogue data.
  • Includes hindsight controller, which takes dialogue input and prefix, then outputs a more desirable action.

GUI Agents with Foundation Models: A Comprehensive Survey

  • Introduces Survey about GUI Agents.
  • Divides LLM-based GUI agents into: GUI Perceiver, Task Planner, Decision Maker, Excecutor and Memory Planner (internal memory: actions/screenshots, external memory: manual construct/auto exploration and self-evolution: transition diagram/documents).
  • Identifies challenges related to inference efficiency, self-evolution and real world vs. benchmark gap.

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

  • CodeTree: Introduces multi-agent, LLM-based code generation, which improves multi-stage planning/generation/debugging by using tree search.
  • Includes Thinker/Solver/Debugger/Critic-agents.
  • Critic-agents scores/expands/terminates nodes, which is based on feedback generated by the LLM and the execution feedback on test cases.

CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation

  • CaPo (Cooperative Plan Optimization): Includes meta-plan generation and progress-adaptive meta-plan & execution
  • Meta plan generation consists of analyzing, discuss, create the meta-plan decomposed into subtasks by the various agents.
  • Progress-Adaptive Meta-Plan & Execution: agents execute task in the meta plan and dynamically adjust it based on latest progress in multiturn dialogue.

6th of November 2024

AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making

  • AdaSociety: multi-agent environment to simulate decision making with physical(resources, events, agents skill inventories)/social(establish, alter, form groups, hierarchies)-components.
  • Introduces social states: multilayer directed graph to describe adaptive / dynamic connections, which drive long-term coalition formation / hierarchy.
  • Dynamically connects with other agents to establish autonomously non-deterministic connection with the other agent.
  • State and action space dynamically advance.
  • Identifies research challenges in collective reasoning, social cognition, adaptation, communication and emergence of new social skills and norms.

MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

  • MRJ-Agent: Introduces multi-round dialogue jailbreaking agent, which decomposes harmful queries into multiple sub-queries.
  • This widely generalizable jailbreaking-technnique achieves SOTA-level success rates.

From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning

  • StepAgent: Optimizes LLM-agents wit step-wise RL with inspection- and reflection-steps.

5th of November 2024

SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction

  • SAUCE (Synchronous and Asynchronous User-Customizable Environment): Introduces LLM-based multi agent framework with asynchronous communication feature, where models decide when to speak and what to say.
  • Includes experiment(configures discussio, participants, host and end criteria)/session room(manages ongoing experiment and exit criteria)/host (directs interaction)/person(human or LLM).
  • Implements LLM-agent personas (and human participant) as class-objects in Python.

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

  • AI Metropolis: introduces multi agent LLM-based framework, which enables out-of-order execution (parallel processing) of agents by tracking dynamically real dependencies between agents.
  • LLM agents often wait unnecessarily each step to complete, before proceeding, even when it is a false dependency.
  • LLM agents can be: blocked (another blocks proceeding), coupled (proceed together), clustered (group needs to synchronize), worker (independent process handling cluster) or controller (main process communicating with workers).
  • The related work-section offers comphrensive view on the different scheduling approaches to with agentic AI.

1st of November 2024

DARD: A Multi-Agent Approach for Task-Oriented Dialog Systems

  • DARD (Domain Assigned Response Generation): LLM-based multi agent framework in multi domain & task oriented dialogue.
  • Introduces dialogue manager/hotel/attraction/restaurant/train/taxi-agents, external db and dialogue state tracker.
  • Uses both fine-tuned LLMs and Sonnet 3.0. Reviews differences in performance.

31st of October 2024

Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

  • CARE (Collaborative Assistant for Personalised Exploration): Introduces personalized LLM-based multi agent framework, where user interface includes chat/solution/needs-panels.
  • Focuses on improving multi-turn contextual understanding, personalization, exploration and reduce cognitive load.
  • Employs inquiry/ranking/needs discovery/solution crafting/milestone-agents.

30th of October 2024

EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents

  • EMOS: multi-agent framework for multi-robot system with embodiment & spatial-aware reasoning/navigation/manipulation/object rearrangement.
  • Includes hierarchical task planning, assignment and actioning. Evaluates success rate, sub-goal success rate, token usage and simulation step.
  • Uses "Robot Resume": a self-prompting, instead of "human roleplay" by interpreting the robot URDF files to call robot kinematics tools to generate descriptions of its physical abilities for guiding its planning/action execution.

Aligning Audio-Visual Joint Representations with an Agentic Workflow

  • AVAgent: Adapts audio signal with visual data using LLM-based agent framework, which plans edits of the audio signals and reflection with VLM to evaluate the modifications and uses tool to convert video and audio modality to text.

29th of October 2024

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

  • BENCHAGENTS: Introduces LLM-agent framework automating benchmark creation, which includes four components: planning/generation/data verification/evaluation-agents.
  • Dynamic benchmarks help to identify common failure modes/model differences, while LLM models improve quickly.
  • Planning includes: prompt/task-specific parameters/constraints (positive/negative/positional/sequencing/conditional/iterative).

28th of October 2024

Asynchronous Tool Usage for Real-Time Agents

  • Asynchronous AI agents: Introduces asynchronous, parallel thought processing and real-time tool use based on event-driven finite state-machines.
  • Time stamp is in the messages to enable clock awareness, which enables time-constrained tasks.
  • Event states include idle/listening/generating/emitting.

25th of October 2024

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

  • CoPlanner (Cooperative Planner): Improves reasoning capabilities of LLM by separating reasoning steps. Each agent gets assigned unique reasoning step.
  • Includes planning agent and reasoning agent.
  • Pre-defines 10 human cognition-based meta-strategies. Includes 5 logical reasoning methods: deduction/induction/abduction/analogy/contradiction and four problem solving methods: decomposition/enumeration/elimination/reflection and meta-strategy: finish to indicate end of reasoning.

VisionCoder: Empowering Multi-Agent Auto-Programming for Image Processing with Hybrid LLMs

  • VisionCoder: Multi agent framework with team leader, module leader, function coordinator and development group
  • Identifies excellent two aspects for the Agent-definitions: structural (explains the agents place in the overall structure/scope/responsibilities) and functional (operational steps/reasoning path expected from the agent and the output format requirements).
  • Includes bi-directional workflow: hierarchical tasks are divided into smaller units (forward task flow) and then restored back (backward task flow) from smaller pieces to larger units. Pair programming-concept includes coder and tester: coder produces code, tester reviews it and then the roles are reversed. The pair programming step is repeated three rounds with code execution with incorporation of the error messages to get final working code.

Designing LLM-Agents with Personalities: A Psychometric Approach

  • Reviews creation of psychometrically sound LLM-based agents based on the theory about big 5 personality traits (openess/conscientiousness/extraversion/agreeabless/neuroticism).

FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning

  • FISHNET: Multi agent-framework for insights from SEC regulatory forms. Includes sub-querying (converts query into sub-queries)-, task planning- , experts (Swarm Intelligence)-, harmonizer(routes to specific expert based on embedding match vs. agent persona/tables description)-agents and long term memory.
  • Expert agents consist of: n-port-, n-mfp-, adv-, n-cen-, n-csrv- and 13f-agents, which are experts in different forms related to SEC regulations.

AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs

  • Agent-CQ: Introduces a framework for generating and evaluating conversational search questions and answers. Includes generation (question generation / filtering / answer generation)- and evaluation (multiple LLM-judge calls to review generated questions/answers)-stages.

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

  • EDGE: Introduces framework to generate training data for GUI-tasks in the internet. Introduces element- and action-grounding.

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models

  • Investigates prompting techniques and finds simpler is often better and best prompts are problem specific.
  • In math problems self-consistency with majority vote works well, Chat protect helps to manage amount of hallucinated answers and Self-Verification worked well with MMLU.

AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios

  • AgentSense-benchmark: introduces a multiturn evaluation of LLM-agents regards social intelligence. Focuses on goal competition and implicit reasoning.
  • Character-info includes: attributes/relationships/rules of replacement. Scenarios include: background/characters/social goals/private info.
  • Includes a sample agent-prompt.

24th of October 2024

Unbounded: A Generative Infinite Game of Character Life Simulation

  • Unbounded: Introduces a conceptual and technical implementation of concept called "generative infinite game".
  • Addresses semantically alignedconsistent environment/characters.
  • Trained an LLM based game engine game engine (generating coherent and real-time game mechanisms, narratives and contextual character responses) and "Regional IP-Adapter", which creates visually consistent characters/environments between multiple images while applying creativity. Regional IP-Adapter tracks changes overtime, so if your character gets injured in forest, the injury remains in the following images and the character still wears same clothes, while giving creative touches to the visuals.

AR: Operating System Control via State-Aware Reasoning and Re-Planning

  • OSCAR: Introduces GUI-agent with unified control interfaces / GUI grounding (dual grounding) / exploration-based simulation and re-planning (task driven replanning of only specific tasks).
  • Works both in smartphones and desktop OS. Reviews GUI agents. Includes system prompts.
  • Agent states include: init/observe/plan/execute/error/verify/fail/success/reset. Includes context memory.

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

  • Skywork-Reward: introduces methods to enhance reward modeling for LLMs, focusing on data-centric techniques.
  • It proposes data selection and filtering strategies for high-quality preference datasets, resulting in Skywork-Reward data collection, and develops Skywork-Reward model series including Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B.
  • This work enhances performance of top-ranked models on RewardBench, highlighting practical impact in preference learning applications.

PDL: A Declarative Prompt Programming Language

  • PDL (Prompt Declarative Language): Introduces declarative and data-oriented language based on YAML to construct LLN prompt programs. Every PDL program is a valid YAML-document with PDL-schema.

From a Tiny Slip to a Giant Leap: An LLM-Based Simulation for Fake News Evolution

  • FUSE (Fake News evlUtion Simulation framEwork): Reviews the way true news convert into fake news with LLMs. Includes LLM-based agents: spreaders/commentators/verifiers/bystanders.
  • The simulation evolves with a module called News Evolution Simulator.
  • Includes content deviation metrics.

PRACT: Optimizing Principled Reasoning and Acting of LLM Agent

  • PRAct (Principled Reasoning and Acting)-framework: improves action understanding of agents by including action principles. Introduces RPO (Reflective Principle Optimization).

23rd of October 2024

ASYNCHRONOUS RLHF: FASTER AND MORE EFFICIENT OFF-POLICY RL FOR LANGUAGE MODELS

  • Asynchronous RLHF (Reinforcement Learning from Human Feedback): A framework that separates generation and learning in RLHF, enabling asynchronous generation of new samples while simultaneously training on old samples.
  • Online but off-policy, faster training, more compute-optimal scaling, training LLAMA 3.1 8B on instruction-following task 40% faster while matching final performance.
  • This framework addresses the computational inefficiency of the dominant paradigm for RL finetuning of LLMs by separating generation and learning, leading to faster training and more efficient use of resources.

GraphTeam: Facilitating Large Language Model-based Graph Analysis via Multi-Agent Collaboration

  • GraphTeam: LLM-based collaborative multi agent and graph-based system using three modules: input-output normalization/external knowledge retrieval/problem solving.
  • Includes question(reformats question)/search/coding/reasoning/answer-agents.
  • Constructs to knowledge graphs: documentation and experience.

Real-World Robot Applications of Foundation Models: A Review

  • This paper provides an overview of the practical application of foundation models in real-world robotics.
  • The review emphasizes the replacement of specific components within existing robot systems, input-output relationships, perception, motion planning, and control.
  • The paper concludes with a discussion of future challenges and implications for practical robot applications.

MiniFed : Integrating LLM-based Agentic-Workflow for Simulating FOMC Meeting

  • MiniFed: Simulates real world Federal Reserve FOMC-meetings using LLM-agent based multi-agent framework.
  • Consists of initialization/data collection/simulation/decision making/evaluation.

Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models

  • G4D (Guide for Defense): LLM-based multi agent with external knowledge to discover user intent as safe with a defense framework against jailbreaks.
  • Includes intention detector (intention extraction, key entities identification and information retrieval)/question paraphraser/safety analyzer-components.

An Intelligent Agentic System for Complex Image Restoration Problems

  • AgenticIR: VLM/LLM-agent based image restoration using perception/scheduling/reflection/rescheduling/execution-agents.
  • Includes Rollback-mechanism, where agent returns previous working stage, when an issue.

ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents

  • ReflecTool: Introduces clinical agent, using progressively built long-term memory to assist domain-specific tool selection and improve tool usage. Includes optimization and inference stages.

Navigate Complex Physical Worlds via Geometrically Constrained LLM

  • Reviews LLMs-capability to reconstruct physical world from textual knowledge.
  • Uses LLM-based multi agent framework with scenery designer/object designer/object manufacturer/arranger-agents and geometric constraint solver and generic algorithm.

21st of October 2024

Long Term Memory: The Foundation of AI Self-Evolution

  • Reviews and defines AI Self-Evolution-capability and Long Term Memory (LTM).
  • Identifies benefits in Personalized Models.
  • Identifies limitations in prompt-based memory mechanisms.

Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers

  • Designs Domain Specific Language (DSL) in mapper (maps computations to processors like GPUs, CPUs, etc.) generation related to assignment of compute / memory.
  • The DSL helps to manage high-level inference decisions without interacting with the low-level C++ code APIs.

20th of October 2024

Redefining Proactivity for Information Seeking Dialogue

  • Introduces Information Seeking Dialogue (ISD) agents with proactiveness to include information relevant to the user query.
  • Introduces new prompting strategies: 3-step CoT and 3-in-1 CoT.

18th of October 2024

Teaching Models to Balance Resisting and Accepting Persuasion

  • PBT (Persuasion Balanced Training): Uses multi-agent recursive dialogue trees to train models with preference optimization to accept persuasion in acceptable situations. PBT-trained model outperform in multi-agent debates.
  • Agents argue based on logical reasoning/emotional appeal/established credibility.
  • Refers to research by Woolley et al. (2010), where group intelligence is argued to be driven by diversity/turn-taking/social sensitive, rather than individual intelligence.

18th of October 2024

Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning

  • SARA (Structure-oriented Autonomous Reasoning Agents): Introduces multi agent LLM-based reasoning framework with structure-oriented analysis by refinement and RAG.
  • Outperforms in some cases few-shot learning.
  • Includes reason (structured oriented analysis)-, retrieval-, refinement-agents and shared memory. Includes prompts used.

AI can help humans find common ground in democratic deliberation

  • Habermas Machine: AI mediation technique promoting fair/inclusive debate.
  • LLM-agent opinions/critiques refine group statement to maximize group approval.
  • Aims to improve collective decision making in political discussion/conflict resolution.

17th of October 2024

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

  • Proposes World-Model-Augmented (WMA) web agent by simulating planned actions to obtain outcome before using them (metacognitive monitoring) in order to avoid performing erroneous moves. Reviews LLMs lack of capability to avoid performing errors, which humans can easily avoid by posing world model.
  • Introduces "Transition-focused observation abstraction": world model generates free-form important state differences before / after. Agent simulates outcomes of each possible action with world model and reward model asesses each one.
  • Includes prompts.

Chain of Ideas: Revolutionizing Research in Novel Idea Development with LLM Agents

  • CoI (Chain-of-Ideas): CoI-agent generates research ideas comparable to human-level by organizing literature in a chain structure to avoid logical inconsistencies in ideation.
  • Improves LLMs research ideation capabilities. Consists of three steps: CoI-construction (identifies current trends), Idea generation (consolidates ideas) and Experience design (final experiment design).
  • CoI-prompts include: converting topic in search query for literature retrieval/evaluation of paper relevance to the topic/extract research paper ideas, experiments, entities and reference/summarising trends of the this CoI.
  • Idea generation prompts include: predict future trends / generate ideas / novelty check of ideas.
  • Experiment design prompts include: generate experiment design / review experiment design / obtain queries to edit experiment design / refine experiment design.

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

  • AgentOccam: Refines LLM-agent observation/action space to improve its performance in web tasks with three methods. Sets SOTA in WebArena.
  • Introduces planning actions: branching and pruning. Minimizes trivial interaction space. Removes unnecessary web content.
  • Agent prompt includes general instructions (task description/output specification/action specification) and Online Task Information.
  • Simplifies web content/selectively replays web elements/selectively replays past pages.

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

  • AdaSwitch: Uses local agents for basic and cloud agent for complex tasks.
  • Includes self-practicing, collaborative examination and reflective learning steps.

Harnessing Webpage UIs for Text-Rich Visual Understanding

  • Introduces MultiUI-dataset of 1 million websites for web / UI agents.

Rapid and Automated Alloy Design with Graph Neural Network-Powered LLM-Driven Multi-Agent Systems

  • Multi-agent system including LLMs, AI agents (multi modal LLM-agents) and GNNs to discover automatically new metallic alloys.
  • The LLM-agent roles include: planner-, executor-, coder-, reviewer- and multi-modal-agents.

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

  • Reviews o1-model against other test-time compute methods like BoN/Self-Refin/Agent workflow.
  • Identifies 6 reasoning patterns with o1-model: systematic analysis/method reuse/divide & conquer / self-refinement / context identification / emphasizing constraints.

MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling

  • MeNTI-framework chooses appropriate meta-tool, fills data according to the meta-tool documentation and nested-calling verifies task completion.

Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning

  • RL guides LLM's exploration. The architecture includes: LLM-module/validation module/reasoning tree/RL agent. Applied in code generation.
  • LLM module generates n-candidates, validation module reviews characteristics of each candidate, the features of each review are added to reasoning tree and finally RL explores this reasoning tree to decide the node to explore next.

Metacognitive Monitoring: A Human Ability Beyond Generative Artificial Intelligence

  • Reviews metacognition monitoring abilities of LLMs.

RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images with Autonomous Agents

  • ADI (Adaptive Disaster Interpretation)-framework: introduces an multimodal LLM-agents interpreting disaster scenarios using tools. Introduces RescueADI-dataset.
  • ADI-framework includes perception/recognition/planning/tools-modules.

16th of October 2024

Revealing the Barriers of Language Agents in Planning

  • Reviews planning capabilities of LLMs and identifies current models like o1 only achieve 15.6% performance in real-world tasks.
  • Identifies two core issues: interpretation of constraints/loss of focus in long-horizon planning tasks.
  • Episodic and parametric memory help, but do not resolve the lack of planning capabilities.

Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

  • GCR (Graph-Constrained Reasoning): Integrates Knowledge Graph (KG) into LLM decoding to reduce hallucinations in reasoning.
  • Uses KG-Trie method.

Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios

  • Reviews LLM-agents ability to patch code, suggesting smaller sub-tasks to patch code to be easier for LLM-agents.

JudgeBench: A Benchmark for Evaluating LLM-based Judges

  • JudgeBench-benchmark: Evaluates LLM-judge agents, which focuses on instruction following/factuality/logic/style.

SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

  • SAC-GLAM: Proposes a more autonomous LLM-agents based on adaptation of SAC (Soft Actor-Critic) and HER (Hindsight Experience Replay) for LLM-agents in multi-goal RL environment to perform sequential decision making tasks.
  • Reviews LLM-agents moving from external objective driven towards more autotelic ("self" + "goals") with an intrinsic purpose rather than extrinsic.

Robust RL with LLM-Driven Data Synthesis and Policy Adaptation for Autonomous Driving

  • RAPID: Improves RL performance in autonomous driving with LLM-reasoning. Uses LLM-agent data for offline RL distillation and then adapts online RL-agent with LLM-data.

Enhancing LLM Trading Performance with Fact-Subjectivity Aware Reasoning

  • FS-Reasoning Agent: introduces LLM-based multi-agent trading framework by splitting reasoning processes between factual and subjective reasoning.
  • Includes Statistics/Fact reasoning/Fact/Subjectivity/Subjectivity reasoning/Trading/Reflection agents.
  • Concludes, that superiority of the LLM model is not sufficient to guarantee it outperforming multi-step reasoning.

MedAide: Towards an Omni Medical Aide via Specialized LLM-based Multi-Agent Collaboration

  • MedAide: Introduces LLM-based multi-agent framework, which includes query input/query rewriting/intent recognition/agent collaboration.
  • Activates specialised agents (own prompt template) dynamically by recognizing intent.
  • Includes contextual encoder.

Aegis:An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering

  • Aegis: LLM-based multi-agent framework for FSRs (Functional Safety Requirements) and HARA (Hazard Analysis and Risk Assessment).

15th of October 2024

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

  • G-Designer: introduces designer of multi-agent LLM-graphs based on MACP. Includes Materials/Construct/Design/Optimize-steps.
  • Proposes a LLM-agent communication protocol for multi-agent systems called MACP. MACP includes performance/adaptability/robustness.

AGENTiGraph: An Interactive Knowledge Graph Platform for LLM-based Chatbots Utilizing Private Data

  • AGENTiGraph (Adaptive Generative ENgine for Task-based Interaction and Graphical Representation): LLM-based multi-agent knowledge management framework with knowledge graphs.
  • Includes knowledge extraction/integration/real-time visualization.
  • Dynamically interprets user intent/manage tasks/integrate new knowledge. Classifies tasks. Extracts key concepts. Constructs knowledge graphs. Includes prompts used.

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

  • TestAgent-framework: quantitative/qualitative benchmark using agent-based evaluation with RL, multi-turn interaction from knowledge base/topics of interests.

14th of October 2024

AFlow: Automating Agentic Workflow Generation

  • AFlow: Optimises LLM-agent workflow with MCTS.
  • Includes search space (node, operators, code represented edges), search via AFliw and Search result (math, Q&A and code generation workflows.)

10th of October 2024

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

  • PAVs (Process Advantage Verifiers): is a framework that trains verifiers to predict progress in multi-step reasoning by measuring the change in likelihood of a correct response under a prover policy.
  • PAVs improve exploration during test-time search and online RL, using complementary prover policies, and are more compute-efficient than ORMs.
  • This framework enables more efficient and accurate reasoning in large language models by providing a better way to measure progress in multi-step reasoning.

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

  • Introduces LLM-based multi-agent system for efficient LLM pretraining data selection. LLM converges faster in the pretraining and the method improves LLM output quality.
  • The Data console integrates data inisghts dynamically from the different agents during the training process.
  • Agent console include quality/domain/topic-agents. Includes as well memory.

Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

  • Optima (OPTImising effectiveness and efficiency for LLM-based Multi-Agent systems): Introduces framework to train LLM-based multi-agent system (MAS).
  • Includes 4 iterative steps: Generate/Rank/Select/Train.
  • Investigates scaling laws of inference compute.
  • Optima helps to make LLMs highly efficient conversationalists.

DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

  • DelTA (Document-level Translation Agent): Introduces translation LLM-agent using multi-layer memory components to improve translation consistency/quality.
  • Memory components include: Proper noun memory(to apply correct terminology)/Bilingual summary/long-term/short-term-memory units.

Mars: Situated Inductive Reasoning in an Open-World Environment

  • Mars: Introduces framework for Situated Inductive Reasoning-benchmark and a framework with LLM-agents called: IfR (Induction from Reflection).
  • The paper identifies two critical components for inductive reasoning: situatedness (situational context) and abstractiveness (abstract conclusions).
  • IfR-framework includes task proposer/planner/controller/reflection-steps, rule library (when this, do that) and skill library. The LLM-based reflection-step induces new rules, which actual LLMs struggle currentyly.

Benchmarking Agentic Workflow Generation

  • Introduces WorFEBench-benchmark for unified workflow generation and WorFEval evaluation protocol of workflows for LLM-agents.

9th of October 2024

AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories

  • Samoyed: Introduces LLM-models fine-tuned with AgentBank-dataset for general agent tasks.
  • AgentBank-dataset includes dimensions: reasoning/math/programming/web/embodied AI.

Smart Audit System Empowered by LLM

  • Introduces Smart Audit System with LLMs, which include dynamic risk assessment model/manufacturing compliance copilot/Commonality analysis agent. Developed by Apple researchers.
  • Dynamic risk assessment model adjusts audit: focus/sample size/critical items/resource allocation.
  • Manufacturing compliance copilot self-adjusts its the knowledge base with new information.
  • Commonality analysis agent manages an autonomous agent conducting real-time analysis to custom requests, in order to drive supplier improvements. Includes planning/memory/tools/selecting and usage of tools/generating responses.

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

  • Introduces Embodied Agent Interface-benchmark for embodied decision making LLM-agents.
  • Reviews four critical capabilities: Goal interpretation, Subgoal decomposition, Action sequencing and Transition modelling.

I Want to Break Free! Anti-Social Behavior and Persuasion Ability of LLMs in Multi-Agent Settings with Social Hierarchy

  • zAImbardo-framework: Introduces LLM-agent simulation between prisoner/guard-agents using prompts, which are either shared or private.
  • Shared prompts: communication rules/environment description/research oversight/risks. Private prompts: Starting prompt/personality/goals.

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

  • Introduces UAV navigation agent using MLLM. Includes three levels of assistants: constant/difficult situations/hazard situations.

MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses

  • Moose-Chem: multi-agent framework to discover novel chemistry research hypothesises from given information.

Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach

  • Seeker: introduces LLM-based multi-agent framework for exception handling with planner/detector/predator/ranker/handler-agents.

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

  • ST-WebAgentBench-benchmark: Evaluates safety and trustworthy of web agents against performing undesired operations in business/user applications.

Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

  • CAIMIRA (Content-Aware, Identifiable, Multidimensional, Item Response Analysis)-framework: Reviews differences between humans and SOTA-level LLMs in QA-tasks in reasoning and textual understanding.

8th of October 2024

AgentSquare: Automatic LLM Agent Search in Modular Design Space

  • AgentSquare: Introduces modular LLM-agent framework using module evolution, recombination and performance predictor(skip unpromising agent designs). - The framework optimizes agent designs with Planning/Reasoning/Tool use/Memory-modules.
  • Introduces the research concept of MoLAS (Modularized LLM Agent Search): the automatic optimization of LLM-agent designs from succesfull designs.
  • Includes search-, program-level search- and performance predictor-meta prompts.

7th of October 2024

LLMs Are In-Context Reinforcement Learners

  • In-Context Reinforcement Learning (ICRL): Introduces ICRL-algorithm (increases test-time compute), which effectively learns reward from a classification task. The explorative-version concentrates on positive episodes and stochasticity.
  • Naive ICRL explores poorly.

Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents

  • GraphAgent-Reasoner (GAR): explicit and precise graph-reasoning with multi-agent collaboration.
  • Works to solve real-world graph-reasoning such as webpage ranking,
  • Distributes tasks into nodes (over 1000) to multiple agents collaborating between each other.
  • Includes stages: Algorithmic establishment (retrieve/initialisation/adjust/design), Distributed execution (Master LLM assigns task, agent network communicates) and Master summarisation (termination/aggregation/conclusion).
  • Master LLM defines for each problem 6 components: State/Message/Initialization/Send/Update/Termination.

Grounding Partially-Defined Events in Multimodal Data

  • Reviews event extraction from unstructured video data using multimodal event analysis with LLMs.

GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

  • Introduces GLEE (Games in Language-based Economic Environments)-benchmark, which reviews LLMs in two-player economic game families of bargaining, negotiation andd persuasion.

26th of September 2024

AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment

  • AssistantX: multi LLM-agent framework (PPDR4X) to help users achieve goals in virtual / physical environments.
  • PPDR4X-framework includes short term memory (initial instructions/dialogue data/agent thoughts/cyber tasks/real world tasks), long-term memory (environment information), perception-agent, planning-agent, reflection agent and decision agent.

Control Industrial Automation System with Large Language Models

  • Introduces multi LLM-agent industrial control system, which consists of summarizer-, manager- (planning level), event log manager-, operator-agents (control-level) and command line/event log memory/prompt templates/events/function calls.

Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

  • Reviews the difficulty of processing multiple sub-tasks within single LLM call with ICL to produce correct solution, which is called "In-Context Hardness of Composition".
  • Refers to new term called "Screening", which refers to LLMs capacity to isolate the relevant context. For example LLM with capacity to perform two tasks, may fail performing both within same context.
  • Finds, that is better to distribute tasks to multiple LLM-agents, when task becomes complex. Offers a literature review of the CoT problem solving and agents-research intersection.

25th of September 2024

Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents

  • AXIS: Priorites task completing API-calls above UI-agent actions, which decrases task completion time and cognitive workload.
  • It is more useful to generate efficient API-call agent using programmatic API, than slower human-like UI agent.
  • Includes Explorer-, Follower-, Monitor-, Generator-, Evaluator- and Translator-agents.
  • Enables converting any application, with basic API/documentation and: environment state interface/basic action interface, into agent. Uses self-exploratory framework to identify control elements.

A Roadmap for Embodied and Social Grounding in LLMs

  • Reviews the grounding of LLMs with physical world. Highlights the importance of social grounding of physical experiences. For example a child can build understanding of heavy objects just by observing an adult trying to lift a heavy box.
  • Interesting ideas about the way human perception in physical world.

Plurals: A System for Guiding LLMs Via Simulated Social Ensembles

  • Introduces Plurals-framework: generates diverse agents (stakeholder) based on demographic data to interact diverse opinions using a structrured debate and moderator.
  • The demographic data is basis for generating the agents, which helps to tune the messages to specific audiences.
  • Includes Structures, which forces LLM-agents to share information with a properly formed structure.
  • Moderator-agent then summarises this discussion by trying to take into account the diverse opinions.

Language Grounded Multi-agent Communication for Ad-hoc Teamwork

  • Grounds MARL agent communication with LLM generated synthetic data, which improves communicatio and zero-shot collaboration between agents.

24th of September 2024

Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

  • Synatra: is an approach that transforms indirect knowledge into direct supervision for digital agents at scale.
  • Synatra leverages LLMs to repurpose human-created tutorials and ungrounded observations into executable action sequences, and includes a 7B CodeLlama model.
  • This framework enables more effective and cheaper training of digital agents compared to human demonstrations.

MOSS: ENABLING CODE-DRIVEN EVOLUTION AND CONTEXT MANAGEMENT FOR AI AGENTS

  • MOSS (IIM-oriented Operating System Simulation): is a framework integrating code generation with a dynamic context management system.
  • MOSS uses Inversion of Control (IoC) container, decorators, maintains Python context, isolates local variables, preserves runtime integrity, and enables code-driven evolution.
  • This framework enhances efficiency and capabilities of AI agent development, moving towards Turing-complete agents.


23rd of September 2024

ERABAL: Enhancing Role-Playing Agents through Boundary-Aware Learning

  • ERABEL: Introduces boubdary-aware role playing framework to maintain role comsistency in multiturn conversation.
  • Includes dialogue planner/topic manager/question generator/response generator-agents.
  • Includes prompts for esch agent.

22th of September 2024

BACKTRACKING IMPROVES GENERATION SAFETY

  • Backtracking: is a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token.
  • Backtracking can be incorporated into either SFT or DPO training, provides protection against adversarial attacks, and improves safety without regression in helpfulness.
  • This method provides a new approach to improve language model safety by allowing models to recover from unsafe generations.

20th of September 2024

RRM: Robust Reward Model Training Mitigates Reward Hacking

  • RRM (Robust Reward Model): Reviews reward models ability to differentiate signal from the genuine context and irrelevant information to decide preference. Proposes usage of causal graph.
  • Produces more robust reward model.

ChainBuddy: An AI Agent System for Generating LLM Pipelines

  • ChainBuddy: Includes requirements gathering agent (primary user goal/list of req./user preferences/suggested Cot strategy), planner agent (includes replanner), task-specific agents, connection agent and post-hoc reviewer agent.

Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts

  • Minstrel: a multi-agent framework for automated prompt optimization. Prompts are constructed using role, profile, constraints, goals, initialization and examples, workflow, skills, suggestions, background, style, output format and command modules.
  • Agents are assigned to working groups in charge of similar small tasks.

ShizishanGPT: An Agricultural Large Language Model Integrating Tools and Resources

  • ShizishanGPT: LLM agent for answering with agriculture-based RAG.

19th of September 2024

Training Language Models to Self-Correct via Reinforcement Learning

  • SCoRe (Self-Correct via Reinforcement Learning): Increases LLMs capacity to self-correct via multi-turn Reinforcement Learning.
  • Achieves positive intrinsic self-correction performance as first model.

AutoVerus: Automated Proof Generation for Rust Code

  • AutoVerus: LLM generates correctness proofs for Rust-code using multi-agent framework (proof generation, refinement and debugging).

17th of September 2024

LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Integration of Multi Active/Passive Core-Agents

  • LLM-agent UMF (Unified Modelling Framework): Introduces modular LLM-agent framework, which includes core agent coordinating with planning, memory, profile, action and security modules.
  • Proposes various multi agent frameworks.
  • Proposes active and passive information types.
  • Includes lots of useful ideas for each component.

NVLM: Open Frontier-Class Multimodal LLMs

  • NVLM: frontier level VLM model and high performance as LLM only.
  • Finds, that dataset quality and task diversity impact more than scale.
  • Finds positive transfer from image to text only modality.

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

  • P-RAG: Introduces iteratively updated RAG (self-iterations). P-RAG adds more task-specific knowledge.
  • The RAG stores the following information: goal instruction, scene graph, history and done.

EmPO: Emotion Grounding for Empathetic Response Generation through Preference Optimization

  • EmPO: Introduces the EmpatheticDialogues-dataset for fine tuning LLMs with empathic response generation (ERG).

16th of September 2024

Instigating Cooperation among LLM Agents Using Adaptive Information Modulation

  • SLA (Strategic LLM Agent): combines LLM agents (SLAs) and RL-agent called Pro-social Promoting Agent (PPA) to increase cooperation rate.
  • Adjusts dynamically access to SLA's information (cooperation history with neighbours, average) to increase facilitate social interaction.

Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots

  • Cognitive Kernel: introduces autopilot-like LLM-agent with access to internet with the web browser (appears to use Playwright-library) to interact "human-like" manner (click, scroll, etc).
  • The LLM agent interacts with user and task environment. Includes reasoning kernel, memory kernel and perception kernel.
  • LLM is fine tuned to interact with the environment through atomic actions, which a normal person could perform, rather than API call.
  • Offers interesting ideas for each sub-compoment, as each includes plenty of detailed functionalities.

Central Answer Modeling for an Embodied Multi-LLM System

  • CAM (Central Answering Model): Introduces CAM-framework, where instead of LLM-agent directly answering question, multiple LLM-agent instances generate answer and a central LLM-agent responds to the question.

15th of September 2024

RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation

  • RethinkMCTS: conducts thought-level searches before generating code and adds both verbal feedback to refine thoughts and code execution feedback from incorrect code.
  • Increasing the number of rethink- and rollout-operations improve code generation.

14th of September 2024

PeriGuru: A Peripheral Robotic Mobile App Operation Assistant based on GUI Image Understanding and Prompting with LLM

  • PeriGuru: LLM-agent for GUI with perception, decision and action steps.

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

  • Introduces task-relevant Q-value model for guiding action selection.
  • Includes review of the different methods to improve reasoning, such as LLMs using MCTS.

13th of September 2024

Agents in Software Engineering: Survey, Landscape, and Vision

  • Introduce LLM-agents with perception, memory and actions for SW engineering. Includes multi-agent workflow with feedback, refinement and roles.
  • Actions include internal (reasoning, learning and retrieval) and external (digital environment, dialogue with human/agent)).
  • Memory includes procedural, semantic and episodic.
  • Perception includes textual (UML, execution result, text/code), visual and auditory.
  • Includes good overview of different reasoning techniques for the CoT-action.

12th of August 2024

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

  • Navi: introduces a multi modal agent for Windows OS.
  • Processes screen information called SoM (Set of Marks) with multiple alternative methods : UIA (User Interface Automation) tree, parses DOM tree, uses propietary OCR, icon/image detection and OmniParser-model.
  • Agent prompt includes: task instruction, description of action space, history of actions, clipboard content and thought-variable memory. The prompt includes as well previus/current step screenshot with SoMs.
  • Introduced WindowsAgentArena-benchmark.
  • Includes the agent prompt.

11th of September 2024

Agent Workflow Memory

  • Agent Workflow Memory (AWM): LLM-agent retrieves and reuses reusable routines, which it extracts and generalises from past examples.
  • Consists of LLM, memory and environment state (action-observation).
  • Memory consists of: workflow description, workflow steps (environment state description, deduction process and action sequence). The memory-unit is described as text-based "system"-prompt.
  • Adds increasingly difficult workflows from previously acquired workflows and new experiences.
  • Uses previously learned skills in new settings. Eliminates workflow steps, not required.

10th of September 2024

Think-on-Process: Dynamic Process Generation for Collaborative Development of Multi-Agent System

  • ToP (Think-on-Process): Multi-agent LLM-framework, which generates SW development processes using experiential knowledge.
  • Each chat includes role assignment, memory stream and self-reflection.
  • ToP-framework includes: instance generating, llm enhancing, instance filtering and software developing.
  • Refers to concept of "Chat-chain", where multiple LLM-agents (CEO, CTO, CPO, Tester, Coder and Designer) operate.
  • Converts processes to process textual descriptions: process-to-text and finally to process textual description.

9th of September 2024

SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning

  • SciAgents: Multi-agent graph-reasoning LLM-framework with retrieval for scientific discovery.

8th of September 2024

Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

  • Self-Reflection-Agents: Finds, that self-reflection improves performance of LLM agents in 6 different LLM tested.
  • Self-Reflections, which contain more information (instructions, explanations, and solutions) perform better, than self-reflections with less data.
  • Retry-agent improves significantly performance, which indicates knowledge of a mistake, improves performance of the LLM.

5th of September 2024

Game On: Towards Language Models as RL Experimenters

  • Introduces RL experiment workflow using VLM (not fine-tuned) to perform tasks assigned typically to human experimenter.
  • The system monitors/analyses experiment progress, suggests new tasks, decomposes tasks and retrieves skills to execute. Does not automate
  • Enables embodied autonomous agent to acquire zero-shot new skills.

From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents

  • MAIC (Massively AI-empowered Course): Introduces multi LLM-agent system for scalable (like Massive Open Online Courses), but still adaptive (to personal needs / aptitudes) online education. Includes few comments from students, which highlight the limitss of its current approach.
  • Includes LLM-agents acting both teachers, students, assistant, manager analyser and other agents. Teacher agents adjust style based on communication with the student. Human-student can select style of AI-classmates with the student.
  • Classroom environment incldues current slide, dialogue history, class roles / course management. Course preparation includes read / plan stage, where slide content extraction, structure extraction, function generation and agent generation takes place.

xLAM: A Family of Large Action Models to Empower AI Agent Systems

  • xLAM: Series (from 1B dense to 8x22B MoE) of Large Action Models (LAMs) for AI agent tasks. Achieves high performance in function calling.
  • Fine-tunes basically from a LLM (DeekSeeker/Mistral models) a LAM, which is able to perform highly accurate function calling.

4th of September 2024

Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments

  • Cog-GA (Cognitive-Generative Agent)-agent: Introduces Visual-Language Navigation (VLN)-agent in continuous environments with cognitive maps (spatial, temporal and semantic information) and reflection.
  • Includes instruction processor, high-level planner, waypoint predictor, memory stream (reflection memory/cognitive map), reflection generator and low-level actuator. Instructions are provided as text, panorama input image. Target waypoints are stored in the cognitive maps-memory.
  • Cognitive maps include spatial memories about scene descriptions and landmarks in time step.
  • Limits search space by employing dual-channel waypoint using information about the landmark objects (what) and spatial characteristics (where).

Configurable Foundation Models: Building LLMs from a Modular Perspective

  • Reviews modularity of LLMs. The idea is to instead of re-training from scratch a LLM, to add new knowledge as modules (called emergent bricks pretrained and customised bricks postrained).
  • Identifies the following brick-operations: retrieval / routing, merging, updating and growing.

Large Language Model-Based Agents for Software Engineering: A Survey

  • Survey about SW engineering LLM-agents.

MoA is All You Need: Building LLM Research Team using Mixture of Agents

  • MoA (Mixture-of-Agents)-framework (name was already used before) is a framework with planner, aggregator and varios LLM-agentseach with their own RAG, grouped together.

3rd of September 2024

Empirical evidence of Large Language Model's influence on human spoken communication

  • Empirical evidence, that humans imitate LLMs.
  • Finds, that LLMs reduce linguistic diversity, but it appears an interesting topic to discover, if LLMs only decrease diversity or impact other ways / the ways content creation automation impacts overall to society.

AgentRE: An Agent-Based Framework for Navigating Complex Information Landscapes in Relation Extraction

  • AgentRe: Relation Extraction (RE) agent includes three components: retrieval (static knowledge to help store/retrieve information), memory(dynamic knowledge: shallow memory for extraction results, deep memory for historical action summaries/reflections) and extraction modules (ReAct-based, pulls information based on retrieval and memory).
  • Avoids extracting for incomplete entities, such as phrases referring in general to Museums without being precise on the exact name of the museum.

Focus Agent: LLM-Powered Virtual Focus Group

  • Focus Agent: Simulates moderation of focus groups with human participants and alignment of focus agent opinions with this group.
  • Simulates planning, moderation, questions, discussion and reflection with LLM-agents.

2nd of September 2024

The Compressor-Retriever Architecture for Language Model OS

  • Compressor-Retriever-architectore: Introduces concept of stateful LLM OS by using only base model forward function to compress and retrieve context.
  • Reviews concept of LLM acting as a CPU and its context window acting as RAM.
  • Identifies life-long context as infite, which is core issue with actual session-based interactions.
  • Compressor builds hierarchical db to save previously chunked context. The retriever searches relevant context.

1st of September 2024

Self-evolving Agents with reflective and memory-augmented abilities

  • SAGE: Introduces self-evolving LLM-agent consisting of user/assistant/checker-agents with iterative feedback, reflection and memory optimization (Ebbinghaus-forgetting curve).
  • Self-evolution includes adaptive adjust strategies, optimizing information storage and transmission and reduction of cognitive context.
  • Mimics human brain / memory by creating MemorySyntax, which combines Ebbinghaus forgetting curve and linguistic knowledge.

LanguaShrink: Reducing Token Overhead with Psycholinguistics

  • LannguageShrink: Reduces prompt length (tokens to process) by optimising the prompt by applying psycholinguistic principles and the Ebbinghaus memory curve.
  • For example removes words like "usually" from the prompt, which add complexity, ambiguity, irrelevance etc.

30th of August 2024

Tool-Assisted Agent on SQL Inspection and Refinement in Real-World Scenarios

  • Tool-SQL: LLM-agent for SQL code inspection and fixing using retrieval and refinement.

29th of August 2024

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

  • Learns to automatically retry after detecting error (Retry upon regret) in the LLM generation, which does not require additional self-verification prompting.
  • The model seeks to produce correct solutions, even when up to half of the solution steps include errors and only corrects itself rare cases, when making a mistake.
  • Indicates, that the skill of error correction is significantly different from the pure error-free reasoning, which requires weights update beyond PEFT. reasoning accuracy, masking errors is unnecessary, and models still output shortest solutions.
  • Indicates, that LLMs often know at least in certain domains of having made mistakes and can be seen as simple linear classifier on top of its hidden states.
  • This work provides insights into how to effectively train language models to correct errors during reasoning tasks.

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

  • Suggests, that LLMs fine-tuned with synthetic data from weaker, yet cheaper LLM is more compute optimal, than using stronger, yet more expensive LLM.
  • Samples data from Gemini Pro 1.5 (more expensive, stronger) compared to Gemini Flash 1.5. by using pricing per token as a proxy.

CogVLM2: Visual Language Models for Image and Video Understanding

  • Introduces CogVLM2-family of models: CogVLM2, CogVLM2-Video and GLM-4V.
  • Relates to CogAgent-GUI agent introduced in December 2023.

28th of August 2024

A Survey on Evaluation of Multimodal Large Language Models

  • The Survey reviews Multi Modal Language Models (MLLMs).

WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

  • WebPilot: Introduces Multi-Agent System with Planner(generate and refine plan)/Controller(judge sub-task terminatation, asses sub-task completion, generate strategic reflection)/Extractor(extract information)/Explorer(generate action, analyse observation, generate tactical reflection)/Apprasier(asses state)/Verifier(format action, deduplicate action) LLM-agents.
  • Uses Global Optimization (decomposing tasks/refining high-level plans with reflective analysis) and Local Optimization (executes sub-tasks with customized MCTS/refining decisions iteratively through with each observation).
  • Tasks include navigating forums/upvoting posts/extracting contributor emails.

AutoGen Studio: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems

  • AutoGen Studio: Build on top of AutoGen, the AutoGen Studio includes drag & drop web-UI to customize/attach model/skills/tools/memory/agents involved.
  • The workflow is saved as declarative json-structure. Users can export this json and share it to other users. Apart includes built-in DB Manager, Workflow Manager and Profiler-classes.
  • Backend includes Python API, web API and CLI.

Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions

  • Investigates using LLM-agents for Psychological Counseling dialogue (counselor/client) based on client profiles (mental health issue description/detailed description of the disorder/symptom/problem/chief complaint) and counselor simulation is based on exploration, insight, and action.

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

  • Introduces BattleAgentBench-benchmark, which reviews rule understanding, spatial perception, competition, static cooperation and dynamic cooperation.

Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games

  • Atari-GPT: Applies Multi Modal Language Model as low-level policy (controller).

FlowAct: A Proactive Multimodal Human-robot Interaction System with Continuous Flow of Perception and Modular Action Sub-systems

  • FlowAct: Introduces human-robot interaction system, which continuously perceives and acts. Uses two controllers: Environment State Tracking (EST) and Action Planner.

Retrieval-Augmented Instruction Tuning for Automated Process Engineering Calculations : A Tool-Chaining Problem-Solving Framework with Attributable Reflection

  • RAIT (Retrieval Augmented Instruction Fine-tuning): Introduces RAIT fine-tuning approach in chemical / process engineering, which combines small language models (SMLs) with Retrieval Augmented Code Generation (RACG).

Towards Fully Autonomous Research Powered by LLMs: Case Study on Simulations

  • Reviews feasibility of Autonomous Simulation Agent (ASA) to automate E2E research process using LLMs and API automation (AutoProg).

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

  • LogicGame: Benchmarks rule-based reasoning, execution and planning of LLMs.

Persuasion Games using Large Language Models

  • Introduces persuasion framework with LLM-agents, but the paper is not clearly indicating conclusions about persuasion with LLMs with doubts as well on exact roles/prompts.

EPO: Hierarchical LLM Agents with Environment Preference Optimization

  • EPO (Environment Preference Optimization): Generates preference signals from environmental feedback for long-horizon decision making with LLM-agents.
  • LLM predicts sub-goals and respective low-level actions.
  • Interaction module generates two types of sub-goals: navigation and interaction.

27th of August 2024

27th of August 2024

Generative Verifiers: Reward Modeling as Next-Token Prediction

  • GenRM-verifier (Generative Reward Models): proposes training verifiers with next-token prediction objective.
  • Combines verification and solution generation, whichh improves verification-process.
  • GenRM outperforms classifier-based discriminatary (assigns numerical score to answer, which is used to classify as correct/incorrect answer) verifiers and LLM-as-a-judge (tends to underperform trained LLM-based verifiers).
  • Integrates with fine-tuning, CoT and is able to use inference-time compute in form of majority vote to improve verification.
  • Enables inference-time compute for CoT Verifiers (GenRM-CoT). Uses reference-guided grading to assist "Let's verify step by step"-verification on test-time problems lacking reference solution.
  • See slides here.

AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems

  • AgentMonitor: Captures multi agent (MAS) inputs and outputs to predict task performance and correcting security risks in real-time.
  • Includes 5 different MAS configurations.

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

  • Introduces Hierarchical Prompt Tuning (HPT) and HPT++. Adapts VLM by creating a graph from each description with hierachical relationship guided attention module.

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

  • TourSnmbio-Agent: Performs protein engineering tasks using TourSynbio-7B model (fine-tuned on text and protein sequences).
  • Includes intent classification steps, where is defined in case the user intent is generic question or agent-specific task.
  • Keywords are used in agent selection.

26th of August 2024

Foundation Models for Music: A Survey

  • Reviews research available on Foundational models for Music: representations of music, applications, foundational model techniques, datasets/evals and ethics.

AgentMove: Predicting Human Mobility Anywhere Using Large Language Model based Agentic Framework

  • AgentMove: Mobility prediction LLM agent.
  • Includes spatial-temporal memory.

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

  • Benchmark to evaluate LLM-agent based coding for Java programming language (SWE-bench for Java).

23th of August 2024

LIMP: Large Language Model Enhanced Intent-aware Mobility Prediction

  • LIMP (LLMs for Intent-aware Mobility Prediction): Fine-tunes LLama 3-8B-Instruct model with Analyze-Abstract-Infer (A2I)-agentic workflow for mobility intent reasoning.

Intelligent OPC Engineer Assistant for Semiconductor Manufacturing

  • RL / multimodal LLM-agents solve Optical Proximity Correction (OPC)-problems in semiconductor manufacturing using RL-based recipe search, which typically require years of OPC engineering experience.

22th of August 2024

MEDCO: Medical Education Copilots Based on A Multi-Agent Framework

  • MEDCO (Medical EDucation COpilots): Includes patient, student, expert doctor and radiologist multimodal (X-rays/CT scans/MRIs/ultrasounds) LLM-agents. Student agents are trained/taught with feedback provided and then stored in student memory module to improve future diagnosis.

Graph Retrieval Augmented Trustworthiness Reasoning

  • GRATR (Graph Retrieval Augmented Reasoning): Improves trustworthiness reasoning of the LLM agent using Evidence base.
  • Evidence base is updated based on observation analysis and observation assessment.

MDD-5k: A New Diagnostic Conversation Dataset for Mental Disorders Synthesized via Neuro-Symbolic LLM Agents

  • Neuro-symbolic multi agent framework, which includes doctor, patient and tool LLM-agent interaction and dynamic (patient specific information) diagnosis tree. Introduces mental disorders diagnosis dataset MDD-5k.
  • Doctor agent includes persona, diagnosis result, dialogue generation. Patient agent includes patient information, patient experience and knowledge graph.
  • Establishes deeper engagement with patient to help generate diagnosis by generating the dynamic diagnosis tree.

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

  • Introduces customizable Social Choice Language Model: Uses an external adjudicator to manage tradeoffs via a user-selected social welfare function. Uses LLM to design reward functions in Restless Multi-Armed Bandits-allocation problems.
  • Suggests, that prompt engineering alone

--

SocialQuotes: Learning Contextual Roles of Social Media Quotes on the Web

  • Introduces SocialQuotes-dataset to classify social media / web context into roles (influencer, expert, marketer, commenter, etc.)


Can LLMs Understand Social Norms in Autonomous Driving Games?

  • LLM-agent autonomously drives in multi-agent driving game with social norms. Agents make self-driven decisions without attempting to cooperate.

21st of August 2024

Story3D-Agent: Exploring 3D Storytelling Visualization with Large Language Models

  • Story3D-Agent: LLM-agent used in 3D storytelling visualization with consistent contextually and narrative.

Leveraging Chemistry Foundation Models to Facilitate Structure Focused Retrieval Augmented Generation in Multi-Agent Workflows for Catalyst and Materials Design

  • Improves chemistry information retrieval/catalyst and materials design usage of Chemical Foundational model (such as MolFormer-XL) by combining it with RAG.

LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

  • Agent-based prompting and validation pipeline increase quality of the LLM as a Judge for compiler tests.

DreamFactory: Pioneering Multi-Scene Long Video Generation with a Multi-Agent Framework

  • DreamFactory: video generation-framework, which generates long/complex and stylistically coherent videos using multi-agent video production agent team.
  • Includes requirement analysis/planning/framework preparation/script generation/scenes design/shots design/key-frames generation and video generation.
  • Lacks still creativity (artistic/devising plots) due to reliance on prompts, seems as individual videos stitched together based on synthetic audio clip and need for significant computational resources.

Leveraging Fine-Tuned Retrieval-Augmented Generation with Long-Context Support: For 3GPP Standards

  • Implements fine-tuned Phi-2 with RAG (semantic chunking/extended context support) in telecommunications.

Cause-Aware Empathetic Response Generation via Chain-of-Thought Fine-Tuning

  • CFEG (Cause-aware Fine-tuning Empathetic Generation)-method: Uses emotion cause reasoning and fine-tuned LLM with CoT. Demonstrates superior empathetic dialogue responses.

20th of August 2024

FLAME: Learning to Navigate with Multimodal LLM in Urban Environments

  • FLAME (FLAMingo Architected Embodied Agent): a multimodal language-vision agent for navigational tasks by using three-step tuning: single perception tuning/multiple perception tuning/end-to-end training on VLN datasets.

Athena: Safe Autonomous Agents with Verbal Contrastive Learning

  • Athena: Improves aligned with verbal contrastive learning, which guides LLM-agent behaviour with past safe/unsafe trajectories as in-context contrastive examples and critiquing mechanism. Contains LLM-agents: Actor/Critic/Emulator interacting to complete given task.
  • Introduces safety evalution benchmark for LLM-agents with 80 toolkits in 8 categories.

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

  • Strategist: LLM-agent learns new skills through self-improvement based on MCTS and LLM-based reflection. Generates new ideas based on performance in simulated self-play by analysing good ideas.

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

  • MagicDec: Speculative Decoding speeds throughput mid/long-context serving with sparse KV cache.

19th of August 2024

MegaAgent: A Practical Framework for Autonomous Cooperation in Large-Scale LLM Agent Systems

  • MegaAgent: Autonomous co-operation between dynamically generated LLM agents for specific task requirements. .
  • Automatically generates sub-tasks (delegated to to sub-task admin, which coordinates the sub-task to group of agents), hierarchically plans systematically (boss agent) and monitors concurrent agent activities. OS agent coordinates, that agents communicate in proper format and progress with the task.
  • The Storage module includes: log, memory db, task monitor, interactive python exec/Python, Files and Checklist.
  • MegaAgent claims to pose high scalability/parallelism (due to agents communication cost grows logarithmically, not linearly), high effectiveness (manages 590 agents quicker than CAMEL-framework managed 2 agents. Summarizes previous conversations to store them in vector db) and high autonomy.

GoNoGo: An Efficient LLM-based Multi-Agent System for Streamlining Automotive Software Release Decision-Making

  • GoNoGo: LLM-agent system, which includes Planner- and Actor-agents to process high-level queries for decision support in 120 seconds. Planner interprets user queries/plans analysis strategies. Actor generates code, resolves errors with memory/plugins/coder LLM with self-reflection.

18th of August 2024

Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval

  • Re-Invoice:
  • LLM (Query generator) generates distinct queries from tools document index. Synthetic query copiess are stored with tool name, description and query. LLM (Intent extractor) retrieves most similar tools for new user queries based on multi-view ranking algorithm.
  • The multi view-ranking defines for each intent, the most similar tools. For each intent, it picks the most relevant tool, starting with the intent with highest individual tool similarity.
  • Includes an intent extractor prompt, which works just by adding it as a system instruction.

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

  • HiAgent: LLM-based agent, which uses subgoals to define working memory (intrial memory), instead of retrieving entire crosstrial memory (between experiments).
  • The LLM-agent replaces previous subgoals with the relevant summarized observations (action-observation pairs) for the current task.

16th of August 2024

EmoDynamiX: Emotional Support Dialogue Strategy Prediction by Modelling MiXed Emotions and Discourse Dynamics

  • EmoDynamiX: an LLM agent predicting optimal socio-emotional strategy (strategy embedding) and emotion state (emotion embedding) in a dialogue.
  • Uses Heterogeneous Graph (HG) to model the dialogue interaction: node types reflect past strategies/emotional states/predicted strategy of the agent and edge types reflect dialogue dependencies between turns and speaker role-awareness.

15th of August 2024

Automated Design of Agentic Systems

  • ADAS (Automated Design of Agentic Systems): the Meta agents discovers new agents with superior performance compared to hand-designed agents. Suggests a research direction for higher-order ADAS, where ADAS is used to improve the meta agent itself in the ADAS.
  • The system consists of Meta Agent, which generates new agents and corrects them until error free. The new agent is tested and then added to Agent library. For example specific agents consists of specific blocks such as COT/Verifier/Sub-problem division/etc., which are used in specific order in the system flow.
  • Meta Agent Search-algorithm generates automatically new agentic system designs and system blocks.
  • The Meta Agent Search-algorithm samples new agents optimizing performance in the Search space (prompts/control flows) evaluated with the Evaluation Function (cost/latency/safety).
  • Includes codes of few of the discovered agents.

13th of August 2024

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

  • Agent Q: Introduces real world website agent iteratively fine-tuned with DPO based MCTS with self-critique and AI feedback. Trajectory collection includes reward in each node of the tree.
  • Calculates a weighted score of the MCTS average Q-value. This score is generated by a feedback LLM to construct contrastive pairs for the DPO. The policy is optimised and iteratively improved.
  • LLM is used to sample reasoning/website actions to explore.
  • Achieves high performance in real world environmments and beats an average human-level performance.


12th of August 2024

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

  • AI Scientist: claims fully automatic scientific discovery by generating novel research ideas, writing code, executing experiments, visualizing results, drscribing findings to research paper and simulating evaluation process.

9th of August 2024

AmbigDocs: Reasoning across Documents on Different Entities under the Same Name

  • AmbigDocs: is a new benchmark for evaluating language models' ability to distinguish between different entities with the same name across multiple documents.
  • It leverages Wikipedia's disambiguation pages, generates questions with ambiguous names, and provides corresponding sets of answers, and includes an ontology categorizing incomplete answers and automatic evaluation metrics.
  • This work lays the foundation for future research on reasoning across multiple documents with ambiguous entities.

Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

  • MASTER (CoMunicative Agent BaSed DaTa REfinement FRamework): code repair with LLM. Consists of Code Quizzer (code debug expert creates questions of the error), Code Learner (answers the generated questions) and Code Teacher (reviews and corrects incorrect answers) agents.
  • Includes DEBUGEVAL-benchmark: bug localization, bug identification, code review and code repair.

8th of August 2024

Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

  • Agent4Debate: collaborative and dynamic multi-agent (searcher/analyzer/writer/reviewer) LLM for competitive debate.
  • Includes Chinese Debate Arena-benchmark with
  • Framework begins with context/motion/position/stage. Searcher gathers information, analyzer reviews arguments, writer generates arguments/debates and reviewer provides feedback on debate.

RiskAwareBench: Towards Evaluating Physical Risk Awareness for High-level Planning of LLM-based Embodied Agents

  • RiskAwareBench: reviews physical risk awareness of embodied LLM agents.
  • Includes modules: safety tip generation/risky scene generation/plan generation & evaluation/ isk assesment.

7th of August 2024

Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions

  • PReP: city-navigation to goal using visual perception and memory (working, episodic & semantic) without instructions.
  • Semantic memory summarizer memories from multiple steps, to perform high-level navigtion.

Forecasting Live Chat Intent from Browsing History

  • LLM-based user intent prediction (to predict why user needs live-chat agen support) from high-level categories classified from browsing history and then in second step predicts fine-grained user intent with the high-level intent class and browsing history.

CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases

  • LLM uses cod RAG. Builds code graph db from code repository. Nodes represent symbols, edges represent relationships between symbols and schema defines how code graphs are stored in the code db.

6th of August 2024

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

  • Reviews scaling up inference compute (test-time) in order to built self-improving agents. Quantifies the amount of improvement, when increasing inference.
  • Test-time compute outperforms 14x larger models.
  • Compute optiml scaling strategy can improve efficiency of test-time compute by factor of up to 4x.

5th of August 2024

ReDel: A Toolkit for LLM-Powered Recursive Multi-Agent Systems

  • ReDel (Recursive Delegation): Recursive multi-agent framework, where LLM decides when to delegate/how to delegate (delegation graph).
  • Includes custom tool-use, delegation schema, event-based logging and interactive replay (web UI).
  • Icludes open-source Python package.
  • ReDel delegation schemes include DelegateOne (wait parent-agent until child-agent completion) and DelegateWait (provide separate function for parent agent to retrieve child agent response).
  • Event-driven logging includes built-in events ans custom events.

SpecRover: Code Intent Extraction via LLMs

  • SpecRover/AutoCodeRover-v2: autonomous github issue fixing by understanding developer intent from Github repo structure / developer behaviour.
  • Claims Github issues can be solved as little as $0.65 /issue.

LLM Agents Improve Semantic Code Search

  • RAG-agent (ensemble architecture), which adds relevant contextual information to the user query from the Github repository.
  • Uses RepoRift-platform, which improves code search by: narrows context search to single repository, uses agentic interaction and returns easy-to-understand results with low latency.

3rd of August 2024

The Drama Machine: Simulating Character Development with LLM Agents

  • Drama Machine: Reviews Automated Identity-generation with LLMs. Uses multiple LLMs to simulate dynamic/complex AI characters in domain of drama scenes: interview/detective.
  • Roles include Ego, SuperEgo, Autobiography, Director and Critic.

2nd of July 2024

Coalitions of Large Language Models Increase the Robustness of AI Agents

  • Coalition of LLM models outperform single model and fine-tuned LLMs.
  • Specific LLMs fit for particular tasks and cheaper interference.

1st of August 2024

OmniParser for Pure Vision Based GUI Agent

  • OmniParser: VLM agent parsing GUI screenshots into structured data. Attempts to ground actions grounded on GUI regions.
  • Includes detection model to captura interactable GUI regions. Caption model retrieves functional semantics of these detected elements. OCR generates structured reprentation of the GUI.
  • Improves action prediction accuracy. Includes icon-detection dataset.
  • Reviews comphrehensively screen coordinate detection problem of VLMs.
  • Error cases include: repeated/misinterpreted icons, repeated texts and inaccurate bounding boxes.

AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation

  • AgentGen: Generates diverse LLM agent environments and planning tasks. LLM fine-tuned with this data improves significantly planning capabilities.
  • Uses inspirational corpus to generate environment context (actions/restrictions/etc). Generates tasks, which include "difficulty diversification: easy/medium/hard with bidirectional evolution (Bi-Evol) to smoothly acquire new planning skills.

31st of July 2024

Tulip Agent -- Enabling LLM-Based Agents to Solve Tasks Using Large Tool Libraries

  • Tulip Agent and AutoTulipAgent: LLM-agent has priviledges to create, update, delete and edit tool library.
  • Self-Recursively extendible tool library.
  • AutoTulipAgent includes 5 generic tools: 2 to decompose tasks/search tools, includes apart capability to create/delete/update tools.

29th of July 2024

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

  • iGSM framework: is used to generate diverse grade-school math problems for training and testing language models.
  • The framework includes a hierarchical categorization, structure graph, dependency graph, and solution construction using Chain-of-Thought (CoT) approach, and it uses GPT2-like language model with rotary embedding.
  • This framework enables a principled study of language models' mathematical reasoning skills, going beyond empirical benchmark pushing.

28th of July 2024

Solving Robotics Problems in Zero-Shot with Vision-Language Models

  • Wonderful Team: uses off-shelf VLM model for high-level planning, low-level location extraction and action execution.

26th of July 2024

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

  • AppWorld-benchmark: simulates LLM-agents using App World Engine-execution environment (mimicking 9 real-world apps/simulates 457 APIs/100 ficticious and related users) by measuring 750 complex tasks (records database start state and end state to review correct/incorrect actions to Base DB), which require iterative/interactive code generation without real-world consequences.
  • Generates task scenarios, which are used by the task generator (setup/validation/evaluation).
  • Each task is checked to be: well-defined/includes distractors/has real distractors/contrasts from exissting other tasks.
  • Includes Supervisor (provides passwords/credit cards/etc about the user), (API parameters/descriptions) and Execution Shell to run code.

25th of July 2024

PersonaGym: Evaluating Persona Agents and LLMs

  • Introduces PersnaGym-benchmark to evaluate persona LLM-agents.
  • Sets an automatic PersonaScore-metric to evaluate five different capabilities.
  • Finds SOTA level LLMs to offer highly varying level of capabilities as persona-agents.
  • Increasing model size is not guarantee of better persona agent performance with varying level of persona agent performance detected.

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

  • RISE (Recursive IntroSpEction): iteratively sel-improve LLM responses through fine-tuning with RL.
  • LLM loss is lower, when using multi-turn data compared instead of only the final answer. Works only for reasoning, not knowledge tasks.
  • Indicates strongly, that Full online RL is feasible with RISE and using iterative self-training procedure (such as STaR), because RISE improves the LLM with 5-turns with/without oracle model.
  • Demonstrates, that LLMs can self-improve its own mistakes to beyond level of propietary models, when trained with RISE. The self-improvement continues up to 6 iterations, demonstrating lower loss.
  • RISE starts with turn 1, where only prompt is provided. In turn 2, the prompt, the original response and its feedback is provided to generate the turn 2 response. Majority voting is used to select the final response from multiple responses generated. Alternatively, oracle model can be used to assist, when such is available.
  • Why self-improvement works? RISE is compared to diffusion models, where generation is refined step-by-step. Similarly LLMs may lack "capacity" to process the request, which RISE can help to refine. See the talk on this paper here..

24th of July 2024

Reinforced Prompt Personalization for Recommendation with Large Language Models

  • Reinforced Prompt Personalization (RPP): uses instance-based prompting with MARL.
  • Instead of task-based (role-play/history/reasoning guidance/output format), Instance-based prompting personalises to these four-characteristics with MARL.

AI-Gadget Kit: Integrating Swarm User Interfaces with LLM-driven Agents for Rich Tabletop Game Applications

  • AI-gadget Kit: multi-agent driven Swarm UI (SUI) tabletop gaming system, which consist of meta-motion, interactive behaviour, interactive relationship and application.

3D Question Answering for City Scene Understanding

  • Sg-CityU: 3D multimodal QA, which uses scene graph to provide answers related to spatial relationships about city-scenes

23rd of July 2024

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

  • RedAgent: Introduces concept of "Jaillbreaking strategy" (strategies used by attackers to construct jaillbreaking prompts) red teaming through multi-agent self-reflection from context feedback and skill memory.
  • The approach can jaillbreak LLMs and LLM-based apps (even more vulnerable) using just few queries.
  • The Red-Agent architecture includes skill memory and multiple roles (profile constructor/planner/attacker/evaluator) and short/long term memory.

AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game

  • AmongAgents: multi-agent LLM-framework with memory, reflection and interaction in social deduction game with ambiguous and deceptive characters.
  • Includes meeting/task-phases.
  • Agents pose personality-component: generated with personality prompt from pre-defined set of personalities: behaviour/decision-making, which contribute to more dynamism/realism.

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

  • OpenDevin: LLM-based multi-agent framework, where agents interact as human-like SW agents writing code, using command line and browsing web.
  • The framework includes: interaction mechanism (event stream), environment(sandbox environment for code execution), interface(human-like), multi-agent delegation (co-operate) and evaluation framework.
  • Event stream tracks history of action and observation.

PyBench: Evaluating LLM Agent on various real-world coding tasks

  • Introduces PyBench-benchmark for real-world like coding tasks withh LLM-agents.
  • Introduces high-performance PyLlama3 model for coding tasks.

Artificial Agency and Large Language Models

  • Reviews theoretical models for agents, LLM agents and concept of artificial agency.

LawLuo: A Chinese Law Firm Co-run by LLM Agents

  • LawLuo: includes LLM-based receptionist/lawyer/secrretary/boss-agents to realistic legal consultation company based on SOP (Standard Operating Principle).

22th of July 2024

TaskGen: A Task-Based, Memory-Infused Agentic Framework using StrictJSON

  • TaskGen: LLM-agent framework to solve tasks by dividing task into sub-tasks, executed by its own agent/equipped function. Manages memory/information based on need-to-know. Uses in StrictJson-format.
  • Includes meta-agent, inner-agent, function-calls, sub-tasks, shared memory (sub-task completed/list of past equiped function inputs or outputs/shared variables) and passing context/shared memory to inner agent/function.
  • Utilises global context adds data to default LLM prompt (carrying shared variables throughout a task/to store the current state of a dynamic environmental variable/specific instructions).

Odyssey: Empowering Agents with Open-World Skills

  • Odyssey: interactive (plan-actor-critic) LLM-agent (fine-tuned Llama 3) with real world skill library.
  • Introduces long-term planning/dynamic-immediate planning/autonomous exploration benchmark.
  • Planner decomposes long-term goals into sub-goals with ultimate goals/behavioural constraints/agent states/achievements.
  • Actor executes skill code using query context/similarity match/skill selection.
  • Critic uses execution feedback/self-validation/self-reflection.

19th of July 2024

19th of July 2024

System-1.x: Learning to Balance Fast and Slow Planning with Language Models

  • System-1.x Planner: introduces a controllable planning framework (inference time compute) capable of producing hybrid plans balancing system 1 and system 2 thinking. Includes Controller/System-1 Planner/System-2 Planner.
  • The Controller manages the x-factor, which is the degree to how much to use System-1 vs. System-2 thinking to decompose planning into sub-goals.
  • Demonstrates: controllability/flexibility/generalizability to different search algorithms.

The Vision of Autonomic Computing: Can LLMs Make It a Reality?

  • Explores feasibility of Autonomic Computing Vision (ACV) with multi-agent framework based on LLMs.
  • LLM-based multi-agent framework achieves level 3 autonomy.
  • The original ACV-framework identified 4 pillars: self-configuration, self-optimization, self-healing and self-protection.

18th of July 2024

Prover-Verifier Games improve legibility of LLM outputs

  • Prover-Verifier: Direct RL on solution correctness generates solutions difficult for humans to evaluate and obtains.
  • Checkability training results prover, which maintains legibility, while taking a a legibility tax in form of losing some performance to make them more easier to check for humans.
  • Discusses the possibility of training two models: train model with CoT to maximize accuracy and another model to turn the CoT produced by the model into legible version understandable for humans.

12th of July 2024

PersonaRAG: Enhancing Retrieval-Augmented Generation Systems with User-Centric Agents

  • PersonaRAG: Includes compoments k-docs retrieval, user interaction analysis (user profile/contextual retrieval/live session/document ranking/feedback agents) and cognitive dynamic adaption(selective/collaborative use of agents).

Instruction Following with Goal-Conditioned Reinforcement Learning in Virtual Environments

  • IGOR (Instruction following with GOal-conditioned RL): LLM translates instructions into high-level action plan with sub-goals and RL executes them.

Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation'

  • LLMs generate novel and diverse biomedical hypthesis through multi-agent interaction.

11th of July 2024

GTA: A Benchmark for General Tool Agents

  • GTA-benchmark: evaluates general tool usage of LLM agents in real user queries with real deployed tools. for example web page screenshots.
  • Evaluates perception, operation, logic and creativity tools.
  • Defines "Real-World" as helping humans in real-life with being step/tool-implicit.
  • GPT-4 solves 50% of these tasks.
  • Includes illustration of executable tool chains.

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

  • Internet of Agents (IoA): LLM agents lack capability to interact in dynamic environments with other agents outside its hard-coded communication pipeline.
  • Limitations include: ecosystem isolation, single-device simulation and rigid communication/coordination.
  • IoA acts in Internet-like environment to achieve collective intelligence and new capabilities.
  • Includes architectural design of the IoA-framework.

Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents

  • LAAs (LLM-empowered Autonomous Agents): Introduces concept of LAAs, which include three elements: external tools, LLMs (knowledge modelling) and Agentic workflow (human-like symbolic reasoning).
  • LAAs are characterised by natural language dialogue, decision making, planning, task decomposition and actionining.

GPT-4 is judged more human than humans in displaced and inverted Turing tests

  • Introduces Inverted Turing text.

Beyond Instruction Following: Evaluating Rule Following of Large Language Models

  • RuleBench-benchmark: evaluates LLMs capability to follow rules.
  • Evaluation dimensions include: executing rules, triggering rules, following formal rules, applying rules and following counterfactual rules.

Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency

  • Argues, that LLMs cannot be linguistic agents in the actual form by lacking embodiment, participation and precariousness.

Incorporating Large Language Models into Production Systems for Enhanced Task Automation and Flexibility

  • Reviews integration of LLMs into Automated Production Systems.

10th of July 2024

WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment

  • Discovers lower-bound of covering 0.5% of WikiHow instructions equals roughly usage of 300 APIs, which we can consider lower-bound limit for covering wide variety of WikiHow instructions in Embodied agent tasks.
  • The framework iteratively produces action spaces for APIs to be used by a LLM based embodied agent.
  • This two-step process works by iteratively generating through hallucination: semi-executable agent policies with python by LLM few-shot prompting from WikiHow instructions, parse partial/full python programs into pool of APIs

9th of July 2024

Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models

  • Hypothetical Minds: Introduces "Theory-of-Mind"-module. Includes as well perception, memory and hierarchical two-level planning.

Vision language models are blind

  • Reviews 7 visual tasks, where SOTA-level VLMs perform shockingly bad.

5th of July 2024

On scalable oversight with weak LLMs judging strong LLMs

  • Explores debate and consultancy to supervise AI.
  • Finds debate outperforms consultancy in general. Better debater models modestly improve judge accuracy.

When LLMs Play the Telephone Game: Cumulative Changes and Attractors in Iterated Cultural Transmissions

  • Reviews toxicity/bias in LLM agent multi-step inputs/outputs, instead of individual LLM input-output.

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

  • Reviews LLMs in strategic games. LLMs come with systematic bias: positional bias, payoff bias and behavioural bias. LLMs performance decreases, when the mentioned bias-dimensions are misaligned.

3rd of July 2024

LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

  • LivePortrait: generates realistic video from single portrait image with facial expressions and head poses from different angles.
  • Offers better computational efficiency and controllability over diffusion models, by using implicit-keypoint-based framework.
  • Generation speed is 12.8 ms with RTX 4090.

Cactus: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory

  • Cactus: multi-turn dialogue dataset for mental health counseling, consisting of goal-oriented/structured Cognitive Behavioral Therapy interation.
  • Trains Camel-LLM using the Cactus-dataset.

2nd of July 2024

GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

  • GRASP: Large scale spatial reasoning benchmark and dataset in structured grid environment requiring planning and commonsense reasoning.

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

  • MMedAgent: MMedAgent outperforms GPT-4o-agent in medical tasks based on LLaVA-Med-model by fine-tuning data from 6 different tools.

1st of July 2024

Agentless: Demystifying LLM-based Software Engineering Agents

  • Agentless: Argues, that it s not required to deploy complex autonomous sw agents.
  • Uses two step approach: Localization (files requiring sw fix) and Repair.
  • Framework begins from codebase and an issue. It then reviews repo structure and issue to localize top n-files, localizes classes/functions, localizes edit locations. In the repair-phase, the LLM generates various patches, which are filtered and ranked to submit the patch to the issue.

29th of June 2024

Question Translation Training for Better Multilingual Reasoning

  • QAlign (Question Alignment): is a framework that fine-tunes LLMs to translate reasoning questions into English using X-English parallel question data.
  • It uses targeted in-domain language alignment, enables effective utilization of English instruction data, and includes response alignment with cutting-edge English instruction data.
  • This framework improves multilingual reasoning capabilities of LLMs by transferring English expertise to non-English tasks.

28th of June 2024

LLM Critics Help Catch LLM Bugs

  • Focuses on self-correction or self-critique in the domain of code bug fixing in real-world.
  • Finds majority of the critique generated automatically is better than human generated.

BMW Agents -- A Framework For Task Automation Through Multi-agent Collaboration

  • BMW Agents: Includes three main components for the LLM-based agents: Planning, Execution and Verification.
  • Retrieve a task from task queue DB and coordinator agent orchestrates the agent workflow. Includes Tools, Memory and Persona/Objectives.
  • Tool refiner has access to wide variety of tools, which it limits to subset of tools available for the agent in particular task.
  • Introduces: "Programmable Prompts", which generalises ReAct and PlanReAct by using iterative sequence consisting of pre-defined steps A...X.

Scaling Synthetic Data Creation with 1,000,000,000 Personas

  • Persona-Hub: Diverse 1B personas web dataset using persona-driven data synthesis method. Includes only main characteristics without fine-grained details.

27th of June 2024

Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?

  • Reviews model editing of LLMs.
  • Identifies existence of editable beliefs in LLMs.
  • Develops model editing benchmark.
  • Reviews difference between LLMs acting as agents vs. agent simulators.

Tools Fail: Detecting Silent Errors in Faulty Tools

  • Reviews LLM tool use failure recovery from "silent errors". Tool output is accurate only when: input is accurate, context is sufficient and tool makes correct predictions.
  • Introduces taxanomy for categorising tool-related errors and methods to recovery from them (refine and recovery).
  • Identifies challenges in tool recovery: failure detection/fault assignment/recovery planning.

Simulating Classroom Education with LLM-Empowered Agents

  • SimClass: simulates multi-agent classroom teaching. Includes manager (observe/tutor/interact), teacher, assistant and classmate agents with the user.
  • Session controller manages modules: Class State Receptor, Function executor and Manager agent.
  • Observing uses class-states (class roles, learning materials and dialogue history). Tutoring functions include next page/teaching, which are only directed by the teacher. Interaction functions are performed agent to agent. Classmate agents have different roles like note taker, deep thinker, idea creator etc.

UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

  • UniGen: Textual dataset generation with LLM-dataset generation approach and reviewed in benchmarking and data augmentation context.
  • Demonstrates the data augmentation technique is effective and adds capabilities to the LLM, while discusses the technique limitations in Appendix A such as knowledge intensive tasks Knowledge intensive tasks could benefit instead from Out-Of-Distribution data, still unmastered by the LLM.

Capturing Minds, Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data

  • RPLM (Role Playing Language Model): Develops RPLM with personality behaviours/traits/tendencies. Introduces RolePersonality-dataset based on 14 psychology dimensions, which is gathered using role-playing expert agent interviewing with questions based on the 14 dimensions.

LayoutCopilot: An LLM-powered Multi-agent Collaborative Framework for Interactive Analog Layout Design

  • LayoutCopilot: LLM-based analog layout design framework.

Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interaction

  • Explores emergence of self-replicating programs. Introduces "high-order entropy"-metric to measure complexity of the system studied.

26th of June 2024

Symbolic Learning Enables Self-Evolving Agents

  • Agent Symbolic Optimizers: introduces agent symbolic learning framework. Optimizes symbolic components (prompts/tools/their orchestration) of the LLM agent. Attempts to optimize agent to solve real-world task by enabling LLM-agent to learn from data and self-evolve.
  • Proposes, that key to achieve AGI is to move from model-centric or engineering-centric to data-centric language agents, which learn and envolve autonomously in environments.
  • Agent symbolic learning optimizes symbolic network within language agents.

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

  • MAGIS: LLM-based framework to resolve Github issues using four agents: Manager, Repository Custodian, Developer and Quality Assurance Engineer.
  • Reviews correlation in task success rate and task complexity/ability to locate relevant code line.
  • Planning part includes locating files/code, building team, kick-off meeting. Coding part includes developer producing code and then QAE validating it.

Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models

  • LRLL-agent (Lifelong Robot Library Learning): increases continuously the robot skill library by using soft memory module, self-guided exploration, skill abstractor and lifelong learning algorithm.
  • The framework is inspired by wake-sleep optimization, where wake phase (interacts with environment) is followed by sleep phase (agent reflects experiences).

Simulating The U.S. Senate: An LLM-Driven Agent Approach to Modeling Legislative Behavior and Bipartisanship

  • Reviews use of LLM to understand and improve legislative process.

Mental Modeling of Reinforcement Learning Agents by Language Models

  • XRL (eXplainable RL): Reviews LLMs capacity to build mental models about RL agent behaviour. Finds, that LLMs lack mental modeling capabilities about RL agents.
  • LLM-Xavier workflow: RL agent rolls a trajectory, which LLM-agent reasons to provide an answer. This evaluation is compared with the ground truth data.
  • Offers a way to explain behaviour of black-box RL agents.

--

AI-native Memory: A Pathway from LLMs Towards AGI

  • Claims AGI-like systems require AI-native memory, which is deep neural network parametrising different types of memories beyond language. Claims such Large Personal Model (LPM) would be unique for each person with every detail about the user for personalised generation.
  • Includes useful ideas about what data the personalised memory could look include or the various levels of data granularity.

Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation

  • Investigates role-play zero-shot prompting in conversational agent.

LLCoach: Generating Robot Soccer Plans using Multi-Role Large Language Models

  • LLCoach: Reviews advance planning capabilities of robots in dynamic/unstructured environments.
  • The system offline components collects plans from video frames to the Coach VLM and refines them using LLM, which retrieves Acctions from vector db and synchronises into multi-agent plans. Online component retrieves and executes most similar plan to the world model status.

Octo-planner: On-device Language Model for Planner-Action Agents

  • OctoPlanner: Separates planner/action-steps into OctoPlanner (planner) agent and Action agent (Octopus model) with function execution.
  • Planner agent divides tasks into sub-tasks.
  • Optimized for on-device usage through usage of fine-tuning instead of in-context learning.

25th of June 2024

Human-Object Interaction from Human-Level Instructions

  • Develops complete system to synthesize object motion, full-body motion and finger motion simultaneously.
  • Applies High-evel planner to generate target scene layout/task plan and then uses low-level motion generation with four stage appproach with: CoarseNet/GraspPose/RefineNet and FingerNet.
  • Planner includes three stages: Generate spatial relationships between objects in natural language (to improve performance), calculate target layouts and generate detailed plan.

24th of June 2024

RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale

  • Evaluates LLMs on repository-level coding. Claude Sonnet 3.5 outperforms by 12% the GPT-4o.

RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale

21st of June 2024


GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians

  • GenoAgent: LLM-based genomics data-analysis.

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

  • ESC-Role: LLM-agent for Emotional Support Conversation (ESC) tasks. Includes ESC-Eval benchmark.

Autonomous Agents for Collaborative Task under Information Asymmetry

  • iAgents (Informative Multi-Agent Systems): multi-agent system based on human social network, where person has an agent with access to information only from its user.
  • Introduces InformativeBench-benchmark to evaluate LLM task solving capability when access to only part of information (information asymmetry).
  • iAgents collaborate in social network of 140 individuals and 588 relationships and communicate 30 turns.

FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents

  • FlowBench-benchmark: reviews workflow-guided (think flowcharts) planning capability of LLMs.

Direct Multi-Turn Preference Optimization for Language Agents

  • DMPO-loss function to optimize RL objectives in multiturn agent tasks.

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

  • RAGElo-benchmark reviews retrieval performance as well in RAF-Fusion use (fuses top-k retrievals).

DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection

  • DiPEX (Dispersing Prompt Expansion)-approach: Uses VLM and DiPEX to improve class-agnostic object detection.

Behaviour Distillation

  • Behaviour Distillation: compresses information for training expert policy in RL by learning synthetic data (HaDES-method) of state-action pairs without requiring the expert data.

Uni-Mol2: Exploring Molecular Pretraining Model at Scale

  • Uni-Mol2: 1.1B parameter model for molecular representation based on f Uni-Mol+ architecture (two track transformer).

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

  • Survey on multimodal / VLM / LLM jailbreaking research.

20th of June 2024

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

  • Q*: Improves multi-step reasoning of LLMs through heuristic search planning in MDP.
  • Objective is to find most suitable reasoning with maximum utility.
  • Introduces multiple general approaches (offline RL/best sequence from rollout/completion with stronger LLM) to calculate the Q-value.
  • The approach works as such in various reasoning tasks.

GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models

  • GraphReader: LLM agent converts long text into graph structure to explore by performing step-by-step analysis and by generating detailed plan.
  • Achieves performance level of 128k context window LLM using 4k context window LLM by converting the long text into graph structure.
  • The LLM agent records insights from the explored graph and reflects current situation to optimize answer generation.

LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors

  • LLaSA (Large Language and Sensor Assistan): Text query received is converted into text embedding and sensor reading into IMU embeddings (inertia measurements unit embeddings). Both inputs are passed to LLaSA model and its output to LLM to produce final answer.

Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract Theory

  • Evaluates LLM-based multi-agent society. This society includes psychological drives and social relationships.
  • Evaluates Hobb's Social Contract Theory.

EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms

  • EvoAgent: reviews specialized agents extension into multi-agent system through evolutionary pipeline.

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

  • Introduces TRAIT-personality test to review LLM personality.

Can LLMs Learn by Teaching? A Preliminary Study

  • Learning by Teaching (LbT): LbT includes three methods: Observing student feedback, learning from the feedback and learning iteratively.

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

  • Persuasion by adversial agent in multi-agent debate, which undermines shared interests.

19th of June 2024

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

  • Prism: evaluation framework separately reviews VLMs perception and planning capabilities. Uses single LLM to compare various VLMs (VLM Zoo) perception capabilities or uses multiple LLMs (LLM zoo) with single VLM to evaluate planning capabilities.

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

  • AlanaVLM: SOTA-level (surpasses in spatial reasoning) 7B VLM trained with EVUD-dataset to understand embodied and ecocentric video understanding.
  • Introduces Ecocentric video understanding dataset (EVUD).

SpatialBot: Precise Spatial Understanding with Vision Language Models

  • SpatialBot: VLM trained with SpatialQA-dataset (includes VQAs with low, middle and high-level), which comprehends spatial information in thre levels (point depth/depth description, proximity/object depth and spatial relationship/counting).
  • Introduces SpatialBench-benchmark to review VLMs spatial understanding.

LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application

  • LIT (Language-driven Intention Tracking): LLM and VLM system, which tracks human actions from images using VLM to predict human intentions. Uses graph reasoning to generate a plan steps with LLM.
  • The VLM generates for each image a captioning about what is being done by the human and predicts the likelihood of this task to relate to specific step in the plan.
  • Based on the predicted plan step, the system predicts the most likely next step being performed by the human.

18th of June 2024

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

  • PerceptiveAgent: empathic multi modal agent, using acoustic information from speech for empathic responses adjusting to speaking style.
  • Captures more accurately speakers real intentions (captions) and interacts (speech attributes) using adjusted tone for the context.
  • Framework includes three compoments: Speech captioner (Speech encoder, Q-former and text encoder), LLM and MSMA-Synthesizer (speaker embedder, Attribute embedder and HiFiGAN vocoder).

Problem-Solving in Language Model Networks

  • Represents each agent as a node, which create a connected multi-agent network with self-reflection.
  • Finds self-reflection is useful, when surrounded by incorrect LLM-agents and less useful, when surrounded by LLM-agents providing correct answers.
  • LLM agents are likely to agree for consensus, when the LLM answer is correct. The LLM answer is more likely to be incorrect, when LLMs are more divided.

Ask-before-Plan: Proactive Language Agents for Real-World Planning

  • CEP-agent: mutli-agent with three specialized Clarification (trajectory tuning schema)/Execution (static and dynamic)/Planning-agents.
  • Reviews Proactive Agent Planning, where the LLM agent must predict situations when to ask clarifications based on context from conversation/environment interaction/invoice tool calls/generate plan.
  • Trajectory tuning: fine-tunes clarification and execution agents with past trajectories in static setting.
  • Memory recollection: reuse self-reflective feedback from prior time steps.

AgentReview: Exploring Peer Review Dynamics with LLM Agents

  • AgentReview: LLM-based peer-review simulation framework of scientific papers such as related to NLP.
  • Includes three LLM- based roles: reviewers, authors and Area Chairs.
  • Review process includes: reviwer assessment, author-reviewer discussion, reviewer-area chair discussion, meta-review compilation and paper decision.

Identifying Performance-Sensitive Configurations in Software Systems through Code Analysis with LLM Agents

  • PerfSense: LLM-agent to review performance sensitive configurations of code bases.
  • Includes two LLM-agents: DevAgent and PerfAgent for code analysis of large codebases using limited-sized LLMs. Relies on prompt chaining and RAG (memory).

CodeNav: Beyond tool-use to using real-world codebases with LLM agents

  • CodeNav: LLM-agent navigates new unseen code repositories to solve user query by automatically indexing code blocks.
  • The agent automatically finds code snippets from the target code repository, imports the snippets and iteratively generates solution.

P-Tailor: Customizing Personality Traits for Language Models via Mixture of Specialized LoRA Experts

  • P-Tailor: MoE-based LLMs model 5 big personality traits using specialized LoRA experts.
  • Models multiple characters such as openness.
  • Introduces PCD-dataset on personality traits in various topics.

MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL

  • MAGIC: text-to-SQL multi-agent, which generates automatically self-correction guideline.
  • Framework includes three agents: manager(Planning, Tool and Memory), correction- and feedback-agents.

Large Language Models based Multi-Agent Framework for Objective Oriented Control Design in Power Electronics

  • Includes a multi-agent framework with Manager/Objective design/Model design/Control algorithm design/Control parameter design/Control verification-agents. Use various tools: model tool, control algorithm tool, optimization tool and Verify tool. Applied in Power electronics-domain.

The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

  • Stance detection on political discussion with LLMs and synthetic data with significant improvement on accuracy.

VoCo-LLaMA: Towards Vision Compression with Large Language Models


17th of June 2024

MASAI: Modular Architecture for Software-engineering AI Agents

  • MASAI (Modular Architecture for Software-engineering AI): multiple LLM-agents are tasked with sub-objectives and strategies to achieve those objectives in modular approach. Avoids long-tracectories of LLM agents, enables gathering information from different sources and usage of specific problem solving strategies.
  • Includes five different sub-agents: Test template generator, Issue reproducer, Edit localizer (finds files related to buggy code), Fixer and Ranker (observes the patches passing the test).

Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging

  • TreeInstruct (Socratic questioning): Includes three roles Teacher, Student and Verifier. Asks clarifying questions to help students independently resolve errors by estimating students conceptual knowledge using dynamically generation question tree based on student answers.
  • Uses state space estimation to plan the conversation by identifying distance between student initial answer and the optimal answer.
  • Dynamic conversation restructuring to update conversational plan based on student progress for both questioning and teaching.
  • State space estimation works by using specific task categories, where LLM-verifier reviews student answer for each task-category either as failed or Correct.
  • Tree nodes represent instructor questions and edges reflect the paths to new level of understanding.

Input Conditioned Graph Generation for Language Agents

  • Language Agents as Graphs.
  • Dynamic and learnable agents by using LLMs as graphs. Attempts to learn a model, which generates edges for every input of the LLM in order to represent hte flow of communication in the graph.
  • Outperforms static approaches by 6% in MMLU.

Pre-Training and Personalized Fine-Tuning via Over-the-Air Federated Meta-Learning: Convergence-Generalization Trade-Offs


GUICourse: From General Vision Language Models to Versatile GUI Agents

  • GUICourse-trained VLMs with GUICourse-dataset suite outperform GPT-4V in multiple benchmarks improving navigation capability.
  • Introduces GUICourse-dataset suite (GUIEnv for OCR and grounding, GUIAct for website and Android knowledge of GUIs and GUIChat to improve conversational dialogue/QA-skills with images) for training visual-based GUI agents from generic VLMs.

CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents

  • CLARA: classification of users robot commands as infeasible/ambigious.

Embodied Question Answering via Multi-LLM Systems

  • CAM (Central Answer Model): Embodied QA multi-agent framework, where multiple individual LLM-agents respond queries about household environment.

14th of June 2024

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

  • GuardAgent: guardrails-agent for LLMs based on knowledge-enabled reasoning.
  • Includes task-planning, action plan, memory, tools and code generation and execution.
  • Task planning includes: specification of the target agent, guard request (things the agent cannot perform based on the target agent profile) and target agent (inputs, outputs and logs).

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

  • VideoGUI-benchmark: Automation using instructional videos in visual GUI tasks.
  • Failure modes include: High-level planning, middle-level planning and atomic action execution.
  • Pipeline includes: video selection, human demonstration, manual annotation and review & creation.

Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning

  • OSSA (Object-State-Sensitive Agent): Reviws VLMs and LLMs capacity to generate object-state sensitive plans. Includes two methods: LLM-based (modular) and VLM-based (monolithic).

TRIP-PAL: Travel Planning with Guarantees by Combining Large Language Models and Automated Planners

  • TRIP-PAL: Uses LLMs and automatic planners for automatic planner agents of travel plans.
  • Includes Travel information retrieval, LLM-based planner and Automated Planning.

Rapport-Driven Virtual Agent: Rapport Building Dialogue Strategy for Improving User Experience at First Meeting

  • Free Rapport Agent: Builds a rapport-oriented dialogue agent with focus on user engagement through small talk.
  • Identifies strategies for rapport-techniques.
  • The Free Rapport Agent achieves superior ratings in categories such as naturality, satisfaction, usability an rapport aspects. A potential future research field in investing rapport with TSS-models.

Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation

  • URDF-model: Agents acquire non-verbal communication skills with imitation sign language gestures from RGB video for words.
  • Learsn 5 different signs involving upper body.

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

  • RoboGolf: plays real-world minigolf.
  • Framework includes dual-camera input with VLM, inner closed-loop control (reasoning, action, robot arm execution, execution result, evaluation and recovery from failure modes) and outer closed-loop reflective equilibrium (active feedback, counterfactual reasoning).

SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

  • SkySenseGPT: dataset for remote sensing video-language understanding.

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

  • Flowchart comphrehension with VLM. Includes logical verification, information extraction, localization recognition, reasoning and summarization.

HIRO: Hierarchical Information Retrieval Optimization

  • HIRO (Hierarchical Information Retrieval Optimization): RAG query approach using hierarchical structures to store information.

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning


4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities


13th of June 2024

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents

  • StreamBench-benchmark: simulated learning environment, where LLM receives continuous feedback to iteratively improve performance.
  • Reviews the LLMs self-improving capability in online-setting, instead of only fixed offline-benchmarks

Multi-Agent Software Development through Cross-Team Collaboration

  • CTC (Cross-Team-Collaboration): creates a multi-agent-framework of LLM-agent teams jointly collaborating to make decisions, communicate insights and generate solutions.
  • For example generates different phases: design, coding and testing, which each include sub-tasks. Various agents collaborate to generates ideas from tasks, which are then converted into final code via multi-turn chat chain.

RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs

  • RL-Jack: Designs a novel Deep Reinforcement Learning method to generate novel black-box jailbreaking prompts.
  • Formulates the search of jailbreaking prompts as a search planning problem.

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

  • RLBreaker: black-box jailbreaking with Deep Reinformcent Learning agent from mainly same authors as the RL-Jack paper.
  • Formulates the search of jailbreaking prompts as a search planning problem.

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

  • Multi-agent prompting for text-to image generation by dynamic instructions. The instructions evolve in iteratively with feedback and with a database of professional promts.

12th of June 2024

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

  • MobileAgentBench-benchmark: Highlights issues in current benchmarks related to Scalability and Usability, Robustness and Flexibility and Realistic environment.

A Dialogue Game for Eliciting Balanced Collaboration

  • Studies flexible and balanced role-taking with LLM agents in social dialogue.

Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey

  • A survey, which reviews threats and protective measures on privacy and security concerns with LLMs in five stages: pre-training/fine-tuning/RAG system/deploying/LLM-based agent.

Can Large Language Models Understand Spatial Audio?

  • Multichannel audio understanding with LLMs.

11th of June 2024

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

  • Introduces MCT Self-Refine (MCTSr): integrates LLM with MCTS.
  • Improves solving MATH and complex math Olympiad-problems reasoning.
  • Includes selection, self-refine, self-evaluation and backpropagation-processes.

DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs

  • DARA (Decomposition-Alignment-Reasoning Autonomous Language Agent): solves formal queries by high-level iterative task decomposition and low-level task grounding.
  • Makes pososible training DARA with small number of high-quality reasoning trajectories.
  • SOTA-level performance: Fine-tuned DARA (Llama-2-7B) zero-shot outperforms agents using GPT-4 In-context learning.
  • Iteratively performs task decomposition and task grounding.

RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents

  • RS-Agent (Remote-Sensing Agent): LLM-based remote sensing agent.

World Models with Hints of Large Language Models for Goal Achieving

  • DLLM (Dreaming with Large Language Models: multi-modal model RL, which uses natural hints/goals from LLM in long-horizon tasks.
  • The use of LLM to propose sub-goals (or language hints) improves goal discovery and efficiency of exploration.

DCA-Bench: A Benchmark for Dataset Curation Agents

  • DCA-Bench-benchmark for dataset curation agents.

A Synthetic Dataset for Personal Attribute Inference

  • SynthPAI: synthetic dataset of 7800 comments labelled with personal attributes to investigate misuse of profiling personal attributes from public data.
  • Starts by generating synthetic profiles (each with 8 personal attributes: : age/sex/income level /locationvbirthplace/educationvoccupation/relationship status) of LLM agents, generates chats with these agents and uses LLM agents to add labels (sex, age etc).

Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees

  • ToolPrefer-LLaMA (TP-LLaMA): Inference trajectory optimization by fine-tuning with expert demonstrations and then optimizing with DPO by using the ToolPreference-dataset.
  • Introduces ToolPreference-dataset, which includes tool-augmented LLM succesfull/failed exploration trees from ToolBench-dataset.
  • Reasons with Depth-First Search (DFS) by constructing expert trajectories with decision trees (Tree-of-Thought), where each tree represents LLM thought/API response/API/decision on an API call.

10th of June 2024

FinVerse: An Autonomous Agent System for Versatile Financial Analysis

  • FinVerse: financial information processing agent, which connects to 600 APIs. Plans to open source the dataset.

9th of June 2024

A Survey on LLM-Based Agentic Workflows and LLM-Profiled Components

  • Survey on LLM agentic workflows and LLM-Profiled Components (LLMPCs)

A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning

  • Introduces a survey on LLM-agents with tool use/RAG/planning/feedback learning.

Artificial Intelligence as the New Hacker: Developing Agents for Offensive Security

  • ReaperAI: designs an autonomous ai agent to design and stimulate cyberattack-scenario.

7th of June 2024

Mixture-of-Agents Enhances Large Language Model Capabilities

  • Mixture-of-Agents (MoA): MoA-architecture, where LLM agents are stacked into layers on top of each other. Takes advantage on the phenomenon, where the LLM output tends to get better, when it receives as an input a LLM model output (even from smamller LLM).
  • An agent in given layer takes output from previous layer as an input to generate its output.
  • Implements Together MoA, which achieves SOTA-performance in various benchmarks surpassing GPT-4 Omni in various benchmarks.
  • The MoA ranker selects answers more accurately than LLM alone and tends to select best answer.
  • The model has a limitation in Time-to-First-Token (TTFT), because the prior level model output is required to produce the next level output.

SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals

  • SelfGoal: Divides high-level goals into tree-structure with practical sub-goals.
  • Improves performance of LLM-agents in various tasks.

Language Guided Skill Discovery

  • LGSD (Language Guided Skill Discovery): reviews language guided skill discovery using LLM.
  • LLM converts input into semantically distint skills in order for the agent to visit semantically unique states.

6th of June 2024

Open-Endedness is Essential for Artificial Superhuman Intelligence

  • Defines open-endedness in the context of ASI: "From the perspective of an observer, a system is open-ended if and only if the sequence of artifacts it produces is both novel and learnable."

On the Effects of Data Scale on Computer Control Agents

  • Releases new AndroidControl-dataset with 15k demonstrations on every day tasks in Android apps.
  • Tests an Android agent, which receives task information, pre-processes screen using accessibility trees / html about the screen (so, not using directly screenshot) to include only UI elements with text description, creates textual representation of the accessibility trees / html about the screen.
  • Includes prompts used and references on the accessibility tree / html performance against directly interpreting the screenshot.

Aligning Agents like Large Language Models

  • Aligns a 3D video game agent using RLHF similarly as fine-tuning a LLM.
  • The agent receives only the image input and outputs action from one of the 12 buttons or 2 joysticks.

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

  • AgentGym-framework: Generally capable LLM agent with self-evolution ability.
  • Exposes agents to multiple diverse environments, providing a basic trajectory set, and applying the novel AgentEvol method for self-evolution.
  • AgentEvol: Benchmark to evaluate self-evolution capability over new tasks and environments.

5th of June 2024

The Good, the Bad, and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games

  • Simulates human behaviour using LLMs and finds emotions impact the LLM performance to simulate human-like behaviour.
  • Finds in specific, that angry-emotional state aligns surprisingly well with real human behaviour.
  • GPT-4 responds rationally even when prompted with strong emotions.

DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

  • DriVLMe: autonomous driving agent, which reads video input, uses route planner for shortest route. The model uses the video token and textual tokens about: current instruction, dialogue history and action history to produce dialogue response and the physical action to the simulator.
  • Identifies several challenges, which are applicable in other domains using LLM agents.

4th of June 2024

Chain of Agents: Large Language Models Collaborating on Long-Context Tasks

  • Chain-of-Agents (CoA): Addresses long-content problems by using multi-agent collaboration to add information and reason with LLMs.
  • Consists of two steps: first text is divided into small chunks, which each LLM-agent manage. Then, the worker agents synthesize information sequentially. Finally manager agent consumes these sequences to produce to the final answer.

CoNav: A Benchmark for Human-Centered Collaborative Navigation

  • CoNav-benchmark: 3D-navigation environment, which tests ability to reason human-intentions and navigate collaboratively.
  • Proposes an intention aware agent, which observes humans, avoids human collision and navigates to destinaton
  • Uses panoramic depht-camera view (RGB-D images), historical views, history trajectories and agent pose. Includes ResNet-object detector, Intention predictor (Long-term and short term) for intended activity/object/trajectory and agent pose (gps and compass sensor).

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset

  • Mars (MetAphysical ReaSoning)-benchmark: measures metaphysical reasoning capability: the understanding of the agent to adapt for situational transitions triggered by environment changes in order to act in a concious way with the environment.
  • Agents face a challenge in the environment due to the infinite possible changes triggered by an event. The benchmark systematically reviews reasoning of the LLMs in such situations regards changes in actions, states caused by changed actions and situational transitions caused by changes in actions.
  • SOTA models struggle even after fine-tuning in this benchmark.

3rd of June 2024

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

  • SpatialRGPT: Spatial understanding with VLMs by using depth maps together with RGB images for geometric reasoning.
  • Introduces SpatialBench-benchmark.

2nd of June 2024

A Survey of Useful LLM Evaluation

  • Reviews LLMs core capabilities from three perspectives: reasoning, societal and domain knowledge.

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

  • HPTSA: Research with a planning agent explores environment and decides, which subagents to use in zero-day vulnerabilities exploits.

31st of May 2024

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

  • SaySelf: produces self-reflective rationales on uncertainty and confidence estimates.

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

  • LACIE: LLM listener model, which reviews confidence of given answer to question and fine-tuned based on preference data by non-expert LLM listerner confidence data.

30th of May 2024

Group Robust Preference Optimization in Reward-free RLHF

  • GRPO (Group Robust Preference Optimization): is a method to align LLMs to individual groups' preferences robustly.
  • It seeks a robust policy, maximizes worst-case group performance, adaptively weights groups, prioritizes groups with worse cumulative loss, and is theoretically studied for log-linear policy class.
  • It significantly improves performance for worst-performing groups, reduces loss imbalances, and improves probability accuracies.

Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization

  • HMAW (Hierarchical Multi-Agent Workflow): generic prompt optimization technique, which includes CEO layer, manager prompt, manager layer, worker prompt and worker layer.
  • The HMAW automated prompting method is zero-shot, task agnostic and query-specific.

Nadine: An LLM-driven Intelligent Social Robot with Affective Capabilities and Human-like Memory

  • Nadine: Social robot, LLM agent based on SoR-ReAct. Includes perception, interaction and robot control.
  • Perception includes skeleton tracking, action recognition, face recognition, emotion recognition, audio localization and speech recognition.
  • Interaction module includes world/user representation, long-term memory, knowledge, user interaction, emotional analysis, short-term memory, emotions, mood, personality, internet search, new search, wikipedia, weather search and behaviour generation.
  • Robot control includes gaze, gesture/pose, facial expression, lip synchronization, animation engine, actuator control and speech synthesis.

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

  • Parrot: E2E LLM service for LLM applicationsin python.
  • Proposes "Semantic Variable", to program LLM applications using single pipeline to multiple LLM service providers.
  • Includes interesting insights about serving LLM models / applications when served at large scale.

Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

  • Auto-Arena: automatic evaluation of LLMs.
  • Examiner LLM creates prompts, two LLMs engage in multi-turn conversation on the prompt to reveal difference in performance and LLM judges discusses the performance of different LLM agents to pick the better LLM.

From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

  • PAR (Planner-Actor-Reporter) system with LLM agents: uses hierarchical RL model with LLM handling high-level planning and low level execution.

Large Language Models Can Self-Improve At Web Agent Tasks

  • Reviews LLM agents self-improvement capability.

CausalQuest: Collecting Natural Causal Questions for AI Agents

  • CausalQuest: Trains a classifier for identifying causal questions, reviews causal question types and formalizes the definition of the "causal question". Introduces dataset for causal questions.

Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf

  • RL-based LLM agent to play ONUW-game. Includes belief-modelling (observation-belief), discussion tactic selection (discussion tactic candidates, discussion policy) and decision making (action phase).

29th of May 2024

Artificial Intelligence Index Report 2024

  • Yearly AI Index Report 2024.

STAT: Shrinking Transformers After Training

  • STAT: a structured pruning approach, that compresses Transformer into smaller size without fine-tuning taking 1 minute to compress BERT model or 3 hours 7B parameter model with 1 GPU.

Adaptive In-conversation Team Building for Language Model Agents

  • Captain Agent: Adaptive team building with LLM agents: Adaptive builder-agent, Reflector-agent and LLM agent team.

Contextual Position Encoding: Learning to Count What's Important

  • CoPE (Contextual Position Encoding): LLMs attentionmechanism, which pays attention to i-th sentence and not only i-th token.
  • CoPE solves new tasks, which position embeddings fail.
  • Uses context-vectors to count, which token to pay attention.

28th of May 2024

Faithful Logical Reasoning via Symbolic Chain-of-Thought

  • Symbolic CoT: to improve logical reasoning.
  • Uses four step approach.

A Human-Like Reasoning Framework for Multi-Phases Planning Task with Large Language Models

  • Introduces a multi-stage Human-like planning framework with LLM-agents.

27th of May 2024

An Introduction to Vision-Language Modeling

  • Reviews VLMs: VLM model types, training and evaluation of them.

24th of May 2024

Large Language Model Sentinel: Advancing Adversarial Robustness by LLM Agent

  • LLAMOS (Large LAnguage MOdel Sentinel): adversial attach protection technique, where LLM prompts are reviewed before sending to the target LLM and in case necessary replace the adversial input with a purified version.
  • The LLM input is converted into adversial example, which the target LLM would interpret as invalid. In such case, the system would create a purified version of the prompt, which would be accepted by the LLM target.

9th of May 2024

Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning

  • Smurfs: multi-agent LLM: prompting technique for unique roles to facilitate collaboration between specialized agents.
  • Outperforms GPT-4 model performance in ToolBench I2/I3 with Mistral 7B model.
  • Includes: Planning (task decomposition), Executor (choosing/executing tools), Answer, Verifier agents.
  • Uses to-do list, local memory, tool doc and global memory. Tool errors are managed either by deleting the tool or by restarting the tool-step.
  • Executor agent flow includes: hint, thought, tool list, action, local memory, tool doc and action input.
  • Paper includes exact prompts used for each agent.

Supporting Physical Activity Behavior Change with LLM-Based Conversational Agents

  • GPTCoach: Physical activity behaviour change with LLMs. Uses prompt chains: Dialogue state manager, Strategy prediction, Response generation, Tool call prediction, tool call generation and execution of tool call.

Air Gap: Protecting Privacy-Conscious Conversational Agents

  • AirGapAgent: privacy-conscious LLM agent, which limits leaking private data by limiting data (minimization prompts) provided to the agent.
  • Introduces context-hijacking and refers to contextual integrity. Introduces an adversial thread-model attempting to extract private data.
  • Components include User data, Minimizer LM, task, privacy directive, which are sealed by AirGap to minimize user data given to the environment.

Truthful Aggregation of LLMs with an Application to Online Advertising

  • Reviews usage of LLMs as advertising platforms by balancing user satisfaction vs. influencing via ads to LLM responses.

7th of May 2024

NeurDB: An AI-powered Autonomous Data System

  • NeurDB: AI system combining AI model and the DB.
  • Includes interesting discussion and design choices for next generation DBs.

Iterative Experience Refinement of Software-Developing Agents

  • Iterative Experience Refinement: Autonomous agents with LLMs adjust experiences iteratively when executing the task.
  • Introduces two patterns: succesive pattern (based on nearest experiences in task batch) and cumulative pattern (acquiring experiences from all task batches)

Unveiling Disparities in Web Task Handling Between Human and Web Agent

  • Studies VLML and LLM capability to perform web tasks.
  • Compares web agent and human-like behaviour.

Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation

  • Reviews deception by autonomous agents.
  • Highlights a concern in autonomous agents: potentially triggering humans towards its programmed goal.

Verified Neural Compressed Sensing

  • THis DeepMind study opens avenue for neural networks to solve mathematical and scientific problems, which are automatically verifieble to be correct without any human intervention.

Iterative Experience Refinement of Software-Developing Agents

  • Iterative Experience Refinement: SW-Agents adapt and improve iteratively during task execution.
  • Refining from neareast exerience within a task batch and Cumulatively acquiring experiences from all prior batches. Experience elimination, where high-quality experienced are prioritized.

Policy Learning with a Language Bottleneck

  • Policy Learning with Language Bottleneck (PLLB): AI-agents using rule-generation stage (LLMs) and update stage (learn new policies).
  • Demonstrate generalizable behaviour.

6th of May 2024

Advancing Multimodal Medical Capabilities of Gemini

  • Med-Gemini: SOTA-level medical reasoning (medical image classification/VQA/report generation/genomic risk prediction) in 17 out of 20 benchmarks.
  • Different data modalities use one of the three unique visual encoders, which are separated to own models.
  • Med-Gemini-2D (conventional 2D images: chest X-ray/CT slices/pathology patches), Med-Gemini-3D (3D medical data like CT), and Med-Gemini-Polygenic (non image features like genomics).

AlphaMath Almost Zero: process Supervision without process

  • Super Mario (from Alibaba group): Applies a novel AlphaMath-method, which uses MCTS to improve LLM math reasoning skills without human annotated solution proces.
  • The approach objective is to generate a MCTS Value Model, which is able to confidently review partial solution to a math problem, so the LLM can generate the next reasoning steps. The value model training requires definition of reward or Policy model.
  • AlphaMath includes three stages: Data collection of math problems and answer pairs as first step. MCTS evaluation generates solution paths (correct/incorrect) and evaluates node values. Policy model and Value model are optimized with the MCTS generated data and the model is Iteratively trained.
  • Achieves SOTA-level math benchmark results of 81.4 (GSM8K)- and 63.7(MATH)-datasets using 7B parameter model.
  • The training data includes 15k question-answer pairs, but this data does not include human-annoted solutions.

Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity

  • Mind Animator: Maps human dynamic vision from brain activity between fMRI (semantic/structural/motion features) and video.
  • Achieves SOTA-level performance.

Enhancing Q-Learning with Large Language Model Heuristics

  • LLM-guided Q-learning.

Large Language Models (LLMs) as Agents for Augmented Democracy

  • LLMs predict individual political preferences with 69%-76% accuracy.

Meta-Evolve: Continuous Robot Evolution for One-to-many Policy Transfer

  • Meta-Evolve-method: transfer expert policy from source robot to multiple target robots using continuous robot evolution.

Position Paper: Leveraging Foundational Models for Black-Box Optimization: Benefits, Challenges, and Future Directions

  • DeepMind research on Black-box optimization.

Conformity, Confabulation, and Impersonation: Persona Inconstancy in Multi-Agent LLM Collaboration

  • Reviews LLMs difficulty to consistently apply specific cultural persona.

Self-Improving Customer Review Response Generation Based on LLMs

  • SCRABLE (Self-improving Customer Review Response Automation Based on LLMs): Self-improves prompts and uses LLM-as-a-Judge-mechanism.
  • Customized and automated prompt engineering (LLM as the prompt generator) increases customer satisfaction/engagement.
  • Iterative refinement prompts LLM to apply insights from the human expert answer.

Select to Perfect: Imitating desired behavior from large multi-agent data

  • AI driving agents using Exchange Value, measuring individual agent collective desirability score.
  • Imitates agents with positive Exchange Value, for example how few traffic incidents the agent causes.

When LLMs Meet Cybersecurity: A Systematic Literature Review

  • Includes a comphrensive review of LLM-cybersecurity research from 180 different research pappers.
  • Includes an updated link on LLM-cybersecurity research, which I think is very useful.

FOKE: A Personalized and Explainable Education Framework Integrating Foundation Models, Knowledge Graphs, and Prompt Engineering

  • FOKE: Integrates KGs, LLMs and prompt engineering.

Language-Image Models with 3D Understanding

  • Cube-LLM: 3D-grounded reasoning with LLMs.

Thoughtful Things: Building Human-Centric Smart Devices with Small Language Models

  • Reviews LLMs integrated into smart devices like lamp, which adjusts color of light with voice control using Rasberry Pi 5. Applies small fine-tuned LLMs to reason about their (own) device behaviour.

Organizing a Society of Language Models: Structures and Mechanisms for Enhanced Collective Intelligence

  • Reviews collective intelligence in LLMs: hierarchical/flat/dynamic and federated.

Towards a Formal Creativity Theory: Preliminary results in Novelty and Transformativeness

  • Explores formalization of the Creativity theory.
  • Proposes formal definition for "novelty" and "transformational creativity" (Novelty is not necessary/sufficient).
  • Argues, that "inspiring set" (unordered content of the experience sequence) requires novelty for transformational creativity, which differs from sequences of experiences (chronological flow).
  • Other research directions to creativity include semantic transformativeness, formalization concept of typicality and if transformative artifacts must are outside the hypothetical conceptual space.

OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs

  • OmniActions: LLM processes multimodal inputs (scene description, object detection, OCR, sound classifier and speech content and contextual information: place/activity) using CoT from users, to predict follow up actions

5th of May 2024

Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents

  • Agent Hospital: MedAgent-Zero-method, where LLM-based doctor agents provide SOTA level medical care in MedQA-dataset.
  • Learns to scale knowledge base through inference simulation with doctor agents.
  • MedAgent-Zero-method is a self-evolution method, where medical agents continuously evolve by processing cases and engaging in self-feedback.
  • Uses knowledge database to accumulate successful and unsuccesful treatments performed.

Graphical user interface agents optimization for visual instruction grounding using multi-modal artificial intelligence systems

  • SIC (Search Instruction Coordinates): a multimodal framework to locate objects GUI. Includes two approaches: SICocri and SICdirect.
  • SICocri applies fine-tuned YOLO-V8 (object detection to list all items and fine-tuned for GUIs) with an OCR module (identifies in each UI element the specific texts to separate buttons: cancel vs. submit). The buttons and their OCR-recognized texts and combined by matching their coordinates. GPT-4 (LLM used for component name and type extraction) identifies the best match to requested UI element and provides: UI element Id, type, role, and coordinates.
  • SICdirect instead fuses visual embeddings and prompt embeddings into Encoder/Decoder Transformer to obtain the coordinates.
  • Introduces metric called Central Point Validation (CPV), which checks if the central coordinates of the predicted bounding box locates inside ground truth UI element and converting this boolean value into % by calculating percentage value from total observations.

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

  • AppAgent v2: introduces multimodal agent, which emulates human-like interaction on mobile device GUI. Includes exploration (documenting UI elements) and deployment phase (efficient task execution with RAG).

Language Evolution for Evading Social Media Regulation via LLM-based Multi-agent Simulation

  • Language evolution using LLM-based multi-agent simulation.
  • Includes supervisory and participant agents.

Visual grounding for desktop graphical user interfaces

  • Introduces autonomous GUI-agent. Includes a decent overview about autonomous GUI navigation.
  • Proposes visual grounding with LLM using YoloV8/ChatGPT/OCR-module or multi modal IGVDirect-approach.
  • Introduces new metric: Central Point Validation (if center of the predicted bounding box is inside the target GUI element).
  • Includes GUI-perception prompt.

3th o May 2024

Automating the Enterprise with Foundation Models

  • ECLAIR (Enterprise sCaLe AI for woRkflows): Self-imrpoving and minimal supervision requiring enterprise workflow automation system using foundational models (FM).
  • Includes three stages: Automatic process mapping (video record flow is converted with FM to Standard Operating Procedure), Robust/flexible reasoning-based (using the Standard Operating Procedure and FM), Automated auditing (FM to rate ok / not ok and self-improve).
  • The github repository includes prompt examples and code.

Neuromorphic Correlates of Artificial Consciousness

  • Reviews AI Consciousness and proposes Neuromorphic Correlates of Artificial Consciousness (NCAC)-framework.
  • The framework consists of Quantification, Simulation, Adaptation, and Implementation.
  • Interesting details in general about conciousness research such as Integrated Information Theory (IIT)

What matters when building vision-language models?

  • Reviews VLMs.
  • Builds 8B parameter Idefics2-model achieving SOTA-level performance at its size.

CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation

  • CODEGRAG: effective retrieval method for code in code improving.

Beyond Helpfulness and Harmlessness: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning

  • Persona In-Context Learning (PICLe): LLM method to replicate target persona behaviour using ICL.

Comparative Analysis of Retrieval Systems in the Real World

  • Reviews existing search and retrieval systems for LLMs.

2nd of May 2024

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

  • Plan-Seq-Learn (PSL): Consists of three modules: LLM-based high-level planning module, Sequencing the LLM-generated plan with Pose Estimator/Motion planner with RL and Learning RL control policy module.
  • Achieves SOTA level in 25 robotic long horizon tasks from scratch by team partly consisting team by Mistral.AI and Carnegie Mellon University.
  • RL and LLMs complement each other strengths with LLMs able to divide long horizon goals into achievable sub-goals and RL capable of learning low-level robot control strategy.
  • Includes prompt examples.

FLAME: Factuality-Aware Alignment for Large Language Models

  • FLAME (Factuality Aware Alignment): factuality aware SFT and RL with DPO.

Generative Active Learning for the Search of Small-molecule Protein Binders

  • LambdaZero: generative active learning to search new small-molecule protein binders.
  • Includes Inner loop, Outer loop, Compound synthesis, In-vitro validation and Library synthesis.

Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts

  • MISeD (Meeting Information Seeking Dialogs dataset): combines human annotation with LLMs to generate source-grounded information seeking dialog-datasets.
  • Models fine-tuned with MISeD perform well.

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

  • OmniDrive: E2E autonomous driving with LLM-agents, and OmniDrive-nuScenes benchmark.
  • Visual encoder extracts multi-view image features, which are fed into Q-Former3D and finally to the LLM.

CACTUS: Chemistry Agent Connecting Tool-Usage to Science

  • CACTUS: Uses CoT-reasoning with planning, action, execution and observation-phases.

Creative Problem Solving in Large Language and Vision Models -- What Would it Take?

  • Reviews computational creativity.

CoS: Enhancing Personalization and Mitigating Bias with Context Steering

  • CoS (Context Steering): adjusting LLM to context based on likelihood difference between the LLM output when it has seen / not seen the context.

Generative Active Learning for the Search of Small-molecule Protein Binders

  • LambdaZero: generative ai for searching synthesizable molecules with particular type of desired characteristics.

1st of May 2024

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

  • Self-improving LLM training with MCTS using Iterative Preference Learning and DPO, which significantly improves math reasoning. Reviews computational optimization of such training method.
  • Combines outcome validation and step-wise self-evaluation and continuous update of the quality assessment of the generated new data.
  • Reviews balancing of reasoning chain length, logical coherence in commonsense reasoning.
  • Reviews existing literary of self-training, guided search for reasoning and iterative learning.

ULLER: A Unified Language for Learning and Reasoning

  • ULLER: Unified neuro-symbolic language learning and reasoning.

GOLD: Geometry Problem Solver with Natural Language Description

  • GOLD: Geometry math problem solver.

Social Life Simulation for Non-Cognitive Skills Learning

  • Emotional intelligence in LLM agents based on narrative.

Can a Hallucinating Model help in Reducing Human "Hallucination"?

  • Compares LLMs with humans in terms capability to distinguish logical reasoning errors. LLMs perform better than humans in psychometric assessments. Finds LLMs could be used as personalized LLM-agents to expose misinformation.

"Ask Me Anything": How Comcast Uses LLMs to Assist Agents in Real Time

  • "Ask Me Anything" (AMA): COMCAST applies LLMs (RAG-like) in human-to-human communcition in customer support by using LLMs to help resolve client calls in real-time. Led to millions of dollars savings in reduced time in the calls with positive evaluation by the customers.

Characterising the Creative Process in Humans and Large Language Models

  • Reviews creativity of LLMs.

29th of April 2024

Capabilities of gemini models in medicine

  • Med-Gemini: Med-Gemini-L 1.0 for medical care reasoning.
  • Uses self-training with search (the model iteratively generates CoT reasoning responses with/without web query and applies in-context expert demonstrations) and Uncertainty-guided search at inference (iteratively generate multiple CoT reasoning paths, filter based on uncertainty and retrieve search results for more accurate responses).
  • SOTA-level model in 10 medical reasoning tasks and surpassing human-expert on some of them.
  • Integrates web-search queries when the model is uncertain.

Reinforcement Learning Problem Solving with Large Language Models

  • Prompt LLM iteratively to solve Markov Decision Process (MDP) RL tasks
  • Uses prompting technique for simulating episodes and Q-learning.

HELPER-X: A Unified Instructable Embodied Agent to Tackle Four Interactive Vision-Language Domains with Memory-Augmented Language Models

  • HELPER-X: VLM-based embodied agent, which inputs image and user input. Uses unified memory-augmented prompting for top-k sampling from shared example memory (in-context examples) and these are retrieved to the shared prompt template (domain agnostisc) to query the LLM. LLM generated a program, the program is then executed and the plan is added to the memory (includes instruction plans, corrective plans and added plans).
  • The prompt retrieval is specialized prompt template, which contains role description, task instruction and guides the specific domain (TEAch, ALFRED, DialFRED and Tidy Task).
  • The retrieval is embedding vector-based. Code is open sourced with all code and prompts.

28th of April 2024

From Persona to Personalization: A Survey on Role-Playing Language Agents

  • Reviews Role-Playing Language Agents (RPLAs) with LLMs.
  • Categorizes personas: demographic (statistical), character (established figures), individualized (customized through interactions) personas.

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

  • Demonstrates, that SOTA-level models trained to act honestly/helpful, behave deceptively sometimes without prompted to act such way.
  • For example LLMs may lie to auditor questions.

26th of April 2024

Unveiling Thoughts: A Review of Advancements in EEG Brain Signal Decoding into Text

  • Brain signal decoding into text.

24th of April 2024

Retrieval Head Mechanistically Explains Long-Context Factuality

  • How LLMs obtain capacity to retrieve information from long-context?
  • Retrieval-attention heads have the following characteristics: Universal, Sparse, Intrinsic, Dynamically-activated, Causal and Impact heavily on CoT reasoning.

23th of April 2024

Generate-on-Graph: Treat LLM as both Agent and KG in Incomplete Knowledge Graph Question Answering

  • Generate-on-Graph (GoG): applies selecting/generating/answering-framework for IKGQA (Incomplete Knowledge Graph Question Answering).
  • Help LLMs answer complex questions, even when not able to provide final answer.
  • Generates thoughts, then actions to retrieve knowledge, makes observations from the actions. The thoughts are then processed as thought-chain. The paper includes a detailed GoG-instruction implemented using two LLM-prompts.

Rethinking LLM Memorization through the Lens of Adversarial Compression

  • Reviews memorization of LLMs, whoch refers to LLMscapability to reproduce data with a shorter string than the source data.
  • Proposes: Adversial Compression Ratio (ACR)-metric to measure level of memorizarion.

Evaluating Tool-Augmented Agents in Remote Sensing Platforms

  • GeoLLM QA-benchmark: measures ability to capture long sequences of UI-click/verbal/visual actions on UI.

22th of April 2024

A Survey on Self-Evolution of Large Language Models

  • Alibaba's literarture survey on Self-Evonvolving LLMs.
  • Reviews paradigm shift in LLMs from pretraining (2018), SFT(2019), human alignment (2022) and Self-Evolution(2023).

21st of April 2024

A Survey on the Memory Mechanism of Large Language Model based Agents

  • Huawei's literature review on memory mechanism in LLM-agents.
  • Why memory is required, how to design and evaluate memory-based LLMs?

Accelerating Medical Knowledge Discovery through Automated Knowledge Graph Generation and Enrichment

  • Medical Knowledge Graph Automation (M-KGA)

19th of April 2024

AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

  • AutoCrawler: LLM-based web crawler agent, which automatically defines set of intermediate rules (reusability) / action sequences to extract target information from the website based on varying types of websites and task requirements.
  • Includes Progressive generation-phase (top-down, step-back, action sequence) and Synthesis-phases(set of action sequences).

[Let's Think Dot by Dot: Hidden Computation in Transformer Language Models{(https://arxiv.org/abs/2404.15758)

  • Reviews use of "Filler tokens" instead of CoT. Filler token refers to "...".

SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-trained Models

  • SOPHON: Pretraining protection frameworkd to avoid fine-tuning LLMs for adversary tasks, which results overhead cost for restricted domain fine-tuning above training the model from scratch

18th of April 2024

Aligning Language Models to Explicitly Handle Ambiguity

  • Introduces disambiguation procedure for LLMs
  • Four-step alignment pipeline: Explicit prediction, Implicity ambiguity detection ( Self-disambiguation and Measure Information-gain), Data construction (Information-gain > epsilon) and SFT.

mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture

  • mABC (multi-Agent Blockchain-inspired Collaboration): AI agent workflow, where multiple LLM-agents reach consensus in standardized voting process to manage RCA of microservices.
  • The voting mechanism is blockchain-style.
  • Two workflows: ReAct answer (action, observation and reasoning for real-time/additional data and Direct answer (reasoning with zero-shot/CoT/N-ofThought) when is not required external tools.

17th of April 2024

Many-Shot In-Context Learning

  • Introduces Many-shot ICL, which differs from few-shot ICL by increasing significantly the amount of examples provided within the context window.
  • Improves task-performance across domains over few-shot prompting across variety of domains.
  • One of the first attempts to scale in-context learning or "test-time inference".
  • Introduces the concept of Reinforced ICL, where model generated rationales are used for ICL by using zero-shot / few-shot CoTs prompts as examples to sample more examples. The generated examples are filtered to include only reaching a correct answer (requires ground truth and potentially generates false-positives).
  • Introduces concet of Unsupervised ICL, without CoTs and prompt the model using only inputs (includes example problem/list of unsolved problems/zero-short or few-shot instruction of desired output format). The unsupervised ICL prompt is included to the paper.

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

  • Survey on AI agents.
  • Reviews single- and multi-agent architectures, challenges and future directions.

AgentKit: Flow Engineering with Graphs, not Coding

  • AgentKit: Prompting framework for multifunctional agents. Constructs complex "thought process" from prompts. Consists of nodes.
  • Nodes: prompts for specific task. User compiles Chain-of-Nodes (CoNs), which are structured thought processes in a graph.
  • Agents designed with AgentKit are SOTA-level in WebShop/Crafter-benchmarks.
  • Includes Github-repository with the code, where the graphs are build.

Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

  • Octopus v3: 1B multimodal AI agent.
  • Uses "functional tokens": represents any function as a token.
  • Applies multi-stage training: first trains image-language, which is followed by the learning of functional tokens and finally the functional tokens provide feedback to keep improving the model with RL and external LLM used as a reward model.
  • Operates in edge-devices like Rasberry Pi.

Open-Ended Wargames with Large Language Models

  • Snow Globe: LLM-based multi-agent plays automatically qualititative wargames (open-ended).
  • Information flows: Incident, Response, Inject and Response. The approach could be used in other domains.

16th of April 2024

Self-playing Adversarial Language Game Enhances LLM Reasoning

  • SPAG (Self-Play Adversial language Game): LLM plays both "attacker" and "defender" in a language game called "Adversial Taboo". The "attacker" aims to trigger the "defender" to state the target word only known to it, while the "defender" aims to guess the target word based on communications made by the "attacker".
  • The LLM is supervised fine tuned using RL with ReST based on the game outcomes from wide range of topics.
  • This self-play technique improves the LLMs reasoning capabilities in three epoch.

Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V

  • COME(Closed-loop Open-vocabulary MobilE Manipulation): VLM-based robot consisting of Active Perception, Situated Commonsense Reasoning and Recover from Failure.
  • Helps to recover from mistakes, free-form instructions and follow long-horizon task plans.
  • Improves SOTA-level performance by 25% in real-world tabletop and manipulation tasks, which are Open-Vocabulary Mobile Manipulation (OVMM)-tasks.
  • Step towards autonomous robots in real-world scenarios. The high level-reasoning and planning uses: role, feedback handling, robot setup, APIs, response guidelines and Tips. The paper includes system prompt.

Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards

  • Self-Explore: LLMs explore Pits (wrong steps) in the reasoning and use these explorations as signals in further exploration.
  • Outperforms SFT on GSM8K/MATH-datasets using three different LLMs.
  • Applies step-level fine-grained reward.

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

  • VASA-1: The model produces lip movement based on audio and an image.
  • Visual Affective Skills (VAS): uses diffusion-based holistic facial dynamics.

SCALE: Self-Correcting Visual Navigation for Mobile Robots via Anti-Novelty Estimation

  • SCALE: self-correcting visual navigation using image-goal conditioned implicity Q-learning, which when faced Out-of-distribution observation, the "Localization Recovery" generates possible future trajectories.
  • SOTA-level open-world navigation

N-Agent Ad Hoc Teamwork

  • N-Agent ad-hoc Team work (NAHT): various number and and unknown autonomous agents interact and cooperate dynamically to maximize return in a task.
  • Policy Optimization with Agent Modelling (POAM)-algorithm: each agent has its policy based on same underlining parameters. Critic is trained using information both from controlled and uncontrolled agents, while actor is trained using only controlled agents. Critic evaluates how good actions are at current status, while Actor decides the action to be taken at the status. Both actor and critic use team vector to capture information from all agents.

Emergent intelligence of buckling-driven elasto-active structures

  • Microbot design using elacticity to control collective motion.
  • Enables autonomous maze navigation by two self-propelled microbots connected by polyester beam (bucklebot) in 25 seconds, which is not possible by an individual microbot.

HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

  • Trains LLMs of 7B and 70B with 1.8T tokens with AWS Trainium GPUs, showing 54% of cost compared with Nvidia GPU.
  • Illustrates the approach for training LLMs using AWS Traininum GPUS and AWS Neuron SDK.

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

  • CODA-LM: Vision-Language benchmark for autonomous driving.

White Men Lead, Black Women Help: Uncovering Gender, Racial, and Intersectional Bias in Language Agency

  • Identifies language agency bias in LLMs: gender, racial and intersectional.

Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language Models

  • DB-GPT: Open-source AI app development framework. Includes: RAG, Generative Business Intelligence, Fine-tuning, Data-driven Multi-agents, Data factory and Data sources, Text-to-SQL module and agents. AWEL: Agentic Workflow Expression Language.

Bootstrapping Linear Models for Fast Online Adaptation in Human-Agent Collaboration

  • BLR-HAC (Bootstrapped Logistic Regression for Human Agent Collaboration): pretrains transformer to generate parameters of a shallow parametrized policy. Update it using human-agent collaboration with online logistic regression.

What is Meant by AGI? On the Definition of Artificial General Intelligence

  • Attempts to define AGI: "An Artificial General Intelligence (AGI) system is a computer that is adaptive to the open environment with limited computational resources and that satisfies certain principles."

Private Attribute Inference from Images with Vision-Language Models

  • VLMs identify personal attributes of the image owners, which may cause privacy risk when misused.

CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity

  • CoTAR (Attribute-oriented CoT): Identifies most crucial aspects of the given context to answer using direct citations to referenced parts.
  • Three levels: Span guidance, Sentence guidance, Passage guidance

Chinchilla Scaling: A replication attempt

  • Finds Chinchilla-scaling laws inconsistent.

TEL'M: Test and Evaluation of Language Models

  • TEL’M (Test and Evaluation of Language Models): five evaluations Identification of interesting LLM tasks, Identification of Task properties of interest, Identification of task property metrics, Design of measurement experiments, Execution and analysis of experiments.

Deceiving to Enlighten: Coaxing LLMs to Self-Reflection for Enhanced Bias Detection and Mitigation

  • Reduces bias in LLMs by stating the views are not LLMs own ones, which activates LLMs internal attention to improve sensitivity.

Model-based Offline Quantum Reinforcement Learning

  • First model-based offline quantum RL algorithm

AIGeN: An Adversarial Approach for Instruction Generation in VLN

  • AUGeN: consists of Instructor generator and Instruction discriminator.
  • Instruction generator describes actions needed to navigate to a specific location based on images from the environment.
  • Instruction discriminator matches images as real/fake in case image descriptions match with the instruction provided).

Language Model Cascades: Token-level uncertainty and beyond

  • Cascading LLM: simple queries are guided to "easy"-LLM, while complicated queries are guided to "hard"-LLM. This deferral decision is made by 5-layer MLP model.
  • Applies token-level uncertainty, where length bias is mitigated when making deferral decision. Easy sequence have most tokens in low percentile, while hard sequences have some tokens with high uncertainty.

EyeFormer: Predicting Personalized Scanpaths with Transformer-Guided Reinforcement Learning

  • EyeFormer: predictive model for scanpath (human vision attention behaviour) for both natural scenes and user interfaces. Illustrates using of scanpaths for personalized UI optimization.
  • Deep RL with Transformer, which predicts spatial and temporal characteristics of scanpaths about viewer behaviours.

How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

  • The LLM is less likely to trust retrieved information with RAG, the more likely the LLM is to trust its response without the RAG (Prior).
  • The LLM is more likely to stick to Prior (knowledge), the more unrealistic the RAG pertubated information is.

Rethinking Software Engineering in the Foundation Model Era: From Task-Driven AI Copilots to Goal-Driven AI Pair Programmers


Vision-and-Language Navigation via Causal Learning


Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop Strategy


HelixFold-Multimer: Elevating Protein Complex Structure Prediction to New Heights


Continuous Control Reinforcement Learning: Distributed Distributional DrQ Algorithms


Social Choice for AI Alignment: Dealing with Diverse Human Feedback


Engineering software 2.0 by interpolating neural networks: unifying training, solving, and calibration


Future Language Modeling from Temporal Document History


Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs


Prescribing the Right Remedy: Mitigating Hallucinations in Large Vision-Language Models via Targeted Instruction Tuning


Reasoning on Efficient Knowledge Paths:Knowledge Graph Guides Large Language Model for Domain Question Answering


SparseDM: Toward Sparse Efficient Diffusion Models


Advancing Long-Term Multi-Energy Load Forecasting with Patchformer: A Patch and Transformer-Based Approach


DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion


When Emotional Stimuli meet Prompt Designing: An Auto-Prompt Graphical Paradigm


Self-Supervised Visual Preference Alignment


White Men Lead, Black Women Help: Uncovering Gender, Racial, and Intersectional Bias in Language Agency


Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning


Generative Text Steganography with Large Language Model


EMC$^2$: Efficient MCMC Negative Sampling for Contrastive Learning with Global Convergence


Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay


Question Difficulty Ranking for Multiple-Choice Reading Comprehension


Insight Gained from Migrating a Machine Learning Model to Intelligence Processing Units


MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents


LegalPro-BERT: Classification of Legal Provisions by fine-tuning BERT Large Language Model


Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study


Automating REST API Postman Test Cases Using LLM


Spiral of Silences: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering


MEEL: Multi-Modal Event Evolution Learning


Find The Gap: Knowledge Base Reasoning For Visual Question Answering


15th of April 2024

Memory Sharing for Large Language Model based Agents

  • Memory-Sharing (MS)-framework: Multi LLM-agents share Memory Pool of query/response pairs, which improves In-Context Learning. Retriever-model is trained to retrieve memories based on user query.
  • LLM agent answers based on query and retrieved memories. Scorer evaluates query / response. High scoring pairs are added to the Memory Pool, which is queried with cosine similarity.
  • The shared memory helps all agents to learn from each other.
  • The Retriever model is trained using pre-trained sentence similarity model, which retrieves data from jsonl-file to train a model and it is later used to pick relevant memories for each user query.

Reimagining Self-Adaptation in the Age of Large Language Models

  • Self-Adaptive SW system: Includes Managed system (operational SW system) and Managing System (handles adaptions).
  • Managing system includes Prompt generator, LLM engine, Response parser, Monitor (logs, metrics), Knowledge/Memory (conversation history, fine-tuned models, system config and system prompts) and Execute (verifier/executor).

Deferred NAM: Low-latency Top-K Context Injection via DeferredContext Encoding for Non-Streaming ASR


ChatShop: Interactive Information Seeking with Language Agents


TabSQLify: Enhancing Reasoning Capabilities of LLMs Through Table Decomposition


LLMorpheus: Mutation Testing using Large Language Models


A Survey on Deep Learning for Theorem Proving


Progressive Knowledge Graph Completion


Synergising Human-like Responses and Machine Intelligence for Planning in Disaster Response


HyperMono: A Monotonicity-aware Approach to Hyper-Relational Knowledge Representation


Action Model Learning with Guarantees


Explainable Generative AI (GenXAI): A Survey, Conceptualization, and Research Agenda


MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion


Monte Carlo Search Algorithms Discovering Monte Carlo Tree Search Exploration Terms


Assessing Economic Viability: A Comparative Analysis of Total Cost of Ownership for Domain-Adapted Large Language Models versus State-of-the-art Counterparts in Chip Design Coding Assistance


Handling Reward Misspecification in the Presence of Expectation Mismatch


Generating Games via LLMs: An Investigation with Video Game Description Language


MMInA: Benchmarking Multihop Multimodal Internet Agents


Evolving Interpretable Visual Classifiers with Large Language Models


Evolving Interpretable Visual Classifiers with Large Language Models


Compression Represents Intelligence Linearly


Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection


Foundational Challenges in Assuring Alignment and Safety of Large Language Models


Is Table Retrieval a Solved Problem? Join-Aware Multi-Table Retrieval


Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL


Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video


KG-CTG: Citation Generation through Knowledge Graph-guided Large Language Models


Effective Reinforcement Learning Based on Structural Information Principles


Unveiling Imitation Learning: Exploring the Impact of Data Falsity to Large Language Model


Higher Replay Ratio Empowers Sample-Efficient Multi-Agent Reinforcement Learning


Are Large Language Models Reliable Argument Quality Annotators?


LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models


Harnessing GPT-4V(ision) for Insurance: A Preliminary Exploration


Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation


All-in-one simulation-based inference


Efficient and accurate neural field reconstruction using resistive memory


A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions


Building Semantic Communication System via Molecules: An End-to-End Training Approach


σ-GPTs: A New Approach to Autoregressive Models


Characterization and Mitigation of Insufficiencies in Automated Driving Systems


Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement Learning


State Space Model for New-Generation Network Alternative to Transformers: A Survey


PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI


Exploring Text-to-Motion Generation with Human Preference


The 8th AI City Challenge


RankCLIP: Ranking-Consistent Language-Image Pretraining


Tasks People Prompt: A Taxonomy of LLM Downstream Tasks in Software Verification and Falsification Approaches


14th of April 2024

Self-Selected Attention Span for Accelerating Large Language Model Inference

  • Fine-tunes LLM to self-identify minimal attention span in each step of the task.
  • Speeds up inference 28% by dynamically adjusting self-attention.
  • Allows LLMs to autonoumsly optimize computation.

TransformerFAM: Feedback attention is working memory

  • Unlimited context window

Interactive Generative AI Agents for Satellite Networks through a Mixture of Experts Transmission


Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation


LLeMpower: Understanding Disparities in the Control and Access of Large Language Models


Towards Practical Tool Usage for Continually Learning LLMs


SNN4Agents: A Framework for Developing Energy-Efficient Embodied Spiking Neural Networks for Autonomous Agents


Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment


TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning


Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection


Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts


Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts


TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models


Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling


Survey on Embedding Models for Knowledge Graph and its Applications


GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning


Fusion-Mamba for Cross-modality Object Detection


ToNER: Type-oriented Named Entity Recognition with Generative Language Model


Provable Interactive Learning with Hindsight Instruction Feedback


Semantic In-Domain Product Identification for Search Queries


13th of April 2024

LLMSat: A Large Language Model-Based Goal-Oriented Agent for Autonomous Space Exploration

  • LLMSat: LLM-based spacecraft control and space missions.

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

"Don't forget to put the milk back!" Dataset for Enabling Embodied Agents to Detect Anomalous Situations


Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation


Generative AI Agent for Next-Generation MIMO Design: Fundamentals, Challenges, and Vision


CuriousLLM: Elevating Multi-Document QA with Reasoning-Infused Knowledge Graph Prompting


CodeCloak: A Method for Evaluating and Mitigating Code Leakage by LLM Code Assistants


Exploring Explainability in Video Action Recognition


Adapting Mental Health Prediction Tasks for Cross-lingual Learning via Meta-Training and In-context Learning with Large Language Model


Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies


Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households


Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning


Understanding Multimodal Deep Neural Networks: A Concept Selection View


EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM


An evaluation framework for synthetic data generation models


On Speculative Decoding for Multimodal Large Language Models

12th of April 2024

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

  • Megalodon: Inlimited contrxt length

Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension


Aligning LLMs for FL-free Program Repair


LLM In-Context Recall is Prompt Dependent


CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models


Leveraging Multi-AI Agents for Cross-Domain Knowledge Discovery


Augmenting Knowledge Graph Hierarchies Using Neural Transformers


Enhancing Autonomous Vehicle Training with Language Model Integration and Critical Scenario Generation


LLM Agents can Autonomously Exploit One-day Vulnerabilities


Memory Traces: Are Transformers Tulving Machines?


Study of Emotion Concept Formation by Integrating Vision, Physiology, and Word Information using Multilayered Multimodal Latent Dirichlet Allocation


Inverse Kinematics for Neuro-Robotic Grasping with Humanoid Embodied Agents


SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions


Training a Vision Language Model as Smartphone Assistant


Apollonion: Profile-centric Dialog Agent


Strategic Interactions between Large Language Models-based Agents in Beauty Contests


Enhancing Autonomous Vehicle Training with Language Model Integration and Critical Scenario Generation


Toward a Theory of Tokenization in LLMs


Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions


11th of April 2024

Rho-1: Not All Tokens Are What You Need

  • Rho-1: trains LLM with Selective Language Modelling (SLM) with useful tokens (based on loss pattern).
  • The SLM calculates each token loss using reference model and then selectively removes loss of the unwanted tokens.
  • Rho-1 1B and 7B achieve SOTA results at their size.

Large Language Model Can Continue Evolving From Mistakes


Auctions with LLM Summaries


OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

  • OSWorld: scalable multimodal agents for Ubuntu/Windows/MacOS to perform open-ended web/desktop tasks.
  • Discovers humans complete 72% of tasks, while best agent completes only 12%. The main issues are GUI grounding/operational knowledge.

ODA: Observation-Driven Agent for integrating LLMs and Knowledge Graphs

  • ODA: LLM with knowledge graph (KGs) using iteratively observation, action and reflection to help solve tasks.
  • The observation phase uses a global view of the entire KG and selectively picks relevant parts for reasoning.

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

  • DesignQA-benchmark: Measures VLMs capcity to solve engineering tasks, including CAD images, drawings and engineering requirements. Includes: rule comprehension, rule compliance and rule extraction.

Monte Carlo Tree Search with Boltzmann Exploration

  • Boltzmann Tree Search (BTS): replace soft values with Bellman values in MENTS.
  • Decaying ENtropy Tree Search (DETS): Interpolates between BTS and MENTS.
  • Alias method samples actions fast and demonstrate high performance in game of Go.

WESE: Weak Exploration to Strong Exploitation for LLM Agents


Behavior Trees Enable Structured Programming of Language Model Agents


LLoCO: Learning Long Contexts Offline


ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past


10th of April 2024

Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs

--

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy


Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation


Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation


Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

  • Infinite-Attention: Infinite long context window using compressed memory/local attention.
  • The local attention computes using the in context. The compressed memory computes using the out-of-context.
  • Google tests 1B LLN for 1M sequence length, which is difficult for such small model. I believe there are no existing benchmarks yet for testing such long context windows above +1M context window.
  • Ahieves 114x compression ratio.

GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

  • Gorilla Execution Engine (GoEx): open-source runtime to execute LLM actions, apps and microservices.
  • LLMs evolve from dialogue to autonomous agents, which as well make decisions.
  • "Post-facto Validation": human checks correctness of the generated output, instead of intermediate results. Introduces concet of "Undo" and "Damage confinement" to manage unintended risks with autonomous agents.

Vision-Language Model-based Physical Reasoning for Robot Liquid Perception


BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks


9th of April 2024

Measuring the Persuasiveness of Language Models

  • Reviews the scaling of LLMs on persuasion tasks. Finds, that Claude 3 Opus is statistically as convincing as human.

Can Feedback Enhance Semantic Grounding in Large Vision-Language Models?


Large Language Models to the Rescue: Deadlock Resolution in Multi-Robot Systems

  • Hierarchical LLM guides robot away from deadlock situation by assigning leader-agent and give it direction to continue and GNN executes the low level policy.
  • Finds LLMs effective in various environments for high-level planning tonresolve deadlocks.

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

  • AgentQuest: modular benchmark for multi-step reasoning with possibility via API to extend to different environments.
  • Traditional benchmark includes single environment. AgentQuest uses driver to connect with a specific environment.

AgentsCoDriver: Large Language Model Empowered Collaborative Driving with Lifelong Learning

  • AgentsCoDriver: multi-car collaboration using LLMs.
  • The system includes the following modules: observation, reasoning engine, cognitive memory, reinforcement reflection, and communication.
  • Includes useful designs on prompt generation and module designs.

Autonomous Evaluation and Refinement of Digital Agents

  • Review domain-generic automatic evaluators to improve "digital agents", which improve SOTA performance in WebArena-benchmark by 29%.
  • Evaluators are applied to improve agents with fine-tuning and inference-time guidance.
  • Policy evaluation works by using VLM to perform user screen captioning, which is processed by LLM together with user instructions and agent trajectory(states/actions). The LLM-reasoner response is evaluated together with VLM-based reasoner to provide final failure/success-evaluation.
  • Autonomous refinement uses inference-time guidance (reflexion) and Filtered behaviour cloning.

Wu's Method can Boost Symbolic AI to Rival Silver Medalists and AlphaGeometry to Outperform Gold Medalists at IMO Geometry

  • Combines Wu's method with AlphaGeometry to solve 27/30 IMO geometry problems (SOTA-level), which is 2 above AlphaGeometry alone or Wu's method alone only solves 15.
  • First AI (fully symbolic baseline) to outperform a human in IMO geometry problems.

Graph Reinforcement Learning for Combinatorial Optimization: A Survey and Unifying Perspective


Text-Based Reasoning About Vector Graphics


Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs


pfl-research: simulation framework for accelerating research in Private Federated Learning


MuPT: A Generative Symbolic Music Pretrained Transformer


VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs


WESE: Weak Exploration to Strong Exploitation for LLM Agents


ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos


Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models


Open-Source AI-based SE Tools: Opportunities and Challenges of Collaborative Software Learning


THOUGHTSCULPT: Reasoning with Intermediate Revision and Search

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?


8th of April 2024

HAMMR: HierArchical MultiModal React agents for generic VQA

  • HAMMR: Uses multimodal ReAct-based agent, which is hierarchical by letting the agent call other specialized agents.
  • Outperforms PaLI-X VQA by 5%.

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

  • Ferret-UI: Outperforms GPT-4V on elementary UI-tasks with capability for referring (widget classification, OCR, icon recognition), grounding (find widget/icon/text and widget listing) and reasoning.
  • "Any resolution" (anyres) enlarges small UI-objects in images like icons within varying screen aspect ratios. Screen capture is divided into two sub-sections. Each UI-element is referenced with type, text and bounding box. Uses 250k examples of training data.

AutoCodeRover: Autonomous Program Improvement

  • AutoCodeRover: autonomous sw engineering by solve Github issues (program repair and improvement). Solves 67 Github issues within 10 minutes. Future directions could include issue reproducer/semantic artifacts and human involvement.
  • Includes two stages: context retrieval stage to produce buggy locations and Patch generation stage to produce final patch.

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

  • Presents 12 insights on LLM training duration model architecture, quantization, sparsity and data signal-to-noise ratio.
  • Finds junk data significantly reduces model capacity, which can be avoided to large extent by adding special token in the beginning of text. LLM learns to autonomously label data as high-quality.

360°REA: Towards A Reusable Experience Accumulation with 360° Assessment for Multi-Agent System

  • Reusable Experience Accumulation with 360° Assessment (360°REA): a hierarchical multi-agent framework to evaluate and accumulate experience from feedback.
  • Uses Deal-experience pool and 360◦ performance assessment.
  • Dual-experience pool: helps LLM-agents collect useful experiences in complex tasks using local experience/high-level experience.

Finding Visual Task Vectors

  • Identifies Task Vectors.
  • Uses task vectors to perform different tasks without any sample input.

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models


LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding


WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents


Attention-Driven Multi-Agent Reinforcement Learning: Enhancing Decisions with Expertise-Informed Tasks


Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models


Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models


[Xiwu: A Basis Flexible and Learnable LLM for High Energy Physics](Xiwu: A Basis Flexible and Learnable LLM for High Energy Physics)


7th of April 2024

AI2Apps: A Visual IDE for Building LLM-based AI Agent Applications


LLM-Based Multi-Agent Systems for Software Engineering: Vision and the Road Ahead


StockGPT: A GenAI Model for Stock Prediction and Trading

Prompting Multi-Modal Tokens to Enhance End-to-End Autonomous Driving Imitation Learning with LLMs


6th of April 2024

Self-organizing Multiagent Target Enclosing under Limited Information and Safety Guarantees


Challenges Faced by Large Language Models in Solving Multi-Agent Flocking


Transform then Explore: a Simple and Effective Technique for Exploratory Combinatorial Optimization with Reinforcement Learning


Autonomous Artificial Intelligence Agents for Clinical Decision Making in Oncology


Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model


The Case for Developing a Foundation Model for Planning-like Tasks from Scratch


MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems


Goal-guided Generative Prompt Injection Attack on Large Language Models


5th of April 2024

Exploring Autonomous Agents through the Lens of Large Language Models: A Review


Increased LLM Vulnerabilities from Fine-tuning and Quantization


Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents


ROMA-iQSS: An Objective Alignment Approach via State-Based Value Learning and ROund-Robin Multi-Agent Scheduling


Hypothesis Generation with Large Language Models


KGExplainer: Towards Exploring Connected Subgraph Explanations for Knowledge Graph Completion


4th of April 2024

AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent

  • AutoWebGLM: automated browsing agent using ChatGLM3-6B LLM. Uses html simplification algorithm.
  • Curriculum learning applies hybrid (human/AI) web browsing multi/single-step dataset(Data is collected with: match rules, Prompt LLM, Manual annotation and Solver and data is collected from real world/virtual environment and open source data.). RL/Rejection sampling fine tuning (RFT) is applied for browsing comphrehension and task decomposition.
  • Introduces AutoWebBench-benchmark on real world web browsing tasks.
  • Tools read DOM and webpage screenshot: Element filter, Element list, OCR module, HTML parse. Observation includes: instruction, HTML and previous action. Action includes: HTML section and action name.

Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

  • Visualization-ofThought

Language Model Evolution: An Iterated Learning Perspective


Anticipate & Collab: Data-driven Task Anticipation and Knowledge-driven Planning for Human-robot Collaboration


CONFLARE: CONFormal LArge language model REtrieval


SELF-[IN]CORRECT: LLMs Struggle with Refining Self-Generated Responses


Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding


Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences


Comprehensible Artificial Intelligence on Knowledge Graphs: A survey


Benchmarking ChatGPT on Algorithmic Reasoning


Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra


ReFT: Representation Finetuning for Language Models


CodeEditorBench: Evaluating Code Editing Capability of Large Language Models


A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation


Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought


Embodied Neuromorphic Artificial Intelligence for Robotics: Perspectives, Challenges, and Research Development Stack


RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis


3rd of April 2024

MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise


Designing for Human-Agent Alignment: Understanding what humans want from their agents


PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models


Testing the Effect of Code Documentation on Large Language Model Code Understanding


The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers


Measuring Social Norms of Large Language Models


Exploring Backdoor Vulnerabilities of Chat Models


2th of April 2024

[Mixture-of-Depths: Dynamically allocating compute in transformer-based language models](Mixture-of-Depths: Dynamically allocating compute in transformer-based language models)

  • Mixture-of-Depth (MoD) Transformer: Transformers learn to assign compute dynamically to specific spots in the sequence.
  • Top-k routing: defines tokens participating in block's computation. Learns to route harder tokens through more layers.
  • Helps to speed up

A Survey on Large Language Model-Based Game Agents

  • Survey about LLM-based Game agents.
  • Unified architecture of LLMGAs: Perception(text, image, state, etc.), Thinking(reasoning, reflection, planning), Memory, Role-playing (role, experience, emotion), Action-module (control, dialogue, API, etc.) and Learning module.

Advancing LLM Reasoning Generalists with Preference Trees

  • Eurus: LLMs optimized for reasoning. Trains reward model using UltraInteract-dataset, which consists of Preference Trees.
  • Preference Tree: Diverse planning strategies in single pattern (such as tool creation, sequential processing). Multi-turn interaction trajectories with environment and the critique (learn to apply feedback and correct prior errors). Paired correct and incorrect actions in a tree structure. The data pair includes: instruction, correct response and incorrect response.
  • DPO (instruction fine-tuned) hurts performance, while KTO and NCA improve performance. Indicates, that DPO may be less suitable for reasoning tasks.

Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization

  • SoA (Self-Organized multi-Agent framework): Self-organized LLMs collaborate to generate code base and dynamically multiple based on complexity. Uses Mother and Child-agents.
  • Helps to scale the SoA to longer context lengths of code generation.

Large Language Models for Orchestrating Bimanual Robots

  • LABOR (LAnguage-model�based Bimanual ORchestration)-agent.

CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models


InsightLens: Discovering and Exploring Insights from Conversational Contexts in Large-Language-Model-Powered Data Analysis


Helmsman of the Masses? Evaluate the Opinion Leadership of Large Language Models in the Werewolf Game


Collapse of Self-trained Language Models


RAT: Retrieval-Augmented Transformer for Click-Through Rate Prediction


Is Exploration All You Need? Effective Exploration Characteristics for Transfer in Reinforcement Learning


1st of April 2024

Stream of Search (SoS): Learning to Search in Language

  • Stream of Search (SoS): Symbolic reasoning with next-sequence prediction (LLMs).
  • LLM pretrained with SoS-dataset generated with 500k search trajectories (also called as SoS) using various search strategies (BFS/DFS-based) to learn internal world model of search, which include problem solving using exploration and backtracking.
  • Enables generic and adaptive form of search: symbolic search is based on explicity environmental model, while SoS learns state transitions. The approach is likely to work in real world due to the complex/variable/branching nature of the game.
  • The policy is improved using APA (Advantage-induces Policy Alignment)- and fine-tuning with STaR-technique for threee iterations using 100k correct trajectories.
  • APA is a Actor-Critic RL technique. It creates copy of the LLM used as value network to enhance policy in the LLM. Reward function reviews the length and correctness of the generated trajectory.

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

  • Survey about Strategic reasoning of LLMs: methodologies and metrics. These approaches are categorizied into: Prompt engineering, Modular enhancements, Theory of Mind and Fine-tuning.
  • Reasoning tasks include: Common Sense reasoning, Mathematical reasoning, Symbolic reasoning, Causal reasoning and Strategic reasoning.
  • Strategic reasoning differs from being a more dynamic form of reasoning with the environment and due to the uncertainty of the adversary action.
  • Key traits of strategic reasoning are: Goal-oriented, Interactive, Predictive nature and Adaptability.

Large Language Model Evaluation Via Multi AI Agents: Preliminary results




31st of March 2024


CHOPS: CHat with custOmer Profile Systems for Customer Service with LLMs


Algorithmic Collusion by Large Language Models





30th of March 2024

Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns

  • Aligns LLM word embeddings with human brain embeddings.
  • Brain embeddings are generated from fine-grained spatiotemporal neural recordings in a continuous embedding space.
  • Aligning is based on similar geometric shapes between brain and llm word embeddings.

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning


Language Models are Spacecraft Operators


A Taxonomy for Human-LLM Interaction Modes: An Initial Exploration


Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods


Your Co-Workers Matter: Evaluating Collaborative Capabilities of Language Models in Blocks World


29th of March 2024

Gecko: Versatile Text Embeddings Distilled from Large Language Models

  • Gecko: "SOTA level" text embeddings with 768-dimensions with 7x smaller embedding model compared to prior SOTA. Gecko embeddings with 256 dimensions all existting 768-dimension text embeddings in MTEB
  • Gecko uses FRet (Few-shot Prompted Retrieval dataset)-fine tuning dataset: task description, input query, positive passage, negative passage.
  • FRet generates with LLM the relevant task and query for a passage. The query and task are fed into a pre-trained embedding model to get neighbor passages. LLM scores them either as positive or negative passages.
  • Original passage may not become relevant positive/negative passage.
  • I think the overall idea could work even as prompt-engineering technique, where original passage is sent to LLM to define query/task, generate positive/negative passage and finally use the query, task, positive, negative passage as basis of retrieval.

ITCMA: A Generative Agent Based on a Computational Consciousness Structure

  • ITCMA (Internal Time-Consciousness Machine): an an architecture for generative agents called ITCMA-agent. It is"a computational consciousness structure" and good at utility and generalization to real world.
  • ITCMA framework includes LLM, VLM, Agents under consciousness channels (composed of retention, primal impression and protention each next time step further) and Memory.
  • Slowness is a downside.

Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning

  • Explores open source 7B/13B LLMs ability to perform agentic tasks through supervised fine-tuning with task decomposition/backtracking (multipath reflective reasoning by prompting LLM to reflect path as not optiomal ) data.
  • Agent dataset is contructed through: task construction, trajectory interaction and manual filtering. Includes two usage types: task planning and tool usage.
  • Task planning data is generated the following way. LLM is used in three roles: question generator, action maker (offers thoughts/actions based on environmental feedback) and environmental agent. Action maker/Environmental agent keep interacting until task is completed. Requires manual screening after data is generated to ensure task logical consistency.
  • Tool usage data is generated by manually filtering LLM examples of full reasoning trajectories.

28th of March 2024

STaR-GATE: Teaching Language Models to Ask Clarifying Questions

  • STaR(Self-Taught Reasoner)-GATE (Generative Active Task Elicitation)-algorithm: Self-improves LLM's ability to elicit user preference by generating questions and generalises beyond the trained role-player.
  • Fine tunes LLM by generating a synthetic dataset for math problem dialogues with persona-task prompts.
  • Teaches the LLM to ask clarifying questions to provide personalised responses.

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

  • MatEval: LLM agents emulate human collaboration discussion. Uses self-reflection, CoT and feedback mechnamism.
  • Achieves high-correlation with human evaluation. Includes evaluator-, feedback(to imrpove discussion)- and summarizer-agents.

Change-Agent: Towards Interactive Comprehensive Change Interpretation and Analysis from Change Detection and Change Captioning

  • Change-Agent: Change deteection and interpretation using LLM from earth surface changes.

Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning


Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis


LLMs as Academic Reading Companions: Extending HCI Through Synthetic Personae


MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation





27th of March 2024

Long-form factuality in large language models

  • Search-Augmented Factuality Evaluator (SAFE): long-form factual check with LLM agent using a 38 topic question set (LongFast). Uses multi-step reasoning and determines, if factuality is supported by google search results.
  • LLM generates answer to question, this answer is splitted into individual facts. The facts are converted into self-contained, so the fact can be understood without rest of the facts. The individual facts are retrieved with google search: Facts supported by search results are labelled as supported and rest as non supported. If the fact is not relevant to the question, then the fact is labelled as irrelevant.
  • Achieves super-human level performance and measures this with a F1-score.

What are human values, and how do we align AI to them?


Large Language Models Need Consultants for Reasoning: Becoming an Expert in a Complex Human System Through Behavior Simulation

  • MEOW (MOsaic Expert Observation Wall): improves LLM reasoning with behaviour simulation.
  • Expert model is trained with simulated data from experience of specific task. Tested in communication game.

A Path Towards Legal Autonomy: An interoperable and explainable approach to extracting, transforming, loading and computing legal information using large language models, expert systems and Bayesian networks

  • Reviews the concept of legal autonomy of LLM agents for the first time: extracting, loading and transforming computing legal information.

A Study of Three Influencer Archetypes for the Control of Opinion Spread in Time-Varying Social Networks

  • Reviews automated agents in social networks for opinion control: opinion inference engine with LLM, content generation using opinion vectors.


26th of March 2024

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

  • MAGIS: Resolves Github issues with multi-agent LLMs: Manager, Repository Custodian, Developer and Quality Assurance engineer.

Depending on yourself when you should: Mentoring LLM with RL agents to become the master in cybersecurity games

  • SecurityBot: role-based multiagent collaborative framework with RL agent as mentors for LLM agent to support cybersecurity operations. Includes modules: profiles, memory, reflection and action using LLMs.
  • Collaboration mechanism: cursor for dynamic suggestions taking, aggregator for multiple mentors suggestion ranking & caller for proactive suggestion asking.

A Study of Three Influencer Archetypes for the Control of Opinion Spread in Time-Varying Social Networks


OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation


Compressed Federated Reinforcement Learning with a Generative Model



25th of March 2024

AIOS: LLM Agent Operating System

  • AIOS-architecture ofr LLM agent OS: AIOS SDK, LLM Kernel (Kernel layer), OS Kernel, Agent applications (Application layer), HW layer.
  • LLM kernel: Agent scheduler, Context manager, Memory manager, Storage manager, Tool manager and Access manager.

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

  • RepairAgent: Automated program repair with LLMs with dynamically updated prompt format.

CYGENT: A cybersecurity conversational agent with log summarization powered by GPT-3

  • CYGENT: Fine-tunes LLM for cybersecurity tasks and LLM agent provides/analyzes/summarizes user information from log files, detected events

TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models

  • TwoStep: Combines classical planning with LLMs (Helper Plan and Main Plan).

Do LLM Agents Have Regret? A Case Study in Online Learning and Games


An LLM-Based Digital Twin for Optimizing Human-in-the Loop Systems


Harnessing the power of LLMs for normative reasoning in MASs


Norm Violation Detection in Multi-Agent Systems using Large Language Models: A Pilot Study


Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm


Re2LLM: Reflective Reinforcement Large Language Model for Session-based Recommendation


RL for Consistency Models: Faster Reward Guided Text-to-Image Generation



24th of March 2024


Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications







23th of March 2024

When LLM-based Code Generation Meets the Software Development Process

  • LCG: Multi-agent LLM consisting of waterfall, scrum and Test-Driven-Development sw development workflows with CoT and Self-refinement.
  • LLM agent includes roles: requirements engineer, architect, developer, tester and scrum master. Uses same prompt, with role-identifier, role-specific instruction and task-information to drive dynamic prompting.

Towards a RAG-based Summarization Agent for the Electron-Ion Collider


EduAgent: Generative Student Agents in Learning




22th of March 2024

Can large language models explore in-context?

  • Reviews, if LLMs can explore effectively in-context, similar to Reinforcement learning-like agents.
  • Suggest need for external summarization, larger models like GPT-4 and careful prompt engineering.

CoLLEGe: Concept Embedding Generation for Large Language Models

  • CoLLEGe (Concept Learning with Language Embedding Generation): few-shot learning for new-concept acquisition and knowledge augmentation for LLMs.
  • Generates concept embedding with CoLLEGe based on two example sentences, where the concept is used, creates a definition-sentence using this concept-embedding and asks LLM to generate the definition of the concept.

LLM-Driven Agents for Influencer Selection in Digital Advertising Campaigns

  • Influencer Dynamics Simulator (IDS): LLM-agent based influencer selection for digital ad campaigns.
  • Includes: Influencer pre-selection, user profile generation, follower behaviour prediction and influencer tracking.

Language Models in Dialogue: Conversational Maxims for Human-AI Interactions

  • Proposes principles for effective human-AI conversation: quantity, quality, relevance and manner, benevolence and transparency.

CACA Agent: Capability Collaboration based AI Agent

  • CACA (Capability Collaboration based AI Agent): LLM agent with the following components: profile capability, reception capability, workflow capability, tool capability, tool service, methodology capability, add domain knowledge and planning capability.
  • Processes: user request, generate plan, search methodology, get profile, discover tool, invoke service, add domain knowledge and register tool service.

Content Knowledge Identification with Multi-Agent Large Language Models (LLMs)


21st of March 2024

ReAct Meets ActRe: Autonomous Annotations of Agent Trajectories for Contrastive Self-Training

  • A^3T (Autonomous Annotation Agent Trajectories): Closed-loop self-improvement for LLM agents.
  • Autonomous annotation of agent trajectories with ReAct for contrastive self-training. Reduces human-effort of data-collection.
  • Agent reasons for actions taken (ActRe-prompting agent).Contrastive self-training uses rewards decisions made based on accumulated successful trajectoriess.
  • The model outperforms GPT-4 and matches human average in Webshop-benchmark

ERD: A Framework for Improving LLM Reasoning for Cognitive Distortion Classification

  • ERD: Three step approach to reason cognitive distortions of user input: extraction, reasoning (CoT, Diagnosis of Thought) and debate between two LLM-agents and one LLM-judge.

PeerGPT: Probing the Roles of LLM-based Peer Agents as Team Moderators and Participants in Children's Collaborative Learning

  • PeerGPT: pedagogical agents in Children collaborative learning with peer agent as team moderator or peer agent as a participant.

RoleInteract: Evaluating the Social Interaction of Role-Playing Agents

  • RoleInteract-benchmark: Measures Sociality skills of role-playing LLM-agents. Conversation memory is one aspect to improve conversational agents. Complex group dynamics are still hard.

Polaris: A Safety-focused LLM Constellation Architecture for Healthcare

  • Polaris: 1T parameter LLM as a co-operative agent for patient friendly conversation with multiple specialist agents like nurses/social workers/nutritionists. Uses iterative co-training to optmize diverse objectives. Uses healthcare-related data, including propietary data.
  • Performs on par with human nurses and outperform significantly GPT-4.

20th of March 2024

Reverse Training to Nurse the Reversal Curse

  • Reverse training: trains LLMs using reverse order to solve the reverse curse, where the LLM struggles to learn: B is a feature of A.
  • Reverse curse has been key issue in the current LLM training.

Large Language Models meet Network Slicing Management and Orchestration

  • LLM slices isolated virtual network of a Physical infrastructure.

Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal

  • Traditional risk assessment framework for LLMs through 10 categories: prompt injection, insecure plugin design, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure output handling, excessive agency, overreliance and model theft.

19th of March 2024

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

  • Agent-FLAN (Finetuned LANguage models for aents): finetuning for agentic tasks.
  • Llama-2 7B model with Agent-FLAN surpasses by 3.5% existing SOTA models. Works both for tool utilization and agentic tasks.
  • Observes: LLMs overfit to specific agentic task formats like JSON, Learning speed of LLMs vary for agentic tasks and current training methods introduce hallucinations.

HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

  • HYDRA (HYper Dynamic Reasoning Agent): multi-stage dynamic compositional visual reasoning, to make hyper-decisions (fast, strategic and efficient decisions).
  • Three modules: LLM-Planner, RL agent (controller) and LLM-Reasoner (includes code generator and code executor). Includes Memory (code-, instruction- and feedback-history) and LLM-Textualizer (Uses template to create summary).
  • Planner and Reasoner generate instructions/Code with LLM. RL agent interacts with these modules and makes high-level decisions from best instructions based history. HYDRA adjusts actions from feedback received in reasoning. User queries are deconstructed with three sub-questions processed concurrently. The code executor has access to vision foundational models like BLIP, XVLM and GLIP.
  • RL agent is based on DQN-algorithm.

Characteristic AI Agents via Large Language Models

  • Characteristics AI: simulates real-life individuals in different situations. Releases Character100-dataset.

Embodied LLM Agents Learn to Cooperate in Organized Teams

  • Introduces prompt-based orgnizational structure. Reduces LLM errors related to redundant information and complying any instruction. Includesc communication- and action phases. Criticize-Reflect architecture.

Contextual Moral Value Alignment Through Context-Based Aggregation

  • CMVA-GS: moral value agents with different profiles pass through contextual aggregator.

LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction


The Use of Generative Search Engines for Knowledge Work and Complex Tasks


18th of March 2024

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

  • Dual-modality frameworkk: leverages independent LLM/VLM/SR models in order to interact autonomous robots.
  • Includes components of visual understanding, LLM and Speech regognition.

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

  • EnvGen-framework: Use LLM-agent creates training environment for reasoning, so smaller embodied RL-agents improve their weak skills.
  • Benefits from the LLM-agents world knowledge and the small, yet capable RL agents.

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models

  • Chart understanding task (chart Q&A, captioning, fact-checking, -to-table conversion, factual error correction).

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

  • Agent3D-Zero: 3D scene understanding agent with VLM by selecting and analyzing series of viewpoints for 3D understanding.

17th of March 2024

Logic Query of Thoughts: Guiding Large Language Models to Answer Complex Logic Queries with Knowledge Graphs


15th of March 2024

DiPaCo: Distributed Path Composition

  • DiPaCo (DIstributed PAth COmposition): a modlular ML paradigm, where computing is distributed by path. Path refers to sequence of modules defining input-output function.
  • Paths are small in relation to the overall model. During both training and deployment, a query is routed to replica of a path (sparsely activated), not the entire model.
  • The training phase distributes computation by paths through set of shared modules. The inference phase computes single path.
  • First large-scale, more modular and less synchronous learning, when FLOPs are relatively cheap and communication is relatively expensive.
  • Exceeds 1B parameter dense Transformer by choosing 256 possible paths with size of 150 million parameters.

PERL: Parameter Efficient Reinforcement Learning from Human Feedback

  • PERL (Parameter Efficient Reinforcement Learning): Compares reward modelling training and RL using LoRA against traditional RLHF. The study focuses on device UI control, such as sending email.
  • PERL achieves similar level of performance with less training compute and less memory used.
  • Releases self-dialogue: Taskmaster Coffee and Ticketing-datasets and still pending, but planned release of UI automation-dataset called "S-dataset". Unclear, if the NPOV-dataset apart is kept internal.

AUTONODE: A Neuro-Graphic Self-Learnable Engine for Cognitive GUI Automation

  • AUTONODE (Autonomous User-Interface Transformation through Online Neuro-graphic Operations and Deep Exploration).
  • Integrates Dora (Discovery and mapping Opertion for graph Retrieval Agents).

Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

  • V-HOU Multi-LLMs Collaborated Reasoning: video scene understanding.

Can a GPT4-Powered AI Agent Be a Good Enough Performance Attribution Analyst?

  • LLM agent for performance attrition using CoT and Plan and Solve (PS).

ChatPattern: Layout Pattern Customization via Natural Language


ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference


14th of March 2024

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

  • Quiet-Star: Extension and generalization of STaR-paper. Improves significantly LLM performance on GSM8K-benchmark.
  • Uses "meta-tokens" at the start/end of each thought, to learn when to generate a rationale and when it should make prediction-based on that rationale.

Enhancing Trust in Autonomous Agents: An Architecture for Accountability and Explainability through Blockchain and Large Language Models

  • Blockchain based Autonomous agent not only with explanation, but as well with record auditable interpretation.
  • Components: Autonomous agent, blockchain, Non-expert users, Automatic evaluation, Explainability component and Asynchronous task.

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

  • Vision-GPT-3D: Multimodal agent optimizing 3d vision understanding by integrating: YOLO-, SAM- and DINO-models.
  • Starts by making a depth map from multiple images, converts the depth map into point cloud, then into mesh and finally into a video.

From Skepticism to Acceptance: Simulating the Attitude Dynamics Toward Fake News

  • Fake news Propagation Simulation (FPS)-framework: identifies LLMs usefulness of LLMs to combat fake news. Reviews trends and controls of fake news using multiple agents under different personas (age/name/education/personality traits) with both long/short-term memory and self-reflection. Early and frequent regulation of fake news helps to limit its propagation impact.
  • Dynamic Opinion Agent (DOA) simulates cognitive processes of each agent. Agent Interaction Simulator (AIS) defines how/which agents interact daily and publishes new common knowledge/beliefs to agents.

LLM-based agents for automating the enhancement of user story quality: An early report

  • ALAS (Autonomous LLM-based Agent System): LLM-based system between different agent profiles to develop and maintain high-quality IT user stories.
  • Agent profiles: Product Owner/Requirements Engineer. User story. Task preparation phase: task, sub-tasks, context and vision statement. Task conduction-phase.

USimAgent: Large Language Models for Simulating Search Users

  • USimAgent: generates search interaction sequence through multiple rounds, taking into account context generated in prior rounds, each with steps: reasoning/action, query generation and click behaviour.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

  • MM1: MLLM training.

13th of March 2024

Gemma: Open Models Based on Gemini Research and Technology


Scaling Instructable Agents Across Many Simulated Worlds

  • SIMA: The Scalable, Instructable, Multiworld Agent based on image from the screen and text instruction provided by user. SIMA agent uses text encoder, image encoder and video encoder to process the input image and text and output only the embodied action.
  • Real-tme, embodied agent generalizes in 3D environment to any human task and coordinated by natural language instructions. Agent trained on multiple games outperformed an agent trained on single game. Performs nearly as well in new unseen game environments.
  • Data collection from commercial video game environments, Training of SIMA Agent model with text instruction-actions and human evaluation.

SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents

  • SOTOPIA-π: LLMs with social intelligence engage, act safer and persuade more.
  • Achieves social interaction goal completion capability of GPT-4 using 7B LLM.
  • Starts by generating social tasks with each character with its own social goal. Continues by collecting this training data using behavioural cloning (expert signal) and self-reinforcement(strongly performing signals from itself). Improve the agent policy with the LLM ratings. Generate SOTOPIA tasks with characters and evaluate their interaction with LLM rating and human rating.

AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

  • AutoGuide: the LLM-agent receives task-information, in-context examples, current trajectory and "state-aware guidelines"-retrieval.
  • The "State-aware retrieval" is in short a navigational instruction of the specific section in the web-page, such as clicking the "Forum"-button leads to page, where you can create a new Forum.

TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation

  • TINA (Thinking, Interacting and Action)-framework: a zero-shot Vision-Language Navigation (VLN) based LLM-agent, visual perceptor making observations and a memory.
  • Agent inputs include: Task description, Instuction and Memory. Trajectory memorizer summarizes observations/actions to memory.

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation

  • Systematic Literature Reviews (SLRs)-agent: planner, literature identification, data extraction, data compilation, performance validation. The code includes concrete prompts used with each step.

Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation

  • HAS (Hierarchical Auto-organizing System): Auto-organizes LLM-agents to complete navigation tasks using dynamic maps and auto-organizing-mechanism.
  • Centralized planning (planner, describer, critic and deployer) with global multi-modal memory, distributed execution (actor, curriculum, critic and skill) with local-multi-modal memory and multimodal information (vision, audio, object and map) with environment state.

Cultural evolution in populations of Large Language Models

  • Models cultural evolution in LLM-agent population.

CleanAgent: Automating Data Standardization with LLM-based Agents

  • CleanAgent: a data preparation LLM agent.

12th of March 2024

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

  • NavCoT (Navigational CoT): LLM acts as a world model and a navigational reasoning agent.
  • LLM is prompted to forecast the navigational NavCoT: 1. act as world model to imagine the next observation based on instruction, 2. select best aligned candidate observation fitting to the imagination, 3. determine action based on reasoning from prior steps.
  • In the Future Imagination-step (FI), the LLM is prompted to imagine the next observation, such as seeing a Patio. Visual Information Filter (VIF) selects from the available options provided by the VLM (image and description of the action towards it), the best matching to the FI. Action Prediction (AP)-step generates action prediction based on the selected option.

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

  • Introduces two benchmarks WorkArena- and BrowserGym--benchmarks to evaluate LLM-agent interacting with software via browser.
  • WorkArena (list, form, knowledge base, service catalog, menus) includes 23k tasks to interact with ServiceNow.
  • BrowserGym designs and evaluates web agents in Python environment, which includes html content, raw pixels and acccessibility tree. and
  • Illustrates clear difference in web browsing expertise between GPT-3.5 vs. GPT-4.

Transforming Competition into Collaboration: The Revolutionary Role of Multi-Agent Systems and Language Models in Modern Organizations

  • Multiagent Data and AI based platform framework: data, playground, web app, embedding model, multiagent orchestration (rest of the components interact with), data security/privacy, APIs/plugins, LLM & cache, Cloud provider, cloud DBs, Data Ops, MLOps, LLMOps and data strategy/ethics/LLM governance. The paper offers very little apart from this list, but the list does include quiet many of the components.

DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation

  • DexCap: a hand motion data capture system.

AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

  • Aesop-agent: Multimodal content generation agent.
  • Includes RAG from database(expert experience/professional knowledge), script generation, image generation, video assembly, utility layer.
  • Reviews prompt optimization.

11th of March 2024

RecAI: Leveraging Large Language Models for Next-Generation Recommender Systems

  • RecAI: Recommender systems based on LLMs, where user makes query, the LLM agent makes tool queries to get the correct items.
  • Includes Profile memory, info query, item retrieval and item ranker.
  • The LLM chain includes: init state, dynamic demo, plan execute and reflection.
  • Refers to planning called Plan-First method, which creates comprehensive execution plan and then strictly follows this plan. The planning input includes: user input, context, tool descriptions and demonstrations for in-context learning to create tool utilization plan.

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

  • DriveDreamer-2: First world model to generate customized driving videos, including uncommon scenes.
  • LLM generates user-defined driving videos: LLM converts user request into agent based trajectories, which is used to generate HDMap (python script creates Bird Eye View (BEV)) with respecting traffic rules. Unified Multi-View Model (UniMVM) improve temporal and spatial coherence of the generated video.

Academically intelligent LLMs are not necessarily socially intelligent

  • SESI (Situational Evaluation of Social Intelligence)-benchmark: Superficial friendliness is principal reason for errors.
  • Reviews: Empathy, Social-cognition, self-presentation, influence and concern.
  • Illustrates interesting insight about GPT-4 not being better in this benchmark than GPT-3.5 turbo and Mistral model outperforming Llama 2.

10th of March 2024

TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision

  • TRAD: Thought Retrieval Aligned Decision.
  • Includes three sub-processes: Temporal Expansion, Relative Order Mark and History Alignment.

ArgMed-Agents: Explainable Clinical Decision Reasoning with Large Language Models via Argumentation Schemes

  • ArgMed-agent: Generator of the Argumentation Schema (AS), Verifier of the AS and Reasoner as symbolic solver.

Reframe Anything: LLM Agent for Open World Video Reframing

  • RAVA (Reframe Any Video Agen): Perception to interpret user query and video content, Planning to determine aspect ratio/reframin strategies and Execution uses video editing tools to produce final video.

9th of March 2024

Cached Model-as-a-Resource: Provisioning Large Language Model Agents for Edge Intelligence in Space-air-ground Integrated Networks

  • Model caching optimization on edge devices. Age of Thought (AoT): to measure the relevance/coherence of intermediate thoughts during CoT inference.

8th of March 2024

RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation

  • Retrieval Augmented Thoughts (RAT): Iterative revising CoTs with retrieval information, which improves LLM reasoning in long-horizon tasks and reduces hallucinations.
  • First generates CoT answer, then uses this answers with a verification prompt. The verification prompt requests to verify correctness of the given answer to the question with the separately added information query, for example by using Bing/Google search (authors implement a separate get_content function in their Github code).
  • The query is based on the draft answer. The retrieved information is used to revise the draft answer. The next thought is then appended and a new round of revision performed. The process is repeated, until all revised thoughts are obtained and the final answer is provided.
  • The github code includes multiple functions to manage inputs and outputs for the LLMs.

FLAP: Flow Adhering Planning with Constrained Decoding in LLMs

  • FLAP (Flow Adhering Planning): Static planning in task oriented dialogs using constrained decoding algorithm based on lookahead heuristics.
  • The research is static planning, but the authors plan a follow up research with dynamic planning.
  • Aligns suggested plan thoughts using three scale score regards: user intent alignment, permitted flow steps, API selected, API permitted and structrally correct.

Will GPT-4 Run DOOM?

  • Doom-game agent, consisting Python-based Manager module connected to Doom code and three modules: Planner, Vision and Agent.
  • Vision module (GPT-4V) receives screenshots from the Managers and provides text description of it. - Planner uses as input the walkthrough and history and outputs a granular plan to be executed. Uses k-level of experts.

7th of March 2024

Acceleron: A Tool to Accelerate Research Ideation

  • Acceleron: LLM agent for research using colleague and mentor personas. Interacts with researcher develop research proposal.
  • Introduces concept of "Unanswerability", when LLM should identify when all the retrieved paragraphs are irrelevant.

6th of March 2024

PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion

  • PowerPoint Task Completion-Robustness (PPTC-R)-benchmark for LLMs PowerPoint completion tasks.

SheetAgent: A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models

  • SheetAgent: LLM-agent to complete spreadsheet tasks by interacting through iterative task reasoning. Introduces SheetRM-benchmark.
  • Includes three modules: Planner (generates python code to modify the spreadsheet), Informer (produces SQLs to perceive the spreadsheet despite dynamic range) and Retriever (retrieves instructive examples to improve robustness).
  • Includes interesting concept of erroneous code-code repository as Milvus vector database, in order to perform cosine similarity search in case erroneous code.

Exploring LLM-based Agents for Root Cause Analysis

  • Introduces LLM-based Root-Cause-Analysis (RCA) agent based on ReCT.

5th of March 2024

Cradle: Empowering Foundation Agents Towards General Computer Control

  • Cradle-framework: introduces MLLM-agent to control GUI using screenshot inputs and outputs executable code to control keyboard/mouse actions(key or button to press/where/duration/speed/location to move). Introduces the term General Computer Control (GCC).
  • Includes modules: information gathering/self-reflection/task inference/skill curator/action planning/memory(episodic for retaining information/procedural for skills).
  • Uses PyDirectInput instead of pyautogui for keyboard control. Includes low-level wrapper, which uses ctypes in windows and AppleScript in Mac to communicate low-level mouse controls.
  • Procedural memory is based on topk matches of the skills (text embeddings).
  • Episodic memory consists of short-term (screenshots/task guidance actions/reasoningand long-term summary. Short-term memory includes forgetting factor k set to 5-interactions.
  • The long-term memory includes recurrent information summary to avoid losing track of long-horozon task objective while inside short-horizon task: ongoing task/the past entities met/past behaviours.

Reaching Consensus in Cooperative Multi-Agent Reinforcement Learning with Goal Imagination

  • MAGI (Multi-Agent Goal Imagination)-framework: agents reach consensus (and cooperatively reaching valuable future states) through imagined common goal.
  • Future states are modeled with CVAE-based self-supervised generative modelling. Samples a common goal with high-potential value for multi-agent consensus to guide policies of all agents.
  • CVAE is self-supervised conditional variational auto-encoder to model the distribution of future states.

Language Guided Exploration for RL Agents in Text Environments

  • Introduces Language Guided Exploration (LGE), which in this study outperforms Behaviour Cloning.
  • Explorer: RL agent with LGE outperforms with wide margin behaviour cloning. The key component is the Guide-model (LLM), which provides world knowledge to introduce set of feasible actions and reducing substantially the possible action space.

KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents

  • KnowAgent: LLM-agent to improve planning with explicit action knowledge retrieval. The agent includes Action Knowledge Base (AKB), Planning Path Generation(question, action path, thought and observation) and Kowledgable Self-Learning.
  • Introduces term planning hallucinations, which refers to agent generating conflicting or unnecessary action sequences.
  • AKB contains information to steer action generation process: action name, definition, rule and knowledge.
  • Knowledgable Self-Learning phase improves continuously the understanding and usage of action knowledge

Learning to Use Tools via Cooperative and Interactive Agents

  • ConAgents: Cooperative and interactive agents, which iteratively applies three modules: Grounding, Execution and Observation.
  • Grounding step grounds user query into too definition and target output. Executing defines required tool arguments and completes returned output. Observing addresses long-form data outputs with IterCal-method: LLM agent self-adapts to feedback from tool environment.
  • IterCal-method uses a pseudo-schema, which is basically a simplifie human-readable dictionary of the lengthy output returned from the tool used, see the pseudo-schema in the last page of the paper for quick understanding.

OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following

  • OPEx-agent: Includes Observer, Planner and Executor-roles. Observer-agent processes and interprets sensory inputs, such as vision from the environment. Planner integrates dynamically strategic plans and sub-tasks based on perception. Excutor implements the plans with skills library.
  • Embodied Instruction Following (EIF): agents follows task instruction by interacting with the environment through observations in a ego-centric way.
  • The agent basically includes, what objects the agent is currently observing, what objects have been found, what observations have been so far made and what previous steps have been completed. In addition, there is known the current objective, thought and action.

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

  • Chain-of-Action-Thought (dubbed CoAT): a novel prompting strategy to allow GUI agents to perceive, reason and decide.
  • CoAT includes four parts: Screen context, Action thinking, Action target and Action Result.
  • Screen context explains content of the GUI screenshot. Action thinking takes user query, current screen and history to define possible actions to complete goal. Action target refers to GUI element being actioned such as clicking an icon. Action result maps current screen with next action to future observation.

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

  • InjectAgent-benchmark with +1k test cases in 17 tools and 62 attacker tools. Illustrates. Attack Success Rate (ASR) remains high especially in open source models like Llama 2.
  • This result is surprising, considering "open source" models are often categorized as safer options over closed models.

Entropy-Regularized Token-Level Policy Optimization for Large Language Models

  • Entropy-Regularized Token-level Policy Optimization (ETPO).

ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary

  • ChatCite: Literature summary LLM-agent. Includes Key-Element Extractor and Reflective Incremental Generator.
  • Key-Element Extractor: Extracts research questions, methodology, results, conclusions, contributions, innovations and limitations. These are stored in memory.
  • Reflective Incremental Generator: Reflective mechnanism, Comparative summarizer, Reflective Evaluator and Rank & Select. Iteratively repeated.

4th of March 2024

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

  • Exploration-based Trajectory Optimization (ETO): LLM agent collects failure trajectories to update its policy using failure-success trajectories.
  • ETO includes three steps: Explore (SFT-based behavioral cloning LLM agent), Collect Failures (pairs contrastive trajectories from the failures and expert trajectories) and Optimize trajectories (DPO loss on the pairs).

2nd of March 2024

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

  • AutoDefence: Introduces multi-agent LLM-jailbreaking prevention framework with input agent, defence agent and output agents.
  • Defence agent includes prompt analyser agent, intention analyser agent, judge agent and coordinator agent.
  • Reduces success rate of prompt attacks.

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

  • SceneCraft: LLM agent converts text into Python code for Blender API 3D-scenes.
  • Dual-loop: Inner loop keeps improving scene by writing Blender code, Blender API renders the code and critic-revising this rendered image using Vision-Language Model (VLM).
  • Outer loop learns by updating reusable functions to the library.
  • The beaty of this approach is, that VLM model revising the end result, makes it very generich approach for self-improvement.

1st of March 2024

Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents

  • NetPlay: zero-shot agent, which uses agent loop using GPT-4.
  • Constructs prompt including past events, the current observation, a task description with available skills and the desired output format. Retrieve new skill and Execute it. New events are then observed.

28th of February 2024

Human Simulacra: A Step toward the Personification of Large Language Models

  • Creates LLM personification with complete life story to simulate personality and interacting with external world in human-like manner
  • Uses multi-agent framework to simulate cognitive functions, memory and psychology-guided evaluation to asses the quality of the human simulation with self-reporting and external observations.

Prospect Personalized Recommendation on Large Language Model-based Agent Platform

  • Rec4Agentverse: Recommender agent with three steps: User-Agent Interaction, Agent-Recommender, Agents Collaboration.

Data Interpreter: An LLM Agent For Data Science

  • Data Interpreter: Data scientist LLM agent with Plan, Code and Verify steps. The pipeline is represented as a DAG-structure.
  • Plan Real data adaption using dynamic planning with hierarchical graph structures. Code: Dynamic tool integration to improve code execution. Verify: Logical inconsistency identification through feedback

24th of February 2024

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

  • ByteComposer: LLM-agent based melody composer with four elements: Conception analysis, Draft composition, Self-evaluation and modification and Aesthetic selection.

23th of February 2024

Large Multimodal Agents: A Survey

  • Survey on multi-modal AI and LLM agents.

Genie: Generative Interactive Environments

  • Genie: a Foundational World Model. The learning paradigm is unsupervised learning from unlabelled internet video. The approach scales effectively as compute is increased.
  • Includes: Latent Action Model (LAM) for latent action between each video frame in each timestep, 2. Video tokenizer to convert video frames into discrete tokens, 3. Dynamics model to predict next frame
  • The model/datasets are not released, but the approach is explained in the paper with single GPU implementation details by bringing your own data using the dataset creationg instructions provided.

21st of February 2024

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

  • Searchformer: Transformer model outperforms A* search algorithm in planning.
  • Two step approach, where Transformer excels large action spaces and learns heuristics (strategies to guide search) from the training with the data.
  • First step generates synthetic dataset: Imitate A* search by using A* search and recording compute and and optimal plan as text token sequences(task description, search tree dynamics, and final plan) with length of thousands of tokens. This dataset includes search dynamics of A* search itself. Train a Transformer model (Searchformer) to generate the text token sequences with optimal plan for a given task. This leads to a transformer model, which has the A* search coded in the model weights.
  • Second step further trains Searchformer using Expert Iteration, which attempts to generate optimal plans to tasks with less steps in the optimal plan. The resulting model solves Sokoban puzzles with 27% less search steps, than A* search algorithm. The idea is to generalize the Transformer model into more generic search beyond A* search.

User-LLM: Efficient LLM Contextualization with User Embeddings

  • User-LLM: generates user embeddings from user data with multi-feature autoregressive transformer and then fine-tunes the LLM using these embeddings with cross-attention.
  • The method enables inserting the LLM with long-term user history through compressed user embeddings and short term user context through input prompt.
  • Effective approach for LLM personalization and user modelling. Includes good chapter on LLM long context research.

∞Bench: Extending Long Context Evaluation Beyond 100K Tokens

  • Coins prompting technique called: "Context recalling": improves code debug accuracy from +16% (using CoT) to +40% (using context recalling).
  • Context recalling prompts the model to first recall the relevant information, before doing further reasoning.
  • Introduces long context bencmark: ∞BENCH-benchmark for LLMs with above 100k context window.

Neeko: Leveraging Dynamic LoRA for Efficient Multi-Character Role-Playing Agent

  • Neeko-agent: Multi-character roleplaying agent with LoRA.
  • Includes Pretraining, Multi-character Role-Playing and Incremental Role-Playing with Fusion and Expansion stages.

20th of February 2024

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

  • MuLan: Multimodal LLM agent, addresses text2image generation errors through progressive multiobject generation with LLM-based planning and VLM-based feedback control.
  • MuLan is training free method.

Large Language Model-based Human-Agent Collaboration for Complex Task Solving

  • ReHAC: uman-agent(LLM) collaboration with RL policy model.

19th of February 2024

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

  • AnyGPT: Any-to-Any Multimodal Language Model with any input output between text, speech, image and music.
  • Uses only data preprocessing with modality specific tokenizers to tokenize input into discrete tokens and model outputs by de-tokenizing into specific modality outputs.
  • Introduces multimodal alignment dataset made of conversations.

Shall We Talk: Exploring Spontaneous Collaborations of Competing LLM Agents

  • Studies spontaneuous collaboration between competing LLM agents

WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment

  • WorldCoder: LLM agent learns World Models (world_model.py) using Python program from interactions with its environment.
  • Outperforms baselines from DeepRL- and ReAct-agents in gridworlds-environment.
  • Incldues sample code of the world_model.py.

Comprehensive Cognitive LLM Agent for Smartphone GUI Automation

  • CoCo-Agent: GUI control with VLM/LLM/CLIP, which includes Comprehensive Environment Perception (CEP) and Conditional Action Prediction (CAP). Includes information such as GUI screenshot, GUI layout information, user objective and action history.
  • Offers SOTA-level performance on GUIs, yet high training cost.

LLM Agents for Psychology: A Study on Gamified Assessments

  • PsychoGAT: Gamification of psychological assessment traditionally performed with questionaries with superior performance. Includes prompt templates.

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations

  • Structured CoT (SCoT): breakdowns into states for for generating actions for each sub-tasks durign the specific state.
  • For example first state determines, if question is answerable, the next step identifies required steps for the answer and the next state generates the step answer.

18th of February 2024

LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration

  • LongAgent: Scales LLaMA to 128k context window outperforming GPT-4 through multiagent collaboration using inter-member communication.
  • Leader agent selects agent members of team based on task description, agent team collaboratively reason, deduct answer and finally resolve conflict to generate final answer.

Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents

  • Fine-tuning LLMs with Negative examples enhances performance.

Modelling Political Coalition Negotiations Using LLM-based Agents

  • Political coalition negotiation with LLM agents.

17th of February 2024

LLM can Achieve Self-Regulation via Hyperparameter Aware Generation

  • Hyperparameter Aware Generation (HAG): the LLM learns to modify automatically its hyperparameters (temperature, top_p, top_k, repetition_penalty) for each user task input.
  • Self-regulation of hyperparameters enables the LLM to finetune its responses to different task inputs.
  • Self-regulation takes inspiration from the ability of human body to regulate itself based on different factors like temperature, blood pressure, adrealine etc.

16th of February 2024

Robust agents learn causal world models

  • Implies causal understanding is required for robust generalization.
  • Causal models can be learned from adaptive agents.

15th of February 2024

Chain-of-Thought Reasoning Without Prompting

  • CoT-Decoding: CoT without prompting. LLMs inherently pose reasoning abilities.
  • Uses top-k alternative tokens to uncover CoT paths, which are frequently paths discovered in CoT.

A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts

  • ReadAgent: very long context management through gist-memories and pagination for web browsing.
  • ReadAgent: LLM decided what content to store as episode pagination, LLM compresses page memory as shorter gist memory (see fuzzy-trace theory about memory) and LLM decides the pages to look up per given task and the gist memories related to the context of the task. The agent then retrieves the related page information to complete the task.
  • Extends effective context window by 3-20x and keeps failure rate close to 0%, which is significantly less than traversing tree with a MemWalker-like solution.
  • Gist-memory improves Web navigation over using raw html inputs, which is by nature a very long context task.

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

  • AI Hospital: LLM acts with doctor, patient, examiner and physician-roles. Categorises medical information into: subjective, objective and Diagnosis/Treatment.
  • MVME-benchmark (Multi-View Medical Evaluation): evaluates LLMs in symptop collection, recommendation analysis and diagnosis.

14th of February 2024

AgentLens: Visual Analysis for Agent Behaviors in LLM-based Autonomous Systems

  • AgentLens: visual analysis of of LLM based autonomous agents and exploration of their behaviours.
  • UI includesOutline view, Agent view and Monitor view. Summarizes raw events, Descriptions of generated behaviours, Behaviour embeddings, Timeline segmentation.
  • The behavioural embeddings: enables plotting specific behaviours in time, which is very effective approach.

Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications

  • AgentEval: framework to verify utility of the LLM tool through automatic criteria creation for a given task to review meeting of user needs.
  • Includes CriticAgent to list criteria of accepted values and QuantifierAgent verifying suggested criteria.

DoRA: Weight-Decomposed Low-Rank Adaptation

  • Next generation LoRA. Get more out from your LLM, while not directly related to agents.

13th of February 2024

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

  • GLoRe: Presents a Stepwise Outcome-based Reward models. SORM is in contrat to Outcome-Based Reward models (ORMs) and Process-Based Rewrd Model (PRMs), where trained only on synthetic data to approximate future reward of optimal policy V*.
  • Uses three step refinement training process: 1. Fine-tune base model for Student policy model, 2. SORM training, 3. Refinement training.

Grounding LLMs For Robot Task Planning Using Closed-loop State Feedback

  • Brain-Body LLM(BB-LLM): Brain-LLM defines high-level plans for robot. The BodyLLM converts them into low-level planned actions as robot commands.

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

  • Agent Smith: "Infectious Jailbraking" Technique, which infects single LLM agent, that then infects with exponential growth rate the remaining agents.
  • Concering technique reminding traditional computer virus, because the computational/time/resource expenses of infecting single agent remain low, but includes capability of infecting rest of the agents.

Simulating Human Strategic Behavior: Comparing Single and Multi-agent LLMs

  • Investigation on LLMs capability to simulate human strategic behaviour.
  • Compares Multiagent vs. Single LLM agent performance in the Ultimatum game and finds multiagent system more accurately simulating human behaviour.

Large Language Models as Minecraft Agents

  • Develops Minecraft Builder and Architect LLM agents using JSON-format with capacity to ask clarifying questions from the LLM.

PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Preference Alignment

  • PROMST: Optimizes prompts. Includes TaskLLM and PromptLLM. PromptLLM generates new prompt suggestions from existing best prompts and their feedbacks. New candidates are selected by score prediction model.

12th of February 2024

T-RAG: Lessons from the LLM Trenches


OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

  • FRIDAY: Self-improving embodied agent to interact with OS.
  • OS-Copilot framework: Planner, Configurator to update or retrieve (Declarative memory for user profile and Semantic knowledge/Procedural memory for tools), Actor (Executor / Critic).
  • Learns to control and self-improve.

Predictive representations: building blocks of intelligence

  • Successor Representation (SR) may function as versatile building blocks of intelligence.

Secret Collusion Among Generative AI Agents

  • Model capability evaluation framework on Secret collusion.

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

  • THE COLOSSEUM benchmark for robot manipulation generalization through 20 diverse tasks.

11th of February 2024

Self-Correcting Self-Consuming Loops for Generative Model Training

  • Self-Correcting Functions using expert knowledge for generative model training.

9th of February 2024


V-STaR: Training Verifiers for Self-Taught Reasoners

  • V-STaR: Enhancement to STaR-method. Uses during self-improvement not only correct, but as well incorrect solutions generated to train a verifier using DPO, where is judged correctness of the model-generated solutions.
  • Iterating V-STaR multiple rounds generates progressively better reasoners and stronger verifiers by increasing GSM8K performance significantly from base STaR-method.
  • Addresses the aspect of data efficiency by being able to improve both from correct and incorrect solutions.

Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

  • TS-LLM: a tree search guided LLM decoding with learned value function applicable for reasoning tasks.

Feedback Loops With Language Models Drive In-Context Reward Hacking

  • LLMs interacting with the real-world create feedback loops, where the LLMs outputs shape world state, from where next LLMs are trained.
  • Such feedback loops can cause In-Context Reward Hacking (ICRH): LLM outputs increase BOTH the objective and the negative side-effects.
  • Output-refinement and policy refinement lead to ICRH.

Understanding the Weakness of Large Language Model Agents within a Complex Android Environment

  • AndroidArena benchmark for measuring LLMs capability to control a modern operating system.
  • Main failure modes: understanding, reasoning, exploration, and reflection.

Large Language Models: A Survey

  • Reviews past years LLM research: LLM model families, building of LLMs, using of LLMs, LLM datasets, LLM metrics and future directions and challenges.
  • Includes deployment pipelines, vector databases, prompting pipelines and LLM training/inference frameworks

Why Solving Multi-agent Path Finding with Large Language Model has not Succeeded Yet

  • Identifies three reasons on why multi-agent path finding with LLMs does not work: model limitation, lack of understanding and lack of reasoning.

8th of February 2024

An Interactive Agent Foundation Model

  • Interactive Agent Foundational Model: A generalist agent. Multi-task, Multi-domain: Healthcare, Gaming AI and Robotics.
  • Interactive Agent framework: action encoder, visual encoder and language encoder. Pretrained to predict masked unified tokens for the three modalities: text token, visual token and action/agent token from each separate token per input type. Effectively generalizes between domains.
  • Defines term "Agent-based AI" as generating dynamic behaviours grounded on the context understanding of uncertain environment. Defines "Embodied Agent-paradigm principles": Perception, Planning and Interaction. Agent actions impact directly task plans by not requiring environment feedback to plan next action.
  • MUltimodal systems preteained cross-modality grounded with environment hallucinate less by being grounded with the physical/virtual environment and require less size, than models pretrained separately/without grounding.

UFO: A UI-Focused Agent for Windows OS Interaction

  • UI-Focused (UFO) agent: Automatically controlling Windows OS. The system includes two VLM-based agents: AppAgent (Application Selection Agent) and ActAgent (Action Selection Agent).
  • AppAgent uses User input, Desktop screenshot, App information, Examples and Memory. It chooses application to complete the task, generates global plan. AppAgent outputs observation, Thoughts, Selected App, Status, Global pla and Comment.
  • ActAgent takes as input User request, Screenshots (highlighted last action, clean, annotated), Control information, Examples and Memory. ActAgent pursues local plans and actions until meeting the goal / receives observations from apps / interacts with memory. Outputs observation, Thoughts, Labeled control operation, Function, Status, Local plan and Comment.
  • Control Interaction module grounds actions.

Real-World Robot Applications of Foundation Models: A Review

  • A literature review of Robotics Foundationa models.
  • Reviews Input/Ourput relationships of models, perception, motion planning and control.

TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation

  • TimeArena: A textual simulation environment for LLM agents to complete tasks as soon as possible.
  • 30 real world like tasks from household activities to laboratory work. Illustrates, that GPT-4 lacks temporal awareness such as failing to recognize opportunities in parallel processing.

ScreenAgent: A Vision Language Model-driven Computer Control Agent

  • VLM to control a real computer screen/GUI.
  • Includes Planning, Acting and Reflecting phases.

In-Context Principle Learning from Mistakes

  • Learning Principles (LEAP): Intentially guide LLM to make mistakes on few examples to reflect on them and learn task-specific principles.
  • Improves MATH reasoning capability.

Keyframer: Empowering Animation Design using Large Language Models

  • Keyframer: LLM-powered animation generator from SVG images.

Discovering Temporally-Aware Reinforcement Learning Algorithms

  • Reviews Temporally-aware reinforcement learning and Meta-learning.

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

  • WebLINX: Real-time webpage control with LLMs.
  • Filters relevant web page elements

How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

  • NegotionArena bencbmark: to measure LLMs ability to negotiate.

Decision Theory-Guided Deep Reinforcement Learning for Fast Learning

  • Decision Theory-guided Deep Reinforcement Learning (DT-guided DRL): addresses cold start problem in RL.
  • Promotes more structural and informed exploration strategy.

7th of February 2024

The Future of Cognitive Strategy-enhanced Persuasive Dialogue Agents: New Perspectives and Trends

  • CogAgent: Persuasion LLM agent framework.
  • Cognitive strategy mining, Cognitive Strategy Prediction for Dialogue Modelling and Application scenarios (bargaining, counselling, debating etc.)

Can Large Language Model Agents Simulate Human Trust Behaviors?

  • Reviews LLM agents ability to simulate Trust.

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

  • ScreenAI: a VLM. Screen user interfaces (UIs) understanding, dataset creation with LLMs.

6th of February 2024

Self-Discover: Large Language Models Self-Compose Reasoning Structures

  • Self-Discover: Self-discovers complex reasoning structures outperforming CoT-Self-Consistency in MATH, while being more compute efficient.
  • Select reasoning modules(for exampel CoT, etc), Adapt reasoning modules and Implement reasoning structures as key-value pair as json.
  • Works with multiple LLMs and different types of reasoning scenarios.

AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls

  • AnyTool: LLM agent utilizing over 16k APIs.
  • API retriever with hierarchical structure with meta-agent, user query solver using candidate APIs and self-reflection mechanism for initial impractical solutions. Uses GPT-4 with function calling.
  • Introduces AnyToolBench-benchmark.
  • Meta-agent is linked with multiple category agents each managing collection of tool agents.

Can Generative Agents Predict Emotion?

  • Reviews LLM agents capability to align humans in terms of emotional states, when new events take place.
  • LLM agent framework, where time series text memories are stored in graph database, which are summarized. As new events take place, the norm of the past episodic memories is combined with the current context. LLM agents emotional state is measured using pre-existing Positive And Negative Affect Schedule (PANAS)-framework to arrive a PANAS score of the current emotional state. Finally, the new memory is added to the graph database.
  • The LLM agent acts in a virtual town with multiple agents interacting for example inviting and assisting a party. Performance is reviewed using pre-existing EmotionBench-benchmark. LLM agents lack to some extent ability to align emotionally like humans.
  • Raises interesting concern, that GPT-3.5 may be biased to provide positive answers and therefore struggle to illustrate negative emotions.

S-Agents: self-organizing agents in open-ended environment

  • S-Agents: Tree-of-Agents, where the leader LLM agent leads tree-like structure wiith executor agents.
  • Hourglass agent framework: Monitor progress and Hierarchical planning.
  • Monitor progresss: starts with previous plan and perception used to monitor progress against objective.
  • Hierarchical planning: plans long-term (task planner), takes current task and generates actions (action planner) in the environment and agents.

Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning

  • Indirect Reasoning (IR): Uses logic of contrapositives and contradictions for factual reasoning and math proofs.
  • Adding IR to factual reasoning increases overall accuracy compared to Direct Reasoning (DR) only or IR only.

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

  • Vision Language Model: MobileVLM V2.

QuantAgent: Seeking Holy Grail in Trading by Self-Improving Large Language Model

  • QuantAgent: Includes two LLM agents: Writer and Judge. The Writer-agent retrieves Knowledge Base (KB) and then generates answer based on the KB and submits the answer to real environment for evaluation. The Judge-agent retrieves relevant KB related to the review and it then generates score and feedback used in the next iteration.
  • The iteration continues until maximum number of steps is reached or the score is high enough.

Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models

  • Improves LLMs geometric reasoning with self-correction, collaboration and role specialization using geometric tools and four LLM agents.
  • Uses LLM agents with four roles: Natural language solver and validator, Geometric tool Solver and Validator.

In-context learning agents are asymmetric belief updaters

  • In-context learning: framing of the problem significantly impacts succesfullness.
  • LLMs learn better from better-than-expected outcomes rather than worse-than-expected outcomes.

Systematic Biases in LLM Simulations of Debates

  • Reviews LLMs capability to generate believable simulation and current LLMs include a simulation bias for political debate.
  • Self-fine tunes LLM to take a specific political stance by using politically-oriented question to reflect answers, which is more effective than prompt-profiling alone.
  • Illustrates the difficulty for LLMs to simulate specific human behaviour like a political views.

Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science

  • Takes safety research from LLM safety to LLM agent safety, which is more holistic view.
  • Scientific agent: Reviews LLM agent vulnerabilities within science domain: Data Insuffiency, Planning limitation, Tool limitations, LLM limitations and Lack of measurement.
  • Introduces triangle framework: Human regulation (Intent), Agent alignment (Red teaming) and Agent regulation (environmental feedback).

5th of February 2024

Understanding the planning of LLM agents: A survey

  • LLM-Agent planning: provides a systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability.
  • It categorizes existing works into Task Decomposition, Plan Selection, External Module, Reflection and Memory, and provides comprehensive analysis for each direction.
  • This survey is the first work that comprehensively analyzes LLM-based agents from the planning abilities.

Chain-of-Feedback: Mitigating the Effects of Inconsistency in Responses

  • Recursive Chain-of-Feedback (R-CoF): Recursively breaks down complex reasoning problems into more easier and more detailed solutions and re-adjusts original reasoning based on the detailed correct reasoning.
  • Given a problem, asks LLM to generate answer using multiple reasoning steps, then LLM verifies the incorrect reasoning steps, LLM then recursively asks only to solve the incorrect reasoning steps using same approach. If the new answer is correct, it gets added to the higher level answer and otherwise repeats the recursive LLM call.

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

  • Promptable Representations for Reinforcement Learning (PR2L): the model asks from VLM about the game tasks, such as in case a spider is visiblle. The VLM responds semantic features or knowledge, which then better help the system to advance in the game by connecting what is seen with what it needs to do. This ensures, that the system actions are grounded with the reality of what is going on in the game.
  • Initializes RL policy using VLM representation.
  • PR2L was not trained to play Minecraft only, but it still plays at level closed to models specifically trained with Minecraft games.

Guiding Language Model Math Reasoning with Planning Tokens

  • Planning tokens improve LLM reasoning capabilities.
  • Add the planning tokens in the LLM generated answer based on CoT in the beginning of each reasoning step, such as planning token related to multiplying done on that reasoning step,

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

  • DeepSeekMath: 7B model comparable with math reasoning of a 70B model, close to Gemini Ultra and GPT-4.
  • Introduces Group Relative Policy Optimization (GRPO).

LLM Agents in Interaction: Measuring Personality Consistency and Linguistic Alignment in Interacting Populations of Large Language Models

  • Studies LLM agents capability to follow human personality profiles: analytical vs. creative personality.
  • Each profile demonstrates different levels of consistency towards its profile in writing style and in a personality test.

Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

  • Plan Like a Graph (PLaG): asynchronous plan reasoning with LLM: generates time estimations, identify step dependencies, converts the time estimates and dependencies into a graph processor and finally generate answer.
  • Creates AsyncHow-benchmark: for asynchronous plan reasoning, requiring ability to correctly add time, correctly comparing time durations and ability to solve constrained reasoning.
  • LLMs struggle efficiently completing complex asyncchronous plans without detailed illustration of how to solve the task.

C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models


4th of February 2024

Understanding the planning of LLM agents: A survey

  • Review studies about the LLM agents planning capabilities.
  • Categorizes these planning capabilities into: Task decomposition, Plan selection, External module, Reflection and Memory.
  • Identifies development areas in: evaluating efficiency of the planning, revisiting of planning strategies in multimodality and more realistic evaluations.

Solution-oriented Agent-based Models Generation with Verifier-assisted Iterative In-context Learning

  • SAGE: Modelling and Solving stages with Automatic Design and Generation of ABM.

LLM-Enhanced Data Management

  • LLMDB: Detailed data management framework with LLMs.
  • Components include: Preparation, Request pre-processing, Request parsing, Pipeline executor agent, Vector database and Data/Model management.

Collaborative Agents for Software Engineering

  • CodeAgent: Autonomous Agent, a multi agent code review system.
  • SOTA in code review systema.

3rd of Februry 2024

More Agents Is All You Need

  • Scaling up LLM-agents increases performance with sampling & majority voting.
  • Performance improvements increase and then decrease as difficult level gets harder. Improvements increase in function of number of steps. Prior probability of correct answer increases performance gains.

Affordable Generative Agents

  • Affordable Generative Agents (AGA) framework: agent environment interaction and inter-agent interactions.
  • Believable, low cost LLM-agents by replacing repetitive LLM inferences with learned policies. Models social relationships between LLM-agents and compresses auxiliary dialogue information.
  • Emergent believable behaviour: LLM-agents generate finite behaviours in limited environments. Defines "mind wandering"-technique in memorory to generate diverse social behaviour by sampling both: highly relevant events and sampling ranly unrelated events. The idea is to randomness & spontaneus responses, like a real person.
  • Social memory: relationship, feeling, events summary between the agents.

2nd of February 2024

K-Level Reasoning with Large Language Models

  • K-level of Reasoning: Recursive reasoning process, which improves dynamic reasoning by integrating cognitive hierarchy theory by recursively predicting and responding to the thoughts and actions of rivals.
  • In essence, multiple LLM agents take a context, reason on it and make decision in "k-1"-level. The reasoning is then repeated in the "k"-level by integrating the the analysis from "k-1"-level to arrive decision in the "k"-level.

1st of February 2024

Multimodal Embodied Interactive Agent for Cafe Scene

  • MEIA (Multimodal Embodied Interactive Agent): Uses Multimodal Environment Memory (MEM) with LLM and VLM, to store egocentric environmental information (object IDs/coordinates as textual memory and visual observations as image memories) to improve significantly task planning and execution.
  • MEIA is able to perform various tasks such as seating guidance, order taking and environmental adjustments being robust in zero-shot learning for real world tasks.
  • It appears to be the first paper to introduce multimodal memory, which improves significantly performance and increases precision of the planning.
  • Includes two measurement metrics: ESR (Executable Success Rate) and SSL (Succcess Rate Weighted by Step Length) with formulas included.
  • Uses RGB images (stored in image memory)/depth images/segmentation images.

Efficient Exploration for LLMs

  • Actively exploration is used to achieve high performance with less feedback.
  • Uses double Thompson sampling with eistemic neural network (ENNs) to model reward uncertainty and least amount of queries.
  • Gemini Nano is used as baseline model, which output is compared with Best-of-N responses from Gemini Nano based on reward model.

Hello OLMo: A truly open LLM

  • OLMo: First open access data, open weights, open source code LLM.
  • The model training data comes with need to agree to AI2's license terms wiith very clearly stated legal implications.

Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents

  • Formal-LLM: Context-Free Grammar (CFG) translates guidance and rules for each relevant task, which LLM text generation must follow when generating the plan.
  • Prevents generating invalid plans.

30th of January 2024

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

  • StrokeNUWA: Introduces image representations based on vector graphics using "stroke tokens". The approach does not require using raster/pixel representation.
  • Includes components of: Vector-Quantized-Stroke (VQ-Stroke), Scalable Vector Graphics (SVG) compression, Encoder-Decoder LLM for SVG generation and post-processing SVG fixer.
  • Enables 94 times faster inference speed and representing images as more "language like" manner of sequences of strokes.

Efficient Tool Use with Chain-of-Abstraction Reasoning

  • Chain-of-Abstraction (CoA): trains LLMs with decoded reasoning chains using abstract placeholders and then call tools to complete the reasoning chain.
  • CoA learns more generic math reasoning and

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

  • UltraTool Construction-framework includes three key steps: Query collection, Solution Annotation and Manual refinement.
  • UltraTool: benchmarking LLM performance in using tools in real world.
  • Reviews tool use performance from planning, tool creation awareness, tool creation, tool usage awareness, tool selection and tool usage.

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

  • Scale-Eval: Meta-evaluation framework using agents debates to reach consensus or align with human answer in various task scenarios.

LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval and Distillation

  • LLaMP: ReAct-agents connected with arXiv, Wikipedia, Material Project-agents. Includes promts and json-formats used with the RAG-pipeline. Reduces hallucinations in material science queries.

29th of January 2024

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

  • Mobile-Agent: Multimodal Large Language Models (MLLM) for mobile devices, which locates visual/textual, plans, decomposes and executes complex tasks.
  • OS agnostic
  • Introduces Mobile-Eval benchmark and open sources code.

Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis

  • Patient consultation with muliple agents, starting with general practioner and then LLM agents in specific specialities: surgeon, respiratory doctor, endocrinologist.
  • Icludes three stages: Individual practitioner consultation, practitioner group consultation and agent-based groupdecision fusion.

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

  • CompAgent: LLM agent is manages the task of the entire image generation.
  • The LLM agent is used to plan composition of objects next to each other. Achieves better images for example when prompted to generate image with a red hat next to blue backpack.

28th of January 2024

YODA: Teacher-Student Progressive Learning for Language Models

  • YODA: Hunan-like progressive learning paradigm for LLMs, where student agent learns in fixed dataset by learning first basic questions, then learns to generalize and finally learns harder problems.
  • Teacher agent asks then similar questions from the student agent. The teacher agent gradually adds more complex and more generic questions after each iteration and offers feedback to the student agent for the answers provided.
  • The approach helps the student agent to learn to solve problems and generalize problems comprehensively, which leads to 10% improvement in MATH benchmark from the original Llama 2.

26th of January 2024

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

  • Reviews how voice-assistant systems should predict and manage: turn-taking, backchanneling and continued speaking.
  • Contiying speaking refers to the other party needing to continue listening the current speaker. Backchanneling refers to the current listener needing to produce a short utterance of acceptance without meaning to take over the speaker role. Turn-taking refers to the listered being expected to take over speaking turn from the current speaker.
  • Creates fusion model combining both LLM (GPT-2/RedPajama) and HuBERT-acoustic model.

24th of January 2024

Hi-Core: Hierarchical Knowledge Transfer for Continual Reinforcement Learning

  • Hi-Core: Formulates goals as a high-level policy using LLM reasoning and then low-level policy learning towards these high-level goals. Policy library is used to store policies searchable with embeddings based on policy description.
  • Makes the important point, that to learn high-level human cognitive skills using transfer learning, we need to represent high-level human knowledge effectively to be able to transfer them into models.

23rd of January 2024

Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding

  • Meta-prompting: LLM coordinate and execute multiple independent queries with their responses to generate final answer.

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

  • AutoRT: Fleet of robots use VLM and LLM

HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments

  • HAZARD-benchmark made of three dynamic challenges for an embodied agents: flood, fire and wind, which performance are evaluated in terms of value, steps and damage.
  • Builds LLM-based pipeline for embodied agents by providing it task description, agent status and target info. Agent reads environment information, includes observation memory and LLM-based decision maker to select the next action.

22th of January 2024

Memory Matters: The Need to Improve Long-Term Memory in LLM-Agents

  • Reviews memory management of LLM-agents with useful insights about using different types meta-data in vector db along the word embeddings as long-term memory.
  • Identifies in past research example ways of storing: thoughts/skills in vector db, but as well gaps in retrieving information, when different memories may contradict the retrieval.

OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics

  • OK-robot (Open-Knowledge): 59% success rate in open ended picking and dropping task.
  • SOTA level in OVMM-benchmark.

WARM: On the Benefits of Weight Averaged Reward Models

  • Weight Averaged Reward Models (WARM) models.

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

  • PySafe: Safety research on LLM agents based on behavioural/psychological-characteristics.

21st of January 2024

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

  • AttentionLego: LLM is implemented on Processing-In Memory (PIM) HW.

The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

  • Simplistic robotic control using VLM and LLM: VLM to object textual description and scene comprehension. LLM for reasoning and REM-node to translate commands into robot actions.

19th of January 2024

Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning

  • Tool-LMM: LLM is agent able to process multimodal inputs into APIs of the specific modalities.
  • Input modalities include, text, audio/text, text/video and text/image. The LLM text output includes recommendation of the API to be used and model information.

A match made in consistency heaven: when large language models meet evolutionary algorithms

  • Compares and finds multiple similarities between GPT-LLMs and Genetic Algorithm (GA)-evolutionary algorithms.

CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents

  • CivicRealm: RL agent generalization benchmark, based on video game environment with various players and dynamic game space, imperfect information and random variability.

18th of January 2024

Self-Rewarding Language Models

  • Self-rewarding LLMs: Ability for LLM to follow instructions and Ability to create/evaluate new training data (Self-Instruction creation).
  • LLLm-as-a-Judge: LLM acts as a reward model and self-reward its own responses.
  • Claims to outperform Claude 2/Gemini Pro/GPT-4 0613 with three iterations and ability to keep continuously improving both self-instructions and the reward signal.

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

  • R-Judge: Safety benchmark for LLM-agents, not LLM models on 27 risk scenarios.

17th of January 2024

Large Language Models Are Neurosymbolic Reasoners

  • LLM agent plays text-based game with access to Symbolic module.

ReFT: Reasoning with Reinforced Fine-Tuning

  • Reinforced Fine-Tuning (ReFT): In the initial SFT-step, the model is trained to produce correct answers to mathematical problems.
  • In the second step, online RL with PPO is used to prompt multiple CoT responses to learn from them.
  • ReFT uses majority voting and reward model reranking.

Scalable Pre-training of Large Autoregressive Image Models

  • AIM: Visual models, which scale with both compute and data introduced.

What makes for a 'good' social actor? Using respect as a lens to evaluate interactions with language agents

  • LLM agent as as social (automated) actor.
  • Identifies what makes a good vs negative social behaviour for LLM agents.

16th of January 2024

Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering

  • AlphaCodium: Improves code solutions through AI code tests.
  • Iteratively reasons about code tests and reflects problem, generates AI tests to improve testing.
  • Two phases: Preprocessing (to reason new AI tests from ranked solutions feom public tests) and Code iteration (with public and AI tests).

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

  • MultiPLY: Multisensory (temperature, tactile, audio and visuals) embodied agent acts (action tokens such as navigate/select/touch/observe/look around/) in 3D virtual environment.
  • The model trained with ultisensory Universe-dataset, performs multiple tasks: navigates, manipulates, uses tools, dialogue,
  • Encodes 3D-scenes as object centric representations, generate action token to be taken from current state token (temperature/tactile/sound/object) within the environment to reach new state observation in time. The new state token is fed back to LLM to drive follow up actions.

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models

  • DoramonGPT includes task-related symbolic memory, sub-task/knowledge tools and MCTS planner.
  • The task related symbolic memory will choose either the Spatial or Time-dimension as most relevant based on the LLM.
  • DoramonGPT collecta information before reasoning, reasons spatial-temporal video, explores different solutions in a large planning space.

Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination

  • Self-Imagine: VLM creates HTML code about the text question, renders it as an image and uses the image with the question to answer the question with the VLM.

Application of LLM Agents in Recruitment: A Novel Framework for Resume Screening

  • Automated resume screening, where segments from CV are classified into information types, personal information is removed. T
  • The HR grading LLM agent rates these resumes and another HR decision making agent picks preferred application with eplanation, which is then available for the HR professional.

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

  • Contrastive Preference Optimization (CPO): A potential improvement to DPO, applied in machine translation.

15th of January 2024

Exploring the Potential of Large Language Models in Self-adaptive Systems

  • Literature review of Self-Adaptive Systems with LLMs.

A Study on Training and Developing Large Language Models for Behavior Tree Generation

  • LLMs used to generate Behavioural Trees (BT) generation for agents/robots.

When Large Language Model Agents Meet 6G Networks: Perception, Grounding, and Alignment

  • Least Age-of-Thought (LAoT) model caching algorithm to manage local/global compute/network traffic to avoid model with least valuable thoughts.

14th of January 2024

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

  • Introduces CodeAgent, a LLM agent able to use tools (search, code navigation and code interpreter) to generate code/create repositories (instructions, code dependencies) better than Github Copilot.
  • Introduces CodeAgentBench-dataset.
  • Code symbol navigation is key component, to explore: file/module-based parsing and class/function-symbol navigation.

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

  • α-UMi: Multi-agent LLM, which includes planner/caller and summarizer and tools.

12th of January 2024

ModaVerse: Efficiently Transforming Modalities with LLMs

  • ModaVerse: Introduces Adaptor+Agent framework for training multi-modal LLM able to process content across audio/video/image modalities.
  • Introduces Input/Output (I/O) Alignment: LLM generates language aligned meta-responses, which are instructions to activate specific generative models.
  • This method is capable of converting variety of modalities, while being very efficient to train.

AntEval: Quantitatively Evaluating Informativeness and Expressiveness of Agent Social Interactions

  • AntEval: a framework to evaluate LLM-agents social interactions with two metrics: Information Exchange Precision and Intention Expresiveness Gap.

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study

  • Investigates bi-directional feedback loop, where LLM agent acts as a teacher, while the RL agent acts as a student.

11th of January 2024

EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction

  • EASYTOOL: Creates a cleaned version of any tool/API documentation for LLM agent to use via single "tool instruction".
  • Tool documentation is translated into: tool descriptions and tool core functionality. Each are created using specific LLM instructions.
  • Significantly improves tool-based LLM agent performance.

Designing Heterogeneous LLM Agents for Financial Sentiment Analysis

  • Heterogenoeus multi-Agent Discussion (HAD): Multiple agents with each instructions to pay attention to error category types, which form the resulting answer based on shared disussion. The domain of the research is Financial Sentiment Analysis.
  • Builds on the conclusion, that LLMs are "resources": similar to Minsky's theory about human mind being built from a Resource-cloud to be activated/deactivated on the spot.
  • Defines Kernel Theory-Based Design: Kernel theory, Meta-requirements, Meta-designs, Testable hypothesis.

Evidence to Generate (E2G): A Single-agent Two-step Prompting for Context Grounded and Retrieval Augmented Reasoning

  • Evidence-to-Generation (E2G): Single LLM produces in two-steps answer step-by-step based on evidence from the context/question provided.
  • E2G represents context-aware reasoning.

10th of January 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  • Adds backdoors on LLMs.
  • Trains deceptive LLMs using data, which "acts" based on being either in training vs inference: demonstrates safe code vs unsafe code.

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

  • Reviews systematically "Personal LLM Agents" connected to personal data and devices for personal use.

The Impact of Reasoning Step Length on Large Language Models

  • Adding reasoning steps improvea accuracy unril 5th step.

InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

  • DABench-benchmark for LLM based data analysis and open sources Data analysis agent : DA Agent.

9th of January 2024

Agent Alignment in Evolving Social Norms

  • EvolutionaryAgent: Evaluates LLM agents based on fitness to social norms using observer LLM within EvolvingSociety-environment.
  • LLM agents producing highest social norm ratings, self-envolve and reproduce into new generation LLM agents. Agents either convert into obsolate or survived.
  • Agents events are recorded within short term memory with a threshold, which defines when long term and higher-level memories are distilled.
  • Defines initial stage of the EnvolvingSociety and the desired direction only.

Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects

  • Reviews LLM Intelligent agents: definitions, frameworks, single/multiple agents, compoments, cognitive features etc.

Metacognition is all you need? Using Introspection in Generative Agents to Improve Goal-directed Behavior

  • Adds a metacognition to LLM agents for emulating System 1 and System 2 processes. The idea is to let LLMs "think about thinking".
  • The Metacognition module (knowledge about itself, the task and the strategies) gets triggered to ask reflective questions, when the LLM agent is not making significant progress.
  • The metacognition is used throughout the planning, evaluation, monitoring and cognition-steps using reflective questions and then stored in the meta-memory used.

7th of January 2024

Agent AI: Surveying the Horizons of Multimodal Interaction

  • Agent AI system: Perceives and acts in different domains and applications.
  • Multi-modal generalist agent: Environment and Perception with task-planning and skill observation, Agent learning, Memory, Agent action; Cognition.

4th of January 2024

LLaVA-ϕ: Efficient Multi-Modal Assistant with Small Language Model

  • LLava-Phi: VLM using Phi-2 as LLM model with CLIP-ViT-L/14 with 336x336 visual encoder.

Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

  • Self-Contrast: Explores potential paths, Contrasts differences and Summarizes them into checklist to better reason.
  • Many LLM agent errors are due to inconsistent feedback.

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

  • Technique to tune LLM for "search": INstruction Tuning datasEt foR Search (INTERS).

3rd of January 2024

Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov Decision Processes

  • Adaptive MCTS (Ada-MCTS): explores using epistemic & aleatoric uncertanties to adapt risk-aversion behaviour vs performance when spending more time in the environment.

Economics Arena for Large Language Models

  • EconArena: Reviews multiple LLM models jn their ability to act rationally by comparing performance between models and against Nash Equilibrium (NE) rationality.
  • Better models act more rational. LLMs are dynamically able to change strategies based on opponent strategy. Game history improves reasoning. Competing with rational opponent helps to achieve NE quicker.

2nd of January 2024

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

  • LLMs have built-in capability to manage long context, similar as children manage long context such as books mainly by having seen short context text.
  • Self-Extend: No specific training / finetuning required. Plug in 4 lines of code during inference to the attention mechanism, based on LLM with RoPE and FLOOR-operation.

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

  • Self-Play fIne-tuNing (SPIN): Fine-tuning LLMs based on Self-play mechanism, where the main player is the to-be learned LLM from the current iteration and its opponent is the same LLM from the previous iteration.

22th of December 2023

Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning

  • Pangu-Agent: Introduces a generic RL-based objective to improve agents intrinsic and extrinsic functions.

21st of December 2023

AppAgent: Multimodal Agents as Smartphone Users

  • Multimodal VLM agents learn operate popular smartphone apps by creating a knowledge base through: Autonomous exploration and Human demonstrations.
  • Includes: Exploration phase and Deployment phase.
  • Exploration phase learns smartphone functionalities through trial and error, which are saves records of effects to actions and stops, if the current view is unrelated to the assigned task. Exploration stops, whene task is finished. Alternatively these behaviours are shown through human demonstrations, which keeps the agent exploration streamlined and efficient.
  • In deployment phase, the VLM agent has access to the UI screenshot and potential actions. The agent generates a summary of the actions taken and interaction history, which are passed to the next step.

Capture the Flag: Uncovering Data Insights with Large Language Models

  • Exlores two types of Data Science Agents: Explorer agent and Aggregator agent

20th of December 2023

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

  • AgentCoder: Multi-Agent Assistant Code Generation made from Programmer Agent, Test designer Agent and Test executor Agent
  • Uses Self-Refine with CoT in a Multi-Agent System.

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

  • LM Assertions: Integrates with DSPy, which integrates reasoning, self-improvement, augmentation, retrieval and tools (DSPy is like challenger for Langchain).
  • To help runtime self-refinement in LM pipelines with boolean type conditions: Assert (hard or critical condition) and Suggest (soft condition).
  • For example a critical condition (hard) is such, that will resul the LM pipeline to halt, if the condition is not met with maximum number of attempts, while Suggest-option still lets the pipeline to continue.

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

  • ASSISTGUI: Window mouse / keyboard management with LLM.

Generative agents in the streets: Exploring the use of Large Language Models (LLMs) in collecting urban perceptions

  • Explores generative agents in urban environments: includes memory modyke, movement module, visual inference module and a LLM module

dIR -- Discrete Information Retrieval: Conversational Search over Unstructured (and Structured) Data with Large Language Models

  • Discrete Information Retrieval (dIR): Text-queries of SQL databases using LLMs.

19th of December 2023

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

  • Plays Starcraft 2 better than an average player by using Chain of Summarization (CoS), python-sc2 and TextStarCraft II-environment (Observation-to-Text Adapter: and Text-to-Action Adapter).
  • Chain of Summarization (CoS): Improves LLMs capability to extract / analyze information using two compnents: Single-frame summarization and Multi-frame summarization.
  • TextStarCraft II-environment processes game information into textual format for LLM model defining macro-actions and a rule-based method for micro-actions
  • System prompt includes: Situation Overview, Situation Analysis, Strategic Planning, Opponent Strategy, Analysis, Strategic Recommendations, Decision-Making rocess.
  • Reduces 10x the need of LLM API calls and improves strategic, analytical and judging capabilities.

19th of December 2023

Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives

  • LLM empowered agent-based modeling and simulation framework: surveys the landscape of utilizing LLMs in agent-based modeling and simulation.
  • Framework examines challenges, future directions, motivation for applying LLMs, environment perception, human alignment, action generation, evaluation, cyber, physical, social, and hybrid domains.
  • This framework provides a comprehensive overview of recent works in this interdisciplinary field.

Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives

  • Reviews LLM-based agents on their ability to simulate various human-like capabilities.

18th of December 2023

Agent Assessment of Others Through the Lens of Self

  • Discusses concept of Self-Awareness of Autonomous Agents.

Evaluating Language-Model Agents on Realistic Autonomous Tasks

  • Autonomous Replication and Adaption (ARA) framework: reviews ability of LLM agents to acquire resources, create copies of themselves and adapt to novel situations in the real world.
  • Tests LLM-agents using Scaffolding programs to interact with LLMs.
  • Defines implications of potentially ARA-level agents.

LLM-ARK: Knowledge Graph Reasoning Using Large Language Models via Deep Reinforcement Learning

  • LLM-ARK: LLM reasons from Knowledge Graphs with DRL.

17th of December 2023

Learning to Act without Actions

  • LAPO (Latent Action Policy).

16th of December 2023

ProTIP: Progressive Tool Retrieval Improves Planning

  • Progressive Tool Retrieval Improves Planning (ProTIP): Mulit-step planning with external tools, where tasks are decomposed without explicit definition of the sub-task.
  • Addresses the issue, where single-step tool retrieval does not manage to handle dependencies between the tools.

15th of December 2023

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

  • Self-Imepoving LLM model without any human-assisted data for fine tuning achieving significantly better reasoning results with smaller model, when using the synthetic data to distill smaller model.
  • Finetunes LLM with ReST using ReAct-method reasoning-actions.

14th od December 2023

Practices for Governing Agentic AI Systems

  • OpenAI's research on Agentic AI systems with definition of Agentic AI system.
  • Includes level of "Agenticness": the degree of goal complexity, environment complexity, adaptability and independence.

TinyGSM: achieving >80% on GSM8k with small language models

  • First student LLM to learn the Teacher LLM model ( GPT-3.5) performance in mathematical reasoning using synthetic data from the teacher model.
  • TinyGSM: Two 1.3B LLNs with a 1.3B verifier LLM achieves SOTA level 81.5% accuracy on GSM8k, which consists of a high-quality dataset TinyGSM and use of verifier selecting final answer from multiple output generations.

Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent

  • Planner-Reasoner-Executor-Reflector (PRER) / MathAgent: Planner, Reasoner, Executor and Reflector.
  • Systematic process for solving zero-shot mathematical reasoning with LLM agents.

Rational Sensibility: LLM Enhanced Empathetic Response Generation Guided by Self-presentation Theory

  • Self-Representation with Lamb: Uses semantic label to set tone for the conversation.

LiFT: Unsupervised Reinforcement Learning with Foundation Models as Teachers

  • LiFT: Outperforms significantly VPT/other models in MineDojo-ennvironment.
  • LLM provides task instruction.
  • VLM is sed to learn policy and act as a reward model.

LLMind: Orchestrating AI and IoT with LLMs for Complex Task Execution

  • LLMind: Includes coordinator updating short-term memory/retrieving required AI (IoT) modules with ability to define, if script exists for the module and enerates it, if missing. Coordinator retrieves error / output messages from the executed script, which is handled by the script executor.

Holodeck: Language Guided Generation of 3D Embodied AI Environments

  • HoloDeck: Generating 3d embodied environments with LLM: FLoor-wall module, doorway-window module, object selection module and layout design module.

Personalized Path Recourse

  • Personalized Path Recourse (PPR): Personalized path of actions to achieve a certain goal with an agent.

Adaptive parameter sharing for multi-agent reinforcement learning

  • AdaPS: Maps agents to different regions of brain/shared network based on identity vectors obtained with VAE and clusters agents to K classes.

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

  • RL agent using LLM to act as a Reward designer, Reward critic and a Trajectory designer.

Vision-Language Models as a Source of Rewards

  • VLMs work as reward models and larger scale improves performance of the reward model.

Learning Coalition Structures with Games

  • Coalition Structure Learning (CSL): Learns coalitions of agents via set of games.

12th of December 2023

Medprompt+

  • Medprompt+ extends Medprompt-method improved by asking additionally if scrapt-pad is needed and increasing number of ensembled calls from 5 to 20.

diff History for Long-Context Language Agents

  • Compresses consecutive text observations from environment with Unix "diff"-command, which leads to 700% improvement in game score, outperforming existing agents by 40%, which use visual observations.
  • Similar approach may enable building vastly more generic embodied LLM agents.

Sequential Planning in Large Partially Observable Environments guided by LLMs

  • Neoplanner: builds state space model of the environment by testing different actions, observations and rewards. Builds a graph memory of learnings from all previous trials using Learner agent.
  • Model provides anytime best policy given the knowledge at that moment. Balances exploration and exploitation.

11th of December 2023

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

  • ReSTEM (Expectation-Maximization): LLM generates samples (E-step/Expectation-step) using temperature sampling, filter samples using binary feedback/reward, fine-tune LLM using these feedbacks (M-step/Maximization-step). Repeat few rounds. Improves significantly coding and math benchmark results.
  • Ability to generate multiple correct solutions compared against human-generated data.
  • ReSTEM uses temperature sampling (diverse/creative), compared to STaR-method based on greedy sampling (most-likely), where the rationalization-process leads to false-positive solutions.

8th of Decembebr 2023

KwaiAgents: Generalized Information-seeking Agent System with Large Language Models

  • KwaiAgents, an autonomous agent loop including three key components: (KAgentSyst), LLMs (KAgentLLMs) and Benchmarks (KAgentsBench).
  • System includes: Memorybank (Knowledge, Conversation and Task), Tool-library (Factuality-aware, Time-aware and Custom tools) used with Memory update, Task plan, Tool execution and Finish & Conclude-steps.
  • LLM-component includes templates for LLs, Meta-Agent Tuning (MAT)-framework and LLM services. Benchmarks include both human and LLM-driven profiling.
  • MAT includes six key components to generate prompt templates: system profile, instructions/constraints, tool specification, goal placement, memory allocation and output format.

7th of December 2023

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

  • Creates answer in two steps: Starts by creating pseudo-code to solve the question, then runs the pseudo-code in code interpreter or LM emulating code, in case no code interpreter is available.

AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making

  • Autonomous Visualization Agents (AVAs): User instructions are converted with Visualization agent into actions and the taken actions are converted back to language within visualization tasks.
  • Components include: Visual perception, Action planning and Memory components, working within visualization-perception-action-loop.

Generating Illustrated Instructions

  • StackedDiffusion: Generates illustrated instructions based on text, which helps to train SOTA level multi modal models preferred over human generated articles.

Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use

  • Introduces "Attention Buckets", which enable a 7B open source model to acchieve GPT-4 level tool use performance by compensating attention peaks between parallel processes in specific context.

6th of December 2023

Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia

  • Concordia-library: Simulation environment made of multiple agents and Grand Master (GM) inspired by the Dungeons and Dragons game.
  • Agents consume observations and GM agent actions. Agent produces actions and GM event statements (such as physical grounding).
  • Includes long and short term memory, which include state of the world.

LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem

  • AIOS-Agent Ecosystem: Envisions LLMs as OS, Agents as Applications, Natural Language as Programming language and Tools as Devices/Libraries.

5th of December 2023

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

  • Answers visual questions by creating programs, that can review the image such as count number of specific types of objects and use tools.
  • Answer is provided with CoT reasoning based on filtered program from many programs executed.

Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Constructio

  • Uses three LLM agents for entity, event and relation extraction to build knowledge graph.

Large Knowledge Model: Perspectives and Challenges

  • Large Knowledge Models: Reviews combination of LLMs (neural representation) and Knowledge graphs (symbolic representation) through usage of knowledge graph embeddings and text embeddings with LLMs.

4th of December 2023

Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication

  • Exchange-of-Thought (EoT): Improvement from CoT and Self-Consistency, where thoughts from other LLMs are considered, outperforming in mathematical reasoning the CoT with Self-Consistency
  • Proposes four communication paradigms to define the setup of the Exchange-of-Thought: Memory, Report, Relay and Debate.
  • For example in Debate-mode: two LLM agents produce first ansswer the question and the two rationalizations are provided to the third LLM agent in order to debate these solutions in order to provide the right answer.

LLM A*: Human in the Loop Large Language Models Enabled A* Search for Robotics

  • LLM A*: Includes current node, goal node, optical action and these three make up the plan.
  • The chat-environment with user defines user inputs: Setting up environment, Setting up Action model, Start and Target Nodes, Heuristic and Rules.
  • Demonstrates the possibility of achieving very good path planning results using mobile embodied agents.

Towards Learning a Generalist Model for Embodied Navigation

  • NaviLLM: Embodied navigation with LLMs using schema-based instruction (task, history, observation and output hint), which generalizes well to unseen navigation tasks.
  • Uses the following Multi-task learning modules: Visual-Language Navigation, Object localization, Trajectory Summarization and 3D Queestion Summarization.

OpenVoice: Versatile Instant Voice Cloning

  • OpenVoice: Voice cloning almost from instant voice record.

29th of Novemebr 2023

Universal Self-Consistency for Large Language Model Generation

  • Universal Self-Consistency (USC): Uses LLMs to select the most consistent answer among multiple candidates working in mathematical reasoning and code generation and unlike the original Self-Consistency, the method works in open-ended questions.
  • This can be used as a more capabale component in the STaR-method, which generalizes with Q&A with open-ended answers, not only precise answers.

28th of Novemebr 2023

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

  • Medprompt: Generalist LLM using MedPrompt outperforms SOTA specialist model.
  • Uses SOTA prompt method: CoT, Choice Shuffle and Self-Consistency prompting
  • Introduces Choice Shuffle-technique, which inreases diversity of the reasoning paths.

27th of Novemeber 2023

Some intuitions about large language models

  • Jason Wei Blog post / Presentation.
  • Learning the relationship from Input to Output is as well Next-word prediction learning.
  • Next-word prediction is massively multi-task learning.

22th of November 2023

Building the Future of Responsible AI: A Pattern-Oriented Reference Architecture for Designing Large Language Model based Agents

  • Identifies two types of LLM agents: "Agents-as-workers" and "Agents-as-coordinators".

21st of November 2023

System 2 Attention (is something you might need too)

  • System 2 Attention (S2A): Generate interim user question and interim context from the original user input. Finally, generate the final answer by answering to the interim user question from the interim context.
  • Reduces hallucination from irrelevant context by first defining the question and the context and this way separating irrelevant facts from impacting the response generation.

20th of November 2023

Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents

  • Systematic review of research from Chain-of-Thought (CoT) to LLM Agents and identifies gaps in generalization, redundant interactions and customization and more.

17th of November 2023

A Language Agent for Autonomous Driving

  • Agent-Driver: Uses LLM agent for human-like intelligence for autonomous driving.
  • Tool library provides input for: detection, prediction, occupancy and mapping functions. Memory includes commonsense memory and Experience memory. There is apart historical trajectories and ego-states.
  • The reasoning engine includes: CoT reasoning, Task planning, Motion planning and Self-Reflection. These lead to actions and again to environment update.

16th of November 2023

Digital Socrates: Evaluating LLMs through explanation critiques

  • Digital Socrates: evaluates reasoning flaws: giving feedback on why and where?

15th of November 2023

Divergences between Language Models and Human Brains

  • Reviews differences measured with MEG in human brain vs. language models.
  • The study reveeals, that LLMs are less good at social/emotional intelligence and physical commonsense reasoning.
  • Finetuning helps to align LLMs to act more in human brain-like manner.

AutoMix: Automatically Mixing Language Models

  • AutoMix: Use a smaller LLM to generate initial response and uses Meta-Verifier to check the trustworthy in rough scale. If the answer is trustworthy then use the small LLM answer, otherwise consult a larger LLM.
  • Uses Incremental Benefit Per Unit Cost (IBC) metric to asses effectiveness of this approach.

14th of November 2023

DeepThought: An Architecture for Autonomous Self-motivated Systems

  • DeepThought: An architecture for cognitive language agents posing agency, self-motivation, and partly meta-cognition.
  • Includes supervisor module, Deep Reinforcement Learning module, Attention Schema (long-term memory), Language/Auditory/Vision modules and Embedding store.

9th of November 2023

LLM Augmented Hierarchical Agents

  • Hierchical agent uses LLM to evaluate, when to use specific skill to complete specific sub-level task with long horizon.
  • The resulting model works without the need for a LLM after the training.

Prompt Engineering a Prompt Engineer

  • Guide LLM to prompt engineer prompts automatically
  • The metaprompt uses: prompt engineering tutorial, two-step task description, step-by-step reasoning template and context specification.

8th of November 2023

ADaPT: As-Needed Decomposition and Planning with Language Models

  • ADaPT: Plans and decomposes dynamically complex tasks with LLMs, if the executor is not able to complete the task.

2nd of November 2023

RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

  • RoboGen: Agent using LLMs to define new tasks to learn, create their simulation environments, train on them to acquire diverse & new skills.
  • Agent includes: Task proposal, Scene generation, Training Supervision Generation & Skill learning.

Youtube. Adam Kalai presents "Recursive Self-improving Code Generation - talk 2.11.2023

  • Adam Kalai talk on the "Self-Taught Optimizers (STOP): Recursively Self-Improving code generation", which is in essence attempts to build code for letting LLMs themselves improve (their) own code.
  • I recommend to check this especially from safety-aspects on the point "sandbox-flag" and to better understand the

1st of November 2023

Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents

  • Introduces plug-and-play dialogue policy planner(PPDPP).
  • Dialogues plans using Self-play with three LLM agents: one acting to achieve a goal like buying a product at cheaper price, second to negotiate as seller a higher price and a third LLM scoring performance as reward model.

SAGE: Smart home Agent with Grounded Execution

  • SAGE (Smart home Agent with Grounded Execution).
  • Device interaction: Interaction planner, Attribute retriever, API documentation retriever, Device disambiguity, Device command execution.
  • Personalization: Long-term memory, User profile & Personalization tool.
  • Includes Physical grounding such as light bulbs and External grounding (such as weather forecast) & Personalization.

Efficient Human-AI Coordination via Preparatory Language-based Convention

  • HAPLAN: Human-AI coordination using Conventions. Humans communicate roles & tasksof individuals before starting a task to be completed. Humans create Conventions.
  • Builds a Convention (an action-plan) to guide AI/human using task requirements, human preferences, number of agents and other information for a better understanding of tasks & responsibilities of each agent/human.
  • Assigns sub-problems to own sessions. Convention is first confirmed with human.

31st of October 2023

Generating Sequences by Learning to Self-Correct

  • Self-Correction: A generative LLM, which includes two modules: Generator and Corrector.

Autonomous Robotic Reinforcement Learning with Asynchronous Human Feedback

  • Autonomously explores real world
  • Guided Expliration for Autonomous Reinforcement learning (GEAR): approaches objective by meeting promising sub-goal close to final target (Goal Selector), but reachable from current position using current policy (Density model).
  • Crowdsourced & Occasional comparative feedback regards user objective vs. available correct/incorrect states.

Towards A Natural Language Interface for Flexible Multi-Agent Task Assignment

  • Programs constraints into task assignments system based on natural language using Multi-agent LLMs.

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

  • DEEP: Uses agressive (truthfull) & conservative modes (to disguise) to play spy game to asses intelligence of LLMs to describe target word without stating explicitly the word.

Multi-Agent Consensus Seeking via Large Language Models

  • Consensus within multi-agent reason mainly reason and change their numerical value state based on consensus strategy based on average strategy.

26th of October 2023

CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents

  • Studies competition of LLM agents and identifies research on competition of LLM agents, as important as co-operation.
  • The initial advantage of a LLM agent leads to feedback creating cycle for Matthew's effect.
  • LLM Agents can operate in competitive environment.
  • LLM Agents learn to imitate and differentiate with other LLM agents.

25th of October 2023

PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization

  • PromptAgent: Optimizes prompts using planning algorithms such as MCTS.
  • Creates intermediate prompts, updates them based on error feedback, simulates future rewards and searches higher reward paths.
  • Prompts generated include: Domain knowledge, Task description, Term clarification, Solution Guidance,Exception handling, Priority & Emphasis, Formatting

24th of October 2023

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

  • Key-value store for observation retrieval, parsed actions are executed by RCAgent or by Expert Agent.

Diverse Conventions for Human-AI Collaboration

  • Mixed-play: generates diverse conventions (arbitrary solutions to reocurring cooperation problems) by randomly switching between self-play (maximize award) and cross-play (Minimize) actions to maxime mixed-play.
  • CoMeDi (Cross-play optimized, Mixed-play enforced Diversity) algorithm is explained .

Woodpecker: Hallucination Correction for Multimodal Large Language Models

  • Woodpecker: To extract key concepts, formulate questions and validate visual knowledge and generate visual claims using Multimodal Large Language Models (MLLMs) to control hallucinations in LLM responses.

In-Context Learning Creates Task Vectors

  • Training data used with LLMs is compressed into task vectors within LLM. Task vectors are used in 18 tasks.

Instruct and Extract: Instruction Tuning for On-Demand Information Extraction

  • On Demand Information Extraction (ODIE): Extracting information using LLMs from text to present it in structured tabular format.

23th of October 2023


Function Vectors in Large Language Models

  • LLMs include Function Vectors (FCs) to trigger functions in different contexts.

LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay

  • Explores social behaviour or LLMs in Avalon-game regards team working and other collaboration.

20th of October 2023

ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search

  • ToolChain*: Uses A ∗ search algorithm to navigate an action space as a tree-like structure with LLM agent.
  • Selects most promising path, Expand follow up actions in the selected path, Update the tree-structure.

Democratizing Reasoning Ability: Tailored Learning from Large Language Model

  • Student LM takes an “exam” to gather mistakes it made. Teacher LM generates training data based on the mistakes. Teacher LM customizes each "exam" the feedback. Student LM learns to improve with self-reflection on its mistakes made and the new training data provided by the teacher LM. These steps are repeated until Student LM has reacher Teacher LM capability.

19th of October 2023

AgentTuning: Enabling Generalized Agent Abilities for LLMs

  • AgentTuning: Improves LLM capability by Instruction Tuning to user tasks by using AgentInstruct-dataset to create AgentLM using AgentTuning.

18th of October 2023

Language Agents for Detecting Implicit Stereotypes in Text-to-image Models at Scale

  • Language agent to automatically identify ans quantify extent of generated images.
  • Planning and Reasoning. Tool usage: Intent understanding, Instruction generation, Instruction retrieval, Prompt optimization & Stereotype score generation.

17th of October 2023

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

  • Set-of-Mark (SoM)-visual prompting technique to answer questions by partioning image into regions with different level of granularity and insert numbers for each region.
  • Studies VLM model prompting techniques.

VeRA: Vector-based Random Matrix Adaptation

  • VeRA

The next grand challenge for AI

  • Foundational Agent: Agents, which scale in all three axis of: skills, embodiment and realities. If chatgpt was scaled with data, foundational agents are scaled with realities.

16th of October 2023

Character-LLM: A Trainable Agent for Role-Playing

  • Character-LLM: simulates historical figures using LLMs, which mimick profile / experiences and emotional states of specific individuals.
  • Applies "Experience Reconstruction" with detailed experiences and memories.
  • Specialises a base model for character generation.
  • Evaluates using step-by-step LLM-judge aproach by evaluating one dimension at each step.

OpenAgents: An Open Platform for Language Agents in the Wild

  • OpenAgents-platform: Data agent, Plugin/Tools and Web agent
  • Automatic tool selection from over 200 tools

Improving Large Language Model Fine-tuning for Solving Math Problems

  • Introduces multi-task sequential fine-tuning method, where solution generation is improved by including solution evaluation as part of the fine-tuning objective together with the generated solution to provide higher-quality guidance to solution generator.
  • Quality and style of the step-by-step solutions used for fine-tuning impact model performance. Solution re-ranking and Majority voting used together are effective way to improve model performance with fine-tuning.

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization

  • A Continually Learning Generative Agent from Interactions (CLIN): Memory generator updates memory, Controller manages tasks and Executor converts it into actions towards the goal.

Theory of Mind for Multi-Agent Collaboration via Large Language Models

  • LLM-based agent manages complex multi-agent collaboration task with performance level comparable with RL agent.

13th of October 2023

A Zero-Shot Language Agent for Computer Control with Structured Reflection

  • Zero-shot agent plans executable actions in the environment and iteratively progresses by learning from mistakes using self-reflection and structured thoughts management.
  • Better generalization, outperforms best iterative-planning agents

12th of October 2023

AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems

  • AgentCF: LLM agent-based recommender system with Use and Item Agents.
  • User & Item Agents interact autonomously and the discrepancies between the two are stored in the memory to help guide better future recommendations.

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

  • Octopus: Uses Vision-Language Model with Reinforcement Learning from Environmental Feedback (RLEF).
  • Generates action sequences and executable code.

MemGPT: Towards LLMs as Operating Systems

  • MemGPT: OS-based design with LLM-processor managing its actual context and long term memory and uses functions to make changes and events to manage order of processing data.

Promptor: A Conversational and Autonomous Prompt Generation Agent for Intelligent Text Entry Techniques

  • Promptor: Automatic prompt generation.
  • Builds prompts based on: User goals, User Profiles, Data Profile, Contextual nformation & Output constraints
  • System prompt includes: instructions, Actions, Facts and Examples.

Towards Robust Multi-Modal Reasoning via Model Selection

  • Dynamic model selection by taking into account input & sub-task dependencies.

11th of October 2023

The Temporal Structure of Language Processing in the Human Brain Corresponds to The Layered Hierarchy of Deep Language Models

  • Evidence about strong correlation between layers activated in Deep Language Models (DLMs) and human brain high-order language areas: auditory,syntactic and semantic areas.
  • Brain and DLMs both process input into multi dimensional vector embeddings, processed as sequences taking into account the context.
  • Identifies differences. One difference is, that human brain does not perform straightforward linear interpolation between the previous and current words, suggesting RNNs may better mimick human brain language processing. The other difference is, that humans do not learn only by reading text, but use data from multiple modalities.

Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting

  • Diagnosis-of-Thought: Cognitive distortion detection through prompting: Subjective assessment, contrastive reasoning and schema analysis.

LangNav: Language as a Perceptual Representation for Navigation

  • Uses BLIP to make imgae caption and DETR for object detection on image views to to obtain text descriptions, which a LLM agent uses to generate navigation instruction.

10th of October 2023

Towards Mitigating Hallucination in Large Language Models via Self-Reflection

  • Self-Reflection: Introduces self-reflection prompting, similar to "Reflection"-prompting. Evaluates via LLM-loom, if the answer knowledge is factual enough and in second loop, if the answer is enough consistent.
  • Human reviewers are asked to evaluate sentence in answer in case is generic, fact-inconsistent or fact-consistent. The user is as well asked to categorise answer to be question-inconsistent(inconsistent), tangential (consistent, but not on topic) or answerable (consistent and answers).

9th of October 2023

FireAct: Toward Language Agent Fine-tuning

  • Fine-tuning LLMs with agent trajectories for better autonomous agents.

8th of October 2023

Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading

  • MemWalker: navigates long-context iteratively and construct memory as treelike structure.

7th of October 2023

Crystal: Introspective Reasoners Reinforced with Self-Feedback

  • Introspective reasoning of the knowledge.

Self-Supervised Behavior Cloned Transformers are Path Crawlers for Text Games

  • PathCrawling: Crawl all paths leading to reward (train LLM with these paths) and Evaluate generality to unseen task. Continue crwaling most general paths.

6th of October 2023

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

  • Language Agents Tree Search (LATS): Self-Refine, Memory, Reasoning, Decision Making & Planning.
  • Uses multiple reasonining paths and learns from experience by integrating external feedback & self-reflection.

BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity

  • BrainScuba (Semantic Captioning Using Brain Alignments): LLM generates interpretable captions.
  • Aligns brain activity pattern with semantic content to generate captions to explain how brain processes visual information.
  • Collects brain imaging data fMRI when human views visual stimuli and uses BERT to obtain semantic reprensentation in natural language, which is based on alignment process. This process maps images to voxel-wise brain activations.

5th of October 2023

Agent Instructs Large Language Models to be General Zero-Shot Reasoners

  • AgentInstruct: generates instructions for th problem and then solves it using these instructions, improving the Chain of Thought (CoT) zero-shot reasoning.

5th of October 2023

Balancing Autonomy and Alignment: A Multi-Dimensional Taxonomy for Autonomous LLM-powered Multi-Agent Architectures

  • Characteristics of Autonomous Agents: Goal-driven task management, Intelligent Agents with LLMs, Multi-Agents collaboration, Context interaction, Balancing Autonomy vs. Alignment.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

  • DSPy programs (think Langchain as cmparison) help create LLM pipelines, which can outperform few-shot prompting techniques.
  • Help improve mathe world problems or answering complex questions and manage chaining / loops.

3rd of October 2023

Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

  • Self-Taught Optimizer (STOP): Ask LLM to improve initial program by providing improvement candidates and then output best solution.

Lyfe Agents: Generative agents for low-cost real-time social interactions

  • LyfeAgents Brain: Sensory processing, Internal states, Self-monitor, Action selection and Memory.
  • Internal states are text based: current goal, memory, recent events and sensory inputs.
  • Cognitive controller selects high-level actions. Action model selects actions until termination condition is reached.
  • Self-monitoring maintains and emphasizes recent and novel events towards agent goals
  • Memories are clustered and summarized before moving them to long-term storage (vector database)

EcoAssistant: Using LLM Assistant More Affordably and Accurately

  • EcoAssistant: Enables LLM agent to converse with code executor to iteratively produce answers based on code produced. Hierachical structure, where cheaper and weaker LLM is used before trying the stronger and expensive LLM.
  • Surpasses GPT-4 10% in performance with 50% less cost.

Large Language Models as Analogical Reasoners

  • LLM self-generates examples/knowledge related to the task.

Conceptual Framework for Autonomous Cognitive Entities

  • Conceptual framework for Autonomous entities.

OceanGPT: A Large Language Model for Ocean Science Tasks

  • DoInstruct (Domain Instruction): Automatically gathers large amount of domain specific instruction data for multi-agent collaboration.
  • Domain Instruction generation: Agents used as experts in each topic. Instructions are augmented rapidly through agent collaboration, which are annotated and finally inspected for high quality fine-tuning dataset.

2nd of October 2023

Enabling Language Models to Implicitly Learn Self-Improvement

  • ImPlicit Self-ImprovemenT (PIT)-framework: introduces self-improvement, where LLMs self-improve its response quality with human preference data without extensive human annotation.

SmartPlay : A Benchmark for LLMs as Intelligent Agents

  • SmartPlay: a benchmark to test LLM-based agents from 9 perspectives.
  • Tests: Reasonning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness.

GRID: A Platform for General Robot Intelligence Development

  • GRID: General Robot Intelligence Development
  • Solves complex tasks using simulatiom and/or real-world data
  • Task specification, robot configuration and sensor/API.
  • Foundation Mosaic: a neural architecture.

1st of October 2023

RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models

  • RoleLLM: Role-profile constructor, Context-based Instruction generarion, Role-based Prompting(RoleGPT), Role-conditioned Instruction-tuning.

29th of September 2023

AutoAgents: A Framework for Automatic Agent Generation

  • AutoAgents: Planner agent receives user input and converts it into a plan. Multiple agent roles take actions in this plan to convert into a result.
  • Observers: Observer agent reviews, if the created agent roles meet the requirements. Plan observer agent reviews, if the plan meets expectations. Action observer reviews, if the action response meets expectations.
  • Includes drafting stage (with agent observer and plan observer agents) and Execution stage (with action observer).

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

  • Motif: Trains a reward fucntion/model from pairs of gameplay captions and LLM observations of these game actions. Then train an agent using RL with the reward model.
  • Diverse behaviours triggered with the LLM improve in performance in specific domain: for example Gold Collector collects more cold.

28th of September 2023

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

  • Promptbreeder uses thinking styles and mutation-prompts and is able to improve mutation/task prompts.

24th of September 2023

Let's reward step by step: Step-Level reward model as the Navigators for Reasoning

  • Heuristic Greedy Search for Process-Supervised Reward Model (HGS-PRM): each new reasoning step generated by the LLM is evaluated by the reward model, if to accept the reasoning step or generate a new one until the reasoning path is identified.
  • Creates PRM-Code dataset using Code-LLaMA-7B using Mutating testing-technique.

23th of September 2023

Natural Language based Context Modeling and Reasoning with LLMs: A Tutorial

  • LLM-driven Context-aware Computing (LCaC) approach.

20th of September 2023

You only look at the screens: Multimodal Chain-of-Action Agents

  • Multimodal Chain-of-Actions Agents (Auto-UI) interacts directly with the UI
  • Chain-ofAction technique using series of action histories and future action plans.

18th of September 2023

MindAgent: Emergent Gaming Interaction

  • MindAgent: Planning skills and Tools use(Agent location, Tool state, Agent holdings, Pending dishes, Timer), LLM dispatcher, Memory history (Environment, Agent State, Actions and Feedback) and Action module(Controller, Human actions, Action validator, Action Types/Patterns/Names).
  • Introduces CuisineWorld-benchmark, where multiple agents play game simultaneously through multi-agent collaboration.

14th of September 2023

The Rise and Potential of Large Language Model Based Agents: A Survey

  • A conceptual framework for LLM-based agents with three components brain, perception, and action.

Agents: An Open-source Framework for Autonomous Language Agents

  • Multi-agent: Planning, memory, tool usage, multi-agent communication & symbolic control.
  • Open source library.

13th of September 2023

Physically Grounded Vision-Language Models for Robotic Manipulation

  • PhysObjects dataset for physical grounding.
  • VLMs with PhysObjects improves its understanding on physical objects.
  • Improves task success rate.

12th of September 2023

Life-inspired Interoceptive Artificial Intelligence for Autonomous and Adaptive Agents

  • Interoceptive AI: monitoring own internal state of the artificial agent.

Textbooks Are All You Need

  • Sebastien Bubeck explains the insights from the reserch on Phi-1 regards coding tasks and Phi-1.5. regards reasoning tasks and the models being able to outperform 1000 times larger LLMs.
  • The talk highlights, that the key ingredients on Textbook-like training data and then giving then giving Exercises.
  • Explains the the key ingredient in "Textbooks are all you need"-paper regards the data, is largerly based on TinyStories-paper, which dataset was used to train a high performing model to generate fluent and consistent stories in English language.

8th of September 2023

Unleashing the Power of Graph Learning through LLM-based Autonomous Agents

  • AutoGraph procedure: data, configuration, searching and tuning agents.

28th of August 2023

RecMind: Large Language Model Powered Agent For Recommendation

  • RecMind: a recommender focused LLm agent with reasoning, planning to sub-tasks, memory & tools.

22th of August 2023

A Survey on Large Language Model based Autonomous Agents

  • Systematic review of LLM based Autonomous Agents.
  • Use cases and evaluation strategies and future use cases.

21st of August 2023

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

  • AgentVerse: multi-agent collaborarion and individual agents social bjeaviours.

18th of August 2023

Graph of Thoughts: Solving Elaborate Problems with Large Language Models

  • Graph-of-Thoughts (GoT): Reasoning with LLM using graph-structure with intermediate steps.
  • Introduces Volume-of-Tought metric to inform the scope of information carried by the LLM output.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

  • AutoGen: An open source framework, where LLM agents converse with other LLM agents either one or many, chat with humans and use tools.
  • LLM agents are able to create new chats with other LLM agents.

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

  • Improves math reasoning with Reinforcement Learning from Evol-Instruct Feedback (RLEIF): Upward and Downward evolution improve instructions by making questions easier or harder based on their difficulty level.

17th of August 2023

Reinforced Self-Training (ReST) for Language Modeling

  • Introduces Reinforced Self-Training (ReST).
  • Grow step generates data from LLM, Improve step uses this filtered data to fine-tune the LLM. Repeat.

Never-ending Learning of User Interfaces

  • Never-ending UI Learner: automatically installs apps from an appstore and crawls them to learn difficult training examples

3rd of August 2023

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

  • Proposes Rejection sampling Fine-Tuning (RFT), which generates reasoning and collects correct ones to augment as fine-tuning dataset.

25th of July 2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

  • An environment to test Autonomous agents in an environment with tools, external knowledge.

20th of July 2023

Textbooks Are All You Need

  • Addresses LLM training data to be "text-book-like": clear, self-contained, instructive, and balanced. The method is used in Phi-models.

BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

  • BuboGPT: Uses Vicuna LLM by receiving text input inserting together visual and audio inputs separately with Q-former. The Vicuna output is then processed using SAM-model for visual grounding.
  • Achieves coherent and grounded descriptions

16th of July 2023

Communicative Agents for Software Development

  • ChatDev: Define task and automatically generate SW designing, coding, testing, and documentation using "Chat Chains", where LLM-based chats include different roles for each sub-task: CEO, programmer, CTO etc.
  • Includes role-assignment, memory and self-reflection.

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

  • Protein Language Model: xTrimoPGLM.

14th of July 2023

Large Language Models Understand and Can be Enhanced by Emotional Stimuli

  • EmotionPrompt: adds to prompt an emotional stimuli, which improves performance by 10.9%.
  • An example of an emotional stimuli is to state that the work is important for career.

23rd of June 2023

LLM Powered Autonomous Agents

  • Lilian Weng from OpenAI article / blog post
  • Covers Planning, Memory and Tool usage of LLM powevered agents

8th June 2023

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

  • Builds multi-agent simulation environment to generate dataset of using many real world apis.
  • Small models can achieve comparable performance to larger models on tool usage.

6th of June 2023

Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach

  • When2Ask: RL agent, which learns when to query LLM for high-level plans to complete a task.
  • Planner, Actor and Mediator.

5th June 2023

SELFEVOLVE: A Code Evolution Framework via Large Language Models

  • Generates intermediate code based on input prompt.
  • Use LLM to act as expert programmer to debug the generated code by receiving errors from Python interpreter.

3th June 2023

Prompt Sapper: LLM-Empowered Software Engineering Infrastructure for AI-Native Services

  • Human AI collaborative intelligence methodology & technical practices, where the idea is not to have "full Auto-GPT" from user input to direct resolution by LLM, but rather human reviews steps between.
  • Useer inputs objective, LLM asks clarification. Use then User adds clarifications and LLM constructs AI chain for human to review. Finally LLM executes the AI chain with user acceptabnce tests.

3th June 2023

Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions

  • Auto-GPTs outperforms supervised state-of-the-art Imitiation Learning (IL) models with GPT4 in WebShop- and ALFWorld-benchmarks in unknown external environments.
  • Additional opinions algorithm improves performance, which takes into account additional opinions from external expert models.

2nd of June 2023

  • MathChat: Describes a solid conversational MATH problem solving in four step process.
  • Describes the prompts used.

26th of May 2023

Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models

  • Graph-of-Thought (GoT) reasoning: To model human thought process as graph instead of chain to improve LLM reasoning capability.

Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing

  • Uses low-quality LM to generate High-quality dataset (more diverse and more effective for generalization in unseen domains) to train a high quality model: 770 million parameter model outperforms GPT-3 in multiple tasks evaluated by humans.

25th of May 2023

Voyager: An Open-Ended Embodied Agent with Large Language Models

  • Voyager: open-ended embodied agent with LLM

24th May 2023

Reasoning with Language Model is Planning with World Model

  • RAP (Reasoning via Planning): Uses LLM as both world model and reasoning LLM-agent. Integrates MCTS search planning algorithm.
  • Incrementally generates reasoning tree with LLM in domains of plan generation, math reasoning and logical inference.

Gorilla: Large Language Model Connected with Massive APIs

  • Gorilla is a retrieve-aware finetuned LLaMA-7B model for API calls using self-instruct to generate Instruction-API pairs.

18th of May 2023

Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation

  • Brainstorm: uses brainstorming step to generate and select diverse thoughts in code generation.
  • Uses three steps: brainstorming, thought selection (trains a thought ranker for this) and writing code.

17th May 2023

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

  • Tree of Thoughts (ToT)-technique makes decisions using multiple different reasoning paths, self-evaluating choices to decide next action with ability to look back/forward for global decisions.

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction


13th of May 2023

BabyCatAGI: Fast and Feline

  • BabyCatAGI: a modified BabyAGI by replacing task manager in BabyBeeAGI with task creation agent running once.
  • Uses Intelligent Agent Tool to combines tools to extract only relevant information to next step such as looping web search and scraping results to pull only specific part to another task.

12th of May 2023

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

  • A breakthrough paper, where synthetic data generated by Teacher-Student LLM is used to train a high-performing model to generate fluent and consistent English stories.
  • Demonstrated the effectiveness of synthetic data in smaller LLMs challenging large SOTA models in domain of English language.
  • Uses GPT-4 to grade content generated by the models as if created by student and being graded by the GPT-4 teacher.

9th of May 2023

ImageBind: One Embedding Space To Bind Them All

  • ImageBind: a joint embedding space for images, text, audio, depth, thermal and IMU data modalities-

3rd of May 2023

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

  • Introduces Visual Chain of Thought (VCoT) for data augmentation, where between reasoning steps multimodal data is infilled to obtain better reasoning results.

30th of April 2023

BabyBeeAGI: Task Management and Functionality Expansion on top of BabyAGI

  • BabyBeeAGI: a modified from BabyAGI tracking statuses of tasks, task dependencies, identification of required new tasks, assigning tools and results in json-format.

26 of April 2023

["Inside OpenAI Entire Talk" by Stanford eCorner

  • Interview of Ilya Sustskever, where defined a way to perform "a consciousness test" from a very controlled dataset, see "minute 15".

21st of April 2023

Improving Grounded Language Understanding in a Collaborative Environment by Interacting with Agents Through Help Feedback

  • LLM agent self-help with LLM to complete IGLU tasks using clarifying questions.

13th of April 2023

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

  • RAFT-finetuning: Samples batch lf data from LLM, reward function scores them, high reward examples are filtered as data to finetune the LLM.

11th of April 2023

ChemCrow: Augmenting large-language models with chemistry tools

  • Uses LLM and chemistry tools to plan and execute different chemical tasks.
  • Tools include web and literature search, Python, human-tool to interact with the end user and various molecule tools, safety tools and chemical reaction tools.

Teaching Large Language Models to Self-Debug

  • The model generates new code together with code explanation. The code is then executed and this executed code is sent back as feedback together with the code explanation. This feedback

7th of April 2023

ChatPipe: Orchestrating Data Preparation Program by Optimizing Human-ChatGPT Interactions

  • ChatPipe - Iterative, data preparation program with ChatGPT using 1. Operation Recommendation, 2. Program generation, 3. Version management.
  • Recommends next data preparation opration. Easily roll-back to previous program for version control.

6th April 2023

Generative Agents: Interactive Simulacra of Human Behavior

  • Enable believable human behavior: observation, planning, and reflection.
  • An agent wants to throw a Valentine’s Day party. The agents autonomously spread invitations, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time.
  • GPTeam is inspired by this approach.

31 March 2023

CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society

  • CAMEL attempts to facilitate autonomous cooperation among communicative agents through role-playing framework.
  • The approach manages complete tasks with minimal human input.

30th of March 2023

Self-Refine: Iterative Refinement with Self-Feedback

  • Self-Refine refers to Iterative refinement with self-feedback: use the LLM to get Feedback to original output, which is passed back to LLM to Refine a new output.
  • The concept is best understood here in the blog by : Self-Refine: Iterative Refinement with Self-Feedback with GIFs and code examples.
  • Improves base-model performance in tasks like math reasoning and code generation.

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace

  • A LLM (such as ChatGPT) accesses HuggingFace community to look AI models to complete the given task.
  • It can read multi modalities by outsourcing tasks like image recognition to the specific image model.

DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents

  • Dialog-Enabled Resolving Agents (DERA) uses two roles: Researcher and Decider to perform discussion between these two agents.
  • Researcher role processes information and Decider role uses judgement.

29th of March 2023

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

  • Multimodal conversational foundation model (MCFM). MCFM generates a textual solution outline, then API selector chooses most relevant API from collection of APIs (with API name, parameter list, description, usage example and example when combining it with another API).
  • MCFM generates action code using recommended API and the API call is executed. Finally, output is provided back to developer.

28th March 2023

Task-driven Autonomous Agent Utilizing GPT-4, Pinecone, and LangChain for Diverse Applications

  • Task-driven autonomous agent, with vector database and Langchain. BabyAGI includes: Execution, creation and prioritization
  • Takes objective, pulls an item from task queue and moves it to execution agent with access to memory.

Sparks of Artificial General Intelligence: Early experiments with GPT-4

  • Raises an argument, that GPT-4 model capabilities should be reviewed as an early and incomplete version of Artificial General Intelligence (AGI) systems due the multiple metrics comparing against human level-performance.
  • Raises the argument, that LLMs need to move beyond "next-word prediction"