RLAF: Reinforcement Learning from Agentic Feedback

A unified framework for training AI agents using multi-perspective critic ensembles.

RLAF (Reinforcement Learning from Agentic Feedback) combines innovations from the latest research in agentic reinforcement learning:

ARPO (July 2025): Adaptive rollout based on entropy
Open-AgentRL (Oct 2025): GRPO-TCR with tool-call reasoning
KAT-Dev (Sept 2025): Multi-stage training pipeline

🎯 Why RLAF?

Traditional RL uses single scalar rewards. RLAF uses multi-perspective critic ensembles:

# Traditional RL: Single reward
reward = 0.75  # Good? Bad? Why?

# RLAF: Multi-critic feedback
feedbacks = [
    Feedback(critic="accuracy", score=0.9, reasoning="Factually correct"),
    Feedback(critic="policy", score=0.6, reasoning="SLA violation risk"),
    Feedback(critic="efficiency", score=0.8, reasoning="Could be faster"),
]
# Aggregated reward: 0.77 (with rich context!)

Key Benefits:

🎭 Multi-perspective evaluation - Accuracy, reasoning, tool use, code quality, policy compliance
🔄 Algorithm-agnostic - Supports ARPO, GRPO-TCR, PPO, DPO
🏭 Production-ready - Not just research, built for real applications
🌐 Cross-domain - ITSM, code generation, reasoning tasks, chatbots

📚 Documentation

Introduction to RLAF - Comprehensive article on RLAF's innovations and how it builds on ARPO, Open-AgentRL, and KAT-Dev
Full Documentation - Guides, API reference, and more

🚀 Quick Start

Installation

pip install rlaf

Or install from source:

git clone https://github.com/cogniolab/cognio-rlaf.git
cd cognio-rlaf
pip install -e .

Minimal Example (30 seconds)

import asyncio
from rlaf import RLAFTrainer
from rlaf.agents import ActorAgent, CriticAgent, CriticEnsemble
from rlaf.core.trainer import TrainingConfig

async def main():
    # 1. Create actor (agent to train)
    actor = ActorAgent(
        name="my-agent",
        model="claude-3-5-sonnet-20241022",
        api_key="your-api-key"
    )

    # 2. Create multi-critic ensemble
    critics = CriticEnsemble([
        CriticAgent("accuracy-critic", "accuracy", api_key="your-api-key"),
        CriticAgent("reasoning-critic", "reasoning", api_key="your-api-key"),
    ])

    # 3. Configure training
    config = TrainingConfig(algorithm="arpo", max_iterations=10)

    # 4. Train!
    trainer = RLAFTrainer(actor=actor, critics=critics, config=config)
    results = await trainer.train(your_dataset)

asyncio.run(main())

📚 Examples

ITSM Agent Training

Train an IT service management agent to triage incidents:

python examples/itsm_agent.py

Features:

Actor: ITSM triage agent
Critics: Accuracy, policy compliance, speed
Algorithm: ARPO (adaptive exploration)

Code Generation Agent

Train a Python code generation agent:

python examples/code_generation.py

Features:

Actor: Code generator
Critics: Correctness, code quality, efficiency
Algorithm: GRPO-TCR (tool-call reasoning)

Simple Demo

See examples/simple_demo.py for a minimal working example.

🏗️ Architecture

Core Components

rlaf/
├── agents/          # Actor and Critic agents
│   ├── actor.py     # Agent being trained
│   └── critic.py    # Evaluation agents
├── algorithms/      # RL algorithms
│   ├── arpo.py      # Adaptive RPO (entropy-based)
│   ├── grpo_tcr.py  # Tool-call reasoning (Open-AgentRL)
│   ├── ppo.py       # Proximal Policy Optimization
│   └── dpo.py       # Direct Preference Optimization
├── feedback/        # Feedback collection
│   └── collector.py # Multi-critic aggregation
├── rewards/         # Reward computation
│   └── aggregator.py # Feedback → RL rewards
└── core/
    ├── base.py      # Base classes
    └── trainer.py   # Main trainer

Multi-Critic Feedback Flow

Input Task
    ↓
[Actor] generates response
    ↓
[Critics] evaluate from multiple perspectives
    ├─ Accuracy Critic → score: 0.9
    ├─ Reasoning Critic → score: 0.8
    ├─ Tool Use Critic → score: 0.7
    └─ Policy Critic → score: 0.85
    ↓
[Feedback Collector] aggregates (weighted avg, voting, debate)
    ↓
[Reward Aggregator] converts to RL reward (with bonuses/penalties)
    ↓
[Algorithm] updates policy (ARPO/GRPO-TCR/PPO/DPO)

🔬 Algorithms

ARPO: Adaptive Reinforcement Policy Optimization

From July 2025 paper (arXiv:2507.19849)

Key innovation: Entropy-based adaptive rollout

High uncertainty → more exploration
Low confidence → increase batch size
Adaptive learning rate scaling

config = TrainingConfig(
    algorithm="arpo",
    entropy_threshold=0.8,
    adaptive_rollout=True
)

GRPO-TCR: Tool-Call Reasoning

From Open-AgentRL (Oct 13, 2025)

Key innovation: Deliberative reasoning before tool calls

4B model outperforms 32B models
Selective tool use (avoid over-calling)
SOTA on AIME, GPQA, LiveCodeBench

config = TrainingConfig(
    algorithm="grpo-tcr",
    tool_call_reasoning=True,
    deliberative_mode=True
)

KAT-Style: Multi-Stage Training

From KAT-Dev (Sept 2025)

3-stage pipeline:

Mid-training: Enhance LLM-as-agent capabilities
RFT: Reinforcement fine-tuning with teacher trajectories
Agentic RL: Full RL with critic ensemble

config = TrainingConfig(
    algorithm="kat",
    multi_stage=True,
    stages=["mid_train", "rft", "agentic_rl"]
)

🎨 Critic Perspectives

RLAF supports multiple critic perspectives:

Perspective	Evaluates	Example Use Case
`accuracy`	Factual correctness	Q&A, reasoning
`reasoning`	Logical soundness	Math, planning
`tool_use`	Tool efficiency	Agent workflows
`code_quality`	Code quality	Code generation
`policy`	SLA/rule compliance	ITSM, enterprise
`speed`	Response efficiency	Real-time systems
`safety`	Security/ethics	Production deployment

Create custom perspectives:

custom_critic = CriticAgent(
    name="domain-expert",
    perspective="medical_accuracy",  # Custom perspective
    model="claude-3-5-sonnet-20241022",
    api_key="your-key"
)

📊 Reward Aggregation Strategies

RLAF offers multiple ways to aggregate multi-critic feedback:

1. Weighted Average (default)

# Confidence-weighted average
config.reward_aggregation = "weighted_average"

2. Voting

# Majority vote on quality threshold
config.reward_aggregation = "voting"

3. Debate

# Highest-confidence critic wins
config.reward_aggregation = "debate"

4. Consensus

# Accept only high-agreement feedback
config.reward_aggregation = "consensus"

🛠️ Configuration

TrainingConfig

from rlaf.core.trainer import TrainingConfig

config = TrainingConfig(
    # Algorithm
    algorithm="arpo",  # arpo, grpo-tcr, kat, ppo, dpo

    # Training
    max_iterations=1000,
    batch_size=32,
    learning_rate=3e-4,

    # ARPO-specific
    entropy_threshold=0.8,
    adaptive_rollout=True,

    # GRPO-TCR-specific
    tool_call_reasoning=True,
    deliberative_mode=True,

    # Rewards
    reward_aggregation="weighted_average",

    # Logging
    checkpoint_every=100,
    eval_every=50,
)

BaseConfig

from rlaf.core.base import BaseConfig

config = BaseConfig(
    model_name="claude-3-5-sonnet-20241022",
    temperature=0.7,
    max_tokens=2048,
    num_critics=3,
)

🧪 Testing

Run the test suite:

pytest tests/

Run examples:

# Simple demo
python examples/simple_demo.py

# ITSM agent
export ANTHROPIC_API_KEY="your-key"
python examples/itsm_agent.py

# Code generation
python examples/code_generation.py

📈 Benchmarks

Comprehensive benchmarks comparing RLAF with baseline methods are now available!

Quick Results

Method	ITSM Triage	Code Generation	Reasoning	Avg. Score	Training Time
RLAF (ARPO)	87.3%	82.5%	79.8%	83.2%	3.2h
RLAF (GRPO-TCR)	85.1%	84.2%	81.3%	83.5%	4.1h
Open-AgentRL	82.4%	80.1%	82.1%	81.5%	5.3h
PPO	76.2%	74.3%	73.1%	74.5%	6.1h
DPO	74.8%	76.5%	71.9%	74.4%	4.8h

Key Findings:

✅ 12.4% improvement over supervised fine-tuning
✅ 35% faster training than Open-AgentRL
✅ 43% cost savings with intelligent model routing
✅ 40% fewer samples needed to reach 80% performance vs PPO

See full benchmarks: benchmarks/README.md

Run Benchmarks Yourself

# Run all benchmarks
python benchmarks/run_all.py

# Generate charts
python benchmarks/visualize.py

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Key areas:

New critic perspectives
Additional RL algorithms
Domain-specific examples
Performance optimizations

📝 Citation

If you use RLAF in your research, please cite:

@software{rlaf2025,
  title = {RLAF: Reinforcement Learning from Agentic Feedback},
  author = {Cognio Lab},
  year = {2025},
  url = {https://github.com/cogniolab/cognio-rlaf}
}

🔗 Related Work

RLAF builds on these excellent projects:

ARPO (July 2025): arXiv:2507.19849
Open-AgentRL (Oct 2025): GitHub
KAT-Dev (Sept 2025): Skywork AI Blog
IBM Multi-Agent Learning: Research Blog

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Anthropic for Claude API
OpenAI for RL research foundations
Open-AgentRL team at Gen-Verse
ARPO authors
KAT-Dev team at Skywork/Kuaishou

Built with ❤️ by Cognio Lab

Making AI agents smarter through multi-perspective feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarks		benchmarks
docs		docs
examples		examples
rlaf		rlaf
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LINKEDIN_ARTICLE.md		LINKEDIN_ARTICLE.md
LINKEDIN_POST.md		LINKEDIN_POST.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

cogniolab/cognio-rlaf

Folders and files

Latest commit

History

Repository files navigation