A production-ready AI agent designed to tackle the GAIA benchmark, built for the Hugging Face AI Agents Course (Unit 4).
This agent was developed as part of the Hugging Face AI Agents Course - Unit 4, with the goal of building an AI system capable of scoring well on the challenging GAIA benchmark.
GAIA (General AI Assistants) is a benchmark that evaluates AI systems on real-world assistant tasks that are:
- Simple for humans (92% success rate) but hard for AI (GPT-4: ~15%)
- Multi-modal: requiring text, image, audio, and video understanding
- Tool-dependent: needing web search, file analysis, and code execution โ Objectively measurable: with unambiguous factual answers
- Audio transcription (MP3, WAV, M4A files) using OpenAI Whisper
- Image analysis and description using Vision models
- YouTube video content analysis (subtitles + audio extraction)
- Document processing (Excel, CSV, PDF, text files)
- Web search and content extraction using Tavily
- Real-time information gathering from websites
- YouTube video analysis with transcript extraction
- Source verification and fact-checking
- Python code execution for complex calculations
- Spreadsheet analysis and data manipulation
- Multi-step reasoning chains with state management
- File format detection and appropriate tool selection
- Automatic task difficulty assessment (Level 1-3)
- GAIA API integration for benchmark submission
- Autonomous execution without human guidance
- Proper answer formatting for benchmark evaluation
- Rate limiting and error handling for production use
This agent is built using the LangGraph ReAct Agent Template, which provides a robust foundation for reasoning and action agents.
- StateGraph: Manages agent execution flow between reasoning and action
- ToolNode: Handles tool invocation and response processing
- Configuration: Flexible model and parameter settings
- Message Handling: Structured conversation state management
- Reason: Agent analyzes the task and plans next steps
- Act: Agent executes chosen tools to gather information
- Observe: Agent processes tool results and updates understanding
- Repeat: Continue until task is complete with final answer
| Category | Tools | Purpose |
|---|---|---|
| Web & Search | search, extract_text_from_url |
Information gathering and web browsing |
| Media Analysis | transcribe_audio, analyze_youtube_video, analyze_image |
Audio/video/image processing |
| File Processing | analyze_file, read_spreadsheet, download_gaia_file |
Document analysis and file handling |
| Data Science | python_repl, analyze_spreadsheet_data |
Calculations and data analysis |
| GAIA Integration | fetch_gaia_task, list_gaia_tasks |
Benchmark interaction and task management |
Assuming you have already installed LangGraph Studio, to set up:
# Clone the repository
git clone <your-repo-url>
cd gaia-benchmark-agent
# Create .env file
cp .env.example .envAdd the following to your .env file:
# Required for search functionality
TAVILY_API_KEY=your-tavily-api-key
# Choose your LLM provider
ANTHROPIC_API_KEY=your-anthropic-key
# OR
OPENAI_API_KEY=your-openai-key
# Optional: LangSmith for tracing
LANGSMITH_API_KEY=your-langsmith-key
LANGSMITH_PROJECT=gaia-agent# Run first 5 questions for testing
python -m react_agent.run_gaia_benchmark
# Edit the script to change username and max_questions# Open in LangGraph Studio
langgraph dev
# Or run individual tasks
python -c "
from react_agent import run_all_gaia_tasks
import asyncio
async def test():
result = await run_all_gaia_tasks(
username='your_username',
max_questions=3
)
print(result)
asyncio.run(test())
"The agent supports multiple LLM providers. Configure in LangGraph Studio or via environment:
# Default configuration
model: anthropic/claude-3-5-sonnet-20240620
# Alternative options:
# model: openai/gpt-4o
# model: openai/gpt-4-turbo- Level 1: Basic tasks (5-10 steps, 1-2 tools) - Target: >30%
- Level 2: Intermediate tasks (10-15 steps, multiple tools) - Target: >15%
- Level 3: Complex tasks (15+ steps, advanced reasoning) - Target: >5%
- โ Autonomous Operation: No human guidance required during execution
- โ Multi-modal Processing: Handles text, audio, images, and video
- โ Robust Error Handling: Graceful failure recovery and retries
- โ Proper Formatting: GAIA-compliant answer format with "FINAL ANSWER:"
- โ Rate Limiting: API compliance with 5-second delays between questions
- โ Tool Orchestration: Intelligent tool selection and chaining
- Level 1: "What was the enrollment count of the H. pylori clinical trial from Jan-May 2018 on NIH website?"
- Level 2: "Analyze this Excel file and calculate total food sales excluding drinks"
- Level 3: "Find the astronaut from NASA Group X who spent least time in space, excluding those with zero time"
Extend the agent's capabilities by adding tools in src/react_agent/tools.py:
async def my_custom_tool(parameter: str) -> str:
"""Description of what this tool does."""
# Your implementation here
return result
# Add to TOOLS list
TOOLS.append(my_custom_tool)Update the agent's behavior in src/react_agent/prompts.py:
SYSTEM_PROMPT = """
Your custom instructions here...
Remember to always end with: FINAL ANSWER: [answer]
"""Configure different models in src/react_agent/configuration.py or via LangGraph Studio.
Modify src/react_agent/gaia_runner.py to:
- Change submission parameters
- Add custom preprocessing
- Implement different execution strategies
# Start LangGraph Studio for interactive development
langgraph dev
# Run tests
python -m pytest tests/
# Format code
make format
# Lint code
make lint- Use LangGraph Studio's state inspection to debug execution flow
- Check
src/react_agent/prompts.pyfor GAIA-specific instructions - Monitor tool execution in the studio's trace view
- Test individual tools with small examples before full GAIA runs
For detailed tracing and collaboration:
- Set
LANGSMITH_API_KEYin.env - Set
LANGSMITH_TRACING=true - View detailed execution traces in LangSmith dashboard
src/react_agent/
โโโ __init__.py # Main exports
โโโ graph.py # LangGraph state machine
โโโ tools.py # Tool implementations (15+ tools)
โโโ prompts.py # GAIA-optimized system prompts
โโโ configuration.py # Agent configuration
โโโ gaia_runner.py # GAIA benchmark orchestration
โโโ run_gaia_benchmark.py # CLI entry point
โโโ utils.py # Helper functions
tests/
โโโ unit_tests/ # Unit tests
โโโ integration_tests/ # GAIA integration tests
This agent is built on the LangGraph ReAct Agent Template, which provides:
- ๐ง ReAct Pattern: Iterative reasoning and acting loops
- ๐ ๏ธ Tool Integration: Seamless tool calling and response handling
- ๐ State Management: Robust conversation state tracking
- ๐ Error Handling: Automatic retries and graceful failure modes
- ๐ Scalability: Production-ready architecture with LangGraph
The core logic, defined in src/react_agent/graph.py, demonstrates a flexible ReAct agent that iteratively reasons about user queries and executes actions, making it ideal for the complex, multi-step reasoning required by GAIA benchmark tasks.
Course: Hugging Face AI Agents Course - Unit 4
Objective: Build production-ready agents capable of scoring on challenging benchmarks
Framework: LangGraph + ReAct pattern for robust agent orchestration


