A comprehensive evaluation framework for testing memory-based conversational AI agents across multiple datasets and metrics.
This system evaluates how well AI agents can answer questions based on accumulated conversational memory over time. It supports several benchmark datasets.
This project requires setting up two separate Python environments to handle different dependencies.
- Python 3.10+
- uv package installer (recommended for faster installs)
# Clone the repository
git clone https://github.com/ictnlp/unified-memory-agent.git
cd unified-memory-agent
uv sync
source .venv/bin/activate# Navigate to infinity_emb directory
cd external/infinity/libs/infinity_emb
# Create separate virtual environment
uv venv
source .venv/bin/activate
# Install infinity_emb with all optional dependencies
uv pip install -e ".[all]"
# Return to project root
cd ../../../..Deploy
source .venv/bin/activate
vllm serve $MODEL -dp 2 -tp 4 --gpu-memory-utilization 0.8 --enforce-eager > vllm.log 2>&1 &
source ./external/infinity/.venv/bin/activate
infinity_emb v2 --model-id sentence-transformers/all-MiniLM-L6-v2 --port 8080 > infinity_emb.log 2>&1 &
# Wait for server
until curl -s http://localhost:8080/health > /dev/null 2>&1; do
sleep 2
echo "wait for server port 8080..."
done
until curl -s http://localhost:8000/health > /dev/null 2>&1; do
sleep 2
echo "wait for server port 8000..."
doneRun Demo
python test_verl_agent.pyFollow Memalpha, download the memalpha and memagentbench dataset.
mkdir -p ./data/raw/
# Download Memalpha training/test dataset
bash hfd.sh YuWangX/Memalpha --dataset --tool aria2c -x 10
mv Memalpha ./data/raw/memalpha
# Download MemoryAgentBench evaluation dataset (processed version for this project)
bash hfd.sh YuWangX/Memalpha-Memoryagentbench --dataset --tool aria2c -x 10
mv Memalpha ./data/raw/memoryagentbenchFollow Memagent, download the hotpotqa dataset.
bash hfd.sh BytedTsinghua-SIA/hotpotqa --dataset --tool aria2c -x 10
mv hotpotqa ./data/raw/Download the convomem dataset
bash hfd.sh Salesforce/ConvoMem --dataset --tool aria2c -x 10
mv ConvoMem ./data/raw/Set your api url in .env and run
python data/EvalDataset.pyOptionally, you can also deploy Qwen/Qwen3-30B-A3B-Instruct-2507, correspondingly modify ./data/synthv1_async.py, and then run data/EvalDataset.py to generate Ledger-QA.
# Run full evaluation (generation + scoring)
bash run_evaluation.sh# Run concat baseline evaluation
bash baselines/run_concat.sh
# After finish generation
bash run_score.sh- Tasks:
convomem,locomo,longmemeval,memalpha,hotpotqa,msc,squad,banking77,clinic,nlu,perltqa,pubmed_rct,trec_coarse,trec_fine - Agents:
concat,memagent,memagent_woq,mem1,memalpha,toolmem
- responses_*.jsonl: Real-time JSONL output with questions and responses
- Location:
results/{task}/
- evaluated_*.jsonl: Complete evaluation results with all metrics
- Statistics: Pretty-printed tables (console or text file)
- Memalpha-full
cd data
python construct_memalpha_verl_dataset.py
python construct_memalpha_verl_dataset.py --val- hotpotqa Follow Memagent, we construct a hotpotqa containing 8192 items. We further chunk the context of each item for retrieving. You can download the processed data by
bash hfd.sh dp66/hotpotqa-uma --dataset --tool aria2c -x 10
mv hotpotqa-uma ./data/train- Ledger-QA
bash hfd.sh dp66/ledger-qa-train --dataset --tool aria2c -x 10
mv ledger-qa-train ./data/trainOptionally, you can also construct ledger-qa by the following code after running python data/EvalDataset.py
cd data
python construct_processed_verl_dataset.py processed_synth-ss10_train.json ./data/train/ledger-qa-train/train.parquet --dataset-name synth --batch-size 40Similarly, you can construct the validation dataset
cd data
python construct_processed_verl_dataset.py processed_synth-ss10.json ./data/train/ledger-qa-train/dev.parquet --dataset-name synth --batch-size 40source ./external/infinity/.venv/bin/activate
infinity_emb v2 --model-id sentence-transformers/all-MiniLM-L6-v2 --port 8080 > infinity_emb.log 2>&1 &
# Wait for server
until curl -s http://localhost:8080/health > /dev/null 2>&1; do
sleep 2
echo "wait for server port 8080..."
doneSingle node
source .venv/bin/activate
cd external/verl
bash run_1node.sh4 nodes
source .venv/bin/activate
cd external/verl
bash run_4nodes.sh- Create new agent class inheriting from
BaseAgent - Implement required methods:
add_memory_async()andQA_batch_async() - Add to agent factory in
evaluate_async.py
Example:
class CustomAgent(BaseAgent):
async def add_memory_async(self, chunk: str) -> None:
# Your implementation
pass
async def QA_batch_async(self, question: str) -> str:
# Your implementation
pass