AI Response Evaluation Automation Benchmark Website

AI Chatbot Evaluation Tool with Automated Data Collection

Evaluate chatbot responses for ethical alignment, inclusivity, complexity, and sentiment - with automatic CSV database tracking.

🚀 Quick Start

python main.py

Then open: http://localhost:8080 🎯

That's it! One command gives you:

✅ Interactive web interface
✅ REST API endpoints
✅ Automatic CSV database
✅ Real-time statistics
✅ Data export capability

📁 Project Structure

web_automation_CSSW/
├── main.py                 🚀 Main entry point (start here!)
├── app.py                  🌐 Flask application (web + API + database)
├── api/                    🔌 Evaluation engine
│   ├── api_server.py       → NLP evaluation functions
│   └── requirements.txt    → Python dependencies
├── data/                   💾 Database (auto-created)
│   └── evaluations.csv     → All evaluation records
├── venv/                   🐍 Virtual environment
├── logs/                   📝 Application logs
└── README.md               📚 This file

Clean & Simple: Just 5 top-level items, no complex folder hierarchies!

🎯 Purpose & Overview

What Does This Tool Do?

This application evaluates AI chatbot responses across four critical dimensions:

Ethical Alignment (0-1) - Professional appropriateness and ethical considerations
Inclusivity (0-1) - LGBTQ+ support, cultural sensitivity, and inclusive language
Complexity (0-100) - Text readability using Flesch-Kincaid scoring
Sentiment (0-1) - Emotional alignment between human input and chatbot response

Why Is This Important?

🏥 Mental Health Apps - Ensure responses are appropriate and supportive
🤖 AI Development - Quality assurance for chatbot systems
📊 Research - Analyze and compare AI model performance
🎓 Education - Teach responsible AI development

Key Features

✅ 100% Pure Python - No PHP, Drupal, or Composer complexity
✅ Automatic Data Collection - Every evaluation saved to CSV
✅ Web Interface + API - Use it interactively or programmatically
✅ Real-time Statistics - Track averages and usage patterns
✅ Export Capability - Download your data anytime
✅ Simple Deployment - One command to start everything

🔄 Workflow Diagram

┌──────────────────────────────────────────────────────────────┐
│                    USER INTERACTION                           │
│                                                                │
│  Browser (http://localhost:8080)  OR  API Client (curl/code) │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│                   FLASK APPLICATION (app.py)                  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │               WEB INTERFACE (Routes)                    │  │
│  │  ┌──────────────────────────────────────────────────┐  │  │
│  │  │  GET  /           → Home page with form          │  │  │
│  │  │  GET  /health     → Health check                 │  │  │
│  │  │  POST /api/evaluate → Process evaluation         │  │  │
│  │  │  GET  /api/history  → Get all records           │  │  │
│  │  │  GET  /api/stats    → Get statistics            │  │  │
│  │  │  GET  /api/download → Download CSV              │  │  │
│  │  └──────────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                                │
│  ┌────────────────────────────────────────────────────────┐  │
│  │            DATA COLLECTION (save_to_csv)               │  │
│  │  • Captures every evaluation                           │  │
│  │  • Timestamps each entry                               │  │
│  │  • Stores all metrics                                  │  │
│  │  • Appends to data/evaluations.csv                     │  │
│  └────────────────────────────────────────────────────────┘  │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│              EVALUATION ENGINE (api/api_server.py)            │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_ethical_alignment()                          │  │
│  │  → Checks professional appropriateness                 │  │
│  │  → Uses keyword matching for ethical concerns          │  │
│  │  → Returns: 0.0 (problematic) to 1.0 (appropriate)    │  │
│  └────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_inclusivity_score()                          │  │
│  │  → Detects LGBTQ+ terminology                          │  │
│  │  → Checks cultural sensitivity                         │  │
│  │  → Returns: 0.0 (exclusive) to 1.0 (inclusive)        │  │
│  └────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_complexity_score()                           │  │
│  │  → Flesch-Kincaid readability analysis                 │  │
│  │  → Sentence structure analysis                         │  │
│  │  → Returns: 0 (very complex) to 100 (simple)          │  │
│  └────────────────────────────────────────────────────────┘  │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  evaluate_sentiment_distribution()                     │  │
│  │  → Compares human and chatbot text                     │  │
│  │  → Analyzes emotional alignment                        │  │
│  │  → Returns: 0.0 (mismatched) to 1.0 (aligned)         │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                                │
│  Technologies: NLTK, NumPy, scikit-learn, TF-IDF             │
└────────────────────────┬─────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────────────┐
│                    RESULTS & DATABASE                         │
│                                                                │
│  ┌────────────────────┐         ┌───────────────────────┐    │
│  │  JSON Response     │         │  CSV Database         │    │
│  │  to User/API       │         │  (data/evaluations.csv│    │
│  │                    │         │                       │    │
│  │  {                 │         │  timestamp,chatbot... │    │
│  │   "ethical": 0.8,  │         │  2025-10-22T19:00:..  │    │
│  │   "inclusivity":.. │         │  2025-10-22T19:00:..  │    │
│  │  }                 │         │  ...                  │    │
│  └────────────────────┘         └───────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

💾 Database Features

Automatic Data Collection

Every evaluation is automatically saved to data/evaluations.csv with:

⏰ Timestamp - Exact date/time of evaluation
💬 Chatbot Text - The response being evaluated
👤 Human Text - Optional user input (for sentiment analysis)
🔧 Formula Used - Which metric(s) were calculated
📊 All Scores - Ethical, inclusivity, complexity, sentiment values

CSV Structure

timestamp,chatbot_text,human_text,formula,ethical_alignment,inclusivity,complexity,sentiment
2025-10-22T19:00:11,I will help you.,,ethical_alignment,0.61,,,
2025-10-22T19:00:27,We welcome all backgrounds.,,inclusivity,,0.0,,
2025-10-22T19:01:45,I understand.,I need help.,all,1.0,0.0,80.31,0.03

Data Management

View History - See your last 10 evaluations in the web UI
Statistics - Get averages, totals, and formula usage
Export - Download the complete CSV anytime
No Cleanup Needed - Data persists automatically

🌐 Using the Application

Method 1: Web Interface (Easiest)

Start the server:
```
python main.py
```
Open your browser:
```
http://localhost:8080
```
Evaluate chatbot responses:
- Select evaluation type (All Metrics, Ethical Alignment, etc.)
- Enter chatbot text
- Optionally add human text (for sentiment)
- Click "Evaluate"
- View results instantly!
Access your data:
- Click "View History" - See recent evaluations
- Click "View Statistics" - See averages and totals
- Click "Download CSV" - Export all data

Method 2: API Usage (Programmatic)

Health Check

curl http://localhost:8080/health

Response:

{"status": "healthy"}

Evaluate Single Metric

curl -X POST http://localhost:8080/api/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "formula": "ethical_alignment",
    "chatbot_text": "I understand and support you."
  }'

Response:

{"ethical_alignment": 1.0}

Evaluate All Metrics

curl -X POST http://localhost:8080/api/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "formula": "all",
    "chatbot_text": "I understand your concerns.",
    "human_text": "I am feeling anxious."
  }'

Response:

{
  "ethical_alignment": 1.0,
  "inclusivity": 0.0,
  "complexity": 81.86,
  "sentiment": 0.03
}

Get Evaluation History

curl http://localhost:8080/api/history

Response:

{
  "count": 15,
  "data": [
    {
      "timestamp": "2025-10-22T19:00:11.986236",
      "chatbot_text": "I will help you with that.",
      "human_text": "",
      "formula": "ethical_alignment",
      "ethical_alignment": "0.61",
      "inclusivity": "",
      "complexity": "",
      "sentiment": ""
    }
  ]
}

Get Statistics

curl http://localhost:8080/api/stats

Response:

{
  "total_evaluations": 15,
  "formulas_used": {
    "all": 5,
    "ethical_alignment": 7,
    "inclusivity": 2,
    "sentiment": 1
  },
  "averages": {
    "ethical_alignment": 0.82,
    "inclusivity": 0.15,
    "complexity": 78.45,
    "sentiment": 0.05
  }
}

Download CSV Database

curl http://localhost:8080/api/download -o my_evaluations.csv

📊 API Reference

Endpoints

Method	Endpoint	Description	Auth Required
GET	`/`	Web interface home page	No
GET	`/health`	Health check	No
POST	`/api/evaluate`	Evaluate chatbot text	No
GET	`/api/history`	Get all evaluation records	No
GET	`/api/stats`	Get database statistics	No
GET	`/api/download`	Download CSV database	No

POST /api/evaluate

Request Body:

{
  "formula": "all",
  "chatbot_text": "Your chatbot response here",
  "human_text": "Optional, required for sentiment"
}

Formula Options:

ethical_alignment - Professional ethics score only
inclusivity - Inclusivity score only
complexity - Readability score only
sentiment - Emotional match only (requires human_text)
all - All metrics at once

Response Codes:

200 - Success
400 - Bad request (missing required fields)
500 - Server error

⚙️ Installation & Setup

Prerequisites

Python 3.8 or higher
pip (Python package manager)

First-Time Setup

# 1. Clone or download the project
cd web_automation_CSSW

# 2. Create virtual environment
python3 -m venv venv

# 3. Activate virtual environment
source venv/bin/activate  # Mac/Linux
# OR
venv\Scripts\activate     # Windows

# 4. Install dependencies
pip install -r api/requirements.txt

# 5. Run the application
python main.py

Subsequent Runs

# Just run (virtual environment auto-activates if needed)
python main.py

Dependencies (Installed Automatically)

From api/requirements.txt:

Flask - Web framework
NLTK - Natural language processing
NumPy - Numerical computing
scikit-learn - Machine learning utilities
transformers (optional) - Advanced NLP
torch (optional) - Deep learning

📈 Evaluation Metrics Explained

1. Ethical Alignment (0-1 scale)

Purpose: Ensures chatbot responses are professionally appropriate and ethically sound.

How it works:

Scans for problematic keywords and phrases
Checks for harmful advice or inappropriate content
Returns 0.0 for problematic text, 1.0 for appropriate text

Example:

✅ "I understand your concerns." → 1.0
❌ "You should harm yourself." → 0.0

Use case: Mental health chatbots, customer service bots

2. Inclusivity (0-1 scale)

Purpose: Measures LGBTQ+ support and cultural sensitivity.

How it works:

Detects inclusive terminology (LGBTQ+, pronouns, diversity terms)
Scores based on presence and frequency of inclusive language
Returns 0.0 for no inclusive language, higher scores for more inclusivity

Example:

✅ "We support LGBTQ+ individuals." → 0.8
⚪ "We help everyone." → 0.0

Use case: Diversity initiatives, inclusive app development

3. Complexity (0-100 scale)

Purpose: Measures text readability using Flesch-Kincaid scoring.

How it works:

Analyzes sentence length and syllable count
Calculates reading ease score
0 = very complex, 100 = very simple

Example:

✅ "I can help." → 120 (very simple)
⚪ "I shall endeavor to facilitate assistance." → 40 (complex)

Use case: Ensuring accessible communication, education apps

4. Sentiment (0-1 scale)

Purpose: Measures emotional alignment between human input and chatbot response.

How it works:

Uses TF-IDF vectorization to compare texts
Calculates cosine similarity between human and chatbot text
Returns 0.0 for complete mismatch, 1.0 for perfect alignment

Example:

Human: "I'm feeling great!"
Chatbot: "That's wonderful to hear!" → 0.8 (good alignment)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
api		api
data		data
logs		logs
.DS_Store		.DS_Store
README.md		README.md
app.py		app.py
main.py		main.py

ZhaoJackson/AI_Response_Evaluation_Benchmark

Folders and files

Latest commit

History

Repository files navigation

AI Response Evaluation Automation Benchmark Website

🚀 Quick Start

📁 Project Structure

🎯 Purpose & Overview

What Does This Tool Do?

Why Is This Important?

Key Features

🔄 Workflow Diagram

💾 Database Features

Automatic Data Collection

CSV Structure

Data Management

🌐 Using the Application

Method 1: Web Interface (Easiest)

Method 2: API Usage (Programmatic)

Health Check

Evaluate Single Metric

Evaluate All Metrics

Get Evaluation History

Get Statistics

Download CSV Database

📊 API Reference

Endpoints

POST /api/evaluate

⚙️ Installation & Setup

Prerequisites

First-Time Setup

Subsequent Runs

Dependencies (Installed Automatically)

📈 Evaluation Metrics Explained

1. Ethical Alignment (0-1 scale)

2. Inclusivity (0-1 scale)

3. Complexity (0-100 scale)

4. Sentiment (0-1 scale)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages