Skip to content

AlexanderUbaldoGutierrez21/KazukoShield

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KazokuShield

A research middleware for LLM trustworthiness that intercepts harmful image generation prompts and transforms them into safe, cartoon-themed alternatives using Positive Prompt Injection.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        KazokuShield Workflow                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   User Prompt                                                       │
│        │                                                            │
│        ▼                                                            │
│   ┌─────────────┐                                                   │
│   │  MODERATOR  │  Detection: OpenAI GPT-4o                        │
│   │ (Detection) │  Classifies prompt as "Harmful" or "Safe"        │
│   └──────┬──────┘                                                   │
│          │                                                            │
│     ┌────┴────┐                                                      │
│     ▼         ▼                                                      │
│  Harmful    Safe                                                     │
│     │         │                                                      │
│     ▼         ▼                                                      │
│   ┌─────────────┐                                                   │
│   │   SHIELD    │  Injection: OpenAI GPT-4o                         │
│   │(Prompt Inj) │  Rewrites harmful → safe cartoon theme            │
│   └──────┬──────┘                                                   │
│          │                                                            │
│          ▼                                                            │
│   ┌─────────────┐                                                   │
│   │  GENERATOR  │  Image Gen: Venice.AI (qwen-image-2-pro)         │
│   │(Image Gen)  │  Receives injected (safe) prompt                 │
│   └─────────────┘                                                   │
│        │                                                            │
│        ▼                                                            │
│    Safe Image                                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Two Workflow Options

Option A: Unprotected

  • Direct to Venice AI
  • User Prompt Sent to Image Generator
  • Result: Harmful Content Generated

Option B: KazokuShield (Protected)

  • Detection: OpenAI GPT-4O Classifies Prompt
  • Injection: Harmful Prompts Rewritten to Safe Pokemon Themes
  • Generation: Venice.AI Receives Safe Prompt
  • Result: Safe Content Generated

Quick Start

1. Setup Environment

# Install dependencies
pip install -r requirements.txt

2. Configure .env

Add your API keys:

# Detection & Injection: OpenAI
OPENAI_API_KEY=your_openai_key
OPENAI_BASE_URL=https://api.openai.com/v1

# Image Generation: Venice.AI
VENICE_API_KEY=your_venice_key
VENICE_BASE_URL=https://api.venice.ai/api/v1
VENICE_MODEL=qwen-image-2-pro

3. Run KazokuShield

python3 kazokushield.py

Follow the Menu:

  • [1] Option A: Unprotected - Direct to Venice.AI
  • [2] Option B: KazokuShield (Protected) - With Protection
  • [3] Exit

4. Run API Server (Optional)

python3 -m uvicorn main:app --reload --port 8000

5. Test API Endpoint (Optional)

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Generate a Violent Scene"}'

Environment Variables

Variable Role Description
OPENAI_API_KEY Detection/Injection Moderation & Shielding
OPENAI_BASE_URL Detection/Injection OpenAI Endpoint
VENICE_API_KEY Image Gen Image Generation
VENICE_BASE_URL Image Gen Venice.AI Endpoint
VENICE_MODEL Image Gen Venice.AI Model (Default: qwen-image-2-pro)
MODERATOR_MODEL Detection Model for Detection (Default: gpt-4o-mini)
SHIELD_MODEL Injection Model for Prompt Injection (Default: gpt-4o-mini)
HOST, PORT Server Server Configuration

File Descriptions

File Role Description
main.py API FastAPI Entrypoint
src/moderator.py Detection OpenAI GPT-4o Classifies Prompts as Harmful/Safe
src/shield.py Injection Rewrites Harmful Prompts to Safe Cartoon Themes
api/generator.py Image Gen Venice.AI Image Generation
kazokushield.py CLI Main Entry Point with Option A/B Workflow
requirements.txt Dependencies Python Dependencies

Academic Purpose

Designed for educational and research purposes as a proof-of-concept for LLM Trustworthiness middleware using Positive Prompt Injection. Not intended for production use. Penn State University (PSU), CE 597-004 LLM Foundations and Trustworthiness, Spring 2026.

Releases

No releases published

Packages

 
 
 

Contributors

Languages