Skip to content

iitis/hatespeech_classification

Repository files navigation

Audio Transcription Classification

This Python project processes audio transcriptions and classifies them for aggression and hate speech categories using various language models.

Features

  • Binary Classification: Determines if text is aggressive (0 = not aggressive, 1 = aggressive).
  • Multiclass Classification: Categorizes text into one of five classes:
    • 0: Racism (attacks based on race, nationality, or religion)
    • 1: Sexism (attacks directed at women, gender roles, objectification)
    • 2: Hate Speech (general hate speech not fitting other categories)
    • 3: Vulgarism (vulgar language without targeted attacks)
    • 4: Neutral (non-aggressive language)
  • Supports multiple model providers: Local (Ollama), OpenAI, and Google Gemini.
  • Processes input files from the input/ directory and outputs results to output/.
  • Handles errors gracefully (e.g., unreachable models, invalid responses).

Requirements

  • Python 3.7+
  • Required packages (install via pip install -r requirements.txt):
    • litellm
    • requests
    • openai (for OpenAI provider)
    • google-genai (for Gemini provider)
  • For local models: Ollama server running with supported models (e.g., llama3.3, mistral-large).
  • API keys for OpenAI and Gemini if using those providers.

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd Classification
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set up API keys (if using OpenAI or Gemini):

    • Edit the script files to set OPENAI_API_KEY and GEMINI_API_KEY, or set environment variables.
  4. For local models, ensure Ollama is installed and running:

    • Install Ollama from ollama.ai.
    • Pull required models: ollama pull llama3.3, etc.
    • Set SELF_HOSTED_MODELS_URL to your Ollama API base (e.g., http://localhost:11434).

Usage

  1. Place input files in the input/ directory. Each file should contain lines in the format: filename;text (semicolon-separated).

  2. Run the classification script:

    • For one-shot classification: python classification-oneshot.py
    • For zero-shot classification: python classification-zeroshot.py
  3. Results will be saved in the output/ directory as CSV files, one per model and input file.

Output Format

Output files are CSV with semicolon-separated values:

  • filename: Original filename
  • binary_aggressive: 0 or 1 (or error codes)
  • multiclass_label: 0-4 (or error codes)
  • text: Sanitized input text

Supported Models

  • Local: llama3.3, mistral-large, SpeakLeash/bielik-11b-v2.3-instruct:Q8_0
  • OpenAI: gpt-5 (adjust as needed)
  • Gemini: gemini-2.5-flash

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages