Skip to content

elvismdev/trafilatura-api

 
 

Repository files navigation

Trafilatura REST API

A REST API wrapper around Trafilatura for extracting article content and metadata from web pages.

Features

  • Extract article text, title, author, date, and more from any URL
  • Featured image extraction
  • Categories and tags detection
  • Language detection
  • API key authentication
  • Docker support
  • Swagger UI documentation

Quick Start

Using Docker Compose

git clone https://github.com/elvismdev/trafilatura-api.git
cd trafilatura-api
docker compose up --build

The API will be available at http://localhost:5000

Using Docker

docker run -d -p 5000:5000 -e API_KEY=your-secret-key ghcr.io/elvismdev/trafilatura-api:latest

Configuration

Environment Variable Description Default
API_KEY Required for /extract endpoint authentication. Set your own secret key for production. test123

API Endpoints

Health Check

GET /

Returns service status and documentation URL.

Extract Content

POST /extract

Extracts article content and metadata from a URL or raw HTML.

Headers:

Content-Type: application/json
X-API-Key: test123

Request Body:

{
  "url": "https://example.com/article",
  "output_options": {
    "include_tables": true,
    "include_links": true,
    "favor_recall": true
  }
}

Or with raw HTML:

{
  "raw_html": "<html>...</html>",
  "url": "https://example.com/article"
}

Response:

{
  "title": "Article Title",
  "author": "Author Name",
  "date": "2025-12-01",
  "description": "Article excerpt or meta description",
  "sitename": "Example News",
  "hostname": "example.com",
  "url": "https://example.com/article",
  "image": "https://example.com/featured-image.jpg",
  "categories": ["News", "Technology"],
  "tags": ["AI", "Machine Learning"],
  "text": "Full article content...",
  "language": "en"
}

Output Options:

Option Type Description
include_tables boolean Include table content
include_links boolean Preserve hyperlinks in text
include_formatting boolean Keep text formatting
favor_precision boolean Prefer less text but higher accuracy
favor_recall boolean Prefer more text even if uncertain

Usage Examples

Basic extraction

curl -X POST http://localhost:5000/extract \
  -H "Content-Type: application/json" \
  -H "X-API-Key: test123" \
  -d '{"url": "https://example.com/article"}'

With output options

curl -X POST http://localhost:5000/extract \
  -H "Content-Type: application/json" \
  -H "X-API-Key: test123" \
  -d '{
    "url": "https://example.com/article",
    "output_options": {
      "include_links": true,
      "favor_recall": true
    }
  }'

Swagger Documentation

Interactive API documentation is available at:

http://localhost:5000/apidocs

Local Development

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Set API key (required for /extract endpoint)
export API_KEY=test123

# Run development server
python -m flask --app app/app.py run

# Run tests
pytest -v

Deployment

QNAP Container Station

  1. Pull image: ghcr.io/elvismdev/trafilatura-api:latest
  2. Create container with:
    • Port mapping: 5000:5000
    • Environment variable: API_KEY = your-secret-key

GitHub Container Registry

Images are automatically built and pushed on every commit to master:

ghcr.io/elvismdev/trafilatura-api:latest

License

MIT

Releases

No releases published

Packages

 
 
 

Languages

  • Python 96.4%
  • Dockerfile 3.6%