A REST API wrapper around Trafilatura for extracting article content and metadata from web pages.
- Extract article text, title, author, date, and more from any URL
- Featured image extraction
- Categories and tags detection
- Language detection
- API key authentication
- Docker support
- Swagger UI documentation
git clone https://github.com/elvismdev/trafilatura-api.git
cd trafilatura-api
docker compose up --buildThe API will be available at http://localhost:5000
docker run -d -p 5000:5000 -e API_KEY=your-secret-key ghcr.io/elvismdev/trafilatura-api:latest| Environment Variable | Description | Default |
|---|---|---|
API_KEY |
Required for /extract endpoint authentication. Set your own secret key for production. |
test123 |
GET /
Returns service status and documentation URL.
POST /extract
Extracts article content and metadata from a URL or raw HTML.
Headers:
Content-Type: application/json
X-API-Key: test123
Request Body:
{
"url": "https://example.com/article",
"output_options": {
"include_tables": true,
"include_links": true,
"favor_recall": true
}
}Or with raw HTML:
{
"raw_html": "<html>...</html>",
"url": "https://example.com/article"
}Response:
{
"title": "Article Title",
"author": "Author Name",
"date": "2025-12-01",
"description": "Article excerpt or meta description",
"sitename": "Example News",
"hostname": "example.com",
"url": "https://example.com/article",
"image": "https://example.com/featured-image.jpg",
"categories": ["News", "Technology"],
"tags": ["AI", "Machine Learning"],
"text": "Full article content...",
"language": "en"
}Output Options:
| Option | Type | Description |
|---|---|---|
include_tables |
boolean | Include table content |
include_links |
boolean | Preserve hyperlinks in text |
include_formatting |
boolean | Keep text formatting |
favor_precision |
boolean | Prefer less text but higher accuracy |
favor_recall |
boolean | Prefer more text even if uncertain |
curl -X POST http://localhost:5000/extract \
-H "Content-Type: application/json" \
-H "X-API-Key: test123" \
-d '{"url": "https://example.com/article"}'curl -X POST http://localhost:5000/extract \
-H "Content-Type: application/json" \
-H "X-API-Key: test123" \
-d '{
"url": "https://example.com/article",
"output_options": {
"include_links": true,
"favor_recall": true
}
}'Interactive API documentation is available at:
http://localhost:5000/apidocs
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Set API key (required for /extract endpoint)
export API_KEY=test123
# Run development server
python -m flask --app app/app.py run
# Run tests
pytest -v- Pull image:
ghcr.io/elvismdev/trafilatura-api:latest - Create container with:
- Port mapping:
5000:5000 - Environment variable:
API_KEY=your-secret-key
- Port mapping:
Images are automatically built and pushed on every commit to master:
ghcr.io/elvismdev/trafilatura-api:latest
MIT