🌌 Aurelia Web Relay – FastAPI Gateway for Local LLMs

FastAPI gateway for local LLMs with on‑demand web research, time‑anchored context, multilingual trigger detection, and SSE streaming.
It accepts an OpenAI‑style chat payload, optionally performs targeted web research, injects a compact source‑cited context block, and forwards the request to a local upstream (e.g., llama.cpp or LM Studio) while passing SSE tokens through with minimal latency.

Built for Python 3.11+.

📊 Table of contents

Why
Features
Architecture
Quickstart
Configuration
API
How web research & context injection works
Multilingual behavior
Security & deployment notes
Troubleshooting
Contributing
License

Why 💙

Local LLMs are fast and private, but they often lack a recency layer and a consistent time anchor. Aurelia Web Relay adds both without changing your client: it decides when web search is useful, collects and ranks sources, builds a tight <<<CONTEXT>>> block with citations and dates, and forwards your request to a local model over an OpenAI‑compatible /v1/chat/completions upstream. The client simply talks to /relay and receives SSE chunks in real time.

📣 Features

Drop‑in gateway for local LLMs — forwards OpenAI‑style chat payloads to an upstream (e.g., llama.cpp, LM Studio), preserving parameters like temperature, top_p, top_k, and penalties.
On‑demand web research — a heuristic (need_web) detects time/news/how‑to queries and triggers a research pipeline (Tavily + SerpAPI, page fetching, content extraction, ranking).
Multilingual trigger detection (30 languages) — recency/how‑to cues are recognized in many languages (e.g., de, fr, es, zh, ar, hi, …). The phrase lists are maintained in languages.json, hot‑reloaded at runtime, and any 4‑digit year like 2025 is treated as a weak recency signal.
Time‑anchored system guidance — inserts a deterministic date/time anchor (UTC + local TZ) so “today/now/currently” are always unambiguous.
Streaming, end‑to‑end — Server‑Sent Events (text/event-stream) are passed through 1:1 from the upstream to your client.
Country‑aware news prioritization — trusted outlets for your configured country get a small ranking bonus; reliable global outlets are preferred by default.
Zero lock‑in — pure FastAPI/uvicorn + httpx; no proprietary SDKs.

🧠 Architecture

sequenceDiagram
    autonumber
    participant Client
    participant Relay as Web Relay (FastAPI)
    participant R as ResearchOrchestrator
    participant Providers as Tavily / SerpAPI
    participant Fetch as Extractor (trafilatura / readability / BS4)
    participant Upstream as Local LLM (/v1/chat/completions)

    Client->>Relay: POST /relay (chat payload, stream=true)
    alt need_web(query) is true
        Relay->>R: research_and_digest(query)
        R->>Providers: search (expanded queries)
        Providers-->>R: candidate links
        R->>Fetch: fetch_and_extract(url...) (concurrent)
        Fetch-->>R: clean text + publish dates
        R-->>Relay: ranked digest + <<<CONTEXT>>>
    end
    Relay->>Upstream: POST /v1/chat/completions (stream=true)
    Upstream-->>Relay: SSE chunks (data: {...})
    Relay-->>Client: SSE chunks (passed through)

Code map

app.py — FastAPI app, multilingual need_web heuristic, time‑anchor context, endpoint handlers (/health, /relay, /relay_once), and SSE streaming response.
research.py — query expansion, multi‑provider search, dedupe, BM25 + recency + domain scoring, country‑aware boosts, digest & <<<CONTEXT>>> builder.
extract.py — robust page fetching and article extraction (trafilatura → readability → BS4), light publish‑date detection.
llm_client.py — thin SSE client that forwards upstream /v1/chat/completions streams exactly as received.
lang_signals.py + languages.json — multilingual phrase lists (recency/how‑to) with hot‑reload and LANGUAGE_FILE override.

📌 Quickstart

Prerequisites

Python 3.11+
A local LLM server exposing an OpenAI‑compatible endpoint at /v1/chat/completions (e.g., llama.cpp server or LM Studio)
(Optional) API keys for web search:
- TAVILY_API_KEY (recommended)
- SERPAPI_API_KEY (optional)

🛠️ Install

# 1) Create & activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 2) Install dependencies
pip install -r requirements.txt

🧪 Configure

Create a .env file in the project root:

# Upstream LLM
UPSTREAM_TYPE=llama                    # llama | lmstudio (label only)
UPSTREAM_URL=http://127.0.0.1:8080     # where your local /v1/chat/completions lives
DEFAULT_MODEL=gemma-3-12b-it-ud@q8_k_xl

# Context & time
CONTEXT_BUDGET_CHARS=7000
LOCAL_TZ=Europe/Zurich                 # IANA TZ (fallback: UTC)

# Networking
REQUEST_TIMEOUT=20                     # seconds

# Research (optional but recommended)
TAVILY_API_KEY=                        # get a key from Tavily
SERPAPI_API_KEY=                       # optional but highly recommended
COUNTRY=CH                             # ISO-3166 alpha-2 (for news prioritization)
FETCH_CONCURRENCY=6

# Multilingual signals
LANGUAGE_FILE=./languages.json         # optional override; auto-reloads on change

🚀 Run

Either run the built‑in launcher:

python app.py --host 0.0.0.0 --port 5100 --reload

Or use uvicorn directly:

uvicorn app:app --host 0.0.0.0 --port 5100 --reload

Check health:

curl http://localhost:5100/health

💻 Configuration

Variable	Default	Purpose
`UPSTREAM_TYPE`	`llama`	Label to indicate the upstream kind (`llama` / `lmstudio`).
`UPSTREAM_URL`	`http://127.0.0.1:8080`	Base URL of your local OpenAI‑compatible server.
`DEFAULT_MODEL`	`gemma-3-12b-it-ud@q8_k_xl`	Model name sent upstream if the request omits `model`.
`CONTEXT_BUDGET_CHARS`	`7000`	Max characters allocated for the generated `<<<CONTEXT>>>` block.
`LOCAL_TZ`	`Europe/Zurich`	IANA timezone for the local time anchor (falls back to UTC).
`REQUEST_TIMEOUT`	`20`	Network timeout (seconds) for upstream & fetching.
`TAVILY_API_KEY`	—	Enables Tavily search.
`SERPAPI_API_KEY`	—	Enables Google via SerpAPI.
`COUNTRY`	—	ISO‑3166 country code for country‑aware news boosts (e.g., `CH`, `DE`, `US`).
`FETCH_CONCURRENCY`	`6`	Max concurrent page fetches during extraction.
`LANGUAGE_FILE`	`./languages.json`	Optional path override for multilingual signal lists; auto‑reloads (checked ~30s).

📣 API

GET `/health`

Returns basic status and time anchor fields.

{
  "status": "ok",
  "ts_utc": "2025-11-18T09:10:11Z",
  "today_local": "2025-11-18",
  "tz": "Europe/Zurich",
  "upstream": "llama",
  "url": "http://127.0.0.1:8080"
}

POST `/relay` (streaming)

Accepts an OpenAI‑style chat body and returns SSE with upstream tokens.
Body schema (subset):

{
  "model": "gemma-3-12b-it-ud@q8_k_xl",
  "messages": [{"role": "user", "content": "What's new in Python 3.12?"}],
  "temperature": 0.7,
  "top_p": 0.95,
  "top_k": 60,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "stream": true,
  "char_budget": 7000
}

char_budget overrides the server’s CONTEXT_BUDGET_CHARS for this call.

cURL example (SSE):

curl -N http://localhost:5100/relay   -H "Content-Type: application/json"   -H "Accept: text/event-stream"   -d '{
        "model": "gemma-3-12b-it-ud@q8_k_xl",
        "messages": [{"role":"user","content":"Dame los titulares más recientes sobre baterías cuánticas."}],
        "stream": true
      }'

Server will stream lines like:

data: {"id":"...","object":"chat.completion.chunk","model":"...","choices":[{"delta":{"content":"..."}}]}

... (more chunks) ...

data: [DONE]

POST `/relay_once` (non‑streaming)

Returns a single JSON completion after the upstream finishes.

cURL example:

curl http://localhost:5100/relay_once   -H "Content-Type: application/json"   -d '{
        "model": "gemma-3-12b-it-ud@q8_k_xl",
        "messages": [{"role":"user","content":"Fasse RFC 9457 in zwei Sätzen zusammen."}],
        "stream": false
      }'

Response (shape):

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "…final answer text…"
      }
    }
  ]
}

How web research & context injection works

When the latest user message contains temporal/news/price/how‑to cues (in any supported language), the relay:

Expands queries (adds variants like “current …”, “latest …”, “… 2025”, “… tutorial”).
Searches multiple providers (Tavily and/or SerpAPI) and merges results.
Deduplicates by normalized URL and fuzzy title per domain.
Fetches pages concurrently and extracts clean text via trafilatura → readability → BeautifulSoup (publish dates are extracted from common meta tags where available).
Reranks using BM25 (on extracted text/snippets) + recency decay (half‑life ~30 days) + provider/domain quality (including a small country‑aware news bonus) and domain diversity caps.
Builds a compact <<<CONTEXT>>> block (top ~5 sources): title, domain, detected publish date, 1–3 key bullets per source, canonical URL — plus explicit instructions for the model to cite [1], [2], … and to treat “today/currently/now” relative to the provided time anchor (UTC + local TZ).
Merges messages: the context block is prepended to the last user message; a system line with guidance and the time anchor is injected ahead of the conversation.
Streams upstream: request is forwarded to your local model with stream=true, and chunks are passed through unchanged.

If research fails (e.g., provider down), the relay still answers without context; the failure note is embedded inside the <<<CONTEXT>>> section for transparency.

Multilingual behavior

The need_web(...) heuristic recognizes recency/how‑to cues in 30 languages via languages.json. Any 4‑digit year (20xx) is treated as a weak recency cue.
The lists auto‑reload if the file changes (checked roughly every 30s). You can point to a custom file via LANGUAGE_FILE=path/to/your.json.
Matching is substring‑based on lower‑cased input, making it robust across scripts and diacritics.
You can extend the lists by adding entries under recency / howto for each language code. A minimal shape:

{
  "metadata": {"version": 1, "updated": "2025-11-19"},
  "languages": [{"code": "de", "name": "German"}, {"code": "es", "name": "Spanish"}],
  "recency": {
    "de": ["heute","aktuell","neueste","preis","gesetz"],
    "es": ["hoy","últimas","precio","ley","calendario"]
  },
  "howto": {
    "de": ["anleitung","leitfaden","wie","schritt für schritt"],
    "es": ["cómo","guía","tutorial","paso a paso"]
  }
}

Examples

“¿Qué hay de nuevo en Python 3.13?” → recency + year signal → research enabled.
“Wie installiere ich Poetry unter Windows?” → how‑to signal → research enabled.
“Expliquez-moi OpenTelemetry en deux phrases.” → no recency/how‑to → no research.

🫂 Security & deployment notes

Auth: The relay ships without authentication. Place it behind a reverse proxy (e.g., Traefik / NGINX) and enforce auth/TLS as needed.
CORS: Add CORS middleware if you call it from browsers.
Timeouts: Tune REQUEST_TIMEOUT for both upstream and page fetching; default is conservative.
Rate limiting: Consider a proxy‑level limiter to protect your upstream.
Observability: Add structured logging and tracing around /relay and upstream calls in production.

🛠️ Troubleshooting

Upstream errors / connection refused
Ensure UPSTREAM_URL points to a live server that implements /v1/chat/completions. Test with a minimal POST.
No streaming
Use curl -N and include Accept: text/event-stream. Proxies may buffer SSE; disable buffering where applicable.
Research never triggers
Verify LANGUAGE_FILE is readable and your prompt contains recency/how‑to cues in any supported language, or provide TAVILY_API_KEY / SERPAPI_API_KEY.
Empty or low‑quality extractions
Some sites block scraping or use heavy JS. The pipeline gracefully falls back (readability → BS4), but sources may be skipped if content < 200 chars.

💙 Contributing

Issues and PRs are welcome. Please keep changes small and well‑documented. Suggested areas:

Provider adapters (additional search engines)
Smarter date extraction and language detection
Pluggable ranking & diversity rules
Observability, metrics, and tests

📄 License (Summary)

Aurelia Web Relay is licensed under the Aurelia Web Relay License (AWRL).

You may:

✅ Use, modify, and share the software for non-commercial purposes only
✅ Fork, study, and run it locally
✅ Build non-commercial tools or demos based on it

You may not:

❌ Use it in any commercial, for-profit, or monetized setting
❌ Offer it as a service (SaaS, hosting, API, chatbot, etc.)
❌ Integrate it into paid products, platforms, or enterprise workflows

To use Aurelia Web Relay commercially, you must obtain a separate written license.
→ Contact: legal@samedia.app

Read the full license here: LICENSE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌌 Aurelia Web Relay – FastAPI Gateway for Local LLMs

📊 Table of contents

Why 💙

📣 Features

🧠 Architecture

📌 Quickstart

Prerequisites

🛠️ Install

🧪 Configure

🚀 Run

💻 Configuration

📣 API

GET `/health`

POST `/relay` (streaming)

POST `/relay_once` (non‑streaming)

How web research & context injection works

Multilingual behavior

🫂 Security & deployment notes

🛠️ Troubleshooting

💙 Contributing

📄 License (Summary)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
app.py		app.py
extract.py		extract.py
lang_signals.py		lang_signals.py
languages.json		languages.json
llm_client.py		llm_client.py
requirements.txt		requirements.txt
research.py		research.py

License

py-sandy/llm-web-relay

Folders and files

Latest commit

History

Repository files navigation

🌌 Aurelia Web Relay – FastAPI Gateway for Local LLMs

📊 Table of contents

Why 💙

📣 Features

🧠 Architecture

📌 Quickstart

Prerequisites

🛠️ Install

🧪 Configure

🚀 Run

💻 Configuration

📣 API

GET /health

POST /relay (streaming)

POST /relay_once (non‑streaming)

How web research & context injection works

Multilingual behavior

🫂 Security & deployment notes

🛠️ Troubleshooting

💙 Contributing

📄 License (Summary)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

GET `/health`

POST `/relay` (streaming)

POST `/relay_once` (non‑streaming)

Packages