Skip to content

A FastAPI gateway for local LLMs that adds intelligent web research, multilingual recency/how-to detection, time-anchored guidance, context injection, and OpenAI-compatible SSE streaming. Turn any local model into a recency-aware, context-enhanced assistant instantly.

License

Notifications You must be signed in to change notification settings

py-sandy/llm-web-relay

Repository files navigation

🌌 Aurelia Web Relay – FastAPI Gateway for Local LLMs

FastAPI gateway for local LLMs with on‑demand web research, time‑anchored context, multilingual trigger detection, and SSE streaming.
It accepts an OpenAI‑style chat payload, optionally performs targeted web research, injects a compact source‑cited context block, and forwards the request to a local upstream (e.g., llama.cpp or LM Studio) while passing SSE tokens through with minimal latency.

Built for Python 3.11+.


📊 Table of contents


Why 💙

Local LLMs are fast and private, but they often lack a recency layer and a consistent time anchor. Aurelia Web Relay adds both without changing your client: it decides when web search is useful, collects and ranks sources, builds a tight <<<CONTEXT>>> block with citations and dates, and forwards your request to a local model over an OpenAI‑compatible /v1/chat/completions upstream. The client simply talks to /relay and receives SSE chunks in real time.


📣 Features

  • Drop‑in gateway for local LLMs — forwards OpenAI‑style chat payloads to an upstream (e.g., llama.cpp, LM Studio), preserving parameters like temperature, top_p, top_k, and penalties.
  • On‑demand web research — a heuristic (need_web) detects time/news/how‑to queries and triggers a research pipeline (Tavily + SerpAPI, page fetching, content extraction, ranking).
  • Multilingual trigger detection (30 languages) — recency/how‑to cues are recognized in many languages (e.g., de, fr, es, zh, ar, hi, …). The phrase lists are maintained in languages.json, hot‑reloaded at runtime, and any 4‑digit year like 2025 is treated as a weak recency signal.
  • Time‑anchored system guidance — inserts a deterministic date/time anchor (UTC + local TZ) so “today/now/currently” are always unambiguous.
  • Streaming, end‑to‑endServer‑Sent Events (text/event-stream) are passed through 1:1 from the upstream to your client.
  • Country‑aware news prioritization — trusted outlets for your configured country get a small ranking bonus; reliable global outlets are preferred by default.
  • Zero lock‑in — pure FastAPI/uvicorn + httpx; no proprietary SDKs.

🧠 Architecture

sequenceDiagram
    autonumber
    participant Client
    participant Relay as Web Relay (FastAPI)
    participant R as ResearchOrchestrator
    participant Providers as Tavily / SerpAPI
    participant Fetch as Extractor (trafilatura / readability / BS4)
    participant Upstream as Local LLM (/v1/chat/completions)

    Client->>Relay: POST /relay (chat payload, stream=true)
    alt need_web(query) is true
        Relay->>R: research_and_digest(query)
        R->>Providers: search (expanded queries)
        Providers-->>R: candidate links
        R->>Fetch: fetch_and_extract(url...) (concurrent)
        Fetch-->>R: clean text + publish dates
        R-->>Relay: ranked digest + <<<CONTEXT>>>
    end
    Relay->>Upstream: POST /v1/chat/completions (stream=true)
    Upstream-->>Relay: SSE chunks (data: {...})
    Relay-->>Client: SSE chunks (passed through)
Loading

Code map

  • app.py — FastAPI app, multilingual need_web heuristic, time‑anchor context, endpoint handlers (/health, /relay, /relay_once), and SSE streaming response.
  • research.py — query expansion, multi‑provider search, dedupe, BM25 + recency + domain scoring, country‑aware boosts, digest & <<<CONTEXT>>> builder.
  • extract.py — robust page fetching and article extraction (trafilatura → readability → BS4), light publish‑date detection.
  • llm_client.py — thin SSE client that forwards upstream /v1/chat/completions streams exactly as received.
  • lang_signals.py + languages.json — multilingual phrase lists (recency/how‑to) with hot‑reload and LANGUAGE_FILE override.

📌 Quickstart

Prerequisites

  • Python 3.11+
  • A local LLM server exposing an OpenAI‑compatible endpoint at /v1/chat/completions (e.g., llama.cpp server or LM Studio)
  • (Optional) API keys for web search:
    • TAVILY_API_KEY (recommended)
    • SERPAPI_API_KEY (optional)

🛠️ Install

# 1) Create & activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 2) Install dependencies
pip install -r requirements.txt

🧪 Configure

Create a .env file in the project root:

# Upstream LLM
UPSTREAM_TYPE=llama                    # llama | lmstudio (label only)
UPSTREAM_URL=http://127.0.0.1:8080     # where your local /v1/chat/completions lives
DEFAULT_MODEL=gemma-3-12b-it-ud@q8_k_xl

# Context & time
CONTEXT_BUDGET_CHARS=7000
LOCAL_TZ=Europe/Zurich                 # IANA TZ (fallback: UTC)

# Networking
REQUEST_TIMEOUT=20                     # seconds

# Research (optional but recommended)
TAVILY_API_KEY=                        # get a key from Tavily
SERPAPI_API_KEY=                       # optional but highly recommended
COUNTRY=CH                             # ISO-3166 alpha-2 (for news prioritization)
FETCH_CONCURRENCY=6

# Multilingual signals
LANGUAGE_FILE=./languages.json         # optional override; auto-reloads on change

🚀 Run

Either run the built‑in launcher:

python app.py --host 0.0.0.0 --port 5100 --reload

Or use uvicorn directly:

uvicorn app:app --host 0.0.0.0 --port 5100 --reload

Check health:

curl http://localhost:5100/health

💻 Configuration

Variable Default Purpose
UPSTREAM_TYPE llama Label to indicate the upstream kind (llama / lmstudio).
UPSTREAM_URL http://127.0.0.1:8080 Base URL of your local OpenAI‑compatible server.
DEFAULT_MODEL gemma-3-12b-it-ud@q8_k_xl Model name sent upstream if the request omits model.
CONTEXT_BUDGET_CHARS 7000 Max characters allocated for the generated <<<CONTEXT>>> block.
LOCAL_TZ Europe/Zurich IANA timezone for the local time anchor (falls back to UTC).
REQUEST_TIMEOUT 20 Network timeout (seconds) for upstream & fetching.
TAVILY_API_KEY Enables Tavily search.
SERPAPI_API_KEY Enables Google via SerpAPI.
COUNTRY ISO‑3166 country code for country‑aware news boosts (e.g., CH, DE, US).
FETCH_CONCURRENCY 6 Max concurrent page fetches during extraction.
LANGUAGE_FILE ./languages.json Optional path override for multilingual signal lists; auto‑reloads (checked ~30s).

📣 API

GET /health

Returns basic status and time anchor fields.

{
  "status": "ok",
  "ts_utc": "2025-11-18T09:10:11Z",
  "today_local": "2025-11-18",
  "tz": "Europe/Zurich",
  "upstream": "llama",
  "url": "http://127.0.0.1:8080"
}

POST /relay (streaming)

Accepts an OpenAI‑style chat body and returns SSE with upstream tokens.
Body schema (subset):

{
  "model": "gemma-3-12b-it-ud@q8_k_xl",
  "messages": [{"role": "user", "content": "What's new in Python 3.12?"}],
  "temperature": 0.7,
  "top_p": 0.95,
  "top_k": 60,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "stream": true,
  "char_budget": 7000
}

char_budget overrides the server’s CONTEXT_BUDGET_CHARS for this call.

cURL example (SSE):

curl -N http://localhost:5100/relay   -H "Content-Type: application/json"   -H "Accept: text/event-stream"   -d '{
        "model": "gemma-3-12b-it-ud@q8_k_xl",
        "messages": [{"role":"user","content":"Dame los titulares más recientes sobre baterías cuánticas."}],
        "stream": true
      }'

Server will stream lines like:

data: {"id":"...","object":"chat.completion.chunk","model":"...","choices":[{"delta":{"content":"..."}}]}

... (more chunks) ...

data: [DONE]

POST /relay_once (non‑streaming)

Returns a single JSON completion after the upstream finishes.

cURL example:

curl http://localhost:5100/relay_once   -H "Content-Type: application/json"   -d '{
        "model": "gemma-3-12b-it-ud@q8_k_xl",
        "messages": [{"role":"user","content":"Fasse RFC 9457 in zwei Sätzen zusammen."}],
        "stream": false
      }'

Response (shape):

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "…final answer text…"
      }
    }
  ]
}

How web research & context injection works

When the latest user message contains temporal/news/price/how‑to cues (in any supported language), the relay:

  1. Expands queries (adds variants like “current …”, “latest …”, “… 2025”, “… tutorial”).
  2. Searches multiple providers (Tavily and/or SerpAPI) and merges results.
  3. Deduplicates by normalized URL and fuzzy title per domain.
  4. Fetches pages concurrently and extracts clean text via trafilatura → readability → BeautifulSoup (publish dates are extracted from common meta tags where available).
  5. Reranks using BM25 (on extracted text/snippets) + recency decay (half‑life ~30 days) + provider/domain quality (including a small country‑aware news bonus) and domain diversity caps.
  6. Builds a compact <<<CONTEXT>>> block (top ~5 sources): title, domain, detected publish date, 1–3 key bullets per source, canonical URL — plus explicit instructions for the model to cite [1], [2], … and to treat “today/currently/now” relative to the provided time anchor (UTC + local TZ).
  7. Merges messages: the context block is prepended to the last user message; a system line with guidance and the time anchor is injected ahead of the conversation.
  8. Streams upstream: request is forwarded to your local model with stream=true, and chunks are passed through unchanged.

If research fails (e.g., provider down), the relay still answers without context; the failure note is embedded inside the <<<CONTEXT>>> section for transparency.


Multilingual behavior

  • The need_web(...) heuristic recognizes recency/how‑to cues in 30 languages via languages.json. Any 4‑digit year (20xx) is treated as a weak recency cue.
  • The lists auto‑reload if the file changes (checked roughly every 30s). You can point to a custom file via LANGUAGE_FILE=path/to/your.json.
  • Matching is substring‑based on lower‑cased input, making it robust across scripts and diacritics.
  • You can extend the lists by adding entries under recency / howto for each language code. A minimal shape:
{
  "metadata": {"version": 1, "updated": "2025-11-19"},
  "languages": [{"code": "de", "name": "German"}, {"code": "es", "name": "Spanish"}],
  "recency": {
    "de": ["heute","aktuell","neueste","preis","gesetz"],
    "es": ["hoy","últimas","precio","ley","calendario"]
  },
  "howto": {
    "de": ["anleitung","leitfaden","wie","schritt für schritt"],
    "es": ["cómo","guía","tutorial","paso a paso"]
  }
}

Examples

  • “¿Qué hay de nuevo en Python 3.13?” → recency + year signal → research enabled.
  • “Wie installiere ich Poetry unter Windows?” → how‑to signal → research enabled.
  • “Expliquez-moi OpenTelemetry en deux phrases.” → no recency/how‑to → no research.

🫂 Security & deployment notes

  • Auth: The relay ships without authentication. Place it behind a reverse proxy (e.g., Traefik / NGINX) and enforce auth/TLS as needed.
  • CORS: Add CORS middleware if you call it from browsers.
  • Timeouts: Tune REQUEST_TIMEOUT for both upstream and page fetching; default is conservative.
  • Rate limiting: Consider a proxy‑level limiter to protect your upstream.
  • Observability: Add structured logging and tracing around /relay and upstream calls in production.

🛠️ Troubleshooting

  • Upstream errors / connection refused
    Ensure UPSTREAM_URL points to a live server that implements /v1/chat/completions. Test with a minimal POST.
  • No streaming
    Use curl -N and include Accept: text/event-stream. Proxies may buffer SSE; disable buffering where applicable.
  • Research never triggers
    Verify LANGUAGE_FILE is readable and your prompt contains recency/how‑to cues in any supported language, or provide TAVILY_API_KEY / SERPAPI_API_KEY.
  • Empty or low‑quality extractions
    Some sites block scraping or use heavy JS. The pipeline gracefully falls back (readability → BS4), but sources may be skipped if content < 200 chars.

💙 Contributing

Issues and PRs are welcome. Please keep changes small and well‑documented. Suggested areas:

  • Provider adapters (additional search engines)
  • Smarter date extraction and language detection
  • Pluggable ranking & diversity rules
  • Observability, metrics, and tests

📄 License (Summary)

Aurelia Web Relay is licensed under the Aurelia Web Relay License (AWRL).

You may:

  • ✅ Use, modify, and share the software for non-commercial purposes only
  • ✅ Fork, study, and run it locally
  • ✅ Build non-commercial tools or demos based on it

You may not:

  • ❌ Use it in any commercial, for-profit, or monetized setting
  • ❌ Offer it as a service (SaaS, hosting, API, chatbot, etc.)
  • ❌ Integrate it into paid products, platforms, or enterprise workflows

To use Aurelia Web Relay commercially, you must obtain a separate written license.
→ Contact: legal@samedia.app

Read the full license here: LICENSE.md

About

A FastAPI gateway for local LLMs that adds intelligent web research, multilingual recency/how-to detection, time-anchored guidance, context injection, and OpenAI-compatible SSE streaming. Turn any local model into a recency-aware, context-enhanced assistant instantly.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages