FastAPI gateway for local LLMs with on‑demand web research, time‑anchored context, multilingual trigger detection, and SSE streaming.
It accepts an OpenAI‑style chat payload, optionally performs targeted web research, injects a compact source‑cited context block, and forwards the request to a local upstream (e.g., llama.cpp or LM Studio) while passing SSE tokens through with minimal latency.
Built for Python 3.11+.
- Why
- Features
- Architecture
- Quickstart
- Configuration
- API
- How web research & context injection works
- Multilingual behavior
- Security & deployment notes
- Troubleshooting
- Contributing
- License
Local LLMs are fast and private, but they often lack a recency layer and a consistent time anchor. Aurelia Web Relay adds both without changing your client: it decides when web search is useful, collects and ranks sources, builds a tight <<<CONTEXT>>> block with citations and dates, and forwards your request to a local model over an OpenAI‑compatible /v1/chat/completions upstream. The client simply talks to /relay and receives SSE chunks in real time.
- Drop‑in gateway for local LLMs — forwards OpenAI‑style chat payloads to an upstream (e.g.,
llama.cpp, LM Studio), preserving parameters liketemperature,top_p,top_k, and penalties. - On‑demand web research — a heuristic (
need_web) detects time/news/how‑to queries and triggers a research pipeline (Tavily + SerpAPI, page fetching, content extraction, ranking). - Multilingual trigger detection (30 languages) — recency/how‑to cues are recognized in many languages (e.g., de, fr, es, zh, ar, hi, …). The phrase lists are maintained in
languages.json, hot‑reloaded at runtime, and any 4‑digit year like2025is treated as a weak recency signal. - Time‑anchored system guidance — inserts a deterministic date/time anchor (UTC + local TZ) so “today/now/currently” are always unambiguous.
- Streaming, end‑to‑end — Server‑Sent Events (
text/event-stream) are passed through 1:1 from the upstream to your client. - Country‑aware news prioritization — trusted outlets for your configured country get a small ranking bonus; reliable global outlets are preferred by default.
- Zero lock‑in — pure FastAPI/uvicorn + httpx; no proprietary SDKs.
sequenceDiagram
autonumber
participant Client
participant Relay as Web Relay (FastAPI)
participant R as ResearchOrchestrator
participant Providers as Tavily / SerpAPI
participant Fetch as Extractor (trafilatura / readability / BS4)
participant Upstream as Local LLM (/v1/chat/completions)
Client->>Relay: POST /relay (chat payload, stream=true)
alt need_web(query) is true
Relay->>R: research_and_digest(query)
R->>Providers: search (expanded queries)
Providers-->>R: candidate links
R->>Fetch: fetch_and_extract(url...) (concurrent)
Fetch-->>R: clean text + publish dates
R-->>Relay: ranked digest + <<<CONTEXT>>>
end
Relay->>Upstream: POST /v1/chat/completions (stream=true)
Upstream-->>Relay: SSE chunks (data: {...})
Relay-->>Client: SSE chunks (passed through)
Code map
app.py— FastAPI app, multilingualneed_webheuristic, time‑anchor context, endpoint handlers (/health,/relay,/relay_once), and SSE streaming response.research.py— query expansion, multi‑provider search, dedupe, BM25 + recency + domain scoring, country‑aware boosts, digest &<<<CONTEXT>>>builder.extract.py— robust page fetching and article extraction (trafilatura → readability → BS4), light publish‑date detection.llm_client.py— thin SSE client that forwards upstream/v1/chat/completionsstreams exactly as received.lang_signals.py+languages.json— multilingual phrase lists (recency/how‑to) with hot‑reload andLANGUAGE_FILEoverride.
- Python 3.11+
- A local LLM server exposing an OpenAI‑compatible endpoint at
/v1/chat/completions(e.g.,llama.cppserver or LM Studio) - (Optional) API keys for web search:
TAVILY_API_KEY(recommended)SERPAPI_API_KEY(optional)
# 1) Create & activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2) Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
# Upstream LLM
UPSTREAM_TYPE=llama # llama | lmstudio (label only)
UPSTREAM_URL=http://127.0.0.1:8080 # where your local /v1/chat/completions lives
DEFAULT_MODEL=gemma-3-12b-it-ud@q8_k_xl
# Context & time
CONTEXT_BUDGET_CHARS=7000
LOCAL_TZ=Europe/Zurich # IANA TZ (fallback: UTC)
# Networking
REQUEST_TIMEOUT=20 # seconds
# Research (optional but recommended)
TAVILY_API_KEY= # get a key from Tavily
SERPAPI_API_KEY= # optional but highly recommended
COUNTRY=CH # ISO-3166 alpha-2 (for news prioritization)
FETCH_CONCURRENCY=6
# Multilingual signals
LANGUAGE_FILE=./languages.json # optional override; auto-reloads on changeEither run the built‑in launcher:
python app.py --host 0.0.0.0 --port 5100 --reloadOr use uvicorn directly:
uvicorn app:app --host 0.0.0.0 --port 5100 --reloadCheck health:
curl http://localhost:5100/health| Variable | Default | Purpose |
|---|---|---|
UPSTREAM_TYPE |
llama |
Label to indicate the upstream kind (llama / lmstudio). |
UPSTREAM_URL |
http://127.0.0.1:8080 |
Base URL of your local OpenAI‑compatible server. |
DEFAULT_MODEL |
gemma-3-12b-it-ud@q8_k_xl |
Model name sent upstream if the request omits model. |
CONTEXT_BUDGET_CHARS |
7000 |
Max characters allocated for the generated <<<CONTEXT>>> block. |
LOCAL_TZ |
Europe/Zurich |
IANA timezone for the local time anchor (falls back to UTC). |
REQUEST_TIMEOUT |
20 |
Network timeout (seconds) for upstream & fetching. |
TAVILY_API_KEY |
— | Enables Tavily search. |
SERPAPI_API_KEY |
— | Enables Google via SerpAPI. |
COUNTRY |
— | ISO‑3166 country code for country‑aware news boosts (e.g., CH, DE, US). |
FETCH_CONCURRENCY |
6 |
Max concurrent page fetches during extraction. |
LANGUAGE_FILE |
./languages.json |
Optional path override for multilingual signal lists; auto‑reloads (checked ~30s). |
Returns basic status and time anchor fields.
{
"status": "ok",
"ts_utc": "2025-11-18T09:10:11Z",
"today_local": "2025-11-18",
"tz": "Europe/Zurich",
"upstream": "llama",
"url": "http://127.0.0.1:8080"
}Accepts an OpenAI‑style chat body and returns SSE with upstream tokens.
Body schema (subset):
{
"model": "gemma-3-12b-it-ud@q8_k_xl",
"messages": [{"role": "user", "content": "What's new in Python 3.12?"}],
"temperature": 0.7,
"top_p": 0.95,
"top_k": 60,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"stream": true,
"char_budget": 7000
}char_budget overrides the server’s CONTEXT_BUDGET_CHARS for this call.
cURL example (SSE):
curl -N http://localhost:5100/relay -H "Content-Type: application/json" -H "Accept: text/event-stream" -d '{
"model": "gemma-3-12b-it-ud@q8_k_xl",
"messages": [{"role":"user","content":"Dame los titulares más recientes sobre baterías cuánticas."}],
"stream": true
}'Server will stream lines like:
data: {"id":"...","object":"chat.completion.chunk","model":"...","choices":[{"delta":{"content":"..."}}]}
... (more chunks) ...
data: [DONE]
Returns a single JSON completion after the upstream finishes.
cURL example:
curl http://localhost:5100/relay_once -H "Content-Type: application/json" -d '{
"model": "gemma-3-12b-it-ud@q8_k_xl",
"messages": [{"role":"user","content":"Fasse RFC 9457 in zwei Sätzen zusammen."}],
"stream": false
}'Response (shape):
{
"choices": [
{
"message": {
"role": "assistant",
"content": "…final answer text…"
}
}
]
}When the latest user message contains temporal/news/price/how‑to cues (in any supported language), the relay:
- Expands queries (adds variants like “current …”, “latest …”, “… 2025”, “… tutorial”).
- Searches multiple providers (Tavily and/or SerpAPI) and merges results.
- Deduplicates by normalized URL and fuzzy title per domain.
- Fetches pages concurrently and extracts clean text via trafilatura → readability → BeautifulSoup (publish dates are extracted from common meta tags where available).
- Reranks using BM25 (on extracted text/snippets) + recency decay (half‑life ~30 days) + provider/domain quality (including a small country‑aware news bonus) and domain diversity caps.
- Builds a compact
<<<CONTEXT>>>block (top ~5 sources): title, domain, detected publish date, 1–3 key bullets per source, canonical URL — plus explicit instructions for the model to cite[1],[2], … and to treat “today/currently/now” relative to the provided time anchor (UTC + local TZ). - Merges messages: the context block is prepended to the last user message; a system line with guidance and the time anchor is injected ahead of the conversation.
- Streams upstream: request is forwarded to your local model with
stream=true, and chunks are passed through unchanged.
If research fails (e.g., provider down), the relay still answers without context; the failure note is embedded inside the
<<<CONTEXT>>>section for transparency.
- The
need_web(...)heuristic recognizes recency/how‑to cues in 30 languages vialanguages.json. Any 4‑digit year (20xx) is treated as a weak recency cue. - The lists auto‑reload if the file changes (checked roughly every 30s). You can point to a custom file via
LANGUAGE_FILE=path/to/your.json. - Matching is substring‑based on lower‑cased input, making it robust across scripts and diacritics.
- You can extend the lists by adding entries under
recency/howtofor each language code. A minimal shape:
Examples
- “¿Qué hay de nuevo en Python 3.13?” → recency + year signal → research enabled.
- “Wie installiere ich Poetry unter Windows?” → how‑to signal → research enabled.
- “Expliquez-moi OpenTelemetry en deux phrases.” → no recency/how‑to → no research.
- Auth: The relay ships without authentication. Place it behind a reverse proxy (e.g., Traefik / NGINX) and enforce auth/TLS as needed.
- CORS: Add CORS middleware if you call it from browsers.
- Timeouts: Tune
REQUEST_TIMEOUTfor both upstream and page fetching; default is conservative. - Rate limiting: Consider a proxy‑level limiter to protect your upstream.
- Observability: Add structured logging and tracing around
/relayand upstream calls in production.
- Upstream errors / connection refused
EnsureUPSTREAM_URLpoints to a live server that implements/v1/chat/completions. Test with a minimal POST. - No streaming
Usecurl -Nand includeAccept: text/event-stream. Proxies may buffer SSE; disable buffering where applicable. - Research never triggers
VerifyLANGUAGE_FILEis readable and your prompt contains recency/how‑to cues in any supported language, or provideTAVILY_API_KEY/SERPAPI_API_KEY. - Empty or low‑quality extractions
Some sites block scraping or use heavy JS. The pipeline gracefully falls back (readability → BS4), but sources may be skipped if content < 200 chars.
Issues and PRs are welcome. Please keep changes small and well‑documented. Suggested areas:
- Provider adapters (additional search engines)
- Smarter date extraction and language detection
- Pluggable ranking & diversity rules
- Observability, metrics, and tests
Aurelia Web Relay is licensed under the Aurelia Web Relay License (AWRL).
You may:
- ✅ Use, modify, and share the software for non-commercial purposes only
- ✅ Fork, study, and run it locally
- ✅ Build non-commercial tools or demos based on it
You may not:
- ❌ Use it in any commercial, for-profit, or monetized setting
- ❌ Offer it as a service (SaaS, hosting, API, chatbot, etc.)
- ❌ Integrate it into paid products, platforms, or enterprise workflows
To use Aurelia Web Relay commercially, you must obtain a separate written license.
→ Contact: legal@samedia.app
Read the full license here: LICENSE.md
{ "metadata": {"version": 1, "updated": "2025-11-19"}, "languages": [{"code": "de", "name": "German"}, {"code": "es", "name": "Spanish"}], "recency": { "de": ["heute","aktuell","neueste","preis","gesetz"], "es": ["hoy","últimas","precio","ley","calendario"] }, "howto": { "de": ["anleitung","leitfaden","wie","schritt für schritt"], "es": ["cómo","guía","tutorial","paso a paso"] } }