Skip to content

ageeorn/SyCoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

example

Synaptic Chain‑of‑Thought (SyCoT) - Preview

MIT License

A Flask web app showcasing a “self-RAG” memory for RLM (Reasoning Language Model) thinking. It archives task-conclusion pairs, retrieves them via semantic similarity and thematic clustering, and enhances answer consistency across sessions. This preview demonstrates the core concept through a user interface.

Why this exists

Modern reasoning models produce long chains of thought (CoT) to solve problems. These CoT traces generate many intermediate “reasoning tokens” that drive the model toward a final answer. In most setups, those reasoning tokens are ephemeral: once the answer is produced, they are discarded. This creates three issues:

  • Latency and cost: longer CoT → more tokens → higher inference time and compute.
  • Waste: the work done to reach a correct conclusion is thrown away, even if a very similar task appears later.
  • Inconsistency: without persistent memory, the same or related tasks can yield different conclusions across sessions.

SyCoT turns that discarded effort into a reusable asset. Instead of saving the raw CoT tokens, SyCoT persistently archives the conclusion along with a sentence embedding of the task. When a new task arrives, SyCoT:

  • Reuses a high‑similarity conclusion instantly (no model call), or
  • Primes the model with concise, related conclusions retrieved by similarity and by thematic clustering.

This keeps CoT benefits while avoiding repeated long reasoning. Result: faster, cheaper, more consistent answers, with a small, transparent memory that improves over time.

How it works

Request lifecycle:

  1. Input and session: The browser UI posts your task to /chat.
  2. Decision gate: A lightweight DecisionEngine checks if memory should be used (e.g., questions, short prompts, or follow‑ups → use memory).
  3. Embedding and retrieval: The task is embedded with all‑MiniLM‑L6‑v2 and compared against the compressed on‑disk archive (archive.json.gz).
    • If a very similar task exists, SyCoT returns its archived conclusion immediately with near‑zero latency.
    • Otherwise, SyCoT collects moderately similar conclusions to use as context.
  4. Thematic clustering: K‑Means groups tasks into themes. SyCoT also adds conclusions from the nearest cluster to broaden context.
  5. Prompt assembly: A concise prompt is built from these prior conclusions plus your current task. Coding mode adjusts thresholds and prompt tone.
  6. Model call: The prompt is sent to the NVIDIA Nemotron model via an OpenAI‑compatible API.
  7. Archive and respond: If a final conclusion is detected and auto‑archive is enabled, it is saved to the archive for future reuse, and a polished answer is returned to the UI.

Conceptually, this mimics synaptic consolidation: conclusions are encoded, strengthened by recurrence (similarity), and organised into themes (clusters) that support transfer and generalisation.

Positioning

  • Not a new CoT algorithm: SyCoT does not alter a model’s intrinsic chain‑of‑thought algorithm or decoding. It adds a dynamic memory and retrieval layer that supports the reasoning process.
  • Context awareness: Retrieves similar and thematic conclusions (via clustering) to ground responses.
  • Efficiency: On high‑similarity matches, returns conclusions instantly (no model call); otherwise seeds the prompt with concise prior conclusions.

User experience

  • Instant replies on repeats: On high‑similarity matches, SyCoT returns the archived conclusion immediately (no model call).
  • Lower latency and token use: Related tasks reuse prior conclusions or short context instead of long CoT.
  • More consistent answers: Memory reduces drift across sessions.
  • Coding mode: Simple toggle that tunes thresholds and prompt tone for code tasks.

How SyCoT compares to other approaches

  • Faithful Chain‑of‑Thought Reasoning (arXiv 2301.13379)

    • Emphasises faithfulness via symbolic chains and deterministic solvers.
    • SyCoT is complementary: memory improves consistency/efficiency; symbolic layers can be added for verification when needed.
  • A‑Thought: Efficient Reasoning via Bidirectional Compression for Low‑Resource Settings (arXiv 2505.24550)

    • Compresses/searches reasoning paths for efficiency in constrained budgets.
    • SyCoT reduces repeated search by reusing conclusions and providing thematic retrieval for related intents.
  • Resource‑Budgeted Adaptive CoT (arXiv 2505.11896 PDF)

    • Adapts CoT depth/length under token/latency budgets with early‑exit style policies.
    • SyCoT is complementary: its memory layer can return instant, zero‑token answers for recurrent queries and only fall back to adaptive CoT when novel reasoning is required.

SyCoT can be used alongside other approaches to maximise efficiency or accuracy by leveraging its memory layer for instant answers on recurring tasks while integrating with adaptive or symbolic methods for novel problems.

In short, many newer methods optimise whether/how much to think now; SyCoT additionally optimises by remembering and reusing what has already been thought, yielding consistent answers, lower latency, and reduced compute for recurring or thematically related tasks.

What’s in this preview

  • preview.py: Flask server and JSON endpoints
  • SyCoT.py: core archive, retrieval, clustering, and model calls
  • decision_engine.py: heuristics for when to use memory
  • templates/index.html: bare UI
  • static/script.js, static/style.css: simple chat frontend
  • requirements.txt, .gitignore, README.md

Features

  • Persistent memory: Compressed JSON archive on disk with deduplication by semantic similarity
  • Hybrid retrieval: Direct “same task” reuse + related examples via similarity and clustering
  • One‑click reset: Clear archive from the UI (or POST /clear-archive)
  • Session‑aware coding mode: Persisted per browser session
  • Lightweight stack: No database required
  • Well‑commented codebase: Clear comments to facilitate quick understanding and easy modification

Architecture

High‑level data flow:

diagram

Components:

  • UI (HTML/JS) ↔ Flask endpoints (/, /chat, /clear-archive)
  • Archive store: archive.json.gz (plus archive_usage.log)
  • Embeddings: sentence-transformers (all-MiniLM-L6-v2)
  • Clustering: scikit-learn KMeans
  • Reasoning model: NVIDIA Nemotron via OpenAI‑compatible API (chat.completions)

Endpoints

  • GET / - serves the minimal chat UI
  • POST /chat - body:
    {
      "task": "string",
      "is_coding_mode": true,
      "auto_archive": true
    }
    response:
    {
      "thinking": "string",
      "conclusion": "string",
      "is_coding_mode": true
    }
  • POST /clear-archive - clears the on‑disk archive and resets clustering

Requirements

  • Python 3.10+
  • Windows (tested), should work elsewhere with minor changes
  • Internet access for the NVIDIA model API

Setup (Windows / PowerShell)

  1. Create and activate a virtual environment
python -m venv .venv
.\.venv\Scripts\Activate
  1. Install dependencies
pip install -r requirements.txt
  1. Configure the API
  • NVIDIA API key (Nemotron via OpenAI‑compatible endpoint)
    • Go to https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1, click "View Code" and generate an API key.
    • Open SyCoT.py and replace INSERT_API_KEY_HERE with your key
  1. Run the preview app
python preview.py

Open http://127.0.0.1:5000 or http://localhost:5000 in your browser.

Using the UI

  • Type a prompt (or paste some code)
  • Toggle Coding Mode to switch to code‑oriented thresholds and prompts.
  • Use Clear Archive to wipe memory (archive.json.gz) and reset clusters.

Notes

  • Similarity thresholds live near the top of SyCoT.py:
    similarity_high = 0.8       # Threshold for direct conclusion reuse
    similarity_low = 0.5        # Minimum similarity for context inclusion
    similarity_high_code = 0.85 # (coding mode)
    similarity_low_code = 0.55  # (coding mode)
  • Archive file paths and logging are defined in SyCoT.py.
  • The model ID and API base are in SyCoT.py (nvidia/llama-3.1-nemotron-ultra-253b-v1).
  • The app silently sends a handshake (“hey”) on page load to sync session state, skipping memory retrieval for the first prompt sent by the user.

License

  • This project is licenced under the MIT License. See the LICENSE file for details.

About

Synaptic Chain-of-Thought Preview

Resources

License

Stars

Watchers

Forks