| title | ArXiv Concept Tracker |
|---|---|
| emoji | 📚 |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| pinned | false |
| app_port | 8000 |
Track how research concepts evolve over time using AI-powered semantic embeddings and Kalman filtering.
- 🔍 Search ArXiv papers by keyword
- 📊 Track concept evolution through time windows
- 🧠 Semantic similarity with embeddings (MiniLM-L6-v2)
- 📈 Interactive timeline visualization
- 🎯 Kalman filter for smooth concept tracking
- Linear concept tracking: Follow concept evolution from seed papers forward through time
- Local embeddings: sentence-transformers (no API costs)
- Kalman filtering: Velocity and acceleration constraints prevent unrealistic concept jumps
- ArXiv integration: Automatic paper fetching and metadata extraction
- REST API: FastAPI backend with JSON responses
- Comprehensive caching: Embeddings are cached locally for fast repeated runs
- Clone or navigate to the project directory:
cd /Users/markgewhite/Documents/MyFiles/Projects/training/ztm/llm_web_apps/concept_tracker- Create and activate virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtThis will install:
- FastAPI & Uvicorn (web framework)
- Qwen3 embeddings via sentence-transformers
- ArXiv API client
- NumPy, scikit-learn for computations
- Pytest for testing
Note: First run will download the Qwen3 model (~400MB) automatically.
The application uses sensible defaults. To customize, copy .env.example to .env and edit:
cp .env.example .envKey parameters in backend/config.py:
# Kalman Filter Parameters
max_velocity = 0.05 # Max concept drift per time step
max_acceleration = 0.02 # Max change in velocity
# Similarity Thresholds
threshold_auto_include = 0.85 # High confidence (auto-accept)
threshold_strong = 0.75 # Moderate confidence
threshold_moderate = 0.65 # Low confidence (minimum)uvicorn backend.main:app --reloadThe API will be available at http://localhost:8000
Interactive API documentation: http://localhost:8000/docs
Find potential seed papers:
curl "http://localhost:8000/api/search?query=attention%20is%20all%20you%20need&limit=5"Get details for a specific paper:
curl "http://localhost:8000/api/paper/1706.03762"Track a concept from seed papers forward:
curl -X POST "http://localhost:8000/api/track" \
-H "Content-Type: application/json" \
-d '{
"seed_paper_ids": ["1706.03762"],
"end_date": "2018-12-31",
"window_months": 6,
"max_papers_per_window": 50
}'Parameters:
seed_paper_ids: 1-5 ArXiv IDs to start tracking fromend_date: End date (ISO format: "YYYY-MM-DD")window_months: Time window size (default: 6 months)max_papers_per_window: Max papers to fetch per window (default: 50)
# Track from "Attention is All You Need" (2017) to end of 2018
curl -X POST "http://localhost:8000/api/track" \
-H "Content-Type: application/json" \
-d '{
"seed_paper_ids": ["1706.03762"],
"end_date": "2018-12-31",
"window_months": 6,
"similarity_threshold": 0.65,
"max_papers_per_window": 50
}' | python -m json.toolExpected output:
{
"seed_papers": [...],
"timeline": [
{
"step_number": 1,
"start_date": "2017-06-12T...",
"end_date": "2017-12-12T...",
"papers": [...],
"avg_similarity": 0.78,
"num_high_confidence": 12,
"num_moderate": 8,
"num_low": 3
},
...
],
"total_papers": 45,
"num_steps": 3
}- Initialization: Start with 1-5 seed papers (e.g., "Attention is All You Need")
- Embedding: Generate semantic embeddings (title + abstract) using Qwen3
- Time Windows: Move forward in configurable windows (default: 6 months)
- For each window:
- Fetch candidate papers from ArXiv
- Generate embeddings (cached after first generation)
- Kalman Filtering: Evaluate each paper against physics-inspired constraints:
- Similarity: Must be > 0.65 to current concept vector
- Velocity: Change must be < 0.05 (prevents sudden jumps)
- Acceleration: Change in velocity must be < 0.02 (prevents direction shifts)
- Accept papers that pass all constraints
- Update concept vector as weighted mean of accepted papers
- Repeat until end date
The tracker rejects papers that would cause unrealistic concept jumps:
- Similarity < 0.65: Too dissimilar to current concept
- Velocity > 0.05: Concept jumping too fast through embedding space
- Acceleration > 0.02: Sudden change in direction
Check logs for rejection reasons:
uvicorn backend.main:app --log-level=debug# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_kalman.py -v
# Run slow integration tests (uses real ArXiv data)
pytest tests/test_api.py -v -s --tb=shorttest_arxiv_client.py: ArXiv API integrationtest_kalman.py: Kalman filter constraintstest_api.py: FastAPI endpoints
concept_tracker/
├── backend/
│ ├── __init__.py
│ ├── main.py # FastAPI app & endpoints
│ ├── config.py # Kalman parameters & settings
│ ├── models.py # Pydantic data models
│ ├── arxiv_client.py # ArXiv API wrapper
│ ├── embedding_service.py # Qwen3 embeddings + cache
│ ├── kalman_tracker.py # Core tracking algorithm
│ ├── tracker.py # Main orchestrator
│ └── utils/
│ ├── __init__.py
│ └── cache.py # Pickle-based cache
├── cache/ # Embedding storage (auto-created)
├── tests/ # Test suite
├── requirements.txt # Python dependencies
├── .env.example # Configuration template
└── README.md # This file
- Time: 10-15 minutes (one-time embedding generation + download)
- Bottleneck: Qwen3 model download (~400MB) and embedding generation
- Time: 2-3 minutes
- Bottleneck: ArXiv API queries and Kalman filtering
- All embeddings are permanently cached in
cache/embeddings/ - Cache grows ~4KB per paper (1024 floats × 4 bytes)
- 10,000 papers = ~40MB cache (acceptable)
If tracking results are not satisfactory:
Edit backend/config.py:
max_velocity = 0.07 # Increase from 0.05
max_acceleration = 0.03 # Increase from 0.02
threshold_moderate = 0.60 # Decrease from 0.65Edit backend/config.py:
max_velocity = 0.03 # Decrease from 0.05
max_acceleration = 0.01 # Decrease from 0.02
threshold_moderate = 0.70 # Increase from 0.65Restart the server after changes:
uvicorn backend.main:app --reloadSolution: Ensure you have ~1GB free disk space. Model downloads to ~/.cache/huggingface/
Solution: The client includes rate limiting (3 sec delay). If you still get errors, increase arxiv_rate_limit in config.
Solution:
- Check logs for rejection reasons
- Lower
threshold_moderatein config - Increase
max_velocityif velocity rejections are common
Solution: Reduce max_papers_per_window in tracking request
Test with known concept evolution (Transformers 2017-2018):
curl -X POST "http://localhost:8000/api/track" \
-H "Content-Type: application/json" \
-d '{
"seed_paper_ids": ["1706.03762"],
"end_date": "2018-06-30",
"window_months": 6,
"max_papers_per_window": 50
}'Expected:
- Should find BERT-related papers (1810.04805)
- Should find other transformer variants
- Should NOT jump to unrelated NLP (pure RNN papers)
- Similarity should stay above 0.65
- 2-3 time steps with 10-30 papers each
- ✅ Linear tracking (current MVP)
- 🔲 Tree branching with HDBSCAN clustering
- 🔲 Web UI with D3.js visualization
- 🔲 Bidirectional tracking (trace concepts to their origins)
- 🔲 Multi-signal validation (citations, author overlap)
MIT License - See LICENSE file
This is an MVP/prototype. For issues or suggestions, please open an issue on GitHub.
- ArXiv for open access to research papers
- Qwen team for the embedding model
- FastAPI and sentence-transformers communities