A research tool for analyzing the geometric properties of AI conversations through embedding trajectory analysis, phase detection, and statistical invariance testing.
This project investigates whether conversations with AI systems exhibit consistent geometric patterns across different embedding models. We analyze conversation trajectories in high-dimensional embedding spaces to identify:
- Geometric invariance across different embedding models (transformer-based and classical)
- Phase transitions in conversation dynamics (exploration → insight → synthesis)
- Transport-theoretic measures of conversation structure
- Statistical validation through hierarchical hypothesis testing
- Python 3.12+
- CUDA-capable GPU (required for embedding generation)
- LaTeX distribution (for paper compilation)
- Clone the repository:
git clone https://github.com/im-knots/gcs.git
cd gcs- Create and activate a virtual environment:
cd analysis
python -m venv conversation_analysis_env
source conversation_analysis_env/bin/activate # On Windows: conversation_analysis_env\Scripts\activate- Install dependencies:
pip install -r requirements.txtBasic usage:
python run_analysis.pyWith options:
python run_analysis.py \
--data-dir /path/to/conversations \
--output-dir results/ \
--max-conversations 100 \
--batch-size 25 \
--figure-format both \
--log-level INFOOptions:
--data-dir: Directory containing conversation JSON files--output-dir: Output directory for results (default:analysis_output)--max-conversations: Limit number of conversations to analyze--batch-size: GPU batch size for embedding generation (default: 25)--figure-format: Output format for figures:png,pdf, orboth--no-checkpoint: Disable checkpointing--clear-checkpoints: Clear existing checkpoints before starting
Conversations should be in JSON format:
{
"metadata": {
"session_id": "unique-id",
"timestamp": "2024-01-01T00:00:00Z"
},
"messages": [
{
"role": "user",
"content": "Message content",
"timestamp": "2024-01-01T00:00:00Z"
},
{
"role": "assistant",
"content": "Response content",
"timestamp": "2024-01-01T00:00:01Z"
}
],
"phases": [ // Optional: annotated phase transitions
{
"turn": 5,
"phase": "exploration"
}
]
}- Ensemble approach: 5 models (3 transformers, 2 classical)
- GPU-accelerated batch processing
- Dimension handling: Automatic alignment for cross-model comparison
- Velocity & acceleration in embedding space
- Curvature with ensemble methods
- Adaptive normalization for numerical stability
- Multiple algorithms: HMM, spectral clustering, change point detection
- Ensemble consensus across methods
- Comparison with human annotations when available
- Wasserstein distance for trajectory comparison
- Sinkhorn divergence for high-dimensional robustness
- Gromov-Wasserstein for isometry-invariant comparison
- GPU acceleration via PyTorch backend
- Hierarchical hypothesis testing (3 tiers)
- Bootstrap confidence intervals (10,000 iterations)
- Null model comparison (phase-scrambled baselines)
- Multiple testing correction (Bonferroni)
Each conversation generates a comprehensive visualization including:
- Distance matrices and self-similarity plots
- Recurrence plots showing temporal patterns
- PCA trajectory projections
- Density evolution over time
- Phase detection results across models
- Transport metric comparisons
- Statistical summaries
Run the test suite:
# All tests
python -m pytest tests/ -v
# Specific test categories
python -m pytest tests/test_transport_metrics.py -v
python -m pytest tests/test_pipeline_integration.py -v
# Skip slow tests
python -m pytest tests/ -m "not slow" -vTo compile the research paper:
cd paper
pdflatex paper.tex
bibtex paper
pdflatex paper.tex
pdflatex paper.texOr using latexmk:
latexmk -pdf paper.texThis project is licensed under the MIT License - see the LICENSE file for details.