This repository implements novel attention mechanisms and optimization techniques for transformer architectures, focusing on natural gradient approximation through attention patterns and critical phenomena in transformer networks.
.
├── natural_attention.py # Core implementation of Natural Attention mechanism
├── analysis/ # to be done
│ ├── metrics.py
│ └── visualization.py
├── papers/
│ ├── attention-informed-optimization.pdf
│ ├── hierarchical-fim-preprint.pdf
│ └── transformer-criticality-paper.pdf
└── notebooks/
└── Atención.ipynb # Example notebook with training code for gpt2
The NaturalAttention
class in natural_attention.py
implements an attention mechanism that:
- Computes and stores raw attention energies
- Provides natural gradient information through attention patterns
- Integrates with standard transformer architectures
Key classes:
NaturalAttention
: Core attention mechanismGPT2NaturalAttentionBlock
: GPT-2 compatible attention blockAttentionInformedOptimizer
: Custom optimizer leveraging attention patterns
The training pipeline includes:
- Custom dataset handling for WikiText
- Parallel training of standard and natural attention models
- Integration with Weights & Biases for experiment tracking
- Attention-informed optimization techniques
pip install wandb transformers datasets torch tqdm
from transformers import GPT2Config
from natural_attention import GPT2NaturalAttentionBlock, AttentionInformedOptimizer
# Configuration
config_dict = {
'max_length': 32,
'batch_size': 4,
'n_embd': 64,
'n_layer': 2,
'n_head': 2,
'learning_rate': 1e-3,
'epochs': 10,
'save_every': 2
}
# Initialize models and train
standard_model, natural_model = train_models(config_dict)
The repository includes tools for analyzing:
- Attention pattern dynamics
- Training convergence metrics
- Model performance comparisons
- Critical phenomena in transformer behavior
This implementation is based on three key papers:
- "Attention-Informed Optimization": Introduces the concept of using attention energies for optimization
- "Attention as Natural Gradient": Establishes theoretical connections between attention and Fisher Information
- "Criticality and Phase Transitions": Explores critical phenomena in transformer networks
Our implementation shows:
- Improved convergence rates with attention-informed optimization
- More stable attention patterns
- Better perplexity scores on language modeling tasks
- Evidence of critical behavior in transformer training
Contributions are welcome! Areas of particular interest:
- Additional analysis tools
- Performance optimizations
- New attention mechanisms
- Extended theoretical analysis
If you use this code in your research, please cite:
@article{aranda2024natural,
title={Attention-Informed Optimization: Leveraging Attention Energies for Neural Network Training},
author={Aranda Barois, Jeronimo},
year={2024}
}
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and feedback:
- Open an issue in this repository
- Contact the authors through the paper correspondence