Description
Model description
I would like to implement a new model architecture.
Short description
RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)
RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.
Open source status
- The model implementation is available
- The model weights are available
Provide useful links for the implementation
Implementation and weights
There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.
Status
The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.
Activity