RWKV-Infer

A lightweight RWKV inference platform that operates in Cuda and Rocm environments, supporting multi-batch inference.(RWKV v7)

Currently Working Big Major Change.

🔄 Hybrid RWKV + Transformer Support

I'm so excited to announce that RWKV-Infer now supports a hybrid architecture combining RWKV and Transformer layers.

This design brings together the best of both worlds:

🌊 RWKV layers
Efficient long-context modeling with linear time complexity and minimal memory usage. Ideal for early-stage token mixing and maintaining global coherence.
⚡ Transformer (GQA) layers
Powerful attention mechanisms retained in later layers for precise reasoning, structured generation, and knowledge retention.

🚀 Key Benefits

Improved long-context capability without increasing memory usage.
Reduced KV cache size — up to 90% smaller by replacing early Transformer blocks with RWKV.
Balanced performance: RWKV handles sequence length, while Transformer ensures generation quality.

Key Features

Multi Recurrent State Sampling:

MRSS (Multi Recurrent State Sampling) is a novel method for LLM inference that combines multiple fine-tuned states with fixed gating weights to achieve more flexible and effective inference.
- Pseudo Mixture of State Experts: By combining multiple states, MRSS integrates knowledge from different "experts," generating richer outputs.
- Separation of elements: Allows fine-tuning of knowledge, emotions, and speaking styles independently.
- State reusability: Enables efficient creation of new models through state recombination.
Mixture of LoRA Experts:

Combines multiple LoRA (Low-Rank Adaptation) modules as "experts" that specialize in different tasks or domains
- perform inference with the MoLE model trained on RWKV-LM-RLHF.
- This is a preliminary verification towards the upcoming MoE.
Hot swapping of adapter models:
- Bone(Block Affine Transformation) Adapter
- DoRA: Weight-Decomposed Low-Rank Adaptation
- LoRA Adapter
Quantization Support:
- FP8 (Experiment. need NVIDIA H100 or Ada series gpu)
- int8 (Triton genv kernel based. best inference speed for single batch )
- FP6 (slightly degradation. toachao fpx e3m2)
- FP5 (ppl 10% degradation. toachao fpx e2m2)
- hqq4 (hqq int4 gemlite triton kernel)
Multi Batch Generation:
- multi batch generation with Flash-Linear-Attention(x070,hxa079)
- multi batch sampling
- On an RTX4090, a 7B parameter model can run over 256 batches of inference.

Accelerate your RWKV model inference with RWKV-Infer!

How To Use

1. Python >= 3.12
1. Install Pytorch 2.7+
1. some case need (conda install libstdcxx -c conda-forge --override-channels) for building cuda kernel
1. install requirements with latest triton

pip install -r requirements_fla.txt

1. prepare models in models folder
1. prepare states in states folder
1. Run Server

python rwkv_server_fla_fastapi.py --localhost 0.0.0.0 --port 9000 --debug False --workers 64 --dynamic_state_cache_size 512

1. Load Model if quant, set model_strategy:bf16,fp16,int8,fp8,nf4

curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":""}'

1. Enjoy Infernce via OpenAI Compatible API!

API Examples

1. Model Load

curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"","default_temperature":"1.0", "default_top_p":"0.3", "endtoken":"\\n\\n"}'

1. Model Load + Adapter

curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"","adapter_filename":"adapters/rwkv-9-bone.pth","adapter_mode":"bone","default_temperature":"1.0", "default_top_p":"0.3", "endtoken":"\\n\\n"}'

1. Add Single state

curl http://127.0.0.1:9000/loadstatemodel -X POST -H "Content-Type: application/json" -d '{"state_filename":"state.pth","state_viewname":"State Test","default_temperature":"1.0", "default_top_p":"0.3"}'

1. Add MRSS states

curl http://127.0.0.1:9000/mrss_loadstatemodel -X POST -H "Content-Type: application/json" -d '{"state_viewname":"MRSS Test", "state_filenames":["states/jp7b-bancho.pth","states/ojousama2.pth","states/secret.pth"], "contain_originalstate":"True", "state_gatingweight":["0.01","0.3","0.4","0.03"],"default_temperature":"1.0", "default_top_p":"0.8"}'

1. Remove All State

curl http://127.0.0.1:9000/removestatemodel -X POST -H "Content-Type: application/json" -d '{"dummy":"dummy"}'

1. Get Model Names (During inference, setting the same name as this ID will enable dynamic state loading.)

curl http://127.0.0.1:9000/models -X GET

Thanks for

RWKV-LM,ChatRWKV @BlinkDL
rwkv.hpp @harrisonvanderbyl
RWKV-PEFT @Jl-er
flash-linear-attention @ sustcsonglin

2025 OpenMOSE

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
log		log
misc		misc
mmlu_dev_dataset		mmlu_dev_dataset
mmlu_test_dataset		mmlu_test_dataset
rwkvengine		rwkvengine
test_scripts		test_scripts
.gitignore		.gitignore
.webui_secret_key		.webui_secret_key
LICENSE		LICENSE
README.md		README.md
apitest.py		apitest.py
demo2.py		demo2.py
eval_multi.py		eval_multi.py
eval_single.py		eval_single.py
evaluator.py		evaluator.py
gate_analysis.png		gate_analysis.png
gemlite_config.json		gemlite_config.json
kotori.webp		kotori.webp
load-rwkv-reka.sh		load-rwkv-reka.sh
model-load.sh		model-load.sh
model-load1b6.sh		model-load1b6.sh
mrss.png		mrss.png
mrss_debugtest.sh		mrss_debugtest.sh
requirements_fla.txt		requirements_fla.txt
run.sh		run.sh
rwkv_mmlu_eval.py		rwkv_mmlu_eval.py
rwkv_server_fla_fastapi.py		rwkv_server_fla_fastapi.py
test_completions.py		test_completions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RWKV-Infer

Currently Working Big Major Change.

🔄 Hybrid RWKV + Transformer Support

🚀 Key Benefits

Key Features

How To Use

API Examples

Thanks for

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

OpenMOSE/RWKV-Infer

Folders and files

Latest commit

History

Repository files navigation

RWKV-Infer

Currently Working Big Major Change.

🔄 Hybrid RWKV + Transformer Support

🚀 Key Benefits

Key Features

How To Use

API Examples

Thanks for

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages