Productizing DeepSeek

Productizing DeepSeek is an open-source project code for https://lu.ma/noj3yy63 webinar.

Setup

uv sync --frozen --no-cache
uv run modal setup

Tree of Models

graph TD
    subgraph DeepSeek LLMs
        DS_V3_Base["DeepSeek-V3-Base (671B total, 37B activated)"] -->|Fine-tuned| DS_V3["DeepSeek-V3 (671B total, 37B activated)"]
        DS_V3_Base -->|RL-Only| DS_R1_Zero["DeepSeek-R1-Zero (671B total, 37B activated)"]
        DS_V3_Base -->|SFT+RL| DS_R1["DeepSeek-R1 (671B total, 37B activated)"]
    end

    subgraph Distilled Models
        DS_R1 -->|Distilled| DS_Distill_Q1_5["DeepSeek-R1-Distill-Qwen-1.5B"]
        DS_R1 --> DS_Distill_Q7["DeepSeek-R1-Distill-Qwen-7B"]
        DS_R1 --> DS_Distill_Q14["DeepSeek-R1-Distill-Qwen-14B"]
        DS_R1 --> DS_Distill_Q32["DeepSeek-R1-Distill-Qwen-32B"]
        DS_R1 --> DS_Distill_L8["DeepSeek-R1-Distill-Llama-8B"]
        DS_R1 --> DS_Distill_L70["DeepSeek-R1-Distill-Llama-70B"]
    end

    subgraph External Influence
        DS_R1 -->|Derived from| PPLX_R1["Perplexity-AI R1"]
    end

Managed

Local

https://ollama.com/download/mac

ollama run deepseek-r1:1.5b

Serverless

Platform	R1 Support	R1 Distilled Support	Price per 1M Tokens (Input/Output)	OpenAI Compatible
Deepseek	Yes	No	$0.55 / $2.19	Yes
Hyperbolic	Yes	No	$2.00 / $2.00	Yes
Nebius AI Studio	Yes	Yes	$0.80 / $2.40	Yes
Fireworks	Yes	Yes	$3.00 / $8.00	Yes
Novita	Yes	Yes	$4.00 / $4.00	Yes
Together AI	Yes	Yes	$7.00 / $7.00	Yes

** As of Feb 19, 2025

uv run ./src/productizing_deepseek/clients.py groq 'Write some python code'
uv run ./src/productizing_deepseek/clients.py together 'Write some python code'

Clouds

AWS: reservation for ml.p5e.48xlarge: $41.6116 per hour

Azure: servelsss and free (as of now)

uv run ./src/productizing_deepseek/clients.py azure 'Write some python code'

DIY

flowchart LR
    subgraph Servers
        SGLang
        LMDeploy
        vLLM
        TGI["TGI (Text Generation Inference)"]
        Triton["Triton Inference Server"]
    end
    subgraph Runtimes
        LlamaCPP["Llama.cpp"]
        TensorRT_LLM["TensorRT-LLM"]
    end
    TGI -->|supports| LlamaCPP
    TGI -->|supports| TensorRT_LLM
    Triton -->|supports| TensorRT_LLM
    Triton -->|runs via| vLLM

https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file#6-how-to-run-locally

SGLang: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 LMDeploy: InternLM/lmdeploy#2960 TRT-LLM: https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3 vLLM: vllm-project/vllm#11539

vLLM + Modal

Pricing

GPU Model	vRAM	Memory Type	Price/Hour
Nvidia H200	141 GB	HBM3e	?
Nvidia H100	80 GB	HBM3	$4.56
Nvidia A100	80 GB	HBM2e	$3.40
Nvidia A100	40 GB	HBM2	$2.78
Nvidia L40S	48 GB	GDDR6	$1.95
Nvidia A10G	24 GB	GDDR6	$1.10
Nvidia L4	24 GB	GDDR6	$0.80
Nvidia T4	16 GB	GDDR6	$0.59

Estimate

Model	Parameters	Min vRAM Required	Suitable GPUs
DeepSeek-R1	685B	~1,370 GB	Not possible on single GPU - needs multi-node
DeepSeek-R1-Distill-Llama-70B	70B	~140 GB	H200 only
DeepSeek-R1-Distill-Qwen-32B	32B	~64 GB	H200, H100, A100-80GB
DeepSeek-R1-Distill-Qwen-14B	14B	~28 GB	All except T4
DeepSeek-R1-Distill-Llama-8B	8B	~16 GB	All GPUs
DeepSeek-R1-Distill-Qwen-7B	7B	~14 GB	All GPUs
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	~3 GB	All GPUs

Try!

create project

uv run modal environment create productizing-deepseek

download models

uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name deepseek-ai/DeepSeek-R1
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name perplexity-ai/r1-1776
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name unsloth/DeepSeek-R1
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name unsloth/DeepSeek-R1-GGUF
uv run modal run --detach  src/productizing_deepseek/custom_load.py --model-name unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF

uv run modal shell --volume productizing-deepseek

create deployments

uv run modal deploy src/productizing_deepseek/custom_vllm_llama_1b.py
uv run modal deploy src/productizing_deepseek/custom_vllm_llama_8b.py
uv run modal deploy src/productizing_deepseek/custom_vllm_qwen_32b.py
uv run modal deploy src/productizing_deepseek/custom_vllm_llama_70b.py
uv run modal deploy src/productizing_deepseek/custom_vllm_r1.py

clients

uv run python ./src/productizing_deepseek/clients.py modal 'test' "DeepSeek-R1-Distill-Qwen-1.5B" "https://truskovskiyk-productizing-deepseek--distill-llama-1b-serve.modal.run/v1"
uv run python ./src/productizing_deepseek/clients.py modal 'test' "DeepSeek-R1-Distill-Llama-8B" "https://truskovskiyk-productizing-deepseek--distill-llama-8b-serve.modal.run/v1/"
uv run python ./src/productizing_deepseek/clients.py modal 'test' "DeepSeek-R1-Distill-Qwen-32B" "https://truskovskiyk-productizing-deepseek--distill-qwen-32b-serve.modal.run/v1/"
uv run python ./src/productizing_deepseek/clients.py modal 'test' "DeepSeek-R1-Distill-Llama-70B" "https://truskovskiyk-productizing-deepseek--distill-llama-70b-serve.modal.run/v1/"
uv run python ./src/productizing_deepseek/clients.py modal 'test' "DeepSeek-R1" "https://truskovskiyk-productizing-deepseek--r1-serve.modal.run/v1/"

clean up

modal app stop --name r1
modal app stop --name distill-llama-70b 
modal app stop --name distill-qwen-32b 
modal app stop --name distill-llama-8b

Results

Model	GPU Type	Number of GPUs	Price per GPU (per hour)	Total Cost (per hour)	GPU meme req	Start time
distill-llama-1.5b	Nvidia A10 (24 GB)	1	$1.10	$1.10	19GB	1.10
distill-llama-8b	Nvidia A100 (40 GB)	1	$2.78	$2.78	35GB	51.84s
distill-llama-70b	Nvidia A100 (80 GB)	2	$3.40	$6.80	71GB	1m12s
distill-qwen-32b	Nvidia A100 (80 GB)	1	$3.40	$3.40	143GB	2m16s
r1	Nvidia H200 (141 GB)	8	$4.56	$36.48	1003GB	19m58s

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/productizing_deepseek		src/productizing_deepseek
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Productizing DeepSeek

Table of Contents

Setup

Tree of Models

Managed

Local

Serverless

Clouds

AWS: reservation for ml.p5e.48xlarge: $41.6116 per hour

Azure: servelsss and free (as of now)

DIY

vLLM + Modal

Pricing

Estimate

Try!

Results

About

Uh oh!

Releases

Packages

Languages

License

kyryl-opens-ml/productizing-deepseek

Folders and files

Latest commit

History

Repository files navigation

Productizing DeepSeek

Table of Contents

Setup

Tree of Models

Managed

Local

Serverless

Clouds

AWS: reservation for ml.p5e.48xlarge: $41.6116 per hour

Azure: servelsss and free (as of now)

DIY

vLLM + Modal

Pricing

Estimate

Try!

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages