feat: realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. #5813

minleminzui · 2025-04-28T05:40:27Z

Motivation

realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. https://docs.vllm.ai/en/v0.8.4_a/serving/openai_compatible_server.html#tokenizer-api

Modifications

cd  sglang/test/srt
python3 -m unittest test_tokenizer_api.TestTokenizerAPI

output:

command=python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --host 127.0.0.1 --port 8000
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:38:40] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=8000, mem_fraction_static=0.8639842241317652, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=106198034, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None)
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:38:48 TP0] Attention backend not set. Use fa3 backend by default.
[2025-04-28 05:38:48 TP0] Init torch distributed begin.
[rank0]:[W428 05:38:49.898395559 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[2025-04-28 05:38:49 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-28 05:38:49 TP0] Load weight begin. avail mem=94.78 GB
[2025-04-28 05:38:50 TP0] Using model weights format ['*.safetensors']
[2025-04-28 05:38:51 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.72it/s]

[2025-04-28 05:38:51 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=92.37 GB, mem usage=2.41 GB.
[2025-04-28 05:38:51 TP0] KV Cache is allocated. #tokens: 2604325, K size: 39.74 GB, V size: 39.74 GB
[2025-04-28 05:38:51 TP0] Memory pool end. avail mem=10.78 GB
[2025-04-28 05:38:51 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=10.68 GB
Capturing batches (avail_mem=6.05 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.52it/s]
[2025-04-28 05:38:58 TP0] Capture cuda graph end. Time elapsed: 6.48 s. avail mem=6.04 GB. mem usage=4.64 GB.
[2025-04-28 05:38:58 TP0] max_total_num_tokens=2604325, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-04-28 05:38:59] INFO:     Started server process [2615992]
[2025-04-28 05:38:59] INFO:     Waiting for application startup.
[2025-04-28 05:38:59] INFO:     Application startup complete.
[2025-04-28 05:38:59] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-04-28 05:39:00] INFO:     127.0.0.1:59424 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-28 05:39:00 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:39:01] INFO:     127.0.0.1:59428 - "POST /generate HTTP/1.1" 200 OK
[2025-04-28 05:39:01] The server is fired up and ready to roll!
[2025-04-28 05:39:08 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:39:09] INFO:     127.0.0.1:54282 - "GET /health_generate HTTP/1.1" 200 OK
[2025-04-28 05:39:10] INFO:     127.0.0.1:54296 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54300 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54310 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54322 - "POST /detokenize HTTP/1.1" 200 OK
[2025-04-28 05:39:10] INFO:     127.0.0.1:54330 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54336 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54350 - "POST /tokenize HTTP/1.1" 200 OK
[2025-04-28 05:39:10] INFO:     127.0.0.1:54364 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54366 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54382 - "POST /tokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54396 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54406 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] Child process unexpectedly failed with an exit code 9. pid=2616298

----------------------------------------------------------------------
Ran 10 tests in 34.882s

OK

python3 test_tokenizer_api.py

output:

command=python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --host 127.0.0.1 --port 8000
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:39:43] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=8000, mem_fraction_static=0.8639842241317652, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=711975287, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None)
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:39:52 TP0] Attention backend not set. Use fa3 backend by default.
[2025-04-28 05:39:52 TP0] Init torch distributed begin.
[rank0]:[W428 05:39:52.206806070 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[2025-04-28 05:39:52 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-28 05:39:52 TP0] Load weight begin. avail mem=94.78 GB
[2025-04-28 05:39:54 TP0] Using model weights format ['*.safetensors']
[2025-04-28 05:39:54 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.73it/s]

[2025-04-28 05:39:55 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=92.37 GB, mem usage=2.41 GB.
[2025-04-28 05:39:55 TP0] KV Cache is allocated. #tokens: 2604325, K size: 39.74 GB, V size: 39.74 GB
[2025-04-28 05:39:55 TP0] Memory pool end. avail mem=10.78 GB
[2025-04-28 05:39:55 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=10.68 GB
Capturing batches (avail_mem=6.05 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.46it/s]
[2025-04-28 05:40:01 TP0] Capture cuda graph end. Time elapsed: 6.53 s. avail mem=6.04 GB. mem usage=4.64 GB.
[2025-04-28 05:40:02 TP0] max_total_num_tokens=2604325, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-04-28 05:40:03] INFO:     Started server process [2619481]
[2025-04-28 05:40:03] INFO:     Waiting for application startup.
[2025-04-28 05:40:03] INFO:     Application startup complete.
[2025-04-28 05:40:03] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-04-28 05:40:04] INFO:     127.0.0.1:40742 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-28 05:40:04 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:40:05] INFO:     127.0.0.1:40756 - "POST /generate HTTP/1.1" 200 OK
[2025-04-28 05:40:05] The server is fired up and ready to roll!
[2025-04-28 05:40:11 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:40:12] INFO:     127.0.0.1:40764 - "GET /health_generate HTTP/1.1" 200 OK
[2025-04-28 05:40:12] INFO:     127.0.0.1:40768 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40776 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40780 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40792 - "POST /detokenize HTTP/1.1" 200 OK
[2025-04-28 05:40:12] INFO:     127.0.0.1:40798 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40806 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40808 - "POST /tokenize HTTP/1.1" 200 OK
[2025-04-28 05:40:12] INFO:     127.0.0.1:40822 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40832 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40844 - "POST /tokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40852 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40862 - "POST /tokenize HTTP/1.1" 200 OK
.
----------------------------------------------------------------------
Ran 10 tests in 34.802s

OK

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…ce-style tokenizers.

minleminzui added 3 commits April 28, 2025 05:05

feat: realize Tokenizer API, which is a simple wrapper over HuggingFa…

cbb758a

…ce-style tokenizers.

more

46d7497

more

aacb743

minleminzui requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners April 28, 2025 05:40

minleminzui added 3 commits April 28, 2025 05:46

more

4ad30a7

more

b4e040f

more

7134a07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. #5813

feat: realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. #5813

minleminzui commented Apr 28, 2025

feat: realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. #5813

Are you sure you want to change the base?

feat: realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. #5813

Conversation

minleminzui commented Apr 28, 2025

Motivation

Modifications

Checklist