Skip to content

feat: realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. #5813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

minleminzui
Copy link
Collaborator

Motivation

realize Tokenizer API, which is a simple wrapper over HuggingFace-style tokenizers. https://docs.vllm.ai/en/v0.8.4_a/serving/openai_compatible_server.html#tokenizer-api

Modifications

cd  sglang/test/srt
python3 -m unittest test_tokenizer_api.TestTokenizerAPI

output:

command=python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --host 127.0.0.1 --port 8000
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:38:40] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=8000, mem_fraction_static=0.8639842241317652, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=106198034, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None)
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:38:48 TP0] Attention backend not set. Use fa3 backend by default.
[2025-04-28 05:38:48 TP0] Init torch distributed begin.
[rank0]:[W428 05:38:49.898395559 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[2025-04-28 05:38:49 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-28 05:38:49 TP0] Load weight begin. avail mem=94.78 GB
[2025-04-28 05:38:50 TP0] Using model weights format ['*.safetensors']
[2025-04-28 05:38:51 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.72it/s]

[2025-04-28 05:38:51 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=92.37 GB, mem usage=2.41 GB.
[2025-04-28 05:38:51 TP0] KV Cache is allocated. #tokens: 2604325, K size: 39.74 GB, V size: 39.74 GB
[2025-04-28 05:38:51 TP0] Memory pool end. avail mem=10.78 GB
[2025-04-28 05:38:51 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=10.68 GB
Capturing batches (avail_mem=6.05 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.52it/s]
[2025-04-28 05:38:58 TP0] Capture cuda graph end. Time elapsed: 6.48 s. avail mem=6.04 GB. mem usage=4.64 GB.
[2025-04-28 05:38:58 TP0] max_total_num_tokens=2604325, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-04-28 05:38:59] INFO:     Started server process [2615992]
[2025-04-28 05:38:59] INFO:     Waiting for application startup.
[2025-04-28 05:38:59] INFO:     Application startup complete.
[2025-04-28 05:38:59] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-04-28 05:39:00] INFO:     127.0.0.1:59424 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-28 05:39:00 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:39:01] INFO:     127.0.0.1:59428 - "POST /generate HTTP/1.1" 200 OK
[2025-04-28 05:39:01] The server is fired up and ready to roll!
[2025-04-28 05:39:08 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:39:09] INFO:     127.0.0.1:54282 - "GET /health_generate HTTP/1.1" 200 OK
[2025-04-28 05:39:10] INFO:     127.0.0.1:54296 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54300 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54310 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54322 - "POST /detokenize HTTP/1.1" 200 OK
[2025-04-28 05:39:10] INFO:     127.0.0.1:54330 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54336 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54350 - "POST /tokenize HTTP/1.1" 200 OK
[2025-04-28 05:39:10] INFO:     127.0.0.1:54364 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54366 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54382 - "POST /tokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54396 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] INFO:     127.0.0.1:54406 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:39:10] Child process unexpectedly failed with an exit code 9. pid=2616298

----------------------------------------------------------------------
Ran 10 tests in 34.882s

OK
python3 test_tokenizer_api.py 

output:

command=python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --host 127.0.0.1 --port 8000
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:39:43] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_path='meta-llama/Llama-3.2-1B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-1B-Instruct', chat_template=None, completion_template=None, is_embedding=False, revision=None, host='127.0.0.1', port=8000, mem_fraction_static=0.8639842241317652, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=711975287, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_multimodal=None, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None)
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
/usr/local/lib/python3.10/dist-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
[2025-04-28 05:39:52 TP0] Attention backend not set. Use fa3 backend by default.
[2025-04-28 05:39:52 TP0] Init torch distributed begin.
[rank0]:[W428 05:39:52.206806070 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[2025-04-28 05:39:52 TP0] Init torch distributed ends. mem usage=0.00 GB
[2025-04-28 05:39:52 TP0] Load weight begin. avail mem=94.78 GB
[2025-04-28 05:39:54 TP0] Using model weights format ['*.safetensors']
[2025-04-28 05:39:54 TP0] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.73it/s]

[2025-04-28 05:39:55 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=92.37 GB, mem usage=2.41 GB.
[2025-04-28 05:39:55 TP0] KV Cache is allocated. #tokens: 2604325, K size: 39.74 GB, V size: 39.74 GB
[2025-04-28 05:39:55 TP0] Memory pool end. avail mem=10.78 GB
[2025-04-28 05:39:55 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=10.68 GB
Capturing batches (avail_mem=6.05 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00,  5.46it/s]
[2025-04-28 05:40:01 TP0] Capture cuda graph end. Time elapsed: 6.53 s. avail mem=6.04 GB. mem usage=4.64 GB.
[2025-04-28 05:40:02 TP0] max_total_num_tokens=2604325, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4097, context_len=131072
[2025-04-28 05:40:03] INFO:     Started server process [2619481]
[2025-04-28 05:40:03] INFO:     Waiting for application startup.
[2025-04-28 05:40:03] INFO:     Application startup complete.
[2025-04-28 05:40:03] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-04-28 05:40:04] INFO:     127.0.0.1:40742 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-04-28 05:40:04 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:40:05] INFO:     127.0.0.1:40756 - "POST /generate HTTP/1.1" 200 OK
[2025-04-28 05:40:05] The server is fired up and ready to roll!
[2025-04-28 05:40:11 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-04-28 05:40:12] INFO:     127.0.0.1:40764 - "GET /health_generate HTTP/1.1" 200 OK
[2025-04-28 05:40:12] INFO:     127.0.0.1:40768 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40776 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40780 - "POST /detokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40792 - "POST /detokenize HTTP/1.1" 200 OK
[2025-04-28 05:40:12] INFO:     127.0.0.1:40798 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40806 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40808 - "POST /tokenize HTTP/1.1" 200 OK
[2025-04-28 05:40:12] INFO:     127.0.0.1:40822 - "POST /detokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40832 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40844 - "POST /tokenize HTTP/1.1" 400 Bad Request
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40852 - "POST /tokenize HTTP/1.1" 200 OK
.[2025-04-28 05:40:12] INFO:     127.0.0.1:40862 - "POST /tokenize HTTP/1.1" 200 OK
.
----------------------------------------------------------------------
Ran 10 tests in 34.802s

OK

Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant