Skip to content

Instantly share code, notes, and snippets.

1. After we load the original model `vllm_model = vllm_get_model(vllm_config=vllm_config_for_load)`, vllm_model looks like
for idx, m in enumerate(vllm_model.named_modules()):
print(idx, '->', m)
https://gist.github.com/vanbasten23/56a5cf844c0a527453a37af36efd3193
2. After replace the layer with LoRA layers (via `load_lora_model`), the model looks like
for idx, m in enumerate(vllm_model.named_modules()):
0 -> ('', _VllmRunner(
(vllm_model): Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): VocabParallelEmbedding(num_embeddings=151936, embedding_dim=2048, org_vocab_size=151936, num_embeddings_padded=151936, tp_size=1)
(layers): ModuleList(
(0-35): 36 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(qkv_proj): MergedQKVParallelLinearWithLoRA(
(base_layer): JaxQKVParallelLinear()
)
0 -> ('', _VllmRunner(
(vllm_model): Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): VocabParallelEmbedding(num_embeddings=151936, embedding_dim=2048, org_vocab_size=151936, num_embeddings_padded=151936, tp_size=1)
(layers): ModuleList(
(0-35): 36 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(qkv_proj): MergedQKVParallelLinearWithLoRA(
(base_layer): QKVParallelLinear(in_features=2048, output_features=2560, bias=True, tp_size=1, gather_output=False)
)
0 -> ('', Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): VocabParallelEmbedding(num_embeddings=151936, embedding_dim=2048, org_vocab_size=151936, num_embeddings_padded=151936, tp_size=1)
(layers): ModuleList(
(0-35): 36 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(qkv_proj): MergedQKVParallelLinearWithLoRA(
(base_layer): QKVParallelLinear(in_features=2048, output_features=2560, bias=True, tp_size=1, gather_output=False)
)
(o_proj): RowParallelLinearWithLoRA(
0 -> ('', Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): VocabParallelEmbedding(num_embeddings=151936, embedding_dim=2048, org_vocab_size=151936, num_embeddings_padded=151936, tp_size=1)
(layers): ModuleList(
(0-35): 36 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(qkv_proj): QKVParallelLinear(in_features=2048, output_features=2560, bias=True, tp_size=1, gather_output=False)
(o_proj): RowParallelLinear(input_features=2048, output_features=2048, bias=False, tp_size=1, reduce_results=True)
(rotary_emb): RotaryEmbedding(head_size=128, rotary_dim=128, max_position_embeddings=32768, base=1000000.0, is_neox_style=True)
(attn): Attention(head_size=128, num_heads=16, num_kv_heads=2, scale=0.08838834764831845, backend=PallasAttentionBackendImpl)
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
INFO 08-05 17:06:45 [__init__.py:241] Automatically detected platform tpu.
INFO 08-05 17:06:45 [tpu.py:202] tpu_commons not found, using vLLM's TpuPlatform
INFO 08-05 17:06:46 [utils.py:326] non-default args: {'model': 'Qwen/Qwen2-1.5B-Instruct', 'max_model_len': 128, 'max_num_batched_tokens': 64, 'max_num_seqs': 4, 'disable_log_stats': True}
INFO 08-05 17:06:52 [config.py:726] Resolved architecture: Qwen2ForCausalLM
INFO 08-05 17:06:52 [config.py:1759] Using max model len 128
INFO 08-05 17:06:52 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=64.
INFO 08-05 17:06:52 [tpu.py:112] [TPU] Forcing DYNAMO_ONCE compilation level
(EngineCore_0 pid=721878) INFO 08-05 17:06:53 [core.py:619] Waiting for init message from front-end.
(EngineCore_0 pid=721878) INFO 08-05 17:06:53 [core.py:71] Initializing a V1 LLM engine (v0.8.5.dev2456+g309c1bb82) with config: model='Qwen/Qwen2-1.5B-Instruct', speculati
/home/xiowei/miniconda3/envs/vllm/lib/python3.10/site-packages/jax/_src/cloud_tpu_init.py:84: UserWarning: Transparent hugepages are not enabled. TPU runtime startup and shutdown time should be significantly improved on TPU v5e and newer. If not already set, you may need to enable transparent hugepages in your VM image (sudo sh -c "echo always > /sys/kernel/mm/transparent_hugepage/enabled")
warnings.warn(
INFO 07-17 20:38:09 [__init__.py:244] Automatically detected platform tpu.
/mnt/disks/persist/vllm/vllm/platforms/tpu.py:202: UserWarning: 🚨 CAUTION: You are using 'tpu_commons' , which is experimental and NOT intended for production use yet. Please see the README for more details.
from tpu_commons.platforms import TpuPlatform as TpuCommonsPlatform
Running uLLM without Pathways. Module pathwaysutils is not imported.
INFO 07-17 20:38:23 [config.py:1467] Using max model len 1024
INFO 07-17 20:38:23 [config.py:2267] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-17 20:38:23 [tpu_jax.p
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "vllm",
"type": "debugpy",
"request": "launch",
This file has been truncated, but you can view the full file.
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
INFO 07-16 22:24:48 [__init__.py:253] Automatically detected platform tpu.
INFO 07-16 22:24:48 [tpu.py:196] tpu_commons not found, using vLLM's TpuPlatform
/home/xiowei/miniconda3/envs/vllm312/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================= test session starts ==============================
platform linux -- Python 3.12.11, pytest-8.3.3, pluggy-1.5.0 -- /h
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "vllm",
"type": "debugpy",
"request": "launch",