Created
November 14, 2024 09:51
-
-
Save mariofusco/222be1f9df46e48eb5c793d1bb83dc1b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[instruct@bastion ~]$ ilab data generate --model /var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1 --enable-serving-output | |
INFO 2024-11-14 09:48:03,104 numexpr.utils:148: Note: NumExpr detected 48 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. | |
INFO 2024-11-14 09:48:03,104 numexpr.utils:161: NumExpr defaulting to 16 threads. | |
INFO 2024-11-14 09:48:04,236 datasets:59: PyTorch version 2.3.1 available. | |
INFO 2024-11-14 09:48:05,861 instructlab.model.backends.vllm:105: Trying to connect to model server at http://127.0.0.1:8000/v1 | |
INFO 2024-11-14 09:48:07,345 instructlab.model.backends.vllm:308: vLLM starting up on pid 64 at http://127.0.0.1:44617/v1 | |
INFO 2024-11-14 09:48:07,345 instructlab.model.backends.vllm:114: Starting a temporary vLLM server at http://127.0.0.1:44617/v1 | |
INFO 2024-11-14 09:48:07,345 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:44617/v1, this might take a moment... Attempt: 1/120 | |
INFO 2024-11-14 09:48:10,577 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:44617/v1, this might take a moment... Attempt: 2/120 | |
INFO 11-14 09:48:12 api_server.py:212] vLLM API server version 0.5.2.4 | |
INFO 11-14 09:48:12 api_server.py:213] args: Namespace(host='127.0.0.1', port=44617, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='skill-classifier-v3-clm', local_path='/var/home/instruct/.cache/instructlab/models/skills-adapter-v3'), LoRAModulePath(name='text-classifier-knowledge-v3-clm', local_path='/var/home/instruct/.cache/instructlab/models/knowledge-adapter-v3')], prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=True, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='bfloat16', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=True, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) | |
INFO 11-14 09:48:12 llm_engine.py:174] Initializing an LLM engine (v0.5.2.4) with config: model='/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False) | |
INFO 11-14 09:48:12 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager | |
(VllmWorkerProcess pid=85) INFO 11-14 09:48:13 multiproc_worker_utils.py:215] Worker ready; awaiting tasks | |
(VllmWorkerProcess pid=83) INFO 11-14 09:48:13 multiproc_worker_utils.py:215] Worker ready; awaiting tasks | |
(VllmWorkerProcess pid=84) INFO 11-14 09:48:13 multiproc_worker_utils.py:215] Worker ready; awaiting tasks | |
INFO 2024-11-14 09:48:13,854 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:44617/v1, this might take a moment... Attempt: 3/120 | |
(VllmWorkerProcess pid=84) INFO 11-14 09:48:14 utils.py:737] Found nccl from library libnccl.so.2 | |
(VllmWorkerProcess pid=83) INFO 11-14 09:48:14 utils.py:737] Found nccl from library libnccl.so.2 | |
(VllmWorkerProcess pid=84) INFO 11-14 09:48:14 pynccl.py:63] vLLM is using nccl==2.22.3 | |
INFO 11-14 09:48:14 utils.py:737] Found nccl from library libnccl.so.2 | |
(VllmWorkerProcess pid=83) INFO 11-14 09:48:14 pynccl.py:63] vLLM is using nccl==2.22.3 | |
INFO 11-14 09:48:14 pynccl.py:63] vLLM is using nccl==2.22.3 | |
(VllmWorkerProcess pid=85) INFO 11-14 09:48:14 utils.py:737] Found nccl from library libnccl.so.2 | |
(VllmWorkerProcess pid=85) INFO 11-14 09:48:14 pynccl.py:63] vLLM is using nccl==2.22.3 | |
(VllmWorkerProcess pid=85) WARNING 11-14 09:48:14 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
WARNING 11-14 09:48:14 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
(VllmWorkerProcess pid=83) WARNING 11-14 09:48:14 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
(VllmWorkerProcess pid=84) WARNING 11-14 09:48:14 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: CUDA out of memory. Tried to allocate 224.00 MiB. GPU has a total capacity of 21.95 GiB of which 142.12 MiB is free. Process 15487 has 21.80 GiB memory in use. Of the allocated memory 21.48 GiB is allocated by PyTorch, and 21.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last): | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model_runner.load_model() | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 256, in load_model | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] model = _initialize_model(model_config, self.load_config, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return model_class(config=model_config.hf_config, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 320, in __init__ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model = MixtralModel(config, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 258, in __init__ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.layers = nn.ModuleList([ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 259, in <listcomp> | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] MixtralDecoderLayer(config, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 197, in __init__ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.block_sparse_moe = MixtralMoE( | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 80, in __init__ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.experts = FusedMoE(num_experts=num_experts, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 147, in __init__ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.quant_method.create_weights( | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 56, in create_weights | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] w2_weight = torch.nn.Parameter(torch.empty(num_experts, | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return func(*args, **kwargs) | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU has a total capacity of 21.95 GiB of which 142.12 MiB is free. Process 15487 has 21.80 GiB memory in use. Of the allocated memory 21.48 GiB is allocated by PyTorch, and 21.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) | |
(VllmWorkerProcess pid=85) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: CUDA out of memory. Tried to allocate 224.00 MiB. GPU has a total capacity of 21.95 GiB of which 142.12 MiB is free. Process 15486 has 21.80 GiB memory in use. Of the allocated memory 21.48 GiB is allocated by PyTorch, and 21.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last): | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model_runner.load_model() | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 256, in load_model | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] model = _initialize_model(model_config, self.load_config, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return model_class(config=model_config.hf_config, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 320, in __init__ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model = MixtralModel(config, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 258, in __init__ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.layers = nn.ModuleList([ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 259, in <listcomp> | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] MixtralDecoderLayer(config, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 197, in __init__ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.block_sparse_moe = MixtralMoE( | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 80, in __init__ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.experts = FusedMoE(num_experts=num_experts, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 147, in __init__ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.quant_method.create_weights( | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 56, in create_weights | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] w2_weight = torch.nn.Parameter(torch.empty(num_experts, | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return func(*args, **kwargs) | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU has a total capacity of 21.95 GiB of which 142.12 MiB is free. Process 15486 has 21.80 GiB memory in use. Of the allocated memory 21.48 GiB is allocated by PyTorch, and 21.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) | |
(VllmWorkerProcess pid=84) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: CUDA out of memory. Tried to allocate 224.00 MiB. GPU has a total capacity of 21.95 GiB of which 142.12 MiB is free. Process 15485 has 21.80 GiB memory in use. Of the allocated memory 21.48 GiB is allocated by PyTorch, and 21.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables), Traceback (most recent call last): | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] output = executor(*args, **kwargs) | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model_runner.load_model() | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 256, in load_model | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] model = _initialize_model(model_config, self.load_config, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return model_class(config=model_config.hf_config, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 320, in __init__ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.model = MixtralModel(config, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 258, in __init__ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.layers = nn.ModuleList([ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 259, in <listcomp> | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] MixtralDecoderLayer(config, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 197, in __init__ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.block_sparse_moe = MixtralMoE( | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 80, in __init__ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.experts = FusedMoE(num_experts=num_experts, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 147, in __init__ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] self.quant_method.create_weights( | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 56, in create_weights | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] w2_weight = torch.nn.Parameter(torch.empty(num_experts, | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] return func(*args, **kwargs) | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] ^^^^^^^^^^^^^^^^^^^^^ | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU has a total capacity of 21.95 GiB of which 142.12 MiB is free. Process 15485 has 21.80 GiB memory in use. Of the allocated memory 21.48 GiB is allocated by PyTorch, and 21.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) | |
(VllmWorkerProcess pid=83) ERROR 11-14 09:48:14 multiproc_worker_utils.py:226] | |
[rank0]: Traceback (most recent call last): | |
[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main | |
[rank0]: File "<frozen runpy>", line 88, in _run_code | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 282, in <module> | |
[rank0]: run_server(args) | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224, in run_server | |
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args( | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args | |
[rank0]: engine = cls( | |
[rank0]: ^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373, in __init__ | |
[rank0]: self.engine = self._init_engine(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 525, in _init_engine | |
[rank0]: return engine_class(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 249, in __init__ | |
[rank0]: self.model_executor = executor_class( | |
[rank0]: ^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 158, in __init__ | |
[rank0]: super().__init__(*args, **kwargs) | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__ | |
[rank0]: super().__init__(*args, **kwargs) | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 150, in __init__ | |
[rank0]: super().__init__(model_config, cache_config, parallel_config, | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 46, in __init__ | |
[rank0]: self._init_executor() | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 84, in _init_executor | |
[rank0]: self._run_workers("load_model", | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers | |
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model | |
[rank0]: self.model_runner.load_model() | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 256, in load_model | |
[rank0]: self.model = get_model(model_config=self.model_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model | |
[rank0]: return loader.load_model(model_config=model_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model | |
[rank0]: model = _initialize_model(model_config, self.load_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model | |
[rank0]: return model_class(config=model_config.hf_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 320, in __init__ | |
[rank0]: self.model = MixtralModel(config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 258, in __init__ | |
[rank0]: self.layers = nn.ModuleList([ | |
[rank0]: ^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 259, in <listcomp> | |
[rank0]: MixtralDecoderLayer(config, | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 197, in __init__ | |
[rank0]: self.block_sparse_moe = MixtralMoE( | |
[rank0]: ^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 80, in __init__ | |
[rank0]: self.experts = FusedMoE(num_experts=num_experts, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 147, in __init__ | |
[rank0]: self.quant_method.create_weights( | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 56, in create_weights | |
[rank0]: w2_weight = torch.nn.Parameter(torch.empty(num_experts, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ | |
[rank0]: return func(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU | |
ERROR 11-14 09:48:14 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 85 died, exit code: -15 | |
INFO 11-14 09:48:14 multiproc_worker_utils.py:123] Killing local vLLM worker processes | |
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown | |
warnings.warn('resource_tracker: There appear to be %d ' | |
INFO 2024-11-14 09:48:17,315 instructlab.model.backends.vllm:171: vLLM startup failed. Retrying (1/1) | |
ERROR 2024-11-14 09:48:17,316 instructlab.model.backends.vllm:176: vLLM failed to start. | |
INFO 2024-11-14 09:48:17,316 instructlab.model.backends.vllm:105: Trying to connect to model server at http://127.0.0.1:8000/v1 | |
INFO 2024-11-14 09:48:18,572 instructlab.model.backends.vllm:308: vLLM starting up on pid 194 at http://127.0.0.1:43601/v1 | |
INFO 2024-11-14 09:48:18,572 instructlab.model.backends.vllm:114: Starting a temporary vLLM server at http://127.0.0.1:43601/v1 | |
INFO 2024-11-14 09:48:18,572 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:43601/v1, this might take a moment... Attempt: 1/120 | |
INFO 11-14 09:48:21 api_server.py:212] vLLM API server version 0.5.2.4 | |
INFO 11-14 09:48:21 api_server.py:213] args: Namespace(host='127.0.0.1', port=43601, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='skill-classifier-v3-clm', local_path='/var/home/instruct/.cache/instructlab/models/skills-adapter-v3'), LoRAModulePath(name='text-classifier-knowledge-v3-clm', local_path='/var/home/instruct/.cache/instructlab/models/knowledge-adapter-v3')], prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=True, max_loras=1, max_lora_rank=64, lora_extra_vocab_size=256, lora_dtype='bfloat16', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=True, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) | |
INFO 11-14 09:48:21 llm_engine.py:174] Initializing an LLM engine (v0.5.2.4) with config: model='/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1', speculative_config=None, tokenizer='/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/var/home/instruct/.cache/instructlab/models/mistralai/Mixtral-8x7B-Instruct-v0.1, use_v2_block_manager=False, enable_prefix_caching=False) | |
INFO 11-14 09:48:21 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager | |
INFO 2024-11-14 09:48:21,852 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:43601/v1, this might take a moment... Attempt: 2/120 | |
(VllmWorkerProcess pid=213) INFO 11-14 09:48:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks | |
(VllmWorkerProcess pid=215) INFO 11-14 09:48:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks | |
(VllmWorkerProcess pid=214) INFO 11-14 09:48:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks | |
(VllmWorkerProcess pid=214) INFO 11-14 09:48:23 utils.py:737] Found nccl from library libnccl.so.2 | |
(VllmWorkerProcess pid=214) INFO 11-14 09:48:23 pynccl.py:63] vLLM is using nccl==2.22.3 | |
(VllmWorkerProcess pid=213) INFO 11-14 09:48:23 utils.py:737] Found nccl from library libnccl.so.2 | |
INFO 11-14 09:48:23 utils.py:737] Found nccl from library libnccl.so.2 | |
(VllmWorkerProcess pid=213) INFO 11-14 09:48:23 pynccl.py:63] vLLM is using nccl==2.22.3 | |
INFO 11-14 09:48:23 pynccl.py:63] vLLM is using nccl==2.22.3 | |
(VllmWorkerProcess pid=215) INFO 11-14 09:48:23 utils.py:737] Found nccl from library libnccl.so.2 | |
(VllmWorkerProcess pid=215) INFO 11-14 09:48:23 pynccl.py:63] vLLM is using nccl==2.22.3 | |
(VllmWorkerProcess pid=213) WARNING 11-14 09:48:23 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
(VllmWorkerProcess pid=214) WARNING 11-14 09:48:23 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
WARNING 11-14 09:48:23 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
(VllmWorkerProcess pid=215) WARNING 11-14 09:48:23 custom_all_reduce.py:118] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. | |
[rank0]: Traceback (most recent call last): | |
[rank0]: File "<frozen runpy>", line 198, in _run_module_as_main | |
[rank0]: File "<frozen runpy>", line 88, in _run_code | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 282, in <module> | |
[rank0]: run_server(args) | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224, in run_server | |
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args( | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args | |
[rank0]: engine = cls( | |
[rank0]: ^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373, in __init__ | |
[rank0]: self.engine = self._init_engine(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 525, in _init_engine | |
[rank0]: return engine_class(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 249, in __init__ | |
[rank0]: self.model_executor = executor_class( | |
[rank0]: ^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 158, in __init__ | |
[rank0]: super().__init__(*args, **kwargs) | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__ | |
[rank0]: super().__init__(*args, **kwargs) | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 150, in __init__ | |
[rank0]: super().__init__(model_config, cache_config, parallel_config, | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/executor_base.py", line 46, in __init__ | |
[rank0]: self._init_executor() | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 84, in _init_executor | |
[rank0]: self._run_workers("load_model", | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers | |
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 139, in load_model | |
[rank0]: self.model_runner.load_model() | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/model_runner.py", line 256, in load_model | |
[rank0]: self.model = get_model(model_config=self.model_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model | |
[rank0]: return loader.load_model(model_config=model_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model | |
[rank0]: model = _initialize_model(model_config, self.load_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model | |
[rank0]: return model_class(config=model_config.hf_config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 320, in __init__ | |
[rank0]: self.model = MixtralModel(config, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 258, in __init__ | |
[rank0]: self.layers = nn.ModuleList([ | |
[rank0]: ^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 259, in <listcomp> | |
[rank0]: MixtralDecoderLayer(config, | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 197, in __init__ | |
[rank0]: self.block_sparse_moe = MixtralMoE( | |
[rank0]: ^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/models/mixtral.py", line 80, in __init__ | |
[rank0]: self.experts = FusedMoE(num_experts=num_experts, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 147, in __init__ | |
[rank0]: self.quant_method.create_weights( | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 56, in create_weights | |
[rank0]: w2_weight = torch.nn.Parameter(torch.empty(num_experts, | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/_device.py", line 78, in __torch_function__ | |
[rank0]: return func(*args, **kwargs) | |
[rank0]: ^^^^^^^^^^^^^^^^^^^^^ | |
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU | |
ERROR 11-14 09:48:23 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 215 died, exit code: -15 | |
INFO 11-14 09:48:23 multiproc_worker_utils.py:123] Killing local vLLM worker processes | |
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown | |
warnings.warn('resource_tracker: There appear to be %d ' | |
Failed to start server: vLLM failed to start. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment