Last active
December 29, 2024 22:03
-
-
Save ucalyptus2/b965a36609177d6f3930cc94fec7cd02 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Chunk 0: | |
This file is a merged representation of the entire codebase, combining all repository files into a single document. Generated by Repomix on 20241229T195517.135Z File Summary Purpose This file contains a packed representation of the entire repositorys contents. It is designed to be easily consumable by AI systems for analysis, code review, or other automated processes. | |
Chunk 1: | |
File Format The content is organized as follows 1. This summary section 2. Repository information 3. | |
Chunk 2: | |
Directory structure 4. Multiple file entries, each consisting of a. A separator line b. | |
Chunk 3: | |
The file path File pathtofile c. Another separator line d. The full contents of the file e. | |
Chunk 4: | |
A blank line Usage Guidelines This file should be treated as readonly. Any changes should be made to the original repository files, not this packed version. When processing this file, use the file path to distinguish between different files in the repository. | |
Chunk 5: | |
Be aware that this file may contain sensitive information. Handle it with the same level of security as you would the original repository. Notes Some files may have been excluded based on .gitignore rules and Repomixs configuration. | |
Chunk 6: | |
Binary files are not included in this packed representation. Please refer to the Repository Structure section for a complete list of file paths, including binary files. Additional Info For more information about Repomix, visit httpsgithub.comyamadashyrepomix Directory Structure .github workflows codeql.yml precommit.yaml ci common test_util.py util.sh L0_additional_outputs_vllm additional_outputs_test.py test.sh L0_backend_vllm accuracy_test accuracy_test.py test.sh enabled_stream enabled_stream_test.py test.sh metrics_test test.sh vllm_metrics_test.py request_cancellation request_cancellation_test.py test.sh vllm_backend ensemble_config.pbtxt test.sh vllm_backend_test.py test.sh L0_check_health_vllm check_health_test.py test.sh L0_multi_gpu_vllm multi_lora download.py multi_lora_test.py test.sh vllm_backend test.sh vllm_multi_gpu_test.py test.sh docs additional_outputs.md health_check.md llama_multi_lora_tutorial.md samples model_repository vllm_model 1 model.json config.pbtxt client.py prompts.txt src utils metrics.py model.py .gitignore .precommitconfig.yaml LICENSE pyproject.toml README.md Files File .githubworkflowscodeql.yml Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
Chunk 7: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 8: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 9: | |
name CodeQL on pull_request jobs analyze name Analyze runson ubuntulatest permissions actions read contents read securityevents write strategy failfast false matrix language python CodeQL supports cpp, csharp, go, java, javascript, python, ruby Learn more about CodeQL language support at httpsaka.mscodeqldocslanguagesupport steps name Checkout repository uses actionscheckoutv3 Initializes the CodeQL tools for scanning. name Initialize CodeQL uses githubcodeqlactioninitv2 with languages matrix.language If you wish to specify custom queries, you can do so here or in a config file. By default, queries listed here will override any specified in a config file. | |
Chunk 10: | |
Prefix the list here with to use these queries and those in the config file. Details on CodeQLs query packs refer to httpsdocs.github.comencodesecuritycodescanningautomaticallyscanningyourcodeforvulnerabilitiesanderrorsconfiguringcodescanningusingqueriesinqlpacks queries securityandquality Autobuild attempts to build any compiled languages CC, C, Go, or Java. If this step fails, then you should remove it and run the build manually see below name Autobuild uses githubcodeqlactionautobuildv2 Commandline programs to run using the OS shell. | |
Chunk 11: | |
See httpsdocs.github.comenactionsusingworkflowsworkflowsyntaxforgithubactionsjobsjob_idstepsrun If the Autobuild fails above, remove it and uncomment the following three lines. modify them or add more to build your code if your project, please refer to the EXAMPLE below for guidance. run echo Run, Build Application using script .location_of_script_within_repobuildscript.sh name Perform CodeQL Analysis uses githubcodeqlactionanalyzev2 with category languagematrix.language File .githubworkflowsprecommit.yaml Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
Chunk 12: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 13: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 14: | |
name precommit on pull_request jobs precommit runson ubuntu22.04 steps uses actionscheckoutv3 uses actionssetuppythonv3 uses precommitactionv3.0.0 File cicommontest_util.py !usrbinenv python3 Copyright 20182023, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 15: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 16: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import queue import unittest import numpy as np import tritonclient.grpc as grpcclient class TestResultCollectorunittest.TestCase TestResultCollector stores test result and prints it to stdout. In order to use this class, unit tests must inherit this class. | |
Chunk 17: | |
Use check_test_results bash function from commonutil.sh to verify the expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures total, errors, failures classmethod def tearDownClasscls this method is called when all the unit tests in a class are finished. json_res total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone result argument stores the accumulative test results test_result super.runresult total test_result.testsRun errors lentest_result.errors failures lentest_result.failures self.setResulttotal, errors, failures class AsyncTestResultCollectorunittest.IsolatedAsyncioTestCase TestResultCollector stores test result and prints it to stdout. In order to use this class, unit tests must inherit this class. | |
Chunk 18: | |
Use check_test_results bash function from commonutil.sh to verify the expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures total, errors, failures classmethod def tearDownClasscls this method is called when all the unit tests in a class are finished. json_res total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone result argument stores the accumulative test results test_result super.runresult total test_result.testsRun errors lentest_result.errors failures lentest_result.failures self.setResulttotal, errors, failures def create_vllm_request prompt, request_id, stream, sampling_parameters, model_name, send_parameters_as_tensorTrue, exclude_input_in_outputNone, inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpynp.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if send_parameters_as_tensor sampling_parameters_data np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data if exclude_input_in_output is not None inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool outputs grpcclient.InferRequestedOutputtext_output return model_name model_name, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters, class UserData def __init__self self._completed_requests queue.Queue def callbackuser_data, result, error if error user_data._completed_requests.puterror else user_data._completed_requests.putresult File cicommonutil.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 19: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 20: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. SERVERSERVERopttritonserverbintritonserver SERVER_IPADDRTRITONSERVER_IPADDRlocalhost SERVER_LOGSERVER_LOG.server.log SERVER_TIMEOUTSERVER_TIMEOUT120 SERVER_LD_PRELOADSERVER_LD_PRELOAD Run inference server. | |
Chunk 21: | |
Return once servers health endpoint shows ready or timeout expires. Sets SERVER_PID to pid of SERVER, or 0 if error including expired timeout function run_server SERVER_PID0 if z SERVER then echo SERVER must be defined return fi if ! f SERVER then echo SERVER does not exist return fi if z SERVER_LD_PRELOAD then echo Running SERVER SERVER_ARGS else echo Running LD_PRELOADSERVER_LD_PRELOAD SERVER SERVER_ARGS fi LD_PRELOADSERVER_LD_PRELOADLD_PRELOAD SERVER SERVER_ARGS SERVER_LOG 21 SERVER_PID! | |
Chunk 22: | |
wait_for_server_ready SERVER_PID SERVER_TIMEOUT if WAIT_RET ! 0 then Get further debug information about server startup failure gdb_helper true Cleanup kill SERVER_PID devnull 21 true SERVER_PID0 fi Wait until server health endpoint shows ready. Sets WAIT_RET to 0 on success, 1 on failure function wait_for_server_ready local spid1 shift local wait_time_secs130 shift WAIT_RET0 local wait_secswait_time_secs until test wait_secs eq 0 do if ! | |
Chunk 23: | |
kill 0 spid devnull 21 then echo Server not running. WAIT_RET1 return fi sleep 1 set e codecurl s w http_code SERVER_IPADDR8000v2healthready set e if code 200 then return fi wait_secs done echo Timeout wait_time_secs secs. Server not ready. | |
Chunk 24: | |
WAIT_RET1 Check Python unittest results. function check_test_results local log_file1 local expected_num_tests2 if z expected_num_tests then echo expected number of tests must be defined return 1 fi num_failurescat log_file grep E .total.errors.failures. tail n 1 jq .failures num_testscat log_file grep E .total.errors.failures. | |
Chunk 25: | |
tail n 1 jq .total num_errorscat log_file grep E .total.errors.failures. tail n 1 jq .errors Number regular expression re09 if ? ne 0 ! | |
Chunk 26: | |
num_failures re ! num_tests re ! num_errors re then cat log_file echo e nn Test Failed unable to parse test resultsn log_file return 1 fi if num_errors ! | |
Chunk 27: | |
0 num_failures ! 0 num_tests ne expected_num_tests then cat log_file echo e nn Test Failed Expected expected_num_tests tests, num_tests tests executed, num_errors tests had error, and num_failures tests failed. n log_file return 1 fi return 0 function collect_artifacts_from_subdir cp .log core .. | |
Chunk 28: | |
true File ciL0_additional_outputs_vllmadditional_outputs_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 29: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 30: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import numpy as np import pytest import tritonclient.grpc as grpcclient class TestAdditionalOutputs _grpc_url localhost8001 _model_name vllm_opt _sampling_parameters temperature 0, top_p 1 _prompt In this example, def _get_sampling_parametersself, logprobsNone sampling_parameters self._sampling_parameters.copy if logprobs is not None sampling_parameterslogprobs logprobs return sampling_parameters def _get_inputs self, prompt, streamTrue, sampling_parametersNone, return_finish_reasonNone, return_cumulative_logprobNone, return_logprobsNone, return_num_input_tokensNone, return_num_output_tokensNone, inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ if return_finish_reason is not None inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_finish_reason, dtypebool if return_cumulative_logprob is not None inputs.append grpcclient.InferInputreturn_cumulative_logprob, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_cumulative_logprob, dtypebool if return_logprobs is not None inputs.appendgrpcclient.InferInputreturn_logprobs, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_logprobs, dtypebool if return_num_input_tokens is not None inputs.appendgrpcclient.InferInputreturn_num_input_tokens, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_num_input_tokens, dtypebool if return_num_output_tokens is not None inputs.append grpcclient.InferInputreturn_num_output_tokens, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_num_output_tokens, dtypebool return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself, inputs, sampling_parameters self._responses with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parameterssampling_parameters client.stop_stream assert lenself._responses 0 def _assert_text_output_validself text_output for response in self._responses result, error responseresult, responseerror assert error is None text_output result.as_numpynametext_output0.decodeutf8 assert lentext_output 0, output is empty assert text_output.count 4, output is not a sentence def _assert_finish_reasonself, return_finish_reason for i in rangelenself._responses result, error self._responsesiresult, self._responsesierror assert error is None finish_reason_np result.as_numpynamefinish_reason if return_finish_reason is None or return_finish_reason False assert finish_reason_np is None continue finish_reason finish_reason_np0.decodeutf8 if i lenself._responses 1 assert finish_reason None else assert finish_reason length def _assert_cumulative_logprobself, return_cumulative_logprob prev_cumulative_logprob 0.0 for response in self._responses result, error responseresult, responseerror assert error is None cumulative_logprob_np result.as_numpynamecumulative_logprob if return_cumulative_logprob is None or return_cumulative_logprob False assert cumulative_logprob_np is None continue cumulative_logprob cumulative_logprob_np0.astypefloat assert cumulative_logprob ! prev_cumulative_logprob prev_cumulative_logprob cumulative_logprob def _assert_logprobs self, stream, sampling_parameters, return_logprobs, return_num_output_tokens for response in self._responses result, error responseresult, responseerror assert error is None logprobs_np result.as_numpynamelogprobs if return_logprobs is None or return_logprobs False assert logprobs_np is None continue logprobs json.loadslogprobs_np0.decodeutf8 if logprobs not in sampling_parameters assert logprobs is None continue assert isinstancelogprobs, list assert lenlogprobs 1 if return_num_output_tokens True num_output_tokens result.as_numpynamenum_output_tokens0.astype int assert lenlogprobs num_output_tokens text_output_logprobs for logprobs_d in logprobs assert isinstancelogprobs_d, dict assert lenlogprobs_d 1 assert lenlogprobs_d sampling_parameterslogprobs 1 rank_one_found False for token_id, logprob_d in logprobs_d.items assert isinstancetoken_id, str assert lenlogprob_d 3 assert isinstancelogprob_dlogprob, float assert isinstancelogprob_drank, int assert isinstancelogprob_ddecoded_token, str if logprob_drank 1 assert not rank_one_found rank_one_found True text_output_logprobs logprob_ddecoded_token assert rank_one_found text_output result.as_numpynametext_output0.decodeutf8 if not stream given exclude_input_in_output is not set, prepend_input is True if not streaming and False if streaming text_output_logprobs self._prompt text_output_logprobs assert text_output_logprobs text_output def _assert_num_input_tokensself, return_num_input_tokens for response in self._responses result, error responseresult, responseerror assert error is None num_input_tokens_np result.as_numpynamenum_input_tokens if return_num_input_tokens is None or return_num_input_tokens False assert num_input_tokens_np is None continue num_input_tokens num_input_tokens_np.astypeint assert num_input_tokens 0 assert num_input_tokens lenself._prompt def _assert_num_output_tokensself, return_num_output_tokens for response in self._responses result, error responseresult, responseerror assert error is None num_output_tokens_np result.as_numpynamenum_output_tokens if return_num_output_tokens is None or return_num_output_tokens False assert num_output_tokens_np is None continue num_output_tokens num_output_tokens_np0.astypeint assert num_output_tokens 0 pytest.mark.parametrizestream, True, False pytest.mark.parametrizereturn_finish_reason, None, True, False pytest.mark.parametrizereturn_cumulative_logprob, None, True, False pytest.mark.parametrizelogprobs, None, 0, 2 pytest.mark.parametrizereturn_logprobs, None, True, False pytest.mark.parametrizereturn_num_input_tokens, None, True, False pytest.mark.parametrizereturn_num_output_tokens, None, True, False def test_additional_outputs self, stream, return_finish_reason, return_cumulative_logprob, logprobs, return_logprobs, return_num_input_tokens, return_num_output_tokens, sampling_parameters self._get_sampling_parameterslogprobslogprobs inputs self._get_inputs self._prompt, streamstream, sampling_parameterssampling_parameters, return_finish_reasonreturn_finish_reason, return_cumulative_logprobreturn_cumulative_logprob, return_logprobsreturn_logprobs, return_num_input_tokensreturn_num_input_tokens, return_num_output_tokensreturn_num_output_tokens, self._llm_inferinputs, sampling_parameters self._assert_text_output_valid self._assert_finish_reasonreturn_finish_reason self._assert_cumulative_logprobreturn_cumulative_logprob self._assert_logprobs stream, sampling_parameters, return_logprobs, return_num_output_tokens self._assert_num_input_tokensreturn_num_input_tokens self._assert_num_output_tokensreturn_num_output_tokens File ciL0_additional_outputs_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 31: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 32: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 33: | |
export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc Prepare Model rm rf models vllm_baseline_output.pkl mkdir p models SAMPLE_MODELS_REPO....samplesmodel_repository cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json RET0 Test SERVER_LOGadditional_outputs_test.server.log SERVER_ARGSmodelrepositorymodels run_server if SERVER_PID 0 then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_additional_outputs.xml s v additional_outputs_test.py if ? ne 0 then echo e nn additional_outputs_test FAILED. n RET1 fi set e kill SERVER_PID wait SERVER_PID if RET eq 0 then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET File ciL0_backend_vllmaccuracy_testaccuracy_test.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 34: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 35: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 36: | |
import argparse import asyncio import pickle import sys import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import from vllm import SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.utils import random_uuid sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request VLLM_ENGINE_CONFIG model facebookopt125m, gpu_memory_utilization 0.3, PROMPTS The most dangerous animal is, The capital of France is, The future of AI is, SAMPLING_PARAMETERS temperature 0, top_p 1 async def generate_python_vllm_outputprompt, llm_engine request_id random_uuid sampling_params SamplingParamsSAMPLING_PARAMETERS python_vllm_output None last_output None async for vllm_output in llm_engine.generateprompt, sampling_params, request_id last_output vllm_output if last_output python_vllm_output prompt output.text.encodeutf8 for output in last_output.outputs return python_vllm_output def prepare_vllm_baseline_outputs Helper function that starts async vLLM engine and generates output for each prompt in PROMPTS. Saves resulted baselines in vllm_baseline_output.pkl for further use. llm_engine AsyncLLMEngine.from_engine_argsAsyncEngineArgsVLLM_ENGINE_CONFIG python_vllm_output for i in rangelenPROMPTS python_vllm_output.extend asyncio.rungenerate_python_vllm_outputPROMPTSi, llm_engine with openvllm_baseline_output.pkl, wb as f pickle.dumppython_vllm_output, f return class VLLMTritonAccuracyTestTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name vllm_opt self.python_vllm_output with openvllm_baseline_output.pkl, rb as f self.python_vllm_output pickle.loadf self.assertNotEqual self.python_vllm_output, , Loaded baseline outputs list should not be empty, self.assertIsNotNone self.python_vllm_output, Loaded baseline outputs list should not be None self.assertEqual lenself.python_vllm_output, lenPROMPTS, Unexpected number of baseline outputs loaded, expected , but got .format lenPROMPTS, lenself.python_vllm_output , def test_vllm_modelself user_data UserData stream False triton_vllm_output self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangelenPROMPTS request_data create_vllm_request PROMPTSi, i, stream, SAMPLING_PARAMETERS, self.vllm_model_name self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parametersSAMPLING_PARAMETERS, for i in rangelenPROMPTS result user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException, strresult output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None triton_vllm_output.extendoutput self.triton_client.stop_stream self.assertEqualself.python_vllm_output.sort, triton_vllm_output.sort def tearDownself self.triton_client.close if __name__ __main__ parser argparse.ArgumentParser parser.add_argument generatebaseline, actionstore_true, requiredFalse, defaultFalse, helpGenerates baseline output for accuracy tests, FLAGS parser.parse_args if FLAGS.generate_baseline prepare_vllm_baseline_outputs exit0 unittest.main File ciL0_backend_vllmaccuracy_testtest.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 37: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 38: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 39: | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.accuracy_test_server.log CLIENT_LOG.accuracy_test_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.accuracy_test.py SAMPLE_MODELS_REPO......samplesmodel_repository VLLM_ENGINE_LOGvllm_engine.log EXPECTED_NUM_TESTS1 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json f vllm_baseline_output.pkl rm vllm_baseline_output.pkl RET0 set e Need to generate baseline first, since running 2 vLLM engines causes memory issues httpsgithub.comvllmprojectvllmissues2248 python3 CLIENT_PY generatebaseline VLLM_ENGINE_LOG 21 BASELINE_PID! wait BASELINE_PID set e run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. | |
Chunk 40: | |
n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Accuracy test FAILED. n else echo e nn Accuracy test PASSED. | |
Chunk 41: | |
n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmenabled_streamenabled_stream_test.py Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 42: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 43: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys import unittest import tritonclient.grpc.aio as grpcclient from tritonclient.utils import sys.path.append....common from test_util import AsyncTestResultCollector, create_vllm_request PROMPTS The most dangerous animal is SAMPLING_PARAMETERS temperature 0, top_p 1 class VLLMTritonStreamTestAsyncTestResultCollector async def _test_vllm_model self, promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamTrue, exclude_input_in_outputNone, expected_outputNone, expect_errorFalse, async with grpcclient.InferenceServerClient urllocalhost8001 as triton_client model_name vllm_opt async def request_iterator for i, prompt in enumerateprompts yield create_vllm_request prompt, i, stream, sampling_parameters, model_name, exclude_input_in_outputexclude_input_in_output, response_iterator triton_client.stream_infer inputs_iteratorrequest_iterator final_response async for response in response_iterator result, error response if expect_error self.assertIsInstanceerror, InferenceServerException self.assertEqual error.message, Error generating stream When streaming, exclude_input_in_output False is not allowed., error, return self.assertIsNoneerror, error self.assertIsNotNoneresult, result output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None final_response.appendstroutput0, encodingutf8 if expected_output is not None self.assertEqual final_response, expected_output, Expected to receive the following response , but received ..format expected_output, final_response , async def test_vllm_model_enabled_streamself Verifying that request with multiple prompts runs successfully. prompts The most dangerous animal is, The future of AI is, await self._test_vllm_modelpromptsprompts async def test_vllm_model_enabled_stream_exclude_input_in_output_defaultself Verifying that streaming request returns only generated diffs, which is default behaviour for streamTrue. | |
Chunk 44: | |
expected_output the, one, that, is, most, likely, to, be, killed, by, a, car, ., n, I, m, await self._test_vllm_modelexpected_outputexpected_output async def test_vllm_model_enabled_stream_exclude_input_in_output_falseself Verifying that streaming request returns only generated diffs even if exclude_input_in_output is set to False explicitly. expected_output Error generating stream When streaming, exclude_input_in_output False is not allowed. await self._test_vllm_model exclude_input_in_outputFalse, expected_outputexpected_output, expect_errorTrue, if __name__ __main__ unittest.main File ciL0_backend_vllmenabled_streamtest.sh !binbash Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
Chunk 45: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 46: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 47: | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.enabled_stream_server.log CLIENT_LOG.enabled_stream_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.enabled_stream_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS3 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
Chunk 48: | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Enabled Stream test FAILED. n else echo e nn Enabled Stream test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmmetrics_testtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 49: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 50: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 51: | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_metrics_server.log CLIENT_LOG.vllm_metrics_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_metrics_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 Helpers function copy_model_repository rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt vllm_opt model will be loaded on server start and stay loaded throughout unittesting. To ensure that vllms memory profiler will not error out on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt, so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json run_test local TEST_CASE1 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY TEST_CASE v CLIENT_LOG 21 if ? | |
Chunk 52: | |
ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY TEST_CASE FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e TODO Nongraceful shutdown when metrics are enabled. | |
Chunk 53: | |
kill SERVER_PID wait SERVER_PID RET0 Test disabling vLLM metrics reporting without parameter REPORT_CUSTOM_METRICS in config.pbtxt copy_model_repository run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled Test disabling vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to false in config.pbtxt copy_model_repository echo e parameters key REPORT_CUSTOM_METRICS value string_value false modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled Test vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to true in config.pbtxt copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics Test vLLM metrics custom sampling parameters Custom sampling parameters may result in different vLLM output depending on the platform. Therefore, these metrics are tests separately. copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_custom_sampling_params Test enabling vLLM metrics reporting in config.pbtxt but disabling in model.json copy_model_repository jq . | |
Chunk 54: | |
disable_log_stats true modelsvllm_opt1model.json temp.json mv temp.json modelsvllm_opt1model.json echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled Test enabling vLLM metrics reporting in config.pbtxt while disabling in server option copy_model_repository echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt SERVER_ARGSSERVER_ARGS allowmetricsfalse run_test VLLMTritonMetricsTest.test_vllm_metrics_refused rm rf .models temp.json if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmmetrics_testvllm_metrics_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 55: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 56: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 57: | |
import os import re import sys import unittest from functools import partial import requests import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMTritonMetricsTestTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.tritonserver_ipaddr os.environ.getTRITONSERVER_IPADDR, localhost self.vllm_model_name vllm_opt self.prompts The most dangerous animal is, The capital of France is, The future of AI is, self.sampling_parameters temperature 0, top_p 1 def parse_vllm_metricsself Store vllm metrics in a dictionary. r requests.getfhttpself.tritonserver_ipaddr8002metrics r.raise_for_status Regular expression to match the pattern pattern rvllm ?.? 09. | |
Chunk 58: | |
vllm_dict Find all matches in the text matches re.findallpattern, r.text, re.MULTILINE for match in matches key, value match vllm_dictkey floatvalue if . in value else intvalue return vllm_dict def vllm_infer self, prompts, sampling_parameters, model_name, Helper function to send async stream infer requests to vLLM. user_data UserData number_of_vllm_reqs lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data create_vllm_request promptsi, i, False, sampling_parameters, model_name, True, self.triton_client.async_stream_infer model_namemodel_name, inputsrequest_datainputs, request_idrequest_datarequest_id, outputsrequest_dataoutputs, parameterssampling_parameters, for _ in rangenumber_of_vllm_reqs result user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None self.triton_client.stop_stream def test_vllm_metricsself Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name, metrics_dict self.parse_vllm_metrics total_prompts lenself.prompts vllmprompt_tokens_total 2, 133, 144, 2702, 3477, 16 2, 133, 812, 9, 1470, 16 2, 133, 499, 9, 4687, 16 self.assertEqualmetrics_dictvllmprompt_tokens_total, 18 vllmgeneration_tokens_total 5, 65, 14, 16, 144, 533, 7, 28, 848, 30, 10, 512, 4, 50118, 100, 437 5, 812, 9, 5, 1515, 3497, 4, 50118, 50118, 133, 812, 9, 1470, 16, 5, 812 11, 5, 1420, 9, 5, 82, 4, 50118, 50118, 133, 499, 9, 4687, 16, 11, 5 self.assertEqualmetrics_dictvllmgeneration_tokens_total, 48 vllmtime_to_first_token_seconds self.assertEqual metrics_dictvllmtime_to_first_token_seconds_count, total_prompts self.assertGreatermetrics_dictvllmtime_to_first_token_seconds_sum, 0 self.assertEqual metrics_dictvllmtime_to_first_token_seconds_bucket, total_prompts vllmtime_per_output_token_seconds self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_count, 45 self.assertGreatermetrics_dictvllmtime_per_output_token_seconds_sum, 0 self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_bucket, 45 vllme2e_request_latency_seconds self.assertEqual metrics_dictvllme2e_request_latency_seconds_count, total_prompts self.assertGreatermetrics_dictvllme2e_request_latency_seconds_sum, 0 self.assertEqual metrics_dictvllme2e_request_latency_seconds_bucket, total_prompts vllmrequest_prompt_tokens self.assertEqual metrics_dictvllmrequest_prompt_tokens_count, total_prompts self.assertEqualmetrics_dictvllmrequest_prompt_tokens_sum, 18 self.assertEqual metrics_dictvllmrequest_prompt_tokens_bucket, total_prompts vllmrequest_generation_tokens self.assertEqual metrics_dictvllmrequest_generation_tokens_count, total_prompts, self.assertEqualmetrics_dictvllmrequest_generation_tokens_sum, 48 self.assertEqual metrics_dictvllmrequest_generation_tokens_bucket, total_prompts, TODO Revisit this test due to the removal of best_of def test_custom_sampling_paramsself Adding sampling parameters for testing metrics. | |
Chunk 59: | |
Definitions can be found here httpsdocs.vllm.aienlatestdevsampling_params.html n, best_of 2, 4 custom_sampling_parameters self.sampling_parameters.copy Changing temperature because best_of must be 1 when using greedy sampling, i.e. temperature 0. custom_sampling_parameters.update n strn, best_of strbest_of, temperature 1 Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parameterscustom_sampling_parameters, model_nameself.vllm_model_name, metrics_dict self.parse_vllm_metrics total_prompts lenself.prompts vllmrequest_params_best_of self.assertEqual metrics_dictvllmrequest_params_best_of_count, total_prompts self.assertEqual metrics_dictvllmrequest_params_best_of_sum, best_of total_prompts self.assertEqual metrics_dictvllmrequest_params_best_of_bucket, total_prompts vllmrequest_params_n self.assertEqualmetrics_dictvllmrequest_params_n_count, total_prompts self.assertEqualmetrics_dictvllmrequest_params_n_sum, n total_prompts self.assertEqualmetrics_dictvllmrequest_params_n_bucket, total_prompts def test_vllm_metrics_disabledself Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name, metrics_dict self.parse_vllm_metrics No vLLM metric found self.assertEquallenmetrics_dict, 0 def test_vllm_metrics_refusedself Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name, with self.assertRaisesrequests.exceptions.ConnectionError self.parse_vllm_metrics def tearDownself self.triton_client.close if __name__ __main__ unittest.main File ciL0_backend_vllmrequest_cancellationrequest_cancellation_test.py Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
Chunk 60: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 61: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 62: | |
import sys import time import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMRequestCancelTestTestResultCollector def test_request_cancellationself, send_parameters_as_tensorTrue with grpcclient.InferenceServerClienturllocalhost8001 as triton_client log_file_path .request_cancellation_server.log user_data UserData model_name vllm_opt stream False sampling_parameters temperature 0, top_p 1, max_tokens 1500, prompt fWrite an original and creative poem of at least 200 words. triton_client.start_streamcallbackpartialcallback, user_data request_data create_vllm_request prompt, 1, stream, sampling_parameters, model_name, send_parameters_as_tensor, triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, time.sleep1 triton_client.stop_streamcancel_requestsTrue time.sleep1 self.assertFalseuser_data._completed_requests.empty result user_data._completed_requests.get self.assertIsInstanceresult, InferenceServerException self.assertEqualresult.status, StatusCode.CANCELLED self.assertTrueuser_data._completed_requests.empty with openlog_file_path, moder as log_file log_text log_file.read self.assertInvllm Cancelling the request, log_text self.assertInvllm Successfully cancelled the request, log_text if __name__ __main__ unittest.main File ciL0_backend_vllmrequest_cancellationtest.sh !binbash Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 63: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 64: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.request_cancellation_server.log CLIENT_LOG.request_cancellation_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.request_cancellation_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? | |
Chunk 65: | |
ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Request Cancellation test FAILED. | |
Chunk 66: | |
n else echo e nn Request Cancellation test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmvllm_backendensemble_config.pbtxt Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 67: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 68: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. name ensemble_model platform ensemble max_batch_size 1 input name text_input data_type TYPE_STRING dims 1 output name text_output data_type TYPE_STRING dims 1 ensemble_scheduling step model_name vllm_opt model_version 1 input_map key text_input value text_input output_map key text_output value text_output File ciL0_backend_vllmvllm_backendtest.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 69: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 70: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 71: | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_backend_server.log CLIENT_LOG.vllm_backend_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_backend_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS6 Helpers function assert_curl_success message1 if code ! 200 then cat .curl.out echo e nn message line BASH_LINENOn RET1 fi rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt vllm_opt model will be loaded on server start and stay loaded throughout unittesting. To test vllm model loadunload we use a dedicated vllm_load_test. | |
Chunk 72: | |
To ensure that vllms memory profiler will not error out on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt, so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json cp r modelsvllm_opt modelsvllm_load_test mkdir p modelsadd_sub1 wget P modelsadd_sub1 httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_submodel.py wget P modelsadd_sub httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_subconfig.pbtxt Invalid model attribute cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_1 sed i sdisable_log_requestsinvalid_attribute modelsvllm_invalid_11model.json Invalid model name cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_2 sed i sfacebookopt125minvalid_model modelsvllm_invalid_21model.json Sanity check ensembles are enabled and can successfully be loaded mkdir p modelsensemble_model1 cp r ensemble_config.pbtxt modelsensemble_modelconfig.pbtxt RET0 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. | |
Chunk 73: | |
n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID Test Python backend cmdline parameters are propagated to vllm backend SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR backendconfigpython,defaultmaxbatchsize8 SERVER_LOG.vllm_test_cmdline_server.log rm rf .modelsvllm_invalid_1 .modelsvllm_invalid_2 .modelsvllm_load_test run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi kill SERVER_PID wait SERVER_PID rm rf .models COUNTgrep c defaultmaxbatchsize8 SERVER_LOG if COUNT ne 2 then echo Cmdline parameters verification Failed fi Test loading multiple vllm models SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_one SERVER_LOG.vllm_test_multi_model.log Create two models, one is just a copy of the other, and make sure gpu utilization is low enough for multiple models to avoid OOM. vLLM changed behavior of their GPU profiler from total to free memory, so to load two small models, we need to start triton server in explicit mode. | |
Chunk 74: | |
MODEL1vllm_one MODEL2vllm_two mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsMODEL1 cp r modelsMODEL1 modelsMODEL2 sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL11model.json sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL21model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi Explicitly load model rm f .curl.out set e codecurl s w http_code o .curl.out X POST localhost8000v2repositorymodelsvllm_twoload set e assert_curl_success Failed to load vllm_two model kill SERVER_PID wait SERVER_PID rm rf .models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmvllm_backendvllm_backend_test.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 75: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 76: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 77: | |
import sys import unittest from functools import partial import numpy as np import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request PROMPTS The most dangerous animal is, The capital of France is, The future of AI is, SAMPLING_PARAMETERS temperature 0, top_p 1 class VLLMTritonBackendTestTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name vllm_opt self.python_model_name add_sub self.ensemble_model_name ensemble_model self.vllm_load_test vllm_load_test def test_vllm_triton_backendself Load both vllm and add_sub models self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.load_modelself.python_model_name self.assertTrueself.triton_client.is_model_readyself.python_model_name Test to ensure that ensemble models are supported in vllm container. If ensemble support not present, triton will error out at model loading stage. Ensemble Model is a pipeline consisting of 1 model vllm_opt self.triton_client.load_modelself.ensemble_model_name self.assertTrueself.triton_client.is_model_readyself.ensemble_model_name self.triton_client.unload_modelself.ensemble_model_name Unload vllm model and test add_sub model self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test self._test_python_model Load vllm model and unload add_sub model self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.unload_modelself.python_model_name self.assertFalseself.triton_client.is_model_readyself.python_model_name Test vllm model and unload vllm model self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorTrue, model_nameself.vllm_load_test, self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorFalse, model_nameself.vllm_load_test, self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test def test_model_with_invalid_attributesself model_name vllm_invalid_1 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_vllm_invalid_model_nameself model_name vllm_invalid_2 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_exclude_input_in_output_defaultself Verifying default behavior for exclude_input_in_output in nonstreaming mode. | |
Chunk 78: | |
Expected result prompt is returned with diffs. prompts The capital of France is, expected_output bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, expected_outputexpected_output, def test_exclude_input_in_output_falseself Verifying behavior for exclude_input_in_output False in nonstreaming mode. Expected result prompt is returned with diffs. | |
Chunk 79: | |
Test vllm model and unload vllm model prompts The capital of France is, expected_output bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputFalse, expected_outputexpected_output, def test_exclude_input_in_output_trueself Verifying behavior for exclude_input_in_output True in nonstreaming mode. Expected result only diffs are returned. Test vllm model and unload vllm model prompts The capital of France is, expected_output b the capital of the French Republic.nnThe capital of France is the capital sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputTrue, expected_outputexpected_output, def _test_vllm_model self, prompts, sampling_parameters, stream, send_parameters_as_tensor, exclude_input_in_outputNone, expected_outputNone, model_namevllm_opt, user_data UserData number_of_vllm_reqs lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor, exclude_input_in_outputexclude_input_in_output, self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, for i in rangenumber_of_vllm_reqs result user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n Expected n Actual.format output, expected_outputi , self.triton_client.stop_stream def _test_python_modelself shape 4 input0_data np.random.randshape.astypenp.float32 input1_data np.random.randshape.astypenp.float32 inputs grpcclient.InferInput INPUT0, input0_data.shape, np_to_triton_dtypeinput0_data.dtype , grpcclient.InferInput INPUT1, input1_data.shape, np_to_triton_dtypeinput1_data.dtype , inputs0.set_data_from_numpyinput0_data inputs1.set_data_from_numpyinput1_data outputs grpcclient.InferRequestedOutputOUTPUT0, grpcclient.InferRequestedOutputOUTPUT1, response self.triton_client.infer self.python_model_name, inputs, request_id10, outputsoutputs self.assertTrue np.allcloseinput0_data input1_data, response.as_numpyOUTPUT0 self.assertTrue np.allcloseinput0_data input1_data, response.as_numpyOUTPUT1 def tearDownself self.triton_client.close if __name__ __main__ unittest.main File ciL0_backend_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 80: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 81: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 82: | |
RET0 SUBTESTSaccuracy_test request_cancellation enabled_stream vllm_backend metrics_test python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST bash ex test.sh cd .. if ? ne 0 then echo Subtest TEST FAILED RET1 fi done if RET eq 0 then echo e nn vLLM Backend Test Passedn else echo e nn vLLM Backend Test FAILEDn fi exit RET File ciL0_check_health_vllmcheck_health_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 83: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 84: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 85: | |
import json import numpy as np import tritonclient.grpc as grpcclient class TestCheckHealth _grpc_url localhost8001 _model_name vllm_opt _sampling_parameters temperature 0, top_p 1 _prompt In this example, def _get_inputsself, prompt, streamTrue, sampling_parametersNone inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself inputs self._get_inputs self._prompt, streamTrue, sampling_parametersself._sampling_parameters self._responses with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parametersself._sampling_parameters client.stop_stream def _assert_text_output_validself text_output for response in self._responses result, error responseresult, responseerror assert error is None text_output result.as_numpynametext_output0.decodeutf8 assert lentext_output 0, output is empty assert text_output.count 4, output is not a sentence def _assert_infer_exceptionself, expected_exception_message assert lenself._responses 1 for response in self._responses result, error responseresult, responseerror assert result is None assert strerror expected_exception_message def _assert_model_readyself, expected_readiness with grpcclient.InferenceServerClientself._grpc_url as client is_model_ready API assert client.is_model_readyself._model_name expected_readiness get_model_repository_index API model_state None for model_index in client.get_model_repository_index.models if model_index.name self._model_name assert model_state is None, duplicate model index found model_state model_index.state READY assert model_state expected_readiness def test_vllm_is_healthyself num_repeats 3 for i in rangenum_repeats self._assert_model_readyTrue self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue def test_vllm_not_healthyself self._assert_model_readyTrue The 1st infer should complete successfully self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue The 2nd infer should begin with health check failed self._llm_infer self._assert_infer_exception Model is unavailable due to unhealthy vLLM engine self._assert_model_readyFalse The 3rd infer should have model not found self._llm_infer self._assert_infer_exception Request for unknown model vllm_opt has no available versions self._assert_model_readyFalse File ciL0_check_health_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 86: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 87: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc RET0 function setup_model_repository local sample_model_repo_path1....samplesmodel_repository rm rf models vllm_baseline_output.pkl mkdir p models cp r sample_model_repo_pathvllm_model modelsvllm_opt function enable_health_check local enable_vllm_health_check1 echo e parameters modelsvllm_optconfig.pbtxt echo e key ENABLE_VLLM_HEALTH_CHECK modelsvllm_optconfig.pbtxt echo e value string_value enable_vllm_health_check modelsvllm_optconfig.pbtxt echo e modelsvllm_optconfig.pbtxt VLLM_INSTALL_PATHusrlocallibpython3.12distpackagesvllm function mock_vllm_async_llm_engine backup original file mv VLLM_INSTALL_PATHenginemultiprocessingclient.py VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup cp VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py overwrite the original check_health method echo e VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e async def check_healthself, check_count0 VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e check_count0 1 VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e if check_count0 1 VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e raise RuntimeErrorSimulated vLLM check_health failure VLLM_INSTALL_PATHenginemultiprocessingclient.py function unmock_vllm_async_llm_engine restore from backup rm f VLLM_INSTALL_PATHenginemultiprocessingclient.py mv VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py function test_check_health local test_name1 local unit_test_name2 SERVER_LOGtest_name.server.log SERVER_ARGSmodelrepositorymodels modelcontrolmodeexplicit loadmodel run_server if SERVER_PID 0 then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_name.report.xml s v check_health_test.pyTestCheckHealthunit_test_name test_name.log if ? ne 0 then echo e nn test_name FAILED. | |
Chunk 88: | |
n RET1 fi set e kill SERVER_PID wait SERVER_PID Test health check unspecified setup_model_repository test_check_health health_check_unspecified test_vllm_is_healthy Test health check disabled setup_model_repository enable_health_check false test_check_health health_check_disabled test_vllm_is_healthy Test health check enabled setup_model_repository enable_health_check true test_check_health health_check_enabled test_vllm_is_healthy Mock check_health from vLLM mock_vllm_async_llm_engine Test health check unspecified with mocked vLLM check_health failure setup_model_repository test_check_health health_check_unspecified_mocked_failure test_vllm_is_healthy Test health check disabled with mocked vLLM check_health failure setup_model_repository enable_health_check false test_check_health health_check_disabled_mocked_failure test_vllm_is_healthy Test health check enabled with mocked vLLM check_health failure setup_model_repository enable_health_check true test_check_health health_check_enabled_mocked_failure test_vllm_not_healthy Unmock check_health unmock_vllm_async_llm_engine if RET eq 0 then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET File ciL0_multi_gpu_vllmmulti_loradownload.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 89: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 90: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. from huggingface_hub import snapshot_download if __name__ __main__ download lora weight alpaca snapshot_download repo_idswathijnGemmaDoll2bdollyLORATune, local_dir.weightslorasGemmaDoll, max_workers8, download lora weight GemmaSheep snapshot_download repo_ideduardoalvarezGemmaSheep2BLORATUNED, local_dir.weightslorasGemmaSheep, max_workers8, download backbone weight googlegemma2b snapshot_download repo_idunslothgemma2b, local_dir.weightsbackbonegemma2b, max_workers8, File ciL0_multi_gpu_vllmmulti_loramulti_lora_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 91: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 92: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import os import sys import unittest from functools import partial from typing import List import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import AsyncTestResultCollector, UserData, callback, create_vllm_request PROMPTS Instruct What do you think of Computer Science?nOutput SAMPLING_PARAMETERS temperature 0, top_p 1 server_enable_lora True class VLLMTritonLoraTestAsyncTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name vllm_llama_multi_lora def _test_vllm_model self, prompts Liststr, sampling_parameters, lora_name Liststr, server_enable_loraTrue, streamFalse, exclude_input_in_outputNone, expected_outputNone, assert lenprompts len lora_name , The number of prompts and lora names should be the same user_data UserData number_of_vllm_reqs lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs lora lora_namei if lora_name else None sam_para_copy sampling_parameters.copy if lora is not None sam_para_copylora_name lora request_data create_vllm_request promptsi, i, stream, sam_para_copy, self.vllm_model_name, exclude_input_in_outputexclude_input_in_output, self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, for i in rangenumber_of_vllm_reqs result user_data._completed_requests.get if typeresult is InferenceServerException printresult.message if server_enable_lora self.assertEqual strresult.message, fLoRA lora_namei is not supported, we currently support doll, sheep, InferenceServerException, else self.assertEqual strresult.message, LoRA feature is not enabled., InferenceServerException, self.triton_client.stop_stream return output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n Expected n Actual.format output, expected_outputi , self.triton_client.stop_stream def test_multi_lora_requestsself sampling_parameters temperature 0, top_p 1 make two requests separately to avoid the different arrival of response answers prompt_1 Instruct What do you think of Computer Science?nOutput lora_1 doll expected_output b I think it is a very interesting subject.nnInstruct What do you self._test_vllm_model prompt_1, sampling_parameters, lora_namelora_1, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output, prompt_2 Instruct Tell me more about soccernOutput lora_2 sheep expected_output b I love soccer. | |
Chunk 93: | |
I play soccer every day.nInstruct Tell me self._test_vllm_model prompt_2, sampling_parameters, lora_namelora_2, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output, def test_none_exist_loraself prompts Instruct What is the capital city of France?nOutput, loras bactrian sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, lora_nameloras, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputNone, this request will lead to lora not supported error, so there is no expected output def tearDownself self.triton_client.close if __name__ __main__ server_enable_lora os.environ.getSERVER_ENABLE_LORA, false.lower true unittest.main File ciL0_multi_gpu_vllmmulti_loratest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 94: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 95: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.multi_lora_server.log CLIENT_LOG.multi_lora_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.multi_lora_test.py DOWNLOAD_PY.download.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS2 first we download weights pip install U huggingface_hub rm rf weights mkdir p weightslorasGemmaDoll mkdir p weightslorasGemmaSheep mkdir p weightsbackbonegemma2b python3 DOWNLOAD_PY v CLIENT_LOG 21 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_llama_multi_lora export SERVER_ENABLE_LORAtrue Check boolean flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json multi_lora_jsoncat EOF doll .weightslorasGemmaDoll, sheep .weightslorasGemmaSheep EOF echo multi_lora_json modelsvllm_llama_multi_lora1multi_lora.json RET0 If it is the first time launching triton server with gemma2b and multilora feature, it may take more than 1 minutes. Please wait. | |
Chunk 96: | |
SERVER_TIMEOUT60000 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
Chunk 97: | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID Check string flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
Chunk 98: | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID disable lora export SERVER_ENABLE_LORAfalse check bool flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
Chunk 99: | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID disable lora export SERVER_ENABLE_LORAfalse check string flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
Chunk 100: | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models rm rf weights if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Multi LoRA test FAILED. n else echo e nn Multi LoRA test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_multi_gpu_vllmvllm_backendtest.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 101: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 102: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 103: | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit logverbose1 TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_multi_gpu_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 Helpers function validate_file_contains local KEY1 local FILE2 if z KEY z FILE then echo Error KEY and FILE must be provided. return 1 fi if ! f FILE then echo Error File FILE does not exist. | |
Chunk 104: | |
return 1 fi countgrep o w KEY FILE wc l if count ne 1 then echo Error KEY KEY found count times in FILE. Expected exactly once. return 1 fi function run_multi_gpu_test export KIND1 export TENSOR_PARALLELISM2 export INSTANCE_COUNT3 export DISTRIBUTED_EXECUTOR_BACKEND4 Setup a clean model repository export TEST_MODELvllm_opt_KIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNT local TEST_MODEL_TRITON_CONFIGmodelsTEST_MODELconfig.pbtxt local TEST_MODEL_VLLM_CONFIGmodelsTEST_MODEL1model.json rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsTEST_MODEL sed i sKIND_MODELKIND TEST_MODEL_TRITON_CONFIG sed i 3s tensor_parallel_size TENSOR_PARALLELISM,n TEST_MODEL_VLLM_CONFIG if TENSOR_PARALLELISM ne 1 then jq arg backend DISTRIBUTED_EXECUTOR_BACKEND . | |
Chunk 105: | |
distributed_executor_backendbackend TEST_MODEL_VLLM_CONFIG temp.json mv temp.json TEST_MODEL_VLLM_CONFIG fi Assert the correct kind is set in case the template config changes in the future validate_file_contains KIND TEST_MODEL_TRITON_CONFIG Start server echo Running multiGPU test with kindKIND, tpTENSOR_PARALLELISM, instance_countINSTANCE_COUNT SERVER_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTserver.log run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi Run unit tests set e CLIENT_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTclient.log python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
Chunk 106: | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e Cleanup kill SERVER_PID wait SERVER_PID Test rm f .log RET0 Test the various cases of kind, tensor parallelism, and instance count for different ways to run multiGPU models with vLLM on Triton KINDSKIND_MODEL KIND_GPU TPS1 2 INSTANCE_COUNTS1 2 DISTRIBUTED_EXECUTOR_BACKENDray for kind in KINDS do for tp in TPS do for count in INSTANCE_COUNTS do run_multi_gpu_test kind tp count DISTRIBUTED_EXECUTOR_BACKEND done done done Results if RET eq 1 then echo e nn Multi GPU Utilization test FAILED. n else echo e nn Multi GPU Utilization test PASSED. n fi exit RET File ciL0_multi_gpu_vllmvllm_backendvllm_multi_gpu_test.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 107: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 108: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 109: | |
import os import sys import unittest from functools import partial import pynvml import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMMultiGPUTestTestResultCollector def setUpself pynvml.nvmlInit self.triton_client grpcclient.InferenceServerClienturllocalhost8001 def get_gpu_memory_utilizationself, gpu_id handle pynvml.nvmlDeviceGetHandleByIndexgpu_id info pynvml.nvmlDeviceGetMemoryInfohandle return info.used def get_available_gpu_idsself device_count pynvml.nvmlDeviceGetCount available_gpus for gpu_id in rangedevice_count handle pynvml.nvmlDeviceGetHandleByIndexgpu_id if handle available_gpus.appendgpu_id return available_gpus def _test_vllm_multi_gpu_utilizationself, model_name str Test that loading a given vLLM model will increase GPU utilization across multiple GPUs, and run a sanity check inference to confirm that the loaded multigpumultiinstance model is working as expected. gpu_ids self.get_available_gpu_ids self.assertGreaterEquallengpu_ids, 2, Error Detected single GPU printnn Before Loading vLLM Model mem_util_before_loading_model for gpu_id in gpu_ids memory_utilization self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes mem_util_before_loading_modelgpu_id memory_utilization self.triton_client.load_modelmodel_name self._test_vllm_modelmodel_name print After Loading vLLM Model vllm_model_used_gpus 0 for gpu_id in gpu_ids memory_utilization self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes if memory_utilization mem_util_before_loading_modelgpu_id vllm_model_used_gpus 1 self.assertGreaterEqualvllm_model_used_gpus, 2 def _test_vllm_modelself, model_name str, send_parameters_as_tensor bool True user_data UserData stream False prompts The most dangerous animal is, The capital of France is, The future of AI is, number_of_vllm_reqs lenprompts sampling_parameters temperature 0, top_p 1 self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor, self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, for i in rangenumber_of_vllm_reqs result user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException output result.as_numpytext_output self.assertIsNotNoneoutput self.triton_client.stop_stream def test_multi_gpu_modelself Tests that a multiGPU vLLM model loads successfully on multiple GPUs and can handle a few sanity check inference requests. MultiGPU models are currently defined here as either a single model instance with tensor parallelism 1 multiple model instances each with tensor parallelism 1 FIXME This test currently skips over a few combinations that may be enhanced in the future, such as tensor parallel models with multiple model instances KIND_MODEL models with multiple model instances model os.environ.getTEST_MODEL kind os.environ.getKIND tp os.environ.getTENSOR_PARALLELISM instance_count os.environ.getINSTANCE_COUNT for env_var in model, kind, tp, instance_count self.assertIsNotNoneenv_var printfTest Matrix model, kind, tp, instance_count Only support tensor parallelism or multiple instances for now, but not both. | |
Chunk 110: | |
Support for multiinstance tensor parallel models may require more special handling in the backend to better handle device assignment. NOTE This eliminates the 111 and 224 test cases. if inttp intinstance_count ! | |
Chunk 111: | |
2 msg TENSOR_PARALLELISM and INSTANCE_COUNT must have a product of 2 for this 2GPU test printSkipping Test, msg self.skipTestmsg Loading a KIND_GPU model with Tensor Parallelism 1 should fail and recommend using KIND_MODEL instead for multigpu model instances. if kind KIND_GPU and inttp 1 with self.assertRaisesRegex InferenceServerException, please specify KIND_MODEL self._test_vllm_multi_gpu_utilizationmodel return Loading a KIND_MODEL model with multiple instances can cause oversubscription to specific GPUs and cause a CUDA OOM if the gpu_memory_utilization settings are high without further handling of device assignment in the backend. if kind KIND_MODEL and intinstance_count 1 msg Testing multiple model instances of KIND_MODEL is not implemented at this time printSkipping Test, msg self.skipTestmsg self._test_vllm_multi_gpu_utilizationmodel def tearDownself pynvml.nvmlShutdown self.triton_client.close if __name__ __main__ unittest.main File ciL0_multi_gpu_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 112: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 113: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 114: | |
RET0 SUBTESTSvllm_backend multi_lora python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST bash ex test.sh cd .. if ? ne 0 then echo Subtest TEST FAILED RET1 fi done if RET eq 0 then echo e nn vLLM MultiGPU Tests Passedn else echo e nn vLLM MultiGPU Tests FAILEDn fi exit RET File docsadditional_outputs.md ! | |
Chunk 115: | |
Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 116: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 117: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Additional Outputs from vLLM The vLLM backend supports sending additional outputs from vLLM on top of the usual text_output when requested. All additional outputs are disabled by default and they need to be enabled on a perrequest basis. | |
Chunk 118: | |
If enabled, the corresponding output tensor will be set for all responses from the request. Supported Additional Outputs Finish Reason The reason why the sequence is finished. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL26 for more details. | |
Chunk 119: | |
To enable, set return_finish_reason input tensor to True. The reason will be sent as a string on the finish_reason output tensor. Cumulative Log Probabilities The cumulative log probability of the generated output text. | |
Chunk 120: | |
See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL22 for more details. To enable, set return_cumulative_logprob input tensor to True. The floating point value will be sent on the cumulative_logprob output tensor. | |
Chunk 121: | |
Log Probabilities The log probabilities of the top probability tokens at each position of the logprobshttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsampling_params.pyL146L152 are requested. Only the log probabilities of the new tokens generated since the last response are returned on each new response. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL24L25 for more details on the log probabilities. | |
Chunk 122: | |
To enable, set return_logprobs input tensor to True. The log probabilities will be sent on the logprobs output tensor as a serialized JSON string. Number of Input Tokens The number of token IDs of the prompt. | |
Chunk 123: | |
See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL79L81 for more details. To enable, set return_num_input_tokens input tensor to True. The unsigned integer value will be sent on the num_input_tokens output tensor. | |
Chunk 124: | |
Number of Output Tokens The number of token IDs of the generated output text sent on this response. It is the difference in length of the token IDs generated from the last response to this response. If this is the first response, the last response length is presumed to be zero. | |
Chunk 125: | |
See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL21 for more details on the token IDs of the generated output text. To enable, set return_num_output_tokens input tensor to True. The unsigned integer value will be sent on the num_output_tokens output tensor. | |
Chunk 126: | |
Examples Add Finish Reason to Outputs python import numpy as np import tritonclient.grpc as grpcclient inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayexample prompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayTrue, dtypebool def callbackresult, error ... printresult.as_numpynamefinish_reason with grpcclient.InferenceServerClientlocalhost8001 as client client.start_streamcallback client.async_stream_infervLLM_model_name, inputsinputs, ... client.stop_stream Notes Enabling additional outputs may impact performance, only add additional outputs when necessary. | |
Chunk 127: | |
File docshealth_check.md ! Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 128: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 129: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. vLLM Health Check BETA !NOTE The vLLM Health Check support is currently in BETA. | |
Chunk 130: | |
Its features and functionality are subject to change as we collect feedback. We are excited to hear any thoughts you have! The vLLM backend supports checking for vLLM Engine Healthhttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmengineasync_llm_engine.pyL1177L1185 upon receiving each inference request. | |
Chunk 131: | |
If the health check fails, the model state will becomes NOT Ready at the server, which can be queried by the Repository Indexhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_model_repository.mdindex or Model Readyhttpsgithub.comtritoninferenceserverclientblobmainsrcc2B2Blibraryhttp_client.hL178L192 APIs. The Health Check is disabled by default. To enable it, set the following parameter on the model config to true parameters key ENABLE_VLLM_HEALTH_CHECK value string_value true and select Model Control Mode EXPLICIThttpsgithub.comtritoninferenceserverserverblobmaindocsuser_guidemodel_management.mdmodelcontrolmodeexplicit when the server is started. | |
Chunk 132: | |
File docsllama_multi_lora_tutorial.md ! Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 133: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 134: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Tutorial on depolying multilora vLLM backend in Triton The idea of multilora was proposed recently, for more please refer to SLoRA Serving Thousands of Concurrent LoRA Adaptershttpsarxiv.orgabs2311.03285 Punica MultiTenant LoRA Servinghttpsarxiv.orgabs2310.18547 Now the vLLM has supported multilora, which integrated the Punica feature and related cuda kernels. | |
Chunk 135: | |
See this PRhttpsgithub.comvllmprojectvllmpull1804 for more. 20240124 this PR has been merged into the main branch of vLLM The following tutorial demonstrates how to deploy a LLaMa model with multiple loras on Triton Inference Server using the Tritons Pythonbasedhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends vLLMhttpsgithub.comtritoninferenceservervllm_backendtreemain backend. Before you continue reading, its important to note that all commandline instructions containing xx.yy in the document cannot be used directly by copying and pasting. | |
Chunk 136: | |
xx.yy represents the Triton version, and you must specify the Triton version you want to use for the bash command to work. Step 1 Start a docker container for tritonvllm serving A docker container is strongly recommended for serving, and this tutorial will only demonstrate how to launch triton in the docker environment. First, start a docker container using the tritonserver image with vLLM backend from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags bash NOTICE you must first cd to your vllm_workspace path outside the container. | |
Chunk 137: | |
mkdir vllm_workspace cd vllm_workspace sudo docker run gpus all it nethost p 80018001 shmsize12G ulimit memlock1 ulimit stack67108864 v PWDvllm_workspace w vllm_workspace nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 binbash NOTICE the version of triton docker image should be configurated, here and through this tutorial we use xx.yy to symbolize the version. Tritons vLLM container has been introduced starting from 23.10 release, and multilora experimental support was added in vLLM v0.3.0 release. Docker image version nvcr.ionvidiatritonserver24.05vllmpythonpy3 or higher version is strongly recommended. | |
Chunk 138: | |
For pre24.05 containers, the docker images didnt support multilora feature, so you need to replace that provided in the container opttritonserverbackendsvllmmodel.py with the most up to date version. Just follow this command Download the model.py script from github bash wget P opttritonserverbackendsvllm httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysrcmodel.py This command will download the model.py script to the Triton vllm backend directory which will enable multilora feature. Step 2 Prepare your weights To support multilora on Triton, you need to manage your file path for model backbone and lora weights separately. | |
Chunk 139: | |
A typical weights repository can be as follows weights backbone llama7bhf loras alpacalora7b wizardLMlora7b A workspace for vllm, and model backbone weights, LoRA adapter weights is strongly recommended. You should expand the storage of these weight files to ensure they are logically organized in the workspace. Step 3 Prepare model repository for Triton Server __3.1 Download the model repository files__ To use Triton, a model repository is needed, for model path , backend configuration and other information. | |
Chunk 140: | |
The vllm backend is implemented based on python backend, and sampling_params of vllm are sampled from model.json. To create a triton model repository, you may download the files through these commands bash NOTICE you must first cd to your vllm_workspace path. cd vllm_workspace mkdir p model_repositoryvllm_model1 wget P model_repositoryvllm_model1 httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_model1model.json wget P model_repositoryvllm_model httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_modelconfig.pbtxt The model repository should look like this model_repository vllm_model 1 model.json config.pbtxt Now, you have finished the basic deployment, and the file structure should look like this vllm_workspace weights backbone llama7bhf loras alpacalora7b bactrianxllamalora7b model_repository vllm_model 1 model.json config.pbtxt __3.2 Populate model.json__ For this tutorial we will use the following set of parameters, specified in the model.json. | |
Chunk 141: | |
json modelvllm_workspaceweightsbackbonellama7bhf, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 16 model The path to your model repository disable_log_requests To show logs when launch vllm or not. gpu_memory_utilization The gpu memory allocated for the model weights and vllm PagedAttention kv cache manager. tensor_parallel_size The vllm now support the tensor paralism, so you can decide how many gpus you want to use for serving. | |
Chunk 142: | |
block_size vLLM kv cache block size. enable_lora If you want to support vllm multilora, this should be configured and set true. max_lora_rank The maximum of LoRA rank of your lora adapter. | |
Chunk 143: | |
The full set of parameters can be found herehttpsgithub.comYard1vllmblobmulti_loravllmenginearg_utils.pyL11. __3.3 Specify local lora path__ vLLM v0.4.0.post1 supported the inference of local lora weights applying, which means that the vllm cannot pull any lora adapter from huggingface. So triton should know where the local lora weights are. | |
Chunk 144: | |
Create a multi_lora.json file under model_repositoryvllm_model1 path bash cd model_repositoryvllm_model1 touch multi_lora.json The content of multi_lora.json should look like this json alpaca vllm_workspaceweightslorasalpacalora7b, bactrian vllm_workspaceweightslorasbactrianxllama7blora The key should be the supported lora name, and the value should be the specific path in your machine. Warning if you set enable_lora to true in model.json without creating a multi_lora.json file, the server will throw FileNotFoundError when initializing. Step 4 Launch Triton bash NOTICE you must first cd to your vllm_workspace path. | |
Chunk 145: | |
cd vllm_workspace tritonserver modelstore .model_repository After you start Triton you will see output on the console showing the server starting up and loading the model. When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002 Step 5 Send a request A client request script for multilora was prepared, downloading the client script from source bash wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesclient.py wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesprompts.txt Try running this script by the following command bash python3 client.py l yourpreparedloraname Here we assume you have prepared alpaca lora weight, thus we use bash python3 client.py l alpaca File samplesmodel_repositoryvllm_model1model.json modelfacebookopt125m, disable_log_requests true, gpu_memory_utilization 0.5, enforce_eager true File samplesmodel_repositoryvllm_modelconfig.pbtxt Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
Chunk 146: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 147: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 148: | |
Note You do not need to change any fields in this configuration. backend vllm The usage of device is deferred to the vLLM engine instance_group count 1 kind KIND_MODEL File samplesclient.py !usrbinenv python3 Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 149: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 150: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse import asyncio import json import sys import numpy as np import tritonclient.grpc.aio as grpcclient from tritonclient.utils import class LLMClient def __init__self, flags argparse.Namespace self._flags flags self._results_dict def get_triton_clientself try triton_client grpcclient.InferenceServerClient urlself._flags.url, verboseself._flags.verbose, except Exception as e printchannel creation failed stre sys.exit return triton_client async def async_request_iterator self, prompts, sampling_parameters, exclude_input_in_output try for iter in rangeself._flags.iterations for i, prompt in enumerateprompts prompt_id self._flags.offset lenprompts iter i self._results_dictstrprompt_id yield self.create_request prompt, self._flags.streaming_mode, prompt_id, sampling_parameters, exclude_input_in_output, except Exception as error printfCaught an error in the request iterator error async def stream_inferself, prompts, sampling_parameters, exclude_input_in_output try triton_client self.get_triton_client Start streaming response_iterator triton_client.stream_infer inputs_iteratorself.async_request_iterator prompts, sampling_parameters, exclude_input_in_output , stream_timeoutself._flags.stream_timeout, async for response in response_iterator yield response except InferenceServerException as error printerror sys.exit1 async def process_stream self, prompts, sampling_parameters, exclude_input_in_output Clear results in between process_stream calls self.results_dict success True Read response from the stream async for response in self.stream_infer prompts, sampling_parameters, exclude_input_in_output result, error response if error printfEncountered error while processing error success False else output result.as_numpytext_output for i in output self._results_dictresult.get_response.id.appendi return success async def runself Sampling parameters for text generation including temperature, top_p, top_k, max_tokens, early_stopping. | |
Chunk 151: | |
Full list available at httpsgithub.comvllmprojectvllmblob5255d99dc595f9ae7647842242d6542aa4145a4fvllmsampling_params.pyL23 sampling_parameters temperature 0.1, top_p 0.95, max_tokens 100, exclude_input_in_output self._flags.exclude_inputs_in_outputs if self._flags.lora_name is not None sampling_parameterslora_name self._flags.lora_name with openself._flags.input_prompts, r as file printfLoading inputs from self._flags.input_prompts... prompts file.readlines success await self.process_stream prompts, sampling_parameters, exclude_input_in_output with openself._flags.results_file, w as file for id in self._results_dict.keys for result in self._results_dictid file.writeresult.decodeutf8 file.writen file.writennn printfStoring results into self._flags.results_file... if self._flags.verbose with openself._flags.results_file, r as file printfnContents of self._flags.results_file printfile.read if success printPASS vLLM example else printFAIL vLLM example def run_asyncself asyncio.runself.run def create_request self, prompt, stream, request_id, sampling_parameters, exclude_input_in_output, send_parameters_as_tensorTrue, inputs prompt_data np.arrayprompt.encodeutf8, dtypenp.object_ try inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpyprompt_data except Exception as error printfEncountered an error during request creation error stream_data np.arraystream, dtypebool inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpystream_data Request parameters are not yet supported via BLS. | |
Chunk 152: | |
Provide an optional mechanism to send serialized parameters as an input tensor until support is added if send_parameters_as_tensor sampling_parameters_data np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool Add requested outputs outputs outputs.appendgrpcclient.InferRequestedOutputtext_output Issue the asynchronous sequence inference. return model_name self._flags.model, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters, if __name__ __main__ parser argparse.ArgumentParser parser.add_argument m, model, typestr, requiredFalse, defaultvllm_model, helpModel name, parser.add_argument v, verbose, actionstore_true, requiredFalse, defaultFalse, helpEnable verbose output, parser.add_argument u, url, typestr, requiredFalse, defaultlocalhost8001, helpInference server URL and its gRPC port. Default is localhost8001., parser.add_argument t, streamtimeout, typefloat, requiredFalse, defaultNone, helpStream timeout in seconds. | |
Chunk 153: | |
Default is None., parser.add_argument offset, typeint, requiredFalse, default0, helpAdd offset to request IDs used, parser.add_argument inputprompts, typestr, requiredFalse, defaultprompts.txt, helpText file with input prompts, parser.add_argument resultsfile, typestr, requiredFalse, defaultresults.txt, helpThe file with output results, parser.add_argument iterations, typeint, requiredFalse, default1, helpNumber of iterations through the prompts file, parser.add_argument s, streamingmode, actionstore_true, requiredFalse, defaultFalse, helpEnable streaming mode, parser.add_argument excludeinputsinoutputs, actionstore_true, requiredFalse, defaultFalse, helpExclude prompt from outputs, parser.add_argument l, loraname, typestr, requiredFalse, defaultNone, helpThe querying LoRA name, FLAGS parser.parse_args client LLMClientFLAGS client.run_async File samplesprompts.txt Hello, my name is The most dangerous animal is The capital of France is The future of AI is File srcutilsmetrics.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 154: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 155: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import queue import threading from typing import Dict, List, Union import triton_python_backend_utils as pb_utils from vllm.engine.metrics import StatLoggerBase as VllmStatLoggerBase from vllm.engine.metrics import Stats as VllmStats from vllm.engine.metrics import SupportsMetricsInfo, build_1_2_5_buckets class TritonMetrics def __init__self, labels Liststr, max_model_len int Initialize metric families Iteration stats self.counter_prompt_tokens_family pb_utils.MetricFamily namevllmprompt_tokens_total, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.COUNTER, self.counter_generation_tokens_family pb_utils.MetricFamily namevllmgeneration_tokens_total, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.COUNTER, self.histogram_time_to_first_token_family pb_utils.MetricFamily namevllmtime_to_first_token_seconds, descriptionHistogram of time to first token in seconds., kindpb_utils.MetricFamily.HISTOGRAM, self.histogram_time_per_output_token_family pb_utils.MetricFamily namevllmtime_per_output_token_seconds, descriptionHistogram of time per output token in seconds., kindpb_utils.MetricFamily.HISTOGRAM, Request stats Latency self.histogram_e2e_time_request_family pb_utils.MetricFamily namevllme2e_request_latency_seconds, descriptionHistogram of end to end request latency in seconds., kindpb_utils.MetricFamily.HISTOGRAM, Metadata self.histogram_num_prompt_tokens_request_family pb_utils.MetricFamily namevllmrequest_prompt_tokens, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.HISTOGRAM, self.histogram_num_generation_tokens_request_family pb_utils.MetricFamily namevllmrequest_generation_tokens, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.HISTOGRAM, self.histogram_n_request_family pb_utils.MetricFamily namevllmrequest_params_n, descriptionHistogram of the n request parameter., kindpb_utils.MetricFamily.HISTOGRAM, Initialize metrics Iteration stats self.counter_prompt_tokens self.counter_prompt_tokens_family.Metric labelslabels self.counter_generation_tokens self.counter_generation_tokens_family.Metric labelslabels Use the same bucket boundaries from vLLM sample metrics as an example. httpsgithub.comvllmprojectvllmblob21313e09e3f9448817016290da20d0db1adf3664vllmenginemetrics.pyL81L96 self.histogram_time_to_first_token self.histogram_time_to_first_token_family.Metric labelslabels, buckets 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, , self.histogram_time_per_output_token self.histogram_time_per_output_token_family.Metric labelslabels, buckets 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, , Request stats Latency self.histogram_e2e_time_request self.histogram_e2e_time_request_family.Metric labelslabels, buckets1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, Metadata self.histogram_num_prompt_tokens_request self.histogram_num_prompt_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len, self.histogram_num_generation_tokens_request self.histogram_num_generation_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len, self.histogram_n_request self.histogram_n_request_family.Metric labelslabels, buckets1, 2, 5, 10, 20, class VllmStatLoggerVllmStatLoggerBase StatLogger is used as an adapter between vLLM stats collector and Triton metrics provider. | |
Chunk 156: | |
def __init__self, labels Dict, max_model_len int, log_logger None Tracked stats over current local logging interval. local_interval not used here. Its for vLLM logs to stdout. | |
Chunk 157: | |
super.__init__local_interval0 self.metrics TritonMetricslabels, max_model_len self.log_logger log_logger Starting the metrics thread. It allows vLLM to keep making progress while reporting metrics to triton metrics service. self._logger_queue queue.Queue self._logger_thread threading.Threadtargetself.logger_loop self._logger_thread.start def infoself, type str, obj SupportsMetricsInfo None pass def _log_counterself, counter, data Unionint, float None Convenience function for logging to counter. | |
Chunk 158: | |
Args counter A counter metric instance. data An int or float to increment the count metric. Returns None if data ! | |
Chunk 159: | |
0 self._logger_queue.put_nowaitcounter, increment, data def _log_histogramself, histogram, data UnionListint, Listfloat None Convenience function for logging list to histogram. Args histogram A histogram metric instance. data A list of int or float data to observe into the histogram metric. | |
Chunk 160: | |
Returns None for datum in data self._logger_queue.put_nowaithistogram, observe, datum def logself, stats VllmStats None Report stats to Triton metrics server. Args stats Created by LLMEngine for use by VllmStatLogger. Returns None The list of vLLM metrics reporting to Triton is also documented here. | |
Chunk 161: | |
httpsgithub.comtritoninferenceservervllm_backendblobmainREADME.mdtritonmetrics counter_metrics self.metrics.counter_prompt_tokens, stats.num_prompt_tokens_iter, self.metrics.counter_generation_tokens, stats.num_generation_tokens_iter, histogram_metrics self.metrics.histogram_time_to_first_token, stats.time_to_first_tokens_iter, , self.metrics.histogram_time_per_output_token, stats.time_per_output_tokens_iter, , self.metrics.histogram_e2e_time_request, stats.time_e2e_requests, self.metrics.histogram_num_prompt_tokens_request, stats.num_prompt_tokens_requests, , self.metrics.histogram_num_generation_tokens_request, stats.num_generation_tokens_requests, , self.metrics.histogram_n_request, stats.n_requests, for metric, data in counter_metrics self._log_countermetric, data for metric, data in histogram_metrics self._log_histogrammetric, data def logger_loopself while True item self._logger_queue.get To signal shutdown a None item will be added to the queue. if item is None break metric, command, data item if command increment metric.incrementdata elif command observe metric.observedata else self.log_logger.log_errorfUndefined command name command def finalizeself Shutdown the logger thread. self._logger_queue.putNone if self._logger_thread is not None self._logger_thread.join self._logger_thread None File srcmodel.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
Chunk 162: | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 163: | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
Chunk 164: | |
import asyncio import base64 import gc import json import os import queue import threading from io import BytesIO from typing import Dict, List import numpy as np import torch import triton_python_backend_utils as pb_utils from PIL import Image from vllm.engine.arg_utils import AsyncEngineArgs from vllm.entrypoints.openai.api_server import build_async_engine_client_from_engine_args, from vllm.lora.request import LoRARequest from vllm.sampling_params import SamplingParams from vllm.utils import random_uuid from utils.metrics import VllmStatLogger _VLLM_ENGINE_ARGS_FILENAME model.json _MULTI_LORA_ARGS_FILENAME multi_lora.json class TritonPythonModel classmethod def auto_complete_configcls, auto_complete_model_config Add inputsoutputs to the model config. cls._auto_complete_inputs_and_outputsauto_complete_model_config We need to use decoupled transaction policy for saturating vLLM engine for max throughtput. TODO DLIS5233 Allow asynchronous execution to lift this restriction for cases there is exactly a single response to a single request. | |
Chunk 165: | |
auto_complete_model_config.set_model_transaction_policydictdecoupledTrue Disabling batching in Triton, let vLLM handle the batching on its own. auto_complete_model_config.set_max_batch_size0 return auto_complete_model_config staticmethod def _auto_complete_inputs_and_outputsauto_complete_model_config Inputs expected by the backend. inputs name text_input, data_type TYPE_STRING, dims 1, name image, data_type TYPE_STRING, dims 1, can be multiple images as separate elements optional True, , name stream, data_type TYPE_BOOL, dims 1, optional True, , name sampling_parameters, data_type TYPE_STRING, dims 1, optional True, , name exclude_input_in_output, data_type TYPE_BOOL, dims 1, optional True, , name return_finish_reason, data_type TYPE_BOOL, dims 1, optional True, , name return_cumulative_logprob, data_type TYPE_BOOL, dims 1, optional True, , name return_logprobs, data_type TYPE_BOOL, dims 1, optional True, , name return_num_input_tokens, data_type TYPE_BOOL, dims 1, optional True, , name return_num_output_tokens, data_type TYPE_BOOL, dims 1, optional True, , Outputs expected by the backend. | |
Chunk 166: | |
outputs name text_output, data_type TYPE_STRING, dims 1, name finish_reason, data_type TYPE_STRING, dims 1, name cumulative_logprob, data_type TYPE_FP32, dims 1, name logprobs, data_type TYPE_STRING, dims 1, name num_input_tokens, data_type TYPE_UINT32, dims 1, name num_output_tokens, data_type TYPE_UINT32, dims 1, Collect input and output names from the provided model config. config auto_complete_model_config.as_dict input_names output_names for input in configinput input_names.appendinputname for output in configoutput output_names.appendoutputname Add missing inputs and outputs to the model config. for input in inputs if inputname not in input_names auto_complete_model_config.add_inputinput for output in outputs if outputname not in output_names auto_complete_model_config.add_outputoutput def initializeself, args self.args args self.logger pb_utils.Logger self.model_config json.loadsargsmodel_config output_config pb_utils.get_output_config_by_name self.model_config, text_output self.output_dtype pb_utils.triton_string_to_numpyoutput_configdata_type Setup vLLM engine health check self._enable_health_check self._get_bool_config_param ENABLE_VLLM_HEALTH_CHECK self._is_healthy True Initialize engine arguments TODO Move this into _init_engine, after moving check metrics enabled. | |
Chunk 167: | |
self._init_engine_args Check if metrics are enabled. The ZMQ process cannot be used when metrics are enabled. TODO Move the check into _setup_metrics. | |
Chunk 168: | |
self._enable_metrics self._get_bool_config_paramREPORT_CUSTOM_METRICS and not self._aync_engine_args.disable_log_stats Starting the vLLM engine and its event thread running the AsyncIO event loop. self._init_engine Setup vLLM metrics self._setup_metrics Starting the response thread. It allows vLLM to keep making progress while response senders are sending responses to server frontend. | |
Chunk 169: | |
self._response_queue queue.Queue self._response_thread threading.Threadtargetself._response_loop self._response_thread.start def _init_engine_argsself Currently, Triton needs to use decoupled policy for asynchronously forwarding requests to vLLM engine, so assert it. self.using_decoupled pb_utils.using_decoupled_model_transaction_policy self.model_config assert self.using_decoupled , vLLM Triton backend must be configured to use decoupled model transaction policy engine_args_filepath os.path.join pb_utils.get_model_dir, _VLLM_ENGINE_ARGS_FILENAME assert os.path.isfile engine_args_filepath , f_VLLM_ENGINE_ARGS_FILENAME containing vllm engine args must be provided in pb_utils.get_model_dir with openengine_args_filepath as file self.vllm_engine_config json.loadfile Validate device and multiprocessing settings are currently set based on modelconfigs. self._validate_device_config Check for LoRA config and set it up if enabled self._setup_lora Create an AsyncEngineArgs from the config from JSON self._aync_engine_args AsyncEngineArgsself.vllm_engine_config def _init_engineself Run the engine in a separate thread running the AsyncIO event loop. | |
Chunk 170: | |
self._llm_engine None self._llm_engine_start_cv threading.Condition self._llm_engine_shutdown_event asyncio.Event self._event_thread threading.Thread targetasyncio.run, argsself._run_llm_engine, self._event_thread.start with self._llm_engine_start_cv while self._llm_engine is None self._llm_engine_start_cv.wait The threading.Thread will not raise the exception here should the engine failed to start, so the exception is passed back via the engine variable. if isinstanceself._llm_engine, Exception e self._llm_engine self.logger.log_errorfvllm Failed to start engine e if self._event_thread is not None self._event_thread.join self._event_thread None raise e async def _run_llm_engineself Counter to keep track of ongoing request counts. self._ongoing_request_count 0 try Start the vLLM engine. | |
Chunk 171: | |
The engine lives for the scope of this with statement. TODO Metrics should work with ZMQ enabled. async with build_async_engine_client_from_engine_args engine_argsself._aync_engine_args, disable_frontend_multiprocessingself._enable_metrics, as engine Capture the engine event loop and make it visible to other threads. | |
Chunk 172: | |
self._event_loop asyncio.get_running_loop Signal the engine is started and make it visible to other threads. with self._llm_engine_start_cv self._llm_engine engine self._llm_engine_start_cv.notify_all Wait for the engine shutdown signal. await self._llm_engine_shutdown_event.wait Wait for the ongoing requests to complete. | |
Chunk 173: | |
while self._ongoing_request_count 0 self.logger.log_info vllm Awaiting remaining requests.format self._ongoing_request_count await asyncio.sleep1 Cancel all tasks in the event loop. for task in asyncio.all_tasksloopself._event_loop if task is not asyncio.current_task task.cancel except Exception as e Signal and pass the exception back via the engine variable if the engine failed to start. If the engine has started, reraise the exception. | |
Chunk 174: | |
with self._llm_engine_start_cv if self._llm_engine is None self._llm_engine e self._llm_engine_start_cv.notify_all return raise e self._llm_engine None self.logger.log_infovllm Shutdown complete def _validate_device_configself triton_kind self.argsmodel_instance_kind triton_device_id intself.argsmodel_instance_device_id triton_instance fself.argsmodel_name_triton_device_id Tritons current definition of KIND_GPU makes assumptions that models only use a single GPU. For multiGPU models, the recommendation is to specify KIND_MODEL to acknowledge that the model will take control of the devices made available to it. NOTE Consider other parameters that would indicate multiGPU in the future. | |
Chunk 175: | |
tp_size intself.vllm_engine_config.gettensor_parallel_size, 1 if tp_size 1 and triton_kind GPU raise ValueError KIND_GPU is currently for singleGPU models, please specify KIND_MODEL in the models config.pbtxt for multiGPU models If KIND_GPU is specified, specify the device ID assigned by Triton to ensure that multiple model instances do not oversubscribe the same default device. if triton_kind GPU and triton_device_id 0 self.logger.log_info fDetected KIND_GPU model instance, explicitly setting GPU devicetriton_device_id for triton_instance vLLM doesnt currently v0.4.2 expose device selection in the APIs torch.cuda.set_devicetriton_device_id def _setup_loraself self.enable_lora False Check if enable_lora field is in the model.json, and if it is, read its contents, which can be string or bool. if enable_lora in self.vllm_engine_config.keys and strself.vllm_engine_configenable_lora.lower true create Triton LoRA weights repository multi_lora_args_filepath os.path.join pb_utils.get_model_dir, _MULTI_LORA_ARGS_FILENAME try with openmulti_lora_args_filepath as lora_file lora_repository Dictstr, str json.loadlora_file self.lora_repository lora_repository self.supported_loras Liststr listself.lora_repository.keys self.supported_loras_len lenself.supported_loras self.enable_lora True except FileNotFoundError raise FileNotFoundError fTriton backend cannot find multi_lora_args_filepath. | |
Chunk 176: | |
def _setup_metricsself self._vllm_metrics None TODO Do not read metrics directly from the vLLM engine, read from prometheus client to allow the use of ZMQ process when metrics are enabled. See httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmentrypointsopenaiapi_server.pyL222L245 if self._enable_metrics try labels model self.argsmodel_name, version self.argsmodel_version, Add vLLM custom metrics engine_config self._llm_engine.engine.model_config self._vllm_metrics VllmStatLogger labels, engine_config.max_model_len, self.logger self._llm_engine.add_loggertriton, self._vllm_metrics except pb_utils.TritonModelException as e if metrics not supported in stre Metrics are disabled at the server self.logger.log_infovllm Metrics not supported else raise e def _get_bool_config_paramself, param_name str bool return param_name in self.model_configparameters and self.model_configparametersparam_namestring_value.lower true def _response_loopself while True item self._response_queue.get To signal shutdown a None item will be added to the queue. if item is None break response_state, response, response_flag item response_sender response_stateresponse_sender try response_sender.sendresponse, response_flag Stop checking for cancellation if the last response is generated. | |
Chunk 177: | |
if not response_statelast_response_generated response_stateis_cancelled response_sender.is_cancelled except Exception as e self.logger.log_error fAn error occurred while sending a response e finally if response_flag pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL self._ongoing_request_count 1 def executeself, requests if self._enable_health_check and not self._check_healthrequests return None for request in requests request self._verify_lorasrequest if request is not None assert self._llm_engine_shutdown_event.is_set is False , Cannot create tasks after shutdown has been requested coro self._generaterequest asyncio.run_coroutine_threadsafecoro, self._event_loop return None async def _generateself, request response_sender request.get_response_sender response_state response_sender response_sender, is_cancelled False, last_response_generated False, last response ready but not yet sent self._ongoing_request_count 1 decrement_ongoing_request_count True try request_id random_uuid prompt, stream, prepend_input, parameters, additional_outputs, self._get_input_tensorsrequest sampling_params_dict self._get_sampling_params_dictparameters lora_name sampling_params_dict.poplora_name, None sampling_params SamplingParamssampling_params_dict lora_request None if lora_name is not None lora_id strself.supported_loras.indexlora_name 1 lora_int_id intlora_id lora_local_path self.lora_repositorylora_name lora_request LoRARequestlora_id, lora_int_id, lora_local_path response_iterator self._llm_engine.generate prompt, sampling_params, request_id, lora_requestlora_request request_output_state async for request_output in response_iterator Cancellation state will be checked by the response loop and written to the response state if streaming. If not streaming, cancellation state needs to be checked here. is_cancelled response_stateis_cancelled if not stream is_cancelled response_sender.is_cancelled if is_cancelled self.logger.log_infovllm Cancelling the request await self._llm_engine.abortrequest_id self.logger.log_infovllm Successfully cancelled the request if stream Add cancelled final response to response loop. | |
Chunk 178: | |
response_statelast_response_generated True response pb_utils.InferenceResponse errorpb_utils.TritonError messageRequest was cancelled, codepb_utils.TritonError.CANCELLED, flags pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count False self._response_queue.put_nowait response_state, response, flags break Send each response if streaming. if stream response self._create_response request_output_state, request_output, prepend_inputFalse, additional_outputsadditional_outputs, flags 0 if request_output.finished response_statelast_response_generated True flags pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count False self._response_queue.put_nowaitresponse_state, response, flags Send the last response which contains all the outputs if not streaming. if not stream response_sender.send self._create_response request_output_state, request_outputrequest_output, prepend_inputprepend_input, additional_outputsadditional_outputs, , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL, except Exception as e self.logger.log_errorfvllm Error generating stream e error pb_utils.TritonErrorfError generating stream e text_output_tensor pb_utils.Tensor text_output, np.asarrayNA, dtypeself.output_dtype response pb_utils.InferenceResponse output_tensorstext_output_tensor, errorerror response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL raise e finally if decrement_ongoing_request_count self._ongoing_request_count 1 def _get_input_tensorsself, request prompt prompt pb_utils.get_input_tensor_by_namerequest, text_input.as_numpy0 if isinstanceprompt, bytes prompt prompt.decodeutf8 image images pb_utils.get_input_tensor_by_namerequest, image if images images_vllm for image_np in images.as_numpy image_b base64.b64decodeimage_np.decodeutf8 image_rgb Image.openBytesIOimage_b.convertRGB images_vllm.appendimage_rgb if lenimages_vllm 0 prompt prompt prompt, multi_modal_data image images_vllm, stream stream pb_utils.get_input_tensor_by_namerequest, stream if stream stream stream.as_numpy0 else stream False prepend_input exclude_input_in_output prepend_input pb_utils.get_input_tensor_by_name request, exclude_input_in_output if prepend_input When exclude_input_in_output is False, we want to prepend input prompt to output, thus prepend_input should be True, and vice versa. | |
Chunk 179: | |
prepend_input not prepend_input.as_numpy0 elif prepend_input is None and stream prepend_input False else prepend_input True if prepend_input and stream raise ValueError When streaming, exclude_input_in_output False is not allowed. parameters sampling_parameters An alternative mechanism to receive serialized parameters as an input tensor, because request parameters are not yet supported via BLS. sampling_parameters pb_utils.get_input_tensor_by_name request, sampling_parameters if sampling_parameters parameters sampling_parameters.as_numpy0.decodeutf8 else parameters request.parameters additional outputs additional_outputs return_finish_reason None, return_cumulative_logprob None, return_logprobs None, return_num_input_tokens None, return_num_output_tokens None, for tensor_name in additional_outputs.keys tensor pb_utils.get_input_tensor_by_namerequest, tensor_name if tensor tensor booltensor.as_numpy0 else tensor False additional_outputstensor_name tensor return prompt, stream, prepend_input, parameters, additional_outputs def _create_response self, request_output_state, request_output, prepend_input, additional_outputs output_tensors text_output prepend_prompt if prev_lens_text_output not in request_output_state this is the first response if prepend_input prepend_prompt request_output.prompt request_output_stateprev_lens_text_output 0 len request_output.outputs prev_lens request_output_stateprev_lens_text_output text_output prepend_prompt output.textprev_len.encodeutf8 for output, prev_len in ziprequest_output.outputs, prev_lens request_output_stateprev_lens_text_output lenoutput.text for output in request_output.outputs output_tensors.append pb_utils.Tensor text_output, np.asarraytext_output, dtypeself.output_dtype finish_reason if additional_outputsreturn_finish_reason finish_reason stroutput.finish_reason for output in request_output.outputs output_tensors.append pb_utils.Tensor finish_reason, np.asarrayfinish_reason, dtypenp.object_ cumulative_logprob if additional_outputsreturn_cumulative_logprob cumulative_logprob output.cumulative_logprob for output in request_output.outputs output_tensors.append pb_utils.Tensor cumulative_logprob, np.asarraycumulative_logprob, dtypenp.float32, logprobs httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsequence.pyL37L58 if additional_outputsreturn_logprobs if prev_lens_logprobs not in request_output_state request_output_stateprev_lens_logprobs 0 len request_output.outputs logprobs for i in rangelenrequest_output.outputs output request_output.outputsi if output.logprobs is None logprobs.appendnull.encodeutf8 continue prev_len request_output_stateprev_lens_logprobsi request_output_stateprev_lens_logprobsi lenoutput.logprobs logprobs_py for logprob_d_vllm in output.logprobsprev_len logprob_d_py for token_id, logprob_vllm in logprob_d_vllm.items logprob_d_pytoken_id logprob logprob_vllm.logprob, rank logprob_vllm.rank, decoded_token logprob_vllm.decoded_token, logprobs_py.appendlogprob_d_py logprobs.appendjson.dumpslogprobs_py.encodeutf8 output_tensors.append pb_utils.Tensorlogprobs, np.asarraylogprobs, dtypenp.object_ num_input_tokens if additional_outputsreturn_num_input_tokens num_input_tokens lenrequest_output.prompt_token_ids output_tensors.append pb_utils.Tensor num_input_tokens, np.asarraynum_input_tokens, dtypenp.uint32 num_output_tokens if additional_outputsreturn_num_output_tokens if prev_lens_num_output_tokens not in request_output_state request_output_stateprev_lens_num_output_tokens 0 len request_output.outputs prev_lens request_output_stateprev_lens_num_output_tokens num_output_tokens lenoutput.token_ids prev_len for output, prev_len in ziprequest_output.outputs, prev_lens request_output_stateprev_lens_num_output_tokens lenoutput.token_ids for output in request_output.outputs output_tensors.append pb_utils.Tensor num_output_tokens, np.asarraynum_output_tokens, dtypenp.uint32 return pb_utils.InferenceResponseoutput_tensorsoutput_tensors def _get_sampling_params_dictself, params_json params_dict json.loadsparams_json Special parsing for the supported sampling parameters bool_keys ignore_eos, skip_special_tokens, use_beam_search for k in bool_keys if k in params_dict params_dictk boolparams_dictk float_keys frequency_penalty, length_penalty, presence_penalty, temperature, top_p, for k in float_keys if k in params_dict params_dictk floatparams_dictk int_keys best_of, max_tokens, min_tokens, n, top_k for k in int_keys if k in params_dict params_dictk intparams_dictk return params_dict def _verify_lorasself, request We will check if the requested lora exists here, if not we will send a response with LoRA not found information. | |
Chunk 180: | |
In this way we may avoid further processing. verified_request None lora_error None lora_name None parameters_input_tensor pb_utils.get_input_tensor_by_name request, sampling_parameters if parameters_input_tensor parameters parameters_input_tensor.as_numpy0.decodeutf8 sampling_params_dict self._get_sampling_params_dictparameters lora_name sampling_params_dict.poplora_name, None if lora_name is not None if not self.enable_lora lora_error pb_utils.TritonErrorLoRA feature is not enabled. self.logger.log_info vllm LoRA is not enabled, please restart the backend with LoRA enabled. | |
Chunk 181: | |
elif lora_name not in self.supported_loras lora_error pb_utils.TritonError fLoRA lora_name is not supported, we currently support self.supported_loras self.logger.log_infofvllm LoRA lora_name not found. if lora_error is not None output_tensor pb_utils.Tensor text_output, np.asarrayError Unsupported LoRA., dtypeself.output_dtype, response pb_utils.InferenceResponse output_tensorsoutput_tensor, errorlora_error response_sender request.get_response_sender response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL else verified_request request return verified_request def _check_healthself, requests coro self._llm_engine.check_health future asyncio.run_coroutine_threadsafecoro, self._event_loop try future.result except Exception as e self.logger.log_error fvllm Engine is not healthy and model will be unloaded e pb_utils.unload_modelself.model_configname nonblocking self._is_healthy False if not self._is_healthy for request in requests request.get_response_sender.send pb_utils.InferenceResponse errorpb_utils.TritonError messageModel is unavailable due to unhealthy vLLM engine, codepb_utils.TritonError.UNAVAILABLE, , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL, return self._is_healthy def finalizeself self.logger.log_infovllm Issuing finalize to vllm backend self._event_loop.call_soon_threadsafeself._llm_engine_shutdown_event.set Shutdown the event thread. if self._event_thread is not None self._event_thread.join self._event_thread None Shutdown the response thread. | |
Chunk 182: | |
self._response_queue.putNone if self._response_thread is not None self._response_thread.join self._response_thread None Shutdown the metrics thread. if self._vllm_metrics is not None self._vllm_metrics.finalize When using parallel tensors, the stub process may not shutdown due to unreleased references, so manually run the garbage collector once. self.logger.log_infovllm Running Garbage Collector on finalize... | |
Chunk 183: | |
gc.collect self.logger.log_infovllm Garbage Collector on finalize... done File .gitignore Bytecompiled optimized DLL files __pycache__ .pycod py.class C extensions .so Distribution packaging .Python build developeggs dist downloads eggs .eggs lib lib64 parts sdist var wheels sharepythonwheels .egginfo .installed.cfg .egg MANIFEST PyInstaller Usually these files are written by a python script from a template before PyInstaller builds the exe, so as to inject dateother infos into it. .manifest .spec Installer logs piplog.txt pipdeletethisdirectory.txt Unit test coverage reports htmlcov .tox .nox .coverage .coverage. | |
Chunk 184: | |
.cache nosetests.xml coverage.xml .cover .py,cover .hypothesis .pytest_cache cover .out Translations .mo .pot Django stuff .log local_settings.py db.sqlite3 db.sqlite3journal Flask stuff instance .webassetscache Scrapy stuff .scrapy Sphinx documentation docs_build PyBuilder .pybuilder target Jupyter Notebook .ipynb_checkpoints IPython profile_default ipython_config.py pyenv For a library or package, you might want to ignore these files since the code is intended to run in multiple environments otherwise, check them in .pythonversion pipenv According to pypapipenv598, it is recommended to include Pipfile.lock in version control. However, in case of collaboration, if having platformspecific dependencies or dependencies having no crossplatform support, pipenv may install dependencies that dont work, or not install all needed dependencies. Pipfile.lock poetry Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. | |
Chunk 185: | |
This is especially recommended for binary packages to ensure reproducibility, and is more commonly ignored for libraries. httpspythonpoetry.orgdocsbasicusagecommityourpoetrylockfiletoversioncontrol poetry.lock pdm Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. pdm.lock pdm stores projectwide configurations in .pdm.toml, but it is recommended to not include it in version control. | |
Chunk 186: | |
httpspdm.fming.devusewithide .pdm.toml PEP 582 used by e.g. github.comDavidOConnorpyflow and github.compdmprojectpdm __pypackages__ Celery stuff celerybeatschedule celerybeat.pid SageMath parsed files .sage.py Environments .env .venv env venv ENV env.bak venv.bak Spyder project settings .spyderproject .spyproject Rope project settings .ropeproject mkdocs documentation site mypy .mypy_cache .dmypy.json dmypy.json Pyre type checker .pyre pytype static type analyzer .pytype Cython debug symbols cython_debug Test result files Miniconda miniconda vllm_env.tar.gz triton_python_backend_stub python_backend results.txt .log PyCharm JetBrains specific template is maintained in a separate JetBrains.gitignore that can be found at httpsgithub.comgithubgitignoreblobmainGlobalJetBrains.gitignore and can be added to the global gitignore or merged into this file. For a more nuclear option not recommended you can uncomment the following to ignore the entire idea folder. | |
Chunk 187: | |
.idea File .precommitconfig.yaml Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 188: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 189: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. repos repo httpsgithub.comtimothycrosleyisort rev 5.12.0 hooks id isort additional_dependencies toml repo httpsgithub.compsfblack rev 23.1.0 hooks id black types_or python, cython repo httpsgithub.comPyCQAflake8 rev 5.0.4 hooks id flake8 args maxlinelength88, selectC,E,F,W,B,B950, extendignore E203,E501 types_or python, cython repo httpsgithub.comprecommitmirrorsclangformat rev v16.0.5 hooks id clangformat types_or c, c, cuda, proto, textproto, java args fallbackstylenone, stylefile, i repo httpsgithub.comcodespellprojectcodespell rev v2.2.4 hooks id codespell additional_dependencies tomli args toml, pyproject.toml exclude ?x.stemmer..stop_words.CHANGELOG.md More details about these precommit hooks here httpsprecommit.comhooks.html repo httpsgithub.comprecommitprecommithooks rev v4.4.0 hooks id checkcaseconflict id checkexecutableshaveshebangs id checkmergeconflict id checkjson id checktoml id checkyaml exclude deploytemplates. id checkshebangscriptsareexecutable id endoffilefixer types_or c, c, cuda, proto, textproto, java, python id mixedlineending id requirementstxtfixer id trailingwhitespace File LICENSE BSD 3Clause License Copyright c 2023, Triton Inference Server Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met 1. | |
Chunk 190: | |
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
Chunk 191: | |
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 192: | |
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. File pyproject.toml Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
Chunk 193: | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
Chunk 194: | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. tool.codespell note precommit passes explicit lists of files here, which this skip file list doesnt override this is only to allow you to run codespell interactively skip ..git,..github ignore short words, and typename parameters like OffsetT ignoreregex b.1,4AZwTb use the clear dictionary for unambiguous spelling mistakes builtin clear disable warnings about binary files and wrong encoding quietlevel 3 tool.isort profile black use_parentheses true multi_line_output 3 include_trailing_comma true force_grid_wrap 0 ensure_newline_before_comments true line_length 88 balanced_wrapping true indent skip build File README.md ! | |
Chunk 195: | |
Copyright 20232024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
Chunk 196: | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
Chunk 197: | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. !Licensehttpsimg.shields.iobadgeLicenseBSD3lightgrey.svghttpsopensource.orglicensesBSD3Clause vLLM Backend The Triton backend for vLLMhttpsgithub.comvllmprojectvllm is designed to run supported modelshttpsvllm.readthedocs.ioenlatestmodelssupported_models.html on a vLLM enginehttpsgithub.comvllmprojectvllmblobmainvllmengineasync_llm_engine.py. You can learn more about Triton backends in the backend repohttpsgithub.comtritoninferenceserverbackend. | |
Chunk 198: | |
This is a Pythonbased backendhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends. When using this backend, all requests are placed on the vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled by the vLLM engine. | |
Chunk 199: | |
Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentationhttpsgithub.comtritoninferenceserverservertritoninferenceserver available in the main serverhttpsgithub.comtritoninferenceserverserver repo. If you dont find your answer there you can ask questions on the main Triton issues pagehttpsgithub.comtritoninferenceserverserverissues. | |
Chunk 200: | |
Installing the vLLM Backend There are several ways to install and deploy the vLLM backend. Option 1. Use the PreBuilt Docker Container. | |
Chunk 201: | |
Pull a tritonserverxx.yyvllmpythonpy3 container with vLLM backend from the NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver registry. xx.yy is the version of Triton that you want to use. Please note, that Tritons vLLM container has been introduced starting from 23.10 release. | |
Chunk 202: | |
docker pull nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 Option 2. Build a Custom Container From Source You can follow steps described in the Building With Dockerhttpsgithub.comtritoninferenceserverserverblobmaindocscustomization_guidebuild.mdbuildingwithdocker guide and use the build.pyhttpsgithub.comtritoninferenceserverserverblobmainbuild.py script. A sample command to build a Triton Server container with all options enabled is shown below. | |
Chunk 203: | |
Feel free to customize flags according to your needs. Please use NGC registryhttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags to get the latest version of the Triton vLLM container, which corresponds to the latest YY.MM year.month of Triton releasehttpsgithub.comtritoninferenceserverserverreleases. YY.MM is the version of Triton. | |
Chunk 204: | |
export TRITON_CONTAINER_VERSIONYY.MM .build.py v enablelogging enablestats enabletracing enablemetrics enablegpumetrics enablecpumetrics enablegpu filesystemgcs filesystems3 filesystemazure_storage endpointhttp endpointgrpc endpointsagemaker endpointvertexai upstreamcontainerversionTRITON_CONTAINER_VERSION backendpythonrTRITON_CONTAINER_VERSION backendvllmrTRITON_CONTAINER_VERSION backendensemble Option 3. Add the vLLM Backend to the Default Triton Container You can install the vLLM backend directly into the NGC Triton container. In this case, please install vLLM first. | |
Chunk 205: | |
You can do so by running pip install vllmvLLM_version. Then, set up the vLLM backend in the container with the following commands mkdir p opttritonserverbackendsvllm git clone httpsgithub.comtritoninferenceservervllm_backend.git tmpvllm_backend cp r tmpvllm_backendsrc opttritonserverbackendsvllm Using the vLLM Backend You can see an example model_repositorysamplesmodel_repository in the samplessamples folder. You can use this as is and change the model by changing the model value in model.json. | |
Chunk 206: | |
model.json represents a keyvalue dictionary that is fed to vLLMs AsyncLLMEngine when initializing the model. You can see supported arguments in vLLMs arg_utils.pyhttpsgithub.comvllmprojectvllmblobmainvllmenginearg_utils.py. Specifically, herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL11 and herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL201. | |
Chunk 207: | |
For multiGPU support, EngineArgs like tensor_parallel_size can be specified in model.jsonsamplesmodel_repositoryvllm_model1model.json. Note vLLM greedily consume up to 90 of the GPUs memory under default settings. The sample model updates this behavior by setting gpu_memory_utilization to 50. | |
Chunk 208: | |
You can tweak this behavior using fields like gpu_memory_utilization and other settings in model.jsonsamplesmodel_repositoryvllm_model1model.json. Launching Triton Inference Server Once you have the model repository set up, it is time to launch the Triton server. We will use the prebuilt Triton container with vLLM backendoption1usetheprebuiltdockercontainer from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver in this example. | |
Chunk 209: | |
docker run gpus all it nethost rm p 80018001 shmsize1G ulimit memlock1 ulimit stack67108864 v PWDwork w work nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 tritonserver modelrepository .model_repository Replace xx.yy with the version of Triton that you want to use. Note that Tritons vLLM container was first published starting from 23.10 release. After you start Triton you will see output on the console showing the server starting up and loading the model. | |
Chunk 210: | |
When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002 Sending Your First Inference After you start Tritonhttpsdocs.nvidia.comdeeplearningtritoninferenceserveruserguidedocsgetting_startedquickstart.html with the sample model_repositorysamplesmodel_repository, you can quickly run your first inference request with the generate endpointhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_generate.md. Try out the command below. | |
Chunk 211: | |
curl X POST localhost8000v2modelsvllm_modelgenerate d text_input What is Triton Inference Server?, parameters stream false, temperature 0 Upon success, you should see a response from the server like this one model_namevllm_model,model_version1,text_outputWhat is Triton Inference Server?nnTriton Inference Server is a server that is used by many In the samplessamples folder, you can also find a sample client, client.pysamplesclient.py which uses Tritons asyncio gRPC client libraryhttpsgithub.comtritoninferenceserverclientpythonasynciosupportbeta1 to run inference on Triton. Running the Latest vLLM Version You can check the vLLM version included in Triton Inference Server from Framework Containers Support Matrixhttpsdocs.nvidia.comdeeplearningframeworkssupportmatrixindex.html. Note The vLLM Triton Inference Server container has been introduced starting from 23.10 release. | |
Chunk 212: | |
You can use pip install ... within the container to upgrade vLLM version. Running Multiple Instances of Triton Server If you are running multiple instances of Triton server with a Pythonbased backend, you need to specify a different shmregionprefixname for each server. | |
Chunk 213: | |
See herehttpsgithub.comtritoninferenceserverpython_backendrunningmultipleinstancesoftritonserver for more information. Additional vLLM outputs Additional vLLM outputs may be requested optionally on a perrequest basis. See this docsdocsadditional_outputs.md for more information. | |
Chunk 214: | |
Triton Metrics Starting with the 24.08 release of Triton, users can now obtain specific vLLM metrics by querying the Triton metrics endpoint see complete vLLM metrics herehttpsdocs.vllm.aienlatestservingmetrics.html. This can be accomplished by launching a Triton server in any of the ways described above ensuring the build code container is 24.08 or later and querying the server. Upon receiving a successful response, you can query the metrics endpoint by entering the following bash curl localhost8002metrics VLLM stats are reported by the metrics endpoint in fields that are prefixed with vllm. | |
Chunk 215: | |
Triton currently supports reporting of the following metrics from vLLM. bash Number of prefill tokens processed. counter_prompt_tokens Number of generation tokens processed. | |
Chunk 216: | |
counter_generation_tokens Histogram of time to first token in seconds. histogram_time_to_first_token Histogram of time per output token in seconds. histogram_time_per_output_token Histogram of end to end request latency in seconds. | |
Chunk 217: | |
histogram_e2e_time_request Number of prefill tokens processed. histogram_num_prompt_tokens_request Number of generation tokens processed. histogram_num_generation_tokens_request Histogram of the best_of request parameter. | |
Chunk 218: | |
histogram_best_of_request Histogram of the n request parameter. histogram_n_request Your output for these fields should look similar to the following bash HELP vllmprompt_tokens_total Number of prefill tokens processed. TYPE vllmprompt_tokens_total counter vllmprompt_tokens_totalmodelvllm_model,version1 10 HELP vllmgeneration_tokens_total Number of generation tokens processed. | |
Chunk 219: | |
TYPE vllmgeneration_tokens_total counter vllmgeneration_tokens_totalmodelvllm_model,version1 16 HELP vllmtime_to_first_token_seconds Histogram of time to first token in seconds. TYPE vllmtime_to_first_token_seconds histogram vllmtime_to_first_token_seconds_countmodelvllm_model,version1 1 vllmtime_to_first_token_seconds_summodelvllm_model,version1 0.03233122825622559 vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,le0.001 0 ... vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,leInf 1 HELP vllmtime_per_output_token_seconds Histogram of time per output token in seconds. | |
Chunk 220: | |
TYPE vllmtime_per_output_token_seconds histogram vllmtime_per_output_token_seconds_countmodelvllm_model,version1 15 vllmtime_per_output_token_seconds_summodelvllm_model,version1 0.04501533508300781 vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,le0.01 14 ... vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,leInf 15 HELP vllme2e_request_latency_seconds Histogram of end to end request latency in seconds. TYPE vllme2e_request_latency_seconds histogram vllme2e_request_latency_seconds_countmodelvllm_model,version1 1 vllme2e_request_latency_seconds_summodelvllm_model,version1 0.08686184883117676 vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,le1 1 ... | |
Chunk 221: | |
vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_prompt_tokens Number of prefill tokens processed. TYPE vllmrequest_prompt_tokens histogram vllmrequest_prompt_tokens_countmodelvllm_model,version1 1 vllmrequest_prompt_tokens_summodelvllm_model,version1 10 vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_generation_tokens Number of generation tokens processed. | |
Chunk 222: | |
TYPE vllmrequest_generation_tokens histogram vllmrequest_generation_tokens_countmodelvllm_model,version1 1 vllmrequest_generation_tokens_summodelvllm_model,version1 16 vllmrequest_generation_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_generation_tokens_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_params_best_of Histogram of the best_of request parameter. TYPE vllmrequest_params_best_of histogram vllmrequest_params_best_of_countmodelvllm_model,version1 1 vllmrequest_params_best_of_summodelvllm_model,version1 1 vllmrequest_params_best_of_bucketmodelvllm_model,version1,le1 1 ... | |
Chunk 223: | |
vllmrequest_params_best_of_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_params_n Histogram of the n request parameter. TYPE vllmrequest_params_n histogram vllmrequest_params_n_countmodelvllm_model,version1 1 vllmrequest_params_n_summodelvllm_model,version1 1 vllmrequest_params_n_bucketmodelvllm_model,version1,le1 1 ... vllmrequest_params_n_bucketmodelvllm_model,version1,leInf 1 To enable vLLM engine colleting metrics, disable_log_stats option need to be either false or left empty false by default in model.jsonhttpsgithub.comtritoninferenceservervllm_backendblobmainsamplesmodel_repositoryvllm_model1model.json. | |
Chunk 224: | |
bash disable_log_stats false Note vLLM metrics are not reported to Triton metrics server by default due to potential performance slowdowns. To enable vLLM models metrics reporting, please add following lines to its config.pbtxt as well. parameters key REPORT_CUSTOM_METRICS value string_value true vLLM Engine Health Check BETA vLLM Engine Health Check may be enabled optionally, for more accurate model state reported by the server. | |
Chunk 225: | |
See this docsdocshealth_check.md for more information. Referencing the Tutorial You can read further in the vLLM Quick Deploy guidehttpsgithub.comtritoninferenceservertutorialstreemainQuick_DeployvLLM in the tutorialshttpsgithub.comtritoninferenceservertutorials repository. | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file is a merged representation of the entire codebase, combining all repository files into a single document. Generated by Repomix on 20241229T195517.135Z File Summary Purpose This file contains a packed representation of the entire repositorys contents. It is designed to be easily consumable by AI systems for analysis, code review, or other automated processes. | |
--- | |
File Format The content is organized as follows 1. This summary section 2. Repository information 3. | |
--- | |
Directory structure 4. Multiple file entries, each consisting of a. A separator line b. | |
--- | |
The file path File pathtofile c. Another separator line d. The full contents of the file e. | |
--- | |
A blank line Usage Guidelines This file should be treated as readonly. Any changes should be made to the original repository files, not this packed version. When processing this file, use the file path to distinguish between different files in the repository. | |
--- | |
Be aware that this file may contain sensitive information. Handle it with the same level of security as you would the original repository. Notes Some files may have been excluded based on .gitignore rules and Repomixs configuration. | |
--- | |
Binary files are not included in this packed representation. Please refer to the Repository Structure section for a complete list of file paths, including binary files. Additional Info For more information about Repomix, visit httpsgithub.comyamadashyrepomix Directory Structure .github workflows codeql.yml precommit.yaml ci common test_util.py util.sh L0_additional_outputs_vllm additional_outputs_test.py test.sh L0_backend_vllm accuracy_test accuracy_test.py test.sh enabled_stream enabled_stream_test.py test.sh metrics_test test.sh vllm_metrics_test.py request_cancellation request_cancellation_test.py test.sh vllm_backend ensemble_config.pbtxt test.sh vllm_backend_test.py test.sh L0_check_health_vllm check_health_test.py test.sh L0_multi_gpu_vllm multi_lora download.py multi_lora_test.py test.sh vllm_backend test.sh vllm_multi_gpu_test.py test.sh docs additional_outputs.md health_check.md llama_multi_lora_tutorial.md samples model_repository vllm_model 1 model.json config.pbtxt client.py prompts.txt src utils metrics.py model.py .gitignore .precommitconfig.yaml LICENSE pyproject.toml README.md Files File .githubworkflowscodeql.yml Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
name CodeQL on pull_request jobs analyze name Analyze runson ubuntulatest permissions actions read contents read securityevents write strategy failfast false matrix language python CodeQL supports cpp, csharp, go, java, javascript, python, ruby Learn more about CodeQL language support at httpsaka.mscodeqldocslanguagesupport steps name Checkout repository uses actionscheckoutv3 Initializes the CodeQL tools for scanning. name Initialize CodeQL uses githubcodeqlactioninitv2 with languages matrix.language If you wish to specify custom queries, you can do so here or in a config file. By default, queries listed here will override any specified in a config file. | |
--- | |
Prefix the list here with to use these queries and those in the config file. Details on CodeQLs query packs refer to httpsdocs.github.comencodesecuritycodescanningautomaticallyscanningyourcodeforvulnerabilitiesanderrorsconfiguringcodescanningusingqueriesinqlpacks queries securityandquality Autobuild attempts to build any compiled languages CC, C, Go, or Java. If this step fails, then you should remove it and run the build manually see below name Autobuild uses githubcodeqlactionautobuildv2 Commandline programs to run using the OS shell. | |
--- | |
See httpsdocs.github.comenactionsusingworkflowsworkflowsyntaxforgithubactionsjobsjob_idstepsrun If the Autobuild fails above, remove it and uncomment the following three lines. modify them or add more to build your code if your project, please refer to the EXAMPLE below for guidance. run echo Run, Build Application using script .location_of_script_within_repobuildscript.sh name Perform CodeQL Analysis uses githubcodeqlactionanalyzev2 with category languagematrix.language File .githubworkflowsprecommit.yaml Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
name precommit on pull_request jobs precommit runson ubuntu22.04 steps uses actionscheckoutv3 uses actionssetuppythonv3 uses precommitactionv3.0.0 File cicommontest_util.py !usrbinenv python3 Copyright 20182023, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import queue import unittest import numpy as np import tritonclient.grpc as grpcclient class TestResultCollectorunittest.TestCase TestResultCollector stores test result and prints it to stdout. In order to use this class, unit tests must inherit this class. | |
--- | |
Use check_test_results bash function from commonutil.sh to verify the expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures total, errors, failures classmethod def tearDownClasscls this method is called when all the unit tests in a class are finished. json_res total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone result argument stores the accumulative test results test_result super.runresult total test_result.testsRun errors lentest_result.errors failures lentest_result.failures self.setResulttotal, errors, failures class AsyncTestResultCollectorunittest.IsolatedAsyncioTestCase TestResultCollector stores test result and prints it to stdout. In order to use this class, unit tests must inherit this class. | |
--- | |
Use check_test_results bash function from commonutil.sh to verify the expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures total, errors, failures classmethod def tearDownClasscls this method is called when all the unit tests in a class are finished. json_res total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone result argument stores the accumulative test results test_result super.runresult total test_result.testsRun errors lentest_result.errors failures lentest_result.failures self.setResulttotal, errors, failures def create_vllm_request prompt, request_id, stream, sampling_parameters, model_name, send_parameters_as_tensorTrue, exclude_input_in_outputNone, inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpynp.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if send_parameters_as_tensor sampling_parameters_data np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data if exclude_input_in_output is not None inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool outputs grpcclient.InferRequestedOutputtext_output return model_name model_name, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters, class UserData def __init__self self._completed_requests queue.Queue def callbackuser_data, result, error if error user_data._completed_requests.puterror else user_data._completed_requests.putresult File cicommonutil.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. SERVERSERVERopttritonserverbintritonserver SERVER_IPADDRTRITONSERVER_IPADDRlocalhost SERVER_LOGSERVER_LOG.server.log SERVER_TIMEOUTSERVER_TIMEOUT120 SERVER_LD_PRELOADSERVER_LD_PRELOAD Run inference server. | |
--- | |
Return once servers health endpoint shows ready or timeout expires. Sets SERVER_PID to pid of SERVER, or 0 if error including expired timeout function run_server SERVER_PID0 if z SERVER then echo SERVER must be defined return fi if ! f SERVER then echo SERVER does not exist return fi if z SERVER_LD_PRELOAD then echo Running SERVER SERVER_ARGS else echo Running LD_PRELOADSERVER_LD_PRELOAD SERVER SERVER_ARGS fi LD_PRELOADSERVER_LD_PRELOADLD_PRELOAD SERVER SERVER_ARGS SERVER_LOG 21 SERVER_PID! | |
--- | |
wait_for_server_ready SERVER_PID SERVER_TIMEOUT if WAIT_RET ! 0 then Get further debug information about server startup failure gdb_helper true Cleanup kill SERVER_PID devnull 21 true SERVER_PID0 fi Wait until server health endpoint shows ready. Sets WAIT_RET to 0 on success, 1 on failure function wait_for_server_ready local spid1 shift local wait_time_secs130 shift WAIT_RET0 local wait_secswait_time_secs until test wait_secs eq 0 do if ! | |
--- | |
kill 0 spid devnull 21 then echo Server not running. WAIT_RET1 return fi sleep 1 set e codecurl s w http_code SERVER_IPADDR8000v2healthready set e if code 200 then return fi wait_secs done echo Timeout wait_time_secs secs. Server not ready. | |
--- | |
WAIT_RET1 Check Python unittest results. function check_test_results local log_file1 local expected_num_tests2 if z expected_num_tests then echo expected number of tests must be defined return 1 fi num_failurescat log_file grep E .total.errors.failures. tail n 1 jq .failures num_testscat log_file grep E .total.errors.failures. | |
--- | |
tail n 1 jq .total num_errorscat log_file grep E .total.errors.failures. tail n 1 jq .errors Number regular expression re09 if ? ne 0 ! | |
--- | |
num_failures re ! num_tests re ! num_errors re then cat log_file echo e nn Test Failed unable to parse test resultsn log_file return 1 fi if num_errors ! | |
--- | |
0 num_failures ! 0 num_tests ne expected_num_tests then cat log_file echo e nn Test Failed Expected expected_num_tests tests, num_tests tests executed, num_errors tests had error, and num_failures tests failed. n log_file return 1 fi return 0 function collect_artifacts_from_subdir cp .log core .. | |
--- | |
true File ciL0_additional_outputs_vllmadditional_outputs_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import numpy as np import pytest import tritonclient.grpc as grpcclient class TestAdditionalOutputs _grpc_url localhost8001 _model_name vllm_opt _sampling_parameters temperature 0, top_p 1 _prompt In this example, def _get_sampling_parametersself, logprobsNone sampling_parameters self._sampling_parameters.copy if logprobs is not None sampling_parameterslogprobs logprobs return sampling_parameters def _get_inputs self, prompt, streamTrue, sampling_parametersNone, return_finish_reasonNone, return_cumulative_logprobNone, return_logprobsNone, return_num_input_tokensNone, return_num_output_tokensNone, inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ if return_finish_reason is not None inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_finish_reason, dtypebool if return_cumulative_logprob is not None inputs.append grpcclient.InferInputreturn_cumulative_logprob, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_cumulative_logprob, dtypebool if return_logprobs is not None inputs.appendgrpcclient.InferInputreturn_logprobs, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_logprobs, dtypebool if return_num_input_tokens is not None inputs.appendgrpcclient.InferInputreturn_num_input_tokens, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_num_input_tokens, dtypebool if return_num_output_tokens is not None inputs.append grpcclient.InferInputreturn_num_output_tokens, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_num_output_tokens, dtypebool return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself, inputs, sampling_parameters self._responses with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parameterssampling_parameters client.stop_stream assert lenself._responses 0 def _assert_text_output_validself text_output for response in self._responses result, error responseresult, responseerror assert error is None text_output result.as_numpynametext_output0.decodeutf8 assert lentext_output 0, output is empty assert text_output.count 4, output is not a sentence def _assert_finish_reasonself, return_finish_reason for i in rangelenself._responses result, error self._responsesiresult, self._responsesierror assert error is None finish_reason_np result.as_numpynamefinish_reason if return_finish_reason is None or return_finish_reason False assert finish_reason_np is None continue finish_reason finish_reason_np0.decodeutf8 if i lenself._responses 1 assert finish_reason None else assert finish_reason length def _assert_cumulative_logprobself, return_cumulative_logprob prev_cumulative_logprob 0.0 for response in self._responses result, error responseresult, responseerror assert error is None cumulative_logprob_np result.as_numpynamecumulative_logprob if return_cumulative_logprob is None or return_cumulative_logprob False assert cumulative_logprob_np is None continue cumulative_logprob cumulative_logprob_np0.astypefloat assert cumulative_logprob ! prev_cumulative_logprob prev_cumulative_logprob cumulative_logprob def _assert_logprobs self, stream, sampling_parameters, return_logprobs, return_num_output_tokens for response in self._responses result, error responseresult, responseerror assert error is None logprobs_np result.as_numpynamelogprobs if return_logprobs is None or return_logprobs False assert logprobs_np is None continue logprobs json.loadslogprobs_np0.decodeutf8 if logprobs not in sampling_parameters assert logprobs is None continue assert isinstancelogprobs, list assert lenlogprobs 1 if return_num_output_tokens True num_output_tokens result.as_numpynamenum_output_tokens0.astype int assert lenlogprobs num_output_tokens text_output_logprobs for logprobs_d in logprobs assert isinstancelogprobs_d, dict assert lenlogprobs_d 1 assert lenlogprobs_d sampling_parameterslogprobs 1 rank_one_found False for token_id, logprob_d in logprobs_d.items assert isinstancetoken_id, str assert lenlogprob_d 3 assert isinstancelogprob_dlogprob, float assert isinstancelogprob_drank, int assert isinstancelogprob_ddecoded_token, str if logprob_drank 1 assert not rank_one_found rank_one_found True text_output_logprobs logprob_ddecoded_token assert rank_one_found text_output result.as_numpynametext_output0.decodeutf8 if not stream given exclude_input_in_output is not set, prepend_input is True if not streaming and False if streaming text_output_logprobs self._prompt text_output_logprobs assert text_output_logprobs text_output def _assert_num_input_tokensself, return_num_input_tokens for response in self._responses result, error responseresult, responseerror assert error is None num_input_tokens_np result.as_numpynamenum_input_tokens if return_num_input_tokens is None or return_num_input_tokens False assert num_input_tokens_np is None continue num_input_tokens num_input_tokens_np.astypeint assert num_input_tokens 0 assert num_input_tokens lenself._prompt def _assert_num_output_tokensself, return_num_output_tokens for response in self._responses result, error responseresult, responseerror assert error is None num_output_tokens_np result.as_numpynamenum_output_tokens if return_num_output_tokens is None or return_num_output_tokens False assert num_output_tokens_np is None continue num_output_tokens num_output_tokens_np0.astypeint assert num_output_tokens 0 pytest.mark.parametrizestream, True, False pytest.mark.parametrizereturn_finish_reason, None, True, False pytest.mark.parametrizereturn_cumulative_logprob, None, True, False pytest.mark.parametrizelogprobs, None, 0, 2 pytest.mark.parametrizereturn_logprobs, None, True, False pytest.mark.parametrizereturn_num_input_tokens, None, True, False pytest.mark.parametrizereturn_num_output_tokens, None, True, False def test_additional_outputs self, stream, return_finish_reason, return_cumulative_logprob, logprobs, return_logprobs, return_num_input_tokens, return_num_output_tokens, sampling_parameters self._get_sampling_parameterslogprobslogprobs inputs self._get_inputs self._prompt, streamstream, sampling_parameterssampling_parameters, return_finish_reasonreturn_finish_reason, return_cumulative_logprobreturn_cumulative_logprob, return_logprobsreturn_logprobs, return_num_input_tokensreturn_num_input_tokens, return_num_output_tokensreturn_num_output_tokens, self._llm_inferinputs, sampling_parameters self._assert_text_output_valid self._assert_finish_reasonreturn_finish_reason self._assert_cumulative_logprobreturn_cumulative_logprob self._assert_logprobs stream, sampling_parameters, return_logprobs, return_num_output_tokens self._assert_num_input_tokensreturn_num_input_tokens self._assert_num_output_tokensreturn_num_output_tokens File ciL0_additional_outputs_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc Prepare Model rm rf models vllm_baseline_output.pkl mkdir p models SAMPLE_MODELS_REPO....samplesmodel_repository cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json RET0 Test SERVER_LOGadditional_outputs_test.server.log SERVER_ARGSmodelrepositorymodels run_server if SERVER_PID 0 then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_additional_outputs.xml s v additional_outputs_test.py if ? ne 0 then echo e nn additional_outputs_test FAILED. n RET1 fi set e kill SERVER_PID wait SERVER_PID if RET eq 0 then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET File ciL0_backend_vllmaccuracy_testaccuracy_test.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
import argparse import asyncio import pickle import sys import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import from vllm import SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.utils import random_uuid sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request VLLM_ENGINE_CONFIG model facebookopt125m, gpu_memory_utilization 0.3, PROMPTS The most dangerous animal is, The capital of France is, The future of AI is, SAMPLING_PARAMETERS temperature 0, top_p 1 async def generate_python_vllm_outputprompt, llm_engine request_id random_uuid sampling_params SamplingParamsSAMPLING_PARAMETERS python_vllm_output None last_output None async for vllm_output in llm_engine.generateprompt, sampling_params, request_id last_output vllm_output if last_output python_vllm_output prompt output.text.encodeutf8 for output in last_output.outputs return python_vllm_output def prepare_vllm_baseline_outputs Helper function that starts async vLLM engine and generates output for each prompt in PROMPTS. Saves resulted baselines in vllm_baseline_output.pkl for further use. llm_engine AsyncLLMEngine.from_engine_argsAsyncEngineArgsVLLM_ENGINE_CONFIG python_vllm_output for i in rangelenPROMPTS python_vllm_output.extend asyncio.rungenerate_python_vllm_outputPROMPTSi, llm_engine with openvllm_baseline_output.pkl, wb as f pickle.dumppython_vllm_output, f return class VLLMTritonAccuracyTestTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name vllm_opt self.python_vllm_output with openvllm_baseline_output.pkl, rb as f self.python_vllm_output pickle.loadf self.assertNotEqual self.python_vllm_output, , Loaded baseline outputs list should not be empty, self.assertIsNotNone self.python_vllm_output, Loaded baseline outputs list should not be None self.assertEqual lenself.python_vllm_output, lenPROMPTS, Unexpected number of baseline outputs loaded, expected , but got .format lenPROMPTS, lenself.python_vllm_output , def test_vllm_modelself user_data UserData stream False triton_vllm_output self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangelenPROMPTS request_data create_vllm_request PROMPTSi, i, stream, SAMPLING_PARAMETERS, self.vllm_model_name self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parametersSAMPLING_PARAMETERS, for i in rangelenPROMPTS result user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException, strresult output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None triton_vllm_output.extendoutput self.triton_client.stop_stream self.assertEqualself.python_vllm_output.sort, triton_vllm_output.sort def tearDownself self.triton_client.close if __name__ __main__ parser argparse.ArgumentParser parser.add_argument generatebaseline, actionstore_true, requiredFalse, defaultFalse, helpGenerates baseline output for accuracy tests, FLAGS parser.parse_args if FLAGS.generate_baseline prepare_vllm_baseline_outputs exit0 unittest.main File ciL0_backend_vllmaccuracy_testtest.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.accuracy_test_server.log CLIENT_LOG.accuracy_test_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.accuracy_test.py SAMPLE_MODELS_REPO......samplesmodel_repository VLLM_ENGINE_LOGvllm_engine.log EXPECTED_NUM_TESTS1 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json f vllm_baseline_output.pkl rm vllm_baseline_output.pkl RET0 set e Need to generate baseline first, since running 2 vLLM engines causes memory issues httpsgithub.comvllmprojectvllmissues2248 python3 CLIENT_PY generatebaseline VLLM_ENGINE_LOG 21 BASELINE_PID! wait BASELINE_PID set e run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. | |
--- | |
n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Accuracy test FAILED. n else echo e nn Accuracy test PASSED. | |
--- | |
n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmenabled_streamenabled_stream_test.py Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys import unittest import tritonclient.grpc.aio as grpcclient from tritonclient.utils import sys.path.append....common from test_util import AsyncTestResultCollector, create_vllm_request PROMPTS The most dangerous animal is SAMPLING_PARAMETERS temperature 0, top_p 1 class VLLMTritonStreamTestAsyncTestResultCollector async def _test_vllm_model self, promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamTrue, exclude_input_in_outputNone, expected_outputNone, expect_errorFalse, async with grpcclient.InferenceServerClient urllocalhost8001 as triton_client model_name vllm_opt async def request_iterator for i, prompt in enumerateprompts yield create_vllm_request prompt, i, stream, sampling_parameters, model_name, exclude_input_in_outputexclude_input_in_output, response_iterator triton_client.stream_infer inputs_iteratorrequest_iterator final_response async for response in response_iterator result, error response if expect_error self.assertIsInstanceerror, InferenceServerException self.assertEqual error.message, Error generating stream When streaming, exclude_input_in_output False is not allowed., error, return self.assertIsNoneerror, error self.assertIsNotNoneresult, result output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None final_response.appendstroutput0, encodingutf8 if expected_output is not None self.assertEqual final_response, expected_output, Expected to receive the following response , but received ..format expected_output, final_response , async def test_vllm_model_enabled_streamself Verifying that request with multiple prompts runs successfully. prompts The most dangerous animal is, The future of AI is, await self._test_vllm_modelpromptsprompts async def test_vllm_model_enabled_stream_exclude_input_in_output_defaultself Verifying that streaming request returns only generated diffs, which is default behaviour for streamTrue. | |
--- | |
expected_output the, one, that, is, most, likely, to, be, killed, by, a, car, ., n, I, m, await self._test_vllm_modelexpected_outputexpected_output async def test_vllm_model_enabled_stream_exclude_input_in_output_falseself Verifying that streaming request returns only generated diffs even if exclude_input_in_output is set to False explicitly. expected_output Error generating stream When streaming, exclude_input_in_output False is not allowed. await self._test_vllm_model exclude_input_in_outputFalse, expected_outputexpected_output, expect_errorTrue, if __name__ __main__ unittest.main File ciL0_backend_vllmenabled_streamtest.sh !binbash Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.enabled_stream_server.log CLIENT_LOG.enabled_stream_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.enabled_stream_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS3 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Enabled Stream test FAILED. n else echo e nn Enabled Stream test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmmetrics_testtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_metrics_server.log CLIENT_LOG.vllm_metrics_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_metrics_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 Helpers function copy_model_repository rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt vllm_opt model will be loaded on server start and stay loaded throughout unittesting. To ensure that vllms memory profiler will not error out on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt, so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json run_test local TEST_CASE1 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY TEST_CASE v CLIENT_LOG 21 if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY TEST_CASE FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e TODO Nongraceful shutdown when metrics are enabled. | |
--- | |
kill SERVER_PID wait SERVER_PID RET0 Test disabling vLLM metrics reporting without parameter REPORT_CUSTOM_METRICS in config.pbtxt copy_model_repository run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled Test disabling vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to false in config.pbtxt copy_model_repository echo e parameters key REPORT_CUSTOM_METRICS value string_value false modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled Test vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to true in config.pbtxt copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics Test vLLM metrics custom sampling parameters Custom sampling parameters may result in different vLLM output depending on the platform. Therefore, these metrics are tests separately. copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_custom_sampling_params Test enabling vLLM metrics reporting in config.pbtxt but disabling in model.json copy_model_repository jq . | |
--- | |
disable_log_stats true modelsvllm_opt1model.json temp.json mv temp.json modelsvllm_opt1model.json echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled Test enabling vLLM metrics reporting in config.pbtxt while disabling in server option copy_model_repository echo e parameters key REPORT_CUSTOM_METRICS value string_value true modelsvllm_optconfig.pbtxt SERVER_ARGSSERVER_ARGS allowmetricsfalse run_test VLLMTritonMetricsTest.test_vllm_metrics_refused rm rf .models temp.json if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmmetrics_testvllm_metrics_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
import os import re import sys import unittest from functools import partial import requests import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMTritonMetricsTestTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.tritonserver_ipaddr os.environ.getTRITONSERVER_IPADDR, localhost self.vllm_model_name vllm_opt self.prompts The most dangerous animal is, The capital of France is, The future of AI is, self.sampling_parameters temperature 0, top_p 1 def parse_vllm_metricsself Store vllm metrics in a dictionary. r requests.getfhttpself.tritonserver_ipaddr8002metrics r.raise_for_status Regular expression to match the pattern pattern rvllm ?.? 09. | |
--- | |
vllm_dict Find all matches in the text matches re.findallpattern, r.text, re.MULTILINE for match in matches key, value match vllm_dictkey floatvalue if . in value else intvalue return vllm_dict def vllm_infer self, prompts, sampling_parameters, model_name, Helper function to send async stream infer requests to vLLM. user_data UserData number_of_vllm_reqs lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data create_vllm_request promptsi, i, False, sampling_parameters, model_name, True, self.triton_client.async_stream_infer model_namemodel_name, inputsrequest_datainputs, request_idrequest_datarequest_id, outputsrequest_dataoutputs, parameterssampling_parameters, for _ in rangenumber_of_vllm_reqs result user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None self.triton_client.stop_stream def test_vllm_metricsself Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name, metrics_dict self.parse_vllm_metrics total_prompts lenself.prompts vllmprompt_tokens_total 2, 133, 144, 2702, 3477, 16 2, 133, 812, 9, 1470, 16 2, 133, 499, 9, 4687, 16 self.assertEqualmetrics_dictvllmprompt_tokens_total, 18 vllmgeneration_tokens_total 5, 65, 14, 16, 144, 533, 7, 28, 848, 30, 10, 512, 4, 50118, 100, 437 5, 812, 9, 5, 1515, 3497, 4, 50118, 50118, 133, 812, 9, 1470, 16, 5, 812 11, 5, 1420, 9, 5, 82, 4, 50118, 50118, 133, 499, 9, 4687, 16, 11, 5 self.assertEqualmetrics_dictvllmgeneration_tokens_total, 48 vllmtime_to_first_token_seconds self.assertEqual metrics_dictvllmtime_to_first_token_seconds_count, total_prompts self.assertGreatermetrics_dictvllmtime_to_first_token_seconds_sum, 0 self.assertEqual metrics_dictvllmtime_to_first_token_seconds_bucket, total_prompts vllmtime_per_output_token_seconds self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_count, 45 self.assertGreatermetrics_dictvllmtime_per_output_token_seconds_sum, 0 self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_bucket, 45 vllme2e_request_latency_seconds self.assertEqual metrics_dictvllme2e_request_latency_seconds_count, total_prompts self.assertGreatermetrics_dictvllme2e_request_latency_seconds_sum, 0 self.assertEqual metrics_dictvllme2e_request_latency_seconds_bucket, total_prompts vllmrequest_prompt_tokens self.assertEqual metrics_dictvllmrequest_prompt_tokens_count, total_prompts self.assertEqualmetrics_dictvllmrequest_prompt_tokens_sum, 18 self.assertEqual metrics_dictvllmrequest_prompt_tokens_bucket, total_prompts vllmrequest_generation_tokens self.assertEqual metrics_dictvllmrequest_generation_tokens_count, total_prompts, self.assertEqualmetrics_dictvllmrequest_generation_tokens_sum, 48 self.assertEqual metrics_dictvllmrequest_generation_tokens_bucket, total_prompts, TODO Revisit this test due to the removal of best_of def test_custom_sampling_paramsself Adding sampling parameters for testing metrics. | |
--- | |
Definitions can be found here httpsdocs.vllm.aienlatestdevsampling_params.html n, best_of 2, 4 custom_sampling_parameters self.sampling_parameters.copy Changing temperature because best_of must be 1 when using greedy sampling, i.e. temperature 0. custom_sampling_parameters.update n strn, best_of strbest_of, temperature 1 Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parameterscustom_sampling_parameters, model_nameself.vllm_model_name, metrics_dict self.parse_vllm_metrics total_prompts lenself.prompts vllmrequest_params_best_of self.assertEqual metrics_dictvllmrequest_params_best_of_count, total_prompts self.assertEqual metrics_dictvllmrequest_params_best_of_sum, best_of total_prompts self.assertEqual metrics_dictvllmrequest_params_best_of_bucket, total_prompts vllmrequest_params_n self.assertEqualmetrics_dictvllmrequest_params_n_count, total_prompts self.assertEqualmetrics_dictvllmrequest_params_n_sum, n total_prompts self.assertEqualmetrics_dictvllmrequest_params_n_bucket, total_prompts def test_vllm_metrics_disabledself Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name, metrics_dict self.parse_vllm_metrics No vLLM metric found self.assertEquallenmetrics_dict, 0 def test_vllm_metrics_refusedself Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name, with self.assertRaisesrequests.exceptions.ConnectionError self.parse_vllm_metrics def tearDownself self.triton_client.close if __name__ __main__ unittest.main File ciL0_backend_vllmrequest_cancellationrequest_cancellation_test.py Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
import sys import time import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMRequestCancelTestTestResultCollector def test_request_cancellationself, send_parameters_as_tensorTrue with grpcclient.InferenceServerClienturllocalhost8001 as triton_client log_file_path .request_cancellation_server.log user_data UserData model_name vllm_opt stream False sampling_parameters temperature 0, top_p 1, max_tokens 1500, prompt fWrite an original and creative poem of at least 200 words. triton_client.start_streamcallbackpartialcallback, user_data request_data create_vllm_request prompt, 1, stream, sampling_parameters, model_name, send_parameters_as_tensor, triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, time.sleep1 triton_client.stop_streamcancel_requestsTrue time.sleep1 self.assertFalseuser_data._completed_requests.empty result user_data._completed_requests.get self.assertIsInstanceresult, InferenceServerException self.assertEqualresult.status, StatusCode.CANCELLED self.assertTrueuser_data._completed_requests.empty with openlog_file_path, moder as log_file log_text log_file.read self.assertInvllm Cancelling the request, log_text self.assertInvllm Successfully cancelled the request, log_text if __name__ __main__ unittest.main File ciL0_backend_vllmrequest_cancellationtest.sh !binbash Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.request_cancellation_server.log CLIENT_LOG.request_cancellation_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.request_cancellation_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Request Cancellation test FAILED. | |
--- | |
n else echo e nn Request Cancellation test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmvllm_backendensemble_config.pbtxt Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. name ensemble_model platform ensemble max_batch_size 1 input name text_input data_type TYPE_STRING dims 1 output name text_output data_type TYPE_STRING dims 1 ensemble_scheduling step model_name vllm_opt model_version 1 input_map key text_input value text_input output_map key text_output value text_output File ciL0_backend_vllmvllm_backendtest.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_backend_server.log CLIENT_LOG.vllm_backend_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_backend_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS6 Helpers function assert_curl_success message1 if code ! 200 then cat .curl.out echo e nn message line BASH_LINENOn RET1 fi rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt vllm_opt model will be loaded on server start and stay loaded throughout unittesting. To test vllm model loadunload we use a dedicated vllm_load_test. | |
--- | |
To ensure that vllms memory profiler will not error out on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt, so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json cp r modelsvllm_opt modelsvllm_load_test mkdir p modelsadd_sub1 wget P modelsadd_sub1 httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_submodel.py wget P modelsadd_sub httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_subconfig.pbtxt Invalid model attribute cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_1 sed i sdisable_log_requestsinvalid_attribute modelsvllm_invalid_11model.json Invalid model name cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_2 sed i sfacebookopt125minvalid_model modelsvllm_invalid_21model.json Sanity check ensembles are enabled and can successfully be loaded mkdir p modelsensemble_model1 cp r ensemble_config.pbtxt modelsensemble_modelconfig.pbtxt RET0 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. | |
--- | |
n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID Test Python backend cmdline parameters are propagated to vllm backend SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR backendconfigpython,defaultmaxbatchsize8 SERVER_LOG.vllm_test_cmdline_server.log rm rf .modelsvllm_invalid_1 .modelsvllm_invalid_2 .modelsvllm_load_test run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi kill SERVER_PID wait SERVER_PID rm rf .models COUNTgrep c defaultmaxbatchsize8 SERVER_LOG if COUNT ne 2 then echo Cmdline parameters verification Failed fi Test loading multiple vllm models SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_one SERVER_LOG.vllm_test_multi_model.log Create two models, one is just a copy of the other, and make sure gpu utilization is low enough for multiple models to avoid OOM. vLLM changed behavior of their GPU profiler from total to free memory, so to load two small models, we need to start triton server in explicit mode. | |
--- | |
MODEL1vllm_one MODEL2vllm_two mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsMODEL1 cp r modelsMODEL1 modelsMODEL2 sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL11model.json sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL21model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi Explicitly load model rm f .curl.out set e codecurl s w http_code o .curl.out X POST localhost8000v2repositorymodelsvllm_twoload set e assert_curl_success Failed to load vllm_two model kill SERVER_PID wait SERVER_PID rm rf .models if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_backend_vllmvllm_backendvllm_backend_test.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
import sys import unittest from functools import partial import numpy as np import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request PROMPTS The most dangerous animal is, The capital of France is, The future of AI is, SAMPLING_PARAMETERS temperature 0, top_p 1 class VLLMTritonBackendTestTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name vllm_opt self.python_model_name add_sub self.ensemble_model_name ensemble_model self.vllm_load_test vllm_load_test def test_vllm_triton_backendself Load both vllm and add_sub models self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.load_modelself.python_model_name self.assertTrueself.triton_client.is_model_readyself.python_model_name Test to ensure that ensemble models are supported in vllm container. If ensemble support not present, triton will error out at model loading stage. Ensemble Model is a pipeline consisting of 1 model vllm_opt self.triton_client.load_modelself.ensemble_model_name self.assertTrueself.triton_client.is_model_readyself.ensemble_model_name self.triton_client.unload_modelself.ensemble_model_name Unload vllm model and test add_sub model self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test self._test_python_model Load vllm model and unload add_sub model self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.unload_modelself.python_model_name self.assertFalseself.triton_client.is_model_readyself.python_model_name Test vllm model and unload vllm model self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorTrue, model_nameself.vllm_load_test, self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorFalse, model_nameself.vllm_load_test, self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test def test_model_with_invalid_attributesself model_name vllm_invalid_1 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_vllm_invalid_model_nameself model_name vllm_invalid_2 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_exclude_input_in_output_defaultself Verifying default behavior for exclude_input_in_output in nonstreaming mode. | |
--- | |
Expected result prompt is returned with diffs. prompts The capital of France is, expected_output bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, expected_outputexpected_output, def test_exclude_input_in_output_falseself Verifying behavior for exclude_input_in_output False in nonstreaming mode. Expected result prompt is returned with diffs. | |
--- | |
Test vllm model and unload vllm model prompts The capital of France is, expected_output bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputFalse, expected_outputexpected_output, def test_exclude_input_in_output_trueself Verifying behavior for exclude_input_in_output True in nonstreaming mode. Expected result only diffs are returned. Test vllm model and unload vllm model prompts The capital of France is, expected_output b the capital of the French Republic.nnThe capital of France is the capital sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputTrue, expected_outputexpected_output, def _test_vllm_model self, prompts, sampling_parameters, stream, send_parameters_as_tensor, exclude_input_in_outputNone, expected_outputNone, model_namevllm_opt, user_data UserData number_of_vllm_reqs lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor, exclude_input_in_outputexclude_input_in_output, self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, for i in rangenumber_of_vllm_reqs result user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n Expected n Actual.format output, expected_outputi , self.triton_client.stop_stream def _test_python_modelself shape 4 input0_data np.random.randshape.astypenp.float32 input1_data np.random.randshape.astypenp.float32 inputs grpcclient.InferInput INPUT0, input0_data.shape, np_to_triton_dtypeinput0_data.dtype , grpcclient.InferInput INPUT1, input1_data.shape, np_to_triton_dtypeinput1_data.dtype , inputs0.set_data_from_numpyinput0_data inputs1.set_data_from_numpyinput1_data outputs grpcclient.InferRequestedOutputOUTPUT0, grpcclient.InferRequestedOutputOUTPUT1, response self.triton_client.infer self.python_model_name, inputs, request_id10, outputsoutputs self.assertTrue np.allcloseinput0_data input1_data, response.as_numpyOUTPUT0 self.assertTrue np.allcloseinput0_data input1_data, response.as_numpyOUTPUT1 def tearDownself self.triton_client.close if __name__ __main__ unittest.main File ciL0_backend_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
RET0 SUBTESTSaccuracy_test request_cancellation enabled_stream vllm_backend metrics_test python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST bash ex test.sh cd .. if ? ne 0 then echo Subtest TEST FAILED RET1 fi done if RET eq 0 then echo e nn vLLM Backend Test Passedn else echo e nn vLLM Backend Test FAILEDn fi exit RET File ciL0_check_health_vllmcheck_health_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
import json import numpy as np import tritonclient.grpc as grpcclient class TestCheckHealth _grpc_url localhost8001 _model_name vllm_opt _sampling_parameters temperature 0, top_p 1 _prompt In this example, def _get_inputsself, prompt, streamTrue, sampling_parametersNone inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself inputs self._get_inputs self._prompt, streamTrue, sampling_parametersself._sampling_parameters self._responses with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parametersself._sampling_parameters client.stop_stream def _assert_text_output_validself text_output for response in self._responses result, error responseresult, responseerror assert error is None text_output result.as_numpynametext_output0.decodeutf8 assert lentext_output 0, output is empty assert text_output.count 4, output is not a sentence def _assert_infer_exceptionself, expected_exception_message assert lenself._responses 1 for response in self._responses result, error responseresult, responseerror assert result is None assert strerror expected_exception_message def _assert_model_readyself, expected_readiness with grpcclient.InferenceServerClientself._grpc_url as client is_model_ready API assert client.is_model_readyself._model_name expected_readiness get_model_repository_index API model_state None for model_index in client.get_model_repository_index.models if model_index.name self._model_name assert model_state is None, duplicate model index found model_state model_index.state READY assert model_state expected_readiness def test_vllm_is_healthyself num_repeats 3 for i in rangenum_repeats self._assert_model_readyTrue self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue def test_vllm_not_healthyself self._assert_model_readyTrue The 1st infer should complete successfully self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue The 2nd infer should begin with health check failed self._llm_infer self._assert_infer_exception Model is unavailable due to unhealthy vLLM engine self._assert_model_readyFalse The 3rd infer should have model not found self._llm_infer self._assert_infer_exception Request for unknown model vllm_opt has no available versions self._assert_model_readyFalse File ciL0_check_health_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc RET0 function setup_model_repository local sample_model_repo_path1....samplesmodel_repository rm rf models vllm_baseline_output.pkl mkdir p models cp r sample_model_repo_pathvllm_model modelsvllm_opt function enable_health_check local enable_vllm_health_check1 echo e parameters modelsvllm_optconfig.pbtxt echo e key ENABLE_VLLM_HEALTH_CHECK modelsvllm_optconfig.pbtxt echo e value string_value enable_vllm_health_check modelsvllm_optconfig.pbtxt echo e modelsvllm_optconfig.pbtxt VLLM_INSTALL_PATHusrlocallibpython3.12distpackagesvllm function mock_vllm_async_llm_engine backup original file mv VLLM_INSTALL_PATHenginemultiprocessingclient.py VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup cp VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py overwrite the original check_health method echo e VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e async def check_healthself, check_count0 VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e check_count0 1 VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e if check_count0 1 VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e raise RuntimeErrorSimulated vLLM check_health failure VLLM_INSTALL_PATHenginemultiprocessingclient.py function unmock_vllm_async_llm_engine restore from backup rm f VLLM_INSTALL_PATHenginemultiprocessingclient.py mv VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py function test_check_health local test_name1 local unit_test_name2 SERVER_LOGtest_name.server.log SERVER_ARGSmodelrepositorymodels modelcontrolmodeexplicit loadmodel run_server if SERVER_PID 0 then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_name.report.xml s v check_health_test.pyTestCheckHealthunit_test_name test_name.log if ? ne 0 then echo e nn test_name FAILED. | |
--- | |
n RET1 fi set e kill SERVER_PID wait SERVER_PID Test health check unspecified setup_model_repository test_check_health health_check_unspecified test_vllm_is_healthy Test health check disabled setup_model_repository enable_health_check false test_check_health health_check_disabled test_vllm_is_healthy Test health check enabled setup_model_repository enable_health_check true test_check_health health_check_enabled test_vllm_is_healthy Mock check_health from vLLM mock_vllm_async_llm_engine Test health check unspecified with mocked vLLM check_health failure setup_model_repository test_check_health health_check_unspecified_mocked_failure test_vllm_is_healthy Test health check disabled with mocked vLLM check_health failure setup_model_repository enable_health_check false test_check_health health_check_disabled_mocked_failure test_vllm_is_healthy Test health check enabled with mocked vLLM check_health failure setup_model_repository enable_health_check true test_check_health health_check_enabled_mocked_failure test_vllm_not_healthy Unmock check_health unmock_vllm_async_llm_engine if RET eq 0 then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET File ciL0_multi_gpu_vllmmulti_loradownload.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. from huggingface_hub import snapshot_download if __name__ __main__ download lora weight alpaca snapshot_download repo_idswathijnGemmaDoll2bdollyLORATune, local_dir.weightslorasGemmaDoll, max_workers8, download lora weight GemmaSheep snapshot_download repo_ideduardoalvarezGemmaSheep2BLORATUNED, local_dir.weightslorasGemmaSheep, max_workers8, download backbone weight googlegemma2b snapshot_download repo_idunslothgemma2b, local_dir.weightsbackbonegemma2b, max_workers8, File ciL0_multi_gpu_vllmmulti_loramulti_lora_test.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import os import sys import unittest from functools import partial from typing import List import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import AsyncTestResultCollector, UserData, callback, create_vllm_request PROMPTS Instruct What do you think of Computer Science?nOutput SAMPLING_PARAMETERS temperature 0, top_p 1 server_enable_lora True class VLLMTritonLoraTestAsyncTestResultCollector def setUpself self.triton_client grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name vllm_llama_multi_lora def _test_vllm_model self, prompts Liststr, sampling_parameters, lora_name Liststr, server_enable_loraTrue, streamFalse, exclude_input_in_outputNone, expected_outputNone, assert lenprompts len lora_name , The number of prompts and lora names should be the same user_data UserData number_of_vllm_reqs lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs lora lora_namei if lora_name else None sam_para_copy sampling_parameters.copy if lora is not None sam_para_copylora_name lora request_data create_vllm_request promptsi, i, stream, sam_para_copy, self.vllm_model_name, exclude_input_in_outputexclude_input_in_output, self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, for i in rangenumber_of_vllm_reqs result user_data._completed_requests.get if typeresult is InferenceServerException printresult.message if server_enable_lora self.assertEqual strresult.message, fLoRA lora_namei is not supported, we currently support doll, sheep, InferenceServerException, else self.assertEqual strresult.message, LoRA feature is not enabled., InferenceServerException, self.triton_client.stop_stream return output result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n Expected n Actual.format output, expected_outputi , self.triton_client.stop_stream def test_multi_lora_requestsself sampling_parameters temperature 0, top_p 1 make two requests separately to avoid the different arrival of response answers prompt_1 Instruct What do you think of Computer Science?nOutput lora_1 doll expected_output b I think it is a very interesting subject.nnInstruct What do you self._test_vllm_model prompt_1, sampling_parameters, lora_namelora_1, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output, prompt_2 Instruct Tell me more about soccernOutput lora_2 sheep expected_output b I love soccer. | |
--- | |
I play soccer every day.nInstruct Tell me self._test_vllm_model prompt_2, sampling_parameters, lora_namelora_2, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output, def test_none_exist_loraself prompts Instruct What is the capital city of France?nOutput, loras bactrian sampling_parameters temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, lora_nameloras, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputNone, this request will lead to lora not supported error, so there is no expected output def tearDownself self.triton_client.close if __name__ __main__ server_enable_lora os.environ.getSERVER_ENABLE_LORA, false.lower true unittest.main File ciL0_multi_gpu_vllmmulti_loratest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.multi_lora_server.log CLIENT_LOG.multi_lora_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.multi_lora_test.py DOWNLOAD_PY.download.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS2 first we download weights pip install U huggingface_hub rm rf weights mkdir p weightslorasGemmaDoll mkdir p weightslorasGemmaSheep mkdir p weightsbackbonegemma2b python3 DOWNLOAD_PY v CLIENT_LOG 21 rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_llama_multi_lora export SERVER_ENABLE_LORAtrue Check boolean flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json multi_lora_jsoncat EOF doll .weightslorasGemmaDoll, sheep .weightslorasGemmaSheep EOF echo multi_lora_json modelsvllm_llama_multi_lora1multi_lora.json RET0 If it is the first time launching triton server with gemma2b and multilora feature, it may take more than 1 minutes. Please wait. | |
--- | |
SERVER_TIMEOUT60000 run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID Check string flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID disable lora export SERVER_ENABLE_LORAfalse check bool flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID disable lora export SERVER_ENABLE_LORAfalse check string flag value for enable_lora model_jsoncat EOF model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray EOF echo model_json modelsvllm_llama_multi_lora1model.json run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models rm rf weights if RET eq 1 then cat CLIENT_LOG cat SERVER_LOG echo e nn Multi LoRA test FAILED. n else echo e nn Multi LoRA test PASSED. n fi collect_artifacts_from_subdir exit RET File ciL0_multi_gpu_vllmvllm_backendtest.sh !binbash Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit logverbose1 TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_multi_gpu_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 Helpers function validate_file_contains local KEY1 local FILE2 if z KEY z FILE then echo Error KEY and FILE must be provided. return 1 fi if ! f FILE then echo Error File FILE does not exist. | |
--- | |
return 1 fi countgrep o w KEY FILE wc l if count ne 1 then echo Error KEY KEY found count times in FILE. Expected exactly once. return 1 fi function run_multi_gpu_test export KIND1 export TENSOR_PARALLELISM2 export INSTANCE_COUNT3 export DISTRIBUTED_EXECUTOR_BACKEND4 Setup a clean model repository export TEST_MODELvllm_opt_KIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNT local TEST_MODEL_TRITON_CONFIGmodelsTEST_MODELconfig.pbtxt local TEST_MODEL_VLLM_CONFIGmodelsTEST_MODEL1model.json rm rf models mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsTEST_MODEL sed i sKIND_MODELKIND TEST_MODEL_TRITON_CONFIG sed i 3s tensor_parallel_size TENSOR_PARALLELISM,n TEST_MODEL_VLLM_CONFIG if TENSOR_PARALLELISM ne 1 then jq arg backend DISTRIBUTED_EXECUTOR_BACKEND . | |
--- | |
distributed_executor_backendbackend TEST_MODEL_VLLM_CONFIG temp.json mv temp.json TEST_MODEL_VLLM_CONFIG fi Assert the correct kind is set in case the template config changes in the future validate_file_contains KIND TEST_MODEL_TRITON_CONFIG Start server echo Running multiGPU test with kindKIND, tpTENSOR_PARALLELISM, instance_countINSTANCE_COUNT SERVER_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTserver.log run_server if SERVER_PID 0 then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi Run unit tests set e CLIENT_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTclient.log python3 CLIENT_PY v CLIENT_LOG 21 if ? ne 0 then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if ? | |
--- | |
ne 0 then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e Cleanup kill SERVER_PID wait SERVER_PID Test rm f .log RET0 Test the various cases of kind, tensor parallelism, and instance count for different ways to run multiGPU models with vLLM on Triton KINDSKIND_MODEL KIND_GPU TPS1 2 INSTANCE_COUNTS1 2 DISTRIBUTED_EXECUTOR_BACKENDray for kind in KINDS do for tp in TPS do for count in INSTANCE_COUNTS do run_multi_gpu_test kind tp count DISTRIBUTED_EXECUTOR_BACKEND done done done Results if RET eq 1 then echo e nn Multi GPU Utilization test FAILED. n else echo e nn Multi GPU Utilization test PASSED. n fi exit RET File ciL0_multi_gpu_vllmvllm_backendvllm_multi_gpu_test.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
import os import sys import unittest from functools import partial import pynvml import tritonclient.grpc as grpcclient from tritonclient.utils import sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMMultiGPUTestTestResultCollector def setUpself pynvml.nvmlInit self.triton_client grpcclient.InferenceServerClienturllocalhost8001 def get_gpu_memory_utilizationself, gpu_id handle pynvml.nvmlDeviceGetHandleByIndexgpu_id info pynvml.nvmlDeviceGetMemoryInfohandle return info.used def get_available_gpu_idsself device_count pynvml.nvmlDeviceGetCount available_gpus for gpu_id in rangedevice_count handle pynvml.nvmlDeviceGetHandleByIndexgpu_id if handle available_gpus.appendgpu_id return available_gpus def _test_vllm_multi_gpu_utilizationself, model_name str Test that loading a given vLLM model will increase GPU utilization across multiple GPUs, and run a sanity check inference to confirm that the loaded multigpumultiinstance model is working as expected. gpu_ids self.get_available_gpu_ids self.assertGreaterEquallengpu_ids, 2, Error Detected single GPU printnn Before Loading vLLM Model mem_util_before_loading_model for gpu_id in gpu_ids memory_utilization self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes mem_util_before_loading_modelgpu_id memory_utilization self.triton_client.load_modelmodel_name self._test_vllm_modelmodel_name print After Loading vLLM Model vllm_model_used_gpus 0 for gpu_id in gpu_ids memory_utilization self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes if memory_utilization mem_util_before_loading_modelgpu_id vllm_model_used_gpus 1 self.assertGreaterEqualvllm_model_used_gpus, 2 def _test_vllm_modelself, model_name str, send_parameters_as_tensor bool True user_data UserData stream False prompts The most dangerous animal is, The capital of France is, The future of AI is, number_of_vllm_reqs lenprompts sampling_parameters temperature 0, top_p 1 self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor, self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters, for i in rangenumber_of_vllm_reqs result user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException output result.as_numpytext_output self.assertIsNotNoneoutput self.triton_client.stop_stream def test_multi_gpu_modelself Tests that a multiGPU vLLM model loads successfully on multiple GPUs and can handle a few sanity check inference requests. MultiGPU models are currently defined here as either a single model instance with tensor parallelism 1 multiple model instances each with tensor parallelism 1 FIXME This test currently skips over a few combinations that may be enhanced in the future, such as tensor parallel models with multiple model instances KIND_MODEL models with multiple model instances model os.environ.getTEST_MODEL kind os.environ.getKIND tp os.environ.getTENSOR_PARALLELISM instance_count os.environ.getINSTANCE_COUNT for env_var in model, kind, tp, instance_count self.assertIsNotNoneenv_var printfTest Matrix model, kind, tp, instance_count Only support tensor parallelism or multiple instances for now, but not both. | |
--- | |
Support for multiinstance tensor parallel models may require more special handling in the backend to better handle device assignment. NOTE This eliminates the 111 and 224 test cases. if inttp intinstance_count ! | |
--- | |
2 msg TENSOR_PARALLELISM and INSTANCE_COUNT must have a product of 2 for this 2GPU test printSkipping Test, msg self.skipTestmsg Loading a KIND_GPU model with Tensor Parallelism 1 should fail and recommend using KIND_MODEL instead for multigpu model instances. if kind KIND_GPU and inttp 1 with self.assertRaisesRegex InferenceServerException, please specify KIND_MODEL self._test_vllm_multi_gpu_utilizationmodel return Loading a KIND_MODEL model with multiple instances can cause oversubscription to specific GPUs and cause a CUDA OOM if the gpu_memory_utilization settings are high without further handling of device assignment in the backend. if kind KIND_MODEL and intinstance_count 1 msg Testing multiple model instances of KIND_MODEL is not implemented at this time printSkipping Test, msg self.skipTestmsg self._test_vllm_multi_gpu_utilizationmodel def tearDownself pynvml.nvmlShutdown self.triton_client.close if __name__ __main__ unittest.main File ciL0_multi_gpu_vllmtest.sh !binbash Copyright 2024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
RET0 SUBTESTSvllm_backend multi_lora python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST bash ex test.sh cd .. if ? ne 0 then echo Subtest TEST FAILED RET1 fi done if RET eq 0 then echo e nn vLLM MultiGPU Tests Passedn else echo e nn vLLM MultiGPU Tests FAILEDn fi exit RET File docsadditional_outputs.md ! | |
--- | |
Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Additional Outputs from vLLM The vLLM backend supports sending additional outputs from vLLM on top of the usual text_output when requested. All additional outputs are disabled by default and they need to be enabled on a perrequest basis. | |
--- | |
If enabled, the corresponding output tensor will be set for all responses from the request. Supported Additional Outputs Finish Reason The reason why the sequence is finished. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL26 for more details. | |
--- | |
To enable, set return_finish_reason input tensor to True. The reason will be sent as a string on the finish_reason output tensor. Cumulative Log Probabilities The cumulative log probability of the generated output text. | |
--- | |
See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL22 for more details. To enable, set return_cumulative_logprob input tensor to True. The floating point value will be sent on the cumulative_logprob output tensor. | |
--- | |
Log Probabilities The log probabilities of the top probability tokens at each position of the logprobshttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsampling_params.pyL146L152 are requested. Only the log probabilities of the new tokens generated since the last response are returned on each new response. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL24L25 for more details on the log probabilities. | |
--- | |
To enable, set return_logprobs input tensor to True. The log probabilities will be sent on the logprobs output tensor as a serialized JSON string. Number of Input Tokens The number of token IDs of the prompt. | |
--- | |
See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL79L81 for more details. To enable, set return_num_input_tokens input tensor to True. The unsigned integer value will be sent on the num_input_tokens output tensor. | |
--- | |
Number of Output Tokens The number of token IDs of the generated output text sent on this response. It is the difference in length of the token IDs generated from the last response to this response. If this is the first response, the last response length is presumed to be zero. | |
--- | |
See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL21 for more details on the token IDs of the generated output text. To enable, set return_num_output_tokens input tensor to True. The unsigned integer value will be sent on the num_output_tokens output tensor. | |
--- | |
Examples Add Finish Reason to Outputs python import numpy as np import tritonclient.grpc as grpcclient inputs inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayexample prompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayTrue, dtypebool def callbackresult, error ... printresult.as_numpynamefinish_reason with grpcclient.InferenceServerClientlocalhost8001 as client client.start_streamcallback client.async_stream_infervLLM_model_name, inputsinputs, ... client.stop_stream Notes Enabling additional outputs may impact performance, only add additional outputs when necessary. | |
--- | |
File docshealth_check.md ! Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. vLLM Health Check BETA !NOTE The vLLM Health Check support is currently in BETA. | |
--- | |
Its features and functionality are subject to change as we collect feedback. We are excited to hear any thoughts you have! The vLLM backend supports checking for vLLM Engine Healthhttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmengineasync_llm_engine.pyL1177L1185 upon receiving each inference request. | |
--- | |
If the health check fails, the model state will becomes NOT Ready at the server, which can be queried by the Repository Indexhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_model_repository.mdindex or Model Readyhttpsgithub.comtritoninferenceserverclientblobmainsrcc2B2Blibraryhttp_client.hL178L192 APIs. The Health Check is disabled by default. To enable it, set the following parameter on the model config to true parameters key ENABLE_VLLM_HEALTH_CHECK value string_value true and select Model Control Mode EXPLICIThttpsgithub.comtritoninferenceserverserverblobmaindocsuser_guidemodel_management.mdmodelcontrolmodeexplicit when the server is started. | |
--- | |
File docsllama_multi_lora_tutorial.md ! Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Tutorial on depolying multilora vLLM backend in Triton The idea of multilora was proposed recently, for more please refer to SLoRA Serving Thousands of Concurrent LoRA Adaptershttpsarxiv.orgabs2311.03285 Punica MultiTenant LoRA Servinghttpsarxiv.orgabs2310.18547 Now the vLLM has supported multilora, which integrated the Punica feature and related cuda kernels. | |
--- | |
See this PRhttpsgithub.comvllmprojectvllmpull1804 for more. 20240124 this PR has been merged into the main branch of vLLM The following tutorial demonstrates how to deploy a LLaMa model with multiple loras on Triton Inference Server using the Tritons Pythonbasedhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends vLLMhttpsgithub.comtritoninferenceservervllm_backendtreemain backend. Before you continue reading, its important to note that all commandline instructions containing xx.yy in the document cannot be used directly by copying and pasting. | |
--- | |
xx.yy represents the Triton version, and you must specify the Triton version you want to use for the bash command to work. Step 1 Start a docker container for tritonvllm serving A docker container is strongly recommended for serving, and this tutorial will only demonstrate how to launch triton in the docker environment. First, start a docker container using the tritonserver image with vLLM backend from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags bash NOTICE you must first cd to your vllm_workspace path outside the container. | |
--- | |
mkdir vllm_workspace cd vllm_workspace sudo docker run gpus all it nethost p 80018001 shmsize12G ulimit memlock1 ulimit stack67108864 v PWDvllm_workspace w vllm_workspace nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 binbash NOTICE the version of triton docker image should be configurated, here and through this tutorial we use xx.yy to symbolize the version. Tritons vLLM container has been introduced starting from 23.10 release, and multilora experimental support was added in vLLM v0.3.0 release. Docker image version nvcr.ionvidiatritonserver24.05vllmpythonpy3 or higher version is strongly recommended. | |
--- | |
For pre24.05 containers, the docker images didnt support multilora feature, so you need to replace that provided in the container opttritonserverbackendsvllmmodel.py with the most up to date version. Just follow this command Download the model.py script from github bash wget P opttritonserverbackendsvllm httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysrcmodel.py This command will download the model.py script to the Triton vllm backend directory which will enable multilora feature. Step 2 Prepare your weights To support multilora on Triton, you need to manage your file path for model backbone and lora weights separately. | |
--- | |
A typical weights repository can be as follows weights backbone llama7bhf loras alpacalora7b wizardLMlora7b A workspace for vllm, and model backbone weights, LoRA adapter weights is strongly recommended. You should expand the storage of these weight files to ensure they are logically organized in the workspace. Step 3 Prepare model repository for Triton Server __3.1 Download the model repository files__ To use Triton, a model repository is needed, for model path , backend configuration and other information. | |
--- | |
The vllm backend is implemented based on python backend, and sampling_params of vllm are sampled from model.json. To create a triton model repository, you may download the files through these commands bash NOTICE you must first cd to your vllm_workspace path. cd vllm_workspace mkdir p model_repositoryvllm_model1 wget P model_repositoryvllm_model1 httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_model1model.json wget P model_repositoryvllm_model httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_modelconfig.pbtxt The model repository should look like this model_repository vllm_model 1 model.json config.pbtxt Now, you have finished the basic deployment, and the file structure should look like this vllm_workspace weights backbone llama7bhf loras alpacalora7b bactrianxllamalora7b model_repository vllm_model 1 model.json config.pbtxt __3.2 Populate model.json__ For this tutorial we will use the following set of parameters, specified in the model.json. | |
--- | |
json modelvllm_workspaceweightsbackbonellama7bhf, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 16 model The path to your model repository disable_log_requests To show logs when launch vllm or not. gpu_memory_utilization The gpu memory allocated for the model weights and vllm PagedAttention kv cache manager. tensor_parallel_size The vllm now support the tensor paralism, so you can decide how many gpus you want to use for serving. | |
--- | |
block_size vLLM kv cache block size. enable_lora If you want to support vllm multilora, this should be configured and set true. max_lora_rank The maximum of LoRA rank of your lora adapter. | |
--- | |
The full set of parameters can be found herehttpsgithub.comYard1vllmblobmulti_loravllmenginearg_utils.pyL11. __3.3 Specify local lora path__ vLLM v0.4.0.post1 supported the inference of local lora weights applying, which means that the vllm cannot pull any lora adapter from huggingface. So triton should know where the local lora weights are. | |
--- | |
Create a multi_lora.json file under model_repositoryvllm_model1 path bash cd model_repositoryvllm_model1 touch multi_lora.json The content of multi_lora.json should look like this json alpaca vllm_workspaceweightslorasalpacalora7b, bactrian vllm_workspaceweightslorasbactrianxllama7blora The key should be the supported lora name, and the value should be the specific path in your machine. Warning if you set enable_lora to true in model.json without creating a multi_lora.json file, the server will throw FileNotFoundError when initializing. Step 4 Launch Triton bash NOTICE you must first cd to your vllm_workspace path. | |
--- | |
cd vllm_workspace tritonserver modelstore .model_repository After you start Triton you will see output on the console showing the server starting up and loading the model. When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002 Step 5 Send a request A client request script for multilora was prepared, downloading the client script from source bash wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesclient.py wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesprompts.txt Try running this script by the following command bash python3 client.py l yourpreparedloraname Here we assume you have prepared alpaca lora weight, thus we use bash python3 client.py l alpaca File samplesmodel_repositoryvllm_model1model.json modelfacebookopt125m, disable_log_requests true, gpu_memory_utilization 0.5, enforce_eager true File samplesmodel_repositoryvllm_modelconfig.pbtxt Copyright 2023, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
Note You do not need to change any fields in this configuration. backend vllm The usage of device is deferred to the vLLM engine instance_group count 1 kind KIND_MODEL File samplesclient.py !usrbinenv python3 Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse import asyncio import json import sys import numpy as np import tritonclient.grpc.aio as grpcclient from tritonclient.utils import class LLMClient def __init__self, flags argparse.Namespace self._flags flags self._results_dict def get_triton_clientself try triton_client grpcclient.InferenceServerClient urlself._flags.url, verboseself._flags.verbose, except Exception as e printchannel creation failed stre sys.exit return triton_client async def async_request_iterator self, prompts, sampling_parameters, exclude_input_in_output try for iter in rangeself._flags.iterations for i, prompt in enumerateprompts prompt_id self._flags.offset lenprompts iter i self._results_dictstrprompt_id yield self.create_request prompt, self._flags.streaming_mode, prompt_id, sampling_parameters, exclude_input_in_output, except Exception as error printfCaught an error in the request iterator error async def stream_inferself, prompts, sampling_parameters, exclude_input_in_output try triton_client self.get_triton_client Start streaming response_iterator triton_client.stream_infer inputs_iteratorself.async_request_iterator prompts, sampling_parameters, exclude_input_in_output , stream_timeoutself._flags.stream_timeout, async for response in response_iterator yield response except InferenceServerException as error printerror sys.exit1 async def process_stream self, prompts, sampling_parameters, exclude_input_in_output Clear results in between process_stream calls self.results_dict success True Read response from the stream async for response in self.stream_infer prompts, sampling_parameters, exclude_input_in_output result, error response if error printfEncountered error while processing error success False else output result.as_numpytext_output for i in output self._results_dictresult.get_response.id.appendi return success async def runself Sampling parameters for text generation including temperature, top_p, top_k, max_tokens, early_stopping. | |
--- | |
Full list available at httpsgithub.comvllmprojectvllmblob5255d99dc595f9ae7647842242d6542aa4145a4fvllmsampling_params.pyL23 sampling_parameters temperature 0.1, top_p 0.95, max_tokens 100, exclude_input_in_output self._flags.exclude_inputs_in_outputs if self._flags.lora_name is not None sampling_parameterslora_name self._flags.lora_name with openself._flags.input_prompts, r as file printfLoading inputs from self._flags.input_prompts... prompts file.readlines success await self.process_stream prompts, sampling_parameters, exclude_input_in_output with openself._flags.results_file, w as file for id in self._results_dict.keys for result in self._results_dictid file.writeresult.decodeutf8 file.writen file.writennn printfStoring results into self._flags.results_file... if self._flags.verbose with openself._flags.results_file, r as file printfnContents of self._flags.results_file printfile.read if success printPASS vLLM example else printFAIL vLLM example def run_asyncself asyncio.runself.run def create_request self, prompt, stream, request_id, sampling_parameters, exclude_input_in_output, send_parameters_as_tensorTrue, inputs prompt_data np.arrayprompt.encodeutf8, dtypenp.object_ try inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpyprompt_data except Exception as error printfEncountered an error during request creation error stream_data np.arraystream, dtypebool inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpystream_data Request parameters are not yet supported via BLS. | |
--- | |
Provide an optional mechanism to send serialized parameters as an input tensor until support is added if send_parameters_as_tensor sampling_parameters_data np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool Add requested outputs outputs outputs.appendgrpcclient.InferRequestedOutputtext_output Issue the asynchronous sequence inference. return model_name self._flags.model, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters, if __name__ __main__ parser argparse.ArgumentParser parser.add_argument m, model, typestr, requiredFalse, defaultvllm_model, helpModel name, parser.add_argument v, verbose, actionstore_true, requiredFalse, defaultFalse, helpEnable verbose output, parser.add_argument u, url, typestr, requiredFalse, defaultlocalhost8001, helpInference server URL and its gRPC port. Default is localhost8001., parser.add_argument t, streamtimeout, typefloat, requiredFalse, defaultNone, helpStream timeout in seconds. | |
--- | |
Default is None., parser.add_argument offset, typeint, requiredFalse, default0, helpAdd offset to request IDs used, parser.add_argument inputprompts, typestr, requiredFalse, defaultprompts.txt, helpText file with input prompts, parser.add_argument resultsfile, typestr, requiredFalse, defaultresults.txt, helpThe file with output results, parser.add_argument iterations, typeint, requiredFalse, default1, helpNumber of iterations through the prompts file, parser.add_argument s, streamingmode, actionstore_true, requiredFalse, defaultFalse, helpEnable streaming mode, parser.add_argument excludeinputsinoutputs, actionstore_true, requiredFalse, defaultFalse, helpExclude prompt from outputs, parser.add_argument l, loraname, typestr, requiredFalse, defaultNone, helpThe querying LoRA name, FLAGS parser.parse_args client LLMClientFLAGS client.run_async File samplesprompts.txt Hello, my name is The most dangerous animal is The capital of France is The future of AI is File srcutilsmetrics.py Copyright 2024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import queue import threading from typing import Dict, List, Union import triton_python_backend_utils as pb_utils from vllm.engine.metrics import StatLoggerBase as VllmStatLoggerBase from vllm.engine.metrics import Stats as VllmStats from vllm.engine.metrics import SupportsMetricsInfo, build_1_2_5_buckets class TritonMetrics def __init__self, labels Liststr, max_model_len int Initialize metric families Iteration stats self.counter_prompt_tokens_family pb_utils.MetricFamily namevllmprompt_tokens_total, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.COUNTER, self.counter_generation_tokens_family pb_utils.MetricFamily namevllmgeneration_tokens_total, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.COUNTER, self.histogram_time_to_first_token_family pb_utils.MetricFamily namevllmtime_to_first_token_seconds, descriptionHistogram of time to first token in seconds., kindpb_utils.MetricFamily.HISTOGRAM, self.histogram_time_per_output_token_family pb_utils.MetricFamily namevllmtime_per_output_token_seconds, descriptionHistogram of time per output token in seconds., kindpb_utils.MetricFamily.HISTOGRAM, Request stats Latency self.histogram_e2e_time_request_family pb_utils.MetricFamily namevllme2e_request_latency_seconds, descriptionHistogram of end to end request latency in seconds., kindpb_utils.MetricFamily.HISTOGRAM, Metadata self.histogram_num_prompt_tokens_request_family pb_utils.MetricFamily namevllmrequest_prompt_tokens, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.HISTOGRAM, self.histogram_num_generation_tokens_request_family pb_utils.MetricFamily namevllmrequest_generation_tokens, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.HISTOGRAM, self.histogram_n_request_family pb_utils.MetricFamily namevllmrequest_params_n, descriptionHistogram of the n request parameter., kindpb_utils.MetricFamily.HISTOGRAM, Initialize metrics Iteration stats self.counter_prompt_tokens self.counter_prompt_tokens_family.Metric labelslabels self.counter_generation_tokens self.counter_generation_tokens_family.Metric labelslabels Use the same bucket boundaries from vLLM sample metrics as an example. httpsgithub.comvllmprojectvllmblob21313e09e3f9448817016290da20d0db1adf3664vllmenginemetrics.pyL81L96 self.histogram_time_to_first_token self.histogram_time_to_first_token_family.Metric labelslabels, buckets 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, , self.histogram_time_per_output_token self.histogram_time_per_output_token_family.Metric labelslabels, buckets 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, , Request stats Latency self.histogram_e2e_time_request self.histogram_e2e_time_request_family.Metric labelslabels, buckets1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0, Metadata self.histogram_num_prompt_tokens_request self.histogram_num_prompt_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len, self.histogram_num_generation_tokens_request self.histogram_num_generation_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len, self.histogram_n_request self.histogram_n_request_family.Metric labelslabels, buckets1, 2, 5, 10, 20, class VllmStatLoggerVllmStatLoggerBase StatLogger is used as an adapter between vLLM stats collector and Triton metrics provider. | |
--- | |
def __init__self, labels Dict, max_model_len int, log_logger None Tracked stats over current local logging interval. local_interval not used here. Its for vLLM logs to stdout. | |
--- | |
super.__init__local_interval0 self.metrics TritonMetricslabels, max_model_len self.log_logger log_logger Starting the metrics thread. It allows vLLM to keep making progress while reporting metrics to triton metrics service. self._logger_queue queue.Queue self._logger_thread threading.Threadtargetself.logger_loop self._logger_thread.start def infoself, type str, obj SupportsMetricsInfo None pass def _log_counterself, counter, data Unionint, float None Convenience function for logging to counter. | |
--- | |
Args counter A counter metric instance. data An int or float to increment the count metric. Returns None if data ! | |
--- | |
0 self._logger_queue.put_nowaitcounter, increment, data def _log_histogramself, histogram, data UnionListint, Listfloat None Convenience function for logging list to histogram. Args histogram A histogram metric instance. data A list of int or float data to observe into the histogram metric. | |
--- | |
Returns None for datum in data self._logger_queue.put_nowaithistogram, observe, datum def logself, stats VllmStats None Report stats to Triton metrics server. Args stats Created by LLMEngine for use by VllmStatLogger. Returns None The list of vLLM metrics reporting to Triton is also documented here. | |
--- | |
httpsgithub.comtritoninferenceservervllm_backendblobmainREADME.mdtritonmetrics counter_metrics self.metrics.counter_prompt_tokens, stats.num_prompt_tokens_iter, self.metrics.counter_generation_tokens, stats.num_generation_tokens_iter, histogram_metrics self.metrics.histogram_time_to_first_token, stats.time_to_first_tokens_iter, , self.metrics.histogram_time_per_output_token, stats.time_per_output_tokens_iter, , self.metrics.histogram_e2e_time_request, stats.time_e2e_requests, self.metrics.histogram_num_prompt_tokens_request, stats.num_prompt_tokens_requests, , self.metrics.histogram_num_generation_tokens_request, stats.num_generation_tokens_requests, , self.metrics.histogram_n_request, stats.n_requests, for metric, data in counter_metrics self._log_countermetric, data for metric, data in histogram_metrics self._log_histogrammetric, data def logger_loopself while True item self._logger_queue.get To signal shutdown a None item will be added to the queue. if item is None break metric, command, data item if command increment metric.incrementdata elif command observe metric.observedata else self.log_logger.log_errorfUndefined command name command def finalizeself Shutdown the logger thread. self._logger_queue.putNone if self._logger_thread is not None self._logger_thread.join self._logger_thread None File srcmodel.py Copyright 20232024, NVIDIA CORPORATION AFFILIATES. | |
--- | |
All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
--- | |
import asyncio import base64 import gc import json import os import queue import threading from io import BytesIO from typing import Dict, List import numpy as np import torch import triton_python_backend_utils as pb_utils from PIL import Image from vllm.engine.arg_utils import AsyncEngineArgs from vllm.entrypoints.openai.api_server import build_async_engine_client_from_engine_args, from vllm.lora.request import LoRARequest from vllm.sampling_params import SamplingParams from vllm.utils import random_uuid from utils.metrics import VllmStatLogger _VLLM_ENGINE_ARGS_FILENAME model.json _MULTI_LORA_ARGS_FILENAME multi_lora.json class TritonPythonModel classmethod def auto_complete_configcls, auto_complete_model_config Add inputsoutputs to the model config. cls._auto_complete_inputs_and_outputsauto_complete_model_config We need to use decoupled transaction policy for saturating vLLM engine for max throughtput. TODO DLIS5233 Allow asynchronous execution to lift this restriction for cases there is exactly a single response to a single request. | |
--- | |
auto_complete_model_config.set_model_transaction_policydictdecoupledTrue Disabling batching in Triton, let vLLM handle the batching on its own. auto_complete_model_config.set_max_batch_size0 return auto_complete_model_config staticmethod def _auto_complete_inputs_and_outputsauto_complete_model_config Inputs expected by the backend. inputs name text_input, data_type TYPE_STRING, dims 1, name image, data_type TYPE_STRING, dims 1, can be multiple images as separate elements optional True, , name stream, data_type TYPE_BOOL, dims 1, optional True, , name sampling_parameters, data_type TYPE_STRING, dims 1, optional True, , name exclude_input_in_output, data_type TYPE_BOOL, dims 1, optional True, , name return_finish_reason, data_type TYPE_BOOL, dims 1, optional True, , name return_cumulative_logprob, data_type TYPE_BOOL, dims 1, optional True, , name return_logprobs, data_type TYPE_BOOL, dims 1, optional True, , name return_num_input_tokens, data_type TYPE_BOOL, dims 1, optional True, , name return_num_output_tokens, data_type TYPE_BOOL, dims 1, optional True, , Outputs expected by the backend. | |
--- | |
outputs name text_output, data_type TYPE_STRING, dims 1, name finish_reason, data_type TYPE_STRING, dims 1, name cumulative_logprob, data_type TYPE_FP32, dims 1, name logprobs, data_type TYPE_STRING, dims 1, name num_input_tokens, data_type TYPE_UINT32, dims 1, name num_output_tokens, data_type TYPE_UINT32, dims 1, Collect input and output names from the provided model config. config auto_complete_model_config.as_dict input_names output_names for input in configinput input_names.appendinputname for output in configoutput output_names.appendoutputname Add missing inputs and outputs to the model config. for input in inputs if inputname not in input_names auto_complete_model_config.add_inputinput for output in outputs if outputname not in output_names auto_complete_model_config.add_outputoutput def initializeself, args self.args args self.logger pb_utils.Logger self.model_config json.loadsargsmodel_config output_config pb_utils.get_output_config_by_name self.model_config, text_output self.output_dtype pb_utils.triton_string_to_numpyoutput_configdata_type Setup vLLM engine health check self._enable_health_check self._get_bool_config_param ENABLE_VLLM_HEALTH_CHECK self._is_healthy True Initialize engine arguments TODO Move this into _init_engine, after moving check metrics enabled. | |
--- | |
self._init_engine_args Check if metrics are enabled. The ZMQ process cannot be used when metrics are enabled. TODO Move the check into _setup_metrics. | |
--- | |
self._enable_metrics self._get_bool_config_paramREPORT_CUSTOM_METRICS and not self._aync_engine_args.disable_log_stats Starting the vLLM engine and its event thread running the AsyncIO event loop. self._init_engine Setup vLLM metrics self._setup_metrics Starting the response thread. It allows vLLM to keep making progress while response senders are sending responses to server frontend. | |
--- | |
self._response_queue queue.Queue self._response_thread threading.Threadtargetself._response_loop self._response_thread.start def _init_engine_argsself Currently, Triton needs to use decoupled policy for asynchronously forwarding requests to vLLM engine, so assert it. self.using_decoupled pb_utils.using_decoupled_model_transaction_policy self.model_config assert self.using_decoupled , vLLM Triton backend must be configured to use decoupled model transaction policy engine_args_filepath os.path.join pb_utils.get_model_dir, _VLLM_ENGINE_ARGS_FILENAME assert os.path.isfile engine_args_filepath , f_VLLM_ENGINE_ARGS_FILENAME containing vllm engine args must be provided in pb_utils.get_model_dir with openengine_args_filepath as file self.vllm_engine_config json.loadfile Validate device and multiprocessing settings are currently set based on modelconfigs. self._validate_device_config Check for LoRA config and set it up if enabled self._setup_lora Create an AsyncEngineArgs from the config from JSON self._aync_engine_args AsyncEngineArgsself.vllm_engine_config def _init_engineself Run the engine in a separate thread running the AsyncIO event loop. | |
--- | |
self._llm_engine None self._llm_engine_start_cv threading.Condition self._llm_engine_shutdown_event asyncio.Event self._event_thread threading.Thread targetasyncio.run, argsself._run_llm_engine, self._event_thread.start with self._llm_engine_start_cv while self._llm_engine is None self._llm_engine_start_cv.wait The threading.Thread will not raise the exception here should the engine failed to start, so the exception is passed back via the engine variable. if isinstanceself._llm_engine, Exception e self._llm_engine self.logger.log_errorfvllm Failed to start engine e if self._event_thread is not None self._event_thread.join self._event_thread None raise e async def _run_llm_engineself Counter to keep track of ongoing request counts. self._ongoing_request_count 0 try Start the vLLM engine. | |
--- | |
The engine lives for the scope of this with statement. TODO Metrics should work with ZMQ enabled. async with build_async_engine_client_from_engine_args engine_argsself._aync_engine_args, disable_frontend_multiprocessingself._enable_metrics, as engine Capture the engine event loop and make it visible to other threads. | |
--- | |
self._event_loop asyncio.get_running_loop Signal the engine is started and make it visible to other threads. with self._llm_engine_start_cv self._llm_engine engine self._llm_engine_start_cv.notify_all Wait for the engine shutdown signal. await self._llm_engine_shutdown_event.wait Wait for the ongoing requests to complete. | |
--- | |
while self._ongoing_request_count 0 self.logger.log_info vllm Awaiting remaining requests.format self._ongoing_request_count await asyncio.sleep1 Cancel all tasks in the event loop. for task in asyncio.all_tasksloopself._event_loop if task is not asyncio.current_task task.cancel except Exception as e Signal and pass the exception back via the engine variable if the engine failed to start. If the engine has started, reraise the exception. | |
--- | |
with self._llm_engine_start_cv if self._llm_engine is None self._llm_engine e self._llm_engine_start_cv.notify_all return raise e self._llm_engine None self.logger.log_infovllm Shutdown complete def _validate_device_configself triton_kind self.argsmodel_instance_kind triton_device_id intself.argsmodel_instance_device_id triton_instance fself.argsmodel_name_triton_device_id Tritons current definition of KIND_GPU makes assumptions that models only use a single GPU. For multiGPU models, the recommendation is to specify KIND_MODEL to acknowledge that the model will take control of the devices made available to it. NOTE Consider other parameters that would indicate multiGPU in the future. | |
--- | |
tp_size intself.vllm_engine_config.gettensor_parallel_size, 1 if tp_size 1 and triton_kind GPU raise ValueError KIND_GPU is currently for singleGPU models, please specify KIND_MODEL in the models config.pbtxt for multiGPU models If KIND_GPU is specified, specify the device ID assigned by Triton to ensure that multiple model instances do not oversubscribe the same default device. if triton_kind GPU and triton_device_id 0 self.logger.log_info fDetected KIND_GPU model instance, explicitly setting GPU devicetriton_device_id for triton_instance vLLM doesnt currently v0.4.2 expose device selection in the APIs torch.cuda.set_devicetriton_device_id def _setup_loraself self.enable_lora False Check if enable_lora field is in the model.json, and if it is, read its contents, which can be string or bool. if enable_lora in self.vllm_engine_config.keys and strself.vllm_engine_configenable_lora.lower true create Triton LoRA weights repository multi_lora_args_filepath os.path.join pb_utils.get_model_dir, _MULTI_LORA_ARGS_FILENAME try with openmulti_lora_args_filepath as lora_file lora_repository Dictstr, str json.loadlora_file self.lora_repository lora_repository self.supported_loras Liststr listself.lora_repository.keys self.supported_loras_len lenself.supported_loras self.enable_lora True except FileNotFoundError raise FileNotFoundError fTriton backend cannot find multi_lora_args_filepath. | |
--- | |
def _setup_metricsself self._vllm_metrics None TODO Do not read metrics directly from the vLLM engine, read from prometheus client to allow the use of ZMQ process when metrics are enabled. See httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmentrypointsopenaiapi_server.pyL222L245 if self._enable_metrics try labels model self.argsmodel_name, version self.argsmodel_version, Add vLLM custom metrics engine_config self._llm_engine.engine.model_config self._vllm_metrics VllmStatLogger labels, engine_config.max_model_len, self.logger self._llm_engine.add_loggertriton, self._vllm_metrics except pb_utils.TritonModelException as e if metrics not supported in stre Metrics are disabled at the server self.logger.log_infovllm Metrics not supported else raise e def _get_bool_config_paramself, param_name str bool return param_name in self.model_configparameters and self.model_configparametersparam_namestring_value.lower true def _response_loopself while True item self._response_queue.get To signal shutdown a None item will be added to the queue. if item is None break response_state, response, response_flag item response_sender response_stateresponse_sender try response_sender.sendresponse, response_flag Stop checking for cancellation if the last response is generated. | |
--- | |
if not response_statelast_response_generated response_stateis_cancelled response_sender.is_cancelled except Exception as e self.logger.log_error fAn error occurred while sending a response e finally if response_flag pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL self._ongoing_request_count 1 def executeself, requests if self._enable_health_check and not self._check_healthrequests return None for request in requests request self._verify_lorasrequest if request is not None assert self._llm_engine_shutdown_event.is_set is False , Cannot create tasks after shutdown has been requested coro self._generaterequest asyncio.run_coroutine_threadsafecoro, self._event_loop return None async def _generateself, request response_sender request.get_response_sender response_state response_sender response_sender, is_cancelled False, last_response_generated False, last response ready but not yet sent self._ongoing_request_count 1 decrement_ongoing_request_count True try request_id random_uuid prompt, stream, prepend_input, parameters, additional_outputs, self._get_input_tensorsrequest sampling_params_dict self._get_sampling_params_dictparameters lora_name sampling_params_dict.poplora_name, None sampling_params SamplingParamssampling_params_dict lora_request None if lora_name is not None lora_id strself.supported_loras.indexlora_name 1 lora_int_id intlora_id lora_local_path self.lora_repositorylora_name lora_request LoRARequestlora_id, lora_int_id, lora_local_path response_iterator self._llm_engine.generate prompt, sampling_params, request_id, lora_requestlora_request request_output_state async for request_output in response_iterator Cancellation state will be checked by the response loop and written to the response state if streaming. If not streaming, cancellation state needs to be checked here. is_cancelled response_stateis_cancelled if not stream is_cancelled response_sender.is_cancelled if is_cancelled self.logger.log_infovllm Cancelling the request await self._llm_engine.abortrequest_id self.logger.log_infovllm Successfully cancelled the request if stream Add cancelled final response to response loop. | |
--- | |
response_statelast_response_generated True response pb_utils.InferenceResponse errorpb_utils.TritonError messageRequest was cancelled, codepb_utils.TritonError.CANCELLED, flags pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count False self._response_queue.put_nowait response_state, response, flags break Send each response if streaming. if stream response self._create_response request_output_state, request_output, prepend_inputFalse, additional_outputsadditional_outputs, flags 0 if request_output.finished response_statelast_response_generated True flags pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count False self._response_queue.put_nowaitresponse_state, response, flags Send the last response which contains all the outputs if not streaming. if not stream response_sender.send self._create_response request_output_state, request_outputrequest_output, prepend_inputprepend_input, additional_outputsadditional_outputs, , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL, except Exception as e self.logger.log_errorfvllm Error generating stream e error pb_utils.TritonErrorfError generating stream e text_output_tensor pb_utils.Tensor text_output, np.asarrayNA, dtypeself.output_dtype response pb_utils.InferenceResponse output_tensorstext_output_tensor, errorerror response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL raise e finally if decrement_ongoing_request_count self._ongoing_request_count 1 def _get_input_tensorsself, request prompt prompt pb_utils.get_input_tensor_by_namerequest, text_input.as_numpy0 if isinstanceprompt, bytes prompt prompt.decodeutf8 image images pb_utils.get_input_tensor_by_namerequest, image if images images_vllm for image_np in images.as_numpy image_b base64.b64decodeimage_np.decodeutf8 image_rgb Image.openBytesIOimage_b.convertRGB images_vllm.appendimage_rgb if lenimages_vllm 0 prompt prompt prompt, multi_modal_data image images_vllm, stream stream pb_utils.get_input_tensor_by_namerequest, stream if stream stream stream.as_numpy0 else stream False prepend_input exclude_input_in_output prepend_input pb_utils.get_input_tensor_by_name request, exclude_input_in_output if prepend_input When exclude_input_in_output is False, we want to prepend input prompt to output, thus prepend_input should be True, and vice versa. | |
--- | |
prepend_input not prepend_input.as_numpy0 elif prepend_input is None and stream prepend_input False else prepend_input True if prepend_input and stream raise ValueError When streaming, exclude_input_in_output False is not allowed. parameters sampling_parameters An alternative mechanism to receive serialized parameters as an input tensor, because request parameters are not yet supported via BLS. sampling_parameters pb_utils.get_input_tensor_by_name request, sampling_parameters if sampling_parameters parameters sampling_parameters.as_numpy0.decodeutf8 else parameters request.parameters additional outputs additional_outputs return_finish_reason None, return_cumulative_logprob None, return_logprobs None, return_num_input_tokens None, return_num_output_tokens None, for tensor_name in additional_outputs.keys tensor pb_utils.get_input_tensor_by_namerequest, tensor_name if tensor tensor booltensor.as_numpy0 else tensor False additional_outputstensor_name tensor return prompt, stream, prepend_input, parameters, additional_outputs def _create_response self, request_output_state, request_output, prepend_input, additional_outputs output_tensors text_output prepend_prompt if prev_lens_text_output not in request_output_state this is the first response if prepend_input prepend_prompt request_output.prompt request_output_stateprev_lens_text_output 0 len request_output.outputs prev_lens request_output_stateprev_lens_text_output text_output prepend_prompt output.textprev_len.encodeutf8 for output, prev_len in ziprequest_output.outputs, prev_lens request_output_stateprev_lens_text_output lenoutput.text for output in request_output.outputs output_tensors.append pb_utils.Tensor text_output, np.asarraytext_output, dtypeself.output_dtype finish_reason if additional_outputsreturn_finish_reason finish_reason stroutput.finish_reason for output in request_output.outputs output_tensors.append pb_utils.Tensor finish_reason, np.asarrayfinish_reason, dtypenp.object_ cumulative_logprob if additional_outputsreturn_cumulative_logprob cumulative_logprob output.cumulative_logprob for output in request_output.outputs output_tensors.append pb_utils.Tensor cumulative_logprob, np.asarraycumulative_logprob, dtypenp.float32, logprobs httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsequence.pyL37L58 if additional_outputsreturn_logprobs if prev_lens_logprobs not in request_output_state request_output_stateprev_lens_logprobs 0 len request_output.outputs logprobs for i in rangelenrequest_output.outputs output request_output.outputsi if output.logprobs is None logprobs.appendnull.encodeutf8 continue prev_len request_output_stateprev_lens_logprobsi request_output_stateprev_lens_logprobsi lenoutput.logprobs logprobs_py for logprob_d_vllm in output.logprobsprev_len logprob_d_py for token_id, logprob_vllm in logprob_d_vllm.items logprob_d_pytoken_id logprob logprob_vllm.logprob, rank logprob_vllm.rank, decoded_token logprob_vllm.decoded_token, logprobs_py.appendlogprob_d_py logprobs.appendjson.dumpslogprobs_py.encodeutf8 output_tensors.append pb_utils.Tensorlogprobs, np.asarraylogprobs, dtypenp.object_ num_input_tokens if additional_outputsreturn_num_input_tokens num_input_tokens lenrequest_output.prompt_token_ids output_tensors.append pb_utils.Tensor num_input_tokens, np.asarraynum_input_tokens, dtypenp.uint32 num_output_tokens if additional_outputsreturn_num_output_tokens if prev_lens_num_output_tokens not in request_output_state request_output_stateprev_lens_num_output_tokens 0 len request_output.outputs prev_lens request_output_stateprev_lens_num_output_tokens num_output_tokens lenoutput.token_ids prev_len for output, prev_len in ziprequest_output.outputs, prev_lens request_output_stateprev_lens_num_output_tokens lenoutput.token_ids for output in request_output.outputs output_tensors.append pb_utils.Tensor num_output_tokens, np.asarraynum_output_tokens, dtypenp.uint32 return pb_utils.InferenceResponseoutput_tensorsoutput_tensors def _get_sampling_params_dictself, params_json params_dict json.loadsparams_json Special parsing for the supported sampling parameters bool_keys ignore_eos, skip_special_tokens, use_beam_search for k in bool_keys if k in params_dict params_dictk boolparams_dictk float_keys frequency_penalty, length_penalty, presence_penalty, temperature, top_p, for k in float_keys if k in params_dict params_dictk floatparams_dictk int_keys best_of, max_tokens, min_tokens, n, top_k for k in int_keys if k in params_dict params_dictk intparams_dictk return params_dict def _verify_lorasself, request We will check if the requested lora exists here, if not we will send a response with LoRA not found information. | |
--- | |
In this way we may avoid further processing. verified_request None lora_error None lora_name None parameters_input_tensor pb_utils.get_input_tensor_by_name request, sampling_parameters if parameters_input_tensor parameters parameters_input_tensor.as_numpy0.decodeutf8 sampling_params_dict self._get_sampling_params_dictparameters lora_name sampling_params_dict.poplora_name, None if lora_name is not None if not self.enable_lora lora_error pb_utils.TritonErrorLoRA feature is not enabled. self.logger.log_info vllm LoRA is not enabled, please restart the backend with LoRA enabled. | |
--- | |
elif lora_name not in self.supported_loras lora_error pb_utils.TritonError fLoRA lora_name is not supported, we currently support self.supported_loras self.logger.log_infofvllm LoRA lora_name not found. if lora_error is not None output_tensor pb_utils.Tensor text_output, np.asarrayError Unsupported LoRA., dtypeself.output_dtype, response pb_utils.InferenceResponse output_tensorsoutput_tensor, errorlora_error response_sender request.get_response_sender response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL else verified_request request return verified_request def _check_healthself, requests coro self._llm_engine.check_health future asyncio.run_coroutine_threadsafecoro, self._event_loop try future.result except Exception as e self.logger.log_error fvllm Engine is not healthy and model will be unloaded e pb_utils.unload_modelself.model_configname nonblocking self._is_healthy False if not self._is_healthy for request in requests request.get_response_sender.send pb_utils.InferenceResponse errorpb_utils.TritonError messageModel is unavailable due to unhealthy vLLM engine, codepb_utils.TritonError.UNAVAILABLE, , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL, return self._is_healthy def finalizeself self.logger.log_infovllm Issuing finalize to vllm backend self._event_loop.call_soon_threadsafeself._llm_engine_shutdown_event.set Shutdown the event thread. if self._event_thread is not None self._event_thread.join self._event_thread None Shutdown the response thread. | |
--- | |
self._response_queue.putNone if self._response_thread is not None self._response_thread.join self._response_thread None Shutdown the metrics thread. if self._vllm_metrics is not None self._vllm_metrics.finalize When using parallel tensors, the stub process may not shutdown due to unreleased references, so manually run the garbage collector once. self.logger.log_infovllm Running Garbage Collector on finalize... | |
--- | |
gc.collect self.logger.log_infovllm Garbage Collector on finalize... done File .gitignore Bytecompiled optimized DLL files __pycache__ .pycod py.class C extensions .so Distribution packaging .Python build developeggs dist downloads eggs .eggs lib lib64 parts sdist var wheels sharepythonwheels .egginfo .installed.cfg .egg MANIFEST PyInstaller Usually these files are written by a python script from a template before PyInstaller builds the exe, so as to inject dateother infos into it. .manifest .spec Installer logs piplog.txt pipdeletethisdirectory.txt Unit test coverage reports htmlcov .tox .nox .coverage .coverage. | |
--- | |
.cache nosetests.xml coverage.xml .cover .py,cover .hypothesis .pytest_cache cover .out Translations .mo .pot Django stuff .log local_settings.py db.sqlite3 db.sqlite3journal Flask stuff instance .webassetscache Scrapy stuff .scrapy Sphinx documentation docs_build PyBuilder .pybuilder target Jupyter Notebook .ipynb_checkpoints IPython profile_default ipython_config.py pyenv For a library or package, you might want to ignore these files since the code is intended to run in multiple environments otherwise, check them in .pythonversion pipenv According to pypapipenv598, it is recommended to include Pipfile.lock in version control. However, in case of collaboration, if having platformspecific dependencies or dependencies having no crossplatform support, pipenv may install dependencies that dont work, or not install all needed dependencies. Pipfile.lock poetry Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. | |
--- | |
This is especially recommended for binary packages to ensure reproducibility, and is more commonly ignored for libraries. httpspythonpoetry.orgdocsbasicusagecommityourpoetrylockfiletoversioncontrol poetry.lock pdm Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. pdm.lock pdm stores projectwide configurations in .pdm.toml, but it is recommended to not include it in version control. | |
--- | |
httpspdm.fming.devusewithide .pdm.toml PEP 582 used by e.g. github.comDavidOConnorpyflow and github.compdmprojectpdm __pypackages__ Celery stuff celerybeatschedule celerybeat.pid SageMath parsed files .sage.py Environments .env .venv env venv ENV env.bak venv.bak Spyder project settings .spyderproject .spyproject Rope project settings .ropeproject mkdocs documentation site mypy .mypy_cache .dmypy.json dmypy.json Pyre type checker .pyre pytype static type analyzer .pytype Cython debug symbols cython_debug Test result files Miniconda miniconda vllm_env.tar.gz triton_python_backend_stub python_backend results.txt .log PyCharm JetBrains specific template is maintained in a separate JetBrains.gitignore that can be found at httpsgithub.comgithubgitignoreblobmainGlobalJetBrains.gitignore and can be added to the global gitignore or merged into this file. For a more nuclear option not recommended you can uncomment the following to ignore the entire idea folder. | |
--- | |
.idea File .precommitconfig.yaml Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. repos repo httpsgithub.comtimothycrosleyisort rev 5.12.0 hooks id isort additional_dependencies toml repo httpsgithub.compsfblack rev 23.1.0 hooks id black types_or python, cython repo httpsgithub.comPyCQAflake8 rev 5.0.4 hooks id flake8 args maxlinelength88, selectC,E,F,W,B,B950, extendignore E203,E501 types_or python, cython repo httpsgithub.comprecommitmirrorsclangformat rev v16.0.5 hooks id clangformat types_or c, c, cuda, proto, textproto, java args fallbackstylenone, stylefile, i repo httpsgithub.comcodespellprojectcodespell rev v2.2.4 hooks id codespell additional_dependencies tomli args toml, pyproject.toml exclude ?x.stemmer..stop_words.CHANGELOG.md More details about these precommit hooks here httpsprecommit.comhooks.html repo httpsgithub.comprecommitprecommithooks rev v4.4.0 hooks id checkcaseconflict id checkexecutableshaveshebangs id checkmergeconflict id checkjson id checktoml id checkyaml exclude deploytemplates. id checkshebangscriptsareexecutable id endoffilefixer types_or c, c, cuda, proto, textproto, java, python id mixedlineending id requirementstxtfixer id trailingwhitespace File LICENSE BSD 3Clause License Copyright c 2023, Triton Inference Server Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met 1. | |
--- | |
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. | |
--- | |
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. File pyproject.toml Copyright 2023, NVIDIA CORPORATION AFFILIATES. All rights reserved. | |
--- | |
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | |
--- | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. tool.codespell note precommit passes explicit lists of files here, which this skip file list doesnt override this is only to allow you to run codespell interactively skip ..git,..github ignore short words, and typename parameters like OffsetT ignoreregex b.1,4AZwTb use the clear dictionary for unambiguous spelling mistakes builtin clear disable warnings about binary files and wrong encoding quietlevel 3 tool.isort profile black use_parentheses true multi_line_output 3 include_trailing_comma true force_grid_wrap 0 ensure_newline_before_comments true line_length 88 balanced_wrapping true indent skip build File README.md ! | |
--- | |
Copyright 20232024, NVIDIA CORPORATION AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | |
--- | |
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | |
--- | |
IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. !Licensehttpsimg.shields.iobadgeLicenseBSD3lightgrey.svghttpsopensource.orglicensesBSD3Clause vLLM Backend The Triton backend for vLLMhttpsgithub.comvllmprojectvllm is designed to run supported modelshttpsvllm.readthedocs.ioenlatestmodelssupported_models.html on a vLLM enginehttpsgithub.comvllmprojectvllmblobmainvllmengineasync_llm_engine.py. You can learn more about Triton backends in the backend repohttpsgithub.comtritoninferenceserverbackend. | |
--- | |
This is a Pythonbased backendhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends. When using this backend, all requests are placed on the vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled by the vLLM engine. | |
--- | |
Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentationhttpsgithub.comtritoninferenceserverservertritoninferenceserver available in the main serverhttpsgithub.comtritoninferenceserverserver repo. If you dont find your answer there you can ask questions on the main Triton issues pagehttpsgithub.comtritoninferenceserverserverissues. | |
--- | |
Installing the vLLM Backend There are several ways to install and deploy the vLLM backend. Option 1. Use the PreBuilt Docker Container. | |
--- | |
Pull a tritonserverxx.yyvllmpythonpy3 container with vLLM backend from the NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver registry. xx.yy is the version of Triton that you want to use. Please note, that Tritons vLLM container has been introduced starting from 23.10 release. | |
--- | |
docker pull nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 Option 2. Build a Custom Container From Source You can follow steps described in the Building With Dockerhttpsgithub.comtritoninferenceserverserverblobmaindocscustomization_guidebuild.mdbuildingwithdocker guide and use the build.pyhttpsgithub.comtritoninferenceserverserverblobmainbuild.py script. A sample command to build a Triton Server container with all options enabled is shown below. | |
--- | |
Feel free to customize flags according to your needs. Please use NGC registryhttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags to get the latest version of the Triton vLLM container, which corresponds to the latest YY.MM year.month of Triton releasehttpsgithub.comtritoninferenceserverserverreleases. YY.MM is the version of Triton. | |
--- | |
export TRITON_CONTAINER_VERSIONYY.MM .build.py v enablelogging enablestats enabletracing enablemetrics enablegpumetrics enablecpumetrics enablegpu filesystemgcs filesystems3 filesystemazure_storage endpointhttp endpointgrpc endpointsagemaker endpointvertexai upstreamcontainerversionTRITON_CONTAINER_VERSION backendpythonrTRITON_CONTAINER_VERSION backendvllmrTRITON_CONTAINER_VERSION backendensemble Option 3. Add the vLLM Backend to the Default Triton Container You can install the vLLM backend directly into the NGC Triton container. In this case, please install vLLM first. | |
--- | |
You can do so by running pip install vllmvLLM_version. Then, set up the vLLM backend in the container with the following commands mkdir p opttritonserverbackendsvllm git clone httpsgithub.comtritoninferenceservervllm_backend.git tmpvllm_backend cp r tmpvllm_backendsrc opttritonserverbackendsvllm Using the vLLM Backend You can see an example model_repositorysamplesmodel_repository in the samplessamples folder. You can use this as is and change the model by changing the model value in model.json. | |
--- | |
model.json represents a keyvalue dictionary that is fed to vLLMs AsyncLLMEngine when initializing the model. You can see supported arguments in vLLMs arg_utils.pyhttpsgithub.comvllmprojectvllmblobmainvllmenginearg_utils.py. Specifically, herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL11 and herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL201. | |
--- | |
For multiGPU support, EngineArgs like tensor_parallel_size can be specified in model.jsonsamplesmodel_repositoryvllm_model1model.json. Note vLLM greedily consume up to 90 of the GPUs memory under default settings. The sample model updates this behavior by setting gpu_memory_utilization to 50. | |
--- | |
You can tweak this behavior using fields like gpu_memory_utilization and other settings in model.jsonsamplesmodel_repositoryvllm_model1model.json. Launching Triton Inference Server Once you have the model repository set up, it is time to launch the Triton server. We will use the prebuilt Triton container with vLLM backendoption1usetheprebuiltdockercontainer from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver in this example. | |
--- | |
docker run gpus all it nethost rm p 80018001 shmsize1G ulimit memlock1 ulimit stack67108864 v PWDwork w work nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 tritonserver modelrepository .model_repository Replace xx.yy with the version of Triton that you want to use. Note that Tritons vLLM container was first published starting from 23.10 release. After you start Triton you will see output on the console showing the server starting up and loading the model. | |
--- | |
When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002 Sending Your First Inference After you start Tritonhttpsdocs.nvidia.comdeeplearningtritoninferenceserveruserguidedocsgetting_startedquickstart.html with the sample model_repositorysamplesmodel_repository, you can quickly run your first inference request with the generate endpointhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_generate.md. Try out the command below. | |
--- | |
curl X POST localhost8000v2modelsvllm_modelgenerate d text_input What is Triton Inference Server?, parameters stream false, temperature 0 Upon success, you should see a response from the server like this one model_namevllm_model,model_version1,text_outputWhat is Triton Inference Server?nnTriton Inference Server is a server that is used by many In the samplessamples folder, you can also find a sample client, client.pysamplesclient.py which uses Tritons asyncio gRPC client libraryhttpsgithub.comtritoninferenceserverclientpythonasynciosupportbeta1 to run inference on Triton. Running the Latest vLLM Version You can check the vLLM version included in Triton Inference Server from Framework Containers Support Matrixhttpsdocs.nvidia.comdeeplearningframeworkssupportmatrixindex.html. Note The vLLM Triton Inference Server container has been introduced starting from 23.10 release. | |
--- | |
You can use pip install ... within the container to upgrade vLLM version. Running Multiple Instances of Triton Server If you are running multiple instances of Triton server with a Pythonbased backend, you need to specify a different shmregionprefixname for each server. | |
--- | |
See herehttpsgithub.comtritoninferenceserverpython_backendrunningmultipleinstancesoftritonserver for more information. Additional vLLM outputs Additional vLLM outputs may be requested optionally on a perrequest basis. See this docsdocsadditional_outputs.md for more information. | |
--- | |
Triton Metrics Starting with the 24.08 release of Triton, users can now obtain specific vLLM metrics by querying the Triton metrics endpoint see complete vLLM metrics herehttpsdocs.vllm.aienlatestservingmetrics.html. This can be accomplished by launching a Triton server in any of the ways described above ensuring the build code container is 24.08 or later and querying the server. Upon receiving a successful response, you can query the metrics endpoint by entering the following bash curl localhost8002metrics VLLM stats are reported by the metrics endpoint in fields that are prefixed with vllm. | |
--- | |
Triton currently supports reporting of the following metrics from vLLM. bash Number of prefill tokens processed. counter_prompt_tokens Number of generation tokens processed. | |
--- | |
counter_generation_tokens Histogram of time to first token in seconds. histogram_time_to_first_token Histogram of time per output token in seconds. histogram_time_per_output_token Histogram of end to end request latency in seconds. | |
--- | |
histogram_e2e_time_request Number of prefill tokens processed. histogram_num_prompt_tokens_request Number of generation tokens processed. histogram_num_generation_tokens_request Histogram of the best_of request parameter. | |
--- | |
histogram_best_of_request Histogram of the n request parameter. histogram_n_request Your output for these fields should look similar to the following bash HELP vllmprompt_tokens_total Number of prefill tokens processed. TYPE vllmprompt_tokens_total counter vllmprompt_tokens_totalmodelvllm_model,version1 10 HELP vllmgeneration_tokens_total Number of generation tokens processed. | |
--- | |
TYPE vllmgeneration_tokens_total counter vllmgeneration_tokens_totalmodelvllm_model,version1 16 HELP vllmtime_to_first_token_seconds Histogram of time to first token in seconds. TYPE vllmtime_to_first_token_seconds histogram vllmtime_to_first_token_seconds_countmodelvllm_model,version1 1 vllmtime_to_first_token_seconds_summodelvllm_model,version1 0.03233122825622559 vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,le0.001 0 ... vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,leInf 1 HELP vllmtime_per_output_token_seconds Histogram of time per output token in seconds. | |
--- | |
TYPE vllmtime_per_output_token_seconds histogram vllmtime_per_output_token_seconds_countmodelvllm_model,version1 15 vllmtime_per_output_token_seconds_summodelvllm_model,version1 0.04501533508300781 vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,le0.01 14 ... vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,leInf 15 HELP vllme2e_request_latency_seconds Histogram of end to end request latency in seconds. TYPE vllme2e_request_latency_seconds histogram vllme2e_request_latency_seconds_countmodelvllm_model,version1 1 vllme2e_request_latency_seconds_summodelvllm_model,version1 0.08686184883117676 vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,le1 1 ... | |
--- | |
vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_prompt_tokens Number of prefill tokens processed. TYPE vllmrequest_prompt_tokens histogram vllmrequest_prompt_tokens_countmodelvllm_model,version1 1 vllmrequest_prompt_tokens_summodelvllm_model,version1 10 vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_generation_tokens Number of generation tokens processed. | |
--- | |
TYPE vllmrequest_generation_tokens histogram vllmrequest_generation_tokens_countmodelvllm_model,version1 1 vllmrequest_generation_tokens_summodelvllm_model,version1 16 vllmrequest_generation_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_generation_tokens_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_params_best_of Histogram of the best_of request parameter. TYPE vllmrequest_params_best_of histogram vllmrequest_params_best_of_countmodelvllm_model,version1 1 vllmrequest_params_best_of_summodelvllm_model,version1 1 vllmrequest_params_best_of_bucketmodelvllm_model,version1,le1 1 ... | |
--- | |
vllmrequest_params_best_of_bucketmodelvllm_model,version1,leInf 1 HELP vllmrequest_params_n Histogram of the n request parameter. TYPE vllmrequest_params_n histogram vllmrequest_params_n_countmodelvllm_model,version1 1 vllmrequest_params_n_summodelvllm_model,version1 1 vllmrequest_params_n_bucketmodelvllm_model,version1,le1 1 ... vllmrequest_params_n_bucketmodelvllm_model,version1,leInf 1 To enable vLLM engine colleting metrics, disable_log_stats option need to be either false or left empty false by default in model.jsonhttpsgithub.comtritoninferenceservervllm_backendblobmainsamplesmodel_repositoryvllm_model1model.json. | |
--- | |
bash disable_log_stats false Note vLLM metrics are not reported to Triton metrics server by default due to potential performance slowdowns. To enable vLLM models metrics reporting, please add following lines to its config.pbtxt as well. parameters key REPORT_CUSTOM_METRICS value string_value true vLLM Engine Health Check BETA vLLM Engine Health Check may be enabled optionally, for more accurate model state reported by the server. | |
--- | |
See this docsdocshealth_check.md for more information. Referencing the Tutorial You can read further in the vLLM Quick Deploy guidehttpsgithub.comtritoninferenceservertutorialstreemainQuick_DeployvLLM in the tutorialshttpsgithub.comtritoninferenceservertutorials repository. | |
--- | |
View raw
(Sorry about that, but we can’t show files that are this big right now.)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment