ucalyptus2 · December 29, 2024 22:03
diff --git a/chunk_mapping.txt b/chunk_mapping.txt
 Chunk 0:
 This file is a merged representation of the entire codebase, combining all repository files into a single document. Generated by Repomix on 20241229T195517.135Z  File Summary  Purpose  This file contains a packed representation of the entire repositorys contents. It is designed to be easily consumable by AI systems for analysis, code review, or other automated processes.

 Chunk 1:
 File Format  The content is organized as follows 1. This summary section 2. Repository information 3.

 Chunk 2:
 Directory structure 4. Multiple file entries, each consisting of a. A separator line  b.

 Chunk 3:
 The file path File pathtofile c. Another separator line d. The full contents of the file e.

 Chunk 4:
 A blank line Usage Guidelines   This file should be treated as readonly. Any changes should be made to the original repository files, not this packed version. When processing this file, use the file path to distinguish between different files in the repository.

 Chunk 5:
 Be aware that this file may contain sensitive information. Handle it with the same level of security as you would the original repository. Notes   Some files may have been excluded based on .gitignore rules and Repomixs configuration.

 Chunk 6:
 Binary files are not included in this packed representation. Please refer to the Repository Structure section for a complete list of file paths, including binary files. Additional Info  For more information about Repomix, visit httpsgithub.comyamadashyrepomix  Directory Structure  .github workflows codeql.yml precommit.yaml ci common test_util.py util.sh L0_additional_outputs_vllm additional_outputs_test.py test.sh L0_backend_vllm accuracy_test accuracy_test.py test.sh enabled_stream enabled_stream_test.py test.sh metrics_test test.sh vllm_metrics_test.py request_cancellation request_cancellation_test.py test.sh vllm_backend ensemble_config.pbtxt test.sh vllm_backend_test.py test.sh L0_check_health_vllm check_health_test.py test.sh L0_multi_gpu_vllm multi_lora download.py multi_lora_test.py test.sh vllm_backend test.sh vllm_multi_gpu_test.py test.sh docs additional_outputs.md health_check.md llama_multi_lora_tutorial.md samples model_repository vllm_model 1 model.json config.pbtxt client.py prompts.txt src utils metrics.py model.py .gitignore .precommitconfig.yaml LICENSE pyproject.toml README.md  Files   File .githubworkflowscodeql.yml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 Chunk 7:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 8:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 9:
 name CodeQL on pull_request jobs analyze name Analyze runson ubuntulatest permissions actions read contents read securityevents write strategy failfast false matrix language  python   CodeQL supports  cpp, csharp, go, java, javascript, python, ruby   Learn more about CodeQL language support at httpsaka.mscodeqldocslanguagesupport steps  name Checkout repository uses actionscheckoutv3  Initializes the CodeQL tools for scanning. name Initialize CodeQL uses githubcodeqlactioninitv2 with languages  matrix.language   If you wish to specify custom queries, you can do so here or in a config file. By default, queries listed here will override any specified in a config file.

 Chunk 10:
 Prefix the list here with  to use these queries and those in the config file. Details on CodeQLs query packs refer to  httpsdocs.github.comencodesecuritycodescanningautomaticallyscanningyourcodeforvulnerabilitiesanderrorsconfiguringcodescanningusingqueriesinqlpacks queries securityandquality  Autobuild attempts to build any compiled languages CC, C, Go, or Java. If this step fails, then you should remove it and run the build manually see below  name Autobuild uses githubcodeqlactionautobuildv2  Commandline programs to run using the OS shell.

 Chunk 11:
 See httpsdocs.github.comenactionsusingworkflowsworkflowsyntaxforgithubactionsjobsjob_idstepsrun  If the Autobuild fails above, remove it and uncomment the following three lines. modify them or add more to build your code if your project, please refer to the EXAMPLE below for guidance. run   echo Run, Build Application using script  .location_of_script_within_repobuildscript.sh  name Perform CodeQL Analysis uses githubcodeqlactionanalyzev2 with category languagematrix.language  File .githubworkflowsprecommit.yaml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 Chunk 12:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 13:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 14:
 name precommit on pull_request jobs precommit runson ubuntu22.04 steps  uses actionscheckoutv3  uses actionssetuppythonv3  uses precommitactionv3.0.0  File cicommontest_util.py  !usrbinenv python3  Copyright 20182023, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 15:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 16:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import queue import unittest import numpy as np import tritonclient.grpc as grpcclient class TestResultCollectorunittest.TestCase  TestResultCollector stores test result and prints it to stdout. In order  to use this class, unit tests must inherit this class.

 Chunk 17:
 Use  check_test_results bash function from commonutil.sh to verify the  expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures  total, errors, failures classmethod def tearDownClasscls  this method is called when all the unit tests in a class are  finished. json_res  total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone  result argument stores the accumulative test results test_result  super.runresult total  test_result.testsRun errors  lentest_result.errors failures  lentest_result.failures self.setResulttotal, errors, failures class AsyncTestResultCollectorunittest.IsolatedAsyncioTestCase  TestResultCollector stores test result and prints it to stdout. In order  to use this class, unit tests must inherit this class.

 Chunk 18:
 Use  check_test_results bash function from commonutil.sh to verify the  expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures  total, errors, failures classmethod def tearDownClasscls  this method is called when all the unit tests in a class are  finished. json_res  total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone  result argument stores the accumulative test results test_result  super.runresult total  test_result.testsRun errors  lentest_result.errors failures  lentest_result.failures self.setResulttotal, errors, failures def create_vllm_request prompt, request_id, stream, sampling_parameters, model_name, send_parameters_as_tensorTrue, exclude_input_in_outputNone,  inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpynp.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if send_parameters_as_tensor sampling_parameters_data  np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data if exclude_input_in_output is not None inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool outputs  grpcclient.InferRequestedOutputtext_output return  model_name model_name, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters,  class UserData def __init__self self._completed_requests  queue.Queue def callbackuser_data, result, error if error user_data._completed_requests.puterror else user_data._completed_requests.putresult  File cicommonutil.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 19:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 20:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. SERVERSERVERopttritonserverbintritonserver SERVER_IPADDRTRITONSERVER_IPADDRlocalhost SERVER_LOGSERVER_LOG.server.log SERVER_TIMEOUTSERVER_TIMEOUT120 SERVER_LD_PRELOADSERVER_LD_PRELOAD  Run inference server.

 Chunk 21:
 Return once servers health endpoint shows  ready or timeout expires. Sets SERVER_PID to pid of SERVER, or 0 if  error including expired timeout function run_server   SERVER_PID0 if  z SERVER  then echo  SERVER must be defined return fi if  ! f SERVER  then echo  SERVER does not exist return fi if  z SERVER_LD_PRELOAD  then echo  Running SERVER SERVER_ARGS else echo  Running LD_PRELOADSERVER_LD_PRELOAD SERVER SERVER_ARGS fi LD_PRELOADSERVER_LD_PRELOADLD_PRELOAD SERVER SERVER_ARGS  SERVER_LOG 21  SERVER_PID!

 Chunk 22:
 wait_for_server_ready SERVER_PID SERVER_TIMEOUT if  WAIT_RET ! 0  then  Get further debug information about server startup failure gdb_helper  true  Cleanup kill SERVER_PID  devnull 21  true SERVER_PID0 fi   Wait until server health endpoint shows ready. Sets WAIT_RET to 0 on  success, 1 on failure function wait_for_server_ready  local spid1 shift local wait_time_secs130 shift WAIT_RET0 local wait_secswait_time_secs until test wait_secs eq 0  do if !

 Chunk 23:
 kill 0 spid  devnull 21 then echo  Server not running. WAIT_RET1 return fi sleep 1 set e codecurl s w http_code SERVER_IPADDR8000v2healthready set e if  code  200  then return fi wait_secs done echo  Timeout wait_time_secs secs. Server not ready.

 Chunk 24:
 WAIT_RET1   Check Python unittest results. function check_test_results   local log_file1 local expected_num_tests2 if  z expected_num_tests  then echo  expected number of tests must be defined return 1 fi num_failurescat log_file  grep E .total.errors.failures. tail n 1  jq .failures num_testscat log_file  grep E .total.errors.failures.

 Chunk 25:
 tail n 1  jq .total num_errorscat log_file  grep E .total.errors.failures. tail n 1  jq .errors  Number regular expression re09 if  ? ne 0   !

 Chunk 26:
 num_failures  re   ! num_tests  re    ! num_errors  re  then cat log_file echo e nn Test Failed unable to parse test resultsn  log_file return 1 fi if  num_errors !

 Chunk 27:
 0    num_failures ! 0    num_tests ne expected_num_tests  then cat log_file echo e nn Test Failed Expected expected_num_tests tests, num_tests tests executed, num_errors tests had error, and num_failures tests failed. n  log_file return 1 fi return 0  function collect_artifacts_from_subdir   cp .log core ..

 Chunk 28:
 true   File ciL0_additional_outputs_vllmadditional_outputs_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 29:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 30:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import numpy as np import pytest import tritonclient.grpc as grpcclient class TestAdditionalOutputs _grpc_url  localhost8001 _model_name  vllm_opt _sampling_parameters  temperature 0, top_p 1 _prompt  In this example, def _get_sampling_parametersself, logprobsNone sampling_parameters  self._sampling_parameters.copy if logprobs is not None sampling_parameterslogprobs  logprobs return sampling_parameters def _get_inputs self, prompt, streamTrue, sampling_parametersNone, return_finish_reasonNone, return_cumulative_logprobNone, return_logprobsNone, return_num_input_tokensNone, return_num_output_tokensNone,  inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_   if return_finish_reason is not None inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_finish_reason, dtypebool if return_cumulative_logprob is not None inputs.append grpcclient.InferInputreturn_cumulative_logprob, 1, BOOL  inputs1.set_data_from_numpy np.arrayreturn_cumulative_logprob, dtypebool  if return_logprobs is not None inputs.appendgrpcclient.InferInputreturn_logprobs, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_logprobs, dtypebool if return_num_input_tokens is not None inputs.appendgrpcclient.InferInputreturn_num_input_tokens, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_num_input_tokens, dtypebool  if return_num_output_tokens is not None inputs.append grpcclient.InferInputreturn_num_output_tokens, 1, BOOL  inputs1.set_data_from_numpy np.arrayreturn_num_output_tokens, dtypebool  return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself, inputs, sampling_parameters self._responses   with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parameterssampling_parameters  client.stop_stream assert lenself._responses  0 def _assert_text_output_validself text_output   for response in self._responses result, error  responseresult, responseerror assert error is None text_output  result.as_numpynametext_output0.decodeutf8 assert lentext_output  0, output is empty assert text_output.count   4, output is not a sentence def _assert_finish_reasonself, return_finish_reason for i in rangelenself._responses result, error  self._responsesiresult, self._responsesierror assert error is None finish_reason_np  result.as_numpynamefinish_reason if return_finish_reason is None or return_finish_reason  False assert finish_reason_np is None continue finish_reason  finish_reason_np0.decodeutf8 if i  lenself._responses  1 assert finish_reason  None else assert finish_reason  length def _assert_cumulative_logprobself, return_cumulative_logprob prev_cumulative_logprob  0.0 for response in self._responses result, error  responseresult, responseerror assert error is None cumulative_logprob_np  result.as_numpynamecumulative_logprob if return_cumulative_logprob is None or return_cumulative_logprob  False assert cumulative_logprob_np is None continue cumulative_logprob  cumulative_logprob_np0.astypefloat assert cumulative_logprob ! prev_cumulative_logprob prev_cumulative_logprob  cumulative_logprob def _assert_logprobs self, stream, sampling_parameters, return_logprobs, return_num_output_tokens  for response in self._responses result, error  responseresult, responseerror assert error is None logprobs_np  result.as_numpynamelogprobs if return_logprobs is None or return_logprobs  False assert logprobs_np is None continue logprobs  json.loadslogprobs_np0.decodeutf8 if logprobs not in sampling_parameters assert logprobs is None continue assert isinstancelogprobs, list assert lenlogprobs  1 if return_num_output_tokens  True num_output_tokens  result.as_numpynamenum_output_tokens0.astype int  assert lenlogprobs  num_output_tokens text_output_logprobs   for logprobs_d in logprobs assert isinstancelogprobs_d, dict assert lenlogprobs_d  1 assert lenlogprobs_d  sampling_parameterslogprobs  1 rank_one_found  False for token_id, logprob_d in logprobs_d.items assert isinstancetoken_id, str assert lenlogprob_d  3 assert isinstancelogprob_dlogprob, float assert isinstancelogprob_drank, int assert isinstancelogprob_ddecoded_token, str if logprob_drank  1 assert not rank_one_found rank_one_found  True text_output_logprobs  logprob_ddecoded_token assert rank_one_found text_output  result.as_numpynametext_output0.decodeutf8 if not stream  given exclude_input_in_output is not set, prepend_input is True if not  streaming and False if streaming text_output_logprobs  self._prompt  text_output_logprobs assert text_output_logprobs  text_output def _assert_num_input_tokensself, return_num_input_tokens for response in self._responses result, error  responseresult, responseerror assert error is None num_input_tokens_np  result.as_numpynamenum_input_tokens if return_num_input_tokens is None or return_num_input_tokens  False assert num_input_tokens_np is None continue num_input_tokens  num_input_tokens_np.astypeint assert num_input_tokens  0 assert num_input_tokens  lenself._prompt def _assert_num_output_tokensself, return_num_output_tokens for response in self._responses result, error  responseresult, responseerror assert error is None num_output_tokens_np  result.as_numpynamenum_output_tokens if return_num_output_tokens is None or return_num_output_tokens  False assert num_output_tokens_np is None continue num_output_tokens  num_output_tokens_np0.astypeint assert num_output_tokens  0 pytest.mark.parametrizestream, True, False pytest.mark.parametrizereturn_finish_reason, None, True, False pytest.mark.parametrizereturn_cumulative_logprob, None, True, False pytest.mark.parametrizelogprobs, None, 0, 2 pytest.mark.parametrizereturn_logprobs, None, True, False pytest.mark.parametrizereturn_num_input_tokens, None, True, False pytest.mark.parametrizereturn_num_output_tokens, None, True, False def test_additional_outputs self, stream, return_finish_reason, return_cumulative_logprob, logprobs, return_logprobs, return_num_input_tokens, return_num_output_tokens,  sampling_parameters  self._get_sampling_parameterslogprobslogprobs inputs  self._get_inputs self._prompt, streamstream, sampling_parameterssampling_parameters, return_finish_reasonreturn_finish_reason, return_cumulative_logprobreturn_cumulative_logprob, return_logprobsreturn_logprobs, return_num_input_tokensreturn_num_input_tokens, return_num_output_tokensreturn_num_output_tokens,  self._llm_inferinputs, sampling_parameters self._assert_text_output_valid self._assert_finish_reasonreturn_finish_reason self._assert_cumulative_logprobreturn_cumulative_logprob self._assert_logprobs stream, sampling_parameters, return_logprobs, return_num_output_tokens  self._assert_num_input_tokensreturn_num_input_tokens self._assert_num_output_tokensreturn_num_output_tokens  File ciL0_additional_outputs_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 31:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 32:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 33:
 export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc  Prepare Model rm rf models vllm_baseline_output.pkl  mkdir p models SAMPLE_MODELS_REPO....samplesmodel_repository cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json RET0  Test SERVER_LOGadditional_outputs_test.server.log SERVER_ARGSmodelrepositorymodels run_server if  SERVER_PID  0  then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_additional_outputs.xml s v additional_outputs_test.py if  ? ne 0  then echo e nn additional_outputs_test FAILED. n RET1 fi set e kill SERVER_PID wait SERVER_PID if  RET eq 0  then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET  File ciL0_backend_vllmaccuracy_testaccuracy_test.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 34:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 35:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 36:
 import argparse import asyncio import pickle import sys import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import  from vllm import SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.utils import random_uuid sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request VLLM_ENGINE_CONFIG   model facebookopt125m, gpu_memory_utilization 0.3,  PROMPTS   The most dangerous animal is, The capital of France is, The future of AI is,  SAMPLING_PARAMETERS  temperature 0, top_p 1 async def generate_python_vllm_outputprompt, llm_engine request_id  random_uuid sampling_params  SamplingParamsSAMPLING_PARAMETERS python_vllm_output  None last_output  None async for vllm_output in llm_engine.generateprompt, sampling_params, request_id last_output  vllm_output if last_output python_vllm_output   prompt  output.text.encodeutf8 for output in last_output.outputs  return python_vllm_output def prepare_vllm_baseline_outputs  Helper function that starts async vLLM engine and generates output for each prompt in PROMPTS. Saves resulted baselines in vllm_baseline_output.pkl for further use. llm_engine  AsyncLLMEngine.from_engine_argsAsyncEngineArgsVLLM_ENGINE_CONFIG python_vllm_output   for i in rangelenPROMPTS python_vllm_output.extend asyncio.rungenerate_python_vllm_outputPROMPTSi, llm_engine  with openvllm_baseline_output.pkl, wb as f pickle.dumppython_vllm_output, f return class VLLMTritonAccuracyTestTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name  vllm_opt self.python_vllm_output   with openvllm_baseline_output.pkl, rb as f self.python_vllm_output  pickle.loadf self.assertNotEqual self.python_vllm_output, , Loaded baseline outputs list should not be empty,  self.assertIsNotNone self.python_vllm_output, Loaded baseline outputs list should not be None  self.assertEqual lenself.python_vllm_output, lenPROMPTS, Unexpected number of baseline outputs loaded, expected , but got .format lenPROMPTS, lenself.python_vllm_output ,  def test_vllm_modelself user_data  UserData stream  False triton_vllm_output   self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangelenPROMPTS request_data  create_vllm_request PROMPTSi, i, stream, SAMPLING_PARAMETERS, self.vllm_model_name  self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parametersSAMPLING_PARAMETERS,  for i in rangelenPROMPTS result  user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException, strresult output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None triton_vllm_output.extendoutput self.triton_client.stop_stream self.assertEqualself.python_vllm_output.sort, triton_vllm_output.sort def tearDownself self.triton_client.close if __name__  __main__ parser  argparse.ArgumentParser parser.add_argument generatebaseline, actionstore_true, requiredFalse, defaultFalse, helpGenerates baseline output for accuracy tests,  FLAGS  parser.parse_args if FLAGS.generate_baseline prepare_vllm_baseline_outputs exit0 unittest.main  File ciL0_backend_vllmaccuracy_testtest.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 37:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 38:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 39:
 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.accuracy_test_server.log CLIENT_LOG.accuracy_test_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.accuracy_test.py SAMPLE_MODELS_REPO......samplesmodel_repository VLLM_ENGINE_LOGvllm_engine.log EXPECTED_NUM_TESTS1 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json  f vllm_baseline_output.pkl   rm vllm_baseline_output.pkl RET0 set e  Need to generate baseline first, since running 2 vLLM engines causes  memory issues httpsgithub.comvllmprojectvllmissues2248 python3 CLIENT_PY generatebaseline  VLLM_ENGINE_LOG 21  BASELINE_PID! wait BASELINE_PID set e run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED.

 Chunk 40:
 n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Accuracy test FAILED. n else echo e nn Accuracy test PASSED.

 Chunk 41:
 n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmenabled_streamenabled_stream_test.py   Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 42:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 43:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys import unittest import tritonclient.grpc.aio as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import AsyncTestResultCollector, create_vllm_request PROMPTS  The most dangerous animal is SAMPLING_PARAMETERS  temperature 0, top_p 1 class VLLMTritonStreamTestAsyncTestResultCollector async def _test_vllm_model self, promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamTrue, exclude_input_in_outputNone, expected_outputNone, expect_errorFalse,  async with grpcclient.InferenceServerClient urllocalhost8001  as triton_client model_name  vllm_opt async def request_iterator for i, prompt in enumerateprompts yield create_vllm_request prompt, i, stream, sampling_parameters, model_name, exclude_input_in_outputexclude_input_in_output,  response_iterator  triton_client.stream_infer inputs_iteratorrequest_iterator  final_response   async for response in response_iterator result, error  response if expect_error self.assertIsInstanceerror, InferenceServerException self.assertEqual error.message, Error generating stream When streaming, exclude_input_in_output  False is not allowed., error,  return self.assertIsNoneerror, error self.assertIsNotNoneresult, result output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None final_response.appendstroutput0, encodingutf8 if expected_output is not None self.assertEqual final_response, expected_output, Expected to receive the following response , but received ..format expected_output, final_response ,  async def test_vllm_model_enabled_streamself  Verifying that request with multiple prompts runs successfully. prompts   The most dangerous animal is, The future of AI is,  await self._test_vllm_modelpromptsprompts async def test_vllm_model_enabled_stream_exclude_input_in_output_defaultself  Verifying that streaming request returns only generated diffs, which is default behaviour for streamTrue.

 Chunk 44:
 expected_output    the,  one,  that,  is,  most,  likely,  to,  be,  killed,  by,  a,  car, ., n, I, m,  await self._test_vllm_modelexpected_outputexpected_output async def test_vllm_model_enabled_stream_exclude_input_in_output_falseself  Verifying that streaming request returns only generated diffs even if exclude_input_in_output is set to False explicitly. expected_output  Error generating stream When streaming, exclude_input_in_output  False is not allowed. await self._test_vllm_model exclude_input_in_outputFalse, expected_outputexpected_output, expect_errorTrue,  if __name__  __main__ unittest.main  File ciL0_backend_vllmenabled_streamtest.sh  !binbash  Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 Chunk 45:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 46:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 47:
 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.enabled_stream_server.log CLIENT_LOG.enabled_stream_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.enabled_stream_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS3 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 Chunk 48:
 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Enabled Stream test FAILED. n else echo e nn Enabled Stream test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmmetrics_testtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 49:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 50:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 51:
 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_metrics_server.log CLIENT_LOG.vllm_metrics_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_metrics_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1  Helpers  function copy_model_repository  rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt  vllm_opt model will be loaded on server start and stay loaded throughout  unittesting. To ensure that vllms memory profiler will not error out  on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt,  so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json  run_test  local TEST_CASE1 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY TEST_CASE v  CLIENT_LOG 21 if  ?

 Chunk 52:
 ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY TEST_CASE FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e  TODO Nongraceful shutdown when metrics are enabled.

 Chunk 53:
 kill SERVER_PID wait SERVER_PID  RET0  Test disabling vLLM metrics reporting without parameter REPORT_CUSTOM_METRICS in config.pbtxt copy_model_repository run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled  Test disabling vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to false in config.pbtxt copy_model_repository echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value false     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled  Test vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to true in config.pbtxt copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics  Test vLLM metrics custom sampling parameters  Custom sampling parameters may result in different vLLM output depending  on the platform. Therefore, these metrics are tests separately. copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_custom_sampling_params  Test enabling vLLM metrics reporting in config.pbtxt but disabling in model.json copy_model_repository jq .

 Chunk 54:
 disable_log_stats  true modelsvllm_opt1model.json  temp.json mv temp.json modelsvllm_opt1model.json echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled  Test enabling vLLM metrics reporting in config.pbtxt while disabling in server option copy_model_repository echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt SERVER_ARGSSERVER_ARGS allowmetricsfalse run_test VLLMTritonMetricsTest.test_vllm_metrics_refused rm rf .models temp.json if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmmetrics_testvllm_metrics_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 55:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 56:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 57:
 import os import re import sys import unittest from functools import partial import requests import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMTritonMetricsTestTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.tritonserver_ipaddr  os.environ.getTRITONSERVER_IPADDR, localhost self.vllm_model_name  vllm_opt self.prompts   The most dangerous animal is, The capital of France is, The future of AI is,  self.sampling_parameters  temperature 0, top_p 1 def parse_vllm_metricsself  Store vllm metrics in a dictionary. r  requests.getfhttpself.tritonserver_ipaddr8002metrics r.raise_for_status  Regular expression to match the pattern pattern  rvllm ?.? 09.

 Chunk 58:
 vllm_dict    Find all matches in the text matches  re.findallpattern, r.text, re.MULTILINE for match in matches key, value  match vllm_dictkey  floatvalue if . in value else intvalue return vllm_dict def vllm_infer self, prompts, sampling_parameters, model_name,   Helper function to send async stream infer requests to vLLM. user_data  UserData number_of_vllm_reqs  lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data  create_vllm_request promptsi, i, False, sampling_parameters, model_name, True,  self.triton_client.async_stream_infer model_namemodel_name, inputsrequest_datainputs, request_idrequest_datarequest_id, outputsrequest_dataoutputs, parameterssampling_parameters,  for _ in rangenumber_of_vllm_reqs result  user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None self.triton_client.stop_stream def test_vllm_metricsself  Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name,  metrics_dict  self.parse_vllm_metrics total_prompts  lenself.prompts  vllmprompt_tokens_total  2, 133, 144, 2702, 3477, 16  2, 133, 812, 9, 1470, 16  2, 133, 499, 9, 4687, 16 self.assertEqualmetrics_dictvllmprompt_tokens_total, 18  vllmgeneration_tokens_total  5, 65, 14, 16, 144, 533, 7, 28, 848, 30, 10, 512, 4, 50118, 100, 437  5, 812, 9, 5, 1515, 3497, 4, 50118, 50118, 133, 812, 9, 1470, 16, 5, 812  11, 5, 1420, 9, 5, 82, 4, 50118, 50118, 133, 499, 9, 4687, 16, 11, 5 self.assertEqualmetrics_dictvllmgeneration_tokens_total, 48  vllmtime_to_first_token_seconds self.assertEqual metrics_dictvllmtime_to_first_token_seconds_count, total_prompts  self.assertGreatermetrics_dictvllmtime_to_first_token_seconds_sum, 0 self.assertEqual metrics_dictvllmtime_to_first_token_seconds_bucket, total_prompts   vllmtime_per_output_token_seconds self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_count, 45 self.assertGreatermetrics_dictvllmtime_per_output_token_seconds_sum, 0 self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_bucket, 45  vllme2e_request_latency_seconds self.assertEqual metrics_dictvllme2e_request_latency_seconds_count, total_prompts  self.assertGreatermetrics_dictvllme2e_request_latency_seconds_sum, 0 self.assertEqual metrics_dictvllme2e_request_latency_seconds_bucket, total_prompts   vllmrequest_prompt_tokens self.assertEqual metrics_dictvllmrequest_prompt_tokens_count, total_prompts  self.assertEqualmetrics_dictvllmrequest_prompt_tokens_sum, 18 self.assertEqual metrics_dictvllmrequest_prompt_tokens_bucket, total_prompts   vllmrequest_generation_tokens self.assertEqual metrics_dictvllmrequest_generation_tokens_count, total_prompts,  self.assertEqualmetrics_dictvllmrequest_generation_tokens_sum, 48 self.assertEqual metrics_dictvllmrequest_generation_tokens_bucket, total_prompts,   TODO Revisit this test due to the removal of best_of def test_custom_sampling_paramsself  Adding sampling parameters for testing metrics.

 Chunk 59:
 Definitions can be found here httpsdocs.vllm.aienlatestdevsampling_params.html n, best_of  2, 4 custom_sampling_parameters  self.sampling_parameters.copy  Changing temperature because best_of must be 1 when using greedy  sampling, i.e. temperature 0. custom_sampling_parameters.update n strn, best_of strbest_of, temperature 1   Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parameterscustom_sampling_parameters, model_nameself.vllm_model_name,  metrics_dict  self.parse_vllm_metrics total_prompts  lenself.prompts  vllmrequest_params_best_of  self.assertEqual metrics_dictvllmrequest_params_best_of_count, total_prompts  self.assertEqual metrics_dictvllmrequest_params_best_of_sum, best_of  total_prompts  self.assertEqual metrics_dictvllmrequest_params_best_of_bucket, total_prompts    vllmrequest_params_n self.assertEqualmetrics_dictvllmrequest_params_n_count, total_prompts  self.assertEqualmetrics_dictvllmrequest_params_n_sum, n  total_prompts self.assertEqualmetrics_dictvllmrequest_params_n_bucket, total_prompts def test_vllm_metrics_disabledself  Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name,  metrics_dict  self.parse_vllm_metrics  No vLLM metric found self.assertEquallenmetrics_dict, 0 def test_vllm_metrics_refusedself  Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name,  with self.assertRaisesrequests.exceptions.ConnectionError self.parse_vllm_metrics def tearDownself self.triton_client.close if __name__  __main__ unittest.main  File ciL0_backend_vllmrequest_cancellationrequest_cancellation_test.py   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 Chunk 60:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 61:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 62:
 import sys import time import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMRequestCancelTestTestResultCollector def test_request_cancellationself, send_parameters_as_tensorTrue with grpcclient.InferenceServerClienturllocalhost8001 as triton_client log_file_path  .request_cancellation_server.log user_data  UserData model_name  vllm_opt stream  False sampling_parameters   temperature 0, top_p 1, max_tokens 1500,  prompt  fWrite an original and creative poem of at least 200 words. triton_client.start_streamcallbackpartialcallback, user_data request_data  create_vllm_request prompt, 1, stream, sampling_parameters, model_name, send_parameters_as_tensor,  triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  time.sleep1 triton_client.stop_streamcancel_requestsTrue time.sleep1 self.assertFalseuser_data._completed_requests.empty result  user_data._completed_requests.get self.assertIsInstanceresult, InferenceServerException self.assertEqualresult.status, StatusCode.CANCELLED self.assertTrueuser_data._completed_requests.empty with openlog_file_path, moder as log_file log_text  log_file.read self.assertInvllm Cancelling the request, log_text self.assertInvllm Successfully cancelled the request, log_text if __name__  __main__ unittest.main  File ciL0_backend_vllmrequest_cancellationtest.sh  !binbash  Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 63:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 64:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.request_cancellation_server.log CLIENT_LOG.request_cancellation_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.request_cancellation_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ?

 Chunk 65:
 ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Request Cancellation test FAILED.

 Chunk 66:
 n else echo e nn Request Cancellation test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmvllm_backendensemble_config.pbtxt   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 67:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 68:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. name ensemble_model platform ensemble max_batch_size 1 input   name text_input data_type TYPE_STRING dims  1    output   name text_output data_type TYPE_STRING dims  1    ensemble_scheduling  step   model_name vllm_opt model_version 1 input_map  key text_input value text_input  output_map  key text_output value text_output      File ciL0_backend_vllmvllm_backendtest.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 69:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 70:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 71:
 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_backend_server.log CLIENT_LOG.vllm_backend_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_backend_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS6  Helpers  function assert_curl_success  message1 if  code ! 200  then cat .curl.out echo e nn message  line BASH_LINENOn RET1 fi  rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt  vllm_opt model will be loaded on server start and stay loaded throughout  unittesting. To test vllm model loadunload we use a dedicated  vllm_load_test.

 Chunk 72:
 To ensure that vllms memory profiler will not error out  on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt,  so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json cp r modelsvllm_opt modelsvllm_load_test mkdir p modelsadd_sub1 wget P modelsadd_sub1 httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_submodel.py wget P modelsadd_sub httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_subconfig.pbtxt  Invalid model attribute cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_1 sed i sdisable_log_requestsinvalid_attribute modelsvllm_invalid_11model.json  Invalid model name cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_2 sed i sfacebookopt125minvalid_model modelsvllm_invalid_21model.json  Sanity check ensembles are enabled and can successfully be loaded mkdir p modelsensemble_model1 cp r ensemble_config.pbtxt modelsensemble_modelconfig.pbtxt RET0 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED.

 Chunk 73:
 n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  Test Python backend cmdline parameters are propagated to vllm backend SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR backendconfigpython,defaultmaxbatchsize8 SERVER_LOG.vllm_test_cmdline_server.log rm rf .modelsvllm_invalid_1 .modelsvllm_invalid_2 .modelsvllm_load_test run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi kill SERVER_PID wait SERVER_PID rm rf .models COUNTgrep c defaultmaxbatchsize8 SERVER_LOG if  COUNT ne 2  then echo Cmdline parameters verification Failed fi  Test loading multiple vllm models SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_one SERVER_LOG.vllm_test_multi_model.log  Create two models, one is just a copy of the other, and make sure gpu  utilization is low enough for multiple models to avoid OOM. vLLM changed behavior of their GPU profiler from total to free memory,  so to load two small models, we need to start  triton server in explicit mode.

 Chunk 74:
 MODEL1vllm_one MODEL2vllm_two mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsMODEL1 cp r modelsMODEL1 modelsMODEL2 sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL11model.json sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL21model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi  Explicitly load model rm f .curl.out set e codecurl s w http_code o .curl.out X POST localhost8000v2repositorymodelsvllm_twoload set e assert_curl_success Failed to load vllm_two model kill SERVER_PID wait SERVER_PID rm rf .models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmvllm_backendvllm_backend_test.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 75:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 76:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 77:
 import sys import unittest from functools import partial import numpy as np import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request PROMPTS   The most dangerous animal is, The capital of France is, The future of AI is,  SAMPLING_PARAMETERS  temperature 0, top_p 1 class VLLMTritonBackendTestTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name  vllm_opt self.python_model_name  add_sub self.ensemble_model_name  ensemble_model self.vllm_load_test  vllm_load_test def test_vllm_triton_backendself  Load both vllm and add_sub models self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.load_modelself.python_model_name self.assertTrueself.triton_client.is_model_readyself.python_model_name  Test to ensure that ensemble models are supported in vllm container. If ensemble support not present, triton will error out at model loading stage. Ensemble Model is a pipeline consisting of 1 model vllm_opt self.triton_client.load_modelself.ensemble_model_name self.assertTrueself.triton_client.is_model_readyself.ensemble_model_name self.triton_client.unload_modelself.ensemble_model_name  Unload vllm model and test add_sub model self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test self._test_python_model  Load vllm model and unload add_sub model self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.unload_modelself.python_model_name self.assertFalseself.triton_client.is_model_readyself.python_model_name  Test vllm model and unload vllm model self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorTrue, model_nameself.vllm_load_test,  self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorFalse, model_nameself.vllm_load_test,  self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test def test_model_with_invalid_attributesself model_name  vllm_invalid_1 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_vllm_invalid_model_nameself model_name  vllm_invalid_2 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_exclude_input_in_output_defaultself  Verifying default behavior for exclude_input_in_output in nonstreaming mode.

 Chunk 78:
 Expected result prompt is returned with diffs. prompts   The capital of France is,  expected_output   bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital  sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, expected_outputexpected_output,  def test_exclude_input_in_output_falseself  Verifying behavior for exclude_input_in_output  False in nonstreaming mode. Expected result prompt is returned with diffs.

 Chunk 79:
 Test vllm model and unload vllm model prompts   The capital of France is,  expected_output   bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital  sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputFalse, expected_outputexpected_output,  def test_exclude_input_in_output_trueself  Verifying behavior for exclude_input_in_output  True in nonstreaming mode. Expected result only diffs are returned. Test vllm model and unload vllm model prompts   The capital of France is,  expected_output   b the capital of the French Republic.nnThe capital of France is the capital  sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputTrue, expected_outputexpected_output,  def _test_vllm_model self, prompts, sampling_parameters, stream, send_parameters_as_tensor, exclude_input_in_outputNone, expected_outputNone, model_namevllm_opt,  user_data  UserData number_of_vllm_reqs  lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data  create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor, exclude_input_in_outputexclude_input_in_output,  self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  for i in rangenumber_of_vllm_reqs result  user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n  Expected  n Actual.format output, expected_outputi ,  self.triton_client.stop_stream def _test_python_modelself shape  4 input0_data  np.random.randshape.astypenp.float32 input1_data  np.random.randshape.astypenp.float32 inputs   grpcclient.InferInput INPUT0, input0_data.shape, np_to_triton_dtypeinput0_data.dtype , grpcclient.InferInput INPUT1, input1_data.shape, np_to_triton_dtypeinput1_data.dtype ,  inputs0.set_data_from_numpyinput0_data inputs1.set_data_from_numpyinput1_data outputs   grpcclient.InferRequestedOutputOUTPUT0, grpcclient.InferRequestedOutputOUTPUT1,  response  self.triton_client.infer self.python_model_name, inputs, request_id10, outputsoutputs  self.assertTrue np.allcloseinput0_data  input1_data, response.as_numpyOUTPUT0  self.assertTrue np.allcloseinput0_data  input1_data, response.as_numpyOUTPUT1  def tearDownself self.triton_client.close if __name__  __main__ unittest.main  File ciL0_backend_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 80:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 81:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 82:
 RET0 SUBTESTSaccuracy_test request_cancellation enabled_stream vllm_backend metrics_test python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST  bash ex test.sh  cd .. if  ? ne 0  then echo Subtest TEST FAILED RET1 fi done if  RET eq 0  then echo e nn vLLM Backend Test Passedn else echo e nn vLLM Backend Test FAILEDn fi exit RET  File ciL0_check_health_vllmcheck_health_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 83:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 84:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 85:
 import json import numpy as np import tritonclient.grpc as grpcclient class TestCheckHealth _grpc_url  localhost8001 _model_name  vllm_opt _sampling_parameters  temperature 0, top_p 1 _prompt  In this example, def _get_inputsself, prompt, streamTrue, sampling_parametersNone inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_   return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself inputs  self._get_inputs self._prompt, streamTrue, sampling_parametersself._sampling_parameters  self._responses   with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parametersself._sampling_parameters  client.stop_stream def _assert_text_output_validself text_output   for response in self._responses result, error  responseresult, responseerror assert error is None text_output  result.as_numpynametext_output0.decodeutf8 assert lentext_output  0, output is empty assert text_output.count   4, output is not a sentence def _assert_infer_exceptionself, expected_exception_message assert lenself._responses  1 for response in self._responses result, error  responseresult, responseerror assert result is None assert strerror  expected_exception_message def _assert_model_readyself, expected_readiness with grpcclient.InferenceServerClientself._grpc_url as client  is_model_ready API assert client.is_model_readyself._model_name  expected_readiness  get_model_repository_index API model_state  None for model_index in client.get_model_repository_index.models if model_index.name  self._model_name assert model_state is None, duplicate model index found model_state  model_index.state  READY assert model_state  expected_readiness def test_vllm_is_healthyself num_repeats  3 for i in rangenum_repeats self._assert_model_readyTrue self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue def test_vllm_not_healthyself self._assert_model_readyTrue  The 1st infer should complete successfully self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue  The 2nd infer should begin with health check failed self._llm_infer self._assert_infer_exception Model is unavailable due to unhealthy vLLM engine  self._assert_model_readyFalse  The 3rd infer should have model not found self._llm_infer self._assert_infer_exception Request for unknown model vllm_opt has no available versions  self._assert_model_readyFalse  File ciL0_check_health_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 86:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 87:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc RET0 function setup_model_repository  local sample_model_repo_path1....samplesmodel_repository rm rf models vllm_baseline_output.pkl  mkdir p models cp r sample_model_repo_pathvllm_model modelsvllm_opt  function enable_health_check  local enable_vllm_health_check1 echo e parameters   modelsvllm_optconfig.pbtxt echo e  key ENABLE_VLLM_HEALTH_CHECK  modelsvllm_optconfig.pbtxt echo e  value  string_value enable_vllm_health_check   modelsvllm_optconfig.pbtxt echo e   modelsvllm_optconfig.pbtxt  VLLM_INSTALL_PATHusrlocallibpython3.12distpackagesvllm function mock_vllm_async_llm_engine   backup original file mv VLLM_INSTALL_PATHenginemultiprocessingclient.py VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup cp VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py  overwrite the original check_health method echo e   VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  async def check_healthself, check_count0  VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  check_count0  1  VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  if check_count0  1  VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  raise RuntimeErrorSimulated vLLM check_health failure  VLLM_INSTALL_PATHenginemultiprocessingclient.py  function unmock_vllm_async_llm_engine   restore from backup rm f VLLM_INSTALL_PATHenginemultiprocessingclient.py mv VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py  function test_check_health  local test_name1 local unit_test_name2 SERVER_LOGtest_name.server.log SERVER_ARGSmodelrepositorymodels modelcontrolmodeexplicit loadmodel run_server if  SERVER_PID  0  then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_name.report.xml s v check_health_test.pyTestCheckHealthunit_test_name  test_name.log if  ? ne 0  then echo e nn test_name FAILED.

 Chunk 88:
 n RET1 fi set e kill SERVER_PID wait SERVER_PID   Test health check unspecified setup_model_repository test_check_health health_check_unspecified test_vllm_is_healthy  Test health check disabled setup_model_repository enable_health_check false test_check_health health_check_disabled test_vllm_is_healthy  Test health check enabled setup_model_repository enable_health_check true test_check_health health_check_enabled test_vllm_is_healthy  Mock check_health from vLLM mock_vllm_async_llm_engine  Test health check unspecified with mocked vLLM check_health failure setup_model_repository test_check_health health_check_unspecified_mocked_failure test_vllm_is_healthy  Test health check disabled with mocked vLLM check_health failure setup_model_repository enable_health_check false test_check_health health_check_disabled_mocked_failure test_vllm_is_healthy  Test health check enabled with mocked vLLM check_health failure setup_model_repository enable_health_check true test_check_health health_check_enabled_mocked_failure test_vllm_not_healthy  Unmock check_health unmock_vllm_async_llm_engine if  RET eq 0  then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET  File ciL0_multi_gpu_vllmmulti_loradownload.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 89:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 90:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. from huggingface_hub import snapshot_download if __name__  __main__  download lora weight alpaca snapshot_download repo_idswathijnGemmaDoll2bdollyLORATune, local_dir.weightslorasGemmaDoll, max_workers8,   download lora weight GemmaSheep snapshot_download repo_ideduardoalvarezGemmaSheep2BLORATUNED, local_dir.weightslorasGemmaSheep, max_workers8,   download backbone weight googlegemma2b snapshot_download repo_idunslothgemma2b, local_dir.weightsbackbonegemma2b, max_workers8,   File ciL0_multi_gpu_vllmmulti_loramulti_lora_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 91:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 92:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import os import sys import unittest from functools import partial from typing import List import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import AsyncTestResultCollector, UserData, callback, create_vllm_request PROMPTS  Instruct What do you think of Computer Science?nOutput SAMPLING_PARAMETERS  temperature 0, top_p 1 server_enable_lora  True class VLLMTritonLoraTestAsyncTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name  vllm_llama_multi_lora def _test_vllm_model self, prompts Liststr, sampling_parameters, lora_name Liststr, server_enable_loraTrue, streamFalse, exclude_input_in_outputNone, expected_outputNone,  assert lenprompts  len lora_name , The number of prompts and lora names should be the same user_data  UserData number_of_vllm_reqs  lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs lora  lora_namei if lora_name else None sam_para_copy  sampling_parameters.copy if lora is not None sam_para_copylora_name  lora request_data  create_vllm_request promptsi, i, stream, sam_para_copy, self.vllm_model_name, exclude_input_in_outputexclude_input_in_output,  self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  for i in rangenumber_of_vllm_reqs result  user_data._completed_requests.get if typeresult is InferenceServerException printresult.message if server_enable_lora self.assertEqual strresult.message, fLoRA lora_namei is not supported, we currently support doll, sheep, InferenceServerException,  else self.assertEqual strresult.message, LoRA feature is not enabled., InferenceServerException,  self.triton_client.stop_stream return output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n  Expected  n Actual.format output, expected_outputi ,  self.triton_client.stop_stream def test_multi_lora_requestsself sampling_parameters  temperature 0, top_p 1  make two requests separately to avoid the different arrival of response answers prompt_1  Instruct What do you think of Computer Science?nOutput lora_1  doll expected_output   b I think it is a very interesting subject.nnInstruct What do you  self._test_vllm_model prompt_1, sampling_parameters, lora_namelora_1, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output,  prompt_2  Instruct Tell me more about soccernOutput lora_2  sheep expected_output   b I love soccer.

 Chunk 93:
 I play soccer every day.nInstruct Tell me  self._test_vllm_model prompt_2, sampling_parameters, lora_namelora_2, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output,  def test_none_exist_loraself prompts   Instruct What is the capital city of France?nOutput,  loras  bactrian sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, lora_nameloras, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputNone,  this request will lead to lora not supported error, so there is no expected output  def tearDownself self.triton_client.close if __name__  __main__ server_enable_lora  os.environ.getSERVER_ENABLE_LORA, false.lower  true unittest.main  File ciL0_multi_gpu_vllmmulti_loratest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 94:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 95:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.multi_lora_server.log CLIENT_LOG.multi_lora_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.multi_lora_test.py DOWNLOAD_PY.download.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS2  first we download weights pip install U huggingface_hub rm rf weights  mkdir p weightslorasGemmaDoll  mkdir p weightslorasGemmaSheep mkdir p weightsbackbonegemma2b python3 DOWNLOAD_PY v  CLIENT_LOG 21 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_llama_multi_lora export SERVER_ENABLE_LORAtrue  Check boolean flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json multi_lora_jsoncat EOF  doll .weightslorasGemmaDoll, sheep .weightslorasGemmaSheep  EOF  echo multi_lora_json  modelsvllm_llama_multi_lora1multi_lora.json RET0  If it is the first time launching triton server with gemma2b and multilora feature,  it may take more than 1 minutes. Please wait.

 Chunk 96:
 SERVER_TIMEOUT60000 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 Chunk 97:
 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  Check string flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 Chunk 98:
 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  disable lora export SERVER_ENABLE_LORAfalse  check bool flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 Chunk 99:
 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  disable lora export SERVER_ENABLE_LORAfalse  check string flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 Chunk 100:
 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models rm rf weights if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Multi LoRA test FAILED. n else echo e nn Multi LoRA test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_multi_gpu_vllmvllm_backendtest.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 101:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 102:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 103:
 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit logverbose1 TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_multi_gpu_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1  Helpers function validate_file_contains  local KEY1 local FILE2 if  z KEY    z FILE  then echo Error KEY and FILE must be provided. return 1 fi if  ! f FILE  then echo Error File FILE does not exist.

 Chunk 104:
 return 1 fi countgrep o w KEY FILE  wc l if  count ne 1  then echo Error KEY KEY found count times in FILE. Expected exactly once. return 1 fi  function run_multi_gpu_test  export KIND1 export TENSOR_PARALLELISM2 export INSTANCE_COUNT3 export DISTRIBUTED_EXECUTOR_BACKEND4  Setup a clean model repository export TEST_MODELvllm_opt_KIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNT local TEST_MODEL_TRITON_CONFIGmodelsTEST_MODELconfig.pbtxt local TEST_MODEL_VLLM_CONFIGmodelsTEST_MODEL1model.json rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsTEST_MODEL sed i sKIND_MODELKIND TEST_MODEL_TRITON_CONFIG sed i 3s tensor_parallel_size TENSOR_PARALLELISM,n TEST_MODEL_VLLM_CONFIG if  TENSOR_PARALLELISM ne 1  then jq arg backend DISTRIBUTED_EXECUTOR_BACKEND .

 Chunk 105:
 distributed_executor_backendbackend TEST_MODEL_VLLM_CONFIG  temp.json mv temp.json TEST_MODEL_VLLM_CONFIG fi  Assert the correct kind is set in case the template config changes in the future validate_file_contains KIND TEST_MODEL_TRITON_CONFIG  Start server echo Running multiGPU test with kindKIND, tpTENSOR_PARALLELISM, instance_countINSTANCE_COUNT SERVER_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTserver.log run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi  Run unit tests set e CLIENT_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTclient.log python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 Chunk 106:
 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e  Cleanup kill SERVER_PID wait SERVER_PID   Test rm f .log RET0  Test the various cases of kind, tensor parallelism, and instance count  for different ways to run multiGPU models with vLLM on Triton KINDSKIND_MODEL KIND_GPU TPS1 2 INSTANCE_COUNTS1 2 DISTRIBUTED_EXECUTOR_BACKENDray for kind in KINDS do for tp in TPS do for count in INSTANCE_COUNTS do run_multi_gpu_test kind tp count DISTRIBUTED_EXECUTOR_BACKEND done done done  Results if  RET eq 1  then echo e nn Multi GPU Utilization test FAILED. n else echo e nn Multi GPU Utilization test PASSED. n fi exit RET  File ciL0_multi_gpu_vllmvllm_backendvllm_multi_gpu_test.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 107:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 108:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 109:
 import os import sys import unittest from functools import partial import pynvml import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMMultiGPUTestTestResultCollector def setUpself pynvml.nvmlInit self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 def get_gpu_memory_utilizationself, gpu_id handle  pynvml.nvmlDeviceGetHandleByIndexgpu_id info  pynvml.nvmlDeviceGetMemoryInfohandle return info.used def get_available_gpu_idsself device_count  pynvml.nvmlDeviceGetCount available_gpus   for gpu_id in rangedevice_count handle  pynvml.nvmlDeviceGetHandleByIndexgpu_id if handle available_gpus.appendgpu_id return available_gpus def _test_vllm_multi_gpu_utilizationself, model_name str  Test that loading a given vLLM model will increase GPU utilization across multiple GPUs, and run a sanity check inference to confirm that the loaded multigpumultiinstance model is working as expected. gpu_ids  self.get_available_gpu_ids self.assertGreaterEquallengpu_ids, 2, Error Detected single GPU printnn Before Loading vLLM Model  mem_util_before_loading_model   for gpu_id in gpu_ids memory_utilization  self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes mem_util_before_loading_modelgpu_id  memory_utilization self.triton_client.load_modelmodel_name self._test_vllm_modelmodel_name print After Loading vLLM Model  vllm_model_used_gpus  0 for gpu_id in gpu_ids memory_utilization  self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes if memory_utilization  mem_util_before_loading_modelgpu_id vllm_model_used_gpus  1 self.assertGreaterEqualvllm_model_used_gpus, 2 def _test_vllm_modelself, model_name str, send_parameters_as_tensor bool  True user_data  UserData stream  False prompts   The most dangerous animal is, The capital of France is, The future of AI is,  number_of_vllm_reqs  lenprompts sampling_parameters  temperature 0, top_p 1 self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data  create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor,  self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  for i in rangenumber_of_vllm_reqs result  user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException output  result.as_numpytext_output self.assertIsNotNoneoutput self.triton_client.stop_stream def test_multi_gpu_modelself  Tests that a multiGPU vLLM model loads successfully on multiple GPUs and can handle a few sanity check inference requests. MultiGPU models are currently defined here as either  a single model instance with tensor parallelism  1  multiple model instances each with tensor parallelism  1 FIXME This test currently skips over a few combinations that may be enhanced in the future, such as  tensor parallel models with multiple model instances  KIND_MODEL models with multiple model instances  model  os.environ.getTEST_MODEL kind  os.environ.getKIND tp  os.environ.getTENSOR_PARALLELISM instance_count  os.environ.getINSTANCE_COUNT for env_var in model, kind, tp, instance_count self.assertIsNotNoneenv_var printfTest Matrix model, kind, tp, instance_count  Only support tensor parallelism or multiple instances for now, but not both.

 Chunk 110:
 Support for multiinstance tensor parallel models may require more  special handling in the backend to better handle device assignment. NOTE This eliminates the 111 and 224 test cases. if inttp  intinstance_count !

 Chunk 111:
 2 msg  TENSOR_PARALLELISM and INSTANCE_COUNT must have a product of 2 for this 2GPU test printSkipping Test, msg self.skipTestmsg  Loading a KIND_GPU model with Tensor Parallelism  1 should fail and  recommend using KIND_MODEL instead for multigpu model instances. if kind  KIND_GPU and inttp  1 with self.assertRaisesRegex InferenceServerException, please specify KIND_MODEL  self._test_vllm_multi_gpu_utilizationmodel return  Loading a KIND_MODEL model with multiple instances can cause  oversubscription to specific GPUs and cause a CUDA OOM if the  gpu_memory_utilization settings are high without further handling  of device assignment in the backend. if kind  KIND_MODEL and intinstance_count  1 msg  Testing multiple model instances of KIND_MODEL is not implemented at this time printSkipping Test, msg self.skipTestmsg self._test_vllm_multi_gpu_utilizationmodel def tearDownself pynvml.nvmlShutdown self.triton_client.close if __name__  __main__ unittest.main  File ciL0_multi_gpu_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 112:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 113:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 114:
 RET0 SUBTESTSvllm_backend multi_lora python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST  bash ex test.sh  cd .. if  ? ne 0  then echo Subtest TEST FAILED RET1 fi done if  RET eq 0  then echo e nn vLLM MultiGPU Tests Passedn else echo e nn vLLM MultiGPU Tests FAILEDn fi exit RET  File docsadditional_outputs.md  !

 Chunk 115:
 Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 116:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 117:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Additional Outputs from vLLM The vLLM backend supports sending additional outputs from vLLM on top of the usual text_output when requested. All additional outputs are disabled by default and they need to be enabled on a perrequest basis.

 Chunk 118:
 If enabled, the corresponding output tensor will be set for all responses from the request. Supported Additional Outputs  Finish Reason The reason why the sequence is finished. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL26 for more details.

 Chunk 119:
 To enable, set return_finish_reason input tensor to True. The reason will be sent as a string on the finish_reason output tensor. Cumulative Log Probabilities The cumulative log probability of the generated output text.

 Chunk 120:
 See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL22 for more details. To enable, set return_cumulative_logprob input tensor to True. The floating point value will be sent on the cumulative_logprob output tensor.

 Chunk 121:
 Log Probabilities The log probabilities of the top probability tokens at each position of the logprobshttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsampling_params.pyL146L152 are requested. Only the log probabilities of the new tokens generated since the last response are returned on each new response. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL24L25 for more details on the log probabilities.

 Chunk 122:
 To enable, set return_logprobs input tensor to True. The log probabilities will be sent on the logprobs output tensor as a serialized JSON string. Number of Input Tokens The number of token IDs of the prompt.

 Chunk 123:
 See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL79L81 for more details. To enable, set return_num_input_tokens input tensor to True. The unsigned integer value will be sent on the num_input_tokens output tensor.

 Chunk 124:
 Number of Output Tokens The number of token IDs of the generated output text sent on this response. It is the difference in length of the token IDs generated from the last response to this response. If this is the first response, the last response length is presumed to be zero.

 Chunk 125:
 See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL21 for more details on the token IDs of the generated output text. To enable, set return_num_output_tokens input tensor to True. The unsigned integer value will be sent on the num_output_tokens output tensor.

 Chunk 126:
 Examples  Add Finish Reason to Outputs python import numpy as np import tritonclient.grpc as grpcclient inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayexample prompt.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayTrue, dtypebool def callbackresult, error ... printresult.as_numpynamefinish_reason with grpcclient.InferenceServerClientlocalhost8001 as client client.start_streamcallback client.async_stream_infervLLM_model_name, inputsinputs, ... client.stop_stream   Notes  Enabling additional outputs may impact performance, only add additional outputs when necessary.

 Chunk 127:
 File docshealth_check.md  ! Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 128:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 129:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. vLLM Health Check BETA  !NOTE  The vLLM Health Check support is currently in BETA.

 Chunk 130:
 Its features and  functionality are subject to change as we collect feedback. We are excited to  hear any thoughts you have! The vLLM backend supports checking for vLLM Engine Healthhttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmengineasync_llm_engine.pyL1177L1185 upon receiving each inference request.

 Chunk 131:
 If the health check fails, the model state will becomes NOT Ready at the server, which can be queried by the Repository Indexhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_model_repository.mdindex or Model Readyhttpsgithub.comtritoninferenceserverclientblobmainsrcc2B2Blibraryhttp_client.hL178L192 APIs. The Health Check is disabled by default. To enable it, set the following parameter on the model config to true  parameters  key ENABLE_VLLM_HEALTH_CHECK value  string_value true    and select Model Control Mode EXPLICIThttpsgithub.comtritoninferenceserverserverblobmaindocsuser_guidemodel_management.mdmodelcontrolmodeexplicit when the server is started.

 Chunk 132:
 File docsllama_multi_lora_tutorial.md  ! Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 133:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 134:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Tutorial on depolying multilora vLLM backend in Triton The idea of multilora was proposed recently, for more please refer to  SLoRA Serving Thousands of Concurrent LoRA Adaptershttpsarxiv.orgabs2311.03285  Punica MultiTenant LoRA Servinghttpsarxiv.orgabs2310.18547 Now the vLLM has supported multilora, which integrated the Punica feature and related cuda kernels.

 Chunk 135:
 See this PRhttpsgithub.comvllmprojectvllmpull1804 for more. 20240124 this PR has been merged into the main branch of vLLM The following tutorial demonstrates how to deploy a LLaMa model with multiple loras on Triton Inference Server using the Tritons Pythonbasedhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends vLLMhttpsgithub.comtritoninferenceservervllm_backendtreemain backend. Before you continue reading, its important to note that all commandline instructions containing xx.yy in the document cannot be used directly by copying and pasting.

 Chunk 136:
 xx.yy represents the Triton version, and you must specify the Triton version you want to use for the bash command to work. Step 1 Start a docker container for tritonvllm serving A docker container is strongly recommended for serving, and this tutorial will only demonstrate how to launch triton in the docker environment. First, start a docker container using the tritonserver image with vLLM backend from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags bash  NOTICE you must first cd to your vllm_workspace path outside the container.

 Chunk 137:
 mkdir vllm_workspace  cd vllm_workspace sudo docker run gpus all it nethost p 80018001 shmsize12G  ulimit memlock1 ulimit stack67108864 v PWDvllm_workspace  w vllm_workspace nvcr.ionvidiatritonserverxx.yyvllmpythonpy3  binbash  NOTICE the version of triton docker image should be configurated, here and through this tutorial we use xx.yy to symbolize the version. Tritons vLLM container has been introduced starting from 23.10 release, and multilora experimental support was added in vLLM v0.3.0 release. Docker image version nvcr.ionvidiatritonserver24.05vllmpythonpy3 or higher version is strongly recommended.

 Chunk 138:
 For pre24.05 containers, the docker images didnt support multilora feature, so you need to replace that provided in the container opttritonserverbackendsvllmmodel.py with the most up to date version. Just follow this command Download the model.py script from github bash wget P opttritonserverbackendsvllm httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysrcmodel.py  This command will download the model.py script to the Triton vllm backend directory which will enable multilora feature. Step 2 Prepare your weights To support multilora on Triton, you need to manage your file path for model backbone and lora weights separately.

 Chunk 139:
 A typical weights repository can be as follows  weights  backbone   llama7bhf  loras  alpacalora7b  wizardLMlora7b   A workspace for vllm, and model backbone weights, LoRA adapter weights is strongly recommended. You should expand the storage of these weight files to ensure they are logically organized in the workspace. Step 3 Prepare model repository for Triton Server __3.1 Download the model repository files__ To use Triton, a model repository is needed, for model path , backend configuration and other information.

 Chunk 140:
 The vllm backend is implemented based on python backend, and sampling_params of vllm are sampled from model.json. To create a triton model repository, you may download the files through these commands bash  NOTICE you must first cd to your vllm_workspace path. cd vllm_workspace mkdir p model_repositoryvllm_model1 wget P model_repositoryvllm_model1 httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_model1model.json wget P model_repositoryvllm_model httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_modelconfig.pbtxt  The model repository should look like this  model_repository  vllm_model  1   model.json  config.pbtxt   Now, you have finished the basic deployment, and the file structure should look like this  vllm_workspace  weights   backbone    llama7bhf   loras   alpacalora7b   bactrianxllamalora7b   model_repository  vllm_model  1   model.json  config.pbtxt  __3.2 Populate model.json__ For this tutorial we will use the following set of parameters, specified in the model.json.

 Chunk 141:
 json  modelvllm_workspaceweightsbackbonellama7bhf, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 16    model The path to your model repository  disable_log_requests To show logs when launch vllm or not. gpu_memory_utilization The gpu memory allocated for the model weights and vllm PagedAttention kv cache manager. tensor_parallel_size The vllm now support the tensor paralism, so you can decide how many gpus you want to use for serving.

 Chunk 142:
 block_size vLLM kv cache block size. enable_lora If you want to support vllm multilora, this should be configured and set true. max_lora_rank The maximum of LoRA rank of your lora adapter.

 Chunk 143:
 The full set of parameters can be found herehttpsgithub.comYard1vllmblobmulti_loravllmenginearg_utils.pyL11. __3.3 Specify local lora path__ vLLM v0.4.0.post1 supported the inference of local lora weights applying, which means that the vllm cannot pull any lora adapter from huggingface. So triton should know where the local lora weights are.

 Chunk 144:
 Create a multi_lora.json file under model_repositoryvllm_model1 path bash cd model_repositoryvllm_model1 touch multi_lora.json  The content of multi_lora.json should look like this json  alpaca vllm_workspaceweightslorasalpacalora7b, bactrian vllm_workspaceweightslorasbactrianxllama7blora   The key should be the supported lora name, and the value should be the specific path in your machine. Warning if you set enable_lora to true in model.json without creating a multi_lora.json file, the server will throw FileNotFoundError when initializing. Step 4 Launch Triton bash  NOTICE you must first cd to your vllm_workspace path.

 Chunk 145:
 cd vllm_workspace tritonserver modelstore .model_repository  After you start Triton you will see output on the console showing the server starting up and loading the model. When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002   Step 5 Send a request A client request script for multilora was prepared, downloading the client script from source bash wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesclient.py wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesprompts.txt  Try running this script by the following command bash python3 client.py l yourpreparedloraname  Here we assume you have prepared alpaca lora weight, thus we use bash python3 client.py l alpaca   File samplesmodel_repositoryvllm_model1model.json   modelfacebookopt125m, disable_log_requests true, gpu_memory_utilization 0.5, enforce_eager true   File samplesmodel_repositoryvllm_modelconfig.pbtxt   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 Chunk 146:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 147:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 148:
 Note You do not need to change any fields in this configuration. backend vllm  The usage of device is deferred to the vLLM engine instance_group   count 1 kind KIND_MODEL    File samplesclient.py  !usrbinenv python3  Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 149:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 150:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse import asyncio import json import sys import numpy as np import tritonclient.grpc.aio as grpcclient from tritonclient.utils import  class LLMClient def __init__self, flags argparse.Namespace self._flags  flags self._results_dict   def get_triton_clientself try triton_client  grpcclient.InferenceServerClient urlself._flags.url, verboseself._flags.verbose,  except Exception as e printchannel creation failed   stre sys.exit return triton_client async def async_request_iterator self, prompts, sampling_parameters, exclude_input_in_output  try for iter in rangeself._flags.iterations for i, prompt in enumerateprompts prompt_id  self._flags.offset  lenprompts  iter  i self._results_dictstrprompt_id   yield self.create_request prompt, self._flags.streaming_mode, prompt_id, sampling_parameters, exclude_input_in_output,  except Exception as error printfCaught an error in the request iterator error async def stream_inferself, prompts, sampling_parameters, exclude_input_in_output try triton_client  self.get_triton_client  Start streaming response_iterator  triton_client.stream_infer inputs_iteratorself.async_request_iterator prompts, sampling_parameters, exclude_input_in_output , stream_timeoutself._flags.stream_timeout,  async for response in response_iterator yield response except InferenceServerException as error printerror sys.exit1 async def process_stream self, prompts, sampling_parameters, exclude_input_in_output   Clear results in between process_stream calls self.results_dict   success  True  Read response from the stream async for response in self.stream_infer prompts, sampling_parameters, exclude_input_in_output  result, error  response if error printfEncountered error while processing error success  False else output  result.as_numpytext_output for i in output self._results_dictresult.get_response.id.appendi return success async def runself  Sampling parameters for text generation  including temperature, top_p, top_k, max_tokens, early_stopping.

 Chunk 151:
 Full list available at  httpsgithub.comvllmprojectvllmblob5255d99dc595f9ae7647842242d6542aa4145a4fvllmsampling_params.pyL23 sampling_parameters   temperature 0.1, top_p 0.95, max_tokens 100,  exclude_input_in_output  self._flags.exclude_inputs_in_outputs if self._flags.lora_name is not None sampling_parameterslora_name  self._flags.lora_name with openself._flags.input_prompts, r as file printfLoading inputs from self._flags.input_prompts... prompts  file.readlines success  await self.process_stream prompts, sampling_parameters, exclude_input_in_output  with openself._flags.results_file, w as file for id in self._results_dict.keys for result in self._results_dictid file.writeresult.decodeutf8 file.writen file.writennn printfStoring results into self._flags.results_file... if self._flags.verbose with openself._flags.results_file, r as file printfnContents of self._flags.results_file  printfile.read if success printPASS vLLM example else printFAIL vLLM example def run_asyncself asyncio.runself.run def create_request self, prompt, stream, request_id, sampling_parameters, exclude_input_in_output, send_parameters_as_tensorTrue,  inputs   prompt_data  np.arrayprompt.encodeutf8, dtypenp.object_ try inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpyprompt_data except Exception as error printfEncountered an error during request creation error stream_data  np.arraystream, dtypebool inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpystream_data  Request parameters are not yet supported via BLS.

 Chunk 152:
 Provide an  optional mechanism to send serialized parameters as an input  tensor until support is added if send_parameters_as_tensor sampling_parameters_data  np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool  Add requested outputs outputs   outputs.appendgrpcclient.InferRequestedOutputtext_output  Issue the asynchronous sequence inference. return  model_name self._flags.model, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters,  if __name__  __main__ parser  argparse.ArgumentParser parser.add_argument m, model, typestr, requiredFalse, defaultvllm_model, helpModel name,  parser.add_argument v, verbose, actionstore_true, requiredFalse, defaultFalse, helpEnable verbose output,  parser.add_argument u, url, typestr, requiredFalse, defaultlocalhost8001, helpInference server URL and its gRPC port. Default is localhost8001.,  parser.add_argument t, streamtimeout, typefloat, requiredFalse, defaultNone, helpStream timeout in seconds.

 Chunk 153:
 Default is None.,  parser.add_argument offset, typeint, requiredFalse, default0, helpAdd offset to request IDs used,  parser.add_argument inputprompts, typestr, requiredFalse, defaultprompts.txt, helpText file with input prompts,  parser.add_argument resultsfile, typestr, requiredFalse, defaultresults.txt, helpThe file with output results,  parser.add_argument iterations, typeint, requiredFalse, default1, helpNumber of iterations through the prompts file,  parser.add_argument s, streamingmode, actionstore_true, requiredFalse, defaultFalse, helpEnable streaming mode,  parser.add_argument excludeinputsinoutputs, actionstore_true, requiredFalse, defaultFalse, helpExclude prompt from outputs,  parser.add_argument l, loraname, typestr, requiredFalse, defaultNone, helpThe querying LoRA name,  FLAGS  parser.parse_args client  LLMClientFLAGS client.run_async  File samplesprompts.txt  Hello, my name is The most dangerous animal is The capital of France is The future of AI is  File srcutilsmetrics.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 154:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 155:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import queue import threading from typing import Dict, List, Union import triton_python_backend_utils as pb_utils from vllm.engine.metrics import StatLoggerBase as VllmStatLoggerBase from vllm.engine.metrics import Stats as VllmStats from vllm.engine.metrics import SupportsMetricsInfo, build_1_2_5_buckets class TritonMetrics def __init__self, labels Liststr, max_model_len int  Initialize metric families  Iteration stats self.counter_prompt_tokens_family  pb_utils.MetricFamily namevllmprompt_tokens_total, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.COUNTER,  self.counter_generation_tokens_family  pb_utils.MetricFamily namevllmgeneration_tokens_total, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.COUNTER,  self.histogram_time_to_first_token_family  pb_utils.MetricFamily namevllmtime_to_first_token_seconds, descriptionHistogram of time to first token in seconds., kindpb_utils.MetricFamily.HISTOGRAM,  self.histogram_time_per_output_token_family  pb_utils.MetricFamily namevllmtime_per_output_token_seconds, descriptionHistogram of time per output token in seconds., kindpb_utils.MetricFamily.HISTOGRAM,   Request stats  Latency self.histogram_e2e_time_request_family  pb_utils.MetricFamily namevllme2e_request_latency_seconds, descriptionHistogram of end to end request latency in seconds., kindpb_utils.MetricFamily.HISTOGRAM,   Metadata self.histogram_num_prompt_tokens_request_family  pb_utils.MetricFamily namevllmrequest_prompt_tokens, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.HISTOGRAM,  self.histogram_num_generation_tokens_request_family  pb_utils.MetricFamily namevllmrequest_generation_tokens, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.HISTOGRAM,  self.histogram_n_request_family  pb_utils.MetricFamily namevllmrequest_params_n, descriptionHistogram of the n request parameter., kindpb_utils.MetricFamily.HISTOGRAM,   Initialize metrics  Iteration stats self.counter_prompt_tokens  self.counter_prompt_tokens_family.Metric labelslabels  self.counter_generation_tokens  self.counter_generation_tokens_family.Metric labelslabels   Use the same bucket boundaries from vLLM sample metrics as an example. httpsgithub.comvllmprojectvllmblob21313e09e3f9448817016290da20d0db1adf3664vllmenginemetrics.pyL81L96 self.histogram_time_to_first_token   self.histogram_time_to_first_token_family.Metric labelslabels, buckets 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, ,   self.histogram_time_per_output_token   self.histogram_time_per_output_token_family.Metric labelslabels, buckets 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, ,    Request stats  Latency self.histogram_e2e_time_request  self.histogram_e2e_time_request_family.Metric labelslabels, buckets1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0,   Metadata self.histogram_num_prompt_tokens_request   self.histogram_num_prompt_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len,   self.histogram_num_generation_tokens_request   self.histogram_num_generation_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len,   self.histogram_n_request  self.histogram_n_request_family.Metric labelslabels, buckets1, 2, 5, 10, 20,  class VllmStatLoggerVllmStatLoggerBase StatLogger is used as an adapter between vLLM stats collector and Triton metrics provider.

 Chunk 156:
 def __init__self, labels Dict, max_model_len int, log_logger  None  Tracked stats over current local logging interval. local_interval not used here. Its for vLLM logs to stdout.

 Chunk 157:
 super.__init__local_interval0 self.metrics  TritonMetricslabels, max_model_len self.log_logger  log_logger  Starting the metrics thread. It allows vLLM to keep making progress  while reporting metrics to triton metrics service. self._logger_queue  queue.Queue self._logger_thread  threading.Threadtargetself.logger_loop self._logger_thread.start def infoself, type str, obj SupportsMetricsInfo  None pass def _log_counterself, counter, data Unionint, float  None Convenience function for logging to counter.

 Chunk 158:
 Args counter A counter metric instance. data An int or float to increment the count metric. Returns None  if data !

 Chunk 159:
 0 self._logger_queue.put_nowaitcounter, increment, data def _log_histogramself, histogram, data UnionListint, Listfloat  None Convenience function for logging list to histogram. Args histogram A histogram metric instance. data A list of int or float data to observe into the histogram metric.

 Chunk 160:
 Returns None  for datum in data self._logger_queue.put_nowaithistogram, observe, datum def logself, stats VllmStats  None Report stats to Triton metrics server. Args stats Created by LLMEngine for use by VllmStatLogger. Returns None   The list of vLLM metrics reporting to Triton is also documented here.

 Chunk 161:
 httpsgithub.comtritoninferenceservervllm_backendblobmainREADME.mdtritonmetrics counter_metrics   self.metrics.counter_prompt_tokens, stats.num_prompt_tokens_iter, self.metrics.counter_generation_tokens, stats.num_generation_tokens_iter,  histogram_metrics    self.metrics.histogram_time_to_first_token, stats.time_to_first_tokens_iter, ,  self.metrics.histogram_time_per_output_token, stats.time_per_output_tokens_iter, , self.metrics.histogram_e2e_time_request, stats.time_e2e_requests,  self.metrics.histogram_num_prompt_tokens_request, stats.num_prompt_tokens_requests, ,  self.metrics.histogram_num_generation_tokens_request, stats.num_generation_tokens_requests, , self.metrics.histogram_n_request, stats.n_requests,  for metric, data in counter_metrics self._log_countermetric, data for metric, data in histogram_metrics self._log_histogrammetric, data def logger_loopself while True item  self._logger_queue.get  To signal shutdown a None item will be added to the queue. if item is None break metric, command, data  item if command  increment metric.incrementdata elif command  observe metric.observedata else self.log_logger.log_errorfUndefined command name command def finalizeself  Shutdown the logger thread. self._logger_queue.putNone if self._logger_thread is not None self._logger_thread.join self._logger_thread  None  File srcmodel.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 Chunk 162:
 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 Chunk 163:
 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 Chunk 164:
 import asyncio import base64 import gc import json import os import queue import threading from io import BytesIO from typing import Dict, List import numpy as np import torch import triton_python_backend_utils as pb_utils from PIL import Image from vllm.engine.arg_utils import AsyncEngineArgs from vllm.entrypoints.openai.api_server import  build_async_engine_client_from_engine_args,  from vllm.lora.request import LoRARequest from vllm.sampling_params import SamplingParams from vllm.utils import random_uuid from utils.metrics import VllmStatLogger _VLLM_ENGINE_ARGS_FILENAME  model.json _MULTI_LORA_ARGS_FILENAME  multi_lora.json class TritonPythonModel classmethod def auto_complete_configcls, auto_complete_model_config  Add inputsoutputs to the model config. cls._auto_complete_inputs_and_outputsauto_complete_model_config  We need to use decoupled transaction policy for saturating  vLLM engine for max throughtput. TODO DLIS5233 Allow asynchronous execution to lift this  restriction for cases there is exactly a single response to  a single request.

 Chunk 165:
 auto_complete_model_config.set_model_transaction_policydictdecoupledTrue  Disabling batching in Triton, let vLLM handle the batching on its own. auto_complete_model_config.set_max_batch_size0 return auto_complete_model_config staticmethod def _auto_complete_inputs_and_outputsauto_complete_model_config  Inputs expected by the backend. inputs   name text_input, data_type TYPE_STRING, dims 1,  name image, data_type TYPE_STRING, dims 1,  can be multiple images as separate elements optional True, ,  name stream, data_type TYPE_BOOL, dims 1, optional True, ,  name sampling_parameters, data_type TYPE_STRING, dims 1, optional True, ,  name exclude_input_in_output, data_type TYPE_BOOL, dims 1, optional True, ,  name return_finish_reason, data_type TYPE_BOOL, dims 1, optional True, ,  name return_cumulative_logprob, data_type TYPE_BOOL, dims 1, optional True, ,  name return_logprobs, data_type TYPE_BOOL, dims 1, optional True, ,  name return_num_input_tokens, data_type TYPE_BOOL, dims 1, optional True, ,  name return_num_output_tokens, data_type TYPE_BOOL, dims 1, optional True, ,   Outputs expected by the backend.

 Chunk 166:
 outputs   name text_output, data_type TYPE_STRING, dims 1, name finish_reason, data_type TYPE_STRING, dims 1, name cumulative_logprob, data_type TYPE_FP32, dims 1, name logprobs, data_type TYPE_STRING, dims 1, name num_input_tokens, data_type TYPE_UINT32, dims 1, name num_output_tokens, data_type TYPE_UINT32, dims 1,   Collect input and output names from the provided model config. config  auto_complete_model_config.as_dict input_names   output_names   for input in configinput input_names.appendinputname for output in configoutput output_names.appendoutputname  Add missing inputs and outputs to the model config. for input in inputs if inputname not in input_names auto_complete_model_config.add_inputinput for output in outputs if outputname not in output_names auto_complete_model_config.add_outputoutput def initializeself, args self.args  args self.logger  pb_utils.Logger self.model_config  json.loadsargsmodel_config output_config  pb_utils.get_output_config_by_name self.model_config, text_output  self.output_dtype  pb_utils.triton_string_to_numpyoutput_configdata_type  Setup vLLM engine health check self._enable_health_check  self._get_bool_config_param ENABLE_VLLM_HEALTH_CHECK  self._is_healthy  True  Initialize engine arguments  TODO Move this into _init_engine, after moving check metrics enabled.

 Chunk 167:
 self._init_engine_args  Check if metrics are enabled. The ZMQ process cannot be used when metrics are  enabled. TODO Move the check into _setup_metrics.

 Chunk 168:
 self._enable_metrics   self._get_bool_config_paramREPORT_CUSTOM_METRICS and not self._aync_engine_args.disable_log_stats   Starting the vLLM engine and its event thread running the AsyncIO event loop. self._init_engine  Setup vLLM metrics self._setup_metrics  Starting the response thread. It allows vLLM to keep making progress while  response senders are sending responses to server frontend.

 Chunk 169:
 self._response_queue  queue.Queue self._response_thread  threading.Threadtargetself._response_loop self._response_thread.start def _init_engine_argsself  Currently, Triton needs to use decoupled policy for asynchronously  forwarding requests to vLLM engine, so assert it. self.using_decoupled  pb_utils.using_decoupled_model_transaction_policy self.model_config  assert  self.using_decoupled , vLLM Triton backend must be configured to use decoupled model transaction policy engine_args_filepath  os.path.join pb_utils.get_model_dir, _VLLM_ENGINE_ARGS_FILENAME  assert os.path.isfile engine_args_filepath , f_VLLM_ENGINE_ARGS_FILENAME containing vllm engine args must be provided in pb_utils.get_model_dir with openengine_args_filepath as file self.vllm_engine_config  json.loadfile  Validate device and multiprocessing settings are currently set based on modelconfigs. self._validate_device_config  Check for LoRA config and set it up if enabled self._setup_lora  Create an AsyncEngineArgs from the config from JSON self._aync_engine_args  AsyncEngineArgsself.vllm_engine_config def _init_engineself  Run the engine in a separate thread running the AsyncIO event loop.

 Chunk 170:
 self._llm_engine  None self._llm_engine_start_cv  threading.Condition self._llm_engine_shutdown_event  asyncio.Event self._event_thread  threading.Thread targetasyncio.run, argsself._run_llm_engine,  self._event_thread.start with self._llm_engine_start_cv while self._llm_engine is None self._llm_engine_start_cv.wait  The threading.Thread will not raise the exception here should the engine  failed to start, so the exception is passed back via the engine variable. if isinstanceself._llm_engine, Exception e  self._llm_engine self.logger.log_errorfvllm Failed to start engine e if self._event_thread is not None self._event_thread.join self._event_thread  None raise e async def _run_llm_engineself  Counter to keep track of ongoing request counts. self._ongoing_request_count  0 try  Start the vLLM engine.

 Chunk 171:
 The engine lives for the scope of this with  statement. TODO Metrics should work with ZMQ enabled. async with build_async_engine_client_from_engine_args engine_argsself._aync_engine_args, disable_frontend_multiprocessingself._enable_metrics,  as engine  Capture the engine event loop and make it visible to other threads.

 Chunk 172:
 self._event_loop  asyncio.get_running_loop  Signal the engine is started and make it visible to other threads. with self._llm_engine_start_cv self._llm_engine  engine self._llm_engine_start_cv.notify_all  Wait for the engine shutdown signal. await self._llm_engine_shutdown_event.wait  Wait for the ongoing requests to complete.

 Chunk 173:
 while self._ongoing_request_count  0 self.logger.log_info vllm Awaiting remaining  requests.format self._ongoing_request_count   await asyncio.sleep1  Cancel all tasks in the event loop. for task in asyncio.all_tasksloopself._event_loop if task is not asyncio.current_task task.cancel except Exception as e  Signal and pass the exception back via the engine variable if the engine  failed to start. If the engine has started, reraise the exception.

 Chunk 174:
 with self._llm_engine_start_cv if self._llm_engine is None self._llm_engine  e self._llm_engine_start_cv.notify_all return raise e self._llm_engine  None self.logger.log_infovllm Shutdown complete def _validate_device_configself triton_kind  self.argsmodel_instance_kind triton_device_id  intself.argsmodel_instance_device_id triton_instance  fself.argsmodel_name_triton_device_id  Tritons current definition of KIND_GPU makes assumptions that  models only use a single GPU. For multiGPU models, the recommendation  is to specify KIND_MODEL to acknowledge that the model will take control  of the devices made available to it. NOTE Consider other parameters that would indicate multiGPU in the future.

 Chunk 175:
 tp_size  intself.vllm_engine_config.gettensor_parallel_size, 1 if tp_size  1 and triton_kind  GPU raise ValueError KIND_GPU is currently for singleGPU models, please specify KIND_MODEL  in the models config.pbtxt for multiGPU models   If KIND_GPU is specified, specify the device ID assigned by Triton to ensure that  multiple model instances do not oversubscribe the same default device. if triton_kind  GPU and triton_device_id  0 self.logger.log_info fDetected KIND_GPU model instance, explicitly setting GPU devicetriton_device_id for triton_instance   vLLM doesnt currently v0.4.2 expose device selection in the APIs torch.cuda.set_devicetriton_device_id def _setup_loraself self.enable_lora  False  Check if enable_lora field is in the model.json,  and if it is, read its contents, which can be string or bool. if  enable_lora in self.vllm_engine_config.keys and strself.vllm_engine_configenable_lora.lower  true   create Triton LoRA weights repository multi_lora_args_filepath  os.path.join pb_utils.get_model_dir, _MULTI_LORA_ARGS_FILENAME  try with openmulti_lora_args_filepath as lora_file lora_repository Dictstr, str  json.loadlora_file self.lora_repository  lora_repository self.supported_loras Liststr  listself.lora_repository.keys self.supported_loras_len  lenself.supported_loras self.enable_lora  True except FileNotFoundError raise FileNotFoundError fTriton backend cannot find multi_lora_args_filepath.

 Chunk 176:
 def _setup_metricsself self._vllm_metrics  None  TODO Do not read metrics directly from the vLLM engine, read from prometheus  client to allow the use of ZMQ process when metrics are enabled. See  httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmentrypointsopenaiapi_server.pyL222L245 if self._enable_metrics try labels   model self.argsmodel_name, version self.argsmodel_version,   Add vLLM custom metrics engine_config  self._llm_engine.engine.model_config self._vllm_metrics  VllmStatLogger labels, engine_config.max_model_len, self.logger  self._llm_engine.add_loggertriton, self._vllm_metrics except pb_utils.TritonModelException as e if metrics not supported in stre  Metrics are disabled at the server self.logger.log_infovllm Metrics not supported else raise e def _get_bool_config_paramself, param_name str  bool return param_name in self.model_configparameters and  self.model_configparametersparam_namestring_value.lower  true  def _response_loopself while True item  self._response_queue.get  To signal shutdown a None item will be added to the queue. if item is None break response_state, response, response_flag  item response_sender  response_stateresponse_sender try response_sender.sendresponse, response_flag  Stop checking for cancellation if the last response is generated.

 Chunk 177:
 if not response_statelast_response_generated response_stateis_cancelled  response_sender.is_cancelled except Exception as e self.logger.log_error fAn error occurred while sending a response e  finally if response_flag  pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL self._ongoing_request_count  1 def executeself, requests if self._enable_health_check and not self._check_healthrequests return None for request in requests request  self._verify_lorasrequest if request is not None assert  self._llm_engine_shutdown_event.is_set is False , Cannot create tasks after shutdown has been requested coro  self._generaterequest asyncio.run_coroutine_threadsafecoro, self._event_loop return None async def _generateself, request response_sender  request.get_response_sender response_state   response_sender response_sender, is_cancelled False, last_response_generated False,  last response ready but not yet sent  self._ongoing_request_count  1 decrement_ongoing_request_count  True try request_id  random_uuid  prompt, stream, prepend_input, parameters, additional_outputs,   self._get_input_tensorsrequest sampling_params_dict  self._get_sampling_params_dictparameters lora_name  sampling_params_dict.poplora_name, None sampling_params  SamplingParamssampling_params_dict lora_request  None if lora_name is not None lora_id  strself.supported_loras.indexlora_name  1 lora_int_id  intlora_id lora_local_path  self.lora_repositorylora_name lora_request  LoRARequestlora_id, lora_int_id, lora_local_path response_iterator  self._llm_engine.generate prompt, sampling_params, request_id, lora_requestlora_request  request_output_state   async for request_output in response_iterator  Cancellation state will be checked by the response loop and written to  the response state if streaming. If not streaming, cancellation state  needs to be checked here. is_cancelled  response_stateis_cancelled if not stream is_cancelled  response_sender.is_cancelled if is_cancelled self.logger.log_infovllm Cancelling the request await self._llm_engine.abortrequest_id self.logger.log_infovllm Successfully cancelled the request if stream  Add cancelled final response to response loop.

 Chunk 178:
 response_statelast_response_generated  True response  pb_utils.InferenceResponse errorpb_utils.TritonError messageRequest was cancelled, codepb_utils.TritonError.CANCELLED,   flags  pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count  False self._response_queue.put_nowait response_state, response, flags  break  Send each response if streaming. if stream response  self._create_response request_output_state, request_output, prepend_inputFalse, additional_outputsadditional_outputs,  flags  0 if request_output.finished response_statelast_response_generated  True flags  pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count  False self._response_queue.put_nowaitresponse_state, response, flags  Send the last response which contains all the outputs if not streaming. if not stream response_sender.send self._create_response request_output_state, request_outputrequest_output, prepend_inputprepend_input, additional_outputsadditional_outputs, , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,  except Exception as e self.logger.log_errorfvllm Error generating stream e error  pb_utils.TritonErrorfError generating stream e text_output_tensor  pb_utils.Tensor text_output, np.asarrayNA, dtypeself.output_dtype  response  pb_utils.InferenceResponse output_tensorstext_output_tensor, errorerror  response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL  raise e finally if decrement_ongoing_request_count self._ongoing_request_count  1 def _get_input_tensorsself, request  prompt prompt  pb_utils.get_input_tensor_by_namerequest, text_input.as_numpy0 if isinstanceprompt, bytes prompt  prompt.decodeutf8  image images  pb_utils.get_input_tensor_by_namerequest, image if images images_vllm   for image_np in images.as_numpy image_b  base64.b64decodeimage_np.decodeutf8 image_rgb  Image.openBytesIOimage_b.convertRGB images_vllm.appendimage_rgb if lenimages_vllm  0 prompt   prompt prompt, multi_modal_data image images_vllm,   stream stream  pb_utils.get_input_tensor_by_namerequest, stream if stream stream  stream.as_numpy0 else stream  False  prepend_input  exclude_input_in_output prepend_input  pb_utils.get_input_tensor_by_name request, exclude_input_in_output  if prepend_input  When exclude_input_in_output is False, we want to prepend input prompt  to output, thus prepend_input should be True, and vice versa.

 Chunk 179:
 prepend_input  not prepend_input.as_numpy0 elif prepend_input is None and stream prepend_input  False else prepend_input  True if prepend_input and stream raise ValueError When streaming, exclude_input_in_output  False is not allowed. parameters  sampling_parameters  An alternative mechanism to receive serialized parameters as an input tensor,  because request parameters are not yet supported via BLS. sampling_parameters  pb_utils.get_input_tensor_by_name request, sampling_parameters  if sampling_parameters parameters  sampling_parameters.as_numpy0.decodeutf8 else parameters  request.parameters  additional outputs additional_outputs   return_finish_reason None, return_cumulative_logprob None, return_logprobs None, return_num_input_tokens None, return_num_output_tokens None,  for tensor_name in additional_outputs.keys tensor  pb_utils.get_input_tensor_by_namerequest, tensor_name if tensor tensor  booltensor.as_numpy0 else tensor  False additional_outputstensor_name  tensor return prompt, stream, prepend_input, parameters, additional_outputs def _create_response self, request_output_state, request_output, prepend_input, additional_outputs  output_tensors    text_output prepend_prompt   if prev_lens_text_output not in request_output_state  this is the first response if prepend_input prepend_prompt  request_output.prompt request_output_stateprev_lens_text_output  0  len request_output.outputs  prev_lens  request_output_stateprev_lens_text_output text_output   prepend_prompt  output.textprev_len.encodeutf8 for output, prev_len in ziprequest_output.outputs, prev_lens  request_output_stateprev_lens_text_output   lenoutput.text for output in request_output.outputs  output_tensors.append pb_utils.Tensor text_output, np.asarraytext_output, dtypeself.output_dtype    finish_reason if additional_outputsreturn_finish_reason finish_reason   stroutput.finish_reason for output in request_output.outputs  output_tensors.append pb_utils.Tensor finish_reason, np.asarrayfinish_reason, dtypenp.object_    cumulative_logprob if additional_outputsreturn_cumulative_logprob cumulative_logprob   output.cumulative_logprob for output in request_output.outputs  output_tensors.append pb_utils.Tensor cumulative_logprob, np.asarraycumulative_logprob, dtypenp.float32,    logprobs  httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsequence.pyL37L58 if additional_outputsreturn_logprobs if prev_lens_logprobs not in request_output_state request_output_stateprev_lens_logprobs  0  len request_output.outputs  logprobs   for i in rangelenrequest_output.outputs output  request_output.outputsi if output.logprobs is None logprobs.appendnull.encodeutf8 continue prev_len  request_output_stateprev_lens_logprobsi request_output_stateprev_lens_logprobsi  lenoutput.logprobs logprobs_py   for logprob_d_vllm in output.logprobsprev_len logprob_d_py   for token_id, logprob_vllm in logprob_d_vllm.items logprob_d_pytoken_id   logprob logprob_vllm.logprob, rank logprob_vllm.rank, decoded_token logprob_vllm.decoded_token,  logprobs_py.appendlogprob_d_py logprobs.appendjson.dumpslogprobs_py.encodeutf8 output_tensors.append pb_utils.Tensorlogprobs, np.asarraylogprobs, dtypenp.object_   num_input_tokens if additional_outputsreturn_num_input_tokens num_input_tokens  lenrequest_output.prompt_token_ids output_tensors.append pb_utils.Tensor num_input_tokens, np.asarraynum_input_tokens, dtypenp.uint32    num_output_tokens if additional_outputsreturn_num_output_tokens if prev_lens_num_output_tokens not in request_output_state request_output_stateprev_lens_num_output_tokens  0  len request_output.outputs  prev_lens  request_output_stateprev_lens_num_output_tokens num_output_tokens   lenoutput.token_ids  prev_len for output, prev_len in ziprequest_output.outputs, prev_lens  request_output_stateprev_lens_num_output_tokens   lenoutput.token_ids for output in request_output.outputs  output_tensors.append pb_utils.Tensor num_output_tokens, np.asarraynum_output_tokens, dtypenp.uint32   return pb_utils.InferenceResponseoutput_tensorsoutput_tensors def _get_sampling_params_dictself, params_json params_dict  json.loadsparams_json  Special parsing for the supported sampling parameters bool_keys  ignore_eos, skip_special_tokens, use_beam_search for k in bool_keys if k in params_dict params_dictk  boolparams_dictk float_keys   frequency_penalty, length_penalty, presence_penalty, temperature, top_p,  for k in float_keys if k in params_dict params_dictk  floatparams_dictk int_keys  best_of, max_tokens, min_tokens, n, top_k for k in int_keys if k in params_dict params_dictk  intparams_dictk return params_dict def _verify_lorasself, request  We will check if the requested lora exists here, if not we will send a  response with LoRA not found information.

 Chunk 180:
 In this way we may avoid  further processing. verified_request  None lora_error  None lora_name  None parameters_input_tensor  pb_utils.get_input_tensor_by_name request, sampling_parameters  if parameters_input_tensor parameters  parameters_input_tensor.as_numpy0.decodeutf8 sampling_params_dict  self._get_sampling_params_dictparameters lora_name  sampling_params_dict.poplora_name, None if lora_name is not None if not self.enable_lora lora_error  pb_utils.TritonErrorLoRA feature is not enabled. self.logger.log_info vllm LoRA is not enabled, please restart the backend with LoRA enabled.

 Chunk 181:
 elif lora_name not in self.supported_loras lora_error  pb_utils.TritonError fLoRA lora_name is not supported, we currently support self.supported_loras  self.logger.log_infofvllm LoRA lora_name not found. if lora_error is not None output_tensor  pb_utils.Tensor text_output, np.asarrayError Unsupported LoRA., dtypeself.output_dtype,  response  pb_utils.InferenceResponse output_tensorsoutput_tensor, errorlora_error  response_sender  request.get_response_sender response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL  else verified_request  request return verified_request def _check_healthself, requests coro  self._llm_engine.check_health future  asyncio.run_coroutine_threadsafecoro, self._event_loop try future.result except Exception as e self.logger.log_error fvllm Engine is not healthy and model will be unloaded e  pb_utils.unload_modelself.model_configname  nonblocking self._is_healthy  False if not self._is_healthy for request in requests request.get_response_sender.send pb_utils.InferenceResponse errorpb_utils.TritonError messageModel is unavailable due to unhealthy vLLM engine, codepb_utils.TritonError.UNAVAILABLE,  , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,  return self._is_healthy def finalizeself self.logger.log_infovllm Issuing finalize to vllm backend self._event_loop.call_soon_threadsafeself._llm_engine_shutdown_event.set  Shutdown the event thread. if self._event_thread is not None self._event_thread.join self._event_thread  None  Shutdown the response thread.

 Chunk 182:
 self._response_queue.putNone if self._response_thread is not None self._response_thread.join self._response_thread  None  Shutdown the metrics thread. if self._vllm_metrics is not None self._vllm_metrics.finalize  When using parallel tensors, the stub process may not shutdown due to  unreleased references, so manually run the garbage collector once. self.logger.log_infovllm Running Garbage Collector on finalize...

 Chunk 183:
 gc.collect self.logger.log_infovllm Garbage Collector on finalize... done  File .gitignore   Bytecompiled  optimized  DLL files __pycache__ .pycod py.class  C extensions .so  Distribution  packaging .Python build developeggs dist downloads eggs .eggs lib lib64 parts sdist var wheels sharepythonwheels .egginfo .installed.cfg .egg MANIFEST  PyInstaller  Usually these files are written by a python script from a template  before PyInstaller builds the exe, so as to inject dateother infos into it. .manifest .spec  Installer logs piplog.txt pipdeletethisdirectory.txt  Unit test  coverage reports htmlcov .tox .nox .coverage .coverage.

 Chunk 184:
 .cache nosetests.xml coverage.xml .cover .py,cover .hypothesis .pytest_cache cover .out  Translations .mo .pot  Django stuff .log local_settings.py db.sqlite3 db.sqlite3journal  Flask stuff instance .webassetscache  Scrapy stuff .scrapy  Sphinx documentation docs_build  PyBuilder .pybuilder target  Jupyter Notebook .ipynb_checkpoints  IPython profile_default ipython_config.py  pyenv  For a library or package, you might want to ignore these files since the code is  intended to run in multiple environments otherwise, check them in  .pythonversion  pipenv  According to pypapipenv598, it is recommended to include Pipfile.lock in version control. However, in case of collaboration, if having platformspecific dependencies or dependencies  having no crossplatform support, pipenv may install dependencies that dont work, or not  install all needed dependencies. Pipfile.lock  poetry  Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.

 Chunk 185:
 This is especially recommended for binary packages to ensure reproducibility, and is more  commonly ignored for libraries. httpspythonpoetry.orgdocsbasicusagecommityourpoetrylockfiletoversioncontrol poetry.lock  pdm  Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. pdm.lock  pdm stores projectwide configurations in .pdm.toml, but it is recommended to not include it  in version control.

 Chunk 186:
 httpspdm.fming.devusewithide .pdm.toml  PEP 582 used by e.g. github.comDavidOConnorpyflow and github.compdmprojectpdm __pypackages__  Celery stuff celerybeatschedule celerybeat.pid  SageMath parsed files .sage.py  Environments .env .venv env venv ENV env.bak venv.bak  Spyder project settings .spyderproject .spyproject  Rope project settings .ropeproject  mkdocs documentation site  mypy .mypy_cache .dmypy.json dmypy.json  Pyre type checker .pyre  pytype static type analyzer .pytype  Cython debug symbols cython_debug  Test result files Miniconda miniconda vllm_env.tar.gz triton_python_backend_stub python_backend results.txt .log  PyCharm  JetBrains specific template is maintained in a separate JetBrains.gitignore that can  be found at httpsgithub.comgithubgitignoreblobmainGlobalJetBrains.gitignore  and can be added to the global gitignore or merged into this file. For a more nuclear  option not recommended you can uncomment the following to ignore the entire idea folder.

 Chunk 187:
 .idea  File .precommitconfig.yaml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 188:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 189:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. repos  repo httpsgithub.comtimothycrosleyisort rev 5.12.0 hooks  id isort additional_dependencies toml  repo httpsgithub.compsfblack rev 23.1.0 hooks  id black types_or python, cython  repo httpsgithub.comPyCQAflake8 rev 5.0.4 hooks  id flake8 args maxlinelength88, selectC,E,F,W,B,B950, extendignore  E203,E501 types_or python, cython  repo httpsgithub.comprecommitmirrorsclangformat rev v16.0.5 hooks  id clangformat types_or c, c, cuda, proto, textproto, java args fallbackstylenone, stylefile, i  repo httpsgithub.comcodespellprojectcodespell rev v2.2.4 hooks  id codespell additional_dependencies tomli args toml, pyproject.toml exclude ?x.stemmer..stop_words.CHANGELOG.md  More details about these precommit hooks here  httpsprecommit.comhooks.html  repo httpsgithub.comprecommitprecommithooks rev v4.4.0 hooks  id checkcaseconflict  id checkexecutableshaveshebangs  id checkmergeconflict  id checkjson  id checktoml  id checkyaml exclude deploytemplates. id checkshebangscriptsareexecutable  id endoffilefixer types_or c, c, cuda, proto, textproto, java, python  id mixedlineending  id requirementstxtfixer  id trailingwhitespace  File LICENSE  BSD 3Clause License Copyright c 2023, Triton Inference Server Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met 1.

 Chunk 190:
 Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution.

 Chunk 191:
 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.

 Chunk 192:
 IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. File pyproject.toml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 Chunk 193:
 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 Chunk 194:
 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. tool.codespell  note precommit passes explicit lists of files here, which this skip file list doesnt override   this is only to allow you to run codespell interactively skip  ..git,..github  ignore short words, and typename parameters like OffsetT ignoreregex  b.1,4AZwTb  use the clear dictionary for unambiguous spelling mistakes builtin  clear  disable warnings about binary files and wrong encoding quietlevel  3 tool.isort profile  black use_parentheses  true multi_line_output  3 include_trailing_comma  true force_grid_wrap  0 ensure_newline_before_comments  true line_length  88 balanced_wrapping  true indent    skip  build  File README.md  !

 Chunk 195:
 Copyright 20232024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 Chunk 196:
 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 Chunk 197:
 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. !Licensehttpsimg.shields.iobadgeLicenseBSD3lightgrey.svghttpsopensource.orglicensesBSD3Clause  vLLM Backend The Triton backend for vLLMhttpsgithub.comvllmprojectvllm is designed to run supported modelshttpsvllm.readthedocs.ioenlatestmodelssupported_models.html on a vLLM enginehttpsgithub.comvllmprojectvllmblobmainvllmengineasync_llm_engine.py. You can learn more about Triton backends in the backend repohttpsgithub.comtritoninferenceserverbackend.

 Chunk 198:
 This is a Pythonbased backendhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends. When using this backend, all requests are placed on the vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled by the vLLM engine.

 Chunk 199:
 Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentationhttpsgithub.comtritoninferenceserverservertritoninferenceserver available in the main serverhttpsgithub.comtritoninferenceserverserver repo. If you dont find your answer there you can ask questions on the main Triton issues pagehttpsgithub.comtritoninferenceserverserverissues.

 Chunk 200:
 Installing the vLLM Backend There are several ways to install and deploy the vLLM backend. Option 1. Use the PreBuilt Docker Container.

 Chunk 201:
 Pull a tritonserverxx.yyvllmpythonpy3 container with vLLM backend from the NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver registry. xx.yy is the version of Triton that you want to use. Please note, that Tritons vLLM container has been introduced starting from 23.10 release.

 Chunk 202:
 docker pull nvcr.ionvidiatritonserverxx.yyvllmpythonpy3   Option 2. Build a Custom Container From Source You can follow steps described in the Building With Dockerhttpsgithub.comtritoninferenceserverserverblobmaindocscustomization_guidebuild.mdbuildingwithdocker guide and use the build.pyhttpsgithub.comtritoninferenceserverserverblobmainbuild.py script. A sample command to build a Triton Server container with all options enabled is shown below.

 Chunk 203:
 Feel free to customize flags according to your needs. Please use NGC registryhttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags to get the latest version of the Triton vLLM container, which corresponds to the latest YY.MM year.month of Triton releasehttpsgithub.comtritoninferenceserverserverreleases. YY.MM is the version of Triton.

 Chunk 204:
 export TRITON_CONTAINER_VERSIONYY.MM .build.py v enablelogging enablestats enabletracing enablemetrics enablegpumetrics enablecpumetrics enablegpu filesystemgcs filesystems3 filesystemazure_storage endpointhttp endpointgrpc endpointsagemaker endpointvertexai upstreamcontainerversionTRITON_CONTAINER_VERSION backendpythonrTRITON_CONTAINER_VERSION backendvllmrTRITON_CONTAINER_VERSION backendensemble   Option 3. Add the vLLM Backend to the Default Triton Container You can install the vLLM backend directly into the NGC Triton container. In this case, please install vLLM first.

 Chunk 205:
 You can do so by running pip install vllmvLLM_version. Then, set up the vLLM backend in the container with the following commands  mkdir p opttritonserverbackendsvllm git clone httpsgithub.comtritoninferenceservervllm_backend.git tmpvllm_backend cp r tmpvllm_backendsrc opttritonserverbackendsvllm   Using the vLLM Backend You can see an example model_repositorysamplesmodel_repository in the samplessamples folder. You can use this as is and change the model by changing the model value in model.json.

 Chunk 206:
 model.json represents a keyvalue dictionary that is fed to vLLMs AsyncLLMEngine when initializing the model. You can see supported arguments in vLLMs arg_utils.pyhttpsgithub.comvllmprojectvllmblobmainvllmenginearg_utils.py. Specifically, herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL11 and herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL201.

 Chunk 207:
 For multiGPU support, EngineArgs like tensor_parallel_size can be specified in model.jsonsamplesmodel_repositoryvllm_model1model.json. Note vLLM greedily consume up to 90 of the GPUs memory under default settings. The sample model updates this behavior by setting gpu_memory_utilization to 50.

 Chunk 208:
 You can tweak this behavior using fields like gpu_memory_utilization and other settings in model.jsonsamplesmodel_repositoryvllm_model1model.json. Launching Triton Inference Server Once you have the model repository set up, it is time to launch the Triton server. We will use the prebuilt Triton container with vLLM backendoption1usetheprebuiltdockercontainer from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver in this example.

 Chunk 209:
 docker run gpus all it nethost rm p 80018001 shmsize1G ulimit memlock1 ulimit stack67108864 v PWDwork w work nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 tritonserver modelrepository .model_repository  Replace xx.yy with the version of Triton that you want to use. Note that Tritons vLLM container was first published starting from 23.10 release. After you start Triton you will see output on the console showing the server starting up and loading the model.

 Chunk 210:
 When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002   Sending Your First Inference After you start Tritonhttpsdocs.nvidia.comdeeplearningtritoninferenceserveruserguidedocsgetting_startedquickstart.html with the sample model_repositorysamplesmodel_repository, you can quickly run your first inference request with the generate endpointhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_generate.md. Try out the command below.

 Chunk 211:
 curl X POST localhost8000v2modelsvllm_modelgenerate d text_input What is Triton Inference Server?, parameters stream false, temperature 0  Upon success, you should see a response from the server like this one  model_namevllm_model,model_version1,text_outputWhat is Triton Inference Server?nnTriton Inference Server is a server that is used by many  In the samplessamples folder, you can also find a sample client, client.pysamplesclient.py which uses Tritons asyncio gRPC client libraryhttpsgithub.comtritoninferenceserverclientpythonasynciosupportbeta1 to run inference on Triton. Running the Latest vLLM Version You can check the vLLM version included in Triton Inference Server from Framework Containers Support Matrixhttpsdocs.nvidia.comdeeplearningframeworkssupportmatrixindex.html. Note The vLLM Triton Inference Server container has been introduced starting from 23.10 release.

 Chunk 212:
 You can use pip install ... within the container to upgrade vLLM version. Running Multiple Instances of Triton Server If you are running multiple instances of Triton server with a Pythonbased backend, you need to specify a different shmregionprefixname for each server.

 Chunk 213:
 See herehttpsgithub.comtritoninferenceserverpython_backendrunningmultipleinstancesoftritonserver for more information. Additional vLLM outputs Additional vLLM outputs may be requested optionally on a perrequest basis. See this docsdocsadditional_outputs.md for more information.

 Chunk 214:
 Triton Metrics Starting with the 24.08 release of Triton, users can now obtain specific vLLM metrics by querying the Triton metrics endpoint see complete vLLM metrics herehttpsdocs.vllm.aienlatestservingmetrics.html. This can be accomplished by launching a Triton server in any of the ways described above ensuring the build code  container is 24.08 or later and querying the server. Upon receiving a successful response, you can query the metrics endpoint by entering the following bash curl localhost8002metrics  VLLM stats are reported by the metrics endpoint in fields that are prefixed with vllm.

 Chunk 215:
 Triton currently supports reporting of the following metrics from vLLM. bash  Number of prefill tokens processed. counter_prompt_tokens  Number of generation tokens processed.

 Chunk 216:
 counter_generation_tokens  Histogram of time to first token in seconds. histogram_time_to_first_token  Histogram of time per output token in seconds. histogram_time_per_output_token  Histogram of end to end request latency in seconds.

 Chunk 217:
 histogram_e2e_time_request  Number of prefill tokens processed. histogram_num_prompt_tokens_request  Number of generation tokens processed. histogram_num_generation_tokens_request  Histogram of the best_of request parameter.

 Chunk 218:
 histogram_best_of_request  Histogram of the n request parameter. histogram_n_request  Your output for these fields should look similar to the following bash  HELP vllmprompt_tokens_total Number of prefill tokens processed. TYPE vllmprompt_tokens_total counter vllmprompt_tokens_totalmodelvllm_model,version1 10  HELP vllmgeneration_tokens_total Number of generation tokens processed.

 Chunk 219:
 TYPE vllmgeneration_tokens_total counter vllmgeneration_tokens_totalmodelvllm_model,version1 16  HELP vllmtime_to_first_token_seconds Histogram of time to first token in seconds. TYPE vllmtime_to_first_token_seconds histogram vllmtime_to_first_token_seconds_countmodelvllm_model,version1 1 vllmtime_to_first_token_seconds_summodelvllm_model,version1 0.03233122825622559 vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,le0.001 0 ... vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,leInf 1  HELP vllmtime_per_output_token_seconds Histogram of time per output token in seconds.

 Chunk 220:
 TYPE vllmtime_per_output_token_seconds histogram vllmtime_per_output_token_seconds_countmodelvllm_model,version1 15 vllmtime_per_output_token_seconds_summodelvllm_model,version1 0.04501533508300781 vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,le0.01 14 ... vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,leInf 15  HELP vllme2e_request_latency_seconds Histogram of end to end request latency in seconds. TYPE vllme2e_request_latency_seconds histogram vllme2e_request_latency_seconds_countmodelvllm_model,version1 1 vllme2e_request_latency_seconds_summodelvllm_model,version1 0.08686184883117676 vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,le1 1 ...

 Chunk 221:
 vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_prompt_tokens Number of prefill tokens processed. TYPE vllmrequest_prompt_tokens histogram vllmrequest_prompt_tokens_countmodelvllm_model,version1 1 vllmrequest_prompt_tokens_summodelvllm_model,version1 10 vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_generation_tokens Number of generation tokens processed.

 Chunk 222:
 TYPE vllmrequest_generation_tokens histogram vllmrequest_generation_tokens_countmodelvllm_model,version1 1 vllmrequest_generation_tokens_summodelvllm_model,version1 16 vllmrequest_generation_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_generation_tokens_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_params_best_of Histogram of the best_of request parameter. TYPE vllmrequest_params_best_of histogram vllmrequest_params_best_of_countmodelvllm_model,version1 1 vllmrequest_params_best_of_summodelvllm_model,version1 1 vllmrequest_params_best_of_bucketmodelvllm_model,version1,le1 1 ...

 Chunk 223:
 vllmrequest_params_best_of_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_params_n Histogram of the n request parameter. TYPE vllmrequest_params_n histogram vllmrequest_params_n_countmodelvllm_model,version1 1 vllmrequest_params_n_summodelvllm_model,version1 1 vllmrequest_params_n_bucketmodelvllm_model,version1,le1 1 ... vllmrequest_params_n_bucketmodelvllm_model,version1,leInf 1  To enable vLLM engine colleting metrics, disable_log_stats option need to be either false or left empty false by default in model.jsonhttpsgithub.comtritoninferenceservervllm_backendblobmainsamplesmodel_repositoryvllm_model1model.json.

 Chunk 224:
 bash disable_log_stats false  Note vLLM metrics are not reported to Triton metrics server by default due to potential performance slowdowns. To enable vLLM models metrics reporting, please add following lines to its config.pbtxt as well. parameters  key REPORT_CUSTOM_METRICS value  string_value true     vLLM Engine Health Check BETA vLLM Engine Health Check may be enabled optionally, for more accurate model state reported by the server.

 Chunk 225:
 See this docsdocshealth_check.md for more information. Referencing the Tutorial You can read further in the vLLM Quick Deploy guidehttpsgithub.comtritoninferenceservertutorialstreemainQuick_DeployvLLM in the tutorialshttpsgithub.comtritoninferenceservertutorials repository.

diff --git a/chunks.txt b/chunks.txt
 This file is a merged representation of the entire codebase, combining all repository files into a single document. Generated by Repomix on 20241229T195517.135Z  File Summary  Purpose  This file contains a packed representation of the entire repositorys contents. It is designed to be easily consumable by AI systems for analysis, code review, or other automated processes.

 ---

 File Format  The content is organized as follows 1. This summary section 2. Repository information 3.

 ---

 Directory structure 4. Multiple file entries, each consisting of a. A separator line  b.

 ---

 The file path File pathtofile c. Another separator line d. The full contents of the file e.

 ---

 A blank line Usage Guidelines   This file should be treated as readonly. Any changes should be made to the original repository files, not this packed version. When processing this file, use the file path to distinguish between different files in the repository.

 ---

 Be aware that this file may contain sensitive information. Handle it with the same level of security as you would the original repository. Notes   Some files may have been excluded based on .gitignore rules and Repomixs configuration.

 ---

 Binary files are not included in this packed representation. Please refer to the Repository Structure section for a complete list of file paths, including binary files. Additional Info  For more information about Repomix, visit httpsgithub.comyamadashyrepomix  Directory Structure  .github workflows codeql.yml precommit.yaml ci common test_util.py util.sh L0_additional_outputs_vllm additional_outputs_test.py test.sh L0_backend_vllm accuracy_test accuracy_test.py test.sh enabled_stream enabled_stream_test.py test.sh metrics_test test.sh vllm_metrics_test.py request_cancellation request_cancellation_test.py test.sh vllm_backend ensemble_config.pbtxt test.sh vllm_backend_test.py test.sh L0_check_health_vllm check_health_test.py test.sh L0_multi_gpu_vllm multi_lora download.py multi_lora_test.py test.sh vllm_backend test.sh vllm_multi_gpu_test.py test.sh docs additional_outputs.md health_check.md llama_multi_lora_tutorial.md samples model_repository vllm_model 1 model.json config.pbtxt client.py prompts.txt src utils metrics.py model.py .gitignore .precommitconfig.yaml LICENSE pyproject.toml README.md  Files   File .githubworkflowscodeql.yml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 name CodeQL on pull_request jobs analyze name Analyze runson ubuntulatest permissions actions read contents read securityevents write strategy failfast false matrix language  python   CodeQL supports  cpp, csharp, go, java, javascript, python, ruby   Learn more about CodeQL language support at httpsaka.mscodeqldocslanguagesupport steps  name Checkout repository uses actionscheckoutv3  Initializes the CodeQL tools for scanning. name Initialize CodeQL uses githubcodeqlactioninitv2 with languages  matrix.language   If you wish to specify custom queries, you can do so here or in a config file. By default, queries listed here will override any specified in a config file.

 ---

 Prefix the list here with  to use these queries and those in the config file. Details on CodeQLs query packs refer to  httpsdocs.github.comencodesecuritycodescanningautomaticallyscanningyourcodeforvulnerabilitiesanderrorsconfiguringcodescanningusingqueriesinqlpacks queries securityandquality  Autobuild attempts to build any compiled languages CC, C, Go, or Java. If this step fails, then you should remove it and run the build manually see below  name Autobuild uses githubcodeqlactionautobuildv2  Commandline programs to run using the OS shell.

 ---

 See httpsdocs.github.comenactionsusingworkflowsworkflowsyntaxforgithubactionsjobsjob_idstepsrun  If the Autobuild fails above, remove it and uncomment the following three lines. modify them or add more to build your code if your project, please refer to the EXAMPLE below for guidance. run   echo Run, Build Application using script  .location_of_script_within_repobuildscript.sh  name Perform CodeQL Analysis uses githubcodeqlactionanalyzev2 with category languagematrix.language  File .githubworkflowsprecommit.yaml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 name precommit on pull_request jobs precommit runson ubuntu22.04 steps  uses actionscheckoutv3  uses actionssetuppythonv3  uses precommitactionv3.0.0  File cicommontest_util.py  !usrbinenv python3  Copyright 20182023, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import queue import unittest import numpy as np import tritonclient.grpc as grpcclient class TestResultCollectorunittest.TestCase  TestResultCollector stores test result and prints it to stdout. In order  to use this class, unit tests must inherit this class.

 ---

 Use  check_test_results bash function from commonutil.sh to verify the  expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures  total, errors, failures classmethod def tearDownClasscls  this method is called when all the unit tests in a class are  finished. json_res  total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone  result argument stores the accumulative test results test_result  super.runresult total  test_result.testsRun errors  lentest_result.errors failures  lentest_result.failures self.setResulttotal, errors, failures class AsyncTestResultCollectorunittest.IsolatedAsyncioTestCase  TestResultCollector stores test result and prints it to stdout. In order  to use this class, unit tests must inherit this class.

 ---

 Use  check_test_results bash function from commonutil.sh to verify the  expected number of tests produced by this class classmethod def setResultcls, total, errors, failures cls.total, cls.errors, cls.failures  total, errors, failures classmethod def tearDownClasscls  this method is called when all the unit tests in a class are  finished. json_res  total cls.total, errors cls.errors, failures cls.failures with opentest_results.txt, w as f f.writejson.dumpsjson_res def runself, resultNone  result argument stores the accumulative test results test_result  super.runresult total  test_result.testsRun errors  lentest_result.errors failures  lentest_result.failures self.setResulttotal, errors, failures def create_vllm_request prompt, request_id, stream, sampling_parameters, model_name, send_parameters_as_tensorTrue, exclude_input_in_outputNone,  inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpynp.arrayprompt.encodeutf8, dtypenp.object_ inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if send_parameters_as_tensor sampling_parameters_data  np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data if exclude_input_in_output is not None inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool outputs  grpcclient.InferRequestedOutputtext_output return  model_name model_name, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters,  class UserData def __init__self self._completed_requests  queue.Queue def callbackuser_data, result, error if error user_data._completed_requests.puterror else user_data._completed_requests.putresult  File cicommonutil.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. SERVERSERVERopttritonserverbintritonserver SERVER_IPADDRTRITONSERVER_IPADDRlocalhost SERVER_LOGSERVER_LOG.server.log SERVER_TIMEOUTSERVER_TIMEOUT120 SERVER_LD_PRELOADSERVER_LD_PRELOAD  Run inference server.

 ---

 Return once servers health endpoint shows  ready or timeout expires. Sets SERVER_PID to pid of SERVER, or 0 if  error including expired timeout function run_server   SERVER_PID0 if  z SERVER  then echo  SERVER must be defined return fi if  ! f SERVER  then echo  SERVER does not exist return fi if  z SERVER_LD_PRELOAD  then echo  Running SERVER SERVER_ARGS else echo  Running LD_PRELOADSERVER_LD_PRELOAD SERVER SERVER_ARGS fi LD_PRELOADSERVER_LD_PRELOADLD_PRELOAD SERVER SERVER_ARGS  SERVER_LOG 21  SERVER_PID!

 ---

 wait_for_server_ready SERVER_PID SERVER_TIMEOUT if  WAIT_RET ! 0  then  Get further debug information about server startup failure gdb_helper  true  Cleanup kill SERVER_PID  devnull 21  true SERVER_PID0 fi   Wait until server health endpoint shows ready. Sets WAIT_RET to 0 on  success, 1 on failure function wait_for_server_ready  local spid1 shift local wait_time_secs130 shift WAIT_RET0 local wait_secswait_time_secs until test wait_secs eq 0  do if !

 ---

 kill 0 spid  devnull 21 then echo  Server not running. WAIT_RET1 return fi sleep 1 set e codecurl s w http_code SERVER_IPADDR8000v2healthready set e if  code  200  then return fi wait_secs done echo  Timeout wait_time_secs secs. Server not ready.

 ---

 WAIT_RET1   Check Python unittest results. function check_test_results   local log_file1 local expected_num_tests2 if  z expected_num_tests  then echo  expected number of tests must be defined return 1 fi num_failurescat log_file  grep E .total.errors.failures. tail n 1  jq .failures num_testscat log_file  grep E .total.errors.failures.

 ---

 tail n 1  jq .total num_errorscat log_file  grep E .total.errors.failures. tail n 1  jq .errors  Number regular expression re09 if  ? ne 0   !

 ---

 num_failures  re   ! num_tests  re    ! num_errors  re  then cat log_file echo e nn Test Failed unable to parse test resultsn  log_file return 1 fi if  num_errors !

 ---

 0    num_failures ! 0    num_tests ne expected_num_tests  then cat log_file echo e nn Test Failed Expected expected_num_tests tests, num_tests tests executed, num_errors tests had error, and num_failures tests failed. n  log_file return 1 fi return 0  function collect_artifacts_from_subdir   cp .log core ..

 ---

 true   File ciL0_additional_outputs_vllmadditional_outputs_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json import numpy as np import pytest import tritonclient.grpc as grpcclient class TestAdditionalOutputs _grpc_url  localhost8001 _model_name  vllm_opt _sampling_parameters  temperature 0, top_p 1 _prompt  In this example, def _get_sampling_parametersself, logprobsNone sampling_parameters  self._sampling_parameters.copy if logprobs is not None sampling_parameterslogprobs  logprobs return sampling_parameters def _get_inputs self, prompt, streamTrue, sampling_parametersNone, return_finish_reasonNone, return_cumulative_logprobNone, return_logprobsNone, return_num_input_tokensNone, return_num_output_tokensNone,  inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_   if return_finish_reason is not None inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_finish_reason, dtypebool if return_cumulative_logprob is not None inputs.append grpcclient.InferInputreturn_cumulative_logprob, 1, BOOL  inputs1.set_data_from_numpy np.arrayreturn_cumulative_logprob, dtypebool  if return_logprobs is not None inputs.appendgrpcclient.InferInputreturn_logprobs, 1, BOOL inputs1.set_data_from_numpynp.arrayreturn_logprobs, dtypebool if return_num_input_tokens is not None inputs.appendgrpcclient.InferInputreturn_num_input_tokens, 1, BOOL inputs1.set_data_from_numpy np.arrayreturn_num_input_tokens, dtypebool  if return_num_output_tokens is not None inputs.append grpcclient.InferInputreturn_num_output_tokens, 1, BOOL  inputs1.set_data_from_numpy np.arrayreturn_num_output_tokens, dtypebool  return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself, inputs, sampling_parameters self._responses   with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parameterssampling_parameters  client.stop_stream assert lenself._responses  0 def _assert_text_output_validself text_output   for response in self._responses result, error  responseresult, responseerror assert error is None text_output  result.as_numpynametext_output0.decodeutf8 assert lentext_output  0, output is empty assert text_output.count   4, output is not a sentence def _assert_finish_reasonself, return_finish_reason for i in rangelenself._responses result, error  self._responsesiresult, self._responsesierror assert error is None finish_reason_np  result.as_numpynamefinish_reason if return_finish_reason is None or return_finish_reason  False assert finish_reason_np is None continue finish_reason  finish_reason_np0.decodeutf8 if i  lenself._responses  1 assert finish_reason  None else assert finish_reason  length def _assert_cumulative_logprobself, return_cumulative_logprob prev_cumulative_logprob  0.0 for response in self._responses result, error  responseresult, responseerror assert error is None cumulative_logprob_np  result.as_numpynamecumulative_logprob if return_cumulative_logprob is None or return_cumulative_logprob  False assert cumulative_logprob_np is None continue cumulative_logprob  cumulative_logprob_np0.astypefloat assert cumulative_logprob ! prev_cumulative_logprob prev_cumulative_logprob  cumulative_logprob def _assert_logprobs self, stream, sampling_parameters, return_logprobs, return_num_output_tokens  for response in self._responses result, error  responseresult, responseerror assert error is None logprobs_np  result.as_numpynamelogprobs if return_logprobs is None or return_logprobs  False assert logprobs_np is None continue logprobs  json.loadslogprobs_np0.decodeutf8 if logprobs not in sampling_parameters assert logprobs is None continue assert isinstancelogprobs, list assert lenlogprobs  1 if return_num_output_tokens  True num_output_tokens  result.as_numpynamenum_output_tokens0.astype int  assert lenlogprobs  num_output_tokens text_output_logprobs   for logprobs_d in logprobs assert isinstancelogprobs_d, dict assert lenlogprobs_d  1 assert lenlogprobs_d  sampling_parameterslogprobs  1 rank_one_found  False for token_id, logprob_d in logprobs_d.items assert isinstancetoken_id, str assert lenlogprob_d  3 assert isinstancelogprob_dlogprob, float assert isinstancelogprob_drank, int assert isinstancelogprob_ddecoded_token, str if logprob_drank  1 assert not rank_one_found rank_one_found  True text_output_logprobs  logprob_ddecoded_token assert rank_one_found text_output  result.as_numpynametext_output0.decodeutf8 if not stream  given exclude_input_in_output is not set, prepend_input is True if not  streaming and False if streaming text_output_logprobs  self._prompt  text_output_logprobs assert text_output_logprobs  text_output def _assert_num_input_tokensself, return_num_input_tokens for response in self._responses result, error  responseresult, responseerror assert error is None num_input_tokens_np  result.as_numpynamenum_input_tokens if return_num_input_tokens is None or return_num_input_tokens  False assert num_input_tokens_np is None continue num_input_tokens  num_input_tokens_np.astypeint assert num_input_tokens  0 assert num_input_tokens  lenself._prompt def _assert_num_output_tokensself, return_num_output_tokens for response in self._responses result, error  responseresult, responseerror assert error is None num_output_tokens_np  result.as_numpynamenum_output_tokens if return_num_output_tokens is None or return_num_output_tokens  False assert num_output_tokens_np is None continue num_output_tokens  num_output_tokens_np0.astypeint assert num_output_tokens  0 pytest.mark.parametrizestream, True, False pytest.mark.parametrizereturn_finish_reason, None, True, False pytest.mark.parametrizereturn_cumulative_logprob, None, True, False pytest.mark.parametrizelogprobs, None, 0, 2 pytest.mark.parametrizereturn_logprobs, None, True, False pytest.mark.parametrizereturn_num_input_tokens, None, True, False pytest.mark.parametrizereturn_num_output_tokens, None, True, False def test_additional_outputs self, stream, return_finish_reason, return_cumulative_logprob, logprobs, return_logprobs, return_num_input_tokens, return_num_output_tokens,  sampling_parameters  self._get_sampling_parameterslogprobslogprobs inputs  self._get_inputs self._prompt, streamstream, sampling_parameterssampling_parameters, return_finish_reasonreturn_finish_reason, return_cumulative_logprobreturn_cumulative_logprob, return_logprobsreturn_logprobs, return_num_input_tokensreturn_num_input_tokens, return_num_output_tokensreturn_num_output_tokens,  self._llm_inferinputs, sampling_parameters self._assert_text_output_valid self._assert_finish_reasonreturn_finish_reason self._assert_cumulative_logprobreturn_cumulative_logprob self._assert_logprobs stream, sampling_parameters, return_logprobs, return_num_output_tokens  self._assert_num_input_tokensreturn_num_input_tokens self._assert_num_output_tokensreturn_num_output_tokens  File ciL0_additional_outputs_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc  Prepare Model rm rf models vllm_baseline_output.pkl  mkdir p models SAMPLE_MODELS_REPO....samplesmodel_repository cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json RET0  Test SERVER_LOGadditional_outputs_test.server.log SERVER_ARGSmodelrepositorymodels run_server if  SERVER_PID  0  then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_additional_outputs.xml s v additional_outputs_test.py if  ? ne 0  then echo e nn additional_outputs_test FAILED. n RET1 fi set e kill SERVER_PID wait SERVER_PID if  RET eq 0  then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET  File ciL0_backend_vllmaccuracy_testaccuracy_test.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 import argparse import asyncio import pickle import sys import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import  from vllm import SamplingParams from vllm.engine.arg_utils import AsyncEngineArgs from vllm.engine.async_llm_engine import AsyncLLMEngine from vllm.utils import random_uuid sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request VLLM_ENGINE_CONFIG   model facebookopt125m, gpu_memory_utilization 0.3,  PROMPTS   The most dangerous animal is, The capital of France is, The future of AI is,  SAMPLING_PARAMETERS  temperature 0, top_p 1 async def generate_python_vllm_outputprompt, llm_engine request_id  random_uuid sampling_params  SamplingParamsSAMPLING_PARAMETERS python_vllm_output  None last_output  None async for vllm_output in llm_engine.generateprompt, sampling_params, request_id last_output  vllm_output if last_output python_vllm_output   prompt  output.text.encodeutf8 for output in last_output.outputs  return python_vllm_output def prepare_vllm_baseline_outputs  Helper function that starts async vLLM engine and generates output for each prompt in PROMPTS. Saves resulted baselines in vllm_baseline_output.pkl for further use. llm_engine  AsyncLLMEngine.from_engine_argsAsyncEngineArgsVLLM_ENGINE_CONFIG python_vllm_output   for i in rangelenPROMPTS python_vllm_output.extend asyncio.rungenerate_python_vllm_outputPROMPTSi, llm_engine  with openvllm_baseline_output.pkl, wb as f pickle.dumppython_vllm_output, f return class VLLMTritonAccuracyTestTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name  vllm_opt self.python_vllm_output   with openvllm_baseline_output.pkl, rb as f self.python_vllm_output  pickle.loadf self.assertNotEqual self.python_vllm_output, , Loaded baseline outputs list should not be empty,  self.assertIsNotNone self.python_vllm_output, Loaded baseline outputs list should not be None  self.assertEqual lenself.python_vllm_output, lenPROMPTS, Unexpected number of baseline outputs loaded, expected , but got .format lenPROMPTS, lenself.python_vllm_output ,  def test_vllm_modelself user_data  UserData stream  False triton_vllm_output   self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangelenPROMPTS request_data  create_vllm_request PROMPTSi, i, stream, SAMPLING_PARAMETERS, self.vllm_model_name  self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parametersSAMPLING_PARAMETERS,  for i in rangelenPROMPTS result  user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException, strresult output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None triton_vllm_output.extendoutput self.triton_client.stop_stream self.assertEqualself.python_vllm_output.sort, triton_vllm_output.sort def tearDownself self.triton_client.close if __name__  __main__ parser  argparse.ArgumentParser parser.add_argument generatebaseline, actionstore_true, requiredFalse, defaultFalse, helpGenerates baseline output for accuracy tests,  FLAGS  parser.parse_args if FLAGS.generate_baseline prepare_vllm_baseline_outputs exit0 unittest.main  File ciL0_backend_vllmaccuracy_testtest.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.accuracy_test_server.log CLIENT_LOG.accuracy_test_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.accuracy_test.py SAMPLE_MODELS_REPO......samplesmodel_repository VLLM_ENGINE_LOGvllm_engine.log EXPECTED_NUM_TESTS1 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.3 modelsvllm_opt1model.json  f vllm_baseline_output.pkl   rm vllm_baseline_output.pkl RET0 set e  Need to generate baseline first, since running 2 vLLM engines causes  memory issues httpsgithub.comvllmprojectvllmissues2248 python3 CLIENT_PY generatebaseline  VLLM_ENGINE_LOG 21  BASELINE_PID! wait BASELINE_PID set e run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED.

 ---

 n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Accuracy test FAILED. n else echo e nn Accuracy test PASSED.

 ---

 n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmenabled_streamenabled_stream_test.py   Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys import unittest import tritonclient.grpc.aio as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import AsyncTestResultCollector, create_vllm_request PROMPTS  The most dangerous animal is SAMPLING_PARAMETERS  temperature 0, top_p 1 class VLLMTritonStreamTestAsyncTestResultCollector async def _test_vllm_model self, promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamTrue, exclude_input_in_outputNone, expected_outputNone, expect_errorFalse,  async with grpcclient.InferenceServerClient urllocalhost8001  as triton_client model_name  vllm_opt async def request_iterator for i, prompt in enumerateprompts yield create_vllm_request prompt, i, stream, sampling_parameters, model_name, exclude_input_in_outputexclude_input_in_output,  response_iterator  triton_client.stream_infer inputs_iteratorrequest_iterator  final_response   async for response in response_iterator result, error  response if expect_error self.assertIsInstanceerror, InferenceServerException self.assertEqual error.message, Error generating stream When streaming, exclude_input_in_output  False is not allowed., error,  return self.assertIsNoneerror, error self.assertIsNotNoneresult, result output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None final_response.appendstroutput0, encodingutf8 if expected_output is not None self.assertEqual final_response, expected_output, Expected to receive the following response , but received ..format expected_output, final_response ,  async def test_vllm_model_enabled_streamself  Verifying that request with multiple prompts runs successfully. prompts   The most dangerous animal is, The future of AI is,  await self._test_vllm_modelpromptsprompts async def test_vllm_model_enabled_stream_exclude_input_in_output_defaultself  Verifying that streaming request returns only generated diffs, which is default behaviour for streamTrue.

 ---

 expected_output    the,  one,  that,  is,  most,  likely,  to,  be,  killed,  by,  a,  car, ., n, I, m,  await self._test_vllm_modelexpected_outputexpected_output async def test_vllm_model_enabled_stream_exclude_input_in_output_falseself  Verifying that streaming request returns only generated diffs even if exclude_input_in_output is set to False explicitly. expected_output  Error generating stream When streaming, exclude_input_in_output  False is not allowed. await self._test_vllm_model exclude_input_in_outputFalse, expected_outputexpected_output, expect_errorTrue,  if __name__  __main__ unittest.main  File ciL0_backend_vllmenabled_streamtest.sh  !binbash  Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.enabled_stream_server.log CLIENT_LOG.enabled_stream_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.enabled_stream_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS3 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Enabled Stream test FAILED. n else echo e nn Enabled Stream test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmmetrics_testtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_metrics_server.log CLIENT_LOG.vllm_metrics_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_metrics_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1  Helpers  function copy_model_repository  rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt  vllm_opt model will be loaded on server start and stay loaded throughout  unittesting. To ensure that vllms memory profiler will not error out  on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt,  so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json  run_test  local TEST_CASE1 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY TEST_CASE v  CLIENT_LOG 21 if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY TEST_CASE FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e  TODO Nongraceful shutdown when metrics are enabled.

 ---

 kill SERVER_PID wait SERVER_PID  RET0  Test disabling vLLM metrics reporting without parameter REPORT_CUSTOM_METRICS in config.pbtxt copy_model_repository run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled  Test disabling vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to false in config.pbtxt copy_model_repository echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value false     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled  Test vLLM metrics reporting with parameter REPORT_CUSTOM_METRICS set to true in config.pbtxt copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics  Test vLLM metrics custom sampling parameters  Custom sampling parameters may result in different vLLM output depending  on the platform. Therefore, these metrics are tests separately. copy_model_repository cp SAMPLE_MODELS_REPOvllm_modelconfig.pbtxt modelsvllm_opt echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_custom_sampling_params  Test enabling vLLM metrics reporting in config.pbtxt but disabling in model.json copy_model_repository jq .

 ---

 disable_log_stats  true modelsvllm_opt1model.json  temp.json mv temp.json modelsvllm_opt1model.json echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt run_test VLLMTritonMetricsTest.test_vllm_metrics_disabled  Test enabling vLLM metrics reporting in config.pbtxt while disabling in server option copy_model_repository echo e  parameters  key REPORT_CUSTOM_METRICS value  string_value true     modelsvllm_optconfig.pbtxt SERVER_ARGSSERVER_ARGS allowmetricsfalse run_test VLLMTritonMetricsTest.test_vllm_metrics_refused rm rf .models temp.json if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmmetrics_testvllm_metrics_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 import os import re import sys import unittest from functools import partial import requests import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMTritonMetricsTestTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.tritonserver_ipaddr  os.environ.getTRITONSERVER_IPADDR, localhost self.vllm_model_name  vllm_opt self.prompts   The most dangerous animal is, The capital of France is, The future of AI is,  self.sampling_parameters  temperature 0, top_p 1 def parse_vllm_metricsself  Store vllm metrics in a dictionary. r  requests.getfhttpself.tritonserver_ipaddr8002metrics r.raise_for_status  Regular expression to match the pattern pattern  rvllm ?.? 09.

 ---

 vllm_dict    Find all matches in the text matches  re.findallpattern, r.text, re.MULTILINE for match in matches key, value  match vllm_dictkey  floatvalue if . in value else intvalue return vllm_dict def vllm_infer self, prompts, sampling_parameters, model_name,   Helper function to send async stream infer requests to vLLM. user_data  UserData number_of_vllm_reqs  lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data  create_vllm_request promptsi, i, False, sampling_parameters, model_name, True,  self.triton_client.async_stream_infer model_namemodel_name, inputsrequest_datainputs, request_idrequest_datarequest_id, outputsrequest_dataoutputs, parameterssampling_parameters,  for _ in rangenumber_of_vllm_reqs result  user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None self.triton_client.stop_stream def test_vllm_metricsself  Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name,  metrics_dict  self.parse_vllm_metrics total_prompts  lenself.prompts  vllmprompt_tokens_total  2, 133, 144, 2702, 3477, 16  2, 133, 812, 9, 1470, 16  2, 133, 499, 9, 4687, 16 self.assertEqualmetrics_dictvllmprompt_tokens_total, 18  vllmgeneration_tokens_total  5, 65, 14, 16, 144, 533, 7, 28, 848, 30, 10, 512, 4, 50118, 100, 437  5, 812, 9, 5, 1515, 3497, 4, 50118, 50118, 133, 812, 9, 1470, 16, 5, 812  11, 5, 1420, 9, 5, 82, 4, 50118, 50118, 133, 499, 9, 4687, 16, 11, 5 self.assertEqualmetrics_dictvllmgeneration_tokens_total, 48  vllmtime_to_first_token_seconds self.assertEqual metrics_dictvllmtime_to_first_token_seconds_count, total_prompts  self.assertGreatermetrics_dictvllmtime_to_first_token_seconds_sum, 0 self.assertEqual metrics_dictvllmtime_to_first_token_seconds_bucket, total_prompts   vllmtime_per_output_token_seconds self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_count, 45 self.assertGreatermetrics_dictvllmtime_per_output_token_seconds_sum, 0 self.assertEqualmetrics_dictvllmtime_per_output_token_seconds_bucket, 45  vllme2e_request_latency_seconds self.assertEqual metrics_dictvllme2e_request_latency_seconds_count, total_prompts  self.assertGreatermetrics_dictvllme2e_request_latency_seconds_sum, 0 self.assertEqual metrics_dictvllme2e_request_latency_seconds_bucket, total_prompts   vllmrequest_prompt_tokens self.assertEqual metrics_dictvllmrequest_prompt_tokens_count, total_prompts  self.assertEqualmetrics_dictvllmrequest_prompt_tokens_sum, 18 self.assertEqual metrics_dictvllmrequest_prompt_tokens_bucket, total_prompts   vllmrequest_generation_tokens self.assertEqual metrics_dictvllmrequest_generation_tokens_count, total_prompts,  self.assertEqualmetrics_dictvllmrequest_generation_tokens_sum, 48 self.assertEqual metrics_dictvllmrequest_generation_tokens_bucket, total_prompts,   TODO Revisit this test due to the removal of best_of def test_custom_sampling_paramsself  Adding sampling parameters for testing metrics.

 ---

 Definitions can be found here httpsdocs.vllm.aienlatestdevsampling_params.html n, best_of  2, 4 custom_sampling_parameters  self.sampling_parameters.copy  Changing temperature because best_of must be 1 when using greedy  sampling, i.e. temperature 0. custom_sampling_parameters.update n strn, best_of strbest_of, temperature 1   Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parameterscustom_sampling_parameters, model_nameself.vllm_model_name,  metrics_dict  self.parse_vllm_metrics total_prompts  lenself.prompts  vllmrequest_params_best_of  self.assertEqual metrics_dictvllmrequest_params_best_of_count, total_prompts  self.assertEqual metrics_dictvllmrequest_params_best_of_sum, best_of  total_prompts  self.assertEqual metrics_dictvllmrequest_params_best_of_bucket, total_prompts    vllmrequest_params_n self.assertEqualmetrics_dictvllmrequest_params_n_count, total_prompts  self.assertEqualmetrics_dictvllmrequest_params_n_sum, n  total_prompts self.assertEqualmetrics_dictvllmrequest_params_n_bucket, total_prompts def test_vllm_metrics_disabledself  Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name,  metrics_dict  self.parse_vllm_metrics  No vLLM metric found self.assertEquallenmetrics_dict, 0 def test_vllm_metrics_refusedself  Test vLLM metrics self.vllm_infer promptsself.prompts, sampling_parametersself.sampling_parameters, model_nameself.vllm_model_name,  with self.assertRaisesrequests.exceptions.ConnectionError self.parse_vllm_metrics def tearDownself self.triton_client.close if __name__  __main__ unittest.main  File ciL0_backend_vllmrequest_cancellationrequest_cancellation_test.py   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 import sys import time import unittest from functools import partial import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMRequestCancelTestTestResultCollector def test_request_cancellationself, send_parameters_as_tensorTrue with grpcclient.InferenceServerClienturllocalhost8001 as triton_client log_file_path  .request_cancellation_server.log user_data  UserData model_name  vllm_opt stream  False sampling_parameters   temperature 0, top_p 1, max_tokens 1500,  prompt  fWrite an original and creative poem of at least 200 words. triton_client.start_streamcallbackpartialcallback, user_data request_data  create_vllm_request prompt, 1, stream, sampling_parameters, model_name, send_parameters_as_tensor,  triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  time.sleep1 triton_client.stop_streamcancel_requestsTrue time.sleep1 self.assertFalseuser_data._completed_requests.empty result  user_data._completed_requests.get self.assertIsInstanceresult, InferenceServerException self.assertEqualresult.status, StatusCode.CANCELLED self.assertTrueuser_data._completed_requests.empty with openlog_file_path, moder as log_file log_text  log_file.read self.assertInvllm Cancelling the request, log_text self.assertInvllm Successfully cancelled the request, log_text if __name__  __main__ unittest.main  File ciL0_backend_vllmrequest_cancellationtest.sh  !binbash  Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.request_cancellation_server.log CLIENT_LOG.request_cancellation_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.request_cancellation_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt RET0 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Request Cancellation test FAILED.

 ---

 n else echo e nn Request Cancellation test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmvllm_backendensemble_config.pbtxt   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. name ensemble_model platform ensemble max_batch_size 1 input   name text_input data_type TYPE_STRING dims  1    output   name text_output data_type TYPE_STRING dims  1    ensemble_scheduling  step   model_name vllm_opt model_version 1 input_map  key text_input value text_input  output_map  key text_output value text_output      File ciL0_backend_vllmvllm_backendtest.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_opt logverbose1 SERVER_LOG.vllm_backend_server.log CLIENT_LOG.vllm_backend_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_backend_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS6  Helpers  function assert_curl_success  message1 if  code ! 200  then cat .curl.out echo e nn message  line BASH_LINENOn RET1 fi  rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_opt  vllm_opt model will be loaded on server start and stay loaded throughout  unittesting. To test vllm model loadunload we use a dedicated  vllm_load_test.

 ---

 To ensure that vllms memory profiler will not error out  on vllm_load_test load, we reduce gpu_memory_utilization for vllm_opt,  so that at least 60 of GPU memory was available for other models. sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsvllm_opt1model.json cp r modelsvllm_opt modelsvllm_load_test mkdir p modelsadd_sub1 wget P modelsadd_sub1 httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_submodel.py wget P modelsadd_sub httpsraw.githubusercontent.comtritoninferenceserverpython_backendmainexamplesadd_subconfig.pbtxt  Invalid model attribute cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_1 sed i sdisable_log_requestsinvalid_attribute modelsvllm_invalid_11model.json  Invalid model name cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_invalid_2 sed i sfacebookopt125minvalid_model modelsvllm_invalid_21model.json  Sanity check ensembles are enabled and can successfully be loaded mkdir p modelsensemble_model1 cp r ensemble_config.pbtxt modelsensemble_modelconfig.pbtxt RET0 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED.

 ---

 n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ? ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  Test Python backend cmdline parameters are propagated to vllm backend SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR backendconfigpython,defaultmaxbatchsize8 SERVER_LOG.vllm_test_cmdline_server.log rm rf .modelsvllm_invalid_1 .modelsvllm_invalid_2 .modelsvllm_load_test run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi kill SERVER_PID wait SERVER_PID rm rf .models COUNTgrep c defaultmaxbatchsize8 SERVER_LOG if  COUNT ne 2  then echo Cmdline parameters verification Failed fi  Test loading multiple vllm models SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit loadmodelvllm_one SERVER_LOG.vllm_test_multi_model.log  Create two models, one is just a copy of the other, and make sure gpu  utilization is low enough for multiple models to avoid OOM. vLLM changed behavior of their GPU profiler from total to free memory,  so to load two small models, we need to start  triton server in explicit mode.

 ---

 MODEL1vllm_one MODEL2vllm_two mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsMODEL1 cp r modelsMODEL1 modelsMODEL2 sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL11model.json sed i sgpu_memory_utilization 0.5gpu_memory_utilization 0.4 modelsMODEL21model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi  Explicitly load model rm f .curl.out set e codecurl s w http_code o .curl.out X POST localhost8000v2repositorymodelsvllm_twoload set e assert_curl_success Failed to load vllm_two model kill SERVER_PID wait SERVER_PID rm rf .models if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn vLLM test FAILED. n else echo e nn vLLM test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_backend_vllmvllm_backendvllm_backend_test.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 import sys import unittest from functools import partial import numpy as np import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request PROMPTS   The most dangerous animal is, The capital of France is, The future of AI is,  SAMPLING_PARAMETERS  temperature 0, top_p 1 class VLLMTritonBackendTestTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name  vllm_opt self.python_model_name  add_sub self.ensemble_model_name  ensemble_model self.vllm_load_test  vllm_load_test def test_vllm_triton_backendself  Load both vllm and add_sub models self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.load_modelself.python_model_name self.assertTrueself.triton_client.is_model_readyself.python_model_name  Test to ensure that ensemble models are supported in vllm container. If ensemble support not present, triton will error out at model loading stage. Ensemble Model is a pipeline consisting of 1 model vllm_opt self.triton_client.load_modelself.ensemble_model_name self.assertTrueself.triton_client.is_model_readyself.ensemble_model_name self.triton_client.unload_modelself.ensemble_model_name  Unload vllm model and test add_sub model self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test self._test_python_model  Load vllm model and unload add_sub model self.triton_client.load_modelself.vllm_load_test self.assertTrueself.triton_client.is_model_readyself.vllm_load_test self.triton_client.unload_modelself.python_model_name self.assertFalseself.triton_client.is_model_readyself.python_model_name  Test vllm model and unload vllm model self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorTrue, model_nameself.vllm_load_test,  self._test_vllm_model promptsPROMPTS, sampling_parametersSAMPLING_PARAMETERS, streamFalse, send_parameters_as_tensorFalse, model_nameself.vllm_load_test,  self.triton_client.unload_modelself.vllm_load_test self.assertFalseself.triton_client.is_model_readyself.vllm_load_test def test_model_with_invalid_attributesself model_name  vllm_invalid_1 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_vllm_invalid_model_nameself model_name  vllm_invalid_2 with self.assertRaisesInferenceServerException self.triton_client.load_modelmodel_name def test_exclude_input_in_output_defaultself  Verifying default behavior for exclude_input_in_output in nonstreaming mode.

 ---

 Expected result prompt is returned with diffs. prompts   The capital of France is,  expected_output   bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital  sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, expected_outputexpected_output,  def test_exclude_input_in_output_falseself  Verifying behavior for exclude_input_in_output  False in nonstreaming mode. Expected result prompt is returned with diffs.

 ---

 Test vllm model and unload vllm model prompts   The capital of France is,  expected_output   bThe capital of France is the capital of the French Republic.nnThe capital of France is the capital  sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputFalse, expected_outputexpected_output,  def test_exclude_input_in_output_trueself  Verifying behavior for exclude_input_in_output  True in nonstreaming mode. Expected result only diffs are returned. Test vllm model and unload vllm model prompts   The capital of France is,  expected_output   b the capital of the French Republic.nnThe capital of France is the capital  sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, streamFalse, send_parameters_as_tensorTrue, exclude_input_in_outputTrue, expected_outputexpected_output,  def _test_vllm_model self, prompts, sampling_parameters, stream, send_parameters_as_tensor, exclude_input_in_outputNone, expected_outputNone, model_namevllm_opt,  user_data  UserData number_of_vllm_reqs  lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data  create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor, exclude_input_in_outputexclude_input_in_output,  self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  for i in rangenumber_of_vllm_reqs result  user_data._completed_requests.get if typeresult is InferenceServerException printresult.message self.assertIsNottyperesult, InferenceServerException, strresult output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n  Expected  n Actual.format output, expected_outputi ,  self.triton_client.stop_stream def _test_python_modelself shape  4 input0_data  np.random.randshape.astypenp.float32 input1_data  np.random.randshape.astypenp.float32 inputs   grpcclient.InferInput INPUT0, input0_data.shape, np_to_triton_dtypeinput0_data.dtype , grpcclient.InferInput INPUT1, input1_data.shape, np_to_triton_dtypeinput1_data.dtype ,  inputs0.set_data_from_numpyinput0_data inputs1.set_data_from_numpyinput1_data outputs   grpcclient.InferRequestedOutputOUTPUT0, grpcclient.InferRequestedOutputOUTPUT1,  response  self.triton_client.infer self.python_model_name, inputs, request_id10, outputsoutputs  self.assertTrue np.allcloseinput0_data  input1_data, response.as_numpyOUTPUT0  self.assertTrue np.allcloseinput0_data  input1_data, response.as_numpyOUTPUT1  def tearDownself self.triton_client.close if __name__  __main__ unittest.main  File ciL0_backend_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 RET0 SUBTESTSaccuracy_test request_cancellation enabled_stream vllm_backend metrics_test python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST  bash ex test.sh  cd .. if  ? ne 0  then echo Subtest TEST FAILED RET1 fi done if  RET eq 0  then echo e nn vLLM Backend Test Passedn else echo e nn vLLM Backend Test FAILEDn fi exit RET  File ciL0_check_health_vllmcheck_health_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 import json import numpy as np import tritonclient.grpc as grpcclient class TestCheckHealth _grpc_url  localhost8001 _model_name  vllm_opt _sampling_parameters  temperature 0, top_p 1 _prompt  In this example, def _get_inputsself, prompt, streamTrue, sampling_parametersNone inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayprompt.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpynp.arraystream, dtypebool if sampling_parameters is not None inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpy np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_   return inputs def _callbackself, result, error self._responses.appendresult result, error error def _llm_inferself inputs  self._get_inputs self._prompt, streamTrue, sampling_parametersself._sampling_parameters  self._responses   with grpcclient.InferenceServerClientself._grpc_url as client client.start_streamself._callback client.async_stream_infer self._model_name, inputsinputs, parametersself._sampling_parameters  client.stop_stream def _assert_text_output_validself text_output   for response in self._responses result, error  responseresult, responseerror assert error is None text_output  result.as_numpynametext_output0.decodeutf8 assert lentext_output  0, output is empty assert text_output.count   4, output is not a sentence def _assert_infer_exceptionself, expected_exception_message assert lenself._responses  1 for response in self._responses result, error  responseresult, responseerror assert result is None assert strerror  expected_exception_message def _assert_model_readyself, expected_readiness with grpcclient.InferenceServerClientself._grpc_url as client  is_model_ready API assert client.is_model_readyself._model_name  expected_readiness  get_model_repository_index API model_state  None for model_index in client.get_model_repository_index.models if model_index.name  self._model_name assert model_state is None, duplicate model index found model_state  model_index.state  READY assert model_state  expected_readiness def test_vllm_is_healthyself num_repeats  3 for i in rangenum_repeats self._assert_model_readyTrue self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue def test_vllm_not_healthyself self._assert_model_readyTrue  The 1st infer should complete successfully self._llm_infer self._assert_text_output_valid self._assert_model_readyTrue  The 2nd infer should begin with health check failed self._llm_infer self._assert_infer_exception Model is unavailable due to unhealthy vLLM engine  self._assert_model_readyFalse  The 3rd infer should have model not found self._llm_infer self._assert_infer_exception Request for unknown model vllm_opt has no available versions  self._assert_model_readyFalse  File ciL0_check_health_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. export CUDA_VISIBLE_DEVICES0 source ..commonutil.sh pip3 install pytest8.1.1 pip3 install tritonclientgrpc RET0 function setup_model_repository  local sample_model_repo_path1....samplesmodel_repository rm rf models vllm_baseline_output.pkl  mkdir p models cp r sample_model_repo_pathvllm_model modelsvllm_opt  function enable_health_check  local enable_vllm_health_check1 echo e parameters   modelsvllm_optconfig.pbtxt echo e  key ENABLE_VLLM_HEALTH_CHECK  modelsvllm_optconfig.pbtxt echo e  value  string_value enable_vllm_health_check   modelsvllm_optconfig.pbtxt echo e   modelsvllm_optconfig.pbtxt  VLLM_INSTALL_PATHusrlocallibpython3.12distpackagesvllm function mock_vllm_async_llm_engine   backup original file mv VLLM_INSTALL_PATHenginemultiprocessingclient.py VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup cp VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py  overwrite the original check_health method echo e   VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  async def check_healthself, check_count0  VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  check_count0  1  VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  if check_count0  1  VLLM_INSTALL_PATHenginemultiprocessingclient.py echo e  raise RuntimeErrorSimulated vLLM check_health failure  VLLM_INSTALL_PATHenginemultiprocessingclient.py  function unmock_vllm_async_llm_engine   restore from backup rm f VLLM_INSTALL_PATHenginemultiprocessingclient.py mv VLLM_INSTALL_PATHenginemultiprocessingclient.py.backup VLLM_INSTALL_PATHenginemultiprocessingclient.py  function test_check_health  local test_name1 local unit_test_name2 SERVER_LOGtest_name.server.log SERVER_ARGSmodelrepositorymodels modelcontrolmodeexplicit loadmodel run_server if  SERVER_PID  0  then echo e nn Failed to start SERVERn cat SERVER_LOG exit 1 fi set e python3 m pytest junitxmltest_name.report.xml s v check_health_test.pyTestCheckHealthunit_test_name  test_name.log if  ? ne 0  then echo e nn test_name FAILED.

 ---

 n RET1 fi set e kill SERVER_PID wait SERVER_PID   Test health check unspecified setup_model_repository test_check_health health_check_unspecified test_vllm_is_healthy  Test health check disabled setup_model_repository enable_health_check false test_check_health health_check_disabled test_vllm_is_healthy  Test health check enabled setup_model_repository enable_health_check true test_check_health health_check_enabled test_vllm_is_healthy  Mock check_health from vLLM mock_vllm_async_llm_engine  Test health check unspecified with mocked vLLM check_health failure setup_model_repository test_check_health health_check_unspecified_mocked_failure test_vllm_is_healthy  Test health check disabled with mocked vLLM check_health failure setup_model_repository enable_health_check false test_check_health health_check_disabled_mocked_failure test_vllm_is_healthy  Test health check enabled with mocked vLLM check_health failure setup_model_repository enable_health_check true test_check_health health_check_enabled_mocked_failure test_vllm_not_healthy  Unmock check_health unmock_vllm_async_llm_engine if  RET eq 0  then echo e nn Test Passedn else echo e nn Test FAILEDn fi exit RET  File ciL0_multi_gpu_vllmmulti_loradownload.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. from huggingface_hub import snapshot_download if __name__  __main__  download lora weight alpaca snapshot_download repo_idswathijnGemmaDoll2bdollyLORATune, local_dir.weightslorasGemmaDoll, max_workers8,   download lora weight GemmaSheep snapshot_download repo_ideduardoalvarezGemmaSheep2BLORATUNED, local_dir.weightslorasGemmaSheep, max_workers8,   download backbone weight googlegemma2b snapshot_download repo_idunslothgemma2b, local_dir.weightsbackbonegemma2b, max_workers8,   File ciL0_multi_gpu_vllmmulti_loramulti_lora_test.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import os import sys import unittest from functools import partial from typing import List import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import AsyncTestResultCollector, UserData, callback, create_vllm_request PROMPTS  Instruct What do you think of Computer Science?nOutput SAMPLING_PARAMETERS  temperature 0, top_p 1 server_enable_lora  True class VLLMTritonLoraTestAsyncTestResultCollector def setUpself self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 self.vllm_model_name  vllm_llama_multi_lora def _test_vllm_model self, prompts Liststr, sampling_parameters, lora_name Liststr, server_enable_loraTrue, streamFalse, exclude_input_in_outputNone, expected_outputNone,  assert lenprompts  len lora_name , The number of prompts and lora names should be the same user_data  UserData number_of_vllm_reqs  lenprompts self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs lora  lora_namei if lora_name else None sam_para_copy  sampling_parameters.copy if lora is not None sam_para_copylora_name  lora request_data  create_vllm_request promptsi, i, stream, sam_para_copy, self.vllm_model_name, exclude_input_in_outputexclude_input_in_output,  self.triton_client.async_stream_infer model_nameself.vllm_model_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  for i in rangenumber_of_vllm_reqs result  user_data._completed_requests.get if typeresult is InferenceServerException printresult.message if server_enable_lora self.assertEqual strresult.message, fLoRA lora_namei is not supported, we currently support doll, sheep, InferenceServerException,  else self.assertEqual strresult.message, LoRA feature is not enabled., InferenceServerException,  self.triton_client.stop_stream return output  result.as_numpytext_output self.assertIsNotNoneoutput, text_output should not be None if expected_output is not None self.assertEqual output, expected_outputi, Actual and expected outputs do not match.n  Expected  n Actual.format output, expected_outputi ,  self.triton_client.stop_stream def test_multi_lora_requestsself sampling_parameters  temperature 0, top_p 1  make two requests separately to avoid the different arrival of response answers prompt_1  Instruct What do you think of Computer Science?nOutput lora_1  doll expected_output   b I think it is a very interesting subject.nnInstruct What do you  self._test_vllm_model prompt_1, sampling_parameters, lora_namelora_1, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output,  prompt_2  Instruct Tell me more about soccernOutput lora_2  sheep expected_output   b I love soccer.

 ---

 I play soccer every day.nInstruct Tell me  self._test_vllm_model prompt_2, sampling_parameters, lora_namelora_2, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputexpected_output,  def test_none_exist_loraself prompts   Instruct What is the capital city of France?nOutput,  loras  bactrian sampling_parameters  temperature 0, top_p 1 self._test_vllm_model prompts, sampling_parameters, lora_nameloras, server_enable_loraserver_enable_lora, streamFalse, exclude_input_in_outputTrue, expected_outputNone,  this request will lead to lora not supported error, so there is no expected output  def tearDownself self.triton_client.close if __name__  __main__ server_enable_lora  os.environ.getSERVER_ENABLE_LORA, false.lower  true unittest.main  File ciL0_multi_gpu_vllmmulti_loratest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR logverbose1 SERVER_LOG.multi_lora_server.log CLIENT_LOG.multi_lora_client.log TEST_RESULT_FILEtest_results.txt CLIENT_PY.multi_lora_test.py DOWNLOAD_PY.download.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS2  first we download weights pip install U huggingface_hub rm rf weights  mkdir p weightslorasGemmaDoll  mkdir p weightslorasGemmaSheep mkdir p weightsbackbonegemma2b python3 DOWNLOAD_PY v  CLIENT_LOG 21 rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsvllm_llama_multi_lora export SERVER_ENABLE_LORAtrue  Check boolean flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json multi_lora_jsoncat EOF  doll .weightslorasGemmaDoll, sheep .weightslorasGemmaSheep  EOF  echo multi_lora_json  modelsvllm_llama_multi_lora1multi_lora.json RET0  If it is the first time launching triton server with gemma2b and multilora feature,  it may take more than 1 minutes. Please wait.

 ---

 SERVER_TIMEOUT60000 run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  Check string flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.7, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 32, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  disable lora export SERVER_ENABLE_LORAfalse  check bool flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID  disable lora export SERVER_ENABLE_LORAfalse  check string flag value for enable_lora model_jsoncat EOF  model.weightsbackbonegemma2b, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora false, lora_extra_vocab_size 256, distributed_executor_backendray  EOF  echo model_json  modelsvllm_llama_multi_lora1model.json run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi set e python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e kill SERVER_PID wait SERVER_PID rm rf models rm rf weights if  RET eq 1  then cat CLIENT_LOG cat SERVER_LOG echo e nn Multi LoRA test FAILED. n else echo e nn Multi LoRA test PASSED. n fi collect_artifacts_from_subdir exit RET  File ciL0_multi_gpu_vllmvllm_backendtest.sh  !binbash  Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 source ....commonutil.sh TRITON_DIRTRITON_DIRopttritonserver SERVERTRITON_DIRbintritonserver BACKEND_DIRTRITON_DIRbackends SERVER_ARGSmodelrepositorypwdmodels backenddirectoryBACKEND_DIR modelcontrolmodeexplicit logverbose1 TEST_RESULT_FILEtest_results.txt CLIENT_PY.vllm_multi_gpu_test.py SAMPLE_MODELS_REPO......samplesmodel_repository EXPECTED_NUM_TESTS1  Helpers function validate_file_contains  local KEY1 local FILE2 if  z KEY    z FILE  then echo Error KEY and FILE must be provided. return 1 fi if  ! f FILE  then echo Error File FILE does not exist.

 ---

 return 1 fi countgrep o w KEY FILE  wc l if  count ne 1  then echo Error KEY KEY found count times in FILE. Expected exactly once. return 1 fi  function run_multi_gpu_test  export KIND1 export TENSOR_PARALLELISM2 export INSTANCE_COUNT3 export DISTRIBUTED_EXECUTOR_BACKEND4  Setup a clean model repository export TEST_MODELvllm_opt_KIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNT local TEST_MODEL_TRITON_CONFIGmodelsTEST_MODELconfig.pbtxt local TEST_MODEL_VLLM_CONFIGmodelsTEST_MODEL1model.json rm rf models  mkdir p models cp r SAMPLE_MODELS_REPOvllm_model modelsTEST_MODEL sed i sKIND_MODELKIND TEST_MODEL_TRITON_CONFIG sed i 3s tensor_parallel_size TENSOR_PARALLELISM,n TEST_MODEL_VLLM_CONFIG if  TENSOR_PARALLELISM ne 1  then jq arg backend DISTRIBUTED_EXECUTOR_BACKEND .

 ---

 distributed_executor_backendbackend TEST_MODEL_VLLM_CONFIG  temp.json mv temp.json TEST_MODEL_VLLM_CONFIG fi  Assert the correct kind is set in case the template config changes in the future validate_file_contains KIND TEST_MODEL_TRITON_CONFIG  Start server echo Running multiGPU test with kindKIND, tpTENSOR_PARALLELISM, instance_countINSTANCE_COUNT SERVER_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTserver.log run_server if  SERVER_PID  0  then cat SERVER_LOG echo e nn Failed to start SERVERn exit 1 fi  Run unit tests set e CLIENT_LOG.vllm_multi_gpu_testKIND_tpTENSOR_PARALLELISM_countINSTANCE_COUNTclient.log python3 CLIENT_PY v  CLIENT_LOG 21 if  ? ne 0  then cat CLIENT_LOG echo e nn Running CLIENT_PY FAILED. n RET1 else check_test_results TEST_RESULT_FILE EXPECTED_NUM_TESTS if  ?

 ---

 ne 0  then cat CLIENT_LOG echo e nn Test Result Verification FAILED.n RET1 fi fi set e  Cleanup kill SERVER_PID wait SERVER_PID   Test rm f .log RET0  Test the various cases of kind, tensor parallelism, and instance count  for different ways to run multiGPU models with vLLM on Triton KINDSKIND_MODEL KIND_GPU TPS1 2 INSTANCE_COUNTS1 2 DISTRIBUTED_EXECUTOR_BACKENDray for kind in KINDS do for tp in TPS do for count in INSTANCE_COUNTS do run_multi_gpu_test kind tp count DISTRIBUTED_EXECUTOR_BACKEND done done done  Results if  RET eq 1  then echo e nn Multi GPU Utilization test FAILED. n else echo e nn Multi GPU Utilization test PASSED. n fi exit RET  File ciL0_multi_gpu_vllmvllm_backendvllm_multi_gpu_test.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 import os import sys import unittest from functools import partial import pynvml import tritonclient.grpc as grpcclient from tritonclient.utils import  sys.path.append....common from test_util import TestResultCollector, UserData, callback, create_vllm_request class VLLMMultiGPUTestTestResultCollector def setUpself pynvml.nvmlInit self.triton_client  grpcclient.InferenceServerClienturllocalhost8001 def get_gpu_memory_utilizationself, gpu_id handle  pynvml.nvmlDeviceGetHandleByIndexgpu_id info  pynvml.nvmlDeviceGetMemoryInfohandle return info.used def get_available_gpu_idsself device_count  pynvml.nvmlDeviceGetCount available_gpus   for gpu_id in rangedevice_count handle  pynvml.nvmlDeviceGetHandleByIndexgpu_id if handle available_gpus.appendgpu_id return available_gpus def _test_vllm_multi_gpu_utilizationself, model_name str  Test that loading a given vLLM model will increase GPU utilization across multiple GPUs, and run a sanity check inference to confirm that the loaded multigpumultiinstance model is working as expected. gpu_ids  self.get_available_gpu_ids self.assertGreaterEquallengpu_ids, 2, Error Detected single GPU printnn Before Loading vLLM Model  mem_util_before_loading_model   for gpu_id in gpu_ids memory_utilization  self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes mem_util_before_loading_modelgpu_id  memory_utilization self.triton_client.load_modelmodel_name self._test_vllm_modelmodel_name print After Loading vLLM Model  vllm_model_used_gpus  0 for gpu_id in gpu_ids memory_utilization  self.get_gpu_memory_utilizationgpu_id printfGPU gpu_id Memory Utilization memory_utilization bytes if memory_utilization  mem_util_before_loading_modelgpu_id vllm_model_used_gpus  1 self.assertGreaterEqualvllm_model_used_gpus, 2 def _test_vllm_modelself, model_name str, send_parameters_as_tensor bool  True user_data  UserData stream  False prompts   The most dangerous animal is, The capital of France is, The future of AI is,  number_of_vllm_reqs  lenprompts sampling_parameters  temperature 0, top_p 1 self.triton_client.start_streamcallbackpartialcallback, user_data for i in rangenumber_of_vllm_reqs request_data  create_vllm_request promptsi, i, stream, sampling_parameters, model_name, send_parameters_as_tensor,  self.triton_client.async_stream_infer model_namemodel_name, request_idrequest_datarequest_id, inputsrequest_datainputs, outputsrequest_dataoutputs, parameterssampling_parameters,  for i in rangenumber_of_vllm_reqs result  user_data._completed_requests.get self.assertIsNottyperesult, InferenceServerException output  result.as_numpytext_output self.assertIsNotNoneoutput self.triton_client.stop_stream def test_multi_gpu_modelself  Tests that a multiGPU vLLM model loads successfully on multiple GPUs and can handle a few sanity check inference requests. MultiGPU models are currently defined here as either  a single model instance with tensor parallelism  1  multiple model instances each with tensor parallelism  1 FIXME This test currently skips over a few combinations that may be enhanced in the future, such as  tensor parallel models with multiple model instances  KIND_MODEL models with multiple model instances  model  os.environ.getTEST_MODEL kind  os.environ.getKIND tp  os.environ.getTENSOR_PARALLELISM instance_count  os.environ.getINSTANCE_COUNT for env_var in model, kind, tp, instance_count self.assertIsNotNoneenv_var printfTest Matrix model, kind, tp, instance_count  Only support tensor parallelism or multiple instances for now, but not both.

 ---

 Support for multiinstance tensor parallel models may require more  special handling in the backend to better handle device assignment. NOTE This eliminates the 111 and 224 test cases. if inttp  intinstance_count !

 ---

 2 msg  TENSOR_PARALLELISM and INSTANCE_COUNT must have a product of 2 for this 2GPU test printSkipping Test, msg self.skipTestmsg  Loading a KIND_GPU model with Tensor Parallelism  1 should fail and  recommend using KIND_MODEL instead for multigpu model instances. if kind  KIND_GPU and inttp  1 with self.assertRaisesRegex InferenceServerException, please specify KIND_MODEL  self._test_vllm_multi_gpu_utilizationmodel return  Loading a KIND_MODEL model with multiple instances can cause  oversubscription to specific GPUs and cause a CUDA OOM if the  gpu_memory_utilization settings are high without further handling  of device assignment in the backend. if kind  KIND_MODEL and intinstance_count  1 msg  Testing multiple model instances of KIND_MODEL is not implemented at this time printSkipping Test, msg self.skipTestmsg self._test_vllm_multi_gpu_utilizationmodel def tearDownself pynvml.nvmlShutdown self.triton_client.close if __name__  __main__ unittest.main  File ciL0_multi_gpu_vllmtest.sh  !binbash  Copyright 2024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 RET0 SUBTESTSvllm_backend multi_lora python3 m pip install tritonclientgrpc for TEST in SUBTESTS do cd TEST  bash ex test.sh  cd .. if  ? ne 0  then echo Subtest TEST FAILED RET1 fi done if  RET eq 0  then echo e nn vLLM MultiGPU Tests Passedn else echo e nn vLLM MultiGPU Tests FAILEDn fi exit RET  File docsadditional_outputs.md  !

 ---

 Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Additional Outputs from vLLM The vLLM backend supports sending additional outputs from vLLM on top of the usual text_output when requested. All additional outputs are disabled by default and they need to be enabled on a perrequest basis.

 ---

 If enabled, the corresponding output tensor will be set for all responses from the request. Supported Additional Outputs  Finish Reason The reason why the sequence is finished. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL26 for more details.

 ---

 To enable, set return_finish_reason input tensor to True. The reason will be sent as a string on the finish_reason output tensor. Cumulative Log Probabilities The cumulative log probability of the generated output text.

 ---

 See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL22 for more details. To enable, set return_cumulative_logprob input tensor to True. The floating point value will be sent on the cumulative_logprob output tensor.

 ---

 Log Probabilities The log probabilities of the top probability tokens at each position of the logprobshttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsampling_params.pyL146L152 are requested. Only the log probabilities of the new tokens generated since the last response are returned on each new response. See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL24L25 for more details on the log probabilities.

 ---

 To enable, set return_logprobs input tensor to True. The log probabilities will be sent on the logprobs output tensor as a serialized JSON string. Number of Input Tokens The number of token IDs of the prompt.

 ---

 See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL79L81 for more details. To enable, set return_num_input_tokens input tensor to True. The unsigned integer value will be sent on the num_input_tokens output tensor.

 ---

 Number of Output Tokens The number of token IDs of the generated output text sent on this response. It is the difference in length of the token IDs generated from the last response to this response. If this is the first response, the last response length is presumed to be zero.

 ---

 See herehttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmoutputs.pyL21 for more details on the token IDs of the generated output text. To enable, set return_num_output_tokens input tensor to True. The unsigned integer value will be sent on the num_output_tokens output tensor.

 ---

 Examples  Add Finish Reason to Outputs python import numpy as np import tritonclient.grpc as grpcclient inputs   inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpy np.arrayexample prompt.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputreturn_finish_reason, 1, BOOL inputs1.set_data_from_numpynp.arrayTrue, dtypebool def callbackresult, error ... printresult.as_numpynamefinish_reason with grpcclient.InferenceServerClientlocalhost8001 as client client.start_streamcallback client.async_stream_infervLLM_model_name, inputsinputs, ... client.stop_stream   Notes  Enabling additional outputs may impact performance, only add additional outputs when necessary.

 ---

 File docshealth_check.md  ! Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. vLLM Health Check BETA  !NOTE  The vLLM Health Check support is currently in BETA.

 ---

 Its features and  functionality are subject to change as we collect feedback. We are excited to  hear any thoughts you have! The vLLM backend supports checking for vLLM Engine Healthhttpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmengineasync_llm_engine.pyL1177L1185 upon receiving each inference request.

 ---

 If the health check fails, the model state will becomes NOT Ready at the server, which can be queried by the Repository Indexhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_model_repository.mdindex or Model Readyhttpsgithub.comtritoninferenceserverclientblobmainsrcc2B2Blibraryhttp_client.hL178L192 APIs. The Health Check is disabled by default. To enable it, set the following parameter on the model config to true  parameters  key ENABLE_VLLM_HEALTH_CHECK value  string_value true    and select Model Control Mode EXPLICIThttpsgithub.comtritoninferenceserverserverblobmaindocsuser_guidemodel_management.mdmodelcontrolmodeexplicit when the server is started.

 ---

 File docsllama_multi_lora_tutorial.md  ! Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Tutorial on depolying multilora vLLM backend in Triton The idea of multilora was proposed recently, for more please refer to  SLoRA Serving Thousands of Concurrent LoRA Adaptershttpsarxiv.orgabs2311.03285  Punica MultiTenant LoRA Servinghttpsarxiv.orgabs2310.18547 Now the vLLM has supported multilora, which integrated the Punica feature and related cuda kernels.

 ---

 See this PRhttpsgithub.comvllmprojectvllmpull1804 for more. 20240124 this PR has been merged into the main branch of vLLM The following tutorial demonstrates how to deploy a LLaMa model with multiple loras on Triton Inference Server using the Tritons Pythonbasedhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends vLLMhttpsgithub.comtritoninferenceservervllm_backendtreemain backend. Before you continue reading, its important to note that all commandline instructions containing xx.yy in the document cannot be used directly by copying and pasting.

 ---

 xx.yy represents the Triton version, and you must specify the Triton version you want to use for the bash command to work. Step 1 Start a docker container for tritonvllm serving A docker container is strongly recommended for serving, and this tutorial will only demonstrate how to launch triton in the docker environment. First, start a docker container using the tritonserver image with vLLM backend from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags bash  NOTICE you must first cd to your vllm_workspace path outside the container.

 ---

 mkdir vllm_workspace  cd vllm_workspace sudo docker run gpus all it nethost p 80018001 shmsize12G  ulimit memlock1 ulimit stack67108864 v PWDvllm_workspace  w vllm_workspace nvcr.ionvidiatritonserverxx.yyvllmpythonpy3  binbash  NOTICE the version of triton docker image should be configurated, here and through this tutorial we use xx.yy to symbolize the version. Tritons vLLM container has been introduced starting from 23.10 release, and multilora experimental support was added in vLLM v0.3.0 release. Docker image version nvcr.ionvidiatritonserver24.05vllmpythonpy3 or higher version is strongly recommended.

 ---

 For pre24.05 containers, the docker images didnt support multilora feature, so you need to replace that provided in the container opttritonserverbackendsvllmmodel.py with the most up to date version. Just follow this command Download the model.py script from github bash wget P opttritonserverbackendsvllm httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysrcmodel.py  This command will download the model.py script to the Triton vllm backend directory which will enable multilora feature. Step 2 Prepare your weights To support multilora on Triton, you need to manage your file path for model backbone and lora weights separately.

 ---

 A typical weights repository can be as follows  weights  backbone   llama7bhf  loras  alpacalora7b  wizardLMlora7b   A workspace for vllm, and model backbone weights, LoRA adapter weights is strongly recommended. You should expand the storage of these weight files to ensure they are logically organized in the workspace. Step 3 Prepare model repository for Triton Server __3.1 Download the model repository files__ To use Triton, a model repository is needed, for model path , backend configuration and other information.

 ---

 The vllm backend is implemented based on python backend, and sampling_params of vllm are sampled from model.json. To create a triton model repository, you may download the files through these commands bash  NOTICE you must first cd to your vllm_workspace path. cd vllm_workspace mkdir p model_repositoryvllm_model1 wget P model_repositoryvllm_model1 httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_model1model.json wget P model_repositoryvllm_model httpsraw.githubusercontent.comtritoninferenceservervllm_backendrxx.yysamplesmodel_repositoryvllm_modelconfig.pbtxt  The model repository should look like this  model_repository  vllm_model  1   model.json  config.pbtxt   Now, you have finished the basic deployment, and the file structure should look like this  vllm_workspace  weights   backbone    llama7bhf   loras   alpacalora7b   bactrianxllamalora7b   model_repository  vllm_model  1   model.json  config.pbtxt  __3.2 Populate model.json__ For this tutorial we will use the following set of parameters, specified in the model.json.

 ---

 json  modelvllm_workspaceweightsbackbonellama7bhf, disable_log_requests true, gpu_memory_utilization 0.8, tensor_parallel_size 2, block_size 16, enforce_eager true, enable_lora true, max_lora_rank 16    model The path to your model repository  disable_log_requests To show logs when launch vllm or not. gpu_memory_utilization The gpu memory allocated for the model weights and vllm PagedAttention kv cache manager. tensor_parallel_size The vllm now support the tensor paralism, so you can decide how many gpus you want to use for serving.

 ---

 block_size vLLM kv cache block size. enable_lora If you want to support vllm multilora, this should be configured and set true. max_lora_rank The maximum of LoRA rank of your lora adapter.

 ---

 The full set of parameters can be found herehttpsgithub.comYard1vllmblobmulti_loravllmenginearg_utils.pyL11. __3.3 Specify local lora path__ vLLM v0.4.0.post1 supported the inference of local lora weights applying, which means that the vllm cannot pull any lora adapter from huggingface. So triton should know where the local lora weights are.

 ---

 Create a multi_lora.json file under model_repositoryvllm_model1 path bash cd model_repositoryvllm_model1 touch multi_lora.json  The content of multi_lora.json should look like this json  alpaca vllm_workspaceweightslorasalpacalora7b, bactrian vllm_workspaceweightslorasbactrianxllama7blora   The key should be the supported lora name, and the value should be the specific path in your machine. Warning if you set enable_lora to true in model.json without creating a multi_lora.json file, the server will throw FileNotFoundError when initializing. Step 4 Launch Triton bash  NOTICE you must first cd to your vllm_workspace path.

 ---

 cd vllm_workspace tritonserver modelstore .model_repository  After you start Triton you will see output on the console showing the server starting up and loading the model. When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002   Step 5 Send a request A client request script for multilora was prepared, downloading the client script from source bash wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesclient.py wget httpsraw.githubusercontent.comtritoninferenceservervllm_backendmainsamplesprompts.txt  Try running this script by the following command bash python3 client.py l yourpreparedloraname  Here we assume you have prepared alpaca lora weight, thus we use bash python3 client.py l alpaca   File samplesmodel_repositoryvllm_model1model.json   modelfacebookopt125m, disable_log_requests true, gpu_memory_utilization 0.5, enforce_eager true   File samplesmodel_repositoryvllm_modelconfig.pbtxt   Copyright 2023, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 Note You do not need to change any fields in this configuration. backend vllm  The usage of device is deferred to the vLLM engine instance_group   count 1 kind KIND_MODEL    File samplesclient.py  !usrbinenv python3  Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse import asyncio import json import sys import numpy as np import tritonclient.grpc.aio as grpcclient from tritonclient.utils import  class LLMClient def __init__self, flags argparse.Namespace self._flags  flags self._results_dict   def get_triton_clientself try triton_client  grpcclient.InferenceServerClient urlself._flags.url, verboseself._flags.verbose,  except Exception as e printchannel creation failed   stre sys.exit return triton_client async def async_request_iterator self, prompts, sampling_parameters, exclude_input_in_output  try for iter in rangeself._flags.iterations for i, prompt in enumerateprompts prompt_id  self._flags.offset  lenprompts  iter  i self._results_dictstrprompt_id   yield self.create_request prompt, self._flags.streaming_mode, prompt_id, sampling_parameters, exclude_input_in_output,  except Exception as error printfCaught an error in the request iterator error async def stream_inferself, prompts, sampling_parameters, exclude_input_in_output try triton_client  self.get_triton_client  Start streaming response_iterator  triton_client.stream_infer inputs_iteratorself.async_request_iterator prompts, sampling_parameters, exclude_input_in_output , stream_timeoutself._flags.stream_timeout,  async for response in response_iterator yield response except InferenceServerException as error printerror sys.exit1 async def process_stream self, prompts, sampling_parameters, exclude_input_in_output   Clear results in between process_stream calls self.results_dict   success  True  Read response from the stream async for response in self.stream_infer prompts, sampling_parameters, exclude_input_in_output  result, error  response if error printfEncountered error while processing error success  False else output  result.as_numpytext_output for i in output self._results_dictresult.get_response.id.appendi return success async def runself  Sampling parameters for text generation  including temperature, top_p, top_k, max_tokens, early_stopping.

 ---

 Full list available at  httpsgithub.comvllmprojectvllmblob5255d99dc595f9ae7647842242d6542aa4145a4fvllmsampling_params.pyL23 sampling_parameters   temperature 0.1, top_p 0.95, max_tokens 100,  exclude_input_in_output  self._flags.exclude_inputs_in_outputs if self._flags.lora_name is not None sampling_parameterslora_name  self._flags.lora_name with openself._flags.input_prompts, r as file printfLoading inputs from self._flags.input_prompts... prompts  file.readlines success  await self.process_stream prompts, sampling_parameters, exclude_input_in_output  with openself._flags.results_file, w as file for id in self._results_dict.keys for result in self._results_dictid file.writeresult.decodeutf8 file.writen file.writennn printfStoring results into self._flags.results_file... if self._flags.verbose with openself._flags.results_file, r as file printfnContents of self._flags.results_file  printfile.read if success printPASS vLLM example else printFAIL vLLM example def run_asyncself asyncio.runself.run def create_request self, prompt, stream, request_id, sampling_parameters, exclude_input_in_output, send_parameters_as_tensorTrue,  inputs   prompt_data  np.arrayprompt.encodeutf8, dtypenp.object_ try inputs.appendgrpcclient.InferInputtext_input, 1, BYTES inputs1.set_data_from_numpyprompt_data except Exception as error printfEncountered an error during request creation error stream_data  np.arraystream, dtypebool inputs.appendgrpcclient.InferInputstream, 1, BOOL inputs1.set_data_from_numpystream_data  Request parameters are not yet supported via BLS.

 ---

 Provide an  optional mechanism to send serialized parameters as an input  tensor until support is added if send_parameters_as_tensor sampling_parameters_data  np.array json.dumpssampling_parameters.encodeutf8, dtypenp.object_  inputs.appendgrpcclient.InferInputsampling_parameters, 1, BYTES inputs1.set_data_from_numpysampling_parameters_data inputs.appendgrpcclient.InferInputexclude_input_in_output, 1, BOOL inputs1.set_data_from_numpynp.arrayexclude_input_in_output, dtypebool  Add requested outputs outputs   outputs.appendgrpcclient.InferRequestedOutputtext_output  Issue the asynchronous sequence inference. return  model_name self._flags.model, inputs inputs, outputs outputs, request_id strrequest_id, parameters sampling_parameters,  if __name__  __main__ parser  argparse.ArgumentParser parser.add_argument m, model, typestr, requiredFalse, defaultvllm_model, helpModel name,  parser.add_argument v, verbose, actionstore_true, requiredFalse, defaultFalse, helpEnable verbose output,  parser.add_argument u, url, typestr, requiredFalse, defaultlocalhost8001, helpInference server URL and its gRPC port. Default is localhost8001.,  parser.add_argument t, streamtimeout, typefloat, requiredFalse, defaultNone, helpStream timeout in seconds.

 ---

 Default is None.,  parser.add_argument offset, typeint, requiredFalse, default0, helpAdd offset to request IDs used,  parser.add_argument inputprompts, typestr, requiredFalse, defaultprompts.txt, helpText file with input prompts,  parser.add_argument resultsfile, typestr, requiredFalse, defaultresults.txt, helpThe file with output results,  parser.add_argument iterations, typeint, requiredFalse, default1, helpNumber of iterations through the prompts file,  parser.add_argument s, streamingmode, actionstore_true, requiredFalse, defaultFalse, helpEnable streaming mode,  parser.add_argument excludeinputsinoutputs, actionstore_true, requiredFalse, defaultFalse, helpExclude prompt from outputs,  parser.add_argument l, loraname, typestr, requiredFalse, defaultNone, helpThe querying LoRA name,  FLAGS  parser.parse_args client  LLMClientFLAGS client.run_async  File samplesprompts.txt  Hello, my name is The most dangerous animal is The capital of France is The future of AI is  File srcutilsmetrics.py   Copyright 2024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import queue import threading from typing import Dict, List, Union import triton_python_backend_utils as pb_utils from vllm.engine.metrics import StatLoggerBase as VllmStatLoggerBase from vllm.engine.metrics import Stats as VllmStats from vllm.engine.metrics import SupportsMetricsInfo, build_1_2_5_buckets class TritonMetrics def __init__self, labels Liststr, max_model_len int  Initialize metric families  Iteration stats self.counter_prompt_tokens_family  pb_utils.MetricFamily namevllmprompt_tokens_total, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.COUNTER,  self.counter_generation_tokens_family  pb_utils.MetricFamily namevllmgeneration_tokens_total, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.COUNTER,  self.histogram_time_to_first_token_family  pb_utils.MetricFamily namevllmtime_to_first_token_seconds, descriptionHistogram of time to first token in seconds., kindpb_utils.MetricFamily.HISTOGRAM,  self.histogram_time_per_output_token_family  pb_utils.MetricFamily namevllmtime_per_output_token_seconds, descriptionHistogram of time per output token in seconds., kindpb_utils.MetricFamily.HISTOGRAM,   Request stats  Latency self.histogram_e2e_time_request_family  pb_utils.MetricFamily namevllme2e_request_latency_seconds, descriptionHistogram of end to end request latency in seconds., kindpb_utils.MetricFamily.HISTOGRAM,   Metadata self.histogram_num_prompt_tokens_request_family  pb_utils.MetricFamily namevllmrequest_prompt_tokens, descriptionNumber of prefill tokens processed., kindpb_utils.MetricFamily.HISTOGRAM,  self.histogram_num_generation_tokens_request_family  pb_utils.MetricFamily namevllmrequest_generation_tokens, descriptionNumber of generation tokens processed., kindpb_utils.MetricFamily.HISTOGRAM,  self.histogram_n_request_family  pb_utils.MetricFamily namevllmrequest_params_n, descriptionHistogram of the n request parameter., kindpb_utils.MetricFamily.HISTOGRAM,   Initialize metrics  Iteration stats self.counter_prompt_tokens  self.counter_prompt_tokens_family.Metric labelslabels  self.counter_generation_tokens  self.counter_generation_tokens_family.Metric labelslabels   Use the same bucket boundaries from vLLM sample metrics as an example. httpsgithub.comvllmprojectvllmblob21313e09e3f9448817016290da20d0db1adf3664vllmenginemetrics.pyL81L96 self.histogram_time_to_first_token   self.histogram_time_to_first_token_family.Metric labelslabels, buckets 0.001, 0.005, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, ,   self.histogram_time_per_output_token   self.histogram_time_per_output_token_family.Metric labelslabels, buckets 0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.75, 1.0, 2.5, ,    Request stats  Latency self.histogram_e2e_time_request  self.histogram_e2e_time_request_family.Metric labelslabels, buckets1.0, 2.5, 5.0, 10.0, 15.0, 20.0, 30.0, 40.0, 50.0, 60.0,   Metadata self.histogram_num_prompt_tokens_request   self.histogram_num_prompt_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len,   self.histogram_num_generation_tokens_request   self.histogram_num_generation_tokens_request_family.Metric labelslabels, bucketsbuild_1_2_5_bucketsmax_model_len,   self.histogram_n_request  self.histogram_n_request_family.Metric labelslabels, buckets1, 2, 5, 10, 20,  class VllmStatLoggerVllmStatLoggerBase StatLogger is used as an adapter between vLLM stats collector and Triton metrics provider.

 ---

 def __init__self, labels Dict, max_model_len int, log_logger  None  Tracked stats over current local logging interval. local_interval not used here. Its for vLLM logs to stdout.

 ---

 super.__init__local_interval0 self.metrics  TritonMetricslabels, max_model_len self.log_logger  log_logger  Starting the metrics thread. It allows vLLM to keep making progress  while reporting metrics to triton metrics service. self._logger_queue  queue.Queue self._logger_thread  threading.Threadtargetself.logger_loop self._logger_thread.start def infoself, type str, obj SupportsMetricsInfo  None pass def _log_counterself, counter, data Unionint, float  None Convenience function for logging to counter.

 ---

 Args counter A counter metric instance. data An int or float to increment the count metric. Returns None  if data !

 ---

 0 self._logger_queue.put_nowaitcounter, increment, data def _log_histogramself, histogram, data UnionListint, Listfloat  None Convenience function for logging list to histogram. Args histogram A histogram metric instance. data A list of int or float data to observe into the histogram metric.

 ---

 Returns None  for datum in data self._logger_queue.put_nowaithistogram, observe, datum def logself, stats VllmStats  None Report stats to Triton metrics server. Args stats Created by LLMEngine for use by VllmStatLogger. Returns None   The list of vLLM metrics reporting to Triton is also documented here.

 ---

 httpsgithub.comtritoninferenceservervllm_backendblobmainREADME.mdtritonmetrics counter_metrics   self.metrics.counter_prompt_tokens, stats.num_prompt_tokens_iter, self.metrics.counter_generation_tokens, stats.num_generation_tokens_iter,  histogram_metrics    self.metrics.histogram_time_to_first_token, stats.time_to_first_tokens_iter, ,  self.metrics.histogram_time_per_output_token, stats.time_per_output_tokens_iter, , self.metrics.histogram_e2e_time_request, stats.time_e2e_requests,  self.metrics.histogram_num_prompt_tokens_request, stats.num_prompt_tokens_requests, ,  self.metrics.histogram_num_generation_tokens_request, stats.num_generation_tokens_requests, , self.metrics.histogram_n_request, stats.n_requests,  for metric, data in counter_metrics self._log_countermetric, data for metric, data in histogram_metrics self._log_histogrammetric, data def logger_loopself while True item  self._logger_queue.get  To signal shutdown a None item will be added to the queue. if item is None break metric, command, data  item if command  increment metric.incrementdata elif command  observe metric.observedata else self.log_logger.log_errorfUndefined command name command def finalizeself  Shutdown the logger thread. self._logger_queue.putNone if self._logger_thread is not None self._logger_thread.join self._logger_thread  None  File srcmodel.py   Copyright 20232024, NVIDIA CORPORATION  AFFILIATES.

 ---

 All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution.

 ---

 Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

 ---

 import asyncio import base64 import gc import json import os import queue import threading from io import BytesIO from typing import Dict, List import numpy as np import torch import triton_python_backend_utils as pb_utils from PIL import Image from vllm.engine.arg_utils import AsyncEngineArgs from vllm.entrypoints.openai.api_server import  build_async_engine_client_from_engine_args,  from vllm.lora.request import LoRARequest from vllm.sampling_params import SamplingParams from vllm.utils import random_uuid from utils.metrics import VllmStatLogger _VLLM_ENGINE_ARGS_FILENAME  model.json _MULTI_LORA_ARGS_FILENAME  multi_lora.json class TritonPythonModel classmethod def auto_complete_configcls, auto_complete_model_config  Add inputsoutputs to the model config. cls._auto_complete_inputs_and_outputsauto_complete_model_config  We need to use decoupled transaction policy for saturating  vLLM engine for max throughtput. TODO DLIS5233 Allow asynchronous execution to lift this  restriction for cases there is exactly a single response to  a single request.

 ---

 auto_complete_model_config.set_model_transaction_policydictdecoupledTrue  Disabling batching in Triton, let vLLM handle the batching on its own. auto_complete_model_config.set_max_batch_size0 return auto_complete_model_config staticmethod def _auto_complete_inputs_and_outputsauto_complete_model_config  Inputs expected by the backend. inputs   name text_input, data_type TYPE_STRING, dims 1,  name image, data_type TYPE_STRING, dims 1,  can be multiple images as separate elements optional True, ,  name stream, data_type TYPE_BOOL, dims 1, optional True, ,  name sampling_parameters, data_type TYPE_STRING, dims 1, optional True, ,  name exclude_input_in_output, data_type TYPE_BOOL, dims 1, optional True, ,  name return_finish_reason, data_type TYPE_BOOL, dims 1, optional True, ,  name return_cumulative_logprob, data_type TYPE_BOOL, dims 1, optional True, ,  name return_logprobs, data_type TYPE_BOOL, dims 1, optional True, ,  name return_num_input_tokens, data_type TYPE_BOOL, dims 1, optional True, ,  name return_num_output_tokens, data_type TYPE_BOOL, dims 1, optional True, ,   Outputs expected by the backend.

 ---

 outputs   name text_output, data_type TYPE_STRING, dims 1, name finish_reason, data_type TYPE_STRING, dims 1, name cumulative_logprob, data_type TYPE_FP32, dims 1, name logprobs, data_type TYPE_STRING, dims 1, name num_input_tokens, data_type TYPE_UINT32, dims 1, name num_output_tokens, data_type TYPE_UINT32, dims 1,   Collect input and output names from the provided model config. config  auto_complete_model_config.as_dict input_names   output_names   for input in configinput input_names.appendinputname for output in configoutput output_names.appendoutputname  Add missing inputs and outputs to the model config. for input in inputs if inputname not in input_names auto_complete_model_config.add_inputinput for output in outputs if outputname not in output_names auto_complete_model_config.add_outputoutput def initializeself, args self.args  args self.logger  pb_utils.Logger self.model_config  json.loadsargsmodel_config output_config  pb_utils.get_output_config_by_name self.model_config, text_output  self.output_dtype  pb_utils.triton_string_to_numpyoutput_configdata_type  Setup vLLM engine health check self._enable_health_check  self._get_bool_config_param ENABLE_VLLM_HEALTH_CHECK  self._is_healthy  True  Initialize engine arguments  TODO Move this into _init_engine, after moving check metrics enabled.

 ---

 self._init_engine_args  Check if metrics are enabled. The ZMQ process cannot be used when metrics are  enabled. TODO Move the check into _setup_metrics.

 ---

 self._enable_metrics   self._get_bool_config_paramREPORT_CUSTOM_METRICS and not self._aync_engine_args.disable_log_stats   Starting the vLLM engine and its event thread running the AsyncIO event loop. self._init_engine  Setup vLLM metrics self._setup_metrics  Starting the response thread. It allows vLLM to keep making progress while  response senders are sending responses to server frontend.

 ---

 self._response_queue  queue.Queue self._response_thread  threading.Threadtargetself._response_loop self._response_thread.start def _init_engine_argsself  Currently, Triton needs to use decoupled policy for asynchronously  forwarding requests to vLLM engine, so assert it. self.using_decoupled  pb_utils.using_decoupled_model_transaction_policy self.model_config  assert  self.using_decoupled , vLLM Triton backend must be configured to use decoupled model transaction policy engine_args_filepath  os.path.join pb_utils.get_model_dir, _VLLM_ENGINE_ARGS_FILENAME  assert os.path.isfile engine_args_filepath , f_VLLM_ENGINE_ARGS_FILENAME containing vllm engine args must be provided in pb_utils.get_model_dir with openengine_args_filepath as file self.vllm_engine_config  json.loadfile  Validate device and multiprocessing settings are currently set based on modelconfigs. self._validate_device_config  Check for LoRA config and set it up if enabled self._setup_lora  Create an AsyncEngineArgs from the config from JSON self._aync_engine_args  AsyncEngineArgsself.vllm_engine_config def _init_engineself  Run the engine in a separate thread running the AsyncIO event loop.

 ---

 self._llm_engine  None self._llm_engine_start_cv  threading.Condition self._llm_engine_shutdown_event  asyncio.Event self._event_thread  threading.Thread targetasyncio.run, argsself._run_llm_engine,  self._event_thread.start with self._llm_engine_start_cv while self._llm_engine is None self._llm_engine_start_cv.wait  The threading.Thread will not raise the exception here should the engine  failed to start, so the exception is passed back via the engine variable. if isinstanceself._llm_engine, Exception e  self._llm_engine self.logger.log_errorfvllm Failed to start engine e if self._event_thread is not None self._event_thread.join self._event_thread  None raise e async def _run_llm_engineself  Counter to keep track of ongoing request counts. self._ongoing_request_count  0 try  Start the vLLM engine.

 ---

 The engine lives for the scope of this with  statement. TODO Metrics should work with ZMQ enabled. async with build_async_engine_client_from_engine_args engine_argsself._aync_engine_args, disable_frontend_multiprocessingself._enable_metrics,  as engine  Capture the engine event loop and make it visible to other threads.

 ---

 self._event_loop  asyncio.get_running_loop  Signal the engine is started and make it visible to other threads. with self._llm_engine_start_cv self._llm_engine  engine self._llm_engine_start_cv.notify_all  Wait for the engine shutdown signal. await self._llm_engine_shutdown_event.wait  Wait for the ongoing requests to complete.

 ---

 while self._ongoing_request_count  0 self.logger.log_info vllm Awaiting remaining  requests.format self._ongoing_request_count   await asyncio.sleep1  Cancel all tasks in the event loop. for task in asyncio.all_tasksloopself._event_loop if task is not asyncio.current_task task.cancel except Exception as e  Signal and pass the exception back via the engine variable if the engine  failed to start. If the engine has started, reraise the exception.

 ---

 with self._llm_engine_start_cv if self._llm_engine is None self._llm_engine  e self._llm_engine_start_cv.notify_all return raise e self._llm_engine  None self.logger.log_infovllm Shutdown complete def _validate_device_configself triton_kind  self.argsmodel_instance_kind triton_device_id  intself.argsmodel_instance_device_id triton_instance  fself.argsmodel_name_triton_device_id  Tritons current definition of KIND_GPU makes assumptions that  models only use a single GPU. For multiGPU models, the recommendation  is to specify KIND_MODEL to acknowledge that the model will take control  of the devices made available to it. NOTE Consider other parameters that would indicate multiGPU in the future.

 ---

 tp_size  intself.vllm_engine_config.gettensor_parallel_size, 1 if tp_size  1 and triton_kind  GPU raise ValueError KIND_GPU is currently for singleGPU models, please specify KIND_MODEL  in the models config.pbtxt for multiGPU models   If KIND_GPU is specified, specify the device ID assigned by Triton to ensure that  multiple model instances do not oversubscribe the same default device. if triton_kind  GPU and triton_device_id  0 self.logger.log_info fDetected KIND_GPU model instance, explicitly setting GPU devicetriton_device_id for triton_instance   vLLM doesnt currently v0.4.2 expose device selection in the APIs torch.cuda.set_devicetriton_device_id def _setup_loraself self.enable_lora  False  Check if enable_lora field is in the model.json,  and if it is, read its contents, which can be string or bool. if  enable_lora in self.vllm_engine_config.keys and strself.vllm_engine_configenable_lora.lower  true   create Triton LoRA weights repository multi_lora_args_filepath  os.path.join pb_utils.get_model_dir, _MULTI_LORA_ARGS_FILENAME  try with openmulti_lora_args_filepath as lora_file lora_repository Dictstr, str  json.loadlora_file self.lora_repository  lora_repository self.supported_loras Liststr  listself.lora_repository.keys self.supported_loras_len  lenself.supported_loras self.enable_lora  True except FileNotFoundError raise FileNotFoundError fTriton backend cannot find multi_lora_args_filepath.

 ---

 def _setup_metricsself self._vllm_metrics  None  TODO Do not read metrics directly from the vLLM engine, read from prometheus  client to allow the use of ZMQ process when metrics are enabled. See  httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmentrypointsopenaiapi_server.pyL222L245 if self._enable_metrics try labels   model self.argsmodel_name, version self.argsmodel_version,   Add vLLM custom metrics engine_config  self._llm_engine.engine.model_config self._vllm_metrics  VllmStatLogger labels, engine_config.max_model_len, self.logger  self._llm_engine.add_loggertriton, self._vllm_metrics except pb_utils.TritonModelException as e if metrics not supported in stre  Metrics are disabled at the server self.logger.log_infovllm Metrics not supported else raise e def _get_bool_config_paramself, param_name str  bool return param_name in self.model_configparameters and  self.model_configparametersparam_namestring_value.lower  true  def _response_loopself while True item  self._response_queue.get  To signal shutdown a None item will be added to the queue. if item is None break response_state, response, response_flag  item response_sender  response_stateresponse_sender try response_sender.sendresponse, response_flag  Stop checking for cancellation if the last response is generated.

 ---

 if not response_statelast_response_generated response_stateis_cancelled  response_sender.is_cancelled except Exception as e self.logger.log_error fAn error occurred while sending a response e  finally if response_flag  pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL self._ongoing_request_count  1 def executeself, requests if self._enable_health_check and not self._check_healthrequests return None for request in requests request  self._verify_lorasrequest if request is not None assert  self._llm_engine_shutdown_event.is_set is False , Cannot create tasks after shutdown has been requested coro  self._generaterequest asyncio.run_coroutine_threadsafecoro, self._event_loop return None async def _generateself, request response_sender  request.get_response_sender response_state   response_sender response_sender, is_cancelled False, last_response_generated False,  last response ready but not yet sent  self._ongoing_request_count  1 decrement_ongoing_request_count  True try request_id  random_uuid  prompt, stream, prepend_input, parameters, additional_outputs,   self._get_input_tensorsrequest sampling_params_dict  self._get_sampling_params_dictparameters lora_name  sampling_params_dict.poplora_name, None sampling_params  SamplingParamssampling_params_dict lora_request  None if lora_name is not None lora_id  strself.supported_loras.indexlora_name  1 lora_int_id  intlora_id lora_local_path  self.lora_repositorylora_name lora_request  LoRARequestlora_id, lora_int_id, lora_local_path response_iterator  self._llm_engine.generate prompt, sampling_params, request_id, lora_requestlora_request  request_output_state   async for request_output in response_iterator  Cancellation state will be checked by the response loop and written to  the response state if streaming. If not streaming, cancellation state  needs to be checked here. is_cancelled  response_stateis_cancelled if not stream is_cancelled  response_sender.is_cancelled if is_cancelled self.logger.log_infovllm Cancelling the request await self._llm_engine.abortrequest_id self.logger.log_infovllm Successfully cancelled the request if stream  Add cancelled final response to response loop.

 ---

 response_statelast_response_generated  True response  pb_utils.InferenceResponse errorpb_utils.TritonError messageRequest was cancelled, codepb_utils.TritonError.CANCELLED,   flags  pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count  False self._response_queue.put_nowait response_state, response, flags  break  Send each response if streaming. if stream response  self._create_response request_output_state, request_output, prepend_inputFalse, additional_outputsadditional_outputs,  flags  0 if request_output.finished response_statelast_response_generated  True flags  pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL decrement_ongoing_request_count  False self._response_queue.put_nowaitresponse_state, response, flags  Send the last response which contains all the outputs if not streaming. if not stream response_sender.send self._create_response request_output_state, request_outputrequest_output, prepend_inputprepend_input, additional_outputsadditional_outputs, , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,  except Exception as e self.logger.log_errorfvllm Error generating stream e error  pb_utils.TritonErrorfError generating stream e text_output_tensor  pb_utils.Tensor text_output, np.asarrayNA, dtypeself.output_dtype  response  pb_utils.InferenceResponse output_tensorstext_output_tensor, errorerror  response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL  raise e finally if decrement_ongoing_request_count self._ongoing_request_count  1 def _get_input_tensorsself, request  prompt prompt  pb_utils.get_input_tensor_by_namerequest, text_input.as_numpy0 if isinstanceprompt, bytes prompt  prompt.decodeutf8  image images  pb_utils.get_input_tensor_by_namerequest, image if images images_vllm   for image_np in images.as_numpy image_b  base64.b64decodeimage_np.decodeutf8 image_rgb  Image.openBytesIOimage_b.convertRGB images_vllm.appendimage_rgb if lenimages_vllm  0 prompt   prompt prompt, multi_modal_data image images_vllm,   stream stream  pb_utils.get_input_tensor_by_namerequest, stream if stream stream  stream.as_numpy0 else stream  False  prepend_input  exclude_input_in_output prepend_input  pb_utils.get_input_tensor_by_name request, exclude_input_in_output  if prepend_input  When exclude_input_in_output is False, we want to prepend input prompt  to output, thus prepend_input should be True, and vice versa.

 ---

 prepend_input  not prepend_input.as_numpy0 elif prepend_input is None and stream prepend_input  False else prepend_input  True if prepend_input and stream raise ValueError When streaming, exclude_input_in_output  False is not allowed. parameters  sampling_parameters  An alternative mechanism to receive serialized parameters as an input tensor,  because request parameters are not yet supported via BLS. sampling_parameters  pb_utils.get_input_tensor_by_name request, sampling_parameters  if sampling_parameters parameters  sampling_parameters.as_numpy0.decodeutf8 else parameters  request.parameters  additional outputs additional_outputs   return_finish_reason None, return_cumulative_logprob None, return_logprobs None, return_num_input_tokens None, return_num_output_tokens None,  for tensor_name in additional_outputs.keys tensor  pb_utils.get_input_tensor_by_namerequest, tensor_name if tensor tensor  booltensor.as_numpy0 else tensor  False additional_outputstensor_name  tensor return prompt, stream, prepend_input, parameters, additional_outputs def _create_response self, request_output_state, request_output, prepend_input, additional_outputs  output_tensors    text_output prepend_prompt   if prev_lens_text_output not in request_output_state  this is the first response if prepend_input prepend_prompt  request_output.prompt request_output_stateprev_lens_text_output  0  len request_output.outputs  prev_lens  request_output_stateprev_lens_text_output text_output   prepend_prompt  output.textprev_len.encodeutf8 for output, prev_len in ziprequest_output.outputs, prev_lens  request_output_stateprev_lens_text_output   lenoutput.text for output in request_output.outputs  output_tensors.append pb_utils.Tensor text_output, np.asarraytext_output, dtypeself.output_dtype    finish_reason if additional_outputsreturn_finish_reason finish_reason   stroutput.finish_reason for output in request_output.outputs  output_tensors.append pb_utils.Tensor finish_reason, np.asarrayfinish_reason, dtypenp.object_    cumulative_logprob if additional_outputsreturn_cumulative_logprob cumulative_logprob   output.cumulative_logprob for output in request_output.outputs  output_tensors.append pb_utils.Tensor cumulative_logprob, np.asarraycumulative_logprob, dtypenp.float32,    logprobs  httpsgithub.comvllmprojectvllmblobv0.6.3.post1vllmsequence.pyL37L58 if additional_outputsreturn_logprobs if prev_lens_logprobs not in request_output_state request_output_stateprev_lens_logprobs  0  len request_output.outputs  logprobs   for i in rangelenrequest_output.outputs output  request_output.outputsi if output.logprobs is None logprobs.appendnull.encodeutf8 continue prev_len  request_output_stateprev_lens_logprobsi request_output_stateprev_lens_logprobsi  lenoutput.logprobs logprobs_py   for logprob_d_vllm in output.logprobsprev_len logprob_d_py   for token_id, logprob_vllm in logprob_d_vllm.items logprob_d_pytoken_id   logprob logprob_vllm.logprob, rank logprob_vllm.rank, decoded_token logprob_vllm.decoded_token,  logprobs_py.appendlogprob_d_py logprobs.appendjson.dumpslogprobs_py.encodeutf8 output_tensors.append pb_utils.Tensorlogprobs, np.asarraylogprobs, dtypenp.object_   num_input_tokens if additional_outputsreturn_num_input_tokens num_input_tokens  lenrequest_output.prompt_token_ids output_tensors.append pb_utils.Tensor num_input_tokens, np.asarraynum_input_tokens, dtypenp.uint32    num_output_tokens if additional_outputsreturn_num_output_tokens if prev_lens_num_output_tokens not in request_output_state request_output_stateprev_lens_num_output_tokens  0  len request_output.outputs  prev_lens  request_output_stateprev_lens_num_output_tokens num_output_tokens   lenoutput.token_ids  prev_len for output, prev_len in ziprequest_output.outputs, prev_lens  request_output_stateprev_lens_num_output_tokens   lenoutput.token_ids for output in request_output.outputs  output_tensors.append pb_utils.Tensor num_output_tokens, np.asarraynum_output_tokens, dtypenp.uint32   return pb_utils.InferenceResponseoutput_tensorsoutput_tensors def _get_sampling_params_dictself, params_json params_dict  json.loadsparams_json  Special parsing for the supported sampling parameters bool_keys  ignore_eos, skip_special_tokens, use_beam_search for k in bool_keys if k in params_dict params_dictk  boolparams_dictk float_keys   frequency_penalty, length_penalty, presence_penalty, temperature, top_p,  for k in float_keys if k in params_dict params_dictk  floatparams_dictk int_keys  best_of, max_tokens, min_tokens, n, top_k for k in int_keys if k in params_dict params_dictk  intparams_dictk return params_dict def _verify_lorasself, request  We will check if the requested lora exists here, if not we will send a  response with LoRA not found information.

 ---

 In this way we may avoid  further processing. verified_request  None lora_error  None lora_name  None parameters_input_tensor  pb_utils.get_input_tensor_by_name request, sampling_parameters  if parameters_input_tensor parameters  parameters_input_tensor.as_numpy0.decodeutf8 sampling_params_dict  self._get_sampling_params_dictparameters lora_name  sampling_params_dict.poplora_name, None if lora_name is not None if not self.enable_lora lora_error  pb_utils.TritonErrorLoRA feature is not enabled. self.logger.log_info vllm LoRA is not enabled, please restart the backend with LoRA enabled.

 ---

 elif lora_name not in self.supported_loras lora_error  pb_utils.TritonError fLoRA lora_name is not supported, we currently support self.supported_loras  self.logger.log_infofvllm LoRA lora_name not found. if lora_error is not None output_tensor  pb_utils.Tensor text_output, np.asarrayError Unsupported LoRA., dtypeself.output_dtype,  response  pb_utils.InferenceResponse output_tensorsoutput_tensor, errorlora_error  response_sender  request.get_response_sender response_sender.send response, flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL  else verified_request  request return verified_request def _check_healthself, requests coro  self._llm_engine.check_health future  asyncio.run_coroutine_threadsafecoro, self._event_loop try future.result except Exception as e self.logger.log_error fvllm Engine is not healthy and model will be unloaded e  pb_utils.unload_modelself.model_configname  nonblocking self._is_healthy  False if not self._is_healthy for request in requests request.get_response_sender.send pb_utils.InferenceResponse errorpb_utils.TritonError messageModel is unavailable due to unhealthy vLLM engine, codepb_utils.TritonError.UNAVAILABLE,  , flagspb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,  return self._is_healthy def finalizeself self.logger.log_infovllm Issuing finalize to vllm backend self._event_loop.call_soon_threadsafeself._llm_engine_shutdown_event.set  Shutdown the event thread. if self._event_thread is not None self._event_thread.join self._event_thread  None  Shutdown the response thread.

 ---

 self._response_queue.putNone if self._response_thread is not None self._response_thread.join self._response_thread  None  Shutdown the metrics thread. if self._vllm_metrics is not None self._vllm_metrics.finalize  When using parallel tensors, the stub process may not shutdown due to  unreleased references, so manually run the garbage collector once. self.logger.log_infovllm Running Garbage Collector on finalize...

 ---

 gc.collect self.logger.log_infovllm Garbage Collector on finalize... done  File .gitignore   Bytecompiled  optimized  DLL files __pycache__ .pycod py.class  C extensions .so  Distribution  packaging .Python build developeggs dist downloads eggs .eggs lib lib64 parts sdist var wheels sharepythonwheels .egginfo .installed.cfg .egg MANIFEST  PyInstaller  Usually these files are written by a python script from a template  before PyInstaller builds the exe, so as to inject dateother infos into it. .manifest .spec  Installer logs piplog.txt pipdeletethisdirectory.txt  Unit test  coverage reports htmlcov .tox .nox .coverage .coverage.

 ---

 .cache nosetests.xml coverage.xml .cover .py,cover .hypothesis .pytest_cache cover .out  Translations .mo .pot  Django stuff .log local_settings.py db.sqlite3 db.sqlite3journal  Flask stuff instance .webassetscache  Scrapy stuff .scrapy  Sphinx documentation docs_build  PyBuilder .pybuilder target  Jupyter Notebook .ipynb_checkpoints  IPython profile_default ipython_config.py  pyenv  For a library or package, you might want to ignore these files since the code is  intended to run in multiple environments otherwise, check them in  .pythonversion  pipenv  According to pypapipenv598, it is recommended to include Pipfile.lock in version control. However, in case of collaboration, if having platformspecific dependencies or dependencies  having no crossplatform support, pipenv may install dependencies that dont work, or not  install all needed dependencies. Pipfile.lock  poetry  Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.

 ---

 This is especially recommended for binary packages to ensure reproducibility, and is more  commonly ignored for libraries. httpspythonpoetry.orgdocsbasicusagecommityourpoetrylockfiletoversioncontrol poetry.lock  pdm  Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. pdm.lock  pdm stores projectwide configurations in .pdm.toml, but it is recommended to not include it  in version control.

 ---

 httpspdm.fming.devusewithide .pdm.toml  PEP 582 used by e.g. github.comDavidOConnorpyflow and github.compdmprojectpdm __pypackages__  Celery stuff celerybeatschedule celerybeat.pid  SageMath parsed files .sage.py  Environments .env .venv env venv ENV env.bak venv.bak  Spyder project settings .spyderproject .spyproject  Rope project settings .ropeproject  mkdocs documentation site  mypy .mypy_cache .dmypy.json dmypy.json  Pyre type checker .pyre  pytype static type analyzer .pytype  Cython debug symbols cython_debug  Test result files Miniconda miniconda vllm_env.tar.gz triton_python_backend_stub python_backend results.txt .log  PyCharm  JetBrains specific template is maintained in a separate JetBrains.gitignore that can  be found at httpsgithub.comgithubgitignoreblobmainGlobalJetBrains.gitignore  and can be added to the global gitignore or merged into this file. For a more nuclear  option not recommended you can uncomment the following to ignore the entire idea folder.

 ---

 .idea  File .precommitconfig.yaml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. repos  repo httpsgithub.comtimothycrosleyisort rev 5.12.0 hooks  id isort additional_dependencies toml  repo httpsgithub.compsfblack rev 23.1.0 hooks  id black types_or python, cython  repo httpsgithub.comPyCQAflake8 rev 5.0.4 hooks  id flake8 args maxlinelength88, selectC,E,F,W,B,B950, extendignore  E203,E501 types_or python, cython  repo httpsgithub.comprecommitmirrorsclangformat rev v16.0.5 hooks  id clangformat types_or c, c, cuda, proto, textproto, java args fallbackstylenone, stylefile, i  repo httpsgithub.comcodespellprojectcodespell rev v2.2.4 hooks  id codespell additional_dependencies tomli args toml, pyproject.toml exclude ?x.stemmer..stop_words.CHANGELOG.md  More details about these precommit hooks here  httpsprecommit.comhooks.html  repo httpsgithub.comprecommitprecommithooks rev v4.4.0 hooks  id checkcaseconflict  id checkexecutableshaveshebangs  id checkmergeconflict  id checkjson  id checktoml  id checkyaml exclude deploytemplates. id checkshebangscriptsareexecutable  id endoffilefixer types_or c, c, cuda, proto, textproto, java, python  id mixedlineending  id requirementstxtfixer  id trailingwhitespace  File LICENSE  BSD 3Clause License Copyright c 2023, Triton Inference Server Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met 1.

 ---

 Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation andor other materials provided with the distribution.

 ---

 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. File pyproject.toml   Copyright 2023, NVIDIA CORPORATION  AFFILIATES. All rights reserved.

 ---

 Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission.

 ---

 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. tool.codespell  note precommit passes explicit lists of files here, which this skip file list doesnt override   this is only to allow you to run codespell interactively skip  ..git,..github  ignore short words, and typename parameters like OffsetT ignoreregex  b.1,4AZwTb  use the clear dictionary for unambiguous spelling mistakes builtin  clear  disable warnings about binary files and wrong encoding quietlevel  3 tool.isort profile  black use_parentheses  true multi_line_output  3 include_trailing_comma  true force_grid_wrap  0 ensure_newline_before_comments  true line_length  88 balanced_wrapping  true indent    skip  build  File README.md  !

 ---

 Copyright 20232024, NVIDIA CORPORATION  AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without  modification, are permitted provided that the following conditions  are met   Redistributions of source code must retain the above copyright  notice, this list of conditions and the following disclaimer.

 ---

 Redistributions in binary form must reproduce the above copyright  notice, this list of conditions and the following disclaimer in the  documentation andor other materials provided with the distribution. Neither the name of NVIDIA CORPORATION nor the names of its  contributors may be used to endorse or promote products derived  from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AS IS AND ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR  PURPOSE ARE DISCLAIMED.

 ---

 IN NO EVENT SHALL THE COPYRIGHT OWNER OR  CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,  EXEMPLARY, OR CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR  PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT  INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. !Licensehttpsimg.shields.iobadgeLicenseBSD3lightgrey.svghttpsopensource.orglicensesBSD3Clause  vLLM Backend The Triton backend for vLLMhttpsgithub.comvllmprojectvllm is designed to run supported modelshttpsvllm.readthedocs.ioenlatestmodelssupported_models.html on a vLLM enginehttpsgithub.comvllmprojectvllmblobmainvllmengineasync_llm_engine.py. You can learn more about Triton backends in the backend repohttpsgithub.comtritoninferenceserverbackend.

 ---

 This is a Pythonbased backendhttpsgithub.comtritoninferenceserverbackendblobmaindocspython_based_backends.mdpythonbasedbackends. When using this backend, all requests are placed on the vLLM AsyncEngine as soon as they are received. Inflight batching and paged attention is handled by the vLLM engine.

 ---

 Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentationhttpsgithub.comtritoninferenceserverservertritoninferenceserver available in the main serverhttpsgithub.comtritoninferenceserverserver repo. If you dont find your answer there you can ask questions on the main Triton issues pagehttpsgithub.comtritoninferenceserverserverissues.

 ---

 Installing the vLLM Backend There are several ways to install and deploy the vLLM backend. Option 1. Use the PreBuilt Docker Container.

 ---

 Pull a tritonserverxx.yyvllmpythonpy3 container with vLLM backend from the NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver registry. xx.yy is the version of Triton that you want to use. Please note, that Tritons vLLM container has been introduced starting from 23.10 release.

 ---

 docker pull nvcr.ionvidiatritonserverxx.yyvllmpythonpy3   Option 2. Build a Custom Container From Source You can follow steps described in the Building With Dockerhttpsgithub.comtritoninferenceserverserverblobmaindocscustomization_guidebuild.mdbuildingwithdocker guide and use the build.pyhttpsgithub.comtritoninferenceserverserverblobmainbuild.py script. A sample command to build a Triton Server container with all options enabled is shown below.

 ---

 Feel free to customize flags according to your needs. Please use NGC registryhttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonservertags to get the latest version of the Triton vLLM container, which corresponds to the latest YY.MM year.month of Triton releasehttpsgithub.comtritoninferenceserverserverreleases. YY.MM is the version of Triton.

 ---

 export TRITON_CONTAINER_VERSIONYY.MM .build.py v enablelogging enablestats enabletracing enablemetrics enablegpumetrics enablecpumetrics enablegpu filesystemgcs filesystems3 filesystemazure_storage endpointhttp endpointgrpc endpointsagemaker endpointvertexai upstreamcontainerversionTRITON_CONTAINER_VERSION backendpythonrTRITON_CONTAINER_VERSION backendvllmrTRITON_CONTAINER_VERSION backendensemble   Option 3. Add the vLLM Backend to the Default Triton Container You can install the vLLM backend directly into the NGC Triton container. In this case, please install vLLM first.

 ---

 You can do so by running pip install vllmvLLM_version. Then, set up the vLLM backend in the container with the following commands  mkdir p opttritonserverbackendsvllm git clone httpsgithub.comtritoninferenceservervllm_backend.git tmpvllm_backend cp r tmpvllm_backendsrc opttritonserverbackendsvllm   Using the vLLM Backend You can see an example model_repositorysamplesmodel_repository in the samplessamples folder. You can use this as is and change the model by changing the model value in model.json.

 ---

 model.json represents a keyvalue dictionary that is fed to vLLMs AsyncLLMEngine when initializing the model. You can see supported arguments in vLLMs arg_utils.pyhttpsgithub.comvllmprojectvllmblobmainvllmenginearg_utils.py. Specifically, herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL11 and herehttpsgithub.comvllmprojectvllmblobee8217e5bee5860469204ee57077a91138c9af02vllmenginearg_utils.pyL201.

 ---

 For multiGPU support, EngineArgs like tensor_parallel_size can be specified in model.jsonsamplesmodel_repositoryvllm_model1model.json. Note vLLM greedily consume up to 90 of the GPUs memory under default settings. The sample model updates this behavior by setting gpu_memory_utilization to 50.

 ---

 You can tweak this behavior using fields like gpu_memory_utilization and other settings in model.jsonsamplesmodel_repositoryvllm_model1model.json. Launching Triton Inference Server Once you have the model repository set up, it is time to launch the Triton server. We will use the prebuilt Triton container with vLLM backendoption1usetheprebuiltdockercontainer from NGChttpscatalog.ngc.nvidia.comorgsnvidiacontainerstritonserver in this example.

 ---

 docker run gpus all it nethost rm p 80018001 shmsize1G ulimit memlock1 ulimit stack67108864 v PWDwork w work nvcr.ionvidiatritonserverxx.yyvllmpythonpy3 tritonserver modelrepository .model_repository  Replace xx.yy with the version of Triton that you want to use. Note that Tritons vLLM container was first published starting from 23.10 release. After you start Triton you will see output on the console showing the server starting up and loading the model.

 ---

 When you see output like the following, Triton is ready to accept inference requests. I1030 223328.291908 1 grpc_server.cc2513 Started GRPCInferenceService at 0.0.0.08001 I1030 223328.292879 1 http_server.cc4497 Started HTTPService at 0.0.0.08000 I1030 223328.335154 1 http_server.cc270 Started Metrics Service at 0.0.0.08002   Sending Your First Inference After you start Tritonhttpsdocs.nvidia.comdeeplearningtritoninferenceserveruserguidedocsgetting_startedquickstart.html with the sample model_repositorysamplesmodel_repository, you can quickly run your first inference request with the generate endpointhttpsgithub.comtritoninferenceserverserverblobmaindocsprotocolextension_generate.md. Try out the command below.

 ---

 curl X POST localhost8000v2modelsvllm_modelgenerate d text_input What is Triton Inference Server?, parameters stream false, temperature 0  Upon success, you should see a response from the server like this one  model_namevllm_model,model_version1,text_outputWhat is Triton Inference Server?nnTriton Inference Server is a server that is used by many  In the samplessamples folder, you can also find a sample client, client.pysamplesclient.py which uses Tritons asyncio gRPC client libraryhttpsgithub.comtritoninferenceserverclientpythonasynciosupportbeta1 to run inference on Triton. Running the Latest vLLM Version You can check the vLLM version included in Triton Inference Server from Framework Containers Support Matrixhttpsdocs.nvidia.comdeeplearningframeworkssupportmatrixindex.html. Note The vLLM Triton Inference Server container has been introduced starting from 23.10 release.

 ---

 You can use pip install ... within the container to upgrade vLLM version. Running Multiple Instances of Triton Server If you are running multiple instances of Triton server with a Pythonbased backend, you need to specify a different shmregionprefixname for each server.

 ---

 See herehttpsgithub.comtritoninferenceserverpython_backendrunningmultipleinstancesoftritonserver for more information. Additional vLLM outputs Additional vLLM outputs may be requested optionally on a perrequest basis. See this docsdocsadditional_outputs.md for more information.

 ---

 Triton Metrics Starting with the 24.08 release of Triton, users can now obtain specific vLLM metrics by querying the Triton metrics endpoint see complete vLLM metrics herehttpsdocs.vllm.aienlatestservingmetrics.html. This can be accomplished by launching a Triton server in any of the ways described above ensuring the build code  container is 24.08 or later and querying the server. Upon receiving a successful response, you can query the metrics endpoint by entering the following bash curl localhost8002metrics  VLLM stats are reported by the metrics endpoint in fields that are prefixed with vllm.

 ---

 Triton currently supports reporting of the following metrics from vLLM. bash  Number of prefill tokens processed. counter_prompt_tokens  Number of generation tokens processed.

 ---

 counter_generation_tokens  Histogram of time to first token in seconds. histogram_time_to_first_token  Histogram of time per output token in seconds. histogram_time_per_output_token  Histogram of end to end request latency in seconds.

 ---

 histogram_e2e_time_request  Number of prefill tokens processed. histogram_num_prompt_tokens_request  Number of generation tokens processed. histogram_num_generation_tokens_request  Histogram of the best_of request parameter.

 ---

 histogram_best_of_request  Histogram of the n request parameter. histogram_n_request  Your output for these fields should look similar to the following bash  HELP vllmprompt_tokens_total Number of prefill tokens processed. TYPE vllmprompt_tokens_total counter vllmprompt_tokens_totalmodelvllm_model,version1 10  HELP vllmgeneration_tokens_total Number of generation tokens processed.

 ---

 TYPE vllmgeneration_tokens_total counter vllmgeneration_tokens_totalmodelvllm_model,version1 16  HELP vllmtime_to_first_token_seconds Histogram of time to first token in seconds. TYPE vllmtime_to_first_token_seconds histogram vllmtime_to_first_token_seconds_countmodelvllm_model,version1 1 vllmtime_to_first_token_seconds_summodelvllm_model,version1 0.03233122825622559 vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,le0.001 0 ... vllmtime_to_first_token_seconds_bucketmodelvllm_model,version1,leInf 1  HELP vllmtime_per_output_token_seconds Histogram of time per output token in seconds.

 ---

 TYPE vllmtime_per_output_token_seconds histogram vllmtime_per_output_token_seconds_countmodelvllm_model,version1 15 vllmtime_per_output_token_seconds_summodelvllm_model,version1 0.04501533508300781 vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,le0.01 14 ... vllmtime_per_output_token_seconds_bucketmodelvllm_model,version1,leInf 15  HELP vllme2e_request_latency_seconds Histogram of end to end request latency in seconds. TYPE vllme2e_request_latency_seconds histogram vllme2e_request_latency_seconds_countmodelvllm_model,version1 1 vllme2e_request_latency_seconds_summodelvllm_model,version1 0.08686184883117676 vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,le1 1 ...

 ---

 vllme2e_request_latency_seconds_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_prompt_tokens Number of prefill tokens processed. TYPE vllmrequest_prompt_tokens histogram vllmrequest_prompt_tokens_countmodelvllm_model,version1 1 vllmrequest_prompt_tokens_summodelvllm_model,version1 10 vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_prompt_tokens_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_generation_tokens Number of generation tokens processed.

 ---

 TYPE vllmrequest_generation_tokens histogram vllmrequest_generation_tokens_countmodelvllm_model,version1 1 vllmrequest_generation_tokens_summodelvllm_model,version1 16 vllmrequest_generation_tokens_bucketmodelvllm_model,version1,le1 0 ... vllmrequest_generation_tokens_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_params_best_of Histogram of the best_of request parameter. TYPE vllmrequest_params_best_of histogram vllmrequest_params_best_of_countmodelvllm_model,version1 1 vllmrequest_params_best_of_summodelvllm_model,version1 1 vllmrequest_params_best_of_bucketmodelvllm_model,version1,le1 1 ...

 ---

 vllmrequest_params_best_of_bucketmodelvllm_model,version1,leInf 1  HELP vllmrequest_params_n Histogram of the n request parameter. TYPE vllmrequest_params_n histogram vllmrequest_params_n_countmodelvllm_model,version1 1 vllmrequest_params_n_summodelvllm_model,version1 1 vllmrequest_params_n_bucketmodelvllm_model,version1,le1 1 ... vllmrequest_params_n_bucketmodelvllm_model,version1,leInf 1  To enable vLLM engine colleting metrics, disable_log_stats option need to be either false or left empty false by default in model.jsonhttpsgithub.comtritoninferenceservervllm_backendblobmainsamplesmodel_repositoryvllm_model1model.json.

 ---

 bash disable_log_stats false  Note vLLM metrics are not reported to Triton metrics server by default due to potential performance slowdowns. To enable vLLM models metrics reporting, please add following lines to its config.pbtxt as well. parameters  key REPORT_CUSTOM_METRICS value  string_value true     vLLM Engine Health Check BETA vLLM Engine Health Check may be enabled optionally, for more accurate model state reported by the server.

 ---

 See this docsdocshealth_check.md for more information. Referencing the Tutorial You can read further in the vLLM Quick Deploy guidehttpsgithub.comtritoninferenceservertutorialstreemainQuick_DeployvLLM in the tutorialshttpsgithub.comtritoninferenceservertutorials repository.

 ---

diff --git a/embeddings.npy b/embeddings.npy
diff --git a/triton-inference-server_vllm_backend.txt b/triton-inference-server_vllm_backend.txt