enricorotundo · August 17, 2023 09:56
diff --git a/gistfile1.txt b/gistfile1.txt
 since this is nowhere to be found...


 =============================
 == Triton Inference Server ==
 =============================

 NVIDIA Release 23.02 (build 53616260)
 Triton Server Version 2.31.0

 Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

 Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
 By pulling and using the container, you accept the terms and conditions of this license:
 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

 WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

 root@611a66e9c322:/opt/tritonserver# tritonserver --help
 Usage: tritonserver [options]
  --help
        Print usage
  --log-verbose <integer>
        Set verbose logging level. Zero (0) disables verbose logging
        and values >= 1 enable verbose logging.
  --log-info <boolean>
        Enable/disable info-level logging.
  --log-warning <boolean>
        Enable/disable warning-level logging.
  --log-error <boolean>
        Enable/disable error-level logging.
  --log-format <string>
        Set the logging format. Options are "default" and "ISO8601".
        The default is "default". For "default", the log severity (L) and
        timestamp will be logged as "LMMDD hh:mm:ss.ssssss". For "ISO8601",
        the log format will be "YYYY-MM-DDThh:mm:ssZ L".
  --log-file <string>
        Set the name of the log output file. If specified, log
        outputs will be saved to this file. If not specified, log outputs will
        stream to the console.
  --id <string>
        Identifier for this server.
  --model-store <string>
        Equivalent to --model-repository.
  --model-repository <string>
        Path to model repository directory. It may be specified
        multiple times to add multiple model repositories. Note that if a model
        is not unique across all model repositories at any time, the model
        will not be available.
  --exit-on-error <boolean>
        Exit the inference server if an error occurs during
        initialization.
  --disable-auto-complete-config
        If set, disables the triton and backends from auto
        completing model configuration files. Model configuration files must be
        provided and all required configuration settings must be specified.
  --strict-model-config <boolean>
        DEPRECATED: If true model configuration files must be
        provided and all required configuration settings must be specified. If
        false the model configuration may be absent or only partially specified
        and the server will attempt to derive the missing required
        configuration.
  --strict-readiness <boolean>
        If true /v2/health/ready endpoint indicates ready if the
        server is responsive and all models are available. If false
        /v2/health/ready endpoint indicates ready if server is responsive even if
        some/all models are unavailable.
  --allow-http <boolean>
        Allow the server to listen for HTTP requests.
  --http-port <integer>
        The port for the server to listen on for HTTP requests.
  --reuse-http-port <boolean>
        Allow multiple servers to listen on the same HTTP port when
        every server has this option set. If you plan to use this option as
        a way to load balance between different Triton servers, the same
        model repository or set of models must be used for every server.
  --http-address <string>
        The address for the http server to binds to.
  --http-thread-count <integer>
        Number of threads handling HTTP requests.
  --allow-grpc <boolean>
        Allow the server to listen for GRPC requests.
  --grpc-port <integer>
        The port for the server to listen on for GRPC requests.
  --reuse-grpc-port <boolean>
        Allow multiple servers to listen on the same GRPC port when
        every server has this option set. If you plan to use this option as
        a way to load balance between different Triton servers, the same
        model repository or set of models must be used for every server.
  --grpc-address <string>
        The address for the grpc server to binds to.
  --grpc-infer-allocation-pool-size <integer>
        The maximum number of inference request/response objects
        that remain allocated for reuse. As long as the number of in-flight
        requests doesn't exceed this value there will be no
        allocation/deallocation of request/response objects.
  --grpc-use-ssl <boolean>
        Use SSL authentication for GRPC requests. Default is false.
  --grpc-use-ssl-mutual <boolean>
        Use mututal SSL authentication for GRPC requests. Default is
        false.
  --grpc-server-cert <string>
        File holding PEM-encoded server certificate. Ignored unless
        --grpc-use-ssl is true.
  --grpc-server-key <string>
        File holding PEM-encoded server key. Ignored unless
        --grpc-use-ssl is true.
  --grpc-root-cert <string>
        File holding PEM-encoded root certificate. Ignore unless
        --grpc-use-ssl is false.
  --grpc-infer-response-compression-level <string>
        The compression level to be used while returning the infer
        response to the peer. Allowed values are none, low, medium and high.
        By default, compression level is selected as none.
  --grpc-keepalive-time <integer>
        The period (in milliseconds) after which a keepalive ping is
        sent on the transport. Default is 7200000 (2 hours).
  --grpc-keepalive-timeout <integer>
        The period (in milliseconds) the sender of the keepalive
        ping waits for an acknowledgement. If it does not receive an
        acknowledgment within this time, it will close the connection. Default is
        20000 (20 seconds).
  --grpc-keepalive-permit-without-calls <boolean>
        Allows keepalive pings to be sent even if there are no calls
        in flight (0 : false; 1 : true). Default is 0 (false).
  --grpc-http2-max-pings-without-data <integer>
        The maximum number of pings that can be sent when there is
        no data/header frame to be sent. gRPC Core will not continue sending
        pings if we run over the limit. Setting it to 0 allows sending pings
        without such a restriction. Default is 2.
  --grpc-http2-min-recv-ping-interval-without-data <integer>
        If there are no data/header frames being sent on the
        transport, this channel argument on the server side controls the minimum
        time (in milliseconds) that gRPC Core would expect between receiving
        successive pings. If the time between successive pings is less than
        this time, then the ping will be considered a bad ping from the peer.
        Such a ping counts as a ‘ping strike’. Default is 300000 (5
        minutes).
  --grpc-http2-max-ping-strikes <integer>
        Maximum number of bad pings that the server will tolerate
        before sending an HTTP2 GOAWAY frame and closing the transport.
        Setting it to 0 allows the server to accept any number of bad pings.
        Default is 2.
  --allow-sagemaker <boolean>
        Allow the server to listen for Sagemaker requests. Default
        is false.
  --sagemaker-port <integer>
        The port for the server to listen on for Sagemaker requests.
        Default is 8080.
  --sagemaker-safe-port-range <<integer>-<integer>>
        Set the allowed port range for endpoints other than the
        SageMaker endpoints.
  --sagemaker-thread-count <integer>
        Number of threads handling Sagemaker requests. Default is 8.
  --allow-vertex-ai <boolean>
        Allow the server to listen for Vertex AI requests. Default
        is true if AIP_MODE=PREDICTION, false otherwise.
  --vertex-ai-port <integer>
        The port for the server to listen on for Vertex AI requests.
        Default is AIP_HTTP_PORT if set, 8080 otherwise.
  --vertex-ai-thread-count <integer>
        Number of threads handling Vertex AI requests. Default is 8.
  --vertex-ai-default-model <string>
        The name of the model to use for single-model inference
        requests.
  --allow-metrics <boolean>
        Allow the server to provide prometheus metrics.
  --allow-gpu-metrics <boolean>
        Allow the server to provide GPU metrics. Ignored unless
        --allow-metrics is true.
  --allow-cpu-metrics <boolean>
        Allow the server to provide CPU metrics. Ignored unless
        --allow-metrics is true.
  --metrics-port <integer>
        The port reporting prometheus metrics.
  --metrics-interval-ms <float>
        Metrics will be collected once every <metrics-interval-ms>
        milliseconds. Default is 2000 milliseconds.
  --trace-file <string>
        Set the file where trace output will be saved. If
        --trace-log-frequency is also specified, this argument value will be the
        prefix of the files to save the trace output. See --trace-log-frequency
        for detail.
  --trace-level <string>
        Specify a trace level. OFF to disable tracing, TIMESTAMPS to
        trace timestamps, TENSORS to trace tensors. It may be specified
        multiple times to trace multiple informations. Default is OFF.
  --trace-rate <integer>
        Set the trace sampling rate. Default is 1000.
  --trace-count <integer>
        Set the number of traces to be sampled. If the value is -1,
        the number of traces to be sampled will not be limited. Default is
        -1.
  --trace-log-frequency <integer>
        Set the trace log frequency. If the value is 0, Triton will
        only log the trace output to <trace-file> when shutting down.
        Otherwise, Triton will log the trace output to <trace-file>.<idx> when it
        collects the specified number of traces. For example, if the log
        frequency is 100, when Triton collects the 100-th trace, it logs the
        traces to file <trace-file>.0, and when it collects the 200-th trace,
        it logs the 101-th to the 200-th traces to file <trace-file>.1.
        Default is 0.
  --model-control-mode <string>
        Specify the mode for model management. Options are "none",
        "poll" and "explicit". The default is "none". For "none", the server
        will load all models in the model repository(s) at startup and will
        not make any changes to the load models after that. For "poll", the
        server will poll the model repository(s) to detect changes and will
        load/unload models based on those changes. The poll rate is
        controlled by 'repository-poll-secs'. For "explicit", model load and unload
        is initiated by using the model control APIs, and only models
        specified with --load-model will be loaded at startup.
  --repository-poll-secs <integer>
        Interval in seconds between each poll of the model
        repository to check for changes. Valid only when --model-control-mode=poll is
        specified.
  --load-model <string>
        Name of the model to be loaded on server startup. It may be
        specified multiple times to add multiple models. To load ALL models
        at startup, specify '*' as the model name with --load-model=* as the
        ONLY --load-model argument, this does not imply any pattern
        matching. Specifying --load-model=* in conjunction with another
        --load-model argument will result in error. Note that this option will only
        take effect if --model-control-mode=explicit is true.
  --rate-limit <string>
        Specify the mode for rate limiting. Options are
        "execution_count" and "off". The default is "off". For "execution_count", the
        server will determine the instance using configured priority and the
        number of time the instance has been used to run inference. The
        inference will finally be executed once the required resources are
        available. For "off", the server will ignore any rate limiter config and
        run inference as soon as an instance is ready.
  --rate-limit-resource <<string>:<integer>:<integer>>
        The number of resources available to the server. The format
        of this flag is
        --rate-limit-resource=<resource_name>:<count>:<device>. The <device> is optional and if not listed will be applied to
        every device. If the resource is specified as "GLOBAL" in the model
        configuration the resource is considered shared among all the devices
        in the system. The <device> property is ignored for such resources.
        This flag can be specified multiple times to specify each resources
        and their availability. By default, the max across all instances
        that list the resource is selected as its availability. The values for
        this flag is case-insensitive.
  --pinned-memory-pool-byte-size <integer>
        The total byte size that can be allocated as pinned system
        memory. If GPU support is enabled, the server will allocate pinned
        system memory to accelerate data transfer between host and devices
        until it exceeds the specified byte size. If 'numa-node' is configured
        via --host-policy, the pinned system memory of the pool size will be
        allocated on each numa node. This option will not affect the
        allocation conducted by the backend frameworks. Default is 256 MB.
  --cuda-memory-pool-byte-size <<integer>:<integer>>
        The total byte size that can be allocated as CUDA memory for
        the GPU device. If GPU support is enabled, the server will allocate
        CUDA memory to minimize data transfer between host and devices
        until it exceeds the specified byte size. This option will not affect
        the allocation conducted by the backend frameworks. The argument
        should be 2 integers separated by colons in the format <GPU device
        ID>:<pool byte size>. This option can be used multiple times, but only
        once per GPU device. Subsequent uses will overwrite previous uses for
        the same GPU device. Default is 64 MB.
  --response-cache-byte-size <integer>
        The size in bytes to allocate for a request/response cache.
        When non-zero, Triton allocates the requested size in CPU memory and
        shares the cache across all inference requests and across all
        models. For a given model to use request caching, the model must enable
        request caching in the model configuration. By default, no model uses
        request caching even if the request cache is enabled with the
        --response-cache-byte-size flag. Default is 0.
  --min-supported-compute-capability <float>
        The minimum supported CUDA compute capability. GPUs that
        don't support this compute capability will not be used by the server.
  --exit-timeout-secs <integer>
        Timeout (in seconds) when exiting to wait for in-flight
        inferences to finish. After the timeout expires the server exits even if
        inferences are still in flight.
  --backend-directory <string>
        The global directory searched for backend shared libraries.
        Default is '/opt/tritonserver/backends'.
  --repoagent-directory <string>
        The global directory searched for repository agent shared
        libraries. Default is '/opt/tritonserver/repoagents'.
  --buffer-manager-thread-count <integer>
        The number of threads used to accelerate copies and other
        operations required to manage input and output tensor contents.
        Default is 0.
  --model-load-thread-count <integer>
        The number of threads used to concurrently load models in
        model repositories. Default is 2*<num_cpu_cores>.
  --backend-config <<string>,<string>=<string>>
        Specify a backend-specific configuration setting. The format
        of this flag is --backend-config=<backend_name>,<setting>=<value>.
        Where <backend_name> is the name of the backend, such as 'tensorrt'.
  --host-policy <<string>,<string>=<string>>
        Specify a host policy setting associated with a policy name.
        The format of this flag is
        --host-policy=<policy_name>,<setting>=<value>. Currently supported settings are 'numa-node', 'cpu-cores'.
        Note that 'numa-node' setting will affect pinned memory pool behavior,
        see --pinned-memory-pool for more detail.
  --model-load-gpu-limit <<device_id>:<fraction>>
        Specify the limit on GPU memory usage as a fraction. If
        model loading on the device is requested and the current memory usage
        exceeds the limit, the load will be rejected. If not specified, the
        limit will not be set.
	since this is nowhere to be found...


	=============================
	== Triton Inference Server ==
	=============================

	NVIDIA Release 23.02 (build 53616260)
	Triton Server Version 2.31.0

	Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

	Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

	This container image and its contents are governed by the NVIDIA Deep Learning Container License.
	By pulling and using the container, you accept the terms and conditions of this license:
	https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

	WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
	Use the NVIDIA Container Toolkit to start this container with GPU support; see
	https://docs.nvidia.com/datacenter/cloud-native/ .

	root@611a66e9c322:/opt/tritonserver# tritonserver --help
	Usage: tritonserver [options]
	--help
	Print usage
	--log-verbose <integer>
	Set verbose logging level. Zero (0) disables verbose logging
	and values >= 1 enable verbose logging.
	--log-info <boolean>
	Enable/disable info-level logging.
	--log-warning <boolean>
	Enable/disable warning-level logging.
	--log-error <boolean>
	Enable/disable error-level logging.
	--log-format <string>
	Set the logging format. Options are "default" and "ISO8601".
	The default is "default". For "default", the log severity (L) and
	timestamp will be logged as "LMMDD hh:mm:ss.ssssss". For "ISO8601",
	the log format will be "YYYY-MM-DDThh:mm:ssZ L".
	--log-file <string>
	Set the name of the log output file. If specified, log
	outputs will be saved to this file. If not specified, log outputs will
	stream to the console.
	--id <string>
	Identifier for this server.
	--model-store <string>
	Equivalent to --model-repository.
	--model-repository <string>
	Path to model repository directory. It may be specified
	multiple times to add multiple model repositories. Note that if a model
	is not unique across all model repositories at any time, the model
	will not be available.
	--exit-on-error <boolean>
	Exit the inference server if an error occurs during
	initialization.
	--disable-auto-complete-config
	If set, disables the triton and backends from auto
	completing model configuration files. Model configuration files must be
	provided and all required configuration settings must be specified.
	--strict-model-config <boolean>
	DEPRECATED: If true model configuration files must be
	provided and all required configuration settings must be specified. If
	false the model configuration may be absent or only partially specified
	and the server will attempt to derive the missing required
	configuration.
	--strict-readiness <boolean>
	If true /v2/health/ready endpoint indicates ready if the
	server is responsive and all models are available. If false
	/v2/health/ready endpoint indicates ready if server is responsive even if
	some/all models are unavailable.
	--allow-http <boolean>
	Allow the server to listen for HTTP requests.
	--http-port <integer>
	The port for the server to listen on for HTTP requests.
	--reuse-http-port <boolean>
	Allow multiple servers to listen on the same HTTP port when
	every server has this option set. If you plan to use this option as
	a way to load balance between different Triton servers, the same
	model repository or set of models must be used for every server.
	--http-address <string>
	The address for the http server to binds to.
	--http-thread-count <integer>
	Number of threads handling HTTP requests.
	--allow-grpc <boolean>
	Allow the server to listen for GRPC requests.
	--grpc-port <integer>
	The port for the server to listen on for GRPC requests.
	--reuse-grpc-port <boolean>
	Allow multiple servers to listen on the same GRPC port when
	every server has this option set. If you plan to use this option as
	a way to load balance between different Triton servers, the same
	model repository or set of models must be used for every server.
	--grpc-address <string>
	The address for the grpc server to binds to.
	--grpc-infer-allocation-pool-size <integer>
	The maximum number of inference request/response objects
	that remain allocated for reuse. As long as the number of in-flight
	requests doesn't exceed this value there will be no
	allocation/deallocation of request/response objects.
	--grpc-use-ssl <boolean>
	Use SSL authentication for GRPC requests. Default is false.
	--grpc-use-ssl-mutual <boolean>
	Use mututal SSL authentication for GRPC requests. Default is
	false.
	--grpc-server-cert <string>
	File holding PEM-encoded server certificate. Ignored unless
	--grpc-use-ssl is true.
	--grpc-server-key <string>
	File holding PEM-encoded server key. Ignored unless
	--grpc-use-ssl is true.
	--grpc-root-cert <string>
	File holding PEM-encoded root certificate. Ignore unless
	--grpc-use-ssl is false.
	--grpc-infer-response-compression-level <string>
	The compression level to be used while returning the infer
	response to the peer. Allowed values are none, low, medium and high.
	By default, compression level is selected as none.
	--grpc-keepalive-time <integer>
	The period (in milliseconds) after which a keepalive ping is
	sent on the transport. Default is 7200000 (2 hours).
	--grpc-keepalive-timeout <integer>
	The period (in milliseconds) the sender of the keepalive
	ping waits for an acknowledgement. If it does not receive an
	acknowledgment within this time, it will close the connection. Default is
	20000 (20 seconds).
	--grpc-keepalive-permit-without-calls <boolean>
	Allows keepalive pings to be sent even if there are no calls
	in flight (0 : false; 1 : true). Default is 0 (false).
	--grpc-http2-max-pings-without-data <integer>
	The maximum number of pings that can be sent when there is
	no data/header frame to be sent. gRPC Core will not continue sending
	pings if we run over the limit. Setting it to 0 allows sending pings
	without such a restriction. Default is 2.
	--grpc-http2-min-recv-ping-interval-without-data <integer>
	If there are no data/header frames being sent on the
	transport, this channel argument on the server side controls the minimum
	time (in milliseconds) that gRPC Core would expect between receiving
	successive pings. If the time between successive pings is less than
	this time, then the ping will be considered a bad ping from the peer.
	Such a ping counts as a ‘ping strike’. Default is 300000 (5
	minutes).
	--grpc-http2-max-ping-strikes <integer>
	Maximum number of bad pings that the server will tolerate
	before sending an HTTP2 GOAWAY frame and closing the transport.
	Setting it to 0 allows the server to accept any number of bad pings.
	Default is 2.
	--allow-sagemaker <boolean>
	Allow the server to listen for Sagemaker requests. Default
	is false.
	--sagemaker-port <integer>
	The port for the server to listen on for Sagemaker requests.
	Default is 8080.
	--sagemaker-safe-port-range <<integer>-<integer>>
	Set the allowed port range for endpoints other than the
	SageMaker endpoints.
	--sagemaker-thread-count <integer>
	Number of threads handling Sagemaker requests. Default is 8.
	--allow-vertex-ai <boolean>
	Allow the server to listen for Vertex AI requests. Default
	is true if AIP_MODE=PREDICTION, false otherwise.
	--vertex-ai-port <integer>
	The port for the server to listen on for Vertex AI requests.
	Default is AIP_HTTP_PORT if set, 8080 otherwise.
	--vertex-ai-thread-count <integer>
	Number of threads handling Vertex AI requests. Default is 8.
	--vertex-ai-default-model <string>
	The name of the model to use for single-model inference
	requests.
	--allow-metrics <boolean>
	Allow the server to provide prometheus metrics.
	--allow-gpu-metrics <boolean>
	Allow the server to provide GPU metrics. Ignored unless
	--allow-metrics is true.
	--allow-cpu-metrics <boolean>
	Allow the server to provide CPU metrics. Ignored unless
	--allow-metrics is true.
	--metrics-port <integer>
	The port reporting prometheus metrics.
	--metrics-interval-ms <float>
	Metrics will be collected once every <metrics-interval-ms>
	milliseconds. Default is 2000 milliseconds.
	--trace-file <string>
	Set the file where trace output will be saved. If
	--trace-log-frequency is also specified, this argument value will be the
	prefix of the files to save the trace output. See --trace-log-frequency
	for detail.
	--trace-level <string>
	Specify a trace level. OFF to disable tracing, TIMESTAMPS to
	trace timestamps, TENSORS to trace tensors. It may be specified
	multiple times to trace multiple informations. Default is OFF.
	--trace-rate <integer>
	Set the trace sampling rate. Default is 1000.
	--trace-count <integer>
	Set the number of traces to be sampled. If the value is -1,
	the number of traces to be sampled will not be limited. Default is
	-1.
	--trace-log-frequency <integer>
	Set the trace log frequency. If the value is 0, Triton will
	only log the trace output to <trace-file> when shutting down.
	Otherwise, Triton will log the trace output to <trace-file>.<idx> when it
	collects the specified number of traces. For example, if the log
	frequency is 100, when Triton collects the 100-th trace, it logs the
	traces to file <trace-file>.0, and when it collects the 200-th trace,
	it logs the 101-th to the 200-th traces to file <trace-file>.1.
	Default is 0.
	--model-control-mode <string>
	Specify the mode for model management. Options are "none",
	"poll" and "explicit". The default is "none". For "none", the server
	will load all models in the model repository(s) at startup and will
	not make any changes to the load models after that. For "poll", the
	server will poll the model repository(s) to detect changes and will
	load/unload models based on those changes. The poll rate is
	controlled by 'repository-poll-secs'. For "explicit", model load and unload
	is initiated by using the model control APIs, and only models
	specified with --load-model will be loaded at startup.
	--repository-poll-secs <integer>
	Interval in seconds between each poll of the model
	repository to check for changes. Valid only when --model-control-mode=poll is
	specified.
	--load-model <string>
	Name of the model to be loaded on server startup. It may be
	specified multiple times to add multiple models. To load ALL models
	at startup, specify '' as the model name with --load-model= as the
	ONLY --load-model argument, this does not imply any pattern
	matching. Specifying --load-model=* in conjunction with another
	--load-model argument will result in error. Note that this option will only
	take effect if --model-control-mode=explicit is true.
	--rate-limit <string>
	Specify the mode for rate limiting. Options are
	"execution_count" and "off". The default is "off". For "execution_count", the
	server will determine the instance using configured priority and the
	number of time the instance has been used to run inference. The
	inference will finally be executed once the required resources are
	available. For "off", the server will ignore any rate limiter config and
	run inference as soon as an instance is ready.
	--rate-limit-resource <<string>:<integer>:<integer>>
	The number of resources available to the server. The format
	of this flag is
	--rate-limit-resource=<resource_name>:<count>:<device>. The <device> is optional and if not listed will be applied to
	every device. If the resource is specified as "GLOBAL" in the model
	configuration the resource is considered shared among all the devices
	in the system. The <device> property is ignored for such resources.
	This flag can be specified multiple times to specify each resources
	and their availability. By default, the max across all instances
	that list the resource is selected as its availability. The values for
	this flag is case-insensitive.
	--pinned-memory-pool-byte-size <integer>
	The total byte size that can be allocated as pinned system
	memory. If GPU support is enabled, the server will allocate pinned
	system memory to accelerate data transfer between host and devices
	until it exceeds the specified byte size. If 'numa-node' is configured
	via --host-policy, the pinned system memory of the pool size will be
	allocated on each numa node. This option will not affect the
	allocation conducted by the backend frameworks. Default is 256 MB.
	--cuda-memory-pool-byte-size <<integer>:<integer>>
	The total byte size that can be allocated as CUDA memory for
	the GPU device. If GPU support is enabled, the server will allocate
	CUDA memory to minimize data transfer between host and devices
	until it exceeds the specified byte size. This option will not affect
	the allocation conducted by the backend frameworks. The argument
	should be 2 integers separated by colons in the format <GPU device
	ID>:<pool byte size>. This option can be used multiple times, but only
	once per GPU device. Subsequent uses will overwrite previous uses for
	the same GPU device. Default is 64 MB.
	--response-cache-byte-size <integer>
	The size in bytes to allocate for a request/response cache.
	When non-zero, Triton allocates the requested size in CPU memory and
	shares the cache across all inference requests and across all
	models. For a given model to use request caching, the model must enable
	request caching in the model configuration. By default, no model uses
	request caching even if the request cache is enabled with the
	--response-cache-byte-size flag. Default is 0.
	--min-supported-compute-capability <float>
	The minimum supported CUDA compute capability. GPUs that
	don't support this compute capability will not be used by the server.
	--exit-timeout-secs <integer>
	Timeout (in seconds) when exiting to wait for in-flight
	inferences to finish. After the timeout expires the server exits even if
	inferences are still in flight.
	--backend-directory <string>
	The global directory searched for backend shared libraries.
	Default is '/opt/tritonserver/backends'.
	--repoagent-directory <string>
	The global directory searched for repository agent shared
	libraries. Default is '/opt/tritonserver/repoagents'.
	--buffer-manager-thread-count <integer>
	The number of threads used to accelerate copies and other
	operations required to manage input and output tensor contents.
	Default is 0.
	--model-load-thread-count <integer>
	The number of threads used to concurrently load models in
	model repositories. Default is 2*<num_cpu_cores>.
	--backend-config <<string>,<string>=<string>>
	Specify a backend-specific configuration setting. The format
	of this flag is --backend-config=<backend_name>,<setting>=<value>.
	Where <backend_name> is the name of the backend, such as 'tensorrt'.
	--host-policy <<string>,<string>=<string>>
	Specify a host policy setting associated with a policy name.
	The format of this flag is
	--host-policy=<policy_name>,<setting>=<value>. Currently supported settings are 'numa-node', 'cpu-cores'.
	Note that 'numa-node' setting will affect pinned memory pool behavior,
	see --pinned-memory-pool for more detail.
	--model-load-gpu-limit <<device_id>:<fraction>>
	Specify the limit on GPU memory usage as a fraction. If
	model loading on the device is requested and the current memory usage
	exceeds the limit, the load will be rejected. If not specified, the
	limit will not be set.