Skip to content

Instantly share code, notes, and snippets.

@samos123
Created September 22, 2024 22:33
Show Gist options
  • Save samos123/05c50215b7582f4d9432d02cae741677 to your computer and use it in GitHub Desktop.
Save samos123/05c50215b7582f4d9432d02cae741677 to your computer and use it in GitHub Desktop.
spec:
args:
- --max-model-len=65536
- --max-num-batched-token=65536
- --gpu-memory-utilization=0.9
- --tensor-parallel-size=2
- --enable-prefix-caching
- --disable-log-requests
- --max-num-seqs=1024
engine: VLLM
features:
- TextGeneration
maxReplicas: 8
minReplicas: 0
owner: ""
replicas: 8
resourceProfile: nvidia-gpu-h100:2
scaleDownDelaySeconds: 30
targetRequests: 500
url: hf://neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
status:
replicas:
all: 8
ready: 7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment