Step 1. Update the host_port
key serve
section of the config.yaml
to listen in all interfaces.
...
serve:
backend: vllm
chat_template: auto
host_port: 0.0.0.0:8000
llama_cpp:
gpu_layers: -1
llm_family: ''
max_ctx_size: 4096
model_path: /var/home/cloud-user/.cache/instructlab/models/granite-7b-redhat-lab
vllm:
gpus: null
llm_family: ''
max_startup_attempts: null
vllm_args:
- --tensor-parallel-size
- '4'
...
Step 2. Create API key (Section 4.2.1)
Step 3. Running ilab model serve as a service (Section 4.1.1)
Step 4. Testing external access to inference endpoint
- Find the name or IP address of the node
# Hostname
echo $HOSTNAME
ip-172-31-23-148.ec2.internal
# Get the IP address of the eth0 interface
IPADDR=$(ip -4 -o a show eth0 | cut -d ' ' -f 7 | cut -d '/' -f 1)
echo $IPADDR
172.31.23.148
The interence serving endpoint will be http://$HOSTNAME:8000/v1/ or http://$IPADDR:8000/v1/
- Test chat endpoint using
curl
# Note: the full model path should be used as the model name
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "/var/home/cloud-user/.cache/instructlab/models/granite-7b-redhat-lab",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}' | jq .
And the answer will be similar to:
{
"id": "cmpl-03452d543a734e808f14e195a040403b",
"object": "chat.completion",
"created": 1728526412,
"model": "/var/home/cloud-user/.cache/instructlab/models/granite-7b-redhat-lab",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 19,
"total_tokens": 29,
"completion_tokens": 10
}
}