RHEL AI 1.1 as Inference Endpoint

Step 1. Update the host_port key serve section of the config.yaml to listen in all interfaces.

...
serve:
  backend: vllm
  chat_template: auto
  host_port: 0.0.0.0:8000
  llama_cpp:
    gpu_layers: -1
    llm_family: ''
    max_ctx_size: 4096
  model_path: /var/home/cloud-user/.cache/instructlab/models/granite-7b-redhat-lab
  vllm:
    gpus: null
    llm_family: ''
    max_startup_attempts: null
    vllm_args:
    - --tensor-parallel-size
    - '4'
  ...

Step 2. Create API key (Section 4.2.1)

Step 3. Running ilab model serve as a service (Section 4.1.1)

Step 4. Testing external access to inference endpoint

Find the name or IP address of the node

# Hostname
echo $HOSTNAME
ip-172-31-23-148.ec2.internal

# Get the IP address of the eth0 interface
IPADDR=$(ip -4 -o a show eth0 | cut -d ' ' -f 7 | cut -d '/' -f 1)
echo $IPADDR

172.31.23.148

The interence serving endpoint will be http://$HOSTNAME:8000/v1/ or http://$IPADDR:8000/v1/

Test chat endpoint using curl

# Note: the full model path should be used as the model name
curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
  "model": "/var/home/cloud-user/.cache/instructlab/models/granite-7b-redhat-lab",
  "messages": [
      {
          "role": "system",
          "content": "You are a helpful assistant."
      },
      {
          "role": "user",
          "content": "Hello!"
      }
  ]
}' | jq .

And the answer will be similar to:

{
  "id": "cmpl-03452d543a734e808f14e195a040403b",
  "object": "chat.completion",
  "created": 1728526412,
  "model": "/var/home/cloud-user/.cache/instructlab/models/granite-7b-redhat-lab",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "total_tokens": 29,
    "completion_tokens": 10
  }
}

williamcaban/rhelai-serving.md

RHEL AI 1.1 as Inference Endpoint