Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save danielezonca/a915d65f7d13e6adef056aa2d8b391ee to your computer and use it in GitHub Desktop.

Select an option

Save danielezonca/a915d65f7d13e6adef056aa2d8b391ee to your computer and use it in GitHub Desktop.
Envoy AI Gateway: Chat Completion Translation Architecture Analysis

Envoy AI Gateway: Chat Completion Translation Architecture

An analysis of how the Envoy AI Gateway converts between native third-party AI provider formats and a unified OpenAI-compatible API for chat completion endpoints.

Repository: envoyproxy/ai-gateway Documentation: aigateway.envoyproxy.io


Table of Contents


1. Architecture Overview

The gateway uses a pluggable translator pattern where all incoming requests arrive in OpenAI-compatible format and are translated to each provider's native format (and vice versa for responses). At runtime, the gateway operates as a gRPC external processor for Envoy Proxy, using Envoy's ext_proc (External Processing) filter.

The core design principles:

  • Unified client interface: Clients always speak OpenAI-compatible API format
  • Pluggable backends: Each provider has a dedicated translator implementing a common interface
  • Two-phase processing: Routing is separated from schema translation via two ext_proc filter levels
  • Retry-aware: The architecture supports transparent failover between providers

2. Envoy ext_proc Integration

The AI Gateway runs as a standalone gRPC server implementing envoy.service.ext_proc.v3.ExternalProcessor. It communicates with Envoy over a Unix Domain Socket (/tmp/extproc.sock) for low-latency IPC.

The control plane dynamically injects the ext_proc filter configuration into Envoy's xDS configuration. The gateway does not use a WASM filter, Lua script, or custom C++ filter — it relies entirely on ext_proc's gRPC streaming protocol to intercept and mutate requests and responses as they flow through Envoy.

gRPC Processing Protocol

The ext_proc protocol is a bidirectional gRPC stream. Envoy sends ProcessingRequest messages at various phases of the HTTP lifecycle, and the gateway responds with ProcessingResponse messages containing mutations:

ProcessingRequest Type Purpose
RequestHeaders Client's HTTP request headers arrive
RequestBody Client's HTTP request body arrives (buffered)
ResponseHeaders Backend's HTTP response headers arrive
ResponseBody Backend's HTTP response body arrives (buffered or streamed)

Each response can include header mutations, body mutations, and dynamic metadata for downstream Envoy filters (e.g., rate limiting, cost tracking).


3. Two-Level Filter Design

The key architectural decision is a two-level ext_proc filter hierarchy:

Router Filter (HTTP Connection Manager level)

  • Receives the client request headers and body
  • Parses the OpenAI-format request and extracts the model name
  • Starts an OpenTelemetry tracing span
  • Stores the original request body for later use by the upstream filter
  • Does not perform schema translation — its role is routing context and state management

Configuration: RequestBodyMode: BUFFERED, ResponseBodyMode: BUFFERED

Upstream Filter (Upstream Cluster level)

  • Fires after Envoy has selected a backend cluster
  • Receives the backend identity from Envoy request attributes
  • Selects the appropriate Translator based on the backend's APISchemaName
  • Calls translator.RequestBody() to convert OpenAI format to provider-native format
  • Applies backend-specific authentication (AWS SigV4, GCP tokens, API keys, etc.)
  • On response, calls translator.ResponseBody() to convert back to OpenAI format
  • Handles streaming via ModeOverride to STREAMED body processing

Configuration: RequestHeaderMode: SEND only (body already in memory from router phase)

Why Two Levels?

This separation allows Envoy to handle retries and failover between backends. If a request to one provider fails, Envoy can retry against a different backend cluster. The upstream filter re-translates the original request (preserved by the router filter) for each retry attempt, potentially targeting a completely different provider with a different schema.


4. Translator Interface

The core abstraction lives in internal/translator/translator.go:

type Translator[ReqT any, SpanT any] interface {
    RequestBody(raw []byte, body *ReqT, flag bool) (
        newHeaders []Header, mutatedBody []byte, err error,
    )

    ResponseHeaders(headers map[string]string) (
        newHeaders []Header, err error,
    )

    ResponseBody(respHeaders map[string]string, body io.Reader,
        endOfStream bool, span SpanT) (
        newHeaders []Header, mutatedBody []byte,
        tokenUsage TokenUsage, responseModel string, err error,
    )

    ResponseError(respHeaders map[string]string, body io.Reader) (
        newHeaders []Header, mutatedBody []byte, err error,
    )
}

A type alias ties the generic interface to chat completions:

type OpenAIChatCompletionTranslator = Translator[
    openai.ChatCompletionRequest,
    tracingapi.ChatCompletionSpan,
]

Translator Method Mapping to ext_proc Phases

Envoy Phase Filter Level Translator Call
Request Headers Router Parse body, store state (no translator call)
Request Headers Upstream translator.RequestBody()
Response Headers Upstream translator.ResponseHeaders()
Response Body (success) Upstream translator.ResponseBody()
Response Body (error) Upstream translator.ResponseError()

5. Supported Chat Completion Translators

All translator implementations live in internal/translator/:

Provider Source File Factory Function
OpenAI (and compatible: Groq, Mistral, DeepSeek, Together AI, etc.) openai_openai.go NewChatCompletionOpenAIToOpenAITranslator()
AWS Bedrock (Converse API) openai_awsbedrock.go NewChatCompletionOpenAIToAWSBedrockTranslator()
AWS Bedrock (Anthropic) openai_awsanthropic.go NewChatCompletionOpenAIToAWSAnthropicTranslator()
Azure OpenAI openai_azureopenai.go NewChatCompletionOpenAIToAzureOpenAITranslator()
GCP Vertex AI (Gemini) openai_gcpvertexai.go NewChatCompletionOpenAIToGCPVertexAITranslator()
GCP Vertex AI (Anthropic) openai_gcpanthropic.go NewChatCompletionOpenAIToGCPAnthropicTranslator()
Anthropic (Native) anthropic_anthropic.go NewAnthropicToAnthropicTranslator()

Providers that natively support the OpenAI API format (Groq, Grok, Mistral, Together AI, DeepSeek, Cohere, SambaNova, Google Gemini on AI Studio, DeepInfra, and self-hosted models like vLLM) use the OpenAI-to-OpenAI translator, which is mostly a passthrough that adjusts the URL path prefix.

Translator Selection (Factory Pattern)

The factory in internal/endpointspec/endpointspec.go dispatches based on the configured APISchemaName:

func (ChatCompletionsEndpointSpec) GetTranslator(
    schema filterapi.VersionedAPISchema,
    modelNameOverride string,
) (translator.OpenAIChatCompletionTranslator, error) {
    switch schema.Name {
    case filterapi.APISchemaOpenAI:
        return translator.NewChatCompletionOpenAIToOpenAITranslator(...)
    case filterapi.APISchemaAWSBedrock:
        return translator.NewChatCompletionOpenAIToAWSBedrockTranslator(...)
    case filterapi.APISchemaGCPVertexAI:
        return translator.NewChatCompletionOpenAIToGCPVertexAITranslator(...)
    case filterapi.APISchemaAzureOpenAI:
        return translator.NewChatCompletionOpenAIToAzureOpenAITranslator(...)
    // ... more providers
    }
}

API Schema Registry

Schema names are defined in internal/filterapi/filterconfig.go:

const (
    APISchemaOpenAI       = "OpenAI"
    APISchemaAWSBedrock   = "AWSBedrock"
    APISchemaAWSAnthropic = "AWSAnthropic"
    APISchemaAzureOpenAI  = "AzureOpenAI"
    APISchemaGCPVertexAI  = "GCPVertexAI"
    APISchemaGCPAnthropic = "GCPAnthropic"
    APISchemaAnthropic    = "Anthropic"
    APISchemaCohere       = "Cohere"
)

6. API Schema Data Models

Provider-specific request/response types live in internal/apischema/:

internal/apischema/
├── openai/       # ChatCompletionRequest/Response (unified client format)
├── anthropic/    # MessagesRequest/Response
├── awsbedrock/   # ConverseInput/ConverseOutput
├── gcp/          # Gemini GenerateContentRequest/Response
└── cohere/       # Rerank models

The OpenAI schema (internal/apischema/openai/openai.go) is the richest, defining:

  • ChatCompletionRequest — the unified request format
  • ChatCompletionResponse — the unified response format
  • Chat message union types (system, user, assistant, tool, developer)
  • Content part unions (text, image_url, input_audio, file)
  • Tool/function call definitions
  • Token usage structures
  • Vendor-specific extension fields (Anthropic thinking config, GCP extensions)

7. Request/Response Lifecycle

Request Translation Pipeline

Client OpenAI Request
  |
  v
Router Filter (routerProcessor.ProcessRequestBody):
  1. Parse JSON into *openai.ChatCompletionRequest
  2. Extract original model name
  3. Detect if streaming
  4. Start tracing span
  5. Set internal headers (x-ai-eg-model, x-ai-eg-original-path)
  6. Return CONTINUE with header mutations
  |
  v
Envoy Route Decision:
  Select upstream cluster, attach backend name in request attributes
  |
  v
Upstream Filter (upstreamProcessor.ProcessRequestHeaders):
  1. Look up backend from request attributes
  2. Select translator based on backend's APISchemaName
  3. Call translator.RequestBody():
     - Validate parameters
     - Map OpenAI fields to provider format
     - Handle model name overrides
     - Set appropriate HTTP path and headers
  4. Apply backend authentication
  5. Apply route-level header/body mutations
  6. Build dynamic metadata (cost tracking)
  7. Return CONTINUE_AND_REPLACE with mutations
  |
  v
Backend (provider-native format)

Response Translation Pipeline

Backend Response (provider-native format)
  |
  v
Upstream Filter (upstreamProcessor.ProcessResponseHeaders):
  1. Store response headers
  2. Call translator.ResponseHeaders()
  3. If streaming: set ModeOverride to STREAMED
  4. Return CONTINUE with header mutations
  |
  v
Upstream Filter (upstreamProcessor.ProcessResponseBody):
  1. Decompress if needed (gzip, deflate)
  2. Check HTTP status code:
     - Error path: call translator.ResponseError()
     - Success path: call translator.ResponseBody()
  3. Extract token usage and response model
  4. Record metrics (token counts, latency)
  5. Build dynamic metadata (cost data)
  6. Return CONTINUE with body mutations
  |
  v
Client Response (OpenAI format)

Example: OpenAI to AWS Bedrock

Request conversion (internal/translator/openai_awsbedrock.go):

OpenAI Request:                      AWS Bedrock Request:
{                                    {
  "model": "claude-3-sonnet",          "modelId": "anthropic.claude-3-sonnet",
  "messages": [{                       "messages": [{
    "role": "user",                      "role": "user",
    "content": [{                        "content": [{
      "type": "text",                      "text": "Hello"
      "text": "Hello"                    }, {
    }, {                                   "image": {
      "type": "image_url",                  "format": "png",
      "image_url": {                        "source": {
        "url": "data:image/png;..."           "bytes": [...]
      }                                    }
    }]                                   }
  }]                                   }]
}                                    }]
                                   }

Response conversion:

AWS Bedrock Response:                OpenAI Response:
{                                    {
  "output": {                          "model": "claude-3-sonnet",
    "message": {                       "choices": [{
      "role": "assistant",               "message": {
      "content": [{                        "role": "assistant",
        "text": "Hello!"                   "content": "Hello!",
      }, {                                 "tool_calls": [{
        "toolUse": {                         "id": "...",
          "toolUseId": "...",                "type": "function",
          "name": "search",                  "function": {
          "input": {...}                       "name": "search",
        }                                      "arguments": "..."
      }]                                     }
    }                                      }]
  },                                     },
  "usage": {                             "finish_reason": "stop"
    "inputTokens": 10,                }],
    "outputTokens": 5                  "usage": {
  },                                     "prompt_tokens": 10,
  "stopReason": "end_turn"               "completion_tokens": 5,
}                                        "total_tokens": 15
                                       }
                                     }

8. What Gets Translated

Each translator handles conversion of:

Aspect Details
Messages Role mapping, content block structure, multi-modal content (text, images, audio)
Tool/function calls Tool definition format, tool choice strategy (auto/required/none), tool call results
Thinking/reasoning Vendor-specific thinking mode configuration (e.g., Anthropic extended thinking)
Streaming SSE format differences, stateful chunk accumulation, event type mapping
Token usage Field name mapping (input_tokens vs prompt_tokens), total calculation
Finish reasons E.g., Anthropic end_turn → OpenAI stop, Bedrock end_turnstop
Error responses Provider error format → unified error format
Content types Base64 image parsing, data URI handling, audio format conversion
Cache control Vendor-specific prompt caching annotations

Streaming Response Handling

Each provider has different streaming formats requiring stateful parsers:

  • Anthropic: anthropicStreamParser in anthropic_helper.go — parses SSE events (content_block_start, content_block_delta, message_delta) and converts to OpenAI SSE format (data: {"choices":[{"delta":{...}}]})
  • AWS Bedrock: Uses AWS EventStream protocol (binary framing) decoded into ConverseStreamEvent chunks, converted to OpenAI chat completion chunks
  • GCP Vertex AI (Gemini): Parses Gemini streaming responses and maps to OpenAI chunk format

Common Helper Functions

Shared conversion utilities in internal/translator/:

  • anthropic_helper.go: openAIToAnthropicMessages(), translateOpenAItoAnthropicTools(), anthropicToolUseToOpenAICalls(), anthropicToOpenAIFinishReason()
  • gemini_helper.go: openAIMessagesToGeminiContents(), message and content conversion utilities

9. Envoy Configuration Injection

The control plane (internal/extensionserver/post_translate_modify.go) dynamically generates Envoy xDS configuration with the following components:

ext_proc UDS Cluster

Cluster {
  Name: "ai-gateway-extproc-uds"
  Type: STATIC
  ConnectTimeout: 10s
  LoadAssignment {
    Endpoints {
      LbEndpoints {
        Endpoint {
          Address: Pipe { Path: "/tmp/extproc.sock" }
        }
      }
    }
  }
  PerConnectionBufferLimitBytes: 50Mi
  Http2ProtocolOptions {
    InitialConnectionWindowSize: 1Mi
    InitialStreamWindowSize: 64Ki
  }
}

Router Filter ext_proc Configuration

ExternalProcessor {
  GrpcService {
    EnvoyGrpc { ClusterName: "ai-gateway-extproc-uds" }
    Timeout: 30s
  }
  ProcessingMode {
    RequestHeaderMode: SEND
    RequestBodyMode: BUFFERED
    ResponseHeaderMode: SEND
    ResponseBodyMode: BUFFERED
  }
  MessageTimeout: 10s
  FailureModeAllow: false
  AllowModeOverride: true
}

Upstream Filter ext_proc Configuration

ExternalProcessor {
  GrpcService { ClusterName: "ai-gateway-extproc-uds" }
  ProcessingMode {
    RequestHeaderMode: SEND
    RequestBodyMode: NONE       # Body already available from router phase
    ResponseHeaderMode: SKIP
    ResponseBodyMode: NONE
  }
  MessageTimeout: 10s
}

Header Mutation Filter

Applied after ext_proc to inject dynamic metadata into headers:

HeaderMutation {
  Mutations {
    RequestMutations [{
      Append {
        Header {
          Key: "content-length"
          Value: "%DYNAMIC_METADATA(ai-gateway:content_length)%"
        }
      }
    }]
  }
}

10. Server Entry Point

The gRPC server starts in cmd/extproc/mainlib/main.go:

server, err := extproc.NewServer(l, flags.enableRedaction)
server.Register(path, extproc.NewFactory(...))

s := grpc.NewServer(grpc.MaxRecvMsgSize(flags.maxRecvMsgSize))
extprocv3.RegisterExternalProcessorServer(s, server)
grpc_health_v1.RegisterHealthServer(s, server)
s.Serve(extProcLis)

Registered Endpoint Paths

Path Processor
/v1/chat/completions Chat completion
/v1/completions Text completion
/v1/embeddings Embeddings
/v1/messages Anthropic native messages
/v1/models Model listing
/v2/rerank Cohere rerank

Startup Configuration

Flag Default Purpose
configPath Configuration YAML file path
extProcAddr :1063 gRPC server address (supports UDS: unix:///tmp/ext_proc.sock)
adminPort 1064 HTTP admin port (metrics, health)
mcpAddr Optional MCP proxy address
maxRecvMsgSize unlimited Max gRPC message size

11. End-to-End Request Flow Diagram

                        CLIENT REQUEST
                POST /v1/chat/completions
                {model: "gpt-4", messages: [...]}
                            |
                            v
            +-------------------------------+
            |   ENVOY ROUTER FILTER         |
            |   Matches route, calls        |
            |   ext_proc (SEND + BUFFERED)  |
            +---------------+---------------+
                            |
               Request Headers + Body
                            |
                            v
            +-------------------------------+
            |  routerProcessor              |
            |  .ProcessRequestBody()        |
            |  ----------------------------  |
            |  1. Parse as OpenAI request   |
            |  2. Extract model name        |
            |  3. Start tracing span        |
            |  4. Store original body       |
            |  5. Set internal headers      |
            |  6. Return CONTINUE           |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   ENVOY ROUTE DECISION        |
            |   Select upstream cluster     |
            |   Attach backend name in      |
            |   request attributes          |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   ENVOY UPSTREAM FILTER       |
            |   Calls ext_proc (SEND only)  |
            +---------------+---------------+
                            |
                   Request Headers
                            |
                            v
            +-------------------------------+
            |  upstreamProcessor            |
            |  .ProcessRequestHeaders()     |
            |  ----------------------------  |
            |  1. Get backend from attrs    |
            |  2. Select translator         |
            |  3. translator.RequestBody()  |
            |     (OpenAI -> Bedrock/etc.)  |
            |  4. Apply backend auth        |
            |  5. Apply route mutations     |
            |  6. Build dynamic metadata    |
            |  7. Return CONTINUE_AND_      |
            |     REPLACE                   |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   BACKEND (Provider-native)   |
            |   Returns response            |
            +---------------+---------------+
                            |
                   Response Headers
                            |
                            v
            +-------------------------------+
            |  upstreamProcessor            |
            |  .ProcessResponseHeaders()    |
            |  ----------------------------  |
            |  1. translator.Response-      |
            |     Headers()                 |
            |  2. Detect streaming          |
            |  3. Set STREAMED mode if      |
            |     streaming                 |
            |  4. Return CONTINUE           |
            +---------------+---------------+
                            |
                   Response Body
              (per chunk if streaming)
                            |
                            v
            +-------------------------------+
            |  upstreamProcessor            |
            |  .ProcessResponseBody()       |
            |  ----------------------------  |
            |  1. Decompress if needed      |
            |  2. translator.ResponseBody() |
            |     (Bedrock/etc. -> OpenAI)  |
            |  3. Extract token usage       |
            |  4. Record metrics            |
            |  5. Build cost metadata       |
            |  6. Return CONTINUE           |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   ENVOY ROUTER (response)     |
            |   Mutations applied           |
            +---------------+---------------+
                            |
                            v
                    CLIENT RESPONSE
              {choices: [{message: {...}}]}
              {usage: {prompt_tokens: ...}}

12. Key Files Reference

Core ext_proc Integration

File Purpose
cmd/extproc/main.go Entry point, signal handling
cmd/extproc/mainlib/main.go Server initialization, listener setup, processor factory registration
internal/extproc/server.go gRPC ExternalProcessor service, stream lifecycle management
internal/extproc/processor.go Processor interface definition
internal/extproc/processor_impl.go routerProcessor and upstreamProcessor implementations
internal/extproc/util.go Helpers (header mutations, dynamic metadata)

Translator Layer

File Purpose
internal/translator/translator.go Core Translator interface, type aliases
internal/translator/openai_openai.go OpenAI-to-OpenAI passthrough
internal/translator/openai_awsbedrock.go OpenAI-to-AWS Bedrock Converse
internal/translator/openai_awsanthropic.go OpenAI-to-AWS Bedrock Anthropic
internal/translator/openai_azureopenai.go OpenAI-to-Azure OpenAI
internal/translator/openai_gcpvertexai.go OpenAI-to-GCP Vertex AI Gemini
internal/translator/openai_gcpanthropic.go OpenAI-to-GCP Vertex AI Anthropic
internal/translator/anthropic_anthropic.go Anthropic-to-Anthropic passthrough
internal/translator/anthropic_helper.go OpenAI/Anthropic conversion utilities, streaming parser
internal/translator/gemini_helper.go OpenAI/Gemini conversion utilities

Data Models

File Purpose
internal/apischema/openai/openai.go OpenAI ChatCompletionRequest/Response definitions
internal/apischema/anthropic/anthropic.go Anthropic MessagesRequest/Response definitions
internal/apischema/awsbedrock/awsbedrock.go AWS Bedrock ConverseInput/ConverseOutput definitions
internal/apischema/gcp/gcp.go GCP Gemini type definitions

Configuration & Control Plane

File Purpose
internal/filterapi/filterconfig.go APISchemaName constants, VersionedAPISchema, backend config
internal/endpointspec/endpointspec.go Endpoint spec interface, translator factory dispatch
internal/extensionserver/post_translate_modify.go xDS modification to inject ext_proc filters into Envoy config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment