Envoy AI Gateway: Chat Completion Translation Architecture

An analysis of how the Envoy AI Gateway converts between native third-party AI provider formats and a unified OpenAI-compatible API for chat completion endpoints.

Repository: envoyproxy/ai-gateway Documentation: aigateway.envoyproxy.io

1. Architecture Overview
2. Envoy ext_proc Integration
3. Two-Level Filter Design
4. Translator Interface
5. Supported Chat Completion Translators
6. API Schema Data Models
7. Request/Response Lifecycle
8. What Gets Translated
9. Envoy Configuration Injection
10. Server Entry Point
11. End-to-End Request Flow Diagram
12. Key Files Reference

1. Architecture Overview

The gateway uses a pluggable translator pattern where all incoming requests arrive in OpenAI-compatible format and are translated to each provider's native format (and vice versa for responses). At runtime, the gateway operates as a gRPC external processor for Envoy Proxy, using Envoy's ext_proc (External Processing) filter.

The core design principles:

Unified client interface: Clients always speak OpenAI-compatible API format
Pluggable backends: Each provider has a dedicated translator implementing a common interface
Two-phase processing: Routing is separated from schema translation via two ext_proc filter levels
Retry-aware: The architecture supports transparent failover between providers

2. Envoy ext_proc Integration

The AI Gateway runs as a standalone gRPC server implementing envoy.service.ext_proc.v3.ExternalProcessor. It communicates with Envoy over a Unix Domain Socket (/tmp/extproc.sock) for low-latency IPC.

The control plane dynamically injects the ext_proc filter configuration into Envoy's xDS configuration. The gateway does not use a WASM filter, Lua script, or custom C++ filter — it relies entirely on ext_proc's gRPC streaming protocol to intercept and mutate requests and responses as they flow through Envoy.

gRPC Processing Protocol

The ext_proc protocol is a bidirectional gRPC stream. Envoy sends ProcessingRequest messages at various phases of the HTTP lifecycle, and the gateway responds with ProcessingResponse messages containing mutations:

ProcessingRequest Type	Purpose
`RequestHeaders`	Client's HTTP request headers arrive
`RequestBody`	Client's HTTP request body arrives (buffered)
`ResponseHeaders`	Backend's HTTP response headers arrive
`ResponseBody`	Backend's HTTP response body arrives (buffered or streamed)

Each response can include header mutations, body mutations, and dynamic metadata for downstream Envoy filters (e.g., rate limiting, cost tracking).

3. Two-Level Filter Design

The key architectural decision is a two-level ext_proc filter hierarchy:

Router Filter (HTTP Connection Manager level)

Receives the client request headers and body
Parses the OpenAI-format request and extracts the model name
Starts an OpenTelemetry tracing span
Stores the original request body for later use by the upstream filter
Does not perform schema translation — its role is routing context and state management

Configuration: RequestBodyMode: BUFFERED, ResponseBodyMode: BUFFERED

Upstream Filter (Upstream Cluster level)

Fires after Envoy has selected a backend cluster
Receives the backend identity from Envoy request attributes
Selects the appropriate Translator based on the backend's APISchemaName
Calls translator.RequestBody() to convert OpenAI format to provider-native format
Applies backend-specific authentication (AWS SigV4, GCP tokens, API keys, etc.)
On response, calls translator.ResponseBody() to convert back to OpenAI format
Handles streaming via ModeOverride to STREAMED body processing

Configuration: RequestHeaderMode: SEND only (body already in memory from router phase)

Why Two Levels?

This separation allows Envoy to handle retries and failover between backends. If a request to one provider fails, Envoy can retry against a different backend cluster. The upstream filter re-translates the original request (preserved by the router filter) for each retry attempt, potentially targeting a completely different provider with a different schema.

4. Translator Interface

The core abstraction lives in internal/translator/translator.go:

type Translator[ReqT any, SpanT any] interface {
    RequestBody(raw []byte, body *ReqT, flag bool) (
        newHeaders []Header, mutatedBody []byte, err error,
    )

    ResponseHeaders(headers map[string]string) (
        newHeaders []Header, err error,
    )

    ResponseBody(respHeaders map[string]string, body io.Reader,
        endOfStream bool, span SpanT) (
        newHeaders []Header, mutatedBody []byte,
        tokenUsage TokenUsage, responseModel string, err error,
    )

    ResponseError(respHeaders map[string]string, body io.Reader) (
        newHeaders []Header, mutatedBody []byte, err error,
    )
}

A type alias ties the generic interface to chat completions:

type OpenAIChatCompletionTranslator = Translator[
    openai.ChatCompletionRequest,
    tracingapi.ChatCompletionSpan,
]

Translator Method Mapping to ext_proc Phases

Envoy Phase	Filter Level	Translator Call
Request Headers	Router	Parse body, store state (no translator call)
Request Headers	Upstream	`translator.RequestBody()`
Response Headers	Upstream	`translator.ResponseHeaders()`
Response Body (success)	Upstream	`translator.ResponseBody()`
Response Body (error)	Upstream	`translator.ResponseError()`

5. Supported Chat Completion Translators

All translator implementations live in internal/translator/:

Provider	Source File	Factory Function
OpenAI (and compatible: Groq, Mistral, DeepSeek, Together AI, etc.)	`openai_openai.go`	`NewChatCompletionOpenAIToOpenAITranslator()`
AWS Bedrock (Converse API)	`openai_awsbedrock.go`	`NewChatCompletionOpenAIToAWSBedrockTranslator()`
AWS Bedrock (Anthropic)	`openai_awsanthropic.go`	`NewChatCompletionOpenAIToAWSAnthropicTranslator()`
Azure OpenAI	`openai_azureopenai.go`	`NewChatCompletionOpenAIToAzureOpenAITranslator()`
GCP Vertex AI (Gemini)	`openai_gcpvertexai.go`	`NewChatCompletionOpenAIToGCPVertexAITranslator()`
GCP Vertex AI (Anthropic)	`openai_gcpanthropic.go`	`NewChatCompletionOpenAIToGCPAnthropicTranslator()`
Anthropic (Native)	`anthropic_anthropic.go`	`NewAnthropicToAnthropicTranslator()`

Providers that natively support the OpenAI API format (Groq, Grok, Mistral, Together AI, DeepSeek, Cohere, SambaNova, Google Gemini on AI Studio, DeepInfra, and self-hosted models like vLLM) use the OpenAI-to-OpenAI translator, which is mostly a passthrough that adjusts the URL path prefix.

Translator Selection (Factory Pattern)

The factory in internal/endpointspec/endpointspec.go dispatches based on the configured APISchemaName:

func (ChatCompletionsEndpointSpec) GetTranslator(
    schema filterapi.VersionedAPISchema,
    modelNameOverride string,
) (translator.OpenAIChatCompletionTranslator, error) {
    switch schema.Name {
    case filterapi.APISchemaOpenAI:
        return translator.NewChatCompletionOpenAIToOpenAITranslator(...)
    case filterapi.APISchemaAWSBedrock:
        return translator.NewChatCompletionOpenAIToAWSBedrockTranslator(...)
    case filterapi.APISchemaGCPVertexAI:
        return translator.NewChatCompletionOpenAIToGCPVertexAITranslator(...)
    case filterapi.APISchemaAzureOpenAI:
        return translator.NewChatCompletionOpenAIToAzureOpenAITranslator(...)
    // ... more providers
    }
}

API Schema Registry

Schema names are defined in internal/filterapi/filterconfig.go:

const (
    APISchemaOpenAI       = "OpenAI"
    APISchemaAWSBedrock   = "AWSBedrock"
    APISchemaAWSAnthropic = "AWSAnthropic"
    APISchemaAzureOpenAI  = "AzureOpenAI"
    APISchemaGCPVertexAI  = "GCPVertexAI"
    APISchemaGCPAnthropic = "GCPAnthropic"
    APISchemaAnthropic    = "Anthropic"
    APISchemaCohere       = "Cohere"
)

6. API Schema Data Models

Provider-specific request/response types live in internal/apischema/:

internal/apischema/
├── openai/       # ChatCompletionRequest/Response (unified client format)
├── anthropic/    # MessagesRequest/Response
├── awsbedrock/   # ConverseInput/ConverseOutput
├── gcp/          # Gemini GenerateContentRequest/Response
└── cohere/       # Rerank models

The OpenAI schema (internal/apischema/openai/openai.go) is the richest, defining:

ChatCompletionRequest — the unified request format
ChatCompletionResponse — the unified response format
Chat message union types (system, user, assistant, tool, developer)
Content part unions (text, image_url, input_audio, file)
Tool/function call definitions
Token usage structures
Vendor-specific extension fields (Anthropic thinking config, GCP extensions)

7. Request/Response Lifecycle

Request Translation Pipeline

Client OpenAI Request
  |
  v
Router Filter (routerProcessor.ProcessRequestBody):
  1. Parse JSON into *openai.ChatCompletionRequest
  2. Extract original model name
  3. Detect if streaming
  4. Start tracing span
  5. Set internal headers (x-ai-eg-model, x-ai-eg-original-path)
  6. Return CONTINUE with header mutations
  |
  v
Envoy Route Decision:
  Select upstream cluster, attach backend name in request attributes
  |
  v
Upstream Filter (upstreamProcessor.ProcessRequestHeaders):
  1. Look up backend from request attributes
  2. Select translator based on backend's APISchemaName
  3. Call translator.RequestBody():
     - Validate parameters
     - Map OpenAI fields to provider format
     - Handle model name overrides
     - Set appropriate HTTP path and headers
  4. Apply backend authentication
  5. Apply route-level header/body mutations
  6. Build dynamic metadata (cost tracking)
  7. Return CONTINUE_AND_REPLACE with mutations
  |
  v
Backend (provider-native format)

Response Translation Pipeline

Backend Response (provider-native format)
  |
  v
Upstream Filter (upstreamProcessor.ProcessResponseHeaders):
  1. Store response headers
  2. Call translator.ResponseHeaders()
  3. If streaming: set ModeOverride to STREAMED
  4. Return CONTINUE with header mutations
  |
  v
Upstream Filter (upstreamProcessor.ProcessResponseBody):
  1. Decompress if needed (gzip, deflate)
  2. Check HTTP status code:
     - Error path: call translator.ResponseError()
     - Success path: call translator.ResponseBody()
  3. Extract token usage and response model
  4. Record metrics (token counts, latency)
  5. Build dynamic metadata (cost data)
  6. Return CONTINUE with body mutations
  |
  v
Client Response (OpenAI format)

Example: OpenAI to AWS Bedrock

Request conversion (internal/translator/openai_awsbedrock.go):

OpenAI Request:                      AWS Bedrock Request:
{                                    {
  "model": "claude-3-sonnet",          "modelId": "anthropic.claude-3-sonnet",
  "messages": [{                       "messages": [{
    "role": "user",                      "role": "user",
    "content": [{                        "content": [{
      "type": "text",                      "text": "Hello"
      "text": "Hello"                    }, {
    }, {                                   "image": {
      "type": "image_url",                  "format": "png",
      "image_url": {                        "source": {
        "url": "data:image/png;..."           "bytes": [...]
      }                                    }
    }]                                   }
  }]                                   }]
}                                    }]
                                   }

Response conversion:

AWS Bedrock Response:                OpenAI Response:
{                                    {
  "output": {                          "model": "claude-3-sonnet",
    "message": {                       "choices": [{
      "role": "assistant",               "message": {
      "content": [{                        "role": "assistant",
        "text": "Hello!"                   "content": "Hello!",
      }, {                                 "tool_calls": [{
        "toolUse": {                         "id": "...",
          "toolUseId": "...",                "type": "function",
          "name": "search",                  "function": {
          "input": {...}                       "name": "search",
        }                                      "arguments": "..."
      }]                                     }
    }                                      }]
  },                                     },
  "usage": {                             "finish_reason": "stop"
    "inputTokens": 10,                }],
    "outputTokens": 5                  "usage": {
  },                                     "prompt_tokens": 10,
  "stopReason": "end_turn"               "completion_tokens": 5,
}                                        "total_tokens": 15
                                       }
                                     }

8. What Gets Translated

Each translator handles conversion of:

Aspect	Details
Messages	Role mapping, content block structure, multi-modal content (text, images, audio)
Tool/function calls	Tool definition format, tool choice strategy (`auto`/`required`/`none`), tool call results
Thinking/reasoning	Vendor-specific thinking mode configuration (e.g., Anthropic extended thinking)
Streaming	SSE format differences, stateful chunk accumulation, event type mapping
Token usage	Field name mapping (`input_tokens` vs `prompt_tokens`), total calculation
Finish reasons	E.g., Anthropic `end_turn` → OpenAI `stop`, Bedrock `end_turn` → `stop`
Error responses	Provider error format → unified error format
Content types	Base64 image parsing, data URI handling, audio format conversion
Cache control	Vendor-specific prompt caching annotations

Streaming Response Handling

Each provider has different streaming formats requiring stateful parsers:

Anthropic: anthropicStreamParser in anthropic_helper.go — parses SSE events (content_block_start, content_block_delta, message_delta) and converts to OpenAI SSE format (data: {"choices":[{"delta":{...}}]})
AWS Bedrock: Uses AWS EventStream protocol (binary framing) decoded into ConverseStreamEvent chunks, converted to OpenAI chat completion chunks
GCP Vertex AI (Gemini): Parses Gemini streaming responses and maps to OpenAI chunk format

Common Helper Functions

Shared conversion utilities in internal/translator/:

anthropic_helper.go: openAIToAnthropicMessages(), translateOpenAItoAnthropicTools(), anthropicToolUseToOpenAICalls(), anthropicToOpenAIFinishReason()
gemini_helper.go: openAIMessagesToGeminiContents(), message and content conversion utilities

9. Envoy Configuration Injection

The control plane (internal/extensionserver/post_translate_modify.go) dynamically generates Envoy xDS configuration with the following components:

ext_proc UDS Cluster

Cluster {
  Name: "ai-gateway-extproc-uds"
  Type: STATIC
  ConnectTimeout: 10s
  LoadAssignment {
    Endpoints {
      LbEndpoints {
        Endpoint {
          Address: Pipe { Path: "/tmp/extproc.sock" }
        }
      }
    }
  }
  PerConnectionBufferLimitBytes: 50Mi
  Http2ProtocolOptions {
    InitialConnectionWindowSize: 1Mi
    InitialStreamWindowSize: 64Ki
  }
}

Router Filter ext_proc Configuration

ExternalProcessor {
  GrpcService {
    EnvoyGrpc { ClusterName: "ai-gateway-extproc-uds" }
    Timeout: 30s
  }
  ProcessingMode {
    RequestHeaderMode: SEND
    RequestBodyMode: BUFFERED
    ResponseHeaderMode: SEND
    ResponseBodyMode: BUFFERED
  }
  MessageTimeout: 10s
  FailureModeAllow: false
  AllowModeOverride: true
}

Upstream Filter ext_proc Configuration

ExternalProcessor {
  GrpcService { ClusterName: "ai-gateway-extproc-uds" }
  ProcessingMode {
    RequestHeaderMode: SEND
    RequestBodyMode: NONE       # Body already available from router phase
    ResponseHeaderMode: SKIP
    ResponseBodyMode: NONE
  }
  MessageTimeout: 10s
}

Header Mutation Filter

Applied after ext_proc to inject dynamic metadata into headers:

HeaderMutation {
  Mutations {
    RequestMutations [{
      Append {
        Header {
          Key: "content-length"
          Value: "%DYNAMIC_METADATA(ai-gateway:content_length)%"
        }
      }
    }]
  }
}

10. Server Entry Point

The gRPC server starts in cmd/extproc/mainlib/main.go:

server, err := extproc.NewServer(l, flags.enableRedaction)
server.Register(path, extproc.NewFactory(...))

s := grpc.NewServer(grpc.MaxRecvMsgSize(flags.maxRecvMsgSize))
extprocv3.RegisterExternalProcessorServer(s, server)
grpc_health_v1.RegisterHealthServer(s, server)
s.Serve(extProcLis)

Registered Endpoint Paths

Path	Processor
`/v1/chat/completions`	Chat completion
`/v1/completions`	Text completion
`/v1/embeddings`	Embeddings
`/v1/messages`	Anthropic native messages
`/v1/models`	Model listing
`/v2/rerank`	Cohere rerank

Startup Configuration

Flag	Default	Purpose
`configPath`	—	Configuration YAML file path
`extProcAddr`	`:1063`	gRPC server address (supports UDS: `unix:///tmp/ext_proc.sock`)
`adminPort`	`1064`	HTTP admin port (metrics, health)
`mcpAddr`	—	Optional MCP proxy address
`maxRecvMsgSize`	unlimited	Max gRPC message size

11. End-to-End Request Flow Diagram

                        CLIENT REQUEST
                POST /v1/chat/completions
                {model: "gpt-4", messages: [...]}
                            |
                            v
            +-------------------------------+
            |   ENVOY ROUTER FILTER         |
            |   Matches route, calls        |
            |   ext_proc (SEND + BUFFERED)  |
            +---------------+---------------+
                            |
               Request Headers + Body
                            |
                            v
            +-------------------------------+
            |  routerProcessor              |
            |  .ProcessRequestBody()        |
            |  ----------------------------  |
            |  1. Parse as OpenAI request   |
            |  2. Extract model name        |
            |  3. Start tracing span        |
            |  4. Store original body       |
            |  5. Set internal headers      |
            |  6. Return CONTINUE           |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   ENVOY ROUTE DECISION        |
            |   Select upstream cluster     |
            |   Attach backend name in      |
            |   request attributes          |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   ENVOY UPSTREAM FILTER       |
            |   Calls ext_proc (SEND only)  |
            +---------------+---------------+
                            |
                   Request Headers
                            |
                            v
            +-------------------------------+
            |  upstreamProcessor            |
            |  .ProcessRequestHeaders()     |
            |  ----------------------------  |
            |  1. Get backend from attrs    |
            |  2. Select translator         |
            |  3. translator.RequestBody()  |
            |     (OpenAI -> Bedrock/etc.)  |
            |  4. Apply backend auth        |
            |  5. Apply route mutations     |
            |  6. Build dynamic metadata    |
            |  7. Return CONTINUE_AND_      |
            |     REPLACE                   |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   BACKEND (Provider-native)   |
            |   Returns response            |
            +---------------+---------------+
                            |
                   Response Headers
                            |
                            v
            +-------------------------------+
            |  upstreamProcessor            |
            |  .ProcessResponseHeaders()    |
            |  ----------------------------  |
            |  1. translator.Response-      |
            |     Headers()                 |
            |  2. Detect streaming          |
            |  3. Set STREAMED mode if      |
            |     streaming                 |
            |  4. Return CONTINUE           |
            +---------------+---------------+
                            |
                   Response Body
              (per chunk if streaming)
                            |
                            v
            +-------------------------------+
            |  upstreamProcessor            |
            |  .ProcessResponseBody()       |
            |  ----------------------------  |
            |  1. Decompress if needed      |
            |  2. translator.ResponseBody() |
            |     (Bedrock/etc. -> OpenAI)  |
            |  3. Extract token usage       |
            |  4. Record metrics            |
            |  5. Build cost metadata       |
            |  6. Return CONTINUE           |
            +---------------+---------------+
                            |
                            v
            +-------------------------------+
            |   ENVOY ROUTER (response)     |
            |   Mutations applied           |
            +---------------+---------------+
                            |
                            v
                    CLIENT RESPONSE
              {choices: [{message: {...}}]}
              {usage: {prompt_tokens: ...}}

12. Key Files Reference

Core ext_proc Integration

File	Purpose
`cmd/extproc/main.go`	Entry point, signal handling
`cmd/extproc/mainlib/main.go`	Server initialization, listener setup, processor factory registration
`internal/extproc/server.go`	gRPC `ExternalProcessor` service, stream lifecycle management
`internal/extproc/processor.go`	Processor interface definition
`internal/extproc/processor_impl.go`	`routerProcessor` and `upstreamProcessor` implementations
`internal/extproc/util.go`	Helpers (header mutations, dynamic metadata)

Translator Layer

File	Purpose
`internal/translator/translator.go`	Core `Translator` interface, type aliases
`internal/translator/openai_openai.go`	OpenAI-to-OpenAI passthrough
`internal/translator/openai_awsbedrock.go`	OpenAI-to-AWS Bedrock Converse
`internal/translator/openai_awsanthropic.go`	OpenAI-to-AWS Bedrock Anthropic
`internal/translator/openai_azureopenai.go`	OpenAI-to-Azure OpenAI
`internal/translator/openai_gcpvertexai.go`	OpenAI-to-GCP Vertex AI Gemini
`internal/translator/openai_gcpanthropic.go`	OpenAI-to-GCP Vertex AI Anthropic
`internal/translator/anthropic_anthropic.go`	Anthropic-to-Anthropic passthrough
`internal/translator/anthropic_helper.go`	OpenAI/Anthropic conversion utilities, streaming parser
`internal/translator/gemini_helper.go`	OpenAI/Gemini conversion utilities

Data Models

File	Purpose
`internal/apischema/openai/openai.go`	OpenAI `ChatCompletionRequest`/`Response` definitions
`internal/apischema/anthropic/anthropic.go`	Anthropic `MessagesRequest`/`Response` definitions
`internal/apischema/awsbedrock/awsbedrock.go`	AWS Bedrock `ConverseInput`/`ConverseOutput` definitions
`internal/apischema/gcp/gcp.go`	GCP Gemini type definitions

Configuration & Control Plane

File	Purpose
`internal/filterapi/filterconfig.go`	`APISchemaName` constants, `VersionedAPISchema`, backend config
`internal/endpointspec/endpointspec.go`	Endpoint spec interface, translator factory dispatch
`internal/extensionserver/post_translate_modify.go`	xDS modification to inject ext_proc filters into Envoy config

danielezonca/envoy-ai-gateway-chat-completion-analysis.md