An analysis of how the Envoy AI Gateway converts between native third-party AI provider formats and a unified OpenAI-compatible API for chat completion endpoints.
Repository: envoyproxy/ai-gateway Documentation: aigateway.envoyproxy.io
- 1. Architecture Overview
- 2. Envoy ext_proc Integration
- 3. Two-Level Filter Design
- 4. Translator Interface
- 5. Supported Chat Completion Translators
- 6. API Schema Data Models
- 7. Request/Response Lifecycle
- 8. What Gets Translated
- 9. Envoy Configuration Injection
- 10. Server Entry Point
- 11. End-to-End Request Flow Diagram
- 12. Key Files Reference
The gateway uses a pluggable translator pattern where all incoming requests arrive in OpenAI-compatible format and are translated to each provider's native format (and vice versa for responses). At runtime, the gateway operates as a gRPC external processor for Envoy Proxy, using Envoy's ext_proc (External Processing) filter.
The core design principles:
- Unified client interface: Clients always speak OpenAI-compatible API format
- Pluggable backends: Each provider has a dedicated translator implementing a common interface
- Two-phase processing: Routing is separated from schema translation via two ext_proc filter levels
- Retry-aware: The architecture supports transparent failover between providers
The AI Gateway runs as a standalone gRPC server implementing envoy.service.ext_proc.v3.ExternalProcessor. It communicates with Envoy over a Unix Domain Socket (/tmp/extproc.sock) for low-latency IPC.
The control plane dynamically injects the ext_proc filter configuration into Envoy's xDS configuration. The gateway does not use a WASM filter, Lua script, or custom C++ filter — it relies entirely on ext_proc's gRPC streaming protocol to intercept and mutate requests and responses as they flow through Envoy.
The ext_proc protocol is a bidirectional gRPC stream. Envoy sends ProcessingRequest messages at various phases of the HTTP lifecycle, and the gateway responds with ProcessingResponse messages containing mutations:
| ProcessingRequest Type | Purpose |
|---|---|
RequestHeaders |
Client's HTTP request headers arrive |
RequestBody |
Client's HTTP request body arrives (buffered) |
ResponseHeaders |
Backend's HTTP response headers arrive |
ResponseBody |
Backend's HTTP response body arrives (buffered or streamed) |
Each response can include header mutations, body mutations, and dynamic metadata for downstream Envoy filters (e.g., rate limiting, cost tracking).
The key architectural decision is a two-level ext_proc filter hierarchy:
- Receives the client request headers and body
- Parses the OpenAI-format request and extracts the model name
- Starts an OpenTelemetry tracing span
- Stores the original request body for later use by the upstream filter
- Does not perform schema translation — its role is routing context and state management
Configuration: RequestBodyMode: BUFFERED, ResponseBodyMode: BUFFERED
- Fires after Envoy has selected a backend cluster
- Receives the backend identity from Envoy request attributes
- Selects the appropriate
Translatorbased on the backend'sAPISchemaName - Calls
translator.RequestBody()to convert OpenAI format to provider-native format - Applies backend-specific authentication (AWS SigV4, GCP tokens, API keys, etc.)
- On response, calls
translator.ResponseBody()to convert back to OpenAI format - Handles streaming via
ModeOverridetoSTREAMEDbody processing
Configuration: RequestHeaderMode: SEND only (body already in memory from router phase)
This separation allows Envoy to handle retries and failover between backends. If a request to one provider fails, Envoy can retry against a different backend cluster. The upstream filter re-translates the original request (preserved by the router filter) for each retry attempt, potentially targeting a completely different provider with a different schema.
The core abstraction lives in internal/translator/translator.go:
type Translator[ReqT any, SpanT any] interface {
RequestBody(raw []byte, body *ReqT, flag bool) (
newHeaders []Header, mutatedBody []byte, err error,
)
ResponseHeaders(headers map[string]string) (
newHeaders []Header, err error,
)
ResponseBody(respHeaders map[string]string, body io.Reader,
endOfStream bool, span SpanT) (
newHeaders []Header, mutatedBody []byte,
tokenUsage TokenUsage, responseModel string, err error,
)
ResponseError(respHeaders map[string]string, body io.Reader) (
newHeaders []Header, mutatedBody []byte, err error,
)
}A type alias ties the generic interface to chat completions:
type OpenAIChatCompletionTranslator = Translator[
openai.ChatCompletionRequest,
tracingapi.ChatCompletionSpan,
]| Envoy Phase | Filter Level | Translator Call |
|---|---|---|
| Request Headers | Router | Parse body, store state (no translator call) |
| Request Headers | Upstream | translator.RequestBody() |
| Response Headers | Upstream | translator.ResponseHeaders() |
| Response Body (success) | Upstream | translator.ResponseBody() |
| Response Body (error) | Upstream | translator.ResponseError() |
All translator implementations live in internal/translator/:
| Provider | Source File | Factory Function |
|---|---|---|
| OpenAI (and compatible: Groq, Mistral, DeepSeek, Together AI, etc.) | openai_openai.go |
NewChatCompletionOpenAIToOpenAITranslator() |
| AWS Bedrock (Converse API) | openai_awsbedrock.go |
NewChatCompletionOpenAIToAWSBedrockTranslator() |
| AWS Bedrock (Anthropic) | openai_awsanthropic.go |
NewChatCompletionOpenAIToAWSAnthropicTranslator() |
| Azure OpenAI | openai_azureopenai.go |
NewChatCompletionOpenAIToAzureOpenAITranslator() |
| GCP Vertex AI (Gemini) | openai_gcpvertexai.go |
NewChatCompletionOpenAIToGCPVertexAITranslator() |
| GCP Vertex AI (Anthropic) | openai_gcpanthropic.go |
NewChatCompletionOpenAIToGCPAnthropicTranslator() |
| Anthropic (Native) | anthropic_anthropic.go |
NewAnthropicToAnthropicTranslator() |
Providers that natively support the OpenAI API format (Groq, Grok, Mistral, Together AI, DeepSeek, Cohere, SambaNova, Google Gemini on AI Studio, DeepInfra, and self-hosted models like vLLM) use the OpenAI-to-OpenAI translator, which is mostly a passthrough that adjusts the URL path prefix.
The factory in internal/endpointspec/endpointspec.go dispatches based on the configured APISchemaName:
func (ChatCompletionsEndpointSpec) GetTranslator(
schema filterapi.VersionedAPISchema,
modelNameOverride string,
) (translator.OpenAIChatCompletionTranslator, error) {
switch schema.Name {
case filterapi.APISchemaOpenAI:
return translator.NewChatCompletionOpenAIToOpenAITranslator(...)
case filterapi.APISchemaAWSBedrock:
return translator.NewChatCompletionOpenAIToAWSBedrockTranslator(...)
case filterapi.APISchemaGCPVertexAI:
return translator.NewChatCompletionOpenAIToGCPVertexAITranslator(...)
case filterapi.APISchemaAzureOpenAI:
return translator.NewChatCompletionOpenAIToAzureOpenAITranslator(...)
// ... more providers
}
}Schema names are defined in internal/filterapi/filterconfig.go:
const (
APISchemaOpenAI = "OpenAI"
APISchemaAWSBedrock = "AWSBedrock"
APISchemaAWSAnthropic = "AWSAnthropic"
APISchemaAzureOpenAI = "AzureOpenAI"
APISchemaGCPVertexAI = "GCPVertexAI"
APISchemaGCPAnthropic = "GCPAnthropic"
APISchemaAnthropic = "Anthropic"
APISchemaCohere = "Cohere"
)Provider-specific request/response types live in internal/apischema/:
internal/apischema/
├── openai/ # ChatCompletionRequest/Response (unified client format)
├── anthropic/ # MessagesRequest/Response
├── awsbedrock/ # ConverseInput/ConverseOutput
├── gcp/ # Gemini GenerateContentRequest/Response
└── cohere/ # Rerank models
The OpenAI schema (internal/apischema/openai/openai.go) is the richest, defining:
ChatCompletionRequest— the unified request formatChatCompletionResponse— the unified response format- Chat message union types (system, user, assistant, tool, developer)
- Content part unions (text, image_url, input_audio, file)
- Tool/function call definitions
- Token usage structures
- Vendor-specific extension fields (Anthropic thinking config, GCP extensions)
Client OpenAI Request
|
v
Router Filter (routerProcessor.ProcessRequestBody):
1. Parse JSON into *openai.ChatCompletionRequest
2. Extract original model name
3. Detect if streaming
4. Start tracing span
5. Set internal headers (x-ai-eg-model, x-ai-eg-original-path)
6. Return CONTINUE with header mutations
|
v
Envoy Route Decision:
Select upstream cluster, attach backend name in request attributes
|
v
Upstream Filter (upstreamProcessor.ProcessRequestHeaders):
1. Look up backend from request attributes
2. Select translator based on backend's APISchemaName
3. Call translator.RequestBody():
- Validate parameters
- Map OpenAI fields to provider format
- Handle model name overrides
- Set appropriate HTTP path and headers
4. Apply backend authentication
5. Apply route-level header/body mutations
6. Build dynamic metadata (cost tracking)
7. Return CONTINUE_AND_REPLACE with mutations
|
v
Backend (provider-native format)
Backend Response (provider-native format)
|
v
Upstream Filter (upstreamProcessor.ProcessResponseHeaders):
1. Store response headers
2. Call translator.ResponseHeaders()
3. If streaming: set ModeOverride to STREAMED
4. Return CONTINUE with header mutations
|
v
Upstream Filter (upstreamProcessor.ProcessResponseBody):
1. Decompress if needed (gzip, deflate)
2. Check HTTP status code:
- Error path: call translator.ResponseError()
- Success path: call translator.ResponseBody()
3. Extract token usage and response model
4. Record metrics (token counts, latency)
5. Build dynamic metadata (cost data)
6. Return CONTINUE with body mutations
|
v
Client Response (OpenAI format)
Request conversion (internal/translator/openai_awsbedrock.go):
OpenAI Request: AWS Bedrock Request:
{ {
"model": "claude-3-sonnet", "modelId": "anthropic.claude-3-sonnet",
"messages": [{ "messages": [{
"role": "user", "role": "user",
"content": [{ "content": [{
"type": "text", "text": "Hello"
"text": "Hello" }, {
}, { "image": {
"type": "image_url", "format": "png",
"image_url": { "source": {
"url": "data:image/png;..." "bytes": [...]
} }
}] }
}] }]
} }]
}
Response conversion:
AWS Bedrock Response: OpenAI Response:
{ {
"output": { "model": "claude-3-sonnet",
"message": { "choices": [{
"role": "assistant", "message": {
"content": [{ "role": "assistant",
"text": "Hello!" "content": "Hello!",
}, { "tool_calls": [{
"toolUse": { "id": "...",
"toolUseId": "...", "type": "function",
"name": "search", "function": {
"input": {...} "name": "search",
} "arguments": "..."
}] }
} }]
}, },
"usage": { "finish_reason": "stop"
"inputTokens": 10, }],
"outputTokens": 5 "usage": {
}, "prompt_tokens": 10,
"stopReason": "end_turn" "completion_tokens": 5,
} "total_tokens": 15
}
}
Each translator handles conversion of:
| Aspect | Details |
|---|---|
| Messages | Role mapping, content block structure, multi-modal content (text, images, audio) |
| Tool/function calls | Tool definition format, tool choice strategy (auto/required/none), tool call results |
| Thinking/reasoning | Vendor-specific thinking mode configuration (e.g., Anthropic extended thinking) |
| Streaming | SSE format differences, stateful chunk accumulation, event type mapping |
| Token usage | Field name mapping (input_tokens vs prompt_tokens), total calculation |
| Finish reasons | E.g., Anthropic end_turn → OpenAI stop, Bedrock end_turn → stop |
| Error responses | Provider error format → unified error format |
| Content types | Base64 image parsing, data URI handling, audio format conversion |
| Cache control | Vendor-specific prompt caching annotations |
Each provider has different streaming formats requiring stateful parsers:
- Anthropic:
anthropicStreamParserinanthropic_helper.go— parses SSE events (content_block_start,content_block_delta,message_delta) and converts to OpenAI SSE format (data: {"choices":[{"delta":{...}}]}) - AWS Bedrock: Uses AWS EventStream protocol (binary framing) decoded into
ConverseStreamEventchunks, converted to OpenAI chat completion chunks - GCP Vertex AI (Gemini): Parses Gemini streaming responses and maps to OpenAI chunk format
Shared conversion utilities in internal/translator/:
anthropic_helper.go:openAIToAnthropicMessages(),translateOpenAItoAnthropicTools(),anthropicToolUseToOpenAICalls(),anthropicToOpenAIFinishReason()gemini_helper.go:openAIMessagesToGeminiContents(), message and content conversion utilities
The control plane (internal/extensionserver/post_translate_modify.go) dynamically generates Envoy xDS configuration with the following components:
Cluster {
Name: "ai-gateway-extproc-uds"
Type: STATIC
ConnectTimeout: 10s
LoadAssignment {
Endpoints {
LbEndpoints {
Endpoint {
Address: Pipe { Path: "/tmp/extproc.sock" }
}
}
}
}
PerConnectionBufferLimitBytes: 50Mi
Http2ProtocolOptions {
InitialConnectionWindowSize: 1Mi
InitialStreamWindowSize: 64Ki
}
}ExternalProcessor {
GrpcService {
EnvoyGrpc { ClusterName: "ai-gateway-extproc-uds" }
Timeout: 30s
}
ProcessingMode {
RequestHeaderMode: SEND
RequestBodyMode: BUFFERED
ResponseHeaderMode: SEND
ResponseBodyMode: BUFFERED
}
MessageTimeout: 10s
FailureModeAllow: false
AllowModeOverride: true
}ExternalProcessor {
GrpcService { ClusterName: "ai-gateway-extproc-uds" }
ProcessingMode {
RequestHeaderMode: SEND
RequestBodyMode: NONE # Body already available from router phase
ResponseHeaderMode: SKIP
ResponseBodyMode: NONE
}
MessageTimeout: 10s
}Applied after ext_proc to inject dynamic metadata into headers:
HeaderMutation {
Mutations {
RequestMutations [{
Append {
Header {
Key: "content-length"
Value: "%DYNAMIC_METADATA(ai-gateway:content_length)%"
}
}
}]
}
}The gRPC server starts in cmd/extproc/mainlib/main.go:
server, err := extproc.NewServer(l, flags.enableRedaction)
server.Register(path, extproc.NewFactory(...))
s := grpc.NewServer(grpc.MaxRecvMsgSize(flags.maxRecvMsgSize))
extprocv3.RegisterExternalProcessorServer(s, server)
grpc_health_v1.RegisterHealthServer(s, server)
s.Serve(extProcLis)| Path | Processor |
|---|---|
/v1/chat/completions |
Chat completion |
/v1/completions |
Text completion |
/v1/embeddings |
Embeddings |
/v1/messages |
Anthropic native messages |
/v1/models |
Model listing |
/v2/rerank |
Cohere rerank |
| Flag | Default | Purpose |
|---|---|---|
configPath |
— | Configuration YAML file path |
extProcAddr |
:1063 |
gRPC server address (supports UDS: unix:///tmp/ext_proc.sock) |
adminPort |
1064 |
HTTP admin port (metrics, health) |
mcpAddr |
— | Optional MCP proxy address |
maxRecvMsgSize |
unlimited | Max gRPC message size |
CLIENT REQUEST
POST /v1/chat/completions
{model: "gpt-4", messages: [...]}
|
v
+-------------------------------+
| ENVOY ROUTER FILTER |
| Matches route, calls |
| ext_proc (SEND + BUFFERED) |
+---------------+---------------+
|
Request Headers + Body
|
v
+-------------------------------+
| routerProcessor |
| .ProcessRequestBody() |
| ---------------------------- |
| 1. Parse as OpenAI request |
| 2. Extract model name |
| 3. Start tracing span |
| 4. Store original body |
| 5. Set internal headers |
| 6. Return CONTINUE |
+---------------+---------------+
|
v
+-------------------------------+
| ENVOY ROUTE DECISION |
| Select upstream cluster |
| Attach backend name in |
| request attributes |
+---------------+---------------+
|
v
+-------------------------------+
| ENVOY UPSTREAM FILTER |
| Calls ext_proc (SEND only) |
+---------------+---------------+
|
Request Headers
|
v
+-------------------------------+
| upstreamProcessor |
| .ProcessRequestHeaders() |
| ---------------------------- |
| 1. Get backend from attrs |
| 2. Select translator |
| 3. translator.RequestBody() |
| (OpenAI -> Bedrock/etc.) |
| 4. Apply backend auth |
| 5. Apply route mutations |
| 6. Build dynamic metadata |
| 7. Return CONTINUE_AND_ |
| REPLACE |
+---------------+---------------+
|
v
+-------------------------------+
| BACKEND (Provider-native) |
| Returns response |
+---------------+---------------+
|
Response Headers
|
v
+-------------------------------+
| upstreamProcessor |
| .ProcessResponseHeaders() |
| ---------------------------- |
| 1. translator.Response- |
| Headers() |
| 2. Detect streaming |
| 3. Set STREAMED mode if |
| streaming |
| 4. Return CONTINUE |
+---------------+---------------+
|
Response Body
(per chunk if streaming)
|
v
+-------------------------------+
| upstreamProcessor |
| .ProcessResponseBody() |
| ---------------------------- |
| 1. Decompress if needed |
| 2. translator.ResponseBody() |
| (Bedrock/etc. -> OpenAI) |
| 3. Extract token usage |
| 4. Record metrics |
| 5. Build cost metadata |
| 6. Return CONTINUE |
+---------------+---------------+
|
v
+-------------------------------+
| ENVOY ROUTER (response) |
| Mutations applied |
+---------------+---------------+
|
v
CLIENT RESPONSE
{choices: [{message: {...}}]}
{usage: {prompt_tokens: ...}}
| File | Purpose |
|---|---|
cmd/extproc/main.go |
Entry point, signal handling |
cmd/extproc/mainlib/main.go |
Server initialization, listener setup, processor factory registration |
internal/extproc/server.go |
gRPC ExternalProcessor service, stream lifecycle management |
internal/extproc/processor.go |
Processor interface definition |
internal/extproc/processor_impl.go |
routerProcessor and upstreamProcessor implementations |
internal/extproc/util.go |
Helpers (header mutations, dynamic metadata) |
| File | Purpose |
|---|---|
internal/translator/translator.go |
Core Translator interface, type aliases |
internal/translator/openai_openai.go |
OpenAI-to-OpenAI passthrough |
internal/translator/openai_awsbedrock.go |
OpenAI-to-AWS Bedrock Converse |
internal/translator/openai_awsanthropic.go |
OpenAI-to-AWS Bedrock Anthropic |
internal/translator/openai_azureopenai.go |
OpenAI-to-Azure OpenAI |
internal/translator/openai_gcpvertexai.go |
OpenAI-to-GCP Vertex AI Gemini |
internal/translator/openai_gcpanthropic.go |
OpenAI-to-GCP Vertex AI Anthropic |
internal/translator/anthropic_anthropic.go |
Anthropic-to-Anthropic passthrough |
internal/translator/anthropic_helper.go |
OpenAI/Anthropic conversion utilities, streaming parser |
internal/translator/gemini_helper.go |
OpenAI/Gemini conversion utilities |
| File | Purpose |
|---|---|
internal/apischema/openai/openai.go |
OpenAI ChatCompletionRequest/Response definitions |
internal/apischema/anthropic/anthropic.go |
Anthropic MessagesRequest/Response definitions |
internal/apischema/awsbedrock/awsbedrock.go |
AWS Bedrock ConverseInput/ConverseOutput definitions |
internal/apischema/gcp/gcp.go |
GCP Gemini type definitions |
| File | Purpose |
|---|---|
internal/filterapi/filterconfig.go |
APISchemaName constants, VersionedAPISchema, backend config |
internal/endpointspec/endpointspec.go |
Endpoint spec interface, translator factory dispatch |
internal/extensionserver/post_translate_modify.go |
xDS modification to inject ext_proc filters into Envoy config |