These are some cleaned up notes that may help fill in gaps in the official docs, full real wire examples are nice ;)
Connect to the WebSocket endpoint:
wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key=YOUR_API_KEY
The protocol follows a bidirectional communication pattern where:
- Client establishes connection and sends setup message
- Server acknowledges with setup completion
- Client can then stream audio chunks or send text messages
- Server responds with audio or text responses
All messages are JSON encoded.
Client must send this message immediately after connection:
{
"setup": {
"model": "models/gemini-2.0-flash-exp",
"generationConfig": {
"responseModalities": "audio",
"speechConfig": {
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Aoede"
}
}
}
},
"systemInstruction": {
"parts": [{"text": "You are my helpful assistant."}]
}
}
}
Server acknowledges with:
{
"setupComplete": {}
}
- Send one audio chunk per message (dunno why, alpha issue?)
- Audio must be PCM format, 16kHz sample rate, 16-bit
- Audio data must be base64 encoded
{
"realtimeInput": {
"mediaChunks": [
{
"mimeType": "audio/pcm;rate=16000",
"data": "<base64_encoded_audio_data>"
}
]
}
}
Server responds with PCM audio at 24kHz sample rate:
{
"serverContent": {
"modelTurn": {
"parts": [
{
"inlineData": {
"mimeType": "audio/pcm;rate=24000",
"data": "<base64_encoded_audio_data>"
}
}
]
}
}
}
Client can also send text messages:
{
"clientContent": {
"turns": [
{
"role": "user",
"parts": [{"text": "hello"}]
}
],
"turnComplete": true
}
}