Kwindla Hultman Kramer kwindla

Advice on Voice AI, June 2025

My top three pieces of advice for people getting started with voice agents.

Spend time up front understanding why latency and instruction following accuracy drive voice AI tech choices.
You will need to add significant tooling complexity as you go from proof of concept to production. Prepare for that. Especially important: build lightweight evals as early as you can.
The right path is: start with a proven, "best practices" tech stack -> get everything working one piece at a time -> deploy to real-world users and collect data -> then think about optimizing cost/latency/etc.

Task

I've been experimenting as much as I can with Gemini video understanding. Multiple videos, multi-step prompts, etc.

Prompt

Analyze this YouTube video.

https://www.youtube.com/watch?v=PgyJs0jfp_o

Funny GPT-4o inference result.

Audible in this vide: https://youtu.be/PgyJs0jfp_o?si=43CJgmk954kulmgl&t=863

Output

It sounds like we're on an intriguing mission! I'm going to scan through the grand chandeliers and ornate carpets of the hotel for traces. Be right back with the results!

	from google import genai

	import os

	client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

	# filename_for_upload = "/Users/khkramer/Downloads/maven-lightning-trimmed.mp4"
	# myfile = client.files.upload(file=filename_for_upload)
	#
	# print("My files:")

	# double transcription events
	# pip install 'pipecat-ai[daily,silero,openai,cartesia]'==0.0.59 dotenv
	#
	# transcription events as expected
	# pip install 'pipecat-ai[daily,silero,openai,cartesia]'==0.0.58 dotenv

	import asyncio
	import sys
	import os

	#
	# Copyright (c) 2024–2025, Daily
	#
	# SPDX-License-Identifier: BSD 2-Clause License
	#

	import argparse
	import asyncio
	import os

	#
	# Copyright (c) 2024–2025, Daily
	#
	# SPDX-License-Identifier: BSD 2-Clause License
	#

	import argparse
	import asyncio
	import os

	import asyncio

	from openai import AsyncOpenAI
	from openai.helpers import LocalAudioPlayer

	import wave
	import numpy as np

	openai = AsyncOpenAI()

	2025-04-22 12:27:57.538 \| INFO \| 07-interruptible:run_bot:28 - Starting bot
	2025-04-22 12:27:57.538 \| DEBUG \| pipecat.audio.vad.silero:__init__:111 - Loading Silero VAD model...
	2025-04-22 12:27:57.564 \| DEBUG \| pipecat.audio.vad.silero:__init__:133 - Loaded Silero VAD
	2025-04-22 12:27:57.583 \| DEBUG \| pipecat.processors.frame_processor:link:177 - Linking PipelineSource#0 -> SmallWebRTCInputTransport#0
	2025-04-22 12:27:57.583 \| DEBUG \| pipecat.processors.frame_processor:link:177 - Linking SmallWebRTCInputTransport#0 -> DeepgramSTTService#0
	2025-04-22 12:27:57.583 \| DEBUG \| pipecat.processors.frame_processor:link:177 - Linking DeepgramSTTService#0 -> OpenAIUserContextAggregator#0
	2025-04-22 12:27:57.583 \| DEBUG \| pipecat.processors.frame_processor:link:177 - Linking OpenAIUserContextAggregator#0 -> OpenAILLMService#0
	2025-04-22 12:27:57.583 \| DEBUG \| pipecat.processors.frame_processor:link:177 - Linking OpenAILLMService#0 -> CartesiaTTSService#0
	2025-04-22 12:27:57.583 \| DEBUG \| pip