Iris Omni Agent by Aubergine AI Labs

Case Study: Iris - An Omni-Assistant by Aubergine AI Labs

Introduction

Hello, I'm Abhishek, creator of Iris at Aubergine AI Labs. We specialize in creating natural conversational experiences through advanced AI. Introducing Iris, the voice agent as a service that can see, speak, and hear, integrating seamlessly into any digital platform, provided there's a mic and speaker. Be it apps, backend servers, web interfaces, or micro-controllers, Iris can operate efficiently in all.

Why Iris?

In today’s fast-paced world, the demand for intuitive, rapid, and engaging human-computer interactions is higher than ever. Iris bridges the gap by providing a powerful and intelligent voice and vision solution. From making appointments and providing customer support to handling inquiries and interactions, Iris is a game-changer.

Key Features

Turbo Latency Optimizations: Utilizes intelligent caching and optimized GPU inference to ensure real-time interactions and low-latency audio streaming.
Natural Interruptions: Effectively manages pauses and stops during conversations to ensure fluid interaction.
Proprietary Endpoint Model: Enhances speed and responsibility, minimizing interruptions due to conversational pauses.
Advanced Function Calling: Empowers agents with capabilities to book appointments, perform data lookups, fill forms, and more. Give access to the outside world.
On-prem Provider Deployments: Keeps latency consistent and reliable with dedicated infrastructure.
Wide Multilingual Support: Communicates in over 100 languages including English, Spanish, German, Hindi, and Portuguese.

Use Cases

Inbound Calls:

Barbershop: Manage availability, bookings, and inquiries.
Dentist Appointments: Handle scheduling and patient FAQs.
Restaurant: Manage reservations and menu inquiries.
SaaS Websites: Provide support, product information, and troubleshooting.
Realtor Offices: Manage property inquiries and viewings.
Insurance Companies: Support for claims, policy inquiries, and general assistance.

Outbound Calls:

Satisfaction Surveys: Gather customer insights.
Qualifying Leads: Screen potential clients.
Debt Collection: Facilitate repayment negotiations.
Transportation Logistics: Provide shipment status updates.
Telehealth Check-ins: Monitor patient health.
Food Delivery: Inform customers about delivery progress.

Voice Products:

Sales Roleplay & Conversations Roleplay: Train new employees and managers.
Digital Employees: Enable conversational workplace AI agents.
Mock Interviews & AI Therapy: Prepare for job interviews and provide supportive conversations.
AI Companions: Offer interactive emotional support.

Voice IoT:

AI Toys & Home Assistants: Provide smart playtime companions and voice-activated home controls.
Last-Mile Robots & Cars: Enhance delivery experiences and car assistants.
Smart Mirrors & Elderly Care: Track health and remind for medication.

Implementation

Integrating Iris into your application is smooth and straightforward with our powerful SDKs. Here's an example of a simple web implementation:

Frontend SDK Setup:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Iris</title>
    <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css2?family=Barlow:wght@500&display=swap" rel="stylesheet">
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/materialize/1.0.0/css/materialize.min.css">
    <script src="https://ai.auberginesolutions.com/static/iris_agent.js"></script>
    <style>
        body,
        html {
            margin: 0;
            padding: 0;
            height: 100%;
            display: flex;
            justify-content: center;
            align-items: center;
            font-family: Arial, sans-serif;
            background-color: #212529;
        }

        .container {
            display: flex;
            flex-direction: column;
            align-items: center;
            text-align: center;
        }

        .video-visualizer-container {
            position: absolute;
            top: 196px;
            display: flex;
            gap: 20px;
            /* Added space between video elements */
            left: 50%;
            transform: translateX(-50%);
        }

        .full-conversation-container {
            position: relative;
            top: 150px;
            width: 500px;
            height: 203px;
        }

        .full-conversation-container::before {
            content: '';
            position: absolute;
            top: 0;
            left: 0;
            right: 0;
            height: 50px;
            background: linear-gradient(to bottom, rgba(33, 37, 41, 1), rgba(33, 37, 41, 0));
            pointer-events: none;
        }

        .conversation-container {
            padding: 20px;
            border-radius: 10px;
            height: 100%;
            overflow-y: auto;
        }

        .conversation-container::-webkit-scrollbar {
            display: none;
        }

        .conversation-container ul {
            list-style-type: none;
            padding: 0;
            margin: 0;
        }

        .conversation-container li {
            color: #fff;
            margin: 10px 0;
            font-family: 'Barlow', sans-serif;
            font-size: 18px;
            font-weight: 500;
        }

        #execution_time {
            transition: background-color 0.3s;
            position: absolute;
            bottom: 150px;
            left: 50%;
            transform: translateX(-50%);
            color: white;
            font-family: 'Barlow', sans-serif;
            font-size: 17px;
        }

        .mic-button,
        .camera-button,
        .screen-button {
            background-color: #33EBEB;
            border: none;
            border-radius: 50%;
            width: 60px;
            height: 60px;
            display: flex;
            justify-content: center;
            align-items: center;
            cursor: pointer;
            font-size: 24px;
            color: #000000;
            transition: background-color 0.3s;
            position: absolute;
            bottom: 50px;
        }

        .mic-button:hover,
        .camera-button:hover,
        .screen-button:hover {
            background-color: #383F46;
            color: #fff;
        }

        .mic-button:disabled,
        .camera-button:disabled,
        .screen-button:disabled {
            background-color: #595959;
            cursor: not-allowed;
        }

        .mic-button.recording {
            background-color: #4D5761;
            color: #fff;
        }

        .mic-button.recording:hover {
            background-color: #E04848;
        }

        .screen-button {
            left: 68%;
        }

        .mic-button {
            left: 48%;
        }

        .camera-button {
            left: 28%;
        }

        @keyframes pulse {
            0% {
                transform: scale(1);
            }

            50% {
                transform: scale(1.1);
            }

            100% {
                transform: scale(1);
            }
        }

        video {
            width: 300px;
            height: 200px;
            background: black;
        }
    </style>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0-beta3/css/all.min.css">
</head>

<body>

    <div class="container">
        <!-- Added another video element for screen sharing -->
        <div class="video-visualizer-container">
            <video id="videoVisualizer" autoplay muted></video>
            <video id="screenVisualizer" autoplay muted></video>
        </div>
        <div class="full-conversation-container">
            <div class="conversation-container">
                <ul id="conversationList">
                </ul>
            </div>
        </div>
        <div id="execution_time" style="color: #fff;" hidden>0 ms</div>
        <button id="micButton" class="mic-button">
            <i class="fas fa-microphone"></i>
        </button>
        <button id="cameraButton" class="camera-button" disabled>
            <i class="fas fa-camera"></i>
        </button>
        <button id="screenButton" class="screen-button" disabled>
            <i class="fas fa-desktop"></i>
        </button>
    </div>

    <audio id="audioPlayer" controls hidden></audio>

    <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/1.0.0/js/materialize.min.js"></script>
    
    <script>
        document.addEventListener('DOMContentLoaded', () => {
            const apiKey = '<api_key>';
            const websocketDomain = 'wss://ai.auberginesolutions.com/ws';
            const irisAgent = new IrisAgent(apiKey, websocketDomain);

            irisAgent.setupMicButton(document.getElementById('micButton'));
            irisAgent.setupCameraButton(document.getElementById('cameraButton'));
            irisAgent.setupScreenButton(document.getElementById('screenButton'));
            irisAgent.setupConversationList(document.getElementById('conversationList'));
            irisAgent.setupAudioPlayer(document.getElementById('audioPlayer'));
            irisAgent.setupVideoVisualizer(document.getElementById('videoVisualizer'));
            irisAgent.setupScreenVisualizer(document.getElementById('screenVisualizer'));
            irisAgent.setupExecutionTimeElement(document.getElementById('execution_time'));
        });
    </script>
</body>
</html>

Backend SDK:

Code to Implement Voice only Assistant in your Django Project. consumers.py

import os
import django
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'iris.settings')
django.setup()

from channels.generic.websocket import AsyncWebsocketConsumer
from Iris import VoiceAssistant, Model, VoiceModelProvider, OpenAIVoice
from voice.models import Messages, Personas
from rest_framework.authtoken.models import Token
from datetime import datetime
import json
from constance import config

class IrisVoiceAgentConsumer(AsyncWebsocketConsumer):

    async def connect(self):

        token = self.scope['url_route']['kwargs']['token']
        try:
            token_obj = await Token.objects.aget(key=token)
        except Token.DoesNotExist:
            await self.close()

        personas_object = await Personas.objects.aget(is_active=True)
        self.first_name = token_obj.user.first_name
        self.messages_object, _ = await Messages.objects.aget_or_create(user=token_obj.user)
        current_date = datetime.now().date().strftime("%B %d, %Y")
        self.messages_object.messages = [
            {
                "role": "system",
                "content": personas_object.prompt_template + f" Today's date: {current_date}"
            }
        ]
        await self.messages_object.asave()
        await self.accept()

        # send message history
        await self.send(text_data=json.dumps({"messages": self.messages_object.messages[1:]}))

        
        self.voice_assistant = VoiceAssistant(
                    llm=Model.GPT_4o, 
                    voice_model_provider=VoiceModelProvider.OPENAI,
                    conversation_id=self.messages_object.id,
                    consumer=self)
        
        await self.voice_assistant.start()
        self.first_byte = True


    async def disconnect(self, close_code):
        if self.voice_assistant:
            await self.voice_assistant.stop() 
        
    async def receive(self, bytes_data: bytes = None):
        if self.first_byte:
            self.initial_message = f"Hey {self.first_name}! I’m Iris, Developed by Aubergine AI Labs and you can talk to me like a person!."
            await self.voice_assistant.say(self.initial_message, 
                                            voice_speed=1,
                                            voice=config.ACTIVE_VOICE)
            self.first_byte = False
        await self.voice_assistant.listen(bytes_data)

routing.py

from django.urls import path
from voice.consumers import IrisVoiceAgentConsumer

websocket_urlpatterns = [
    path('ws/iris-voice-agent/<str:token>/', IrisVoiceAgentConsumer.as_asgi()),
]

Code to Implement Omni Assistant in your Django Project. consumers.py

import os
import django
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'iris.settings')
django.setup()

from channels.generic.websocket import AsyncWebsocketConsumer
from voice.utils import VoiceAssistant, Model, VoiceModelProvider, OpenAIVoice
from voice.models import Messages, Personas
from rest_framework.authtoken.models import Token
from datetime import datetime
import json
from constance import config

class IrisOmniAgentConsumer(AsyncWebsocketConsumer):

    async def connect(self):

        token = self.scope['url_route']['kwargs']['token']
        try:
            token_obj = await Token.objects.aget(key=token)
        except Token.DoesNotExist:
            await self.close()

        personas_object = await Personas.objects.aget(is_active=True)
        self.first_name = token_obj.user.first_name
        self.messages_object, _ = await Messages.objects.aget_or_create(user=token_obj.user)
        current_date = datetime.now().date().strftime("%B %d, %Y")
        self.messages_object.messages = [
            {
                "role": "system",
                "content": personas_object.prompt_template + f" Today's date: {current_date}"
            }
        ]
        await self.messages_object.asave()
        await self.accept()

        # send message history
        await self.send(text_data=json.dumps({"messages": self.messages_object.messages[1:]}))

        
        self.voice_assistant = VoiceAssistant(
                    llm=Model.GPT_4o, 
                    voice_model_provider=VoiceModelProvider.OPENAI,
                    conversation_id=self.messages_object.id,
                    consumer=self)
        
        await self.voice_assistant.start()
        self.first_byte = True


    async def disconnect(self, close_code):
        if self.voice_assistant:
            await self.voice_assistant.stop() 
        
    async def receive(self, bytes_data: bytes = None, text_data: str = None):
        
        if text_data:
            data = json.loads(text_data)
            content = data.get("content", None)
            if data["data_type"] == "vision":
                await self.voice_assistant.update_vision_stream(data["media_source"], content)
            elif data["data_type"] == "text":
                await self.voice_assistant.prepare_answer(transcript=content)
            elif data["data_type"] in ["camera", "screen"]:
                await self.voice_assistant.update_device_state(device_type=data["data_type"], state=content)
            
        else:
            if self.first_byte:
                self.initial_message = f"Hey {self.first_name}! I’m Iris, Developed by Aubergine AI Labs. How can I help you today?"
                await self.voice_assistant.say(self.initial_message, 
                                               voice_speed=1,
                                               voice=config.ACTIVE_VOICE)
                
                # await self.voice_assistant.say("And on the left, you can share your camera so that I can see the outside world.", voice_speed=0.9,voice=OpenAIVoice.ALLOY)
                self.first_byte = False
            await self.voice_assistant.listen(bytes_data)

routing.py

from django.urls import path
from voice.consumers import IrisOmniAgentConsumer

websocket_urlpatterns = [
    path('ws/iris-omni-agent/<str:token>/', IrisVoiceAgentConsumer.as_asgi()),
]

Technical Architecture For Voice Only Agent

Here’s a graphical representation of the architecture to help visualize the flow:

sequenceDiagram

User ->> WebApp: Click initiate recording

WebApp ->> IrisAgent: Start Recording

IrisAgent ->> Backend: stream audio data (WS)

Backend ->> VoiceAssistant: Forward audio data

VoiceAssistant ->> STT: Process and Transcribe (STT)

Note over STT, VoiceAssistant: Cancel ongoing tasks<br/>Sending transcript to client<br/>Prepare_answer in background

STT -->> STT: Cancel ongoing Transcript Task & PrepareAnswer Task
STT -->> STT: Create new Transcript Task
STT -->> STT: Create new PrepareAnswer Task

STT -->> Backend: stream transcription (ws)

Backend ->> IrisAgent: stream transcription (ws)

STT ->> PrepareAnswer: stream use's query

PrepareAnswer ->> TTS: prepare answer of iris as stream for the text to speech input

TTS -->> LLM:  pass the stream of query

LLM ->> TTS: response stream from the model

TTS ->> Chunking Algorithm: streams model's response to make sense out of it.

Chunking Algorithm ->> TTS: streams chunks of meaningful words and sentences.

TTS -->> TTS: process text-to-speech

TTS -->> Backend: audio stream

Backend ->> IrisAgent: stream audio (WS)

IrisAgent ->> WebApp: Play audio response

Note over User, WebApp: User can interrupt<br/>the conversation

Innovative Approach

Out-of-the-box thinking:

Interrupt Handling: Advanced algorithms for managing conversational pauses and resuming accurately.
Concurrency: Optimal multi-threading ensures smooth handling of video and audio streams.
Scalability: Designed to scale horizontally, maintaining performance across diverse workloads.

Future Implementations

Healthcare: Personalized care and real-time health monitoring.
Automobile Assistants: Advanced driver assistance systems (ADAS).
Smart Cities: Intelligent kiosks and public assistance devices.

Demos

For a demo of the voice assistant, visit Iris Voice Assistant Demo. For a demo of the omni assistant integrating vision with voice, visit Iris Omni Assistant Demo.

Conclusion

Iris by Aubergine AI Labs is more than just a technological innovation; it's a vision for the future, making human-computer interactions more intuitive, seamless, and human-like. With its robust capabilities, easy integration, and powerful features, Iris is set to revolutionize everyday digital interactions.

Ready to bring Iris into your world? Explore what Iris can do for you today.

gehlotabhishek/iris agent.md