Skip to content

Instantly share code, notes, and snippets.

@ruvnet
Created November 16, 2024 15:07
Show Gist options
  • Save ruvnet/58a2e71fcc79dbea6a9886700849ca50 to your computer and use it in GitHub Desktop.
Save ruvnet/58a2e71fcc79dbea6a9886700849ca50 to your computer and use it in GitHub Desktop.
This project integrates OpenAI's GPT-4o large language model with Power Automate Desktop to create an advanced AI-powered automation system. It uses real-time streaming via WebSockets to enable the AI to observe and interact with your desktop, allowing for dynamic and intelligent automation of tasks.

Introduction:

This project integrates OpenAI's GPT-4o large language model with Power Automate Desktop to create an advanced AI-powered automation system. It uses real-time streaming via WebSockets to enable the AI to observe and interact with your desktop, allowing for dynamic and intelligent automation of tasks.

Key Features:

  • Real-time desktop streaming to GPT-4o via WebSockets
  • AI-powered analysis and decision making for desktop automation
  • Easy setup with guided installation script
  • Customizable automation actions and workflows
  • Seamless integration with Power Automate Desktop

Usage Examples:

  1. Automated data entry: The AI observes forms and enters data intelligently
  2. Smart email management: Categorize, respond to, and file emails automatically
  3. Document analysis and summarization: Extract key information from documents on screen
  4. Workflow optimization: Suggest and implement more efficient ways to perform tasks
  5. Intelligent troubleshooting: Diagnose and resolve software issues by observing error messages

Main PowerShell Script (save as gpt4o_desktop_automation.ps1):

# GPT-4o Desktop Automation with Power Automate Desktop

# Import required modules
Import-Module PowerShellAI
Import-Module Microsoft.PowerAutomate.Desktop

# WebSocket client for real-time streaming
Add-Type -AssemblyName System.Net.WebSockets.Client

# Function to capture and stream desktop
function Stream-Desktop {
    param (
        [string]$WebSocketUrl
    )

    $client = New-Object System.Net.WebSockets.ClientWebSocket
    $cts = New-Object System.Threading.CancellationTokenSource

    try {
        $client.ConnectAsync($WebSocketUrl, $cts.Token).Wait()

        while ($client.State -eq [System.Net.WebSockets.WebSocketState]::Open) {
            $screenshot = New-Object System.Drawing.Bitmap([System.Windows.Forms.Screen]::PrimaryScreen.Bounds.Width, [System.Windows.Forms.Screen]::PrimaryScreen.Bounds.Height)
            $graphics = [System.Drawing.Graphics]::FromImage($screenshot)
            $graphics.CopyFromScreen((New-Object System.Drawing.Point(0,0)), (New-Object System.Drawing.Point(0,0)), $screenshot.Size)
            $graphics.Dispose()

            $ms = New-Object System.IO.MemoryStream
            $screenshot.Save($ms, [System.Drawing.Imaging.ImageFormat]::Jpeg)
            $imageBytes = $ms.ToArray()
            $ms.Dispose()
            $screenshot.Dispose()

            $buffer = New-Object byte[] $imageBytes.Length
            [System.Buffer]::BlockCopy($imageBytes, 0, $buffer, 0, $imageBytes.Length)

            $segment = New-Object ArraySegment[byte] -ArgumentList @(,$buffer)
            $client.SendAsync($segment, [System.Net.WebSockets.WebSocketMessageType]::Binary, $true, $cts.Token).Wait()

            Start-Sleep -Milliseconds 100
        }
    }
    finally {
        if ($client.State -eq [System.Net.WebSockets.WebSocketState]::Open) {
            $client.CloseAsync([System.Net.WebSockets.WebSocketCloseStatus]::NormalClosure, "", $cts.Token).Wait()
        }
        $client.Dispose()
    }
}

# Function to process GPT-4o responses and execute actions
function Process-GPT4oResponse {
    param (
        [string]$Response
    )

    $actions = $Response | ConvertFrom-Json

    foreach ($action in $actions) {
        switch ($action.type) {
            "click" {
                [System.Windows.Forms.Cursor]::Position = New-Object System.Drawing.Point($action.x, $action.y)
                [System.Windows.Forms.SendKeys]::SendWait("{CLICK}")
            }
            "type" {
                [System.Windows.Forms.SendKeys]::SendWait($action.text)
            }
            "keypress" {
                [System.Windows.Forms.SendKeys]::SendWait($action.key)
            }
            "openapp" {
                Start-Process $action.app
            }
            "runflow" {
                Invoke-PADFlow -FlowName $action.flow
            }
        }
    }
}

# Main execution loop
function Start-GPT4oAutomation {
    param (
        [string]$ApiKey,
        [string]$WebSocketUrl
    )

    Set-OpenAIKey -Key $ApiKey

    $streamingJob = Start-Job -ScriptBlock ${function:Stream-Desktop} -ArgumentList $WebSocketUrl

    while ($true) {
        $gpt4oResponse = Get-GPT3Completion "Analyze the current desktop state and suggest actions to automate the visible task. Respond with a JSON array of actions."
        Process-GPT4oResponse -Response $gpt4oResponse

        Start-Sleep -Seconds 1
    }

    $streamingJob | Stop-Job
    $streamingJob | Remove-Job
}

# Start the automation
$apiKey = $env:OPENAI_API_KEY
$webSocketUrl = "wss://api.openai.com/v1/engines/gpt-4o/completions"
Start-GPT4oAutomation -ApiKey $apiKey -WebSocketUrl $webSocketUrl

Installation Script (save as install_gpt4o_automation.ps1):

# Installation script for GPT-4o Desktop Automation

# Check for administrator privileges
if (-NOT ([Security.Principal.WindowsPrincipal][Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator")) {
    Write-Warning "Please run this script as an Administrator!"
    Exit
}

# Install required modules
Install-Module -Name PowerShellAI -Force
Install-Module -Name Microsoft.PowerAutomate.Desktop -Force

# Download main script
$mainScriptUrl = "https://raw.githubusercontent.com/yourusername/gpt4o-desktop-automation/main/gpt4o_desktop_automation.ps1"
$mainScriptPath = "$env:USERPROFILE\Documents\gpt4o_desktop_automation.ps1"
Invoke-WebRequest -Uri $mainScriptUrl -OutFile $mainScriptPath

# Prompt for OpenAI API Key
$apiKey = Read-Host "Enter your OpenAI API Key" -AsSecureString
$BSTR = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($apiKey)
$apiKeyPlain = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($BSTR)

# Set environment variable for API Key
[System.Environment]::SetEnvironmentVariable("OPENAI_API_KEY", $apiKeyPlain, "User")

# Create shortcut on desktop
$WshShell = New-Object -ComObject WScript.Shell
$Shortcut = $WshShell.CreateShortcut("$env:USERPROFILE\Desktop\GPT-4o Desktop Automation.lnk")
$Shortcut.TargetPath = "powershell.exe"
$Shortcut.Arguments = "-ExecutionPolicy Bypass -File `"$mainScriptPath`""
$Shortcut.Save()

Write-Host "Installation complete! You can now run the GPT-4o Desktop Automation from the shortcut on your desktop."

To set up and use the GPT-4o Desktop Automation:

  1. Save both scripts to your computer.
  2. Run PowerShell as Administrator.
  3. Navigate to the directory containing the installation script.
  4. Run the installation script:
    .\install_gpt4o_automation.ps1
    
  5. Follow the prompts to enter your OpenAI API key.
  6. Once installation is complete, you can launch the automation by double-clicking the desktop shortcut or running the main script directly.

The system will now observe your desktop in real-time, analyze the content using GPT-4o, and automatically perform actions to assist with your tasks. You can customize the automation behavior by modifying the main script or creating specific Power Automate Desktop flows that the AI can trigger.

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment