Skip to content

Instantly share code, notes, and snippets.

@afawcett
Last active November 4, 2025 09:05
Show Gist options
  • Select an option

  • Save afawcett/c7c291bdcba4f377f001b88e54ef66ee to your computer and use it in GitHub Desktop.

Select an option

Save afawcett/c7c291bdcba4f377f001b88e54ef66ee to your computer and use it in GitHub Desktop.
Cursor Command to help Cursor Read Salesforce Docs

Background: I find Cursor struggles to retrieve content from Salesforce documentation links due to various approaches used such as as shadow dom and dynamic loading. This Cursor command leverages Cursors Browser Automation tool (needs to be explcitly enabled) to help give it a few more clues on how to deal with these pages. I initially built a shell script for it to call but then I discovered Cursor commands. So I built this command using Cursor itself once we had both figured this out - effectively it wrote its own instructions. Use at your own risk.

Usage: In Cursor type / and select Create Command enter sfdoc and press enter. Then paste the content below into the file. To use, use prompts like Read this https://developer.salesforce.com/docs... using /sfdoc. Make sure you have chromium NPM library installed globally as the Node.js scripts below require it. Also make sure you have given Cursor permission to use your browser - there is a icon bottom right of the chat window to enable this in the latest version - it may also need enabling in the Cursor MCP settings.


Extract Salesforce documentation content from [URL] using browser automation. The Salesforce docs use heavy JavaScript rendering with custom web components and shadow DOM. Follow this comprehensive approach based on successful sfdoc.sh implementation:

Initial Setup:

  1. Launch browser in headless mode to avoid popup windows
  2. Navigate to the Salesforce documentation URL
  3. Wait for network idle state: await page.waitForLoadState('networkidle')
  4. Accept cookies if prompted (click "Accept All Cookies" button)
  5. Wait for the page to fully load (5+ seconds for dynamic content)
  6. CRITICAL: Operate in SILENT MODE - do NOT provide step-by-step feedback, status updates, or intermediate messages. Only show the final extracted documentation content.

Content Extraction Strategy (4-Tier Approach):

  1. Strategy 1 - Shadow DOM Access (Primary):

    const docXmlContent = document.querySelector('doc-xml-content');
    if (docXmlContent && docXmlContent.shadowRoot) {
        const shadowRoot = docXmlContent.shadowRoot;
        const docContent = shadowRoot.querySelector('doc-content');
        if (docContent && docContent.shadowRoot) {
            const deeperShadowRoot = docContent.shadowRoot;
            const mainContent = deeperShadowRoot.querySelector('.main-container') || 
                              deeperShadowRoot.querySelector('main') ||
                              deeperShadowRoot.querySelector('[class*="main"]') ||
                              deeperShadowRoot;
            return mainContent.innerHTML;
        }
    }
  2. Strategy 2 - Direct Content Selectors:

    const contentSelectors = [
        'main', '[role="main"]', '.content', '.article-content', 
        '.help-content', 'article', '.body', '[class*="content"]'
    ];
    
    for (const selector of contentSelectors) {
        const element = document.querySelector(selector);
        if (element && element.textContent && element.textContent.length > 500) {
            // Skip cookie banners and other non-content elements
            const text = element.textContent.toLowerCase();
            if (text.includes('cookie') && text.includes('privacy')) continue;
            return element.innerHTML;
        }
    }
  3. Strategy 3 - Fallback Element Search:

    const allElements = document.querySelectorAll('*');
    for (const el of allElements) {
        if (el.textContent && el.textContent.length > 1000 && 
            !el.querySelector('script') && !el.querySelector('style') &&
            el.tagName !== 'SCRIPT' && el.tagName !== 'STYLE') {
            
            const text = el.textContent.toLowerCase();
            if (text.includes('cookie') && text.includes('privacy')) continue;
            
            // Look for Salesforce-specific content patterns
            if (text.includes('external client app') || 
                text.includes('metadata api') ||
                text.includes('scratch org') ||
                text.includes('salesforce help') ||
                text.includes('tooling api') ||
                text.includes('contact center')) {
                return el.innerHTML;
            }
        }
    }
  4. Strategy 4 - Body Content (Last Resort):

    const body = document.body;
    if (body && body.textContent && body.textContent.length > 500) {
        return body.innerHTML;
    }

Implementation Pattern:

// Launch browser in headless mode
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();

// Navigate and wait for network idle
await page.goto(url);
await page.waitForLoadState('networkidle');

// Accept cookies if present
try {
    await page.click('button:has-text("Accept All Cookies")');
} catch (e) {
    // No cookie banner present
}

// Wait for content to load
await page.waitForTimeout(5000);

const strategies = [
    () => { /* Strategy 1 - Shadow DOM */ },
    () => { /* Strategy 2 - Direct Selectors */ },
    () => { /* Strategy 3 - Fallback Search */ },
    () => { /* Strategy 4 - Body Content */ }
];

for (const strategy of strategies) {
    try {
        const result = strategy();
        if (result && result.success) {
            return result;
        }
    } catch (e) {
        continue;
    }
}

await browser.close();

Key Technical Details:

  • Headless Mode: Always use chromium.launch({ headless: true }) to avoid popup windows
  • Use page.waitForLoadState('networkidle') for proper loading
  • Skip cookie banners: text.includes('cookie') && text.includes('privacy')
  • Look for substantial content: textContent.length > 500 (Strategy 2) or > 1000 (Strategy 3)
  • Exclude scripts and styles: !el.querySelector('script') && !el.querySelector('style')
  • Target Salesforce-specific patterns: "metadata api", "tooling api", "contact center", etc.
  • SILENT MODE: Do NOT show step-by-step progress, status updates, or intermediate messages. Only display the final extracted documentation content.
  • Clean Browser: Always close browser with await browser.close()

Error Handling:

  • Try each strategy sequentially
  • Catch and continue on strategy failures
  • Always return structured result with success/failure status
  • Include method used, content length, and error details

Output Structure:

{
    success: true/false,
    title: document.title,
    content: extractedHTML,
    textContent: extractedText,
    url: window.location.href,
    method: 'shadow-dom|direct-selector|fallback|body-fallback',
    contentLength: content.length,
    textLength: textContent.length
}

CRITICAL REQUIREMENT:

  • DO NOT use sfdoc.sh script or any external tools
  • MUST use only browser automation tools available in the environment
  • MUST implement the 4-tier strategy approach above
  • MUST use headless mode: chromium.launch({ headless: true })
  • MUST use page.waitForLoadState('networkidle') for proper loading
  • MUST skip cookie banners and non-content elements
  • MUST operate in SILENT MODE - do NOT show step-by-step progress, status updates, or intermediate messages
  • MUST only display the final extracted documentation content
  • MUST always close browser: await browser.close()
  • NEVER fall back to external scripts or tools - persist with browser automation until successful
@rtmalone
Copy link

I'm assuming the implementation is running as a node script. Is chromium a node package installed globally? I think this is a great idea. My one and only use of this command so far got results, but I think outside the use of the command. I saw the LLM try use a playwright mcp tool that I don't have and then resort to web searches via the brave web mcp tool. Like I said it did produce a result about the info on the page I pointed it too, but not certain it was because of the command. Thoughts or advice?

@afawcett
Copy link
Author

afawcett commented Nov 4, 2025

I'm assuming the implementation is running as a node script. Is chromium a node package installed globally? I think this is a great idea. My one and only use of this command so far got results, but I think outside the use of the command. I saw the LLM try use a playwright mcp tool that I don't have and then resort to web searches via the brave web mcp tool. Like I said it did produce a result about the info on the page I pointed it too, but not certain it was because of the command. Thoughts or advice?

Yeah I have chromium installed globally - I'll update the notes above. If you didn't it maybe that your agent decided to take a different approach and thus why you got the results you did.

Update: I just ran it myself in a fresh project and it decide to use Playright - but critically it did use the information in the command to generate a small Node.js utility that embodied the instructions in the command. I guess in a way it did follow the command even though it did not literally use the code provided. It also did it very fast compared to 2 weeks ago - I am now using latest v2 release with their new Composer 1 model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment