Created
July 1, 2025 12:57
-
-
Save gfwfail/6c017f82ec853c86a90f80835ba6cd7f to your computer and use it in GitHub Desktop.
scraper-development-guide-en.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Universal Bank Scraper Development Guide | |
## Core Development Methodology | |
### 1. Real-time DOM Analysis with Playwright MCP Tools | |
#### Why This Approach is Revolutionary | |
- **Real-time interaction**: See page structure directly without guessing selectors | |
- **Precise targeting**: Get accurate element references (ref) and attributes | |
- **Instant validation**: Test selectors immediately | |
- **Avoid blind development**: Reduce trial-and-error time | |
#### Standard Workflow: | |
```javascript | |
// Step 1: Navigate to target page | |
await mcp__playwright__browser_navigate({ url: 'target-bank-login-url' }); | |
// Step 2: Analyze page snapshot, identify key elements | |
// Page snapshot will show structure like: | |
// - textbox "Username field description" [ref=eXX] | |
// - textbox "Password field description" [ref=eYY] | |
// - button "Login button text" [ref=eZZ] | |
// Step 3: Test interactions using found references | |
await mcp__playwright__browser_type({ | |
element: "descriptive name", | |
ref: "actual ref value", | |
text: "test text" | |
}); | |
// Step 4: Observe page changes, adjust strategy | |
``` | |
### 2. Selector Strategy Priority | |
#### Recommended Order (Most to Least Stable): | |
1. **getByRole Selectors** (Most Recommended) | |
```javascript | |
await page.getByRole('textbox', { name: 'Field Label' }); | |
await page.getByRole('button', { name: 'Button Text' }); | |
``` | |
2. **aria-label Attributes** | |
```javascript | |
await page.locator('[aria-label="Specific Label"]'); | |
``` | |
3. **Stable ID or name Attributes** | |
```javascript | |
await page.locator('#stableId'); | |
await page.locator('[name="fieldName"]'); | |
``` | |
4. **Text Content** (For Buttons) | |
```javascript | |
await page.locator('button:has-text("Login")'); | |
``` | |
5. **CSS Classes** (Least Recommended - Changes Frequently) | |
### 3. Universal Page Analysis Process | |
#### 3.1 Initial Exploration | |
```javascript | |
// Save page HTML for offline analysis | |
const pageContent = await page.content(); | |
await fs.writeFile('/tmp/bank-page-debug.html', pageContent); | |
// Get all input field information | |
const inputs = await page.$$eval('input', elements => | |
elements.map(el => ({ | |
id: el.id, | |
name: el.name, | |
type: el.type, | |
placeholder: el.placeholder, | |
ariaLabel: el.getAttribute('aria-label'), | |
visible: el.offsetParent !== null | |
})) | |
); | |
// Get all button information | |
const buttons = await page.$$eval('button, input[type="submit"]', elements => | |
elements.map(el => ({ | |
text: el.textContent || el.value, | |
type: el.type, | |
id: el.id, | |
className: el.className | |
})) | |
); | |
``` | |
#### 3.2 Element Location Strategy | |
```javascript | |
// Universal element finding function | |
async function findAndFillField(page, fieldPurpose, value, possibleSelectors) { | |
// Try getByRole first | |
try { | |
await page.getByRole('textbox', { name: new RegExp(fieldPurpose, 'i') }).fill(value); | |
return true; | |
} catch (e) { | |
// Silent fail, try other methods | |
} | |
// Then try provided selector list | |
for (const selector of possibleSelectors) { | |
try { | |
await page.waitForSelector(selector, { timeout: 2000 }); | |
await page.fill(selector, value); | |
return true; | |
} catch (e) { | |
continue; | |
} | |
} | |
throw new Error(`Cannot find ${fieldPurpose} input field`); | |
} | |
``` | |
### 4. Account Data Extraction Patterns | |
#### 4.1 Exploratory Parsing | |
```javascript | |
// Don't assume specific class names or structure | |
// Find based on content characteristics | |
// Step 1: Find potential account containers | |
const potentialContainers = await page.$$('div, tr, section, article, li'); | |
// Step 2: Identify account elements by content features | |
for (const container of potentialContainers) { | |
const text = await container.textContent(); | |
// Common account element features: | |
// - Contains currency symbols ($, ¥, €, etc.) | |
// - Contains account number formats | |
// - Contains account type keywords | |
if (text && this.looksLikeAccount(text)) { | |
// Parse further | |
} | |
} | |
``` | |
#### 4.2 Intelligent Data Extraction | |
```javascript | |
// Universal account info extractor | |
function extractAccountInfo(text) { | |
const info = { | |
name: '', | |
number: '', | |
balance: 0, | |
currency: 'USD' | |
}; | |
// Extract various account number formats with regex | |
const accountNumberPatterns = [ | |
/\d{3}-\d{3}\s+\d{4,}/, // Format: 123-456 78901 | |
/\d{4}\s+\d{4}\s+\d{4}\s+\d{4}/, // Credit card format | |
/\d{6,16}/ // Continuous digits | |
]; | |
// Extract amounts (supports multiple currencies) | |
const amountPattern = /([¥$€£₹]\s?[\d,]+\.?\d*)|(\d+\.?\d*\s?[¥$€£₹])/; | |
// Smart extraction logic... | |
return info; | |
} | |
``` | |
### 5. Error Handling and Retry Strategy | |
```javascript | |
class RobustScraper { | |
async safeOperation(operation, operationName, maxRetries = 3) { | |
for (let i = 0; i < maxRetries; i++) { | |
try { | |
return await operation(); | |
} catch (error) { | |
logger.warn(`${operationName} failed, attempt ${i + 1}/${maxRetries}`); | |
// Smart error analysis | |
if (this.isRecoverableError(error)) { | |
await this.delay(2000 * (i + 1)); // Exponential backoff | |
continue; | |
} | |
throw error; // Throw unrecoverable errors immediately | |
} | |
} | |
} | |
isRecoverableError(error) { | |
const recoverablePatterns = [ | |
'timeout', | |
'network', | |
'navigation', | |
'element not found' | |
]; | |
return recoverablePatterns.some(pattern => | |
error.message.toLowerCase().includes(pattern) | |
); | |
} | |
} | |
``` | |
### 6. Standard Process for New Bank Development | |
#### Phase 1: Exploration and Understanding | |
1. Use MCP tools to visit bank website | |
2. Screenshot all key pages | |
3. Analyze page structure, identify patterns | |
4. Document all possible selectors | |
#### Phase 2: Prototype Development | |
1. Create minimal viable login flow | |
2. Use headless: false to observe behavior | |
3. Gradually add error handling | |
4. Validate data extraction logic | |
#### Phase 3: Robustness Improvements | |
1. Add retry mechanisms | |
2. Handle edge cases | |
3. Optimize performance (reduce wait times) | |
4. Add detailed logging | |
#### Phase 4: Production Ready | |
1. Remove debug code | |
2. Add security measures | |
3. Implement monitoring and alerts | |
4. Write documentation | |
### 7. Universal Best Practices | |
#### Security | |
- Never hardcode credentials in code | |
- Use environment variables or secure storage | |
- Implement rate limiting to avoid bans | |
- Add user agent rotation | |
#### Maintainability | |
- Centralize selector management | |
- Use config files for bank-specific info | |
- Implement comprehensive logging | |
- Write unit tests | |
#### Performance Optimization | |
- Only wait for necessary elements | |
- Use waitForLoadState('domcontentloaded') instead of 'networkidle' | |
- Process multiple accounts in parallel | |
- Implement smart caching | |
### 8. Debugging Tips | |
```javascript | |
// Development helper function | |
async function debugPage(page) { | |
// Save screenshot | |
await page.screenshot({ path: 'debug-screenshot.png' }); | |
// Save HTML | |
const html = await page.content(); | |
await fs.writeFile('debug-page.html', html); | |
// Print interactable elements | |
const interactables = await page.$$eval( | |
'input, button, select, textarea, a', | |
els => els.map(el => ({ | |
tag: el.tagName, | |
text: el.textContent || el.value, | |
visible: el.offsetParent !== null | |
})) | |
); | |
console.log('Interactable elements:', interactables); | |
} | |
``` | |
## Summary | |
Key success factors for bank scraper development: | |
1. **Use MCP tools** for real-time exploration instead of blind coding | |
2. **Prioritize semantic selectors** (getByRole) over fragile CSS selectors | |
3. **Extract data based on content** rather than structure | |
4. **Implement smart retries** and error recovery | |
5. **Keep code generic** for easy extension to new banks | |
This methodology has been validated across multiple bank implementations and can reduce new bank integration time from days to hours. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment