Skip to content

Instantly share code, notes, and snippets.

@WomB0ComB0
Created April 30, 2025 01:55
Show Gist options
  • Save WomB0ComB0/52a15e361f87b04752d8e14caed9515c to your computer and use it in GitHub Desktop.
Save WomB0ComB0/52a15e361f87b04752d8e14caed9515c to your computer and use it in GitHub Desktop.
fire-crawl-parser.ts and related files - with AI-generated descriptions
import { file, readableStreamToJSON, argv, write } from 'bun';
import { z } from 'zod';
const fireSchema = z.object({
markdown: z.string(),
metadata: z.object({
generator: z.string(),
viewport: z.string(),
language: z.string(),
description: z.string(),
title: z.string(),
scrapeId: z.string(),
sourceURL: z.string(),
url: z.string(),
statusCode: z.number(),
}),
warning: z.string().optional(),
});
type FireCrawl = z.infer<typeof fireSchema>;
/**
* Validates command line arguments
* @returns An object with validated arguments or throws an error
*/
function validateArgs(): { filepath: string; filterPattern: string; limit: number } {
if (argv.length < 3) {
console.error('Error: Missing required arguments');
console.log('Usage: bun fire-crawl-parser.ts <file_path> [filter_pattern] [limit]');
process.exit(1);
}
const [filepath, filterPattern = "", limitStr = "0"] = argv.slice(2);
const limit = parseInt(limitStr, 10);
if (isNaN(limit) && limitStr !== "0") {
console.error(`Error: Invalid limit value: ${limitStr}`);
process.exit(1);
}
return { filepath, filterPattern, limit };
}
/**
* Reads and parses a JSON file
* @param filepath Path to the JSON file
* @returns Parsed and validated data
*/
async function readAndParseFile(filepath: string): Promise<FireCrawl[]> {
const fileObj = file(filepath);
if (!await fileObj.exists()) {
console.error(`Error: File not found: ${filepath}`);
process.exit(1);
}
try {
const fileData = await readableStreamToJSON(fileObj.stream());
const validationResult = z.array(fireSchema).safeParse(fileData);
if (!validationResult.success) {
console.error('Error: Invalid data format');
console.error(validationResult.error.format());
process.exit(1);
}
return validationResult.data;
} catch (error) {
console.error(`Error reading or parsing file: ${filepath}`);
console.error(error instanceof Error ? error.message : String(error));
process.exit(1);
}
}
/**
* Filters data based on a pattern and limit
* @param data Array of FireCrawl objects
* @param filterPattern Pattern to filter URLs by
* @param limit Maximum number of results (0 for unlimited)
* @returns Filtered data
*/
function filterData(data: FireCrawl[], filterPattern: string, limit: number): FireCrawl[] {
let filtered = data.filter(item => {
const statusCode = item.metadata.statusCode;
// 2xx: Success
// 3xx: Redirection
// 4xx: Client errors
// 5xx: Server errors
return statusCode >= 200 && statusCode < 600;
});
if (filterPattern) {
const patterns = filterPattern.split('|');
filtered = data.filter(item =>
patterns.some(pattern => item.metadata.url.includes(pattern))
);
}
if (limit > 0) filtered = filtered.slice(0, limit);
return filtered;
}
/**
* Outputs the results
* @param data Filtered data to output
*/
function outputResults(data: FireCrawl[]): void {
console.log(`Found ${data.length} matching results.`);
const outputPath = `crawl-results-${Date.now()}.json`;
write(outputPath, JSON.stringify(data, null, 2));
console.log(`Results written to ${outputPath}`);
data.forEach((item, index) => {
console.log(`\nResult ${index + 1}:`);
console.log(` Title: ${item.metadata.title}`);
console.log(` URL: ${item.metadata.url}`);
console.log(` Status Code: ${item.metadata.statusCode}`);
});
}
/**
* Main function that orchestrates the parsing process
*/
async function main(): Promise<void> {
try {
const { filepath, filterPattern, limit } = validateArgs();
console.log(`Processing file: ${filepath}`);
if (filterPattern) console.log(`Using filter: ${filterPattern}`);
if (limit > 0) console.log(`Limiting results to: ${limit}`);
const data = await readAndParseFile(filepath);
console.log(`Successfully parsed ${data.length} entries.`);
const filteredData = filterData(data, filterPattern, limit);
outputResults(filteredData);
process.exit(0);
} catch (error) {
console.error('Unexpected error occurred:');
console.error(error instanceof Error ? error.message : String(error));
process.exit(1);
}
}
if (require.main === module) {
main().catch(console.error);
}

fire-crawl-parser.ts Description

File Type: ts

Generated Description:

fire-crawl-parser.ts Analysis

This TypeScript file (fire-crawl-parser.ts) is a command-line tool designed to parse a JSON file containing web scraping results, filter the data based on specified criteria, and output the filtered results to both the console and a new JSON file. It leverages Bun's built-in functionalities for file I/O and Zod for data validation.

Summary

The script takes a filepath as a command-line argument, optionally a filter pattern and a limit on the number of results, reads a JSON file from that path, validates its structure against a predefined schema, filters the data based on status codes and an optional URL pattern, and finally prints a summary to the console and writes the filtered data to a new JSON file.

Key Components and Functions

  • fireSchema (Zod Schema): Defines the expected structure of each object within the input JSON file using Zod, ensuring data integrity. This schema validates the presence and type of fields like markdown, metadata (containing title, url, statusCode, etc.), and an optional warning.

  • validateArgs(): Parses and validates command-line arguments. It checks for the minimum required arguments (filepath), handles optional filter patterns and a limit on the number of results, and exits with an error if input is invalid.

  • readAndParseFile(filepath): Asynchronously reads the JSON file from the specified filepath, parses it, and validates it against the fireSchema using Zod's safeParse. It handles file not found errors and invalid JSON data gracefully, exiting with appropriate error messages.

  • filterData(data, filterPattern, limit): Filters the parsed data based on two criteria:

    • Status Code: Filters out entries with status codes outside the 200-599 range (successful and redirection requests).
    • Filter Pattern: If a filterPattern (a pipe-separated string of patterns) is provided, it filters entries whose metadata.url includes any of the patterns.
    • Limit: Limits the number of results to the specified limit (0 means no limit).
  • outputResults(data): Outputs the filtered data. It writes the filtered data to a new JSON file named crawl-results-<timestamp>.json and prints a summary including the number of results and key information (title, URL, status code) for each result to the console.

  • main(): The main function orchestrates the entire process: validates arguments, reads and parses the file, filters the data, and outputs the results. It includes comprehensive error handling.

Notable Patterns and Techniques

  • Data Validation (Zod): Uses Zod for robust schema validation, ensuring data integrity and preventing unexpected errors due to malformed input data.
  • Asynchronous Operations (async/await): Employs async/await for efficient handling of asynchronous file I/O operations.
  • Command-line Argument Handling: Processes command-line arguments effectively, providing clear usage instructions and error messages.
  • Error Handling: Includes comprehensive error handling at various stages of the process, preventing unexpected crashes and providing informative error messages.
  • Modular Design: The code is organized into well-defined functions, improving readability, maintainability, and testability.

Potential Use Cases

  • Analyzing Web Scraping Results: Process large datasets of web scraping results to identify successful crawls, filter for specific URLs or domains, and analyze the distribution of status codes.
  • Data Cleaning and Preprocessing: Clean and filter web scraping data before further processing or analysis.
  • Monitoring Website Status: Track the status codes of a set of URLs over time to monitor website availability and performance.
  • Building a Web Crawler Pipeline: Integrate this parser as a component in a larger web crawling pipeline to process and analyze scraped data.

The script is well-structured, efficient, and handles errors gracefully, making it suitable for various data processing tasks involving web scraping results.

Description generated on 4/29/2025, 9:55:40 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment