File Type: ts
Generated Description:
This TypeScript file (fire-crawl-parser.ts
) is a command-line tool designed to parse a JSON file containing web scraping results, filter the data based on specified criteria, and output the filtered results to both the console and a new JSON file. It leverages Bun's built-in functionalities for file I/O and Zod for data validation.
The script takes a filepath as a command-line argument, optionally a filter pattern and a limit on the number of results, reads a JSON file from that path, validates its structure against a predefined schema, filters the data based on status codes and an optional URL pattern, and finally prints a summary to the console and writes the filtered data to a new JSON file.
-
fireSchema
(Zod Schema): Defines the expected structure of each object within the input JSON file using Zod, ensuring data integrity. This schema validates the presence and type of fields likemarkdown
,metadata
(containingtitle
,url
,statusCode
, etc.), and an optionalwarning
. -
validateArgs()
: Parses and validates command-line arguments. It checks for the minimum required arguments (filepath), handles optional filter patterns and a limit on the number of results, and exits with an error if input is invalid. -
readAndParseFile(filepath)
: Asynchronously reads the JSON file from the specifiedfilepath
, parses it, and validates it against thefireSchema
using Zod'ssafeParse
. It handles file not found errors and invalid JSON data gracefully, exiting with appropriate error messages. -
filterData(data, filterPattern, limit)
: Filters the parsed data based on two criteria:- Status Code: Filters out entries with status codes outside the 200-599 range (successful and redirection requests).
- Filter Pattern: If a
filterPattern
(a pipe-separated string of patterns) is provided, it filters entries whosemetadata.url
includes any of the patterns. - Limit: Limits the number of results to the specified
limit
(0 means no limit).
-
outputResults(data)
: Outputs the filtered data. It writes the filtered data to a new JSON file namedcrawl-results-<timestamp>.json
and prints a summary including the number of results and key information (title, URL, status code) for each result to the console. -
main()
: The main function orchestrates the entire process: validates arguments, reads and parses the file, filters the data, and outputs the results. It includes comprehensive error handling.
- Data Validation (Zod): Uses Zod for robust schema validation, ensuring data integrity and preventing unexpected errors due to malformed input data.
- Asynchronous Operations (async/await): Employs
async/await
for efficient handling of asynchronous file I/O operations. - Command-line Argument Handling: Processes command-line arguments effectively, providing clear usage instructions and error messages.
- Error Handling: Includes comprehensive error handling at various stages of the process, preventing unexpected crashes and providing informative error messages.
- Modular Design: The code is organized into well-defined functions, improving readability, maintainability, and testability.
- Analyzing Web Scraping Results: Process large datasets of web scraping results to identify successful crawls, filter for specific URLs or domains, and analyze the distribution of status codes.
- Data Cleaning and Preprocessing: Clean and filter web scraping data before further processing or analysis.
- Monitoring Website Status: Track the status codes of a set of URLs over time to monitor website availability and performance.
- Building a Web Crawler Pipeline: Integrate this parser as a component in a larger web crawling pipeline to process and analyze scraped data.
The script is well-structured, efficient, and handles errors gracefully, making it suitable for various data processing tasks involving web scraping results.
Description generated on 4/29/2025, 9:55:40 PM