This Python script allows you to scrape specific elements from a given website using the Parsera library, powered by OpenAI`'s models.
- Python 3.7+
- An OpenAI API Key
- Parsera library
Install Parsera and Required Libraries:
pip install parsera
playwright install
The script can be used directly from the command line. You can pass elements to extract either as key-value pairs or as a JSON string.
--url
: The URL of the website to scrape (required).--elements
: The elements to extract, provided as key-value pairs or a JSON string (required).--openai_api_key
: The OpenAI API Key (optional if set as an environment variable).--output_file
: Optional file path to save the results.
Using Key-Value Pairs:
python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements Title="News title" Points="Number of points" Comments="Number of comments" --openai_api_key "your_openai_api_key"
Using a JSON String:
python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements '{"Title": "News title", "Points": "Number of points", "Comments": "Number of comments"}' --openai_api_key "your_openai_api_key"
Saving Results to a File:
python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements Title="News title" Points="Number of points" Comments="Number of comments" --output_file "scraped_results.json"
Using Environment Variable for API Key:
export OPENAI_API_KEY="your_openai_api_key"
python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements Title="News title" Points="Number of points" Comments="Number of comments"
If an output file is specified with --output_file
, the results will be saved as a JSON file. Otherwise, the results will be printed to the console.
{"Title":"Hackingthelargestairlineandhotelrewardsplatform(2023)","Points":"104","Comments":"24"},{"Title":"Anotherinterestingarticle","Points":"75","Comments":"13"}
- The script provides detailed logging to help diagnose issues during scraping. Logs are printed to the console.
- The script handles common errors such as invalid input, network issues, and JSON parsing errors.
To avoid exposing your OpenAI API key in the command line, you can set it as an environment variable:
export OPENAI_API_KEY="your_openai_api_key"
This will allow the script to run without requiring the --openai_api_key
argument.
- Invalid Key-Value Pair: Ensure that each element is provided as a valid key-value pair in the format
key="value"
. - JSON Parsing Errors: If using a JSON string, ensure it is properly formatted.
- Network Issues: The script includes a retry mechanism to handle transient network errors. If issues persist, check your network connection.
Add the following to your requirements.txt
:
parsera
This script offers a flexible and powerful way to scrape websites using Parsera and OpenAI. It’s designed to be user-friendly and robust, with clear error handling and logging to assist in a wide range of scenarios.