Skip to content

Instantly share code, notes, and snippets.

@cyberpunk042
Last active August 16, 2024 22:44
Show Gist options
  • Save cyberpunk042/d513d5386540b7c952314f27e0578816 to your computer and use it in GitHub Desktop.
Save cyberpunk042/d513d5386540b7c952314f27e0578816 to your computer and use it in GitHub Desktop.
Snippet to scrape a website for targeted elements. The input can be key-value pair or JSON string. The output can be console or file.

Web Scraping with Parsera

This Python script allows you to scrape specific elements from a given website using the Parsera library, powered by OpenAI`'s models.

Requirements

  • Python 3.7+
  • An OpenAI API Key
  • Parsera library

Installation

Install Parsera and Required Libraries:

pip install parsera
playwright install

Usage

The script can be used directly from the command line. You can pass elements to extract either as key-value pairs or as a JSON string.

Command-Line Arguments

  • --url: The URL of the website to scrape (required).
  • --elements: The elements to extract, provided as key-value pairs or a JSON string (required).
  • --openai_api_key: The OpenAI API Key (optional if set as an environment variable).
  • --output_file: Optional file path to save the results.

Example Usage

Using Key-Value Pairs:

python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements Title="News title" Points="Number of points" Comments="Number of comments" --openai_api_key "your_openai_api_key"

Using a JSON String:

python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements '{"Title": "News title", "Points": "Number of points", "Comments": "Number of comments"}' --openai_api_key "your_openai_api_key"

Saving Results to a File:

python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements Title="News title" Points="Number of points" Comments="Number of comments" --output_file "scraped_results.json"

Using Environment Variable for API Key:

export OPENAI_API_KEY="your_openai_api_key"
python ai_scrape_with_parsera.py --url "https://news.ycombinator.com/" --elements Title="News title" Points="Number of points" Comments="Number of comments"

Output

If an output file is specified with --output_file, the results will be saved as a JSON file. Otherwise, the results will be printed to the console.

Sample Output (Console)

{"Title":"Hackingthelargestairlineandhotelrewardsplatform(2023)","Points":"104","Comments":"24"},{"Title":"Anotherinterestingarticle","Points":"75","Comments":"13"}

Error Handling and Logging

  • The script provides detailed logging to help diagnose issues during scraping. Logs are printed to the console.
  • The script handles common errors such as invalid input, network issues, and JSON parsing errors.

Environment Variable

To avoid exposing your OpenAI API key in the command line, you can set it as an environment variable:

export OPENAI_API_KEY="your_openai_api_key"

This will allow the script to run without requiring the --openai_api_key argument.

Troubleshooting

  • Invalid Key-Value Pair: Ensure that each element is provided as a valid key-value pair in the format key="value".
  • JSON Parsing Errors: If using a JSON string, ensure it is properly formatted.
  • Network Issues: The script includes a retry mechanism to handle transient network errors. If issues persist, check your network connection.

Requirements File

Add the following to your requirements.txt:

parsera

Conclusion

This script offers a flexible and powerful way to scrape websites using Parsera and OpenAI. It’s designed to be user-friendly and robust, with clear error handling and logging to assist in a wide range of scenarios.

import os
import logging
import argparse
from parsera import Parsera
import json
import time
from requests.exceptions import RequestException
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def initialize_parsera():
"""
Initialize the Parsera scraper with the OpenAI model.
Returns:
Parsera: The initialized Parsera instance.
Raises:
Exception: If Parsera initialization fails.
"""
try:
scrapper = Parsera()
logger.info("Parsera initialized successfully.")
return scrapper
except Exception as e:
logger.error(f"Failed to initialize Parsera: {e}")
raise
def parse_elements(elements_input):
"""
Parse elements input which could be a JSON string or a list of key-value pairs.
Args:
elements_input (str or list): JSON string or list of key-value pairs.
Returns:
dict: Parsed elements as a dictionary.
Raises:
ValueError: If the input is invalid.
"""
if isinstance(elements_input, str):
try:
elements = json.loads(elements_input)
logger.info("Parsed elements from JSON string.")
except json.JSONDecodeError:
# Fallback to key-value parsing if JSON parsing fails
elements = dict(pair.split("=", 1) for pair in elements_input.split(" "))
logger.info("Parsed elements from key-value string.")
else:
elements = dict(pair.split("=", 1) for pair in elements_input)
logger.info("Parsed elements from key-value pairs.")
return elements
def scrape_website(scrapper, url, elements, retries=3, delay=5):
"""
Scrape the specified elements from the given URL.
Args:
scrapper (Parsera): The initialized Parsera instance.
url (str): The URL to scrape.
elements (dict): The elements to extract.
retries (int): Number of retries in case of failure.
delay (int): Delay in seconds between retries.
Returns:
dict: The scraped results.
Raises:
Exception: If scraping fails after retries.
"""
for attempt in range(retries):
try:
result = scrapper.run(url=url, elements=elements)
logger.info(f"Scraping completed successfully for {url}.")
return result
except RequestException as e:
logger.warning(f"Network-related error: {e}. Retrying {attempt + 1}/{retries}...")
time.sleep(delay)
except Exception as e:
logger.error(f"Failed to scrape website on attempt {attempt + 1}: {e}")
if attempt == retries - 1:
raise
raise Exception("Failed to scrape website after multiple retries.")
def save_results(result, output_file=None):
"""
Save the scraping results to a file or print to the console.
Args:
result (dict): The scraping result to save or print.
output_file (str): The file path to save the results. If None, results are printed.
Raises:
IOError: If saving the file fails.
"""
if output_file:
try:
with open(output_file, 'w') as file:
json.dump(result, file, indent=4)
logger.info(f"Results saved to file: {output_file}")
except IOError as e:
logger.error(f"Failed to save results to file: {e}")
raise
else:
print(json.dumps(result, indent=4))
def main(url, elements_input, api_key, output_file=None):
"""
Main function to orchestrate the web scraping.
Args:
url (str): The URL to scrape.
elements_input (str or list): The elements to extract, as key=value pairs or a JSON string.
api_key (str): OpenAI API key for the default model.
output_file (str): Optional file path to save the results.
Raises:
ValueError: If the API key is not provided.
"""
if not api_key:
raise ValueError("OpenAI API Key is required. Please provide it via --openai_api_key or set the OPENAI_API_KEY environment variable.")
os.environ["OPENAI_API_KEY"] = api_key
scrapper = initialize_parsera()
elements = parse_elements(elements_input)
result = scrape_website(scrapper, url, elements)
save_results(result, output_file)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Scrape a website using Parsera with OpenAI.")
parser.add_argument("--url", required=True, help="The URL of the website to scrape.")
parser.add_argument("--elements", required=True, help="Elements to extract as key=value pairs or a JSON string.")
parser.add_argument("--openai_api_key", help="OpenAI API Key. Can also be set via OPENAI_API_KEY environment variable.")
parser.add_argument("--output_file", help="Optional file path to save the results.")
args = parser.parse_args()
# Retrieve OpenAI API key from arguments or environment variable
api_key = args.openai_api_key or os.getenv("OPENAI_API_KEY")
main(
url=args.url,
elements_input=args.elements,
api_key=api_key,
output_file=args.output_file
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment