Skip to content

Instantly share code, notes, and snippets.

@sinebeef
Created November 28, 2019 17:39
Show Gist options
  • Save sinebeef/3a5155269e1f0af50b441527b6b69405 to your computer and use it in GitHub Desktop.
Save sinebeef/3a5155269e1f0af50b441527b6b69405 to your computer and use it in GitHub Desktop.
Python script for extracting a list of indexed urls from a site:domain result pages
import requests
import csv
from bs4 import BeautifulSoup
#from decimal import *
from decimal import Decimal
product = ['index1.html','index2.html','index3.html']
for prod in product:
with open( prod, "r") as f:
meh = f.read()
f.close()
soup = BeautifulSoup(meh, 'html.parser')
for div in soup.find_all('div', 'r'):
for link in div.find_all('a'):
if '#' not in link['href']:
if 'google' not in link['href']:
print(link['href'])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment