Last active
October 25, 2018 01:08
-
-
Save cbscribe/e7fa44d5b895b4412552343d3606a66f to your computer and use it in GitHub Desktop.
Python beautifulsoup example
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import requests, bs4 | |
import time | |
# for n in range(0, 5937): | |
# url = 'https://www.fiercebiotech.com/biotech?page=0%2C' + str(n) | |
# data = requests.get(url) | |
# use this to prevent ddos | |
# time.sleep(10) | |
# example loading google.com | |
url = "http://google.com/" | |
# "data" contains the raw html from the website | |
data = requests.get(url) | |
data.raise_for_status() | |
#print(data.text) | |
# "soupdata" contains the processed html | |
soupdata = bs4.BeautifulSoup(data.text,features="html.parser") | |
# this pulls all <a> tags into a list | |
links = soupdata.select('a') | |
# loops through all links | |
for link in links: | |
# loop through each link and print its url and text | |
print(link.get('href'), "\t", link.string) | |
print("-"*20) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the "for n in range (0,5937):" function, the "url" variable does not contain a list of the string of output generated. It only contains 1 line, when it should have listed thousands. Is there a way to assign a variable to list all the output?