This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| main_url = "http://books.toscrape.com/index.html" | |
| import requests | |
| result = requests.get(main_url) | |
| result.text[:1000] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from bs4 import BeautifulSoup | |
| soup = BeautifulSoup(result.text, 'html.parser') | |
| print(soup.prettify()[:1000]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def getAndParseURL(url): | |
| result = requests.get(url) | |
| soup = BeautifulSoup(result.text, 'html.parser') | |
| return(soup) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| soup.find("article", class_ = "product_pod") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| soup.find("article", class_ = "product_pod").div.a |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| soup.find("article", class_ = "product_pod").div.a.get('href') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| main_page_products_urls = [x.div.a.get('href') for x in soup.findAll("article", class_ = "product_pod")] | |
| print(str(len(main_page_products_urls)) + " fetched products URLs") | |
| print("One example:") | |
| main_page_products_urls[0] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def getBooksURLs(url): | |
| soup = getAndParseURL(url) | |
| # remove the index.html part of the base url before returning the results | |
| return(["/".join(url.split("/")[:-1]) + "/" + x.div.a.get('href') for x in soup.findAll("article", class_ = "product_pod")]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import re | |
| categories_urls = [main_url + x.get('href') for x in soup.find_all("a", href=re.compile("catalogue/category/books"))] | |
| categories_urls = categories_urls[1:] # we remove the first one because it corresponds to all the books | |
| print(str(len(categories_urls)) + " fetched categories URLs") | |
| print("Some examples:") | |
| categories_urls[:5] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # store all the results into a list | |
| pages_urls = [main_url] | |
| soup = getAndParseURL(pages_urls[0]) | |
| # while we get two matches, this means that the webpage contains a 'previous' and a 'next' button | |
| # if there is only one button, this means that we are either on the first page or on the last page | |
| # we stop when we get to the last page | |
| while len(soup.findAll("a", href=re.compile("page"))) == 2 or len(pages_urls) == 1: |
OlderNewer