Created
September 5, 2022 04:28
-
-
Save victormurcia/5121186f1c59f10bca2c3c0ed816f1d0 to your computer and use it in GitHub Desktop.
decode downloaded html and extract all <a href=""> links
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # decode downloaded html and extract all <a href=""> links | |
| def get_urls_from_html(content): | |
| # decode the provided content as ascii text | |
| html = content.decode('utf-8') | |
| # parse the document as best we can | |
| soup = BeautifulSoup(html, 'html.parser') | |
| # find all all of the <a href=""> tags in the document | |
| atags = soup.find_all('a') | |
| # get all links from a tags | |
| return [tag.get('href') for tag in atags] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment