Last active
March 16, 2018 08:44
-
-
Save codekiln/92fc29ce38eed858991bfd91a67ae5bd to your computer and use it in GitHub Desktop.
get_text_from_html using beautifulsoup; skip comments in style tags from MS Word
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from bs4 import BeautifulSoup | |
def get_text_from_html(html_str): | |
""" | |
Given a string of html, return the text content, | |
removing HTML contents and style artifacts. | |
This function solves an issue that when pasting from Word, | |
<style> tags can contain html comments that bsoup 4 | |
doesn't skip over when calling get_text(). | |
It also truncates adjacent whitespaces to one character; | |
\r\n[space][tab][space][space] would become [space]. | |
:param html_str: string of html | |
:return: text string. Two whitespaces will become one | |
""" | |
soup = BeautifulSoup(html_str) | |
for style in soup.find_all("style"): | |
style.extract() | |
text = soup.get_text() | |
if text: | |
return " ".join(text.split()) | |
return "" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment