Skip to content

Instantly share code, notes, and snippets.

@PyYoshi
Created December 8, 2011 08:52
Show Gist options
  • Select an option

  • Save PyYoshi/1446499 to your computer and use it in GitHub Desktop.

Select an option

Save PyYoshi/1446499 to your computer and use it in GitHub Desktop.
htmlデータからタグを除去する
from lxml.html import fromstring
def strip_tags(html):
"""
htmlデータからタグ除去したテキストデータを抽出する
※scriptタグとstyleタグを無視
Args:
html: str, パースしたいhtmlデータ
Returns:
text: str, タグ除去されたテキストデータ
"""
et = fromstring(html)
xpath = r'//text()[name(..)!="script"][name(..)!="style"]'
text = ''.join([text for text in et.xpath(xpath) if text.strip()])
return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment