Skip to content

Instantly share code, notes, and snippets.

@matt-peters
Created February 10, 2017 19:11
Show Gist options
  • Save matt-peters/409d0112a81eb8900421aa08ab5f2168 to your computer and use it in GitHub Desktop.
Save matt-peters/409d0112a81eb8900421aa08ab5f2168 to your computer and use it in GitHub Desktop.
Walking the parse tree of extracted blocks in Dragnet
import requests
from dragnet.models import content_extractor
u = 'https://github.com/seomoz/dragnet'
html = requests.get(u).content
blocks = content_extractor.analyze(html, blocks=True)
block_text = [block.text for block in blocks]
# block.features is a dict with interesting things extracted from the block
start_elements = [block.features['block_start_element'] for block in blocks]
# the first paragraph extracted by dragnet
element = start_elements[0]
# all child nodes
element.xpath('//p/descendant::*')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment