Skip to content

Instantly share code, notes, and snippets.

@7stud
Last active August 29, 2015 14:14
Show Gist options
  • Save 7stud/ac22c8d9c01097ed9f99 to your computer and use it in GitHub Desktop.
Save 7stud/ac22c8d9c01097ed9f99 to your computer and use it in GitHub Desktop.
data.txt:
------------------------------
<h1>1</h1> <!-- Level One -->
<h4>1.1</h4> <!-- Level Two -->
<div class='x'>1.1.1</div> <!-- Level Three -->
<div class='x'>1.1.2</div>
<div class='x'>1.1.3</div>
<div class='x'>1.1.4</div>
<h4>1.2</h4>
<div class='x'>1.2.1</div>
<div class='x'>1.2.2</div>
<div class='x'>1.2.3</div>
<div class='x'>1.2.4</div>
<h4>1.3</h4>
<div class='x'>1.3.1</div>
<div class='x'>1.3.2</div>
<div class='x'>1.3.3</div>
<div class='x'>1.3.4</div>
<h4>1.4</h4>
<div class='x'>1.4.1</div>
<div class='x'>1.4.2</div>
<div class='x'>1.4.3</div>
<div class='x'>1.4.4</div>
<h1>2</h1>
<h4>2.1</h4>
<div class='x'>2.1.1</div>
<div class='x'>2.1.2</div>
<div class='x'>2.1.3</div>
<div class='x'>2.1.4</div>
<h4>2.2</h4>
<div class='x'>2.2.1</div>
<div class='x'>2.2.2</div>
<div class='x'>2.2.3</div>
<div class='x'>2.2.4</div>
<h4>2.3</h4>
<div class='x'>2.3.1</div>
<div class='x'>2.3.2</div>
<div class='x'>2.3.3</div>
<div class='x'>2.3.4</div>
<h4>2.4</h4>
<div class='x'>2.4.1</div>
<div class='x'>2.4.2</div>
<div class='x'>2.4.3</div>
<div class='x'>2.4.4</div>
----------------------
from bs4 import BeautifulSoup
with open('data.txt') as f:
html = f.read()
soup = BeautifulSoup(html)
elmt = soup.find('h1')
h1_text = elmt.text
while True:
elmt = elmt.next_sibling
if not elmt: break #then no more tags
tag_name = getattr(elmt, 'name', None) #Tags have names like 'h1', 'h4', 'div'. If the tag
#does not have a name, e.g. a Comment, return None
if not tag_name: continue #Skip the Comment's in your html
if tag_name == 'h1':
h1_text = elmt.text
elif tag_name == 'h4':
h4_text = elmt.text
elif tag_name == 'div':
print("{}:{}:{}".format(h1_text, h4_text, elmt.text))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment