Created
May 31, 2018 10:49
-
-
Save pulsejet/646e16a51e6ae49cc4ad3aa60f3d2c2b to your computer and use it in GitHub Desktop.
Separate xml files removing tags (CRUDE)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import xml.etree.ElementTree as ET | |
tree = ET.parse('xmlsdev.xml') | |
root = tree.getroot() | |
with open('corrected', 'w', encoding='utf-8') as file: | |
i = 0 | |
for sentence in root.iter('sentence'): | |
for child in sentence.findall('del'): | |
tail = child.tail | |
child.clear() | |
child.tail = tail | |
file.write("".join(sentence.itertext())) | |
file.write('\n') | |
if i%100 == 0: | |
print(i) | |
i += 1 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment