Last active
April 19, 2018 04:16
-
-
Save scott2b/083556cc48f7f5839bed2ba35c87d283 to your computer and use it in GitHub Desktop.
extract themes from the Gdelt SET_EVENTPATTERNS.xml file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
The patterns file is here: https://github.com/ahalterman/GKG-Themes/blob/master/SET_EVENTPATTERNS.xml | |
It is not valid XML so using regex | |
There are non-theme entries in this file not considered here. The globals section at the top of the | |
file should be taken into account when processing documents with pattern matches. | |
""" | |
import re | |
p = re.compile(r'^<CATEGORY NAME="([^"]+)" TYPE="THEME">\s*<TERMS>([^<]+)</TERMS>', re.M|re.S) | |
themes = {} | |
with open('SET_EVENTPATTERNS.xml') as f: | |
for theme, terms in p.findall(f.read()): | |
terms = [tuple(t.split('\t')) for t in terms.split('\n') if t and len(t.split('\t')) == 2] | |
if terms: | |
themes[theme] = terms | |
print(themes) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment