Last active
December 15, 2015 20:41
-
-
Save bdunnette/5320606 to your computer and use it in GitHub Desktop.
Parse an NLM MeSH Trees file into something usable as a Drupal taxonomy
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from __future__ import print_function | |
# The data comes from the NLM's MeSH Trees file, which can be downloaded here: https://www.nlm.nih.gov/mesh/filelist.html | |
tree_file = open('mtrees2013.bin') | |
tree_outfile = open('mtrees2013-parsed.txt', 'w') | |
tree_array = {} | |
for row in tree_file.readlines(): | |
#print(row) | |
# The 'index' of the term is whatever follows the semicolon | |
term_index = row[row.find(';') + 1:len(row) - 1] | |
# The term's parent (if any) will be whatever has the address one level up the hierarchy - i.e. A01.101.202's parent would be A01.101 - so we'll parse this out | |
parent_index = term_index[:term_index.rfind('.')] | |
# If a parent term has already been parsed, put that at the front of the 'term' to provide a hierarchy | |
if parent_index in tree_array: | |
term = ','.join([tree_array[parent_index], row[:row.find(';')]]) | |
# Otherwise, just take whatever precedes the semicolon as the 'term' | |
else: | |
term = row[:row.find(';')] | |
# Add this term to our array (for future parent searches) | |
tree_array[term_index] = term | |
# Finally, write this term to our text file | |
print(term, file=tree_outfile) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment