lxml_homework_yu.md

Parse Pubmed xml using Python (NLTK class)

You need lxml library to parse xml file (use pip install lxml or sudo pip install lxml)

import pandas as pd
from lxml import etree
from lxml.etree import tostring
from itertools import chain

def stringify_children(node):
    """
    Filters and removes possible Nones in texts and tails
    ref: http://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml
    """
    parts = ([node.text] +
             list(chain(*([c.text, c.tail] for c in node.getchildren()))) +
             [node.tail])
    return ''.join(filter(None, parts))


tree = etree.parse('pubmed_result-2.xml') # read xml file
abstracts = tree.xpath('//PubmedArticle//Abstract')
titles = tree.xpath('//PubmedArticle//Title')
abstracts_idx = [(i, t.text, stringify_children(a)) for (i, t, a) in zip(range(len(abstracts)), titles, abstracts)] # tuples of text
df = pd.DataFrame(abstracts_idx, columns=['index', 'title', 'abstract']) # transform to dataframe
df.to_csv('abstracts.csv', index=False, header=True, encoding='utf-8') # save to csv file

titipata/lxml_homework_yu.md

Select an option

No results found

Select an option

No results found

Parse Pubmed xml using Python (NLTK class)