Skip to content

Instantly share code, notes, and snippets.

@titipata
Created March 12, 2016 02:07
Show Gist options
  • Select an option

  • Save titipata/76a7cb9a19b2a27a31f3 to your computer and use it in GitHub Desktop.

Select an option

Save titipata/76a7cb9a19b2a27a31f3 to your computer and use it in GitHub Desktop.

Parse Pubmed xml using Python (NLTK class)

You need lxml library to parse xml file (use pip install lxml or sudo pip install lxml)

import pandas as pd
from lxml import etree
from lxml.etree import tostring
from itertools import chain

def stringify_children(node):
    """
    Filters and removes possible Nones in texts and tails
    ref: http://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml
    """
    parts = ([node.text] +
             list(chain(*([c.text, c.tail] for c in node.getchildren()))) +
             [node.tail])
    return ''.join(filter(None, parts))


tree = etree.parse('pubmed_result-2.xml') # read xml file
abstracts = tree.xpath('//PubmedArticle//Abstract')
titles = tree.xpath('//PubmedArticle//Title')
abstracts_idx = [(i, t.text, stringify_children(a)) for (i, t, a) in zip(range(len(abstracts)), titles, abstracts)] # tuples of text
df = pd.DataFrame(abstracts_idx, columns=['index', 'title', 'abstract']) # transform to dataframe
df.to_csv('abstracts.csv', index=False, header=True, encoding='utf-8') # save to csv file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment