Last active
December 21, 2015 00:29
-
-
Save mihi-tr/6220248 to your computer and use it in GitHub Desktop.
Ipython notebook for the name-pdf scraper
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "Scraping a PDF with Scraperwikis PDFtoXML" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "While for simple single or double-page tables [tabula](http://jazzido.github.io/tabula/) is a viable option - if you have PDFs with tables over multiple pages you'll soon grow old marking them.\n\nThis is where you'll need some scripting. Thanks to [scraperwikis library](https://pypi.python.org/pypi/scraperwiki) (```pip install scraperwiki```) and the included pdftoxml - scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate - that was quite tricky to scrape - I decided to protocol the process here." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import scraperwiki, urllib2", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": "First import the scraperwiki library and urrllib2 - since the file we're using is on a webserver\n" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "u=urllib2.urlopen(\"http://images.derstandard.at/2013/08/12/VN2p_2012.pdf\") #open the url for the PDF\nx=scraperwiki.pdftoxml(u.read()) # interpret it as xml\nprint x[:1024] # let's see what's in there abbreviated...\n", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\n\n<pdf2xml producer=\"poppler\" version=\"0.22.5\">\n<page number=\"1\" position=\"absolute\" top=\"0\" left=\"0\" height=\"1263\" width=\"892\">\n\t<fontspec id=\"0\" size=\"8\" family=\"Times\" color=\"#000000\"/>\n\t<fontspec id=\"1\" size=\"7\" family=\"Times\" color=\"#000000\"/>\n<text top=\"42\" left=\"64\" width=\"787\" height=\"12\" font=\"0\"><b>TABELLE VN2Ap/1 30/07/13 11.38.44 BLATT 1 </b></text>\n<text top=\"58\" left=\"64\" width=\"718\" height=\"12\" font=\"0\"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) F\u00dcR NEUGEBORENE KNABEN MIT </b></text>\n<text top=\"73\" left=\"64\" width=\"340\" height=\"12\" font=\"0\"><b>\u00d6STERREICHISCHER STAATSB\u00dcRGERSCHAFT 2012 - \u00d6STERREICH </b></text>\n<text top=\"89\" left=\"64\" width=\"6\" height=\"12\" font=\"0\"><b> </b></text>\n<text top=\"104\" left=\"64\" width=\"769\" height=\"12\" font=\"0\"><b>VORNAMEN ABSOLUT % \n" | |
} | |
], | |
"prompt_number": 35 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "As you can see above, we have successfully loaded the PDF as xml (take a look at the PDF by just opening the url given, it should give you an idea how it is structured). \n\nThe basic structure of a pdf parsed this way will always be ```page``` tags followed by ```text``` tags contianing the information, positioning and font information. The positioning and font information can often help to get the table we want - however not in this case: everything is font=\"0\" and left=\"64\". \n\nWe can now use [xpath](http://en.wikipedia.org/wiki/XPath) to query our document..." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import lxml\nr=lxml.etree.fromstring(x)\nr.xpath('//page[@number=\"1\"]')", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 4, | |
"text": "[<Element page at 0x31c32d0>]" | |
} | |
], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "and also get some lines out of it\n" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "r.xpath('//text[@left=\"64\"]/b')[0:10] #array abbreviated for legibility", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 32, | |
"text": "[<Element b at 0x31c3320>,\n <Element b at 0x31c3550>,\n <Element b at 0x31c35a0>,\n <Element b at 0x31c35f0>,\n <Element b at 0x31c3640>,\n <Element b at 0x31c3690>,\n <Element b at 0x31c36e0>,\n <Element b at 0x31c3730>,\n <Element b at 0x31c3780>,\n <Element b at 0x31c37d0>]" | |
} | |
], | |
"prompt_number": 32 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "r.xpath('//text[@left=\"64\"]/b')[8].text", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 7, | |
"text": "u'Aaron * 64 0,19 91 Aim\\xe9 1 0,00 959 '" | |
} | |
], | |
"prompt_number": 7 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Great - this will help us. If we look at the document you'll notice that there are all boys names from page 1-20 and girls names from page 21-43 - let's get them seperately..." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "boys=r.xpath('//page[@number<=\"20\"]/text[@left=\"64\"]/b')\ngirls=r.xpath('//page[@number>\"20\" and @number<=\"43\"]/text[@left=\"64\"]/b')\nprint boys[8].text\nprint girls[8].text", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Aaron * 64 0,19 91 Aim\u00e9 1 0,00 959 \nAarina 1 0,00 1.156 Ala\u00efa 1 0,00 1.156 \n" | |
} | |
], | |
"prompt_number": 13 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "fantastic - but you'll also notice something - the columns are all there, sperated by whitespaces. And also Aaron has an asterisk - we want to remove it (the asterisk is explained in the original doc).\n\nTo split it up into columns I'll create a small function using regexes to split it." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import re\n\ndef split_entry(e):\n return re.split(\"[ ]+\",e.text.replace(\"*\",\"\")) # we're removing the asterisk here as well...", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 10 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "now let's apply it to boys and girls" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "boys=[split_entry(i) for i in boys]\ngirls=[split_entry(i) for i in girls]\nprint boys[8]\nprint girls[8]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959', u'']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156', u'']\n" | |
} | |
], | |
"prompt_number": 14 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "That worked!. Notice the empty string u'' at the end? I'd like to filter it. I'll do this using the ifilter function from itertools" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import itertools\nboys=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in boys]\ngirls=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in girls]\nprint boys[8]\nprint girls[8]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156']\n" | |
} | |
], | |
"prompt_number": 16 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Worked, this cleaned up our boys and girls arrays. We want to make them properly though - there are two columns each four fields wide. I'll do this with a little function" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "def take4(x):\n if (len(x)>5):\n return [x[0:4],x[4:]]\n else:\n return [x[0:4]]\n \nboys=[take4(i) for i in boys]\ngirls=[take4(i) for i in girls]\nprint boys[8]\nprint girls[8]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\\xe9', u'1', u'0,00', u'959']]\n[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\\xefa', u'1', u'0,00', u'1.156']]\n" | |
} | |
], | |
"prompt_number": 17 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "ah that worked nicely! - now let's make sure it's one array with both options in it -for this i'll use reduce" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "boys=reduce(lambda x,y: x+y, boys, [])\ngirls=reduce(lambda x,y: x+y, girls,[])\nprint boys[10]\nprint girls[10]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "['Aiden', '2', '0,01', '667']\n['Alaa', '1', '0,00', '1.156']\n" | |
} | |
], | |
"prompt_number": 18 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "perfect - now let's add a gender to the entries\n" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "for x in boys:\n x.append(\"m\")\n\nfor x in girls:\n x.append(\"f\")\n\nprint boys[10]\nprint girls[10]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "['Aiden', '2', '0,01', '667', 'm']\n['Alaa', '1', '0,00', '1.156', 'f']\n" | |
} | |
], | |
"prompt_number": 19 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "We got that! For further processing I'll join the arrays up" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "names=boys+girls\nprint names[10]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "['Aiden', '2', '0,01', '667', 'm']\n" | |
} | |
], | |
"prompt_number": 29 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "let's take a look at the full array..." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "names[0:10]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 30, | |
"text": "[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],\n ['BLATT', '1', 'm'],\n [u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],\n [u'PHONETISCH',\n u'ZUSAMMENGEFASST,',\n u'ALPHABETISCH',\n u'SORTIERT)',\n u'F\\xdcR',\n u'NEUGEBORENE',\n u'KNABEN',\n u'MIT',\n 'm'],\n [u'\\xd6STERREICHISCHER', u'STAATSB\\xdcRGERSCHAFT', u'2012', u'-', 'm'],\n ['m'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['m'],\n ['INSGESAMT', '34.017', '100,00', '.', 'm']]" | |
} | |
], | |
"prompt_number": 30 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Notice there is still quite a bit of mess in there: basically all the lines starting with an all caps entry, \"der\", \"m\" or \"f\". Let's remove them...." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries\nnames=[i for i in itertools.ifilter(lambda x: not (x[0] in [\"der\",\"m\",\"f\"]),names)] # remove all entries that are \"der\",\"m\" or \"f\"\nnames[0:10]", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 31, | |
"text": "[['Aiden', '2', '0,01', '667', 'm'],\n ['Aiman', '3', '0,01', '532', 'm'],\n [u'Aaron', u'64', u'0,19', u'91', 'm'],\n [u'Aim\\xe9', u'1', u'0,00', u'959', 'm'],\n ['Abbas', '2', '0,01', '667', 'm'],\n ['Ajan', '2', '0,01', '667', 'm'],\n ['Abdallrhman', '1', '0,00', '959', 'm'],\n ['Ajdin', '15', '0,04', '225', 'm'],\n ['Abdel', '1', '0,00', '959', 'm'],\n ['Ajnur', '1', '0,00', '959', 'm']]" | |
} | |
], | |
"prompt_number": 31 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Woohoo - we have a cleaned up list. Now let's write it as csv...." | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "import csv\nf=open(\"names.csv\",\"wb\") #open file for writing\nw=csv.writer(f) #open a csv writer\n\nw.writerow([\"Name\",\"Count\",\"Percent\",\"Rank\",\"Gender\"]) #write the header\n\nfor n in names:\n w.writerow([i.encode(\"utf-8\") for i in n]) #write each row\n\nf.close()", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 27 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Done, We've scraped a multi-page PDF using python. All in all this was a fairly quick way to get the data out of a PDF using scraperwiki tools.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": "", | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment