mihi-tr · December 21, 2015 00:29
diff --git a/scraping.json b/scraping.json
 {
 "metadata": {
  "name": "Scraping a PDF with Scraperwikis PDFtoXML"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "While for simple single or double-page tables [tabula](http://jazzido.github.io/tabula/) is a viable option - if you have PDFs with tables over multiple pages you'll soon grow old marking them.\n\nThis is where you'll need some scripting. Thanks to [scraperwikis library](https://pypi.python.org/pypi/scraperwiki) (```pip install scraperwiki```) and the included pdftoxml - scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate - that was quite tricky to scrape - I decided to protocol the process here."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import scraperwiki, urllib2",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": "First import the scraperwiki library and urrllib2 - since the file we're using is on a webserver\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "u=urllib2.urlopen(\"http://images.derstandard.at/2013/08/12/VN2p_2012.pdf\") #open the url for the PDF\nx=scraperwiki.pdftoxml(u.read()) # interpret it as xml\nprint x[:1024] # let's see what's in there abbreviated...\n",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\n\n<pdf2xml producer=\"poppler\" version=\"0.22.5\">\n<page number=\"1\" position=\"absolute\" top=\"0\" left=\"0\" height=\"1263\" width=\"892\">\n\t<fontspec id=\"0\" size=\"8\" family=\"Times\" color=\"#000000\"/>\n\t<fontspec id=\"1\" size=\"7\" family=\"Times\" color=\"#000000\"/>\n<text top=\"42\" left=\"64\" width=\"787\" height=\"12\" font=\"0\"><b>TABELLE VN2Ap/1                                                                               30/07/13  11.38.44  BLATT    1 </b></text>\n<text top=\"58\" left=\"64\" width=\"718\" height=\"12\" font=\"0\"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) F\u00dcR NEUGEBORENE KNABEN MIT </b></text>\n<text top=\"73\" left=\"64\" width=\"340\" height=\"12\" font=\"0\"><b>\u00d6STERREICHISCHER STAATSB\u00dcRGERSCHAFT 2012 - \u00d6STERREICH </b></text>\n<text top=\"89\" left=\"64\" width=\"6\" height=\"12\" font=\"0\"><b> </b></text>\n<text top=\"104\" left=\"64\" width=\"769\" height=\"12\" font=\"0\"><b>VORNAMEN                  ABSOLUT          %   \n"
      }
     ],
     "prompt_number": 35
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "As you can see above, we have successfully loaded the PDF as xml (take a look at the PDF by just opening the url given, it should give you an idea how it is structured). \n\nThe basic structure of a pdf parsed this way will always be ```page``` tags followed by ```text``` tags contianing the information, positioning and font information. The positioning and font information can often help to get the table we want - however not in this case: everything is font=\"0\" and left=\"64\". \n\nWe can now use [xpath](http://en.wikipedia.org/wiki/XPath) to query our document..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import lxml\nr=lxml.etree.fromstring(x)\nr.xpath('//page[@number=\"1\"]')",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 4,
       "text": "[<Element page at 0x31c32d0>]"
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "and also get some lines out of it\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "r.xpath('//text[@left=\"64\"]/b')[0:10] #array abbreviated for legibility",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 32,
       "text": "[<Element b at 0x31c3320>,\n <Element b at 0x31c3550>,\n <Element b at 0x31c35a0>,\n <Element b at 0x31c35f0>,\n <Element b at 0x31c3640>,\n <Element b at 0x31c3690>,\n <Element b at 0x31c36e0>,\n <Element b at 0x31c3730>,\n <Element b at 0x31c3780>,\n <Element b at 0x31c37d0>]"
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "r.xpath('//text[@left=\"64\"]/b')[8].text",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 7,
       "text": "u'Aaron *                        64       0,19       91               Aim\\xe9                            1       0,00      959 '"
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Great - this will help us. If we look at the document you'll notice that there are all boys names from page 1-20 and girls names from page 21-43 - let's get them seperately..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "boys=r.xpath('//page[@number<=\"20\"]/text[@left=\"64\"]/b')\ngirls=r.xpath('//page[@number>\"20\" and @number<=\"43\"]/text[@left=\"64\"]/b')\nprint boys[8].text\nprint girls[8].text",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Aaron *                        64       0,19       91               Aim\u00e9                            1       0,00      959 \nAarina                          1       0,00    1.156               Ala\u00efa                           1       0,00    1.156 \n"
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "fantastic - but you'll also notice something - the columns are all there, sperated by whitespaces. And also Aaron has an asterisk - we want to remove it (the asterisk is explained in the original doc).\n\nTo split it up into columns I'll create a small function using regexes to split it."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import re\n\ndef split_entry(e):\n    return re.split(\"[ ]+\",e.text.replace(\"*\",\"\")) # we're removing the asterisk here as well...",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "now let's apply it to boys and girls"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "boys=[split_entry(i) for i in boys]\ngirls=[split_entry(i) for i in girls]\nprint boys[8]\nprint girls[8]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959', u'']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156', u'']\n"
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "That worked!. Notice the empty string u'' at the end? I'd like to filter it. I'll do this using the ifilter function from itertools"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import itertools\nboys=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in boys]\ngirls=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in girls]\nprint boys[8]\nprint girls[8]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156']\n"
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Worked, this cleaned up our boys and girls arrays. We want to make them properly though - there are two columns each four fields wide. I'll do this with a little function"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "def take4(x):\n    if (len(x)>5):\n        return [x[0:4],x[4:]]\n    else:\n        return [x[0:4]]\n    \nboys=[take4(i) for i in boys]\ngirls=[take4(i) for i in girls]\nprint boys[8]\nprint girls[8]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\\xe9', u'1', u'0,00', u'959']]\n[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\\xefa', u'1', u'0,00', u'1.156']]\n"
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "ah that worked nicely! - now let's make sure it's one array with both options in it -for this i'll use reduce"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "boys=reduce(lambda x,y: x+y, boys, [])\ngirls=reduce(lambda x,y: x+y, girls,[])\nprint boys[10]\nprint girls[10]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "['Aiden', '2', '0,01', '667']\n['Alaa', '1', '0,00', '1.156']\n"
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "perfect - now let's add a gender to the entries\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "for x in boys:\n    x.append(\"m\")\n\nfor x in girls:\n    x.append(\"f\")\n\nprint boys[10]\nprint girls[10]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "['Aiden', '2', '0,01', '667', 'm']\n['Alaa', '1', '0,00', '1.156', 'f']\n"
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "We got that! For further processing I'll join the arrays up"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "names=boys+girls\nprint names[10]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "['Aiden', '2', '0,01', '667', 'm']\n"
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "let's take a look at the full array..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "names[0:10]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 30,
       "text": "[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],\n ['BLATT', '1', 'm'],\n [u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],\n [u'PHONETISCH',\n  u'ZUSAMMENGEFASST,',\n  u'ALPHABETISCH',\n  u'SORTIERT)',\n  u'F\\xdcR',\n  u'NEUGEBORENE',\n  u'KNABEN',\n  u'MIT',\n  'm'],\n [u'\\xd6STERREICHISCHER', u'STAATSB\\xdcRGERSCHAFT', u'2012', u'-', 'm'],\n ['m'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['m'],\n ['INSGESAMT', '34.017', '100,00', '.', 'm']]"
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Notice there is still quite a bit of mess in there: basically all the lines starting with an all caps entry, \"der\", \"m\" or \"f\". Let's remove them...."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries\nnames=[i for i in itertools.ifilter(lambda x: not (x[0] in [\"der\",\"m\",\"f\"]),names)] # remove all entries that are \"der\",\"m\" or \"f\"\nnames[0:10]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 31,
       "text": "[['Aiden', '2', '0,01', '667', 'm'],\n ['Aiman', '3', '0,01', '532', 'm'],\n [u'Aaron', u'64', u'0,19', u'91', 'm'],\n [u'Aim\\xe9', u'1', u'0,00', u'959', 'm'],\n ['Abbas', '2', '0,01', '667', 'm'],\n ['Ajan', '2', '0,01', '667', 'm'],\n ['Abdallrhman', '1', '0,00', '959', 'm'],\n ['Ajdin', '15', '0,04', '225', 'm'],\n ['Abdel', '1', '0,00', '959', 'm'],\n ['Ajnur', '1', '0,00', '959', 'm']]"
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Woohoo - we have a cleaned up list. Now let's write it as csv...."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import csv\nf=open(\"names.csv\",\"wb\") #open file for writing\nw=csv.writer(f) #open a csv writer\n\nw.writerow([\"Name\",\"Count\",\"Percent\",\"Rank\",\"Gender\"]) #write the header\n\nfor n in names:\n    w.writerow([i.encode(\"utf-8\") for i in n]) #write each row\n\nf.close()",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 27
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Done, We've scraped a multi-page PDF using python. All in all this was a fairly quick way to get the data out of a PDF using scraperwiki tools.\n"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "",
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
 }
	{
	"metadata": {
	"name": "Scraping a PDF with Scraperwikis PDFtoXML"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "While for simple single or double-page tables [tabula](http://jazzido.github.io/tabula/) is a viable option - if you have PDFs with tables over multiple pages you'll soon grow old marking them.\n\nThis is where you'll need some scripting. Thanks to [scraperwikis library](https://pypi.python.org/pypi/scraperwiki) (```pip install scraperwiki```) and the included pdftoxml - scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate - that was quite tricky to scrape - I decided to protocol the process here."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import scraperwiki, urllib2",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "raw",
	"metadata": {},
	"source": "First import the scraperwiki library and urrllib2 - since the file we're using is on a webserver\n"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "u=urllib2.urlopen(\"http://images.derstandard.at/2013/08/12/VN2p_2012.pdf\") #open the url for the PDF\nx=scraperwiki.pdftoxml(u.read()) # interpret it as xml\nprint x[:1024] # let's see what's in there abbreviated...\n",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE pdf2xml SYSTEM \"pdf2xml.dtd\">\n\n<pdf2xml producer=\"poppler\" version=\"0.22.5\">\n<page number=\"1\" position=\"absolute\" top=\"0\" left=\"0\" height=\"1263\" width=\"892\">\n\t<fontspec id=\"0\" size=\"8\" family=\"Times\" color=\"#000000\"/>\n\t<fontspec id=\"1\" size=\"7\" family=\"Times\" color=\"#000000\"/>\n<text top=\"42\" left=\"64\" width=\"787\" height=\"12\" font=\"0\"><b>TABELLE VN2Ap/1 30/07/13 11.38.44 BLATT 1 </b></text>\n<text top=\"58\" left=\"64\" width=\"718\" height=\"12\" font=\"0\"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) F\u00dcR NEUGEBORENE KNABEN MIT </b></text>\n<text top=\"73\" left=\"64\" width=\"340\" height=\"12\" font=\"0\"><b>\u00d6STERREICHISCHER STAATSB\u00dcRGERSCHAFT 2012 - \u00d6STERREICH </b></text>\n<text top=\"89\" left=\"64\" width=\"6\" height=\"12\" font=\"0\"><b> </b></text>\n<text top=\"104\" left=\"64\" width=\"769\" height=\"12\" font=\"0\"><b>VORNAMEN ABSOLUT % \n"
	}
	],
	"prompt_number": 35
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "As you can see above, we have successfully loaded the PDF as xml (take a look at the PDF by just opening the url given, it should give you an idea how it is structured). \n\nThe basic structure of a pdf parsed this way will always be ```page``` tags followed by ```text``` tags contianing the information, positioning and font information. The positioning and font information can often help to get the table we want - however not in this case: everything is font=\"0\" and left=\"64\". \n\nWe can now use [xpath](http://en.wikipedia.org/wiki/XPath) to query our document..."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import lxml\nr=lxml.etree.fromstring(x)\nr.xpath('//page[@number=\"1\"]')",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 4,
	"text": "[<Element page at 0x31c32d0>]"
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "and also get some lines out of it\n"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "r.xpath('//text[@left=\"64\"]/b')[0:10] #array abbreviated for legibility",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 32,
	"text": "[<Element b at 0x31c3320>,\n <Element b at 0x31c3550>,\n <Element b at 0x31c35a0>,\n <Element b at 0x31c35f0>,\n <Element b at 0x31c3640>,\n <Element b at 0x31c3690>,\n <Element b at 0x31c36e0>,\n <Element b at 0x31c3730>,\n <Element b at 0x31c3780>,\n <Element b at 0x31c37d0>]"
	}
	],
	"prompt_number": 32
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "r.xpath('//text[@left=\"64\"]/b')[8].text",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 7,
	"text": "u'Aaron * 64 0,19 91 Aim\\xe9 1 0,00 959 '"
	}
	],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Great - this will help us. If we look at the document you'll notice that there are all boys names from page 1-20 and girls names from page 21-43 - let's get them seperately..."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "boys=r.xpath('//page[@number<=\"20\"]/text[@left=\"64\"]/b')\ngirls=r.xpath('//page[@number>\"20\" and @number<=\"43\"]/text[@left=\"64\"]/b')\nprint boys[8].text\nprint girls[8].text",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "Aaron * 64 0,19 91 Aim\u00e9 1 0,00 959 \nAarina 1 0,00 1.156 Ala\u00efa 1 0,00 1.156 \n"
	}
	],
	"prompt_number": 13
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "fantastic - but you'll also notice something - the columns are all there, sperated by whitespaces. And also Aaron has an asterisk - we want to remove it (the asterisk is explained in the original doc).\n\nTo split it up into columns I'll create a small function using regexes to split it."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import re\n\ndef split_entry(e):\n return re.split(\"[ ]+\",e.text.replace(\"*\",\"\")) # we're removing the asterisk here as well...",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 10
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "now let's apply it to boys and girls"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "boys=[split_entry(i) for i in boys]\ngirls=[split_entry(i) for i in girls]\nprint boys[8]\nprint girls[8]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959', u'']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156', u'']\n"
	}
	],
	"prompt_number": 14
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "That worked!. Notice the empty string u'' at the end? I'd like to filter it. I'll do this using the ifilter function from itertools"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import itertools\nboys=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in boys]\ngirls=[[i for i in itertools.ifilter(lambda x: x!=\"\",j)] for j in girls]\nprint boys[8]\nprint girls[8]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "[u'Aaron', u'64', u'0,19', u'91', u'Aim\\xe9', u'1', u'0,00', u'959']\n[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\\xefa', u'1', u'0,00', u'1.156']\n"
	}
	],
	"prompt_number": 16
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Worked, this cleaned up our boys and girls arrays. We want to make them properly though - there are two columns each four fields wide. I'll do this with a little function"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "def take4(x):\n if (len(x)>5):\n return [x[0:4],x[4:]]\n else:\n return [x[0:4]]\n \nboys=[take4(i) for i in boys]\ngirls=[take4(i) for i in girls]\nprint boys[8]\nprint girls[8]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\\xe9', u'1', u'0,00', u'959']]\n[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\\xefa', u'1', u'0,00', u'1.156']]\n"
	}
	],
	"prompt_number": 17
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "ah that worked nicely! - now let's make sure it's one array with both options in it -for this i'll use reduce"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "boys=reduce(lambda x,y: x+y, boys, [])\ngirls=reduce(lambda x,y: x+y, girls,[])\nprint boys[10]\nprint girls[10]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "['Aiden', '2', '0,01', '667']\n['Alaa', '1', '0,00', '1.156']\n"
	}
	],
	"prompt_number": 18
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "perfect - now let's add a gender to the entries\n"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "for x in boys:\n x.append(\"m\")\n\nfor x in girls:\n x.append(\"f\")\n\nprint boys[10]\nprint girls[10]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "['Aiden', '2', '0,01', '667', 'm']\n['Alaa', '1', '0,00', '1.156', 'f']\n"
	}
	],
	"prompt_number": 19
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "We got that! For further processing I'll join the arrays up"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "names=boys+girls\nprint names[10]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": "['Aiden', '2', '0,01', '667', 'm']\n"
	}
	],
	"prompt_number": 29
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "let's take a look at the full array..."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "names[0:10]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 30,
	"text": "[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],\n ['BLATT', '1', 'm'],\n [u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],\n [u'PHONETISCH',\n u'ZUSAMMENGEFASST,',\n u'ALPHABETISCH',\n u'SORTIERT)',\n u'F\\xdcR',\n u'NEUGEBORENE',\n u'KNABEN',\n u'MIT',\n 'm'],\n [u'\\xd6STERREICHISCHER', u'STAATSB\\xdcRGERSCHAFT', u'2012', u'-', 'm'],\n ['m'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],\n ['m'],\n ['INSGESAMT', '34.017', '100,00', '.', 'm']]"
	}
	],
	"prompt_number": 30
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Notice there is still quite a bit of mess in there: basically all the lines starting with an all caps entry, \"der\", \"m\" or \"f\". Let's remove them...."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries\nnames=[i for i in itertools.ifilter(lambda x: not (x[0] in [\"der\",\"m\",\"f\"]),names)] # remove all entries that are \"der\",\"m\" or \"f\"\nnames[0:10]",
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 31,
	"text": "[['Aiden', '2', '0,01', '667', 'm'],\n ['Aiman', '3', '0,01', '532', 'm'],\n [u'Aaron', u'64', u'0,19', u'91', 'm'],\n [u'Aim\\xe9', u'1', u'0,00', u'959', 'm'],\n ['Abbas', '2', '0,01', '667', 'm'],\n ['Ajan', '2', '0,01', '667', 'm'],\n ['Abdallrhman', '1', '0,00', '959', 'm'],\n ['Ajdin', '15', '0,04', '225', 'm'],\n ['Abdel', '1', '0,00', '959', 'm'],\n ['Ajnur', '1', '0,00', '959', 'm']]"
	}
	],
	"prompt_number": 31
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Woohoo - we have a cleaned up list. Now let's write it as csv...."
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "import csv\nf=open(\"names.csv\",\"wb\") #open file for writing\nw=csv.writer(f) #open a csv writer\n\nw.writerow([\"Name\",\"Count\",\"Percent\",\"Rank\",\"Gender\"]) #write the header\n\nfor n in names:\n w.writerow([i.encode(\"utf-8\") for i in n]) #write each row\n\nf.close()",
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 27
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": "Done, We've scraped a multi-page PDF using python. All in all this was a fairly quick way to get the data out of a PDF using scraperwiki tools.\n"
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": "",
	"language": "python",
	"metadata": {},
	"outputs": []
	}
	],
	"metadata": {}
	}
	]
	}