Last active
December 23, 2015 23:39
-
-
Save anandology/6711800 to your computer and use it in GitHub Desktop.
Python Training - September 26-28, 2013
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Python Training - Notes\n", | |
"\n", | |
"Additional notes provided after the training." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Parsing XML" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Python standard library has good support of parsing XML files.\n", | |
"\n", | |
"Lets try to parse an XML response from Solr." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import xml.etree.ElementTree as ET\n", | |
"\n", | |
"xml = \"\"\"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", | |
"<response>\n", | |
"<lst name=\"responseHeader\">\n", | |
" <int name=\"status\">0</int>\n", | |
" <int name=\"QTime\">0</int>\n", | |
" <lst name=\"params\">\n", | |
" <str name=\"indent\">on</str>\n", | |
" <str name=\"fl\">title,key</str>\n", | |
" <str name=\"q\">html5</str>\n", | |
" </lst>\n", | |
"</lst>\n", | |
"<result name=\"response\" numFound=\"2\" start=\"0\">\n", | |
" <doc>\n", | |
" <str name=\"key\">/books/OL24405389M</str>\n", | |
" <str name=\"title\">HTML5 For Web Designers</str>\n", | |
" </doc>\n", | |
" <doc>\n", | |
" <str name=\"key\">/works/OL15437498W</str>\n", | |
" <str name=\"title\">HTML5 For Web Designers</str>\n", | |
" </doc>\n", | |
"</result>\n", | |
"</response>\n", | |
"\"\"\"\n", | |
"\n", | |
"# constuct ElementTree node from xml string\n", | |
"root = ET.fromstring(xml)\n", | |
"\n", | |
"# find the result node\n", | |
"result = root.find(\"result\")\n", | |
"\n", | |
"# find all doc nodes from result\n", | |
"docs = result.findall(\"doc\")\n", | |
"\n", | |
"# iterate over each doc and print the key and title\n", | |
"for doc in docs:\n", | |
" # It is also possible to iterate over all children using a for loop\n", | |
" for field in doc:\n", | |
" # node.atrrib gives a dictionary of all attributes of the node\n", | |
" # and node.text gives the text of that node\n", | |
" print field.attrib['name'], field.text\n", | |
" print\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"key /books/OL24405389M\n", | |
"title HTML5 For Web Designers\n", | |
"\n", | |
"key /works/OL15437498W\n", | |
"title HTML5 For Web Designers\n", | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"While there many other ways of parsing XML in the Python standard library, `xml.etree` provides the best API. You can find more about the etree module from Python documentation.\n", | |
"\n", | |
"<http://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree>\n", | |
"\n", | |
"If performance is an issue, try using third-party module `lxml`, which is API compatible with `xml.etree.ElementTree` with some more features.\n", | |
"\n", | |
"See [lxml tutorial](http://lxml.de/tutorial.html) for more info about `lxml`." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Handling Unicode\n", | |
"\n", | |
"Python has very good support for unicode. But things get confusing if you don't pay attention.\n", | |
"\n", | |
"Unicode strings in Python are represented with like u'\\u0C05'. The `u` before the string denotes that it is a unicode string and `\\u` is the escape code for writing unicode code points. For example, the following is the word \"telugu\" written in Telugu.\n", | |
"\n", | |
" >>> x = u'\\u0c24\\u0c46\\u0c32\\u0c41\\u0c17\\u0c41'\n", | |
" \n", | |
"But if you try to print that, it usually fail with unicode error, because it tries to encode to some codec supported by the terminal.\n", | |
"\n", | |
" >>> print x\n", | |
" Traceback (most recent call last):\n", | |
" File \"<stdin>\", line 1, in <module>\n", | |
" UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)\n", | |
"\n", | |
"You need to convert that into some encoding when you want to store it in files too. And you need to do the reverse when you are reading it. The mostly commonly used encoding is `utf-8`. In UTF-8 encoding, the ASCII characters gets the same values and everything else is encoding using unused codes in the ASCII range.\n", | |
"\n", | |
" >>> y = x.encode('utf-8')\n", | |
" >>> y\n", | |
" '\\xe0\\xb0\\xa4\\xe0\\xb1\\x86\\xe0\\xb0\\xb2\\xe0\\xb1\\x81\\xe0\\xb0\\x97\\xe0\\xb1\\x81'\n", | |
" >>> print y\n", | |
" \u0c24\u0c46\u0c32\u0c41\u0c17\u0c41 \n", | |
" >>> y.decode('utf-8')\n", | |
" u'\\u0c24\\u0c46\\u0c32\\u0c41\\u0c17\\u0c41' " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Parsing HTML" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The library that I recommend to parse HTML is BeautifulSoup. You can install it using:\n", | |
"\n", | |
" pip install beautifulsoup4\n", | |
"\n", | |
"Lets try a small example of parsing some HTML." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from bs4 import BeautifulSoup\n", | |
"\n", | |
"html = \"\"\"\n", | |
"<html><head><title>Indian Railways</title></head>\n", | |
"<body>\n", | |
"<div class=\"navigation\">\n", | |
" <a href=\"pnr_Enq.html\" >PNR Status</a>\n", | |
" <a href=\"between_Imp_Stations.html\" >Train Between Important Stations</a>\n", | |
" <a href=\"seat_Avail.html\" ><b>Seat Availability</b></a>\n", | |
"</div>\n", | |
"...\n", | |
"</body>\n", | |
"\"\"\"\n", | |
"\n", | |
"soup = BeautifulSoup(html)\n", | |
"# print the tile\n", | |
"# This .string attribute gives the text of the node\n", | |
"print soup.head.title.string\n", | |
"print soup.find('title').string\n", | |
"\n", | |
"# Find all the links\n", | |
"# The find_all method returns all the matching nodes\n", | |
"for a in soup.find_all(\"a\"):\n", | |
" # The node behaves like a dictionary for accessing attributes\n", | |
" # The get_text() method extracts the from all sub nodes.\n", | |
" print a['href'], a.get_text()\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Indian Railways\n", | |
"Indian Railways\n", | |
"pnr_Enq.html PNR Status\n", | |
"between_Imp_Stations.html Train Between Important Stations\n", | |
"seat_Avail.html Seat Availability\n" | |
] | |
} | |
], | |
"prompt_number": 35 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"You can do lot more with BeautifulSoup. Go and read documentation at:\n", | |
"\n", | |
"http://www.crummy.com/software/BeautifulSoup/bs4/doc/\n", | |
"\n", | |
"Couple of problems for you work on:\n", | |
"\n", | |
"**Problem:** Find out all the languages that wikipedia has an instance. You can find that from wikipedia website http://wikipedia.org/.\n", | |
"\n", | |
"Hint: You can get html using `html = urllib.urlipen(url).read()` and use that with BeautifulSoup.\n", | |
"\n", | |
"Beware! You are going to hit unicode issues." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment