Skip to content

Instantly share code, notes, and snippets.

@dogweather
Last active April 11, 2022 02:19
Show Gist options
  • Save dogweather/356d759824561a014aa88c550cae493e to your computer and use it in GitHub Desktop.
Save dogweather/356d759824561a014aa88c550cae493e to your computer and use it in GitHub Desktop.
################################################
# Each Article has a title and a number.
# But three of them are very difficult to parse.
################################################
# Map Article titles to corrected numbers.
EXCEPTIONS = {
"Elements of Crimes": "9",
"Ne bis in idem": "20",
"Individual criminal responsibility": "25"
}
# Parse the title & number and possibly correct the number.
article_title = parse_title(html)
raw_number = parse_number(html)
article_number = EXCEPTIONS.get(article_title, raw_number)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scraping Webpages with BeautifulSoup\n",
"===================================="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets try to get a list of all the years of all of Amitabh Bachchan movies! If you don't know, he's kind of the Sean Connery of India.\n",
"\n",
"BeautifulSoup lets you download webpages and search them for specific HTML entities. You can use this ability to scrape data out of the webpage, or a series of webpages. It is fast and works well. Their [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a handy reference."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting the Content\n",
"------------------\n",
"First you gotta grab the content (I like to use [requests](http://docs.python-requests.org/en/latest/) for this)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import requests\n",
"r = requests.get('http://www.imdb.com/name/nm0000821') # lets look at Amitabh Bachchan's list of movies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How you can make your \"beautiful soup\"! This turns the HTML into a DOM tree that you can navigate with code."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup\n",
"webpage = BeautifulSoup(r.text, \"html.parser\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scraping the Info You Want\n",
"------------------------\n",
"Now there are a few ways to get content out. For instance, to get the title you could treat it like an object:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"u'Amitabh Bachchan - IMDb'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"webpage.title.text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or you can search for specific tags. This would get all the links (as DOM elements):"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"671"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(webpage.find_all('a'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or you can use good old CSS selectors, to actually find all the years his movies were made in:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"341"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(webpage.select('div.filmo-row span.year_column'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, we really want to turn this into a list of years... not DOM elements"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"raw_year_list = [e.text.strip() for e in webpage.select('div.filmo-row span.year_column')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cleaning and Analyzing the Data\n",
"-----------------------------\n",
"So we can check if he made any films in a particular year"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'1972' in raw_year_list"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"And we can look for messy data:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[u'',\n",
" u'',\n",
" u'2014/I',\n",
" u'2013/I',\n",
" u'2013/I',\n",
" u'2003/I',\n",
" u'2003/I',\n",
" u'1983/I',\n",
" u'1980/I',\n",
" u'',\n",
" u'2014/I',\n",
" u'2015-2016',\n",
" u'2000-2012']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[year for year in raw_year_list if not year.isnumeric()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can remove these messy entries (even though that isn't the best thing to do):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"u'2016,2016,2016,2016,2015,2015,2014,2014,2014,2014,2013,2013,2013,2012,2012,2012,2012,2011,2011,2010,2010,2010,2009,2009,2009,2009,2008,2008,2008,2008,2008,2007,2007,2007,2007,2007,2007,2007,2007,2007,2007,2006,2006,2006,2006,2006,2005,2005,2005,2005,2005,2005,2005,2005,2005,2004,2004,2004,2004,2004,2004,2004,2004,2004,2004,2004,2003,2003,2003,2003,2003,2002,2002,2002,2002,2002,2001,2001,2001,2001,2000,1999,1999,1999,1999,1999,1998,1998,1998,1997,1997,1997,1996,1994,1994,1993,1992,1991,1991,1991,1991,1990,1990,1989,1989,1989,1989,1988,1988,1988,1986,1985,1985,1985,1984,1984,1984,1984,1984,1983,1983,1983,1983,1982,1982,1982,1982,1982,1982,1981,1981,1981,1981,1981,1981,1981,1981,1980,1980,1980,1979,1979,1979,1979,1979,1979,1979,1978,1978,1978,1978,1978,1977,1977,1977,1977,1977,1977,1977,1977,1976,1976,1976,1976,1976,1975,1975,1975,1975,1975,1975,1974,1974,1974,1974,1974,1974,1974,1973,1973,1973,1973,1973,1973,1972,1972,1972,1972,1972,1972,1972,1972,1972,1971,1971,1971,1971,1970,2012,2012,2011,2009,2009,2008,2007,2007,2006,2004,2003,2003,2002,2001,2001,1999,1999,1989,1989,1989,1984,1983,1983,1981,2016,2015,2012,2009,2009,2009,2008,2007,2006,2004,2003,2001,1999,1999,1991,1989,1981,1981,1981,1979,1979,1978,1977,1976,2011,2005,2001,1998,1997,1997,1996,1998,1998,2015,2012,2010,2009,2009,2008,2007,2005,2004,2001,1999,1993,1989,2014,2013,2012,2012,2012,2011,2011,2011,2010,2010,2008,2008,2007,2007,2007,2006,2005,2005,2005,2005,2004,2004,2004,2004,2004,2003,2003,2002,2001,2000,1999,1996,1993,1992,1991,1990,1990,1988,1988,1988,1988,1988,1987,1986,1985,1985,1984,1984,1981,1981,1979,1979,1977,1975,1975,1971,2013,2008,2004,1983'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"year_list = [year for year in raw_year_list if year.isnumeric()]\n",
"','.join(year_list)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1970: +\n",
"1971: +++++\n",
"1972: +++++++++\n",
"1973: ++++++\n",
"1974: +++++++\n",
"1975: ++++++++\n",
"1976: ++++++\n",
"1977: ++++++++++\n",
"1978: ++++++\n",
"1979: +++++++++++\n",
"1980: +++\n",
"1981: ++++++++++++++\n",
"1982: ++++++\n",
"1983: +++++++\n",
"1984: ++++++++\n",
"1985: +++++\n",
"1986: ++\n",
"1987: +\n",
"1988: ++++++++\n",
"1989: +++++++++\n",
"1990: ++++\n",
"1991: ++++++\n",
"1992: ++\n",
"1993: +++\n",
"1994: ++\n",
"1996: +++\n",
"1997: +++++\n",
"1998: ++++++\n",
"1999: +++++++++++\n",
"2000: ++\n",
"2001: ++++++++++\n",
"2002: +++++++\n",
"2003: ++++++++++\n",
"2004: ++++++++++++++++++++\n",
"2005: +++++++++++++++\n",
"2006: ++++++++\n",
"2007: +++++++++++++++++\n",
"2008: +++++++++++\n",
"2009: +++++++++++\n",
"2010: ++++++\n",
"2011: +++++++\n",
"2012: +++++++++++\n",
"2013: +++++\n",
"2014: +++++\n",
"2015: ++++\n",
"2016: +++++\n"
]
}
],
"source": [
"import collections\n",
"year_freq = collections.Counter(year_list)\n",
"for year in sorted(year_freq.keys()):\n",
" print str(year)+': '+('+'*year_freq[year])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment