Skip to content

Instantly share code, notes, and snippets.

@nrrb
Created October 16, 2014 16:41
Show Gist options
  • Select an option

  • Save nrrb/ee2026397b917617ab4e to your computer and use it in GitHub Desktop.

Select an option

Save nrrb/ee2026397b917617ab4e to your computer and use it in GitHub Desktop.
Intro to Scraping - Centro Careers
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:8c376e93a42e2bbb5f5a73f37c0f9c0749d7a10826815892bfcd25bc9d941908"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's explore web scraping using the [requests](http://docs.python-requests.org/en/latest/) and [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) libraries. We'll be roughly following [this guide to web scraping with Python](http://blog.miguelgrinberg.com/post/easy-web-scraping-with-python). We'll scrape the job listings off the [Centro Careers page](http://www.centro.net/careers).\n",
"\n",
"As in the tutorial, we'll be using the `select` method on the BeautifulSoup object ([documentation here](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)) which uses [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors?redirectlocale=en-US&redirectslug=CSS%2FGetting_Started%2FSelectors) for accessing HTML elements."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests\n",
"from bs4 import BeautifulSoup"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"response = requests.get(\"http://www.centro.net/careers\")\n",
"if response.status_code == requests.codes.ok:\n",
" print(\"Hooray! We got an OK response ({code}) from the server!\".format(code=response.status_code))\n",
"else:\n",
" print(\"Uh oh, the status code {code} indicates something is not quite right...\".format(code=response.status_code))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Hooray! We got an OK response (200) from the server!\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"soup = BeautifulSoup(response.content)\n",
"job_links = soup.select(\"a.positions__link\")\n",
"print(\"There are {} jobs posted at Centro!\".format(len(job_links)))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"There are 53 jobs posted at Centro!\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(\"\"\"This is what the job link HTML looks like:\n",
"\n",
"{html}\"\"\".format(html=job_links[0]))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"This is what the job link HTML looks like:\n",
"\n",
"<a class=\"positions__link\" href=\"careers/oO9FZfwl\">\n",
"<span class=\"positions__title\">Account Manager</span>\n",
"<span class=\"positions__location\">Denver, CO, United States</span>\n",
"</a>\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# For each position, we have three potentially interesting sub-elements: the link to the position, the title,\n",
"# and the location. Let's extract these pieces of information from each position and create a Python list to\n",
"# contain them all.\n",
"print(\"Link: {link}\".format(link=job_links[0].attrs[\"href\"]))\n",
"print(\"Title: {title}\".format(title=job_links[0].select(\"span.positions__title\")[0].text))\n",
"print(\"Location: {location}\".format(location=job_links[0].select(\"span.positions__location\")[0].text))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Link: careers/oO9FZfwl\n",
"Title: Account Manager\n",
"Location: Denver, CO, United States\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This is equivalent to all_positions = [] just a little more explicit.\n",
"all_positions = list()\n",
"for job_link in job_links:\n",
" link = job_link.attrs[\"href\"]\n",
" title = job_link.select(\"span.positions__title\")[0].text\n",
" location = job_link.select(\"span.positions__location\")[0].text\n",
" all_positions.append(dict(link=link, title=title, location=location))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "IndexError",
"evalue": "list index out of range",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-20-6105fd192364>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mjob_link\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mjob_links\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mlink\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mjob_link\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mattrs\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"href\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mtitle\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mjob_link\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mselect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"span.positions__title\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0mlocation\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mjob_link\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mselect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"span.positions__location\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mall_positions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlink\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlink\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtitle\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtitle\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlocation\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlocation\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mIndexError\u001b[0m: list index out of range"
]
}
],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This error means one of the links is not in the format we expected. Let's see what it is. Since the\n",
"# for loop broke on that error, we can reference the last value of job_link to see what it was and inspect\n",
"# what might have caused the error.\n",
"print(job_link)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<a class=\"positions__link\" href=\"http://www.jobvite.com\" target=\"__blank\">Powered by Jobvite</a>\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This shows us that there are some sponsored links without real job titles or locations. Let's redo the\n",
"# loop only this time we'll check to make sure that there are title and location elements before we try\n",
"# to access their text.\n",
"all_positions = list()\n",
"for job_link in job_links:\n",
" if job_link.select(\"span.positions__title\") and job_link.select(\"span.positions__location\"):\n",
" link = job_link.attrs[\"href\"]\n",
" title = job_link.select(\"span.positions__title\")[0].text\n",
" location = job_link.select(\"span.positions__location\")[0].text\n",
" job_details = dict(link=link, title=title, location=location)\n",
" print(\"Found a job: {details}\".format(details=job_details))\n",
" all_positions.append(job_details)\n",
" else:\n",
" print(\"This is not a real job listing: {link}\".format(link=job_link))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Found a job: {'link': 'careers/oO9FZfwl', 'location': u'Denver, CO, United States', 'title': u'Account Manager'}\n",
"Found a job: {'link': 'careers/o2GJZfwa', 'location': u'New York, NY, United States', 'title': u'Account Manager'}\n",
"Found a job: {'link': 'careers/oudfZfwF', 'location': u'Chicago, IL, United States', 'title': u'Corporate Business Analyst'}\n",
"Found a job: {'link': 'careers/oxlyZfw9', 'location': u'Chicago, IL, United States', 'title': u'Corporate Program Manager'}\n",
"Found a job: {'link': 'careers/oZtGZfwR', 'location': u'New York, NY, United States', 'title': u'Campaign Manager'}\n",
"Found a job: {'link': 'careers/oh4FZfwJ', 'location': u'San Francisco, CA, United States', 'title': u'Data Scientist'}\n",
"Found a job: {'link': 'careers/oG4FZfw8', 'location': u'Chicago, IL, United States', 'title': u'Data Warehouse Engineer'}\n",
"Found a job: {'link': 'careers/on5FZfwQ', 'location': u'Chicago, IL, United States', 'title': u'Senior Data Scientist'}\n",
"Found a job: {'link': 'careers/oA1jYfwC', 'location': u'San Francisco, CA, United States', 'title': u'Senior Data Scientist'}\n",
"Found a job: {'link': 'careers/okbDZfwR', 'location': u'Chicago, IL, United States', 'title': u'Director of Social and Native Solutions'}\n",
"Found a job: {'link': 'careers/oYA2Yfwi', 'location': u'Chicago, IL, United States', 'title': u'Staff Accountant'}\n",
"Found a job: {'link': 'careers/oYPzZfw5', 'location': u'Toronto, Canada', 'title': u'Staff Accountant'}\n",
"Found a job: {'link': 'careers/oDIoZfws', 'location': u'Chicago, IL, United States', 'title': u'Client Analytics Manager'}\n",
"Found a job: {'link': 'careers/oNIDZfwR', 'location': u'New York, NY, United States', 'title': u'Paid Search Supervisor'}\n",
"Found a job: {'link': 'careers/oIaxZfw8', 'location': u'Chicago, IL, United States', 'title': u'Programmatic Buyer'}\n",
"Found a job: {'link': 'careers/opj6Yfww', 'location': u'New York, NY, United States', 'title': u'Search Strategist'}\n",
"Found a job: {'link': 'careers/oKnFZfwv', 'location': u'Chicago, IL, United States', 'title': u'Social Media Strategist'}\n",
"Found a job: {'link': 'careers/oLZ3Yfwv', 'location': u'Toronto, Canada', 'title': u'Associate Software Engineer'}\n",
"Found a job: {'link': 'careers/oU4FZfwm', 'location': u'Chicago, IL, United States', 'title': u'Data Analyst'}\n",
"Found a job: {'link': 'careers/on6SYfw3', 'location': u'Chicago, IL, United States', 'title': u'Developer Operations Engineer'}\n",
"Found a job: {'link': 'careers/oAzmYfwd', 'location': u'Chicago, IL, United States', 'title': u'Product Manager'}\n",
"Found a job: {'link': 'careers/o58hZfwd', 'location': u'Toronto, Canada', 'title': u'Product Manager'}\n",
"Found a job: {'link': 'careers/oOHiYfwv', 'location': u'Toronto, Canada', 'title': u'Quality Assurance Engineer'}\n",
"Found a job: {'link': 'careers/oyHGZfwE', 'location': u'Chicago, IL, United States', 'title': u'Senior Product Manager'}\n",
"Found a job: {'link': 'careers/oRriYfwi', 'location': u'San Francisco, CA, United States', 'title': u'Senior Product Manager'}\n",
"Found a job: {'link': 'careers/oHHiYfwo', 'location': u'Chicago, IL, United States', 'title': u'Senior Software Engineer in Test'}\n",
"Found a job: {'link': 'careers/oNHiYfwu', 'location': u'Toronto, Canada', 'title': u'Senior Software Engineer in Test'}\n",
"Found a job: {'link': 'careers/oz5FZfw2', 'location': u'Chicago, IL, United States', 'title': u'Senior Software Engineer, Big Data'}\n",
"Found a job: {'link': 'careers/oB7GZfw7', 'location': u'San Francisco, CA, United States', 'title': u'Senior Software Engineer, Big Data'}\n",
"Found a job: {'link': 'careers/oxqaYfwP', 'location': u'Toronto, Canada', 'title': u'Senior Software Engineer, JavaScript'}\n",
"Found a job: {'link': 'careers/oOKvZfwM', 'location': u'Chicago, IL, United States', 'title': u'Senior Software Engineer, Rails'}\n",
"Found a job: {'link': 'careers/oVKiYfwF', 'location': u'San Francisco, CA, United States', 'title': u'Senior Software Engineer, Rails'}\n",
"Found a job: {'link': 'careers/oSKiYfwC', 'location': u'San Francisco, CA, United States', 'title': u'Senior Software Engineer, UI'}\n",
"Found a job: {'link': 'careers/oWHiYfwD', 'location': u'Toronto, Canada', 'title': u'Senior Software Engineer, UI'}\n",
"Found a job: {'link': 'careers/olZXYfwZ', 'location': u'Toronto, Canada', 'title': u'Senior Systems Administrator'}\n",
"Found a job: {'link': 'careers/oBTDZfwQ', 'location': u'Chicago, IL, United States', 'title': u'Senior UX Designer'}\n",
"Found a job: {'link': 'careers/ojY8Yfw7', 'location': u'San Francisco, CA, United States', 'title': u'Senior Vice President, Data'}\n",
"Found a job: {'link': 'careers/oOZ3Yfwy', 'location': u'Toronto, Canada', 'title': u'Software Engineer'}\n",
"Found a job: {'link': 'careers/oR6HZfwn', 'location': u'Toronto, Canada', 'title': u'Software Engineer'}\n",
"Found a job: {'link': 'careers/okShZfwc', 'location': u'Toronto, Canada', 'title': u'Software Engineer in Test'}\n",
"Found a job: {'link': 'careers/owKwZfwv', 'location': u'Chicago, IL, United States', 'title': u'Software Engineer, Rails'}\n",
"Found a job: {'link': 'careers/o7KiYfwR', 'location': u'San Francisco, CA, United States', 'title': u'Software Engineer, Rails'}\n",
"Found a job: {'link': 'careers/oZYwZfwc', 'location': u'Chicago, IL, United States', 'title': u'Software Engineer, UI'}\n",
"Found a job: {'link': 'careers/op7wZfwL', 'location': u'Toronto, Canada', 'title': u'Technical Recruiter - Contract'}\n",
"Found a job: {'link': 'careers/oSvnYfws', 'location': u'Chicago, IL, United States', 'title': u'Technical Support Analyst'}\n",
"Found a job: {'link': 'careers/ouItZfwo', 'location': u'Chicago, IL, United States', 'title': u'UX Designer'}\n",
"Found a job: {'link': 'careers/oatFZfw1', 'location': u'Los Angeles, CA, United States', 'title': u'Channel Sales Manager'}\n",
"Found a job: {'link': 'careers/oVpFZfwI', 'location': u'Atlanta, GA, United States', 'title': u'Account Executive'}\n",
"Found a job: {'link': 'careers/oWvFZfwP', 'location': u'Chicago, IL, United States', 'title': u'Account Executive'}\n",
"Found a job: {'link': 'careers/oDuyZfwo', 'location': u'Phoenix, AZ, United States', 'title': u'Account Executive'}\n",
"Found a job: {'link': 'careers/oADGZfwC', 'location': u'San Francisco, CA, United States', 'title': u'Account Executive'}\n",
"Found a job: {'link': 'careers/orqFZfwf', 'location': u'Seattle, WA, United States', 'title': u'Regional Sales Manager'}\n",
"This is not a real job listing: <a class=\"positions__link\" href=\"http://www.jobvite.com\" target=\"__blank\">Powered by Jobvite</a>\n"
]
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# How many job listings do we have now?\n",
"print(\"Found {job_count} job listings.\".format(job_count=len(all_positions)))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Found 52 job listings.\n"
]
}
],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Now we have a list of dictionaries. Let's do a quick operation with the data structure, filtering out\n",
"# just the job listings that are in Chicago.\n",
"chicago_positions = filter(lambda position: \"Chicago\" in position[\"location\"], all_positions)\n",
"print(\"Found {job_count} jobs in Chicago:\".format(job_count=len(chicago_positions)))\n",
"# Python has a module called pprint that's short for pretty print. It displays data structures with some\n",
"# indentation so they're easier to read.\n",
"from pprint import pprint as pretty_print\n",
"pretty_print(chicago_positions)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Found 22 jobs in Chicago:\n",
"[{'link': 'careers/oudfZfwF',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Corporate Business Analyst'},\n",
" {'link': 'careers/oxlyZfw9',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Corporate Program Manager'},\n",
" {'link': 'careers/oG4FZfw8',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Data Warehouse Engineer'},\n",
" {'link': 'careers/on5FZfwQ',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Senior Data Scientist'},\n",
" {'link': 'careers/okbDZfwR',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Director of Social and Native Solutions'},\n",
" {'link': 'careers/oYA2Yfwi',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Staff Accountant'},\n",
" {'link': 'careers/oDIoZfws',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Client Analytics Manager'},\n",
" {'link': 'careers/oIaxZfw8',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Programmatic Buyer'},\n",
" {'link': 'careers/oKnFZfwv',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Social Media Strategist'},\n",
" {'link': 'careers/oU4FZfwm',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Data Analyst'},\n",
" {'link': 'careers/on6SYfw3',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Developer Operations Engineer'},\n",
" {'link': 'careers/oAzmYfwd',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Product Manager'},\n",
" {'link': 'careers/oyHGZfwE',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Senior Product Manager'},\n",
" {'link': 'careers/oHHiYfwo',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Senior Software Engineer in Test'},\n",
" {'link': 'careers/oz5FZfw2',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Senior Software Engineer, Big Data'},\n",
" {'link': 'careers/oOKvZfwM',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Senior Software Engineer, Rails'},\n",
" {'link': 'careers/oBTDZfwQ',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Senior UX Designer'},\n",
" {'link': 'careers/owKwZfwv',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Software Engineer, Rails'},\n",
" {'link': 'careers/oZYwZfwc',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Software Engineer, UI'},\n",
" {'link': 'careers/oSvnYfws',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Technical Support Analyst'},\n",
" {'link': 'careers/ouItZfwo',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'UX Designer'},\n",
" {'link': 'careers/oWvFZfwP',\n",
" 'location': u'Chicago, IL, United States',\n",
" 'title': u'Account Executive'}]\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment