Skip to content

Instantly share code, notes, and snippets.

@elyase
Last active December 23, 2015 22:29
Show Gist options
  • Save elyase/6703739 to your computer and use it in GitHub Desktop.
Save elyase/6703739 to your computer and use it in GitHub Desktop.
scrape victor: View at http://nbviewer.ipython.org/6703739
{
"metadata": {
"name": "Scrape Victor"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Download data"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyquery import PyQuery as pq\n\nurl = 'https://www.nuans.com/RTS2/en/jur_codes-codes_jur_en.cgi#Example_of_report_layouts'\nd = pq(url)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 62
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Method 1: Transversing the tables"
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n\nl = [ [td.text(), td.next().text()]\n for table in d('.borderless').items()\n for td in table('td:nth-child(1)').items() # left column\n if table('th:first').text() == 'NUANS Reports & Preliminary Searches' and \n td.next().text() in ('Active', 'Inactive') ]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "10 loops, best of 3: 172 ms per loop\n"
}
],
"prompt_number": 143
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Method 2: Transversing the left 'td' elements"
},
{
"cell_type": "code",
"collapsed": false,
"input": "%%timeit\n\nl = []\nfor th in d.items('.borderless td:nth-child(1)'):\n left = th.text()\n right = th.next().text()\n tr = th.parent()\n tbody = tr.parent()\n title = tbody('th:first').text() # first element\n if title == 'NUANS Reports & Preliminary Searches' and right in ['Active', 'Inactive']:\n l.append([left, right])",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "1 loops, best of 3: 199 ms per loop\n"
}
],
"prompt_number": 140
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Write results to csv"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import csv\n\nwith open('vic_scrape.csv', 'wb') as csvfile:\n csv.writer(csvfile, delimiter=',').writerows(l)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 108
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Test"
},
{
"cell_type": "code",
"collapsed": false,
"input": "l[:10]",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 132,
"text": "[['Amlgmtd', 'Inactive'],\n ['Bankrupt', 'Active'],\n ['Cancelled', 'Inactive'],\n ['Cnttn_Out', 'Inactive'],\n ['Deleted', 'Inactive'],\n ['Dissolved', 'Inactive'],\n ['Historic', 'Inactive'],\n ['Lqdtd', 'Active'],\n ['LT_CrtOrd', 'Active'],\n ['Pnd_Rstrn', 'Active']]"
}
],
"prompt_number": 132
},
{
"cell_type": "code",
"collapsed": false,
"input": "!head vic_scrape.csv",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Amlgmtd,Inactive\r\r\nBankrupt,Active\r\r\nCancelled,Inactive\r\r\nCnttn_Out,Inactive\r\r\nDeleted,Inactive\r\r\nDissolved,Inactive\r\r\nHistoric,Inactive\r\r\nLqdtd,Active\r\r\nLT_CrtOrd,Active\r\r\nPnd_Rstrn,Active\r\r\n"
}
],
"prompt_number": 109
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment