Last active
July 27, 2016 16:32
-
-
Save rmax/087ef3dbcc499c63f1b4a2248161423e to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Extracting content from tables is a highly common requirement not only web scraping projects but also for casual users and data analyst who want to capture. The project `extruct` could aim to be a good-enough content extractor. It supports Json+ld and microfortmats, and we can add more common structures extractors.\n", | |
"\n", | |
"Below is shown a prototype for a `tables` extractor. The initial objective is to be better than `pandas.read_html` and based on `lxml` so it can be used along with `parsel`.\n", | |
"\n", | |
"The current prototype does a naive recognition of rows and headers, supporting multiple column headers. Further improvements could be:\n", | |
"\n", | |
"* Allow parse callbacks (i.e.: allow the user to process the `td` content).\n", | |
"* Support of vertical headers (if defined with `th`)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import requests\n", | |
"import pandas as pd\n", | |
"from extruct.tables import TablesExtractor\n", | |
"te = TablesExtractor()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# extract.tables" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false, | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th colspan=\"3\" halign=\"left\">Team size</th>\n", | |
" <th colspan=\"8\" halign=\"left\"></th>\n", | |
" <th colspan=\"4\" halign=\"left\">Awards</th>\n", | |
" <th colspan=\"2\" halign=\"left\"></th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th>Country</th>\n", | |
" <th>All</th>\n", | |
" <th>M</th>\n", | |
" <th>F</th>\n", | |
" <th>P1</th>\n", | |
" <th>P2</th>\n", | |
" <th>P3</th>\n", | |
" <th>P4</th>\n", | |
" <th>P5</th>\n", | |
" <th>P6</th>\n", | |
" <th>Total</th>\n", | |
" <th>Rank</th>\n", | |
" <th>G</th>\n", | |
" <th>S</th>\n", | |
" <th>B</th>\n", | |
" <th>HM</th>\n", | |
" <th>Leader</th>\n", | |
" <th>Deputy leader</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>United States of America</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td></td>\n", | |
" <td>42</td>\n", | |
" <td>32</td>\n", | |
" <td>31</td>\n", | |
" <td>42</td>\n", | |
" <td>25</td>\n", | |
" <td>13</td>\n", | |
" <td>185</td>\n", | |
" <td>1</td>\n", | |
" <td>5</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Po-Shen Loh</td>\n", | |
" <td>John Berman</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>People's Republic of China</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td></td>\n", | |
" <td>42</td>\n", | |
" <td>36</td>\n", | |
" <td>12</td>\n", | |
" <td>42</td>\n", | |
" <td>22</td>\n", | |
" <td>27</td>\n", | |
" <td>181</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Bin Xiong</td>\n", | |
" <td>Qiusheng Li</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Republic of Korea</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td></td>\n", | |
" <td>42</td>\n", | |
" <td>15</td>\n", | |
" <td>18</td>\n", | |
" <td>42</td>\n", | |
" <td>25</td>\n", | |
" <td>19</td>\n", | |
" <td>161</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>Yongjin Song</td>\n", | |
" <td>Bo-Hae Im</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Democratic People's Republic of Korea</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td></td>\n", | |
" <td>42</td>\n", | |
" <td>25</td>\n", | |
" <td>17</td>\n", | |
" <td>42</td>\n", | |
" <td>22</td>\n", | |
" <td>8</td>\n", | |
" <td>156</td>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Yong Chol Ham</td>\n", | |
" <td>Ryong Gol Yom</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Vietnam</td>\n", | |
" <td>6</td>\n", | |
" <td>5</td>\n", | |
" <td>1</td>\n", | |
" <td>35</td>\n", | |
" <td>21</td>\n", | |
" <td>23</td>\n", | |
" <td>42</td>\n", | |
" <td>28</td>\n", | |
" <td>2</td>\n", | |
" <td>151</td>\n", | |
" <td>5</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Bá Khánh Trình Lê</td>\n", | |
" <td>Anh Vinh Lê</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Team size \\\n", | |
" Country All M F P1 P2 P3 P4 P5 \n", | |
"0 United States of America 6 6 42 32 31 42 25 \n", | |
"1 People's Republic of China 6 6 42 36 12 42 22 \n", | |
"2 Republic of Korea 6 6 42 15 18 42 25 \n", | |
"3 Democratic People's Republic of Korea 6 6 42 25 17 42 22 \n", | |
"4 Vietnam 6 5 1 35 21 23 42 28 \n", | |
"\n", | |
" Awards \n", | |
" P6 Total Rank G S B HM Leader Deputy leader \n", | |
"0 13 185 1 5 1 0 0 Po-Shen Loh John Berman \n", | |
"1 27 181 2 4 2 0 0 Bin Xiong Qiusheng Li \n", | |
"2 19 161 3 3 1 2 0 Yongjin Song Bo-Hae Im \n", | |
"3 8 156 4 3 3 0 0 Yong Chol Ham Ryong Gol Yom \n", | |
"4 2 151 5 2 3 1 0 Bá Khánh Trình Lê Anh Vinh Lê " | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"r = requests.get('https://www.imo-official.org/year_country_r.aspx?year=2015')\n", | |
"te.extract_tables(r.text)[0].to_pandas().head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>date</th>\n", | |
" <th>type</th>\n", | |
" <th>registration</th>\n", | |
" <th>operator</th>\n", | |
" <th>fat.</th>\n", | |
" <th>location</th>\n", | |
" <th></th>\n", | |
" <th>pic</th>\n", | |
" <th>cat</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>02-JAN-2016</td>\n", | |
" <td>Cessna 208B Grand Caravan</td>\n", | |
" <td>N540ME</td>\n", | |
" <td>Wright Air Service</td>\n", | |
" <td>0</td>\n", | |
" <td>near Anaktuvuk Pa...</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>03-JAN-2016</td>\n", | |
" <td>Canadair Regional Jet CRJ-702</td>\n", | |
" <td>N164GJ</td>\n", | |
" <td>GoJet Airlines, opf. United Express</td>\n", | |
" <td>0</td>\n", | |
" <td>Chicago-O'Ha...</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>04-JAN-2016</td>\n", | |
" <td>Beechcraft 200 Super King Air</td>\n", | |
" <td>N275X</td>\n", | |
" <td>Skyway Aircraft Inc.</td>\n", | |
" <td>0</td>\n", | |
" <td>St. Petersbu...</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>05-JAN-2016</td>\n", | |
" <td>Basler BT-67 Turbo 67 (AC-47T)</td>\n", | |
" <td>FAC-1658</td>\n", | |
" <td>Colombian AF</td>\n", | |
" <td>0</td>\n", | |
" <td>Tres Esquina...</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>08-JAN-2016</td>\n", | |
" <td>Canadair Regional Jet CRJ-200PF</td>\n", | |
" <td>SE-DUX</td>\n", | |
" <td>West Air Sweden</td>\n", | |
" <td>2</td>\n", | |
" <td>near Akkajaure</td>\n", | |
" <td></td>\n", | |
" <td></td>\n", | |
" <td>A1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" date type registration \\\n", | |
"0 02-JAN-2016 Cessna 208B Grand Caravan N540ME \n", | |
"1 03-JAN-2016 Canadair Regional Jet CRJ-702 N164GJ \n", | |
"2 04-JAN-2016 Beechcraft 200 Super King Air N275X \n", | |
"3 05-JAN-2016 Basler BT-67 Turbo 67 (AC-47T) FAC-1658 \n", | |
"4 08-JAN-2016 Canadair Regional Jet CRJ-200PF SE-DUX \n", | |
"\n", | |
" operator fat. location pic cat \n", | |
"0 Wright Air Service 0 near Anaktuvuk Pa... A2 \n", | |
"1 GoJet Airlines, opf. United Express 0 Chicago-O'Ha... A2 \n", | |
"2 Skyway Aircraft Inc. 0 St. Petersbu... A2 \n", | |
"3 Colombian AF 0 Tres Esquina... A2 \n", | |
"4 West Air Sweden 2 near Akkajaure A1 " | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"r = requests.get('https://aviation-safety.net/database/dblist.php?Year=2016')\n", | |
"te.extract_tables(r.text)[0].to_pandas().head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Status:</td>\n", | |
" <td>Preliminary - official</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Date:</td>\n", | |
" <td>Saturday 2 January 2016</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Time:</td>\n", | |
" <td>12:05</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Type:</td>\n", | |
" <td>Cessna 208B Grand Caravan</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Operator:</td>\n", | |
" <td>Wright Air Service</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>Registration:</td>\n", | |
" <td>N540ME</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>C/n / msn:</td>\n", | |
" <td>208B-0540</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>First flight:</td>\n", | |
" <td>1996</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>Engines:</td>\n", | |
" <td>1 Pratt & Whitney Canada PT6A-114A</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>Crew:</td>\n", | |
" <td>Fatalities: 0 / Occupants: 1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>Passengers:</td>\n", | |
" <td>Fatalities: 0 / Occupants: 7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11</th>\n", | |
" <td>Total:</td>\n", | |
" <td>Fatalities: 0 / Occupants: 8</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12</th>\n", | |
" <td>Airplane damage:</td>\n", | |
" <td>Substantial</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13</th>\n", | |
" <td>Location:</td>\n", | |
" <td>10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>14</th>\n", | |
" <td>Phase:</td>\n", | |
" <td>En route (ENR)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>15</th>\n", | |
" <td>Nature:</td>\n", | |
" <td>Domestic Scheduled Passenger</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>Departure airport:</td>\n", | |
" <td>Fairbanks International Airport, AK (FAI/PAFA)...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>17</th>\n", | |
" <td>Destination airport:</td>\n", | |
" <td>Anaktuvuk Pass Airport, AK (AKP/PAKP), United ...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1\n", | |
"0 Status: Preliminary - official\n", | |
"1 Date: Saturday 2 January 2016\n", | |
"2 Time: 12:05\n", | |
"3 Type: Cessna 208B Grand Caravan\n", | |
"4 Operator: Wright Air Service\n", | |
"5 Registration: N540ME\n", | |
"6 C/n / msn: 208B-0540\n", | |
"7 First flight: 1996\n", | |
"8 Engines: 1 Pratt & Whitney Canada PT6A-114A\n", | |
"9 Crew: Fatalities: 0 / Occupants: 1\n", | |
"10 Passengers: Fatalities: 0 / Occupants: 7\n", | |
"11 Total: Fatalities: 0 / Occupants: 8\n", | |
"12 Airplane damage: Substantial\n", | |
"13 Location: 10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...\n", | |
"14 Phase: En route (ENR)\n", | |
"15 Nature: Domestic Scheduled Passenger\n", | |
"16 Departure airport: Fairbanks International Airport, AK (FAI/PAFA)...\n", | |
"17 Destination airport: Anaktuvuk Pass Airport, AK (AKP/PAKP), United ..." | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"r = requests.get('https://aviation-safety.net/database/record.php?id=20160102-0')\n", | |
"pd.concat([t.to_pandas() for t in te.extract_tables(r.text)])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"## Comparing with pandas.read_html " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Country</th>\n", | |
" <th>Team size</th>\n", | |
" <th>P1</th>\n", | |
" <th>P2</th>\n", | |
" <th>P3</th>\n", | |
" <th>P4</th>\n", | |
" <th>P5</th>\n", | |
" <th>P6</th>\n", | |
" <th>Total</th>\n", | |
" <th>Rank</th>\n", | |
" <th>Awards</th>\n", | |
" <th>Leader</th>\n", | |
" <th>Deputy leader</th>\n", | |
" <th>All</th>\n", | |
" <th>M</th>\n", | |
" <th>F</th>\n", | |
" <th>G</th>\n", | |
" <th>S</th>\n", | |
" <th>B</th>\n", | |
" <th>HM</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>United States of America</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td>NaN</td>\n", | |
" <td>42</td>\n", | |
" <td>32</td>\n", | |
" <td>31</td>\n", | |
" <td>42</td>\n", | |
" <td>25</td>\n", | |
" <td>13</td>\n", | |
" <td>185</td>\n", | |
" <td>1</td>\n", | |
" <td>5</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Po-Shen Loh</td>\n", | |
" <td>John Berman</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>People's Republic of China</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td>NaN</td>\n", | |
" <td>42</td>\n", | |
" <td>36</td>\n", | |
" <td>12</td>\n", | |
" <td>42</td>\n", | |
" <td>22</td>\n", | |
" <td>27</td>\n", | |
" <td>181</td>\n", | |
" <td>2</td>\n", | |
" <td>4</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Bin Xiong</td>\n", | |
" <td>Qiusheng Li</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Republic of Korea</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td>NaN</td>\n", | |
" <td>42</td>\n", | |
" <td>15</td>\n", | |
" <td>18</td>\n", | |
" <td>42</td>\n", | |
" <td>25</td>\n", | |
" <td>19</td>\n", | |
" <td>161</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>Yongjin Song</td>\n", | |
" <td>Bo-Hae Im</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Democratic People's Republic of Korea</td>\n", | |
" <td>6</td>\n", | |
" <td>6</td>\n", | |
" <td>NaN</td>\n", | |
" <td>42</td>\n", | |
" <td>25</td>\n", | |
" <td>17</td>\n", | |
" <td>42</td>\n", | |
" <td>22</td>\n", | |
" <td>8</td>\n", | |
" <td>156</td>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>Yong Chol Ham</td>\n", | |
" <td>Ryong Gol Yom</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Vietnam</td>\n", | |
" <td>6</td>\n", | |
" <td>5</td>\n", | |
" <td>1.0</td>\n", | |
" <td>35</td>\n", | |
" <td>21</td>\n", | |
" <td>23</td>\n", | |
" <td>42</td>\n", | |
" <td>28</td>\n", | |
" <td>2</td>\n", | |
" <td>151</td>\n", | |
" <td>5</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Bá Khánh Trình Lê</td>\n", | |
" <td>Anh Vinh Lê</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Country Team size P1 P2 P3 P4 P5 P6 \\\n", | |
"0 United States of America 6 6 NaN 42 32 31 42 \n", | |
"1 People's Republic of China 6 6 NaN 42 36 12 42 \n", | |
"2 Republic of Korea 6 6 NaN 42 15 18 42 \n", | |
"3 Democratic People's Republic of Korea 6 6 NaN 42 25 17 42 \n", | |
"4 Vietnam 6 5 1.0 35 21 23 42 \n", | |
"\n", | |
" Total Rank Awards Leader Deputy leader All M F G \\\n", | |
"0 25 13 185 1 5 1 0 0 Po-Shen Loh \n", | |
"1 22 27 181 2 4 2 0 0 Bin Xiong \n", | |
"2 25 19 161 3 3 1 2 0 Yongjin Song \n", | |
"3 22 8 156 4 3 3 0 0 Yong Chol Ham \n", | |
"4 28 2 151 5 2 3 1 0 Bá Khánh Trình Lê \n", | |
"\n", | |
" S B HM \n", | |
"0 John Berman NaN NaN \n", | |
"1 Qiusheng Li NaN NaN \n", | |
"2 Bo-Hae Im NaN NaN \n", | |
"3 Ryong Gol Yom NaN NaN \n", | |
"4 Anh Vinh Lê NaN NaN " | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.read_html('https://www.imo-official.org/year_country_r.aspx?year=2015', tupleize_cols=True)[0].head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" <th>5</th>\n", | |
" <th>6</th>\n", | |
" <th>7</th>\n", | |
" <th>8</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>date</td>\n", | |
" <td>type</td>\n", | |
" <td>registration</td>\n", | |
" <td>operator</td>\n", | |
" <td>fat.</td>\n", | |
" <td>location</td>\n", | |
" <td>NaN</td>\n", | |
" <td>pic</td>\n", | |
" <td>cat</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>02-JAN-2016</td>\n", | |
" <td>Cessna 208B Grand Caravan</td>\n", | |
" <td>N540ME</td>\n", | |
" <td>Wright Air Service</td>\n", | |
" <td>0</td>\n", | |
" <td>near Anaktuvuk Pa...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>03-JAN-2016</td>\n", | |
" <td>Canadair Regional Jet CRJ-702</td>\n", | |
" <td>N164GJ</td>\n", | |
" <td>GoJet Airlines, opf. United Express</td>\n", | |
" <td>0</td>\n", | |
" <td>Chicago-O'Ha...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>04-JAN-2016</td>\n", | |
" <td>Beechcraft 200 Super King Air</td>\n", | |
" <td>N275X</td>\n", | |
" <td>Skyway Aircraft Inc.</td>\n", | |
" <td>0</td>\n", | |
" <td>St. Petersbu...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>05-JAN-2016</td>\n", | |
" <td>Basler BT-67 Turbo 67 (AC-47T)</td>\n", | |
" <td>FAC-1658</td>\n", | |
" <td>Colombian AF</td>\n", | |
" <td>0</td>\n", | |
" <td>Tres Esquina...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>A2</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1 2 \\\n", | |
"0 date type registration \n", | |
"1 02-JAN-2016 Cessna 208B Grand Caravan N540ME \n", | |
"2 03-JAN-2016 Canadair Regional Jet CRJ-702 N164GJ \n", | |
"3 04-JAN-2016 Beechcraft 200 Super King Air N275X \n", | |
"4 05-JAN-2016 Basler BT-67 Turbo 67 (AC-47T) FAC-1658 \n", | |
"\n", | |
" 3 4 5 6 7 \\\n", | |
"0 operator fat. location NaN pic \n", | |
"1 Wright Air Service 0 near Anaktuvuk Pa... NaN NaN \n", | |
"2 GoJet Airlines, opf. United Express 0 Chicago-O'Ha... NaN NaN \n", | |
"3 Skyway Aircraft Inc. 0 St. Petersbu... NaN NaN \n", | |
"4 Colombian AF 0 Tres Esquina... NaN NaN \n", | |
"\n", | |
" 8 \n", | |
"0 cat \n", | |
"1 A2 \n", | |
"2 A2 \n", | |
"3 A2 \n", | |
"4 A2 " | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.read_html('https://aviation-safety.net/database/dblist.php?Year=2016')[0].head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>Status:</td>\n", | |
" <td>Preliminary - official</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Date:</td>\n", | |
" <td>Saturday 2 January 2016</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Time:</td>\n", | |
" <td>12:05</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>Type:</td>\n", | |
" <td>Cessna 208B Grand Caravan</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>Operator:</td>\n", | |
" <td>Wright Air Service</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>Registration:</td>\n", | |
" <td>N540ME</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>C/n / msn:</td>\n", | |
" <td>208B-0540</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>First flight:</td>\n", | |
" <td>1996</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>Engines:</td>\n", | |
" <td>1 Pratt & Whitney Canada PT6A-114A</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>Crew:</td>\n", | |
" <td>Fatalities: 0 / Occupants: 1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>Passengers:</td>\n", | |
" <td>Fatalities: 0 / Occupants: 7</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11</th>\n", | |
" <td>Total:</td>\n", | |
" <td>Fatalities: 0 / Occupants: 8</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12</th>\n", | |
" <td>Airplane damage:</td>\n", | |
" <td>Substantial</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13</th>\n", | |
" <td>Location:</td>\n", | |
" <td>10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>14</th>\n", | |
" <td>Phase:</td>\n", | |
" <td>En route (ENR)</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>15</th>\n", | |
" <td>Nature:</td>\n", | |
" <td>Domestic Scheduled Passenger</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>Departure airport:</td>\n", | |
" <td>Fairbanks International Airport, AK (FAI/PAFA)...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>17</th>\n", | |
" <td>Destination airport:</td>\n", | |
" <td>Anaktuvuk Pass Airport, AK (AKP/PAKP), United ...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1\n", | |
"0 Status: Preliminary - official\n", | |
"1 Date: Saturday 2 January 2016\n", | |
"2 Time: 12:05\n", | |
"3 Type: Cessna 208B Grand Caravan\n", | |
"4 Operator: Wright Air Service\n", | |
"5 Registration: N540ME\n", | |
"6 C/n / msn: 208B-0540\n", | |
"7 First flight: 1996\n", | |
"8 Engines: 1 Pratt & Whitney Canada PT6A-114A\n", | |
"9 Crew: Fatalities: 0 / Occupants: 1\n", | |
"10 Passengers: Fatalities: 0 / Occupants: 7\n", | |
"11 Total: Fatalities: 0 / Occupants: 8\n", | |
"12 Airplane damage: Substantial\n", | |
"13 Location: 10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...\n", | |
"14 Phase: En route (ENR)\n", | |
"15 Nature: Domestic Scheduled Passenger\n", | |
"16 Departure airport: Fairbanks International Airport, AK (FAI/PAFA)...\n", | |
"17 Destination airport: Anaktuvuk Pass Airport, AK (AKP/PAKP), United ..." | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.concat(pd.read_html('https://aviation-safety.net/database/record.php?id=20160102-0'))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "DS (py3)", | |
"language": "python", | |
"name": "python3-ds" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment