A set of python scripts to download the html page and generate json fragments for error codes.
Sources
Todo list:
- Download source
- Parse HTML data and generate JSON
- Import JSON into search engine
Progress:
We downloaded the html using the python script. Then time to use REPL to idenfity interesting bits of information that we require from the page.
Using bpython and the following function, we load the page in memory
def get_soup(html_page):
return BeautifulSoup(html_page, 'html.parser')
def read_file(file_path):
return Path(file_path).read_text(encoding='utf8')
_bpython -i _
>>> page = get_soup(read_file('file.html'))
>>> row = page.select('table tbody tr')[0]
>>> row
<tr id="10001">
<td rowspan="3">10001</td>
<td rowspan="3"><code>DoExpressCheckoutPayment</code>
<br/><code>DoNonReferencedCredit</code>
<br/><code>DoUATPExpressCheckoutPayment</code>
<br/><code>GetExpressCheckoutDetails</code>
<br/><code>MassPay</code>
<br/><code>RefundTransaction</code>
<br/><code>SetExpressCheckout</code>
<br/><code>TransactionSearch</code>
</td>
<td><strong><small>SHORT</small></strong></td>
<td><code>ButtonSource</code> value truncated.</td>
</tr>
We are interested in the error code which is 10001 in the above fragment
>>> row['id'] # gives us the error code
But we print out all row ids, there are some rows where there is no id
>>> [row.get('id') for row in page.select('table tbody tr')]
Lets get few of these
>>> [row for row in page.select('table tbody tr') if row.get('id') == None ][0:5]
We seem to be selecting all the nested tr
elements where we only need to get the top level tr
inside tbody
Total number of tr
element that we get with
>>> len(page.select('table tbody tr'))
1928
If we only select the ones with ids
>>> len(page.select('table tbody tr["id"]'))
746
which looks better so printing out all the ids
>>> [row.get('id') for row in page.select('table tbody tr')]