[Developing data collector for Response codes] #python #scraping

Data collector

A set of python scripts to download the html page and generate json fragments for error codes.

Paypal

Sources

https://developer.paypal.com/docs/classic/api/errors/

Todo list:

Download source
Parse HTML data and generate JSON
Import JSON into search engine

Progress:

We downloaded the html using the python script. Then time to use REPL to idenfity interesting bits of information that we require from the page.

Using bpython and the following function, we load the page in memory

def get_soup(html_page):
    return BeautifulSoup(html_page, 'html.parser')


def read_file(file_path):
    return Path(file_path).read_text(encoding='utf8')

_bpython -i _

>>> page = get_soup(read_file('file.html'))
>>> row = page.select('table tbody tr')[0]
>>> row
<tr id="10001">
<td rowspan="3">10001</td>
<td rowspan="3"><code>DoExpressCheckoutPayment</code>
<br/><code>DoNonReferencedCredit</code>
<br/><code>DoUATPExpressCheckoutPayment</code>
<br/><code>GetExpressCheckoutDetails</code>
<br/><code>MassPay</code>
<br/><code>RefundTransaction</code>
<br/><code>SetExpressCheckout</code>
<br/><code>TransactionSearch</code>
</td>
<td><strong><small>SHORT</small></strong></td>
<td><code>ButtonSource</code> value truncated.</td>
</tr>

We are interested in the error code which is 10001 in the above fragment

>>> row['id'] # gives us the error code

But we print out all row ids, there are some rows where there is no id

>>> [row.get('id') for row in page.select('table tbody tr')]

Lets get few of these

>>> [row for row in page.select('table tbody tr') if row.get('id') == None ][0:5]

We seem to be selecting all the nested tr elements where we only need to get the top level tr inside tbody

Total number of tr element that we get with

>>> len(page.select('table tbody tr'))
1928

If we only select the ones with ids

>>> len(page.select('table tbody tr["id"]'))
746

which looks better so printing out all the ids

>>> [row.get('id') for row in page.select('table tbody tr')]

namuan/py_data_collector.md

Data collector

Paypal