Skip to content

Instantly share code, notes, and snippets.

@namuan
Created August 26, 2018 18:19
Show Gist options
  • Save namuan/7c7ecd72e687ed6017129170cb9b47a2 to your computer and use it in GitHub Desktop.
Save namuan/7c7ecd72e687ed6017129170cb9b47a2 to your computer and use it in GitHub Desktop.
[Developing data collector for Response codes] #python #scraping

Data collector

A set of python scripts to download the html page and generate json fragments for error codes.

Paypal

Sources

Todo list:

  • Download source
  • Parse HTML data and generate JSON
  • Import JSON into search engine

Progress:

We downloaded the html using the python script. Then time to use REPL to idenfity interesting bits of information that we require from the page.

Using bpython and the following function, we load the page in memory

def get_soup(html_page):
    return BeautifulSoup(html_page, 'html.parser')


def read_file(file_path):
    return Path(file_path).read_text(encoding='utf8')

_bpython -i _

>>> page = get_soup(read_file('file.html'))
>>> row = page.select('table tbody tr')[0]
>>> row
<tr id="10001">
<td rowspan="3">10001</td>
<td rowspan="3"><code>DoExpressCheckoutPayment</code>
<br/><code>DoNonReferencedCredit</code>
<br/><code>DoUATPExpressCheckoutPayment</code>
<br/><code>GetExpressCheckoutDetails</code>
<br/><code>MassPay</code>
<br/><code>RefundTransaction</code>
<br/><code>SetExpressCheckout</code>
<br/><code>TransactionSearch</code>
</td>
<td><strong><small>SHORT</small></strong></td>
<td><code>ButtonSource</code> value truncated.</td>
</tr>

We are interested in the error code which is 10001 in the above fragment

>>> row['id'] # gives us the error code

But we print out all row ids, there are some rows where there is no id

>>> [row.get('id') for row in page.select('table tbody tr')]

Lets get few of these

>>> [row for row in page.select('table tbody tr') if row.get('id') == None ][0:5]

We seem to be selecting all the nested tr elements where we only need to get the top level tr inside tbody

Total number of tr element that we get with

>>> len(page.select('table tbody tr'))
1928

If we only select the ones with ids

>>> len(page.select('table tbody tr["id"]'))
746

which looks better so printing out all the ids

>>> [row.get('id') for row in page.select('table tbody tr')]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment