Last active
March 21, 2024 04:51
-
-
Save aculich/b34868c098d94d614515 to your computer and use it in GitHub Desktop.
How to extract Wikipedia infoboxes and wikitables using Pandas
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
# <nbformat>3.0</nbformat> | |
# <codecell> | |
from pandas.io.html import read_html | |
page = 'https://en.wikipedia.org/wiki/University_of_California,_Berkeley' | |
infoboxes = read_html(page, index_col=0, infer_types=False, attrs={"class":"infobox"}) | |
wikitables = read_html(page, index_col=0, infer_types=False, attrs={"class":"wikitable"}) | |
print "Extracted {num} infoboxes".format(num=len(infoboxes)) | |
print "Extracted {num} wikitables".format(num=len(wikitables)) | |
# <codecell> | |
infoboxes[0] | |
# <codecell> | |
infoboxes[1] | |
# <codecell> | |
wikitables[0] | |
# <codecell> | |
wikitables[1] | |
# <markdowncell> | |
# The `infer_types=False` argument is needed to turn off automatic type inference for Pandas <0.14, otherwise if date-like text appears in the table the data type will automatically be inferred as a date for the whole column, not just the particular entry, resulting in a table full of `NaT`s for non-date entries. In version >=0.14 the `infer_types` argument will be removed so it will no longer cause this kind of problem. | |
# <codecell> | |
malformed = read_html(page, index_col=0, attrs={"class":"infobox"}) | |
malformed[0] | |
# <markdowncell> | |
# The `index_col=0` argument will turn the zeroth-column into a set of labels for the table rows which is what we want for infoboxes and some (but not all) wikitables. | |
# | |
# Using `index_col` means that we can then refer directly to the the entries by their label, e.g.: | |
# <codecell> | |
infoboxes[0].xs(u'Motto').values[0] | |
# <markdowncell> | |
# Leaving out the argument the labels will instead be a numeric index and the zeroth-column will be part of the data. | |
# <codecell> | |
no_lefthand_labels = read_html(page, infer_types=False, attrs={"class":"infobox"}) | |
no_lefthand_labels[0] |
thanks!
Updated for new syntax (basically drop the infer_types arg):
import pandas as pd
page = 'https://en.wikipedia.org/wiki/University_of_California,_Berkeley'
infoboxes = pd.read_html(page, index_col=0, attrs={"class":"infobox"})
wikitables = pd.read_html(page, index_col=0, attrs={"class":"wikitable"})
print "Extracted {num} infoboxes".format(num=len(infoboxes))
print "Extracted {num} wikitables".format(num=len(wikitables))
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
cool