Skip to content

Instantly share code, notes, and snippets.

Last active October 25, 2023 08:28
Show Gist options
  • Save wassname/5b10774dfcd61cdd3f28 to your computer and use it in GitHub Desktop.
Save wassname/5b10774dfcd61cdd3f28 to your computer and use it in GitHub Desktop.
Scrape a table from wikipedia using python. Allows for cells spanning multiple rows and/or columns. Outputs csv files for each table
# -*- coding: utf-8 -*-
Scrape a table from wikipedia using python. Allows for cells spanning multiple rows and/or columns. Outputs csv files for
each table
authors: panford, wassname, muzzled, Yossi
license: MIT
from bs4 import BeautifulSoup
import requests
import os
import codecs
wiki = ""
header = {
'User-Agent': 'Mozilla/5.0'
} # Needed to prevent 403 error on Wikipedia
page = requests.get(wiki, headers=header)
soup = BeautifulSoup(page.content)
tables = soup.findAll("table", {"class": "wikitable"})
# show tables
for i, table in enumerate(tables):
print("#"*10 + "Table {}".format(i) + '#'*10)
for tn, table in enumerate(tables):
# preinit list of lists
rows = table.findAll("tr")
row_lengths = [len(r.findAll(['th', 'td'])) for r in rows]
ncols = max(row_lengths)
nrows = len(rows)
data = []
for i in range(nrows):
rowD = []
for j in range(ncols):
# process html
for i in range(len(rows)):
row = rows[i]
rowD = []
cells = row.findAll(["td", "th"])
for j in range(len(cells)):
cell = cells[j]
#lots of cells span cols and rows so lets deal with that
cspan = int(cell.get('colspan', 1))
rspan = int(cell.get('rowspan', 1))
l = 0
for k in range(rspan):
# Shifts to the first empty cell of this row
while data[i + k][j + l]:
l += 1
for m in range(cspan):
cell_n = j + l + m
row_n = i + k
# in some cases the colspan can overflow the table, in those cases just get the last item
cell_n = min(cell_n, len(data[row_n])-1)
data[row_n][cell_n] += cell.text
# write data out to tab seperated format
page = os.path.split(wiki)[1]
fname = 'output_{}_t{}.tsv'.format(page, tn)
f =, 'w')
for i in range(nrows):
rowStr = '\t'.join(data[i])
rowStr = rowStr.replace('\n', '')
f.write(rowStr + '\n')
Copy link

gkcng commented Mar 24, 2018

Hi, great work! Though this doesn't work on some tables with multiple row spans in different columns, the cell insertion went into the wrong places. e.g.

This will fix it, at least for the tables within the above page. Instead of:

            for k in range(rspan):
                for l in range(cspan):

Do not append to already filled cells:

            l = 0
            for k in range(rspan):
                # Shifts to the first empty cell of this row
                while data[i+k][j+l]:
                for m in range(cspan):

Copy link

panford commented Jun 12, 2019

Great code!. It was of great help to me. However, I was scraping a wiki table that already had figures separated by commas (eg. 8,133,000) and after writing were separated and displaced where comma, into separate cells so I changed the delimiter ',' to '\t' in the last for loop. Looked like this rowStr = '\t'.join(data[i])

Copy link

Oh people are using this, that's great. Thanks for the improvements panford and muzzled, I tested and incorporated them into the gist above and they help a lot.

Copy link

Yossi commented Sep 10, 2019

lines 29 and 30 can be replaced with:
for tn, table in enumerate(tables):

Copy link

wassname commented Sep 11, 2019

Thanks, I added that change.

Copy link

dsvrsec commented Sep 13, 2019

when I am trying to use the code, for the link "",I am facing the below error.Can you please help me to solve this.

Traceback (most recent call last):

File "", line 65, in
while data[i + k][j + l]:

IndexError: list index out of range

Copy link

Worked for me, and I notice your lined number doesn't correspond to the latest version. Perhaps try the latest code?

Copy link

dsvrsec commented Sep 13, 2019

I am trying to extract infobox i used class="infobox"..for some wikipedia pages it works..but for this type of company infobox it throws error.

Copy link

#37 to #42 could be replaced by : data = [[''] * ncols for i in range(nrows)]
#45: for i in range(nrows):
#47, #68: all instance of rowD should be removed
#73: f =, mode='w', encoding='utf-8') <- to deal with non-English characters
Thank you very much for this great gist!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment