Last active
June 19, 2021 13:31
-
-
Save brianckeegan/d11452963fb0570fb461 to your computer and use it in GitHub Desktop.
Top Wikipedia stories in 2014
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
import pandas as pd | |
from bs4 import BeautifulSoup, element | |
import urllib2, re | |
# Read the HTML from the webpage on Wikipedia stats and convert to soup | |
soup = BeautifulSoup(urllib2.urlopen('http://stats.wikimedia.org/EN/TablesWikipediaEN.htm').read()) | |
# Look for all the paragraphs with 2014 | |
_p = soup.findAll('b',text=re.compile('2014')) | |
# Select only those paragraph parents that have exactly 152 fields, corresponding to the top-25 lists | |
_p2014 = [t.parent for t in _p if len(t.parent) == 152] | |
# Get the text out of the children tags as a list of lists | |
parsed = [[t.text for t in list(p.children) if type(t) != element.NavigableString] for p in _p2014] | |
# Convert to a dictionary keyed by month abbreviation with values as the list of text fields | |
parsed = {month[0].split(u'\xa0')[0]:month[1:] for month in parsed} | |
# Do some crazy dictionary and list comprehensions with zips to convert the values in the list | |
parsed = {k:[{'rank':int(a),'editors':int(b),'article':c} for a,b,c in zip(v[0::3],v[1::3],v[2::3])] for k,v in parsed.items()} | |
# Convert each month into a DataFrame with month information in the index | |
# and then concat all the dfs together, sorting on those with the most editors | |
ranked = pd.concat([pd.DataFrame(parsed[i],index=[i]*len(parsed[i])) for i in parsed.keys()]).sort('editors',ascending=False).reset_index() | |
# rename the reset index to something meaningful | |
ranked.rename(columns={'index':'month'},inplace=True) | |
# Group the articles by name, compute aggregate statistics | |
# Rank on the total number editors and months in the top 25 | |
top_articles = ranked.groupby('article').agg({'month':len,'editors':np.sum,'rank':np.min}).sort(['month','editors'],ascending=False) | |
top_articles |
Author
brianckeegan
commented
Dec 20, 2014
Article | Total editors across months | Minimum rank | Total months in Top 25 |
---|---|---|---|
Deaths in 2014 | 1848 | 1 | 10 |
Islamic State of Iraq and the Levant | 752 | 2 | 5 |
Malaysia Airlines Flight 370 | 1147 | 1 | 4 |
Ebola virus epidemic in West Africa | 758 | 1 | 4 |
Ukraine | 416 | 7 | 4 |
Frozen (2013 film) | 344 | 7 | 4 |
2014 Israel-Gaza conflict | 675 | 2 | 3 |
2014 pro-Russian unrest in Ukraine | 314 | 8 | 3 |
War in Donbass | 273 | 12 | 3 |
Malaysia Airlines Flight 17 | 755 | 1 | 2 |
2014 FIFA World Cup | 488 | 1 | 2 |
2014 Crimean crisis | 476 | 2 | 2 |
Ebola virus disease | 296 | 6 | 2 |
2014 Russian military intervention in Ukraine | 294 | 3 | 2 |
2014 Winter Olympics | 275 | 2 | 2 |
2014 FIFA World Cup squads | 261 | 6 | 2 |
SummerSlam (2014) | 250 | 8 | 2 |
2014 Ukrainian revolution | 240 | 4 | 2 |
2014 Hong Kong protests | 236 | 5 | 2 |
Super Bowl XLVIII | 233 | 4 | 2 |
Eurovision Song Contest 2014 | 231 | 4 | 2 |
Indian general election, 2014 | 210 | 8 | 2 |
Gamergate controversy | 201 | 9 | 2 |
Narendra Modi | 187 | 9 | 2 |
Transformers: Age of Extinction | 185 | 15 | 2 |
Big Brother 16 (U.S.) | 181 | 15 | 2 |
Euromaidan | 179 | 10 | 2 |
Kick (2014 film) | 177 | 14 | 2 |
The Amazing Spider-Man 2 | 177 | 12 | 2 |
2014 military intervention against the Islamic State of Iraq and the Levant | 165 | 10 | 2 |
FIFA 15 | 158 | 12 | 2 |
Nash Grier | 156 | 19 | 2 |
Bitcoin | 130 | 21 | 2 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment