Skip to content

Instantly share code, notes, and snippets.

@brianckeegan
Created April 3, 2017 04:50
Show Gist options
  • Save brianckeegan/ef46e198f8586a63c054942a5c660e63 to your computer and use it in GitHub Desktop.
Save brianckeegan/ef46e198f8586a63c054942a5c660e63 to your computer and use it in GitHub Desktop.
Debugging baseball-reference standings scrape
from bs4 import BeautifulSoup
import requests
# Get the data
raw = requests.get('http://www.baseball-reference.com/leagues/MLB/2016-standings.shtml').text
# Parse it
soup = BeautifulSoup(raw,'html.parser')
# Find the table
table = soup.find_all('table',{'id':'expanded_standings_overall'})
# Should be true, but there's nothing in there
len(table) > 0
@brianckeegan
Copy link
Author

It turns out the data is hidden in a comment field.

from bs4 import BeautifulSoup, Comment
import pandas as pd

comments = soup.find_all(text=lambda e: isinstance(e, Comment))

# It's the 14th comment
table_soup = BeautifulSoup(comments[14],'lxml')
table_html = table_soup.find_all('table',{'id':'expanded_standings_overall'})[0]

pd.read_html(str(table_html),index_col=0)[0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment