Last active
August 29, 2015 14:18
-
-
Save blha303/d215016013c68db05c75 to your computer and use it in GitHub Desktop.
A python function to get netflix movie info and parse out all relevant data, returning it in a tuple as stated in the docstring
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Example usage: | |
>>> netflix("60024942") | |
(u'Catch Me If You Can', u'2002', u'Thu Jan 09 08:00:00 UTC 2003', u'M', u'140 minutes', u'http://cdn2.nflximg.net/images/0432/12050432.jpg', u'An FBI agent makes it his mission to put cunning con man Frank Abagnale Jr. behind bars. But Frank not only eludes capture, he revels in the pursuit.', {u'director': u'Steven Spielberg', u'genre': u'Dramas', u'language': u'English', u'starring': u'Leonardo DiCaprio, Tom Hanks'}) | |
ID could be obtained through trivial URL parsing using any stdlib library. | |
Here, i'll do an example: | |
from urlparse import urlparse, parse_qs # urllib.parse in python 3 | |
parse_qs(urlparse("http://www.netflix.com/WiPlayer?movieid=60024942").query) | |
=> {'movieid': ['60024942']} | |
Obviously you'd have a much longer url, but you can then use r["movieid"][0] to get the id, pass it to netflix(), tadaaaaa | |
""" | |
def netflix(id): | |
""" Returns a tuple of strings: (title, year, date-published, MPAA-rating, duration, boxart-url, description, moreinfo) | |
moreinfo may contain genre, language, actor and director info, depending on what's available""" | |
soup = Soup(requests.get("http://www.netflix.com/JSON/BOB?movieid=" + id).json()["html"]) | |
data = ( | |
soup.find(attrs={'class': 'title'}).text.strip() if soup.find(attrs={'class': 'title'}) else None, | |
soup.find(attrs={'class': 'year'}).text.strip() if soup.find(attrs={'class': 'year'}) else None, | |
soup.find(attrs={'itemprop': 'datePublished'})["content"] if soup.find(attrs={'itemprop': 'datePublished'}) else None, | |
soup.find(attrs={'class': 'mpaaRating'}).text.strip() if soup.find(attrs={'class': 'mpaaRating'}) else None, | |
soup.find(attrs={'class': 'duration'}).text.strip() if soup.find(attrs={'class': 'duration'}) else None, | |
soup.find(attrs={'itemprop': 'thumbnailUrl'})["src"] if soup.find(attrs={'itemprop': 'thumbnailUrl'}) else None, | |
soup.find(attrs={'class', 'boxShot'}).nextSibling.strip(), | |
{k.text.strip()[:-1].lower(): " ".join(v.text.strip().split()) for k,v in zip(soup.findAll('dt'), soup.findAll('dd'))} | |
) | |
return data |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment