Last active
January 8, 2021 08:32
-
-
Save Glench/4627325 to your computer and use it in GitHub Desktop.
A command-line script to find the common tropes of two or more media items. To use, you need to install the python libraries 'pattern' and 'pyquery'. Then use like this, passing in media names or links to tv tropes pages: > python tv_tropes_common_tropes.py 'My Little Pony Friendship is Magic' 'Hamlet'
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python | |
# a script to get all the common tropes for media from tv tropes | |
# usage: | |
# python tv_tropes_matcher.py name1 name2 [name3...nameN] | |
# please put names with spaces or special characters in quotes | |
# you can also pass in the urls if it won't automatch by name. | |
# pip install pattern | |
# pip install pyquery | |
import sys | |
from pprint import pprint | |
import re | |
import urllib | |
from pattern import web | |
from pyquery import PyQuery | |
names = sys.argv[1:] | |
spider_regex = re.compile(r'[A-Z](To|-)[A-Z]$') | |
queries = ['#wikitext > ul > li > a:first-child', '#wikitext div > ul > li > a:first-child', '#wikitext > ul > li > ul a:first-child'] | |
trope_urls = {} | |
def get_tropes_by_url(url): | |
page = web.URL(url).download() | |
pq_page = PyQuery(page) | |
print 'Page title:', pq_page('title').text() | |
tropes = set() | |
for query in queries: | |
if len(tropes) < 3: | |
for a in pq_page(query): | |
pq_a = PyQuery(a) | |
if spider_regex.search(pq_a.attr('href')): | |
tropes = tropes.union(get_tropes_by_url(pq_a.attr('href'))) | |
else: | |
trope_urls[pq_a.text()] = pq_a.attr('href') | |
tropes.add(pq_a.text()) | |
return tropes | |
def get_tropes(name): | |
# TODO: turn name into url somehow | |
if 'http:' in name: | |
url = name | |
else: | |
url = 'http://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=navclient&gfns=1&q={}'.format(urllib.quote('tv tropes ' + name)) | |
print url | |
return get_tropes_by_url(url) | |
def trope_intersection(tropes1, tropes2): | |
return tropes1.intersection(tropes2) | |
if len(names) > 1: | |
common_tropes = reduce(trope_intersection, (get_tropes(name) for name in names)) | |
if common_tropes: | |
print 'Common matches are: {}'.format(len(common_tropes)) | |
for trope in common_tropes: | |
print '\t', trope #, '\t', trope_urls[trope] | |
else: | |
print 'There are no common tropes!' | |
else: | |
print 'Please enter enter 2 or more shows/movies/books/etc' | |
sys.exit(1) |
ok! I wrote this as a one-off tool during the MIT mystery hunt, so I'm not surprised it's incomplete. Hope you still found it useful — that's why I put it up!
Also some questions:
- Why does this not written with BeautifulSoup? Which tutorial do you think is "da best" for writing it?
- Would you consider the option that the other pages on the same medium are "useful"?
I am still screaming about the CSS tags being absurd and annoying. OUCH
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There are some issues with a naive "direct page" scraper, since:
Also, there is an idea that this tool can be generalized to compare multiple pieces of media.
Reference to a triple-media analysis proposal: rhgarcia/tropescraper#11