Created
March 30, 2015 05:34
-
-
Save titipata/f061743ccc48c1db3a33 to your computer and use it in GitHub Desktop.
Google Scholar Scoreboard see more https://github.com/titipata/google_scholar_scoreboard
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "", | |
"signature": "sha256:8918c770e5f0319be5b9003a936c20e387589b76d64fca0d1a9d2d130e695954" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Google Scholar - Kording Lab" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is ipython notebook code to create real time web scraper using information from Google Scholar.\n", | |
"We use information from our lab members' Google Scholar. Code is divided into 3 sections - library and function, run scholar update and clear, delete figure.\n", | |
"\n", | |
"- Original Code by Daniel Acuna (on Mathematica)\n", | |
"- Created by Titipat Achakulvisut with great help of Daniel Acuna\n", | |
"\n", | |
"HISTORY:\n", | |
"- Created on: 27 Aug 2014\n", | |
"- Updated:\n", | |
" - 29 Aug 2014 update webscraping using lxml instead of regular expression\n", | |
" - 9 Sep 2014 minor changes in order to put on github\n", | |
"- Version 0.1" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Libraries and Functions" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Import all libraries and functions to run the real time google scholar\n", | |
"\n", | |
"Requirement:\n", | |
"- ipython +notebook\n", | |
"- numpy\n", | |
"- lxml\n", | |
"- pygame\n", | |
"- pandas\n", | |
"- urllib2\n", | |
"- matplotlib\n", | |
"\n", | |
"Notice\n", | |
"- We also have real time twitter if you download library 'twitter' and get twitter api online\n", | |
"- We can change the plot in variable 'parameters' depending on screen size you display on\n", | |
"\n", | |
"ps. \n", | |
"- if install python with Anaconda, you need only 'pygame' that is separately installed\n", | |
"- if install python on MacOSX using Macports, feel free to read our documents where we have the section that we can install from command line http://klab.smpp.northwestern.edu/wiki/images/e/e6/Macport.pdf" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"%pylab qt4" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# import library to scrape website\n", | |
"from urllib2 import urlopen\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import time\n", | |
"from lxml import etree # read html\n", | |
"from lxml import html\n", | |
"\n", | |
"# import library to plot and draw\n", | |
"import matplotlib.pyplot as plt\n", | |
"import cStringIO # get image from website\n", | |
"from PIL import Image\n", | |
"\n", | |
"# use pygame to play music\n", | |
"import pygame\n", | |
"\n", | |
"\n", | |
"#import twitter\n", | |
"#api = twitter.Api(consumer_key='',\n", | |
"# consumer_secret='',\n", | |
"# access_token_key='',\n", | |
"# access_token_secret='')\n", | |
"\n", | |
"#### CONSTANT ####\n", | |
"# name list\n", | |
"NAME = ['Konrad', \n", | |
" 'Daniel', \n", | |
" 'Pavan',\n", | |
" 'Josh', \n", | |
" 'Ted', \n", | |
" 'Pat',\n", | |
" 'Eva', \n", | |
" 'Mohammad', \n", | |
" 'Hugo',\n", | |
" 'Luca',\n", | |
" 'Sohrob',\n", | |
" 'Steve',\n", | |
" 'Iris']\n", | |
"\n", | |
"# url of each lab member (based on name lists)\n", | |
"BASE_URL = ['http://scholar.google.com/citations?user=MiFqJGcAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=GAi23ssAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=JtltLUAAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=tbfWCDgAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=T8W-5LsAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=jjvixpcAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=wdFV87UAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=AlTQrFcAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=JG7xb2AAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=xxDk3-EAAAAJ&hl=en',\n", | |
" 'http://scholar.google.co.uk/citations?user=9jqURCEAAAAJ&hl=en',\n", | |
" 'http://scholar.google.com/citations?user=uwpOnSAAAAAJ&hl=en&oi=sra',\n", | |
" 'http://scholar.google.com/citations?user=Ztwn608AAAAJ&hl=en']\n", | |
"\n", | |
"\n", | |
"# image link that we want to display if someone get cited\n", | |
"IMG_LINK = {'Konrad': 'http://www.qwantz.com/patreon/p3.png',\n", | |
" 'Daniel': 'http://www.quickmeme.com/img/45/451ab8e56df6f66c37c7eda8e36765f743cbabf1e1dbee5dffd648f47dde54d1.jpg',\n", | |
" 'Mohammad': 'http://mybroadband.co.za/vb/attachment.php?s=c864c202183cf1b3d2c57f738e78fce8&attachmentid=103452&d=1394014893',\n", | |
" 'Pavan': 'http://www.nbc.com/sites/nbcunbc/files/files/styles/nbc_bio_image/public/images/2013/11/08/azizAnsari_tomHaverford.jpg?itok=PCowY6uk',\n", | |
" 'Hugo': 'http://public.media.smithsonianmag.com/legacy_blog/dinosaur-comic-strip.jpg',\n", | |
" 'Ted': 'http://lovestats.files.wordpress.com/2012/07/r-square-success-kid.jpg',\n", | |
" 'Pat': 'http://m.memegen.com/x6259d.jpg',\n", | |
" 'Josh': 'http://cdn.mhpbooks.com/uploads/2013/10/Success-Kid.jpg',\n", | |
" 'Eva': 'http://veryhilarious.com/wp-content/uploads/2012/07/indy-hipster.jpg',\n", | |
" 'Luca': 'http://i1.cpcache.com/product_zoom/510200164/bunga_bunga_berlusconi_classic_thong.jpg?color=White&height=460&width=460&padToSquare=true',\n", | |
" 'Sohrob': 'http://4.bp.blogspot.com/-rjyBJpjUizw/U35ON10k8eI/AAAAAAAAfUM/KMFJEgwJhM8/s1600/stupid-meme-stalin-obama-2.jpg',\n", | |
" 'Steve': 'http://ct.fra.bz/ol/fz/sw/i58/2/5/25/frabz-giant-burrito-man-will-make-you-fire-torpedoes-of-another-kind-a01c69.jpg',\n", | |
" 'Iris': 'http://public.media.smithsonianmag.com/legacy_blog/dinosaur-comic-strip.jpg'\n", | |
" }\n", | |
"\n", | |
"\n", | |
"# music snippet that you want to play if anyone got cited\n", | |
"MUSIC_DIR = '/Users/titipat/Desktop/Amazon Web Service/snippet.wav'\n", | |
"\n", | |
"# parameter for plotting\n", | |
"params_cite = {'fontsize': 22, 'color': 'green', 'fontweight':'bold'}\n", | |
"params_hindex = {'fontsize': 22, 'color': 'red', 'fontweight':'bold'}\n", | |
"params_date = {'fontsize': 20, 'color': 'blue', 'fontweight':'bold'}\n", | |
"params_others = {'fontsize': 30, 'color': 'black'}\n", | |
"params_gini = {'fontsize': 20, 'color': 'blue', 'fontweight':'bold'}\n", | |
"params_gini_val = {'fontsize': 20, 'color': 'black'}\n", | |
"params_tweet = {'fontsize': 20, 'color': 'red', 'fontweight':'bold'}\n", | |
"\n", | |
"def get_citation_matrix():\n", | |
" ''' Get all citation datafram from Google Scholar '''\n", | |
" all_people = pd.DataFrame(columns=['name', 'citation', 'h_index','url', 'hn_index'])\n", | |
" all_people['name'] = NAME\n", | |
" all_people['url'] = BASE_URL\n", | |
"\n", | |
" for i in range(len(all_people)):\n", | |
" tree = html.parse(all_people.url[i])\n", | |
" cit = tree.xpath(\"/html/body/div[@id='gs_top']/div[@id='gsc_bdy']/div[@id='gsc_rsb']/div[@class='gsc_rsb_s']/table[@id='gsc_rsb_st']//tr//td[@class='gsc_rsb_std']\")\n", | |
" citations = np.int(cit[0].text)\n", | |
" h_index = np.int(cit[2].text)\n", | |
" \n", | |
" all_people['citation'][i] = np.int(citations)\n", | |
" all_people['h_index'][i] = np.int(h_index)\n", | |
" all_people['hn_index'][i] = float(h_index**2)/(float(citations) + 1) # suggested ratio by Mohammad and Hugo\n", | |
"\n", | |
" return all_people # return matrix of citation\n", | |
"\n", | |
"def sort_citation(all_people):\n", | |
" ''' Sort the given dataframe by citation '''\n", | |
" all_people_sorted = all_people.sort(columns=['h_index', 'citation'], ascending=False)\n", | |
" all_people_sorted.index = np.arange(len(all_people))\n", | |
" return all_people_sorted\n", | |
"\n", | |
"def get_options(all_people, all_people_new):\n", | |
" ''' Get citation and new citation then return option dataframe (including difference) '''\n", | |
" # get difference\n", | |
" cite_diff = all_people_new.citation - all_people.citation\n", | |
" index = cite_diff.nonzero()[0]\n", | |
" name_diff = list(all_people.name[index]) # list of name different\n", | |
" \n", | |
" # get table of options if there is different\n", | |
" options = pd.DataFrame(columns=['name', 'citation_diff', 'h_index_diff','sign'])\n", | |
" options['name'] = all_people['name'] # refer to new update\n", | |
" options['citation_diff'] = 0\n", | |
" options['h_index_diff'] = 0\n", | |
" options['sign'] = ''\n", | |
"\n", | |
" # for one people in name_diff\n", | |
" for j in range(len(name_diff)):\n", | |
" idx_new = np.where(name_diff[j] == all_people_new['name'])[0]\n", | |
" idx_old = np.where(name_diff[j] == all_people['name'])[0]\n", | |
" diff_citation = all_people_new.citation[idx_new] - all_people.citation[idx_old]\n", | |
" diff_hindex = all_people_new.h_index[idx_new] - all_people.h_index[idx_old]\n", | |
" #print sign(all_people_sorted_new.citation[idx_new] - all_people_sorted.citation[idx_old]) # get sign +1 or -1\n", | |
" options['citation_diff'][idx_new] = diff_citation\n", | |
" options['h_index_diff'][idx_new] = diff_hindex\n", | |
" sign_pm = np.int(np.sign(diff_citation))\n", | |
" if sign_pm == 1:\n", | |
" options['sign'][idx_new] = '+'\n", | |
" else:\n", | |
" options['sign'][idx_new] = '-'\n", | |
"\n", | |
" # change index\n", | |
" options.index = options.name # do this one!\n", | |
" \n", | |
" # return dataframe of options and name that have different citation\n", | |
" return options, name_diff\n", | |
" \n", | |
"def is_different(all_people_sorted, all_people_sorted_new):\n", | |
" ''' Find if two dataframe are equal or not '''\n", | |
" result = np.max(all_people_sorted_new.h_index != all_people_sorted.h_index) or np.max(all_people_sorted_new.citation != all_people_sorted.citation)\n", | |
" return result\n", | |
"\n", | |
"def play_music():\n", | |
" ''' Play Everything is Awesome wavfile track '''\n", | |
" pygame.init()\n", | |
" pygame.mixer.music.load(MUSIC_DIR)\n", | |
" pygame.mixer.music.play()\n", | |
" \n", | |
"def compute_gini(y):\n", | |
" ''' function to compute Gini index '''\n", | |
" N = len(y)\n", | |
" gini = double(2*np.dot(sorted(y, reverse=False), np.arange(1, N+1)))/double(double(N)*np.sum(y)) - ((N+1.0)/double(N))\n", | |
" gini = np.ceil(gini * 1000) / 1000.0\n", | |
" return gini\n", | |
" \n", | |
"def draw_citation_table(): #all_people_new_sorted\n", | |
" global all_people, all_people_new, all_people_sorted, all_people_new_sorted, options, options_old, name_diff # see the outer variable\n", | |
" \n", | |
" # get new citation\n", | |
" all_people_new = get_citation_matrix()\n", | |
" all_people_new_sorted = sort_citation(all_people_new)\n", | |
" \n", | |
" # if different optain new options\n", | |
" if is_different(all_people_sorted, all_people_new_sorted):\n", | |
" options_old, name_diff = get_options(all_people, all_people_new)\n", | |
" options = options_old\n", | |
" else:\n", | |
" options = options_old\n", | |
" \n", | |
" \n", | |
" # DRAW TITLE/ HEADERS\n", | |
" fig = plt.gcf() # get current figure\n", | |
" fig.clf() # clear current figure\n", | |
" fig.suptitle('Bayesian Behavior Lab Citations',\n", | |
" fontsize=35, fontweight='bold',\n", | |
" color='gray', style='italic')\n", | |
" \n", | |
" plt.text(0.2, 0.94, 'Citations', **params_cite)\n", | |
" plt.text(0.5, 0.94, 'h-index', **params_hindex)\n", | |
" plt.axis('off')\n", | |
"\n", | |
" for i in range(len(all_people_new_sorted)):\n", | |
" name = all_people_new_sorted.name[i] # get name\n", | |
" if (name in name_diff):\n", | |
" option_cit = '('+ options.loc[name].sign + str(options.loc[name].citation_diff) + ')'\n", | |
" option_h = '(' + options.loc[name].sign + str(options.loc[name].h_index_diff) + ')'\n", | |
" # if citation or h-index different is 0, turn to blank\n", | |
" if np.int(options.loc[name].citation_diff) == 0:\n", | |
" option_cit = ''\n", | |
" if np.int(options.loc[name].h_index_diff) == 0:\n", | |
" option_h = ''\n", | |
" else:\n", | |
" option_cit = ''\n", | |
" option_h = ''\n", | |
"\n", | |
" # DRAW Name, distance between shown name lists is here\n", | |
" plt.text(-0.1, 0.85-0.065*i, str(all_people_new_sorted.name[i]) , **params_others)\n", | |
" # DRAW Citation\n", | |
" plt.text(0.2, 0.85-0.065*i, str(all_people_new_sorted.citation[i]) + option_cit, **params_others)\n", | |
" # DRAW H-index\n", | |
" plt.text(0.5, 0.85-0.065*i, str(all_people_new_sorted.h_index[i]) + option_h, **params_others)\n", | |
"\n", | |
" plt.text(0.0, -0.1, 'Last citation update: ' + time.strftime('%X %b %d, %Y'), **params_date)\n", | |
" plt.text(0.65, 0.6, 'Gini (citation)', **params_gini)\n", | |
" plt.text(0.92, 0.6, str(compute_gini(all_people_new_sorted.citation)), **params_gini_val)\n", | |
" plt.text(0.65, 0.53, 'Gini (h-index)', **params_gini)\n", | |
" plt.text(0.92, 0.53, str(compute_gini(all_people_new_sorted.h_index)), **params_gini_val)\n", | |
"\n", | |
" # DRAW New Twitter Feed from Kording Lab\n", | |
" #statuses = api.GetUserTimeline(screen_name=\"KordingLab\") # getting all tweets\n", | |
" #plt.text(-0.1, -0.05, 'Tweets: ', **params_gini)\n", | |
" #try:\n", | |
" # new_tweet = str(statuses[0].text.encode('ascii', 'ignore'))\n", | |
" # plt.text(0.1, -0.05, new_tweet, **params_gini_val)\n", | |
" #except ValueError:\n", | |
" # plt.text(0.1, -0.05, \"Can't retrive tweet...\", **params_gini_val)\n", | |
" \n", | |
" \n", | |
" # DRAW IMAGE\n", | |
" if len(name_diff) > 0:\n", | |
" URL = IMG_LINK[name_diff[-1]] # use lowested name to show image\n", | |
" else:\n", | |
" URL = 'http://www.qwantz.com/patreon/p3.png' # default dinosaur images \n", | |
" file = cStringIO.StringIO(urlopen(URL).read())\n", | |
"\n", | |
" img = Image.open(file)\n", | |
" axicon = fig.add_axes([0.6,0.15,0.33,0.33])\n", | |
" plt.imshow(img)\n", | |
" plt.axis('off')\n", | |
" \n", | |
" fig.canvas.draw()\n", | |
" fig.canvas.activateWindow()\n", | |
" plt.draw()\n", | |
" plt.show()\n", | |
" \n", | |
" \n", | |
" # Play music and update part!\n", | |
" if is_different(all_people_sorted, all_people_new_sorted):\n", | |
" play_music()\n", | |
" all_people = all_people_new # replace all people with new one\n", | |
" all_people_sorted = sort_citation(all_people) # sorted again\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Run Google Scholar" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"After running library and functions part, run this line to show the citation update" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Draw the First citation image\n", | |
"plt.close('all')\n", | |
"fig = plt.figure(facecolor='white') # or 'white' depending on bg we want\n", | |
"\n", | |
"all_people = get_citation_matrix() # get citation from provided url\n", | |
"options, name_diff = get_options(all_people, all_people)\n", | |
"options_old = options # assign value to get rid of conflict\n", | |
"all_people_sorted = sort_citation(all_people)\n", | |
"all_people_new = get_citation_matrix()\n", | |
"all_people_new_sorted = sort_citation(all_people_new)\n", | |
"draw_citation_table() # draw first K-lab citation\n", | |
"\n", | |
"# timer to run code every some amount of time\n", | |
"timer = fig.canvas.new_timer(interval=1000*60*15) # run every 15 minutes (1000*60*15 milli-seconds)\n", | |
"timer.add_callback(draw_citation_table)\n", | |
"timer.start()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"Close figures and Stop Timer" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"this section is to close the real time citation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# close all figure and stop timer\n", | |
"timer.stop()\n", | |
"plt.close('all')" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 2, | |
"metadata": {}, | |
"source": [ | |
"Adding customize css file for NBViewer" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from IPython.core.display import HTML\n", | |
"HTML(open(\"./custom_nb.css\", \"r\").read())" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
See notebook viewer more here: http://nbviewer.ipython.org/urls/gist.githubusercontent.com/titipata/f061743ccc48c1db3a33/raw/a9de4d7e3a072af36ed899a854ed4637e83f4272/gs_scoreboard.ipynb