Skip to content

Instantly share code, notes, and snippets.

@titipata
Created March 30, 2015 05:34
Show Gist options
  • Save titipata/f061743ccc48c1db3a33 to your computer and use it in GitHub Desktop.
Save titipata/f061743ccc48c1db3a33 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:8918c770e5f0319be5b9003a936c20e387589b76d64fca0d1a9d2d130e695954"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Google Scholar - Kording Lab"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is ipython notebook code to create real time web scraper using information from Google Scholar.\n",
"We use information from our lab members' Google Scholar. Code is divided into 3 sections - library and function, run scholar update and clear, delete figure.\n",
"\n",
"- Original Code by Daniel Acuna (on Mathematica)\n",
"- Created by Titipat Achakulvisut with great help of Daniel Acuna\n",
"\n",
"HISTORY:\n",
"- Created on: 27 Aug 2014\n",
"- Updated:\n",
" - 29 Aug 2014 update webscraping using lxml instead of regular expression\n",
" - 9 Sep 2014 minor changes in order to put on github\n",
"- Version 0.1"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Libraries and Functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import all libraries and functions to run the real time google scholar\n",
"\n",
"Requirement:\n",
"- ipython +notebook\n",
"- numpy\n",
"- lxml\n",
"- pygame\n",
"- pandas\n",
"- urllib2\n",
"- matplotlib\n",
"\n",
"Notice\n",
"- We also have real time twitter if you download library 'twitter' and get twitter api online\n",
"- We can change the plot in variable 'parameters' depending on screen size you display on\n",
"\n",
"ps. \n",
"- if install python with Anaconda, you need only 'pygame' that is separately installed\n",
"- if install python on MacOSX using Macports, feel free to read our documents where we have the section that we can install from command line http://klab.smpp.northwestern.edu/wiki/images/e/e6/Macport.pdf"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%pylab qt4"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# import library to scrape website\n",
"from urllib2 import urlopen\n",
"import numpy as np\n",
"import pandas as pd\n",
"import time\n",
"from lxml import etree # read html\n",
"from lxml import html\n",
"\n",
"# import library to plot and draw\n",
"import matplotlib.pyplot as plt\n",
"import cStringIO # get image from website\n",
"from PIL import Image\n",
"\n",
"# use pygame to play music\n",
"import pygame\n",
"\n",
"\n",
"#import twitter\n",
"#api = twitter.Api(consumer_key='',\n",
"# consumer_secret='',\n",
"# access_token_key='',\n",
"# access_token_secret='')\n",
"\n",
"#### CONSTANT ####\n",
"# name list\n",
"NAME = ['Konrad', \n",
" 'Daniel', \n",
" 'Pavan',\n",
" 'Josh', \n",
" 'Ted', \n",
" 'Pat',\n",
" 'Eva', \n",
" 'Mohammad', \n",
" 'Hugo',\n",
" 'Luca',\n",
" 'Sohrob',\n",
" 'Steve',\n",
" 'Iris']\n",
"\n",
"# url of each lab member (based on name lists)\n",
"BASE_URL = ['http://scholar.google.com/citations?user=MiFqJGcAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=GAi23ssAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=JtltLUAAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=tbfWCDgAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=T8W-5LsAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=jjvixpcAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=wdFV87UAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=AlTQrFcAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=JG7xb2AAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=xxDk3-EAAAAJ&hl=en',\n",
" 'http://scholar.google.co.uk/citations?user=9jqURCEAAAAJ&hl=en',\n",
" 'http://scholar.google.com/citations?user=uwpOnSAAAAAJ&hl=en&oi=sra',\n",
" 'http://scholar.google.com/citations?user=Ztwn608AAAAJ&hl=en']\n",
"\n",
"\n",
"# image link that we want to display if someone get cited\n",
"IMG_LINK = {'Konrad': 'http://www.qwantz.com/patreon/p3.png',\n",
" 'Daniel': 'http://www.quickmeme.com/img/45/451ab8e56df6f66c37c7eda8e36765f743cbabf1e1dbee5dffd648f47dde54d1.jpg',\n",
" 'Mohammad': 'http://mybroadband.co.za/vb/attachment.php?s=c864c202183cf1b3d2c57f738e78fce8&attachmentid=103452&d=1394014893',\n",
" 'Pavan': 'http://www.nbc.com/sites/nbcunbc/files/files/styles/nbc_bio_image/public/images/2013/11/08/azizAnsari_tomHaverford.jpg?itok=PCowY6uk',\n",
" 'Hugo': 'http://public.media.smithsonianmag.com/legacy_blog/dinosaur-comic-strip.jpg',\n",
" 'Ted': 'http://lovestats.files.wordpress.com/2012/07/r-square-success-kid.jpg',\n",
" 'Pat': 'http://m.memegen.com/x6259d.jpg',\n",
" 'Josh': 'http://cdn.mhpbooks.com/uploads/2013/10/Success-Kid.jpg',\n",
" 'Eva': 'http://veryhilarious.com/wp-content/uploads/2012/07/indy-hipster.jpg',\n",
" 'Luca': 'http://i1.cpcache.com/product_zoom/510200164/bunga_bunga_berlusconi_classic_thong.jpg?color=White&height=460&width=460&padToSquare=true',\n",
" 'Sohrob': 'http://4.bp.blogspot.com/-rjyBJpjUizw/U35ON10k8eI/AAAAAAAAfUM/KMFJEgwJhM8/s1600/stupid-meme-stalin-obama-2.jpg',\n",
" 'Steve': 'http://ct.fra.bz/ol/fz/sw/i58/2/5/25/frabz-giant-burrito-man-will-make-you-fire-torpedoes-of-another-kind-a01c69.jpg',\n",
" 'Iris': 'http://public.media.smithsonianmag.com/legacy_blog/dinosaur-comic-strip.jpg'\n",
" }\n",
"\n",
"\n",
"# music snippet that you want to play if anyone got cited\n",
"MUSIC_DIR = '/Users/titipat/Desktop/Amazon Web Service/snippet.wav'\n",
"\n",
"# parameter for plotting\n",
"params_cite = {'fontsize': 22, 'color': 'green', 'fontweight':'bold'}\n",
"params_hindex = {'fontsize': 22, 'color': 'red', 'fontweight':'bold'}\n",
"params_date = {'fontsize': 20, 'color': 'blue', 'fontweight':'bold'}\n",
"params_others = {'fontsize': 30, 'color': 'black'}\n",
"params_gini = {'fontsize': 20, 'color': 'blue', 'fontweight':'bold'}\n",
"params_gini_val = {'fontsize': 20, 'color': 'black'}\n",
"params_tweet = {'fontsize': 20, 'color': 'red', 'fontweight':'bold'}\n",
"\n",
"def get_citation_matrix():\n",
" ''' Get all citation datafram from Google Scholar '''\n",
" all_people = pd.DataFrame(columns=['name', 'citation', 'h_index','url', 'hn_index'])\n",
" all_people['name'] = NAME\n",
" all_people['url'] = BASE_URL\n",
"\n",
" for i in range(len(all_people)):\n",
" tree = html.parse(all_people.url[i])\n",
" cit = tree.xpath(\"/html/body/div[@id='gs_top']/div[@id='gsc_bdy']/div[@id='gsc_rsb']/div[@class='gsc_rsb_s']/table[@id='gsc_rsb_st']//tr//td[@class='gsc_rsb_std']\")\n",
" citations = np.int(cit[0].text)\n",
" h_index = np.int(cit[2].text)\n",
" \n",
" all_people['citation'][i] = np.int(citations)\n",
" all_people['h_index'][i] = np.int(h_index)\n",
" all_people['hn_index'][i] = float(h_index**2)/(float(citations) + 1) # suggested ratio by Mohammad and Hugo\n",
"\n",
" return all_people # return matrix of citation\n",
"\n",
"def sort_citation(all_people):\n",
" ''' Sort the given dataframe by citation '''\n",
" all_people_sorted = all_people.sort(columns=['h_index', 'citation'], ascending=False)\n",
" all_people_sorted.index = np.arange(len(all_people))\n",
" return all_people_sorted\n",
"\n",
"def get_options(all_people, all_people_new):\n",
" ''' Get citation and new citation then return option dataframe (including difference) '''\n",
" # get difference\n",
" cite_diff = all_people_new.citation - all_people.citation\n",
" index = cite_diff.nonzero()[0]\n",
" name_diff = list(all_people.name[index]) # list of name different\n",
" \n",
" # get table of options if there is different\n",
" options = pd.DataFrame(columns=['name', 'citation_diff', 'h_index_diff','sign'])\n",
" options['name'] = all_people['name'] # refer to new update\n",
" options['citation_diff'] = 0\n",
" options['h_index_diff'] = 0\n",
" options['sign'] = ''\n",
"\n",
" # for one people in name_diff\n",
" for j in range(len(name_diff)):\n",
" idx_new = np.where(name_diff[j] == all_people_new['name'])[0]\n",
" idx_old = np.where(name_diff[j] == all_people['name'])[0]\n",
" diff_citation = all_people_new.citation[idx_new] - all_people.citation[idx_old]\n",
" diff_hindex = all_people_new.h_index[idx_new] - all_people.h_index[idx_old]\n",
" #print sign(all_people_sorted_new.citation[idx_new] - all_people_sorted.citation[idx_old]) # get sign +1 or -1\n",
" options['citation_diff'][idx_new] = diff_citation\n",
" options['h_index_diff'][idx_new] = diff_hindex\n",
" sign_pm = np.int(np.sign(diff_citation))\n",
" if sign_pm == 1:\n",
" options['sign'][idx_new] = '+'\n",
" else:\n",
" options['sign'][idx_new] = '-'\n",
"\n",
" # change index\n",
" options.index = options.name # do this one!\n",
" \n",
" # return dataframe of options and name that have different citation\n",
" return options, name_diff\n",
" \n",
"def is_different(all_people_sorted, all_people_sorted_new):\n",
" ''' Find if two dataframe are equal or not '''\n",
" result = np.max(all_people_sorted_new.h_index != all_people_sorted.h_index) or np.max(all_people_sorted_new.citation != all_people_sorted.citation)\n",
" return result\n",
"\n",
"def play_music():\n",
" ''' Play Everything is Awesome wavfile track '''\n",
" pygame.init()\n",
" pygame.mixer.music.load(MUSIC_DIR)\n",
" pygame.mixer.music.play()\n",
" \n",
"def compute_gini(y):\n",
" ''' function to compute Gini index '''\n",
" N = len(y)\n",
" gini = double(2*np.dot(sorted(y, reverse=False), np.arange(1, N+1)))/double(double(N)*np.sum(y)) - ((N+1.0)/double(N))\n",
" gini = np.ceil(gini * 1000) / 1000.0\n",
" return gini\n",
" \n",
"def draw_citation_table(): #all_people_new_sorted\n",
" global all_people, all_people_new, all_people_sorted, all_people_new_sorted, options, options_old, name_diff # see the outer variable\n",
" \n",
" # get new citation\n",
" all_people_new = get_citation_matrix()\n",
" all_people_new_sorted = sort_citation(all_people_new)\n",
" \n",
" # if different optain new options\n",
" if is_different(all_people_sorted, all_people_new_sorted):\n",
" options_old, name_diff = get_options(all_people, all_people_new)\n",
" options = options_old\n",
" else:\n",
" options = options_old\n",
" \n",
" \n",
" # DRAW TITLE/ HEADERS\n",
" fig = plt.gcf() # get current figure\n",
" fig.clf() # clear current figure\n",
" fig.suptitle('Bayesian Behavior Lab Citations',\n",
" fontsize=35, fontweight='bold',\n",
" color='gray', style='italic')\n",
" \n",
" plt.text(0.2, 0.94, 'Citations', **params_cite)\n",
" plt.text(0.5, 0.94, 'h-index', **params_hindex)\n",
" plt.axis('off')\n",
"\n",
" for i in range(len(all_people_new_sorted)):\n",
" name = all_people_new_sorted.name[i] # get name\n",
" if (name in name_diff):\n",
" option_cit = '('+ options.loc[name].sign + str(options.loc[name].citation_diff) + ')'\n",
" option_h = '(' + options.loc[name].sign + str(options.loc[name].h_index_diff) + ')'\n",
" # if citation or h-index different is 0, turn to blank\n",
" if np.int(options.loc[name].citation_diff) == 0:\n",
" option_cit = ''\n",
" if np.int(options.loc[name].h_index_diff) == 0:\n",
" option_h = ''\n",
" else:\n",
" option_cit = ''\n",
" option_h = ''\n",
"\n",
" # DRAW Name, distance between shown name lists is here\n",
" plt.text(-0.1, 0.85-0.065*i, str(all_people_new_sorted.name[i]) , **params_others)\n",
" # DRAW Citation\n",
" plt.text(0.2, 0.85-0.065*i, str(all_people_new_sorted.citation[i]) + option_cit, **params_others)\n",
" # DRAW H-index\n",
" plt.text(0.5, 0.85-0.065*i, str(all_people_new_sorted.h_index[i]) + option_h, **params_others)\n",
"\n",
" plt.text(0.0, -0.1, 'Last citation update: ' + time.strftime('%X %b %d, %Y'), **params_date)\n",
" plt.text(0.65, 0.6, 'Gini (citation)', **params_gini)\n",
" plt.text(0.92, 0.6, str(compute_gini(all_people_new_sorted.citation)), **params_gini_val)\n",
" plt.text(0.65, 0.53, 'Gini (h-index)', **params_gini)\n",
" plt.text(0.92, 0.53, str(compute_gini(all_people_new_sorted.h_index)), **params_gini_val)\n",
"\n",
" # DRAW New Twitter Feed from Kording Lab\n",
" #statuses = api.GetUserTimeline(screen_name=\"KordingLab\") # getting all tweets\n",
" #plt.text(-0.1, -0.05, 'Tweets: ', **params_gini)\n",
" #try:\n",
" # new_tweet = str(statuses[0].text.encode('ascii', 'ignore'))\n",
" # plt.text(0.1, -0.05, new_tweet, **params_gini_val)\n",
" #except ValueError:\n",
" # plt.text(0.1, -0.05, \"Can't retrive tweet...\", **params_gini_val)\n",
" \n",
" \n",
" # DRAW IMAGE\n",
" if len(name_diff) > 0:\n",
" URL = IMG_LINK[name_diff[-1]] # use lowested name to show image\n",
" else:\n",
" URL = 'http://www.qwantz.com/patreon/p3.png' # default dinosaur images \n",
" file = cStringIO.StringIO(urlopen(URL).read())\n",
"\n",
" img = Image.open(file)\n",
" axicon = fig.add_axes([0.6,0.15,0.33,0.33])\n",
" plt.imshow(img)\n",
" plt.axis('off')\n",
" \n",
" fig.canvas.draw()\n",
" fig.canvas.activateWindow()\n",
" plt.draw()\n",
" plt.show()\n",
" \n",
" \n",
" # Play music and update part!\n",
" if is_different(all_people_sorted, all_people_new_sorted):\n",
" play_music()\n",
" all_people = all_people_new # replace all people with new one\n",
" all_people_sorted = sort_citation(all_people) # sorted again\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Run Google Scholar"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After running library and functions part, run this line to show the citation update"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Draw the First citation image\n",
"plt.close('all')\n",
"fig = plt.figure(facecolor='white') # or 'white' depending on bg we want\n",
"\n",
"all_people = get_citation_matrix() # get citation from provided url\n",
"options, name_diff = get_options(all_people, all_people)\n",
"options_old = options # assign value to get rid of conflict\n",
"all_people_sorted = sort_citation(all_people)\n",
"all_people_new = get_citation_matrix()\n",
"all_people_new_sorted = sort_citation(all_people_new)\n",
"draw_citation_table() # draw first K-lab citation\n",
"\n",
"# timer to run code every some amount of time\n",
"timer = fig.canvas.new_timer(interval=1000*60*15) # run every 15 minutes (1000*60*15 milli-seconds)\n",
"timer.add_callback(draw_citation_table)\n",
"timer.start()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Close figures and Stop Timer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"this section is to close the real time citation"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# close all figure and stop timer\n",
"timer.stop()\n",
"plt.close('all')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Adding customize css file for NBViewer"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from IPython.core.display import HTML\n",
"HTML(open(\"./custom_nb.css\", \"r\").read())"
],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
@titipata
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment