Skip to content

Instantly share code, notes, and snippets.

View dartar's full-sized avatar

Dario Taraborelli dartar

View GitHub Profile
@dartar
dartar / scholar.py
Created May 19, 2012 16:49
Modified version of scholar.py (a python Google Scholar parser by Christian Kreibich) exposing direct PDF download URL when available
#! /usr/bin/env python
"""
This module provides classes for querying Google Scholar and parsing
returned results. It currently *only* processes the first results
page. It is not a recursive crawler.
"""
# Version: 1.3 -- $Date: 2012-02-01 16:51:16 -0800 (Wed, 01 Feb 2012) $
#
# ChangeLog
# ---------
@dartar
dartar / gist:e3bbc61970d66acdcf88
Created November 18, 2014 18:33
Monthly PV ex-US
SELECT
LEFT(timestamp,7) AS month,
SUM(CASE WHEN country NOT IN ("US", "Inv") THEN pageviews END) AS ex_us_total,
SUM(CASE WHEN is_spider = 0 AND `is_automata` = 0 AND country NOT IN ("US", "Inv") THEN pageviews END) AS ex_us_human,
SUM(CASE WHEN is_spider = 0 AND `is_automata` = 0 AND country NOT IN ("US", "Inv") AND `access_method` = "Desktop" THEN pageviews END) AS ex_us_human_desktop,
SUM(CASE WHEN is_spider = 0 AND `is_automata` = 0 AND country NOT IN ("US", "Inv") AND `access_method` = "Mobile web" THEN pageviews END) AS ex_us_human_mobile,
SUM(CASE WHEN is_spider = 0 AND `is_automata` = 0 AND country NOT IN ("US", "Inv") AND `refering_site` = "Google" THEN pageviews END) AS ex_us_human_google,
SUM(CASE WHEN is_spider = 0 AND `is_automata` = 0 AND country NOT IN ("US", "Inv") AND `access_method` = "Desktop" AND `refering_site` = "Google" THEN pageviews END) AS ex_us_human_desktop_google,
SUM(CASE WHEN is_spider = 0 AND `is_automata` = 0 AND country NOT IN ("US", "Inv") AND `access_meth
ggplot(data=tgm, aes(x=month, y=value, colour=series, group=series)) + xlab("Month") + ylab("Pageviews (billion)") + ggtitle(tgtitle) + geom_line() + stat_smooth(method="lm", linetype='dotted', alpha=0, aes(outfit=fit<<-..y..)) + theme_bw() + theme(plot.title = element_text(size = rel(2.2)), axis.title = element_text(size = rel(1.6)), axis.text = element_text(size = rel(1.2)),legend.title = element_text(size = rel(1.5)), legend.text = element_text(size = rel(1.3))) + scale_x_discrete(breaks=c('2013-05', '2013-07', '2013-09', '2013-11', '2014-01', '2014-03', '2014-05', '2014-07', '2014-09','2014-11'))
#!/usr/bin/python
import httplib
import sys
base_url = "dx.doi.org"
for line in sys.stdin:
doi = line.strip()
url = base_url + doi
conn = httplib.HTTPConnection(base_url)
@dartar
dartar / abstract.MD
Created July 10, 2018 10:12
Abstract

Visualizing the topic and accessibility of scholarly articles cited in Wikipedia

Building on a dataset we previously released of citations with identifiers across all Wikipedia language editions, we explore the distribution of DOIs cited in Wikipedia by topic and accessibility.

Topic

We assign a topic to each publication, by looking at the main topic(s) of the Wikipedia article that cites it. Topics are determined by...

Accessibility

We determine the accessibility of each publication (Open Access vs Closed Access) by looking up the DOI in data provided by Unpaywall.

@dartar
dartar / GPT-wiki.txt
Created December 4, 2022 16:16
Write an essay analyzing the negative consequences of content generated by ChatGPT being reused in Wikipedia, being displayed in Google search and used for training a new LLM
The use of content generated by ChatGPT in Wikipedia, Google search results, and for training a new language model could have negative consequences. In the case of Wikipedia, the use of machine-generated content could lead to the inclusion of false or misleading information on the site, undermining its credibility and reliability. Additionally, the use of machine-generated content could make it more difficult for human editors to verify the accuracy and reliability of the information on the site, potentially leading to a decline in the overall quality of Wikipedia's content.
Similarly, the use of content generated by ChatGPT in Google search results could lead to the inclusion of false or misleading information in search results, undermining the credibility and reliability of the search engine. Additionally, the use of machine-generated content could make it more difficult for users to verify the accuracy and reliability of the information they find through Google, potentially leading to a decline in the ove