Skip to content

Instantly share code, notes, and snippets.

@mjpost
Last active February 14, 2024 12:58
Show Gist options
  • Save mjpost/201a1b2753d82f6aaf0654e499bbfbcc to your computer and use it in GitHub Desktop.
Save mjpost/201a1b2753d82f6aaf0654e499bbfbcc to your computer and use it in GitHub Desktop.
Uses the Semantic Scholar API (with Anthology support!) to get paper citation counts for an Anthology volume
#!/usr/bin/env python3
"""Uses the Semantic Scholar API to get citation counts for all papers in
an ACL volume. Assumes old-style IDs (e.g., P96-1).
Mad props to Semantic Scholar for making this so easy.
"""
import json
import os
import shutil
import sys
import urllib
def retrieve_url(remote_url: str, local_path: str):
"""
Saves a URL to a local path. Can handle cookies, e.g., those
used downloading PDFs from MIT Press (TACL, CL).
:param remote_url: The URL to download from. Currently supports http only.
:param local_path: Where to save the file to.
"""
if remote_url.startswith("http"):
import ssl
import urllib.request
cookieProcessor = urllib.request.HTTPCookieProcessor()
opener = urllib.request.build_opener(cookieProcessor)
request = urllib.request.Request(
remote_url, headers={'User-Agent': 'Mozilla/5.0'}
)
with opener.open(request, timeout=1000) as url, open(
local_path, mode="wb"
) as input_file_fh:
input_file_fh.write(url.read())
else:
shutil.copyfile(remote_url, local_path)
return True
def main(args):
for num in range(1, 1000):
url = f"https://api.semanticscholar.org/v1/paper/ACL:{args.volume}{num:03d}"
try:
retrieve_url(url, "t.json")
d = json.load(open("t.json"))
"""
Here are all the supported keys:
dict_keys(['abstract',
'arxivId',
'authors',
'citationVelocity',
'citations',
'corpusId',
'doi',
'fieldsOfStudy',
'influentialCitationCount',
'isOpenAccess',
'isPublisherLicensed',
'is_open_access',
'is_publisher_licensed',
'numCitedBy',
'numCiting',
'paperId',
'references',
'title',
'topics',
'url',
'venue',
'year'])
"""
print(len(d["citations"]), d["influentialCitationCount"], d["title"], sep="\t")
except:
print(f"* couldn't find URL {url}, quitting.", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("volume")
args = parser.parse_args()
main(args)
@mjpost
Copy link
Author

mjpost commented Jun 8, 2021

Running on volume 1 of ACL 1996 gives the following. The volumes are (citations, influential citations, title).

./get_citation_counts.py P96-1
40       1       Higher-Order Coloured Unification and Natural Language Semantics
40      3       A Model-Theoretic Framework for Theories of Syntax
173     12      Noun-Phrase Analysis in Unrestricted Text for Information Retrieval
43      5       Morphological Cues for Lexical Semantics
44      0       From Submit to Submitted via Submission: On Lexical Rules in Large-Scale Lexicon Acquisition
559     45      Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach
4       0       Using Eye Movements to Study Spoken Language Comprehension: Evidence for Incremental Interpretation (Invited Talk)
153     11      A Fully Statistical Approach to Natural Language Interfaces
249     8       A Robust System for Natural Spoken Dialogue
149     11      Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction
110     8       Efficient Normal-Form Parsing for Combinatory Categorial Grammar
13      1       Another Facet of LIG Parsing
4       1       Parsing for Semidirectional Lambek Grammar is NP-Complete
15      2       Computing Optimal Descriptions for Optimality Theory Grammars with Context-Free Position Structures
76      9       Directed Replacement
30      0       Synchronous Models of Language
4       0       Coordination as a Direct Process
47      7       High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information
16      0       An Iterative Algorithm to Build Chinese Language Models
36      0       Pattern-Based Context-Free Grammars for Machine Translation
178     15      A Polynomial-Time Algorithm for Statistical Machine Translation
11      0       SEMHE: A Generalised Two-Level System
76      3       Head Automata and Bilingual Tiling: Translation with Minimal Representations
164     10      Parsing Algorithms and Metrics
704     70      A New Statistical Parser Based on Bigram Lexical Dependencies
14      0       Two Sources of Control over the Generation of Software Instructions
141     5       Chart Generation
9       0       Evaluating the Portability of Revision Rules for Incremental Summary Generation
47      2       Compilation of Weighted Finite-State Transducers from Decision Trees
42      2       Fast Parsing Using Pruning and Grammar Specialization
155     16      An Efficient Compiler for Weighted Rewrite Rules
24      0       Efficient Tabular LR Parsing
8       1       Magic for Filter Optimization in Dynamic Bottom-up Processing
16      0       Efficient Transformation-Based Parsing
17      0       Resolving Anaphors in Embedded Sentences
114     6       Functional Centering
39      2       Mechanisms for Mixed-Initiative Human-Computer Collaborative Discourse
205     21      A Prosodic Analysis of Discourse Segments in Direction-Giving Monologues
44      1       An Information Structural Approach to Spoken Language Generation
3       1       The Rhythm of Lexical Stress in Prose
235     3       An Empirical Study of Smoothing Techniques for Language Modeling
101     4       Minimizing Manual Annotation Cost in Supervised Training from Corpora
33      0       Unsupervised Learning of Word-Category Guessing Rules
* couldn't find URL https://api.semanticscholar.org/v1/paper/ACL:P96-1044, quitting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment