The Combined Annotation Dependent Depletion, CADD, is a useful tool for querying SNPs of interest. The following is an implementation of their API to perform batch queries. The code is attached in a jupyter notebook, which can be run by itself, or reused as part of a larger program.
There are several important caveats to keep in mind:
- The API is, by definition, experimental, and
not thought to be used for retrieving thousands or millions of variants
. Do NOT remove the lines of code that provide a pause, and do NOT use this for more than 1000 queries at a time. Doing so will result in the server crashing, and some very irate researchers at the University of Washington. - The API is in the early stages, and may change significantly at a later date, requiring the code to be updated as well.
- The jupyter notebook is written to only accommodate a single SNP position, but the API also supports a SNP range, such as
22:44044001-44044002
. Modifying the code to complete this step is fairly straightforward, and has been left as an exercise to anyone so inclined.
Step 1: Import file of coordinates. The file must be a single column text file in the format chr:position
-
Example:
5:2003402 5:2003609
coordsList = []
FILENAME = 'debugCADD.txt'
with open(FILENAME, 'r') as fOpen:
for i in fOpen:
i = i.rstrip('\r\n')
coordsList.append(i)
# simple bash implementation that is also useful for debugging, run within jupyter notebook
# output will be in the file 'urlOut.json.txt'
!curl -i -L https://cadd.gs.washington.edu/api/v1.0/v1.3/5:2003402 > urlOut.json.txt
Python Code
import requests, sys
import time
import random
def lookupOnCADD(cID):
# The documentation explicitly stated that it was experimental ONLY
# the next 2 lines are required unless you want to be the person responsible for crashing a server
n = random.random() + 1.5
time.sleep(n)
server = "https://cadd.gs.washington.edu/api/v1.0/v1.3/"
# the requests line is the actual query.
# it forms a string https://cadd.gs.washington.edu/api/v1.0/v1.3/cID where cID is the identifier passed to the function
# then submits this as an HTTP GET request, and returns the result as a JSON file.
q = server+cID
r = requests.get(q, headers={ "Content-Type" : "application/json"})
if not r.ok:
# if there's an error in the HTTP request, print it with the identifier.
# usually this is from a position that cannot be found
print(str(cID)+"\t"+"ERROR")
return (None, 0)
else:
decoded = r.json()
if not decoded:
# if there's an error parsing the JSON, output the identifier and the error.
# Note that this will be rare, or indicative of a connection/network problem
print("Error decoding JSON for " + str(cID))
return (None, -1)
else:
# if there are no errors, return the JSON object
return (1, decoded)
# The data from lookupOnCADD() will be in JSON format as [{k1:val1, k2:val2,...}, {k1:val1, k2:val2},...]
# this is one of many ways to output it, beliow is one example.
# Note that even though the JSON dict ordering will be static,
# do NOT rely on this, since Python randomly shuffles the order of dict keys!
for i in coordsList:
res = lookupOnCADD(i)
if res[0]:
for itm in res[1]:
for k,v in itm.items():
print(k + ',' + v)