Skip to content

Instantly share code, notes, and snippets.

@disulfidebond
Created September 23, 2019 17:33
Show Gist options
  • Save disulfidebond/362f249c7b0c6fbabba6b1ff7a37387f to your computer and use it in GitHub Desktop.
Save disulfidebond/362f249c7b0c6fbabba6b1ff7a37387f to your computer and use it in GitHub Desktop.
CADD API

Overview

The Combined Annotation Dependent Depletion, CADD, is a useful tool for querying SNPs of interest. The following is an implementation of their API to perform batch queries. The code is attached in a jupyter notebook, which can be run by itself, or reused as part of a larger program.

There are several important caveats to keep in mind:

  • The API is, by definition, experimental, and not thought to be used for retrieving thousands or millions of variants. Do NOT remove the lines of code that provide a pause, and do NOT use this for more than 1000 queries at a time. Doing so will result in the server crashing, and some very irate researchers at the University of Washington.
  • The API is in the early stages, and may change significantly at a later date, requiring the code to be updated as well.
  • The jupyter notebook is written to only accommodate a single SNP position, but the API also supports a SNP range, such as 22:44044001-44044002. Modifying the code to complete this step is fairly straightforward, and has been left as an exercise to anyone so inclined.

Step 1: Import file of coordinates. The file must be a single column text file in the format chr:position

  • Example:

      5:2003402
      5:2003609
    

coordsList = [] FILENAME = 'debugCADD.txt' with open(FILENAME, 'r') as fOpen: for i in fOpen: i = i.rstrip('\r\n') coordsList.append(i)

# simple bash implementation that is also useful for debugging, run within jupyter notebook # output will be in the file 'urlOut.json.txt' !curl -i -L https://cadd.gs.washington.edu/api/v1.0/v1.3/5:2003402 > urlOut.json.txt

Python Code

    import requests, sys
    import time
    import random


    def lookupOnCADD(cID):
        # The documentation explicitly stated that it was experimental ONLY
        # the next 2 lines are required unless you want to be the person responsible for crashing a server
        n = random.random() + 1.5
        time.sleep(n)
        server = "https://cadd.gs.washington.edu/api/v1.0/v1.3/"
        # the requests line is the actual query. 
        # it forms a string https://cadd.gs.washington.edu/api/v1.0/v1.3/cID where cID is the identifier passed to the function
        # then submits this as an HTTP GET request, and returns the result as a JSON file.
        q = server+cID
        r = requests.get(q, headers={ "Content-Type" : "application/json"})
        if not r.ok:
            # if there's an error in the HTTP request, print it with the identifier.
            # usually this is from a position that cannot be found
            print(str(cID)+"\t"+"ERROR")
            return (None, 0)
        else:
            decoded = r.json()
            if not decoded:
                # if there's an error parsing the JSON, output the identifier and the error.  
                # Note that this will be rare, or indicative of a connection/network problem
                print("Error decoding JSON for " + str(cID))
                return (None, -1)
            else:
                # if there are no errors, return the JSON object
                return (1, decoded)

    # The data from lookupOnCADD() will be in JSON format as [{k1:val1, k2:val2,...}, {k1:val1, k2:val2},...]
    # this is one of many ways to output it, beliow is one example.
    # Note that even though the JSON dict ordering will be static, 
    # do NOT rely on this, since Python randomly shuffles the order of dict keys!
    for i in coordsList:
        res = lookupOnCADD(i)
        if res[0]:
            for itm in res[1]:
                for k,v in itm.items():
                    print(k + ',' + v)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment