Extract Claims Data from Collective Health.

Free your Collective Claim Data(!)

Found it much easier to hijack their APIs using Chrome (as opposed to browser scraping).

Broke the process into two parts:

  1. Getting a list of all relevant claims
  2. Retrieving PDFs for said claims

(1) Getting a list of all relevant claims

  • Go to '' (and login if prompted), and navigate to the person who's claims you want.
  • Open Chrome's inspector to the Network panel
  • Refresh the page
  • One of the requests is called 'claim', right click, and select "Copy > Copy as cURL"
  • This will give you something like:
curl '' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, br' - ....
  • Modify the limit parameter to a sufficiently large value (I set mine at 5000), basically it controls the number of claims to scrape.

  • Save that output to a file by adding -o claims.json to the end of the curl command.

You now have all your claims raw data!

(2) Retrieving PDFS for said claims

(a) I use the following incantation to make the json data into a smaller structure that I find useful to read.

cat claims.json | jq '[.data[] | select(.claimType == "professionalMedicalClaim" or .claimType == "institutionalMedicalClaim") | {id: .id, claimSystemId: .claimSystemId, dateOfServiceStart: .dateOfServiceStart, claimDescription:.claimDescription, displayProvider:, billedAmount: .billedAmount, planPaid: .planPaid, patientResponsibility: .patientResponsibility, filename: ((.dateOfServiceStart | gsub("-";"")) + "_" + .claimSystemId + "_" + (.claimDescription | gsub(" ";"-") | ascii_downcase) + "_" + (| gsub(" ";"-") | ascii_downcase)) }]' > claims-filtered.json

(if you want to do so, I highly recommend using to explore your claims file. It's incredible!)

(b) Find an example claim you want to download a PDF for on ''. Similar to what we did in (1), find the request called download and get it's cURL. It'll looks something like:

curl '' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, br' -H 'CH-Login-Token: ...' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' -H 'Accept: application/json' -H 'Cookie: ...' -H 'Connection: keep-alive' --compressed

(c) Create a file called with contents:

#!/usr/bin/env python

import json
import sys
import subprocess

if len(sys.argv) != 3:
  print "Usage: %s <filename> <curl-params>" % sys.argv[0]


with open(filename, "r") as read_file:
    data = json.load(read_file)
    for claim in data:
      response = subprocess.check_output('''curl '' %s''' % (
        claim["id"], curl_params), shell=True)
      jresponse = json.loads(response)
      subprocess.check_output('''curl -X GET "%s" -o %s.pdf''' % (
        jresponse["url"], claim["filename"]), shell=True)

Copy everything after the url from the cURL command from (b) into the command below.

chmod +x
./ claims-filtered.json "STUFF_AFTER_URL_FROM_CURL"
# this takes a second

(time to ... profit!)


  • The stuff mentioned here downloads pdfs for all the claims. You can modify the claims.json filtering to pick whatever you see fit.

E.g. find all the claims for a given provider:

cat claims-filtered.json | jq '[.[] | select(.displayProvider | startswith("PROVIDER_NAME"))]
  • This only works for medical claims (not RX, etc). So I'm filtering based on that.
