Found it much easier to hijack their APIs using Chrome (as opposed to browser scraping).
Broke the process into two parts:
- Getting a list of all relevant claims
- Retrieving PDFs for said claims
- Go to 'https://my.collectivehealth.com/activity?patientId=0' (and login if prompted), and navigate to the person who's claims you want.
- Open Chrome's inspector to the Network panel
- Refresh the page
- One of the requests is called 'claim', right click, and select "Copy > Copy as cURL"
- This will give you something like:
curl 'https://my.collectivehealth.com/api/v2/person/...&skip=0&limit=20' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, br' - ....
-
Modify the limit parameter to a sufficiently large value (I set mine at 5000), basically it controls the number of claims to scrape.
-
Save that output to a file by adding
-o claims.json
to the end of the curl command.
You now have all your claims raw data!
(a) I use the following incantation to make the json data into a smaller structure that I find useful to read.
cat claims.json | jq '[.data[] | select(.claimType == "professionalMedicalClaim" or .claimType == "institutionalMedicalClaim") | {id: .id, claimSystemId: .claimSystemId, dateOfServiceStart: .dateOfServiceStart, claimDescription:.claimDescription, displayProvider: .displayProvider.name, billedAmount: .billedAmount, planPaid: .planPaid, patientResponsibility: .patientResponsibility, filename: ((.dateOfServiceStart | gsub("-";"")) + "_" + .claimSystemId + "_" + (.claimDescription | gsub(" ";"-") | ascii_downcase) + "_" + (.displayProvider.name| gsub(" ";"-") | ascii_downcase)) }]' > claims-filtered.json
(if you want to do so, I highly recommend using http://visidata.org/ to explore your claims file. It's incredible!)
(b) Find an example claim you want to download a PDF for on 'https://my.collectivehealth.com/activity?patientId=0'. Similar to what we did in (1), find the request called download and get it's cURL. It'll looks something like:
curl 'https://my.collectivehealth.com/api/v1/claim/1234567/download' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, br' -H 'CH-Login-Token: ...' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' -H 'Accept: application/json' -H 'Cookie: ...' -H 'Connection: keep-alive' --compressed
(c) Create a file called download.py
with contents:
#!/usr/bin/env python
import json
import sys
import subprocess
if len(sys.argv) != 3:
print "Usage: %s <filename> <curl-params>" % sys.argv[0]
sys.exit(1)
filename=sys.argv[1]
curl_params=sys.argv[2]
with open(filename, "r") as read_file:
data = json.load(read_file)
for claim in data:
response = subprocess.check_output('''curl 'https://my.collectivehealth.com/api/v1/claim/%d/download' %s''' % (
claim["id"], curl_params), shell=True)
jresponse = json.loads(response)
subprocess.check_output('''curl -X GET "%s" -o %s.pdf''' % (
jresponse["url"], claim["filename"]), shell=True)
Copy everything after the url from the cURL command from (b) into the command below.
chmod +x download.py
./download.py claims-filtered.json "STUFF_AFTER_URL_FROM_CURL"
# this takes a second
(time to ... profit!)
- The stuff mentioned here downloads pdfs for all the claims. You can modify the
claims.json
filtering to pick whatever you see fit.
E.g. find all the claims for a given provider:
cat claims-filtered.json | jq '[.[] | select(.displayProvider | startswith("PROVIDER_NAME"))]
- This only works for medical claims (not RX, etc). So I'm filtering based on that.