Skip to content

Instantly share code, notes, and snippets.

@Mr0grog
Mr0grog / example_urls.txt
Last active July 22, 2024 13:18
Check the HTTP status codes for a list of URLs in a file. Undoubtedly there are fancier ways to do this with wget or some other common Linux utilities, but I’m not good enough at bash scripting to find a way to do that with minimal memory usage (if the list of URLs is millions long) and with nice formatting, so I just wrote a script.
www.energy.gov
gobbledegook
apple.com
www.gpo.gov
httpstat.us/404
@Mr0grog
Mr0grog / webmonitoring_active_urls.json
Last active December 16, 2018 03:15
EDGI Web Monitoring Active URLs
This file has been truncated, but you can view the full file.
[
"http://www.nrel.gov/esif",
"http://www.nrel.gov/sustainable_nrel/rsf.html",
"http://www.nrel.gov/esi/vehicle-grid-integration.html",
"http://www.nrel.gov/esi/esi-news-201606.html",
"http://www.nrel.gov/research/library.html",
"http://www.nrel.gov/transportation/data_resources.html",
"http://www.nrel.gov/security.html",
"http://www.nrel.gov/buildings/sunrel.html",
"http://www.nrel.gov/about/visiting-nrel.html",
@Mr0grog
Mr0grog / custom-error.js
Last active February 21, 2019 19:58
Getting custom error classes right in JavaScript
export default class CustomError extends Error {
/**
* CustomError is a base class that helps set up well considered custom Error
* types for use in libraries.
* @param {string} message An error message
* @param {any} options
* @param {string} options.code Sets the `code` property on the error. If not
* set, it defaults to the class name.
* @param {Function} options.trimStack In V8, trim everything from this
* function and deeper off the stack trace in the error.
@Mr0grog
Mr0grog / web-monitoring-pages-per-domain.csv
Last active October 22, 2019 04:44
Web Monitoring monitored domains (with count of pages per domain)
www.epa.gov 7087
energy.gov 1569
www.globalchange.gov 1209
science.energy.gov 1182
www.nrel.gov 849
www.ferc.gov 638
www.eia.gov 610
www.blm.gov 588
earthdata.nasa.gov 558
arpa-e.energy.gov 541
@Mr0grog
Mr0grog / allterms-count.csv
Created October 23, 2019 19:54
Changed terms acrosss all EDGI-monitored pages
We can't make this file beautiful and searchable because it's too large.
term,page_count
information,5789
may,5767
energy,5099
national,4829
u,4404
page,4400
new,4376
office,4260
use,4057
@Mr0grog
Mr0grog / topterms-count.csv
Created October 23, 2019 19:56
Top changed terms across EDGI monitored pages
We can't make this file beautiful and searchable because it's too large.
term,page_count
information,5789
may,5767
energy,5099
national,4829
u,4404
page,4400
new,4376
office,4260
use,4057
@Mr0grog
Mr0grog / stdout_stderr_combined.py
Created January 25, 2020 07:39
Experiments in capturing combined output streams from subprocesses in Python.
# Experiments in capturing combined output streams from subprocesses.
#
# It turns out it's kind of hard to get the interleaved results of stdout and
# stderr in Python. However, in a lot of situations where Python is calling
# out to other processes, you probably want to swallow the child process's
# stderr when things to right and print it when things go wrong. Since some
# programs may be outputting results on stdout and warnings and errors on
# stderr, it makes sense that you'd want it all, and all in the order it was
# printed when things go wrong.
#
@Mr0grog
Mr0grog / .env
Created April 13, 2020 05:52
List pages that are errors from web-monitoring-db
export WEB_MONITORING_DB_URL='https://api.monitoring.envirodatagov.org'
export WEB_MONITORING_DB_EMAIL='YOUR EMAIL HERE'
export WEB_MONITORING_DB_PASSWORD='YOUR PASSWORD HERE'
@Mr0grog
Mr0grog / edgi-non-gov-mil-us-seeds-broad.csv
Last active October 15, 2020 20:47
EDGI 2020 EoT seeds from non- .gov/.mil/.us hosts
url
http://mdl-mom5.herokuapp.com/
https://www.instagram.com/cleanetwork/
https://sercc.com/
https://serc.carleton.edu/
https://cires.colorado.edu/
http://geomag.colorado.edu/
https://lasp.colorado.edu/
http://www.cloudsat.cira.colostate.edu/
http://rammb.cira.colostate.edu/
@Mr0grog
Mr0grog / summarize.py
Last active January 26, 2023 18:45
Summarize log files from EDGI Wayback imports
from datetime import timedelta
import dateutil.parser
from pathlib import Path
import re
START_LINE = re.compile(r'^\[([^\]]+)\] Starting Internet Archive Import')
END_LINE = re.compile(r'^\s*Internet Archive import completed at (.+)')
SUMMARY_START = re.compile(r'^\s*Loaded (\d+) CDX records:')
SUMMARY_ITEM = re.compile(r'^\s*(\d+)\s([\s\w\-]+)\s\(')
IMPORT_ERRORS = re.compile(r'^\s*Total:\s*(\d+)\serrors')