Skip to content

Instantly share code, notes, and snippets.

View edsu's full-sized avatar

Ed Summers edsu

View GitHub Profile
@edsu
edsu / 2023-12-20.txt
Last active December 20, 2023 16:33
A count of albums in the lists at https://aoty.hubmed.org/ for 2023
[13] Sufjan Stevens - Javelin [Clash, The Fader, PopMatters, Pitchfork, The Line of Best Fit, Consequence, Rolling Stone, Exclaim, Paste, Mojo, Uncut, Piccadilly Records, Rough Trade]
[12] Kelela - Raven [Clash, The Fader, The Forty-Five, PopMatters, Pitchfork, Crack, The Line of Best Fit, Consequence, Rolling Stone, Exclaim, Paste, The Quietus]
[12] Wednesday - Rat Saw God [Clash, The Fader, The Forty-Five, PopMatters, Pitchfork, The Line of Best Fit, Consequence, Rolling Stone, Exclaim, Paste, Uncut, Rough Trade]
[11] Noname - Sundial [Clash, The Fader, The Forty-Five, The Wire, PopMatters, Pitchfork, Crack, The Line of Best Fit, Rolling Stone, Paste, The Quietus]
[9] Mitski - The Land Is Inhospitable and So Are We [Clash, The Fader, PopMatters, Pitchfork, The Line of Best Fit, Consequence, Rolling Stone, Exclaim, Mojo]
[9] Lankum - False Lankum [Clash, Concrete Islands, Crack, The Line of Best Fit, Fast 'n' Bulbous, Louder Than War, Mojo, Uncut, The Quietus]
[8] Amaarae - Fountain Baby [Clash, The Fader, T
@edsu
edsu / json_shapes.py
Last active November 17, 2023 23:22
Feed in some JSONL and get a report of the patterns present in the data.
#!/usr/bin/env python3
import csv
import json
from collections import OrderedDict
from collections import Counter
def trace(data, shape=None):
if isinstance(data, dict):
new_dict = OrderedDict()
@edsu
edsu / nytimes-gptbot.sh
Last active September 7, 2023 19:14
Uses the Wayback Machine to show (approximately) when the New York Times started telling OpenAI to stop scraping them.
#!/bin/bash
#
# Use the Internet Archive Wayback Machine to demonstrate roughly when the
# NYTimes started blocking GPTBot.
#
# See: https://www.theverge.com/2023/8/21/23840705/new-york-times-openai-web-crawler-ai-gpt
#
wget -q -O robots-20230817.txt https://web.archive.org/web/20230817012138id_/https://www.nytimes.com/robots.txt
@edsu
edsu / example.md
Last active August 24, 2023 14:31

Org A split off of Org B, Org B split into Org C & Org D, Org A and Org D merged into Org E?

can be turned into Mermaid notation

graph TD;
  B --> A;
  B --> C;
  B --> D;
 A --> E;
This file has been truncated, but you can view the full file.
collection: fatal-encounters
generateWACZ: true
workers: 4
screencastPort: 9037
seeds:
- url: https://fatalencounters.org/
scopeType: prefix
- url: https://www.wsoctv.com/news/1-person-dead-after-attempting-escape-police-troopers-say/QXA244QPUZGJ5GAGRADGDWBAEU/
scopeType: page
- url: https://www.wtok.com/2022/01/01/officer-involved-shooting/
@edsu
edsu / check.py
Created August 3, 2023 14:51
Check a specific WARC file that is being discussed in IIPC Slack
#!/usr/bin/env python
from warcio.archiveiterator import ArchiveIterator
with open('archive/rec-20230722210008512613-81a34b41ee13.warc.gz', 'rb') as stream:
for i, record in enumerate(ArchiveIterator(stream)):
print(i, record.rec_headers.get_header('WARC-Target-URI'))
if record.rec_type == 'response':
content = record.content_stream().read()
@edsu
edsu / writer.py
Created July 26, 2023 16:01
A little example of writing files as resource records to a WARC file.
from warcio.warcwriter import WARCWriter
with open('test.warc.gz', 'wb') as output:
writer = WARCWriter(output, gzip=True)
# write some metadata for the warc as a info record
rec = writer.create_warcinfo_record('test.warc.gz', {
'software': 'warcio',
'description': 'An example of packaging up two images in a WARC'
})
@edsu
edsu / warc2mbox.py
Last active July 15, 2023 19:07
Convert Yahoo Groups WARC archive files to MBOX files: see https://archive.org/search?query=subject%3A%22yahoo+groups%22
#!/usr/bin/env python3
# run like this:
#
# $ python3 warc2mbox.py yahoo-groups-2016-03-20T12:45:19Z-nyzp9w.warc.gz
#
# and it will generate an mbox file for each Yahoo Group:
#
# $ ls -l mboxes
# -rw-r--r-- 1 edsummers staff 12522488 Jul 15 14:14 amicigranata.mbox
@edsu
edsu / swap_check.py
Last active July 13, 2023 16:14
Reads a text file of URLs and writes out a CSV report of whether the URL is in swap.stanford.edu
#!/usr/bin/env python3
import csv
import sys
import json
import time
import requests
def get_snapshots(url):
url = f"https://swap.stanford.edu/was/cdx?url={url}&output=json"
#!/usr/bin/env python3
import csv
import sys
import json
import time
import requests
def get_snapshots(url):
url = f"https://swap.stanford.edu/was/cdx?url={url}&output=json"