Zach Young zacharysyoung

OMG, I was feeling so low about an hour ago:

I was looking at a super-slick shell script written by a pro (I'll call this person "Ace") using tools I hadn't seen before
The structure of the NC file was even more daunting at first blush than the PRT files—the PRT was massive, but it was flat
A basic idea like the values being the same between the PRT and NC files was eluding me

And I was about to email you this message:

So, here's my analysis of what's going on...

This shit is so far over my head!

Import rows of data into individual PDFs

How to get data like this...

Name	Age	Street Address	City	State	Zip
Tami	23	123 Main St	Anytown	Anystate	11111
John	54	456 Second Ave	Anytown	Anystate	22222
Troy	39	789 Last Cir	Anytown	Anystate	99999

Validate an Address in Airtable

Validate an address with SmartyStreets from a script in Airtable. You can even use your "free 250 lookups per month"!

SmartyStreet Configuration

Fill in SS_KEY and SS_LICENSE with your SmartyStreets info.

name	POS	id	Ref	ALT	Frequency	CDS	Start	End	sequence_cc
chrM	41	.	C	T	0.002498	CDS	3307	4262	-3265
chrM	42	rs377245343	T	TC	0.001562	CDS	3307	4262	-3264
chrM	55	.	TA	T	0.00406	CDS	4470	5511	-4414
chrM	55	.	T	C	0.001874	CDS	4470	5511	-4414

Two ways to get all (sub)pages

You load a "base" URL, and if there are more results than 100, you load subsequent pages till you have ALL results.

You can accomplish this a number of ways, but two stick out for me:

with recursion: scrape_it() sees there are more pages to scrape and calls itself with the next page
with a while loop: you assume you might need multiple fetches and do all the work in a loop that continues to run as long is there a next page

For either method, I mocked up "sample" pages, I hope it's not too abstract, and that you can see there's some data you really care about, reports, and some meta-data that tells you there are more reports to be had, next_url:

	#!/bin/bash

	grep 'INSTANT CH4' .PRT \| \ # scan all files (.PRT) and filter each file by the text "INSTANT CH4"
	awk ' NR % 2 == 1 { print; } ' \| \ # there are two different datasets per file with your variables, this takes the
	# 'INSTANT CH4' line from the first dataset
	cut -c 1-7,67-76 \| \ # cut out everything but the filename/year (first 7 characters) and the column
	# for the data point you care about (characters 67 to 76)
	sed -E 's/ +/,/' \ # `cut` takes year and data columns and joins them with a space, `sed` replaces
	# the space with a comma for CSV
	> INSTANT_CH4.csv # save the output to a CSV file

	/*
	You can run this in Chrome by:
	1. going to View > Developer > Developer Tools
	2. find the "Console" tab
	3. copy all the stuff below in one chunk, and paste into the console
	4. hit <Enter>

	After that you can modify a line by copying it and pasting onto new a line and hitting <Enter> to re-run that line
	*/

	// Available variables:
	// - Machine
	// - interpret
	// - assign
	// - send
	// - sendParent
	// - spawn
	// - raise
	// - actions
	// - XState (all XState exports)

	const fetchMachine = Machine({
	id: 'distinct & valid',
	initial: 'new',
	states: {
	'new': {
	on: {
	FOUND_DISTINCT: 'distinct',
	FOUND_NOT_DISTINCT: 'no'
	}
	},

	import argparse
	from collections import defaultdict
	import csv


	class Actor(object):
	"""An actor with bounded rationality.

	The methods on this class such as u_success, u_failure, eu_challenge are
	meant to be calculated from the actor's perspective, which in practice