Skip to content

Instantly share code, notes, and snippets.

Processing 30GB of text

Working out how to read-in 30GB worth of text without a line-breaking record separator... at least, there wasn't a newline in the short example:

28807644'~'0'~'Maun FCU'~'US#@#@#28855353'~'0'~'WNB Holdings LLC'~'US#@#@#29212330'~'0'~'Idaho First Bank'~'US#@#@#29278777'~'0'~'Republic Bank of Arizona'~'US#@#@#29633181'~'0'~'Friendly Hills Bank'~'US#@#@#29760145'~'0'~'The Freedom Bank of Virginia'~'US#@#@#100504846'~'0'~'Community First Fund Federal Credit Union'~'US#@#@#

To make this right:

  • replace field separator '~' with ,
  • replace record separator #@#@# with \n
import re
# Replace embedded escaped unicode with their actual unicode values:
#
# `\Not wanted backslashes\ unicode: \u2019\u2026`
#
# to:
#
# `\Not wanted backslashes\ unicode: ’…`
@zacharysyoung
zacharysyoung / Better_CSV_doc.md
Last active February 4, 2022 16:39
CSVs: reading, processing, writing w/Python

Welcome to CSV w/Python!

CSV files are a good way to share tables of data, and Python's CSV module makes working with them straightforward.

This guide will quickly move you through all the concepts you need to fill in this basic CSV program:

# Open the CSV
# Read all the data
# Extract the header
#!/usr/bin/env python3
import csv
import random
# Used to characterize answer for https://stackoverflow.com/questions/75578992
with open("input.csv", "w", newline="") as f:
w = csv.writer(f)
w.writerow(["RowNum", "ID"])
for i in range(20_000_000):

Merge hundreds of CSVs, each with millions of rows

500 CSVs, each with over 1 million rows need to be merged together into one CSV.

  • each CSV represents a sensor which recorded a value and the timestamp of the recording, will millions of timestamp/value rows
  • all CSVs have the same number of rows

How can we "merge" the CSVs such that each sensor's value-column is added to the merged CSV (500 value columns), and the timestamps for each row for each sensor are averaged into a single column?

sensor1.csv
#!/bin/sh
# Join Part-A and Part-B
gocsv join -c 'label' -outer file1.csv file2.csv > joined.csv
echo 'Joined'
gocsv view joined.csv
# Rename the two samely-named 'label' columns to unique names
gocsv rename -c 1 -names 'Label_A' joined.csv | gocsv rename -c 3 -names 'Label_B' > renamed.csv
echo 'Renamed key cols'
#!/usr/bin/env python3
# https://stackoverflow.com/a/71784820/246801
# Misses "interior" extraneous whitespaces
block = ["Line 1\n", " Line 2\n", "Line 3\n"]
list_comp = [x.strip() for x in block]
func_chain = "".join(block).strip().split("\n")
@zacharysyoung
zacharysyoung / make_editable.js
Last active May 11, 2022 17:34
Make all PDF form fields editable, in Acrobat.
/* globals getField */
// From https://answers.acrobatusers.com/Script-change-fields-read-specific-fields-q296813.aspx
for(var i = 0; i < this.numFields; i++)
{
var fieldName = this.getNthFieldName(i);
getField(fieldName).readonly = false;
getField(fieldName).locked = false;
}
console.println('\nDone');
@zacharysyoung
zacharysyoung / README.md
Last active July 5, 2023 17:27
Start an HTTP server and listen for a response, but only for so long

Shutting down HTTP servers

Shutdown with WaitGroup and Context

wg_context.go

From

@zacharysyoung
zacharysyoung / main.go
Created June 16, 2022 10:47
Go VS Python, silly but real metrics
package main
func main() {
for i := 0; i < N; i++ {
}
}