Skip to content

Instantly share code, notes, and snippets.

View mjpost's full-sized avatar

Matt Post mjpost

View GitHub Profile
@mjpost
mjpost / columnize.pl
Last active August 9, 2017 06:26
I used this for turning a sorted list of reviewers for the NAACL 2013 proceedings into 3 columns of sorted reviewers spanning multiple pages. Two arguments control (a) the number of rows on the first page and (b) the number of rows on subsequent pages.
#!/usr/bin/env perl
# Matt Post <[email protected]>
# April 2013
# This script turns a single column of sorted text into LaTeX-formatted multiple columns spanning
# multiple PDF pages.
# Its input is a single column of text on STDIN (each line is a complete entry). Two optional
# arguments specify (a) the number of rows on the first page and (b) the number of rows on the
# remaining pages, with both defaulting to 45. The last page will be adjusted automatically.
@mjpost
mjpost / mid
Created March 19, 2014 17:34
UNIX has `head` and `tail`, but what about when you need a line number in the middle? `mid` provides just that.
#!/usr/bin/perl
# Returns the requested line number from a file or list of files.
# If the line number is given as i:j or i-j, selects that range.
# If no file is given, we read from STDIN.
my $arg = shift;
($num1,$split,$num2) = split(/([:\-\+])/,$arg);
die usage() unless $arg and (! $split or $num2);
@mjpost
mjpost / unicode_header.py
Last active September 10, 2015 13:33
Standard Python header
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Python *sucks* at UTF-8 (don't tell me "It's fixed in Python 3"; I don't care, plus no one uses Python 3)
# If you put this at the top of every Python script, however, it get rids of most of the headaches dealing with STDIN
# and STDOUT (basically, akin to "perl -C31"). I don't know if it's all necessary; I just know that if I put it at
# the top of my scripts, most of the problems go away, and I can stop thinking about it.
import sys
import codecs
@mjpost
mjpost / regenerate_ics.py
Created August 20, 2015 13:23
Rebuilds Apple Calendar *.ics files so they can be safely reimported
#!/usr/bin/env python
"""
Looks at all the *.ics files in the current directory, removes the X- keys,
and generates a new UUID. This is used for restoring an accidentally-deleted
calendar in Apple's Calendar program; it is a rewrite of the node.js version
that is linked to from here:
http://fokkezb.nl/2015/01/13/how-to-restore-a-deleted-icloud-calendar/
"""
@mjpost
mjpost / parallel.sh
Last active October 27, 2022 13:19
Command line use of GNU parallel
# I can never remember syntax for GNU parallel
## Treat STDIN as a pool of commands to run, running the command for each, at most j in parallel
cat commands.txt | parallel -j 10
## Download a long list of files in parallel
cat files.txt | parallel -j 10 wget -q {}
## Start 10 parallel instances of COMMAND with FLAGS. Feed STDIN in 10k blocks to these commands. Assemble the outputs in order (-k).
cat large_input.txt | parallel -j 10 --pipe -k --block-size 10m COMMAND FLAGS > output.txt
@mjpost
mjpost / trim_fairseq_model.py
Created May 15, 2020 14:37
Removes ADAM optimizer state from fairseq models, greatly reducing their size
#!/usr/bin/env python3
"""
This is code to take a trained Fairseq model and discard the ADAM optimizer state,
which is not needed at test time. It can reduce a model size by ~70%.
Original author: Brian Thompson
"""
from fairseq import checkpoint_utils
#!/usr/bin/env python3
"""
Takes a list of collection IDs as arguments, and outputs a TSV
(name, Anthology ID, paper title) containing every person who
is the first author of a paper and has no other papers in the
Anthology.
Place in acl-anthology/bin and run
#!/usr/bin/env python3
import sys
import sacremoses
def main(args):
"""Tokenizes, preserving tabs"""
mt = sacremoses.MosesTokenizer(lang=args.lang)
def tok(s):
#!/usr/bin/env python3
import sys
from sacremoses.normalize import MosesPunctNormalizer
def main(args):
normalizer = MosesPunctNormalizer(lang=args.lang, penn=args.penn)
for line in sys.stdin:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Copyright 2019--2021 Matt Post <[email protected]>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0