Skip to content

Instantly share code, notes, and snippets.

@jasonrig
jasonrig / GoogleCloudStorage-wiki.en.vec.tsv
Created April 27, 2018 15:32
Upload wiki.en.vec to Google Cloud Storage
We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 1 column, instead of 3 in line 1.
TsvHttpData-1.0
https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec 6597238061 RBesTphc+YsGkOD1gBEqaA==
@jasonrig
jasonrig / spark-2.4.3-hadoop2.7.tcl
Created August 9, 2019 03:45
Example modulefile for Spark on HPC clusters
#%Module1.0
## Spark 2.4.3 modulefile
##
proc ModulesHelp { } {
puts stderr "Provides the Spark environment for large-scale data processing"
puts stderr "This version has been prebuilt with Scala 2.11."
}
module-whatis "Apache Spark™ is a unified analytics engine for large-scale data processing."
@jasonrig
jasonrig / run_spark_cluster.sh
Created August 9, 2019 03:47
Example SLURM job script to start a Spark cluster
#!/bin/bash
#SBATCH --job-name spark-cluster
#SBATCH --account=qh82
#SBATCH --time=02:00:00
# --- Master resources ---
#SBATCH --nodes=1
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
# --- Worker resources ---
@jasonrig
jasonrig / remac.sh
Created November 9, 2019 03:34
A shell script to toggle NIC MAC addresses between the factory default and a new random one
#!/bin/bash
if [ "$#" -ne 1 ]; then
echo "Network device must be specified"
exit 1
fi
DEVICE="$1"
OLD_MAC=$(ifconfig | grep "$DEVICE:" -A3 | grep ether | head -1 | cut -d ' ' -f 2)
@jasonrig
jasonrig / parse_pubmed.py
Created March 21, 2023 12:21
Verify and extract abstracts from PubMed
import csv
import glob
import gzip
import hashlib
import xml.sax
import tqdm
file_list = glob.glob("./pubmed/*.xml.gz")