Mike Lin mlin

Keybase proof

I hereby claim:

I am mlin on github.
I am mlin (https://keybase.io/mlin) on keybase.
I have a public key ASCO-NadYMiwqGxb9_4cD-VFjMbVqrk7ors-n9seZl_A5wo

To claim this, I am signing this object:

Context: static.wiki and Show HN post

We downloaded static.wiki's 40.3 GiB SQLite database of English Wikipedia and created a compressed version of it with sqlite_zstd_vfs, our read/write Zstandard compression layer for SQLite3. The compressed version is 10.4 GiB (26%), and the VFS supports HTTP random access in the spirit of the original (although we don't yet have a WebAssembly build; it's a library for CLI & desktop apps for now). You can try it out on Linux or macOS x86-64:

pip3 install genomicsqlite
genomicsqlite https://f000.backblazeb2.com/file/mlin-public/static.wiki/en.zstd.db \
    "select text from wiki_articles where title = 'SQLite'"

	#!/usr/bin/env python3

	import sys
	import time
	import docker
	import multiprocessing
	from argparse import ArgumentParser, REMAINDER

	def swarmsub(image, command=None, cpu=1, mounts=None):
	client = docker.from_env()

	version 1.0

	task split_vcf_for_spark {
	# Quickly split a large .vcf.gz file into a specified number of compressed partitions.
	#
	# Motivation: calling SparkContext.textFile on a single large vcf.gz can be painfully slow,
	# because it's decompressed and parsed in ~1 thread. Use this to first split it up (with a
	# faster multithreaded pipeline); then tell Spark to parallel load the data using textFile on a
	# glob pattern.
	#

	#!/bin/bash

	# Running inside a docker container, periodically read the container's CPU/memory usage counters
	# and log them to standard error. Fields:
	#
	# cpu_pct average user %CPU usage over the most recent period
	# mem_MiB container's current RSS (excludes file cache), in mebibytes (= 2**20 bytes)
	# cpu_total_s container's user CPU time consumption since this script started, in seconds
	# elapsed_s wall time elapsed since this script started, in seconds
	#

	#!/usr/bin/env python3

	"""
	Generate a standalone WDL document from a given workflow using imported tasks. Requires: miniwdl

	python3 paste_wdl_imports.py [-o STANDALONE.wdl] WORKFLOW.wdl

	For each "call imported_namespace.task_name [as alias]" in the workflow, appends the task's source
	code with the task name changed to "imported_namespace__task_name", and rewrites the call to refer
	to this new name (keeping the original alias). Also blanks out the import statements.

	<!DOCTYPE html>
	<html>

	<head>
	<meta charset="utf8" />
	<title>htsget</title>
	<!-- needed for adaptive design -->
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<style>
	body {

	#!/usr/bin/env python3
	# run this script using LD_LIBRARY_PATH to manipulate the SQlite3 library version
	import os
	import random
	import time
	import sqlite3

	N = 100000
	random.seed(42)

	FROM ubuntu:20.04
	RUN apt-get -qq update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
	wget curl python3-pip python-is-python3
	RUN pip3 install --system miniwdl==1.4.2

	ENV UDOCKER_VERSION=1.3.1
	WORKDIR /usr/local
	RUN wget -nv https://github.com/indigo-dc/udocker/releases/download/v${UDOCKER_VERSION}/udocker-${UDOCKER_VERSION}.tar.gz \
	&& tar zxf udocker-${UDOCKER_VERSION}.tar.gz \
	&& rm udocker-${UDOCKER_VERSION}.tar.gz