Skip to content

Instantly share code, notes, and snippets.

View lh3's full-sized avatar

Heng Li lh3

View GitHub Profile
@lh3
lh3 / 00README.md
Last active February 9, 2016 18:38

This gist implements a proof-of-concept web server to provide remote access to multiple BAMs. You can launch the server with:

go run hts-server.go -e /path/to/samtools bam-dir1 bam-dir2

and test it with:

curl -s http://127.0.0.1:8000/

The server regards bam-dir1 and bam-dir2 as study accessions. For a BAM file bam-dir1/myfile1.bam, the file accession is myfile1. It is required that every file has a unique file name across all input directories (i.e. even in two directories, two files must be named differently).

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Std;
my %opts = (t=>8, k=>17, m=>0.6, s=>200, p=>'wtasm');
getopts('t:p:k:m:s:', \%opts);
die (qq/Usage: smartdenovo.pl [options] <reads.fa>
Options:
===> Where do you most often perform bioinformatics analysis <===
1 Personal computer (laptop/desktop)
2 Lab server
3 Departmental server
4 University server/cluster
5 Cloud
6 Other
===> Type of problems <===

Writing a program for efficient matrix multiplication is tricky. A naive implementation usually looks like this (I will use a square matrix for simplicity):

for (i = 0; i < n; ++i)
  for (j = 0; j < n; ++i)
    for (k = 0, m[i][j] = 0.; k < n; ++i)
      m[i][j] += a[i][k] * b[k][j];

The problem is the inner loop a[i][k] * b[k][j] where b[k][j] and b[k+1][j] are not adjacent in memory. This leads to frequent cache misses for large matrices. The better implementation is to transpose matrix b. The implementation will look like:

for (i = 0; i &lt; n; ++i) // transpose
var max_leftover = 0.1, max_diff = 0.05, min_mapq = 20;
var file = arguments.length == 0? new File() : new File(arguments[0]);
var buf = new Bytes();
var re = /(\d+)([MIDSH])/g;
var decoy = {};
while (file.readline(buf) >= 0) {
var m, line = buf.toString();
if ((m = /^@SQ.*\tSN:(\S+).*\tLN:(\d+)/.exec(line)) != null) {
@lh3
lh3 / mastermind.c
Last active August 29, 2015 14:23
MasterMind game, with the computer as the code breaker. Dedicated to my daughter.
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
const char *mm_colors = "RGBWOPYSE";
const char *mm_numbers = "123456789";
long *mm_enumerate(int m, int n, int rep, long *_z) // m colors, n columns

Querying Genotypes with Standard SQL

This document demonstrates that we can use three conceptual SQL tables for variant annotations, sample phenotypes and genotypes, respectively, to achieve flexible query of genotype data. The tables are conceptual in that they are primarily used for constructing queries but not for storing real data as they are not efficient enough for large-scale data.

We will use an example to show the structure of the three tables and how to query data in standard SQL. You can find all the SQL statements in

@lh3
lh3 / vqsr-b37.mak
Last active August 29, 2015 14:19
Makefile for VQSR
# VQSR for single human sample mapped to b37. Usage:
#
# make -f vqsr-b37.mak prefix=my-vcf-prefix
#
# For this to work, the input VCF must be named as "my-vcf-prefix.vcf" and
# there must be "GenomeAnalysisTK.jar" and "b37" directory from the GATK
# resource bundle in the working directory.
#
# References:
#
@lh3
lh3 / 00_cmd.sh
Last active December 16, 2021 13:02
Scripts and command lines to create universal mask for hs37d5
#
# generate 01compositional.bed.gz
#
# low-complexity by mDUST
mdust hs37d5.fa -c -w7 -v28 \
| hs37d5.mdust-w7-v28.txt \
| cut -f1,3,4 \
| gawk -vOFS="\t" '{--$2;print}' \
| bgzip > 01compositional/hs37d5.mdust-w7-v28.bed.gz
@lh3
lh3 / occ1.c
Last active August 29, 2015 14:14
Get a k-mer count from the fermi2 index
/* To compile this program, you need rld0.{h,c} from ropebwt2 or fermi2:
*
* gcc -g -O2 -Wall -o occ1 rld0.c occ1.c
*/
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include "rld0.h"