Skip to content

Instantly share code, notes, and snippets.

View brentp's full-sized avatar

Brent Pedersen brentp

View GitHub Profile
@brentp
brentp / 0_reuse_code.js
Created September 29, 2013 19:25
Here are some things you can do with Gists in GistBox.
// Use Gists to store code you would like to remember later on
console.log(window); // log the "window" object to the console
@brentp
brentp / fq.split.py
Last active December 23, 2015 10:59
split a fastq file into evenly size random chunks.
"""
split a single fastq file in to random, non-overlapping subsets
arguments:
+ fastq file
+ number of splits
+ number of reps
e.g.:
python fq.split.py input.fastq 3 4
@brentp
brentp / ziller-meth.md
Created August 14, 2013 14:34
notes from the nature letter.
Charting a dynamic DNA methylation landscape of the human genome
Nature (2013) doi:10.1038/nature12433

Summary

Ziller at al looked at 42 whole genome bisulfite sequence datasets from 30 cell-lines and tissues (which seems indicative of the practice of not having biological replicates for WGBS).

When requring a difference of 30% (FDR 0.104), they find only 5.6million CpGs in 716K regions with "dynamic" methylation and use this to

import sys
from cruzdb import Genome
from itertools import groupby
#g = Genome('mm9')
#g.mirror(('kgXref', 'knownGene'), 'sqlite:///mm9.db')
g = Genome('sqlite:///mm9.db')
# python ken-rrbs.py Ken_List_For_RRBS.txt
import pandas as pd
from StringIO import StringIO
import numpy as np
import re
import string
txt = """\
a b c d e f g h i
0 0 1 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0
@brentp
brentp / skat-combined.txt
Last active December 20, 2015 02:18
summary of "Sequence Kernel Association Tests for the Combined Effect of Rare and Common Variants." http://www.ncbi.nlm.nih.gov/pubmed/23684009
The adaptive-Sum test now in SKAT as described by Ionita-Laza et al (2013) is a test for the combined effect of rare
and common variants. Generally, with common variants, e.g. those on a GWAS chip, we use a logistic
regression of
disease ~ age + ... + genotype
and determine the p-value for genotype on e.g. 500K sites on the chip. For rare variants, there is not enough power to detect with such single-variant tests and so there are many methods to "combine" rare variants in some local area to look for a "burden" of mutations in that region. Often, these local areas are centered around genes or moving windows. This also reduces the testing burden (so to speak) from 500K SNPs to 25K genes.
SKAT is a method for detecting rare variants associated with disease. SKAT generally takes a matrix with rows of SNPs and columns of samples. Each SNP can be given any weight. The default is to use
@brentp
brentp / bootstrip.R
Created May 3, 2013 19:52
calculate p-value for a particular beta (or any measure), using a bootstrap-like measure. allows subsampling where as bootstrap is sampling with replacement. allows shuffling of residuals for linear model. the .iter version of each function will perform 200 iterations on the full dataset, then 5000 iterations on those with a simulated p-value < …
options(stringsAsFactors=FALSE)
library(limma)
library(parallel)
# fn returns the statistic of interest from a limma fit object.
bootstrip = function(mat, mod, fn, iterations=100, p_samples=0.5, mc.cores=12){
stopifnot(nrow(mod) == ncol(mat))
ta = mclapply(1:iterations, function(core_i){
idx = sample.int(ncol(mat), p_samples * ncol(mat), replace=TRUE)
#! /bin/bash
# Align paired bisulfite-converted DNA reads to a genome.
# This assumes that reads1.fastq are all from the converted strand
# (i.e. they have C->T conversions) and reads2.fastq are all from the
# reverse-complement (i.e. they have G->A conversions).
# "GNU parallel" needs to be installed.
@brentp
brentp / linear_model.py
Created April 10, 2013 15:57
calculate t statistics and p-values for coefficients in Linear Model in python, using scikit-learn framework.
from sklearn import linear_model
from scipy import stats
import numpy as np
class LinearRegression(linear_model.LinearRegression):
"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
@brentp
brentp / ens-ex.pl
Created April 5, 2013 13:46
example of using the Ensembl perl API to see the effect of a SNP on predicted transciption factor binding affinity
#!/usr/bin/perl
use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::ApiVersion;
printf(STDERR "The API version used is %s\n", software_version() );
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(