Brent Pedersen brentp

Charting a dynamic DNA methylation landscape of the human genome
Nature (2013) doi:10.1038/nature12433

Summary

Ziller at al looked at 42 whole genome bisulfite sequence datasets from 30 cell-lines and tissues (which seems indicative of the practice of not having biological replicates for WGBS).

When requring a difference of 30% (FDR 0.104), they find only 5.6million CpGs in 716K regions with "dynamic" methylation and use this to

	"""
	split a single fastq file in to random, non-overlapping subsets
	arguments:
	+ fastq file
	+ number of splits
	+ number of reps

	e.g.:

	python fq.split.py input.fastq 3 4

	import sys
	from cruzdb import Genome
	from itertools import groupby

	#g = Genome('mm9')
	#g.mirror(('kgXref', 'knownGene'), 'sqlite:///mm9.db')

	g = Genome('sqlite:///mm9.db')

	# python ken-rrbs.py Ken_List_For_RRBS.txt

	import pandas as pd
	from StringIO import StringIO
	import numpy as np
	import re
	import string

	txt = """\
	a b c d e f g h i
	0 0 1 0 0 1 0 0 0
	1 0 0 0 0 0 0 0 0

	The adaptive-Sum test now in SKAT as described by Ionita-Laza et al (2013) is a test for the combined effect of rare
	and common variants. Generally, with common variants, e.g. those on a GWAS chip, we use a logistic
	regression of

	disease ~ age + ... + genotype

	and determine the p-value for genotype on e.g. 500K sites on the chip. For rare variants, there is not enough power to detect with such single-variant tests and so there are many methods to "combine" rare variants in some local area to look for a "burden" of mutations in that region. Often, these local areas are centered around genes or moving windows. This also reduces the testing burden (so to speak) from 500K SNPs to 25K genes.

	SKAT is a method for detecting rare variants associated with disease. SKAT generally takes a matrix with rows of SNPs and columns of samples. Each SNP can be given any weight. The default is to use

	options(stringsAsFactors=FALSE)
	library(limma)
	library(parallel)

	# fn returns the statistic of interest from a limma fit object.
	bootstrip = function(mat, mod, fn, iterations=100, p_samples=0.5, mc.cores=12){
	stopifnot(nrow(mod) == ncol(mat))

	ta = mclapply(1:iterations, function(core_i){
	idx = sample.int(ncol(mat), p_samples * ncol(mat), replace=TRUE)

	// Use Gists to store code you would like to remember later on
	console.log(window); // log the "window" object to the console

	#! /bin/bash

	# Align paired bisulfite-converted DNA reads to a genome.

	# This assumes that reads1.fastq are all from the converted strand
	# (i.e. they have C->T conversions) and reads2.fastq are all from the
	# reverse-complement (i.e. they have G->A conversions).

	# "GNU parallel" needs to be installed.

	from sklearn import linear_model
	from scipy import stats
	import numpy as np


	class LinearRegression(linear_model.LinearRegression):
	"""
	LinearRegression class after sklearn's, but calculate t-statistics
	and p-values for model coefficients (betas).
	Additional attributes available after .fit()

	#!/usr/bin/perl
	use strict;
	use warnings;
	use Bio::EnsEMBL::Registry;

	use Bio::EnsEMBL::ApiVersion;
	printf(STDERR "The API version used is %s\n", software_version() );

	my $registry = 'Bio::EnsEMBL::Registry';
	$registry->load_registry_from_db(