Skip to content

Instantly share code, notes, and snippets.

View crazyhottommy's full-sized avatar
🎯
Focusing

Ming Tang crazyhottommy

🎯
Focusing
View GitHub Profile
#! /bin/bash
# this script is used to decompose vcf file and normalize the vcf file
# see here https://gemini.readthedocs.org/en/latest/index.html
# and load it to gemini database
# put the following three lines to every bash script to catch errors
set -e
set -u
set -o pipefail -o errexit -o nounset

compare:

cat raw_expression.txt | head
cat raw_expression.txt | awk -F"\t" '{if(NR==1) $1="NAME"FS$1}1' OFS="\t" | head 

$1 is the first field for each line.
$1="NAME" FS $1 assigns NAME to the first field and seperate by a field seprator (FS), which is a tab.
1 is always TRUE, so awk prints out the rest of the lines.

then, add a second column with a dummy "na":

m.row.sum<- cbind(m1, rowSums(m1))
o1<- rev(order(m.row.sum[,602]))
m.row.sum<- m.row.sum[o1,]
bk = unique(c(seq(-0.1,3, length=100),seq(3,10.35,length=100)))
hmcols<- colorRampPalette(c("white","red"))(length(bk)-1)
pheatmap( m.row.sum[,1:601], cluster_rows = F, cluster_cols = F, col= hmcols, breaks = bk, legend=FALSE, show_rownames=FALSE, show_colnames=FALSE)
#The X axis is -3 kb to 3 kb around TSS.
#It turns out that the heatmap.2 function use default hclust ( Hierachical Clustering) to cluster the #matrix.
#alternatively, we can use K-means clustering to cluster the data and to see what's the pattern look like.
km<- kmeans(m1,2) # determine how many cluster you want, I specify 2 here
m1.kmeans<- cbind(m1, km$cluster) # combine the cluster with the matrix
dim(m1.kmeans)
# This R script is to generate the TF or histone modification heatmap
# at certain genomic features (TSS, enhancers) from the ChIP-seq data
# the input matrix is got from Homer software. alternative to R, use cluster3 to cluster, and visualize by # java Treeviewer
# generate the matrix by Homer: annotatePeaks.pl peak_file.txt hg19 -size 6000 -hist 10 -ghist -d TF1/ # > outputfile_matrix.txt
# see several posts for heatmap:
# http://davetang.org/muse/2010/12/06/making-a-heatmap-with-r/
# http://www.r-bloggers.com/r-using-rcolorbrewer-to-colour-your-figures-in-r/
# 08/20/13 by Tommy Tang
# it is such a simple script but took me several days to get it work...I mean the desired
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### I am going to demonstrate how to use ipython notebook bash_kernal to do reproducible research.\n",
"I can do command line in the notebook and take notes along the way.\n",
"Let's go to the directory first."
]
## get all the promoter sequences for human hg19 genome
## Author: Ming Tang (Tommy)
## Date: 04/30/2015
## load the libraries
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
BSgenome.Hsapiens.UCSC.hg19
# or
Hsapiens
options(stringsAsFactors=F)
library(gdata)
library(parallel)
files = list.files(path='ctx/',pattern='*.bd$')
meta = read.csv("WGS.coverage.csv")
mclapply (files, function(f) {
dat = read.delim(sprintf('ctx/%s', f),comment.char='#',header=F,as.is=T)[,-(12:14)]
message(sprintf("File: %s, Dim: (%s)", f, paste(dim(dat), collapse=",")))
## Overview
# central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
# I am going to draw 40 numbers from exponential distribution for 1000 times, and calcuate the mean
# of each draw (we will have 1000 means), and through this simulation to test if the
# distribution of the means will be normal or not.
## start simulation
# number of simulation, sample size and lambda
nosim<- 1000
# creat a test file
$time seq 1 10000000 > ten_million.txt
seq 1 10000000 > ten_million.txt 3.51s user 0.13s system 99% cpu 3.663 total
# it is a "big" file with size of 109M
$ls -lh ten_million.txt
-rw-r--r-- 1 Tammy staff 109M Mar 22 20:49 ten_million.txt
$man gshuf
# randomly select 1000 lines from it