Ming Tang crazyhottommy

library(tidyverse)

## read in the mutect files
mix.files<- as.list(dir(".", pattern= "*.tsv"))

## need to add the file name into a column
mix_mutect_datlist <- lapply(mix.files, function(f) {
        dat = read.table(f, header =T, sep ="\t", quote = "\"")

backup dotfiles

Following will copy all of dot ~/. files and directories (including its contents) directly underneath home directory.
To avoid copying cache and other local configs, e.g., that of web browser, java apps, etc., preferably query directory size tool under entire home $HOME/, using ncdu $HOME of similar tool.
Exclude all those large directories using rsync --exclude=.local --exclude=.cache format
Avoid rsync password, ssh keys, .bash_history, etc. if you are uploading to github, etc.
rsync home dotfiles and configs as follows:

# in your local machine

use diffbind to get diff sites

library(DiffBind)
UCI.H3K27ac.dba<- dba(sampleSheet="H3K27ac_diffbind.csv", scoreCol= 7, filter=80, peakFormat = "macs")

UCI_H3K27ac_RPKM<- dba.count(UCI.H3K27ac.dba, minOverlap=2, 
                      fragmentSize = 200, bParallel = T,
                      score = DBA_SCORE_RPKM)

This is the default behavior for SSH. It protects user keys by enforcing rwx------ on $HOME/.ssh and ensuring only the owner has write permissions to $HOME. If a user other than the respective owner has write permission on the $HOME directory, they could maliciously modify the permissions on $HOME/.ssh, potentially hijacking the user keys, known_hosts, or something similar. In summary, the following permissions on $HOME will be sufficient for SSH to work.

rwx------

rwxr-x---

rwxr-xr-x

>SSH will not work correctly and will send warnings to the log facilities if any variation of g+w or o+w exists on the $HOME directory.

## DEseq2 built-in function
plotPCA(vsd.fast, intgroup=c("subtype"))

##SVD to get PCs mannually
X<- assay(vsd.fast)

## center X
X<- t(scale(t(X),center=TRUE,scale=FALSE))

From Mike Love:https://gist.github.com/mikelove/f539631f9e187a8931d34779436a1c01

An R implementation of the rule:

Archive generated fastq files are organised by run accession number under vol1/fastq directory in ftp.sra.ebi.ac.uk:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/[/]/

is the first 6 letters and numbers of the run accession ( e.g. ERR000 for ERR000916 ),

Install

The latest version of vep is on github http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer

it is version 89 when this gist was written.(bioinformatics tools evolve too fast!)

check this gist as well https://gist.github.com/ckandoth/f265ea7c59a880e28b1e533a6e935697

Some reading for the basics

cores, cpus and threads :
http://www.slac.stanford.edu/comp/unix/package/lsf/currdoc/lsf_admin/index.htm?lim_core_detection.html~main
Traditionally, the value of ncpus has been equal to the number of physical CPUs. However, many CPUs consist of multiple cores and threads, so the traditional 1:1 mapping is no longer useful. A more useful approach is to set ncpus to equal one of the following:

The number of processors
Cores—the number of cores (per processor) * the number of processors (this is the ncpus default setting)
Threads—the number of threads (per core) * the number of cores (per processor) * the number of processors

	query.maf.hg19 <- GDCquery(project = "TCGA-COAD",
	data.category = "Simple nucleotide variation",
	data.type = "Simple somatic mutation",
	access = "open",
	legacy = TRUE)
	# Check maf availables
	knitr::kable(getResults(query.maf.hg19)[,c("created_datetime","file_name")])

	query.maf.hg19 <- GDCquery(project = "TCGA-COAD",
	data.category = "Simple nucleotide variation",

	# This code will get all clinical indexed data from TCGA
	library(TCGAbiolinks)
	library(data.table)
	clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>%
	regexPipes::grep("TCGA",value=T) %>%
	sort %>%
	plyr::alply(1,GDCquery_clinic, .progress = "text") %>%
	rbindlist
	readr::write_csv(clinical,path = paste0("all_clin_indexed.csv"))