library(tidyverse)
## read in the mutect files
mix.files<- as.list(dir(".", pattern= "*.tsv"))
## need to add the file name into a column
mix_mutect_datlist <- lapply(mix.files, function(f) {
dat = read.table(f, header =T, sep ="\t", quote = "\"")
- Following will copy all of dot
~/.
files and directories (including its contents) directly underneath home directory. - To avoid copying cache and other local configs, e.g., that of web browser, java apps, etc., preferably query directory size tool under entire home
$HOME/
, usingncdu $HOME
of similar tool. - Exclude all those large directories using
rsync --exclude=.local --exclude=.cache
format - Avoid rsync password, ssh keys, .bash_history, etc. if you are uploading to github, etc.
- rsync home dotfiles and configs as follows:
# in your local machine
query.maf.hg19 <- GDCquery(project = "TCGA-COAD", | |
data.category = "Simple nucleotide variation", | |
data.type = "Simple somatic mutation", | |
access = "open", | |
legacy = TRUE) | |
# Check maf availables | |
knitr::kable(getResults(query.maf.hg19)[,c("created_datetime","file_name")]) | |
query.maf.hg19 <- GDCquery(project = "TCGA-COAD", | |
data.category = "Simple nucleotide variation", |
This is the default behavior for SSH. It protects user keys by enforcing rwx------ on $HOME/.ssh and ensuring only the owner has write permissions to $HOME. If a user other than the respective owner has write permission on the $HOME directory, they could maliciously modify the permissions on $HOME/.ssh, potentially hijacking the user keys, known_hosts, or something similar. In summary, the following permissions on $HOME will be sufficient for SSH to work.
- rwx------
- rwxr-x---
- rwxr-xr-x
>SSH will not work correctly and will send warnings to the log facilities if any variation of g+w or o+w exists on the $HOME directory.
# This code will get all clinical indexed data from TCGA | |
library(TCGAbiolinks) | |
library(data.table) | |
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>% | |
regexPipes::grep("TCGA",value=T) %>% | |
sort %>% | |
plyr::alply(1,GDCquery_clinic, .progress = "text") %>% | |
rbindlist | |
readr::write_csv(clinical,path = paste0("all_clin_indexed.csv")) |
## DEseq2 built-in function
plotPCA(vsd.fast, intgroup=c("subtype"))
##SVD to get PCs mannually
X<- assay(vsd.fast)
## center X
X<- t(scale(t(X),center=TRUE,scale=FALSE))
From Mike Love:https://gist.github.com/mikelove/f539631f9e187a8931d34779436a1c01
An R implementation of the rule:
Archive generated fastq files are organised by run accession number under vol1/fastq directory in ftp.sra.ebi.ac.uk:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/[/]/
is the first 6 letters and numbers of the run accession ( e.g. ERR000 for ERR000916 ),
The latest version of vep is on github http://www.ensembl.org/info/docs/tools/vep/script/vep_download.html#installer
it is version 89 when this gist was written.(bioinformatics tools evolve too fast!)
check this gist as well https://gist.github.com/ckandoth/f265ea7c59a880e28b1e533a6e935697
cores, cpus and threads :
http://www.slac.stanford.edu/comp/unix/package/lsf/currdoc/lsf_admin/index.htm?lim_core_detection.html~main
Traditionally, the value of ncpus has been equal to the number of physical CPUs. However, many CPUs consist of multiple cores and threads, so the traditional 1:1 mapping is no longer useful. A more useful approach is to set ncpus to equal one of the following:
- The number of processors
- Cores—the number of cores (per processor) * the number of processors (this is the ncpus default setting)
- Threads—the number of threads (per core) * the number of cores (per processor) * the number of processors