Ming Tang crazyhottommy

# somewhat hackish solution to:
# https://twitter.com/EamonCaddigan/status/646759751242620928
# based mostly on copy/pasting from ggplot2 geom_violin source:

load libraries

#install.packages("tidyverse")
#install.packages("Seurat")
# here package is to get rid of absolute path
# see details https://github.com/jennybc/here_here
#install.packages("here")
# interact with file systems https://github.com/r-lib/fs

# stop using biocLite https://twitter.com/strnr/status/1022451016736927745?lang=en
# more details https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html
# require R>=3.5.0, if you have a lower version of R. you may still use biocLite to install 
# bioconductor packages

install.packages("BiocManager")
BiocManager::install("TxDb.Mmusculus.UCSC.mm9.knownGene")

R solution

library(purrr)
library(tidyverse)
f_files<- list.files("results/superEnhancer/rna_expression/MSTC", pattern = "featureCount.txt", full.names = T)

read_in_feature_counts<- function(file){
        cnt<- read_tsv(file, col_names =T, comment = "#")
        cnt<- cnt %>% dplyr::select(-Chr, -Start, -End, -Strand, -Length)

get the peaks that shared in multiple samples

There was a lot of discussion on this topic in biostars:

https://www.biostars.org/p/172566/
https://www.biostars.org/p/13516/

Two tools can be used: the glorious bedtools intersect and the under-appreciated bedmap.

I found bedmap is more flexiable for my need. I have 48 peak super-enhancer bed files generated by ROSE and want to find peaks that are present in a proportion of the samples.

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises.

1. The devil 1 and 0 coordinate system

read detail here https://github.com/crazyhottommy/DNA-seq-analysis#tips-and-lessons-learned-during-my-dna-seq-data-analysis-journey

some files such as bed file is 0 based. Two genomic regions:

When call chromHMM with a bin size of say 1000 bp, if the consecutive bins are of the same state, it will be merged.

I want to tile the segment file before merging. https://support.bioconductor.org/p/102775/#102777

library(GenomicRanges)
library(rtracklayer)
chromHMM_seg&lt;- import("data/chromHMM/SKCM-M852-P008_18_segments.bed", format = "BED")

	# somewhat hackish solution to:
	# https://twitter.com/EamonCaddigan/status/646759751242620928
	# based mostly on copy/pasting from ggplot2 geom_violin source:
	# https://github.com/hadley/ggplot2/blob/master/R/geom-violin.r

	library(ggplot2)
	library(dplyr)


	"%\|\|%" <- function(a, b) {

	# somewhat hackish solution to:
	# https://twitter.com/EamonCaddigan/status/646759751242620928
	# based mostly on copy/pasting from ggplot2 geom_violin source:
	# https://github.com/hadley/ggplot2/blob/master/R/geom-violin.r

	library(ggplot2)
	library(dplyr)


	"%\|\|%" <- function(a, b) {

	#API created by @apfejes (Anthony Fejes) on top of my half-cooked script
	#python ebi_url_from_srr.py --file srr_list.txt \| xargs -I {} wget {}

	import argparse
	def prepareURL(srr_name, prefix="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/"):
	dir_1=srr_name[:6]
	dir_2=""
	url=""
	num_digits=sum(s.isdigit() for s in srr_name)
	if(num_digits == 6):