Skip to content

Instantly share code, notes, and snippets.

@melissamlwong
Forked from bobthecat/bigcorPar.r
Last active May 13, 2020 00:53
Show Gist options
  • Save melissamlwong/b489e7765a620c5fb1f3af0ece3024b1 to your computer and use it in GitHub Desktop.
Save melissamlwong/b489e7765a620c5fb1f3af0ece3024b1 to your computer and use it in GitHub Desktop.
#!/usr/bin/env Rscript
#Correlation calculation for large dataset (tested on ~120k columns)
#Modified by Melissa M.L. Wong on 19 July 2018
#Modification 1: Remove ff matrix due to size limitation of ~45k. Converting ff matrix to ffdf and writing to file takes forever.
#Modification 2: Print pearson correlation to console and redirect output to a file using bash
#Modification 3: User can select columns from x to y to be used for the comparisons with other columns
#Modification 4: No data is stored in memory. Memory usage is about 4 Gb.
#Comment: This is faster than all vs all comparison. The task can be split into multple chunks and saved in multiple files
#Usage: Rscript -e 't<-read.table("matrix.dat",sep=" ",header=T, stringsAsFactors=F);a<-as.matrix(sapply(t, as.numeric));source("bigcorPar.r");bigcorPar(a, ncore=64,x=1,y=1000)' >> matrix_cor.txt
bigcorPar <- function(a, ncore=64, x=x,y=y){
require(doMC)
registerDoMC(cores = ncore)
column<-colnames(a)
chr<-x:y
oth<-1:ncol(a)
oth<-oth[! oth %in% chr]
output <- foreach(i=x:y) %dopar% {
for (j in oth) {
COR <- cor(a[, i], a[, j])
B1 <- column[i]
B2 <- column[j]
cat(sprintf("%s\t%s\t%.7f\n",B1,B2,COR))
flush.console()
COR <- NULL
B1 <- NULL
B2 <- NULL
}
}
gc(verbose = FALSE)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment