cgpu · September 9, 2018 08:08
diff --git a/imputeLCMD_vignette.Rmd b/imputeLCMD_vignette.Rmd
 ---
 title: 'Imputation of missing data:`impute.MinDet` function from the R package `imputeLCMD`, `missForest` for Random Forest imputation, `impute.knn` function for knn imputation'
 author:
 - name: "cristina "
  affiliation: 
  email: ""
 date: ""
 toc: true
 output:
  rmarkdown::github_document:
    toc: true # table of content true
    toc_depth: 3  # upto three depths of headings (specified by #, ## and ###)
    toc_float: true
    number_sections: true  ## if you want number sections at each table header
    theme: flatly     # many options for theme: https://rmarkdown.rstudio.com/html_document_format.html#tabbed_sections
    highlight: zenburn #espresso  # specifies the syntax highlighting style
    #css: my.css       # you can add your custom css, should be in same folder
    df_print: paged
 ---

 ```{r}
 library(imputeLCMD)
 ```


 We will use the `imputeLCMD` R package and slightly adjust it's utility to imputate missing values of metabolomics data with a deterministic minimum value observed in the feature of interest. The following code to generate the simulated dataset with the missing values can be found in the documentation of the `imputeLCMD` R package in the following link: https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet 

 # Testing how imputeLCMD package works


 ## Generating the simulatedd expression data matrix

 ```{r}
 library(imputeLCMD)
 # Testing how imputeLCMD package works:
 #https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet

 # generate expression data matrix
 # Use fewer features and samples for preview convenience
 exprsDataObj = generate.ExpressionData(nSamples1 = 4, 
                                       nSamples2 = 2,
                                       meanSamples = 0,
                                       sdSamples = 0.2,
                                       nFeatures = 6, 
                                       nFeaturesUp = 2, 
                                       nFeaturesDown = 4,
                                       meanDynRange = 20, 
                                       sdDynRange = 1,
                                       meanDiffAbund = 1, 
                                       sdDiffAbund = 0.2)

 # Acces the simulated dataset:
 exprsData = exprsDataObj[[1]]
 dim(exprsData) # 8 features x 6 samples,

 colnames(exprsData) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1')
 rownames(exprsData) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6')


 # Preview dataset,  features as rows
 exprsData
 ```

 ## Introduce % MNAR missing values in the dataset 

 ```{r}
 set.seed(12345)

 # Insert 25% missing data with 100% missing not at random
 m.THR = quantile(exprsData, 
                 probs = 0.40)
 sd.THR = 0.1

 # 100% missing not at random:
 MNAR.rate = 55
 exprsData.MD.obj = insertMVs(exprsData,m.THR,sd.THR,MNAR.rate)
 exprsData.MD = exprsData.MD.obj[[2]]


 colnames(exprsData.MD) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1')
 rownames(exprsData.MD) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6')

 exprsData.MD
 ```

 # Use `colMeans` on exprsData.MD to check percent of missingness per feature:

 ```{r}
 missingness_percent_per_feat      <- colMeans(is.na(t(exprsData.MD)))
 (missingness_percent_per_feat)
 missness_df <- data.frame('over50' = which(missingness_percent_per_feat>=0.5))
 list_of_over50_missingness <- rownames(missness_df)
 ```



 ```{r}
 data_file = "C:/Users/bruno/Documents/test_data_MetImp.csv"
 uncorrected_df <- read.csv( data_file, 
                         sep =',', 
                         check.names = FALSE)

 sum(is.na(uncorrected_df))
 ```


 ```{r}
 slice_uncorrected_df <- uncorrected_df[1:8,1:8]
 slice_uncorrected_df
 uncorrected_df$group
 ```


 # Use good ol' R base: 
 ```{r}
 library(dplyr)

 base_imput <- exprsData.MD
 base_imput[is.na(base_imput)] <- 0
 base_imput
 ```
 # Count all zeros in dataframe

 ```{r}
 sum(base_imput == 0) 
 ```

 Find zeros in rows, count incidents of negative values:
 ```{r}
 datamatrix <- base_imput
 dataset_ID <- substr(colnames(XCMS_data[1]), start = 1, stop = 9)

 has.zero <- apply(datamatrix, 1, function(row) any(row == 0))
 nZeros <- length(which(has.zero))
 nZeros
 has.zero

 paste0(nZeros/(dim(datamatrix)[1] * dim(datamatrix)[2]), "% zero values in the retention time filtered", dataset_ID, " XCMS peaklist.")
 ```


 # Try with Random Forests

 ```{r}
 # https://cran.r-project.org/web/packages/missForest/missForest.pdf
 # https://github.com/stekhoven/missForest/blob/master/vignettes/missForest_1.4.pdf
 library('missForest')

 # Have it as a dataframe not matrix! 

 RF<- missForest(data.frame(t(as.matrix(exprsData.MD)))) #a data matrix with NA values. columns ==  variables
 cat("\n\n")
 t(as.matrix(exprsData.MD))
 cat("\n\n")
 as.matrix(RF$ximp)
 ```


 # Try with Tibshiranis' kNN implemetation:
 R package: "impute - Imputation for microarray data"

 data: An expression matrix with
       genes as rows, samples as the columns
 AKA:  features X observations --->  JAD_data.T

 ```{r}
 k_neighbours = 3
 library("impute") # Tibshirani 
 imputed_Tibshi_knn <- impute.knn( data      = as.matrix(exprsData.MD) , #genes as rows, samples in the columns 
                                   k        = k_neighbours,
                                   rowmax   = 0.5, 
                                   colmax   = 0.8, 
                                   maxp     = 1500, 
                                   rng.seed = 12345)
 imputed_Tibshi_knn$data
 ```

 # Perform missing data imputation on the feats x samples data matrix

 ```{r}
 # subjects as rows so as the min value is by feature:
 (exprsData.imputed = impute.MinDet(exprsData.MD))
 ```

 Note that the imputation is performed column wise, which means a missing value for a feature X1 of sample_01 for example is replaced by the lowest value found in the other features for which measurements for sample_01 have been taken.

 This is not what we desire to achieve, especially in non scaled data, where the range of values between features might differ drastically. 

 In order to perform the imputation by feature, which means that a missing value of any sample, will be replaced by the minimum value measured for that feature amongst all subjects.


 Thus, the expression matrix given as an argument to the `imputeLCMD::impute.MinDet` function should be a data matrix with subjects as rows, and features as columns.

 Let's check again, with the transposed data matrix. Remember:<br>
 - n of samples = 4<br>
 - n of features = 6

 # Impuatation with a transposed data data matrix (samples x features) 
 ```{r}

 # perform missing data imputation
 (exprsData.MD.T <- t(exprsData.MD))
 # subjects as rows so as the min value is by feature:

 ```

 # Feature - wise imputation :)

 samples x features

 ```{r}
 (exprsData.imputed = impute.MinDet(exprsData.MD.T))
 ```

 As we observe above, by having the expression matrix as __samples x Features__ the desired imputation is achieved.

 # Session info {.unnumbered}

 ```{r}
 sessioninfo::session_info()
 ```
	---
	title: 'Imputation of missing data:`impute.MinDet` function from the R package `imputeLCMD`, `missForest` for Random Forest imputation, `impute.knn` function for knn imputation'
	author:
	- name: "cristina "
	affiliation:
	email: ""
	date: ""
	toc: true
	output:
	rmarkdown::github_document:
	toc: true # table of content true
	toc_depth: 3 # upto three depths of headings (specified by #, ## and ###)
	toc_float: true
	number_sections: true ## if you want number sections at each table header
	theme: flatly # many options for theme: https://rmarkdown.rstudio.com/html_document_format.html#tabbed_sections
	highlight: zenburn #espresso # specifies the syntax highlighting style
	#css: my.css # you can add your custom css, should be in same folder
	df_print: paged
	---

	```{r}
	library(imputeLCMD)
	```


	We will use the `imputeLCMD` R package and slightly adjust it's utility to imputate missing values of metabolomics data with a deterministic minimum value observed in the feature of interest. The following code to generate the simulated dataset with the missing values can be found in the documentation of the `imputeLCMD` R package in the following link: https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet

	# Testing how imputeLCMD package works


	## Generating the simulatedd expression data matrix

	```{r}
	library(imputeLCMD)
	# Testing how imputeLCMD package works:
	#https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet

	# generate expression data matrix
	# Use fewer features and samples for preview convenience
	exprsDataObj = generate.ExpressionData(nSamples1 = 4,
	nSamples2 = 2,
	meanSamples = 0,
	sdSamples = 0.2,
	nFeatures = 6,
	nFeaturesUp = 2,
	nFeaturesDown = 4,
	meanDynRange = 20,
	sdDynRange = 1,
	meanDiffAbund = 1,
	sdDiffAbund = 0.2)

	# Acces the simulated dataset:
	exprsData = exprsDataObj[[1]]
	dim(exprsData) # 8 features x 6 samples,

	colnames(exprsData) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1')
	rownames(exprsData) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6')


	# Preview dataset, features as rows
	exprsData
	```

	## Introduce % MNAR missing values in the dataset

	```{r}
	set.seed(12345)

	# Insert 25% missing data with 100% missing not at random
	m.THR = quantile(exprsData,
	probs = 0.40)
	sd.THR = 0.1

	# 100% missing not at random:
	MNAR.rate = 55
	exprsData.MD.obj = insertMVs(exprsData,m.THR,sd.THR,MNAR.rate)
	exprsData.MD = exprsData.MD.obj[[2]]


	colnames(exprsData.MD) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1')
	rownames(exprsData.MD) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6')

	exprsData.MD
	```

	# Use `colMeans` on exprsData.MD to check percent of missingness per feature:

	```{r}
	missingness_percent_per_feat <- colMeans(is.na(t(exprsData.MD)))
	(missingness_percent_per_feat)
	missness_df <- data.frame('over50' = which(missingness_percent_per_feat>=0.5))
	list_of_over50_missingness <- rownames(missness_df)
	```



	```{r}
	data_file = "C:/Users/bruno/Documents/test_data_MetImp.csv"
	uncorrected_df <- read.csv( data_file,
	sep =',',
	check.names = FALSE)

	sum(is.na(uncorrected_df))
	```


	```{r}
	slice_uncorrected_df <- uncorrected_df[1:8,1:8]
	slice_uncorrected_df
	uncorrected_df$group
	```


	# Use good ol' R base:
	```{r}
	library(dplyr)

	base_imput <- exprsData.MD
	base_imput[is.na(base_imput)] <- 0
	base_imput
	```
	# Count all zeros in dataframe

	```{r}
	sum(base_imput == 0)
	```

	Find zeros in rows, count incidents of negative values:
	```{r}
	datamatrix <- base_imput
	dataset_ID <- substr(colnames(XCMS_data[1]), start = 1, stop = 9)

	has.zero <- apply(datamatrix, 1, function(row) any(row == 0))
	nZeros <- length(which(has.zero))
	nZeros
	has.zero

	paste0(nZeros/(dim(datamatrix)[1] * dim(datamatrix)[2]), "% zero values in the retention time filtered", dataset_ID, " XCMS peaklist.")
	```


	# Try with Random Forests

	```{r}
	# https://cran.r-project.org/web/packages/missForest/missForest.pdf
	# https://github.com/stekhoven/missForest/blob/master/vignettes/missForest_1.4.pdf
	library('missForest')

	# Have it as a dataframe not matrix!

	RF<- missForest(data.frame(t(as.matrix(exprsData.MD)))) #a data matrix with NA values. columns == variables
	cat("\n\n")
	t(as.matrix(exprsData.MD))
	cat("\n\n")
	as.matrix(RF$ximp)
	```


	# Try with Tibshiranis' kNN implemetation:
	R package: "impute - Imputation for microarray data"

	data: An expression matrix with
	genes as rows, samples as the columns
	AKA: features X observations ---> JAD_data.T

	```{r}
	k_neighbours = 3
	library("impute") # Tibshirani
	imputed_Tibshi_knn <- impute.knn( data = as.matrix(exprsData.MD) , #genes as rows, samples in the columns
	k = k_neighbours,
	rowmax = 0.5,
	colmax = 0.8,
	maxp = 1500,
	rng.seed = 12345)
	imputed_Tibshi_knn$data
	```

	# Perform missing data imputation on the feats x samples data matrix

	```{r}
	# subjects as rows so as the min value is by feature:
	(exprsData.imputed = impute.MinDet(exprsData.MD))
	```

	Note that the imputation is performed column wise, which means a missing value for a feature X1 of sample_01 for example is replaced by the lowest value found in the other features for which measurements for sample_01 have been taken.

	This is not what we desire to achieve, especially in non scaled data, where the range of values between features might differ drastically.

	In order to perform the imputation by feature, which means that a missing value of any sample, will be replaced by the minimum value measured for that feature amongst all subjects.


	Thus, the expression matrix given as an argument to the `imputeLCMD::impute.MinDet` function should be a data matrix with subjects as rows, and features as columns.

	Let's check again, with the transposed data matrix. Remember:<br>
	- n of samples = 4<br>
	- n of features = 6

	# Impuatation with a transposed data data matrix (samples x features)
	```{r}

	# perform missing data imputation
	(exprsData.MD.T <- t(exprsData.MD))
	# subjects as rows so as the min value is by feature:

	```

	# Feature - wise imputation :)

	samples x features

	```{r}
	(exprsData.imputed = impute.MinDet(exprsData.MD.T))
	```

	As we observe above, by having the expression matrix as __samples x Features__ the desired imputation is achieved.

	# Session info {.unnumbered}

	```{r}
	sessioninfo::session_info()
	```