Last active
September 9, 2018 08:08
-
-
Save cgpu/a5c7e888cf4bbbd21225381b6bdefab5 to your computer and use it in GitHub Desktop.
Imputation with `imputeLCMD::impute.MinDet`. Used with a transposed dataframe so that the q-th quartile minimal value is drawn from feature's distribution of observed values and not from instance's (aka patient, subject, animal, individual etc))
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: 'Imputation of missing data:`impute.MinDet` function from the R package `imputeLCMD`, `missForest` for Random Forest imputation, `impute.knn` function for knn imputation' | |
author: | |
- name: "cristina " | |
affiliation: | |
email: "" | |
date: "" | |
toc: true | |
output: | |
rmarkdown::github_document: | |
toc: true # table of content true | |
toc_depth: 3 # upto three depths of headings (specified by #, ## and ###) | |
toc_float: true | |
number_sections: true ## if you want number sections at each table header | |
theme: flatly # many options for theme: https://rmarkdown.rstudio.com/html_document_format.html#tabbed_sections | |
highlight: zenburn #espresso # specifies the syntax highlighting style | |
#css: my.css # you can add your custom css, should be in same folder | |
df_print: paged | |
--- | |
```{r} | |
library(imputeLCMD) | |
``` | |
We will use the `imputeLCMD` R package and slightly adjust it's utility to imputate missing values of metabolomics data with a deterministic minimum value observed in the feature of interest. The following code to generate the simulated dataset with the missing values can be found in the documentation of the `imputeLCMD` R package in the following link: https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet | |
# Testing how imputeLCMD package works | |
## Generating the simulatedd expression data matrix | |
```{r} | |
library(imputeLCMD) | |
# Testing how imputeLCMD package works: | |
#https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet | |
# generate expression data matrix | |
# Use fewer features and samples for preview convenience | |
exprsDataObj = generate.ExpressionData(nSamples1 = 4, | |
nSamples2 = 2, | |
meanSamples = 0, | |
sdSamples = 0.2, | |
nFeatures = 6, | |
nFeaturesUp = 2, | |
nFeaturesDown = 4, | |
meanDynRange = 20, | |
sdDynRange = 1, | |
meanDiffAbund = 1, | |
sdDiffAbund = 0.2) | |
# Acces the simulated dataset: | |
exprsData = exprsDataObj[[1]] | |
dim(exprsData) # 8 features x 6 samples, | |
colnames(exprsData) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1') | |
rownames(exprsData) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6') | |
# Preview dataset, features as rows | |
exprsData | |
``` | |
## Introduce % MNAR missing values in the dataset | |
```{r} | |
set.seed(12345) | |
# Insert 25% missing data with 100% missing not at random | |
m.THR = quantile(exprsData, | |
probs = 0.40) | |
sd.THR = 0.1 | |
# 100% missing not at random: | |
MNAR.rate = 55 | |
exprsData.MD.obj = insertMVs(exprsData,m.THR,sd.THR,MNAR.rate) | |
exprsData.MD = exprsData.MD.obj[[2]] | |
colnames(exprsData.MD) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1') | |
rownames(exprsData.MD) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6') | |
exprsData.MD | |
``` | |
# Use `colMeans` on exprsData.MD to check percent of missingness per feature: | |
```{r} | |
missingness_percent_per_feat <- colMeans(is.na(t(exprsData.MD))) | |
(missingness_percent_per_feat) | |
missness_df <- data.frame('over50' = which(missingness_percent_per_feat>=0.5)) | |
list_of_over50_missingness <- rownames(missness_df) | |
``` | |
```{r} | |
data_file = "C:/Users/bruno/Documents/test_data_MetImp.csv" | |
uncorrected_df <- read.csv( data_file, | |
sep =',', | |
check.names = FALSE) | |
sum(is.na(uncorrected_df)) | |
``` | |
```{r} | |
slice_uncorrected_df <- uncorrected_df[1:8,1:8] | |
slice_uncorrected_df | |
uncorrected_df$group | |
``` | |
# Use good ol' R base: | |
```{r} | |
library(dplyr) | |
base_imput <- exprsData.MD | |
base_imput[is.na(base_imput)] <- 0 | |
base_imput | |
``` | |
# Count all zeros in dataframe | |
```{r} | |
sum(base_imput == 0) | |
``` | |
Find zeros in rows, count incidents of negative values: | |
```{r} | |
datamatrix <- base_imput | |
dataset_ID <- substr(colnames(XCMS_data[1]), start = 1, stop = 9) | |
has.zero <- apply(datamatrix, 1, function(row) any(row == 0)) | |
nZeros <- length(which(has.zero)) | |
nZeros | |
has.zero | |
paste0(nZeros/(dim(datamatrix)[1] * dim(datamatrix)[2]), "% zero values in the retention time filtered", dataset_ID, " XCMS peaklist.") | |
``` | |
# Try with Random Forests | |
```{r} | |
# https://cran.r-project.org/web/packages/missForest/missForest.pdf | |
# https://github.com/stekhoven/missForest/blob/master/vignettes/missForest_1.4.pdf | |
library('missForest') | |
# Have it as a dataframe not matrix! | |
RF<- missForest(data.frame(t(as.matrix(exprsData.MD)))) #a data matrix with NA values. columns == variables | |
cat("\n\n") | |
t(as.matrix(exprsData.MD)) | |
cat("\n\n") | |
as.matrix(RF$ximp) | |
``` | |
# Try with Tibshiranis' kNN implemetation: | |
R package: "impute - Imputation for microarray data" | |
data: An expression matrix with | |
genes as rows, samples as the columns | |
AKA: features X observations ---> JAD_data.T | |
```{r} | |
k_neighbours = 3 | |
library("impute") # Tibshirani | |
imputed_Tibshi_knn <- impute.knn( data = as.matrix(exprsData.MD) , #genes as rows, samples in the columns | |
k = k_neighbours, | |
rowmax = 0.5, | |
colmax = 0.8, | |
maxp = 1500, | |
rng.seed = 12345) | |
imputed_Tibshi_knn$data | |
``` | |
# Perform missing data imputation on the feats x samples data matrix | |
```{r} | |
# subjects as rows so as the min value is by feature: | |
(exprsData.imputed = impute.MinDet(exprsData.MD)) | |
``` | |
Note that the imputation is performed column wise, which means a missing value for a feature X1 of sample_01 for example is replaced by the lowest value found in the other features for which measurements for sample_01 have been taken. | |
This is not what we desire to achieve, especially in non scaled data, where the range of values between features might differ drastically. | |
In order to perform the imputation by feature, which means that a missing value of any sample, will be replaced by the minimum value measured for that feature amongst all subjects. | |
Thus, the expression matrix given as an argument to the `imputeLCMD::impute.MinDet` function should be a data matrix with subjects as rows, and features as columns. | |
Let's check again, with the transposed data matrix. Remember:<br> | |
- n of samples = 4<br> | |
- n of features = 6 | |
# Impuatation with a transposed data data matrix (samples x features) | |
```{r} | |
# perform missing data imputation | |
(exprsData.MD.T <- t(exprsData.MD)) | |
# subjects as rows so as the min value is by feature: | |
``` | |
# Feature - wise imputation :) | |
samples x features | |
```{r} | |
(exprsData.imputed = impute.MinDet(exprsData.MD.T)) | |
``` | |
As we observe above, by having the expression matrix as __samples x Features__ the desired imputation is achieved. | |
# Session info {.unnumbered} | |
```{r} | |
sessioninfo::session_info() | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment