Skip to content

Instantly share code, notes, and snippets.

@cgpu
Last active September 9, 2018 08:08
Show Gist options
  • Save cgpu/a5c7e888cf4bbbd21225381b6bdefab5 to your computer and use it in GitHub Desktop.
Save cgpu/a5c7e888cf4bbbd21225381b6bdefab5 to your computer and use it in GitHub Desktop.
Imputation with `imputeLCMD::impute.MinDet`. Used with a transposed dataframe so that the q-th quartile minimal value is drawn from feature's distribution of observed values and not from instance's (aka patient, subject, animal, individual etc))
---
title: 'Imputation of missing data:`impute.MinDet` function from the R package `imputeLCMD`, `missForest` for Random Forest imputation, `impute.knn` function for knn imputation'
author:
- name: "cristina "
affiliation:
email: ""
date: ""
toc: true
output:
rmarkdown::github_document:
toc: true # table of content true
toc_depth: 3 # upto three depths of headings (specified by #, ## and ###)
toc_float: true
number_sections: true ## if you want number sections at each table header
theme: flatly # many options for theme: https://rmarkdown.rstudio.com/html_document_format.html#tabbed_sections
highlight: zenburn #espresso # specifies the syntax highlighting style
#css: my.css # you can add your custom css, should be in same folder
df_print: paged
---
```{r}
library(imputeLCMD)
```
We will use the `imputeLCMD` R package and slightly adjust it's utility to imputate missing values of metabolomics data with a deterministic minimum value observed in the feature of interest. The following code to generate the simulated dataset with the missing values can be found in the documentation of the `imputeLCMD` R package in the following link: https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet
# Testing how imputeLCMD package works
## Generating the simulatedd expression data matrix
```{r}
library(imputeLCMD)
# Testing how imputeLCMD package works:
#https://www.rdocumentation.org/packages/imputeLCMD/versions/2.0/topics/impute.MinDet
# generate expression data matrix
# Use fewer features and samples for preview convenience
exprsDataObj = generate.ExpressionData(nSamples1 = 4,
nSamples2 = 2,
meanSamples = 0,
sdSamples = 0.2,
nFeatures = 6,
nFeaturesUp = 2,
nFeaturesDown = 4,
meanDynRange = 20,
sdDynRange = 1,
meanDiffAbund = 1,
sdDiffAbund = 0.2)
# Acces the simulated dataset:
exprsData = exprsDataObj[[1]]
dim(exprsData) # 8 features x 6 samples,
colnames(exprsData) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1')
rownames(exprsData) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6')
# Preview dataset, features as rows
exprsData
```
## Introduce % MNAR missing values in the dataset
```{r}
set.seed(12345)
# Insert 25% missing data with 100% missing not at random
m.THR = quantile(exprsData,
probs = 0.40)
sd.THR = 0.1
# 100% missing not at random:
MNAR.rate = 55
exprsData.MD.obj = insertMVs(exprsData,m.THR,sd.THR,MNAR.rate)
exprsData.MD = exprsData.MD.obj[[2]]
colnames(exprsData.MD) <- c('AD1', 'AD2', 'ctrlAD1', 'SQ1', 'SQ2', 'ctrlSQ1')
rownames(exprsData.MD) <- c('feat_1' , 'feat_2', 'feat_3', 'feat_4' , 'feat_5', 'feat_6')
exprsData.MD
```
# Use `colMeans` on exprsData.MD to check percent of missingness per feature:
```{r}
missingness_percent_per_feat <- colMeans(is.na(t(exprsData.MD)))
(missingness_percent_per_feat)
missness_df <- data.frame('over50' = which(missingness_percent_per_feat>=0.5))
list_of_over50_missingness <- rownames(missness_df)
```
```{r}
data_file = "C:/Users/bruno/Documents/test_data_MetImp.csv"
uncorrected_df <- read.csv( data_file,
sep =',',
check.names = FALSE)
sum(is.na(uncorrected_df))
```
```{r}
slice_uncorrected_df <- uncorrected_df[1:8,1:8]
slice_uncorrected_df
uncorrected_df$group
```
# Use good ol' R base:
```{r}
library(dplyr)
base_imput <- exprsData.MD
base_imput[is.na(base_imput)] <- 0
base_imput
```
# Count all zeros in dataframe
```{r}
sum(base_imput == 0)
```
Find zeros in rows, count incidents of negative values:
```{r}
datamatrix <- base_imput
dataset_ID <- substr(colnames(XCMS_data[1]), start = 1, stop = 9)
has.zero <- apply(datamatrix, 1, function(row) any(row == 0))
nZeros <- length(which(has.zero))
nZeros
has.zero
paste0(nZeros/(dim(datamatrix)[1] * dim(datamatrix)[2]), "% zero values in the retention time filtered", dataset_ID, " XCMS peaklist.")
```
# Try with Random Forests
```{r}
# https://cran.r-project.org/web/packages/missForest/missForest.pdf
# https://github.com/stekhoven/missForest/blob/master/vignettes/missForest_1.4.pdf
library('missForest')
# Have it as a dataframe not matrix!
RF<- missForest(data.frame(t(as.matrix(exprsData.MD)))) #a data matrix with NA values. columns == variables
cat("\n\n")
t(as.matrix(exprsData.MD))
cat("\n\n")
as.matrix(RF$ximp)
```
# Try with Tibshiranis' kNN implemetation:
R package: "impute - Imputation for microarray data"
data: An expression matrix with
genes as rows, samples as the columns
AKA: features X observations ---> JAD_data.T
```{r}
k_neighbours = 3
library("impute") # Tibshirani
imputed_Tibshi_knn <- impute.knn( data = as.matrix(exprsData.MD) , #genes as rows, samples in the columns
k = k_neighbours,
rowmax = 0.5,
colmax = 0.8,
maxp = 1500,
rng.seed = 12345)
imputed_Tibshi_knn$data
```
# Perform missing data imputation on the feats x samples data matrix
```{r}
# subjects as rows so as the min value is by feature:
(exprsData.imputed = impute.MinDet(exprsData.MD))
```
Note that the imputation is performed column wise, which means a missing value for a feature X1 of sample_01 for example is replaced by the lowest value found in the other features for which measurements for sample_01 have been taken.
This is not what we desire to achieve, especially in non scaled data, where the range of values between features might differ drastically.
In order to perform the imputation by feature, which means that a missing value of any sample, will be replaced by the minimum value measured for that feature amongst all subjects.
Thus, the expression matrix given as an argument to the `imputeLCMD::impute.MinDet` function should be a data matrix with subjects as rows, and features as columns.
Let's check again, with the transposed data matrix. Remember:<br>
- n of samples = 4<br>
- n of features = 6
# Impuatation with a transposed data data matrix (samples x features)
```{r}
# perform missing data imputation
(exprsData.MD.T <- t(exprsData.MD))
# subjects as rows so as the min value is by feature:
```
# Feature - wise imputation :)
samples x features
```{r}
(exprsData.imputed = impute.MinDet(exprsData.MD.T))
```
As we observe above, by having the expression matrix as __samples x Features__ the desired imputation is achieved.
# Session info {.unnumbered}
```{r}
sessioninfo::session_info()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment