Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save fabsta/94f63d255f2e0c105bc3d30e42c9fc90 to your computer and use it in GitHub Desktop.
Save fabsta/94f63d255f2e0c105bc3d30e42c9fc90 to your computer and use it in GitHub Desktop.

[TOC]

duplicates

table(complete.cases(df)) # returns logical vector 

Imputing

impute missing values with linear regresion

imput_age <- lm(age~., data = full_imp)
summary(imput_age)
imp_age <- predict(imput_age, full_imp[which(is.na(age), arr.ind = T), ])
select all where age is NA
full_imp[is.na(full_imp[, age]), age := .(imp_age)]

Missing data

overview

sapply(dataframe, function(x) sum(is.na(x)))

all missing

mvc <- sapply(ds[vars], function(x) sum(is.na(x)))
mvn <- names(which(mvc == nrow(ds)))
ignore <- union(ignore, mvn)

70% missing

mvc <- sapply(ds[vars], function(x) sum(is.na(x)))
mvn <- names(which(mvc >= 0.7*nrow(ds)))
ignore <- union(ignore, mvn)

Normalize

factors <- which(sapply(ds[vars], is.factor))
for (f in factors) levels(ds[[f]]) <- normVarNames(levels(ds[[f]]))

remove rows

(NA values)

na.omit(merge) # full range
merge[complete.cases(merge[,2:3]),] # keep columns with values complete in column 2+3

remove rows with missing target

ds <- ds[!is.na(ds[target]),]
sum(is.na(ds[target]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment