Last active
January 13, 2016 17:10
-
-
Save garyfeng/27e7f8e406192a8cb33a to your computer and use it in GitHub Desktop.
A function to back fill NAs in a vector with the last non-NA value
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # A function to back fill NAs in a vector with the last non-NA value | |
| # https://gist.github.com/garyfeng/27e7f8e406192a8cb33a | |
| backFillNA<- function (x) { | |
| nas<- which(is.na(x)) | |
| # trick from http://stackoverflow.com/questions/24837401/find-consecutive-values-in-vector-in-r | |
| naList<-split(nas, cumsum(c(1, diff(nas) != 1))) | |
| # get the backfill values | |
| valueList<-lapply(naList, function(nalist) { | |
| prev<- nalist[1]-1 | |
| #print(nalist) | |
| if (prev<=0) prev <-1 | |
| x[nalist]<-x[prev] | |
| return(x[nalist]) | |
| }) | |
| # now back fill | |
| # can't use unlist(), as it converts everything to a common class. | |
| # see http://stackoverflow.com/questions/13859905/returning-a-vector-of-class-posixct-with-vapply | |
| #x[unlist(naList)] <- unlist(valueList) | |
| x[unlist(naList)] <- do.call(c, valueList) | |
| return (x) | |
| } | |
| x<-c("A","B",NA,NA,"C",NA,NA,NA,NA,"D",NA,NA); | |
| #> x | |
| # [1] "A" "B" NA NA "C" NA NA NA NA "D" NA NA | |
| #> backFillNA(x) | |
| # [1] "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D" "D" | |
| ######################## | |
| # The following recursive solution is taken from: | |
| # https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html | |
| # it uses Recall(), R's method for recursive calling, to do back filling | |
| # It works well for short vectors, but a test on a vector of 20k elements with many NAs | |
| # and the recursion failed after a long while. | |
| rna <- function(z) { | |
| y <- c(NA, head(z, -1)) | |
| z <- ifelse(is.na(z), y, z) | |
| if (any(is.na(z))) Recall(z) else z | |
| } | |
| #> x | |
| # [1] "A" "B" NA NA "C" NA NA NA NA "D" NA NA | |
| #> rna(x) | |
| # [1] "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D" "D" |
Author
Author
Compared the non-recursive version with a simple for loop based method:
loopFillNA <- function(x) {
lastV <- x[1]
for (i in 1:length(x)) {
if(is.na(x[i])) x[i] <- lastV else lastV <- x[i]
}
x
}
We have a large numeric vector t0 with mostly NAs.
Testing results:
> system.time(loopFillNA(t0[1:10000]))
user system elapsed
1.366 0.977 2.344
> system.time(rna(t0[1:10000]))
user system elapsed
0.961 0.238 1.198
> system.time(backFillNA(t0[1:10000]))
user system elapsed
0.014 0.001 0.015
Author
hmm... the performance of the non-recursive version may be a function of how many NAs there are -- more precisely, how many NA runs there are in the data. We need to test this.
Author
c.f. the Pandas implementation of forward and backward fillNA at http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing and http://pandas.pydata.org/pandas-docs/stable/missing_data.html#cleaning-filling-missing-data
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Tried this with a vector with 29K elements, and recursion was too deep.
Thinking a non-recursive approach: