-
-
Save kylebgorman/1074739 to your computer and use it in GitHub Desktop.
The Z_r averaging transform in R; very useful for studying the statistical properties of sparse data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Z_r (or "averaging") transform functions, based on: | |
# | |
# Kenneth W. Church and William A. Gale. 1991. A comparison of the enhanced | |
# Good-Turing and deleted estimation methods for estimating probabilities of | |
# English bigrams. Computer Speech and Language 5(1):19--54 | |
# | |
# Kyle Gorman <[email protected]> | |
# | |
# Church and Gale do not say what is to be done about points at the edges. I | |
# have chosen to average them with respect to only the inward facing frequency, | |
# which seems consistent to me with what Church and Gale had in mind. Comments | |
# are welcome about this choice, of course. | |
# | |
# I am making this code available because several people have told me that it's | |
# not obvious. | |
# | |
# There are versions for r/n_r vectors, and one for a single integer frequency | |
# vector f (if it's already probabilties, multiply it out) | |
Zr.nr <- function(r, nr) { | |
# compute a smoothed freq distribution using Z_r statistic | |
zro <- nr | |
zro[1] <- zro[1] / (r[2] - r[1]) | |
L <- length(nr) | |
i <- 2 | |
while (i < L) { | |
zro[i] <- 2 * zro[i] / (r[i + 1] - r[i - 1]) | |
i <- i + 1 | |
} | |
zro[L] <- zro[L] / (r[L] - r[L - 1]) | |
return(zro) | |
} | |
Zr.f <- function(f) { | |
# f is a vector of integer frequencies. returns a data frame for plotting | |
q <- rle(sort(f)) | |
r <- q$values | |
nr <- q$lengths | |
Zr <- Zr.nr(r, nr) | |
return(data.frame(r, nr, Zr)) | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment