Daniel J. Hocking djhocking

Predict vs simulate in lme4

For this investigation we are going to use the sleepdata data set from the lme4 package. Here is the head of the data frame:

	# make fake dataset
	df <- data.frame(x = runif(100, 0, 1), y = rnorm(100, 10, 3), z = rpois(100, 10))

	# subset dataframe
	df_sub <- df[which(df$x >= 0.75), ]

	# subset using dplyr
	library(dplyr)

	df_sub2 <- df %>%

	# ggplot boxplot groups of continuous x

	### Daniel J. Hocking

	I am trying to make a boxplot with ggplot2 in R where the x-axis in continous but their are paired boxplots for each value of on the x-axis based on another factor with two possible values. I want to make a plot where boxplots are arranged by number of survey years on the x-axis but paired by spatialTF (2 boxplots for every value of n_years) but n_years are not evenly spaced.

	This plot gets the paired boxplots correct by year but the years on the x-axis are evenly spaced and don't reflect the actual (continuous) time between years.

	```
	ggplot(df_converged, aes(factor(n_years), mean_N_est)) +


	# table references
	tbl_locations <- tbl(db, 'locations') %>%
	rename(location_id=id, location_name=name, location_description=description) %>%
	select(-created_at, -updated_at)
	tbl_series <- tbl(db, 'series') %>%
	rename(series_id=id) %>%
	select(-created_at, -updated_at)
	tbl_variables <- tbl(db, 'variables') %>%
	rename(variable_id=id, variable_name=name, variable_description=description) %>%

	# Get list of unique catchments with daymet data in our database
	drv <- dbDriver("PostgreSQL")
	con <- dbConnect(drv, dbname=...)
	# con <- dbConnect(drv, dbname=...)
	qry <- "SELECT DISTINCT featureid FROM daymet;"
	result <- dbSendQuery(con, qry)
	catchments <- fetch(result, n=-1)
	catchments <- as.character(catchments$featureid)

	# get daymet data for a subset of catchments

	# fetch temperature data
	tbl_values <- left_join(tbl_series,
	select(tbl_variables, variable_id, variable_name),
	by=c('variable_id'='variable_id')) %>%
	select(-file_id) %>%
	filter(location_id %in% df_locations$location_id,
	variable_name=="TEMP") %>%
	left_join(tbl_values,
	by=c('series_id'='series_id')) %>%
	left_join(select(tbl_locations, location_id, location_name, latitude, longitude, featureid=catchment_id),

	We assumed stream temperature measurements were normally distributed following,

	\\[ t_{s,h,d,y} \sim \mathcal{N}(\mu_{s,h,d,y}, \sigma) \\]

	where $t_{s,h,d,y}$ is the observed stream water temperature at the site ($s$) within the sub-basin identified by the 8-digit Hydrologic Unit Code (HUC8; $h$) for each day ($d$) in each year ($y$). We describe the normal distribution with the standard deviation ($\sigma$). The expected temperature follows a linear trend

	\\[ \omega_{s,h,d,y} = X^0 B^0 + X_{h}^{huc} B_{h}^{huc} + X_{s,h}^{site} B_{s,h}^{site} + X_{y}^{year} B_{y}^{year} \\]

	but the expected temperature ($\mu_{s,h,d,y}$) is adjusted based on the residual error from the previous day

	#purling is in the knitr package
	library(knitr)


	setwd("C:/ALR/Models/boo") #example using local windows directory, can easily switch

	#it can be really simple
	purl( "script1.Rmd", "script1.R" )
	#just specify the rmd filename, then r filename, with extensions

	one <- seq(1:10)
	two <- rnorm(10)
	three <- runif(10, 1, 2)
	four <- -10:-1

	df <- data.frame(one, two, three)
	df2 <- data.frame(one, two, three, four)

	str(df)

	# Function for getting bootstrapped glmer predictions in parallel
	glmmBoot <- function(dat, form, R, nc){
	# dat = data for glmer (lme4) logistic regression
	# form = formula of glmer equation for fitting
	# R = total number of bootstrap draws - should be multiple of nc b/c divided among cores evenly
	# nc = number of cores to use in parallel

	library(parallel)
	cl <- makeCluster(nc) # Request # cores
	clusterExport(cl, c("dat", "form", "nc", "R"), envir = environment()) # Make these available to each core