Last active
November 17, 2017 08:45
-
-
Save OrenBochman/c14e01dd265d5cd7271508da3f7232c0 to your computer and use it in GitHub Desktop.
crash course in simulation in R
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| --- | |
| title: "R Notebook" | |
| output: | |
| html_notebook: | |
| fig_caption: yes | |
| toc: yes | |
| --- | |
| ```{r setup} | |
| rm(list=ls()) # Housekeeping | |
| setwd("~/workspace/R") | |
| #install.packages("randomNames") | |
| library(randomNames) | |
| #install.packages("kableExtra") | |
| library(kableExtra) | |
| #install.packages("knitr") | |
| library(knitr) | |
| #install.packages("flexsurv") | |
| #library(flexsurv) | |
| #install.packages("VGAM") | |
| library(VGAM) | |
| options(knitr.table.format = "html") | |
| numcases <- 500 | |
| ``` | |
| ```{r sim gender} | |
| gender_min <- 0 | |
| gender_max <- 1 | |
| gender <- as.integer(rbinom(numcases,size = 1,prob = .45)) # simulate random indepedent variables | |
| par(mfrow=c(2,1)) | |
| hist(gender,main=paste( numcases," draws of gender"), breaks=seq(gender_min-.5,gender_max+.5,1)) | |
| #hist(gender,main=paste( numcases," draws of gender"), breaks=cut(gender,2)) | |
| boxplot(gender, horizontal=TRUE,range=1) | |
| title("boxplot of a binomial random distribution") | |
| ``` | |
| age0 is uniformly distributd between age_min and age_max | |
| age1 is based on the gompertz distribution | |
| age2 is based on the makeham-gompertz distribution | |
| There are a couple of scenarios for simulating age: | |
| In insurence one might want to simulate life expectency (age at death) as well as cause of death etc. | |
| Life expectency should be bi model based on gender. | |
| In marketing one might be want to simulate the proportion of the ages in one's market. Age should be based on some type of contingency table | |
| ```{r sim life_expectency} | |
| age_min <- 18 | |
| age_max <- 80 | |
| age0 <- as.integer(runif(numcases,age_min,age_max)) # simulate random indepedent variables | |
| lifeExpectency1 <-as.integer(rgompertz(numcases, shape=.085, scale = .001)) | |
| lifeExpectency2 <-as.integer(rmakeham(numcases, shape=.085, scale=.001, epsilon = 0)) | |
| age1_min <- min(age1) | |
| age1_max <- max(age1) | |
| age2_min <- min(age2) | |
| age2_max <- max(age2) | |
| par(mfrow=c(2,2)) | |
| hist(age1,main=paste( numcases," draws of ages"),breaks=seq(age1_min-.5,age1_max+.5,1)) | |
| boxplot(age1, horizontal=TRUE,range=1) | |
| hist(age2,main=paste( numcases," draws of ages"),breaks=seq(age2_min-.5,age2_max+.5,1)) | |
| boxplot(age2, horizontal=TRUE,range=1) | |
| title("boxplot for ages") | |
| ``` | |
| this is a first attempt at bimodal age distibution, should be updated to use as merkam-gompez distibutions | |
| and then the data should be based on this factor | |
| 0-14 years: 18.73% (male 31,255,995/female 29,919,938) | |
| 15-24 years: 13.27% (male 22,213,952/female 21,137,826) | |
| 25-54 years: 39.45% (male 64,528,673/female 64,334,499) | |
| 55-64 years: 12.91% (male 20,357,880/female 21,821,976) | |
| 65 years and over: 15.63% (male 22,678,235/female 28,376,817) (2017 est.) | |
| ```{r sim age} | |
| mu1 <- 55 | |
| mu2 <- 75 | |
| mu3 <- (65) | |
| sig1 <- 3 | |
| sig2 <- 3 | |
| sig3 <- (15) | |
| cpct <- 0.6 | |
| bimodalDistFunc <- function (n,cpct, mu1, mu2,mu3, sig1, sig2,sig3) { | |
| y0 = rnorm(n,mean=mu1, sd = sig1) | |
| y1 = rnorm(n,mean=mu2, sd = sig2) | |
| y3= runif(n,min=0, max = 25) | |
| #flag <- rbinom(n,size=1,prob=cpct) | |
| flag = gender | |
| y = (y0*(1 - flag) + y1*flag )+(y3) | |
| } | |
| age <- (bimodalDistFunc(n=numcases,cpct,mu1,mu2,mu3, sig1,sig2,sig3))/2 | |
| par(mfrow=c(2,1)) | |
| hist(age,breaks=300) | |
| #hist(age,main=paste( numcases," draws of ages"),breaks=seq(age_min-.5,age_max+.5,1)) | |
| boxplot((age), horizontal=TRUE,range=1) | |
| title("boxplot for ages") | |
| ``` | |
| Ethnic groups are problematic currently the package makes use of the CIA worldbook's terminology. | |
| the simulation currently is based on 2010 us data. also the probabilty is unnormalised which does not break things. | |
| Simulation: | |
| 1. ethnic groups, religion and first language as factors | |
| 2. we sample independently from the groups's list using thier proportion in the population and replace. | |
| We use this primerily to make generation of names more realistic | |
| religious are more of the same | |
| some room for improvement: | |
| mixed ethnic groups should be allowed. | |
| ethnic group and their proportion data is also available for other countries. | |
| add languages based on: English 79%, Spanish 13%, other Indo-European 3.7%, Asian and Pacific island 3.4%, other 1% (2015 est.) | |
| language race and language should probably not be independent though the joint distributions may not be available. | |
| ```{r sim ethnicity} | |
| ethnicity_min <- 1 | |
| ethnicity_max <- 7 | |
| ethnic_groups <- c("Amerindian", "Asian", "Black", "Hispanic", "Caucasian","Arab") | |
| prob_ethnicity=c(0.9,4.8,12.5,16.3,72.4,1.1) | |
| ethnicity<-sample(x=ethnic_groups,size = numcases,replace = TRUE, prob= prob_ethnicity) | |
| #https://www.cia.gov/library/publications/the-world-factbook/fields/2075.html | |
| #white %, black 12.6%, Asian 4.8%, Amerindian and Alaska native 0.9%, native Hawaiian and other Pacific islander 0.2%, other 6.2%, two or more races 2.9% | |
| #ethnicity <- as.integer(runif(numcases,ethnicity_min,ethnicity_max)) # simulate random indepedent variables | |
| fethnicity <- factor(ethnicity, levels=ethnic_groups) | |
| #table(fethnicity) | |
| kable(t(as.matrix(table(fethnicity)))) %>% | |
| kable_styling(bootstrap_options = "striped", full_width = F, position = "left") | |
| religious_groups <- c('Protestant' , 'Catholic' , 'Jewish' , 'Mormon' , 'other Christian' , 'Muslim', 'Jehovah\'s Witness' , 'Buddhist' , 'Hindu', 'other', 'unaffiliated') | |
| propb_religion<-c(46.5,20.8,1.9,1.6,0.9,0.9,0.8,0.7,0.7,1.8,23.4) | |
| #again https://www.cia.gov/library/publications/the-world-factbook/geos/us.html | |
| religion<-sample(x=religious_groups,size = numcases,replace = TRUE, prob= propb_religion) | |
| freligion <- factor(religion, levels=religious_groups) | |
| kable(t(as.matrix(table(freligion)))) %>% | |
| kable_styling(bootstrap_options = "striped", full_width = F, position = "left") | |
| par(mfrow=c(2,1)) | |
| barplot(table(fethnicity), las=2 ,space=1) | |
| par(mar = c(7, 4, 2, 2) + 0.2) | |
| barplot(table(freligion), las=2 ,space=1) | |
| #barplot(table(fethnicity)) | |
| #boxplot(x=levels(fethnicity), horizontal=TRUE) | |
| #title("boxplot of a uniform random distribution") | |
| ``` | |
| ```{r sim education} | |
| education_min <- 5 | |
| education_max <- 25 | |
| education <- as.integer(rbinom(numcases,(education_max-education_min),.4) + education_min) # simulate random indepedent variables | |
| max(education) | |
| par(mfrow=c(2,1)) | |
| hist(education,main=paste(numcases," draws of education"), breaks=seq(education_min-.5,education_max+.5,1)) | |
| boxplot(education, horizontal=TRUE,range=1) | |
| title("boxplot of a binomeal random distribution") | |
| ``` | |
| ```{r sim wealth} | |
| wealth <- as.integer(abs(rcauchy(numcases,location = 10*age, scale = 10*education))) # simulate random indepedent variables | |
| wealth_min <- min(wealth) | |
| wealth_max <- max(wealth) | |
| par(mfrow=c(2,1)) | |
| hist(wealth,main=paste(numcases," draws of wealth"), breaks=seq(wealth_min-.5,wealth_max+.5,1)) | |
| boxplot(wealth, horizontal=TRUE,range=1) | |
| title("boxplot of a cachy random distribution") | |
| ``` | |
| ```{r sim names} | |
| names <- randomNames(numcases,gender,ethnicity,which.names="both", | |
| name.order="first.last",name.sep=" ",sample.with.replacement=TRUE,return.complete.data=FALSE) # simulate random indepedent variables | |
| ``` | |
| Height is a function of gender and changes with age so idealy height should have a model rather than a simple ditribution. | |
| Weight depends on height but varies by body type, via a BMI. | |
| ```{r height_bmi_weight} | |
| height <- as.integer(rnorm(n=numcases,mean = 170,sd=10)) | |
| height_max<-max(height) | |
| height_min<-min(height) | |
| bmi <- rnorm(n=numcases,mean = 26,sd=2.5) | |
| bmi_max<-max(bmi) | |
| bmi_min<-min(bmi) | |
| weight = bmi * height * height / 10000 | |
| weight_max<-max(weight) | |
| weight_min<-min(weight) | |
| par(mfrow=c(3,2)) | |
| hist(height,main=paste(numcases," draws of height"), breaks=seq(height_min-.5,height_max+.5,1)) | |
| boxplot(height, horizontal=TRUE,range=1) | |
| title("boxplot of a normaly disributed height") | |
| hist(bmi,main=paste(numcases," draws of bmi"), breaks=seq(bmi_min-1,bmi_max+1,1)) | |
| boxplot(bmi, horizontal=TRUE,range=1) | |
| title("boxplot of a normaly distributed bmi") | |
| hist(weight,main=paste(numcases," draws of weight"), breaks=seq(weight_min-1,weight_max+1,1)) | |
| boxplot(weight, horizontal=TRUE,range=1) | |
| title("boxplot of a weight") | |
| ``` | |
| see also: https://en.wikipedia.org/wiki/List_of_average_human_height_worldwide | |
| ```{r frame} | |
| demographic <- data.frame(name=names, gender=gender, age=age, | |
| education=education,ethnicity=ethnicity, religion=religion, | |
| wealth=wealth, weight=weight,height=height,bmi=bmi) | |
| summary(demographic) | |
| head(demographic) | |
| ``` | |
| ```{r} | |
| world = read.csv('http://www.stat.berkeley.edu/classes/s133/data/world.txt',header=TRUE,stringsAsFactors=FALSE) | |
| head(world) | |
| kable(world) %>% | |
| kable_styling(bootstrap_options = "striped", full_width = F, position = "left") | |
| ``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment