OrenBochman · November 17, 2017 08:45
diff --git a/quicksim.R b/quicksim.R
 ---
 title: "R Notebook"
 output: 
  html_notebook: 
    fig_caption: yes
    toc: yes
 ---


 ```{r setup}
 rm(list=ls())					 # Housekeeping
 setwd("~/workspace/R")
 #install.packages("randomNames")
 library(randomNames)
 #install.packages("kableExtra")
 library(kableExtra)
 #install.packages("knitr")
 library(knitr)
 #install.packages("flexsurv")
 #library(flexsurv)
 #install.packages("VGAM")
 library(VGAM)

 options(knitr.table.format = "html") 
 numcases <- 500
 ```


 ```{r sim gender}

 gender_min  <-  0
 gender_max  <-  1
 gender <- as.integer(rbinom(numcases,size = 1,prob = .45)) # simulate random indepedent variables

 par(mfrow=c(2,1))
 hist(gender,main=paste( numcases," draws of gender"), breaks=seq(gender_min-.5,gender_max+.5,1))
 #hist(gender,main=paste( numcases," draws of gender"), breaks=cut(gender,2))      
 boxplot(gender, horizontal=TRUE,range=1)           
 title("boxplot of a binomial random distribution")
 ```

 age0 is uniformly distributd between age_min and age_max
 age1 is based on the gompertz distribution
 age2 is based on the makeham-gompertz distribution

 There are a couple of scenarios for simulating age:
 In insurence one might want to simulate life expectency (age at death) as well as cause of death etc.
 Life expectency should be bi model based on gender.

 In marketing one might be want to simulate the proportion of the ages in one's market. Age should be based on some type of contingency table


 ```{r sim life_expectency}


 age_min <- 18
 age_max <- 80
 age0 <- as.integer(runif(numcases,age_min,age_max)) # simulate random indepedent variables
 lifeExpectency1 <-as.integer(rgompertz(numcases, shape=.085, scale = .001))
 lifeExpectency2 <-as.integer(rmakeham(numcases,  shape=.085, scale=.001, epsilon = 0))

 age1_min <- min(age1)
 age1_max <- max(age1)
 age2_min <- min(age2)
 age2_max <- max(age2)
 par(mfrow=c(2,2))
 hist(age1,main=paste( numcases," draws of ages"),breaks=seq(age1_min-.5,age1_max+.5,1))      
 boxplot(age1, horizontal=TRUE,range=1)           
 hist(age2,main=paste( numcases," draws of ages"),breaks=seq(age2_min-.5,age2_max+.5,1))      
 boxplot(age2, horizontal=TRUE,range=1)           

 title("boxplot for ages")
 ```
 this is a first attempt at bimodal age distibution, should be updated to use as merkam-gompez distibutions
 and then the data should be based on this factor

 0-14 years: 18.73% (male 31,255,995/female 29,919,938)
 15-24 years: 13.27% (male 22,213,952/female 21,137,826)
 25-54 years: 39.45% (male 64,528,673/female 64,334,499)
 55-64 years: 12.91% (male 20,357,880/female 21,821,976)
 65 years and over: 15.63% (male 22,678,235/female 28,376,817) (2017 est.)


 ```{r sim age}
 mu1 <- 55   
 mu2 <- 75
 mu3 <- (65)
 sig1 <- 3
 sig2 <- 3
 sig3 <- (15)
 cpct <- 0.6   

 bimodalDistFunc <- function (n,cpct, mu1, mu2,mu3, sig1, sig2,sig3) {
  y0 = rnorm(n,mean=mu1, sd = sig1)
  y1 = rnorm(n,mean=mu2, sd = sig2)
  y3= runif(n,min=0, max = 25)
  #flag <- rbinom(n,size=1,prob=cpct)
  flag = gender
  y = (y0*(1 - flag) + y1*flag )+(y3)
 }
 age <- (bimodalDistFunc(n=numcases,cpct,mu1,mu2,mu3, sig1,sig2,sig3))/2
 par(mfrow=c(2,1))
 hist(age,breaks=300)
 #hist(age,main=paste( numcases," draws of ages"),breaks=seq(age_min-.5,age_max+.5,1))      
 boxplot((age), horizontal=TRUE,range=1)           
 title("boxplot for ages")
 ```


 Ethnic groups are problematic currently the package makes use of the CIA worldbook's terminology.

 the simulation currently is based on 2010 us data. also the probabilty is unnormalised which does not break things.

 Simulation:

 1. ethnic groups, religion and first language as factors
 2. we sample independently from the groups's list using thier proportion in the population and replace. 

 We use this primerily to make generation of names more realistic
 religious are more of the same


 some room for improvement:
 mixed ethnic groups should be allowed.
 ethnic group and their proportion data is also available for other countries.
 add languages based on: English 79%, Spanish 13%, other Indo-European 3.7%, Asian and Pacific island 3.4%, other 1% (2015 est.)
 language race and language should probably not be independent though the joint distributions may not be available.

 ```{r sim ethnicity}
 ethnicity_min  <-  1
 ethnicity_max  <-  7
 ethnic_groups <- c("Amerindian", "Asian", "Black", "Hispanic", "Caucasian","Arab")
 prob_ethnicity=c(0.9,4.8,12.5,16.3,72.4,1.1)
 ethnicity<-sample(x=ethnic_groups,size = numcases,replace = TRUE, prob= prob_ethnicity)

 #https://www.cia.gov/library/publications/the-world-factbook/fields/2075.html
 #white %, black 12.6%, Asian 4.8%, Amerindian and Alaska native 0.9%, native Hawaiian and other Pacific islander 0.2%, other 6.2%, two or more races 2.9% 

 #ethnicity <- as.integer(runif(numcases,ethnicity_min,ethnicity_max)) # simulate random indepedent variables
 fethnicity  <- factor(ethnicity, levels=ethnic_groups)
 #table(fethnicity)
 kable(t(as.matrix(table(fethnicity)))) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

 religious_groups <- c('Protestant' , 'Catholic' , 'Jewish' , 'Mormon' , 'other Christian' , 'Muslim', 'Jehovah\'s Witness' , 'Buddhist' , 'Hindu', 'other', 'unaffiliated')
 propb_religion<-c(46.5,20.8,1.9,1.6,0.9,0.9,0.8,0.7,0.7,1.8,23.4)
 #again https://www.cia.gov/library/publications/the-world-factbook/geos/us.html
 religion<-sample(x=religious_groups,size = numcases,replace = TRUE, prob= propb_religion)

 freligion  <- factor(religion, levels=religious_groups)
 kable(t(as.matrix(table(freligion)))) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

 par(mfrow=c(2,1))
 barplot(table(fethnicity),  las=2 ,space=1)

 par(mar = c(7, 4, 2, 2) + 0.2) 
 barplot(table(freligion), las=2 ,space=1)

 #barplot(table(fethnicity))
 #boxplot(x=levels(fethnicity), horizontal=TRUE)           
 #title("boxplot of a uniform random distribution")
 ```


 ```{r sim education}

 education_min  <-  5
 education_max  <-  25
 education <- as.integer(rbinom(numcases,(education_max-education_min),.4) + education_min) # simulate random indepedent variables
 max(education)

 par(mfrow=c(2,1))

 hist(education,main=paste(numcases," draws of education"), breaks=seq(education_min-.5,education_max+.5,1))  

 boxplot(education, horizontal=TRUE,range=1)           

 title("boxplot of a binomeal random distribution")
 ```

 ```{r sim wealth}


 wealth <- as.integer(abs(rcauchy(numcases,location = 10*age, scale = 10*education))) # simulate random indepedent variables
 wealth_min  <-  min(wealth)
 wealth_max  <-  max(wealth)

 par(mfrow=c(2,1))
 hist(wealth,main=paste(numcases," draws of wealth"), breaks=seq(wealth_min-.5,wealth_max+.5,1))      
 boxplot(wealth, horizontal=TRUE,range=1)           
 title("boxplot of a cachy random distribution")
 ```


 ```{r sim names}


 names <- randomNames(numcases,gender,ethnicity,which.names="both",
 name.order="first.last",name.sep=" ",sample.with.replacement=TRUE,return.complete.data=FALSE)  # simulate random indepedent variables
 ```

 Height is a function of gender and changes with age so idealy height should have a model rather than a simple ditribution.

 Weight depends on height but varies by body type, via a BMI.

 ```{r height_bmi_weight}

 height <- as.integer(rnorm(n=numcases,mean = 170,sd=10))
 height_max<-max(height)
 height_min<-min(height)

 bmi <- rnorm(n=numcases,mean = 26,sd=2.5)
 bmi_max<-max(bmi)
 bmi_min<-min(bmi)

 weight = bmi * height * height / 10000
 weight_max<-max(weight)
 weight_min<-min(weight)

 par(mfrow=c(3,2))
 hist(height,main=paste(numcases," draws of height"), breaks=seq(height_min-.5,height_max+.5,1))      
 boxplot(height, horizontal=TRUE,range=1)           
 title("boxplot of a normaly disributed height")
 hist(bmi,main=paste(numcases," draws of bmi"), breaks=seq(bmi_min-1,bmi_max+1,1))      
 boxplot(bmi, horizontal=TRUE,range=1)           
 title("boxplot of a normaly distributed bmi")
 hist(weight,main=paste(numcases," draws of weight"), breaks=seq(weight_min-1,weight_max+1,1))      
 boxplot(weight, horizontal=TRUE,range=1)           
 title("boxplot of a weight")

 ```

 see also: https://en.wikipedia.org/wiki/List_of_average_human_height_worldwide

 ```{r frame}
 demographic <- data.frame(name=names, gender=gender, age=age, 
                          education=education,ethnicity=ethnicity, religion=religion,
                          wealth=wealth, weight=weight,height=height,bmi=bmi)

 summary(demographic)

 head(demographic)

 ```
 ```{r}
 world = read.csv('http://www.stat.berkeley.edu/classes/s133/data/world.txt',header=TRUE,stringsAsFactors=FALSE)
 head(world)

 kable(world) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

 ```
	---
	title: "R Notebook"
	output:
	html_notebook:
	fig_caption: yes
	toc: yes
	---


	```{r setup}
	rm(list=ls()) # Housekeeping
	setwd("~/workspace/R")
	#install.packages("randomNames")
	library(randomNames)
	#install.packages("kableExtra")
	library(kableExtra)
	#install.packages("knitr")
	library(knitr)
	#install.packages("flexsurv")
	#library(flexsurv)
	#install.packages("VGAM")
	library(VGAM)

	options(knitr.table.format = "html")
	numcases <- 500
	```


	```{r sim gender}

	gender_min <- 0
	gender_max <- 1
	gender <- as.integer(rbinom(numcases,size = 1,prob = .45)) # simulate random indepedent variables

	par(mfrow=c(2,1))
	hist(gender,main=paste( numcases," draws of gender"), breaks=seq(gender_min-.5,gender_max+.5,1))
	#hist(gender,main=paste( numcases," draws of gender"), breaks=cut(gender,2))
	boxplot(gender, horizontal=TRUE,range=1)
	title("boxplot of a binomial random distribution")
	```

	age0 is uniformly distributd between age_min and age_max
	age1 is based on the gompertz distribution
	age2 is based on the makeham-gompertz distribution

	There are a couple of scenarios for simulating age:
	In insurence one might want to simulate life expectency (age at death) as well as cause of death etc.
	Life expectency should be bi model based on gender.

	In marketing one might be want to simulate the proportion of the ages in one's market. Age should be based on some type of contingency table


	```{r sim life_expectency}


	age_min <- 18
	age_max <- 80
	age0 <- as.integer(runif(numcases,age_min,age_max)) # simulate random indepedent variables
	lifeExpectency1 <-as.integer(rgompertz(numcases, shape=.085, scale = .001))
	lifeExpectency2 <-as.integer(rmakeham(numcases, shape=.085, scale=.001, epsilon = 0))

	age1_min <- min(age1)
	age1_max <- max(age1)
	age2_min <- min(age2)
	age2_max <- max(age2)
	par(mfrow=c(2,2))
	hist(age1,main=paste( numcases," draws of ages"),breaks=seq(age1_min-.5,age1_max+.5,1))
	boxplot(age1, horizontal=TRUE,range=1)
	hist(age2,main=paste( numcases," draws of ages"),breaks=seq(age2_min-.5,age2_max+.5,1))
	boxplot(age2, horizontal=TRUE,range=1)

	title("boxplot for ages")
	```
	this is a first attempt at bimodal age distibution, should be updated to use as merkam-gompez distibutions
	and then the data should be based on this factor

	0-14 years: 18.73% (male 31,255,995/female 29,919,938)
	15-24 years: 13.27% (male 22,213,952/female 21,137,826)
	25-54 years: 39.45% (male 64,528,673/female 64,334,499)
	55-64 years: 12.91% (male 20,357,880/female 21,821,976)
	65 years and over: 15.63% (male 22,678,235/female 28,376,817) (2017 est.)


	```{r sim age}
	mu1 <- 55
	mu2 <- 75
	mu3 <- (65)
	sig1 <- 3
	sig2 <- 3
	sig3 <- (15)
	cpct <- 0.6

	bimodalDistFunc <- function (n,cpct, mu1, mu2,mu3, sig1, sig2,sig3) {
	y0 = rnorm(n,mean=mu1, sd = sig1)
	y1 = rnorm(n,mean=mu2, sd = sig2)
	y3= runif(n,min=0, max = 25)
	#flag <- rbinom(n,size=1,prob=cpct)
	flag = gender
	y = (y0(1 - flag) + y1flag )+(y3)
	}
	age <- (bimodalDistFunc(n=numcases,cpct,mu1,mu2,mu3, sig1,sig2,sig3))/2
	par(mfrow=c(2,1))
	hist(age,breaks=300)
	#hist(age,main=paste( numcases," draws of ages"),breaks=seq(age_min-.5,age_max+.5,1))
	boxplot((age), horizontal=TRUE,range=1)
	title("boxplot for ages")
	```


	Ethnic groups are problematic currently the package makes use of the CIA worldbook's terminology.

	the simulation currently is based on 2010 us data. also the probabilty is unnormalised which does not break things.

	Simulation:

	1. ethnic groups, religion and first language as factors
	2. we sample independently from the groups's list using thier proportion in the population and replace.

	We use this primerily to make generation of names more realistic
	religious are more of the same


	some room for improvement:
	mixed ethnic groups should be allowed.
	ethnic group and their proportion data is also available for other countries.
	add languages based on: English 79%, Spanish 13%, other Indo-European 3.7%, Asian and Pacific island 3.4%, other 1% (2015 est.)
	language race and language should probably not be independent though the joint distributions may not be available.

	```{r sim ethnicity}
	ethnicity_min <- 1
	ethnicity_max <- 7
	ethnic_groups <- c("Amerindian", "Asian", "Black", "Hispanic", "Caucasian","Arab")
	prob_ethnicity=c(0.9,4.8,12.5,16.3,72.4,1.1)
	ethnicity<-sample(x=ethnic_groups,size = numcases,replace = TRUE, prob= prob_ethnicity)

	#https://www.cia.gov/library/publications/the-world-factbook/fields/2075.html
	#white %, black 12.6%, Asian 4.8%, Amerindian and Alaska native 0.9%, native Hawaiian and other Pacific islander 0.2%, other 6.2%, two or more races 2.9%

	#ethnicity <- as.integer(runif(numcases,ethnicity_min,ethnicity_max)) # simulate random indepedent variables
	fethnicity <- factor(ethnicity, levels=ethnic_groups)
	#table(fethnicity)
	kable(t(as.matrix(table(fethnicity)))) %>%
	kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

	religious_groups <- c('Protestant' , 'Catholic' , 'Jewish' , 'Mormon' , 'other Christian' , 'Muslim', 'Jehovah\'s Witness' , 'Buddhist' , 'Hindu', 'other', 'unaffiliated')
	propb_religion<-c(46.5,20.8,1.9,1.6,0.9,0.9,0.8,0.7,0.7,1.8,23.4)
	#again https://www.cia.gov/library/publications/the-world-factbook/geos/us.html
	religion<-sample(x=religious_groups,size = numcases,replace = TRUE, prob= propb_religion)

	freligion <- factor(religion, levels=religious_groups)
	kable(t(as.matrix(table(freligion)))) %>%
	kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

	par(mfrow=c(2,1))
	barplot(table(fethnicity), las=2 ,space=1)

	par(mar = c(7, 4, 2, 2) + 0.2)
	barplot(table(freligion), las=2 ,space=1)

	#barplot(table(fethnicity))
	#boxplot(x=levels(fethnicity), horizontal=TRUE)
	#title("boxplot of a uniform random distribution")
	```


	```{r sim education}

	education_min <- 5
	education_max <- 25
	education <- as.integer(rbinom(numcases,(education_max-education_min),.4) + education_min) # simulate random indepedent variables
	max(education)

	par(mfrow=c(2,1))

	hist(education,main=paste(numcases," draws of education"), breaks=seq(education_min-.5,education_max+.5,1))

	boxplot(education, horizontal=TRUE,range=1)

	title("boxplot of a binomeal random distribution")
	```

	```{r sim wealth}


	wealth <- as.integer(abs(rcauchy(numcases,location = 10age, scale = 10education))) # simulate random indepedent variables
	wealth_min <- min(wealth)
	wealth_max <- max(wealth)

	par(mfrow=c(2,1))
	hist(wealth,main=paste(numcases," draws of wealth"), breaks=seq(wealth_min-.5,wealth_max+.5,1))
	boxplot(wealth, horizontal=TRUE,range=1)
	title("boxplot of a cachy random distribution")
	```


	```{r sim names}


	names <- randomNames(numcases,gender,ethnicity,which.names="both",
	name.order="first.last",name.sep=" ",sample.with.replacement=TRUE,return.complete.data=FALSE) # simulate random indepedent variables
	```

	Height is a function of gender and changes with age so idealy height should have a model rather than a simple ditribution.

	Weight depends on height but varies by body type, via a BMI.

	```{r height_bmi_weight}

	height <- as.integer(rnorm(n=numcases,mean = 170,sd=10))
	height_max<-max(height)
	height_min<-min(height)

	bmi <- rnorm(n=numcases,mean = 26,sd=2.5)
	bmi_max<-max(bmi)
	bmi_min<-min(bmi)

	weight = bmi * height * height / 10000
	weight_max<-max(weight)
	weight_min<-min(weight)

	par(mfrow=c(3,2))
	hist(height,main=paste(numcases," draws of height"), breaks=seq(height_min-.5,height_max+.5,1))
	boxplot(height, horizontal=TRUE,range=1)
	title("boxplot of a normaly disributed height")
	hist(bmi,main=paste(numcases," draws of bmi"), breaks=seq(bmi_min-1,bmi_max+1,1))
	boxplot(bmi, horizontal=TRUE,range=1)
	title("boxplot of a normaly distributed bmi")
	hist(weight,main=paste(numcases," draws of weight"), breaks=seq(weight_min-1,weight_max+1,1))
	boxplot(weight, horizontal=TRUE,range=1)
	title("boxplot of a weight")

	```

	see also: https://en.wikipedia.org/wiki/List_of_average_human_height_worldwide

	```{r frame}
	demographic <- data.frame(name=names, gender=gender, age=age,
	education=education,ethnicity=ethnicity, religion=religion,
	wealth=wealth, weight=weight,height=height,bmi=bmi)

	summary(demographic)

	head(demographic)

	```
	```{r}
	world = read.csv('http://www.stat.berkeley.edu/classes/s133/data/world.txt',header=TRUE,stringsAsFactors=FALSE)
	head(world)

	kable(world) %>%
	kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

	```
No results found