saptarshiguha · June 9, 2017 20:12
diff --git a/mozilla110_110_112_113_part1.py b/mozilla110_110_112_113_part1.py
 ---
 title: Churn Analyses for mozilla100 vs mozilla111
 author: Saptarshi Guha <[email protected]>
 date: "`r format(Sys.time(), '%B %d, %Y')`"
 output:
 html_document:
    toc: true
    toc_float: true
    mathjax: default
    self_contained: true
    theme: readable
    highlight: haddock
    number_sections: true
 ---

 <style>
 pre code, pre, code {
  white-space: pre !important;
  overflow-x: scroll !important;
  word-break: keep-all !important;
  word-wrap: initial !important;
 }

 .section p {
  padding-left: 1em;
  padding-right:1em;
 }
  
 .tocify-extend-page {
    display: none
 }
 </style>


 ```{r global_options, include=FALSE}
 knitr::opts_chunk$set(fig.height=10,
                      fig.width=20,
                      fig.align='center',
                      fig.retina=FALSE,
                      echo=TRUE,
                      cache=TRUE,
                      knitron.profile='default',
                      warning=FALSE,
                      message=FALSE)
 ```


 # Introduction mozilla110 vs mozilla111

 ```{r echo=FALSE}
 sampFrac <- 1
 ```


 ```{r eval=FALSE, echo=FALSE}
 ex <- ti(map=function(a,b){ if(grepl("(mozilla110|mozilla111)",b$did[1])){
                                if(length(unique(b$did))==1){
                                    b[,id:=a]
                                    b[,isdef:=sapply(isdef,function(s) if(s=='false') 0 else 1)]
                                    if (runif(1)<sampFrac) rhcollect(a,b)
                                }
                            }},reduce=0)
 set.seed(20)
 exdataOrig <- ex$take(-1)
 exdataOrig<- rbindlist(lapply(exdataOrig,"[[",2))
 ENDD <- as.Date('2017-05-20')+6*7
 exdataOrig <- exdataOrig[date<=as.character(ENDD) & date>=pcd,] ## only worry about retention till 6 weeks
 invisible(exdataOrig[, ":="(opp=as.numeric(as.Date(ENDD) - pcd[1]+1), used=.N)
                   , by=list(did,id)][, p:=used/opp])

 middleFrac <- 1
 if(middleFrac < 1){
    exdata <- exdataOrig[, list(n=1),by=list(did,id)][, list(id = sample(id, .N*middleFrac)),by=did]
    exdata <- exdataOrig[id %in% exdata$id,]
 }else{
    exdata <- exdataOrig
 }
 finalFrac <- sampFrac*middleFrac
 invisible(exdata2 <- exdata[, list(seng=seng[1],
                                   pcd=pcd[1],opp=opp[1],isdef=isdef[1],
                                   used=used[1],p=(used[1]-1)/(opp[1]-1),
                                   hr=sum(hr),u=sum(turi),
                                   ts=sum(search),ttabs=sum(ttabs)
                                   ),by=list(did,id)])

 ki2 <- exdata[,pcd:=as.Date(pcd)][,{
    con <- c()
    for(i in 0:6){
        con <- c(con, 1*(any(date>=pcd + i*7 & date<=(pcd+i*7+6))))
    }
    data.table(week=0:6, used=con)
 }, by=list(id,did)]

 ki3 <- ki2[,{
    z <- t.test(used~did,alternative='two.sided',conf.level=1-0.05/14)
    p1 <- mean(used[did=='mozilla110'])
    p2 <- mean(used[did=='mozilla111'])
    list(p1=p1,p2=p2,delta=z$est[1]-z$est[2], lo=z$conf.int[1],up=z$conf.int[2])
 },by=list(week)][order(week),]

 ki <- exdata[, {
    week6 = 1*any( date>=(as.Date(pcd) + 6*7) & date<=(as.Date(pcd)+6*7+6))
    list(active6 = week6)
 },by=list(id,did)]

 ```

 ```{r echo=FALSE}
 v <- exdataOrig[, did[1]   ,by=id][, table(V1)/finalFrac]
 h <- function(x) formatC(round(x, 2), big.mark=",",drop0trailing=TRUE, format="f") 
 ```

 *mozilla110* and *mozilla111* were two different funnelcakes.  I am not aware
 of which is the control or the test so this analysis is blinded. The
 variation consisted of different messaging in the download page.

 The experiment was designed with **almost equal split** between *mozilla110* and
 *mozilla111* (comparisons with the other two funnelcakes mozilla112 and
 mozilla113 will be done). The experiment was available, that is downloads
 between 2017-05-07 to 2017-05-20 are part of the experiment randomly assigned to
 either *mozilla110* or *mozilla111*. We have `r h(v[['mozilla110']])` profiles
 (`r (round(v['mozilla110']/sum(v)*100,1))`%) and `r h(v['mozilla111'])` 
 (`r (round(v['mozilla111']/sum(v)*100,1))`%) profiles who downloaded, installed and
 pinged us from *mozilla110* and *mozilla111* respectively based on a
 `r finalFrac*100`% sample.


 Is there a retention difference? This is typically answered using 6 week
 retention, i.e. percentage of users in week 0 still using week 6. I will try and
 answer the question a bit more holistically.

 # Summary

 I compared,using a `r 100*finalFrac` %  sample, the distributions on multiple fronts

 - in the 6th week since profile creation date, there is a difference in
  likelihood of being active
 - there is a difference in proportion of days active
 - little difference in search engine default
 - no difference in total hours/session given large usage, no difference
  for *mozilla110* to have low usage (<1hr)
  

 Though there is significance, it is important to consider if this difference
 is practically useful.

 # Contributions

 I believe we ought compare the effects of treatment and control not just
 on one week, but their entire usage up to given week.

 Ultimately every form of retention will be related to higher usage. If a
 user is likely to come back then by consequence she will use the browser
 more and this will be reflected in increased usage.

 Other merits

 - We consider the entire 6 week period on a per profile basis
 - we control for days of available to them (i.e. we need no restrict to 6
  weeks, and instead take days on file as a explanatory variable)
 - have the ability to control for several explanatory variables(e.g. opportunity
  of days of use)
 - can describe the differences on multiple outcomes (e.g. total usage, days
    used etc)  
 - demonstrate difference using multiple methods: graphical, non parametric
  and parametric.


 # Analysis


 ## Likelihood of using in the 6th Week and Throughout the 6 weeks

 ```{r echo=FALSE,cache=TRUE}
 mx <- glm( active6 ~ did, data=ki, family='binomial')
 oddsRatioconfIntDelta <- function(mx){
    b2=coef(mx)['didmozilla111']
    grad=exp(b2)
    vb2 <- vcov(mx)['didmozilla111','didmozilla111']
    svG <- sqrt(grad %*% vb2 %*% grad)
    r <- pmax(0,c(grad - qnorm(1-0.05/2)*svG,grad + qnorm(1-0.05/2)*svG))
    r2 <- sapply(r,function(k) if(k>1) k-1 else 1+k)
    list(r,r2)
 }
 oddsCI <- function(mx){
    r <- cbind(exp(stats::coef(mx)),exp(stats::confint(mx, level=level)),summary(mx)$coefficients[,4])
 }
 r <- oddsCI(mx)
                                        #j <- round((oddsRatioconfIntDelta(mx)[[2]])*100,2)
 r1 <- r['didmozilla111',c('2.5 %','97.5 %')]
 r1s <- if(1 %between% r1) "Since the effect of mozilla111 is both an increase and decrease, there isn't a signficant effect of mozilla111."
 r2 <- sapply(r['didmozilla111',c('2.5 %','97.5 %')],function(k){
    if(k>1){
        sprintf("%s %% increase",round(100*(k-1),2))
    } else{
        sprintf("%s %% drop",round(100*(1-k),2))
    }})
                                                             

 #mxconf <- confint(mx,"didmozilla111")
 ```

 For *mozilla111*, there is a between  `r r2[1]`% to `r r2[2]`% 
 odds of being active in the 6th week. `r r1s`.


 In the following plot, I've plotted the absolute difference in proportions
 active for every week and the confidence intervals have been Bonferoroni
 corrected (95% confidence intervals). There is very little difference between
 *mozilla110* and *mozilla111* (though technically speaking, *mozilla111* is
 every slightly higher, negligibly so)

 ```{r  echo=FALSE, fig.cap='Plot of Difference in Proportions',fig.height=10, fig.width=19,fig.align='center'}
 xyplot( delta ~ week, type='l',lwd=2,xlab="Weeks Since PCD",
       ylab='Difference in Proportion of Profiles Active (mozilla110-mozilla111)',
       panel=function(x,y,...){
           panel.polygon(c(x,rev(x)),c(ki3$lo,rev(ki3$up)),
                         col='#55555540',border=FALSE)
           panel.grid()
           panel.abline(h=0)
           panel.xyplot(x,y,...)
       },data=ki3,aspect=0.7,scales=list(cex=1.5,tick.num=10),cex=1.5,ylim= range(-0.09,0.09))
 summary(mx)
 ```

 #### Plot of Proportion Active In A Week

 And we can't forgo a plot of the proportion of profiles active in a week (p1
 is *mozilla110* and p2 is *mozilla111*. There is pretty much no difference.

 ```{r echo=FALSE,fig.height=10, fig.width=19,fig.align='center'}
 xyplot( p1 + p2  ~ week, type='l',lwd=2,xlab="Weeks Since PCD",
       ylab='Proportion of Profiles Active',auto.key=list(columns=2)
       ,panel=function(x,y,...){
           panel.grid()
           panel.xyplot(x,y,...)
       },data=ki3,aspect=0.7,scales=list(cex=1.5,tick.num=10),cex=1.5)
 ```

 #### Cumulative Active Plot

 ```{r echo=FALSE, warning=FALSE, message=FALSE}

 ll <- ki2[, {
    fp <- min(which(used==1))
    list(first=if(fp<Inf) week[fp] else NA_integer_)
 },by=list(did,id)]
 llx <- ll[!is.na(first), list(week=0, T=.N-(sum(first<=0)),PctCameAfterFirstWeek = (sum(first<=0))*100 / .N),by=did]
 llx0 <- llx[, mean(PctCameAfterFirstWeek)]
 ll2 <- rbindlist(lapply(1:6,function(J){
    ll[!is.na(first), list(week=J
                          ,PctCameAfterFirstWeek = (-sum(first<=0) + sum(first<=J))*100 / llx[did==.BY$did,T])
     , by=list(did)]
 }))

 ```


 The following is a cumulative active chart. Per week, what proportion of our
 control/test group have become active given they didn't in the first week.
 (For the record, `r round(llx0,2)`% of profiles use the browser in the first week)
 *mozilla111* is higher which implies they end to use it more rapidly compared to
 *mozilla110* but this seems mostly true for the first week and less so thereafter.

 ```{r echo=FALSE,fig.cap='Cumulative Actives Given Not used In first week',fig.height=10, fig.width=19,fig.align='center'}

 xyplot( PctCameAfterFirstWeek ~ week, group=did,data=ll2, auto.key=list(column=2),type=c('l','g')
       ,scale=list(cex=1.5))
 ```


 ## Overall Usage: proportion of days used across observed life

 ### By Bootstrap
 We can compute the ratio $pu = \frac{used-1}{opp-1}$ which is the proportion of days
 available to the profile used. We subtract 1 from both since everyone is active
 for at least one day. The following figure indicates a *small* difference
 between test and control.

 ```{r echo=FALSE,fig.cap='Boxplot of Proportion of Days Active',fig.height=10, fig.width=19,fig.align='center'}


 bwplot( did ~ p, data=exdata2, xlab='Proportion of Days Active', ylab='distribution'
      ,scale=list(cex=1.5,x=list(tick.num=20)))
 ```

 ```{r echo=FALSE, cache=TRUE}

 library(boot)
 bfunction <- function(x,d){
    xx <- x[d,]
    delta <- xx[did=='mozilla111',mean(p)] - xx[did=='mozilla110',mean(p)]
    delta
 }
 pb <- boot(exdata2, bfunction, R=1000,strata=exdata2$did=='mozilla110')
 xx <-  round(as.numeric(quantile(pb$t, c(0.025,0.975))),3)
 xx3 <- mean(pb$t)
 ```

 Without making any parametric assumptions, we compute 1000 bootstrap samples of
 the mean $pu$ for both *mozilla110* and *mozilla111*. The difference
 (*mozilla111* - *mozilla110*) is computed
 and I've plotted the histogram of this $pu$. The vertical lines correspond to
 the 95% bootstrap CI $(`r xx[1]`,`r xx[2]`)$ and the central vertical line the
 bootstrap estimate of the mean difference in proportion of days used: (`r round(xx3,3)`).

 **Since this interval does not  0, we are led to believe that there is a
 difference in $pu$.**


 ```{r echo=FALSE,fig.cap='Proportion of Opportunity Active vs Opportunity',fig.height=10, fig.width=19,fig.align='center'}
 densityplot( ~ pb$t, panel=function(...){
    m1 <- mean(pb$t)
    m2 <- quantile(pb$t,0.025)
    m3 <- quantile(pb$t,0.975)
    panel.densityplot(...)
    panel.abline(v=c(m2,m1,m3), lwd=1,lty=3,col='#888888')
 },lwd=2,
 scale=list(cex=1.5)
 ,xlab='Bootstrap Replicates of difference in proportion of days active',ylab='Density',bw=0.01,aspect=0.75)

 ```

 ```{r echo=FALSE, cache=TRUE}

 bfunction <- function(x,d){
    xx <- x[d,]
    N <- nrow(xx); 
    delta <- xx[1:(N/2),mean(p)] - xx[(N/2):N,mean(p)]
    delta
 }
 pbnull <- boot(exdata2, bfunction, R=1000)
 observeddiff <- exdata2[, list(p=mean(p)),by=did]
 observeddiff <- abs(observeddiff[did=='mozilla111',p]-observeddiff[did=='mozilla110',p])
 res <- mean(abs(pbnull$t) >= observeddiff)

 ```

 We can confirm this with a bootstrap test. Under the null hypothesis of no
 difference we can compute the null distribution of the differences.  We then
 compute the proportion of bootstrap differences that are greater in absolute
 sense than our observed mean difference in $pu$ between *mozilla110* and
 *mozilla111* (which is `r round(observeddiff,5)`).  **This proportion is 
 `r res` which is  less than 0.05 and conclude that there is
 significant difference** in the proportion of days used by profiles of
 *mozilla110* and *mozilla111*. That said the difference is extremely small 
 (`r round(xx3,4)`) and very likely negligible. 

 ### Using A Statistical Model

 ```{r}

 invisible(mdl <- glm( used ~ (opp)+did+log(ttabs+1)+log(u+1), data=exdata2,family='quasipoisson'))
 j2 <- round((oddsRatioconfIntDelta(mdl)[[1]]-1)*100,2)

 mdl2 <- glm( used ~ (opp)+did*isdef+log(ttabs+1)+log(u+1), data=exdata2,family='quasipoisson')

 ```

 Instead of using the bootstrap, we'll use a statistical model. In the above we
 model the dependency of *days used* controlling for *number of days available to
 profile* , *distribution_id* and other usage (tabs opened, uris visited). For
 *mozilla111* there is between `r j2[1]`% to `r j2[2]`% lift in the days used
 (given same opportunity and session usage). This matches the result we got in
 the bootstrap approach: that *mozilla111* has a slightly less rate of use.


 ```{r echo=FALSE}
 ## http://www.theanalysisfactor.com/interactions-main-effects-not-significant/
 summary( mdl2 )
 ```



 `r round(100*exdata2[, mean(isdef)],2)`% have Firefox has default.
 The table of proportion of usage follows here (notice the difference for non
 default profiles and default profiles).

 (Side note to author: in the above model, including `isdefault` doesn't make
 much sense since default status is very much correlated to hours/profile/day and
 tabs opened. Hence in the following proportion table we see a difference between
 default and non default users but it becomes negligible when we control for
 usage.

 ```{r echo=FALSE}

 exdata2[, mean(p),by=list(isdef,did)][order(isdef,did),][, list(Default=isdef, Distribution=did,ProportionDaysUSed=V1)]

 ```                                               
                                               
 ## Usage Characteristics between *mozilla110* and *mozilla111*

 We'll compare the two distributions using search engine default and  total hours
 used.

 ### Search Engine default

 ```{r}
 ox <- (exdata2[, prop.table(table(seng,did),2)*100])
 ox[ox[,'mozilla110']>1 & ox[,'mozilla111']>1,]
 ```

 ```{r echo=FALSE}
 #ox <- exdata2[, (chisq.test(table(seng,did),simulate.p.value = TRUE))]
 ```

 ### Total Hours/Profile Used

 Seems the hour usage is multimodal (mostly because of a preponderance of low
 usage profiles). We need to take this into account.

 ```{r echo=FALSE,fig.cap='Histogram of Total Hours used by Distribution',fig.height=10, fig.width=19,fig.align='center'}
 histogram(~ hr,groups=did, data=exdata2[hr>0,]
         ,xlab='Total Hours(log 2 scale)',scale=list(x=list(tick.num=20,cex=1.5,log=2)))
 ```

 Looking at the above figure, we consider 1 hr as the cut off and model this
 separately.

 We model the likelihood of using the browser for less than one hour. We see that
 *mozilla111* users are less likely to use it for less than one hour. It seems
 that *mozilla111* have lesser likelihood of using the browser for less than an
 hour 


 ```{r }
 mdlp <- glm( I(hr<1)  ~ used + did,data=exdata2,family='binomial')
 (summary(mdlp))
 ## 95% confidence interval for odds ratio
 print(oddsCI(mdlp)) 
 ```

 but given they use it for more than an hour, *mozilla111* don't have a
 significant hours/profile difference (controlling for days used and uris (`u`) visited)

 ```{r }
 mdl2 <- glm( log(hr) ~ used+did+u,data=exdata2[hr>=1,])
 print(summary(mdl2))
 #print(oddsCI(mdl2))
 ```






 # Appendix

 ## Python Extraction Code



 ```{python WhatIsThis, eval=FALSE}


 # run with pyspark

 import sys
 import datetime
 import random
 import subprocess
 import mozillametricstools.common.functions as mozfun

 # "active_addons"
 mozfun.register_udf(sqlContext
                    , lambda arr:  sum(arr) if arr else 0, "array_sum"
                    , pyspark.sql.types.IntegerType())
 ms = sqlContext.read.load("s3://telemetry-parquet/main_summary/v4", "parquet"
                          , mergeSchema=True)
 sqlContext.registerDataFrameAsTable(ms,"ms")


 #



 ms3 = sqlContext.sql("""
   select
   sample_id,client_id, distribution_id,subsession_start_date,
   profile_creation_date,default_search_engine,
   subsession_length, scalar_parent_browser_engagement_total_uri_count , is_default_browser,
   scalar_parent_browser_engagement_tab_open_event_count, search_counts
   from ms
   where
   distribution_id in ( 'mozilla111','mozilla112','mozilla113','mozilla110')
   and profile_creation_date >= 17292 and profile_creation_date <= 17306
   and substring(submission_date, 1,10)>='2017-05-07'
   and channel = 'release'
   and app_name = 'Firefox'
 """)
 sqlContext.registerDataFrameAsTable(ms3, "ms3")
 # sqlContext.sql("""
 # select distribution_id, count(distinct(client_id))*4
 # from ms3 where sample_id <='25'  group by distribution_id 
 # """).toPandas()





 ms4 = sqlContext.sql("""
   select 
   client_id as cid,
   case 
       when distribution_id is null then 'miss' 
       else distribution_id
   end as did,
   substring(subsession_start_date,1,10) as date,
   from_unixtime(profile_creation_date*86400,'yyyy-MM-dd') as pcd,
   count(*) as ns,
   last(case when default_search_engine is not null 
            then default_search_engine else 'miss' end) as seng,
   sum(case when subsession_length is null 
            then 0 else subsession_length end) as ssl,
   sum(case when  scalar_parent_browser_engagement_total_uri_count is null 
            then  0 else  scalar_parent_browser_engagement_total_uri_count end ) as turi,
   max(case when is_default_browser is null 
            then False else is_default_browser end) as isdef,
   sum(case when  scalar_parent_browser_engagement_tab_open_event_count is null 
            then 0 else  scalar_parent_browser_engagement_tab_open_event_count end) as ttabs,
   sum(case 
     when search_counts is not null then array_sum(search_counts.count) else 0 
   end)  as tsearch
   from ms3
   group by 1,2,3,4
 """)

 import subprocess
 O = "s3://mozilla-metrics/sguha/tmp/tmpfiles"
 subprocess.call(["aws", "s3", "rm", "--recursive", O])
 write = pyspark.sql.DataFrameWriter(ms4.coalesce(400))
 write.csv(path=O
          ,mode='overwrite'
          ,nullValue="NA"
          ,header=False)
 print(1)


 ```


 ## R Modification Code

 ```{r eval=FALSE}

 ## Convert the above data set to RHIPE format where each 'row'
 ## corresponds to `client_id` and it's value is a data table. I really
 ## should move this to python, but not really bothered about
 ## it. Coding in R is easier and i can consider doing some fancier
 ## statistical approaches across subsets.

 source("~/rhipe.r")
 z <- rhwatch(
    map=expression({
        tryCatch({
            z <- fread(paste(unlist(map.values),collapse="\n")
                     , colClasses=c('character','character','character','character',
                                    'character','numeric','numeric','numeric','character','numeric',
                                    'numeric'))
            setnames(z,c("cid","did","date","pcd",
                         'ns',"seng","hr","turi","isdef",'ttabs',"search"))
            z[, date:=as.Date(date,"%Y-%m-%d")]
            z[, pcd:=as.Date(pcd,"%Y-%m-%d")]
            z[, hr:=hr/3600]
            z[, rhcollect(.BY$cid, .SD),by=cid]
    }, error=function(e) { rhcounter("errors",as.character(e),1)})
    }),
 reduce=dtbinder(), input=rhfmt("text",folders="s3://mozilla-metrics/sguha/tmp/tmpfiles")
 ,output="s3://mozilla-metrics/sguha/tmp/mozilla110vs111vs112vs113"
 ,mapred=list(mapred.reduce.tasks=300)
 ,read=FALSE
 ,setup=E)

 I <- 's3://mozilla-metrics/sguha/tmp/mozilla110vs111vs112vs113'
 ti <- rh(I,E)

 ## Should I read all? Or take a sample. Use at least 50K profiles
 ti(map=function(a,b) { rhcollect(list(did=b$did[1]),1) })$collect()

 ```
	---
	title: Churn Analyses for mozilla100 vs mozilla111
	author: Saptarshi Guha <[email protected]>
	date: "`r format(Sys.time(), '%B %d, %Y')`"
	output:
	html_document:
	toc: true
	toc_float: true
	mathjax: default
	self_contained: true
	theme: readable
	highlight: haddock
	number_sections: true
	---

	<style>
	pre code, pre, code {
	white-space: pre !important;
	overflow-x: scroll !important;
	word-break: keep-all !important;
	word-wrap: initial !important;
	}

	.section p {
	padding-left: 1em;
	padding-right:1em;
	}

	.tocify-extend-page {
	display: none
	}
	</style>


	```{r global_options, include=FALSE}
	knitr::opts_chunk$set(fig.height=10,
	fig.width=20,
	fig.align='center',
	fig.retina=FALSE,
	echo=TRUE,
	cache=TRUE,
	knitron.profile='default',
	warning=FALSE,
	message=FALSE)
	```


	# Introduction mozilla110 vs mozilla111

	```{r echo=FALSE}
	sampFrac <- 1
	```


	```{r eval=FALSE, echo=FALSE}
	ex <- ti(map=function(a,b){ if(grepl("(mozilla110\|mozilla111)",b$did[1])){
	if(length(unique(b$did))==1){
	b[,id:=a]
	b[,isdef:=sapply(isdef,function(s) if(s=='false') 0 else 1)]
	if (runif(1)<sampFrac) rhcollect(a,b)
	}
	}},reduce=0)
	set.seed(20)
	exdataOrig <- ex$take(-1)
	exdataOrig<- rbindlist(lapply(exdataOrig,"[[",2))
	ENDD <- as.Date('2017-05-20')+6*7
	exdataOrig <- exdataOrig[date<=as.character(ENDD) & date>=pcd,] ## only worry about retention till 6 weeks
	invisible(exdataOrig[, ":="(opp=as.numeric(as.Date(ENDD) - pcd[1]+1), used=.N)
	, by=list(did,id)][, p:=used/opp])

	middleFrac <- 1
	if(middleFrac < 1){
	exdata <- exdataOrig[, list(n=1),by=list(did,id)][, list(id = sample(id, .N*middleFrac)),by=did]
	exdata <- exdataOrig[id %in% exdata$id,]
	}else{
	exdata <- exdataOrig
	}
	finalFrac <- sampFrac*middleFrac
	invisible(exdata2 <- exdata[, list(seng=seng[1],
	pcd=pcd[1],opp=opp[1],isdef=isdef[1],
	used=used[1],p=(used[1]-1)/(opp[1]-1),
	hr=sum(hr),u=sum(turi),
	ts=sum(search),ttabs=sum(ttabs)
	),by=list(did,id)])

	ki2 <- exdata[,pcd:=as.Date(pcd)][,{
	con <- c()
	for(i in 0:6){
	con <- c(con, 1(any(date>=pcd + i7 & date<=(pcd+i*7+6))))
	}
	data.table(week=0:6, used=con)
	}, by=list(id,did)]

	ki3 <- ki2[,{
	z <- t.test(used~did,alternative='two.sided',conf.level=1-0.05/14)
	p1 <- mean(used[did=='mozilla110'])
	p2 <- mean(used[did=='mozilla111'])
	list(p1=p1,p2=p2,delta=z$est[1]-z$est[2], lo=z$conf.int[1],up=z$conf.int[2])
	},by=list(week)][order(week),]

	ki <- exdata[, {
	week6 = 1any( date>=(as.Date(pcd) + 67) & date<=(as.Date(pcd)+6*7+6))
	list(active6 = week6)
	},by=list(id,did)]

	```

	```{r echo=FALSE}
	v <- exdataOrig[, did[1] ,by=id][, table(V1)/finalFrac]
	h <- function(x) formatC(round(x, 2), big.mark=",",drop0trailing=TRUE, format="f")
	```

	mozilla110 and mozilla111 were two different funnelcakes. I am not aware
	of which is the control or the test so this analysis is blinded. The
	variation consisted of different messaging in the download page.

	The experiment was designed with almost equal split between mozilla110 and
	mozilla111 (comparisons with the other two funnelcakes mozilla112 and
	mozilla113 will be done). The experiment was available, that is downloads
	between 2017-05-07 to 2017-05-20 are part of the experiment randomly assigned to
	either mozilla110 or mozilla111. We have `r h(v[['mozilla110']])` profiles
	(`r (round(v['mozilla110']/sum(v)*100,1))`%) and `r h(v['mozilla111'])`
	(`r (round(v['mozilla111']/sum(v)*100,1))`%) profiles who downloaded, installed and
	pinged us from mozilla110 and mozilla111 respectively based on a
	`r finalFrac*100`% sample.


	Is there a retention difference? This is typically answered using 6 week
	retention, i.e. percentage of users in week 0 still using week 6. I will try and
	answer the question a bit more holistically.

	# Summary

	I compared,using a `r 100*finalFrac` % sample, the distributions on multiple fronts

	- in the 6th week since profile creation date, there is a difference in
	likelihood of being active
	- there is a difference in proportion of days active
	- little difference in search engine default
	- no difference in total hours/session given large usage, no difference
	for mozilla110 to have low usage (<1hr)


	Though there is significance, it is important to consider if this difference
	is practically useful.

	# Contributions

	I believe we ought compare the effects of treatment and control not just
	on one week, but their entire usage up to given week.

	Ultimately every form of retention will be related to higher usage. If a
	user is likely to come back then by consequence she will use the browser
	more and this will be reflected in increased usage.

	Other merits

	- We consider the entire 6 week period on a per profile basis
	- we control for days of available to them (i.e. we need no restrict to 6
	weeks, and instead take days on file as a explanatory variable)
	- have the ability to control for several explanatory variables(e.g. opportunity
	of days of use)
	- can describe the differences on multiple outcomes (e.g. total usage, days
	used etc)
	- demonstrate difference using multiple methods: graphical, non parametric
	and parametric.


	# Analysis


	## Likelihood of using in the 6th Week and Throughout the 6 weeks

	```{r echo=FALSE,cache=TRUE}
	mx <- glm( active6 ~ did, data=ki, family='binomial')
	oddsRatioconfIntDelta <- function(mx){
	b2=coef(mx)['didmozilla111']
	grad=exp(b2)
	vb2 <- vcov(mx)['didmozilla111','didmozilla111']
	svG <- sqrt(grad %% vb2 %% grad)
	r <- pmax(0,c(grad - qnorm(1-0.05/2)svG,grad + qnorm(1-0.05/2)svG))
	r2 <- sapply(r,function(k) if(k>1) k-1 else 1+k)
	list(r,r2)
	}
	oddsCI <- function(mx){
	r <- cbind(exp(stats::coef(mx)),exp(stats::confint(mx, level=level)),summary(mx)$coefficients[,4])
	}
	r <- oddsCI(mx)
	#j <- round((oddsRatioconfIntDelta(mx)[[2]])*100,2)
	r1 <- r['didmozilla111',c('2.5 %','97.5 %')]
	r1s <- if(1 %between% r1) "Since the effect of mozilla111 is both an increase and decrease, there isn't a signficant effect of mozilla111."
	r2 <- sapply(r['didmozilla111',c('2.5 %','97.5 %')],function(k){
	if(k>1){
	sprintf("%s %% increase",round(100*(k-1),2))
	} else{
	sprintf("%s %% drop",round(100*(1-k),2))
	}})


	#mxconf <- confint(mx,"didmozilla111")
	```

	For mozilla111, there is a between `r r2[1]`% to `r r2[2]`%
	odds of being active in the 6th week. `r r1s`.


	In the following plot, I've plotted the absolute difference in proportions
	active for every week and the confidence intervals have been Bonferoroni
	corrected (95% confidence intervals). There is very little difference between
	mozilla110 and mozilla111 (though technically speaking, mozilla111 is
	every slightly higher, negligibly so)

	```{r echo=FALSE, fig.cap='Plot of Difference in Proportions',fig.height=10, fig.width=19,fig.align='center'}
	xyplot( delta ~ week, type='l',lwd=2,xlab="Weeks Since PCD",
	ylab='Difference in Proportion of Profiles Active (mozilla110-mozilla111)',
	panel=function(x,y,...){
	panel.polygon(c(x,rev(x)),c(ki3$lo,rev(ki3$up)),
	col='#55555540',border=FALSE)
	panel.grid()
	panel.abline(h=0)
	panel.xyplot(x,y,...)
	},data=ki3,aspect=0.7,scales=list(cex=1.5,tick.num=10),cex=1.5,ylim= range(-0.09,0.09))
	summary(mx)
	```

	#### Plot of Proportion Active In A Week

	And we can't forgo a plot of the proportion of profiles active in a week (p1
	is mozilla110 and p2 is mozilla111. There is pretty much no difference.

	```{r echo=FALSE,fig.height=10, fig.width=19,fig.align='center'}
	xyplot( p1 + p2 ~ week, type='l',lwd=2,xlab="Weeks Since PCD",
	ylab='Proportion of Profiles Active',auto.key=list(columns=2)
	,panel=function(x,y,...){
	panel.grid()
	panel.xyplot(x,y,...)
	},data=ki3,aspect=0.7,scales=list(cex=1.5,tick.num=10),cex=1.5)
	```

	#### Cumulative Active Plot

	```{r echo=FALSE, warning=FALSE, message=FALSE}

	ll <- ki2[, {
	fp <- min(which(used==1))
	list(first=if(fp<Inf) week[fp] else NA_integer_)
	},by=list(did,id)]
	llx <- ll[!is.na(first), list(week=0, T=.N-(sum(first<=0)),PctCameAfterFirstWeek = (sum(first<=0))*100 / .N),by=did]
	llx0 <- llx[, mean(PctCameAfterFirstWeek)]
	ll2 <- rbindlist(lapply(1:6,function(J){
	ll[!is.na(first), list(week=J
	,PctCameAfterFirstWeek = (-sum(first<=0) + sum(first<=J))*100 / llx[did==.BY$did,T])
	, by=list(did)]
	}))

	```


	The following is a cumulative active chart. Per week, what proportion of our
	control/test group have become active given they didn't in the first week.
	(For the record, `r round(llx0,2)`% of profiles use the browser in the first week)
	mozilla111 is higher which implies they end to use it more rapidly compared to
	mozilla110 but this seems mostly true for the first week and less so thereafter.

	```{r echo=FALSE,fig.cap='Cumulative Actives Given Not used In first week',fig.height=10, fig.width=19,fig.align='center'}

	xyplot( PctCameAfterFirstWeek ~ week, group=did,data=ll2, auto.key=list(column=2),type=c('l','g')
	,scale=list(cex=1.5))
	```


	## Overall Usage: proportion of days used across observed life

	### By Bootstrap
	We can compute the ratio $pu = \frac{used-1}{opp-1}$ which is the proportion of days
	available to the profile used. We subtract 1 from both since everyone is active
	for at least one day. The following figure indicates a small difference
	between test and control.

	```{r echo=FALSE,fig.cap='Boxplot of Proportion of Days Active',fig.height=10, fig.width=19,fig.align='center'}


	bwplot( did ~ p, data=exdata2, xlab='Proportion of Days Active', ylab='distribution'
	,scale=list(cex=1.5,x=list(tick.num=20)))
	```

	```{r echo=FALSE, cache=TRUE}

	library(boot)
	bfunction <- function(x,d){
	xx <- x[d,]
	delta <- xx[did=='mozilla111',mean(p)] - xx[did=='mozilla110',mean(p)]
	delta
	}
	pb <- boot(exdata2, bfunction, R=1000,strata=exdata2$did=='mozilla110')
	xx <- round(as.numeric(quantile(pb$t, c(0.025,0.975))),3)
	xx3 <- mean(pb$t)
	```

	Without making any parametric assumptions, we compute 1000 bootstrap samples of
	the mean $pu$ for both mozilla110 and mozilla111. The difference
	(mozilla111 - mozilla110) is computed
	and I've plotted the histogram of this $pu$. The vertical lines correspond to
	the 95% bootstrap CI $(`r xx[1]`,`r xx[2]`)$ and the central vertical line the
	bootstrap estimate of the mean difference in proportion of days used: (`r round(xx3,3)`).

	**Since this interval does not 0, we are led to believe that there is a
	difference in $pu$.**


	```{r echo=FALSE,fig.cap='Proportion of Opportunity Active vs Opportunity',fig.height=10, fig.width=19,fig.align='center'}
	densityplot( ~ pb$t, panel=function(...){
	m1 <- mean(pb$t)
	m2 <- quantile(pb$t,0.025)
	m3 <- quantile(pb$t,0.975)
	panel.densityplot(...)
	panel.abline(v=c(m2,m1,m3), lwd=1,lty=3,col='#888888')
	},lwd=2,
	scale=list(cex=1.5)
	,xlab='Bootstrap Replicates of difference in proportion of days active',ylab='Density',bw=0.01,aspect=0.75)

	```

	```{r echo=FALSE, cache=TRUE}

	bfunction <- function(x,d){
	xx <- x[d,]
	N <- nrow(xx);
	delta <- xx[1:(N/2),mean(p)] - xx[(N/2):N,mean(p)]
	delta
	}
	pbnull <- boot(exdata2, bfunction, R=1000)
	observeddiff <- exdata2[, list(p=mean(p)),by=did]
	observeddiff <- abs(observeddiff[did=='mozilla111',p]-observeddiff[did=='mozilla110',p])
	res <- mean(abs(pbnull$t) >= observeddiff)

	```

	We can confirm this with a bootstrap test. Under the null hypothesis of no
	difference we can compute the null distribution of the differences. We then
	compute the proportion of bootstrap differences that are greater in absolute
	sense than our observed mean difference in $pu$ between mozilla110 and
	mozilla111 (which is `r round(observeddiff,5)`). **This proportion is
	`r res` which is less than 0.05 and conclude that there is
	significant difference** in the proportion of days used by profiles of
	mozilla110 and mozilla111. That said the difference is extremely small
	(`r round(xx3,4)`) and very likely negligible.

	### Using A Statistical Model

	```{r}

	invisible(mdl <- glm( used ~ (opp)+did+log(ttabs+1)+log(u+1), data=exdata2,family='quasipoisson'))
	j2 <- round((oddsRatioconfIntDelta(mdl)[[1]]-1)*100,2)

	mdl2 <- glm( used ~ (opp)+did*isdef+log(ttabs+1)+log(u+1), data=exdata2,family='quasipoisson')

	```

	Instead of using the bootstrap, we'll use a statistical model. In the above we
	model the dependency of days used controlling for *number of days available to
	profile* , distribution_id and other usage (tabs opened, uris visited). For
	mozilla111 there is between `r j2[1]`% to `r j2[2]`% lift in the days used
	(given same opportunity and session usage). This matches the result we got in
	the bootstrap approach: that mozilla111 has a slightly less rate of use.


	```{r echo=FALSE}
	## http://www.theanalysisfactor.com/interactions-main-effects-not-significant/
	summary( mdl2 )
	```



	`r round(100*exdata2[, mean(isdef)],2)`% have Firefox has default.
	The table of proportion of usage follows here (notice the difference for non
	default profiles and default profiles).

	(Side note to author: in the above model, including `isdefault` doesn't make
	much sense since default status is very much correlated to hours/profile/day and
	tabs opened. Hence in the following proportion table we see a difference between
	default and non default users but it becomes negligible when we control for
	usage.

	```{r echo=FALSE}

	exdata2[, mean(p),by=list(isdef,did)][order(isdef,did),][, list(Default=isdef, Distribution=did,ProportionDaysUSed=V1)]

	```

	## Usage Characteristics between mozilla110 and mozilla111

	We'll compare the two distributions using search engine default and total hours
	used.

	### Search Engine default

	```{r}
	ox <- (exdata2[, prop.table(table(seng,did),2)*100])
	ox[ox[,'mozilla110']>1 & ox[,'mozilla111']>1,]
	```

	```{r echo=FALSE}
	#ox <- exdata2[, (chisq.test(table(seng,did),simulate.p.value = TRUE))]
	```

	### Total Hours/Profile Used

	Seems the hour usage is multimodal (mostly because of a preponderance of low
	usage profiles). We need to take this into account.

	```{r echo=FALSE,fig.cap='Histogram of Total Hours used by Distribution',fig.height=10, fig.width=19,fig.align='center'}
	histogram(~ hr,groups=did, data=exdata2[hr>0,]
	,xlab='Total Hours(log 2 scale)',scale=list(x=list(tick.num=20,cex=1.5,log=2)))
	```

	Looking at the above figure, we consider 1 hr as the cut off and model this
	separately.

	We model the likelihood of using the browser for less than one hour. We see that
	mozilla111 users are less likely to use it for less than one hour. It seems
	that mozilla111 have lesser likelihood of using the browser for less than an
	hour


	```{r }
	mdlp <- glm( I(hr<1) ~ used + did,data=exdata2,family='binomial')
	(summary(mdlp))
	## 95% confidence interval for odds ratio
	print(oddsCI(mdlp))
	```

	but given they use it for more than an hour, mozilla111 don't have a
	significant hours/profile difference (controlling for days used and uris (`u`) visited)

	```{r }
	mdl2 <- glm( log(hr) ~ used+did+u,data=exdata2[hr>=1,])
	print(summary(mdl2))
	#print(oddsCI(mdl2))
	```






	# Appendix

	## Python Extraction Code



	```{python WhatIsThis, eval=FALSE}


	# run with pyspark

	import sys
	import datetime
	import random
	import subprocess
	import mozillametricstools.common.functions as mozfun

	# "active_addons"
	mozfun.register_udf(sqlContext
	, lambda arr: sum(arr) if arr else 0, "array_sum"
	, pyspark.sql.types.IntegerType())
	ms = sqlContext.read.load("s3://telemetry-parquet/main_summary/v4", "parquet"
	, mergeSchema=True)
	sqlContext.registerDataFrameAsTable(ms,"ms")


	#



	ms3 = sqlContext.sql("""
	select
	sample_id,client_id, distribution_id,subsession_start_date,
	profile_creation_date,default_search_engine,
	subsession_length, scalar_parent_browser_engagement_total_uri_count , is_default_browser,
	scalar_parent_browser_engagement_tab_open_event_count, search_counts
	from ms
	where
	distribution_id in ( 'mozilla111','mozilla112','mozilla113','mozilla110')
	and profile_creation_date >= 17292 and profile_creation_date <= 17306
	and substring(submission_date, 1,10)>='2017-05-07'
	and channel = 'release'
	and app_name = 'Firefox'
	""")
	sqlContext.registerDataFrameAsTable(ms3, "ms3")
	# sqlContext.sql("""
	# select distribution_id, count(distinct(client_id))*4
	# from ms3 where sample_id <='25' group by distribution_id
	# """).toPandas()





	ms4 = sqlContext.sql("""
	select
	client_id as cid,
	case
	when distribution_id is null then 'miss'
	else distribution_id
	end as did,
	substring(subsession_start_date,1,10) as date,
	from_unixtime(profile_creation_date*86400,'yyyy-MM-dd') as pcd,
	count(*) as ns,
	last(case when default_search_engine is not null
	then default_search_engine else 'miss' end) as seng,
	sum(case when subsession_length is null
	then 0 else subsession_length end) as ssl,
	sum(case when scalar_parent_browser_engagement_total_uri_count is null
	then 0 else scalar_parent_browser_engagement_total_uri_count end ) as turi,
	max(case when is_default_browser is null
	then False else is_default_browser end) as isdef,
	sum(case when scalar_parent_browser_engagement_tab_open_event_count is null
	then 0 else scalar_parent_browser_engagement_tab_open_event_count end) as ttabs,
	sum(case
	when search_counts is not null then array_sum(search_counts.count) else 0
	end) as tsearch
	from ms3
	group by 1,2,3,4
	""")

	import subprocess
	O = "s3://mozilla-metrics/sguha/tmp/tmpfiles"
	subprocess.call(["aws", "s3", "rm", "--recursive", O])
	write = pyspark.sql.DataFrameWriter(ms4.coalesce(400))
	write.csv(path=O
	,mode='overwrite'
	,nullValue="NA"
	,header=False)
	print(1)


	```


	## R Modification Code

	```{r eval=FALSE}

	## Convert the above data set to RHIPE format where each 'row'
	## corresponds to `client_id` and it's value is a data table. I really
	## should move this to python, but not really bothered about
	## it. Coding in R is easier and i can consider doing some fancier
	## statistical approaches across subsets.

	source("~/rhipe.r")
	z <- rhwatch(
	map=expression({
	tryCatch({
	z <- fread(paste(unlist(map.values),collapse="\n")
	, colClasses=c('character','character','character','character',
	'character','numeric','numeric','numeric','character','numeric',
	'numeric'))
	setnames(z,c("cid","did","date","pcd",
	'ns',"seng","hr","turi","isdef",'ttabs',"search"))
	z[, date:=as.Date(date,"%Y-%m-%d")]
	z[, pcd:=as.Date(pcd,"%Y-%m-%d")]
	z[, hr:=hr/3600]
	z[, rhcollect(.BY$cid, .SD),by=cid]
	}, error=function(e) { rhcounter("errors",as.character(e),1)})
	}),
	reduce=dtbinder(), input=rhfmt("text",folders="s3://mozilla-metrics/sguha/tmp/tmpfiles")
	,output="s3://mozilla-metrics/sguha/tmp/mozilla110vs111vs112vs113"
	,mapred=list(mapred.reduce.tasks=300)
	,read=FALSE
	,setup=E)

	I <- 's3://mozilla-metrics/sguha/tmp/mozilla110vs111vs112vs113'
	ti <- rh(I,E)

	## Should I read all? Or take a sample. Use at least 50K profiles
	ti(map=function(a,b) { rhcollect(list(did=b$did[1]),1) })$collect()

	```