Created
June 9, 2017 20:12
-
-
Save saptarshiguha/73e5e4c5c7d3d5086cf282f84901f612 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: Churn Analyses for mozilla100 vs mozilla111 | |
author: Saptarshi Guha <[email protected]> | |
date: "`r format(Sys.time(), '%B %d, %Y')`" | |
output: | |
html_document: | |
toc: true | |
toc_float: true | |
mathjax: default | |
self_contained: true | |
theme: readable | |
highlight: haddock | |
number_sections: true | |
--- | |
<style> | |
pre code, pre, code { | |
white-space: pre !important; | |
overflow-x: scroll !important; | |
word-break: keep-all !important; | |
word-wrap: initial !important; | |
} | |
.section p { | |
padding-left: 1em; | |
padding-right:1em; | |
} | |
.tocify-extend-page { | |
display: none | |
} | |
</style> | |
```{r global_options, include=FALSE} | |
knitr::opts_chunk$set(fig.height=10, | |
fig.width=20, | |
fig.align='center', | |
fig.retina=FALSE, | |
echo=TRUE, | |
cache=TRUE, | |
knitron.profile='default', | |
warning=FALSE, | |
message=FALSE) | |
``` | |
# Introduction mozilla110 vs mozilla111 | |
```{r echo=FALSE} | |
sampFrac <- 1 | |
``` | |
```{r eval=FALSE, echo=FALSE} | |
ex <- ti(map=function(a,b){ if(grepl("(mozilla110|mozilla111)",b$did[1])){ | |
if(length(unique(b$did))==1){ | |
b[,id:=a] | |
b[,isdef:=sapply(isdef,function(s) if(s=='false') 0 else 1)] | |
if (runif(1)<sampFrac) rhcollect(a,b) | |
} | |
}},reduce=0) | |
set.seed(20) | |
exdataOrig <- ex$take(-1) | |
exdataOrig<- rbindlist(lapply(exdataOrig,"[[",2)) | |
ENDD <- as.Date('2017-05-20')+6*7 | |
exdataOrig <- exdataOrig[date<=as.character(ENDD) & date>=pcd,] ## only worry about retention till 6 weeks | |
invisible(exdataOrig[, ":="(opp=as.numeric(as.Date(ENDD) - pcd[1]+1), used=.N) | |
, by=list(did,id)][, p:=used/opp]) | |
middleFrac <- 1 | |
if(middleFrac < 1){ | |
exdata <- exdataOrig[, list(n=1),by=list(did,id)][, list(id = sample(id, .N*middleFrac)),by=did] | |
exdata <- exdataOrig[id %in% exdata$id,] | |
}else{ | |
exdata <- exdataOrig | |
} | |
finalFrac <- sampFrac*middleFrac | |
invisible(exdata2 <- exdata[, list(seng=seng[1], | |
pcd=pcd[1],opp=opp[1],isdef=isdef[1], | |
used=used[1],p=(used[1]-1)/(opp[1]-1), | |
hr=sum(hr),u=sum(turi), | |
ts=sum(search),ttabs=sum(ttabs) | |
),by=list(did,id)]) | |
ki2 <- exdata[,pcd:=as.Date(pcd)][,{ | |
con <- c() | |
for(i in 0:6){ | |
con <- c(con, 1*(any(date>=pcd + i*7 & date<=(pcd+i*7+6)))) | |
} | |
data.table(week=0:6, used=con) | |
}, by=list(id,did)] | |
ki3 <- ki2[,{ | |
z <- t.test(used~did,alternative='two.sided',conf.level=1-0.05/14) | |
p1 <- mean(used[did=='mozilla110']) | |
p2 <- mean(used[did=='mozilla111']) | |
list(p1=p1,p2=p2,delta=z$est[1]-z$est[2], lo=z$conf.int[1],up=z$conf.int[2]) | |
},by=list(week)][order(week),] | |
ki <- exdata[, { | |
week6 = 1*any( date>=(as.Date(pcd) + 6*7) & date<=(as.Date(pcd)+6*7+6)) | |
list(active6 = week6) | |
},by=list(id,did)] | |
``` | |
```{r echo=FALSE} | |
v <- exdataOrig[, did[1] ,by=id][, table(V1)/finalFrac] | |
h <- function(x) formatC(round(x, 2), big.mark=",",drop0trailing=TRUE, format="f") | |
``` | |
*mozilla110* and *mozilla111* were two different funnelcakes. I am not aware | |
of which is the control or the test so this analysis is blinded. The | |
variation consisted of different messaging in the download page. | |
The experiment was designed with **almost equal split** between *mozilla110* and | |
*mozilla111* (comparisons with the other two funnelcakes mozilla112 and | |
mozilla113 will be done). The experiment was available, that is downloads | |
between 2017-05-07 to 2017-05-20 are part of the experiment randomly assigned to | |
either *mozilla110* or *mozilla111*. We have `r h(v[['mozilla110']])` profiles | |
(`r (round(v['mozilla110']/sum(v)*100,1))`%) and `r h(v['mozilla111'])` | |
(`r (round(v['mozilla111']/sum(v)*100,1))`%) profiles who downloaded, installed and | |
pinged us from *mozilla110* and *mozilla111* respectively based on a | |
`r finalFrac*100`% sample. | |
Is there a retention difference? This is typically answered using 6 week | |
retention, i.e. percentage of users in week 0 still using week 6. I will try and | |
answer the question a bit more holistically. | |
# Summary | |
I compared,using a `r 100*finalFrac` % sample, the distributions on multiple fronts | |
- in the 6th week since profile creation date, there is a difference in | |
likelihood of being active | |
- there is a difference in proportion of days active | |
- little difference in search engine default | |
- no difference in total hours/session given large usage, no difference | |
for *mozilla110* to have low usage (<1hr) | |
Though there is significance, it is important to consider if this difference | |
is practically useful. | |
# Contributions | |
I believe we ought compare the effects of treatment and control not just | |
on one week, but their entire usage up to given week. | |
Ultimately every form of retention will be related to higher usage. If a | |
user is likely to come back then by consequence she will use the browser | |
more and this will be reflected in increased usage. | |
Other merits | |
- We consider the entire 6 week period on a per profile basis | |
- we control for days of available to them (i.e. we need no restrict to 6 | |
weeks, and instead take days on file as a explanatory variable) | |
- have the ability to control for several explanatory variables(e.g. opportunity | |
of days of use) | |
- can describe the differences on multiple outcomes (e.g. total usage, days | |
used etc) | |
- demonstrate difference using multiple methods: graphical, non parametric | |
and parametric. | |
# Analysis | |
## Likelihood of using in the 6th Week and Throughout the 6 weeks | |
```{r echo=FALSE,cache=TRUE} | |
mx <- glm( active6 ~ did, data=ki, family='binomial') | |
oddsRatioconfIntDelta <- function(mx){ | |
b2=coef(mx)['didmozilla111'] | |
grad=exp(b2) | |
vb2 <- vcov(mx)['didmozilla111','didmozilla111'] | |
svG <- sqrt(grad %*% vb2 %*% grad) | |
r <- pmax(0,c(grad - qnorm(1-0.05/2)*svG,grad + qnorm(1-0.05/2)*svG)) | |
r2 <- sapply(r,function(k) if(k>1) k-1 else 1+k) | |
list(r,r2) | |
} | |
oddsCI <- function(mx){ | |
r <- cbind(exp(stats::coef(mx)),exp(stats::confint(mx, level=level)),summary(mx)$coefficients[,4]) | |
} | |
r <- oddsCI(mx) | |
#j <- round((oddsRatioconfIntDelta(mx)[[2]])*100,2) | |
r1 <- r['didmozilla111',c('2.5 %','97.5 %')] | |
r1s <- if(1 %between% r1) "Since the effect of mozilla111 is both an increase and decrease, there isn't a signficant effect of mozilla111." | |
r2 <- sapply(r['didmozilla111',c('2.5 %','97.5 %')],function(k){ | |
if(k>1){ | |
sprintf("%s %% increase",round(100*(k-1),2)) | |
} else{ | |
sprintf("%s %% drop",round(100*(1-k),2)) | |
}}) | |
#mxconf <- confint(mx,"didmozilla111") | |
``` | |
For *mozilla111*, there is a between `r r2[1]`% to `r r2[2]`% | |
odds of being active in the 6th week. `r r1s`. | |
In the following plot, I've plotted the absolute difference in proportions | |
active for every week and the confidence intervals have been Bonferoroni | |
corrected (95% confidence intervals). There is very little difference between | |
*mozilla110* and *mozilla111* (though technically speaking, *mozilla111* is | |
every slightly higher, negligibly so) | |
```{r echo=FALSE, fig.cap='Plot of Difference in Proportions',fig.height=10, fig.width=19,fig.align='center'} | |
xyplot( delta ~ week, type='l',lwd=2,xlab="Weeks Since PCD", | |
ylab='Difference in Proportion of Profiles Active (mozilla110-mozilla111)', | |
panel=function(x,y,...){ | |
panel.polygon(c(x,rev(x)),c(ki3$lo,rev(ki3$up)), | |
col='#55555540',border=FALSE) | |
panel.grid() | |
panel.abline(h=0) | |
panel.xyplot(x,y,...) | |
},data=ki3,aspect=0.7,scales=list(cex=1.5,tick.num=10),cex=1.5,ylim= range(-0.09,0.09)) | |
summary(mx) | |
``` | |
#### Plot of Proportion Active In A Week | |
And we can't forgo a plot of the proportion of profiles active in a week (p1 | |
is *mozilla110* and p2 is *mozilla111*. There is pretty much no difference. | |
```{r echo=FALSE,fig.height=10, fig.width=19,fig.align='center'} | |
xyplot( p1 + p2 ~ week, type='l',lwd=2,xlab="Weeks Since PCD", | |
ylab='Proportion of Profiles Active',auto.key=list(columns=2) | |
,panel=function(x,y,...){ | |
panel.grid() | |
panel.xyplot(x,y,...) | |
},data=ki3,aspect=0.7,scales=list(cex=1.5,tick.num=10),cex=1.5) | |
``` | |
#### Cumulative Active Plot | |
```{r echo=FALSE, warning=FALSE, message=FALSE} | |
ll <- ki2[, { | |
fp <- min(which(used==1)) | |
list(first=if(fp<Inf) week[fp] else NA_integer_) | |
},by=list(did,id)] | |
llx <- ll[!is.na(first), list(week=0, T=.N-(sum(first<=0)),PctCameAfterFirstWeek = (sum(first<=0))*100 / .N),by=did] | |
llx0 <- llx[, mean(PctCameAfterFirstWeek)] | |
ll2 <- rbindlist(lapply(1:6,function(J){ | |
ll[!is.na(first), list(week=J | |
,PctCameAfterFirstWeek = (-sum(first<=0) + sum(first<=J))*100 / llx[did==.BY$did,T]) | |
, by=list(did)] | |
})) | |
``` | |
The following is a cumulative active chart. Per week, what proportion of our | |
control/test group have become active given they didn't in the first week. | |
(For the record, `r round(llx0,2)`% of profiles use the browser in the first week) | |
*mozilla111* is higher which implies they end to use it more rapidly compared to | |
*mozilla110* but this seems mostly true for the first week and less so thereafter. | |
```{r echo=FALSE,fig.cap='Cumulative Actives Given Not used In first week',fig.height=10, fig.width=19,fig.align='center'} | |
xyplot( PctCameAfterFirstWeek ~ week, group=did,data=ll2, auto.key=list(column=2),type=c('l','g') | |
,scale=list(cex=1.5)) | |
``` | |
## Overall Usage: proportion of days used across observed life | |
### By Bootstrap | |
We can compute the ratio $pu = \frac{used-1}{opp-1}$ which is the proportion of days | |
available to the profile used. We subtract 1 from both since everyone is active | |
for at least one day. The following figure indicates a *small* difference | |
between test and control. | |
```{r echo=FALSE,fig.cap='Boxplot of Proportion of Days Active',fig.height=10, fig.width=19,fig.align='center'} | |
bwplot( did ~ p, data=exdata2, xlab='Proportion of Days Active', ylab='distribution' | |
,scale=list(cex=1.5,x=list(tick.num=20))) | |
``` | |
```{r echo=FALSE, cache=TRUE} | |
library(boot) | |
bfunction <- function(x,d){ | |
xx <- x[d,] | |
delta <- xx[did=='mozilla111',mean(p)] - xx[did=='mozilla110',mean(p)] | |
delta | |
} | |
pb <- boot(exdata2, bfunction, R=1000,strata=exdata2$did=='mozilla110') | |
xx <- round(as.numeric(quantile(pb$t, c(0.025,0.975))),3) | |
xx3 <- mean(pb$t) | |
``` | |
Without making any parametric assumptions, we compute 1000 bootstrap samples of | |
the mean $pu$ for both *mozilla110* and *mozilla111*. The difference | |
(*mozilla111* - *mozilla110*) is computed | |
and I've plotted the histogram of this $pu$. The vertical lines correspond to | |
the 95% bootstrap CI $(`r xx[1]`,`r xx[2]`)$ and the central vertical line the | |
bootstrap estimate of the mean difference in proportion of days used: (`r round(xx3,3)`). | |
**Since this interval does not 0, we are led to believe that there is a | |
difference in $pu$.** | |
```{r echo=FALSE,fig.cap='Proportion of Opportunity Active vs Opportunity',fig.height=10, fig.width=19,fig.align='center'} | |
densityplot( ~ pb$t, panel=function(...){ | |
m1 <- mean(pb$t) | |
m2 <- quantile(pb$t,0.025) | |
m3 <- quantile(pb$t,0.975) | |
panel.densityplot(...) | |
panel.abline(v=c(m2,m1,m3), lwd=1,lty=3,col='#888888') | |
},lwd=2, | |
scale=list(cex=1.5) | |
,xlab='Bootstrap Replicates of difference in proportion of days active',ylab='Density',bw=0.01,aspect=0.75) | |
``` | |
```{r echo=FALSE, cache=TRUE} | |
bfunction <- function(x,d){ | |
xx <- x[d,] | |
N <- nrow(xx); | |
delta <- xx[1:(N/2),mean(p)] - xx[(N/2):N,mean(p)] | |
delta | |
} | |
pbnull <- boot(exdata2, bfunction, R=1000) | |
observeddiff <- exdata2[, list(p=mean(p)),by=did] | |
observeddiff <- abs(observeddiff[did=='mozilla111',p]-observeddiff[did=='mozilla110',p]) | |
res <- mean(abs(pbnull$t) >= observeddiff) | |
``` | |
We can confirm this with a bootstrap test. Under the null hypothesis of no | |
difference we can compute the null distribution of the differences. We then | |
compute the proportion of bootstrap differences that are greater in absolute | |
sense than our observed mean difference in $pu$ between *mozilla110* and | |
*mozilla111* (which is `r round(observeddiff,5)`). **This proportion is | |
`r res` which is less than 0.05 and conclude that there is | |
significant difference** in the proportion of days used by profiles of | |
*mozilla110* and *mozilla111*. That said the difference is extremely small | |
(`r round(xx3,4)`) and very likely negligible. | |
### Using A Statistical Model | |
```{r} | |
invisible(mdl <- glm( used ~ (opp)+did+log(ttabs+1)+log(u+1), data=exdata2,family='quasipoisson')) | |
j2 <- round((oddsRatioconfIntDelta(mdl)[[1]]-1)*100,2) | |
mdl2 <- glm( used ~ (opp)+did*isdef+log(ttabs+1)+log(u+1), data=exdata2,family='quasipoisson') | |
``` | |
Instead of using the bootstrap, we'll use a statistical model. In the above we | |
model the dependency of *days used* controlling for *number of days available to | |
profile* , *distribution_id* and other usage (tabs opened, uris visited). For | |
*mozilla111* there is between `r j2[1]`% to `r j2[2]`% lift in the days used | |
(given same opportunity and session usage). This matches the result we got in | |
the bootstrap approach: that *mozilla111* has a slightly less rate of use. | |
```{r echo=FALSE} | |
## http://www.theanalysisfactor.com/interactions-main-effects-not-significant/ | |
summary( mdl2 ) | |
``` | |
`r round(100*exdata2[, mean(isdef)],2)`% have Firefox has default. | |
The table of proportion of usage follows here (notice the difference for non | |
default profiles and default profiles). | |
(Side note to author: in the above model, including `isdefault` doesn't make | |
much sense since default status is very much correlated to hours/profile/day and | |
tabs opened. Hence in the following proportion table we see a difference between | |
default and non default users but it becomes negligible when we control for | |
usage. | |
```{r echo=FALSE} | |
exdata2[, mean(p),by=list(isdef,did)][order(isdef,did),][, list(Default=isdef, Distribution=did,ProportionDaysUSed=V1)] | |
``` | |
## Usage Characteristics between *mozilla110* and *mozilla111* | |
We'll compare the two distributions using search engine default and total hours | |
used. | |
### Search Engine default | |
```{r} | |
ox <- (exdata2[, prop.table(table(seng,did),2)*100]) | |
ox[ox[,'mozilla110']>1 & ox[,'mozilla111']>1,] | |
``` | |
```{r echo=FALSE} | |
#ox <- exdata2[, (chisq.test(table(seng,did),simulate.p.value = TRUE))] | |
``` | |
### Total Hours/Profile Used | |
Seems the hour usage is multimodal (mostly because of a preponderance of low | |
usage profiles). We need to take this into account. | |
```{r echo=FALSE,fig.cap='Histogram of Total Hours used by Distribution',fig.height=10, fig.width=19,fig.align='center'} | |
histogram(~ hr,groups=did, data=exdata2[hr>0,] | |
,xlab='Total Hours(log 2 scale)',scale=list(x=list(tick.num=20,cex=1.5,log=2))) | |
``` | |
Looking at the above figure, we consider 1 hr as the cut off and model this | |
separately. | |
We model the likelihood of using the browser for less than one hour. We see that | |
*mozilla111* users are less likely to use it for less than one hour. It seems | |
that *mozilla111* have lesser likelihood of using the browser for less than an | |
hour | |
```{r } | |
mdlp <- glm( I(hr<1) ~ used + did,data=exdata2,family='binomial') | |
(summary(mdlp)) | |
## 95% confidence interval for odds ratio | |
print(oddsCI(mdlp)) | |
``` | |
but given they use it for more than an hour, *mozilla111* don't have a | |
significant hours/profile difference (controlling for days used and uris (`u`) visited) | |
```{r } | |
mdl2 <- glm( log(hr) ~ used+did+u,data=exdata2[hr>=1,]) | |
print(summary(mdl2)) | |
#print(oddsCI(mdl2)) | |
``` | |
# Appendix | |
## Python Extraction Code | |
```{python WhatIsThis, eval=FALSE} | |
# run with pyspark | |
import sys | |
import datetime | |
import random | |
import subprocess | |
import mozillametricstools.common.functions as mozfun | |
# "active_addons" | |
mozfun.register_udf(sqlContext | |
, lambda arr: sum(arr) if arr else 0, "array_sum" | |
, pyspark.sql.types.IntegerType()) | |
ms = sqlContext.read.load("s3://telemetry-parquet/main_summary/v4", "parquet" | |
, mergeSchema=True) | |
sqlContext.registerDataFrameAsTable(ms,"ms") | |
# | |
ms3 = sqlContext.sql(""" | |
select | |
sample_id,client_id, distribution_id,subsession_start_date, | |
profile_creation_date,default_search_engine, | |
subsession_length, scalar_parent_browser_engagement_total_uri_count , is_default_browser, | |
scalar_parent_browser_engagement_tab_open_event_count, search_counts | |
from ms | |
where | |
distribution_id in ( 'mozilla111','mozilla112','mozilla113','mozilla110') | |
and profile_creation_date >= 17292 and profile_creation_date <= 17306 | |
and substring(submission_date, 1,10)>='2017-05-07' | |
and channel = 'release' | |
and app_name = 'Firefox' | |
""") | |
sqlContext.registerDataFrameAsTable(ms3, "ms3") | |
# sqlContext.sql(""" | |
# select distribution_id, count(distinct(client_id))*4 | |
# from ms3 where sample_id <='25' group by distribution_id | |
# """).toPandas() | |
ms4 = sqlContext.sql(""" | |
select | |
client_id as cid, | |
case | |
when distribution_id is null then 'miss' | |
else distribution_id | |
end as did, | |
substring(subsession_start_date,1,10) as date, | |
from_unixtime(profile_creation_date*86400,'yyyy-MM-dd') as pcd, | |
count(*) as ns, | |
last(case when default_search_engine is not null | |
then default_search_engine else 'miss' end) as seng, | |
sum(case when subsession_length is null | |
then 0 else subsession_length end) as ssl, | |
sum(case when scalar_parent_browser_engagement_total_uri_count is null | |
then 0 else scalar_parent_browser_engagement_total_uri_count end ) as turi, | |
max(case when is_default_browser is null | |
then False else is_default_browser end) as isdef, | |
sum(case when scalar_parent_browser_engagement_tab_open_event_count is null | |
then 0 else scalar_parent_browser_engagement_tab_open_event_count end) as ttabs, | |
sum(case | |
when search_counts is not null then array_sum(search_counts.count) else 0 | |
end) as tsearch | |
from ms3 | |
group by 1,2,3,4 | |
""") | |
import subprocess | |
O = "s3://mozilla-metrics/sguha/tmp/tmpfiles" | |
subprocess.call(["aws", "s3", "rm", "--recursive", O]) | |
write = pyspark.sql.DataFrameWriter(ms4.coalesce(400)) | |
write.csv(path=O | |
,mode='overwrite' | |
,nullValue="NA" | |
,header=False) | |
print(1) | |
``` | |
## R Modification Code | |
```{r eval=FALSE} | |
## Convert the above data set to RHIPE format where each 'row' | |
## corresponds to `client_id` and it's value is a data table. I really | |
## should move this to python, but not really bothered about | |
## it. Coding in R is easier and i can consider doing some fancier | |
## statistical approaches across subsets. | |
source("~/rhipe.r") | |
z <- rhwatch( | |
map=expression({ | |
tryCatch({ | |
z <- fread(paste(unlist(map.values),collapse="\n") | |
, colClasses=c('character','character','character','character', | |
'character','numeric','numeric','numeric','character','numeric', | |
'numeric')) | |
setnames(z,c("cid","did","date","pcd", | |
'ns',"seng","hr","turi","isdef",'ttabs',"search")) | |
z[, date:=as.Date(date,"%Y-%m-%d")] | |
z[, pcd:=as.Date(pcd,"%Y-%m-%d")] | |
z[, hr:=hr/3600] | |
z[, rhcollect(.BY$cid, .SD),by=cid] | |
}, error=function(e) { rhcounter("errors",as.character(e),1)}) | |
}), | |
reduce=dtbinder(), input=rhfmt("text",folders="s3://mozilla-metrics/sguha/tmp/tmpfiles") | |
,output="s3://mozilla-metrics/sguha/tmp/mozilla110vs111vs112vs113" | |
,mapred=list(mapred.reduce.tasks=300) | |
,read=FALSE | |
,setup=E) | |
I <- 's3://mozilla-metrics/sguha/tmp/mozilla110vs111vs112vs113' | |
ti <- rh(I,E) | |
## Should I read all? Or take a sample. Use at least 50K profiles | |
ti(map=function(a,b) { rhcollect(list(did=b$did[1]),1) })$collect() | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment