Skip to content

Instantly share code, notes, and snippets.

@joakin
Last active June 7, 2018 08:30
Show Gist options
  • Save joakin/d3d94065dea726361223426f12cbc290 to your computer and use it in GitHub Desktop.
Save joakin/d3d94065dea726361223426f12cbc290 to your computer and use it in GitHub Desktop.
Things we currently do to set up an A/B test as an engineer

Things we currently do to set up an A/B test as an engineer

Planning

  • Review the tasks
    • Understand what is being asked to be tracked and why
      • Analyze feasibility of implementation provide feedback and ask questions with Product and Analysts
  • Flesh out the technical information (see Development) of such tasks
  • Ensure there are separate tasks created for implementation of tracking, and deployment of tracking

Development

  • Ensure both variants are accessible to users at the same time
    • If they aren't, add code paths to be able to have the two variants working at the same time
  • Create configuration variables for the experiment with the buckets
    • Name the configuration variables and give them default values
    • If needed, read the config on the server to ship or not certain code bundles to the clients
    • Read the config in the client
  • Create user bucketing
    • If needed, write down feature specific user targeting code (anon, logged in, etc.) May be based on more configuration flags (per-wiki configs)
    • Configure mw.experiments with the test config to bucket user
    • Configure variants appropriately depending on the bucket
  • Create EventLogging schema in wikitech
    • Document all fields
    • Review with analyst
    • Add configuration for schema name and version into code base
  • Add test specific instrumentation to the variants
    • Code event sending for the specific events needed to the schema
      • For the different code paths
      • For the different buckets
      • With the necessary for each type of user or event
  • Sampling - if needed (it usually is)
    • Add configuration for event logging sampling rate
    • If sampling is a probability over individual event, configure sending events
    • If sampling is something more complex (it usually is), like receiving events for all pages in a session, or per page, or something else
      • Code up event sampling by specific factors, and connect it to the event sending

Testing

Run tests for the event logging (the verification step is complex because of configuration needs and how events can be checked). So, manually run through each scenario, for all combinations of:

test variants · user buckets · sampling mechanisms · event sent scenario

Starting an A/B test

For each wiki we want to deploy (some times for wikis of similar size this can be done in a group)

  • Review A/B test start task and ensure all information is there
    • to which wikis
    • when
    • if there is a sampling rate needed for event logging
    • if the experiment config is well specified and makes sense
  • Schedule a SWAT deploy and make the patch with the config change
    • Test the config change locally
  • Deploy a config change for the needed wiki(s) enabling the experiment
    • Test and verify the experiment is on and live
  • Update the Schema talk page with the details of the sampling rate that was deployed to which wiki

Follow-up

  • Tasks need to be created for the disabling the test at a certain date
  • We track event logging error rates for validation
  • Analyst verifies data reception is solid, if there are bugs, back to Development
  • Analyst verifies event volume is consistent with expectations

Stopping an A/B test

  • Schedule a SWAT deploy and make the patch with the config change
    • Test the config change locally
  • Deploy a config change for the needed wiki(s) disabling the experiment
    • Test and verify the experiment is off and data collection is off
  • Update the Schema talk page with the details of the sampling rate that was deployed to which wiki

Sunsetting

When all A/B tests have been run, and no more are expected, we keep an eye to add the sunsetting tasks to get rid of the A/B testing code and old variants. Unused code is a liability and creates maintenance problems.

  • Create task to get rid of A/B tests implementation
    • Flesh it out with as many details of what needs to be removed
      • Config variables and their usages and reading
      • Bucketing logic and experiment configuration
      • If the event logging is not going to be needed
        • Event sampling and event sending on the different code paths
  • Create task to get rid of unused variant
    • Remove variant code path and tests
  • Test everything works after all these removals
@niedzielski
Copy link

Cool! It would be neat to mention also:

  • Some of the different technologies used (Grafana, EventLogging, ...)
  • How we verify (beta cluster EventLogging, visual EventLogging, Hive, ...)
  • Common problems and debugging (looking at the wrong schema version's table in EventLogging, bad bucket sizes, invalid schema shape, ...)
  • And the engineer's perspective on where it's reported and how those decisions feed back into the process (this would be the loop from the end into the beginning).

@jdlrobson
Copy link

"Understand what is being asked to be tracked and why"
specifically what is the question or question__s__ you want answered and what is the most simple way we can find that answer. https://www.mediawiki.org/wiki/Extension:EventLogging/Guide#Posing_a_question

Also the "who" - anons? logged in? Editors with a certain edit count?

https://www.mediawiki.org/wiki/Extension:EventLogging/Guide is a great resource.

"Ensure both variants are accessible to users at the same time"
Might want to add something around caching. e.g. If the A/B test changes HTML for anonymous users this creates challenges with flash of unstyled content.

"Create EventLogging schema in wikitech"
It's actually meta.wikimedia.org
Ideally this should also be done upfront as part of analysis before any development occurs. Not during development. We should be better ourselves on that front!

"Starting an A/B test"
I don't know if feasible but ideally I'd like to see us do dummy A/B tests where we generate data for say a 2 day period and then vet that data. As we discovered with page previews there is potential for issues that can evade QA and disrupt data quality (e.g. browser bug duplicates/issues with sampling).

@phuedx
Copy link

phuedx commented Jun 7, 2018

Even though I'm late, I wanted to acknowledge that I've read this and the feedback from @niedzielski and @jdlrobson and it LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment