bearloga · June 25, 2024 21:26
diff --git a/demo_data.sql b/demo_data.sql
 -- Usage: spark3-sql -f demo_data.sql -d db=bearloga

 DROP TABLE IF EXISTS ${db}.demo_data_v1;
 CREATE TABLE IF NOT EXISTS ${db}.demo_data_v1 (
  user_id int,
  action string,
  experiments array<map<string, string>>
 )
 LOCATION '/tmp/${db}/demo_data_v1';

 INSERT INTO ${db}.demo_data_v1 (user_id, action, experiments) VALUES
  (1, 'click', ARRAY(MAP('name', 'test1', 'group', 'a'), MAP('name', 'test2', 'group', 'b'))),
  (2, 'click', ARRAY()),
  (3, 'click', ARRAY(MAP('name', 'test2', 'group', 'a'))),
  (4, 'click', ARRAY(MAP('name', 'test1', 'group', 'b'), MAP('name', 'test2', 'group', 'a')))
 ;

 DROP TABLE IF EXISTS ${db}.demo_data_v2;
 CREATE TABLE IF NOT EXISTS ${db}.demo_data_v2 (
  user_id int,
  action string,
  experiments struct<enrolled:array<string>, assigned:map<string, string>>
 )
 LOCATION '/tmp/${db}/demo_data_v2';

 INSERT INTO ${db}.demo_data_v2 (user_id, action, experiments) VALUES
  (1, 'click', NAMED_STRUCT('enrolled', ARRAY('test1', 'test2'), 'assigned', MAP('test1', 'a', 'test2', 'b'))),
  (2, 'click', NAMED_STRUCT('enrolled', ARRAY(), 'assigned', NULL)),
  (3, 'click', NAMED_STRUCT('enrolled', ARRAY('test2'), 'assigned', MAP('test2', 'a'))),
  (4, 'click', NAMED_STRUCT('enrolled', ARRAY('test1', 'test2'), 'assigned', MAP('test1', 'b', 'test2', 'a')))
 ;
diff --git a/demo_queries.ipynb b/demo_queries.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2f88cb9e-c0c8-45c0-9c7d-a90b7db82126",
   "metadata": {},
   "source": [
    "# Data modeling multiple experiments\n",
    "\n",
    "Phab: [T368326](https://phabricator.wikimedia.org/T368326)\n",
    "\n",
    "> What if the user is enrolled in multiple, non-overlapping (hopefully) experiments that are running on the same page? Should we rather add an experiments property, which is an array of experiment objects? If so, how would that impact querying the data?\n",
    "\n",
    "Let's explore two potential ways to model this data:\n",
    "- **Version 1**: array of objects (more-or-less)\n",
    "- **Version 2**: object of arrays (more-or-less)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2a865170-be1c-44a2-85cb-069dae17707e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import wmfdata as wmf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8064d741-c9d0-4584-9c7c-368482da9a36",
   "metadata": {},
   "source": [
    "## Version 1\n",
    "\n",
    "In this version `experiments` is an array of objects with properties:\n",
    "- `name` (name of experiment the subject is enrolled in)\n",
    "- `group` (group/variant the subject is assigned to)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "07bd466d-d009-476a-ab85-e4f5513da13a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>action</th>\n",
       "      <th>experiments</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>click</td>\n",
       "      <td>[{'name': 'test1', 'group': 'a'}, {'name': 'te...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>click</td>\n",
       "      <td>[{'name': 'test1', 'group': 'b'}, {'name': 'te...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>click</td>\n",
       "      <td>[{'name': 'test2', 'group': 'a'}]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>click</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id action                                        experiments\n",
       "0        1  click  [{'name': 'test1', 'group': 'a'}, {'name': 'te...\n",
       "1        4  click  [{'name': 'test1', 'group': 'b'}, {'name': 'te...\n",
       "2        3  click                  [{'name': 'test2', 'group': 'a'}]\n",
       "3        2  click                                                 []"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wmf.spark.run(\"SELECT * FROM bearloga.demo_data_v1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e6008a7-4243-4beb-b3a4-aa04ecfed89a",
   "metadata": {},
   "source": [
    "**NOTE**: Event Platform does not support evolution of structs in arrays (cf. [Complex array element and map value type evolution is not well supported](https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Complex_array_element_and_map_value_type_evolution_is_not_well_supported)). So in version 1 we can't have `experiments` be of type `array<struct<name:string,group:string>>` because that can't changed to, say, `array<struct<id:int,name:string,group:string>>` (an `id` field/property added), whereas if `experiments` is `array<map<string,string>>` then we have the flexibility to expand what information we collect about each experiment that the subject is in."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd0e1fe3-4fa5-4ae2-95f7-33190e049a74",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "### Querying with Spark SQL\n",
    "\n",
    "Using the `LATERAL VIEW EXPLODE()` pattern (https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-view.html) to count number of users in each group of the \"test2\" experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "26af2881-2b08-4629-9370-991b39038bc4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>user_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  group  user_count\n",
       "0     a           2\n",
       "1     b           1"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wmf.spark.run(\"\"\"\n",
    "SELECT\n",
    "  experiment.group,\n",
    "  COUNT(DISTINCT user_id) AS user_count\n",
    "FROM bearloga.demo_data_v1\n",
    "LATERAL VIEW EXPLODE(experiments) AS experiment\n",
    "WHERE experiment.name = 'test2'\n",
    "GROUP BY experiment.group\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f78ff23-12ab-412e-9342-8fe5cbc4000b",
   "metadata": {},
   "source": [
    "### Querying with Presto\n",
    "\n",
    "Using `CROSS JOIN UNNEST()` pattern (https://prestodb.io/docs/current/sql/select.html#unnest) to count number of users in each group of the \"test2\" experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ad2e4a4e-9884-4160-877d-8f4fbf70a646",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>user_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  group  user_count\n",
       "0     b           1\n",
       "1     a           2"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wmf.presto.run(\"\"\"\n",
    "SELECT\n",
    "  experiment['group'] AS \"group\",\n",
    "  COUNT(DISTINCT user_id) AS user_count\n",
    "FROM bearloga.demo_data_v1\n",
    "CROSS JOIN UNNEST(experiments) AS t (experiment)\n",
    "WHERE experiment['name'] = 'test2'\n",
    "GROUP BY 1\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "548aae4e-93a7-483c-b385-bb4b02ea3d46",
   "metadata": {},
   "source": [
    "Alternatively, we can use some [array functions](https://prestodb.io/docs/current/functions/array.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5ad53a23-b1e0-4a1b-a89b-dc5c94c0237d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>user_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>b</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>a</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  group  user_count\n",
       "0     b           1\n",
       "1     a           2"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wmf.presto.run(\"\"\"\n",
    "SELECT\n",
    "  filter(experiments, x -> x['name'] = 'test2')[1]['group'] AS \"group\",\n",
    "  COUNT(DISTINCT user_id) AS user_count\n",
    "FROM bearloga.demo_data_v1\n",
    "WHERE any_match(experiments, x -> x['name'] = 'test2')\n",
    "GROUP BY 1\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49ffe0b8-c38d-4c7d-9086-e6fb78a4dcd9",
   "metadata": {},
   "source": [
    "## Version 2\n",
    "\n",
    "In this version, `experiments` is instead an object with properties:\n",
    "- `enrolled` (array of names of experiments the subject is enrolled in)\n",
    "- `assigned` (map of groups the subject is assigned to with the experiments' names as keys)\n",
    "\n",
    "**NOTE**: We assume that each experiment has a unique ID/name that identifies it and that group assignments can safely be stored in a `map` with the experiment names/IDs as keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "78c17d4a-4f60-47e0-b92f-1e231a65b148",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>action</th>\n",
       "      <th>experiments</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>click</td>\n",
       "      <td>([test2], {'test2': 'a'})</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>click</td>\n",
       "      <td>([test1, test2], {'test1': 'a', 'test2': 'b'})</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>click</td>\n",
       "      <td>([test1, test2], {'test1': 'b', 'test2': 'a'})</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>click</td>\n",
       "      <td>([], None)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id action                                     experiments\n",
       "0        3  click                       ([test2], {'test2': 'a'})\n",
       "1        1  click  ([test1, test2], {'test1': 'a', 'test2': 'b'})\n",
       "2        4  click  ([test1, test2], {'test1': 'b', 'test2': 'a'})\n",
       "3        2  click                                      ([], None)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wmf.spark.run(\"SELECT * FROM bearloga.demo_data_v2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1766abd-654e-436b-9394-c8a6e8edd344",
   "metadata": {},
   "source": [
    "### Querying with Spark SQL\n",
    "\n",
    "Counting number of users in each group of the \"test2\" experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6278e0cc-4ad9-4cee-af43-c93360ee5f34",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>user_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  group  user_count\n",
       "0     a           2\n",
       "1     b           1"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wmf.spark.run(\"\"\"\n",
    "SELECT\n",
    "  experiments.assigned['test2'] AS group,\n",
    "  COUNT(DISTINCT user_id) AS user_count\n",
    "FROM bearloga.demo_data_v2\n",
    "WHERE ARRAY_CONTAINS(experiments.enrolled, 'test2')\n",
    "GROUP BY experiments.assigned['test2'];\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a9230fc-23bb-41ff-a557-dd55e15afbf3",
   "metadata": {},
   "source": [
    "**NOTE**: Yes, [map_contains_key](https://spark.apache.org/docs/latest/api/sql/#map_contains_key) exists but only since 3.3.0. We're currently running 3.2.0. If our Spark SQL version was more recent then we could get rid of `experiments.enrolled` array and  instead look through `experiments.assigned`'s keys:\n",
    "\n",
    "```\n",
    "SELECT\n",
    "  experiments.assigned['test2'] AS group,\n",
    "  COUNT(DISTINCT user_id) AS user_count\n",
    "FROM bearloga.demo_data_v2\n",
    "WHERE MAP_CONTAINS_KEY(experiments.assigned, 'test2')\n",
    "GROUP BY experiments.assigned['test2'];\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7977225d-9285-401f-b465-a051ddff64aa",
   "metadata": {},
   "source": [
    "### Querying with Presto\n",
    "\n",
    "The Presto query is essentially the same as the Spark SQL version:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "402a95de-4048-4129-9884-0fd660ec29b5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>user_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>a</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>b</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  group  user_count\n",
       "0     a           2\n",
       "1     b           1"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wmf.presto.run(\"\"\"\n",
    "SELECT\n",
    "  experiments.assigned['test2'] AS \"group\",\n",
    "  COUNT(DISTINCT user_id) AS user_count\n",
    "FROM bearloga.demo_data_v2\n",
    "WHERE contains(experiments.enrolled, 'test2')\n",
    "GROUP BY 1\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95cf5cdd-db13-44bc-8674-6c27aee8b81c",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
	-- Usage: spark3-sql -f demo_data.sql -d db=bearloga

	DROP TABLE IF EXISTS ${db}.demo_data_v1;
	CREATE TABLE IF NOT EXISTS ${db}.demo_data_v1 (
	user_id int,
	action string,
	experiments array<map<string, string>>
	)
	LOCATION '/tmp/${db}/demo_data_v1';

	INSERT INTO ${db}.demo_data_v1 (user_id, action, experiments) VALUES
	(1, 'click', ARRAY(MAP('name', 'test1', 'group', 'a'), MAP('name', 'test2', 'group', 'b'))),
	(2, 'click', ARRAY()),
	(3, 'click', ARRAY(MAP('name', 'test2', 'group', 'a'))),
	(4, 'click', ARRAY(MAP('name', 'test1', 'group', 'b'), MAP('name', 'test2', 'group', 'a')))
	;

	DROP TABLE IF EXISTS ${db}.demo_data_v2;
	CREATE TABLE IF NOT EXISTS ${db}.demo_data_v2 (
	user_id int,
	action string,
	experiments struct<enrolled:array<string>, assigned:map<string, string>>
	)
	LOCATION '/tmp/${db}/demo_data_v2';

	INSERT INTO ${db}.demo_data_v2 (user_id, action, experiments) VALUES
	(1, 'click', NAMED_STRUCT('enrolled', ARRAY('test1', 'test2'), 'assigned', MAP('test1', 'a', 'test2', 'b'))),
	(2, 'click', NAMED_STRUCT('enrolled', ARRAY(), 'assigned', NULL)),
	(3, 'click', NAMED_STRUCT('enrolled', ARRAY('test2'), 'assigned', MAP('test2', 'a'))),
	(4, 'click', NAMED_STRUCT('enrolled', ARRAY('test1', 'test2'), 'assigned', MAP('test1', 'b', 'test2', 'a')))
	;
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "2f88cb9e-c0c8-45c0-9c7d-a90b7db82126",
	"metadata": {},
	"source": [
	"# Data modeling multiple experiments\n",
	"\n",
	"Phab: [T368326](https://phabricator.wikimedia.org/T368326)\n",
	"\n",
	"> What if the user is enrolled in multiple, non-overlapping (hopefully) experiments that are running on the same page? Should we rather add an experiments property, which is an array of experiment objects? If so, how would that impact querying the data?\n",
	"\n",
	"Let's explore two potential ways to model this data:\n",
	"- Version 1: array of objects (more-or-less)\n",
	"- Version 2: object of arrays (more-or-less)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "2a865170-be1c-44a2-85cb-069dae17707e",
	"metadata": {},
	"outputs": [],
	"source": [
	"import wmfdata as wmf"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8064d741-c9d0-4584-9c7c-368482da9a36",
	"metadata": {},
	"source": [
	"## Version 1\n",
	"\n",
	"In this version `experiments` is an array of objects with properties:\n",
	"- `name` (name of experiment the subject is enrolled in)\n",
	"- `group` (group/variant the subject is assigned to)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "07bd466d-d009-476a-ab85-e4f5513da13a",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>user_id</th>\n",
	" <th>action</th>\n",
	" <th>experiments</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>1</td>\n",
	" <td>click</td>\n",
	" <td>[{'name': 'test1', 'group': 'a'}, {'name': 'te...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>4</td>\n",
	" <td>click</td>\n",
	" <td>[{'name': 'test1', 'group': 'b'}, {'name': 'te...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>3</td>\n",
	" <td>click</td>\n",
	" <td>[{'name': 'test2', 'group': 'a'}]</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>2</td>\n",
	" <td>click</td>\n",
	" <td>[]</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" user_id action experiments\n",
	"0 1 click [{'name': 'test1', 'group': 'a'}, {'name': 'te...\n",
	"1 4 click [{'name': 'test1', 'group': 'b'}, {'name': 'te...\n",
	"2 3 click [{'name': 'test2', 'group': 'a'}]\n",
	"3 2 click []"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"wmf.spark.run(\"SELECT * FROM bearloga.demo_data_v1\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7e6008a7-4243-4beb-b3a4-aa04ecfed89a",
	"metadata": {},
	"source": [
	"NOTE: Event Platform does not support evolution of structs in arrays (cf. [Complex array element and map value type evolution is not well supported](https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Complex_array_element_and_map_value_type_evolution_is_not_well_supported)). So in version 1 we can't have `experiments` be of type `array<struct<name:string,group:string>>` because that can't changed to, say, `array<struct<id:int,name:string,group:string>>` (an `id` field/property added), whereas if `experiments` is `array<map<string,string>>` then we have the flexibility to expand what information we collect about each experiment that the subject is in."
	]
	},
	{
	"cell_type": "markdown",
	"id": "cd0e1fe3-4fa5-4ae2-95f7-33190e049a74",
	"metadata": {
	"jp-MarkdownHeadingCollapsed": true,
	"tags": []
	},
	"source": [
	"### Querying with Spark SQL\n",
	"\n",
	"Using the `LATERAL VIEW EXPLODE()` pattern (https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-view.html) to count number of users in each group of the \"test2\" experiment:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "26af2881-2b08-4629-9370-991b39038bc4",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>group</th>\n",
	" <th>user_count</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>a</td>\n",
	" <td>2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>b</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" group user_count\n",
	"0 a 2\n",
	"1 b 1"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"wmf.spark.run(\"\"\"\n",
	"SELECT\n",
	" experiment.group,\n",
	" COUNT(DISTINCT user_id) AS user_count\n",
	"FROM bearloga.demo_data_v1\n",
	"LATERAL VIEW EXPLODE(experiments) AS experiment\n",
	"WHERE experiment.name = 'test2'\n",
	"GROUP BY experiment.group\n",
	"\"\"\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7f78ff23-12ab-412e-9342-8fe5cbc4000b",
	"metadata": {},
	"source": [
	"### Querying with Presto\n",
	"\n",
	"Using `CROSS JOIN UNNEST()` pattern (https://prestodb.io/docs/current/sql/select.html#unnest) to count number of users in each group of the \"test2\" experiment:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "ad2e4a4e-9884-4160-877d-8f4fbf70a646",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>group</th>\n",
	" <th>user_count</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>b</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>a</td>\n",
	" <td>2</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" group user_count\n",
	"0 b 1\n",
	"1 a 2"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"wmf.presto.run(\"\"\"\n",
	"SELECT\n",
	" experiment['group'] AS \"group\",\n",
	" COUNT(DISTINCT user_id) AS user_count\n",
	"FROM bearloga.demo_data_v1\n",
	"CROSS JOIN UNNEST(experiments) AS t (experiment)\n",
	"WHERE experiment['name'] = 'test2'\n",
	"GROUP BY 1\n",
	"\"\"\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "548aae4e-93a7-483c-b385-bb4b02ea3d46",
	"metadata": {},
	"source": [
	"Alternatively, we can use some [array functions](https://prestodb.io/docs/current/functions/array.html):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "5ad53a23-b1e0-4a1b-a89b-dc5c94c0237d",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>group</th>\n",
	" <th>user_count</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>b</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>a</td>\n",
	" <td>2</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" group user_count\n",
	"0 b 1\n",
	"1 a 2"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"wmf.presto.run(\"\"\"\n",
	"SELECT\n",
	" filter(experiments, x -> x['name'] = 'test2')[1]['group'] AS \"group\",\n",
	" COUNT(DISTINCT user_id) AS user_count\n",
	"FROM bearloga.demo_data_v1\n",
	"WHERE any_match(experiments, x -> x['name'] = 'test2')\n",
	"GROUP BY 1\n",
	"\"\"\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "49ffe0b8-c38d-4c7d-9086-e6fb78a4dcd9",
	"metadata": {},
	"source": [
	"## Version 2\n",
	"\n",
	"In this version, `experiments` is instead an object with properties:\n",
	"- `enrolled` (array of names of experiments the subject is enrolled in)\n",
	"- `assigned` (map of groups the subject is assigned to with the experiments' names as keys)\n",
	"\n",
	"NOTE: We assume that each experiment has a unique ID/name that identifies it and that group assignments can safely be stored in a `map` with the experiment names/IDs as keys."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"id": "78c17d4a-4f60-47e0-b92f-1e231a65b148",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>user_id</th>\n",
	" <th>action</th>\n",
	" <th>experiments</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>3</td>\n",
	" <td>click</td>\n",
	" <td>([test2], {'test2': 'a'})</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>1</td>\n",
	" <td>click</td>\n",
	" <td>([test1, test2], {'test1': 'a', 'test2': 'b'})</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>4</td>\n",
	" <td>click</td>\n",
	" <td>([test1, test2], {'test1': 'b', 'test2': 'a'})</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>2</td>\n",
	" <td>click</td>\n",
	" <td>([], None)</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" user_id action experiments\n",
	"0 3 click ([test2], {'test2': 'a'})\n",
	"1 1 click ([test1, test2], {'test1': 'a', 'test2': 'b'})\n",
	"2 4 click ([test1, test2], {'test1': 'b', 'test2': 'a'})\n",
	"3 2 click ([], None)"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"wmf.spark.run(\"SELECT * FROM bearloga.demo_data_v2\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d1766abd-654e-436b-9394-c8a6e8edd344",
	"metadata": {},
	"source": [
	"### Querying with Spark SQL\n",
	"\n",
	"Counting number of users in each group of the \"test2\" experiment:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"id": "6278e0cc-4ad9-4cee-af43-c93360ee5f34",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>group</th>\n",
	" <th>user_count</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>a</td>\n",
	" <td>2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>b</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" group user_count\n",
	"0 a 2\n",
	"1 b 1"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"wmf.spark.run(\"\"\"\n",
	"SELECT\n",
	" experiments.assigned['test2'] AS group,\n",
	" COUNT(DISTINCT user_id) AS user_count\n",
	"FROM bearloga.demo_data_v2\n",
	"WHERE ARRAY_CONTAINS(experiments.enrolled, 'test2')\n",
	"GROUP BY experiments.assigned['test2'];\n",
	"\"\"\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "9a9230fc-23bb-41ff-a557-dd55e15afbf3",
	"metadata": {},
	"source": [
	"NOTE: Yes, [map_contains_key](https://spark.apache.org/docs/latest/api/sql/#map_contains_key) exists but only since 3.3.0. We're currently running 3.2.0. If our Spark SQL version was more recent then we could get rid of `experiments.enrolled` array and instead look through `experiments.assigned`'s keys:\n",
	"\n",
	"```\n",
	"SELECT\n",
	" experiments.assigned['test2'] AS group,\n",
	" COUNT(DISTINCT user_id) AS user_count\n",
	"FROM bearloga.demo_data_v2\n",
	"WHERE MAP_CONTAINS_KEY(experiments.assigned, 'test2')\n",
	"GROUP BY experiments.assigned['test2'];\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7977225d-9285-401f-b465-a051ddff64aa",
	"metadata": {},
	"source": [
	"### Querying with Presto\n",
	"\n",
	"The Presto query is essentially the same as the Spark SQL version:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "402a95de-4048-4129-9884-0fd660ec29b5",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>group</th>\n",
	" <th>user_count</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>a</td>\n",
	" <td>2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>b</td>\n",
	" <td>1</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" group user_count\n",
	"0 a 2\n",
	"1 b 1"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"wmf.presto.run(\"\"\"\n",
	"SELECT\n",
	" experiments.assigned['test2'] AS \"group\",\n",
	" COUNT(DISTINCT user_id) AS user_count\n",
	"FROM bearloga.demo_data_v2\n",
	"WHERE contains(experiments.enrolled, 'test2')\n",
	"GROUP BY 1\n",
	"\"\"\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "95cf5cdd-db13-44bc-8674-6c27aee8b81c",
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.13"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}