dmklee · April 4, 2024 21:04
diff --git a/everydayrobotics_aiinaction.ipynb b/everydayrobotics_aiinaction.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "name": "EverydayRobotics_aiinaction.ipynb",
      "collapsed_sections": [
        "tupLE8aZXYNw"
      ],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/dmklee/aa0045490dda1bdd223882e962b5cd4e/everydayrobotics_aiinaction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# **Before you proceed:** Click *Runtime->Run all*\n"
      ],
      "metadata": {
        "id": "tupLE8aZXYNw"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%%capture\n",
        "!pip install ipympl pybullet\n",
        "!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gbYTyz2JB9xuF90AH9SF3GsiELPHxykI' -O ai_in_action.zip\n",
        "!unzip -o ai_in_action.zip"
      ],
      "metadata": {
        "id": "NnQ_L9h7upIy"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import output\n",
        "output.enable_custom_widget_manager()"
      ],
      "metadata": {
        "id": "nBptE0HHus6B"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "%matplotlib ipympl"
      ],
      "metadata": {
        "id": "3L6wEJ3PuyYu"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import sys\n",
        "import string\n",
        "import matplotlib.pyplot as plt\n",
        "from PIL import Image\n",
        "import time\n",
        "import numpy as np\n",
        "from matplotlib import rc\n",
        "rc('animation', html='jshtml')\n",
        "import matplotlib.animation as animation\n",
        "from tqdm import tqdm\n",
        "import IPython\n",
        "from base64 import b64encode\n",
        "import torch\n",
        "from torch import nn\n",
        "from simulator import Env, Simulator, Agent, train, argmax2d, show_predictions"
      ],
      "metadata": {
        "id": "AtKzjNHSu1fO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Robots to the Rescue 🛢️ 🦆 🤖\n",
        "There's been a horrible (hypothetical) oil spill!!! We need your help to develop a fleet of robot arms to save the poor ducks from the oil.  Can you use reinforcement learning to train the robots to spot and grab the ducks in the oil?\n",
        "\n",
        "We built a simulator that you will use to train the reinforcement learning agent (without needing to harm real ducks).  Run the cell to see the simulator."
      ],
      "metadata": {
        "id": "eaOulQCK_BMY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# @title\n",
        "# read video file and render it in HTML\n",
        "mp4 = open('assets/duck_grasp.mp4','rb').read()\n",
        "data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
        "IPython.display.HTML(\"\"\"<video width=600 controls><source src=\"%s\" type=\"video/mp4\"></video>\"\"\" % data_url)"
      ],
      "metadata": {
        "id": "uhTkjjDBuEIc",
        "cellView": "form"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Interactive Reinforcement Learning\n",
        "We can formulate the duck rescue task as a reinforcement learning problem.\n",
        "\n",
        "- **State**: Camera image of the robot's workspace\n",
        "- **Action**: XY-location to perform a grasp\n",
        "- **Reward**: $+1$ if duck is picked up, otherwise $0$. *...is this a good idea?*\n",
        "\n",
        "As we gather experience, we will maintain a **value function** that describes the value of each action for a given state.  The value of an action is described by the expected reward of performing the action.\n",
        "\n",
        "In the next cell, you get to perform pick actions manually.  After each action, a reward is received and the value function is updated.  The **`learning_rate`** determines how much the value function is updated with each reward.  With enough actions, the value converges to the average reward received for that action.\n"
      ],
      "metadata": {
        "id": "Bip0IzKvYaHn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def interactive_grasping():\n",
        "  img = np.array(Image.open('assets/duck_grasping.png'))\n",
        "  reward_func = np.array(Image.open('assets/v_func.png')).astype(bool)\n",
        "\n",
        "  fig, ax = plt.subplots(1, 2, figsize=(9, 4))\n",
        "  ax[0].imshow(img)\n",
        "\n",
        "  n_bins = 8\n",
        "  ticks = np.linspace(0, img.shape[1], num=n_bins+1)[:-1]\n",
        "  xticklabels = np.arange(n_bins)\n",
        "  yticklabels = string.ascii_uppercase[:n_bins]\n",
        "  ax[0].set_xticks(ticks+ticks[1]/2, labels=xticklabels)\n",
        "  ax[0].set_yticks(ticks+ticks[1]/2, labels=yticklabels)\n",
        "\n",
        "  [ax[0].axhline(h, color='w') for h in np.linspace(0, img.shape[1],num=n_bins+1)[1:-1]]\n",
        "  [ax[0].axvline(h, color='w') for h in np.linspace(0, img.shape[1],num=n_bins+1)[1:-1]]\n",
        "  plt.tight_layout(rect=(0,0,1, 0.95))\n",
        "  ax[0].set_title('Click on cell to simulate grasp')\n",
        "\n",
        "  my_v_func = 0.1*np.random.random((n_bins, n_bins)) + 0.5\n",
        "  v_func_obj = ax[1].imshow(my_v_func, cmap=\"RdYlGn\", vmin=0, vmax=1)\n",
        "  cbar = fig.colorbar(v_func_obj, ax=ax[1], ticks=[0, 0.5, 1])\n",
        "  ax[1].axis('off')\n",
        "  ax[1].set_title('Value Function')\n",
        "  [ax[1].axhline(h-0.5, color='k') for h in np.arange(n_bins+1)]\n",
        "  [ax[1].axvline(h-0.5, color='k') for h in np.arange(n_bins+1)]\n",
        "\n",
        "  learning_rate = 0.2 # @param {type:\"slider\", min: 0, max: 1, step:0.1}\n",
        "\n",
        "  def onclick(event):\n",
        "      if event.inaxes == ax[0]:\n",
        "        ix, iy = event.xdata, event.ydata\n",
        "        xbin, ybin = int(ix // ticks[1]), int(iy // ticks[1])\n",
        "        r = reward_func[ybin, xbin]\n",
        "        label = \"SUCCESS\" if r else \"FAIL\"\n",
        "        my_v_func[ybin, xbin] += learning_rate * (r - my_v_func[ybin, xbin] )\n",
        "        v_func_obj.set_data(my_v_func)\n",
        "        fig.canvas.flush_events()\n",
        "\n",
        "        output.clear(output_tags='some_outputs')\n",
        "        with output.use_tags('some_outputs'):\n",
        "          sys.stdout.write(f\"Pick at {yticklabels[ybin]}{xticklabels[xbin]}: {label}\\n\")\n",
        "          sys.stdout.flush()\n",
        "\n",
        "  cid = fig.canvas.mpl_connect('button_press_event', onclick)\n",
        "\n",
        "interactive_grasping()"
      ],
      "metadata": {
        "id": "AenlRiv5uHLQ",
        "cellView": "form"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Do you think its better to update the value of high reward or low reward actions? Why?*\n",
        "\n",
        "*Could you imagine a task where it would be bad to have a high learning rate?*"
      ],
      "metadata": {
        "id": "eKEItNOsip-x"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Exploration vs. Exploitation\n",
        "A simple approach to balancing exploration and exploitation is the **epsilon-greedy algorithm**.  The programmer determines a variable, epsilon,\n",
        "that determines the fraction of actions that are explorative.  For instance, when epsilon is 0, no exploration is performed.\n",
        "\n",
        "Play around with different values of **`epsilon`** and **`learning_rate`** to see the effects on what value function is learned, and what rewards are received.\n",
        "\n",
        "You need to re-run the cell when you change the sliders."
      ],
      "metadata": {
        "id": "oVUCiXa7tL3M"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def tabular_rl():\n",
        "  epsilon = 0.8 # @param {type:\"slider\", min: 0, max: 1, step:0.05}\n",
        "  value_init = \"random\" # param [\"random\", \"optimistic\", \"pessimistic\"]\n",
        "  num_steps = 200 # param {type:\"slider\", min: 0, max: 500, step:10}\n",
        "  learning_rate = 0.5 # @param {type:\"slider\", min: 0, max: 1, step:0.1}\n",
        "\n",
        "  img = np.array(Image.open('assets/duck_grasping.png'))\n",
        "  reward_func = np.array(Image.open('assets/v_func.png')).astype(bool)\n",
        "  n_bins = reward_func.shape[1]\n",
        "\n",
        "  if value_init == \"optimistic\":\n",
        "    v_func = np.ones((n_bins, n_bins))\n",
        "  elif value_init == \"pessimistic\":\n",
        "    v_func = np.zeros((n_bins, n_bins))\n",
        "  else:\n",
        "    v_func = np.random.uniform(0.4, 0.6, (n_bins, n_bins))\n",
        "\n",
        "  fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [1, 1.4]}, figsize=(8, 4))\n",
        "  v_func_obj = ax[0].imshow(v_func, cmap=\"RdYlGn\", vmin=0, vmax=1)\n",
        "  cbar = fig.colorbar(v_func_obj, ax=ax[0], ticks=[0, 0.5, 1], shrink=0.8)\n",
        "  pick_loc_obj, = ax[0].plot([],[], 'k+', linewidth=4)\n",
        "  [ax[0].axhline(h+0.5, color='k') for h in np.arange(n_bins-1)]\n",
        "  [ax[0].axvline(h+0.5, color='k') for h in np.arange(n_bins-1)]\n",
        "  ax[0].axis('off')\n",
        "  ax[0].set_title('Value Function')\n",
        "\n",
        "  learning_curve = []\n",
        "  lc_obj, = ax[1].plot([], [], '-')\n",
        "  ax[1].set_xlim(0, num_steps)\n",
        "  ax[1].set_ylim(0, 1.05)\n",
        "  ax[1].set_title('Learning Curve')\n",
        "  ax[1].set_xlabel('Num. Grasp Attempts')\n",
        "  ax[1].set_ylabel('Success Rate')\n",
        "  plt.tight_layout()\n",
        "\n",
        "  def update(i):\n",
        "    if np.random.random() > epsilon:\n",
        "      argmax_ind = np.argwhere(v_func == np.max(v_func))\n",
        "      x, y = argmax_ind[np.random.randint(len(argmax_ind))]\n",
        "    else:\n",
        "      x, y = np.random.randint(n_bins, size=2)\n",
        "\n",
        "    pick_loc_obj.set_data([y], [x])\n",
        "\n",
        "    r = reward_func[x, y]\n",
        "    v_func[x, y] += learning_rate * (r - v_func[x, y] )\n",
        "    v_func_obj.set_data(v_func)\n",
        "\n",
        "    learning_curve.append(r)\n",
        "    W = min(10, len(learning_curve))\n",
        "    if len(learning_curve) >= W:\n",
        "      avg_lc = np.convolve(learning_curve, np.ones(W)/W, mode=\"valid\")\n",
        "      lc_obj.set_data(np.arange(len(avg_lc)), avg_lc)\n",
        "    return v_func_obj, pick_loc_obj, lc_obj\n",
        "\n",
        "  anim = animation.FuncAnimation(\n",
        "    fig, update, frames=num_steps, blit=False, repeat=True\n",
        "  )\n",
        "  IPython.display.display(anim)\n",
        "  plt.close()\n",
        "\n",
        "  anim\n",
        "\n",
        "tabular_rl()"
      ],
      "metadata": {
        "id": "bwvdEt-tod3z",
        "cellView": "form"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Can you think of a better strategy for balancing exploration and exploitation?*"
      ],
      "metadata": {
        "id": "asT7_H_ei0-3"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Generalizing Actions with Deep Reinforcement Learning\n",
        "Above we learned a useful value function that could be used to select high-reward actions.  However, that value function is only valid when the duck is in that exact position.  *How can we learn a value function that generalizes to other scenarios?*\n",
        "\n",
        "We will use **Deep Reinforcement Learning**, which uses a *neural network* to predict the value function.  The network is updated based on rewards recieved for each action, like above.  Here you will train a deep RL agent that uses a *convolutional neural network (CNN)* as its value function."
      ],
      "metadata": {
        "id": "0hUknokSnpPv"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Convolutional Neural Network as Value Function\n",
        "A Convolutional Neural Network is designed to process images.  Here, we create a small, two-layer network.  You have some control over the design parameters of the network:\n",
        "- **`kernel_size`**: controls receptive field of each operation (e.g. how local the representation is)\n",
        "- **`hidden_dim`**: controls representational capacity of network\n",
        "- **`use_relu`**: whether or not to add non-linearity between layers; Non-linearities allow network to encode more complex relations\n",
        "\n",
        "You need to re-run this cell after updating the value!"
      ],
      "metadata": {
        "id": "ZiWcW41mjvyA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "kernel_size = 5 # @param [1, 3, 5, 7]{type:\"raw\"}\n",
        "hidden_dim = 16 # @param [8, 16, 32]{type:\"raw\"}\n",
        "use_relu = True  # @param {type:\"boolean\"}\n",
        "\n",
        "network = nn.Sequential(\n",
        "  nn.Conv2d(3, hidden_dim, kernel_size=kernel_size, padding=(kernel_size-1)//2),\n",
        "  nn.ReLU(True) if use_relu else nn.Identity(),\n",
        "  nn.Conv2d(hidden_dim, 1, kernel_size, padding=(kernel_size-1)//2),\n",
        ")\n",
        "print('Neural Network:')\n",
        "[print(f\"  {m}\") for m in network.modules() if not isinstance(m, nn.Sequential)];"
      ],
      "metadata": {
        "id": "mTCXyU5BQS5m",
        "cellView": "form"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Train Network\n",
        "Run this cell to train your network to learn the value function.  The goal is to achieve a high-success rate with as few training steps (= grasp attempts) as possible.\n",
        "\n",
        "In addition to modifying the network, you can also change the training hyperparameters:\n",
        "- **`epsilon`**: controls fraction of exploration actions taken during training\n",
        "- **`num_steps`**: each step is an attempted grasp followed by an optimization step of the network\n",
        "- **`learning_rate`**: affects how much network updates its weights based on each reward recieved (we recommend between 0.01 and 0.0001).\n",
        "\n",
        "Once the network is trained, you will see the learning curve, a plot of the loss (network prediction error), and some example predictions of the network for different positions of ducks."
      ],
      "metadata": {
        "id": "CcOqiJBwj3pg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "torch.manual_seed(0)\n",
        "np.random.seed(0)\n",
        "\n",
        "# make sure network is initialized\n",
        "for m in network.modules():\n",
        "  if isinstance(m, torch.nn.Conv2d):\n",
        "    torch.nn.init.xavier_uniform_(m.weight)\n",
        "    m.bias.data.fill_(0.01)\n",
        "\n",
        "agent = Agent(network)\n",
        "epsilon = 0.2 # @param {type:\"slider\", min: 0, max: 1, step:0.01}\n",
        "num_steps = 4000 # @param {type:\"slider\", min: 0, max: 10000, step:100}\n",
        "learning_rate = 0.001 # @param {type:\"number\"}\n",
        "\n",
        "agent = train(agent, num_steps=num_steps, lr=learning_rate, epsilon=epsilon)\n",
        "\n",
        "show_predictions(agent)"
      ],
      "metadata": {
        "id": "lLyl8xx8ekYY",
        "cellView": "form"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Does the predicted value function look accurate?  Do you notice any changes to it with different values of epsilon?*\n",
        "\n",
        "*What would you expect to happen if there were other objects in the oil? Do you think your RL agent would avoid them?  How would you make the agent avoid picking up other things?*\n",
        "\n",
        "\n",
        "*What is the effect of kernel_size on success rate?  Why?*"
      ],
      "metadata": {
        "id": "WMm28BeLj9kA"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## What Next?\n",
        "If you want to learn more about RL, consider checking out the following resources:\n",
        "\n",
        "- Northeastern's CS 4180/5180: Reinforcement Learning\n",
        "- [RL Course Lecture Videos](https://www.youtube.com/watch?v=2pWv7GOvuf0)\n",
        "- [PyTorch Walkthrough of Deep Q-Network](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)"
      ],
      "metadata": {
        "id": "RhpfLj8qNQb3"
      }
    }
  ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"provenance": [],
	"name": "EverydayRobotics_aiinaction.ipynb",
	"collapsed_sections": [
	"tupLE8aZXYNw"
	],
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/dmklee/aa0045490dda1bdd223882e962b5cd4e/everydayrobotics_aiinaction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Before you proceed: Click Runtime->Run all\n"
	],
	"metadata": {
	"id": "tupLE8aZXYNw"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"%%capture\n",
	"!pip install ipympl pybullet\n",
	"!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gbYTyz2JB9xuF90AH9SF3GsiELPHxykI' -O ai_in_action.zip\n",
	"!unzip -o ai_in_action.zip"
	],
	"metadata": {
	"id": "NnQ_L9h7upIy"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"from google.colab import output\n",
	"output.enable_custom_widget_manager()"
	],
	"metadata": {
	"id": "nBptE0HHus6B"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"%matplotlib ipympl"
	],
	"metadata": {
	"id": "3L6wEJ3PuyYu"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"import sys\n",
	"import string\n",
	"import matplotlib.pyplot as plt\n",
	"from PIL import Image\n",
	"import time\n",
	"import numpy as np\n",
	"from matplotlib import rc\n",
	"rc('animation', html='jshtml')\n",
	"import matplotlib.animation as animation\n",
	"from tqdm import tqdm\n",
	"import IPython\n",
	"from base64 import b64encode\n",
	"import torch\n",
	"from torch import nn\n",
	"from simulator import Env, Simulator, Agent, train, argmax2d, show_predictions"
	],
	"metadata": {
	"id": "AtKzjNHSu1fO"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Robots to the Rescue 🛢️ 🦆 🤖\n",
	"There's been a horrible (hypothetical) oil spill!!! We need your help to develop a fleet of robot arms to save the poor ducks from the oil. Can you use reinforcement learning to train the robots to spot and grab the ducks in the oil?\n",
	"\n",
	"We built a simulator that you will use to train the reinforcement learning agent (without needing to harm real ducks). Run the cell to see the simulator."
	],
	"metadata": {
	"id": "eaOulQCK_BMY"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"# @title\n",
	"# read video file and render it in HTML\n",
	"mp4 = open('assets/duck_grasp.mp4','rb').read()\n",
	"data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
	"IPython.display.HTML(\"\"\"<video width=600 controls><source src=\"%s\" type=\"video/mp4\"></video>\"\"\" % data_url)"
	],
	"metadata": {
	"id": "uhTkjjDBuEIc",
	"cellView": "form"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"### Interactive Reinforcement Learning\n",
	"We can formulate the duck rescue task as a reinforcement learning problem.\n",
	"\n",
	"- State: Camera image of the robot's workspace\n",
	"- Action: XY-location to perform a grasp\n",
	"- Reward: $+1$ if duck is picked up, otherwise $0$. ...is this a good idea?\n",
	"\n",
	"As we gather experience, we will maintain a value function that describes the value of each action for a given state. The value of an action is described by the expected reward of performing the action.\n",
	"\n",
	"In the next cell, you get to perform pick actions manually. After each action, a reward is received and the value function is updated. The `learning_rate` determines how much the value function is updated with each reward. With enough actions, the value converges to the average reward received for that action.\n"
	],
	"metadata": {
	"id": "Bip0IzKvYaHn"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"def interactive_grasping():\n",
	" img = np.array(Image.open('assets/duck_grasping.png'))\n",
	" reward_func = np.array(Image.open('assets/v_func.png')).astype(bool)\n",
	"\n",
	" fig, ax = plt.subplots(1, 2, figsize=(9, 4))\n",
	" ax[0].imshow(img)\n",
	"\n",
	" n_bins = 8\n",
	" ticks = np.linspace(0, img.shape[1], num=n_bins+1)[:-1]\n",
	" xticklabels = np.arange(n_bins)\n",
	" yticklabels = string.ascii_uppercase[:n_bins]\n",
	" ax[0].set_xticks(ticks+ticks[1]/2, labels=xticklabels)\n",
	" ax[0].set_yticks(ticks+ticks[1]/2, labels=yticklabels)\n",
	"\n",
	" [ax[0].axhline(h, color='w') for h in np.linspace(0, img.shape[1],num=n_bins+1)[1:-1]]\n",
	" [ax[0].axvline(h, color='w') for h in np.linspace(0, img.shape[1],num=n_bins+1)[1:-1]]\n",
	" plt.tight_layout(rect=(0,0,1, 0.95))\n",
	" ax[0].set_title('Click on cell to simulate grasp')\n",
	"\n",
	" my_v_func = 0.1*np.random.random((n_bins, n_bins)) + 0.5\n",
	" v_func_obj = ax[1].imshow(my_v_func, cmap=\"RdYlGn\", vmin=0, vmax=1)\n",
	" cbar = fig.colorbar(v_func_obj, ax=ax[1], ticks=[0, 0.5, 1])\n",
	" ax[1].axis('off')\n",
	" ax[1].set_title('Value Function')\n",
	" [ax[1].axhline(h-0.5, color='k') for h in np.arange(n_bins+1)]\n",
	" [ax[1].axvline(h-0.5, color='k') for h in np.arange(n_bins+1)]\n",
	"\n",
	" learning_rate = 0.2 # @param {type:\"slider\", min: 0, max: 1, step:0.1}\n",
	"\n",
	" def onclick(event):\n",
	" if event.inaxes == ax[0]:\n",
	" ix, iy = event.xdata, event.ydata\n",
	" xbin, ybin = int(ix // ticks[1]), int(iy // ticks[1])\n",
	" r = reward_func[ybin, xbin]\n",
	" label = \"SUCCESS\" if r else \"FAIL\"\n",
	" my_v_func[ybin, xbin] += learning_rate * (r - my_v_func[ybin, xbin] )\n",
	" v_func_obj.set_data(my_v_func)\n",
	" fig.canvas.flush_events()\n",
	"\n",
	" output.clear(output_tags='some_outputs')\n",
	" with output.use_tags('some_outputs'):\n",
	" sys.stdout.write(f\"Pick at {yticklabels[ybin]}{xticklabels[xbin]}: {label}\\n\")\n",
	" sys.stdout.flush()\n",
	"\n",
	" cid = fig.canvas.mpl_connect('button_press_event', onclick)\n",
	"\n",
	"interactive_grasping()"
	],
	"metadata": {
	"id": "AenlRiv5uHLQ",
	"cellView": "form"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Do you think its better to update the value of high reward or low reward actions? Why?\n",
	"\n",
	"Could you imagine a task where it would be bad to have a high learning rate?"
	],
	"metadata": {
	"id": "eKEItNOsip-x"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Exploration vs. Exploitation\n",
	"A simple approach to balancing exploration and exploitation is the epsilon-greedy algorithm. The programmer determines a variable, epsilon,\n",
	"that determines the fraction of actions that are explorative. For instance, when epsilon is 0, no exploration is performed.\n",
	"\n",
	"Play around with different values of `epsilon` and `learning_rate` to see the effects on what value function is learned, and what rewards are received.\n",
	"\n",
	"You need to re-run the cell when you change the sliders."
	],
	"metadata": {
	"id": "oVUCiXa7tL3M"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"def tabular_rl():\n",
	" epsilon = 0.8 # @param {type:\"slider\", min: 0, max: 1, step:0.05}\n",
	" value_init = \"random\" # param [\"random\", \"optimistic\", \"pessimistic\"]\n",
	" num_steps = 200 # param {type:\"slider\", min: 0, max: 500, step:10}\n",
	" learning_rate = 0.5 # @param {type:\"slider\", min: 0, max: 1, step:0.1}\n",
	"\n",
	" img = np.array(Image.open('assets/duck_grasping.png'))\n",
	" reward_func = np.array(Image.open('assets/v_func.png')).astype(bool)\n",
	" n_bins = reward_func.shape[1]\n",
	"\n",
	" if value_init == \"optimistic\":\n",
	" v_func = np.ones((n_bins, n_bins))\n",
	" elif value_init == \"pessimistic\":\n",
	" v_func = np.zeros((n_bins, n_bins))\n",
	" else:\n",
	" v_func = np.random.uniform(0.4, 0.6, (n_bins, n_bins))\n",
	"\n",
	" fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [1, 1.4]}, figsize=(8, 4))\n",
	" v_func_obj = ax[0].imshow(v_func, cmap=\"RdYlGn\", vmin=0, vmax=1)\n",
	" cbar = fig.colorbar(v_func_obj, ax=ax[0], ticks=[0, 0.5, 1], shrink=0.8)\n",
	" pick_loc_obj, = ax[0].plot([],[], 'k+', linewidth=4)\n",
	" [ax[0].axhline(h+0.5, color='k') for h in np.arange(n_bins-1)]\n",
	" [ax[0].axvline(h+0.5, color='k') for h in np.arange(n_bins-1)]\n",
	" ax[0].axis('off')\n",
	" ax[0].set_title('Value Function')\n",
	"\n",
	" learning_curve = []\n",
	" lc_obj, = ax[1].plot([], [], '-')\n",
	" ax[1].set_xlim(0, num_steps)\n",
	" ax[1].set_ylim(0, 1.05)\n",
	" ax[1].set_title('Learning Curve')\n",
	" ax[1].set_xlabel('Num. Grasp Attempts')\n",
	" ax[1].set_ylabel('Success Rate')\n",
	" plt.tight_layout()\n",
	"\n",
	" def update(i):\n",
	" if np.random.random() > epsilon:\n",
	" argmax_ind = np.argwhere(v_func == np.max(v_func))\n",
	" x, y = argmax_ind[np.random.randint(len(argmax_ind))]\n",
	" else:\n",
	" x, y = np.random.randint(n_bins, size=2)\n",
	"\n",
	" pick_loc_obj.set_data([y], [x])\n",
	"\n",
	" r = reward_func[x, y]\n",
	" v_func[x, y] += learning_rate * (r - v_func[x, y] )\n",
	" v_func_obj.set_data(v_func)\n",
	"\n",
	" learning_curve.append(r)\n",
	" W = min(10, len(learning_curve))\n",
	" if len(learning_curve) >= W:\n",
	" avg_lc = np.convolve(learning_curve, np.ones(W)/W, mode=\"valid\")\n",
	" lc_obj.set_data(np.arange(len(avg_lc)), avg_lc)\n",
	" return v_func_obj, pick_loc_obj, lc_obj\n",
	"\n",
	" anim = animation.FuncAnimation(\n",
	" fig, update, frames=num_steps, blit=False, repeat=True\n",
	" )\n",
	" IPython.display.display(anim)\n",
	" plt.close()\n",
	"\n",
	" anim\n",
	"\n",
	"tabular_rl()"
	],
	"metadata": {
	"id": "bwvdEt-tod3z",
	"cellView": "form"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Can you think of a better strategy for balancing exploration and exploitation?"
	],
	"metadata": {
	"id": "asT7_H_ei0-3"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## Generalizing Actions with Deep Reinforcement Learning\n",
	"Above we learned a useful value function that could be used to select high-reward actions. However, that value function is only valid when the duck is in that exact position. How can we learn a value function that generalizes to other scenarios?\n",
	"\n",
	"We will use Deep Reinforcement Learning, which uses a neural network to predict the value function. The network is updated based on rewards recieved for each action, like above. Here you will train a deep RL agent that uses a convolutional neural network (CNN) as its value function."
	],
	"metadata": {
	"id": "0hUknokSnpPv"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"#### Convolutional Neural Network as Value Function\n",
	"A Convolutional Neural Network is designed to process images. Here, we create a small, two-layer network. You have some control over the design parameters of the network:\n",
	"- `kernel_size`: controls receptive field of each operation (e.g. how local the representation is)\n",
	"- `hidden_dim`: controls representational capacity of network\n",
	"- `use_relu`: whether or not to add non-linearity between layers; Non-linearities allow network to encode more complex relations\n",
	"\n",
	"You need to re-run this cell after updating the value!"
	],
	"metadata": {
	"id": "ZiWcW41mjvyA"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"kernel_size = 5 # @param [1, 3, 5, 7]{type:\"raw\"}\n",
	"hidden_dim = 16 # @param [8, 16, 32]{type:\"raw\"}\n",
	"use_relu = True # @param {type:\"boolean\"}\n",
	"\n",
	"network = nn.Sequential(\n",
	" nn.Conv2d(3, hidden_dim, kernel_size=kernel_size, padding=(kernel_size-1)//2),\n",
	" nn.ReLU(True) if use_relu else nn.Identity(),\n",
	" nn.Conv2d(hidden_dim, 1, kernel_size, padding=(kernel_size-1)//2),\n",
	")\n",
	"print('Neural Network:')\n",
	"[print(f\" {m}\") for m in network.modules() if not isinstance(m, nn.Sequential)];"
	],
	"metadata": {
	"id": "mTCXyU5BQS5m",
	"cellView": "form"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"#### Train Network\n",
	"Run this cell to train your network to learn the value function. The goal is to achieve a high-success rate with as few training steps (= grasp attempts) as possible.\n",
	"\n",
	"In addition to modifying the network, you can also change the training hyperparameters:\n",
	"- `epsilon`: controls fraction of exploration actions taken during training\n",
	"- `num_steps`: each step is an attempted grasp followed by an optimization step of the network\n",
	"- `learning_rate`: affects how much network updates its weights based on each reward recieved (we recommend between 0.01 and 0.0001).\n",
	"\n",
	"Once the network is trained, you will see the learning curve, a plot of the loss (network prediction error), and some example predictions of the network for different positions of ducks."
	],
	"metadata": {
	"id": "CcOqiJBwj3pg"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"torch.manual_seed(0)\n",
	"np.random.seed(0)\n",
	"\n",
	"# make sure network is initialized\n",
	"for m in network.modules():\n",
	" if isinstance(m, torch.nn.Conv2d):\n",
	" torch.nn.init.xavier_uniform_(m.weight)\n",
	" m.bias.data.fill_(0.01)\n",
	"\n",
	"agent = Agent(network)\n",
	"epsilon = 0.2 # @param {type:\"slider\", min: 0, max: 1, step:0.01}\n",
	"num_steps = 4000 # @param {type:\"slider\", min: 0, max: 10000, step:100}\n",
	"learning_rate = 0.001 # @param {type:\"number\"}\n",
	"\n",
	"agent = train(agent, num_steps=num_steps, lr=learning_rate, epsilon=epsilon)\n",
	"\n",
	"show_predictions(agent)"
	],
	"metadata": {
	"id": "lLyl8xx8ekYY",
	"cellView": "form"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"source": [
	"Does the predicted value function look accurate? Do you notice any changes to it with different values of epsilon?\n",
	"\n",
	"What would you expect to happen if there were other objects in the oil? Do you think your RL agent would avoid them? How would you make the agent avoid picking up other things?\n",
	"\n",
	"\n",
	"What is the effect of kernel_size on success rate? Why?"
	],
	"metadata": {
	"id": "WMm28BeLj9kA"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## What Next?\n",
	"If you want to learn more about RL, consider checking out the following resources:\n",
	"\n",
	"- Northeastern's CS 4180/5180: Reinforcement Learning\n",
	"- [RL Course Lecture Videos](https://www.youtube.com/watch?v=2pWv7GOvuf0)\n",
	"- [PyTorch Walkthrough of Deep Q-Network](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)"
	],
	"metadata": {
	"id": "RhpfLj8qNQb3"
	}
	}
	]
	}