XinyueZ · March 10, 2022 23:01
diff --git a/generator-of-building-batch-and-data-window-shift-by-1.ipynb b/generator-of-building-batch-and-data-window-shift-by-1.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Generator of building batch and data window shift by 1.ipynb",
      "provenance": [],
      "authorship_tag": "ABX9TyNHf1+B6cNs+Njps9oe0gf/",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/XinyueZ/0fe0a7c6e0c9b1eca4b29593fa26e2a7/generator-of-building-batch-and-data-window-shift-by-1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Build a data-batch generator with build-in `yield`\n",
        "====\n",
        "\n",
        "Use `yield` to build a data batch generator which can pump out batch of data with rows and each row contains specific number of data point.\n",
        "\n",
        "- Check bonus\n",
        " - In `tensorflow` check `DataSet` by `window`.\n",
        " - In `pytorch` check `torch.utils.data.DataLoader`."
      ],
      "metadata": {
        "id": "DkF_03U0dWI9"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Warm-up\n",
        "\n",
        "Firstly we can recap (if you know them already you can ignore them) how those 👇 work."
      ],
      "metadata": {
        "id": "-oY41hKte6ju"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## How the numpy extracts same length of data from each row of matrix simultaneously "
      ],
      "metadata": {
        "id": "dIJRrWtpczve"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "\n",
        "def extract_matrix(matrix, len_2_extract):\n",
        "  data = []\n",
        "  max_len = len(matrix[0, :])\n",
        " \n",
        "  start = 0\n",
        "  end = start + len_2_extract\n",
        "\n",
        "  print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n",
        "\n",
        "  while(end < max_len + 1):\n",
        "    data.append(matrix[:, start:end])\n",
        "    start = start + len_2_extract\n",
        "    end   = start + len_2_extract\n",
        "    print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n",
        "  return data\n",
        "\n",
        "matrix = np.random.randint(0, 11, size=(10, 6)) \n",
        "print(f\"matrix:\\n{matrix}\")\n",
        "print()\n",
        "\n",
        "print(extract_matrix(matrix, 2))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1wH4C0D8cwvh",
        "outputId": "a79c9541-f781-4d66-dfd1-aed428fa5b14"
      },
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "matrix:\n",
            "[[ 8  1  1  3  2  7]\n",
            " [ 1  5  2  2  2  9]\n",
            " [ 5  3  0  5  2  8]\n",
            " [ 1  0  6  1  8  9]\n",
            " [ 1  4  9 10  6  0]\n",
            " [ 9 10  5  2  3  2]\n",
            " [ 3 10  0  5  7  4]\n",
            " [ 1 10  4  5  1  3]\n",
            " [ 4  0  8  1  8  8]\n",
            " [10  3  8  7  1  2]]\n",
            "\n",
            "max_len: 6, start: 0, end: 2\n",
            "\n",
            "max_len: 6, start: 2, end: 4\n",
            "\n",
            "max_len: 6, start: 4, end: 6\n",
            "\n",
            "max_len: 6, start: 6, end: 8\n",
            "\n",
            "[array([[ 8,  1],\n",
            "       [ 1,  5],\n",
            "       [ 5,  3],\n",
            "       [ 1,  0],\n",
            "       [ 1,  4],\n",
            "       [ 9, 10],\n",
            "       [ 3, 10],\n",
            "       [ 1, 10],\n",
            "       [ 4,  0],\n",
            "       [10,  3]]), array([[ 1,  3],\n",
            "       [ 2,  2],\n",
            "       [ 0,  5],\n",
            "       [ 6,  1],\n",
            "       [ 9, 10],\n",
            "       [ 5,  2],\n",
            "       [ 0,  5],\n",
            "       [ 4,  5],\n",
            "       [ 8,  1],\n",
            "       [ 8,  7]]), array([[2, 7],\n",
            "       [2, 9],\n",
            "       [2, 8],\n",
            "       [8, 9],\n",
            "       [6, 0],\n",
            "       [3, 2],\n",
            "       [7, 4],\n",
            "       [1, 3],\n",
            "       [8, 8],\n",
            "       [1, 2]])]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The matrix of 10 by 6 will be extracted into to 3 parts with the length to extract `2`:\n",
        "\n",
        "```python\n",
        "matrix[:, 0:2]\n",
        "matrix[:, 2:4]\n",
        "matrix[:, 4:6]\n",
        "matrix[:, 6:8]\n",
        "```\n",
        "The result of `extract_matrix` is returned synchronously, that is, after it has finished executing."
      ],
      "metadata": {
        "id": "NCVfGH__6qiG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## How `yield` works"
      ],
      "metadata": {
        "id": "tN43HSe8c7es"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "\n",
        "def pump_out_matrix(matrix, len_2_extract):\n",
        "  print(\"pump_out_matrix: only call ONE time...\")\n",
        "\n",
        "\n",
        "  max_len = len(matrix[0, :])\n",
        " \n",
        "  start = 0\n",
        "  end = start + len_2_extract\n",
        "\n",
        "  print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n",
        "\n",
        "  while(end < max_len + 1):\n",
        "    yield matrix[:, start:end]\n",
        "    start = start + len_2_extract\n",
        "    end   = start + len_2_extract\n",
        "    print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n",
        "\n",
        "\n",
        "matrix = np.random.randint(0, 11, size=(10, 6)) \n",
        "print(f\"matrix:\\n{matrix}\")\n",
        "print()\n",
        "\n",
        "itor = iter(pump_out_matrix(matrix, 2))\n",
        "print(\"pump start:\")\n",
        "for it in itor:\n",
        "  print(it)\n",
        "  print(\"pump next:\")\n",
        "print(\"pump end\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "27G0q54W7TCm",
        "outputId": "7761d2d2-7504-44f9-bf9e-b81edd989669"
      },
      "execution_count": 30,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "matrix:\n",
            "[[ 8  0  6  2  0 10]\n",
            " [ 8  0  5 10  0  6]\n",
            " [ 2  9  2  1  0 10]\n",
            " [ 3  9  9  6  1  2]\n",
            " [ 1  8  6  6  5  4]\n",
            " [ 5  0  3  7  6  9]\n",
            " [ 6  5  1  1  2  1]\n",
            " [ 8  8  1  5  9  2]\n",
            " [10  7  0  3 10  1]\n",
            " [ 2  6  0  2  6  8]]\n",
            "\n",
            "pump start:\n",
            "pump_out_matrix: only call ONE time...\n",
            "max_len: 6, start: 0, end: 2\n",
            "[[ 8  0]\n",
            " [ 8  0]\n",
            " [ 2  9]\n",
            " [ 3  9]\n",
            " [ 1  8]\n",
            " [ 5  0]\n",
            " [ 6  5]\n",
            " [ 8  8]\n",
            " [10  7]\n",
            " [ 2  6]]\n",
            "pump next:\n",
            "max_len: 6, start: 2, end: 4\n",
            "[[ 6  2]\n",
            " [ 5 10]\n",
            " [ 2  1]\n",
            " [ 9  6]\n",
            " [ 6  6]\n",
            " [ 3  7]\n",
            " [ 1  1]\n",
            " [ 1  5]\n",
            " [ 0  3]\n",
            " [ 0  2]]\n",
            "pump next:\n",
            "max_len: 6, start: 4, end: 6\n",
            "[[ 0 10]\n",
            " [ 0  6]\n",
            " [ 0 10]\n",
            " [ 1  2]\n",
            " [ 5  4]\n",
            " [ 6  9]\n",
            " [ 2  1]\n",
            " [ 9  2]\n",
            " [10  1]\n",
            " [ 6  8]]\n",
            "pump next:\n",
            "max_len: 6, start: 6, end: 8\n",
            "pump end\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The matrix of 10 by 6 will be pumped out with 3 parts with the length to extract `2` separately:\n",
        "\n",
        "```python\n",
        "matrix[:, 0:2]\n",
        "matrix[:, 2:4]\n",
        "matrix[:, 4:6]\n",
        "matrix[:, 6:8]\n",
        "```\n",
        "The result of `pump_out_matrix` is returned asynchronously. \n",
        "\n",
        "Simple to say:\n",
        "\n",
        "`pump_out_matrix` is a subject.\n",
        "The client program subscribes on this subject as a subscriber.\n",
        "As long as the `pump_out_matrix` extracts one part, ie: `matrix[:, 0:2]`, it will push the result back to the subscriber.\n",
        "\n",
        "# This is observer pattern, right?\n",
        "https://en.wikipedia.org/wiki/Observer_pattern"
      ],
      "metadata": {
        "id": "HOHKOlx88f72"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Example\n",
        "\n",
        "Build staggered pairs feature and label batches. \n",
        "\n",
        "## Method parameters\n",
        "\n",
        "We'd like to have a data batch generator that can pump out a batch of data with `batch_rows` and each row has `steps` data points. \n",
        "\n",
        "For simply demo, we input `numpy` `array` as `source`\n",
        "\n",
        "For machine learning purposes, we also want to have a 2 elements tuple pumped by the generator. \n",
        "\n",
        "The 1st and 2nd elements have the same `batch_rows` (actually the same layout) to represent as **features** and **labels** respectively.\n",
        "\n",
        "In order to demo simply, each data point isn't high dimensional, it is just a scalar:\n",
        "\n",
        "### Eg:\n",
        "\n",
        "`source`:  `[ 6  3  9  0  7  3  3 10  2  1  6  5  2  0  5  0  3  2  3  8  0  0  3 10\n",
        "  0  0  9  4  3  0  4  0  0]`\n",
        "\n",
        "`batch_rows`: 4\n",
        "\n",
        "We expect to have a **base batch**:\n",
        "\n",
        "```python\n",
        "[[ 6  3  9  0  7  3  3 10]\n",
        " [ 2  1  6  5  2  0  5  0]\n",
        " [ 3  2  3  8  0  0  3 10]\n",
        " [ 0  0  9  4  3  0  4  0]]\n",
        "```\n",
        "\n",
        "`steps`: 3\n",
        "\n",
        "The generator will extract the `steps` from the `base batch`, thanks `numpy` that we can do this quite easily. \n",
        "\n",
        "Check the warm-up example ☝.\n",
        "\n",
        "## Layout of features and label:\n",
        "\n",
        "**Feature**: each data point\n",
        "**Label**: the next data point\n",
        "\n",
        "### Eg:\n",
        "\n",
        "in the 1st row the `6` is a feature, then the next data point `3` is its label.\n",
        "in the 4th row the `3` is a feature, then the next data point `0` is its label.\n",
        "\n",
        "and so on.\n",
        "\n",
        "*When you are familiar with NLP or Time-series tasks, you should know this quite well, however, we don't cover this in too much detail.*\n",
        "\n",
        "## Generator works\n",
        "\n",
        "In first iteration the generator extracts from column `0` to column `steps - 1`:\n",
        "In python it is `[:, 0:steps]`\n",
        "\n",
        "```python\n",
        "features:\n",
        "[[6 3 9]\n",
        " [2 1 6]\n",
        " [3 2 3]\n",
        " [0 0 9]]\n",
        "labels:\n",
        "[[3 9 0]\n",
        " [1 6 5]\n",
        " [2 3 8]\n",
        " [0 9 4]]\n",
        "```\n",
        "\n",
        "Zoom in to 1st row of features:\n",
        "\n",
        "`[6 3 9]`\n",
        "\n",
        "\n",
        "Zoom in to 1st row of labels:\n",
        "\n",
        "`[3 9 0]`\n",
        "\n",
        "**The data points between feature and label are staggered pairs.**\n",
        "\n",
        "In the next iteration the generator does from `steps` to `steps + steps`:\n",
        "\n",
        "In python it is `[:, steps : steps + steps]`\n",
        "\n",
        "```\n",
        "features:\n",
        "[[0 7 3]\n",
        " [5 2 0]\n",
        " [8 0 0]\n",
        " [4 3 0]]\n",
        "labels:\n",
        "[[7 3 3]\n",
        " [2 0 5]\n",
        " [0 0 3]\n",
        " [3 0 4]]\n",
        "```\n",
        "\n",
        "## So the working rule is: \n",
        "\n",
        "```pseudo\n",
        "\n",
        "# Given we have generate_times to pump out batches:\n",
        "# It will be a loop:\n",
        "\n",
        "Loop i to generate_times:\n",
        "    start = i * steps\n",
        "    end   = start + steps\n",
        "\n",
        "    x <- extract matrix [all rows, start : end) \n",
        "    y <- extract matrix [all rows, start + 1 : end + 1) \n",
        "\n",
        "    yield (x,y)\n",
        "\n",
        "```\n",
        "\n",
        "Remember that both feature and label are stored in a tuple.\n",
        "\n",
        "The generator runs automatically until there aren't enough (`steps`) data points. Those data which cannot be pumped out will drop out. \n",
        "\n",
        "### Eg:\n",
        "\n",
        "```\n",
        "[[10]\n",
        " [0]\n",
        " [10]\n",
        " [0]]\n",
        "```\n",
        "\n",
        "Will not be pumped out.\n",
        "\n",
        "\n",
        "`yield` pumps out the pair `(x, y)`, you can check the warm-up example to get how `yield` works ☝.\n",
        "\n",
        "*In case want to use those dropped out data, we can post-pad or pre-pad some `zero`s to create `steps` long manually, however, we don't cover this in too much detail.*\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "gzV8uoWUTs2c"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {
        "id": "_zTkrXa3zWPI"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "\n",
        "def build_batch_generator(source, batch_rows, steps, shift=1):\n",
        "  row_len = len(source) // batch_rows # Base steps if the batch has .\n",
        "  print(f\"row_len: {row_len}\")\n",
        "  print()\n",
        "\n",
        "  base_batch = np.zeros(shape=(batch_rows, row_len), dtype=\"int8\")\n",
        "  print(base_batch)\n",
        "  print()\n",
        "\n",
        "  for i in range(batch_rows):\n",
        "    start = i * row_len \n",
        "    end   = start + row_len  \n",
        "\n",
        "    print(f\"start: {start}, end: {end}\")\n",
        "    \n",
        "    base_batch[i] = source[start : end]\n",
        "\n",
        "  print(base_batch)\n",
        "  print()\n",
        "\n",
        "  generate_times = row_len // steps # How many times the generator will run.\n",
        "  print(f\"Generate time: {generate_times}\")\n",
        "  print()\n",
        "\n",
        "  # This impl. of extracting columns in matrix is different from what we see in extract_matrix().\n",
        "  # However, the extract_matrix() is the preferred solution that can save the generate_times.\n",
        "  # Use the below solution in order to demo the idea clearly that the generator pumps out data iteratively.\n",
        "  for i in range(generate_times):\n",
        "    start = i * steps # eg: [0 -> steps) | [1 * steps -> 1 * step + steps) | .... \n",
        "    end   = start + steps \n",
        "\n",
        "    print(f\"start: {start}, end: {end}\")\n",
        "\n",
        "    x = base_batch[:, start : end]\n",
        "    y = base_batch[:, start + 1 : end + 1]\n",
        "\n",
        "    yield (x, y)"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "source = np.random.randint(0, 11, 33)\n",
        "print(f\"source: {source}\")\n",
        "print()\n",
        "\n",
        "batch_gen = build_batch_generator(\n",
        "    source,\n",
        "    batch_rows=4, # Rows (features+labels) in every batch.\n",
        "    steps=3)      # How many steps the generator will generate.\n",
        "\n",
        "for it in iter(batch_gen):\n",
        "  print(\"features:\")\n",
        "  print(it[0])\n",
        "  print(\"labels:\")\n",
        "  print(it[1])\n",
        "  print()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Z_URJYIdRdtv",
        "outputId": "4daf478d-7d4f-47f4-be54-ad7c21aafaeb"
      },
      "execution_count": 32,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "source: [10  2  3  0  5  4  7  0  5  0  7  6  7  3  9  6  1 10  8  2  9  9  2  9\n",
            "  4  7 10  5  6  6  0 10  9]\n",
            "\n",
            "row_len: 8\n",
            "\n",
            "[[0 0 0 0 0 0 0 0]\n",
            " [0 0 0 0 0 0 0 0]\n",
            " [0 0 0 0 0 0 0 0]\n",
            " [0 0 0 0 0 0 0 0]]\n",
            "\n",
            "start: 0, end: 8\n",
            "start: 8, end: 16\n",
            "start: 16, end: 24\n",
            "start: 24, end: 32\n",
            "[[10  2  3  0  5  4  7  0]\n",
            " [ 5  0  7  6  7  3  9  6]\n",
            " [ 1 10  8  2  9  9  2  9]\n",
            " [ 4  7 10  5  6  6  0 10]]\n",
            "\n",
            "Generate time: 2\n",
            "\n",
            "start: 0, end: 3\n",
            "features:\n",
            "[[10  2  3]\n",
            " [ 5  0  7]\n",
            " [ 1 10  8]\n",
            " [ 4  7 10]]\n",
            "labels:\n",
            "[[ 2  3  0]\n",
            " [ 0  7  6]\n",
            " [10  8  2]\n",
            " [ 7 10  5]]\n",
            "\n",
            "start: 3, end: 6\n",
            "features:\n",
            "[[0 5 4]\n",
            " [6 7 3]\n",
            " [2 9 9]\n",
            " [5 6 6]]\n",
            "labels:\n",
            "[[5 4 7]\n",
            " [7 3 9]\n",
            " [9 9 2]\n",
            " [6 6 0]]\n",
            "\n"
          ]
        }
      ]
    }
  ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "Generator of building batch and data window shift by 1.ipynb",
	"provenance": [],
	"authorship_tag": "ABX9TyNHf1+B6cNs+Njps9oe0gf/",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/XinyueZ/0fe0a7c6e0c9b1eca4b29593fa26e2a7/generator-of-building-batch-and-data-window-shift-by-1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"Build a data-batch generator with build-in `yield`\n",
	"====\n",
	"\n",
	"Use `yield` to build a data batch generator which can pump out batch of data with rows and each row contains specific number of data point.\n",
	"\n",
	"- Check bonus\n",
	" - In `tensorflow` check `DataSet` by `window`.\n",
	" - In `pytorch` check `torch.utils.data.DataLoader`."
	],
	"metadata": {
	"id": "DkF_03U0dWI9"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Warm-up\n",
	"\n",
	"Firstly we can recap (if you know them already you can ignore them) how those 👇 work."
	],
	"metadata": {
	"id": "-oY41hKte6ju"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## How the numpy extracts same length of data from each row of matrix simultaneously "
	],
	"metadata": {
	"id": "dIJRrWtpczve"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"import numpy as np\n",
	"\n",
	"def extract_matrix(matrix, len_2_extract):\n",
	" data = []\n",
	" max_len = len(matrix[0, :])\n",
	" \n",
	" start = 0\n",
	" end = start + len_2_extract\n",
	"\n",
	" print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n",
	"\n",
	" while(end < max_len + 1):\n",
	" data.append(matrix[:, start:end])\n",
	" start = start + len_2_extract\n",
	" end = start + len_2_extract\n",
	" print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n",
	" return data\n",
	"\n",
	"matrix = np.random.randint(0, 11, size=(10, 6)) \n",
	"print(f\"matrix:\\n{matrix}\")\n",
	"print()\n",
	"\n",
	"print(extract_matrix(matrix, 2))"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "1wH4C0D8cwvh",
	"outputId": "a79c9541-f781-4d66-dfd1-aed428fa5b14"
	},
	"execution_count": 29,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"matrix:\n",
	"[[ 8 1 1 3 2 7]\n",
	" [ 1 5 2 2 2 9]\n",
	" [ 5 3 0 5 2 8]\n",
	" [ 1 0 6 1 8 9]\n",
	" [ 1 4 9 10 6 0]\n",
	" [ 9 10 5 2 3 2]\n",
	" [ 3 10 0 5 7 4]\n",
	" [ 1 10 4 5 1 3]\n",
	" [ 4 0 8 1 8 8]\n",
	" [10 3 8 7 1 2]]\n",
	"\n",
	"max_len: 6, start: 0, end: 2\n",
	"\n",
	"max_len: 6, start: 2, end: 4\n",
	"\n",
	"max_len: 6, start: 4, end: 6\n",
	"\n",
	"max_len: 6, start: 6, end: 8\n",
	"\n",
	"[array([[ 8, 1],\n",
	" [ 1, 5],\n",
	" [ 5, 3],\n",
	" [ 1, 0],\n",
	" [ 1, 4],\n",
	" [ 9, 10],\n",
	" [ 3, 10],\n",
	" [ 1, 10],\n",
	" [ 4, 0],\n",
	" [10, 3]]), array([[ 1, 3],\n",
	" [ 2, 2],\n",
	" [ 0, 5],\n",
	" [ 6, 1],\n",
	" [ 9, 10],\n",
	" [ 5, 2],\n",
	" [ 0, 5],\n",
	" [ 4, 5],\n",
	" [ 8, 1],\n",
	" [ 8, 7]]), array([[2, 7],\n",
	" [2, 9],\n",
	" [2, 8],\n",
	" [8, 9],\n",
	" [6, 0],\n",
	" [3, 2],\n",
	" [7, 4],\n",
	" [1, 3],\n",
	" [8, 8],\n",
	" [1, 2]])]\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"The matrix of 10 by 6 will be extracted into to 3 parts with the length to extract `2`:\n",
	"\n",
	"```python\n",
	"matrix[:, 0:2]\n",
	"matrix[:, 2:4]\n",
	"matrix[:, 4:6]\n",
	"matrix[:, 6:8]\n",
	"```\n",
	"The result of `extract_matrix` is returned synchronously, that is, after it has finished executing."
	],
	"metadata": {
	"id": "NCVfGH__6qiG"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"## How `yield` works"
	],
	"metadata": {
	"id": "tN43HSe8c7es"
	}
	},
	{
	"cell_type": "code",
	"source": [
	"import numpy as np\n",
	"\n",
	"def pump_out_matrix(matrix, len_2_extract):\n",
	" print(\"pump_out_matrix: only call ONE time...\")\n",
	"\n",
	"\n",
	" max_len = len(matrix[0, :])\n",
	" \n",
	" start = 0\n",
	" end = start + len_2_extract\n",
	"\n",
	" print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n",
	"\n",
	" while(end < max_len + 1):\n",
	" yield matrix[:, start:end]\n",
	" start = start + len_2_extract\n",
	" end = start + len_2_extract\n",
	" print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n",
	"\n",
	"\n",
	"matrix = np.random.randint(0, 11, size=(10, 6)) \n",
	"print(f\"matrix:\\n{matrix}\")\n",
	"print()\n",
	"\n",
	"itor = iter(pump_out_matrix(matrix, 2))\n",
	"print(\"pump start:\")\n",
	"for it in itor:\n",
	" print(it)\n",
	" print(\"pump next:\")\n",
	"print(\"pump end\")"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "27G0q54W7TCm",
	"outputId": "7761d2d2-7504-44f9-bf9e-b81edd989669"
	},
	"execution_count": 30,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"matrix:\n",
	"[[ 8 0 6 2 0 10]\n",
	" [ 8 0 5 10 0 6]\n",
	" [ 2 9 2 1 0 10]\n",
	" [ 3 9 9 6 1 2]\n",
	" [ 1 8 6 6 5 4]\n",
	" [ 5 0 3 7 6 9]\n",
	" [ 6 5 1 1 2 1]\n",
	" [ 8 8 1 5 9 2]\n",
	" [10 7 0 3 10 1]\n",
	" [ 2 6 0 2 6 8]]\n",
	"\n",
	"pump start:\n",
	"pump_out_matrix: only call ONE time...\n",
	"max_len: 6, start: 0, end: 2\n",
	"[[ 8 0]\n",
	" [ 8 0]\n",
	" [ 2 9]\n",
	" [ 3 9]\n",
	" [ 1 8]\n",
	" [ 5 0]\n",
	" [ 6 5]\n",
	" [ 8 8]\n",
	" [10 7]\n",
	" [ 2 6]]\n",
	"pump next:\n",
	"max_len: 6, start: 2, end: 4\n",
	"[[ 6 2]\n",
	" [ 5 10]\n",
	" [ 2 1]\n",
	" [ 9 6]\n",
	" [ 6 6]\n",
	" [ 3 7]\n",
	" [ 1 1]\n",
	" [ 1 5]\n",
	" [ 0 3]\n",
	" [ 0 2]]\n",
	"pump next:\n",
	"max_len: 6, start: 4, end: 6\n",
	"[[ 0 10]\n",
	" [ 0 6]\n",
	" [ 0 10]\n",
	" [ 1 2]\n",
	" [ 5 4]\n",
	" [ 6 9]\n",
	" [ 2 1]\n",
	" [ 9 2]\n",
	" [10 1]\n",
	" [ 6 8]]\n",
	"pump next:\n",
	"max_len: 6, start: 6, end: 8\n",
	"pump end\n"
	]
	}
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"The matrix of 10 by 6 will be pumped out with 3 parts with the length to extract `2` separately:\n",
	"\n",
	"```python\n",
	"matrix[:, 0:2]\n",
	"matrix[:, 2:4]\n",
	"matrix[:, 4:6]\n",
	"matrix[:, 6:8]\n",
	"```\n",
	"The result of `pump_out_matrix` is returned asynchronously. \n",
	"\n",
	"Simple to say:\n",
	"\n",
	"`pump_out_matrix` is a subject.\n",
	"The client program subscribes on this subject as a subscriber.\n",
	"As long as the `pump_out_matrix` extracts one part, ie: `matrix[:, 0:2]`, it will push the result back to the subscriber.\n",
	"\n",
	"# This is observer pattern, right?\n",
	"https://en.wikipedia.org/wiki/Observer_pattern"
	],
	"metadata": {
	"id": "HOHKOlx88f72"
	}
	},
	{
	"cell_type": "markdown",
	"source": [
	"# Example\n",
	"\n",
	"Build staggered pairs feature and label batches. \n",
	"\n",
	"## Method parameters\n",
	"\n",
	"We'd like to have a data batch generator that can pump out a batch of data with `batch_rows` and each row has `steps` data points. \n",
	"\n",
	"For simply demo, we input `numpy` `array` as `source`\n",
	"\n",
	"For machine learning purposes, we also want to have a 2 elements tuple pumped by the generator. \n",
	"\n",
	"The 1st and 2nd elements have the same `batch_rows` (actually the same layout) to represent as features and labels respectively.\n",
	"\n",
	"In order to demo simply, each data point isn't high dimensional, it is just a scalar:\n",
	"\n",
	"### Eg:\n",
	"\n",
	"`source`: `[ 6 3 9 0 7 3 3 10 2 1 6 5 2 0 5 0 3 2 3 8 0 0 3 10\n",
	" 0 0 9 4 3 0 4 0 0]`\n",
	"\n",
	"`batch_rows`: 4\n",
	"\n",
	"We expect to have a base batch:\n",
	"\n",
	"```python\n",
	"[[ 6 3 9 0 7 3 3 10]\n",
	" [ 2 1 6 5 2 0 5 0]\n",
	" [ 3 2 3 8 0 0 3 10]\n",
	" [ 0 0 9 4 3 0 4 0]]\n",
	"```\n",
	"\n",
	"`steps`: 3\n",
	"\n",
	"The generator will extract the `steps` from the `base batch`, thanks `numpy` that we can do this quite easily. \n",
	"\n",
	"Check the warm-up example ☝.\n",
	"\n",
	"## Layout of features and label:\n",
	"\n",
	"Feature: each data point\n",
	"Label: the next data point\n",
	"\n",
	"### Eg:\n",
	"\n",
	"in the 1st row the `6` is a feature, then the next data point `3` is its label.\n",
	"in the 4th row the `3` is a feature, then the next data point `0` is its label.\n",
	"\n",
	"and so on.\n",
	"\n",
	"When you are familiar with NLP or Time-series tasks, you should know this quite well, however, we don't cover this in too much detail.\n",
	"\n",
	"## Generator works\n",
	"\n",
	"In first iteration the generator extracts from column `0` to column `steps - 1`:\n",
	"In python it is `[:, 0:steps]`\n",
	"\n",
	"```python\n",
	"features:\n",
	"[[6 3 9]\n",
	" [2 1 6]\n",
	" [3 2 3]\n",
	" [0 0 9]]\n",
	"labels:\n",
	"[[3 9 0]\n",
	" [1 6 5]\n",
	" [2 3 8]\n",
	" [0 9 4]]\n",
	"```\n",
	"\n",
	"Zoom in to 1st row of features:\n",
	"\n",
	"`[6 3 9]`\n",
	"\n",
	"\n",
	"Zoom in to 1st row of labels:\n",
	"\n",
	"`[3 9 0]`\n",
	"\n",
	"The data points between feature and label are staggered pairs.\n",
	"\n",
	"In the next iteration the generator does from `steps` to `steps + steps`:\n",
	"\n",
	"In python it is `[:, steps : steps + steps]`\n",
	"\n",
	"```\n",
	"features:\n",
	"[[0 7 3]\n",
	" [5 2 0]\n",
	" [8 0 0]\n",
	" [4 3 0]]\n",
	"labels:\n",
	"[[7 3 3]\n",
	" [2 0 5]\n",
	" [0 0 3]\n",
	" [3 0 4]]\n",
	"```\n",
	"\n",
	"## So the working rule is: \n",
	"\n",
	"```pseudo\n",
	"\n",
	"# Given we have generate_times to pump out batches:\n",
	"# It will be a loop:\n",
	"\n",
	"Loop i to generate_times:\n",
	" start = i * steps\n",
	" end = start + steps\n",
	"\n",
	" x <- extract matrix [all rows, start : end) \n",
	" y <- extract matrix [all rows, start + 1 : end + 1) \n",
	"\n",
	" yield (x,y)\n",
	"\n",
	"```\n",
	"\n",
	"Remember that both feature and label are stored in a tuple.\n",
	"\n",
	"The generator runs automatically until there aren't enough (`steps`) data points. Those data which cannot be pumped out will drop out. \n",
	"\n",
	"### Eg:\n",
	"\n",
	"```\n",
	"[[10]\n",
	" [0]\n",
	" [10]\n",
	" [0]]\n",
	"```\n",
	"\n",
	"Will not be pumped out.\n",
	"\n",
	"\n",
	"`yield` pumps out the pair `(x, y)`, you can check the warm-up example to get how `yield` works ☝.\n",
	"\n",
	"In case want to use those dropped out data, we can post-pad or pre-pad some `zero`s to create `steps` long manually, however, we don't cover this in too much detail.\n",
	"\n",
	"\n"
	],
	"metadata": {
	"id": "gzV8uoWUTs2c"
	}
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"metadata": {
	"id": "_zTkrXa3zWPI"
	},
	"outputs": [],
	"source": [
	"import numpy as np\n",
	"\n",
	"def build_batch_generator(source, batch_rows, steps, shift=1):\n",
	" row_len = len(source) // batch_rows # Base steps if the batch has .\n",
	" print(f\"row_len: {row_len}\")\n",
	" print()\n",
	"\n",
	" base_batch = np.zeros(shape=(batch_rows, row_len), dtype=\"int8\")\n",
	" print(base_batch)\n",
	" print()\n",
	"\n",
	" for i in range(batch_rows):\n",
	" start = i * row_len \n",
	" end = start + row_len \n",
	"\n",
	" print(f\"start: {start}, end: {end}\")\n",
	" \n",
	" base_batch[i] = source[start : end]\n",
	"\n",
	" print(base_batch)\n",
	" print()\n",
	"\n",
	" generate_times = row_len // steps # How many times the generator will run.\n",
	" print(f\"Generate time: {generate_times}\")\n",
	" print()\n",
	"\n",
	" # This impl. of extracting columns in matrix is different from what we see in extract_matrix().\n",
	" # However, the extract_matrix() is the preferred solution that can save the generate_times.\n",
	" # Use the below solution in order to demo the idea clearly that the generator pumps out data iteratively.\n",
	" for i in range(generate_times):\n",
	" start = i * steps # eg: [0 -> steps) \| [1 * steps -> 1 * step + steps) \| .... \n",
	" end = start + steps \n",
	"\n",
	" print(f\"start: {start}, end: {end}\")\n",
	"\n",
	" x = base_batch[:, start : end]\n",
	" y = base_batch[:, start + 1 : end + 1]\n",
	"\n",
	" yield (x, y)"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"source = np.random.randint(0, 11, 33)\n",
	"print(f\"source: {source}\")\n",
	"print()\n",
	"\n",
	"batch_gen = build_batch_generator(\n",
	" source,\n",
	" batch_rows=4, # Rows (features+labels) in every batch.\n",
	" steps=3) # How many steps the generator will generate.\n",
	"\n",
	"for it in iter(batch_gen):\n",
	" print(\"features:\")\n",
	" print(it[0])\n",
	" print(\"labels:\")\n",
	" print(it[1])\n",
	" print()"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "Z_URJYIdRdtv",
	"outputId": "4daf478d-7d4f-47f4-be54-ad7c21aafaeb"
	},
	"execution_count": 32,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"source: [10 2 3 0 5 4 7 0 5 0 7 6 7 3 9 6 1 10 8 2 9 9 2 9\n",
	" 4 7 10 5 6 6 0 10 9]\n",
	"\n",
	"row_len: 8\n",
	"\n",
	"[[0 0 0 0 0 0 0 0]\n",
	" [0 0 0 0 0 0 0 0]\n",
	" [0 0 0 0 0 0 0 0]\n",
	" [0 0 0 0 0 0 0 0]]\n",
	"\n",
	"start: 0, end: 8\n",
	"start: 8, end: 16\n",
	"start: 16, end: 24\n",
	"start: 24, end: 32\n",
	"[[10 2 3 0 5 4 7 0]\n",
	" [ 5 0 7 6 7 3 9 6]\n",
	" [ 1 10 8 2 9 9 2 9]\n",
	" [ 4 7 10 5 6 6 0 10]]\n",
	"\n",
	"Generate time: 2\n",
	"\n",
	"start: 0, end: 3\n",
	"features:\n",
	"[[10 2 3]\n",
	" [ 5 0 7]\n",
	" [ 1 10 8]\n",
	" [ 4 7 10]]\n",
	"labels:\n",
	"[[ 2 3 0]\n",
	" [ 0 7 6]\n",
	" [10 8 2]\n",
	" [ 7 10 5]]\n",
	"\n",
	"start: 3, end: 6\n",
	"features:\n",
	"[[0 5 4]\n",
	" [6 7 3]\n",
	" [2 9 9]\n",
	" [5 6 6]]\n",
	"labels:\n",
	"[[5 4 7]\n",
	" [7 3 9]\n",
	" [9 9 2]\n",
	" [6 6 0]]\n",
	"\n"
	]
	}
	]
	}
	]
	}
No results found