Last active
March 10, 2022 23:01
-
-
Save XinyueZ/0fe0a7c6e0c9b1eca4b29593fa26e2a7 to your computer and use it in GitHub Desktop.
Generator of building batch and data window shift by 1.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "nbformat": 4, | |
| "nbformat_minor": 0, | |
| "metadata": { | |
| "colab": { | |
| "name": "Generator of building batch and data window shift by 1.ipynb", | |
| "provenance": [], | |
| "authorship_tag": "ABX9TyNHf1+B6cNs+Njps9oe0gf/", | |
| "include_colab_link": true | |
| }, | |
| "kernelspec": { | |
| "name": "python3", | |
| "display_name": "Python 3" | |
| }, | |
| "language_info": { | |
| "name": "python" | |
| } | |
| }, | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": { | |
| "id": "view-in-github", | |
| "colab_type": "text" | |
| }, | |
| "source": [ | |
| "<a href=\"https://colab.research.google.com/gist/XinyueZ/0fe0a7c6e0c9b1eca4b29593fa26e2a7/generator-of-building-batch-and-data-window-shift-by-1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "Build a data-batch generator with build-in `yield`\n", | |
| "====\n", | |
| "\n", | |
| "Use `yield` to build a data batch generator which can pump out batch of data with rows and each row contains specific number of data point.\n", | |
| "\n", | |
| "- Check bonus\n", | |
| " - In `tensorflow` check `DataSet` by `window`.\n", | |
| " - In `pytorch` check `torch.utils.data.DataLoader`." | |
| ], | |
| "metadata": { | |
| "id": "DkF_03U0dWI9" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# Warm-up\n", | |
| "\n", | |
| "Firstly we can recap (if you know them already you can ignore them) how those 👇 work." | |
| ], | |
| "metadata": { | |
| "id": "-oY41hKte6ju" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## How the numpy extracts same length of data from each row of matrix simultaneously " | |
| ], | |
| "metadata": { | |
| "id": "dIJRrWtpczve" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import numpy as np\n", | |
| "\n", | |
| "def extract_matrix(matrix, len_2_extract):\n", | |
| " data = []\n", | |
| " max_len = len(matrix[0, :])\n", | |
| " \n", | |
| " start = 0\n", | |
| " end = start + len_2_extract\n", | |
| "\n", | |
| " print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n", | |
| "\n", | |
| " while(end < max_len + 1):\n", | |
| " data.append(matrix[:, start:end])\n", | |
| " start = start + len_2_extract\n", | |
| " end = start + len_2_extract\n", | |
| " print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n", | |
| " return data\n", | |
| "\n", | |
| "matrix = np.random.randint(0, 11, size=(10, 6)) \n", | |
| "print(f\"matrix:\\n{matrix}\")\n", | |
| "print()\n", | |
| "\n", | |
| "print(extract_matrix(matrix, 2))" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "1wH4C0D8cwvh", | |
| "outputId": "a79c9541-f781-4d66-dfd1-aed428fa5b14" | |
| }, | |
| "execution_count": 29, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "matrix:\n", | |
| "[[ 8 1 1 3 2 7]\n", | |
| " [ 1 5 2 2 2 9]\n", | |
| " [ 5 3 0 5 2 8]\n", | |
| " [ 1 0 6 1 8 9]\n", | |
| " [ 1 4 9 10 6 0]\n", | |
| " [ 9 10 5 2 3 2]\n", | |
| " [ 3 10 0 5 7 4]\n", | |
| " [ 1 10 4 5 1 3]\n", | |
| " [ 4 0 8 1 8 8]\n", | |
| " [10 3 8 7 1 2]]\n", | |
| "\n", | |
| "max_len: 6, start: 0, end: 2\n", | |
| "\n", | |
| "max_len: 6, start: 2, end: 4\n", | |
| "\n", | |
| "max_len: 6, start: 4, end: 6\n", | |
| "\n", | |
| "max_len: 6, start: 6, end: 8\n", | |
| "\n", | |
| "[array([[ 8, 1],\n", | |
| " [ 1, 5],\n", | |
| " [ 5, 3],\n", | |
| " [ 1, 0],\n", | |
| " [ 1, 4],\n", | |
| " [ 9, 10],\n", | |
| " [ 3, 10],\n", | |
| " [ 1, 10],\n", | |
| " [ 4, 0],\n", | |
| " [10, 3]]), array([[ 1, 3],\n", | |
| " [ 2, 2],\n", | |
| " [ 0, 5],\n", | |
| " [ 6, 1],\n", | |
| " [ 9, 10],\n", | |
| " [ 5, 2],\n", | |
| " [ 0, 5],\n", | |
| " [ 4, 5],\n", | |
| " [ 8, 1],\n", | |
| " [ 8, 7]]), array([[2, 7],\n", | |
| " [2, 9],\n", | |
| " [2, 8],\n", | |
| " [8, 9],\n", | |
| " [6, 0],\n", | |
| " [3, 2],\n", | |
| " [7, 4],\n", | |
| " [1, 3],\n", | |
| " [8, 8],\n", | |
| " [1, 2]])]\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "The matrix of 10 by 6 will be extracted into to 3 parts with the length to extract `2`:\n", | |
| "\n", | |
| "```python\n", | |
| "matrix[:, 0:2]\n", | |
| "matrix[:, 2:4]\n", | |
| "matrix[:, 4:6]\n", | |
| "matrix[:, 6:8]\n", | |
| "```\n", | |
| "The result of `extract_matrix` is returned synchronously, that is, after it has finished executing." | |
| ], | |
| "metadata": { | |
| "id": "NCVfGH__6qiG" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "## How `yield` works" | |
| ], | |
| "metadata": { | |
| "id": "tN43HSe8c7es" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "import numpy as np\n", | |
| "\n", | |
| "def pump_out_matrix(matrix, len_2_extract):\n", | |
| " print(\"pump_out_matrix: only call ONE time...\")\n", | |
| "\n", | |
| "\n", | |
| " max_len = len(matrix[0, :])\n", | |
| " \n", | |
| " start = 0\n", | |
| " end = start + len_2_extract\n", | |
| "\n", | |
| " print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n", | |
| "\n", | |
| " while(end < max_len + 1):\n", | |
| " yield matrix[:, start:end]\n", | |
| " start = start + len_2_extract\n", | |
| " end = start + len_2_extract\n", | |
| " print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n", | |
| "\n", | |
| "\n", | |
| "matrix = np.random.randint(0, 11, size=(10, 6)) \n", | |
| "print(f\"matrix:\\n{matrix}\")\n", | |
| "print()\n", | |
| "\n", | |
| "itor = iter(pump_out_matrix(matrix, 2))\n", | |
| "print(\"pump start:\")\n", | |
| "for it in itor:\n", | |
| " print(it)\n", | |
| " print(\"pump next:\")\n", | |
| "print(\"pump end\")" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "27G0q54W7TCm", | |
| "outputId": "7761d2d2-7504-44f9-bf9e-b81edd989669" | |
| }, | |
| "execution_count": 30, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "matrix:\n", | |
| "[[ 8 0 6 2 0 10]\n", | |
| " [ 8 0 5 10 0 6]\n", | |
| " [ 2 9 2 1 0 10]\n", | |
| " [ 3 9 9 6 1 2]\n", | |
| " [ 1 8 6 6 5 4]\n", | |
| " [ 5 0 3 7 6 9]\n", | |
| " [ 6 5 1 1 2 1]\n", | |
| " [ 8 8 1 5 9 2]\n", | |
| " [10 7 0 3 10 1]\n", | |
| " [ 2 6 0 2 6 8]]\n", | |
| "\n", | |
| "pump start:\n", | |
| "pump_out_matrix: only call ONE time...\n", | |
| "max_len: 6, start: 0, end: 2\n", | |
| "[[ 8 0]\n", | |
| " [ 8 0]\n", | |
| " [ 2 9]\n", | |
| " [ 3 9]\n", | |
| " [ 1 8]\n", | |
| " [ 5 0]\n", | |
| " [ 6 5]\n", | |
| " [ 8 8]\n", | |
| " [10 7]\n", | |
| " [ 2 6]]\n", | |
| "pump next:\n", | |
| "max_len: 6, start: 2, end: 4\n", | |
| "[[ 6 2]\n", | |
| " [ 5 10]\n", | |
| " [ 2 1]\n", | |
| " [ 9 6]\n", | |
| " [ 6 6]\n", | |
| " [ 3 7]\n", | |
| " [ 1 1]\n", | |
| " [ 1 5]\n", | |
| " [ 0 3]\n", | |
| " [ 0 2]]\n", | |
| "pump next:\n", | |
| "max_len: 6, start: 4, end: 6\n", | |
| "[[ 0 10]\n", | |
| " [ 0 6]\n", | |
| " [ 0 10]\n", | |
| " [ 1 2]\n", | |
| " [ 5 4]\n", | |
| " [ 6 9]\n", | |
| " [ 2 1]\n", | |
| " [ 9 2]\n", | |
| " [10 1]\n", | |
| " [ 6 8]]\n", | |
| "pump next:\n", | |
| "max_len: 6, start: 6, end: 8\n", | |
| "pump end\n" | |
| ] | |
| } | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "The matrix of 10 by 6 will be pumped out with 3 parts with the length to extract `2` separately:\n", | |
| "\n", | |
| "```python\n", | |
| "matrix[:, 0:2]\n", | |
| "matrix[:, 2:4]\n", | |
| "matrix[:, 4:6]\n", | |
| "matrix[:, 6:8]\n", | |
| "```\n", | |
| "The result of `pump_out_matrix` is returned asynchronously. \n", | |
| "\n", | |
| "Simple to say:\n", | |
| "\n", | |
| "`pump_out_matrix` is a subject.\n", | |
| "The client program subscribes on this subject as a subscriber.\n", | |
| "As long as the `pump_out_matrix` extracts one part, ie: `matrix[:, 0:2]`, it will push the result back to the subscriber.\n", | |
| "\n", | |
| "# This is observer pattern, right?\n", | |
| "https://en.wikipedia.org/wiki/Observer_pattern" | |
| ], | |
| "metadata": { | |
| "id": "HOHKOlx88f72" | |
| } | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "source": [ | |
| "# Example\n", | |
| "\n", | |
| "Build staggered pairs feature and label batches. \n", | |
| "\n", | |
| "## Method parameters\n", | |
| "\n", | |
| "We'd like to have a data batch generator that can pump out a batch of data with `batch_rows` and each row has `steps` data points. \n", | |
| "\n", | |
| "For simply demo, we input `numpy` `array` as `source`\n", | |
| "\n", | |
| "For machine learning purposes, we also want to have a 2 elements tuple pumped by the generator. \n", | |
| "\n", | |
| "The 1st and 2nd elements have the same `batch_rows` (actually the same layout) to represent as **features** and **labels** respectively.\n", | |
| "\n", | |
| "In order to demo simply, each data point isn't high dimensional, it is just a scalar:\n", | |
| "\n", | |
| "### Eg:\n", | |
| "\n", | |
| "`source`: `[ 6 3 9 0 7 3 3 10 2 1 6 5 2 0 5 0 3 2 3 8 0 0 3 10\n", | |
| " 0 0 9 4 3 0 4 0 0]`\n", | |
| "\n", | |
| "`batch_rows`: 4\n", | |
| "\n", | |
| "We expect to have a **base batch**:\n", | |
| "\n", | |
| "```python\n", | |
| "[[ 6 3 9 0 7 3 3 10]\n", | |
| " [ 2 1 6 5 2 0 5 0]\n", | |
| " [ 3 2 3 8 0 0 3 10]\n", | |
| " [ 0 0 9 4 3 0 4 0]]\n", | |
| "```\n", | |
| "\n", | |
| "`steps`: 3\n", | |
| "\n", | |
| "The generator will extract the `steps` from the `base batch`, thanks `numpy` that we can do this quite easily. \n", | |
| "\n", | |
| "Check the warm-up example ☝.\n", | |
| "\n", | |
| "## Layout of features and label:\n", | |
| "\n", | |
| "**Feature**: each data point\n", | |
| "**Label**: the next data point\n", | |
| "\n", | |
| "### Eg:\n", | |
| "\n", | |
| "in the 1st row the `6` is a feature, then the next data point `3` is its label.\n", | |
| "in the 4th row the `3` is a feature, then the next data point `0` is its label.\n", | |
| "\n", | |
| "and so on.\n", | |
| "\n", | |
| "*When you are familiar with NLP or Time-series tasks, you should know this quite well, however, we don't cover this in too much detail.*\n", | |
| "\n", | |
| "## Generator works\n", | |
| "\n", | |
| "In first iteration the generator extracts from column `0` to column `steps - 1`:\n", | |
| "In python it is `[:, 0:steps]`\n", | |
| "\n", | |
| "```python\n", | |
| "features:\n", | |
| "[[6 3 9]\n", | |
| " [2 1 6]\n", | |
| " [3 2 3]\n", | |
| " [0 0 9]]\n", | |
| "labels:\n", | |
| "[[3 9 0]\n", | |
| " [1 6 5]\n", | |
| " [2 3 8]\n", | |
| " [0 9 4]]\n", | |
| "```\n", | |
| "\n", | |
| "Zoom in to 1st row of features:\n", | |
| "\n", | |
| "`[6 3 9]`\n", | |
| "\n", | |
| "\n", | |
| "Zoom in to 1st row of labels:\n", | |
| "\n", | |
| "`[3 9 0]`\n", | |
| "\n", | |
| "**The data points between feature and label are staggered pairs.**\n", | |
| "\n", | |
| "In the next iteration the generator does from `steps` to `steps + steps`:\n", | |
| "\n", | |
| "In python it is `[:, steps : steps + steps]`\n", | |
| "\n", | |
| "```\n", | |
| "features:\n", | |
| "[[0 7 3]\n", | |
| " [5 2 0]\n", | |
| " [8 0 0]\n", | |
| " [4 3 0]]\n", | |
| "labels:\n", | |
| "[[7 3 3]\n", | |
| " [2 0 5]\n", | |
| " [0 0 3]\n", | |
| " [3 0 4]]\n", | |
| "```\n", | |
| "\n", | |
| "## So the working rule is: \n", | |
| "\n", | |
| "```pseudo\n", | |
| "\n", | |
| "# Given we have generate_times to pump out batches:\n", | |
| "# It will be a loop:\n", | |
| "\n", | |
| "Loop i to generate_times:\n", | |
| " start = i * steps\n", | |
| " end = start + steps\n", | |
| "\n", | |
| " x <- extract matrix [all rows, start : end) \n", | |
| " y <- extract matrix [all rows, start + 1 : end + 1) \n", | |
| "\n", | |
| " yield (x,y)\n", | |
| "\n", | |
| "```\n", | |
| "\n", | |
| "Remember that both feature and label are stored in a tuple.\n", | |
| "\n", | |
| "The generator runs automatically until there aren't enough (`steps`) data points. Those data which cannot be pumped out will drop out. \n", | |
| "\n", | |
| "### Eg:\n", | |
| "\n", | |
| "```\n", | |
| "[[10]\n", | |
| " [0]\n", | |
| " [10]\n", | |
| " [0]]\n", | |
| "```\n", | |
| "\n", | |
| "Will not be pumped out.\n", | |
| "\n", | |
| "\n", | |
| "`yield` pumps out the pair `(x, y)`, you can check the warm-up example to get how `yield` works ☝.\n", | |
| "\n", | |
| "*In case want to use those dropped out data, we can post-pad or pre-pad some `zero`s to create `steps` long manually, however, we don't cover this in too much detail.*\n", | |
| "\n", | |
| "\n" | |
| ], | |
| "metadata": { | |
| "id": "gzV8uoWUTs2c" | |
| } | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 31, | |
| "metadata": { | |
| "id": "_zTkrXa3zWPI" | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "import numpy as np\n", | |
| "\n", | |
| "def build_batch_generator(source, batch_rows, steps, shift=1):\n", | |
| " row_len = len(source) // batch_rows # Base steps if the batch has .\n", | |
| " print(f\"row_len: {row_len}\")\n", | |
| " print()\n", | |
| "\n", | |
| " base_batch = np.zeros(shape=(batch_rows, row_len), dtype=\"int8\")\n", | |
| " print(base_batch)\n", | |
| " print()\n", | |
| "\n", | |
| " for i in range(batch_rows):\n", | |
| " start = i * row_len \n", | |
| " end = start + row_len \n", | |
| "\n", | |
| " print(f\"start: {start}, end: {end}\")\n", | |
| " \n", | |
| " base_batch[i] = source[start : end]\n", | |
| "\n", | |
| " print(base_batch)\n", | |
| " print()\n", | |
| "\n", | |
| " generate_times = row_len // steps # How many times the generator will run.\n", | |
| " print(f\"Generate time: {generate_times}\")\n", | |
| " print()\n", | |
| "\n", | |
| " # This impl. of extracting columns in matrix is different from what we see in extract_matrix().\n", | |
| " # However, the extract_matrix() is the preferred solution that can save the generate_times.\n", | |
| " # Use the below solution in order to demo the idea clearly that the generator pumps out data iteratively.\n", | |
| " for i in range(generate_times):\n", | |
| " start = i * steps # eg: [0 -> steps) | [1 * steps -> 1 * step + steps) | .... \n", | |
| " end = start + steps \n", | |
| "\n", | |
| " print(f\"start: {start}, end: {end}\")\n", | |
| "\n", | |
| " x = base_batch[:, start : end]\n", | |
| " y = base_batch[:, start + 1 : end + 1]\n", | |
| "\n", | |
| " yield (x, y)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "source": [ | |
| "source = np.random.randint(0, 11, 33)\n", | |
| "print(f\"source: {source}\")\n", | |
| "print()\n", | |
| "\n", | |
| "batch_gen = build_batch_generator(\n", | |
| " source,\n", | |
| " batch_rows=4, # Rows (features+labels) in every batch.\n", | |
| " steps=3) # How many steps the generator will generate.\n", | |
| "\n", | |
| "for it in iter(batch_gen):\n", | |
| " print(\"features:\")\n", | |
| " print(it[0])\n", | |
| " print(\"labels:\")\n", | |
| " print(it[1])\n", | |
| " print()" | |
| ], | |
| "metadata": { | |
| "colab": { | |
| "base_uri": "https://localhost:8080/" | |
| }, | |
| "id": "Z_URJYIdRdtv", | |
| "outputId": "4daf478d-7d4f-47f4-be54-ad7c21aafaeb" | |
| }, | |
| "execution_count": 32, | |
| "outputs": [ | |
| { | |
| "output_type": "stream", | |
| "name": "stdout", | |
| "text": [ | |
| "source: [10 2 3 0 5 4 7 0 5 0 7 6 7 3 9 6 1 10 8 2 9 9 2 9\n", | |
| " 4 7 10 5 6 6 0 10 9]\n", | |
| "\n", | |
| "row_len: 8\n", | |
| "\n", | |
| "[[0 0 0 0 0 0 0 0]\n", | |
| " [0 0 0 0 0 0 0 0]\n", | |
| " [0 0 0 0 0 0 0 0]\n", | |
| " [0 0 0 0 0 0 0 0]]\n", | |
| "\n", | |
| "start: 0, end: 8\n", | |
| "start: 8, end: 16\n", | |
| "start: 16, end: 24\n", | |
| "start: 24, end: 32\n", | |
| "[[10 2 3 0 5 4 7 0]\n", | |
| " [ 5 0 7 6 7 3 9 6]\n", | |
| " [ 1 10 8 2 9 9 2 9]\n", | |
| " [ 4 7 10 5 6 6 0 10]]\n", | |
| "\n", | |
| "Generate time: 2\n", | |
| "\n", | |
| "start: 0, end: 3\n", | |
| "features:\n", | |
| "[[10 2 3]\n", | |
| " [ 5 0 7]\n", | |
| " [ 1 10 8]\n", | |
| " [ 4 7 10]]\n", | |
| "labels:\n", | |
| "[[ 2 3 0]\n", | |
| " [ 0 7 6]\n", | |
| " [10 8 2]\n", | |
| " [ 7 10 5]]\n", | |
| "\n", | |
| "start: 3, end: 6\n", | |
| "features:\n", | |
| "[[0 5 4]\n", | |
| " [6 7 3]\n", | |
| " [2 9 9]\n", | |
| " [5 6 6]]\n", | |
| "labels:\n", | |
| "[[5 4 7]\n", | |
| " [7 3 9]\n", | |
| " [9 9 2]\n", | |
| " [6 6 0]]\n", | |
| "\n" | |
| ] | |
| } | |
| ] | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment