Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save XinyueZ/0fe0a7c6e0c9b1eca4b29593fa26e2a7 to your computer and use it in GitHub Desktop.

Select an option

Save XinyueZ/0fe0a7c6e0c9b1eca4b29593fa26e2a7 to your computer and use it in GitHub Desktop.
Generator of building batch and data window shift by 1.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Generator of building batch and data window shift by 1.ipynb",
"provenance": [],
"authorship_tag": "ABX9TyNHf1+B6cNs+Njps9oe0gf/",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/XinyueZ/0fe0a7c6e0c9b1eca4b29593fa26e2a7/generator-of-building-batch-and-data-window-shift-by-1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"Build a data-batch generator with build-in `yield`\n",
"====\n",
"\n",
"Use `yield` to build a data batch generator which can pump out batch of data with rows and each row contains specific number of data point.\n",
"\n",
"- Check bonus\n",
" - In `tensorflow` check `DataSet` by `window`.\n",
" - In `pytorch` check `torch.utils.data.DataLoader`."
],
"metadata": {
"id": "DkF_03U0dWI9"
}
},
{
"cell_type": "markdown",
"source": [
"# Warm-up\n",
"\n",
"Firstly we can recap (if you know them already you can ignore them) how those 👇 work."
],
"metadata": {
"id": "-oY41hKte6ju"
}
},
{
"cell_type": "markdown",
"source": [
"## How the numpy extracts same length of data from each row of matrix simultaneously "
],
"metadata": {
"id": "dIJRrWtpczve"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"\n",
"def extract_matrix(matrix, len_2_extract):\n",
" data = []\n",
" max_len = len(matrix[0, :])\n",
" \n",
" start = 0\n",
" end = start + len_2_extract\n",
"\n",
" print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n",
"\n",
" while(end < max_len + 1):\n",
" data.append(matrix[:, start:end])\n",
" start = start + len_2_extract\n",
" end = start + len_2_extract\n",
" print(f\"max_len: {max_len}, start: {start}, end: {end}\\n\")\n",
" return data\n",
"\n",
"matrix = np.random.randint(0, 11, size=(10, 6)) \n",
"print(f\"matrix:\\n{matrix}\")\n",
"print()\n",
"\n",
"print(extract_matrix(matrix, 2))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1wH4C0D8cwvh",
"outputId": "a79c9541-f781-4d66-dfd1-aed428fa5b14"
},
"execution_count": 29,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"matrix:\n",
"[[ 8 1 1 3 2 7]\n",
" [ 1 5 2 2 2 9]\n",
" [ 5 3 0 5 2 8]\n",
" [ 1 0 6 1 8 9]\n",
" [ 1 4 9 10 6 0]\n",
" [ 9 10 5 2 3 2]\n",
" [ 3 10 0 5 7 4]\n",
" [ 1 10 4 5 1 3]\n",
" [ 4 0 8 1 8 8]\n",
" [10 3 8 7 1 2]]\n",
"\n",
"max_len: 6, start: 0, end: 2\n",
"\n",
"max_len: 6, start: 2, end: 4\n",
"\n",
"max_len: 6, start: 4, end: 6\n",
"\n",
"max_len: 6, start: 6, end: 8\n",
"\n",
"[array([[ 8, 1],\n",
" [ 1, 5],\n",
" [ 5, 3],\n",
" [ 1, 0],\n",
" [ 1, 4],\n",
" [ 9, 10],\n",
" [ 3, 10],\n",
" [ 1, 10],\n",
" [ 4, 0],\n",
" [10, 3]]), array([[ 1, 3],\n",
" [ 2, 2],\n",
" [ 0, 5],\n",
" [ 6, 1],\n",
" [ 9, 10],\n",
" [ 5, 2],\n",
" [ 0, 5],\n",
" [ 4, 5],\n",
" [ 8, 1],\n",
" [ 8, 7]]), array([[2, 7],\n",
" [2, 9],\n",
" [2, 8],\n",
" [8, 9],\n",
" [6, 0],\n",
" [3, 2],\n",
" [7, 4],\n",
" [1, 3],\n",
" [8, 8],\n",
" [1, 2]])]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"The matrix of 10 by 6 will be extracted into to 3 parts with the length to extract `2`:\n",
"\n",
"```python\n",
"matrix[:, 0:2]\n",
"matrix[:, 2:4]\n",
"matrix[:, 4:6]\n",
"matrix[:, 6:8]\n",
"```\n",
"The result of `extract_matrix` is returned synchronously, that is, after it has finished executing."
],
"metadata": {
"id": "NCVfGH__6qiG"
}
},
{
"cell_type": "markdown",
"source": [
"## How `yield` works"
],
"metadata": {
"id": "tN43HSe8c7es"
}
},
{
"cell_type": "code",
"source": [
"import numpy as np\n",
"\n",
"def pump_out_matrix(matrix, len_2_extract):\n",
" print(\"pump_out_matrix: only call ONE time...\")\n",
"\n",
"\n",
" max_len = len(matrix[0, :])\n",
" \n",
" start = 0\n",
" end = start + len_2_extract\n",
"\n",
" print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n",
"\n",
" while(end < max_len + 1):\n",
" yield matrix[:, start:end]\n",
" start = start + len_2_extract\n",
" end = start + len_2_extract\n",
" print(f\"max_len: {max_len}, start: {start}, end: {end}\")\n",
"\n",
"\n",
"matrix = np.random.randint(0, 11, size=(10, 6)) \n",
"print(f\"matrix:\\n{matrix}\")\n",
"print()\n",
"\n",
"itor = iter(pump_out_matrix(matrix, 2))\n",
"print(\"pump start:\")\n",
"for it in itor:\n",
" print(it)\n",
" print(\"pump next:\")\n",
"print(\"pump end\")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "27G0q54W7TCm",
"outputId": "7761d2d2-7504-44f9-bf9e-b81edd989669"
},
"execution_count": 30,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"matrix:\n",
"[[ 8 0 6 2 0 10]\n",
" [ 8 0 5 10 0 6]\n",
" [ 2 9 2 1 0 10]\n",
" [ 3 9 9 6 1 2]\n",
" [ 1 8 6 6 5 4]\n",
" [ 5 0 3 7 6 9]\n",
" [ 6 5 1 1 2 1]\n",
" [ 8 8 1 5 9 2]\n",
" [10 7 0 3 10 1]\n",
" [ 2 6 0 2 6 8]]\n",
"\n",
"pump start:\n",
"pump_out_matrix: only call ONE time...\n",
"max_len: 6, start: 0, end: 2\n",
"[[ 8 0]\n",
" [ 8 0]\n",
" [ 2 9]\n",
" [ 3 9]\n",
" [ 1 8]\n",
" [ 5 0]\n",
" [ 6 5]\n",
" [ 8 8]\n",
" [10 7]\n",
" [ 2 6]]\n",
"pump next:\n",
"max_len: 6, start: 2, end: 4\n",
"[[ 6 2]\n",
" [ 5 10]\n",
" [ 2 1]\n",
" [ 9 6]\n",
" [ 6 6]\n",
" [ 3 7]\n",
" [ 1 1]\n",
" [ 1 5]\n",
" [ 0 3]\n",
" [ 0 2]]\n",
"pump next:\n",
"max_len: 6, start: 4, end: 6\n",
"[[ 0 10]\n",
" [ 0 6]\n",
" [ 0 10]\n",
" [ 1 2]\n",
" [ 5 4]\n",
" [ 6 9]\n",
" [ 2 1]\n",
" [ 9 2]\n",
" [10 1]\n",
" [ 6 8]]\n",
"pump next:\n",
"max_len: 6, start: 6, end: 8\n",
"pump end\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"The matrix of 10 by 6 will be pumped out with 3 parts with the length to extract `2` separately:\n",
"\n",
"```python\n",
"matrix[:, 0:2]\n",
"matrix[:, 2:4]\n",
"matrix[:, 4:6]\n",
"matrix[:, 6:8]\n",
"```\n",
"The result of `pump_out_matrix` is returned asynchronously. \n",
"\n",
"Simple to say:\n",
"\n",
"`pump_out_matrix` is a subject.\n",
"The client program subscribes on this subject as a subscriber.\n",
"As long as the `pump_out_matrix` extracts one part, ie: `matrix[:, 0:2]`, it will push the result back to the subscriber.\n",
"\n",
"# This is observer pattern, right?\n",
"https://en.wikipedia.org/wiki/Observer_pattern"
],
"metadata": {
"id": "HOHKOlx88f72"
}
},
{
"cell_type": "markdown",
"source": [
"# Example\n",
"\n",
"Build staggered pairs feature and label batches. \n",
"\n",
"## Method parameters\n",
"\n",
"We'd like to have a data batch generator that can pump out a batch of data with `batch_rows` and each row has `steps` data points. \n",
"\n",
"For simply demo, we input `numpy` `array` as `source`\n",
"\n",
"For machine learning purposes, we also want to have a 2 elements tuple pumped by the generator. \n",
"\n",
"The 1st and 2nd elements have the same `batch_rows` (actually the same layout) to represent as **features** and **labels** respectively.\n",
"\n",
"In order to demo simply, each data point isn't high dimensional, it is just a scalar:\n",
"\n",
"### Eg:\n",
"\n",
"`source`: `[ 6 3 9 0 7 3 3 10 2 1 6 5 2 0 5 0 3 2 3 8 0 0 3 10\n",
" 0 0 9 4 3 0 4 0 0]`\n",
"\n",
"`batch_rows`: 4\n",
"\n",
"We expect to have a **base batch**:\n",
"\n",
"```python\n",
"[[ 6 3 9 0 7 3 3 10]\n",
" [ 2 1 6 5 2 0 5 0]\n",
" [ 3 2 3 8 0 0 3 10]\n",
" [ 0 0 9 4 3 0 4 0]]\n",
"```\n",
"\n",
"`steps`: 3\n",
"\n",
"The generator will extract the `steps` from the `base batch`, thanks `numpy` that we can do this quite easily. \n",
"\n",
"Check the warm-up example ☝.\n",
"\n",
"## Layout of features and label:\n",
"\n",
"**Feature**: each data point\n",
"**Label**: the next data point\n",
"\n",
"### Eg:\n",
"\n",
"in the 1st row the `6` is a feature, then the next data point `3` is its label.\n",
"in the 4th row the `3` is a feature, then the next data point `0` is its label.\n",
"\n",
"and so on.\n",
"\n",
"*When you are familiar with NLP or Time-series tasks, you should know this quite well, however, we don't cover this in too much detail.*\n",
"\n",
"## Generator works\n",
"\n",
"In first iteration the generator extracts from column `0` to column `steps - 1`:\n",
"In python it is `[:, 0:steps]`\n",
"\n",
"```python\n",
"features:\n",
"[[6 3 9]\n",
" [2 1 6]\n",
" [3 2 3]\n",
" [0 0 9]]\n",
"labels:\n",
"[[3 9 0]\n",
" [1 6 5]\n",
" [2 3 8]\n",
" [0 9 4]]\n",
"```\n",
"\n",
"Zoom in to 1st row of features:\n",
"\n",
"`[6 3 9]`\n",
"\n",
"\n",
"Zoom in to 1st row of labels:\n",
"\n",
"`[3 9 0]`\n",
"\n",
"**The data points between feature and label are staggered pairs.**\n",
"\n",
"In the next iteration the generator does from `steps` to `steps + steps`:\n",
"\n",
"In python it is `[:, steps : steps + steps]`\n",
"\n",
"```\n",
"features:\n",
"[[0 7 3]\n",
" [5 2 0]\n",
" [8 0 0]\n",
" [4 3 0]]\n",
"labels:\n",
"[[7 3 3]\n",
" [2 0 5]\n",
" [0 0 3]\n",
" [3 0 4]]\n",
"```\n",
"\n",
"## So the working rule is: \n",
"\n",
"```pseudo\n",
"\n",
"# Given we have generate_times to pump out batches:\n",
"# It will be a loop:\n",
"\n",
"Loop i to generate_times:\n",
" start = i * steps\n",
" end = start + steps\n",
"\n",
" x <- extract matrix [all rows, start : end) \n",
" y <- extract matrix [all rows, start + 1 : end + 1) \n",
"\n",
" yield (x,y)\n",
"\n",
"```\n",
"\n",
"Remember that both feature and label are stored in a tuple.\n",
"\n",
"The generator runs automatically until there aren't enough (`steps`) data points. Those data which cannot be pumped out will drop out. \n",
"\n",
"### Eg:\n",
"\n",
"```\n",
"[[10]\n",
" [0]\n",
" [10]\n",
" [0]]\n",
"```\n",
"\n",
"Will not be pumped out.\n",
"\n",
"\n",
"`yield` pumps out the pair `(x, y)`, you can check the warm-up example to get how `yield` works ☝.\n",
"\n",
"*In case want to use those dropped out data, we can post-pad or pre-pad some `zero`s to create `steps` long manually, however, we don't cover this in too much detail.*\n",
"\n",
"\n"
],
"metadata": {
"id": "gzV8uoWUTs2c"
}
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"id": "_zTkrXa3zWPI"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def build_batch_generator(source, batch_rows, steps, shift=1):\n",
" row_len = len(source) // batch_rows # Base steps if the batch has .\n",
" print(f\"row_len: {row_len}\")\n",
" print()\n",
"\n",
" base_batch = np.zeros(shape=(batch_rows, row_len), dtype=\"int8\")\n",
" print(base_batch)\n",
" print()\n",
"\n",
" for i in range(batch_rows):\n",
" start = i * row_len \n",
" end = start + row_len \n",
"\n",
" print(f\"start: {start}, end: {end}\")\n",
" \n",
" base_batch[i] = source[start : end]\n",
"\n",
" print(base_batch)\n",
" print()\n",
"\n",
" generate_times = row_len // steps # How many times the generator will run.\n",
" print(f\"Generate time: {generate_times}\")\n",
" print()\n",
"\n",
" # This impl. of extracting columns in matrix is different from what we see in extract_matrix().\n",
" # However, the extract_matrix() is the preferred solution that can save the generate_times.\n",
" # Use the below solution in order to demo the idea clearly that the generator pumps out data iteratively.\n",
" for i in range(generate_times):\n",
" start = i * steps # eg: [0 -> steps) | [1 * steps -> 1 * step + steps) | .... \n",
" end = start + steps \n",
"\n",
" print(f\"start: {start}, end: {end}\")\n",
"\n",
" x = base_batch[:, start : end]\n",
" y = base_batch[:, start + 1 : end + 1]\n",
"\n",
" yield (x, y)"
]
},
{
"cell_type": "code",
"source": [
"source = np.random.randint(0, 11, 33)\n",
"print(f\"source: {source}\")\n",
"print()\n",
"\n",
"batch_gen = build_batch_generator(\n",
" source,\n",
" batch_rows=4, # Rows (features+labels) in every batch.\n",
" steps=3) # How many steps the generator will generate.\n",
"\n",
"for it in iter(batch_gen):\n",
" print(\"features:\")\n",
" print(it[0])\n",
" print(\"labels:\")\n",
" print(it[1])\n",
" print()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Z_URJYIdRdtv",
"outputId": "4daf478d-7d4f-47f4-be54-ad7c21aafaeb"
},
"execution_count": 32,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"source: [10 2 3 0 5 4 7 0 5 0 7 6 7 3 9 6 1 10 8 2 9 9 2 9\n",
" 4 7 10 5 6 6 0 10 9]\n",
"\n",
"row_len: 8\n",
"\n",
"[[0 0 0 0 0 0 0 0]\n",
" [0 0 0 0 0 0 0 0]\n",
" [0 0 0 0 0 0 0 0]\n",
" [0 0 0 0 0 0 0 0]]\n",
"\n",
"start: 0, end: 8\n",
"start: 8, end: 16\n",
"start: 16, end: 24\n",
"start: 24, end: 32\n",
"[[10 2 3 0 5 4 7 0]\n",
" [ 5 0 7 6 7 3 9 6]\n",
" [ 1 10 8 2 9 9 2 9]\n",
" [ 4 7 10 5 6 6 0 10]]\n",
"\n",
"Generate time: 2\n",
"\n",
"start: 0, end: 3\n",
"features:\n",
"[[10 2 3]\n",
" [ 5 0 7]\n",
" [ 1 10 8]\n",
" [ 4 7 10]]\n",
"labels:\n",
"[[ 2 3 0]\n",
" [ 0 7 6]\n",
" [10 8 2]\n",
" [ 7 10 5]]\n",
"\n",
"start: 3, end: 6\n",
"features:\n",
"[[0 5 4]\n",
" [6 7 3]\n",
" [2 9 9]\n",
" [5 6 6]]\n",
"labels:\n",
"[[5 4 7]\n",
" [7 3 9]\n",
" [9 9 2]\n",
" [6 6 0]]\n",
"\n"
]
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment