moodoki · June 30, 2021 06:11
diff --git a/high-dimensional-data-visualization.ipynb b/high-dimensional-data-visualization.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "High Dimensional Data Visualization",
      "private_outputs": true,
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyOHlP8IO3UimnXV2lez9NSy",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/moodoki/851004bd4ac24e49e833d74da2162d1e/high-dimensional-data-visualization.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0b50qqqhWu1-"
      },
      "source": [
        "!pip install -q umap-learn"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZUTUnULCXnnE"
      },
      "source": [
        "General imports"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nTPilY4CWpI-"
      },
      "source": [
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "import pandas as pd"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Z6M8VuODXtWQ"
      },
      "source": [
        "Get some data to work with. We will use Mnist digits from OpenML.\n",
        "\n",
        "We will create a pandas DataFrame with a column `y` for the labels and 1 column for each pixel in the image. We'll also normalize the data to [0, 1] by dividing by 255.\n",
        "\n",
        "The dimension reduction implementations expects an array-like with each row representing a sample and each column a dimension. After creating our DataFrame, we can get this array using `df[feat_cols]`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "59Ct3FKGW0WF"
      },
      "source": [
        "from sklearn.datasets import fetch_openml\n",
        "\n",
        "mnist = fetch_openml(\"mnist_784\")\n",
        "x = mnist.data / 255.0\n",
        "y = mnist.target\n",
        "\n",
        "feat_cols = [ 'pixel'+str(i+1) for i in range(x.shape[1]) ]\n",
        "df = pd.DataFrame(x ,columns=feat_cols)\n",
        "df['y'] = y\n",
        "df['label'] = df['y'].apply(lambda i: str(i))\n",
        "\n",
        "df.head()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1zd5UK8LZ-kA"
      },
      "source": [
        "# PCA"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Y8sGgOfsZKug"
      },
      "source": [
        "from sklearn.decomposition import PCA\n",
        "\n",
        "pca = PCA(n_components=2)\n",
        "pca_result = pca.fit_transform(df[feat_cols])\n",
        "df['pca_0'] = pca_result[:, 0]\n",
        "df['pca_1'] = pca_result[:, 1]\n",
        "print(f'Explained var: {pca.explained_variance_ratio_}')\n",
        "\n",
        "plt.figure(figsize=(16,10))\n",
        "sns.scatterplot(\n",
        "    x=f'pca_0', y=f'pca_1',\n",
        "    hue=\"y\",\n",
        "    palette=sns.color_palette(\"colorblind\", 10),\n",
        "    data=df,\n",
        "    legend=\"full\",\n",
        "    alpha=0.3\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7QIGcNlaahkJ"
      },
      "source": [
        "# t-SNE"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ttIRSlI-hjt2"
      },
      "source": [
        "Note: t-SNE takes considerable time to run. 70000 examples is far too much for running on colab. We'll just play with the 5000 samples."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZbQ_fW1WZ4Pc"
      },
      "source": [
        "from sklearn.manifold import TSNE\n",
        "\n",
        "df_sub = df.head(5000)\n",
        "\n",
        "tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=300)\n",
        "tsne_result = tsne.fit_transform(df_sub[feat_cols])\n",
        "df_sub['tsne_0'] = tsne_result[:, 0]\n",
        "df_sub['tsne_1'] = tsne_result[:, 1]\n",
        "\n",
        "plt.figure(figsize=(16,10))\n",
        "sns.scatterplot(\n",
        "    x='tsne_0', y='tsne_1',\n",
        "    hue=\"y\",\n",
        "    palette=sns.color_palette(\"colorblind\", 10),\n",
        "    data=df_sub,\n",
        "    legend=\"full\",\n",
        "    alpha=0.3\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qg_HeP9Qa2my"
      },
      "source": [
        "# UMAP"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "L5qbeU_PavNr"
      },
      "source": [
        "from umap import UMAP\n",
        "\n",
        "umap_reducer = UMAP()\n",
        "umap_result = umap_reducer.fit_transform(df[feat_cols])\n",
        "\n",
        "df[f'umap_0'] = umap_result[:, 0]\n",
        "df[f'umap_1'] = umap_result[:, 1]\n",
        "\n",
        "plt.figure(figsize=(16,10))\n",
        "sns.scatterplot(\n",
        "    x=f'umap_0', y=f'umap_1',\n",
        "    hue=\"y\",\n",
        "    palette=sns.color_palette(\"colorblind\", 10),\n",
        "    data=df,\n",
        "    legend=\"full\",\n",
        "    alpha=0.3\n",
        ")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qRiSswQmbpJQ"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "High Dimensional Data Visualization",
	"private_outputs": true,
	"provenance": [],
	"collapsed_sections": [],
	"authorship_tag": "ABX9TyOHlP8IO3UimnXV2lez9NSy",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/moodoki/851004bd4ac24e49e833d74da2162d1e/high-dimensional-data-visualization.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "0b50qqqhWu1-"
	},
	"source": [
	"!pip install -q umap-learn"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ZUTUnULCXnnE"
	},
	"source": [
	"General imports"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "nTPilY4CWpI-"
	},
	"source": [
	"import matplotlib.pyplot as plt\n",
	"import seaborn as sns\n",
	"import pandas as pd"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "Z6M8VuODXtWQ"
	},
	"source": [
	"Get some data to work with. We will use Mnist digits from OpenML.\n",
	"\n",
	"We will create a pandas DataFrame with a column `y` for the labels and 1 column for each pixel in the image. We'll also normalize the data to [0, 1] by dividing by 255.\n",
	"\n",
	"The dimension reduction implementations expects an array-like with each row representing a sample and each column a dimension. After creating our DataFrame, we can get this array using `df[feat_cols]`."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "59Ct3FKGW0WF"
	},
	"source": [
	"from sklearn.datasets import fetch_openml\n",
	"\n",
	"mnist = fetch_openml(\"mnist_784\")\n",
	"x = mnist.data / 255.0\n",
	"y = mnist.target\n",
	"\n",
	"feat_cols = [ 'pixel'+str(i+1) for i in range(x.shape[1]) ]\n",
	"df = pd.DataFrame(x ,columns=feat_cols)\n",
	"df['y'] = y\n",
	"df['label'] = df['y'].apply(lambda i: str(i))\n",
	"\n",
	"df.head()"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "1zd5UK8LZ-kA"
	},
	"source": [
	"# PCA"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Y8sGgOfsZKug"
	},
	"source": [
	"from sklearn.decomposition import PCA\n",
	"\n",
	"pca = PCA(n_components=2)\n",
	"pca_result = pca.fit_transform(df[feat_cols])\n",
	"df['pca_0'] = pca_result[:, 0]\n",
	"df['pca_1'] = pca_result[:, 1]\n",
	"print(f'Explained var: {pca.explained_variance_ratio_}')\n",
	"\n",
	"plt.figure(figsize=(16,10))\n",
	"sns.scatterplot(\n",
	" x=f'pca_0', y=f'pca_1',\n",
	" hue=\"y\",\n",
	" palette=sns.color_palette(\"colorblind\", 10),\n",
	" data=df,\n",
	" legend=\"full\",\n",
	" alpha=0.3\n",
	")"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "7QIGcNlaahkJ"
	},
	"source": [
	"# t-SNE"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ttIRSlI-hjt2"
	},
	"source": [
	"Note: t-SNE takes considerable time to run. 70000 examples is far too much for running on colab. We'll just play with the 5000 samples."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "ZbQ_fW1WZ4Pc"
	},
	"source": [
	"from sklearn.manifold import TSNE\n",
	"\n",
	"df_sub = df.head(5000)\n",
	"\n",
	"tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=300)\n",
	"tsne_result = tsne.fit_transform(df_sub[feat_cols])\n",
	"df_sub['tsne_0'] = tsne_result[:, 0]\n",
	"df_sub['tsne_1'] = tsne_result[:, 1]\n",
	"\n",
	"plt.figure(figsize=(16,10))\n",
	"sns.scatterplot(\n",
	" x='tsne_0', y='tsne_1',\n",
	" hue=\"y\",\n",
	" palette=sns.color_palette(\"colorblind\", 10),\n",
	" data=df_sub,\n",
	" legend=\"full\",\n",
	" alpha=0.3\n",
	")"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "qg_HeP9Qa2my"
	},
	"source": [
	"# UMAP"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "L5qbeU_PavNr"
	},
	"source": [
	"from umap import UMAP\n",
	"\n",
	"umap_reducer = UMAP()\n",
	"umap_result = umap_reducer.fit_transform(df[feat_cols])\n",
	"\n",
	"df[f'umap_0'] = umap_result[:, 0]\n",
	"df[f'umap_1'] = umap_result[:, 1]\n",
	"\n",
	"plt.figure(figsize=(16,10))\n",
	"sns.scatterplot(\n",
	" x=f'umap_0', y=f'umap_1',\n",
	" hue=\"y\",\n",
	" palette=sns.color_palette(\"colorblind\", 10),\n",
	" data=df,\n",
	" legend=\"full\",\n",
	" alpha=0.3\n",
	")"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "qRiSswQmbpJQ"
	},
	"source": [
	""
	],
	"execution_count": null,
	"outputs": []
	}
	]
	}