Skip to content

Instantly share code, notes, and snippets.

@moodoki
Last active June 30, 2021 06:11
Show Gist options
  • Save moodoki/851004bd4ac24e49e833d74da2162d1e to your computer and use it in GitHub Desktop.
Save moodoki/851004bd4ac24e49e833d74da2162d1e to your computer and use it in GitHub Desktop.
High Dimensional Data Visualization
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "High Dimensional Data Visualization",
"private_outputs": true,
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyOHlP8IO3UimnXV2lez9NSy",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/moodoki/851004bd4ac24e49e833d74da2162d1e/high-dimensional-data-visualization.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "code",
"metadata": {
"id": "0b50qqqhWu1-"
},
"source": [
"!pip install -q umap-learn"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZUTUnULCXnnE"
},
"source": [
"General imports"
]
},
{
"cell_type": "code",
"metadata": {
"id": "nTPilY4CWpI-"
},
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import pandas as pd"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z6M8VuODXtWQ"
},
"source": [
"Get some data to work with. We will use Mnist digits from OpenML.\n",
"\n",
"We will create a pandas DataFrame with a column `y` for the labels and 1 column for each pixel in the image. We'll also normalize the data to [0, 1] by dividing by 255.\n",
"\n",
"The dimension reduction implementations expects an array-like with each row representing a sample and each column a dimension. After creating our DataFrame, we can get this array using `df[feat_cols]`."
]
},
{
"cell_type": "code",
"metadata": {
"id": "59Ct3FKGW0WF"
},
"source": [
"from sklearn.datasets import fetch_openml\n",
"\n",
"mnist = fetch_openml(\"mnist_784\")\n",
"x = mnist.data / 255.0\n",
"y = mnist.target\n",
"\n",
"feat_cols = [ 'pixel'+str(i+1) for i in range(x.shape[1]) ]\n",
"df = pd.DataFrame(x ,columns=feat_cols)\n",
"df['y'] = y\n",
"df['label'] = df['y'].apply(lambda i: str(i))\n",
"\n",
"df.head()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "1zd5UK8LZ-kA"
},
"source": [
"# PCA"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Y8sGgOfsZKug"
},
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=2)\n",
"pca_result = pca.fit_transform(df[feat_cols])\n",
"df['pca_0'] = pca_result[:, 0]\n",
"df['pca_1'] = pca_result[:, 1]\n",
"print(f'Explained var: {pca.explained_variance_ratio_}')\n",
"\n",
"plt.figure(figsize=(16,10))\n",
"sns.scatterplot(\n",
" x=f'pca_0', y=f'pca_1',\n",
" hue=\"y\",\n",
" palette=sns.color_palette(\"colorblind\", 10),\n",
" data=df,\n",
" legend=\"full\",\n",
" alpha=0.3\n",
")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7QIGcNlaahkJ"
},
"source": [
"# t-SNE"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ttIRSlI-hjt2"
},
"source": [
"Note: t-SNE takes considerable time to run. 70000 examples is far too much for running on colab. We'll just play with the 5000 samples."
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZbQ_fW1WZ4Pc"
},
"source": [
"from sklearn.manifold import TSNE\n",
"\n",
"df_sub = df.head(5000)\n",
"\n",
"tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=300)\n",
"tsne_result = tsne.fit_transform(df_sub[feat_cols])\n",
"df_sub['tsne_0'] = tsne_result[:, 0]\n",
"df_sub['tsne_1'] = tsne_result[:, 1]\n",
"\n",
"plt.figure(figsize=(16,10))\n",
"sns.scatterplot(\n",
" x='tsne_0', y='tsne_1',\n",
" hue=\"y\",\n",
" palette=sns.color_palette(\"colorblind\", 10),\n",
" data=df_sub,\n",
" legend=\"full\",\n",
" alpha=0.3\n",
")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "qg_HeP9Qa2my"
},
"source": [
"# UMAP"
]
},
{
"cell_type": "code",
"metadata": {
"id": "L5qbeU_PavNr"
},
"source": [
"from umap import UMAP\n",
"\n",
"umap_reducer = UMAP()\n",
"umap_result = umap_reducer.fit_transform(df[feat_cols])\n",
"\n",
"df[f'umap_0'] = umap_result[:, 0]\n",
"df[f'umap_1'] = umap_result[:, 1]\n",
"\n",
"plt.figure(figsize=(16,10))\n",
"sns.scatterplot(\n",
" x=f'umap_0', y=f'umap_1',\n",
" hue=\"y\",\n",
" palette=sns.color_palette(\"colorblind\", 10),\n",
" data=df,\n",
" legend=\"full\",\n",
" alpha=0.3\n",
")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "qRiSswQmbpJQ"
},
"source": [
""
],
"execution_count": null,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment