Last active
June 30, 2021 06:11
-
-
Save moodoki/851004bd4ac24e49e833d74da2162d1e to your computer and use it in GitHub Desktop.
High Dimensional Data Visualization
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "High Dimensional Data Visualization", | |
"private_outputs": true, | |
"provenance": [], | |
"collapsed_sections": [], | |
"authorship_tag": "ABX9TyOHlP8IO3UimnXV2lez9NSy", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/moodoki/851004bd4ac24e49e833d74da2162d1e/high-dimensional-data-visualization.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "0b50qqqhWu1-" | |
}, | |
"source": [ | |
"!pip install -q umap-learn" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ZUTUnULCXnnE" | |
}, | |
"source": [ | |
"General imports" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "nTPilY4CWpI-" | |
}, | |
"source": [ | |
"import matplotlib.pyplot as plt\n", | |
"import seaborn as sns\n", | |
"import pandas as pd" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Z6M8VuODXtWQ" | |
}, | |
"source": [ | |
"Get some data to work with. We will use Mnist digits from OpenML.\n", | |
"\n", | |
"We will create a pandas DataFrame with a column `y` for the labels and 1 column for each pixel in the image. We'll also normalize the data to [0, 1] by dividing by 255.\n", | |
"\n", | |
"The dimension reduction implementations expects an array-like with each row representing a sample and each column a dimension. After creating our DataFrame, we can get this array using `df[feat_cols]`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "59Ct3FKGW0WF" | |
}, | |
"source": [ | |
"from sklearn.datasets import fetch_openml\n", | |
"\n", | |
"mnist = fetch_openml(\"mnist_784\")\n", | |
"x = mnist.data / 255.0\n", | |
"y = mnist.target\n", | |
"\n", | |
"feat_cols = [ 'pixel'+str(i+1) for i in range(x.shape[1]) ]\n", | |
"df = pd.DataFrame(x ,columns=feat_cols)\n", | |
"df['y'] = y\n", | |
"df['label'] = df['y'].apply(lambda i: str(i))\n", | |
"\n", | |
"df.head()" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1zd5UK8LZ-kA" | |
}, | |
"source": [ | |
"# PCA" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Y8sGgOfsZKug" | |
}, | |
"source": [ | |
"from sklearn.decomposition import PCA\n", | |
"\n", | |
"pca = PCA(n_components=2)\n", | |
"pca_result = pca.fit_transform(df[feat_cols])\n", | |
"df['pca_0'] = pca_result[:, 0]\n", | |
"df['pca_1'] = pca_result[:, 1]\n", | |
"print(f'Explained var: {pca.explained_variance_ratio_}')\n", | |
"\n", | |
"plt.figure(figsize=(16,10))\n", | |
"sns.scatterplot(\n", | |
" x=f'pca_0', y=f'pca_1',\n", | |
" hue=\"y\",\n", | |
" palette=sns.color_palette(\"colorblind\", 10),\n", | |
" data=df,\n", | |
" legend=\"full\",\n", | |
" alpha=0.3\n", | |
")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "7QIGcNlaahkJ" | |
}, | |
"source": [ | |
"# t-SNE" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ttIRSlI-hjt2" | |
}, | |
"source": [ | |
"Note: t-SNE takes considerable time to run. 70000 examples is far too much for running on colab. We'll just play with the 5000 samples." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ZbQ_fW1WZ4Pc" | |
}, | |
"source": [ | |
"from sklearn.manifold import TSNE\n", | |
"\n", | |
"df_sub = df.head(5000)\n", | |
"\n", | |
"tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=300)\n", | |
"tsne_result = tsne.fit_transform(df_sub[feat_cols])\n", | |
"df_sub['tsne_0'] = tsne_result[:, 0]\n", | |
"df_sub['tsne_1'] = tsne_result[:, 1]\n", | |
"\n", | |
"plt.figure(figsize=(16,10))\n", | |
"sns.scatterplot(\n", | |
" x='tsne_0', y='tsne_1',\n", | |
" hue=\"y\",\n", | |
" palette=sns.color_palette(\"colorblind\", 10),\n", | |
" data=df_sub,\n", | |
" legend=\"full\",\n", | |
" alpha=0.3\n", | |
")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "qg_HeP9Qa2my" | |
}, | |
"source": [ | |
"# UMAP" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "L5qbeU_PavNr" | |
}, | |
"source": [ | |
"from umap import UMAP\n", | |
"\n", | |
"umap_reducer = UMAP()\n", | |
"umap_result = umap_reducer.fit_transform(df[feat_cols])\n", | |
"\n", | |
"df[f'umap_0'] = umap_result[:, 0]\n", | |
"df[f'umap_1'] = umap_result[:, 1]\n", | |
"\n", | |
"plt.figure(figsize=(16,10))\n", | |
"sns.scatterplot(\n", | |
" x=f'umap_0', y=f'umap_1',\n", | |
" hue=\"y\",\n", | |
" palette=sns.color_palette(\"colorblind\", 10),\n", | |
" data=df,\n", | |
" legend=\"full\",\n", | |
" alpha=0.3\n", | |
")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "qRiSswQmbpJQ" | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment