Created
April 8, 2018 19:32
-
-
Save ravila4/85eb84473b93edbfc2fc6f9e374fde71 to your computer and use it in GitHub Desktop.
PCA analysis on Swiss Bank Notes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The file SwissBankNotes_HW2_BINF5112.txt consists of six variables measured on 200 old Swiss 1,000-franc bank notes. The first 100 are genuine and the next 100 are counterfeit. The six variables are length of the bank note, height of the bank note, measured on the left, height of the bank note, measured on the right, distance of inner frame to the lower border, distance of inner frame to the upper border, and length of the diagonal." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Length</th>\n", | |
" <th>Left</th>\n", | |
" <th>Right</th>\n", | |
" <th>Bottom</th>\n", | |
" <th>Top</th>\n", | |
" <th>Diagonal</th>\n", | |
" <th>Y</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>214.8</td>\n", | |
" <td>131.0</td>\n", | |
" <td>131.1</td>\n", | |
" <td>9.0</td>\n", | |
" <td>9.7</td>\n", | |
" <td>141.0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>214.6</td>\n", | |
" <td>129.7</td>\n", | |
" <td>129.7</td>\n", | |
" <td>8.1</td>\n", | |
" <td>9.5</td>\n", | |
" <td>141.7</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>214.8</td>\n", | |
" <td>129.7</td>\n", | |
" <td>129.7</td>\n", | |
" <td>8.7</td>\n", | |
" <td>9.6</td>\n", | |
" <td>142.2</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>214.8</td>\n", | |
" <td>129.7</td>\n", | |
" <td>129.6</td>\n", | |
" <td>7.5</td>\n", | |
" <td>10.4</td>\n", | |
" <td>142.0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>215.0</td>\n", | |
" <td>129.6</td>\n", | |
" <td>129.7</td>\n", | |
" <td>10.4</td>\n", | |
" <td>7.7</td>\n", | |
" <td>141.8</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Length Left Right Bottom Top Diagonal Y\n", | |
"1 214.8 131.0 131.1 9.0 9.7 141.0 0\n", | |
"2 214.6 129.7 129.7 8.1 9.5 141.7 0\n", | |
"3 214.8 129.7 129.7 8.7 9.6 142.2 0\n", | |
"4 214.8 129.7 129.6 7.5 10.4 142.0 0\n", | |
"5 215.0 129.6 129.7 10.4 7.7 141.8 0" | |
] | |
}, | |
"execution_count": 1, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"\n", | |
"# Read data\n", | |
"df = pd.read_csv(\"SwissBankNote_HW2_BINF5112.txt\", sep=\"\\s+\")\n", | |
"\n", | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Discuss which matrix (the sample covariance matrix or the sample correlation matrix) seems more appropriate for PCA of this example." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A correlation matrix is usually more appropriate when the units used in the independent variables are different, because using the correlation matrix will standardize the data. In this case, the units are the same, so it may be okay to use the covariance matrix." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Length</th>\n", | |
" <th>Left</th>\n", | |
" <th>Right</th>\n", | |
" <th>Bottom</th>\n", | |
" <th>Top</th>\n", | |
" <th>Diagonal</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Length</th>\n", | |
" <td>0.141793</td>\n", | |
" <td>0.031443</td>\n", | |
" <td>0.023091</td>\n", | |
" <td>-0.103246</td>\n", | |
" <td>-0.177737</td>\n", | |
" <td>0.084306</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Left</th>\n", | |
" <td>0.031443</td>\n", | |
" <td>0.130339</td>\n", | |
" <td>0.108427</td>\n", | |
" <td>0.215803</td>\n", | |
" <td>-0.024207</td>\n", | |
" <td>-0.209342</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Right</th>\n", | |
" <td>0.023091</td>\n", | |
" <td>0.108427</td>\n", | |
" <td>0.163274</td>\n", | |
" <td>0.284132</td>\n", | |
" <td>0.067082</td>\n", | |
" <td>-0.240470</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Bottom</th>\n", | |
" <td>-0.103246</td>\n", | |
" <td>0.215803</td>\n", | |
" <td>0.284132</td>\n", | |
" <td>2.086878</td>\n", | |
" <td>0.117303</td>\n", | |
" <td>-1.036996</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Top</th>\n", | |
" <td>-0.177737</td>\n", | |
" <td>-0.024207</td>\n", | |
" <td>0.067082</td>\n", | |
" <td>0.117303</td>\n", | |
" <td>30.915678</td>\n", | |
" <td>-0.100771</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Diagonal</th>\n", | |
" <td>0.084306</td>\n", | |
" <td>-0.209342</td>\n", | |
" <td>-0.240470</td>\n", | |
" <td>-1.036996</td>\n", | |
" <td>-0.100771</td>\n", | |
" <td>1.327716</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Length Left Right Bottom Top Diagonal\n", | |
"Length 0.141793 0.031443 0.023091 -0.103246 -0.177737 0.084306\n", | |
"Left 0.031443 0.130339 0.108427 0.215803 -0.024207 -0.209342\n", | |
"Right 0.023091 0.108427 0.163274 0.284132 0.067082 -0.240470\n", | |
"Bottom -0.103246 0.215803 0.284132 2.086878 0.117303 -1.036996\n", | |
"Top -0.177737 -0.024207 0.067082 0.117303 30.915678 -0.100771\n", | |
"Diagonal 0.084306 -0.209342 -0.240470 -1.036996 -0.100771 1.327716" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"features = df.iloc[:, 0:6]\n", | |
"\n", | |
"# Display covariance matrix\n", | |
"features.cov()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAECCAYAAADesWqHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAACmtJREFUeJzt3d+LXPUdxvHnyTbrhkQrRStipClYBBGqZQnFQKHBSvyB7aWCXgl704LSgtRL/wHxpjdBpS1aRVChWLUGNNiAv5IYrRotRSwNCusPRJOqIbtPL3ZS8mPtnk3mO+fo5/2CJbvJMPuQ5L1nZnZnjpMIQC1r+h4AYPIIHyiI8IGCCB8oiPCBgggfKGiw4dveZvtt2/+0/dsB7LnP9rzt1/vecpTtC20/a3u/7Tds3zqATTO2X7L96mjTnX1vOsr2lO1XbD/e95ajbL9r+++299nePbHPO8Tv49uekvQPST+TdEDSy5JuTPJmj5t+IumgpD8mubSvHceyfb6k85PstX2mpD2SftHz35MlrU9y0PZaSbsk3Zrkhb42HWX715JmJZ2V5Lq+90hL4UuaTfLhJD/vUI/4myX9M8k7SQ5LekjSz/sclOQ5SR/3ueFESd5Psnf0/meS9ku6oOdNSXJw9OHa0VvvRxfbGyVdK+mevrcMwVDDv0DSv4/5+IB6/g89dLY3Sbpc0ov9LvnfTep9kuYl7UjS+yZJd0u6XdJi30NOEElP295je25Sn3So4XuZ3+v9qDFUtjdIekTSbUk+7XtPkoUkl0naKGmz7V7vGtm+TtJ8kj197vgKW5L8SNLVkn45ukvZ3FDDPyDpwmM+3ijpvZ62DNrofvQjkh5I8mjfe46V5BNJOyVt63nKFknXj+5PPyRpq+37+520JMl7o1/nJT2mpbu5zQ01/Jcl/cD2921PS7pB0p973jQ4owfS7pW0P8ldfe+RJNvn2j579P46SVdKeqvPTUnuSLIxySYt/V96JslNfW6SJNvrRw/KyvZ6SVdJmsh3jQYZfpIjkn4l6a9aesDq4SRv9LnJ9oOSnpd0se0Dtm/pc8/IFkk3a+kItm/0dk3Pm86X9Kzt17T0BXxHksF8+2xgzpO0y/arkl6S9JckT03iEw/y23kA2hrkER9AW4QPFET4QEGEDxRE+EBBgw5/kj/C2NUQN0nD3MWmbvrYNOjwJQ3uH0nD3CQNcxebuiF8AO01+QGeac9k3ZoNp309h/OFpj0zhkXjM8RN0jB3jXXTmP6fHtaXmtYZY7kuSZKXez7Z6ozz7+nzxYM6nC9WHPWtsXy2E6xbs0E/3nB9i6s+dYtDezbmgA3xpzmH+u83NdX3guO88J9uPx3NTX2gIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygoE7hD+1c9QBOz4rhj85V/zstndTvEkk32r6k9TAA7XQ54g/uXPUATk+X8DlXPfAN0+UVeDqdq370SqFzkjTj9ac5C0BLXY74nc5Vn2R7ktkks0N77TcAx+sSPueqB75hVrypn+SI7aPnqp+SdF/f56oHcHo6vcpukickPdF4C4AJ4Sf3gIIIHyiI8IGCCB8oiPCBgggfKIjwgYIIHyiI8IGCCB8oiPCBgggfKKjTk3ROyeJis6s+JWv4GteVvdxrr/QrCwt9T1je0HblpNfIWRY1AAURPlAQ4QMFET5QEOEDBRE+UBDhAwURPlAQ4QMFET5QEOEDBRE+UBDhAwURPlAQ4QMFrRi+7ftsz9t+fRKDALTX5Yj/e0nbGu8AMEErhp/kOUkfT2ALgAnhPj5Q0Nhec8/2nKQ5SZrx+nFdLYAGxnbET7I9yWyS2WnPjOtqATTATX2goC7fzntQ0vOSLrZ9wPYt7WcBaGnF+/hJbpzEEACTw019oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcK6nK23AttP2t7v+03bN86iWEA2lnxbLmSjkj6TZK9ts+UtMf2jiRvNt4GoJEVj/hJ3k+yd/T+Z5L2S7qg9TAA7azqPr7tTZIul/RiizEAJqPLTX1Jku0Nkh6RdFuST5f58zlJc5I04/VjGwhg/Dod8W2v1VL0DyR5dLnLJNmeZDbJ7LRnxrkRwJh1eVTfku6VtD/JXe0nAWityxF/i6SbJW21vW/0dk3jXQAaWvE+fpJdkjyBLQAmhJ/cAwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwrq/Ao8q5Y0u+pTsfSyAujC53yn7wknyQcf9T1hWYuHDvU94Tjp2B1HfKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcK6nKa7BnbL9l+1fYbtu+cxDAA7XR5Pv6XkrYmOWh7raRdtp9M8kLjbQAa6XKa7Eg6OPpw7ehtWK+yAWBVOt3Htz1le5+keUk7krzYdhaAljqFn2QhyWWSNkrabPvSEy9je872btu7D+eLce8EMEarelQ/ySeSdkratsyfbU8ym2R22jNjmgeghS6P6p9r++zR++skXSnprdbDALTT5VH98yX9wfaUlr5QPJzk8bazALTU5VH91yRdPoEtACaEn9wDCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCurytNzVS6TFxSZXfaqysND3hK+NfPBR3xNO8uTbf+t7wrKuvuiKviccx593O5ZzxAcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCuocvu0p26/Y5hTZwNfcao74t0ra32oIgMnpFL7tjZKulXRP2zkAJqHrEf9uSbdL+srX07I9Z3u37d2H9eVYxgFoY8XwbV8naT7Jnv93uSTbk8wmmZ3WGWMbCGD8uhzxt0i63va7kh6StNX2/U1XAWhqxfCT3JFkY5JNkm6Q9EySm5ovA9AM38cHClrV6+on2SlpZ5MlACaGIz5QEOEDBRE+UBDhAwURPlAQ4QMFET5QEOEDBRE+UBDhAwURPlAQ4QMFET5Q0KqendeZLU1NNbnqU7aw0PeCr43FQ4f6nnCSqy+6ou8Jy1rz7bP6nnC8w92O5RzxgYIIHyiI8IGCCB8oiPCBgggfKIjwgYIIHyiI8IGCCB8oiPCBgggfKIjwgYIIHyio09Nybb8r6TNJC5KOJJltOQpAW6t5Pv5Pk3zYbAmAieGmPlBQ1/Aj6Wnbe2zPtRwEoL2uN/W3JHnP9ncl7bD9VpLnjr3A6AvCnCTNeP2YZwIYp05H/CTvjX6dl/SYpM3LXGZ7ktkks9OeGe9KAGO1Yvi219s+8+j7kq6S9HrrYQDa6XJT/zxJj9k+evk/JXmq6SoATa0YfpJ3JP1wAlsATAjfzgMKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKcpLxX6n9gaR/jeGqzpE0tBf4HOImaZi72NTNODd9L8m5K12oSfjjYnv30F7Ke4ibpGHuYlM3fWzipj5QEOEDBQ09/O19D1jGEDdJw9zFpm4mvmnQ9/EBtDH0Iz6ABggfKIjwgYIIHyiI8IGC/gslsIPze/KYBAAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7f20c7ca22e8>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import matplotlib.pyplot as plt\n", | |
"\n", | |
"# Display covariance heatmap\n", | |
"plt.matshow(features.cov())\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this display, we can easily see that the covariance matrix contains one large outlier in the \"top\" variable." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Length</th>\n", | |
" <th>Left</th>\n", | |
" <th>Right</th>\n", | |
" <th>Bottom</th>\n", | |
" <th>Top</th>\n", | |
" <th>Diagonal</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Length</th>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.231293</td>\n", | |
" <td>0.151763</td>\n", | |
" <td>-0.189801</td>\n", | |
" <td>-0.084891</td>\n", | |
" <td>0.194301</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Left</th>\n", | |
" <td>0.231293</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.743263</td>\n", | |
" <td>0.413781</td>\n", | |
" <td>-0.012059</td>\n", | |
" <td>-0.503229</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Right</th>\n", | |
" <td>0.151763</td>\n", | |
" <td>0.743263</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.486758</td>\n", | |
" <td>0.029858</td>\n", | |
" <td>-0.516476</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Bottom</th>\n", | |
" <td>-0.189801</td>\n", | |
" <td>0.413781</td>\n", | |
" <td>0.486758</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.014604</td>\n", | |
" <td>-0.622983</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Top</th>\n", | |
" <td>-0.084891</td>\n", | |
" <td>-0.012059</td>\n", | |
" <td>0.029858</td>\n", | |
" <td>0.014604</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>-0.015729</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Diagonal</th>\n", | |
" <td>0.194301</td>\n", | |
" <td>-0.503229</td>\n", | |
" <td>-0.516476</td>\n", | |
" <td>-0.622983</td>\n", | |
" <td>-0.015729</td>\n", | |
" <td>1.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Length Left Right Bottom Top Diagonal\n", | |
"Length 1.000000 0.231293 0.151763 -0.189801 -0.084891 0.194301\n", | |
"Left 0.231293 1.000000 0.743263 0.413781 -0.012059 -0.503229\n", | |
"Right 0.151763 0.743263 1.000000 0.486758 0.029858 -0.516476\n", | |
"Bottom -0.189801 0.413781 0.486758 1.000000 0.014604 -0.622983\n", | |
"Top -0.084891 -0.012059 0.029858 0.014604 1.000000 -0.015729\n", | |
"Diagonal 0.194301 -0.503229 -0.516476 -0.622983 -0.015729 1.000000" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Display correlation matrix\n", | |
"df.iloc[:, 0:6].corr()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAECCAYAAADesWqHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAACyNJREFUeJzt3V+IpQUdxvHncRz/NGPslmuIu7QFYYhixrA3C0JSYSVWdwp1JaxhgVYgdeltUHiT2FJSkbUEthH90yXdREhtVtd/rUWY0bbRrG6Lzmrb7szTxZyNXffYvKPve963ft8PDDvjHt55GPc77zlz5pzjJAJQyxl9DwAweYQPFET4QEGEDxRE+EBBhA8UNNjwbV9t+/e2/2j7SwPYc5ftBdtP973lBNubbD9ge5/tZ2zfPIBN59h+1PYTo0239b3pBNtTth+3/dO+t5xg+3nbT9nea3t+Yp93iPfj256S9AdJH5K0X9JvJV2f5Hc9brpS0qKk7ya5tK8dJ7N9oaQLkzxm+zxJeyR9ouevkyXNJFm0PS3pIUk3J3m4r00n2P6CpDlJb01yTd97pJXwJc0leWGSn3eoZ/wtkv6Y5Lkk/5K0Q9LH+xyU5EFJh/rc8FpJ/pbksdH7L0vaJ+minjclyeLow+nRW+9nF9sbJX1M0jf73jIEQw3/Ikl/Oenj/er5H/TQ2d4s6QpJj/S75D9XqfdKWpC0K0nvmyTdLulWSct9D3mNSLrP9h7b2yb1SYcavsf8t97PGkNle1bSPZJuSfJS33uSLCV5n6SNkrbY7vWmke1rJC0k2dPnjtexNcn7JX1E0mdHNyk7N9Tw90vadNLHGyUd6GnLoI1uR98j6e4kP+p7z8mSHJa0W9LVPU/ZKuna0e3pHZKusv29fietSHJg9OeCpJ1auZnbuaGG/1tJ77H9LttnSbpO0k963jQ4ox+kfUvSviRf63uPJNneYHvd6P1zJX1Q0rN9bkry5SQbk2zWyr+l+5N8qs9NkmR7ZvRDWdmekfRhSRO512iQ4Sc5Lulzku7Vyg+sfpjkmT432f6BpN9Iutj2fts39LlnZKukT2vlDLZ39PbRnjddKOkB209q5Rv4riSDuftsYN4h6SHbT0h6VNLPkvxyEp94kHfnAejWIM/4ALpF+EBBhA8URPhAQYQPFDTo8Cf5K4xNDXGTNMxdbGqmj02DDl/S4P4naZibpGHuYlMzhA+ge538As/5b5vK5k3Tb/o4B19c0oa3T7WwSHrq8IZWjrO0uKip2dlWjrVywHYOs3zkiM6YmWnlWNOvtHIYHTu6qOmz2/laLbfzz0DH/3lEZ57TztdJkpbObeEYi0c0NdvOpuOHDmlp8ci4B7md4sxWPttrbN40rUfv3bT6BSfo3Ttv7HvCWFOLw7vSdcH80B65Kh1dN7yvkyT945Jh/ebrga/e3uhyw/xqAugU4QMFET5QEOEDBRE+UBDhAwURPlAQ4QMFET5QEOEDBRE+UBDhAwURPlBQo/CH9lr1AN6cVcMfvVb917Xyon6XSLre9iVdDwPQnSZn/MG9Vj2AN6dJ+LxWPfB/pkn4jV6r3vY22/O25w++2NLzSQHoRJPwG71WfZLtSeaSzLX1PHkAutEkfF6rHvg/s+qTbSY5bvvEa9VPSbqr79eqB/DmNHqW3SQ/l/TzjrcAmBB+cw8oiPCBgggfKIjwgYIIHyiI8IGCCB8oiPCBgggfKIjwgYIIHyiI8IGCGj1IZ62eOrxB7955YxeHfsOe++Q3+p4w1p2Hh/dkRndcfGXfE07jX6/ve8JY7/3Kn/qecIpDB482uhxnfKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygoFXDt32X7QXbT09iEIDuNTnjf1vS1R3vADBBq4af5EFJhyawBcCEcBsfKKi18G1vsz1ve35pcbGtwwLoQGvhJ9meZC7J3NTsbFuHBdABruoDBTW5O+8Hkn4j6WLb+23f0P0sAF1a9Xn1k1w/iSEAJoer+kBBhA8URPhAQYQPFET4QEGEDxRE+EBBhA8URPhAQYQPFET4QEGEDxRE+EBBqz467w1ZkqYWh/U95c7DF/U9YazPrPtr3xNOs+Mtr/Y94TQvnLW+7wljZXm57wmnSrOLDatOABNB+EBBhA8URPhAQYQPFET4QEGEDxRE+EBBhA8URPhAQYQPFET4QEGEDxRE+EBBTV4td5PtB2zvs/2M7ZsnMQxAd5o8Hv+4pC8mecz2eZL22N6V5HcdbwPQkVXP+En+luSx0fsvS9onaZjPagGgkTXdxre9WdIVkh7pYgyAyWgcvu1ZSfdIuiXJS2P+fpvtedvzy0eOtLkRQMsahW97WivR353kR+Muk2R7krkkc2fMzLS5EUDLmvxU35K+JWlfkq91PwlA15qc8bdK+rSkq2zvHb19tONdADq06t15SR6S5AlsATAh/OYeUBDhAwURPlAQ4QMFET5QEOEDBRE+UBDhAwURPlAQ4QMFET5QEOEDBRE+UFCTJ9tcs+lXpAvml7s49Bt2x8VX9j1hrB1vebXvCafZfemP+55wmst23dT3hLGW/r7Q94RTJMcbXY4zPlAQ4QMFET5QEOEDBRE+UBDhAwURPlAQ4QMFET5QEOEDBRE+UBDhAwURPlAQ4QMFNXmZ7HNsP2r7CdvP2L5tEsMAdKfJ4/GPSroqyaLtaUkP2f5Fkoc73gagI01eJjuSFkcfTo/e0uUoAN1qdBvf9pTtvZIWJO1K8ki3swB0qVH4SZaSvE/SRklbbF/62svY3mZ73vb8saOLpx8EwGCs6af6SQ5L2i3p6jF/tz3JXJK56bNnW5oHoAtNfqq/wfa60fvnSvqgpGe7HgagO01+qn+hpO/YntLKN4ofJvlpt7MAdKnJT/WflHTFBLYAmBB+cw8oiPCBgggfKIjwgYIIHyiI8IGCCB8oiPCBgggfKIjwgYIIHyiI8IGCCB8oqMnDctdseUo6um5Y31P86/V9TxjrhbOGt+uyXTf1PeE0T33+jr4njHX5sWF9rY7d3ew5cIdVJ4CJIHygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCmocvu0p24/b5iWygf9xaznj3yxpX1dDAExOo/Btb5T0MUnf7HYOgEloesa/XdKtkpZf7wK2t9metz1//J9HWhkHoBurhm/7GkkLSfb8t8sl2Z5kLsncmefMtDYQQPuanPG3SrrW9vOSdki6yvb3Ol0FoFOrhp/ky0k2Jtks6TpJ9yf5VOfLAHSG+/GBgtb0vPpJdkva3ckSABPDGR8oiPCBgggfKIjwgYIIHyiI8IGCCB8oiPCBgggfKIjwgYIIHyiI8IGCCB8oaE2Pzmtq6VzpH5eki0O/Ye/9yp/6njBWll/32cx6s/T3hb4nnObyYzf1PWGsJ269o+8Jp9jyq4ONLscZHyiI8IGCCB8oiPCBgggfKIjwgYIIHyiI8IGCCB8oiPCBgggfKIjwgYIIHyiI8IGCGj0s1/bzkl6WtCTpeJK5LkcB6NZaHo//gSQvdLYEwMRwVR8oqGn4kXSf7T22t3U5CED3ml7V35rkgO0LJO2y/WySB0++wOgbwjZJmlq/vuWZANrU6Iyf5MDozwVJOyVtGXOZ7UnmksxNzc60uxJAq1YN3/aM7fNOvC/pw5Ke7noYgO40uar/Dkk7bZ+4/PeT/LLTVQA6tWr4SZ6TdPkEtgCYEO7OAwoifKAgwgcKInygIMIHCiJ8oCDCBwoifKAgwgcKInygIMIHCiJ8oCDCBwpykvYPah+U9OcWDnW+pKE9wecQN0nD3MWmZtrc9M4kG1a7UCfht8X2/NCeynuIm6Rh7mJTM31s4qo+UBDhAwUNPfztfQ8YY4ibpGHuYlMzE9806Nv4ALox9DM+gA4QPlAQ4QMFET5QEOEDBf0b6RKxXPb9DeMAAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7f20c7c2bdd8>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"# Display correlation heatmap\n", | |
"plt.matshow(features.corr())\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this case, we can see that the correlation values are much more evenly spread out between the maximum and the minimum values. Hence, we can conclude that for this data the correlation matrix is probably the best choice for PCA." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Consulting the sample correlation matrix, discuss the usefulness of PCA for dimension reduction." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"PCA transforms the data into a new set of coordinate axes, along the eigenvectors of the covariance or correlation matrix. Since the sources of variation are orthogonal to each other, we can easily reduce dimensions, by ignoring the axes that contribute the least to the total variance.\n", | |
"\n", | |
"In the correlation matrix and heat maps, we can see that some variable pairs contain more variance than others. For example, the relationship betwee \"diagonal\" with \"left\", \"right\", and \"bottom\" distances has a poor correlation, indicating high variability. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Carry out a PCA with 200 bank notes. Choose the number of PCs you want to keep based on the cumulative proportion (of your choice) or the scree diagram. Is PCA effective for dimensional reduction?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"PCA can be very effective for dimensional reduction, but it is important to keep in mind data normalization.\n", | |
"\n", | |
"Here we define a function for calculating PCA:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"np.set_printoptions(threshold=np.nan)\n", | |
"\n", | |
"def PCA(data, matrix_type, threshold):\n", | |
" \"\"\"Function to compute PCA.\n", | |
" Supports both covariance, and correlation matrices.\"\"\"\n", | |
" # Center data on axis\n", | |
" data = data - np.mean(data, axis=0)\n", | |
" if matrix_type == \"covariance\":\n", | |
" # Compute covariance matrix\n", | |
" mat = np.cov(data, rowvar=False)\n", | |
" elif matrix_type == \"correlation\":\n", | |
" # Compute correlation matrix\n", | |
" mat = np.corrcoef(data, rowvar=False)\n", | |
" evals, evecs = np.linalg.eigh(mat)\n", | |
" # Sort by decreasing magnitude\n", | |
" idx = np.argsort(evals)[::-1]\n", | |
" evals = evals[idx]\n", | |
" evecs = evecs[:, idx]\n", | |
" # Find the variance retained at each level\n", | |
" variance_retained = np.cumsum(evals) / np.sum(evals)\n", | |
" print(\"Cummulative variance retained:\", variance_retained)\n", | |
" print(\"_\" * 70)\n", | |
" print(\"All components:\")\n", | |
" print(\"eigenvals:\", evals)\n", | |
" print(\"eigenvecs:\")\n", | |
" print(evecs)\n", | |
" # Feature selection\n", | |
" index = np.argmax(variance_retained >= threshold)\n", | |
" evals = evals[:index + 1]\n", | |
" evecs = evecs[:, :index + 1]\n", | |
" reduced_data = np.dot(evecs.T, data.T).T\n", | |
" print(\"_\" * 70)\n", | |
" print(\"Top components with at least {0:.0f}% of the variance:\".format(threshold * 100) )\n", | |
" print(\"eigenvals:\", evals)\n", | |
" print(\"eigenvecs:\")\n", | |
" print(evecs)\n", | |
" # Get transformation matrix\n", | |
" coefficients = np.dot(evecs, np.diag(evals)) \n", | |
" return coefficients" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Cummulative variance retained: [ 0.44112039 0.65656584 0.82062644 0.89786062 0.95929831 1. ]\n", | |
"______________________________________________________________________\n", | |
"All components:\n", | |
"eigenvals: [ 2.64672233 1.29267268 0.98436364 0.46340507 0.36862616 0.24421012]\n", | |
"eigenvecs:\n", | |
"[[-0.00608924 0.79765636 0.16172325 0.54422009 0.1865996 0.080981 ]\n", | |
" [-0.50728436 0.31018807 0.08184041 -0.3802684 0.0009923 -0.70366401]\n", | |
" [-0.5244321 0.21444822 0.10525394 -0.33628115 -0.33334729 0.66610743]\n", | |
" [-0.46980657 -0.29743752 -0.12458421 0.65338254 -0.47176944 -0.16067461]\n", | |
" [-0.01439042 -0.25131735 0.9639207 0.07506323 0.03200387 -0.02882109]\n", | |
" [ 0.49666001 0.2644053 0.10679422 -0.11658551 -0.79402049 -0.16719134]]\n", | |
"______________________________________________________________________\n", | |
"Top components with at least 90% of the variance:\n", | |
"eigenvals: [ 2.64672233 1.29267268 0.98436364 0.46340507 0.36862616]\n", | |
"eigenvecs:\n", | |
"[[-0.00608924 0.79765636 0.16172325 0.54422009 0.1865996 ]\n", | |
" [-0.50728436 0.31018807 0.08184041 -0.3802684 0.0009923 ]\n", | |
" [-0.5244321 0.21444822 0.10525394 -0.33628115 -0.33334729]\n", | |
" [-0.46980657 -0.29743752 -0.12458421 0.65338254 -0.47176944]\n", | |
" [-0.01439042 -0.25131735 0.9639207 0.07506323 0.03200387]\n", | |
" [ 0.49666001 0.2644053 0.10679422 -0.11658551 -0.79402049]]\n" | |
] | |
} | |
], | |
"source": [ | |
"# Calculate PCA with a cumulative proportion of 0.90\n", | |
"pca_coeff = PCA(features, \"correlation\", .9)\n", | |
"#evals, evecs = PCA(features, \"covariance\", .9)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Interpret PC1 and PC2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"All principal components are a linear combination of the original features. PC1 is the component with the most variance, and PC2 is the component with the second most variance." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[ -1.61165282e-02 1.03110859e+00 1.59194486e-01 2.52194350e-01\n", | |
" 6.87854928e-02]\n", | |
" [ -1.34264085e+00 4.00971638e-01 8.05607256e-02 -1.76218306e-01\n", | |
" 3.65786904e-04]\n", | |
" [ -1.38802616e+00 2.77211351e-01 1.03608155e-01 -1.55834392e-01\n", | |
" -1.22880533e-01]\n", | |
" [ -1.24344754e+00 -3.84489361e-01 -1.22636168e-01 3.02780783e-01\n", | |
" -1.73906558e-01]\n", | |
" [ -3.80874363e-02 -3.24871072e-01 9.48848490e-01 3.47846835e-02\n", | |
" 1.17974626e-02]\n", | |
" [ 1.31452115e+00 3.41789511e-01 1.05124347e-01 -5.40263144e-02\n", | |
" -2.92696726e-01]]\n" | |
] | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAQwAAAE5CAYAAABlOx6NAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAEVhJREFUeJzt3XusZWV9xvHvM2dGBhgEp4IXQMYLXogFTEZqQ1sjYoNXUrUGUq1W6zTeCo3WgKHxUttq26j9g9SOimJFiSi0SixIVcBbhQGR2+ANuUyhAUW5jBWZ4dc/9qI5jsM5754566y9z3w/yc5Ze5211/vMyZlnv2udvddOVSFJLZYNHUDS9LAwJDWzMCQ1szAkNbMwJDWzMCQ1W9KFkeSYJN9N8oMkJw2dZz5JTktyW5Krh87SKsmBSb6SZGOSa5KcMHSmuSRZmeSSJN/p8r5z6Eytkswk+XaSc4fKsGQLI8kMcCrwXOAQ4Pgkhwybal4fA44ZOsSYtgBvrqqnAM8A3jDhP+d7gaOq6jDgcOCYJM8YOFOrE4CNQwZYsoUBHAH8oKqur6pfAmcCxw6caU5VdTFwx9A5xlFVt1bV5d3y3Yx+ofcfNtWDq5F7ursrutvEv3oxyQHA84EPD5ljKRfG/sDNs+5vYoJ/kZeCJGuApwHfGjbJ3Lqp/RXAbcAFVTXReTsfAN4K3D9kiKVcGNnOuol/JplWSVYBnwVOrKq7hs4zl6raWlWHAwcARyR56tCZ5pLkBcBtVXXZ0FmWcmFsAg6cdf8A4JaBsixpSVYwKoszqursofO0qqqfARcy+eeNjgRelOQGRofWRyX5xBBBlnJhXAocnOSxSR4CHAd8buBMS06SAB8BNlbV+4bOM58k+ybZp1veHTgauG7YVHOrqpOr6oCqWsPo9/jLVfXyIbIs2cKoqi3AG4HzGZ2I+3RVXTNsqrkl+RTwTeBJSTYlec3QmRocCbyC0bPeFd3teUOHmsOjgK8kuZLRk8oFVTXYnymnTXx7u6RWS3aGIWnhWRiSmlkYkppZGJKaWRiSmu0ShZFk3dAZxjVtmactL0xf5knIu0sUBjD4D3oHTFvmacsL05d58Ly7SmFIWgAT9cKtmVV71vLVqxd8v1vv2czMqj0XfL8AB+19ey/7vfOOrey9embB93vjPQ9f8H0CbL17MzN79fMzzrJ+fke33rWZmYf2k7m2bO+9jzunz9/jLT/5KVvv2Txv6OW9jL6Dlq9ezaPfcuLQMcZy6gs/NHSEsbz2q68aOsLYlq+8b+gIY7vvZyuHjjCW//mbf2razkMSSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNei2MJMck+W6SHyQ5qc+xJPWvt8JIMgOcCjwXOAQ4PskhfY0nqX99zjCOAH5QVddX1S+BM4FjexxPUs/6LIz9gZtn3d/UrfsVSdYl2ZBkw9Z7NvcYR9LO6rMwtnfJ8l+7XnxVra+qtVW1tq9LqEtaGH0WxibgwFn3DwBu6XE8ST3rszAuBQ5O8tgkDwGOAz7X43iSetbbBxlV1ZYkbwTOB2aA06rqmr7Gk9S/Xj/5rKq+AHyhzzEkLR5f6SmpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqVmvV9wa126bfs7Bf3n50DHGcvLBLx46wljq3ul7jlh20/RdTT773D90hPHU9i7y/+um77dH0mAsDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNeiuMJKcluS3J1X2NIWlx9TnD+BhwTI/7l7TIeiuMqroYuKOv/UtafJ7DkNRs8KuGJ1kHrANYyR4Dp5E0l8FnGFW1vqrWVtXaFVk5dBxJcxi8MCRNjz7/rPop4JvAk5JsSvKavsaStDh6O4dRVcf3tW9Jw/CQRFIzC0NSMwtDUjMLQ1IzC0NSMwtDUjMLQ1IzC0NSMwtDUjMLQ1IzC0NSMwtDUjMLQ1IzC0NSMwtDUjMLQ1IzC0NSs8GvGv4rqqj7fjl0irHctXnKLlw8hU8RNYWZl2+ertC5v2276fpXSRqUhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqVlvhZHkwCRfSbIxyTVJTuhrLEmLo89rem4B3lxVlyfZC7gsyQVVdW2PY0rqUW8zjKq6taou75bvBjYC+/c1nqT+Lco5jCRrgKcB31qM8ST1o/ePGUiyCvgscGJV3bWd768D1gGsZI++40jaCb3OMJKsYFQWZ1TV2dvbpqrWV9Xaqlq7gt36jCNpJ41dGEkeluTQhu0CfATYWFXv25FwkiZLU2EkuTDJQ5OsBr4DfDTJfCVwJPAK4KgkV3S35+1kXkkDaj2HsXdV3ZXkT4GPVtXbk1w51wOq6mtAdjqhpInRekiyPMmjgJcB5/aYR9IEay2MdwHnAz+sqkuTPA74fn+xJE2ipkOSqjoLOGvW/euBl/QVStJkaj3p+cQkX0pydXf/0CSn9BtN0qRpPST5EHAycB9AVV0JHNdXKEmTqbUw9qiqS7ZZt2Whw0iabK2F8eMkjwcKIMlLgVt7SyVpIrW+DuMNwHrgyUn+G/gR8Ee9pZI0keYtjCTLgLVVdXSSPYFl3dvVJe1i5j0kqar7gTd2y5stC2nX1XoO44Ikb+kuu7f6gVuvySRNnNZzGK/uvr5h1roCHrewcSRNstZXej627yCSJl9TYST54+2tr6qPL2wcSZOs9ZDk6bOWVwLPBi4HLAxpF9J6SPKm2feT7A38ay+JJE2sHb2m58+BgxcyiKTJ13oO4/N0LwtnVDKHMOvt7gsly2eY2We6/lq72zf2GjrCWO5/RM2/0YRZ/pRfu9j8xNty7UOHjjCexl+L1nMY/zhreQtwY1VtGjOSpCnXekjyvKq6qLt9vao2JXlvr8kkTZzWwnjOdtY9dyGDSJp8cx6SJHkd8HrgcdtcJXwv4Ot9BpM0eeY7h/FJ4D+AvwNOmrX+7qq6o7dUkibSnIVRVXcCdwLHAyTZj9ELt1YlWVVVN/UfUdKkaL0I8AuTfJ/RhXMuAm5gNPOQtAtpPen5buAZwPe6N6I9G89hSLuc1sK4r6p+AixLsqyqvgIc3mMuSROo9YVbP0uyCvgqcEaS2/Cq4dIup3WGcSyj94+cCJwH/BB4YV+hJE2m1nerbk5yEHBwVZ2eZA9gpt9okiZN619JXgt8BviXbtX+wL/1FUrSZGo9JHkDcCRwF0BVfR/Yr69QkiZTa2HcW1W/fOBOkuU0vyFW0lLRWhgXJXkbsHuS5zC6Fsbn+4slaRK1FsZJwO3AVcCfAV8ATukrlKTJNN+7VR9TVTd1n372oe7WJMlK4GJgt26cz1TV23cmrKRhzTfD+P+/hCT57Jj7vhc4qqoOY/Sq0GOSPGPMfUiaIPO9DiOzlsf6lLOqKuCe7u6K7uaJUmmKzTfDqAdZbpJkJskVwG3ABVX1rXH3IWlyzDfDOCzJXYxmGrt3y3T3q6rmvDRyVW0FDk+yD3BOkqdW1dWzt0myDlgHsHLZqh35N0haJPNdQGdBXv5dVT9LciFwDHD1Nt9bD6wH2HvFvh6ySBNsRz/IaF5J9u1mFiTZHTgauK6v8ST1r/Xt7TviUcDpSWYYFdOnq+rcHseT1LPeCqOqrgSe1tf+JS2+3g5JJC09FoakZhaGpGYWhqRmFoakZhaGpGYWhqRmFoakZhaGpGYWhqRmFoakZhaGpGYWhqRmFoakZhaGpGYWhqRmFoakZhaGpGZ9XtNzbL945Eq+++YnDh1jLHvcOnSC8bzrxWcOHWFsb/vPlw0dYXz73D90grG0fj6AMwxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc0sDEnNLAxJzSwMSc16L4wkM0m+neTcvseS1K/FmGGcAGxchHEk9azXwkhyAPB84MN9jiNpcfQ9w/gA8FZguq6IKmm7eiuMJC8Abquqy+bZbl2SDUk2bN28ua84khZAnzOMI4EXJbkBOBM4Kskntt2oqtZX1dqqWjuz5549xpG0s3orjKo6uaoOqKo1wHHAl6vq5X2NJ6l/vg5DUrNF+eSzqroQuHAxxpLUH2cYkppZGJKaWRiSmlkYkppZGJKaWRiSmlkYkppZGJKaWRiSmlkYkppZGJKaWRiSmlkYkppZGJKaWRiSmlkYkppZGJKaLcoVt1r95urbueS4Dw4dYyyHXnL80BHGcvLFLx06wthmfpGhI4xt5e3T9Vy87L7G7fqNIWkpsTAkNbMwJDWzMCQ1szAkNbMwJDWzMCQ1szAkNbMwJDWzMCQ1szAkNbMwJDWzMCQ1szAkNbMwJDWzMCQ1szAkNbMwJDXr9RJ9SW4A7ga2Aluqam2f40nq12Jc0/NZVfXjRRhHUs88JJHUrO/CKOCLSS5Lsm57GyRZl2RDkg23/2Rrz3Ek7Yy+D0mOrKpbkuwHXJDkuqq6ePYGVbUeWA+w9rCV1XMeSTuh1xlGVd3Sfb0NOAc4os/xJPWrt8JIsmeSvR5YBn4fuLqv8ST1r89DkkcA5yR5YJxPVtV5PY4nqWe9FUZVXQ8c1tf+JS0+/6wqqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIamZhSGpmYUhqZmFIalZqibnQt1Jbgdu7GHXDwem7cOUpi3ztOWF6cvcZ96Dqmrf+TaaqMLoS5IN0/YxjdOWedrywvRlnoS8HpJIamZhSGq2qxTG+qED7IDtZk6yNckVSa5OclaSPbr1j0xyZpIfJrk2yReSPHHW4/4iyS+S7L2YeYeU5G3zbDJxmecxeN5d4hzGUpLknqpa1S2fAVwGvB/4BnB6VX2w+97hwF5V9dXu/iXAvcBHqupjQ2RfbLN/VloYu8oMY6n6KvAE4FnAfQ+UBUBVXTGrLB4PrAJOAY5/sJ0leWuSq5J8J8l7unWHJ/mvJFcmOSfJw7r1FyZ5f5KLk2xM8vQkZyf5fpJ3d9usSXJdktO7x39m1ozo2Um+3Y13WpLduvU3JHlnksu77z25W79nt92l3eOO7da/qhv3vG7sv+/WvwfYvZuNnbGgP/VdWVV5m6IbcE/3dTnw78DrgD8H3j/HY04B/orRE8QNwH7b2ea5jGYpe3T3V3dfrwSe2S2/C/hAt3wh8N5u+QTgFuBRwG7AJuA3gDVAMfpQboDTgLcAK4GbgSd26z8OnNgt3wC8qVt+PfDhbvlvgZd3y/sA3wP2BF4FXA/s3e33RuDA2T8rbwt3c4YxfXZPcgWwAbgJ+EjDY44Dzqyq+4GzgT/czjZHAx+tqp8DVNUd3fmOfarqom6b04Hfm/WYz3VfrwKuqapbq+peRv+BD+y+d3NVfb1b/gTwO8CTgB9V1fceZL9nd18vY1Q6MPps3pO6f/uFjMrhMd33vlRVd1bVL4BrgYPm+XloB/X52arqx/9W1eGzVyS5Bnjp9jZOcihwMHBB9zm3D2H0H/rUbTdlNBsYx73d1/tnLT9w/4HfrW33Wd1YLfvdOms/AV5SVd+dvWGS39pm7NmP0QJzhrE0fBnYLclrH1jRnVN4JqNzFu+oqjXd7dHA/km2fRb+IvDqWecYVlfVncBPk/xut80rgIsYz2OS/Ha3fDzwNeA6YE2SJ4yx3/OBN6VrvSRPaxj7viQrxsyrOVgYS0CNDtj/AHhO92fVa4B3MDqvcBxwzjYPOadbP3sf5zE6xNjQTfvf0n3rlcA/JLkSOJzReYxxbARe2T1+NfDP3aHDnwBnJbmK0Yzkg3PsA+CvgRXAlUmu7u7PZ323vSc9F4h/VlVvkqwBzq2qpw4cRQvEGYakZs4wJDVzhiGpmYUhqZmFIamZhSGpmYUhqZmFIanZ/wHZ5pRtH3WLMQAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7f20c7bd3278>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"# Show transformation matrix\n", | |
"print(pca_coeff)\n", | |
"\n", | |
"# Plot transformation matrix\n", | |
"plt.matshow(pca_coeff)\n", | |
"plt.xlabel(\"PCA component\")\n", | |
"plt.ylabel(\"Features\")\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Make a scatter plot of PC scores (use different symbols for genuine and counterfeit bank note) for PC1 vs. PC2. Which PC variable is more important in discriminating counterfeit vs. genuine? Furthermore, which of the six original measurement variables contribute the most for the counterfeit detection?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From the heat map above, we can observe that feature 5, corresponding to \"Diagonal\", is the most influential, with a coefficient of 1.314 in PC1, and is the most useful for separating the data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Transform original feature matrix\n", | |
"z = features.dot(pca_coeff)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7f20c58b1470>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"true_labels = df['Y']\n", | |
"\n", | |
"# Plot the first two components\n", | |
"plt.scatter(z[0], z[1], c=true_labels)\n", | |
"plt.xlabel(\"PC 1\")\n", | |
"plt.ylabel(\"PC 2\")\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From this plot, we can observe that PC1 has most of the variance and is the best separator of the data. We can also observe that there is one outlier value in PC2." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using the Scikit-learn library (**scikit-learn only supports covariance matrices**):" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.decomposition import PCA as pca\n", | |
"\n", | |
"pca = pca(n_components=2)\n", | |
"pca.fit(features)\n", | |
"features_pca = pca.transform(features)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7f20bad59630>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.scatter(features_pca[:, 0], features_pca[:, 1], c=true_labels)\n", | |
"plt.xlabel(\"PC 1\")\n", | |
"plt.ylabel(\"PC 2\")\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this plot of the two principal components of the covariance matrix, the outlier effect is more pronounced. The outlier has caused the entire plot to be flipped 90 degrees, so now, PC2 is the best separator of the data." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment