Skip to content

Instantly share code, notes, and snippets.

@stabgan
Created January 31, 2021 10:15
Show Gist options
  • Save stabgan/e82270cff4c1fce850d5d994152c985b to your computer and use it in GitHub Desktop.
Save stabgan/e82270cff4c1fce850d5d994152c985b to your computer and use it in GitHub Desktop.
3col_heart.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "3col_heart.ipynb",
"provenance": [],
"toc_visible": true,
"authorship_tag": "ABX9TyPfTt+2CcMu+91XHzHusfKE",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "TPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/stabgan/e82270cff4c1fce850d5d994152c985b/3col_heart.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Yrf6H65S1rEQ"
},
"source": [
"# Heart.xlsx data analysis and feature extraction - A Study\n",
"by Kaustabh Ganguly"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4l48x2KY2D5Q"
},
"source": [
"## Importing Necessary Functions and Libraries"
]
},
{
"cell_type": "code",
"metadata": {
"id": "0KVrFbvpBEzH"
},
"source": [
"import pandas as pd\n",
"from scipy.stats import skewnorm\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"def warn(*args, **kwargs):\n",
" pass\n",
"import warnings\n",
"warnings.warn = warn\n",
"from imblearn.over_sampling import SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN, RandomOverSampler\n",
"from imblearn.combine import SMOTEENN,SMOTETomek\n",
"from imblearn.ensemble import BalancedRandomForestClassifier\n",
"from sklearn.decomposition import PCA\n",
"from google.colab import files\n",
"import io\n",
"from sklearn.preprocessing import StandardScaler\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from xgboost import XGBClassifier\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.inspection import permutation_importance"
],
"execution_count": 1,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zoeW8uQD2K5S"
},
"source": [
"Uploading the Data local repository to cloud"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"resources": {
"http://localhost:8080/nbextensions/google.colab/files.js": {
"data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7CgpmdW5jdGlvbiBfdXBsb2FkRmlsZXMoaW5wdXRJZCwgb3V0cHV0SWQpIHsKICBjb25zdCBzdGVwcyA9IHVwbG9hZEZpbGVzU3RlcChpbnB1dElkLCBvdXRwdXRJZCk7CiAgY29uc3Qgb3V0cHV0RWxlbWVudCA9IGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKG91dHB1dElkKTsKICAvLyBDYWNoZSBzdGVwcyBvbiB0aGUgb3V0cHV0RWxlbWVudCB0byBtYWtlIGl0IGF2YWlsYWJsZSBmb3IgdGhlIG5leHQgY2FsbAogIC8vIHRvIHVwbG9hZEZpbGVzQ29udGludWUgZnJvbSBQeXRob24uCiAgb3V0cHV0RWxlbWVudC5zdGVwcyA9IHN0ZXBzOwoKICByZXR1cm4gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpOwp9CgovLyBUaGlzIGlzIHJvdWdobHkgYW4gYXN5bmMgZ2VuZXJhdG9yIChub3Qgc3VwcG9ydGVkIGluIHRoZSBicm93c2VyIHlldCksCi8vIHdoZXJlIHRoZXJlIGFyZSBtdWx0aXBsZSBhc3luY2hyb25vdXMgc3RlcHMgYW5kIHRoZSBQeXRob24gc2lkZSBpcyBnb2luZwovLyB0byBwb2xsIGZvciBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcC4KLy8gVGhpcyB1c2VzIGEgUHJvbWlzZSB0byBibG9jayB0aGUgcHl0aG9uIHNpZGUgb24gY29tcGxldGlvbiBvZiBlYWNoIHN0ZXAsCi8vIHRoZW4gcGFzc2VzIHRoZSByZXN1bHQgb2YgdGhlIHByZXZpb3VzIHN0ZXAgYXMgdGhlIGlucHV0IHRvIHRoZSBuZXh0IHN0ZXAuCmZ1bmN0aW9uIF91cGxvYWRGaWxlc0NvbnRpbnVlKG91dHB1dElkKSB7CiAgY29uc3Qgb3V0cHV0RWxlbWVudCA9IGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKG91dHB1dElkKTsKICBjb25zdCBzdGVwcyA9IG91dHB1dEVsZW1lbnQuc3RlcHM7CgogIGNvbnN0IG5leHQgPSBzdGVwcy5uZXh0KG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSk7CiAgcmV0dXJuIFByb21pc2UucmVzb2x2ZShuZXh0LnZhbHVlLnByb21pc2UpLnRoZW4oKHZhbHVlKSA9PiB7CiAgICAvLyBDYWNoZSB0aGUgbGFzdCBwcm9taXNlIHZhbHVlIHRvIG1ha2UgaXQgYXZhaWxhYmxlIHRvIHRoZSBuZXh0CiAgICAvLyBzdGVwIG9mIHRoZSBnZW5lcmF0b3IuCiAgICBvdXRwdXRFbGVtZW50Lmxhc3RQcm9taXNlVmFsdWUgPSB2YWx1ZTsKICAgIHJldHVybiBuZXh0LnZhbHVlLnJlc3BvbnNlOwogIH0pOwp9CgovKioKICogR2VuZXJhdG9yIGZ1bmN0aW9uIHdoaWNoIGlzIGNhbGxlZCBiZXR3ZWVuIGVhY2ggYXN5bmMgc3RlcCBvZiB0aGUgdXBsb2FkCiAqIHByb2Nlc3MuCiAqIEBwYXJhbSB7c3RyaW5nfSBpbnB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIGlucHV0IGZpbGUgcGlja2VyIGVsZW1lbnQuCiAqIEBwYXJhbSB7c3RyaW5nfSBvdXRwdXRJZCBFbGVtZW50IElEIG9mIHRoZSBvdXRwdXQgZGlzcGxheS4KICogQHJldHVybiB7IUl0ZXJhYmxlPCFPYmplY3Q+fSBJdGVyYWJsZSBvZiBuZXh0IHN0ZXBzLgogKi8KZnVuY3Rpb24qIHVwbG9hZEZpbGVzU3RlcChpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IGlucHV0RWxlbWVudCA9IGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKGlucHV0SWQpOwogIGlucHV0RWxlbWVudC5kaXNhYmxlZCA9IGZhbHNlOwoKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIG91dHB1dEVsZW1lbnQuaW5uZXJIVE1MID0gJyc7CgogIGNvbnN0IHBpY2tlZFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgaW5wdXRFbGVtZW50LmFkZEV2ZW50TGlzdGVuZXIoJ2NoYW5nZScsIChlKSA9PiB7CiAgICAgIHJlc29sdmUoZS50YXJnZXQuZmlsZXMpOwogICAgfSk7CiAgfSk7CgogIGNvbnN0IGNhbmNlbCA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2J1dHRvbicpOwogIGlucHV0RWxlbWVudC5wYXJlbnRFbGVtZW50LmFwcGVuZENoaWxkKGNhbmNlbCk7CiAgY2FuY2VsLnRleHRDb250ZW50ID0gJ0NhbmNlbCB1cGxvYWQnOwogIGNvbnN0IGNhbmNlbFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgY2FuY2VsLm9uY2xpY2sgPSAoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9OwogIH0pOwoKICAvLyBXYWl0IGZvciB0aGUgdXNlciB0byBwaWNrIHRoZSBmaWxlcy4KICBjb25zdCBmaWxlcyA9IHlpZWxkIHsKICAgIHByb21pc2U6IFByb21pc2UucmFjZShbcGlja2VkUHJvbWlzZSwgY2FuY2VsUHJvbWlzZV0pLAogICAgcmVzcG9uc2U6IHsKICAgICAgYWN0aW9uOiAnc3RhcnRpbmcnLAogICAgfQogIH07CgogIGNhbmNlbC5yZW1vdmUoKTsKCiAgLy8gRGlzYWJsZSB0aGUgaW5wdXQgZWxlbWVudCBzaW5jZSBmdXJ0aGVyIHBpY2tzIGFyZSBub3QgYWxsb3dlZC4KICBpbnB1dEVsZW1lbnQuZGlzYWJsZWQgPSB0cnVlOwoKICBpZiAoIWZpbGVzKSB7CiAgICByZXR1cm4gewogICAgICByZXNwb25zZTogewogICAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgICAgfQogICAgfTsKICB9CgogIGZvciAoY29uc3QgZmlsZSBvZiBmaWxlcykgewogICAgY29uc3QgbGkgPSBkb2N1bWVudC5jcmVhdGVFbGVtZW50KCdsaScpOwogICAgbGkuYXBwZW5kKHNwYW4oZmlsZS5uYW1lLCB7Zm9udFdlaWdodDogJ2JvbGQnfSkpOwogICAgbGkuYXBwZW5kKHNwYW4oCiAgICAgICAgYCgke2ZpbGUudHlwZSB8fCAnbi9hJ30pIC0gJHtmaWxlLnNpemV9IGJ5dGVzLCBgICsKICAgICAgICBgbGFzdCBtb2RpZmllZDogJHsKICAgICAgICAgICAgZmlsZS5sYXN0TW9kaWZpZWREYXRlID8gZmlsZS5sYXN0TW9kaWZpZWREYXRlLnRvTG9jYWxlRGF0ZVN0cmluZygpIDoKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgJ24vYSd9IC0gYCkpOwogICAgY29uc3QgcGVyY2VudCA9IHNwYW4oJzAlIGRvbmUnKTsKICAgIGxpLmFwcGVuZENoaWxkKHBlcmNlbnQpOwoKICAgIG91dHB1dEVsZW1lbnQuYXBwZW5kQ2hpbGQobGkpOwoKICAgIGNvbnN0IGZpbGVEYXRhUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICAgIGNvbnN0IHJlYWRlciA9IG5ldyBGaWxlUmVhZGVyKCk7CiAgICAgIHJlYWRlci5vbmxvYWQgPSAoZSkgPT4gewogICAgICAgIHJlc29sdmUoZS50YXJnZXQucmVzdWx0KTsKICAgICAgfTsKICAgICAgcmVhZGVyLnJlYWRBc0FycmF5QnVmZmVyKGZpbGUpOwogICAgfSk7CiAgICAvLyBXYWl0IGZvciB0aGUgZGF0YSB0byBiZSByZWFkeS4KICAgIGxldCBmaWxlRGF0YSA9IHlpZWxkIHsKICAgICAgcHJvbWlzZTogZmlsZURhdGFQcm9taXNlLAogICAgICByZXNwb25zZTogewogICAgICAgIGFjdGlvbjogJ2NvbnRpbnVlJywKICAgICAgfQogICAgfTsKCiAgICAvLyBVc2UgYSBjaHVua2VkIHNlbmRpbmcgdG8gYXZvaWQgbWVzc2FnZSBzaXplIGxpbWl0cy4gU2VlIGIvNjIxMTU2NjAuCiAgICBsZXQgcG9zaXRpb24gPSAwOwogICAgd2hpbGUgKHBvc2l0aW9uIDwgZmlsZURhdGEuYnl0ZUxlbmd0aCkgewogICAgICBjb25zdCBsZW5ndGggPSBNYXRoLm1pbihmaWxlRGF0YS5ieXRlTGVuZ3RoIC0gcG9zaXRpb24sIE1BWF9QQVlMT0FEX1NJWkUpOwogICAgICBjb25zdCBjaHVuayA9IG5ldyBVaW50OEFycmF5KGZpbGVEYXRhLCBwb3NpdGlvbiwgbGVuZ3RoKTsKICAgICAgcG9zaXRpb24gKz0gbGVuZ3RoOwoKICAgICAgY29uc3QgYmFzZTY0ID0gYnRvYShTdHJpbmcuZnJvbUNoYXJDb2RlLmFwcGx5KG51bGwsIGNodW5rKSk7CiAgICAgIHlpZWxkIHsKICAgICAgICByZXNwb25zZTogewogICAgICAgICAgYWN0aW9uOiAnYXBwZW5kJywKICAgICAgICAgIGZpbGU6IGZpbGUubmFtZSwKICAgICAgICAgIGRhdGE6IGJhc2U2NCwKICAgICAgICB9LAogICAgICB9OwogICAgICBwZXJjZW50LnRleHRDb250ZW50ID0KICAgICAgICAgIGAke01hdGgucm91bmQoKHBvc2l0aW9uIC8gZmlsZURhdGEuYnl0ZUxlbmd0aCkgKiAxMDApfSUgZG9uZWA7CiAgICB9CiAgfQoKICAvLyBBbGwgZG9uZS4KICB5aWVsZCB7CiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICB9CiAgfTsKfQoKc2NvcGUuZ29vZ2xlID0gc2NvcGUuZ29vZ2xlIHx8IHt9OwpzY29wZS5nb29nbGUuY29sYWIgPSBzY29wZS5nb29nbGUuY29sYWIgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYi5fZmlsZXMgPSB7CiAgX3VwbG9hZEZpbGVzLAogIF91cGxvYWRGaWxlc0NvbnRpbnVlLAp9Owp9KShzZWxmKTsK",
"ok": true,
"headers": [
[
"content-type",
"application/javascript"
]
],
"status": 200,
"status_text": ""
}
},
"base_uri": "https://localhost:8080/",
"height": 72
},
"id": "mltOEg2kCFUy",
"outputId": "8e2557f9-5013-4fc4-8e8d-5a193a3ae48e"
},
"source": [
"uploaded = files.upload()"
],
"execution_count": 2,
"outputs": [
{
"output_type": "display_data",
"data": {
"text/html": [
"\n",
" <input type=\"file\" id=\"files-e4c89af8-0db2-4724-ad1b-d6054eb4450c\" name=\"files[]\" multiple disabled\n",
" style=\"border:none\" />\n",
" <output id=\"result-e4c89af8-0db2-4724-ad1b-d6054eb4450c\">\n",
" Upload widget is only available when the cell has been executed in the\n",
" current browser session. Please rerun this cell to enable.\n",
" </output>\n",
" <script src=\"/nbextensions/google.colab/files.js\"></script> "
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {
"tags": []
}
},
{
"output_type": "stream",
"text": [
"Saving 3 col heart.xlsx to 3 col heart (1).xlsx\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KBbJYkR22TUt"
},
"source": [
"Converting the data into pandas DataFrame"
]
},
{
"cell_type": "code",
"metadata": {
"id": "oSK7KBDLCMoY"
},
"source": [
"data = pd.read_excel(io.BytesIO(uploaded['3 col heart.xlsx']))"
],
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "C1vrN7sC2XJS"
},
"source": [
"Checking the Dimension of the Data"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iQTlPueoH48F",
"outputId": "1cacf628-5b2a-4f2a-d6f3-1c43d77021ec"
},
"source": [
"data.shape"
],
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(270, 4)"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P-nOmZUg2ady"
},
"source": [
"Checking the count of absence/presence of heart disease. (Last column)"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CvHLPEA4IF6u",
"outputId": "2faa7b11-35ce-4cda-ad76-d3efd8132297"
},
"source": [
"data[\"Absence (1) or presence (2) of heart disease\"].value_counts()"
],
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1 150\n",
"2 120\n",
"Name: Absence (1) or presence (2) of heart disease, dtype: int64"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dmAeAnjV2qbU"
},
"source": [
"# Data"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "HlEPNIStCMr6",
"outputId": "57aaaf30-0688-4ab4-a043-963e12f30dd0"
},
"source": [
"data.head()"
],
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>blood pressure</th>\n",
" <th>cholesterol</th>\n",
" <th>maximum heart rate</th>\n",
" <th>Absence (1) or presence (2) of heart disease</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>130</td>\n",
" <td>322</td>\n",
" <td>109</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>115</td>\n",
" <td>564</td>\n",
" <td>160</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>124</td>\n",
" <td>261</td>\n",
" <td>141</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>128</td>\n",
" <td>263</td>\n",
" <td>105</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>120</td>\n",
" <td>269</td>\n",
" <td>121</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" blood pressure ... Absence (1) or presence (2) of heart disease\n",
"0 130 ... 2\n",
"1 115 ... 1\n",
"2 124 ... 2\n",
"3 128 ... 1\n",
"4 120 ... 1\n",
"\n",
"[5 rows x 4 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t0lC1AqW2tZL"
},
"source": [
"Replacing the target variable's (last column) from 1,2 to -1/1 and renaming the last column accordingly for future"
]
},
{
"cell_type": "code",
"metadata": {
"id": "uP4aSKMXCMur"
},
"source": [
"data[\"Absence (1) or presence (2) of heart disease\"].replace({2: 1, 1: -1}, inplace=True)\n",
"data = data.rename(columns={\"Absence (1) or presence (2) of heart disease\":\"Target (1= Presence of heart disease)\"})"
],
"execution_count": 7,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "riKzqhEs3A55"
},
"source": [
"Converting DataFrame objects to Numpy arrays"
]
},
{
"cell_type": "code",
"metadata": {
"id": "tbk9SaYOEJdn"
},
"source": [
"X = data.iloc[:, 0:-1].values\n",
"y = data.iloc[:, -1].values"
],
"execution_count": 8,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2VU1Xrzs3EqM"
},
"source": [
"Using SMOTE for oversampling and TOMEK links for undersampling data points"
]
},
{
"cell_type": "code",
"metadata": {
"id": "TxWqww3SFYg4"
},
"source": [
"sm = SMOTETomek(random_state=4)\n",
"X, y = sm.fit_sample(X, y)"
],
"execution_count": 9,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "aYz9p2QT3NBm"
},
"source": [
"Checking the Dimension of our numpy arrays"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "G3wg3aSVF3_9",
"outputId": "b5295ac6-725b-43ba-ad8f-33aecd63df1b"
},
"source": [
"X.shape, y.shape"
],
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((250, 3), (250,))"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tlQm0ZEX3YkR"
},
"source": [
"Checking if counts of target variable is same "
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "xV4fBPvwF_LO",
"outputId": "dcc6a4e1-e54f-4c47-b532-aa4ba9d3b246"
},
"source": [
"pd.DataFrame(y).value_counts()"
],
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" 1 125\n",
"-1 125\n",
"dtype: int64"
]
},
"metadata": {
"tags": []
},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AW24GKge3emY"
},
"source": [
"Standardising all the independent variables"
]
},
{
"cell_type": "code",
"metadata": {
"id": "XUDgsPz5L2EQ"
},
"source": [
"sc = StandardScaler()\n",
"X_stan = sc.fit_transform(X)"
],
"execution_count": 12,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7PKWXaVp3nKp"
},
"source": [
"Principal Component Analysis of the standardised data and extracting the explained variance in the form of ratio (sums to 1)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "MFqakuiOL311"
},
"source": [
"pca = PCA(n_components=3)\n",
"X_pca = pca.fit_transform(X_stan)\n",
"explained_variance = pca.explained_variance_ratio_"
],
"execution_count": 14,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "SbHXrC3b3yjn"
},
"source": [
"# PCA Analysis of Important Features (Unsupervised)"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"id": "OIz52hwgLO-t",
"outputId": "081e0bff-dee8-4e91-eace-34e721f2c637"
},
"source": [
"features = data.iloc[:, 0:-1].columns.to_numpy()\n",
"weights_pca = explained_variance\n",
"vis = zip(features,weights_pca)\n",
"for i,j in vis :\n",
" print(\"Feature {} has weight {}%\".format(i,j*100))\n",
"fig = plt.figure()\n",
"ax = fig.add_axes([0,0,1,1])\n",
"ax.bar(features,weights_pca)\n",
"plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n",
"plt.show()"
],
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature blood pressure has weight 39.354536029924596%\n",
"Feature cholesterol has weight 33.24287008456666%\n",
"Feature maximum heart rate has weight 27.402593885508736%\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4W7YW0dD37Ya"
},
"source": [
"Linear Discriminant Analysis of the standardised data and extracting the coefficient of the trained LDA which is acting as our score for each feature"
]
},
{
"cell_type": "code",
"metadata": {
"id": "T2NUT4pGibHq"
},
"source": [
"lda = LDA(solver='eigen', shrinkage = 'auto',n_components=3)\n",
"X_lda = lda.fit_transform(X_stan, y)\n",
"weights_lda = lda.coef_[0]"
],
"execution_count": 16,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_Q4pDCgzAHTz",
"outputId": "0057c766-092c-4308-f885-b907c8f093be"
},
"source": [
"weights_lda"
],
"execution_count": 17,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([ 0.38010567, 0.21838097, -1.3470873 ])"
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eegwyD1J44Rj"
},
"source": [
"# **LDA** analysis of important features (Supervised)"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"id": "6OqPj9rNktV7",
"outputId": "18d47651-d324-4d80-c139-4da38e02209f"
},
"source": [
"vis = zip(features,weights_lda)\n",
"for i,j in vis :\n",
" print(\"Feature {} has Score {}%\".format(i,j*100))\n",
"fig = plt.figure()\n",
"ax = fig.add_axes([0,0,1,1])\n",
"ax.bar(features,weights_lda)\n",
"plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n",
"plt.show()"
],
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature blood pressure has Score 38.010567313803044%\n",
"Feature cholesterol has Score 21.838096891499255%\n",
"Feature maximum heart rate has Score -134.70873037980527%\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KeX1rYAT5AIa"
},
"source": [
"# Using **Random Forest Classifier** to train our model and extracting the feature_importance_ parameter from trained model which is our score"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"id": "K8RXDKLBnMAQ",
"outputId": "60760ea3-1276-4afd-883e-5e52ffbd682e"
},
"source": [
"model = RandomForestClassifier()\n",
"# fit the model\n",
"model.fit(X, y)\n",
"# get importance\n",
"importance = model.feature_importances_\n",
"# summarize feature importance\n",
"for i,v in zip(features,importance):\n",
"\tprint(\"Feature: {}, Score: {}\".format(i,v*100))\n",
"# plot feature importance\n",
"fig = plt.figure()\n",
"ax = fig.add_axes([0,0,1,1])\n",
"weights = importance\n",
"ax.bar(features,importance)\n",
"plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n",
"plt.show()"
],
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature: blood pressure, Score: 23.36408424069707\n",
"Feature: cholesterol, Score: 30.88035483411681\n",
"Feature: maximum heart rate, Score: 45.755560925186124\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Tj4bPFih5cRp"
},
"source": [
"# Using **Logistic Regression** to train our model and extracting the coef_ parameter from trained model which is our score"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"id": "sbu5Hf6Etv7s",
"outputId": "d8607f52-8af1-4386-cc7b-70588f5d69bb"
},
"source": [
"model = LogisticRegression()\n",
"# fit the model\n",
"model.fit(X, y)\n",
"# get importance\n",
"importance = np.exp(model.coef_[0])\n",
"# summarize feature importance\n",
"for i,v in zip(features,importance):\n",
"\tprint(\"Feature: {}, Score: {}\".format(i,v*100))\n",
"# plot feature importance\n",
"fig = plt.figure()\n",
"ax = fig.add_axes([0,0,1,1])\n",
"weights = importance\n",
"ax.bar(features,importance)\n",
"plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n",
"plt.show()"
],
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature: blood pressure, Score: 102.54244322624493\n",
"Feature: cholesterol, Score: 100.44932424725899\n",
"Feature: maximum heart rate, Score: 94.39095599530435\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sc3YmIW95lkv"
},
"source": [
"# Using **Decision Tree Classifier** (CART) to train our model and extracting the feature_importance_ parameter from trained model which is our score"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"id": "kt4aBXL5wOEB",
"outputId": "a7d7b5a6-1e2e-41d5-947a-1622ac11721b"
},
"source": [
"model = DecisionTreeClassifier()\n",
"# fit the model\n",
"model.fit(X, y)\n",
"# get importance\n",
"importance = model.feature_importances_\n",
"for i,v in zip(features,importance):\n",
"\tprint(\"Feature: {}, Score: {}\".format(i,v*100))\n",
"# plot feature importance\n",
"fig = plt.figure()\n",
"ax = fig.add_axes([0,0,1,1])\n",
"weights = importance\n",
"ax.bar(features,importance)\n",
"plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n",
"plt.show()"
],
"execution_count": 21,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature: blood pressure, Score: 25.088112025284303\n",
"Feature: cholesterol, Score: 26.511470690688167\n",
"Feature: maximum heart rate, Score: 48.40041728402753\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kGHE53tF5r6r"
},
"source": [
"# Using **XGBoost Classifier** to train our model and extracting the feature_importance_ parameter from trained model which is our score"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"id": "Z1ydBmg7wsht",
"outputId": "e40027e7-005a-4267-b46c-eed981db4061"
},
"source": [
"model = XGBClassifier()\n",
"# fit the model\n",
"model.fit(X, y)\n",
"# get importance\n",
"importance = model.feature_importances_\n",
"for i,v in zip(features,importance):\n",
"\tprint(\"Feature: {}, Score: {}\".format(i,v*100))\n",
"# plot feature importance\n",
"fig = plt.figure()\n",
"ax = fig.add_axes([0,0,1,1])\n",
"weights = importance\n",
"ax.bar(features,importance)\n",
"plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n",
"plt.show()"
],
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature: blood pressure, Score: 25.387638807296753\n",
"Feature: cholesterol, Score: 21.72376662492752\n",
"Feature: maximum heart rate, Score: 52.88859009742737\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LmI-04MR5xOE"
},
"source": [
"# Permutation Feature Importance\n",
"Permutation feature importance is a technique for calculating relative importance scores that is independent of the model used.\n",
"\n",
"First, a model is fit on the dataset, such as a model that does not support native feature importance scores. Then the model is used to make predictions on a dataset, although the values of a feature (column) in the dataset are scrambled. This is repeated for each feature in the dataset. Then this whole process is repeated 3, 5, 10 or more times. The result is a mean importance score for each input feature (and distribution of scores given the repeats).\n",
"\n",
"This approach can be used for regression or classification and requires that a performance metric be chosen as the basis of the importance score, such as the mean squared error for regression and accuracy for classification.\n",
"\n",
"Permutation feature selection can be used via the permutation_importance() function that takes a fit model, a dataset (train or test dataset is fine), and a scoring function."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 437
},
"id": "Pg-txpAoxH5L",
"outputId": "3ee7a6e2-03a9-4f1e-a45b-d1a88d6bc89b"
},
"source": [
"model = KNeighborsClassifier()\n",
"# fit the model\n",
"model.fit(X, y)\n",
"# perform permutation importance\n",
"results = permutation_importance(model, X, y, scoring='accuracy')\n",
"# get importance\n",
"importance = results.importances_mean\n",
"for i,v in zip(features,importance):\n",
"\tprint(\"Feature: {}, Score: {}\".format(i,v*100))\n",
"# plot feature importance\n",
"fig = plt.figure()\n",
"ax = fig.add_axes([0,0,1,1])\n",
"weights = importance\n",
"ax.bar(features,importance)\n",
"plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n",
"plt.show()"
],
"execution_count": 23,
"outputs": [
{
"output_type": "stream",
"text": [
"Feature: blood pressure, Score: 7.680000000000006\n",
"Feature: cholesterol, Score: 11.120000000000005\n",
"Feature: maximum heart rate, Score: 24.880000000000006\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment