Skip to content

Instantly share code, notes, and snippets.

@ClebsonDantasUchoa
Last active August 22, 2018 03:37
Show Gist options
  • Select an option

  • Save ClebsonDantasUchoa/42a8331103a6d4cc113d29c9cf598506 to your computer and use it in GitHub Desktop.

Select an option

Save ClebsonDantasUchoa/42a8331103a6d4cc113d29c9cf598506 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Classificar se um indivíduo é ou não portador da diabetes\n",
"## Definição do problema:\n",
"### O conjunto de dados contém várias informações sobre pessoas indianas:\n",
"1. Número de vezes grávida\n",
"2. Concentração de glicose plasmática a 2 horas em um teste oral de tolerância à glicose\n",
"3. Pressão arterial diastólica (mm Hg)\n",
"4. espessura de dobra de pele de tríceps (mm)\n",
"5. Insulina sérica de 2 horas (mu U / ml)\n",
"6. Índice de massa corporal (peso em kg / (altura em m) ^ 2)\n",
"7. Função de pedigree de diabetes\n",
"8. Idade (anos)\n",
"9. Variável de classe (0 ou 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importação das bibliotecas"
]
},
{
"cell_type": "code",
"execution_count": 207,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn import tree\n",
"from sklearn import metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Carregando e definindo as colunas do dataset"
]
},
{
"cell_type": "code",
"execution_count": 208,
"metadata": {},
"outputs": [],
"source": [
"cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']\n",
"df = pd.read_csv('diabetes.data', header=None, names=cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualização e descrição dos dados"
]
},
{
"cell_type": "code",
"execution_count": 209,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>preg</th>\n",
" <th>plas</th>\n",
" <th>pres</th>\n",
" <th>skin</th>\n",
" <th>test</th>\n",
" <th>mass</th>\n",
" <th>pedi</th>\n",
" <th>age</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>6</td>\n",
" <td>148</td>\n",
" <td>72</td>\n",
" <td>35</td>\n",
" <td>0</td>\n",
" <td>33.6</td>\n",
" <td>0.627</td>\n",
" <td>50</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>85</td>\n",
" <td>66</td>\n",
" <td>29</td>\n",
" <td>0</td>\n",
" <td>26.6</td>\n",
" <td>0.351</td>\n",
" <td>31</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8</td>\n",
" <td>183</td>\n",
" <td>64</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>23.3</td>\n",
" <td>0.672</td>\n",
" <td>32</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>89</td>\n",
" <td>66</td>\n",
" <td>23</td>\n",
" <td>94</td>\n",
" <td>28.1</td>\n",
" <td>0.167</td>\n",
" <td>21</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>137</td>\n",
" <td>40</td>\n",
" <td>35</td>\n",
" <td>168</td>\n",
" <td>43.1</td>\n",
" <td>2.288</td>\n",
" <td>33</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>116</td>\n",
" <td>74</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>25.6</td>\n",
" <td>0.201</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>3</td>\n",
" <td>78</td>\n",
" <td>50</td>\n",
" <td>32</td>\n",
" <td>88</td>\n",
" <td>31.0</td>\n",
" <td>0.248</td>\n",
" <td>26</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>10</td>\n",
" <td>115</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>35.3</td>\n",
" <td>0.134</td>\n",
" <td>29</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>2</td>\n",
" <td>197</td>\n",
" <td>70</td>\n",
" <td>45</td>\n",
" <td>543</td>\n",
" <td>30.5</td>\n",
" <td>0.158</td>\n",
" <td>53</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>8</td>\n",
" <td>125</td>\n",
" <td>96</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.0</td>\n",
" <td>0.232</td>\n",
" <td>54</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" preg plas pres skin test mass pedi age class\n",
"0 6 148 72 35 0 33.6 0.627 50 1\n",
"1 1 85 66 29 0 26.6 0.351 31 0\n",
"2 8 183 64 0 0 23.3 0.672 32 1\n",
"3 1 89 66 23 94 28.1 0.167 21 0\n",
"4 0 137 40 35 168 43.1 2.288 33 1\n",
"5 5 116 74 0 0 25.6 0.201 30 0\n",
"6 3 78 50 32 88 31.0 0.248 26 1\n",
"7 10 115 0 0 0 35.3 0.134 29 0\n",
"8 2 197 70 45 543 30.5 0.158 53 1\n",
"9 8 125 96 0 0 0.0 0.232 54 1"
]
},
"execution_count": 209,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 210,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>preg</th>\n",
" <th>plas</th>\n",
" <th>pres</th>\n",
" <th>skin</th>\n",
" <th>test</th>\n",
" <th>mass</th>\n",
" <th>pedi</th>\n",
" <th>age</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" <td>768.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>3.845052</td>\n",
" <td>120.894531</td>\n",
" <td>69.105469</td>\n",
" <td>20.536458</td>\n",
" <td>79.799479</td>\n",
" <td>31.992578</td>\n",
" <td>0.471876</td>\n",
" <td>33.240885</td>\n",
" <td>0.348958</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>3.369578</td>\n",
" <td>31.972618</td>\n",
" <td>19.355807</td>\n",
" <td>15.952218</td>\n",
" <td>115.244002</td>\n",
" <td>7.884160</td>\n",
" <td>0.331329</td>\n",
" <td>11.760232</td>\n",
" <td>0.476951</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.078000</td>\n",
" <td>21.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1.000000</td>\n",
" <td>99.000000</td>\n",
" <td>62.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>27.300000</td>\n",
" <td>0.243750</td>\n",
" <td>24.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>3.000000</td>\n",
" <td>117.000000</td>\n",
" <td>72.000000</td>\n",
" <td>23.000000</td>\n",
" <td>30.500000</td>\n",
" <td>32.000000</td>\n",
" <td>0.372500</td>\n",
" <td>29.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>6.000000</td>\n",
" <td>140.250000</td>\n",
" <td>80.000000</td>\n",
" <td>32.000000</td>\n",
" <td>127.250000</td>\n",
" <td>36.600000</td>\n",
" <td>0.626250</td>\n",
" <td>41.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>17.000000</td>\n",
" <td>199.000000</td>\n",
" <td>122.000000</td>\n",
" <td>99.000000</td>\n",
" <td>846.000000</td>\n",
" <td>67.100000</td>\n",
" <td>2.420000</td>\n",
" <td>81.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" preg plas pres skin test mass \\\n",
"count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 \n",
"mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 \n",
"std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 \n",
"min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
"25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 \n",
"50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 \n",
"75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 \n",
"max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 \n",
"\n",
" pedi age class \n",
"count 768.000000 768.000000 768.000000 \n",
"mean 0.471876 33.240885 0.348958 \n",
"std 0.331329 11.760232 0.476951 \n",
"min 0.078000 21.000000 0.000000 \n",
"25% 0.243750 24.000000 0.000000 \n",
"50% 0.372500 29.000000 0.000000 \n",
"75% 0.626250 41.000000 1.000000 \n",
"max 2.420000 81.000000 1.000000 "
]
},
"execution_count": 210,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 211,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"class\n",
"0 500\n",
"1 268\n",
"dtype: int64\n"
]
}
],
"source": [
"print(df.groupby('class').size())"
]
},
{
"cell_type": "code",
"execution_count": 212,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x720 with 9 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.hist(figsize=(10,10));"
]
},
{
"cell_type": "code",
"execution_count": 213,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>preg</th>\n",
" <th>plas</th>\n",
" <th>pres</th>\n",
" <th>skin</th>\n",
" <th>test</th>\n",
" <th>mass</th>\n",
" <th>pedi</th>\n",
" <th>age</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>preg</th>\n",
" <td>1.000000</td>\n",
" <td>0.129459</td>\n",
" <td>0.141282</td>\n",
" <td>-0.081672</td>\n",
" <td>-0.073535</td>\n",
" <td>0.017683</td>\n",
" <td>-0.033523</td>\n",
" <td>0.544341</td>\n",
" <td>0.221898</td>\n",
" </tr>\n",
" <tr>\n",
" <th>plas</th>\n",
" <td>0.129459</td>\n",
" <td>1.000000</td>\n",
" <td>0.152590</td>\n",
" <td>0.057328</td>\n",
" <td>0.331357</td>\n",
" <td>0.221071</td>\n",
" <td>0.137337</td>\n",
" <td>0.263514</td>\n",
" <td>0.466581</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pres</th>\n",
" <td>0.141282</td>\n",
" <td>0.152590</td>\n",
" <td>1.000000</td>\n",
" <td>0.207371</td>\n",
" <td>0.088933</td>\n",
" <td>0.281805</td>\n",
" <td>0.041265</td>\n",
" <td>0.239528</td>\n",
" <td>0.065068</td>\n",
" </tr>\n",
" <tr>\n",
" <th>skin</th>\n",
" <td>-0.081672</td>\n",
" <td>0.057328</td>\n",
" <td>0.207371</td>\n",
" <td>1.000000</td>\n",
" <td>0.436783</td>\n",
" <td>0.392573</td>\n",
" <td>0.183928</td>\n",
" <td>-0.113970</td>\n",
" <td>0.074752</td>\n",
" </tr>\n",
" <tr>\n",
" <th>test</th>\n",
" <td>-0.073535</td>\n",
" <td>0.331357</td>\n",
" <td>0.088933</td>\n",
" <td>0.436783</td>\n",
" <td>1.000000</td>\n",
" <td>0.197859</td>\n",
" <td>0.185071</td>\n",
" <td>-0.042163</td>\n",
" <td>0.130548</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mass</th>\n",
" <td>0.017683</td>\n",
" <td>0.221071</td>\n",
" <td>0.281805</td>\n",
" <td>0.392573</td>\n",
" <td>0.197859</td>\n",
" <td>1.000000</td>\n",
" <td>0.140647</td>\n",
" <td>0.036242</td>\n",
" <td>0.292695</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pedi</th>\n",
" <td>-0.033523</td>\n",
" <td>0.137337</td>\n",
" <td>0.041265</td>\n",
" <td>0.183928</td>\n",
" <td>0.185071</td>\n",
" <td>0.140647</td>\n",
" <td>1.000000</td>\n",
" <td>0.033561</td>\n",
" <td>0.173844</td>\n",
" </tr>\n",
" <tr>\n",
" <th>age</th>\n",
" <td>0.544341</td>\n",
" <td>0.263514</td>\n",
" <td>0.239528</td>\n",
" <td>-0.113970</td>\n",
" <td>-0.042163</td>\n",
" <td>0.036242</td>\n",
" <td>0.033561</td>\n",
" <td>1.000000</td>\n",
" <td>0.238356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>class</th>\n",
" <td>0.221898</td>\n",
" <td>0.466581</td>\n",
" <td>0.065068</td>\n",
" <td>0.074752</td>\n",
" <td>0.130548</td>\n",
" <td>0.292695</td>\n",
" <td>0.173844</td>\n",
" <td>0.238356</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" preg plas pres skin test mass pedi \\\n",
"preg 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683 -0.033523 \n",
"plas 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071 0.137337 \n",
"pres 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805 0.041265 \n",
"skin -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573 0.183928 \n",
"test -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859 0.185071 \n",
"mass 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000 0.140647 \n",
"pedi -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647 1.000000 \n",
"age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242 0.033561 \n",
"class 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695 0.173844 \n",
"\n",
" age class \n",
"preg 0.544341 0.221898 \n",
"plas 0.263514 0.466581 \n",
"pres 0.239528 0.065068 \n",
"skin -0.113970 0.074752 \n",
"test -0.042163 0.130548 \n",
"mass 0.036242 0.292695 \n",
"pedi 0.033561 0.173844 \n",
"age 1.000000 0.238356 \n",
"class 0.238356 1.000000 "
]
},
"execution_count": 213,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corr()"
]
},
{
"cell_type": "code",
"execution_count": 214,
"metadata": {},
"outputs": [],
"source": [
"df = df.values\n",
"#np.random.seed(1)\n",
"#np.random.shuffle(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Criação do dataset"
]
},
{
"cell_type": "code",
"execution_count": 215,
"metadata": {},
"outputs": [],
"source": [
"dados = df[:, 0:8]\n",
"diabetes = df[:, 8]"
]
},
{
"cell_type": "code",
"execution_count": 216,
"metadata": {},
"outputs": [],
"source": [
"n_train = int(round(len(diabetes) * 0.75))\n",
"dados_treino = dados[:n_train,:]\n",
"diabetes_treino = diabetes[:n_train]\n",
"dados_teste = dados[n_train:,:]\n",
"diabetes_teste = diabetes[n_train:]"
]
},
{
"cell_type": "code",
"execution_count": 217,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((768, 8), (768,), (576, 8), (576,), (192, 8), (192,))"
]
},
"execution_count": 217,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dados.shape, diabetes.shape, dados_treino.shape, diabetes_treino.shape, dados_teste.shape, diabetes_teste.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Criando uma árvore de decisão e fazendo a ligação entre os dados"
]
},
{
"cell_type": "code",
"execution_count": 218,
"metadata": {},
"outputs": [],
"source": [
"clf = tree.DecisionTreeClassifier()\n",
"clf.fit(dados_treino, diabetes_treino)\n",
"resposta = clf.predict(dados_teste)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resultados"
]
},
{
"cell_type": "code",
"execution_count": 219,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0.0 0.76 0.80 0.78 122\n",
" 1.0 0.61 0.56 0.58 70\n",
"\n",
"avg / total 0.70 0.71 0.71 192\n",
"\n"
]
}
],
"source": [
"print(metrics.classification_report(diabetes_teste, resposta))"
]
},
{
"cell_type": "code",
"execution_count": 220,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.7083333333333334\n"
]
}
],
"source": [
"accuracy = metrics.accuracy_score(diabetes_teste, resposta)\n",
"print(accuracy)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@regispires
Copy link

oi, clebson. gostei!
no entanto, você tem uma hora e isso dá tempo de detalhar mais.

Assim aqui vão algumas observações...

  • não é justo predizer a primeira linha, pois o modelo já a conhece, pois ela entrou no conjunto de treino. Nesse, caso seu exemplo estaria OK se essa primeira linha não tivesse sido usada durante o fit.
  • sugiro dividir os dados entre conjunto de treino (70%) e conjunto de teste (30%) -> exemplo dessa divisão em https://github.com/ciencia-de-dados-pratica/GEAM/blob/master/001/iris-notebook.ipynb
  • treinar (fit) o modelo com os dados de treino.
  • predizer a classe para cada linha dos dados de teste.
  • apresentar quantas linhas foram classificadas corretamente em relação ao total, ou seja, calcular a acurácia do modelo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment