Created
August 5, 2022 17:08
-
-
Save RodolfoFerro/5a2880fede2faa0b8c19199707356d12 to your computer and use it in GitHub Desktop.
Text Classifier Model
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "Text Classifier Model", | |
"private_outputs": true, | |
"provenance": [], | |
"collapsed_sections": [], | |
"authorship_tag": "ABX9TyNMCvUSvmrBTQ6rJd9A9Rpe", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/RodolfoFerro/5a2880fede2faa0b8c19199707356d12/text-classifier-model.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# Modelo clasificador de Tweets" | |
], | |
"metadata": { | |
"id": "fSBdg2x1ldjB" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Estructura de nuestro modelo\n", | |
"\n", | |
"Para nuestro modelo utilizaremos una neurona similar a las anteriores, con una función de activación sigmoide." | |
], | |
"metadata": { | |
"id": "45Nuhu_7so2z" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import numpy as np\n", | |
"from random import randint\n", | |
"import math\n", | |
"\n", | |
"\n", | |
"class neuron_model():\n", | |
" def __init__(self, X, Y):\n", | |
" \"\"\"Constructor of the class.\"\"\"\n", | |
"\n", | |
" # Set seed and set data attributes\n", | |
" np.random.seed(123)\n", | |
"\n", | |
" self.n_data = X.shape[1]\n", | |
" self.X = X.T\n", | |
" self.Y = Y.T\n", | |
" \n", | |
" # weights vector initialization\n", | |
" self.w = np.zeros((self.X.shape[0], 1))\n", | |
"\n", | |
" # Weights random initialization\n", | |
" for j in range(self.X.shape[0]):\n", | |
" self.w[j, 0] = randint(-10, 10) * 0.01\n", | |
" \n", | |
" # Bias initialization\n", | |
" self.b = 0\n", | |
"\n", | |
"\n", | |
" def sigmoid(self, x):\n", | |
" \"\"\"Sigmoid function.\"\"\"\n", | |
"\n", | |
" return 1.0 / (1.0 + np.exp(-x))\n", | |
" \n", | |
"\n", | |
" def predict(self, x):\n", | |
" \"\"\"Prediction function.\"\"\"\n", | |
"\n", | |
" return self.sigmoid(np.dot(x, self.w) + self.b)\n", | |
"\n", | |
"\n", | |
" def train(self, iterations, learning_rate=0.1):\n", | |
" \"\"\"Training function.\"\"\"\n", | |
" \n", | |
" # Cost initialization\n", | |
" cost = 0\n", | |
" for i in range(iterations):\n", | |
" out = self.sigmoid(np.dot(self.w.T, self.X) + self.b)\n", | |
" cost = (-1. / self.n_data) * np.sum((self.Y * np.log(out)) + (1. - self.Y) * np.log(1. - out))\n", | |
" \n", | |
" print(f'[INFO] Iteration: {i + 1}/{iterations}, cost: {cost}')\n", | |
"\n", | |
" if math.isnan(cost):\n", | |
" break\n", | |
" \n", | |
" dw = (1. / self.n_data) * np.dot(self.X, (out - self.Y).T)\n", | |
" db = (1. / self.n_data) * np.sum(out - self.Y)\n", | |
" self.w = self.w - dw * learning_rate\n", | |
" self.b = self.b - db * learning_rate\n", | |
" \n", | |
" print('[INFO] Training succeeded!')" | |
], | |
"metadata": { | |
"id": "h9icUJ3ngDY2" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Descarga de _dataset_\n", | |
"\n", | |
"El dataset que utilizaremos consta de opiniones de películas y proviene de la _Internet Movie Database (IMDB)_: https://www.imdb.com/interfaces/\n", | |
"\n", | |
"El conjunto lo descargaremos de un repositorio público y para hacerlo podemos utilizar la siguiente línea de código. El conjunto de datos que utilizaremos cuenta con 50,000 reviews de varias películas conocidas." | |
], | |
"metadata": { | |
"id": "2Nu2TyfDmfDj" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"!wget https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv" | |
], | |
"metadata": { | |
"id": "jjmQlIh2ksuP" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### Transformación de datos\n", | |
"\n", | |
"Será necesario transformar los datos para poder hacer un modelado de los mismos." | |
], | |
"metadata": { | |
"id": "zoDeIpU7nRsq" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import pandas as pd\n", | |
"\n", | |
"\n", | |
"df = pd.read_csv('IMDB-Dataset.csv')\n", | |
"df.head(10)" | |
], | |
"metadata": { | |
"id": "E1vr_2pZmesI" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"sentiment = df[['sentiment']].values\n", | |
"review = df[['review']].values\n", | |
"\n", | |
"sentiment = [1 if value == 'positive' else 0 for value in sentiment]\n", | |
"review = [str(text) for text in review]" | |
], | |
"metadata": { | |
"id": "zHOvifbdnlfY" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from sklearn.feature_extraction.text import HashingVectorizer\n", | |
"\n", | |
"# Create the transformation\n", | |
"vectorizer = HashingVectorizer(n_features=20)\n", | |
"vectorizer.fit(review)" | |
], | |
"metadata": { | |
"id": "2EyWR7K0xmdQ" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# encode document\n", | |
"vector = vectorizer.transform([review[0]])\n", | |
"\n", | |
"# summarize encoded vector\n", | |
"print(vector.shape)\n", | |
"print(vector.toarray())" | |
], | |
"metadata": { | |
"id": "lMwwpyjgykBA" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Preparamos los datos finales" | |
], | |
"metadata": { | |
"id": "rPkJ6Q_jzCAL" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"X = vectorizer.transform(review).toarray()\n", | |
"Y = np.array(sentiment)" | |
], | |
"metadata": { | |
"id": "TWYZeH4syv6x" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Creación de un modelo\n", | |
"\n", | |
"Ya que hemos cargado nuestros datos, podemos crear un modelo y entrenarlo con los datos mencionados." | |
], | |
"metadata": { | |
"id": "F4g7opXTs0RR" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"model = neuron_model(X, Y)\n", | |
"model.train(10000, learning_rate=0.0001)" | |
], | |
"metadata": { | |
"id": "sd-j5OjNr-ow" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"## Predicción con el modelo\n", | |
"\n", | |
"Una vez entrenado el modelo, podemos realizar predicciones con el mismo, para ello podemos cargar algunos datos ejemplo de para prueas." | |
], | |
"metadata": { | |
"id": "Jtwy-gTOtCOy" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"phrases = [\n", | |
" 'I do not consider this comment as something positive.',\n", | |
" 'I really love life and people and animals and everything. I enjoy being happy.',\n", | |
" 'I hate people.'\n", | |
"]\n", | |
"\n", | |
"x = vectorizer.transform(phrases).toarray()\n", | |
"model.predict(x)\n" | |
], | |
"metadata": { | |
"id": "eeaSISKGrs0b" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from sklearn.metrics import accuracy_score\n", | |
"\n", | |
"res_X = model.predict(X)\n", | |
"res_X = np.array([1 if val >= 0.5 else 0 for val in res_X])\n", | |
"\n", | |
"accuracy_score(Y, res_X)" | |
], | |
"metadata": { | |
"id": "eHlVRsLH9t6Q" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"model.w" | |
], | |
"metadata": { | |
"id": "yZRPzb4v-sh4" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"model.b" | |
], | |
"metadata": { | |
"id": "QydZL8ke-u2t" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"--------\n", | |
"\n", | |
"> Contenido creado por **Rodolfo Ferro** para [CdeCMx](https://clubesdeciencia.mx/), 2022. <br>\n", | |
"> Puedes contactarme a través de Insta ([@rodo_ferro](https://www.instagram.com/rodo_ferro/)) o Twitter ([@rodo_ferro](https://twitter.com/rodo_ferro))." | |
], | |
"metadata": { | |
"id": "LzaGoXnfuQjC" | |
} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment