ShivnarenSrinivasan · March 19, 2022 06:44
diff --git a/.\assignment.ipynb b/.\assignment.ipynb
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "powered-thong"
      },
      "source": [
        "### Applied Data Science and Machine Intelligence\n",
        "### A program by IIT Madras and TalentSprint\n",
        "### Module 2 Mini Project: Sentiment Analysis using linear classifiers and unsupervised clustering."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1nK0fzdQzk0g"
      },
      "source": [
        "## Learning Objectives\n",
        "\n",
        "At the end of the mini project, you will be able to -\n",
        "\n",
        "* use a real world dataset.\n",
        "* undertake several important steps like cleaning the data and normalizing the data points.\n",
        "* do sentiment classification.\n",
        "* compare between different types of classification methods and their pros and cons. \n",
        "* compare between supervised and unsupervised (clustering) techniques. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FvhRny6xpviu"
      },
      "source": [
        "### Goal of the project\n",
        "The goal of this project is to train linear classification models that can recognize the sentiment of the reviewer. In this project we will be dealing with only positive and negative sentiments (binary classification).\n",
        "\n",
        "**Disclaimer**: \n",
        "There are multiple ways to solve this problem, as there is no unique formula to solve.\n",
        "This is just one such approach.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hungry-accident"
      },
      "source": [
        "**Packages used:**  \n",
        "* `Pandas` for data frames and easy to read csv files  \n",
        "* `Numpy` for array and matrix mathematics functions  \n",
        "* `Matplotlib` and `Seaborn` for visualization\n",
        "*  `sklearn` for the metrics and pre-processing\n",
        "* `scipy` for helper functions required at various stages of the project.\n",
        "* `warnings` is used to supress warnings from different libraries used in the project."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u-g3SLm-DLEW"
      },
      "source": [
        "### Importing the packages"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 263,
      "metadata": {
        "id": "e_V0z4eaDILg"
      },
      "outputs": [],
      "source": [
        "# Importing standard libraries\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import matplotlib.cm as cm\n",
        "import seaborn as sns\n",
        "import pandas as pd\n",
        "import scipy\n",
        "import math\n",
        "import random\n",
        "\n",
        "# Importing linear classification algorithms\n",
        "from sklearn.neighbors import KNeighborsClassifier\n",
        "from sklearn import svm\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis       \n",
        "from sklearn.tree import DecisionTreeClassifier       \n",
        "from sklearn.naive_bayes import GaussianNB\n",
        "from sklearn import tree\n",
        "from sklearn.ensemble import VotingClassifier, RandomForestClassifier\n",
        "\n",
        "# Importing the clustering algorithms\n",
        "from sklearn.cluster import MiniBatchKMeans\n",
        "from sklearn.cluster import AgglomerativeClustering\n",
        "from sklearn.mixture import GaussianMixture\n",
        "\n",
        "# Importing preprocessing functions\n",
        "from sklearn.preprocessing import MinMaxScaler\n",
        "from sklearn.preprocessing import StandardScaler\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "from sklearn.decomposition import PCA\n",
        "from sklearn.decomposition import TruncatedSVD\n",
        "\n",
        "# Importing metrics\n",
        "from sklearn.metrics import accuracy_score\n",
        "from sklearn.metrics import f1_score\n",
        "\n",
        "# Suppressing warnings\n",
        "import warnings\n",
        "warnings.filterwarnings('ignore')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 37,
      "metadata": {},
      "outputs": [],
      "source": [
        "%aimport helpers\n",
        "import helpers as h"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0L9eRrC6ABcy"
      },
      "source": [
        "### Downloading a dataset containing amazon review information along with ratings\n",
        "To download the data, we will use **`!gdown`**. \n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GWP8wZ-2ABcy",
        "outputId": "e0132a8b-5a91-459c-a776-6fd81f8b4e15"
      },
      "outputs": [],
      "source": [
        "# Downloading the dataset from the Google Drive link.\n",
        "# !gdown https://drive.google.com/uc?id=1kd0RZvI4ur2ehkv4zAriXg6g2W1Mh3xO"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9padyuWLABc0"
      },
      "source": [
        "## How does the dataset look like?\n",
        "Lets use a standard dataset from Amazon which contains reviews and ratings from the customer. The original dataset has three features: name(name of the products), review(Customer reviews of the products), and rating(rating of the customer of a product ranging from 1 to 5). The review column will be the input column and the rating column will be used to understand the sentiments of the review. Here are some important data preprocessing steps:\n",
        "The dataset has about 183,500 rows of data. There are 1147 null values which will be removed.\n",
        "As the dataset is pretty big, it takes a lot of time to run some machine learning algorithms. We will use 30% of the data in this project which is still 54,000+ data points! The sample will be representative of the whole dataset.\n",
        "If the rating is 1 and 2 that will be considered a negative review. And if the review is 3, 4, and 5, the review will be considered as a  positive review. We add a new column named ‘sentiments’ to the dataset that will use 1 for the positive reviews and 0 for the negative reviews. We read and display the contents of the dataset down below."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o2tXX5YAqcqX"
      },
      "source": [
        "**Exercise 1**: Load the data and perform the following (1 points)\n",
        "- Exploratory Data Analysis \n",
        "- Preprocessing \n",
        "\n",
        "\n",
        "**Hints:** \n",
        "\n",
        "- checking for the number of rows and columns\n",
        "- summary of the dataset\n",
        "- statistical description of the features \n",
        "- check for the duplicate values\n",
        "- Show the top 5 and the last 5 rows of the data\n",
        "- check for the null values, and handle them if *any*"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "s7F0lQeoxJa3"
      },
      "source": [
        "For Exercises, 2 to 9, use sklearn library to model, fir, train and see the metrics [Accuracy and F1_score]. Writing your own custom functions not required.\n",
        "\n",
        "\n",
        "1.   **Exercise 1**: Load the data and perform the following : (1 point)\n",
        "      - Exploratory Data Analysis \n",
        "      - Preprocessing \n",
        "2.   **Exercise 2**: **Implementation using K-Nearest Neighbor (KNN) Classifier**:  (1 point)\n",
        "\n",
        "3.   **Exercise 3**: **Implementation using Support Vector Machines (SVM) Classifier**:  (3 points)\n",
        "      - First Reduce the features using PCA\n",
        "      - use Hard-Margin Classifier\n",
        "      - use Soft-Margin Classifier\n",
        "      - use Kernel SVM Classifier\n",
        "4.   **Exercise 4**: **Implementation using Decision Trees**:  (1 point)\n",
        "5.   **Exercise 5**: **Implementation using Ensemble Classifier**:  (1 point) \n",
        "      - use LogisticRegression, KNN, SVM, Naive Bayes and VotingClassifier as the weak classifiers\n",
        "\n",
        "6.   **Exercise 6**: **Implementation using Random Forest Classifier**:  (1 point)\n",
        "      - use LogisticRegression, KNN, SVM, Naive Bayes and VotingClassifier as the weak classifiers\n",
        "7.   **Exercise 7**: **Implementation using Clustering**: (1 point)\n",
        "      - k Means Clustering\n",
        "      - Gaussian Mixture Models\n",
        "8.   **Exercise 8**: **Test your own sentence**: (1 point)\n",
        "      - Input your sentences ( One for positive and negative each)\n",
        "      - Print the output sentiment."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "k7Raxweqx6UL"
      },
      "source": [
        "**Sample code using Logistic Regression**\n",
        "\n",
        "The logistic function, more popularly called the sigmoid function was to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. \n",
        "\n",
        "It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.\n",
        "\n",
        "$\\frac{1}{ (1 + e^{-value})}$\n",
        "\n",
        "Where $e$ is the base of the natural logarithms and value is the actual numerical value that you want to transform. Below is a plot of the numbers between $-5$ and $5$ transformed into the range $0$ and $1$ using the logistic function.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 60,
      "metadata": {
        "id": "uoH7Z9uiwbbd"
      },
      "outputs": [],
      "source": [
        "def _test_logistic(X_train_vec, X_test_vec, y_train, y_test):\n",
        "    # Logistic regression model is defined\n",
        "    logistic_regression = LogisticRegression()\n",
        "\n",
        "    # Training the logistic regression classifier\n",
        "    logistic_regression.fit(X_train_vec, y_train)\n",
        "\n",
        "    # Calculating accuracy on the logistic regression classifier\n",
        "    # The accuracy is within 0 and 1 in this snippet\n",
        "    lr_score = logistic_regression.score(X_test_vec, y_test)\n",
        "    print(\"Accuracy of the sentiment classification using the Logistic Regression based classifier: \", lr_score)\n",
        "\n",
        "    # Predicting on the test set\n",
        "    y_pred_lr = logistic_regression.predict(X_test_vec)\n",
        "\n",
        "    # F1 score calculation\n",
        "    lr_f1_score = f1_score(y_pred_lr, y_test)\n",
        "\n",
        "    print (\"F1 Score for sentiment classification using the Logistic Regression based classifier: \", lr_f1_score)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pOMTJSEH3_JY"
      },
      "source": [
        "### **Exercise 1**: Load the data and perform the following: (1 point)\n",
        "\n",
        "- Exploratory Data Analysis (Use Pandas, Seaborn)\n",
        "- Preprocessing (Use Pandas)\n",
        "\n",
        "**Hints:** \n",
        "\n",
        "- checking for the number of rows and columns\n",
        "- summary of the dataset\n",
        "- statistical description of the features \n",
        "- check for the duplicate values\n",
        "- Show the top 5 and the last 5 rows of the data\n",
        "- check for the null values, and handle them if *any*"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "_iKegvGG3_TP"
      },
      "outputs": [],
      "source": [
        "# YOUR CODE(s) HERE\n",
        "RAW = pd.read_csv('amazon_baby.csv')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 183531 entries, 0 to 183530\n",
            "Data columns (total 3 columns):\n",
            " #   Column  Non-Null Count   Dtype \n",
            "---  ------  --------------   ----- \n",
            " 0   name    183213 non-null  object\n",
            " 1   review  182702 non-null  object\n",
            " 2   rating  183531 non-null  int64 \n",
            "dtypes: int64(1), object(2)\n",
            "memory usage: 4.2+ MB\n"
          ]
        }
      ],
      "source": [
        "RAW.info()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>rating</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>count</th>\n",
              "      <td>183531.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>mean</th>\n",
              "      <td>4.120448</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>std</th>\n",
              "      <td>1.285017</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>min</th>\n",
              "      <td>1.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25%</th>\n",
              "      <td>4.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>50%</th>\n",
              "      <td>5.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75%</th>\n",
              "      <td>5.000000</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>max</th>\n",
              "      <td>5.000000</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "              rating\n",
              "count  183531.000000\n",
              "mean        4.120448\n",
              "std         1.285017\n",
              "min         1.000000\n",
              "25%         4.000000\n",
              "50%         5.000000\n",
              "75%         5.000000\n",
              "max         5.000000"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "RAW.describe()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 168,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 164145 entries, 0 to 164144\n",
            "Data columns (total 3 columns):\n",
            " #   Column  Non-Null Count   Dtype \n",
            "---  ------  --------------   ----- \n",
            " 0   name    164145 non-null  object\n",
            " 1   review  164145 non-null  object\n",
            " 2   rating  164145 non-null  int64 \n",
            "dtypes: int64(1), object(2)\n",
            "memory usage: 3.8+ MB\n"
          ]
        }
      ],
      "source": [
        "train, test = h.split_data(h.cleaned_data(RAW))\n",
        "train.info()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 169,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>name</th>\n",
              "      <th>review</th>\n",
              "      <th>rating</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>123653</th>\n",
              "      <td>Philips AVENT BPA Free Classic Infant Starter ...</td>\n",
              "      <td>Excellent product! It came perfectly organized...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>156549</th>\n",
              "      <td>Meeno Babies Walk Mee - The Original Handheld ...</td>\n",
              "      <td>good therapy tool my little guy loves to walk ...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                     name  \\\n",
              "123653  Philips AVENT BPA Free Classic Infant Starter ...   \n",
              "156549  Meeno Babies Walk Mee - The Original Handheld ...   \n",
              "\n",
              "                                                   review  rating  \n",
              "123653  Excellent product! It came perfectly organized...       5  \n",
              "156549  good therapy tool my little guy loves to walk ...       5  "
            ]
          },
          "execution_count": 169,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "train[train.duplicated()]\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 171,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>name</th>\n",
              "      <th>review</th>\n",
              "      <th>rating</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Planetwise Flannel Wipes</td>\n",
              "      <td>These flannel wipes are OK, but in my opinion ...</td>\n",
              "      <td>3</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Planetwise Wipe Pouch</td>\n",
              "      <td>it came early and was not disappointed. i love...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Annas Dream Full Quilt with 2 Shams</td>\n",
              "      <td>Very soft and comfortable and warmer than it l...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Stop Pacifier Sucking without tears with Thumb...</td>\n",
              "      <td>This is a product well worth the purchase.  I ...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Stop Pacifier Sucking without tears with Thumb...</td>\n",
              "      <td>All of my kids have cried non-stop when I trie...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                name  \\\n",
              "0                           Planetwise Flannel Wipes   \n",
              "1                              Planetwise Wipe Pouch   \n",
              "2                Annas Dream Full Quilt with 2 Shams   \n",
              "3  Stop Pacifier Sucking without tears with Thumb...   \n",
              "4  Stop Pacifier Sucking without tears with Thumb...   \n",
              "\n",
              "                                              review  rating  \n",
              "0  These flannel wipes are OK, but in my opinion ...       3  \n",
              "1  it came early and was not disappointed. i love...       5  \n",
              "2  Very soft and comfortable and warmer than it l...       5  \n",
              "3  This is a product well worth the purchase.  I ...       5  \n",
              "4  All of my kids have cried non-stop when I trie...       5  "
            ]
          },
          "execution_count": 171,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "train.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 170,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>name</th>\n",
              "      <th>review</th>\n",
              "      <th>rating</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>164140</th>\n",
              "      <td>&amp;quot;A Little Pillow Company&amp;quot; Hypoallerg...</td>\n",
              "      <td>If this pillow were any larger I would worry a...</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164141</th>\n",
              "      <td>&amp;quot;A Little Pillow Company&amp;quot; Hypoallerg...</td>\n",
              "      <td>The pillow was just as advertised.  My grandso...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164142</th>\n",
              "      <td>&amp;quot;A Little Pillow Company&amp;quot; Hypoallerg...</td>\n",
              "      <td>Perfect size for my toddler! She got it when s...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164143</th>\n",
              "      <td>&amp;quot;A Little Pillow Company&amp;quot; Hypoallerg...</td>\n",
              "      <td>My grandkids love this pillow - I should say p...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164144</th>\n",
              "      <td>&amp;quot;A Little Pillow Company&amp;quot; Hypoallerg...</td>\n",
              "      <td>We got this pillow for our 19-month old and sh...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                     name  \\\n",
              "164140  &quot;A Little Pillow Company&quot; Hypoallerg...   \n",
              "164141  &quot;A Little Pillow Company&quot; Hypoallerg...   \n",
              "164142  &quot;A Little Pillow Company&quot; Hypoallerg...   \n",
              "164143  &quot;A Little Pillow Company&quot; Hypoallerg...   \n",
              "164144  &quot;A Little Pillow Company&quot; Hypoallerg...   \n",
              "\n",
              "                                                   review  rating  \n",
              "164140  If this pillow were any larger I would worry a...       4  \n",
              "164141  The pillow was just as advertised.  My grandso...       5  \n",
              "164142  Perfect size for my toddler! She got it when s...       5  \n",
              "164143  My grandkids love this pillow - I should say p...       5  \n",
              "164144  We got this pillow for our 19-month old and sh...       5  "
            ]
          },
          "execution_count": 170,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "train.tail()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EX9ah2Jc4X_M"
      },
      "source": [
        "### **Exercise 2**: **Implementation using K-Nearest Neighbor (KNN) Classifier**:  (1 point)\n",
        "\n",
        "\n",
        "[Refer to the Logistic Regression Example in the above cells]\n",
        "\n",
        "- Define the KNN classifier with Number of neighbours=5 using sklearn's **KNeighborsClassifier** function\n",
        "- Train the KNN classifier\n",
        "- Predict the test set\n",
        "- Calculate accuracy on the KNN classifier\n",
        "- Compute the F1 score"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 103,
      "metadata": {
        "id": "omrI_SQk4jA8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.5241228070175439\n",
            "F1 Score: 0.44807353748243356\n"
          ]
        }
      ],
      "source": [
        "# YOUR CODE(s) HERE\n",
        "knn = h.build_model(KNeighborsClassifier())\n",
        "knn = h.train_model(knn, train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Logistic Regression\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 267,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.6886468309086922\n",
            "F1 Score: 0.6593863098460689\n"
          ]
        }
      ],
      "source": [
        "log_clf = h.build_model(LogisticRegression())\n",
        "log_clf = h.train_model(log_clf, train, frac=1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 270,
      "metadata": {},
      "outputs": [],
      "source": [
        "h.save_model(log_clf, 'logistic')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LHgubZ294yfU"
      },
      "source": [
        "### **Exercise 3**: **Implementation using Support Vector Machines (SVM) Classifier**:  (3 points)\n",
        "  - First Reduce the features using PCA\n",
        "  - use Hard-Margin Classifier\n",
        "  - use Soft-Margin Classifier\n",
        "  - use Kernel SVM Classifier\n",
        "\n",
        "\n",
        "\n",
        "Background:\n",
        "The next classifier we look into are support vector machines. \n",
        "\n",
        "![wget](https://cdn.talentsprint.com/aiml/aiml_2020_b14_hyd/experiment_details_backup/linear_data.png)\n",
        "\n",
        "While the other classifiers such as the perceptron and the logistic regression uses a similar concept of finding a boundary between two classes using a straight line, SVMs aim to maximize this boundary. Therefore, not only the SVM tries to find a boundary, it tries to find the best boundary that separates the two classes. Again, with very simple tricks the two class classification can be easily extended to a multiclass classification. The formal formulation of a SVM is,\n",
        "\n",
        "$g(x) = w^Tx + b$, is the equation of the line we want to find with weights $w$ and a bias $b$.\n",
        "\n",
        "Now as seen from the figure, $g(x) = k$ and $g(x) = -k$ will give two worst lines for classification as they are right at the boundary of one of the classes. We need to maximize the distance of the line from both of the classes.\n",
        "\n",
        "Therefore,\n",
        "\n",
        "Maximize $k$ such that :\n",
        "\n",
        "$-w^Tx + b \\geq k \\: for \\: d_i == 1$\n",
        "\n",
        "$-w^Tx + b \\leq k \\: for \\: d_i == -1$\n",
        "\n",
        "We keep $g(x) \\geq 1$ and minimize $||w||$.\n",
        "\n",
        "We finally write the final minimization function (uses Lagrangians to come to this solution).\n",
        "\n",
        "Minimize: $J(w, b, \\alpha) = \\frac{1}{2}w^Tw - \\Sigma_{i=1}^{N}(\\alpha_id_i(w^Tx_i + b)) + \\Sigma_{i=1}^{N}(\\alpha_i)$\n",
        "\n",
        "There are multiple types of SVM. We first use the standard linear SVM and check the performance of the model. However, SVM cannot be directly used on this dataset.   "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "75AdOM0W9qJz"
      },
      "source": [
        "The data is too large and the normal SVM function from `sklearn` will take a lot of time to run. Therefore, we first apply a PCA based dimensionality reduction technique on the input data. This will be followed by different types of SVM techniques and the performance can be compared. Since, dimensionality reduction is applied, a slight drop in performance is expected. However, with the improvement in the time taken for training a SVM in mind, it is important we first apply PCA based dimensionality reduction.\n",
        "\n",
        "In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.\n",
        "\n",
        "\n",
        "**Hints**\n",
        "- Define the PCA model using sklearn's **TruncatedSVD**\n",
        "- Fit the training data using **model.fit**\n",
        "- Reduce the dimensions of the training data using **model.transform**\n",
        "- Reduce the dimensions of the testing data using **model.transform**\n",
        "\n",
        "\n",
        "- Use sklearn's **svm.SVC**. Appropriately choose the arguments - *kernel*, *gamma*, and *C* for hard-margin, soft-margin and kernel SVM classifiers.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 252,
      "metadata": {
        "id": "rHfsD6lR4yfU"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.6374695863746959\n",
            "F1 Score: 0.5480687598884553\n"
          ]
        }
      ],
      "source": [
        "# YOUR CODE(s) HERE\n",
        "_linear_svm = h.build_model(svm.SVC(kernel='linear'))\n",
        "_linear_svm = h.train_model(_linear_svm, train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 255,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.6778273265589589\n",
            "F1 Score: 0.6451545763785558\n"
          ]
        }
      ],
      "source": [
        "linear_svm = h.build_model(svm.LinearSVC())\n",
        "linear_svm = h.train_model(linear_svm, train, frac=1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 272,
      "metadata": {},
      "outputs": [],
      "source": [
        "h.save_model(linear_svm, 'linear_svm')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 78,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.5328947368421053\n",
            "F1 Score: 0.370510503727129\n"
          ]
        }
      ],
      "source": [
        "hard_svm = h.build_model(svm.SVC(C=0.01, kernel='linear'))\n",
        "hard_svm = h.train_model(hard_svm, train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 79,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.5745614035087719\n",
            "F1 Score: 0.5302167099114147\n"
          ]
        }
      ],
      "source": [
        "soft_svm = h.build_model(svm.SVC(C=10, kernel='linear'))\n",
        "soft_svm = h.train_model(soft_svm, train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 80,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.5350877192982456\n",
            "F1 Score: 0.37537628638798975\n"
          ]
        }
      ],
      "source": [
        "kernel_svm = h.build_model(svm.SVC(kernel='rbf'))\n",
        "kernel_svm = h.train_model(kernel_svm, train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RD_CHfNy4zCM"
      },
      "source": [
        "### **Exercise 4**: **Implementation using Decision Trees**:  (1 point)\n",
        "\n",
        "Decision Trees are supervised Machine Learning algorithms that can perform both classification and regression tasks and even multioutput tasks. They can handle complex datasets. As the name shows, it uses a tree-like model to make decisions in order to classify or predict according to the problem. It is an ML algorithm that progressively divides datasets into smaller data groups based on a descriptive feature until it reaches sets that are small enough to be described by some label.\n",
        "\n",
        "The most important part of a decision tree is its explainability!\n",
        "\n",
        "The importance of decision tree algorithm is that it has many applications in the real world. For example:\n",
        "\n",
        "1. In the Healthcare sector: To develop Clinical Decision Analysis tools which allow decision-makers to apply for evidence-based medicine and make objective clinical decisions when faced with complex situations.\n",
        "2. Virtual Assistants (Chatbots): To develop chatbots that provide information and assistance to customers in any required domain.\n",
        "3. Retail and Marketing: Sentiment analysis detects the pulse of customer feedback and emotions and allows organizations to learn about customer choices and drives decisions.\n",
        "\n",
        "**Hint**\n",
        "Use sklearn's **DecisionTreeClassifier** function"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 84,
      "metadata": {
        "id": "o2yJdTmm4zCN"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.45394736842105265\n",
            "F1 Score: 0.4277133322795446\n"
          ]
        }
      ],
      "source": [
        "# YOUR CODE(s) HERE\n",
        "dt_clf = h.build_model(DecisionTreeClassifier())\n",
        "dt_clf = h.train_model(dt_clf, train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IJncymcr42fg"
      },
      "source": [
        "### **Exercise 5**: **Implementation using Ensemble Classifier**:  (1 point) \n",
        "- use LogisticRegression, KNN, SVM, Naive Bayes and VotingClassifier as the weak classifiers"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 97,
      "metadata": {
        "id": "18F2cxVG40F2"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.5869565217391305\n",
            "F1 Score: 0.43418701608100063\n"
          ]
        }
      ],
      "source": [
        "# YOUR CODE(s) HERE\n",
        "from sklearn.naive_bayes import MultinomialNB\n",
        "_vt_clf_template = VotingClassifier([('logistic', LogisticRegression()), ('naive_bayes', MultinomialNB()), ('knn', KNeighborsClassifier()), ('svm', svm.SVC(probability=True))], voting='soft')\n",
        "vt_clf = h.build_model(_vt_clf_template)\n",
        "vt_clf = h.train_model(vt_clf, train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fW-wW8Or40zA"
      },
      "source": [
        "### **Exercise 6**: **Implementation using Random Forest Classifier**:  (1 point)\n",
        "\n",
        "A random forest is a collection of decision trees whose results are aggregated into one final result. Random Forest  is a supervised classification algorithm. There is a direct relationship between the number of trees in the forest and the results it can get: the larger the number of trees, the more accurate the result. But here creating the forest is not the same as constructing the decision tree with the information gain or gain index approach.\n",
        "Steps:\n",
        "1. Randomly select “k” features from total “m” features where k << m as shown in the figure below\n",
        "2. Among the “k” features, calculate the node “d” using the best split point\n",
        "3. Split the node into leaf nodes using the best split\n",
        "4. Repeat the 1 to 3 steps until “l” number of nodes has been reached.\n",
        "5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.\n",
        "6. Take the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome (target)\n",
        "7. Calculate the votes for each predicted target\n",
        "8. Consider the high voted predicted target as the final prediction from the random forest algorithm\n",
        "\n",
        "**Hint**:\n",
        "- Use sklearn's **RandomForestClassifier**\n",
        "- Experiment with n_estimators, max_depth, max_leaf_nodes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 85,
      "metadata": {
        "id": "SvvDAwWe40zB"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.5350877192982456\n",
            "F1 Score: 0.37537628638798975\n"
          ]
        }
      ],
      "source": [
        "# YOUR CODE(s) HERE\n",
        "# YOUR CODE(s) HERE\n",
        "rf_clf = h.build_model(RandomForestClassifier())\n",
        "rf_clf = h.train_model(dt_clf, train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 145,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.581140350877193\n",
            "F1 Score: 0.4271905491885052\n"
          ]
        }
      ],
      "source": [
        "nb_clf = h.build_model(MultinomialNB())\n",
        "nb_clf = h.train_model(nb_clf, train, frac=.1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ohCDKM-Z47AQ"
      },
      "source": [
        "### **Exercise 7**: **Implementation using Clustering**: (1 point)\n",
        "- k Means Clustering, with and without PCA=2\n",
        "- Gaussian Mixture Models\n",
        "\n",
        "**Hints**:\n",
        "- Use sklearn's **MiniBatchKMeans**\n",
        "- Use sklearn's **GaussianMixture**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 210,
      "metadata": {},
      "outputs": [],
      "source": [
        "def _get_preds(model, df):\n",
        "    X = df['review'].to_numpy()\n",
        "    t_labels = df['rating'].to_numpy()\n",
        "    pred = model.predict(X)\n",
        "    return t_labels, pred"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 223,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.06082725060827251\n",
            "F1 Score: 0.04651513795149036\n"
          ]
        }
      ],
      "source": [
        "kmeans = h.build_kmeans(MiniBatchKMeans(n_clusters=5, random_state=0), with_pca=True)\n",
        "kmeans = h.train_model(kmeans, train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 224,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.0827250608272506\n",
            "F1 Score: 0.052191432867575785\n"
          ]
        }
      ],
      "source": [
        "kmeans = h.build_kmeans(MiniBatchKMeans(n_clusters=5, random_state=0), with_pca=False)\n",
        "kmeans = h.train_model(kmeans, train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 215,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[3, -1, -1, 5, 5]"
            ]
          },
          "execution_count": 215,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "h.gen_labels(5, *_get_preds(kmeans, train.head()))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 220,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Accuracy: 0.25\n",
            "F1 Score: 0.16666666666666666\n"
          ]
        }
      ],
      "source": [
        "gmm = h.build_gmm()\n",
        "gmm = h.train_model(gmm, train, 0.0001, verbose=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 214,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>name</th>\n",
              "      <th>review</th>\n",
              "      <th>rating</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Planetwise Flannel Wipes</td>\n",
              "      <td>These flannel wipes are OK, but in my opinion ...</td>\n",
              "      <td>3</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Planetwise Wipe Pouch</td>\n",
              "      <td>it came early and was not disappointed. i love...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Annas Dream Full Quilt with 2 Shams</td>\n",
              "      <td>Very soft and comfortable and warmer than it l...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Stop Pacifier Sucking without tears with Thumb...</td>\n",
              "      <td>This is a product well worth the purchase.  I ...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Stop Pacifier Sucking without tears with Thumb...</td>\n",
              "      <td>All of my kids have cried non-stop when I trie...</td>\n",
              "      <td>5</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                name  \\\n",
              "0                           Planetwise Flannel Wipes   \n",
              "1                              Planetwise Wipe Pouch   \n",
              "2                Annas Dream Full Quilt with 2 Shams   \n",
              "3  Stop Pacifier Sucking without tears with Thumb...   \n",
              "4  Stop Pacifier Sucking without tears with Thumb...   \n",
              "\n",
              "                                              review  rating  \n",
              "0  These flannel wipes are OK, but in my opinion ...       3  \n",
              "1  it came early and was not disappointed. i love...       5  \n",
              "2  Very soft and comfortable and warmer than it l...       5  \n",
              "3  This is a product well worth the purchase.  I ...       5  \n",
              "4  All of my kids have cried non-stop when I trie...       5  "
            ]
          },
          "execution_count": 214,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "train.head()\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 213,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "array([0, 0, 0, 0, 0], dtype=int64)"
            ]
          },
          "execution_count": 213,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "gmm.predict(train.head().review)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 211,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[5, -1, -1, -1, -1]"
            ]
          },
          "execution_count": 211,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "h.gen_labels(5, *_get_preds(gmm, train.head()))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZF5k-DwF4838"
      },
      "source": [
        "### **Exercise 8**: **Test your own sentence**: (1 point)\n",
        "- Input your sentences ( One for positive and negative each)\n",
        "- Print the output sentiment.**Exercise**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 256,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>name</th>\n",
              "      <th>review</th>\n",
              "      <th>rating</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>164157</th>\n",
              "      <td>Gerber First Essentials Soft Center Latex Stan...</td>\n",
              "      <td>I ordered the first essential nooks in blue 4 ...</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164159</th>\n",
              "      <td>aden + anais 2 Pack Rayon From Bamboo Issie Se...</td>\n",
              "      <td>Ordered these thinking they were the Issies wi...</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164176</th>\n",
              "      <td>aden + anais Rayon From Bamboo Crib Sheet, Azu...</td>\n",
              "      <td>This is a very beautiful, soft sheet but it do...</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164195</th>\n",
              "      <td>Disney Princess Potty Seat by The First Years ...</td>\n",
              "      <td>This is complete crap bought it without lookin...</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>164218</th>\n",
              "      <td>Diaper Safari One Size Pocket Diaper - Cardinal</td>\n",
              "      <td>I really, really wanted to like this diaper. T...</td>\n",
              "      <td>1</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                     name  \\\n",
              "164157  Gerber First Essentials Soft Center Latex Stan...   \n",
              "164159  aden + anais 2 Pack Rayon From Bamboo Issie Se...   \n",
              "164176  aden + anais Rayon From Bamboo Crib Sheet, Azu...   \n",
              "164195  Disney Princess Potty Seat by The First Years ...   \n",
              "164218    Diaper Safari One Size Pocket Diaper - Cardinal   \n",
              "\n",
              "                                                   review  rating  \n",
              "164157  I ordered the first essential nooks in blue 4 ...       1  \n",
              "164159  Ordered these thinking they were the Issies wi...       1  \n",
              "164176  This is a very beautiful, soft sheet but it do...       1  \n",
              "164195  This is complete crap bought it without lookin...       1  \n",
              "164218  I really, really wanted to like this diaper. T...       1  "
            ]
          },
          "execution_count": 256,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "r_1 = test.query('rating == 1')#value_counts()\n",
        "r_1.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 271,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "array([5, 5, 5], dtype=int64)"
            ]
          },
          "execution_count": 271,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "log_clf.predict(r_1.head())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 273,
      "metadata": {},
      "outputs": [],
      "source": [
        "def test_model(model, X: np.ndarray, y: np.ndarray):\n",
        "    pred = model.predict(X)\n",
        "    print('accuracy', accuracy_score(pred, y), 'f1', f1_score(pred, y, average='weighted'))\n",
        "    return pred"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 280,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "array([1], dtype=int64)"
            ]
          },
          "execution_count": 280,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "linear_svm.predict(['the product I ordered is not what I got'])"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "M2_MP1_NB_LinearClassification.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.2"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
 }
diff --git a/.\helpers.py b/.\helpers.py
 import os
 from functools import cache

 import numpy as np
 import pandas as pd
 import joblib

 from scipy import stats
 from sklearn import (
    base,
    pipeline,
 )
 from sklearn.feature_extraction.text import (
    TfidfVectorizer,
 )
 from sklearn.preprocessing import FunctionTransformer
 from sklearn.mixture import GaussianMixture

 from sklearn.model_selection import train_test_split
 from sklearn.decomposition import (
    PCA,
    TruncatedSVD,
 )

 from sklearn.metrics import (
    accuracy_score,
    f1_score,
 )

 # ----------- LOADING --------------------
 @cache
 def load_data(file: str = 'amazon_baby.csv') -> pd.DataFrame:
    return pd.read_csv(file)


 def cleaned_data(df: pd.DataFrame) -> pd.DataFrame:
    _dropped_na = df.dropna()
    cleaned = _dropped_na.set_index(pd.RangeIndex(0, len(_dropped_na)))
    return cleaned


 def split_data(
    df: pd.DataFrame, test_frac: float = 0.1
 ) -> tuple[pd.DataFrame, pd.DataFrame]:
    idx = df.index
    rng = np.random.default_rng(0)
    rng.shuffle(idx.to_numpy())
    cutoff = int(len(df) * (1 - test_frac))

    train = df.iloc[:cutoff, :]
    test = df.iloc[cutoff:, :]
    return train, test


 def sample(df: pd.DataFrame, frac: float) -> pd.DataFrame:
    return df.sample(frac=frac, random_state=0)


 # ---------- TRAINING --------------------


 def build_model(model: base.BaseEstimator) -> pipeline.Pipeline:
    return pipeline.Pipeline([('vectorize_words', TfidfVectorizer()), ('model', model)])


 def build_svm(model: base.ClassifierMixin) -> pipeline.Pipeline:
    return pipeline.Pipeline(
        [
            ('vectorize_words', TfidfVectorizer()),
            ('pca', TruncatedSVD(100, random_state=0)),
            ('model', model),
        ]
    )


 def build_kmeans(model: base.ClusterMixin, with_pca: bool = False) -> pipeline.Pipeline:
    steps = [('vectorize_words', TfidfVectorizer()), ('model', model)]

    if with_pca:
        steps.insert(1, ('pca', TruncatedSVD(2, random_state=0)))

    return pipeline.Pipeline(steps)


 def build_gmm() -> pipeline.Pipeline:
    return pipeline.Pipeline(
        [
            ('vectorize_words', TfidfVectorizer()),
            ('todense', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
            ('model', GaussianMixture(n_components=5)),
        ]
    )


 # def build_gmm(df: pd.DataFrame) -> np.ndarray:
 #     _X_train, _X_test, _, _ = train_test_split(df.loc[:, 'review'].to_numpy(), df['rating'].to_numpy(), random_state=0)
 #     tf_vec = TfidfVectorizer()
 #     X_train = tf_vec.fit_transform(_X_train).toarray().reshape(-1, 1)

 #     X_test = tf_vec.transform(_X_test).toarray().reshape(-1, 1)
 #     model = GaussianMixture(n_components=5)
 #     model.fit(
 #         X_train,
 #     )
 #     pred = model.predict(X_test)
 #     return pred
 #     # print(model.score(pred, y_test.reshape(-1, 1)))


 def train_model(
    model: pipeline.Pipeline,
    df: pd.DataFrame,
    frac: float = 0.01,
    *,
    verbose: bool = True,
 ) -> pipeline.Pipeline:
    df = df.sample(frac=frac, random_state=0)
    X_train, X_test, y_train, y_test = train_test_split(
        df['review'].to_numpy(), df['rating'].to_numpy(), random_state=0
    )
    model.fit(X_train, y_train)
    pred = model.predict(X_test)

    if verbose:
        print(f"Accuracy: {accuracy_score(y_test, pred)}")
        print(f"F1 Score: {f1_score(y_test, pred, average='weighted')}")

    return model


 def extract_word_vec(arr: np.ndarray | pd.Series) -> np.ndarray:
    tf_vec = TfidfVectorizer()
    extracted = tf_vec.fit_transform(arr)
    return extracted.toarray()


 def gen_labels(n_clusters: int, real_labels: np.ndarray, labels: np.ndarray):
    """Label the test predictions."""

    permutation = []
    for i in range(n_clusters):
        idx = labels == i
        if not idx.any():
            label = -1
        else:
            label = stats.mode(real_labels[idx]).mode[
                0
            ]  # Choose the most common label among data points in the cluster
        permutation.append(label)
    return permutation


 def save_model(model: base.BaseEstimator, name: str) -> None:
    filename = f'{name}.joblib'
    if os.path.exists(filename):
        raise FileExistsError(f'{filename} already exists!')
    joblib.dump(model, filename)
	import os
	from functools import cache

	import numpy as np
	import pandas as pd
	import joblib

	from scipy import stats
	from sklearn import (
	base,
	pipeline,
	)
	from sklearn.feature_extraction.text import (
	TfidfVectorizer,
	)
	from sklearn.preprocessing import FunctionTransformer
	from sklearn.mixture import GaussianMixture

	from sklearn.model_selection import train_test_split
	from sklearn.decomposition import (
	PCA,
	TruncatedSVD,
	)

	from sklearn.metrics import (
	accuracy_score,
	f1_score,
	)

	# ----------- LOADING --------------------
	@cache
	def load_data(file: str = 'amazon_baby.csv') -> pd.DataFrame:
	return pd.read_csv(file)


	def cleaned_data(df: pd.DataFrame) -> pd.DataFrame:
	_dropped_na = df.dropna()
	cleaned = _dropped_na.set_index(pd.RangeIndex(0, len(_dropped_na)))
	return cleaned


	def split_data(
	df: pd.DataFrame, test_frac: float = 0.1
	) -> tuple[pd.DataFrame, pd.DataFrame]:
	idx = df.index
	rng = np.random.default_rng(0)
	rng.shuffle(idx.to_numpy())
	cutoff = int(len(df) * (1 - test_frac))

	train = df.iloc[:cutoff, :]
	test = df.iloc[cutoff:, :]
	return train, test


	def sample(df: pd.DataFrame, frac: float) -> pd.DataFrame:
	return df.sample(frac=frac, random_state=0)


	# ---------- TRAINING --------------------


	def build_model(model: base.BaseEstimator) -> pipeline.Pipeline:
	return pipeline.Pipeline([('vectorize_words', TfidfVectorizer()), ('model', model)])


	def build_svm(model: base.ClassifierMixin) -> pipeline.Pipeline:
	return pipeline.Pipeline(
	[
	('vectorize_words', TfidfVectorizer()),
	('pca', TruncatedSVD(100, random_state=0)),
	('model', model),
	]
	)


	def build_kmeans(model: base.ClusterMixin, with_pca: bool = False) -> pipeline.Pipeline:
	steps = [('vectorize_words', TfidfVectorizer()), ('model', model)]

	if with_pca:
	steps.insert(1, ('pca', TruncatedSVD(2, random_state=0)))

	return pipeline.Pipeline(steps)


	def build_gmm() -> pipeline.Pipeline:
	return pipeline.Pipeline(
	[
	('vectorize_words', TfidfVectorizer()),
	('todense', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
	('model', GaussianMixture(n_components=5)),
	]
	)


	# def build_gmm(df: pd.DataFrame) -> np.ndarray:
	# _X_train, _X_test, _, _ = train_test_split(df.loc[:, 'review'].to_numpy(), df['rating'].to_numpy(), random_state=0)
	# tf_vec = TfidfVectorizer()
	# X_train = tf_vec.fit_transform(_X_train).toarray().reshape(-1, 1)

	# X_test = tf_vec.transform(_X_test).toarray().reshape(-1, 1)
	# model = GaussianMixture(n_components=5)
	# model.fit(
	# X_train,
	# )
	# pred = model.predict(X_test)
	# return pred
	# # print(model.score(pred, y_test.reshape(-1, 1)))


	def train_model(
	model: pipeline.Pipeline,
	df: pd.DataFrame,
	frac: float = 0.01,
	*,
	verbose: bool = True,
	) -> pipeline.Pipeline:
	df = df.sample(frac=frac, random_state=0)
	X_train, X_test, y_train, y_test = train_test_split(
	df['review'].to_numpy(), df['rating'].to_numpy(), random_state=0
	)
	model.fit(X_train, y_train)
	pred = model.predict(X_test)

	if verbose:
	print(f"Accuracy: {accuracy_score(y_test, pred)}")
	print(f"F1 Score: {f1_score(y_test, pred, average='weighted')}")

	return model


	def extract_word_vec(arr: np.ndarray \| pd.Series) -> np.ndarray:
	tf_vec = TfidfVectorizer()
	extracted = tf_vec.fit_transform(arr)
	return extracted.toarray()


	def gen_labels(n_clusters: int, real_labels: np.ndarray, labels: np.ndarray):
	"""Label the test predictions."""

	permutation = []
	for i in range(n_clusters):
	idx = labels == i
	if not idx.any():
	label = -1
	else:
	label = stats.mode(real_labels[idx]).mode[
	0
	] # Choose the most common label among data points in the cluster
	permutation.append(label)
	return permutation


	def save_model(model: base.BaseEstimator, name: str) -> None:
	filename = f'{name}.joblib'
	if os.path.exists(filename):
	raise FileExistsError(f'{filename} already exists!')
	joblib.dump(model, filename)