analyticsindiamagazine · October 4, 2019 10:58
diff --git a/Predict_The_Book_Price_Solution.ipynb b/Predict_The_Book_Price_Solution.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Article-Book_price.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FIvlJKqoP1Zq",
        "colab_type": "text"
      },
      "source": [
        "# The Complete Solution To MachineHack's Predict The Book Price Hackathon\n",
        "\n",
        "\n",
        "---\n",
        "\n",
        "### Predict The Book Price Hackathon\n",
        "\n",
        "“The so-called paradoxes of an author, to which a reader takes exception, often exist not in the author’s book at all, but rather in the reader’s head.” – Friedrich Nietzsche\n",
        "\n",
        "Books are open doors to the unimagined worlds which is unique to every person. It is more than just a hobby for many. There are many among us who prefer to spend more time with books than anything else.\n",
        "\n",
        "Here we explore a big database of books. Books of different genres, from thousands of authors. In this challenge, participants are required to use the dataset to build a Machine Learning model to predict the price of books based on a given set of features.\n",
        "\n",
        "Size of training set: 6237 records\n",
        "Size of test set: 1560 records\n",
        "\n",
        "FEATURES:\n",
        "\n",
        "* Title: The title of the book\n",
        "* Author: The author(s) of the book.\n",
        "* Edition: The edition of the book eg (Paperback,– Import, 26 Apr 2018)\n",
        "* Reviews: The customer reviews about the book\n",
        "* Ratings: The customer ratings of the book\n",
        "* Synopsis: The synopsis of the book\n",
        "* Genre: The genre the book belongs to\n",
        "* BookCategory: The department the book is usually available at.\n",
        "* Price: The price of the book (Target variable)\n",
        "\n",
        "\n",
        "Click [here](https://www.machinehack.com/course/predict-the-price-of-books/) to participate in the hackathon.\n",
        "\n",
        "---\n",
        "\n",
        "This python notebook contains the complete step by step guide to work on the above mentioned hackathon.Use this notebook to learn and adapt this work to better your score.\n",
        "\n",
        "### Approach\n",
        "\n",
        "1. Exploring The Data Sets\n",
        "2. Cleaning, Processing and Generating New Features\n",
        "1. Building A Regressor \n",
        "2. Optimizing The Hyperparameters Using Bayesian Optimization\n",
        "\n",
        "The above steps are explained in detail as follows."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "H3K1ySNMS7b1",
        "colab_type": "text"
      },
      "source": [
        "## 1. Exploring The Data Sets\n",
        "\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "In this step, we will import the datasets and will do a simple analysis that will help us process the data before predictive modeling.\n",
        "\n",
        "This block involves:\n",
        "\n",
        "* Importing the data\n",
        "* Understanding the features and their characterstics \n",
        "* Noting key observations from the data.\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VaXTvorZKITI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import pandas as pd"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "efVgoYL3Lnn4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train = pd.read_excel(\"Data/Data_Train.xlsx\")\n",
        "test = pd.read_excel(\"Data/Data_Test.xlsx\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CSXtnNyy0ndp",
        "colab_type": "code",
        "outputId": "a9a15fb4-aa89-4604-c0f1-5c17aed061a3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        }
      },
      "source": [
        "train.head(50)"
      ],
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Title</th>\n",
              "      <th>Author</th>\n",
              "      <th>Edition</th>\n",
              "      <th>Reviews</th>\n",
              "      <th>Ratings</th>\n",
              "      <th>Synopsis</th>\n",
              "      <th>Genre</th>\n",
              "      <th>BookCategory</th>\n",
              "      <th>Price</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>The Prisoner's Gold (The Hunters 3)</td>\n",
              "      <td>Chris Kuzneski</td>\n",
              "      <td>Paperback,– 10 Mar 2016</td>\n",
              "      <td>4.0 out of 5 stars</td>\n",
              "      <td>8 customer reviews</td>\n",
              "      <td>THE HUNTERS return in their third brilliant no...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Action &amp; Adventure</td>\n",
              "      <td>220.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Guru Dutt: A Tragedy in Three Acts</td>\n",
              "      <td>Arun Khopkar</td>\n",
              "      <td>Paperback,– 7 Nov 2012</td>\n",
              "      <td>3.9 out of 5 stars</td>\n",
              "      <td>14 customer reviews</td>\n",
              "      <td>A layered portrait of a troubled genius for wh...</td>\n",
              "      <td>Cinema &amp; Broadcast (Books)</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>202.93</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Leviathan (Penguin Classics)</td>\n",
              "      <td>Thomas Hobbes</td>\n",
              "      <td>Paperback,– 25 Feb 1982</td>\n",
              "      <td>4.8 out of 5 stars</td>\n",
              "      <td>6 customer reviews</td>\n",
              "      <td>\"During the time men live without a common Pow...</td>\n",
              "      <td>International Relations</td>\n",
              "      <td>Humour</td>\n",
              "      <td>299.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>A Pocket Full of Rye (Miss Marple)</td>\n",
              "      <td>Agatha Christie</td>\n",
              "      <td>Paperback,– 5 Oct 2017</td>\n",
              "      <td>4.1 out of 5 stars</td>\n",
              "      <td>13 customer reviews</td>\n",
              "      <td>A handful of grain is found in the pocket of a...</td>\n",
              "      <td>Contemporary Fiction (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>180.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>LIFE 70 Years of Extraordinary Photography</td>\n",
              "      <td>Editors of Life</td>\n",
              "      <td>Hardcover,– 10 Oct 2006</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>1 customer review</td>\n",
              "      <td>For seven decades, \"Life\" has been thrilling t...</td>\n",
              "      <td>Photography Textbooks</td>\n",
              "      <td>Arts, Film &amp; Photography</td>\n",
              "      <td>965.62</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>ChiRunning: A Revolutionary Approach to Effort...</td>\n",
              "      <td>Danny Dreyer</td>\n",
              "      <td>Paperback,– 5 May 2009</td>\n",
              "      <td>4.5 out of 5 stars</td>\n",
              "      <td>8 customer reviews</td>\n",
              "      <td>The revised edition of the bestselling ChiRunn...</td>\n",
              "      <td>Healthy Living &amp; Wellness (Books)</td>\n",
              "      <td>Sports</td>\n",
              "      <td>900.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>Death on the Nile (Poirot)</td>\n",
              "      <td>Agatha Christie</td>\n",
              "      <td>Paperback,– 5 Oct 2017</td>\n",
              "      <td>4.4 out of 5 stars</td>\n",
              "      <td>72 customer reviews</td>\n",
              "      <td>Agatha Christie’s most exotic murder mystery\\n...</td>\n",
              "      <td>Crime, Thriller &amp; Mystery (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>224.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>Yoga Your Home Practice Companion: A Complete ...</td>\n",
              "      <td>Sivananda Yoga Vedanta Centre</td>\n",
              "      <td>Hardcover,– Import, 1 Mar 2018</td>\n",
              "      <td>4.7 out of 5 stars</td>\n",
              "      <td>16 customer reviews</td>\n",
              "      <td>Achieve a healthy body, mental alertness, and ...</td>\n",
              "      <td>Sports Training &amp; Coaching (Books)</td>\n",
              "      <td>Sports</td>\n",
              "      <td>836.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>Karmayogi: A Biography of E. Sreedharan</td>\n",
              "      <td>M S Ashokan</td>\n",
              "      <td>Paperback,– 15 Dec 2015</td>\n",
              "      <td>4.2 out of 5 stars</td>\n",
              "      <td>111 customer reviews</td>\n",
              "      <td>Karmayogi is the dramatic and inspiring story ...</td>\n",
              "      <td>Biographies &amp; Autobiographies (Books)</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>130.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>The Iron King (The Accursed Kings, Book 1)</td>\n",
              "      <td>Maurice Druon</td>\n",
              "      <td>Paperback,– 26 Mar 2013</td>\n",
              "      <td>4.0 out of 5 stars</td>\n",
              "      <td>1 customer review</td>\n",
              "      <td>‘This is the original game of thrones’ George ...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Action &amp; Adventure</td>\n",
              "      <td>695.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>Battle for Sanskrit: Is Sanskrit Political or ...</td>\n",
              "      <td>Rajiv Malhotra</td>\n",
              "      <td>Paperback,– 20 Jan 2017</td>\n",
              "      <td>4.9 out of 5 stars</td>\n",
              "      <td>132 customer reviews</td>\n",
              "      <td>There is a new awakening in India that is chal...</td>\n",
              "      <td>Asian History</td>\n",
              "      <td>Language, Linguistics &amp; Writing</td>\n",
              "      <td>373.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>11</th>\n",
              "      <td>Blockchain Revolution: How the Technology Behi...</td>\n",
              "      <td>Don Tapscott, Alex Tapscott</td>\n",
              "      <td>Paperback,– Import, 14 Jun 2018</td>\n",
              "      <td>3.5 out of 5 stars</td>\n",
              "      <td>17 customer reviews</td>\n",
              "      <td>THE DEFINITIVE BOOK ON HOW THE TECHNOLOGY BEHI...</td>\n",
              "      <td>Banks &amp; Banking</td>\n",
              "      <td>Computing, Internet &amp; Digital Media</td>\n",
              "      <td>309.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>12</th>\n",
              "      <td>Tai-Pan: The Second Novel of the Asian Saga</td>\n",
              "      <td>James Clavell</td>\n",
              "      <td>Paperback,– 1 Jul 1999</td>\n",
              "      <td>4.1 out of 5 stars</td>\n",
              "      <td>4 customer reviews</td>\n",
              "      <td>Set in the turbulent days of the founding of H...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Action &amp; Adventure</td>\n",
              "      <td>379.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>13</th>\n",
              "      <td>The Art of Shaolin Kung Fu: The Secrets of Kun...</td>\n",
              "      <td>Wong Kiew Kit</td>\n",
              "      <td>Paperback,– 15 Nov 2002</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>3 customer reviews</td>\n",
              "      <td>The Art of Shaolin Kung Fu is the ultimate gui...</td>\n",
              "      <td>Asian History</td>\n",
              "      <td>Sports</td>\n",
              "      <td>1066.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>14</th>\n",
              "      <td>Anil's Ghost</td>\n",
              "      <td>Michael Ondaatje</td>\n",
              "      <td>Paperback,– 1 Sep 2011</td>\n",
              "      <td>3.8 out of 5 stars</td>\n",
              "      <td>5 customer reviews</td>\n",
              "      <td>Anil's Ghost transports us to Sri Lanka, a cou...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Romance</td>\n",
              "      <td>381.22</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>15</th>\n",
              "      <td>Superman: An Origin Story (DC Comics Super Her...</td>\n",
              "      <td>Matthew K Manning</td>\n",
              "      <td>Paperback,– 26 Feb 2015</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>2 customer reviews</td>\n",
              "      <td>One day, an alien orphan crash-lands on Earth ...</td>\n",
              "      <td>Comics &amp; Mangas (Books)</td>\n",
              "      <td>Comics &amp; Mangas</td>\n",
              "      <td>287.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>16</th>\n",
              "      <td>My First Book of London</td>\n",
              "      <td>Charlotte Guillain, Roland Dry</td>\n",
              "      <td>Hardcover,– 8 Mar 2018</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>1 customer review</td>\n",
              "      <td>London is one of the most exciting cities in t...</td>\n",
              "      <td>Children's Mysteries &amp; Curiosities (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>162.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>17</th>\n",
              "      <td>Naruto: Itachi's Story, Vol. 1: Daylight</td>\n",
              "      <td>Takashi Yano</td>\n",
              "      <td>Paperback,– 1 Nov 2016</td>\n",
              "      <td>4.9 out of 5 stars</td>\n",
              "      <td>23 customer reviews</td>\n",
              "      <td>A new series of prose novels, straight from th...</td>\n",
              "      <td>Mangas</td>\n",
              "      <td>Comics &amp; Mangas</td>\n",
              "      <td>587.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>18</th>\n",
              "      <td>The Story of Philosophy</td>\n",
              "      <td>Will Durant</td>\n",
              "      <td>Mass Market Paperback,– 1 Jan 1991</td>\n",
              "      <td>4.5 out of 5 stars</td>\n",
              "      <td>76 customer reviews</td>\n",
              "      <td>A brilliant and concise account of the lives a...</td>\n",
              "      <td>Biographies &amp; Autobiographies (Books)</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>291.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>19</th>\n",
              "      <td>Introducing Data Science: Big Data, Machine Le...</td>\n",
              "      <td>Davy Cielen, Arno D.B. Meysman, Mohamed Ali</td>\n",
              "      <td>Paperback,– 2016</td>\n",
              "      <td>4.3 out of 5 stars</td>\n",
              "      <td>5 customer reviews</td>\n",
              "      <td>Introducing Data Science explains vital data s...</td>\n",
              "      <td>Artificial Intelligence</td>\n",
              "      <td>Computing, Internet &amp; Digital Media</td>\n",
              "      <td>352.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>20</th>\n",
              "      <td>The Travelling Cat Chronicles</td>\n",
              "      <td>Hiro Arikawa</td>\n",
              "      <td>Hardcover,– 24 Nov 2018</td>\n",
              "      <td>4.9 out of 5 stars</td>\n",
              "      <td>10 customer reviews</td>\n",
              "      <td>A stunning hardback edition of this surprise h...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Action &amp; Adventure</td>\n",
              "      <td>339.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>21</th>\n",
              "      <td>Messi: Updated Edition (Luca Caioli)</td>\n",
              "      <td>Luca Caioli</td>\n",
              "      <td>Paperback,– Import, 4 Oct 2018</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>2 customer reviews</td>\n",
              "      <td>FROM THE BESTSELLING AUTHOR OF RONALDO AND NEY...</td>\n",
              "      <td>Biographies &amp; Autobiographies (Books)</td>\n",
              "      <td>Sports</td>\n",
              "      <td>309.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>22</th>\n",
              "      <td>The Dark Arena</td>\n",
              "      <td>Mario Puzo</td>\n",
              "      <td>Paperback,– 5 Jul 2012</td>\n",
              "      <td>3.1 out of 5 stars</td>\n",
              "      <td>2 customer reviews</td>\n",
              "      <td>MARIO PUZO'S FIRST ACCLAIMED NOVEL, BEFORE HIS...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Action &amp; Adventure</td>\n",
              "      <td>262.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>23</th>\n",
              "      <td>Sap Fico Beginner's Handbook: Step By Step Acr...</td>\n",
              "      <td>Murugesan Ramaswamy</td>\n",
              "      <td>Paperback,– 1 Nov 2014</td>\n",
              "      <td>3.1 out of 5 stars</td>\n",
              "      <td>10 customer reviews</td>\n",
              "      <td>Step by Step Screenshots Guided Handholding Ap...</td>\n",
              "      <td>Software &amp; Business Applications (Books)</td>\n",
              "      <td>Computing, Internet &amp; Digital Media</td>\n",
              "      <td>607.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>24</th>\n",
              "      <td>German Grammar You Really Need To Know: Teach ...</td>\n",
              "      <td>Jenny Russ</td>\n",
              "      <td>Paperback,– 31 Aug 2012</td>\n",
              "      <td>4.8 out of 5 stars</td>\n",
              "      <td>9 customer reviews</td>\n",
              "      <td>Comprehensive and clear explanations of key gr...</td>\n",
              "      <td>German</td>\n",
              "      <td>Language, Linguistics &amp; Writing</td>\n",
              "      <td>536.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25</th>\n",
              "      <td>Stealth of Nations: The Global Rise of the Inf...</td>\n",
              "      <td>Robert Neuwirth</td>\n",
              "      <td>Hardcover,– Deckle Edge, 18 Oct 2011</td>\n",
              "      <td>4.0 out of 5 stars</td>\n",
              "      <td>1 customer review</td>\n",
              "      <td>• Thousands of Africans head to China each yea...</td>\n",
              "      <td>International Business</td>\n",
              "      <td>Politics</td>\n",
              "      <td>621.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>26</th>\n",
              "      <td>Fixed!: Cash and Corruption in Cricket</td>\n",
              "      <td>Shantanu Guha Ray</td>\n",
              "      <td>Paperback,– 1 Mar 2016</td>\n",
              "      <td>4.3 out of 5 stars</td>\n",
              "      <td>15 customer reviews</td>\n",
              "      <td>Who killed Hansie Cronje and Bob Woolmer? Have...</td>\n",
              "      <td>Cricket (Books)</td>\n",
              "      <td>Sports</td>\n",
              "      <td>286.98</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>27</th>\n",
              "      <td>The Buddha Box Set</td>\n",
              "      <td>Osamu Tezuka</td>\n",
              "      <td>Paperback,– Box set, 15 Jun 2014</td>\n",
              "      <td>4.3 out of 5 stars</td>\n",
              "      <td>34 customer reviews</td>\n",
              "      <td>The classic eight volume graphic novel series ...</td>\n",
              "      <td>Comics &amp; Graphic Novels (Books)</td>\n",
              "      <td>Comics &amp; Mangas</td>\n",
              "      <td>3779.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>28</th>\n",
              "      <td>30 Years of WrestleMania (Wwe)</td>\n",
              "      <td>Brian Shields</td>\n",
              "      <td>Hardcover,– 15 Sep 2014</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>17 customer reviews</td>\n",
              "      <td>From the creators of WWE 50 and the official W...</td>\n",
              "      <td>PC &amp; Video Games (Books)</td>\n",
              "      <td>Sports</td>\n",
              "      <td>802.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>29</th>\n",
              "      <td>Memories, Dreams, Reflections (Vintage)</td>\n",
              "      <td>C. G. Jung, Aniela Jaffe, Clara Winston, Richa...</td>\n",
              "      <td>Paperback,– 23 Apr 1989</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>9 customer reviews</td>\n",
              "      <td>An eye-opening biography of one of the most in...</td>\n",
              "      <td>Biographies &amp; Autobiographies (Books)</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>588.26</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>30</th>\n",
              "      <td>The Hit (Will Robie series)</td>\n",
              "      <td>David Baldacci</td>\n",
              "      <td>Paperback,– 21 Nov 2013</td>\n",
              "      <td>4.0 out of 5 stars</td>\n",
              "      <td>32 customer reviews</td>\n",
              "      <td>The Hit is David Baldacci's blockbuster follow...</td>\n",
              "      <td>Contemporary Fiction (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>340.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>31</th>\n",
              "      <td>The Archer Files: The Complete Short Stories o...</td>\n",
              "      <td>Ross Macdonald, Tom Nolan</td>\n",
              "      <td>Paperback,– 21 Jul 2015</td>\n",
              "      <td>3.9 out of 5 stars</td>\n",
              "      <td>2 customer reviews</td>\n",
              "      <td>No matter what cases private eye Lew Archer ta...</td>\n",
              "      <td>Short Stories (Books)</td>\n",
              "      <td>Humour</td>\n",
              "      <td>799.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>32</th>\n",
              "      <td>Light on Life (Arkana)</td>\n",
              "      <td>Hart Defouw</td>\n",
              "      <td>Paperback,– 14 Oct 2000</td>\n",
              "      <td>4.4 out of 5 stars</td>\n",
              "      <td>49 customer reviews</td>\n",
              "      <td>Light on Life brings the insight and wisdom of...</td>\n",
              "      <td>Astrology</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>395.10</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>33</th>\n",
              "      <td>The Doomsday Conspiracy</td>\n",
              "      <td>Sidney Sheldon</td>\n",
              "      <td>Paperback,– 5 Sep 2005</td>\n",
              "      <td>4.2 out of 5 stars</td>\n",
              "      <td>49 customer reviews</td>\n",
              "      <td>The Doomsday Conspiracy, by Sidney Sheldon, is...</td>\n",
              "      <td>Romance (Books)</td>\n",
              "      <td>Romance</td>\n",
              "      <td>225.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>34</th>\n",
              "      <td>The Art of Uncharted 4: A Thief's End</td>\n",
              "      <td>Naughty Dog</td>\n",
              "      <td>Hardcover,– 10 May 2016</td>\n",
              "      <td>4.3 out of 5 stars</td>\n",
              "      <td>10 customer reviews</td>\n",
              "      <td>Journey alongside Nathan Drake once again, as ...</td>\n",
              "      <td>Design</td>\n",
              "      <td>Comics &amp; Mangas</td>\n",
              "      <td>1780.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>35</th>\n",
              "      <td>HANNIBAL RISING</td>\n",
              "      <td>Thomas Harris</td>\n",
              "      <td>Paperback,– 2019</td>\n",
              "      <td>4.3 out of 5 stars</td>\n",
              "      <td>8 customer reviews</td>\n",
              "      <td>_________________________ hannibal lecter wasn...</td>\n",
              "      <td>Contemporary Fiction (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>309.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>36</th>\n",
              "      <td>Data Structures Using C</td>\n",
              "      <td>Reema Thareja</td>\n",
              "      <td>Paperback,– 11 Jun 2014</td>\n",
              "      <td>4.4 out of 5 stars</td>\n",
              "      <td>62 customer reviews</td>\n",
              "      <td>This second edition of Data Structures Using C...</td>\n",
              "      <td>Introductory &amp; Beginning Programming</td>\n",
              "      <td>Language, Linguistics &amp; Writing</td>\n",
              "      <td>559.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>37</th>\n",
              "      <td>Don't Ask Any Old Bloke for Directions</td>\n",
              "      <td>Palden Gyatso Tenzing</td>\n",
              "      <td>Paperback,– 17 Apr 2009</td>\n",
              "      <td>3.9 out of 5 stars</td>\n",
              "      <td>61 customer reviews</td>\n",
              "      <td>Exploring a karmic network in 25,320 kilometre...</td>\n",
              "      <td>Travel (Books)</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>224.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>38</th>\n",
              "      <td>Prince of Fire</td>\n",
              "      <td>Daniel Silva</td>\n",
              "      <td>Paperback,– 30 Nov 2006</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>1 customer review</td>\n",
              "      <td>On a bright morning in Rome, a terrible explos...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>565.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>39</th>\n",
              "      <td>HBR's 10 Must Reads: On Making Smart Decisions...</td>\n",
              "      <td>HBR</td>\n",
              "      <td>Paperback,– 1 Dec 2013</td>\n",
              "      <td>4.6 out of 5 stars</td>\n",
              "      <td>8 customer reviews</td>\n",
              "      <td>NEW from the bestselling HBR's 10 Must Reads s...</td>\n",
              "      <td>Sports (Books)</td>\n",
              "      <td>Sports</td>\n",
              "      <td>511.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>40</th>\n",
              "      <td>Politics and the English Language (Penguin Mod...</td>\n",
              "      <td>George Orwell</td>\n",
              "      <td>Paperback,– 3 Jan 2013</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>7 customer reviews</td>\n",
              "      <td>'Politics and the English Language' is widely ...</td>\n",
              "      <td>Communications</td>\n",
              "      <td>Language, Linguistics &amp; Writing</td>\n",
              "      <td>144.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>41</th>\n",
              "      <td>S.</td>\n",
              "      <td>Doug Dorst</td>\n",
              "      <td>Hardcover,– 28 Sep 2013</td>\n",
              "      <td>4.7 out of 5 stars</td>\n",
              "      <td>5 customer reviews</td>\n",
              "      <td>One book. Two readers. A world of mystery, men...</td>\n",
              "      <td>Contemporary Fiction (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>1455.90</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>42</th>\n",
              "      <td>Oxford Learner's Thesaurus</td>\n",
              "      <td>Dict.</td>\n",
              "      <td>Paperback,– 20 Aug 2008</td>\n",
              "      <td>3.8 out of 5 stars</td>\n",
              "      <td>18 customer reviews</td>\n",
              "      <td>A dictionary of synonyms and opposites that he...</td>\n",
              "      <td>Foreign Languages</td>\n",
              "      <td>Language, Linguistics &amp; Writing</td>\n",
              "      <td>645.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>43</th>\n",
              "      <td>Lost in Translation</td>\n",
              "      <td>Ella Frances Sanders</td>\n",
              "      <td>Hardcover,– 8 Jul 2015</td>\n",
              "      <td>4.5 out of 5 stars</td>\n",
              "      <td>16 customer reviews</td>\n",
              "      <td>Did you know that the Japanese have a word to ...</td>\n",
              "      <td>Linguistics (Books)</td>\n",
              "      <td>Language, Linguistics &amp; Writing</td>\n",
              "      <td>427.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>44</th>\n",
              "      <td>Daisy Jones and The Six</td>\n",
              "      <td>Taylor Jenkins Reid</td>\n",
              "      <td>Hardcover,– 2019</td>\n",
              "      <td>4.6 out of 5 stars</td>\n",
              "      <td>6 customer reviews</td>\n",
              "      <td>picked as &lt; u&gt; one to watch in 2019&lt;/u&gt; by &lt;th...</td>\n",
              "      <td>Music Books</td>\n",
              "      <td>Romance</td>\n",
              "      <td>560.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>45</th>\n",
              "      <td>Bear Grylls: Two All-Action Adventures: Facing...</td>\n",
              "      <td>Bear Grylls</td>\n",
              "      <td>Paperback,– 3 Jul 2014</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>3 customer reviews</td>\n",
              "      <td>Bear Grylls is one of the world's most famous ...</td>\n",
              "      <td>Outdoor Survival Skills (Books)</td>\n",
              "      <td>Sports</td>\n",
              "      <td>395.10</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>46</th>\n",
              "      <td>My First Book Of Beethoven: Favorite Pieces In...</td>\n",
              "      <td>David Dutkanicz</td>\n",
              "      <td>Paperback,– 29 Dec 2006</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>2 customer reviews</td>\n",
              "      <td>Specially arranged and simplified, these piece...</td>\n",
              "      <td>Music Books</td>\n",
              "      <td>Arts, Film &amp; Photography</td>\n",
              "      <td>386.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>47</th>\n",
              "      <td>Byculla to Bangkok</td>\n",
              "      <td>S. Hussain Zaidi</td>\n",
              "      <td>Paperback,– 22 Feb 2014</td>\n",
              "      <td>4.1 out of 5 stars</td>\n",
              "      <td>98 customer reviews</td>\n",
              "      <td>The much-awaited sequel to Dongri to Dubai Aft...</td>\n",
              "      <td>True Accounts (Books)</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>226.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>48</th>\n",
              "      <td>Assassin's Creed: Forsaken</td>\n",
              "      <td>Oliver Bowden</td>\n",
              "      <td>Paperback,– 1 Nov 2012</td>\n",
              "      <td>4.8 out of 5 stars</td>\n",
              "      <td>12 customer reviews</td>\n",
              "      <td>The new novelization based on the bestselling ...</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Action &amp; Adventure</td>\n",
              "      <td>303.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>49</th>\n",
              "      <td>Mastering Manga with Mark Crilley: 30 Drawing ...</td>\n",
              "      <td>Mark Crilley</td>\n",
              "      <td>Paperback,– 30 Mar 2012</td>\n",
              "      <td>4.8 out of 5 stars</td>\n",
              "      <td>14 customer reviews</td>\n",
              "      <td>It's THE book on manga from YouTube's most pop...</td>\n",
              "      <td>Mangas</td>\n",
              "      <td>Arts, Film &amp; Photography</td>\n",
              "      <td>1383.00</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                Title  ...    Price\n",
              "0                 The Prisoner's Gold (The Hunters 3)  ...   220.00\n",
              "1                  Guru Dutt: A Tragedy in Three Acts  ...   202.93\n",
              "2                        Leviathan (Penguin Classics)  ...   299.00\n",
              "3                  A Pocket Full of Rye (Miss Marple)  ...   180.00\n",
              "4          LIFE 70 Years of Extraordinary Photography  ...   965.62\n",
              "5   ChiRunning: A Revolutionary Approach to Effort...  ...   900.00\n",
              "6                          Death on the Nile (Poirot)  ...   224.00\n",
              "7   Yoga Your Home Practice Companion: A Complete ...  ...   836.00\n",
              "8             Karmayogi: A Biography of E. Sreedharan  ...   130.00\n",
              "9          The Iron King (The Accursed Kings, Book 1)  ...   695.00\n",
              "10  Battle for Sanskrit: Is Sanskrit Political or ...  ...   373.00\n",
              "11  Blockchain Revolution: How the Technology Behi...  ...   309.00\n",
              "12        Tai-Pan: The Second Novel of the Asian Saga  ...   379.00\n",
              "13  The Art of Shaolin Kung Fu: The Secrets of Kun...  ...  1066.00\n",
              "14                                       Anil's Ghost  ...   381.22\n",
              "15  Superman: An Origin Story (DC Comics Super Her...  ...   287.00\n",
              "16                            My First Book of London  ...   162.00\n",
              "17           Naruto: Itachi's Story, Vol. 1: Daylight  ...   587.00\n",
              "18                            The Story of Philosophy  ...   291.00\n",
              "19  Introducing Data Science: Big Data, Machine Le...  ...   352.00\n",
              "20                      The Travelling Cat Chronicles  ...   339.00\n",
              "21               Messi: Updated Edition (Luca Caioli)  ...   309.00\n",
              "22                                     The Dark Arena  ...   262.00\n",
              "23  Sap Fico Beginner's Handbook: Step By Step Acr...  ...   607.00\n",
              "24  German Grammar You Really Need To Know: Teach ...  ...   536.00\n",
              "25  Stealth of Nations: The Global Rise of the Inf...  ...   621.00\n",
              "26             Fixed!: Cash and Corruption in Cricket  ...   286.98\n",
              "27                                 The Buddha Box Set  ...  3779.00\n",
              "28                     30 Years of WrestleMania (Wwe)  ...   802.00\n",
              "29            Memories, Dreams, Reflections (Vintage)  ...   588.26\n",
              "30                        The Hit (Will Robie series)  ...   340.00\n",
              "31  The Archer Files: The Complete Short Stories o...  ...   799.00\n",
              "32                             Light on Life (Arkana)  ...   395.10\n",
              "33                            The Doomsday Conspiracy  ...   225.00\n",
              "34              The Art of Uncharted 4: A Thief's End  ...  1780.00\n",
              "35                                    HANNIBAL RISING  ...   309.00\n",
              "36                            Data Structures Using C  ...   559.00\n",
              "37             Don't Ask Any Old Bloke for Directions  ...   224.00\n",
              "38                                     Prince of Fire  ...   565.00\n",
              "39  HBR's 10 Must Reads: On Making Smart Decisions...  ...   511.00\n",
              "40  Politics and the English Language (Penguin Mod...  ...   144.00\n",
              "41                                                 S.  ...  1455.90\n",
              "42                         Oxford Learner's Thesaurus  ...   645.00\n",
              "43                                Lost in Translation  ...   427.00\n",
              "44                            Daisy Jones and The Six  ...   560.00\n",
              "45  Bear Grylls: Two All-Action Adventures: Facing...  ...   395.10\n",
              "46  My First Book Of Beethoven: Favorite Pieces In...  ...   386.00\n",
              "47                                 Byculla to Bangkok  ...   226.00\n",
              "48                         Assassin's Creed: Forsaken  ...   303.00\n",
              "49  Mastering Manga with Mark Crilley: 30 Drawing ...  ...  1383.00\n",
              "\n",
              "[50 rows x 9 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 19
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Kkl8cxrfL6dj",
        "colab_type": "code",
        "outputId": "508b59aa-96d2-45d4-bb16-8e54e19a1a81",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 272
        }
      },
      "source": [
        "print(train.info())"
      ],
      "execution_count": 20,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 6237 entries, 0 to 6236\n",
            "Data columns (total 9 columns):\n",
            "Title           6237 non-null object\n",
            "Author          6237 non-null object\n",
            "Edition         6237 non-null object\n",
            "Reviews         6237 non-null object\n",
            "Ratings         6237 non-null object\n",
            "Synopsis        6237 non-null object\n",
            "Genre           6237 non-null object\n",
            "BookCategory    6237 non-null object\n",
            "Price           6237 non-null float64\n",
            "dtypes: float64(1), object(8)\n",
            "memory usage: 438.6+ KB\n",
            "None\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SspSSxPEv8gq",
        "colab_type": "code",
        "outputId": "3594f69b-3a74-4f11-b84d-3580d9e25bd8",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 111
        }
      },
      "source": [
        "train.describe(include = 'all').head(2)"
      ],
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Title</th>\n",
              "      <th>Author</th>\n",
              "      <th>Edition</th>\n",
              "      <th>Reviews</th>\n",
              "      <th>Ratings</th>\n",
              "      <th>Synopsis</th>\n",
              "      <th>Genre</th>\n",
              "      <th>BookCategory</th>\n",
              "      <th>Price</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>count</th>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>unique</th>\n",
              "      <td>5568</td>\n",
              "      <td>3679</td>\n",
              "      <td>3370</td>\n",
              "      <td>36</td>\n",
              "      <td>342</td>\n",
              "      <td>5549</td>\n",
              "      <td>345</td>\n",
              "      <td>11</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "       Title Author Edition Reviews Ratings Synopsis Genre BookCategory   Price\n",
              "count   6237   6237    6237    6237    6237     6237  6237         6237  6237.0\n",
              "unique  5568   3679    3370      36     342     5549   345           11     NaN"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 21
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "U3MCre4TwiZc",
        "colab_type": "code",
        "outputId": "39fc7de7-3721-41b8-b03e-ec2de9e55358",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 68
        }
      },
      "source": [
        "print(train.columns)"
      ],
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Index(['Title', 'Author', 'Edition', 'Reviews', 'Ratings', 'Synopsis', 'Genre',\n",
            "       'BookCategory', 'Price'],\n",
            "      dtype='object')\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "b1ZRckAF3Frk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "## Removing Synopsis since here we are not going to use this feature\n",
        "\n",
        "train = train[['Title', 'Author', 'Edition', 'Reviews', 'Ratings','Genre',\n",
        "               'BookCategory', 'Price']]\n",
        "\n",
        "test = test[['Title', 'Author', 'Edition', 'Reviews', 'Ratings','Genre',\n",
        "               'BookCategory']]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YmktsksI1cR5",
        "colab_type": "code",
        "outputId": "8d30ad31-a035-404c-dac3-f9b6d7bb2ec0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        }
      },
      "source": [
        "train.head()"
      ],
      "execution_count": 24,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Title</th>\n",
              "      <th>Author</th>\n",
              "      <th>Edition</th>\n",
              "      <th>Reviews</th>\n",
              "      <th>Ratings</th>\n",
              "      <th>Genre</th>\n",
              "      <th>BookCategory</th>\n",
              "      <th>Price</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>The Prisoner's Gold (The Hunters 3)</td>\n",
              "      <td>Chris Kuzneski</td>\n",
              "      <td>Paperback,– 10 Mar 2016</td>\n",
              "      <td>4.0 out of 5 stars</td>\n",
              "      <td>8 customer reviews</td>\n",
              "      <td>Action &amp; Adventure (Books)</td>\n",
              "      <td>Action &amp; Adventure</td>\n",
              "      <td>220.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Guru Dutt: A Tragedy in Three Acts</td>\n",
              "      <td>Arun Khopkar</td>\n",
              "      <td>Paperback,– 7 Nov 2012</td>\n",
              "      <td>3.9 out of 5 stars</td>\n",
              "      <td>14 customer reviews</td>\n",
              "      <td>Cinema &amp; Broadcast (Books)</td>\n",
              "      <td>Biographies, Diaries &amp; True Accounts</td>\n",
              "      <td>202.93</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Leviathan (Penguin Classics)</td>\n",
              "      <td>Thomas Hobbes</td>\n",
              "      <td>Paperback,– 25 Feb 1982</td>\n",
              "      <td>4.8 out of 5 stars</td>\n",
              "      <td>6 customer reviews</td>\n",
              "      <td>International Relations</td>\n",
              "      <td>Humour</td>\n",
              "      <td>299.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>A Pocket Full of Rye (Miss Marple)</td>\n",
              "      <td>Agatha Christie</td>\n",
              "      <td>Paperback,– 5 Oct 2017</td>\n",
              "      <td>4.1 out of 5 stars</td>\n",
              "      <td>13 customer reviews</td>\n",
              "      <td>Contemporary Fiction (Books)</td>\n",
              "      <td>Crime, Thriller &amp; Mystery</td>\n",
              "      <td>180.00</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>LIFE 70 Years of Extraordinary Photography</td>\n",
              "      <td>Editors of Life</td>\n",
              "      <td>Hardcover,– 10 Oct 2006</td>\n",
              "      <td>5.0 out of 5 stars</td>\n",
              "      <td>1 customer review</td>\n",
              "      <td>Photography Textbooks</td>\n",
              "      <td>Arts, Film &amp; Photography</td>\n",
              "      <td>965.62</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                        Title  ...   Price\n",
              "0         The Prisoner's Gold (The Hunters 3)  ...  220.00\n",
              "1          Guru Dutt: A Tragedy in Three Acts  ...  202.93\n",
              "2                Leviathan (Penguin Classics)  ...  299.00\n",
              "3          A Pocket Full of Rye (Miss Marple)  ...  180.00\n",
              "4  LIFE 70 Years of Extraordinary Photography  ...  965.62\n",
              "\n",
              "[5 rows x 8 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 24
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZHq8QtcL_tD3",
        "colab_type": "code",
        "outputId": "40a32bf6-0a47-42ad-c78f-e0c5a6dab844",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 170
        }
      },
      "source": [
        "train.isnull().sum()"
      ],
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Title           0\n",
              "Author          0\n",
              "Edition         0\n",
              "Reviews         0\n",
              "Ratings         0\n",
              "Genre           0\n",
              "BookCategory    0\n",
              "Price           0\n",
              "dtype: int64"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 25
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WqhA0WyMXMD4",
        "colab_type": "text"
      },
      "source": [
        "#### KEY OBSERVATIONS\n",
        "\n",
        "* No null values in the dataset to treat.\n",
        "\n",
        "* Some books have multiple authors in the Author column which needs to be processed amd seperated.\n",
        "\n",
        "* Edition Column can be split in to 3 different features. (Type, Month and Year)\n",
        "\n",
        "* The Reviews and Ratings columms are misslabelled.\n",
        "\n",
        "* Reviews and Ratings, both needs to cleaned to represent integer and float values respectively.\n",
        "\n",
        "* Like authors , a book may belong to multiple categories and genres. Thus we will need to split both the Genre and Category columns.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7aRbnRnAUfmf",
        "colab_type": "text"
      },
      "source": [
        "## Processing The Data\n",
        "---\n",
        "\n",
        "In this stage we will process the data by cleaning and making it ready for modeling.\n",
        "\n",
        "This stage involves:\n",
        "\n",
        "* Cleaning the data and generating new features\n",
        "* Encoding all categorical variables\n",
        "* Scaling the data\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A_6ocQREGz-2",
        "colab_type": "text"
      },
      "source": [
        "### Cleaning And Generating New Features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U9zwu4rzqxGd",
        "colab_type": "text"
      },
      "source": [
        "#### Splitting Edition Column\n",
        "\n",
        "---\n",
        "\n",
        "We will clean the column Edition and will create 3 new features from it which are Type, Month and Year.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9FYwQDv4poys",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#A method to clean and restructure the Edition column\n",
        "\n",
        "def split_edition(data):  \n",
        "  \n",
        "  edition  = list(data)\n",
        "  \n",
        "  ed_type = [i.split(\",– \")[0].strip().upper() for i in edition]\n",
        "  \n",
        "  edit_date = [i.split(\",– \")[1].strip() for i in edition]\n",
        "  \n",
        "  m_y = [i.split()[-2:] for i in edit_date]\n",
        "  \n",
        "  \n",
        "  for i in range(len(m_y)):\n",
        "    if len(m_y[i]) == 1:\n",
        "      m_y[i].insert(0,'NA')\n",
        "      \n",
        "  # Based on the given dataset below is the list of possible values for Months\n",
        "  \n",
        "  months =  ['Apr','Aug','Dec','Feb', 'Jan', 'Jul','Jun','Mar','May','NA','Nov','Oct','Sep']\n",
        "  \n",
        "  ed_month = [m_y[i][0].upper() if m_y[i][0] in months else 'NA' for i in range(len(m_y))]\n",
        "  ed_year = [int(m_y[i][1].strip()) if m_y[i][1].isdigit() else 0 for i in range(len(m_y))]\n",
        "  \n",
        "  return ed_type, ed_month, ed_year"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G4QApAYejy8F",
        "colab_type": "text"
      },
      "source": [
        "#### Splitting Author Columns\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6qWwA0KIm8O1",
        "colab_type": "text"
      },
      "source": [
        "In order to split a colum in to multiple features we must first determine or identify that how many features can an existing Column account for. Hence for splitting the Author column in to multiple authors, we must know the maximum number of authors for a single book in the given datasets.We will combine the test and training set to do so.\n",
        "\n",
        "We will also store the names of each and every author which we will later neeed for label encoding.\n",
        "\n",
        "We will apply the same principles for the Genre as well as the BookCategory columns."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Bg4iBFtiMFJJ",
        "colab_type": "code",
        "outputId": "e85740e0-c6f9-4615-b762-9500b963659e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "source": [
        "#Identifying the maximum number of authors for a single book from the given datasets\n",
        "authors_1 = list(train['Author'])\n",
        "authors_2 = list(test['Author'])\n",
        "\n",
        "authors_1.extend(authors_2)\n",
        "\n",
        "authorslis = [i.split(\",\") for i in authors_1]\n",
        "\n",
        "max = 1\n",
        "for i in authorslis:\n",
        "  if len(i) >= max:\n",
        "    max = len(i)\n",
        "print(\"Max. number of authors for a single boook = \",max)\n",
        "\n",
        "for i in range(len(authorslis)):\n",
        "  if len(authorslis[i]) == max:\n",
        "    print(i)    \n",
        "    \n",
        "all_authors = [author.strip().upper() for listin in authorslis for author in listin]\n",
        "    "
      ],
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Max. number of authors for a single boook =  7\n",
            "7008\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kAdns8NkPPdV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# A method to split the Author column in to 7 new columns\n",
        "def split_authors(data):\n",
        "  \n",
        "  authors = list(data)\n",
        "  \n",
        "  A1 = []\n",
        "  A2 = []\n",
        "  A3 = []\n",
        "  A4 = []\n",
        "  A5 = []\n",
        "  A6 = []\n",
        "  A7 = []\n",
        "  for i in authors:\n",
        "    \n",
        "    try :\n",
        "      A1.append(i.split(',')[0].strip().upper())\n",
        "    except :\n",
        "      A1.append('NONE')\n",
        "      \n",
        "    try :\n",
        "      A2.append(i.split(',')[1].strip().upper())\n",
        "    except :\n",
        "      A2.append('NONE')\n",
        "        \n",
        "    try :\n",
        "      A3.append(i.split(',')[2].strip().upper())\n",
        "    except :\n",
        "      A3.append('NONE')\n",
        "        \n",
        "    try :\n",
        "      A4.append(i.split(',')[3].strip().upper())\n",
        "    except :\n",
        "      A4.append('NONE')\n",
        "        \n",
        "    try :\n",
        "      A5.append(i.split(',')[4].strip().upper())\n",
        "    except :\n",
        "      A5.append('NONE')\n",
        "      \n",
        "    try :\n",
        "      A6.append(i.split(',')[5].strip().upper())\n",
        "    except :\n",
        "      A6.append('NONE')\n",
        "     \n",
        "    try :\n",
        "      A7.append(i.split(',')[6].strip().upper())\n",
        "    except :\n",
        "      A7.append('NONE')\n",
        "\n",
        "      \n",
        "  return A1,A2,A3,A4,A5,A6,A7\n",
        "  \n",
        "all_authors.append('NONE')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YAiwJ3Bnj89I",
        "colab_type": "text"
      },
      "source": [
        "#### Splitting Genre Columns\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VgE7l4QrY_6k",
        "colab_type": "code",
        "outputId": "8847298f-f86d-4a0a-d0ba-d3e2874f3d39",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "#Identifying the maximum number of Genres for a single book from the given datasets\n",
        "\n",
        "genre_1 = list(train['Genre'])\n",
        "genre_2 = list(test['Genre'])\n",
        "\n",
        "genre_1.extend(genre_2)\n",
        "\n",
        "genre_lis = [i.split(\",\") for i in genre_1]\n",
        "\n",
        "\n",
        "max = 1\n",
        "for i in genre_lis:\n",
        "  if len(i) >= max:\n",
        "    max = len(i)\n",
        "print(\"Max. number of genres for a single boook = \",max)\n",
        "      \n",
        "all_genres = [genre.strip().upper() for listin in genre_lis for genre in listin]\n",
        "    \n"
      ],
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Max. number of genres for a single boook =  2\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6PZi_VEGkEi0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# A method to split the Genre column in to 7 new columns\n",
        "\n",
        "def split_genres(data):\n",
        "  \n",
        "  genres = list(data)\n",
        "  \n",
        "  G1 = []\n",
        "  G2 = []\n",
        "  \n",
        "  for i in genres:\n",
        "    \n",
        "    try :\n",
        "      G1.append(i.split(',')[0].strip().upper())\n",
        "      \n",
        "    except :\n",
        "      G1.append('NONE')\n",
        "      \n",
        "    try :\n",
        "      G2.append(i.split(',')[1].strip().upper())\n",
        "    except :\n",
        "      G2.append('NONE')\n",
        "\n",
        "\n",
        "      \n",
        "  return G1,G2\n",
        "  \n",
        "all_genres.append('NONE')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nigiT_6KkGM4",
        "colab_type": "text"
      },
      "source": [
        "#### Splitting BookCategory Column\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_eRU4S1MkNDI",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "07d4e611-6e13-4156-db04-d1679d743e71"
      },
      "source": [
        "#Identifying the maximum number of Categories for a single book from the given datasets\n",
        "\n",
        "cat_1 = list(train['BookCategory'])\n",
        "cat_2 = list(test['BookCategory'])\n",
        "\n",
        "cat_1.extend(cat_2)\n",
        "\n",
        "cat_lis = [i.split(\",\") for i in cat_1]\n",
        "\n",
        "\n",
        "max = 1\n",
        "for i in cat_lis:\n",
        "  if len(i) >= max:\n",
        "    max = len(i)\n",
        "print(\"Max. number of Categories for a single boook = \",max)\n",
        "\n",
        "all_categories = [cat.strip().upper() for listin in cat_lis for cat in listin]\n",
        "    "
      ],
      "execution_count": 31,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Max. number of Categories for a single boook =  2\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Pc16PenecGMq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# A method to split the Category column in to 7 new columns\n",
        "\n",
        "def split_categories(data):\n",
        "  \n",
        "  cat = list(data)\n",
        "  \n",
        "  C1 = []\n",
        "  C2 = []\n",
        "\n",
        "  for i in cat:\n",
        "    \n",
        "    try :\n",
        "      C1.append(i.split(',')[0].strip().upper())\n",
        "    except :\n",
        "      C1.append('NONE')\n",
        "      \n",
        "    try :\n",
        "      C2.append(i.split(',')[1].strip().upper())\n",
        "    except :\n",
        "      C2.append('NONE')\n",
        "\n",
        "\n",
        "      \n",
        "  return C1,C2\n",
        "  \n",
        "all_categories.append('NONE')\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YhigqezfkP1A",
        "colab_type": "text"
      },
      "source": [
        "#### Cleaning & Restructuring The Datasets"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hpBVzD13o8Ma",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# A method to clean and restructure the datasets\n",
        "\n",
        "import re\n",
        "\n",
        "def restructure(data):\n",
        "  \n",
        "  #Cleaning Title Column\n",
        "  titles = list(data['Title'])\n",
        "  titles = [title.strip().upper() for title in titles]\n",
        "  \n",
        "  #Cleaning & Restructuring Author Column\n",
        "  a1,a2,a3,a4,a5,a6,a7 = split_authors(data['Author']) \n",
        "  \n",
        "  #Cleaning & Restructuring Edition Column\n",
        "  ed_type, ed_month, ed_year = split_edition(data['Edition'])\n",
        "  \n",
        "  #Cleaning Ratings Column\n",
        "  ratings = list(data['Reviews'])\n",
        "  ratings = [float(re.sub(\" out of 5 stars\", \"\", i).strip()) for i in ratings]\n",
        "  \n",
        "  #Cleaning Reviews Column\n",
        "  reviews = list(data['Ratings'])\n",
        "  plu = ' customer reviews'\n",
        "  reviews = [re.sub(\" customer reviews\", \"\", i) if plu in i else re.sub(\" customer review\", \"\", i) for i in reviews  ]\n",
        "  reviews = [int(re.sub(\",\", \"\", i).strip()) for i in reviews ]\n",
        "  \n",
        "\n",
        "  #Cleaning & Restructuring Genre Column\n",
        "  g1, g2 = split_genres(data['Genre'])\n",
        "  \n",
        "  #Cleaning & Restructuring BookCategory Column\n",
        "  c1,c2 = split_categories(data['BookCategory'])\n",
        "\n",
        "  # Forming the Structured dataset\n",
        "  structured_data = pd.DataFrame({'Title': titles,\n",
        "                                  'Author1': a1,\n",
        "                                  'Author2': a2,\n",
        "                                  'Author3': a3,\n",
        "                                  'Author4': a4,\n",
        "                                  'Author5': a5,\n",
        "                                  'Author6': a6,\n",
        "                                  'Author7': a7,\n",
        "                                  'Edition_Type': ed_type,\n",
        "                                  'Edition_Month': ed_month,\n",
        "                                  'Edition_Year': ed_year,\n",
        "                                  'Ratings': ratings,\n",
        "                                  'Reviews': reviews,\n",
        "                                  'Genre1': g1,\n",
        "                                  'Genre2': g2,\n",
        "                                  'Category1': c1,\n",
        "                                  'Category2': c2\n",
        "                                  \n",
        "                               })\n",
        "  \n",
        "  return structured_data\n",
        "\n",
        " "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yIEpd23TSuNh",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 227
        },
        "outputId": "2a81c418-f804-412d-ff7e-9a2f606cacce"
      },
      "source": [
        "restructure(train).head(3)"
      ],
      "execution_count": 35,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Title</th>\n",
              "      <th>Author1</th>\n",
              "      <th>Author2</th>\n",
              "      <th>Author3</th>\n",
              "      <th>Author4</th>\n",
              "      <th>Author5</th>\n",
              "      <th>Author6</th>\n",
              "      <th>Author7</th>\n",
              "      <th>Edition_Type</th>\n",
              "      <th>Edition_Month</th>\n",
              "      <th>Edition_Year</th>\n",
              "      <th>Ratings</th>\n",
              "      <th>Reviews</th>\n",
              "      <th>Genre1</th>\n",
              "      <th>Genre2</th>\n",
              "      <th>Category1</th>\n",
              "      <th>Category2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>THE PRISONER'S GOLD (THE HUNTERS 3)</td>\n",
              "      <td>CHRIS KUZNESKI</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>PAPERBACK</td>\n",
              "      <td>MAR</td>\n",
              "      <td>2016</td>\n",
              "      <td>4.0</td>\n",
              "      <td>8</td>\n",
              "      <td>ACTION &amp; ADVENTURE (BOOKS)</td>\n",
              "      <td>NONE</td>\n",
              "      <td>ACTION &amp; ADVENTURE</td>\n",
              "      <td>NONE</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>GURU DUTT: A TRAGEDY IN THREE ACTS</td>\n",
              "      <td>ARUN KHOPKAR</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>PAPERBACK</td>\n",
              "      <td>NOV</td>\n",
              "      <td>2012</td>\n",
              "      <td>3.9</td>\n",
              "      <td>14</td>\n",
              "      <td>CINEMA &amp; BROADCAST (BOOKS)</td>\n",
              "      <td>NONE</td>\n",
              "      <td>BIOGRAPHIES</td>\n",
              "      <td>DIARIES &amp; TRUE ACCOUNTS</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>LEVIATHAN (PENGUIN CLASSICS)</td>\n",
              "      <td>THOMAS HOBBES</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>PAPERBACK</td>\n",
              "      <td>FEB</td>\n",
              "      <td>1982</td>\n",
              "      <td>4.8</td>\n",
              "      <td>6</td>\n",
              "      <td>INTERNATIONAL RELATIONS</td>\n",
              "      <td>NONE</td>\n",
              "      <td>HUMOUR</td>\n",
              "      <td>NONE</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                 Title  ...                Category2\n",
              "0  THE PRISONER'S GOLD (THE HUNTERS 3)  ...                     NONE\n",
              "1   GURU DUTT: A TRAGEDY IN THREE ACTS  ...  DIARIES & TRUE ACCOUNTS\n",
              "2         LEVIATHAN (PENGUIN CLASSICS)  ...                     NONE\n",
              "\n",
              "[3 rows x 17 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 35
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NusnSRPbCGqO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "\n",
        "X_train = restructure(train)\n",
        "\n",
        "Y_train = train.iloc[:, -1].values\n",
        "\n",
        "X_test = restructure(test)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "elvbEa4fkgLp",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 424
        },
        "outputId": "c072e4ee-f1ad-45c4-d298-c761443660f3"
      },
      "source": [
        "X_train.describe(include = 'all')"
      ],
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Title</th>\n",
              "      <th>Author1</th>\n",
              "      <th>Author2</th>\n",
              "      <th>Author3</th>\n",
              "      <th>Author4</th>\n",
              "      <th>Author5</th>\n",
              "      <th>Author6</th>\n",
              "      <th>Author7</th>\n",
              "      <th>Edition_Type</th>\n",
              "      <th>Edition_Month</th>\n",
              "      <th>Edition_Year</th>\n",
              "      <th>Ratings</th>\n",
              "      <th>Reviews</th>\n",
              "      <th>Genre1</th>\n",
              "      <th>Genre2</th>\n",
              "      <th>Category1</th>\n",
              "      <th>Category2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>count</th>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237.000000</td>\n",
              "      <td>6237.000000</td>\n",
              "      <td>6237.000000</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>unique</th>\n",
              "      <td>5564</td>\n",
              "      <td>3633</td>\n",
              "      <td>264</td>\n",
              "      <td>73</td>\n",
              "      <td>21</td>\n",
              "      <td>5</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>19</td>\n",
              "      <td>13</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>345</td>\n",
              "      <td>27</td>\n",
              "      <td>11</td>\n",
              "      <td>6</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>top</th>\n",
              "      <td>CASINO ROYALE: JAMES BOND 007 (VINTAGE)</td>\n",
              "      <td>AGATHA CHRISTIE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>NONE</td>\n",
              "      <td>PAPERBACK</td>\n",
              "      <td>OCT</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>ACTION &amp; ADVENTURE (BOOKS)</td>\n",
              "      <td>NONE</td>\n",
              "      <td>ACTION &amp; ADVENTURE</td>\n",
              "      <td>NONE</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>freq</th>\n",
              "      <td>4</td>\n",
              "      <td>69</td>\n",
              "      <td>5929</td>\n",
              "      <td>6159</td>\n",
              "      <td>6214</td>\n",
              "      <td>6233</td>\n",
              "      <td>6237</td>\n",
              "      <td>6237</td>\n",
              "      <td>5193</td>\n",
              "      <td>639</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>947</td>\n",
              "      <td>5594</td>\n",
              "      <td>818</td>\n",
              "      <td>3297</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>mean</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>2005.101972</td>\n",
              "      <td>4.293202</td>\n",
              "      <td>35.984287</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>std</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>116.821510</td>\n",
              "      <td>0.662501</td>\n",
              "      <td>149.995031</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>min</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>1.000000</td>\n",
              "      <td>1.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>25%</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>2010.000000</td>\n",
              "      <td>4.000000</td>\n",
              "      <td>2.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>50%</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>2014.000000</td>\n",
              "      <td>4.400000</td>\n",
              "      <td>7.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>75%</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>2017.000000</td>\n",
              "      <td>4.800000</td>\n",
              "      <td>22.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>max</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>2019.000000</td>\n",
              "      <td>5.000000</td>\n",
              "      <td>6090.000000</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                          Title  ... Category2\n",
              "count                                      6237  ...      6237\n",
              "unique                                     5564  ...         6\n",
              "top     CASINO ROYALE: JAMES BOND 007 (VINTAGE)  ...      NONE\n",
              "freq                                          4  ...      3297\n",
              "mean                                        NaN  ...       NaN\n",
              "std                                         NaN  ...       NaN\n",
              "min                                         NaN  ...       NaN\n",
              "25%                                         NaN  ...       NaN\n",
              "50%                                         NaN  ...       NaN\n",
              "75%                                         NaN  ...       NaN\n",
              "max                                         NaN  ...       NaN\n",
              "\n",
              "[11 rows x 17 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 37
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "A0tjLxd1CaZw",
        "colab_type": "code",
        "outputId": "c12e426f-fcc6-4252-d1d2-123743078d27",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 391
        }
      },
      "source": [
        "X_train.info()"
      ],
      "execution_count": 38,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 6237 entries, 0 to 6236\n",
            "Data columns (total 17 columns):\n",
            "Title            6237 non-null object\n",
            "Author1          6237 non-null object\n",
            "Author2          6237 non-null object\n",
            "Author3          6237 non-null object\n",
            "Author4          6237 non-null object\n",
            "Author5          6237 non-null object\n",
            "Author6          6237 non-null object\n",
            "Author7          6237 non-null object\n",
            "Edition_Type     6237 non-null object\n",
            "Edition_Month    6237 non-null object\n",
            "Edition_Year     6237 non-null int64\n",
            "Ratings          6237 non-null float64\n",
            "Reviews          6237 non-null int64\n",
            "Genre1           6237 non-null object\n",
            "Genre2           6237 non-null object\n",
            "Category1        6237 non-null object\n",
            "Category2        6237 non-null object\n",
            "dtypes: float64(1), int64(2), object(14)\n",
            "memory usage: 828.4+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "x5NiiOGZCmLh",
        "colab_type": "code",
        "outputId": "a29bc96b-6ff1-4365-f2bd-df26f8f8aa43",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 391
        }
      },
      "source": [
        "X_test.info()"
      ],
      "execution_count": 39,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 1560 entries, 0 to 1559\n",
            "Data columns (total 17 columns):\n",
            "Title            1560 non-null object\n",
            "Author1          1560 non-null object\n",
            "Author2          1560 non-null object\n",
            "Author3          1560 non-null object\n",
            "Author4          1560 non-null object\n",
            "Author5          1560 non-null object\n",
            "Author6          1560 non-null object\n",
            "Author7          1560 non-null object\n",
            "Edition_Type     1560 non-null object\n",
            "Edition_Month    1560 non-null object\n",
            "Edition_Year     1560 non-null int64\n",
            "Ratings          1560 non-null float64\n",
            "Reviews          1560 non-null int64\n",
            "Genre1           1560 non-null object\n",
            "Genre2           1560 non-null object\n",
            "Category1        1560 non-null object\n",
            "Category2        1560 non-null object\n",
            "dtypes: float64(1), int64(2), object(14)\n",
            "memory usage: 207.3+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5ACDwNtkUpv3",
        "colab_type": "text"
      },
      "source": [
        "### Encoding Categorical Features"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hU7meCrmE2gd",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# A method for Finding Unique items for all columns\n",
        "def unique_items(list1, list2):\n",
        "  a = list1\n",
        "  b = list2\n",
        "  a.extend(b)\n",
        "  return list(set(a))  "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wbG0hWZt_fOY",
        "colab_type": "code",
        "outputId": "cde54847-e3db-4678-e424-afe73805cfd8",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "from sklearn.preprocessing import LabelEncoder\n",
        "\n",
        "le_Title = LabelEncoder()\n",
        "all_titles = unique_items(list(X_train.Title),list(X_test.Title))\n",
        "le_Title.fit(all_titles)\n",
        "\n",
        "le_Edition_Type = LabelEncoder()\n",
        "all_etypes = unique_items(list(X_train.Edition_Type),list(X_test.Edition_Type))\n",
        "le_Edition_Type.fit(all_etypes)\n",
        "\n",
        "\n",
        "le_Edition_Month = LabelEncoder()\n",
        "all_em = unique_items(list(X_train.Edition_Month),list(X_test.Edition_Month))\n",
        "le_Edition_Month.fit(all_em)\n",
        "\n",
        "le_Author = LabelEncoder()\n",
        "all_Authors = list(set(all_authors))\n",
        "le_Author.fit(all_Authors)\n",
        "\n",
        "le_Genre = LabelEncoder()\n",
        "all_Genres = list(set(all_genres))\n",
        "le_Genre.fit(all_Genres)\n",
        "\n",
        "le_Category = LabelEncoder()\n",
        "all_Categories = list(set(all_categories))\n",
        "le_Category.fit(all_Categories)\n"
      ],
      "execution_count": 41,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "LabelEncoder()"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 41
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tuuGL--qImdF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "\n",
        "X_train['Title'] = le_Title.transform(X_train['Title'])\n",
        "\n",
        "X_train['Edition_Type'] = le_Edition_Type.transform(X_train['Edition_Type'])\n",
        "\n",
        "\n",
        "\n",
        "X_train['Edition_Month'] = le_Edition_Month.transform(X_train['Edition_Month'])\n",
        "\n",
        "X_train['Author1'] = le_Author.transform(X_train['Author1'])\n",
        "X_train['Author2'] = le_Author.transform(X_train['Author2'])\n",
        "X_train['Author3'] = le_Author.transform(X_train['Author3'])\n",
        "X_train['Author4'] = le_Author.transform(X_train['Author4'])\n",
        "X_train['Author5'] = le_Author.transform(X_train['Author5'])\n",
        "X_train['Author6'] = le_Author.transform(X_train['Author6'])\n",
        "X_train['Author7'] = le_Author.transform(X_train['Author7'])\n",
        "\n",
        "\n",
        "X_train['Genre1'] = le_Genre.transform(X_train['Genre1'])\n",
        "X_train['Genre2'] = le_Genre.transform(X_train['Genre2'])\n",
        "\n",
        "\n",
        "X_train['Category1'] = le_Category.transform(X_train['Category1'])\n",
        "X_train['Category2'] = le_Category.transform(X_train['Category2'])\n",
        "\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cwCtnU6Ey-ZQ",
        "colab_type": "code",
        "outputId": "3327e69c-4385-4a2c-8211-586bcf4aa183",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        }
      },
      "source": [
        "X_train.head()"
      ],
      "execution_count": 43,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Title</th>\n",
              "      <th>Author1</th>\n",
              "      <th>Author2</th>\n",
              "      <th>Author3</th>\n",
              "      <th>Author4</th>\n",
              "      <th>Author5</th>\n",
              "      <th>Author6</th>\n",
              "      <th>Author7</th>\n",
              "      <th>Edition_Type</th>\n",
              "      <th>Edition_Month</th>\n",
              "      <th>Edition_Year</th>\n",
              "      <th>Ratings</th>\n",
              "      <th>Reviews</th>\n",
              "      <th>Genre1</th>\n",
              "      <th>Genre2</th>\n",
              "      <th>Category1</th>\n",
              "      <th>Category2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>5802</td>\n",
              "      <td>797</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>7</td>\n",
              "      <td>2016</td>\n",
              "      <td>4.0</td>\n",
              "      <td>8</td>\n",
              "      <td>0</td>\n",
              "      <td>267</td>\n",
              "      <td>0</td>\n",
              "      <td>12</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>2120</td>\n",
              "      <td>391</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>10</td>\n",
              "      <td>2012</td>\n",
              "      <td>3.9</td>\n",
              "      <td>14</td>\n",
              "      <td>80</td>\n",
              "      <td>267</td>\n",
              "      <td>2</td>\n",
              "      <td>6</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>2984</td>\n",
              "      <td>4353</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>3</td>\n",
              "      <td>1982</td>\n",
              "      <td>4.8</td>\n",
              "      <td>6</td>\n",
              "      <td>211</td>\n",
              "      <td>267</td>\n",
              "      <td>8</td>\n",
              "      <td>12</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>189</td>\n",
              "      <td>78</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>11</td>\n",
              "      <td>2017</td>\n",
              "      <td>4.1</td>\n",
              "      <td>13</td>\n",
              "      <td>98</td>\n",
              "      <td>267</td>\n",
              "      <td>5</td>\n",
              "      <td>16</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>2987</td>\n",
              "      <td>1221</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>8</td>\n",
              "      <td>11</td>\n",
              "      <td>2006</td>\n",
              "      <td>5.0</td>\n",
              "      <td>1</td>\n",
              "      <td>284</td>\n",
              "      <td>267</td>\n",
              "      <td>1</td>\n",
              "      <td>7</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   Title  Author1  Author2  Author3  ...  Genre1  Genre2  Category1  Category2\n",
              "0   5802      797     3073     3073  ...       0     267          0         12\n",
              "1   2120      391     3073     3073  ...      80     267          2          6\n",
              "2   2984     4353     3073     3073  ...     211     267          8         12\n",
              "3    189       78     3073     3073  ...      98     267          5         16\n",
              "4   2987     1221     3073     3073  ...     284     267          1          7\n",
              "\n",
              "[5 rows x 17 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 43
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5EDSYQRwDODF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "\n",
        "X_test['Title'] = le_Title.transform(X_test['Title'])\n",
        "\n",
        "X_test['Edition_Type'] = le_Edition_Type.transform(X_test['Edition_Type'])\n",
        "\n",
        "\n",
        "\n",
        "X_test['Edition_Month'] = le_Edition_Month.transform(X_test['Edition_Month'])\n",
        "\n",
        "X_test['Author1'] = le_Author.transform(X_test['Author1'])\n",
        "X_test['Author2'] = le_Author.transform(X_test['Author2'])\n",
        "X_test['Author3'] = le_Author.transform(X_test['Author3'])\n",
        "X_test['Author4'] = le_Author.transform(X_test['Author4'])\n",
        "X_test['Author5'] = le_Author.transform(X_test['Author5'])\n",
        "X_test['Author6'] = le_Author.transform(X_test['Author6'])\n",
        "X_test['Author7'] = le_Author.transform(X_test['Author7'])\n",
        "\n",
        "\n",
        "X_test['Genre1'] = le_Genre.transform(X_test['Genre1'])\n",
        "X_test['Genre2'] = le_Genre.transform(X_test['Genre2'])\n",
        "\n",
        "\n",
        "X_test['Category1'] = le_Category.transform(X_test['Category1'])\n",
        "X_test['Category2'] = le_Category.transform(X_test['Category2'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "z4_3jjKJD5dC",
        "colab_type": "code",
        "outputId": "1ca85fbf-0bac-4ed0-ace1-9df6c6dd0875",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        }
      },
      "source": [
        "X_test.head()"
      ],
      "execution_count": 45,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Title</th>\n",
              "      <th>Author1</th>\n",
              "      <th>Author2</th>\n",
              "      <th>Author3</th>\n",
              "      <th>Author4</th>\n",
              "      <th>Author5</th>\n",
              "      <th>Author6</th>\n",
              "      <th>Author7</th>\n",
              "      <th>Edition_Type</th>\n",
              "      <th>Edition_Month</th>\n",
              "      <th>Edition_Year</th>\n",
              "      <th>Ratings</th>\n",
              "      <th>Reviews</th>\n",
              "      <th>Genre1</th>\n",
              "      <th>Genre2</th>\n",
              "      <th>Category1</th>\n",
              "      <th>Category2</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>5082</td>\n",
              "      <td>4058</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>12</td>\n",
              "      <td>11</td>\n",
              "      <td>1986</td>\n",
              "      <td>4.4</td>\n",
              "      <td>960</td>\n",
              "      <td>324</td>\n",
              "      <td>267</td>\n",
              "      <td>5</td>\n",
              "      <td>16</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>2906</td>\n",
              "      <td>1401</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>0</td>\n",
              "      <td>2018</td>\n",
              "      <td>5.0</td>\n",
              "      <td>1</td>\n",
              "      <td>273</td>\n",
              "      <td>267</td>\n",
              "      <td>4</td>\n",
              "      <td>9</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>751</td>\n",
              "      <td>949</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>7</td>\n",
              "      <td>2011</td>\n",
              "      <td>5.0</td>\n",
              "      <td>4</td>\n",
              "      <td>314</td>\n",
              "      <td>267</td>\n",
              "      <td>14</td>\n",
              "      <td>12</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>6232</td>\n",
              "      <td>169</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>9</td>\n",
              "      <td>2016</td>\n",
              "      <td>4.1</td>\n",
              "      <td>11</td>\n",
              "      <td>295</td>\n",
              "      <td>267</td>\n",
              "      <td>4</td>\n",
              "      <td>9</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>3790</td>\n",
              "      <td>3505</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>3073</td>\n",
              "      <td>13</td>\n",
              "      <td>2</td>\n",
              "      <td>2011</td>\n",
              "      <td>4.4</td>\n",
              "      <td>9</td>\n",
              "      <td>235</td>\n",
              "      <td>267</td>\n",
              "      <td>10</td>\n",
              "      <td>11</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   Title  Author1  Author2  Author3  ...  Genre1  Genre2  Category1  Category2\n",
              "0   5082     4058     3073     3073  ...     324     267          5         16\n",
              "1   2906     1401     3073     3073  ...     273     267          4          9\n",
              "2    751      949     3073     3073  ...     314     267         14         12\n",
              "3   6232      169     3073     3073  ...     295     267          4          9\n",
              "4   3790     3505     3073     3073  ...     235     267         10         11\n",
              "\n",
              "[5 rows x 17 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 45
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QHxuxxl_UxVE",
        "colab_type": "text"
      },
      "source": [
        "### Sclaing The Features"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "vOotF7qUz2HF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Feature Scaling\n",
        "\n",
        "from sklearn.preprocessing import StandardScaler\n",
        "sc = StandardScaler()\n",
        "\n",
        "X_train = sc.fit_transform(X_train)\n",
        "\n",
        "X_test = sc.transform(X_test)\n",
        "\n",
        "#Reshaping ti fit the scaler\n",
        "Y_train = Y_train.reshape((len(Y_train), 1)) \n",
        "\n",
        "Y_train = sc.fit_transform(Y_train)\n",
        "\n",
        "#Restoring the original shape after scaling\n",
        "Y_train = Y_train.ravel()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sd80FmEw97yE",
        "colab_type": "code",
        "outputId": "f9bbeefc-c021-4ee7-cd66-87dfcd16568e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "X_train.shape #SC"
      ],
      "execution_count": 47,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(6237, 17)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 47
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xIikkIR1_bZY",
        "colab_type": "code",
        "outputId": "b1691d65-7391-4810-923b-35e4cb659c23",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "Y_train.shape #SC"
      ],
      "execution_count": 48,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(6237,)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 48
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TEXau171tDqU",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 238
        },
        "outputId": "746931a5-edcd-4223-adde-3ec3b6bbde0a"
      },
      "source": [
        "X_train"
      ],
      "execution_count": 49,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[ 1.22489462, -1.11014305,  0.10361038, ..., -0.04054244,\n",
              "        -1.21387465,  0.3167796 ],\n",
              "       [-0.64983923, -1.40876249,  0.10361038, ..., -0.04054244,\n",
              "        -0.82060991, -1.88135243],\n",
              "       [-0.20992341,  1.50535127,  0.10361038, ..., -0.04054244,\n",
              "         0.35918432,  0.3167796 ],\n",
              "       ...,\n",
              "       [ 0.91379674, -0.11278358,  0.10361038, ..., -0.04054244,\n",
              "         1.53897855,  0.3167796 ],\n",
              "       [-0.75676322, -1.56395633,  0.10361038, ..., -0.04054244,\n",
              "        -1.21387465,  0.3167796 ],\n",
              "       [ 0.95096556, -0.2951915 ,  0.10361038, ..., -0.04054244,\n",
              "        -1.21387465,  0.3167796 ]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 49
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "W3mD60ErtFrK",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 238
        },
        "outputId": "9506d3d8-df30-4d05-84c3-a886441330cc"
      },
      "source": [
        "X_test"
      ],
      "execution_count": 55,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[ 0.8582981 ,  1.2883741 ,  0.10361038, ..., -0.04054244,\n",
              "        -0.23071279,  1.78220095],\n",
              "       [-0.24963804, -0.66589149,  0.10361038, ..., -0.04054244,\n",
              "        -0.42734517, -0.78228642],\n",
              "       [-1.34688178, -0.99834465,  0.10361038, ..., -0.04054244,\n",
              "         1.53897855,  0.3167796 ],\n",
              "       ...,\n",
              "       [ 1.06145367,  0.00342793,  0.10361038, ..., -0.04054244,\n",
              "         0.35918432,  0.3167796 ],\n",
              "       [ 0.20911677, -0.49672284,  0.10361038, ..., -0.04054244,\n",
              "        -0.82060991, -1.88135243],\n",
              "       [-1.15085447, -1.35139225,  0.10361038, ..., -0.04054244,\n",
              "         0.75244907, -0.04957574]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 55
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kdGM1rpS4Ly2",
        "colab_type": "text"
      },
      "source": [
        "## Building a Regression Model\n",
        "\n",
        "---\n",
        "\n",
        "At this stage, we are all ready with the data which can now be fed in to a regressor. We will build a simple XGBoost regressor and will fit the training data. We will then use the model to predict the prices of the Books in the test set.\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "* Building a simple XGBoost regressor\n",
        "\n",
        "* Testing the regressor on validation set\n",
        "* Predicting the prices for test set data\n",
        "* Saving the predictioons into an excel file\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oxpRLSvCKRJg",
        "colab_type": "text"
      },
      "source": [
        "### Creating Training & Valiation sets"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gnoeicDYInty",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "train_x, val_x, train_y, val_y = train_test_split(X_train, Y_train, test_size = 0.1, random_state = 123)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Mjt8t2zCJ1W0",
        "colab_type": "code",
        "outputId": "4992ca5a-6c0f-4109-812c-7dc2c8508064",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 85
        }
      },
      "source": [
        "print(train_x.shape)\n",
        "print(train_y.shape)\n",
        "print(val_x.shape)\n",
        "print(val_y.shape)"
      ],
      "execution_count": 68,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "(5613, 17)\n",
            "(5613,)\n",
            "(624, 17)\n",
            "(624,)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "svtQWm8SCwg0",
        "colab_type": "text"
      },
      "source": [
        "### XGBoost"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3fgH-qtZNOLu",
        "colab_type": "text"
      },
      "source": [
        "### Validating The Model\n",
        "\n",
        "---\n",
        "\n",
        "We will fist train and validate the model using RMLSE(Root Mean Squared Logerithmic Error).Once the validation is done we will use both the train and validation samples to train the final model which will be used to predict for the test set."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "6XCp0ZKv7idh",
        "outputId": "6aadca88-df94-48b4-85d1-819c3d66b8a2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "from xgboost import XGBRegressor\n",
        "import numpy as np\n",
        "\n",
        "xgb=XGBRegressor( objective='reg:squarederror', max_depth=6, learning_rate=0.1, n_estimators=100, booster = 'gbtree', n_jobs = -1,random_state = 1)\n",
        "xgb.fit(train_x,train_y)\n",
        "\n",
        "y_pred = sc.inverse_transform(xgb.predict(val_x))\n",
        "y_true = sc.inverse_transform(val_y)\n",
        "\n",
        "error = np.square(np.log10(y_pred +1) - np.log10(y_true +1)).mean() ** 0.5\n",
        "score = 1 - error\n",
        "\n",
        "print(\"RMLSE Score = \", score)"
      ],
      "execution_count": 69,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "RMLSE Score =  0.7163958922112829\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NRDm0O5dNykQ",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 136
        },
        "outputId": "959f0d33-7c02-4cd1-b1a9-c710fd20db46"
      },
      "source": [
        "# Fitting the complete training set (inclusing val_x and val_y)\n",
        "xgb.fit(X_train,Y_train)\n"
      ],
      "execution_count": 70,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
              "             colsample_bynode=1, colsample_bytree=1, gamma=0,\n",
              "             importance_type='gain', learning_rate=0.1, max_delta_step=0,\n",
              "             max_depth=6, min_child_weight=1, missing=None, n_estimators=100,\n",
              "             n_jobs=-1, nthread=None, objective='reg:squarederror',\n",
              "             random_state=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n",
              "             seed=None, silent=None, subsample=1, verbosity=1)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 70
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "v89PDQklKymM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Predicting for test set\n",
        "y_pred_xgb = sc.inverse_transform(xgb.predict(X_test))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "tqwjEHDU7idq",
        "colab": {}
      },
      "source": [
        "# Saving the predictions in excel file\n",
        "\n",
        "solution = pd.DataFrame(y_pred_xgb, columns = ['Price'])\n",
        "solution.to_excel('Predict_Book_Price_Soln.xlsx', index = False)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ms2jR1CaOeIg",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 359
        },
        "outputId": "e9c78145-6d7d-49dd-db18-15460724cf6d"
      },
      "source": [
        "solution.head(10)"
      ],
      "execution_count": 74,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Price</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>214.681122</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>1330.917114</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>624.322449</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>844.769043</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>425.502533</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>890.277893</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>956.208252</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>339.453979</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>558.580017</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>467.108673</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "         Price\n",
              "0   214.681122\n",
              "1  1330.917114\n",
              "2   624.322449\n",
              "3   844.769043\n",
              "4   425.502533\n",
              "5   890.277893\n",
              "6   956.208252\n",
              "7   339.453979\n",
              "8   558.580017\n",
              "9   467.108673"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 74
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5i4R-fZT90gW",
        "colab_type": "text"
      },
      "source": [
        "##Bayesian Optimization on XGBoost \n",
        "\n",
        "---\n",
        "\n",
        "In this step will will use Bayesian Optimization to optimize the hypermeters such as gamma, learning_rate, max_depth, n_estimators.\n",
        "\n",
        "\n",
        "We will use a pre-built library called bayesian-optimization."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LU2T5Jvr8RCl",
        "colab_type": "code",
        "outputId": "8ce4251a-bf1f-4d20-e19a-4bd6ccb91138",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 238
        }
      },
      "source": [
        "!pip install bayesian-optimization"
      ],
      "execution_count": 76,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting bayesian-optimization\n",
            "  Downloading https://files.pythonhosted.org/packages/72/0c/173ac467d0a53e33e41b521e4ceba74a8ac7c7873d7b857a8fbdca88302d/bayesian-optimization-1.0.1.tar.gz\n",
            "Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from bayesian-optimization) (1.16.5)\n",
            "Requirement already satisfied: scipy>=0.14.0 in /usr/local/lib/python3.6/dist-packages (from bayesian-optimization) (1.3.1)\n",
            "Requirement already satisfied: scikit-learn>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from bayesian-optimization) (0.21.3)\n",
            "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.18.0->bayesian-optimization) (0.13.2)\n",
            "Building wheels for collected packages: bayesian-optimization\n",
            "  Building wheel for bayesian-optimization (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for bayesian-optimization: filename=bayesian_optimization-1.0.1-cp36-none-any.whl size=10032 sha256=8fd26880064a093cff7a7ea3de5daeb04582531c5362de7b4848e56e5d73439b\n",
            "  Stored in directory: /root/.cache/pip/wheels/1d/0d/3b/6b9d4477a34b3905f246ff4e7acf6aafd4cc9b77d473629b77\n",
            "Successfully built bayesian-optimization\n",
            "Installing collected packages: bayesian-optimization\n",
            "Successfully installed bayesian-optimization-1.0.1\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "L_GOZGBx9OXa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from bayes_opt import BayesianOptimization\n",
        "import xgboost as xgb\n",
        "#from sklearn.metrics import mean_squared_error,mean_squared_log_error"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2iFA25n4-blG",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "dtrain = xgb.DMatrix(X_train, label= Y_train)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NXOpETUc9hZ3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def bo_tune_xgb(max_depth, gamma, n_estimators ,learning_rate):\n",
        "    params = {'max_depth': int(max_depth),\n",
        "              'gamma': gamma,      \n",
        "              'n_estimators': int(n_estimators),\n",
        "              'learning_rate':learning_rate,\n",
        "              'subsample': 0.8,\n",
        "              'eta': 0.1,\n",
        "              'eval_metric': 'rmse'}\n",
        "    \n",
        "    #Cross validating with the specified parameters in 5 folds and 70 iterations\n",
        "    cv_result = xgb.cv(params, dtrain, num_boost_round=100, nfold=10)    \n",
        "    \n",
        "    #Return the negative RMSE\n",
        "    return -1.0 * cv_result['test-rmse-mean'].iloc[-1]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ef_-mQ1N9lT3",
        "colab_type": "code",
        "outputId": "b4484c85-d12f-4e62-c598-9aa9790da8d5",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 408
        }
      },
      "source": [
        "\n",
        "xgb_bo = BayesianOptimization(bo_tune_xgb, {'max_depth': (1, 300), \n",
        "                                             'gamma': (0, 1),\n",
        "                                             'learning_rate':(0,1),\n",
        "                                             'n_estimators':(1,1000)\n",
        "                                            })\n",
        "\n",
        "\n",
        "xgb_bo.maximize(n_iter=10, init_points=10, acq='ei')"
      ],
      "execution_count": 80,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "|   iter    |  target   |   gamma   | learni... | max_depth | n_esti... |\n",
            "-------------------------------------------------------------------------\n",
            "| \u001b[0m 1       \u001b[0m | \u001b[0m-1.009   \u001b[0m | \u001b[0m 0.2697  \u001b[0m | \u001b[0m 0.5214  \u001b[0m | \u001b[0m 51.2    \u001b[0m | \u001b[0m 389.1   \u001b[0m |\n",
            "| \u001b[95m 2       \u001b[0m | \u001b[95m-0.9282  \u001b[0m | \u001b[95m 0.06396 \u001b[0m | \u001b[95m 0.07131 \u001b[0m | \u001b[95m 240.2   \u001b[0m | \u001b[95m 947.1   \u001b[0m |\n",
            "| \u001b[0m 3       \u001b[0m | \u001b[0m-0.9759  \u001b[0m | \u001b[0m 0.6246  \u001b[0m | \u001b[0m 0.3789  \u001b[0m | \u001b[0m 76.9    \u001b[0m | \u001b[0m 530.3   \u001b[0m |\n",
            "| \u001b[95m 4       \u001b[0m | \u001b[95m-0.9172  \u001b[0m | \u001b[95m 0.8354  \u001b[0m | \u001b[95m 0.1186  \u001b[0m | \u001b[95m 195.2   \u001b[0m | \u001b[95m 685.8   \u001b[0m |\n",
            "| \u001b[0m 5       \u001b[0m | \u001b[0m-0.9174  \u001b[0m | \u001b[0m 0.9739  \u001b[0m | \u001b[0m 0.1673  \u001b[0m | \u001b[0m 8.738   \u001b[0m | \u001b[0m 905.6   \u001b[0m |\n",
            "| \u001b[0m 6       \u001b[0m | \u001b[0m-1.074   \u001b[0m | \u001b[0m 0.6809  \u001b[0m | \u001b[0m 0.7988  \u001b[0m | \u001b[0m 42.74   \u001b[0m | \u001b[0m 197.3   \u001b[0m |\n",
            "| \u001b[0m 7       \u001b[0m | \u001b[0m-0.9909  \u001b[0m | \u001b[0m 0.1068  \u001b[0m | \u001b[0m 0.405   \u001b[0m | \u001b[0m 33.39   \u001b[0m | \u001b[0m 796.0   \u001b[0m |\n",
            "| \u001b[0m 8       \u001b[0m | \u001b[0m-0.943   \u001b[0m | \u001b[0m 0.6288  \u001b[0m | \u001b[0m 0.2134  \u001b[0m | \u001b[0m 231.1   \u001b[0m | \u001b[0m 114.2   \u001b[0m |\n",
            "| \u001b[0m 9       \u001b[0m | \u001b[0m-0.9464  \u001b[0m | \u001b[0m 0.2038  \u001b[0m | \u001b[0m 0.2343  \u001b[0m | \u001b[0m 152.1   \u001b[0m | \u001b[0m 38.35   \u001b[0m |\n",
            "| \u001b[0m 10      \u001b[0m | \u001b[0m-0.9398  \u001b[0m | \u001b[0m 0.7739  \u001b[0m | \u001b[0m 0.209   \u001b[0m | \u001b[0m 256.2   \u001b[0m | \u001b[0m 416.5   \u001b[0m |\n",
            "| \u001b[0m 11      \u001b[0m | \u001b[0m-1.112   \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 1e+03   \u001b[0m |\n",
            "| \u001b[0m 12      \u001b[0m | \u001b[0m-1.112   \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 300.0   \u001b[0m | \u001b[0m 779.8   \u001b[0m |\n",
            "| \u001b[0m 13      \u001b[0m | \u001b[0m-1.152   \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 300.0   \u001b[0m | \u001b[0m 1.0     \u001b[0m |\n",
            "| \u001b[0m 14      \u001b[0m | \u001b[0m-1.112   \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 125.2   \u001b[0m | \u001b[0m 898.1   \u001b[0m |\n",
            "| \u001b[0m 15      \u001b[0m | \u001b[0m-1.135   \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 300.0   \u001b[0m | \u001b[0m 1e+03   \u001b[0m |\n",
            "| \u001b[0m 16      \u001b[0m | \u001b[0m-1.112   \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 1.0     \u001b[0m |\n",
            "| \u001b[0m 17      \u001b[0m | \u001b[0m-1.112   \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 230.3   \u001b[0m | \u001b[0m 570.6   \u001b[0m |\n",
            "| \u001b[0m 18      \u001b[0m | \u001b[0m-1.112   \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 300.0   \u001b[0m | \u001b[0m 283.2   \u001b[0m |\n",
            "| \u001b[0m 19      \u001b[0m | \u001b[0m-1.152   \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 174.7   \u001b[0m | \u001b[0m 356.2   \u001b[0m |\n",
            "| \u001b[0m 20      \u001b[0m | \u001b[0m-1.112   \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 0.0     \u001b[0m | \u001b[0m 1.0     \u001b[0m | \u001b[0m 610.3   \u001b[0m |\n",
            "=========================================================================\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "muarWA_b9nwh",
        "colab_type": "code",
        "outputId": "c119acf6-4006-4e6d-ffce-9fa6ea3b71f1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "source": [
        "#Extracting the best parameters\n",
        "params = xgb_bo.max['params']\n",
        "\n",
        "print(params)\n",
        "\n",
        "#Conversting the max_depth and n_estimator values from float to int\n",
        "params['max_depth']= int(round(params['max_depth']))\n",
        "params['n_estimators']= int(round(params['n_estimators']))\n",
        "\n",
        "print(params)\n"
      ],
      "execution_count": 81,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "{'gamma': 0.835366902965147, 'learning_rate': 0.11864947002102888, 'max_depth': 195.21748298318235, 'n_estimators': 685.7597094777982}\n",
            "{'gamma': 0.835366902965147, 'learning_rate': 0.11864947002102888, 'max_depth': 195, 'n_estimators': 686}\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ogWf5ZKO9x5p",
        "colab_type": "code",
        "outputId": "5afce368-5ecc-4d98-ee64-8afa4caea574",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "#Initialize an XGB with the tuned parameters and fit the training data\n",
        "from xgboost import XGBRegressor\n",
        "reg = XGBRegressor(**params).fit(X_train,Y_train)\n",
        "\n",
        "y_pred_reg = sc.inverse_transform(reg.predict(X_test))"
      ],
      "execution_count": 83,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[10:19:41] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NFU-H-icP5uD",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 359
        },
        "outputId": "a9fdd176-11d9-4e6f-8ef8-29b48c83f0bd"
      },
      "source": [
        "solution_bo = pd.DataFrame(y_pred_reg, columns = ['Price'])\n",
        "\n",
        "solution_bo.head(10)"
      ],
      "execution_count": 90,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Price</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>241.728058</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>1828.147339</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>429.341614</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>858.137817</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>372.580841</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>503.807220</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>587.503967</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>476.958710</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>399.994812</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>335.585876</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "         Price\n",
              "0   241.728058\n",
              "1  1828.147339\n",
              "2   429.341614\n",
              "3   858.137817\n",
              "4   372.580841\n",
              "5   503.807220\n",
              "6   587.503967\n",
              "7   476.958710\n",
              "8   399.994812\n",
              "9   335.585876"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 90
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wUCplBBRQLqI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "solution_bo.to_excel('Predict_Book_Prices_BO_Soln.xlsx', index = False)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AljdBCrgZ14J",
        "colab_type": "text"
      },
      "source": [
        "Once you have your solution files, upload it to MachineHack to know your score.\n",
        "\n",
        "Good Luck !!"
      ]
    }
  ]
 }