Created
October 4, 2019 10:58
-
-
Save analyticsindiamagazine/d67b72f9f6c509bdffa6ed8dd52f6882 to your computer and use it in GitHub Desktop.
MachineHack recently launched its latest hackathon called Predict The Price Of Books. This article is a complete step by step guide to the solution.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "Article-Book_price.ipynb", | |
"provenance": [], | |
"collapsed_sections": [] | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"accelerator": "GPU" | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "FIvlJKqoP1Zq", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# The Complete Solution To MachineHack's Predict The Book Price Hackathon\n", | |
"\n", | |
"\n", | |
"---\n", | |
"\n", | |
"### Predict The Book Price Hackathon\n", | |
"\n", | |
"“The so-called paradoxes of an author, to which a reader takes exception, often exist not in the author’s book at all, but rather in the reader’s head.” – Friedrich Nietzsche\n", | |
"\n", | |
"Books are open doors to the unimagined worlds which is unique to every person. It is more than just a hobby for many. There are many among us who prefer to spend more time with books than anything else.\n", | |
"\n", | |
"Here we explore a big database of books. Books of different genres, from thousands of authors. In this challenge, participants are required to use the dataset to build a Machine Learning model to predict the price of books based on a given set of features.\n", | |
"\n", | |
"Size of training set: 6237 records\n", | |
"Size of test set: 1560 records\n", | |
"\n", | |
"FEATURES:\n", | |
"\n", | |
"* Title: The title of the book\n", | |
"* Author: The author(s) of the book.\n", | |
"* Edition: The edition of the book eg (Paperback,– Import, 26 Apr 2018)\n", | |
"* Reviews: The customer reviews about the book\n", | |
"* Ratings: The customer ratings of the book\n", | |
"* Synopsis: The synopsis of the book\n", | |
"* Genre: The genre the book belongs to\n", | |
"* BookCategory: The department the book is usually available at.\n", | |
"* Price: The price of the book (Target variable)\n", | |
"\n", | |
"\n", | |
"Click [here](https://www.machinehack.com/course/predict-the-price-of-books/) to participate in the hackathon.\n", | |
"\n", | |
"---\n", | |
"\n", | |
"This python notebook contains the complete step by step guide to work on the above mentioned hackathon.Use this notebook to learn and adapt this work to better your score.\n", | |
"\n", | |
"### Approach\n", | |
"\n", | |
"1. Exploring The Data Sets\n", | |
"2. Cleaning, Processing and Generating New Features\n", | |
"1. Building A Regressor \n", | |
"2. Optimizing The Hyperparameters Using Bayesian Optimization\n", | |
"\n", | |
"The above steps are explained in detail as follows." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "H3K1ySNMS7b1", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## 1. Exploring The Data Sets\n", | |
"\n", | |
"\n", | |
"---\n", | |
"\n", | |
"\n", | |
"In this step, we will import the datasets and will do a simple analysis that will help us process the data before predictive modeling.\n", | |
"\n", | |
"This block involves:\n", | |
"\n", | |
"* Importing the data\n", | |
"* Understanding the features and their characterstics \n", | |
"* Noting key observations from the data.\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "VaXTvorZKITI", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import pandas as pd" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "efVgoYL3Lnn4", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"train = pd.read_excel(\"Data/Data_Train.xlsx\")\n", | |
"test = pd.read_excel(\"Data/Data_Test.xlsx\")" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "CSXtnNyy0ndp", | |
"colab_type": "code", | |
"outputId": "a9a15fb4-aa89-4604-c0f1-5c17aed061a3", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 1000 | |
} | |
}, | |
"source": [ | |
"train.head(50)" | |
], | |
"execution_count": 19, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Title</th>\n", | |
" <th>Author</th>\n", | |
" <th>Edition</th>\n", | |
" <th>Reviews</th>\n", | |
" <th>Ratings</th>\n", | |
" <th>Synopsis</th>\n", | |
" <th>Genre</th>\n", | |
" <th>BookCategory</th>\n", | |
" <th>Price</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>The Prisoner's Gold (The Hunters 3)</td>\n", | |
" <td>Chris Kuzneski</td>\n", | |
" <td>Paperback,– 10 Mar 2016</td>\n", | |
" <td>4.0 out of 5 stars</td>\n", | |
" <td>8 customer reviews</td>\n", | |
" <td>THE HUNTERS return in their third brilliant no...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Action & Adventure</td>\n", | |
" <td>220.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Guru Dutt: A Tragedy in Three Acts</td>\n", | |
" <td>Arun Khopkar</td>\n", | |
" <td>Paperback,– 7 Nov 2012</td>\n", | |
" <td>3.9 out of 5 stars</td>\n", | |
" <td>14 customer reviews</td>\n", | |
" <td>A layered portrait of a troubled genius for wh...</td>\n", | |
" <td>Cinema & Broadcast (Books)</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>202.93</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Leviathan (Penguin Classics)</td>\n", | |
" <td>Thomas Hobbes</td>\n", | |
" <td>Paperback,– 25 Feb 1982</td>\n", | |
" <td>4.8 out of 5 stars</td>\n", | |
" <td>6 customer reviews</td>\n", | |
" <td>\"During the time men live without a common Pow...</td>\n", | |
" <td>International Relations</td>\n", | |
" <td>Humour</td>\n", | |
" <td>299.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>A Pocket Full of Rye (Miss Marple)</td>\n", | |
" <td>Agatha Christie</td>\n", | |
" <td>Paperback,– 5 Oct 2017</td>\n", | |
" <td>4.1 out of 5 stars</td>\n", | |
" <td>13 customer reviews</td>\n", | |
" <td>A handful of grain is found in the pocket of a...</td>\n", | |
" <td>Contemporary Fiction (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>180.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>LIFE 70 Years of Extraordinary Photography</td>\n", | |
" <td>Editors of Life</td>\n", | |
" <td>Hardcover,– 10 Oct 2006</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>1 customer review</td>\n", | |
" <td>For seven decades, \"Life\" has been thrilling t...</td>\n", | |
" <td>Photography Textbooks</td>\n", | |
" <td>Arts, Film & Photography</td>\n", | |
" <td>965.62</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>ChiRunning: A Revolutionary Approach to Effort...</td>\n", | |
" <td>Danny Dreyer</td>\n", | |
" <td>Paperback,– 5 May 2009</td>\n", | |
" <td>4.5 out of 5 stars</td>\n", | |
" <td>8 customer reviews</td>\n", | |
" <td>The revised edition of the bestselling ChiRunn...</td>\n", | |
" <td>Healthy Living & Wellness (Books)</td>\n", | |
" <td>Sports</td>\n", | |
" <td>900.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>Death on the Nile (Poirot)</td>\n", | |
" <td>Agatha Christie</td>\n", | |
" <td>Paperback,– 5 Oct 2017</td>\n", | |
" <td>4.4 out of 5 stars</td>\n", | |
" <td>72 customer reviews</td>\n", | |
" <td>Agatha Christie’s most exotic murder mystery\\n...</td>\n", | |
" <td>Crime, Thriller & Mystery (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>224.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>Yoga Your Home Practice Companion: A Complete ...</td>\n", | |
" <td>Sivananda Yoga Vedanta Centre</td>\n", | |
" <td>Hardcover,– Import, 1 Mar 2018</td>\n", | |
" <td>4.7 out of 5 stars</td>\n", | |
" <td>16 customer reviews</td>\n", | |
" <td>Achieve a healthy body, mental alertness, and ...</td>\n", | |
" <td>Sports Training & Coaching (Books)</td>\n", | |
" <td>Sports</td>\n", | |
" <td>836.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>Karmayogi: A Biography of E. Sreedharan</td>\n", | |
" <td>M S Ashokan</td>\n", | |
" <td>Paperback,– 15 Dec 2015</td>\n", | |
" <td>4.2 out of 5 stars</td>\n", | |
" <td>111 customer reviews</td>\n", | |
" <td>Karmayogi is the dramatic and inspiring story ...</td>\n", | |
" <td>Biographies & Autobiographies (Books)</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>130.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>The Iron King (The Accursed Kings, Book 1)</td>\n", | |
" <td>Maurice Druon</td>\n", | |
" <td>Paperback,– 26 Mar 2013</td>\n", | |
" <td>4.0 out of 5 stars</td>\n", | |
" <td>1 customer review</td>\n", | |
" <td>‘This is the original game of thrones’ George ...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Action & Adventure</td>\n", | |
" <td>695.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>Battle for Sanskrit: Is Sanskrit Political or ...</td>\n", | |
" <td>Rajiv Malhotra</td>\n", | |
" <td>Paperback,– 20 Jan 2017</td>\n", | |
" <td>4.9 out of 5 stars</td>\n", | |
" <td>132 customer reviews</td>\n", | |
" <td>There is a new awakening in India that is chal...</td>\n", | |
" <td>Asian History</td>\n", | |
" <td>Language, Linguistics & Writing</td>\n", | |
" <td>373.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11</th>\n", | |
" <td>Blockchain Revolution: How the Technology Behi...</td>\n", | |
" <td>Don Tapscott, Alex Tapscott</td>\n", | |
" <td>Paperback,– Import, 14 Jun 2018</td>\n", | |
" <td>3.5 out of 5 stars</td>\n", | |
" <td>17 customer reviews</td>\n", | |
" <td>THE DEFINITIVE BOOK ON HOW THE TECHNOLOGY BEHI...</td>\n", | |
" <td>Banks & Banking</td>\n", | |
" <td>Computing, Internet & Digital Media</td>\n", | |
" <td>309.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12</th>\n", | |
" <td>Tai-Pan: The Second Novel of the Asian Saga</td>\n", | |
" <td>James Clavell</td>\n", | |
" <td>Paperback,– 1 Jul 1999</td>\n", | |
" <td>4.1 out of 5 stars</td>\n", | |
" <td>4 customer reviews</td>\n", | |
" <td>Set in the turbulent days of the founding of H...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Action & Adventure</td>\n", | |
" <td>379.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13</th>\n", | |
" <td>The Art of Shaolin Kung Fu: The Secrets of Kun...</td>\n", | |
" <td>Wong Kiew Kit</td>\n", | |
" <td>Paperback,– 15 Nov 2002</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>3 customer reviews</td>\n", | |
" <td>The Art of Shaolin Kung Fu is the ultimate gui...</td>\n", | |
" <td>Asian History</td>\n", | |
" <td>Sports</td>\n", | |
" <td>1066.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>14</th>\n", | |
" <td>Anil's Ghost</td>\n", | |
" <td>Michael Ondaatje</td>\n", | |
" <td>Paperback,– 1 Sep 2011</td>\n", | |
" <td>3.8 out of 5 stars</td>\n", | |
" <td>5 customer reviews</td>\n", | |
" <td>Anil's Ghost transports us to Sri Lanka, a cou...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Romance</td>\n", | |
" <td>381.22</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>15</th>\n", | |
" <td>Superman: An Origin Story (DC Comics Super Her...</td>\n", | |
" <td>Matthew K Manning</td>\n", | |
" <td>Paperback,– 26 Feb 2015</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>2 customer reviews</td>\n", | |
" <td>One day, an alien orphan crash-lands on Earth ...</td>\n", | |
" <td>Comics & Mangas (Books)</td>\n", | |
" <td>Comics & Mangas</td>\n", | |
" <td>287.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>My First Book of London</td>\n", | |
" <td>Charlotte Guillain, Roland Dry</td>\n", | |
" <td>Hardcover,– 8 Mar 2018</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>1 customer review</td>\n", | |
" <td>London is one of the most exciting cities in t...</td>\n", | |
" <td>Children's Mysteries & Curiosities (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>162.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>17</th>\n", | |
" <td>Naruto: Itachi's Story, Vol. 1: Daylight</td>\n", | |
" <td>Takashi Yano</td>\n", | |
" <td>Paperback,– 1 Nov 2016</td>\n", | |
" <td>4.9 out of 5 stars</td>\n", | |
" <td>23 customer reviews</td>\n", | |
" <td>A new series of prose novels, straight from th...</td>\n", | |
" <td>Mangas</td>\n", | |
" <td>Comics & Mangas</td>\n", | |
" <td>587.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>18</th>\n", | |
" <td>The Story of Philosophy</td>\n", | |
" <td>Will Durant</td>\n", | |
" <td>Mass Market Paperback,– 1 Jan 1991</td>\n", | |
" <td>4.5 out of 5 stars</td>\n", | |
" <td>76 customer reviews</td>\n", | |
" <td>A brilliant and concise account of the lives a...</td>\n", | |
" <td>Biographies & Autobiographies (Books)</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>291.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>19</th>\n", | |
" <td>Introducing Data Science: Big Data, Machine Le...</td>\n", | |
" <td>Davy Cielen, Arno D.B. Meysman, Mohamed Ali</td>\n", | |
" <td>Paperback,– 2016</td>\n", | |
" <td>4.3 out of 5 stars</td>\n", | |
" <td>5 customer reviews</td>\n", | |
" <td>Introducing Data Science explains vital data s...</td>\n", | |
" <td>Artificial Intelligence</td>\n", | |
" <td>Computing, Internet & Digital Media</td>\n", | |
" <td>352.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>20</th>\n", | |
" <td>The Travelling Cat Chronicles</td>\n", | |
" <td>Hiro Arikawa</td>\n", | |
" <td>Hardcover,– 24 Nov 2018</td>\n", | |
" <td>4.9 out of 5 stars</td>\n", | |
" <td>10 customer reviews</td>\n", | |
" <td>A stunning hardback edition of this surprise h...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Action & Adventure</td>\n", | |
" <td>339.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>21</th>\n", | |
" <td>Messi: Updated Edition (Luca Caioli)</td>\n", | |
" <td>Luca Caioli</td>\n", | |
" <td>Paperback,– Import, 4 Oct 2018</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>2 customer reviews</td>\n", | |
" <td>FROM THE BESTSELLING AUTHOR OF RONALDO AND NEY...</td>\n", | |
" <td>Biographies & Autobiographies (Books)</td>\n", | |
" <td>Sports</td>\n", | |
" <td>309.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>22</th>\n", | |
" <td>The Dark Arena</td>\n", | |
" <td>Mario Puzo</td>\n", | |
" <td>Paperback,– 5 Jul 2012</td>\n", | |
" <td>3.1 out of 5 stars</td>\n", | |
" <td>2 customer reviews</td>\n", | |
" <td>MARIO PUZO'S FIRST ACCLAIMED NOVEL, BEFORE HIS...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Action & Adventure</td>\n", | |
" <td>262.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>23</th>\n", | |
" <td>Sap Fico Beginner's Handbook: Step By Step Acr...</td>\n", | |
" <td>Murugesan Ramaswamy</td>\n", | |
" <td>Paperback,– 1 Nov 2014</td>\n", | |
" <td>3.1 out of 5 stars</td>\n", | |
" <td>10 customer reviews</td>\n", | |
" <td>Step by Step Screenshots Guided Handholding Ap...</td>\n", | |
" <td>Software & Business Applications (Books)</td>\n", | |
" <td>Computing, Internet & Digital Media</td>\n", | |
" <td>607.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>24</th>\n", | |
" <td>German Grammar You Really Need To Know: Teach ...</td>\n", | |
" <td>Jenny Russ</td>\n", | |
" <td>Paperback,– 31 Aug 2012</td>\n", | |
" <td>4.8 out of 5 stars</td>\n", | |
" <td>9 customer reviews</td>\n", | |
" <td>Comprehensive and clear explanations of key gr...</td>\n", | |
" <td>German</td>\n", | |
" <td>Language, Linguistics & Writing</td>\n", | |
" <td>536.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25</th>\n", | |
" <td>Stealth of Nations: The Global Rise of the Inf...</td>\n", | |
" <td>Robert Neuwirth</td>\n", | |
" <td>Hardcover,– Deckle Edge, 18 Oct 2011</td>\n", | |
" <td>4.0 out of 5 stars</td>\n", | |
" <td>1 customer review</td>\n", | |
" <td>• Thousands of Africans head to China each yea...</td>\n", | |
" <td>International Business</td>\n", | |
" <td>Politics</td>\n", | |
" <td>621.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>26</th>\n", | |
" <td>Fixed!: Cash and Corruption in Cricket</td>\n", | |
" <td>Shantanu Guha Ray</td>\n", | |
" <td>Paperback,– 1 Mar 2016</td>\n", | |
" <td>4.3 out of 5 stars</td>\n", | |
" <td>15 customer reviews</td>\n", | |
" <td>Who killed Hansie Cronje and Bob Woolmer? Have...</td>\n", | |
" <td>Cricket (Books)</td>\n", | |
" <td>Sports</td>\n", | |
" <td>286.98</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>27</th>\n", | |
" <td>The Buddha Box Set</td>\n", | |
" <td>Osamu Tezuka</td>\n", | |
" <td>Paperback,– Box set, 15 Jun 2014</td>\n", | |
" <td>4.3 out of 5 stars</td>\n", | |
" <td>34 customer reviews</td>\n", | |
" <td>The classic eight volume graphic novel series ...</td>\n", | |
" <td>Comics & Graphic Novels (Books)</td>\n", | |
" <td>Comics & Mangas</td>\n", | |
" <td>3779.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>28</th>\n", | |
" <td>30 Years of WrestleMania (Wwe)</td>\n", | |
" <td>Brian Shields</td>\n", | |
" <td>Hardcover,– 15 Sep 2014</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>17 customer reviews</td>\n", | |
" <td>From the creators of WWE 50 and the official W...</td>\n", | |
" <td>PC & Video Games (Books)</td>\n", | |
" <td>Sports</td>\n", | |
" <td>802.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>29</th>\n", | |
" <td>Memories, Dreams, Reflections (Vintage)</td>\n", | |
" <td>C. G. Jung, Aniela Jaffe, Clara Winston, Richa...</td>\n", | |
" <td>Paperback,– 23 Apr 1989</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>9 customer reviews</td>\n", | |
" <td>An eye-opening biography of one of the most in...</td>\n", | |
" <td>Biographies & Autobiographies (Books)</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>588.26</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>30</th>\n", | |
" <td>The Hit (Will Robie series)</td>\n", | |
" <td>David Baldacci</td>\n", | |
" <td>Paperback,– 21 Nov 2013</td>\n", | |
" <td>4.0 out of 5 stars</td>\n", | |
" <td>32 customer reviews</td>\n", | |
" <td>The Hit is David Baldacci's blockbuster follow...</td>\n", | |
" <td>Contemporary Fiction (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>340.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>31</th>\n", | |
" <td>The Archer Files: The Complete Short Stories o...</td>\n", | |
" <td>Ross Macdonald, Tom Nolan</td>\n", | |
" <td>Paperback,– 21 Jul 2015</td>\n", | |
" <td>3.9 out of 5 stars</td>\n", | |
" <td>2 customer reviews</td>\n", | |
" <td>No matter what cases private eye Lew Archer ta...</td>\n", | |
" <td>Short Stories (Books)</td>\n", | |
" <td>Humour</td>\n", | |
" <td>799.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>32</th>\n", | |
" <td>Light on Life (Arkana)</td>\n", | |
" <td>Hart Defouw</td>\n", | |
" <td>Paperback,– 14 Oct 2000</td>\n", | |
" <td>4.4 out of 5 stars</td>\n", | |
" <td>49 customer reviews</td>\n", | |
" <td>Light on Life brings the insight and wisdom of...</td>\n", | |
" <td>Astrology</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>395.10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>33</th>\n", | |
" <td>The Doomsday Conspiracy</td>\n", | |
" <td>Sidney Sheldon</td>\n", | |
" <td>Paperback,– 5 Sep 2005</td>\n", | |
" <td>4.2 out of 5 stars</td>\n", | |
" <td>49 customer reviews</td>\n", | |
" <td>The Doomsday Conspiracy, by Sidney Sheldon, is...</td>\n", | |
" <td>Romance (Books)</td>\n", | |
" <td>Romance</td>\n", | |
" <td>225.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>34</th>\n", | |
" <td>The Art of Uncharted 4: A Thief's End</td>\n", | |
" <td>Naughty Dog</td>\n", | |
" <td>Hardcover,– 10 May 2016</td>\n", | |
" <td>4.3 out of 5 stars</td>\n", | |
" <td>10 customer reviews</td>\n", | |
" <td>Journey alongside Nathan Drake once again, as ...</td>\n", | |
" <td>Design</td>\n", | |
" <td>Comics & Mangas</td>\n", | |
" <td>1780.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>35</th>\n", | |
" <td>HANNIBAL RISING</td>\n", | |
" <td>Thomas Harris</td>\n", | |
" <td>Paperback,– 2019</td>\n", | |
" <td>4.3 out of 5 stars</td>\n", | |
" <td>8 customer reviews</td>\n", | |
" <td>_________________________ hannibal lecter wasn...</td>\n", | |
" <td>Contemporary Fiction (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>309.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>36</th>\n", | |
" <td>Data Structures Using C</td>\n", | |
" <td>Reema Thareja</td>\n", | |
" <td>Paperback,– 11 Jun 2014</td>\n", | |
" <td>4.4 out of 5 stars</td>\n", | |
" <td>62 customer reviews</td>\n", | |
" <td>This second edition of Data Structures Using C...</td>\n", | |
" <td>Introductory & Beginning Programming</td>\n", | |
" <td>Language, Linguistics & Writing</td>\n", | |
" <td>559.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>37</th>\n", | |
" <td>Don't Ask Any Old Bloke for Directions</td>\n", | |
" <td>Palden Gyatso Tenzing</td>\n", | |
" <td>Paperback,– 17 Apr 2009</td>\n", | |
" <td>3.9 out of 5 stars</td>\n", | |
" <td>61 customer reviews</td>\n", | |
" <td>Exploring a karmic network in 25,320 kilometre...</td>\n", | |
" <td>Travel (Books)</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>224.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>38</th>\n", | |
" <td>Prince of Fire</td>\n", | |
" <td>Daniel Silva</td>\n", | |
" <td>Paperback,– 30 Nov 2006</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>1 customer review</td>\n", | |
" <td>On a bright morning in Rome, a terrible explos...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>565.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>39</th>\n", | |
" <td>HBR's 10 Must Reads: On Making Smart Decisions...</td>\n", | |
" <td>HBR</td>\n", | |
" <td>Paperback,– 1 Dec 2013</td>\n", | |
" <td>4.6 out of 5 stars</td>\n", | |
" <td>8 customer reviews</td>\n", | |
" <td>NEW from the bestselling HBR's 10 Must Reads s...</td>\n", | |
" <td>Sports (Books)</td>\n", | |
" <td>Sports</td>\n", | |
" <td>511.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>40</th>\n", | |
" <td>Politics and the English Language (Penguin Mod...</td>\n", | |
" <td>George Orwell</td>\n", | |
" <td>Paperback,– 3 Jan 2013</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>7 customer reviews</td>\n", | |
" <td>'Politics and the English Language' is widely ...</td>\n", | |
" <td>Communications</td>\n", | |
" <td>Language, Linguistics & Writing</td>\n", | |
" <td>144.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>41</th>\n", | |
" <td>S.</td>\n", | |
" <td>Doug Dorst</td>\n", | |
" <td>Hardcover,– 28 Sep 2013</td>\n", | |
" <td>4.7 out of 5 stars</td>\n", | |
" <td>5 customer reviews</td>\n", | |
" <td>One book. Two readers. A world of mystery, men...</td>\n", | |
" <td>Contemporary Fiction (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>1455.90</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>42</th>\n", | |
" <td>Oxford Learner's Thesaurus</td>\n", | |
" <td>Dict.</td>\n", | |
" <td>Paperback,– 20 Aug 2008</td>\n", | |
" <td>3.8 out of 5 stars</td>\n", | |
" <td>18 customer reviews</td>\n", | |
" <td>A dictionary of synonyms and opposites that he...</td>\n", | |
" <td>Foreign Languages</td>\n", | |
" <td>Language, Linguistics & Writing</td>\n", | |
" <td>645.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>43</th>\n", | |
" <td>Lost in Translation</td>\n", | |
" <td>Ella Frances Sanders</td>\n", | |
" <td>Hardcover,– 8 Jul 2015</td>\n", | |
" <td>4.5 out of 5 stars</td>\n", | |
" <td>16 customer reviews</td>\n", | |
" <td>Did you know that the Japanese have a word to ...</td>\n", | |
" <td>Linguistics (Books)</td>\n", | |
" <td>Language, Linguistics & Writing</td>\n", | |
" <td>427.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>44</th>\n", | |
" <td>Daisy Jones and The Six</td>\n", | |
" <td>Taylor Jenkins Reid</td>\n", | |
" <td>Hardcover,– 2019</td>\n", | |
" <td>4.6 out of 5 stars</td>\n", | |
" <td>6 customer reviews</td>\n", | |
" <td>picked as < u> one to watch in 2019</u> by <th...</td>\n", | |
" <td>Music Books</td>\n", | |
" <td>Romance</td>\n", | |
" <td>560.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>45</th>\n", | |
" <td>Bear Grylls: Two All-Action Adventures: Facing...</td>\n", | |
" <td>Bear Grylls</td>\n", | |
" <td>Paperback,– 3 Jul 2014</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>3 customer reviews</td>\n", | |
" <td>Bear Grylls is one of the world's most famous ...</td>\n", | |
" <td>Outdoor Survival Skills (Books)</td>\n", | |
" <td>Sports</td>\n", | |
" <td>395.10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>46</th>\n", | |
" <td>My First Book Of Beethoven: Favorite Pieces In...</td>\n", | |
" <td>David Dutkanicz</td>\n", | |
" <td>Paperback,– 29 Dec 2006</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>2 customer reviews</td>\n", | |
" <td>Specially arranged and simplified, these piece...</td>\n", | |
" <td>Music Books</td>\n", | |
" <td>Arts, Film & Photography</td>\n", | |
" <td>386.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>47</th>\n", | |
" <td>Byculla to Bangkok</td>\n", | |
" <td>S. Hussain Zaidi</td>\n", | |
" <td>Paperback,– 22 Feb 2014</td>\n", | |
" <td>4.1 out of 5 stars</td>\n", | |
" <td>98 customer reviews</td>\n", | |
" <td>The much-awaited sequel to Dongri to Dubai Aft...</td>\n", | |
" <td>True Accounts (Books)</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>226.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>48</th>\n", | |
" <td>Assassin's Creed: Forsaken</td>\n", | |
" <td>Oliver Bowden</td>\n", | |
" <td>Paperback,– 1 Nov 2012</td>\n", | |
" <td>4.8 out of 5 stars</td>\n", | |
" <td>12 customer reviews</td>\n", | |
" <td>The new novelization based on the bestselling ...</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Action & Adventure</td>\n", | |
" <td>303.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>49</th>\n", | |
" <td>Mastering Manga with Mark Crilley: 30 Drawing ...</td>\n", | |
" <td>Mark Crilley</td>\n", | |
" <td>Paperback,– 30 Mar 2012</td>\n", | |
" <td>4.8 out of 5 stars</td>\n", | |
" <td>14 customer reviews</td>\n", | |
" <td>It's THE book on manga from YouTube's most pop...</td>\n", | |
" <td>Mangas</td>\n", | |
" <td>Arts, Film & Photography</td>\n", | |
" <td>1383.00</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Title ... Price\n", | |
"0 The Prisoner's Gold (The Hunters 3) ... 220.00\n", | |
"1 Guru Dutt: A Tragedy in Three Acts ... 202.93\n", | |
"2 Leviathan (Penguin Classics) ... 299.00\n", | |
"3 A Pocket Full of Rye (Miss Marple) ... 180.00\n", | |
"4 LIFE 70 Years of Extraordinary Photography ... 965.62\n", | |
"5 ChiRunning: A Revolutionary Approach to Effort... ... 900.00\n", | |
"6 Death on the Nile (Poirot) ... 224.00\n", | |
"7 Yoga Your Home Practice Companion: A Complete ... ... 836.00\n", | |
"8 Karmayogi: A Biography of E. Sreedharan ... 130.00\n", | |
"9 The Iron King (The Accursed Kings, Book 1) ... 695.00\n", | |
"10 Battle for Sanskrit: Is Sanskrit Political or ... ... 373.00\n", | |
"11 Blockchain Revolution: How the Technology Behi... ... 309.00\n", | |
"12 Tai-Pan: The Second Novel of the Asian Saga ... 379.00\n", | |
"13 The Art of Shaolin Kung Fu: The Secrets of Kun... ... 1066.00\n", | |
"14 Anil's Ghost ... 381.22\n", | |
"15 Superman: An Origin Story (DC Comics Super Her... ... 287.00\n", | |
"16 My First Book of London ... 162.00\n", | |
"17 Naruto: Itachi's Story, Vol. 1: Daylight ... 587.00\n", | |
"18 The Story of Philosophy ... 291.00\n", | |
"19 Introducing Data Science: Big Data, Machine Le... ... 352.00\n", | |
"20 The Travelling Cat Chronicles ... 339.00\n", | |
"21 Messi: Updated Edition (Luca Caioli) ... 309.00\n", | |
"22 The Dark Arena ... 262.00\n", | |
"23 Sap Fico Beginner's Handbook: Step By Step Acr... ... 607.00\n", | |
"24 German Grammar You Really Need To Know: Teach ... ... 536.00\n", | |
"25 Stealth of Nations: The Global Rise of the Inf... ... 621.00\n", | |
"26 Fixed!: Cash and Corruption in Cricket ... 286.98\n", | |
"27 The Buddha Box Set ... 3779.00\n", | |
"28 30 Years of WrestleMania (Wwe) ... 802.00\n", | |
"29 Memories, Dreams, Reflections (Vintage) ... 588.26\n", | |
"30 The Hit (Will Robie series) ... 340.00\n", | |
"31 The Archer Files: The Complete Short Stories o... ... 799.00\n", | |
"32 Light on Life (Arkana) ... 395.10\n", | |
"33 The Doomsday Conspiracy ... 225.00\n", | |
"34 The Art of Uncharted 4: A Thief's End ... 1780.00\n", | |
"35 HANNIBAL RISING ... 309.00\n", | |
"36 Data Structures Using C ... 559.00\n", | |
"37 Don't Ask Any Old Bloke for Directions ... 224.00\n", | |
"38 Prince of Fire ... 565.00\n", | |
"39 HBR's 10 Must Reads: On Making Smart Decisions... ... 511.00\n", | |
"40 Politics and the English Language (Penguin Mod... ... 144.00\n", | |
"41 S. ... 1455.90\n", | |
"42 Oxford Learner's Thesaurus ... 645.00\n", | |
"43 Lost in Translation ... 427.00\n", | |
"44 Daisy Jones and The Six ... 560.00\n", | |
"45 Bear Grylls: Two All-Action Adventures: Facing... ... 395.10\n", | |
"46 My First Book Of Beethoven: Favorite Pieces In... ... 386.00\n", | |
"47 Byculla to Bangkok ... 226.00\n", | |
"48 Assassin's Creed: Forsaken ... 303.00\n", | |
"49 Mastering Manga with Mark Crilley: 30 Drawing ... ... 1383.00\n", | |
"\n", | |
"[50 rows x 9 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 19 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Kkl8cxrfL6dj", | |
"colab_type": "code", | |
"outputId": "508b59aa-96d2-45d4-bb16-8e54e19a1a81", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 272 | |
} | |
}, | |
"source": [ | |
"print(train.info())" | |
], | |
"execution_count": 20, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 6237 entries, 0 to 6236\n", | |
"Data columns (total 9 columns):\n", | |
"Title 6237 non-null object\n", | |
"Author 6237 non-null object\n", | |
"Edition 6237 non-null object\n", | |
"Reviews 6237 non-null object\n", | |
"Ratings 6237 non-null object\n", | |
"Synopsis 6237 non-null object\n", | |
"Genre 6237 non-null object\n", | |
"BookCategory 6237 non-null object\n", | |
"Price 6237 non-null float64\n", | |
"dtypes: float64(1), object(8)\n", | |
"memory usage: 438.6+ KB\n", | |
"None\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "SspSSxPEv8gq", | |
"colab_type": "code", | |
"outputId": "3594f69b-3a74-4f11-b84d-3580d9e25bd8", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 111 | |
} | |
}, | |
"source": [ | |
"train.describe(include = 'all').head(2)" | |
], | |
"execution_count": 21, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Title</th>\n", | |
" <th>Author</th>\n", | |
" <th>Edition</th>\n", | |
" <th>Reviews</th>\n", | |
" <th>Ratings</th>\n", | |
" <th>Synopsis</th>\n", | |
" <th>Genre</th>\n", | |
" <th>BookCategory</th>\n", | |
" <th>Price</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>unique</th>\n", | |
" <td>5568</td>\n", | |
" <td>3679</td>\n", | |
" <td>3370</td>\n", | |
" <td>36</td>\n", | |
" <td>342</td>\n", | |
" <td>5549</td>\n", | |
" <td>345</td>\n", | |
" <td>11</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Title Author Edition Reviews Ratings Synopsis Genre BookCategory Price\n", | |
"count 6237 6237 6237 6237 6237 6237 6237 6237 6237.0\n", | |
"unique 5568 3679 3370 36 342 5549 345 11 NaN" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 21 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "U3MCre4TwiZc", | |
"colab_type": "code", | |
"outputId": "39fc7de7-3721-41b8-b03e-ec2de9e55358", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 68 | |
} | |
}, | |
"source": [ | |
"print(train.columns)" | |
], | |
"execution_count": 22, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Index(['Title', 'Author', 'Edition', 'Reviews', 'Ratings', 'Synopsis', 'Genre',\n", | |
" 'BookCategory', 'Price'],\n", | |
" dtype='object')\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "b1ZRckAF3Frk", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"## Removing Synopsis since here we are not going to use this feature\n", | |
"\n", | |
"train = train[['Title', 'Author', 'Edition', 'Reviews', 'Ratings','Genre',\n", | |
" 'BookCategory', 'Price']]\n", | |
"\n", | |
"test = test[['Title', 'Author', 'Edition', 'Reviews', 'Ratings','Genre',\n", | |
" 'BookCategory']]" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "YmktsksI1cR5", | |
"colab_type": "code", | |
"outputId": "8d30ad31-a035-404c-dac3-f9b6d7bb2ec0", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 204 | |
} | |
}, | |
"source": [ | |
"train.head()" | |
], | |
"execution_count": 24, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Title</th>\n", | |
" <th>Author</th>\n", | |
" <th>Edition</th>\n", | |
" <th>Reviews</th>\n", | |
" <th>Ratings</th>\n", | |
" <th>Genre</th>\n", | |
" <th>BookCategory</th>\n", | |
" <th>Price</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>The Prisoner's Gold (The Hunters 3)</td>\n", | |
" <td>Chris Kuzneski</td>\n", | |
" <td>Paperback,– 10 Mar 2016</td>\n", | |
" <td>4.0 out of 5 stars</td>\n", | |
" <td>8 customer reviews</td>\n", | |
" <td>Action & Adventure (Books)</td>\n", | |
" <td>Action & Adventure</td>\n", | |
" <td>220.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>Guru Dutt: A Tragedy in Three Acts</td>\n", | |
" <td>Arun Khopkar</td>\n", | |
" <td>Paperback,– 7 Nov 2012</td>\n", | |
" <td>3.9 out of 5 stars</td>\n", | |
" <td>14 customer reviews</td>\n", | |
" <td>Cinema & Broadcast (Books)</td>\n", | |
" <td>Biographies, Diaries & True Accounts</td>\n", | |
" <td>202.93</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>Leviathan (Penguin Classics)</td>\n", | |
" <td>Thomas Hobbes</td>\n", | |
" <td>Paperback,– 25 Feb 1982</td>\n", | |
" <td>4.8 out of 5 stars</td>\n", | |
" <td>6 customer reviews</td>\n", | |
" <td>International Relations</td>\n", | |
" <td>Humour</td>\n", | |
" <td>299.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>A Pocket Full of Rye (Miss Marple)</td>\n", | |
" <td>Agatha Christie</td>\n", | |
" <td>Paperback,– 5 Oct 2017</td>\n", | |
" <td>4.1 out of 5 stars</td>\n", | |
" <td>13 customer reviews</td>\n", | |
" <td>Contemporary Fiction (Books)</td>\n", | |
" <td>Crime, Thriller & Mystery</td>\n", | |
" <td>180.00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>LIFE 70 Years of Extraordinary Photography</td>\n", | |
" <td>Editors of Life</td>\n", | |
" <td>Hardcover,– 10 Oct 2006</td>\n", | |
" <td>5.0 out of 5 stars</td>\n", | |
" <td>1 customer review</td>\n", | |
" <td>Photography Textbooks</td>\n", | |
" <td>Arts, Film & Photography</td>\n", | |
" <td>965.62</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Title ... Price\n", | |
"0 The Prisoner's Gold (The Hunters 3) ... 220.00\n", | |
"1 Guru Dutt: A Tragedy in Three Acts ... 202.93\n", | |
"2 Leviathan (Penguin Classics) ... 299.00\n", | |
"3 A Pocket Full of Rye (Miss Marple) ... 180.00\n", | |
"4 LIFE 70 Years of Extraordinary Photography ... 965.62\n", | |
"\n", | |
"[5 rows x 8 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 24 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ZHq8QtcL_tD3", | |
"colab_type": "code", | |
"outputId": "40a32bf6-0a47-42ad-c78f-e0c5a6dab844", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 170 | |
} | |
}, | |
"source": [ | |
"train.isnull().sum()" | |
], | |
"execution_count": 25, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"Title 0\n", | |
"Author 0\n", | |
"Edition 0\n", | |
"Reviews 0\n", | |
"Ratings 0\n", | |
"Genre 0\n", | |
"BookCategory 0\n", | |
"Price 0\n", | |
"dtype: int64" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 25 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "WqhA0WyMXMD4", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### KEY OBSERVATIONS\n", | |
"\n", | |
"* No null values in the dataset to treat.\n", | |
"\n", | |
"* Some books have multiple authors in the Author column which needs to be processed amd seperated.\n", | |
"\n", | |
"* Edition Column can be split in to 3 different features. (Type, Month and Year)\n", | |
"\n", | |
"* The Reviews and Ratings columms are misslabelled.\n", | |
"\n", | |
"* Reviews and Ratings, both needs to cleaned to represent integer and float values respectively.\n", | |
"\n", | |
"* Like authors , a book may belong to multiple categories and genres. Thus we will need to split both the Genre and Category columns.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "7aRbnRnAUfmf", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Processing The Data\n", | |
"---\n", | |
"\n", | |
"In this stage we will process the data by cleaning and making it ready for modeling.\n", | |
"\n", | |
"This stage involves:\n", | |
"\n", | |
"* Cleaning the data and generating new features\n", | |
"* Encoding all categorical variables\n", | |
"* Scaling the data\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "A_6ocQREGz-2", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Cleaning And Generating New Features" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "U9zwu4rzqxGd", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### Splitting Edition Column\n", | |
"\n", | |
"---\n", | |
"\n", | |
"We will clean the column Edition and will create 3 new features from it which are Type, Month and Year.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "9FYwQDv4poys", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"#A method to clean and restructure the Edition column\n", | |
"\n", | |
"def split_edition(data): \n", | |
" \n", | |
" edition = list(data)\n", | |
" \n", | |
" ed_type = [i.split(\",– \")[0].strip().upper() for i in edition]\n", | |
" \n", | |
" edit_date = [i.split(\",– \")[1].strip() for i in edition]\n", | |
" \n", | |
" m_y = [i.split()[-2:] for i in edit_date]\n", | |
" \n", | |
" \n", | |
" for i in range(len(m_y)):\n", | |
" if len(m_y[i]) == 1:\n", | |
" m_y[i].insert(0,'NA')\n", | |
" \n", | |
" # Based on the given dataset below is the list of possible values for Months\n", | |
" \n", | |
" months = ['Apr','Aug','Dec','Feb', 'Jan', 'Jul','Jun','Mar','May','NA','Nov','Oct','Sep']\n", | |
" \n", | |
" ed_month = [m_y[i][0].upper() if m_y[i][0] in months else 'NA' for i in range(len(m_y))]\n", | |
" ed_year = [int(m_y[i][1].strip()) if m_y[i][1].isdigit() else 0 for i in range(len(m_y))]\n", | |
" \n", | |
" return ed_type, ed_month, ed_year" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "G4QApAYejy8F", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### Splitting Author Columns\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "6qWwA0KIm8O1", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"In order to split a colum in to multiple features we must first determine or identify that how many features can an existing Column account for. Hence for splitting the Author column in to multiple authors, we must know the maximum number of authors for a single book in the given datasets.We will combine the test and training set to do so.\n", | |
"\n", | |
"We will also store the names of each and every author which we will later neeed for label encoding.\n", | |
"\n", | |
"We will apply the same principles for the Genre as well as the BookCategory columns." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Bg4iBFtiMFJJ", | |
"colab_type": "code", | |
"outputId": "e85740e0-c6f9-4615-b762-9500b963659e", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 51 | |
} | |
}, | |
"source": [ | |
"#Identifying the maximum number of authors for a single book from the given datasets\n", | |
"authors_1 = list(train['Author'])\n", | |
"authors_2 = list(test['Author'])\n", | |
"\n", | |
"authors_1.extend(authors_2)\n", | |
"\n", | |
"authorslis = [i.split(\",\") for i in authors_1]\n", | |
"\n", | |
"max = 1\n", | |
"for i in authorslis:\n", | |
" if len(i) >= max:\n", | |
" max = len(i)\n", | |
"print(\"Max. number of authors for a single boook = \",max)\n", | |
"\n", | |
"for i in range(len(authorslis)):\n", | |
" if len(authorslis[i]) == max:\n", | |
" print(i) \n", | |
" \n", | |
"all_authors = [author.strip().upper() for listin in authorslis for author in listin]\n", | |
" " | |
], | |
"execution_count": 26, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Max. number of authors for a single boook = 7\n", | |
"7008\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "kAdns8NkPPdV", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# A method to split the Author column in to 7 new columns\n", | |
"def split_authors(data):\n", | |
" \n", | |
" authors = list(data)\n", | |
" \n", | |
" A1 = []\n", | |
" A2 = []\n", | |
" A3 = []\n", | |
" A4 = []\n", | |
" A5 = []\n", | |
" A6 = []\n", | |
" A7 = []\n", | |
" for i in authors:\n", | |
" \n", | |
" try :\n", | |
" A1.append(i.split(',')[0].strip().upper())\n", | |
" except :\n", | |
" A1.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" A2.append(i.split(',')[1].strip().upper())\n", | |
" except :\n", | |
" A2.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" A3.append(i.split(',')[2].strip().upper())\n", | |
" except :\n", | |
" A3.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" A4.append(i.split(',')[3].strip().upper())\n", | |
" except :\n", | |
" A4.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" A5.append(i.split(',')[4].strip().upper())\n", | |
" except :\n", | |
" A5.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" A6.append(i.split(',')[5].strip().upper())\n", | |
" except :\n", | |
" A6.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" A7.append(i.split(',')[6].strip().upper())\n", | |
" except :\n", | |
" A7.append('NONE')\n", | |
"\n", | |
" \n", | |
" return A1,A2,A3,A4,A5,A6,A7\n", | |
" \n", | |
"all_authors.append('NONE')" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "YAiwJ3Bnj89I", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### Splitting Genre Columns\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "VgE7l4QrY_6k", | |
"colab_type": "code", | |
"outputId": "8847298f-f86d-4a0a-d0ba-d3e2874f3d39", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"#Identifying the maximum number of Genres for a single book from the given datasets\n", | |
"\n", | |
"genre_1 = list(train['Genre'])\n", | |
"genre_2 = list(test['Genre'])\n", | |
"\n", | |
"genre_1.extend(genre_2)\n", | |
"\n", | |
"genre_lis = [i.split(\",\") for i in genre_1]\n", | |
"\n", | |
"\n", | |
"max = 1\n", | |
"for i in genre_lis:\n", | |
" if len(i) >= max:\n", | |
" max = len(i)\n", | |
"print(\"Max. number of genres for a single boook = \",max)\n", | |
" \n", | |
"all_genres = [genre.strip().upper() for listin in genre_lis for genre in listin]\n", | |
" \n" | |
], | |
"execution_count": 28, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Max. number of genres for a single boook = 2\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "6PZi_VEGkEi0", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# A method to split the Genre column in to 7 new columns\n", | |
"\n", | |
"def split_genres(data):\n", | |
" \n", | |
" genres = list(data)\n", | |
" \n", | |
" G1 = []\n", | |
" G2 = []\n", | |
" \n", | |
" for i in genres:\n", | |
" \n", | |
" try :\n", | |
" G1.append(i.split(',')[0].strip().upper())\n", | |
" \n", | |
" except :\n", | |
" G1.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" G2.append(i.split(',')[1].strip().upper())\n", | |
" except :\n", | |
" G2.append('NONE')\n", | |
"\n", | |
"\n", | |
" \n", | |
" return G1,G2\n", | |
" \n", | |
"all_genres.append('NONE')" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "nigiT_6KkGM4", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### Splitting BookCategory Column\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "_eRU4S1MkNDI", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"outputId": "07d4e611-6e13-4156-db04-d1679d743e71" | |
}, | |
"source": [ | |
"#Identifying the maximum number of Categories for a single book from the given datasets\n", | |
"\n", | |
"cat_1 = list(train['BookCategory'])\n", | |
"cat_2 = list(test['BookCategory'])\n", | |
"\n", | |
"cat_1.extend(cat_2)\n", | |
"\n", | |
"cat_lis = [i.split(\",\") for i in cat_1]\n", | |
"\n", | |
"\n", | |
"max = 1\n", | |
"for i in cat_lis:\n", | |
" if len(i) >= max:\n", | |
" max = len(i)\n", | |
"print(\"Max. number of Categories for a single boook = \",max)\n", | |
"\n", | |
"all_categories = [cat.strip().upper() for listin in cat_lis for cat in listin]\n", | |
" " | |
], | |
"execution_count": 31, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Max. number of Categories for a single boook = 2\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Pc16PenecGMq", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# A method to split the Category column in to 7 new columns\n", | |
"\n", | |
"def split_categories(data):\n", | |
" \n", | |
" cat = list(data)\n", | |
" \n", | |
" C1 = []\n", | |
" C2 = []\n", | |
"\n", | |
" for i in cat:\n", | |
" \n", | |
" try :\n", | |
" C1.append(i.split(',')[0].strip().upper())\n", | |
" except :\n", | |
" C1.append('NONE')\n", | |
" \n", | |
" try :\n", | |
" C2.append(i.split(',')[1].strip().upper())\n", | |
" except :\n", | |
" C2.append('NONE')\n", | |
"\n", | |
"\n", | |
" \n", | |
" return C1,C2\n", | |
" \n", | |
"all_categories.append('NONE')\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "YhigqezfkP1A", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"#### Cleaning & Restructuring The Datasets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "hpBVzD13o8Ma", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# A method to clean and restructure the datasets\n", | |
"\n", | |
"import re\n", | |
"\n", | |
"def restructure(data):\n", | |
" \n", | |
" #Cleaning Title Column\n", | |
" titles = list(data['Title'])\n", | |
" titles = [title.strip().upper() for title in titles]\n", | |
" \n", | |
" #Cleaning & Restructuring Author Column\n", | |
" a1,a2,a3,a4,a5,a6,a7 = split_authors(data['Author']) \n", | |
" \n", | |
" #Cleaning & Restructuring Edition Column\n", | |
" ed_type, ed_month, ed_year = split_edition(data['Edition'])\n", | |
" \n", | |
" #Cleaning Ratings Column\n", | |
" ratings = list(data['Reviews'])\n", | |
" ratings = [float(re.sub(\" out of 5 stars\", \"\", i).strip()) for i in ratings]\n", | |
" \n", | |
" #Cleaning Reviews Column\n", | |
" reviews = list(data['Ratings'])\n", | |
" plu = ' customer reviews'\n", | |
" reviews = [re.sub(\" customer reviews\", \"\", i) if plu in i else re.sub(\" customer review\", \"\", i) for i in reviews ]\n", | |
" reviews = [int(re.sub(\",\", \"\", i).strip()) for i in reviews ]\n", | |
" \n", | |
"\n", | |
" #Cleaning & Restructuring Genre Column\n", | |
" g1, g2 = split_genres(data['Genre'])\n", | |
" \n", | |
" #Cleaning & Restructuring BookCategory Column\n", | |
" c1,c2 = split_categories(data['BookCategory'])\n", | |
"\n", | |
" # Forming the Structured dataset\n", | |
" structured_data = pd.DataFrame({'Title': titles,\n", | |
" 'Author1': a1,\n", | |
" 'Author2': a2,\n", | |
" 'Author3': a3,\n", | |
" 'Author4': a4,\n", | |
" 'Author5': a5,\n", | |
" 'Author6': a6,\n", | |
" 'Author7': a7,\n", | |
" 'Edition_Type': ed_type,\n", | |
" 'Edition_Month': ed_month,\n", | |
" 'Edition_Year': ed_year,\n", | |
" 'Ratings': ratings,\n", | |
" 'Reviews': reviews,\n", | |
" 'Genre1': g1,\n", | |
" 'Genre2': g2,\n", | |
" 'Category1': c1,\n", | |
" 'Category2': c2\n", | |
" \n", | |
" })\n", | |
" \n", | |
" return structured_data\n", | |
"\n", | |
" " | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "yIEpd23TSuNh", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 227 | |
}, | |
"outputId": "2a81c418-f804-412d-ff7e-9a2f606cacce" | |
}, | |
"source": [ | |
"restructure(train).head(3)" | |
], | |
"execution_count": 35, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Title</th>\n", | |
" <th>Author1</th>\n", | |
" <th>Author2</th>\n", | |
" <th>Author3</th>\n", | |
" <th>Author4</th>\n", | |
" <th>Author5</th>\n", | |
" <th>Author6</th>\n", | |
" <th>Author7</th>\n", | |
" <th>Edition_Type</th>\n", | |
" <th>Edition_Month</th>\n", | |
" <th>Edition_Year</th>\n", | |
" <th>Ratings</th>\n", | |
" <th>Reviews</th>\n", | |
" <th>Genre1</th>\n", | |
" <th>Genre2</th>\n", | |
" <th>Category1</th>\n", | |
" <th>Category2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>THE PRISONER'S GOLD (THE HUNTERS 3)</td>\n", | |
" <td>CHRIS KUZNESKI</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>PAPERBACK</td>\n", | |
" <td>MAR</td>\n", | |
" <td>2016</td>\n", | |
" <td>4.0</td>\n", | |
" <td>8</td>\n", | |
" <td>ACTION & ADVENTURE (BOOKS)</td>\n", | |
" <td>NONE</td>\n", | |
" <td>ACTION & ADVENTURE</td>\n", | |
" <td>NONE</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>GURU DUTT: A TRAGEDY IN THREE ACTS</td>\n", | |
" <td>ARUN KHOPKAR</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>PAPERBACK</td>\n", | |
" <td>NOV</td>\n", | |
" <td>2012</td>\n", | |
" <td>3.9</td>\n", | |
" <td>14</td>\n", | |
" <td>CINEMA & BROADCAST (BOOKS)</td>\n", | |
" <td>NONE</td>\n", | |
" <td>BIOGRAPHIES</td>\n", | |
" <td>DIARIES & TRUE ACCOUNTS</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>LEVIATHAN (PENGUIN CLASSICS)</td>\n", | |
" <td>THOMAS HOBBES</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>PAPERBACK</td>\n", | |
" <td>FEB</td>\n", | |
" <td>1982</td>\n", | |
" <td>4.8</td>\n", | |
" <td>6</td>\n", | |
" <td>INTERNATIONAL RELATIONS</td>\n", | |
" <td>NONE</td>\n", | |
" <td>HUMOUR</td>\n", | |
" <td>NONE</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Title ... Category2\n", | |
"0 THE PRISONER'S GOLD (THE HUNTERS 3) ... NONE\n", | |
"1 GURU DUTT: A TRAGEDY IN THREE ACTS ... DIARIES & TRUE ACCOUNTS\n", | |
"2 LEVIATHAN (PENGUIN CLASSICS) ... NONE\n", | |
"\n", | |
"[3 rows x 17 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 35 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "NusnSRPbCGqO", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"\n", | |
"X_train = restructure(train)\n", | |
"\n", | |
"Y_train = train.iloc[:, -1].values\n", | |
"\n", | |
"X_test = restructure(test)\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "elvbEa4fkgLp", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 424 | |
}, | |
"outputId": "c072e4ee-f1ad-45c4-d298-c761443660f3" | |
}, | |
"source": [ | |
"X_train.describe(include = 'all')" | |
], | |
"execution_count": 37, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Title</th>\n", | |
" <th>Author1</th>\n", | |
" <th>Author2</th>\n", | |
" <th>Author3</th>\n", | |
" <th>Author4</th>\n", | |
" <th>Author5</th>\n", | |
" <th>Author6</th>\n", | |
" <th>Author7</th>\n", | |
" <th>Edition_Type</th>\n", | |
" <th>Edition_Month</th>\n", | |
" <th>Edition_Year</th>\n", | |
" <th>Ratings</th>\n", | |
" <th>Reviews</th>\n", | |
" <th>Genre1</th>\n", | |
" <th>Genre2</th>\n", | |
" <th>Category1</th>\n", | |
" <th>Category2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237.000000</td>\n", | |
" <td>6237.000000</td>\n", | |
" <td>6237.000000</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>unique</th>\n", | |
" <td>5564</td>\n", | |
" <td>3633</td>\n", | |
" <td>264</td>\n", | |
" <td>73</td>\n", | |
" <td>21</td>\n", | |
" <td>5</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>19</td>\n", | |
" <td>13</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>345</td>\n", | |
" <td>27</td>\n", | |
" <td>11</td>\n", | |
" <td>6</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>top</th>\n", | |
" <td>CASINO ROYALE: JAMES BOND 007 (VINTAGE)</td>\n", | |
" <td>AGATHA CHRISTIE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>NONE</td>\n", | |
" <td>PAPERBACK</td>\n", | |
" <td>OCT</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>ACTION & ADVENTURE (BOOKS)</td>\n", | |
" <td>NONE</td>\n", | |
" <td>ACTION & ADVENTURE</td>\n", | |
" <td>NONE</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>freq</th>\n", | |
" <td>4</td>\n", | |
" <td>69</td>\n", | |
" <td>5929</td>\n", | |
" <td>6159</td>\n", | |
" <td>6214</td>\n", | |
" <td>6233</td>\n", | |
" <td>6237</td>\n", | |
" <td>6237</td>\n", | |
" <td>5193</td>\n", | |
" <td>639</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>947</td>\n", | |
" <td>5594</td>\n", | |
" <td>818</td>\n", | |
" <td>3297</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2005.101972</td>\n", | |
" <td>4.293202</td>\n", | |
" <td>35.984287</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>116.821510</td>\n", | |
" <td>0.662501</td>\n", | |
" <td>149.995031</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2010.000000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2014.000000</td>\n", | |
" <td>4.400000</td>\n", | |
" <td>7.000000</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2017.000000</td>\n", | |
" <td>4.800000</td>\n", | |
" <td>22.000000</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>2019.000000</td>\n", | |
" <td>5.000000</td>\n", | |
" <td>6090.000000</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Title ... Category2\n", | |
"count 6237 ... 6237\n", | |
"unique 5564 ... 6\n", | |
"top CASINO ROYALE: JAMES BOND 007 (VINTAGE) ... NONE\n", | |
"freq 4 ... 3297\n", | |
"mean NaN ... NaN\n", | |
"std NaN ... NaN\n", | |
"min NaN ... NaN\n", | |
"25% NaN ... NaN\n", | |
"50% NaN ... NaN\n", | |
"75% NaN ... NaN\n", | |
"max NaN ... NaN\n", | |
"\n", | |
"[11 rows x 17 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 37 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "A0tjLxd1CaZw", | |
"colab_type": "code", | |
"outputId": "c12e426f-fcc6-4252-d1d2-123743078d27", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 391 | |
} | |
}, | |
"source": [ | |
"X_train.info()" | |
], | |
"execution_count": 38, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 6237 entries, 0 to 6236\n", | |
"Data columns (total 17 columns):\n", | |
"Title 6237 non-null object\n", | |
"Author1 6237 non-null object\n", | |
"Author2 6237 non-null object\n", | |
"Author3 6237 non-null object\n", | |
"Author4 6237 non-null object\n", | |
"Author5 6237 non-null object\n", | |
"Author6 6237 non-null object\n", | |
"Author7 6237 non-null object\n", | |
"Edition_Type 6237 non-null object\n", | |
"Edition_Month 6237 non-null object\n", | |
"Edition_Year 6237 non-null int64\n", | |
"Ratings 6237 non-null float64\n", | |
"Reviews 6237 non-null int64\n", | |
"Genre1 6237 non-null object\n", | |
"Genre2 6237 non-null object\n", | |
"Category1 6237 non-null object\n", | |
"Category2 6237 non-null object\n", | |
"dtypes: float64(1), int64(2), object(14)\n", | |
"memory usage: 828.4+ KB\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "x5NiiOGZCmLh", | |
"colab_type": "code", | |
"outputId": "a29bc96b-6ff1-4365-f2bd-df26f8f8aa43", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 391 | |
} | |
}, | |
"source": [ | |
"X_test.info()" | |
], | |
"execution_count": 39, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 1560 entries, 0 to 1559\n", | |
"Data columns (total 17 columns):\n", | |
"Title 1560 non-null object\n", | |
"Author1 1560 non-null object\n", | |
"Author2 1560 non-null object\n", | |
"Author3 1560 non-null object\n", | |
"Author4 1560 non-null object\n", | |
"Author5 1560 non-null object\n", | |
"Author6 1560 non-null object\n", | |
"Author7 1560 non-null object\n", | |
"Edition_Type 1560 non-null object\n", | |
"Edition_Month 1560 non-null object\n", | |
"Edition_Year 1560 non-null int64\n", | |
"Ratings 1560 non-null float64\n", | |
"Reviews 1560 non-null int64\n", | |
"Genre1 1560 non-null object\n", | |
"Genre2 1560 non-null object\n", | |
"Category1 1560 non-null object\n", | |
"Category2 1560 non-null object\n", | |
"dtypes: float64(1), int64(2), object(14)\n", | |
"memory usage: 207.3+ KB\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "5ACDwNtkUpv3", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Encoding Categorical Features" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "hU7meCrmE2gd", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# A method for Finding Unique items for all columns\n", | |
"def unique_items(list1, list2):\n", | |
" a = list1\n", | |
" b = list2\n", | |
" a.extend(b)\n", | |
" return list(set(a)) " | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "wbG0hWZt_fOY", | |
"colab_type": "code", | |
"outputId": "cde54847-e3db-4678-e424-afe73805cfd8", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"from sklearn.preprocessing import LabelEncoder\n", | |
"\n", | |
"le_Title = LabelEncoder()\n", | |
"all_titles = unique_items(list(X_train.Title),list(X_test.Title))\n", | |
"le_Title.fit(all_titles)\n", | |
"\n", | |
"le_Edition_Type = LabelEncoder()\n", | |
"all_etypes = unique_items(list(X_train.Edition_Type),list(X_test.Edition_Type))\n", | |
"le_Edition_Type.fit(all_etypes)\n", | |
"\n", | |
"\n", | |
"le_Edition_Month = LabelEncoder()\n", | |
"all_em = unique_items(list(X_train.Edition_Month),list(X_test.Edition_Month))\n", | |
"le_Edition_Month.fit(all_em)\n", | |
"\n", | |
"le_Author = LabelEncoder()\n", | |
"all_Authors = list(set(all_authors))\n", | |
"le_Author.fit(all_Authors)\n", | |
"\n", | |
"le_Genre = LabelEncoder()\n", | |
"all_Genres = list(set(all_genres))\n", | |
"le_Genre.fit(all_Genres)\n", | |
"\n", | |
"le_Category = LabelEncoder()\n", | |
"all_Categories = list(set(all_categories))\n", | |
"le_Category.fit(all_Categories)\n" | |
], | |
"execution_count": 41, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"LabelEncoder()" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 41 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "tuuGL--qImdF", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"\n", | |
"X_train['Title'] = le_Title.transform(X_train['Title'])\n", | |
"\n", | |
"X_train['Edition_Type'] = le_Edition_Type.transform(X_train['Edition_Type'])\n", | |
"\n", | |
"\n", | |
"\n", | |
"X_train['Edition_Month'] = le_Edition_Month.transform(X_train['Edition_Month'])\n", | |
"\n", | |
"X_train['Author1'] = le_Author.transform(X_train['Author1'])\n", | |
"X_train['Author2'] = le_Author.transform(X_train['Author2'])\n", | |
"X_train['Author3'] = le_Author.transform(X_train['Author3'])\n", | |
"X_train['Author4'] = le_Author.transform(X_train['Author4'])\n", | |
"X_train['Author5'] = le_Author.transform(X_train['Author5'])\n", | |
"X_train['Author6'] = le_Author.transform(X_train['Author6'])\n", | |
"X_train['Author7'] = le_Author.transform(X_train['Author7'])\n", | |
"\n", | |
"\n", | |
"X_train['Genre1'] = le_Genre.transform(X_train['Genre1'])\n", | |
"X_train['Genre2'] = le_Genre.transform(X_train['Genre2'])\n", | |
"\n", | |
"\n", | |
"X_train['Category1'] = le_Category.transform(X_train['Category1'])\n", | |
"X_train['Category2'] = le_Category.transform(X_train['Category2'])\n", | |
"\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "cwCtnU6Ey-ZQ", | |
"colab_type": "code", | |
"outputId": "3327e69c-4385-4a2c-8211-586bcf4aa183", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 204 | |
} | |
}, | |
"source": [ | |
"X_train.head()" | |
], | |
"execution_count": 43, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Title</th>\n", | |
" <th>Author1</th>\n", | |
" <th>Author2</th>\n", | |
" <th>Author3</th>\n", | |
" <th>Author4</th>\n", | |
" <th>Author5</th>\n", | |
" <th>Author6</th>\n", | |
" <th>Author7</th>\n", | |
" <th>Edition_Type</th>\n", | |
" <th>Edition_Month</th>\n", | |
" <th>Edition_Year</th>\n", | |
" <th>Ratings</th>\n", | |
" <th>Reviews</th>\n", | |
" <th>Genre1</th>\n", | |
" <th>Genre2</th>\n", | |
" <th>Category1</th>\n", | |
" <th>Category2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>5802</td>\n", | |
" <td>797</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>7</td>\n", | |
" <td>2016</td>\n", | |
" <td>4.0</td>\n", | |
" <td>8</td>\n", | |
" <td>0</td>\n", | |
" <td>267</td>\n", | |
" <td>0</td>\n", | |
" <td>12</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2120</td>\n", | |
" <td>391</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>10</td>\n", | |
" <td>2012</td>\n", | |
" <td>3.9</td>\n", | |
" <td>14</td>\n", | |
" <td>80</td>\n", | |
" <td>267</td>\n", | |
" <td>2</td>\n", | |
" <td>6</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2984</td>\n", | |
" <td>4353</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>3</td>\n", | |
" <td>1982</td>\n", | |
" <td>4.8</td>\n", | |
" <td>6</td>\n", | |
" <td>211</td>\n", | |
" <td>267</td>\n", | |
" <td>8</td>\n", | |
" <td>12</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>189</td>\n", | |
" <td>78</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>11</td>\n", | |
" <td>2017</td>\n", | |
" <td>4.1</td>\n", | |
" <td>13</td>\n", | |
" <td>98</td>\n", | |
" <td>267</td>\n", | |
" <td>5</td>\n", | |
" <td>16</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2987</td>\n", | |
" <td>1221</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>8</td>\n", | |
" <td>11</td>\n", | |
" <td>2006</td>\n", | |
" <td>5.0</td>\n", | |
" <td>1</td>\n", | |
" <td>284</td>\n", | |
" <td>267</td>\n", | |
" <td>1</td>\n", | |
" <td>7</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Title Author1 Author2 Author3 ... Genre1 Genre2 Category1 Category2\n", | |
"0 5802 797 3073 3073 ... 0 267 0 12\n", | |
"1 2120 391 3073 3073 ... 80 267 2 6\n", | |
"2 2984 4353 3073 3073 ... 211 267 8 12\n", | |
"3 189 78 3073 3073 ... 98 267 5 16\n", | |
"4 2987 1221 3073 3073 ... 284 267 1 7\n", | |
"\n", | |
"[5 rows x 17 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 43 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "5EDSYQRwDODF", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"\n", | |
"X_test['Title'] = le_Title.transform(X_test['Title'])\n", | |
"\n", | |
"X_test['Edition_Type'] = le_Edition_Type.transform(X_test['Edition_Type'])\n", | |
"\n", | |
"\n", | |
"\n", | |
"X_test['Edition_Month'] = le_Edition_Month.transform(X_test['Edition_Month'])\n", | |
"\n", | |
"X_test['Author1'] = le_Author.transform(X_test['Author1'])\n", | |
"X_test['Author2'] = le_Author.transform(X_test['Author2'])\n", | |
"X_test['Author3'] = le_Author.transform(X_test['Author3'])\n", | |
"X_test['Author4'] = le_Author.transform(X_test['Author4'])\n", | |
"X_test['Author5'] = le_Author.transform(X_test['Author5'])\n", | |
"X_test['Author6'] = le_Author.transform(X_test['Author6'])\n", | |
"X_test['Author7'] = le_Author.transform(X_test['Author7'])\n", | |
"\n", | |
"\n", | |
"X_test['Genre1'] = le_Genre.transform(X_test['Genre1'])\n", | |
"X_test['Genre2'] = le_Genre.transform(X_test['Genre2'])\n", | |
"\n", | |
"\n", | |
"X_test['Category1'] = le_Category.transform(X_test['Category1'])\n", | |
"X_test['Category2'] = le_Category.transform(X_test['Category2'])" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "z4_3jjKJD5dC", | |
"colab_type": "code", | |
"outputId": "1ca85fbf-0bac-4ed0-ace1-9df6c6dd0875", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 204 | |
} | |
}, | |
"source": [ | |
"X_test.head()" | |
], | |
"execution_count": 45, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Title</th>\n", | |
" <th>Author1</th>\n", | |
" <th>Author2</th>\n", | |
" <th>Author3</th>\n", | |
" <th>Author4</th>\n", | |
" <th>Author5</th>\n", | |
" <th>Author6</th>\n", | |
" <th>Author7</th>\n", | |
" <th>Edition_Type</th>\n", | |
" <th>Edition_Month</th>\n", | |
" <th>Edition_Year</th>\n", | |
" <th>Ratings</th>\n", | |
" <th>Reviews</th>\n", | |
" <th>Genre1</th>\n", | |
" <th>Genre2</th>\n", | |
" <th>Category1</th>\n", | |
" <th>Category2</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>5082</td>\n", | |
" <td>4058</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>12</td>\n", | |
" <td>11</td>\n", | |
" <td>1986</td>\n", | |
" <td>4.4</td>\n", | |
" <td>960</td>\n", | |
" <td>324</td>\n", | |
" <td>267</td>\n", | |
" <td>5</td>\n", | |
" <td>16</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2906</td>\n", | |
" <td>1401</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>0</td>\n", | |
" <td>2018</td>\n", | |
" <td>5.0</td>\n", | |
" <td>1</td>\n", | |
" <td>273</td>\n", | |
" <td>267</td>\n", | |
" <td>4</td>\n", | |
" <td>9</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>751</td>\n", | |
" <td>949</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>7</td>\n", | |
" <td>2011</td>\n", | |
" <td>5.0</td>\n", | |
" <td>4</td>\n", | |
" <td>314</td>\n", | |
" <td>267</td>\n", | |
" <td>14</td>\n", | |
" <td>12</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>6232</td>\n", | |
" <td>169</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>9</td>\n", | |
" <td>2016</td>\n", | |
" <td>4.1</td>\n", | |
" <td>11</td>\n", | |
" <td>295</td>\n", | |
" <td>267</td>\n", | |
" <td>4</td>\n", | |
" <td>9</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>3790</td>\n", | |
" <td>3505</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>3073</td>\n", | |
" <td>13</td>\n", | |
" <td>2</td>\n", | |
" <td>2011</td>\n", | |
" <td>4.4</td>\n", | |
" <td>9</td>\n", | |
" <td>235</td>\n", | |
" <td>267</td>\n", | |
" <td>10</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Title Author1 Author2 Author3 ... Genre1 Genre2 Category1 Category2\n", | |
"0 5082 4058 3073 3073 ... 324 267 5 16\n", | |
"1 2906 1401 3073 3073 ... 273 267 4 9\n", | |
"2 751 949 3073 3073 ... 314 267 14 12\n", | |
"3 6232 169 3073 3073 ... 295 267 4 9\n", | |
"4 3790 3505 3073 3073 ... 235 267 10 11\n", | |
"\n", | |
"[5 rows x 17 columns]" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 45 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "QHxuxxl_UxVE", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Sclaing The Features" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "vOotF7qUz2HF", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# Feature Scaling\n", | |
"\n", | |
"from sklearn.preprocessing import StandardScaler\n", | |
"sc = StandardScaler()\n", | |
"\n", | |
"X_train = sc.fit_transform(X_train)\n", | |
"\n", | |
"X_test = sc.transform(X_test)\n", | |
"\n", | |
"#Reshaping ti fit the scaler\n", | |
"Y_train = Y_train.reshape((len(Y_train), 1)) \n", | |
"\n", | |
"Y_train = sc.fit_transform(Y_train)\n", | |
"\n", | |
"#Restoring the original shape after scaling\n", | |
"Y_train = Y_train.ravel()" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "sd80FmEw97yE", | |
"colab_type": "code", | |
"outputId": "f9bbeefc-c021-4ee7-cd66-87dfcd16568e", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"X_train.shape #SC" | |
], | |
"execution_count": 47, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(6237, 17)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 47 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "xIikkIR1_bZY", | |
"colab_type": "code", | |
"outputId": "b1691d65-7391-4810-923b-35e4cb659c23", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"Y_train.shape #SC" | |
], | |
"execution_count": 48, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"(6237,)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 48 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "TEXau171tDqU", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 238 | |
}, | |
"outputId": "746931a5-edcd-4223-adde-3ec3b6bbde0a" | |
}, | |
"source": [ | |
"X_train" | |
], | |
"execution_count": 49, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"array([[ 1.22489462, -1.11014305, 0.10361038, ..., -0.04054244,\n", | |
" -1.21387465, 0.3167796 ],\n", | |
" [-0.64983923, -1.40876249, 0.10361038, ..., -0.04054244,\n", | |
" -0.82060991, -1.88135243],\n", | |
" [-0.20992341, 1.50535127, 0.10361038, ..., -0.04054244,\n", | |
" 0.35918432, 0.3167796 ],\n", | |
" ...,\n", | |
" [ 0.91379674, -0.11278358, 0.10361038, ..., -0.04054244,\n", | |
" 1.53897855, 0.3167796 ],\n", | |
" [-0.75676322, -1.56395633, 0.10361038, ..., -0.04054244,\n", | |
" -1.21387465, 0.3167796 ],\n", | |
" [ 0.95096556, -0.2951915 , 0.10361038, ..., -0.04054244,\n", | |
" -1.21387465, 0.3167796 ]])" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 49 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "W3mD60ErtFrK", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 238 | |
}, | |
"outputId": "9506d3d8-df30-4d05-84c3-a886441330cc" | |
}, | |
"source": [ | |
"X_test" | |
], | |
"execution_count": 55, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"array([[ 0.8582981 , 1.2883741 , 0.10361038, ..., -0.04054244,\n", | |
" -0.23071279, 1.78220095],\n", | |
" [-0.24963804, -0.66589149, 0.10361038, ..., -0.04054244,\n", | |
" -0.42734517, -0.78228642],\n", | |
" [-1.34688178, -0.99834465, 0.10361038, ..., -0.04054244,\n", | |
" 1.53897855, 0.3167796 ],\n", | |
" ...,\n", | |
" [ 1.06145367, 0.00342793, 0.10361038, ..., -0.04054244,\n", | |
" 0.35918432, 0.3167796 ],\n", | |
" [ 0.20911677, -0.49672284, 0.10361038, ..., -0.04054244,\n", | |
" -0.82060991, -1.88135243],\n", | |
" [-1.15085447, -1.35139225, 0.10361038, ..., -0.04054244,\n", | |
" 0.75244907, -0.04957574]])" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 55 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "kdGM1rpS4Ly2", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Building a Regression Model\n", | |
"\n", | |
"---\n", | |
"\n", | |
"At this stage, we are all ready with the data which can now be fed in to a regressor. We will build a simple XGBoost regressor and will fit the training data. We will then use the model to predict the prices of the Books in the test set.\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"* Building a simple XGBoost regressor\n", | |
"\n", | |
"* Testing the regressor on validation set\n", | |
"* Predicting the prices for test set data\n", | |
"* Saving the predictioons into an excel file\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "oxpRLSvCKRJg", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Creating Training & Valiation sets" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "gnoeicDYInty", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"from sklearn.model_selection import train_test_split\n", | |
"\n", | |
"train_x, val_x, train_y, val_y = train_test_split(X_train, Y_train, test_size = 0.1, random_state = 123)" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Mjt8t2zCJ1W0", | |
"colab_type": "code", | |
"outputId": "4992ca5a-6c0f-4109-812c-7dc2c8508064", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 85 | |
} | |
}, | |
"source": [ | |
"print(train_x.shape)\n", | |
"print(train_y.shape)\n", | |
"print(val_x.shape)\n", | |
"print(val_y.shape)" | |
], | |
"execution_count": 68, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"(5613, 17)\n", | |
"(5613,)\n", | |
"(624, 17)\n", | |
"(624,)\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "svtQWm8SCwg0", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### XGBoost" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3fgH-qtZNOLu", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### Validating The Model\n", | |
"\n", | |
"---\n", | |
"\n", | |
"We will fist train and validate the model using RMLSE(Root Mean Squared Logerithmic Error).Once the validation is done we will use both the train and validation samples to train the final model which will be used to predict for the test set." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab_type": "code", | |
"id": "6XCp0ZKv7idh", | |
"outputId": "6aadca88-df94-48b4-85d1-819c3d66b8a2", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"from xgboost import XGBRegressor\n", | |
"import numpy as np\n", | |
"\n", | |
"xgb=XGBRegressor( objective='reg:squarederror', max_depth=6, learning_rate=0.1, n_estimators=100, booster = 'gbtree', n_jobs = -1,random_state = 1)\n", | |
"xgb.fit(train_x,train_y)\n", | |
"\n", | |
"y_pred = sc.inverse_transform(xgb.predict(val_x))\n", | |
"y_true = sc.inverse_transform(val_y)\n", | |
"\n", | |
"error = np.square(np.log10(y_pred +1) - np.log10(y_true +1)).mean() ** 0.5\n", | |
"score = 1 - error\n", | |
"\n", | |
"print(\"RMLSE Score = \", score)" | |
], | |
"execution_count": 69, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"RMLSE Score = 0.7163958922112829\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "NRDm0O5dNykQ", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 136 | |
}, | |
"outputId": "959f0d33-7c02-4cd1-b1a9-c710fd20db46" | |
}, | |
"source": [ | |
"# Fitting the complete training set (inclusing val_x and val_y)\n", | |
"xgb.fit(X_train,Y_train)\n" | |
], | |
"execution_count": 70, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", | |
" colsample_bynode=1, colsample_bytree=1, gamma=0,\n", | |
" importance_type='gain', learning_rate=0.1, max_delta_step=0,\n", | |
" max_depth=6, min_child_weight=1, missing=None, n_estimators=100,\n", | |
" n_jobs=-1, nthread=None, objective='reg:squarederror',\n", | |
" random_state=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n", | |
" seed=None, silent=None, subsample=1, verbosity=1)" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 70 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "v89PDQklKymM", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"# Predicting for test set\n", | |
"y_pred_xgb = sc.inverse_transform(xgb.predict(X_test))" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"colab_type": "code", | |
"id": "tqwjEHDU7idq", | |
"colab": {} | |
}, | |
"source": [ | |
"# Saving the predictions in excel file\n", | |
"\n", | |
"solution = pd.DataFrame(y_pred_xgb, columns = ['Price'])\n", | |
"solution.to_excel('Predict_Book_Price_Soln.xlsx', index = False)\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ms2jR1CaOeIg", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 359 | |
}, | |
"outputId": "e9c78145-6d7d-49dd-db18-15460724cf6d" | |
}, | |
"source": [ | |
"solution.head(10)" | |
], | |
"execution_count": 74, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Price</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>214.681122</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1330.917114</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>624.322449</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>844.769043</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>425.502533</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>890.277893</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>956.208252</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>339.453979</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>558.580017</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>467.108673</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Price\n", | |
"0 214.681122\n", | |
"1 1330.917114\n", | |
"2 624.322449\n", | |
"3 844.769043\n", | |
"4 425.502533\n", | |
"5 890.277893\n", | |
"6 956.208252\n", | |
"7 339.453979\n", | |
"8 558.580017\n", | |
"9 467.108673" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 74 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "5i4R-fZT90gW", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"##Bayesian Optimization on XGBoost \n", | |
"\n", | |
"---\n", | |
"\n", | |
"In this step will will use Bayesian Optimization to optimize the hypermeters such as gamma, learning_rate, max_depth, n_estimators.\n", | |
"\n", | |
"\n", | |
"We will use a pre-built library called bayesian-optimization." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "LU2T5Jvr8RCl", | |
"colab_type": "code", | |
"outputId": "8ce4251a-bf1f-4d20-e19a-4bd6ccb91138", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 238 | |
} | |
}, | |
"source": [ | |
"!pip install bayesian-optimization" | |
], | |
"execution_count": 76, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"Collecting bayesian-optimization\n", | |
" Downloading https://files.pythonhosted.org/packages/72/0c/173ac467d0a53e33e41b521e4ceba74a8ac7c7873d7b857a8fbdca88302d/bayesian-optimization-1.0.1.tar.gz\n", | |
"Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from bayesian-optimization) (1.16.5)\n", | |
"Requirement already satisfied: scipy>=0.14.0 in /usr/local/lib/python3.6/dist-packages (from bayesian-optimization) (1.3.1)\n", | |
"Requirement already satisfied: scikit-learn>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from bayesian-optimization) (0.21.3)\n", | |
"Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.18.0->bayesian-optimization) (0.13.2)\n", | |
"Building wheels for collected packages: bayesian-optimization\n", | |
" Building wheel for bayesian-optimization (setup.py) ... \u001b[?25l\u001b[?25hdone\n", | |
" Created wheel for bayesian-optimization: filename=bayesian_optimization-1.0.1-cp36-none-any.whl size=10032 sha256=8fd26880064a093cff7a7ea3de5daeb04582531c5362de7b4848e56e5d73439b\n", | |
" Stored in directory: /root/.cache/pip/wheels/1d/0d/3b/6b9d4477a34b3905f246ff4e7acf6aafd4cc9b77d473629b77\n", | |
"Successfully built bayesian-optimization\n", | |
"Installing collected packages: bayesian-optimization\n", | |
"Successfully installed bayesian-optimization-1.0.1\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "L_GOZGBx9OXa", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"from bayes_opt import BayesianOptimization\n", | |
"import xgboost as xgb\n", | |
"#from sklearn.metrics import mean_squared_error,mean_squared_log_error" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "2iFA25n4-blG", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"dtrain = xgb.DMatrix(X_train, label= Y_train)\n" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "NXOpETUc9hZ3", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"def bo_tune_xgb(max_depth, gamma, n_estimators ,learning_rate):\n", | |
" params = {'max_depth': int(max_depth),\n", | |
" 'gamma': gamma, \n", | |
" 'n_estimators': int(n_estimators),\n", | |
" 'learning_rate':learning_rate,\n", | |
" 'subsample': 0.8,\n", | |
" 'eta': 0.1,\n", | |
" 'eval_metric': 'rmse'}\n", | |
" \n", | |
" #Cross validating with the specified parameters in 5 folds and 70 iterations\n", | |
" cv_result = xgb.cv(params, dtrain, num_boost_round=100, nfold=10) \n", | |
" \n", | |
" #Return the negative RMSE\n", | |
" return -1.0 * cv_result['test-rmse-mean'].iloc[-1]" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Ef_-mQ1N9lT3", | |
"colab_type": "code", | |
"outputId": "b4484c85-d12f-4e62-c598-9aa9790da8d5", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 408 | |
} | |
}, | |
"source": [ | |
"\n", | |
"xgb_bo = BayesianOptimization(bo_tune_xgb, {'max_depth': (1, 300), \n", | |
" 'gamma': (0, 1),\n", | |
" 'learning_rate':(0,1),\n", | |
" 'n_estimators':(1,1000)\n", | |
" })\n", | |
"\n", | |
"\n", | |
"xgb_bo.maximize(n_iter=10, init_points=10, acq='ei')" | |
], | |
"execution_count": 80, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"| iter | target | gamma | learni... | max_depth | n_esti... |\n", | |
"-------------------------------------------------------------------------\n", | |
"| \u001b[0m 1 \u001b[0m | \u001b[0m-1.009 \u001b[0m | \u001b[0m 0.2697 \u001b[0m | \u001b[0m 0.5214 \u001b[0m | \u001b[0m 51.2 \u001b[0m | \u001b[0m 389.1 \u001b[0m |\n", | |
"| \u001b[95m 2 \u001b[0m | \u001b[95m-0.9282 \u001b[0m | \u001b[95m 0.06396 \u001b[0m | \u001b[95m 0.07131 \u001b[0m | \u001b[95m 240.2 \u001b[0m | \u001b[95m 947.1 \u001b[0m |\n", | |
"| \u001b[0m 3 \u001b[0m | \u001b[0m-0.9759 \u001b[0m | \u001b[0m 0.6246 \u001b[0m | \u001b[0m 0.3789 \u001b[0m | \u001b[0m 76.9 \u001b[0m | \u001b[0m 530.3 \u001b[0m |\n", | |
"| \u001b[95m 4 \u001b[0m | \u001b[95m-0.9172 \u001b[0m | \u001b[95m 0.8354 \u001b[0m | \u001b[95m 0.1186 \u001b[0m | \u001b[95m 195.2 \u001b[0m | \u001b[95m 685.8 \u001b[0m |\n", | |
"| \u001b[0m 5 \u001b[0m | \u001b[0m-0.9174 \u001b[0m | \u001b[0m 0.9739 \u001b[0m | \u001b[0m 0.1673 \u001b[0m | \u001b[0m 8.738 \u001b[0m | \u001b[0m 905.6 \u001b[0m |\n", | |
"| \u001b[0m 6 \u001b[0m | \u001b[0m-1.074 \u001b[0m | \u001b[0m 0.6809 \u001b[0m | \u001b[0m 0.7988 \u001b[0m | \u001b[0m 42.74 \u001b[0m | \u001b[0m 197.3 \u001b[0m |\n", | |
"| \u001b[0m 7 \u001b[0m | \u001b[0m-0.9909 \u001b[0m | \u001b[0m 0.1068 \u001b[0m | \u001b[0m 0.405 \u001b[0m | \u001b[0m 33.39 \u001b[0m | \u001b[0m 796.0 \u001b[0m |\n", | |
"| \u001b[0m 8 \u001b[0m | \u001b[0m-0.943 \u001b[0m | \u001b[0m 0.6288 \u001b[0m | \u001b[0m 0.2134 \u001b[0m | \u001b[0m 231.1 \u001b[0m | \u001b[0m 114.2 \u001b[0m |\n", | |
"| \u001b[0m 9 \u001b[0m | \u001b[0m-0.9464 \u001b[0m | \u001b[0m 0.2038 \u001b[0m | \u001b[0m 0.2343 \u001b[0m | \u001b[0m 152.1 \u001b[0m | \u001b[0m 38.35 \u001b[0m |\n", | |
"| \u001b[0m 10 \u001b[0m | \u001b[0m-0.9398 \u001b[0m | \u001b[0m 0.7739 \u001b[0m | \u001b[0m 0.209 \u001b[0m | \u001b[0m 256.2 \u001b[0m | \u001b[0m 416.5 \u001b[0m |\n", | |
"| \u001b[0m 11 \u001b[0m | \u001b[0m-1.112 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 1e+03 \u001b[0m |\n", | |
"| \u001b[0m 12 \u001b[0m | \u001b[0m-1.112 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 300.0 \u001b[0m | \u001b[0m 779.8 \u001b[0m |\n", | |
"| \u001b[0m 13 \u001b[0m | \u001b[0m-1.152 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 300.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m |\n", | |
"| \u001b[0m 14 \u001b[0m | \u001b[0m-1.112 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 125.2 \u001b[0m | \u001b[0m 898.1 \u001b[0m |\n", | |
"| \u001b[0m 15 \u001b[0m | \u001b[0m-1.135 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 300.0 \u001b[0m | \u001b[0m 1e+03 \u001b[0m |\n", | |
"| \u001b[0m 16 \u001b[0m | \u001b[0m-1.112 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m |\n", | |
"| \u001b[0m 17 \u001b[0m | \u001b[0m-1.112 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 230.3 \u001b[0m | \u001b[0m 570.6 \u001b[0m |\n", | |
"| \u001b[0m 18 \u001b[0m | \u001b[0m-1.112 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 300.0 \u001b[0m | \u001b[0m 283.2 \u001b[0m |\n", | |
"| \u001b[0m 19 \u001b[0m | \u001b[0m-1.152 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 174.7 \u001b[0m | \u001b[0m 356.2 \u001b[0m |\n", | |
"| \u001b[0m 20 \u001b[0m | \u001b[0m-1.112 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 0.0 \u001b[0m | \u001b[0m 1.0 \u001b[0m | \u001b[0m 610.3 \u001b[0m |\n", | |
"=========================================================================\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "muarWA_b9nwh", | |
"colab_type": "code", | |
"outputId": "c119acf6-4006-4e6d-ffce-9fa6ea3b71f1", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 51 | |
} | |
}, | |
"source": [ | |
"#Extracting the best parameters\n", | |
"params = xgb_bo.max['params']\n", | |
"\n", | |
"print(params)\n", | |
"\n", | |
"#Conversting the max_depth and n_estimator values from float to int\n", | |
"params['max_depth']= int(round(params['max_depth']))\n", | |
"params['n_estimators']= int(round(params['n_estimators']))\n", | |
"\n", | |
"print(params)\n" | |
], | |
"execution_count": 81, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"{'gamma': 0.835366902965147, 'learning_rate': 0.11864947002102888, 'max_depth': 195.21748298318235, 'n_estimators': 685.7597094777982}\n", | |
"{'gamma': 0.835366902965147, 'learning_rate': 0.11864947002102888, 'max_depth': 195, 'n_estimators': 686}\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ogWf5ZKO9x5p", | |
"colab_type": "code", | |
"outputId": "5afce368-5ecc-4d98-ee64-8afa4caea574", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
} | |
}, | |
"source": [ | |
"#Initialize an XGB with the tuned parameters and fit the training data\n", | |
"from xgboost import XGBRegressor\n", | |
"reg = XGBRegressor(**params).fit(X_train,Y_train)\n", | |
"\n", | |
"y_pred_reg = sc.inverse_transform(reg.predict(X_test))" | |
], | |
"execution_count": 83, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"[10:19:41] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.\n" | |
], | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "NFU-H-icP5uD", | |
"colab_type": "code", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 359 | |
}, | |
"outputId": "a9fdd176-11d9-4e6f-8ef8-29b48c83f0bd" | |
}, | |
"source": [ | |
"solution_bo = pd.DataFrame(y_pred_reg, columns = ['Price'])\n", | |
"\n", | |
"solution_bo.head(10)" | |
], | |
"execution_count": 90, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Price</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>241.728058</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1828.147339</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>429.341614</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>858.137817</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>372.580841</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>503.807220</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>587.503967</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>476.958710</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>399.994812</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>335.585876</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Price\n", | |
"0 241.728058\n", | |
"1 1828.147339\n", | |
"2 429.341614\n", | |
"3 858.137817\n", | |
"4 372.580841\n", | |
"5 503.807220\n", | |
"6 587.503967\n", | |
"7 476.958710\n", | |
"8 399.994812\n", | |
"9 335.585876" | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 90 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "wUCplBBRQLqI", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"solution_bo.to_excel('Predict_Book_Prices_BO_Soln.xlsx', index = False)" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "AljdBCrgZ14J", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Once you have your solution files, upload it to MachineHack to know your score.\n", | |
"\n", | |
"Good Luck !!" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment