Last active
April 23, 2021 06:36
-
-
Save Tahsin-Mayeesha/81dcdafc61b774768b64ba5201e31e0a to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Load the dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import matplotlib.pyplot as plt\n", | |
"import re\n", | |
"import seaborn as sns\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import warnings\n", | |
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"anime = pd.read_csv(\"anime.csv\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>anime_id</th>\n", | |
" <th>name</th>\n", | |
" <th>genre</th>\n", | |
" <th>type</th>\n", | |
" <th>episodes</th>\n", | |
" <th>rating</th>\n", | |
" <th>members</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>32281</td>\n", | |
" <td>Kimi no Na wa.</td>\n", | |
" <td>Drama, Romance, School, Supernatural</td>\n", | |
" <td>Movie</td>\n", | |
" <td>1</td>\n", | |
" <td>9.37</td>\n", | |
" <td>200630</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>5114</td>\n", | |
" <td>Fullmetal Alchemist: Brotherhood</td>\n", | |
" <td>Action, Adventure, Drama, Fantasy, Magic, Mili...</td>\n", | |
" <td>TV</td>\n", | |
" <td>64</td>\n", | |
" <td>9.26</td>\n", | |
" <td>793665</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>28977</td>\n", | |
" <td>Gintama°</td>\n", | |
" <td>Action, Comedy, Historical, Parody, Samurai, S...</td>\n", | |
" <td>TV</td>\n", | |
" <td>51</td>\n", | |
" <td>9.25</td>\n", | |
" <td>114262</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>9253</td>\n", | |
" <td>Steins;Gate</td>\n", | |
" <td>Sci-Fi, Thriller</td>\n", | |
" <td>TV</td>\n", | |
" <td>24</td>\n", | |
" <td>9.17</td>\n", | |
" <td>673572</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>9969</td>\n", | |
" <td>Gintama&#039;</td>\n", | |
" <td>Action, Comedy, Historical, Parody, Samurai, S...</td>\n", | |
" <td>TV</td>\n", | |
" <td>51</td>\n", | |
" <td>9.16</td>\n", | |
" <td>151266</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" anime_id name \\\n", | |
"0 32281 Kimi no Na wa. \n", | |
"1 5114 Fullmetal Alchemist: Brotherhood \n", | |
"2 28977 Gintama° \n", | |
"3 9253 Steins;Gate \n", | |
"4 9969 Gintama' \n", | |
"\n", | |
" genre type episodes rating \\\n", | |
"0 Drama, Romance, School, Supernatural Movie 1 9.37 \n", | |
"1 Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 \n", | |
"2 Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 \n", | |
"3 Sci-Fi, Thriller TV 24 9.17 \n", | |
"4 Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.16 \n", | |
"\n", | |
" members \n", | |
"0 200630 \n", | |
"1 793665 \n", | |
"2 114262 \n", | |
"3 673572 \n", | |
"4 151266 " | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"anime.head()\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"anime_id 0\n", | |
"name 0\n", | |
"genre 62\n", | |
"type 25\n", | |
"episodes 0\n", | |
"rating 230\n", | |
"members 0\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"anime.isnull().sum()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Data preprocessing \n", | |
"\n", | |
"## Episodes\n", | |
"\n", | |
"Many animes have unknown number of episodes even if they have similar rating. On top of that many super popular animes such as Naruto Shippuden, Attack on Titan Season 2 were ongoing when the data was collected, thus their number of episodes was considered as \"Unknown\". For some of my favorite animes I've filled in the episode numbers manually. For the other anime's, I had to make some educated guesses. Changes I've made are :\n", | |
"\n", | |
"Animes that are grouped under Hentai Categories generally have 1 episode in my experience. So I've filled the unknown values with 1.\n", | |
"\n", | |
"Animes that are grouped are \"OVA\" stands for \"Original Video Animation\". These are generally one/two episode long animes(often the popular ones have 2/3 episodes though), but I've decided to fill the unknown numbers of episodes with 1 again.\n", | |
"\n", | |
"Animes that are grouped under \"Movies\" are considered as '1' episode as per the dataset overview goes.\n", | |
"\n", | |
"For all the other animes with unknown number of episodes, I've filled the na values with the median which is 2.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>anime_id</th>\n", | |
" <th>name</th>\n", | |
" <th>genre</th>\n", | |
" <th>type</th>\n", | |
" <th>episodes</th>\n", | |
" <th>rating</th>\n", | |
" <th>members</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>74</th>\n", | |
" <td>21</td>\n", | |
" <td>One Piece</td>\n", | |
" <td>Action, Adventure, Comedy, Drama, Fantasy, Sho...</td>\n", | |
" <td>TV</td>\n", | |
" <td>Unknown</td>\n", | |
" <td>8.58</td>\n", | |
" <td>504862</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>252</th>\n", | |
" <td>235</td>\n", | |
" <td>Detective Conan</td>\n", | |
" <td>Adventure, Comedy, Mystery, Police, Shounen</td>\n", | |
" <td>TV</td>\n", | |
" <td>Unknown</td>\n", | |
" <td>8.25</td>\n", | |
" <td>114702</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>615</th>\n", | |
" <td>1735</td>\n", | |
" <td>Naruto: Shippuuden</td>\n", | |
" <td>Action, Comedy, Martial Arts, Shounen, Super P...</td>\n", | |
" <td>TV</td>\n", | |
" <td>Unknown</td>\n", | |
" <td>7.94</td>\n", | |
" <td>533578</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" anime_id name \\\n", | |
"74 21 One Piece \n", | |
"252 235 Detective Conan \n", | |
"615 1735 Naruto: Shippuuden \n", | |
"\n", | |
" genre type episodes rating \\\n", | |
"74 Action, Adventure, Comedy, Drama, Fantasy, Sho... TV Unknown 8.58 \n", | |
"252 Adventure, Comedy, Mystery, Police, Shounen TV Unknown 8.25 \n", | |
"615 Action, Comedy, Martial Arts, Shounen, Super P... TV Unknown 7.94 \n", | |
"\n", | |
" members \n", | |
"74 504862 \n", | |
"252 114702 \n", | |
"615 533578 " | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"anime[anime['episodes']=='Unknown'].head(3)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"anime.loc[(anime[\"genre\"]==\"Hentai\") & (anime[\"episodes\"]==\"Unknown\"),\"episodes\"] = \"1\"\n", | |
"anime.loc[(anime[\"type\"]==\"OVA\") & (anime[\"episodes\"]==\"Unknown\"),\"episodes\"] = \"1\"\n", | |
"\n", | |
"anime.loc[(anime[\"type\"] == \"Movie\") & (anime[\"episodes\"] == \"Unknown\")] = \"1\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"known_animes = {\"Naruto Shippuuden\":500, \"One Piece\":784,\"Detective Conan\":854, \"Dragon Ball Super\":86,\n", | |
" \"Crayon Shin chan\":942, \"Yu Gi Oh Arc V\":148,\"Shingeki no Kyojin Season 2\":25,\n", | |
" \"Boku no Hero Academia 2nd Season\":25,\"Little Witch Academia TV\":25}\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"for k,v in known_animes.items(): \n", | |
" anime.loc[anime[\"name\"]==k,\"episodes\"] = v" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"anime[\"episodes\"] = anime[\"episodes\"].map(lambda x:np.nan if x==\"Unknown\" else x)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"anime[\"episodes\"].fillna(anime[\"episodes\"].median(),inplace = True)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Type" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>type_1</th>\n", | |
" <th>type_Movie</th>\n", | |
" <th>type_Music</th>\n", | |
" <th>type_ONA</th>\n", | |
" <th>type_OVA</th>\n", | |
" <th>type_Special</th>\n", | |
" <th>type_TV</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" type_1 type_Movie type_Music type_ONA type_OVA type_Special type_TV\n", | |
"0 0 1 0 0 0 0 0\n", | |
"1 0 0 0 0 0 0 1\n", | |
"2 0 0 0 0 0 0 1\n", | |
"3 0 0 0 0 0 0 1\n", | |
"4 0 0 0 0 0 0 1" | |
] | |
}, | |
"execution_count": 20, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pd.get_dummies(anime[[\"type\"]]).head()\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Rating, Members and Genre\n", | |
"\n", | |
"For members feature, I Just converted the strings to float.Episode numbers, members and rating are different from categorical variables and very different in values. Rating ranges from 0-10 in the dataset while the episode number can be even 800+ episodes long when it comes to long running popular animes such as One Piece, Naruto etc. So I ended up using sklearn.preprocessing.MinMaxScaler as it scales the values from 0-1.Many animes have unknown ratings. These were filled with the median of the ratings.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"anime[\"rating\"] = anime[\"rating\"].astype(float)\n", | |
"anime[\"rating\"].fillna(anime[\"rating\"].median(),inplace = True)\n", | |
"anime[\"members\"] = anime[\"members\"].astype(float)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Adventure</th>\n", | |
" <th>Cars</th>\n", | |
" <th>Comedy</th>\n", | |
" <th>Dementia</th>\n", | |
" <th>Demons</th>\n", | |
" <th>Drama</th>\n", | |
" <th>Ecchi</th>\n", | |
" <th>Fantasy</th>\n", | |
" <th>Game</th>\n", | |
" <th>Harem</th>\n", | |
" <th>...</th>\n", | |
" <th>type_1</th>\n", | |
" <th>type_Movie</th>\n", | |
" <th>type_Music</th>\n", | |
" <th>type_ONA</th>\n", | |
" <th>type_OVA</th>\n", | |
" <th>type_Special</th>\n", | |
" <th>type_TV</th>\n", | |
" <th>rating</th>\n", | |
" <th>members</th>\n", | |
" <th>episodes</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>9.37</td>\n", | |
" <td>200630.0</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>9.26</td>\n", | |
" <td>793665.0</td>\n", | |
" <td>64</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>9.25</td>\n", | |
" <td>114262.0</td>\n", | |
" <td>51</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>9.17</td>\n", | |
" <td>673572.0</td>\n", | |
" <td>24</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>9.16</td>\n", | |
" <td>151266.0</td>\n", | |
" <td>51</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 93 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Adventure Cars Comedy Dementia Demons Drama Ecchi Fantasy \\\n", | |
"0 0 0 0 0 0 0 0 0 \n", | |
"1 1 0 0 0 0 1 0 1 \n", | |
"2 0 0 1 0 0 0 0 0 \n", | |
"3 0 0 0 0 0 0 0 0 \n", | |
"4 0 0 1 0 0 0 0 0 \n", | |
"\n", | |
" Game Harem ... type_1 type_Movie type_Music type_ONA \\\n", | |
"0 0 0 ... 0 1 0 0 \n", | |
"1 0 0 ... 0 0 0 0 \n", | |
"2 0 0 ... 0 0 0 0 \n", | |
"3 0 0 ... 0 0 0 0 \n", | |
"4 0 0 ... 0 0 0 0 \n", | |
"\n", | |
" type_OVA type_Special type_TV rating members episodes \n", | |
"0 0 0 0 9.37 200630.0 1 \n", | |
"1 0 0 1 9.26 793665.0 64 \n", | |
"2 0 0 1 9.25 114262.0 51 \n", | |
"3 0 0 1 9.17 673572.0 24 \n", | |
"4 0 0 1 9.16 151266.0 51 \n", | |
"\n", | |
"[5 rows x 93 columns]" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Scaling\n", | |
"\n", | |
"anime_features = pd.concat([anime[\"genre\"].str.get_dummies(sep=\",\"),\n", | |
" pd.get_dummies(anime[[\"type\"]]),\n", | |
" anime[[\"rating\"]],anime[[\"members\"]],anime[\"episodes\"]],axis=1)\n", | |
"anime[\"name\"] = anime[\"name\"].map(lambda name:re.sub('[^A-Za-z0-9]+', \" \", name))\n", | |
"anime_features.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Index([' Adventure', ' Cars', ' Comedy', ' Dementia', ' Demons', ' Drama',\n", | |
" ' Ecchi', ' Fantasy', ' Game', ' Harem', ' Hentai', ' Historical',\n", | |
" ' Horror', ' Josei', ' Kids', ' Magic', ' Martial Arts', ' Mecha',\n", | |
" ' Military', ' Music', ' Mystery', ' Parody', ' Police',\n", | |
" ' Psychological', ' Romance', ' Samurai', ' School', ' Sci-Fi',\n", | |
" ' Seinen', ' Shoujo', ' Shoujo Ai', ' Shounen', ' Shounen Ai',\n", | |
" ' Slice of Life', ' Space', ' Sports', ' Super Power', ' Supernatural',\n", | |
" ' Thriller', ' Vampire', ' Yaoi', ' Yuri', '1', 'Action', 'Adventure',\n", | |
" 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi', 'Fantasy',\n", | |
" 'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei', 'Kids',\n", | |
" 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music', 'Mystery',\n", | |
" 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai', 'School',\n", | |
" 'Sci-Fi', 'Seinen', 'Shoujo', 'Shounen', 'Slice of Life', 'Space',\n", | |
" 'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Vampire', 'Yaoi',\n", | |
" 'type_1', 'type_Movie', 'type_Music', 'type_ONA', 'type_OVA',\n", | |
" 'type_Special', 'type_TV', 'rating', 'members', 'episodes'],\n", | |
" dtype='object')" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"anime_features.columns\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.preprocessing import MinMaxScaler\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"min_max_scaler = MinMaxScaler()\n", | |
"anime_features = min_max_scaler.fit_transform(anime_features)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 0. , 0. , 0. , ..., 0.93, 0.2 , 0. ],\n", | |
" [ 1. , 0. , 0. , ..., 0.92, 0.78, 0.03],\n", | |
" [ 0. , 0. , 1. , ..., 0.92, 0.11, 0.03],\n", | |
" ..., \n", | |
" [ 0. , 0. , 0. , ..., 0.43, 0. , 0. ],\n", | |
" [ 0. , 0. , 0. , ..., 0.44, 0. , 0. ],\n", | |
" [ 0. , 0. , 0. , ..., 0.5 , 0. , 0. ]])" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.round(anime_features,2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Fit Nearest Neighbor To Data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.neighbors import NearestNeighbors\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"nbrs = NearestNeighbors(n_neighbors=6, algorithm='ball_tree').fit(anime_features)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"distances, indices = nbrs.kneighbors(anime_features)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Query examples and helper functions\n", | |
"\n", | |
"Many anime names have not been documented properly and in many cases the names are in Japanese instead of English and the spelling is often different. For that reason I've also created another helper function get_id_from_partial_name to find out ids of the animes from part of names." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def get_index_from_name(name):\n", | |
" return anime[anime[\"name\"]==name].index.tolist()[0]\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"all_anime_names = list(anime.name.values)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def get_id_from_partial_name(partial):\n", | |
" for name in all_anime_names:\n", | |
" if partial in name:\n", | |
" print(name,all_anime_names.index(name))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"\"\"\" print_similar_query can search for similar animes both by id and by name. \"\"\"\n", | |
"\n", | |
"def print_similar_animes(query=None,id=None):\n", | |
" if id:\n", | |
" for id in indices[id][1:]:\n", | |
" print(anime.ix[id][\"name\"])\n", | |
" if query:\n", | |
" found_id = get_index_from_name(query)\n", | |
" for id in indices[found_id][1:]:\n", | |
" print(anime.ix[id][\"name\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Query Examples " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Naruto Shippuuden\n", | |
"Katekyo Hitman Reborn \n", | |
"Bleach\n", | |
"Dragon Ball Z\n", | |
"Boku no Hero Academia\n" | |
] | |
} | |
], | |
"source": [ | |
"print_similar_animes(query=\"Naruto\")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Noragami Aragoto\n", | |
"JoJo no Kimyou na Bouken TV \n", | |
"JoJo no Kimyou na Bouken Stardust Crusaders\n", | |
"JoJo no Kimyou na Bouken Stardust Crusaders 2nd Season\n", | |
"Yumekui Merry\n" | |
] | |
} | |
], | |
"source": [ | |
"print_similar_animes(\"Noragami\")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Mushishi Zoku Shou\n", | |
"Mushishi Zoku Shou 2nd Season\n", | |
"Mushishi Special Hihamukage\n", | |
"Mushishi Zoku Shou Odoro no Michi\n", | |
"Mushishi Zoku Shou Suzu no Shizuku\n" | |
] | |
} | |
], | |
"source": [ | |
"print_similar_animes(\"Mushishi\")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Gintama 039 \n", | |
"Gintama \n", | |
"Gintama 039 Enchousen\n", | |
"Gintama 2017 \n", | |
"Gintama Movie Kanketsu hen Yorozuya yo Eien Nare\n" | |
] | |
} | |
], | |
"source": [ | |
"print_similar_animes(\"Gintama\")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Fairy Tail 2014 \n", | |
"Magi The Labyrinth of Magic\n", | |
"Magi The Kingdom of Magic\n", | |
"Densetsu no Yuusha no Densetsu\n", | |
"Magi Sinbad no Bouken TV \n" | |
] | |
} | |
], | |
"source": [ | |
"print_similar_animes(\"Fairy Tail\")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Boruto Naruto the Movie 486\n", | |
"Naruto Shippuuden 615\n", | |
"The Last Naruto the Movie 719\n", | |
"Naruto Shippuuden Movie 6 Road to Ninja 784\n", | |
"Naruto 841\n", | |
"Boruto Naruto the Movie Naruto ga Hokage ni Natta Hi 1103\n", | |
"Naruto Shippuuden Movie 5 Blood Prison 1237\n", | |
"Naruto x UT 1343\n", | |
"Naruto Shippuuden Movie 4 The Lost Tower 1472\n", | |
"Naruto Shippuuden Movie 3 Hi no Ishi wo Tsugu Mono 1573\n", | |
"Naruto Shippuuden Movie 1 1827\n", | |
"Naruto Shippuuden Movie 2 Kizuna 1828\n", | |
"Naruto Shippuuden Shippuu quot Konoha Gakuen quot Den 2374\n", | |
"Naruto Honoo no Chuunin Shiken Naruto vs Konohamaru 2416\n", | |
"Naruto SD Rock Lee no Seishun Full Power Ninden 2457\n", | |
"Naruto Shippuuden Sunny Side Battle 2458\n", | |
"Naruto Movie 1 Dai Katsugeki Yuki Hime Shinobu Houjou Dattebayo 2756\n", | |
"Naruto Soyokazeden Movie Naruto to Mashin to Mitsu no Onegai Dattebayo 2997\n", | |
"Naruto Movie 2 Dai Gekitotsu Maboroshi no Chiteiiseki Dattebayo 3449\n", | |
"Naruto Dai Katsugeki Yuki Hime Shinobu Houjou Dattebayo Special Konoha Annual Sports Festival 3529\n", | |
"Naruto Movie 3 Dai Koufun Mikazuki Jima no Animaru Panikku Dattebayo 3560\n", | |
"Naruto The Cross Roads 3561\n", | |
"Naruto Narutimate Hero 3 Tsuini Gekitotsu Jounin vs Genin Musabetsu Dairansen taikai Kaisai 3838\n", | |
"Naruto Takigakure no Shitou Ore ga Eiyuu Dattebayo 3984\n", | |
"Naruto Akaki Yotsuba no Clover wo Sagase 5111\n" | |
] | |
} | |
], | |
"source": [ | |
"get_id_from_partial_name(\"Naruto\")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Naruto Shippuuden Movie 6 Road to Ninja\n", | |
"Boruto Naruto the Movie\n", | |
"Naruto Shippuuden Movie 4 The Lost Tower\n", | |
"Naruto Shippuuden Movie 3 Hi no Ishi wo Tsugu Mono\n", | |
"Naruto Honoo no Chuunin Shiken Naruto vs Konohamaru \n" | |
] | |
} | |
], | |
"source": [ | |
"print_similar_animes(id=719)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Kokoro ga Sakebitagatterunda \n", | |
"Harmonie\n", | |
"Air Movie\n", | |
"Hotarubi no Mori e\n", | |
"Momo e no Tegami\n" | |
] | |
} | |
], | |
"source": [ | |
"print_similar_animes(\"Kimi no Na wa \")\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment