Created
July 7, 2019 07:48
-
-
Save pb111/65dab4818f16ddb58bb6a18a3ba1785b to your computer and use it in GitHub Desktop.
K-Means Clustering with Python and Scikit-Learn
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# K-Means Clustering with Python and Scikit-Learn\n", | |
"\n", | |
"\n", | |
"K-Means clustering is the most popular unsupervised machine learning algorithm. K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. I have used `Facebook Live Sellers in Thailand` dataset for this project. I implement K-Means clustering to find intrinsic groups within this dataset that display the same `status_type` behaviour. The `status_type` behaviour variable consists of posts of a different nature (video, photos, statuses and links)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Table of Contents\n", | |
"\n", | |
"\n", | |
"1.\tIntroduction to K-Means Clustering\n", | |
"2.\tK-Means Clustering intuition\n", | |
"3.\tChoosing the value of K\n", | |
"4.\tThe elbow method\n", | |
"5.\tThe problem statement\n", | |
"6.\tDataset description\n", | |
"7.\tImport libraries\n", | |
"8.\tImport dataset\n", | |
"9.\tExploratory data analysis\n", | |
"10.\tDeclare feature vector and target variable\n", | |
"11.\tConvert categorical variable into integers\n", | |
"12.\tFeature scaling\n", | |
"13.\tK-Means model with two clusters\n", | |
"14.\tK-Means model parameters study\n", | |
"15.\tCheck quality of weak classification by the model\n", | |
"16.\tUse elbow method to find optimal number of clusters\n", | |
"17.\tK-Means model with different clusters\n", | |
"18.\tResults and conclusion\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 1. Introduction to K-Means Clustering\n", | |
"\n", | |
"\n", | |
"Machine learning algorithms can be broadly classified into two categories - supervised and unsupervised learning. There are other categories also like semi-supervised learning and reinforcement learning. But, most of the algorithms are classified as supervised or unsupervised learning. The difference between them happens because of presence of target variable. In unsupervised learning, there is no target variable. The dataset only has input variables which describe the data. This is called unsupervised learning.\n", | |
"\n", | |
"**K-Means clustering** is the most popular unsupervised learning algorithm. It is used when we have unlabelled data which is data without defined categories or groups. The algorithm follows an easy or simple way to classify a given data set through a certain number of clusters, fixed apriori. K-Means algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 2. K-Means Clustering intuition\n", | |
"\n", | |
"\n", | |
"K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. It is based on centroid-based clustering.\n", | |
"\n", | |
"\n", | |
"**Centroid** - A centroid is a data point at the centre of a cluster. In centroid-based clustering, clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster.\n", | |
"K-Means clustering works as follows:-\n", | |
"The K-Means clustering algorithm uses an iterative procedure to deliver a final result. The algorithm requires number of clusters K and the data set as input. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the K centroids. The algorithm then iterates between two steps:-\n", | |
"\n", | |
"\n", | |
"**1. Data assignment step**\n", | |
"\n", | |
"\n", | |
"Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids in set C, then each data point is assigned to a cluster based on minimum Euclidean distance. \n", | |
"\n", | |
"\n", | |
"\n", | |
"**2. Centroid update step**\n", | |
"\n", | |
"\n", | |
"In this step, the centroids are recomputed and updated. This is done by taking the mean of all data points assigned to that centroid’s cluster. \n", | |
"\n", | |
"\n", | |
"The algorithm then iterates between step 1 and step 2 until a stopping criteria is met. Stopping criteria means no data points change the clusters, the sum of the distances is minimized or some maximum number of iterations is reached.\n", | |
"This algorithm is guaranteed to converge to a result. The result may be a local optimum meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 3. Choosing the value of K\n", | |
"\n", | |
"\n", | |
"The K-Means algorithm depends upon finding the number of clusters and data labels for a pre-defined value of K. To find the number of clusters in the data, we need to run the K-Means clustering algorithm for different values of K and compare the results. So, the performance of K-Means algorithm depends upon the value of K. We should choose the optimal value of K that gives us best performance. There are different techniques available to find the optimal value of K. The most common technique is the **elbow method** which is described below.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 4. The elbow method\n", | |
"\n", | |
"\n", | |
"The elbow method is used to determine the optimal number of clusters in K-means clustering. The elbow method plots the value of the cost function produced by different values of K. \n", | |
"\n", | |
"If K increases, average distortion will decrease. Then each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements in average distortion will decline as K increases. The value of K at which improvement in distortion declines the most is called the elbow, at which we should stop dividing the data into further clusters.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 5. The problem statement\n", | |
"\n", | |
"\n", | |
"In this project, I implement K-Means clustering with Python and Scikit-Learn. As mentioned earlier, K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. I have used `Facebook Live Sellers in Thailand Dataset` for this project. I implement K-Means clustering to find intrinsic groups within this dataset that display the same `status_type` behaviour. The `status_type` behaviour variable consists of posts of a different nature (video, photos, statuses and links). " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 6. Dataset description\n", | |
"\n", | |
"\n", | |
"In this project, I have used `Facebook Live Sellers in Thailand` Dataset, downloaded from the UCI Machine Learning repository. The dataset can be found at the following url-\n", | |
"\n", | |
"\n", | |
"https://archive.ics.uci.edu/ml/datasets/Facebook+Live+Sellers+in+Thailand\n", | |
"\n", | |
"\n", | |
"The dataset consists of Facebook pages of 10 Thai fashion and cosmetics retail sellers. The `status_type` behaviour variable consists of posts of a different nature (video, photos, statuses and links). It also contains engagement metrics of comments, shares and reactions.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 7. Import libraries" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import matplotlib.pyplot as plt\n", | |
"import seaborn as sns\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Ignore warnings\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import warnings\n", | |
"\n", | |
"warnings.filterwarnings('ignore')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 8. Import dataset\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"data = 'C:/datasets/Live.csv'\n", | |
"\n", | |
"df = pd.read_csv(data)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 9. Exploratory data analysis" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Check shape of the dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(7050, 16)" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that there are 7050 instances and 16 attributes in the dataset. In the dataset description, it is given that there are 7051 instances and 12 attributes in the dataset.\n", | |
"\n", | |
"So, we can infer that the first instance is the row header and there are 4 extra attributes in the dataset. Next, we should take a look at the dataset to gain more insight about it." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Preview the dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>status_id</th>\n", | |
" <th>status_type</th>\n", | |
" <th>status_published</th>\n", | |
" <th>num_reactions</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>num_shares</th>\n", | |
" <th>num_likes</th>\n", | |
" <th>num_loves</th>\n", | |
" <th>num_wows</th>\n", | |
" <th>num_hahas</th>\n", | |
" <th>num_sads</th>\n", | |
" <th>num_angrys</th>\n", | |
" <th>Column1</th>\n", | |
" <th>Column2</th>\n", | |
" <th>Column3</th>\n", | |
" <th>Column4</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>246675545449582_1649696485147474</td>\n", | |
" <td>video</td>\n", | |
" <td>4/22/2018 6:00</td>\n", | |
" <td>529</td>\n", | |
" <td>512</td>\n", | |
" <td>262</td>\n", | |
" <td>432</td>\n", | |
" <td>92</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>246675545449582_1649426988507757</td>\n", | |
" <td>photo</td>\n", | |
" <td>4/21/2018 22:45</td>\n", | |
" <td>150</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>150</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>246675545449582_1648730588577397</td>\n", | |
" <td>video</td>\n", | |
" <td>4/21/2018 6:17</td>\n", | |
" <td>227</td>\n", | |
" <td>236</td>\n", | |
" <td>57</td>\n", | |
" <td>204</td>\n", | |
" <td>21</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>246675545449582_1648576705259452</td>\n", | |
" <td>photo</td>\n", | |
" <td>4/21/2018 2:29</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>246675545449582_1645700502213739</td>\n", | |
" <td>photo</td>\n", | |
" <td>4/18/2018 3:22</td>\n", | |
" <td>213</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>204</td>\n", | |
" <td>9</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" <td>NaN</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" status_id status_type status_published \\\n", | |
"0 246675545449582_1649696485147474 video 4/22/2018 6:00 \n", | |
"1 246675545449582_1649426988507757 photo 4/21/2018 22:45 \n", | |
"2 246675545449582_1648730588577397 video 4/21/2018 6:17 \n", | |
"3 246675545449582_1648576705259452 photo 4/21/2018 2:29 \n", | |
"4 246675545449582_1645700502213739 photo 4/18/2018 3:22 \n", | |
"\n", | |
" num_reactions num_comments num_shares num_likes num_loves num_wows \\\n", | |
"0 529 512 262 432 92 3 \n", | |
"1 150 0 0 150 0 0 \n", | |
"2 227 236 57 204 21 1 \n", | |
"3 111 0 0 111 0 0 \n", | |
"4 213 0 0 204 9 0 \n", | |
"\n", | |
" num_hahas num_sads num_angrys Column1 Column2 Column3 Column4 \n", | |
"0 1 1 0 NaN NaN NaN NaN \n", | |
"1 0 0 0 NaN NaN NaN NaN \n", | |
"2 1 0 0 NaN NaN NaN NaN \n", | |
"3 0 0 0 NaN NaN NaN NaN \n", | |
"4 0 0 0 NaN NaN NaN NaN " | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### View summary of dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 7050 entries, 0 to 7049\n", | |
"Data columns (total 16 columns):\n", | |
"status_id 7050 non-null object\n", | |
"status_type 7050 non-null object\n", | |
"status_published 7050 non-null object\n", | |
"num_reactions 7050 non-null int64\n", | |
"num_comments 7050 non-null int64\n", | |
"num_shares 7050 non-null int64\n", | |
"num_likes 7050 non-null int64\n", | |
"num_loves 7050 non-null int64\n", | |
"num_wows 7050 non-null int64\n", | |
"num_hahas 7050 non-null int64\n", | |
"num_sads 7050 non-null int64\n", | |
"num_angrys 7050 non-null int64\n", | |
"Column1 0 non-null float64\n", | |
"Column2 0 non-null float64\n", | |
"Column3 0 non-null float64\n", | |
"Column4 0 non-null float64\n", | |
"dtypes: float64(4), int64(9), object(3)\n", | |
"memory usage: 881.3+ KB\n" | |
] | |
} | |
], | |
"source": [ | |
"df.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Check for missing values in dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"status_id 0\n", | |
"status_type 0\n", | |
"status_published 0\n", | |
"num_reactions 0\n", | |
"num_comments 0\n", | |
"num_shares 0\n", | |
"num_likes 0\n", | |
"num_loves 0\n", | |
"num_wows 0\n", | |
"num_hahas 0\n", | |
"num_sads 0\n", | |
"num_angrys 0\n", | |
"Column1 7050\n", | |
"Column2 7050\n", | |
"Column3 7050\n", | |
"Column4 7050\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.isnull().sum()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that there are 4 redundant columns in the dataset. We should drop them before proceeding further." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Drop redundant columns" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.drop(['Column1', 'Column2', 'Column3', 'Column4'], axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Again view summary of dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 7050 entries, 0 to 7049\n", | |
"Data columns (total 12 columns):\n", | |
"status_id 7050 non-null object\n", | |
"status_type 7050 non-null object\n", | |
"status_published 7050 non-null object\n", | |
"num_reactions 7050 non-null int64\n", | |
"num_comments 7050 non-null int64\n", | |
"num_shares 7050 non-null int64\n", | |
"num_likes 7050 non-null int64\n", | |
"num_loves 7050 non-null int64\n", | |
"num_wows 7050 non-null int64\n", | |
"num_hahas 7050 non-null int64\n", | |
"num_sads 7050 non-null int64\n", | |
"num_angrys 7050 non-null int64\n", | |
"dtypes: int64(9), object(3)\n", | |
"memory usage: 661.0+ KB\n" | |
] | |
} | |
], | |
"source": [ | |
"df.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, we can see that redundant columns have been removed from the dataset. \n", | |
"\n", | |
"We can see that, there are 3 character variables (data type = object) and remaining 9 numerical variables (data type = int64).\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### View the statistical summary of numerical variables" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>num_reactions</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>num_shares</th>\n", | |
" <th>num_likes</th>\n", | |
" <th>num_loves</th>\n", | |
" <th>num_wows</th>\n", | |
" <th>num_hahas</th>\n", | |
" <th>num_sads</th>\n", | |
" <th>num_angrys</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" <td>7050.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>230.117163</td>\n", | |
" <td>224.356028</td>\n", | |
" <td>40.022553</td>\n", | |
" <td>215.043121</td>\n", | |
" <td>12.728652</td>\n", | |
" <td>1.289362</td>\n", | |
" <td>0.696454</td>\n", | |
" <td>0.243688</td>\n", | |
" <td>0.113191</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>462.625309</td>\n", | |
" <td>889.636820</td>\n", | |
" <td>131.599965</td>\n", | |
" <td>449.472357</td>\n", | |
" <td>39.972930</td>\n", | |
" <td>8.719650</td>\n", | |
" <td>3.957183</td>\n", | |
" <td>1.597156</td>\n", | |
" <td>0.726812</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>17.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>17.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>59.500000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>58.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>219.000000</td>\n", | |
" <td>23.000000</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>184.750000</td>\n", | |
" <td>3.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>4710.000000</td>\n", | |
" <td>20990.000000</td>\n", | |
" <td>3424.000000</td>\n", | |
" <td>4710.000000</td>\n", | |
" <td>657.000000</td>\n", | |
" <td>278.000000</td>\n", | |
" <td>157.000000</td>\n", | |
" <td>51.000000</td>\n", | |
" <td>31.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" num_reactions num_comments num_shares num_likes num_loves \\\n", | |
"count 7050.000000 7050.000000 7050.000000 7050.000000 7050.000000 \n", | |
"mean 230.117163 224.356028 40.022553 215.043121 12.728652 \n", | |
"std 462.625309 889.636820 131.599965 449.472357 39.972930 \n", | |
"min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", | |
"25% 17.000000 0.000000 0.000000 17.000000 0.000000 \n", | |
"50% 59.500000 4.000000 0.000000 58.000000 0.000000 \n", | |
"75% 219.000000 23.000000 4.000000 184.750000 3.000000 \n", | |
"max 4710.000000 20990.000000 3424.000000 4710.000000 657.000000 \n", | |
"\n", | |
" num_wows num_hahas num_sads num_angrys \n", | |
"count 7050.000000 7050.000000 7050.000000 7050.000000 \n", | |
"mean 1.289362 0.696454 0.243688 0.113191 \n", | |
"std 8.719650 3.957183 1.597156 0.726812 \n", | |
"min 0.000000 0.000000 0.000000 0.000000 \n", | |
"25% 0.000000 0.000000 0.000000 0.000000 \n", | |
"50% 0.000000 0.000000 0.000000 0.000000 \n", | |
"75% 0.000000 0.000000 0.000000 0.000000 \n", | |
"max 278.000000 157.000000 51.000000 31.000000 " | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"There are 3 categorical variables in the dataset. I will explore them one by one." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Explore `status_id` variable" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array(['246675545449582_1649696485147474',\n", | |
" '246675545449582_1649426988507757',\n", | |
" '246675545449582_1648730588577397', ...,\n", | |
" '1050855161656896_1060126464063099',\n", | |
" '1050855161656896_1058663487542730',\n", | |
" '1050855161656896_1050858841656528'], dtype=object)" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view the labels in the variable\n", | |
"\n", | |
"df['status_id'].unique()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"6997" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view how many different types of variables are there\n", | |
"\n", | |
"len(df['status_id'].unique())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that there are 6997 unique labels in the `status_id` variable. The total number of instances in the dataset is 7050. So, it is approximately a unique identifier for each of the instances. Thus this is not a variable that we can use. Hence, I will drop it." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Explore `status_published` variable" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array(['4/22/2018 6:00', '4/21/2018 22:45', '4/21/2018 6:17', ...,\n", | |
" '9/21/2016 23:03', '9/20/2016 0:43', '9/10/2016 10:30'],\n", | |
" dtype=object)" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view the labels in the variable\n", | |
"\n", | |
"df['status_published'].unique()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"6913" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view how many different types of variables are there\n", | |
"\n", | |
"len(df['status_published'].unique())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Again, we can see that there are 6913 unique labels in the `status_published` variable. The total number of instances in the dataset is 7050. So, it is also a approximately a unique identifier for each of the instances. Thus this is not a variable that we can use. Hence, I will drop it also." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Explore `status_type` variable" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array(['video', 'photo', 'link', 'status'], dtype=object)" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view the labels in the variable\n", | |
"\n", | |
"df['status_type'].unique()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# view how many different types of variables are there\n", | |
"\n", | |
"len(df['status_type'].unique())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that there are 4 categories of labels in the `status_type` variable." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Drop `status_id` and `status_published` variable from the dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.drop(['status_id', 'status_published'], axis=1, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### View the summary of dataset again" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 7050 entries, 0 to 7049\n", | |
"Data columns (total 10 columns):\n", | |
"status_type 7050 non-null object\n", | |
"num_reactions 7050 non-null int64\n", | |
"num_comments 7050 non-null int64\n", | |
"num_shares 7050 non-null int64\n", | |
"num_likes 7050 non-null int64\n", | |
"num_loves 7050 non-null int64\n", | |
"num_wows 7050 non-null int64\n", | |
"num_hahas 7050 non-null int64\n", | |
"num_sads 7050 non-null int64\n", | |
"num_angrys 7050 non-null int64\n", | |
"dtypes: int64(9), object(1)\n", | |
"memory usage: 550.9+ KB\n" | |
] | |
} | |
], | |
"source": [ | |
"df.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Preview the dataset again" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>status_type</th>\n", | |
" <th>num_reactions</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>num_shares</th>\n", | |
" <th>num_likes</th>\n", | |
" <th>num_loves</th>\n", | |
" <th>num_wows</th>\n", | |
" <th>num_hahas</th>\n", | |
" <th>num_sads</th>\n", | |
" <th>num_angrys</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>video</td>\n", | |
" <td>529</td>\n", | |
" <td>512</td>\n", | |
" <td>262</td>\n", | |
" <td>432</td>\n", | |
" <td>92</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>photo</td>\n", | |
" <td>150</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>150</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>video</td>\n", | |
" <td>227</td>\n", | |
" <td>236</td>\n", | |
" <td>57</td>\n", | |
" <td>204</td>\n", | |
" <td>21</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>photo</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>photo</td>\n", | |
" <td>213</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>204</td>\n", | |
" <td>9</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" status_type num_reactions num_comments num_shares num_likes num_loves \\\n", | |
"0 video 529 512 262 432 92 \n", | |
"1 photo 150 0 0 150 0 \n", | |
"2 video 227 236 57 204 21 \n", | |
"3 photo 111 0 0 111 0 \n", | |
"4 photo 213 0 0 204 9 \n", | |
"\n", | |
" num_wows num_hahas num_sads num_angrys \n", | |
"0 3 1 1 0 \n", | |
"1 0 0 0 0 \n", | |
"2 1 1 0 0 \n", | |
"3 0 0 0 0 \n", | |
"4 0 0 0 0 " | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that there is 1 non-numeric column `status_type` in the dataset. I will convert it into integer equivalents." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 10. Declare feature vector and target variable" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"X = df\n", | |
"\n", | |
"y = df['status_type']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 11. Convert categorical variable into integers" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.preprocessing import LabelEncoder\n", | |
"\n", | |
"le = LabelEncoder()\n", | |
"\n", | |
"X['status_type'] = le.fit_transform(X['status_type'])\n", | |
"\n", | |
"y = le.transform(y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### View the summary of X" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 7050 entries, 0 to 7049\n", | |
"Data columns (total 10 columns):\n", | |
"status_type 7050 non-null int32\n", | |
"num_reactions 7050 non-null int64\n", | |
"num_comments 7050 non-null int64\n", | |
"num_shares 7050 non-null int64\n", | |
"num_likes 7050 non-null int64\n", | |
"num_loves 7050 non-null int64\n", | |
"num_wows 7050 non-null int64\n", | |
"num_hahas 7050 non-null int64\n", | |
"num_sads 7050 non-null int64\n", | |
"num_angrys 7050 non-null int64\n", | |
"dtypes: int32(1), int64(9)\n", | |
"memory usage: 523.3 KB\n" | |
] | |
} | |
], | |
"source": [ | |
"X.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Preview the dataset X" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>status_type</th>\n", | |
" <th>num_reactions</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>num_shares</th>\n", | |
" <th>num_likes</th>\n", | |
" <th>num_loves</th>\n", | |
" <th>num_wows</th>\n", | |
" <th>num_hahas</th>\n", | |
" <th>num_sads</th>\n", | |
" <th>num_angrys</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>3</td>\n", | |
" <td>529</td>\n", | |
" <td>512</td>\n", | |
" <td>262</td>\n", | |
" <td>432</td>\n", | |
" <td>92</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>150</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>150</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>3</td>\n", | |
" <td>227</td>\n", | |
" <td>236</td>\n", | |
" <td>57</td>\n", | |
" <td>204</td>\n", | |
" <td>21</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>111</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1</td>\n", | |
" <td>213</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>204</td>\n", | |
" <td>9</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" status_type num_reactions num_comments num_shares num_likes num_loves \\\n", | |
"0 3 529 512 262 432 92 \n", | |
"1 1 150 0 0 150 0 \n", | |
"2 3 227 236 57 204 21 \n", | |
"3 1 111 0 0 111 0 \n", | |
"4 1 213 0 0 204 9 \n", | |
"\n", | |
" num_wows num_hahas num_sads num_angrys \n", | |
"0 3 1 1 0 \n", | |
"1 0 0 0 0 \n", | |
"2 1 1 0 0 \n", | |
"3 0 0 0 0 \n", | |
"4 0 0 0 0 " | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 12. Feature Scaling" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"cols = X.columns" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.preprocessing import MinMaxScaler\n", | |
"\n", | |
"ms = MinMaxScaler()\n", | |
"\n", | |
"X = ms.fit_transform(X)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"X = pd.DataFrame(X, columns=[cols])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead tr th {\n", | |
" text-align: left;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr>\n", | |
" <th></th>\n", | |
" <th>status_type</th>\n", | |
" <th>num_reactions</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>num_shares</th>\n", | |
" <th>num_likes</th>\n", | |
" <th>num_loves</th>\n", | |
" <th>num_wows</th>\n", | |
" <th>num_hahas</th>\n", | |
" <th>num_sads</th>\n", | |
" <th>num_angrys</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.112314</td>\n", | |
" <td>0.024393</td>\n", | |
" <td>0.076519</td>\n", | |
" <td>0.091720</td>\n", | |
" <td>0.140030</td>\n", | |
" <td>0.010791</td>\n", | |
" <td>0.006369</td>\n", | |
" <td>0.019608</td>\n", | |
" <td>0.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0.333333</td>\n", | |
" <td>0.031847</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.031847</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.048195</td>\n", | |
" <td>0.011243</td>\n", | |
" <td>0.016647</td>\n", | |
" <td>0.043312</td>\n", | |
" <td>0.031963</td>\n", | |
" <td>0.003597</td>\n", | |
" <td>0.006369</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0.333333</td>\n", | |
" <td>0.023567</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.023567</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0.333333</td>\n", | |
" <td>0.045223</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.043312</td>\n", | |
" <td>0.013699</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" status_type num_reactions num_comments num_shares num_likes num_loves \\\n", | |
"0 1.000000 0.112314 0.024393 0.076519 0.091720 0.140030 \n", | |
"1 0.333333 0.031847 0.000000 0.000000 0.031847 0.000000 \n", | |
"2 1.000000 0.048195 0.011243 0.016647 0.043312 0.031963 \n", | |
"3 0.333333 0.023567 0.000000 0.000000 0.023567 0.000000 \n", | |
"4 0.333333 0.045223 0.000000 0.000000 0.043312 0.013699 \n", | |
"\n", | |
" num_wows num_hahas num_sads num_angrys \n", | |
"0 0.010791 0.006369 0.019608 0.0 \n", | |
"1 0.000000 0.000000 0.000000 0.0 \n", | |
"2 0.003597 0.006369 0.000000 0.0 \n", | |
"3 0.000000 0.000000 0.000000 0.0 \n", | |
"4 0.000000 0.000000 0.000000 0.0 " | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 13. K-Means model with two clusters" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n", | |
" n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',\n", | |
" random_state=0, tol=0.0001, verbose=0)" | |
] | |
}, | |
"execution_count": 28, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from sklearn.cluster import KMeans\n", | |
"\n", | |
"kmeans = KMeans(n_clusters=2, random_state=0) \n", | |
"\n", | |
"kmeans.fit(X)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 14. K-Means model parameters study" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[3.28506857e-01, 3.90710874e-02, 7.54854864e-04, 7.53667113e-04,\n", | |
" 3.85438884e-02, 2.17448568e-03, 2.43721364e-03, 1.20039760e-03,\n", | |
" 2.75348016e-03, 1.45313276e-03],\n", | |
" [9.54921576e-01, 6.46330441e-02, 2.67028654e-02, 2.93171709e-02,\n", | |
" 5.71231462e-02, 4.71007076e-02, 8.18581889e-03, 9.65207685e-03,\n", | |
" 8.04219428e-03, 7.19501847e-03]])" | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"kmeans.cluster_centers_" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- The KMeans algorithm clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as **inertia**, or within-cluster sum-of-squares Inertia, or the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent clusters are.\n", | |
"\n", | |
"\n", | |
"- The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean j of the samples in the cluster. The means are commonly called the cluster **centroids**.\n", | |
"\n", | |
"\n", | |
"- The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum of squared criterion." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Inertia\n", | |
"\n", | |
"\n", | |
"- **Inertia** is not a normalized metric. \n", | |
"\n", | |
"- The lower values of inertia are better and zero is optimal. \n", | |
"\n", | |
"- But in very high-dimensional spaces, euclidean distances tend to become inflated (this is an instance of `curse of dimensionality`). \n", | |
"\n", | |
"- Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.\n", | |
"\n", | |
"- We can calculate model inertia as follows:-" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"237.75726404419564" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"kmeans.inertia_" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- The lesser the model inertia, the better the model fit.\n", | |
"\n", | |
"- We can see that the model has very high inertia. So, this is not a good model fit to the data." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" ## 15. Check quality of weak classification by the model" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Result: 63 out of 7050 samples were correctly labeled.\n" | |
] | |
} | |
], | |
"source": [ | |
"labels = kmeans.labels_\n", | |
"\n", | |
"# check how many of the samples were correctly labeled\n", | |
"correct_labels = sum(y == labels)\n", | |
"\n", | |
"print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Accuracy score: 0.01\n" | |
] | |
} | |
], | |
"source": [ | |
"print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We have achieved a weak classification accuracy of 1% by our unsupervised model." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 16. Use elbow method to find optimal number of clusters" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"from sklearn.cluster import KMeans\n", | |
"cs = []\n", | |
"for i in range(1, 11):\n", | |
" kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)\n", | |
" kmeans.fit(X)\n", | |
" cs.append(kmeans.inertia_)\n", | |
"plt.plot(range(1, 11), cs)\n", | |
"plt.title('The Elbow Method')\n", | |
"plt.xlabel('Number of clusters')\n", | |
"plt.ylabel('CS')\n", | |
"plt.show()\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- By the above plot, we can see that there is a kink at k=2. \n", | |
"\n", | |
"- Hence k=2 can be considered a good number of the cluster to cluster this data.\n", | |
"\n", | |
"- But, we have seen that I have achieved a weak classification accuracy of 1% with k=2.\n", | |
"\n", | |
"- I will write the required code with k=2 again for convinience." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Result: 63 out of 7050 samples were correctly labeled.\n", | |
"Accuracy score: 0.01\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.cluster import KMeans\n", | |
"\n", | |
"kmeans = KMeans(n_clusters=2,random_state=0)\n", | |
"\n", | |
"kmeans.fit(X)\n", | |
"\n", | |
"labels = kmeans.labels_\n", | |
"\n", | |
"# check how many of the samples were correctly labeled\n", | |
"\n", | |
"correct_labels = sum(y == labels)\n", | |
"\n", | |
"print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n", | |
"\n", | |
"print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So, our weak unsupervised classification model achieved a very weak classification accuracy of 1%." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I will check the model accuracy with different number of clusters." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 17. K-Means model with different clusters" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### K-Means model with 3 clusters" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Result: 138 out of 7050 samples were correctly labeled.\n", | |
"Accuracy score: 0.02\n" | |
] | |
} | |
], | |
"source": [ | |
"kmeans = KMeans(n_clusters=3, random_state=0)\n", | |
"\n", | |
"kmeans.fit(X)\n", | |
"\n", | |
"# check how many of the samples were correctly labeled\n", | |
"labels = kmeans.labels_\n", | |
"\n", | |
"correct_labels = sum(y == labels)\n", | |
"print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n", | |
"print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### K-Means model with 4 clusters" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Result: 4340 out of 7050 samples were correctly labeled.\n", | |
"Accuracy score: 0.62\n" | |
] | |
} | |
], | |
"source": [ | |
"kmeans = KMeans(n_clusters=4, random_state=0)\n", | |
"\n", | |
"kmeans.fit(X)\n", | |
"\n", | |
"# check how many of the samples were correctly labeled\n", | |
"labels = kmeans.labels_\n", | |
"\n", | |
"correct_labels = sum(y == labels)\n", | |
"print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n", | |
"print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We have achieved a relatively high accuracy of 62% with k=4." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## 18. Results and conclusion\n", | |
"\n", | |
"\n", | |
"1.\tIn this project, I have implemented the most popular unsupervised clustering technique called **K-Means Clustering**.\n", | |
"\n", | |
"2.\tI have applied the elbow method and find that k=2 (k is number of clusters) can be considered a good number of cluster to cluster this data.\n", | |
"\n", | |
"3.\tI have find that the model has very high inertia of 237.7572. So, this is not a good model fit to the data.\n", | |
"\n", | |
"4.\tI have achieved a weak classification accuracy of 1% with k=2 by our unsupervised model.\n", | |
"\n", | |
"5.\tSo, I have changed the value of k and find relatively higher classification accuracy of 62% with k=4.\n", | |
"\n", | |
"6.\tHence, we can conclude that k=4 being the optimal number of clusters.\n" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.0" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Great work. There is any other way(technique, algorithm,...) to initiate the K-Means?
Thanks,