Created
November 6, 2016 01:15
-
-
Save phdkiran/69b52ad7e5171af5657233b1ac7eb1df to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Homework with Hacker News data\n", | |
"\n", | |
"## About Hacker News\n", | |
"\n", | |
"[Hacker News](https://news.ycombinator.com/) is a popular \"social news\" website run by the startup incubator Y Combinator. It primarily includes news about technology, but also includes job postings and community-generated questions.\n", | |
"\n", | |
"Any user can [submit](https://news.ycombinator.com/submit) a post to Hacker News. There are two types of posts: articles and discussions. To submit an **article**, the user includes a title and a URL. To submit a **discussion**, the user includes a title and additional text.\n", | |
"\n", | |
"Users can upvote posts that they find interesting. Every post starts at 1 point, and each upvote adds an additional point. The most popular recent posts appear on the front page of Hacker News.\n", | |
"\n", | |
"## Description of the data\n", | |
"\n", | |
"A [dataset of Hacker News posts](https://www.kaggle.com/hacker-news/hacker-news-posts) is hosted on Kaggle Datasets. It includes about one year of data, ending in September 2016. The following fields are included in the dataset:\n", | |
"\n", | |
"- **title:** title of the post\n", | |
"- **url:** URL of the post (if any)\n", | |
"- **num_points:** number of points that the post received\n", | |
"- **num_comments:** number of user comments on the post\n", | |
"- **author:** name of the user that submitted the post\n", | |
"- **created_at:** date and time the post was submitted\n", | |
"\n", | |
"## Problem statement\n", | |
"\n", | |
"Your goal is to predict the likelihood that a post will be \"popular\", based on the data that is available at the time the post is submitted." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 1: Get the data\n", | |
"\n", | |
"1. Go to the [Kaggle Datasets](https://www.kaggle.com/hacker-news/hacker-news-posts) page, and click the download button.\n", | |
"2. Unzip **`hacker-news-posts.zip`**, and then move **`HN_posts_year_to_Sep_26_2016.csv`** to a directory where you can easily access it.\n", | |
"3. Read the file into a pandas DataFrame called **\"hn\"**.\n", | |
"4. Either during or after the file reading process, convert the **created_at** column to datetime format.\n", | |
"\n", | |
" - **Hint:** [How do I work with dates and times in pandas?](https://www.youtube.com/watch?v=yCgJGsg0Xa4&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=25) explains how to do this." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"%matplotlib inline\n", | |
"from sklearn import cross_validation, naive_bayes, feature_extraction, metrics" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12579008</td>\n", | |
" <td>You have two days to comment if you want stem ...</td>\n", | |
" <td>http://www.regulations.gov/document?D=FDA-2015...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>altstar</td>\n", | |
" <td>2016-09-26 03:26:00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12579005</td>\n", | |
" <td>SQLAR the SQLite Archiver</td>\n", | |
" <td>https://www.sqlite.org/sqlar/doc/trunk/README.md</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>blacksqr</td>\n", | |
" <td>2016-09-26 03:24:00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>12578997</td>\n", | |
" <td>What if we just printed a flatscreen televisio...</td>\n", | |
" <td>https://medium.com/vanmoof/our-secrets-out-f21...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>pavel_lishin</td>\n", | |
" <td>2016-09-26 03:19:00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>12578989</td>\n", | |
" <td>algorithmic music</td>\n", | |
" <td>http://cacm.acm.org/magazines/2011/7/109891-al...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>poindontcare</td>\n", | |
" <td>2016-09-26 03:16:00</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>12578979</td>\n", | |
" <td>How the Data Vault Enables the Next-Gen Data W...</td>\n", | |
" <td>https://www.talend.com/blog/2016/05/12/talend-...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>markgainor1</td>\n", | |
" <td>2016-09-26 03:14:00</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"0 12579008 You have two days to comment if you want stem ... \n", | |
"1 12579005 SQLAR the SQLite Archiver \n", | |
"2 12578997 What if we just printed a flatscreen televisio... \n", | |
"3 12578989 algorithmic music \n", | |
"4 12578979 How the Data Vault Enables the Next-Gen Data W... \n", | |
"\n", | |
" url num_points \\\n", | |
"0 http://www.regulations.gov/document?D=FDA-2015... 1 \n", | |
"1 https://www.sqlite.org/sqlar/doc/trunk/README.md 1 \n", | |
"2 https://medium.com/vanmoof/our-secrets-out-f21... 1 \n", | |
"3 http://cacm.acm.org/magazines/2011/7/109891-al... 1 \n", | |
"4 https://www.talend.com/blog/2016/05/12/talend-... 1 \n", | |
"\n", | |
" num_comments author created_at \n", | |
"0 0 altstar 2016-09-26 03:26:00 \n", | |
"1 0 blacksqr 2016-09-26 03:24:00 \n", | |
"2 0 pavel_lishin 2016-09-26 03:19:00 \n", | |
"3 0 poindontcare 2016-09-26 03:16:00 \n", | |
"4 0 markgainor1 2016-09-26 03:14:00 " | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"col_names = ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n", | |
"# hn = pd.read_csv('../data/hacker-news-posts/HN_posts_year_to_Sep_26_2016.csv', nrows=1000, skiprows=200000, names=col_names)\n", | |
"hn = pd.read_csv('../data/HN_posts_year_to_Sep_26_2016.csv', parse_dates=['created_at'])\n", | |
"hn.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"269452" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# hn.groupby('title').count()\n", | |
"hn.title.nunique()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"links = (hn.url.str.startswith('http'))\n", | |
"# hn[::-1]\n", | |
"# hn.shape\n", | |
"hn[(links == False)].head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"id int64\n", | |
"title object\n", | |
"url object\n", | |
"num_points int64\n", | |
"num_comments int64\n", | |
"author object\n", | |
"created_at datetime64[ns]\n", | |
"dtype: object" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from datetime import datetime\n", | |
"#convert data type for column created_at\n", | |
"hn['created_at'] = pd.to_datetime(hn['created_at'])\n", | |
"# print('memory usage: ', hn.memory_usage(deep=True))\n", | |
"hn.dtypes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 2: Prepare the data\n", | |
"\n", | |
"1. Create a new column called **\"popular\"** that is **1** if the post received greater than 5 points, and **0** otherwise. This will be the response variable that you are trying to predict.\n", | |
"2. Split the **hn** DataFrame into two separate DataFrames. The first DataFrame should be called **\"train\"**, and should contain all posts before July 1, 2016. The second DataFrame should be called **\"new\"**, and should contain the remaining posts.\n", | |
"\n", | |
" - **Hint:** [How do I work with dates and times in pandas?](https://www.youtube.com/watch?v=yCgJGsg0Xa4&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=25) explains how to do this. Make sure that all rows from **hn** are in either **train** or **new**, but not both.\n", | |
" - **Hint:** When you are creating **train** and **new**, you should use the [`DataFrame.copy()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html) method to make sure that you are creating separate objects (rather than references to the **hn** DataFrame).\n", | |
" - **Note:** You will be building a model using the **train** DataFrame, and making predictions for posts in the **new** DataFrame, which is our simulated future data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12579008</td>\n", | |
" <td>You have two days to comment if you want stem ...</td>\n", | |
" <td>http://www.regulations.gov/document?D=FDA-2015...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>altstar</td>\n", | |
" <td>2016-09-26 03:26:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12579005</td>\n", | |
" <td>SQLAR the SQLite Archiver</td>\n", | |
" <td>https://www.sqlite.org/sqlar/doc/trunk/README.md</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>blacksqr</td>\n", | |
" <td>2016-09-26 03:24:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>12578997</td>\n", | |
" <td>What if we just printed a flatscreen televisio...</td>\n", | |
" <td>https://medium.com/vanmoof/our-secrets-out-f21...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>pavel_lishin</td>\n", | |
" <td>2016-09-26 03:19:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"0 12579008 You have two days to comment if you want stem ... \n", | |
"1 12579005 SQLAR the SQLite Archiver \n", | |
"2 12578997 What if we just printed a flatscreen televisio... \n", | |
"\n", | |
" url num_points \\\n", | |
"0 http://www.regulations.gov/document?D=FDA-2015... 1 \n", | |
"1 https://www.sqlite.org/sqlar/doc/trunk/README.md 1 \n", | |
"2 https://medium.com/vanmoof/our-secrets-out-f21... 1 \n", | |
"\n", | |
" num_comments author created_at popular \n", | |
"0 0 altstar 2016-09-26 03:26:00 0 \n", | |
"1 0 blacksqr 2016-09-26 03:24:00 0 \n", | |
"2 0 pavel_lishin 2016-09-26 03:19:00 0 " | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"#create new column popular\n", | |
"hn['popular'] = (hn.num_points > 5).astype(int)\n", | |
"(hn.head(3))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>293114</th>\n", | |
" <td>10176919</td>\n", | |
" <td>Ask HN: What is/are your favorite quote(s)?</td>\n", | |
" <td>NaN</td>\n", | |
" <td>15</td>\n", | |
" <td>20</td>\n", | |
" <td>kumarski</td>\n", | |
" <td>2015-09-06 06:02:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>293115</th>\n", | |
" <td>10176917</td>\n", | |
" <td>Attention and awareness in stage magic: turnin...</td>\n", | |
" <td>http://people.cs.uchicago.edu/~luitien/nrn2473...</td>\n", | |
" <td>14</td>\n", | |
" <td>0</td>\n", | |
" <td>stakent</td>\n", | |
" <td>2015-09-06 06:01:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>293116</th>\n", | |
" <td>10176908</td>\n", | |
" <td>Dying vets fuck you letter (2013)</td>\n", | |
" <td>http://dangerousminds.net/comments/dying_vets_...</td>\n", | |
" <td>10</td>\n", | |
" <td>2</td>\n", | |
" <td>mycodebreaks</td>\n", | |
" <td>2015-09-06 05:56:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>293117</th>\n", | |
" <td>10176907</td>\n", | |
" <td>PHP 7 Coolest Features: Space Ships, Type Hint...</td>\n", | |
" <td>https://www.zend.com/en/resources/php-7</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>Garbage</td>\n", | |
" <td>2015-09-06 05:55:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>293118</th>\n", | |
" <td>10176903</td>\n", | |
" <td>Toyota Establishes Research Centers with MIT a...</td>\n", | |
" <td>http://newsroom.toyota.co.jp/en/detail/9233109/</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>tim_sw</td>\n", | |
" <td>2015-09-06 05:50:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"293114 10176919 Ask HN: What is/are your favorite quote(s)? \n", | |
"293115 10176917 Attention and awareness in stage magic: turnin... \n", | |
"293116 10176908 Dying vets fuck you letter (2013) \n", | |
"293117 10176907 PHP 7 Coolest Features: Space Ships, Type Hint... \n", | |
"293118 10176903 Toyota Establishes Research Centers with MIT a... \n", | |
"\n", | |
" url num_points \\\n", | |
"293114 NaN 15 \n", | |
"293115 http://people.cs.uchicago.edu/~luitien/nrn2473... 14 \n", | |
"293116 http://dangerousminds.net/comments/dying_vets_... 10 \n", | |
"293117 https://www.zend.com/en/resources/php-7 2 \n", | |
"293118 http://newsroom.toyota.co.jp/en/detail/9233109/ 4 \n", | |
"\n", | |
" num_comments author created_at popular \n", | |
"293114 20 kumarski 2015-09-06 06:02:00 1 \n", | |
"293115 0 stakent 2015-09-06 06:01:00 1 \n", | |
"293116 2 mycodebreaks 2015-09-06 05:56:00 1 \n", | |
"293117 0 Garbage 2015-09-06 05:55:00 0 \n", | |
"293118 0 tim_sw 2015-09-06 05:50:00 0 " | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"hn.tail()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(293119, 8)" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"hn.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(64119, 8)" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"#use dates post \n", | |
"cutoff_date = datetime(2016, 7, 1)\n", | |
"date_filter = hn.created_at > cutoff_date\n", | |
"date_filter.sum()\n", | |
"\n", | |
"new = hn.loc[date_filter, :].copy()\n", | |
"new.shape\n", | |
"# type(hn.created_at[0])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(229000, 8)" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"#create train dataset\n", | |
"(date_filter == False).sum()\n", | |
"train = hn.loc[~date_filter, :].copy()\n", | |
"train.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"assert hn.shape[0] == train.shape[0] + new.shape[0], 'sizes should match'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12579008</td>\n", | |
" <td>You have two days to comment if you want stem ...</td>\n", | |
" <td>http://www.regulations.gov/document?D=FDA-2015...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>altstar</td>\n", | |
" <td>2016-09-26 03:26:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12579005</td>\n", | |
" <td>SQLAR the SQLite Archiver</td>\n", | |
" <td>https://www.sqlite.org/sqlar/doc/trunk/README.md</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>blacksqr</td>\n", | |
" <td>2016-09-26 03:24:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>12578997</td>\n", | |
" <td>What if we just printed a flatscreen televisio...</td>\n", | |
" <td>https://medium.com/vanmoof/our-secrets-out-f21...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>pavel_lishin</td>\n", | |
" <td>2016-09-26 03:19:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>12578989</td>\n", | |
" <td>algorithmic music</td>\n", | |
" <td>http://cacm.acm.org/magazines/2011/7/109891-al...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>poindontcare</td>\n", | |
" <td>2016-09-26 03:16:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>12578979</td>\n", | |
" <td>How the Data Vault Enables the Next-Gen Data W...</td>\n", | |
" <td>https://www.talend.com/blog/2016/05/12/talend-...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>markgainor1</td>\n", | |
" <td>2016-09-26 03:14:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>12578975</td>\n", | |
" <td>Saving the Hassle of Shopping</td>\n", | |
" <td>https://blog.menswr.com/2016/09/07/whats-new-w...</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>bdoux</td>\n", | |
" <td>2016-09-26 03:13:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>12578954</td>\n", | |
" <td>Macalifa A new open-source music app for UWP ...</td>\n", | |
" <td>http://forums.windowscentral.com/windows-phone...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>thecodrr</td>\n", | |
" <td>2016-09-26 03:06:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>12578942</td>\n", | |
" <td>GitHub theweavrs/Macalifa: A music player wri...</td>\n", | |
" <td>https://github.com/theweavrs/Macalifa</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>thecodrr</td>\n", | |
" <td>2016-09-26 03:04:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>12578919</td>\n", | |
" <td>Google Allo first Impression</td>\n", | |
" <td>http://prodissues.com/2016/09/google-allo-firs...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>jandll</td>\n", | |
" <td>2016-09-26 02:57:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>12578918</td>\n", | |
" <td>Advanced Multimedia on the Linux Command Line</td>\n", | |
" <td>https://avi.alkalay.net/2016/09/multimedia-lin...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>mynameislegion</td>\n", | |
" <td>2016-09-26 02:56:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>12578908</td>\n", | |
" <td>Ask HN: What TLD do you use for local developm...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>4</td>\n", | |
" <td>7</td>\n", | |
" <td>Sevrene</td>\n", | |
" <td>2016-09-26 02:53:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11</th>\n", | |
" <td>12578893</td>\n", | |
" <td>Muroc Maru</td>\n", | |
" <td>http://www.weirdca.com/location.php?location=511</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>x43b</td>\n", | |
" <td>2016-09-26 02:46:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12</th>\n", | |
" <td>12578879</td>\n", | |
" <td>Why companies make their products worse</td>\n", | |
" <td>https://www.1843magazine.com/ideas/the-daily/w...</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>RachelF</td>\n", | |
" <td>2016-09-26 02:40:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>13</th>\n", | |
" <td>12578866</td>\n", | |
" <td>Tuning AWS SQS Queues</td>\n", | |
" <td>http://blog.simontaranto.com/post/2016-09-25-t...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>srt32</td>\n", | |
" <td>2016-09-26 02:37:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>14</th>\n", | |
" <td>12578857</td>\n", | |
" <td>The Promise of GitHub</td>\n", | |
" <td>http://constantbetasoftware.com/2016/09/26/git...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>ttam</td>\n", | |
" <td>2016-09-26 02:34:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>15</th>\n", | |
" <td>12578834</td>\n", | |
" <td>Joint R&D Has Its Ups and Downs</td>\n", | |
" <td>http://semiengineering.com/joint-rd-has-its-up...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Lind5</td>\n", | |
" <td>2016-09-26 02:28:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>16</th>\n", | |
" <td>12578831</td>\n", | |
" <td>IBM announces next implementation of Apples Sw...</td>\n", | |
" <td>https://9to5mac.com/2016/09/25/ibm-announces-n...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>phodo</td>\n", | |
" <td>2016-09-26 02:28:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>17</th>\n", | |
" <td>12578822</td>\n", | |
" <td>Amazons Algorithms Dont Find You the Best Deals</td>\n", | |
" <td>https://www.technologyreview.com/s/602442/amaz...</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>yarapavan</td>\n", | |
" <td>2016-09-26 02:26:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>18</th>\n", | |
" <td>12578816</td>\n", | |
" <td>Ruffled Feathers</td>\n", | |
" <td>http://www.texasmonthly.com/articles/whooping-...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Thevet</td>\n", | |
" <td>2016-09-26 02:23:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>19</th>\n", | |
" <td>12578806</td>\n", | |
" <td>The Veil of Ignorance Design and Accessbility</td>\n", | |
" <td>https://blog.marvelapp.com/the-veil-of-ignorance/</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>muratmutlu</td>\n", | |
" <td>2016-09-26 02:21:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>20</th>\n", | |
" <td>12578796</td>\n", | |
" <td>OMeta#: Who? What? When? Where? Why? (2008)</td>\n", | |
" <td>http://www.moserware.com/2008/06/ometa-who-wha...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>adamnemecek</td>\n", | |
" <td>2016-09-26 02:18:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>21</th>\n", | |
" <td>12578791</td>\n", | |
" <td>Burning Ship fractal</td>\n", | |
" <td>https://en.wikipedia.org/wiki/Burning_Ship_fra...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>colinprince</td>\n", | |
" <td>2016-09-26 02:17:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>22</th>\n", | |
" <td>12578786</td>\n", | |
" <td>From Hiroko to Susie: The untold stories of Ja...</td>\n", | |
" <td>http://www.washingtonpost.com/sf/national/2016...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>kawera</td>\n", | |
" <td>2016-09-26 02:16:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>23</th>\n", | |
" <td>12578753</td>\n", | |
" <td>ROBOLUTION:Robocalyptic Themed Machine Learnin...</td>\n", | |
" <td>http://robolution.co/</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>v3ss0n</td>\n", | |
" <td>2016-09-26 02:07:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>24</th>\n", | |
" <td>12578738</td>\n", | |
" <td>Segas Plans for World Domination (1993)</td>\n", | |
" <td>https://www.wired.com/1993/06/sega/</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>luu</td>\n", | |
" <td>2016-09-26 02:04:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25</th>\n", | |
" <td>12578725</td>\n", | |
" <td>Google Car: Sense and Money Impasse</td>\n", | |
" <td>https://mondaynote.com/google-car-sense-and-mo...</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>kawera</td>\n", | |
" <td>2016-09-26 02:01:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>26</th>\n", | |
" <td>12578705</td>\n", | |
" <td>Why an open Web is important when sea levels a...</td>\n", | |
" <td>https://changelog.com/221/</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>mynameislegion</td>\n", | |
" <td>2016-09-26 01:58:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>27</th>\n", | |
" <td>12578700</td>\n", | |
" <td>Forever 23: The Rapid Rise and Sudden Disappea...</td>\n", | |
" <td>http://pictorial.jezebel.com/forever-23-the-ra...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>samclemens</td>\n", | |
" <td>2016-09-26 01:57:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>28</th>\n", | |
" <td>12578694</td>\n", | |
" <td>Emergency dose of epinephrine that does not co...</td>\n", | |
" <td>http://m.imgur.com/gallery/th6Ua</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>dredmorbius</td>\n", | |
" <td>2016-09-26 01:54:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>29</th>\n", | |
" <td>12578681</td>\n", | |
" <td>Abu Ashraf Masnun: Introduction to Django Chan...</td>\n", | |
" <td>http://masnun.rocks/2016/09/25/introduction-to...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>samber</td>\n", | |
" <td>2016-09-26 01:49:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64089</th>\n", | |
" <td>12013115</td>\n", | |
" <td>Tours tech company Zerve to shut down</td>\n", | |
" <td>https://www.tnooz.com/article/tours-zerve-shut...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>danso</td>\n", | |
" <td>2016-07-01 00:59:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64090</th>\n", | |
" <td>12013098</td>\n", | |
" <td>I worked in the CIA under Bush. Obama is right...</td>\n", | |
" <td>http://www.vox.com/2016/6/28/12046626/phrase-i...</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>jseliger</td>\n", | |
" <td>2016-07-01 00:54:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64091</th>\n", | |
" <td>12013095</td>\n", | |
" <td>What is Storj?</td>\n", | |
" <td>http://blog.storj.io/post/146711695563/what-is...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>gk1</td>\n", | |
" <td>2016-07-01 00:53:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64092</th>\n", | |
" <td>12013087</td>\n", | |
" <td>Data Scien" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"0 12579008 You have two days to comment if you want stem ... \n", | |
"1 12579005 SQLAR the SQLite Archiver \n", | |
"2 12578997 What if we just printed a flatscreen televisio... \n", | |
"3 12578989 algorithmic music \n", | |
"4 12578979 How the Data Vault Enables the Next-Gen Data W... \n", | |
"5 12578975 Saving the Hassle of Shopping \n", | |
"6 12578954 Macalifa A new open-source music app for UWP ... \n", | |
"7 12578942 GitHub theweavrs/Macalifa: A music player wri... \n", | |
"8 12578919 Google Allo first Impression \n", | |
"9 12578918 Advanced Multimedia on the Linux Command Line \n", | |
"10 12578908 Ask HN: What TLD do you use for local developm... \n", | |
"11 12578893 Muroc Maru \n", | |
"12 12578879 Why companies make their products worse \n", | |
"13 12578866 Tuning AWS SQS Queues \n", | |
"14 12578857 The Promise of GitHub \n", | |
"15 12578834 Joint R&D Has Its Ups and Downs \n", | |
"16 12578831 IBM announces next implementation of Apples Sw... \n", | |
"17 12578822 Amazons Algorithms Dont Find You the Best Deals \n", | |
"18 12578816 Ruffled Feathers \n", | |
"19 12578806 The Veil of Ignorance Design and Accessbility \n", | |
"20 12578796 OMeta#: Who? What? When? Where? Why? (2008) \n", | |
"21 12578791 Burning Ship fractal \n", | |
"22 12578786 From Hiroko to Susie: The untold stories of Ja... \n", | |
"23 12578753 ROBOLUTION:Robocalyptic Themed Machine Learnin... \n", | |
"24 12578738 Segas Plans for World Domination (1993) \n", | |
"25 12578725 Google Car: Sense and Money Impasse \n", | |
"26 12578705 Why an open Web is important when sea levels a... \n", | |
"27 12578700 Forever 23: The Rapid Rise and Sudden Disappea... \n", | |
"28 12578694 Emergency dose of epinephrine that does not co... \n", | |
"29 12578681 Abu Ashraf Masnun: Introduction to Django Chan... \n", | |
"... ... ... \n", | |
"64089 12013115 Tours tech company Zerve to shut down \n", | |
"64090 12013098 I worked in the CIA under Bush. Obama is right... \n", | |
"64091 12013095 What is Storj? \n", | |
"64092 12013087 Data Science Weekly Newsletter Edition #136 \n", | |
"64093 12013085 Workday acquires Zaption \n", | |
"64094 12013084 Cato Institute admits We are warming our plane... \n", | |
"64095 12013074 Tekserve, Precursor to the Apple Store, to Clo... \n", | |
"64096 12013072 Man killed in Tesla crash had previously recor... \n", | |
"64097 12013054 A Women's History of Silicon Valley \n", | |
"64098 12013026 Cloud Application Security Tips What Changes ... \n", | |
"64099 12013025 From not working to neural networking \n", | |
"64100 12013022 Unlimited shelljs commands with ES6 proxies \n", | |
"64101 12013020 Mozilla involves the community in its open-sou... \n", | |
"64102 12013018 Sam Brownback's funding plan for Kansas is bad... \n", | |
"64103 12013005 Show HN: Classic Board Games with a Modern UI ... \n", | |
"64104 12012989 Has Anyone Gotten a Job After Self-Teaching Pr... \n", | |
"64105 12012987 Investing in Analytics The Paradox of Choice \n", | |
"64106 12012977 Devastating Amazon hardware review of a wirele... \n", | |
"64107 12012972 Elon Musk Is Wrong. We Aren't Living in a Simu... \n", | |
"64108 12012968 Oracle Ordered to Pay HP $3B by Jury Over Itan... \n", | |
"64109 12012960 Spotify: Apple is holding up app approval to s... \n", | |
"64110 12012944 Car drives sideways \n", | |
"64111 12012942 Certificate Transparency and the Certificate A... \n", | |
"64112 12012931 Brain: An esoteric modern computer language ba... \n", | |
"64113 12012927 Ask HN: Designers and Developers of HN, how ca... \n", | |
"64114 12012924 Zenefits Loses Over Half of Its Value \n", | |
"64115 12012918 The Blackbird: First fully adjustable car rig ... \n", | |
"64116 12012913 Ask HN: As a Python developer, what am I missi... \n", | |
"64117 12012912 Ask HN: What do you build, what tools and edit... \n", | |
"64118 12012897 Mattermark Daily Thursday, June 30th, 2016 \n", | |
"\n", | |
" url num_points \\\n", | |
"0 http://www.regulations.gov/document?D=FDA-2015... 1 \n", | |
"1 https://www.sqlite.org/sqlar/doc/trunk/README.md 1 \n", | |
"2 https://medium.com/vanmoof/our-secrets-out-f21... 1 \n", | |
"3 http://cacm.acm.org/magazines/2011/7/109891-al... 1 \n", | |
"4 https://www.talend.com/blog/2016/05/12/talend-... 1 \n", | |
"5 https://blog.menswr.com/2016/09/07/whats-new-w... 1 \n", | |
"6 http://forums.windowscentral.com/windows-phone... 1 \n", | |
"7 https://github.com/theweavrs/Macalifa 1 \n", | |
"8 http://prodissues.com/2016/09/google-allo-firs... 3 \n", | |
"9 https://avi.alkalay.net/2016/09/multimedia-lin... 1 \n", | |
"10 NaN 4 \n", | |
"11 http://www.weirdca.com/location.php?location=511 1 \n", | |
"12 https://www.1843magazine.com/ideas/the-daily/w... 4 \n", | |
"13 http://blog.simontaranto.com/post/2016-09-25-t... 3 \n", | |
"14 http://constantbetasoftware.com/2016/09/26/git... 2 \n", | |
"15 http://semiengineering.com/joint-rd-has-its-up... 1 \n", | |
"16 https://9to5mac.com/2016/09/25/ibm-announces-n... 2 \n", | |
"17 https://www.technologyreview.com/s/602442/amaz... 1 \n", | |
"18 http://www.texasmonthly.com/articles/whooping-... 1 \n", | |
"19 https://blog.marvelapp.com/the-veil-of-ignorance/ 3 \n", | |
"20 http://www.moserware.com/2008/06/ometa-who-wha... 1 \n", | |
"21 https://en.wikipedia.org/wiki/Burning_Ship_fra... 2 \n", | |
"22 http://www.washingtonpost.com/sf/national/2016... 2 \n", | |
"23 http://robolution.co/ 2 \n", | |
"24 https://www.wired.com/1993/06/sega/ 2 \n", | |
"25 https://mondaynote.com/google-car-sense-and-mo... 4 \n", | |
"26 https://changelog.com/221/ 1 \n", | |
"27 http://pictorial.jezebel.com/forever-23-the-ra... 1 \n", | |
"28 http://m.imgur.com/gallery/th6Ua 2 \n", | |
"29 http://masnun.rocks/2016/09/25/introduction-to... 1 \n", | |
"... ... ... \n", | |
"64089 https://www.tnooz.com/article/tours-zerve-shut... 2 \n", | |
"64090 http://www.vox.com/2016/6/28/12046626/phrase-i... 4 \n", | |
"64091 http://blog.storj.io/post/146711695563/what-is... 2 \n", | |
"64092 http://www.datascienceweekly.org/newsletters/d... 2 \n", | |
"64093 http://blog.zaption.com/post/146724427719/zapt... 3 \n", | |
"64094 http://www.freetochoose.tv/program.php?id=dead... 2 \n", | |
"64095 http://www.nytimes.com/2016/06/30/nyregion/tek... 1 \n", | |
"64096 http://jalopnik.com/man-killed-in-self-driving... 5 \n", | |
"64097 https://backchannel.com/a-womens-history-of-si... 1 \n", | |
"64098 http://www.happyapps.io/blog/2016-06-28-cloud-... 1 \n", | |
"64099 http://www.economist.com/news/special-report/2... 1 \n", | |
"64100 https://github.com/nfischer/shelljs-exec-proxy 5 \n", | |
"64101 https://www.designweek.co.uk/issues/27-june-4-... 2 \n", | |
"64102 http://www.slate.com/blogs/moneybox/2016/06/30... 1 \n", | |
"64103 https://boardom.io 2 \n", | |
"64104 https://www.reddit.com/r/learnprogramming/comm... 2 \n", | |
"64105 https://vijaybhat.com/2016/06/28/investing-in-... 1 \n", | |
"64106 https://www.amazon.com/gp/review/R2JVRCO8T1ON0R 118 \n", | |
"64107 http://motherboard.vice.com/read/we-dont-live-... 3 \n", | |
"64108 http://www.bloomberg.com/news/articles/2016-06... 2 \n", | |
"64109 https://www.engadget.com/2016/06/30/spotify-cl... 2 \n", | |
"64110 http://www.theverge.com/2016/6/30/12064724/omn... 4 \n", | |
"64111 https://www.mjt.me.uk/posts/certificate-transp... 2 \n", | |
"64112 https://github.com/luizperes/brain/issues 1 \n", | |
"64113 NaN 1 \n", | |
"64114 http://fortune.com/2016/06/30/zenefits-loses-o... 191 \n", | |
"64115 http://www.themill.com/portfolio/3002/the-blac... 2 \n", | |
"64116 NaN 33 \n", | |
"64117 NaN 3 \n", | |
"64118 https://mattermark.com/mattermark-daily-thursd... 1 \n", | |
"\n", | |
" num_comments author created_at popular \n", | |
"0 0 altstar 2016-09-26 03:26:00 0 \n", | |
"1 0 blacksqr 2016-09-26 03:24:00 0 \n", | |
"2 0 pavel_lishin 2016-09-26 03:19:00 0 \n", | |
"3 0 poindontcare 2016-09-26 03:16:00 0 \n", | |
"4 0 markgainor1 2016-09-26 03:14:00 0 \n", | |
"5 1 bdoux 2016-09-26 03:13:00 0 \n", | |
"6 0 thecodrr 2016-09-26 03:06:00 0 \n", | |
"7 0 thecodrr 2016-09-26 03:04:00 0 \n", | |
"8 0 jandll 2016-09-26 02:57:00 0 \n", | |
"9 0 mynameislegion 2016-09-26 02:56:00 0 \n", | |
"10 7 Sevrene 2016-09-26 02:53:00 0 \n", | |
"11 0 x43b 2016-09-26 02:46:00 0 \n", | |
"12 0 RachelF 2016-09-26 02:40:00 0 \n", | |
"13 0 srt32 2016-09-26 02:37:00 0 \n", | |
"14 0 ttam 2016-09-26 02:34:00 0 \n", | |
"15 0 Lind5 2016-09-26 02:28:00 0 \n", | |
"16 0 phodo 2" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"hn[date_filter]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 3: Explore the training data\n", | |
"\n", | |
"Explore the **train** DataFrame to gain an understanding of the dataset.\n", | |
"\n", | |
"**Note:** At no time should you explore the **new** DataFrame, since this is our simulated future data which you would not have access to in the \"real world\"." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"(229000, 8)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(train.shape)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>64119</th>\n", | |
" <td>12012874</td>\n", | |
" <td>The Master JavaScript Course Has Been Released...</td>\n", | |
" <td>http://www.masterjavascript.io/lp/master-javas...</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:58:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64120</th>\n", | |
" <td>12012865</td>\n", | |
" <td>Deeply Learn the JavaScript Scope Chain</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/05/22...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:57:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64121</th>\n", | |
" <td>12012863</td>\n", | |
" <td>Deeply Learning JavaScript Closures</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/04/24...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:57:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"64119 12012874 The Master JavaScript Course Has Been Released... \n", | |
"64120 12012865 Deeply Learn the JavaScript Scope Chain \n", | |
"64121 12012863 Deeply Learning JavaScript Closures \n", | |
"\n", | |
" url num_points \\\n", | |
"64119 http://www.masterjavascript.io/lp/master-javas... 1 \n", | |
"64120 http://www.masterjavascript.io/blog/2016/05/22... 2 \n", | |
"64121 http://www.masterjavascript.io/blog/2016/04/24... 2 \n", | |
"\n", | |
" num_comments author created_at popular \n", | |
"64119 1 erikgrueter 2016-06-30 23:58:00 0 \n", | |
"64120 0 erikgrueter 2016-06-30 23:57:00 0 \n", | |
"64121 0 erikgrueter 2016-06-30 23:57:00 0 " | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train.head(3)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"ingve 2119\n", | |
"jonbaer 2117\n", | |
"prostoalex 1224\n", | |
"dnetesn 1212\n", | |
"jseliger 935\n", | |
"bootload 822\n", | |
"DiabloD3 709\n", | |
"williswee 706\n", | |
"doener 686\n", | |
"walterbell 642\n", | |
"Name: author, dtype: int64" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# train['author'] = train['author'].astype('category')\n", | |
"train.author.value_counts().head(10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<matplotlib.axes._subplots.AxesSubplot at 0x7f7acc3ee470>" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAErCAYAAADQckjCAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcnFWd7/HPF6IiQjCjQM+wGGFYXcAgIILeRq8oowI6\nigsyOqDDXEAQlwGcq8TthagDgg6MA8iicBG9KiLIJjabLCEQgiRg7mCQRBNFGQFF2X73j3OKPOl0\np0Oq6jzF6e/79epX13NqOb/qrvrVqbM9igjMzKxea7QdgJmZ9ZcTvZlZ5Zzozcwq50RvZlY5J3oz\ns8o50ZuZVW7CRC9pY0lXSrpD0u2SPpjLj5G0SNIt+ecNjfscLWmBpPmS9miUz5A0V9LPJX25P0/J\nzMyaNNE8eklDwFBEzJG0DjAb2Bt4B/BgRBw/6vbbAOcCOwIbA1cAW0RESLoRODQiZkm6GDgxIi7t\n+bMyM7MnTdiij4glETEnX34ImA9slK/WGHfZGzgvIh6LiIXAAmCn/IGxbkTMyrc7G9iny/jNzGwC\nT6mPXtJ0YHvgxlx0qKQ5kk6TtF4u2wi4t3G3xblsI2BRo3wRyz4wzMysT1Y50edum+8Ah+eW/cnA\nZhGxPbAE+Lf+hGhmZt2Ysio3kjSFlOS/EREXAETEbxs3ORW4MF9eDGzSuG7jXDZe+Vj1eQMeM7PV\nEBErdKmvaov+68C8iDixU5D73DveCvwsX/4B8E5Jz5T0QuBvgZsiYgnwB0k7SRLwD8AFKwl2tX+O\nOeaYru7fq59BiGMQYhiUOAYhhkGJYxBiGJQ4BiGGXsUxnglb9JJ2BfYDbpd0KxDAx4F3S9oeeAJY\nCByUE/Q8SecD84BHgYNjWQSHAGcCawEXR8QlE9VvZmbdmTDRR8R1wJpjXDVuko6IY4FjxyifDbzk\nqQRoZmbdqXJl7PDwcNshAIMRxyDEAIMRxyDEAIMRxyDEAIMRxyDEAP2NY8IFU22QFIMYl5nZIJNE\ndDEYa2ZmT1NO9GZmlXOiNzOrnBO9mVnlnOjNzCrnRG9mVjknejOzyjnRm5lVzonezKxyTvRmZpVz\nojczq5wTvZlZ5Zzozcwq50RvZlY5J3ozs8o50ZuZVc6J3sysck70ZmaVc6I3M6ucE72ZWeWc6M3M\nKudEb2ZWOSd6M7PKOdGbmVXOid7MrHJO9GZmlXOiNzOrnBO9mVnlnpaJfmhoOpK6+hkamt56HL2I\nwcxsIoqItmNYgaRYWVySgG7jFt0+9+7j6D4GM7MOSUSERpc/LVv0Zma26pzozcwq50RvZlY5J3oz\ns8pNmOglbSzpSkl3SLpd0mG5fJqkyyTdJelSSes17nO0pAWS5kvao1E+Q9JcST+X9OX+PCUzM2ta\nlRb9Y8CHI+JFwC7AIZK2Bo4CroiIrYArgaMBJG0L7AtsA+wJnKw0PQXgFODAiNgS2FLS63v6bMzM\nbAUTJvqIWBIRc/Llh4D5wMbA3sBZ+WZnAfvky3sB50XEYxGxEFgA7CRpCFg3Imbl253duI+ZmfXJ\nU+qjlzQd2B64AdgwIpZC+jAANsg32wi4t3G3xblsI2BRo3xRLjMzsz5a5UQvaR3gO8DhuWU/eqWP\nV/6YmQ2gKatyI0lTSEn+GxFxQS5eKmnDiFiau2V+k8sXA5s07r5xLhuvfEwzZ8588vLw8DDDw8Or\nEqqZ2aQxMjLCyMjIhLdbpS0QJJ0N3BcRH26UHQf8PiKOk3QkMC0ijsqDsecAO5O6Zi4HtoiIkHQD\ncBgwC7gIOCkiLhmjPm+BYGb2FI23BcKEiV7SrsDVwO2krBbAx4GbgPNJrfR7gH0j4r/zfY4GDgQe\nJXX1XJbLdwDOBNYCLo6Iw8ep04nezOwpWu1E3wYnejOzp86bmpmZTVJO9GZmlXOiNzOrnBO9mVnl\nnOjNzCrnRG9mVjknejOzyjnRm5lVzonezKxyTvRmZpVzojczq5wTvZlZ5Zzozcwq50RvZlY5J3oz\ns8o50ZuZVc6J3sysck70ZmaVc6I3M6ucE72ZWeWc6M3MKudEb2ZWOSd6M7PKOdGbmVXOid7MrHJO\n9GZmlXOiNzOrnBO9mVnlnOjNzCrnRG9mVjknejOzyjnRm5lVzonezKxyTvRmZpVzojczq9yEiV7S\n6ZKWSprbKDtG0iJJt+SfNzSuO1rSAknzJe3RKJ8haa6kn0v6cu+fipmZjWVVWvRnAK8fo/z4iJiR\nfy4BkLQNsC+wDbAncLIk5dufAhwYEVsCW0oa6zHNzKzHJkz0EXEtcP8YV2mMsr2B8yLisYhYCCwA\ndpI0BKwbEbPy7c4G9lm9kM3M7Knopo/+UElzJJ0mab1cthFwb+M2i3PZRsCiRvmiXGZmZn22uon+\nZGCziNgeWAL8W+9CMjOzXpqyOneKiN82Dk8FLsyXFwObNK7bOJeNVz6umTNnPnl5eHiY4eHh1QnV\nzKxaIyMjjIyMTHg7RcTEN5KmAxdGxEvy8VBELMmXjwB2jIh3S9oWOAfYmdQ1czmwRUSEpBuAw4BZ\nwEXASZ1B3DHqi5XFlcZ3J457gmfFqjz3lT5C13F0H4OZWYckImKF8dMJW/SSzgWGgedJ+iVwDLC7\npO2BJ4CFwEEAETFP0vnAPOBR4OBGxj4EOBNYC7h4vCRvZma9tUot+tLcojcze+rGa9F7ZayZWeWc\n6M3MKudEb2ZWOSd6M7PKOdGbmVXOid7MrHJO9GZmlXOiNzOrnBO9mVnlnOjNzCrnRG9mVjknejOz\nyjnRm5lVzonezKxyTvRmZpVzojczq5wTvZlZ5Zzozcwq50RvZlY5J3ozs8o50ZuZVc6J/mluaGg6\nkrr6GRqa3vbTMLM+UkS0HcMKJMXK4pIEdBu36Pa5dx/HIMTQmziGhqazdOk9q33/DTd8AUuWLOwq\nBrPJThIRoRXKnei7eAQn+h7G0X0MZpPdeIneXTdmZpVzojczq5wTvZlZ5Zzozcwq50RvZlY5J3oz\ns8o50ZuZVc6J3sysck70ZmaVc6I3M6ucE72ZWeWc6M3MKjdhopd0uqSlkuY2yqZJukzSXZIulbRe\n47qjJS2QNF/SHo3yGZLmSvq5pC/3/qmYmdlYVqVFfwbw+lFlRwFXRMRWwJXA0QCStgX2BbYB9gRO\nVtrWEOAU4MCI2BLYUtLoxzQzsz6YMNFHxLXA/aOK9wbOypfPAvbJl/cCzouIxyJiIbAA2EnSELBu\nRMzKtzu7cR8zM+uj1e2j3yAilgJExBJgg1y+EXBv43aLc9lGwKJG+aJcZmZmfdarwVifMcLMbEBN\nWc37LZW0YUQszd0yv8nli4FNGrfbOJeNVz6umTNnPnl5eHiY4eHh1QzVzKxOIyMjjIyMTHi7VTqV\noKTpwIUR8ZJ8fBzw+4g4TtKRwLSIOCoPxp4D7Ezqmrkc2CIiQtINwGHALOAi4KSIuGSc+nwqwWIx\nDEocPpWgWbfGO5XghC16SecCw8DzJP0SOAb4PPBtSQcA95Bm2hAR8ySdD8wDHgUObmTsQ4AzgbWA\ni8dL8mZm1ls+OXg3jzAArVj/LcyswycHNzObpJzozcwq50Rv1Rgamo6krn6Ghqa3/TTMes599N08\nwgD0S/tv0csYehOHWVvcR29mNkk50ZuZVc6J3sysck70ZmaVc6I3M6ucE72ZWeWc6M3MKudEb2ZW\nOSd6sx7y6lwbRF4Z280jeDVoD+MYhBi6j2MQYrDJyytjzcwmKSd6M7PKOdGbmVXOid7MrHJO9GZm\nlXOiNzOrnBO9mVnlnOjNzCrnRG9WoUFYoTsIMVjilbHdPIJXg/YwjkGIofs4BiGGQYljEGKYbLwy\n1swmJX+zcIu+u0dwK7aHcQxCDN3HMQgxDEocgxDDIMVRglv0ZmaTlBO9mVnlnOjNzPqs7XEC99F3\n8wjul+5hHIMQQ/dxDEIMgxLHIMQwKHGUisF99GZmk5QTvZlZ5Zzozcwq50RvZlY5J3ozs8o50ZuZ\nVa6rRC9poaTbJN0q6aZcNk3SZZLuknSppPUatz9a0gJJ8yXt0W3wZmY2sW5b9E8AwxHxsojYKZcd\nBVwREVsBVwJHA0jaFtgX2AbYEzhZaXKpmZn1UbeJXmM8xt7AWfnyWcA++fJewHkR8VhELAQWADth\nZmZ91W2iD+BySbMkvT+XbRgRSwEiYgmwQS7fCLi3cd/FuczMzPpoSpf33zUifi1pfeAySXex4jrf\nwdtjwcxsEukq0UfEr/Pv30r6PqkrZqmkDSNiqaQh4Df55ouBTRp33ziXjWnmzJlPXh4eHmZ4eLib\nUM3MqjMyMsLIyMiEt1vtTc0krQ2sEREPSXoOcBnwKeC1wO8j4jhJRwLTIuKoPBh7DrAzqcvmcmCL\nsXYv86ZmJWMYlDgGIYbu4xiEGAYljkGIYVDiaHtTs25a9BsC35MU+XHOiYjLJN0MnC/pAOAe0kwb\nImKepPOBecCjwMErzeZmZtYT3qa4m0dwK7aHcQxCDN3HMQgxDEocgxDDoMTRdoveK2PNzCrnRG9m\nVjknejOzyjnRm5lVzonezKxyTvRmZpVzojczq5wTvZlZ5Zzozcwq50RvZlY5J3ozs8o50ZuZVc6J\n3sysck70ZmaVc6I3M6ucE72ZWeWc6M3MKudEb2ZWOSd6M7PKOdGbmVXOid7MrHJO9GZmlXOiNzOr\nnBO9mVnlnOjNzCrnRG9mVjknejOzyjnRm5lVzonezKxyTvRmZpVzojczq5wTvZlZ5Zzozcwq50Rv\nZlY5J3ozs8oVT/SS3iDpTkk/l3Rk6frNzCaboole0hrAV4HXAy8C3iVp697XNNL7h1wtI20HwGDE\nAIMRx0jbAWQjbQfAYMQAgxHHSNsBZCN9e+TSLfqdgAURcU9EPAqcB+zd+2pGev+Qq2Wk7QAYjBhg\nMOIYaTuAbKTtABiMGGAw4hhpO4BspG+PXDrRbwTc2zhelMvMzKxPPBhrZlY5RUS5yqRXADMj4g35\n+CggIuK4UbcrF5SZWUUiQqPLSif6NYG7gNcCvwZuAt4VEfOLBWFmNslMKVlZRDwu6VDgMlK30elO\n8mZm/VW0RW9mZuV5MNbMrHJO9D0iaU1JP2k7Dkvy/+NLbcdhNgiqS/SS1m6j3oh4HHhC0npt1D8R\nSc9sO4aS8v9jt7bjGFSS/qrtGNqiZJO24yip6GBsP0l6JXAasA6wqaTtgIMi4uCCYTwE3C7pcuCP\nncKIOKxgDEgaAd4XEQvz8U7AqcB2BWNYH/gAMJ3G6ywiDigVA3CrpB8A32b5/8d3SwUg6SvAuANh\nJV4bknYlvTeeAA4APgtslj/8942I6/sdQyOWtYGPAJtGxAckbQFsFRE/LBVDRISki4GXlKqzSdKH\nV3Z9RBzf6zqrSfTACaQ9dH4AEBG3SXp14Ri+m3/adixwiaSTSCuP9wT+sXAMFwDXAFcAjxeuu2Mt\n4HfAaxplQdn/0c35967AtsC38vHbgXmFYjgB2JfUCLoI2CcirpU0A/hKjq2UM4DZwC75eDHpg7hY\nos9ukbRjRMwqXC/AuqUrrGbWjaQbI2JnSbdGxMty2W0RUawVm+t8Nqm1clfJeseIYxi4HLgPeFlE\nLClc/5yI2L5knYNM0g3AbhHxWD5+BnBNRLyiQN3N98T8iNimcd0tETGj3zE06rs5Il4+AO/TO4G/\nBe4hfdsTqbH/0pJxlFJTi/7e3H0T+U10OFB0jr6kNwNfAp4JvFDS9sCnI2KvwnF8gtSCezXwUmBE\n0kci4qKCYfxQ0t9FxMUF61yOpC2BU4ANI+LFkl4K7BURn20hnGnAVOD3+XidXFZCcyzu6FHXlR67\neSQ3hgJA0ubAXwrHAOnbfyvyN+1x9aM7r6ZE/8/AiaSuisWkRVmHFI5hJmmHzhGAiJgjabPCMQA8\nD9gpIh4Grpd0CamPtmSiPxz4uKS/AI+yrMU0tWAMpwIfA75GqnyupHNJfdSlfZ40ZvAT0t/i1aTX\nSwmfkLR2RPwpIr7fKcxJ9uxCMXQcA1wCbCLpHFK30fsKx0BE3CNpN2CLiDgjjymtU6j62YXqeVJN\nXTfrR8RvW47hhoh4xaivpXPb+jrYeXO3UfcgkDQrInYc9f9orUtJ0hCwcz68sXR32qCQ9DzgFaQP\nvBsi4r4WYjgGeDlpIHhLSX8DfDsiSo5XdGLp+/u0pumV10m6TNKBkp7bUgx3SHo3sKakLfKMi5+W\nDkLSLpLmAXfm4+0knVw4hl0lPSdffo+k4yVtWjIG4L7cau10E7yNtMdSW/6S678f2LLkZAFJ75V0\ni6Q/5p+bJf1DqfobcYg0OWCHPNNm7TwrrLS3AHuRZ2NFxK8oPEha9H0aEdX8kLpNjgfuJo3iv6dw\n/WsDnwNmkWZbfA5Yq4W/w43AJsCtjbKfFY5hLqnFth1wK6kb7arCMWxGmvXzJ1J33rXA9NL/jxzL\n+4HbSUn+J8DDwJWF6n5v/h/sDqwHPJc0E2k2sH/hv8MpwL8D8/PxNGBWC/+Pm/LvW/Lv5wBzC8dQ\n7H1a9I9b8A/4fFLf4+Ntx9LS878x/26+gG4rHEPnDfRJ4MBmWQt/j+cA67b8P7mdNN1zTj7eGvhu\nobpvGOsDjrTG4YaWXhetvTZznR8ljd3cTVrvcT3wwcIxFHufVjMYK2kq6evYO4HNge+RWvglY1gf\n+BfS+XDX6pRHxGvGvVN/tD4DCXhQ0tHA/sCr8vmCn1EygNELU1KvAX8AZkfEnJKxAH+OiD9LQtKz\nIuJOSVsVqntq5MVzTRGxML9vSnpUabvyTnfa+qSFXEVFxJckvQ54ANgK+GREXF44jGLv02oSPXAb\n8H3SdMZiK/1GOYe0IOZNpFlA7wXaGCAehBlI7wDeDRwQEUty//wXC8fw8vxzYT5+E6lL6Z8lfTsi\nvlAwlkV57Oj7wOWS7ifN4S7h4dW8rh9OIjXCNpD0OeBtwP8uHAMAObGXTu5Nzffpr4BL6dP7tKZZ\nN4qWn4yk2RGxQ3OmTWfmR5txtUXSC0jT167IS9/XjIgHC9Z/NfB3EfFQPu6sDH0DqVW/balYRsX1\nP0h95ZdExCMF6vsT8P/GugrYLCKe0+8YRsWzNenkQwJ+HC2ck0LSW4HjgA1yHG1M/y2mphb9Bfmr\nedMfSIOiX4uIPxeI4dH8+9eS3kj6lC62eZQGYF+VRiwfAP6J9Pw3J7Va/oP0Bi9lA5ZfjPMoafHU\nw3l+f1FK+y+9Kh9eUyLJZ9tMfJMyJH0GuBo4MyL+ONHt++gLwJvb+JDpyGtsTiRNNQ3SOMEREXF3\nr+uqKdH/Algf+D/5+B3Ag8CWpIUz+xeI4bNKu1d+hLSHyFTgiAL1dtw88U2KOYQ0RnIjQEQskLRB\n4RjOAW6UdEE+fjNwbp72WWqfGQAkHU4a9Ovss/NNSf8ZEV/pd90R8WQXkaQNgc43zJsi4jf9rn+U\nu4F3ASdJepC0H9LVEXHByu/Wc0vbTPLZuaQZSG/Jx+8k5a+dx73Haqqp62aFLpLGgpk7IuJFbcU2\nGWnU3kOSppBmXBRdPCbp5SzbtOu6iGjlw1DSXGCXTis2f9hcX/LvIWlf0jjJCKmr4lXAxyLiO6Vi\naMQyRNqm46PAtIgoPYf9RGCINGby5Le7KLuz6QqLKdWnfX9qatGvI2nTiPglQB786yxpLvIVufFV\nbBfSTIK+fRWbII71gSNJuyW2NfvnKkkfB56dZzcczLJB0b6SNDUiHlDac/3u/NO57q8i4vfj37t/\nYbH8Lp6P57KS/hXYsdOKz6+TK4BiiV7SaaTX5VJSa/5twC2l6m+YSlpfsUejrMjOplp2LoAfSToK\nOC/X/Q6gL3tD1ZToPwJcK+m/SG+gFwIH55bTWYViKPZVbAKd2T9vpL3ZP0cBB5Lmjx8EXBwRpxaq\n+1zSDJvZLD9moXzcxv5DZ5C6kb6Xj/cBTi8cwxqjump+R/nV8c8D1gT+m7TB232Rd/QsKSJKb9vd\n1Hlddj7oD2pcF6y48VzXqum6AZD0LNJCFIC7Cg3ANusv9lVsgjhan/0j6fCIOHGisslEaf/3zlmv\nromIWwvX/0XSbqbNcay5EXFkyThyLNuQdpA8gjQba+PC9be6s2leV7JLRFxXpL7KEv0rWfGMRn3f\nna/xVexI0hL35lexaRHR80/oCeLpbK52KWne8q+A70TE5gVjWGGfczU2F+tz3SvdXz0iinUVaIJT\n9pXuRpL09ywbs7gmIr63stv3of43kcYGXk3aiuGGHMfXC8dxFXln01i24d3PIuLFBWMo8n6AihK9\npG+QpvHNYVlfaJSYUijpFyz/VawpIqJoV0F+M11D2kejM/vnUxHxgwJ1v4u0UGq3HEPHusATEdH3\n6ZVa+Unao+RYxRivjc4brjNvu41upNZI+irpdXFNpI3E2oqj9Z1NlU5efz1pK4y+JuKaEv18YNu2\nF01NdnmR1AtJpzM8qnHVg6RuguL9sZNdnsY41vui+CKhPGb254h4PHefbA38KCIeneCuvY7jR8Ch\npK2JZyjtbHpgROxZMIYHSfswPU5aody3/0dNif7bwGER0eY2tK11H42KYWNSS3430hv8GuDwiFhU\nOI7mythnA1MKr4xdG/gw6dSO/6QWTkTdiOXHo7/NjFVWO0mzSV0304DrSDu9PhIR+xWOYzPgP4FX\nkrpbfwHs11xzUJOaZt08H5gn6SaWnxdb7DR+43UfUf4sPmeQZp68PR+/J5e9rlQAY6yM3ZjyK2M7\nJ6J+ZT4ufiJqSWuRWm3PlzSNZV04U0mrhYtqDAgHcG3pAWFS4/JPkg4ETo6IL0gqvcEcecrz/8zf\nMNYo2QDpUFrKvx/wwoj4jKRNgL+OiJt6XVdNiX5m2wGQNtAahO6j9SPijMbxmZI+VDiGQVgZu3lE\nvCOPG5ATTOm56wcBHwL+huXniz8AfLVkIJI+Sfrw78wVP1Npc7eSp1aUpF1ICe7AXLZmwfo7QaxH\nOq3hq/PxVaQNEf9QMIyTSettXgN8BniIND2757Pjqkn0EXFV2zEAPyOttmu1+wj4naT3sGwa3btI\nc6ZL+ktEPNLJq3llbOkPwNZPRJ2nk54o6YMltjuYwH7Adp1px5I+T/r2WTLRf4g0T/x7EXFH7kJZ\n2eB5v3yd9H7dNx/vT/oG+NaCMeycxwduBYiI+yX15WTtT/tEL+naiNhtjAGnNnaja737KDuA1Ed/\nAulv8lPKn4C5tZWxDQNxIursa5IOI7cgSdsQfK3wIOSvSCulO+tLnkXqziomN8iuahzfDRTbbK9h\n84j4+8bxp1roQiq2N381g7GDQGn72RWU/rYhadfRCzHGKutzDGuQvprvQfrQvRQ4rXS3lgbgRNQ5\njtNIJ17prNLen3QGtPcXqLuzq+mmpG6By/Px60gbm/W9FSvpyxHxIUkXMsY3u9KNIUnXk/b5uTYf\n7wp8KSJ2KRjDfqS1NjNIr4u3AZ+IiPN7XpcTfW+p/d0Bx1ustEJZ7fKbd05E/DF3Zc0ATmxjZsVY\nK6RLrZqW9N6VXR8Rfd8iRNIOETF7gBpD25EmSayXi+4H3hsRcwvHUWRvfif6HlLLuwPmQa5XkvpB\nT2hcNRV4S4mk0ohlV9IA+QtIXYTFFwgp7Ri5HWnZ/xmkvWX2jYgxk02fY7kFeHtE/Fc+3oy0WnlS\nffi2TcufXlKkGVEAfyS9Po8vGMs3ImL/icp64WnfRz9g2t4d8JmkHTunkFaidjxA+lpY0umkfUxm\ns/yujSU9FhEhaW/g3yPi9Dytrw0fA34i6W5SgnkBUHRjrbyO4FhW3NW07x++km5n5SfFKbVdc+d9\nsRXpm/cFpP/He4CeT2ucwHJbp+f++h36UZETfW+1ujtgZ6BL0pmd7oncV75ORDxQKo7sDxHxo8J1\njtb6Cco7IuLHnQVbueiuiCh9lqszSAPUJwC7kz5oSr0+31SonpWKiE8BKJ1mckZn/rykmaTTTPZd\nfk12Jio8wLK1FY+QFnH1vk533fSOVtwd8J2kZf//UjiOc0nbEz9OWnk4ldQ3Xezk3Hnq3pqkOdvN\nGUglNxQbIk0tnRUR1+YFKcMR8Y1SMTRieQbwv2hx1o2W7Wp6e0S8pFlWKoZBIeku4KWdD1ulnW/n\nRsRWK79nT2M4NgpteOhE32NKJx1u7g74/RZimBMR2+dR/RmkPWdmF/x63NxYbPQmXn3fUGzUlNuO\nzsZiQfqm9cWIOLnfsTRiam3WTSOGn5JWxX4HuJI0tfLzJZLbGNOfn7yKFk7KLelfSXPom+cH+FZE\nHFug7uK7qzrR98AYc/mbqy+fIJ1goVhikXQHsD1pG4SvRsRVBWd4dAa7Okm1+bcoOtg1njzl8qeF\nW2+tzbpp1LcjMJ+0PfBnSDNOvhARN5SKYZDkhNs5WfvVpbaDUAu7q7qPvgciYrf8e8zzXnYSC2nJ\ncwlfAxYCtwFXK20uVqqPfrzBrjdTfrBrTBHxO0nDhat9XNLmo2bdFB2kjohZ+eJDlB8Ibp7ecQXR\nwukdc8u5+GkMI2L30nW6RV+IpL+OFnfWlDQlCm4RnAe73tgY7FoXuCgiXr3ye9ZJ0mtJg6Gd89dO\nB/4xIvq+/H8QFitJ+mFEvEljn7uh6LTbQSLpxaw4C6rnmyC6RV9IySQ/esMm0pLzTwMlN2zakOVP\nyv5ILpusriN903ot6Xypl5JOOlFCZ/D5S4XqW0FEdGbdXEd6PV4TEXe2Fc8gkHQMMExK9BcDewLX\n0ofdbt2ir5Ck/0vasKk58LddiaXujRhaG+waRJLOJ3WfnZOL3g08NyLePv69+hLH+gARUfpk8Z36\ndyf1i7+KtH31LaSkP+nOJZzXFmwH3BoR2+VV9d+MiJ5vJ+5EXyGNcUq0scoKxNHKYNcgkjQvIrad\nqKyP9c8knVFpDVK3yWPAVyLi0yXqHxXLmqTxm91J04AfjoitS8fRNi07neFs0t/iQWB+P/4W7rqp\n08OSdhu1YdPDpYNoa7BrQN0i6RWdGS6SdgZuLlFxngm1K2nV9i9y2WbAKZKOiIgTVvoAvY3lx6Rt\nB64nnfnsyZXkk9AsSc8FTiWtIH+IPnXnuUVfoUHZsMmWW/r/DNJMpF/m4xcAd5Zo0Svtd/66GLVz\nZ+7GuSwBsQv4AAABfUlEQVTyybFLkHQCaZn/X0j99VcD10dE8YZI2yR9kzxeQdo6emq/3qNu0Vcm\nL/PfKvf5TQVoYfsDW2YQlv4/Y3SSh9RPn1fsFhMRR8CTs7DeR5qJNETaG3+yOZ3UtfkV0njFrZKu\n7sd4hVv0FZJ0c0S8vO04bDCsbIvq0ttXSzqUlNx2IK31uIY0GHtlqRgGSanxCif6CuV9Zu4DvkXa\nfhVoZ1GKtU/S4zReB82rgLUiolirXtJHScl9dsl1HYNojPGKa/s1XuFEX6HGopTlTNZFKWaDqOR4\nhRN9hZROiH0waQOrILUW/mMyDniZDbrGeMVHgaGI6Pl4hRN9hcZZnLNeROw7/r3MrKSS4xWedVOn\nF4+atvcTSfNai8bMxrIWcDwFxiuc6OvU2uIcM1s1EVFs7yF33VRI0nyWLc4B2BS4i7TsPUqegMTM\n2udEX6G8//y4Ip9P1swmByd6M7PKlToDvJmZtcSJ3sysck70ZmaVc6I3M6ucE72ZWeX+P/pcJeBE\niGDsAAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x7f7acc3d90f0>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"train.author.value_counts().head(10).plot(kind='bar')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>64119</th>\n", | |
" <td>12012874</td>\n", | |
" <td>The Master JavaScript Course Has Been Released...</td>\n", | |
" <td>http://www.masterjavascript.io/lp/master-javas...</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:58:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64120</th>\n", | |
" <td>12012865</td>\n", | |
" <td>Deeply Learn the JavaScript Scope Chain</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/05/22...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:57:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64121</th>\n", | |
" <td>12012863</td>\n", | |
" <td>Deeply Learning JavaScript Closures</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/04/24...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:57:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64122</th>\n", | |
" <td>12012861</td>\n", | |
" <td>You want more than product/market fit</td>\n", | |
" <td>https://justinjackson.ca/want/</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>wocg</td>\n", | |
" <td>2016-06-30 23:57:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64123</th>\n", | |
" <td>12012859</td>\n", | |
" <td>What's the Best Way to Learn JavaScript???</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/06/05...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:56:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64124</th>\n", | |
" <td>12012856</td>\n", | |
" <td>Mastering First Class Functions in JavaScript</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/06/13...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:56:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64125</th>\n", | |
" <td>12012854</td>\n", | |
" <td>How ECMAScript 6 Does Classes</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/06/20...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:56:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64126</th>\n", | |
" <td>12012845</td>\n", | |
" <td>Deep Dive into Function Constructors in JavaSc...</td>\n", | |
" <td>http://www.masterjavascript.io/blog/2016/06/26...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>erikgrueter</td>\n", | |
" <td>2016-06-30 23:55:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64127</th>\n", | |
" <td>12012831</td>\n", | |
" <td>Building a Realtime Collaborative Editor with ...</td>\n", | |
" <td>http://tutorials.pluralsight.com/node-js/build...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>prtkgpt</td>\n", | |
" <td>2016-06-30 23:53:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64128</th>\n", | |
" <td>12012827</td>\n", | |
" <td>107 Nobel laureates sign letter blasting Green...</td>\n", | |
" <td>https://www.washingtonpost.com/news/speaking-o...</td>\n", | |
" <td>111</td>\n", | |
" <td>116</td>\n", | |
" <td>larion1</td>\n", | |
" <td>2016-06-30 23:52:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64129</th>\n", | |
" <td>12012825</td>\n", | |
" <td>Sean Ellis on how growth hacking will outlive ...</td>\n", | |
" <td>https://blog.mixpanel.com/2016/06/30/sean-elli...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>trey_swann</td>\n", | |
" <td>2016-06-30 23:52:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64130</th>\n", | |
" <td>12012818</td>\n", | |
" <td>Introducing the YMARK specification</td>\n", | |
" <td>https://pvieito.com/2016/06/ymark</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>pvieito</td>\n", | |
" <td>2016-06-30 23:51:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64131</th>\n", | |
" <td>12012797</td>\n", | |
" <td>Side by side animated comparisons of various s...</td>\n", | |
" <td>http://sorting-algorithms.com</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>arigatuso</td>\n", | |
" <td>2016-06-30 23:48:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64132</th>\n", | |
" <td>12012774</td>\n", | |
" <td>Functional patterns Identity element</td>\n", | |
" <td>http://philipnilsson.github.io/Badness10k/post...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>alipang</td>\n", | |
" <td>2016-06-30 23:44:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64133</th>\n", | |
" <td>12012765</td>\n", | |
" <td>Zenefits Compensates Investors Over Past Misco...</td>\n", | |
" <td>http://www.nytimes.com/2016/07/01/technology/z...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>jackgavigan</td>\n", | |
" <td>2016-06-30 23:42:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64134</th>\n", | |
" <td>12012761</td>\n", | |
" <td>Enhanced Apache Spark in Domino</td>\n", | |
" <td>https://blog.dominodatalab.com/enhanced-apache...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>gk1</td>\n", | |
" <td>2016-06-30 23:42:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64135</th>\n", | |
" <td>12012752</td>\n", | |
" <td>Meet Furby Connect: Always-connected, yes, it ...</td>\n", | |
" <td>http://www.cnet.com/products/hasbro-furby-conn...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>ourmandave</td>\n", | |
" <td>2016-06-30 23:40:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64136</th>\n", | |
" <td>12012741</td>\n", | |
" <td>AMD Buys HiAlgo to Boost Radeon Software Suite</td>\n", | |
" <td>http://www.eweek.com/pc-hardware/amd-launches-...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>siberianbear</td>\n", | |
" <td>2016-06-30 23:38:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64137</th>\n", | |
" <td>12012728</td>\n", | |
" <td>Tesla's autopilot probed by government after f...</td>\n", | |
" <td>http://money.cnn.com/2016/06/30/technology/tes...</td>\n", | |
" <td>5</td>\n", | |
" <td>0</td>\n", | |
" <td>Trisell</td>\n", | |
" <td>2016-06-30 23:35:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64138</th>\n", | |
" <td>12012727</td>\n", | |
" <td>The First Fatal Crash in a Self-Driving Car Ha...</td>\n", | |
" <td>http://jalopnik.com/first-fatal-tesla-autopilo...</td>\n", | |
" <td>9</td>\n", | |
" <td>0</td>\n", | |
" <td>chase202</td>\n", | |
" <td>2016-06-30 23:35:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64139</th>\n", | |
" <td>12012724</td>\n", | |
" <td>Querying a Redux Store</td>\n", | |
" <td>https://medium.com/@adamrackis/querying-a-redu...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>felipellrocha</td>\n", | |
" <td>2016-06-30 23:34:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64140</th>\n", | |
" <td>12012723</td>\n", | |
" <td>Massachusetts House approves bill limiting non...</td>\n", | |
" <td>http://news.wgbh.org/2016/06/29/politics-gover...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>eogas</td>\n", | |
" <td>2016-06-30 23:34:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64141</th>\n", | |
" <td>12012708</td>\n", | |
" <td>AMD Zen Processor Naples with 32 Cores / 64 Th...</td>\n", | |
" <td>http://wccftech.com/amd-naples-32-core-zen/</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>mrb</td>\n", | |
" <td>2016-06-30 23:31:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64142</th>\n", | |
" <td>12012698</td>\n", | |
" <td>Zoox raises $200M Series A for self-driving cars</td>\n", | |
" <td>http://www.businessinsider.com/zoox-raises-200...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>coloneltcb</td>\n", | |
" <td>2016-06-30 23:29:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64143</th>\n", | |
" <td>12012689</td>\n", | |
" <td>The first fully adjustable car rig that create...</td>\n", | |
" <td>http://www.themill.com/portfolio/3002/the-blac...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Someone</td>\n", | |
" <td>2016-06-30 23:28:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64144</th>\n", | |
" <td>12012676</td>\n", | |
" <td>Self-Driving Tesla Was Involved in Fatal Crash...</td>\n", | |
" <td>http://www.nytimes.com/2016/07/01/business/sel...</td>\n", | |
" <td>31</td>\n", | |
" <td>9</td>\n", | |
" <td>thisjustinm</td>\n", | |
" <td>2016-06-30 23:24:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64145</th>\n", | |
" <td>12012674</td>\n", | |
" <td>Zenefits Slashes Valuation to $2B in Deal with...</td>\n", | |
" <td>https://www.buzzfeed.com/williamalden/zenefits...</td>\n", | |
" <td>13</td>\n", | |
" <td>2</td>\n", | |
" <td>coloneltcb</td>\n", | |
" <td>2016-06-30 23:24:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64146</th>\n", | |
" <td>12012669</td>\n", | |
" <td>Zenefits Adjusts Valuation to $2B</td>\n", | |
" <td>https://techcrunch.com/2016/06/30/zenefits-rev...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>shaaaaawn</td>\n", | |
" <td>2016-06-30 23:23:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64147</th>\n", | |
" <td>12012645</td>\n", | |
" <td>Why we need an alternative to venture capital</td>\n", | |
" <td>https://medium.com/@stefanobernardi/why-we-nee...</td>\n", | |
" <td>5</td>\n", | |
" <td>0</td>\n", | |
" <td>Sainth</td>\n", | |
" <td>2016-06-30 23:18:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64148</th>\n", | |
" <td>12012641</td>\n", | |
" <td>[Tutorial] Creating Map Visualisations in Pyth...</td>\n", | |
" <td>http://www.datadependence.com/2016/06/creating...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>Jmoir</td>\n", | |
" <td>2016-06-30 23:18:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>293089</th>\n", | |
" <td>10177071</td>\n", | |
" <td>Bulk Collection of Signals Intelligence: Techn...</td>\n", | |
" <td>http://www.nap.edu/catalog/19414/bulk-collecti...</td>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>mindcrime</td>\n", | |
" <td>2015-09-06 08:02:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>293090</th>\n", | |
" <td>10177065</td>\n", | |
" <td>Free Sitemap Generator Tool (Beta)</td>\n", | |
" <td>https://www.codepunker.com/tools/sitemap-gener...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>codepunker</td>\n", | |
" <td>2015-09-06 07:58:00</td>" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"64119 12012874 The Master JavaScript Course Has Been Released... \n", | |
"64120 12012865 Deeply Learn the JavaScript Scope Chain \n", | |
"64121 12012863 Deeply Learning JavaScript Closures \n", | |
"64122 12012861 You want more than product/market fit \n", | |
"64123 12012859 What's the Best Way to Learn JavaScript??? \n", | |
"64124 12012856 Mastering First Class Functions in JavaScript \n", | |
"64125 12012854 How ECMAScript 6 Does Classes \n", | |
"64126 12012845 Deep Dive into Function Constructors in JavaSc... \n", | |
"64127 12012831 Building a Realtime Collaborative Editor with ... \n", | |
"64128 12012827 107 Nobel laureates sign letter blasting Green... \n", | |
"64129 12012825 Sean Ellis on how growth hacking will outlive ... \n", | |
"64130 12012818 Introducing the YMARK specification \n", | |
"64131 12012797 Side by side animated comparisons of various s... \n", | |
"64132 12012774 Functional patterns Identity element \n", | |
"64133 12012765 Zenefits Compensates Investors Over Past Misco... \n", | |
"64134 12012761 Enhanced Apache Spark in Domino \n", | |
"64135 12012752 Meet Furby Connect: Always-connected, yes, it ... \n", | |
"64136 12012741 AMD Buys HiAlgo to Boost Radeon Software Suite \n", | |
"64137 12012728 Tesla's autopilot probed by government after f... \n", | |
"64138 12012727 The First Fatal Crash in a Self-Driving Car Ha... \n", | |
"64139 12012724 Querying a Redux Store \n", | |
"64140 12012723 Massachusetts House approves bill limiting non... \n", | |
"64141 12012708 AMD Zen Processor Naples with 32 Cores / 64 Th... \n", | |
"64142 12012698 Zoox raises $200M Series A for self-driving cars \n", | |
"64143 12012689 The first fully adjustable car rig that create... \n", | |
"64144 12012676 Self-Driving Tesla Was Involved in Fatal Crash... \n", | |
"64145 12012674 Zenefits Slashes Valuation to $2B in Deal with... \n", | |
"64146 12012669 Zenefits Adjusts Valuation to $2B \n", | |
"64147 12012645 Why we need an alternative to venture capital \n", | |
"64148 12012641 [Tutorial] Creating Map Visualisations in Pyth... \n", | |
"... ... ... \n", | |
"293089 10177071 Bulk Collection of Signals Intelligence: Techn... \n", | |
"293090 10177065 Free Sitemap Generator Tool (Beta) \n", | |
"293091 10177062 What Microsoft Got Right That JetBrains Didnt \n", | |
"293092 10177048 The Microservices Way Weekly Microserivces Ne... \n", | |
"293093 10177041 Snapchat's video push clicks with users \n", | |
"293094 10177034 Implement Elasticsearch in Jekyll blog \n", | |
"293095 10177013 HTTP/2 demo \n", | |
"293096 10177011 Video Poker Hackers Cleared of Federal Charges \n", | |
"293097 10177010 Canadian photographer publishes art book of So... \n", | |
"293098 10177004 You cannot have at-least-once broadcast \n", | |
"293099 10176994 Law for the Commons \n", | |
"293100 10176983 Time Cube is gone \n", | |
"293101 10176981 Why exactly did Bitcoin take off? \n", | |
"293102 10176980 How over a million Americans live on $2/day \n", | |
"293103 10176976 My Keyboard \n", | |
"293104 10176974 Google's new logo was created by Russian desig... \n", | |
"293105 10176971 Example time scale of system latencies \n", | |
"293106 10176962 The Interdependency of Stanford and Silicon Va... \n", | |
"293107 10176960 Hands-On with Googles OnHub Router \n", | |
"293108 10176959 Top Gear trio werent worth the money says Netflix \n", | |
"293109 10176951 Banking on Radical Honesty \n", | |
"293110 10176942 JSF ViewState and CSRF Hacker Attacks \n", | |
"293111 10176938 Play Framework template using Spring and Jinq ... \n", | |
"293112 10176926 Chemozart: molecule editor and visualizer with... \n", | |
"293113 10176923 Why we aren't tempted to use ACLs on our Unix ... \n", | |
"293114 10176919 Ask HN: What is/are your favorite quote(s)? \n", | |
"293115 10176917 Attention and awareness in stage magic: turnin... \n", | |
"293116 10176908 Dying vets fuck you letter (2013) \n", | |
"293117 10176907 PHP 7 Coolest Features: Space Ships, Type Hint... \n", | |
"293118 10176903 Toyota Establishes Research Centers with MIT a... \n", | |
"\n", | |
" url num_points \\\n", | |
"64119 http://www.masterjavascript.io/lp/master-javas... 1 \n", | |
"64120 http://www.masterjavascript.io/blog/2016/05/22... 2 \n", | |
"64121 http://www.masterjavascript.io/blog/2016/04/24... 2 \n", | |
"64122 https://justinjackson.ca/want/ 4 \n", | |
"64123 http://www.masterjavascript.io/blog/2016/06/05... 1 \n", | |
"64124 http://www.masterjavascript.io/blog/2016/06/13... 1 \n", | |
"64125 http://www.masterjavascript.io/blog/2016/06/20... 1 \n", | |
"64126 http://www.masterjavascript.io/blog/2016/06/26... 1 \n", | |
"64127 http://tutorials.pluralsight.com/node-js/build... 2 \n", | |
"64128 https://www.washingtonpost.com/news/speaking-o... 111 \n", | |
"64129 https://blog.mixpanel.com/2016/06/30/sean-elli... 1 \n", | |
"64130 https://pvieito.com/2016/06/ymark 2 \n", | |
"64131 http://sorting-algorithms.com 2 \n", | |
"64132 http://philipnilsson.github.io/Badness10k/post... 3 \n", | |
"64133 http://www.nytimes.com/2016/07/01/technology/z... 3 \n", | |
"64134 https://blog.dominodatalab.com/enhanced-apache... 2 \n", | |
"64135 http://www.cnet.com/products/hasbro-furby-conn... 1 \n", | |
"64136 http://www.eweek.com/pc-hardware/amd-launches-... 2 \n", | |
"64137 http://money.cnn.com/2016/06/30/technology/tes... 5 \n", | |
"64138 http://jalopnik.com/first-fatal-tesla-autopilo... 9 \n", | |
"64139 https://medium.com/@adamrackis/querying-a-redu... 3 \n", | |
"64140 http://news.wgbh.org/2016/06/29/politics-gover... 1 \n", | |
"64141 http://wccftech.com/amd-naples-32-core-zen/ 4 \n", | |
"64142 http://www.businessinsider.com/zoox-raises-200... 1 \n", | |
"64143 http://www.themill.com/portfolio/3002/the-blac... 1 \n", | |
"64144 http://www.nytimes.com/2016/07/01/business/sel... 31 \n", | |
"64145 https://www.buzzfeed.com/williamalden/zenefits... 13 \n", | |
"64146 https://techcrunch.com/2016/06/30/zenefits-rev... 2 \n", | |
"64147 https://medium.com/@stefanobernardi/why-we-nee... 5 \n", | |
"64148 http://www.datadependence.com/2016/06/creating... 3 \n", | |
"... ... ... \n", | |
"293089 http://www.nap.edu/catalog/19414/bulk-collecti... 3 \n", | |
"293090 https://www.codepunker.com/tools/sitemap-gener... 1 \n", | |
"293091 http://blog.dmitryleskov.com/business-of-softw... 4 \n", | |
"293092 https://www.getrevue.co/profile/microservices 1 \n", | |
"293093 http://www.latimes.com/business/la-fi-snapchat... 7 \n", | |
"293094 http://botleg.com/stories/implement-elasticsea... 3 \n", | |
"293095 https://http2.akamai.com/demo 1 \n", | |
"293096 http://www.wired.com/2013/11/video--poker-case/ 23 \n", | |
"293097 http://www.rt.com/news/314528-soviet-bus-stops... 1 \n", | |
"293098 http://250bpm.com/blog:61 11 \n", | |
"293099 http://wiki.commonstransition.org/wiki/Law_for... 27 \n", | |
"293100 https://www.theverge.com/2015/9/2/9247913/time... 2 \n", | |
"293101 https://bitcoinrevolt.wordpress.com/2015/09/06... 1 \n", | |
"293102 http://www.vox.com/2015/9/2/9248801/extreme-po... 5 \n", | |
"293103 http://zyghost.com/articles/My-Keyboard.html 144 \n", | |
"293104 http://www.dailytech.com/Exclusive+Googles+New... 25 \n", | |
"293105 https://twitter.com/chrisjrn/status/6402978640... 2 \n", | |
"293106 http://techcrunch.com/2015/09/04/what-will-sta... 2 \n", | |
"293107 http://techcrunch.com/2015/09/05/hands-on-with... 1 \n", | |
"293108 http://www.t3.com/news/top-gear-trio-weren-t-w... 2 \n", | |
"293109 http://www.inkworthy.com/s/55e9e5b9b092c75e002... 20 \n", | |
"293110 http://www.beyondjava.net/blog/jsf-viewstate-a... 1 \n", | |
"293111 https://www.typesafe.com/activator/template/pl... 2 \n", | |
"293112 https://github.com/mohebifar/chemozart 6 \n", | |
"293113 https://utcc.utoronto.ca/~cks/space/blog/sysad... 34 \n", | |
"293114 NaN 15 \n", | |
"293115 http://people.cs.uchicago.edu/~luitien/nrn2473... 14 \n", | |
"293116 http://dangerousminds.net/comments/dying_vets_... 10 \n", | |
"293117 https://www.zend.com/en/resources/php-7 2 \n", | |
"293118 http://newsroom.toyota.co.jp/en/detail/9233109/ 4 \n", | |
"\n", | |
" num_comments author created_at popular \n", | |
"64119 1 erikgrueter 2016-06-30 23:58:00 0 \n", | |
"64120 0 erikgrueter 2016-06-30 23:57:00 0 \n", | |
"64121 0 erikgrueter 2016-06-30 23:57:00 0 \n", | |
"64122 0 wocg 2016-06-30 23:57:00 0 \n", | |
"64123 0 erikgrueter 2016-06-30 23:56:00 0 \n", | |
"64124 0 erikgrueter 2016-06-30 23:56:00 0 \n", | |
"64125 0 erikgrueter 2016-06-30 23:56:00 0 \n", | |
"64126 0 erikgrueter 2016-06-30 23:55:00 0 \n", | |
"64127 0 prtkgpt 2016-06-30 23:53:00 0 \n", | |
"64128 116 larion1 2016-06-30 23:52:00 1 \n", | |
"64129 0 trey_swann 2016-06-30 23:52:00 0 \n", | |
"64130 0 pvieito 2016-06-30 23:51:00 0 \n", | |
"64131 0 arigatuso 2016-06-30 23:48:00 0 \n", | |
"64132 0 alipang 2016-06-30 23:44:00 0 \n", | |
"64133 0 jackgavigan 2016-06-30 23:42:00 0 " | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train.author.unique()\n", | |
"title_nunique = train.title.nunique()\n", | |
"title_nunique\n", | |
"train\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>2.290000e+05</td>\n", | |
" <td>229000.000000</td>\n", | |
" <td>229000.000000</td>\n", | |
" <td>229000.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>1.106136e+07</td>\n", | |
" <td>14.879712</td>\n", | |
" <td>6.360956</td>\n", | |
" <td>0.205262</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>5.308650e+05</td>\n", | |
" <td>58.340932</td>\n", | |
" <td>29.779061</td>\n", | |
" <td>0.403894</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>1.017690e+07</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>1.059370e+07</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>1.104797e+07</td>\n", | |
" <td>2.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>1.151814e+07</td>\n", | |
" <td>4.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>1.201287e+07</td>\n", | |
" <td>5771.000000</td>\n", | |
" <td>2531.000000</td>\n", | |
" <td>1.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id num_points num_comments popular\n", | |
"count 2.290000e+05 229000.000000 229000.000000 229000.000000\n", | |
"mean 1.106136e+07 14.879712 6.360956 0.205262\n", | |
"std 5.308650e+05 58.340932 29.779061 0.403894\n", | |
"min 1.017690e+07 1.000000 0.000000 0.000000\n", | |
"25% 1.059370e+07 1.000000 0.000000 0.000000\n", | |
"50% 1.104797e+07 2.000000 0.000000 0.000000\n", | |
"75% 1.151814e+07 4.000000 1.000000 0.000000\n", | |
"max 1.201287e+07 5771.000000 2531.000000 1.000000" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train.num_comments.describe()\n", | |
"train.describe().stack().unstack()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"count 229000\n", | |
"unique 167920\n", | |
"top 2016-04-09 22:40:00\n", | |
"freq 11\n", | |
"first 2015-09-06 05:50:00\n", | |
"last 2016-06-30 23:58:00\n", | |
"Name: created_at, dtype: object" | |
] | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train.created_at.describe()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"18198" | |
] | |
}, | |
"execution_count": 19, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# first feature I can think of is 'author'\n", | |
"# second feature is 'num_comments'\n", | |
"train.duplicated(['title']).sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"gb = train.groupby(['title', 'author'])\n", | |
"# gb.describe()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>269178</th>\n", | |
" <td>10349749</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Amorymeltzer</td>\n", | |
" <td>2015-10-07 23:01:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>268786</th>\n", | |
" <td>10352453</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Archio</td>\n", | |
" <td>2015-10-08 13:03:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>265780</th>\n", | |
" <td>10377044</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>14</td>\n", | |
" <td>0</td>\n", | |
" <td>mrstorm</td>\n", | |
" <td>2015-10-12 21:26:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>263787</th>\n", | |
" <td>10390786</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>doppp</td>\n", | |
" <td>2015-10-15 01:31:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>262759</th>\n", | |
" <td>10397555</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>56</td>\n", | |
" <td>91</td>\n", | |
" <td>Amorymeltzer</td>\n", | |
" <td>2015-10-16 05:02:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>290704</th>\n", | |
" <td>10193864</td>\n", | |
" <td>#NAME?</td>\n", | |
" <td>http://%22+https://datanice.wordpress.com/2015...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>hassenc</td>\n", | |
" <td>2015-09-09 19:39:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>197434</th>\n", | |
" <td>10902030</td>\n", | |
" <td>#NAME?</td>\n", | |
" <td>https://medium.com/app-a-day/headstrong-a9f4eb...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>thindjinn</td>\n", | |
" <td>2016-01-14 15:43:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>267243</th>\n", | |
" <td>10364765</td>\n", | |
" <td>#Node.js: A quick optimization advice</td>\n", | |
" <td>https://medium.com/@c2c/nodejs-a-quick-optimiz...</td>\n", | |
" <td>5</td>\n", | |
" <td>0</td>\n", | |
" <td>ot</td>\n", | |
" <td>2015-10-10 07:00:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>262904</th>\n", | |
" <td>10396644</td>\n", | |
" <td>#Node.js: A quick optimization advice</td>\n", | |
" <td>https://top.fse.guru/nodejs-a-quick-optimizati...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>atriix</td>\n", | |
" <td>2015-10-15 23:31:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>171079</th>\n", | |
" <td>11108462</td>\n", | |
" <td>#Node.js: A quick optimization advice</td>\n", | |
" <td>https://top.fse.guru/nodejs-a-quick-optimizati...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>kiyanwang</td>\n", | |
" <td>2016-02-16 07:56:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>283724</th>\n", | |
" <td>10242238</td>\n", | |
" <td>$1 Unistroke Recognizer</td>\n", | |
" <td>https://depts.washington.edu/aimgroup/proj/dol...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>tambourine_man</td>\n", | |
" <td>2015-09-18 21:45:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>283238</th>\n", | |
" <td>10245928</td>\n", | |
" <td>$1 Unistroke Recognizer</td>\n", | |
" <td>https://depts.washington.edu/aimgroup/proj/dol...</td>\n", | |
" <td>138</td>\n", | |
" <td>41</td>\n", | |
" <td>tambourine_man</td>\n", | |
" <td>2015-09-19 23:30:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>117539</th>\n", | |
" <td>11548239</td>\n", | |
" <td>$10 router blamed in Bangladesh bank hack</td>\n", | |
" <td>http://www.bbc.co.uk/news/technology-36110421</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>0xbadf00d</td>\n", | |
" <td>2016-04-22 10:53:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>117527</th>\n", | |
" <td>11548318</td>\n", | |
" <td>$10 router blamed in Bangladesh bank hack</td>\n", | |
" <td>http://www.bbc.com/news/technology-36110421</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>ghosh</td>\n", | |
" <td>2016-04-22 11:19:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>117137</th>\n", | |
" <td>11552182</td>\n", | |
" <td>$10 router blamed in Bangladesh bank hack</td>\n", | |
" <td>http://www.bbc.co.uk/news/technology-36110421</td>\n", | |
" <td>43</td>\n", | |
" <td>9</td>\n", | |
" <td>andygambles</td>\n", | |
" <td>2016-04-22 20:00:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>268893</th>\n", | |
" <td>10351737</td>\n", | |
" <td>$2.200 to win for the best JavaScript coder</td>\n", | |
" <td>http://ibm-bluemix.coderpower.com/#/?utm_sourc...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>galina</td>\n", | |
" <td>2015-10-08 09:58:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>264458</th>\n", | |
" <td>10386096</td>\n", | |
" <td>$2.200 to win for the best JavaScript coder</td>\n", | |
" <td>http://ibm-bluemix.coderpower.com/#/?utm_sourc...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>galina</td>\n", | |
" <td>2015-10-14 12:03:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>181087</th>\n", | |
" <td>11027159</td>\n", | |
" <td>$250k of DigitalOcean credits for YC startups</td>\n", | |
" <td>http://blog.ycombinator.com/$250k-of-digitaloc...</td>\n", | |
" <td>12</td>\n", | |
" <td>2</td>\n", | |
" <td>abritishguy</td>\n", | |
" <td>2016-02-03 16:01:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>180615</th>\n", | |
" <td>11031039</td>\n", | |
" <td>$250k of DigitalOcean credits for YC startups</td>\n", | |
" <td>http://blog.ycombinator.com/$250k-of-digitaloc...</td>\n", | |
" <td>262</td>\n", | |
" <td>215</td>\n", | |
" <td>lacorp</td>\n", | |
" <td>2016-02-04 00:19:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>284165</th>\n", | |
" <td>10238884</td>\n", | |
" <td>$25K in book sales, and I'm almost ready to pu...</td>\n", | |
" <td>https://servercheck.in/blog/25k-book-sales-and...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>geerlingguy</td>\n", | |
" <td>2015-09-18 12:58:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>283609</th>\n", | |
" <td>10243142</td>\n", | |
" <td>$25K in book sales, and I'm almost ready to pu...</td>\n", | |
" <td>https://servercheck.in/blog/25k-book-sales-and...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>geerlingguy</td>\n", | |
" <td>2015-09-19 03:07:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>169561</th>\n", | |
" <td>11119379</td>\n", | |
" <td>$5M AI Xprize from IBM and TED</td>\n", | |
" <td>http://ai.xprize.com</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>dlg</td>\n", | |
" <td>2016-02-17 17:16:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>169534</th>\n", | |
" <td>11119601</td>\n", | |
" <td>$5M AI Xprize from IBM and TED</td>\n", | |
" <td>http://AI.xprize.org</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>dlg</td>\n", | |
" <td>2016-02-17 17:41:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>258984</th>\n", | |
" <td>10424421</td>\n", | |
" <td>'10-second' hack jogs Fitbits into malware-spr...</td>\n", | |
" <td>http://www.theregister.co.uk/2015/10/21/fitbit...</td>\n", | |
" <td>3</td>\n", | |
" <td>0</td>\n", | |
" <td>ColinWright</td>\n", | |
" <td>2015-10-21 09:31:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>258064</th>\n", | |
" <td>10430712</td>\n", | |
" <td>'10-second' hack jogs Fitbits into malware-spr...</td>\n", | |
" <td>http://www.theregister.co.uk/2015/10/21/fitbit...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>kameit00</td>\n", | |
" <td>2015-10-22 06:32:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>256200</th>\n", | |
" <td>10442722</td>\n", | |
" <td>'10-second' hack jogs Fitbits into malware-spr...</td>\n", | |
" <td>http://www.theregister.co.uk/2015/10/21/fitbit...</td>\n", | |
" <td>13</td>\n", | |
" <td>1</td>\n", | |
" <td>kameit00</td>\n", | |
" <td>2015-10-24 06:50:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>101076</th>\n", | |
" <td>11689447</td>\n", | |
" <td>'Android VR' confirmed by Google developer site</td>\n", | |
" <td>http://www.engadget.com/2016/05/13/android-vr-...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>T-A</td>\n", | |
" <td>2016-05-13 09:38:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99277</th>\n", | |
" <td>11705217</td>\n", | |
" <td>'Android VR' confirmed by Google developer site</td>\n", | |
" <td>http://www.engadget.com/2016/05/13/android-vr-...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>stesch</td>\n", | |
" <td>2016-05-16 09:28:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>107157</th>\n", | |
" <td>11636704</td>\n", | |
" <td>'Bitcoin creator': I do not have the courage</td>\n", | |
" <td>http://www.bbc.com/news/technology-36213588</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>csomar</td>\n", | |
" <td>2016-05-05 15:03:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>107012</th>\n", | |
" <td>11637973</td>\n", | |
" <td>'Bitcoin creator': I do not have the courage</td>\n", | |
" <td>http://www.bbc.co.uk/news/technology-36213588</td>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>kartikkumar</td>\n", | |
" <td>2016-05-05 17:16:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>209920</th>\n", | |
" <td>10796874</td>\n", | |
" <td>iOS App Icon Colors in the Year 2015</td>\n", | |
" <td>https://growthbug.com/ios-app-icon-colors-in-t...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>hboon</td>\n", | |
" <td>2015-12-27 08:45:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>287941</th>\n", | |
" <td>10213500</td>\n", | |
" <td>iOS App Reverse Engineering</td>\n", | |
" <td>https://github.com/iosre/iOSAppReverseEngineering</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>sanjeetsuhag</td>\n", | |
" <td>2015-09-14 03:03:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>287875</th>\n", | |
" <td>10213883</td>\n", | |
" <td>iOS App Reverse Engineering</td>\n", | |
" <td>https://github.com/iosre/iOSAppReverseEngineering</td>\n", | |
" <td>131</td>\n", | |
" <td>8</td>\n", | |
" <td>snakeninny</td>\n", | |
" <td>2015-09-14 06:51:00</td>\n", | |
" <" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"269178 10349749 #FFFFFF Diversity \n", | |
"268786 10352453 #FFFFFF Diversity \n", | |
"265780 10377044 #FFFFFF Diversity \n", | |
"263787 10390786 #FFFFFF Diversity \n", | |
"262759 10397555 #FFFFFF Diversity \n", | |
"290704 10193864 #NAME? \n", | |
"197434 10902030 #NAME? \n", | |
"267243 10364765 #Node.js: A quick optimization advice \n", | |
"262904 10396644 #Node.js: A quick optimization advice \n", | |
"171079 11108462 #Node.js: A quick optimization advice \n", | |
"283724 10242238 $1 Unistroke Recognizer \n", | |
"283238 10245928 $1 Unistroke Recognizer \n", | |
"117539 11548239 $10 router blamed in Bangladesh bank hack \n", | |
"117527 11548318 $10 router blamed in Bangladesh bank hack \n", | |
"117137 11552182 $10 router blamed in Bangladesh bank hack \n", | |
"268893 10351737 $2.200 to win for the best JavaScript coder \n", | |
"264458 10386096 $2.200 to win for the best JavaScript coder \n", | |
"181087 11027159 $250k of DigitalOcean credits for YC startups \n", | |
"180615 11031039 $250k of DigitalOcean credits for YC startups \n", | |
"284165 10238884 $25K in book sales, and I'm almost ready to pu... \n", | |
"283609 10243142 $25K in book sales, and I'm almost ready to pu... \n", | |
"169561 11119379 $5M AI Xprize from IBM and TED \n", | |
"169534 11119601 $5M AI Xprize from IBM and TED \n", | |
"258984 10424421 '10-second' hack jogs Fitbits into malware-spr... \n", | |
"258064 10430712 '10-second' hack jogs Fitbits into malware-spr... \n", | |
"256200 10442722 '10-second' hack jogs Fitbits into malware-spr... \n", | |
"101076 11689447 'Android VR' confirmed by Google developer site \n", | |
"99277 11705217 'Android VR' confirmed by Google developer site \n", | |
"107157 11636704 'Bitcoin creator': I do not have the courage \n", | |
"107012 11637973 'Bitcoin creator': I do not have the courage \n", | |
"... ... ... \n", | |
"209920 10796874 iOS App Icon Colors in the Year 2015 \n", | |
"287941 10213500 iOS App Reverse Engineering \n", | |
"287875 10213883 iOS App Reverse Engineering \n", | |
"280002 10270665 iOS at Facebook [pdf] \n", | |
"279850 10271551 iOS at Facebook [pdf] \n", | |
"279093 10277101 iOS at Facebook [pdf] \n", | |
"233068 10615487 iPad Pro: Wrong Questions \n", | |
"232050 10622604 iPad Pro: Wrong Questions \n", | |
"222653 10695695 iPhone 6s Smart Battery Case \n", | |
"222156 10698723 iPhone 6s Smart Battery Case \n", | |
"222076 10699378 iPhone 6s Smart Battery Case \n", | |
"168484 11128085 iPhone Safari Remote Crash \n", | |
"168369 11129023 iPhone Safari Remote Crash \n", | |
"179107 11043643 iPhones 'disabled' if Apple detects third-part... \n", | |
"178745 11047004 iPhones 'disabled' if Apple detects third-part... \n", | |
"178691 11047359 iPhones 'disabled' if Apple detects third-part... \n", | |
"174730 11079435 iPhones 'disabled' if Apple detects third-part... \n", | |
"187883 10975995 iPhones dont matter anymore \n", | |
"187285 10980318 iPhones dont matter anymore \n", | |
"284110 10239189 iStats CLI: OS X Hardware Stats from the Comma... \n", | |
"164624 11159600 iStats CLI: OS X Hardware Stats from the Comma... \n", | |
"248489 10499700 iTunes Terms and Conditions: The Graphic Novel \n", | |
"247805 10504326 iTunes Terms and Conditions: The Graphic Novel \n", | |
"149413 11280999 sweep Scanning LiDAR \n", | |
"147063 11298674 sweep Scanning LiDAR \n", | |
"269791 10345746 £1984: does a cashless economy make for a sur... \n", | |
"267028 10366343 £1984: does a cashless economy make for a sur... \n", | |
"227765 10656242 £1984: does a cashless economy make for a sur... \n", | |
"256563 10440403 Ã\n", | |
"zone Futures Market \n", | |
"253478 10461573 Ã\n", | |
"zone Futures Market \n", | |
"\n", | |
" url num_points \\\n", | |
"269178 https://medium.com/this-is-hard/ffffff-diversi... 1 \n", | |
"268786 https://medium.com/this-is-hard/ffffff-diversi... 1 \n", | |
"265780 https://medium.com/this-is-hard/ffffff-diversi... 14 \n", | |
"263787 https://medium.com/this-is-hard/ffffff-diversi... 1 \n", | |
"262759 https://medium.com/this-is-hard/ffffff-diversi... 56 \n", | |
"290704 http://%22+https://datanice.wordpress.com/2015... 2 \n", | |
"197434 https://medium.com/app-a-day/headstrong-a9f4eb... 1 \n", | |
"267243 https://medium.com/@c2c/nodejs-a-quick-optimiz... 5 \n", | |
"262904 https://top.fse.guru/nodejs-a-quick-optimizati... 1 \n", | |
"171079 https://top.fse.guru/nodejs-a-quick-optimizati... 2 \n", | |
"283724 https://depts.washington.edu/aimgroup/proj/dol... 1 \n", | |
"283238 https://depts.washington.edu/aimgroup/proj/dol... 138 \n", | |
"117539 http://www.bbc.co.uk/news/technology-36110421 2 \n", | |
"117527 http://www.bbc.com/news/technology-36110421 2 \n", | |
"117137 http://www.bbc.co.uk/news/technology-36110421 43 \n", | |
"268893 http://ibm-bluemix.coderpower.com/#/?utm_sourc... 1 \n", | |
"264458 http://ibm-bluemix.coderpower.com/#/?utm_sourc... 1 \n", | |
"181087 http://blog.ycombinator.com/$250k-of-digitaloc... 12 \n", | |
"180615 http://blog.ycombinator.com/$250k-of-digitaloc... 262 \n", | |
"284165 https://servercheck.in/blog/25k-book-sales-and... 3 \n", | |
"283609 https://servercheck.in/blog/25k-book-sales-and... 2 \n", | |
"169561 http://ai.xprize.com 2 \n", | |
"169534 http://AI.xprize.org 4 \n", | |
"258984 http://www.theregister.co.uk/2015/10/21/fitbit... 3 \n", | |
"258064 http://www.theregister.co.uk/2015/10/21/fitbit... 2 \n", | |
"256200 http://www.theregister.co.uk/2015/10/21/fitbit... 13 \n", | |
"101076 http://www.engadget.com/2016/05/13/android-vr-... 1 \n", | |
"99277 http://www.engadget.com/2016/05/13/android-vr-... 2 \n", | |
"107157 http://www.bbc.com/news/technology-36213588 1 \n", | |
"107012 http://www.bbc.co.uk/news/technology-36213588 3 \n", | |
"... ... ... \n", | |
"209920 https://growthbug.com/ios-app-icon-colors-in-t... 2 \n", | |
"287941 https://github.com/iosre/iOSAppReverseEngineering 1 \n", | |
"287875 https://github.com/iosre/iOSAppReverseEngineering 131 \n", | |
"280002 https://static1.squarespace.com/static/5463eca... 2 \n", | |
"279850 https://chris-price-b2rp.squarespace.com/s/Sim... 2 \n", | |
"279093 https://static1.squarespace.com/static/5463eca... 160 \n", | |
"233068 http://www.mondaynote.com/2015/11/23/ipad-pro-... 3 \n", | |
"232050 http://www.mondaynote.com/2015/11/23/ipad-pro-... 1 \n", | |
"222653 http://www.apple.com/shop/product/MGQM2LL/A/ip... 6 \n", | |
"222156 http://www.apple.com/shop/product/MGQM2LL/A/ip... 2 \n", | |
"222076 http://www.apple.com/shop/product/MGQL2LL/A/ip... 1 \n", | |
"168484 https://medium.com/@s3yfullah/iphone-safari-re... 1 \n", | |
"168369 https://medium.com/@s3yfullah/iphone-safari-re... 2 \n", | |
"179107 http://www.bbc.co.uk/news/technology-35502030 9 \n", | |
"178745 http://www.bbc.co.uk/news/technology-35502030 12 \n", | |
"178691 http://www.theguardian.com/money/2016/feb/05/e... 438 \n", | |
"174730 http://www.bbc.co.uk/news/technology-35502030 2 \n", | |
"187883 http://www.computerworld.com/article/3026186/a... 2 \n", | |
"187285 http://www.computerworld.com/article/3026186/a... 1 \n", | |
"284110 https://github.com/Chris911/iStats 1 \n", | |
"164624 https://github.com/Chris911/iStats 1 \n", | |
"248489 http://itunestandc.tumblr.com/ 1 \n", | |
"247805 http://itunestandc.tumblr.com/ 1 \n", | |
"149413 https://www.kickstarter.com/projects/scanse/sw... 10 \n", | |
"147063 https://www.kickstarter.com/projects/scanse/sw... 1 \n", | |
"269791 http://www.theguardian.com/sustainable-busines... 6 \n", | |
"267028 http://www.theguardian.com/sustainable-busines... 5 \n", | |
"227765 http://www.theguardian.com/sustainable-busines... 17 \n", | |
"256563 http://azone.guggenheim.org/ 1 \n", | |
"253478 http://azone.guggenheim.org/ 1 \n", | |
"\n", | |
" num_comments author created_at popular \n", | |
"269178 0 Amorymeltzer 2015-10-07 23:01:00 0 \n", | |
"268786 0 Archio 2015-10-08 13:03:00 0 \n", | |
"265780 0 mrstorm 2015-10-12 21:26:00 1 \n", | |
"263787 0 doppp 2015-10-15 01:31:00 0 \n", | |
"262759 91 Amorymeltzer 2015-10-16 05:02:00 1 \n", | |
"290704 0 hassenc 2015-09-09 19:39:00 0 \n", | |
"197434 0 thindjinn 2016-01-14 15:43:00 0 \n", | |
"267243 0 ot 2015-10-10 07:00:00 0 \n", | |
"262904 0 atriix 2015-10-15 23:31:00 0 \n", | |
"171079 0 kiyanwang 2016-02-16 07:56:00 0 \n", | |
"283724 0 tambourine_man 2015-09-18 21:45:00 0 \n", | |
"283238 41 tambourine_man 2015-09-19 23:30:00 1 \n", | |
"117539 0 0xbadf00d 2016-04-22 10:53:00 0 \n", | |
"117527 1 ghosh 2016-04-22 11:19:00 0 \n", | |
"117137 9 andygambles 2016-04-22 20" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train[train.duplicated(['title'], keep=False)].sort_values(['title', 'created_at'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>268786</th>\n", | |
" <td>10352453</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>Archio</td>\n", | |
" <td>2015-10-08 13:03:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>265780</th>\n", | |
" <td>10377044</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>14</td>\n", | |
" <td>0</td>\n", | |
" <td>mrstorm</td>\n", | |
" <td>2015-10-12 21:26:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>263787</th>\n", | |
" <td>10390786</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>doppp</td>\n", | |
" <td>2015-10-15 01:31:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>262759</th>\n", | |
" <td>10397555</td>\n", | |
" <td>#FFFFFF Diversity</td>\n", | |
" <td>https://medium.com/this-is-hard/ffffff-diversi...</td>\n", | |
" <td>56</td>\n", | |
" <td>91</td>\n", | |
" <td>Amorymeltzer</td>\n", | |
" <td>2015-10-16 05:02:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>197434</th>\n", | |
" <td>10902030</td>\n", | |
" <td>#NAME?</td>\n", | |
" <td>https://medium.com/app-a-day/headstrong-a9f4eb...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>thindjinn</td>\n", | |
" <td>2016-01-14 15:43:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>262904</th>\n", | |
" <td>10396644</td>\n", | |
" <td>#Node.js: A quick optimization advice</td>\n", | |
" <td>https://top.fse.guru/nodejs-a-quick-optimizati...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>atriix</td>\n", | |
" <td>2015-10-15 23:31:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>171079</th>\n", | |
" <td>11108462</td>\n", | |
" <td>#Node.js: A quick optimization advice</td>\n", | |
" <td>https://top.fse.guru/nodejs-a-quick-optimizati...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>kiyanwang</td>\n", | |
" <td>2016-02-16 07:56:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>283238</th>\n", | |
" <td>10245928</td>\n", | |
" <td>$1 Unistroke Recognizer</td>\n", | |
" <td>https://depts.washington.edu/aimgroup/proj/dol...</td>\n", | |
" <td>138</td>\n", | |
" <td>41</td>\n", | |
" <td>tambourine_man</td>\n", | |
" <td>2015-09-19 23:30:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>117527</th>\n", | |
" <td>11548318</td>\n", | |
" <td>$10 router blamed in Bangladesh bank hack</td>\n", | |
" <td>http://www.bbc.com/news/technology-36110421</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>ghosh</td>\n", | |
" <td>2016-04-22 11:19:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>117137</th>\n", | |
" <td>11552182</td>\n", | |
" <td>$10 router blamed in Bangladesh bank hack</td>\n", | |
" <td>http://www.bbc.co.uk/news/technology-36110421</td>\n", | |
" <td>43</td>\n", | |
" <td>9</td>\n", | |
" <td>andygambles</td>\n", | |
" <td>2016-04-22 20:00:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>264458</th>\n", | |
" <td>10386096</td>\n", | |
" <td>$2.200 to win for the best JavaScript coder</td>\n", | |
" <td>http://ibm-bluemix.coderpower.com/#/?utm_sourc...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>galina</td>\n", | |
" <td>2015-10-14 12:03:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>180615</th>\n", | |
" <td>11031039</td>\n", | |
" <td>$250k of DigitalOcean credits for YC startups</td>\n", | |
" <td>http://blog.ycombinator.com/$250k-of-digitaloc...</td>\n", | |
" <td>262</td>\n", | |
" <td>215</td>\n", | |
" <td>lacorp</td>\n", | |
" <td>2016-02-04 00:19:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>283609</th>\n", | |
" <td>10243142</td>\n", | |
" <td>$25K in book sales, and I'm almost ready to pu...</td>\n", | |
" <td>https://servercheck.in/blog/25k-book-sales-and...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>geerlingguy</td>\n", | |
" <td>2015-09-19 03:07:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>169534</th>\n", | |
" <td>11119601</td>\n", | |
" <td>$5M AI Xprize from IBM and TED</td>\n", | |
" <td>http://AI.xprize.org</td>\n", | |
" <td>4</td>\n", | |
" <td>0</td>\n", | |
" <td>dlg</td>\n", | |
" <td>2016-02-17 17:41:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>258064</th>\n", | |
" <td>10430712</td>\n", | |
" <td>'10-second' hack jogs Fitbits into malware-spr...</td>\n", | |
" <td>http://www.theregister.co.uk/2015/10/21/fitbit...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>kameit00</td>\n", | |
" <td>2015-10-22 06:32:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>256200</th>\n", | |
" <td>10442722</td>\n", | |
" <td>'10-second' hack jogs Fitbits into malware-spr...</td>\n", | |
" <td>http://www.theregister.co.uk/2015/10/21/fitbit...</td>\n", | |
" <td>13</td>\n", | |
" <td>1</td>\n", | |
" <td>kameit00</td>\n", | |
" <td>2015-10-24 06:50:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>99277</th>\n", | |
" <td>11705217</td>\n", | |
" <td>'Android VR' confirmed by Google developer site</td>\n", | |
" <td>http://www.engadget.com/2016/05/13/android-vr-...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>stesch</td>\n", | |
" <td>2016-05-16 09:28:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>107012</th>\n", | |
" <td>11637973</td>\n", | |
" <td>'Bitcoin creator': I do not have the courage</td>\n", | |
" <td>http://www.bbc.co.uk/news/technology-36213588</td>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>kartikkumar</td>\n", | |
" <td>2016-05-05 17:16:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>263186</th>\n", | |
" <td>10394555</td>\n", | |
" <td>'Bizarre' star may suggest existence of 'alien...</td>\n", | |
" <td>http://www.bbc.co.uk/newsbeat/articles/34540449</td>\n", | |
" <td>8</td>\n", | |
" <td>0</td>\n", | |
" <td>rikkipitt</td>\n", | |
" <td>2015-10-15 17:30:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>150814</th>\n", | |
" <td>11266850</td>\n", | |
" <td>'Body Hacking' Movement Rises Ahead of Moral A...</td>\n", | |
" <td>http://www.npr.org/sections/alltechconsidered/...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>prostoalex</td>\n", | |
" <td>2016-03-11 14:43:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>129437</th>\n", | |
" <td>11449495</td>\n", | |
" <td>'Devastating' bug pops secure doors at airport...</td>\n", | |
" <td>http://www.theregister.co.uk/2016/04/04/devast...</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>wglb</td>\n", | |
" <td>2016-04-07 18:36:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>286242</th>\n", | |
" <td>10223458</td>\n", | |
" <td>'Dislike' button coming to Facebook</td>\n", | |
" <td>http://www.bbc.com/news/technology-34264624</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>inm</td>\n", | |
" <td>2015-09-15 22:04:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>286236</th>\n", | |
" <td>10223480</td>\n", | |
" <td>'Dislike' button coming to Facebook</td>\n", | |
" <td>http://www.bbc.com/news/technology-34264624</td>\n", | |
" <td>16</td>\n", | |
" <td>14</td>\n", | |
" <td>elie_CH</td>\n", | |
" <td>2015-09-15 22:11:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>192285</th>\n", | |
" <td>10943909</td>\n", | |
" <td>'Extinct' tree frog rediscovered in India afte...</td>\n", | |
" <td>http://www.bbc.com/news/world-asia-india-35368...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>ghosh</td>\n", | |
" <td>2016-01-21 07:38:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>268826</th>\n", | |
" <td>10352189</td>\n", | |
" <td>'Extreme poverty' to fall below 10% of world p...</td>\n", | |
" <td>http://www.theguardian.com/society/2015/oct/05...</td>\n", | |
" <td>283</td>\n", | |
" <td>240</td>\n", | |
" <td>hliyan</td>\n", | |
" <td>2015-10-08 12:19:00</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>169801</th>\n", | |
" <td>11117607</td>\n", | |
" <td>'Five-dimensional' glass discs can store data ...</td>\n", | |
" <td>http://www.theverge.com/2016/2/16/11018018/5d-...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>jonbaer</td>\n", | |
" <td>2016-02-17 13:17:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>166999</th>\n", | |
" <td>11140484</td>\n", | |
" <td>'Five-dimensional' glass discs can store data ...</td>\n", | |
" <td>http://www.theverge.com/2016/2/16/11018018/5d-...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>raphar</td>\n", | |
" <td>2016-02-20 15:45:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>219189</th>\n", | |
" <td>10721633</td>\n", | |
" <td>'Get in the Van' and Other Tips for Getting Me...</td>\n", | |
" <td>http://firstround.com/review/the-power-of-inte...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>kareemm</td>\n", | |
" <td>2015-12-12 03:08:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>163047</th>\n", | |
" <td>11171530</td>\n", | |
" <td>'Ghost Protest' in Seoul Uses Holograms, Not P...</td>\n", | |
" <td>http://www.npr.org/sections/parallels/2016/02/...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>Doolwind</td>\n", | |
" <td>2016-02-25 00:42:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>211207</th>\n", | |
" <td>10785950</td>\n", | |
" <td>'Hateful Eight' Pirated Screener Traced Back t...</td>\n", | |
" <td>http://www.hollywoodreporter.com/news/hateful-...</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>nkurz</td>\n", | |
" <td>2015-12-23 22:01:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>109601</th>\n", | |
" <td>11615726</td>\n", | |
" <td>[openssl-announce] Forthcoming OpenSSL releases</td>\n", | |
" <td>https://mta.openssl.org/pipermail/openssl-anno...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>deweerdt</td>\n", | |
" <td>2016-05-02 21:18:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>196189</th>\n", | |
" <td>10912879</td>\n", | |
" <td>iAd App Network Will Be Discontinued</td>\n", | |
" <td>https://developer.apple.com/news/?id=01152016a</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>zdw</td>\n", | |
" <td>2016-01-15 22:52:00</td>\n", | |
" <td>0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>195967</th>\n", | |
" <td>10914803</td>\n", | |
" <td>iAd App Network Will Be Discontinued</td>\n", | |
" <t" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"268786 10352453 #FFFFFF Diversity \n", | |
"265780 10377044 #FFFFFF Diversity \n", | |
"263787 10390786 #FFFFFF Diversity \n", | |
"262759 10397555 #FFFFFF Diversity \n", | |
"197434 10902030 #NAME? \n", | |
"262904 10396644 #Node.js: A quick optimization advice \n", | |
"171079 11108462 #Node.js: A quick optimization advice \n", | |
"283238 10245928 $1 Unistroke Recognizer \n", | |
"117527 11548318 $10 router blamed in Bangladesh bank hack \n", | |
"117137 11552182 $10 router blamed in Bangladesh bank hack \n", | |
"264458 10386096 $2.200 to win for the best JavaScript coder \n", | |
"180615 11031039 $250k of DigitalOcean credits for YC startups \n", | |
"283609 10243142 $25K in book sales, and I'm almost ready to pu... \n", | |
"169534 11119601 $5M AI Xprize from IBM and TED \n", | |
"258064 10430712 '10-second' hack jogs Fitbits into malware-spr... \n", | |
"256200 10442722 '10-second' hack jogs Fitbits into malware-spr... \n", | |
"99277 11705217 'Android VR' confirmed by Google developer site \n", | |
"107012 11637973 'Bitcoin creator': I do not have the courage \n", | |
"263186 10394555 'Bizarre' star may suggest existence of 'alien... \n", | |
"150814 11266850 'Body Hacking' Movement Rises Ahead of Moral A... \n", | |
"129437 11449495 'Devastating' bug pops secure doors at airport... \n", | |
"286242 10223458 'Dislike' button coming to Facebook \n", | |
"286236 10223480 'Dislike' button coming to Facebook \n", | |
"192285 10943909 'Extinct' tree frog rediscovered in India afte... \n", | |
"268826 10352189 'Extreme poverty' to fall below 10% of world p... \n", | |
"169801 11117607 'Five-dimensional' glass discs can store data ... \n", | |
"166999 11140484 'Five-dimensional' glass discs can store data ... \n", | |
"219189 10721633 'Get in the Van' and Other Tips for Getting Me... \n", | |
"163047 11171530 'Ghost Protest' in Seoul Uses Holograms, Not P... \n", | |
"211207 10785950 'Hateful Eight' Pirated Screener Traced Back t... \n", | |
"... ... ... \n", | |
"109601 11615726 [openssl-announce] Forthcoming OpenSSL releases \n", | |
"196189 10912879 iAd App Network Will Be Discontinued \n", | |
"195967 10914803 iAd App Network Will Be Discontinued \n", | |
"194469 10926748 iAd App Network Will Be Discontinued \n", | |
"173308 11089640 iAd App Network Will Be Discontinued \n", | |
"275122 10306159 iFixit App Pulled from Apples Store \n", | |
"274471 10310842 iKe: Browser-based k-family language IDE \n", | |
"264626 10385112 iMac: Then and Now \n", | |
"77271 11898921 iOS 10 Human Interface Guidelines \n", | |
"265610 10378667 iOS 9 GUI (iPhone) \n", | |
"284501 10236456 iOS 9 adblocker apps shoot to top of charts on... \n", | |
"217826 10732672 iOS 9.2 Update: The Fall of URI Schemes and th... \n", | |
"209920 10796874 iOS App Icon Colors in the Year 2015 \n", | |
"287875 10213883 iOS App Reverse Engineering \n", | |
"279850 10271551 iOS at Facebook [pdf] \n", | |
"279093 10277101 iOS at Facebook [pdf] \n", | |
"232050 10622604 iPad Pro: Wrong Questions \n", | |
"222156 10698723 iPhone 6s Smart Battery Case \n", | |
"222076 10699378 iPhone 6s Smart Battery Case \n", | |
"168369 11129023 iPhone Safari Remote Crash \n", | |
"178745 11047004 iPhones 'disabled' if Apple detects third-part... \n", | |
"178691 11047359 iPhones 'disabled' if Apple detects third-part... \n", | |
"174730 11079435 iPhones 'disabled' if Apple detects third-part... \n", | |
"187285 10980318 iPhones dont matter anymore \n", | |
"164624 11159600 iStats CLI: OS X Hardware Stats from the Comma... \n", | |
"247805 10504326 iTunes Terms and Conditions: The Graphic Novel \n", | |
"147063 11298674 sweep Scanning LiDAR \n", | |
"267028 10366343 £1984: does a cashless economy make for a sur... \n", | |
"227765 10656242 £1984: does a cashless economy make for a sur... \n", | |
"253478 10461573 Ã\n", | |
"zone Futures Market \n", | |
"\n", | |
" url num_points \\\n", | |
"268786 https://medium.com/this-is-hard/ffffff-diversi... 1 \n", | |
"265780 https://medium.com/this-is-hard/ffffff-diversi... 14 \n", | |
"263787 https://medium.com/this-is-hard/ffffff-diversi... 1 \n", | |
"262759 https://medium.com/this-is-hard/ffffff-diversi... 56 \n", | |
"197434 https://medium.com/app-a-day/headstrong-a9f4eb... 1 \n", | |
"262904 https://top.fse.guru/nodejs-a-quick-optimizati... 1 \n", | |
"171079 https://top.fse.guru/nodejs-a-quick-optimizati... 2 \n", | |
"283238 https://depts.washington.edu/aimgroup/proj/dol... 138 \n", | |
"117527 http://www.bbc.com/news/technology-36110421 2 \n", | |
"117137 http://www.bbc.co.uk/news/technology-36110421 43 \n", | |
"264458 http://ibm-bluemix.coderpower.com/#/?utm_sourc... 1 \n", | |
"180615 http://blog.ycombinator.com/$250k-of-digitaloc... 262 \n", | |
"283609 https://servercheck.in/blog/25k-book-sales-and... 2 \n", | |
"169534 http://AI.xprize.org 4 \n", | |
"258064 http://www.theregister.co.uk/2015/10/21/fitbit... 2 \n", | |
"256200 http://www.theregister.co.uk/2015/10/21/fitbit... 13 \n", | |
"99277 http://www.engadget.com/2016/05/13/android-vr-... 2 \n", | |
"107012 http://www.bbc.co.uk/news/technology-36213588 3 \n", | |
"263186 http://www.bbc.co.uk/newsbeat/articles/34540449 8 \n", | |
"150814 http://www.npr.org/sections/alltechconsidered/... 2 \n", | |
"129437 http://www.theregister.co.uk/2016/04/04/devast... 3 \n", | |
"286242 http://www.bbc.com/news/technology-34264624 2 \n", | |
"286236 http://www.bbc.com/news/technology-34264624 16 \n", | |
"192285 http://www.bbc.com/news/world-asia-india-35368... 2 \n", | |
"268826 http://www.theguardian.com/society/2015/oct/05... 283 \n", | |
"169801 http://www.theverge.com/2016/2/16/11018018/5d-... 2 \n", | |
"166999 http://www.theverge.com/2016/2/16/11018018/5d-... 1 \n", | |
"219189 http://firstround.com/review/the-power-of-inte... 2 \n", | |
"163047 http://www.npr.org/sections/parallels/2016/02/... 2 \n", | |
"211207 http://www.hollywoodreporter.com/news/hateful-... 1 \n", | |
"... ... ... \n", | |
"109601 https://mta.openssl.org/pipermail/openssl-anno... 2 \n", | |
"196189 https://developer.apple.com/news/?id=01152016a 2 \n", | |
"195967 https://developer.apple.com/news/?id=01152016a 1 \n", | |
"194469 https://developer.apple.com/news/?id=01152016a 4 \n", | |
"173308 https://developer.apple.com/news/?id=01152016a 1 \n", | |
"275122 http://ifixit.org/blog/7401/ifixit-app-pulled/ 211 \n", | |
"274471 http://johnearnest.github.io/ok/ike/ike.html?g... 33 \n", | |
"264626 http://www.apple.com/imac/then-and-now/?utm_so... 2 \n", | |
"77271 https://developer.apple.com/ios/human-interfac... 2 \n", | |
"265610 http://facebook.github.io/design/ios9.html 3 \n", | |
"284501 http://www.theguardian.com/technology/2015/sep... 3 \n", | |
"217826 https://blog.branch.io/ios-9.2-redirection-upd... 1 \n", | |
"209920 https://growthbug.com/ios-app-icon-colors-in-t... 2 \n", | |
"287875 https://github.com/iosre/iOSAppReverseEngineering 131 \n", | |
"279850 https://chris-price-b2rp.squarespace.com/s/Sim... 2 \n", | |
"279093 https://static1.squarespace.com/static/5463eca... 160 \n", | |
"232050 http://www.mondaynote.com/2015/11/23/ipad-pro-... 1 \n", | |
"222156 http://www.apple.com/shop/product/MGQM2LL/A/ip... 2 \n", | |
"222076 http://www.apple.com/shop/product/MGQL2LL/A/ip... 1 \n", | |
"168369 https://medium.com/@s3yfullah/iphone-safari-re... 2 \n", | |
"178745 http://www.bbc.co.uk/news/technology-35502030 12 \n", | |
"178691 http://www.theguardian.com/money/2016/feb/05/e... 438 \n", | |
"174730 http://www.bbc.co.uk/news/technology-35502030 2 \n", | |
"187285 http://www.computerworld.com/article/3026186/a... 1 \n", | |
"164624 https://github.com/Chris911/iStats 1 \n", | |
"247805 http://itunestandc.tumblr.com/ 1 \n", | |
"147063 https://www.kickstarter.com/projects/scanse/sw... 1 \n", | |
"267028 http://www.theguardian.com/sustainable-busines... 5 \n", | |
"227765 http://www.theguardian.com/sustainable-busines... 17 \n", | |
"253478 http://azone.guggenheim.org/ 1 \n", | |
"\n", | |
" num_comments author created_at popular \n", | |
"268786 0 Archio 2015-10-08 13:03:00 0 \n", | |
"265780 0 mrstorm 2015-10-12 21:26:00 1 \n", | |
"263787 0 doppp 2015-10-15 01:31:00 0 \n", | |
"262759 91 Amorymeltzer 2015-10-16 05:02:00 1 \n", | |
"197434 0 thindjinn 2016-01-14 15:43:00 0 \n", | |
"262904 0 atriix 2015-10-15 23:31:00 0 \n", | |
"171079 0 kiyanwang 2016-02-16 07:56:00 0 \n", | |
"283238 41 tambourine_man 2015-09-19 23:30:00 1 \n", | |
"117527 1 ghosh 2016-04-22 11:19:00 0 \n", | |
"117137 9 andygambles 2016-04-22 20:00:00 1 \n", | |
"264458 0 galina 2015-10-14 12:03:00 0 \n", | |
"180615 215 lacorp 2016-02-04 00:19:00 1 \n", | |
"283609 0 geerlingguy 2015-09-19 03:07:00 0 \n", | |
"169534 0 dlg 2016-02-17 17:41:00 0 \n", | |
"258064 0 kameit00 2015-10-22 06" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train[train.duplicated(['title'], keep='last')].sort_values(['title', 'created_at'])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 4: Feature engineering\n", | |
"\n", | |
"Create new features in **train** that you think might be relevant to predicting the response, **popular**. After creating each feature, check whether it is likely to be a useful feature.\n", | |
"\n", | |
"For this task, don't use **`CountVectorizer`**.\n", | |
"\n", | |
"**Note:** Think very carefully about which features you would be \"allowed\" to use in the real world. If a feature incorporates future data that would not be available **at the time of post submission**, then it can't be used in your model." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"44057" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# train.author.map(authors.title.agg('count'))\n", | |
"train.author.nunique()\n", | |
"# authors.author.count()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 5: Define **`make_features()`**\n", | |
"\n", | |
"1. Define a function, **`make_features()`**, that accepts a DataFrame and returns a DataFrame with your engineered features added. You should only include features that you think might be useful for predicting the response.\n", | |
"2. Re-split the **hn** DataFrame into **train** and **new** (using the code from Task 2) to return them to their original contents.\n", | |
"3. Run **`make_features()`** on **train** and **new**, and check that your features were successfully created." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# title_set = set(train['title'])\n", | |
"# title_set = set(train['title'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"title_author = {}\n", | |
"# test_set = title_set.copy()\n", | |
"\n", | |
"def find_dupes(df):\n", | |
" author = df['author']\n", | |
" entry = df['title']\n", | |
"# print(author, entry)\n", | |
"# check if there is an entry in the title_author dictionary\n", | |
" if entry in title_author:\n", | |
" #see if the same author had the post for upvoting\n", | |
" if title_author[entry] == author:\n", | |
" return 0\n", | |
" else:\n", | |
" return 1\n", | |
" else:\n", | |
"# test_set.add(entry)\n", | |
"# there is no entry in title dict. Add an entry and return 0 \n", | |
" title_author[entry] = author\n", | |
" return 0" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def make_features(df, unfair=False):\n", | |
" df.sort_values(by=['created_at'], inplace=True)\n", | |
" #groupby author\n", | |
" authors = df.groupby('author') \n", | |
" author_total_posts = authors.title.count() \n", | |
" # auth_popularity_dict = auth_popularity.to_dict() \n", | |
" #see if the title of the topic repeats and mark them \n", | |
" df['duplicate'] = df[['title', 'author']].apply(find_dupes, axis=1)\n", | |
"# df['duplicate'] = df.duplicated('title', keep='last').astype(int) \n", | |
" #total posts\n", | |
" df['total_posts'] = df.author.map(author_total_posts)\n", | |
" #use length as a feature\n", | |
" df['length_title'] = df.title.apply(len)\n", | |
" if unfair:\n", | |
" author_comments = authors.num_comments.agg(['mean', 'count']) \n", | |
" author_points = authors.num_points.agg(['mean', 'count'])\n", | |
" #no of comments\n", | |
" df['mean_of_comments'] = df.author.map(author_comments['mean'])\n", | |
" df['count_of_comments'] = df.author.map(author_comments['count'])\n", | |
" #no of points\n", | |
" df['mean_of_points'] = df.author.map(author_points['mean'])\n", | |
" df['count_of_points'] = df.author.map(author_points['count']) \n", | |
" return df\n", | |
"\n", | |
"train = make_features(train) \n", | |
"new = make_features(new)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"14557" | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train.duplicate.sum()\n", | |
"# title_author.keys()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# title_author.keys()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" <th>duplicate</th>\n", | |
" <th>total_posts</th>\n", | |
" <th>length_title</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>64118</th>\n", | |
" <td>12012897</td>\n", | |
" <td>Mattermark Daily Thursday, June 30th, 2016</td>\n", | |
" <td>https://mattermark.com/mattermark-daily-thursd...</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>nickfrost</td>\n", | |
" <td>2016-07-01 00:03:00</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>111</td>\n", | |
" <td>43</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64117</th>\n", | |
" <td>12012912</td>\n", | |
" <td>Ask HN: What do you build, what tools and edit...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>nojvek</td>\n", | |
" <td>2016-07-01 00:05:00</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>3</td>\n", | |
" <td>61</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64116</th>\n", | |
" <td>12012913</td>\n", | |
" <td>Ask HN: As a Python developer, what am I missi...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>33</td>\n", | |
" <td>41</td>\n", | |
" <td>15DCFA8F</td>\n", | |
" <td>2016-07-01 00:05:00</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>68</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64115</th>\n", | |
" <td>12012918</td>\n", | |
" <td>The Blackbird: First fully adjustable car rig ...</td>\n", | |
" <td>http://www.themill.com/portfolio/3002/the-blac...</td>\n", | |
" <td>2</td>\n", | |
" <td>0</td>\n", | |
" <td>rubyn00bie</td>\n", | |
" <td>2016-07-01 00:06:00</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>2</td>\n", | |
" <td>76</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64114</th>\n", | |
" <td>12012924</td>\n", | |
" <td>Zenefits Loses Over Half of Its Value</td>\n", | |
" <td>http://fortune.com/2016/06/30/zenefits-loses-o...</td>\n", | |
" <td>191</td>\n", | |
" <td>81</td>\n", | |
" <td>prostoalex</td>\n", | |
" <td>2016-07-01 00:08:00</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>432</td>\n", | |
" <td>37</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"64118 12012897 Mattermark Daily Thursday, June 30th, 2016 \n", | |
"64117 12012912 Ask HN: What do you build, what tools and edit... \n", | |
"64116 12012913 Ask HN: As a Python developer, what am I missi... \n", | |
"64115 12012918 The Blackbird: First fully adjustable car rig ... \n", | |
"64114 12012924 Zenefits Loses Over Half of Its Value \n", | |
"\n", | |
" url num_points \\\n", | |
"64118 https://mattermark.com/mattermark-daily-thursd... 1 \n", | |
"64117 NaN 3 \n", | |
"64116 NaN 33 \n", | |
"64115 http://www.themill.com/portfolio/3002/the-blac... 2 \n", | |
"64114 http://fortune.com/2016/06/30/zenefits-loses-o... 191 \n", | |
"\n", | |
" num_comments author created_at popular duplicate \\\n", | |
"64118 0 nickfrost 2016-07-01 00:03:00 0 0 \n", | |
"64117 3 nojvek 2016-07-01 00:05:00 0 0 \n", | |
"64116 41 15DCFA8F 2016-07-01 00:05:00 1 0 \n", | |
"64115 0 rubyn00bie 2016-07-01 00:06:00 0 0 \n", | |
"64114 81 prostoalex 2016-07-01 00:08:00 1 0 \n", | |
"\n", | |
" total_posts length_title \n", | |
"64118 111 43 \n", | |
"64117 3 61 \n", | |
"64116 1 68 \n", | |
"64115 2 76 \n", | |
"64114 432 37 " | |
] | |
}, | |
"execution_count": 29, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"new.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"4723" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"new.duplicate.sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# len(new) - len(test_set - title_set)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 6: Evaluate your model using cross-validation\n", | |
"\n", | |
"1. Define **X** and **y** using your chosen feature columns from **train**.\n", | |
"2. Choose a classification model, and then use **`cross_val_score`** to evaluate your model. Use the parameter **`scoring='roc_auc'`**, since we're going to use AUC as the evaluation metric.\n", | |
"3. **Optional:** Try adding features to your model that would not be \"allowed\" in the real world (because they incorporate information about the future), and see how that affects your AUC. (Be sure to remove these features from your model before moving on to the next task!)\n", | |
"\n", | |
" - **Note:** An AUC of 1.0 represents a perfect model, and an AUC of 0.5 represents random guessing. You can think of 0.5 as the AUC of the \"null model\". (My [blog post and video](http://www.dataschool.io/roc-curves-and-auc-explained/) explain AUC in more depth.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# train.columns\n", | |
"# X = train[['duplicate', 'length_title', 'total_posts', 'mean_of_comments', 'count_of_comments', 'count_of_points', 'mean_of_points']]\n", | |
"# X = train[['duplicate', 'total_posts', 'mean_of_comments', 'count_of_comments', 'count_of_points', 'mean_of_points']]\n", | |
"X = train[['duplicate', 'total_posts']] #, 'mean_of_comments', 'count_of_comments']] \n", | |
"y = train['popular']\n", | |
"\n", | |
"X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, random_state=1)\n", | |
"\n", | |
"from sklearn.neighbors import KNeighborsClassifier\n", | |
"\n", | |
"knn = KNeighborsClassifier(n_neighbors=100)\n", | |
"\n", | |
"knn.fit(X_train, y_train)\n", | |
"\n", | |
"y_pred_class = knn.predict_proba(X_test)[:, 1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.56127191062054815" | |
] | |
}, | |
"execution_count": 33, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"#accuracy score\n", | |
"# metrics.accuracy_score(y_test, y_pred_class)\n", | |
"metrics.roc_auc_score(y_test, y_pred_class)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([ 0.54134119, 0.5524376 , 0.55398633, 0.55042044, 0.55210614])" | |
] | |
}, | |
"execution_count": 34, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"#crossvalidation roc_score\n", | |
"cross_validation.cross_val_score(knn, X, y, scoring='roc_auc', cv=5)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"#null model\n", | |
"# y_test.value_counts().head(1)/len(y_test)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 7: Tune your model using grid search\n", | |
"\n", | |
"Use **`GridSearchCV`** to find the optimal tuning parameters for your model." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.grid_search import GridSearchCV\n", | |
"param_grid = {'n_neighbors': [100,200]}\n", | |
"grid = GridSearchCV(knn, param_grid, cv=5, scoring='roc_auc')\n", | |
"before = dir(grid)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"grid.fit(X, y)\n", | |
"after = dir(grid)\n", | |
"# grid.best_score_" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['grid.grid_scores_', 'grid.best_score_', 'grid.best_params_', 'grid.scorer_', 'grid.best_estimator_']\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"[[mean: 0.55006, std: 0.00450, params: {'n_neighbors': 100},\n", | |
" mean: 0.55514, std: 0.00599, params: {'n_neighbors': 200}],\n", | |
" 0.55514187726343167,\n", | |
" {'n_neighbors': 200},\n", | |
" make_scorer(roc_auc_score, needs_threshold=True),\n", | |
" KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", | |
" metric_params=None, n_jobs=1, n_neighbors=200, p=2,\n", | |
" weights='uniform')]" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"#extra\n", | |
"cmd_list = [('grid.{}'.format(x)) for x in list(set(after) - set(before))]\n", | |
"print(cmd_list)\n", | |
"[eval(x) for x in cmd_list]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"new_param_grid = {'n_neighbors' : list(range(10, 151, 10))}\n", | |
"grid = GridSearchCV(knn, new_param_grid, cv=5, scoring='roc_auc')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 4min 6s, sys: 976 ms, total: 4min 7s\n", | |
"Wall time: 4min 7s\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"[mean: 0.53247, std: 0.00438, params: {'n_neighbors': 10},\n", | |
" mean: 0.53421, std: 0.00696, params: {'n_neighbors': 20},\n", | |
" mean: 0.53984, std: 0.00841, params: {'n_neighbors': 30},\n", | |
" mean: 0.54042, std: 0.00721, params: {'n_neighbors': 40},\n", | |
" mean: 0.54696, std: 0.00761, params: {'n_neighbors': 50},\n", | |
" mean: 0.54826, std: 0.00747, params: {'n_neighbors': 60},\n", | |
" mean: 0.55032, std: 0.00607, params: {'n_neighbors': 70},\n", | |
" mean: 0.54900, std: 0.00540, params: {'n_neighbors': 80},\n", | |
" mean: 0.54983, std: 0.00398, params: {'n_neighbors': 90},\n", | |
" mean: 0.55006, std: 0.00450, params: {'n_neighbors': 100},\n", | |
" mean: 0.55070, std: 0.00413, params: {'n_neighbors': 110},\n", | |
" mean: 0.55180, std: 0.00521, params: {'n_neighbors': 120},\n", | |
" mean: 0.55260, std: 0.00457, params: {'n_neighbors': 130},\n", | |
" mean: 0.55278, std: 0.00516, params: {'n_neighbors': 140},\n", | |
" mean: 0.55280, std: 0.00510, params: {'n_neighbors': 150}]" | |
] | |
}, | |
"execution_count": 40, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"%time grid.fit(X, y)\n", | |
"grid.grid_scores_" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(0.55280148150806863,\n", | |
" KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n", | |
" metric_params=None, n_jobs=1, n_neighbors=150, p=2,\n", | |
" weights='uniform'))" | |
] | |
}, | |
"execution_count": 41, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"grid.best_score_, grid.best_estimator_" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 8: Make predictions for the new data\n", | |
"\n", | |
"1. Create a DataFrame called **\"X_new\"** that includes the same feature columns you used to train your model.\n", | |
"2. Train your best model (found during grid search) using **X** and **y**.\n", | |
"3. Calculate the predicted probability of popularity for all posts in **X_new**.\n", | |
"4. Calculate the AUC of your model by comparing your predicted probabilities against the **popular** column in the **new** DataFrame. (See how that compares to the AUC that was output by **`GridSearchCV`**.)\n", | |
"\n", | |
" - **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to calculate predicted probabilities and AUC." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"duplicate 0\n", | |
"total_posts 0\n", | |
"dtype: int64" | |
] | |
}, | |
"execution_count": 42, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X_new = new[['duplicate', 'total_posts']] #, 'mean_of_comments', 'count_of_comments', 'count_of_points', 'mean_of_points']]\n", | |
"X_new.all()\n", | |
"X_new.isnull().sum()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" <th>duplicate</th>\n", | |
" <th>total_posts</th>\n", | |
" <th>length_title</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>author</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>00taffe</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>04rob</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0uate</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0x0</th>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0x142857</th>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" <td>12</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0x23</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0x4139</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0x54MUR41</th>\n", | |
" <td>14</td>\n", | |
" <td>14</td>\n", | |
" <td>12</td>\n", | |
" <td>14</td>\n", | |
" <td>14</td>\n", | |
" <td>14</td>\n", | |
" <td>14</td>\n", | |
" <td>14</td>\n", | |
" <td>14</td>\n", | |
" <td>14</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0x70run</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0x7fffffff</th>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" <td>13</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0xAX</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0xCMP</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>2</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0xFFC</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0xbadf00d</th>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" <td>11</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0xmohit</th>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" <td>31</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0xsky</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>0xsnowcrash</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1.0203E+11</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1.11E+13</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>10098</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>100Mcenturies</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>100ideas</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>100k</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1024core</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1123581321</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>11thEarlOfMar</th>\n", | |
" <td>15</td>\n", | |
" <td>15</td>\n", | |
" <td>13</td>\n", | |
" <td>15</td>\n", | |
" <td>15</td>\n", | |
" <td>15</td>\n", | |
" <td>15</td>\n", | |
" <td>15</td>\n", | |
" <td>15</td>\n", | |
" <td>15</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>123456</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>12345671</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>127001brewer</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1337biz</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zubster</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zuck9</th>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>3</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zufallsheld</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zunzun</th>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" <td>4</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zura</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zuzuleinen</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zvrba</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zw123456</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zweiterlinde</th>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zwieback</th>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" <td>3</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zwilliamson</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zwischenzug</th>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" <td>5</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>zwlee28</th>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td>2</td>\n", | |
" <td" | |
], | |
"text/plain": [ | |
" id title url num_points num_comments created_at popular \\\n", | |
"author \n", | |
"00taffe 1 1 0 1 1 1 1 \n", | |
"04rob 2 2 1 2 2 2 2 \n", | |
"0uate 1 1 1 1 1 1 1 \n", | |
"0x0 11 11 11 11 11 11 11 \n", | |
"0x142857 12 12 12 12 12 12 12 \n", | |
"0x23 1 1 1 1 1 1 1 \n", | |
"0x4139 1 1 1 1 1 1 1 \n", | |
"0x54MUR41 14 14 12 14 14 14 14 \n", | |
"0x70run 1 1 0 1 1 1 1 \n", | |
"0x7fffffff 13 13 13 13 13 13 13 \n", | |
"0xAX 2 2 2 2 2 2 2 \n", | |
"0xCMP 3 3 2 3 3 3 3 \n", | |
"0xFFC 2 2 2 2 2 2 2 \n", | |
"0xbadf00d 11 11 11 11 11 11 11 \n", | |
"0xmohit 31 31 31 31 31 31 31 \n", | |
"0xsky 1 1 1 1 1 1 1 \n", | |
"0xsnowcrash 1 1 1 1 1 1 1 \n", | |
"1.0203E+11 1 1 1 1 1 1 1 \n", | |
"1.11E+13 1 1 1 1 1 1 1 \n", | |
"10098 1 1 1 1 1 1 1 \n", | |
"100Mcenturies 1 1 0 1 1 1 1 \n", | |
"100ideas 1 1 1 1 1 1 1 \n", | |
"100k 1 1 1 1 1 1 1 \n", | |
"1024core 1 1 1 1 1 1 1 \n", | |
"1123581321 2 2 1 2 2 2 2 \n", | |
"11thEarlOfMar 15 15 13 15 15 15 15 \n", | |
"123456 1 1 1 1 1 1 1 \n", | |
"12345671 2 2 2 2 2 2 2 \n", | |
"127001brewer 2 2 2 2 2 2 2 \n", | |
"1337biz 3 3 3 3 3 3 3 \n", | |
"... .. ... ... ... ... ... ... \n", | |
"zubster 1 1 1 1 1 1 1 \n", | |
"zuck9 4 4 3 4 4 4 4 \n", | |
"zufallsheld 3 3 3 3 3 3 3 \n", | |
"zunzun 4 4 4 4 4 4 4 \n", | |
"zura 1 1 1 1 1 1 1 \n", | |
"zuzuleinen 1 1 1 1 1 1 1 \n", | |
"zvrba 1 1 1 1 1 1 1 \n", | |
"zw123456 2 2 2 2 2 2 2 \n", | |
"zweiterlinde 1 1 1 1 1 1 1 \n", | |
"zwieback 3 3 3 3 3 3 3 \n", | |
"zwilliamson 2 2 2 2 2 2 2 \n", | |
"zwischenzug 5 5 5 5 5 5 5 \n", | |
"zwlee28 2 2 2 2 2 2 2 \n", | |
"zwrt 1 1 1 1 1 1 1 \n", | |
"zx2c4 3 3 3 3 3 3 3 \n", | |
"zxcv45 1 1 1 1 1 1 1 \n", | |
"zxcvvcxz 5 5 4 5 5 5 5 \n", | |
"zxlk21e 1 1 1 1 1 1 1 \n", | |
"zxombie 1 1 1 1 1 1 1 \n", | |
"zxv 32 32 32 32 32 32 32 \n", | |
"zy1t 1 1 1 1 1 1 1 \n", | |
"zyedidia 1 1 1 1 1 1 1 \n", | |
"zygimantasdev 1 1 0 1 1 1 1 \n", | |
"zymhan 1 1 1 1 1 1 1 \n", | |
"zyngaro 1 1 0 1 1 1 1 \n", | |
"zzarcon 3 3 3 3 3 3 3 \n", | |
"zzleeper 1 1 1 1 1 1 1 \n", | |
"zzy8200 1 1 1 1 1 1 1 \n", | |
"zzzbra 1 1 1 1 1 1 1 \n", | |
"zzzhan 5 5 5 5 5 5 5 \n", | |
"\n", | |
" duplicate total_posts length_title \n", | |
"author \n", | |
"00taffe 1 1 1 \n", | |
"04rob 2 2 2 \n", | |
"0uate 1 1 1 \n", | |
"0x0 11 11 11 \n", | |
"0x142857 12 12 12 \n", | |
"0x23 1 1 1 \n", | |
"0x4139 1 1 1 \n", | |
"0x54MUR41 14 14 14 \n", | |
"0x70run 1 1 1 \n", | |
"0x7fffffff 13 13 13 \n", | |
"0xAX 2 2 2 \n", | |
"0xCMP 3 3 3 \n", | |
"0xFFC 2 2 2 \n", | |
"0xbadf00d 11 11 11 \n", | |
"0xmohit 31 31 31 \n", | |
"0xsky 1 1 1 \n", | |
"0xsnowcrash 1 1 1 \n", | |
"1.0203E+11 1 1 1 \n", | |
"1.11E+13 1 1 1 \n", | |
"10098 1 1 1 \n", | |
"100Mcenturies 1 1 1 \n", | |
"100ideas 1 1 1 \n", | |
"100k 1 1 1 \n", | |
"1024core 1 1 1 \n", | |
"1123581321 2 2 2 \n", | |
"11thEarlOfMar 15 15 15 \n", | |
"123456 1 1 1 \n", | |
"12345671 2 2 2 \n", | |
"127001brewer 2 2 2 \n", | |
"1337biz 3 3 3 \n", | |
"... ... ... ... \n", | |
"zubster 1 1 1 \n", | |
"zuck9 4 4 4 \n", | |
"zufallsheld 3 3 3 \n", | |
"zunzun 4 4 4 \n", | |
"zura 1 1 1 \n", | |
"zuzuleinen 1 1 1 \n", | |
"zvrba 1 1 1 \n", | |
"zw123456 2 2 2 \n", | |
"zweiterlinde 1 1 1 \n", | |
"zwieback 3 3 3 \n", | |
"zwilliamson 2 2 2 \n", | |
"zwischenzug 5 5 5 \n", | |
"zwlee28 2 2 2 \n", | |
"zwrt 1 1 1 \n", | |
"zx2c4 3 3 3 \n", | |
"zxcv45 1 1 1 \n", | |
"zxcvvcxz 5 5 5 \n", | |
"zxlk21e 1 1 1 \n", | |
"zxombie 1 1 1 \n", | |
"zxv 32 32 32 \n", | |
"zy1t 1 1 1 \n", | |
"zyedidia 1 1 1 \n", | |
"zygimantasdev 1 1 1 \n", | |
"zymhan 1 1 1 \n", | |
"zyngaro 1 1 1 \n", | |
"zzarcon 3 3 3 \n", | |
"zzleeper 1 1 1 \n", | |
"zzy8200 1 1 1 \n", | |
"zzzbra 1 1 1 \n", | |
"zzzhan 5 5 5 \n", | |
"\n", | |
"[19002 rows x 10 columns]" | |
] | |
}, | |
"execution_count": 43, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"new.isnull().sum()\n", | |
"new.groupby('author').count()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>title</th>\n", | |
" <th>url</th>\n", | |
" <th>num_points</th>\n", | |
" <th>num_comments</th>\n", | |
" <th>author</th>\n", | |
" <th>created_at</th>\n", | |
" <th>popular</th>\n", | |
" <th>duplicate</th>\n", | |
" <th>total_posts</th>\n", | |
" <th>length_title</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>64114</th>\n", | |
" <td>12012924</td>\n", | |
" <td>Zenefits Loses Over Half of Its Value</td>\n", | |
" <td>http://fortune.com/2016/06/30/zenefits-loses-o...</td>\n", | |
" <td>191</td>\n", | |
" <td>81</td>\n", | |
" <td>prostoalex</td>\n", | |
" <td>2016-07-01 00:08:00</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>432</td>\n", | |
" <td>37</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64113</th>\n", | |
" <td>12012927</td>\n", | |
" <td>Ask HN: Designers and Developers of HN, how ca...</td>\n", | |
" <td>NaN</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>nojvek</td>\n", | |
" <td>2016-07-01 00:09:00</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>3</td>\n", | |
" <td>80</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>64112</th>\n", | |
" <td>12012931</td>\n", | |
" <td>Brain: An esoteric modern computer language ba...</td>\n", | |
" <td>https://github.com/luizperes/brain/issues</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>lerax</td>\n", | |
" <td>2016-07-01 00:10:00</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>5</td>\n", | |
" <td>62</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id title \\\n", | |
"64114 12012924 Zenefits Loses Over Half of Its Value \n", | |
"64113 12012927 Ask HN: Designers and Developers of HN, how ca... \n", | |
"64112 12012931 Brain: An esoteric modern computer language ba... \n", | |
"\n", | |
" url num_points \\\n", | |
"64114 http://fortune.com/2016/06/30/zenefits-loses-o... 191 \n", | |
"64113 NaN 1 \n", | |
"64112 https://github.com/luizperes/brain/issues 1 \n", | |
"\n", | |
" num_comments author created_at popular duplicate \\\n", | |
"64114 81 prostoalex 2016-07-01 00:08:00 1 0 \n", | |
"64113 0 nojvek 2016-07-01 00:09:00 0 0 \n", | |
"64112 0 lerax 2016-07-01 00:10:00 0 0 \n", | |
"\n", | |
" total_posts length_title \n", | |
"64114 432 37 \n", | |
"64113 3 80 \n", | |
"64112 5 62 " | |
] | |
}, | |
"execution_count": 44, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X_new.isnull().sum()\n", | |
"# X_new.shape\n", | |
"X_new[X_new.total_posts.isnull()]\n", | |
"X_new.head(5)\n", | |
"new[4:7]\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.50565537723495524" | |
] | |
}, | |
"execution_count": 45, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"knn = KNeighborsClassifier(n_neighbors=150, p=2, leaf_size=30)\n", | |
"knn.fit(X, y)\n", | |
"y_new = knn.predict_proba(X_new)[:, 1]\n", | |
"metrics.roc_auc_score(new['popular'], y_new)\n", | |
"# y_new" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Task 9: Use text as the input data instead\n", | |
"\n", | |
"1. Define a new **X** using the **title** column from **train**.\n", | |
"2. Create a **`Pipeline`** of **`CountVectorizer`** and the model of your choice.\n", | |
"3. Use **`cross_val_score`** to properly evaluate the AUC of your pipeline.\n", | |
"4. **Optional:** See if you can increase the AUC by changing what you use as the input text.\n", | |
"5. Train the pipeline on **X** and **y**, calculate predicted probabilities for all posts in the **new** DataFrame, and calculate the AUC." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"X = train['title']\n", | |
"y = train['popular']\n", | |
"from sklearn.pipeline import Pipeline, make_pipeline\n", | |
"from sklearn.feature_extraction.text import CountVectorizer\n", | |
"cv = CountVectorizer()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.naive_bayes import MultinomialNB\n", | |
"nb = MultinomialNB()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"knn = KNeighborsClassifier(n_neighbors=150, p=2, leaf_size=30)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# pipe = make_pipeline(cv, knn)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"pipe = make_pipeline(cv, nb)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'countvectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", | |
" ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", | |
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", | |
" tokenizer=None, vocabulary=None),\n", | |
" 'multinomialnb': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)}" | |
] | |
}, | |
"execution_count": 51, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pipe.named_steps" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 52, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", | |
" ngram_range=(1, 1), preprocessor=None, stop_words=None,\n", | |
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", | |
" tokenizer=None, vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])" | |
] | |
}, | |
"execution_count": 52, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"pipe.fit(X, y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"y_pred_pipe = pipe.predict_proba(X)[:, 1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.73754160309287498" | |
] | |
}, | |
"execution_count": 56, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"metrics.roc_auc_score(y, y_pred_pipe)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"ename": "NameError", | |
"evalue": "name 'cross_validation' is not defined", | |
"output_type": "error", | |
"traceback": [ | |
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", | |
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", | |
"\u001b[1;32m<ipython-input-1-817ec2575446>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mcross_validation\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcross_val_score\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpipe\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0my\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mtrain\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'num_points'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mscoring\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'roc_auc'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcv\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m5\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", | |
"\u001b[1;31mNameError\u001b[0m: name 'cross_validation' is not defined" | |
] | |
} | |
], | |
"source": [ | |
"cross_validation.cross_val_score(pipe, X, y = train['num_points'], scoring='roc_auc', cv=5)" | |
] | |
} | |
], | |
"metadata": { | |
"gist": { | |
"data": { | |
"description": "MLtext3/submissions/05_hacker_news_homework.ipynb", | |
"public": false | |
}, | |
"id": "" | |
}, | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment