Last active
April 9, 2020 19:24
-
-
Save allen-li1231/4d1cdc452a73b7301b211915ff78e346 to your computer and use it in GitHub Desktop.
Group13_BUDT758X_Project_Update.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
}, | |
"colab": { | |
"name": "Group13_BUDT758X_Project_Update.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"include_colab_link": true | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/allen-li1231/4d1cdc452a73b7301b211915ff78e346/group13_budt758x_project_update-1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "fbhRDBqSTG_n", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# BUDT 758X Project Update: Data Exploration of Yelp\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "14VER1QhTG_o", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Group 13 ( FUN ): Lei Han, Xiaoyou Zhou, Zhonghao (Allen) Li, Yutian Luo, Cindy Chang" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "u1jq-HwrTG_p", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"CONTEXT OF THE DATA" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "kSm19eQJTG_q", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"This dataset is a subset of Yelp’s businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries.In the dataset we’ll find information about businesses across 11 metropolitan areas in four countries." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "5tgi3mAYTG_r", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"We have chosen to pick Yelp dataset for three main reasons:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "2gYzwRLMTG_r", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"(1).The data is feasible and has potential due to large volumes (5GB).\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "qsnNwilQTG_s", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"(2).Since the information is gathered from the Yelp website,which is one of the most renowned review platform who has a monthly average of 76.7 million unique visitors via its mobile website in 2019, it is authentic and will help us develop practical insights." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JdU07C5iTG_t", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"(3).The datasets include 5.2 million user reviews of multiple restaurants in 11 metropolitan areas, which enriches the quality of the data." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ppaXFZ1UTG_t", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Project Overview" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "eSL8uj1ZTG_u", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### 1. Project Objectives" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "a1WiI_P1TG_u", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"The goal is to perform data analysis on the Yelp restaurant dataset consisting of customer review texts, restaurant rankings, location information, etc. from a variety of restaurants. What inspired the project is that when people look for new restaurants to go to, they may be lost in the sea of options. So they tend to make decisions based on the reviews and overall rating of a specific restaurant." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "40maMoMyTG_v", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"By gathering insights about the customer base and developing strategic factors that would influence a customer’s decision to visit a particular restaurant, not only can we help diners make the best choices for restaurants but also we can provide recommendations for restaurants to expand their business by attracting more customers and improve clients experiences." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "reoKLg0iTG_w", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### 2. Exploration Methods" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "HUFfVe-PTG_x", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"To analyze, cluster, classify and predict different labels corresponding to restaurant/review/customers on Yelp,we will use\n", | |
"- cluster and classification models to partition restaurants for specific needs, for instance, restaurants that are suitable for children/business environments. \n", | |
"- sentiment analysis & traditional NLP analytics tools to add more dimension on current data\n", | |
"- prediction models to forecast a newly opened restaurant’s ranking based on scarced reviews and provide first-handed references for our valued customers." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "3naduGabTG_y", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### 3. Questions of Interests" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "CWyC45iiTG_y", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"a. Which state has the highest review counts? We expect that states with high populations such as New York, California, or Florida to have the highest review count. And We are hoping to use the review count to select a subset of the Yelp data to conduct our analysis on.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "DOT_7of1TG_z", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"b. What is the accuracy of the customer having more friends rating? (Are these customers’ ratings close to average business stars).We expect that the ratings are pretty close. Based on these analysis, we can find out the impact of ratings from customers with more friends on restaurants.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "jTPyHY9jTG_z", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"c. What is the accuracy of the customer ratings and their review reflection? (Are customers’ ratings really show their thoughts?) We expect their rating will. In this part, we will introduce a new numerical variable called sentiment, which is calculated by the positive and negative word found in review text." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "dWfq6SyjTG_0", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"d. What are the features the restaurants with high ratings will have? We expect that restaurants with features like free WiFi, high review counts, and serving alcohol will have higher ratings." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "EpXfAzlKTG_1", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"e. What is the relation between restaurants opening time and their review sentiments? We expect that restaurants with longer opening time will have higher sentiment ratings. \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Av-7YBsFTG_1", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"f. What are words with higher showing times in customers’ review? We expect words like: ‘best’, ‘great’, ‘yummy’, ‘love’, ‘tasty’ will show more. And for these words, Yelp can use them as their cities tags for states, which might improve customer using experience." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "88uob_MpTG_2", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### 4. Data Description" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "QcrO9RXMTG_3", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"This dataset contains seven CSV files. In total, there are 5,200,000 user reviews, information on 174,000 businesses, and data spans 11 metropolitan areas." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "QCUgXDPHTG_3", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "UhdDQbmDTG_7", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Data Processing and Analysis" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "S4KzPdNaTG_7", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"MILESTONES AND PROGRESS " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "bzas0IwSTG_8", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"(1). Data collection - scrape the data from kaggle so we have the base to solve the problem." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "g6gvuVtpTG_8", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"(2). Data preparation - which includes data cleansing and data transformation" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "6jG7Q7gmTG_9", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"- Transfer json into pandas dataframe with proper indexing \n", | |
"- Deal with missing values.\n", | |
"- Merge and reshape multiple dataframes.\n", | |
"- Delete unnecessary columns which could add ambiguity based on logical assumptions.\n", | |
"- Delete duplicate data about restaurants and combine the values.\n", | |
"- Fix typographical errors in the dataset.\n", | |
"- Transfer the review texts into lowercase and remove punctuations." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ZolHAJeLTG_9", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "qPFOjFSrTG_-", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"NEXT STEP" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "VoJCiJwwTG_-", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "jNa6EyvBTG__", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "UY3d_EEjTHAA", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"### 1. Data Collection" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "_lcEt0e0THAA", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment