Created
August 11, 2014 22:37
-
-
Save zhouzhuojie/17a41d67510b32ba27df to your computer and use it in GitHub Desktop.
Yelp Dataset
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Installation (Ubuntu OS, others similar) | |
```bash | |
sudo apt-get install mongodb | |
sudo pip install pymongo # or sudo easy_install pymongo | |
``` | |
### Restore the data into mongodb | |
```bash | |
mongorestore --db yelp ./dump/yelp | |
``` | |
The data will be stored into db named `yelp`, under `yelp`, there will be 5 collections: | |
``` | |
businesses | |
chekins | |
reviews | |
tips | |
users | |
``` | |
These are all the meta data that can be used as the estimation. | |
### Restore the pickle graph | |
For the yelp user graph, I have preprocessed all the users in `users` collection and store the whole connected undirected graph as `yelp_users.pickle`, so you can access the user graph by | |
```python | |
import pickle | |
graph = pickle.load(open('./yelp_users.pickle', 'r')) | |
print graph.nodes()[0] // The node's label is unicode string, e.g. u'MtX0WZ4bqMfFuYvtupgRqg' | |
# Graph info | |
# Number of nodes: 119839 | |
# Number of edges: 954116 | |
# Average degree: 15.9233 | |
``` | |
At this point, you can run your program as before, the graph is just plain networkx graph. If you want rich feature/info of users, please follow the following code. | |
### Rich info | |
Each user is stored as the following structure: | |
```json | |
{ | |
'type': 'user', | |
'user_id': (unique user identifier), | |
'name': (first name, last initial, like 'Matt J.'), | |
'review_count': (review count), | |
'average_stars': (floating point average, like 4.31), | |
'votes': { | |
'useful': (count of useful votes across all reviews), | |
'funny': (count of funny votes across all reviews), | |
'cool': (count of cool votes across all reviews) | |
} | |
} | |
``` | |
For example, | |
```json | |
{ | |
"_id":ObjectId("53e282f680183c24b7fb0dda"), | |
"yelping_since":"2013-11", | |
"votes":{ | |
"funny":0, | |
"useful":2, | |
"cool":0 | |
}, | |
"review_count":9, | |
"name":"Clarinda", | |
"user_id":"as22TLsZn_SwVv4oCjxdMg", | |
"friends":[ | |
"ND6DMIKxM8Q1ShEMZuA5rA", | |
"M-O0tasOl0SGiUsxdO5cZw", | |
"1cDIfb6TSh71n99Ark5n3A", | |
"NYhXvxMqbVsB7lcFrgNqow", | |
"bFe-sUMDaDm2Q9u5Cve1tQ", | |
"7P9okhRRYG0hz02Fk7tRMw", | |
"enR0fiE0u_jfmW2x3aqxUA", | |
"W4YsRCa1Xq4wrRRf53E5ZQ", | |
"b89mmlWnUfrIzpuftP3cgQ", | |
"nutDqAZ0fyOmz8yAqbCvFw", | |
"BPbrH2VQASzR9Le6oC4DpQ" | |
], | |
"fans":0, | |
"average_stars":4, | |
"type":"user", | |
"compliments":{ | |
"plain":1 | |
}, | |
"elite":[ | |
] | |
} | |
``` | |
So, one may be able to sample users' average number of `average_stars`, `review_count`, `votes.funny`, `votes.useful`, `votes.cool` etc. | |
Therefore, given a user_id, we can query the mongodb to get the right info for that user. | |
```python | |
import pymongo | |
import networkx as nx | |
import pickle | |
import random | |
# Prepare db and graph | |
db = pymongo.MongoClient().yelp | |
graph = pickle.load(open('./yelp_users.pickle', 'r')) | |
print nx.info(graph) | |
print nx.is_connected(graph) | |
# Given a user u, find its info | |
u = random.choice(graph.nodes()) # here, u is a user_id string, | |
# e.g. u"ND6DMIKxM8Q1ShEMZuA5rA" | |
u_info = db.users.find_one({'user_id': u}) | |
print u_info | |
print u_info['average_stars'] | |
print u_info['review_count'] | |
``` | |
### Dataset reference | |
[Yelp Dataset Challenge](http://www.yelp.com/dataset_challenge) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment