Skip to content

Instantly share code, notes, and snippets.

@zhouzhuojie
Created August 11, 2014 22:37
Show Gist options
  • Save zhouzhuojie/17a41d67510b32ba27df to your computer and use it in GitHub Desktop.
Save zhouzhuojie/17a41d67510b32ba27df to your computer and use it in GitHub Desktop.
Yelp Dataset
### Installation (Ubuntu OS, others similar)
```bash
sudo apt-get install mongodb
sudo pip install pymongo # or sudo easy_install pymongo
```
### Restore the data into mongodb
```bash
mongorestore --db yelp ./dump/yelp
```
The data will be stored into db named `yelp`, under `yelp`, there will be 5 collections:
```
businesses
chekins
reviews
tips
users
```
These are all the meta data that can be used as the estimation.
### Restore the pickle graph
For the yelp user graph, I have preprocessed all the users in `users` collection and store the whole connected undirected graph as `yelp_users.pickle`, so you can access the user graph by
```python
import pickle
graph = pickle.load(open('./yelp_users.pickle', 'r'))
print graph.nodes()[0] // The node's label is unicode string, e.g. u'MtX0WZ4bqMfFuYvtupgRqg'
# Graph info
# Number of nodes: 119839
# Number of edges: 954116
# Average degree: 15.9233
```
At this point, you can run your program as before, the graph is just plain networkx graph. If you want rich feature/info of users, please follow the following code.
### Rich info
Each user is stored as the following structure:
```json
{
'type': 'user',
'user_id': (unique user identifier),
'name': (first name, last initial, like 'Matt J.'),
'review_count': (review count),
'average_stars': (floating point average, like 4.31),
'votes': {
'useful': (count of useful votes across all reviews),
'funny': (count of funny votes across all reviews),
'cool': (count of cool votes across all reviews)
}
}
```
For example,
```json
{
"_id":ObjectId("53e282f680183c24b7fb0dda"),
"yelping_since":"2013-11",
"votes":{
"funny":0,
"useful":2,
"cool":0
},
"review_count":9,
"name":"Clarinda",
"user_id":"as22TLsZn_SwVv4oCjxdMg",
"friends":[
"ND6DMIKxM8Q1ShEMZuA5rA",
"M-O0tasOl0SGiUsxdO5cZw",
"1cDIfb6TSh71n99Ark5n3A",
"NYhXvxMqbVsB7lcFrgNqow",
"bFe-sUMDaDm2Q9u5Cve1tQ",
"7P9okhRRYG0hz02Fk7tRMw",
"enR0fiE0u_jfmW2x3aqxUA",
"W4YsRCa1Xq4wrRRf53E5ZQ",
"b89mmlWnUfrIzpuftP3cgQ",
"nutDqAZ0fyOmz8yAqbCvFw",
"BPbrH2VQASzR9Le6oC4DpQ"
],
"fans":0,
"average_stars":4,
"type":"user",
"compliments":{
"plain":1
},
"elite":[
]
}
```
So, one may be able to sample users' average number of `average_stars`, `review_count`, `votes.funny`, `votes.useful`, `votes.cool` etc.
Therefore, given a user_id, we can query the mongodb to get the right info for that user.
```python
import pymongo
import networkx as nx
import pickle
import random
# Prepare db and graph
db = pymongo.MongoClient().yelp
graph = pickle.load(open('./yelp_users.pickle', 'r'))
print nx.info(graph)
print nx.is_connected(graph)
# Given a user u, find its info
u = random.choice(graph.nodes()) # here, u is a user_id string,
# e.g. u"ND6DMIKxM8Q1ShEMZuA5rA"
u_info = db.users.find_one({'user_id': u})
print u_info
print u_info['average_stars']
print u_info['review_count']
```
### Dataset reference
[Yelp Dataset Challenge](http://www.yelp.com/dataset_challenge)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment