Created
March 14, 2017 21:55
-
-
Save plallin/1e1217c3e3fc162fceeed9dfde60f65c to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
The account was scraped using the following Twitter library: https://github.com/sixohsix/twitter | |
I prefer this library over Tweepy as it is much "closer" to the API (less abstraction). | |
The way the following script works is basically as follows: | |
- First, I get all of the user's friend (I get max 200 friend per call) and store their the friends' names in a list | |
- Once I have the list of all the user's friends, I iterate through it to get the friends of the friends | |
- In a second script, I count the number of times a particular username occurs, and then return the sorted list of | |
friends (the most followed ones). | |
- That's it, more or less | |
AltShiftX needed his personal account scraped (not his account AltShiftX). The account @AltShiftX can be scraped in less | |
than 45 calls which takes about 30 minutes. He follows ~420 people on his personal account, so it took longer. | |
I made some calculations and the minimum time it could take (that is, if all his friends followed < 200 people) is | |
7.5 hours. The maximum time it could take (that is if all your friends followed > 1000 people as I'm only getting a list | |
up to a 1000) is about 33.5. I ran the script from my Raspberry Pi starting Friday 3/4pm and the script was finished | |
on Saturday circa 1pm. | |
There are way to improve this: if I had used friend/ids instead of friends/list, I could have scraped 5,000 friends per | |
call, so for example, 420 friends take 3 friends/lists calls (200+200+20) but a single friend/ids call. The disadvantage | |
is that friends/list give you the full details of the accounts scraped, while ids only return their id number. However, | |
You could then use users/lookup to get the full details of up to 100 ids in 1 call, so it would be pretty fast to get | |
a decent amount of details as 15 calls will get you 1500 (100 * 15) user details, which is a lot :-) | |
""" | |
import time | |
import json | |
from twitter import * | |
import datetime | |
print("Starting to scrap! " + str(datetime.datetime.now())) | |
with open("config.json") as data_file: # just a config file I have to save my keys | |
data = json.load(data_file) | |
data = data["MyAccountName"] # Account whose keys I will use (I have a couple of accounts) | |
CONSUMER_KEY = data["consumer_key"] | |
SECRET_CONSUMER_KEY = data["secret_consumer_key"] | |
ACCESS_TOKEN = data["access_token"] | |
SECRET_ACCESS_TOKEN = data["secret_access_token"] | |
t = Twitter(auth=OAuth(ACCESS_TOKEN, SECRET_ACCESS_TOKEN, CONSUMER_KEY, SECRET_CONSUMER_KEY)) | |
account_to_be_scraped = "Insert account to be scraped here" | |
file_out = "recommendations.txt" # the friends of his friends will be added to that file. | |
friends = [] # holds the list of friends of account_to_be_scraped | |
next_page_loc = -1 # for cursor purposes; if there is no next page, cursor points to location 0 | |
friends_list = t.friends.list(screen_name=account_to_be_scraped, | |
count=200, | |
skip_status=True) | |
while next_page_loc != 0: | |
next_page_loc = friends_list['next_cursor'] # get location of next page | |
for friend in friends_list['users']: | |
friends.append(friend['screen_name']) | |
if next_page_loc == 0: | |
break # we reached the end of the user's friend list. | |
else: | |
friends_list = t.friends.list(screen_name=account_to_be_scraped, | |
cursor=next_page_loc, | |
count=200, | |
skip_status=True) | |
with open(file_out, "a") as out: | |
for friend in friends: | |
out.write(friend + "\n") | |
for friend in friends: | |
friend_friends_list = t.friends.list(screen_name=friend, | |
count=200, | |
skip_status=True) | |
count = 0 # required for friend following a large number of person; we want to scrap the first 1000 (5 * 200) only | |
while count < 5: | |
next_page_loc = friend_friends_list['next_cursor'] | |
for follow in friend_friends_list['users']: | |
with open(file_out, "a") as out: | |
out.write(follow['screen_name'] + "\n") | |
remaining_calls = t.application.rate_limit_status(resources="friends")["resources"]["friends"]["/friends/list"]["remaining"] | |
if remaining_calls <= 1: | |
print("Got sleepy while scrapping {}'s data...".format(friend)) | |
time.sleep(60 * 15) # wait 15 minutes for API limit to replenish | |
remaining_calls = t.application.rate_limit_status(resources="friends")["resources"]["friends"]["/friends/list"]["remaining"] | |
print("sleep over! I have now {} calls left".format(remaining_calls)) | |
if next_page_loc == 0: | |
break # end of friend list. | |
else: | |
friend_friends_list = t.friends.list(screen_name=friend, | |
cursor=next_page_loc, | |
count=200, | |
skip_status=True) | |
count += 1 | |
print("over and out :-)" + str(datetime.datetime.now())) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment