Skip to content

Instantly share code, notes, and snippets.

@schochastics
Last active September 22, 2021 08:40
Show Gist options
  • Save schochastics/1ff42c0211916d73fc98ba8ad0dcb261 to your computer and use it in GitHub Desktop.
Save schochastics/1ff42c0211916d73fc98ba8ad0dcb261 to your computer and use it in GitHub Desktop.
Get tweets from the Academic Research product track.
# start_time: %Y-%m-%dT%H:%M:%SZ
# end_time: %Y-%m-%dT%H:%M:%SZ
# needs jsonlite and httr
# next_token can be obtained from meta$next_token to paginate through results
get_tweets <- function(q="",n=10,start_time,end_time,token,next_token=""){
if(n>500){
warning("n too big. Using 500 instead")
n <- 500
}
if(n<5){
warning("n too small Using 10 instead")
n <- 500
}
if(missing(token)){
stop("bearer token must be specified.")
}
if(missing(end_time)){
end_time <- gsub(" ","T",paste0(as.character(Sys.time()),"Z"))
}
if(missing(start_time)){
start_time <- paste0(Sys.Date(),"T00:00:00Z")
}
if(substr(token,1,7)=="Bearer "){
bearer <- token
} else{
bearer <- paste0("Bearer ",token)
}
#endpoint
url <- "https://api.twitter.com/2/tweets/search/all"
#parameters
params = list(
"query" = q,
"max_results" = n,
"start_time" = start_time,
"end_time" = end_time,
"tweet.fields" = "attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,possibly_sensitive,referenced_tweets,source,text,withheld",
"user.fields" = "created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld",
"expansions" = "author_id,entities.mentions.username,geo.place_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id",
"place.fields" = "contained_within,country,country_code,full_name,geo,id,name,place_type"
)
if(next_token!=""){
params[["next_token"]] <- next_token
}
r <- httr::GET(url,httr::add_headers(Authorization = bearer),query=params)
#fix random 503 errors
count <- 0
while(httr::status_code(r)==503 & count<4){
r <- httr::GET(url,httr::add_headers(Authorization = bearer),query=params)
count <- count+1
Sys.sleep(count*5)
}
if(httr::status_code(r)!=200){
stop(paste("something went wrong. Status code:", httr::status_code(r)))
}
if(httr::headers(r)$`x-rate-limit-remaining`=="1"){
warning(paste("x-rate-limit-remaining=1. Resets at",as.POSIXct(as.numeric(httr::headers(r)$`x-rate-limit-reset`), origin="1970-01-01")))
}
dat <- jsonlite::fromJSON(httr::content(r, "text"))
dat
}
@schochastics
Copy link
Author

schochastics commented Feb 5, 2021

issues:

  • meaningful error handling
  • doesnt seem to return all fields dat$data -> tweet data dat$includes -> user data
  • fix random "503 Service Unavailable" errors

@schochastics
Copy link
Author

schochastics commented Feb 5, 2021

simple pagination (150,000 tweets per 15 minutes without going over rate limit)

next_token <-""
k <- 0
while(k<15*60){
  df <- get_tweets(q,n,start_time,end_time,bearer,next_token)
  jsonlite::write_json(df$data,paste0("data/","data_",df$data$id[nrow(df$data)],".json"))
  jsonlite::write_json(df$includes,paste0("data/","includes_",df$data$id[nrow(df$data)],".json"))
  next_token <- df$meta$next_token #this is NULL if there are no pages left
  Sys.sleep(3.1)
  k <- k+3
  cat(k,": ","(",nrow(df$data),") ",df$data$created_at[nrow(df$data)],"\n",sep = "")
}

@aksoyundan
Copy link

Thank you very much for this!

@luisignaciomenendez
Copy link

Hello!

Could you provide a simple example on how to scrape some tweets? I have been trying to apply the function myself to obtain a simple sample of tweets but I can't really figure out what I am doing wrong.

Thanks a lot in advance!

@schochastics
Copy link
Author

schochastics commented Mar 19, 2021

@luisignaciomenendez This code has been turned into a package: https://github.com/cjbarrie/academictwitteR
This should be easier to use than the code above

@jeffcsauer
Copy link

Having a bit of trouble getting this off the ground (as both a standalone .R file or in the academictwitteR package. Any thoughts? Getting a 400 status code with the following:

next_token <- ""
k <- 0
while (k < 3 * 3) {
  df <- get_tweets(
    "beyonce",
    n = 500,
    start_time = "2010-01-01T00:00:00Z0",
    end_time = "2010-10-01T00:00:00Z",
    token = bearer_token
  )
  jsonlite::write_json(df$data, paste0("data/", "data_", df$data$id[nrow(df$data)], ".json"))
  jsonlite::write_json(df$includes,
                       paste0("data/", "includes_", df$data$id[nrow(df$data)], ".json"))
  next_token <-
    df$meta$next_token #this is NULL if there are no pages left
  Sys.sleep(3.1)
  k <- k + 3
  cat(k,
      ": ",
      "(",
      nrow(df$data),
      ") ",
      df$data$created_at[nrow(df$data)],
      "\n",
      sep = "")
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment