Skip to content

Instantly share code, notes, and snippets.

@andymithamclarke
Last active September 21, 2022 06:02
Show Gist options
  • Save andymithamclarke/49f5e8301eba043f028372b1e3c1ff26 to your computer and use it in GitHub Desktop.
Save andymithamclarke/49f5e8301eba043f028372b1e3c1ff26 to your computer and use it in GitHub Desktop.
Collecting Tweets from the Twitter API using tweepy.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"source": "# Twitter Search API 🔎💬",
"metadata": {
"id": "D8dyDeluNACR",
"cell_id": "00001-41dd5298-8cc4-498c-948f-58809891a2b0",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "markdown",
"source": "This is a Graphext notebook made in conjunction <a href=\"https://www.graphext.com/docs/collecting-twitter-data\" target=\"_blank\">our guide on collecting data from Twitter</a>.\n\nThe notebook uses <a href=\"http://docs.tweepy.org/en/v3.5.0/getting_started.html\" target=\"_blank\">Tweepy</a> to get data from the Twitter API. \n\nWithin the notebook, you can set a search query using the <a href=\"https://gist.github.com/andyclarkemedia/3b4e062a45323138bd28ec52d80eb7b1\" target=\"_blank\">Twitter query language</a> to return specific tweets. The results returned will be tweets matching your query that were posted within the last week.",
"metadata": {
"id": "ryqNyAz_NACW",
"cell_id": "00002-43d00c08-e941-436b-bf2b-b0f2f24f6629",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "code",
"metadata": {
"id": "WHW_T20-NACY",
"cell_id": "00009-7c45fad0-3fba-49c6-b03d-8d19f2c7c34a",
"deepnote_to_be_reexecuted": false,
"source_hash": "b623e53d",
"execution_millis": 3,
"output_cleared": false,
"execution_start": 1615294745705,
"deepnote_cell_type": "code"
},
"source": "# Install 'tweepy' package\n!pip install tweepy\n# Import 'tweepy' package\nimport tweepy",
"execution_count": 12,
"outputs": []
},
{
"cell_type": "markdown",
"source": "## Setup ⚙️",
"metadata": {
"tags": [],
"cell_id": "00004-b47e35ec-92c2-49aa-88da-612638ded490",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "markdown",
"source": "The following steps configure `tweepy` to make API calls to Twitter. \n\nYou need to provide your own Twitter API keys; `api_key` and `api_secret`. Since we are using _App Authentication_ you **don't need** the `access_token` and `access_token_secret` that is usually required for retrieving data from the Twitter API with _User Authentication_ .\n\nYou must sign up for a <a href=\"https://developer.twitter.com/en/apply-for-access\" target=\"_blank\">Twitter developers account </a> in order to get an `api_key` and an `api_secret`. \n\nFor details on how to do this, <a href=\"https://www.graphext.com/help-center-articles/collecting-twitter-data\" target=\"_blank\">follow our guide</a>.",
"metadata": {
"tags": [],
"cell_id": "00005-89269c3e-6309-4019-8118-0e238f28cd19",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "code",
"metadata": {
"id": "nB_94U7YNACY",
"cell_id": "00011-4d117785-1691-41c2-ba13-b6bf56519ff1",
"deepnote_to_be_reexecuted": false,
"source_hash": "6e4c8aea",
"execution_millis": 0,
"execution_start": 1615294277266,
"deepnote_cell_type": "code"
},
"source": "# Set your API keys - This information is yours alone and should be kept private.\nimport os\napi_key = ' '\napi_secret = ' '",
"execution_count": 2,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "P6WYJbZYNACZ",
"cell_id": "00012-fe0e4989-9f36-43e3-879b-176c347bdb5e",
"deepnote_to_be_reexecuted": false,
"source_hash": "4fca230f",
"execution_millis": 109,
"execution_start": 1615294279858,
"deepnote_cell_type": "code"
},
"source": "# Authorize your api key and api secret \nauth = tweepy.AppAuthHandler(api_key, api_secret)\n# Call the API \napi = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True) ",
"execution_count": 3,
"outputs": []
},
{
"cell_type": "markdown",
"source": "## Getting Tweets Using a Search Query 🐦",
"metadata": {
"id": "Xd-LuzRNNACZ",
"cell_id": "00016-081150c6-422b-4ff1-b619-dfa9326931eb",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "markdown",
"source": "Next, you will search the Twitter API for tweets using the <a href=\"https://gist.github.com/andyclarkemedia/3b4e062a45323138bd28ec52d80eb7b1\" target=\"_blank\">Twitter query language</a>.\n\nThe tweets matching your query will be saved in a dictionary using the 'ID' of the tweet as a key.\n\nThe information returned from your query is as follows. The text inside of `< ... >` tells Graphext the variable type of values in this field.\n\n\n>`id<gx:number>`: The ID of the tweet used as the key for each dictionary entry.\n\n>`text<gx:text>`: The text of the tweet.\n\n>`created_at<gx:date>`: The date the tweet was posted.\n\n>`author_id<gx:number>`: The ID of the user that posted the tweet.\n\n>`author_name<gx:category>`: The name of the user holding the account that posted the tweet.\n\n>`author_handler<gx:category>`: The handle of the account that posted the tweet.\n\n>`author_user_agent<gx:category>`: The type of device used to post the tweet.\n\n>`user_description<gx:text>`: The description of the account that posted the tweet.\n\n>`user_location<gx:text>`: The location, if provided, of the user that posted the tweet.\n\n>`author_avatar<gx:url>`: The image link of the profile picture used by the account that posted the tweet.\n\n>`user_followers_count<gx:number>`: The number of followers the account that posted the tweet has.\n\n>`user_created_at<gx:date>`: The creation date of the account that posted the tweet.\n\n>`user_following_count<gx:number>`: The number of accounts that the account posting the tweet follows.\n\n>`user_verified<gx:boolean>`: A `True` or `False` value denoting whether the account posting the tweet is a verified account.\n\n>`lang<gx:category>`: The language of the tweet.\n\n>`tweet_hashtags<gx:list[category]>`: The hashtags used inside of the tweet.\n\n>`tweet_symbols<gx:list[category]>`: The emojis or symbols used inside of the tweets.\n\n>`mention_names<gx:list[category]>`: The handles of accounts mentioned inside of the tweet.\n\n>`mention_ids<gx:list[category]>`: The ids of the accounts mentioned inside of the tweet.\n\n>`n_retweets<gx:number>`: The number of retweets received by the tweet.\n\n>`n_favorites<gx:number>`: The number of times the tweet was favorited by other users.\n\n>`is_retweet<gx:boolean>`: A `True` or `False` value denoting whether the tweet is a retweet or not.\n\n>`original_tweet_user_handle<gx:category>`: If `is_retweet` is `True`, this value provides the handle of the user that originally posted the tweet.\n\n>`original_tweet_id<gx:number>`: If `is_retweet` is `True`, this value provides the id of the original tweet.\n",
"metadata": {
"id": "xU6eHoUVNACa",
"cell_id": "00018-500919c2-eede-4269-a0bf-fa26c79558ef",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "markdown",
"source": "##### Check out the <a href=\"https://gist.github.com/andyclarkemedia/3b4e062a45323138bd28ec52d80eb7b1\" target=\"_blank\">Twitter query language</a> to learn how to format a search query using Twitter's advanced search operators.",
"metadata": {
"tags": [],
"cell_id": "00010-93106e85-d0d5-47d5-aa6a-705fd134af91",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "code",
"metadata": {
"id": "QD-UAaUwNACa",
"cell_id": "00022-d43469b3-9d38-47d7-b4b6-45254fbe12a6",
"deepnote_to_be_reexecuted": false,
"source_hash": "cd1292af",
"execution_millis": 0,
"execution_start": 1615294331058,
"deepnote_cell_type": "code"
},
"source": "# Declare your search query 🔎👇\nsearch_query = \"#Euro2020\"",
"execution_count": 8,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "iONOtMzTBqSA",
"cell_id": "00023-cb490c00-2136-4788-bcf4-97e8f0852670",
"deepnote_to_be_reexecuted": false,
"source_hash": "61b0bd2e",
"execution_millis": 2,
"execution_start": 1615294332723,
"deepnote_cell_type": "code"
},
"source": "# Max Tweet ID - allows you to pause and resume tweet collection from the tweet you last collected\n# Each time you run a NEW collection process using a different query - Run this command to reset the ID of the tweet you want to start collecting from\nmax_id = None",
"execution_count": 9,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "HSMeNcb6NACb",
"cell_id": "00024-5429b43e-0ca7-4f09-a0b2-4851f33ee803",
"deepnote_to_be_reexecuted": false,
"source_hash": "572c10a4",
"execution_millis": 472,
"execution_start": 1615294334550,
"deepnote_cell_type": "code"
},
"source": "# Make the Twitter API request\n# The progress bar will indicate how many tweets you have collected\n\n# Note that this process might take some time.\n# Additionally, the loop will pause each time you have exceeeded Twitter's rate limit for collecting tweets.\n# It will resume progress automatically once enough time has passed to begin collecting tweets again.\n\n# Declare a dictionary to save incoming tweets\ntweets = {}\n\ntry:\n\n # Use the 'lang' parameter to set a specific language for your query. Eg. 'es' for Spanish\n parameters = {\n \"q\": search_query,\n \"count\":100, \n \"lang\": \"\",\n \"max_id\": max_id\n }\n \n # Enter a number inside the '.items()' brackets to limit the size of your results.\n # Rate limits: 100 max per request, user auth 180 requests per 15 min or 450 per app auth. (We are using app auth)\n # 45K tweets per 15 min. Delete items if you want to obtain all of them \n # https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets\n \n tweet_requests = tweepy.Cursor(api.search, **parameters).items(10)\n\n print(tweet_requests)\n\n for tweet in tweet_requests:\n \n # Add new tweet to dictionary\n tweets[tweet.id] = {\n 'text<gx:text>': tweet.text,\n 'created_at<gx:date>': tweet.created_at,\n 'author_id<gx:number>': tweet.user.id,\n 'author_name<gx:category>': tweet.user.name,\n 'author_handler<gx:category>': str(tweet.user.screen_name),\n 'author_user_agent<gx:category>': tweet.source,\n 'user_description<gx:text>': tweet.user.description,\n 'user_location<gx:text>': tweet.user.location,\n 'author_avatar<gx:url>': tweet.user.profile_image_url,\n 'user_followers_count<gx:number>': tweet.user.followers_count,\n 'user_created_at<gx:date>': tweet.user.created_at,\n 'user_following_count<gx:number>': tweet.user.friends_count,\n 'user_verified<gx:boolean>': tweet.user.verified,\n 'lang<gx:category>': tweet.lang,\n 'tweet_hashtags<gx:list[category]>': tweet.entities['hashtags'],\n 'tweet_symbols<gx:list[category]>': tweet.entities['symbols'],\n 'mention_names<gx:list[category]>': [\"@\" + d['screen_name'] for d in tweet.entities['user_mentions'] if 'screen_name' in d],\n 'mention_ids<gx:list[number]>': [d['id'] for d in tweet.entities['user_mentions'] if 'id' in d],\n 'n_retweets<gx:number>': tweet.retweet_count,\n 'n_favorites<gx:number>': tweet.favorite_count,\n 'is_retweet<gx:boolean>': hasattr(tweet, 'retweeted_status')\n }\n\n if tweets[tweet.id]['is_retweet<gx:boolean>']:\n tweets[tweet.id]['original_tweet_user_handle<gx:category>'] = tweet.retweeted_status.user.screen_name\n tweets[tweet.id]['original_tweet_id<gx:number>'] = str(tweet.retweeted_status.id)\n\n \n # Set the latest tweet ID as the tweet to resume from\n max_id = tweet.id\n \n\n# Catches any error\nexcept Exception as e:\n print(\"Something went wrong. Run the command again to continue from the ID of the tweet you last collected. You can also download the {} tweets you've collected so far by runnning the rest of the notebook. \\n \\n The error message:\".format(len(tweets)), e)\n\n# When process completes \nelse:\n print(\"\\nHurray! You've collected all tweets matching your query. You have {} tweets ready to export. Now run the rest of the notebook to download your data.\".format(len(tweets)))",
"execution_count": 10,
"outputs": [
{
"name": "stdout",
"text": "<tweepy.cursor.ItemIterator object at 0x7ff0263467d0>\n\nHurray! You've collected all tweets matching your query. You have 10 tweets ready to export. Now run the rest of the notebook to download your data.\n",
"output_type": "stream"
}
]
},
{
"cell_type": "markdown",
"source": "## Exporting the Data ⬇️",
"metadata": {
"id": "KwQD2DkvNACb",
"cell_id": "00026-5a6d79eb-308e-463a-b414-bce8525ecca1",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "markdown",
"source": "After making the request you will now export the data inside your dictionary to a csv file. First, load the dictionary into a dataframe. Then, export the dataframe as a csv file.",
"metadata": {
"id": "sQNIF8F-NACb",
"cell_id": "00027-1f1c8ea3-491b-44e2-afa4-ed33a6e8745a",
"deepnote_cell_type": "markdown"
}
},
{
"cell_type": "code",
"metadata": {
"id": "GTAuc9UiNACc",
"cell_id": "00028-32b52cca-4910-42b7-90c3-69c108fdea47",
"deepnote_to_be_reexecuted": false,
"source_hash": "c4c9c7d3",
"execution_millis": 0,
"execution_start": 1615294736142,
"deepnote_cell_type": "code"
},
"source": "# Import pandas\nimport pandas as pd\n\n# Save dictionary as a dataframe\ndf = pd.DataFrame.from_dict(tweets, orient='index')\n\n# Convert ID index to 'id' column\ndf['id<gx:number>'] = df.index\n\n# Export as csv\ndf.to_csv(search_query + ' '+ 'tweets.csv', index=False)\n\n# Inspect your dataset\ndf",
"execution_count": 12,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "tRw5ke9jBftS",
"cell_id": "00031-aba5660a-2d82-417a-ba26-9035d08a7b9d",
"deepnote_to_be_reexecuted": false,
"source_hash": null,
"execution_millis": 371,
"execution_start": 1615223932570,
"output_cleared": true,
"deepnote_cell_type": "code"
},
"source": "",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": "<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f47e8d8d-e9a0-451f-b929-ba54b6b81628' target=\"_blank\">\n<img style='display:inline;max-height:16px;margin:0px;margin-right:7.5px;' src='' > </img>\nCreated in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>",
"metadata": {
"tags": [],
"created_in_deepnote_cell": true,
"deepnote_cell_type": "markdown"
}
}
],
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"colab": {
"name": "Collecting-Twitter-Data.ipynb",
"provenance": [],
"collapsed_sections": []
},
"deepnote_notebook_id": "5b9519ee-f1f8-494f-b3fd-dc66a6e975f5",
"deepnote": {},
"deepnote_execution_queue": []
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment