Last active
September 21, 2022 06:02
-
-
Save andymithamclarke/49f5e8301eba043f028372b1e3c1ff26 to your computer and use it in GitHub Desktop.
Collecting Tweets from the Twitter API using tweepy.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"source": "# Twitter Search API 🔎💬", | |
"metadata": { | |
"id": "D8dyDeluNACR", | |
"cell_id": "00001-41dd5298-8cc4-498c-948f-58809891a2b0", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "This is a Graphext notebook made in conjunction <a href=\"https://www.graphext.com/docs/collecting-twitter-data\" target=\"_blank\">our guide on collecting data from Twitter</a>.\n\nThe notebook uses <a href=\"http://docs.tweepy.org/en/v3.5.0/getting_started.html\" target=\"_blank\">Tweepy</a> to get data from the Twitter API. \n\nWithin the notebook, you can set a search query using the <a href=\"https://gist.github.com/andyclarkemedia/3b4e062a45323138bd28ec52d80eb7b1\" target=\"_blank\">Twitter query language</a> to return specific tweets. The results returned will be tweets matching your query that were posted within the last week.", | |
"metadata": { | |
"id": "ryqNyAz_NACW", | |
"cell_id": "00002-43d00c08-e941-436b-bf2b-b0f2f24f6629", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "WHW_T20-NACY", | |
"cell_id": "00009-7c45fad0-3fba-49c6-b03d-8d19f2c7c34a", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": "b623e53d", | |
"execution_millis": 3, | |
"output_cleared": false, | |
"execution_start": 1615294745705, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "# Install 'tweepy' package\n!pip install tweepy\n# Import 'tweepy' package\nimport tweepy", | |
"execution_count": 12, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "## Setup ⚙️", | |
"metadata": { | |
"tags": [], | |
"cell_id": "00004-b47e35ec-92c2-49aa-88da-612638ded490", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "The following steps configure `tweepy` to make API calls to Twitter. \n\nYou need to provide your own Twitter API keys; `api_key` and `api_secret`. Since we are using _App Authentication_ you **don't need** the `access_token` and `access_token_secret` that is usually required for retrieving data from the Twitter API with _User Authentication_ .\n\nYou must sign up for a <a href=\"https://developer.twitter.com/en/apply-for-access\" target=\"_blank\">Twitter developers account </a> in order to get an `api_key` and an `api_secret`. \n\nFor details on how to do this, <a href=\"https://www.graphext.com/help-center-articles/collecting-twitter-data\" target=\"_blank\">follow our guide</a>.", | |
"metadata": { | |
"tags": [], | |
"cell_id": "00005-89269c3e-6309-4019-8118-0e238f28cd19", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "nB_94U7YNACY", | |
"cell_id": "00011-4d117785-1691-41c2-ba13-b6bf56519ff1", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": "6e4c8aea", | |
"execution_millis": 0, | |
"execution_start": 1615294277266, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "# Set your API keys - This information is yours alone and should be kept private.\nimport os\napi_key = ' '\napi_secret = ' '", | |
"execution_count": 2, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "P6WYJbZYNACZ", | |
"cell_id": "00012-fe0e4989-9f36-43e3-879b-176c347bdb5e", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": "4fca230f", | |
"execution_millis": 109, | |
"execution_start": 1615294279858, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "# Authorize your api key and api secret \nauth = tweepy.AppAuthHandler(api_key, api_secret)\n# Call the API \napi = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True) ", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "## Getting Tweets Using a Search Query 🐦", | |
"metadata": { | |
"id": "Xd-LuzRNNACZ", | |
"cell_id": "00016-081150c6-422b-4ff1-b619-dfa9326931eb", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "Next, you will search the Twitter API for tweets using the <a href=\"https://gist.github.com/andyclarkemedia/3b4e062a45323138bd28ec52d80eb7b1\" target=\"_blank\">Twitter query language</a>.\n\nThe tweets matching your query will be saved in a dictionary using the 'ID' of the tweet as a key.\n\nThe information returned from your query is as follows. The text inside of `< ... >` tells Graphext the variable type of values in this field.\n\n\n>`id<gx:number>`: The ID of the tweet used as the key for each dictionary entry.\n\n>`text<gx:text>`: The text of the tweet.\n\n>`created_at<gx:date>`: The date the tweet was posted.\n\n>`author_id<gx:number>`: The ID of the user that posted the tweet.\n\n>`author_name<gx:category>`: The name of the user holding the account that posted the tweet.\n\n>`author_handler<gx:category>`: The handle of the account that posted the tweet.\n\n>`author_user_agent<gx:category>`: The type of device used to post the tweet.\n\n>`user_description<gx:text>`: The description of the account that posted the tweet.\n\n>`user_location<gx:text>`: The location, if provided, of the user that posted the tweet.\n\n>`author_avatar<gx:url>`: The image link of the profile picture used by the account that posted the tweet.\n\n>`user_followers_count<gx:number>`: The number of followers the account that posted the tweet has.\n\n>`user_created_at<gx:date>`: The creation date of the account that posted the tweet.\n\n>`user_following_count<gx:number>`: The number of accounts that the account posting the tweet follows.\n\n>`user_verified<gx:boolean>`: A `True` or `False` value denoting whether the account posting the tweet is a verified account.\n\n>`lang<gx:category>`: The language of the tweet.\n\n>`tweet_hashtags<gx:list[category]>`: The hashtags used inside of the tweet.\n\n>`tweet_symbols<gx:list[category]>`: The emojis or symbols used inside of the tweets.\n\n>`mention_names<gx:list[category]>`: The handles of accounts mentioned inside of the tweet.\n\n>`mention_ids<gx:list[category]>`: The ids of the accounts mentioned inside of the tweet.\n\n>`n_retweets<gx:number>`: The number of retweets received by the tweet.\n\n>`n_favorites<gx:number>`: The number of times the tweet was favorited by other users.\n\n>`is_retweet<gx:boolean>`: A `True` or `False` value denoting whether the tweet is a retweet or not.\n\n>`original_tweet_user_handle<gx:category>`: If `is_retweet` is `True`, this value provides the handle of the user that originally posted the tweet.\n\n>`original_tweet_id<gx:number>`: If `is_retweet` is `True`, this value provides the id of the original tweet.\n", | |
"metadata": { | |
"id": "xU6eHoUVNACa", | |
"cell_id": "00018-500919c2-eede-4269-a0bf-fa26c79558ef", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "##### Check out the <a href=\"https://gist.github.com/andyclarkemedia/3b4e062a45323138bd28ec52d80eb7b1\" target=\"_blank\">Twitter query language</a> to learn how to format a search query using Twitter's advanced search operators.", | |
"metadata": { | |
"tags": [], | |
"cell_id": "00010-93106e85-d0d5-47d5-aa6a-705fd134af91", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "QD-UAaUwNACa", | |
"cell_id": "00022-d43469b3-9d38-47d7-b4b6-45254fbe12a6", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": "cd1292af", | |
"execution_millis": 0, | |
"execution_start": 1615294331058, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "# Declare your search query 🔎👇\nsearch_query = \"#Euro2020\"", | |
"execution_count": 8, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "iONOtMzTBqSA", | |
"cell_id": "00023-cb490c00-2136-4788-bcf4-97e8f0852670", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": "61b0bd2e", | |
"execution_millis": 2, | |
"execution_start": 1615294332723, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "# Max Tweet ID - allows you to pause and resume tweet collection from the tweet you last collected\n# Each time you run a NEW collection process using a different query - Run this command to reset the ID of the tweet you want to start collecting from\nmax_id = None", | |
"execution_count": 9, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "HSMeNcb6NACb", | |
"cell_id": "00024-5429b43e-0ca7-4f09-a0b2-4851f33ee803", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": "572c10a4", | |
"execution_millis": 472, | |
"execution_start": 1615294334550, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "# Make the Twitter API request\n# The progress bar will indicate how many tweets you have collected\n\n# Note that this process might take some time.\n# Additionally, the loop will pause each time you have exceeeded Twitter's rate limit for collecting tweets.\n# It will resume progress automatically once enough time has passed to begin collecting tweets again.\n\n# Declare a dictionary to save incoming tweets\ntweets = {}\n\ntry:\n\n # Use the 'lang' parameter to set a specific language for your query. Eg. 'es' for Spanish\n parameters = {\n \"q\": search_query,\n \"count\":100, \n \"lang\": \"\",\n \"max_id\": max_id\n }\n \n # Enter a number inside the '.items()' brackets to limit the size of your results.\n # Rate limits: 100 max per request, user auth 180 requests per 15 min or 450 per app auth. (We are using app auth)\n # 45K tweets per 15 min. Delete items if you want to obtain all of them \n # https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets\n \n tweet_requests = tweepy.Cursor(api.search, **parameters).items(10)\n\n print(tweet_requests)\n\n for tweet in tweet_requests:\n \n # Add new tweet to dictionary\n tweets[tweet.id] = {\n 'text<gx:text>': tweet.text,\n 'created_at<gx:date>': tweet.created_at,\n 'author_id<gx:number>': tweet.user.id,\n 'author_name<gx:category>': tweet.user.name,\n 'author_handler<gx:category>': str(tweet.user.screen_name),\n 'author_user_agent<gx:category>': tweet.source,\n 'user_description<gx:text>': tweet.user.description,\n 'user_location<gx:text>': tweet.user.location,\n 'author_avatar<gx:url>': tweet.user.profile_image_url,\n 'user_followers_count<gx:number>': tweet.user.followers_count,\n 'user_created_at<gx:date>': tweet.user.created_at,\n 'user_following_count<gx:number>': tweet.user.friends_count,\n 'user_verified<gx:boolean>': tweet.user.verified,\n 'lang<gx:category>': tweet.lang,\n 'tweet_hashtags<gx:list[category]>': tweet.entities['hashtags'],\n 'tweet_symbols<gx:list[category]>': tweet.entities['symbols'],\n 'mention_names<gx:list[category]>': [\"@\" + d['screen_name'] for d in tweet.entities['user_mentions'] if 'screen_name' in d],\n 'mention_ids<gx:list[number]>': [d['id'] for d in tweet.entities['user_mentions'] if 'id' in d],\n 'n_retweets<gx:number>': tweet.retweet_count,\n 'n_favorites<gx:number>': tweet.favorite_count,\n 'is_retweet<gx:boolean>': hasattr(tweet, 'retweeted_status')\n }\n\n if tweets[tweet.id]['is_retweet<gx:boolean>']:\n tweets[tweet.id]['original_tweet_user_handle<gx:category>'] = tweet.retweeted_status.user.screen_name\n tweets[tweet.id]['original_tweet_id<gx:number>'] = str(tweet.retweeted_status.id)\n\n \n # Set the latest tweet ID as the tweet to resume from\n max_id = tweet.id\n \n\n# Catches any error\nexcept Exception as e:\n print(\"Something went wrong. Run the command again to continue from the ID of the tweet you last collected. You can also download the {} tweets you've collected so far by runnning the rest of the notebook. \\n \\n The error message:\".format(len(tweets)), e)\n\n# When process completes \nelse:\n print(\"\\nHurray! You've collected all tweets matching your query. You have {} tweets ready to export. Now run the rest of the notebook to download your data.\".format(len(tweets)))", | |
"execution_count": 10, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"text": "<tweepy.cursor.ItemIterator object at 0x7ff0263467d0>\n\nHurray! You've collected all tweets matching your query. You have 10 tweets ready to export. Now run the rest of the notebook to download your data.\n", | |
"output_type": "stream" | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "## Exporting the Data ⬇️", | |
"metadata": { | |
"id": "KwQD2DkvNACb", | |
"cell_id": "00026-5a6d79eb-308e-463a-b414-bce8525ecca1", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "After making the request you will now export the data inside your dictionary to a csv file. First, load the dictionary into a dataframe. Then, export the dataframe as a csv file.", | |
"metadata": { | |
"id": "sQNIF8F-NACb", | |
"cell_id": "00027-1f1c8ea3-491b-44e2-afa4-ed33a6e8745a", | |
"deepnote_cell_type": "markdown" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "GTAuc9UiNACc", | |
"cell_id": "00028-32b52cca-4910-42b7-90c3-69c108fdea47", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": "c4c9c7d3", | |
"execution_millis": 0, | |
"execution_start": 1615294736142, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "# Import pandas\nimport pandas as pd\n\n# Save dictionary as a dataframe\ndf = pd.DataFrame.from_dict(tweets, orient='index')\n\n# Convert ID index to 'id' column\ndf['id<gx:number>'] = df.index\n\n# Export as csv\ndf.to_csv(search_query + ' '+ 'tweets.csv', index=False)\n\n# Inspect your dataset\ndf", | |
"execution_count": 12, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "tRw5ke9jBftS", | |
"cell_id": "00031-aba5660a-2d82-417a-ba26-9035d08a7b9d", | |
"deepnote_to_be_reexecuted": false, | |
"source_hash": null, | |
"execution_millis": 371, | |
"execution_start": 1615223932570, | |
"output_cleared": true, | |
"deepnote_cell_type": "code" | |
}, | |
"source": "", | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": "<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f47e8d8d-e9a0-451f-b929-ba54b6b81628' target=\"_blank\">\n<img style='display:inline;max-height:16px;margin:0px;margin-right:7.5px;' src='' > </img>\nCreated in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>", | |
"metadata": { | |
"tags": [], | |
"created_in_deepnote_cell": true, | |
"deepnote_cell_type": "markdown" | |
} | |
} | |
], | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
}, | |
"colab": { | |
"name": "Collecting-Twitter-Data.ipynb", | |
"provenance": [], | |
"collapsed_sections": [] | |
}, | |
"deepnote_notebook_id": "5b9519ee-f1f8-494f-b3fd-dc66a6e975f5", | |
"deepnote": {}, | |
"deepnote_execution_queue": [] | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment