Last active
August 30, 2024 21:39
-
-
Save TheTechRobo/268cdfb44182c9af08ed11e820cdcc5a to your computer and use it in GitHub Desktop.
How I archive twitter accounts and hashtags and how you can too
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hello! | |
====== | |
Table of Contents; | |
The Fast Way | |
The Thorough Way | |
Wait, how do I read this file? | |
Archiving Twitter accounts and hashtags is easy, even without an API key. Here's how! | |
/####\ | |
|STOP| Twitter is usually very lenient on bans. However, it's still possible to get banned, | |
|SIGN|especially as snscrape (mis)uses the public API used by the web-app. I take no responsibility for this. | |
\####/ | |
####### | |
| NOTE | Most of this guide assumes a sane UNIX environment with Python 3 installed. | |
| |If you don't have that, this guide won't be helpful. | |
|_____|On Windows you might use WSL, Cygwin, or Git Bash. WSL is recommended. Or just, y'know, stop using Windows. | |
Number One: The Fast Way | |
---------------------------- | |
The easiest way is to fire up [snscrape](https://github.com/JustAnotherArchivist/snscrape) with the --jsonl parameter. | |
* The JSONL option is so that instead of only displaying the URL of each post, it'll output the data | |
Of course you'll want to pipe snscrape's output to a file. | |
Example for most POSIX-compliant shells (AKA anything but Windows): | |
snscrape --jsonl --progress twitter-user Ukraine >> tweets.jsonl | |
Or: | |
snscrape --jsonl --progress twitter-hashtag BeautifulUkraine >> tweets.jsonl | |
The `>>' tells your shell to append to the file (creating it if it doesn't exist). You could also use `>' to overwrite the file. | |
/^\ | |
/ | \ Don't do this to the same file in multiple different shells/terminals! It will likely corrupt the file. | |
/__.__\ | |
Number Two: The Thorough Way | |
---------------------------- | |
This is my *preferred* way to do it. This way scrapes a posts replies, its parent replies, its parent replies' replies, its | |
replies' replies, etc. In other words, it's recursive. | |
* This does not traverse hashtags/@mentions. You'll have to do that yourself if you want it, but you'd probably end up scraping | |
the whole website by doing that :P | |
This is a multi-step process. | |
First. You need to scrape the post IDs for your search term. | |
Example for most POSIX-compliant shells (AKA anything but Windows): | |
snscrape -f "{id}" --progress twitter-user Ukraine >> twitterids | |
Or: | |
snscrape -f "{id}" --progress twitter-hashtag BeautifulUkraine >> twitterids | |
This will take about the same time as the old method, as it's exactly the same as the old method; the only difference is the | |
only thing being saved is the post ID. The second step is what will take a while. Fortunately, the second step can resume | |
crawls, so long as the "Archive File" (the `finished.json' file) is intact. This Archive File keeps track of what tweets it's | |
already downloaded, so it can skip them. Ain't that neat? | |
Clone the https://github.com/archivist-rs/batch-get-tweets repo and enter it (don't forget what folder you saved twitterids in!) | |
For simplicity's sake, I'll store the path in a variable. Assuming a sane Unix system, run: | |
``` | |
export ARCHIVIST_TWEETS_SCRIPT=$(find `pwd`/get_tweets.py) | |
``` | |
Now go back to the folder where `twitterids' is. Next run: | |
``` | |
python3 -m pip install alive_progress | |
python3 $ARCHIVIST_TWEETS_SCRIPT | |
``` | |
You should now see a progress bar on your screen. This process is going to take a *very long time*. It's not unheard of for it to | |
take overnight, and it may even take multiple days. With multiple days comes multiple crashes, which is why there's the resume | |
functionality. It's a great way to find bugs in snscrape :-) | |
Now, this may also be very quick. It all depends on two factors: | |
1. and most importantly: | |
The amount of Twitter IDs. | |
2. and similarily importantly: | |
The amount of *replies* to those Twitter IDs (and their replies, etc). | |
3. Your Internet connection. | |
...And, now you're done! | |
Wait, so how do I read this file? | |
--------------------------------- | |
Reading tweets.jsonl isn't straightforward. However, you've got pretty much all the tweet data there, it's just in a very machine- | |
readable format. | |
I'm working on a viewer; progress is at https://github.com/archivist-rs/archivist. Still not great though, and I could use some | |
help, especially by someone that can actually design something. I'm _decent_, but I'm still utterly shit. | |
The tweets.jsonl file has the raw JSONL produced by snscrape. Good luck! | |
Thanks for reading this guide! It's one of my longest... writings ever. I hope it was helpful to you, and that you do great things | |
with it. | |
Ciao! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment