Skip to content

Instantly share code, notes, and snippets.

@TheTechRobo
Last active August 30, 2024 21:39
Show Gist options
  • Save TheTechRobo/268cdfb44182c9af08ed11e820cdcc5a to your computer and use it in GitHub Desktop.
Save TheTechRobo/268cdfb44182c9af08ed11e820cdcc5a to your computer and use it in GitHub Desktop.
How I archive twitter accounts and hashtags and how you can too
Hello!
======
Table of Contents;
The Fast Way
The Thorough Way
Wait, how do I read this file?
Archiving Twitter accounts and hashtags is easy, even without an API key. Here's how!
/####\
|STOP| Twitter is usually very lenient on bans. However, it's still possible to get banned,
|SIGN|especially as snscrape (mis)uses the public API used by the web-app. I take no responsibility for this.
\####/
#######
| NOTE | Most of this guide assumes a sane UNIX environment with Python 3 installed.
| |If you don't have that, this guide won't be helpful.
|_____|On Windows you might use WSL, Cygwin, or Git Bash. WSL is recommended. Or just, y'know, stop using Windows.
Number One: The Fast Way
----------------------------
The easiest way is to fire up [snscrape](https://github.com/JustAnotherArchivist/snscrape) with the --jsonl parameter.
* The JSONL option is so that instead of only displaying the URL of each post, it'll output the data
Of course you'll want to pipe snscrape's output to a file.
Example for most POSIX-compliant shells (AKA anything but Windows):
snscrape --jsonl --progress twitter-user Ukraine >> tweets.jsonl
Or:
snscrape --jsonl --progress twitter-hashtag BeautifulUkraine >> tweets.jsonl
The `>>' tells your shell to append to the file (creating it if it doesn't exist). You could also use `>' to overwrite the file.
/^\
/ | \ Don't do this to the same file in multiple different shells/terminals! It will likely corrupt the file.
/__.__\
Number Two: The Thorough Way
----------------------------
This is my *preferred* way to do it. This way scrapes a posts replies, its parent replies, its parent replies' replies, its
replies' replies, etc. In other words, it's recursive.
* This does not traverse hashtags/@mentions. You'll have to do that yourself if you want it, but you'd probably end up scraping
the whole website by doing that :P
This is a multi-step process.
First. You need to scrape the post IDs for your search term.
Example for most POSIX-compliant shells (AKA anything but Windows):
snscrape -f "{id}" --progress twitter-user Ukraine >> twitterids
Or:
snscrape -f "{id}" --progress twitter-hashtag BeautifulUkraine >> twitterids
This will take about the same time as the old method, as it's exactly the same as the old method; the only difference is the
only thing being saved is the post ID. The second step is what will take a while. Fortunately, the second step can resume
crawls, so long as the "Archive File" (the `finished.json' file) is intact. This Archive File keeps track of what tweets it's
already downloaded, so it can skip them. Ain't that neat?
Clone the https://github.com/archivist-rs/batch-get-tweets repo and enter it (don't forget what folder you saved twitterids in!)
For simplicity's sake, I'll store the path in a variable. Assuming a sane Unix system, run:
```
export ARCHIVIST_TWEETS_SCRIPT=$(find `pwd`/get_tweets.py)
```
Now go back to the folder where `twitterids' is. Next run:
```
python3 -m pip install alive_progress
python3 $ARCHIVIST_TWEETS_SCRIPT
```
You should now see a progress bar on your screen. This process is going to take a *very long time*. It's not unheard of for it to
take overnight, and it may even take multiple days. With multiple days comes multiple crashes, which is why there's the resume
functionality. It's a great way to find bugs in snscrape :-)
Now, this may also be very quick. It all depends on two factors:
1. and most importantly:
The amount of Twitter IDs.
2. and similarily importantly:
The amount of *replies* to those Twitter IDs (and their replies, etc).
3. Your Internet connection.
...And, now you're done!
Wait, so how do I read this file?
---------------------------------
Reading tweets.jsonl isn't straightforward. However, you've got pretty much all the tweet data there, it's just in a very machine-
readable format.
I'm working on a viewer; progress is at https://github.com/archivist-rs/archivist. Still not great though, and I could use some
help, especially by someone that can actually design something. I'm _decent_, but I'm still utterly shit.
The tweets.jsonl file has the raw JSONL produced by snscrape. Good luck!
Thanks for reading this guide! It's one of my longest... writings ever. I hope it was helpful to you, and that you do great things
with it.
Ciao!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment