TheTechRobo · August 30, 2024 21:39
diff --git a/how2twitter.avi b/how2twitter.avi
 Hello!
 ======

  Table of Contents;
    The Fast Way
    The Thorough Way
    Wait, how do I read this file?

 Archiving Twitter accounts and hashtags is easy, even without an API key. Here's how!

       /####\
       |STOP| Twitter is usually very lenient on bans. However, it's still possible to get banned,
       |SIGN|especially as snscrape (mis)uses the public API used by the web-app. I take no responsibility for this.
       \####/
       
       #######
       | NOTE | Most of this guide assumes a sane UNIX environment with Python 3 installed.
        |      |If you don't have that, this guide won't be helpful.
         |_____|On Windows you might use WSL, Cygwin, or Git Bash. WSL is recommended. Or just, y'know, stop using Windows.

 Number One: The Fast Way
 ----------------------------
 The easiest way is to fire up [snscrape](https://github.com/JustAnotherArchivist/snscrape) with the --jsonl parameter.

    * The JSONL option is so that instead of only displaying the URL of each post, it'll output the data

 Of course you'll want to pipe snscrape's output to a file.
 Example for most POSIX-compliant shells (AKA anything but Windows):

    snscrape --jsonl --progress twitter-user Ukraine >> tweets.jsonl
 Or:
    snscrape --jsonl --progress twitter-hashtag BeautifulUkraine >> tweets.jsonl

 The `>>' tells your shell to append to the file (creating it if it doesn't exist). You could also use `>' to overwrite the file.

        /^\
       / | \ Don't do this to the same file in multiple different shells/terminals! It will likely corrupt the file.
      /__.__\
    
 Number Two: The Thorough Way
 ----------------------------
 This is my *preferred* way to do it. This way scrapes a posts replies, its parent replies, its parent replies' replies, its
 replies' replies, etc. In other words, it's recursive.

    * This does not traverse hashtags/@mentions. You'll have to do that yourself if you want it, but you'd probably end up scraping
      the whole website by doing that :P

 This is a multi-step process.

 First. You need to scrape the post IDs for your search term. 
 Example for most POSIX-compliant shells (AKA anything but Windows):

    snscrape -f "{id}" --progress twitter-user Ukraine >> twitterids
 Or:
    snscrape -f "{id}" --progress twitter-hashtag BeautifulUkraine >> twitterids

 This will take about the same time as the old method, as it's exactly the same as the old method; the only difference is the
 only thing being saved is the post ID. The second step is what will take a while. Fortunately, the second step can resume
 crawls, so long as the "Archive File" (the `finished.json' file) is intact. This Archive File keeps track of what tweets it's
 already downloaded, so it can skip them. Ain't that neat?

 Clone the https://github.com/archivist-rs/batch-get-tweets repo and enter it (don't forget what folder you saved twitterids in!)
 For simplicity's sake, I'll store the path in a variable. Assuming a sane Unix system, run:

 ```
 export ARCHIVIST_TWEETS_SCRIPT=$(find `pwd`/get_tweets.py)
 ```
 Now go back to the folder where `twitterids' is. Next run:

 ```
 python3 -m pip install alive_progress
 python3 $ARCHIVIST_TWEETS_SCRIPT
 ```

 You should now see a progress bar on your screen. This process is going to take a *very long time*. It's not unheard of for it to
 take overnight, and it may even take multiple days. With multiple days comes multiple crashes, which is why there's the resume
 functionality. It's a great way to find bugs in snscrape :-)
 Now, this may also be very quick. It all depends on two factors:

  1. and most importantly:
      The amount of Twitter IDs.
  2. and similarily importantly:
      The amount of *replies* to those Twitter IDs (and their replies, etc).
  3. Your Internet connection.

 ...And, now you're done!

 Wait, so how do I read this file?
 ---------------------------------
 Reading tweets.jsonl isn't straightforward. However, you've got pretty much all the tweet data there, it's just in a very machine-
 readable format.
 I'm working on a viewer; progress is at https://github.com/archivist-rs/archivist. Still not great though, and I could use some
 help, especially by someone that can actually design something. I'm _decent_, but I'm still utterly shit.

 The tweets.jsonl file has the raw JSONL produced by snscrape. Good luck!

 Thanks for reading this guide! It's one of my longest... writings ever. I hope it was helpful to you, and that you do great things
 with it.
  Ciao!
	Hello!
	======

	Table of Contents;
	The Fast Way
	The Thorough Way
	Wait, how do I read this file?

	Archiving Twitter accounts and hashtags is easy, even without an API key. Here's how!

	/####\
	\|STOP\| Twitter is usually very lenient on bans. However, it's still possible to get banned,
	\|SIGN\|especially as snscrape (mis)uses the public API used by the web-app. I take no responsibility for this.
	\####/

	#######
	\| NOTE \| Most of this guide assumes a sane UNIX environment with Python 3 installed.
	\| \|If you don't have that, this guide won't be helpful.
	\|_____\|On Windows you might use WSL, Cygwin, or Git Bash. WSL is recommended. Or just, y'know, stop using Windows.

	Number One: The Fast Way
	----------------------------
	The easiest way is to fire up [snscrape](https://github.com/JustAnotherArchivist/snscrape) with the --jsonl parameter.

	* The JSONL option is so that instead of only displaying the URL of each post, it'll output the data

	Of course you'll want to pipe snscrape's output to a file.
	Example for most POSIX-compliant shells (AKA anything but Windows):

	snscrape --jsonl --progress twitter-user Ukraine >> tweets.jsonl
	Or:
	snscrape --jsonl --progress twitter-hashtag BeautifulUkraine >> tweets.jsonl

	The `>>' tells your shell to append to the file (creating it if it doesn't exist). You could also use `>' to overwrite the file.

	/^\
	/ \| \ Don't do this to the same file in multiple different shells/terminals! It will likely corrupt the file.
	/__.__\

	Number Two: The Thorough Way
	----------------------------
	This is my preferred way to do it. This way scrapes a posts replies, its parent replies, its parent replies' replies, its
	replies' replies, etc. In other words, it's recursive.

	* This does not traverse hashtags/@mentions. You'll have to do that yourself if you want it, but you'd probably end up scraping
	the whole website by doing that :P

	This is a multi-step process.

	First. You need to scrape the post IDs for your search term.
	Example for most POSIX-compliant shells (AKA anything but Windows):

	snscrape -f "{id}" --progress twitter-user Ukraine >> twitterids
	Or:
	snscrape -f "{id}" --progress twitter-hashtag BeautifulUkraine >> twitterids

	This will take about the same time as the old method, as it's exactly the same as the old method; the only difference is the
	only thing being saved is the post ID. The second step is what will take a while. Fortunately, the second step can resume
	crawls, so long as the "Archive File" (the `finished.json' file) is intact. This Archive File keeps track of what tweets it's
	already downloaded, so it can skip them. Ain't that neat?

	Clone the https://github.com/archivist-rs/batch-get-tweets repo and enter it (don't forget what folder you saved twitterids in!)
	For simplicity's sake, I'll store the path in a variable. Assuming a sane Unix system, run:

	```
	export ARCHIVIST_TWEETS_SCRIPT=$(find `pwd`/get_tweets.py)
	```
	Now go back to the folder where `twitterids' is. Next run:

	```
	python3 -m pip install alive_progress
	python3 $ARCHIVIST_TWEETS_SCRIPT
	```

	You should now see a progress bar on your screen. This process is going to take a very long time. It's not unheard of for it to
	take overnight, and it may even take multiple days. With multiple days comes multiple crashes, which is why there's the resume
	functionality. It's a great way to find bugs in snscrape :-)
	Now, this may also be very quick. It all depends on two factors:

	1. and most importantly:
	The amount of Twitter IDs.
	2. and similarily importantly:
	The amount of replies to those Twitter IDs (and their replies, etc).
	3. Your Internet connection.

	...And, now you're done!

	Wait, so how do I read this file?
	---------------------------------
	Reading tweets.jsonl isn't straightforward. However, you've got pretty much all the tweet data there, it's just in a very machine-
	readable format.
	I'm working on a viewer; progress is at https://github.com/archivist-rs/archivist. Still not great though, and I could use some
	help, especially by someone that can actually design something. I'm _decent_, but I'm still utterly shit.

	The tweets.jsonl file has the raw JSONL produced by snscrape. Good luck!

	Thanks for reading this guide! It's one of my longest... writings ever. I hope it was helpful to you, and that you do great things
	with it.
	Ciao!
No results found