Twitter allows you to download your archive of tweets. You can do this by going to your account settings and requesting your archive. Once you receive the email with the download link, you can download the zip file.
Top-level keys of Twitter/twitter_archive.json
(via jq 'keys' Twitter/twitter_archive.json
):
[
"account",
"community-tweet",
"follower",
"following",
"like",
"note-tweet",
"profile",
"tweets",
"upload-options"
]
We can get the shape of the tweets with:
jq '
def shape:
if type == "object" then with_entries(.value |= shape)
elif type == "array" then if length == 0 then [] else [ (map(shape) | add // {}) ] end
else type end;
.tweets | map(.tweet // .) | (map(shape) | add)
' Twitter/twitter_archive.json > Twitter/tweets_shape.json
To embed and search tweets, make sure you have llm installed and API keys set for whatever provider(s) you want to use.
Also, set a default embedding model available from that provider:
llm embed-models default text-embedding-3-small # an OpenAI model
llm
requires text to be embedded to be in a json array with "id" and "content" fields:
[
{"id": "one", "content": "This is the first item"},
{"id": "two", "content": "This is the second item"}
]
We can do this with:
jq -r '
.tweets
| map(.tweet // .)
| map({
id: .id_str,
content: .full_text
})
' Twitter/twitter_archive.json > Twitter/tweets_to_embed.json
That will give us a file Twitter/tweets_to_embed.json
. Then we can embed the tweets with:
llm embed-multi tweets Twitter/tweets_to_embed.json --store # store plain text with embeddings
Now you can run a semantic search over the tweets:
llm similar tweets -c "Python is good"
Or chain llm commands to do RAG:
llm similar tweets -c "Python is good" -p | \
llm "Look at my previous tweets about Python and write a new tweet in my voice/style about why Python is good"
Say we only want to embed the 100 most engaging tweets. We can do this by sorting the tweets by retweet_count and favorite_count, and then filtering to the top 100:
jq -r '
.tweets
| map(.tweet // .)
| sort_by(-((.retweet_count | tonumber) + (.favorite_count | tonumber)))
| .[0:100]
| map({
id: .id_str,
content: .full_text,
metadata: {
retweets: .retweet_count,
likes: .favorite_count
}
})
' Twitter/twitter_archive.json > Twitter/bangers_to_embed.json
Then, we can embed the bangers:
llm embed-multi bangers Twitter/bangers_to_embed.json --store
Say we want to exclude replies from our analysis. We can do this by filtering the tweets by the presence of the in_reply_to_status_id
field: map(select(.in_reply_to_status_id == null))
. Similarly, we can exclude links and quote tweets by filtering the tweets by whether the full_text
contains a link or quote: map(select(.full_text | test("https://t\\.co/") | not))
. We can also exclude media tweets by filtering the tweets by whether the extended_entities.media
array is empty: map(select(.extended_entities.media == null))
.
So, to embed the top 100 bangers that are not replies, link/quote tweets, or media tweets, we can do:
jq -r '
.tweets
| map(.tweet // .)
| map(select(.in_reply_to_status_id == null))
| map(select(.full_text | test("https://t\\.co/") | not))
| map(select(.extended_entities.media == null))
| sort_by(-((.retweet_count | tonumber) + (.favorite_count | tonumber)))
| .[0:100]
| map({
id: .id_str,
content: .full_text,
metadata: {
retweets: .retweet_count,
likes: .favorite_count
}
})
' Twitter/twitter_archive.json > Twitter/filtered_bangers_to_embed.json
Maybe we'd store these in a filtered_bangers
embedding group:
llm embed-multi filtered_bangers Twitter/filtered_bangers_to_embed.json --store
Say we want to generate a reply to some tweet, and we want to provide the LLM examples of past replies to similar tweets. We can do this by embedding the tweet being replied to.
The llm
tool supports storing metadata alongside the embeddings. What we need is a JSON array or NDJSON file with "id", "content", and "metadata" fields, where the "metadata" field is a JSON object with arbitrary key-value pairs:
[
{
"id": "https://x.com/user/status/1234567890",
"content": "Python is gross and disgusting and anyone who uses it is a loser.",
"metadata": {"reply_text": "But I love Python 😭."}
},
{
"id": "https://x.com/user/status/1234567890",
"content": "Python? I don't like it. It's not my favorite language.",
"metadata": {"reply_text": "wym, Python is great!"}
}
]
Assuming we have this data saved in a file Twitter/replied_tweets.json
, we can embed the file with:
llm embed-multi replied_to_tweets Twitter/replied_tweets.json --store
Now, say we want to reply to a tweet that says, "Python is bad". We can find similar replied-to tweets to that one:
llm similar replied_to_tweets -c "Python is bad"
This will return a JSON array with the most similar results, including metadata.
We can pipe these results to an LLM to generate a reply in the same style as the retrieved replies:
llm similar replied_to_tweets -c "Python is bad" | \
llm "Look at my past replies to tweets about Python and write a new reply in my voice/style to the tweet 'Python is bad'."
The tricky part here is getting the text of replied-to tweets and putting it in the required format.
First, we'll enrich our tweet data with information about the tweets being replied to, if any:
jq -r '
.tweets
| map(.tweet // .)
| map(select(.in_reply_to_status_id != null))
| map({
id: .id_str,
content: .full_text,
replying_to_id: .in_reply_to_status_id,
replying_to_user: .in_reply_to_screen_name,
replying_to_url: (
if .in_reply_to_screen_name and .in_reply_to_status_id then
"https://x.com/\(.in_reply_to_screen_name)/status/\(.in_reply_to_status_id)"
else
null
end
)
})
' Twitter/twitter_archive.json > Twitter/tweets_to_embed_with_replies.json
That gives us an array of objects like:
[
{
"id": "1891535193061683419",
"content": "Python is very special to me.",
"replying_to_id": "1234567890",
"replying_to_user": "another_user",
"replying_to_url": "https://x.com/another_user/status/1234567890"
}
]
Then, we'll have to run a scraping workflow to get the text of the tweets being replied to.
The Twitter API allows 100 reads per month (learn more here), and you can use my x-cli
tool to query up to 100 tweets by ID with a single API call if you have a free developer account.
After installing and configuring the tool according to the instructions in the README, you can run x-cli get <id1> <id2> ... --json > Twitter/replied_tweets_raw.json
to get an array of JSON data for the tweets that match those IDs. Running this in a loop over all the replying_to_id
s in Twitter/tweets_to_embed_with_replies.json
looks like this:
jq -r '
map(.replying_to_id)
| unique
| .[0:100]
| "x-cli get " + join(" ") + " --json"
' Twitter/tweets_to_embed_with_replies.json | \
xargs -I {} bash -c "{} > Twitter/replied_tweets_raw.json"
We can then extract the text of the tweets and store it in a new file:
jq -r '
.data
| map({
id: .id,
content: .text
})
' Twitter/replied_tweets_raw.json > Twitter/replied_tweets.json
Finally, we need to map the replying_to_id
s to the id
s of our reply tweets, and inject the reply text as metadata:
jq -n '
# Load replied-to tweets as an array, indexed by id
(input
| map({key: .id, value: {id: .id, content: .content}})
| from_entries
) as $replied
|
# Load tweets with replies, but only those that have matching replied-to tweets
(input
| map(select(
.replying_to_id != null and
($replied[.replying_to_id] // null) != null
))
) as $matching_replies
|
# Create the final output with metadata
$matching_replies
| map(
. as $reply
| $replied[.replying_to_id]
| {
id: .id,
content: .content,
metadata: {reply_text: $reply.content}
}
)
' Twitter/replied_tweets.json Twitter/tweets_to_embed_with_replies.json > Twitter/replied_tweets_with_metadata.json
Alternatively, we can run my scrape_tweet.js script (assuming bun is installed): bun run Twitter/scrape_tweet.js https://x.com/another_user/status/1234567890
. It will use puppeteer to scrape the text of a single tweet by URL and return a JSON object containing the tweet content, with its url as the id.
The script supports an optional --metadata-json
flag to attach arbitrary metadata to the output (e.g., the text of your reply). It will return a JSON object like:
{
"id": "https://x.com/christophcsmith/status/1919038899428393413",
"content": "I am well aware that I, a developer who uses Python to build web applications, am the one who is wrong.",
"metadata": { "reply_text": "So wrong it's right!" }
}
You can run this script in a loop over all the replying_to
URLs in Twitter/tweets_to_embed_with_replies.json
to get a new JSON file with the tweet texts of the tweets being replied to, while passing your original reply text as metadata:
# Create NDJSON of replied-to tweets, injecting reply text as metadata
jq -r '
.[]
| select(.replying_to != null)
| [.replying_to, (.content | @json)]
| @tsv
' Twitter/tweets_to_embed_with_replies.json | \
while IFS=$'\t' read -r url reply_json; do
# reply_json is a JSON-escaped string, safe to embed into JSON
bun run Twitter/scrape_tweet.js "$url" --metadata-json "{\"reply_text\":$reply_json}"
done > Twitter/replied_tweets.ndjson