Embedding your downloaded Twitter archive data for semantic search and RAG with llm

Getting your tweets

Twitter allows you to download your archive of tweets. You can do this by going to your account settings and requesting your archive. Once you receive the email with the download link, you can download the zip file.

Exploring the data

Top-level keys of Twitter/twitter_archive.json (via jq 'keys' Twitter/twitter_archive.json):

[
  "account",
  "community-tweet",
  "follower",
  "following",
  "like",
  "note-tweet",
  "profile",
  "tweets",
  "upload-options"
]

We can get the shape of the tweets with:

jq '
def shape:
  if type == "object" then with_entries(.value |= shape)
  elif type == "array" then if length == 0 then [] else [ (map(shape) | add // {}) ] end
  else type end;
.tweets | map(.tweet // .) | (map(shape) | add)
' Twitter/twitter_archive.json > Twitter/tweets_shape.json

Embedding tweets

To embed and search tweets, make sure you have llm installed and API keys set for whatever provider(s) you want to use.

Also, set a default embedding model available from that provider:

llm embed-models default text-embedding-3-small  # an OpenAI model

llm requires text to be embedded to be in a json array with "id" and "content" fields:

[
  {"id": "one", "content": "This is the first item"},
  {"id": "two", "content": "This is the second item"}
]

We can do this with:

jq -r '
  .tweets 
  | map(.tweet // .) 
  | map({
      id: .id_str, 
      content: .full_text
    })
' Twitter/twitter_archive.json > Twitter/tweets_to_embed.json

That will give us a file Twitter/tweets_to_embed.json. Then we can embed the tweets with:

llm embed-multi tweets Twitter/tweets_to_embed.json --store  # store plain text with embeddings

Now you can run a semantic search over the tweets:

llm similar tweets -c "Python is good"

Or chain llm commands to do RAG:

llm similar tweets -c "Python is good" -p | \
  llm "Look at my previous tweets about Python and write a new tweet in my voice/style about why Python is good"

Advanced Usage

Embedding only nth-percentile high-engagement bangers

Say we only want to embed the 100 most engaging tweets. We can do this by sorting the tweets by retweet_count and favorite_count, and then filtering to the top 100:

jq -r '
  .tweets 
  | map(.tweet // .) 
  | sort_by(-((.retweet_count | tonumber) + (.favorite_count | tonumber))) 
  | .[0:100] 
  | map({
      id: .id_str, 
      content: .full_text, 
      metadata: {
        retweets: .retweet_count, 
        likes: .favorite_count
      }
    })
' Twitter/twitter_archive.json > Twitter/bangers_to_embed.json

Then, we can embed the bangers:

llm embed-multi bangers Twitter/bangers_to_embed.json --store

Excluding replies and media tweets

Say we want to exclude replies from our analysis. We can do this by filtering the tweets by the presence of the in_reply_to_status_id field: map(select(.in_reply_to_status_id == null)). Similarly, we can exclude links and quote tweets by filtering the tweets by whether the full_text contains a link or quote: map(select(.full_text | test("https://t\\.co/") | not)). We can also exclude media tweets by filtering the tweets by whether the extended_entities.media array is empty: map(select(.extended_entities.media == null)).

So, to embed the top 100 bangers that are not replies, link/quote tweets, or media tweets, we can do:

jq -r '
  .tweets 
  | map(.tweet // .) 
  | map(select(.in_reply_to_status_id == null)) 
  | map(select(.full_text | test("https://t\\.co/") | not)) 
  | map(select(.extended_entities.media == null)) 
  | sort_by(-((.retweet_count | tonumber) + (.favorite_count | tonumber))) 
  | .[0:100] 
  | map({
      id: .id_str, 
      content: .full_text, 
      metadata: {
        retweets: .retweet_count, 
        likes: .favorite_count
      }
    })
' Twitter/twitter_archive.json > Twitter/filtered_bangers_to_embed.json

Maybe we'd store these in a filtered_bangers embedding group:

llm embed-multi filtered_bangers Twitter/filtered_bangers_to_embed.json --store

Embedding tweets being replied to

Say we want to generate a reply to some tweet, and we want to provide the LLM examples of past replies to similar tweets. We can do this by embedding the tweet being replied to.

The llm tool supports storing metadata alongside the embeddings. What we need is a JSON array or NDJSON file with "id", "content", and "metadata" fields, where the "metadata" field is a JSON object with arbitrary key-value pairs:

[
  {
    "id": "https://x.com/user/status/1234567890",
    "content": "Python is gross and disgusting and anyone who uses it is a loser.",
    "metadata": {"reply_text": "But I love Python 😭."}
  },
  {
    "id": "https://x.com/user/status/1234567890",
    "content": "Python? I don't like it. It's not my favorite language.",
    "metadata": {"reply_text": "wym, Python is great!"}
  }
]

Assuming we have this data saved in a file Twitter/replied_tweets.json, we can embed the file with:

llm embed-multi replied_to_tweets Twitter/replied_tweets.json --store

Now, say we want to reply to a tweet that says, "Python is bad". We can find similar replied-to tweets to that one:

llm similar replied_to_tweets -c "Python is bad"

This will return a JSON array with the most similar results, including metadata.

We can pipe these results to an LLM to generate a reply in the same style as the retrieved replies:

llm similar replied_to_tweets -c "Python is bad" | \
  llm "Look at my past replies to tweets about Python and write a new reply in my voice/style to the tweet 'Python is bad'."

Getting data in the required format

The tricky part here is getting the text of replied-to tweets and putting it in the required format.

First, we'll enrich our tweet data with information about the tweets being replied to, if any:

jq -r '
  .tweets 
  | map(.tweet // .) 
  | map(select(.in_reply_to_status_id != null)) 
  | map({
      id: .id_str, 
      content: .full_text,
      replying_to_id: .in_reply_to_status_id,
      replying_to_user: .in_reply_to_screen_name,
      replying_to_url: (
        if .in_reply_to_screen_name and .in_reply_to_status_id then 
          "https://x.com/\(.in_reply_to_screen_name)/status/\(.in_reply_to_status_id)"
        else 
          null 
        end
      )
    })
' Twitter/twitter_archive.json > Twitter/tweets_to_embed_with_replies.json

That gives us an array of objects like:

[
  {
    "id": "1891535193061683419",
    "content": "Python is very special to me.",
    "replying_to_id": "1234567890",
    "replying_to_user": "another_user",
    "replying_to_url": "https://x.com/another_user/status/1234567890"
  }
]

Then, we'll have to run a scraping workflow to get the text of the tweets being replied to.

Scraping replied-to tweets with `x-cli`

The Twitter API allows 100 reads per month (learn more here), and you can use my x-cli tool to query up to 100 tweets by ID with a single API call if you have a free developer account.

After installing and configuring the tool according to the instructions in the README, you can run x-cli get <id1> <id2> ... --json > Twitter/replied_tweets_raw.json to get an array of JSON data for the tweets that match those IDs. Running this in a loop over all the replying_to_ids in Twitter/tweets_to_embed_with_replies.json looks like this:

jq -r '
  map(.replying_to_id) 
  | unique 
  | .[0:100] 
  | "x-cli get " + join(" ") + " --json"
' Twitter/tweets_to_embed_with_replies.json | \
  xargs -I {} bash -c "{} > Twitter/replied_tweets_raw.json"

We can then extract the text of the tweets and store it in a new file:

jq -r '
  .data 
  | map({
      id: .id, 
      content: .text
    })
' Twitter/replied_tweets_raw.json > Twitter/replied_tweets.json

Finally, we need to map the replying_to_ids to the ids of our reply tweets, and inject the reply text as metadata:

jq -n '
  # Load replied-to tweets as an array, indexed by id  
  (input 
    | map({key: .id, value: {id: .id, content: .content}}) 
    | from_entries
  ) as $replied
  |
  # Load tweets with replies, but only those that have matching replied-to tweets
  (input 
    | map(select(
        .replying_to_id != null and 
        ($replied[.replying_to_id] // null) != null
      ))
  ) as $matching_replies
  |
  # Create the final output with metadata
  $matching_replies 
  | map(
      . as $reply 
      | $replied[.replying_to_id] 
      | {
          id: .id,
          content: .content,
          metadata: {reply_text: $reply.content}
        }
    )
' Twitter/replied_tweets.json Twitter/tweets_to_embed_with_replies.json > Twitter/replied_tweets_with_metadata.json

Scraping replied-to tweets with `scrape_tweet.js`

Alternatively, we can run my scrape_tweet.js script (assuming bun is installed): bun run Twitter/scrape_tweet.js https://x.com/another_user/status/1234567890. It will use puppeteer to scrape the text of a single tweet by URL and return a JSON object containing the tweet content, with its url as the id.

The script supports an optional --metadata-json flag to attach arbitrary metadata to the output (e.g., the text of your reply). It will return a JSON object like:

{
  "id": "https://x.com/christophcsmith/status/1919038899428393413",
  "content": "I am well aware that I, a developer who uses Python to build web applications, am the one who is wrong.",
  "metadata": { "reply_text": "So wrong it's right!" }
}

You can run this script in a loop over all the replying_to URLs in Twitter/tweets_to_embed_with_replies.json to get a new JSON file with the tweet texts of the tweets being replied to, while passing your original reply text as metadata:

# Create NDJSON of replied-to tweets, injecting reply text as metadata
jq -r '
  .[] 
  | select(.replying_to != null) 
  | [.replying_to, (.content | @json)] 
  | @tsv
' Twitter/tweets_to_embed_with_replies.json | \
while IFS=$'\t' read -r url reply_json; do
  # reply_json is a JSON-escaped string, safe to embed into JSON
  bun run Twitter/scrape_tweet.js "$url" --metadata-json "{\"reply_text\":$reply_json}"
done > Twitter/replied_tweets.ndjson

chriscarrollsmith/embed_tweets_with_llm.md