[TIx 3] Extracting links from text with URI::extract

I have a simple Rails app that collects all the tweets I favorite on Twitter so I can sort and search through them at my leisure. Many of those favorites contain links I'd like to refer to, so I wrote a helper method that converts them to clickable anchor tags that looked like this:

# app/helpers/favorites_helper.rb
module FavoritesHelper

  # snip
  
  def text_to_true_link(tweet_text)
    urls = tweet_text.scan(/https*:\/\/t.co\/\w+/)
    urls.each do |url|
      tweet_text.gsub!(url, "<a href=#{url} target='_blank'>#{url}</a>")
    end
    tweet_text.html_safe
  end
end

The text_to_true_link method

takes raw tweet_text as a string,
scans through it looking for Twitter shortlinks with a regex (which should have used ? in place of * there),
stores the link text in an array called urls,
substitutes each link with an anchor tag for that link, and
returns the newly formatted tweet_text with clickable links.

I thought this was a pretty clever hack, but while looking at my oldest tweets, I realized that they had links that predated the standard t.co shortlink and subsequently were not being converted into clickable links. So I did what you'd expect an inexperienced developer to do — I started looking for a Goldilocks regex that wasn't too complex and wasn't too liberal that would be adequate for my URI matching purposes.

While doing this, I stumbled upon a Stack Overflow answer that mentioned URI::regexp which had a comment mentioning URI::extract. What does URI::extract do? Why, exactly what I want — it extracts URIs from text.

At first, I tried using urls = URI.extract(tweet_text) which seemed to work. However, on further inspection, this was capturing any text that terminated in a colon, too, e.g.,

tweet_text = "Kleisli: common monads in Ruby https://github.com/txus/kleisli"
urls = URI.extract(tweet_text) # => ["Kleisli:", "https://github.com/txus/kleisli"]

Looking more closely at the documentation, URI::extract takes a second argument that limits URI matches to a specific set of schemes.

tweet_text = "Kleisli: common monads in Ruby https://github.com/txus/kleisli"
urls = URI.extract(tweet_text, %w(http https)) # => ["https://github.com/txus/kleisli"]

This led me to my current adequate implementation:

# app/helpers/favorites_helper.rb
module FavoritesHelper

  # snip
  
  def text_to_true_link(tweet_text)
    urls = URI.extract(tweet_text, %w(http https))
    urls.each do |url|
      tweet_text.gsub!(url, "<a href=#{url} target='_blank'>#{url}</a>")
    end
    tweet_text.html_safe
  end
end

Normally, I think I do a good job checking (or knowing) whether Ruby has a method that does what I want before I try to implement my own solution. Thinking more deeply as to why I missed URI::extract, I realized that while I have a pretty good command of Ruby's core libraries, I haven't spent nearly as much time exploring Ruby's standard libraries. I'd like to dig into more of the latter from here on out.

Questions I still have:

Is there a better way to replace embedded links in text with their clickable counterparts?
How does a large site like Twitter or Facebook implement this?

O-I/extract_uri.md

bgschiller commented Dec 2, 2021