Ruby’s URI module contains tools for manipulating URIs (which any normal person would call URLs – there are reasons for distinguishing between URLs and URIs, but I’m not sure they’re worth the confusion caused by the existence of both terms)
If you need to grab the query string from a URI, or determine whether it’s relative, you’re supposed to use URI instead of some ad hoc regular expression:
[Dev]> uri = URI.parse("http://google.com/search?q=hi")
=> #<URI::HTTP:0x104daf9a8 URL:http://google.com/search?q=hi>
[Dev]> uri.host
=> "google.com"
[Dev]> uri.relative?
=> false
[Dev]> uri.query
=> "q=hi"To prevent spam on Rap Genius, I want to nofollow all user-entered links. However, users frequently link to content within Rap Genius, and I do want search engines to crawl those. Here was my first try:
doc = Hpricot(explanation_html)
doc.search('a').each do |a|
target = URI.parse(a['href'])
unless target.relative? || target.host.match(/rapgenius\.com$/)
a['rel'] = "nofollow"
end
endI.e., unless the link’s href is relative or it explicitly points to a page Rap Genius, we add rel="nofollow"
Unfortunately, testing with actual user data showed that there are (at least) 4 problems with this code
1) URI.parse can’t handle URIs surrounded by spaces
Allow me to demonstrate:
[Dev]> URI.parse(" http://google.com")
URI::InvalidURIError: bad URI(is not URI?): http://google.com
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:436:in `split'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:485:in `parse'
from (irb):23This is a design error in URI – maybe there should be a “strict” mode, but the default behavior should be more forgiving.
2) URI.parse barfs when it sees an invalid URI
Suppose a user mistakenly enters <a href=")http://google.com">Google<a>. Obviously he meant “http://google.com”, not “)http://google.com” – and yet URI.parse throws an exception on this input.
To solve this, I pre-processed URI.parse’s input with URI.extract, which grabs URIs from a blob of text. (Though remember: it only extracts absolute URIs!)
3) URI.parse barfs when it sees certain characters
I don’t think http://www.google.com/search?q=% is technically a valid URI (since the % has a special meaning), but all modern browsers interpret it correctly. URI.parse on the other hand, throws an exception.
The trick here is to pre-process inputs to URI.parse with URI.encode, which smartly encodes the special characters. (In this case replacing % with %25)
4) URI.parse can’t handle sub-domains that contain underscores
[Dev]> URI.parse("http://a_b.google.com")
URI::InvalidURIError: the scheme http does not accept registry part: a_b.google.com (or bad hostname?)Again, maybe there’s some RFC that proves this is “correct” behavior, but this isn’t helpful when I need to handle a subdomain with an underscore (of which there are many!)
I don’t have a good workaround for this one.
Here’s the version of my rel="nofollow"-izer that fixes issues 1-3 above:
doc = Hpricot(explanation_html)
doc.search('a').each do |a|
uri = URI.encode(a['href'].strip)
uri = URI.extract(uri).first if URI.extract(uri).first
target = URI.parse(uri)
unless target.relative? || target.host.match(/rapgenius\.com$/)
a['rel'] = "nofollow"
end
endThe if URI.extract(uri).first part is there because URI.extract returns [] when you feed it a relative URL. (This also means that the parsing of relative URLs will be unnecessarily strict, but whatever)
Good libraries solve real life problems, not “well-formed user input” fantasy-land problems. In this respect, Hpricot (e.g.) is a good library, and URI is not.