Recently I started using Dalli for memcache in Rap Genius because memcached-northscale was being a dick
This caused some problems because memcache can’t handle cache keys with spaces or other special characters, and Dalli doesn’t do anything to protect you from this (or at least it doesn’t in version 1.0.4, which is the most recent version compatible with Rails 2.3.x (which Rap Genius still uses))
Cache keys with spaces come up because of the way Rails generates them. For example, this block:
cache [@song, 'meta description'] do
meta = @song.lyrics_as_text.gsub(/\n+/, ' / ').strip[0,180]
endgenerates the cache key:
258/258/songs/66266-20120328010114/258/meta description
Here’s what each element represents:
258 – the “cache version”. This is appended to every cache key across the application so that I can easily invalidate everything by incrementing it (it appears multiple times in the cache key because of the way Rails constructs it – i.e., for no particular reason)
songs – the model name
66266 – the id of the current song
20120328010114 – the current song’s updated_at
meta description – the name of the specific thing we’re caching
The idea is that you encode all this information in the cache key so that you don’t have to worry about invalidating the cache when something changes. For example, if I update the song, the cache key will change and the next time Rails hits this block it will look for a non-existent cache key and regenerate the song’s meta description. This technique is known as “generational caching”
BUT ANYWAY, Dalli throws up when you feed it a cache key that contains a space. How do we handle this?
One thought is to replace spaces with chill characters (e.g., dashes) in all cache keys. But this is fraught because then “Dan Berger” and “Dan-Berger” will appears as the same hash key to Dalli, which could cause collisions
What we need is a way to map every hash key to a unique string consisting of permissible characters. And for this we use a hash. E.g.:
> Digest::SHA1.hexdigest("Dan Berger")
=> "d86a8cfad42d11b58c5d369b235b78345e0eaa01"
> Digest::SHA1.hexdigest("Dan-Berger")
=> "9ea3c32819f1625b926937e5a8a59a23a4d9aae5"Now we need some code that sits between Dalli and Rap Genius, transparently hashing every cache key. I.e., when I write
Rails.cache.write("Dan Berger", "hilarious")Dalli will see it as
Rails.cache.write("d86a8cfad42d11b58c5d369b235b78345e0eaa01", "hilarious")This way, when I’m programming Rap Genius I don’t want to have to remember “okay, I gotta hash this string before I sent it to Dalli!”. Instead, I want Dalli to behave as if it’s using a special upgraded memcache that accepts any hash key.
SO: how do we accomplish this? Well first we need to find the method that Dalli uses to actually interact with memcache. Fortunately, Dalli uses a single method for both reading and writing so it doesn’t have to duplicate its error checking code. That method is called perform:
module Dalli
class Client
# Chokepoint method for instrumentation
def perform(op, key, *args)
key = key.to_s
validate_key(key)
key = key_with_namespace(key)
begin
server = ring.server_for_key(key)
server.request(op, key, *args)
rescue NetworkError => e
Dalli.logger.debug { e.message }
Dalli.logger.debug { "retrying request with new server" }
retry
end
end
end
endWe need to make sure the key that we feed perform is always hashed. This is fairly simple to accomplish with alias_method:
module Dalli
class Client
def perform_with_hash(op, key, *args)
perform_without_hash(op, Digest::SHA1.hexdigest(key), *args)
end
alias_method :perform_without_hash, :perform
alias_method :perform, :perform_with_hash
end
endThe idea here is to tell Ruby:
perform method with the name perform_without_hashperform_with_hash method with the name perform (thus, when the outside world THINKS they’re calling perform, they’re really calling perform_with_hash)perform_with_hash to pass along its arguments unchanged to perform_without_hash (i.e., the original perform), except, let’s hash the key firstNow, if that weren’t trippy enough already, Rails provides an abstraction for this common “method wrapper” pattern:
module Dalli
class Client
def perform_with_hash(op, key, *args)
perform_without_hash(op, Digest::SHA1.hexdigest(key), *args)
end
alias_method_chain :perform, :hash
end
endAnd that’s it – we’re done! Now we can use any cache key we want, as if there were never a character restriction. Why doesn’t Dalli do this out of the box? I have no clue lol!
Before you can prevent users from adding duplicate artists to Rap Genius, you have to be able to identify duplicate artists.
This is harder than it sounds – you can’t just look for duplicate artist names because different users will invariably use different names for the same artist. For example, Lil' Wayne and Lil Wayne (i.e., with and without the apostrophe) both refer to the same person.
So different names does not necessarily imply different artists. The question is how different do two names have to be before we can infer that they refer to different things?
The apostrophe case gives us a start: two names must differ by more than just punctuation – specifically, they must differ by at least one letter or number – in order to refer to different things.
How do we use this fact to help us detect duplicates? Instead of comparing two names directly, we will instead compare their essences – where a name’s essence is calculated by removing all its non-letters and non-numbers.
Here’s a Ruby implementation of this idea:
class String
def essence
strip. # remove leading and trailing whitespace
downcase. # make everything lowercase
gsub(/[^a-z0-9]/, '') # remove all non-alpha numerics
end
endHere’s how this works:
"Lil' Wayne".essence # => "lilwayne"
"Lil' Wayne" == "Lil Wayne" # => false
"Lil' Wayne".essence == "Lil Wayne".essence # => trueThe names aren’t equal, but their essences are, so they refer to the same thing
Does our implementation of essence completely capture the idea of “sameness”? Is it possible for two names to contain different alpha-numeric characters and yet still refer to the same thing? Anything is possible when you’re dealing with natural language!
Here are some modifications to String#essence I made based on experimenting with real-world data:
Here’s a version of String#essence that accounts for these observations. Am I missing anything?
class String
def essence
strip.
downcase.
gsub('&', 'and').
gsub('$', 's').
gsub('+', 'plus').
gsub(/\bda\b/i, 'the').
gsub(/([a-y\-])z\b/i, '\1s').
sub(/^(th[ea]|a|an)\s+/i, '').
gsub(/[^a-z0-9]/, '')
end
end(Note: Essence doesn’t help you at all in the Tupac / 2Pac case. For this you have to manually compile a list of alternate names for each artist! lol!)
Ruby’s URI module contains tools for manipulating URIs (which any normal person would call URLs – there are reasons for distinguishing between URLs and URIs, but I’m not sure they’re worth the confusion caused by the existence of both terms)
If you need to grab the query string from a URI, or determine whether it’s relative, you’re supposed to use URI instead of some ad hoc regular expression:
[Dev]> uri = URI.parse("http://google.com/search?q=hi")
=> #<URI::HTTP:0x104daf9a8 URL:http://google.com/search?q=hi>
[Dev]> uri.host
=> "google.com"
[Dev]> uri.relative?
=> false
[Dev]> uri.query
=> "q=hi"To prevent spam on Rap Genius, I want to nofollow all user-entered links. However, users frequently link to content within Rap Genius, and I do want search engines to crawl those. Here was my first try:
doc = Hpricot(explanation_html)
doc.search('a').each do |a|
target = URI.parse(a['href'])
unless target.relative? || target.host.match(/rapgenius\.com$/)
a['rel'] = "nofollow"
end
endI.e., unless the link’s href is relative or it explicitly points to a page Rap Genius, we add rel="nofollow"
Unfortunately, testing with actual user data showed that there are (at least) 4 problems with this code
1) URI.parse can’t handle URIs surrounded by spaces
Allow me to demonstrate:
[Dev]> URI.parse(" http://google.com")
URI::InvalidURIError: bad URI(is not URI?): http://google.com
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:436:in `split'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:485:in `parse'
from (irb):23This is a design error in URI – maybe there should be a “strict” mode, but the default behavior should be more forgiving.
2) URI.parse barfs when it sees an invalid URI
Suppose a user mistakenly enters <a href=")http://google.com">Google<a>. Obviously he meant “http://google.com”, not “)http://google.com” – and yet URI.parse throws an exception on this input.
To solve this, I pre-processed URI.parse’s input with URI.extract, which grabs URIs from a blob of text. (Though remember: it only extracts absolute URIs!)
3) URI.parse barfs when it sees certain characters
I don’t think http://www.google.com/search?q=% is technically a valid URI (since the % has a special meaning), but all modern browsers interpret it correctly. URI.parse on the other hand, throws an exception.
The trick here is to pre-process inputs to URI.parse with URI.encode, which smartly encodes the special characters. (In this case replacing % with %25)
4) URI.parse can’t handle sub-domains that contain underscores
[Dev]> URI.parse("http://a_b.google.com")
URI::InvalidURIError: the scheme http does not accept registry part: a_b.google.com (or bad hostname?)Again, maybe there’s some RFC that proves this is “correct” behavior, but this isn’t helpful when I need to handle a subdomain with an underscore (of which there are many!)
I don’t have a good workaround for this one.
Here’s the version of my rel="nofollow"-izer that fixes issues 1-3 above:
doc = Hpricot(explanation_html)
doc.search('a').each do |a|
uri = URI.encode(a['href'].strip)
uri = URI.extract(uri).first if URI.extract(uri).first
target = URI.parse(uri)
unless target.relative? || target.host.match(/rapgenius\.com$/)
a['rel'] = "nofollow"
end
endThe if URI.extract(uri).first part is there because URI.extract returns [] when you feed it a relative URL. (This also means that the parsing of relative URLs will be unnecessarily strict, but whatever)
Good libraries solve real life problems, not “well-formed user input” fantasy-land problems. In this respect, Hpricot (e.g.) is a good library, and URI is not.