Home

Identifying duplicate songs, artists, and albums on Rap Genius: not so easy!

Before you can prevent users from adding duplicate artists to Rap Genius, you have to be able to identify duplicate artists.

This is harder than it sounds – you can’t just look for duplicate artist names because different users will invariably use different names for the same artist. For example, Lil' Wayne and Lil Wayne (i.e., with and without the apostrophe) both refer to the same person.

So different names does not necessarily imply different artists. The question is how different do two names have to be before we can infer that they refer to different things?

String “Essence”

The apostrophe case gives us a start: two names must differ by more than just punctuation – specifically, they must differ by at least one letter or number – in order to refer to different things.

How do we use this fact to help us detect duplicates? Instead of comparing two names directly, we will instead compare their essences – where a name’s essence is calculated by removing all its non-letters and non-numbers.

Here’s a Ruby implementation of this idea:

class String
  def essence
    strip. # remove leading and trailing whitespace
    downcase. # make everything lowercase
    gsub(/[^a-z0-9]/, '') # remove all non-alpha numerics
  end
end

Here’s how this works:

"Lil' Wayne".essence # => "lilwayne"
"Lil' Wayne" == "Lil Wayne" # => false
"Lil' Wayne".essence == "Lil Wayne".essence # => true

The names aren’t equal, but their essences are, so they refer to the same thing

Improving Essence

Does our implementation of essence completely capture the idea of “sameness”? Is it possible for two names to contain different alpha-numeric characters and yet still refer to the same thing? Anything is possible when you’re dealing with natural language!

Here are some modifications to String#essence I made based on experimenting with real-world data:

  • Puff Daddy & The Family refers to the same thing as Puff Daddy and The Family
    • So essence should treat & and and the same. Similarly, essence should treat + and plus the same
  • Nipsey Hu$$le should equal Nipsey Hussle
    • Essence should treat $ and s the same (at least in the case of rap names..)
  • The Hot Boyz should equal Hot Boyz
    • Essence should ignore leading articles
  • The Hot Boyz should equal The Hot Boys
    • Essence should treat trailing z’s the same as trailing s’s
  • Da Hot Boyz should equal The Hot Boyz
    • Essence should treat Da (and Tha) like The. (Again, at least for rap names)

The finished product

Here’s a version of String#essence that accounts for these observations. Am I missing anything?

class String
  def essence
    strip.
    downcase.
    gsub('&', 'and').
    gsub('$', 's').
    gsub('+', 'plus').
    gsub(/\bda\b/i, 'the').
    gsub(/([a-y\-])z\b/i, '\1s').
    sub(/^(th[ea]|a|an)\s+/i, '').
    gsub(/[^a-z0-9]/, '')
  end
end

(Note: Essence doesn’t help you at all in the Tupac / 2Pac case. For this you have to manually compile a list of alternate names for each artist! lol!)

Posted March 25th, 2011

Enforcing rel='nofollow' with Ruby's URI Library: What a Pain!

An Overview of Ruby’s URI Module

Ruby’s URI module contains tools for manipulating URIs (which any normal person would call URLs – there are reasons for distinguishing between URLs and URIs, but I’m not sure they’re worth the confusion caused by the existence of both terms)

If you need to grab the query string from a URI, or determine whether it’s relative, you’re supposed to use URI instead of some ad hoc regular expression:

[Dev]> uri = URI.parse("http://google.com/search?q=hi")
=> #<URI::HTTP:0x104daf9a8 URL:http://google.com/search?q=hi>
[Dev]> uri.host
=> "google.com"
[Dev]> uri.relative?
=> false
[Dev]> uri.query
=> "q=hi"

Using the URI Module

To prevent spam on Rap Genius, I want to nofollow all user-entered links. However, users frequently link to content within Rap Genius, and I do want search engines to crawl those. Here was my first try:

doc = Hpricot(explanation_html)
doc.search('a').each do |a|
  target = URI.parse(a['href'])
  unless target.relative? || target.host.match(/rapgenius\.com$/)
    a['rel'] = "nofollow"
  end
end

I.e., unless the link’s href is relative or it explicitly points to a page Rap Genius, we add rel="nofollow"

Somebody Made A Mess

Unfortunately, testing with actual user data showed that there are (at least) 4 problems with this code

1) URI.parse can’t handle URIs surrounded by spaces

Allow me to demonstrate:

[Dev]> URI.parse(" http://google.com")
URI::InvalidURIError: bad URI(is not URI?):  http://google.com
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:436:in `split'
    from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:485:in `parse'
    from (irb):23

This is a design error in URI – maybe there should be a “strict” mode, but the default behavior should be more forgiving.

2) URI.parse barfs when it sees an invalid URI

Suppose a user mistakenly enters <a href=")http://google.com">Google<a>. Obviously he meant “http://google.com”, not “)http://google.com” – and yet URI.parse throws an exception on this input.

To solve this, I pre-processed URI.parse’s input with URI.extract, which grabs URIs from a blob of text. (Though remember: it only extracts absolute URIs!)

3) URI.parse barfs when it sees certain characters

I don’t think http://www.google.com/search?q=% is technically a valid URI (since the % has a special meaning), but all modern browsers interpret it correctly. URI.parse on the other hand, throws an exception.

The trick here is to pre-process inputs to URI.parse with URI.encode, which smartly encodes the special characters. (In this case replacing % with %25)

4) URI.parse can’t handle sub-domains that contain underscores

[Dev]> URI.parse("http://a_b.google.com")
URI::InvalidURIError: the scheme http does not accept registry part: a_b.google.com (or bad hostname?)

Again, maybe there’s some RFC that proves this is “correct” behavior, but this isn’t helpful when I need to handle a subdomain with an underscore (of which there are many!)

I don’t have a good workaround for this one.

The Final Product

Here’s the version of my rel="nofollow"-izer that fixes issues 1-3 above:

doc = Hpricot(explanation_html)
doc.search('a').each do |a|
  uri = URI.encode(a['href'].strip)
  uri = URI.extract(uri).first if URI.extract(uri).first
  target = URI.parse(uri)
  unless target.relative? || target.host.match(/rapgenius\.com$/)
    a['rel'] = "nofollow"
  end
end

The if URI.extract(uri).first part is there because URI.extract returns [] when you feed it a relative URL. (This also means that the parsing of relative URLs will be unnecessarily strict, but whatever)

The Lesson

Good libraries solve real life problems, not “well-formed user input” fantasy-land problems. In this respect, Hpricot (e.g.) is a good library, and URI is not.

Posted November 22nd, 2010

String#to_file: the easiest way to write a string to a file in Ruby

The Ruby file API has always struck me as inelegant (i.e., I’m constantly looking up its syntax). So I wrote String#to_file to make the common operation of writing a string to a file easy:

class String
  def to_file(filename)
    File.open(filename, 'w') {|f| f.write self }
  end
end

Now when you do this:

"some string".to_file("testing.txt")

You’ll get a file called “testing.txt” that contains “some string”. Easy, no?

You can use the same method to download files:

require 'open-uri'
url = 'http://blog.stackoverflow.com/audio/stackoverflow-podcast-001.mp3'
open(url).read.to_file(url.split('/').last)

And boom, you’ve downloaded the file to “stackoverflow-podcast-001.mp3”

Posted November 18th, 2010