Thursday, January 30, 2014

A Fast, Cheap, and Inaccurate Fuzzy Short String Match for Ruby

The following is an extremely naive solution in Ruby for fuzzy matching strings, which may be fine if you have a limited number of strings to compare and know your dataset will work with it:

(a.chars & b.chars).size

e.g.

2.1.0p0 :001 > a = "from_this"
 => "from_this" 
2.1.0p0 :002 > b = "to_this"
 => "to_this" 
2.1.0p0 :003 > c = "from_those"
 => "from_those" 
2.1.0p0 :004 > (a.chars & b.chars).size
 => 6 
2.1.0p0 :005 > (a.chars & c.chars).size
 => 8 
2.1.0p0 :006 > (b.chars & c.chars).size
 => 5 
2.1.0p0 :007 > (a.chars & c.chars).size
 => 8 

By comparing the number of matching chars, you get a number you can sort on to get the most similar match last, e.g.

2.1.0 :001 > a = ['from_this','from_that']
 => ["from_this", "from_that"] 
2.1.0 :002 > b = ['to_that','to_this']
 => ["to_that", "to_this"] 
2.1.0 :003 > Hash[a.collect { |a_name| [a_name, b.sort_by { |b_name| (b_name.chars & a_name.chars).size }.last]}]
 => {"from_this"=>"to_this", "from_that"=>"to_that"} 

I tried using it like this as a quick-and-dirty to fuzzy match a sym in another list of syms:

sister_assoc_name = possible_sister_assoc_names.sort_by { |name| (name.to_s.chars & assoc_name.to_s.chars).size }.last

It was fun, but overly simplistic. It only worked for about 1/2 of the dataset I had in a test to ensure inverse_of options are setup correctly in models.

As of Jan. 2014, a better option for Ruby 2.1 is still fuzzy_match. Now the last example I had to fuzzy match one symbol in a list of symbols looks like:

sister_assoc_name = FuzzyMatch.new(possible_sister_assoc_names.map(&:to_s)).find(assoc_name.to_s).try(:to_sym)

No comments: