[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Detecting similar strings

Dylan Markow

8/1/2006 6:26:00 AM

Is there a way to take two strings, and decide if they are "similar."
I'm creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are "identical" up to a certain
percentage, such as only having 1 or 2 characters different?

--
Posted via http://www.ruby-....

9 Answers

Farrel Lifson

8/1/2006 6:43:00 AM

0

On 01/08/06, Dylan Markow <dylan@dylanmarkow.com> wrote:
> Is there a way to take two strings, and decide if they are "similar."
> I'm creating a contact system in Rails, and am having a large problem
> with my users punching in duplicate entries with the last names spelled
> slightly different.
>
> Is there a way to check if 2 strings are "identical" up to a certain
> percentage, such as only having 1 or 2 characters different?

Here's a Perl module that does something similar. You might try
porting it over to Ruby.
http://search.cpan.org/~mlehmann/String-Similarity-1.02/Sim...

Farrel

Farrel Lifson

8/1/2006 6:47:00 AM

0

On 01/08/06, Farrel Lifson <farrel.lifson@gmail.com> wrote:
> On 01/08/06, Dylan Markow <dylan@dylanmarkow.com> wrote:
> > Is there a way to take two strings, and decide if they are "similar."
> > I'm creating a contact system in Rails, and am having a large problem
> > with my users punching in duplicate entries with the last names spelled
> > slightly different.
> >
> > Is there a way to check if 2 strings are "identical" up to a certain
> > percentage, such as only having 1 or 2 characters different?
>
> Here's a Perl module that does something similar. You might try
> porting it over to Ruby.
> http://search.cpan.org/~mlehmann/String-Similarity-1.02/Sim...
>
> Farrel

And here is a nice description of the Levenstein Distance:
http://en.wikipedia.org/wiki/Levenshtei...

Farrel

Alexandru E. Ungur

8/1/2006 7:23:00 AM

0

>>> sender: "Dylan Markow" date: "Tue, Aug 01, 2006 at 03:25:59PM +0900" <<<EOQ
> Is there a way to take two strings, and decide if they are "similar."
> I'm creating a contact system in Rails, and am having a large problem
> with my users punching in duplicate entries with the last names spelled
> slightly different.
>
> Is there a way to check if 2 strings are "identical" up to a certain
> percentage, such as only having 1 or 2 characters different?
Looks like there is a soundex implementation for Ruby:
http://raa.ruby-lang.org/search.rhtml?sear...

If by any chance you are using MySQL you could use the soundex function
builtin into it as well.

Good luck,
Alex

benjohn

8/1/2006 7:36:00 AM

0

>>>> sender: "Dylan Markow" date: "Tue, Aug 01, 2006 at 03:25:59PM +0900"
>>>> <<<EOQ
>> Is there a way to take two strings, and decide if they are "similar."
>> I'm creating a contact system in Rails, and am having a large problem
>> with my users punching in duplicate entries with the last names
>> spelled
>> slightly different.
>>
>> Is there a way to check if 2 strings are "identical" up to a certain
>> percentage, such as only having 1 or 2 characters different?
> Looks like there is a soundex implementation for Ruby:
> http://raa.ruby-lang.org/search.rhtml?sear...
>
> If by any chance you are using MySQL you could use the soundex function
> builtin into it as well.

If you check on Google under "ruby soundex module", you'll find a few
likely hits:
http://tinyurl...

I remember a friend saying that there are superior algorithms to soundex
that are no more complex, so you might want to explore about a little.



benjohn

8/1/2006 8:04:00 AM

0

On 01/08/06, Farrel Lifson <farrel.lifson@gmail.com> wrote:
> And here is a nice description of the Levenstein Distance:
> http://en.wikipedia.org/wiki/Levenshtei...

You know, there are lots of implementations there, but the Ruby one
seems to be missing :) [There's no reason to restrict it to working on
strings. If you duck, it'll work just as nicely on arrays of what have
you.]




Michal Suchanek

8/1/2006 8:21:00 AM

0

On 8/1/06, benjohn@fysh.org <benjohn@fysh.org> wrote:
> >>>> sender: "Dylan Markow" date: "Tue, Aug 01, 2006 at 03:25:59PM +0900"
> >>>> <<<EOQ
> >> Is there a way to take two strings, and decide if they are "similar."
> >> I'm creating a contact system in Rails, and am having a large problem
> >> with my users punching in duplicate entries with the last names
> >> spelled
> >> slightly different.
> >>
> >> Is there a way to check if 2 strings are "identical" up to a certain
> >> percentage, such as only having 1 or 2 characters different?
> > Looks like there is a soundex implementation for Ruby:
> > http://raa.ruby-lang.org/search.rhtml?sear...
> >
> > If by any chance you are using MySQL you could use the soundex function
> > builtin into it as well.
>
> If you check on Google under "ruby soundex module", you'll find a few
> likely hits:
> http://tinyurl...
>
> I remember a friend saying that there are superior algorithms to soundex
> that are no more complex, so you might want to explore about a little.
>

If the names might be from different languages I would rather use
Levenstein than soundex. Levenstein is probably good at describing
typos as "very close" but soundex might be somewhat language specific.

But I haven't tried so I might be wrong.

Thanks

Michal

Tom Reilly

8/1/2006 1:55:00 PM

0

Here is an implementation of levenstein distance. It seems to me that
there is another in ruby gems somewhere but I don't remember where.

# Determine Levenshtein distance of two strings

def Ld(s,t)

#s string #1
#t string #2

n = s.size
m = t.size
a = Array.new

if n != 0 && m != 0

#2 create array
r = Array.new
rz = Array.new

0.upto(m) {|x| r.push(0)}

0.upto(n) {|x|a.push(r.dup)}
a.each_index {|x| a[x][0] = x}
0.upto(m) {|x| a[0][x] = x}

#a.each {|x| p x}

cost = 0
1.upto(n) do |i|
1.upto(m) do |j|
if s[i] == t[j]
cost =0
else
cost = 1
end
a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j -
1] + cost].min
end
end
a[n][m]
#a.each {|x| p x}
else
0
end
end



Gene Tani

8/1/2006 1:55:00 PM

0


Dylan Markow wrote:
> Is there a way to take two strings, and decide if they are "similar."
> I'm creating a contact system in Rails, and am having a large problem

Interesting your name is almost "Markov" ;-}

Anyway, besides algorithms mentioned here, I have notes on Double
Metaphone, NYSIIS, Phonex. For sequence analysis in general, google
McIlroy-Hunt, Ratcliff/Obershelp:

http://docs.python.org/lib/module-difflib.htmlhttp://docs.python.org/lib/module-di...


Gene Tani

8/1/2006 2:06:00 PM

0


Gene Tani wrote:
> Dylan Markow wrote:
> > Is there a way to take two strings, and decide if they are "similar."

http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMa...