Asp Forum - Help with string matching algorythm

Tomislav Kralj

8/9/2007 12:28:00 PM

Hello,

I would be so gratefull to ANYONE who can help me!

I'm writting a program in C++.
A part of my program is to compare two strings and as a result I have to
get a number (range:0-1) which represents a similarity beetwen those two
strings.

Which algorythm to use?
I searched Web, but there's a million of them, and I don't know which to
use.
I don't need a solution in C++, just a hint which algorythm to use to
implement this concept.

Thanx!
--
Posted via http://www.ruby-....

9 Answers

Todd Burch

8/9/2007 12:32:00 PM

Tomislav Kralj wrote:
>
> I'm writting a program in C++.
> A part of my program is to compare two strings and as a result I have to
> get a number (range:0-1) which represents a similarity beetwen those two
> strings.
>
> Which algorythm to use?
> I searched Web, but there's a million of them, and I don't know which to
> use.
> I don't need a solution in C++, just a hint which algorythm to use to
> implement this concept.
>
> Thanx!

strcmp()

Todd
--
Posted via http://www.ruby-....

Robert Klemme

8/9/2007 2:45:00 PM

2007/8/9, Tomislav Kralj <rex14012001@yahoo.com>:
> Hello,
>
> I would be so gratefull to ANYONE who can help me!
>
> I'm writting a program in C++.
> A part of my program is to compare two strings and as a result I have to
> get a number (range:0-1) which represents a similarity beetwen those two
> strings.
>
> Which algorythm to use?
> I searched Web, but there's a million of them, and I don't know which to
> use.
> I don't need a solution in C++, just a hint which algorythm to use to
> implement this concept.

There is no general answer to your question. It depends on what you
want to do with the result. There must be some requirements or at
least more information about the nature of your problem. There is no
general definition of the term "similarity" for text strings - it
really depends on the application case.

Kind regards

robert

Dan Zwell

8/9/2007 3:40:00 PM

Tomislav Kralj wrote:
> Hello,
>
> I would be so gratefull to ANYONE who can help me!
>
> I'm writting a program in C++.
> A part of my program is to compare two strings and as a result I have to
> get a number (range:0-1) which represents a similarity beetwen those two
> strings.
>
> Which algorythm to use?
> I searched Web, but there's a million of them, and I don't know which to
> use.
> I don't need a solution in C++, just a hint which algorythm to use to
> implement this concept.
>
> Thanx!

That sounds like a fun problem. As someone already said in a reply, the
algorithm depends on your requirements, and what type of similarity you
want. For example, if you wanted to use this information to attempt a
new type of sorting, the algorithm I'm about to suggest would be useless.

That being said, here's what I would do--it's conceptually very simple:
1. Find the "longest common subsequence". What distinguishes this from
"longest common substring" (and makes it harder) is that the matching
letters don't need to be adjacent. For example, the longest common
subsequence of "aaaacccceeee" and "aaaabbbbccc" is "aaaacccc". This is
best calculated with dynamic programming, but you can probably find
guidance on that on the internet.

2. Compare the length of this substring with the lengths of the two
original strings. Perhaps something simple like "Percent similarity =
(length of common subsequence) / (average length of two original strings)".

Hope this helps,
Dan

Phrogz

8/9/2007 3:54:00 PM

On Aug 9, 6:28 am, Tomislav Kralj <rex14012...@yahoo.com> wrote:
> A part of my program is to compare two strings and as a result I have to
> get a number (range:0-1) which represents a similarity beetwen those two
> strings.

You might want to google for string "correlation" algorithms.
(Depending on what sort of similarity you want.)

Marcin Mielzynski

8/9/2007 4:12:00 PM

Phrogz pisze:
> On Aug 9, 6:28 am, Tomislav Kralj <rex14012...@yahoo.com> wrote:
>> A part of my program is to compare two strings and as a result I have to
>> get a number (range:0-1) which represents a similarity beetwen those two
>> strings.
>
> You might want to google for string "correlation" algorithms.
> (Depending on what sort of similarity you want.)
>

You might try: http://amatch.ruby...

lopex

Adam Shelly

8/9/2007 5:46:00 PM

n 8/9/07, Robert Klemme <shortcutter@googlemail.com> wrote:
> 2007/8/9, Tomislav Kralj <rex14012001@yahoo.com>:
> > A part of my program is to compare two strings and as a result I have to
> > get a number (range:0-1) which represents a similarity beetwen those two
> > strings.
>
> There is no general answer to your question. It depends on what you
> want to do with the result. There must be some requirements or at
> least more information about the nature of your problem. There is no
> general definition of the term "similarity" for text strings - it
> really depends on the application case.
>
The problem description made me think of bioinformatics - especially
comparing genetic distances. You can measure similarity as the number
of changes needed to transform one string into another. If that
sounds like the type of similarity you need, look up Levenshtein
Distance: http://en.wikipedia.org/wiki/Levenshtei...

In fact, there was a Ruby Quiz dealing with a similar problem: Word
Chains - http://www.rubyquiz.com/q.... The difference was that
the quiz only allowed changes that resulted in valid dictionary words.

-Adam

Olivier

8/14/2007 9:09:00 AM

Tomislav Kralj a Ã©crit :
> Hello,
>
> I would be so gratefull to ANYONE who can help me!
>
> I'm writting a program in C++.
> A part of my program is to compare two strings and as a result I have to
> get a number (range:0-1) which represents a similarity beetwen those two
> strings.
>
> Which algorythm to use?
> I searched Web, but there's a million of them, and I don't know which to
> use.
> I don't need a solution in C++, just a hint which algorythm to use to
> implement this concept.
>
> Thanx!
>
You problem is similar to finding the edit distance between two strings.
Have a look at http://en.wikipedia.org/wiki/Edi... and
http://en.wikipedia.org/wiki/Stri....
I don't know which one could give a result in the range [0,1], however.

--
Olivier Renaud

Olivier

8/14/2007 9:20:00 AM

Olivier Renaud a Ã©crit :
> Tomislav Kralj a Ã©crit :
>> Hello,
>>
>> I would be so gratefull to ANYONE who can help me!
>>
>> I'm writting a program in C++.
>> A part of my program is to compare two strings and as a result I have to
>> get a number (range:0-1) which represents a similarity beetwen those two
>> strings.
>>
>> Which algorythm to use?
>> I searched Web, but there's a million of them, and I don't know which to
>> use.
>> I don't need a solution in C++, just a hint which algorythm to use to
>> implement this concept.
>>
>> Thanx!
>>
> You problem is similar to finding the edit distance between two
> strings. Have a look at http://en.wikipedia.org/wiki/Edi... and
> http://en.wikipedia.org/wiki/Stri....
> I don't know which one could give a result in the range [0,1], however.
>
I think using the levenshtein distance this way should do the trick :
levenshtein_distance(a, b) / max(a.size, b.size)

since the result of the levenshtein distance is at most the length of
the longer string.

--
Olivier Renaud

Han Holl

8/14/2007 10:57:00 AM

As others already said, it depends on your needs.
But nobody mentioned soundex yet:
http://raa.ruby-lang.org/projec...
which may or may not, be what you want.

Han Holl

comp.lang.ruby

Help with string matching algorythm

Tomislav Kralj

Todd Burch

Robert Klemme

Dan Zwell

Phrogz

Marcin Mielzynski

Adam Shelly

Olivier

Olivier

Han Holl

x Login to ForumsZone