Dan Zwell
8/9/2007 3:40:00 PM
Tomislav Kralj wrote:
> Hello,
>
> I would be so gratefull to ANYONE who can help me!
>
> I'm writting a program in C++.
> A part of my program is to compare two strings and as a result I have to
> get a number (range:0-1) which represents a similarity beetwen those two
> strings.
>
> Which algorythm to use?
> I searched Web, but there's a million of them, and I don't know which to
> use.
> I don't need a solution in C++, just a hint which algorythm to use to
> implement this concept.
>
> Thanx!
That sounds like a fun problem. As someone already said in a reply, the
algorithm depends on your requirements, and what type of similarity you
want. For example, if you wanted to use this information to attempt a
new type of sorting, the algorithm I'm about to suggest would be useless.
That being said, here's what I would do--it's conceptually very simple:
1. Find the "longest common subsequence". What distinguishes this from
"longest common substring" (and makes it harder) is that the matching
letters don't need to be adjacent. For example, the longest common
subsequence of "aaaacccceeee" and "aaaabbbbccc" is "aaaacccc". This is
best calculated with dynamic programming, but you can probably find
guidance on that on the internet.
2. Compare the length of this substring with the lengths of the two
original strings. Perhaps something simple like "Percent similarity =
(length of common subsequence) / (average length of two original strings)".
Hope this helps,
Dan