Michael Fellinger
5/15/2008 2:40:00 AM
On Thu, May 15, 2008 at 10:25 AM, John <john.d.perkins@gmail.com> wrote:
> I am trying to discover similar files to reduce redundancy on a large
> project. The 'Text' gem works well for this, but even short strings
> take a long time. Large strings - like 20k HTML files - take an
> amazing amount of time. My script looks like this:
>
> require 'rubygems'
> require 'text'
>
> a = file_one
> b = file_two
>
> puts Text::Levenshtein.distance(a, b)
>
>
> It would be nice to be able to short-circuit the comparison when the
> distance crossed a max value, but that isn't possible. It would be
> even BETTER to be able to compare long stings like with PHPs
> similar_text, which has nice percentage output. I have to do a lot of
> comparisons, about 40 million. Is there something already written?
Take a look at the source of the Text gem, the algorithm for
Text::Levenshtein::distance is not too hard to read or long, maybe you
can just modify that to suit your needs?
^ manveru