Asp Forum - 40 million levenshtein distances for two long strings

John

5/15/2008 1:20:00 AM

I am trying to discover similar files to reduce redundancy on a large
project. The 'Text' gem works well for this, but even short strings
take a long time. Large strings - like 20k HTML files - take an
amazing amount of time. My script looks like this:

require 'rubygems'
require 'text'

a = file_one
b = file_two

puts Text::Levenshtein.distance(a, b)

It would be nice to be able to short-circuit the comparison when the
distance crossed a max value, but that isn't possible. It would be
even BETTER to be able to compare long stings like with PHPs
similar_text, which has nice percentage output. I have to do a lot of
comparisons, about 40 million. Is there something already written?

11 Answers

Michael Fellinger

5/15/2008 2:40:00 AM

On Thu, May 15, 2008 at 10:25 AM, John <john.d.perkins@gmail.com> wrote:
> I am trying to discover similar files to reduce redundancy on a large
> project. The 'Text' gem works well for this, but even short strings
> take a long time. Large strings - like 20k HTML files - take an
> amazing amount of time. My script looks like this:
>
> require 'rubygems'
> require 'text'
>
> a = file_one
> b = file_two
>
> puts Text::Levenshtein.distance(a, b)
>
>
> It would be nice to be able to short-circuit the comparison when the
> distance crossed a max value, but that isn't possible. It would be
> even BETTER to be able to compare long stings like with PHPs
> similar_text, which has nice percentage output. I have to do a lot of
> comparisons, about 40 million. Is there something already written?

Take a look at the source of the Text gem, the algorithm for
Text::Levenshtein::distance is not too hard to read or long, maybe you
can just modify that to suit your needs?

^ manveru

ara.t.howard

5/15/2008 6:12:00 AM

On May 14, 2008, at 7:25 PM, John wrote:

> It would be nice to be able to short-circuit the comparison when the
> distance crossed a max value

diff file|head -$max

??

a @ http://codeforp...
--
we can deny everything, except that we have the possibility of being
better. simply reflect on that.
h.h. the 14th dalai lama

Erik Veenstra

5/15/2008 11:25:00 AM

> I have to do a lot of comparisons, about 40 million.

40 million or 400 million?... :}

I've once written a Levenshtein implementation in C, because of
this speed issue. It compares 2 20 KB files in 3 seconds. And
it does come with a threshold.

Interested?

gegroet,
Erik V. - http://www.erikve...

----------------------------------------------------------------

$ ll lev[12]
-rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1
-rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2

$ cat lev.rb
require "ev/levenshtein"

s1 = File.read("lev1")
s2 = File.read("lev2")

p EV.levenshtein(s1, s2)

$ time ruby lev.rb
12

real 0m3.105s
user 0m2.843s
sys 0m0.281s

----------------------------------------------------------------

Axel Etzold

5/15/2008 12:36:00 PM

-------- Original-Nachricht --------
> Datum: Thu, 15 May 2008 20:24:42 +0900
> Von: Erik Veenstra <erikveen@gmail.com>
> An: ruby-talk@ruby-lang.org
> Betreff: Re: 40 million levenshtein distances for two long strings

> > I have to do a lot of comparisons, about 40 million.
>
> 40 million or 400 million?... :}
>
> I've once written a Levenshtein implementation in C, because of
> this speed issue. It compares 2 20 KB files in 3 seconds. And
> it does come with a threshold.
>
> Interested?

Hi Erik,

Could you post your implementation here ?

Thank you,

Axel

>
> gegroet,
> Erik V. - http://www.erikve...
>
> ----------------------------------------------------------------
>
> $ ll lev[12]
> -rw-r--r-- 1 erik erik 19275 May 15 12:10 lev1
> -rw-r--r-- 1 erik erik 19275 May 15 12:12 lev2
>
> $ cat lev.rb
> require "ev/levenshtein"
>
> s1 = File.read("lev1")
> s2 = File.read("lev2")
>
> p EV.levenshtein(s1, s2)
>
> $ time ruby lev.rb
> 12
>
> real 0m3.105s
> user 0m2.843s
> sys 0m0.281s
>
> ----------------------------------------------------------------

--
249 Spiele für nur 1 Preis. Die GMX Spieleflatrate schon ab 9,90 Euro.
Neu: Asterix bei den Olympischen Spielen: http://flat.ga...

John

5/15/2008 6:30:00 PM

....48,270,225 to be exact, which (without multithreading) would take a
bit over 300 days. I'm with Axel, I think more than just a couple
people would be interested in that slice of C.

Vidar Hokstad

5/16/2008 9:58:00 AM

On May 15, 7:29 pm, John <john.d.perk...@gmail.com> wrote:
> ...48,270,225 to be exact, which (without multithreading) would take a
> bit over 300 days. I'm with Axel, I think more than just a couple
> people would be interested in that slice of C.

The wikipedia article on Levenshtein distance has a reasonable C
implementation - using Ruby inline on it is fairly trivial.

I used the Wikipedia version from Ruby for some sanity checks on
results for my MSc thesis, but unfortunately I can't find my cleaned
up code right now (at work - I can look for it later today if you
like, and unless someone else has posted a full version by then)

Vidar

Erik Veenstra

5/16/2008 2:39:00 PM

I've requested a new RubyForge project. Until it's available,
I'll paste the code on PasteBin.

Comments and suggestions are welcome. To be honest, its' my
first and only C extensions 'till now...

I've never released a C extension as a GEM. Anybody willing to
help?

gegroet,
Erik V. - http://www.erikve...

----------------------------------------------------------------

Ruby library: http://pastebin.com...
C extension: http://pastebin.com...
extconf.rb: http://pastebin.com...
Unit Tests: http://pastebin.com...

----------------------------------------------------------------

Erik Veenstra

5/16/2008 2:50:00 PM

The Ruby version of Levenshtein#distance didn't pass the
tests... :}

Here's a new version.

gegroet,
Erik V. - http://www.erikve...

----------------------------------------------------------------

Ruby Library: http://pastebin.com...

----------------------------------------------------------------

John

5/16/2008 8:39:00 PM

Let us know when your project is approved! Good work sir.

Erik Veenstra

5/24/2008 3:23:00 PM

I've released the Levenshtein module as a gem. The most
expensive part of the algorithm is written in C. But before
entering the most expensive part of the algorithm, it tries to
optimize as much as possible (in Ruby land).

There's a TAR.GZ archive and a GEM for Linux (tested), OS X
(untested) and Cygwin (tested). I wasn't able to compile a
version that worked with the One-Click Ruby Installer. I'm not
familiar with Windows, nor with its development environment.
Anyone willing to help?

gegroet,
Erik V. - http://www.erikve...

----------------------------------------------------------------

The Levenshtein distance is a metric for measuring the amount
of difference between two sequences (i.e., the so called edit
distance). The Levenshtein distance between two strings is
given by the minimum number of operations needed to transform
one string into the other, where an operation is an insertion,
deletion, or substitution of a single character.

More information about the Levenshtein distance algorithm:
http://en.wikipedia.org/wiki/Levenshtei... .

More information about the library:
http://www.erikve...levenshtein/doc/index.html .

----------------------------------------------------------------

comp.lang.ruby

40 million levenshtein distances for two long strings

John

Michael Fellinger

ara.t.howard

Erik Veenstra

Axel Etzold

John

Vidar Hokstad

Erik Veenstra

Erik Veenstra

John

Erik Veenstra

x Login to ForumsZone