Asp Forum - Re: Edit Distance at Wikipedia

Mauricio Fernández

10/16/2006 10:25:00 AM

On Mon, Oct 16, 2006 at 07:15:52PM +0900, Minkoo wrote:
> Hi list.
>
> I read an article posted at Wikipedia about Levenshtein distance (aka edit
> distance).
> The location of the document is
> http://en.wikipedia.org/wiki/Levenshtei...
>
> In the document, a sample Ruby code goes like the following:
>
> class String
> def levenshtein(comparator)
> a, b = self.unpack('U*'), comparator.unpack('U*')
> n, m = a.length, b.length
> a, b, n, m = b, a, m, n if n > m
> current = [*0..n]
> 1.upto(m) do |i|
> previous, current = current, [i]+[0]*n
> 1.upto(n) do |j|
> add, delete = previous[j]+1, current[j-1]+1
> change = previous[j-1]
> change += 1 if a[j-1] != b[i-1]
> current[j] = [add, delete, change].min
> end
> end
> current[n]
> end
> end
>
> In the code, there's a parameter called comparator which seems to be used to
> decode given parameter. But, I can't understand what exactly the comparator
> is doing.
>
> Does anybody know the detail?

It's simply the string you're comparing against; unpack('U*') just turns the
UTF-8 characters into unsigned integers:

class String
def levenshtein(comparator)
a, b = self.unpack('U*'), comparator.unpack('U*')
b # => [102, 111, 111, 98, 97, 114]
n, m = a.length, b.length
a, b, n, m = b, a, m, n if n > m
current = [*0..n]
1.upto(m) do |i|
previous, current = current, [i]+[0]*n
1.upto(n) do |j|
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
change += 1 if a[j-1] != b[i-1]
current[j] = [add, delete, change].min
end
end
current[n]
end
end

"foo".levenshtein("foobar") # => 3

--
Mauricio Fernandez - http://eige... - singular Ruby

3 Answers

Mauricio Fernández

10/16/2006 9:04:00 PM

On Mon, Oct 16, 2006 at 10:11:51PM +0900, Minkoo wrote:
> On 10/16/06, Mauricio Fernandez <mfp@acm.org> wrote:
> >It's simply the string you're comparing against; unpack('U*') just turns
> >the UTF-8 characters into unsigned integers:
> >
> > class String
> > def levenshtein(comparator)
> > a, b = self.unpack('U*'), comparator.unpack('U*')
> > b # => [102, 111,
> >111, 98, 97, 114]
[...]
> >
> > "foo".levenshtein("foobar") # => 3
> >
> I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
> default?

Ruby's Strings as of 1.8 are just a sequence of bytes. If a String happens to
hold a UTF-8 sequence, some operations will work properly when you set
$KCODE="u", e.g.

s = "\303\241" # "á" == acute
s.scan(/\w/) # => []
$KCODE="u"
s.scan(/\w/) # => ["á"]

Some other methods are covered by jcode.rb.

> In other words, suppose that I've launched irb and fired
> "foo".levenshtein("foobar").
> In that case, is the string "foo" encoded as utf-8?

This depends on your locale; Ruby's Strings do not know about encodings in 1.8
(they will very soon in 1.9 according to matz' latest messages in ruby-core;
search the ruby-talk archives for extensive discussions of the upcoming M17N).

> Do I always have to unpack the string like the code shown above?

In this particular case, it depends on two things:
* the encoding you're using
* whether the Levenshtein should be relative to characters or bytes

In general, it's up to you to keep track of the encodings and handle Strings
appropriately.

--
Mauricio Fernandez - http://eige... - singular Ruby

Dido Sevilla

10/17/2006 6:45:00 AM

On 10/16/06, Minkoo <minkoo.seo@gmail.com> wrote:
> I'm afraid that I'm not used to character encodings. Does Ruby use UTF-8 by
> default?
>

As of Ruby 1.8, it doesn't, but see the other responses.

> In other words, suppose that I've launched irb and fired
> "foo".levenshtein("foobar").
> In that case, is the string "foo" encoded as utf-8?

The code makes that assumption, correct or not. If you're one of those
ignorant yokels who believes that characters are bytes, it's also a
safe assumption. ;) Now, if you're one of those people who does i18n,
l10n, and m17n before breakfast, you'll have alarm bells going off
inside your head immediately, as such assumptions are in general very
dangerous to make. A cardinal rule in this kind of programming is that
strings are meaningless without an attached encoding.

> Do I always have to
> unpack the string like
> the code shown above?

It would be better, at least that keeps your options open when you
attempt to internationalize your program. You wind up supporting
Unicode at the very least, and that makes the transition to
internationalization a bit easier, as Unicode is a reasonable choice
as a character set if one wishes to do internationalization.

Paul Battley

10/17/2006 8:59:00 AM

On 17/10/06, Dido Sevilla <dido.sevilla@gmail.com> wrote:
> The code makes that assumption, correct or not. If you're one of those
> ignorant yokels who believes that characters are bytes, it's also a
> safe assumption. ;) Now, if you're one of those people who does i18n,
> l10n, and m17n before breakfast, you'll have alarm bells going off
> inside your head immediately, as such assumptions are in general very
> dangerous to make. A cardinal rule in this kind of programming is that
> strings are meaningless without an attached encoding.

<blatant plug>

The version of Levenshtein in the Text project
(http://rubyforge.org/pro...) will work on either UTF-8
codepoints or bytes depending on the value of $KCODE. It's the best we
can do for now; in the future, of course, we will be able to decide
based on the strings' encodings themselves.

</blatant plug>

Paul.

comp.lang.ruby

Re: Edit Distance at Wikipedia

Mauricio Fernández

Mauricio Fernández

Dido Sevilla

Paul Battley

x Login to ForumsZone