[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

How to Write a Spelling Corrector

Brian Adkins

4/10/2007 1:05:00 AM

Peter Norvig wrote a simple spelling corrector in 20 lines of Python 2.5,
so I thought I'd see what it looks like in Ruby. I'm not too pleased with
my version, if anyone can make it more elegant, that would be great. Some
of the sticking points were:

1) List comprehensions in Python made the edits1 function more elegant
IMO. Hopefully someone can improve that function.

2) The boolean expression in the correct function evaluates empty
sets/arrays as false in Python but not in Ruby, so I had to add the
"result.empty? ? nil : result" expression to several functions. I expect
there's a better way to handle this also.

3) Minor point, but apparently Python has a built in constant for the set
of lower case characters "string.lowercase", so I just defined a constant.

Otherwise, the translation was pretty straightforward.

Here's a link to Norvig's page: http://www.norvig.com/spell-co...

That page includes a link to a text file that I saved locally as
holmes.txt: http://www.norvig.com/...

Note: I wrapped a few of the longer lines for posting.

def words text
text.downcase.scan(/[a-z]+/)
end

def train features
model = Hash.new(1)
features.each {|f| model[f] += 1 }
return model
end

NWORDS = train(words(File.new('holmes.txt').read))
LETTERS = 'abcdefghijklmnopqrstuvwxyz'

def edits1 word
n = word.length
deletion = (0...n).collect {|i| word[0...i]+word[i+1..-1] }
transposition = (0...n-1).collect {
|i| word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] }
alteration = []
n.times {|i| LETTERS.each_byte {
|l| alteration << word[0...i]+l.chr+word[i+1..-1] } }
insertion = []
(n+1).times {|i| LETTERS.each_byte {
|l| insertion << word[0...i]+l.chr+word[i..-1] } }
result = deletion + transposition + alteration + insertion
result.empty? ? nil : result
end

def known_edits2 word
result = []
edits1(word).each {|e1| edits1(e1).each {
|e2| result << e2 if NWORDS.has_key?(e2) }}
result.empty? ? nil : result
end

def known words
result = words.find_all {|w| NWORDS.has_key?(w) }
result.empty? ? nil : result
end

def correct word
(known([word]) or known(edits1(word)) or known_edits2(word) or
[word]).max {|a,b| NWORDS[a] <=> NWORDS[b] }
end


Brian Adkins
http://lojic...

12 Answers

Joel VanderWerf

4/10/2007 3:38:00 AM

0

Brian Adkins wrote:
> 3) Minor point, but apparently Python has a built in constant for the set
> of lower case characters "string.lowercase", so I just defined a constant.
...
> LETTERS = 'abcdefghijklmnopqrstuvwxyz'

A minor comment: you can construct this string with

("a".."z").to_a.join
=> "abcdefghijklmnopqrstuvwxyz"

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Robert Dober

4/10/2007 7:52:00 AM

0

On 4/10/07, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:
> Brian Adkins wrote:
> > 3) Minor point, but apparently Python has a built in constant for the set
> > of lower case characters "string.lowercase", so I just defined a constant.
> ...
> > LETTERS = 'abcdefghijklmnopqrstuvwxyz'
>
> A minor comment: you can construct this string with
>
> ("a".."z").to_a.join
> => "abcdefghijklmnopqrstuvwxyz"

Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a.
The performance is about the same, the former seems faster for short
arrays/ranges ~ 500
and from about 1k elements the later gets slightly faster.
I feel that [*a..b] is the right/better/more logical construct to use
if one wants an array, or do I miss something?

Cheers
Robert
>
> --
> vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407
>
>


--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

Brian Adkins

4/10/2007 3:03:00 PM

0

On Tue, 10 Apr 2007 16:52:03 +0900, Robert Dober wrote:

> On 4/10/07, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:
>> Brian Adkins wrote:
>> > 3) Minor point, but apparently Python has a built in constant for the set
>> > of lower case characters "string.lowercase", so I just defined a constant.
>> ...
>> > LETTERS = 'abcdefghijklmnopqrstuvwxyz'
>>
>> A minor comment: you can construct this string with
>>
>> ("a".."z").to_a.join
>> => "abcdefghijklmnopqrstuvwxyz"
>
> Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a.

Interesting. I didn't realize that would work with ranges. Here's what I
read in "Programming Ruby" p. 348:

"Following these parameters may be a single parameter prefixed with an
asterisk. If this parameter is an array, Ruby replaces it with zero or
more parameters corresponding to the elements of the array."

I guess to_a is called on the range implicitly prior to evaluating *.


> The performance is about the same, the former seems faster for short
> arrays/ranges ~ 500
> and from about 1k elements the later gets slightly faster.
> I feel that [*a..b] is the right/better/more logical construct to use
> if one wants an array, or do I miss something?
>
> Cheers
> Robert
>>
>> --
>> vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407
>>
>>

Robert Dober

4/10/2007 4:11:00 PM

0

On 4/10/07, Brian Adkins <lojicdotcomNOSPAM@gmail.com> wrote:
> On Tue, 10 Apr 2007 16:52:03 +0900, Robert Dober wrote:
>
> > On 4/10/07, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:
> >> Brian Adkins wrote:
> >> > 3) Minor point, but apparently Python has a built in constant for the set
> >> > of lower case characters "string.lowercase", so I just defined a constant.
> >> ...
> >> > LETTERS = 'abcdefghijklmnopqrstuvwxyz'
> >>
> >> A minor comment: you can construct this string with
> >>
> >> ("a".."z").to_a.join
> >> => "abcdefghijklmnopqrstuvwxyz"
> >
> > Almost nobody, not even the gurus ever use [*a..b] instead of (a..b).to_a.
>
> Interesting. I didn't realize that would work with ranges. Here's what I
> read in "Programming Ruby" p. 348:
>
> "Following these parameters may be a single parameter prefixed with an
> asterisk. If this parameter is an array, Ruby replaces it with zero or
> more parameters corresponding to the elements of the array."
>
> I guess to_a is called on the range implicitly prior to evaluating *.
Hmm I guessed rather not, but I am wrong:

429/5 > cat ast.rb
# vim: sts=2 sw=2 tw=0 expandtab nu:

a = [*1..2]
robert@PC:~/log/ruby/tests/theory 18:09:04
430/6 > ruby -rprofile ast.rb
% cumulative self self total
time seconds seconds calls ms/call ms/call name
0.00 0.00 0.00 1 0.00 0.00 Enumerable.to_a
0.00 0.00 0.00 1 0.00 0.00 Range#each
0.00 0.01 0.00 1 0.00 10.00 #toplevel

Good thinking there Brian !!
Cheers
Robert
>
>
> > The performance is about the same, the former seems faster for short
> > arrays/ranges ~ 500
> > and from about 1k elements the later gets slightly faster.
> > I feel that [*a..b] is the right/better/more logical construct to use
> > if one wants an array, or do I miss something?
Well I did as shown above, but I still rather go for the shorter ;)
> >
> > Cheers
> > Robert
> >>
> >> --
> >> vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407
> >>
> >>
>
>
>


--
You see things; and you say Why?
But I dream things that never were; and I say Why not?
-- George Bernard Shaw

William James

4/11/2007 8:55:00 AM

0


Brian Adkins wrote:
> Peter Norvig wrote a simple spelling corrector in 20 lines of Python 2.5,
> so I thought I'd see what it looks like in Ruby. I'm not too pleased with
> my version, if anyone can make it more elegant, that would be great. Some
> of the sticking points were:
>
> 1) List comprehensions in Python made the edits1 function more elegant
> IMO. Hopefully someone can improve that function.
>
> 2) The boolean expression in the correct function evaluates empty
> sets/arrays as false in Python but not in Ruby, so I had to add the
> "result.empty? ? nil : result" expression to several functions. I expect
> there's a better way to handle this also.
>
> 3) Minor point, but apparently Python has a built in constant for the set
> of lower case characters "string.lowercase", so I just defined a constant.
>
> Otherwise, the translation was pretty straightforward.
>
> Here's a link to Norvig's page: http://www.norvig.com/spell-co...
>
> That page includes a link to a text file that I saved locally as
> holmes.txt: http://www.norvig.com/...
>
> Note: I wrapped a few of the longer lines for posting.
>
> def words text
> text.downcase.scan(/[a-z]+/)
> end
>
> def train features
> model = Hash.new(1)
> features.each {|f| model[f] += 1 }
> return model
> end
>
> NWORDS = train(words(File.new('holmes.txt').read))
> LETTERS = 'abcdefghijklmnopqrstuvwxyz'
>
> def edits1 word
> n = word.length
> deletion = (0...n).collect {|i| word[0...i]+word[i+1..-1] }
> transposition = (0...n-1).collect {
> |i| word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] }
> alteration = []
> n.times {|i| LETTERS.each_byte {
> |l| alteration << word[0...i]+l.chr+word[i+1..-1] } }
> insertion = []
> (n+1).times {|i| LETTERS.each_byte {
> |l| insertion << word[0...i]+l.chr+word[i..-1] } }
> result = deletion + transposition + alteration + insertion
> result.empty? ? nil : result
> end

Letters = ('a'..'z').to_a

class String
def replace( which, what )
self[0...which] + what + self[which+1 .. -1]
end
end

def edits1 word
n = word.size
return nil if n.zero?
deletion = (0...n).map{|pos| word.replace(pos, "") }
transposition = (0...n-1).map{|i|
word[0...i]+word[i+1,1]+word[i,1]+word[i+2..-1] }
alteration = (0 ... n).to_a.map{|pos|
Letters.map{|ch| word.replace(pos, ch) }}.flatten
insertion = (0 .. n).to_a.map{|pos|
Letters.map{|ch| word[0...pos] + ch + word[pos..-1] }}.flatten
deletion + transposition + alteration + insertion
end

Gavin Kistner

4/11/2007 11:10:00 PM

0

On Apr 11, 2:54 am, "William James" <w_a_x_...@yahoo.com> wrote:

> class String
> def replace( which, what )
> self[0...which] + what + self[which+1 .. -1]
> end
> end

Cleaner, IMHO:

class String
def replace( where, what )
s2 = dup
s2[ where ] = what
s2
end
end

Drew Olson

4/11/2007 11:51:00 PM

0

This may be a bit OT, but I am really struck as to how elegantly both of
these languages (and the programmers as well) handle this seemingly
difficult problem. I've been working with ruby mainly for fun over the
past 8-9 months and seeing this solution (in Python and ruby) really
heightens my respect for the elegance and expressiveness of both
languages.

</rave>


--
Posted via http://www.ruby-....

Nicolas Buet

4/12/2007 4:05:00 PM

0

I am so glad I find this topic here! I was on the plane yesterday, and
just like Peter Norwig I managed to write a spelling corrector. An
obvious difference is that he started from nothing, while I started
from his code... Anyway, I did it in Ruby, it works, but... it's dead
slow. If I need to compute the second array (the one with 2 mistakes),
it takes about 1 second per suggestion.

My edit1 function is like this:
----------------------------
def edits1(word)
n = word.length

return ( (0..n-1).collect {|i| word[0,i] + word[i+1, n-1]} + #deletion
(0..n-2).collect {|i| word[0,i] + word[i+1,1] +word[i,1] +
word[i+2, n-1]} + #transposition
('a'..'z').collect {|c| (0..n-1).collect { |i| word[0,i] + c +
word[i+1, n-1]} } + #alteration
('a'..'z').collect {|c| (0..n).collect { |i| word[0,i] + c +
word[i, n]} } ).flatten #insertion

end
---------------------------------------------

I believe this is where I lose most of the time? when I call it, it
looks like that:

--------------------------------------------
edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten
--------------------------------------------

Can anyone see here a flaw in logic, or some function that should be
avoided because it's known as slow?

On 4/12/07, Drew Olson <olsonas@gmail.com> wrote:
> This may be a bit OT, but I am really struck as to how elegantly both of
> these languages (and the programmers as well) handle this seemingly
> difficult problem. I've been working with ruby mainly for fun over the
> past 8-9 months and seeing this solution (in Python and ruby) really
> heightens my respect for the elegance and expressiveness of both
> languages.
>
> </rave>
>
>
> --
> Posted via http://www.ruby-....
>
>

Pedro Fortuny Ayuso

4/12/2007 6:37:00 PM

0

At the risk of being off-topic or uninformed, I guess this

> --------------------------------------------
> edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten
> --------------------------------------------

is overkill. The thing is that there are WAY too many off-by-two
mistakes but too few words to be worth that computation.
Moreover, most of the "corrections" are nonsense: "xx" "zx"
"hh" "jk" "bb" almost surely. This is where dictionary metrics
must have come into being, I think. (I have not the original post,
sorry, so this may be out of place).

Take into account that for a 4-letter word like "riot" you
are letting the user have up to 2 mistakes, that means from:

"at" to (at least) "tricot"

which ... seems too loose to me. I am utterly unaware of the way
dictionaries work, but I bet they use hashes (in the sense of
"a (key, value) list" and in the proper classical sense of
"a good function for distinguishing and identifying items"),
as well as ad-hoc metrics A LOT. I'd rather stick with words of
the same length and at most off-by-one mistakes (or in obvious
cases, for longer words, other possibilities). There must be
a kind of metric-hash pair to play with the duality: number of
mistakes "allowed" / number of "similar" real words / length
of the word. Obviously,

"termodynamics"

gives rise to few doubts, but what about

"lykes"

"bykes" "Mykes" "Bites" "Bytes" "Likes" "Like" "Myke" "Mike"
"Byke"... and this without thinking.

I guess the hash/metric problem is the real game in Dictionary
software. I'd be glad to help if I find the time, though. But
it seems a trodden path to me (apart from the doing it for the joy
of it, of course).


2c

Pedro

El 12/04/2007, a las 18:04, Nicolas Buet escribió:

> I am so glad I find this topic here! I was on the plane yesterday, and
> just like Peter Norwig I managed to write a spelling corrector. An
> obvious difference is that he started from nothing, while I started
> from his code... Anyway, I did it in Ruby, it works, but... it's dead
> slow. If I need to compute the second array (the one with 2 mistakes),
> it takes about 1 second per suggestion.
>
> My edit1 function is like this:
> ----------------------------
> def edits1(word)
> n = word.length
>
> return ( (0..n-1).collect {|i| word[0,i] + word[i+1, n-1]} +
> #deletion
> (0..n-2).collect {|i| word[0,i] + word[i+1,1] +word[i,1] +
> word[i+2, n-1]} + #transposition
> ('a'..'z').collect {|c| (0..n-1).collect { |i| word[0,i] + c +
> word[i+1, n-1]} } + #alteration
> ('a'..'z').collect {|c| (0..n).collect { |i| word[0,i] + c +
> word[i, n]} } ).flatten #insertion
>
> end
> ---------------------------------------------
>
> I believe this is where I lose most of the time? when I call it, it
> looks like that:
>
> --------------------------------------------
> edits2 = (edits1(word).collect {|i| edits1(i) & NWORDS }).flatten
> --------------------------------------------
>
> Can anyone see here a flaw in logic, or some function that should be
> avoided because it's known as slow?
>
> On 4/12/07, Drew Olson <olsonas@gmail.com> wrote:
>> This may be a bit OT, but I am really struck as to how elegantly
>> both of
>> these languages (and the programmers as well) handle this seemingly
>> difficult problem. I've been working with ruby mainly for fun over
>> the
>> past 8-9 months and seeing this solution (in Python and ruby) really
>> heightens my respect for the elegance and expressiveness of both
>> languages.
>>
>> </rave>
>>
>>
>> --
>> Posted via http://www.ruby-....
>>
>>
>

Pedro Fortuny Ayuso
pfortuny@gmail.com http://pfortuny....
C/Capuchinos 14, 1-S. 47006 Valladolid. SPAIN



Florian Groß

4/12/2007 7:36:00 PM

0

> I am so glad I find this topic here! I was on the plane yesterday, and
> just like Peter Norwig I managed to write a spelling corrector. An
> obvious difference is that he started from nothing, while I started
> from his code... Anyway, I did it in Ruby, it works, but... it's dead
> slow. If I need to compute the second array (the one with 2 mistakes),
> it takes about 1 second per suggestion.

I avoided doing two-character edits. Instead I use Text::Metaphone and
mutate that string and Text::Levenshtein for filtering by distance. It
seems to work fairly well. (You will need the text gem to try this
out.)

The code (together with the holmes.txt as a sample for training it) is
available at http://flgr.0x42.net/code/spel...