Tom Reilly
4/12/2005 1:47:00 AM
I happened to notice your posting about classifier.
My problem is this and I wonder if your program would be useful.
I am a MD taking care of nursing home patients. I wrote a data base program
to keep track of all of the phone calls we get. We have used the
program for 2
years. We have over 80,000 phone records which contain the problem
about which
the nursing home called and the recommended treatment.
It occurred to me that given these messages, there ought to be some way
that they
could be classified according to problem type and the summary could be used
to determine what problems a given nursing home is not handling very well.
Using Hash.new, I determined that there are about 22,000 words some
abbreviations,
some correct spellings, some others incorrect.. There are on the
average of 20 words
per message though many of the words are adjitives, prepositions, verbs
which don't
help classifications.
Using a Levenshtein Distance algorithm for the larger words, it does a
pretty
good job of eliminating misspellings though it works quite poorly on 3,
4, and 5 character
words.
Determine Levenshtein distance of two strings
def Ld(s,t)
n = s.size
m = t.size
a = Array.new
if n != 0 && m != 0
#2 create array
r = Array.new
rz = Array.new
0.upto(m) {|x| r.push(0)}
0.upto(n) {|x|a.push(r.dup)}
a.each_index {|x| a[x][0] = x}
0.upto(m) {|x| a[0][x] = x}
#a.each {|x| p x}
cost = 0
1.upto(n) do |i|
1.upto(m) do |j|
if s[i] == t[j]
cost =0
else
cost = 1
end
a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j -
1] + cost].min
end
end
a[n][m]
#a.each {|x| p x}
else
0
end
end
I'd appreciate any comments you might have.
Thanks
Tom Reilly.