John Machin
2/7/2008 10:32:00 PM
On Feb 7, 10:37 pm, Matthew_WAR...@bnpparibas.com wrote:
> > On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
>
> > > Jeff Schwab wrote:
> > ...
> > >> If the strings happen to be the same length, the Levenshtein distance
> > >> is equivalent to the Hamming distance.
>
> Is this really what the OP was asking for. If I understand it correctly,
> Levenshtein distance works out the number of edits required to transform
> the string to the target string. The smaller the more equivalent, but with
> the OP's problem I would expect
>
> table1 table2
> brian briam
> erian
>
> I think the OP would like to guess at 'briam' rather than 'erian', but
> Levenstein would rate them equally good guesses?
>
> I know this is pushing it more toward phonetic alaysis of the words or
> something similar, and thats orders of magnitude more complex.
>
Not very. The edit distance idea can be generalised by having variable
penalties for replacement and for insertion/deletion.
E.g. n/m has a low replacement penalty because they're both
phonetically very similar AND adjacent on some keyboards.
Google "zobel editex" for some ideas.
Insertion/deletion: a good tweak is to use a low (even zero) penalty
for omitting a doubled letter e.g. Matthew / Mathew.
Google "febrl" for a Python package for record matching -- the authors
have a recent paper where they compare various name-matching methods.
HTH,
John
> This message
[big snip]
has astonishingly large multi-lingual carbuncles on its rump. Please
consider posting from home.
> Ce message et toutes les pieces jointes (ci-apres le
[big snip]