[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Re: Looking for library to estimate likeness of two strings

Matthew_WARREN

2/7/2008 11:37:00 AM








> On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
>
> > Jeff Schwab wrote:
> ...
> >> If the strings happen to be the same length, the Levenshtein distance
> >> is equivalent to the Hamming distance.

Is this really what the OP was asking for. If I understand it correctly,
Levenshtein distance works out the number of edits required to transform
the string to the target string. The smaller the more equivalent, but with
the OP's problem I would expect


table1 table2
brian briam
erian


I think the OP would like to guess at 'briam' rather than 'erian', but
Levenstein would rate them equally good guesses?

I know this is pushing it more toward phonetic alaysis of the words or
something similar, and thats orders of magnitude more complex.

just in case,

http://www.linguistlist.org/sp/Softwa...

might be a good place to start looking into it, along with the NLTK
libraries here

http://nltk.sourceforge.net/index.php/Doc...



Matt.


--


This message and any attachments (the "message") is
intended solely for the addressees and is confidential.
If you receive this message in error, please delete it and
immediately notify the sender. Any use not in accord with
its purpose, any dissemination or disclosure, either whole
or partial, is prohibited except formal approval. The internet
can not guarantee the integrity of this message.
BNP PARIBAS (and its subsidiaries) shall (will) not
therefore be liable for the message if modified.
Do not print this message unless it is necessary,
consider the environment.

---------------------------------------------

Ce message et toutes les pieces jointes (ci-apres le
"message") sont etablis a l'intention exclusive de ses
destinataires et sont confidentiels. Si vous recevez ce
message par erreur, merci de le detruire et d'en avertir
immediatement l'expediteur. Toute utilisation de ce
message non conforme a sa destination, toute diffusion
ou toute publication, totale ou partielle, est interdite, sauf
autorisation expresse. L'internet ne permettant pas
d'assurer l'integrite de ce message, BNP PARIBAS (et ses
filiales) decline(nt) toute responsabilite au titre de ce
message, dans l'hypothese ou il aurait ete modifie.
N'imprimez ce message que si necessaire,
pensez a l'environnement.
1 Answer

John Machin

2/7/2008 10:32:00 PM

0

On Feb 7, 10:37 pm, Matthew_WAR...@bnpparibas.com wrote:
> > On Wed, 06 Feb 2008 17:32:53 -0600, Robert Kern wrote:
>
> > > Jeff Schwab wrote:
> > ...
> > >> If the strings happen to be the same length, the Levenshtein distance
> > >> is equivalent to the Hamming distance.
>
> Is this really what the OP was asking for. If I understand it correctly,
> Levenshtein distance works out the number of edits required to transform
> the string to the target string. The smaller the more equivalent, but with
> the OP's problem I would expect
>
> table1 table2
> brian briam
> erian
>
> I think the OP would like to guess at 'briam' rather than 'erian', but
> Levenstein would rate them equally good guesses?
>
> I know this is pushing it more toward phonetic alaysis of the words or
> something similar, and thats orders of magnitude more complex.
>

Not very. The edit distance idea can be generalised by having variable
penalties for replacement and for insertion/deletion.

E.g. n/m has a low replacement penalty because they're both
phonetically very similar AND adjacent on some keyboards.

Google "zobel editex" for some ideas.

Insertion/deletion: a good tweak is to use a low (even zero) penalty
for omitting a doubled letter e.g. Matthew / Mathew.

Google "febrl" for a Python package for record matching -- the authors
have a recent paper where they compare various name-matching methods.

HTH,
John

> This message
[big snip]
has astonishingly large multi-lingual carbuncles on its rump. Please
consider posting from home.

> Ce message et toutes les pieces jointes (ci-apres le

[big snip]