Robert Klemme
1/29/2008 7:31:00 PM
On 29.01.2008 19:03, Phil Rhoades wrote:
> People,
>
> I am not sure if this is an appropriate place to ask this sort of
> question - there are probably dozens of different solutions with greatly
> varying amounts of time, effort and efficiency involved but since I like
> doing things in Ruby, I thought I would ask the gurus here:
>
> I periodically receive new mailing lists in CVS format and I have to
> check for duplications of individual mailing addresses in the new list
> and the current list. The problem is, the there is no common format in
> the new lists because they come from different organisations - one list
> might have all data in capital letters, another might have last name and
> only initials, another might have last name and first name, another
> might have full state names and others a two character field - there are
> lots of variations. About the only thing that can relied on (ignoring
> case) is that the last name would be the same in both lists if there is
> a duplication. If I want to pattern match from the new list to the
> existing list, I have to be fairly flexible ie it is better to get false
> positives (because they can be quickly ignored by eyeballing) than false
> negatives (someone is mailed twice in the new merged list).
>
> Suggestions? ideas? Should I just use the regular shell tools?
<brainstorming>
Maybe a two step approach:
1. normalize data (e.g. rip off all whitespace, punctuation or just keep
all characters and digits)
2. calculate something like the hamming distance between every two
entries and flag those entries which have a distance less than a certain
threshold
Downside is that step 2 takes O(n*n)...
</brainstorming>
Kind regards
robert