Asp Forum - Problem with comparing huge amount of strings

Jan Fischer

10/6/2008 8:02:00 AM

Hello together,

I got a problem grouping rows in a Database by similarity. What I try to
reach is the following:
I have a table looking like (just an example):

ID Companyname Group
1 Mickeysoft Ltd. NULL
2 Mickysoft LTD NULL
3 Mickeysft Limited NULL
4 Aple Inc NULL
5 APPLE INC NULL

and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
for the IDs 4 and 5.

At the moment I compare two strings by making them lowercase, deleting
dots etc., deleting the substrings 'Inc', 'Ltd' etc. and then building
the Levenshtein-Distance of the metaphone-key of the two strings.
Works not really good and is damn slow, but it's okay and best I could
figure out. (Nevertheless your hints on that are welcome too.)

My problem is, that I don't know how to apply my compare-method in an
efficient way. What I'm doing now is selecting the first row where Group
is NULL and then selecting each row (where Group is NULL) in my database
in a loop again, comparing with the first selected string and setting
Group to a certain number if comapare method says they match.

That lasts a lang time and - worse - my code looks very ugly, even to a
very beginner as I am.

So if somebody of you has a good idea how to improve that, some docs on
how to implement such a thing or even a totally different approach to
get the results I want, I would be very glad.

Jan

10 Answers

Martin DeMello

10/6/2008 9:49:00 AM

On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <janfischer@mail.org> wrote:
> At the moment I compare two strings by making them lowercase, deleting dots
> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
> Levenshtein-Distance of the metaphone-key of the two strings.
> Works not really good and is damn slow, but it's okay and best I could
> figure out. (Nevertheless your hints on that are welcome too.)
>
> My problem is, that I don't know how to apply my compare-method in an
> efficient way. What I'm doing now is selecting the first row where Group is
> NULL and then selecting each row (where Group is NULL) in my database in a
> loop again, comparing with the first selected string and setting Group to a
> certain number if comapare method says they match.

The problem with relying on a compare function for grouping is that it
leads naturally to an O(n^2) solution. Ideally, what you want is some
sort ofo "characteristic" function, so that two entries with the same
characteristic are likely to be in the same grouping. Of course,
coming up with the characteristic is not always easy, and may not even
be possible, but it's a useful direction to think in. Look up fuzzy
searching techniques, too, and see if they have anything adaptable.

martin

Robert Klemme

10/6/2008 11:05:00 AM

2008/10/6 Martin DeMello <martindemello@gmail.com>:
> On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <janfischer@mail.org> wrote:
>> At the moment I compare two strings by making them lowercase, deleting dots
>> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
>> Levenshtein-Distance of the metaphone-key of the two strings.
>> Works not really good and is damn slow, but it's okay and best I could
>> figure out. (Nevertheless your hints on that are welcome too.)
>>
>> My problem is, that I don't know how to apply my compare-method in an
>> efficient way. What I'm doing now is selecting the first row where Group is
>> NULL and then selecting each row (where Group is NULL) in my database in a
>> loop again, comparing with the first selected string and setting Group to a
>> certain number if comapare method says they match.
>
> The problem with relying on a compare function for grouping is that it
> leads naturally to an O(n^2) solution. Ideally, what you want is some
> sort ofo "characteristic" function, so that two entries with the same
> characteristic are likely to be in the same grouping. Of course,
> coming up with the characteristic is not always easy, and may not even
> be possible, but it's a useful direction to think in. Look up fuzzy
> searching techniques, too, and see if they have anything adaptable.

Also, look into your database's feature set. Most modern RDBMS come
with text searching capabilities and dealing with large volumes of
data is where those beasts excel. If such a characteristic function
can be found chances are that you can implement it as function in the
DB and resolve the issue via database queries / updates.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end

Brian Candler

10/6/2008 6:16:00 PM

Martin DeMello wrote:
> On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <janfischer@mail.org> wrote:
>> certain number if comapare method says they match.
> The problem with relying on a compare function for grouping is that it
> leads naturally to an O(n^2) solution. Ideally, what you want is some
> sort ofo "characteristic" function, so that two entries with the same
> characteristic are likely to be in the same grouping.

Soundex on the first word, perhaps? (See wikipedia)

I don't know how appropriate it would be for your dataset, but it's easy
to try out. Maybe at least it will give you some smaller groups which in
turn can be subject to more detailled analysis. If it works well then
you can make an extra column for this value, with an index, so your DB
can search and group on it efficiently.
--
Posted via http://www.ruby-....

ragav

10/6/2008 7:13:00 PM

Jan Fischer wrote:
> Hello together,
>
> I got a problem grouping rows in a Database by similarity. What I try to
> reach is the following:
> I have a table looking like (just an example):
>
> ID Companyname Group
> 1 Mickeysoft Ltd. NULL
> 2 Mickysoft LTD NULL
> 3 Mickeysft Limited NULL
> 4 Aple Inc NULL
> 5 APPLE INC NULL
>
> and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
> for the IDs 4 and 5.
>
> At the moment I compare two strings by making them lowercase, deleting
> dots etc., deleting the substrings 'Inc', 'Ltd' etc. and then building
> the Levenshtein-Distance of the metaphone-key of the two strings.
> Works not really good and is damn slow, but it's okay and best I could
> figure out. (Nevertheless your hints on that are welcome too.)
>
> My problem is, that I don't know how to apply my compare-method in an
> efficient way. What I'm doing now is selecting the first row where Group
> is NULL and then selecting each row (where Group is NULL) in my database
> in a loop again, comparing with the first selected string and setting
> Group to a certain number if comapare method says they match.
>
> That lasts a lang time and - worse - my code looks very ugly, even to a
> very beginner as I am.
>
> So if somebody of you has a good idea how to improve that, some docs on
> how to implement such a thing or even a totally different approach to
> get the results I want, I would be very glad.
>
> Jan

First at the minimum if you apply Lev distance for each pair of keys you
might want to implement this as a custom database function. For eg in
mysql you can create custom user functions
http://dev.mysql.com/doc/refman/6.0/en/adding-func...

The naive implementation of comparing each key with every other is
quadratic. You need a way of "blocking/chunking" the dataset.

1. For eg can you assume that the first letter will never be misspelled?
Then you can group by first letters and create a smaller search space.
2. Or if you don't accept a Lev distance > 2 then maybe you can first
sort the names by length and only compute Lev distance on those which
differ in size < 3?
3. Perhaps the record has other fields which can be used individually or
in combination to generate a sort key?
4. There are more complicated blocking schemes - like n-gram(chunk by N
common continuous characters) or even sampling based ones.

A good blocking algorithm will be the key to scalability.

There is lots of literature out there on this .. search for sorted
neighborhood, record-linkage, merge-purge, hardening soft databases.

--Cheers
--Ragav
--
Posted via http://www.ruby-....

Martin DeMello

10/7/2008 12:18:00 AM

On Mon, Oct 6, 2008 at 12:12 PM, Ragav Satish <ragavsatish@gmail.com> wrote:
> 4. There are more complicated blocking schemes - like n-gram(chunk by N
> common continuous characters) or even sampling based ones.

n-gram were my first thought - indeed, I'd started writing out a
bigram-based scheme, then I realised that it'd fail badly if there was
a common word like "systems" or "computers" that a lot of the entries
had. Maybe some sort of multipass scheme to first reduce each entry to
a characteristic word, then do n-gram frequency analysis on those
words (my idea was this: pass 1: make up a frequency table of bigrams,
pass 2: characterise each entry by the presence/multiplicity of the
six or so most common ones)

martin

ragav

10/7/2008 4:27:00 AM

Martin DeMello wrote:
> On Mon, Oct 6, 2008 at 12:12 PM, Ragav Satish <ragavsatish@gmail.com>
> wrote:
>> 4. There are more complicated blocking schemes - like n-gram(chunk by N
>> common continuous characters) or even sampling based ones.
>
> n-gram were my first thought - indeed, I'd started writing out a
> bigram-based scheme, then I realised that it'd fail badly if there was
> a common word like "systems" or "computers" that a lot of the entries
> had. Maybe some sort of multipass scheme to first reduce each entry to
> a characteristic word, then do n-gram frequency analysis on those
> words (my idea was this: pass 1: make up a frequency table of bigrams,
> pass 2: characterise each entry by the presence/multiplicity of the
> six or so most common ones)
>
> martin

Yes indeed. The choice of what blocking scheme to use is largely
dependent on what kinds of key variations are expected and the size of
the database.

1. As a first stage of normalization, OP mentioned he was stripping of
INC, LLC, Limited and the like. As you mentioned earlier it might
possible to use a single token .. so the second token like Systems,
Computers etc can be stripped off during blocking and only get applied
in Phase II while comparing intra block records using a more exact
scheme like lev distance.

2. Before we even go to a bigram scheme, simple schemes like Soundex,
First Two character truncation, NYSIIS etc should be tried.

3. Better yet two or more of these should be applied in multiple passes
and the blocks built out of the union of blocks produced with each
method.

4. If bigrams are used then I would go with the full bigram indexing(not
just the six most common) with a threshold value and reverse indexes .
http://datamining.anu.edu.au/publications/2003/kdd03-... (sec
2.1) (This paper talks of python code so possibly it could be easily
converted to ruby).

--Cheers
--Ragav

--
Posted via http://www.ruby-....

Lloyd Linklater

10/7/2008 11:57:00 AM

Jan Fischer

10/8/2008 8:14:00 AM

Hello again,

Robert Klemme schrieb am 6.10.08 um 13:05:
> 2008/10/6 Martin DeMello <martindemello@gmail.com>:
>> On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <janfischer@mail.org> wrote:
>>> At the moment I compare two strings by making them lowercase, deleting dots
>>> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
>>> Levenshtein-Distance of the metaphone-key of the two strings.
>>> Works not really good and is damn slow, but it's okay and best I could
>>> figure out. (Nevertheless your hints on that are welcome too.)
>>>
>>> My problem is, that I don't know how to apply my compare-method in an
>>> efficient way. What I'm doing now is selecting the first row where Group is
>>> NULL and then selecting each row (where Group is NULL) in my database in a
>>> loop again, comparing with the first selected string and setting Group to a
>>> certain number if comapare method says they match.
>> The problem with relying on a compare function for grouping is that it
>> leads naturally to an O(n^2) solution. Ideally, what you want is some
>> sort ofo "characteristic" function, so that two entries with the same
>> characteristic are likely to be in the same grouping. Of course,
>> coming up with the characteristic is not always easy, and may not even
>> be possible, but it's a useful direction to think in. Look up fuzzy
>> searching techniques, too, and see if they have anything adaptable.
>
> Also, look into your database's feature set. Most modern RDBMS come
> with text searching capabilities and dealing with large volumes of
> data is where those beasts excel. If such a characteristic function
> can be found chances are that you can implement it as function in the
> DB and resolve the issue via database queries / updates.

thank you very much for your hints. I think you both suggest using sth.
like the soundex-function to characterize each single row in a certain
way and then group by this chracteristic. (Like Brian Chandler suggests
in another post too, thanks Brian.)
Fact is that I already tried the "soundex-solution", but the results
don't even come close to what I expected. I also tried to build the
metaphone-key of every string and group by that, but that wasn't enough too.

I think a mix of your hints and what Ragev Satish suggests in his post
will prabably lead me into the right direction:
First step is to "normalize" my strings (strip "Ltd." etc.), build the
metaphone key and then write that back into the database instead of
calculating everything again for each comparison. Should be O(n), right?
(Read about Landau notation for the first time...)
After this step I'll try to group my strings by other information -
maybe by country of origin, zip code (if available) or first letter -
before I'll compare every string to each other.

I have to be honest: I never heard of n- or bi-gram schemes, so I think
I have to read a lot about that before. But it sounds very interesting
and I'm very thankful for these hints.

Kind regards
Jan

Jan Fischer

10/8/2008 8:17:00 AM

Hi, thank you very much, Ragav and Martin. You both helped me a lot!
I went into your posts on the other subthread. (MId:
<gchq3f$1ai$1@news.albasani.net>)

Jan

Robert Klemme

10/9/2008 4:22:00 PM

On 08.10.2008 10:13, Jan Fischer wrote:

> thank you very much for your hints. I think you both suggest using sth.
> like the soundex-function to characterize each single row in a certain
> way and then group by this chracteristic. (Like Brian Chandler suggests
> in another post too, thanks Brian.)
> Fact is that I already tried the "soundex-solution", but the results
> don't even come close to what I expected. I also tried to build the
> metaphone-key of every string and group by that, but that wasn't enough
> too.
>
> I think a mix of your hints and what Ragev Satish suggests in his post
> will prabably lead me into the right direction:

Yep, I agree.

> First step is to "normalize" my strings (strip "Ltd." etc.), build the
> metaphone key and then write that back into the database instead of
> calculating everything again for each comparison. Should be O(n), right?

Yes, kind of. In databases O calculus is not so important because SQL
is descriptive and not procedural. In other words, you have no control
(in reality there is some control you can exert but it is important to
keep in mind that SQL is not a programming language) over what
operations the DB will execute in order to give you the result set you
requested. It is more important to care about access paths so you
definitively want to index the result of this function.

You did not state which RDBMS you are using. I know Oracle very well
and SQL Server pretty well, so I speak only about the two. Chances are
that you can do similar things with other products. In both cases you
can define a function for this. In SQL Server you can then store a
calculated column which uses this function and in Oracle you can
simulate the calculated column with a trigger or create a function based
index. Then you can query that calculated column / function efficiently
and group by or order by that column. If you add analytic SQL to the
mix you can even do some nice additional tricks like counting per column
value etc.

Cheers

robert

comp.lang.ruby

Problem with comparing huge amount of strings

Jan Fischer

Martin DeMello

Robert Klemme

Brian Candler

ragav

Martin DeMello

ragav

Lloyd Linklater

Jan Fischer

Jan Fischer

Robert Klemme

x Login to ForumsZone