Jan Fischer
10/8/2008 8:14:00 AM
Hello again,
Robert Klemme schrieb am 6.10.08 um 13:05:
> 2008/10/6 Martin DeMello <martindemello@gmail.com>:
>> On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <janfischer@mail.org> wrote:
>>> At the moment I compare two strings by making them lowercase, deleting dots
>>> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
>>> Levenshtein-Distance of the metaphone-key of the two strings.
>>> Works not really good and is damn slow, but it's okay and best I could
>>> figure out. (Nevertheless your hints on that are welcome too.)
>>>
>>> My problem is, that I don't know how to apply my compare-method in an
>>> efficient way. What I'm doing now is selecting the first row where Group is
>>> NULL and then selecting each row (where Group is NULL) in my database in a
>>> loop again, comparing with the first selected string and setting Group to a
>>> certain number if comapare method says they match.
>> The problem with relying on a compare function for grouping is that it
>> leads naturally to an O(n^2) solution. Ideally, what you want is some
>> sort ofo "characteristic" function, so that two entries with the same
>> characteristic are likely to be in the same grouping. Of course,
>> coming up with the characteristic is not always easy, and may not even
>> be possible, but it's a useful direction to think in. Look up fuzzy
>> searching techniques, too, and see if they have anything adaptable.
>
> Also, look into your database's feature set. Most modern RDBMS come
> with text searching capabilities and dealing with large volumes of
> data is where those beasts excel. If such a characteristic function
> can be found chances are that you can implement it as function in the
> DB and resolve the issue via database queries / updates.
thank you very much for your hints. I think you both suggest using sth.
like the soundex-function to characterize each single row in a certain
way and then group by this chracteristic. (Like Brian Chandler suggests
in another post too, thanks Brian.)
Fact is that I already tried the "soundex-solution", but the results
don't even come close to what I expected. I also tried to build the
metaphone-key of every string and group by that, but that wasn't enough too.
I think a mix of your hints and what Ragev Satish suggests in his post
will prabably lead me into the right direction:
First step is to "normalize" my strings (strip "Ltd." etc.), build the
metaphone key and then write that back into the database instead of
calculating everything again for each comparison. Should be O(n), right?
(Read about Landau notation for the first time...)
After this step I'll try to group my strings by other information -
maybe by country of origin, zip code (if available) or first letter -
before I'll compare every string to each other.
I have to be honest: I never heard of n- or bi-gram schemes, so I think
I have to read a lot about that before. But it sounds very interesting
and I'm very thankful for these hints.
Kind regards
Jan