Hello Olaf.
El 22/08/2011 10:00 a.m., Schmidt escribió:
> Am 22.08.2011 09:37, schrieb Eduardo:
>
>> And Access would let me to add the indexes... (they would
> > be less than 32 in each table).
>>
>> But if I go for the SQL Mid(Filename,2) it could be faster?
> No - in this case (using builtin SQL-Functions) there's always
> at least one fulltable-scan necessary.
OK.
> But didn't we have this topic already "on the table"
> (the thread, where you were asking "find a similar word")?
Yes and no. At that time, it was to compare with a small set of words
that can range from 5 to 50 words, that I have in an array in memory.
They are also bible words, but that case was for words used in a
versicle or several versicles.
That is already working fine.
Now, I'm working again on that program, and I found that I also need to
find a word when no versicle is specified, and that's the difference
because I need to search in the whole dictionary.
>
> In short, if there are only about 8000 Word-entries,
> which need "enhanced, unsharp scanning", then you
> would be faster, if you handle these wordscans
> not directly per "normal DB-Functions and table-scans"
> (in case, the DB-engine in question does not
> offer builtin "unsharp search-functions", which
> are more capable than for example 'Like').
Yes, I remember that you suggested to go to a SQLite database...
The only database type that I have worked with so far are Access
databases, so I'm trying to stay with Access unless I see that I really
must migrate to another one (and also learn all the things related to
distribution and deployment).
>
> So the best approach would be, to move your
> 8000 words over into a String-Array and perform
> your comparisons on this (or also additional,
> differently ordered or shortened StringArrays)
> simply by index, directly in your own App-Loops.
This is something that I'm considering, but at the end they will be 8000
* 60, because all the variants.
I In fact there are two tables, ~8000 are the Hebrew words, and the
Greek words are ~5000.
This will take time to load ~13,000 x 60 registers in memory at the
start of the program.
In that case I would search on each vector with a succesive
aproximations routine.
>
> The problem is an old one - and your current
> approach (same word-content in different
> "shapes", shortened from right, shortened from
> left, etc. ... and then multiple comparison-loops)
> is the one, which perhaps everyone starts with.
>
> But eventually one will find, that this "naive"
> approach might be sufficient for a small amount
> of words, but multiple array-scans are costly in
> terms of performance and need a lot of memory
> when the "unsharp requirements" grow - or the
> wordcount gets higher.
In this case, the dictionay won't ever change.
>
> And that's, were all these more advanced,
> unsharp algorithms come into play (metaphone,
> ratcliff & Co.)
>
> Did you already tried one of them against your
> set of words?
Humm, no. What I'm doing, is shortening from right, shortening from left
(also varying, the search word shortened 2 characters, the database word
shortened one, and so), and do a "FindFirst" in the register in
question. If I find a match, then I compare that word (not the shortened
one but the complete word) with the word that I'm searching (also the
complete one) with an algorytm based on "Levenshtein Distance".
Then I move first to the previous and later to the next contiguous
records until I reach a number of words that didn't match more than 50%,
then I stop (and go to the next searching cicle in the loop).
At the end, I order the list of words taht were found by the percentage
of coincidence, and if they are more than 20, I leave only the 20 more
relevants.
>
> Or could you give a list of (maybe 10-20 should
> be sufficient) words - and then a few searchterms,
> for which you define the word (out of your
> 10-20 words-list) that should be returned - just that
> we can see, how tolerant/unsharp the algorithm should be...
>
> Olaf
OK, I'll copy some results from what I've made.
Suppose that the user wants to find the word "anupokritos", but he or
she doesn't remember how to spell it, and writes it "like it sound":
******************
word: anaupocritos (Greek)
It took 159,06 seconds
Word ID: G 505 anupokritos percentage: 82
Word ID: G 506 anupotaktos percentage: 65
Word ID: G 5273 hupokrites percentage: 58
Word ID: G 379 anapologetos percentage: 58
Word ID: G 368 anantirrhetos percentage: 58
Word ID: G 369 anantirrhetos percentage: 58
Word ID: G 178 akatakritos percentage: 58
Word ID: G 799 Asugkritos percentage: 57
Word ID: G 5580 pseudochristos percentage: 56
Word ID: G 319 anagnorizomai percentage: 53
Word ID: G 338 anaitios percentage: 50
Word ID: G 402 anachoreo percentage: 50
Word ID: G 87 adiakritos percentage: 50
Word ID: G 361 anamartetos percentage: 50
Word ID: G 526 apallotrioo percentage: 50
Word ID: G 3480 Nazoraios percentage: 50
Word ID: G 4182 polupoikilos percentage: 50
Word ID: G 125 Aiguptos percentage: 50
Word ID: G 378 anapleroo percentage: 50
Word ID: G 377 anapipto percentage: 50
******************
Some more samples:
******************
word: neumaticós (Greek)
It took 124,41 seconds
Word ID: G 4153 pneumatikos percentage: 81
Word ID: G 4152 pneumatikos percentage: 81
Word ID: G 1193 dermatinos percentage: 66
Word ID: G 3020 Leuitikos percentage: 60
Word ID: G 4984 somatikos percentage: 60
Word ID: G 4985 somatikos percentage: 60
Word ID: G 1444 Hebraikos percentage: 60
Word ID: G 2122 eukairos percentage: 57
Word ID: G 2121 eukairos percentage: 57
Word ID: G 2739 kaumatizo percentage: 57
Word ID: G 3524 nephaleos percentage: 55
Word ID: G 5538 chrematismos percentage: 54
Word ID: G 2773 kermatistes percentage: 54
Word ID: G 3566 numphios percentage: 53
Word ID: G 3512 neoterikos percentage: 50
Word ID: G 2452 Ioudaikos percentage: 50
Word ID: G 2451 Ioudaikos percentage: 50
Word ID: G 2441 himatismos percentage: 50
Word ID: G 1054 Galatikos percentage: 50
Word ID: G 4262 probatikos percentage: 50
******************
word: teofenustos (Greek)
It took 152,21 seconds
Word ID: G 2315 theopneustos percentage: 65
Word ID: G 3504 neophutos percentage: 60
Word ID: G 1675 Hellenistes percentage: 54
Word ID: G 1354 Dionusios percentage: 53
Word ID: G 5118 tosoutos percentage: 53
Word ID: G 5108 toioutos percentage: 53
Word ID: G 5082 telikoutos percentage: 51
Word ID: G 4339 proselutos percentage: 51
Word ID: G 5533 chreopheiletes percentage: 50
Word ID: G 2459 Ioustos percentage: 50
Word ID: G 2312 theodidaktos percentage: 49
Word ID: G 2180 Ephesios percentage: 48
Word ID: G 1721 emphutos percentage: 48
******************
word: alos (Greek)
It took 16,13 seconds
Word ID: G 243 allos percentage: 97
Word ID: G 247 allos percentage: 97
Word ID: G 2570 kalos percentage: 79
Word ID: G 4535 salos percentage: 79
Word ID: G 2573 kalos percentage: 77
Word ID: G 216 alalos percentage: 66
Word ID: G 5194 hualos percentage: 66
Word ID: G 358 analos percentage: 66
Word ID: G 527 apalos percentage: 65
Word ID: G 836 aulos percentage: 55
Word ID: G 259 halosis percentage: 54
Word ID: G 3171 megalos percentage: 54
Word ID: G 806 asphalos percentage: 54
Word ID: G 250 aloe percentage: 53
Word ID: G 301 Amos percentage: 50
******************
word: Jehová Nisi (Hebrew)
It took 215,4 seconds
Word ID: H 3071 Yehovah nicciy percentage: 72
Word ID: H 3070 Yehovah yireh percentage: 54
******************
word: Jehová Sitquenu (Hebrew)
It took 292,88 seconds
Word ID: H 3072 Yehovah tsidqenuw percentage: 72
******************
word: Elojim (Hebrew)
It took 59,43 seconds
Word ID: H 430 'elohiym percentage: 97
Word ID: H 440 'Elowniy percentage: 53
******************
word: nefes (Hebrew)
It took 36,44 seconds
Word ID: H 5315 nephesh percentage: 80
Word ID: H 657 'ephec percentage: 77
Word ID: H 5311 nephets percentage: 62
Word ID: H 5233 nekec percentage: 62
Word ID: H 5298 Nepheg percentage: 60
Word ID: H 5309 nephel percentage: 60
Word ID: H 5316 nepheth percentage: 60
Word ID: H 660 'eph`eh percentage: 57
Word ID: H 7516 rephesh percentage: 49
Word ID: H 2665 chephes percentage: 49
******************
In all cases the desired word was the first one in the list.
Thanks.