[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

microsoft.public.vb.general.discussion

Olaf: dhRichClient3 to index many documents?

(Mike Mitchell)

9/27/2011 5:15:00 PM

I had a hernia op ten days ago and am only now beginning to feel like
approaching the PC again. I've dug out the last tests I ran with your
dhSQLite and have worked out how to set and read the hidden rowid
column in a virtual FTS3 table.

The actual task is to index a bunch of documents. Let's assume they're
text files of varying sizes up to 5kb.

The quick way would be to read each text file into a string and whack
the string into a column, e.g.

TxtContent = "Entire contents of one text file..........."
NewID = Cnn.UniqueID64
Cmd.SetInt64 1, NewID
Cmd.SetText 2, TxtContent
Cmd.Execute

Then repeat the above with the next text file and so on.

Possibly, first run a quick parser over each text file to remove
'noise' words.

Sound OK to you?

MM
6 Answers

Schmidt

9/27/2011 5:52:00 PM

0

Am 27.09.2011 19:14, schrieb MM:

> The quick way would be to read each text file into a
> string and whack the string into a column, e.g.
>
> TxtContent = "Entire contents of one text file..........."
> NewID = Cnn.UniqueID64
> Cmd.SetInt64 1, NewID
> Cmd.SetText 2, TxtContent
> Cmd.Execute
>
> Then repeat the above with the next text file and so on.

Yep, as long as the target-table is an FTS3-table
already, then this should be fine.

> Possibly, first run a quick parser over each text file
> to remove 'noise' words.
>
> Sound OK to you?

When the target-table is the FTS3-Table, then there's
no need IMO, to remove the "usual" noise-words yourself -
the internal FTS3-Tokenizer can do that for you, in
case you will not use the (default) "simple" Tokenizer,
but the also already built-in "porter/stemmer algorithm"...

Please read about it here:
http://www.sqlite.org/fts3.html...

So, maybe the specification of:
"... using FTS3(YourTxtContentColumn, tokenize=porter)"
in your "virtual" Create Table Statement is already
enough for your purposes.

Olaf



(Mike Mitchell)

9/29/2011 2:12:00 PM

0

On Tue, 27 Sep 2011 19:51:49 +0200, Schmidt <sss@online.de> wrote:

>Am 27.09.2011 19:14, schrieb MM:
>
>> The quick way would be to read each text file into a
>> string and whack the string into a column, e.g.
>>
>> TxtContent = "Entire contents of one text file..........."
>> NewID = Cnn.UniqueID64
>> Cmd.SetInt64 1, NewID
>> Cmd.SetText 2, TxtContent
>> Cmd.Execute
>>
>> Then repeat the above with the next text file and so on.
>
>Yep, as long as the target-table is an FTS3-table
>already, then this should be fine.
>
>> Possibly, first run a quick parser over each text file
>> to remove 'noise' words.
>>
>> Sound OK to you?
>
>When the target-table is the FTS3-Table, then there's
>no need IMO, to remove the "usual" noise-words yourself -
>the internal FTS3-Tokenizer can do that for you, in
>case you will not use the (default) "simple" Tokenizer,
>but the also already built-in "porter/stemmer algorithm"...
>
>Please read about it here:
>http://www.sqlite.org/fts3.html...
>
>So, maybe the specification of:
> "... using FTS3(YourTxtContentColumn, tokenize=porter)"
>in your "virtual" Create Table Statement is already
>enough for your purposes.
>
>Olaf

Thanks.

Further question: The sqlite36_engine.dll in the RichClient toolset
downloads is not the latest version as per
http://www.sqlite.org/dow... which is delivering ver. 3.7.8.

However, I believe I'm right in saying that the official download
doesn't incorporate FTS3/4 as standard. So do you plan on issuing a
version of 3.7.8 compiled with the FTS3/4 stuff or should I just use
your 36 version?

Actually, I've just noticed that the ref to sqlite36_engine.dll at
http://www.thecommon.... says it is "StdCall-compile of
sqlite-version 3.7.5", which makes me wonder why you named it
sqlite36?

MM

Schmidt

9/30/2011 10:02:00 AM

0

Am 29.09.2011 16:11, schrieb MM:

> Actually, I've just noticed that the ref to
> sqlite36_engine.dll at http://www.thecommon....
> says it is "StdCall-compile of sqlite-version 3.7.5",
> which makes me wonder why you named it sqlite36?

Yep, that naming of the companion-dll was not all that
well-thought - but there's some "historical reasons".

The first wrapper I wrote was:
dhSQLite.dll (companion-dll: sqlite3_engine.dll)

Then, when I've put the wrapper into the RichClient:
dhRichClient.dll (companion: sqlite35_engine.dll)

*There* was already the mistake (but at this point
in time I've brought out RichClient-Versions
faster than the sqlite-devs made their minor-
version-jumps).

Then with RichClient3 the sqlite-versions were at
3.6xxx and so you got your sqlite36_engine.dll.

Now RC3 was my first "interface-stable" release
with BinComp on - and it is out there now for
a few years (only receiving internal fixes, and
no interface-enhancements) - and so the sqlite-
devs outran me.

You can check for the real version by doing e.g.:

Private Sub Form_Load()
Dim Cnn As New cConnection '<- define a DB-Connection
Caption = Cnn.Version
End Sub

This puts out 3.7.5 for the latest RichClient3-
Binaries-package.

And 3.7.7.1 for the latest RichClient4-Package,
so with RC4 I'm only one minor sqlite-relase
behind.

All my efforts are going into RC4 now - and that
includes new compiles of the enhanced companion-
dll (which is now named vb_cairo_sqlite.dll).

When sqlite-version 3.7.9 comes out, then I will
do a recompile of vb_cairo_sqlite.dll (I will leave
out any other sqlite-step in-between, to save
some work on my end).

So, when you want to stay up-to-date with sqlite-
versions, then you should use the newer RC4 instead
of RC3 ... you can download its latest Base-Package
(without the optional WebKit-Browser-Dlls) from here:
http://www.datenhaus.de/Downloads/vbRC4Ba...

Usage in your (formerly RC3-based) Application
should only require a change of the Project-
reference from 'dhRichClient3' to 'vbRichClient4'.

....

How are you coming along with your FullText-search?
Is the more enhanced "porter" tokenizer sufficient
for your purposes (filtering out "noise-words") -
or do you still need your own "word-tokenizing"?

If so, then I could take a look at the tokenizer
C-Interface SQLite offers, for "rolling your own" -
it should be adaptable to VB6-*.bas modules,
using 'AddressOf' for the callbacks.


Olaf

(Mike Mitchell)

9/30/2011 11:01:00 AM

0

On Fri, 30 Sep 2011 12:02:28 +0200, Schmidt <sss@online.de> wrote:

>Am 29.09.2011 16:11, schrieb MM:
>
>> Actually, I've just noticed that the ref to
> > sqlite36_engine.dll at http://www.thecommon....
> > says it is "StdCall-compile of sqlite-version 3.7.5",
> > which makes me wonder why you named it sqlite36?
>
>Yep, that naming of the companion-dll was not all that
>well-thought - but there's some "historical reasons".
>
>The first wrapper I wrote was:
>dhSQLite.dll (companion-dll: sqlite3_engine.dll)
>
>Then, when I've put the wrapper into the RichClient:
>dhRichClient.dll (companion: sqlite35_engine.dll)
>
>*There* was already the mistake (but at this point
>in time I've brought out RichClient-Versions
>faster than the sqlite-devs made their minor-
>version-jumps).
>
>Then with RichClient3 the sqlite-versions were at
>3.6xxx and so you got your sqlite36_engine.dll.
>
>Now RC3 was my first "interface-stable" release
>with BinComp on - and it is out there now for
>a few years (only receiving internal fixes, and
>no interface-enhancements) - and so the sqlite-
>devs outran me.
>
>You can check for the real version by doing e.g.:
>
>Private Sub Form_Load()
>Dim Cnn As New cConnection '<- define a DB-Connection
> Caption = Cnn.Version
>End Sub
>
>This puts out 3.7.5 for the latest RichClient3-
>Binaries-package.
>
>And 3.7.7.1 for the latest RichClient4-Package,
>so with RC4 I'm only one minor sqlite-relase
>behind.
>
>All my efforts are going into RC4 now - and that
>includes new compiles of the enhanced companion-
>dll (which is now named vb_cairo_sqlite.dll).
>
>When sqlite-version 3.7.9 comes out, then I will
>do a recompile of vb_cairo_sqlite.dll (I will leave
>out any other sqlite-step in-between, to save
>some work on my end).
>
>So, when you want to stay up-to-date with sqlite-
>versions, then you should use the newer RC4 instead
>of RC3 ... you can download its latest Base-Package
>(without the optional WebKit-Browser-Dlls) from here:
>http://www.datenhaus.de/Downloads/vbRC4Ba...
>
>Usage in your (formerly RC3-based) Application
>should only require a change of the Project-
>reference from 'dhRichClient3' to 'vbRichClient4'.
>
>...
>
>How are you coming along with your FullText-search?
>Is the more enhanced "porter" tokenizer sufficient
>for your purposes (filtering out "noise-words") -
>or do you still need your own "word-tokenizing"?
>
>If so, then I could take a look at the tokenizer
>C-Interface SQLite offers, for "rolling your own" -
>it should be adaptable to VB6-*.bas modules,
>using 'AddressOf' for the callbacks.
>
>
>Olaf

Thanks for the update.

Re tokenizer: I've taken your advice and read all that stuff on Porter
Stemming and so on at the link you gave and certainly to begin with
that will be all I need for initial prototyping.

My main problem at the moment is finding the physical and mental
energy! Man, did that operation knock me for six. I suppose when one
is younger the body gets used to it more quickly. But, for example,
I've just driven my car for only the second time in two weeks to get
some shopping locally and I feel like I've run a half-marathon. Better
than Wednesday, when it was more like a full marathon.

MM

ralph

9/30/2011 10:02:00 PM

0

On Fri, 30 Sep 2011 12:02:28 +0200, Schmidt <sss@online.de> wrote:

>Am 29.09.2011 16:11, schrieb MM:
>
>> Actually, I've just noticed that the ref to
> > sqlite36_engine.dll at http://www.thecommon....
> > says it is "StdCall-compile of sqlite-version 3.7.5",
> > which makes me wonder why you named it sqlite36?
>
>Yep, that naming of the companion-dll was not all that
>well-thought - but there's some "historical reasons".
>

Ha.

You are in good company. There are "historical reasons" behind MS
magical jump to "6" for its VC and other development products with the
release of Visual Studio 6.

I found it interesting that when MS changed from the separate MDAC
packages to an O/S supplied DAC - the new version, while functionally
equivalent to MDAC 2.8, became DAC version "6".

Perhaps you should just rename your stuff version 6? <bg>

-ralph
[ After 30 years of working with collaborating libraries and trying to
maintain progressive and logical versioning, I've found it all seems
to go to h*ll at some point, and that point always seems to be around
version 3.5+ or what would be version 4.
Partly out of superstition and partly to avoid additional anguish at
that point I automatically pop them all to a common higher version, or
re-package with new names and start over from "1".]

(Mike Mitchell)

10/1/2011 2:19:00 PM

0

On Fri, 30 Sep 2011 12:02:28 +0200, Schmidt <sss@online.de> wrote:

>How are you coming along with your FullText-search?

One problem I've hit in the past few minutes is
NewID = Cnn.UniqueID64

That generates a number that is too large for a VB6 Long (obviously!)
and can't be stored in, say, List1.ItemData.

Isn't there a NewID = Cnn.UniqueID32 ?

I've got round it by:

Function GenerateSeed() As Long
GenerateSeed = (Now - DateSerial(1970, 1, 1)) * 86400
End Function

and then:

Sub AddFiles....
etc etc ........
NewID = GenerateSeed()
Do
Cmd.SetInt32 1, NewID
Cmd.SetText 2, TxtContent
Cmd.Execute
........ etc etc

NewID = NewID + 1
Loop
End Sub

So every time a new block of files are added, it creates a new seed,
then increments NewID within that block.

I did try to fathom the SQLite docs re AutoIncrement, but it's like
trying to understand Einstein... Why the heck did they make RowID a
hidden column?

Anyway, it works! I've got three text files loaded and can search for
any word in any of the files. (Well, I haven't tested it extensively
yet, as it's only been running for the first time in the past ten
minutes, but it's looking quite promising.)

MM