Asp Forum - fast method accessing large, simple structured data

agc

2/2/2008 8:37:00 PM

Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

Thanks,
Alex

6 Answers

Diez B. Roggisch

2/2/2008 8:52:00 PM

agc schrieb:
> Hi,
>
> I'm looking for a fast way of accessing some simple (structured) data.
>
> The data is like this:
> Approx 6 - 10 GB simple XML files with the only elements
> I really care about are the <title> and <article> ones.
>
> So what I'm hoping to do is put this data in a format so
> that I can access it as fast as possible for a given request
> (http request, Python web server) that specifies just the title,
> and I return the article content.
>
> Is there some good format that is optimized for search for
> just 1 attribute (title) and then returning the corresponding article?
>
> I've thought about putting this data in a SQLite database because
> from what I know SQLite has very fast reads (no network latency, etc)
> but not as fast writes, which is fine because I probably wont be doing
> much writing (I wont ever care about the speed of any writes).
>
> So is a database the way to go, or is there some other,
> more specialized format that would be better?

Database it is. Make sure you have proper indexing.

Diez

John Machin

2/2/2008 9:51:00 PM

agc wrote:
> Hi,
>
> I'm looking for a fast way of accessing some simple (structured) data.
>
> The data is like this:
> Approx 6 - 10 GB simple XML files with the only elements
> I really care about are the <title> and <article> ones.
>
> So what I'm hoping to do is put this data in a format so
> that I can access it as fast as possible for a given request
> (http request, Python web server) that specifies just the title,
> and I return the article content.
>
> Is there some good format that is optimized for search for
> just 1 attribute (title) and then returning the corresponding article?
>
> I've thought about putting this data in a SQLite database because
> from what I know SQLite has very fast reads (no network latency, etc)
> but not as fast writes, which is fine because I probably wont be doing
> much writing (I wont ever care about the speed of any writes).
>
> So is a database the way to go, or is there some other,
> more specialized format that would be better?
>

"Database" without any further qualification indicates exact matching,
which doesn't seem to be very practical in the context of titles of
articles. There is an enormous body of literature on inexact/fuzzy
matching, and lots of deployed applications -- it's not a Python-related
question, really.

M.-A. Lemburg

2/2/2008 11:21:00 PM

On 2008-02-02 21:36, agc wrote:
> Hi,
>
> I'm looking for a fast way of accessing some simple (structured) data.
>
> The data is like this:
> Approx 6 - 10 GB simple XML files with the only elements
> I really care about are the <title> and <article> ones.
>
> So what I'm hoping to do is put this data in a format so
> that I can access it as fast as possible for a given request
> (http request, Python web server) that specifies just the title,
> and I return the article content.
>
> Is there some good format that is optimized for search for
> just 1 attribute (title) and then returning the corresponding article?
>
> I've thought about putting this data in a SQLite database because
> from what I know SQLite has very fast reads (no network latency, etc)
> but not as fast writes, which is fine because I probably wont be doing
> much writing (I wont ever care about the speed of any writes).
>
> So is a database the way to go, or is there some other,
> more specialized format that would be better?

Depends on what you want to search and how, e.g. whether
a search for title substrings should give results, whether
stemming is needed, etc.

If all you want is a simple mapping of full title to article
string, an on-disk dictionary is probably the way to go,
e.g. mxBeeBase (part of the eGenix mx Base Distribution).

For more complex search, you're better off with a tool that
indexes the titles based on words, ie. a full-text search
engine such as Lucene.

Databases can also handle this, but they often have problems when
it comes to more complex queries where their indexes no longer
help them to speed up the query and they have to resort to
doing a table scan - a sequential search of all rows.

Some databases provide special full-text extensions, but
those are of varying quality. Better use a specialized
tool such as Lucene for this.

For more background on the problems of full-text search, see e.g.

http://www.ibm.com/developerworks/opensource/library/l-...

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Feb 03 2008)
>>> Python/Zope Consulting and Support ... http://www.e...
>>> mxODBC.Zope.Database.Adapter ... http://zope.e...
>>> mxODBC, mxDateTime, mxTextTools ... http://python.e...
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611

agc

2/3/2008 4:42:00 AM

On Feb 2, 1:50 pm, John Machin <sjmac...@lexicon.net> wrote:
> agc wrote:
> > Hi,
>
> > I'm looking for a fast way of accessing some simple (structured) data.
>
> > The data is like this:
> > Approx 6 - 10 GB simple XML files with the only elements
> > I really care about are the <title> and <article> ones.
>
> > So what I'm hoping to do is put this data in a format so
> > that I can access it as fast as possible for a given request
> > (http request, Python web server) that specifies just the title,
> > and I return the article content.
>
> > Is there some good format that is optimized for search for
> > just 1 attribute (title) and then returning the corresponding article?
>
> > I've thought about putting this data in a SQLite database because
> > from what I know SQLite has very fast reads (no network latency, etc)
> > but not as fast writes, which is fine because I probably wont be doing
> > much writing (I wont ever care about the speed of any writes).
>
> > So is a database the way to go, or is there some other,
> > more specialized format that would be better?
>
> "Database" without any further qualification indicates exact matching,
> which doesn't seem to be very practical in the context of titles of
> articles. There is an enormous body of literature on inexact/fuzzy
> matching, and lots of deployed applications -- it's not a Python-related
> question, really.

Yes, you are right that in some sense this question is not truly
Python related,
but I am looking to solve this problem in a way that plays as nicely
as
possible with Python:

I guess an important feature of what I'm looking for is
some kind of mapping from *exact* title to corresponding article,
i.e. if my data set wasn't so large, I would just keep all my
data in a in-memory Python dictionary, which would be very fast.

But I have about 2 million article titles mapping to approx. 6-10 GB
of article bodies, so I think this would be just to big for a
simple Python dictionary.

Does anyone have any advice on the feasibility of using
just an in memory dictionary? The dataset just seems to big,
but maybe there is a related method?

Thanks,
Alex

Ivan Illarionov

2/3/2008 1:55:00 PM

> Is there some good format that is optimized for search for
> just 1 attribute (title) and then returning the corresponding article?

I would use Durus (http://www.mems-exchange.org/softw...) -
simple pythonic object database - and store this data as persistent
python dict with Title keys and Article values.

Stefan Behnel

2/3/2008 5:42:00 PM

agc wrote:
> I guess an important feature of what I'm looking for is
> some kind of mapping from *exact* title to corresponding article,
> i.e. if my data set wasn't so large, I would just keep all my
> data in a in-memory Python dictionary, which would be very fast.
>
> But I have about 2 million article titles mapping to approx. 6-10 GB
> of article bodies, so I think this would be just to big for a
> simple Python dictionary.

Then use a database table that maps titles to articles, and make sure you
create an index over the title column.

Stefan

comp.lang.python

fast method accessing large, simple structured data

agc

Diez B. Roggisch

John Machin

M.-A. Lemburg

agc

Ivan Illarionov

Stefan Behnel

x Login to ForumsZone