[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

should I use a database or a flat file?

James Dinkel

4/1/2008 3:12:00 PM

I need to store some information with my ruby program and I am not sure
on what would be the best method. I'm mostly concerned about what would
be the most efficient use of cpu resources.

Basically, I will have a list of names each belonging to one of 5
categories. Sort of like this:

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name7
-name8
-name9
-etc...

There will be hundreds of names, evenly divided between the categories.
But each name will go in only one category, there is no relation between
categories or anything like that. All the information will be
completely rewritten once a day and then read several times throughout
the day.

My choices for storage are an sqlite database (using ActiveRecord), a
flat text file of my own design, a YAML file, or an XML file.
--
Posted via http://www.ruby-....

20 Answers

Robert Klemme

4/1/2008 3:18:00 PM

0

2008/4/1, James Dinkel <jdinkel@gmail.com>:
> I need to store some information with my ruby program and I am not sure
> on what would be the best method. I'm mostly concerned about what would
> be the most efficient use of cpu resources.
>
> Basically, I will have a list of names each belonging to one of 5
> categories. Sort of like this:
>
> Cat1
> -name1
> -name2
> -name3
> -etc...
>
> Cat2
> -name4
> -name5
> -name6
> -etc...
>
> Cat3
> -name7
> -name8
> -name9
> -etc...
>
> There will be hundreds of names, evenly divided between the categories.

That's not much. I'd probably use XML - but that also depends on what
generates the data and what needs to be able to read it. You can
efficiently generate it and read it (using a stream parser for
example, but that seems unnecessary for hundreds of names only).

But ultimately it depends on what you want to do with the data. In
some cases a DB might be a better choice. Also, if your volume is
going to increase dramatically etc.

> But each name will go in only one category, there is no relation between
> categories or anything like that. All the information will be
> completely rewritten once a day and then read several times throughout
> the day.
>
> My choices for storage are an sqlite database (using ActiveRecord), a
> flat text file of my own design, a YAML file, or an XML file.

YAML is another nice alternative because it is human readable. And
you can use Marshal if producer and consumer of the data are Ruby
programs.

Kind regards

robert

--
use.inject do |as, often| as.you_can - without end

Lionel Bouton

4/1/2008 3:32:00 PM

0

James Dinkel wrote:
> I need to store some information with my ruby program and I am not sure
> on what would be the best method. I'm mostly concerned about what would
> be the most efficient use of cpu resources.
>
> Basically, I will have a list of names each belonging to one of 5
> categories. Sort of like this:
>
> Cat1
> -name1
> -name2
> -name3
> -etc...
>
> Cat2
> -name4
> -name5
> -name6
> -etc...
>
> Cat3
> -name7
> -name8
> -name9
> -etc...
>
> There will be hundreds of names, evenly divided between the categories.
> But each name will go in only one category, there is no relation between
> categories or anything like that. All the information will be
> completely rewritten once a day and then read several times throughout
> the day.
>
> My choices for storage are an sqlite database (using ActiveRecord), a
> flat text file of my own design, a YAML file, or an XML file.

IMHO Databases are best when you have concurrent access to data being
modified regularly and want to enforce constraints during concurrent
write accesses.

In your case, the data is mostly static and constraints are easily
handled outside the storage layer (you overwrite all data with another
consistent version in one pass). I'd advise to use the simplest storage
method, which probably is a YAML dump of an object holding all this data.

Marshall.dump/load is an option too. It may be faster than YAML if this
matters to you (I've not benchmarked it, so you better do it if you need
fast read/write). It's not human-readable, so it can be a drawback when
debugging.

That was the code/integration complexity side of your problem.

For the performance side of the problem :

If you dump your data in a temporary file and then rename it to
overwrite the final destination, you can use a neat hack for long
running processes needing fresh data: you can design a little cache that
checks the mtime of the backing store (the final destination) on read
accesses and reload it when it changes.
mtime checks are cheap and simple to code and if the need arise for
really high throughput you can minimize them by coding a TTL logic.

Lionel

James Dinkel

4/1/2008 3:32:00 PM

0


> But ultimately it depends on what you want to do with the data.

yeah, it's kinda hard to describe without just posting my entire script,
which I doubt people will want to read.

The data will be accessed by one ruby script, running on one computer.
The data will be read in, then the file closed and done for a couple
hours. So no concurrent access, no relations, no keeping the connection
open for extended periods of time, which is why I thought a database
would probably be overkill and just add overhead.

But I didn't know if maybe reading a file into memory would take more
effort than reading entries from a database. Also, I was a little off
on the numbers, I meant to say that there are hundreds of names per
category, so total names could be over a thousand. That size will
likely never ever change beyond +/- 100 at the most.

Thanks for the info. I'm really a newb at this, so any thoughts on
storing data using any of these methods is helpful.

James.
--
Posted via http://www.ruby-....

Zundra Daniel

4/1/2008 3:39:00 PM

0

[Note: parts of this message were removed to make it a legal post.]

Seems like the type of problem yaml thats perfect for yaml


On Tue, Apr 1, 2008 at 11:32 AM, James Dinkel <jdinkel@gmail.com> wrote:

>
> > But ultimately it depends on what you want to do with the data.
>
> yeah, it's kinda hard to describe without just posting my entire script,
> which I doubt people will want to read.
>
> The data will be accessed by one ruby script, running on one computer.
> The data will be read in, then the file closed and done for a couple
> hours. So no concurrent access, no relations, no keeping the connection
> open for extended periods of time, which is why I thought a database
> would probably be overkill and just add overhead.
>
> But I didn't know if maybe reading a file into memory would take more
> effort than reading entries from a database. Also, I was a little off
> on the numbers, I meant to say that there are hundreds of names per
> category, so total names could be over a thousand. That size will
> likely never ever change beyond +/- 100 at the most.
>
> Thanks for the info. I'm really a newb at this, so any thoughts on
> storing data using any of these methods is helpful.
>
> James.
> --
> Posted via http://www.ruby-....
>
>

Todd Benson

4/1/2008 4:37:00 PM

0

On Tue, Apr 1, 2008 at 10:32 AM, James Dinkel <jdinkel@gmail.com> wrote:
>
> > But ultimately it depends on what you want to do with the data.
>
> yeah, it's kinda hard to describe without just posting my entire script,
> which I doubt people will want to read.
>
> The data will be accessed by one ruby script, running on one computer.
> The data will be read in, then the file closed and done for a couple
> hours. So no concurrent access, no relations, no keeping the connection
> open for extended periods of time, which is why I thought a database
> would probably be overkill and just add overhead.
>
> But I didn't know if maybe reading a file into memory would take more
> effort than reading entries from a database. Also, I was a little off
> on the numbers, I meant to say that there are hundreds of names per
> category, so total names could be over a thousand. That size will
> likely never ever change beyond +/- 100 at the most.
>
> Thanks for the info. I'm really a newb at this, so any thoughts on
> storing data using any of these methods is helpful.
>
> James.

I'm going to slightly disagree with Lionel -- and also Robert -- on
this one. First of all, a database is not necessarily just for
concurrency. It's for data integrity and allows the ability to build
reports on that data that you can trust because of the strict nature
of the underlying data store (I'm talking about RDBMS, but I've kept
my eyes open about OO databases as well; stay away from Pick,
though!!).

Here's the problem with relational databases, though (RDBMSs): it's
hard to model a hierarchy (which you can pull off somewhat clumsily
with XML).

If you are not going to do serious queries and inserts on the db, and
your data isn't complex, then a flat file approach might work. It
works, after all, for software builds. I strongly recommend against
it in higher languages, though, even for small apps. And, no, I am
not a database vendor.

I always tell people they should learn SQL, but nowadays I'm getting a
cold shoulder, especially with OO people :)

The other important thing that I've noticed about data and storage is:
what do you want to do with it and how often? Store it, query it (and
how), add to it, move it around, archive it, etc. These are important
factors to consider.

Todd

Kyle Schmitt

4/1/2008 4:39:00 PM

0

Oh wait, Lionel already suggested that.

Kyle Schmitt

4/1/2008 4:44:00 PM

0

Don't forget: you could put the data into a hash, and marshall it to
disc. Not a DB, but better than a flat file!

Lionel Bouton

4/1/2008 4:52:00 PM

0

Todd Benson wrote:

> I'm going to slightly disagree with Lionel -- and also Robert -- on
> this one. First of all, a database is not necessarily just for
> concurrency. It's for data integrity

Yes I agree (as explained below concurrency is what I consider the main
problem to solve to enforce data integrity). That said if you write your
data in one pass as the OP, you don't need data integrity in the storage
layer... rename is atomic : you either renamed the temp file to its
final position before a crash or not.

The problem are partial updates where you need to maintain consistancy.
And on the top of my head the only problems with partial updates are :
- concurrent accesses (most common, counting both concurrent read and
write accesses),
- crashes (fortunately less common and can even be adressed by backups
in many cases).

These are why I disagree with people wanting to push all the consistency
logic into the applicaltion layer on database-backed applications with
concurrent access (like often advocated for Rails). It's simply not
doable without recoding the whole concurrent access manager and
log-based/MVCC/... crash resistance of the database in the application
layer (good luck with that).

Lionel.

Todd Benson

4/1/2008 5:14:00 PM

0

On Tue, Apr 1, 2008 at 11:55 AM, Lionel Bouton
<lionel-subscription@bouton.name> wrote:
> Todd Benson wrote:
>
> > I'm going to slightly disagree with Lionel -- and also Robert -- on
> > this one. First of all, a database is not necessarily just for
> > concurrency. It's for data integrity
>
> Yes I agree (as explained below concurrency is what I consider the main
> problem to solve to enforce data integrity). That said if you write your
> data in one pass as the OP, you don't need data integrity in the storage
> layer... rename is atomic : you either renamed the temp file to its
> final position before a crash or not.
>
> The problem are partial updates where you need to maintain consistancy.
> And on the top of my head the only problems with partial updates are :
> - concurrent accesses (most common, counting both concurrent read and
> write accesses),
> - crashes (fortunately less common and can even be adressed by backups
> in many cases).
>
> These are why I disagree with people wanting to push all the consistency
> logic into the applicaltion layer on database-backed applications with
> concurrent access (like often advocated for Rails). It's simply not
> doable without recoding the whole concurrent access manager and
> log-based/MVCC/... crash resistance of the database in the application
> layer (good luck with that).
>
> Lionel.

Maybe we are talking about different things. By data integrity, I
mean you can be certain not just that the data was entered correctly,
but also that it coincides with the relationships present. In a
modified version of the OP's model, for example...

Cat1
-name1
-name2
-name3
-etc...

Cat2
-name4
-name5
-name6
-etc...

Cat3
-name1
-name2
-name3
etc...

Note the same category names, but in different categories.

Now, surely, you can say, "Well, the application logic will take care
of that ambiguity." But I say we should continue to separate
application logic from data logic.

I'm no CS guy, so I don't know the correct terms for this, but I do
see the potential pratfalls.

There certainly is a time and place for this, but I've found it's
usefulness generally not that beneficial.

Todd

Todd Benson

4/1/2008 5:20:00 PM

0

On Tue, Apr 1, 2008 at 12:13 PM, Todd Benson <caduceass@gmail.com> wrote:
>
> Maybe we are talking about different things. By data integrity, I
> mean you can be certain not just that the data was entered correctly,
> but also that it coincides with the relationships present. In a
> modified version of the OP's model, for example...
>
>
> Cat1
> -name1
> -name2
> -name3
> -etc...
>
> Cat2
> -name4
> -name5
> -name6
> -etc...
>
> Cat3
> -name1
> -name2
> -name3
> etc...
>
> Note the same category names, but in different categories.
>
> Now, surely, you can say, "Well, the application logic will take care
> of that ambiguity." But I say we should continue to separate
> application logic from data logic.
>
> I'm no CS guy, so I don't know the correct terms for this, but I do
> see the potential pratfalls.
>
> There certainly is a time and place for this, but I've found it's
> usefulness generally not that beneficial.
>
> Todd
>

Sorry Lionel; missed the OP's "But each name will go in only one
category". I do still think it wouldn't be that bad to use a DB.

Todd