Asp Forum - dealing with large csv files

brzak

11/30/2008 2:04:00 AM

hi

I was hoping for a few pointers on how to best go about
processing large csv files.

The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.

The processing:
summarise numerical fields based on conditions applied to other
fields

the resulting summary is a table;
the column headers of which are the list of unique values of one
field
the row headers are decided upon by conditions on other fields
(this may include lookups, exclusions, reclassifications)

Taking into account the size of the files, and number of operations
requierd on each record...

What if any of thse considerations do I need to take into account:

-is the file read in line by line / or in one go?
+if in one go, there would be issues with available memory?
+if it's line by line, is tehre a significant difference in time
taken to process? (i.e from my limited personal experience with
VBA, reading/writing a cell at a time in a spreasheet is far
slower than reading/writing in 'batches')
+or would it be an idea to read a limited number in one go?
e.g. deal with 20,000 at a time in memory
i suppose this question demonstrates a lack of experience with C++
but hey, that's why i'm posting in the learner's forum :)

-however much of the file is read, is it worth writing a bespoke
solution
or look for a parser/class that's been written for csv files?
+perhaps there is a module that i can import?
+since the csv files are *supposed* to be of a standard format,
would
there be much to gain iin writing something specific to this - this
would be done with the aim of reducing processing time

-data types... should i read the value fields as floating point
numbers (range approx. +/- 500000.00)
+will using floating point data types save memory?

As anyone reading would be able to tell, I'm still quite new to this
language, and am missing some of the basics which I've had a bit of
trouble locating solutions to.

Any advice would be much appreciated!

Brz

6 Answers

LR

11/30/2008 6:32:00 AM

brzak wrote:
> hi
>
> I was hoping for a few pointers on how to best go about
> processing large csv files.

I don't think that you've given enough information about your particular
problem, but I'll give it a try.

>
> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
>
> The processing:
> summarise numerical fields based on conditions applied to other
> fields
>
> the resulting summary is a table;
> the column headers of which are the list of unique values of one
> field
> the row headers are decided upon by conditions on other fields
> (this may include lookups, exclusions, reclassifications)

Do you mean a hierarchical file, or some other structure?

>
> Taking into account the size of the files, and number of operations
> requierd on each record...
>
> What if any of thse considerations do I need to take into account:
>
> -is the file read in line by line / or in one go?

I'd probably try to do it line by line first and get the processing
right, and then if you need to, switch to all in one go.

> +if in one go, there would be issues with available memory?

There will be issues no matter how you write your code, but the question
is do you have enough memory to store the entire text or not? Plus the
space required for the variables you're going to manipulate?

> +if it's line by line, is tehre a significant difference in time
> taken to process? (i.e from my limited personal experience with
> VBA, reading/writing a cell at a time in a spreasheet is far
> slower than reading/writing in 'batches')

Very system dependent. And "batch" size dependent. You'll have to try it
and see.

> +or would it be an idea to read a limited number in one go?
> e.g. deal with 20,000 at a time in memory

Depends on what your data is like and what you have to do with it.

> i suppose this question demonstrates a lack of experience with C++
> but hey, that's why i'm posting in the learner's forum :)
>
> -however much of the file is read, is it worth writing a bespoke
> solution
> or look for a parser/class that's been written for csv files?

Reading in csv files isn't difficult. You do mean comma seperated
values? What have you tried so far?

> +perhaps there is a module that i can import?

Have you looked at std::getline? You can use that not only to read a
line at a time, but once you have a line in a std::string to easily get
values that are separated by commas using a std::istringstream. Unless
you have quoted fields with commas in them. In which case you might be
more interested in something that can deal with regular expressions.
You might want to look at www.boost.org, or perhaps if your compiler has
support for it, the tr1. I did a quick google(tm) and found this, but I
don't know how good it is: http://www.johndcook.com/cpp_...

But I think that std::getline should be something to take a look at first.

> +since the csv files are *supposed* to be of a standard format,

What standard are they following? And do you suspect they don't follow a
standard?

> would
> there be much to gain iin writing something specific to this - this
> would be done with the aim of reducing processing time

Have you tried this already and found that your processing time is too long?

>
> -data types... should i read the value fields as floating point
> numbers (range approx. +/- 500000.00)

Probably that or double. I think double would be better. But as you
point out...

> +will using floating point data types save memory?

Yes, well probably, but I suspect it will take more time to use float.
System dependent. But I don't know. Best to benchmark if you're
concerned. Also, I don't recall, but I think that sizeof(float) can be
equal to sizeof(double), although I can't think of a system where it is.

I think that most problems have a space time trade off. You may have to
decide which one is more important to you.

> As anyone reading would be able to tell, I'm still quite new to this
> language, and am missing some of the basics which I've had a bit of
> trouble locating solutions to.
>
> Any advice would be much appreciated!

Sounds like you might need a good book. You might want to check out the
reviews at www.accu.org.

LR

James Kanze

11/30/2008 10:14:00 AM

On Nov 30, 3:04 am, brzak <brz...@gmail.com> wrote:

> I was hoping for a few pointers on how to best go about
> processing large csv files.

> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.

> The processing:
> summarise numerical fields based on conditions applied to other
> fields

> the resulting summary is a table;
> the column headers of which are the list of unique values of one
> field
> the row headers are decided upon by conditions on other fields
> (this may include lookups, exclusions, reclassifications)

> Taking into account the size of the files, and number of operations
> requierd on each record...

> What if any of thse considerations do I need to take into account:

> -is the file read in line by line / or in one go?
> +if in one go, there would be issues with available memory?
> +if it's line by line, is tehre a significant difference in time
> taken to process? (i.e from my limited personal experience with
> VBA, reading/writing a cell at a time in a spreasheet is far
> slower than reading/writing in 'batches')
> +or would it be an idea to read a limited number in one go?
> e.g. deal with 20,000 at a time in memory
> i suppose this question demonstrates a lack of experience with C++
> but hey, that's why i'm posting in the learner's forum :)

For such a small file, it probably doesn't matter. Reading one
character at a time might, or even unit buffered, but otherwise,
the buffering in ifstream should be largely adequate. If you do
find that I/O is a blocking point, you can try memory mapping
the file, but there's no guarantee that that will improve
anything.

> -however much of the file is read, is it worth writing a bespoke
> solution
> or look for a parser/class that's been written for csv files?
> +perhaps there is a module that i can import?
> +since the csv files are *supposed* to be of a standard format,
> would
> there be much to gain iin writing something specific to this - this
> would be done with the aim of reducing processing time

If you can find a generalized CSV parser, use it. I suspect,
however, that CSV is so simple that most people just do it by
hand; if you know up front which fields contain what types, it's
a lot easier (and faster).

> -data types... should i read the value fields as floating point
> numbers (range approx. +/- 500000.00)
> +will using floating point data types save memory?

Compared to what? Floating point takes less space than double,
but again, we're talking about a fairly small data set, so it
probably doesn't matter.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Erik Wikström

11/30/2008 10:27:00 AM

On 2008-11-30 03:04, brzak wrote:
> hi
>
> I was hoping for a few pointers on how to best go about
> processing large csv files.
>
> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
>
> The processing:
> summarise numerical fields based on conditions applied to other
> fields
>
> the resulting summary is a table;
> the column headers of which are the list of unique values of one
> field
> the row headers are decided upon by conditions on other fields
> (this may include lookups, exclusions, reclassifications)
>
> Taking into account the size of the files, and number of operations
> requierd on each record...

It all very much depend on what kind of operations you want to perform,
and whether you can do the operations "on the fly" or if you need access
to the whole table before you begin.

> What if any of thse considerations do I need to take into account:
>
> -is the file read in line by line / or in one go?

If memory usage is important you read line by line, if performance is
more important then you probably want to do it in one go. You might also
want to consider memory mapping of the file.

> +if in one go, there would be issues with available memory?

Depends on your target environment, 75MB is not very much in a modern
desktop or server, but for other targets it might be.

> +if it's line by line, is tehre a significant difference in time
> taken to process? (i.e from my limited personal experience with
> VBA, reading/writing a cell at a time in a spreasheet is far
> slower than reading/writing in 'batches')

Not necessary, but I'd say probably.

> +or would it be an idea to read a limited number in one go?
> e.g. deal with 20,000 at a time in memory

Since it's hard to know where the content of a field starts and ends
without some processing it is probably not possible to read anything
other than full lines (one or more) without risking getting only half
the contents of a field.

> i suppose this question demonstrates a lack of experience with C++
> but hey, that's why i'm posting in the learner's forum :)
>
> -however much of the file is read, is it worth writing a bespoke
> solution
> or look for a parser/class that's been written for csv files?

It's really not that difficult to write it yourself, and if you do you
can customise it to your needs. Just a tip though, try not to limit
yourself by hardcoding the number of columns or something.

> -data types... should i read the value fields as floating point
> numbers (range approx. +/- 500000.00)
> +will using floating point data types save memory?

Float will probably save you memory, the question is whether it is worth
the loss in precision.

> As anyone reading would be able to tell, I'm still quite new to this
> language, and am missing some of the basics which I've had a bit of
> trouble locating solutions to.

If performance is critical you should profile your solution to see where
the bottlenecks are. You might also check out the difference in
performance between C++ streams and C IO functions, some implementations
of the C++ standard library have bad performance.

--
Erik WikstrÃ¶m

Daniel T.

11/30/2008 2:38:00 PM

brzak <brzaak@gmail.com> wrote:

> I was hoping for a few pointers on how to best go about
> processing large csv files.

I think you have gotten about as much advice as we can give already, but
I'll add my 2p just in case. :-)

> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
>
> The processing:
> summarise numerical fields based on conditions applied to other
> fields

Other fields in the same column, or other fields in different columns?
I'm assuming the former...

> What if any of thse considerations do I need to take into account:
>
> -is the file read in line by line / or in one go?
> +if in one go, there would be issues with available memory?
> +if it's line by line, is tehre a significant difference in time
> taken to process? (i.e from my limited personal experience with
> VBA, reading/writing a cell at a time in a spreasheet is far
> slower than reading/writing in 'batches')
> +or would it be an idea to read a limited number in one go?
> e.g. deal with 20,000 at a time in memory
> i suppose this question demonstrates a lack of experience with C++
> but hey, that's why i'm posting in the learner's forum :)

Even if you read the file in line by line, the underlying code is
probably buffering large chunks of the file (i.e., if you write your
code to do the second option above, the system will probably actually be
using the third option. It's possible that the first option would be
faster, but it depends on what your hardware consists of.

I'd say use the second option.

> -however much of the file is read, is it worth writing a bespoke
> solution
> or look for a parser/class that's been written for csv files?
> +perhaps there is a module that i can import?
> +since the csv files are *supposed* to be of a standard format,
> would
> there be much to gain iin writing something specific to this - this
> would be done with the aim of reducing processing time

This is the kind of problem that teachers often use in class to teach
C++. So I doubt you will find public libraries that parse them.

That said, http://www.trumphurst.com/cpp... is your friend.

> -data types... should i read the value fields as floating point
> numbers (range approx. +/- 500000.00)
> +will using floating point data types save memory?

Read the fields in whatever format makes the most sense. Are they
integral values that will fit in an 'int'? Then read them in as ints.
Are they fixed point types (like money?) Then read them in using a fixed
point/money class. Otherwise use double.

Raymond Häb

12/1/2008 3:45:00 AM

brzak schrieb:
> I was hoping for a few pointers on how to best go about
> processing large csv files.
>
> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
> [...]

I just want to add a hint to boost::spirit, boost::xpressive
and the fileiterator

http://www.boost.org/doc/libs/1_37_0/libs/spirit/classic/...
http://www.boost.org/doc/libs/1_37_0/libs/spirit/classic/doc/file_ite...
http://www.boost.org/doc/libs/1_37_0/doc/html/xpre...

But maybe learning all this is "overkill" for such an easy task. But if
you are frequently confronted to similar problems you can produce very
efficient solutions with it (both for development time and runtime (but
sadly not compile time)).

Raymond

Boogie

12/1/2008 12:01:00 PM

On Dec 1, 4:45 am, Raymond Häb <ray.h...@gmx.de> wrote:
> I just want to add a hint to boost::spirit, boost::xpressive
> and the fileiterator
>
> But maybe learning all this is "overkill" for such an easy task. But if
> you are frequently confronted to similar problems you can produce very
> efficient solutions with it (both for development time and runtime (but
> sadly not compile time)).
>
> Raymond

I think that would be an overkill.
I would suggest:

Tokenizer for parsing (read line of text and then tokenize it):
http://www.boost.org/doc/libs/1_37_0/libs/tokenizer/escaped_list_sep...

Lexical cast for string to float (double) conversion:
http://www.boost.org/doc/libs/1_37_0/libs/conversion/lexica...

Easy to use, doesn't require a lot of writing code ;)

Boogie

comp.lang.c++

dealing with large csv files