James Kanze
11/30/2008 10:14:00 AM
On Nov 30, 3:04 am, brzak <brz...@gmail.com> wrote:
> I was hoping for a few pointers on how to best go about
> processing large csv files.
> The files:
> a typical file has 100K-500K records
> approx 150 chars per line (10 'fields')
> so file sizes of 15MB-75MB.
> The processing:
> summarise numerical fields based on conditions applied to other
> fields
> the resulting summary is a table;
> the column headers of which are the list of unique values of one
> field
> the row headers are decided upon by conditions on other fields
> (this may include lookups, exclusions, reclassifications)
> Taking into account the size of the files, and number of operations
> requierd on each record...
> What if any of thse considerations do I need to take into account:
> -is the file read in line by line / or in one go?
> +if in one go, there would be issues with available memory?
> +if it's line by line, is tehre a significant difference in time
> taken to process? (i.e from my limited personal experience with
> VBA, reading/writing a cell at a time in a spreasheet is far
> slower than reading/writing in 'batches')
> +or would it be an idea to read a limited number in one go?
> e.g. deal with 20,000 at a time in memory
> i suppose this question demonstrates a lack of experience with C++
> but hey, that's why i'm posting in the learner's forum :)
For such a small file, it probably doesn't matter. Reading one
character at a time might, or even unit buffered, but otherwise,
the buffering in ifstream should be largely adequate. If you do
find that I/O is a blocking point, you can try memory mapping
the file, but there's no guarantee that that will improve
anything.
> -however much of the file is read, is it worth writing a bespoke
> solution
> or look for a parser/class that's been written for csv files?
> +perhaps there is a module that i can import?
> +since the csv files are *supposed* to be of a standard format,
> would
> there be much to gain iin writing something specific to this - this
> would be done with the aim of reducing processing time
If you can find a generalized CSV parser, use it. I suspect,
however, that CSV is so simple that most people just do it by
hand; if you know up front which fields contain what types, it's
a lot easier (and faster).
> -data types... should i read the value fields as floating point
> numbers (range approx. +/- 500000.00)
> +will using floating point data types save memory?
Compared to what? Floating point takes less space than double,
but again, we're talking about a fairly small data set, so it
probably doesn't matter.
--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34