Asp Forum - File Processing

Branimir Petrovic

9/30/2008 6:45:00 PM

Hello

I want to read and process and rewrite a very large disk based file
(>3Gbytes) as quickly as possible.
The processing effectively involves finding certain strings and replacing
them with other strings of equal length such that the file size is unaltered
(the file is uncompressed btw). I wondered if anyone could advise me of the
best way to do this and also of things to avoid. More specifically I was
wondering :-

-Is it best to open a single file for read-write access and overwrite the
changed bytes or would it be better to create a new file?
-Is there any point in buffering bytes in rather than reading one byte at a
time or does this just defeat the buffering that's done by the OS anyway?
-Would this benefit from multi-threading - read, process, write?

And finally could anyone point me to any sample code which already does this
sort of thing in the fastest possible way?

Many Thanks
Jeff

6 Answers

Victor Bazarov

9/30/2008 7:35:00 PM

Jeff wrote:
> I want to read and process and rewrite a very large disk based file
> (>3Gbytes) as quickly as possible.
> The processing effectively involves finding certain strings and replacing
> them with other strings of equal length such that the file size is unaltered
> (the file is uncompressed btw). I wondered if anyone could advise me of the
> best way to do this and also of things to avoid. More specifically I was
> wondering :-
>
> -Is it best to open a single file for read-write access and overwrite the
> changed bytes or would it be better to create a new file?

It is always a good idea to leave the old file intact, unless you
somehow can ensure that a single write operation will never fail and
that an incomplete set of find/replace operations is still OK. Ask in
any database development newsgroup.

> -Is there any point in buffering bytes in rather than reading one byte at a
> time or does this just defeat the buffering that's done by the OS anyway?

You'd have to experiment. C++ language does not define any buffering
AFA OS is concerned.

> -Would this benefit from multi-threading - read, process, write?

Unlikely. Processing will take so little time compared to the I/O, and
I/O is going to be the bottleneck anyway, so...

> [..]

V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask

James Kanze

10/1/2008 9:13:00 AM

On Sep 30, 9:35 pm, Victor Bazarov <v.Abaza...@comAcast.net> wrote:
> Jeff wrote:
> > I want to read and process and rewrite a very large disk based file
> > (>3Gbytes) as quickly as possible.
> > The processing effectively involves finding certain strings and replacing
> > them with other strings of equal length such that the file size is unaltered
> > (the file is uncompressed btw). I wondered if anyone could advise me of the
> > best way to do this and also of things to avoid. More specifically I was
> > wondering :-

> > -Is it best to open a single file for read-write access and overwrite the
> > changed bytes or would it be better to create a new file?

> It is always a good idea to leave the old file intact, unless you
> somehow can ensure that a single write operation will never fail and
> that an incomplete set of find/replace operations is still OK. Ask in
> any database development newsgroup.

This is generally true, but he said a "very large" file. I'd
have some hesitations about making a copy if the file size were,
say, 100 Gigabytes.

As always, you have to weigh the trade offs. Making a copy is
certainly a safer solution, if you can afford it.

> > -Is there any point in buffering bytes in rather than
> > reading one byte at a time or does this just defeat the
> > buffering that's done by the OS anyway?

> You'd have to experiment. C++ language does not define any
> buffering AFA OS is concerned.

C++ does define buffering in iostreams. But the fastest
solution will almost certainly involve platform specific
requests. I'd probably start by using mmap on a Unix system, or
CreateFileMapping/MapViewOfFile under Windows. If performance
is really an issue, he'll probably have to experiment with
different solutions, but I'd be surprised if anything was
significantly faster than using a memory mapped file, modified
in place.

But of course, as you pointed out above, this solution doesn't
provide transactional integrity. And it only works if the
process has enough available address space to map the file.
(Probably no problem on a 64 bit processor, but likely not the
case on 32 bit one.)

> > -Would this benefit from multi-threading - read, process, write?

> Unlikely. Processing will take so little time compared to the
> I/O, and I/O is going to be the bottleneck anyway, so...

If he uses memory mapping, the system will take care of all of
the IO behind his back anyway. Otherwise, some sort of
asynchronous I/O can sometimes improve performance.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

jacek.dziedzic

10/1/2008 12:25:00 PM

On Sep 30, 8:44 pm, "Jeff" <some...@somewhere.com> wrote:
> Hello
>
> I want to read and process and rewrite a very large disk based file
> (>3Gbytes) as quickly as possible.
> The processing effectively involves finding certain strings and replacing
> them with other strings of equal length such that the file size is unaltered
> (the file is uncompressed btw). I wondered if anyone could advise me of the
> best way to do this and also of things to avoid. More specifically I was
> wondering :-
>
> -Is it best to open a single file for read-write access and overwrite the
> changed bytes or would it be better to create a new file?

You are asking about performance or safety? As Victor pointed out
already,
it's always safer to work on a copy. Performance-wise overwriting the
bytes
in the one file you have will be way faster then copying the file.

> -Is there any point in buffering bytes in rather than reading one byte at a
> time or does this just defeat the buffering that's done by the OS anyway?

There is. If you intend to issue 3000000000 read() calls to read a
3GB file,
one byte a time, you're wasting quite a lot of time doing the calls.
Reading
in, say, 1MB chunks would make it faster, although it complicates
looking
for the strings (chunk boundaries).

> -Would this benefit from multi-threading - read, process, write?

Not to any significant degree, unless you're doing a *lot* of
processing
to find the strings you need (like complex regexen or such). Very
likely
you're way I/O-bound here.

> And finally could anyone point me to any sample code which already does this
> sort of thing in the fastest possible way?

No, but I would strongly advise you to look into memory-mapped I/O,
if
your system supports it. This is not portable in C++ sense, and hence
OT for this newsgroup, but it is most likely the fastest you can get,
and -- as a bonus -- you avoid all read() and write() calls, and need
no
buffering. Google for the mmap() call.

HTH,
- J.

James Kanze

10/1/2008 7:44:00 PM

On Oct 1, 2:24 pm, jacek.dzied...@gmail.com wrote:
> On Sep 30, 8:44 pm, "Jeff" <some...@somewhere.com> wrote:

> No, but I would strongly advise you to look into memory-mapped
> I/O, if your system supports it. This is not portable in C++
> sense, and hence OT for this newsgroup, but it is most likely
> the fastest you can get, and -- as a bonus -- you avoid all
> read() and write() calls, and need no buffering. Google for
> the mmap() call.

While it's true that mmap is usually faster than naïve file
handling, the buffering, reading and writing are still there.
The only difference is that its the OS which takes care of them
(with a bit of help from the hardware), and not you. Typically,
*IF* you're a real expert, and you're willing to invest a lot of
time and effort, you can do better for any specific use.
Typically, not much better, however, and typically, you're not a
real expert (the real experts are busy implementing the code in
the OS), and the slight gains you get aren't worth the cost.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

AnonMail2005@gmail.com

10/2/2008 2:43:00 AM

On Sep 30, 2:44 pm, "Jeff" <some...@somewhere.com> wrote:
> Hello
>
> I want to read and process and rewrite a very large disk based file
> (>3Gbytes) as quickly as possible.
> The processing effectively involves finding certain strings and replacing
> them with other strings of equal length such that the file size is unaltered
> (the file is uncompressed btw). I wondered if anyone could advise me of the
> best way to do this and also of things to avoid. More specifically I was
> wondering :-
>
> -Is it best to open a single file for read-write access and overwrite the
> changed bytes or would it be better to create a new file?
> -Is there any point in buffering bytes in rather than reading one byte at a
> time or does this just defeat the buffering that's done by the OS anyway?
> -Would this benefit from multi-threading - read, process, write?
>
> And finally could anyone point me to any sample code which already does this
> sort of thing in the fastest possible way?
>
> Many Thanks
> Jeff

First cut, I would look into unix text processing tools like grep and
sed.
Why reinvent the wheel? Also, these tools are available for use in
non
unix environmetns like the PC.

HTH

Branimir Petrovic

10/2/2008 9:55:00 AM

Thanks a million for the very helpful replies.

I'm still experimenting, but I already found that I can make significant
(>10) improvements in speed by buffering in the file rather than reading it
byte by byte.

Jeff

comp.lang.c++

File Processing

Branimir Petrovic

Victor Bazarov

James Kanze

jacek.dziedzic

James Kanze

AnonMail2005@gmail.com

Branimir Petrovic

x Login to ForumsZone