Asp Forum - *Fast* way to process large files line by line

Devesh Agrawal

11/15/2006 7:21:00 PM

Hi Folks,

I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

For this I am using the following psuedocode:

outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P

check for eof conditions on f

if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

iii. Is it bad to have around 40-50 files opened at the same time ?.

iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.

v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.

c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

--Devesh Agrawal

--
Posted via http://www.ruby-....

19 Answers

Farrel Lifson

11/15/2006 7:27:00 PM

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> Hi Folks,
>
> I am using ruby to analyse a huge (around 60G) amount of my networking
> experiment data. Let me briefly describe my technique: I have to read
> around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
> contains traceroutes to lots of destinations at different times. I.E a
> file is basically a list of traceroutes launched from a given src (src =
> filename) launched at diff times. I want to get a structure like
> following: (list of all traceroutes from *all* src's at time 1), (list
> of all traceroutes from *all* src's at time 2)... and so on.
>
> For this I am using the following psuedocode:
>
> outputfile.open
> open all files f1..fn
> while (!(all files have eof))
> (f1..fn).each{|f|
> next if f.eof
> line = f.readline
> parse the line, and get a structure P out of it
> put P into a hashtable: H[P.time] << P
>
> check for eof conditions on f
>
> if (H has more than k keys ? (ie has it become very large))
> H.keys.sort{|t|
> outputfile << Marshal.dump(H[t])
> H.delete(t)
> }
> end
> }
> end
> close all files
>
> //Btw I can't use an array instead of a hashtable H, as the P.time's
> read across all files needn't be same.
>
> This is performing miserbly SLOW. I have the following questions:
>
> i. How fast is f.readline ?. I want to use the maximum buffering
> possible for largest speed gains. In ruby how do I set the buffer size.
> I looked through io.c, and it seems that readline essentially uses getc
> (stopping when it gets a newline). How can I set the buffer size for the
> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>
> ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
> even worse.
>
> iii. Is it bad to have around 40-50 files opened at the same time ?.
>
> iv. The program does use a lot of memory but not so much, around 30-40
> pc of 1G ram machine is used by it. So I think paging in/out is not a
> problem.
>
> v. Would coding the realine part in C using rubyinline offer me speed
> advantages ?
>
> vi. I am thinking of trying the following to reduce the time it takes,
> I would very much welcome your comments:
>
> a. Remove Marshal.dump [I don't need to strictly serialize objects,
> only dump the data and read it back] and replace it with some string
> form which is more compact. Actually is it possible to have something
> like fixed length structures like in C: Example I would want P to be
> like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
> the IO would be faster as I could just dump a fixed number of bytes to a
> file.
>
> b. Try to reduce the memory consumption of this by reducing k further
> so as the program doesn't page in/out.
>
> c. Can someone point me to a good sample code for reading a file line
> by line in C and then putting it into a ruby hashtable ?.
> d. How much of the slowness is due to the fact that it is ruby and not
> C ?
>
> To give you an idea of how slow this is actually: Just reading all the
> files
> line by line takes around 8-9 hrs. Whereas the above thing easily takes
> 5-6
> days !!. And I am quite unable to run profile on my code as it is just
> too slow.
>
> I would be very grateful for your comments, and particularly if you have
> any suggestions/experience on doing this in a fast way.
>
> --Devesh Agrawal
>
>
>
> --
> Posted via http://www.ruby-....
>
>

Could you not parrallelise the processing of each file? Perhaps using
something like starfish (http://www.rufy.com/sta...)?

Farrel

Devesh Agrawal

11/15/2006 7:37:00 PM

Hi,
Farrel Lifson wrote:
> On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
>>
>>
>> close all files
>> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>> v. Would coding the realine part in C using rubyinline offer me speed
>> the IO would be faster as I could just dump a fixed number of bytes to a
>> To give you an idea of how slow this is actually: Just reading all the
>>
>>
>>
>> --
>> Posted via http://www.ruby-....
>>
>>
>
> Could you not parrallelise the processing of each file? Perhaps using
> something like starfish (http://www.rufy.com/sta...)?

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

> Farrel

--
Posted via http://www.ruby-....

Farrel Lifson

11/15/2006 7:54:00 PM

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> Hi,
> Farrel Lifson wrote:
> > On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> >>
> >>
> >> close all files
> >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
> >> v. Would coding the realine part in C using rubyinline offer me speed
> >> the IO would be faster as I could just dump a fixed number of bytes to a
> >> To give you an idea of how slow this is actually: Just reading all the
> >>
> >>
> >>
> >> --
> >> Posted via http://www.ruby-....
> >>
> >>
> >
> > Could you not parrallelise the processing of each file? Perhaps using
> > something like starfish (http://www.rufy.com/sta...)?
>
> Did you mean parrallelizing across multiple files or parrallelizing the
> processing of one file ?
>
> Yes and No. But this involves me getting deeper into a description of my
> problem:
>
> Each file has traceroutes at lots of times t1,t2.... The objective is to
> collect all traceroutes that happened at that time into one structure.
> Hence I could do something like this: Using ruby threads (or whatever)
> read each file, and store it into one *common* hashtable. And then call
> the syncing of the hashtable to the disk incase it has grown large
> enough.
>
> I was more hoping for some kind of a fast way to readlines, using say
> mmap or something like that. I read a few posts about how using mmap
> helped someone else. I will look into this starfish, I rejected ruby
> threads as unlike pthreads they weren't really true threads.
>
> Thanks for replying. Is there something wrong or inherently slow with
> the things I am doing ?. Can It be speeded up ?.
>
> > Farrel
>
>
> --
> Posted via http://www.ruby-....
>
>

I think this
if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
can just be changed to
if (H has more than k keys ?)
outputfile << Marshal.dump(H)
end
H = {}

Farrel

Farrel Lifson

11/15/2006 7:55:00 PM

On 15/11/06, Farrel Lifson <farrel.lifson@gmail.com> wrote:
> On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> > Hi,
> > Farrel Lifson wrote:
> > > On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> > >>
> > >>
> > >> close all files
> > >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
> > >> v. Would coding the realine part in C using rubyinline offer me speed
> > >> the IO would be faster as I could just dump a fixed number of bytes to a
> > >> To give you an idea of how slow this is actually: Just reading all the
> > >>
> > >>
> > >>
> > >> --
> > >> Posted via http://www.ruby-....
> > >>
> > >>
> > >
> > > Could you not parrallelise the processing of each file? Perhaps using
> > > something like starfish (http://www.rufy.com/sta...)?
> >
> > Did you mean parrallelizing across multiple files or parrallelizing the
> > processing of one file ?
> >
> > Yes and No. But this involves me getting deeper into a description of my
> > problem:
> >
> > Each file has traceroutes at lots of times t1,t2.... The objective is to
> > collect all traceroutes that happened at that time into one structure.
> > Hence I could do something like this: Using ruby threads (or whatever)
> > read each file, and store it into one *common* hashtable. And then call
> > the syncing of the hashtable to the disk incase it has grown large
> > enough.
> >
> > I was more hoping for some kind of a fast way to readlines, using say
> > mmap or something like that. I read a few posts about how using mmap
> > helped someone else. I will look into this starfish, I rejected ruby
> > threads as unlike pthreads they weren't really true threads.
> >
> > Thanks for replying. Is there something wrong or inherently slow with
> > the things I am doing ?. Can It be speeded up ?.
> >
> > > Farrel
> >
> >
> > --
> > Posted via http://www.ruby-....
> >
> >
>
> I think this
> if (H has more than k keys ? (ie has it become very large))
> H.keys.sort{|t|
> outputfile << Marshal.dump(H[t])
> H.delete(t)
> }
> end
> can just be changed to
> if (H has more than k keys ?)
> outputfile << Marshal.dump(H)
> end
> H = {}
>
> Farrel
>

Whoops! make that
if (H has more than k keys ?)
outputfile << Marshal.dump(H)
H={}
end

Eric Hodel

11/15/2006 8:05:00 PM

On Nov 15, 2006, at 11:21 AM, Devesh Agrawal wrote:

> Hi Folks,
>
> I am using ruby to analyse a huge (around 60G) amount of my
> networking
> experiment data. Let me briefly describe my technique: I have to read
> around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
> contains traceroutes to lots of destinations at different times. I.E a
> file is basically a list of traceroutes launched from a given src
> (src =
> filename) launched at diff times. I want to get a structure like
> following: (list of all traceroutes from *all* src's at time 1), (list
> of all traceroutes from *all* src's at time 2)... and so on.
>
> For this I am using the following psuedocode:
>
> outputfile.open
> open all files f1..fn
> while (!(all files have eof))
> (f1..fn).each{|f|
> next if f.eof
> line = f.readline
> parse the line, and get a structure P out of it
> put P into a hashtable: H[P.time] << P
>
> check for eof conditions on f
>
> if (H has more than k keys ? (ie has it become very large))
> H.keys.sort{|t|
> outputfile << Marshal.dump(H[t])
> H.delete(t)
> }
> end
> }
> end
> close all files
>
> //Btw I can't use an array instead of a hashtable H, as the P.time's
> read across all files needn't be same.
>
> This is performing miserbly SLOW. I have the following questions:

Have you profiled? Where is your time really coming from?

Repost with a profile and then we can give some real suggestions.

> i. How fast is f.readline ?. I want to use the maximum buffering
> possible for largest speed gains. In ruby how do I set the buffer
> size.
> I looked through io.c, and it seems that readline essentially uses
> getc
> (stopping when it gets a newline). How can I set the buffer size
> for the
> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

I seriously doubt that this is your choke-point.

> ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
> even worse.

Marshal.dump is pretty fast, probably as fast as you're going to get
for a serialization format. _why did some benchmarks back in the day
and it beat out the other P languages.

That said, why are you even using it? Why not just add raw strings?

> v. Would coding the realine part in C using rubyinline offer me speed
> advantages ?

No.

(or, very unlikely)

> vi. I am thinking of trying the following to reduce the time it
> takes,
> I would very much welcome your comments:

Profile, profile, profile.

> a. Remove Marshal.dump [I don't need to strictly serialize objects,
> only dump the data and read it back] and replace it with some string
> form which is more compact. Actually is it possible to have something
> like fixed length structures like in C: Example I would want P to be
> like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
> the IO would be faster as I could just dump a fixed number of bytes
> to a
> file.

Yes, do this, simpler is better.

Try #pack and #unpack.

> b. Try to reduce the memory consumption of this by reducing k
> further so as the program doesn't page in/out.

You already said it isn't paging...

> c. Can someone point me to a good sample code for reading a file
> line by line in C and then putting it into a ruby hashtable ?.

No. Profile, profile, profile.

> d. How much of the slowness is due to the fact that it is ruby
> and not C ?

We can't tell you without a profile. Profile, profile, profile.

> To give you an idea of how slow this is actually: Just reading all the
> files line by line takes around 8-9 hrs. Whereas the above thing
> easily takes
> 5-6 days !!. And I am quite unable to run profile on my code as it
> is just
> too slow.

Lies.

Use a reduced dataset and with ruby-prof or zenprofile.

You know nothing without a profile.

> I would be very grateful for your comments, and particularly if you
> have
> any suggestions/experience on doing this in a fast way.

Profile it, you can't make sane changes without one.

--
Eric Hodel - drbrain@segment7.net - http://blog.se...
This implementation is HODEL-HASH-9600 compliant

http://trackmap.rob...

Farrel Lifson

11/15/2006 8:08:00 PM

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> Hi,
> Farrel Lifson wrote:
> > On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> >>
> >>
> >> close all files
> >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
> >> v. Would coding the realine part in C using rubyinline offer me speed
> >> the IO would be faster as I could just dump a fixed number of bytes to a
> >> To give you an idea of how slow this is actually: Just reading all the
> >>
> >>
> >>
> >> --
> >> Posted via http://www.ruby-....
> >>
> >>
> >
> > Could you not parrallelise the processing of each file? Perhaps using
> > something like starfish (http://www.rufy.com/sta...)?
>
> Did you mean parrallelizing across multiple files or parrallelizing the
> processing of one file ?
>
> Yes and No. But this involves me getting deeper into a description of my
> problem:
>
> Each file has traceroutes at lots of times t1,t2.... The objective is to
> collect all traceroutes that happened at that time into one structure.
> Hence I could do something like this: Using ruby threads (or whatever)
> read each file, and store it into one *common* hashtable. And then call
> the syncing of the hashtable to the disk incase it has grown large
> enough.
>
> I was more hoping for some kind of a fast way to readlines, using say
> mmap or something like that. I read a few posts about how using mmap
> helped someone else. I will look into this starfish, I rejected ruby
> threads as unlike pthreads they weren't really true threads.
>
> Thanks for replying. Is there something wrong or inherently slow with
> the things I am doing ?. Can It be speeded up ?.
>
> > Farrel
>
>
> --
> Posted via http://www.ruby-....
>
>

Also try running your code with some sample data through the ruby
profiler (just run 'ruby -rprofiler yourcode.rb') and it should give
you an idea where your program is spending it's time.

Farrel

Jano Svitok

11/15/2006 9:09:00 PM

On 11/15/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> Hi,
> Farrel Lifson wrote:
> > On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> >>
> >>
> >> close all files
> >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
> >> v. Would coding the realine part in C using rubyinline offer me speed
> >> the IO would be faster as I could just dump a fixed number of bytes to a
> >> To give you an idea of how slow this is actually: Just reading all the
> >>
> >>
> >>
> >> --
> >> Posted via http://www.ruby-....
> >>
> >>
> >
> > Could you not parrallelise the processing of each file? Perhaps using
> > something like starfish (http://www.rufy.com/sta...)?
>
> Did you mean parrallelizing across multiple files or parrallelizing the
> processing of one file ?
>
> Yes and No. But this involves me getting deeper into a description of my
> problem:
>
> Each file has traceroutes at lots of times t1,t2.... The objective is to
> collect all traceroutes that happened at that time into one structure.
> Hence I could do something like this: Using ruby threads (or whatever)
> read each file, and store it into one *common* hashtable. And then call
> the syncing of the hashtable to the disk incase it has grown large
> enough.
>
> I was more hoping for some kind of a fast way to readlines, using say
> mmap or something like that. I read a few posts about how using mmap
> helped someone else. I will look into this starfish, I rejected ruby
> threads as unlike pthreads they weren't really true threads.
>
> Thanks for replying. Is there something wrong or inherently slow with
> the things I am doing ?. Can It be speeded up ?.
>
> > Farrel

Maybe it's possible to preprocess the files with grep or something
similar (like splitting into hour-long slices, etc.), using ruby just
for the sorting and merging - that way you could make it parallel, and
besides, grep&co are supposed to be much faster.

Robert Klemme

11/15/2006 9:24:00 PM

Devesh Agrawal wrote:
> I am using ruby to analyse a huge (around 60G) amount of my networking
> experiment data. Let me briefly describe my technique: I have to read
> around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
> contains traceroutes to lots of destinations at different times. I.E a
> file is basically a list of traceroutes launched from a given src (src =
> filename) launched at diff times. I want to get a structure like
> following: (list of all traceroutes from *all* src's at time 1), (list
> of all traceroutes from *all* src's at time 2)... and so on.

> //Btw I can't use an array instead of a hashtable H, as the P.time's
> read across all files needn't be same.
>
> This is performing miserbly SLOW. I have the following questions:

First I have a question: why do you read those files in parallel in the
first place?

> i. How fast is f.readline ?. I want to use the maximum buffering
> possible for largest speed gains. In ruby how do I set the buffer size.
> I looked through io.c, and it seems that readline essentially uses getc
> (stopping when it gets a newline). How can I set the buffer size for the
> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>
> ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
> even worse.

No, Marshal is actually pretty fast. It may be due to the other IO you
do or because of the data you write.

> iii. Is it bad to have around 40-50 files opened at the same time ?.

No, but reading from all those files in /parallel/ is. It is of course
platform dependent how the IO susystem deals with that but chances are
that the disk heads have to move back and forth between all the files.

> iv. The program does use a lot of memory but not so much, around 30-40
> pc of 1G ram machine is used by it. So I think paging in/out is not a
> problem.

It is better to not believe but know that paging is not an issue.

> v. Would coding the realine part in C using rubyinline offer me speed
> advantages ?

Very unlikely.

> vi. I am thinking of trying the following to reduce the time it takes,
> I would very much welcome your comments:
>
> a. Remove Marshal.dump [I don't need to strictly serialize objects,
> only dump the data and read it back] and replace it with some string
> form which is more compact. Actually is it possible to have something
> like fixed length structures like in C: Example I would want P to be
> like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
> the IO would be faster as I could just dump a fixed number of bytes to a
> file.

Don't do the writing in a temp file while you are reading. This poses
even more burden on your IO subsystem. Btw, what filesystem do you use?
You're not happening to be on Suse with ReiserFS?

> b. Try to reduce the memory consumption of this by reducing k further
> so as the program doesn't page in/out.

Without /knowing/ that paging is an issue this does not make sense.

> c. Can someone point me to a good sample code for reading a file line
> by line in C and then putting it into a ruby hashtable ?.
> d. How much of the slowness is due to the fact that it is ruby and not
> C ?

Here's my assessment: you do not have a programming language but a
design problem. Reading from multiple large files at the same time is
inefficient.

> To give you an idea of how slow this is actually: Just reading all the
> files
> line by line takes around 8-9 hrs.

You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems
unrealistically slow. Is this old hardware? Do you have drive
problems? Are there any other disk intensive tasks going on (a busy DB
or web server) on that machine?

> Whereas the above thing easily takes
> 5-6
> days !!. And I am quite unable to run profile on my code as it is just
> too slow.

Clearly.

> I would be very grateful for your comments, and particularly if you have
> any suggestions/experience on doing this in a fast way.

Here's my advice: rewrite the program to read those files sequentially
because that is likely faster on most systems. Remove the temp file
writing while files are read. And find out why your IO system is
performing so badly.

Kind regards

robert

devesh

11/15/2006 10:35:00 PM

Hi Everyone, Thanks for replying.

First a couple of stupid things that I am doing, then a couple of more
questions.

Stupid things:
i. I parse each line, but that essentially amounts to spliting it and
such. So maybe I should be better off using scanf kind of stuff and
also compiling regexps once and using them for each line. I heard that
helps.

ii. Going thru my hash H to delete keys is rather stupid. A better way
is to just dump it and trash it.

iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
line and parse it (with my rather non efficient parse function). And I
did this one file at a time in sequence

iv. I will definitely try to profile my code. I did use -rprofile but
that kind of revealed the obvious that most of my time was spent in
reading file loop > marshalling > parsing. Plus it was rather slow for
even 100MB of stuff.

Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
server runs Centos. And I think it is pretty well managed.

Now my questions:

i. I was earlier doing this: Read each file in sequence, and then
write out a similar hash, containing data from one src for each time.
Even with this hash, I used to dump it to temporary files (one per hash
key = time instant) and then after doing this for every file (src's). I
would then merge all this data in the temporary files into what I want.
This was slow, (but in retrospect not as slow as what I am doing now by
reading in parrallel). The reason I though opening multiple files and
reading from them line by line would be faster is cos I wouldn't have
to open/close all those thousands of temporary files.
I still don't see the reason why doing things in parrallel would screw
things up ? As my assumption is that the disk head is anyway pretty
wild as the system will involve a lot of IO b/w context switches.

ii. Also why do you think that writing another file in the midst of
reading (one or more) one is a bad idea ?. The only way I can avoid
this is to chunk up files into smaller units and then process them,
writing their results onto temp files and then proceeding along further
with other files. Which one do you think is a better thing to do ?

I will now try the following (based on your suggestions):
i. Avoid heavy weight regexps in parsing lines
ii. Avoid marshaling/dumping
iii. Focus on doing things a file at a time.
iv. If all else fails, I will try using something like sharkfish etc.

Btw will using something like an mmap extension for ruby speed things
up for me ?.

Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
new to using ruby.

On Nov 15, 4:24 pm, Robert Klemme <shortcut...@googlemail.com> wrote:
> Devesh Agrawal wrote:
> > I am using ruby to analyse a huge (around 60G) amount of my networking
> > experiment data. Let me briefly describe my technique: I have to read
> > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
> > contains traceroutes to lots of destinations at different times. I.E a
> > file is basically a list of traceroutes launched from a given src (src =
> > filename) launched at diff times. I want to get a structure like
> > following: (list of all traceroutes from *all* src's at time 1), (list
> > of all traceroutes from *all* src's at time 2)... and so on.
> > //Btw I can't use an array instead of a hashtable H, as the P.time's
> > read across all files needn't be same.
>
> > This is performing miserbly SLOW. I have the following questions:First I have a question: why do you read those files in parallel in the
> first place?
>
> > i. How fast is f.readline ?. I want to use the maximum buffering
> > possible for largest speed gains. In ruby how do I set the buffer size.
> > I looked through io.c, and it seems that readline essentially uses getc
> > (stopping when it gets a newline). How can I set the buffer size for the
> > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>
> > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
> > even worse.No, Marshal is actually pretty fast. It may be due to the other IO you
> do or because of the data you write.
>
> > iii. Is it bad to have around 40-50 files opened at the same time ?.No, but reading from all those files in /parallel/ is. It is of course
> platform dependent how the IO susystem deals with that but chances are
> that the disk heads have to move back and forth between all the files.
>
> > iv. The program does use a lot of memory but not so much, around 30-40
> > pc of 1G ram machine is used by it. So I think paging in/out is not a
> > problem.It is better to not believe but know that paging is not an issue.
>
> > v. Would coding the realine part in C using rubyinline offer me speed
> > advantages ?Very unlikely.
>
> > vi. I am thinking of trying the following to reduce the time it takes,
> > I would very much welcome your comments:
>
> > a. Remove Marshal.dump [I don't need to strictly serialize objects,
> > only dump the data and read it back] and replace it with some string
> > form which is more compact. Actually is it possible to have something
> > like fixed length structures like in C: Example I would want P to be
> > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
> > the IO would be faster as I could just dump a fixed number of bytes to a
> > file.Don't do the writing in a temp file while you are reading. This poses
> even more burden on your IO subsystem. Btw, what filesystem do you use?
> You're not happening to be on Suse with ReiserFS?
>
> > b. Try to reduce the memory consumption of this by reducing k further
> > so as the program doesn't page in/out.Without /knowing/ that paging is an issue this does not make sense.
>
> > c. Can someone point me to a good sample code for reading a file line
> > by line in C and then putting it into a ruby hashtable ?.
> > d. How much of the slowness is due to the fact that it is ruby and not
> > C ?Here's my assessment: you do not have a programming language but a
> design problem. Reading from multiple large files at the same time is
> inefficient.
>
> > To give you an idea of how slow this is actually: Just reading all the
> > files
> > line by line takes around 8-9 hrs.You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems
> unrealistically slow. Is this old hardware? Do you have drive
> problems? Are there any other disk intensive tasks going on (a busy DB
> or web server) on that machine?
>
> > Whereas the above thing easily takes
>
> > 5-6
> > days !!. And I am quite unable to run profile on my code as it is just
> > too slow.Clearly.
>
> > I would be very grateful for your comments, and particularly if you have
> > any suggestions/experience on doing this in a fast way.Here's my advice: rewrite the program to read those files sequentially
> because that is likely faster on most systems. Remove the temp file
> writing while files are read. And find out why your IO system is
> performing so badly.
>
> Kind regards
>
> robert

Paul Lutus

11/15/2006 11:50:00 PM

devesh wrote:

/ ...

> I still don't see the reason why doing things in parrallel would screw
> things up ?

If you read from more than one file simultaneously, the drive heads must
constantly slew back and forth from the location of one file to another.
The more files open, the worse this becomes. Files are laid out on a drive
in a logical way, and the data should be read in that same logical way --
one file at a time.

Imagine shopping for groceries and washing your car simultaneously. A lot of
moving back and forth to do them at once, don't you think? Maybe washing
your car, then shopping, would be a better use of your time.

> As my assumption is that the disk head is anyway pretty
> wild as the system will involve a lot of IO b/w context switches.

No matter how bad it is, you can always make it worse by opening multiple
files.
>
> ii. Also why do you think that writing another file in the midst of
> reading (one or more) one is a bad idea ?.

Same reason -- disk thrashing.

--
Paul Lutus
http://www.ara...

comp.lang.ruby

Fast way to process large files line by line

Devesh Agrawal

Farrel Lifson

Devesh Agrawal

Farrel Lifson

Farrel Lifson

Eric Hodel

Farrel Lifson

Jano Svitok

Robert Klemme

devesh

Paul Lutus

comp.lang.ruby

*Fast* way to process large files line by line

Devesh Agrawal

Farrel Lifson

Devesh Agrawal

Farrel Lifson

Farrel Lifson

Eric Hodel

Farrel Lifson

Jano Svitok

Robert Klemme

devesh

Paul Lutus

x Login to ForumsZone

Fast way to process large files line by line