devesh
11/15/2006 10:35:00 PM
Hi Everyone, Thanks for replying.
First a couple of stupid things that I am doing, then a couple of more
questions.
Stupid things:
i. I parse each line, but that essentially amounts to spliting it and
such. So maybe I should be better off using scanf kind of stuff and
also compiling regexps once and using them for each line. I heard that
helps.
ii. Going thru my hash H to delete keys is rather stupid. A better way
is to just dump it and trash it.
iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
line and parse it (with my rather non efficient parse function). And I
did this one file at a time in sequence
iv. I will definitely try to profile my code. I did use -rprofile but
that kind of revealed the obvious that most of my time was spent in
reading file loop > marshalling > parsing. Plus it was rather slow for
even 100MB of stuff.
Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
server runs Centos. And I think it is pretty well managed.
Now my questions:
i. I was earlier doing this: Read each file in sequence, and then
write out a similar hash, containing data from one src for each time.
Even with this hash, I used to dump it to temporary files (one per hash
key = time instant) and then after doing this for every file (src's). I
would then merge all this data in the temporary files into what I want.
This was slow, (but in retrospect not as slow as what I am doing now by
reading in parrallel). The reason I though opening multiple files and
reading from them line by line would be faster is cos I wouldn't have
to open/close all those thousands of temporary files.
I still don't see the reason why doing things in parrallel would screw
things up ? As my assumption is that the disk head is anyway pretty
wild as the system will involve a lot of IO b/w context switches.
ii. Also why do you think that writing another file in the midst of
reading (one or more) one is a bad idea ?. The only way I can avoid
this is to chunk up files into smaller units and then process them,
writing their results onto temp files and then proceeding along further
with other files. Which one do you think is a better thing to do ?
I will now try the following (based on your suggestions):
i. Avoid heavy weight regexps in parsing lines
ii. Avoid marshaling/dumping
iii. Focus on doing things a file at a time.
iv. If all else fails, I will try using something like sharkfish etc.
Btw will using something like an mmap extension for ruby speed things
up for me ?.
Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
new to using ruby.
On Nov 15, 4:24 pm, Robert Klemme <shortcut...@googlemail.com> wrote:
> Devesh Agrawal wrote:
> > I am using ruby to analyse a huge (around 60G) amount of my networking
> > experiment data. Let me briefly describe my technique: I have to read
> > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
> > contains traceroutes to lots of destinations at different times. I.E a
> > file is basically a list of traceroutes launched from a given src (src =
> > filename) launched at diff times. I want to get a structure like
> > following: (list of all traceroutes from *all* src's at time 1), (list
> > of all traceroutes from *all* src's at time 2)... and so on.
> > //Btw I can't use an array instead of a hashtable H, as the P.time's
> > read across all files needn't be same.
>
> > This is performing miserbly SLOW. I have the following questions:First I have a question: why do you read those files in parallel in the
> first place?
>
> > i. How fast is f.readline ?. I want to use the maximum buffering
> > possible for largest speed gains. In ruby how do I set the buffer size.
> > I looked through io.c, and it seems that readline essentially uses getc
> > (stopping when it gets a newline). How can I set the buffer size for the
> > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>
> > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
> > even worse.No, Marshal is actually pretty fast. It may be due to the other IO you
> do or because of the data you write.
>
> > iii. Is it bad to have around 40-50 files opened at the same time ?.No, but reading from all those files in /parallel/ is. It is of course
> platform dependent how the IO susystem deals with that but chances are
> that the disk heads have to move back and forth between all the files.
>
> > iv. The program does use a lot of memory but not so much, around 30-40
> > pc of 1G ram machine is used by it. So I think paging in/out is not a
> > problem.It is better to not believe but know that paging is not an issue.
>
> > v. Would coding the realine part in C using rubyinline offer me speed
> > advantages ?Very unlikely.
>
> > vi. I am thinking of trying the following to reduce the time it takes,
> > I would very much welcome your comments:
>
> > a. Remove Marshal.dump [I don't need to strictly serialize objects,
> > only dump the data and read it back] and replace it with some string
> > form which is more compact. Actually is it possible to have something
> > like fixed length structures like in C: Example I would want P to be
> > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
> > the IO would be faster as I could just dump a fixed number of bytes to a
> > file.Don't do the writing in a temp file while you are reading. This poses
> even more burden on your IO subsystem. Btw, what filesystem do you use?
> You're not happening to be on Suse with ReiserFS?
>
> > b. Try to reduce the memory consumption of this by reducing k further
> > so as the program doesn't page in/out.Without /knowing/ that paging is an issue this does not make sense.
>
> > c. Can someone point me to a good sample code for reading a file line
> > by line in C and then putting it into a ruby hashtable ?.
> > d. How much of the slowness is due to the fact that it is ruby and not
> > C ?Here's my assessment: you do not have a programming language but a
> design problem. Reading from multiple large files at the same time is
> inefficient.
>
> > To give you an idea of how slow this is actually: Just reading all the
> > files
> > line by line takes around 8-9 hrs.You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems
> unrealistically slow. Is this old hardware? Do you have drive
> problems? Are there any other disk intensive tasks going on (a busy DB
> or web server) on that machine?
>
> > Whereas the above thing easily takes
>
> > 5-6
> > days !!. And I am quite unable to run profile on my code as it is just
> > too slow.Clearly.
>
> > I would be very grateful for your comments, and particularly if you have
> > any suggestions/experience on doing this in a fast way.Here's my advice: rewrite the program to read those files sequentially
> because that is likely faster on most systems. Remove the temp file
> writing while files are read. And find out why your IO system is
> performing so badly.
>
> Kind regards
>
> robert