Ian Hobson
3/11/2009 12:20:00 PM
Miroslaw Niegowski wrote:
> Hi,
> I have following log in simple format:
>
> -- START - DATE (YYYY-MM-DD HH:MM:SS.sss)
> <xml>
> ....
> </xml>
> -- END - DATE (YYYY-MM-DD HH:MM:SS.sss)
>
> -- START - DATE (YYYY-MM-DD HH:MM:SS.sss)
> <xml>
> ....
> </xml>
> -- END - DATE (YYYY-MM-DD HH:MM:SS.sss)
>
> -- START - DATE (YYYY-MM-DD HH:MM:SS.sss)
> <xml>
> ....
> </xml>
> -- END - DATE (YYYY-MM-DD HH:MM:SS.sss)
>
> -- START - DATE (YYYY-MM-DD HH:MM:SS.sss)
> <xml>
> ....
> </xml>
> -- END - DATE (YYYY-MM-DD HH:MM:SS.sss)
>
> What is the most efficient way to sort it in ruby by START DATE & TIME?
> Log size is generally smaller than 1GB...
> I was thinking of :
> 1) loading the file into memory
> 2) splitting into hash (date | xml)
> 3) sorting key values
> 4) output
>
> Is there a better way to do this ? I'm mostly concerned with memory
> consumption...
>
>
Hi, Miroslaw,
That would require lots of RAM - but should work. (If it triggers
swapping, then it will be slow).
I would read the file twice. The first time using pos to note the
position of the start of each "-- START.." line. Record the date and
position in a tag record.
Then sort these tags in to order. You might do this in RAM or write the
data to a file, sort that with a utility and read it back again.
And finally, process the tags in order, using pos= to position back to
the beginning of the section, and copy it out to the results file.
A final thought - the entries are presumably written in ending date
order by N processes. Therefore the oldest start, must be one of the
last N. If you remove that, the next oldest will again be one of the
last N. Continue by induction. Therefore, if you were to read the file
in reverse order, you could use a "heap" structure of size N to delay
writing the sections until you knew which was the oldest. This would
mean you could write an output file in reverse order of start time.
Another phase then reads this temp file, again in reverse, writing out
in the correct order.
You can't work from the beginning of the file, because a long running
process may be delayed more than N sections before it reports.
That would solve your problem with no sorting, and only holding N
sections in memory. You would need to code your own readpreviousline
routine.
Regards
Ian