[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Moving large amount of files, 1.750.000+

Sebastian Newstream

11/9/2008 5:05:00 PM

Hello fellow Rubyists!

I'm trying to impress my boss and co-workers with Ruby so we
hopefully can start to use it in work more often. I was given
the task with moving a *large* repository of images from one
source to the next. The repository consists of around 1.750.000
images and requires around 350GB of space.
I though this would be no match for Ruby!
Even though it proved no match for Ruby it was a large match for me. =)

I have attached the source code with this post.
Please be gentle on me, I'm quite new to Ruby. =D

So far I have run test on my local machine and it took around 47s to
copy 4.211 items. *calculating* With this speed it would take around
13H to copy the whole repository. That's a lot of time.
If I present this to my co-workers I know they will instantly blame Ruby
for this, even though I am the one to blame.

My question is this: How do I speed up my application?
I reused my filehandler and skipped the printing to the console,
but it is still taking time.

Also if any one has any previous experience of handling this many files
any kind of tips are welcome. I'm quite worried that the array
containing
the path to all the files will flood the stack.

Thanks in advance and my regards.
//Sebastian

Attachments:
http://www.ruby-...attachment/2908/eXt...

--
Posted via http://www.ruby-....

14 Answers

Robert Klemme

11/9/2008 6:13:00 PM

0

On 09.11.2008 18:04, Sebastian Newstream wrote:
> Hello fellow Rubyists!
>
> I'm trying to impress my boss and co-workers with Ruby so we
> hopefully can start to use it in work more often. I was given
> the task with moving a *large* repository of images from one
> source to the next. The repository consists of around 1.750.000
> images and requires around 350GB of space.

> My question is this: How do I speed up my application?
> I reused my filehandler and skipped the printing to the console,
> but it is still taking time.
>
> Also if any one has any previous experience of handling this many files
> any kind of tips are welcome. I'm quite worried that the array
> containing
> the path to all the files will flood the stack.

Sorry to disappoint you but this amount of copying won't be really fast
regardless of programming language. You do not mention what a "source"
in your case is, what operating systems are involved and what transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works pretty
well. But no matter what you do, the slowest link will determine your
throughput: you cannot go faster than network speed or the speed that
your "sources" can read or write.

Here's the tar variant, since you copy images I assume data is
compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
tar xf - )

If you can physically move the source disk to the target host and then
do a local copy with cp -a that's probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote locations).

Kind regards

robert

Randy Kramer

11/9/2008 6:36:00 PM

0

On Sunday 09 November 2008 01:12 pm, Robert Klemme wrote:
> Sorry to disappoint you but this amount of copying won't be really
fast
> regardless of programming language. You do not mention what a
"source"
> in your case is, what operating systems are involved and what
transport
> media you are intending to use (local, network). If you need to
> transport using a network in my experience tar with a pipe works
pretty
> well. But no matter what you do, the slowest link will determine your
> throughput: you cannot go faster than network speed or the speed that
> your "sources" can read or write.
>
> Here's the tar variant, since you copy images I assume data is
> compressed and does not need compression (on your favorite Unix shell
> prompt):
>
> $> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
> tar xf - )
>
> If you can physically move the source disk to the target host and then
> do a local copy with cp -a that's probably the fastest you can go -
> unless the physical takes ages (e.g. to the moon or other remote
locations).

I agree with Robert, but before I saw his response I did some
calculations. Assuming all the images are the same size (about 200
KB), moving 4,211 of them in 47 seconds is a data rate close to 18
MB/sec.--that's faster than a 100 mb/sec Ethernet, not counting any
overhead due to collisions.

That's pretty fast for most channels. Are you moving data from one disk
to another on the same computer? Or over a high speed connection
between two computers? What is the raw hardware speed of the
interconnect?

I wouldn't be too worried about the 13 hours, you've got a lot of data
to move.

Randy Kramer
--
I didn't have time to write a short letter, so I created a video
instead.--with apologies to Cicero, et.al.

Randy Kramer

11/9/2008 9:13:00 PM

0

On Sunday 09 November 2008 01:35 pm, Randy Kramer wrote:
> I wouldn't be too worried about the 13 hours, you've got a lot of data
> to move.

PS: I wish I had added: Since all you're doing is copying files, do it
from the CLI (as Robert suggested)--no need to involve any programming
language which is just added overhead. Then let us know how many hours
it takes that way, for comparison.

Randy Kramer
--
I didn't have time to write a short letter, so I created a video
instead.--with apologies to Cicero, et.al.

Sebastian Newstream

11/10/2008 7:53:00 AM

0

First of all, thanks for your quick answer!
I was a bit tired when I asked the question so I'm sorry
for the lacking information.

Robert Klemme wrote:
> On 09.11.2008 18:04, Sebastian Newstream wrote:
>> Hello fellow Rubyists!
>>
>> I'm trying to impress my boss and co-workers with Ruby so we
>> hopefully can start to use it in work more often. I was given
>> the task with moving a *large* repository of images from one
>> source to the next. The repository consists of around 1.750.000
>> images and requires around 350GB of space.
>
>> My question is this: How do I speed up my application?
>> I reused my filehandler and skipped the printing to the console,
>> but it is still taking time.
>>
>> Also if any one has any previous experience of handling this many files
>> any kind of tips are welcome. I'm quite worried that the array
>> containing
>> the path to all the files will flood the stack.
>
> Sorry to disappoint you but this amount of copying won't be really fast
> regardless of programming language. You do not mention what a "source"
> in your case is, what operating systems are involved and what transport
> media you are intending to use (local, network). If you need to
> transport using a network in my experience tar with a pipe works pretty
> well. But no matter what you do, the slowest link will determine your
> throughput: you cannot go faster than network speed or the speed that
> your "sources" can read or write.

The target system I will use is a virtual Windows 2003 server with a
mounted network drive. Unfortunatly I have no access to any of the
hardware.
But I know there is at least a 100Mbit Ethernet connection between the
server and the mounted disk.

>
> Here's the tar variant, since you copy images I assume data is
> compressed and does not need compression (on your favorite Unix shell
> prompt):
>
> $> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
> tar xf - )

Thanks for your tips, but it's a Windows system.

>
> If you can physically move the source disk to the target host and then
> do a local copy with cp -a that's probably the fastest you can go -
> unless the physical takes ages (e.g. to the moon or other remote
> locations).

Since our company outsourced the hardware maintenance the moon or across
the street makes no difference. =(

>
> Kind regards
>
> robert

What I meant to ask was, I what way can I change my source code to be
more effective?
Thanks a lot for your time.
//Sebastian
--
Posted via http://www.ruby-....

Sebastian Newstream

11/10/2008 7:59:00 AM

0

Thank you as well Kramer! I will try to clarify...

Randy Kramer wrote:
> On Sunday 09 November 2008 01:12 pm, Robert Klemme wrote:
>> Sorry to disappoint you but this amount of copying won't be really
> fast
>> regardless of programming language. You do not mention what a
> "source"
>> in your case is, what operating systems are involved and what
> transport
>> media you are intending to use (local, network). If you need to
>> transport using a network in my experience tar with a pipe works
> pretty
>>
>> If you can physically move the source disk to the target host and then
>> do a local copy with cp -a that's probably the fastest you can go -
>> unless the physical takes ages (e.g. to the moon or other remote
> locations).
>
> I agree with Robert, but before I saw his response I did some
> calculations. Assuming all the images are the same size (about 200
> KB), moving 4,211 of them in 47 seconds is a data rate close to 18
> MB/sec.--that's faster than a 100 mb/sec Ethernet, not counting any
> overhead due to collisions.
>
> That's pretty fast for most channels. Are you moving data from one disk
> to another on the same computer? Or over a high speed connection
> between two computers? What is the raw hardware speed of the
> interconnect?

I know it is a very rough estimation, and the test I performed where on
my Macbook Pro from one folder to another. Of course when I run this
live, the environment will be very different. I just wanted to estimate
a minimum time for the copy.

>
> I wouldn't be too worried about the 13 hours, you've got a lot of data
> to move.
>
> Randy Kramer

--
Posted via http://www.ruby-....

Sebastian Newstream

11/10/2008 8:11:00 AM

0

Randy Kramer wrote:
> On Sunday 09 November 2008 01:35 pm, Randy Kramer wrote:
>> I wouldn't be too worried about the 13 hours, you've got a lot of data
>> to move.
>
Your probably right, I will start the job on a friday evening and let it
take it's time.

> PS: I wish I had added: Since all you're doing is copying files, do it
> from the CLI (as Robert suggested)--no need to involve any programming
> language which is just added overhead. Then let us know how many hours
> it takes that way, for comparison.
>
> Randy Kramer

Your probably right about this as well, but I can't backout of the Ruby
corner now. I already opened my mouth about Ruby to much now, if I
change my method now it will make Ruby look realy bad. =(

This is what I succeded with:
* I removed all of the console prints for each file. (This lowered the
time with about 20s! I had no idea that output was so demanding.).
* I kept the filehandle open for writing to the process.log.
* I also removed any line of unessesary code in the critical part of my
application.
This lowered the time to around 17s. I will now try to run the test on
the right environment.

Of course I will post the results here for your guys to se.
Thanks again for your time.
//Sebastian
--
Posted via http://www.ruby-....

Saji N. Hameed

11/10/2008 8:28:00 AM

0

* Sebastian Newstream <abeansits@gmail.com> [2008-11-10 17:11:08 +0900]:

> Randy Kramer wrote:
> > On Sunday 09 November 2008 01:35 pm, Randy Kramer wrote:
> >> I wouldn't be too worried about the 13 hours, you've got a lot of data
> >> to move.
> >
> Your probably right, I will start the job on a friday evening and let it
> take it's time.
>
> > PS: I wish I had added: Since all you're doing is copying files, do it
> > from the CLI (as Robert suggested)--no need to involve any programming
> > language which is just added overhead. Then let us know how many hours
> > it takes that way, for comparison.
> >
> > Randy Kramer
>
> Your probably right about this as well, but I can't backout of the Ruby
> corner now. I already opened my mouth about Ruby to much now, if I
> change my method now it will make Ruby look realy bad. =(
>
> This is what I succeded with:
> * I removed all of the console prints for each file. (This lowered the
> time with about 20s! I had no idea that output was so demanding.).
> * I kept the filehandle open for writing to the process.log.
> * I also removed any line of unessesary code in the critical part of my
> application.
> This lowered the time to around 17s. I will now try to run the test on
> the right environment.
>
> Of course I will post the results here for your guys to se.
> Thanks again for your time.
> //Sebastian


This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the transfer
jobs among multiple threads ??? ... )

saji
--
Saji N. Hameed

APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 saji@apcc21.net
KOREA



Jano Svitok

11/10/2008 12:49:00 PM

0

On Mon, Nov 10, 2008 at 09:28, Saji N. Hameed <saji@apcc21.net> wrote:
> This may be a naive suggestion. It may be worthwhile to see if there
> is a benefit in parallelize the process using Threads (split the transfer
> jobs among multiple threads ??? ... )

I guess Ara Howard's threadify
(http://codeforpeople.com/lib/ruby/...) might be handy.

The usefulness of more threads depends on network saturation - measure
your network/disk throughput
using plain system copy (maybe several parallel ones), then measure
what your script does.
I'm afraid if you're going over ethernet, one thread would be enough.

I'd also suggest using File.directory? for testing if the file is
directory, instead of searching for '.'

Jano

Robert Klemme

11/10/2008 1:13:00 PM

0

2008/11/10 Sebastian Newstream <abeansits@gmail.com>:
> Robert Klemme wrote:
>> On 09.11.2008 18:04, Sebastian Newstream wrote:
> The target system I will use is a virtual Windows 2003 server with a
> mounted network drive. Unfortunatly I have no access to any of the
> hardware.
> But I know there is at least a 100Mbit Ethernet connection between the
> server and the mounted disk.
>
>>
>> Here's the tar variant, since you copy images I assume data is
>> compressed and does not need compression (on your favorite Unix shell
>> prompt):
>>
>> $> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
>> tar xf - )
>
> Thanks for your tips, but it's a Windows system.

The command above works on a cygwin shell. Alternatively you can use
XCOPY or directly use the Windows Shell (Explorer).

>> If you can physically move the source disk to the target host and then
>> do a local copy with cp -a that's probably the fastest you can go -
>> unless the physical takes ages (e.g. to the moon or other remote
>> locations).
>
> Since our company outsourced the hardware maintenance the moon or across
> the street makes no difference. =(

:-)

> What I meant to ask was, I what way can I change my source code to be
> more effective?

And the answer is and was: don't bother too much because your transfer
is IO bound regardless of programming language or tool used.

Cheers

robert


--
remember.guy do |as, often| as.you_can - without end

Sebastian Newstream

11/10/2008 2:08:00 PM

0

Robert Klemme wrote:
> 2008/11/10 Sebastian Newstream <abeansits@gmail.com>:
>>> compressed and does not need compression (on your favorite Unix shell
>>> prompt):
>>>
>>> $> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
>>> tar xf - )
>>
>> Thanks for your tips, but it's a Windows system.
>
> The command above works on a cygwin shell. Alternatively you can use
> XCOPY or directly use the Windows Shell (Explorer).

Ok great tip. Will keep it as a backup plan.
The thing is I need loging of all files being transfered so I know if
something is missing.

>
>>> If you can physically move the source disk to the target host and then
>>> do a local copy with cp -a that's probably the fastest you can go -
>>> unless the physical takes ages (e.g. to the moon or other remote
>>> locations).
>>
>> Since our company outsourced the hardware maintenance the moon or across
>> the street makes no difference. =(
>
> :-)
>
>> What I meant to ask was, I what way can I change my source code to be
>> more effective?
>
> And the answer is and was: don't bother too much because your transfer
> is IO bound regardless of programming language or tool used.

OK! I will listen to your tips.
Thanks for all your input Robert.
Best regards
//Sebastian

>
> Cheers
>
> robert

--
Posted via http://www.ruby-....