[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

Efficient parsing of large Excel documents in Ruby

Wes Gamble

10/28/2006 7:25:00 AM

All,

I am currently using the parseexcel gem to parse an Excel file so that I
can save it's data into a database (in a Rails app.).

When I run it against an Excel file with ~42000 rows and 11 columns (the
spreadsheet is about 10MB in size), it takes the better part of 10
minutes to parse the file and uses about 200 MB of memory. I'm not
actually sure if the parsing completes successfully - the app. stops
right after the parsing appears to finish.

It would seem that I have two issues - memory usage and performance.

Is anyone aware of a much faster way to parse large Excel files?

Is anyone aware of a "windowing" scheme that uses a set amount of memory
to transfer portions of the data to a client requesting the data so as
to keep the memory required at a fixed level?

Should I look into using the Win32OLE libraries with the hope that it
would be more efficient?

I am transferring this data into a SQL Server database, so perhaps I
could look into some M$-native method of moving this data.

Any thoughts or advice is appreciated.

Thanks,
Wes Gamble

--
Posted via http://www.ruby-....

16 Answers

Christian Madsen

10/28/2006 7:55:00 AM

0

You should definitely use win32ole. This way, you can cut down memory
usage to the minimal needs by only storing the current row. I've been
using it for storing 11000 rows by 10 columns with great success.

However, there seems to be some kind of round trip delay in the OLE
communication: writing 11000 rows takes about 90 seconds, but that
should still be OK for you. I overcame this by simply storing an 11000
by 10 array, which took ~1 second. This then required the memory usage
of an 11000 by 10 array, but not 200MB...

See this link: http://wiki.rubygarden.org/Ruby/page/show/Scri...

So in the end, you should use win32ole and probably make an Excel
wrapper class for instantiating/closing Excel, appending rows etc.

-Christian

Wes Gamble skrev:
> All,
>
> I am currently using the parseexcel gem to parse an Excel file so that I
> can save it's data into a database (in a Rails app.).
>
> When I run it against an Excel file with ~42000 rows and 11 columns (the
> spreadsheet is about 10MB in size), it takes the better part of 10
> minutes to parse the file and uses about 200 MB of memory. I'm not
> actually sure if the parsing completes successfully - the app. stops
> right after the parsing appears to finish.
>
> It would seem that I have two issues - memory usage and performance.
>
> Is anyone aware of a much faster way to parse large Excel files?
>
> Is anyone aware of a "windowing" scheme that uses a set amount of memory
> to transfer portions of the data to a client requesting the data so as
> to keep the memory required at a fixed level?
>
> Should I look into using the Win32OLE libraries with the hope that it
> would be more efficient?
>
> I am transferring this data into a SQL Server database, so perhaps I
> could look into some M$-native method of moving this data.
>
> Any thoughts or advice is appreciated.
>
> Thanks,
> Wes Gamble
>
> --
> Posted via http://www.ruby-....

matt

10/28/2006 4:37:00 PM

0

Christian Madsen <doktormadsen@gmail.com> wrote:

> See this link: http://wiki.rubygarden.org/Ruby/page/show/Scri...

Interesting. Just as a note, everything that happens on that page can
now be done on Mac using almost the same language, thanks to
rb-appscript (or RubyOSA, which I have not yet tried). It would be
interesting to write a Mac version of the page. m.

--
matt neuburg, phd = matt@tidbits.com, http://www.tidbits...
Tiger - http://www.takecontrolbooks.com/tiger-custom...
AppleScript - http://www.amazon.com/gp/product/...
Read TidBITS! It's free and smart. http://www.t...

Sam Smoot

10/28/2006 4:43:00 PM

0

I agree with Christian, but with a twist.

I like to use Excel Automation (w/ WIN32OLE) to convert the document to
a format that's easier to work with. That way, after the initial
conversion, you can handle CSV, XML, etc, and be working with a fast
native library instead of the slow interop.

Another reason I prefer this method is that Excel interop is sometimes
hard to get "just right". There's processes I've written with it
running for over a year that I've never had to touch, but there's
another that used to require I login to the server and kill orphaned
Excel processes. Someone changes a column, you don't code defensively
enough, and BAM!

So the simplicity of the Load/Convert process, and getting out of Excel
as quickly as possible, conserving resources and avoiding potential
bugs that might result in orphaned processes is a big advantage IMO.

Plus it's almost gauranteed to be the fastest, and simplest to develop.
(As long as you're OK with having to run on Windows and having Excel
installed anyways.)

Wes Gamble

10/28/2006 5:23:00 PM

0

I was definitely thinking about using win32ole/Excel to convert this
spreadsheet to CSV and then process it with a CSV parser.

I'm currently running on Linux though, so now I have to figure out any
issues around a Windows production platform.

Thanks,
Wes


--
Posted via http://www.ruby-....

M. Edward (Ed) Borasky

10/28/2006 6:53:00 PM

0

Wes Gamble wrote:
> I was definitely thinking about using win32ole/Excel to convert this
> spreadsheet to CSV and then process it with a CSV parser.
>
> I'm currently running on Linux though, so now I have to figure out any
> issues around a Windows production platform.
>
> Thanks,
> Wes
>
>

Have you tried unixODBC? It seems to work fairly well, although there
are some issues with most Linux/Unix tools when the spreadsheet has tabs
or column headers with spaces in them.

It's fairly easy to leave the Excel spreadsheets on a Windows system and
hack together a simple "server" for them. I think all you have to do is
"publish them to the intranet", but I've never done it. Then you can use
something like Hpricot to parse them.

M. Edward (Ed) Borasky

10/28/2006 7:09:00 PM

0

Wes Gamble wrote:

> I am transferring this data into a SQL Server database, so perhaps I
> could look into some M$-native method of moving this data.

Yes, there is a native way of doing this in VBA from a macro, which you
can store either in your personal macro workbook on in the spreadsheets
themselves. I have a colleague where I work that built an application
that does it.

But unfortunately, I have no idea what the technique is called. Almost
anyone who's been through some advanced Microsoft training ought to be
able to help you find it, or you might search the Microsoft knowledge
base for "efficient transfer of data to SQL Server".

If you don't have an answer by Tuesday, let me know and I'll ask her
what it's called.

And no, I don't think it involves installing CygWin. :)

Wes Gamble

10/28/2006 7:27:00 PM

0

Ed,

Not a bad idea. I'm already using unixODBC to connect to SQL Server
from the Linux box. I totally forgot that you can get to Excel via
ODBC.

Don't I need an ODBC driver for Excel though?

What about a DBI driver for Excel - does one exist?

Thanks,
Wes

--
Posted via http://www.ruby-....

Charles Oliver Nutter

10/28/2006 11:00:00 PM

0

Wes Gamble wrote:
> All,
>
> I am currently using the parseexcel gem to parse an Excel file so that I
> can save it's data into a database (in a Rails app.).

Another rough option would be using JRuby or one of the Java bridges to
call out to POI, Apache's Office-document library.

--
Charles Oliver Nutter, JRuby Core Developer
Blogging on Ruby and Java @ headius.blogspot.com
Help spec out Ruby today! @ www.headius.com/rubyspec
headius@headius.com -- charles.nutter@sun.com

Wes Gamble

10/29/2006 6:58:00 AM

0

Charles Oliver Nutter wrote:
> Wes Gamble wrote:
>> All,
>>
>> I am currently using the parseexcel gem to parse an Excel file so that I
>> can save it's data into a database (in a Rails app.).
>
> Another rough option would be using JRuby or one of the Java bridges to
> call out to POI, Apache's Office-document library.

Charles,

I would love to use JRuby for this app., but don't feel it's quite ready
yet (although you guys are moving it forward at an incredible rate and I
expect it will be a viable production option in less than a year - keep
up the good work).

Does anyone know if the POI Ruby bindings are a potential solution? I
see
"Implement support for reading Excel files (easy)" under the list of
TODOs so not sure if I could use POI-Ruby to read...

Anyone have any experience with POI-Ruby?

Wes


--
Posted via http://www.ruby-....

Wes Gamble

10/29/2006 7:05:00 AM

0

M. Edward (Ed) Borasky wrote:
> Wes Gamble wrote:
>> I was definitely thinking about using win32ole/Excel to convert this
>> spreadsheet to CSV and then process it with a CSV parser.
>>
>> I'm currently running on Linux though, so now I have to figure out any
>> issues around a Windows production platform.
>>
>> Thanks,
>> Wes
>>
>>
>
> Have you tried unixODBC? It seems to work fairly well, although there
> are some issues with most Linux/Unix tools when the spreadsheet has tabs
> or column headers with spaces in them.
>
> It's fairly easy to leave the Excel spreadsheets on a Windows system and
> hack together a simple "server" for them. I think all you have to do is
> "publish them to the intranet", but I've never done it. Then you can use
> something like Hpricot to parse them.

Hpricot? I thought that only parsed HTML? Can you say a little more
about this? Would the "server" you mention above parse the Excel into
some intermediate format that I would then process with something else
(like Hpricot or ???)?

Thanks,
Wes

--
Posted via http://www.ruby-....