Robert Klemme
5/30/2006 11:10:00 AM
subimage wrote:
> Hey all...
>
> I'm working on a massive Rails site that does heavy data import daily.
> A lot of this data is in XML files of various sizes ranging from 100k
> to 400mb, and totaling around 2gb for all sources. I'd like to keep the
> entire project using Ruby.
>
> At first, I wrote my parsers using REXML, but found that to be DOG
> SLOW, especially for the large files. I tried REXML::parse_stream but
> couldn't find any good documentation for handling parsing that way. It
> was taking around 30 minutes to an hour to even _open_ the larger files
> on a p4 1.8ghz test machine.
>
> After that exercise I switched to libxml, which is a lot speedier, but
> still slow (no numbers to back it up yet, just can tell by the speed of
> data insert in my DB)
>
> I'm wondering if there's some other lib out there that I'm missing? Can
> someone point me in the right direction? Is there anything faster I'm
> missing out on?
>
> Are there any "gotchas" with using libxml that I should be aware of
> speed-wise?
>
> Any and all help is much appreciated...thanks!
Since you insert data into a DB: are you absolutely positive about the
fact that it's the XML parsing part that's slow? Here's what I'd do:
use two threads connected with a bounded queue, one thread for reading
XML with REXML's stream parser and one thread for inserting into the DB.
That way you can utilize CPU for parsing XML while your process waits
for the DB call to return. If possible use bulk insertions.
Alternatively, write out a CVS file and use the DB's bulk loader to pump
the data into the DB. HTH
Kind regards
robert