Brian Candler
2/8/2007 3:50:00 PM
On Fri, Feb 09, 2007 at 12:11:10AM +0900, Drew Olson wrote:
> Pit Capitain wrote:
> > Drew Olson schrieb:
> >> datasets here and want to streamline the code as much as possible.
> > Drew, I would try to perform the comparison inside the database,
> > especially if your datasets are "very large". (How large is this
> > actually?) Given that you are working with Oracle, I would use Oracle's
> > database links to get access to both tables.
> >
> > Regards,
> > Pit
>
> Pit -
>
> This does make sense, but I'm using ruby in this situation for a reason.
> I had previously written some validation scripts that ran on .csv dumps
> from the database and I leveraged this previous work to get these
> scripts up and running quickly by introducing ActiveRecord. Also, I'm
> writing an error report to .csv using FasterCSV and doing quite a bit of
> data manipulation during these compares. In short, I'd really like to
> continue using ruby/ActiveRecord here. However, I want to make sure that
> the way I'm going about it is as "efficiently as possible". It's not a
> __huge__ deal, however if I'm make some massive error that would save
> 50% when running my scripts, it would be nice to change them.
>
> As far as record size, we're talking close to 1 million records, more in
> some cases.
I'd suggest:
1. sort both the datasets by the primary key (or SELECT ... ORDER BY)
2. run through both lists with a pointer to each. This lets you efficiently
find records which are in A but not in B, and vice versa. This is a
linear-time operation.
3. then, for records which are in both A and B, you can compare them
attribute-by-attribute.
You can do the whole lot externally: write out both datasets to CSV files,
sort them (using the Unix 'sort' command), and then open both files and read
in a line at a time. Your 'pointer' is then just your current position in
the file, and this avoids having to read the whole datasets into memory. A
million records should be quite doable.
If you just want to find rows which exist in one but not the other, the Unix
'join' command may do what you need. This also needs the files to be
pre-sorted.
Doing the search in SQL will probably be more efficient, but the query is a
pain to write. Something like
SELECT table1.key
FROM table1 LEFT JOIN table2
ON table1.key = table2.key
WHERE table2.key IS NULL;
will find rows which exist in table A but not in table B. Turn it around to
find rows in B but not A. Do a third query which is a normal join to compare
all the fields of corresponding rows. There may be a better way that this,
but my SQL foo doesn't extend that far.
HTH,
Brian.