Asp Forum - General Approach to Data Validation

Drew Olson

2/8/2007 2:16:00 PM

Hey all-

I posted some questions yesterday about getting ActiveRecord working
with Oracle 10g, and I'm happy to say I was successful. ActiveRecord
makes pulling and manipulating data so easy. Now, I'm writing some data
validation scripts. Here's the goal: given two data sources, confirm
that specific fields in the first table are the same as the fields in
the 2nd for the same record, check to see if there are records in the
1st table that aren't in the 2nd, and vise-versa. I was wondering if my
general approach is the fastest way to do this, as I'm not new to ruby
but generally new to writing high performance sql (and thus the
corresponding ActiveRecord statements). I'm working with very large
datasets here and want to streamline the code as much as possible.
Here's my general strategy (assume the objects are ActiveRecord
objects), advice is much appreciated:

Table1.find(:all).each do |rec1|
rec2 =
Table2.find(:first,:conditions=>{:key1=>rec1.key1,:key2=>rec1.key2})
if rec2
if rec1.field1 != rec2.field2
#throw value mismatch here
end
else
#throw missing record error here
end
end

Table2.find(:all).each do |rec2|
rec1 =
Table1.find(:first,:conditions=>{:key1=>rec2.key1,:key2=>rec2.key2})
if !rec1
#throw missing record error here
end
end

--
Posted via http://www.ruby-....

11 Answers

Pit Capitain

2/8/2007 2:42:00 PM

Drew Olson schrieb:
> I posted some questions yesterday about getting ActiveRecord working
> with Oracle 10g, and I'm happy to say I was successful. ActiveRecord
> makes pulling and manipulating data so easy. Now, I'm writing some data
> validation scripts. Here's the goal: given two data sources, confirm
> that specific fields in the first table are the same as the fields in
> the 2nd for the same record, check to see if there are records in the
> 1st table that aren't in the 2nd, and vise-versa. I was wondering if my
> general approach is the fastest way to do this, as I'm not new to ruby
> but generally new to writing high performance sql (and thus the
> corresponding ActiveRecord statements). I'm working with very large
> datasets here and want to streamline the code as much as possible.

Drew, I would try to perform the comparison inside the database,
especially if your datasets are "very large". (How large is this
actually?) Given that you are working with Oracle, I would use Oracle's
database links to get access to both tables.

Regards,
Pit

Drew Olson

2/8/2007 3:11:00 PM

Pit Capitain wrote:
> Drew Olson schrieb:
>> datasets here and want to streamline the code as much as possible.
> Drew, I would try to perform the comparison inside the database,
> especially if your datasets are "very large". (How large is this
> actually?) Given that you are working with Oracle, I would use Oracle's
> database links to get access to both tables.
>
> Regards,
> Pit

Pit -

This does make sense, but I'm using ruby in this situation for a reason.
I had previously written some validation scripts that ran on .csv dumps
from the database and I leveraged this previous work to get these
scripts up and running quickly by introducing ActiveRecord. Also, I'm
writing an error report to .csv using FasterCSV and doing quite a bit of
data manipulation during these compares. In short, I'd really like to
continue using ruby/ActiveRecord here. However, I want to make sure that
the way I'm going about it is as "efficiently as possible". It's not a
__huge__ deal, however if I'm make some massive error that would save
50% when running my scripts, it would be nice to change them.

As far as record size, we're talking close to 1 million records, more in
some cases.

-Drew

--
Posted via http://www.ruby-....

Brian Candler

2/8/2007 3:50:00 PM

On Fri, Feb 09, 2007 at 12:11:10AM +0900, Drew Olson wrote:
> Pit Capitain wrote:
> > Drew Olson schrieb:
> >> datasets here and want to streamline the code as much as possible.
> > Drew, I would try to perform the comparison inside the database,
> > especially if your datasets are "very large". (How large is this
> > actually?) Given that you are working with Oracle, I would use Oracle's
> > database links to get access to both tables.
> >
> > Regards,
> > Pit
>
> Pit -
>
> This does make sense, but I'm using ruby in this situation for a reason.
> I had previously written some validation scripts that ran on .csv dumps
> from the database and I leveraged this previous work to get these
> scripts up and running quickly by introducing ActiveRecord. Also, I'm
> writing an error report to .csv using FasterCSV and doing quite a bit of
> data manipulation during these compares. In short, I'd really like to
> continue using ruby/ActiveRecord here. However, I want to make sure that
> the way I'm going about it is as "efficiently as possible". It's not a
> __huge__ deal, however if I'm make some massive error that would save
> 50% when running my scripts, it would be nice to change them.
>
> As far as record size, we're talking close to 1 million records, more in
> some cases.

I'd suggest:

1. sort both the datasets by the primary key (or SELECT ... ORDER BY)

2. run through both lists with a pointer to each. This lets you efficiently
find records which are in A but not in B, and vice versa. This is a
linear-time operation.

3. then, for records which are in both A and B, you can compare them
attribute-by-attribute.

You can do the whole lot externally: write out both datasets to CSV files,
sort them (using the Unix 'sort' command), and then open both files and read
in a line at a time. Your 'pointer' is then just your current position in
the file, and this avoids having to read the whole datasets into memory. A
million records should be quite doable.

If you just want to find rows which exist in one but not the other, the Unix
'join' command may do what you need. This also needs the files to be
pre-sorted.

Doing the search in SQL will probably be more efficient, but the query is a
pain to write. Something like

SELECT table1.key
FROM table1 LEFT JOIN table2
ON table1.key = table2.key
WHERE table2.key IS NULL;

will find rows which exist in table A but not in table B. Turn it around to
find rows in B but not A. Do a third query which is a normal join to compare
all the fields of corresponding rows. There may be a better way that this,
but my SQL foo doesn't extend that far.

HTH,

Brian.

Pit Capitain

2/8/2007 4:27:00 PM

Drew Olson schrieb:
> This does make sense, but I'm using ruby in this situation for a reason.
> I had previously written some validation scripts that ran on .csv dumps
> from the database and I leveraged this previous work to get these
> scripts up and running quickly by introducing ActiveRecord. Also, I'm
> writing an error report to .csv using FasterCSV and doing quite a bit of
> data manipulation during these compares. In short, I'd really like to
> continue using ruby/ActiveRecord here. However, I want to make sure that
> the way I'm going about it is as "efficiently as possible". It's not a
> __huge__ deal, however if I'm make some massive error that would save
> 50% when running my scripts, it would be nice to change them.
>
> As far as record size, we're talking close to 1 million records, more in
> some cases.

Drew, I don't think doing it on the client is the right tool for this
job. But if you really want to, I'd try one of the following approaches.
Note that I've never used ActiveRecord before, so I don't know whether
you actually can do this.

One way would be to read all the records at once and then use Ruby to
compare the two datasets. This requires lots of RAM, and it doesn't
scale well if you'll get more records. You could try to split the
datasets into smaller disjoint parts and then compare only those parts
in order to reduce the memory needed.

The other way would be to read the records one after the other (using
cursors in db terminology) in an appropriate sort order. If ActiveRecord
allows you to do this in parallel with both tables, you could compare
the tables like this:

table1 = open_cursor_for "table1"
table2 = open_cursor_for "table2"

record1 = table1.next_record
record2 = table2.next_record

until record1.nil? and record2.nil?
case relevant_fields(record1) <=> relevant_fields(record2)
when -1
puts "#{record1} is missing in table2"
record1 = table1.next_record
when +1
puts "#{record2} is missing in table1"
record2 = table2.next_record
else
puts "#{record1} exists in both tables"
record1 = table1.next_record
record2 = table2.next_record
end
end

table1.close
table2.close

(I see that Brian suggested the same algorithm.)

The code you have shown in your first post performs roughly 2 million
database queries. Try this with 100, 1000, 10000 queries, and then
estimate how long it would take for the real job. If this is no problem
for you, your code should be fine.

Regards,
Pit

Drew Olson

2/8/2007 4:33:00 PM

Brian Candler wrote:
> I'd suggest:
>
> 1. sort both the datasets by the primary key (or SELECT ... ORDER BY)
>
> 2. run through both lists with a pointer to each. This lets you
> efficiently
> find records which are in A but not in B, and vice versa. This is a
> linear-time operation.
>
> 3. then, for records which are in both A and B, you can compare them
> attribute-by-attribute.

I think you've hit the nail on the head with this. It will drastically
reduce the number of queries I make. I'm coding it up now and taking a
look at the performace difference.

Thanks for your help,
Drew

--
Posted via http://www.ruby-....

Drew Olson

2/8/2007 4:51:00 PM

Pit Capitain wrote:
> The other way would be to read the records one after the other (using
> cursors in db terminology) in an appropriate sort order. If ActiveRecord
> allows you to do this in parallel with both tables, you could compare
> the tables like this:

This is the exact approach I originally wanted to take. I would be very
interested to know if ActiveRecord supports this type of operation. I
have yet to see a way to do it from what I've read of the documentation.

-Drew

--
Posted via http://www.ruby-....

Gary Wright

2/8/2007 7:46:00 PM

On Feb 8, 2007, at 11:51 AM, Drew Olson wrote:

> Pit Capitain wrote:
>> The other way would be to read the records one after the other (using
>> cursors in db terminology) in an appropriate sort order. If
>> ActiveRecord
>> allows you to do this in parallel with both tables, you could compare
>> the tables like this:
>
> This is the exact approach I originally wanted to take. I would be
> very
> interested to know if ActiveRecord supports this type of operation. I
> have yet to see a way to do it from what I've read of the
> documentation.

A while back (last year?) I asked why ActiveRecord didn't support
'lazy' iteration over query results. AR slurps up the entire
result set into memory and then wraps objects (i.e. instances of
ActiveRecord::Base or subclasses) around each row.

This approach obviously has problems if you have large data sets.

Perhaps this has changed since I last posed the question?

Gary Wright

Ezra Zygmuntowicz

2/9/2007 5:33:00 AM

Hi~

On Feb 8, 2007, at 11:46 AM, Gary Wright wrote:

>
> On Feb 8, 2007, at 11:51 AM, Drew Olson wrote:
>
>> Pit Capitain wrote:
>>> The other way would be to read the records one after the other
>>> (using
>>> cursors in db terminology) in an appropriate sort order. If
>>> ActiveRecord
>>> allows you to do this in parallel with both tables, you could
>>> compare
>>> the tables like this:
>>
>> This is the exact approach I originally wanted to take. I would be
>> very
>> interested to know if ActiveRecord supports this type of operation. I
>> have yet to see a way to do it from what I've read of the
>> documentation.
>
> A while back (last year?) I asked why ActiveRecord didn't support
> 'lazy' iteration over query results. AR slurps up the entire
> result set into memory and then wraps objects (i.e. instances of
> ActiveRecord::Base or subclasses) around each row.
>
> This approach obviously has problems if you have large data sets.
>
> Perhaps this has changed since I last posed the question?
>
>
> Gary Wright

ActiveRecord will start to consume a huge amount of memory and cpu
if you fetch more then a thousand or so records at a time. This is
because it does load all results into memory at once and then wraps
each row in a AR object which is expensive with the amount of records
being talked about in this thread. There is an AR plugin somewhere
called paginating find that works sort of like a cursor but doesn't
use a cursor, it does limits and offsets.

But I do think there is an ActiveRecordExtensions[1] project that
fixes some of these issues as well.

Cheers-
-- Ezra Zygmuntowicz
-- Lead Rails Evangelist
-- ez@engineyard.com
-- Engine Yard, Serious Rails Hosting
-- (866) 518-YARD (9273)

[1] http://rubyforge.org/proje...

Ara.T.Howard

2/9/2007 6:31:00 AM

Ezra Zygmuntowicz

2/9/2007 7:18:00 AM

Hey Ara-

On Feb 8, 2007, at 10:31 PM, ara.t.howard@noaa.gov wrote:

> On Fri, 9 Feb 2007, Ezra Zygmuntowicz wrote:
>
>>
>>
>> ActiveRecord will start to consume a huge amount of memory and
>> cpu if you fetch more then a thousand or so records at a time.
>> This is because it does load all results into memory at once and
>> then wraps each row in a AR object which is expensive with the
>> amount of records being talked about in this thread. There is an
>> AR plugin somewhere called paginating find that works sort of like
>> a cursor but doesn't use a cursor, it does limits and offsets.
>>
>> But I do think there is an ActiveRecordExtensions[1] project that
>> fixes some of these issues as well.
>>
>> Cheers-
>> -- Ezra Zygmuntowicz-- Lead Rails Evangelist
>> -- ez@engineyard.com
>> -- Engine Yard, Serious Rails Hosting
>> -- (866) 518-YARD (9273)
>>
>> [1] http://rubyforge.org/proje...
>>
>
> hey ezra-
>
> fyi, nearly all the database apis support
>
> db.execute(sql) do |tuple|
> p tuple
> end
>
> which does the obvious - only one tuple is in memory at a time. i
> sent in a
> patch for rails to use any given block as meaning: construct the AR
> object and
> yield it, one at the time which largely consited of a bunch of
> '&b's but it
> there was no interest at the time. i'm shocked that this is still
> an issue in
> the rails core. haven't people been hitting this as rails is used
> on bigger
> projects with bigger databases?
>
> have used any of the patches/extensions?

I've used the AR extensions project, it adds a lot of the stuff you
need to work with large data sets. Hosting lots of rails apps the
error i see most often is either the ActiveRecord n+1 query problem
where each page results in hundreds of sql queries. Or its the too
many :include joins that end up pulling hundreds or thousands of
objects into memory at once, killing the mongrels the app runs in.

>
> ps. wore my engine yard shirt today!

Awesome!

>
> regards.
>
> -a
> --
> we can deny everything, except that we have the possibility of
> being better.
> simply reflect on that.
> - the dalai lama
>

Cheers-
-- Ezra Zygmuntowicz
-- Lead Rails Evangelist
-- ez@engineyard.com
-- Engine Yard, Serious Rails Hosting
-- (866) 518-YARD (9273)

comp.lang.ruby

General Approach to Data Validation

Drew Olson

Pit Capitain

Drew Olson

Brian Candler

Pit Capitain

Drew Olson

Drew Olson

Gary Wright

Ezra Zygmuntowicz

Ara.T.Howard

Ezra Zygmuntowicz

x Login to ForumsZone